Aggregation Framework

55
Emily Stolfo #mongodbdays Aggregation Framework Ruby Engineer/Evangelist, 10gen @EmStolfo Tuesday, January 29, 13

description

 

Transcript of Aggregation Framework

Page 1: Aggregation Framework

Emily Stolfo

#mongodbdays

Aggregation Framework

Ruby Engineer/Evangelist, 10gen

@EmStolfo

Tuesday, January 29, 13

Page 2: Aggregation Framework

Agenda

• State of Aggregation• Pipeline• Usage and Limitations• Optimization • Sharding• (Expressions)• Looking Ahead

Tuesday, January 29, 13

Page 3: Aggregation Framework

State of Aggregation

Tuesday, January 29, 13

Page 4: Aggregation Framework

State of Aggregation

• We're storing our data in MongoDB• We need to do ad-hoc reporting, grouping,

common aggregations, etc.• What are we using for this?

Tuesday, January 29, 13

Page 5: Aggregation Framework

Data Warehousing

Tuesday, January 29, 13

Page 6: Aggregation Framework

Data Warehousing

• SQL for reporting and analytics• Infrastructure complications

– Additional maintenance– Data duplication– ETL processes– Real time?

Tuesday, January 29, 13

Page 7: Aggregation Framework

MapReduce

Tuesday, January 29, 13

Page 8: Aggregation Framework

MapReduce

• Extremely versatile, powerful• Intended for complex data analysis• Overkill for simple aggregation tasks, such as

– Averages– Summation– Grouping

Tuesday, January 29, 13

Page 9: Aggregation Framework

MapReduce in MongoDB

• Implemented with JavaScript– Single-threaded– Difficult to debug

• Concurrency– Appearance of parallelism– Write locks

Tuesday, January 29, 13

Page 10: Aggregation Framework

Aggregation Framework

Tuesday, January 29, 13

Page 11: Aggregation Framework

Aggregation Framework

• Declared in JSON, executes in C++• Flexible, functional, and simple

– Operation pipeline– Computational expressions

• Works well with sharding

Tuesday, January 29, 13

Page 12: Aggregation Framework

Enabling Developers

• Doing more within MongoDB, faster• Refactoring MapReduce and groupings

– Replace pages of JavaScript– Longer aggregation pipelines

• Quick aggregations from the shell

Tuesday, January 29, 13

Page 13: Aggregation Framework

Pipeline

Tuesday, January 29, 13

Page 14: Aggregation Framework

Pipeline

• Process a stream of documents– Original input is a collection– Final output is a result document

• Series of operators– Filter or transform data– Input/output chain

ps ax | grep mongod | head -n 1

Tuesday, January 29, 13

Page 15: Aggregation Framework

Pipeline Operators

• $match• $project• $group• $unwind

• $sort• $limit• $skip

Tuesday, January 29, 13

Page 16: Aggregation Framework

{ _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Example book data

Tuesday, January 29, 13

Page 17: Aggregation Framework

$match

• Filter documents• Uses existing query syntax• (No geospatial operations or $where)

Tuesday, January 29, 13

Page 18: Aggregation Framework

Matching Field Values

{ $match: { language: "Russian"}}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Tuesday, January 29, 13

Page 19: Aggregation Framework

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Matching with Query Operators

{ $match: { pages: { $gt: 1000 }}}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Tuesday, January 29, 13

Page 20: Aggregation Framework

$project

• Reshape documents• Include, exclude or rename fields• Inject computed fields• Create sub-document fields

Tuesday, January 29, 13

Page 21: Aggregation Framework

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Including and Excluding Fields

{ $project: { _id: 0, title: 1, language: 1}}

{ title: "Great Gatsby", language: "English"}

Tuesday, January 29, 13

Page 22: Aggregation Framework

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Renaming and Computing Fields

{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}

{ _id: 375, avgChapterLength: 24.2222, lang: "English"}

Tuesday, January 29, 13

Page 23: Aggregation Framework

{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Creating Sub-Document Fields

{ $project: { title: 1, stats: { pages: "$pages", language: "$language", }}}

{ _id: 375, title: "Great Gatsby", stats: { pages: 218, language: "English" }

Tuesday, January 29, 13

Page 24: Aggregation Framework

$group

• Group documents by an ID– Field reference, object, constant

• Other output fields are computed– $max, $min, $avg, $sum– $addToSet, $push– $first, $last

• Processes all data in memory

Tuesday, January 29, 13

Page 25: Aggregation Framework

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Calculating an Average

{ $group: { _id: "$language", avgPages: { $avg: "$pages" }}}

{ _id: "Russian", avgPages: 1440}

{ _id: "English", avgPages: 653}

Tuesday, January 29, 13

Page 26: Aggregation Framework

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: "War and Peace", pages: 1440, language: "Russian”}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Summating Fields and Counting

{ $group: { _id: "$language", numTitles: { $sum: 1 }, sumPages: { $sum: "$pages" }}}

{ _id: "Russian", numTitles: 1, sumPages: 1440}

{ _id: "English", numTitles: 2, sumPages: 1306}

Tuesday, January 29, 13

Page 27: Aggregation Framework

{ title: "The Great Gatsby", pages: 218, language: "English"}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "Atlas Shrugged", pages: 1088, language: "English"}

Collecting Distinct Values

{ $group: { _id: "$language", titles: { $addToSet: "$title" }}}

{ _id: "Russian", titles: [ "War and Peace" ]}

{ _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ]}

Tuesday, January 29, 13

Page 28: Aggregation Framework

$unwind

• Applied to an array field• Yield new documents for each array element

– Array replaced by element value– Missing/empty fields → no output– Non-array fields → error

• Pipe to $group to aggregate array values

Tuesday, January 29, 13

Page 29: Aggregation Framework

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ]}

Yielding Multiple Documents from One

{ $unwind: "$subjects" }

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s"}

Tuesday, January 29, 13

Page 30: Aggregation Framework

$sort, $limit, $skip

• Sort documents by one or more fields– Same order syntax as cursors– Waits for earlier pipeline operator to return– In-memory unless early and indexed

• Limit and skip follow cursor behavior

Tuesday, January 29, 13

Page 31: Aggregation Framework

{ title: "The Great Gatsby" }

{ title: "Brave New World" }

{ title: "Grapes of Wrath" }

{ title: "Animal Farm" }

{ title: "Lord of the Flies" }

{ title: "Fathers and Sons" }

{ title: "Invisible Man" }

{ title: "Fahrenheit 451" }

Sort All the Documents in the Pipeline

{ $sort: { title: 1 }}

{ title: "Animal Farm" }

{ title: "Brave New World" }

{ title: "Fahrenheit 451" }

{ title: "Fathers and Sons" }

{ title: "Grapes of Wrath" }

{ title: "Invisible Man" }

{ title: "Lord of the Flies" }

{ title: "The Great Gatsby" }

Tuesday, January 29, 13

Page 32: Aggregation Framework

{ title: "The Great Gatsby" }

{ title: "Brave New World" }

{ title: "Grapes of Wrath" }

{ title: "Animal Farm" }

{ title: "Lord of the Flies" }

{ title: "Fathers and Sons" }

{ title: "Invisible Man" }

{ title: "Fahrenheit 451" }

Limit Documents Through the Pipeline

{ $limit: 5 }

{ title: "The Great Gatsby" }

{ title: "Brave New World" }

{ title: "Grapes of Wrath" }

{ title: "Animal Farm" }

{ title: "Lord of the Flies" }

Tuesday, January 29, 13

Page 33: Aggregation Framework

{ title: "The Great Gatsby" }

{ title: "Brave New World" }

{ title: "Grapes of Wrath" }

{ title: "Animal Farm" }

{ title: "Lord of the Flies" }

{ title: "Fathers and Sons" }

{ title: "Invisible Man" }

{ title: "Fahrenheit 451" }

Skip Over Documents in the Pipeline

{ $skip: 5 }

{ title: "Fathers and Sons" }

{ title: "Invisible Man" }

{ title: "Fahrenheit 451" }

Tuesday, January 29, 13

Page 34: Aggregation Framework

Usage and Limitations

Tuesday, January 29, 13

Page 35: Aggregation Framework

Usage

• collection.aggregate() method– Mongo shell– Most drivers

• aggregate database command

Tuesday, January 29, 13

Page 36: Aggregation Framework

db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}}])

{ result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1}

Collection

Tuesday, January 29, 13

Page 37: Aggregation Framework

db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]})

{ result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1}

Database Command

Tuesday, January 29, 13

Page 38: Aggregation Framework

Limitations

• Result limited by BSON document size– Final command result– Intermediate shard results

• Pipeline operator memory limits• Some BSON types unsupported

– Binary, Code, deprecated types

Tuesday, January 29, 13

Page 39: Aggregation Framework

Sharding

Tuesday, January 29, 13

Page 40: Aggregation Framework

Sharding

• Split the pipeline at first $group or $sort– Shards execute pipeline up to that point– mongos merges results and continues

• Early $match may excuse shards• CPU and memory implications for mongos

Tuesday, January 29, 13

Page 41: Aggregation Framework

[ { $match: { /* filter by shard key */ }}, { $project: { /* select fields */ }}, { $group: { /* group by some field */ }}, { $sort: { /* sort by some field */ }}, { $project: { /* reshape result */ }}]

Sharding

Tuesday, January 29, 13

Page 42: Aggregation Framework

Aggregation in a sharded cluster

Tuesday, January 29, 13

Page 43: Aggregation Framework

Expressions

Tuesday, January 29, 13

Page 44: Aggregation Framework

Expressions

• Return computed values• Used with $project and $group• Reference fields using $ (e.g. "$x")• Expressions may be nested

Tuesday, January 29, 13

Page 45: Aggregation Framework

Boolean Operators• Input array of one or more values

– $and, $or– Short-circuit logic

• Invert values with $not• Evaluation of non-boolean types

– null, undefined, zero ▶ false– Non-zero, strings, dates, objects ▶ true

{ $and: [true, false] } ▶ false{ $or: ["foo", 0] } ▶ true{ $not: null } ▶ true

Tuesday, January 29, 13

Page 46: Aggregation Framework

Comparison Operators

• Compare numbers, strings, and dates• Input array with two operands

– $cmp, $eq, $ne– $gt, $gte, $lt, $lte

{ $cmp: [3, 4] } ▶ -1{ $eq: ["foo", "bar"] } ▶ false{ $ne: ["foo", "bar"] } ▶ true{ $gt: [9, 7] } ▶ true

Tuesday, January 29, 13

Page 47: Aggregation Framework

Arithmetic Operators

• Input array of one or more numbers– $add, $multiply

• Input array of two operands– $subtract, $divide, $mod

{ $add: [1, 2, 3] } ▶ 6{ $multiply: [2, 2, 2] } ▶ 8{ $subtract: [10, 7] } ▶ 3{ $divide: [10, 2] } ▶ 5{ $mod: [8, 3] } ▶ 2

Tuesday, January 29, 13

Page 48: Aggregation Framework

String Operators

• $strcasecmp case-insensitive comparison– $cmp is case-sensitive

• $toLower and $toUpper case change• $substr for sub-string extraction• Not encoding aware (assumes ASCII alphabet)

{ $strcasecmp: ["foo", "bar"] } ▶ 1{ $substr: ["foo", 1, 2] } ▶ "oo"{ $toUpper: "foo" } ▶ "FOO"{ $toLower: "BAR" } ▶ "bar"

Tuesday, January 29, 13

Page 49: Aggregation Framework

Date Operators

• Extract values from date objects– $dayOfYear, $dayOfMonth, $dayOfWeek– $year, $month, $week– $hour, $minute, $second

{ $year: ISODate("2012-10-24T00:00:00.000Z") } ▶ 2012{ $month: ISODate("2012-10-24T00:00:00.000Z") } ▶ 10{ $dayOfMonth: ISODate("2012-10-24T00:00:00.000Z") } ▶ 24{ $dayOfWeek: ISODate("2012-10-24T00:00:00.000Z") } ▶ 4{ $dayOfYear: ISODate("2012-10-24T00:00:00.000Z") } ▶ 299{ $week: ISODate("2012-10-24T00:00:00.000Z") } ▶ 43

Tuesday, January 29, 13

Page 50: Aggregation Framework

Conditional Operators

• $cond ternary operator• $ifNull

{ $cond: [{ $eq: [1, 2] }, "same", "different"] } ▶ "different”

{ $ifNull: ["foo", "bar"] } ▶ "foo"{ $ifNull: [null, "bar"] } ▶ "bar"

Tuesday, January 29, 13

Page 51: Aggregation Framework

Looking Ahead

Tuesday, January 29, 13

Page 52: Aggregation Framework

Framework Use Cases

• Basic aggregation queries• Ad-hoc reporting• Real-time analytics• Visualizing time series data

Tuesday, January 29, 13

Page 53: Aggregation Framework

Extending the Framework

• Adding new pipeline operators, expressions• $out and $tee for output control

– https://jira.mongodb.org/browse/SERVER-3253

Tuesday, January 29, 13

Page 54: Aggregation Framework

Future Enhancements

• Automatically move $match earlier if possible• Pipeline explain facility• Memory usage improvements

– Grouping input sorted by _id– Sorting with limited output

Tuesday, January 29, 13

Page 55: Aggregation Framework

Ruby Engineer/Evangelist, 10gen

@EmStolfo

Emily Stolfo

#mongodbdays

Thank You

Tuesday, January 29, 13