Aggregation Framework
description
Transcript of Aggregation Framework
Emily Stolfo
#mongodbdays
Aggregation Framework
Ruby Engineer/Evangelist, 10gen
@EmStolfo
Tuesday, January 29, 13
Agenda
• State of Aggregation• Pipeline• Usage and Limitations• Optimization • Sharding• (Expressions)• Looking Ahead
Tuesday, January 29, 13
State of Aggregation
Tuesday, January 29, 13
State of Aggregation
• We're storing our data in MongoDB• We need to do ad-hoc reporting, grouping,
common aggregations, etc.• What are we using for this?
Tuesday, January 29, 13
Data Warehousing
Tuesday, January 29, 13
Data Warehousing
• SQL for reporting and analytics• Infrastructure complications
– Additional maintenance– Data duplication– ETL processes– Real time?
Tuesday, January 29, 13
MapReduce
Tuesday, January 29, 13
MapReduce
• Extremely versatile, powerful• Intended for complex data analysis• Overkill for simple aggregation tasks, such as
– Averages– Summation– Grouping
Tuesday, January 29, 13
MapReduce in MongoDB
• Implemented with JavaScript– Single-threaded– Difficult to debug
• Concurrency– Appearance of parallelism– Write locks
Tuesday, January 29, 13
Aggregation Framework
Tuesday, January 29, 13
Aggregation Framework
• Declared in JSON, executes in C++• Flexible, functional, and simple
– Operation pipeline– Computational expressions
• Works well with sharding
Tuesday, January 29, 13
Enabling Developers
• Doing more within MongoDB, faster• Refactoring MapReduce and groupings
– Replace pages of JavaScript– Longer aggregation pipelines
• Quick aggregations from the shell
Tuesday, January 29, 13
Pipeline
Tuesday, January 29, 13
Pipeline
• Process a stream of documents– Original input is a collection– Final output is a result document
• Series of operators– Filter or transform data– Input/output chain
ps ax | grep mongod | head -n 1
Tuesday, January 29, 13
Pipeline Operators
• $match• $project• $group• $unwind
• $sort• $limit• $skip
Tuesday, January 29, 13
{ _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
Example book data
Tuesday, January 29, 13
$match
• Filter documents• Uses existing query syntax• (No geospatial operations or $where)
Tuesday, January 29, 13
Matching Field Values
{ $match: { language: "Russian"}}
{ title: "War and Peace", pages: 1440, language: "Russian"}
{ title: "The Great Gatsby", pages: 218, language: "English"}
{ title: "War and Peace", pages: 1440, language: "Russian"}
{ title: "Atlas Shrugged", pages: 1088, language: "English"}
Tuesday, January 29, 13
{ title: "The Great Gatsby", pages: 218, language: "English"}
{ title: "War and Peace", pages: 1440, language: "Russian"}
{ title: "Atlas Shrugged", pages: 1088, language: "English"}
Matching with Query Operators
{ $match: { pages: { $gt: 1000 }}}
{ title: "War and Peace", pages: 1440, language: "Russian"}
{ title: "Atlas Shrugged", pages: 1088, language: "English"}
Tuesday, January 29, 13
$project
• Reshape documents• Include, exclude or rename fields• Inject computed fields• Create sub-document fields
Tuesday, January 29, 13
{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
Including and Excluding Fields
{ $project: { _id: 0, title: 1, language: 1}}
{ title: "Great Gatsby", language: "English"}
Tuesday, January 29, 13
{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
Renaming and Computing Fields
{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}
{ _id: 375, avgChapterLength: 24.2222, lang: "English"}
Tuesday, January 29, 13
{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}
Creating Sub-Document Fields
{ $project: { title: 1, stats: { pages: "$pages", language: "$language", }}}
{ _id: 375, title: "Great Gatsby", stats: { pages: 218, language: "English" }
Tuesday, January 29, 13
$group
• Group documents by an ID– Field reference, object, constant
• Other output fields are computed– $max, $min, $avg, $sum– $addToSet, $push– $first, $last
• Processes all data in memory
Tuesday, January 29, 13
{ title: "The Great Gatsby", pages: 218, language: "English"}
{ title: "War and Peace", pages: 1440, language: "Russian"}
{ title: "Atlas Shrugged", pages: 1088, language: "English"}
Calculating an Average
{ $group: { _id: "$language", avgPages: { $avg: "$pages" }}}
{ _id: "Russian", avgPages: 1440}
{ _id: "English", avgPages: 653}
Tuesday, January 29, 13
{ title: "The Great Gatsby", pages: 218, language: "English"}
{ title: "War and Peace", pages: 1440, language: "Russian”}
{ title: "Atlas Shrugged", pages: 1088, language: "English"}
Summating Fields and Counting
{ $group: { _id: "$language", numTitles: { $sum: 1 }, sumPages: { $sum: "$pages" }}}
{ _id: "Russian", numTitles: 1, sumPages: 1440}
{ _id: "English", numTitles: 2, sumPages: 1306}
Tuesday, January 29, 13
{ title: "The Great Gatsby", pages: 218, language: "English"}
{ title: "War and Peace", pages: 1440, language: "Russian"}
{ title: "Atlas Shrugged", pages: 1088, language: "English"}
Collecting Distinct Values
{ $group: { _id: "$language", titles: { $addToSet: "$title" }}}
{ _id: "Russian", titles: [ "War and Peace" ]}
{ _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ]}
Tuesday, January 29, 13
$unwind
• Applied to an array field• Yield new documents for each array element
– Array replaced by element value– Missing/empty fields → no output– Non-array fields → error
• Pipe to $group to aggregate array values
Tuesday, January 29, 13
{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ]}
Yielding Multiple Documents from One
{ $unwind: "$subjects" }
{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island"}
{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York"}
{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s"}
Tuesday, January 29, 13
$sort, $limit, $skip
• Sort documents by one or more fields– Same order syntax as cursors– Waits for earlier pipeline operator to return– In-memory unless early and indexed
• Limit and skip follow cursor behavior
Tuesday, January 29, 13
{ title: "The Great Gatsby" }
{ title: "Brave New World" }
{ title: "Grapes of Wrath" }
{ title: "Animal Farm" }
{ title: "Lord of the Flies" }
{ title: "Fathers and Sons" }
{ title: "Invisible Man" }
{ title: "Fahrenheit 451" }
Sort All the Documents in the Pipeline
{ $sort: { title: 1 }}
{ title: "Animal Farm" }
{ title: "Brave New World" }
{ title: "Fahrenheit 451" }
{ title: "Fathers and Sons" }
{ title: "Grapes of Wrath" }
{ title: "Invisible Man" }
{ title: "Lord of the Flies" }
{ title: "The Great Gatsby" }
Tuesday, January 29, 13
{ title: "The Great Gatsby" }
{ title: "Brave New World" }
{ title: "Grapes of Wrath" }
{ title: "Animal Farm" }
{ title: "Lord of the Flies" }
{ title: "Fathers and Sons" }
{ title: "Invisible Man" }
{ title: "Fahrenheit 451" }
Limit Documents Through the Pipeline
{ $limit: 5 }
{ title: "The Great Gatsby" }
{ title: "Brave New World" }
{ title: "Grapes of Wrath" }
{ title: "Animal Farm" }
{ title: "Lord of the Flies" }
Tuesday, January 29, 13
{ title: "The Great Gatsby" }
{ title: "Brave New World" }
{ title: "Grapes of Wrath" }
{ title: "Animal Farm" }
{ title: "Lord of the Flies" }
{ title: "Fathers and Sons" }
{ title: "Invisible Man" }
{ title: "Fahrenheit 451" }
Skip Over Documents in the Pipeline
{ $skip: 5 }
{ title: "Fathers and Sons" }
{ title: "Invisible Man" }
{ title: "Fahrenheit 451" }
Tuesday, January 29, 13
Usage and Limitations
Tuesday, January 29, 13
Usage
• collection.aggregate() method– Mongo shell– Most drivers
• aggregate database command
Tuesday, January 29, 13
db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}}])
{ result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1}
Collection
Tuesday, January 29, 13
db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]})
{ result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1}
Database Command
Tuesday, January 29, 13
Limitations
• Result limited by BSON document size– Final command result– Intermediate shard results
• Pipeline operator memory limits• Some BSON types unsupported
– Binary, Code, deprecated types
Tuesday, January 29, 13
Sharding
Tuesday, January 29, 13
Sharding
• Split the pipeline at first $group or $sort– Shards execute pipeline up to that point– mongos merges results and continues
• Early $match may excuse shards• CPU and memory implications for mongos
Tuesday, January 29, 13
[ { $match: { /* filter by shard key */ }}, { $project: { /* select fields */ }}, { $group: { /* group by some field */ }}, { $sort: { /* sort by some field */ }}, { $project: { /* reshape result */ }}]
Sharding
Tuesday, January 29, 13
Aggregation in a sharded cluster
Tuesday, January 29, 13
Expressions
Tuesday, January 29, 13
Expressions
• Return computed values• Used with $project and $group• Reference fields using $ (e.g. "$x")• Expressions may be nested
Tuesday, January 29, 13
Boolean Operators• Input array of one or more values
– $and, $or– Short-circuit logic
• Invert values with $not• Evaluation of non-boolean types
– null, undefined, zero ▶ false– Non-zero, strings, dates, objects ▶ true
{ $and: [true, false] } ▶ false{ $or: ["foo", 0] } ▶ true{ $not: null } ▶ true
Tuesday, January 29, 13
Comparison Operators
• Compare numbers, strings, and dates• Input array with two operands
– $cmp, $eq, $ne– $gt, $gte, $lt, $lte
{ $cmp: [3, 4] } ▶ -1{ $eq: ["foo", "bar"] } ▶ false{ $ne: ["foo", "bar"] } ▶ true{ $gt: [9, 7] } ▶ true
Tuesday, January 29, 13
Arithmetic Operators
• Input array of one or more numbers– $add, $multiply
• Input array of two operands– $subtract, $divide, $mod
{ $add: [1, 2, 3] } ▶ 6{ $multiply: [2, 2, 2] } ▶ 8{ $subtract: [10, 7] } ▶ 3{ $divide: [10, 2] } ▶ 5{ $mod: [8, 3] } ▶ 2
Tuesday, January 29, 13
String Operators
• $strcasecmp case-insensitive comparison– $cmp is case-sensitive
• $toLower and $toUpper case change• $substr for sub-string extraction• Not encoding aware (assumes ASCII alphabet)
{ $strcasecmp: ["foo", "bar"] } ▶ 1{ $substr: ["foo", 1, 2] } ▶ "oo"{ $toUpper: "foo" } ▶ "FOO"{ $toLower: "BAR" } ▶ "bar"
Tuesday, January 29, 13
Date Operators
• Extract values from date objects– $dayOfYear, $dayOfMonth, $dayOfWeek– $year, $month, $week– $hour, $minute, $second
{ $year: ISODate("2012-10-24T00:00:00.000Z") } ▶ 2012{ $month: ISODate("2012-10-24T00:00:00.000Z") } ▶ 10{ $dayOfMonth: ISODate("2012-10-24T00:00:00.000Z") } ▶ 24{ $dayOfWeek: ISODate("2012-10-24T00:00:00.000Z") } ▶ 4{ $dayOfYear: ISODate("2012-10-24T00:00:00.000Z") } ▶ 299{ $week: ISODate("2012-10-24T00:00:00.000Z") } ▶ 43
Tuesday, January 29, 13
Conditional Operators
• $cond ternary operator• $ifNull
{ $cond: [{ $eq: [1, 2] }, "same", "different"] } ▶ "different”
{ $ifNull: ["foo", "bar"] } ▶ "foo"{ $ifNull: [null, "bar"] } ▶ "bar"
Tuesday, January 29, 13
Looking Ahead
Tuesday, January 29, 13
Framework Use Cases
• Basic aggregation queries• Ad-hoc reporting• Real-time analytics• Visualizing time series data
Tuesday, January 29, 13
Extending the Framework
• Adding new pipeline operators, expressions• $out and $tee for output control
– https://jira.mongodb.org/browse/SERVER-3253
Tuesday, January 29, 13
Future Enhancements
• Automatically move $match earlier if possible• Pipeline explain facility• Memory usage improvements
– Grouping input sorted by _id– Sorting with limited output
Tuesday, January 29, 13
Ruby Engineer/Evangelist, 10gen
@EmStolfo
Emily Stolfo
#mongodbdays
Thank You
Tuesday, January 29, 13