Aggregation Framework

Emily Stolfo

#mongodbdays

Aggregation Framework

Ruby Engineer/Evangelist, 10gen

@EmStolfo

Tuesday, January 29, 13

Agenda

• State of Aggregation• Pipeline• Usage and Limitations• Optimization • Sharding• (Expressions)• Looking Ahead


State of Aggregation


State of Aggregation

• We're storing our data in MongoDB• We need to do ad-hoc reporting, grouping,

common aggregations, etc.• What are we using for this?


Data Warehousing


Data Warehousing

• SQL for reporting and analytics• Infrastructure complications

– Additional maintenance– Data duplication– ETL processes– Real time?


MapReduce


MapReduce

• Extremely versatile, powerful• Intended for complex data analysis• Overkill for simple aggregation tasks, such as

– Averages– Summation– Grouping


MapReduce in MongoDB

• Implemented with JavaScript– Single-threaded– Difficult to debug

• Concurrency– Appearance of parallelism– Write locks



• Declared in JSON, executes in C++• Flexible, functional, and simple

– Operation pipeline– Computational expressions

• Works well with sharding


Enabling Developers

• Doing more within MongoDB, faster• Refactoring MapReduce and groupings

– Replace pages of JavaScript– Longer aggregation pipelines

• Quick aggregations from the shell


Pipeline


Pipeline

• Process a stream of documents– Original input is a collection– Final output is a result document

• Series of operators– Filter or transform data– Input/output chain

ps ax | grep mongod | head -n 1


Pipeline Operators

• $match• $project• $group• $unwind

• $sort• $limit• $skip


{ _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Example book data


$match

• Filter documents• Uses existing query syntax• (No geospatial operations or $where)


Matching Field Values

{ $match: { language: "Russian"}}

{ title: "War and Peace", pages: 1440, language: "Russian"}

{ title: "The Great Gatsby", pages: 218, language: "English"}


{ title: "Atlas Shrugged", pages: 1088, language: "English"}





Matching with Query Operators

{ $match: { pages: { $gt: 1000 }}}




$project

• Reshape documents• Include, exclude or rename fields• Inject computed fields• Create sub-document fields


{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Including and Excluding Fields

{ $project: { _id: 0, title: 1, language: 1}}

{ title: "Great Gatsby", language: "English"}


{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Renaming and Computing Fields

{ $project: { avgChapterLength: { $divide: ["$pages", "$chapters"] }, lang: "$language"}}

{ _id: 375, avgChapterLength: 24.2222, lang: "English"}


{ _id: 375, title: "Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, subjects: [ "Long Island", "New York", "1920s" ], language: "English"}

Creating Sub-Document Fields

{ $project: { title: 1, stats: { pages: "$pages", language: "$language", }}}

{ _id: 375, title: "Great Gatsby", stats: { pages: 218, language: "English" }


$group

• Group documents by an ID– Field reference, object, constant

• Other output fields are computed– $max, $min, $avg, $sum– $addToSet, $push– $first, $last

• Processes all data in memory





Calculating an Average

{ $group: { _id: "$language", avgPages: { $avg: "$pages" }}}

{ _id: "Russian", avgPages: 1440}

{ _id: "English", avgPages: 653}



{ title: "War and Peace", pages: 1440, language: "Russian”}


Summating Fields and Counting

{ $group: { _id: "$language", numTitles: { $sum: 1 }, sumPages: { $sum: "$pages" }}}

{ _id: "Russian", numTitles: 1, sumPages: 1440}

{ _id: "English", numTitles: 2, sumPages: 1306}





Collecting Distinct Values

{ $group: { _id: "$language", titles: { $addToSet: "$title" }}}

{ _id: "Russian", titles: [ "War and Peace" ]}

{ _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ]}


$unwind

• Applied to an array field• Yield new documents for each array element

– Array replaced by element value– Missing/empty fields → no output– Non-array fields → error

• Pipe to $group to aggregate array values


{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ]}

Yielding Multiple Documents from One

{ $unwind: "$subjects" }

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York"}

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s"}


$sort, $limit, $skip

• Sort documents by one or more fields– Same order syntax as cursors– Waits for earlier pipeline operator to return– In-memory unless early and indexed

• Limit and skip follow cursor behavior


{ title: "The Great Gatsby" }

{ title: "Brave New World" }

{ title: "Grapes of Wrath" }

{ title: "Animal Farm" }

{ title: "Lord of the Flies" }

{ title: "Fathers and Sons" }

{ title: "Invisible Man" }

{ title: "Fahrenheit 451" }

Sort All the Documents in the Pipeline

{ $sort: { title: 1 }}


















Limit Documents Through the Pipeline

{ $limit: 5 }















Skip Over Documents in the Pipeline

{ $skip: 5 }





Usage and Limitations


Usage

• collection.aggregate() method– Mongo shell– Most drivers

• aggregate database command


db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}}])

{ result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1}

Collection


db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]})

{ result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1}

Database Command


Limitations

• Result limited by BSON document size– Final command result– Intermediate shard results

• Pipeline operator memory limits• Some BSON types unsupported

– Binary, Code, deprecated types


Sharding


Sharding

• Split the pipeline at first $group or $sort– Shards execute pipeline up to that point– mongos merges results and continues

• Early $match may excuse shards• CPU and memory implications for mongos


[ { $match: { /* filter by shard key */ }}, { $project: { /* select fields */ }}, { $group: { /* group by some field */ }}, { $sort: { /* sort by some field */ }}, { $project: { /* reshape result */ }}]

Sharding


Aggregation in a sharded cluster


Expressions


Expressions

• Return computed values• Used with $project and $group• Reference fields using $ (e.g. "$x")• Expressions may be nested


Boolean Operators• Input array of one or more values

– $and, $or– Short-circuit logic

• Invert values with $not• Evaluation of non-boolean types

– null, undefined, zero ▶ false– Non-zero, strings, dates, objects ▶ true

{ $and: [true, false] } ▶ false{ $or: ["foo", 0] } ▶ true{ $not: null } ▶ true


Comparison Operators

• Compare numbers, strings, and dates• Input array with two operands

– $cmp, $eq, $ne– $gt, $gte, $lt, $lte

{ $cmp: [3, 4] } ▶ -1{ $eq: ["foo", "bar"] } ▶ false{ $ne: ["foo", "bar"] } ▶ true{ $gt: [9, 7] } ▶ true


Arithmetic Operators

• Input array of one or more numbers– $add, $multiply

• Input array of two operands– $subtract, $divide, $mod

{ $add: [1, 2, 3] } ▶ 6{ $multiply: [2, 2, 2] } ▶ 8{ $subtract: [10, 7] } ▶ 3{ $divide: [10, 2] } ▶ 5{ $mod: [8, 3] } ▶ 2


String Operators

• $strcasecmp case-insensitive comparison– $cmp is case-sensitive

• $toLower and $toUpper case change• $substr for sub-string extraction• Not encoding aware (assumes ASCII alphabet)

{ $strcasecmp: ["foo", "bar"] } ▶ 1{ $substr: ["foo", 1, 2] } ▶ "oo"{ $toUpper: "foo" } ▶ "FOO"{ $toLower: "BAR" } ▶ "bar"


Date Operators

• Extract values from date objects– $dayOfYear, $dayOfMonth, $dayOfWeek– $year, $month, $week– $hour, $minute, $second

{ $year: ISODate("2012-10-24T00:00:00.000Z") } ▶ 2012{ $month: ISODate("2012-10-24T00:00:00.000Z") } ▶ 10{ $dayOfMonth: ISODate("2012-10-24T00:00:00.000Z") } ▶ 24{ $dayOfWeek: ISODate("2012-10-24T00:00:00.000Z") } ▶ 4{ $dayOfYear: ISODate("2012-10-24T00:00:00.000Z") } ▶ 299{ $week: ISODate("2012-10-24T00:00:00.000Z") } ▶ 43


Conditional Operators

• $cond ternary operator• $ifNull

{ $cond: [{ $eq: [1, 2] }, "same", "different"] } ▶ "different”

{ $ifNull: ["foo", "bar"] } ▶ "foo"{ $ifNull: [null, "bar"] } ▶ "bar"


Looking Ahead


Framework Use Cases

• Basic aggregation queries• Ad-hoc reporting• Real-time analytics• Visualizing time series data


Extending the Framework

• Adding new pipeline operators, expressions• $out and $tee for output control

– https://jira.mongodb.org/browse/SERVER-3253


Future Enhancements

• Automatically move $match earlier if possible• Pipeline explain facility• Memory usage improvements

– Grouping input sorted by _id– Sorting with limited output


Ruby Engineer/Evangelist, 10gen

@EmStolfo

Emily Stolfo

#mongodbdays

Thank You


Aggregation Framework

Documents

Transcript of Aggregation Framework