MongoDB, Hadoop and humongous data - MongoSV 2012

52
MongoDB Hadoop & hu mongo us data

description

Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.

Transcript of MongoDB, Hadoop and humongous data - MongoSV 2012

Page 1: MongoDB, Hadoop and humongous data - MongoSV 2012

MongoDB Hadoop

& humongous data

Page 2: MongoDB, Hadoop and humongous data - MongoSV 2012

Talking aboutWhat is Humongous Data

Humongous Data & You

MongoDB & Data processing

Future of Humongous Data

Page 4: MongoDB, Hadoop and humongous data - MongoSV 2012

What is humongous

data?

Page 5: MongoDB, Hadoop and humongous data - MongoSV 2012

2000Google IncToday announced it has released the largest search engine on the Internet.

Google’s new index, comprising more than 1 billion URLs

Page 6: MongoDB, Hadoop and humongous data - MongoSV 2012

Our indexing system for processing links indicates that we now count 1 trillion unique URLs

(and the number of individual web pages out there is growing by several billion pages per day).

2008

Page 7: MongoDB, Hadoop and humongous data - MongoSV 2012

An unprecedented amount of data is being created and is accessible

Page 8: MongoDB, Hadoop and humongous data - MongoSV 2012

Data Growth

0

250

500

750

1000

2000 2001 2002 2003 2004 2005 2006 2007 2008

1 4 10 2455

120

250

500

1,000

Millions of URLs

Page 9: MongoDB, Hadoop and humongous data - MongoSV 2012

Truly Exponential Growth

Is hard for people to grasp

A BBC reporter recently: "Your current PC is more powerful than the computer they had on board the first flight to the moon".

Page 10: MongoDB, Hadoop and humongous data - MongoSV 2012

Moore’s LawApplies to more than just CPUs

Boiled down it is that things double at regular intervals

It’s exponential growth.. and applies to big data

Page 11: MongoDB, Hadoop and humongous data - MongoSV 2012

How BIG is it?

Page 12: MongoDB, Hadoop and humongous data - MongoSV 2012

How BIG is it?

2008

Page 13: MongoDB, Hadoop and humongous data - MongoSV 2012

How BIG is it?

2008

2007

20062005

20042003

20022001

Page 14: MongoDB, Hadoop and humongous data - MongoSV 2012

Why all this talk about BIG

Data now?

Page 15: MongoDB, Hadoop and humongous data - MongoSV 2012

In the past few years open source software emerged enabling ‘us’ to handle BIG Data

Page 16: MongoDB, Hadoop and humongous data - MongoSV 2012

The Big DataStory

Page 17: MongoDB, Hadoop and humongous data - MongoSV 2012

Is actually two stories

Page 18: MongoDB, Hadoop and humongous data - MongoSV 2012

Doers & Tellers talking about different things

http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september

Page 19: MongoDB, Hadoop and humongous data - MongoSV 2012

Tellers

Page 20: MongoDB, Hadoop and humongous data - MongoSV 2012

Doers

Page 21: MongoDB, Hadoop and humongous data - MongoSV 2012

Doers talk a lot more about actual solutions

Page 22: MongoDB, Hadoop and humongous data - MongoSV 2012

They know it’s a two sided story

Processing

Storage

Page 23: MongoDB, Hadoop and humongous data - MongoSV 2012

Take aways

MongoDB and Hadoop

MongoDB for storage & operations

Hadoop for processing & analytics

Page 24: MongoDB, Hadoop and humongous data - MongoSV 2012

MongoDB& Data Processing

Page 25: MongoDB, Hadoop and humongous data - MongoSV 2012

Applications have complex needs

MongoDB ideal operational database

MongoDB ideal for BIG data

Not a data processing engine, but provides processing functionality

Page 26: MongoDB, Hadoop and humongous data - MongoSV 2012

Many options for Processing Data

•Process in MongoDB using Map Reduce

•Process in MongoDB using Aggregation Framework

•Process outside MongoDB (using Hadoop)

Page 27: MongoDB, Hadoop and humongous data - MongoSV 2012

MongoDB Map ReduceData

Map()

emit(k,v)

Sort(k)

Group(k)

Reduce(k,values)

k,v

Finalize(k,v)

k,v

MongoDB

map iterates on documentsDocument is $this

1 at time per shard

Input matches output

Can run multiple times

Page 28: MongoDB, Hadoop and humongous data - MongoSV 2012

MongoDB Map Reduce

MongoDB map reduce quite capable... but with limits

- Javascript not best language for processing map reduce

- Javascript limited in external data processing libraries

- Adds load to data store

Page 29: MongoDB, Hadoop and humongous data - MongoSV 2012

MongoDB Aggregation

Most uses of MongoDB Map Reduce were for aggregation

Aggregation Framework optimized for aggregate queries

Realtime aggregation similar to SQL GroupBy

Page 30: MongoDB, Hadoop and humongous data - MongoSV 2012

MongoDB & Hadoop

Map (k1, v1, ctx)

ctx.write(k2,v2)

Map (k1, v1, ctx)

ctx.write(k2,v2)

Map (k1, v1, ctx)

ctx.write(k2,v2)

Creates a list of Input Splits

(InputFormat)

Output Format

MongoDB

single server or sharded cluster

same as Mongo's shard chunks (64mb)

each spliteach spliteach split

MongoDB

RecordReader

Runs on same thread as map

Reducer threads

Partitioner(k2)Partitioner(k2)Partitioner(k2)

Many map operations1 at time per input split

Runs once per keyReduce(k2,values3)

kf,vf

Combiner(k2,values2)

k2, v3

Combiner(k2,values2)

k2, v3

Combiner(k2,values2)

k2, v3

Sort(k2)Sort(k2)Sort(keys2)

Page 31: MongoDB, Hadoop and humongous data - MongoSV 2012

DEMOTIME

Page 32: MongoDB, Hadoop and humongous data - MongoSV 2012

DEMOInstall Hadoop MongoDB Plugin

Import tweets from twitter

Write mapper in Python using Hadoop streaming

Write reducer in Python using Hadoop streaming

Call myself a data scientist

Page 33: MongoDB, Hadoop and humongous data - MongoSV 2012

Installing Mongo-hadoop

hadoop_version '0.23'hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"

git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i '' "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh

https://gist.github.com/1887726

Page 34: MongoDB, Hadoop and humongous data - MongoSV 2012

Groking Twitter

curl \https://stream.twitter.com/1/statuses/sample.json \-u<login>:<password> \| mongoimport -d test -c live

... let it run for about 2 hours

Page 35: MongoDB, Hadoop and humongous data - MongoSV 2012

DEMO 1

Page 36: MongoDB, Hadoop and humongous data - MongoSV 2012

Map Hashtags in Python#!/usr/bin/env python

import syssys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents): for doc in documents: for hashtag in doc['entities']['hashtags']: yield {'_id': hashtag['text'], 'count': 1}

BSONMapper(mapper)print >> sys.stderr, "Done Mapping."

Page 37: MongoDB, Hadoop and humongous data - MongoSV 2012

Reduce hashtags in Python#!/usr/bin/env python

import syssys.path.append(".")

from pymongo_hadoop import BSONReducer

def reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode('utf8') _count = 0 for v in values: _count += v['count'] return {'_id': key.encode('utf8'), 'count': _count}

BSONReducer(reducer)

Page 38: MongoDB, Hadoop and humongous data - MongoSV 2012

All together

hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar \ -mapper examples/twitter/twit_hashtag_map.py \-reducer examples/twitter/twit_hashtag_reduce.py \-inputURI mongodb://127.0.0.1/test.live \-outputURI mongodb://127.0.0.1/test.twit_reduction \-file examples/twitter/twit_hashtag_map.py \-file examples/twitter/twit_hashtag_reduce.py

Page 39: MongoDB, Hadoop and humongous data - MongoSV 2012

Popular Hash Tagsdb.twit_hashtags.find().sort( {'count' : -1 })

{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "Bahrain", "count" : 129 }{ "_id" : "bahrain", "count" : 125 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "BabyKillerOcalan", "count" : 106 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }

Page 40: MongoDB, Hadoop and humongous data - MongoSV 2012

DEMO 2

Page 41: MongoDB, Hadoop and humongous data - MongoSV 2012

Aggregation in Mongo 2.1db.live.aggregate(

{ $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })

Page 42: MongoDB, Hadoop and humongous data - MongoSV 2012

Popular Hash Tagsdb.twit_hashtags.aggregate(a){

"result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, { "_id" : "Bahrain", "count" : 129 }, { "_id" : "bahrain", "count" : 125 } ],"ok" : 1}

Page 43: MongoDB, Hadoop and humongous data - MongoSV 2012

Futureof The

humongousdata

Page 44: MongoDB, Hadoop and humongous data - MongoSV 2012

What is BIG?

BIG today is normal tomorrow

Page 45: MongoDB, Hadoop and humongous data - MongoSV 2012

Data Growth

0

2250

4500

6750

9000

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

1 4 10 24 55 120 250500

1,000

2,150

4,400

9,000

Millions of URLs

Page 46: MongoDB, Hadoop and humongous data - MongoSV 2012

Data Growth

0

2250

4500

6750

9000

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

1 4 10 24 55 120 250500

1,000

2,150

4,400

9,000

Millions of URLs

Page 47: MongoDB, Hadoop and humongous data - MongoSV 2012

2012Generating over 250 Millions of tweets per day

Page 48: MongoDB, Hadoop and humongous data - MongoSV 2012

MongoDB enables us to scale with the redefinition of BIG.

New processing tools like Hadoop & Storm are enabling us to process the new BIG.

Page 49: MongoDB, Hadoop and humongous data - MongoSV 2012

Hadoop is our first step

Page 50: MongoDB, Hadoop and humongous data - MongoSV 2012

MongoDB is committed to working

with best data tools including

Hadoop, Storm, Disco, Spark & more

Page 51: MongoDB, Hadoop and humongous data - MongoSV 2012

download at

Questions?

http://spf13.comhttp://github.com/spf13@spf13

github.com/mongodb/mongo-hadoop

Page 52: MongoDB, Hadoop and humongous data - MongoSV 2012