Introduction to MapReduce & hadoop
-
Upload
colin-su -
Category
Technology
-
view
422 -
download
3
description
Transcript of Introduction to MapReduce & hadoop
Introduction to Hadoop and MapReduce
Colin Su, Tagtoo
Advertisement System Architecture (now)
Advertisement System Architecture (future)
• Grid
• Ad Server
• Data Highway
• Steaming Computing
Grid
• Core:
• Data mining
• Machine Learning
• Collecting data from users, logs and calculate out the strategy
• Sort our data in a proper form, them we could use it anytime
Data -> Information
Ad Server
• Ranking
• According the “information” in Grid, decide which AD should be advertised
• show proper ads to website visitors
Data Highway
• Transfer your data to the proper place
Stream Computing
• Core:
• logging
• feedback
• anti-cheating
• pricing
• post-process everything thrown out from Ad Server, and feedback useful information to Grid
• be the entrance of advertisement system
Hadoop
• an open-source software framework for data scientists
• derives from Google’s MapReduce and Google File System (GFS) papers
• written in Java
• could be divided in to 2 components:
• MapReduce
• HDFS (Hadoop distributed file system)
• a yellow elephant
Why Hadoop?
• moving computation is much cheaper and easier than moving data
• “Big Data”, the amount of data becomes too large, need a effective way to manage it
• so does computation
• high fault-tolerance
• developed by Yahoo!
MapReduce
• a programming model for processing “large data sets” with a “parallel, distributed” algorithm on a cluster
• different from map/reduce, the conception of functional programming, but actually they have the same idea, “divide and conquer”
• proposed by Google
Functional “map/reduce”
• map()/reduce() in Python
• map(function(elem), list) -> list
• reduce(function(elem1, elem2), list) -> single result
• e.g.
• map(lambda x: x*2, [1,2,3,4]) => [2,4,6,8]
• reduce(lambda x,y: x+y, [1,2,3,4]) => 10
Parallel “MapReduce” 5 Steps
• prepare the map() input for mappers
• mappers run the map() code -> generated intermediate pairs
• dispatch intermediate pairs to reducers
• reducers run the reduce() code, aggregate the results
• prepare output from the result of reduce()
Example of “MapReduce” Word Count
map() reduce()
Example of “MapReduce” Word Count
• Original Input
Apple Orange Mongo Orange Grapes Plum ...
Example of “MapReduce” Word Count
• Prepare data for mappers
Apple Orange Mongo
Orange Grapes Plum
...
Example of “MapReduce” Word Count
• map() to useful record
Apple Orange Mongo
(Apple, 1)
(Orange, 1)
(Mongo, 1)
Intermediate key/value pair
• sort and shuffle
Example of “MapReduce” Word Count
(Apple, 1)
(Orange, 1)
(Mongo, 1)
(Apple, 1)
(Orange, 1)
(Mongo, 1)
(Apple, 1)
(Orange, 1)
(Mongo, 1)
(Apple, 1)
(Orange, 1)
(Mongo, 1)
Reducer
(Apple, 1)
(Apple, 1)
Reducer
(Orange, 1)
(Orange, 1)
Reducer
(Mongo, 1)
(Mongo, 1)
unsorted Sorted
Shuffle to Reducers
Example of “MapReduce” Word Count
• Reduce()
Reducer
(Apple, 1)
(Apple, 1) (Apple, 2)
(Orange, 3)
Reducer
(Orange, 1)
(Orange, 1)
(Orange, 1)
Example of “MapReduce” Word Count
• Generate Output
(Apple, 2)
(Orange, 3)
(Grapes, 1)
(Plum, 5)
Apple 2Orange 3Grapes 1Plum 5
WordCount.txt
ZooKeeper
Hadoop Infrastructure
• Pig: Programming Language for MapReduce
• Thrift: cross-language communication, just like Google’s ProtoBuffer
• Zookeeper: cluster management
Hadoop
Pig
MapReduce
HDFS
ThriftHadoop
Hadoop
Hadoop
Hadoop
Other Services