Introduction to MapReduce & hadoop

20
Introduction to Hadoop and MapReduce Colin Su, Tagtoo

description

Tagtoo internal seminar

Transcript of Introduction to MapReduce & hadoop

Page 1: Introduction to MapReduce & hadoop

Introduction to Hadoop and MapReduce

Colin Su, Tagtoo

Page 2: Introduction to MapReduce & hadoop

Advertisement System Architecture (now)

Page 3: Introduction to MapReduce & hadoop

Advertisement System Architecture (future)

• Grid

• Ad Server

• Data Highway

• Steaming Computing

Page 4: Introduction to MapReduce & hadoop

Grid

• Core:

• Data mining

• Machine Learning

• Collecting data from users, logs and calculate out the strategy

• Sort our data in a proper form, them we could use it anytime

Data -> Information

Page 5: Introduction to MapReduce & hadoop

Ad Server

• Ranking

• According the “information” in Grid, decide which AD should be advertised

• show proper ads to website visitors

Page 6: Introduction to MapReduce & hadoop

Data Highway

• Transfer your data to the proper place

Page 7: Introduction to MapReduce & hadoop

Stream Computing

• Core:

• logging

• feedback

• anti-cheating

• pricing

• post-process everything thrown out from Ad Server, and feedback useful information to Grid

• be the entrance of advertisement system

Page 8: Introduction to MapReduce & hadoop

Hadoop

• an open-source software framework for data scientists

• derives from Google’s MapReduce and Google File System (GFS) papers

• written in Java

• could be divided in to 2 components:

• MapReduce

• HDFS (Hadoop distributed file system)

• a yellow elephant

Page 9: Introduction to MapReduce & hadoop

Why Hadoop?

• moving computation is much cheaper and easier than moving data

• “Big Data”, the amount of data becomes too large, need a effective way to manage it

• so does computation

• high fault-tolerance

• developed by Yahoo!

Page 10: Introduction to MapReduce & hadoop

MapReduce

• a programming model for processing “large data sets” with a “parallel, distributed” algorithm on a cluster

• different from map/reduce, the conception of functional programming, but actually they have the same idea, “divide and conquer”

• proposed by Google

Page 11: Introduction to MapReduce & hadoop

Functional “map/reduce”

• map()/reduce() in Python

• map(function(elem), list) -> list

• reduce(function(elem1, elem2), list) -> single result

• e.g.

• map(lambda x: x*2, [1,2,3,4]) => [2,4,6,8]

• reduce(lambda x,y: x+y, [1,2,3,4]) => 10

Page 12: Introduction to MapReduce & hadoop

Parallel “MapReduce” 5 Steps

• prepare the map() input for mappers

• mappers run the map() code -> generated intermediate pairs

• dispatch intermediate pairs to reducers

• reducers run the reduce() code, aggregate the results

• prepare output from the result of reduce()

Page 13: Introduction to MapReduce & hadoop

Example of “MapReduce” Word Count

map() reduce()

Page 14: Introduction to MapReduce & hadoop

Example of “MapReduce” Word Count

• Original Input

Apple Orange Mongo Orange Grapes Plum ...

Page 15: Introduction to MapReduce & hadoop

Example of “MapReduce” Word Count

• Prepare data for mappers

Apple Orange Mongo

Orange Grapes Plum

...

Page 16: Introduction to MapReduce & hadoop

Example of “MapReduce” Word Count

• map() to useful record

Apple Orange Mongo

(Apple, 1)

(Orange, 1)

(Mongo, 1)

Intermediate key/value pair

Page 17: Introduction to MapReduce & hadoop

• sort and shuffle

Example of “MapReduce” Word Count

(Apple, 1)

(Orange, 1)

(Mongo, 1)

(Apple, 1)

(Orange, 1)

(Mongo, 1)

(Apple, 1)

(Orange, 1)

(Mongo, 1)

(Apple, 1)

(Orange, 1)

(Mongo, 1)

Reducer

(Apple, 1)

(Apple, 1)

Reducer

(Orange, 1)

(Orange, 1)

Reducer

(Mongo, 1)

(Mongo, 1)

unsorted Sorted

Shuffle to Reducers

Page 18: Introduction to MapReduce & hadoop

Example of “MapReduce” Word Count

• Reduce()

Reducer

(Apple, 1)

(Apple, 1) (Apple, 2)

(Orange, 3)

Reducer

(Orange, 1)

(Orange, 1)

(Orange, 1)

Page 19: Introduction to MapReduce & hadoop

Example of “MapReduce” Word Count

• Generate Output

(Apple, 2)

(Orange, 3)

(Grapes, 1)

(Plum, 5)

Apple 2Orange 3Grapes 1Plum 5

WordCount.txt

Page 20: Introduction to MapReduce & hadoop

ZooKeeper

Hadoop Infrastructure

• Pig: Programming Language for MapReduce

• Thrift: cross-language communication, just like Google’s ProtoBuffer

• Zookeeper: cluster management

Hadoop

Pig

MapReduce

HDFS

ThriftHadoop

Hadoop

Hadoop

Hadoop

Other Services