Introduction to MapReduce & hadoop

Introduction to Hadoop and MapReduce

Colin Su, Tagtoo

Advertisement System Architecture (now)

Advertisement System Architecture (future)

• Grid

• Ad Server

• Data Highway

• Steaming Computing

• Core:

• Data mining

• Machine Learning

• Collecting data from users, logs and calculate out the strategy

• Sort our data in a proper form, them we could use it anytime

Data -> Information

Ad Server

• Ranking

• According the “information” in Grid, decide which AD should be advertised

• show proper ads to website visitors

Data Highway

• Transfer your data to the proper place

Stream Computing

• Core:

• logging

• feedback

• anti-cheating

• pricing

• post-process everything thrown out from Ad Server, and feedback useful information to Grid

• be the entrance of advertisement system

Hadoop

• an open-source software framework for data scientists

• derives from Google’s MapReduce and Google File System (GFS) papers

• written in Java

• could be divided in to 2 components:

• MapReduce

• HDFS (Hadoop distributed file system)

• a yellow elephant

Why Hadoop?

• moving computation is much cheaper and easier than moving data

• “Big Data”, the amount of data becomes too large, need a effective way to manage it

• so does computation

• high fault-tolerance

• developed by Yahoo!

MapReduce

• a programming model for processing “large data sets” with a “parallel, distributed” algorithm on a cluster

• different from map/reduce, the conception of functional programming, but actually they have the same idea, “divide and conquer”

• proposed by Google

Functional “map/reduce”

• map()/reduce() in Python

• map(function(elem), list) -> list

• reduce(function(elem1, elem2), list) -> single result

• e.g.

• map(lambda x: x*2, [1,2,3,4]) => [2,4,6,8]

• reduce(lambda x,y: x+y, [1,2,3,4]) => 10

Parallel “MapReduce” 5 Steps

• prepare the map() input for mappers

• mappers run the map() code -> generated intermediate pairs

• dispatch intermediate pairs to reducers

• reducers run the reduce() code, aggregate the results

• prepare output from the result of reduce()

Example of “MapReduce” Word Count

map() reduce()

• Original Input

Apple Orange Mongo Orange Grapes Plum ...

• Prepare data for mappers

Apple Orange Mongo

Orange Grapes Plum

• map() to useful record

Apple Orange Mongo

(Apple, 1)

(Orange, 1)

(Mongo, 1)

Intermediate key/value pair

• sort and shuffle

(Apple, 1)

(Orange, 1)

(Mongo, 1)

(Apple, 1)

(Orange, 1)

(Mongo, 1)

(Apple, 1)

(Orange, 1)

(Mongo, 1)

(Apple, 1)

(Orange, 1)

(Mongo, 1)

Reducer

(Apple, 1)

Reducer

(Orange, 1)

Reducer

(Mongo, 1)

unsorted Sorted

Shuffle to Reducers

• Reduce()

Reducer

(Apple, 1)

(Apple, 1) (Apple, 2)

(Orange, 3)

Reducer

(Orange, 1)

• Generate Output

(Apple, 2)

(Orange, 3)

(Grapes, 1)

(Plum, 5)

Apple 2Orange 3Grapes 1Plum 5

WordCount.txt

ZooKeeper

Hadoop Infrastructure

• Pig: Programming Language for MapReduce

• Thrift: cross-language communication, just like Google’s ProtoBuffer

• Zookeeper: cluster management

Hadoop

MapReduce

ThriftHadoop

Hadoop

Other Services

Introduction to MapReduce & hadoop

Technology

Transcript of Introduction to MapReduce & hadoop

Hadoop MapReduce

Hadoop MapReduce - 123seminarsonly.com · Hadoop MapReduce Felipe Meneses Besson IME-USP, Brazil July 7, 2010

MapReduce Programming with Apache Hadoop - DSTdst.lbl.gov/ACSDownloads/kjackson/downloads/Hadoop-HDFS8-12pm.… · MapReduce Programming with Apache Hadoop Viraj Bhat ... (hadoop,

Hadoop: Beyond MapReduce

MapReduce & Hadoop IIcslui/CMSC5702/mapreduce_hadoop2.pdf · MapReduce & Hadoop II ... MapReduce & Hadoop MapReduce Recap ... example, the combiners aggregate term counts across the

MapReduce en Hadoop

Workload Dependent Hadoop MapReduce Application Performance Modeling · PDF fileWorkload Dependent Hadoop MapReduce Application Performance Modeling Dominique Heger Introduction In

Hadoop hbase mapreduce

Introduction to Apache Spark - University of Waterloocs451/slides/... · Hadoop MapReduce Architecture Hadoop v1.0. Hadoop v1.0. Hadoop v2.0. Spark Architecture. Title: Developing

SIGMETRICS Tutorial: MapReducecourses.cs.vt.edu/~cs5204/.../MapReduce/MapReduce... · Introduction to MapReduce Programming Model Hadoop Map/Reduce Programming Tutorial and more.

MapReduce Online - USENIX · 2.2 Hadoop Architecture Hadoop is composed of Hadoop MapReduce, an imple-mentation of MapReduce designed for large clusters, and the Hadoop Distributed

Big Data, Hadoop, MapReduce - Introduction pour statisticien non …arichou/Poly_Hadoop.pdf · 2017-10-06 · Big Data, Hadoop, MapReduce - Introduction pour statisticien non-initié

Introduction to Hadoop-Mapreduce Platform

Introduction to Hadoop, MapReduce and HDFS for Big Data ...

Introduction to MapReduce and Hadoop IT 332 Distributed Systems.

Processing with What is MapReduce? Hadoop/MapReduce ...

A Micro-Benchmark Suite for Evaluating Hadoop MapReduce …...Hadoop MapReduce 5 Performance of Hadoop MapReduce is influenced by many factors • Network configuration of cluster

Introduction to Hadoop, MapReduce and HDFS for Big Data - SNIA

Introduction to Hadoop and MapReduce

Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem