Data Management Using MapReduce · Reduce Function I User-de ned function I Accepts one...

Data Management Using MapReduce

M. Tamer Ozsu

University of Waterloo

CS742-Distributed & Parallel DBMS M. Tamer Ozsu 1 / 24

Basics

I For data analysis of very large data setsI Highly dynamic, irregular, schemaless, etc.I SQL too heavy

I “Embarrassingly parallel problems”I New, simple parallel programming model

I Data structured as (key, value) pairsI E.g. (doc-id, content), (word, count), etc.

I Functional programming style with two functions to be given:I Map(k1,v1) → list(k2,v2)I Reduce(k2, list (v2)) → list(v3)

I Implemented on a distributed file system (e.g., Google File System)on very large clusters


Map Function

I User-defined functionI Processes input key/value pairsI Produces a set of intermediate key/value pairs

I Map function I/OI Input: read a chunk from distributed file system (DFS)I Output: Write to intermediate file on local disk

I MapReduce libraryI Executes map functionI Groups together all intermediate values with the same key (i.e.,

generates a set of lists)I Passes these lists to reduce functions

I Effect of map functionI Processes and partitions input dataI Builds a distributed map (trnasparent to user)I Similar to “group by” operaton in SQL


Reduce Function

I User-defined functionI Accepts one intermediate key and a set of values for that key (i.e., a

list)I Merges these values together to form a (possibly) smaller setI Typically, zero or one output value is generated per invocation

I Reduce function I/OI Input: read from intermediate files using remote reads on local files of

corresponding mapper nodesI Output: Each reducees writes its output as a file back to DFS

I Effect of map functionI Similar to aggregation operaton in SQL


MapReduce Processing

...Inp

ut

dat

ase

t

Map

Map

Map

Map

(k1, v)

(k2, v)(k2, v)

(k2, v)

(k1, v)

(k1, v)

(k2, v)

Group by k

Group by k

(k1, (v, v, v))

(k1, (v, v, v, v)) Reduce

Reduce

Ou

tpu

td

ata

set


Example 1

Assume you are reading the monthly average temperatures for each of the12 months of a year for a bunch of cities, i.e., each input is the name ofthe city (key) and the average monthly temperature. Compute the averageannual temperature for each city.

I Map:Input: 〈City, Month, MonthAvgTemp〉

1. Create key/value pairs

Output: 〈City, MonthAvgTemp〉I Reduce:

Input: 〈City, MonthAvgTemp〉1. Sort to get 〈City, list(MonthAvgTemp)〉 (i.e., it combines in a list

the monthly average temperatures for a given city)2. Compute average over list(MonthAvgTemp)

Output: 〈City, AnnualAvgTemp〉


Example 2

I Consider EMP (ENAME, TITLE, CITY)

I Query:

SELECT CITY , COUNT(∗ )FROM EMPWHERE ENAME LIKE ”\%Smith ”GROUP BY CITY

I Map:

I n p u t : ( TID , emp ) , Output : ( CITY , 1 )i f emp .ENAME l i k e ”\%Smith ” return ( CITY , 1 )

I Reduce:

I n p u t : ( CITY , l i s t ( 1 ) ) , Output : ( CITY ,SUM( l i s t ( 1 ) ) )return ( CITY ,SUM( 1∗ ) )


MapReduce Architecture

Scheduler

Master

Input Module

Map Module

Combine Module

Partition Module

Map Process

Worker

Input Module

Map Module

Combine Module

Partition Module

Map Process

Worker

Input Module

Map Module

Combine Module

Partition Module

Map Process

Worker

Group Module

Reduce Module

Output Module

Reduce Process

Worker

Group Module

Reduce Module

Output Module

Reduce Process

Worker


Execution Flow with ArchitectureMapReduce: Simplified Data Processing on Large Clusters

7. When all map tasks and reduce tasks have been completed, the mas-ter wakes up the user program. At this point, the MapReduce callin the user program returns back to the user code.

After successful completion, the output of the mapreduce executionis available in the R output files (one per reduce task, with file namesspecified by the user). Typically, users do not need to combine these Routput files into one file; they often pass these files as input to anotherMapReduce call or use them from another distributed application thatis able to deal with input that is partitioned into multiple files.

3.2 Master Data StructuresThe master keeps several data structures. For each map task andreduce task, it stores the state (idle, in-progress, or completed) and theidentity of the worker machine (for nonidle tasks).

The master is the conduit through which the location of interme-diate file regions is propagated from map tasks to reduce tasks. There -fore, for each completed map task, the master stores the locations andsizes of the R intermediate file regions produced by the map task.Updates to this location and size information are received as map tasksare completed. The information is pushed incrementally to workersthat have in-progress reduce tasks.

3.3 Fault ToleranceSince the MapReduce library is designed to help process very largeamounts of data using hundreds or thousands of machines, the librarymust tolerate machine failures gracefully.

Handling Worker FailuresThe master pings every worker periodically. If no response is receivedfrom a worker in a certain amount of time, the master marks the workeras failed. Any map tasks completed by the worker are reset back to theirinitial idle state and therefore become eligible for scheduling on otherworkers. Similarly, any map task or reduce task in progress on a failedworker is also reset to idle and becomes eligible for rescheduling.

Completed map tasks are reexecuted on a failure because their out-put is stored on the local disk(s) of the failed machine and is thereforeinaccessible. Completed reduce tasks do not need to be reexecutedsince their output is stored in a global file system.

When a map task is executed first by worker A and then later exe-cuted by worker B (because A failed), all workers executing reducetasks are notified of the reexecution. Any reduce task that has notalready read the data from worker A will read the data from worker B.

MapReduce is resilient to large-scale worker failures. For example,during one MapReduce operation, network maintenance on a runningcluster was causing groups of 80 machines at a time to become unreach-able for several minutes. The MapReduce master simply re executed thework done by the unreachable worker machines and continued to makeforward progress, eventually completing the MapReduce operation.

Semantics in the Presence of FailuresWhen the user-supplied map and reduce operators are deterministicfunctions of their input values, our distributed implementation pro-duces the same output as would have been produced by a nonfaultingsequential execution of the entire program.

split 0

split 1

split 2

split 3

split 4

(1) fork

(3) read(4) local write

(1) fork(1) fork

(6) write

worker

worker

worker

Master

UserProgram

outputfile 0

outputfile 1

worker

worker

(2)assignmap

(2)assignreduce

(5) remote

(5) read

Inputfiles

Mapphasr

Intermediate files(on local disks)

Reducephase

Outputfiles

Fig. 1. Execution overview.

COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1 109

From: J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. ACM, 2008.


CharacteristicsI Flexibility

I User can write any map and reduce function codeI No need to know how to parallelize

I ScalabilityI Elastic scalabilityI Automatic load balancing

I EfficiencyI Simple: parallel scanI No database loading

I Fault toleranceI Worker failure

I Master pings workers periodically; assumes failure if no responseI Tasks (both map and reduce) on failed workers scheduled on a different

worker nodeI Master failure

I Checkpoints of master stateI Recovery after failure → progress can halt

I Replication on distributed data store


HadoopI Most popular MapReduce implementation – developed by Yahoo!I Two components

I Processing engineI HDFS: Hadoop Distributed Storage System – others possibleI Can be deployed on the same machine or on different machines

I ProcessesI Job tracker: hosted on the master node and implements the scheduleI Task tracker: hosted on the worker nodes and accepts tasks from job tracker

and executes themI HDFS

I Name node: stores how data are partitioned, monitors the status of datanodes, and data dictionary

I Data node: Stores and manages data chunks assigned to it

Task Tracker Job Tracker Task Tracker

Data Node Name Node Data Node

Worker 1 Name Node Worker n

MapReduce

HDFS


Hadoop UDF FunctionsPhase Name Function

Map

InputFormat::getSplit Partition the input data into differentsplits. Each split is processed by a map-per and may consist of several chunks.

RecordReader::next Define how a split is divided into items.Each item is a key/value pair and used asthe input for the map function.

Mapper::map Users can customize the map function toprocess the input data. The input dataare transformed into some intermediatekey/value pairs.

WritableComparable::compareTo The comparison function for the key/-value pairs.

Job::setCombinerClass Specify how the key/value pair are aggre-gated locally.

Shuffle Job::setPartitionerClass Specify how the intermediate key/valuepairs are shuffled to different reducers.

ReduceJob::setGroupingComparatorClass Specify how the key/value pairs are

grouped in the reduce phase.Reducer::reduce Users write their own reduce functions

to perform the corresponding jobs.


MapReduce Implementations

Name Language File System Index Master Server MultipleJobSupport

Hadoop Java HDFS No Name Node and JobTracker

Yes

Disco PythonandErlang

DistributedIndex

DiscoServer

No No

Skynet Ruby MySQL orUnix FileSystem

No Any node in the cluster No

FileMap Shelland PerlScripts

Unix FileSystem

No Any node in the cluster No

Twister Java Unix FileSystem

No One master node in brokernetwork

Yes

Cascading Java HDFS No Name Node and JobTracker

Yes


MapReduce Languages

Hive Pig

Language Declarative SQL-like Dataflow

Data model Nested Nested

UDF Supported Supported

Data partition Supported Not supported

Interface Command line, web,JDBC/ODBC server

Command line

Query optimization Rule based Rule based

Metastore Supported Not supported


MapReduce Implementations of Database Operators

I Select and Project can be easily implemented in the map function

I Aggregation is not difficult (see next slide)

I Join requires more work

MapReduce join implementations

θ-join

Equi-join

Repartitionjoin

Semi-join Map-only join

Broadcast joinPartition

join

Similarity join Multi-way join

MultipleMapReduce

jobs

Replicatedjoin


Aggregation

Key Value

1 R1

2 R2

map

Aid Value

1 R1

2 R2

Mapper 1

R

Extracting aggrega-tion attribute (Aid)

Key Value

3 R3

4 R4

map

Aid Value

1 R3

2 R4

Mapper 2

R

Map Phase

Par

titi

onin

gby

Aid

(Rou

nd

Rob

in)

Aid Value

1R1

R3 red

uce Result

1, f(R1, R3)

Reducer 1

Grouping byAid

Applying the aggregationfunction for the tuples withthe same Aid

Aid Value

2R2

R4 red

uce Result

2, f(R2, R4)

Reducer 2

Reduce Phase


θ-Join

Baseline implementation of R(A,B) on S(B,C)

1. Partition R and assign each partition to mappers

2. Each mapper takes 〈a, b〉 tuples and converts them to a list ofkey/value pairs of the form (b, 〈a,R〉)

3. Each reducer pulls the pairs with the same key

4. Each reducer joins tuples of R with tuples of S


θ-JoinI If θ equals = (i.e., equijoin)

I Repartition joinI Semijoin-based joinI Map-only join

I If θ equals 6=

Key Value

1 R1

2 R2

3 R3

4 R4

map

Bid Tuple

1 ‘R’,1,R1

4 ‘R’,2,R2

3 ‘R’,3,R3

2 ‘R’,2,R4

Mapper 1

R

Randomly assigningbucket ID (Bid)

Key Value

1 S1

2 S2

3 S3

4 S4

map

Bid Tuple

4 ‘S’,1,S1

2 ‘S’,2,S2

3 ‘S’,3,S3

1 ‘S’,2,S4

Mapper 1

S

Map Phase

Par

titi

onin

gby

Bid

(Rou

nd

Rob

in) Bid Tuple

1‘R’,1,R1

‘S’,4,S4

3‘R’,3,R3

‘S’,3,S3

red

uce

Origin Tuple

‘R’ 1,R1

‘R’ 4,R4

Origin Tuple

‘S’ 3,S3

‘S’ 4,S4

Result

R1 S3

R1 S4

R4 S3

Reducer 1

Grouping byBid

Aggregatetuples basedon origins

Local θ-join (θ is‘ 6=′)

Reducer 2

Reduce Phase


Repartition Join

Key Value

1 R1

2 R2

3 R3

4 R4

map

Key Value

1 ‘R’,R1

2 ‘R’,R2

3 ‘R’,R3

4 ‘R’,R4

Mapper 1

R

Tagging origins

Key Value

1 S1

2 S2

3 S3

4 S4

map

Key Value

1 ‘S’,S1

2 ‘S’,S2

3 ‘S’,S3

4 ‘S’,S4

Mapper 1

S

Map Phase

Par

titi

onin

gby

key

(Rou

nd

Rob

in)

Key Tuple

1‘R’,R1

‘S’,S1

3‘R’,R3

‘S’,S3

red

uce

Result

R1 S1

R3 S3

Reducer 1

Grouping by keys Local join

Key Tuple

2‘R’,R2

‘S’,S2

4‘R’,R4

‘S’,S4

red

uce

Result

R2 S2

R4 S4

Reducer 2

Reduce Phase


Semijoin-based Join

Key Value

1 R1

3 R2

1 R3

4 R4

Map

Red

uce Key

1

3

4

R

Job 1Full MapReduce job

Extracting join keys

Key 1 3 4

Key Value

1 S1

2 S2

map

Key Value

1 S1

Mapper 1

S

Broadcasting keys of R to all thesplits of S and join S with keys of R

Key 1 3 4

Key Value

3 S3

4 S4

map

Key Value

3 S3

4 S4

Mapper 2

S

Job 2Map-only job

Key Value

1 S1

3 S3

4 S4

Key Value

1 R1

2 R2

map

Result

R1 S1

R2 S3

Mapper 1

S

R

Broadcasting the results of theprevious job (S) to all the splits

of R, and locally joining R with S

Mapper 2

Job 3Map-only job


Map-only Join

I Broadcast join: If inner relation � outer relation → no shufflingI Map phase similar to third job of semijoin-based joinI Each mapper loads the full inner table to build an in-memory hash;

scan the outer relation

I Trojan join: Relations are already co-partitioned on the join key → alltuples of both relations are co-located on the same node

I Co-partitioning implemented by one jobI Scheduler loads co-partitioned data chunks in the same mapper to

perform a local join

HR DataR HS DataS HR DataR HS DataS

Co-group Co-groupFooter· · ·

Co-partitioned Split


Multiway Join

I Multiple MapReduce jobsI R on S on T → (R on S) on TI Each join implemented in one MapReduce jobI Join ordering problem

I Replicated joinI Single MapReduce job


Replicated Join

Rid Value

1 C1

2 C2

map

Key Value

1, null ‘C’,C1

2, null ‘C’,C2

R

Mapper 1

Generating keys and tagging origins

Rid Sid Value

1 1 L1

1 2 L2

2 2 L3

map

Key Value

1,1 ‘L’,L1

1,2 ‘L’,L2

2,2 ‘L’,L3

T

Mapper 2

Sid Value

1 O1

2 O2

map

Key Value

null,1 ‘O’,O1

null,2 ‘O’,O2

S

Mapper 3

Map Phase

Par

titi

onin

gby

key

(Rou

nd

Rob

in)

Sh

uffl

ing

tup

les

tom

ult

iple

red

uce

rsif

nec

essa

ry

Key Value

1,null ‘C’, C1

1,1 ‘L’, L1

null,1 ‘O’, O1

red

uce Result

C1,L1,O1

Reducer 1

Joining the three tables locally

Key Value

1,null ‘C’, C1

1,2 ‘L’, L2

null,2 ‘O’, O2

red

uce Result

C1,L2,O2

Reducer 2

Key Value

2,null ‘C’, C2

null,1 ‘O’, O1 red

uce

Result

Reducer 3

Key Value

2,null ‘C’, C2

2,2 ‘L’, L3

null,2 ‘O’, O2

red

uce Result

C2,L2,O2

Reducer 4

Reduce Phase


DBMS on MapReduce

HadoopDB Llama Cheetah

Language SQL-like Simple interface SQL

Storage Row store Column store Hybrid store

Data com-pression

No Yes Yes

Data parti-tion

Horizontally parti-tioned

Vertically partitioned Horizontally par-titioned at chunklevel

Indexing Local index in eachdatabase instance

No index Local index for eachdata chunk

Query op-timization

Rule based optimiza-tion plus local op-timization by Post-greSQL

Column-based op-timization, latematerialization andprocessing multiwayjoin in one job

Multi-query optimiza-tion, materializedviews


Data Management Using MapReduce · Reduce Function I User-de ned function I Accepts one...

Documents

Transcript of Data Management Using MapReduce · Reduce Function I User-de ned function I Accepts one...