MapReduce Tutorial

192
Tutorial: MapReduce Theory and Practice of Data-intensive Applications Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Tutorial: MapReduce 1 / 191

Transcript of MapReduce Tutorial

Page 1: MapReduce Tutorial

Tutorial: MapReduceTheory and Practice of Data-intensive Applications

Pietro Michiardi

Eurecom

Pietro Michiardi (Eurecom) Tutorial: MapReduce 1 / 191

Page 2: MapReduce Tutorial

Introduction

Introduction

Pietro Michiardi (Eurecom) Tutorial: MapReduce 2 / 191

Page 3: MapReduce Tutorial

Introduction

What is MapReduce

A programming model:I Inspired by functional programmingI Allows expressing distributed computations on massive amounts of

data

An execution framework:I Designed for large-scale data processingI Designed to run on clusters of commodity hardware

Pietro Michiardi (Eurecom) Tutorial: MapReduce 3 / 191

Page 4: MapReduce Tutorial

Introduction

What is this Tutorial About

Design of scalable algorithms with MapReduceI Applied algorithm design and case studies

In-depth description of MapReduceI Principles of functional programmingI The execution framework

In-depth description of HadoopI Architecture internalsI Software componentsI Cluster deployments

Pietro Michiardi (Eurecom) Tutorial: MapReduce 4 / 191

Page 5: MapReduce Tutorial

Introduction Motivations

Motivations

Pietro Michiardi (Eurecom) Tutorial: MapReduce 5 / 191

Page 6: MapReduce Tutorial

Introduction Motivations

Big Data

Vast repositories of dataI Web-scale processingI Behavioral dataI PhysicsI AstronomyI Finance

“The fourth paradigm” of science [6]I Data-intensive processing is fast becoming a necessityI Design algorithms capable of scaling to real-world datasets

It’s not the algorithm, it’s the data! [2]I More data leads to better accuracyI With more data, accuracy of different algorithms converges

Pietro Michiardi (Eurecom) Tutorial: MapReduce 6 / 191

Page 7: MapReduce Tutorial

Introduction Big Ideas

Key Ideas Behind MapReduce

Pietro Michiardi (Eurecom) Tutorial: MapReduce 7 / 191

Page 8: MapReduce Tutorial

Introduction Big Ideas

Scale out, not up!

For data-intensive workloads, a large number of commodityservers is preferred over a small number of high-end servers

I Cost of super-computers is not linearI But datacenter efficiency is a difficult problem to solve [3, 5]

Some numbers:I Data processed by Google every day: 20 PBI Data processed by Facebook every day: 15 TB

Pietro Michiardi (Eurecom) Tutorial: MapReduce 8 / 191

Page 9: MapReduce Tutorial

Introduction Big Ideas

Implications of Scaling Out

Processing data is quick, I/O is very slowI 1 HDD = 75 MB/secI 1000 HDDs = 75 GB/sec

Sharing vs. Shared nothing:I High-performance computing focus: distribute the workloadI Shared nothing focus: distribute the data

Sharing is difficult:I Synchronization, deadlocksI Finite bandwidth to access data from SANI Temporal dependencies are complicated (restarts)

Pietro Michiardi (Eurecom) Tutorial: MapReduce 9 / 191

Page 10: MapReduce Tutorial

Introduction Big Ideas

Failures are the norm, not the exception

LALN data [DSN 2006]I Data for 5000 machines, for 9 yearsI Hardware: 60%, Software: 20%, Network 5%

DRAM error analysis [Sigmetrics 2009]I Data for 2.5 yearsI 8% of DIMMs affected by errors

Disk drive failure analysis [FAST 2007]I Utilization and temperature major causes of failures

Amazon Web Service failure [April 2011]I Cascading effect

Pietro Michiardi (Eurecom) Tutorial: MapReduce 10 / 191

Page 11: MapReduce Tutorial

Introduction Big Ideas

Implications of Failures

Failures are part of everyday lifeI Mostly due to the scale and shared environment

Sources of FailuresI Hardware / SoftwareI PreemptionI Unavailability of a resource due to overload

Failure TypesI PermanentI Transient

Pietro Michiardi (Eurecom) Tutorial: MapReduce 11 / 191

Page 12: MapReduce Tutorial

Introduction Big Ideas

Move Processing to the Data

Drastic departure from high-performance computing modelI HPC: distinction between processing nodes and storage nodesI HPC: CPU intensive tasks

Data intensive workloadsI Generally not processor demandingI The network becomes the bottleneckI MapReduce assumes processing and storage nodes to be

colocated: Data Locality

Distributed filesystems are necessary

Pietro Michiardi (Eurecom) Tutorial: MapReduce 12 / 191

Page 13: MapReduce Tutorial

Introduction Big Ideas

Process Data Sequentially and Avoid Random Access

Data intensive workloadsI Relevant datasets are too large to fit in memoryI Such data resides on disks

Disk performance is a bottleneckI Seek times for random disk access are the problem

F Example: 1 TB DB with 1010 100-byte records. Updates on 1%requires 1 month, reading and rewriting the whole DB would take 1day1

I Organize computation for sequential reads

1From a post by Ted Dunning on the Hadoop mailing listPietro Michiardi (Eurecom) Tutorial: MapReduce 13 / 191

Page 14: MapReduce Tutorial

Introduction Big Ideas

Implications of Data Access Patterns

MapReduce is designed forI batch processingI involving (mostly) full scans of the dataset

Typically, data is collected “elsewhere” and copied to thedistributed filesystem

Data-intensive applicationsI Read and process the whole Internet dataset from a crawlerI Read and process the whole Social Graph

Pietro Michiardi (Eurecom) Tutorial: MapReduce 14 / 191

Page 15: MapReduce Tutorial

Introduction Big Ideas

Hide System-level Details

Separate the what from the howI MapReduce abstracts away the “distributed” part of the systemI Such details are handled by the framework

In-depth knowledge of the framework is keyI Custom data reader/writerI Custom data partitioningI Memory utilization

Auxiliary componentsI Hadoop PigI Hadoop HiveI Cascading

Pietro Michiardi (Eurecom) Tutorial: MapReduce 15 / 191

Page 16: MapReduce Tutorial

Introduction Big Ideas

Seamless Scalability

We can define scalability along two dimensionsI In terms of data: given twice the amount of data, the same

algorithm should take no more than twice as long to runI In terms of resources: given a cluster twice the size, the same

algorithm should take no more than half as long to run

Embarassingly parallel problemsI Simple definition: independent (shared nothing) computations on

fragments of the datasetI It’s not easy to decide whether a problem is embarrassingly parallel

or not

MapReduce is a first attempt, not the final answer

Pietro Michiardi (Eurecom) Tutorial: MapReduce 16 / 191

Page 17: MapReduce Tutorial

Introduction Big Ideas

Part One

Pietro Michiardi (Eurecom) Tutorial: MapReduce 17 / 191

Page 18: MapReduce Tutorial

MapReduce Framework

The MapReduce Framework

Pietro Michiardi (Eurecom) Tutorial: MapReduce 18 / 191

Page 19: MapReduce Tutorial

MapReduce Framework Preliminaries

Preliminaries

Pietro Michiardi (Eurecom) Tutorial: MapReduce 19 / 191

Page 20: MapReduce Tutorial

MapReduce Framework Preliminaries

Divide and Conquer

A feasible approach to tackling large-data problemsI Partition a large problem into smaller sub-problemsI Independent sub-problems executed in parallelI Combine intermediate results from each individual worker

The workers can be:I Threads in a processor coreI Cores in a multi-core processorI Multiple processors in a machineI Many machines in a cluster

Implementation details of divide and conquer are complex

Pietro Michiardi (Eurecom) Tutorial: MapReduce 20 / 191

Page 21: MapReduce Tutorial

MapReduce Framework Preliminaries

Divide and Conquer: How to?

Decompose the original problem in smaller, parallel tasks

Schedule tasks on workers distributed in a clusterI Data localityI Resource availability

Ensure workers get the data they need?

Coordinate synchronization among workers?

Share partial results

Handle failures?

Pietro Michiardi (Eurecom) Tutorial: MapReduce 21 / 191

Page 22: MapReduce Tutorial

MapReduce Framework Preliminaries

The MapReduce Approach

Shared memory approach (OpenMP, MPI, ...)I Developer needs to take care of (almost) everythingI Synchronization, ConcurrencyI Resource allocation

MapReduce: a shared nothing approachI Most of the above issues are taken care ofI Problem decomposition and sharing partial results need particular

attentionI Optimizations (memory and network consumption) are tricky

Pietro Michiardi (Eurecom) Tutorial: MapReduce 22 / 191

Page 23: MapReduce Tutorial

MapReduce Framework Programming Model

The MapReduce Programming model

Pietro Michiardi (Eurecom) Tutorial: MapReduce 23 / 191

Page 24: MapReduce Tutorial

MapReduce Framework Programming Model

Functional Programming Roots

Key feature: higher order functionsI Functions that accept other functions as argumentsI Map and Fold

f f f f f

g g g g g

Figure: Illustration of map and fold.

Pietro Michiardi (Eurecom) Tutorial: MapReduce 24 / 191

Page 25: MapReduce Tutorial

MapReduce Framework Programming Model

Functional Programming Roots

map phase:I Given a list, map takes as an argument a function f (that takes a

single argument) and applies it to all element in a list

fold phase:I Given a list, fold takes as arguments a function g (that takes two

arguments) and an initial valueI g is first applied to the initial value and the first item in the listI The result is stored in an intermediate variable, which is used as an

input together with the next item to a second application of gI The process is repeated until all items in the list have been

consumed

Pietro Michiardi (Eurecom) Tutorial: MapReduce 25 / 191

Page 26: MapReduce Tutorial

MapReduce Framework Programming Model

Functional Programming Roots

We can view map as a transformation over a datasetI This transformation is specified by the function fI Each functional application happens in isolationI The application of f to each element of a dataset can be

parallelized in a straightforward manner

We can view fold as an aggregation operationI The aggregation is defined by the function gI Data locality: elements in the list must be “brought together”I If we can group element of the list, also the fold phase can proceed

in parallel

Associative and commutative operationsI Allow performance gains through local aggregation and reordeing

Pietro Michiardi (Eurecom) Tutorial: MapReduce 26 / 191

Page 27: MapReduce Tutorial

MapReduce Framework Programming Model

Functional Programming and MapReduce

Equivalence of MapReduce and Functional Programming:I The map of MapReduce corresponds to the map operationI The reduce of MapReduce corresponds to the fold operation

The framework coordinates the map and reduce phases:I How intermediate results are grouped for the reduce to happen in

parallel

In practice:I User-specified computation is applied (in parallel) to all input

records of a datasetI Intermediate results are aggregated by another user-specified

computation

Pietro Michiardi (Eurecom) Tutorial: MapReduce 27 / 191

Page 28: MapReduce Tutorial

MapReduce Framework Programming Model

What can we do with MapReduce?

MapReduce “implements” a subset of functionalprogramming

I The programming model appears quite limited

There are several important problems that can be adapted toMapReduce

I In this tutorial we will focus on illustrative casesI We will see in detail “design patterns”

F How to transform a problem and its inputF How to save memory and bandwidth in the system

Pietro Michiardi (Eurecom) Tutorial: MapReduce 28 / 191

Page 29: MapReduce Tutorial

MapReduce Framework The Framework

Mappers and Reducers

Pietro Michiardi (Eurecom) Tutorial: MapReduce 29 / 191

Page 30: MapReduce Tutorial

MapReduce Framework The Framework

Data Structures

Key-value pairs are the basic data structure in MapReduceI Keys and values can be: integers, float, strings, raw bytesI They can also be arbitrary data structures

The design of MapReduce algorithms involes:I Imposing the key-value structure on arbitrary datasets

F E.g.: for a collection of Web pages, input keys may be URLs andvalues may be the HTML content

I In some algorithms, input keys are not used, in others they uniquelyidentify a record

I Keys can be combined in complex ways to design variousalgorithms

Pietro Michiardi (Eurecom) Tutorial: MapReduce 30 / 191

Page 31: MapReduce Tutorial

MapReduce Framework The Framework

A MapReduce job

The programmer defines a mapper and a reducer as follows2:I map: (k1, v1)→ [(k2, v2)]I reduce: (k2, [v2])→ [(k3, v3)]

A MapReduce job consists in:I A dataset stored on the underlying distributed filesystem, which is

split in a number of files across machinesI The mapper is applied to every input key-value pair to generate

intermediate key-value pairsI The reducer is applied to all values associated with the same

intermediate key to generate output key-value pairs

2We use the convention [· · · ] to denote a list.Pietro Michiardi (Eurecom) Tutorial: MapReduce 31 / 191

Page 32: MapReduce Tutorial

MapReduce Framework The Framework

Where the magic happens

Implicit between the map and reduce phases is a distributed“group by” operation on intermediate keys

I Intermediate data arrive at each reducer in order, sorted by the keyI No ordering is guaranteed across reducers

Output keys from reducers are written back to the distributedfilesystem

I The output may consist of r distinct files, where r is the number ofreducers

I Such output may be the input to a subsequent MapReduce phase

Intermediate keys are transient:I They are not stored on the distributed filesystemI They are “spilled” to the local disk of each machine in the cluster

Pietro Michiardi (Eurecom) Tutorial: MapReduce 32 / 191

Page 33: MapReduce Tutorial

MapReduce Framework The Framework

A Simplified view of MapReduce

Figure: Mappers are applied to all input key-value pairs, to generate anarbitrary number of intermediate pairs. Reducers are applied to allintermediate values associated with the same intermediate key. Between themap and reduce phase lies a barrier that involves a large distributed sort andgroup by.

Pietro Michiardi (Eurecom) Tutorial: MapReduce 33 / 191

Page 34: MapReduce Tutorial

MapReduce Framework The Framework

“Hello World” in MapReduce

Figure: Pseudo-code for the word count algorithm.

Pietro Michiardi (Eurecom) Tutorial: MapReduce 34 / 191

Page 35: MapReduce Tutorial

MapReduce Framework The Framework

“Hello World” in MapReduce

Input:I Key-value pairs: (docid, doc) stored on the distributed filesystemI docid: unique identifier of a documentI doc: is the text of the document itself

Mapper:I Takes an input key-value pair, tokenize the documentI Emits intermediate key-value pairs: the word is the key and the

integer is the valueThe framework:

I Guarantees all values associated with the same key (the word) arebrought to the same reducer

The reducer:I Receives all values associated to some keysI Sums the values and writes output key-value pairs: the key is the

word and the value is the number of occurrences

Pietro Michiardi (Eurecom) Tutorial: MapReduce 35 / 191

Page 36: MapReduce Tutorial

MapReduce Framework The Framework

Implementation and Execution Details

The partitioner is in charge of assigning intermediate keys(words) to reducers

I Note that the partitioner can be customized

How many map and reduce tasks?I The framework essentially takes care of map tasksI The designer/developer takes care of reduce tasks

In this tutorial we will focus on HadoopI Other implementations of the framework exist: Google, Disco, ...

Pietro Michiardi (Eurecom) Tutorial: MapReduce 36 / 191

Page 37: MapReduce Tutorial

MapReduce Framework The Framework

Restrictions

Using external resourcesI E.g.: Other data stores than the distributed file systemI Concurrent access by many map/reduce tasks

Side effectsI Not allowed in functional programmingI E.g.: preserving state across multiple inputsI State is kept internal

I/O and executionI External side effects using distributed data stores (e.g. BigTable)I No input (e.g. computing π), no reducers, never no mappers

Pietro Michiardi (Eurecom) Tutorial: MapReduce 37 / 191

Page 38: MapReduce Tutorial

MapReduce Framework The Framework

The Execution Framework

Pietro Michiardi (Eurecom) Tutorial: MapReduce 38 / 191

Page 39: MapReduce Tutorial

MapReduce Framework The Framework

The Execution Framework

MapReduce program, a.k.a. a job:I Code of mappers and reducersI Code for combiners and partitioners (optional)I Configuration parametersI All packaged together

A MapReduce job is submitted to the clusterI The framework takes care of eveything elseI Next, we will delve into the details

Pietro Michiardi (Eurecom) Tutorial: MapReduce 39 / 191

Page 40: MapReduce Tutorial

MapReduce Framework The Framework

Scheduling

Each Job is broken into tasksI Map tasks work on fractions of the input dataset, as defined by the

underlying distributed filesystemI Reduce tasks work on intermediate inputs and write back to the

distributed filesystem

The number of tasks may exceed the number of availablemachines in a cluster

I The scheduler takes care of maintaining something similar to aqueue of pending tasks to be assigned to machines with availableresources

Jobs to be executed in a cluster requires scheduling as wellI Different users may submit jobsI Jobs may be of various complexityI Fairness is generally a requirement

Pietro Michiardi (Eurecom) Tutorial: MapReduce 40 / 191

Page 41: MapReduce Tutorial

MapReduce Framework The Framework

Scheduling

The scheduler component can be customizedI As of today, for Hadoop, there are various schedulers

Dealing with stragglersI Job execution time depends on the slowest map and reduce tasksI Speculative execution can help with slow machines

F But data locality may be at stake

Dealing with skew in the distribution of valuesI E.g.: temperature readings from sensorsI In this case, scheduling cannot helpI It is possible to work on customized partitioning and sampling to

solve such issues [Advanced Topic]

Pietro Michiardi (Eurecom) Tutorial: MapReduce 41 / 191

Page 42: MapReduce Tutorial

MapReduce Framework The Framework

Data/code co-location

How to feed data to the codeI In MapReduce, this issue is intertwined with scheduling and the

underlying distributed filesystem

How data locality is achievedI The scheduler starts the task on the node that holds a particular

block of data required by the taskI If this is not possible, tasks are started elsewhere, and data will

cross the networkF Note that usually input data is replicated

I Distance rules [11] help dealing with bandwidth consumptionF Same rack scheduling

Pietro Michiardi (Eurecom) Tutorial: MapReduce 42 / 191

Page 43: MapReduce Tutorial

MapReduce Framework The Framework

Synchronization

In MapReduce, synchronization is achieved by the “shuffle andsort” bareer

I Intermediate key-value pairs are grouped by keyI This requires a distributed sort involving all mappers, and taking

into account all reducersI If you have m mappers and r reducers this phase involves up to

m × r copying operations

IMPORTANT: the reduce operation cannot start until allmappers have finished

I This is different from functional programming that allows “lazy”aggregation

I In practice, a common optimization is for reducers to pull data frommappers as soon as they finish

Pietro Michiardi (Eurecom) Tutorial: MapReduce 43 / 191

Page 44: MapReduce Tutorial

MapReduce Framework The Framework

Errors and faults

Using quite simple mechanisms, the MapReduce framework dealswith:

Hardware failuresI Individual machines: disks, RAMI Networking equipmentI Power / cooling

Software failuresI Exceptions, bugs

Corrupt and/or invalid input data

Pietro Michiardi (Eurecom) Tutorial: MapReduce 44 / 191

Page 45: MapReduce Tutorial

MapReduce Framework The Framework

Partitioners and Combiners

Pietro Michiardi (Eurecom) Tutorial: MapReduce 45 / 191

Page 46: MapReduce Tutorial

MapReduce Framework The Framework

Partitioners

Partitioners are responsible for:I Dividing up the intermediate key spaceI Assigning intermediate key-value pairs to reducers→ Specify the task to which an intermediate key-value pair must be

copied

Hash-based partitionerI Computes the hash of the key modulo the number of reducers rI This ensures a roughly even partitioning of the key space

F However, it ignores values: this can cause imbalance in the dataprocessed by each reducer

I When dealing with complex keys, even the base partitioner mayneed customization

Pietro Michiardi (Eurecom) Tutorial: MapReduce 46 / 191

Page 47: MapReduce Tutorial

MapReduce Framework The Framework

Combiners

Combiners are an (optional) optimization:I Allow local aggregation before the “shuffle and sort” phaseI Each combiner operates in isolation

Essentially, combiners are used to save bandwidthI E.g.: word count program

Combiners can be implemented using local data-structuresI E.g., an associative array keeps intermediate computations and

aggregation thereofI The map function only emits once all input records (even all input

splits) are processed

Pietro Michiardi (Eurecom) Tutorial: MapReduce 47 / 191

Page 48: MapReduce Tutorial

MapReduce Framework The Framework

Partitioners and Combiners, an Illustration

Figure: Complete view of MapReduce illustrating combiners and partitioners.

Note: in Hadoop, partitioners are executed before combiners.Pietro Michiardi (Eurecom) Tutorial: MapReduce 48 / 191

Page 49: MapReduce Tutorial

MapReduce Framework The Framework

The Distributed Filesystem

Pietro Michiardi (Eurecom) Tutorial: MapReduce 49 / 191

Page 50: MapReduce Tutorial

MapReduce Framework The Framework

Colocate data and computation!

As dataset sizes increase, more computing capacity isrequired for processing

As compute capacity grows, the link between the computenodes and the storage nodes becomes a bottleneck

I One could eventually think of special-purpose interconnects forhigh-performance networking

I This is often a costly solution as cost does not increase linearly withperformance

Key idea: abandon the separation between compute andstorage nodes

I This is exactly what happens in current implementations of theMapReduce framework

I A distributed filesystem is not mandatory, but highly desirable

Pietro Michiardi (Eurecom) Tutorial: MapReduce 50 / 191

Page 51: MapReduce Tutorial

MapReduce Framework The Framework

Distributed filesystems

In this tutorial we will focus on HDFS, the Hadoopimplementation of the Google distributed filesystem (GFS)

Distributed filesystems are not new!I HDFS builds upon previous results, tailored to the specific

requirements of MapReduceI Write once, read many workloadsI Does not handle concurrency, but allow replicationI Optimized for throughput, not latency

Pietro Michiardi (Eurecom) Tutorial: MapReduce 51 / 191

Page 52: MapReduce Tutorial

MapReduce Framework The Framework

HDFS

Divide user data into blocksI Blocks are big! [64, 128] MBI Avoids problems related to metadata management

Replicate blocks across the local disks of nodes in thecluster

I Replication is handled by storage nodes themselves (similar tochain replication) and follows distance rules

Master-slave architectureI NameNode: master maintains the namespace (metadata, file to

block mapping, location of blocks) and maintains overall health ofthe file system

I DataNode: slaves manage the data blocks

Pietro Michiardi (Eurecom) Tutorial: MapReduce 52 / 191

Page 53: MapReduce Tutorial

MapReduce Framework The Framework

HDFS, an Illustration

Figure: The architecture of HDFS.

Pietro Michiardi (Eurecom) Tutorial: MapReduce 53 / 191

Page 54: MapReduce Tutorial

MapReduce Framework The Framework

HDFS I/O

A typical read from a client involves:1 Contact the NameNode to determine where the actual data is stored2 NameNode replies with block identifiers and locations (i.e., which

DataNode)3 Contact the DataNode to fetch data

A typical write from a client involves:1 Contact the NameNode to update the namespace and verify

permissions2 NameNode allocates a new block on a suitable DataNode3 The client directly streams to the selected DataNode4 Currently, HDFS files are immutable

Data is never moved through the NameNodeI Hence, there is no bottleneck

Pietro Michiardi (Eurecom) Tutorial: MapReduce 54 / 191

Page 55: MapReduce Tutorial

MapReduce Framework The Framework

HDFS Replication

By default, HDFS stores 3 sperate copies of each blockI This ensures reliability, availability and performance

Replication policyI Spread replicas across differen racksI Robust against cluster node failuresI Robust against rack failures

Block replication benefits MapReduceI Scheduling decisions can take replicas into accountI Exploit better data locality

Pietro Michiardi (Eurecom) Tutorial: MapReduce 55 / 191

Page 56: MapReduce Tutorial

MapReduce Framework The Framework

HDFS: more on operational assumptions

A small number of large files is preferred over a large numberof small files

I Metadata may explodeI Input splits fo MapReduce based on individual files→ Mappers are launched for every fileF High startup costsF Inefficient “shuffle and sort”

Workloads are batch oriented

Not full POSIX

Cooperative scenario

Pietro Michiardi (Eurecom) Tutorial: MapReduce 56 / 191

Page 57: MapReduce Tutorial

MapReduce Framework The Framework

Part Two

Pietro Michiardi (Eurecom) Tutorial: MapReduce 57 / 191

Page 58: MapReduce Tutorial

Hadoop MapReduce

Hadoop implementation of MapReduce

Pietro Michiardi (Eurecom) Tutorial: MapReduce 58 / 191

Page 59: MapReduce Tutorial

Hadoop MapReduce Preliminaries

Preliminaries

Pietro Michiardi (Eurecom) Tutorial: MapReduce 59 / 191

Page 60: MapReduce Tutorial

Hadoop MapReduce Preliminaries

From Theory to Practice

The story so farI Concepts behind the MapReduce FrameworkI Overview of the programming model

Hadoop implementation of MapReduceI HDFS in detailsI Hadoop I/OI Hadoop MapReduce

F Implementation detailsF Types and FormatsF Features in Hadoop

I Hadoop Streaming: Dumbo

Hadoop Deployments

Pietro Michiardi (Eurecom) Tutorial: MapReduce 60 / 191

Page 61: MapReduce Tutorial

Hadoop MapReduce Preliminaries

Terminology

MapReduce:I Job: an execution of a Mapper and Reducer across a data setI Task: an execution of a Mapper or a Reducer on a slice of dataI Task Attempt: instance of an attempt to execute a taskI Example:

F Running “Word Count” across 20 files is one jobF 20 files to be mapped = 20 map tasks + some number of reduce tasksF At least 20 attempts will be performed... more if a machine crashes

Task AttemptsI Task attempted at least once, possibly moreI Multiple crashes on input imply discarding itI Multiple attempts may occur in parallel (speculative execution)I Task ID from TaskInProgress is not a unique identifier

Pietro Michiardi (Eurecom) Tutorial: MapReduce 61 / 191

Page 62: MapReduce Tutorial

Hadoop MapReduce HDFS in details

HDFS in details

Pietro Michiardi (Eurecom) Tutorial: MapReduce 62 / 191

Page 63: MapReduce Tutorial

Hadoop MapReduce HDFS in details

The Hadoop Distributed Filesystem

Large dataset(s) outgrowing the storage capacity of a singlephysical machine

I Need to partition it across a number of separate machinesI Network-based system, with all its complicationsI Tolerate failures of machines

Hadoop Distributed Filesystem[10, 11]I Very large filesI Streaming data accessI Commodity hardware

Pietro Michiardi (Eurecom) Tutorial: MapReduce 63 / 191

Page 64: MapReduce Tutorial

Hadoop MapReduce HDFS in details

HDFS Blocks

(Big) files are broken into block-sized chunksI NOTE: A file that is smaller than a single block does not occupy a

full block’s worth of underlying storage

Blocks are stored on independent machinesI Reliability and parallel access

Why is a block so large?I Make transfer times larger than seek latencyI E.g.: Assume seek time is 10ms and the transfer rate is 100 MB/s,

if you want seek time to be 1% of transfer time, then the block sizeshould be 100MB

Pietro Michiardi (Eurecom) Tutorial: MapReduce 64 / 191

Page 65: MapReduce Tutorial

Hadoop MapReduce HDFS in details

NameNodes and DataNodes

NameNodeI Keeps metadata in RAMI Each block information occupies roughly 150 bytes of memoryI Without NameNode, the filesystem cannot be used

F Persistence of metadata: synchronous and atomic writes to NFS

Secondary NameNodeI Merges the namespce with the edit logI A useful trick to recover from a failure of the NameNode is to use the

NFS copy of metadata and switch the secondary to primary

DataNodeI They store data and talk to clientsI They report periodically to the NameNode the list of blocks they hold

Pietro Michiardi (Eurecom) Tutorial: MapReduce 65 / 191

Page 66: MapReduce Tutorial

Hadoop MapReduce HDFS in details

Anatomy of a File Read

NameNode is only used to get block locationI Unresponsive DataNode are discarded by clientsI Batch reading of blocks is allowed

“External” clientsI For each block, the NameNode returns a set of DataNodes holding

a copy thereofI DataNodes are sorted according to their proximity to the client

“MapReduce” clientsI TaskTracker and DataNodes are colocatedI For each block, the NameNode usually3 returns the local DataNode

3Exceptions exist due to stragglers.Pietro Michiardi (Eurecom) Tutorial: MapReduce 66 / 191

Page 67: MapReduce Tutorial

Hadoop MapReduce HDFS in details

Anatomy of a File Write

Details on replicationI Clients ask NameNode for a list of suitable DataNodesI This list forms a pipeline: first DataNode stores a copy of a

block, then forwards it to the second, and so on

Replica PlacementI Tradeoff between reliability and bandwidthI Default placement:

F First copy on the “same” node of the client, second replica is off-rack,third replica is on the same rack as the second but on a different node

F Since Hadoop 0.21, replica placement can be customized

Pietro Michiardi (Eurecom) Tutorial: MapReduce 67 / 191

Page 68: MapReduce Tutorial

Hadoop MapReduce HDFS in details

Network Topology and HDFS

Pietro Michiardi (Eurecom) Tutorial: MapReduce 68 / 191

Page 69: MapReduce Tutorial

Hadoop MapReduce HDFS in details

HDFS Coherency Model

Read your writes is not guaranteedI The namespace is updatedI Block contents may not be visible after a write is finishedI Application design (other than MapReduce) should use sync() to

force synchronizationI sync() involves some overhead: tradeoff between

robustness/consistency and throughput

Multiple writers (for the same block) are not supportedI Instead, different blocks can be written in parallel (using

MapReduce)

Pietro Michiardi (Eurecom) Tutorial: MapReduce 69 / 191

Page 70: MapReduce Tutorial

Hadoop MapReduce Hadoop I/O

Hadoop I/O

Pietro Michiardi (Eurecom) Tutorial: MapReduce 70 / 191

Page 71: MapReduce Tutorial

Hadoop MapReduce Hadoop I/O

I/O operations in Hadoop

Reading and writing dataI From/to HDFSI From/to local disk drivesI Across machines (inter-process communication)

Customized tools for large amounts of dataI Hadoop does not use Java native classesI Allows flexibility for dealing with custom data (e.g. binary)

What’s nextI Overview of what Hadoop offersI For an in depth knowledge, use [11]

Pietro Michiardi (Eurecom) Tutorial: MapReduce 71 / 191

Page 72: MapReduce Tutorial

Hadoop MapReduce Hadoop I/O

Data Integrity

Every I/O operation on disks or the network may corrupt dataI Users expect data not to be corrupted during storage or processingI Data integrity usually achieved with checksums

HDFS transparently checksums all data during I/OI HDFS makes sure that storage overhead is roughly 1%I DataNodes are in charge of checksumming

F With replication, the last replica performs the checkF Checksums are timestamped and logged for statistcs on disks

I Checksumming is also run periodically in a separate threadF Note that thanks to replication, error correction is possible

Pietro Michiardi (Eurecom) Tutorial: MapReduce 72 / 191

Page 73: MapReduce Tutorial

Hadoop MapReduce Hadoop I/O

Compression

Why using compressionI Reduce storage requirementsI Speed up data transfers (across the network or from disks)

Compression and Input SplitsI IMPORTANT: use compression that supports splitting (e.g. bzip2)

Splittable files, Example 1I Consider an uncompressed file of 1GBI HDFS will split it in 16 blocks, 64MB each, to be processed by

separate Mappers

Pietro Michiardi (Eurecom) Tutorial: MapReduce 73 / 191

Page 74: MapReduce Tutorial

Hadoop MapReduce Hadoop I/O

Compression

Splittable files, Example 2 (gzip)I Consider a compressed file of 1GBI HDFS will split it in 16 blocks of 64MB eachI Creating an InputSplit for each block will not work, since it is not

possible to read at an arbitrary point

What’s the problem?I This forces MapReduce to treat the file as a single splitI Then, a single Mapper is fired by the frameworkI For this Mapper, only 1/16-th is local, the rest comes from the

network

Which compression format to use?I Use bzip2I Otherwise, use SequenceFilesI See Chapter 4 (page 84) [11]

Pietro Michiardi (Eurecom) Tutorial: MapReduce 74 / 191

Page 75: MapReduce Tutorial

Hadoop MapReduce Hadoop I/O

Serialization

Transforms structured objects into a byte streamI For transmission over the network: Hadoop uses RPCI For persistent storage on disks

Hadoop uses its own serialization format, WritableI Comparison of types is crucial (Shuffle and Sort phase): Hadoop

provides a custom RawComparator, which avoids deserializationI Custom Writable for having full control on the binary

representation of dataI Also “external” frameworks are allowed: enter Avro

Fixed-lenght or variable-length encoding?I Fixed-lenght: when the distribution of values is uniformI Variable-length: when the distribution of values is not uniform

Pietro Michiardi (Eurecom) Tutorial: MapReduce 75 / 191

Page 76: MapReduce Tutorial

Hadoop MapReduce Hadoop I/O

Sequence FilesSpecialized data structure to hold custom input data

I Using blobs of binaries is not efficient

SequenceFilesI Provide a persistent data structure for binary key-value pairsI Also work well as containers for smaller files so that the framework

is more happy (remember, better few large files than lots of smallfiles)

I They come with the sync() method to introduce sync points tohelp managing InputSplits for MapReduce

Pietro Michiardi (Eurecom) Tutorial: MapReduce 76 / 191

Page 77: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

How Hadoop MapReduce Works

Pietro Michiardi (Eurecom) Tutorial: MapReduce 77 / 191

Page 78: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Anatomy of a MapReduce Job Run

Pietro Michiardi (Eurecom) Tutorial: MapReduce 78 / 191

Page 79: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Job Submission

JobClient classI The runJob() method creates a new instance of a JobClientI Then it calls the submitJob() on this class

Simple verifications on the JobI Is there an output directory?I Are there any input splits?I Can I copy the JAR of the job to HDFS?

NOTE: the JAR of the job is replicated 10 times

Pietro Michiardi (Eurecom) Tutorial: MapReduce 79 / 191

Page 80: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Job Initialization

The JobTracker is responsible for:I Create an object for the jobI Encapsulate its tasksI Bookkeeping with the tasks’ status and progress

This is where the scheduling happensI JobTracker performs scheduling by maintaining a queueI Queueing disciplines are pluggable

Compute mappers and reducersI JobTracker retrieves input splits (computed by JobClient)I Determines the number of Mappers based on the number of input

splitsI Reads the configuration file to set the number of Reducers

Pietro Michiardi (Eurecom) Tutorial: MapReduce 80 / 191

Page 81: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Task AssignmentHearbeat-based mechanism

I TaskTrackers periodically send hearbeats to the JobTrackerI TaskTracker is aliveI Heartbeat contains also information on availability of theTaskTrackers to execute a task

I JobTracker piggybacks a task if TaskTracker is available

Selecting a taskI JobTracker first needs to select a job (i.e. scheduling)I TaskTrackers have a fixed number of slots for map and reduce

tasksI JobTracker gives priority to map tasks (WHY?)

Data localityI JobTracker is topology aware

F Useful for map tasksF Unused for reduce tasks

Pietro Michiardi (Eurecom) Tutorial: MapReduce 81 / 191

Page 82: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Task Execution

Task Assignement is done, now TaskTrackers can executeI Copy the JAR from the HDFSI Create a local working directoryI Create an instance of TaskRunner

TaskRunner launches a child JVMI This prevents bugs from stalling the TaskTrackerI A new child JVM is created per InputSplit

F Can be overriden by specifying JVM Reuse option, which is veryuseful for custom, in-memory, combiners

Streaming and PipesI User-defined map and reduce methods need not to be in JavaI Streaming and Pipes allow C++ or python mappers and reducersI We will cover Dumbo

Pietro Michiardi (Eurecom) Tutorial: MapReduce 82 / 191

Page 83: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Handling Failures

In the real world, code is buggy, processes crash and machine fails

Task FailureI Case 1: map or reduce task throws a runtime exception

F The child JVM reports back to the parent TaskTrackerF TaskTracker logs the error and marks the TaskAttempt as failedF TaskTracker frees up a slot to run another task

I Case 2: Hanging tasksF TaskTracker notices no progress updates (timeout = 10 minutes)F TaskTracker kills the child JVM4

I JobTracker is notified of a failed taskF Avoids rescheduling the task on the same TaskTrackerF If a task fails 4 times, it is not re-scheduled5

F Default behavior: if any task fails 4 times, the job fails

4With streaming, you need to take care of the orphaned process.5Exception is made for speculative executionPietro Michiardi (Eurecom) Tutorial: MapReduce 83 / 191

Page 84: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Handling Failures

TaskTracker FailureI Types: crash, running very slowlyI Heartbeats will not be sent to JobTrackerI JobTracker waits for a timeout (10 minutes), then it removes theTaskTracker from its scheduling pool

I JobTracker needs to reschedule even completed tasks (WHY?)I JobTracker needs to reschedule tasks in progressI JobTracker may even blacklist a TaskTracker if too many tasks

failed

JobTracker FailureI Currently, Hadoop has no mechanism for this kind of failureI In future releases:

F Multiple JobTrackersF Use ZooKeeper as a coordination mechanisms

Pietro Michiardi (Eurecom) Tutorial: MapReduce 84 / 191

Page 85: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

SchedulingFIFO Scheduler (default behavior)

I Each job uses the whole clusterI Not suitable for shared production-level cluster

F Long jobs monopolize the clusterF Short jobs can hold back and have no guarantees on execution time

Fair SchedulerI Every user gets a fair share of the cluster capacity over timeI Jobs are placed in to pools, one for each user

F Users that submit more jobs have no more resources than oterhsF Can guarantee minimum capacity per pool

I Supports preemptionI “Contrib” module, requires manual installation

Capacity SchedulerI Hierarchical queues (mimic an oragnization)I FIFO scheduling in each queueI Supports priority

Pietro Michiardi (Eurecom) Tutorial: MapReduce 85 / 191

Page 86: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort

The MapReduce framework guarantees the input to everyreducer to be sorted by key

I The process by which the system sorts and transfers map outputsto reducers is known as shuffle

Shuffle is the most important part of the framework, wherethe “magic” happens

I Good understanding allows optimizing both the framework and theexecution time of MapReduce jobs

Subject to continuous refinements

Pietro Michiardi (Eurecom) Tutorial: MapReduce 86 / 191

Page 87: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Map Side

Pietro Michiardi (Eurecom) Tutorial: MapReduce 87 / 191

Page 88: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Map SideThe output of a map task is not simply written to disk

I In memory bufferingI Pre-sorting

Circular memory bufferI 100 MB by defaultI Threshold based mechanism to spill buffer content to diskI Map output written to the buffer while spilling to diskI If buffer fills up while spilling, the map task is blocked

Disk spillsI Written in round-robin to a local dirI Output data is parttioned corresponding to the reducers they will be

sent toI Within each partition, data is sorted (in-memory)I Optionally, if there is a combiner, it is executed just after the sort

phase

Pietro Michiardi (Eurecom) Tutorial: MapReduce 88 / 191

Page 89: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Map Side

More on spills and memory bufferI Each time the buffer is full, a new spill is createdI Once the map task finishes, there are many spillsI Such spills are merged into a single partitioned and sorted output

file

The output file partitions are made available to reducers overHTTP

I There are 40 (default) threads dedicated to serve the file partitionsto reducers

Pietro Michiardi (Eurecom) Tutorial: MapReduce 89 / 191

Page 90: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Map Side

Pietro Michiardi (Eurecom) Tutorial: MapReduce 90 / 191

Page 91: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Reduce Side

The map output file is located on the local disk of tasktrackerAnother tasktracker (in charge of a reduce task) requiresinput from many other TaskTracker (that finished their maptasks)

I How do reducers know which tasktrackers to fetch map outputfrom?

F When a map task finishes it notifies the parent tasktrackerF The tasktracker notifies (with the heartbeat mechanism) the jobtrackerF A thread in the reducer polls periodically the jobtrackerF Tasktrackers do not delete local map output as soon as a reduce task

has fetched them (WHY?)

Copy phase: a pull approachI There is a small number (5) of copy threads that can fetch map

outputs in parallel

Pietro Michiardi (Eurecom) Tutorial: MapReduce 91 / 191

Page 92: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Reduce Side

The map outputs are copied to the the trasktracker runningthe reducer in memory (if they fit)

I Otherwise they are copied to disk

Input consolidationI A background thread merges all partial inputs into larger, sorted

filesI Note that if compression was used (for map outputs to save

bandwidth), decompression will take place in memory

Sorting the inputI When all map outputs have been copied a merge phase startsI All map outputs are sorted maintaining their sort ordering, in rounds

Pietro Michiardi (Eurecom) Tutorial: MapReduce 92 / 191

Page 93: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Hadoop MapReduce Types and Formats

Pietro Michiardi (Eurecom) Tutorial: MapReduce 93 / 191

Page 94: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

MapReduce Types

Input / output to mappers and reducersI map: (k1, v1)→ [(k2, v2)]I reduce: (k2, [v2])→ [(k3, v3)]

In Hadoop, a mapper is created as follows:I void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter)

Types:I K types implement WritableComparableI V types implement Writable

Pietro Michiardi (Eurecom) Tutorial: MapReduce 94 / 191

Page 95: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

What is a Writable

Hadoop defines its own classes for strings (Text), integers(intWritable), etc...

All keys are instances of WritableComparableI Why comparable?

All values are instances of Writable

Pietro Michiardi (Eurecom) Tutorial: MapReduce 95 / 191

Page 96: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Getting Data to the Mapper

Pietro Michiardi (Eurecom) Tutorial: MapReduce 96 / 191

Page 97: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Reading Data

Datasets are specified by InputFormatsI InputFormats define input data (e.g. a file, a directory)I InputFormats is a factory for RecordReader objects to extract

key-value records from the input source

InputFormats identify partitions of the data that form anInputSplit

I InputSplit is a (reference to a) chunk of the input processed bya single map

F Largest split is processed firstI Each split is divided into records, and the map processes each

record (a key-value pair) in turnI Splits and records are logical, they are not physically bound to a file

Pietro Michiardi (Eurecom) Tutorial: MapReduce 97 / 191

Page 98: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

The relationship between InputSplit and HDFS blocks

Pietro Michiardi (Eurecom) Tutorial: MapReduce 98 / 191

Page 99: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

FileInputFormat and Friends

TextInputFormatI Traeats each newline-terminated line of a file as a value

KeyValueTextInputFormatI Maps newline-terminated text lines of “key” SEPARATOR “value”

SequenceFileInputFormatI Binary file of key-value pairs with some additional metadata

SequenceFileAsTextInputFormatI Same as before but, maps (k.toString(), v.toString())

Pietro Michiardi (Eurecom) Tutorial: MapReduce 99 / 191

Page 100: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Filtering File Inputs

FileInputFormat reads all files out of a specified directoryand send them to the mapper

Delegates filtering this file list to a method subclasses mayoverride

I Example: create your own “xyzFileInputFormat” to read*.xyz from a directory list

Pietro Michiardi (Eurecom) Tutorial: MapReduce 100 / 191

Page 101: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Record Readers

Each InputFormat provides its own RecordReaderimplementation

LineRecordReaderI Reads a line from a text file

KeyValueRecordReaderI Used by KeyValueTextInputFormat

Pietro Michiardi (Eurecom) Tutorial: MapReduce 101 / 191

Page 102: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Input Split Size

FileInputFormat divides large files into chunksI Exact size controlled by mapred.min.split.size

Record readers receive file, offset, and length of chunkI Example

On the top of the Crumpetty Tree→The Quangle Wangle sat,→But his face you could not see,→On account of his Beaver Hat.→

(0, On the top of the Crumpetty Tree)(33, The Quangle Wangle sat,)(57, But his face you could not see,)(89, On account of his Beaver Hat.)

Custom InputFormat implementaions may override splitsize

Pietro Michiardi (Eurecom) Tutorial: MapReduce 102 / 191

Page 103: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Sending Data to Reducers

Map function receives OutputCollector objectI OutputCollector.collect() receives key-value elements

Any (WritableComparable, Writable) can be used

By defalut, mapper output type assumed to be the same asthe reducer output type

Pietro Michiardi (Eurecom) Tutorial: MapReduce 103 / 191

Page 104: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

WritableComparator

Compares WritableComparable dataI Will call the WritableComparable.compare() methodI Can provide fast path for serialized data

Configured through:JobConf.setOutputValueGroupingComparator()

Pietro Michiardi (Eurecom) Tutorial: MapReduce 104 / 191

Page 105: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Partiotioner

int getPartition(key, value, numPartitions)I Outputs the partition number for a given keyI One partition == all values sent to a single reduce task

HasPartitioner used by defaultI Uses key.hashCode() to return partion number

JobConf used to set Partitioner implementation

Pietro Michiardi (Eurecom) Tutorial: MapReduce 105 / 191

Page 106: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

The Reducer

void reduce(k2 key, Iterator<v2> values,OutputCollector<k3, v3> output, Reporterreporter )

Keys and values sent to one partition all go to the samereduce task

Calls are sorted by keyI “Early” keys are reduced and output before “late” keys

Pietro Michiardi (Eurecom) Tutorial: MapReduce 106 / 191

Page 107: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Writing the Output

Pietro Michiardi (Eurecom) Tutorial: MapReduce 107 / 191

Page 108: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Writing the Output

Analogous to InputFormat

TextOutputFormat writes “key value <newline>” strings tooutput file

SequenceFileOutputFormat uses a binary format to packkey-value pairs

NullOutputFormat discards output

Pietro Michiardi (Eurecom) Tutorial: MapReduce 108 / 191

Page 109: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Hadoop MapReduce Features

Pietro Michiardi (Eurecom) Tutorial: MapReduce 109 / 191

Page 110: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Developing a MapReduce Application

Pietro Michiardi (Eurecom) Tutorial: MapReduce 110 / 191

Page 111: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Preliminaries

Writing a program in MapReduce has a certain flow to itI Start by writing the map and reduce functions

F Write unit tests to make sure they do what they shouldI Write a driver program to run a job

F The job can be run from the IDE using a small subset of the dataF The debugger of the IDE can be used

I Evenutally, you can unleash the job on a clusterF Debugging a distributed program is challenging

Once the job is running properlyI Perform standard checks to improve performanceI Perform task profiling

Pietro Michiardi (Eurecom) Tutorial: MapReduce 111 / 191

Page 112: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Configuration

Before writing a MapReduce program, we need to set up andcofigure the development environment

I Components in Hadoop are configured with an ad hoc APII Configuration class is a collection of properties and their valuesI Resources can be combined into a configuration

Configuring the IDEI In the IDE create a new project and add all the JAR files from the

top level of the distribution and form the lib directoryI For Eclipse there are also available pluginsI Commercial IDE also exist (Karmasphere)

AlternativesI Switch configurations (local, cluster)I Alternatives (see Cloudera documentation for Ubuntu) is very

effective

Pietro Michiardi (Eurecom) Tutorial: MapReduce 112 / 191

Page 113: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Local Execution

Use the GenericOptionsParser, Tool and ToolRunnerI These helper classes makes it easy to intervene on job

configurationsI These are additional configurations to the core configuration

The run() methodI Constructs and configure a JobConf object and launch it

How many reducers?I In a local execution, there is a single (eventually none) reducerI Even by setting a number of reducer larger than one, the option will

be ignored

Pietro Michiardi (Eurecom) Tutorial: MapReduce 113 / 191

Page 114: MapReduce Tutorial

Hadoop MapReduce Hadoop MapReduce in details

Cluster Execution

PackagingLaunching a JobThe WebUIHadoop LogsRunning Dependent Jobs, and Oozie

Pietro Michiardi (Eurecom) Tutorial: MapReduce 114 / 191

Page 115: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Hadoop Deployments

Pietro Michiardi (Eurecom) Tutorial: MapReduce 115 / 191

Page 116: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Setting up a Hadoop Cluster

Cluster deploymentI Private clusterI Cloud-based clusterI AWS Elasitc MapReduce

Outlook:I Cluster specification

F HardwareF Network Topology

I Hadoop ConfigurationF Memory considerations

Pietro Michiardi (Eurecom) Tutorial: MapReduce 116 / 191

Page 117: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Cluster Specification

Commodity HardwareI Commodity 6= Low-end

F False economy due to failure rate and maintenance costsI Commodity 6= High-end

F High-end machines perform better, which would imply a smallercluster

F A single machine failure would compromise a large fraction of thecluster

A 2010 specification:I 2 quad-coresI 16-24 GB ECC RAMI 4 × 1 TB SATA disks6

I Gigabit Ethernet

6Why not using RAID instead of JBOD?Pietro Michiardi (Eurecom) Tutorial: MapReduce 117 / 191

Page 118: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Cluster Specification

Example:I Assume your data grows by 1 TB per weekI Assume you have three-way replication in HDFS→ You need additional 3TB of raw storage per weekI Allow for some overhead (temporary files, logs)→ This is a new machine per week

How to dimension a cluster?I Obviously, you won’t buy a machine per week!!I The idea is that the above back-of-the-envelope calculation is that

you can project over a 2 year life-time of your system→ You would need a 100-machine cluster

Where should you put the various components?I Small cluster: NameNode and JobTracker can be colocatedI Large cluster: requires more RAM at the NameNode

Pietro Michiardi (Eurecom) Tutorial: MapReduce 118 / 191

Page 119: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Cluster Specification

Should we use 64-bit or 32-bit machines?I NameNode should run on a 64-bit machine: this avoids the 3GB

Java heap size limit on 32-bit machinesI Other components should run on 32-bit machines to avoid the

memory overhead of large pointers

What’s the role of Java?I Recent releases (Java6) implement some optimization to eliminate

large pointer overhead→ A cluster of 64-bit machines has no downside

Pietro Michiardi (Eurecom) Tutorial: MapReduce 119 / 191

Page 120: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Cluster Specification: Network Topology

Pietro Michiardi (Eurecom) Tutorial: MapReduce 120 / 191

Page 121: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Cluster Specification: Network Topology

Two-level network topologyI Switch redundancy is not shown in the figure

Typical configurationI 30-40 servers per rackI 1 GB switch per rackI Core switch or router with 1GB or better

FeaturesI Aggregate bandwidth between nodes on the same rack is much

larger than for nodes on different racksI Rack awareness

F Hadoop should know the cluster topologyF Benefits both HDFS (data placement) and MapReduce (locality)

Pietro Michiardi (Eurecom) Tutorial: MapReduce 121 / 191

Page 122: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Hadoop Configuration

There are a handful of files for controlling the operation of anHadoop Cluster

I See next slide for a summary table

Managing the configuration across several machinesI All machines of an Hadoop cluster must be in sync!I What happens if you dispatch an update and some machines are

down?I What happens when you add (new) machines to your cluster?I What if you need to patch MapReduce?

Common practice: use configuration management toolsI Chef, Puppet, ...I Declarative language to specify configurationsI Allow also to install software

Pietro Michiardi (Eurecom) Tutorial: MapReduce 122 / 191

Page 123: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Hadoop Configuration

Filename Format Descriptionhadoop-env.sh Bash script Environment variables that are used in the scripts to run Hadoop.core-site.xml Hadoop configuration XML I/O settings that are common to HDFS and MapReduce.hdfs-site.xml Hadoop configuration XML Namenode, the secondary namenode, and the datanodes.

mapred-site.xml Hadoop configuration XML Jobtracker, and the tasktrackers.masters Plain text A list of machines that each run a secondary namenode.slaves Plain text A list of machines that each run a datanode and a tasktracker.

Table: Hadoop Configuration Files

Pietro Michiardi (Eurecom) Tutorial: MapReduce 123 / 191

Page 124: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Hadoop Configuration: memory utilizationHadoop uses a lot of memory

I Default values, for a typical cluster configurationF DataNode: 1 GBF TaskTracker: 1 GBF Child JVM map task: 2 × 200MBF Child JVM reduce task: 2 × 200MB

All the moving parts of Hadoop (HDFS and MapReduce) canbe individually configured

I This is true for cluster configuration but also for job specificconfigurations

Hadoop is fast when using RAMI Generally, MapReduce Jobs are not CPU-boundI Avoid I/O on disk as much as you canI Minimize network traffic

F Customize the partitionerF Use compression (→ decompression is in RAM)

Pietro Michiardi (Eurecom) Tutorial: MapReduce 124 / 191

Page 125: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Elephants in the cloud!

May organization run Hadoop in private clustersI Pros and cons

Cloud based Hadoop installations (Amazon biased)I Use Cloudera + WhirrI Use Elastic MapReduce

Pietro Michiardi (Eurecom) Tutorial: MapReduce 125 / 191

Page 126: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Hadoop on EC2

Launch instances of a cluster on demand, paying by hourI CPU, in general bandwidth is used from within a datacenter, hence

it’s free

Apache Whirr projectI Launch, terminate, modify a running clusterI Requires AWS credentials

ExampleI Launch a cluster test-hadoop-cluster, with one master node

(JobTracker and NameNode) and 5 worker nodes (DataNodesand TaskTrackers)

→ hadoop-ec2 launch-cluster test-hadoop-cluster 5I See project webpage and Chapter 9, page 290 [11]

Pietro Michiardi (Eurecom) Tutorial: MapReduce 126 / 191

Page 127: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

AWS Elastic MapReduce

Hadoop as a serviceI Amazon handles everything, which becomes transparentI How this is done remains a mistery

Focus on What not HowI All you need to do is to package a MapReduce Job in a JAR and

upload it using a Web InterfaceI Other Jobs are available: python, pig, hive, ...I Test your jobs locally!!!

Pietro Michiardi (Eurecom) Tutorial: MapReduce 127 / 191

Page 128: MapReduce Tutorial

Hadoop MapReduce Hadoop Deployments

Part Three

Pietro Michiardi (Eurecom) Tutorial: MapReduce 128 / 191

Page 129: MapReduce Tutorial

Algorithm Design

Algorithm Design in MapReduce

Pietro Michiardi (Eurecom) Tutorial: MapReduce 129 / 191

Page 130: MapReduce Tutorial

Algorithm Design Preliminaries

Preliminaries

Pietro Michiardi (Eurecom) Tutorial: MapReduce 130 / 191

Page 131: MapReduce Tutorial

Algorithm Design Preliminaries

Algorithm Design

Developing algorithms involve:I Preparing the input dataI Implement the mapper and the reducerI Optionally, design the combiner and the partitioner

How to recast existing algorithms in MapReduce?I It is not always obvious how to express algorithmsI Data structures play an important roleI Optimization is hard→ The designer needs to “bend” the framework

Learn by examplesI “Design patterns”I Synchronization is perhaps the most tricky aspect

Pietro Michiardi (Eurecom) Tutorial: MapReduce 131 / 191

Page 132: MapReduce Tutorial

Algorithm Design Preliminaries

Algorithm Design

Aspects that are not under the control of the designerI Where a mapper or reducer will runI When a mapper or reducer begins or finishesI Which input key-value pairs are processed by a specific mapperI Which intermediate key-value paris are processed by a specific

reducer

Aspects that can be controlledI Construct data structures as keys and valuesI Execute user-specified initialization and termination code for

mappers and reducersI Preserve state across multiple input and intermediate keys in

mappers and reducersI Control the sort order of intermediate keys, and therefore the order

in which a reducer will encounter particular keysI Control the partitioning of the key space, and therefore the set of

keys that will be encountered by a particular reducer

Pietro Michiardi (Eurecom) Tutorial: MapReduce 132 / 191

Page 133: MapReduce Tutorial

Algorithm Design Preliminaries

Algorithm DesignMapReduce jobs can be complex

I Many algorithms cannot be easily expressed as a singleMapReduce job

I Decompose complex algorithms into a sequence of jobsF Requires orchestrating data so that the output of one job becomes

the input to the nextI Iterative algorithms require an external driver to check for

convergence

OptimizationsI Scalability (linear)I Resource requirements (storage and bandwidth)

OutlineI Local AggregationI Pairs and StripesI Order inversionI Graph algorithms

Pietro Michiardi (Eurecom) Tutorial: MapReduce 133 / 191

Page 134: MapReduce Tutorial

Algorithm Design Local Aggregation

Local Aggregation

Pietro Michiardi (Eurecom) Tutorial: MapReduce 134 / 191

Page 135: MapReduce Tutorial

Algorithm Design Local Aggregation

Local Aggregation

In the context of data-intensive distributed processing, themost important aspect of synchronization is the exchange ofintermediate results

I This involves copying intermediate results from the processes thatproduced them to those that consume them

I In general, this involves data transfers over the networkI In Hadoop, also disk I/O is involved, as intermediate results are

written to disk

Network and disk latencies are expensiveI Reducing the amount of intermediate data translates into

algorithmic efficiency

Combiners and preserving state across inputsI Reduce the number and size of key-value pairs to be shuffled

Pietro Michiardi (Eurecom) Tutorial: MapReduce 135 / 191

Page 136: MapReduce Tutorial

Algorithm Design Local Aggregation

Combiners

Combiners are a general mechanism to reduce the amount ofintermediate data

I They could be thought of as “mini-reducers”

Example: word countI Combiners aggregate term counts across documents processed by

each map taskI If combiners take advantage of all opportunities for local

aggregation we have at most m × V intermediate key-value pairsF m: number of mappersF V : number of unique terms in the collection

I Note: due to Zipfian nature of term distributions, not all mappers willsee all terms

Pietro Michiardi (Eurecom) Tutorial: MapReduce 136 / 191

Page 137: MapReduce Tutorial

Algorithm Design Local Aggregation

Word Counting in MapReduce

Pietro Michiardi (Eurecom) Tutorial: MapReduce 137 / 191

Page 138: MapReduce Tutorial

Algorithm Design Local Aggregation

In-Mapper Combiners

In-Mapper Combiners, a possible improvementI Hadoop does not guarantee combiners to be executed

Use an associative array to cumulate intermediate resultsI The array is used to tally up term counts within a single documentI The Emit method is called only after all InputRecords have been

processed

Example (see next slide)I The code emits a key-value pair for each unique term in the

document

Pietro Michiardi (Eurecom) Tutorial: MapReduce 138 / 191

Page 139: MapReduce Tutorial

Algorithm Design Local Aggregation

In-Mapper Combiners

Pietro Michiardi (Eurecom) Tutorial: MapReduce 139 / 191

Page 140: MapReduce Tutorial

Algorithm Design Local Aggregation

In-Mapper Combiners

Taking the idea one step furtherI Exploit implementation details in HadoopI A Java mapper object is created for each map taskI JVM reuse must be enabled

Preserve state within and across calls to the Map methodI Initialize method, used to create a across-map persistent data

structureI Close method, used to emit intermediate key-value pairs only

when all map task scheduled on one machine are done

Pietro Michiardi (Eurecom) Tutorial: MapReduce 140 / 191

Page 141: MapReduce Tutorial

Algorithm Design Local Aggregation

In-Mapper Combiners

Pietro Michiardi (Eurecom) Tutorial: MapReduce 141 / 191

Page 142: MapReduce Tutorial

Algorithm Design Local Aggregation

In-Mapper Combiners

Summing up: a first “design pattern”, in-mapper combiningI Provides control over when local aggregation occursI Design can determine how exactly aggregation is done

Efficiency vs. CombinersI There is no additional overhead due to the materialization of

key-value pairsF Un-necessary object creation and destruction (garbage collection)F Serialization, deserialization when memory bounded

I Mappers still need to emit all key-value pairs, combiners onlyreduce network traffic

Pietro Michiardi (Eurecom) Tutorial: MapReduce 142 / 191

Page 143: MapReduce Tutorial

Algorithm Design Local Aggregation

In-Mapper Combiners

PrecautionsI In-mapper combining breaks the functional programming paradigm

due to state preservationI Preserving state across multiple instances implies that algorithm

behavior might depend on execution orderF Ordering-dependent bugs are difficult to find

Scalability bottleneckI The in-mapper combining technique strictly depends on having

sufficient memory to store intermediate resultsF And you don’t want the OS to deal with swapping

I Multiple threads compete for the same resourcesI A possible solution: “block” and “flush”

F Implemented with a simple counter

Pietro Michiardi (Eurecom) Tutorial: MapReduce 143 / 191

Page 144: MapReduce Tutorial

Algorithm Design Local Aggregation

Further Remarks

The extent to which efficiency can be increased with localaggregation depends on the size of the intermediate keyspace

I Opportunities for aggregation araise when multiple values areassociated to the same keys

Local aggregation also effective to deal with reducestragglers

I Reduce the number of values associated with frequently occuringkeys

Pietro Michiardi (Eurecom) Tutorial: MapReduce 144 / 191

Page 145: MapReduce Tutorial

Algorithm Design Local Aggregation

Algorithmic correctness with local aggregation

The use of combiners must be thought carefullyI In Hadoop, they are optional: the correctness of the algorithm

cannot depend on computation (or even execution) of thecombiners

In MapReduce, the reducer input key-value type must matchthe mapper output key-value type

I Hence, for combiners, both input and output key-value types mustmatch the output key-value type of the mapper

Commutative and Associatvie computationsI This is a special case, which worked for word counting

F There the combiner code is actually the reducer codeI In general, combiners and reducers are not interchangeable

Pietro Michiardi (Eurecom) Tutorial: MapReduce 145 / 191

Page 146: MapReduce Tutorial

Algorithm Design Local Aggregation

Algorithmic Correctness: an ExampleProblem statement

I We have a large dataset where input keys are strings and inputvalues are integers

I We wish to compute the mean of all integers associated with thesame key

F In practice: the dataset can be a log from a website, where the keysare user IDs and values are some measure of activity

Next, a baseline approachI We use an identity mapper, which group and sorts appropriately

input key-value parisI Reducers keep track of running sum and the number of integers

encounteredI The mean is emitted as the output of the reducer, with the input

string as the key

Inefficiency problems in the shuffle phasePietro Michiardi (Eurecom) Tutorial: MapReduce 146 / 191

Page 147: MapReduce Tutorial

Algorithm Design Local Aggregation

Example: basic MapReduce to compute the mean of values

Pietro Michiardi (Eurecom) Tutorial: MapReduce 147 / 191

Page 148: MapReduce Tutorial

Algorithm Design Local Aggregation

Algorithmic Correctness: an Example

Note: operations are not distributiveI Mean(1,2,3,4,5) 6= Mean(Mean(1,2), Mean(3,4,5))I Hence: a combiner cannot output partial means and hope that the

reducer will compute the correct final mean

Next, a failed attempt at solving the problemI The combiner partially aggregates results by separating the

components to arrive at the meanI The sum and the count of elements are packaged into a pairI Using the same input string, the combiner emits the pair

Pietro Michiardi (Eurecom) Tutorial: MapReduce 148 / 191

Page 149: MapReduce Tutorial

Algorithm Design Local Aggregation

Example: Wrong use of combiners

Pietro Michiardi (Eurecom) Tutorial: MapReduce 149 / 191

Page 150: MapReduce Tutorial

Algorithm Design Local Aggregation

Algorithmic Correctness: an Example

What wrong with the previous approach?I Trivially, the input/output keys are not correctI Remember that combiners are optimizations, the algorithm should

work even when “removing” them

Executing the code omitting the combiner phaseI The output value type of the mapper is integerI The reducer expects to receive a list of integersI Instead, we make it expect a list of pairs

Next, a correct implementation of the combinerI Note: the reducer is similar to the combiner!I Exercise: verify the correctness

Pietro Michiardi (Eurecom) Tutorial: MapReduce 150 / 191

Page 151: MapReduce Tutorial

Algorithm Design Local Aggregation

Example: Correct use of combiners

Pietro Michiardi (Eurecom) Tutorial: MapReduce 151 / 191

Page 152: MapReduce Tutorial

Algorithm Design Local Aggregation

Algorithmic Correctness: an Example

Using in-mapper combiningI Inside the mapper, the partial sums and counts are held in memory

(across inputs)I Intermediate values are emitted only after the entire input split is

processedI Similarly to before, the output value is a pair

Pietro Michiardi (Eurecom) Tutorial: MapReduce 152 / 191

Page 153: MapReduce Tutorial

Algorithm Design Paris and Stripes

Pairs and Stripes

Pietro Michiardi (Eurecom) Tutorial: MapReduce 153 / 191

Page 154: MapReduce Tutorial

Algorithm Design Paris and Stripes

Pairs and Stripes

A common approach in MapReduce: build complex keysI Data necessary for a computation are naturally brought together by

the framework

Two basic techniques:I Pairs: similar to the example on the averageI Stripes: uses in-mapper memory data structures

Next, we focus on a particular problem that benefits fromthese two methods

Pietro Michiardi (Eurecom) Tutorial: MapReduce 154 / 191

Page 155: MapReduce Tutorial

Algorithm Design Paris and Stripes

Problem statement

The problem: building word co-occurrence matrices for largecorpora

I The co-occurrence matrix of a corpus is a square n × n matrixI n is the number of unique words (i.e., the vocabulary size)I A cell mij contains the number of times the word wi co-occurs with

word wj within a specific contextI Context: a sentence, a paragraph a document or a window of m

wordsI NOTE: the matrix may be symmetric in some cases

MotivationI This problem is a basic building block for more complex operationsI Estimating the distribution of discrete joint events from a large

number of observationsI Similar problem in other domains:

F Customers who buy this tend to also buy that

Pietro Michiardi (Eurecom) Tutorial: MapReduce 155 / 191

Page 156: MapReduce Tutorial

Algorithm Design Paris and Stripes

Observations

Space requirementsI Clearly, the space requirement is O(n2), where n is the size of the

vocabularyI For real-world (English) corpora n can be hundres of thousands of

words, or even billion of worlds

So what’s the problem?I If the matrix can fit in the memory of a single machine, then just use

whatever naive implementationI Instead, if the matrix is bigger than the available memory, then

paging would kick in, and any naive implementation would break

CompressionI Such techniques can help in solving the problem on a single

machineI However, there are scalability problems

Pietro Michiardi (Eurecom) Tutorial: MapReduce 156 / 191

Page 157: MapReduce Tutorial

Algorithm Design Paris and Stripes

Word co-occurrence: the Pairs approachInput to the problem

I Key-value pairs in the form of a docid and a doc

The mapper:I Processes each input documentI Emits key-value pairs with:

F Each co-occurring word pair as the keyF The integer one (the count) as the value

I This is done with two nested loops:F The outer loop iterates over all wordsF The inner loop iterates over all neighbors

The reducer:I Receives pairs relative to co-occurring words

F This requires modifing the partitionerI Computes an absolute count of the joint eventI Emits the pair and the count as the final key-value output

F Basically reducers emit the cells of the matrix

Pietro Michiardi (Eurecom) Tutorial: MapReduce 157 / 191

Page 158: MapReduce Tutorial

Algorithm Design Paris and Stripes

Word co-occurrence: the Pairs approach

Pietro Michiardi (Eurecom) Tutorial: MapReduce 158 / 191

Page 159: MapReduce Tutorial

Algorithm Design Paris and Stripes

Word co-occurrence: the Stripes approach

Input to the problemI Key-value pairs in the form of a docid and a doc

The mapper:I Same two nested loops structure as beforeI Co-occurrence information is first stored in an associative arrayI Emit key-value pairs with words as keys and the corresponding

arrays as values

The reducer:I Receives all associative arrays related to the same wordI Performs an element-wise sum of all associative arrays with the

same keyI Emits key-value output in the form of word, associative array

F Basically, reducers emit rows of the co-occurrence matrix

Pietro Michiardi (Eurecom) Tutorial: MapReduce 159 / 191

Page 160: MapReduce Tutorial

Algorithm Design Paris and Stripes

Word co-occurrence: the Stripes approach

Pietro Michiardi (Eurecom) Tutorial: MapReduce 160 / 191

Page 161: MapReduce Tutorial

Algorithm Design Paris and Stripes

Pairs and Stripes, a comparison

The pairs approachI Generates a large number of key-value pairs (also intermediate)I The benefit from combiners is limited, as it is less likely for a

mapper to process multiple occurrences of a wordI Does not suffer from memory paging problems

The pairs approachI More compactI Generates fewer and shorted intermediate keys

F The framework has less sorting to doI The values are more complex and have serialization/deserialization

overheadI Greately benefits from combiners, as the key space is the

vocabularyI Suffers from memory paging problems, if not properly engineered

Pietro Michiardi (Eurecom) Tutorial: MapReduce 161 / 191

Page 162: MapReduce Tutorial

Algorithm Design Order Inversion

Order Inversion

Pietro Michiardi (Eurecom) Tutorial: MapReduce 162 / 191

Page 163: MapReduce Tutorial

Algorithm Design Order Inversion

Computing relative frequenceies

“Relative” Co-occurrence matrix constructionI Similar problem as before, same matrixI Instead of absolute counts, we take into consideration the fact that

some words appear more frequently than othersF Word wi may co-occur frequently with word wj simply because one of

the two is very commonI We need to convert absolute counts to relative frequencies f (wj |wi)

F What proportion of the time does wj appear in the context of wi?

Formally, we compute:

f (wj |wi) =N(wi ,wj)∑w ′ N(wi ,w ′)

I N(·, ·) is the number of times a co-occurring word pair is observedI The denominator is called the marginal

Pietro Michiardi (Eurecom) Tutorial: MapReduce 163 / 191

Page 164: MapReduce Tutorial

Algorithm Design Order Inversion

Computing relative frequenceies

The stripes approachI In the reducer, the counts of all words that co-occur with the

conditioning variable (wi ) are available in the associative arrayI Hence, the sum of all those counts gives the marginalI Then we divide the the joint counts by the marginal and we’re done

The pairs approachI The reducer receives the pair (wi ,wj) and the countI From this information alone it is not possible to compute f (wj |wi)I Fortunately, as for the mapper, also the reducer can preserve state

across multiple keysF We can buffer in memory all the words that co-occur with wi and their

countsF This is basically building the associative array in the stripes method

Pietro Michiardi (Eurecom) Tutorial: MapReduce 164 / 191

Page 165: MapReduce Tutorial

Algorithm Design Order Inversion

Computing relative frequenceies: a basic approachWe must define the sort order of the pair

I In this way, the keys are first sorted by the left word, and then by theright word (in the pair)

I Hence, we can detect if all pairs associated with the word we areconditioning on (wi ) have been seen

I At this point, we can use the in-memory buffer, compute the relativefrequencies and emit

We must define an appropriate partitionerI The default partitioner is based on the hash value of the

intermediate key, modulo the number of reducersI For a complex key, the raw byte representation is used to compute

the hash valueF Hence, there is no guarantee that the pair (dog, aardvark) and

(dog,zebra) are sent to the same reducerI What we want is that all pairs with the same left word are sent to

the same reducer

Limitations of this approachI Essentially, we reproduce the stripes method on the reducer and

we need to use a custom partitionnerI This algorithm would work, but present the same

memory-bottleneck problem as the stripes method

Pietro Michiardi (Eurecom) Tutorial: MapReduce 165 / 191

Page 166: MapReduce Tutorial

Algorithm Design Order Inversion

Computing relative frequenceies: order inversion

The key is to properly sequence data presented to reducersI If it were possible to compute the marginal in the reducer before

processing the join counts, the reducer could simply divide the jointcounts received from mappers by the marginal

I The notion of “before” and “after” can be captured in the ordering ofkey-value pairs

I The programmer can define the sort order of keys so that dataneeded earlier is presented to the reducer before data that isneeded later

Pietro Michiardi (Eurecom) Tutorial: MapReduce 166 / 191

Page 167: MapReduce Tutorial

Algorithm Design Order Inversion

Computing relative frequenceies: order inversion

Recall that mappers emit pairs of co-occurring words as keys

The mapper:I additionally emits a “special” key of the form (wi , ∗)I The value associated to the special key is one, that represtns the

contribution of the word pair to the marginalI Using combiners, these partial marginal counts will be aggrefated

before being sent to the reducers

The reducer:I We must make sure that the special key-value pairs are processed

before any other key-value pairs where the left word is wiI We also need to modify the partitioner as before, i.e., it would take

into account only the first word

Pietro Michiardi (Eurecom) Tutorial: MapReduce 167 / 191

Page 168: MapReduce Tutorial

Algorithm Design Order Inversion

Computing relative frequenceies: order inversion

Memory requirements:I Minimal, because only the marginal (an integer) needs to be storedI No buffering of individual co-occurring wordI No scalability bottleneck

Key ingredients for order inversionI Emit a special key-value pair to capture the margianlI Control the sort order of the intermediate key, so that the special

key-value pair is processed firstI Define a custom partitioner for routing intermediate key-value pairsI Preserve state across multiple keys in the reducer

Pietro Michiardi (Eurecom) Tutorial: MapReduce 168 / 191

Page 169: MapReduce Tutorial

Algorithm Design Graph Algorithms

Graph Algorithms

Pietro Michiardi (Eurecom) Tutorial: MapReduce 169 / 191

Page 170: MapReduce Tutorial

Algorithm Design Graph Algorithms

Preliminaries and Data Structures

Pietro Michiardi (Eurecom) Tutorial: MapReduce 170 / 191

Page 171: MapReduce Tutorial

Algorithm Design Graph Algorithms

MotivationsExamples of graph problems

I Graph searchI Graph clusteringI Minimum spanning treesI Matching problemsI Flow problemsI Element analysis: node and edge centralities

The problem: big graphs

Why MapReduce?I Algorithms for the above problems on a single machine are not

scalableI Recently, Google designed a new system, Pregel, for large-scale

(incremental) graph processingI Even more recently, [7] indicate a fundamentally new design pattern

to analyze graphs in MapReduce

Pietro Michiardi (Eurecom) Tutorial: MapReduce 171 / 191

Page 172: MapReduce Tutorial

Algorithm Design Graph Algorithms

Graph Representations

Basic data structuresI Adjacency matrixI Adjacency list

Are graphs sparse or dense?I Determines which data-structure to use

F Adjacency matrix: operations on incoming links are easy (columnscan)

F Adjacency list: operations on outgoing links are easyF The shuffle and sort phase can help, by grouping edges by their

destination reducerI [8] dispelled the notion of sparseness of real-world graphs

Pietro Michiardi (Eurecom) Tutorial: MapReduce 172 / 191

Page 173: MapReduce Tutorial

Algorithm Design Graph Algorithms

Parallel Breadth-First-Search

Pietro Michiardi (Eurecom) Tutorial: MapReduce 173 / 191

Page 174: MapReduce Tutorial

Algorithm Design Graph Algorithms

Parallel Breadth-First Search

Single-source shortest pathI Dijkstra algorithm using a global priority queue

F Maintains a globally sorted list of nodes by current distanceI How to solve this problem in parallel?

F “Brute-force” approach: breadth-first search

Parallel BFS: intuitionI FloodingI Iterative algorithm in MapReduceI Shoehorn message passing style algorithms

Pietro Michiardi (Eurecom) Tutorial: MapReduce 174 / 191

Page 175: MapReduce Tutorial

Algorithm Design Graph Algorithms

Parallel Breadth-First Search

Pietro Michiardi (Eurecom) Tutorial: MapReduce 175 / 191

Page 176: MapReduce Tutorial

Algorithm Design Graph Algorithms

Parallel Breadth-First Search

AssumptionsI Connected, directed graphI Data structure: adjacency listI Distance to each node is stored alongside the adjacency list of that

node

The pseudo-codeI We use n to denote the node id (an integer)I We use N to denote the node adjacency list and current distanceI The algorithm works by mapping over all nodesI Mappers emit a key-value pair for each neighbor on the node’s

adjacency listF The key: node id of the neighborF The value: the current distace to the node plus oneF If we can reach node n with a distance d , then we must be able to

reach all the nodes connected ot n with distance d + 1

Pietro Michiardi (Eurecom) Tutorial: MapReduce 176 / 191

Page 177: MapReduce Tutorial

Algorithm Design Graph Algorithms

Parallel Breadth-First Search

The pseudo-code (continued)I After shuffle and sort, reducers receive keys corresponding to the

destination node ids and distances corresponding to all pathsleading to that node

I The reducer selects the shortest of these distances and update thedistance in the node data structure

Passing the graph alongI The mapper: emits the node adjacency list, with the node id as the

keyI The reducer: must distinguish between the node data structure and

the distance values

Pietro Michiardi (Eurecom) Tutorial: MapReduce 177 / 191

Page 178: MapReduce Tutorial

Algorithm Design Graph Algorithms

Parallel Breadth-First Search

MapReduce iterationsI The first time we run the algorithm, we “discover” all nodes

connected to the sourceI The second iteration, we discover all nodes connected to those→ Each iteration expands the “search frontier” by one hopI How many iterations before convergence?

This approach is suitable for small-world graphsI The diameter of the network is smallI See [7] for advanced topics on the subject

Pietro Michiardi (Eurecom) Tutorial: MapReduce 178 / 191

Page 179: MapReduce Tutorial

Algorithm Design Graph Algorithms

Parallel Breadth-First Search

Checking the termination of the algorithmI Requires a “driver” program which submits a job, check termination

condition and eventually iteratesI In practice:

F Hadoop countersF Side-data to be passed to the job configuration

ExtensionsI Storing the actual shortest-pathI Weighted edges (as opposed to unit distance)

Pietro Michiardi (Eurecom) Tutorial: MapReduce 179 / 191

Page 180: MapReduce Tutorial

Algorithm Design Graph Algorithms

The story so far

The graph structure is stored in an adjacency listsI This data structure can be augmented with additional information

The MapReduce frameworkI Maps over the node data structures involving onlt the node’s

internal state and it’s local graph structureI Map results are “passed” along outgoing edgesI The graph itself is passed from the mapper to the reducer

F This is a very costly operation for large graphs!I Reducers aggregate over “same destination” nodes

Graph algorithms are generally iterativeI Require a driver program to check for termination

Pietro Michiardi (Eurecom) Tutorial: MapReduce 180 / 191

Page 181: MapReduce Tutorial

Algorithm Design Graph Algorithms

PageRank

Pietro Michiardi (Eurecom) Tutorial: MapReduce 181 / 191

Page 182: MapReduce Tutorial

Algorithm Design Graph Algorithms

Introduction

What is PageRankI It’s a measure of the relevance of a Web page, based on the

structure of the hyperlink graphI Based on the concept of random Web surfer

Formally we have:

P(n) = α( 1|G|

)+ (1− α)

∑m∈L(n)

P(m)

C(m)

I |G| is the number of nodes in the graphI α is a random jump factorI L(n) is the set of out-going links from page nI C(m) is the out-degree of node m

Pietro Michiardi (Eurecom) Tutorial: MapReduce 182 / 191

Page 183: MapReduce Tutorial

Algorithm Design Graph Algorithms

PageRank in Details

PageRank is defined recursively, hence we need an interativealgorithm

I A node receives “contributions” from all pages that link to it

Consider the set of nodes L(n)I A random surfer at m arrives at n with probability 1/C(m)I Since the PageRank value of m is the probability that the random

surfer is at m, the probability of arriving at n from m is P(m)/C(m)

To compute the PageRank of n we need:I Sum the contributions from all pages that link to nI Take into account the random jump, which is uniform over all nodes

in the graph

Pietro Michiardi (Eurecom) Tutorial: MapReduce 183 / 191

Page 184: MapReduce Tutorial

Algorithm Design Graph Algorithms

PageRank in MapReduce

Pietro Michiardi (Eurecom) Tutorial: MapReduce 184 / 191

Page 185: MapReduce Tutorial

Algorithm Design Graph Algorithms

PageRank in MapReduce

Pietro Michiardi (Eurecom) Tutorial: MapReduce 185 / 191

Page 186: MapReduce Tutorial

Algorithm Design Graph Algorithms

PageRank in MapReduce

Pietro Michiardi (Eurecom) Tutorial: MapReduce 186 / 191

Page 187: MapReduce Tutorial

Algorithm Design Graph Algorithms

PageRank in MapReduce

Sketch of the MapReduce algorithmI The algorithm maps over the nodesI Foreach node computes the PageRank mass the needs to be

distributed to neighborsI Each fraction of the PageRank mass is emitted as the value, keyed

by the node ids of the neighborsI In the shuffle and sort, values are grouped by node id

F Also, we pass the graph structure from mappers to reducers (forsubsequent iterations to take place over the updated graph)

I The reducer updates the value of the PageRank of every singlenode

Pietro Michiardi (Eurecom) Tutorial: MapReduce 187 / 191

Page 188: MapReduce Tutorial

Algorithm Design Graph Algorithms

PageRank in MapReduce

Implementation detailsI Loss of PageRank mass for sink nodesI Auxiliary state informationI One iteration of the algorith

F Two MapReduce jobs: one to distribute the PageRank mass, theother for dangling nodes and random jumps

I Checking for convergenceF Requires a driver programF When updates of PageRank are “stable” the algorithm stops

Further reading on convergence and attacksI Convergenge: [9, 4]I Attacks: Adversarial Information Retrieval Workshop [1]

Pietro Michiardi (Eurecom) Tutorial: MapReduce 188 / 191

Page 189: MapReduce Tutorial

References

References I

[1] Adversarial information retrieval workshop.

[2] Michele Banko and Eric Brill.Scaling to very very large corpora for natural languagedisambiguation.In Proc. of the 39th Annual Meeting of the Association forComputational Linguistic (ACL), 2001.

[3] Luiz Andre Barroso and Urs Holzle.The datacebter as a computer: An introduction to the design ofwarehouse-scale machines.Morgan & Claypool Publishers, 2009.

[4] Monica Bianchini, Marco Gori, and Franco Scarselli.Inside pagerank.In ACM Transactions on Internet Technology, 2005.

Pietro Michiardi (Eurecom) Tutorial: MapReduce 189 / 191

Page 190: MapReduce Tutorial

References

References II

[5] James Hamilton.Cooperative expendable micro-slice servers (cems): Low cost,low power servers for internet-scale services.In Proc. of the 4th Biennal Conference on Innovative DataSystems Research (CIDR), 2009.

[6] Tony Hey, Stewart Tansley, and Kristin Tolle.The fourth paradigm: Data-intensive scientific discovery.Microsoft Research, 2009.

[7] Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, and SergeiVassilvitskii.Filtering: a method for solving graph problems in mapreduce.In Proc. of SPAA, 2011.

Pietro Michiardi (Eurecom) Tutorial: MapReduce 190 / 191

Page 191: MapReduce Tutorial

References

References III

[8] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos.Graphs over time: Densification laws, shrinking diamters andpossible explanations.In Proc. of SIGKDD, 2005.

[9] Lawrence Page, Sergey Brin, Rajeev Motwani, and TerryWinograd.The pagerank citation ranking: Bringin order to the web.In Stanford Digital Library Working Paper, 1999.

[10] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and RobertChansler.The hadoop distributed file system.In Proc. of the 26th IEEE Symposium on Massive StorageSystems and Technologies (MSST). IEEE, 2010.

Pietro Michiardi (Eurecom) Tutorial: MapReduce 191 / 191

Page 192: MapReduce Tutorial

References

References IV

[11] Tom White.Hadoop, The Definitive Guide.O’Reilly, Yahoo, 2010.

Pietro Michiardi (Eurecom) Tutorial: MapReduce 192 / 191