Map/Reduce Programming Model

Post on 23-Feb-2016

56 views 0 download

Tags:

description

Map/Reduce Programming Model. Ahmed Abdelsadek. Outlines. Introduction What is Map/Reduce? Framework Architecture Map/Reduce Algorithm Design Tools and Libraries built on top of Map/Reduce. Introduction. Big Data Scaling ‘out’ not ‘up’ Scaling ‘everything’ linearly with data size - PowerPoint PPT Presentation

Transcript of Map/Reduce Programming Model

Map/Reduce Programming ModelAhmed Abdelsadek

Outlines•Introduction

•What is Map/Reduce?

•Framework Architecture

•Map/Reduce Algorithm Design

•Tools and Libraries built on top of Map/Reduce

Introduction•Big Data

•Scaling ‘out’ not ‘up’

•Scaling ‘everything’ linearly with data size

•Data-intensive applications

Map/Reduce•Origins •Google Map/Reduce•Hadoop Map/Reduce

•The Map and Reduce functions are both defined with respect to data structured in (key, value) pairs.

Mapper• The Map function takes a key/value pair, processes it, and

generates zero or more output key/value pairs. • The input and output types of the mapper can be different

from each other.

Reducer• The Reduce function takes a key and a series of all values

associated with it, processes it, and generates zero or more output key/value pairs.

• The input and output types of the reducer can be different from each other.

Mappers/Reducers•map: (k1; v1) ->

[(k2; v2)]

•reduce: (k2; [v2]) -> [(k3; v3)]

WordCount Example•Problem: count the number of

occurrences of every word in a text collection.

Map(docid a, doc d)for all term t in doc d do

Emit(term t, count 1)

Reduce(term t; counts [c1, c2, …])sum = 0for all count c in counts [c1, c2,

…] dosum = sum + c

Emit(term t, count sum)

Map/Reduce Framework Architecture and Execution Overview

Architecture - Overview•Map/Reduce runs on top of DFS

Data Flow

Job Timeline

Job Work Flow

Job Work Flow

Job Work Flow

Job Work Flow

Job Work Flow

Job Work Flow

Job Work Flow

Job Work Flow

Job Work Flow

Job Work Flow

Fault Tolerance• Task Fails

▫ Re-execution

• TaskTracker Fails▫ Removes the node from

pool of TaskTrackers▫ Re-schedule its tasks

• JobTracker Fails▫ Singe point of failure. Job

fails

Map/Reduce Framework Features• Locality

▫Move code to the data• Task Granularity

▫Mappers and reducers should be much larger than the number of machines, however, not too much! Dynamic load balancing!

•Backup Tasks▫Avoid slow workers▫Near completion

Map/Reduce Framework Features•Skipping bad records

▫Many failures on the same record•Local execution

▫Debug in isolation•Status information

▫Progress of computations•User Counters, report progress

▫Periodically propagated to the master node

Hadoop Streaming and Pipes• APIs to MapReduce that allows you to write your

map and reduce functions in languages other than Java

• Hadoop Streaming▫ Uses Unix standard streams as the interface between

Hadoop and your program▫ You can use any language that can read standard input

and write to standard output• Hadoop Pipes (for C++)

▫ Pipes uses sockets as the channel to communicates with the process running the C++ map or reduce function

▫ JNI is not used

Keep in Mind• Programmer has little control over many aspects

of execution▫ Where a mapper or reducer runs (i.e., on which

node in the cluster).▫ When a mapper or reducer begins or finishes▫ Which input key-value pairs are processed by a

specific mapper.▫ Which intermediate key-value pairs are processed

by a specific reducer.

Map/Reduce Algorithm Design

Partitioners • Dividing up the intermediate key space.• Simplest: Hash value of the key mod the number of

reducers▫Assigns same number of keys to reducers▫Only considers the key and ignores the value▫May yield large differences in the number of

values sent to each reducer

•More complex partitioning algorithm to handle the imbalance in the amount of data associated with each key

Combiners • In WordCount example: the amount of intermediate data is

larger than the input collection itself• Combiners are an optimization for local aggregation before the

shuffle and sort phase▫ Compute a local count for a word over all the documents

processed by the mapper• Think of combiners as “mini-reducers”

▫ However, combiners and reducers are not always interchangeable

• Combiner input and output pair are same as mapper output pairs▫ Same as reducer input pair

• Combiner may be invoked zero, one, or multiple times• Combiner can emit any number of key-value pairs

Complete View of Map/Reduce

Local Aggregation • Network and disk latency are high!

• Features help local aggregation▫Single (Java) Mapper object for multiple

(key,value) pairs in an input split (preserve state across multiple calls of the map() method)

▫Share in-object data structures and counters▫Initialization, and finalization code across all

map() calls in a single task▫JVM reuse across multiple tasks on the same

machine

Basic WordCount Example

Per-Document Aggregation• Associative array inside the map() call to sum up term counts

within a single document

• Emits a key-value pair for each unique term, instead of emitting a key-value pair for each term in the document▫ substantial savings in the number of intermediate key-value pairs

emitted

Per-Mapper Aggregation• Associative array inside the Mapper object to sum up term

counts across multiple documents

In-Mapper Combining• Pros

▫ More control over when local aggregation occurs and how it exactly takes place (recall: no guarantees on combiners)

▫ More efficient than using actual combiners No additional overhead with object creation, serializing,

reading, and writing the key-value pairs

• Cons▫ Breaks the functional programming (not a big deal!)▫ Scalability bottleneck

Needs sufficient memory to store intermediate results Solution: Block and flush, every N key-value pairs have been

processed or every M bytes have been used.

Correctness with Local Aggregation• Combiners are viewed as optional optimizations

▫ Correctness of algorithm should not depend on its computations

• Combiners and reducers are not interchangeable▫ Unless reduce computation is both commutative and

associative

• Make sure of the semantics of your aggregation algorithm▫ Notice for example

Pair and Stripes• In some problems: common approach is to

construct complex keys and values to achieve more efficiency

• Example: Problem of building word co-occurrence matrix from large document collection▫ Formally, the co-occurrence matrix of a corpus is a

square N x N matrix where n is the number of unique words in the corpus

▫ Cell Mij contains the number of times word Wi co-occured with word Wj

Pairs Approach• Mapper: emits co-occurring words pair as the key and the

integer one• Reducer: sums up all the values associated with the same

co-occurring word pair

Pairs Approach•Pairs algorithm generates a massive

number of key-value pairs

•Combiners have few opportunities to perform local aggregation

•The sparsity of the key space also limits the effectiveness of in-memory combining

Stripes Approach• Store co-occurrence information in an associative array• Mapper: emits words as keys and associative arrays as

values• Reducer: element-wise sum of all associative arrays of the

same key

Stripes Approach•Much more compact representation

•Much fewer intermediate key-value pairs

•More opportunities to perform local aggregation

•May cause potential scalability bottlenecks of the algorithm.

Which approach is faster?

• APW (Associated Press Worldstream ): corpus of 2.27 million documents totaling 5.7 GB

Computing Relative Frequencies•In the previous example, (Wi,Wj) co-

occurrence may be high just because one of the words is very common!

•Solution: Compute relative frequencies

Relative Frequencies with Stripes • Straightforward!• In Reducer:

▫ Sum all words counts co-occur with the key word▫ Divide the counts by that sum to get the relative frequency!

• Lessons:▫ Use of complex data structures to coordinate distributed

computations▫ Appropriate structuring of keys and values, bring together all the

pieces of data required to perform a computation

• Drawback?▫ As with before, this algorithm also assumes that each associative

array fits into memory (Scalability bottleneck!)

Relative Frequencies with Pairs• Reducer receives (Wi,Wj) as the key and the counts as value

▫ From this alone it is not possible to compute f(Wj | Wi)• Hint: Reducers like Mappers, can preserve state across multiple

keys• Solution: at reducer side, buffer in memory all the words that co-

occur with Wi▫ In essence building the associative array in the stripes approach

• Problem?▫ Word pairs can be in any arbitrary order!

• Solution: we must define the sort order of the pair ▫ Keys are first sorted by the left word, and then by the right word

• So That: when left word changes ->▫ Sum, calculate and emit the results, flush the memory

Relative Frequencies with Pairs• Problem?

▫ Same left-word pairs may be sent to different reducers!• Solution?

▫ We must ensure that all pairs with the same left word are sent to the same reducer

• How?▫ Custom Paritioners!!

Pays attention to the left word and partition based on its hash only

• Will it work?▫ Yeah!

• Drawback? ▫ Still scalability bottleneck!

Relative Frequencies with Pairs• Another approach? With no bottlenecks?• Can we compute or ‘have’ the sum before processing the

pairs counts?• The notion of ‘before’ and ‘after’ can be seen in the

ordering of the key-value pairs• This insight lies in properly sequencing the data presented

to the reducer▫ Programmer should define the sort order of keys so that data

needed earlier is presented earlier to the reducer• So now, we need two things

▫ Compute the sum for a give word Wi▫ Send that sum to the reducer before any words pair where Wi

is its left side

Relative Frequencies with Pairs• How?• To get the sum

▫ Modify the Mapper to additionally emits a ‘special’ key of (Wi, *), with a value of one

• To ensure the order▫ defining the sort order of the keys so that pairs with the

special symbol of the form (Wi, *) are ordered before any other key-value pairs where the left word is Wi

• In addition: ▫ Partitioner to pay attention to only the left word

Relative Frequencies with Pairs•Example

•Memory bottlenecks?▫No!

Order Inversion Design Pattern• To summarize

▫ Emitting a special key-value pair for getting the sum▫ Controlling the sort order of the intermediate key▫ Defining a custom partitioner▫ Preserving state across multiple keys in the reducer

• Quite common in pattern in many problems

• The key insight▫ Convert the sequencing of computations into a sorting

problem

Secondary Sort• In addition to sorting by key, we also need to sort by value• Implemented in Google, but not in Hadoop• Two main techniques

▫ Buffer all the readings in memory and then sort May lead to too much memory consumption

▫ Value-to-key conversion Move part of the value into the intermediate key to form a

composite key We must define the intermediate key sort order We must define the partitioner so that all pairs associated

with the same key are sent to the same reducer Reducer will need to preserve state across multiple pairs May lead to too much intermediate pairs

Relational Joins• For databases, data-warehousing, and data analytics• Semi-structured data• Example of a join

• Where S and T are datasets (relations), k is the key we want to join on, si and ti are the unique IDs of S and T respectively, Si and Ti are the rest of the tuple attributes

Reduce-side Join• One-to-one join

▫ Emit tuple’s join attribute as key, rest of attributes as value

• One-to-many join▫ Buffer all tuple’s in memory▫ Use Value-to-key pattern

Reduce-side Join• Many-to-many join

▫ The previous algorithm works as well

▫ Smaller set should come first▫ Reducer will buffer it in memory

• Lessons▫ Basic idea is to repartition the two datasets by the join key▫ Not efficient since it shuffles both datasets across the network

Map-side Joins• Assume datasets are

▫ Both sorted by the join key▫ Divided into same number of files▫ Partitioned in the same manner by the join key▫ In each file, tuples are sorted by the join key

• We can perform a join by scanning through both datasets simultaneously▫ This is known as a merge join

• Parallelize by partitioning and sorting both datasets in the same time▫ Map over one of the datasets (the larger one)▫ Inside the mapper read the corresponding part of the other dataset

Non-local read▫ Perform the merge join

Map-side Joins• More efficient than a reduce-side join

▫Doesn’t shuffle all the datasets• Drawback:

▫Strong assumption on the input files format• Advice

▫If used in a workflow with multiple Map/Reduce jobs, ensure the previous reducer writes its output in a convenient format.

Memory-backed Join•If one of the datasets can fit in memory•Load it in memory•Map over the other dataset•Use random access to tuples based on the

join key•Great performance improvement

Summary• In-mapper combining

▫ Aggregates partial results▫ Emit less intermediate pair

• Pair and Stripes▫ Keep track of joint events

One by one Stripe fashion

• Order inversion▫ Convert the sequencing of computations into a sorting

problem• Value-to-key conversion

▫ Scalable solution for secondary sorting▫ Moving part of the value into the key

Before we go!• Remember: Limitations of Map/Reduce Model

▫ Map/Reduce mainly designed for batch processes, not for online query

▫ Prevents modifying or adding input data while the job is running, as well as modifying the number of machines.

▫ Map/Reduce job has a single entry and a single exit We can not keep it alive waiting for an event to trigger

it▫ Map/Reduce works on flat files

Lack of scheme support

What’s Next?

Map/Reduce vs RDBM• A living debate in databases and data analytics communities• On 2008, D. DeWitt and M. Stonebraker write

▫ “MapReduce: A major step backwards”▫ A giant step backward in the programming paradigm▫ An implementation uses brute force instead of indexing▫ Not novel at all -- well known techniques developed nearly 25 years ago▫ Missing most of the features that are routinely included in current DBMS▫ Incompatible with all of the tools DBMS users have come to depend on

• MapReduce is missing features▫ Indexing, Bulk loader, Updates, Transactions, Integrity constraints,

Referential integrity, Views• MapReduce is incompatible with the DBMS tools

▫ Report writers, Business intelligence tools, Data mining tools, Replication tools, Database design tools

Map/Reduce vs RDBM• On 2010, same authors and others write“MapReduce and Parallel DBMSs:Friends or Foes?“• Where they argue that

▫ Map/Reduce is a complement to DBMS not a competitive▫ They are used in different application domain

• Parallel DBMSs excel at efficient querying of large data sets

• MR style systems excel at ETL(extract-transform-load) tasks

NoSQL• Mechanism for storage and retrieval of data that use

looser consistency models than traditional relational databases▫ To achieve higher scalability and availability

• Usually in form of Key-Value store• Built on top of Distributed File Systems• Examples

▫ Google Big Table▫ Apache HBase▫ Apache Cassandra▫ Amazon Dynamo

Tools on top of Hadoop• Apache Pig

▫ Apache Pig is a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce

▫ Apache Pig features a “Pig Latin”, a relational data-flow language enables SQL-like queries to be performed on distributed datasets within Hadoop applications.

▫ Pig originated as a Yahoo Research▫ In 2007, Pig became an open source project of the

Apache Software Foundation.

Apache Pig•Pig Latin Example

Apache Pig•Pig execution flow

Tools on top of Hadoop• Apache Hive

▫ Hive is a data warehouse system for the open source Apache Hadoop project.

▫ Hive features a SQL-like HiveQL language that facilitates data analysis and summarization for large datasets stored in Hadoop-compatible file systems.

▫ Hive originated as a Facebook▫ Later became an open source project under the

Apache Software Foundation.

Apache Hive•HiveQL Example

Pig vs Hive• They are/were independent projects and there was no

centrally coordinated goal. • They were in different spaces early on and have grown to

overlap with time as both projects expand

• Some differences are▫ Pig Latin is procedural, where HiveQL is declarative.▫ Pig Latin allows developers to insert their own code almost

anywhere in the data pipeline.

• Both compiles to Map and Reduce jobs.

Libraries on top of Hadoop• Mahoot

▫ Machine learning library to build scalable machine learning algorithms.

Libraries on top of Hadoop• HIPI (Hadoop Image Processing Interface)

▫ Framework that provides an API for performing image processing tasks in a distributed computing environment

Summary •Map/Reduce

•Framework Architecture

•Map/Reduce Algorithm Design

•Tools and Libraries built on top of Map/Reduce

Demo•Starting Hadoop cluster•Copying data to HDFS•Compiling our Java Map/Reduce code and

create the Jar file.•Submit Hadoop job•Show progress and dash boards•Retrieve the output from HDFS•Shut down Hadoop cluster

Appendix•Studying materials

▫“Data-Intensive Text Processing with MapReduce” Jimmy Lin and Chris Dyer

▫“Hadoop: The Definitive Guide” Tom White

▫“MapReduce Design Patterns” Donald Miner and Adam Shook

Questions?