How to Implement MapReduce

8/6/2019 How to Implement MapReduce

1/18

How to Implement MapReduce Using

Presented ByJamie Pitts

Thursday, December 9, 2010


2/18

A Problem Seeking A Solution

Given a corpus of html-stripped financial filings:Identify and count unique subjects.

Possible Solutions:

1. Use the unix toolbox.2. Create a set of unwieldy perl scripts.3. Solve it with MapReduce, perl, and Gearman!



3/18

What is MapReduce?

A programming model for processingand generating large datasets.

The original 2004 paper by Dean and Ghemawat



4/18

M.R. Execution Model

1. A Master process splits a corpus of documents into sets.

Each is assigned to a separate Mapper worker.

2. A Mapper worker iterates over the contents of the split. AMap function processes each. Keys are extracted and key/value pairs are stored as intermediate data.

3. When the mapping for a split is complete, the Master processassigns a Reducer worker to process the intermediate datafor that split.

4. A Reducer worker iterates over the intermediate data. AReduce function accepts a key and a set of values andreturns a value. Unique key/value pairs are stored asoutput data.

A large set of Input key/value pairs is reducedinto a smaller set of Output key/value pairs



5/18

M.R. Execution Model Diagram



6/18

Applying MapReduceCounting Unique Subjects in a Corpus of Financial Filings

Corpus of Filings,Split by Company ID

Mapper Reducer

Intermediate Data:

Redundant records ofSubjects and Counts

Output Data:

Unique Recordsof Subjects and Counts



7/18

What is Gearman?

Job Server implemented in pureperl and in C

Client and Workers can beimplemented in any language

Not tied to any particular

processing modelNo single point of failure

A protocol and server for distributing workacross multiple servers.

Current Maintainers: Brian Aker and Eric Day



8/18

Gearman Architecture

Job Server is the intermediarybetween Clients and Workers

Clients organize the work

Workers define and registerwork functions with Job Servers.

Servers are aware of all availableWorkers.

Clients and Workers are awareof all available Servers.

Multiple Servers can be run onone machine.

Code that Organizesthe Work

Code that Does the Work



9/18

Perl and GearmanUse Gearman::XS to Create a Pipeline Process

http://search.cpan.org/~dschoen/Gearman-XS/

http://search.cpan.org/~dschoen/Gearman-XS/http://search.cpan.org/~dschoen/Gearman-XS/


10/18

Perl and GearmanClient-Server-Worker Architecture

Gearman Job Servers

Gearman::XS::Client Gearman::XS::Worker

The Master uses Gearman clients tocontrol the MapReduce pipeline.

Mapper and Reducer workers areperl processes, ready to accept jobs.

Each computer has two Gearman Job Servers running:one each for handing mapping and reducing jobs to available workers.

Mappers ReducersSplit > Do Mapping > Do Reducing



11/18

ExampleProcessing Filings from Two Companies

Filings fromTwo Companies

Mapper Reducer

Intermediate Data:

Redundant records ofSubjects and Counts

Output Data:

Unique Recordsof Subjects and Counts

Mapper Reducer



12/18

Example: The Master ProcessSetting Up the Mapper and Reducer Gearman Clients

Connect the mapper client tothe gearman server listeningon 4730. More than one canbe defined here.

Connect the reducer client to

the gearman server listeningon 4731. More than one canbe defined here.

A unique reducer id isgenerated for each reducerclient. This is used toidentify the output.



13/18

Example: The Master ProcessSubmitting a Mapper Job For Each Split

The company IDs are pushedinto an array of splits. A jobshash for both mappers andreducers is defined.

A background mapper job for

each split is submitted togearman.

The job handle is a unique IDgenerated by gearman when

the job is submitted. This isused to identify the job in the

jobs hash.

The split is passed to the job asfrozen workload data.



14/18

Example: The Master ProcessJob Monitoring: Submitting Reducer Jobs When Mapper Jobs Are Done

Mapper and Reducer jobs aremonitored inside of a loop.

The job status is queried fromthe client using the jobhandle.

If the job is completed and if itis a mapper job, a reducer isnow run (detailed in the nextslide).

Completed jobs are deletedfrom the jobs hash.



15/18

Example: The Master ProcessDetail on Submitting the Reducer Job

A background reducer job issubmitted to gearman.

The job handle is a unique IDgenerated by gearman when

the job is submitted. This is

used to identify the job in thejobs hash.

The split and reducer_id ispassed to the job as frozenworkload data.



16/18

ExampleThe Result of Processing Filings from Two Companies

Mapper Reducer

1 AdWords1 Google Adsense1 Adwords1 Google Network1 AdSense1 AdSense1 AdSense1 Google Network

... Redundant Counts ...

Mapper Reducer

Our AdWords agreements aregenerally terminable at anytime by our advertisers.Google AdSense is the programthrough which we distributeour advertisers' AdWords adsfor display on the web sitesof our Google Networkmembers. Our AdSense program

includes AdSense for searchand AdSense for content.AdSense for search is ourservice for distributingrelevant ads from ouradvertisers for display withsearch results on our GoogleNetwork members' sites.

... Filings Text ...1 Ad Words14 AdMob299 AdSense152 AdWords

... Aggregated Counts ...



17/18

Thats It For Now

To Be Continued...

Presented ByJamie Pitts



18/18

Credits

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

Gearman (http://gearman.org)Brian Aker and Eric Day

Diagrams and other content were usedfrom the following sources:
http://gearman.org/http://gearman.org/

How to Implement MapReduce

Documents

Transcript of How to Implement MapReduce