How to Implement MapReduce

download How to Implement MapReduce

of 18

Transcript of How to Implement MapReduce

  • 8/6/2019 How to Implement MapReduce

    1/18

    How to Implement MapReduce Using

    Presented ByJamie Pitts

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    2/18

    A Problem Seeking A Solution

    Given a corpus of html-stripped financial filings:Identify and count unique subjects.

    Possible Solutions:

    1. Use the unix toolbox.2. Create a set of unwieldy perl scripts.3. Solve it with MapReduce, perl, and Gearman!

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    3/18

    What is MapReduce?

    A programming model for processingand generating large datasets.

    The original 2004 paper by Dean and Ghemawat

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    4/18

    M.R. Execution Model

    1. A Master process splits a corpus of documents into sets.

    Each is assigned to a separate Mapper worker.

    2. A Mapper worker iterates over the contents of the split. AMap function processes each. Keys are extracted and key/value pairs are stored as intermediate data.

    3. When the mapping for a split is complete, the Master processassigns a Reducer worker to process the intermediate datafor that split.

    4. A Reducer worker iterates over the intermediate data. AReduce function accepts a key and a set of values andreturns a value. Unique key/value pairs are stored asoutput data.

    A large set of Input key/value pairs is reducedinto a smaller set of Output key/value pairs

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    5/18

    M.R. Execution Model Diagram

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    6/18

    Applying MapReduceCounting Unique Subjects in a Corpus of Financial Filings

    Corpus of Filings,Split by Company ID

    Mapper Reducer

    Intermediate Data:

    Redundant records ofSubjects and Counts

    Output Data:

    Unique Recordsof Subjects and Counts

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    7/18

    What is Gearman?

    Job Server implemented in pureperl and in C

    Client and Workers can beimplemented in any language

    Not tied to any particular

    processing modelNo single point of failure

    A protocol and server for distributing workacross multiple servers.

    Current Maintainers: Brian Aker and Eric Day

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    8/18

    Gearman Architecture

    Job Server is the intermediarybetween Clients and Workers

    Clients organize the work

    Workers define and registerwork functions with Job Servers.

    Servers are aware of all availableWorkers.

    Clients and Workers are awareof all available Servers.

    Multiple Servers can be run onone machine.

    Code that Organizesthe Work

    Code that Does the Work

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    9/18

    Perl and GearmanUse Gearman::XS to Create a Pipeline Process

    http://search.cpan.org/~dschoen/Gearman-XS/

    Thursday, December 9, 2010

    http://search.cpan.org/~dschoen/Gearman-XS/http://search.cpan.org/~dschoen/Gearman-XS/
  • 8/6/2019 How to Implement MapReduce

    10/18

    Perl and GearmanClient-Server-Worker Architecture

    Gearman Job Servers

    Gearman::XS::Client Gearman::XS::Worker

    The Master uses Gearman clients tocontrol the MapReduce pipeline.

    Mapper and Reducer workers areperl processes, ready to accept jobs.

    Each computer has two Gearman Job Servers running:one each for handing mapping and reducing jobs to available workers.

    Mappers ReducersSplit > Do Mapping > Do Reducing

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    11/18

    ExampleProcessing Filings from Two Companies

    Filings fromTwo Companies

    Mapper Reducer

    Intermediate Data:

    Redundant records ofSubjects and Counts

    Output Data:

    Unique Recordsof Subjects and Counts

    Mapper Reducer

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    12/18

    Example: The Master ProcessSetting Up the Mapper and Reducer Gearman Clients

    Connect the mapper client tothe gearman server listeningon 4730. More than one canbe defined here.

    Connect the reducer client to

    the gearman server listeningon 4731. More than one canbe defined here.

    A unique reducer id isgenerated for each reducerclient. This is used toidentify the output.

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    13/18

    Example: The Master ProcessSubmitting a Mapper Job For Each Split

    The company IDs are pushedinto an array of splits. A jobshash for both mappers andreducers is defined.

    A background mapper job for

    each split is submitted togearman.

    The job handle is a unique IDgenerated by gearman when

    the job is submitted. This isused to identify the job in the

    jobs hash.

    The split is passed to the job asfrozen workload data.

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    14/18

    Example: The Master ProcessJob Monitoring: Submitting Reducer Jobs When Mapper Jobs Are Done

    Mapper and Reducer jobs aremonitored inside of a loop.

    The job status is queried fromthe client using the jobhandle.

    If the job is completed and if itis a mapper job, a reducer isnow run (detailed in the nextslide).

    Completed jobs are deletedfrom the jobs hash.

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    15/18

    Example: The Master ProcessDetail on Submitting the Reducer Job

    A background reducer job issubmitted to gearman.

    The job handle is a unique IDgenerated by gearman when

    the job is submitted. This is

    used to identify the job in thejobs hash.

    The split and reducer_id ispassed to the job as frozenworkload data.

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    16/18

    ExampleThe Result of Processing Filings from Two Companies

    Mapper Reducer

    1 AdWords1 Google Adsense1 Adwords1 Google Network1 AdSense1 AdSense1 AdSense1 Google Network

    ... Redundant Counts ...

    Mapper Reducer

    Our AdWords agreements aregenerally terminable at anytime by our advertisers.Google AdSense is the programthrough which we distributeour advertisers' AdWords adsfor display on the web sitesof our Google Networkmembers. Our AdSense program

    includes AdSense for searchand AdSense for content.AdSense for search is ourservice for distributingrelevant ads from ouradvertisers for display withsearch results on our GoogleNetwork members' sites.

    ... Filings Text ...1 Ad Words14 AdMob299 AdSense152 AdWords

    ... Aggregated Counts ...

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    17/18

    Thats It For Now

    To Be Continued...

    Presented ByJamie Pitts

    Thursday, December 9, 2010

  • 8/6/2019 How to Implement MapReduce

    18/18

    Credits

    MapReduce: Simplified Data Processing on Large Clusters

    Jeffrey Dean and Sanjay Ghemawat

    Gearman (http://gearman.org)Brian Aker and Eric Day

    Diagrams and other content were usedfrom the following sources:

    http://gearman.org/http://gearman.org/