Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
How to Implement MapReduce
-
Upload
mickey-wachirawit -
Category
Documents
-
view
233 -
download
0
Transcript of How to Implement MapReduce
-
8/6/2019 How to Implement MapReduce
1/18
How to Implement MapReduce Using
Presented ByJamie Pitts
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
2/18
A Problem Seeking A Solution
Given a corpus of html-stripped financial filings:Identify and count unique subjects.
Possible Solutions:
1. Use the unix toolbox.2. Create a set of unwieldy perl scripts.3. Solve it with MapReduce, perl, and Gearman!
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
3/18
What is MapReduce?
A programming model for processingand generating large datasets.
The original 2004 paper by Dean and Ghemawat
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
4/18
M.R. Execution Model
1. A Master process splits a corpus of documents into sets.
Each is assigned to a separate Mapper worker.
2. A Mapper worker iterates over the contents of the split. AMap function processes each. Keys are extracted and key/value pairs are stored as intermediate data.
3. When the mapping for a split is complete, the Master processassigns a Reducer worker to process the intermediate datafor that split.
4. A Reducer worker iterates over the intermediate data. AReduce function accepts a key and a set of values andreturns a value. Unique key/value pairs are stored asoutput data.
A large set of Input key/value pairs is reducedinto a smaller set of Output key/value pairs
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
5/18
M.R. Execution Model Diagram
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
6/18
Applying MapReduceCounting Unique Subjects in a Corpus of Financial Filings
Corpus of Filings,Split by Company ID
Mapper Reducer
Intermediate Data:
Redundant records ofSubjects and Counts
Output Data:
Unique Recordsof Subjects and Counts
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
7/18
What is Gearman?
Job Server implemented in pureperl and in C
Client and Workers can beimplemented in any language
Not tied to any particular
processing modelNo single point of failure
A protocol and server for distributing workacross multiple servers.
Current Maintainers: Brian Aker and Eric Day
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
8/18
Gearman Architecture
Job Server is the intermediarybetween Clients and Workers
Clients organize the work
Workers define and registerwork functions with Job Servers.
Servers are aware of all availableWorkers.
Clients and Workers are awareof all available Servers.
Multiple Servers can be run onone machine.
Code that Organizesthe Work
Code that Does the Work
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
9/18
Perl and GearmanUse Gearman::XS to Create a Pipeline Process
http://search.cpan.org/~dschoen/Gearman-XS/
Thursday, December 9, 2010
http://search.cpan.org/~dschoen/Gearman-XS/http://search.cpan.org/~dschoen/Gearman-XS/ -
8/6/2019 How to Implement MapReduce
10/18
Perl and GearmanClient-Server-Worker Architecture
Gearman Job Servers
Gearman::XS::Client Gearman::XS::Worker
The Master uses Gearman clients tocontrol the MapReduce pipeline.
Mapper and Reducer workers areperl processes, ready to accept jobs.
Each computer has two Gearman Job Servers running:one each for handing mapping and reducing jobs to available workers.
Mappers ReducersSplit > Do Mapping > Do Reducing
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
11/18
ExampleProcessing Filings from Two Companies
Filings fromTwo Companies
Mapper Reducer
Intermediate Data:
Redundant records ofSubjects and Counts
Output Data:
Unique Recordsof Subjects and Counts
Mapper Reducer
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
12/18
Example: The Master ProcessSetting Up the Mapper and Reducer Gearman Clients
Connect the mapper client tothe gearman server listeningon 4730. More than one canbe defined here.
Connect the reducer client to
the gearman server listeningon 4731. More than one canbe defined here.
A unique reducer id isgenerated for each reducerclient. This is used toidentify the output.
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
13/18
Example: The Master ProcessSubmitting a Mapper Job For Each Split
The company IDs are pushedinto an array of splits. A jobshash for both mappers andreducers is defined.
A background mapper job for
each split is submitted togearman.
The job handle is a unique IDgenerated by gearman when
the job is submitted. This isused to identify the job in the
jobs hash.
The split is passed to the job asfrozen workload data.
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
14/18
Example: The Master ProcessJob Monitoring: Submitting Reducer Jobs When Mapper Jobs Are Done
Mapper and Reducer jobs aremonitored inside of a loop.
The job status is queried fromthe client using the jobhandle.
If the job is completed and if itis a mapper job, a reducer isnow run (detailed in the nextslide).
Completed jobs are deletedfrom the jobs hash.
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
15/18
Example: The Master ProcessDetail on Submitting the Reducer Job
A background reducer job issubmitted to gearman.
The job handle is a unique IDgenerated by gearman when
the job is submitted. This is
used to identify the job in thejobs hash.
The split and reducer_id ispassed to the job as frozenworkload data.
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
16/18
ExampleThe Result of Processing Filings from Two Companies
Mapper Reducer
1 AdWords1 Google Adsense1 Adwords1 Google Network1 AdSense1 AdSense1 AdSense1 Google Network
... Redundant Counts ...
Mapper Reducer
Our AdWords agreements aregenerally terminable at anytime by our advertisers.Google AdSense is the programthrough which we distributeour advertisers' AdWords adsfor display on the web sitesof our Google Networkmembers. Our AdSense program
includes AdSense for searchand AdSense for content.AdSense for search is ourservice for distributingrelevant ads from ouradvertisers for display withsearch results on our GoogleNetwork members' sites.
... Filings Text ...1 Ad Words14 AdMob299 AdSense152 AdWords
... Aggregated Counts ...
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
17/18
Thats It For Now
To Be Continued...
Presented ByJamie Pitts
Thursday, December 9, 2010
-
8/6/2019 How to Implement MapReduce
18/18
Credits
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
Gearman (http://gearman.org)Brian Aker and Eric Day
Diagrams and other content were usedfrom the following sources:
http://gearman.org/http://gearman.org/