It's My Life! Presentation NotesIt's My Life! Presentation Notes
My mapreduce1 presentation
-
Upload
noha-elprince -
Category
Technology
-
view
287 -
download
0
Transcript of My mapreduce1 presentation
![Page 1: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/1.jpg)
MapReduce Simplified Data Processing on Large
Clusters
Google , Inc.
Presented by
Noha El-Prince
Winter 2011
![Page 2: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/2.jpg)
Problem and Motivations � Large Data Size
� Limited CPU Powers
� Difficulties of Distributed , Parallel Computing
2
![Page 3: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/3.jpg)
MapReduce
� MapReduce is a Software framework
� introduced by Google
� Enables automatic parallelization and distribution of large-scale computations
� Hides the details of parallelization, data distribution, load balancing and fault tolerance.
� Achieves high performance
3
![Page 4: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/4.jpg)
Outline � MapReduce : Execution Example
� Programming Model
� MapReduce: Distributed Execution
� More Examples
� Customization on Cluster
� Refinements
� Performance measurement
� Conclusion and Future Work
� MapReduce in other companies
4
![Page 5: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/5.jpg)
5
Programming Model
Raw Data
MapReduce Library
Reduced Processed
data
Mu
(k,v)
(k’,v’)
Intermediate data
Ru
(k’,<v’>*) <k’, v’>*
![Page 6: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/6.jpg)
Example q Input:
� Page 1: the weather is good
� Page 2: today is good � Page 3: good weather is good.
q Output Desired:
The frequency each word is encountered in all pages.
(the 1), (is 3), (weather 2),(today 1), (good 4)
6
![Page 7: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/7.jpg)
(The,1) (weather,1) (is,1) (good,1) (Today,1) (is,1) (good,1) (good,1) (weather,1) (is,1) (good,1)
Intermediate Data
The weather is good Today is good Good weather is good
M M M M M M M
Input Data
(The,[1]) (weather,[1,1]) (is,[1,1,1]) (good,[1,1,1]) (Today,[1])
Group by Key
Grouped Data
map(key, value): for each word w in value:
emit(w, 1)
R R R R R
Output Data (The,1) (weather, 2) (is, 3) (good,3) (Today,1) 7
reduce(key, values): result=0 for each count v in values result += v emit(key, result)
![Page 8: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/8.jpg)
Programming Model § Input : A set of key/value pairs
§ Programmer specifies two functions:
8
Map • map (k,v) à <k’, v’>
Reduce • reduce (k’,<v’>*) à <k’,v’>*
All v’ with same k’ are reduced together
![Page 9: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/9.jpg)
Distributed Execution Overview
Worker
Worker
Master
Worker
Worker
Worker
fork fork fork
assign map
assign reduce
read local write
remote read, sort
Output File 0
Output File 1
write
Split 0 Split 1 Split 2
Input Data
9
![Page 10: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/10.jpg)
MapReduce Examples
10
Distributed Grep:
MAP RED Virus,
[A..,B..]
A….virus B……. C..virus… virus, A…
virus, B…
Search pattern (key) : virus
Web Page
![Page 11: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/11.jpg)
� Count of URL Access Frequency:
11
MapReduce Examples
www.cbc.com www.cnn.com www.bbc.com www.cbc.com www.cbc.com www.bbc.com
Web server logs
MAP CBC, [1,1,1] CNN [1] BBC [1,1]
RED
CBC, 3 BBC, 2 CNN, 1
![Page 12: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/12.jpg)
q Reverse Web-Link Graph:
12
MapReduce Examples
Web server logs
MAP (facebook,youtube) (facebook, disney)
RE
D
(Facebook, [youtube, disney])
Facebook.com Twitter.com
Facebook.com
www.youtube.com
www.disney.com
target www.facebook.com source
![Page 13: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/13.jpg)
MapReduce Examples
q Term-Vector per Host:.
13
Documents of the facebook (hostname)
MAP
<facebook, word1> <facebook, word2> <facebook, word2> <facebook, word2>
<facebook, [word2, …]>
….
Summary of the most popular words
RE
D
![Page 14: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/14.jpg)
14
q Inverted Index:
14
MapReduce Examples
Docs
<word1, docID1> <word2, docID1> … <word3, docID2> <word1, docID2> … <word1, docID3>
<word1, [docID1, docID2, docID3]>
<word2, [docID1]>
RED MAP
![Page 15: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/15.jpg)
Outline � MapReduce : Execution
� Example
� Programming Model
� MapReduce: Distributed Execution
� More Examples
� Customizations on Clusters
� Refinements
� Performance measurement
� Conclusion & Future Work
� MapReduce in other companies
þ
þ
þ
þ
þ
15
![Page 16: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/16.jpg)
Customizations on Clusters
� Coordination
� Scheduling
� Fault Tolerance
� Task Granularity
� Backup Tasks
16
![Page 17: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/17.jpg)
q Coordination
Customizations on Clusters
Master Data Structure
17
M 250.133.22.7 Completed Root/intFile.txt
M 250.133.22.8 inprogress Root/intFile.txt
R 250.123.23.3 idle Root/outFile.txt
![Page 18: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/18.jpg)
q Scheduling Master scheduling policy: (objective: conserve network bandwidth) 1. GFS divides each file into 64MB block.
2. I/P data are stored on the worker’s local disks (managed by GFS)
Ø Locality :using the same cluster for both data storage and data processing.
3. GFS stores multiple copies of each block (typically 3 copies) on different machines.
Customizations on Clusters
18
![Page 19: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/19.jpg)
q Fault Tolerance
On worker failure: • Detect failure via periodic heartbeats • Re-execute completed and in-progress map tasks • Re-execute in progress reduce tasks • Task completion committed through master
Master failure: • Could handle, but don't yet (master failure unlikely) • MapReduce task is aborted and client is notified
Customizations on Clusters
19
![Page 20: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/20.jpg)
q Task Granularity (How tasks are divided ?)
Rule of thumb:
Make M and R much larger than the number of worker machines à Improves dynamic load balancing à speeds recovery from worker failure
Customizations on Clusters
20 Usually R is smaller than M
![Page 21: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/21.jpg)
q Backup tasks
� Problem of stragglers (machines taking long time to complete one of the last few tasks )
� When a MapReduce operation is about to complete: Ø Master schedules backup executions of the
remaining tasks Ø Task is marked “complete” whenever either the
primary or the backup execution completes.
Customizations on Clusters
21
Effect: dramatically shortens job completion time
![Page 22: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/22.jpg)
Outline � MapReduce : Execution
� Example
� Programming Model
� MapReduce: Distributed Execution
� More Examples
� Customizations on Clusters
� Refinements
� Performance measurement
� Conclusion & Future Work
� Companies using MapReduce
þ
þ
þ
þ
þ
þ
22
![Page 23: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/23.jpg)
Refinements
� Partitioning functions.
� Skipping bad records.
� Status info.
� Other Refinements
23
![Page 24: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/24.jpg)
� MapRedue users specify no. of tasks/output files desired (R)
� For reduce, we need to ensure that records with the same intermediate key end up at the same worker
� System uses a default partition function
e.g., hash(key) mod R ( results fairly well-balanced partitions )
� Sometimes useful to override
� E.g., hash(hostname(URL key)) mod R Ø ensures URLs from a host end up in the same output file
Refinements : Partitioning Function
24
![Page 25: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/25.jpg)
§ Map/Reduce functions sometimes fail for particular inputs
• MapReduce has a special treatment for ‘bad’ input data, i.e. input data that repeatedly leads to the crash of a task. Ø The master, tracking crashes of tasks, recognizes
such situations and, after a number of failed retries, will decide to ignore this piece of data.
• Effect: Can work around bugs in third-party libraries
25
Refinements : Skipping Bad Records
![Page 26: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/26.jpg)
� Status pages shows the computation progress
� Links to standard error and output files generated by each task.
� User can Ø Predict the computational length Ø Add more resources if needed
Ø Know which workers have failed
� Useful in user code bug diagnosis
26
Refinements : Status Information
![Page 27: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/27.jpg)
27
§ Combiner function: Compression of intermediate data
Ø useful for saving network bandwidth
Other Refinements
§ User-defined counters Ø periodically propagated to the master from worker machines
Ø Useful for checking behavior of MaReduce operations (appears on master status page )
![Page 28: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/28.jpg)
Outline � MapReduce : Execution
� Example
� Programming Model
� MapReduce: Distributed Execution
� More Examples
� Customizations on Clusters
� Refinements
� Performance measurement
� Conclusion & Future Work
� Companies using MapReduce
þ
þ
þ
þ
þ
þ
28
þ
![Page 29: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/29.jpg)
Performance
§ Tests run on cluster of 1800 machines: each machine has:
� 4 GB of memory
� Dual-processor 2 GHz Xeons with Hyperthreading
� Dual 160 GB IDE disks
� Gigabit Ethernet link
� Bisection bandwidth approximately 100-200 Gbps
§ Two benchmarks:
� Grep: Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records)
� Sort: Sort 1010 100-byte records
29
![Page 30: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/30.jpg)
Grep
• 1800 machines read 1 TB of data at peak of ~31 GB/s
• Startup overhead is significant for short jobs (entire computation = 80 + 1 minute start up
30
M=15000 (input split= 64MB) R=1 Assume all machines has same host Search pattern: 3 characters Found in: 92,337 records
1764 workers
![Page 31: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/31.jpg)
Sort
31
(a) Normal Execution (b) No backup tasks (c) 200 tasks killed
M=15000 (input split= 64MB) R=4000, # of workers = 1746 Fig.(a) Btr than Terasoft benchmark reported result of 1057 s
(a) Locality optimization èInput rate > shuffle rate and output rate Output phase writes 2 copies of sorted data è Shuffle rate > output rate (b) 5 Stragglers à Entire computation rate increases 44% than normal
![Page 32: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/32.jpg)
Experience: Rewrite of Production Indexing System
§ New code is simpler, easier to understand
§ MapReduce takes care of failures, slow machines
§ Easy to make indexing faster by adding more machines
32
![Page 33: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/33.jpg)
Outline � MapReduce : Execution
� Example
� Programming Model
� MapReduce: Distributed Execution
� More Examples
� Customizations on Clusters
� Refinements
� Performance measurement
� Conclusion & Future Work
� Companies using MapReduce
þ
þ
þ
þ
þ
þ
33
þ
þ
![Page 34: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/34.jpg)
Conclusion & Future Work � MapReduce has proven to be a useful abstraction
� Greatly simplifies large-scale computations
� Fun to use: focus on problem, let library deal w/ messy details
34
![Page 35: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/35.jpg)
MapReduce Advantages/Disadvantages
Now it’s easy to program for many CPUs • Communication management effectively gone
Ø I/O scheduling done for us • Fault tolerance, monitoring
Ø machine failures, suddenly-slow machines, etc are handled • Can be much easier to design and program! • Can cascade several (many?) MapReduce tasks
But … it further restricts solvable problems • Might be hard to express problem in MapReduce • Data parallelism is key
Ø Need to be able to break up a problem by data chunks • MapReduce is closed-source (to Google) C++
Ø Hadoop is open-source Java-based rewrite 35
![Page 36: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/36.jpg)
Outline � MapReduce : Execution
� Example
� Programming Model
� MapReduce: Distributed Execution
� More Examples
� Customizations on Clusters
� Refinements
� Performance measurement
� Conclusion & Future Work
� Companies using MapReduce
þ
þ
þ
þ
þ
þ
36
þ
þ
þ
![Page 37: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/37.jpg)
Companies using MapReduce v Amazon: Amazon Elastic MapReduce :
§ a web service § enables businesses, researchers, data analysts, and
developers to easily and cost-effectively process vast amounts of data.
§ It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
§ allows you to use Hadoop with no hardware investment
� http://aws.amazon.com/elasticmapreduce/
37
![Page 38: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/38.jpg)
� Amazon: to build product search indices
� Facebook: processing of web logs, via both Map-Reduce and Hive
� IBM and Google: making large compute clusters available to higher ed and research organizations
� New York Times: large scale image conversions
� Yahoo: use Map Reduce and Pig for web log processing, data model training, web map construction, and much, much more
� Many universities for teaching parallel and large data systems
And many more, see them all at
http://wiki.apache.org/hadoop/ PoweredBy 38
Companies using MapReduce
![Page 39: My mapreduce1 presentation](https://reader033.fdocuments.in/reader033/viewer/2022052618/554f5ecdb4c9058a148b4621/html5/thumbnails/39.jpg)
39