[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
description
Transcript of [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
![Page 1: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/1.jpg)
Introduction to
Zak Stone <[email protected]>PhD candidate, Harvard School of Engineering and Applied SciencesAdvisor: Todd Zickler (Computer Vision)
![Page 2: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/2.jpg)
Hadoop distributes data and computation across a large number of computers.
![Page 3: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/3.jpg)
Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
![Page 4: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/4.jpg)
Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
![Page 5: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/5.jpg)
Why should you care? - Lots of Data
LOTS OF DATAEVERYWHERE
![Page 6: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/6.jpg)
Why should you care? - Lots of Data
LOTS!
![Page 7: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/7.jpg)
Why should you care? - Lots of Data
![Page 8: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/8.jpg)
Why should you care? - Even Grocery Stores Care
...
![Page 9: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/9.jpg)
Why!! ! ! ! ! ! for big data?
• Most credible open-source toolset for large-scale, general-purpose computing
• Backed by ,
• Used by , , many others
• Increasing support from web services
• Hadoop closely imitates infrastructure developed by
• Hadoop processes petabytes daily, right now
![Page 10: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/10.jpg)
Why!! ! ! ! ! ! for big data?
![Page 11: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/11.jpg)
• Don’t use Hadoop if your data and computation fit on one machine
• Getting easier to use, but still complicated
DISCLAIMER
http://www.wired.com/gadgetlab/2008/07/patent-crazines/
![Page 12: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/12.jpg)
Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
![Page 13: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/13.jpg)
What exactly is ! ! ! ! ! ! ! ?
• Actually a growing collection of subprojects
![Page 14: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/14.jpg)
What exactly is ! ! ! ! ! ! ! ?
• Actually a growing collection of subprojects; focus on two right now
![Page 15: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/15.jpg)
Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
![Page 16: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/16.jpg)
An overview of Hadoop Map-Reduce
TraditionalComputing
Hadoop
(one computer)
(many computers)
![Page 17: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/17.jpg)
An overview of Hadoop Map-Reduce
(Actually more like this)
(many computers, little communication, stragglers and failures)
![Page 18: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/18.jpg)
Map-Reduce: Three phases
1. Map
2. Sort
3. Reduce
![Page 19: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/19.jpg)
Map-Reduce: Map phase
(key, value) (key, value)(key, value)(key, value)
Only specify operations on key-value pairs!
(zero or more output pairs)
(each “elephant” works on an input pair; doesn’t know other elephants exist )
INPUT PAIR OUTPUT PAIRS
![Page 20: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/20.jpg)
Map-Reduce: Map phase, word-count example
(line1, “Hello there.”) (“hello”, 1)
(“there”, 1)
(line2, “Why, hello.”) (“why”, 1)
(“hello”, 1)
![Page 21: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/21.jpg)
Map-Reduce: Sort phase
(key1, value289)(key1, value43)(key1, value3)
(key2, value512)(key2, value11)(key2, value67)
...
...
![Page 22: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/22.jpg)
Map-Reduce: Sort phase, word-count example
(“hello”, 1)
(“there”, 1)
(“why”, 1)
(“hello”, 1)
![Page 23: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/23.jpg)
Map-Reduce: Reduce phase
(key1, value289)(key1, value43)(key1, value3)
(key1, output1)
...
![Page 24: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/24.jpg)
Map-Reduce: Reduce phase, word-count example
(“hello”, 1)
(“there”, 1)
(“why”, 1)
(“hello”, 1)(“hello”, 2)
(“there”, 1)
(“why”, 1)
![Page 25: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/25.jpg)
Map-Reduce: Code for word-count
def mapper(key,value): for word in value.split(): yield word,1
def reducer(key,values): yield key,sum(values)
![Page 26: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/26.jpg)
Seems like too much work for a word-count!
![Page 27: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/27.jpg)
Map-Reduce: Imagine word-count on the Web
![Page 28: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/28.jpg)
Map-Reduce: The main advantage
def mapper(key,value): for word in value.split(): yield word,1
def reducer(key,values): yield key,sum(values)
With Hadoop, this very same code could run on the entire Web! (In theory, at least)
![Page 29: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/29.jpg)
Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
![Page 30: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/30.jpg)
HDFS: Hadoop Distributed File System
Data
. . .
. . .
. . .. .
.
(chunks of data on computers)
(each chunkreplicated morethan once for
reliability)
![Page 31: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/31.jpg)
HDFS: Hadoop Distributed File System
. . .
(key1, value1)(key2, value2)
...
(key1, value1)(key2, value2)
......
Computation is local to the dataKey-value pairs processed independently in parallel
![Page 32: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/32.jpg)
HDFS: Inspired by the Google File System
![Page 33: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/33.jpg)
Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
![Page 34: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/34.jpg)
Hadoop Map-Reduce and HDFS: Advantages
• Distribute data and computation
• Computation local to data avoids network overload
• Tasks are independent
• Easy to handle partial failures - entire nodes can fail and restart
• Avoid crawling horrors of failure-tolerant synchronous distributed systems
• Speculative execution to work around stragglers
• Linear scaling in the ideal case
• Designed for cheap, commodity hardware
• Simple programming model
• The “end-user” programmer only writes map-reduce tasks
![Page 35: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/35.jpg)
Hadoop Map-Reduce and HDFS: Disadvantages
• Still rough - software under active development
• e.g. HDFS only recently added support for append operations
• Programming model is very restrictive
• Lack of central data can be frustrating
• “Joins” of multiple datasets are tricky and slow
• No indices! Often, entire dataset gets copied in the process
• Cluster management is hard (debugging, distributing software, collecting logs...)
• Still single master, which requires care and may limit scaling
• Managing job flow isn’t trivial when intermediate data should be kept
• Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)
![Page 36: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/36.jpg)
Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
![Page 37: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/37.jpg)
Getting started: Installation options
• Cloudera virtual machine
• Your own virtual machine (install Ubuntu in VirtualBox, which is free)
• Elastic MapReduce on EC2
• StarCluster with Hadoop on EC2
• Cloudera’s distribution of Hadoop on EC2
• Install Cloudera’s distribution of Hadoop on your own machine
• Available for RPM and Debian deployments
• Or download Hadoop directly from http://hadoop.apache.org/
![Page 38: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/38.jpg)
Getting started: Language choices
• Hadoop is written in Java
• However, Hadoop Streaming allows mappers and reducers in any language!
• Binary data is a little tricky with Hadoop Streaming
• Could use base64 encoding, but TypedBytes are much better
• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo
• The Python word-count example and others come with Dumbo
• Dumbo makes binary data with TypedBytes easy
• Also consider Hadoopy: https://github.com/bwhite/hadoopy
![Page 39: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/39.jpg)
Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
![Page 40: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/40.jpg)
Useful resources and tips
• The Hadoop homepage: http://hadoop.apache.org/
• Cloudera: http://cloudera.com/
• Dumbo: http://wiki.github.com/klbostee/dumbo
• Hadoopy: https://github.com/bwhite/hadoopy
• Amazon Elastic Compute Cloud Getting Started Guide:
• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/
• Always test locally on a tiny dataset before running on a cluster!
![Page 41: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/41.jpg)
...
![Page 42: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)](https://reader034.fdocuments.in/reader034/viewer/2022042518/54c6fa2d4a795944168b45ea/html5/thumbnails/42.jpg)
Thanks for your attention!