Distributed batch processing with Hadoop
-
Upload
ferran-gali-reniu -
Category
Technology
-
view
1.060 -
download
0
description
Transcript of Distributed batch processing with Hadoop
![Page 1: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/1.jpg)
Distributed batch processing with Hadoop
Ferran Galí i Reniu@ferrangali
09/01/2014
![Page 2: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/2.jpg)
Ferran Galí i Reniu
● UPC - FIB● Trovit
![Page 3: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/3.jpg)
Problem
● Too much data○ 90% of all the data in the world has been generated
in the last two years○ Large Hadron Collider: 25 petabytes per year○ Walmart: 1M transactions per hour
● Hard disks○ Cheap!○ Still slow access time○ Write even slower
![Page 4: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/4.jpg)
Solutions
● Multiple Hard Disks○ Work in parallel○ We can reduce access time!
● How to deal with hardware failure?● What if we need to combine data?
![Page 5: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/5.jpg)
![Page 6: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/6.jpg)
Hadoop
● Doug Cutting & Mike Cafarella
![Page 7: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/7.jpg)
Hadoop
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
October 2003
![Page 8: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/8.jpg)
Hadoop
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat
December 2004
![Page 9: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/9.jpg)
Hadoop
● Doug Cutting & Mike Cafarella
● Yahoo!
![Page 10: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/10.jpg)
Hadoop
● HDFS○ Storage
● MapReduce○ Processing
● Ecosystem
![Page 11: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/11.jpg)
![Page 12: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/12.jpg)
HDFS
● Distributed storage○ Managed across a network of commodity machines
● Blocks○ About 128Mb○ Large data sets
● Tolerance to node failure○ Data replication
● Streaming data access○ Many access○ Write once (batch)
![Page 13: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/13.jpg)
HDFS
● DataNodes (Workers)○ Store blocks
● NameNode (Master)○ Maintains metadata○ Knows where the blocks are located○ Make DataNodes fault tolerant○ Single point of failure ○ Secondary NameNode
![Page 14: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/14.jpg)
HDFS
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
![Page 15: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/15.jpg)
HDFS
● Interfaces○ Java○ Command line interface
● Loadhadoop fs -put file.csv /user/hadoop/file.csv
● Extracthadoop fs -get /user/hadoop/file.csv file.csv
![Page 16: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/16.jpg)
![Page 17: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/17.jpg)
MapReduce
● Distributed processing paradigm○ Moving computation is cheaper than moving data
● Map○ Map(k1,v1) -> list(k2,v2)
● Reduce○ Reduce(k2,list(v2)) -> list(v3)
![Page 18: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/18.jpg)
Word Countermap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
![Page 19: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/19.jpg)
Word Countermap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Java is greatHadoop is also great
![Page 20: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/20.jpg)
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Key Value
1 Java is great
2 Hadoop is also great
![Page 21: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/21.jpg)
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(1, “Java is great”)
Key Value
1 Java is great
2 Hadoop is also great
![Page 22: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/22.jpg)
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(1, “Java is great”)
Key Value
Java 1
Key Value
1 Java is great
2 Hadoop is also great
![Page 23: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/23.jpg)
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(1, “Java is great”)
Key Value
Java 1
is 1
Key Value
1 Java is great
2 Hadoop is also great
![Page 24: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/24.jpg)
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(1, “Java is great”)
Key Value
Java 1
is 1
great 1
Key Value
1 Java is great
2 Hadoop is also great
![Page 25: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/25.jpg)
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(2, “Hadoop is also great”)
Key Value
Java 1
is 1
great 1
Key Value
1 Java is great
2 Hadoop is also great
![Page 26: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/26.jpg)
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(2, “Hadoop is also great”)
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
Key Value
1 Java is great
2 Hadoop is also great
![Page 27: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/27.jpg)
Word Count - Group & Sort
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
map(k1,v1) -> list(k2, v2) reduce(k2,list(v2)) -> list(v3)
![Page 28: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/28.jpg)
Word Count - Group & Sort
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
Key Value
Java [1]
is [1, 1]
great [1, 1]
Hadoop [1]
also [1]
group
map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3)
![Page 29: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/29.jpg)
Word Count - Group & Sort
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
Key Value
Java [1]
is [1, 1]
great [1, 1]
Hadoop [1]
also [1]
group
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
sort
map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3)
![Page 30: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/30.jpg)
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
![Page 31: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/31.jpg)
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“also”, [1])
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
![Page 32: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/32.jpg)
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“also”, [1])
Key Value
also 1
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
![Page 33: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/33.jpg)
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“great”, [1, 1])
Key Value
also 1
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
![Page 34: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/34.jpg)
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“great”, [1, 1])
Key Value
also 1
great 2Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
![Page 35: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/35.jpg)
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“Hadoop”, [1])
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
Key Value
also 1
great 2
Hadoop 1
![Page 36: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/36.jpg)
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“is”, [1, 1])
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
Key Value
also 1
great 2
Hadoop 1
is 2
![Page 37: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/37.jpg)
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“Java”, [1])
Key Value
also 1
great 2
Hadoop 1
is 2
Java 1
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
![Page 38: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/38.jpg)
Distributed?
● Map tasks○ Each read block executes a map task
● Reduce tasks○ Partitioning when grouping
![Page 39: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/39.jpg)
Word Count - Partition
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
Key Value
Java [1]
is [1, 1]
great [1, 1]
Hadoop [1]
also [1]
group
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
sort
num partitions = 1
![Page 40: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/40.jpg)
Word Count - Partition
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
Key Value
great [1, 1]
Hadoop [1]
also [1]
group
sort
Key Value
Java [1]
is [1, 1]
group
num partitions = 2
Key Value
is [1, 1]
Java [1]
Key Value
also [1]
great [1, 1]
Hadoop [1]
sort
![Page 41: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/41.jpg)
Distributed?
● Map tasks○ Each read block executes a map task
● Reduce tasks○ Partitioning when grouping○ Each partition executes a reduce task
![Page 42: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/42.jpg)
MapReduce
● Job Tracker○ Dispatches Map & Reduce Tasks
● Task Tracker○ Executes Map & Reduce Tasks
![Page 43: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/43.jpg)
MapReduce
Example 1:● Map● Reduce● Group & Partition
$> hadoop jar jug-hadoop.jar example1 /user/hadoop/input.txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*
http://github.com/ferrangali/jug-hadoop
![Page 44: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/44.jpg)
MapReduce
Example 2:● Sorting● n-Job workflow
$> hadoop jar jug-hadoop.jar example2 /user/hadoop/input.txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*
http://github.com/ferrangali/jug-hadoop
![Page 45: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/45.jpg)
Big Data
![Page 46: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/46.jpg)
Big Data
● Too much data○ Not a problem any more
● It’s just a matter of which tools use● New opportunity for businesses
![Page 47: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/47.jpg)
Big Data Platform
DB
logsindexes
DB
NoSQL
Consumption Processing Serving
![Page 48: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/48.jpg)
Hadoop Ecosystem
![Page 49: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/49.jpg)
Hive
● Data Warehouse● SQL-Like analysis system
SELECT SPLIT(line, “ ”) AS word, COUNT(*)
FROM table
GROUP BY word
ORDER BY word ASC;
● Executes MapReduce underneath!
![Page 50: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/50.jpg)
HBase
● Based on BigTable● Column-oriented database● Random realtime read/write access● Easy to bulk load from Hadoop
![Page 51: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/51.jpg)
Hadoop Ecosystem
● ZooKeeper:○ Centralized coordination system
● Pig○ Data-flow language to analyze large data sets
● Kafka:○ Distributed messaging system
● Sqoop:○ Transfer between RDBMS - HDFS
● ...
![Page 52: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/52.jpg)
Hadoop - Who’s using it?
![Page 53: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/53.jpg)
Trovit
● What is it:○ Vertical search engine.○ Real estate, cars, jobs, products, vacations.
● Challenges:○ Millions of documents to index○ Traffic generates a huge amount of log files
![Page 54: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/54.jpg)
Trovit
● Legacy:○ Use MySQL as a support to document indexing○ Didn’t scale!
● Batch processing:○ Hadoop with a pipeline workflow○ Problem solved!
● Real time processing:○ Storm to improve freshness
● More challenges:○ Content analysis○ Traffic analysis
![Page 55: Distributed batch processing with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/554f60afb4c905bb178b46ea/html5/thumbnails/55.jpg)
Questions?Distributed batch processing with Hadoop
@ferrangali