MapReduce with Hadoop
-
Upload
vitalie-scurtu -
Category
Business
-
view
707 -
download
0
Embed Size (px)
description
Transcript of MapReduce with Hadoop

MapReduce with HADOOP
Vitalie Scurtu

What is hadoop?
Hadoop is a set of open source frameworks for parallel and distributive computing:
• HDFS: Distributed file system
• MapReduce: A technique and a framework for parallel computation in cluster.
• ZooKeeper: A configuration service.
• and others: Hive ,HBase ,Mahout, Pig.• Yahoo's Hadoop clusters was used to sort 1 terabyte of data in 209
seconds in Terabyte Sorting Competition.

Why distributed computing?
• Reduced costs. More computers are cheaper then more powerful computer.
• Scalability. We can add new computer to the cluster anytime.
• Super power and super speed.
• Distributed algorithms.
• Stability
• Robust frameworks.

Configuring Hadoop
• It is java and it uses xml file for configuration.• Installation is very simple. • Every computer can become a part of the cluster.• To try a demo we need only 30 minutes.• Uses an advanced configuration system named
ZooKeeper• cat /usr/local/hadoop/conf/slaves
hadoop-masterhadoop-slave01hadoop-slave02hadoop-slave03hadoop-slave06

HDFSHadoop Distributed File System
• Distributed file system
• Support for huge files (GB, terrabyte)
• Hardware Failure safe, replication
• File access model is “Write-once-read-many”
• Cross-platform (java)

MapReduce
• An uniq model for distributed computation, main algorithm is divided in two– Map
• Accepts in input key-value pairs (dictionary)• Records must be independend (Key A does not depend on Key B)• It does the intermediary computations and prepares the data for Reduce stage.
– Reduce• Accepts in input collections of key-value with intermediary results.• Parallel Sorting and Grouping functions. • Returns the final result.
– Map -> Reduce• It is not only a distributed framework but also a development methodology thanks to its
uniq formula. The algorithms contrains makes it possible for the developer to think about implementation and not to focus on the parallel computation. Once a problem is transormed into a MapReduce algorithm, the framework is applicable.
– Computation time: max(time_of_each_map) + max(time_of_each_reduce)

MapReduce
Map1
Map2
Map3
Map4
Input
Reduce Output

Example of Applications
• Problem: Extract all the texts from a database with 1 million posts and compute the occurencyof each token.
mapper.py <- Takes as input an id
-> Prints each token with its occurency
reducer.py <- Takes as input a list of tokens with ids occurency
-> Sums the occurency of all tokens and outputs the final result.

Experiment 1, 100K docs, 5 slaves
• Time without MapReduce– 906.63user – 4.18system – 0:14:32 elapsed – 104%CPU (0avgtext+0avgdata 0maxresident)k
• Time with MapReduce– 3.79user – 0.40system – 0:21:00 elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k
– 10/10/25 11:10:36 INFO streaming.StreamJob: map 0% reduce 0%
– 10/10/25 11:10:50 INFO streaming.StreamJob: map 16% reduce 0%
– 10/10/25 11:11:48 INFO streaming.StreamJob: map 33% reduce 0%
– 10/10/25 11:12:10 INFO streaming.StreamJob: map 49% reduce 0%
– 10/10/25 11:14:09 INFO streaming.StreamJob: map 66% reduce 0%
– 10/10/25 11:14:37 INFO streaming.StreamJob: map 82% reduce 0%
– 10/10/25 11:16:26 INFO streaming.StreamJob: map 83% reduce 0%
– 10/10/25 11:18:12 INFO streaming.StreamJob: map 83% reduce 17%
– 10/10/25 11:20:18 INFO streaming.StreamJob: map 99% reduce 17%

Experiment 2, 1M doc, 5 slaves
• Time without MapReduce– 6892.08user – 25.03system – 1:56:37 elapsed – 98%CPU (0avgtext+0avgdata 0maxresident)k
• Time with MapReduce– 6.30user – 0.98system – 3:26:18elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k
– 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14%– 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16%– 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25%– 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27%– 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30%– 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32%– 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34%– 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35%– 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35%– 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35%– 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36%– 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%

Experiment 3, 1M doc, 3 slaves
• Time without MapReduce– 6892.08user – 25.03system – 1:56:37 elapsed – 98%CPU (0avgtext+0avgdata 0maxresident)k
• Time with MapReduce– 5.50user – 0.97system – 00:53:20elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k– 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14%
– 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16%
– 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25%
– 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27%
– 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30%
– 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32%
– 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34%
– 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35%
– 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35%
– 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35%
– 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36%
– 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%

What’s next?
• MapReduce can be applied in many problems and natural language processing applications. Examples– Sentiment analysis.
– Computing probabilities of huge data.
– Retrieval problem.
– Huge data statistics and analysis.
– MapReduce is not only a framework it is also a distributed computing methodology.