MapReduce with Hadoop

MapReduce with HADOOP

Vitalie Scurtu

What is hadoop?

Hadoop is a set of open source frameworks for parallel and distributive computing:

• HDFS: Distributed file system

• MapReduce: A technique and a framework for parallel computation in cluster.

• ZooKeeper: A configuration service.

• and others: Hive ,HBase ,Mahout, Pig.• Yahoo's Hadoop clusters was used to sort 1 terabyte of data in 209

seconds in Terabyte Sorting Competition.

Why distributed computing?

• Reduced costs. More computers are cheaper then more powerful computer.

• Scalability. We can add new computer to the cluster anytime.

• Super power and super speed.

• Distributed algorithms.

• Stability

• Robust frameworks.

Configuring Hadoop

• It is java and it uses xml file for configuration.• Installation is very simple. • Every computer can become a part of the cluster.• To try a demo we need only 30 minutes.• Uses an advanced configuration system named

ZooKeeper• cat /usr/local/hadoop/conf/slaves

hadoop-masterhadoop-slave01hadoop-slave02hadoop-slave03hadoop-slave06

HDFSHadoop Distributed File System

• Distributed file system

• Support for huge files (GB, terrabyte)

• Hardware Failure safe, replication

• File access model is “Write-once-read-many”

• Cross-platform (java)

MapReduce

• An uniq model for distributed computation, main algorithm is divided in two– Map

• Accepts in input key-value pairs (dictionary)• Records must be independend (Key A does not depend on Key B)• It does the intermediary computations and prepares the data for Reduce stage.

– Reduce• Accepts in input collections of key-value with intermediary results.• Parallel Sorting and Grouping functions. • Returns the final result.

– Map -> Reduce• It is not only a distributed framework but also a development methodology thanks to its

uniq formula. The algorithms contrains makes it possible for the developer to think about implementation and not to focus on the parallel computation. Once a problem is transormed into a MapReduce algorithm, the framework is applicable.

– Computation time: max(time_of_each_map) + max(time_of_each_reduce)

MapReduce

Map1

Map2

Map3

Map4

Input

Reduce Output

Example of Applications

• Problem: Extract all the texts from a database with 1 million posts and compute the occurencyof each token.

mapper.py <- Takes as input an id

-> Prints each token with its occurency

reducer.py <- Takes as input a list of tokens with ids occurency

-> Sums the occurency of all tokens and outputs the final result.

Experiment 1, 100K docs, 5 slaves

• Time without MapReduce– 906.63user – 4.18system – 0:14:32 elapsed – 104%CPU (0avgtext+0avgdata 0maxresident)k

• Time with MapReduce– 3.79user – 0.40system – 0:21:00 elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k

– 10/10/25 11:10:36 INFO streaming.StreamJob: map 0% reduce 0%









Experiment 2, 1M doc, 5 slaves


• Time with MapReduce– 6.30user – 0.98system – 3:26:18elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k

– 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14%– 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16%– 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25%– 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27%– 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30%– 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32%– 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34%– 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35%– 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35%– 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35%– 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36%– 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%

Experiment 3, 1M doc, 3 slaves


• Time with MapReduce– 5.50user – 0.97system – 00:53:20elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k– 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14%












What’s next?

• MapReduce can be applied in many problems and natural language processing applications. Examples– Sentiment analysis.

– Computing probabilities of huge data.

– Retrieval problem.

– Huge data statistics and analysis.

– MapReduce is not only a framework it is also a distributed computing methodology.

MapReduce with Hadoop

Business

Transcript of MapReduce with Hadoop