MapReduce and the New Software Stack
-
Upload
maruf-aytekin -
Category
Engineering
-
view
177 -
download
4
Transcript of MapReduce and the New Software Stack
MapReduce and the New Software Stack
Maruf Aytekin PhD Student
BAU Computer Engineering Department Besiktas/Istanbul January 5, 2015
Outline
• Introduction • DFS • MapReduce • Examples • Matrix Calculation on Hadoop
Introduction
Modern data-mining or ML applications, called «big-data analysis» requires us to manage massive amounts of data quickly.
Important Examples
• The ranking of Web pages by importance, which involves an iterated matrix-vector multiplication where the dimension is many billions.
• Searches in social-networking sites, which involve graphs with hundreds of millions of nodes and many billions of edges.
• Processing large amount of text or streams such as news recommendation.
New software stack
• Not a “supercomputer” (Beowulf etc.) • “computing clusters” – large collections of
commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.
Distributed File System
• The new form of file system which features much larger units than the disk blocks in a conventional operating system.
• Files can be enormous, possibly a terabytes in size.
• Files are rarely updated.
Physical Organization
• Files are divided into chunks • Chunks are replicated
DFS Implementations
• The Google File System (GFS) • Hadoop Distributed File System (HDFS) • CloudStore, by Kosmix
HDFS Architecture
Block Replication
MapReduce
Style of computing/framework/pattern.
Implementations:
• MapReduce by Google (internal) • Hadoop by the Apache Foundation.
MapReduce
Operates exclusively on <key, value> pairs.
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
MapReduce Computation
MapReduce
In brief, a MapReduce computation executes as follows: • Chunks from a DFS are given to Map tasks. • These Map tasks turn the chunks into a sequence of
<key, value> pairs. • The <key,value> pairs from each Map task are
collected by a master controller and sorted by key. (Combine)
• The keys are divided among all the Reduce tasks, so all <key,value> pairs with the same key wind up at the same Reduce task.
• The Reduce tasks work on one key at a time and processes values for that key then outputs the results as <key,value> pairs.
Execution of MapReduce
Hello World
Word Count
• file01: Hello World Bye World • file02: Hello Hadoop Goodbye Hadoop
Word Count
For the given sample input the first map emits: < Hello, 1 > < World, 1 > < Bye, 1 > < World, 1 > The second map emits: < Hello, 1 > < Hadoop, 1 > < Goodbye, 1 > < Hadoop, 1 >
Combiner: After being sorted on the keys: The output of the first map: < Bye, 1 > < Hello, 1 > < World, 2 > The output of the second map: < Goodbye, 1 > < Hadoop, 2 > < Hello, 1 >
Word Count
Thus the output of the job is: < Bye, 1 > < Goodbye, 1 > < Hadoop, 2 > < Hello, 2 > < World, 2 >
The Reducer implementation, via the reduce method just sums up the values, which are the occurrence counts for each key.
M
j
İ
N
k
j
Matrix CalculationP = M N
k
i
Matrix Data Model for MapReduce:
M (i, j,mij ) N (j, k, njk)
P(1,1) P(1,2)
Matrix Data Files for MapReduce
M,0,0,10.0 M,0,2,9.0 M,0,3,9.0 M,1,0,1.0 M,1,1,3.0 M,1,2,18.0 M,1,3,25.2 . . .
M, i, j, mijN,0,0,1.0 N,0,2,3.0 N,0,4,2.0 N,1,0,2.0 N,3,2,-1.0 N,3,6,4.0 N,4,6,5.0 . . .
N (j, k, njk)
Map
Reduce
Example
Map Task
Matrix M key, value pairs produced as follows:
Matrix N key, value pairs produced as follows:
Map Task Output
Reduce Task
P =
Application
• Run the application on Hadoop
Thank you!
Q & A