MapReduce and the New Software Stack

28
MapReduce and the New Software Stack Maruf Aytekin PhD Student BAU Computer Engineering Department Besiktas/Istanbul January 5, 2015

Transcript of MapReduce and the New Software Stack

Page 1: MapReduce and the New Software Stack

MapReduce and the New Software Stack

Maruf Aytekin PhD Student

BAU Computer Engineering Department Besiktas/Istanbul January 5, 2015

Page 2: MapReduce and the New Software Stack

Outline

• Introduction • DFS • MapReduce • Examples • Matrix Calculation on Hadoop

Page 3: MapReduce and the New Software Stack

Introduction

Modern data-mining or ML applications, called «big-data analysis» requires us to manage massive amounts of data quickly.

Page 4: MapReduce and the New Software Stack

Important Examples

• The ranking of Web pages by importance, which involves an iterated matrix-vector multiplication where the dimension is many billions.

• Searches in social-networking sites, which involve graphs with hundreds of millions of nodes and many billions of edges.

• Processing large amount of text or streams such as news recommendation.

Page 5: MapReduce and the New Software Stack

New software stack

• Not a “supercomputer” (Beowulf etc.) • “computing clusters” – large collections of

commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.

Page 6: MapReduce and the New Software Stack

Distributed File System

• The new form of file system which features much larger units than the disk blocks in a conventional operating system.

• Files can be enormous, possibly a terabytes in size.

• Files are rarely updated.

Page 7: MapReduce and the New Software Stack

Physical Organization

• Files are divided into chunks • Chunks are replicated

Page 8: MapReduce and the New Software Stack

DFS Implementations

• The Google File System (GFS) • Hadoop Distributed File System (HDFS) • CloudStore, by Kosmix

Page 9: MapReduce and the New Software Stack

HDFS Architecture

Page 10: MapReduce and the New Software Stack

Block Replication

Page 11: MapReduce and the New Software Stack

MapReduce

Style of computing/framework/pattern.

Implementations:

• MapReduce by Google (internal) • Hadoop by the Apache Foundation.

Page 12: MapReduce and the New Software Stack

MapReduce

Operates exclusively on <key, value> pairs.

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Page 13: MapReduce and the New Software Stack

MapReduce Computation

Page 14: MapReduce and the New Software Stack

MapReduce

In brief, a MapReduce computation executes as follows: • Chunks from a DFS are given to Map tasks. • These Map tasks turn the chunks into a sequence of

<key, value> pairs. • The <key,value> pairs from each Map task are

collected by a master controller and sorted by key. (Combine)

• The keys are divided among all the Reduce tasks, so all <key,value> pairs with the same key wind up at the same Reduce task.

• The Reduce tasks work on one key at a time and processes values for that key then outputs the results as <key,value> pairs.

Page 15: MapReduce and the New Software Stack

Execution of MapReduce

Page 16: MapReduce and the New Software Stack

Hello World

Word Count

• file01: Hello World Bye World • file02: Hello Hadoop Goodbye Hadoop

Page 17: MapReduce and the New Software Stack

Word Count

For the given sample input the first map emits: < Hello, 1 > < World, 1 > < Bye, 1 > < World, 1 > The second map emits: < Hello, 1 > < Hadoop, 1 > < Goodbye, 1 > < Hadoop, 1 >

Combiner: After being sorted on the keys: The output of the first map: < Bye, 1 > < Hello, 1 > < World, 2 > The output of the second map: < Goodbye, 1 > < Hadoop, 2 > < Hello, 1 >

Page 18: MapReduce and the New Software Stack

Word Count

Thus the output of the job is: < Bye, 1 > < Goodbye, 1 > < Hadoop, 2 > < Hello, 2 > < World, 2 >

The Reducer implementation, via the reduce method just sums up the values, which are the occurrence counts for each key.

Page 19: MapReduce and the New Software Stack

M

j

İ

N

k

j

Matrix CalculationP = M N

k

i

Matrix Data Model for MapReduce:

M (i, j,mij ) N (j, k, njk)

P(1,1) P(1,2)

Page 20: MapReduce and the New Software Stack

Matrix Data Files for MapReduce

M,0,0,10.0 M,0,2,9.0 M,0,3,9.0 M,1,0,1.0 M,1,1,3.0 M,1,2,18.0 M,1,3,25.2 . . .

M, i, j, mijN,0,0,1.0 N,0,2,3.0 N,0,4,2.0 N,1,0,2.0 N,3,2,-1.0 N,3,6,4.0 N,4,6,5.0 . . .

N (j, k, njk)

Page 21: MapReduce and the New Software Stack

Map

Page 22: MapReduce and the New Software Stack

Reduce

Page 23: MapReduce and the New Software Stack

Example

Page 24: MapReduce and the New Software Stack

Map Task

Matrix M key, value pairs produced as follows:

Matrix N key, value pairs produced as follows:

Page 25: MapReduce and the New Software Stack

Map Task Output

Page 26: MapReduce and the New Software Stack

Reduce Task

P =

Page 27: MapReduce and the New Software Stack

Application

• Run the application on Hadoop

Page 28: MapReduce and the New Software Stack

Thank you!

Q & A