Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar

36
MapReduce and the art of “Thinking ParallelShailesh Kumar Third Leap, Inc.

Transcript of Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar

MapReduce and the

art of “Thinking Parallel”

Shailesh Kumar Third Leap, Inc.

Three I’s of a great product!

Interface Intuitive |Functional | Elegant

Infrastructure

Storage |Computation |Network

Intelligence Learn |Predict | Adapt | Evolve

Drowning in Data, Starving for Knowledge

ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTAGCTGTCGCTCGGCTGCATGCCTAGTGCACCTACGCAGTATAAACAATAATAAATTTTACTGTCGTTGACAAGAAACGAGTAACTCGTCCCTCTTCTGCAGACTGCTTATTACGCGACCGTAAGCTAC…

How BIG is Big Data?

600 million tweets per DAY

100 hours per

MINUTE800+ websites

per MINUTE

100 TB of data

uploaded DAILY3.5 Billion

queries PER DAY300 Million Active customers

How BIG is BigData?

▪ Better Sensors ▪ Higher resolution, Real-time, Diverse measurements, …

▪ Faster Communication ▪ Network infrastructure, Compression Technologies, …

▪ Cheaper Storage ▪ Cloud based storage, large warehouses, NoSQL databases

▪ Massive Computation ▪ Cloud computing, Mapreduce/Hadoop parallel processing paradigms

▪ Intelligent Decisions ▪ Advances in Machine Learning and Artificial Intelligence

How did we get here?

The Evolution of “Computing”

Parallel Computing Basics

▪ Data Parallelism (distributed computing) ▪ Lots of data ! Break it into “chunks”, ▪ Process each “chunk” of data in parallel, ▪ Combine results from each “chunk” ▪ MAPREDUCE = Data Parallelism

▪ Process Parallelism (data flow computing) ▪ Lots of stages ! Set up process graph ▪ Pass data through all stages ▪ All stages running in parallel on different data ▪ Assembly line = process parallelism

Agenda

MAPREDUCE Background

Problem 1 – Similarity between all pairs of documents!

Problem 2 – Parallelizing K-Means clustering

Problem 3 – Finding all Maximal Cliques in a Graph

MAPREDUCE 101: A 4-stage ProcessLo

ts o

f dat

a

Shard 1

Shard N

Shard 2

Reduce 1

Reduce R

Map 1

Map 2

Map K

Combine 1

Combine 2

Combine K

Shuffle 1

Shuffle 2

Shuffle K

Output 1

Output R

Each Map processes N/K shards

MAPREDUCE 101: An example Task

▪ Count total frequency of all words on the web ▪ Total number of documents > 20Billion ▪ Total number of unique words > 20Million

▪ Non-Parallel / Linear Implementation

for each document d on the Web for each unique word w in d DocCount w d( ) = # times w occurred in d

WebCount w( ) += DocCount w d( )

MAPREDUCE – MAP/COMBINE

Shard1

Key Value

A 10

B 7

C 9

D 3

B 4

Key Value

A 10

B 11

C 9

D 3

Shard2

Key Value

A 3

D 1

C 4

D 9

B 6

Key Value

A 3

B 6

C 4

D 10

Shard3

Key Value

B 3

D 5

C 4

A 6

A 3

Map-1

Map-2

Map-3

Key Value

A 9

B 3

C 4

D 5

Combine-1

Combine-2

Combine-3

MAPREDUCE – Shuffle/Reduce

Key Value

A 10

B 11

C 9

D 3

Key Value

A 3

B 6

C 4

D 10

Key Value

A 9

B 3

C 4

D 5

Key Value

A 10

A 3

A 9

C 9

C 4

C 4

Key Value

B 11

B 6

B 3

D 3

D 10

D 5

Shuffle 1

Shuffle 2

Shuffle 3

Key Value

A 22

C 17

Key Value

B 20

D 18

Reduce 1

Reduce 2

Key Questions in MAPREDUCE

▪ Is the task really “data-parallelizable”? ▪ High dependence tasks (e.g. Fibonacci series) ▪ Recursive tasks (e.g. Binary Search)

▪ What is the key-value pair output for MAP step? ▪ Each map processes only one data record at a time ▪ It can generate none, one, or multiple key-value pairs

▪ How to combine values of a key in REDUCE step? ▪ The key for reduce is same as key for Map output ▪ The reduce function must be “order agnostic”

Other considerations

▪ Reliability/Robustness ▪ A processor or disk might go bad during the process

▪ Optimization/Efficiency ▪ Allocate CPU’s near data shards to reduce network overhead

▪ Scale/Parallelism ▪ Parallelization linearly proportional to number of machines

▪ Simplicity/Usability ▪ Just specify the Map task and the Reduce task and be done!

▪ Generality ▪ Lots of parallelizable tasks can be written in MapReduce ▪ With some creativity, many more than you can imagine!

Agenda

MAPREDUCE Background

Problem 1 – Similarity between all pairs of documents!

Problem 2 – Parallelizing K-Means clustering

Problem 3 – Finding all Maximal Cliques in a Graph

Similarity between all pairs of docs.

▪ Why bother? ▪ Document Clustering, Similar document search, etc.

▪ Document represented as a “Bag-of-Tokens” ▪ A weight associated with each tokens in vocabulary. ▪ Most weights are zero – Sparsity

▪ Cosine Similarity between two documents

di = w1i ,w2

i ,...,wTi{ }, dj = w1

j ,w2j ,...,wT

j{ }

Sim di ,dj( )= wti

t=1

T

∑ ×wtj

Non-Parallel / Linear Implementation

For each document di For each document dj ( j > i )

Sim di ,dj( )= wti

t=1

T

∑ ×wtj

Complexity = O D2Tσ( )σ = Sparsity factor =10−5

= Average Fraction of vocabulary per documentD = O(10B), T = O(10M )

Complexity = O 1020+7−5( )=O 1022( )

Toy Example for doc-doc similarity

A classic “Join”

Documents = W,X,Y,Z{ }, Words = a,b,c,d,e{ }

W→ a,1 , b,2 , e,5{ }X→ a,3 , c, 4 , d,5{ }Y→ b,6 , c, 7 , d,8{ }Z→ a,9 , e,10{ }

Input W,X→ Sim W,X( )= 3W,Y→ Sim W,Y( )= 12W,Z→ Sim W,Z( )= 59X,Y→ Sim X,Y( )= 68X,Z→ Sim X,Z( )= 27Y,Z→ Sim Y,Z( )= 0

Output

Reverse Indexing to the rescue

First convert the data to reverse index

a→ W,1 , X,3 , Z,9{ }b→ W,2 , Y,6{ }c→ X,4 , Y, 7{ }d→ X,5 , Y,8{ }e→ W,5 , Z,10{ }

W→ a,1 , b,2 , e,5{ }X→ a,3 , c, 4 , d,5{ }Y→ b,6 , c, 7 , d,8{ }Z→ a,9 , e,10{ }

Key/Value for the MAP-Step

a→ W,1 , X,3 , Z,9{ }W,X→ 3W,Z→ 9X,Z→ 27

b→ W,2 , Y,6{ }

c→ X,4 , Y, 7{ }

W,Y→12

e→ W,5 , Z,10{ }

d→ X,5 , Y,8{ }

X,Y→ 28

X,Y→ 40

W,Z→ 50

W,X→ 3W,Y→12W,Z→ 9W,Z→ 50X,Y→ 40X,Y→ 28X,Z→ 27

Value combining in REDUCE-Step

W,X→ 3W,Y→12W,Z→ 9W,Z→ 50X,Y→ 40X,Y→ 28X,Z→ 27

W,X→ Sim W,X( )= 3W,Y→ Sim W,Y( )= 12W,Z→ Sim W,Z( )= 59X,Y→ Sim X,Y( )= 68X,Z→ Sim X,Z( )= 27Y,Z→ Sim Y,Z( )= 0

Agenda

MAPREDUCE Background

Problem 1 – Similarity between all pairs of documents!

Problem 2 – Parallelizing K-Means clustering

Problem 3 – Finding all Maximal Cliques in a Graph

assignments ! centers

K-Means Clustering

mk(t+1) ←

δ n,k(t )xn

n=1

N

δn,k(t )

n=1

N

m1(t+1)

m2(t+1)

δ n,2(t ) = 1

δ n,1(t ) = 1

m1(t )

m2(t )

centers ! assignments

δ n,k(t+1) = k == arg min

j=1...KΔ x n( ),m j

(t )( ){ }( )

K-means clustering 101 – Non-parallel

E-Step – Update assignments from centers

M-Step – Update centers from cluster assignments

πn(t ) ← arg min

k=1...KΔ xn,mk

(t )( ){ }

mk(t+1) ←

δ πn(t ) = k( )xn

n=1

N

δ πn(t ) = k( )

n=1

N

O NKD( ):N = Number of data pointsK = Number of clustersD = number of dimensions

⎨⎪

⎩⎪

O ND( ):N = Number of data pointsD = number of dimensions⎧⎨⎩

K-Means MapReduce

mk(t ){ }k=1

K

Key = πn(t ) → Value = xn

πn(t ) = arg min

k=1...KΔ xn,mk

(t )( ){ } mk(t+1) ←

δ πn(t ) = k( )xn

n=1

N

δ πn(t ) = k( )

n=1

N

mk(t+1){ }k=1

K

π n(t )

mk(t+1)

M a

p

S h

u f f

l e

R e

d u

c e

Iterative MapReduce: Update Cluster Centers/iteration

Agenda

MAPREDUCE Background

Problem 1 – Similarity between all pairs of documents!

Problem 2 – Parallelizing K-Means clustering

Problem 3 – Finding all Maximal Cliques in a Graph

Cliques: Useful structures in Graphs

• People • Products • Movies • Keywords • Documents • Genes • Neurons

• Co-Social • Co-purchase • Co-like • Co-occurrence • Similarity • Co-expressions • Co-firing

guitaristrock-music

guitarsong

musician

rock-bandsinger

electric-guitar

singing

university

school

collegestudent

classroomschool-teacher

teacher

teacher-student-relationship

judge

lawsuit

trial

lawyerfalse-persecution

perjurycourtroom

Example Concepts in IMDB

Graph, Cliques, and Maximal Cliques

Clique = a “fully connected” sub-graph

Maximal Clique = a clique with no “Super-clique”

Finding all Maximal Cliques is NP-hard: O(3n/3)

a

e

b

f

c

g

d

h

Neighborhood of a Clique

a

e

b

f

c

g

d

hf is connected to BOTH b and cg is connected to BOTH b and c

N({b,c}) = { f ,g}

CLIQUEMAP: Clique (key) ! Its Neighbor (value)

{a}→ {b,e}{a,b}→ {e}{b,c}→ { f ,g}

{b,c, f }→ {g}

{h}→∅

{c,d}→∅

{a,b,e}→∅

{b,c, f ,g}→∅

Growing Cliques from CliqueMap

{b,c, f }→ {g}

a

e

b

f

c

g

d

h

{b,c, f } is a cliqueg is connected to all of them

⎫⎬⎭⇒ {b,c, f ,g} is a clique

MapReduce for Maximal CliquesCliqueMap of size k ! size k + 1

{a,b}→ {e}{a,e}→ {b}{b,c}→ { f ,g}{b,e}→ {a}{b, f }→ {c,g}{b,g}→ {c, f }{c, f }→ {b,g}{c,g}→ {b, f }{ f ,g}→ {b,c}{c,d}→∅

Iteration 2

{a,b,e}→∅

{b,c, f }→ {g}{b,c,g}→ { f }{b, f ,g}→ {c}{c, f ,g}→ {b}

Iteration 3

{b,c, f ,g}→∅Iteration 4

{a}→ {b,e}{b}→ {a,c,e, f ,g}{c}→ {b,d, f ,g}{d}→ {c}{e}→ {a,b}{ f }→ {b,c,g}{g}→ {b,c, f }{h}→∅

Iteration 1

Input: Adjacency List

a

e

b

f

c

g

d

h

Key/Value for the MAP-Stepa

e

b

f

c

g

d

h{a}→ {b,e} {a,b}⇒ {e}{a,e}⇒ {b}

{e}→ {a,b}

{b}→ {a,c,e, f ,g}

{a,e}⇒ {b}{b,e}⇒ {a}

{a,b}⇒ {c,e, f ,g}{b,c}⇒ {a,e, f ,g}{b,e}⇒ {a,c, f ,g}{b, f }⇒ {a,c,e,g}{b,g}⇒ {a,c,e, f }

{a,e}⇒ {b}{a,e}⇒ {b}

{a,b}⇒ {e}{a,b}⇒ {c,e, f ,g}

{b,e}⇒ {a.c, f ,g}{b,e}⇒ {a}S

H U

F F

L E

M A

P

Value combining in REDUCE-Stepa

e

b

f

c

g

d

h

{a,e}⇒ {b}{a,e}⇒ {b}

{a,b}⇒ {e}{a,b}⇒ {c,e, f ,g}

{b,e}⇒ {a,c, f ,g}{b,e}⇒ {a}S

H U

F F

L E

{a,b}→ {e}∩{c,e, f ,g} = {e}

{b,e}→ {a,c, f ,g}∩{a} = {a}

{a,e}→ {b}∩{b} = {b}

R E

D U

C E

Reduce = Intersection

Value combining in REDUCE-Step

a

e

b

f

c

g

d

h

c,d{ }⇒ b, f ,g{ }c,d{ }⇒∅

c{ }→ b,d, f ,g{ }d{ }→ c{ }

b,c{ }⇒ a,e, f ,g{ }b,c{ }⇒ d, f ,g{ }

b{ }→ a,c,e, f ,g{ }c{ }→ b,d, f ,g{ }

c,d{ }→ {b, f ,g}∩∅ =∅

b,c{ }→ a,e, f ,g{ }∩ d, f ,g{ } = f ,g{ }

“Art of Thinking Parallel” is about

▪ Transforming the Input Data appropriately ▪ e.g. Reverse Indexing (doc-doc similarity)

▪ Breaking the problem into smaller ones ▪ e.g. Iterative MapReduce (clustering)

▪ Designing the Map step - Key/Value output ▪ e.g. CliqueMaps in Maximal Cliques

▪ Design the Reduce step – Combine values of key ▪ e.g. Intersections in Maximal Cliques