Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
-
Upload
hyderabad-scalability-meetup -
Category
Technology
-
view
328 -
download
0
Transcript of Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
Three I’s of a great product!
Interface Intuitive |Functional | Elegant
Infrastructure
Storage |Computation |Network
Intelligence Learn |Predict | Adapt | Evolve
Drowning in Data, Starving for Knowledge
ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTAGCTGTCGCTCGGCTGCATGCCTAGTGCACCTACGCAGTATAAACAATAATAAATTTTACTGTCGTTGACAAGAAACGAGTAACTCGTCCCTCTTCTGCAGACTGCTTATTACGCGACCGTAAGCTAC…
How BIG is Big Data?
600 million tweets per DAY
100 hours per
MINUTE800+ websites
per MINUTE
100 TB of data
uploaded DAILY3.5 Billion
queries PER DAY300 Million Active customers
How BIG is BigData?
▪ Better Sensors ▪ Higher resolution, Real-time, Diverse measurements, …
▪ Faster Communication ▪ Network infrastructure, Compression Technologies, …
▪ Cheaper Storage ▪ Cloud based storage, large warehouses, NoSQL databases
▪ Massive Computation ▪ Cloud computing, Mapreduce/Hadoop parallel processing paradigms
▪ Intelligent Decisions ▪ Advances in Machine Learning and Artificial Intelligence
How did we get here?
Parallel Computing Basics
▪ Data Parallelism (distributed computing) ▪ Lots of data ! Break it into “chunks”, ▪ Process each “chunk” of data in parallel, ▪ Combine results from each “chunk” ▪ MAPREDUCE = Data Parallelism
▪ Process Parallelism (data flow computing) ▪ Lots of stages ! Set up process graph ▪ Pass data through all stages ▪ All stages running in parallel on different data ▪ Assembly line = process parallelism
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
MAPREDUCE 101: A 4-stage ProcessLo
ts o
f dat
a
Shard 1
Shard N
Shard 2
Reduce 1
Reduce R
Map 1
Map 2
Map K
Combine 1
Combine 2
Combine K
Shuffle 1
Shuffle 2
Shuffle K
Output 1
Output R
Each Map processes N/K shards
MAPREDUCE 101: An example Task
▪ Count total frequency of all words on the web ▪ Total number of documents > 20Billion ▪ Total number of unique words > 20Million
▪ Non-Parallel / Linear Implementation
for each document d on the Web for each unique word w in d DocCount w d( ) = # times w occurred in d
WebCount w( ) += DocCount w d( )
MAPREDUCE – MAP/COMBINE
Shard1
Key Value
A 10
B 7
C 9
D 3
B 4
Key Value
A 10
B 11
C 9
D 3
Shard2
Key Value
A 3
D 1
C 4
D 9
B 6
Key Value
A 3
B 6
C 4
D 10
Shard3
Key Value
B 3
D 5
C 4
A 6
A 3
Map-1
Map-2
Map-3
Key Value
A 9
B 3
C 4
D 5
Combine-1
Combine-2
Combine-3
MAPREDUCE – Shuffle/Reduce
Key Value
A 10
B 11
C 9
D 3
Key Value
A 3
B 6
C 4
D 10
Key Value
A 9
B 3
C 4
D 5
Key Value
A 10
A 3
A 9
C 9
C 4
C 4
Key Value
B 11
B 6
B 3
D 3
D 10
D 5
Shuffle 1
Shuffle 2
Shuffle 3
Key Value
A 22
C 17
Key Value
B 20
D 18
Reduce 1
Reduce 2
Key Questions in MAPREDUCE
▪ Is the task really “data-parallelizable”? ▪ High dependence tasks (e.g. Fibonacci series) ▪ Recursive tasks (e.g. Binary Search)
▪ What is the key-value pair output for MAP step? ▪ Each map processes only one data record at a time ▪ It can generate none, one, or multiple key-value pairs
▪ How to combine values of a key in REDUCE step? ▪ The key for reduce is same as key for Map output ▪ The reduce function must be “order agnostic”
Other considerations
▪ Reliability/Robustness ▪ A processor or disk might go bad during the process
▪ Optimization/Efficiency ▪ Allocate CPU’s near data shards to reduce network overhead
▪ Scale/Parallelism ▪ Parallelization linearly proportional to number of machines
▪ Simplicity/Usability ▪ Just specify the Map task and the Reduce task and be done!
▪ Generality ▪ Lots of parallelizable tasks can be written in MapReduce ▪ With some creativity, many more than you can imagine!
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
Similarity between all pairs of docs.
▪ Why bother? ▪ Document Clustering, Similar document search, etc.
▪ Document represented as a “Bag-of-Tokens” ▪ A weight associated with each tokens in vocabulary. ▪ Most weights are zero – Sparsity
▪ Cosine Similarity between two documents
di = w1i ,w2
i ,...,wTi{ }, dj = w1
j ,w2j ,...,wT
j{ }
Sim di ,dj( )= wti
t=1
T
∑ ×wtj
Non-Parallel / Linear Implementation
For each document di For each document dj ( j > i )
Sim di ,dj( )= wti
t=1
T
∑ ×wtj
Complexity = O D2Tσ( )σ = Sparsity factor =10−5
= Average Fraction of vocabulary per documentD = O(10B), T = O(10M )
Complexity = O 1020+7−5( )=O 1022( )
Toy Example for doc-doc similarity
A classic “Join”
Documents = W,X,Y,Z{ }, Words = a,b,c,d,e{ }
W→ a,1 , b,2 , e,5{ }X→ a,3 , c, 4 , d,5{ }Y→ b,6 , c, 7 , d,8{ }Z→ a,9 , e,10{ }
Input W,X→ Sim W,X( )= 3W,Y→ Sim W,Y( )= 12W,Z→ Sim W,Z( )= 59X,Y→ Sim X,Y( )= 68X,Z→ Sim X,Z( )= 27Y,Z→ Sim Y,Z( )= 0
Output
Reverse Indexing to the rescue
First convert the data to reverse index
a→ W,1 , X,3 , Z,9{ }b→ W,2 , Y,6{ }c→ X,4 , Y, 7{ }d→ X,5 , Y,8{ }e→ W,5 , Z,10{ }
W→ a,1 , b,2 , e,5{ }X→ a,3 , c, 4 , d,5{ }Y→ b,6 , c, 7 , d,8{ }Z→ a,9 , e,10{ }
Key/Value for the MAP-Step
a→ W,1 , X,3 , Z,9{ }W,X→ 3W,Z→ 9X,Z→ 27
b→ W,2 , Y,6{ }
c→ X,4 , Y, 7{ }
W,Y→12
e→ W,5 , Z,10{ }
d→ X,5 , Y,8{ }
X,Y→ 28
X,Y→ 40
W,Z→ 50
W,X→ 3W,Y→12W,Z→ 9W,Z→ 50X,Y→ 40X,Y→ 28X,Z→ 27
Value combining in REDUCE-Step
W,X→ 3W,Y→12W,Z→ 9W,Z→ 50X,Y→ 40X,Y→ 28X,Z→ 27
W,X→ Sim W,X( )= 3W,Y→ Sim W,Y( )= 12W,Z→ Sim W,Z( )= 59X,Y→ Sim X,Y( )= 68X,Z→ Sim X,Z( )= 27Y,Z→ Sim Y,Z( )= 0
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
assignments ! centers
K-Means Clustering
mk(t+1) ←
δ n,k(t )xn
n=1
N
∑
δn,k(t )
n=1
N
∑
m1(t+1)
m2(t+1)
δ n,2(t ) = 1
δ n,1(t ) = 1
m1(t )
m2(t )
centers ! assignments
δ n,k(t+1) = k == arg min
j=1...KΔ x n( ),m j
(t )( ){ }( )
K-means clustering 101 – Non-parallel
E-Step – Update assignments from centers
M-Step – Update centers from cluster assignments
πn(t ) ← arg min
k=1...KΔ xn,mk
(t )( ){ }
mk(t+1) ←
δ πn(t ) = k( )xn
n=1
N
∑
δ πn(t ) = k( )
n=1
N
∑
O NKD( ):N = Number of data pointsK = Number of clustersD = number of dimensions
⎧
⎨⎪
⎩⎪
O ND( ):N = Number of data pointsD = number of dimensions⎧⎨⎩
K-Means MapReduce
mk(t ){ }k=1
K
Key = πn(t ) → Value = xn
πn(t ) = arg min
k=1...KΔ xn,mk
(t )( ){ } mk(t+1) ←
δ πn(t ) = k( )xn
n=1
N
∑
δ πn(t ) = k( )
n=1
N
∑
mk(t+1){ }k=1
K
π n(t )
mk(t+1)
M a
p
S h
u f f
l e
R e
d u
c e
Iterative MapReduce: Update Cluster Centers/iteration
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
Cliques: Useful structures in Graphs
• People • Products • Movies • Keywords • Documents • Genes • Neurons
• Co-Social • Co-purchase • Co-like • Co-occurrence • Similarity • Co-expressions • Co-firing
guitaristrock-music
guitarsong
musician
rock-bandsinger
electric-guitar
singing
university
school
collegestudent
classroomschool-teacher
teacher
teacher-student-relationship
judge
lawsuit
trial
lawyerfalse-persecution
perjurycourtroom
Example Concepts in IMDB
Graph, Cliques, and Maximal Cliques
Clique = a “fully connected” sub-graph
Maximal Clique = a clique with no “Super-clique”
Finding all Maximal Cliques is NP-hard: O(3n/3)
a
e
b
f
c
g
d
h
Neighborhood of a Clique
a
e
b
f
c
g
d
hf is connected to BOTH b and cg is connected to BOTH b and c
N({b,c}) = { f ,g}
CLIQUEMAP: Clique (key) ! Its Neighbor (value)
{a}→ {b,e}{a,b}→ {e}{b,c}→ { f ,g}
{b,c, f }→ {g}
{h}→∅
{c,d}→∅
{a,b,e}→∅
{b,c, f ,g}→∅
Growing Cliques from CliqueMap
{b,c, f }→ {g}
a
e
b
f
c
g
d
h
{b,c, f } is a cliqueg is connected to all of them
⎫⎬⎭⇒ {b,c, f ,g} is a clique
MapReduce for Maximal CliquesCliqueMap of size k ! size k + 1
{a,b}→ {e}{a,e}→ {b}{b,c}→ { f ,g}{b,e}→ {a}{b, f }→ {c,g}{b,g}→ {c, f }{c, f }→ {b,g}{c,g}→ {b, f }{ f ,g}→ {b,c}{c,d}→∅
Iteration 2
{a,b,e}→∅
{b,c, f }→ {g}{b,c,g}→ { f }{b, f ,g}→ {c}{c, f ,g}→ {b}
Iteration 3
{b,c, f ,g}→∅Iteration 4
{a}→ {b,e}{b}→ {a,c,e, f ,g}{c}→ {b,d, f ,g}{d}→ {c}{e}→ {a,b}{ f }→ {b,c,g}{g}→ {b,c, f }{h}→∅
Iteration 1
Input: Adjacency List
a
e
b
f
c
g
d
h
Key/Value for the MAP-Stepa
e
b
f
c
g
d
h{a}→ {b,e} {a,b}⇒ {e}{a,e}⇒ {b}
{e}→ {a,b}
{b}→ {a,c,e, f ,g}
{a,e}⇒ {b}{b,e}⇒ {a}
{a,b}⇒ {c,e, f ,g}{b,c}⇒ {a,e, f ,g}{b,e}⇒ {a,c, f ,g}{b, f }⇒ {a,c,e,g}{b,g}⇒ {a,c,e, f }
{a,e}⇒ {b}{a,e}⇒ {b}
{a,b}⇒ {e}{a,b}⇒ {c,e, f ,g}
{b,e}⇒ {a.c, f ,g}{b,e}⇒ {a}S
H U
F F
L E
M A
P
Value combining in REDUCE-Stepa
e
b
f
c
g
d
h
{a,e}⇒ {b}{a,e}⇒ {b}
{a,b}⇒ {e}{a,b}⇒ {c,e, f ,g}
{b,e}⇒ {a,c, f ,g}{b,e}⇒ {a}S
H U
F F
L E
{a,b}→ {e}∩{c,e, f ,g} = {e}
{b,e}→ {a,c, f ,g}∩{a} = {a}
{a,e}→ {b}∩{b} = {b}
R E
D U
C E
Reduce = Intersection
Value combining in REDUCE-Step
a
e
b
f
c
g
d
h
c,d{ }⇒ b, f ,g{ }c,d{ }⇒∅
c{ }→ b,d, f ,g{ }d{ }→ c{ }
b,c{ }⇒ a,e, f ,g{ }b,c{ }⇒ d, f ,g{ }
b{ }→ a,c,e, f ,g{ }c{ }→ b,d, f ,g{ }
c,d{ }→ {b, f ,g}∩∅ =∅
b,c{ }→ a,e, f ,g{ }∩ d, f ,g{ } = f ,g{ }
“Art of Thinking Parallel” is about
▪ Transforming the Input Data appropriately ▪ e.g. Reverse Indexing (doc-doc similarity)
▪ Breaking the problem into smaller ones ▪ e.g. Iterative MapReduce (clustering)
▪ Designing the Map step - Key/Value output ▪ e.g. CliqueMaps in Maximal Cliques
▪ Design the Reduce step – Combine values of key ▪ e.g. Intersections in Maximal Cliques