Map Reduce - University of British Columbialaks/Map Reduce.pdf · 2012. 3. 12. · various Map...
Transcript of Map Reduce - University of British Columbialaks/Map Reduce.pdf · 2012. 3. 12. · various Map...
-
Map Reduce
Based on A. Rajaraman and J.D.Ullman. Mining Massive Data Sets. Cambridge
U. Press, 2009; Ch. 2. Some figures “stolen” from their slides.
-
Big Data and Cluster Computing 1/2
• What’s “big data”?
– Very large – think several Terabytes.
– Often beyond the capacity of a single compute node’s storage capacity.
– While there is no unique def. of big data, the kind we focus on here has two properties:
• Enormous in size (common to all kinds of big data!)
• Updates mostly take the form of appends, at least in-place updates are rare
2
-
BD and CC 2/2
• Some examples of big data: – Web graph – Social networks
• Computations on big data are expensive: – Computing page rank: iterated matrix vector products over tens
of billions of web pages – Finding your friends on facebook (or other social networks):
search over a graph with > 100M nodes and > 10B edges – Similarity “search” in recommender systems
• Some non-examples: – A bank accounts database, no matter how large (why?) – Any update (i.e., modify) intensive database – Online Retail stores
3
-
Compute Node
Memory
Disk
CPU
“Big Data” typically far exceeds the capabilities of a compute node.
4
-
Cluster Computing – Distributed File System
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Each rack contains 16-64 nodes
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Switch 1 Gbps between any pair of nodes in a rack
2-10 Gbps backbone between racks
Examples: Google DFS; Hadoop DFS (Apache); CloudStore (open source; Kosmix, now part of Walmart Labs)
5
-
DFS
• Divide up file into chunks (e.g., 64MB) and replicate chunks in different racks (e.g., 3 times). – redundancy and resiliency against failures.
• Divide up computation into independent tasks; if task T fails, can restart it w/o affecting tasks T’ T.
Map Reduce Paradigm.
- Tolerant to hardware failure.
- Master file: where are the chunks of this file?
- Master file is replicated; directory of DFS keeps track of MF copies.
6
-
Map Reduce Schematic
Map Tasks
Reduce Tasks
Master Controller: Group by key
Co
mb
ined
Ou
tpu
t In
pu
t C
hu
nks
(key, value) pairs (ki, vi)
(k, [v1, ..., vm])
Chunk = {elements}. E.g.: tuple, doc, tweet, ... A map task may get > 1 chunk input 7
-
Example 1 – Count #words in a collection of documents
• Element = doc; key = word; value = frequency. • Chunk = {docs} Map task (k,v) pairs, initially just (w,
1). • Master Controller: group by word (across ouput from
various Map tasks) and merge all values. • Reduce task: hash each key to some reduce task, which
aggregates (in this case, sums) all values associated with that key.
• Output from all Reduce tasks merged again. • Typically, Reduce function is associative and commutative,
e.g., addition. Can “push” some of the Reduce functionality inside Map
tasks. (Reduce still needs to do its job.) • #Map Tasks & #Reduce Tasks decided by user.
8
-
Example 2 – Matrix Vector Multiplication
• At the core of Page Rank computation.
• x_nx1 = M_nxn x v_nx1. x_i = _j m_ij * v_j; n ~ 10^10.
• M extremely sparse (web link graph): ~10-15 non-zeros per row.
• Assume v will fit in memory. Then:
• Map: (Chunk of M, v) pairs (i, m_ij x v_j); what is the key for terms in the sum expression for x_i?
• Reduce: Add up all values for given key i x_i.
9
-
Matrix Vector Mult – what if v won’t fit in memory?
X
chunk Color = stripe. Each stripe of matrix divided up into chunks.
10
-
Relational Algebra
• Review RA (from any database text). • We discuss MR implementation of RA not because we
want to implement a DBMS over MR. • Operations/computations over large networks can be
captured using RA. • Efficient MR implementation of RA efficient
implementation of a whole family of such computations over large networks.
• E.g., (node pairs connected by) paths of length two: PROJECT_{L1.From, L2.To}((RENAME(Link L1) JOIN_{To=From} RENAME(Link L2)).
• #friends of each user: GROUP-BY_{User:COUNT(Friend)}(Friends).
11
-
MR implementations of SELECT/PROJECT
• SELECT_C(R): Map: for each tuple t if t satisfies C, output (t, t).
• Reduce: Identity, i.e., simply pass on each i/c (key, value) pair.
• Can extract relation by taking just value (or just key!). • PROJECT_X(R): Map: for each tuple t, project it on X; let
t’ = t[X], then output (t’, t’). • Reduce: Transform (t’, [t’, ..., t’]) into (t’, t’), i.e., dup-
elim. • Optimization: Can throw out encountered duplicates
early in Map; still need dup-elim. in Reduce.
12
-
MR implementations of Set ops
• Union/Intersection/Minus: Map: turn each tuple t in R into (t, R) and each tuple t in S into (t, S). Merging could create (t, [R]), or (t, [S]), or (t, [R,S]) or (t, [S, R]).
• Reduce: action depends on operation; for union turn any of those into (t,t); for minus, turn (t,[R]) into (t,t) and turn everything else into (t, NULL); for intersect, turn (t,[R,S]) into (t,t) and everything else into (t, NULL).
13
-
MR implementations of Join
• Natural Join: Idea works also for equi-join. Consider e.g., R(A,B) and S(B,C).
• Map: Map each tuple (a,b) in R to (b, (R,a)) and each tuple (b,c) in S to (b, (S,c)). [Hadoop passes the output to Reduce tasks, sorted on key.]
• Reduce: from each pair (b,[a set of pairs of the form (R,a) or (S,c)]) produce (b, (a1,b,c1), (a1,b,c2), ..., (am,b,cn)). The “value” of this key-value pair = subset of join with B=b.
• Boldface indicates tuple of attributes/values.
• Typically, join selectivity is high, so the cost is close to linear in the total size of the two relations.
• What if a(nother) impl. of MR did not pass Map output sorted on key? 14
-
MR Implementation of Groupby
• Example: R(A,B,C). Want GB_{A: agg1(B1), ..., aggk(Bk)}(R).
• Map: Map each tuple (a,b,c) to (a,b).
• Reduce: Turn each (a,[b1, ..., bm]) into (a, agg1{b1[1], ..., bm[1]}, ..., aggk{b1[k], ..., bm[k]}).
• Optimization: if an agg is associative and commutative, and if b’s associated with the same a are encountered together, can push some computation to Map.
15
-
Matrix Mult via Join! 1/2
• First, we will do this by composing two MR steps. View matrix M as M(I,J,V) triples and N as N(J,K,W) triples.
• Map: Map M(i,j,m_ij) to (j,(M,i,m_ij)) and N(j,k,n_jk) to (j,(N,k,n_jk)).
• Reduce: from each (key, value) pair (j,[triples from M and from N]), and for each (M,i,m_ij) and (N,k,n_jk) in that set of triples, output (j,(i,k,m_ijxn_jk)).
16
-
Matrix Mult via Join 2/2
• Second MR:
• Map: from each (key, value) pair (j, [(i1,k1,v1), ..., (ip,kp,vp)]), produce the (key, value) pairs ((i1,k1), v1), ..., ((ip,kp), vp).
• Reduce: for each (key, value) pair ((i,k), [v1, ..., vm]), produce the output ((i,k), v1+ ... +vm). This is the value in row i and column k of M x N.
• Not the most efficient, but interesting: uses joins; composes MR like an algebraic operator!
17
-
Matrix Mult in one MR step
18
X = m_ij
n_jk row i
col k
p x q q x r
-
Matrix Mult in one MR step
19
Map m_ij ((i,1), (M, j, m_ij))
((i,r), (M, j, m_ij))
Map n_jk ((1,k), (N, j, n_jk))
((p,k), (N, j, n_jk))
Re Reduce
((i,k), (M,1,m_i1)), ..., ((i,k), (M,q,m_iq))
((i,k), (N,1,n_1k)), ..., ((i,k), (N,q,n_qk))
((i,k), Σ_j (m_ij . n_jk).