Map Reduce - University of British Columbialaks/Map Reduce.pdf · 2012. 3. 12. · various Map...

Map Reduce

Based on A. Rajaraman and J.D.Ullman. Mining Massive Data Sets. Cambridge

U. Press, 2009; Ch. 2. Some figures “stolen” from their slides.

Big Data and Cluster Computing 1/2

• What’s “big data”?

– Very large – think several Terabytes.

– Often beyond the capacity of a single compute node’s storage capacity.

– While there is no unique def. of big data, the kind we focus on here has two properties:

• Enormous in size (common to all kinds of big data!)

• Updates mostly take the form of appends, at least in-place updates are rare

2

BD and CC 2/2

• Some examples of big data: – Web graph – Social networks

• Computations on big data are expensive: – Computing page rank: iterated matrix vector products over tens

of billions of web pages – Finding your friends on facebook (or other social networks):

search over a graph with > 100M nodes and > 10B edges – Similarity “search” in recommender systems

• Some non-examples: – A bank accounts database, no matter how large (why?) – Any update (i.e., modify) intensive database – Online Retail stores

3

Compute Node

Memory

Disk

CPU

“Big Data” typically far exceeds the capabilities of a compute node.

4

Cluster Computing – Distributed File System

Mem

Disk

CPU

Mem

Disk

CPU

…

Switch

Each rack contains 16-64 nodes

Mem

Disk

CPU

Mem

Disk

CPU

…

Switch

Switch 1 Gbps between any pair of nodes in a rack

2-10 Gbps backbone between racks

Examples: Google DFS; Hadoop DFS (Apache); CloudStore (open source; Kosmix, now part of Walmart Labs)

5

DFS

• Divide up file into chunks (e.g., 64MB) and replicate chunks in different racks (e.g., 3 times). – redundancy and resiliency against failures.

• Divide up computation into independent tasks; if task T fails, can restart it w/o affecting tasks T’ T.

Map Reduce Paradigm.

- Tolerant to hardware failure.

- Master file: where are the chunks of this file?

- Master file is replicated; directory of DFS keeps track of MF copies.

6

Map Reduce Schematic

Map Tasks

Reduce Tasks

Master Controller: Group by key

Co

mb

ined

Ou

tpu

t In

pu

t C

hu

nks

(key, value) pairs (ki, vi)

(k, [v1, ..., vm])

Chunk = {elements}. E.g.: tuple, doc, tweet, ... A map task may get > 1 chunk input 7

Example 1 – Count #words in a collection of documents

• Element = doc; key = word; value = frequency. • Chunk = {docs} Map task (k,v) pairs, initially just (w,

1). • Master Controller: group by word (across ouput from

various Map tasks) and merge all values. • Reduce task: hash each key to some reduce task, which

aggregates (in this case, sums) all values associated with that key.

• Output from all Reduce tasks merged again. • Typically, Reduce function is associative and commutative,

e.g., addition. Can “push” some of the Reduce functionality inside Map

tasks. (Reduce still needs to do its job.) • #Map Tasks & #Reduce Tasks decided by user.

8

Example 2 – Matrix Vector Multiplication

• At the core of Page Rank computation.

• x_nx1 = M_nxn x v_nx1. x_i = _j m_ij * v_j; n ~ 10^10.

• M extremely sparse (web link graph): ~10-15 non-zeros per row.

• Assume v will fit in memory. Then:

• Map: (Chunk of M, v) pairs (i, m_ij x v_j); what is the key for terms in the sum expression for x_i?

• Reduce: Add up all values for given key i x_i.

9

Matrix Vector Mult – what if v won’t fit in memory?

X

chunk Color = stripe. Each stripe of matrix divided up into chunks.

10

Relational Algebra

• Review RA (from any database text). • We discuss MR implementation of RA not because we

want to implement a DBMS over MR. • Operations/computations over large networks can be

captured using RA. • Efficient MR implementation of RA efficient

implementation of a whole family of such computations over large networks.

• E.g., (node pairs connected by) paths of length two: PROJECT_{L1.From, L2.To}((RENAME(Link L1) JOIN_{To=From} RENAME(Link L2)).

• #friends of each user: GROUP-BY_{User:COUNT(Friend)}(Friends).

11

MR implementations of SELECT/PROJECT

• SELECT_C(R): Map: for each tuple t if t satisfies C, output (t, t).

• Reduce: Identity, i.e., simply pass on each i/c (key, value) pair.

• Can extract relation by taking just value (or just key!). • PROJECT_X(R): Map: for each tuple t, project it on X; let

t’ = t[X], then output (t’, t’). • Reduce: Transform (t’, [t’, ..., t’]) into (t’, t’), i.e., dup-

elim. • Optimization: Can throw out encountered duplicates

early in Map; still need dup-elim. in Reduce.

12

MR implementations of Set ops

• Union/Intersection/Minus: Map: turn each tuple t in R into (t, R) and each tuple t in S into (t, S). Merging could create (t, [R]), or (t, [S]), or (t, [R,S]) or (t, [S, R]).

• Reduce: action depends on operation; for union turn any of those into (t,t); for minus, turn (t,[R]) into (t,t) and turn everything else into (t, NULL); for intersect, turn (t,[R,S]) into (t,t) and everything else into (t, NULL).

13

MR implementations of Join

• Natural Join: Idea works also for equi-join. Consider e.g., R(A,B) and S(B,C).

• Map: Map each tuple (a,b) in R to (b, (R,a)) and each tuple (b,c) in S to (b, (S,c)). [Hadoop passes the output to Reduce tasks, sorted on key.]

• Reduce: from each pair (b,[a set of pairs of the form (R,a) or (S,c)]) produce (b, (a1,b,c1), (a1,b,c2), ..., (am,b,cn)). The “value” of this key-value pair = subset of join with B=b.

• Boldface indicates tuple of attributes/values.

• Typically, join selectivity is high, so the cost is close to linear in the total size of the two relations.

• What if a(nother) impl. of MR did not pass Map output sorted on key? 14

MR Implementation of Groupby

• Example: R(A,B,C). Want GB_{A: agg1(B1), ..., aggk(Bk)}(R).

• Map: Map each tuple (a,b,c) to (a,b).

• Reduce: Turn each (a,[b1, ..., bm]) into (a, agg1{b1[1], ..., bm[1]}, ..., aggk{b1[k], ..., bm[k]}).

• Optimization: if an agg is associative and commutative, and if b’s associated with the same a are encountered together, can push some computation to Map.

15

Matrix Mult via Join! 1/2

• First, we will do this by composing two MR steps. View matrix M as M(I,J,V) triples and N as N(J,K,W) triples.

• Map: Map M(i,j,m_ij) to (j,(M,i,m_ij)) and N(j,k,n_jk) to (j,(N,k,n_jk)).

• Reduce: from each (key, value) pair (j,[triples from M and from N]), and for each (M,i,m_ij) and (N,k,n_jk) in that set of triples, output (j,(i,k,m_ijxn_jk)).

16

Matrix Mult via Join 2/2

• Second MR:

• Map: from each (key, value) pair (j, [(i1,k1,v1), ..., (ip,kp,vp)]), produce the (key, value) pairs ((i1,k1), v1), ..., ((ip,kp), vp).

• Reduce: for each (key, value) pair ((i,k), [v1, ..., vm]), produce the output ((i,k), v1+ ... +vm). This is the value in row i and column k of M x N.

• Not the most efficient, but interesting: uses joins; composes MR like an algebraic operator!

17

Matrix Mult in one MR step

18

X = m_ij

n_jk row i

col k

p x q q x r

Matrix Mult in one MR step

19

Map m_ij ((i,1), (M, j, m_ij))

((i,r), (M, j, m_ij))

Map n_jk ((1,k), (N, j, n_jk))

((p,k), (N, j, n_jk))

Re Reduce

((i,k), (M,1,m_i1)), ..., ((i,k), (M,q,m_iq))

((i,k), (N,1,n_1k)), ..., ((i,k), (N,q,n_qk))

((i,k), Σ_j (m_ij . n_jk).

Map Reduce - University of British Columbialaks/Map Reduce.pdf · 2012. 3. 12. · various Map...

Documents

Transcript of Map Reduce - University of British Columbialaks/Map Reduce.pdf · 2012. 3. 12. · various Map...