1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for...

49
1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism

Transcript of 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for...

Page 1: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

1

QSX: Querying Social Graphs

Graph algorithms in MapReduce

MapReduce: an introduction

BFS for distance queries

PageRank

Keyword search

Subgraph isomorphism

Page 2: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

2

Motivation for studying parallel algorithms

Worse still

regular path queries (finding simple paths): NP-complete

Graph pattern matching via subgraph isomorphism: NP-complete

Graph queries are costlyBFS for reachability: linear time O(|V| + |E|)kNN join: quadratic timeGraph pattern matching by graph simulation: quadratic time

Can we efficiently evaluate graph queries?

Furthermore, real-life graphs are typically big

Facebook: 1.38 billion nodes and 140 billion links

Page 3: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

3

The impact of the sheer volume of big data

Using SSD of 6G/s, a linear scan of a data set DD would take

1.9 days when DD is of 1PB (1015B)

5.28 years when DD is of 1EB (1018B)

Is it feasible to query real-life big graphs?

A departure from classical computational complexity theory

Traditional computational complexity theory of almost 50 years:

• The good: polynomial time computable (PTIME)

• The bad: NP-hard (intractable)

• The ugly: PSPACE-hard, EXPTIME-hard, undecidable…

Page 4: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

Parallel query answering

We can do better provided more resources

10,000 processors

How to capitalize on the resources and reduce response time? 4

Using 10000 SSD of 6G/s, a linear scan of DD might take: 1.9 days/10000 = 16 seconds when DD is of 1PB (1015B)5.28 years/10000 = 4.63 days when DD is of 1EB (1018B)

Only ideally!

DB

M

DB

M

DB

M

interconnection network

P P P

Page 5: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

5

Pipelined parallelism

Pipelining: a sequence of operations on each data item, one conducted by a processors

What are the limitations?

DB

M

DB

M

DB

M

interconnection network

P P P

op1 op2 opn

data

Page 6: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

Parallel query answering

Given a big graph G, and n processors S1, …, SnG is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si

Dividing a big G into small fragments of manageable size

Each processor Si processes operations for a query on its local fragment Gi, in parallel

Q( )

GGQ( )

G1G1Q( )GnGnQ( )G2G2

6

Page 7: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

MapReduce

77

Page 8: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

8

MapReduce

A programming model with two primitive functions:

How does it work?

Input: a list <k1, v1> of key-value pairs

Map: applied to each pair, computes key-value pairs <k2,

v2>

• The intermediate key-value pairs are hash-partitioned based on k2. Each partition (k2, list(v2)) is

sent to a reducer

Reduce: takes a partition as input, and computes key-value pairs <k3, v3>

Map: <k1, v1> list (k2, v2)

Reduce: <k2, list(v2)> list (k3, v3)

The process may reiterate – multiple map/reduce steps

Google

Page 9: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

9

Architecture (Hadoop)

No need to worry about how the data is stored and sent

<k1, v1>

mapper mappermapper

<k1, v1> <k1, v1> <k1, v1>

reducer reducer

<k2, v2> <k2, v2> <k2, v2>

<k3, v3> <k3, v3>

One block for each mapper (a map task)

In local store of mappers

Stored in DFS Partitioned in blocks (64M)

Hash partition (k2)

Aggregate results

Multiple steps

Page 10: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

10

parallelism

Data partitioned parallelism

<k1, v1>

mapper mappermapper

<k1, v1> <k1, v1> <k1, v1>

reducer reducer

<k2, v2> <k2, v2> <k2, v2>

<k3, v3> <k3, v3>

Parallel computation

Parallel computation

What parallelism?

Page 11: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

11

Popular in industry

Apache Hadoop, used by Facebook, Yahoo, …• Hive, Facebook, HiveQL (SQL)• PIG, Yahoo, Pig Latin (SQL like)• SCOPE, Microsoft, SQL, Cassandra, Facebook, CQL (no join)• HBase, Google, distributed BigTable• MongoDB, document-oriented (NoSQL)

Scalability

– Yahoo!: 10,000 cores, for Web search queries (2008)

– Facebook: 100 PB, about half a PB per day

– Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3); New York Time used 100 EC2 instances to process 4TB of image data, $240

Study Spark: https://spark.apache.org/

Page 12: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

12

Advantages of MapReduce

Simple: one only needs to define two functions

no need to worry about how the data is stored, distributed and how the operations are scheduled

Fault tolerance: why?

scalability: a large number of low end machines• scale out (scale horizontally): adding a new computer to a

distributed software application; lost-cost “commodity”• scale up (scale vertically): upgrade, add (costly) resources

to a single node

flexibility: independent of data models or schema

independence: it can work with various storage layers

Page 13: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

13

Fault tolerance

Able to handle an average of 1.2 failures per analysis job

<k1, v1>

mapper mappermapper

<k1, v1> <k1, v1> <k1, v1>

reducer reducer

<k2, v2> <k2, v2> <k2, v2>

<k3, v3> <k3, v3>

triplicated

Detecting failures and reassigning the tasks of failed nodes to healthy nodes

Redundancy checking to achieve load balancing

Page 14: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

14

MapReduce algorithms

map(key: node, value: (adjacency-list, others) )

{computation;

emit (mkey, mvalue)

}

Input: query Q and graph G

Output: answers Q(G) to Q in G

compatibility

Match mkey, mvalue

reduce(key: __ , value: list[value] )

{ …

emit (rkey, rvalue)

}

Match rkey, rvalue when multiple iterations of MapReduce are needed

Page 15: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

15

Control flow

while (termination condition is not satisfied) do {

a) map from staging dir 1;

b) reduce into staging dir 2;

c) move files from staging dir 2 to staging dir 1

}

Copy files from input directory staging dir 1; preprocessing

No global data structures accessible and mutable by all

Iterations of MapReduce

Postprocessing; move files from staging dir 2 to output dir

Termination: non-MapReduce driver program

Functional programming

Page 16: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

BFS for distance queries

16

Page 17: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

17

Dijkstra’s algorithm for distance queries

Distance: single-source shortest-path problem• Input: A directed weighted graph G, and a node s in G• Output: The lengths of shortest paths from s to all nodes in G

Dijkstra (G, s, w): 1. for all nodes v in V do

a. d[v] ;

2. d[s] 0; Que V;

3. while Que is nonempty do

a. u ExtractMin(Que);

b. for all nodes v in adj(u) do

a) if d[v] > d[u] + w(u, v) then d[v] d[u] + w(u, v);

Use a priority queue Que; w(u, v): weight of edge (u, v); d(u): the distance from s to u

Use a priority queue Que; w(u, v): weight of edge (u, v); d(u): the distance from s to u

Extract one with the minimum d(u)Extract one with the minimum d(u)

O(|V| log|V| + |E|).

MapReduce?MapReduce?

Page 18: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

18

A MapReduce algorithm

key-values pairs

Input: graph G, represented by adjacency listsNode N:

• Node id: nid n

• N.distance: from start node s to N

• N.AdjList: [(m, w(n, m))], node id and weight of edge (n, m)

Key: node id n

Value of node N:

• Distance: from start node s to n got so far

• Node N (id, AdjList, etc)

Different structures

Page 19: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

19

Mapper

Data-partitioned parallelism

Parallel processing

all nodes are processed in parallel, each by a mapper

for each node m adjacent to n, emit a revised distance via n

Map (nid n, node N)•d N.distance; •emit( nid n, N); •for each (m, w) in N.AdjList do

• emit( nid m, d + w(n, m));

Revise distance of m via n

Why?

emit (nid: n, N): preserve graph structure for iterative

processing

Page 20: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

20

Reducer

Current M.distance: minimum from all predecessors

list for m: distances from all predecessors so far Node M: must exist (from Mapper)

Reduce (nid m, list)

•dmin ;

•for all d in list do • if IsNode(d)

• then M d;

• else if d < dmin

• then dmin d;

• M.distance dmin;

• emit (nid m, node M);

Group by node idEach d in list is either a distance to m from a predecessoror node m

Always be there. Why?

Minimum distance so far

Update M.distance for this round

Page 21: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

21

Iterations and termination

Termination control

Each MapReduce iteration advances the “known frontier” by one hop

Subsequent iterations include more and more reachable nodes as frontier expands

Multiple iterations are needed to explore entire graph

Termination: when the intermediate result no longer changes

For no node n, N.distance is changed in the last round

controlled by a non-MapReduce driver

Use a flag – inspected by non-MapReduce driver

Page 22: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

Iteration 0: Base case

22

mapper: (a,<c,10>) (c,<s,5>) edgesreducer: (a,<10, ...>) (c,<5, ...>)

"Wave"

0s

a b

c d

10

5

2 3

1

9

7

2

4 6

Page 23: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

Iteration 1

23

mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,12>) (b,<a,11>) (b,<c,14>) (d,<c,7>) edges

reducer: (a,<8, ...>) (c,<5, ...>) (b,<11, ...>) (d,<7, ...>)

0

10

5

10

5

2 3 9

7

4 6s

a b

c d

1

2

"Wave"group (a,<s,10>) and (a,<c,8>)

Page 24: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

Iteration 2

24

mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,12>) (b,<a,11>) (b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>) edgesreducer: (a,<8>) (c,<5>) (b,<9>) (d,<7>)

0

8

5

9

7

10

5

2 3 9

7

4 6s

a b

c d

1

2

"Wave"

Page 25: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

Iteration 3

25

mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>) (b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>) edges

reducer: (a,<8>) (c,<5>) (b,<11>) (d,<7>)No change! Convergence!No change! Convergence!

0

8

5

11

7

10

5

2 3 9

7

4 6s

a b

c d

1

2

Page 26: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

26

Efficiency?

Any other sources of inefficiency?

MapReduce explores all paths in parallel

Each MapReduce iteration advances the “known frontier” by one hop

• Redundant work, since useful work is only done at the “frontier”

Dijkstra’s algorithm is more efficient

• At any step it only pursues edges from the minimum-cost path inside the frontier

skew

Page 27: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

27

A closer look

Need a way to test for convergence

Data partitioned parallelism Local computation at each node in mapper, in parallel:

attributes of the node, adjacent edges and local link structures Propagating computations: traversing the graph; this may

involve iterative MapReduce

Tips: Adjacency lists Local computation in mapper; Pass along partial results via outlinks, keyed by destination node; Perform aggregation in reducer on inlinks to a node Iterate until convergence: controlled by external “driver” pass graph structures between iterations

Page 28: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

PageRank

2828

Page 29: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

29

PageRank

The likelihood that page v is visited by a random walk:

(1/|V|) + (1 - ) _(u L(v)) P(u)/C(u)

Recursive computation: for each page v in G,• compute P(v) by using P(u) for all u L(v)

until• converge: no changes to any P(v)• after a fixed number of iterations

random jumprandom jump following a link from other pagesfollowing a link from other pages

How to speed it up?

Page 30: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

30

A MapReduce algorithm

Assume that = 0

Input: graph G, represented by adjacency listsNode N:

• Node id: nid n• N.rank: the current rank• N.AdjList: [m], node id

Key: node id n

Value of node N:

• rank: a rank of a node

• Node N (id, AdjList, etc)

Simplified version: _(u L(v)) P(u)/C(u)

Page 31: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

31

Mapper

Local computation in mapper

Parallel processing

all nodes are processed in parallel, each by a mapper

Pass PageRank at n to successors of n

Map (nid n, node N)•p N.rank/|N.AdjList|;•emit( nid n, N); •for each m in N.AdjList do

• emit( nid m, p);Pass rank to neighbors

P(u)/C(u)

emit (nid: n, N): preserve graph structure for iterative

processing

Page 32: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

32

Reducer

Aggregation in reducer

list for m: P(u)/C(u) from all predecessors of m

m.rank at the end: _(u L(v)) P(u)/C(u)

Reduce (nid m, list)•s 0; •for all p in list do

• if IsNode(p) • then M p; • else s s + p;

• M.rank s;• emit (nid m, node M);

Sum up

With updated M.rank for this round

Recover graph structure

Page 33: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

PageRank in MapReduce

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

n2 n4 n3 n5 n1 n2 n3n4 n5

n2 n4n3 n5n1 n2 n3 n4 n5

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

Map

Reduce

Acknowledgments: some animation slides are borrowed from

www.cs.kent.edu/~jin/Cloud12Spring/GraphAlgorithms.pptx

Termination control: external driver

Page 34: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

Keyword search

3434

Page 35: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

35

Distinct-root trees

k dj(r, pj): k iterations (termination condition) 35

Match: a subtree T = (r, (k1, p1, d1(r, p1)), …, (km, pm, dm(r, pm)) of G such that

• each keyword ki in Q is contained in a leaf pi of T

• pi is closest to r among all nodes that contain ki • the distance from the root r of T to the lead does not exceed k

Input: A list Q = (k1, …, km) of keywords, a directed graph G, and a positive integer k

Output: distinct trees that match Q bounded by k

A simplified version

Page 36: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

36

An MapReduce algorithm

Preprocessing: can be done in MapReduce itself

Input: graph G, represented by adjacency listsNode N:

• Node id: nid n• N.((K1, P1, D1), …, (Km, Pm, Dm) : representing (n, (k1, p1,

d1(n, p1)), …, (km, pm, dm(n, pm))

• N.AdjList: [m], node id

Key: node id n

Preprocessing: N.((K1, P1, D1), …, (Km, Pm, Dm):

• P1 = and Dm = if N does not contain km

• P1 = n and Dm = 0 otherwise

Page 37: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

37

Mapper

Pass information from successors

Local computation:

Shortcut one node

One hop forward

Map (nid n, node N)•emit( nid n, N); •for each m in N.AdjList do

• emit( nid n, (M.(K1, P1, D1+1), …, M.(Km, Pm, Dm+1));

Contrast this to, e.g., PageRank

m is the node id of node M

Page 38: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

38

Reducer

Shortest distances within j hops

Invariant: in iteration j, N.((K1, P1, D1), …, (Km, Pm, Dm) represents (n, (k1, p1, d1(n, p1)), …, (km, pm, dm(n, pm))

Reduce (nid n, list)•for i from 1 to m do

• pi N. Pi; di N. di;

•for i from 1 to m do

• Si the set of all M.(Ki, Pi, Di) in list

• di the smallest M.Di; pi the corresponding M. Di;

•for i from 1 to m do

• N.Pi pi; N.Di di;

• emit (nid n, node N);

Group by keyword ki

Pick the one with the shortest distance to n

N: the node represented by n; must be in list

Page 39: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

39

Termination and post-processing

A different way of passing information during traversal

Termination: after k iterations, for a given positive integer k

Post-processing: upon termination, for each node n, where

N.((K1, P1, D1), …, (Km, Pm, Dm)

If no Pi = for i from 1 to m, then

N.((K1, P1, D1), …, (Km, Pm, Dm) represents a valid match

(n, (k1, p1, d1(n, p1)), …, (km, pm, dm(n, pm))

Page 40: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

Graph pattern matching by subgraph isomorphism

4040

Page 41: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

41

Input: a query Q and a data graph G,

Output: all the matches of Q in G, i.e, all subgraphs of G that are isomorphic to Q

a bijective function f on nodes:

(u,u’ ) ∈ Q iff (f(u), f(u’)) G∈

41

Graph pattern matching by subgraph isomorphism

NP-complete

MapReduce?

Page 42: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

42

An MapReduce algorithm

Two MapReduce steps: preprocessing, and computation

Input: graph G, represented by adjacency listsNode N:

• Node id: nid n• N.Gd: the subgraph of G rooted at n, consisting of nodes

within d hops of n• N.AdjList: [m], node id

Key: node id n

Preprocessing: for each node n, computes N.Gd

• A MapReduce algorithm of d iterations

• adjacency lists are only used in the preprocessing step

d: the radius of Q

Page 43: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

43

Algorithm

Just a conceptual level evaluation

Show the correctness? All and only isomorphic mappings?

Map (nid n, node N)

•compute all matches S of Q in N.Gd

•emit(1, S);

reduce (1, list)•M the union of all sets in list•emit(M, 1);

Invoke any algorithm for subgraph isomorphism: VF2, Ullman

not necessary; just to eliminate duplicates

Parallel scalability? The more processors, the faster? Yes, as long as the number of processors does not exceed the number of nodes of G

Lot of redundant computations

Yes, data locality

Page 44: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

Summing up

4444

Page 45: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

45

Summary and review

Why do we need parallel algorithms for querying graphs?

What is the MapReduce framework?

How to develop graph algorithms in MapReduce?

– Graph representation

– Local computation in mapper

– Aggregation in reducer

– Termination

Graph algorithms in MapReduce may not be efficient. Why?

Develop your own graph algorithms in MapReduce. Give correctness proof, complexity analysis and performance guarantees for your algorithms

Page 46: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

46

Project (1)

Recall strongly connected components (Lecture 2)

46

Implement a MapReduce algorithm that, given a graph G, computes all (maximum) strongly connected components of G

Develop optimization strategies Experimentally evaluate your algorithm, especially its scalability

with the size of G Write a survey on parallel algorithms for computing strongly

connected components, as part of the related work.

A development project

Page 47: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

47

Project (2)

Recall strongly kNN joins (Lecture 2)

47

Implement a MapReduce algorithm for evaluating kNN join queries Develop optimization strategies Experimentally evaluate your algorithm Write a survey on parallel algorithms for kNN queries and kNN join

queries, as part of the related work.

A development project

Page 48: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

48

Project (3)

Recall keyword search with Steiner-tree semantics (Lecture 2)

48

Implement a MapReduce algorithm for keyword search with Steiner-tree semantics

Develop optimization strategies Experimentally evaluate your algorithm Write a survey on parallel algorithms for keyword search, as part of

the related work.

A development project

Page 49: 1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism.

49

• W. Fan, F. Geerts, and F. Neven. Making Queries Tractable on Big Data with Preprocessing, VLDB 2013

• Y. Tao, W. Lin. X. Xiao. Minimal MapReduce Algorithms (MMC) http://www.cse.cuhk.edu.hk/~taoyf/paper/sigmod13-mr.pdf

• L. Qin, J. Yu, L. Chang, H. Cheng, C. Zhang, Xuemin Lin: Scalable big graph processing in MapReduce. SIGMOD 2014. http://www1.se.cuhk.edu.hk/~hcheng/paper/SIGMOD2014qin.pdf

• W. Lu, Y. Shen, S. Chen, B. Ooi: Efficient Processing of k Nearest Neighbor Joins using MapReduce. PVLDB 2012. http://arxiv.org/pdf/1207.0141.pdf

• V. Rastogi, A. Machanavajjhala, L. Chitnis, A. Sarma: Finding connected components in map-reduce in logarithmic rounds. ICDE 2013http://arxiv.org/pdf/1203.5387.pdf

Papers for you to review