Graphs as Streams: Rethinking Graph Processing in the Streaming Era

Post on 16-Apr-2017

554 views 0 download

Transcript of Graphs as Streams: Rethinking Graph Processing in the Streaming Era

GRAPHS AS STREAMSRETHINKING GRAPH PROCESSING IN THE STREAMING ERA

Vasia Kalavri vasia@apache.org

@vkalavri

2

MODERN STREAMING TECHNOLOGY

➤ sub-second latencies

➤ high throughput

➤ dynamic topologies

➤ powerful semantics

➤ ecosystem integration

3

MORE THAN COUNTING WORDS

Complex Event Processing Online Machine Learning Streaming SQL

4

WHAT ABOUT GRAPH PROCESSING?

5

HOW WE’VE DONE GRAPH PROCESSING SO FAR

1. Load: read the graph from disk and partition it in memory

6

HOW WE’VE DONE GRAPH PROCESSING SO FAR

1. Load: read the graph from disk and partition it in memory

2. Compute: read and mutate the graph state

7

HOW WE’VE DONE GRAPH PROCESSING SO FAR

1. Load: read the graph from disk and partition it in memory

2. Compute: read and mutate the graph state

3. Store: write the final graph state back to disk

8

“ If what you need is to analyze a static graph over and over again then this model is great!

9

WHAT’S WRONG WITH THIS MODEL?

➤ It is slow ➤ wait until the computation is over before you see any result

➤ pre-processing and partitioning

➤ It is expensive ➤ lots of memory and CPU required in order to scale

➤ It requires re-computation for graph changes ➤ no efficient way to deal with updates

10

➤ Maintain the dynamic graph structure

➤ Provide up-to-date results with low latency

➤ Compute on fresh state only

11

GRAPH STREAMING CHALLENGES

12

ACADEMIA TO THE RESCUE

➤ Graph streaming in the 90s-00s ➤ input fits in secondary storage

➤ limited memory

➤ few passes over the input data

➤ compact graph representations and summaries

13

GRAPH SUMMARIES

➤ spanners ➤ connectivity, distance

➤ sparsifiers ➤ cut estimation

➤ neighborhood sketches

graph summary

~algorithm algorithmR1 R2

14

1

43

2

5

i=0

BATCH CONNECTED COMPONENTS

15

6

7

8

1

43

2

5

6

7

8

i=0

BATCH CONNECTED COMPONENTS

16

14

345

235

24

78

67

68

1

21

2

2

i=1

BATCH CONNECTED COMPONENTS

17

6

6

6

1

21

1

2

6

6

6

i=1

BATCH CONNECTED COMPONENTS

18

2

122

112

12

76

6

6

1

11

1

1

i=2

BATCH CONNECTED COMPONENTS

19

6

6

6

54

76

86

31

52

20

STREAM CONNECTED COMPONENTS

Graph Summary: Disjoint Set (Union-Find)

➤ Only store component IDs and vertex IDs

54

76

86

42

31

52

21

1

3

Cid = 1

54

76

86

42

43

31

52

22

1

3

Cid = 1

2

5

Cid = 2

54

76

86

42

43

87

31

52

23

1

3

Cid = 1

2

5

Cid = 2

4

54

76

86

42

43

87

41

31

52

24

1

3

Cid = 1

2

5

Cid = 2

4

6

7Cid = 6

54

76

86

42

43

87

41

5225

1

3

Cid = 1

2

5

Cid = 2

4

6

7Cid = 6

8

54

76

86

42

43

87

41

26

1

3

Cid = 1

2

5

Cid = 2

4

6

7Cid = 6

8

76

86

42

43

87

41

27

6

7Cid = 6

8

1

3

Cid = 1

2

5

Cid = 2

4

76

86

42

43

87

41

28

1

3

Cid = 1

2

5

4

6

7Cid = 6

8

DISTRIBUTED STREAM CONNECTED COMPONENTS

29

THE BAD NEWS

➤A slightly different motivation ➤ finite graph stored in disk vs. unbounded graph arriving in real-time ➤ some algorithms assume we know |V|, |E| ➤ most algorithms designed for single-node execution

30

THE GOOD NEWS

➤A quite different reality ➤ memory is getting bigger ➤ … and cheaper ➤ we know how to design distributed algorithms

31

GELLY-STREAM SINGLE-PASS STREAM GRAPH PROCESSING WITH APACHE FLINK

32

GELLY ON STREAMS

DataStreamDataSet

Distributed Dataflow

Deployment

Gelly Gelly-Stream

➤ Static Graphs

➤ Multi-Pass Algorithms

➤ Full Computations

➤ Dynamic Graphs

➤ Single-Pass Algorithms

➤ Approximate Computations

DataStream

33

DISTRIBUTED STREAM CONNECTED COMPONENTS

34

STREAM CONNECTED COMPONENTS WITH FLINK

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);

35

STREAM CONNECTED COMPONENTS WITH FLINK

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);

36

Partition the edge stream

STREAM CONNECTED COMPONENTS WITH FLINK

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);

37

Define the merging frequency

STREAM CONNECTED COMPONENTS WITH FLINK

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);

38

merge locally

STREAM CONNECTED COMPONENTS WITH FLINK

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);

39

merge globally

GELLY-STREAM STATUS

➤ Properties and Metrics ➤ Transformations ➤ Aggregations ➤ Discretization ➤ Neighborhood Aggregations

40

➤ Graph Streaming Algorithms ➤ Connected Components ➤ Bipartiteness Check ➤ Window Triangle Count ➤ Triangle Count Estimation ➤ Continuous Degree Aggregate

FEELING GELLY?

➤Gelly-Stream Repository

github.com/vasia/gelly-streaming

➤A list of graph streaming papers

citeulike.org/user/vasiakalavri/tag/graph-streaming

➤A related talk at FOSDEM’16

slideshare.net/vkalavri/gellystream-singlepass-graph-streaming-analytics-with-apache-flink

41

GRAPHS AS STREAMSRETHINKING GRAPH PROCESSING IN THE STREAMING ERA

Vasia Kalavri vasia@apache.org

@vkalavri