Distributed Graph Analytics with Gradoop
-
Upload
martin-junghanns -
Category
Data & Analytics
-
view
492 -
download
0
Transcript of Distributed Graph Analytics with Gradoop
Distributed Graph Analytics with Gradoop
inovex Meetup Munich
Let‘s talk about Graph Databases
July 2016
Martin Junghanns (@kc1s) University of Leipzig – Database Research Group
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 3
Motivation EPGM Operators Benchmark Implementation
3
Motivation
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 4
Motivation EPGM Operators Benchmark Implementation
4
Motivation
𝑮𝑟𝑎𝑝ℎ = (𝑽𝑒𝑟𝑡𝑖𝑐𝑒𝑠, 𝑬𝑑𝑔𝑒𝑠)
„Graphs are everywhere“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 5
Motivation EPGM Operators Benchmark Implementation
5
Motivation
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠)
„Graphs are everywhere“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 6
Motivation EPGM Operators Benchmark Implementation
6
Motivation
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠)
Alice
Bob
Eve
Dave
Carol
Mallory
Peggy
Trent
„Graphs are everywhere“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 7
Motivation EPGM Operators Benchmark Implementation
7
Motivation
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
„Graphs are heterogeneous“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 8
Motivation EPGM Operators Benchmark Implementation
8
Motivation
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
„Graphs can be analyzed“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 9
Motivation EPGM Operators Benchmark Implementation
9
Motivation
0.2
0.28
0.26
0.33
0.25
0.26
Alice
Bob
AC/DC
Dave
Carol
Mallory
Peggy
Metallica
3.6
2.82
𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)
„Graphs can be analyzed“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 10
Motivation EPGM Operators Benchmark Implementation
10
Motivation
Assuming a social network
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 11
Motivation EPGM Operators Benchmark Implementation
11
Motivation
Assuming a social network
1. Determine subgraph
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 12
Motivation EPGM Operators Benchmark Implementation
12
Motivation
Assuming a social network
1. Determine subgraph
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 13
Motivation EPGM Operators Benchmark Implementation
13
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 14
Motivation EPGM Operators Benchmark Implementation
14
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 15
Motivation EPGM Operators Benchmark Implementation
15
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 16
Motivation EPGM Operators Benchmark Implementation
16
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 17
Motivation EPGM Operators Benchmark Implementation
17
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
4. Find common subgraph
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 18
Motivation EPGM Operators Benchmark Implementation
18
Motivation
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
4. Find common subgraph
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 19
Motivation EPGM Operators Benchmark Implementation
19
Motivation
Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 20
Motivation EPGM Operators Benchmark Implementation
20
Motivation
Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 21
Motivation EPGM Operators Benchmark Implementation
21
Motivation
Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 22
Motivation EPGM Operators Benchmark Implementation
22
Motivation
Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 23
Motivation EPGM Operators Benchmark Implementation
23
Motivation
Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 24
Motivation EPGM Operators Benchmark Implementation
24
Motivation
„And let‘s not forget …“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 25
Motivation EPGM Operators Benchmark Implementation
25
Motivation
“...Graphs are large.”
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 26
Motivation EPGM Operators Benchmark Implementation
26
Motivation
„An open-source framework and research platform for efficient, distributed and domain independent
management and analytics of heterogeneous graph data.“
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 27
Motivation EPGM Operators Benchmark Implementation
27
Motivation
Data Volume and Problem Complexity
Ease
-of-
use
Graph Processing Systems
Graph Databases
Graph Dataflow Systems Gelly
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 28
Motivation EPGM Operators Benchmark Implementation
28
Motivation
Distributed Graph Store (Apache HBase)
Apache Flink Operator Implementation
Apache Flink Distributed Operator Execution
Extended Property Graph Model (EPGM)
Graph Analytical Language (GrALa)
I/O
Distributed File System (Apache HDFS)
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 29
Motivation EPGM Operators Benchmark Implementation
29
Extended Property Graph Model
(EPGM)
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 30
Motivation EPGM Operators Benchmark Implementation EPGM
• Vertices and directed Edges
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 31
Motivation EPGM Operators Benchmark Implementation EPGM
• Vertices and directed Edges
• Logical Graphs
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 32
Motivation EPGM Operators Benchmark Implementation EPGM
• Vertices and directed Edges
• Logical Graphs
• Identifiers
1 3
4
5
2 1 2
3
4
5
1
2
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 33
Motivation EPGM Operators Benchmark Implementation EPGM
• Vertices and directed Edges
• Logical Graphs
• Identifiers
• Type Labels
1 3
4
5
2 1 2
3
4
5
Person Band
Person
Person
Band
likes likes
likes
knows
likes
1|Community
2|Community
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 34
Motivation EPGM Operators Benchmark Implementation EPGM
• Vertices and directed Edges
• Logical Graphs
• Identifiers
• Type Labels
• Properties 1 3
4
5
2 1 2
3
4
5
Person name : Alice born : 1984
Band name : Metallica founded : 1981
Person name : Bob
Person name : Eve
Band name : AC/DC founded : 1973
likes since : 2014
likes since : 2013
likes since : 2015
knows
likes since : 2014
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 35
Motivation EPGM Operators Benchmark Implementation
35
Operators
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 36
Motivation EPGM Operators Benchmark Implementation
36
Operators
Operators
Unary Binary
Gra
ph
Co
llect
ion
Lo
gica
l Gra
ph
Algorithms
Aggregation
Pattern Matching
Transformation
Grouping Equality
Call
Combination
Overlap
Exclusion
Equality
Union
Intersection
Difference
Flink Gelly Library
BTG Extraction
Frequent Subgraphs
Limit
Selection
Distinct
Sort
Apply
Reduce
Call
Adaptive Partitioning
Subgraph
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 37
Motivation EPGM Operators Benchmark Implementation
37
Operators
1 3
4
5 2
3
1 2
1 3
4
5 2
1 2 4
5
Combination
Overlap
Exclusion
3 Basic Binary Operators
LogicalGraph graph3 = graph1.combine(graph2); LogicalGraph graph4 = graph1.overlap(graph2); LogicalGraph graph5 = graph1.exclude(graph2);
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 38
Motivation EPGM Operators Benchmark Implementation
38
Operators
1 3
4
5 2
3
1 3
4
5 2
3 | vertexCount: 5
UDF
Aggregation
graph3 = graph3.aggregate(“vertexCount”, new AggregateFunction<Long>() { public DataSet<Long> execute(LogicalGraph g) { return Count.count(g.getVertices()); } });
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 39
Motivation EPGM Operators Benchmark Implementation
39
Operators
UDF
3 | vertexCount: 5
name:Alice
f_name:Bob 1 3
4
5 2
3 | Community| vCount: 5
f_name:Alice
f_name:Bob 1 3
4
5 2
Transformation
graph3 = graph3.transformEdges(new TransformationFunction<Edge>() { public Edge execute(Edge e) { e.setLabel(e.getLabel().equals(“orange”) ? “red” : e.getLabel()); return e; }});
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 40
Motivation EPGM Operators Benchmark Implementation
40
Operators
3
1 3
4
5 2
3
4
1 2
3
4
1 2
4
3 5
2 UDF
UDF
UDF
Subgraph
LogicalGraph graph4 = graph3.subgraph( new FilterFunction<Vertex>() { public boolean execute(Vertex v) { return v.getLabel().equals(“green”); }}, new FilterFunction<Edge>() { public boolean execute(Edge e) { return e.getLabel().equals(“orange”); }});
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 41
Motivation EPGM Operators Benchmark Implementation
41
Operators
3
1 3
4
5 2 Pattern
4 5
1 3
4
2
Graph Collection
Pattern Matching
GraphCollection collection = graph3.match(“(:Green)-[:orange]->(:Orange)”);
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 42
Motivation EPGM Operators Benchmark Implementation
42
Operators
Keys
3
1 3
4
5 2
+Aggregate
3
a:23 a:84
a:42
a:12
1 3
4
5 2
a:13
a:21
4
count:2 count:3
max(a):42
max(a):84
max(a):13 max(a):21
6 7
4
6 7
Grouping
LogicalGraph grouped = graph3.groupBy() .useVertexLabel() .useEdgeLabel() .addVertexAggregate(new CountAggregator()) .addEdgeAggregate(new MaxAggregator(“a”));
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 43
Motivation EPGM Operators Benchmark Implementation
43
Operators
Operator
1
2
0 2
3
4 1
5 7 8 6
1 | vertexCount: 5
2 | vertexCount: 4
0 2
3
4 1
5 7 8 6
Apply (e.g. Aggregation)
collection = collection.apply(new Aggregation<>(“vertexCount”, new VertexCount()));
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 44
Motivation EPGM Operators Benchmark Implementation
44
Operators
UDF
vertexCount > 4
1 | vertexCount: 5
2 | vertexCount: 4
0 2
3
4 1
5 7 8 6
1 | vertexCount: 5
0 2
3
4 1
Selection
GraphCollection filtered = collection.select(new FilterFunction<GraphHead>() { public boolean filter(GraphHead g) { return g.getPropertyValue(“vertexCount”).getLong() > 4L; } });
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 45
Motivation EPGM Operators Benchmark Implementation
45
Operators
Algorithm
1
0 2
3
4 1
5 7 8 6
2
3
0 2
3
4 1
5 7 8 6
Call (e.g. Clustering)
GraphCollection clustering = graph.callForCollection(new ClusteringAlgorithm());
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 46
Motivation EPGM Operators Benchmark Implementation
46
Operators
Algorithm
2
rank:0.11 rank:0.25
rank:0.11 rank:1.29
rank:1.29
rank:1.58 rank:0.11
rank:0.75 rank:0.11
0 2
3
4 1
5 7 8 6
1
0 2
3
4 1
5 7 8 6
Call (e.g. Page Rank)
LogicalGraph pageRankGraph = graph.callForGraph(new PageRankAlgorithm());
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 47
Motivation EPGM Operators Benchmark Implementation
47
Implementation Apache Flink Gradoop on Flink
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 48
Motivation EPGM Operators Benchmark Implementation
48
Implementation
Apache Flink Gradoop on Flink
„Streaming Dataflow Engine that provides data distribution, communication and fault tolerance for distributed computations over data streams.“
https://flink.apache.org/
Streaming Dataflow Runtime
DataSet DataStream
Hadoop M
R
Table
Gelly
Flin
kML
Table
Zep
pel
in
Cas
cad
ing
MR
QL
Dat
aflo
w
Sto
rm
Dat
aflo
w
SAM
OA
GR
AD
OO
P
Cluster (e.g. YARN) Local Cloud (e.g. EC2)
Batch Stream
Data Storage (e.g. Files, HDFS, S3, JDBC, HBase, Kafka, …)
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 49
Motivation EPGM Operators Benchmark Implementation
49
Implementation
Apache Flink Gradoop on Flink
DataSet DataSet DataSet
DataSet DataSet DataSet
DataSet DataSet DataSet
DataSet DataSet DataSet
DataSet DataSet DataSet
DataSet DataSet DataSet
• DataSet := Distributed Collection of Data Objects
• Transformation := Operation on DataSets (Higher-order function)
• Flink Programm := Composition of Transformations
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
Transformation DataSet
Flink Program
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 50
Motivation EPGM Operators Benchmark Implementation
50
Implementation
Apache Flink Gradoop on Flink
Hadoop-like Transformations
• map
• flatMap
• mapPartition
• reduce
• reduceGroup
• coGroup
Special Flink Operations
• iterate
• iterateDelta
SQL-like Transformations
• filter
• project
• cross
• union
• distinct
• first-N (limit)
• groupBy
• aggregate
• join
• leftOuterJoin
• rightOuterJoin
• fullOuterJoin
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 51
Motivation EPGM Operators Benchmark Implementation
51
Implementation
Apache Flink Gradoop on Flink
1: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); 2: 3: DataSet<String> text = env.fromElements( // or env.readTextFile(„hdfs://…“) 4: „He who controls the past controls the future.“, 5: „He who controls the present controls the past.“); 6: 7: DataSet<Tuple2<String, Integer>> wordCounts = text 8: .flatMap(new LineSplitter()) // splits the line and outputs (word, 1) tuples 9: .groupBy(0) 10: .sum(1); 11: 12: wordCounts.print(); // trigger execution
flatMap
„He who controls the past controls the future.“ „He who controls the present controls the past.“
(He,1) (who,1) (controls,1) (the,1) (past,1) // ...
groupBy(0)
[(He,1),(He,1)] [(who,1),(who,1)] [(future,1)] [(past,1),(past,1)] [(present,1)] // ...
sum(1)
(He,2) (who,2) (future,1) (past,2) (present,1) // ...
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 52
Motivation EPGM Operators Benchmark Implementation
52
Implementation
Apache Flink Gradoop on Flink
flatMap
(He,1) (who,1) (controls,1)
groupBy(0)
[(He,1),(He,1)] [(who,1),(who,1)]
sum(1)
(He,2) (who,2)
Source flatMap
(the,1) (past,1)
groupBy(0)
[(future,1)] [(past,1),(past,1)]
sum(1)
(future,1) (past,2)
flatMap
(future,1) (past,1)
groupBy(0)
[(present,1)]
sum(1)
(present,1)
Sink
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 53
Motivation EPGM Operators Benchmark Implementation
53
Implementation
Apache Flink Gradoop on Flink
Id Label Properties Graphs
Id Label Properties SourceId TargetId Graphs
EPGMGraphHead
EPGMVertex
EPGMEdge
Id Label Properties POJO
POJO
POJO
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
DataSet<EPGMEdge>
Id Label Properties Graphs
EPGMVertex
GradoopId := UUID 128-bit
String PropertyList := List<Property> Property := (String, PropertyValue) PropertyValue := byte[]
GradoopIdSet := Set<GradoopId>
EPGM Graph Representation
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 54
Motivation EPGM Operators Benchmark Implementation
54
Implementation
Apache Flink Gradoop on Flink
Id Label Properties
1 Community {interest:Heavy Metal}
2 Community {interest:Hard Rock}
Id Label Properties Graphs
1 Person {name:Alice, born:1984} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
4 Band {name:AC/DC,founded:1973} {2}
5 Person {name:Eve} {2}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
3 likes 3 4 {since:2015} {2}
4 knows 3 5 {} {2}
5 likes 5 4 {since:2014} {2}
likes since : 2014
likes since : 2013
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person name : Alice born : 1984
Band name : Metallica founded : 1981
Person name : Bob
Person name : Eve
Band name : AC/DC founded : 1973 likes
since : 2015
knows
likes since : 2014
1 2
3
4
5
DataSet<EPGMGraphHead>
DataSet<EPGMVertex> DataSet<EPGMEdge>
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 55
Motivation EPGM Operators Benchmark Implementation
55
Implementation
Apache Flink Gradoop on Flink
LogicalGraph grouped = graph1.combine(graph2).groupBy() .useVertexLabel() .useEdgeLabel() .addVertexAggregate(new CountAggregator()) .addEdgeAggregate(new CountAggregator());
6 7 Person count : 3
Band count : 2
likes count : 4
knows count : 1
6
7
4
likes since : 2014
likes since : 2013
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person name : Alice born : 1984
Band name : Metallica founded : 1981
Person name : Bob
Person name : Eve
Band name : AC/DC founded : 1973 likes
since : 2015
knows
likes since : 2014
1 2
3
4
5
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 56
Motivation EPGM Operators Benchmark Implementation
56
Implementation
Apache Flink Gradoop on Flink
GroupBy(1,2,3) + GC + GR* + Map Assign edges to groups Compute aggregates Build super edges
Filter + Map Extract super vertex tuples Build super vertices
GroupBy(1) + GroupReduce* Assign vertices to groups Compute aggregates Create super vertex tuples Forward updated group members
V
E
(1,[Person],[])
(2,[Band],[])
(3,[Person],[])
(4,[Band],[])
(5,[Person],[])
(-,6,[Person],[3])
(1,6,[],[])
(-,7,[Band],[2])
(2,7,[],[])
(3,6,[],[])
(4,7,[],[])
(5,6,[],[])
v6
v7
(1,6)
(2,7)
(3,6)
(4,7)
(5,6)
(1,1,2,[likes],[])
(2,3,2,[likes],[])
(3,3,4,[likes],[])
(4,3,5,[knows],[])
(5,5,4,[likes],[])
(1,6,7,[likes],[])
(2,6,7,[likes],[])
(3,6,7,[likes],[])
(4,6,6,[knows],[])
(5,6,7,[likes],[])
e6
e7
Map Extract attributes
Filter + Map Extract group members Reduce memory footprint
Join* Replace Source/TargetId with corresponding super vertex id
Map Extract attributes
*requires worker communication
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 57
Motivation EPGM Operators Benchmark Implementation
57
Implementation
Apache Flink Gradoop on Flink
class LogicalGraph<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { fromCollections(...) : LogicalGraph<G, V, E> fromDataSets(...) : LogicalGraph<G, V, E> fromGellyGraph(...) : LogicalGraph<G, V, E> getGraphHead() : DataSet<G> getVertices() : DataSet<V> getEdges() : DataSet<E> aggregate(...) : LogicalGraph<G, V, E> match(...) : GraphCollection<G, V, E> groupBy(...) : LogicalGraph<G, V, E> subgraph(...) : LogicalGraph<G, V, E> combine(...) : LogicalGraph<G, V, E> // ... }
class GraphCollection<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge > { fromCollections(...) : GraphCollection<G, V, E> fromDataSets(...) : GraphCollection<G, V, E> getGraphHeads() : DataSet<G> getVertices() : DataSet<V> getEdges() : DataSet<E> select(...) : GraphCollection<G, V, E> distinct( ) : GraphCollection<G, V, E> sortBy(...) : GraphCollection<G, V, E> union(...) : GraphCollection<G, V, E> difference(...) : GraphCollection<G, V, E> // ... }
EPGM API (Operators)
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 58
Motivation EPGM Operators Benchmark Implementation
58
Implementation
Apache Flink Gradoop on Flink
interface DataSource<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { getLogicalGraph(...) : LogicalGraph<G, V, E> getGraphCollection(...) : GraphCollection<G, V, E> }
interface DataSink<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge > { write(LogicalGraph<G, V, E>) : void write(GraphCollection<G, V, E>) : void }
class GraphDataSource<...> implements DataSource<...> { } class HBaseDataSource<...> implements DataSource<...> { } class JSONDataSource<...> implements DataSource<...> { } class TLFDataSource<...> implements DataSource<...> { }
class HBaseDataSink<...> implements DataSink<...> { } class JSONDataSink<...> implements DataSink<...> { } class TLFDataSink<...> implements DataSource<...> { }
EPGM API (I/O)
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 59
Motivation EPGM Operators Benchmark Implementation
59
Benchmark
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 60
Motivation EPGM Operators Benchmark Implementation
60
Benchmark
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
http://ldbcouncil.org/
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 61
Motivation EPGM Operators Benchmark Implementation
61
Benchmark
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
https://git.io/vgozj
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 62
Motivation EPGM Operators Benchmark Implementation
62
Benchmark
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT
• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 63
Motivation EPGM Operators Benchmark Implementation
63
Benchmark
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT
• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
0
200
400
600
800
1000
1200
1 2 4 8 16R
un
tim
e [
s]
Number of workers
Graphalytics.100
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 64
Motivation EPGM Operators Benchmark Implementation
64
Benchmark
1
2
4
8
16
1 2 4 8 16
Spe
ed
up
Number of workers
Graphalytics.100 Linear
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT
• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 65
Motivation EPGM Operators Benchmark Implementation
65
Benchmark
1
10
100
1000
10000
Ru
nti
me
[s]
Dataset # Vertices # Edges Disk size
Graphalytics.1 61,613 2,026,082 570 MB
Graphalytics.10 260,613 16,600,778 4.5 GB
Graphalytics.100 1,695,613 147,437,275 40.2 GB
Graphalytics.1000 12,775,613 1,363,747,260 372 GB
Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB
• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT
• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960
Summary
• 0.0.1 First Prototype (May 2015)
– Hadoop MapReduce and Giraph for operator implementations
– Too much complexity
– Performance loss through serialization in HDFS/HBase
• 0.0.2 Using Flink as execution layer (June 2015)
– Basic operators
• 0.1 December 2015
– System-side identifiers (UUID)
– Improved property handling
– More operator implementations (e.g., Equality, Bool operators)
– Code refactoring
• 0.2-SNAPSHOT August 2016
– Graph Pattern Matching
– Frequent Subgraph Mining
– Memory optimization (96-bit ID, Dictionary Encoding, …)
– Refactoring
Release History
Summary
Contributions welcome!
• Code • I/O Formats (GraphML, DOT, …) • Operators and Algorithms • Tuning (Memory consumption, serialization, …) • API improvements
• Use cases and data • Business Intelligence • Fraud Detection • Pattern Mining • …
• Extended Property Graph Model • Schema flexible: Type Labels and Properties • Logical Graphs / Graphs Collection
• Graph and Collection Operators • Combination to analytical workflows
• Implemented on Apache Flink • Built-in scalability • Combine with other libraries
Summary
www.gradoop.com
[1] Junghanns, M.; Petermann, A.; Teichmann, N.; Gomez, K.; Rahm, E., „Analyzing Extended Property Graphs with Apache Flink“, Int. Workshop on Network Data Analytics (NDA), SIGMOD 2016.
[2] Petermann, A.; Junghanns, M.,
„Scalable Business Intelligence with Graph Collections“, it – Special Issue on Big Data Analytics, 2016.
[3] Petermann, A.; Junghanns, M.; Müller, M.; Rahm, E.,
„Graph-based Data Integration and Business Intelligence with BIIIG“, Proc. VLDB Conf. (Demo), 2014.