Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying...
Transcript of Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying...
![Page 1: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/1.jpg)
GraphX:���Unifying Data-Parallel and Graph-Parallel Analytics
Presented by Joseph Gonzalez ���Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion Stoica Strata 2014
*These slides are best viewed in PowerPoint with animation.
![Page 2: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/2.jpg)
Graphs are Central to Analytics
Raw Wikipedia
< / >!< / >!< / >!XML!
Hyperlinks PageRank Top 20 Pages Title PR
Text Table
Title Body Topic Model
(LDA) Word Topics Word Topic
Editor Graph Community Detection
User Community
User Com.
Term-Doc Graph
Discussion Table
User Disc.
Community Topic
Topic Com.
![Page 3: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/3.jpg)
Update ranks in parallel
Iterate until convergence
Rank of user i Weighted sum of
neighbors’ ranks
3
R[i] = 0.15 +X
j2Nbrs(i)
wjiR[j]
PageRank: Identifying Leaders
![Page 4: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/4.jpg)
The Graph-Parallel Pattern
4
Model / Alg. State
Computation depends only on the neighbors
![Page 5: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/5.jpg)
Many Graph-Parallel Algorithms • Collaborative Filtering
– Alternating Least Squares – Stochastic Gradient Descent – Tensor Factorization
• Structured Prediction – Loopy Belief Propagation – Max-Product Linear Programs – Gibbs Sampling
• Semi-supervised ML – Graph SSL – CoEM
• Community Detection – Triangle-Counting – K-core Decomposition – K-Truss
• Graph Analytics – PageRank – Personalized PageRank – Shortest Path – Graph Coloring
• Classification – Neural Networks
5
![Page 6: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/6.jpg)
Graph-Parallel Systems
6
oogle
Expose specialized APIs to simplify graph programming.
Exploit graph structure to achieve orders-of-
magnitude performance gains over more general ���data-parallel systems.
![Page 7: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/7.jpg)
PageRank on the Live-Journal Graph
22
354
1340
0 200 400 600 800 1000 1200 1400 1600
GraphLab
Naïve Spark
Mahout/Hadoop
Runtime (in seconds, PageRank for 10 iterations)
GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark
![Page 8: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/8.jpg)
Graphs are Central to Analytics
Raw Wikipedia
< / >!< / >!< / >!XML!
Hyperlinks PageRank Top 20 Pages Title PR
Text Table
Title Body Topic Model
(LDA) Word Topics Word Topic
Editor Graph Community Detection
User Community
User Com.
Term-Doc Graph
Discussion Table
User Disc.
Community Topic
Topic Com.
![Page 9: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/9.jpg)
Separate Systems to Support Each View Table View Graph View
Dependency Graph
6. Before
8. After
7. After
Table
Result
Row
Row
Row
Row
![Page 10: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/10.jpg)
Having separate systems ���for each view is ���
difficult to use and inefficient
10
![Page 11: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/11.jpg)
Difficult to Program and Use
Users must Learn, Deploy, and Manage multiple systems
Leads to brittle and often ���complex interfaces
11
![Page 12: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/12.jpg)
Inefficient
12
Extensive data movement and duplication across ���the network and file system
< / >!< / >!< / >!XML!
HDFS HDFS HDFS HDFS
Limited reuse internal data-structures ���across stages
![Page 13: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/13.jpg)
Solution: The GraphX Unified Approach
Enabling users to easily and efficiently express the entire graph analytics pipeline
New API Blurs the distinction between
Tables and Graphs
New System Combines Data-Parallel Graph-Parallel Systems
![Page 14: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/14.jpg)
Tables and Graphs are composable ���views of the same physical data
GraphX Unified Representation
Graph View Table View
Each view has its own operators that ���exploit the semantics of the view
to achieve efficient execution
![Page 15: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/15.jpg)
View a Graph as a Table
Id
Rxin Jegonzal Franklin Istoica
SrcId DstId
rxin jegonzal franklin rxin istoica franklin franklin jegonzal
Property (E)
Friend Advisor
Coworker PI
Property (V)
(Stu., Berk.) (PstDoc, Berk.)
(Prof., Berk) (Prof., Berk)
R
J
F
I
Property Graph Vertex Property Table
Edge Property Table
![Page 16: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/16.jpg)
Table Operators Table (RDD) operators are inherited from Spark:
16
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
![Page 17: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/17.jpg)
class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])
// Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E]
}
Graph Operators
17
![Page 18: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/18.jpg)
Triplets Join Vertices and Edges The triplets operator joins vertices and edges:
The mrTriplets operator sums adjacent triplets. SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum FROM triplets AS t GROUPBY t.dstId
Triplets Vertices Edges
B
A
C
D
A B A C B C C D
A B A
B A C B C C D
![Page 19: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/19.jpg)
F
E
Map Reduce Triplets
Map-Reduce for each vertex
D
B
A
C
mapF( ) A B
mapF( ) A C
A1
A2
reduceF( , ) A1 A2 A
19
![Page 20: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/20.jpg)
F
E
Example: Oldest Follower
D
B
A
C What is the age of the oldest follower for each user?
val oldestFollowerAge = graph .mrTriplets( e=> (e.dst.id, e.src.age),//Map (a,b)=> max(a, b) //Reduce ) .vertices
23 42
30
19 75
16 20
![Page 21: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/21.jpg)
We express the Pregel and GraphLab ���abstractions using the GraphX operators���
in less than 50 lines of code!
21
By composing these operators we can ���construct entire graph-analytics pipelines.
![Page 22: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/22.jpg)
DIY Demo this Afternoon
![Page 23: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/23.jpg)
GraphX System Design
![Page 24: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/24.jpg)
Part. 2
Part. 1
Vertex Table
(RDD)
B C
A D
F E
A D
Distributed Graphs as Tables (RDDs)
D
Property Graph
B C
D
E
A A
F
Edge Table (RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing Table
(RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic
![Page 25: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/25.jpg)
Vertex Table
(RDD)
Caching for Iterative mrTriplets Edge Table
(RDD) A B
A C
C D
B C
A E
A F
E F
E D
Mirror Cache
B C D
A
Mirror Cache
D E F
A
B
C
D
E
A
F
B
C
D
E
A
F
A
D
![Page 26: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/26.jpg)
Vertex Table
(RDD)
Edge Table (RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror Cache
B C D
A
Mirror Cache
D E F
A
Incremental Updates for Iterative mrTriplets
B
C
D
E
A
F
Change A A
Change E
Scan
![Page 27: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/27.jpg)
Vertex Table
(RDD)
Edge Table (RDD)
A B
A C
C D
B C
A E
A F
E F
E D
Mirror Cache
B C D
A
Mirror Cache
D E F
A
Aggregation for Iterative mrTriplets
B
C
D
E
A
F
Change
Change
Scan
Change
Change
Change
Change
Local Aggregate
Local Aggregate
B C
D
F
![Page 28: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/28.jpg)
Reduction in Communication Due to Cached Updates
0.1
1
10
100
1000
10000
0 2 4 6 8 10 12 14 16
Net
wor
k Co
mm
. (M
B)
Iteration
Connected Components on Twitter Graph
Most vertices are within 8 hops���of all vertices in their comp.
![Page 29: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/29.jpg)
Benefit of Indexing Active Edges
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16
Runt
ime
(Sec
onds
)
Iteration
Connected Components on Twitter Graph
Scan
Indexed
Scan All Edges
Index of “Active” Edges
![Page 30: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/30.jpg)
Join Elimination Identify and bypass joins for unused triplets fields » Example: PageRank only accesses source attribute
30
0 2000 4000 6000 8000
10000 12000 14000
0 5 10 15 20
Com
mun
icatio
n (M
B)
Iteration
PageRank on Twitter Three Way Join
Join Elimination
Factor of 2 reduction in communication
![Page 31: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/31.jpg)
Additional Query Optimizations
Indexing and Bitmaps: » To accelerate joins across graphs » To efficiently construct sub-graphs
Substantial Index and Data Reuse: » Reuse routing tables across graphs and sub-graphs » Reuse edge adjacency information and indices
31
![Page 32: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/32.jpg)
Performance Comparisons
22 68
207 354
1340
0 200 400 600 800 1000 1200 1400 1600
GraphLab GraphX Giraph
Naïve Spark Mahout/Hadoop
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 3x slower than GraphLab
Live-Journal: 69 Million Edges
![Page 33: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/33.jpg)
GraphX scales to larger graphs
203
451
749
0 200 400 600 800
GraphLab
GraphX
Giraph
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 2x slower than GraphLab » Scala + Java overhead: Lambdas, GC time, … » No shared memory parallelism: 2x increase in comm.
Twitter Graph: 1.5 Billion Edges
![Page 34: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/34.jpg)
PageRank is just one stage…. ������
What about a pipeline?
![Page 35: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/35.jpg)
HDFS HDFS
Compute Spark Preprocess Spark Post.
A Small Pipeline in GraphX
Timed end-to-end GraphX is faster than GraphLab
Raw Wikipedia
< / >!< / >!< / >!XML!
Hyperlinks PageRank Top 20 Pages
342
1492
0 200 400 600 800 1000 1200 1400 1600
GraphLab + Spark GraphX
Giraph + Spark Spark
Total Runtime (in Seconds)
605
375
![Page 36: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/36.jpg)
The GraphX Stack���(Lines of Code)
GraphX (3575)
Spark
Pregel (28) + GraphLab (50)
PageRank (5)
Connected Comp. (10)
Shortest Path (10)
ALS (40) LDA
(120)
K-core (51) Triangle
Count (45)
SVD (40)
![Page 37: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/37.jpg)
Status Alpha release as part of Spark 0.9
Seeking collaborators and feedback
![Page 38: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/38.jpg)
Conclusion and Observations Domain specific views: Tables and Graphs » tables and graphs are first-class composable objects » specialized operators which exploit view semantics
Single system that efficiently spans the pipeline » minimize data movement and duplication » eliminates need to learn and manage multiple systems
Graphs through the lens of database systems » Graph-Parallel Pattern à Triplet joins in relational alg. » Graph Systems à Distributed join optimizations
38
![Page 39: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/39.jpg)
Active Research Static Data à Dynamic Data » Apply GraphX unified approach to time evolving data » Model and analyze relationships over time
Serving Graph Structured Data » Allow external systems to interact with GraphX » Unify distributed graph databases with relational
database technology
39
![Page 40: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/40.jpg)
Thanks!
[email protected] [email protected]
[email protected] [email protected]
http://amplab.github.io/graphx/
![Page 41: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/41.jpg)
Graph Property 1���Real-World Graphs
41
100 102 104 106 108100
102
104
106
108
1010
degree
count
Top 1% of vertices are adjacent to
50% of the edges!
Num
ber o
f Ver
tices
AltaVista WebGraph1.4B Vertices, 6.6B Edges
Degree
More than 108 vertices ���have one neighbor.
0 20 40 60 80
100 120 140 160 180 200
2008 2009 2010 2011 2012 Ra
tio o
f Edg
es to
Vert
ices
Year
Power-Law Degree Distribution Edges >> Vertices
![Page 42: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/42.jpg)
Graph Property 2���Active Vertices
1
10
100
1000
10000
100000
1000000
10000000
100000000
0 10 20 30 40 50 60 70
Num
-Ver
tices
Number of Updates
51% updated only once! PageRank on Web Graph
![Page 43: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/43.jpg)
Graphs are Essential to Data Mining and Machine Learning
Identify influential people and information
Find communities
Understand people’s shared interests
Model complex data dependencies
![Page 44: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/44.jpg)
Ratings Items
Recommending Products Users
![Page 45: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/45.jpg)
Low-Rank Matrix Factorization:
45
r13
r14
r24
r25
f(1)
f(2)
f(3)
f(4)
f(5) Use
r Fac
tors
(U)
Movie Factors (M
) U
sers Movies
Netflix U
sers
≈ x
Movies
f(i)
f(j)
Iterate:
f [i] = arg minw2Rd
X
j2Nbrs(i)
�rij � wT f [j]
�2+ �||w||22
Recommending Products
![Page 46: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/46.jpg)
Liberal Conservative
Post
Post
Post
Post
Post
Post
Post
Post
Predicting User Behavior
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
? ?
?
?
? ?
?
? ? ?
?
?
? ?
? ?
?
?
?
?
?
?
?
?
?
?
?
? ?
?
46
Conditional Random Field!Belief Propagation!
![Page 47: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/47.jpg)
Count triangles passing through each vertex: "
Measures “cohesiveness” of local community
More Triangles Stronger Community
Fewer Triangles Weaker Community
1 2 3
4
Finding Communities
![Page 48: Unifying Data-Parallel and Graph-Parallel Analytics · 2017-09-12 · GraphX:! Unifying Data-Parallel and Graph-Parallel Analytics ! Presented by Joseph Gonzalez"! Joint work with](https://reader035.fdocuments.in/reader035/viewer/2022070820/5f1c99a504248f2ff34317e7/html5/thumbnails/48.jpg)
Preprocessing Compute Post Proc.
Example Graph Analytics Pipeline
48
< / >!< / >!< / >!XML!
Raw Data ETL Slice Compute
Repeat
Subgraph PageRank Initial Graph
Analyze
Top Users