GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 ·...

34
UC BERKELEY GraphX Graph Analytics in Spark Ankur Dave Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez, Reynold Xin, Daniel Crankshaw, Michael Franklin, and Ion Stoica UC BERKELEY

Transcript of GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 ·...

Page 1: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

UC  BERKELEY  

GraphXGraph Analytics in Spark

Ankur Dave���Graduate Student, UC Berkeley AMPLab

Joint work with Joseph Gonzalez, Reynold Xin, Daniel Crankshaw, Michael Franklin, and Ion Stoica

UC  BERKELEY  

Page 2: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Model & Dependencies

Architecture

Machine Learning Landscape

Large & Dense

Parameter ServerGraph-Parallel

SparseSmall & Dense

MapReduce

Page 3: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Model & Dependencies

Architecture

Machine Learning Landscape

Large & Dense

Parameter Server

SparseSmall & Dense

Spark DataflowFramework

GraphX

Page 4: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Graphs

Page 5: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Social Networks

Page 6: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Web Graphs

Page 7: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

⋆⋆⋆⋆⋆

⋆⋆⋆⋆

⋆⋆⋆⋆⋆⋆⋆⋆⋆

User-Item Graphs

Page 8: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Graph Algorithms

Page 9: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

PageRank

Page 10: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Triangle Counting

Page 11: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Collaborative FilteringU

sers

ProductsRatings

Use

rs

≈x

Products

f(i)

f(j)

⋆⋆⋆⋆⋆

Page 12: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Collaborative Filtering

r13

r14

r24

r25

f(1)

f(2)

f(3)

f(4)

f(5)

Use

r Fac

tors

Product Factors

f [i] = arg minw2Rd

X

j2Nbrs(i)

�rij � wT f [j]

�2+ �||w||22

Page 13: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The Graph-Parallel Pattern

Page 14: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The Graph-Parallel Pattern

Page 15: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The Graph-Parallel Pattern

Page 16: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Collaborative Filtering» Alternating Least Squares» Stochastic Gradient Descent» Tensor Factorization

Structured Prediction» Loopy Belief Propagation» Max-Product Linear

Programs» Gibbs Sampling

Semi-supervised ML» Graph SSL » CoEM

Community Detection» Triangle-Counting» K-core Decomposition» K-Truss

Graph Analytics» PageRank» Personalized PageRank» Shortest Path» Graph Coloring

Classification» Neural Networks

Many Graph-Parallel Algorithms

Page 17: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

Link TableTitle Link

Editor GraphCommunityDetection

User Community

User Com.

EditorTable

Editor Title

Top CommunitiesCom. PR..

Modern Analytics

Page 18: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Tables

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

Link TableTitle Link

Editor GraphCommunityDetection

User Community

User Com.

Top CommunitiesCom. PR..

EditorTable

Editor Title

Page 19: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

EditorTable

Editor Title

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PagesTitle PR

Link TableTitle Link

Editor GraphCommunityDetection

User Community

User Com.

Top CommunitiesCom. PR..

Graphs

Page 20: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The GraphX API

Page 21: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Vertex Property:•  User Profile•  Current PageRank Value

Edge Property:•  Weights•  Relationships•  Timestamps

Property Graphs

Page 22: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Graphtype  VertexId  =  Long    val  vertices:  RDD[(VertexId,  String)]  =      sc.parallelize(List(          (1L,  “Alice”),          (2L,  “Bob”),          (3L,  “Charlie”)))    class  Edge[ED](      val  srcId:  VertexId,      val  dstId:  VertexId,      val  attr:  ED)    val  edges:  RDD[Edge[String]]  =      sc.parallelize(List(          Edge(1L,  2L,  “coworker”),          Edge(2L,  3L,  “friend”)))    val  graph  =  Graph(vertices,  edges)  

Creating a Graph (Scala)

1

3

2

Alice

Bob

Charlie

coworker

friend

Page 23: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

class  Graph[VD,  ED]  {    //  Table  Views  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    def  vertices:  RDD[(VertexId,  VD)]    def  edges:  RDD[Edge[ED]]    def  triplets:  RDD[EdgeTriplet[VD,  ED]]    //  Transformations  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    def  mapVertices[VD2](f:  (VertexId,  VD)  =>  VD2):  Graph[VD2,  ED]    def  mapEdges[ED2](f:  Edge[ED]  =>  ED2):  Graph[VD2,  ED]    def  reverse:  Graph[VD,  ED]    def  subgraph(epred:  EdgeTriplet[VD,  ED]  =>  Boolean,                              vpred:  (VertexId,  VD)  =>  Boolean):  Graph[VD,  ED]    //  Joins  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    def  outerJoinVertices[U,  VD2]  

               (tbl:  RDD[(VertexId,  U)])                  (f:  (VertexId,  VD,  Option[U])  =>  VD2):  Graph[VD2,  ED]  

 //  Computation  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    def  mapReduceTriplets[A](  

               sendMsg:  EdgeTriplet[VD,  ED]  =>  Iterator[(VertexId,  A)],                  mergeMsg:  (A,  A)  =>  A):  RDD[(VertexId,  A)]    

Graph Operations (Scala)

Page 24: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

 //  Continued  from  previous  slide    def  pageRank(tol:  Double):  Graph[Double,  Double]    def  triangleCount():  Graph[Int,  ED]    def  connectedComponents():  Graph[VertexId,  ED]    //  ...and  more:  org.apache.spark.graphx.lib  

}  

Built-in Algorithms (Scala)

PageRank Triangle Count Connected Components

Page 25: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

RDD

The triplets view

Graph

1

3

2

Alice

Bob

Charlie

coworker

friend

class  Graph[VD,  ED]  {    def  triplets:  RDD[EdgeTriplet[VD,  ED]]  

}    class  EdgeTriplet[VD,  ED](      val  srcId:  VertexId,  val  dstId:  VertexId,  val  attr:  ED,      val  srcAttr:  VD,  val  dstAttr:  VD)  

srcAttr dstAttr attrAlice coworker BobBob friend Charlie

triplets  

Page 26: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The subgraph transformationclass  Graph[VD,  ED]  {  

 def  subgraph(epred:  EdgeTriplet[VD,  ED]  =>  Boolean,                              vpred:  (VertexId,  VD)  =>  Boolean):  Graph[VD,  ED]  

}    graph.subgraph(epred  =  (edge)  =>  edge.attr  !=  “relative”)  

subgraph  

Graph

Alice Bob

Charlie

relativefriend

coworker

Davidrelative

Graph

Alice Bob

Charlie

friend

coworker

David

Page 27: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

The subgraph transformationclass  Graph[VD,  ED]  {  

 def  subgraph(epred:  EdgeTriplet[VD,  ED]  =>  Boolean,                              vpred:  (VertexId,  VD)  =>  Boolean):  Graph[VD,  ED]  

}    graph.subgraph(vpred  =  (id,  name)  =>  name  !=  “Bob”)  

subgraph  

Graph

Alice Bob

Charlie

relativefriend

coworker

Davidrelative

Graph

Alice

Charlie

relative

Davidrelative

Page 28: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Computation with mapReduceTripletsclass  Graph[VD,  ED]  {      def  mapReduceTriplets[A](          sendMsg:  EdgeTriplet[VD,  ED]  =>  Iterator[(VertexId,  A)],          mergeMsg:  (A,  A)  =>  A):  RDD[(VertexId,  A)]  }    graph.mapReduceTriplets(      edge  =>  Iterator(          (edge.srcId,  1),          (edge.dstId,  1)),      _  +  _)  

Graph

Alice Bob

Charlie

relativefriend

coworker

Davidrelative

mapReduceTriplets  

RDD

vertex id degreeAlice 2Bob 2Charlie 3David 1

upgrade to aggregateMessages  in Spark 1.2.0

Page 29: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

How GraphX Works

Page 30: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Part. 2

Part. 1

Vertex Table

(RDD)

B C

A D

F E

A D

Encoding Property Graphs as RDDs

D

Property Graph

B C

D

E

AA

F

Machine 1

Machine 2

Edge Table(RDD)

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

RoutingTable

(RDD)

B

C

D

E

A

F

1  

2  

1   2  

1   2  

1  

2  

Vertex Cut

Page 31: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Graph System OptimizationsSpecialized���

Data-StructuresVertex-CutsPartitioning

RemoteCaching / Mirroring

Message Combiners Active Set Tracking

Page 32: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

0500

100015002000250030003500

0100020003000400050006000700080009000

Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)

PageRank Benchmark

GraphX performs comparably to ���state-of-the-art graph processing systems.

Runt

ime

(Sec

onds

)

EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE

7x 18x

Page 33: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

1.  Language supporta)  Java API: PR #3234b)  Python API: collaborating with Intel, SPARK-3789

2.  More algorithmsa)  LDA (topic modeling): PR #2388b)  Correlation clusteringc)  Your algorithm here?

3.  Speculativea)  Streaming/time-varying graphsb)  Graph database–like queries

Future of GraphX

Page 34: GraphX - Stanford Universitystanford.edu/~rezab/nips2014workshop/slides/ankur.pdf · 2014-12-13 · Ankur Dave! Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez,

Thanks!

[email protected]

[email protected]@eecs.berkeley.edu

[email protected]

http://spark.apache.org/graphx