Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

23
Reza Zadeh Introduction to Distributed Optimization @Reza_Zadeh | http://reza-zadeh.com

Transcript of Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Page 1: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Reza Zadeh

Introduction to Distributed Optimization

@Reza_Zadeh | http://reza-zadeh.com

Page 2: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Optimization At least two large classes of optimization problems humans can solve:"

»  Convex »  Spectral

Page 3: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Optimization Example: Gradient Descent

Page 4: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Logistic Regression Already saw this with data scaling Need to optimize with broadcast

Page 5: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Model Broadcast: LR

Page 6: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Model Broadcast: LR

Use  via  .value  

Call  sc.broadcast  

Page 7: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Separable Updates Can be generalized for »  Unconstrained optimization »  Smooth or non-smooth

»  LBFGS, Conjugate Gradient, Accelerated Gradient methods, …

Page 8: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Optimization Example: Spectral Program

Page 9: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Spark PageRank Given directed graph, compute node importance. Two RDDs: »  Neighbors (a sparse graph/matrix)

»  Current guess (a vector) "Using cache(), keep neighbor list in RAM

Page 10: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Spark PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing

Neighbors (id, edges)

Ranks (id, rank)

join

partitionBy

join join …

Page 11: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Spark PageRank

Generalizes  to  Matrix  Multiplication,  opening  many  algorithms  from  Numerical  Linear  Algebra  

Page 12: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Partitioning for PageRank Recall from first lecture that network bandwidth is ~100× as expensive as memory bandwidth

One way Spark avoids using it is through locality-aware scheduling for RAM and disk

Another important tool is controlling the partitioning of RDD contents across nodes

Page 13: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Spark PageRank Given directed graph, compute node importance. Two RDDs: »  Neighbors (a sparse graph/matrix)

»  Current guess (a vector)

Best representation for vector and matrix?

Page 14: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

PageRank

Page 15: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Execution

Page 16: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Solution

Page 17: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

New Execution

Page 18: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

How it works Each RDD has an optional Partitioner object" Any shuffle operation on an RDD with a Partitioner will respect that Partitioner"

Any shuffle operation on two RDDs will take on the Partitioner of one of them, if one is set

"

Page 19: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Examples

Page 20: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Main Conclusion Controlled partitioning can avoid unnecessary all-to-all communication, saving computation

Repeated joins generalizes to repeated Matrix Multiplication, opening many algorithms from Numerical Linear Algebra

Page 21: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Performance

Page 22: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

RDD partitioner

Page 23: Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)

Custom Partitioning