Large-Scale Matrix Operations Using a Data Flow Engine

27
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine

description

Large-Scale Matrix Operations Using a Data Flow Engine. Matei Zaharia. Outline. Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Matrix operations on Spark. Problem. Data growing faster than processing speeds - PowerPoint PPT Presentation

Transcript of Large-Scale Matrix Operations Using a Data Flow Engine

Page 1: Large-Scale Matrix Operations Using a Data Flow Engine

Matei Zaharia

Large-Scale Matrix Operations Using a Data Flow Engine

Page 2: Large-Scale Matrix Operations Using a Data Flow Engine

OutlineData flow vs. traditional network programmingLimitations of MapReduceSpark computing engineMatrix operations on Spark

Page 3: Large-Scale Matrix Operations Using a Data Flow Engine

ProblemData growing faster than processing speedsOnly solution is to parallelize on large clusters

»Wide use in both enterprises and web industry

How do we program these things?

Page 4: Large-Scale Matrix Operations Using a Data Flow Engine

Traditional Network ProgrammingMessage-passing between nodes (e.g. MPI)Very difficult to do at scale:

»How to split problem across nodes?• Must consider network & data locality

»How to deal with failures? (inevitable at scale)»Even worse: stragglers (node not failed, but

slow)Rarely used in commodity datacenters

Page 5: Large-Scale Matrix Operations Using a Data Flow Engine

Data Flow ModelsRestrict the programming interface so that the system can do more automaticallyExpress jobs as graphs of high-level operators

»System picks how to split each operator into tasks and where to run each task

»Run parts twice fault recovery

Biggest example: MapReduce

Map

Map

Map

Reduce

Reduce

Page 6: Large-Scale Matrix Operations Using a Data Flow Engine

MapReduce for Matrix OperationsMatrix-vector multiplyPower iteration (e.g. PageRank)Gradient descent methodsStochastic SVDTall skinny QR

Many others!

Page 7: Large-Scale Matrix Operations Using a Data Flow Engine

Why Use a Data Flow Engine?Ease of programming

»High-level functions instead of message passing

Wide deployment»More common than MPI, especially “near”

data

Scalability to very largest clusters»Even HPC world is now concerned about

resilience

Page 8: Large-Scale Matrix Operations Using a Data Flow Engine

OutlineData flow vs. traditional network programmingLimitations of MapReduceSpark computing engineMatrix operations on Spark

Page 9: Large-Scale Matrix Operations Using a Data Flow Engine

Limitations of MapReduceMapReduce is great at one-pass computation, but inefficient for multi-pass algorithmsNo efficient primitives for data sharing

»State between steps goes to distributed file system

»Slow due to replication & disk storage»No control of data partitioning across steps

Page 10: Large-Scale Matrix Operations Using a Data Flow Engine

iter. 1 iter. 2 . . .

Input

file systemread

file systemwrite

file systemread

file systemwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

file systemread

Commonly spend 90% of time doing I/O

Example: Iterative Apps

Page 11: Large-Scale Matrix Operations Using a Data Flow Engine

Example: PageRankRepeatedly multiply sparse matrix and vectorRequires repeatedly hashing together page adjacency lists and rank vector

Neighbors(id, edges)

Ranks(id, rank) …

Same file groupedover and over

iteration 1 iteration 2 iteration 3

Page 12: Large-Scale Matrix Operations Using a Data Flow Engine

Spark Programming ModelExtends MapReduce with primitives for efficient data sharing

»“Resilient distributed datasets”

Open source in Apache Incubator»Growing community with 100+ contributors

APIs in Java, Scala & Python

Page 13: Large-Scale Matrix Operations Using a Data Flow Engine

Resilient Distributed Datasets (RDDs)Collections of objects stored across a cluster User-controlled partitioning & storage (memory, disk, …)Automatically rebuilt on failureurls = spark.textFile(“hdfs://...”)records = urls.map(lambda s: (s, 1))counts = records.reduceByKey(lambda a, b: a + b)bigCounts = counts.filter(lambda (url, cnt): cnt > 10)

Inpu

t file

map reduce filter

Known to behash-partitioned

Also known

bigCounts.cache()

bigCounts.filter( lambda (k,v): “news” in k).count()

bigCounts.join(otherPartitionedRDD)

Page 14: Large-Scale Matrix Operations Using a Data Flow Engine

Performance

Logistic Re-gression

0 25 50 75 100 125

0.96

110

K-Means Cluster-ing

0 30 60 90 120 150 1804.1

155 Hadoop

Spark

Time per Iteration (s)

Page 15: Large-Scale Matrix Operations Using a Data Flow Engine

OutlineData flow vs. traditional network programmingLimitations of MapReduceSpark computing engineMatrix operations on Spark

Page 16: Large-Scale Matrix Operations Using a Data Flow Engine

Neighbors(id, edges)

Ranks(id, rank)

PageRankUsing cache(), keep neighbors in RAMUsing partitioning, avoid repeated hashing

join

join

join…

partitionBy

Page 17: Large-Scale Matrix Operations Using a Data Flow Engine

PageRankUsing cache(), keep neighbors in RAMUsing partitioning, avoid repeated hashing

Neighbors(id, edges)

Ranks(id, rank)

join join join…

samenode

partitionBy

Page 18: Large-Scale Matrix Operations Using a Data Flow Engine

PageRankUsing cache(), keep neighbors in RAMUsing partitioning, avoid repeated hashing

Neighbors(id, edges)

Ranks(id, rank)

join

partitionBy

join join…

Page 19: Large-Scale Matrix Operations Using a Data Flow Engine

PageRank Code# RDD of (id, neighbors) pairslinks = spark.textFile(...).map(parsePage) .partitionBy(128).cache()

ranks = links.mapValues(lambda v: 1.0) # RDD of (id, rank)

for i in range(ITERATIONS): ranks = links.join(ranks).flatMap( lambda (id, (links, rank)): [(d, rank/links.size) for d in links] ).reduceByKey(lambda a, b: a + b)

Page 20: Large-Scale Matrix Operations Using a Data Flow Engine

PageRank Results

01020304050607080 72

23

HadoopBasic SparkSpark + Con-trolled Partition-ing

Tim

e pe

r it

erat

ion

(s)

Page 21: Large-Scale Matrix Operations Using a Data Flow Engine

Alternating Least Squares

1. Start with random A1, B12. Solve for A2 to minimize ||R – A2B1

T||3. Solve for B2 to minimize ||R – A2B2

T||4. Repeat until convergence

R A= BT

Page 22: Large-Scale Matrix Operations Using a Data Flow Engine

ALS on Spark

Cache 2 copies of R in memory, one partitioned by rows and one by columnsKeep A & B partitioned in corresponding wayOperate on blocks to lower communication

R A= BT

Joint work withJoey Gonzales,Virginia Smith

Page 23: Large-Scale Matrix Operations Using a Data Flow Engine

ALS Results

0500

10001500200025003000350040004500 4208

481 297

Mahout / HadoopSpark (Scala)GraphLab (C++)

Tota

l Tim

e (s

)

Page 24: Large-Scale Matrix Operations Using a Data Flow Engine

Benefit for UsersSame engine performs data extraction, model training and interactive queries

…DFS read

DFS writepa

rse DFS

readDFS writetra

in DFS

readDFS writequ

e ry

DFS

DFS read pa

rse tra

in qu

e ry

Separate engines

Spark

Page 25: Large-Scale Matrix Operations Using a Data Flow Engine

Other Projects on SparkMLlib: built-in Spark library for ML

»Includes ALS, K-means||, various algorithms on SGD

»Frankin, Gonzales et al. [MLOSS ‘13]

MLI: Matlab-like language for writing apps

»Basic ALS in 35 lines of code»Evan Sparks, Ameet Talwalkar et al. [ICDM

‘13]

Page 26: Large-Scale Matrix Operations Using a Data Flow Engine

100+ developers, 25+ companies contributing;most active development community after Hadoop

Spark Community

Page 27: Large-Scale Matrix Operations Using a Data Flow Engine

ConclusionData flow engines are becoming an important platform for matrix algorithmsSpark offers a simple programming model that greatly speeds these upMore info: spark.incubator.apache.org