Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... ·...
Transcript of Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... ·...
![Page 1: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/1.jpg)
4/12/13
1
Distributed Machine Learning and Graph Processing with Sparse Matrices
Shivaram Venkataraman*, Erik Bodzsar#
Indrajit Roy+, Alvin AuYoung+, Rob Schreiber+ *UC Berkeley, #U Chicago, +HP Labs
Big Data, Complex Algorithms
PageRank (Dominant eigenvector)
Recommendations (Matrix factorization)
Anomaly detection (Top-K eigenvalues)
User Importance (Vertex Centrality)
Machine learning + Graph algorithms
Iterative Linear Algebra Operations
2
![Page 2: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/2.jpg)
4/12/13
2
PageRank
3
P
Q
R S
Web Graph Adjacency Matrix
Page Rank
0.035 0.006 0.008 0.032
0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0
… ……
PageRank Using Matrices
Power Method Dominant eigenvector
4
M p p
M = web graph matrix p = PageRank vector
Iterate
![Page 3: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/3.jpg)
4/12/13
3
Array-oriented programming environment Millions of users, thousands of free packages Popular among statisticians, bio-informatics communities
R 5
PageRank Using Matrices
Power Method Dominant eigenvector
6
M p p
M = web graph matrix p = PageRank vector
Simplified algorithm repeat { p = M*p }
![Page 4: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/4.jpg)
4/12/13
4
PageRank Using Matrices
7
Web graph matrix
Pagerank vector
P1
P2
PN/s
Power Method Dominant eigenvector
M = web graph matrix p = PageRank vector
Pagerank vector
Large-Scale Processing Frameworks
Data-parallel frameworks – MapReduce/Dryad (2004)
– Process each record in parallel – Use case: Computing sufficient statistics
Graph-centric frameworks – Pregel/GraphLab (2010)
– Process each vertex in parallel – Use case: Graphical models
Array-based frameworks – MadLINQ (2012) – Process blocks of array in parallel – Challenges with sparse matrices
8
![Page 5: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/5.jpg)
4/12/13
5
Challenge 1 – Communication
R - single-threaded Share data through pipes/network Time-inefficient (sending copies) Space-inefficient (extra copies)
R process copy
of data
local copy
R process
data
R process copy
of data
R process copy
of data
Server 1
network copy network copy
Server 2
9
Sparse matrices à Communication overhead
Challenge 2 – Sparse Matrices
10
![Page 6: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/6.jpg)
4/12/13
6
Challenge 2 – Sparse Matrices
1
10
100
1000
10000
1 11 21 31 41 51 61 71 81 91 Blo
ck d
ensi
ty (n
orm
aliz
ed )
Block ID
LiveJournal Netflix ClueWeb-1B
11
Presto
Framework for large-scale iterative linear algebra Extend R for scalability
12
![Page 7: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/7.jpg)
4/12/13
7
Outline • Motivation • Programming model • Design • Applications and Results
13
darray
![Page 8: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/8.jpg)
4/12/13
8
foreach
f (x)
PageRank Using Presto
M ß darray(dim=c(N,N),blocks=(s,N)) P ß darray(dim=c(N,1),blocks=(s,1)) while(..){ foreach(i,1:len,
calculate(p=splits(P,i),m=splits(M,i), x=splits(P_old),z=splits(Z,i)) { p ß (m*x)+z
} ) P_old ß P }
Create Distributed Array
P2
P1
PN/s
M
P1
P2
PN/s
…
P_old Z
P1
P2
PN/s
…
P
s P1
P2
PN/s
…
N
s
N
16
![Page 9: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/9.jpg)
4/12/13
9
PageRank Using Presto
M ß darray(dim=c(N,N),blocks=(s,N)) P ß darray(dim=c(N,1),blocks=(s,1)) while(..){ foreach(i,1:len,
calculate(p=splits(P,i), m=splits(M,i), x=splits(P_old), z=splits(Z,i)) { p ß (m*x)+z
} ) P_old ß P }
Execute function in a cluster
Pass array partitions
P2
P1
PN/s
M
P1
P2
PN/s
…
P_old Z
P1
P2
PN/s
…
P
s P1
P2
PN/s
…
N
s
N
17
Breadth-first Search Using Matrices
G = adjacency matrix X = BFS vector
A B C D E
A 1 1 1 0 0B 0 1 0 1 0C 0 1 1 0 0D 0 0 0 1 1E 0 0 0 0 1
1 0 0 0 0
A B
C
D
E
X
G
* * * 0 0
A B C D E* * * * 0
A B C D E* * * * *
A B C D E
A B
C
D
E A
B
C
D
E A
B
C
D
E
Simplified algorithm: repeat { X = G*X }
Matrix operations
Easy to express Efficient to implement
18
![Page 10: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/10.jpg)
4/12/13
10
Outline • Motivation • Programming model • Design • Applications and Results
19
Presto Architecture
Worker
Executor pool
Worker
Executor pool
Master
R instance R instance
DRAM
R instance R instance R instance
DRAM
R instance
20
![Page 11: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/11.jpg)
4/12/13
11
Dynamic Partitioning of Matrices
Profile execution
Partition
21
Size Invariants invariant(Mat,vec)
22
![Page 12: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/12.jpg)
4/12/13
12
Outline
• Motivation • Programming model • Design • Applications and Results
23
demo
![Page 13: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/13.jpg)
4/12/13
13
25
lj_matrix ß darray(dim=c(n,n),blocks=c(n,n)) in_vector ß darray(dim=c(n,1), blocks=(s,1),
data=1/n) out_vector ß darray(dim=c(n,1), blocks=(s,1)) foreach(i, 1:length(splits(lj_matrix)), function(g = splits(lj_matrix, i), i = splits(in_vector), o = splits(out_vector, i)) { n ß g %*% o update(n)
})
![Page 14: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/14.jpg)
4/12/13
14
dotprod <-‐ function(a,b) { tmp <-‐ darray(dim=a@dim/a@blocks,c(1,1)) foreach(i, 1:length(splits(a)), mult <-‐ function(tmp = splits(tmp,i), a = splits(a,i), b = splits(b,i)) { tmp <-‐ sum(a * b) update(tmp) }, progress=FALSE) return(sum(getpartition(dotprod.tmp))) }
Examples
reduce <-‐ function(f,d) { i <-‐ 1 n <-‐ length(splits(d)) repeat { step <-‐ 2*i reducers <-‐ floor((n-‐i-‐1)/step)+1 foreach(j, 1:reducers, reduce.pair <-‐ function(s1=splits(d, (j-‐1)*step+1), s2 = splits(d,(j-‐1)*step+1+i), fun = f) { s1 <-‐ fun(s1,s2) update(s1) }, progress=FALSE) i <-‐ i*2 if (1+i > n) { break } } }
Examples
![Page 15: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/15.jpg)
4/12/13
15
Applications Implemented in Presto
Application Algorithm Presto LOC
PageRank Eigenvector calculation 41
Triangle counting Top-K eigenvalues 121
Netflix recommendation Matrix factorization 130
Centrality measure Graph algorithm 132
k-path connectivity Graph algorithm 30
k-means Clustering 71
Sequence alignment Smith-Waterman 64
Fewer than 140 lines of code
29
0
10
20
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Split
siz
e (G
B)
Iteration count
Repartitioning Progress
30
![Page 16: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/16.jpg)
4/12/13
16
0 20 40 60 80 100 120 140 160
Wor
kers
Transfer Compute
0 20 40 60 80 100 120 140 160
Wor
kers
Transfer Compute
Repartitioning benefits No Repartition
Repartition
31
Presto
Versioning Distributed Arrays
32
Co-partitioning matrices
Locality-based scheduling
Caching partitions
![Page 17: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/17.jpg)
4/12/13
17
Conclusion
Linear Algebra is a powerful abstraction Easily express machine learning, graph algorithms Challenges: Sparse matrices, Data sharing Presto – prototype extends R
33
Blockus
• Expressive distributed computing systems are in-memory
• Being in-memory is problematic for (very) big data – Expensive – Fault tolerance problems
• Scale Presto vertically • Eliminate memory limitation
34
![Page 18: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/18.jpg)
4/12/13
18
Vertical scaling
• Use SSDs – Low latency – Fast small I/O – Parallel I/O
• SSDs still significantly slower than memory • Need to do better than OS swap!
35
0
100
200
300
400
500
600
16k 64k 256k 1024k 4096k 16384k R
ead
spee
d (M
B/s
)
Block size
Read speed with 4 threads
SSD
HDD
Opportunities from Vertical Scaling
• Enable big data analytics on small systems – Laptop! – Small cluster
• Energy savings for extreme scale systems
• Reduced cost, increased fault tolerance
36
![Page 19: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/19.jpg)
4/12/13
19
Related work: OS paging/buffer cache
• General purpose • (almost) no application knowledge • LRU caching, (conservative) read-ahead • Reactive (do I/O on pagefault)
37
Related work: SSDAlloc
• C library, replace malloc with malloc_object • Objects are stored on SSD • Memory is used as a cache
• Advantage over OS paging: object-level caching • Useful for web servers, etc.
38
![Page 20: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/20.jpg)
4/12/13
20
Related work: GraphChi
• Vertex-level programming • Iterative, update vertex neighborhoods in each
iteration, for each vertex
• GraphChi: I/O optimized execution engine • Make sure I/O is sequential (essential for HDDs,
works for SSDs too)
39
Blockus Idea • Use some form of application knowledge to optimize
I/O
• Know future computation à prefetching – From programmer hints, static analysis, history, etc.
• Know about parallelism à reorder computation to decrease I/O
• Block usage history: better caching (e.g. always keep popular parts of a graph in memory)
• Deeper application knowledge à reorganize data, change computation,etc.
40
![Page 21: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/21.jpg)
4/12/13
21
Worker
Executor pool
Worker
Executor pool
Blockus architecture
Master
R instance
R instance
DR
AM
R instance
SS
Ds
HD
Ds
I/O Engine
Scheduler
R instance
R instance
DR
AM
R instance
SS
Ds
HD
Ds
I/O Engine
• Worker I/O engine: executes all I/O operations
• Scheduler: performs I/O and task scheduling
Scheduler challenges • Presto scheduler
• Assumes everything fits in DRAM • Schedules each task on worker which has most bytes of
its input arrays • Transfers non-local input data greedily (no network
scheduling)
• Blockus: better scheduling policies • Load balancing (in memory) • Intelligent Prefetching • Intelligent computation prioritization • Adaptive Caching
![Page 22: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram](https://reader036.fdocuments.in/reader036/viewer/2022062920/5f01ea697e708231d401aa0e/html5/thumbnails/22.jpg)
4/12/13
22
Blockus task scheduling policies
• Reorder computation based on memory contents to minimize I/O
• Different reordering policies
Project ideas • Fault tolerance by keeping track of lineage • Smart communication (e.g. pipelines, broadcasting) • Static analysis of programs to better understand
dependencies – Better garbage collection – Better asynchronicity/reordering
• Recursive parallelism • Matrix reordering
– To improve caching for out-of-core – Load balancing for distributed system
• Distributed out-of-core computation • Heterogeneous storage (different SSDs, HDDs,
etc.)