Yucheng Low
description
Transcript of Yucheng Low
![Page 1: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/1.jpg)
Carnegie Mellon
Yucheng Low
AapoKyrola
DannyBickson
A Framework for Machine Learning and Data Mining in the Cloud
JosephGonzalez
CarlosGuestrin
JoeHellerstein
![Page 2: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/2.jpg)
Big Data is Everywhere
72 Hours a MinuteYouTube28 Million
Wikipedia Pages
900 MillionFacebook Users
6 Billion Flickr Photos
2
“… data a new class of economic asset, like currency or gold.”
“…growing at 50 percent a year…”
![Page 3: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/3.jpg)
How will wedesign and implement Big learning systems?
Big Learning
3
![Page 4: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/4.jpg)
Shift Towards Use Of Parallelism in ML
GPUs Multicore Clusters Clouds Supercomputers
ML experts repeatedly solve the same parallel design challenges:
Race conditions, distributed state, communication…
Resulting code is very specialized:difficult to maintain, extend, debug…
Graduate students
Avoid these problems by using high-level abstractions
4
![Page 5: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/5.jpg)
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
5
![Page 6: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/6.jpg)
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
6
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
No Communication needed
![Page 7: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/7.jpg)
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
7
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
Embarrassingly Parallel independent computation No Communication needed
![Page 8: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/8.jpg)
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
8
12.9
42.3
21.3
25.8
17.5
67.5
14.9
34.3
24.1
84.3
18.4
84.4
Embarrassingly Parallel independent computation No Communication needed
![Page 9: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/9.jpg)
CPU 1 CPU 2
MapReduce – Reduce Phase
9
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
17.5
67.5
14.9
34.3
2226.
26
1726.
31
Image Features
Attractive Face Statistics
Ugly Face Statistics
U A A U U U A A U A U A
Attractive Faces Ugly Faces
![Page 10: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/10.jpg)
MapReduce for Data-Parallel ML
Excellent for large data-parallel tasks!
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
MapReduce
Computing SufficientStatistics
Graphical ModelsGibbs Sampling
Belief PropagationVariational Opt.
Semi-Supervised Learning
Label PropagationCoEM
Graph AnalysisPageRank
Triangle Counting
Collaborative Filtering
Tensor Factorization
Is there more toMachine Learning
?10
![Page 11: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/11.jpg)
Carnegie Mellon
Exploit Dependencies
![Page 12: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/12.jpg)
12
HockeyScuba Diving
Underwater Hockey
Scuba Diving
Scuba Diving
Scuba Diving
Hockey
Hockey
Hockey
Hockey
![Page 13: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/13.jpg)
Graphs are Everywhere
Use
rs
Movies
Netflix
Collaborative Filtering
Docs
Words
Wiki
Text Analysis
Social Network
Probabilistic Analysis
13
![Page 14: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/14.jpg)
Properties of Computation on Graphs
DependencyGraph
IterativeComputation
My Interests
Friends Interests
LocalUpdates
![Page 15: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/15.jpg)
ML Tasks Beyond Data-Parallelism
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Graphical ModelsGibbs Sampling
Belief PropagationVariational Opt.
Semi-Supervised Learning
Label PropagationCoEM
Graph AnalysisPageRank
Triangle Counting
Collaborative Filtering
Tensor Factorization
15
![Page 16: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/16.jpg)
Bayesian Tensor Factorization
Gibbs Sampling
MatrixFactorization
Lasso
SVM
Belief Propagation
PageRank
CoEM
SVD
LDA
…Many others…Linear Solvers
Splash SamplerAlternating Least
Squares
21
2010Shared Memory
![Page 17: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/17.jpg)
22
Limited CPU PowerLimited MemoryLimited Scalability
![Page 18: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/18.jpg)
Distributed Cloud
- Distributing State- Data Consistency- Fault Tolerance
23
Unlimited amount of computation resources!(up to funding limitations)
![Page 19: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/19.jpg)
The GraphLab Framework
Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
24
![Page 20: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/20.jpg)
Data GraphData associated with vertices and edges
Vertex Data:• User profile• Current interests estimates
Edge Data:• Relationship (friend, classmate, relative)
Graph:• Social Network
25
![Page 21: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/21.jpg)
Distributed Graph
Partition the graph across multiple machines.
26
![Page 22: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/22.jpg)
Distributed Graph
Ghost vertices maintain adjacency structure and replicate remote data.
“ghost” vertices
27
![Page 23: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/23.jpg)
Distributed Graph
Cut efficiently using HPC Graph partitioning tools (ParMetis / Scotch / …)
“ghost” vertices
28
![Page 24: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/24.jpg)
The GraphLab Framework
Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
29
![Page 25: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/25.jpg)
Pagerank(scope){ // Update the current vertex data
// Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; }
Update FunctionsUser-defined program: applied to avertex and transforms data in scope of vertex
Dynamic computation
Update function applied (asynchronously) in parallel until convergence
Many schedulers available to prioritize computation
30
Why Dynamic?
![Page 26: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/26.jpg)
Asynchronous Belief Propagation
Cumulative Vertex Updates
ManyUpdates
FewUpdates
Algorithm identifies and focuses on hidden sequential structure
Graphical Model
Challenge = Boundaries
32
![Page 27: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/27.jpg)
33
Shared Memory Dynamic Schedule
e f g
kjih
dcbaCPU 1
CPU 2
ahab
b
i
Process repeats until scheduler is empty
Scheduler
![Page 28: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/28.jpg)
Distributed Scheduling
e
ih
ba
f g
kj
dcah
fg
j
cb
i
Each machine maintains a schedule over the vertices it owns.
34Distributed Consensus used to identify completion
![Page 29: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/29.jpg)
Ensuring Race-Free CodeHow much can computation overlap?
35
![Page 30: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/30.jpg)
The GraphLab Framework
Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
37
![Page 31: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/31.jpg)
PageRank Revisited
38
Pagerank(scope) {
…}
![Page 32: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/32.jpg)
Racing PageRankThis was actually encountered in user code.
39
![Page 33: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/33.jpg)
Bugs
40
Pagerank(scope) {
…}
![Page 34: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/34.jpg)
Bugs
41
Pagerank(scope) {
…}
tmp
tmp
![Page 35: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/35.jpg)
Throughput != Performance
No Consistency
Higher Throughput
(#updates/sec)
Potentially Slower Convergence of ML
42
![Page 36: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/36.jpg)
Racing Collaborative Filtering
43
0 200 400 600 800 1000 1200 1400 16000.01
0.1
1
d=20 Racing d=20
Time
RMSE
![Page 37: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/37.jpg)
Serializability
44
For every parallel execution, there exists a sequential execution of update functions which produces the same result.
CPU 1
CPU 2
SingleCPU
Parallel
Sequential
time
![Page 38: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/38.jpg)
Serializability Example
45
Read
Write
Update functions one vertex apart can be run in parallel.
Edge Consistency
Overlapping regions are only read.
Stronger / Weaker consistency levels available
User-tunable consistency levelstrades off parallelism & consistency
![Page 39: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/39.jpg)
Solution 1Graph Coloring
Distributed Consistency
Solution 2Distributed Locking
![Page 40: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/40.jpg)
Edge Consistency via Graph Coloring
Vertices of the same color are all at least one vertex apart.Therefore, All vertices of the same color can be run in parallel!
48
![Page 41: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/41.jpg)
Chromatic Distributed EngineTi
me
Execute tasks on all vertices of
color 0Execute tasks
on all vertices of color 0
Ghost Synchronization Completion + Barrier
Execute tasks on all vertices of
color 1
Execute tasks on all vertices of
color 1
Ghost Synchronization Completion + Barrier
49
![Page 42: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/42.jpg)
Matrix FactorizationNetflix Collaborative Filtering
Alternating Least Squares Matrix FactorizationModel: 0.5 million nodes, 99 million edges
Netflix
Users
Movies
d
50
Users Movies
![Page 43: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/43.jpg)
51
Netflix Collaborative Filtering
Ideal
D=100
D=20
# machines
HadoopMPI
GraphLab
# machines
![Page 44: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/44.jpg)
The Cost of Hadoop
52
![Page 45: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/45.jpg)
CoEM (Rosie Jones, 2005)Named Entity Recognition Task
the cat
Australia
Istanbul
<X> ran quickly
travelled to <X>
<X> is pleasant
Hadoop 95 Cores 7.5 hrs
Is “Cat” an animal?Is “Istanbul” a place?
Vertices: 2 MillionEdges: 200 Million
53
Distributed GL 32 EC2 Nodes 80 secs
GraphLab 16 Cores 30 min
0.3% of Hadoop time
![Page 46: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/46.jpg)
ProblemsRequire a graph coloring to be available.
Frequent Barriers make it extremely inefficient for highly dynamic systems where only a small number of vertices are active in each round.
55
![Page 47: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/47.jpg)
Solution 1Graph Coloring
Distributed Consistency
Solution 2Distributed Locking
![Page 48: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/48.jpg)
Distributed LockingEdge Consistency can be guaranteed through locking.
: RW Lock
57
![Page 49: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/49.jpg)
Consistency Through LockingAcquire write-lock on center vertex, read-lock on adjacent.
58
![Page 50: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/50.jpg)
SolutionPipelining
CPU Machine 1
Machine 2
A CB D
Consistency Through LockingMulticore Setting
PThread RW-Locks
Distributed SettingDistributed LocksChallenges
Latency
A C
B D
A CB DA
59
![Page 51: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/51.jpg)
No Pipelining
lock scope 1
Process request 1
scope 1 acquiredupdate_function 1
release scope 1
Process release 1
Tim
e
60
![Page 52: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/52.jpg)
Pipelining / Latency HidingHide latency using pipelining
lock scope 1
Process request 1
scope 1 acquired
update_function 1release scope 1
Process release 1
lock scope 2
Tim
e lock scope 3 Process request 2Process request 3
scope 2 acquiredscope 3 acquired
update_function 2release scope 2
61
![Page 53: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/53.jpg)
Latency HidingHide latency using request buffering
lock scope 1
Process request 1
scope 1 acquired
update_function 1release scope 1
Process release 1
lock scope 2
Tim
e lock scope 3 Process request 2Process request 3
scope 2 acquiredscope 3 acquired
update_function 2release scope 2
No Pipelining 472 s
Pipelining 10 s
47x Speedup
Residual BP on 190K Vertex 560K Edge Graph4 Machines
62
![Page 54: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/54.jpg)
63
Video Cosegmentation
Segments mean the same
1740 FramesModel: 10.5 million nodes, 31 million edges
Probabilistic Inference Task
![Page 55: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/55.jpg)
64
Video Coseg. Speedups
GraphLab
Ideal
# machines
![Page 56: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/56.jpg)
The GraphLab Framework
Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
65
![Page 57: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/57.jpg)
What if machines fail? How do we provide fault tolerance?
![Page 58: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/58.jpg)
Checkpoint1: Stop the world
2: Write state to disk
![Page 59: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/59.jpg)
68
Snapshot Performance
No Snapshot
Snapshot
One slow machine
Because we have to stop the world, One slow machine slows everything down!
Snapshot time
Slow machine
![Page 60: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/60.jpg)
How can we do better?
Take advantage of consistency
![Page 61: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/61.jpg)
70
Checkpointing1985: Chandy-Lamport invented an asynchronous snapshotting algorithm for distributed systems.
snapshottedNot snapshotted
![Page 62: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/62.jpg)
71
CheckpointingFine Grained Chandy-Lamport.
Easily implemented within GraphLab as an Update Function!
![Page 63: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/63.jpg)
72
Async. Snapshot Performance
No Snapshot
Snapshot
One slow machine
No penalty incurred by the slow machine!
![Page 64: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/64.jpg)
SummaryExtended GraphLab abstraction to distributed systemsTwo different methods of achieving consistency
Graph ColoringDistributed Locking with pipelining
Efficient implementationsAsynchronous Fault Tolerance with fined-grained Chandy-Lamport
Performance
Useability
Efficiency Scalabilitys
73
![Page 65: Yucheng Low](https://reader036.fdocuments.in/reader036/viewer/2022062521/568166a3550346895dda8fa3/html5/thumbnails/65.jpg)
Carnegie Mellon University
Release 2.1 + many toolkitshttp://graphlab.org
PageRank: 40x faster than Hadoop
1B edges per second.
Triangle Counting: 282x faster than Hadoop
400M triangles per second.
Major Improvements to be published in OSDI 2012