A New Parallel Framework for Machine Learning
-
Upload
minerva-shelton -
Category
Documents
-
view
29 -
download
3
description
Transcript of A New Parallel Framework for Machine Learning
Carnegie Mellon
Joseph GonzalezJoint work with
YuchengLow
AapoKyrola
DannyBickson
CarlosGuestrin
GuyBlelloch
JoeHellerstein
DavidO’Hallaron
A New Parallel Framework for Machine Learning
AlexSmola
In ML we face BIG problems
48 Hours a MinuteYouTube
24 Million Wikipedia Pages
750 MillionFacebook Users
6 Billion Flickr Photos
Parallelism: Hope for the FutureWide array of different parallel architectures:
New Challenges for Designing Machine Learning Algorithms: Race conditions and deadlocksManaging distributed model state
New Challenges for Implementing Machine Learning Algorithms:Parallel debugging and profilingHardware specific APIs
3
GPUs Multicore Clusters Mini Clouds Clouds
Carnegie Mellon
Core Question
How will wedesign and implement
parallel learning systems?
Carnegie Mellon
Threads, Locks, & Messages
Build each new learning systems usinglow level parallel primitives
We could use ….
Threads, Locks, and MessagesML experts repeatedly solve the same parallel design challenges:
Implement and debug complex parallel systemTune for a specific parallel platformTwo months later the conference paper contains:
“We implemented ______ in parallel.”
The resulting code:is difficult to maintainis difficult to extendcouples learning model to parallel implementation
6
Graduate
students
Carnegie Mellon
Map-Reduce / HadoopBuild learning algorithms on-top of
high-level parallel abstractions
... a better answer:
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
8
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
No Communication needed
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
9
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
Image Features
CPU 1 CPU 2 CPU 3 CPU 4
MapReduce – Map Phase
10
Embarrassingly Parallel independent computation
12.9
42.3
21.3
25.8
17.5
67.5
14.9
34.3
24.1
84.3
18.4
84.4
No Communication needed
CPU 1 CPU 2
MapReduce – Reduce Phase
11
12.9
42.3
21.3
25.8
24.1
84.3
18.4
84.4
17.5
67.5
14.9
34.3
2226.
26
1726.
31
Image Features
Attractive Face Statistics
Ugly Face Statistics
BeliefPropagation
SVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
12
Data-Parallel Graph-ParallelIs there more toMachine Learning
?CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Carnegie Mellon
Concrete Example
Label Propagation
Profile
Label Propagation AlgorithmSocial Arithmetic:
Recurrence Algorithm:
iterate until convergence
Parallelism:Compute all Likes[i] in parallel
Sue Ann
Carlos
Me
50% What I list on my profile40% Sue Ann Likes10% Carlos Like
40%
10%
50%
80% Cameras20% Biking
30% Cameras70% Biking
50% Cameras50% Biking
I Like:
+60% Cameras, 40% Biking
Properties of Graph Parallel Algorithms
DependencyGraph
IterativeComputation
What I Like
What My Friends Like
Factored Computation
?
BeliefPropagation
SVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
16
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Map Reduce?
Carnegie Mellon
Why not use Map-Reducefor
Graph Parallel Algorithms?
Data Dependencies
Map-Reduce does not efficiently express dependent data
User must code substantial data transformations Costly data replication
Inde
pend
ent D
ata
Row
s
Slow
Proc
esso
rIterative Algorithms
Map-Reduce not efficiently express iterative algorithms:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barr
ier
Barr
ier
Barr
ier
MapAbuse: Iterative MapReduceOnly a subset of data needs computation:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barr
ier
Barr
ier
Barr
ier
MapAbuse: Iterative MapReduceSystem is not optimized for iteration:
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Disk Pe
nalty
Disk Pe
nalty
Disk Pe
nalty
Sta
rtup
Pen
alty
Sta
rtup
Pen
alty
Sta
rtup
Pen
alty
Synchronous vs. AsynchronousExample Algorithm: If Red neighbor then turn Red
Synchronous Computation (Map-Reduce) :Evaluate condition on all vertices for every phase
4 Phases each with 9 computations 36 Computations
Asynchronous Computation (Wave-front) :Evaluate condition only when neighbor changes
4 Phases each with 2 computations 8 Computations
Time 0 Time 1 Time 2 Time 3 Time 4
Data-Parallel Algorithms can be Inefficient
The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms.
1 2 3 4 5 6 7 80
100020003000400050006000700080009000
Number of CPUs
Runti
me
in S
econ
ds
Optimized in Memory MapReduceBP
Asynchronous Splash BP
?
BeliefPropagationSVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
The Need for a New AbstractionMap-Reduce is not well suited for Graph-Parallelism
24
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Carnegie Mellon
What is GraphLab?
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
26
Data Graph
27
A graph with arbitrary data (C++ Objects) associated with each vertex and edge.
Vertex Data:• User profile text• Current interests estimates
Edge Data:• Similarity weights
Graph:• Social Network
Implementing the Data GraphMulticore Setting
In MemoryRelatively Straight Forward
vertex_data(vid) dataedge_data(vid,vid) dataneighbors(vid) vid_list
Challenge:Fast lookup, low overhead
Solution:Dense data-structuresFixed Vdata & Edata typesImmutable graph structure
Cluster Setting
In MemoryPartition Graph:
ParMETIS or Random Cuts
Cached Ghosting
Node 1 Node 2
A B
C D
A B
C D
A B
C D
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
29
label_prop(i, scope){ // Get Neighborhood data (Likes[i], Wij, Likes[j]) scope;
// Update the vertex data
// Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); }
Update Functions
30
An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
31
The Scheduler
32
CPU 1
CPU 2
The scheduler determines the order that vertices are updated.
e f g
kjih
dcba b
ih
a
i
b e f
j
c
Sch
edule
r
The process repeats until the scheduler is empty.
Choosing a Schedule
GraphLab provides several different schedulersRound Robin: vertices are updated in a fixed orderFIFO: Vertices are updated in the order they are addedPriority: Vertices are updated in priority order
33
The choice of schedule affects the correctness and parallel performance of the algorithm
Obtain different algorithms by simply changing a flag! --scheduler=roundrobin --scheduler=fifo --scheduler=priority
CPU 1 CPU 2 CPU 3 CPU 4
Implementing the SchedulersMulticore Setting
Challenging!Fine-grained lockingAtomic operations
Approximate FiFo/PriorityRandom placementWork stealing
Cluster Setting
Multicore scheduler on each node
Schedules only “local” verticesExchange update functions
Queue 1
Queue 2
Queue 3
Queue 4
Node 1
CPU 1 CPU 2
Queue 1
Queue 2
Node 2
CPU 1 CPU 2
Queue 1
Queue 2
v1 v2
f(v1)
f(v2)
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
35
GraphLab Ensures Sequential Consistency
36
For each parallel execution, there exists a sequential execution of update functions which produces the same result.
CPU 1
CPU 2
SingleCPU
Parallel
Sequential
time
Ensuring Race-Free CodeHow much can computation overlap?
CPU 1 CPU 2
Common Problem: Write-Write Race
38
Processors running adjacent update functions simultaneously modify shared data:
CPU1 writes: CPU2 writes:
Final Value
Nuances of Sequential Consistency Data consistency depends on the update function:
Some algorithms are “robust” to data-racesGraphLab Solution
The user can choose from three consistency modelsFull, Edge, Vertex
GraphLab automatically enforces the users choice
39
CPU 1 CPU 2
Unsafe
CPU 1 CPU 2
Safe
Read
Consistency Rules
40
Guaranteed sequential consistency for all update functions
Data
Full Consistency
41
Only allow update functions two vertices apart to be run in parallelReduced opportunities for parallelism
Obtaining More Parallelism
42
Not all update functions will modify the entire scope!
Edge Consistency
43
CPU 1 CPU 2
Safe
Read
Obtaining More Parallelism
44
“Map” operations. Feature extraction on vertex data
Vertex Consistency
45
Consistency Through R/W LocksRead/Write locks:
Full Consistency
Edge Consistency
Write Write WriteCanonical Lock Ordering
Read Write ReadRead Write
Multicore Setting: Pthread R/W LocksDistributed Setting: Distributed Locking
Prefetch Locks and Data
Allow computation to proceed while locks/data are requested.
Node 2
Consistency Through R/W Locks
Node 1Data GraphPartition
Lock Pip
elin
e
Consistency Through SchedulingEdge Consistency Model:
Two vertices can be Updated simultaneously if they do not share an edge.
Graph Coloring:Two vertices can be assigned the same color if they do not share an edge.
Barr
ier
Phase 1
Barr
ier
Phase 2
Barr
ier
Phase 3
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
49
Global State and ComputationNot everything fits the Graph metaphor
Shared Weight:
Global Computation:
Anatomy of a GraphLab Program:
1) Define C++ Update Function2) Build data graph using the C++ graph object3) Set engine parameters:
1) Scheduler type 2) Consistency model
4) Add initial vertices to the scheduler 5) Register global aggregates and set refresh intervals6) Run the engine on the graph [Blocking C++ call]7) Final answer is stored in the graph
Algorithms Implemented PageRankLoopy Belief PropagationGibbs SamplingCoEMGraphical Model Parameter LearningProbabilistic Matrix/Tensor FactorizationAlternating Least SquaresLasso with Sparse FeaturesSupport Vector Machines with Sparse FeaturesLabel-Propagation…
The Code
API Implemented in C++:Pthreads, GCC Atomics, TCP/IP, MPI, in house RPC
Multicore APIExperimental Matlab/Java/Python supportNearly Complete Implementation
Available under Apache 2.0 License
Cloud APIBuilt and tested on EC2 using No Fault ToleranceStill Experimental
http://graphlab.org
Carnegie Mellon
Shared MemoryExperiments
Shared Memory Setting16 Core Workstation
54
Loopy Belief Propagation
55
3D retinal image denoising
Data GraphUpdate Function:
Loopy BP Update EquationScheduler:
Approximate PriorityConsistency Model:
Edge Consistency
Vertices: 1 MillionEdges: 3 Million
Loopy Belief Propagation
56
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Optimal
Bett
er
SplashBP
15.5x speedup
CoEM (Rosie Jones, 2005)Named Entity Recognition Task
the dog
Australia
Catalina Island
<X> ran quickly
travelled to <X>
<X> is pleasant
Hadoop 95 Cores 7.5 hrs
Is “Dog” an animal?Is “Catalina” a place?
Vertices: 2 MillionEdges: 200 Million
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
14
16
Number of CPUs
Spee
dup
Bett
er
Optimal
GraphLab CoEM
CoEM (Rosie Jones, 2005)
58
GraphLab 16 Cores 30 min
15x Faster!6x fewer CPUs!
Hadoop 95 Cores 7.5 hrs
Carnegie Mellon
ExperimentsAmazon EC2
High-Performance Nodes
59
Video Cosegmentation
Segments mean the same
Model: 10.5 million nodes, 31 million edges
Gaussian EM clustering + BP on 3D grid
Video Coseg. Speedups
Prefetching Data & Locks
Matrix FactorizationNetflix Collaborative Filtering
Alternating Least Squares Matrix Factorization
Model: 0.5 million nodes, 99 million edges
Netflix
Users
Movies
d
NetflixSpeedup Increasing size of the matrix factorization
The Cost of Hadoop
SummaryAn abstraction tailored to Machine Learning
Targets Graph-Parallel Algorithms
Naturally expressesData/computational dependenciesDynamic iterative computation
Simplifies parallel algorithm designAutomatically ensures data consistencyAchieves state-of-the-art parallel performance on a variety of problems
66
Current/Future Work Out-of-core StorageHadoop/HDFS Integration
Graph ConstructionGraph StorageLaunching GraphLab from HadoopFault Tolerance through HDFS Checkpoints
Sub-scope parallelismAddress the challenge of very high degree nodes
Improved graph partitioningSupport for dynamic graph structure
Carnegie Mellon
Checkout GraphLab
http://graphlab.org
68
Documentation… Code… Tutorials…
Questions & Feedback