A New Parallel Framework for Machine Learning

Carnegie Mellon

Joseph GonzalezJoint work with

YuchengLow

AapoKyrola

DannyBickson

CarlosGuestrin

GuyBlelloch

JoeHellerstein

DavidO’Hallaron

A New Parallel Framework for Machine Learning

AlexSmola

In ML we face BIG problems

48 Hours a MinuteYouTube

24 Million Wikipedia Pages

750 MillionFacebook Users

6 Billion Flickr Photos

Parallelism: Hope for the FutureWide array of different parallel architectures:

New Challenges for Designing Machine Learning Algorithms: Race conditions and deadlocksManaging distributed model state

New Challenges for Implementing Machine Learning Algorithms:Parallel debugging and profilingHardware specific APIs

3

GPUs Multicore Clusters Mini Clouds Clouds

Carnegie Mellon

Core Question

How will wedesign and implement

parallel learning systems?

Carnegie Mellon

Threads, Locks, & Messages

Build each new learning systems usinglow level parallel primitives

We could use ….

Threads, Locks, and MessagesML experts repeatedly solve the same parallel design challenges:

Implement and debug complex parallel systemTune for a specific parallel platformTwo months later the conference paper contains:

“We implemented ______ in parallel.”

The resulting code:is difficult to maintainis difficult to extendcouples learning model to parallel implementation

6

Graduate

students

Carnegie Mellon

Map-Reduce / HadoopBuild learning algorithms on-top of

high-level parallel abstractions

... a better answer:

CPU 1 CPU 2 CPU 3 CPU 4

MapReduce – Map Phase

8

Embarrassingly Parallel independent computation

12.9

42.3

21.3

25.8

No Communication needed



9

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

Image Features



10

Embarrassingly Parallel independent computation

12.9

42.3

21.3

25.8

17.5

67.5

14.9

34.3

24.1

84.3

18.4

84.4

No Communication needed

CPU 1 CPU 2

MapReduce – Reduce Phase

11

12.9

42.3

21.3

25.8

24.1

84.3

18.4

84.4

17.5

67.5

14.9

34.3

2226.

26

1726.

31

Image Features

Attractive Face Statistics

Ugly Face Statistics

BeliefPropagation

SVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks

Tensor Factorization

PageRank

Lasso

Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!

12

Data-Parallel Graph-ParallelIs there more toMachine Learning

?CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Carnegie Mellon

Concrete Example

Label Propagation

Profile

Label Propagation AlgorithmSocial Arithmetic:

Recurrence Algorithm:

iterate until convergence

Parallelism:Compute all Likes[i] in parallel

Sue Ann

Carlos

Me

50% What I list on my profile40% Sue Ann Likes10% Carlos Like

40%

10%

50%

80% Cameras20% Biking



I Like:

+60% Cameras, 40% Biking

Properties of Graph Parallel Algorithms

DependencyGraph

IterativeComputation

What I Like

What My Friends Like

Factored Computation

?

BeliefPropagation

SVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks


PageRank

Lasso

Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!

16

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce


Map Reduce?

Carnegie Mellon

Why not use Map-Reducefor

Graph Parallel Algorithms?

Data Dependencies

Map-Reduce does not efficiently express dependent data

User must code substantial data transformations Costly data replication

Inde

pend

ent D

ata

Row

s

Slow

Proc

esso

rIterative Algorithms

Map-Reduce not efficiently express iterative algorithms:

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Barr

ier

Barr

ier

Barr

ier

MapAbuse: Iterative MapReduceOnly a subset of data needs computation:

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Barr

ier

Barr

ier

Barr

ier

MapAbuse: Iterative MapReduceSystem is not optimized for iteration:

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Disk Pe

nalty

Disk Pe

nalty

Disk Pe

nalty

Sta

rtup

Pen

alty

Sta

rtup

Pen

alty

Sta

rtup

Pen

alty

Synchronous vs. AsynchronousExample Algorithm: If Red neighbor then turn Red

Synchronous Computation (Map-Reduce) :Evaluate condition on all vertices for every phase

4 Phases each with 9 computations 36 Computations

Asynchronous Computation (Wave-front) :Evaluate condition only when neighbor changes

4 Phases each with 2 computations 8 Computations

Time 0 Time 1 Time 2 Time 3 Time 4

Data-Parallel Algorithms can be Inefficient

The limitations of the Map-Reduce abstraction can lead to inefficient parallel algorithms.

1 2 3 4 5 6 7 80

100020003000400050006000700080009000

Number of CPUs

Runti

me

in S

econ

ds

Optimized in Memory MapReduceBP

Asynchronous Splash BP

?

BeliefPropagationSVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks


PageRank

Lasso

The Need for a New AbstractionMap-Reduce is not well suited for Graph-Parallelism

24

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce


Carnegie Mellon

What is GraphLab?

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

26

Data Graph

27

A graph with arbitrary data (C++ Objects) associated with each vertex and edge.

Vertex Data:• User profile text• Current interests estimates

Edge Data:• Similarity weights

Graph:• Social Network

Implementing the Data GraphMulticore Setting

In MemoryRelatively Straight Forward

vertex_data(vid) dataedge_data(vid,vid) dataneighbors(vid) vid_list

Challenge:Fast lookup, low overhead

Solution:Dense data-structuresFixed Vdata & Edata typesImmutable graph structure

Cluster Setting

In MemoryPartition Graph:

ParMETIS or Random Cuts

Cached Ghosting

Node 1 Node 2

A B

C D

A B

C D

A B

C D





29

label_prop(i, scope){ // Get Neighborhood data (Likes[i], Wij, Likes[j]) scope;

// Update the vertex data

// Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); }

Update Functions

30

An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex





31

The Scheduler

32

CPU 1

CPU 2

The scheduler determines the order that vertices are updated.

e f g

kjih

dcba b

ih

a

i

b e f

j

c

Sch

edule

r

The process repeats until the scheduler is empty.

Choosing a Schedule

GraphLab provides several different schedulersRound Robin: vertices are updated in a fixed orderFIFO: Vertices are updated in the order they are addedPriority: Vertices are updated in priority order

33

The choice of schedule affects the correctness and parallel performance of the algorithm

Obtain different algorithms by simply changing a flag! --scheduler=roundrobin --scheduler=fifo --scheduler=priority


Implementing the SchedulersMulticore Setting

Challenging!Fine-grained lockingAtomic operations

Approximate FiFo/PriorityRandom placementWork stealing

Cluster Setting

Multicore scheduler on each node

Schedules only “local” verticesExchange update functions

Queue 1

Queue 2

Queue 3

Queue 4

Node 1

CPU 1 CPU 2

Queue 1

Queue 2

Node 2

CPU 1 CPU 2

Queue 1

Queue 2

v1 v2

f(v1)

f(v2)





35

GraphLab Ensures Sequential Consistency

36

For each parallel execution, there exists a sequential execution of update functions which produces the same result.

CPU 1

CPU 2

SingleCPU

Parallel

Sequential

time

Ensuring Race-Free CodeHow much can computation overlap?

CPU 1 CPU 2

Common Problem: Write-Write Race

38

Processors running adjacent update functions simultaneously modify shared data:

CPU1 writes: CPU2 writes:

Final Value

Nuances of Sequential Consistency Data consistency depends on the update function:

Some algorithms are “robust” to data-racesGraphLab Solution

The user can choose from three consistency modelsFull, Edge, Vertex

GraphLab automatically enforces the users choice

39

CPU 1 CPU 2

Unsafe

CPU 1 CPU 2

Safe

Read

Consistency Rules

40

Guaranteed sequential consistency for all update functions

Data

Full Consistency

41

Only allow update functions two vertices apart to be run in parallelReduced opportunities for parallelism

Obtaining More Parallelism

42

Not all update functions will modify the entire scope!

Edge Consistency

43

CPU 1 CPU 2

Safe

Read

Obtaining More Parallelism

44

“Map” operations. Feature extraction on vertex data

Vertex Consistency

45

Consistency Through R/W LocksRead/Write locks:

Full Consistency

Edge Consistency

Write Write WriteCanonical Lock Ordering

Read Write ReadRead Write

Multicore Setting: Pthread R/W LocksDistributed Setting: Distributed Locking

Prefetch Locks and Data

Allow computation to proceed while locks/data are requested.

Node 2

Consistency Through R/W Locks

Node 1Data GraphPartition

Lock Pip

elin

e

Consistency Through SchedulingEdge Consistency Model:

Two vertices can be Updated simultaneously if they do not share an edge.

Graph Coloring:Two vertices can be assigned the same color if they do not share an edge.

Barr

ier

Phase 1

Barr

ier

Phase 2

Barr

ier

Phase 3





49

Global State and ComputationNot everything fits the Graph metaphor

Shared Weight:

Global Computation:

Anatomy of a GraphLab Program:

1) Define C++ Update Function2) Build data graph using the C++ graph object3) Set engine parameters:

1) Scheduler type 2) Consistency model

4) Add initial vertices to the scheduler 5) Register global aggregates and set refresh intervals6) Run the engine on the graph [Blocking C++ call]7) Final answer is stored in the graph

Algorithms Implemented PageRankLoopy Belief PropagationGibbs SamplingCoEMGraphical Model Parameter LearningProbabilistic Matrix/Tensor FactorizationAlternating Least SquaresLasso with Sparse FeaturesSupport Vector Machines with Sparse FeaturesLabel-Propagation…

The Code

API Implemented in C++:Pthreads, GCC Atomics, TCP/IP, MPI, in house RPC

Multicore APIExperimental Matlab/Java/Python supportNearly Complete Implementation

Available under Apache 2.0 License

Cloud APIBuilt and tested on EC2 using No Fault ToleranceStill Experimental

http://graphlab.org

Carnegie Mellon

Shared MemoryExperiments

Shared Memory Setting16 Core Workstation

54

Loopy Belief Propagation

55

3D retinal image denoising

Data GraphUpdate Function:

Loopy BP Update EquationScheduler:

Approximate PriorityConsistency Model:

Edge Consistency

Vertices: 1 MillionEdges: 3 Million

Loopy Belief Propagation

56

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of CPUs

Spee

dup

Optimal

Bett

er

SplashBP

15.5x speedup

CoEM (Rosie Jones, 2005)Named Entity Recognition Task

the dog

Australia

Catalina Island

<X> ran quickly

travelled to <X>

<X> is pleasant

Hadoop 95 Cores 7.5 hrs

Is “Dog” an animal?Is “Catalina” a place?

Vertices: 2 MillionEdges: 200 Million

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of CPUs

Spee

dup

Bett

er

Optimal

GraphLab CoEM

CoEM (Rosie Jones, 2005)

58

GraphLab 16 Cores 30 min

15x Faster!6x fewer CPUs!

Hadoop 95 Cores 7.5 hrs

Carnegie Mellon

ExperimentsAmazon EC2

High-Performance Nodes

59

Video Cosegmentation

Segments mean the same

Model: 10.5 million nodes, 31 million edges

Gaussian EM clustering + BP on 3D grid

Video Coseg. Speedups

Prefetching Data & Locks

Matrix FactorizationNetflix Collaborative Filtering

Alternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Netflix

Users

Movies

d

NetflixSpeedup Increasing size of the matrix factorization

The Cost of Hadoop

SummaryAn abstraction tailored to Machine Learning

Targets Graph-Parallel Algorithms

Naturally expressesData/computational dependenciesDynamic iterative computation

Simplifies parallel algorithm designAutomatically ensures data consistencyAchieves state-of-the-art parallel performance on a variety of problems

66

Current/Future Work Out-of-core StorageHadoop/HDFS Integration

Graph ConstructionGraph StorageLaunching GraphLab from HadoopFault Tolerance through HDFS Checkpoints

Sub-scope parallelismAddress the challenge of very high degree nodes

Improved graph partitioningSupport for dynamic graph structure

Carnegie Mellon

Checkout GraphLab

http://graphlab.org

68

Documentation… Code… Tutorials…

Questions & Feedback

[email protected]

A New Parallel Framework for Machine Learning

Documents

Transcript of A New Parallel Framework for Machine Learning