Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

95
Scaling Machine Learning To Billions of Parameters Badri Bhaskar, Erik Ordentlich (joint with Andy Feng, Lee Yang, Peter Cnudde) Yahoo, Inc.

Transcript of Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Page 1: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Scaling Machine Learning To Billions of Parameters

Badri Bhaskar, Erik Ordentlich (joint with Andy Feng, Lee Yang, Peter Cnudde) Yahoo, Inc.

Page 2: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Outline• Large scale machine learning (ML) • Spark + Parameter Server

– Architecture – Implementation

• Examples: - Distributed L-BFGS (Batch) - Distributed Word2vec (Sequential)

• Spark + Parameter Server on Hadoop Cluster

Page 3: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

LARGE SCALE ML

Page 4: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

1.2 1.2 -786.3 -8.1

5.4 -84.2 2.3 -3.4

-1.12.3 4.9 7.4

4.5 2.1 -152.3 2.3

0.5 1.2 -0.9-24

-1.3 -2.2 1.8-4.9 -2.1 1.2

Web Scale MLBillions of features

Hun

dred

s of

bill

ions

of e

xam

ples

Big Model

Big

Dat

a

Ex:Yahooword2vec-120billionparametersand500billionsamples

Page 5: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

1.2 1.2 -786.3 -8.1

5.4 -84.2 2.3 -3.4

-1.12.3 4.9 7.4

4.5 2.1 -152.3 2.3

0.5 1.2 -0.9-24

-1.3 -2.2 1.8-4.9 -2.1 1.2

Web Scale MLBillions of features

Hun

dred

s of

bill

ions

of e

xam

ples

Big Model

Big

Dat

a

Ex:Yahooword2vec-120billionparametersand500billionsamples

Store Store Store

Page 6: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

1.2 1.2 -786.3 -8.1

5.4 -84.2 2.3 -3.4

-1.12.3 4.9 7.4

4.5 2.1 -152.3 2.3

0.5 1.2 -0.9-24

-1.3 -2.2 1.8-4.9 -2.1 1.2

Web Scale MLBillions of features

Hun

dred

s of

bill

ions

of e

xam

ples

Big Model

Big

Dat

a

Ex:Yahooword2vec-120billionparametersand500billionsamples

Worker

Worker

Worker

Store Store Store

Page 7: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

1.2 1.2 -786.3 -8.1

5.4 -84.2 2.3 -3.4

-1.12.3 4.9 7.4

4.5 2.1 -152.3 2.3

0.5 1.2 -0.9-24

-1.3 -2.2 1.8-4.9 -2.1 1.2

Web Scale MLBillions of features

Hun

dred

s of

bill

ions

of e

xam

ples

Big Model

Big

Dat

a

Ex:Yahooword2vec-120billionparametersand500billionsamples

Worker

Worker

Worker

Store Store Store

Each example depends only on a tiny fraction of the model

Page 8: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Two Optimization Strategies

ModelMultiple epochs…

BATCH

Example: Gradient Descent, L-BFGS

Model

Model

Model

SEQUENTIAL

Multiple random samples…

Example: (Minibatch) stochastic gradient method, perceptron

Examples

Page 9: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Two Optimization Strategies

ModelMultiple epochs…

BATCH

Example: Gradient Descent, L-BFGS

Model

Model

Model

SEQUENTIAL

Multiple random samples…

Example: (Minibatch) stochastic gradient method, perceptron

• Small number of model updates • Accurate • Each epoch may be expensive. • Easy to parallelize.

Examples

Page 10: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Two Optimization Strategies

ModelMultiple epochs…

BATCH

Example: Gradient Descent, L-BFGS

Model

Model

Model

SEQUENTIAL

Multiple random samples…

Example: (Minibatch) stochastic gradient method, perceptron

• Small number of model updates • Accurate • Each epoch may be expensive. • Easy to parallelize.

• Requires lots of model updates. • Not as accurate, but often good enough • A lot of progress in one pass* for big data. • Not trivial to parallelize.

*also optimal in terms of generalization error (often with a lot of tuning)

Examples

Page 11: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Requirements

Page 12: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Requirements

✓Support both batch and sequential optimization

Page 13: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Requirements

✓Support both batch and sequential optimization✓Sequential training: Handle frequent updates to the model

Page 14: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Requirements

✓Support both batch and sequential optimization✓Sequential training: Handle frequent updates to the model✓Batch training: 100+ passes ⇒ each pass must be fast.

Page 15: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Parameter Server (PS)

Client

Data

Client

Data

Client

Data

Client

Data

Training state stored in PS shards, asynchronous updates

PS Shard PS ShardPS Shard

ΔM Model UpdateM

Model

Early work: Yahoo LDA by Smola and Narayanamurthy based on memcached (2010), Introduced in Google’s Distbelief (2012), refined in Petuum / Bösen (2013), Mu Li et al (2014)

Page 16: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

SPARK + PARAMETER SERVER

Page 17: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

ML in Spark alone

Executor ExecutorCoreCore Core Core Core

Driver

Holds model

Page 18: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

ML in Spark alone

Executor ExecutorCoreCore Core Core Core

Driver

Holds model

MLlib optimization

Page 19: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

ML in Spark alone

• Sequential: – Driver-based communication limits frequency of model updates. – Large minibatch size limits model update frequency, convergence suffers.

Executor ExecutorCoreCore Core Core Core

Driver

Holds model

MLlib optimization

Page 20: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

ML in Spark alone

• Sequential: – Driver-based communication limits frequency of model updates. – Large minibatch size limits model update frequency, convergence suffers.

• Batch: – Driver bandwidth can be a bottleneck – Synchronous stage wise processing limits throughput.

Executor ExecutorCoreCore Core Core Core

Driver

Holds model

MLlib optimization

Page 21: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

ML in Spark alone

• Sequential: – Driver-based communication limits frequency of model updates. – Large minibatch size limits model update frequency, convergence suffers.

• Batch: – Driver bandwidth can be a bottleneck – Synchronous stage wise processing limits throughput.

Executor ExecutorCoreCore Core Core Core

Driver

Holds model

MLlib optimization

PS Architecture circumvents both limitations…

Page 22: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Spark + Parameter Server

Page 23: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

• Leverage Spark for HDFS I/O, distributed processing, fine-grained load balancing, failure recovery, in-memory operations

Spark + Parameter Server

Page 24: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

• Leverage Spark for HDFS I/O, distributed processing, fine-grained load balancing, failure recovery, in-memory operations

• Use PS to sync models, incremental updates during training, or sometimes even some vector math.

Spark + Parameter Server

Page 25: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

• Leverage Spark for HDFS I/O, distributed processing, fine-grained load balancing, failure recovery, in-memory operations

• Use PS to sync models, incremental updates during training, or sometimes even some vector math.

Spark + Parameter Server

HDFS

Training state stored in PS shards

DriverExecutor ExecutorExecutor

CoreCore Core Core

PS Shard PS ShardPS Shard

control

control

Page 26: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Yahoo PS

Page 27: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Yahoo PSServer

Client API

Page 28: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Yahoo PSServer

Client API

GC Preallocated arrays

Page 29: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Yahoo PSServer

Client API

GC Preallocated arrays

• In-memory • Lock per key / Lock-free • Sync / Async

K-V

Page 30: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Yahoo PSServer

Client API

GC Preallocated arrays

• In-memory • Lock per key / Lock-free • Sync / Async

K-V

• Column-partitioned

• Supports BLAS

Page 31: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Yahoo PSServer

Client API

GC Preallocated arrays

• In-memory • Lock per key / Lock-free • Sync / Async

K-V

• Column-partitioned

• Supports BLAS

• Export Model • Checkpoint

HDFS

Page 32: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Yahoo PSServer

Client API

GC Preallocated arrays

• In-memory • Lock per key / Lock-free • Sync / Async

K-V

• Column-partitioned

• Supports BLAS

• Export Model • Checkpoint

HDFS

UDF• Client supplied aggregation • Custom shard operations

Page 33: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Map PS API

• Distributed key-value store abstraction • Supports batched operations in addition to usual get and put • Many operations return a future – you can operate asynchronously or block

Page 34: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Matrix PS API

• Vector math (BLAS style operations), in addition to everything Map API provides • Increment and fetch sparse vectors (e.g., for gradient aggregation) • We use other custom operations on shard (API not shown)

Page 35: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

EXAMPLES

Page 36: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Sponsored Search Advertising

Page 37: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Sponsored Search Advertising

Page 38: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Sponsored Search Advertising

query user ad context

Features

Model

Example Click Model: (Logistic Regression)

Page 39: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

L-BFGS Background

Page 40: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

L-BFGS BackgroundNewton’s method

Gradient Descent

Using curvature information, you can converge faster…

Page 41: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

L-BFGS BackgroundExact, impractical

Newton’s method

Gradient Descent

Using curvature information, you can converge faster…

Page 42: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

L-BFGS BackgroundExact, impractical

Step Size computation - Needs to satisfy some technical (Wolfe) conditions - Adaptively determined from data

Inverse Hessian Approximation (based on history of L-previous gradients and model deltas)

Approximate, practical

Newton’s method

Gradient Descent

Using curvature information, you can converge faster…

Page 43: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

L-BFGS BackgroundExact, impractical

Step Size computation - Needs to satisfy some technical (Wolfe) conditions - Adaptively determined from data

Inverse Hessian Approximation (based on history of L-previous gradients and model deltas)

Approximate, practical

Newton’s method

Gradient Descent

Using curvature information, you can converge faster…

Page 44: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

L-BFGS BackgroundExact, impractical

Step Size computation - Needs to satisfy some technical (Wolfe) conditions - Adaptively determined from data

Inverse Hessian Approximation (based on history of L-previous gradients and model deltas)

Approximate, practical

Newton’s method

Gradient Descent

Using curvature information, you can converge faster…

Page 45: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

L-BFGS BackgroundExact, impractical

Step Size computation - Needs to satisfy some technical (Wolfe) conditions - Adaptively determined from data

Inverse Hessian Approximation (based on history of L-previous gradients and model deltas)

Approximate, practical

Newton’s method

Gradient Descent

Using curvature information, you can converge faster…

dotprod

axpy (y ← ax + y)

copy

axpy

scal

scal

Vector Math

dotprod

Page 46: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

L-BFGS BackgroundExact, impractical

Step Size computation - Needs to satisfy some technical (Wolfe) conditions - Adaptively determined from data

Inverse Hessian Approximation (based on history of L-previous gradients and model deltas)

Approximate, practical

Newton’s method

Gradient Descent

Using curvature information, you can converge faster…

Page 47: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Executor ExecutorExecutor

HDFS HDFSHDFS

Driver

PS PS PS PS

Distributed LBFGS*

Compute gradient and loss

1. Incremental sparse gradient update 2. Fetch sparse portions of model

Coordinates executor

Step 1: Compute and update Gradient

*Our design is very similar to Sandblaster L-BFGS, Jeff Dean et al, Large Scale Distributed Deep Networks (2012)

state vectors

Page 48: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Executor ExecutorExecutor

HDFS HDFSHDFS

Driver

PS PS PS PS

Distributed LBFGS*

Compute gradient and loss

1. Incremental sparse gradient update 2. Fetch sparse portions of model

Coordinates executor

Step 1: Compute and update Gradient

*Our design is very similar to Sandblaster L-BFGS, Jeff Dean et al, Large Scale Distributed Deep Networks (2012)

state vectors

Page 49: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Executor ExecutorExecutor

HDFS HDFSHDFS

Driver

PS PS PS PS

Distributed LBFGS

Coordinates PS for performing L-BFGS updates

Actual L-BFGS updates (BLAS vector math)Step 2: Build inverse Hessian Approximation

Page 50: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Executor ExecutorExecutor

HDFS HDFSHDFS

Driver

PS PS PS PS

Distributed LBFGS

Coordinates executor Compute directional derivatives for parallel line search

Fetch sparse portions of modelStep 3: Compute losses and directional derivatives

Page 51: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Executor ExecutorExecutor

HDFS HDFSHDFS

Driver

PS PS PS PS

Distributed LBFGS

Performs line search and model update

Model update (BLAS vector math)Step 4: Line search and model update

Page 52: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Speedup tricks

Page 53: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Speedup tricks• Intersperse communication and computation

Page 54: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Speedup tricks• Intersperse communication and computation• Quicker convergence

– Parallel line search for step size – Curvature for initial Hessian approximation*

*borrowed from vowpal wabbit

Page 55: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Speedup tricks• Intersperse communication and computation• Quicker convergence

– Parallel line search for step size – Curvature for initial Hessian approximation*

• Network bandwidth reduction – Compressed integer arrays – Only store indices for binary data

*borrowed from vowpal wabbit

Page 56: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Speedup tricks• Intersperse communication and computation• Quicker convergence

– Parallel line search for step size – Curvature for initial Hessian approximation*

• Network bandwidth reduction – Compressed integer arrays – Only store indices for binary data

• Matrix math on minibatch

*borrowed from vowpal wabbit

Page 57: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Speedup tricks• Intersperse communication and computation• Quicker convergence

– Parallel line search for step size – Curvature for initial Hessian approximation*

• Network bandwidth reduction – Compressed integer arrays – Only store indices for binary data

• Matrix math on minibatch

0

750

1500

2250

3000

10 20 100

221612

2880

1260

96

MLlibPS + Spark

1.6 x 108 examples, 100 executors, 10 cores

time

(in s

econ

ds) p

er e

poch

feature size (millions)

*borrowed from vowpal wabbit

Page 58: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word Embeddings

Page 59: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word Embeddings

Page 60: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word Embeddingsv(paris) = [0.13, -0.4, 0.22, …., -0.45] v(lion) = [-0.23, -0.1, 0.98, …., 0.65] v(quark) = [1.4, 0.32, -0.01, …, 0.023]. . .

Page 61: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word2vec

Page 62: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word2vec

Page 63: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word2vec

• new techniques to compute vector representations of words from corpus

Page 64: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word2vec

• new techniques to compute vector representations of words from corpus

• geometry of vectors captures word semantics

Page 65: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word2vec

Page 66: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word2vec• Skipgram with negative sampling:

Page 67: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word2vec• Skipgram with negative sampling:

– training set includes pairs of words and neighbors in corpus, along with randomly selected words for each neighbor

Page 68: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word2vec• Skipgram with negative sampling:

– training set includes pairs of words and neighbors in corpus, along with randomly selected words for each neighbor

– determine w → u(w),v(w) so that sigmoid(u(w)•v(w’)) is close to (minimizes log loss) the probability that w’ is a neighbor of w as opposed to a randomly selected word.

Page 69: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word2vec• Skipgram with negative sampling:

– training set includes pairs of words and neighbors in corpus, along with randomly selected words for each neighbor

– determine w → u(w),v(w) so that sigmoid(u(w)•v(w’)) is close to (minimizes log loss) the probability that w’ is a neighbor of w as opposed to a randomly selected word.

– SGD involves computing many vector dot products e.g., u(w)•v(w’) and vector linear combinations e.g., u(w) += α v(w’).

Page 70: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Word2vec Application at Yahoo• Example training data: gas_cap_replacement_for_car slc_679f037df54f5d9c41cab05bfae0926 gas_door_replacement_for_car slc_466145af16a40717c84683db3f899d0a fuel_door_covers adid_c_28540527225_285898621262 slc_348709d73214fdeb9782f8b71aff7b6e autozone_auto_parts adid_b_3318310706_280452370893 auoto_zone slc_8dcdab5d20a2caa02b8b1d1c8ccbd36b slc_58f979b6deb6f40c640f7ca8a177af2d

[ Grbovic, et. al. SIGIR 2015 and SIGIR 2016 (to appear) ]

Page 71: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec

Page 72: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec• Needed system to train 200 million 300

dimensional word2vec model using minibatch SGD

Page 73: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec• Needed system to train 200 million 300

dimensional word2vec model using minibatch SGD

• Achieved in a high throughput and network efficient way using our matrix based PS server:

Page 74: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec• Needed system to train 200 million 300

dimensional word2vec model using minibatch SGD

• Achieved in a high throughput and network efficient way using our matrix based PS server:– Vectors don’t go over network.

Page 75: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec• Needed system to train 200 million 300

dimensional word2vec model using minibatch SGD

• Achieved in a high throughput and network efficient way using our matrix based PS server:– Vectors don’t go over network.– Most compute on PS servers, with clients aggregating

partial results from shards.

Page 76: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec

Word2vec learners

PS ShardsV1 row V2 row … Vn row

HDFS

. . .

Page 77: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec

Word2vec learners

PS ShardsV1 row V2 row … Vn row

Each shard stores a part of every vector

HDFS

. . .

Page 78: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vecSend word indices and seeds

Word2vec learners

PS ShardsV1 row V2 row … Vn row

HDFS

. . .

Page 79: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vecNegative sampling, compute u•v

Word2vec learners

PS ShardsV1 row V2 row … Vn row

HDFS

. . .

Page 80: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec

Word2vec learners

PS Shards

Aggregate results & compute lin. comb. coefficients (e.g., α…)

V1 row V2 row … Vn row

HDFS

. . .

Page 81: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec

Word2vec learners

PS ShardsV1 row V2 row … Vn row

Update vectors (v += αu, …)

HDFS

. . .

Page 82: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec

Page 83: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec• Network lower by factor of #shards/dimension

compared to conventional PS based system (1/20 to 1/100 for useful scenarios).

Page 84: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec• Network lower by factor of #shards/dimension

compared to conventional PS based system (1/20 to 1/100 for useful scenarios).

• Trains 200 million vocab, 55 billion word search session in 2.5 days.

Page 85: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Distributed Word2vec• Network lower by factor of #shards/dimension

compared to conventional PS based system (1/20 to 1/100 for useful scenarios).

• Trains 200 million vocab, 55 billion word search session in 2.5 days.

• In production for regular training in Yahoo search ad serving system.

Page 86: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Other Projects using Spark + PS

• Online learning on PS – Personalization as a Service – Sponsored Search

• Factorization Machines – Large scale user profiling

Page 87: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

SPARK+PS ON HADOOP CLUSTER

Page 88: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Training Data on HDFS

HDFS

Page 89: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Launch PS Using Apache Slider on YARN

HDFS

YARN

Apache Slider

Parameter Servers

Page 90: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Launch Clients using Spark or Hadoop Streaming API

HDFS

YARN

Apache Slider

Parameter Servers

Apache Spark

Clients

Page 91: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Parameter ServerClients

Training

HDFS

YARN

Apache SliderApache Spark

Page 92: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Model Export

HDFS

YARN

Apache Slider

Parameter Server

Apache Spark

Clients

Page 93: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Model Export

HDFS

YARN

Apache Slider

Parameter Server

Apache Spark

Clients

Page 94: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Summary• Parameter server indispensable for big models • Spark + Parameter Server has proved to be very

flexible platform for our large scale computing needs

• Direct computation on the parameter servers accelerate training for our use-cases

Page 95: Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Thank you!For more, contact [email protected].