08 distributed optimization

14

Click here to load reader

Transcript of 08 distributed optimization

Page 2: 08 distributed optimization

Motivation• Given a trained model, ML / prediction is easy to distribute

• Not a full-blown “Big Data” problem

• What about model training in the face of Big (training) Data?

• Distributed training needed!

• Under the hood: ML as optimisation

Page 3: 08 distributed optimization

ML and optimisation‘Big Data’ ML:

• high training sample volumes

• high-dimensional data

• distributed data: collection, storage

methods are based on optimisation

• write ML as a (typically convex) optimisation problem

• optimise.

Page 4: 08 distributed optimization

Problem formalizationProblem:

• minimize J(𝜃), 𝜃 ∈ ℝd

• subject to Ji(𝜃) ≤ bi, i = 1,...,m

with

• 𝜃 = (𝜃1 ,…, 𝜃d) ∈ ℝd the optimisation variable

• J : Rd → R the objective function

• Ji : Rd → R, i = 1,…, m the constraints

• constants b1 ,…, bm the bounds for the constraints.

Page 5: 08 distributed optimization

Gradient descent• Update the parameters in the opposite direction of the gradient of

the objective function ∇𝜃J(𝜃) w.r.t. the parameters.

• The learning rate 𝜂 determines the size of the steps we take to reach a (local) minimum.

• We follow the direction of the slope of the surface created by the objective function downhill until we reach a valley. [NOTE: heavily based on Sebastian Ruder’s “An overview of gradient descent optimization algorithms”, 19 Jan 2016]

Page 6: 08 distributed optimization

Batch gradient descent

• Idea: depending on the amount of data, trade-off between the accuracy of the parameter update and the time it takes to perform an update.

• Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇𝜃J(𝜃)

Page 7: 08 distributed optimization

Stochastic gradient descent

• Idea: perform a parameter update for each training example x(i) and label y(i)

• Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇𝜃J(𝜃; x(i), y(i))

• Performs redundant computations for large datasets

Page 8: 08 distributed optimization

Momentum gradient descent

• Idea: overcome ravine oscillations by momentum

• Update:

• vt = 𝛾 vt-1 + 𝜂 ∙ ∇𝜃J(𝜃)

• 𝜃 = 𝜃 - vt

Page 9: 08 distributed optimization

Nesterov accelerated gradient

• Idea: 1. big jump in the direction of the previous accumulated gradient & measure the gradient and then 2. make a correction.

• Update:

• vt = 𝛾 vt-1 + 𝜂 ∙ ∇𝜃J(𝜃-𝛾 vt-1)

• 𝜃 = 𝜃 - vt

Page 10: 08 distributed optimization

Adagrad• Idea: larger updates for infrequent and smaller updates for frequent

parameters.

• Update: let gt,i = ∇𝜃J(𝜃i); 𝜃t+1,i = 𝜃t,i + 𝛥𝜃t. Then:

• SGD: 𝛥𝜃t = - 𝜂 ∙ gt

• Adagrad: 𝛥𝜃t = - 𝜂 / √(Gt+ϵ) ⊙ gtwith Gt ∈ℝd⨉d a diagonal matrix where each diagonal element i,i the sum of square of gradients w.r.t. 𝜃i up to time step t, ⊙ element-wise matrix-vector multiplication.

Page 11: 08 distributed optimization

Adadelta• Idea: Instead of accumulating all past squared gradients, restrict the

window of accumulated past gradients to some fixed size w.

• The sum of gradients is recursively defined as a decaying average of all past squared gradients: E[𝛥𝜃2]t = 𝛾 E[𝛥𝜃2]t-1 + (1-𝛾) 𝛥𝜃t2

• Update: we replace the diagonal matrix Gt with the decaying average over past squared gradients E[g2]t 𝛥𝜃t = - RMS[𝛥𝜃]t-1/RMS[g]t ⊙ gt

Page 12: 08 distributed optimization

RMSprop

• Idea: use the first update vector of Adadelta

• Update:

• E[g2]t = 0.9 E[g2]t-1 + 0.1 gt2

• 𝛥𝜃t = - 𝜂 / √(E[g2]t + ϵ) ⊙ gt

Page 13: 08 distributed optimization

Visualization and comparisonAdagrad, Adadelta, and RMSprop almost immediately head off in the right direction and converge similarly fast, while Momentum and NAG are led off-track, evoking the image of a ball rolling down the hill. NAG, however, is quickly able to correct its course due to its increased responsiveness by looking ahead and heads to the minimum.

Page 14: 08 distributed optimization

Conclusions• Big Data ML requires (scalable, distributed) algorithms to process

training points in small batches, performing effective incremental updates to the model

• Final objective: a closed loop that trains models, compares them recursively

• Key challenge: evaluation metrics in the face of available resources (including data)