08 distributed optimization

Distributed [email protected]

mailto:[email protected]

Motivation• Given a trained model, ML / prediction is easy to distribute

• Not a full-blown “Big Data” problem

• What about model training in the face of Big (training) Data?

• Distributed training needed!

• Under the hood: ML as optimisation

ML and optimisation‘Big Data’ ML:

• high training sample volumes

• high-dimensional data

• distributed data: collection, storage

methods are based on optimisation

• write ML as a (typically convex) optimisation problem

• optimise.

Problem formalizationProblem:

• minimize J(𝜃), 𝜃 ∈ ℝd

• subject to Ji(𝜃) ≤ bi, i = 1,...,m

with

• 𝜃 = (𝜃1 ,…, 𝜃d) ∈ ℝd the optimisation variable

• J : Rd → R the objective function

• Ji : Rd → R, i = 1,…, m the constraints

• constants b1 ,…, bm the bounds for the constraints.

Gradient descent• Update the parameters in the opposite direction of the gradient of

the objective function ∇𝜃J(𝜃) w.r.t. the parameters.

• The learning rate 𝜂 determines the size of the steps we take to reach a (local) minimum.

• We follow the direction of the slope of the surface created by the objective function downhill until we reach a valley. [NOTE: heavily based on Sebastian Ruder’s “An overview of gradient descent optimization algorithms”, 19 Jan 2016]

Batch gradient descent

• Idea: depending on the amount of data, trade-off between the accuracy of the parameter update and the time it takes to perform an update.

• Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇𝜃J(𝜃)

Stochastic gradient descent

• Idea: perform a parameter update for each training example x(i) and label y(i)

• Update: 𝜃 = 𝜃 - 𝜂 ∙ ∇𝜃J(𝜃; x(i), y(i))

• Performs redundant computations for large datasets

Momentum gradient descent

• Idea: overcome ravine oscillations by momentum

• Update:

• vt = 𝛾 vt-1 + 𝜂 ∙ ∇𝜃J(𝜃)

• 𝜃 = 𝜃 - vt

Nesterov accelerated gradient

• Idea: 1. big jump in the direction of the previous accumulated gradient & measure the gradient and then 2. make a correction.

• Update:

• vt = 𝛾 vt-1 + 𝜂 ∙ ∇𝜃J(𝜃-𝛾 vt-1)

• 𝜃 = 𝜃 - vt

Adagrad• Idea: larger updates for infrequent and smaller updates for frequent

parameters.

• Update: let gt,i = ∇𝜃J(𝜃i); 𝜃t+1,i = 𝜃t,i + 𝛥𝜃t. Then:

• SGD: 𝛥𝜃t = - 𝜂 ∙ gt

• Adagrad: 𝛥𝜃t = - 𝜂 / √(Gt+ϵ) ⊙ gtwith Gt ∈ℝd⨉d a diagonal matrix where each diagonal element i,i the sum of square of gradients w.r.t. 𝜃i up to time step t, ⊙ element-wise matrix-vector multiplication.

Adadelta• Idea: Instead of accumulating all past squared gradients, restrict the

window of accumulated past gradients to some fixed size w.

• The sum of gradients is recursively defined as a decaying average of all past squared gradients: E[𝛥𝜃2]t = 𝛾 E[𝛥𝜃2]t-1 + (1-𝛾) 𝛥𝜃t2

• Update: we replace the diagonal matrix Gt with the decaying average over past squared gradients E[g2]t 𝛥𝜃t = - RMS[𝛥𝜃]t-1/RMS[g]t ⊙ gt

RMSprop

• Idea: use the first update vector of Adadelta

• Update:

• E[g2]t = 0.9 E[g2]t-1 + 0.1 gt2

• 𝛥𝜃t = - 𝜂 / √(E[g2]t + ϵ) ⊙ gt

Visualization and comparisonAdagrad, Adadelta, and RMSprop almost immediately head off in the right direction and converge similarly fast, while Momentum and NAG are led off-track, evoking the image of a ball rolling down the hill. NAG, however, is quickly able to correct its course due to its increased responsiveness by looking ahead and heads to the minimum.

Conclusions• Big Data ML requires (scalable, distributed) algorithms to process

training points in small batches, performing effective incremental updates to the model

• Final objective: a closed loop that trains models, compares them recursively

• Key challenge: evaluation metrics in the face of available resources (including data)

08 distributed optimization

Science

Transcript of 08 distributed optimization