GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’...

Michail Michailidis & Patrick Maiden

Gradient Descent

• Mo4va4on • Gradient Descent Algorithm ▫  Issues & Alterna4ves •  Stochas4c Gradient Descent •  Parallel Gradient Descent •  HOGWILD!

Outline

•  It is good for finding global minima/maxima if the func4on is convex •  It is good for finding local minima/maxima if the func4on is not convex •  It is used for op4mizing many models in Machine learning: ▫  It is used in conjunc-on with: �  Neural Networks �  Linear Regression �  Logis4c Regression �  Back-‐propaga4on algorithm �  Support Vector Machines

Mo4va4on

Func4on Example

•  Deriva4ve •  Par4al Deriva4ve • Gradient Vector

Quickest ever review of mul4variate calculus

•  Slope of the tangent line •  Easy when a func4on is univariate

Deriva4ve

𝑓(𝑥)= 𝑥↑2 

𝑓′(𝑥) = 𝑑𝑓/𝑑𝑥 =2𝑥

𝑓′′(𝑥)= 𝑑↑2 𝑓/𝑑𝑥 = 2

For mul4variate func4ons (e.g two variables) we need par4al deriva4ves – one per dimension. Examples of mul4variate func4ons:

Par4al Deriva4ve – Mul4variate Func4ons

𝑓(𝑥,𝑦)= 𝑥↑2 + 𝑦↑2  𝑓(𝑥,𝑦)= cos↑2  (𝑥) + 𝑦↑2  𝑓(𝑥,𝑦)= cos↑2  (𝑥) + cos↑2  (𝑦) 

Convex!

𝑓(𝑥,𝑦)=− 𝑥↑2 − 𝑦↑2 

Concave!

To visualize the par4al deriva4ve for each of the dimensions x and y, we can imagine a plane that “cuts” our surface along the two dimensions and once again we get the slope of the tangent line.

Par4al Deriva4ve – Cont’d

surface: 𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑦↑2  plane: 𝑦=1 cut: 𝑓(𝑥,1)=8−𝑥↑2 

slope / deriva-ve of cut: 𝑓′(𝑥)=−2𝑥

Par4al Deriva4ve – Cont’d 2

𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑦↑2 

If we par4ally differen4ate a func4on with respect to x, we pretend y is constant

𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑐↑2 

𝑓↓𝑥 = 𝜕𝑓/𝜕𝑥 =−2𝑥

𝑓(𝑥,𝑦)=9−𝑐↑2 − 𝑦↑2 

𝑓↓𝑦 = 𝜕𝑓/𝜕𝑦 =−2𝑦

Par4al Deriva4ve – Cont’d 3 The two tangent lines that pass through a point, define the tangent plane to that point

•  Is the vector that has as coordinates the par4al deriva4ves of the func4on:

• Note: Gradient Vector is not parallel to tangent surface

Gradient Vector

𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑦↑2 

𝛻𝑓= 𝜕𝑓/𝜕𝑥 𝑖+ 𝜕𝑓/𝜕𝑦 𝑗=(𝜕𝑓/𝜕𝑥 , 𝜕𝑓/𝜕𝑦 )=(−2x, −2y)

𝜕𝑓/𝜕𝑥 =−2𝑥 𝜕𝑓/𝜕𝑦 =−2𝑦

•  Idea ▫  Start somewhere ▫  Take steps based on the gradient vector of the current posi4on 4ll convergence

•  Convergence : ▫  happens when change between two steps < ε

Gradient Descent Algorithm & Walkthrough

Gradient Descent Code (Python)

𝑓↑′ (𝑥)=4𝑥↑3 −9𝑥↑2 

𝑓(𝑥)= 𝑥↑4 −3𝑥↑3 +2

𝑓↑′ (𝑥)=4𝑥↑3 −9𝑥↑2 

Gradient Descent Algorithm & Walkthrough

Poten4al issues of gradient descent -‐ Convexity

𝑓(𝑥,𝑦)= 𝑥↑2 + 𝑦↑2 

We need a convex func4on à so there is a global minimum:

Poten4al issues of gradient descent – Convexity (2)

•  As we saw before, one parameter needs to be set is the step size •  Bigger steps leads to faster convergence, right?

Poten4al issues of gradient descent – Step Size

• Newton’s Method ▫  Approximates a polynomial and jumps to the min of that func4on ▫  Needs Hessian •  BFGS ▫ More complicated algorithm ▫  Commonly used in actual op4miza4on packages

Alterna4ve algorithms

• Mo4va4on ▫  One way to think of gradient descent is as a minimiza4on of a sum of func4ons: � 𝑤=𝑤 −𝛼𝛻𝐿 (𝑤)=𝑤−𝛼∑↑▒𝛻𝐿↓𝑖 (𝑤)  �  ( 𝐿↓𝑖  is the loss func4on evaluated on the i-‐th element of the dataset)

�  On large datasets, it may be computa4onally expensive to iterate over the whole dataset, so pulling a subset of the data may perform beeer

�  Addi4onally, sampling the data leads to “noise” that can avoid finding “shallow local minima.” This is good for op4mizing non-‐convex func4ons. (Murphy)

Stochas4c Gradient Descent

• Online learning algorithm •  Instead of going through the en4re dataset on each itera4on, randomly sample and update the model Initialize w and α Until convergence do: Sample one example i from dataset //stochastic portion w = w -‐ α𝛻𝐿↓𝑖 (𝑤)

return w

Stochas4c Gradient descent

•  Checking for convergence afer each data example can be slow •  One can simulate stochas4city by reshuffling the dataset on each pass:

Initialize w and α Until convergence do:

shuffle dataset of n elements //simulating stochasticity For each example i in n: w = w -‐ α𝛻𝐿↓𝑖 (𝑤)

return w

•  This is generally faster than the classic itera4ve approach (“noise”) •  However, you are s4ll passing over the en4re dataset each 4me •  An approach in the middle is to sample “batches”, subsets of the en4re dataset ▫  This can be parallelized!

Stochas4c Gradient descent (2)

•  Training data is chunked into batches and distributed Initialize w and α Loop until convergence:

generate randomly sampled chunk of data m on each worker machine v: 𝛻𝐿↓𝑣 (𝑤) = 𝑠𝑢𝑚(𝛻𝐿↓𝑖 (𝑤)) // compute gradient on batch 𝑤 = 𝑤 −𝛼∗𝑠𝑢𝑚(𝛻𝐿↓𝑣 (𝑤)) //update global w model

return w

Parallel Gradient descent

• Unclear why it is called this •  Idea: ▫  In Parallel SGD, each batch needs to finish before star4ng next pass ▫  In HOGWILD!, share the global model amongst all machines and update on-‐the-‐fly �  No need to wait for all worker machines to finish before star4ng next epoch �  Assump4on: component-‐wise addi4on is atomic and does not require locking

HOGWILD! (Niu, et al. 2011)

Initialize global model w On each worker machine:

loop until convergence: draw a sample e from complete dataset E get current global state w and compute 𝛻𝐿↓𝑒 (𝑤) for each component i in e: 𝑤↓𝑖  = 𝑤↓𝑖  −𝛼𝑏↓𝑣↑𝑇 𝛻𝐿↓𝑒 (𝑤) // bv is vth std. basis

component update global w

return w

HOGWILD! -‐ Pseudocode

HOGWILD! Parallel SGD

Comparison

W0

W1

W2

GA Gc GB

GA Gc GB

Wx

GA Gc GB

Comparison

•  RR – Round Robin •  Each machine updates x as it comes in. Wait for all before star4ng next pass

•  AIG •  Like Hogwild but does fine-‐grained locking of variables that are going to be used

Comparison (2) SVM Graph Cuts

Matrix Comple4on

•  Having an idea of how gradient descent works informs your use of others’ implementa4ons •  There are very good implementa4ons of the algorithm and other approaches to op4miza4on in many languages •  Packages: •  Python •  NumPy/SciPy

•  Matlab • Matlab Op4miza4on toolbox •  Pmtk3

Moral of the story

•  R •  General-‐purpose op4miza4on: op4m() •  R Op4miza4on Infrastructure (ROI)

•  TupleWare •  Coming soon….

Par-al Deriva-ves: •  hep://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml •  hep://mathinsight.org/nondifferen4able_discon4nuous_par4al_deriva4ves •  hep://www.sv.vt.edu/classes/ESM4714/methods/df2D.html •  Gradients Vector Field Interac4ve Visualiza4on: hep://dlippman.imathas.com/g1/Grapher.html from heps://www.khanacademy.org/math/calculus/par4al_deriva4ves_topic/gradient/v/gradient-‐1 •  hep://simmakers.com/wp-‐content/uploads/Sof/gradient.gif Gradient Descent: •  hep://en.wikipedia.org/wiki/Gradient_descent •  hep://www.youtube.com/watch?v=5u4G23_OohI (Stanford ML Lecture 2) •  hep://en.wikipedia.org/wiki/Stochas4c_gradient_descent •  Murphy, Machine Learning, a Probabils2c Perspec2ve, 2012, MIT Press •  Hogwild paper: hep://pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf

Resources

GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’...

Documents

Transcript of GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’...