GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’...
Transcript of GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’...
Michail Michailidis & Patrick Maiden
Gradient Descent
• Mo4va4on • Gradient Descent Algorithm ▫ Issues & Alterna4ves • Stochas4c Gradient Descent • Parallel Gradient Descent • HOGWILD!
Outline
• It is good for finding global minima/maxima if the func4on is convex • It is good for finding local minima/maxima if the func4on is not convex • It is used for op4mizing many models in Machine learning: ▫ It is used in conjunc-on with: � Neural Networks � Linear Regression � Logis4c Regression � Back-‐propaga4on algorithm � Support Vector Machines
Mo4va4on
Func4on Example
• Deriva4ve • Par4al Deriva4ve • Gradient Vector
Quickest ever review of mul4variate calculus
• Slope of the tangent line • Easy when a func4on is univariate
Deriva4ve
𝑓(𝑥)= 𝑥↑2
𝑓′(𝑥) = 𝑑𝑓/𝑑𝑥 =2𝑥
𝑓′′(𝑥)= 𝑑↑2 𝑓/𝑑𝑥 = 2
For mul4variate func4ons (e.g two variables) we need par4al deriva4ves – one per dimension. Examples of mul4variate func4ons:
Par4al Deriva4ve – Mul4variate Func4ons
𝑓(𝑥,𝑦)= 𝑥↑2 + 𝑦↑2 𝑓(𝑥,𝑦)= cos↑2 (𝑥) + 𝑦↑2 𝑓(𝑥,𝑦)= cos↑2 (𝑥) + cos↑2 (𝑦)
Convex!
𝑓(𝑥,𝑦)=− 𝑥↑2 − 𝑦↑2
Concave!
To visualize the par4al deriva4ve for each of the dimensions x and y, we can imagine a plane that “cuts” our surface along the two dimensions and once again we get the slope of the tangent line.
Par4al Deriva4ve – Cont’d
surface: 𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑦↑2 plane: 𝑦=1 cut: 𝑓(𝑥,1)=8−𝑥↑2
slope / deriva-ve of cut: 𝑓′(𝑥)=−2𝑥
Par4al Deriva4ve – Cont’d 2
𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑦↑2
If we par4ally differen4ate a func4on with respect to x, we pretend y is constant
𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑐↑2
𝑓↓𝑥 = 𝜕𝑓/𝜕𝑥 =−2𝑥
𝑓(𝑥,𝑦)=9−𝑐↑2 − 𝑦↑2
𝑓↓𝑦 = 𝜕𝑓/𝜕𝑦 =−2𝑦
Par4al Deriva4ve – Cont’d 3 The two tangent lines that pass through a point, define the tangent plane to that point
• Is the vector that has as coordinates the par4al deriva4ves of the func4on:
• Note: Gradient Vector is not parallel to tangent surface
Gradient Vector
𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑦↑2
𝛻𝑓= 𝜕𝑓/𝜕𝑥 𝑖+ 𝜕𝑓/𝜕𝑦 𝑗=(𝜕𝑓/𝜕𝑥 , 𝜕𝑓/𝜕𝑦 )=(−2x, −2y)
𝜕𝑓/𝜕𝑥 =−2𝑥 𝜕𝑓/𝜕𝑦 =−2𝑦
• Idea ▫ Start somewhere ▫ Take steps based on the gradient vector of the current posi4on 4ll convergence
• Convergence : ▫ happens when change between two steps < ε
Gradient Descent Algorithm & Walkthrough
Gradient Descent Code (Python)
𝑓↑′ (𝑥)=4𝑥↑3 −9𝑥↑2
𝑓(𝑥)= 𝑥↑4 −3𝑥↑3 +2
𝑓↑′ (𝑥)=4𝑥↑3 −9𝑥↑2
Gradient Descent Algorithm & Walkthrough
Poten4al issues of gradient descent -‐ Convexity
𝑓(𝑥,𝑦)= 𝑥↑2 + 𝑦↑2
We need a convex func4on à so there is a global minimum:
Poten4al issues of gradient descent – Convexity (2)
• As we saw before, one parameter needs to be set is the step size • Bigger steps leads to faster convergence, right?
Poten4al issues of gradient descent – Step Size
• Newton’s Method ▫ Approximates a polynomial and jumps to the min of that func4on ▫ Needs Hessian • BFGS ▫ More complicated algorithm ▫ Commonly used in actual op4miza4on packages
Alterna4ve algorithms
• Mo4va4on ▫ One way to think of gradient descent is as a minimiza4on of a sum of func4ons: � 𝑤=𝑤 −𝛼𝛻𝐿 (𝑤)=𝑤−𝛼∑↑▒𝛻𝐿↓𝑖 (𝑤) � ( 𝐿↓𝑖 is the loss func4on evaluated on the i-‐th element of the dataset)
� On large datasets, it may be computa4onally expensive to iterate over the whole dataset, so pulling a subset of the data may perform beeer
� Addi4onally, sampling the data leads to “noise” that can avoid finding “shallow local minima.” This is good for op4mizing non-‐convex func4ons. (Murphy)
Stochas4c Gradient Descent
• Online learning algorithm • Instead of going through the en4re dataset on each itera4on, randomly sample and update the model Initialize w and α Until convergence do: Sample one example i from dataset //stochastic portion w = w -‐ α𝛻𝐿↓𝑖 (𝑤)
return w
Stochas4c Gradient descent
• Checking for convergence afer each data example can be slow • One can simulate stochas4city by reshuffling the dataset on each pass:
Initialize w and α Until convergence do:
shuffle dataset of n elements //simulating stochasticity For each example i in n: w = w -‐ α𝛻𝐿↓𝑖 (𝑤)
return w
• This is generally faster than the classic itera4ve approach (“noise”) • However, you are s4ll passing over the en4re dataset each 4me • An approach in the middle is to sample “batches”, subsets of the en4re dataset ▫ This can be parallelized!
Stochas4c Gradient descent (2)
• Training data is chunked into batches and distributed Initialize w and α Loop until convergence:
generate randomly sampled chunk of data m on each worker machine v: 𝛻𝐿↓𝑣 (𝑤) = 𝑠𝑢𝑚(𝛻𝐿↓𝑖 (𝑤)) // compute gradient on batch 𝑤 = 𝑤 −𝛼∗𝑠𝑢𝑚(𝛻𝐿↓𝑣 (𝑤)) //update global w model
return w
Parallel Gradient descent
• Unclear why it is called this • Idea: ▫ In Parallel SGD, each batch needs to finish before star4ng next pass ▫ In HOGWILD!, share the global model amongst all machines and update on-‐the-‐fly � No need to wait for all worker machines to finish before star4ng next epoch � Assump4on: component-‐wise addi4on is atomic and does not require locking
HOGWILD! (Niu, et al. 2011)
Initialize global model w On each worker machine:
loop until convergence: draw a sample e from complete dataset E get current global state w and compute 𝛻𝐿↓𝑒 (𝑤) for each component i in e: 𝑤↓𝑖 = 𝑤↓𝑖 −𝛼𝑏↓𝑣↑𝑇 𝛻𝐿↓𝑒 (𝑤) // bv is vth std. basis
component update global w
return w
HOGWILD! -‐ Pseudocode
HOGWILD! Parallel SGD
Comparison
W0
W1
W2
GA Gc GB
GA Gc GB
Wx
GA Gc GB
Comparison
• RR – Round Robin • Each machine updates x as it comes in. Wait for all before star4ng next pass
• AIG • Like Hogwild but does fine-‐grained locking of variables that are going to be used
Comparison (2) SVM Graph Cuts
Matrix Comple4on
• Having an idea of how gradient descent works informs your use of others’ implementa4ons • There are very good implementa4ons of the algorithm and other approaches to op4miza4on in many languages • Packages: • Python • NumPy/SciPy
• Matlab • Matlab Op4miza4on toolbox • Pmtk3
Moral of the story
• R • General-‐purpose op4miza4on: op4m() • R Op4miza4on Infrastructure (ROI)
• TupleWare • Coming soon….
Par-al Deriva-ves: • hep://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml • hep://mathinsight.org/nondifferen4able_discon4nuous_par4al_deriva4ves • hep://www.sv.vt.edu/classes/ESM4714/methods/df2D.html • Gradients Vector Field Interac4ve Visualiza4on: hep://dlippman.imathas.com/g1/Grapher.html from heps://www.khanacademy.org/math/calculus/par4al_deriva4ves_topic/gradient/v/gradient-‐1 • hep://simmakers.com/wp-‐content/uploads/Sof/gradient.gif Gradient Descent: • hep://en.wikipedia.org/wiki/Gradient_descent • hep://www.youtube.com/watch?v=5u4G23_OohI (Stanford ML Lecture 2) • hep://en.wikipedia.org/wiki/Stochas4c_gradient_descent • Murphy, Machine Learning, a Probabils2c Perspec2ve, 2012, MIT Press • Hogwild paper: hep://pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf
Resources