GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’...

29
Michail Michailidis & Patrick Maiden Gradient Descent

Transcript of GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’...

Page 1: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Michail  Michailidis  &  Patrick  Maiden  

Gradient  Descent  

Page 2: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

• Mo4va4on  • Gradient  Descent  Algorithm  ▫  Issues  &  Alterna4ves  •  Stochas4c  Gradient  Descent    •  Parallel  Gradient  Descent  •  HOGWILD!  

Outline  

Page 3: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

•  It  is  good  for  finding  global  minima/maxima  if  the  func4on  is  convex  •  It  is  good  for  finding  local  minima/maxima  if  the  func4on  is  not  convex  •  It  is  used  for  op4mizing  many  models  in  Machine  learning:  ▫  It  is  used  in  conjunc-on  with:    �  Neural  Networks    �  Linear  Regression  �  Logis4c  Regression  �  Back-­‐propaga4on  algorithm  �  Support  Vector  Machines  

Mo4va4on  

Page 4: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Func4on  Example  

Page 5: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

•  Deriva4ve  •  Par4al  Deriva4ve  • Gradient  Vector  

Quickest  ever  review  of  mul4variate  calculus    

Page 6: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

•  Slope  of  the  tangent  line              •  Easy  when  a  func4on  is  univariate  

Deriva4ve  

𝑓(𝑥)= 𝑥↑2   

𝑓′(𝑥)  = 𝑑𝑓/𝑑𝑥 =2𝑥

𝑓′′(𝑥)= 𝑑↑2 𝑓/𝑑𝑥 =  2  

Page 7: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

For  mul4variate  func4ons  (e.g  two  variables)  we  need  par4al  deriva4ves  –  one  per  dimension.  Examples  of  mul4variate  func4ons:        

Par4al  Deriva4ve  –  Mul4variate  Func4ons  

𝑓(𝑥,𝑦)= 𝑥↑2 + 𝑦↑2   𝑓(𝑥,𝑦)= cos↑2  (𝑥) + 𝑦↑2    𝑓(𝑥,𝑦)= cos↑2  (𝑥) + cos↑2  (𝑦)   

Convex!  

𝑓(𝑥,𝑦)=− 𝑥↑2 − 𝑦↑2   

Concave!  

Page 8: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

To  visualize  the  par4al  deriva4ve  for  each  of  the  dimensions  x  and  y,  we  can  imagine  a  plane  that  “cuts”  our  surface  along  the  two  dimensions  and  once  again  we  get  the  slope  of  the  tangent  line.  

       

Par4al  Deriva4ve  –  Cont’d    

surface:  𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑦↑2    plane:  𝑦=1  cut:  𝑓(𝑥,1)=8−𝑥↑2   

slope  /  deriva-ve  of  cut:  𝑓′(𝑥)=−2𝑥  

Page 9: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Par4al  Deriva4ve  –  Cont’d  2    

𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑦↑2   

If  we  par4ally  differen4ate  a  func4on  with  respect  to  x,  we  pretend  y  is  constant  

𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑐↑2   

𝑓↓𝑥 = 𝜕𝑓/𝜕𝑥 =−2𝑥  

𝑓(𝑥,𝑦)=9−𝑐↑2 − 𝑦↑2   

𝑓↓𝑦 = 𝜕𝑓/𝜕𝑦 =−2𝑦  

Page 10: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Par4al  Deriva4ve  –  Cont’d  3    The  two  tangent  lines  that  pass  through  a  point,  define  the  tangent  plane  to  that  point  

Page 11: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

•  Is  the  vector  that  has  as  coordinates  the  par4al  deriva4ves  of  the  func4on:  

 

 

• Note:  Gradient  Vector  is  not  parallel  to  tangent  surface                                                

Gradient  Vector  

𝑓(𝑥,𝑦)=9−𝑥↑2 − 𝑦↑2   

𝛻𝑓=   𝜕𝑓/𝜕𝑥 𝑖+ 𝜕𝑓/𝜕𝑦 𝑗=(𝜕𝑓/𝜕𝑥 , 𝜕𝑓/𝜕𝑦 )=(−2x,  −2y)  

𝜕𝑓/𝜕𝑥 =−2𝑥   𝜕𝑓/𝜕𝑦 =−2𝑦  

Page 12: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

•  Idea  ▫  Start  somewhere  ▫  Take  steps  based  on  the  gradient  vector  of  the  current  posi4on  4ll  convergence  

•  Convergence  :  ▫  happens  when  change  between  two  steps  <  ε  

Gradient  Descent  Algorithm  &  Walkthrough  

Page 13: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Gradient  Descent  Code  (Python)  

𝑓↑′ (𝑥)=4𝑥↑3 −9𝑥↑2   

𝑓(𝑥)= 𝑥↑4 −3𝑥↑3 +2  

𝑓↑′ (𝑥)=4𝑥↑3 −9𝑥↑2   

Page 14: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Gradient  Descent  Algorithm  &  Walkthrough  

Page 15: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Poten4al  issues  of  gradient  descent  -­‐  Convexity  

𝑓(𝑥,𝑦)= 𝑥↑2 + 𝑦↑2   

We  need  a  convex  func4on      à  so  there  is  a  global  minimum:    

Page 16: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Poten4al  issues  of  gradient  descent  –  Convexity  (2)  

Page 17: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

•  As  we  saw  before,  one  parameter  needs  to  be  set  is  the  step  size  •  Bigger  steps  leads  to  faster  convergence,  right?  

Poten4al  issues  of  gradient  descent  –  Step  Size  

Page 18: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

• Newton’s  Method  ▫  Approximates  a  polynomial  and  jumps  to  the  min  of  that  func4on  ▫  Needs  Hessian  •  BFGS  ▫ More  complicated  algorithm  ▫  Commonly  used  in  actual  op4miza4on  packages  

Alterna4ve  algorithms  

Page 19: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

• Mo4va4on  ▫  One  way  to  think  of  gradient  descent  is  as  a  minimiza4on  of  a  sum  of  func4ons:  � 𝑤=𝑤  −𝛼𝛻𝐿  (𝑤)=𝑤−𝛼∑↑▒𝛻𝐿↓𝑖 (𝑤)   �  ( 𝐿↓𝑖   is  the  loss  func4on  evaluated  on  the  i-­‐th  element  of  the  dataset)  

�  On  large  datasets,  it  may  be  computa4onally  expensive  to  iterate  over  the  whole  dataset,  so  pulling  a  subset  of  the  data  may  perform  beeer  

�  Addi4onally,  sampling  the  data  leads  to  “noise”  that  can  avoid  finding  “shallow  local  minima.”  This  is  good  for  op4mizing  non-­‐convex  func4ons.  (Murphy)  

Stochas4c  Gradient  Descent  

Page 20: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

• Online  learning  algorithm  •  Instead  of  going  through  the  en4re  dataset  on  each  itera4on,  randomly  sample  and  update  the  model    Initialize  w  and  α  Until  convergence  do:  Sample  one  example  i  from  dataset  //stochastic  portion  w  =  w  -­‐  α𝛻𝐿↓𝑖 (𝑤)  

return  w    

Stochas4c  Gradient  descent  

Page 21: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

•  Checking  for  convergence  afer  each  data  example  can  be  slow  •  One  can  simulate  stochas4city  by  reshuffling  the  dataset  on  each  pass:  

Initialize  w  and  α  Until  convergence  do:  

shuffle  dataset  of  n  elements  //simulating  stochasticity  For  each  example  i  in  n:    w  =  w  -­‐  α𝛻𝐿↓𝑖 (𝑤)  

return  w    

•  This  is  generally  faster  than  the  classic  itera4ve  approach  (“noise”)  •  However,  you  are  s4ll  passing  over  the  en4re  dataset  each  4me  •  An  approach  in  the  middle  is  to  sample  “batches”,  subsets  of  the  en4re  dataset  ▫  This  can  be  parallelized!  

Stochas4c  Gradient  descent  (2)  

Page 22: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

•  Training  data  is  chunked  into  batches  and  distributed    Initialize  w  and  α  Loop  until  convergence:  

 generate  randomly  sampled  chunk  of  data  m    on  each  worker  machine  v:      𝛻𝐿↓𝑣 (𝑤)  =  𝑠𝑢𝑚(𝛻𝐿↓𝑖 (𝑤))  //  compute  gradient  on  batch    𝑤  =  𝑤  −𝛼∗𝑠𝑢𝑚(𝛻𝐿↓𝑣 (𝑤))  //update  global  w  model    

return  w  

Parallel  Gradient  descent  

Page 23: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

• Unclear  why  it  is  called  this  •  Idea:  ▫  In  Parallel  SGD,  each  batch  needs  to  finish  before  star4ng  next  pass  ▫  In  HOGWILD!,  share  the  global  model  amongst  all  machines  and  update  on-­‐the-­‐fly  �  No  need  to  wait  for  all  worker  machines  to  finish  before  star4ng  next  epoch  �  Assump4on:  component-­‐wise  addi4on  is  atomic  and  does  not  require  locking  

 

HOGWILD!  (Niu,  et  al.  2011)  

Page 24: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Initialize  global  model  w  On  each  worker  machine:  

 loop  until  convergence:      draw  a  sample  e  from  complete  dataset  E      get  current  global  state  w  and  compute  𝛻𝐿↓𝑒 (𝑤)        for  each  component  i  in  e:        𝑤↓𝑖   =   𝑤↓𝑖   −𝛼𝑏↓𝑣↑𝑇 𝛻𝐿↓𝑒 (𝑤)  //  bv  is  vth  std.  basis  

component        update  global  w  

return  w    

HOGWILD!  -­‐  Pseudocode  

Page 25: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

HOGWILD!  Parallel  SGD  

Comparison  

W0  

W1  

W2  

GA   Gc  GB  

GA   Gc  GB  

Wx  

GA   Gc  GB  

Page 26: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Comparison  

•  RR  –  Round  Robin    •  Each  machine  updates  x  as  it  comes  in.  Wait  for  all  before  star4ng  next  pass  

•  AIG    •  Like  Hogwild  but  does  fine-­‐grained  locking  of  variables  that  are  going  to  be  used  

Page 27: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Comparison  (2)  SVM   Graph  Cuts  

Matrix  Comple4on  

Page 28: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

•  Having  an  idea  of  how  gradient  descent  works  informs  your  use  of  others’  implementa4ons  •  There  are  very  good  implementa4ons  of  the  algorithm  and  other  approaches  to  op4miza4on  in  many  languages  •  Packages:  •  Python  •  NumPy/SciPy  

•  Matlab  • Matlab  Op4miza4on  toolbox  •  Pmtk3  

Moral  of  the  story  

•  R  •  General-­‐purpose  op4miza4on:  op4m()  •  R  Op4miza4on  Infrastructure  (ROI)  

•  TupleWare  •  Coming  soon….  

 

Page 29: GradientDescent - Brown Universitycs.brown.edu/courses/cs195w/slides/graddes.pdf · • Mo4vaon’ • GradientDescentAlgorithm’ Issues’&’Alternaves’ • Stochas4c’GradientDescent’

Par-al  Deriva-ves:  •  hep://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml  •  hep://mathinsight.org/nondifferen4able_discon4nuous_par4al_deriva4ves  •  hep://www.sv.vt.edu/classes/ESM4714/methods/df2D.html  •  Gradients  Vector  Field  Interac4ve  Visualiza4on:  hep://dlippman.imathas.com/g1/Grapher.html  from  heps://www.khanacademy.org/math/calculus/par4al_deriva4ves_topic/gradient/v/gradient-­‐1  •  hep://simmakers.com/wp-­‐content/uploads/Sof/gradient.gif  Gradient  Descent:  •  hep://en.wikipedia.org/wiki/Gradient_descent  •  hep://www.youtube.com/watch?v=5u4G23_OohI  (Stanford  ML  Lecture  2)  •  hep://en.wikipedia.org/wiki/Stochas4c_gradient_descent  •  Murphy,  Machine  Learning,    a  Probabils2c  Perspec2ve,  2012,  MIT  Press  •  Hogwild  paper:  hep://pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf    

Resources