Efficient Logistic Regression with Stochastic Gradient Descent
Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer:...
Transcript of Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer:...
![Page 1: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/1.jpg)
1
Optimization for Data ScienceStochastic Gradient Methods
Lecturer: Robert M. Gower & Alexandre Gramfort
Tutorials: Quentin Bertrand, Nidham Gazagnadou
Master 2 Data Science, Institut Polytechnique de Paris (IPP)
![Page 2: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/2.jpg)
2
Solving the Finite Sum Training Problem
![Page 3: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/3.jpg)
3
General methods
Recap
● Gradient Descent
Training Problem
Two parts● Proximal gradient
(ISTA)● Fast proximal
gradient (FISTA)
L(w)
![Page 4: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/4.jpg)
4
A Datum Function
Finite Sum Training Problem
Optimization Sum of Terms
Can we use this sum structure?
![Page 5: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/5.jpg)
5The Training Problem
![Page 6: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/6.jpg)
6The Training Problem
![Page 7: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/7.jpg)
7Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
![Page 8: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/8.jpg)
8Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then
![Page 9: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/9.jpg)
9Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then
![Page 10: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/10.jpg)
10Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then
EXE:
![Page 11: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/11.jpg)
11Stochastic Gradient Descent
![Page 12: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/12.jpg)
12Stochastic Gradient Descent
Optimal point
![Page 13: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/13.jpg)
13
Strong Convexity
Assumptions for Convergence
![Page 14: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/14.jpg)
14
Strong Convexity
Assumptions for Convergence
![Page 15: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/15.jpg)
15
Expected Bounded Stochastic Gradients
Strong Convexity
Assumptions for Convergence
![Page 16: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/16.jpg)
16
Expected Bounded Stochastic Gradients
Strong Convexity
Assumptions for Convergence
![Page 17: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/17.jpg)
17Complexity / Convergence
Theorem
EXE: Do exercises on convergence of random sequences.
![Page 18: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/18.jpg)
18Complexity / Convergence
Theorem
EXE: Do exercises on convergence of random sequences.
![Page 19: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/19.jpg)
19Complexity / Convergence
Theorem
EXE: Do exercises on convergence of random sequences.
![Page 20: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/20.jpg)
20Proof:
Unbiased estimatorTaking expectation with respect to j
Taking total expectationStrong conv.
Bounded Stoch grad
![Page 21: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/21.jpg)
21Stochastic Gradient Descent α =0.01
![Page 22: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/22.jpg)
22Stochastic Gradient Descent α =0.1
![Page 23: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/23.jpg)
23Stochastic Gradient Descent α =0.2
![Page 24: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/24.jpg)
24Stochastic Gradient Descent α =0.5
![Page 25: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/25.jpg)
25
Expected Bounded Stochastic Gradients
Strong Convexity
Assumptions for Convergence
![Page 26: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/26.jpg)
26
Expected Bounded Stochastic Gradients
Strong Convexity
Assumptions for Convergence
![Page 27: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/27.jpg)
27
Expected Bounded Stochastic Gradients
Strong Convexity
Assumptions for Convergence
![Page 28: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/28.jpg)
28
Expected Bounded Stochastic Gradients
Strong Convexity
Assumptions for Convergence
EXE:
![Page 29: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/29.jpg)
29
EXE:
Proof:
Taking expectation
![Page 30: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/30.jpg)
30
Strongly quasi-convexity
Realistic assumptions for Convergence
Each fi is convex and Li smooth
Definition: Gradient Noise
![Page 31: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/31.jpg)
31Assumptions for Convergence
EXE: Calculate the Li ’s and for
HINT: A twice differentiable fi is Li - smooth if and only if
![Page 32: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/32.jpg)
32Assumptions for Convergence
EXE: Calculate the Li ’s and for
HINT: A twice differentiable fi is Li - smooth if and only if
![Page 33: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/33.jpg)
33Assumptions for Convergence
EXE: Calculate the Li ’s and for
HINT: A twice differentiable fi is Li - smooth if and only if
![Page 34: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/34.jpg)
34Assumptions for Convergence
EXE: Calculate the Li ’s and for
HINT: A twice differentiable fi is Li - smooth if and only if
![Page 35: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/35.jpg)
35Assumptions for Convergence
EXE: Calculate the Li ’s and for
![Page 36: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/36.jpg)
36Assumptions for Convergence
EXE: Calculate the Li ’s and for
![Page 37: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/37.jpg)
37Assumptions for Convergence
EXE: Calculate the Li ’s and for
![Page 38: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/38.jpg)
38Relationship between smoothness constantsEXE:
show that
![Page 39: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/39.jpg)
39Relationship between smoothness constantsEXE:
show that
Proof: From the Hessian definition of smoothness
Furthermore
The final result now follows by taking the max over w, then max over i
![Page 40: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/40.jpg)
40Complexity / Convergence
Theorem.
EXE: The steps of the proof are given in the SGD_proof exercise list for homework!
RMG, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, P. Richtarik (2019) ICML 2019SGD: General Analysis and Improved Rates.
![Page 41: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/41.jpg)
41Stochastic Gradient Descent α =0.5
![Page 42: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/42.jpg)
42Stochastic Gradient Descent α =0.5
1) Start with big steps and end with smaller steps
![Page 43: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/43.jpg)
43Stochastic Gradient Descent α =0.5
2) Try averaging the points
1) Start with big steps and end with smaller steps
![Page 44: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/44.jpg)
44SGD shrinking stepsize
Shrinking Stepsize
![Page 45: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/45.jpg)
45SGD shrinking stepsize
Shrinking Stepsize How should we
sample j ?
Does this converge?
![Page 46: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/46.jpg)
46SGD with shrinking stepsize Compared with Gradient Descent
Gradient Descent
SGD 1.0
![Page 47: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/47.jpg)
47SGD with shrinking stepsize Compared with Gradient Descent
SGD 1.0
Gradient Descent
![Page 48: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/48.jpg)
49Complexity / Convergence
Theorem for shrinking stepsizes
![Page 49: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/49.jpg)
50Complexity / Convergence
Theorem for shrinking stepsizes
![Page 50: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/50.jpg)
51Complexity / Convergence
Theorem for shrinking stepsizes
![Page 51: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/51.jpg)
52Stochastic Gradient Descent Compared with Gradient Descent
Gradient Descent
SGD
Noisy iterates. Take averages?
![Page 52: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/52.jpg)
53SGD with (late start) averaging
B. T. Polyak and A. B. Juditsky, SIAM Journal on Control and Optimization (1992)Acceleration of stochastic approximation by averaging
![Page 53: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/53.jpg)
54SGD with (late start) averaging
B. T. Polyak and A. B. Juditsky, SIAM Journal on Control and Optimization (1992)Acceleration of stochastic approximation by averaging
This is not efficient. How to make this efficient?
![Page 54: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/54.jpg)
55Stochastic Gradient Descent With and without averaging
Starts slow, but can reach higher accuracy
![Page 55: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/55.jpg)
56Stochastic Gradient Descent With and without averaging
Starts slow, but can reach higher accuracy
Only use averaging towards the end?
![Page 56: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/56.jpg)
57Stochastic Gradient Descent Averaging the last few iterates
Averaging starts here
![Page 57: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/57.jpg)
58Comparison GD and SGD for strongly convex SGD GD
Iteration complexity
![Page 58: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/58.jpg)
59Comparison GD and SGD for strongly convex SGD GD
Iteration complexity
Cost of an interation
![Page 59: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/59.jpg)
60Comparison GD and SGD for strongly convex SGD GD
Iteration complexity
Cost of an interation
Total complexity*
![Page 60: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/60.jpg)
61Comparison GD and SGD for strongly convex SGD GD
Iteration complexity
Cost of an interation
Total complexity*
![Page 61: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/61.jpg)
62Comparison GD and SGD for strongly convex SGD GD
Iteration complexity
Cost of an interation
Total complexity*
What happens if n is big?What happens if is small?
![Page 62: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/62.jpg)
68Comparison SGD vs GD
time
SGD
M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.
![Page 63: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/63.jpg)
69Comparison SGD vs GD
time
SGD
M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.
![Page 64: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/64.jpg)
70Comparison SGD vs GD
time
SGD
GD
M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.
![Page 65: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/65.jpg)
71Comparison SGD vs GD
time
SGD
GD
Stoch. Average Gradient (SAG)
M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.
![Page 66: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/66.jpg)
72
Practical SGD for Sparse Data
![Page 67: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/67.jpg)
73
Finite Sum Training Problem
Lazy SGD updates for Sparse DataL2 regularizor + linear hypothesis
Assume each data point xi is s-sparse, how many operations does each SGD step cost?
![Page 68: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/68.jpg)
74
Finite Sum Training Problem
Lazy SGD updates for Sparse DataL2 regularizor + linear hypothesis
Assume each data point xi is s-sparse, how many operations does each SGD step cost?
![Page 69: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/69.jpg)
75
Finite Sum Training Problem
Lazy SGD updates for Sparse DataL2 regularizor + linear hypothesis
Assume each data point xi is s-sparse, how many operations does each SGD step cost?
Rescaling O(d)
Addition sparse vector O(s)
![Page 70: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/70.jpg)
76
SGD step
Lazy SGD updates for Sparse Data
EXE:
![Page 71: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/71.jpg)
77
SGD step
Lazy SGD updates for Sparse Data
EXE:
![Page 72: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/72.jpg)
78
SGD step
Lazy SGD updates for Sparse Data
EXE:
![Page 73: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/73.jpg)
79
SGD step
Lazy SGD updates for Sparse Data
O(1) scaling + O(s) sparse add = O(s) update
EXE:
![Page 74: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/74.jpg)
80
Why Machine Learners Like SGD
![Page 75: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/75.jpg)
81
The statistical learning problem:Minimize the expected loss over an unknown expectation
Why Machine Learners like SGD
SGD can solve the statistical learning problem!SGD can solve the statistical learning problem!
Though we solve:
We want to solve:
![Page 76: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/76.jpg)
82
The statistical learning problem:Minimize the expected loss over an unknown expectation
Why Machine Learners like SGD
![Page 77: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/77.jpg)
83
Exercise List time! Please solve:
stoch_ridge_reg_exeSGD_proof_exe
![Page 78: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou](https://reader034.fdocuments.in/reader034/viewer/2022042709/5f418898fb838d645d7336f0/html5/thumbnails/78.jpg)
84
Appendix