Gradient Descent
Review: Gradient Descent
โข In step 3, we have to solve the following optimization problem:
๐โ = argmin๐
๐ฟ ๐ L: loss function ๐: parameters
Suppose that ฮธ has two variables {ฮธ1, ฮธ2}
Randomly start at ๐0 =๐10
๐20 ๐ป๐ฟ ๐ =
ฮค๐๐ฟ ๐1 ๐๐1ฮค๐๐ฟ ๐2 ๐๐2
๐11
๐21 =
๐10
๐20 โ ๐
ฮค๐๐ฟ ๐10 ๐๐1ฮค๐๐ฟ ๐2
0 ๐๐2๐1 = ๐0 โ ๐๐ป๐ฟ ๐0
๐12
๐22 =
๐11
๐21 โ ๐
ฮค๐๐ฟ ๐11 ๐๐1ฮค๐๐ฟ ๐2
1 ๐๐2๐2 = ๐1 โ ๐๐ป๐ฟ ๐1
Review: Gradient Descent
Start at position ๐0
Compute gradient at ๐0
Move to ๐1 = ๐0 - ฮท๐ป๐ฟ ๐0
Compute gradient at ๐1
Move to ๐2 = ๐1 โ ฮท๐ป๐ฟ ๐1Movement
Gradient
โฆโฆ
๐0
๐1
๐2
๐3
๐ป๐ฟ ๐0
๐ป๐ฟ ๐1
๐ป๐ฟ ๐2
๐ป๐ฟ ๐3
๐1
๐2 Gradient: Loss ็็ญ้ซ็ท็ๆณ็ทๆนๅ
Gradient DescentTip 1: Tuning your
learning rates
Learning Rate
No. of parameters updates
Loss
Loss
Very Large
Large
small
Just make
11 iii L
Set the learning rate ฮท carefully
If there are more than three parameters, you cannot visualize this.
But you can always visualize this.
Adaptive Learning Rates
โข Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.โข At the beginning, we are far from the destination, so we
use larger learning rate
โข After several epochs, we are close to the destination, so we reduce the learning rate
โข E.g. 1/t decay: ๐๐ก = ฮค๐ ๐ก + 1
โข Learning rate cannot be one-size-fits-all
โข Giving different parameters different learning rates
Adagrad
โข Divide the learning rate of each parameter by the root mean square of its previous derivatives
๐๐ก: root mean square of the previous derivatives of parameter w
w is one parameters
๐๐ก =๐๐ฟ ๐๐ก
๐๐ค
Vanilla Gradient descent
Adagrad
๐ค๐ก+1 โ ๐ค๐ก โ ๐๐ก๐๐ก
๐๐ก =๐
๐ก + 1
๐ค๐ก+1 โ ๐ค๐ก โ๐๐ก
๐๐ก๐๐ก
Parameter dependent
Adagrad
๐ค1 โ ๐ค0 โ๐0
๐0๐0
โฆโฆ
๐ค2 โ ๐ค1 โ๐1
๐1๐1
๐ค๐ก+1 โ ๐ค๐ก โ๐๐ก
๐๐ก๐๐ก
๐0 = ๐0 2
๐1 =1
2๐0 2 + ๐1 2
๐๐ก =1
๐ก + 1
๐=0
๐ก
๐๐ 2
๐ค3 โ ๐ค2 โ๐2
๐2๐2 ๐2 =
1
3๐0 2 + ๐1 2 + ๐2 2
๐๐ก: root mean square of the previous derivatives of parameter w
Adagrad
โข Divide the learning rate of each parameter by the root mean square of its previous derivatives
๐๐ก =๐
๐ก + 1
๐ค๐ก+1 โ ๐ค๐ก โ๐
ฯ๐=0๐ก ๐๐ 2
๐๐ก
1/t decay
๐ค๐ก+1 โ ๐ค๐ก โ๐
๐๐ก๐๐ก
๐๐ก =1
๐ก + 1
๐=0
๐ก
๐๐ 2
๐๐ก
๐๐ก
Contradiction?
๐ค๐ก+1 โ ๐ค๐ก โ๐
ฯ๐=0๐ก ๐๐ 2
๐๐ก
Vanilla Gradient descent
Adagrad
Larger gradient, larger step
Larger gradient, smaller step
Larger gradient, larger step
๐ค๐ก+1 โ ๐ค๐ก โ ๐๐ก๐๐ก
๐๐ก =๐๐ฟ ๐๐ก
๐๐ค๐๐ก =
๐
๐ก + 1
Intuitive Reason
โข How surprise it is
๐ค๐ก+1 โ ๐ค๐ก โ๐
ฯ๐=0๐ก ๐๐ 2
๐๐ก
้ ๆๅๅทฎ็ๆๆ
g0 g1 g2 g3 g4 โฆโฆ
0.001 0.001 0.003 0.002 0.1 โฆโฆ
g0 g1 g2 g3 g4 โฆโฆ
10.8 20.9 31.7 12.1 0.1 โฆโฆ
ๅๅทฎ
๐๐ก =๐๐ถ ๐๐ก
๐๐ค๐๐ก =
๐
๐ก + 1
็นๅฅๅคง
็นๅฅๅฐ
Larger gradient, larger steps?
๐ฆ = ๐๐ฅ2 + ๐๐ฅ + ๐
๐๐ฆ
๐๐ฅ= |2๐๐ฅ + ๐|
๐ฅ0
|๐ฅ0 +๐
2๐|
๐ฅ0
|2๐๐ฅ0 + ๐|
Best step:
โ๐
2๐
|2๐๐ฅ0 + ๐|
2๐Larger 1st order derivative means far from the minima
Comparison between different parameters
๐ค1
๐ค2
๐ค1
๐ค2
a
b
c
d
c > d
a > b
Larger 1st order derivative means far from the minima
Do not cross parameters
Second Derivative
๐ฆ = ๐๐ฅ2 + ๐๐ฅ + ๐
๐๐ฆ
๐๐ฅ= |2๐๐ฅ + ๐|
โ๐
2๐
๐ฅ0
|๐ฅ0 +๐
2๐|
๐ฅ0
|2๐๐ฅ0 + ๐|
Best step:
๐2๐ฆ
๐๐ฅ2= 2๐ The best step is
|First derivative|
Second derivative
|2๐๐ฅ0 + ๐|
2๐
Comparison between different parameters
๐ค1
๐ค2
๐ค1
๐ค2
a
b
c
d
c > d
a > b
Larger 1st order derivative means far from the minima
Do not cross parameters|First derivative|
Second derivativeThe best step is
Smaller Second
LargerSecond
Larger second derivative
smaller second derivative
|First derivative|
Second derivative
The best step is
Use first derivative to estimate second derivative
first derivative 2
๐ค1 ๐ค2
larger second derivativesmaller second
derivative
๐ค๐ก+1 โ ๐ค๐ก โ๐
ฯ๐=0๐ก ๐๐ 2
๐๐ก
?
Gradient DescentTip 2: Stochastic
Gradient Descent
Make the training faster
Stochastic Gradient Descent
Gradient Descent
Stochastic Gradient Descent
11 iii L
11 inii L
Pick an example xn
Faster!
๐ฟ =
๐
เท๐ฆ๐ โ ๐ + ๐ค๐๐ฅ๐๐
2
Loss is the summation over all training examples
๐ฟ๐ = เท๐ฆ๐ โ ๐ + ๐ค๐๐ฅ๐๐
2
Loss for only one example
โข Demo
Stochastic Gradient Descent
Gradient Descent
Stochastic Gradient Descent
See all examples
See all examples
See only one example
Update after seeing all examples
If there are 20 examples, 20 times faster.
Update for each example
Gradient DescentTip 3: Feature Scaling
Feature Scaling
Make different features have the same scaling
Source of figure: http://cs231n.github.io/neural-networks-2/
๐ฆ = ๐ + ๐ค1๐ฅ1 +๐ค2๐ฅ2
๐ฅ1 ๐ฅ1
๐ฅ2 ๐ฅ2
Feature Scaling
y1w
2w1x
2x
b
1, 2 โฆโฆ
100, 200 โฆโฆ
1w
2w Loss L
y1w
2w1x
2x
b
1, 2 โฆโฆ
1w
2w Loss L
1, 2 โฆโฆ
๐ฆ = ๐ + ๐ค1๐ฅ1 +๐ค2๐ฅ2
Feature Scaling
โฆโฆ
โฆโฆ
โฆโฆ
โฆโฆ
โฆโฆ
โฆโฆ โฆโฆ
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ๐ ๐ฅ๐
mean: ๐๐
standard deviation: ๐๐
๐ฅ๐๐ โ
๐ฅ๐๐ โ๐๐
๐๐
The means of all dimensions are 0, and the variances are all 1
For each dimension i:
๐ฅ11
๐ฅ21
๐ฅ12
๐ฅ22
Gradient DescentTheory
Question
โข When solving:
โข Each time we update the parameters, we obtain ๐that makes ๐ฟ ๐ smaller.
๐โ = argmin๐
๐ฟ ๐ by gradient descent
๐ฟ ๐0 > ๐ฟ ๐1 > ๐ฟ ๐2 > โฏ
Is this statement correct?
Warning of Math
1
2
Formal Derivation
โข Suppose that ฮธ has two variables {ฮธ1, ฮธ2}
0
1
2
How?
L(ฮธ)
Given a point, we can easily find the point with the smallest value nearby.
Taylor Series
โข Taylor series: Let h(x) be any function infinitely differentiable around x = x0.
kk
k
xxk
x0
0
0
!
hxh
2
00
000!2
xxxh
xxxhxh
When x is close to x0 000 xxxhxhxh
sin(x)=
โฆโฆ
E.g. Taylor series for h(x)=sin(x) around x0=ฯ/4
The approximation is good around ฯ/4.
Multivariable Taylor Series
000
000
00
,,,, yy
y
yxhxx
x
yxhyxhyxh
When x and y is close to x0 and y0
000
000
00
,,,, yy
y
yxhxx
x
yxhyxhyxh
+ something related to (x-x0)2 and (y-y0)2 + โฆโฆ
Back to Formal Derivation
1
2 ba,
bba
aba
ba
2
2
1
1
,L,L,LL
Based on Taylor Series:If the red circle is small enough, in the red circle
21
,L,
,L
bav
bau
bvaus 21
L
bas ,L
L(ฮธ)
1
2
Back to Formal DerivationBased on Taylor Series:If the red circle is small enough, in the red circle
bvaus 21L
bas ,L
ba,
L(ฮธ)
Find ฮธ1 and ฮธ2 in the red circle minimizing L(ฮธ)
22
2
2
1 dba
21
,L,
,L
bav
bau
constant
dSimple, right?
Gradient descent โ two variablesRed Circle: (If the radius is small)
bvaus 21L
1 2
1 2
21,
vu,
v
u
2
1
To minimize L(ฮธ)
v
u
b
a
2
1
Find ฮธ1 and ฮธ2 in the red circle minimizing L(ฮธ)
22
2
2
1 dba
21,
Back to Formal Derivation
Find ๐1 and ๐2 yielding the smallest value of ๐ฟ ๐ in the circle
v
u
b
a
2
1
2
1
,L
,L
ba
ba
b
a This is gradient descent.
Based on Taylor Series:If the red circle is small enough, in the red circle
bvaus 21L
bas ,L
21
,L,
,L
bav
bau
constant
Not satisfied if the red circle (learning rate) is not small enough
You can consider the second order term, e.g. Newtonโs method.
End of Warning
More Limitation of Gradient Descent
๐ฟ
๐ค1 ๐ค2
Loss
The value of the parameter w
Very slow at the plateau
Stuck at local minima
๐๐ฟ โ ๐๐ค= 0
Stuck at saddle point
๐๐ฟ โ ๐๐ค= 0
๐๐ฟ โ ๐๐คโ 0
Acknowledgement
โขๆ่ฌ Victor Chen็ผ็พๆๅฝฑ็ไธ็ๆๅญ้ฏ่ชค
Top Related