Download - Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Transcript
Page 1: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Gradient Descent

Page 2: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Review: Gradient Descent

โ€ข In step 3, we have to solve the following optimization problem:

๐œƒโˆ— = argmin๐œƒ

๐ฟ ๐œƒ L: loss function ๐œƒ: parameters

Suppose that ฮธ has two variables {ฮธ1, ฮธ2}

Randomly start at ๐œƒ0 =๐œƒ10

๐œƒ20 ๐›ป๐ฟ ๐œƒ =

ฮค๐œ•๐ฟ ๐œƒ1 ๐œ•๐œƒ1ฮค๐œ•๐ฟ ๐œƒ2 ๐œ•๐œƒ2

๐œƒ11

๐œƒ21 =

๐œƒ10

๐œƒ20 โˆ’ ๐œ‚

ฮค๐œ•๐ฟ ๐œƒ10 ๐œ•๐œƒ1ฮค๐œ•๐ฟ ๐œƒ2

0 ๐œ•๐œƒ2๐œƒ1 = ๐œƒ0 โˆ’ ๐œ‚๐›ป๐ฟ ๐œƒ0

๐œƒ12

๐œƒ22 =

๐œƒ11

๐œƒ21 โˆ’ ๐œ‚

ฮค๐œ•๐ฟ ๐œƒ11 ๐œ•๐œƒ1ฮค๐œ•๐ฟ ๐œƒ2

1 ๐œ•๐œƒ2๐œƒ2 = ๐œƒ1 โˆ’ ๐œ‚๐›ป๐ฟ ๐œƒ1

Page 3: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Review: Gradient Descent

Start at position ๐œƒ0

Compute gradient at ๐œƒ0

Move to ๐œƒ1 = ๐œƒ0 - ฮท๐›ป๐ฟ ๐œƒ0

Compute gradient at ๐œƒ1

Move to ๐œƒ2 = ๐œƒ1 โ€“ ฮท๐›ป๐ฟ ๐œƒ1Movement

Gradient

โ€ฆโ€ฆ

๐œƒ0

๐œƒ1

๐œƒ2

๐œƒ3

๐›ป๐ฟ ๐œƒ0

๐›ป๐ฟ ๐œƒ1

๐›ป๐ฟ ๐œƒ2

๐›ป๐ฟ ๐œƒ3

๐œƒ1

๐œƒ2 Gradient: Loss ็š„็ญ‰้ซ˜็ทš็š„ๆณ•็ทšๆ–นๅ‘

Page 4: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Gradient DescentTip 1: Tuning your

learning rates

Page 5: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Learning Rate

No. of parameters updates

Loss

Loss

Very Large

Large

small

Just make

11 iii L

Set the learning rate ฮท carefully

If there are more than three parameters, you cannot visualize this.

But you can always visualize this.

Page 6: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Adaptive Learning Rates

โ€ข Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.โ€ข At the beginning, we are far from the destination, so we

use larger learning rate

โ€ข After several epochs, we are close to the destination, so we reduce the learning rate

โ€ข E.g. 1/t decay: ๐œ‚๐‘ก = ฮค๐œ‚ ๐‘ก + 1

โ€ข Learning rate cannot be one-size-fits-all

โ€ข Giving different parameters different learning rates

Page 7: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Adagrad

โ€ข Divide the learning rate of each parameter by the root mean square of its previous derivatives

๐œŽ๐‘ก: root mean square of the previous derivatives of parameter w

w is one parameters

๐‘”๐‘ก =๐œ•๐ฟ ๐œƒ๐‘ก

๐œ•๐‘ค

Vanilla Gradient descent

Adagrad

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’ ๐œ‚๐‘ก๐‘”๐‘ก

๐œ‚๐‘ก =๐œ‚

๐‘ก + 1

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚๐‘ก

๐œŽ๐‘ก๐‘”๐‘ก

Parameter dependent

Page 8: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Adagrad

๐‘ค1 โ† ๐‘ค0 โˆ’๐œ‚0

๐œŽ0๐‘”0

โ€ฆโ€ฆ

๐‘ค2 โ† ๐‘ค1 โˆ’๐œ‚1

๐œŽ1๐‘”1

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚๐‘ก

๐œŽ๐‘ก๐‘”๐‘ก

๐œŽ0 = ๐‘”0 2

๐œŽ1 =1

2๐‘”0 2 + ๐‘”1 2

๐œŽ๐‘ก =1

๐‘ก + 1

๐‘–=0

๐‘ก

๐‘”๐‘– 2

๐‘ค3 โ† ๐‘ค2 โˆ’๐œ‚2

๐œŽ2๐‘”2 ๐œŽ2 =

1

3๐‘”0 2 + ๐‘”1 2 + ๐‘”2 2

๐œŽ๐‘ก: root mean square of the previous derivatives of parameter w

Page 9: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Adagrad

โ€ข Divide the learning rate of each parameter by the root mean square of its previous derivatives

๐œ‚๐‘ก =๐œ‚

๐‘ก + 1

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚

ฯƒ๐‘–=0๐‘ก ๐‘”๐‘– 2

๐‘”๐‘ก

1/t decay

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚

๐œŽ๐‘ก๐‘”๐‘ก

๐œŽ๐‘ก =1

๐‘ก + 1

๐‘–=0

๐‘ก

๐‘”๐‘– 2

๐œ‚๐‘ก

๐œŽ๐‘ก

Page 10: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Contradiction?

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚

ฯƒ๐‘–=0๐‘ก ๐‘”๐‘– 2

๐‘”๐‘ก

Vanilla Gradient descent

Adagrad

Larger gradient, larger step

Larger gradient, smaller step

Larger gradient, larger step

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’ ๐œ‚๐‘ก๐‘”๐‘ก

๐‘”๐‘ก =๐œ•๐ฟ ๐œƒ๐‘ก

๐œ•๐‘ค๐œ‚๐‘ก =

๐œ‚

๐‘ก + 1

Page 11: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Intuitive Reason

โ€ข How surprise it is

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚

ฯƒ๐‘–=0๐‘ก ๐‘”๐‘– 2

๐‘”๐‘ก

้€ ๆˆๅๅทฎ็š„ๆ•ˆๆžœ

g0 g1 g2 g3 g4 โ€ฆโ€ฆ

0.001 0.001 0.003 0.002 0.1 โ€ฆโ€ฆ

g0 g1 g2 g3 g4 โ€ฆโ€ฆ

10.8 20.9 31.7 12.1 0.1 โ€ฆโ€ฆ

ๅๅทฎ

๐‘”๐‘ก =๐œ•๐ถ ๐œƒ๐‘ก

๐œ•๐‘ค๐œ‚๐‘ก =

๐œ‚

๐‘ก + 1

็‰นๅˆฅๅคง

็‰นๅˆฅๅฐ

Page 12: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Larger gradient, larger steps?

๐‘ฆ = ๐‘Ž๐‘ฅ2 + ๐‘๐‘ฅ + ๐‘

๐œ•๐‘ฆ

๐œ•๐‘ฅ= |2๐‘Ž๐‘ฅ + ๐‘|

๐‘ฅ0

|๐‘ฅ0 +๐‘

2๐‘Ž|

๐‘ฅ0

|2๐‘Ž๐‘ฅ0 + ๐‘|

Best step:

โˆ’๐‘

2๐‘Ž

|2๐‘Ž๐‘ฅ0 + ๐‘|

2๐‘ŽLarger 1st order derivative means far from the minima

Page 13: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Comparison between different parameters

๐‘ค1

๐‘ค2

๐‘ค1

๐‘ค2

a

b

c

d

c > d

a > b

Larger 1st order derivative means far from the minima

Do not cross parameters

Page 14: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Second Derivative

๐‘ฆ = ๐‘Ž๐‘ฅ2 + ๐‘๐‘ฅ + ๐‘

๐œ•๐‘ฆ

๐œ•๐‘ฅ= |2๐‘Ž๐‘ฅ + ๐‘|

โˆ’๐‘

2๐‘Ž

๐‘ฅ0

|๐‘ฅ0 +๐‘

2๐‘Ž|

๐‘ฅ0

|2๐‘Ž๐‘ฅ0 + ๐‘|

Best step:

๐œ•2๐‘ฆ

๐œ•๐‘ฅ2= 2๐‘Ž The best step is

|First derivative|

Second derivative

|2๐‘Ž๐‘ฅ0 + ๐‘|

2๐‘Ž

Page 15: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Comparison between different parameters

๐‘ค1

๐‘ค2

๐‘ค1

๐‘ค2

a

b

c

d

c > d

a > b

Larger 1st order derivative means far from the minima

Do not cross parameters|First derivative|

Second derivativeThe best step is

Smaller Second

LargerSecond

Larger second derivative

smaller second derivative

Page 16: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

|First derivative|

Second derivative

The best step is

Use first derivative to estimate second derivative

first derivative 2

๐‘ค1 ๐‘ค2

larger second derivativesmaller second

derivative

๐‘ค๐‘ก+1 โ† ๐‘ค๐‘ก โˆ’๐œ‚

ฯƒ๐‘–=0๐‘ก ๐‘”๐‘– 2

๐‘”๐‘ก

?

Page 17: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Gradient DescentTip 2: Stochastic

Gradient Descent

Make the training faster

Page 18: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Stochastic Gradient Descent

Gradient Descent

Stochastic Gradient Descent

11 iii L

11 inii L

Pick an example xn

Faster!

๐ฟ =

๐‘›

เทœ๐‘ฆ๐‘› โˆ’ ๐‘ + ๐‘ค๐‘–๐‘ฅ๐‘–๐‘›

2

Loss is the summation over all training examples

๐ฟ๐‘› = เทœ๐‘ฆ๐‘› โˆ’ ๐‘ + ๐‘ค๐‘–๐‘ฅ๐‘–๐‘›

2

Loss for only one example

Page 19: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

โ€ข Demo

Page 20: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Stochastic Gradient Descent

Gradient Descent

Stochastic Gradient Descent

See all examples

See all examples

See only one example

Update after seeing all examples

If there are 20 examples, 20 times faster.

Update for each example

Page 21: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Gradient DescentTip 3: Feature Scaling

Page 22: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Feature Scaling

Make different features have the same scaling

Source of figure: http://cs231n.github.io/neural-networks-2/

๐‘ฆ = ๐‘ + ๐‘ค1๐‘ฅ1 +๐‘ค2๐‘ฅ2

๐‘ฅ1 ๐‘ฅ1

๐‘ฅ2 ๐‘ฅ2

Page 23: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Feature Scaling

y1w

2w1x

2x

b

1, 2 โ€ฆโ€ฆ

100, 200 โ€ฆโ€ฆ

1w

2w Loss L

y1w

2w1x

2x

b

1, 2 โ€ฆโ€ฆ

1w

2w Loss L

1, 2 โ€ฆโ€ฆ

๐‘ฆ = ๐‘ + ๐‘ค1๐‘ฅ1 +๐‘ค2๐‘ฅ2

Page 24: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Feature Scaling

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ โ€ฆโ€ฆ

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ๐‘Ÿ ๐‘ฅ๐‘…

mean: ๐‘š๐‘–

standard deviation: ๐œŽ๐‘–

๐‘ฅ๐‘–๐‘Ÿ โ†

๐‘ฅ๐‘–๐‘Ÿ โˆ’๐‘š๐‘–

๐œŽ๐‘–

The means of all dimensions are 0, and the variances are all 1

For each dimension i:

๐‘ฅ11

๐‘ฅ21

๐‘ฅ12

๐‘ฅ22

Page 25: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Gradient DescentTheory

Page 26: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Question

โ€ข When solving:

โ€ข Each time we update the parameters, we obtain ๐œƒthat makes ๐ฟ ๐œƒ smaller.

๐œƒโˆ— = argmin๐œƒ

๐ฟ ๐œƒ by gradient descent

๐ฟ ๐œƒ0 > ๐ฟ ๐œƒ1 > ๐ฟ ๐œƒ2 > โ‹ฏ

Is this statement correct?

Page 27: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Warning of Math

Page 28: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

1

2

Formal Derivation

โ€ข Suppose that ฮธ has two variables {ฮธ1, ฮธ2}

0

1

2

How?

L(ฮธ)

Given a point, we can easily find the point with the smallest value nearby.

Page 29: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Taylor Series

โ€ข Taylor series: Let h(x) be any function infinitely differentiable around x = x0.

kk

k

xxk

x0

0

0

!

hxh

2

00

000!2

xxxh

xxxhxh

When x is close to x0 000 xxxhxhxh

Page 30: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

sin(x)=

โ€ฆโ€ฆ

E.g. Taylor series for h(x)=sin(x) around x0=ฯ€/4

The approximation is good around ฯ€/4.

Page 31: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Multivariable Taylor Series

000

000

00

,,,, yy

y

yxhxx

x

yxhyxhyxh

When x and y is close to x0 and y0

000

000

00

,,,, yy

y

yxhxx

x

yxhyxhyxh

+ something related to (x-x0)2 and (y-y0)2 + โ€ฆโ€ฆ

Page 32: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Back to Formal Derivation

1

2 ba,

bba

aba

ba

2

2

1

1

,L,L,LL

Based on Taylor Series:If the red circle is small enough, in the red circle

21

,L,

,L

bav

bau

bvaus 21

L

bas ,L

L(ฮธ)

Page 33: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

1

2

Back to Formal DerivationBased on Taylor Series:If the red circle is small enough, in the red circle

bvaus 21L

bas ,L

ba,

L(ฮธ)

Find ฮธ1 and ฮธ2 in the red circle minimizing L(ฮธ)

22

2

2

1 dba

21

,L,

,L

bav

bau

constant

dSimple, right?

Page 34: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Gradient descent โ€“ two variablesRed Circle: (If the radius is small)

bvaus 21L

1 2

1 2

21,

vu,

v

u

2

1

To minimize L(ฮธ)

v

u

b

a

2

1

Find ฮธ1 and ฮธ2 in the red circle minimizing L(ฮธ)

22

2

2

1 dba

21,

Page 35: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Back to Formal Derivation

Find ๐œƒ1 and ๐œƒ2 yielding the smallest value of ๐ฟ ๐œƒ in the circle

v

u

b

a

2

1

2

1

,L

,L

ba

ba

b

a This is gradient descent.

Based on Taylor Series:If the red circle is small enough, in the red circle

bvaus 21L

bas ,L

21

,L,

,L

bav

bau

constant

Not satisfied if the red circle (learning rate) is not small enough

You can consider the second order term, e.g. Newtonโ€™s method.

Page 36: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

End of Warning

Page 37: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

More Limitation of Gradient Descent

๐ฟ

๐‘ค1 ๐‘ค2

Loss

The value of the parameter w

Very slow at the plateau

Stuck at local minima

๐œ•๐ฟ โˆ• ๐œ•๐‘ค= 0

Stuck at saddle point

๐œ•๐ฟ โˆ• ๐œ•๐‘ค= 0

๐œ•๐ฟ โˆ• ๐œ•๐‘คโ‰ˆ 0

Page 38: Gradient Descent - xiaohongliu.caย ยท Review: Gradient Descent โ€ขIn step 3, we have to solve the following optimization problem: โˆ—=argmin ๐œƒ ๐ฟ L: loss function : parameters

Acknowledgement

โ€ขๆ„Ÿ่ฌ Victor Chen็™ผ็พๆŠ•ๅฝฑ็‰‡ไธŠ็š„ๆ‰“ๅญ—้Œฏ่ชค