Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector...

69
Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1

Transcript of Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector...

Page 1: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

MachineLearning

SupportVectorMachines:Trainingwith

StochasticGradientDescent

1

Page 2: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Supportvectormachines

• Trainingbymaximizingmargin

• TheSVMobjective

• SolvingtheSVMoptimizationproblem

• Supportvectors,dualsandkernels

2

Page 3: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

SVMobjectivefunction

3

Regularizationterm:• Maximizethemargin• Imposesapreferenceoverthe

hypothesisspaceandpushesforbettergeneralization

• Canbereplacedwithotherregularizationtermswhichimposeotherpreferences

EmpiricalLoss:• Hingeloss• Penalizesweightvectorsthatmake

mistakes

• Canbereplacedwithotherlossfunctionswhichimposeotherpreferences

Ahyper-parameterthatcontrolsthetradeoffbetweenalargemarginandasmallhinge-loss

Page 4: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Outline:TrainingSVMbyoptimization

1. Reviewofconvexfunctionsandgradientdescent

2. Stochasticgradientdescent

3. Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

4

Page 5: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Outline:TrainingSVMbyoptimization

1. Reviewofconvexfunctionsandgradientdescent

2. Stochasticgradientdescent

3. Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

5

Page 6: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

SolvingtheSVMoptimizationproblem

Thisfunctionisconvex inw

6

Page 7: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈ [0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

7

u v

f(v)

f(u)

Recall:Convexfunctions

Fromgeometricperspective

Everytangentplaneliesbelowthefunction

Page 8: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈ [0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

8

u v

f(v)

f(u)

Recall:Convexfunctions

Fromgeometricperspective

Everytangentplaneliesbelowthefunction

Page 9: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Convexfunctions

9

Linearfunctions maxisconvex

Somewaystoshowthatafunctionisconvex:

1. Usingthedefinitionofconvexity

2. Showingthatthesecondderivativeispositive(foronedimensionalfunctions)

3. Showingthatthesecondderivativeispositivesemi-definite(forvectorfunctions)

Page 10: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Notallfunctionsareconvex

10

Theseareconcave

Theseareneither

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≥ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Page 11: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Convexfunctionsareconvenient

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈[0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Ingeneral:Necessaryconditionforx tobeaminimumforthefunctionfisr f(x)=0

Forconvexfunctions,thisisbothnecessaryand sufficient

11

u v

f(v)

f(u)

Page 12: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

SolvingtheSVMoptimizationproblem

Thisfunctionisconvexinw

• Thisisaquadraticoptimizationproblembecausetheobjectiveisquadratic

• Oldermethods:UsedtechniquesfromQuadraticProgramming– Veryslow

• Noconstraints,canusegradientdescent– Stillveryslow!

12

Page 13: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Gradientdescent

GeneralstrategyforminimizingafunctionJ(w)

• Startwithaninitialguessforw,say w0

• Iteratetillconvergence:– Computethegradientofthe

gradientofJatwt

– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient

13

J(w)

ww0

Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection

Wearetryingtominimize

Page 14: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Gradientdescent

GeneralstrategyforminimizingafunctionJ(w)

• Startwithaninitialguessforw,say w0

• Iteratetillconvergence:– Computethegradientofthe

gradientofJatwt

– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient

14

J(w)

ww0w1

Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection

Wearetryingtominimize

Page 15: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Gradientdescent

GeneralstrategyforminimizingafunctionJ(w)

• Startwithaninitialguessforw,say w0

• Iteratetillconvergence:– Computethegradientofthe

gradientofJatwt

– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient

15

J(w)

ww0w1w2

Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection

Wearetryingtominimize

Page 16: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Gradientdescent

GeneralstrategyforminimizingafunctionJ(w)

• Startwithaninitialguessforw,say w0

• Iteratetillconvergence:– Computethegradientofthe

gradientofJatwt

– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient

16

J(w)

ww0w1w2w3

Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection

Wearetryingtominimize

Page 17: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientdescentforSVM

1. Initializew0

2. Fort=0,1,2,….1. ComputegradientofJ(w)atwt.Callitr J(wt)

2. Updatewasfollows:

17

r:Calledthe learningrate.

Wearetryingtominimize

Page 18: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Outline:TrainingSVMbyoptimization

ü Reviewofconvexfunctionsandgradientdescent

2. Stochasticgradientdescent

3. Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

18

Page 19: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientdescentforSVM

1. Initializew0

2. Fort=0,1,2,….1. ComputegradientofJ(w)atwt.Callitr J(wt)

2. Updatewasfollows:

19

r:Calledthelearningrate

GradientoftheSVMobjectiverequiressummingovertheentiretrainingset

Slow,doesnotreallyscale

Wearetryingtominimize

Page 20: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

20

Page 21: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

21

Page 22: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

22

Page 23: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

23

Page 24: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

24

Page 25: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

Whatisthegradientofthehingelosswithrespecttow?(Thehingelossisnotadifferentiablefunction!)

25

ThisalgorithmisguaranteedtoconvergetotheminimumofJif°t issmallenough.Why?TheobjectiveJ(w)isaconvexfunction

Page 26: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Outline:TrainingSVMbyoptimization

ü Reviewofconvexfunctionsandgradientdescent

ü Stochasticgradientdescent

3. Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

26

Page 27: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

27

Gradientdescent

Page 28: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

28

StochasticGradientdescent

Page 29: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

29

StochasticGradientdescent

Page 30: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

30

StochasticGradientdescent

Page 31: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

31

StochasticGradientdescent

Page 32: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

32

StochasticGradientdescent

Page 33: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

33

StochasticGradientdescent

Page 34: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

34

StochasticGradientdescent

Page 35: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

35

StochasticGradientdescent

Page 36: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

36

StochasticGradientdescent

Page 37: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

37

StochasticGradientdescent

Page 38: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

38

StochasticGradientdescent

Page 39: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

39

StochasticGradientdescent

Page 40: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

40

StochasticGradientdescent

Page 41: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

41

StochasticGradientdescent

Page 42: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

42

StochasticGradientdescent

Page 43: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

43

StochasticGradientdescent

Page 44: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

44

StochasticGradientdescent

Page 45: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

GradientDescentvsSGD

45

StochasticGradientdescent

Manymoreupdatesthangradientdescent,buteachindividualupdateislesscomputationallyexpensive

Page 46: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Outline:TrainingSVMbyoptimization

ü Reviewofconvexfunctionsandgradientdescent

ü Stochasticgradientdescent

ü Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

46

Page 47: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

Whatisthederivativeofthehingelosswithrespecttow?(Thehingelossisnotadifferentiablefunction!)

47

Page 48: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Hingelossisnot differentiable!

Whatisthederivativeofthehingelosswithrespecttow?

48

Page 49: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Detour:Sub-gradients

Generalizationofgradientstonon-differentiablefunctions(Rememberthateverytangentisahyperplanethatliesbelowthefunctionforconvexfunctions)

Informally,asub-tangentatapointisanyhyperplanethatliesbelowthefunctionatthepoint.Asub-gradientistheslopeofthatline

49

Page 50: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Sub-gradients

50[ExamplefromBoyd]

g1isagradientatx1

g2 and g3isarebothsubgradients atx2

fisdifferentiableatx1Tangentatthispoint

Formally,avectorgisasubgradient tofatpointxif

Page 51: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Sub-gradients

51[ExamplefromBoyd]

g1isagradientatx1

g2 and g3isarebothsubgradients atx2

fisdifferentiableatx1Tangentatthispoint

Formally,avectorgisasubgradient tofatpointxif

Page 52: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Sub-gradients

52[ExamplefromBoyd]

g1isagradientatx1

g2 and g3isarebothsubgradients atx2

fisdifferentiableatx1Tangentatthispoint

Formally,avectorgisasubgradient tofatpointxif

Page 53: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Sub-gradientoftheSVMobjective

53

Generalstrategy:Firstsolvethemaxandcomputethegradientforeachcase

Page 54: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Sub-gradientoftheSVMobjective

54

Generalstrategy:Firstsolvethemaxandcomputethegradientforeachcase

Page 55: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Outline:TrainingSVMbyoptimization

ü Reviewofconvexfunctionsandgradientdescent

ü Stochasticgradientdescent

ü Gradientdescentvsstochasticgradientdescent

ü Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

55

Page 56: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

56

Page 57: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

57

Page 58: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

58

Update w à w – °trJt

Page 59: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

59

Page 60: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

60

°t:learningrate,manytweakspossible

Importanttoshuffleexamplesatthestartofeachepoch

Page 61: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

61

°t:learningrate,manytweakspossible

Page 62: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Convergenceandlearningrates

Withenoughiterations,itwillconvergeinexpectation

Providedthestepsizesare“squaresummable,butnotsummable”

• Stepsizes°t arepositive• Sumofsquaresofstepsizesovert=1to1 isnotinfinite• Sumofstepsizesovert=1to1 isinfinity

• Someexamples:𝛾2 = 56

7896:;or𝛾2 =

56782

62

Page 63: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Convergenceandlearningrates

• Numberofiterationstogettoaccuracywithin²

• Forstronglyconvexfunctions,Nexamples,ddimensional:– Gradientdescent:O(Nd ln(1/²))– Stochasticgradientdescent:O(d/²)

• Moresubtletiesinvolved,butSGDisgenerallypreferablewhenthedatasizeishuge

• Recently,manyvariantsthatarebasedonthisgeneralstrategy– Examples:Adagrad,momentum,Nesterov’s acceleratedgradient,

Adam,RMSProp,etc…

63

Page 64: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Convergenceandlearningrates

• Numberofiterationstogettoaccuracywithin²

• Forstronglyconvexfunctions,Nexamples,ddimensional:– Gradientdescent:O(Nd ln(1/²))– Stochasticgradientdescent:O(d/²)

• Moresubtletiesinvolved,butSGDisgenerallypreferablewhenthedatasizeishuge

• Recently,manyvariantsthatarebasedonthisgeneralstrategy,targetingmultilayerneuralnetworks– Examples:Adagrad,momentum,Nesterov’s acceleratedgradient,

Adam,RMSProp,etc…

64

Page 65: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1-°t) w + °t Cyi xi

elsewà (1-°t) w

3. Returnw

65

Page 66: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Outline:TrainingSVMbyoptimization

ü Reviewofconvexfunctionsandgradientdescent

ü Stochasticgradientdescent

ü Gradientdescentvsstochasticgradientdescent

ü Sub-derivativesofthehingeloss

ü Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

66

Page 67: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1-°t) w + °t Cyi xi

elsewà (1-°t) w

3. Returnw

67

ComparewiththePerceptronupdate:IfywTx· 0,updatewà w +ryx

Page 68: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

Perceptronvs.SVM

• Perceptron:Stochasticsub-gradientdescentforadifferentloss– Noregularizationthough

• SVMoptimizesthehingeloss– Withregularization

68

Page 69: Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector Machines: Training with Stochastic Gradient Descent 1. Support vector machines ...

SVMsummaryfromoptimizationperspective

• Minimizeregularizedhingeloss

• Solveusingstochasticgradientdescent– Veryfast,runtimedoesnotdependonnumberofexamples

– ComparewithPerceptronalgorithm:Perceptrondoesnotmaximizemarginwidth• Perceptronvariantscanforceamargin

– Convergencecriterionisanissue;canbetooaggressiveinthebeginningandgettoareasonablygoodsolutionfast;butconvergenceisslowforveryaccurateweightvector

• Othersuccessfuloptimizationalgorithmsexist– Eg:Dualcoordinatedescent,implementedinliblinear

69Questions?