Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector...

MachineLearning

SupportVectorMachines:Trainingwith

StochasticGradientDescent

1

Supportvectormachines

• Trainingbymaximizingmargin

• TheSVMobjective

• SolvingtheSVMoptimizationproblem

• Supportvectors,dualsandkernels

2

SVMobjectivefunction

3

Regularizationterm:• Maximizethemargin• Imposesapreferenceoverthe

hypothesisspaceandpushesforbettergeneralization

• Canbereplacedwithotherregularizationtermswhichimposeotherpreferences

EmpiricalLoss:• Hingeloss• Penalizesweightvectorsthatmake

mistakes

• Canbereplacedwithotherlossfunctionswhichimposeotherpreferences

Ahyper-parameterthatcontrolsthetradeoffbetweenalargemarginandasmallhinge-loss

Outline:TrainingSVMbyoptimization

1. Reviewofconvexfunctionsandgradientdescent

2. Stochasticgradientdescent

3. Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

4


1. Reviewofconvexfunctionsandgradientdescent






5

SolvingtheSVMoptimizationproblem

Thisfunctionisconvex inw

6

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈ [0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

7

u v

f(v)

f(u)

Recall:Convexfunctions

Fromgeometricperspective

Everytangentplaneliesbelowthefunction

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈ [0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

8

u v

f(v)

f(u)

Recall:Convexfunctions

Fromgeometricperspective

Everytangentplaneliesbelowthefunction

Convexfunctions

9

Linearfunctions maxisconvex

Somewaystoshowthatafunctionisconvex:

1. Usingthedefinitionofconvexity

2. Showingthatthesecondderivativeispositive(foronedimensionalfunctions)

3. Showingthatthesecondderivativeispositivesemi-definite(forvectorfunctions)

Notallfunctionsareconvex

10

Theseareconcave

Theseareneither

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≥ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Convexfunctionsareconvenient

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈[0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Ingeneral:Necessaryconditionforx tobeaminimumforthefunctionfisr f(x)=0

Forconvexfunctions,thisisbothnecessaryand sufficient

11

u v

f(v)

f(u)

SolvingtheSVMoptimizationproblem

Thisfunctionisconvexinw

• Thisisaquadraticoptimizationproblembecausetheobjectiveisquadratic

• Oldermethods:UsedtechniquesfromQuadraticProgramming– Veryslow

• Noconstraints,canusegradientdescent– Stillveryslow!

12

Gradientdescent

GeneralstrategyforminimizingafunctionJ(w)

• Startwithaninitialguessforw,say w0

• Iteratetillconvergence:– Computethegradientofthe

gradientofJatwt

– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient

13

J(w)

ww0

Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection

Wearetryingtominimize

Gradientdescent




gradientofJatwt


14

J(w)

ww0w1



Gradientdescent




gradientofJatwt


15

J(w)

ww0w1w2



Gradientdescent




gradientofJatwt


16

J(w)

ww0w1w2w3



GradientdescentforSVM

1. Initializew0

2. Fort=0,1,2,….1. ComputegradientofJ(w)atwt.Callitr J(wt)

2. Updatewasfollows:

17

r:Calledthe learningrate.



ü Reviewofconvexfunctionsandgradientdescent






18

GradientdescentforSVM

1. Initializew0

2. Fort=0,1,2,….1. ComputegradientofJ(w)atwt.Callitr J(wt)

2. Updatewasfollows:

19

r:Calledthelearningrate

GradientoftheSVMobjectiverequiressummingovertheentiretrainingset

Slow,doesnotreallyscale


Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt Ã wt-1 – °trJt (wt-1)

3. Returnfinalw

20





3. Returnfinalw

21





3. Returnfinalw

22





3. Returnfinalw

23





3. Returnfinalw

24





3. Returnfinalw

Whatisthegradientofthehingelosswithrespecttow?(Thehingelossisnotadifferentiablefunction!)

25

ThisalgorithmisguaranteedtoconvergetotheminimumofJif°t issmallenough.Why?TheobjectiveJ(w)isaconvexfunction



ü Stochasticgradientdescent





26

GradientDescentvsSGD

27

Gradientdescent


28

StochasticGradientdescent


29



30



31



32



33



34



35



36



37



38



39



40



41



42



43



44



45


Manymoreupdatesthangradientdescent,buteachindividualupdateislesscomputationallyexpensive




ü Gradientdescentvsstochasticgradientdescent




46





3. Returnfinalw

Whatisthederivativeofthehingelosswithrespecttow?(Thehingelossisnotadifferentiablefunction!)

47

Hingelossisnot differentiable!

Whatisthederivativeofthehingelosswithrespecttow?

48

Detour:Sub-gradients

Generalizationofgradientstonon-differentiablefunctions(Rememberthateverytangentisahyperplanethatliesbelowthefunctionforconvexfunctions)

Informally,asub-tangentatapointisanyhyperplanethatliesbelowthefunctionatthepoint.Asub-gradientistheslopeofthatline

49

Sub-gradients

50[ExamplefromBoyd]

g1isagradientatx1

g2 and g3isarebothsubgradients atx2

fisdifferentiableatx1Tangentatthispoint

Formally,avectorgisasubgradient tofatpointxif

Sub-gradients

51[ExamplefromBoyd]

g1isagradientatx1




Sub-gradients

52[ExamplefromBoyd]

g1isagradientatx1




Sub-gradientoftheSVMobjective

53

Generalstrategy:Firstsolvethemaxandcomputethegradientforeachcase

Sub-gradientoftheSVMobjective

54

Generalstrategy:Firstsolvethemaxandcomputethegradientforeachcase





ü Sub-derivativesofthehingeloss



55

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wÃ (1- °t) w + °t Cyi xi

elsewÃ (1- °t) w

3. Returnw

56





elsewÃ (1- °t) w

3. Returnw

57





elsewÃ (1- °t) w

3. Returnw

58

Update w Ã w – °trJt





elsewÃ (1- °t) w

3. Returnw

59





elsewÃ (1- °t) w

3. Returnw

60

°t:learningrate,manytweakspossible

Importanttoshuffleexamplesatthestartofeachepoch



2. Forepoch=1…T:1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:


elsewÃ (1- °t) w

3. Returnw

61

°t:learningrate,manytweakspossible

Convergenceandlearningrates

Withenoughiterations,itwillconvergeinexpectation

Providedthestepsizesare“squaresummable,butnotsummable”

• Stepsizes°t arepositive• Sumofsquaresofstepsizesovert=1to1 isnotinfinite• Sumofstepsizesovert=1to1 isinfinity

• Someexamples:𝛾2 = 56

7896:;or𝛾2 =

56782

62


• Numberofiterationstogettoaccuracywithin²

• Forstronglyconvexfunctions,Nexamples,ddimensional:– Gradientdescent:O(Nd ln(1/²))– Stochasticgradientdescent:O(d/²)

• Moresubtletiesinvolved,butSGDisgenerallypreferablewhenthedatasizeishuge

• Recently,manyvariantsthatarebasedonthisgeneralstrategy– Examples:Adagrad,momentum,Nesterov’s acceleratedgradient,

Adam,RMSProp,etc…

63


• Numberofiterationstogettoaccuracywithin²

• Forstronglyconvexfunctions,Nexamples,ddimensional:– Gradientdescent:O(Nd ln(1/²))– Stochasticgradientdescent:O(d/²)

• Moresubtletiesinvolved,butSGDisgenerallypreferablewhenthedatasizeishuge

• Recently,manyvariantsthatarebasedonthisgeneralstrategy,targetingmultilayerneuralnetworks– Examples:Adagrad,momentum,Nesterov’s acceleratedgradient,

Adam,RMSProp,etc…

64




Ifyi wTxi· 1,wÃ (1-°t) w + °t Cyi xi

elsewÃ (1-°t) w

3. Returnw

65





ü Sub-derivativesofthehingeloss

ü Stochasticsub-gradientdescentforSVM


66




Ifyi wTxi· 1,wÃ (1-°t) w + °t Cyi xi

elsewÃ (1-°t) w

3. Returnw

67

ComparewiththePerceptronupdate:IfywTx· 0,updatewÃ w +ryx

Perceptronvs.SVM

• Perceptron:Stochasticsub-gradientdescentforadifferentloss– Noregularizationthough

• SVMoptimizesthehingeloss– Withregularization

68

SVMsummaryfromoptimizationperspective

• Minimizeregularizedhingeloss

• Solveusingstochasticgradientdescent– Veryfast,runtimedoesnotdependonnumberofexamples

– ComparewithPerceptronalgorithm:Perceptrondoesnotmaximizemarginwidth• Perceptronvariantscanforceamargin

– Convergencecriterionisanissue;canbetooaggressiveinthebeginningandgettoareasonablygoodsolutionfast;butconvergenceisslowforveryaccurateweightvector

• Othersuccessfuloptimizationalgorithmsexist– Eg:Dualcoordinatedescent,implementedinliblinear

69Questions?

Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector...

Documents

Transcript of Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector...