Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector...
Transcript of Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector...
MachineLearning
SupportVectorMachines:Trainingwith
StochasticGradientDescent
1
Supportvectormachines
• Trainingbymaximizingmargin
• TheSVMobjective
• SolvingtheSVMoptimizationproblem
• Supportvectors,dualsandkernels
2
SVMobjectivefunction
3
Regularizationterm:• Maximizethemargin• Imposesapreferenceoverthe
hypothesisspaceandpushesforbettergeneralization
• Canbereplacedwithotherregularizationtermswhichimposeotherpreferences
EmpiricalLoss:• Hingeloss• Penalizesweightvectorsthatmake
mistakes
• Canbereplacedwithotherlossfunctionswhichimposeotherpreferences
Ahyper-parameterthatcontrolsthetradeoffbetweenalargemarginandasmallhinge-loss
Outline:TrainingSVMbyoptimization
1. Reviewofconvexfunctionsandgradientdescent
2. Stochasticgradientdescent
3. Gradientdescentvsstochasticgradientdescent
4. Sub-derivativesofthehingeloss
5. Stochasticsub-gradientdescentforSVM
6. Comparisontoperceptron
4
Outline:TrainingSVMbyoptimization
1. Reviewofconvexfunctionsandgradientdescent
2. Stochasticgradientdescent
3. Gradientdescentvsstochasticgradientdescent
4. Sub-derivativesofthehingeloss
5. Stochasticsub-gradientdescentforSVM
6. Comparisontoperceptron
5
SolvingtheSVMoptimizationproblem
Thisfunctionisconvex inw
6
Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈ [0,1] wehave
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
7
u v
f(v)
f(u)
Recall:Convexfunctions
Fromgeometricperspective
Everytangentplaneliesbelowthefunction
Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈ [0,1] wehave
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
8
u v
f(v)
f(u)
Recall:Convexfunctions
Fromgeometricperspective
Everytangentplaneliesbelowthefunction
Convexfunctions
9
Linearfunctions maxisconvex
Somewaystoshowthatafunctionisconvex:
1. Usingthedefinitionofconvexity
2. Showingthatthesecondderivativeispositive(foronedimensionalfunctions)
3. Showingthatthesecondderivativeispositivesemi-definite(forvectorfunctions)
Notallfunctionsareconvex
10
Theseareconcave
Theseareneither
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≥ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
Convexfunctionsareconvenient
Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈[0,1] wehave
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
Ingeneral:Necessaryconditionforx tobeaminimumforthefunctionfisr f(x)=0
Forconvexfunctions,thisisbothnecessaryand sufficient
11
u v
f(v)
f(u)
SolvingtheSVMoptimizationproblem
Thisfunctionisconvexinw
• Thisisaquadraticoptimizationproblembecausetheobjectiveisquadratic
• Oldermethods:UsedtechniquesfromQuadraticProgramming– Veryslow
• Noconstraints,canusegradientdescent– Stillveryslow!
12
Gradientdescent
GeneralstrategyforminimizingafunctionJ(w)
• Startwithaninitialguessforw,say w0
• Iteratetillconvergence:– Computethegradientofthe
gradientofJatwt
– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient
13
J(w)
ww0
Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection
Wearetryingtominimize
Gradientdescent
GeneralstrategyforminimizingafunctionJ(w)
• Startwithaninitialguessforw,say w0
• Iteratetillconvergence:– Computethegradientofthe
gradientofJatwt
– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient
14
J(w)
ww0w1
Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection
Wearetryingtominimize
Gradientdescent
GeneralstrategyforminimizingafunctionJ(w)
• Startwithaninitialguessforw,say w0
• Iteratetillconvergence:– Computethegradientofthe
gradientofJatwt
– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient
15
J(w)
ww0w1w2
Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection
Wearetryingtominimize
Gradientdescent
GeneralstrategyforminimizingafunctionJ(w)
• Startwithaninitialguessforw,say w0
• Iteratetillconvergence:– Computethegradientofthe
gradientofJatwt
– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient
16
J(w)
ww0w1w2w3
Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection
Wearetryingtominimize
GradientdescentforSVM
1. Initializew0
2. Fort=0,1,2,….1. ComputegradientofJ(w)atwt.Callitr J(wt)
2. Updatewasfollows:
17
r:Calledthe learningrate.
Wearetryingtominimize
Outline:TrainingSVMbyoptimization
ü Reviewofconvexfunctionsandgradientdescent
2. Stochasticgradientdescent
3. Gradientdescentvsstochasticgradientdescent
4. Sub-derivativesofthehingeloss
5. Stochasticsub-gradientdescentforSVM
6. Comparisontoperceptron
18
GradientdescentforSVM
1. Initializew0
2. Fort=0,1,2,….1. ComputegradientofJ(w)atwt.Callitr J(wt)
2. Updatewasfollows:
19
r:Calledthelearningrate
GradientoftheSVMobjectiverequiressummingovertheentiretrainingset
Slow,doesnotreallyscale
Wearetryingtominimize
Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n
2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS
2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)
3. Update:wt à wt-1 – °trJt (wt-1)
3. Returnfinalw
20
Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n
2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS
2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)
3. Update:wt à wt-1 – °trJt (wt-1)
3. Returnfinalw
21
Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n
2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS
2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)
3. Update:wt à wt-1 – °trJt (wt-1)
3. Returnfinalw
22
Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n
2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS
2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)
3. Update:wt à wt-1 – °trJt (wt-1)
3. Returnfinalw
23
Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n
2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS
2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)
3. Update:wt à wt-1 – °trJt (wt-1)
3. Returnfinalw
24
Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n
2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS
2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)
3. Update:wt à wt-1 – °trJt (wt-1)
3. Returnfinalw
Whatisthegradientofthehingelosswithrespecttow?(Thehingelossisnotadifferentiablefunction!)
25
ThisalgorithmisguaranteedtoconvergetotheminimumofJif°t issmallenough.Why?TheobjectiveJ(w)isaconvexfunction
Outline:TrainingSVMbyoptimization
ü Reviewofconvexfunctionsandgradientdescent
ü Stochasticgradientdescent
3. Gradientdescentvsstochasticgradientdescent
4. Sub-derivativesofthehingeloss
5. Stochasticsub-gradientdescentforSVM
6. Comparisontoperceptron
26
GradientDescentvsSGD
27
Gradientdescent
GradientDescentvsSGD
28
StochasticGradientdescent
GradientDescentvsSGD
29
StochasticGradientdescent
GradientDescentvsSGD
30
StochasticGradientdescent
GradientDescentvsSGD
31
StochasticGradientdescent
GradientDescentvsSGD
32
StochasticGradientdescent
GradientDescentvsSGD
33
StochasticGradientdescent
GradientDescentvsSGD
34
StochasticGradientdescent
GradientDescentvsSGD
35
StochasticGradientdescent
GradientDescentvsSGD
36
StochasticGradientdescent
GradientDescentvsSGD
37
StochasticGradientdescent
GradientDescentvsSGD
38
StochasticGradientdescent
GradientDescentvsSGD
39
StochasticGradientdescent
GradientDescentvsSGD
40
StochasticGradientdescent
GradientDescentvsSGD
41
StochasticGradientdescent
GradientDescentvsSGD
42
StochasticGradientdescent
GradientDescentvsSGD
43
StochasticGradientdescent
GradientDescentvsSGD
44
StochasticGradientdescent
GradientDescentvsSGD
45
StochasticGradientdescent
Manymoreupdatesthangradientdescent,buteachindividualupdateislesscomputationallyexpensive
Outline:TrainingSVMbyoptimization
ü Reviewofconvexfunctionsandgradientdescent
ü Stochasticgradientdescent
ü Gradientdescentvsstochasticgradientdescent
4. Sub-derivativesofthehingeloss
5. Stochasticsub-gradientdescentforSVM
6. Comparisontoperceptron
46
Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n
2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS
2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)
3. Update:wt à wt-1 – °trJt (wt-1)
3. Returnfinalw
Whatisthederivativeofthehingelosswithrespecttow?(Thehingelossisnotadifferentiablefunction!)
47
Hingelossisnot differentiable!
Whatisthederivativeofthehingelosswithrespecttow?
48
Detour:Sub-gradients
Generalizationofgradientstonon-differentiablefunctions(Rememberthateverytangentisahyperplanethatliesbelowthefunctionforconvexfunctions)
Informally,asub-tangentatapointisanyhyperplanethatliesbelowthefunctionatthepoint.Asub-gradientistheslopeofthatline
49
Sub-gradients
50[ExamplefromBoyd]
g1isagradientatx1
g2 and g3isarebothsubgradients atx2
fisdifferentiableatx1Tangentatthispoint
Formally,avectorgisasubgradient tofatpointxif
Sub-gradients
51[ExamplefromBoyd]
g1isagradientatx1
g2 and g3isarebothsubgradients atx2
fisdifferentiableatx1Tangentatthispoint
Formally,avectorgisasubgradient tofatpointxif
Sub-gradients
52[ExamplefromBoyd]
g1isagradientatx1
g2 and g3isarebothsubgradients atx2
fisdifferentiableatx1Tangentatthispoint
Formally,avectorgisasubgradient tofatpointxif
Sub-gradientoftheSVMobjective
53
Generalstrategy:Firstsolvethemaxandcomputethegradientforeachcase
Sub-gradientoftheSVMobjective
54
Generalstrategy:Firstsolvethemaxandcomputethegradientforeachcase
Outline:TrainingSVMbyoptimization
ü Reviewofconvexfunctionsandgradientdescent
ü Stochasticgradientdescent
ü Gradientdescentvsstochasticgradientdescent
ü Sub-derivativesofthehingeloss
5. Stochasticsub-gradientdescentforSVM
6. Comparisontoperceptron
55
Stochasticsub-gradientdescentforSVM
GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:
Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi
elsewà (1- °t) w
3. Returnw
56
Stochasticsub-gradientdescentforSVM
GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:
Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi
elsewà (1- °t) w
3. Returnw
57
Stochasticsub-gradientdescentforSVM
GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:
Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi
elsewà (1- °t) w
3. Returnw
58
Update w à w – °trJt
Stochasticsub-gradientdescentforSVM
GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:
Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi
elsewà (1- °t) w
3. Returnw
59
Stochasticsub-gradientdescentforSVM
GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:
Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi
elsewà (1- °t) w
3. Returnw
60
°t:learningrate,manytweakspossible
Importanttoshuffleexamplesatthestartofeachepoch
Stochasticsub-gradientdescentforSVM
GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n
2. Forepoch=1…T:1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:
Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi
elsewà (1- °t) w
3. Returnw
61
°t:learningrate,manytweakspossible
Convergenceandlearningrates
Withenoughiterations,itwillconvergeinexpectation
Providedthestepsizesare“squaresummable,butnotsummable”
• Stepsizes°t arepositive• Sumofsquaresofstepsizesovert=1to1 isnotinfinite• Sumofstepsizesovert=1to1 isinfinity
• Someexamples:𝛾2 = 56
7896:;or𝛾2 =
56782
62
Convergenceandlearningrates
• Numberofiterationstogettoaccuracywithin²
• Forstronglyconvexfunctions,Nexamples,ddimensional:– Gradientdescent:O(Nd ln(1/²))– Stochasticgradientdescent:O(d/²)
• Moresubtletiesinvolved,butSGDisgenerallypreferablewhenthedatasizeishuge
• Recently,manyvariantsthatarebasedonthisgeneralstrategy– Examples:Adagrad,momentum,Nesterov’s acceleratedgradient,
Adam,RMSProp,etc…
63
Convergenceandlearningrates
• Numberofiterationstogettoaccuracywithin²
• Forstronglyconvexfunctions,Nexamples,ddimensional:– Gradientdescent:O(Nd ln(1/²))– Stochasticgradientdescent:O(d/²)
• Moresubtletiesinvolved,butSGDisgenerallypreferablewhenthedatasizeishuge
• Recently,manyvariantsthatarebasedonthisgeneralstrategy,targetingmultilayerneuralnetworks– Examples:Adagrad,momentum,Nesterov’s acceleratedgradient,
Adam,RMSProp,etc…
64
Stochasticsub-gradientdescentforSVM
GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n
2. Forepoch=1…T:1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:
Ifyi wTxi· 1,wà (1-°t) w + °t Cyi xi
elsewà (1-°t) w
3. Returnw
65
Outline:TrainingSVMbyoptimization
ü Reviewofconvexfunctionsandgradientdescent
ü Stochasticgradientdescent
ü Gradientdescentvsstochasticgradientdescent
ü Sub-derivativesofthehingeloss
ü Stochasticsub-gradientdescentforSVM
6. Comparisontoperceptron
66
Stochasticsub-gradientdescentforSVM
GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n
2. Forepoch=1…T:1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:
Ifyi wTxi· 1,wà (1-°t) w + °t Cyi xi
elsewà (1-°t) w
3. Returnw
67
ComparewiththePerceptronupdate:IfywTx· 0,updatewà w +ryx
Perceptronvs.SVM
• Perceptron:Stochasticsub-gradientdescentforadifferentloss– Noregularizationthough
• SVMoptimizesthehingeloss– Withregularization
68
SVMsummaryfromoptimizationperspective
• Minimizeregularizedhingeloss
• Solveusingstochasticgradientdescent– Veryfast,runtimedoesnotdependonnumberofexamples
– ComparewithPerceptronalgorithm:Perceptrondoesnotmaximizemarginwidth• Perceptronvariantscanforceamargin
– Convergencecriterionisanissue;canbetooaggressiveinthebeginningandgettoareasonablygoodsolutionfast;butconvergenceisslowforveryaccurateweightvector
• Othersuccessfuloptimizationalgorithmsexist– Eg:Dualcoordinatedescent,implementedinliblinear
69Questions?