Lecture 5: optimization and convexity Sanjeev Arora Elad Hazan · Sanjeev Arora Elad Hazan COS 402...
Transcript of Lecture 5: optimization and convexity Sanjeev Arora Elad Hazan · Sanjeev Arora Elad Hazan COS 402...
Lecture5:optimizationandconvexity
SanjeevArora EladHazan
COS402– MachineLearningand
ArtificialIntelligenceFall2016
Admin
• Exercise2(implementation)nextThu,inclass• Exercise3(written),duenextThu• Movie– “ExMachina”+discussionpanelw.Prof.Hasson(PNI)WedOct.4th 19:30ticketsstillavailable@Bellaroom204COS
• NextTue:specialguest- Dr.Yoram Singer@Google
Recap
• Definition+fundamentaltheoremofstatisticallearning• Powerfulclassesw.lowsamplecomplexityerrorexist(i.e.pythonprograms),butcomputationallyhard• Perceptron• SVM
Agenda
• convexrelaxations• convexoptimization• Gradientdescent
Mathematicaloptimization
Input:function𝑓:𝐾 ↦ 𝑅,for𝐾 ⊆ 𝑅'
Output:point𝑥 ∈ 𝐾,suchthat 𝑓 𝑥 ≤ 𝑓 𝑦 ∀𝑦 ∈ 𝐾
Mathematicaloptimization
• Continuousfunctions(backtocalculus,derivatives,differentiability,…)• Vs.combinatorialoptimizationasingraphalgorithms(strongconnection)• Studiedsinceearly1900’s,lotsofworkinsovietunion(centraloptimization,resourceallocation,militaryapplications,etc.)• Specialcases:linearprogramming,convexoptimization,maxflowingraphs
Efficient(poly-time)algorithms
Optimizationforlinearclassification
Givenasample𝑆 = 𝑥0, 𝑦0 ,… , 𝑥3, 𝑦3 ,findhyperplane(throughtheoriginw.l.o.g)suchthat:
𝑤 = arg min; <0
# ofmistakes
Optimizationforlinearclassification
𝑤 = arg min; <0
𝑖s. 𝑡. 𝑠𝑖𝑔𝑛(𝑤F𝑥G ≠ 𝑦G |
Minimizationcanbehard
Sumofsignsà hard
Convexfunctions:localà global
Sumofconvexfunctionsà alsoconvex
Convexrelaxationfor0-1loss
Convexrelaxationforlinearclassification
𝑤 = arg min; <0
ℓ(𝑤K𝑥G,𝑦G) suchas:
1. Ridge/linearregressionℓ 𝑤K𝑥G, 𝑦G = 𝑤K𝑥G − 𝑦G N
2. SVM ℓ 𝑤K𝑥G,𝑦G = max{0,1 − 𝑦G 𝑤K𝑥G}3. Logisticregression ℓ 𝑤K𝑥G,𝑦G = log(1 + 𝑒;XYZ)
𝑤 = arg min; <0
𝑖s. 𝑡. 𝑠𝑖𝑔𝑛(𝑤F𝑥G ≠ 𝑦G |
Smallrecap
• Findinglinearclassifiers:formulatedasmathematicaloptimization• Convexity:propertythatallowslocal greedyalgorithms• Formulateconvexrelaxationstolinearclassification
Next:• Algorithmsforconvexoptimization
Convexity
Afunction𝑓: 𝑅' ↦ 𝑅 isconvexifandonlyif:
𝑓12𝑥 +
12𝑦 ≤
12𝑓 𝑥 +
12𝑓 𝑦
• Informally:smileyJ
• Gradient=thedirectionofsteepestdescent,whichisthederivativeineachcoordinate:
Calculusreminder:gradient
�[rf(x)]i = � @
@xif(x)
Convexity
(assumesdifferentiability,o/wsubgradient)(anotheralternative:secondderivativeisnon-negativein1D)
• Alternativedefinition:
f y ≥ f x + 𝛻𝑓(𝑥)K(𝑦 − 𝑥)
𝑥 𝑦
• Moveinthedirectionofsteepestdescent,whichis:
Greedyoptimization:gradientdescent
p1p* p2p3
�[rf(x)]i = � @
@xif(x)
𝑥`a0 ← 𝑥` − 𝜂𝛻𝑓 𝑥`
“stepsize”or“Learningrate”
gradientdescent– constrainedset
𝑦`a0 ← 𝑥` − 𝜂𝛻𝑓 𝑥`𝑥`a0 = argmin
d∈e|𝑦`a0 − 𝑥|
convexconstraints
SetKisconvexifandonlyif:
𝑥, 𝑦 ∈ 𝐾 ⇒ (½𝑥 +½𝑦) ∈ 𝐾
gradientdescent– constrainedset
𝑦`a0 ← 𝑥` − 𝜂𝛻𝑓 𝑥`𝑥`a0 = argmin
d∈e|𝑦`a0 − 𝑥|
Let:• G=upperboundonnormofgradients
|𝛻𝑓 𝑥` | ≤ 𝐺
• D=diameterofconstraintset∀𝑥, 𝑦 ∈ 𝐾. |𝑥 − 𝑦| ≤ 𝐷
Theorem:forstepsize𝜂 = jk F
𝑓1𝑇m𝑥``
≤ minY∗∈e
𝑓 𝑥∗ +𝐷𝐺𝑇
Proof:1. Observation 1:
x∗ − yoa0 N = x∗ − xo N − 2𝜂𝛻𝑓(𝑥`)(𝑥` − 𝑥∗)+ 𝛻𝑓(𝑥`) N
2. Observation 2: x∗ − 𝑥`a0 N ≤ x∗ − y`a0 N
This is the Pythagorean theorem:
𝑦`a0 ← 𝑥` − 𝜂𝛻𝑓 𝑥`𝑥`a0 = argmin
d∈e|𝑦`a0 − 𝑥|
Proof:1. Observation 1:
x∗ − yoa0 N = x∗ − xo N − 2𝜂𝛻𝑓(𝑥`)(𝑥` − 𝑥∗) + 𝛻𝑓(𝑥`) N
2. Observation 2: x∗ − 𝑥`a0 N ≤ x∗ − y`a0 N
Thus: x∗ − xoa0 N ≤ x∗ − xo N − 2𝜂𝛻𝑓(𝑥`)(𝑥` − 𝑥∗) + 𝐺N
And hence:
𝑓(1𝑇m𝑥`)− 𝑓 𝑥∗ ≤
`
1𝑇m 𝑓 𝑥` − 𝑓 𝑥∗
`≤1𝑇m𝛻𝑓 𝑥` 𝑥` − 𝑥∗
`
≤1𝑇m
12𝜂 x∗ − xoa0 N − x∗ − xo N
`+𝜂2 𝐺
N
≤1
𝑇 ⋅ 2𝜂 𝐷N +
𝜂2𝐺
N ≤𝐷𝐺𝑇
𝑦`a0 ← 𝑥` − 𝜂𝛻𝑓 𝑥`𝑥`a0 = argmin
d∈e|𝑦`a0 − 𝑥|
gradientdescent– constrainedset
Theorem:forstepsize𝜂 = jk F
𝑓1𝑇m
𝑥``
≤ minY∗∈e
𝑓 𝑥∗ +𝐷𝐺𝑇
Thus,toget𝜖-approximatesolution,applyrsks
tsgradient
iterations.
GDforlinearclassification
1. Ridge/linearregressionℓ 𝑤K𝑥G, 𝑦G = 𝑤K𝑥G − 𝑦G N
2. SVM ℓ 𝑤K𝑥G,𝑦G = max{0,1 − 𝑦G 𝑤K𝑥G}3. Logisticregression ℓ 𝑤K𝑥G,𝑦G = log(1 + 𝑒;XYZ)
𝑤 = arg min; <0
1𝑚mℓ 𝑤K𝑥G,𝑦G
G
GDforlinearclassification
𝑤 = arg min; <0
1𝑚mℓ 𝑤K𝑥G,𝑦G
G
𝑤`a0 = wo − 𝜂1𝑚mℓw 𝑤`K𝑥G,𝑦G 𝑥G
G
• Complexity? 0ts
iterations,eachtaking~lineartimeindataset
• Overall𝑂 3'ts
runningtime,m=#ofexamplesinRd
• Canwespeeditup??
Summary
• Mathematicaloptimizationforlinearclassification• Convexrelaxations• Gradientdescentalgorithm• GDappliedtolinearclassification