Machine Learning Basics Lecture 5: SVM IISVM: optimization β’Optimization (Quadratic Programming):...
Transcript of Machine Learning Basics Lecture 5: SVM IISVM: optimization β’Optimization (Quadratic Programming):...
Machine Learning Basics Lecture 5: SVM II
Princeton University COS 495
Instructor: Yingyu Liang
Review: SVM objective
SVM: objective
β’ Let π¦π β +1,β1 , ππ€,π π₯ = π€ππ₯ + π. Margin:
πΎ = minπ
π¦πππ€,π π₯π| π€ |
β’ Support Vector Machine:
maxπ€,π
πΎ = maxπ€,π
minπ
π¦πππ€,π π₯π| π€ |
SVM: optimization
β’ Optimization (Quadratic Programming):
minπ€,π
1
2π€
2
π¦π π€ππ₯π + π β₯ 1, βπ
β’ Solved by Lagrange multiplier method:
β π€, π, πΆ =1
2π€
2
β
π
πΌπ[π¦π π€ππ₯π + π β 1]
where πΆ is the Lagrange multiplier
Lagrange multiplier
Lagrangian
β’ Consider optimization problem:minπ€
π(π€)
βπ π€ = 0, β1 β€ π β€ π
β’ Lagrangian:
β π€, π· = π π€ +
π
π½πβπ(π€)
where π½πβs are called Lagrange multipliers
Lagrangian
β’ Consider optimization problem:minπ€
π(π€)
βπ π€ = 0, β1 β€ π β€ π
β’ Solved by setting derivatives of Lagrangian to 0
πβ
ππ€π= 0;
πβ
ππ½π= 0
Generalized Lagrangian
β’ Consider optimization problem:minπ€
π(π€)
ππ π€ β€ 0, β1 β€ π β€ π
βπ π€ = 0, β1 β€ π β€ π
β’ Generalized Lagrangian:
β π€, πΆ, π· = π π€ +
π
πΌπππ(π€) +
π
π½πβπ(π€)
where πΌπ , π½πβs are called Lagrange multipliers
Generalized Lagrangian
β’ Consider the quantity:
ππ π€ β maxπΆ,π·:πΌπβ₯0
β π€, πΆ, π·
β’ Why?
ππ π€ = απ π€ , if π€ satisfies all the constraints+β, if π€ does not satisfy the constraints
β’ So minimizing π π€ is the same as minimizing ππ π€
minπ€
π π€ = minπ€
ππ π€ = minπ€
maxπΆ,π·:πΌπβ₯0
β π€,πΆ, π·
Lagrange duality
β’ The primal problem
πβ β minπ€
π π€ = minπ€
maxπΆ,π·:πΌπβ₯0
β π€, πΆ, π·
β’ The dual problem
πβ β maxπΆ,π·:πΌπβ₯0
minπ€
β π€, πΆ, π·
β’ Always true:πβ β€ πβ
Lagrange duality
β’ The primal problem
πβ β minπ€
π π€ = minπ€
maxπΆ,π·:πΌπβ₯0
β π€, πΆ, π·
β’ The dual problem
πβ β maxπΆ,π·:πΌπβ₯0
minπ€
β π€, πΆ, π·
β’ Interesting case: when do we have πβ = πβ?
Lagrange duality
β’ Theorem: under proper conditions, there exists π€β, πΆβ, π·β such that
πβ = β π€β, πΆβ, π·β = πβ
Moreover, π€β, πΆβ, π·β satisfy Karush-Kuhn-Tucker (KKT) conditions:πβ
ππ€π= 0, πΌπππ π€ = 0
ππ π€ β€ 0, βπ π€ = 0, πΌπ β₯ 0
Lagrange duality
β’ Theorem: under proper conditions, there exists π€β, πΆβ, π·β such that
πβ = β π€β, πΆβ, π·β = πβ
Moreover, π€β, πΆβ, π·β satisfy Karush-Kuhn-Tucker (KKT) conditions:πβ
ππ€π= 0, πΌπππ π€ = 0
ππ π€ β€ 0, βπ π€ = 0, πΌπ β₯ 0
dual complementarity
Lagrange duality
β’ Theorem: under proper conditions, there exists π€β, πΆβ, π·β such that
πβ = β π€β, πΆβ, π·β = πβ
β’ Moreover, π€β, πΆβ, π·β satisfy Karush-Kuhn-Tucker (KKT) conditions:πβ
ππ€π= 0, πΌπππ π€ = 0
ππ π€ β€ 0, βπ π€ = 0, πΌπ β₯ 0
dual constraintsprimal constraints
Lagrange duality
β’ What are the proper conditions?
β’ A set of conditions (Slater conditions):β’ π, ππ convex, βπ affine
β’ Exists π€ satisfying all ππ π€ < 0
β’ There exist other sets of conditionsβ’ Search KarushβKuhnβTucker conditions on Wikipedia
SVM: optimization
SVM: optimization
β’ Optimization (Quadratic Programming):
minπ€,π
1
2π€
2
π¦π π€ππ₯π + π β₯ 1, βπ
β’ Generalized Lagrangian:
β π€, π, πΆ =1
2π€
2
β
π
πΌπ[π¦π π€ππ₯π + π β 1]
where πΆ is the Lagrange multiplier
SVM: optimization
β’ KKT conditions:πβ
ππ€= 0, π€ = Οπ πΌππ¦ππ₯π (1)
πβ
ππ= 0, 0 = Οπ πΌππ¦π (2)
β’ Plug into β:
β π€, π, πΆ = Οπ πΌπ β1
2Οππ πΌππΌππ¦ππ¦ππ₯π
ππ₯π (3)
combined with 0 = Οπ πΌππ¦π , πΌπ β₯ 0
SVM: optimization
β’ Reduces to dual problem:
β π€, π, πΆ =
π
πΌπ β1
2
ππ
πΌππΌππ¦ππ¦ππ₯πππ₯π
π
πΌππ¦π = 0, πΌπ β₯ 0
β’ Since π€ = Οπ πΌππ¦ππ₯π, we have π€ππ₯ + π = Οπ πΌππ¦ππ₯πππ₯ + π
Only depend on inner products
Kernel methods
Features
Color Histogram
Red Green Blue
Extract features
π₯ π π₯
Features
Features
β’ Proper feature mapping can make non-linear to linear
β’ Using SVM on the feature space {π π₯π }: only need π π₯πππ(π₯π)
β’ Conclusion: no need to design π β , only need to design
π π₯π , π₯π = π π₯πππ(π₯π)
Polynomial kernels
β’ Fix degree π and constant π:π π₯, π₯β² = π₯ππ₯β² + π π
β’ What are π(π₯)?
β’ Expand the expression to get π(π₯)
Polynomial kernels
Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar
Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar
Gaussian kernels
β’ Fix bandwidth π:
π π₯, π₯β² = exp(β π₯ β π₯β²2/2π2)
β’ Also called radial basis function (RBF) kernels
β’ What are π(π₯)? Consider the un-normalized versionπβ² π₯, π₯β² = exp(π₯ππ₯β²/π2)
β’ Power series expansion:
πβ² π₯, π₯β² =
π
+βπ₯ππ₯β² π
πππ!
Mercerβs condition for kenerls
β’ Theorem: π π₯, π₯β² has expansion
π π₯, π₯β² =
π
+β
ππππ π₯ ππ(π₯β²)
if and only if for any function π(π₯),
β« β« π π₯ π π₯β² π π₯, π₯β² ππ₯ππ₯β² β₯ 0
(Omit some math conditions for π and π)
Constructing new kernels
β’ Kernels are closed under positive scaling, sum, product, pointwise limit, and composition with a power series Οπ
+βππππ(π₯, π₯β²)
β’ Example: π1 π₯, π₯β² , π2 π₯, π₯β² are kernels, then also is
π π₯, π₯β² = 2π1 π₯, π₯β² + 3π2 π₯, π₯β²
β’ Example: π1 π₯, π₯β² is kernel, then also is
π π₯, π₯β² = exp(π1 π₯, π₯β² )
Kernels v.s. Neural networks
Features
Color Histogram
Red Green Blue
Extract features
π₯
π¦ = π€ππ π₯build hypothesis
Features: part of the model
π¦ = π€ππ π₯build hypothesis
Linear model
Nonlinear model
Polynomial kernels
Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar
Polynomial kernel SVM as two layer neural network
π₯1
π₯2
π₯12
π₯22
2π₯1π₯2
2ππ₯1
2ππ₯2
π
π¦ = sign(π€ππ(π₯) + π)
First layer is fixed. If also learn first layer, it becomes two layer neural network