Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming):...

34
Machine Learning Basics Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang

Transcript of Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming):...

Page 1: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Machine Learning Basics Lecture 5: SVM II

Princeton University COS 495

Instructor: Yingyu Liang

Page 2: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Review: SVM objective

Page 3: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

SVM: objective

β€’ Let 𝑦𝑖 ∈ +1,βˆ’1 , 𝑓𝑀,𝑏 π‘₯ = 𝑀𝑇π‘₯ + 𝑏. Margin:

𝛾 = min𝑖

𝑦𝑖𝑓𝑀,𝑏 π‘₯𝑖| 𝑀 |

β€’ Support Vector Machine:

max𝑀,𝑏

𝛾 = max𝑀,𝑏

min𝑖

𝑦𝑖𝑓𝑀,𝑏 π‘₯𝑖| 𝑀 |

Page 4: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

SVM: optimization

β€’ Optimization (Quadratic Programming):

min𝑀,𝑏

1

2𝑀

2

𝑦𝑖 𝑀𝑇π‘₯𝑖 + 𝑏 β‰₯ 1, βˆ€π‘–

β€’ Solved by Lagrange multiplier method:

β„’ 𝑀, 𝑏, 𝜢 =1

2𝑀

2

βˆ’

𝑖

𝛼𝑖[𝑦𝑖 𝑀𝑇π‘₯𝑖 + 𝑏 βˆ’ 1]

where 𝜢 is the Lagrange multiplier

Page 5: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Lagrange multiplier

Page 6: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Lagrangian

β€’ Consider optimization problem:min𝑀

𝑓(𝑀)

β„Žπ‘– 𝑀 = 0, βˆ€1 ≀ 𝑖 ≀ 𝑙

β€’ Lagrangian:

β„’ 𝑀, 𝜷 = 𝑓 𝑀 +

𝑖

π›½π‘–β„Žπ‘–(𝑀)

where 𝛽𝑖’s are called Lagrange multipliers

Page 7: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Lagrangian

β€’ Consider optimization problem:min𝑀

𝑓(𝑀)

β„Žπ‘– 𝑀 = 0, βˆ€1 ≀ 𝑖 ≀ 𝑙

β€’ Solved by setting derivatives of Lagrangian to 0

πœ•β„’

πœ•π‘€π‘–= 0;

πœ•β„’

πœ•π›½π‘–= 0

Page 8: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Generalized Lagrangian

β€’ Consider optimization problem:min𝑀

𝑓(𝑀)

𝑔𝑖 𝑀 ≀ 0, βˆ€1 ≀ 𝑖 ≀ π‘˜

β„Žπ‘— 𝑀 = 0, βˆ€1 ≀ 𝑗 ≀ 𝑙

β€’ Generalized Lagrangian:

β„’ 𝑀, 𝜢, 𝜷 = 𝑓 𝑀 +

𝑖

𝛼𝑖𝑔𝑖(𝑀) +

𝑗

π›½π‘—β„Žπ‘—(𝑀)

where 𝛼𝑖 , 𝛽𝑗’s are called Lagrange multipliers

Page 9: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Generalized Lagrangian

β€’ Consider the quantity:

πœƒπ‘ƒ 𝑀 ≔ max𝜢,𝜷:𝛼𝑖β‰₯0

β„’ 𝑀, 𝜢, 𝜷

β€’ Why?

πœƒπ‘ƒ 𝑀 = α‰Šπ‘“ 𝑀 , if 𝑀 satisfies all the constraints+∞, if 𝑀 does not satisfy the constraints

β€’ So minimizing 𝑓 𝑀 is the same as minimizing πœƒπ‘ƒ 𝑀

min𝑀

𝑓 𝑀 = min𝑀

πœƒπ‘ƒ 𝑀 = min𝑀

max𝜢,𝜷:𝛼𝑖β‰₯0

β„’ 𝑀,𝜢, 𝜷

Page 10: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Lagrange duality

β€’ The primal problem

π‘βˆ— ≔ min𝑀

𝑓 𝑀 = min𝑀

max𝜢,𝜷:𝛼𝑖β‰₯0

β„’ 𝑀, 𝜢, 𝜷

β€’ The dual problem

π‘‘βˆ— ≔ max𝜢,𝜷:𝛼𝑖β‰₯0

min𝑀

β„’ 𝑀, 𝜢, 𝜷

β€’ Always true:π‘‘βˆ— ≀ π‘βˆ—

Page 11: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Lagrange duality

β€’ The primal problem

π‘βˆ— ≔ min𝑀

𝑓 𝑀 = min𝑀

max𝜢,𝜷:𝛼𝑖β‰₯0

β„’ 𝑀, 𝜢, 𝜷

β€’ The dual problem

π‘‘βˆ— ≔ max𝜢,𝜷:𝛼𝑖β‰₯0

min𝑀

β„’ 𝑀, 𝜢, 𝜷

β€’ Interesting case: when do we have π‘‘βˆ— = π‘βˆ—?

Page 12: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Lagrange duality

β€’ Theorem: under proper conditions, there exists π‘€βˆ—, πœΆβˆ—, πœ·βˆ— such that

π‘‘βˆ— = β„’ π‘€βˆ—, πœΆβˆ—, πœ·βˆ— = π‘βˆ—

Moreover, π‘€βˆ—, πœΆβˆ—, πœ·βˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions:πœ•β„’

πœ•π‘€π‘–= 0, 𝛼𝑖𝑔𝑖 𝑀 = 0

𝑔𝑖 𝑀 ≀ 0, β„Žπ‘— 𝑀 = 0, 𝛼𝑖 β‰₯ 0

Page 13: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Lagrange duality

β€’ Theorem: under proper conditions, there exists π‘€βˆ—, πœΆβˆ—, πœ·βˆ— such that

π‘‘βˆ— = β„’ π‘€βˆ—, πœΆβˆ—, πœ·βˆ— = π‘βˆ—

Moreover, π‘€βˆ—, πœΆβˆ—, πœ·βˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions:πœ•β„’

πœ•π‘€π‘–= 0, 𝛼𝑖𝑔𝑖 𝑀 = 0

𝑔𝑖 𝑀 ≀ 0, β„Žπ‘— 𝑀 = 0, 𝛼𝑖 β‰₯ 0

dual complementarity

Page 14: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Lagrange duality

β€’ Theorem: under proper conditions, there exists π‘€βˆ—, πœΆβˆ—, πœ·βˆ— such that

π‘‘βˆ— = β„’ π‘€βˆ—, πœΆβˆ—, πœ·βˆ— = π‘βˆ—

β€’ Moreover, π‘€βˆ—, πœΆβˆ—, πœ·βˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions:πœ•β„’

πœ•π‘€π‘–= 0, 𝛼𝑖𝑔𝑖 𝑀 = 0

𝑔𝑖 𝑀 ≀ 0, β„Žπ‘— 𝑀 = 0, 𝛼𝑖 β‰₯ 0

dual constraintsprimal constraints

Page 15: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Lagrange duality

β€’ What are the proper conditions?

β€’ A set of conditions (Slater conditions):β€’ 𝑓, 𝑔𝑖 convex, β„Žπ‘— affine

β€’ Exists 𝑀 satisfying all 𝑔𝑖 𝑀 < 0

β€’ There exist other sets of conditionsβ€’ Search Karush–Kuhn–Tucker conditions on Wikipedia

Page 16: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

SVM: optimization

Page 17: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

SVM: optimization

β€’ Optimization (Quadratic Programming):

min𝑀,𝑏

1

2𝑀

2

𝑦𝑖 𝑀𝑇π‘₯𝑖 + 𝑏 β‰₯ 1, βˆ€π‘–

β€’ Generalized Lagrangian:

β„’ 𝑀, 𝑏, 𝜢 =1

2𝑀

2

βˆ’

𝑖

𝛼𝑖[𝑦𝑖 𝑀𝑇π‘₯𝑖 + 𝑏 βˆ’ 1]

where 𝜢 is the Lagrange multiplier

Page 18: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

SVM: optimization

β€’ KKT conditions:πœ•β„’

πœ•π‘€= 0, 𝑀 = σ𝑖 𝛼𝑖𝑦𝑖π‘₯𝑖 (1)

πœ•β„’

πœ•π‘= 0, 0 = σ𝑖 𝛼𝑖𝑦𝑖 (2)

β€’ Plug into β„’:

β„’ 𝑀, 𝑏, 𝜢 = σ𝑖 𝛼𝑖 βˆ’1

2σ𝑖𝑗 𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗π‘₯𝑖

𝑇π‘₯𝑗 (3)

combined with 0 = σ𝑖 𝛼𝑖𝑦𝑖 , 𝛼𝑖 β‰₯ 0

Page 19: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

SVM: optimization

β€’ Reduces to dual problem:

β„’ 𝑀, 𝑏, 𝜢 =

𝑖

𝛼𝑖 βˆ’1

2

𝑖𝑗

𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗π‘₯𝑖𝑇π‘₯𝑗

𝑖

𝛼𝑖𝑦𝑖 = 0, 𝛼𝑖 β‰₯ 0

β€’ Since 𝑀 = σ𝑖 𝛼𝑖𝑦𝑖π‘₯𝑖, we have 𝑀𝑇π‘₯ + 𝑏 = σ𝑖 𝛼𝑖𝑦𝑖π‘₯𝑖𝑇π‘₯ + 𝑏

Only depend on inner products

Page 20: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Kernel methods

Page 21: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Features

Color Histogram

Red Green Blue

Extract features

π‘₯ πœ™ π‘₯

Page 22: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Features

Page 23: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Features

β€’ Proper feature mapping can make non-linear to linear

β€’ Using SVM on the feature space {πœ™ π‘₯𝑖 }: only need πœ™ π‘₯π‘–π‘‡πœ™(π‘₯𝑗)

β€’ Conclusion: no need to design πœ™ β‹… , only need to design

π‘˜ π‘₯𝑖 , π‘₯𝑗 = πœ™ π‘₯π‘–π‘‡πœ™(π‘₯𝑗)

Page 24: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Polynomial kernels

β€’ Fix degree 𝑑 and constant 𝑐:π‘˜ π‘₯, π‘₯β€² = π‘₯𝑇π‘₯β€² + 𝑐 𝑑

β€’ What are πœ™(π‘₯)?

β€’ Expand the expression to get πœ™(π‘₯)

Page 25: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Polynomial kernels

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Page 26: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Page 27: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Gaussian kernels

β€’ Fix bandwidth 𝜎:

π‘˜ π‘₯, π‘₯β€² = exp(βˆ’ π‘₯ βˆ’ π‘₯β€²2/2𝜎2)

β€’ Also called radial basis function (RBF) kernels

β€’ What are πœ™(π‘₯)? Consider the un-normalized versionπ‘˜β€² π‘₯, π‘₯β€² = exp(π‘₯𝑇π‘₯β€²/𝜎2)

β€’ Power series expansion:

π‘˜β€² π‘₯, π‘₯β€² =

𝑖

+∞π‘₯𝑇π‘₯β€² 𝑖

πœŽπ‘–π‘–!

Page 28: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Mercer’s condition for kenerls

β€’ Theorem: π‘˜ π‘₯, π‘₯β€² has expansion

π‘˜ π‘₯, π‘₯β€² =

𝑖

+∞

π‘Žπ‘–πœ™π‘– π‘₯ πœ™π‘–(π‘₯β€²)

if and only if for any function 𝑐(π‘₯),

∫ ∫ 𝑐 π‘₯ 𝑐 π‘₯β€² π‘˜ π‘₯, π‘₯β€² 𝑑π‘₯𝑑π‘₯β€² β‰₯ 0

(Omit some math conditions for π‘˜ and 𝑐)

Page 29: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Constructing new kernels

β€’ Kernels are closed under positive scaling, sum, product, pointwise limit, and composition with a power series σ𝑖

+βˆžπ‘Žπ‘–π‘˜π‘–(π‘₯, π‘₯β€²)

β€’ Example: π‘˜1 π‘₯, π‘₯β€² , π‘˜2 π‘₯, π‘₯β€² are kernels, then also is

π‘˜ π‘₯, π‘₯β€² = 2π‘˜1 π‘₯, π‘₯β€² + 3π‘˜2 π‘₯, π‘₯β€²

β€’ Example: π‘˜1 π‘₯, π‘₯β€² is kernel, then also is

π‘˜ π‘₯, π‘₯β€² = exp(π‘˜1 π‘₯, π‘₯β€² )

Page 30: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Kernels v.s. Neural networks

Page 31: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Features

Color Histogram

Red Green Blue

Extract features

π‘₯

𝑦 = π‘€π‘‡πœ™ π‘₯build hypothesis

Page 32: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Features: part of the model

𝑦 = π‘€π‘‡πœ™ π‘₯build hypothesis

Linear model

Nonlinear model

Page 33: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Polynomial kernels

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Page 34: Machine Learning Basics Lecture 5: SVM IISVM: optimization β€’Optimization (Quadratic Programming): min 𝑀,𝑏 s t 2 𝑇 + β‰₯ s,βˆ€ β€’Solved by Lagrange multiplier method: β„’

Polynomial kernel SVM as two layer neural network

π‘₯1

π‘₯2

π‘₯12

π‘₯22

2π‘₯1π‘₯2

2𝑐π‘₯1

2𝑐π‘₯2

𝑐

𝑦 = sign(π‘€π‘‡πœ™(π‘₯) + 𝑏)

First layer is fixed. If also learn first layer, it becomes two layer neural network