Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a...
Transcript of Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a...
![Page 1: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/1.jpg)
Support Vector Machines
CS 760@UW-Madison
![Page 2: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/2.jpg)
Goals for Part 1
you should understand the following concepts• the margin• the linear support vector machine• the primal and dual formulations of SVM learning• support vectors• VC-dimension and maximizing the margin
2
![Page 3: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/3.jpg)
Motivation
![Page 4: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/4.jpg)
Linear classification
(𝑤∗)%𝑥 = 0
Class +1
Class -1
𝑤∗
(𝑤∗)%𝑥 > 0
(𝑤∗)%𝑥 < 0
Assume perfect separation between the two classes
![Page 5: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/5.jpg)
Attempt
• Given training data 𝑥+, 𝑦+ : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷• Hypothesis 𝑦 = sign(𝑓9 𝑥 ) = sign(𝑤%𝑥)
• 𝑦 = +1 if 𝑤%𝑥 > 0• 𝑦 = −1 if 𝑤%𝑥 < 0
• Let’s assume that we can optimize to find 𝑤
![Page 6: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/6.jpg)
Multiple optimal solutions?
Class +1
Class -1
𝑤< 𝑤=𝑤>
Same on empirical loss;Different on test/expected loss
![Page 7: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/7.jpg)
What about 𝑤>?
Class +1
Class -1
𝑤>
New test data
![Page 8: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/8.jpg)
What about 𝑤=?
Class +1
Class -1
𝑤=
New test data
![Page 9: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/9.jpg)
Most confident: 𝑤<
Class +1
Class -1
𝑤<
New test data
![Page 10: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/10.jpg)
Intuition: margin
Class +1
Class -1
𝑤<
large margin
![Page 11: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/11.jpg)
Margin
![Page 12: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/12.jpg)
Margin
• Lemma 1: 𝑥 has distance |@A B || 9 |
to the hyperplane 𝑓9 𝑥 =𝑤%𝑥 = 0
Proof:• 𝑤 is orthogonal to the hyperplane• The unit direction is 9
| 9 |
• The projection of 𝑥 is 99
%𝑥 = @A(B)
| 9 |
𝑤| 𝑤 |
𝑥
𝑤𝑤
%
𝑥
0
![Page 13: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/13.jpg)
Margin: with bias
• Claim 1: 𝑤 is orthogonal to the hyperplane 𝑓9,C 𝑥 = 𝑤%𝑥 + 𝑏 =0
Proof:• pick any 𝑥> and 𝑥< on the hyperplane• 𝑤%𝑥> + 𝑏 = 0• 𝑤%𝑥< + 𝑏 = 0
• So 𝑤%(𝑥> − 𝑥<) = 0
![Page 14: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/14.jpg)
Margin: with bias
• Claim 2: 0 has distance |C|| 9 |
to the hyperplane 𝑤%𝑥 + 𝑏 = 0
Proof:• pick any 𝑥> the hyperplane• Project 𝑥> to the unit direction 9
| 9 |to get the distance
• 99
%𝑥> =
EC| 9 |
since 𝑤%𝑥> + 𝑏 = 0
![Page 15: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/15.jpg)
Margin: with bias
• Lemma 2: 𝑥 has distance |@A,F B || 9 |
to the hyperplane 𝑓9,C 𝑥 =𝑤%𝑥 + 𝑏 = 0
Proof:• Let 𝑥 = 𝑥G + 𝑟
9| 9 |
, then |𝑟| is the distance
• Multiply both sides by 𝑤% and add 𝑏• Left hand side: 𝑤%𝑥 + 𝑏 = 𝑓9,C 𝑥
• Right hand side: 𝑤%𝑥G + 𝑟9I9| 9 |
+ 𝑏 = 0 + 𝑟| 𝑤 |
![Page 16: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/16.jpg)
𝑦 𝑥 = 𝑤%𝑥 + 𝑤J
The notation here is:
Figure from Pattern Recognition and Machine Learning, Bishop
![Page 17: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/17.jpg)
Support Vector Machine (SVM)
![Page 18: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/18.jpg)
SVM: objective
• Margin over all training data points:
𝛾 = min+
|𝑓9,C 𝑥+ || 𝑤 |
• Since only want correct 𝑓9,C, and recall 𝑦+ ∈ {+1,−1}, we have
𝛾 = min+
𝑦+𝑓9,C 𝑥+| 𝑤 |
• If 𝑓9,C incorrect on some 𝑥+, the margin is negative
![Page 19: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/19.jpg)
SVM: objective
• Maximize margin over all training data points:
max9,C
𝛾 = max9,C
min+
𝑦+𝑓9,C 𝑥+| 𝑤 | = max
9,Cmin+
𝑦+(𝑤%𝑥+ + 𝑏)| 𝑤 |
• A bit complicated …
![Page 20: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/20.jpg)
SVM: simplified objective
• Observation: when (𝑤, 𝑏) scaled by a factor 𝑐, the margin unchanged
𝑦+(𝑐𝑤%𝑥+ + 𝑐𝑏)| 𝑐𝑤 | =
𝑦+(𝑤%𝑥+ + 𝑏)| 𝑤 |
• Let’s consider a fixed scale such that
𝑦+∗ 𝑤%𝑥+∗ + 𝑏 = 1where 𝑥+∗ is the point closest to the hyperplane
![Page 21: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/21.jpg)
SVM: simplified objective
• Let’s consider a fixed scale such that
𝑦+∗ 𝑤%𝑥+∗ + 𝑏 = 1where 𝑥+∗ is the point closet to the hyperplane
• Now we have for all data𝑦+ 𝑤%𝑥+ + 𝑏 ≥ 1
and at least for one 𝑖 the equality holds• Then the margin is >
| 9 |
![Page 22: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/22.jpg)
SVM: simplified objective
• Optimization simplified to
min9,C
12
𝑤<
𝑦+ 𝑤%𝑥+ + 𝑏 ≥ 1, ∀𝑖
• How to find the optimum V𝑤∗?• Solved by Lagrange multiplier method
![Page 23: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/23.jpg)
Lagrange multiplier
![Page 24: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/24.jpg)
Lagrangian
• Consider optimization problem:min9
𝑓(𝑤)
ℎ+ 𝑤 = 0, ∀1 ≤ 𝑖 ≤ 𝑙
• Lagrangian:ℒ 𝑤,𝜷 = 𝑓 𝑤 +[
+
𝛽+ℎ+(𝑤)
where 𝛽+ ’s are called Lagrange multipliers
![Page 25: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/25.jpg)
Lagrangian
• Consider optimization problem:min9
𝑓(𝑤)
ℎ+ 𝑤 = 0, ∀1 ≤ 𝑖 ≤ 𝑙
• Solved by setting derivatives of Lagrangian to 0𝜕ℒ𝜕𝑤+
= 0;𝜕ℒ𝜕𝛽+
= 0
![Page 26: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/26.jpg)
Generalized Lagrangian
• Consider optimization problem:min9
𝑓(𝑤)
𝑔+ 𝑤 ≤ 0, ∀1 ≤ 𝑖 ≤ 𝑘
ℎa 𝑤 = 0, ∀1 ≤ 𝑗 ≤ 𝑙• Generalized Lagrangian:
ℒ 𝑤,𝜶, 𝜷 = 𝑓 𝑤 +[+
𝛼+𝑔+(𝑤) +[a
𝛽aℎa(𝑤)
where 𝛼+, 𝛽a ’s are called Lagrange multipliers
![Page 27: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/27.jpg)
Generalized Lagrangian
• Consider the quantity:
𝜃f 𝑤 ≔ max𝜶,𝜷:hijJ
ℒ 𝑤, 𝜶, 𝜷
• Why?
𝜃f 𝑤 = k𝑓 𝑤 , if 𝑤 satisfies all the constraints+∞, if 𝑤 does not satisfy the constraints
• So minimizing 𝑓 𝑤 is the same as minimizing 𝜃f 𝑤min9𝑓 𝑤 = min
9𝜃f 𝑤 = min
9max
𝜶,𝜷:hijJℒ 𝑤, 𝜶, 𝜷
![Page 28: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/28.jpg)
Lagrange duality
• The primal problem𝑝∗ ≔ min
9𝑓 𝑤 = min
9max
𝜶,𝜷:hijJℒ 𝑤, 𝜶, 𝜷
• The dual problem𝑑∗ ≔ max
𝜶,𝜷:hijJmin9ℒ 𝑤,𝜶, 𝜷
• Always true:𝑑∗ ≤ 𝑝∗
![Page 29: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/29.jpg)
Lagrange duality
• The primal problem𝑝∗ ≔ min
9𝑓 𝑤 = min
9max
𝜶,𝜷:hijJℒ 𝑤, 𝜶, 𝜷
• The dual problem𝑑∗ ≔ max
𝜶,𝜷:hijJmin9ℒ 𝑤,𝜶, 𝜷
• Interesting case: when do we have 𝑑∗ = 𝑝∗?
![Page 30: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/30.jpg)
Lagrange duality
• Theorem: under proper conditions, there exists 𝑤∗, 𝜶∗, 𝜷∗such that
𝑑∗ = ℒ 𝑤∗, 𝜶∗, 𝜷∗ = 𝑝∗
Moreover, 𝑤∗, 𝜶∗, 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:
𝜕ℒ𝜕𝑤+
= 0, 𝛼+𝑔+ 𝑤 = 0
𝑔+ 𝑤 ≤ 0, ℎa 𝑤 = 0, 𝛼+ ≥ 0
![Page 31: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/31.jpg)
Lagrange duality
• Theorem: under proper conditions, there exists 𝑤∗, 𝜶∗, 𝜷∗such that
𝑑∗ = ℒ 𝑤∗, 𝜶∗, 𝜷∗ = 𝑝∗
Moreover, 𝑤∗, 𝜶∗, 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:
𝜕ℒ𝜕𝑤+
= 0, 𝛼+𝑔+ 𝑤 = 0
𝑔+ 𝑤 ≤ 0, ℎa 𝑤 = 0, 𝛼+ ≥ 0
dual complementarity
![Page 32: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/32.jpg)
Lagrange duality
• Theorem: under proper conditions, there exists 𝑤∗, 𝜶∗, 𝜷∗such that
𝑑∗ = ℒ 𝑤∗, 𝜶∗, 𝜷∗ = 𝑝∗
• Moreover, 𝑤∗, 𝜶∗, 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:
𝜕ℒ𝜕𝑤+
= 0, 𝛼+𝑔+ 𝑤 = 0
𝑔+ 𝑤 ≤ 0, ℎa 𝑤 = 0, 𝛼+ ≥ 0
dual constraintsprimal constraints
![Page 33: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/33.jpg)
Lagrange duality
• What are the proper conditions? • A set of conditions (Slater conditions):
• 𝑓, 𝑔+ convex, ℎa affine, and exists 𝑤 satisfying all 𝑔+ 𝑤 < 0
• There exist other sets of conditions• Check textbooks, e.g., Convex Optimization by Boyd and
Vandenberghe
![Page 34: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/34.jpg)
SVM: optimization
![Page 35: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/35.jpg)
SVM: optimization
• Optimization (Quadratic Programming):
min9,C
12 𝑤
<
𝑦+ 𝑤%𝑥+ + 𝑏 ≥ 1, ∀𝑖
• Generalized Lagrangian:
ℒ 𝑤, 𝑏, 𝜶 =12
𝑤<
−[+
𝛼+[𝑦+ 𝑤%𝑥+ + 𝑏 − 1]
where 𝜶 is the Lagrange multiplier
![Page 36: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/36.jpg)
SVM: optimization
• KKT conditions:rℒr9
= 0,à 𝑤 = ∑+ 𝛼+𝑦+𝑥+ (1)rℒrC= 0,à 0 = ∑+ 𝛼+𝑦+ (2)
• Plug into ℒ:ℒ 𝑤, 𝑏, 𝜶 = ∑+ 𝛼+ −
><∑+a 𝛼+𝛼a𝑦+𝑦a𝑥+%𝑥a (3)
combined with 0 = ∑+ 𝛼+𝑦+ , 𝛼+ ≥ 0
![Page 37: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/37.jpg)
SVM: optimization
• Reduces to dual problem:ℒ 𝑤, 𝑏, 𝜶 =[
+
𝛼+ −12[+a
𝛼+𝛼a𝑦+𝑦a𝑥+%𝑥a
[+
𝛼+𝑦+ = 0, 𝛼+ ≥ 0
• Since 𝑤 = ∑+ 𝛼+𝑦+𝑥+, we have 𝑤%𝑥 + 𝑏 = ∑+ 𝛼+𝑦+𝑥+%𝑥 + 𝑏
Only depend on inner products
![Page 38: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/38.jpg)
Support Vectors
• those instances with αi > 0are called support vectors • they lie on the margin
boundary• solution NOT changed if
delete the instances with αi = 0 support
vectors
• final solution is a sparse linear combination of the training instances
![Page 39: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/39.jpg)
Learning theory justification
• Vapnik showed a connection between the margin and VC dimension
𝑉𝐶 ≤4𝑅<
𝑚𝑎𝑟𝑔𝑖𝑛z(ℎ)• thus to minimize the VC dimension (and to improve the error
bound) è maximize the margin
error on truedistribution
training seterror VC: VC-dimension
of hypothesis class
𝑒𝑟𝑟𝑜𝑟 ℎ ≤ 𝑒𝑟𝑟𝑜𝑟z ℎ +𝑉𝐶 log2𝑚𝑉𝐶 + 1 + log4𝛿
𝑚
constant dependent on training data
![Page 40: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/40.jpg)
Goals for Part 2
you should understand the following concepts• soft margin SVM• support vector regression• the kernel trick• polynomial kernel• Gaussian/RBF kernel• valid kernels and Mercer’s theorem• kernels and neural networks
40
![Page 41: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/41.jpg)
Variants: soft-margin and SVR
![Page 42: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/42.jpg)
Hard-margin SVM
• Optimization (Quadratic Programming):
min9,C
12 𝑤
<
𝑦+ 𝑤%𝑥+ + 𝑏 ≥ 1, ∀𝑖
![Page 43: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/43.jpg)
Soft-margin SVM [Cortes & Vapnik, Machine Learning 1995]
• if the training instances are not linearly separable, the previous formulation will fail
• we can adjust our approach by using slack variables (denoted by 𝜁+) to tolerate errors
min9,C,�i
12
𝑤<
+ 𝐶[+
𝜁+
𝑦+ 𝑤%𝑥+ + 𝑏 ≥ 1 − 𝜁+, 𝜁+ ≥ 0, ∀𝑖• 𝐶 determines the relative importance of maximizing margin vs.
minimizing slack
![Page 44: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/44.jpg)
The effect of 𝐶 in soft-margin SVM
Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010
![Page 45: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/45.jpg)
Hinge loss
• when we covered neural nets, we talked about minimizing squared loss and cross-entropy loss
• SVMs minimize hinge loss
loss (error) when 𝑦 = 1
model output ℎ 𝒙
squared loss
0/1 loss
hinge loss
![Page 46: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/46.jpg)
Support Vector Regression
• the SVM idea can also be applied in regression tasks
• an 𝜖-insensitive error function specifies that a training instance is well explained if the model’s prediction is within 𝜖 of 𝑦+
(𝑤�𝑥 + 𝑏) − 𝑦 = 𝜖
𝑦 − (𝑤�𝑥 + 𝑏) = 𝜖
![Page 47: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/47.jpg)
Support Vector Regression
• Regression using slack variables (denoted by 𝜁+, 𝜉+) to tolerate errors
min9,C,�i,�i
12 𝑤
<
+ 𝐶[+
𝜁+ + 𝜉+
𝑤%𝑥+ + 𝑏 − 𝑦+ ≤ 𝜖 + 𝜁+,𝑦+ − 𝑤%𝑥+ + 𝑏 ≤ 𝜖 + 𝜉+,
𝜁+, 𝜉+ ≥ 0.
slack variables allow predictionsfor some training instances to beoff by more than 𝜖
![Page 48: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/48.jpg)
Kernel methods
![Page 49: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/49.jpg)
Features
Color Histogram
Red Green
Extract features
𝑥 𝜙 𝑥
![Page 50: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/50.jpg)
Features
Proper feature mapping can make non-linear to linear!
![Page 51: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/51.jpg)
Recall: SVM dual form
• Reduces to dual problem:ℒ 𝑤, 𝑏, 𝜶 =[
+
𝛼+ −12[+a
𝛼+𝛼a𝑦+𝑦a𝑥+%𝑥a
[+
𝛼+𝑦+ = 0, 𝛼+ ≥ 0
• Since 𝑤 = ∑+ 𝛼+𝑦+𝑥+, we have 𝑤%𝑥 + 𝑏 = ∑+ 𝛼+𝑦+𝑥+%𝑥 + 𝑏
Only depend on inner products
![Page 52: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/52.jpg)
Features
• Using SVM on the feature space {𝜙 𝑥+ }: only need 𝜙 𝑥+ %𝜙(𝑥a)
• Conclusion: no need to design 𝜙 ⋅ , only need to design
𝑘 𝑥+, 𝑥a = 𝜙 𝑥+ %𝜙(𝑥a)
![Page 53: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/53.jpg)
Polynomial kernels
• Fix degree 𝑑 and constant 𝑐:𝑘 𝑥, 𝑥′ = 𝑥%𝑥′ + 𝑐 �
• What are 𝜙(𝑥)?• Expand the expression to get 𝜙(𝑥)
![Page 54: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/54.jpg)
Polynomial kernels
Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar
![Page 55: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/55.jpg)
SVMs with polynomial kernels
Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010
56
![Page 56: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/56.jpg)
Gaussian/RBF kernels
• Fix bandwidth 𝜎:𝑘 𝑥, 𝑥′ = exp(− 𝑥 − 𝑥� </2𝜎<)
• Also called radial basis function (RBF) kernels
• What are 𝜙(𝑥)? Consider the un-normalized version𝑘′ 𝑥, 𝑥′ = exp(𝑥%𝑥′/𝜎<)
• Power series expansion:
𝑘′ 𝑥, 𝑥� = [+
��𝑥%𝑥� +
𝜎+𝑖!
![Page 57: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/57.jpg)
The RBF kernel illustrated
Figures from openclassroom.stanford.edu (Andrew Ng)
58
𝛾 = −10 𝛾 = −100 𝛾 = −1000
![Page 58: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/58.jpg)
Mercer’s condition for kenerls
• Theorem: 𝑘 𝑥, 𝑥′ has expansion
𝑘 𝑥, 𝑥′ =[+
��
𝑎+𝜙+ 𝑥 𝜙+(𝑥�)
if and only if for any function 𝑐(𝑥),
∫ ∫ 𝑐 𝑥 𝑐 𝑥� 𝑘 𝑥, 𝑥� 𝑑𝑥𝑑𝑥� ≥ 0
(Omit some math conditions for 𝑘 and 𝑐)
![Page 59: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/59.jpg)
Constructing new kernels
• Kernels are closed under positive scaling, sum, product, pointwise limit, and composition with a power series ∑+�� 𝑎+𝑘+(𝑥, 𝑥�)
• Example: 𝑘> 𝑥, 𝑥′ , 𝑘< 𝑥, 𝑥′ are kernels, then also is
𝑘 𝑥, 𝑥� = 2𝑘> 𝑥, 𝑥′ + 3𝑘< 𝑥, 𝑥′
• Example: 𝑘> 𝑥, 𝑥′ is kernel, then also is
𝑘 𝑥, 𝑥� = exp(𝑘> 𝑥, 𝑥� )
![Page 60: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/60.jpg)
Kernel algebra• given a valid kernel, we can make new valid kernels using a variety of
operators
φ(x) = φa (x), φb (x)( )k(x,v) = ka (x,v)+ kb (x,v)
k(x,v) = γ ka (x,v), γ > 0 φ(x) = γ φa (x)
k(x,v) = ka (x,v)kb (x,v) φl (x) = φai (x)φbj (x)
k(x,v) = xTAv, A is p.s.d. φ(x) = LTx, where A = LLT
k(x,v) = f (x) f (v)ka (x,v) φ(x) = f (x)φa (x)
kernel composition mapping composition
61
![Page 61: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/61.jpg)
Kernels v.s. Neural networks
![Page 62: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/62.jpg)
Features
Color Histogram
Red Green
Extract features
𝑥
𝑦 = 𝑤%𝜙 𝑥build hypothesis
![Page 63: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/63.jpg)
Features: part of the model
𝑦 = 𝑤%𝜙 𝑥build hypothesis
Linear model
Nonlinear model
![Page 64: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/64.jpg)
Polynomial kernels
Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar
![Page 65: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/65.jpg)
Polynomial kernel SVM as two layer neural network
𝑥>
𝑥<
𝑥><
𝑥<<
2𝑥>𝑥<
2𝑐𝑥>
2𝑐𝑥<
𝑐
𝑦 = sign(𝑤%𝜙(𝑥) + 𝑏)
First layer is fixed. If also learn first layer, it becomes two layer neural network
![Page 66: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/66.jpg)
Comments on SVMs
• we can find solutions that are globally optimal (maximize the margin)• because the learning task is framed as a convex optimization
problem• no local minima, in contrast to multi-layer neural nets
• there are two formulations of the optimization: primal and dual• dual represents classifier decision in terms of support vectors• dual enables the use of kernel functions
• we can use a wide range of optimization methods to learn SVM• standard quadratic programming solvers• SMO [Platt, 1999]• linear programming solvers for some formulations• etc.
67
![Page 67: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/67.jpg)
• kernels provide a powerful way to• allow nonlinear decision boundaries• represent/compare complex objects such as strings and trees• incorporate domain knowledge into the learning task
• using the kernel trick, we can implicitly use high-dimensional mappings without explicitly computing them
• one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMs and some encoding
• empirically, SVMs have shown (close to) state-of-the art accuracy for many tasks
• the kernel idea can be extended to other tasks (anomaly detection, regression, etc.)
68
Comments on SVMs
![Page 68: Support Vector Machinespages.cs.wisc.edu/~jerryzhu/cs760/15_SVM.pdf•one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMsand some](https://reader035.fdocuments.in/reader035/viewer/2022070703/5e78b256172e6f49cd592c0c/html5/thumbnails/68.jpg)
THANK YOUSome of the slides in these lectures have been adapted/borrowed from materials developed by Yingyu Liang, Mark Craven, David
Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Elad Hazan, Tom Dietterich, and Pedro Domingos.