Support Vector Machine II - Virginia Techjbhuang/teaching/ECE...Support Vector Machine II Jia-Bin...
Transcript of Support Vector Machine II - Virginia Techjbhuang/teaching/ECE...Support Vector Machine II Jia-Bin...
Support Vector Machine II
Jia-Bin Huang
Virginia Tech Spring 2019ECE-5424G / CS-5824
Administrative
• HW 1 due tonight
• HW 2 released.
• Online Scalable Learning Adaptive to Unknown Dynamics and Graphs
• Yanning Shen, University of Minnesota, Twin Cities
• Tuesday, February 19 9:00am - 10:15am1110 Knowledgeworks II
• Declarative Learning-based Programming for Learning and Reasoning over Spatial Language
• Parisa Kordjamshidi, Assistant Professor, Tulane University
• Thursday, February 21 9:00am - 10:15am110 McBryde
Logistic regression (logistic loss)
min𝜃
1
𝑚
𝑖=1
𝑚
𝑦(𝑖) −log ℎ𝜃 𝑥(𝑖) + (1 − 𝑦 𝑖 ) −log 1 − ℎ𝜃 𝑥(𝑖) +𝜆
2𝑚
𝑗=1
𝑛
𝜃𝑗2
Support vector machine (hinge loss)
min𝜃
1
𝑚
𝑖=1
𝑚
𝑦(𝑖) cost1 𝜃⊤𝑥 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑥 𝑖 +𝜆
2𝑚
𝑗=1
𝑛
𝜃𝑗2
10
if 𝑦 = 1
𝜃⊤𝑥
if 𝑦 = 0
𝜃⊤𝑥−1 2−2 10−1 2−2
−log1
1 + 𝑒−𝜃⊤𝑥
−log 1 −1
1 + 𝑒−𝜃⊤𝑥
0
Support Vector Machine
•Cost function
• Large margin classification
•Kernels
•Using an SVM
Support vector machine
min𝜃
𝐶
𝑖=1
𝑚
𝑦(𝑖) cost1 𝜃⊤𝑥 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑥 𝑖 +
𝑗=1
𝑛
𝜃𝑗2
If “y = 1”, we want 𝜃⊤𝑥 ≥ 1 (not just ≥ 0)
If “y = 0”, we want 𝜃⊤𝑥 ≤ −1 (not just < 0)
10
if 𝑦 = 1
𝜃⊤𝑥
if 𝑦 = 0
𝜃⊤𝑥−1 2−2 10−1 2−2
cost1 𝜃⊤𝑥 cost0 𝜃⊤𝑥
Slide credit: Andrew Ng
SVM decision boundary
min𝜃
𝐶
𝑖=1
𝑚
𝑦(𝑖) cost1 𝜃⊤𝑥 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑥 𝑖 +1
2
𝑗=1
𝑛
𝜃𝑗2
• Let’s say we have a very large 𝐶 …
• Whenever 𝑦 𝑖 = 1:
𝜃⊤𝑥 𝑖 ≥ 1
• Whenever 𝑦 𝑖 = 0:
𝜃⊤𝑥 𝑖 ≤ −1
min𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1
𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0
Slide credit: Andrew Ng
SVM decision boundary: Linearly separable case
𝑥2
𝑥1Slide credit: Andrew Ng
SVM decision boundary: Linearly separable case
𝑥2
𝑥1
margin
Slide credit: Andrew Ng
Why large margin classifiers?
min𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0
𝑥2
𝑥1
margin
Vector inner product
𝑢 =𝑢1𝑢2
𝑣 =𝑣1𝑣2
𝑢 = length of vector 𝑢
= 𝑢12 + 𝑢2
2 ∈ ℝ
𝑝 = length of projection of 𝑣 onto 𝑢
𝑢
𝑢1
𝑢2
𝑣
𝑣1
𝑣2
𝑢⊤𝑣 = 𝑝 ⋅ 𝑢= u1v1 + u2v2
Slide credit: Andrew Ng
SVM decision boundary
min𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0
Simplication: 𝜃0 = 0, 𝑛 = 2
What’s 𝜃⊤𝑥 𝑖 ?
𝜃⊤𝑥 𝑖 = 𝑝(𝑖) 𝜃 2
1
2σ𝑗=1𝑛 𝜃𝑗
2 =1
2𝜃12 + 𝜃2
2 =1
2𝜃12 + 𝜃2
22=
1
2𝜃 2
𝜃
𝜃1
𝜃2
𝑥(𝑖)
𝑥1𝑖
𝑥2𝑖
𝑝(𝑖)
Slide credit: Andrew Ng
SVM decision boundary
min𝜃
1
2𝜃 2
s. t. 𝑝(𝑖) 𝜃 2 ≥ 1 if y 𝑖 = 1𝑝(𝑖) 𝜃 2 ≤ −1 if y 𝑖 = 0
𝜃
𝑝 1 , 𝑝(2) small → 𝜃 2 large
𝑥 1
𝑥 2
𝑥 1
𝑥 2
Simplication: 𝜃0 = 0, 𝑛 = 2
𝑝 1 , 𝑝(2) large → 𝜃 2 can be small
𝜃
Slide credit: Andrew Ng
Rewrite the formulation
Let 𝜃𝑗 = 𝜃𝑗′/𝛾, with 𝜃′ 2 = 1
min𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1
𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0
min𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2 = min𝜃′,𝛾
1
2σ𝑗=1𝑛 𝜃𝑗
′2
𝛾2= max
𝛾𝛾
𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1 → 𝜃′⊤𝑥 𝑖 ≥ 𝛾 if y 𝑖 = 1
𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0 → 𝜃′⊤𝑥 𝑖 ≤ −𝛾 if y 𝑖 = 0
max𝜃′,𝛾
𝛾
s. t. 𝜃′ 2= 1
𝜃′⊤𝑥 𝑖 ≥ 𝛾 if y 𝑖 = 1
𝜃′⊤𝑥 𝑖 ≤ −𝛾 if y 𝑖 = 0
𝑥2
𝑥1
margin
Data not linearly separable?
min𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2 + 𝐶(#misclassification)
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0
min𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1
𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0 𝑥2
𝑥1
NP-hard
Convex relaxation
min𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2 + 𝐶(#misclassification)
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0
NP-hard
𝑥1
min𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2 + 𝐶 σ𝑖 𝜉(𝑖)
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 − 𝜉(𝑖) if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 + 𝜉 𝑖 if y 𝑖 = 0
𝜉 𝑖 ≥ 0 ∀ 𝑖
𝜉(𝑖): slack variables
Hinge loss
Image credit: https://math.stackexchange.com/questions/782586/how-do-you-minimize-hinge-loss
Hard-margin SVM formulationmin𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0
Soft-margin SVM formulationmin𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2 + 𝐶 σ𝑖 𝜉(𝑖)
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 − 𝜉(𝑖) if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 + 𝜉 𝑖 if y 𝑖 = 0
𝜉 𝑖 ≥ 0 ∀ 𝑖
Support Vector Machine
•Cost function
• Large margin classification
•Kernels
•Using an SVM
Non-linear classification
• How do we separate the two classes using a hyperplane?
ℎ𝜃 𝑥 = ቊ 1 if 𝜃⊤𝑥 ≥ 00 if 𝜃⊤𝑥 < 0
𝑥
Non-linear classification
𝑥
𝑥2
Kernel
•𝐾 ⋅,⋅ a legal definition of inner product:
∃𝜙: 𝑋 → 𝑅𝑁
s.t. 𝐾 𝑥, 𝑧 = 𝜙 𝑥 ⊤𝜙(𝑧)
Why Kernels matter?
• Many algorithms interact with data only via dot-products
• Replace 𝑥⊤𝑧 with 𝐾 𝑥, 𝑧 = 𝜙 𝑥 ⊤𝜙(𝑧)
• Act implicitly as if data was in the higher-dimensional 𝜙-space
Example
𝐾 𝑥, 𝑧 = 𝑥⊤𝑧 2 corresponds to
𝑥1, 𝑥2 → 𝜙 𝑥 = 𝑥12, 𝑥2
2, 2𝑥1𝑥2
𝜙 𝑥 ⊤𝜙 𝑧 = 𝑥12, 𝑥2
2, 2𝑥1𝑥2 𝑧12, 𝑧2
2, 2𝑧1𝑧2⊤
= 𝑥12𝑧1
2 + 𝑥22𝑧2
2 + 2x1x2z1z2 = x1z1 + x2z22
= 𝑥⊤𝑧 2 = 𝐾(𝑥, 𝑧)
Example
𝐾 𝑥, 𝑧 = 𝑥⊤𝑧 2 corresponds to
𝑥1, 𝑥2 → 𝜙 𝑥 = 𝑥12, 𝑥2
2, 2𝑥1𝑥2
Slide credit: Maria-Florina Balcan
Example kernels
• Linear kernel𝐾 𝑥, 𝑧 = 𝑥⊤𝑧
•Gaussian (Radial basis function) kernel
𝐾 𝑥, 𝑧 = exp −1
2𝑥 − 𝑧 ⊤Σ−1(𝑥 − 𝑧)
• Sigmoid kernel𝐾 𝑥, 𝑧 = tanh 𝑎 ⋅ 𝑥⊤𝑧 + 𝑏
Constructing new kernels
• Positive scaling𝐾 𝑥, 𝑧 = 𝑐𝐾1 𝑥, 𝑧 , 𝑐 > 0
• Exponentiation𝐾 𝑥, 𝑧 = exp 𝐾1 𝑥, 𝑧
• Addition𝐾 𝑥, 𝑧 = 𝐾1 𝑥, 𝑧 + 𝐾2 𝑥, 𝑧
• Multiplication with function𝐾 𝑥, 𝑧 = 𝑓 𝑥 𝐾1 𝑥, 𝑧 𝑓(𝑧)
• Multiplication𝐾 𝑥, 𝑧 = 𝐾1 𝑥, 𝑧 𝐾2 𝑥, 𝑧
Non-linear decision boundary
Predict 𝑦 = 1 if
𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥1𝑥2+ 𝜃4𝑥1
2 + 𝜃5𝑥22 +⋯ ≥ 0
𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 +⋯
𝑓1 = 𝑥1, 𝑓2 = 𝑥2, 𝑓3 = 𝑥1𝑥2, ⋯
Is there a different/better choice of the features 𝑓1, 𝑓2, 𝑓3, ⋯?
𝑥1
𝑥2
Slide credit: Andrew Ng
Kernel
Give 𝑥, compute new features depending on proximity to landmarks 𝑙(1), 𝑙(2), 𝑙(3)
𝑓1 = similarity(𝑥, 𝑙(1))𝑓2 = similarity(𝑥, 𝑙(2))𝑓3 = similarity(𝑥, 𝑙(3))
Gaussian kernel
similarity 𝑥, 𝑙 𝑖 = exp(−𝑥 − 𝑙 𝑖
2
2𝜎2)
𝑥1
𝑥2𝑙(2)
𝑙(3)
𝑙(1)
Slide credit: Andrew Ng
Predict 𝑦 = 1 if
𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 ≥ 0
Ex: 𝜃0 = −0.5, 𝜃1 = 1, 𝜃2 = 1, 𝜃3 = 0
𝑥1
𝑥2𝑙(2)
𝑙(3)
𝑙(1)
𝑓1 = similarity(𝑥, 𝑙(1))𝑓2 = similarity(𝑥, 𝑙(2))
𝑓3 = similarity(𝑥, 𝑙(3))
Slide credit: Andrew Ng
Choosing the landmarks
• Given 𝑥
𝑓𝑖 = similarity 𝑥, 𝑙 𝑖 = exp(−𝑥 − 𝑙 𝑖
2
2𝜎2)
Predict 𝑦 = 1 if 𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 ≥ 0
Where to get 𝑙(1), 𝑙(2), 𝑙 3 , ⋯ ?
𝑥1
𝑥2𝑙 1
𝑙 2
𝑙 3
⋮
𝑙 𝑚
Slide credit: Andrew Ng
SVM with kernels
• Given 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚
• Choose 𝑙 1 = 𝑥 1 , 𝑙 2 = 𝑥 2 , 𝑙 3 = 𝑥 3 , ⋯, 𝑙 𝑚 = 𝑥 𝑚
• Given example 𝑥:• 𝑓1 = similarity 𝑥, 𝑙(1)
• 𝑓2 = similarity 𝑥, 𝑙(2)
• ⋯
• For training example 𝑥 𝑖 , 𝑦 𝑖 :
• 𝑥 𝑖 → 𝑓(𝑖)
𝑓 =
𝑓0𝑓1𝑓2⋮𝑓𝑚
Slide credit: Andrew Ng
• Hypothesis: Given 𝑥, compute features 𝑓 ∈ ℝ𝑚+1
• Predict 𝑦 = 1 if 𝜃⊤𝑓 ≥ 0
• Training (original)
min𝜃
𝐶
𝑖=1
𝑚
𝑦(𝑖) cost1 𝜃⊤𝑥 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑥 𝑖 +1
2
𝑗=1
𝑛
𝜃𝑗2
• Training (with kernel)
min𝜃
𝐶
𝑖=1
𝑚
𝑦(𝑖) cost1 𝜃⊤𝑓 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑓 𝑖 +1
2
𝑗=1
𝑚
𝜃𝑗2
SVM with kernels
Support vector machines (Primal/Dual)
•Primal form
• Lagrangian dual form
min𝜃,𝜉(𝑖)
1
2σ𝑗=1𝑛 𝜃𝑗
2 + 𝐶 σ𝑖 𝜉(𝑖)
s. t. 𝜃⊤𝑥 𝑖 ≥ 1 − 𝜉(𝑖) if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 + 𝜉 𝑖 if y 𝑖 = 0
𝜉 𝑖 ≥ 0 ∀ 𝑖
min𝛼
1
2
𝑖
𝑗
𝑦(𝑖)𝑦(𝑗)𝛼(𝑖)𝛼(𝑗)𝑥 𝑖 ⊤𝑥(𝑗) −
𝑖
𝛼(𝑖)
s. t. 0 ≤ 𝛼(𝑖) ≤ 𝐶𝑖
𝑖
𝑦(𝑖)𝛼(𝑖) = 0
SVM (Lagrangian dual)
min𝛼
1
2
𝑖
𝑗
𝑦(𝑖)𝑦(𝑗)𝛼(𝑖)𝛼(𝑗)𝑥 𝑖 ⊤𝑥(𝑗) −
𝑖
𝛼(𝑖)
s. t. 0 ≤ 𝛼(𝑖) ≤ 𝐶𝑖
𝑖
𝑦(𝑖)𝛼(𝑖) = 0
Classifier: 𝜃 = σ𝑖 𝛼(𝑖) 𝑦 𝑖 𝑥(𝑖)
• The points 𝑥(𝑖) for which 𝛼(𝑖) ≠ 0 → Support Vectors
Replace 𝑥 𝑖 ⊤𝑥(𝑗) with 𝐾(𝑥 𝑖 , 𝑥(𝑗))
SVM parameters
• 𝐶 =1
𝜆Large 𝐶: Lower bias, high variance. Small 𝐶: Higher bias, low variance.
• 𝜎2
• Large 𝜎2: features 𝑓𝑖 vary more smoothly.• Higher bias, lower variance
• Small 𝜎2: features 𝑓𝑖 vary less smoothly.• Lower bias, higher variance
Slide credit: Andrew Ng
SVM Demo
• https://cs.stanford.edu/people/karpathy/svmjs/demo/
SVM song
Video source:
• https://www.youtube.com/watch?v=g15bqtyidZs
Support Vector Machine
•Cost function
• Large margin classification
•Kernels
•Using an SVM
Using SVM
• SVM software package (e.g., liblinear, libsvm) to solve for 𝜃
• Need to specify:• Choice of parameter 𝐶.
• Choice of kernel (similarity function):
• Linear kernel: Predict 𝑦 = 1 if 𝜃⊤𝑥 ≥ 0
• Gaussian kernel:
• 𝑓𝑖 = exp(−𝑥−𝑙 𝑖 2
2𝜎2), where 𝑙(𝑖) = 𝑥(𝑖)
• Need to choose 𝜎2. Need proper feature scalingSlide credit: Andrew Ng
Kernel (similarity) functions
• Note: not all similarity functions make valid kernels.
• Many off-the-shelf kernels available:• Polynomial kernel
• String kernel
• Chi-square kernel
• Histogram intersection kernel
Slide credit: Andrew Ng
Multi-class classification
• Use one-vs.-all method. Train 𝐾 SVMs, one to distinguish 𝑦 = 𝑖 from the rest, get 𝜃 1 , 𝜃(2), ⋯ , 𝜃(𝐾)
• Pick class 𝑖 with the largest 𝜃 𝑖 ⊤𝑥
𝑥2
𝑥1
Slide credit: Andrew Ng
Logistic regression vs. SVMs
• 𝑛 = number of features (𝑥 ∈ ℝ𝑛+1), 𝑚 = number of training examples
1. If 𝒏 is large (relative to 𝒎): (𝑛 = 10,000,m = 10 − 1000)→ Use logistic regression or SVM without a kernel (“linear kernel”)
2. If 𝒏 is small, 𝒎 is intermediate: (𝑛 = 1 − 1000,m = 10 − 10,000)→ Use SVM with Gaussian kernel
3. If 𝒏 is small, 𝒎 is large: (𝑛 = 1 − 1000,m = 50,000+)→ Create/add more features, then use logistic regression of linear SVM
Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng
Things to remember
• Cost function
min𝜃
𝐶
𝑖=1
𝑚
𝑦(𝑖) cost1 𝜃⊤𝑓 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑓 𝑖 +1
2
𝑗=1
𝑚
𝜃𝑗2
• Large margin classification
• Kernels
• Using an SVM
𝑥2
𝑥1
margin