Support Vector Machine II - Virginia Techjbhuang/teaching/ECE...Support Vector Machine II Jia-Bin...

Support Vector Machine II

Jia-Bin Huang

Virginia Tech Spring 2019ECE-5424G / CS-5824

Administrative

• HW 1 due tonight

• HW 2 released.

• Online Scalable Learning Adaptive to Unknown Dynamics and Graphs

• Yanning Shen, University of Minnesota, Twin Cities

• Tuesday, February 19 9:00am - 10:15am1110 Knowledgeworks II

• Declarative Learning-based Programming for Learning and Reasoning over Spatial Language

• Parisa Kordjamshidi, Assistant Professor, Tulane University

• Thursday, February 21 9:00am - 10:15am110 McBryde

Logistic regression (logistic loss)

min𝜃

1

𝑚

𝑖=1

𝑚

𝑦(𝑖) −log ℎ𝜃 𝑥(𝑖) + (1 − 𝑦 𝑖 ) −log 1 − ℎ𝜃 𝑥(𝑖) +𝜆

2𝑚

𝑗=1

𝑛

𝜃𝑗2

Support vector machine (hinge loss)

min𝜃

1

𝑚

𝑖=1

𝑚

𝑦(𝑖) cost1 𝜃⊤𝑥 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑥 𝑖 +𝜆

2𝑚

𝑗=1

𝑛

𝜃𝑗2

10

if 𝑦 = 1

𝜃⊤𝑥

if 𝑦 = 0

𝜃⊤𝑥−1 2−2 10−1 2−2

−log1

1 + 𝑒−𝜃⊤𝑥

−log 1 −1

1 + 𝑒−𝜃⊤𝑥

0

Support Vector Machine

•Cost function

• Large margin classification

•Kernels

•Using an SVM

Support vector machine

min𝜃

𝐶

𝑖=1

𝑚

𝑦(𝑖) cost1 𝜃⊤𝑥 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑥 𝑖 +

𝑗=1

𝑛

𝜃𝑗2

If “y = 1”, we want 𝜃⊤𝑥 ≥ 1 (not just ≥ 0)

If “y = 0”, we want 𝜃⊤𝑥 ≤ −1 (not just < 0)

10

if 𝑦 = 1

𝜃⊤𝑥

if 𝑦 = 0

𝜃⊤𝑥−1 2−2 10−1 2−2

cost1 𝜃⊤𝑥 cost0 𝜃⊤𝑥

Slide credit: Andrew Ng

SVM decision boundary

min𝜃

𝐶

𝑖=1

𝑚

𝑦(𝑖) cost1 𝜃⊤𝑥 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑥 𝑖 +1

2

𝑗=1

𝑛

𝜃𝑗2

• Let’s say we have a very large 𝐶 …

• Whenever 𝑦 𝑖 = 1:

𝜃⊤𝑥 𝑖 ≥ 1

• Whenever 𝑦 𝑖 = 0:

𝜃⊤𝑥 𝑖 ≤ −1

min𝜃

1

2σ𝑗=1𝑛 𝜃𝑗

2

s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1

𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0


SVM decision boundary: Linearly separable case

𝑥2

𝑥1Slide credit: Andrew Ng

SVM decision boundary: Linearly separable case

𝑥2

𝑥1

margin


Why large margin classifiers?

min𝜃

1


2

s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0

𝑥2

𝑥1

margin

Vector inner product

𝑢 =𝑢1𝑢2

𝑣 =𝑣1𝑣2

𝑢 = length of vector 𝑢

= 𝑢12 + 𝑢2

2 ∈ ℝ

𝑝 = length of projection of 𝑣 onto 𝑢

𝑢

𝑢1

𝑢2

𝑣

𝑣1

𝑣2

𝑢⊤𝑣 = 𝑝 ⋅ 𝑢= u1v1 + u2v2



min𝜃

1


2


Simplication: 𝜃0 = 0, 𝑛 = 2

What’s 𝜃⊤𝑥 𝑖 ?

𝜃⊤𝑥 𝑖 = 𝑝(𝑖) 𝜃 2

1


2 =1

2𝜃12 + 𝜃2

2 =1

2𝜃12 + 𝜃2

22=

1

2𝜃 2

𝜃

𝜃1

𝜃2

𝑥(𝑖)

𝑥1𝑖

𝑥2𝑖

𝑝(𝑖)



min𝜃

1

2𝜃 2

s. t. 𝑝(𝑖) 𝜃 2 ≥ 1 if y 𝑖 = 1𝑝(𝑖) 𝜃 2 ≤ −1 if y 𝑖 = 0

𝜃

𝑝 1 , 𝑝(2) small → 𝜃 2 large

𝑥 1

𝑥 2

𝑥 1

𝑥 2

Simplication: 𝜃0 = 0, 𝑛 = 2

𝑝 1 , 𝑝(2) large → 𝜃 2 can be small

𝜃


Rewrite the formulation

Let 𝜃𝑗 = 𝜃𝑗′/𝛾, with 𝜃′ 2 = 1

min𝜃

1


2

s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1

𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0

min𝜃

1


2 = min𝜃′,𝛾

1


′2

𝛾2= max

𝛾𝛾

𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1 → 𝜃′⊤𝑥 𝑖 ≥ 𝛾 if y 𝑖 = 1

𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0 → 𝜃′⊤𝑥 𝑖 ≤ −𝛾 if y 𝑖 = 0

max𝜃′,𝛾

𝛾

s. t. 𝜃′ 2= 1

𝜃′⊤𝑥 𝑖 ≥ 𝛾 if y 𝑖 = 1

𝜃′⊤𝑥 𝑖 ≤ −𝛾 if y 𝑖 = 0

𝑥2

𝑥1

margin

Data not linearly separable?

min𝜃

1


2 + 𝐶(#misclassification)


min𝜃

1


2

s. t. 𝜃⊤𝑥 𝑖 ≥ 1 if y 𝑖 = 1

𝜃⊤𝑥 𝑖 ≤ −1 if y 𝑖 = 0 𝑥2

𝑥1

NP-hard

Convex relaxation

min𝜃

1


2 + 𝐶(#misclassification)


NP-hard

𝑥1

min𝜃

1


2 + 𝐶 σ𝑖 𝜉(𝑖)

s. t. 𝜃⊤𝑥 𝑖 ≥ 1 − 𝜉(𝑖) if y 𝑖 = 1𝜃⊤𝑥 𝑖 ≤ −1 + 𝜉 𝑖 if y 𝑖 = 0

𝜉 𝑖 ≥ 0 ∀ 𝑖

𝜉(𝑖): slack variables

Hinge loss

Image credit: https://math.stackexchange.com/questions/782586/how-do-you-minimize-hinge-loss

Hard-margin SVM formulationmin𝜃

1


2


Soft-margin SVM formulationmin𝜃

1


2 + 𝐶 σ𝑖 𝜉(𝑖)


𝜉 𝑖 ≥ 0 ∀ 𝑖


•Cost function


•Kernels

•Using an SVM

Non-linear classification

• How do we separate the two classes using a hyperplane?

ℎ𝜃 𝑥 = ቊ 1 if 𝜃⊤𝑥 ≥ 00 if 𝜃⊤𝑥 < 0

𝑥

Non-linear classification

𝑥

𝑥2

Kernel

•𝐾 ⋅,⋅ a legal definition of inner product:

∃𝜙: 𝑋 → 𝑅𝑁

s.t. 𝐾 𝑥, 𝑧 = 𝜙 𝑥 ⊤𝜙(𝑧)

Why Kernels matter?

• Many algorithms interact with data only via dot-products

• Replace 𝑥⊤𝑧 with 𝐾 𝑥, 𝑧 = 𝜙 𝑥 ⊤𝜙(𝑧)

• Act implicitly as if data was in the higher-dimensional 𝜙-space

Example

𝐾 𝑥, 𝑧 = 𝑥⊤𝑧 2 corresponds to

𝑥1, 𝑥2 → 𝜙 𝑥 = 𝑥12, 𝑥2

2, 2𝑥1𝑥2

𝜙 𝑥 ⊤𝜙 𝑧 = 𝑥12, 𝑥2

2, 2𝑥1𝑥2 𝑧12, 𝑧2

2, 2𝑧1𝑧2⊤

= 𝑥12𝑧1

2 + 𝑥22𝑧2

2 + 2x1x2z1z2 = x1z1 + x2z22

= 𝑥⊤𝑧 2 = 𝐾(𝑥, 𝑧)

Example

𝐾 𝑥, 𝑧 = 𝑥⊤𝑧 2 corresponds to

𝑥1, 𝑥2 → 𝜙 𝑥 = 𝑥12, 𝑥2

2, 2𝑥1𝑥2

Slide credit: Maria-Florina Balcan

Example kernels

• Linear kernel𝐾 𝑥, 𝑧 = 𝑥⊤𝑧

•Gaussian (Radial basis function) kernel

𝐾 𝑥, 𝑧 = exp −1

2𝑥 − 𝑧 ⊤Σ−1(𝑥 − 𝑧)

• Sigmoid kernel𝐾 𝑥, 𝑧 = tanh 𝑎 ⋅ 𝑥⊤𝑧 + 𝑏

Constructing new kernels

• Positive scaling𝐾 𝑥, 𝑧 = 𝑐𝐾1 𝑥, 𝑧 , 𝑐 > 0

• Exponentiation𝐾 𝑥, 𝑧 = exp 𝐾1 𝑥, 𝑧

• Addition𝐾 𝑥, 𝑧 = 𝐾1 𝑥, 𝑧 + 𝐾2 𝑥, 𝑧

• Multiplication with function𝐾 𝑥, 𝑧 = 𝑓 𝑥 𝐾1 𝑥, 𝑧 𝑓(𝑧)

• Multiplication𝐾 𝑥, 𝑧 = 𝐾1 𝑥, 𝑧 𝐾2 𝑥, 𝑧

Non-linear decision boundary

Predict 𝑦 = 1 if

𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥1𝑥2+ 𝜃4𝑥1

2 + 𝜃5𝑥22 +⋯ ≥ 0

𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 +⋯

𝑓1 = 𝑥1, 𝑓2 = 𝑥2, 𝑓3 = 𝑥1𝑥2, ⋯

Is there a different/better choice of the features 𝑓1, 𝑓2, 𝑓3, ⋯?

𝑥1

𝑥2


Kernel

Give 𝑥, compute new features depending on proximity to landmarks 𝑙(1), 𝑙(2), 𝑙(3)

𝑓1 = similarity(𝑥, 𝑙(1))𝑓2 = similarity(𝑥, 𝑙(2))𝑓3 = similarity(𝑥, 𝑙(3))

Gaussian kernel

similarity 𝑥, 𝑙 𝑖 = exp(−𝑥 − 𝑙 𝑖

2

2𝜎2)

𝑥1

𝑥2𝑙(2)

𝑙(3)

𝑙(1)


Predict 𝑦 = 1 if

𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 ≥ 0

Ex: 𝜃0 = −0.5, 𝜃1 = 1, 𝜃2 = 1, 𝜃3 = 0

𝑥1

𝑥2𝑙(2)

𝑙(3)

𝑙(1)

𝑓1 = similarity(𝑥, 𝑙(1))𝑓2 = similarity(𝑥, 𝑙(2))

𝑓3 = similarity(𝑥, 𝑙(3))


Choosing the landmarks

• Given 𝑥

𝑓𝑖 = similarity 𝑥, 𝑙 𝑖 = exp(−𝑥 − 𝑙 𝑖

2

2𝜎2)

Predict 𝑦 = 1 if 𝜃0 + 𝜃1𝑓1 + 𝜃2𝑓2 + 𝜃3𝑓3 ≥ 0

Where to get 𝑙(1), 𝑙(2), 𝑙 3 , ⋯ ?

𝑥1

𝑥2𝑙 1

𝑙 2

𝑙 3

⋮

𝑙 𝑚


SVM with kernels

• Given 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚

• Choose 𝑙 1 = 𝑥 1 , 𝑙 2 = 𝑥 2 , 𝑙 3 = 𝑥 3 , ⋯, 𝑙 𝑚 = 𝑥 𝑚

• Given example 𝑥:• 𝑓1 = similarity 𝑥, 𝑙(1)

• 𝑓2 = similarity 𝑥, 𝑙(2)

• ⋯

• For training example 𝑥 𝑖 , 𝑦 𝑖 :

• 𝑥 𝑖 → 𝑓(𝑖)

𝑓 =

𝑓0𝑓1𝑓2⋮𝑓𝑚


• Hypothesis: Given 𝑥, compute features 𝑓 ∈ ℝ𝑚+1

• Predict 𝑦 = 1 if 𝜃⊤𝑓 ≥ 0

• Training (original)

min𝜃

𝐶

𝑖=1

𝑚

𝑦(𝑖) cost1 𝜃⊤𝑥 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑥 𝑖 +1

2

𝑗=1

𝑛

𝜃𝑗2

• Training (with kernel)

min𝜃

𝐶

𝑖=1

𝑚

𝑦(𝑖) cost1 𝜃⊤𝑓 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑓 𝑖 +1

2

𝑗=1

𝑚

𝜃𝑗2

SVM with kernels

Support vector machines (Primal/Dual)

•Primal form

• Lagrangian dual form

min𝜃,𝜉(𝑖)

1


2 + 𝐶 σ𝑖 𝜉(𝑖)


𝜉 𝑖 ≥ 0 ∀ 𝑖

min𝛼

1

2

𝑖

𝑗

𝑦(𝑖)𝑦(𝑗)𝛼(𝑖)𝛼(𝑗)𝑥 𝑖 ⊤𝑥(𝑗) −

𝑖

𝛼(𝑖)

s. t. 0 ≤ 𝛼(𝑖) ≤ 𝐶𝑖

𝑖

𝑦(𝑖)𝛼(𝑖) = 0

SVM (Lagrangian dual)

min𝛼

1

2

𝑖

𝑗

𝑦(𝑖)𝑦(𝑗)𝛼(𝑖)𝛼(𝑗)𝑥 𝑖 ⊤𝑥(𝑗) −

𝑖

𝛼(𝑖)

s. t. 0 ≤ 𝛼(𝑖) ≤ 𝐶𝑖

𝑖

𝑦(𝑖)𝛼(𝑖) = 0

Classifier: 𝜃 = σ𝑖 𝛼(𝑖) 𝑦 𝑖 𝑥(𝑖)

• The points 𝑥(𝑖) for which 𝛼(𝑖) ≠ 0 → Support Vectors

Replace 𝑥 𝑖 ⊤𝑥(𝑗) with 𝐾(𝑥 𝑖 , 𝑥(𝑗))

SVM parameters

• 𝐶 =1

𝜆Large 𝐶: Lower bias, high variance. Small 𝐶: Higher bias, low variance.

• 𝜎2

• Large 𝜎2: features 𝑓𝑖 vary more smoothly.• Higher bias, lower variance

• Small 𝜎2: features 𝑓𝑖 vary less smoothly.• Lower bias, higher variance


SVM Demo

• https://cs.stanford.edu/people/karpathy/svmjs/demo/

https://cs.stanford.edu/people/karpathy/svmjs/demo/

SVM song

Video source:

• https://www.youtube.com/watch?v=g15bqtyidZs


•Cost function


•Kernels

•Using an SVM

Using SVM

• SVM software package (e.g., liblinear, libsvm) to solve for 𝜃

• Need to specify:• Choice of parameter 𝐶.

• Choice of kernel (similarity function):

• Linear kernel: Predict 𝑦 = 1 if 𝜃⊤𝑥 ≥ 0

• Gaussian kernel:

• 𝑓𝑖 = exp(−𝑥−𝑙 𝑖 2

2𝜎2), where 𝑙(𝑖) = 𝑥(𝑖)

• Need to choose 𝜎2. Need proper feature scalingSlide credit: Andrew Ng

Kernel (similarity) functions

• Note: not all similarity functions make valid kernels.

• Many off-the-shelf kernels available:• Polynomial kernel

• String kernel

• Chi-square kernel

• Histogram intersection kernel


Multi-class classification

• Use one-vs.-all method. Train 𝐾 SVMs, one to distinguish 𝑦 = 𝑖 from the rest, get 𝜃 1 , 𝜃(2), ⋯ , 𝜃(𝐾)

• Pick class 𝑖 with the largest 𝜃 𝑖 ⊤𝑥

𝑥2

𝑥1


Logistic regression vs. SVMs

• 𝑛 = number of features (𝑥 ∈ ℝ𝑛+1), 𝑚 = number of training examples

1. If 𝒏 is large (relative to 𝒎): (𝑛 = 10,000,m = 10 − 1000)→ Use logistic regression or SVM without a kernel (“linear kernel”)

2. If 𝒏 is small, 𝒎 is intermediate: (𝑛 = 1 − 1000,m = 10 − 10,000)→ Use SVM with Gaussian kernel

3. If 𝒏 is small, 𝒎 is large: (𝑛 = 1 − 1000,m = 50,000+)→ Create/add more features, then use logistic regression of linear SVM

Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng

Things to remember

• Cost function

min𝜃

𝐶

𝑖=1

𝑚

𝑦(𝑖) cost1 𝜃⊤𝑓 𝑖 + (1 − 𝑦 𝑖 ) cost0 𝜃⊤𝑓 𝑖 +1

2

𝑗=1

𝑚

𝜃𝑗2


• Kernels

• Using an SVM

𝑥2

𝑥1

margin

Support Vector Machine II - Virginia Techjbhuang/teaching/ECE...Support Vector Machine II Jia-Bin...

Documents

Transcript of Support Vector Machine II - Virginia Techjbhuang/teaching/ECE...Support Vector Machine II Jia-Bin...