A Tutorial on Support Vector Machine - wing.nustanyeefa/miscellaneous/svm-tutorial.… · and bias...
Transcript of A Tutorial on Support Vector Machine - wing.nustanyeefa/miscellaneous/svm-tutorial.… · and bias...
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
A Tutorial on Support Vector Machine
Tan Yee Fan
School of ComputingNational University of Singapore
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Contents
Linear Classifier
Theory on Support Vector Machine
Using Support Vector Machine
Comparison with Other Classifiers
Conclusion
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Contents
Linear ClassifierClassifierLinear ClassifierPropertiesTransforming Non-numeric Attributes
Theory on Support Vector Machine
Using Support Vector Machine
Comparison with Other Classifiers
Conclusion
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Classifier
What is a classifier?
◮ A function that maps instances to classes.
◮ Instance usually expressed as a vector of n attributes.
Example: Fit-And-Trim club
◮ Attributes: Gender, Weight, Height
◮ Class (Member): Yes, No
◮ Is Garvin a member of the Fit-And-Trim club?Person Gender Weight Height Member
Garvin Male 78 179 ?
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Classifier
Training a classifier:
◮ Given: Training data (a set of instances whose classesare known).
◮ Output: A function, selected from a predefined set offunctions.
Example: Fit-And-Trim club
◮ Training instances:Person Gender Weight Height Member
Alex Male 79 189 Yes
Betty Female 76 170 Yes
Charlie Male 77 155 Yes
Daisy Female 72 163 No
Eric Male 73 195 No
Fiona Female 70 182 No
◮ Possible classification rule: Weight more than 74.5?
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
ClassifierTraining a classifier:
◮ Aim to minimize both training error (errors on seeninstances) and generalization error (errors on unseeninstances).
◮ A classifier that has low training error but highgeneralization error is said to overfit the training data.
Example: Fit-And-Trim club◮ Suppose we use person name as an attribute, and the
trained classifier uses the following classification rules:◮ Alex → Yes◮ Betty → Yes◮ Charlie → Yes◮ Daisy → No◮ Eric → No◮ Fiona → No
◮ This classifier severely overfits: it achieves 100%accuracy on the training data, but is unable to classifyany unseen test instance.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Linear Classifier
◮ Attributes are real numbers, so each instancex = (x1, x2, . . . , xn)
T ∈ Rn is a n-dimensional vector.
◮ Classes are +1 (positive class) and −1 (negative class).
◮ Classification rule:
y = sign[f(x)] =
{
+1 if f(x) ≥ 0
−1 if f(x) < 0
where the decision function f(·) is
f(x) = w1x1 + w2x2 + . . . +wnxn + b = wTx+ b
for some weight vector w = (w1, w2, . . . , wn)T ∈ R
n
and bias b ∈ R.
◮ Training a linear classifier means tuning w and b.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Linear Classifier
Example:
◮ A course is graded based on two written tests.
◮ Weightage: Test 1 – 30%, Test 2 – 70%.
◮ Students pass if the total weighted score is at least 50%.
Formulation:
◮ x1 = Test 1 score, x2 = Test 2 score.
◮ To pass, we need:
0.3x1 + 0.7x2 ≥ 50
◮ Decision function of linear classifier:
f(x) = 0.3x1 + 0.7x2 − 50
◮ Positive class = pass, negative class = fail.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Linear Classifier
f(x) = wTx+ b = w1x1 + w2x2 + . . .+ wnxn + b
◮ f(x) = 0 is a hyperplane in the n-dimensional realspace R
n.◮ Training: Find a hyperplane that separates positive and
negative instances.
?
Exercise 1Show that w is orthogonal to the hyperplane.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Properties
Given a linear classifier with:
f(x) = w1x1 + w2x2 + . . .+ wnxn + b
The following are equivalent classifiers:
◮ f(x) = (aw1)x1 + (aw2)x2 + . . . + (awn)xn + (ab) forany constant a > 0.
◮ f(x) = w1x1 + . . .+ w′i(axi) + . . .+ wnxn + b for any
constant a > 0 and where w′i =
wi
a.
◮ f(x) = w1x1 + . . .+ wi(xi + a) + . . . +wnxn + b′ forany constant a and where b′ = b− awi.
In other words, possible to scale whole problem, and scaleand shift individual attributes.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Properties
Using training data, we can normalize each attribute xi:
◮ 0 ≤ xi ≤ 1.
◮ xi has mean 0 and standard deviation 1.
Normalization makes training easier by avoiding numericaldifficulties.For linear classifier with normalized attributes:
f(x) = w1x1 + w2x2 + . . .+ wnxn + b
Larger |wi| or w2
i means xi more important or relevant.
◮ Can be used for attribute ranking or selection.
◮ If xi is missing, can be used to decide whether toacquire it.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Transforming Non-numeric Attributes
If attribute values are ordered, use the ordering.
◮ Consider attribute values {Small,Medium, Large}.
◮ Map Small to 1.
◮ Map Medium to 2.
◮ Map Large to 3.
If attribute values have no ordering information, create onenumeric attribute whose values are {0, 1} for each discreteattribute value.
◮ Consider attribute values {Red,Green,Blue}.
◮ Create three attributes xRed, xGreen, xBlue.
◮ Map Red to xRed = 1, xGreen = 0, xBlue = 0.
◮ Map Green to xRed = 0, xGreen = 1, xBlue = 0.
◮ Map Blue to xRed = 0, xGreen = 0, xBlue = 1.
If attribute values are finite sets, use the above method.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Contents
Linear Classifier
Theory on Support Vector MachineSupport Vector MachineLagrangian TheoryFormulationSoft MarginKernel
Using Support Vector Machine
Comparison with Other Classifiers
Conclusion
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Support Vector Machine
Narrow margin
ρ
Wide margin
ρ
Wide margin gives better generalization performance.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Support Vector Machine
◮ Support vector machine (SVM) is a maximum marginlinear classifier.
◮ Training data: D = {(x1, y1), (x2, y2), . . . , (xN , yN )}.
◮ Assume D is linearly separable, i.e., can separatepositive and negative instances exactly using ahyperplane.
◮ SVM requires:◮ Training instance xi with class yi = +1:
f(xi) = wTxi + b ≥ +1.
◮ Training instance xi with class yi = −1:f(xi) = w
Txi + b ≤ −1.
◮ Combined, we have yif(xi) = yi(wTxi + b) ≥ 1.
◮ Support vector: A training instance xi withf(xi) = w
Txi + b = ±1.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Support Vector Machine
w
ρ = 2/||w||
f(x) = 0f(x) = +1
f(x) = −1
support vectors
Test instance x: Greater |f(x)| ⇒ Greater classificationconfidence.
Maximize ρ ⇒ Minimize ‖w‖2 ⇒ Minimize wTw.
(‖w‖2 =√
w2
1+ w2
2+ . . . + w2
n is the length or 2-norm ofthe vector w)
Exercise 2Show that ρ = 2
‖w‖2.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Support Vector Machine
Primal problem:
Minimize1
2w
Tw
Subject to yi(wTxi + b) ≥ 1
Primal problem is convex optimization problem, i.e., anylocal minimum is also global minimum.
But more convenient to solve the dual problem instead.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Lagrangian Theory
Let f(u1, u2, . . . , uk) be function of variables u1, u2, . . . , uk.
Partial derivatives:
◮ Partial derivative ∂f∂ui
means differentiate w.r.t. ui whileholding all other uj ’s as constants.
◮ Example: f(x, y) = sinx+ xy2, ∂f∂x
= cos x+ y2,∂f∂y
= 2xy.
◮ Stationary point when ∂f∂ui
= 0 for all i.
Partial derivatives w.r.t. vectors:
◮ Let u = [u1, u2, . . . , un]T .
◮ ∇f = ∂f∂u
=[
∂f∂u1
, ∂f∂u2
, . . . , ∂f∂un
]T
.
◮ Stationary point when ∇f = ∂f∂u
= 0.
Exercise 3Verify that ∂
∂uaTu = ∂
∂uuTa = a and ∂
∂uuTu = 2u.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Lagrangian TheoryPrimal problem:
Minimize f(u)
Subject to g1(u) ≤ 0, g2(u) ≤ 0, . . . , gm(u) ≤ 0
h1(u) = 0, h2(u) = 0, . . . , hn(u) = 0
Lagrangian function:
L(u,α,β) = f(u) +
m∑
i=1
αigi(u) +
n∑
i=1
βihi(u)
The αi’s and βi’s are Lagrange multipliers.
Optimal solution must satisfy Karush-Kuhn-Tucker (KKT)conditions:
∂L
∂u= 0 gi(u) ≤ 0 hi(u) = 0
αigi(u) = 0 αi ≥ 0
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Lagrangian TheoryKKT conditions:
∂L
∂u= 0 gi(u) ≤ 0 hi(u) = 0
αigi(u) = 0 αi ≥ 0
Dual problem:
Maximize f(α,β) = infu
L(u,α,β)
Subject to KKT conditions
(inf means infimum or greatest lower bound)
Relation between primal and dual objective functions:
f(u) ≥ f(α,β)
Under certain conditions (satisfied by SVM), equality occursat optimal solution.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
FormulationPrimal problem:
Minimize1
2w
Tw
Subject to 1− yi(wTxi + b) ≤ 0
Lagrangian:
L(w, b,α) =1
2w
Tw +
N∑
i=1
αi(1− yi(wTxi + b))
=1
2w
Tw +
N∑
i=1
αi −
N∑
i=1
αiyiwTxi − b
N∑
i=1
αiyi
KKT conditions:
∂L
∂w= w −
N∑
i=1
αiyixi = 0∂L
∂b=
N∑
i=1
αiyi = 0
1− yi(wTxi + b) ≤ 0 αi(1− yi(w
Txi + b)) = 0 αi ≥ 0
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Formulation
Since w =∑N
i=1αiyixi, we have:
wTw =
N∑
i=1
αiyiwTxi =
N∑
i=1
N∑
j=1
αiαjyiyjxTi xj
Note also∑N
i=1αiyi = 0.
L(w, b,α) =1
2w
Tw +
N∑
i=1
αi −
N∑
i=1
αiyiwTxi − b
N∑
i=1
αiyi
=
N∑
i=1
αi −1
2
N∑
i=1
N∑
j=1
αiαjyiyjxTi xj
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Formulation
Dual problem:
Maximize
N∑
i=1
αi −1
2
N∑
i=1
N∑
j=1
αiαjyiyjxTi xj
Subject to αi ≥ 0
N∑
i=1
αiyi = 0
Dual problem is also convex optimization problem.
Many software packages available and tailored to solve thisform of optimization problem efficiently.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Formulation
Solving w and b:
◮ w =∑N
i=1αiyixi.
◮ Choose support vector xi, then b = yi −wTxi.
Decision function:
◮ f(x) =∑N
i=1αiyix
Ti x+ b.
Support vectors:
◮ KKT condition: αi(1− yi(wTxi + b)) = 0.
◮ αi > 0 ⇔ yi(wTxi + b) = 1 ⇔ xi is support vector.
Usually relatively few support vectors, many αi’s will vanish.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Soft Margin
Narrow margin Not linearly separable
Solution: Allow training instances inside the margin or onthe other side of the separating hyperplane (misclassified).
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Soft Margin
Soft margin SVM
εi
εi εi
εi
Allow training instance (xi, yi) to have yi(wTxi + b) < 1, or
yi(wTxi + b) = 1− εi for εi > 0, but penalize it with a
constant factor C > 0.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Soft MarginPrimal problem:
Minimize1
2w
Tw + C
N∑
i=1
εi
Subject to yi(wTxi + b) ≥ 1− εi
εi ≥ 0
Dual problem:
Maximize
N∑
i=1
αi −1
2
N∑
i=1
N∑
j=1
αiαjyiyjxTi xj
Subject to 0 ≤ αi ≤ C
N∑
i=1
αiyi = 0
Exercise 4Derive the dual problem for the soft margin SVM.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Soft Margin
Recall:
◮ 0 ≤ αi ≤ C.
◮ xi is support vector if αi > 0.
Two types of support vectors:
◮ xi is free support vector if αi < C.
◮ xi is bounded support vector if αi = C.
Exercise 5What are the characteristics of free support vectors andbounded support vectors? (Hint: Consider the KKTconditions.)
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Soft MarginPossible to weigh each training instance xi with λi > 0 toindicate its importance.
Primal problem:
Minimize1
2w
Tw + C
N∑
i=1
λiεi
Subject to yi(wTxi + b) ≥ 1− εi
εi ≥ 0
Cost-sensitive SVM:
◮ Different costs for misclassifying positive and negativeinstances.
◮ Set λi proportional to misclassification cost of xi.
Exercise 6Derive the dual problem for this SVM.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Kernel
◮ Nonlinear feature map Φ : Rn → Rm from attribute
space to feature space.
◮ Training instances linearly separable in feature space.
→
Φ(x) = (x, x2)T
Cover’s theorem: Instances more likely to be linearlyseparable in high dimension space.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
KernelDual problem:
MaximizeN∑
i=1
αi −1
2
N∑
i=1
N∑
j=1
αiαjyiyjΦ(xi)TΦ(xj)
Subject to 0 ≤ αi ≤ C
N∑
i=1
αiyi = 0
Decision function:
f(x) =N∑
i=1
αiyiΦ(xi)TΦ(x) + b
Observations:
◮ Computationally expensive or difficult to compute Φ(x).
◮ But often Φ(x)TΦ(x′) has simple expression.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
KernelDual problem:
Maximize
N∑
i=1
αi −1
2
N∑
i=1
N∑
j=1
αiαjyiyjK(xi,xj)
Subject to 0 ≤ αi ≤ C
N∑
i=1
αiyi = 0
Decision function:
f(x) =N∑
i=1
αiyiK(xi,x) + b
The kernel is the function K(x,x′) = Φ(x)TΦ(x′).
Exercise 7Express wT
w and b in terms of the kernel.What is the margin of the kernelized SVM?
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Kernel
◮ Kernel K(·, ·) can take any expression that satisfiesMercer’s condition to ensure K(x,x′) = Φ(x)TΦ(x′)for some feature map Φ(·):
◮ Mercer’s condition: Kernel matrix K formed byKij = K(xi,xj) is positive semidefinite, i.e.,eigenvalues of K are all nonnegative, or vT
Kv ≥ 0 forall vectors v.
◮ No need to compute or even know explicit form of Φ(·).
◮ Examples:◮ Linear kernel: K(x,x′) = x
Tx′.
◮ Polynomial kernel: K(x,x′) = (xTx′ + k)d.
◮ Radial basis function (RBF) kernel:
K(x,x′) = exp(
‖x−x′‖2
2σ2
)
.
◮ Suggestion: Try linear kernel first, then RBF kernel.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Contents
Linear Classifier
Theory on Support Vector Machine
Using Support Vector MachineParameter TuningEstimating Posterior ProbabilityHandling Multiple ClassesApplications
Comparison with Other Classifiers
Conclusion
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Parameter TuningDual problem:
Maximize
N∑
i=1
αi −1
2
N∑
i=1
N∑
j=1
αiαjyiyjK(xi,xj)
Subject to 0 ≤ αi ≤ C
N∑
i=1
αiyi = 0
RBF kernel:
K(x,x′) = exp
(
‖x− x′‖2
2σ2
)
◮ Hyperparameters include C of SVM formulation andkernel parameters, e.g., σ2 of RBF kernel.
◮ Important to select good hyperparameters, has
large impact on SVM performance.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Parameter Tuning
Traditional way of parameter tuning:
◮ Use a training set and a validation set.
◮ Do a grid search on the parameters, e.g., when usingRBF kernel – for eachC ∈ {2−10, 2−8, . . . , 20, . . . , 28, 210} andσ2 ∈ {2−10, 2−8, . . . , 20, . . . , 28, 210}:
◮ Train on training set.◮ Test on validation set.
Choose C and σ2 that gives best performance onvalidation set.
◮ Extensions of this idea: k-fold cross-validation andleave-one-out cross-validation.
Some SVM implementations have integrated (and moreoptimized) parameter tuning, others require user to performown tuning before training.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Estimating Posterior Probability
Probability of test instance x belonging to a class can beestimated by:
P (y = 1|x) =1
1 + exp(Af(x) +B)
A and B are parameters to be determined from trainingdata.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Handling Multiple Classes
◮ For k > 2 classes, decompose into multiple two-classSVMs.
◮ Pairwise SVMs:◮ Train
(
k
2
)
SVMs, one for each pair of classes.◮ For a test instance, each SVM casts a vote on a class.◮ Classify test instance as the class that gets the most
votes.
◮ One-against-all SVMs:◮ Train k SVMs, the ith SVM considers class i as positive
and all other classes as negative.◮ Classify test instance to class i if the ith SVM gives the
greatest f(·) value.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Applications
Applications:
◮ Text processing:◮ Each text document is an instance.◮ Each word is an attribute.◮ Attribute values are word counts.
◮ Image processing:◮ Each image is an instance.◮ Each pixel is an attribute.◮ Attribute values are pixel values.
Cover’s theorem: Instances more likely to be linearlyseparable in high dimension space.
These applications already have many attributes, and eachinstance is a sparse vector, so linear SVM often performswell.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Applications
SVM decision function:
f(x) = w1x1 + w2x2 + w3x3 + w4x4 + b
2-norm of weight vector is ‖w‖2 =√
w21+ w2
2+ w2
3+ w2
4.
Test instance with missing attribute values: x = (x1, ?, x3, ?)
Substitute with constants attributewise: x = (x1, x′2, x3, x
′4)
Decision function becomes:
f(x) = w1x1 +w2x′2 +w3x3 +w4x
′4 + b = w1x1 +w3x3 + b′
New weight vector is w′(x) = (w1, w3)T and its 2-norm
decreased to ‖w′(x)‖2 =√
w21+ w2
3.
Can use f(x), ‖w‖2 and ‖w′(x)‖2 to estimate probability ofmisclassification, and hence whether to acquire missingattribute values.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Contents
Linear Classifier
Theory on Support Vector Machine
Using Support Vector Machine
Comparison with Other ClassifiersDecision TreeK-nearest NeighboursNaıve BayesNeural NetworkCombination of Classifiers
Conclusion
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Decision Tree
Advantages of SVM:
◮ SVM can handle linear relationships between attributes,but typical decision tree split only on one attribute at atime.
◮ Typically better on datasets with many attributes.
◮ Handles continuous values well.
Advantages of decision tree:
◮ Decision tree is better at handling nested if-then-elsetype of rules, which SVM is not good at.
◮ Typically better on datasets with fewer attributes.
◮ Handles discrete values well.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
K-nearest Neighbours
Advantages of SVM:
◮ Faster classification compared to K-nearest neighbours.
◮ Smooth separating hyperplane.
◮ Less suspectible to noise.
Advantages of K-nearest neighbours:
◮ Training is instantaneous – nothing to do.
◮ Local variations are considered.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Naıve Bayes
◮ Naıve Bayes often has the dubious honour of beingplaced among the last in evaluations.
◮ Naıve Bayes is fast and easy to implement, hence oftentreated as a baseline.
◮ One key handicap of naıve Bayes is its independenceassumption.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Neural Network
◮ A linear SVM can be seen as a neural network with nohidden layers.
◮ But training algorithm is different.◮ SVM design is driven by sound theory, neural network
design is driven by applications.◮ Backpropagation algorithm for training multi-layer
neural network does not find the maximal marginseparating hyperplane.
◮ Neural network tend to overfit more than SVM.◮ Neural network have many local minima, SVM has only
one global minima.
◮ Multi-layer neural networks require specifying number ofhidden layers and number of nodes at each layer, butSVM does not need them.
◮ What do the learned weights in a trained multi-layerneural network mean?
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Combination of Classifiers
Possible to combine classifiers:
◮ SVM with decision tree:◮ Instead of splitting based only on one attribute, split
using a linear classifier trained by SVM.
◮ SVM with K-nearest neighbours:◮ Use K-nearest neighbours to classify test instances
inside the margin.
◮ SVM with naıve Bayes:◮ Train Naıve Bayes classifier, produces conditional
probabilities P (xi|c).◮ Modify every (training and testing) instance x by
multiplying each attribute xi by P (xi|c).◮ Train and test on the modified instances.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Contents
Linear Classifier
Theory on Support Vector Machine
Using Support Vector Machine
Comparison with Other Classifiers
ConclusionSummary
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Summary
◮ SVM is a maximum margin linear classifier.
◮ Often good for classifying test instances in highdimensional space.
◮ Lagrangian theory – primal and dual problems.
◮ When training instances are not linearly separable:◮ Use soft margin.◮ Use kernel.
◮ Usage tips:◮ Transforming non-numeric attributes.◮ Normalize attributes.◮ Tune parameters when training.
◮ Comparison with other classifiers.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
Bibliography
◮ Introduction to SVM, with applications: [Burges, 1998],[Osuna et al., 1997].
◮ Practical guide on using SVM effectively:[Hsu et al., 2003].
◮ LIBSVM manual, including weighted instances:[Chang and Lin, 2001].
◮ Tutorial on Lagrangian theory: [Burges, 2003],[Klien, 2004].
◮ SVM with posterior probabilities: [Platt, 2000].
◮ Attribute selection using SVM: [Guyon et al., 2002].
◮ Handling multiple classes: [Hsu and Lin, 2002],[Tibshirani and Hastie, 2007].
◮ SVM combined with. . .◮ Decision tree: [Bennett and Blue, 1998],
[Tibshirani and Hastie, 2007].◮ K-nearest neighbours: [Chiu and Huang, 2007].◮ Naıve Bayes: [Li et al., 2007].
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
References I
Bennett, K. P. and Blue, J. A. (1998).
A support vector machine approach to decision trees.In IEEE World Congress on Computational Intelligence, pages 2396–2401.
Burges, C. J. C. (1998).
A tutorial on support vector machines for pattern recognition.Data Mining and Knowledge Discovery, 2(2):121–167.
Burges, C. J. C. (2003).
Some notes on applied mathematics for machine learning.In Advanced Lectures on Machine Learning, pages 21–40.
Chang, C.-C. and Lin, C.-J. (2001).
LIBSVM: a library for support vector machines.Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/ .
Chiu, C.-Y. and Huang, Y.-T. (2007).
Integration of support vector machine with naıve bayesian classifier for spam classification.In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 618–622.
Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002).
Gene selection for cancer classification using support vector machines.Machine Learning, 46(1-3):389–422.
Hsu, C.-W., Chang, C.-C., and Lin, C.-J. (2003).
A practical guide to support vector classification.Technical report, Department of Computer Science, National Taiwan University.
Hsu, C.-W. and Lin, C.-J. (2002).
A comparison of methods for multiclass support vector machines.IEEE Transactions on Neural Networks, 13(2):415–425.
Support VectorMachine Tutorial
Tan Yee Fan
Linear Classifier
Classifier
Linear Classifier
Properties
Non-numericAttributes
Theory on SupportVector Machine
Support VectorMachine
Lagrangian Theory
Formulation
Soft Margin
Kernel
Using SupportVector Machine
Parameter Tuning
Posterior Probability
Multiple Classes
Applications
Comparison
Decision Tree
K-nearestNeighbours
Naıve Bayes
Neural Network
Combination
Conclusion
Summary
Bibliography
References II
Klien, D. (2004).
Lagrange multipliers without permanent scarring.Available at http://www.cs.berkeley.edu/~klein/papers/lagrange-multipliers.pdf .
Li, R., Wang, H.-N., He, H., Cui, Y.-M., and Du, Z.-L. (2007).
Support vector machine combined with k-nearest neighbors for solar flare forecasting.Chinese Journal of Astronomy and Astrophysics, 7(3):441–447.
Osuna, E. E., Freund, R., and Girosi, F. (1997).
Support vector machines: Training and applications.Technical Report AIM-1602, Artificial Intelligence Laboratory, Massachusetts Institute ofTechnology.
Platt, J. (2000).
Probabilistic outputs for support vector machines and comparison to regularized likelihoodmethods.In Advances in Large Margin Classifiers, pages 61–74.
Tibshirani, R. and Hastie, T. (2007).
Margin trees for high-dimensional classification.Journal of Machine Learning Research (JMLR), 8:637–652.