Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to...
Transcript of Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to...
![Page 1: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/1.jpg)
Introduction to Support Vector Machines
Andreas Maletti
Technische Universität DresdenFakultät Informatik
June 15, 2006
![Page 2: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/2.jpg)
1 The Problem
2 The Basics
3 The Proposed Solution
![Page 3: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/3.jpg)
Learning by Machines
Learning
Rote Learningmemorization(Hash tables)
Reinforcementfeedback at end(Q [Watkins 89])
Inductiongeneralizing examples(ID3 [Quinlan 79])
Clusteringgrouping data
(CMLIB [Hartigan 75])
Analogyrepresentation similarity
(JUPA [Yvon 94])
Discoveryunsupervised, no goal
Genetic Alg.simulated evolution
(GABIL [DeJong 93])
![Page 4: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/4.jpg)
Supervised Learning
DefinitionSupervised Learning: given nontrivial training data (labels known)predict test data (labels unknown)
Implementations
Clustering Rote Learning
Nearest Neighbor[Cover, Hart 67] Induction Hash tables
Neural Networks[McCulloch, Pitts 43]
Decision Trees[Hunt 66]
SVMs[Vapnik et al 92]
![Page 5: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/5.jpg)
Problem Description—General
ProblemClassify a given input• binary classification: two classes• multi-class classification: several, but finitely many classes• regression: infinitely many classes
Major Applications
• Handwriting recognition• Cheminformatics (Quantitative Structure-Activity Relationship)• Pattern recognition• Spam detection (HP Labs, Palo Alto)
![Page 6: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/6.jpg)
Problem Description—Specific
Electricity Load Prediction Challenge 2001
• Power plant that supports energy demand of a region• Excess production expensive• Load varies substantially• Challenge won by libSVM [Chang, Lin 06]
Problem
• given: load and temperature for 730 days (≈ 70kB data)• predict: load for the next 365 days
![Page 7: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/7.jpg)
Example Data
400
450
500
550
600
650
700
750
800
850
50 100 150 200 250 300 350
Loa
d
Day of Year
Load 1997
12:0024:00
![Page 8: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/8.jpg)
Problem Description—Formal
Definition (cf. [Lin 01])Given a training set S ⊆ Rn × {−1, 1} of correctly classified inputdata vectors ~x ∈ Rn, where:• every input data vector appears at most once in S• there exist input data vectors ~p and ~n such that (~p, 1) ∈ S as
well as (~n,−1) ∈ S (non-trivial)successfully classify unseen input data vectors.
![Page 9: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/9.jpg)
Linear Classification [Vapnik 63]
• Given: A training set S ⊆ Rn × {−1, 1}• Goal: Find a hyperplane that separates Rn into halves that
contain only elements of one class
![Page 10: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/10.jpg)
Representation of Hyperplane
DefinitionHyperplane ~n · (~x − ~x0) = 0
• ~n ∈ Rn weight vector• ~x ∈ Rn input vector• ~x0 ∈ Rn offset
Alternatively: ~w · ~x + b = 0
Decision Function
• training set S = {(~xi , yi ) | 1 ≤ i ≤ k}• separating hyperplane ~w · ~x + b = 0 for S
Decision: ~w ·~xi+b
{> 0 if yi = 1< 0 if yi = −1
⇒ f (~x) = sgn(~w · ~x + b)
![Page 11: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/11.jpg)
Learn Hyperplane
Problem
• Given: training set S• Goal: coefficients ~w and b of a separating hyperplane• Difficulty: several or no candidates for ~w and b
Solution [cf. Vapnik’s statistical learning theory]Select admissible ~w and b with maximal margin (minimal distanceto any input data vector)
ObservationWe can scale ~w and b such that
~w · ~xi + b
{≥ 1 if yi = 1≤ −1 if yi = −1
![Page 12: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/12.jpg)
Maximizing the Margin
• Closest points ~x+ and ~x− (with ~w · ~x± + b = ±1)• Distance between ~w · ~x + b = ±1:
(~w · ~x+ + b)− (~w · ~x− + b)
‖~w‖=
2‖~w‖
=2√
~w · ~w
• max~w ,b2√~w ·~w
≡ min~w ,b~w ·~w2
Basic (Primal) Support Vector Machine Form
target: min~w ,b12(~w · ~w)
subject to: yi (~w · ~xi + b) ≥ 1 (i = 1, . . . , k)
![Page 13: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/13.jpg)
Non-separable Data
ProblemMaybe a linear separating hyperplane does not exist!
SolutionAllow training errors ξi penalized by large penalty parameter C
Standard (Primal) Support Vector Machine Form
target: min~w ,b,~ξ12(~w · ~w) + C
(∑ki=1 ξi
)subject to:
yi (~w · ~xi + b) ≥ 1− ξi
ξi ≥ 0(i = 1, . . . , k)
If ξi > 1, then misclassification of ~xi
![Page 14: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/14.jpg)
Higher Dimensional Feature Spaces
ProblemData not separable because target function is essentially nonlinear!
ApproachPotentially separable in higherdimensional space• Map input vectors nonlinearly into
high dimensional space(feature space)
• Perform separation there
![Page 15: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/15.jpg)
Higher Dimensional Feature Spaces
Literature
• Classic approach [Cover 65]• “Kernel trick” [Boser, Guyon, Vapnik 92]• Extension to soft margin [Cortes, Vapnik 95]
Example (cf. [Lin 01])Mapping φ from R3 into feature space R10
φ(~x) = (1,√
2x1,√
2x2,√
2x3, x21 , x2
2 , x23 ,√
2x1x2,√
2x1x3,√
2x2x3)
![Page 16: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/16.jpg)
Adapted Standard Form
Definition
Standard (Primal) Support Vector Machine Form
target: min~w ,b,~ξ12(~w · ~w) + C
(∑ki=1 ξi
)subject to:
yi (~w · φ(~xi ) + b) ≥ 1− ξi
ξi ≥ 0(i = 1, . . . , k)
~w is a vector in a high dimensional space
![Page 17: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/17.jpg)
How to Solve?
ProblemFind ~w and b from the standard SVM form
SolutionSolve via Lagrangian dual [Bazaraa et al 93]:
max~α≥0,~π≥0(min~w ,b,~ξ
L(~w , b, ~ξ, ~α))
where
L(~w , b, ~ξ, ~α)
=~w · ~w
2+ C
( k∑i=1
ξi)
+k∑
i=1
αi (1− ξi − yi (~w · φ(~xi ) + b))−k∑
i=1
πiξi
![Page 18: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/18.jpg)
Simplifying the Dual [Chen et al 03]
Standard (Dual) Support Vector Machine Form
target: min~α12(~αTQ~α)−
∑ki=1 αi
subject to:~y · ~α = 0
0 ≤ αi ≤ C(i = 1, . . . , k)
where: Qij = yiyj(φ(~xi ) · φ(~xj)
)
SolutionWe obtain ~w as
~w =k∑
i=1
αiyiφ(~xi )
![Page 19: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/19.jpg)
Where is the Benefit?
• ~α ∈ Rk (dimension independent from feature space)• Only inner products in feature space
Kernel Trick
• Inner products efficiently calculated on input vectors viakernel K
K (~xi ,~xj) = φ(~xi ) · φ(~xj)
• Select appropriate feature space
• Avoid nonlinear transformation into feature space
• Benefit from better separation properties in feature space
![Page 20: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/20.jpg)
Kernels
ExampleMapping into feature space φ : R3 → R10
φ(~x) = (1,√
2x1,√
2x2, . . . ,√
2x2x3)
Kernel K (~xi ,~xj) = φ(~xi ) · φ(~xj) = (1 + ~xi · ~xj)2.
Popular Kernels
• Gaussian Radial Basis Function:(feature space is an infinite dimensional Hilbert space)
g(~xi ,~xj) = exp(−γ‖~xi − ~xj‖2)
• Polynomial: g(~xi ,~xj) = (~xi · ~xj + 1)d
![Page 21: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/21.jpg)
The Decision Function
Observation
• No need for ~w because
f (~x) = sgn(~w · φ(~x) + b
)= sgn
( k∑i=1
αiyi(φ(~xi ) · φ(~x)
)+ b
)• Uses only ~xi (support vectors) where αi > 0
Few points determine the separation; borderline points
![Page 22: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/22.jpg)
Support Vectors
![Page 23: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/23.jpg)
Support Vector Machines
Definition
• Given: Kernel K and training set S• Goal: decision function f
target: min~α
(~αTQ~α
2−
k∑i=1
αi
)Qij = yiyjK (~xi ,~xj)
subject to:~y · ~α = 0
0 ≤ αi ≤ C(i = 1, . . . , k)
decide: f (~x) = sgn( k∑
i=1
αiyiK (~xi ,~x) + b)
![Page 24: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/24.jpg)
Quadratic Programming
• Suppose Q (k by k) fully dense matrix
• 70,000 training points 70,000 variables
• 70, 0002 · 4B ≈ 19GB: huge problem
• Traditional methods: Newton, Quasi Newton cannot bedirectly applied
• Current methods:• Decomposition [Osuna et al 97], [Joachims 98], [Platt 98]
• Nearest point of two convex hulls [Keerthi et al 99]
![Page 25: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/25.jpg)
Sample Implementation
www.kernel-machines.org
• Main forum on kernel machines• Lists over 250 active researchers• 43 competing implementations
libSVM [Chang, Lin 06]
• Supports binary and multi-class classification and regression• Beginners Guide for SVM classification• “Out of the box”-system (automatic data scaling, parameter
selection)• Won EUNITE and IJCNN challenge
![Page 26: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/26.jpg)
Application Accuracy
Automatic Training using libSVM
Application Training Data Features Classes AccuracyAstroparticle 3,089 4 2 96.9%Bioinformatics 391 20 3 85.2%Vehicle 1,243 21 2 87.8%
![Page 27: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/27.jpg)
References
Books
• Statistical Learning Theory (Vapnik). Wiley, 1998• Advances in Kernel Methods—Support Vector Learning
(Schölkopf, Burges, Smola). MIT Press, 1999• An Introduction to Support Vector Machines (Cristianini,
Shawe-Taylor). Cambridge Univ., 2000• Support Vector Machines—Theory and Applications (Wang).
Springer, 2005
![Page 28: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/28.jpg)
References
Seminal Papers
• A training algorithm for optimal margin classifiers (Boser,Guyon, Vapnik). COLT’92, ACM Press.
• Support vector networks (Cortes, Vapnik). MachineLearning 20, 1995
• Fast training of support vector machines using sequentialminimal optimization (Platt). In Advances in Kernel Methods,MIT Press, 1999
• Improvements to Platt’s SMO algorithm for SVM classifierdesign (Keerthi, Shevade, Bhattacharyya, Murthy). TechnicalReport, 1999
![Page 29: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/29.jpg)
References
Recent Papers
• A tutorial on ν-Support Vector Machines (Chen, Lin,Schölkopf). 2003
• Support Vector and Kernel Machines (Nello Christianini).ICML, 2001
• libSVM: A library for Support Vector Machines (Chang, Lin).System Documentation, 2006
![Page 30: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/30.jpg)
Sequential Minimal Optimization [Platt 98]
• Commonly used to solve standard SVM form• Decomposition method with smallest working set, |B| = 2
• Subproblem analytically solved; no need for optimizationsoftware
• Contained flaws; modified version [Keerthi et al 99]
• Karush-Kuhn-Tucker (KKT) of the dual (~E = (1, . . . , 1)):
Q~α− ~E + b~y − ~λ + ~µ = 0µi (C − αi ) = 0 ~µ ≥ 0
αiλi = 0 ~λ ≥ 0
![Page 31: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/31.jpg)
Computing b
• KKT yield
(Q~α− ~E + b~y)i
{≥ 0 if αi < C≤ 0 if αi > 0
• Let Fi (~α) =∑k
j=1 αjyjK (~xi ,~xj)− yi and
I0 = {i | 0 < αi < C}I1 = {i | yi = 1, αi = 0} I2 = {i | yi = −1, αi = C}I3 = {i | yi = 1, αi = C} I4 = {i | yi = −1, αi = 0}
• Case analysis on yi yields bounds on b
max{Fi (~α) | i ∈ I0 ∪ I3 ∪ I4} ≤ b ≤ min{Fi (~α) | i ∈ I0 ∪ I1 ∪ I2}
![Page 32: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/32.jpg)
Working Set Selection
Observation (see [Keerthi et al 99])~α not optimal solution iff
max{Fi (~α) | i ∈ I0 ∪ I3 ∪ I4} > min{Fi (~α) | i ∈ I0 ∪ I1 ∪ I2}
ApproachSelect working set B = {i , j} with
i ≡ arg maxm{Fm(~α) | m ∈ I0 ∪ I3 ∪ I4}j ≡ arg minm{Fm(~α) | m ∈ I0 ∪ I1 ∪ I2}
![Page 33: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/33.jpg)
The Subproblem
DefinitionLet B = {i , j} and N = {1, . . . , k} \ B.
• ~αB =
(αiαj
)and ~αN = ~α|N (similar for matrices)
B-Subproblem
target: min~αB
~αTB QBB~αB
2+
(∑b∈B
αbQb,N~αN
)−
(∑b∈B
αb
)subject to:
~y · ~α = 00 ≤ αi , αj ≤ C
![Page 34: Introduction to Support Vector Machinesmaletti/pub/slides/mal06c-talk.pdf · Introduction to Support Vector Machines ... Solution Allow training ... • Improvements to Platt’s](https://reader034.fdocuments.in/reader034/viewer/2022051407/5b14ad417f8b9a347c8e67c4/html5/thumbnails/34.jpg)
Final Solution
• Note that −yiαi = ~yN · ~αN + yjαj
• Substitute αi = −yi (~yN · ~αN + yjαj) into target
• One-variable optimization problem
• Can be solved analytically (cf., e.g., [Lin 01])
• Iterate (yielding new ~α) until
max{Fi (~α) | i ∈ I0 ∪ I3 ∪ I4} ≤ min{Fi (~α) | i ∈ I0 ∪ I1 ∪ I2} − ε