INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and...
Transcript of INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and...
![Page 1: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/1.jpg)
INTRODUCTION TO Machine Learning
ETHEM ALPAYDIN © The MIT Press, 2004 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml
Lecture Slides for
![Page 2: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/2.jpg)
CHAPTER 10:
Linear Discrimination
![Page 3: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/3.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3
Likelihood- vs. Discriminant-based Classification
n Likelihood-based: Assume a model for p(x|Ci), use Bayes’ rule to calculate P(Ci|x)
Choose Ci if gi(x) = log P(Ci|x) is maximum n Discriminant-based: Assume a model for the
discriminant gi (x|Φi); no density estimation ¨ Estimating the boundaries is enough; no need to accurately
estimate the densities inside the boundaries
![Page 4: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/4.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 4
Linear Discriminant
n Linear discriminant:
n Advantages: ¨ Simple: O(d) space/computation
¨ Knowledge extraction: Weighted sum of attributes; positive/negative weights, magnitudes (credit scoring)
¨ Optimal when p(x|Ci) are Gaussian with shared cov matrix; useful when classes are (almost) linearly separable
( ) 01
00| ij
d
jiji
Tiiii wxwww,g +=+= ∑
=
xwwx
![Page 5: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/5.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5
Generalized Linear Model
n Quadratic discriminant:
n Instead of higher complexity, we can still use a linear classifier if we use higher-order (product) terms.
n Map from x to z using nonlinear basis functions and use a linear discriminant in z-space
n The linear function defined in the z space corresponds to a non-linear function in the x space.
215224
2132211 xxz,xz,xz,xz,xz =====
( ) 00| iTii
Tiiii ww,,g ++= xwxxwx WW
( ) ( )∑=
=k
jjiji wg
1
xx φ
![Page 6: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/6.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 6
Two Classes
( ) ( ) ( )( ) ( )( ) ( )
0
201021
202101
21
w
ww
ww
ggg
T
T
TT
+=
−+−=
+−+=
−=
xw
xww
xwxw
xxx
( )⎩⎨⎧ >
otherwise
0if choose
2
1
C
gC x
choose C1 if g1(x) > g2(x) C2 if g2(x) > g1(x) Define:
![Page 7: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/7.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 7
Learning the Discriminants
As we have seen before, when p (x | Ci ) ~ N ( μi , ∑),
the optimal discriminant is a linear one:
( )
( )iiTiiii
iTiiii
CPw
ww,g
log21
|
10
1
00
+−==
+=
−− µµµ ΣΣw
xwwx
So, estimate µi and Σ from data, and plug into the gi’s to find the linear discriminant functions. Of course any way of learning can be used (e.g. perceptron, gradient descent, logistic regression...).
![Page 8: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/8.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8
n When K > 2 ¨ Combine K two-class problems, each one separating one
class from all other classes
![Page 9: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/9.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 9
Multiple Classes
( ) 00| iTiiii ww,g += xwwx
Why? Any problem? Convex decision regions based on gis (indicated with blue) dist is |gi(x)|/||wi|| Assumes that classes are linearly separable: reject may be used
( ) ( )xx j
K
ji
i
gg
C
1max
if Choose
==
How to train? How to decide on a test?
![Page 10: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/10.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10
Pairwise Separation
( ) 00| ijTijijijij ww,g += xwwx
( )⎪⎩
⎪⎨
⎧
∈≤
∈>
=
otherwisecare tdon'
if 0
if 0
j
i
ij C
C
g x
x
x
( ) 0
if choose
>≠∀ xij
i
g,ij
C
If the classes are not linearly separable:
Uses k(k-1)/2 linear discriminants
![Page 11: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/11.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11
![Page 12: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/12.jpg)
A Bit of Geometry
![Page 13: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/13.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13
n Dot Product and Projection
n <w,p> =wTp = ||w||||p||Cosθ
n proj. of p onto w
= ||p||Cosθ
= wT.p/||w||
p
w
θ
p
![Page 14: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/14.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 14
Geometry The points x on the separating hyperplane have g(x) = wTx+w0=0. Hence for the points on the boundary wTx=-w0. Thus, these points also have the same projection onto the weight vector w, namely wTx/||w|| (by definition of projection and dot product). But this is equal to –w0/||w||. Hence ...
The perpendicular distance of the boundary to the origin is |w0|/||w||. The distance of any point x to the decision boundary is |g(x)|/||w||.
![Page 15: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/15.jpg)
Support Vector Machines
![Page 16: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/16.jpg)
n Vapnik and Chervonenkis – 1963 n Boser, Guyon and Vapnik – 1992 (kernel trick) n Cortes and Vapnik – 1995 (soft margin)
n The SVM is a machine learning algorithm which ¨ solves classification problems ¨ uses a flexible representation of the class boundaries ¨ implements automatic complexity control to reduce overfitting ¨ has a single global minimum which can be found in polynomial
time
n It is popular because • it can be easy to use • it often has good generalization performance • the same algorithm solves a variety of problems with little tuning
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 16
![Page 17: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/17.jpg)
SVM Concepts
n Convex programming and duality n Using maximum margin to control complexity n Representing non-linear boundaries with feature
expansion n The kernel trick for efficient optimization
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17
![Page 18: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/18.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 18
Linear Separators
n Which of the linear separators is optimal?
![Page 19: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/19.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 19
Classification Margin
n Distance from example xi to the separator is n Examples closest to the hyperplane are support vectors. n Margin ρ of the separator is the distance between support
vectors from two classes.
wxw br i
T +=
r
ρ
![Page 20: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/20.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 20
Maximum Margin Classification
n Maximizing the margin is good according to intuition.
n Implies that only support vectors matter; other training examples are ignorable.
![Page 21: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/21.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 21
SVM as 2-class Linear Classifier
{ }
1for -1
1for 1
such that and find if1 if1
where,
0
0
0
2
1
−=≤+
+=+≥+
⎩⎨⎧
∈−
∈+==
ttT
ttT
t
tt
ttt
rwrw
wCC
rr
xwxww
xx
xX
Optimal separating hyperplane: Separating hyperplane maximizing the margin
(Cortes and Vapnik, 1995; Vapnik, 1995)
Note the condition >= 1 (not just 0). We can always do this if the classes are linearly separable by rescaling w and w0, without affecting the separating hyperplane:
0 0 =+ wtTxw
![Page 22: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/22.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 22
Optimal Separating Hyperplane
(Cortes and Vapnik, 1995; Vapnik, 1995)
( ) 1asrewritten becan which
1for 1
1for 1:satisfyMust
0
0
0
+≥+
−=−≤+
+=+≥+
wr
rwrw
tTt
ttT
ttT
xw
xwxw
![Page 23: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/23.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23
Maximizing the Margin
w|)(| xgd =
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
−=
w
w
|1|
1
d
To maximize margin, minimize the Euclidian norm
of the weight vector w
w22 == dρ
Distance from the discriminant to the closest instances on either side is called the margin In general this relationship holds (geometry): So, for the support vectors, we have:
![Page 24: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/24.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 24
Maximizing the Margin-Alternate explanation
n Distance from the discriminant to the closest instances on either side is called the margin
n Distance of x to the hyperplane is
n We require that this distance is at least some value ρ > 0.
n We would like to maximize ρ, but we can do so in infinitely many ways by scaling w.
n For a unique sol’n, we fix ρ||w||=1 and minimize ||w||.
w
xw 0wtT +
( )t,
wr tTt
∀ρ≥+
w
xw 0
![Page 25: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/25.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 25
min 12w 2 subject to rt wTxt +w0( ) ≥ +1,∀t
Lp =12w 2
− α t rt wTxt +w0( )−1$%
&'
t=1
N
∑Unconstrained problem using Lagrange multipliers (+ numbers)
The solution, if it exists, is always at a saddle point of the Lagrangian Lp should be minimized w.r.t w and maximized w.r.t αts
![Page 26: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/26.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 26
![Page 27: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/27.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 27
![Page 28: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/28.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 28
![Page 29: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/29.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 33
![Page 30: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/30.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 34
( )
( )[ ]
( ) ∑∑
∑
==
=
++−=
−+−=
∀+≥+
N
t
tN
t
tTtt
N
t
tTttp
tTt
wr
wrL
twr
110
2
10
2
02
21
121
,1 subject to 21min
αα
α
xww
xww
xww
Convex quadratic optimization problem can be solved using the dual form where we use these local minima constraints and maximize w.r.t αts
00
0
10
1
=⇒=∂
∂
=⇒=∂
∂
∑
∑
=
=
N
t
ttp
N
t
tttp
rwL
rL
α
α xww
![Page 31: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/31.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 35
from: hIp://math.oregonstate.edu/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/lagrang/lagrang.html
![Page 32: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/32.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 36
( )
( )
( )
∑
∑∑∑
∑
∑ ∑∑
∀≥=
+−=
+−=
+−−=
t
0
0 and 0 to subject
21
21
21
t,r
rr
rwrL
ttt
t
tsTtst
t s
st
t
tT
t t
ttt
t
tttTTd
αα
ααα
α
ααα
xx
ww
xwww
00
0
10
1
=⇒=∂
∂
=⇒=∂
∂
∑
∑
=
=
N
t
ttp
N
t
tttp
rwL
rL
α
α xww
( )[ ]
( ) ∑∑
∑
==
=
++−=
−+−=
N
t
tN
t
tTtt
N
t
tTttp
wr
wrL
110
2
10
2
21
121
αα
α
xww
xww
• Maximize Ld with respect to αt only
• Quadratic programming problem • Thanks to the convexity of the problem, optimal value of Lp = Ld
![Page 33: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/33.jpg)
n To every convex program corresponds a dual
n Solving original (primal) is equivalent to solving dual
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 37
![Page 34: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/34.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 38
Support vectors for which
( ) 10 =+wr tTt xw
![Page 35: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/35.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 39
( )
( )
( ) ∑∑∑
∑
∑ ∑∑
+−=
+−=
+−−=
t
tsTtst
t s
st
t
tT
t t
ttt
t
tttTTd
rr
rwrL
ααα
α
ααα
xx
ww
xwww
21
21
21
0
Size of the dual depends on N and not on d
• Maximize Ld with respect to αt only
• Quadratic programming problem • Thanks to the convexity of the problem, optimal value of Lp = Ld
∑ ∀≥=t
,0 and 0 subject to tr ttt αα
![Page 36: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/36.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 40
Calculating the parameters w and w0
Note that: ¨ either the constraint is exactly satisfied (=1)
(and αt can be non-zero) ¨ or the constraint is clearly satisfied (> 1)
(then αt must be zero)
n Once we solve for αt, we see that most of them are 0 and only a small number have αt >0 ¨ the corresponding xts are called the support vectors
( )[ ]∑=
−+−=N
t
tTttp wrL
10
2 121 xww α
![Page 37: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/37.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 42
Calculating the parameters w and w0
Once we have the Lagrange multipliers, we can compute w and w0: where SV is the set of the Support Vectors.
∑∑∈=
==SVt
tttN
t
ttt rr xxw αα1
tT0 xw−= trw
![Page 38: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/38.jpg)
n We make decisions by comparing each query x with only the support vectors
n Choose class C1 if +, C2 if negative
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 43
y = sign(wTx+w0 ) = ( α trtxt )x +w0t∈SV
N
∑
![Page 39: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/39.jpg)
Not-‐‑Linearly Separable Case
![Page 40: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/40.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 46
n The non-separable case cannot find a feasible solution using the previous approach ¨ The objective function (LD) grows arbitrarily large.
n Relax the constraints, but only when necessary ¨ Introduce a further cost for this
![Page 41: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/41.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 48
Soft Margin Hyperplane
n Not linearly separable
( ) ttTt wxr ξ−≥+ 10w
0≥tξ
Three cases (shown in fig): Case 1: Case 2: Case 3:
10 1
0
<≤
≥
=
t
t
t
ξ
ξ
ξ
![Page 42: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/42.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 49
Soft Margin Hyperplane
n Define Soft error
n New primal is
n Parameter C can be viewed as a way to control overfitting: it “trades off” the relative importance of maximizing the margin and fitting the training data.
( )[ ] ∑∑∑ ξµ−ξ+−+α−ξ+=t
tt
t
ttTtt
t
tp wxrCL 1
21
0
2ww
∑t
tξ
Upper bound on the number of training errors
Lagrange multipliers to enforce positivity of ξ
![Page 43: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/43.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 50
Soft Margin Hyperplane
n New dual is the same as the old one
subject to
n As in the separable case, instances that are not support vectors vanish with their αt=0 and the remaining define the boundary.
( ) ∑∑∑ +−=t
tsTtst
t s
std rrL ααα xx
21
∑ ∀≤≤=t
,0 and 0 tCr ttt αα
![Page 44: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/44.jpg)
Kernel Functions in SVM
![Page 45: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/45.jpg)
n We can handle the overfitting problem: even if we have lots of parameters, large margins make simple classifiers
n “All” that is left is efficiency
n Solution: kernel trick
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 52
![Page 46: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/46.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 53
Kernel Functions
n Instead of trying to fit a non-linear model, we can ¨ map the problem to a new space through a non-linear
transformation and ¨ use a linear model in the new space
n Say we have the new space calculated by the basis functions z = φ(x) where zj=φj(x) , j=1,...,k
d-dimensional x space k-dimensional z space
φ(x) = [φ1(x) φ2(x) ... φk(x)]
![Page 47: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/47.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 54
![Page 48: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/48.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 55
Kernel Functions
( ) ( )
( ) ( )
( ) xx
xx
xx
∀=
=
+=
∑
∑
=
=
for 1 assume weif 0
0
1
ϕ
ϕ
ϕ
kkk
kkk
wg
bwg
![Page 49: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/49.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 56
( )
( ) ( ) ( ) ( )
( ) ( )∑
∑
∑∑
α=
α==
α=α=
t
ttt
t
TtttT
t
ttt
t
ttt
,Krg
rg
rr
xxx
xφxφxφwx
xφzw
Kernel Machines
n Preprocess input x by basis functions z = φ(x) g(z)=wTz
g(x)=wT φ(x)
n SVM solution: Find Kernel functions K(x,y) such that the inner product of basis functions are replaced by a Kernel function in the original input space
![Page 50: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/50.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 57
Kernel Functions
n Consider polynomials of degree q:
K x, y( ) = xTy+1( )q
(Cherkassky and Mulier, 1998)
( ) ( )( )
( ) [ ]T
T
xxxxxx
yxyxyyxxyxyxyxyx
K
22
212121
22
22
21
2121212211
22211
2
,,2,2,2,1
22211
1,
=
+++++=
++=
+=
x
yxyx
φ
![Page 51: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/51.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 59
![Page 52: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/52.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 60
Examples of Kernel Functions
n Linear: K(xi,xj)= xiTxj
¨ Mapping Φ: x → φ(x), where φ(x) is x itself
n Polynomial of power p: K(xi,xj)= (1+ xiTxj)p
¨ Mapping Φ: x → φ(x), where φ(x) has dimensions
n Gaussian (radial-basis function): K(xi,xj) = ¨ Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional:
every point is mapped to a function (a Gaussian)
n Higher-dimensional space still has intrinsic dimensionality d, but linear separators in it correspond to non-linear separators in original space.
2
2
2σji
exx −
−
⎟⎟⎠
⎞⎜⎜⎝
⎛ +
ppd
![Page 53: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/53.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 61
n Typically k is much larger than d, and possibly larger than N ¨ Using the dual where the complexity depends on N rather than k
is advantageous
n We use the soft margin hyperplane ¨ If C is too large, too high a penalty for non-separable points (too
many support vectors) ¨ If C is too small, we may have underfitting
n Decide by cross-validation
![Page 54: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/54.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 62
Other Kernel Functions
n Polynomials of degree q:
n Radial-basis functions:
n Sigmoidal functions such as:
( ) ( )qtTt ,K 1+= xxxx
( )⎥⎥
⎦
⎤
⎢⎢
⎣
⎡
σ
−−=
2
2
expxx
xxt
t ,K
( ) ( )12tanh += tTt ,K xxxx
( ) ( )qtTtK xxxx =,
![Page 55: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/55.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 63
What Functions are Kernels? Advanced
n For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome.
n Any function that satisfies some constraints called the Mercer conditions can be a Kernel function - (Cherkassky and Mulier, 1998)
Every semi-positive definite symmetric function is a kernel
n Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
K=
![Page 56: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/56.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 64
n Informally, kernel methods implicitly define the class of possible patterns by introducing a notion of similarity between data ¨ Choice of similarity -> Choice of relevant features
n More formally, kernel methods exploit information about the inner products between data items ¨ Many standard algorithms can be rewritten so that they only
require inner products between data (inputs) ¨ Kernel functions = inner products in some feature space
(potentially very complex) ¨ If kernel given, no need to specify what features of the data are
being used ¨ Kernel functions make it possible to use infinite dimensions
n efficiently in time / space
![Page 57: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/57.jpg)
String kernels
n For example, given two documents, D1 and D2, the number of words appearing in both may form a kernel.
n Define φ(D1) as the M-dimensional binary vector where dimension i is 1 if word wi appears in D1; 0 otherwise.
n Then φ(D1)Tφ(D2) indicates the number of shared words.
n If we define K(D1,D2) as the number of shared words; ¨ no need to preselect the M words ¨ no need to create the bag-of-words model explicitly ¨ M can be as large as we want
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 65
![Page 58: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/58.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 66
Projecting into Higher Dimensions
![Page 59: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/59.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 67
SVM Applications
n Cortes and Vapnik 1995: ¨ Handwritten digit classification
¨ 16x16 bitmaps -> 256 dimensions
¨ Polynomial kernel where q=3 -> feature space with 106 dimensions
¨ No overfitting on a training set of 7300 instances
¨ Average of 148 support vectors over different training sets
Expected test error rate:
ExpN[P(error)] = ExpN[#support vectors] / N (= 0.02 for the above example)
![Page 60: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/60.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 70
SVM history and applications
n SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s.
n SVMs represent a general methodology for many PR problems: classification,regression, feature extraction, clustering, novelty detection, etc.
n SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data.
n SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
n Most popular optimization algorithms for SVMs use decomposition to hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99]
![Page 61: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/61.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 71
Advantages of SVMs
¨ There are no problems with local minima, because the solution is a Qaudratic Programming problem with a global minimum.
¨ The optimal solution can be found in polynomial time
¨ There are few model parameters to select: the penalty term C, the kernel function and parameters (e.g., spread σ in the case of RBF kernels)
¨ The final results are stable and repeatable (e.g., no random initial weights)
¨ The SVM solution is sparse; it only involves the support vectors
¨ SVMs rely on elegant and principled learning methods
¨ SVMs provide a method to control complexity independently of dimensionality
¨ SVMs have been shown (theoretically and empirically) to have excellent generalization capabilities
![Page 62: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/62.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 72
Challenges
n Can the kernel functions be selected in a principled manner?
n SVMs still require selection of a few parameters, typically through cross-validation
n How does one incorporate domain knowledge? ¨ Currently this is performed through the selection of the kernel and the
introduction of “artificial” examples
n How interpretable are the results provided by an SVM?
n What is the optimal data representation for SVM? What is the effect of feature weighting? How does an SVM handle categorical or missing features?
n Do SVMs always perform best? Can they beat a hand-crafted solution for a particular problem?
n Do SVMs eliminate the model selection problem?
![Page 63: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/63.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 73
n More explanations or demonstrations can be found at: ¨ http://www.support-vector-machines.org/index.html ¨ Haykin Chp. 6 pp. 318-339 ¨ Burges tutorial (under/reading/)
n Burges, CJC "A Tutorial on Support Vector Machines for Pattern Recognition" Data Mining and Knowledge Discovery, Vol 2 No 2, 1998.
¨ http://www.dtreg.com/svm.htm
n Software ¨ SVMlight, by Joachims, is one of the most widely used SVM classification and
regression package. Distributed as C++ source and binaries for Linux, Windows, Cygwin, and Solaris. Kernels: polynomial, radial basis function, and neural (tanh).
¨ LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM (Library for Support Vector Machines), is developed by Chang and Lin; also widely used. Developed in C++ and Java, it supports also multi-class classification, weighted SVM for unbalanced data, cross-validation and automatic model selection. It has interfaces for Python, R, Splus, MATLAB, Perl, Ruby, and LabVIEW. Kernels: linear, polynomial, radial basis function, and neural (tanh).
n Applet to play with: ¨ http://lcn.epfl.ch/tutorial/english/svm/html/index.html ¨ http://cs.stanford.edu/people/karpathy/svmjs/demo/
![Page 64: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/64.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 75
![Page 65: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/65.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 76
![Page 66: INTRODUCTIONTO MachineLearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...Vapnik and Chervonenkis – 1963 ! Boser, Guyon and Vapnik – 1992 (kernel trick) ! Cortes and](https://reader036.fdocuments.in/reader036/viewer/2022062609/60fcfb4cbbf63264b9440d49/html5/thumbnails/66.jpg)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 77