Support Vector Machines
-
Upload
garrison-meadows -
Category
Documents
-
view
29 -
download
0
description
Transcript of Support Vector Machines
04/19/23 1Support Vector Machines M.W. Mak
Support Vector MachinesSupport Vector Machines
1. Introduction to SVMs2. Linear SVMs3. Non-linear SVMs
References:
1. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning Approach, Prentice Hall, to appear.
2. S.R. Gunn, 1998. Support Vector Machines for Classification and Regression. (http://www.isis.ecs.soton.ac.uk/resources/svminfo/)
3. Bernhard Schölkopf. Statistical learning and kernel methods. MSR-TR 2000-23, Microsoft Research, 2000. (ftp://ftp.research.microsoft.com/pub/tr/tr-2000-23.pdf)
4. For more resources on support vector machines, see http://www.kernel-machines.org/
04/19/23 2Support Vector Machines M.W. Mak
Introduction SVMs were developed by Vapnik in 1995 and are becoming
popular due to their attractive features and promising performance.
Conventional neural networks are based on empirical risk minimization where network weights are determined by minimizing the mean squares error between the actual outputs and the desired outputs.
SVMs are based on the structural risk minimization principle where parameters are optimized by minimizing classification error.
SVMs have been shown to posses better generalization capability than conventional neural networks.
04/19/23 3Support Vector Machines M.W. Mak
Introduction (Cont.)
Given N labeled empirical data:
}1,1{),(,),,( 11 Xyy NNxx
where X is the set of input data in and yi are the labels.
1iy
1iy
1x
2x
Domain X
1c2c
(1)
D
04/19/23 4Support Vector Machines M.W. Mak
Introduction (Cont.)
We construct a simple classifier by computing the means of the two classes
1:2
21:1
1
1 and
1
ii yii
yii NN
xcxc
where N1 and N2 are the number of data in the class with positive and negative labels, respectively.
We assign a new point x to the class whose mean is closer to it.
To achieve this, we compute 2)( 21 ccc
(2)
04/19/23 5Support Vector Machines M.W. Mak
Introduction (Cont.) Then, we determine the class of x by checking whether the
vector connecting x and c encloses an angle smaller than /2 with the vector
1x
2xDomain X
1c
2c
c
x
.21 ccw b
y
)()(sgn
)(2)(sgn
)(sgn
21
2121
cxcx
ccccx
wcx
)(2
1 2
1
2
2 cc bwhere
04/19/23 6Support Vector Machines M.W. Mak
Introduction (Cont.) In the special case where b = 0, we have
1:
1
1:
1
1:21:1
)()(sgn
)(1
)(1
sgn
21
ii
ii
yiiN
yiiN
yii
yii NN
y
xxxx
xxxx
This means that we use ALL data point xi, each being weighted equally by 1/N1 or 1/N2, to define the decision plane.
(3)
04/19/23 7Support Vector Machines M.W. Mak
Introduction (Cont.)
1iy
1iy2x
Domain X
1c2c
x
1x
Decision plan
ww
04/19/23 8Support Vector Machines M.W. Mak
Introduction (Cont.)
However, we might want to remove the influence of patterns that are far away from the decision boundary, because their influence is usually small.
We may also select only a few important data point (called support vectors) and weight them differently.
Then, we have a support vector machine.
04/19/23 9Support Vector Machines M.W. Mak
Introduction (Cont.)
1iy
1iy2x
Domain X x
1x
Decision plane
Support vectors
Margin
We aim to find a decision plane that maximizes the margin.
04/19/23 10Support Vector Machines M.W. Mak
Linear SVMs
Assume that all training data satisfy the constraints:
1for 1
1for 1
ii
ii
yb
yb
wx
wx
which means
iby ii 01)( wx
Training data points for which the above equality holds lie on hyperplanes parallel to the decision plane.
(4)
(5)
04/19/23 11Support Vector Machines M.W. Mak
Linear SVMs (Conts.)
2x
1x
Margin: d
Therefore, maximizing the margin is equivalent to minimizing ||w||2.
0 bxw
1 bxw
1 bxw
w
wxx
w
w
xxw
xw
xw
2
2)(
2))((
1)(
1)(
21
21
2
1
d
b
b1: x2: xww
04/19/23 12Support Vector Machines M.W. Mak
Linear SVMs (Lagrangian) We minimize ||w||2 subject to the constraint that
iby ii 01)( wx
This can be achieved by introducing Lagrange multipliers
and a Lagrangian
)1)((2
1),,(
1
2
N
iiii bybL wxww
The Lagrangian has to be minimized with respect to w and b and maximized with respect to 0i
(6)
(7)
Nii 1}0{
04/19/23 13Support Vector Machines M.W. Mak
Linear SVMs (Lagrangian) Setting
0),,( and 0),,(
bLbLb
ww
w
We obtain
N
iiii
N
iii yy
11
and 0 xw
Patterns for which are called Support Vectors. These vectors lie on the margin and satisfy
0k
Skby kk 01)( wx
where S contains the indexes to the support vectors.
(8)
Patterns for which are considered to be irrelevant to the classification.
0k
04/19/23 14Support Vector Machines M.W. Mak
Linear SVMs (Wolfe Dual) Substituting (8) into (7), we obtain the Wolfe dual:
N
iiii
jiji
N
i
N
jji
N
ii
yNi
yyL
1
1 11
0 and ,,,1 ,0 subject to
)(2
1)( :Maximize
xx
The hyper-decision plane is thus
N
iiii bybf
1
)(sgnsgn)( xxxwx
(9)
.ctorsupport vea is and 1 where1 kkk yb xxw
04/19/23 15Support Vector Machines M.W. Mak
Linear SVMs (Example) Analytical example (3-point problem):
1 ]0.10.0[
1 ]0.00.1[
1 ]0.00.0[
33
22
11
y
y
y
T
T
T
x
x
x
Objective function:
3
1
3
1
3
1
3
1
0 and ,3,,1 ,0 subject to
)(2
1)( :Maximize
iiii
jijii j
jii
i
yi
yyL
xx
04/19/23 16Support Vector Machines M.W. Mak
Linear SVMs (Example) We introduce another Lagrange multiplier λ to obtain the
Lagrangian
)(2
1
2
1
)(),(
32123
22321
3
1
i
ii yLF
Differentiating F(α, λ) with respect to λ and αi and set the results to zero, we obtain
1,2,2,4 321
04/19/23 17Support Vector Machines M.W. Mak
Linear SVMs (Example) Substitute the Lagrange multipliers into Eq. 8
11
]22[
2
3
1
xw
xw
T
T
iiii
b
y
-1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
1 2
3
Linear SVM, C=100, #SV=3, acc=100.00%, normW=2.83
x1
x2
T
T
xx
xx
b
][
05.0
0
21
21
x
xw
04/19/23 18Support Vector Machines M.W. Mak
Linear SVMs (Example) 4-point linear separable problem:
-1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
1
2
3
4
Linear SVM, C=100, #SV=4, accuracy=100.00%
x1
x2
-1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
1
2
3
4
Linear SVM, C=100, #SV=3, accuracy=100.00%
x1
x2
4 SVs 3 SVs
04/19/23 19Support Vector Machines M.W. Mak
Linear SVMs (Non-linearly separable) Non-linearly separable: patterns that cannot be separated by
a linear decision boundary without incurring classification error.
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
1920
Data that causes classification error in linear SVMs
04/19/23 20Support Vector Machines M.W. Mak
Linear SVMs (Non-linearly separable) We introduce a set of slack variables with
},,,{ 21 N
0i
iby iii 1)( wx
The slack variables allow some data to violate the constraints defined for the linearly separable case (Eq. 6):
iby ii 1)( wx
Therefore, for some we have ,0 where kk
1)( by kk wx
0.8 and 5.0)( e.g. kkk by wx
04/19/23 21Support Vector Machines M.W. Mak
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
1920
Linear SVM, C=1000.0, #SV=7, acc=95.00%, normW=0.94
x1
x2
Linear SVMs (Non-linearly separable) E.g. because x10 and x19
are inside the margins, i.e. they violate the constraint (Eq. 6).
667.01910
667.01910
04/19/23 22Support Vector Machines M.W. Mak
Linear SVMs (Non-linearly separable) For non-separable cases:
1)( subject to
2
1 : Minimize
1
2
iii
N
ii
by
C
wx
w
where C is a user-defined penalty parameter to penalize any violation of the margins.
The Lagrangian becomes
N
iiii
N
iiii
N
ii byCbL
111
2)1)((
2
1),,( wxww
04/19/23 23Support Vector Machines M.W. Mak
Linear SVMs (Non-linearly separable) Wolfe dual optimization:
N
iiii
jiji
N
i
N
jji
N
ii
yNiC
yyL
1
1 11
0 and ,,,1 ,0 subject to
)(2
1)( :Maximize
xx
The output weight vector and bias term are
N
iiii y
1
xw
.ctorsupport vea is and 1 where1 kkk yb xxw
04/19/23 24Support Vector Machines M.W. Mak0 2 4 6 8 10
0
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
1920
Linear SVM, C=10.0, #SV=7, acc=95.00%, normW=0.94
x1
x2
2. Linear SVMs (Types of SVs) Three types of support vectors
1. On the margin:
0;85.2
0;44.0
1)(
0,0
11
1111
by
C
iT
i
ii
xw
2. Inside the margin:
667.0;10
1)(
20 ;
1010
by
C
iT
i
ii
xw
3. Outside the margin:
667.2;10
1)(
2 ;
2020
by
C
iT
i
ii
xw
01717
04/19/23 25Support Vector Machines M.W. Mak
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
1920
Linear SVM, C=10.0, #SV=7, acc=95.00%, normW=0.94
x1
x2
2. Linear SVMs (Types of SVs)
1
1)(
0;0;1
b
by
y
T
iT
i
iii
xw
xw
1
1)(
0;0;1
b
by
y
T
iT
i
iii
xw
xw
33.0bT xw
33.0bT xw
1bT xw
1bT xw
67.1
67.167.21)(
67.2;;1
20
2020
202020
b
by
Cy
T
T
xw
xw
04/19/23 26Support Vector Machines M.W. Mak
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
1920
Linear SVM, C=10.0, #SV=7, acc=95.00%, normW=0.94
x1
x2
2. Linear SVMs (Types of SVs)
1
1)(
0;0;1
b
by
y
T
iT
i
iii
xw
xw
1
1)(
0;0;1
b
by
y
T
iT
i
iii
xw
xw
33.0bT xw
33.0bT xw
1bT xw
1bT xw
67.1
67.167.21)(
67.2;;1
20
2020
202020
b
by
Cy
T
T
xw
xw
Swapping Class 1 and Class 2
04/19/23 27Support Vector Machines M.W. Mak
2. Linear SVMs (Types of SVs) Effect of varying C:
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
1920
Linear SVM, C=0.1, #SV=10, acc=95.00%, normW=0.57
x1
x2
C = 0.1
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
1920
Linear SVM, C=100.0, #SV=7, acc=95.00%, normW=0.94
x1
x2
C = 1002.5i
i 0.4i
i
04/19/23 28Support Vector Machines M.W. Mak
3. Non-linear SVMs In case the training data X are not linearly separable, we may
use a kernel function to map the data from the input space to a feature space where data become linearly separable.
1iy
1iy2x
Input Space (Domain X)1x
Decision boundary 1iy
1iy
Decision boundary
Feature Space
),( iK xx
04/19/23 29Support Vector Machines M.W. Mak
3. Non-linear SVMs (Conts.) The decision function becomes
N
iiii bKyf
1
),(sgn)( xxx
(a)
04/19/23 31Support Vector Machines M.W. Mak
3. Non-linear SVMs (Conts.)
The decision function becomes
N
iiii bKyf
1
),(sgn)( xxx
For RBF kernels
222exp),( iiK xxxx
For polynomial kernels
0 ,1),(2
pKp
ii
xxxx
04/19/23 32Support Vector Machines M.W. Mak
3. Non-linear SVMs (Conts.)
The decision function becomes
N
iiii bKyf
1
),(sgn)( xxx
The optimization problem becomes:
N
iiii
jiji
N
i
N
jji
N
ii
yNi
KyyW
1
1 11
0 and ,,,1 ,0 subject to
),(2
1)( :Maximize
xx(9)
04/19/23 33Support Vector Machines M.W. Mak
3. Non-linear SVMs (Conts.) The effect of varying C on RBF-SVMs:
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
1920
RBF SVM, 2*sigma2=8.0, C=1000.0, #SV=7, acc=100.00%
x1
x2
C = 10 C = 1000
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
1920
RBF SVM, 2*sigma2=8.0, C=10.0, #SV=9, acc=90.00%
x1
x2
09.3i
i 0.0i
i
04/19/23 34Support Vector Machines M.W. Mak
3. Non-linear SVMs (Conts.) The effect of varying C on Polynomial-SVMs:
C = 10 C = 1000
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
1920
Polynomial SVM, degree=2, C=1000.0, #SV=8, acc=90.00%
x1
x2
0 2 4 6 8 100
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
1920
Polynomial SVM, degree=2, C=10.0, #SV=7, acc=90.00%
x1
x2
99.2i
i 97.2i
i