Support Vector Machines

04/19/23 1Support Vector Machines M.W. Mak

Support Vector MachinesSupport Vector Machines

1. Introduction to SVMs2. Linear SVMs3. Non-linear SVMs

References:

1. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning Approach, Prentice Hall, to appear.

2. S.R. Gunn, 1998. Support Vector Machines for Classification and Regression. (http://www.isis.ecs.soton.ac.uk/resources/svminfo/)

3. Bernhard Schölkopf. Statistical learning and kernel methods. MSR-TR 2000-23, Microsoft Research, 2000. (ftp://ftp.research.microsoft.com/pub/tr/tr-2000-23.pdf)

4. For more resources on support vector machines, see http://www.kernel-machines.org/


Introduction SVMs were developed by Vapnik in 1995 and are becoming

popular due to their attractive features and promising performance.

Conventional neural networks are based on empirical risk minimization where network weights are determined by minimizing the mean squares error between the actual outputs and the desired outputs.

SVMs are based on the structural risk minimization principle where parameters are optimized by minimizing classification error.

SVMs have been shown to posses better generalization capability than conventional neural networks.


Introduction (Cont.)

Given N labeled empirical data:

}1,1{),(,),,( 11 Xyy NNxx

where X is the set of input data in and yi are the labels.

1iy

1iy

1x

2x

Domain X

1c2c

(1)

D



We construct a simple classifier by computing the means of the two classes

1:2

21:1

1

1 and

1

ii yii

yii NN

xcxc

where N1 and N2 are the number of data in the class with positive and negative labels, respectively.

We assign a new point x to the class whose mean is closer to it.

To achieve this, we compute 2)( 21 ccc

(2)


Introduction (Cont.) Then, we determine the class of x by checking whether the

vector connecting x and c encloses an angle smaller than /2 with the vector

1x

2xDomain X

1c

2c

c

x

.21 ccw b

y

)()(sgn

)(2)(sgn

)(sgn

21

2121

cxcx

ccccx

wcx

)(2

1 2

1

2

2 cc bwhere


Introduction (Cont.) In the special case where b = 0, we have

1:

1

1:

1

1:21:1

)()(sgn

)(1

)(1

sgn

21

ii

ii

yiiN

yiiN

yii

yii NN

y

xxxx

xxxx

This means that we use ALL data point xi, each being weighted equally by 1/N1 or 1/N2, to define the decision plane.

(3)



1iy

1iy2x

Domain X

1c2c

x

1x

Decision plan

ww



However, we might want to remove the influence of patterns that are far away from the decision boundary, because their influence is usually small.

We may also select only a few important data point (called support vectors) and weight them differently.

Then, we have a support vector machine.



1iy

1iy2x

Domain X x

1x

Decision plane

Support vectors

Margin

We aim to find a decision plane that maximizes the margin.


Linear SVMs

Assume that all training data satisfy the constraints:

1for 1

1for 1

ii

ii

yb

yb

wx

wx

which means

iby ii 01)( wx

Training data points for which the above equality holds lie on hyperplanes parallel to the decision plane.

(4)

(5)


Linear SVMs (Conts.)

2x

1x

Margin: d

Therefore, maximizing the margin is equivalent to minimizing ||w||2.

0 bxw

1 bxw

1 bxw

w

wxx

w

w

xxw

xw

xw

2

2)(

2))((

1)(

1)(

21

21

2

1

d

b

b1: x2: xww


Linear SVMs (Lagrangian) We minimize ||w||2 subject to the constraint that

iby ii 01)( wx

This can be achieved by introducing Lagrange multipliers

and a Lagrangian

)1)((2

1),,(

1

2

N

iiii bybL wxww

The Lagrangian has to be minimized with respect to w and b and maximized with respect to 0i

(6)

(7)

Nii 1}0{


Linear SVMs (Lagrangian) Setting

0),,( and 0),,(

bLbLb

ww

w

We obtain

N

iiii

N

iii yy

11

and 0 xw

Patterns for which are called Support Vectors. These vectors lie on the margin and satisfy

0k

Skby kk 01)( wx

where S contains the indexes to the support vectors.

(8)

Patterns for which are considered to be irrelevant to the classification.

0k


Linear SVMs (Wolfe Dual) Substituting (8) into (7), we obtain the Wolfe dual:

N

iiii

jiji

N

i

N

jji

N

ii

yNi

yyL

1

1 11

0 and ,,,1 ,0 subject to

)(2

1)( :Maximize

xx

The hyper-decision plane is thus

N

iiii bybf

1

)(sgnsgn)( xxxwx

(9)

.ctorsupport vea is and 1 where1 kkk yb xxw


Linear SVMs (Example) Analytical example (3-point problem):

1 ]0.10.0[

1 ]0.00.1[

1 ]0.00.0[

33

22

11

y

y

y

T

T

T

x

x

x

Objective function:

3

1

3

1

3

1

3

1

0 and ,3,,1 ,0 subject to

)(2

1)( :Maximize

iiii

jijii j

jii

i

yi

yyL

xx


Linear SVMs (Example) We introduce another Lagrange multiplier λ to obtain the

Lagrangian

)(2

1

2

1

)(),(

32123

22321

3

1

i

ii yLF

Differentiating F(α, λ) with respect to λ and αi and set the results to zero, we obtain

1,2,2,4 321


Linear SVMs (Example) Substitute the Lagrange multipliers into Eq. 8

11

]22[

2

3

1

xw

xw

T

T

iiii

b

y

-1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

1 2

3

Linear SVM, C=100, #SV=3, acc=100.00%, normW=2.83

x1

x2

T

T

xx

xx

b

][

05.0

0

21

21

x

xw


Linear SVMs (Example) 4-point linear separable problem:

-1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

1

2

3

4

Linear SVM, C=100, #SV=4, accuracy=100.00%

x1

x2

-1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

1

2

3

4

Linear SVM, C=100, #SV=3, accuracy=100.00%

x1

x2

4 SVs 3 SVs


Linear SVMs (Non-linearly separable) Non-linearly separable: patterns that cannot be separated by

a linear decision boundary without incurring classification error.

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Data that causes classification error in linear SVMs


Linear SVMs (Non-linearly separable) We introduce a set of slack variables with

},,,{ 21 N

0i

iby iii 1)( wx

The slack variables allow some data to violate the constraints defined for the linearly separable case (Eq. 6):

iby ii 1)( wx

Therefore, for some we have ,0 where kk

1)( by kk wx

0.8 and 5.0)( e.g. kkk by wx


0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Linear SVM, C=1000.0, #SV=7, acc=95.00%, normW=0.94

x1

x2

Linear SVMs (Non-linearly separable) E.g. because x10 and x19

are inside the margins, i.e. they violate the constraint (Eq. 6).

667.01910

667.01910


Linear SVMs (Non-linearly separable) For non-separable cases:

1)( subject to

2

1 : Minimize

1

2

iii

N

ii

by

C

wx

w

where C is a user-defined penalty parameter to penalize any violation of the margins.

The Lagrangian becomes

N

iiii

N

iiii

N

ii byCbL

111

2)1)((

2

1),,( wxww


Linear SVMs (Non-linearly separable) Wolfe dual optimization:

N

iiii

jiji

N

i

N

jji

N

ii

yNiC

yyL

1

1 11


)(2

1)( :Maximize

xx

The output weight vector and bias term are

N

iiii y

1

xw

.ctorsupport vea is and 1 where1 kkk yb xxw

04/19/23 24Support Vector Machines M.W. Mak0 2 4 6 8 10

0

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920


x1

x2

2. Linear SVMs (Types of SVs) Three types of support vectors

1. On the margin:

0;85.2

0;44.0

1)(

0,0

11

1111

by

C

iT

i

ii

xw

2. Inside the margin:

667.0;10

1)(

20 ;

1010

by

C

iT

i

ii

xw

3. Outside the margin:

667.2;10

1)(

2 ;

2020

by

C

iT

i

ii

xw

01717


0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920


x1

x2

2. Linear SVMs (Types of SVs)

1

1)(

0;0;1

b

by

y

T

iT

i

iii

xw

xw

1

1)(

0;0;1

b

by

y

T

iT

i

iii

xw

xw

33.0bT xw

33.0bT xw

1bT xw

1bT xw

67.1

67.167.21)(

67.2;;1

20

2020

202020

b

by

Cy

T

T

xw

xw


0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920


x1

x2

2. Linear SVMs (Types of SVs)

1

1)(

0;0;1

b

by

y

T

iT

i

iii

xw

xw

1

1)(

0;0;1

b

by

y

T

iT

i

iii

xw

xw

33.0bT xw

33.0bT xw

1bT xw

1bT xw

67.1

67.167.21)(

67.2;;1

20

2020

202020

b

by

Cy

T

T

xw

xw

Swapping Class 1 and Class 2


2. Linear SVMs (Types of SVs) Effect of varying C:

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920


x1

x2

C = 0.1

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920


x1

x2

C = 1002.5i

i 0.4i

i


3. Non-linear SVMs In case the training data X are not linearly separable, we may

use a kernel function to map the data from the input space to a feature space where data become linearly separable.

1iy

1iy2x

Input Space (Domain X)1x

Decision boundary 1iy

1iy

Decision boundary

Feature Space

),( iK xx


3. Non-linear SVMs (Conts.) The decision function becomes

N

iiii bKyf

1

),(sgn)( xxx

(a)


3. Non-linear SVMs (Conts.)



The decision function becomes

N

iiii bKyf

1

),(sgn)( xxx

For RBF kernels

222exp),( iiK xxxx

For polynomial kernels

0 ,1),(2

pKp

ii

xxxx



The decision function becomes

N

iiii bKyf

1

),(sgn)( xxx

The optimization problem becomes:

N

iiii

jiji

N

i

N

jji

N

ii

yNi

KyyW

1

1 11


),(2

1)( :Maximize

xx(9)


3. Non-linear SVMs (Conts.) The effect of varying C on RBF-SVMs:

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

RBF SVM, 2*sigma2=8.0, C=1000.0, #SV=7, acc=100.00%

x1

x2

C = 10 C = 1000

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

RBF SVM, 2*sigma2=8.0, C=10.0, #SV=9, acc=90.00%

x1

x2

09.3i

i 0.0i

i


3. Non-linear SVMs (Conts.) The effect of varying C on Polynomial-SVMs:

C = 10 C = 1000

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Polynomial SVM, degree=2, C=1000.0, #SV=8, acc=90.00%

x1

x2

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

1

2

3 4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

1920

Polynomial SVM, degree=2, C=10.0, #SV=7, acc=90.00%

x1

x2

99.2i

i 97.2i

i

Support Vector Machines

Documents

Transcript of Support Vector Machines