The support vector machine - SVCL · Support vector machine w y(wTx i b) i i wb min 2 subject to +...

Post on 14-Jun-2020

0 views 0 download

Transcript of The support vector machine - SVCL · Support vector machine w y(wTx i b) i i wb min 2 subject to +...

The support vector himachine

Nuno Vasconcelos ECE Department, UCSDp ,

Outlinewe have talked about classification and linear discriminantsthen we did a detour to talk about kernels• how do we implement a non-linear boundary on the linear

discriminant framework

this led us to RKHS th d i th l i ti• RKHS, the reproducing theorem, regularization

• the idea that all these learning problems boil down to the solution of an optimization problem

we have seen how to solve optimization problems• with and without constraints, equalities, inequalities

2

today we will put everything together by studying the SVM

Classificationa classification problem has two types of variables

• X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world

e.g. g• x ∈ X ⊂ R2 = (fever, blood pressure)

• y ∈ Y = {disease, no disease}

X, Y related by a (unknown) function

)(xfyx

l d i l ifi h Y h th t h( ) f( ) ∀

)(xfy =x(.)f

3

goal: design a classifier h: X → Y such that h(x) = f(x) ∀x

Linear classifier implements the decision rule

⎧ >if 0)(1 xgwith

has the properties

bxwxg T +=)(⎩⎨⎧

=<−>

= sgn[g(x)]if if

0)(10)(1

)(*xgxg

xh

has the properties• it divides X into two “half-spaces”

• boundary is the plane with:

w

y p• normal w• distance to the origin b/||w||

( )/|| || i th di t f i twb

x

wxg )(

• g(x)/||w|| is the distance from point xto the boundary

• g(x) = 0 for points on the plane

4

• g(x) > 0 on the side w points to (“positive side”)• g(x) < 0 on the “negative side”

Linear classifierwe have a classification error if• y = 1 and g(x) < 0 or y = -1 and g(x) > 0• i.e y.g(x) < 0

and a correct classification if• y = 1 and g(x) > 0 or y = -1 and g(x) < 0• i.e y.g(x) > 0

note that for a linearly separable training setnote that, for a linearly separable training set D = {(x1,y1), ..., (xn,yn)}

we can have zero empirical riskthe necessary and sufficient condition is that

( ) ibxwy iT

i ∀>+ ,0

5

( )ii

The marginis the distance from the boundaryto the closest point w

y=1

wbxw i

T

i

+= minγ

there will be no error if it is strictlygreater than zero y=-1

w

greater than zero

note that this is ill-defined in the sense that( ) 0,0 >⇔∀>+ γ ibxwy i

Ti

y 1

w

note that this is ill-defined in the sense that γ does not change if both w and b are scaled by λ w

b

x

wxg )(

6

we need a normalization

Maximizing the marginthis is similar to what we have seenfor Fisher discriminantslets assume we have selected somenormalization, e.g. ||w||=1

th t ti i h t i ththe next question is: what is thecost that we are going to optimize?there are several planes that separatethere are several planes that separatethe classes, which one is best?recall that in the case of the Perceptron, we have seen that the margin determines the complexity of the learning problem• the Perceptron converges in less than (k/γ)2 iterations

7

the Perceptron converges in less than (k/γ) iterations

it sounds like maximizing the margin is a good idea.

Maximizing the marginintuition 1:• think of each point in the training

γ

set as a sample from a probability density centered on it

• if we draw another sample, we will not get the same points

• each point is really a pdf with a certain variance

• this is a kernel density estimate• if we leave a margin of γ on the

training set we are safe againsttraining set, we are safe against this uncertainty

• (as long as the radius of support of the pdf is smaller than γ)

8

of the pdf is smaller than γ)• the larger γ, the more robust the classifier!

Maximizing the marginintuition 2:• think of the plane as an uncertain

estimate because it is learnedfrom a sample drawn at random

• since the sample changes fromdraw to draw, the plane parametersare random variables of non-zerovariance

• instead of a single plane we havea probability distribution overplanes

• the larger the margin, the largerthe number of planes that willnot originate errors

9

• the larger γ, the larger the variance allowed on the plane parameter estimates!

Normalizationwe will go over a formal proof in a few classesfor now let’s look at the normalization y=1natural choice is ||w|| = 1the margin is maximized by solving

w

y=1

g y g

min i

max,

+ bxw iT

bw

y=-1 tosubject 12 =w

this is somewhat complex• need to find the points that achieve the min without knowing what

w and b are

10

• optimization problem with quadratic constraints

Normalizationa more convenient normalization is tomake |g(x)| = 1 for the closest point, i.e. w

y=1

1min ≡+ bxw iT

i

under which

y=-1w1

the SVM is the classifier that maximizes the margin under these constraints

y 1

w

w

the margin under these constraints

wb

x

wxg )(( ) ibxwyw i

Tibw

∀≥+ tosubject 1min 2

11

bw ,

Support vector machine( ) ibxwyw i

Tibw

∀≥+ tosubject 1min 2

,

last time we proved the following theoremTheorem: (strong duality) Consider the problem( g y) p

0)(minarg* ≤−=∈

jTj

Xxdxexfx to subject

where X, and f are convex, and the optimal value f* finite. Then there is at least one Lagrange multiplier vector and there is no duality gapthere is no duality gapthis means the SVM problem can be solved by solving its dual

12

The dual problemfor the primal ( ) ibxwyw i

Tibw

∀≥+ tosubject 21 1min 2

,

the dual problem is{ }),,(maxmax

00αα

ααbwLq

wmin )(

≥≥=

where the Lagrangian is

( )[ ]∑ 11)( 2 bbL T

00 αα w≥≥

setting derivatives to zero

( )[ ]-i

∑ −+= 12

),,( bxwywbwL iT

iiαα

g

i=⇔=−=⇔=∇

∑∑i

iiiiiiw xywxywbwLL ααα *0),,(0

13

0=⇔=∇ ∑i

iib yL α0

The dual problemplugging back we get the Lagrangian

0 ,* == ∑∑i

iii

iii yxyw ααg g

( )[ ]∑ −+= iT

ii bxwywbwL αα 121),*,( 2 -

i

∑∑=ij

jTijijij

Tij

ijiji xxyy-xxyy αααα

21

∑∑ =+−i

ii

ii by αα

0

43421

∑∑ +−=i

ijTij

ijiji xxyy ααα

21

0

14

iij

The dual problemand the dual problem is

⎫⎧

0

21max

0 ⎭⎬⎫

⎩⎨⎧

+−

∑∑≥ i

tbj t

ijTij

ijiji xxyy ααα

α

once this is solved, the vector

0=∑i

tosubject iiy α

∑=i

iii xyw α*

is the normal to the maximum margin planenote:

15

• the dual solution does not determine the optimal b*, since b drops off when we take derivatives

The dual problemdetermining b*• various possibilities, for example• pick one point x+ on the margin on the y=1 side and one point x-

on the y=-1 side• use the margin constraintg

2)(*

11 −+

+

+ +−=⇔

⎭⎬⎫

−=+=+ xxwb

bxwbxw T

T

T

1/||w||

x

note:• the maximum margin solution guarantees that

1⎭+ bxw

1/|| *||

x

• the maximum margin solution guarantees thatthere is always at least one point “on the margin”on each side

• if not we could move the plane and get a larger1/||w*||

1/||w*||

x

16

• if not, we could move the plane and get a largermargin

Support vectors

αi=0

another possibility is to average over all points on the margin

αi 0these are called support vectorsfrom the KKT conditions, a innactive constraint has zero

αi>0

innactive constraint has zero Lagrange multiplier αi. That is, • i) αi > 0 and yi(w*Txi + b*) = 1

αi=0

or• ii) αi = 0 and yi(w*Txi + b*) ≥ 1

hence α > 0 only for pointshence αi > 0 only for points|w*Txi + b*| = 1

which are those that lie at a

17

which are those that lie at a distance equal to the margin

Support vectors

αi=0

points with αi > 0 supportthe optimal plane (w*,b*).

αi 0for this they are called“support vectors”note that the decision rule is

αi>0

note that the decision rule is[ ]

⎤⎡

+=

**sgn)(

*

bxwxf

T

T

αi=0⎥⎤

⎢⎡

+

⎥⎦

⎤⎢⎣

⎡+=

*

*sgn

*

*

bxxy

bxxy

T

i

Tiii α

where SV = {i | α*i>0} is the

⎥⎦

⎢⎣

+= ∑∈

*sgn bxxySVi

Tiii α

18

iset of support vectors

Support vectors

αi=0

since the decision rule is

⎥⎤

⎢⎡

+= ∑ *sgn)( * bxxyxf Tα

we only need the supportvectors to completely define

⎥⎦

⎢⎣

+= ∑∈

sgn)( bxxyxfSVi

iiiα

αi>0vectors to completely definethe classifierwe can literally throw away

αi=0

we can literally throw awayall other points!the Lagrange multipliers canl balso be seen as a measure

of importance of each pointpoints with α =0 have no

19

points with αi=0 have no influence, small perturbation does not change solution

Perceptron learningnote the similarities with the dual Perceptronset αi = 0, b = 0 In this case:set R = maxi ||xi||

do {• αi = 0 means that

the point was never misclassified

for i = 1:n {

if then {

• this means that we have an “easy” point, far from the boundary0≤

⎞⎜⎜⎛

+∑n

iTjjji bxxyy αif then {

– αi = αi + 1b b R2

y• very unlikely to

happen for a support vector

01

≤⎠

⎜⎜⎝

+∑=j

ijjji bxxyy α

– b = b + yiR2

}}

• but the Perceptron does not maximize the margin!

20

}

} until no errorsg

The robustness of SVMsin SLI we talked a lot about the “curse of dimensionality”• number of examples required to achieve certain precision isnumber of examples required to achieve certain precision is

exponential in the number of dimensions

it turns out that SVMs are remarkably robust to di i litdimensionality• not uncommon to see successful applications on 1,000D+ spaces

two main reasons for this:two main reasons for this:• 1) all that the SVM does is to learn a plane.

Alth h th b f di i bAlthough the number of dimensions may belarge, the number of parameters is relativelysmall and there is no much room for overfitting

x

x

21

In fact, d+1 points are enough to specify thedecision rule in Rd!

x

SVMs as feature selectorsthe second reason is that the space is not really that largeg• 2) the SVM is a feature selector

To see this let’s look at the decision functionTo see this let s look at the decision function

.*sgn)( *⎥⎦

⎤⎢⎣

⎡+= ∑

bxxyxfSVi

Tiiiα

This is a thresholding of the quantity

⎦⎣ ∈SVi

∑ T xxy *α

note that each of the terms xiTx is the projection of the vector

to classify (x) into the training vector x

∑∈SVi

iii xxy α

22

to classify (x) into the training vector xi

SVMs as feature selectorsdefining z as the vector of the projection onto all support vectors

the decision function is a plane in the z-space

( )TiT

iT

kxxxxxz ,,)(

1L=

the decision function is a plane in the z space

⎥⎦

⎤⎢⎣

⎡+=⎥

⎤⎢⎣

⎡+= ∑∑

*)(sgn*sgn)( ** bxzwbxxyxf kk

kSVi

Tiiiα

with⎦⎣⎦⎣ ∈ kSVi

( )Tiiii yyw *** αα L=

this means that• the classifier operates on the span of the support vectors!

( )iiii kkyyw ,,

11αα

23

• the classifier operates on the span of the support vectors!• the SVM performs feature selection automatically

SVMs as feature selectorsgeometrically, we have:• 1) projection on the span of the support vectors• 2) classifier on this space

xi

( )Tiiii kkyyw ** ,,*

11αα L=

xix

z(x)(w*,b*)

• the effective dimension is |SV| and, typically, |SV| << n

24

| | , yp y, | |

In summarySVM training:• 1) solve

21max

0 ⎭⎬⎫

⎩⎨⎧

+−

∑∑≥ i

ijTij

ijiji xxyy ααα

α

• 2) then compute

0=∑i

tosubject iiy α

2) then compute

∑∈

=SVi

iii xyw ** α ( )−+

+−= ∑ xxxxyb Ti

Tii

SVii

*

21* α

decision function:

⎥⎤

⎢⎡

+∑ *sgn)( * bxxyxf Tα

25

⎥⎦

⎢⎣

+= ∑∈

*sgn)( bxxyxfSVi

iiiα

Practical implementationsin practice we need an algorithm for solving the optimization problem of the training stagep p g g• this is still a complex problem• there has been a large amount of research in this area• coming up with “your own” algorithm is not going to be

competitive• luckily there are various packages available, e.g.:y p g , g

• libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/• SVM light: http://www.cs.cornell.edu/People/tj/svm_light/

SVM fu: http://five percent nation mit edu/SvmFu/• SVM fu: http://five-percent-nation.mit.edu/SvmFu/• various others (see http://www.support-vector.net/software.html)

• also many papers and books on algorithms (see e.g. B. Schölkopf

26

and A. Smola. Learning with Kernels. MIT Press, 2002)

Kernelizationnote that all equations depend only on xi

Txj

the kernel trick is trivial: replace by K(xi,xj) xx

x2

• 1) training

( ),1max ⎬⎫

⎨⎧

+− ∑∑ ijijiji xxkyy ααα

xx

xx

xxx

xx x

oo

o

o

ooo

o

oooo

( )

0

,20

=⎭⎬

⎩⎨

∑∑≥

i

i

tosubject ii

ijijij

iji

y

yy

α

αααα x1

Φ

• 2) decision function:

( ) ( )( )−+

+−= ∑ xxKxxKyb iiiSVi

i ,,21* *α

xx

x

x

x

xxx

xx

xx

ooo

o

• 2) decision function:

( ) ⎥⎦

⎤⎢⎣

⎡+= ∑ *,sgn)( * bxxKyxf

SViiiiα

oo

oooo

o

oo ox1

x3x2

xn

27

⎦⎣ ∈SVi

Kernelizationnotes:• as usual this follows from the fact that nothing of what we did

really requires us to be in Rdreally requires us to be in Rd.• we could have simply used the notation <xi,xj> for the dot product

and all the equations would still hold• the only difference is that we can no longer recover w* explicitly

without determining the feature transformation Φ, since

( )∑ Φ xyw **

• this could have infinite dimension, e.g. we have seen that it is a f G i h th G i k l

( )∑∈

Φ=SVi

iii xyw * α

sum of Gaussians when we use the Gaussian kernel• but, luckily, we don’t really need w*, only the decision function

( ) ⎤⎡ ∑ *

28

( ) ⎥⎦

⎤⎢⎣

⎡+= ∑

*,sgn)( * bxxKyxfSVi

iii α

Input space interpretationwhen we introduce a kernel, what is the SVM doing in the input space?let’s look again at the decision function

( ) ⎥⎦

⎤⎢⎣

⎡+= ∑ *,sgn)( * bxxKyxf

SViiii α

with

⎦⎣ ∈SVi

( ) ( )( )−+∑ xxKxxKyb 1* *

note that

( ) ( )( )+

+−= ∑ xxKxxKyb iiiSVi

i ,,2

* α

• x+ and x- are support vectors• assuming that the kernel as reduced support when compared to

the distance between support vectors

29

the distance between support vectors

Input space interpretationnote that • assuming that the kernel as reduced support when compared to

the distance between support vectorsthe distance between support vectors

( ) ( )( ),,21* * +−= −+

∈∑ xxKxxKyb iii

SVii α

( ) ( )[ ]

0

,,21 **

−−≈ −−−

+++ xxKxxK αα

• where we have also assumed that α+ ~ α-

• these assumptions are not crucial, but simplify what followsl th d i i f ti i• namely the decision function is

( )⎥⎦

⎤⎢⎣

⎡= ∑ iii xxKyxf ,sgn)( *α

30

⎥⎦

⎢⎣ ∈SVi

Input space interpretationor

( )( )

⎪⎨⎧ ≥

=∑∈

0,,1)(

*

SViiii xxKyif

xfα

rewritting

( )⎪⎩⎨ <−

=∑∈

0,,1)( *

SViiii xxKyif

xfα

g

( ) ( ) ( )∑∑∑<>∈

−=0|

*

0|

** ,,,ii yi

iiyi

iiSVi

iii xxKxxKxxKy ααα

this is

( ) ( )⎪⎧ ≥ ∑∑ xxKxxKif1 ** αα ( ) ( )⎪⎩

⎪⎨⎧

≥=

∑∑<≥

otherwise

xxKxxKifxf ii yi

iiyi

ii

,1

,,,1)( 0|0|

αα

31

Input space interpretationor

( ) ( )⎪⎨

⎧ ≥=

∑∑∑∑ <≥

xxKxxKifxf ii yi

iiiyi

iii

,1,1,1)( 0|

**

0|

** β

απ

α

with

⎪⎩

= ∑∑ <≥

≥<

otherwise

xf i

i

i

i

yiyi

iyiyi

i

,1

)( 0|0|

0|0|

0|,0|,

0|

*

**

0|

*

** <=≥=

∑∑<≥

i

yii

iii

yii

ii yiyi

ii

ααβ

ααπ

which is the same as 0|0| <≥ yiyi ii

( )⎧ ∑∑ xxK ** απ ( )( )

⎪⎪

⎪⎪⎨

≥= ∑∑

∑∑

<

<

h

xxK

xxKifxf

i

i

i

i

yii

yii

yiii

yiii

,

,,1)(

0|

*0|

0|

*0|

α

α

β

π

32

⎪⎩− otherwise,1

Input space interpretationnote that this is the Bayesian decision rule for• 1) class 1 with likelihood and prior

2) class 2 with likelihood and prior

( )∑≥0|

* ,iyi

ii xxKπ ∑∑< i

iyi

ii

*

0|

* / αα

• 2) class 2 with likelihood and prior

( )∑<0|

* ,iyi

ii xxKβ ∑∑≥ i

iyi

ii

*

0|

* / αα

these likelihood functions • are a kernel density estimate if k(.,xi) is a valid pdfi

• peculiar kernel estimate that only places kernels around the support vectors

• all other points are ignored

33

• all other points are ignored

Input space interpretationthis is a discriminant form of density estimation• concentrate modeling power• concentrate modeling power

where it matters the most, i.e. near classification boundary

• smart, since points away from the , p yboundary are always well classified, even if density estimates in their region are poorth SVM b hi hl• the SVM can be seen as a highly efficient combination of the BDR with kernel density estimates

• recall that one major problem of• recall that one major problem of kernel estimates is the complexity of the decision function, O(n).

• with the SVM, the complexity is

34

with the SVM, the complexity is only O(|SV|) but nothing is lost

Input space interpretationnote on the approximations made:• this result was derived assuming b~0this result was derived assuming b 0• in practice, b is frequently left as a parameter which is used to

trade-off false positives for misses• here, that can be done by controlling the BDR threshold

( )( )⎪⎪

≥∑

≥ TxxK

if iyiii ,

1 0|

( )⎪⎪⎩

⎪⎨

≥= ∑<

otherwise

TxxK

ifxfiyi

ii

,1

,,1)(

0|

• hence, there is really not much practical difference, even when the assumption of b*=0 does not hold!

35

36