The support vector machine - SVCL · Support vector machine w y(wTx i b) i i wb min 2 subject to +...

The support vector himachine

Nuno Vasconcelos ECE Department, UCSDp ,

Outlinewe have talked about classification and linear discriminantsthen we did a detour to talk about kernels• how do we implement a non-linear boundary on the linear

discriminant framework

this led us to RKHS th d i th l i ti• RKHS, the reproducing theorem, regularization

• the idea that all these learning problems boil down to the solution of an optimization problem

we have seen how to solve optimization problems• with and without constraints, equalities, inequalities

today we will put everything together by studying the SVM

Classificationa classification problem has two types of variables

• X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world

e.g. g• x ∈ X ⊂ R2 = (fever, blood pressure)

• y ∈ Y = {disease, no disease}

X, Y related by a (unknown) function

)(xfyx

l d i l ifi h Y h th t h( ) f( ) ∀

)(xfy =x(.)f

goal: design a classifier h: X → Y such that h(x) = f(x) ∀x

Linear classifier implements the decision rule

⎧ >if 0)(1 xgwith

has the properties

bxwxg T +=)(⎩⎨⎧

=<−>

= sgn[g(x)]if if

0)(10)(1

)(*xgxg

has the properties• it divides X into two “half-spaces”

• boundary is the plane with:

y p• normal w• distance to the origin b/||w||

( )/|| || i th di t f i twb

wxg )(

• g(x)/||w|| is the distance from point xto the boundary

• g(x) = 0 for points on the plane

• g(x) > 0 on the side w points to (“positive side”)• g(x) < 0 on the “negative side”

Linear classifierwe have a classification error if• y = 1 and g(x) < 0 or y = -1 and g(x) > 0• i.e y.g(x) < 0

and a correct classification if• y = 1 and g(x) > 0 or y = -1 and g(x) < 0• i.e y.g(x) > 0

note that for a linearly separable training setnote that, for a linearly separable training set D = {(x1,y1), ..., (xn,yn)}

we can have zero empirical riskthe necessary and sufficient condition is that

( ) ibxwy iT

i ∀>+ ,0

The marginis the distance from the boundaryto the closest point w

wbxw i

+= minγ

there will be no error if it is strictlygreater than zero y=-1

greater than zero

note that this is ill-defined in the sense that( ) 0,0 >⇔∀>+ γ ibxwy i

note that this is ill-defined in the sense that γ does not change if both w and b are scaled by λ w

wxg )(

we need a normalization

Maximizing the marginthis is similar to what we have seenfor Fisher discriminantslets assume we have selected somenormalization, e.g. ||w||=1

th t ti i h t i ththe next question is: what is thecost that we are going to optimize?there are several planes that separatethere are several planes that separatethe classes, which one is best?recall that in the case of the Perceptron, we have seen that the margin determines the complexity of the learning problem• the Perceptron converges in less than (k/γ)2 iterations

the Perceptron converges in less than (k/γ) iterations

it sounds like maximizing the margin is a good idea.

Maximizing the marginintuition 1:• think of each point in the training

set as a sample from a probability density centered on it

• if we draw another sample, we will not get the same points

• each point is really a pdf with a certain variance

• this is a kernel density estimate• if we leave a margin of γ on the

training set we are safe againsttraining set, we are safe against this uncertainty

• (as long as the radius of support of the pdf is smaller than γ)

of the pdf is smaller than γ)• the larger γ, the more robust the classifier!

Maximizing the marginintuition 2:• think of the plane as an uncertain

estimate because it is learnedfrom a sample drawn at random

• since the sample changes fromdraw to draw, the plane parametersare random variables of non-zerovariance

• instead of a single plane we havea probability distribution overplanes

• the larger the margin, the largerthe number of planes that willnot originate errors

• the larger γ, the larger the variance allowed on the plane parameter estimates!

Normalizationwe will go over a formal proof in a few classesfor now let’s look at the normalization y=1natural choice is ||w|| = 1the margin is maximized by solving

+ bxw iT

y=-1 tosubject 12 =w

this is somewhat complex• need to find the points that achieve the min without knowing what

w and b are

• optimization problem with quadratic constraints

Normalizationa more convenient normalization is tomake |g(x)| = 1 for the closest point, i.e. w

1min ≡+ bxw iT

under which

y=-1w1

the SVM is the classifier that maximizes the margin under these constraints

the margin under these constraints

wxg )(( ) ibxwyw i

∀≥+ tosubject 1min 2

Support vector machine( ) ibxwyw i

∀≥+ tosubject 1min 2

last time we proved the following theoremTheorem: (strong duality) Consider the problem( g y) p

0)(minarg* ≤−=∈

Xxdxexfx to subject

where X, and f are convex, and the optimal value f* finite. Then there is at least one Lagrange multiplier vector and there is no duality gapthere is no duality gapthis means the SVM problem can be solved by solving its dual

The dual problemfor the primal ( ) ibxwyw i

∀≥+ tosubject 21 1min 2

the dual problem is{ }),,(maxmax

00αα

ααbwLq

wmin )(

≥≥=

where the Lagrangian is

( )[ ]∑ 11)( 2 bbL T

00 αα w≥≥

setting derivatives to zero

( )[ ]-i

∑ −+= 12

),,( bxwywbwL iT

iiαα

i=⇔=−=⇔=∇

∑∑i

iiiiiiw xywxywbwLL ααα *0),,(0

0=⇔=∇ ∑i

iib yL α0

The dual problemplugging back we get the Lagrangian

0 ,* == ∑∑i

iii yxyw ααg g

( )[ ]∑ −+= iT

ii bxwywbwL αα 121),*,( 2 -

∑∑=ij

jTijijij

ijiji xxyy-xxyy αααα

∑∑ =+−i

ii by αα

∑∑ +−=i

ijiji xxyy ααα

The dual problemand the dual problem is

⎫⎧

0 ⎭⎬⎫

⎩⎨⎧

∑∑≥ i

ijiji xxyy ααα

once this is solved, the vector

0=∑i

tosubject iiy α

iii xyw α*

is the normal to the maximum margin planenote:

• the dual solution does not determine the optimal b*, since b drops off when we take derivatives

The dual problemdetermining b*• various possibilities, for example• pick one point x+ on the margin on the y=1 side and one point x-

on the y=-1 side• use the margin constraintg

11 −+

+ +−=⇔

⎭⎬⎫

−=+=+ xxwb

bxwbxw T

1/||w||

note:• the maximum margin solution guarantees that

1⎭+ bxw

1/|| *||

• the maximum margin solution guarantees thatthere is always at least one point “on the margin”on each side

• if not we could move the plane and get a larger1/||w*||

1/||w*||

• if not, we could move the plane and get a largermargin

Support vectors

another possibility is to average over all points on the margin

αi 0these are called support vectorsfrom the KKT conditions, a innactive constraint has zero

innactive constraint has zero Lagrange multiplier αi. That is, • i) αi > 0 and yi(w*Txi + b*) = 1

or• ii) αi = 0 and yi(w*Txi + b*) ≥ 1

hence α > 0 only for pointshence αi > 0 only for points|w*Txi + b*| = 1

which are those that lie at a

which are those that lie at a distance equal to the margin

Support vectors

points with αi > 0 supportthe optimal plane (w*,b*).

αi 0for this they are called“support vectors”note that the decision rule is

note that the decision rule is[ ]

⎤⎡

**sgn)(

αi=0⎥⎤

⎢⎡

⎥⎦

⎤⎢⎣

Tiii α

where SV = {i | α*i>0} is the

⎥⎦

⎢⎣

+= ∑∈

*sgn bxxySVi

Tiii α

iset of support vectors

Support vectors

since the decision rule is

⎥⎤

⎢⎡

+= ∑ *sgn)( * bxxyxf Tα

we only need the supportvectors to completely define

⎥⎦

⎢⎣

+= ∑∈

sgn)( bxxyxfSVi

αi>0vectors to completely definethe classifierwe can literally throw away

we can literally throw awayall other points!the Lagrange multipliers canl balso be seen as a measure

of importance of each pointpoints with α =0 have no

points with αi=0 have no influence, small perturbation does not change solution

Perceptron learningnote the similarities with the dual Perceptronset αi = 0, b = 0 In this case:set R = maxi ||xi||

do {• αi = 0 means that

the point was never misclassified

for i = 1:n {

if then {

• this means that we have an “easy” point, far from the boundary0≤

⎞⎜⎜⎛

iTjjji bxxyy αif then {

– αi = αi + 1b b R2

y• very unlikely to

happen for a support vector

≤⎠

⎜⎜⎝

+∑=j

ijjji bxxyy α

– b = b + yiR2

• but the Perceptron does not maximize the margin!

} until no errorsg

The robustness of SVMsin SLI we talked a lot about the “curse of dimensionality”• number of examples required to achieve certain precision isnumber of examples required to achieve certain precision is

exponential in the number of dimensions

it turns out that SVMs are remarkably robust to di i litdimensionality• not uncommon to see successful applications on 1,000D+ spaces

two main reasons for this:two main reasons for this:• 1) all that the SVM does is to learn a plane.

Alth h th b f di i bAlthough the number of dimensions may belarge, the number of parameters is relativelysmall and there is no much room for overfitting

In fact, d+1 points are enough to specify thedecision rule in Rd!

SVMs as feature selectorsthe second reason is that the space is not really that largeg• 2) the SVM is a feature selector

To see this let’s look at the decision functionTo see this let s look at the decision function

.*sgn)( *⎥⎦

⎤⎢⎣

⎡+= ∑

bxxyxfSVi

Tiiiα

This is a thresholding of the quantity

⎦⎣ ∈SVi

∑ T xxy *α

note that each of the terms xiTx is the projection of the vector

to classify (x) into the training vector x

∑∈SVi

iii xxy α

to classify (x) into the training vector xi

SVMs as feature selectorsdefining z as the vector of the projection onto all support vectors

the decision function is a plane in the z-space

( )TiT

kxxxxxz ,,)(

the decision function is a plane in the z space

⎥⎦

⎤⎢⎣

⎡+=⎥

⎤⎢⎣

⎡+= ∑∑

*)(sgn*sgn)( ** bxzwbxxyxf kk

Tiiiα

with⎦⎣⎦⎣ ∈ kSVi

( )Tiiii yyw *** αα L=

this means that• the classifier operates on the span of the support vectors!

( )iiii kkyyw ,,

11αα

• the classifier operates on the span of the support vectors!• the SVM performs feature selection automatically

SVMs as feature selectorsgeometrically, we have:• 1) projection on the span of the support vectors• 2) classifier on this space

( )Tiiii kkyyw ** ,,*

11αα L=

z(x)(w*,b*)

• the effective dimension is |SV| and, typically, |SV| << n

| | , yp y, | |

In summarySVM training:• 1) solve

0 ⎭⎬⎫

⎩⎨⎧

∑∑≥ i

ijiji xxyy ααα

• 2) then compute

0=∑i

tosubject iiy α

2) then compute

∑∈

iii xyw ** α ( )−+

+−= ∑ xxxxyb Ti

21* α

decision function:

⎥⎤

⎢⎡

+∑ *sgn)( * bxxyxf Tα

⎥⎦

⎢⎣

+= ∑∈

*sgn)( bxxyxfSVi

Practical implementationsin practice we need an algorithm for solving the optimization problem of the training stagep p g g• this is still a complex problem• there has been a large amount of research in this area• coming up with “your own” algorithm is not going to be

competitive• luckily there are various packages available, e.g.:y p g , g

• libSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/• SVM light: http://www.cs.cornell.edu/People/tj/svm_light/

SVM fu: http://five percent nation mit edu/SvmFu/• SVM fu: http://five-percent-nation.mit.edu/SvmFu/• various others (see http://www.support-vector.net/software.html)

• also many papers and books on algorithms (see e.g. B. Schölkopf

and A. Smola. Learning with Kernels. MIT Press, 2002)

Kernelizationnote that all equations depend only on xi

the kernel trick is trivial: replace by K(xi,xj) xx

• 1) training

( ),1max ⎬⎫

⎨⎧

+− ∑∑ ijijiji xxkyy ααα

=⎭⎬

⎩⎨

∑∑≥

tosubject ii

ijijij

αααα x1

• 2) decision function:

( ) ( )( )−+

+−= ∑ xxKxxKyb iiiSVi

i ,,21* *α

• 2) decision function:

( ) ⎥⎦

⎤⎢⎣

⎡+= ∑ *,sgn)( * bxxKyxf

SViiiiα

oo ox1

⎦⎣ ∈SVi

Kernelizationnotes:• as usual this follows from the fact that nothing of what we did

really requires us to be in Rdreally requires us to be in Rd.• we could have simply used the notation <xi,xj> for the dot product

and all the equations would still hold• the only difference is that we can no longer recover w* explicitly

without determining the feature transformation Φ, since

( )∑ Φ xyw **

• this could have infinite dimension, e.g. we have seen that it is a f G i h th G i k l

( )∑∈

Φ=SVi

iii xyw * α

sum of Gaussians when we use the Gaussian kernel• but, luckily, we don’t really need w*, only the decision function

( ) ⎤⎡ ∑ *

( ) ⎥⎦

⎤⎢⎣

⎡+= ∑

*,sgn)( * bxxKyxfSVi

iii α

Input space interpretationwhen we introduce a kernel, what is the SVM doing in the input space?let’s look again at the decision function

( ) ⎥⎦

⎤⎢⎣

⎡+= ∑ *,sgn)( * bxxKyxf

SViiii α

⎦⎣ ∈SVi

( ) ( )( )−+∑ xxKxxKyb 1* *

note that

( ) ( )( )+

+−= ∑ xxKxxKyb iiiSVi

• x+ and x- are support vectors• assuming that the kernel as reduced support when compared to

the distance between support vectors

Input space interpretationnote that • assuming that the kernel as reduced support when compared to

the distance between support vectorsthe distance between support vectors

( ) ( )( ),,21* * +−= −+

∈∑ xxKxxKyb iii

SVii α

( ) ( )[ ]

,,21 **

−−≈ −−−

+++ xxKxxK αα

• where we have also assumed that α+ ~ α-

• these assumptions are not crucial, but simplify what followsl th d i i f ti i• namely the decision function is

( )⎥⎦

⎤⎢⎣

⎡= ∑ iii xxKyxf ,sgn)( *α

⎥⎦

⎢⎣ ∈SVi

Input space interpretationor

( )( )

⎪⎨⎧ ≥

=∑∈

0,,1)(

SViiii xxKyif

rewritting

( )⎪⎩⎨ <−

=∑∈

0,,1)( *

SViiii xxKyif

( ) ( ) ( )∑∑∑<>∈

−=0|

** ,,,ii yi

iii xxKxxKxxKy ααα

this is

( ) ( )⎪⎧ ≥ ∑∑ xxKxxKif1 ** αα ( ) ( )⎪⎩

⎪⎨⎧

∑∑<≥

otherwise

xxKxxKifxf ii yi

,,,1)( 0|0|

Input space interpretationor

( ) ( )⎪⎨

⎧ ≥=

∑∑∑∑ <≥

xxKxxKifxf ii yi

,1,1,1)( 0|

⎪⎩

= ∑∑ <≥

otherwise

)( 0|0|

0|,0|,

** <=≥=

∑∑<≥

ii yiyi

ααβ

ααπ

which is the same as 0|0| <≥ yiyi ii

( )⎧ ∑∑ xxK ** απ ( )( )

⎪⎪

⎪⎪⎨

≥= ∑∑

∑∑

xxKifxf

⎪⎩− otherwise,1

Input space interpretationnote that this is the Bayesian decision rule for• 1) class 1 with likelihood and prior

2) class 2 with likelihood and prior

( )∑≥0|

* ,iyi

ii xxKπ ∑∑< i

* / αα

• 2) class 2 with likelihood and prior

( )∑<0|

* ,iyi

ii xxKβ ∑∑≥ i

* / αα

these likelihood functions • are a kernel density estimate if k(.,xi) is a valid pdfi

• peculiar kernel estimate that only places kernels around the support vectors

• all other points are ignored

Input space interpretationthis is a discriminant form of density estimation• concentrate modeling power• concentrate modeling power

where it matters the most, i.e. near classification boundary

• smart, since points away from the , p yboundary are always well classified, even if density estimates in their region are poorth SVM b hi hl• the SVM can be seen as a highly efficient combination of the BDR with kernel density estimates

• recall that one major problem of• recall that one major problem of kernel estimates is the complexity of the decision function, O(n).

• with the SVM, the complexity is

with the SVM, the complexity is only O(|SV|) but nothing is lost

Input space interpretationnote on the approximations made:• this result was derived assuming b~0this result was derived assuming b 0• in practice, b is frequently left as a parameter which is used to

trade-off false positives for misses• here, that can be done by controlling the BDR threshold

( )( )⎪⎪

≥∑

≥ TxxK

if iyiii ,

( )⎪⎪⎩

⎪⎨

≥= ∑<

otherwise

ifxfiyi

• hence, there is really not much practical difference, even when the assumption of b*=0 does not hold!

The support vector machine - SVCL · Support vector machine w y(wTx i b) i i wb min 2 subject to +...

Documents

Transcript of The support vector machine - SVCL · Support vector machine w y(wTx i b) i i wb min 2 subject to +...

Large-scale Linear Support Vector Regressionjmlr.csail.mit.edu › papers › volume13 › ho12a › ho12a.pdfLARGE-SCALE LINEAR SUPPORT VECTOR REGRESSION From Theorem 2.1 of Lin and

Gauss-Green Theorem for Weakly Differentiable Vector ...people.maths.ox.ac.uk/chengq/preprints/pdf/torres-ziemer-cpam.pdf · Gauss-Green Theorem for Weakly Differentiable Vector Fields,

When Is There a Representer Theorem? Vector Versus Matrix ...

Principal Component Analysis - SVCL

Chapter 16 – Vector Calculus 16.8 Stokes Theorem 16.8 Stokes' Theorem1 Objectives: Understand Stokes’ Theorem Use Stokes’ Theorem to evaluate integrals.

kernels.ppt - SVCL

Chapter 13 Vector Calculus Section 13.1 Vector Fieldsfiroz/m267/267_c13.pdf · 2011. 4. 18. · Section 13.4 Green’s Theorem Green’s Theorem gives the relation between a line

KERALA TECHNOLOGICAL UNIVERSITY MTech machine Design... · Experimental Stress Analysis 3-0-0 40 3 D ... Vector calculus-Green‘s theorem-Stoke‘s Theorem and divergence Theorem-tensor

The Vector Operator Ñ and The Divergence Theorem

Vector Calculus 13. Green's Theorem 13.4 3 Green's Theorem Green’s Theorem gives the relationship between a line integral around a simple closed curve.

Linear Algebra Concepts - SVCL

Dimension of Vector Space: Theorem 11limath/Sp19Math54/week5.pdfDimension of Vector Space: Theorem 11 Let: H be a subspace of nite-dimensional vector space V. Then:any linearly independent

Integral Vector Theorems - Loughborough University1. Stokes’ theorem This is a theorem that equates a line integral to a surface integral. For any vector ﬁeld F and a contour C

Green’s Theorem Jeremy Orlo - MIT Mathematicsmath.mit.edu/~jorloff/18.04/notes/greenstheorem.pdfGreen’s Theorem Jeremy Orlo 1 Line integrals and Green’s theorem 1.1 Vector Fields

The two-dimensional divergence theorem · The two-dimensional divergence theorem. Flux across a curve The picture shows a vector eld F and a curve C, with the vector dr pointing along

VECTOR CALCULUS 17. 17.8 Stokes’ Theorem In this section, we will learn about: The Stokes’ Theorem and using it to evaluate integrals. VECTOR CALCULUS.

Edges, interpolation, templates - SVCL

Chapter 14 Gauss’ theorem - Rice University -- Web …fjones/chap14.pdfGauss’ theorem 1 Chapter 14 Gauss’ theorem We now present the third great theorem of integral vector calculus.

Line Integrals and Green’s Theorem Jeremy Orlojorloff/18.04/notes/greenstheorem.pdf · Line Integrals and Green’s Theorem Jeremy Orlo 1 Vector Fields (or vector valued functions)

The support vector machine - SVCL · defining z as the vector of the projection onto all support vectors the decision function is a plane in the z-space with this means that • the