A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief...

Chapter 2

A Brief Introduction to

Chapter 2

Support Vector Machine (SVM)

January 25, 2011

Overview

• A new, powerful method for 2-class classification O i i l id V ik 1965 f li l ifi– Original idea: Vapnik, 1965 for linear classifiers

– SVM, Cortes and Vapnik, 1995– Became very hot since 2001eca e ve y ot s ce 00

• Better generalization (less overfitting)• Can do linearly unseparable classification with globaly p g

optimal• Key ideas

– Use kernel function to transform low dimensional training samples to higher dim (for linear separability problem)

– Use quadratic programming (QP) to find the best classifier q p g g (Q )boundary hyperplane (for global optima)

Linear Classifiersfx

α

tf x yest

f(x,w,b) = sign(w. x - b)denotes +1

denotes -1

f(x,w,b) sign(w. x b)

How would you classify this data?classify this data?

Support Vector Machines: Slide 3


α

tf x yest


denotes -1





α

tf x yest


denotes -1


Any of these would be finewould be fine..

..but which is best?


Classifier Marginfx

α

tf x yest


denotes -1


Define the marginof a linearof a linear classifier as the width that the b d ld bboundary could be increased by before hitting a datapoint.


Maximum Marginfx

α

tf x yest


denotes -1


The maximum margin linearmargin linear classifier is the linear classifier

ith iwith maximum margin.

This is theThis is the simplest kind of SVM (Called an


LSVM)Linear SVM

Maximum Marginfx

α

tf x yest


denotes -1


The maximum margin linearmargin linear classifier is the linear classifier

ith iSupport Vectors with maximum margin.

This is the

Support Vectors are those datapoints that the margin This is the

simplest kind of SVM (Called an

gpushes up against


LSVM)Linear SVM

Why Maximum Margin?

f(x,w,b) = sign(w. x - b)1. Intuitively this feels safest.

2 If we’ve made a small error in thedenotes +1

denotes -1


The maximum margin linear

2. If we ve made a small error in the location of the boundary this gives us least chance of causing a misclassification.margin linear

classifier is the linear classifier

ith thSupport Vectors

misclassification.

3. It also helps generalization

4. There’s some theory that this is a with the, um, maximum margin.

This is the

Support Vectors are those datapoints that the margin

good thing.

5. Empirically it works very very well.This is the simplest kind of SVM (Called an

gpushes up against


LSVM)

Computing the margin widthM = Margin Width

H d tHow do we compute M in terms of wand b?

• Plus-plane = { x : w . x + b = +1 }

and b?

• Minus-plane = { x : w . x + b = -1 }• M=

ww.2

Support Vector Machines: Slide 12Copyright © 2001, 2003, Andrew W. Moore

Learning the Maximum Margin Classifier2

M = Margin Width =x+ww.

2

x-

Given a guess of w and b we can• Compute whether all data points in the correct half-planesCompute whether all data points in the correct half planes• Compute the width of the marginSo now we just need to write a program to search the space

of w’s and b’s to find the widest margin that matches all the datapoints. How?

Gradient descent? Simulated Annealing?


Gradient descent? Simulated Annealing?

Learning via Quadratic Programming

ll d d l f• QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real valued variables subject to linearsome real-valued variables subject to linear constraints.

• Minimize both w⋅w (to maximize M) and• Minimize both w⋅w (to maximize M) and misclassification error


Quadratic Programminguud RTT ++Fi d Quadratic criterion

2maxarg udu

c T ++Find Quadratic criterion

mm

mm

buauauabuauaua

≤+++≤+++

...

...

22222121

11212111

n additional linear

Subject to

nmnmnn buauaua ≤+++ ...:

2211

inequality constraints

nmnmnn 2211

)1()1(22)1(11)1( ... nmmnnn

bbuauaua ++++ =+++

And subject to

eadditequcon

)2()2(22)2(11)2(

:... nmmnnn buauaua ++++ =+++

tional lineality straints


)()(22)(11)( ... enmmenenen buauaua ++++ =+++

ear

Learning Maximum Margin with NoiseGiven guess of w b we canM = Given guess of w , b we can• Compute sum of distances

of points to their correct

M =

ww.2

pzones

• Compute the margin widthA R d i hAssume R datapoints, each

(xk,yk) where yk = +/- 1

What should our quadratic optimization criterion be?

How many constraints will we have?

Wh h ld h b ?What should they be?


Learning Maximum Margin with NoiseGiven guess of w b we canM = Given guess of w , b we can• Compute sum of distances

of points to their correct

M =

ww.2ε11ε2

pzones

• Compute the margin widthA R d i h

ε7Assume R datapoints, each

(xk,yk) where yk = +/- 1

What should our quadratic optimization criterion be?

Mi i i

How many constraints will we have? 2R

Wh h ld h b ?Minimize What should they be?w . xk + b >= 1-εk if yk = 1w x + b <= 1+ε if y = 1

∑=

+R

kkεC

1.

21 ww


w . xk + b <= -1+εk if yk = -1εk >= 0 for all k

εk = distance of error points to their correct place

From LSVM to general SVM

Suppose we are in 1-dimensionWhat would

g

What would SVMs do with this data?

x=0


Not a big surpriseNot a big surprise

Positive “plane” Negative “plane”

x=0


Harder 1-dimensional dataset

Points are notPoints are not linearly separable.p

What can we do now?now?

x=0


Harder 1-dimensional datasetT f th d tTransform the data

points from 1-dim to 2-dim by some nonlinear basis function (called Kernel functions)Kernel functions)

These points sometimes are called feature vectors.

2x=0 ),( 2

kkk xx=z


Harder 1-dimensional dataset

These points are linearly separable now!

Boundary can be found by QP

2x=0 ),( 2

kkk xx=z


Common SVM basis functions

zk = (polynomial terms of xk of degree 1 to q )

z (radial basis functions of x )zk = (radial basis functions of xk )

⎟⎟⎠

⎞⎜⎜⎝

⎛ −==

KW||

KernelFn)(][ jkkjk φj

cxxz

z = (sigmoid functions of x )

⎟⎠

⎜⎝ KWj

zk = (sigmoid functions of xk )


Doing multi-class classificationSVMs can only handle two class outputs (i e a• SVMs can only handle two-class outputs (i.e., a categorical output variable with variety 2).

• What can be done?• What can be done?• Answer: with N classes, learn N SVM’s

• SVM 1 learns “Output==1” vs “Output != 1”• SVM 1 learns Output==1 vs Output != 1• SVM 2 learns “Output==2” vs “Output != 2”• ::• SVM N learns “Output==N” vs “Output != N”

• Then to predict the output for a new input, just p p p , jpredict with each SVM and find out which one puts the prediction the furthest into the positive region.


Compare SVM with NN• Similarity

− SVM + sigmoid kernel ~ two-layer feedforward NN− SVM + Gaussian kernel ~ RBF network− For most problems, SVM and NN have similar performance

• Advantages− Based on sound mathematics theoryBased on sound mathematics theory− Learning result is more robust− Over-fitting is not common− Not trapped in local minima (because of QP)Not trapped in local minima (because of QP)− Fewer parameters to consider (kernel, error cost C)− Works well with fewer training samples (number of support vectors

do not matter much)do not matter much). • Disadvantages

− Problem need to be formulated as 2-class classificationLearning takes long time (QP optimization)


− Learning takes long time (QP optimization)

Advantages and Disadvantages of SVM

Advantages• Advantages• prediction accuracy is generally high • robust works when training examples contain errors• robust, works when training examples contain errors• fast evaluation of the learned target function

• Criticism• Criticism• long training time• difficult to understand the learned function (weights)( g )• not easy to incorporate domain knowledge


SVM Related Links

• Representative implementations

• LIBSVM: an efficient implementation of SVM, multi-class

classifications, nu-SVM, one-class SVM, including also various

interfaces with java, python, etc.

• SVM-light: simpler but performance is not better than LIBSVM,

support only binary classification and only C language

• SVM-torch: another recent implementation also written in C.


A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief...

Documents

Transcript of A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief...