A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief...

27
Chapter 2 A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011

Transcript of A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief...

Page 1: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Chapter 2

A Brief Introduction to

Chapter 2

Support Vector Machine (SVM)

January 25, 2011

Page 2: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Overview

• A new, powerful method for 2-class classification O i i l id V ik 1965 f li l ifi– Original idea: Vapnik, 1965 for linear classifiers

– SVM, Cortes and Vapnik, 1995– Became very hot since 2001eca e ve y ot s ce 00

• Better generalization (less overfitting)• Can do linearly unseparable classification with globaly p g

optimal• Key ideas

– Use kernel function to transform low dimensional training samples to higher dim (for linear separability problem)

– Use quadratic programming (QP) to find the best classifier q p g g (Q )boundary hyperplane (for global optima)

Page 3: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Linear Classifiersfx

α

tf x yest

f(x,w,b) = sign(w. x - b)denotes +1

denotes -1

f(x,w,b) sign(w. x b)

How would you classify this data?classify this data?

Support Vector Machines: Slide 3

Page 4: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Linear Classifiersfx

α

tf x yest

f(x,w,b) = sign(w. x - b)denotes +1

denotes -1

f(x,w,b) sign(w. x b)

How would you classify this data?classify this data?

Support Vector Machines: Slide 4

Page 5: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Linear Classifiersfx

α

tf x yest

f(x,w,b) = sign(w. x - b)denotes +1

denotes -1

f(x,w,b) sign(w. x b)

How would you classify this data?classify this data?

Support Vector Machines: Slide 5

Page 6: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Linear Classifiersfx

α

tf x yest

f(x,w,b) = sign(w. x - b)denotes +1

denotes -1

f(x,w,b) sign(w. x b)

How would you classify this data?classify this data?

Support Vector Machines: Slide 6

Page 7: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Linear Classifiersfx

α

tf x yest

f(x,w,b) = sign(w. x - b)denotes +1

denotes -1

f(x,w,b) sign(w. x b)

Any of these would be finewould be fine..

..but which is best?

Support Vector Machines: Slide 7

Page 8: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Classifier Marginfx

α

tf x yest

f(x,w,b) = sign(w. x - b)denotes +1

denotes -1

f(x,w,b) sign(w. x b)

Define the marginof a linearof a linear classifier as the width that the b d ld bboundary could be increased by before hitting a datapoint.

Support Vector Machines: Slide 8

Page 9: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Maximum Marginfx

α

tf x yest

f(x,w,b) = sign(w. x - b)denotes +1

denotes -1

f(x,w,b) sign(w. x b)

The maximum margin linearmargin linear classifier is the linear classifier

ith iwith maximum margin.

This is theThis is the simplest kind of SVM (Called an

Support Vector Machines: Slide 9

LSVM)Linear SVM

Page 10: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Maximum Marginfx

α

tf x yest

f(x,w,b) = sign(w. x - b)denotes +1

denotes -1

f(x,w,b) sign(w. x b)

The maximum margin linearmargin linear classifier is the linear classifier

ith iSupport Vectors with maximum margin.

This is the

Support Vectors are those datapoints that the margin This is the

simplest kind of SVM (Called an

gpushes up against

Support Vector Machines: Slide 10

LSVM)Linear SVM

Page 11: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Why Maximum Margin?

f(x,w,b) = sign(w. x - b)1. Intuitively this feels safest.

2 If we’ve made a small error in thedenotes +1

denotes -1

f(x,w,b) sign(w. x b)

The maximum margin linear

2. If we ve made a small error in the location of the boundary this gives us least chance of causing a misclassification.margin linear

classifier is the linear classifier

ith thSupport Vectors

misclassification.

3. It also helps generalization

4. There’s some theory that this is a with the, um, maximum margin.

This is the

Support Vectors are those datapoints that the margin

good thing.

5. Empirically it works very very well.This is the simplest kind of SVM (Called an

gpushes up against

Support Vector Machines: Slide 11

LSVM)

Page 12: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Computing the margin widthM = Margin Width

H d tHow do we compute M in terms of wand b?

• Plus-plane = { x : w . x + b = +1 }

and b?

• Minus-plane = { x : w . x + b = -1 }• M=

ww.2

Support Vector Machines: Slide 12Copyright © 2001, 2003, Andrew W. Moore

Page 13: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Learning the Maximum Margin Classifier2

M = Margin Width =x+ww.

2

x-

Given a guess of w and b we can• Compute whether all data points in the correct half-planesCompute whether all data points in the correct half planes• Compute the width of the marginSo now we just need to write a program to search the space

of w’s and b’s to find the widest margin that matches all the datapoints. How?

Gradient descent? Simulated Annealing?

Support Vector Machines: Slide 13

Gradient descent? Simulated Annealing?

Page 14: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Learning via Quadratic Programming

ll d d l f• QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real valued variables subject to linearsome real-valued variables subject to linear constraints.

• Minimize both w⋅w (to maximize M) and• Minimize both w⋅w (to maximize M) and misclassification error

Support Vector Machines: Slide 14

Page 15: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Quadratic Programminguud RTT ++Fi d Quadratic criterion

2maxarg udu

c T ++Find Quadratic criterion

mm

mm

buauauabuauaua

≤+++≤+++

...

...

22222121

11212111

n additional linear

Subject to

nmnmnn buauaua ≤+++ ...:

2211

inequality constraints

nmnmnn 2211

)1()1(22)1(11)1( ... nmmnnn

bbuauaua ++++ =+++

And subject to

eadditequcon

)2()2(22)2(11)2(

:... nmmnnn buauaua ++++ =+++

tional lineality straints

Support Vector Machines: Slide 15

)()(22)(11)( ... enmmenenen buauaua ++++ =+++

ear

Page 16: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Learning Maximum Margin with NoiseGiven guess of w b we canM = Given guess of w , b we can• Compute sum of distances

of points to their correct

M =

ww.2

pzones

• Compute the margin widthA R d i hAssume R datapoints, each

(xk,yk) where yk = +/- 1

What should our quadratic optimization criterion be?

How many constraints will we have?

Wh h ld h b ?What should they be?

Support Vector Machines: Slide 16

Page 17: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Learning Maximum Margin with NoiseGiven guess of w b we canM = Given guess of w , b we can• Compute sum of distances

of points to their correct

M =

ww.2ε11ε2

pzones

• Compute the margin widthA R d i h

ε7Assume R datapoints, each

(xk,yk) where yk = +/- 1

What should our quadratic optimization criterion be?

Mi i i

How many constraints will we have? 2R

Wh h ld h b ?Minimize What should they be?w . xk + b >= 1-εk if yk = 1w x + b <= 1+ε if y = 1

∑=

+R

kkεC

1.

21 ww

Support Vector Machines: Slide 17

w . xk + b <= -1+εk if yk = -1εk >= 0 for all k

εk = distance of error points to their correct place

Page 18: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

From LSVM to general SVM

Suppose we are in 1-dimensionWhat would

g

What would SVMs do with this data?

x=0

Support Vector Machines: Slide 18

Page 19: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Not a big surpriseNot a big surprise

Positive “plane” Negative “plane”

x=0

Support Vector Machines: Slide 19

Page 20: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Harder 1-dimensional dataset

Points are notPoints are not linearly separable.p

What can we do now?now?

x=0

Support Vector Machines: Slide 20

Page 21: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Harder 1-dimensional datasetT f th d tTransform the data

points from 1-dim to 2-dim by some nonlinear basis function (called Kernel functions)Kernel functions)

These points sometimes are called feature vectors.

2x=0 ),( 2

kkk xx=z

Support Vector Machines: Slide 21

Page 22: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Harder 1-dimensional dataset

These points are linearly separable now!

Boundary can be found by QP

2x=0 ),( 2

kkk xx=z

Support Vector Machines: Slide 22

Page 23: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Common SVM basis functions

zk = (polynomial terms of xk of degree 1 to q )

z (radial basis functions of x )zk = (radial basis functions of xk )

⎟⎟⎠

⎞⎜⎜⎝

⎛ −==

KW||

KernelFn)(][ jkkjk φj

cxxz

z = (sigmoid functions of x )

⎟⎠

⎜⎝ KWj

zk = (sigmoid functions of xk )

Support Vector Machines: Slide 23

Page 24: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Doing multi-class classificationSVMs can only handle two class outputs (i e a• SVMs can only handle two-class outputs (i.e., a categorical output variable with variety 2).

• What can be done?• What can be done?• Answer: with N classes, learn N SVM’s

• SVM 1 learns “Output==1” vs “Output != 1”• SVM 1 learns Output==1 vs Output != 1• SVM 2 learns “Output==2” vs “Output != 2”• ::• SVM N learns “Output==N” vs “Output != N”

• Then to predict the output for a new input, just p p p , jpredict with each SVM and find out which one puts the prediction the furthest into the positive region.

Support Vector Machines: Slide 24

Page 25: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Compare SVM with NN• Similarity

− SVM + sigmoid kernel ~ two-layer feedforward NN− SVM + Gaussian kernel ~ RBF network− For most problems, SVM and NN have similar performance

• Advantages− Based on sound mathematics theoryBased on sound mathematics theory− Learning result is more robust− Over-fitting is not common− Not trapped in local minima (because of QP)Not trapped in local minima (because of QP)− Fewer parameters to consider (kernel, error cost C)− Works well with fewer training samples (number of support vectors

do not matter much)do not matter much). • Disadvantages

− Problem need to be formulated as 2-class classificationLearning takes long time (QP optimization)

Support Vector Machines: Slide 25

− Learning takes long time (QP optimization)

Page 26: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

Advantages and Disadvantages of SVM

Advantages• Advantages• prediction accuracy is generally high • robust works when training examples contain errors• robust, works when training examples contain errors• fast evaluation of the learned target function

• Criticism• Criticism• long training time• difficult to understand the learned function (weights)( g )• not easy to incorporate domain knowledge

Support Vector Machines: Slide 26

Page 27: A Brief Introduction to Support Vector Machine (SVM)jzhang/CS689/PPDM-Chapter2.pdf · A Brief Introduction to Chapter 2 Support Vector Machine (SVM) January 25, 2011. Overview •

SVM Related Links

• Representative implementations

• LIBSVM: an efficient implementation of SVM, multi-class

classifications, nu-SVM, one-class SVM, including also various

interfaces with java, python, etc.

• SVM-light: simpler but performance is not better than LIBSVM,

support only binary classification and only C language

• SVM-torch: another recent implementation also written in C.

Support Vector Machines: Slide 27