Support Vector Machines & Kernel Machines

53
Ohad Hageby IDC 2008 Ohad Hageby IDC 2008 1 Support Vector Machines Support Vector Machines & & Kernel Machines Kernel Machines IP Seminar 2008 IP Seminar 2008 IDC Herzliya IDC Herzliya

description

Support Vector Machines & Kernel Machines. IP Seminar 2008 IDC Herzliya. Introduction To Support Vector Machines (SVM). - PowerPoint PPT Presentation

Transcript of Support Vector Machines & Kernel Machines

Page 1: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 11

Support Vector MachinesSupport Vector Machines & &

Kernel MachinesKernel Machines

IP Seminar 2008IP Seminar 2008

IDC HerzliyaIDC Herzliya

Page 2: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 22

Introduction To Support Vector Introduction To Support Vector Machines (SVM)Machines (SVM)

Support vector machines (SVMs)Support vector machines (SVMs) are a set of are a set of related related supervised learningsupervised learning methods used for methods used for classification and regression. They belong to a classification and regression. They belong to a family of generalized family of generalized linear classifierslinear classifiers. A special . A special property of SVMs is that they simultaneously property of SVMs is that they simultaneously minimize the empirical classification error and minimize the empirical classification error and maximize the geometric margin; hence they are maximize the geometric margin; hence they are also known as also known as maximum margin classifiersmaximum margin classifiers. .

(from Wikipedia)(from Wikipedia)

Page 3: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 33

Introduction ContinuedIntroduction Continued

Often we are interested in classifying data Often we are interested in classifying data as a part of a machine-learning process. as a part of a machine-learning process.

Each data point will be represented by a Each data point will be represented by a pp--dimensional vector (a list of dimensional vector (a list of pp numbers). numbers).

Each of these data points belongs to only Each of these data points belongs to only one of two classesone of two classes. .

Page 4: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 44

Training Data Training Data

We want to estimate a function We want to estimate a function f:Rf:RNN{+1,-1}, using input-output training {+1,-1}, using input-output training data pairs generated independent and data pairs generated independent and identically distributed according to identically distributed according to unknown P(x,y) unknown P(x,y)

If f(xIf f(xii)=-1 x)=-1 xii is in class 1 is in class 1

If f(xIf f(xii)=1 x)=1 xii is in class 2 is in class 2

Page 5: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 55

The machineThe machine

The machine task is to learn the mapping The machine task is to learn the mapping of xof xii to y to yii..

It is defined by a set of possible mappings: It is defined by a set of possible mappings: xxf(x)f(x)

Page 6: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 66

Expected Error Expected Error

The test examples assumed to be of the same The test examples assumed to be of the same probability distribution as the training data P(x,y).probability distribution as the training data P(x,y).

The best function f we could have is one The best function f we could have is one minimizing the expected error (risk).minimizing the expected error (risk).

),(),(][ yxdPyxflfR

Page 7: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 77

),(),(][ yxdPyxflfR

II denotes the “loss” function (“0/1 loss”) denotes the “loss” function (“0/1 loss”)

A common loss function is the squared loss:A common loss function is the squared loss:

otherwisez

zzxyfyxfl

;1

0;0),(

2))((),( yxfyxfl

Page 8: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 88

Empirical RiskEmpirical Risk

Unfortunately the risk cannot be minimize directly Unfortunately the risk cannot be minimize directly due to the unknown probability distribution.due to the unknown probability distribution.

““empirical risk” is defined to be just the measured empirical risk” is defined to be just the measured mean error rate on the training set (for a fixed, mean error rate on the training set (for a fixed, finite number of observations)finite number of observations)

n

iiiemp yxfl

nfR

1

,1

Page 9: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 99

The overfitting dilemma The overfitting dilemma

It is possible to give conditions on the It is possible to give conditions on the learning machine which will ensure that learning machine which will ensure that when nwhen n∞ R∞ Rempemp will converge toward will converge toward RRexpected.expected.

For small sample size overfitting For small sample size overfitting might occurmight occur

Page 10: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1010

The overfitting dilemma cont.The overfitting dilemma cont.

From “An introduction to Kernel Based Learning Algorithms”

Page 11: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1111

VC DimensionVC Dimension

A concept in “VC Theory” introduces by A concept in “VC Theory” introduces by Vladimir Vapnik and Alexey Chervonenkis.Vladimir Vapnik and Alexey Chervonenkis.

Measure of the capacity of a statistical Measure of the capacity of a statistical classification algorithm, defined as the classification algorithm, defined as the cardinality of the cardinality of the largest set of points largest set of points that the algorithm can shatterthat the algorithm can shatter..

Page 12: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1212

Shattering ExampleShattering ExampleFrom Wikipedia

For example, consider a straight line as the classification model: the model used by a perceptron. The line should separate positive data points from negative data points. When there are 3 points that are not collinear, the line can shatter them.

Page 13: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1313

Shattering Shattering

A classification model f with some A classification model f with some parameter vector θ is said to shatter a set parameter vector θ is said to shatter a set of data points (xof data points (x11,x,x22,…,x,…,xnn) if, for all ) if, for all

assignments of labels to those points, assignments of labels to those points, there exists a θ such that the model f there exists a θ such that the model f makes no errors when evaluating that set makes no errors when evaluating that set of data points.of data points.

Page 14: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1414

Shattering ContinuedShattering Continued

VC dimension of a model f is the maximum h such that some data point set of cardinality h can be shattered by f.

The VC dimension has utility in statistical learning theory, because it can predict a probabilistic upper bound on the test error of a classification model.

Page 15: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1515

Upper Bound on ErrorUpper Bound on Error

In our case the upper bound on the training In our case the upper bound on the training error is given by (Vapnik, 1995):error is given by (Vapnik, 1995):

For all For all δδ>0 and f>0 and f∊F:∊F:

n

h

nh

fRfR emp

4ln1

2ln

][][

Page 16: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1616

Theorem: VC Dimension in RTheorem: VC Dimension in Rnn

The VC dimension of the set of oriented The VC dimension of the set of oriented hyperplanes in Rhyperplanes in Rnn is is n+1n+1 since we can since we can always choose n+1 points and then always choose n+1 points and then choose one of the points as origin s.t. the choose one of the points as origin s.t. the position vectors of the remaining n points position vectors of the remaining n points are linearly independent. But we can never are linearly independent. But we can never choose n+2 such points. (Anthony and choose n+2 such points. (Anthony and Biggs, 1995)Biggs, 1995)

Page 17: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1717

Structural Risk MinimizationStructural Risk Minimization

Taking too many training points and the model Taking too many training points and the model may be “too tight” and predict poorly on new test may be “too tight” and predict poorly on new test points. Too little, may not be enough to learn.points. Too little, may not be enough to learn.

One way to avoid overfitting dilemma is to limit One way to avoid overfitting dilemma is to limit the complexity of the function class F that we the complexity of the function class F that we choose function f from.choose function f from.

Intuition: “Simple” (e.g. linear) function that Intuition: “Simple” (e.g. linear) function that explains most of the data is preferable to a explains most of the data is preferable to a complex one (Occum’s razor)complex one (Occum’s razor)

Page 18: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1818

From “An introduction to Kernel Based Learning Algorithms”

Page 19: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1919

The Support Vector Machine The Support Vector Machine Linear CaseLinear Case

In a linearly separable dataset there is some choice of w and b (which represent a hyperplane) In a linearly separable dataset there is some choice of w and b (which represent a hyperplane) such that:such that:

Because the set of training data is finite there is a family of such hyperplanes.Because the set of training data is finite there is a family of such hyperplanes.We would like to maximize the distance (margin) of each class points from the separating We would like to maximize the distance (margin) of each class points from the separating plane.plane.We could scale w and b such that: We could scale w and b such that:

0 bxwy ii

1 bxwy ii

Page 20: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2020

SVM – Linear CaseSVM – Linear Case

Linear separating hyperplanes. The support vectors are Linear separating hyperplanes. The support vectors are the ones used to find the hyperplane (circled).the ones used to find the hyperplane (circled).

Page 21: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2121

Important observationsImportant observations

Only a small part of the training set is used to build the Only a small part of the training set is used to build the hyperplane (the support vectors).hyperplane (the support vectors).At least one point at every side of the hyperplane achieve At least one point at every side of the hyperplane achieve the equality:the equality:For two such opposite points with minimal distance:For two such opposite points with minimal distance:

1 bxwy ii

w

xxw

w

w

bx

w

w

w

bx

w

w

hyperplanexdisthyperplanexdistxxw

kllk

lkkl

2

),(),(2

Page 22: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2222

Reformulating as quadratic Reformulating as quadratic optimization problemoptimization problem

This means that maximizing the distance is the This means that maximizing the distance is the same as minimizing ½|w|same as minimizing ½|w|2:2:

1

..2

2

bxwy

ts

wMinimize

ii

Page 23: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2323

Solving the SVMSolving the SVM

We can solve by introducing Lagrange We can solve by introducing Lagrange multipliers multipliers ααii to obtain the Lagrangian which to obtain the Lagrangian which

should be minimized with respect to w and b should be minimized with respect to w and b and maximized with respect to and maximized with respect to ααii (Karush- (Karush-

Kuhn-Tucker conditions)Kuhn-Tucker conditions)

N

iii bxwyw

bwL1

2

12

,,

Page 24: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2424

Solving the SVM Cont.Solving the SVM Cont.

A little manipulation leads to the requirement of:A little manipulation leads to the requirement of:

Note! We expect most Note! We expect most ααii to be zero. Those to be zero. Those

which aren’t represent the support vectors.which aren’t represent the support vectors.

N

iii

N

ii

xyw

And

y

1

10

Page 25: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2525

The dual ProblemThe dual Problem

N

ii

i

j

NN

jijijiii

y

and

TS

xxyy

Maximize

1

11,

0

0

..

2

1

Page 26: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2626

SVM - Non linear caseSVM - Non linear case

Not always the Not always the dataset is linearly dataset is linearly separable! separable!

Page 27: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2727

Page 28: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2828

Mapping F to higher dimensionMapping F to higher dimension

We need a function We need a function ФФ(x)=x’ to map x to a higher (x)=x’ to map x to a higher dimension feature space.dimension feature space.

)(

:

xx

FR N

Page 29: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2929

Mapping F to higher dimensionMapping F to higher dimension

Pro: In many problems we can linearly Pro: In many problems we can linearly separate when feature space is of higher separate when feature space is of higher dimension.dimension.Con: mapping to a higher dimension is Con: mapping to a higher dimension is computationally complex! “The curse of computationally complex! “The curse of dimensionality” (in statistics) tells us we dimensionality” (in statistics) tells us we will need to sample will need to sample exponentiallyexponentially much much more data!more data!Is that really so?Is that really so?

Page 30: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3030

Mapping F to higher dimensionMapping F to higher dimension

Statistical Learning theory tells us that Statistical Learning theory tells us that learning in F can be learning in F can be simplersimpler if one uses if one uses low complexity decision rules (like linear low complexity decision rules (like linear classifier).classifier).

In short, not the dimensionality but the In short, not the dimensionality but the complexity of the function class matters.complexity of the function class matters.

Fortunately, for some feature spaces and Fortunately, for some feature spaces and their mapping their mapping ФФ we can use a trick! we can use a trick!

Page 31: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3131

The “Kernel Trick”The “Kernel Trick”

Kernel function map data vectors to Kernel function map data vectors to feature space with higher dimension (like feature space with higher dimension (like the the ФФ we are looking for). we are looking for).

Some kernel functions has unique Some kernel functions has unique property and they can be used to directly property and they can be used to directly calculate the scalar product in the feature calculate the scalar product in the feature space.space.

Page 32: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3232

Kernel Trick ExampleKernel Trick Example

Given the following kernel function Given the following kernel function ФФ, we will take , we will take x and y vectors in Rx and y vectors in R22, and see how we calculate , and see how we calculate the kernel function K(x,y) using dot product of the kernel function K(x,y) using dot product of ФФ(x)(x)ФФ(y): (y):

22212121

32

,2,,

:

xxxxxx

RR

Page 33: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3333

Conclusion: We do not have to calculate Conclusion: We do not have to calculate ФФ every time to every time to calculate k(x,y)! It’s a straightforward dot product calculate k(x,y)! It’s a straightforward dot product calculation of x and y.calculation of x and y.

Page 34: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3434

Moving back to SVM in the higher Moving back to SVM in the higher DimensionDimension

The Lagrangian will be:The Lagrangian will be:

At the optimal point – “saddle point equations”:At the optimal point – “saddle point equations”:

Which translate to:Which translate to:

N

iii bxwyw

bwL1

2

1)(2

,,

0;0

w

LAnd

b

L

N

iii

N

ii xywy11

)(;0

Page 35: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3535

And the optimization problemAnd the optimization problem

N

ii

i

jijij

NN

jiii

y

and

ni

TS

xxkyy

Maximize

1

11,

0

...1;0

..

,2

1

Page 36: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3636

The Decision FunctionThe Decision Function

Solving the (dual) optimization problem leads to Solving the (dual) optimization problem leads to the non-linear decision functionthe non-linear decision function

bxxkysign

bxxysignxfN

iii

N

iii

1

1

,

Page 37: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3737

The non separable caseThe non separable case

We considered until now the separable case We considered until now the separable case which is consistent with empirical error zero.which is consistent with empirical error zero.

For noisy data this may not be the minimum in For noisy data this may not be the minimum in the expected risk (overfitting!)the expected risk (overfitting!)

Solution: using “slack variables” to relax the hard Solution: using “slack variables” to relax the hard margin constraints:margin constraints:

ni

bxwy

i

iii

...1;0

1

Page 38: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3838

We have now to also minimize upper bound We have now to also minimize upper bound

on the empirical riskon the empirical risk

iii

n

ii

bxwy

ts

Cw

Minimize

1

..

2

:

1

2

Page 39: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3939

And the dual problemAnd the dual problem

N

ii

i

jijij

NN

jiii

y

and

niC

TS

xxkyy

Maximize

1

11,

0

...1;0

..

,2

1

Page 40: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4040

Examples Kernel Functions Examples Kernel Functions

PolynomialsPolynomials

GaussiansGaussians

SigmoidsSigmoids

Radial Basis FunctionsRadial Basis Functions

……

Page 41: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4141

Example of an SV classifier found using RBF:Kernel k(x,x’)=exp(-||x-x’||2). Here the input space is X=[-1,1]2

Taken from Bill Freeman’s Notes

Page 42: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4242

Part 2Part 2Gender Classification with SVMs Gender Classification with SVMs

Page 43: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4343

The GoalThe GoalLearning to classify pictures according to their Learning to classify pictures according to their gender (Male/Female) when only the facial gender (Male/Female) when only the facial features appear (almost no hair)features appear (almost no hair)

Page 44: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4444

The experimentThe experimentFaces were processed from FERET database Faces were processed from FERET database pictures to be consistent with the requirement of the pictures to be consistent with the requirement of the experimentexperiment

Page 45: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4545

The experimentThe experiment

SVM performance compared with:SVM performance compared with:– Linear classifier Linear classifier

– Quadratic classifierQuadratic classifier

– Fisher Linear DiscriminantFisher Linear Discriminant– Nearest NeighborNearest Neighbor

Page 46: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4646

The experiment Cont.The experiment Cont.

The experiment was conducted on two sets of The experiment was conducted on two sets of data: high and low resolution (of the same) data: high and low resolution (of the same) pictures, a performance comparison was made.pictures, a performance comparison was made.

The goal was to learn the minimal required data The goal was to learn the minimal required data for a classifier to classify gender.for a classifier to classify gender.

Performance of 30 humans was used as well for Performance of 30 humans was used as well for comparison.comparison.

The data: 1755 pictures 711 females and 1044 The data: 1755 pictures 711 females and 1044 males.males.

Page 47: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4747

Training DataTraining Data

80 by 40 pixel images for the “high 80 by 40 pixel images for the “high resolution” resolution”

21 by 12 pixel for the thumbnails21 by 12 pixel for the thumbnails

For each classifier estimation with 5-fold For each classifier estimation with 5-fold cross validation. (4/5 training and 1/5 cross validation. (4/5 training and 1/5 testing)testing)

Page 48: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4848

Support FacesSupport Faces

Page 49: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4949

Results on ThumbnailsResults on Thumbnails

Page 50: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 5050

Human Error RateHuman Error Rate

Page 51: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 5151

Human vs SVMHuman vs SVM

Page 52: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 5252

Can you tell?Can you tell?

Page 53: Support Vector Machines  &  Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 5353

Can you tell?Can you tell?

Answer: F-M-M-F-M