Support Vector Machines & Kernel Machines

Ohad Hageby IDC 2008Ohad Hageby IDC 2008 11

Support Vector MachinesSupport Vector Machines & &

Kernel MachinesKernel Machines

IP Seminar 2008IP Seminar 2008

IDC HerzliyaIDC Herzliya


Introduction To Support Vector Introduction To Support Vector Machines (SVM)Machines (SVM)

Support vector machines (SVMs)Support vector machines (SVMs) are a set of are a set of related related supervised learningsupervised learning methods used for methods used for classification and regression. They belong to a classification and regression. They belong to a family of generalized family of generalized linear classifierslinear classifiers. A special . A special property of SVMs is that they simultaneously property of SVMs is that they simultaneously minimize the empirical classification error and minimize the empirical classification error and maximize the geometric margin; hence they are maximize the geometric margin; hence they are also known as also known as maximum margin classifiersmaximum margin classifiers. .

(from Wikipedia)(from Wikipedia)


Introduction ContinuedIntroduction Continued

Often we are interested in classifying data Often we are interested in classifying data as a part of a machine-learning process. as a part of a machine-learning process.

Each data point will be represented by a Each data point will be represented by a pp--dimensional vector (a list of dimensional vector (a list of pp numbers). numbers).

Each of these data points belongs to only Each of these data points belongs to only one of two classesone of two classes. .


Training Data Training Data

We want to estimate a function We want to estimate a function f:Rf:RNN{+1,-1}, using input-output training {+1,-1}, using input-output training data pairs generated independent and data pairs generated independent and identically distributed according to identically distributed according to unknown P(x,y) unknown P(x,y)

If f(xIf f(xii)=-1 x)=-1 xii is in class 1 is in class 1

If f(xIf f(xii)=1 x)=1 xii is in class 2 is in class 2


The machineThe machine

The machine task is to learn the mapping The machine task is to learn the mapping of xof xii to y to yii..

It is defined by a set of possible mappings: It is defined by a set of possible mappings: xxf(x)f(x)


Expected Error Expected Error

The test examples assumed to be of the same The test examples assumed to be of the same probability distribution as the training data P(x,y).probability distribution as the training data P(x,y).

The best function f we could have is one The best function f we could have is one minimizing the expected error (risk).minimizing the expected error (risk).

),(),(][ yxdPyxflfR


),(),(][ yxdPyxflfR

II denotes the “loss” function (“0/1 loss”) denotes the “loss” function (“0/1 loss”)

A common loss function is the squared loss:A common loss function is the squared loss:

otherwisez

zzxyfyxfl

;1

0;0),(

2))((),( yxfyxfl


Empirical RiskEmpirical Risk

Unfortunately the risk cannot be minimize directly Unfortunately the risk cannot be minimize directly due to the unknown probability distribution.due to the unknown probability distribution.

““empirical risk” is defined to be just the measured empirical risk” is defined to be just the measured mean error rate on the training set (for a fixed, mean error rate on the training set (for a fixed, finite number of observations)finite number of observations)

n

iiiemp yxfl

nfR

1

,1


The overfitting dilemma The overfitting dilemma

It is possible to give conditions on the It is possible to give conditions on the learning machine which will ensure that learning machine which will ensure that when nwhen n∞ R∞ Rempemp will converge toward will converge toward RRexpected.expected.

For small sample size overfitting For small sample size overfitting might occurmight occur


The overfitting dilemma cont.The overfitting dilemma cont.

From “An introduction to Kernel Based Learning Algorithms”


VC DimensionVC Dimension

A concept in “VC Theory” introduces by A concept in “VC Theory” introduces by Vladimir Vapnik and Alexey Chervonenkis.Vladimir Vapnik and Alexey Chervonenkis.

Measure of the capacity of a statistical Measure of the capacity of a statistical classification algorithm, defined as the classification algorithm, defined as the cardinality of the cardinality of the largest set of points largest set of points that the algorithm can shatterthat the algorithm can shatter..


Shattering ExampleShattering ExampleFrom Wikipedia

For example, consider a straight line as the classification model: the model used by a perceptron. The line should separate positive data points from negative data points. When there are 3 points that are not collinear, the line can shatter them.


Shattering Shattering

A classification model f with some A classification model f with some parameter vector θ is said to shatter a set parameter vector θ is said to shatter a set of data points (xof data points (x11,x,x22,…,x,…,xnn) if, for all ) if, for all

assignments of labels to those points, assignments of labels to those points, there exists a θ such that the model f there exists a θ such that the model f makes no errors when evaluating that set makes no errors when evaluating that set of data points.of data points.


Shattering ContinuedShattering Continued

VC dimension of a model f is the maximum h such that some data point set of cardinality h can be shattered by f.

The VC dimension has utility in statistical learning theory, because it can predict a probabilistic upper bound on the test error of a classification model.


Upper Bound on ErrorUpper Bound on Error

In our case the upper bound on the training In our case the upper bound on the training error is given by (Vapnik, 1995):error is given by (Vapnik, 1995):

For all For all δδ>0 and f>0 and f∊F:∊F:

n

h

nh

fRfR emp

4ln1

2ln

][][


Theorem: VC Dimension in RTheorem: VC Dimension in Rnn

The VC dimension of the set of oriented The VC dimension of the set of oriented hyperplanes in Rhyperplanes in Rnn is is n+1n+1 since we can since we can always choose n+1 points and then always choose n+1 points and then choose one of the points as origin s.t. the choose one of the points as origin s.t. the position vectors of the remaining n points position vectors of the remaining n points are linearly independent. But we can never are linearly independent. But we can never choose n+2 such points. (Anthony and choose n+2 such points. (Anthony and Biggs, 1995)Biggs, 1995)


Structural Risk MinimizationStructural Risk Minimization

Taking too many training points and the model Taking too many training points and the model may be “too tight” and predict poorly on new test may be “too tight” and predict poorly on new test points. Too little, may not be enough to learn.points. Too little, may not be enough to learn.

One way to avoid overfitting dilemma is to limit One way to avoid overfitting dilemma is to limit the complexity of the function class F that we the complexity of the function class F that we choose function f from.choose function f from.

Intuition: “Simple” (e.g. linear) function that Intuition: “Simple” (e.g. linear) function that explains most of the data is preferable to a explains most of the data is preferable to a complex one (Occum’s razor)complex one (Occum’s razor)


From “An introduction to Kernel Based Learning Algorithms”


The Support Vector Machine The Support Vector Machine Linear CaseLinear Case

In a linearly separable dataset there is some choice of w and b (which represent a hyperplane) In a linearly separable dataset there is some choice of w and b (which represent a hyperplane) such that:such that:

Because the set of training data is finite there is a family of such hyperplanes.Because the set of training data is finite there is a family of such hyperplanes.We would like to maximize the distance (margin) of each class points from the separating We would like to maximize the distance (margin) of each class points from the separating plane.plane.We could scale w and b such that: We could scale w and b such that:

0 bxwy ii

1 bxwy ii


SVM – Linear CaseSVM – Linear Case

Linear separating hyperplanes. The support vectors are Linear separating hyperplanes. The support vectors are the ones used to find the hyperplane (circled).the ones used to find the hyperplane (circled).


Important observationsImportant observations

Only a small part of the training set is used to build the Only a small part of the training set is used to build the hyperplane (the support vectors).hyperplane (the support vectors).At least one point at every side of the hyperplane achieve At least one point at every side of the hyperplane achieve the equality:the equality:For two such opposite points with minimal distance:For two such opposite points with minimal distance:

1 bxwy ii

w

xxw

w

w

bx

w

w

w

bx

w

w

hyperplanexdisthyperplanexdistxxw

kllk

lkkl

2

),(),(2


Reformulating as quadratic Reformulating as quadratic optimization problemoptimization problem

This means that maximizing the distance is the This means that maximizing the distance is the same as minimizing ½|w|same as minimizing ½|w|2:2:

1

..2

2

bxwy

ts

wMinimize

ii


Solving the SVMSolving the SVM

We can solve by introducing Lagrange We can solve by introducing Lagrange multipliers multipliers ααii to obtain the Lagrangian which to obtain the Lagrangian which

should be minimized with respect to w and b should be minimized with respect to w and b and maximized with respect to and maximized with respect to ααii (Karush- (Karush-

Kuhn-Tucker conditions)Kuhn-Tucker conditions)

N

iii bxwyw

bwL1

2

12

,,


Solving the SVM Cont.Solving the SVM Cont.

A little manipulation leads to the requirement of:A little manipulation leads to the requirement of:

Note! We expect most Note! We expect most ααii to be zero. Those to be zero. Those

which aren’t represent the support vectors.which aren’t represent the support vectors.

N

iii

N

ii

xyw

And

y

1

10


The dual ProblemThe dual Problem

N

ii

i

j

NN

jijijiii

y

and

TS

xxyy

Maximize

1

11,

0

0

..

2

1


SVM - Non linear caseSVM - Non linear case

Not always the Not always the dataset is linearly dataset is linearly separable! separable!


Mapping F to higher dimensionMapping F to higher dimension

We need a function We need a function ФФ(x)=x’ to map x to a higher (x)=x’ to map x to a higher dimension feature space.dimension feature space.

)(

:

xx

FR N



Pro: In many problems we can linearly Pro: In many problems we can linearly separate when feature space is of higher separate when feature space is of higher dimension.dimension.Con: mapping to a higher dimension is Con: mapping to a higher dimension is computationally complex! “The curse of computationally complex! “The curse of dimensionality” (in statistics) tells us we dimensionality” (in statistics) tells us we will need to sample will need to sample exponentiallyexponentially much much more data!more data!Is that really so?Is that really so?



Statistical Learning theory tells us that Statistical Learning theory tells us that learning in F can be learning in F can be simplersimpler if one uses if one uses low complexity decision rules (like linear low complexity decision rules (like linear classifier).classifier).

In short, not the dimensionality but the In short, not the dimensionality but the complexity of the function class matters.complexity of the function class matters.

Fortunately, for some feature spaces and Fortunately, for some feature spaces and their mapping their mapping ФФ we can use a trick! we can use a trick!


The “Kernel Trick”The “Kernel Trick”

Kernel function map data vectors to Kernel function map data vectors to feature space with higher dimension (like feature space with higher dimension (like the the ФФ we are looking for). we are looking for).

Some kernel functions has unique Some kernel functions has unique property and they can be used to directly property and they can be used to directly calculate the scalar product in the feature calculate the scalar product in the feature space.space.


Kernel Trick ExampleKernel Trick Example

Given the following kernel function Given the following kernel function ФФ, we will take , we will take x and y vectors in Rx and y vectors in R22, and see how we calculate , and see how we calculate the kernel function K(x,y) using dot product of the kernel function K(x,y) using dot product of ФФ(x)(x)ФФ(y): (y):

22212121

32

,2,,

:

xxxxxx

RR


Conclusion: We do not have to calculate Conclusion: We do not have to calculate ФФ every time to every time to calculate k(x,y)! It’s a straightforward dot product calculate k(x,y)! It’s a straightforward dot product calculation of x and y.calculation of x and y.


Moving back to SVM in the higher Moving back to SVM in the higher DimensionDimension

The Lagrangian will be:The Lagrangian will be:

At the optimal point – “saddle point equations”:At the optimal point – “saddle point equations”:

Which translate to:Which translate to:

N

iii bxwyw

bwL1

2

1)(2

,,

0;0

w

LAnd

b

L

N

iii

N

ii xywy11

)(;0


And the optimization problemAnd the optimization problem

N

ii

i

jijij

NN

jiii

y

and

ni

TS

xxkyy

Maximize

1

11,

0

...1;0

..

,2

1


The Decision FunctionThe Decision Function

Solving the (dual) optimization problem leads to Solving the (dual) optimization problem leads to the non-linear decision functionthe non-linear decision function

bxxkysign

bxxysignxfN

iii

N

iii

1

1

,


The non separable caseThe non separable case

We considered until now the separable case We considered until now the separable case which is consistent with empirical error zero.which is consistent with empirical error zero.

For noisy data this may not be the minimum in For noisy data this may not be the minimum in the expected risk (overfitting!)the expected risk (overfitting!)

Solution: using “slack variables” to relax the hard Solution: using “slack variables” to relax the hard margin constraints:margin constraints:

ni

bxwy

i

iii

...1;0

1


We have now to also minimize upper bound We have now to also minimize upper bound

on the empirical riskon the empirical risk

iii

n

ii

bxwy

ts

Cw

Minimize

1

..

2

:

1

2


And the dual problemAnd the dual problem

N

ii

i

jijij

NN

jiii

y

and

niC

TS

xxkyy

Maximize

1

11,

0

...1;0

..

,2

1


Examples Kernel Functions Examples Kernel Functions

PolynomialsPolynomials

GaussiansGaussians

SigmoidsSigmoids

Radial Basis FunctionsRadial Basis Functions

……


Example of an SV classifier found using RBF:Kernel k(x,x’)=exp(-||x-x’||2). Here the input space is X=[-1,1]2

Taken from Bill Freeman’s Notes


Part 2Part 2Gender Classification with SVMs Gender Classification with SVMs


The GoalThe GoalLearning to classify pictures according to their Learning to classify pictures according to their gender (Male/Female) when only the facial gender (Male/Female) when only the facial features appear (almost no hair)features appear (almost no hair)


The experimentThe experimentFaces were processed from FERET database Faces were processed from FERET database pictures to be consistent with the requirement of the pictures to be consistent with the requirement of the experimentexperiment


The experimentThe experiment

SVM performance compared with:SVM performance compared with:– Linear classifier Linear classifier

– Quadratic classifierQuadratic classifier

– Fisher Linear DiscriminantFisher Linear Discriminant– Nearest NeighborNearest Neighbor


The experiment Cont.The experiment Cont.

The experiment was conducted on two sets of The experiment was conducted on two sets of data: high and low resolution (of the same) data: high and low resolution (of the same) pictures, a performance comparison was made.pictures, a performance comparison was made.

The goal was to learn the minimal required data The goal was to learn the minimal required data for a classifier to classify gender.for a classifier to classify gender.

Performance of 30 humans was used as well for Performance of 30 humans was used as well for comparison.comparison.

The data: 1755 pictures 711 females and 1044 The data: 1755 pictures 711 females and 1044 males.males.


Training DataTraining Data

80 by 40 pixel images for the “high 80 by 40 pixel images for the “high resolution” resolution”

21 by 12 pixel for the thumbnails21 by 12 pixel for the thumbnails

For each classifier estimation with 5-fold For each classifier estimation with 5-fold cross validation. (4/5 training and 1/5 cross validation. (4/5 training and 1/5 testing)testing)


Support FacesSupport Faces


Results on ThumbnailsResults on Thumbnails


Human Error RateHuman Error Rate


Human vs SVMHuman vs SVM


Can you tell?Can you tell?


Can you tell?Can you tell?

Answer: F-M-M-F-M

Support Vector Machines & Kernel Machines

Documents

Transcript of Support Vector Machines & Kernel Machines