Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
Support Vector Machines & Kernel Machines
-
Upload
ciara-bowen -
Category
Documents
-
view
100 -
download
11
description
Transcript of Support Vector Machines & Kernel Machines
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 11
Support Vector MachinesSupport Vector Machines & &
Kernel MachinesKernel Machines
IP Seminar 2008IP Seminar 2008
IDC HerzliyaIDC Herzliya
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 22
Introduction To Support Vector Introduction To Support Vector Machines (SVM)Machines (SVM)
Support vector machines (SVMs)Support vector machines (SVMs) are a set of are a set of related related supervised learningsupervised learning methods used for methods used for classification and regression. They belong to a classification and regression. They belong to a family of generalized family of generalized linear classifierslinear classifiers. A special . A special property of SVMs is that they simultaneously property of SVMs is that they simultaneously minimize the empirical classification error and minimize the empirical classification error and maximize the geometric margin; hence they are maximize the geometric margin; hence they are also known as also known as maximum margin classifiersmaximum margin classifiers. .
(from Wikipedia)(from Wikipedia)
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 33
Introduction ContinuedIntroduction Continued
Often we are interested in classifying data Often we are interested in classifying data as a part of a machine-learning process. as a part of a machine-learning process.
Each data point will be represented by a Each data point will be represented by a pp--dimensional vector (a list of dimensional vector (a list of pp numbers). numbers).
Each of these data points belongs to only Each of these data points belongs to only one of two classesone of two classes. .
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 44
Training Data Training Data
We want to estimate a function We want to estimate a function f:Rf:RNN{+1,-1}, using input-output training {+1,-1}, using input-output training data pairs generated independent and data pairs generated independent and identically distributed according to identically distributed according to unknown P(x,y) unknown P(x,y)
If f(xIf f(xii)=-1 x)=-1 xii is in class 1 is in class 1
If f(xIf f(xii)=1 x)=1 xii is in class 2 is in class 2
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 55
The machineThe machine
The machine task is to learn the mapping The machine task is to learn the mapping of xof xii to y to yii..
It is defined by a set of possible mappings: It is defined by a set of possible mappings: xxf(x)f(x)
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 66
Expected Error Expected Error
The test examples assumed to be of the same The test examples assumed to be of the same probability distribution as the training data P(x,y).probability distribution as the training data P(x,y).
The best function f we could have is one The best function f we could have is one minimizing the expected error (risk).minimizing the expected error (risk).
),(),(][ yxdPyxflfR
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 77
),(),(][ yxdPyxflfR
II denotes the “loss” function (“0/1 loss”) denotes the “loss” function (“0/1 loss”)
A common loss function is the squared loss:A common loss function is the squared loss:
otherwisez
zzxyfyxfl
;1
0;0),(
2))((),( yxfyxfl
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 88
Empirical RiskEmpirical Risk
Unfortunately the risk cannot be minimize directly Unfortunately the risk cannot be minimize directly due to the unknown probability distribution.due to the unknown probability distribution.
““empirical risk” is defined to be just the measured empirical risk” is defined to be just the measured mean error rate on the training set (for a fixed, mean error rate on the training set (for a fixed, finite number of observations)finite number of observations)
n
iiiemp yxfl
nfR
1
,1
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 99
The overfitting dilemma The overfitting dilemma
It is possible to give conditions on the It is possible to give conditions on the learning machine which will ensure that learning machine which will ensure that when nwhen n∞ R∞ Rempemp will converge toward will converge toward RRexpected.expected.
For small sample size overfitting For small sample size overfitting might occurmight occur
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1010
The overfitting dilemma cont.The overfitting dilemma cont.
From “An introduction to Kernel Based Learning Algorithms”
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1111
VC DimensionVC Dimension
A concept in “VC Theory” introduces by A concept in “VC Theory” introduces by Vladimir Vapnik and Alexey Chervonenkis.Vladimir Vapnik and Alexey Chervonenkis.
Measure of the capacity of a statistical Measure of the capacity of a statistical classification algorithm, defined as the classification algorithm, defined as the cardinality of the cardinality of the largest set of points largest set of points that the algorithm can shatterthat the algorithm can shatter..
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1212
Shattering ExampleShattering ExampleFrom Wikipedia
For example, consider a straight line as the classification model: the model used by a perceptron. The line should separate positive data points from negative data points. When there are 3 points that are not collinear, the line can shatter them.
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1313
Shattering Shattering
A classification model f with some A classification model f with some parameter vector θ is said to shatter a set parameter vector θ is said to shatter a set of data points (xof data points (x11,x,x22,…,x,…,xnn) if, for all ) if, for all
assignments of labels to those points, assignments of labels to those points, there exists a θ such that the model f there exists a θ such that the model f makes no errors when evaluating that set makes no errors when evaluating that set of data points.of data points.
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1414
Shattering ContinuedShattering Continued
VC dimension of a model f is the maximum h such that some data point set of cardinality h can be shattered by f.
The VC dimension has utility in statistical learning theory, because it can predict a probabilistic upper bound on the test error of a classification model.
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1515
Upper Bound on ErrorUpper Bound on Error
In our case the upper bound on the training In our case the upper bound on the training error is given by (Vapnik, 1995):error is given by (Vapnik, 1995):
For all For all δδ>0 and f>0 and f∊F:∊F:
n
h
nh
fRfR emp
4ln1
2ln
][][
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1616
Theorem: VC Dimension in RTheorem: VC Dimension in Rnn
The VC dimension of the set of oriented The VC dimension of the set of oriented hyperplanes in Rhyperplanes in Rnn is is n+1n+1 since we can since we can always choose n+1 points and then always choose n+1 points and then choose one of the points as origin s.t. the choose one of the points as origin s.t. the position vectors of the remaining n points position vectors of the remaining n points are linearly independent. But we can never are linearly independent. But we can never choose n+2 such points. (Anthony and choose n+2 such points. (Anthony and Biggs, 1995)Biggs, 1995)
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1717
Structural Risk MinimizationStructural Risk Minimization
Taking too many training points and the model Taking too many training points and the model may be “too tight” and predict poorly on new test may be “too tight” and predict poorly on new test points. Too little, may not be enough to learn.points. Too little, may not be enough to learn.
One way to avoid overfitting dilemma is to limit One way to avoid overfitting dilemma is to limit the complexity of the function class F that we the complexity of the function class F that we choose function f from.choose function f from.
Intuition: “Simple” (e.g. linear) function that Intuition: “Simple” (e.g. linear) function that explains most of the data is preferable to a explains most of the data is preferable to a complex one (Occum’s razor)complex one (Occum’s razor)
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1818
From “An introduction to Kernel Based Learning Algorithms”
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 1919
The Support Vector Machine The Support Vector Machine Linear CaseLinear Case
In a linearly separable dataset there is some choice of w and b (which represent a hyperplane) In a linearly separable dataset there is some choice of w and b (which represent a hyperplane) such that:such that:
Because the set of training data is finite there is a family of such hyperplanes.Because the set of training data is finite there is a family of such hyperplanes.We would like to maximize the distance (margin) of each class points from the separating We would like to maximize the distance (margin) of each class points from the separating plane.plane.We could scale w and b such that: We could scale w and b such that:
0 bxwy ii
1 bxwy ii
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2020
SVM – Linear CaseSVM – Linear Case
Linear separating hyperplanes. The support vectors are Linear separating hyperplanes. The support vectors are the ones used to find the hyperplane (circled).the ones used to find the hyperplane (circled).
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2121
Important observationsImportant observations
Only a small part of the training set is used to build the Only a small part of the training set is used to build the hyperplane (the support vectors).hyperplane (the support vectors).At least one point at every side of the hyperplane achieve At least one point at every side of the hyperplane achieve the equality:the equality:For two such opposite points with minimal distance:For two such opposite points with minimal distance:
1 bxwy ii
w
xxw
w
w
bx
w
w
w
bx
w
w
hyperplanexdisthyperplanexdistxxw
kllk
lkkl
2
),(),(2
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2222
Reformulating as quadratic Reformulating as quadratic optimization problemoptimization problem
This means that maximizing the distance is the This means that maximizing the distance is the same as minimizing ½|w|same as minimizing ½|w|2:2:
1
..2
2
bxwy
ts
wMinimize
ii
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2323
Solving the SVMSolving the SVM
We can solve by introducing Lagrange We can solve by introducing Lagrange multipliers multipliers ααii to obtain the Lagrangian which to obtain the Lagrangian which
should be minimized with respect to w and b should be minimized with respect to w and b and maximized with respect to and maximized with respect to ααii (Karush- (Karush-
Kuhn-Tucker conditions)Kuhn-Tucker conditions)
N
iii bxwyw
bwL1
2
12
,,
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2424
Solving the SVM Cont.Solving the SVM Cont.
A little manipulation leads to the requirement of:A little manipulation leads to the requirement of:
Note! We expect most Note! We expect most ααii to be zero. Those to be zero. Those
which aren’t represent the support vectors.which aren’t represent the support vectors.
N
iii
N
ii
xyw
And
y
1
10
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2525
The dual ProblemThe dual Problem
N
ii
i
j
NN
jijijiii
y
and
TS
xxyy
Maximize
1
11,
0
0
..
2
1
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2626
SVM - Non linear caseSVM - Non linear case
Not always the Not always the dataset is linearly dataset is linearly separable! separable!
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2727
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2828
Mapping F to higher dimensionMapping F to higher dimension
We need a function We need a function ФФ(x)=x’ to map x to a higher (x)=x’ to map x to a higher dimension feature space.dimension feature space.
)(
:
xx
FR N
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 2929
Mapping F to higher dimensionMapping F to higher dimension
Pro: In many problems we can linearly Pro: In many problems we can linearly separate when feature space is of higher separate when feature space is of higher dimension.dimension.Con: mapping to a higher dimension is Con: mapping to a higher dimension is computationally complex! “The curse of computationally complex! “The curse of dimensionality” (in statistics) tells us we dimensionality” (in statistics) tells us we will need to sample will need to sample exponentiallyexponentially much much more data!more data!Is that really so?Is that really so?
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3030
Mapping F to higher dimensionMapping F to higher dimension
Statistical Learning theory tells us that Statistical Learning theory tells us that learning in F can be learning in F can be simplersimpler if one uses if one uses low complexity decision rules (like linear low complexity decision rules (like linear classifier).classifier).
In short, not the dimensionality but the In short, not the dimensionality but the complexity of the function class matters.complexity of the function class matters.
Fortunately, for some feature spaces and Fortunately, for some feature spaces and their mapping their mapping ФФ we can use a trick! we can use a trick!
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3131
The “Kernel Trick”The “Kernel Trick”
Kernel function map data vectors to Kernel function map data vectors to feature space with higher dimension (like feature space with higher dimension (like the the ФФ we are looking for). we are looking for).
Some kernel functions has unique Some kernel functions has unique property and they can be used to directly property and they can be used to directly calculate the scalar product in the feature calculate the scalar product in the feature space.space.
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3232
Kernel Trick ExampleKernel Trick Example
Given the following kernel function Given the following kernel function ФФ, we will take , we will take x and y vectors in Rx and y vectors in R22, and see how we calculate , and see how we calculate the kernel function K(x,y) using dot product of the kernel function K(x,y) using dot product of ФФ(x)(x)ФФ(y): (y):
22212121
32
,2,,
:
xxxxxx
RR
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3333
Conclusion: We do not have to calculate Conclusion: We do not have to calculate ФФ every time to every time to calculate k(x,y)! It’s a straightforward dot product calculate k(x,y)! It’s a straightforward dot product calculation of x and y.calculation of x and y.
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3434
Moving back to SVM in the higher Moving back to SVM in the higher DimensionDimension
The Lagrangian will be:The Lagrangian will be:
At the optimal point – “saddle point equations”:At the optimal point – “saddle point equations”:
Which translate to:Which translate to:
N
iii bxwyw
bwL1
2
1)(2
,,
0;0
w
LAnd
b
L
N
iii
N
ii xywy11
)(;0
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3535
And the optimization problemAnd the optimization problem
N
ii
i
jijij
NN
jiii
y
and
ni
TS
xxkyy
Maximize
1
11,
0
...1;0
..
,2
1
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3636
The Decision FunctionThe Decision Function
Solving the (dual) optimization problem leads to Solving the (dual) optimization problem leads to the non-linear decision functionthe non-linear decision function
bxxkysign
bxxysignxfN
iii
N
iii
1
1
,
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3737
The non separable caseThe non separable case
We considered until now the separable case We considered until now the separable case which is consistent with empirical error zero.which is consistent with empirical error zero.
For noisy data this may not be the minimum in For noisy data this may not be the minimum in the expected risk (overfitting!)the expected risk (overfitting!)
Solution: using “slack variables” to relax the hard Solution: using “slack variables” to relax the hard margin constraints:margin constraints:
ni
bxwy
i
iii
...1;0
1
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3838
We have now to also minimize upper bound We have now to also minimize upper bound
on the empirical riskon the empirical risk
iii
n
ii
bxwy
ts
Cw
Minimize
1
..
2
:
1
2
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 3939
And the dual problemAnd the dual problem
N
ii
i
jijij
NN
jiii
y
and
niC
TS
xxkyy
Maximize
1
11,
0
...1;0
..
,2
1
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4040
Examples Kernel Functions Examples Kernel Functions
PolynomialsPolynomials
GaussiansGaussians
SigmoidsSigmoids
Radial Basis FunctionsRadial Basis Functions
……
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4141
Example of an SV classifier found using RBF:Kernel k(x,x’)=exp(-||x-x’||2). Here the input space is X=[-1,1]2
Taken from Bill Freeman’s Notes
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4242
Part 2Part 2Gender Classification with SVMs Gender Classification with SVMs
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4343
The GoalThe GoalLearning to classify pictures according to their Learning to classify pictures according to their gender (Male/Female) when only the facial gender (Male/Female) when only the facial features appear (almost no hair)features appear (almost no hair)
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4444
The experimentThe experimentFaces were processed from FERET database Faces were processed from FERET database pictures to be consistent with the requirement of the pictures to be consistent with the requirement of the experimentexperiment
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4545
The experimentThe experiment
SVM performance compared with:SVM performance compared with:– Linear classifier Linear classifier
– Quadratic classifierQuadratic classifier
– Fisher Linear DiscriminantFisher Linear Discriminant– Nearest NeighborNearest Neighbor
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4646
The experiment Cont.The experiment Cont.
The experiment was conducted on two sets of The experiment was conducted on two sets of data: high and low resolution (of the same) data: high and low resolution (of the same) pictures, a performance comparison was made.pictures, a performance comparison was made.
The goal was to learn the minimal required data The goal was to learn the minimal required data for a classifier to classify gender.for a classifier to classify gender.
Performance of 30 humans was used as well for Performance of 30 humans was used as well for comparison.comparison.
The data: 1755 pictures 711 females and 1044 The data: 1755 pictures 711 females and 1044 males.males.
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4747
Training DataTraining Data
80 by 40 pixel images for the “high 80 by 40 pixel images for the “high resolution” resolution”
21 by 12 pixel for the thumbnails21 by 12 pixel for the thumbnails
For each classifier estimation with 5-fold For each classifier estimation with 5-fold cross validation. (4/5 training and 1/5 cross validation. (4/5 training and 1/5 testing)testing)
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4848
Support FacesSupport Faces
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 4949
Results on ThumbnailsResults on Thumbnails
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 5050
Human Error RateHuman Error Rate
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 5151
Human vs SVMHuman vs SVM
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 5252
Can you tell?Can you tell?
Ohad Hageby IDC 2008Ohad Hageby IDC 2008 5353
Can you tell?Can you tell?
Answer: F-M-M-F-M