SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges,...

22
SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented By: Tamer Salman
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges,...

Page 1: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

SVMSupport Vectors Machines

Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf,

Smola, Bartlett, Mendelson, Cristianini

Presented By: Tamer Salman

Page 2: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

The addressed Problems

• SVM can deal with three kinds of problems:– Pattern Recognition / Classification.– Regression Estimation.– Density Estimation.

Page 3: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Pattern Recognition

• Given:– A set of M labeled patterns:

– The patterns are drawn i.i.d from an unknown P(X,Y).– A set of functions F.

• Chose a function f in F, such that an unseen pattern x will be correctly classified with high probability?

• Binary classification: Two classes, +1 and -1.

lidiM

iii hhyRxyx ,...,,,, 1

)()(1

)()(

Page 4: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

The Actual Risk

• What is the probability for error of a function f?

where c is some cost function on errors.

• The risk is not computable due to dP(x,y).• A proper estimation must be found.

YX

yxdPyxfxcfR ),()),(,(][

Page 5: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Linear SVMLinearly Separable Case

• Linear SVM produces the maximal margin hyper plane, which is as far as possible from the closest training points.

Linear Neural Network Linear SVM

Page 6: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Linearly Separable Case. Cont.

• Given the training set, we seek w and b such that:

• In Addition, we seek the maximal margin hyperplane.

– What is the margin?– How do we maximize it?

01,][ )()( bxwymi ii

Page 7: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Margin Maximization

• The margin is the sum of distances of the two closest points from each side to the hyper plane.

• The distance of the hyper plane (w,b) from the origin is w/b.• The margin is 2/||w||.• Maximizing the margin is equivalent to minimizing ½||w||².

Page 8: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Linear SVM. Cont.

ibxw

w

i 01,y:toSubject

2

1minimize

i

2

bw,

i

bxww

i

m

ii

0:toSubject

1,y2

1minimizemaximize ii

1

2

bw,

• The LaGrangian is:

Page 9: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Linear SVM. Cont.• Requiring the derivatives with respect to w,b to vanish yields:

• KKT conditions yield:

• Where:

i

xx

i

m

ii

m

i

m

jji

m

ii

0

0y:toSubject

,yy2

1maximize

1

i

1 1

jiji

1

iii xwybanyfor ,,0

m

i

iii xyw

1

Page 10: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Linear SVM. Cont.• The resulting separating function is:

• Notes:– The points with α=0 do not affect the solution.– The points with α≠0 are called support vectors.– The equality conditions hold true only for the SVs.

bxxyxfm

i

iii

1

,sgnsgn

Page 11: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Linear SVM. Non-separable case.

• We introduce slack variables ξi and allow mistakes.

• We demand:

• And minimize:

i

ii

iii

bxwyi

bxwyi

1,1:

1,1:

)()(

)()(

m

iiCw

1

2

2

1

Page 12: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Non-separable case. Cont.

• The modifications yield the following problem:

iC

xx

i

m

ii

m

i

m

jji

m

ii

0

0y:toSubject

,yy2

1maximize

1

i

1 1

jiji

1

Page 13: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Non Linear SVM

• Note that the training data appears in the solution only in inner products.

• If we pre-map the data into a higher and sparser space we can get more separability and a stronger separation family of functions.

• The pre-mapping might make the problem infeasible.

• We want to avoid pre-mapping and still have the same separation ability.

• Suppose we have a simple function that operates on two training points and implements an inner product of their pre-mappings, then we achieve better separation with no added cost.

Page 14: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Mercer Kernels

• A Mercer kernel is a function:

for which there exists a function:

such that:

• A funtion k(.,.) is a Mercer kernel if

for any function g(.), such that:

the following holds true:

)(2 xg

0),()()( dxdyyxkygxg

RXXk dd :

HX d :

)(),(),(, yxyxkXyx d

Page 15: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Some Mercer Kernels

• Homogeneous Polynomial Kernels:

• Non-homogeneous Polynomial Kernels:

• Radial Basis Function (RBF) Kernels:

pyxyxk 1,),(

2

2

2exp),(

yx

yxk

pyxyxk ,),(

Page 16: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Solution of non-linear SVM• The problem:

• The separating function:

iC

xxk

i

m

ii

m

i

m

jji

m

ii

0

0y:toSubject

,yy2

1maximize

1

i

1 1

jiji

1

bxxkyxfm

i

iii

1

,sgnsgn

Page 17: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Notes

• The solutions of non-linear SVM is linear in H (Feature Space).

• In non-linear SVM w exists in H.

• The complexity of computing the kernel values is not higher than the complexity of the solution and can be done a priory in a kernel matrix.

• SVM is suitable for large scale problems due to chunking ability.

Page 18: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Error Estimates

• Due to the fact that the actual risk is not computable, we seek to estimate the error rate of a machine given a finite set of m patterns.

• Empirical Risk.

• Training and Testing.

• k-fold cross validation.

• Leave One out.

Page 19: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Error Bounds

• We seek faster estimates of the solution.• The bound should be tight and informative.

• Theoretical VC bound:

Risk < Empirical Risk + Complexity (VC-dimension / m)

Loose and not always informative.

• Margin Radius bound:

Risk < R² / margin²

Where R is the radius of the smallest enclosing sphere of the data in feature space.

Tight and informative.

Page 20: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Error Bounds. Cont.

Parameter

Error

LOO Error

Bound

Page 21: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Rademacher Complexity• One of the tightest sample-based bounds depend on the Rademacher

Complexity term defined as follows:

where:F is the class of functions mapping the domain of the input into R.Ep(x) expectation with respect to the probability distribution of the input data.Eσ expectation with respect to σi: independent uniform random variable of {±1}

• Rademacher complexity is a measure of the ability of the class of resulting functions to classify the input samples if associated with a random class.

mm

i

ii

FfXPm xxxf

mEEFR ,...,

2sup 1

1

Page 22: SVM Support Vectors Machines Based on Statistical Learning Theory of Vapnik, Chervonenkis, Burges, Scholkopf, Smola, Bartlett, Mendelson, Cristianini Presented.

Rademacher Risk Bound• The following bound holds true with probability (1-δ):

Where:

Êm is the error on the input data measured through a loss function h(.) with Lipshitz constant L. That is:

And the loss function can be one of:Vapnik’s: Bartlett & Mendelson’s:

m

FRLxfyhExfyP mm 2

2ln2ˆ0

m

i

iim xfyh

mxfyhE

1

1,1

1,0

xfyxfy

xfyxfyhV

01

10,1

1,0

xfy

xfyxfy

xfy

xfyhBM