Support Vector Machines and their Applications · Support Vector Machines and their Applications...

Introduction SVMs Generalization Bounds The Kernel Trick Implementations Applications

Support Vector Machines and their Applications

Purushottam Kar

Department of Computer Science and Engineering,Indian Institute of Technology Kanpur.

Summer School on “Expert Systems And Their Applications”,Indian Institute of Information Technology Allahabad.

June 14, 2009

1 / 24

Support Vector Machines

What is being “supported” ?

How can vectors support anything ?

Wait !! Machines ?? - Is this a Mechanical Engineering Lecture ?

2 / 24

The Learning Methodology

Is it possible to write an algorithm to distinguish between ...

a well-formed and an ill-formed C++ program ?

a palindrome and a non-palindrome ?

a graph with and without cliques of size bigger than 1000 ?

3 / 24

a handwritten 4 and a handwritten 9 ?

a spam and a non-spam e-mail ?

a positive movie review and a negative movie review ?

4 / 24

Statistical Machine Learning

“Synthesize” a program based on training data

Assume training data that is randomly generated from someunknown but fixed distribution and a target function

Give probabilistic error bounds

In other words be probably-approximately-correct

The motto - Let the data decide the algorithm

5 / 24

Expert Systems

... a computing system capable of representing andreasoning about some knowledge rich domain, such asinternal medicine or geology ...

Introduction to Expert Systems, Peter Jackson, Addison Wesley PublishingCompany, 1986.

6 / 24

Linear Machines

Arguably the simplest of classifiers acting onvectoral data

Numerous Learning Algorithms - Perceptron,SVM

7 / 24

Linear Machines

7 / 24

Linear Machines

7 / 24

A “special” hyperplane - with the maximummargin

Margin of a point measures how far is it fromthe hyperplane

8 / 24

Learning the Maximum Margin Classifier

minimize~w ,b ‖~w‖2subject to yi (〈~w · ~xi 〉+ b) ≥ 1,

i = 1, . . . , l .

A Linearly-constrained Quadratic program

Solvable in polynomial time - several algorithms known

Does not give us much insight into the nature of the hyperplane

9 / 24

i = 1, . . . , l .

9 / 24

i = 1, . . . , l .

9 / 24

Non-linearly Separable Data

Use slack variables to allow points to lie onthe “wrong” side of the hyperplane

Can still be solved using a QCQP

10 / 24

Learning the Soft Margin Classifier

minimize~w ,b ‖~w‖2 + Cl∑

subject to yi (〈~w · ~xi 〉+ b) ≥ 1− ξi ,i = 1, . . . , l .

Again a Linearly-constrained Quadratic program

More insight gained by looking at the dual program

11 / 24

Learning the Soft Margin Classifier

minimize~w ,b ‖~w‖2 + Cl∑

subject to yi (〈~w · ~xi 〉+ b) ≥ 1− ξi ,i = 1, . . . , l .

Again a Linearly-constrained Quadratic program

More insight gained by looking at the dual program

11 / 24

The Dual Program for the Hard Margin SVM

maximizel∑

αi −1

l∑i ,j=1

yiyjαiαj〈~xi · ~xj〉,

subject tol∑

yiαi = 0,

α1 ≥ 0, i = 1, . . . , l .

Some properties of the optimum (~w∗, b∗, ~α∗):

α∗i [yi (〈~w∗ · ~xi 〉+ b∗)− 1] = 0, i = 1, . . . , l

~w∗ =∑l

i=1 yiα∗i ~xi

f (~x , ~α∗, b∗) =∑l

i=1 yiα∗i 〈~xi · ~x〉+ b∗

12 / 24

maximizel∑

αi −1

l∑i ,j=1

subject tol∑

yiαi = 0,

α1 ≥ 0, i = 1, . . . , l .

α∗i [yi (〈~w∗ · ~xi 〉+ b∗)− 1] = 0, i = 1, . . . , l

~w∗ =∑l

i=1 yiα∗i ~xi

f (~x , ~α∗, b∗) =∑l

i=1 yiα∗i 〈~xi · ~x〉+ b∗

12 / 24

maximizel∑

αi −1

l∑i ,j=1

subject tol∑

yiαi = 0,

α1 ≥ 0, i = 1, . . . , l .

α∗i [yi (〈~w∗ · ~xi 〉+ b∗)− 1] = 0, i = 1, . . . , l

~w∗ =∑l

i=1 yiα∗i ~xi

f (~x , ~α∗, b∗) =∑l

i=1 yiα∗i 〈~xi · ~x〉+ b∗

12 / 24

maximizel∑

αi −1

l∑i ,j=1

subject tol∑

yiαi = 0,

α1 ≥ 0, i = 1, . . . , l .

α∗i [yi (〈~w∗ · ~xi 〉+ b∗)− 1] = 0, i = 1, . . . , l

~w∗ =∑l

i=1 yiα∗i ~xi

f (~x , ~α∗, b∗) =∑l

i=1 yiα∗i 〈~xi · ~x〉+ b∗

12 / 24

The Support

Consider each vector applying a force of αi on the hyperplane in thedirection yi

~w‖~w‖2 .

The conditions exactly correspond to the force and the torque onthe hyperplane being zero

Hence the vectors lying on the margin, for whom αi 6= 0, “support”the hyperplane.

13 / 24

The Support

~w‖~w‖2 .

13 / 24

The Support

~w‖~w‖2 .

13 / 24

Luckiness and Generalization

Essentially the idea is to show that the probability of the trainingsample misleading us into learning an erroneous classifier is small

To do this model selection has to be done carefully

The classifier family should not be too powerful to preventoverfitting

14 / 24

Bounds for Linear Classifiers

Theorem (Vapnik and Chervonenkis)

For any probability distribution on the input domain Rd , and anytarget function g, with probability no less than 1− δ, any linearhyperplane f classifying a randomly chosen training set of size lperfectly cannot disagree with the target function on more than εfraction of the input domain (with respect to the underlyingdistribution) where

((d + 1) log

d + 1+ log

15 / 24

Bounds for Large Margin Classifiers

Theorem (Vapnik)

For any probability distribution on the input domain - a ball ofradius R, and any target function g, with probability no less than1− δ, any linear hyperplane f classifying a randomly chosentraining set of size l perfectly with margin ≥ γ cannot disagreewith the target function on more than ε fraction of the inputdomain (with respect to the underlying distribution) where

ε = O

γ2+ log

))NOTE : Error bound Independent of the dimension !

16 / 24

The XOR problem

The feature map Φ : (x , y) 7−→ (x2, y2,√

2xy)makes the problem linearly a separable one.

But do we need to perform the actual featuremap ?

No ! Since all that the SVM trainingalgorithm requires are values of 〈Φ(~xi ) ·Φ(~xj)〉