Support Vector Machines Kernel Machines

Support Vector Machines Kernel Machines

Ata Kaban

The University of Birmingham

Remember the XOR problem?

Today we learn how to solve it.

Remember the XOR problem?

Support Vector Machines (SVM)

• Method for supervised learning problemsClassification– Regression

• Two key ideas– Assuming linearly separable classes, learn separating

hyperplane with maximum margin– Expand input into high-dimensional space to deal

with linearly non-separable cases (such as the XOR)

Separating hyperplane• Training set: (xi, yi), i=1,2,…N; yi{+1,-1}

• Hyperplane: wx+b=0– This is fully determined by (w,b)

where x=(x1, x2, …, xd), w=(w1, w2, …, wd), wx=(w1x1+w2x2+…+wdxd) dot product

Maximum margin

According to a theorem from Learning Theory, from all possible linear decision functions the one that maximises the margin of the training set will minimise the generalisation error.

[of course, given enough data points and assuming that data is not noisy!]

Maximum marginNote1: with c being a constant, the

decision functions (w,b) and (cw, cb) are the same.

Note2: but margins as measured by the outputs of the function xwx+b are not the same if we take (cw, cb).

Definition: geometric margingeometric margin: the margin given by the canonical decision canonical decision functionfunction, which is when c=1/||w||

Strategy: 1) we need to maximise the geometric margin! (cf result from learning theory)2) subject to the constraint that training examples are classified correctly

w

wx+b=0

wx+b>0wx+b<0

According to Note1, we can demand the function output for the nearest points to be +1 and –1 on the two sides of the decision function. This removes the scaling freedom.

Denoting a nearest positive example x+ and a nearest negative example x-, this is

Computing the geometric margin (that has to be maximised):

And here are the constraints:

Maximum margin

1 and 1 bb wxwx

||||

1)(

||||2

1)

||||||||||||||||(

2

1

wwxwx

wwx

w

w

wx

w

w bb

bb

1for 1

1for 1

ii

ii

yb

yb

wx

wx iby ii allfor 01)( wx

wx+b=0wx+b=1

wx+b=-1

wx+b>1

wx+b<1

Maximum margin – summing up

Given a linearly separable training set (xi, yi), i=1,2,…N; yi{+1,-1}

Minimise ||w||2

Subject to

This is a quadratic programming problem with linear inequality constraints. There are well known procedures for solving it

Niby ii ,..1 ,01)( wx

Support vectors

The training points that are nearest to the separating function are called support vectors.

What is the output of our decision function for these points?

Solving (not req for exam)

• Construct & minimise the Lagrangian

• Take derivatives wrt. w and b, equate them to 0

The Lagrange multipliers i are called ‘dual variables’

Each training point has an associated dual variable.

Ni

bybL

i

N

iiii

,...1,0 constraint wrt.

)1)((||||2

1),,(

1

2

wxww

0)1)((:

0),,(

0),,(

1

1

bycondKKT

yb

bL

ybL

iii

N

iii

N

iiii

wx

w

xww

w

parameters are expressed as a linear combination of training points

only SVs will have non-zero i

(not req for exam)

• So,

• Plug this back into the Lagrangian to obtain the dual formulation (homework)

• The resulting dual that is solved for by using a QP solver:

• The b does not appear in the dual so it is determined separatelyfrom the initial constraints (homework)

SVi iii

N

i iii yyw xx 1

Solving (not req for exam)

Niy

yyW

ii

N

i i

N

i ijiji

N

ji ji

,...1,0,0 :subject to

2

1)( :maximise

1

11,

xx

Data enters only in the form of dot products!

Classifying new data points (not req for exam)

• Once the parameters (*, b*) are found by solving the required quadratic optimisation on the training set of points, the SVM is ready to be used for classifying new points.

• Given new point x, its class membership is

sign[f(x, *, b*)], where ***

1

***** ),,( bybybbfSVi iii

N

i iii xxxxxwx

Data enters only in the form of dot products!

Solution• The solution of the SVM, i.e. of the quadratic programming

problem with linear inequality constraints has the nice property that the data enters only in the form of dot products!

• Dot product (notation & memory refreshing): given x=(x1,x2,…xn) and y=(y1,y2,…yn), then the dot product of x and y is xy=(x1y1+x2y2+…+xnyn).

• This is nice because it allows us to make SVMs non-linear without complicating the algorithm see on next slide.

• If you want to use SVM in practice, many sw are available, see e.g. the Resources page at the back of this handout.

• If you want to understand what the sw does, then you need to master the previous slides marked as ‘not req for exam’.

Non-linear SVMs

• Transform x (x)

• The linear algorithm depends only on xxi, hence transformed algorithm depends only on (x)(xi)

• Use kernel function K(x,y) such that K(x,y)= (x)(y)

Examples of kernels• Example1: x, y points in 2D input, 3D feature space

• Example2:

the that corresponds to this kernel has infinite dimension.

• Note: Not every function is a proper kernel. There is a theorem called Mercer Theorem that characterises proper kernels.

• To test a new input x when working with kernels

2

22

21

21

2

1

2

1 )()()(),( implies this,2)( if ; ; xyyxyxxyx

K

x

xx

x

y

y

x

x

)),(()(1

bKysignfSV

i iii xxx

}2/||||exp{),( 22 yxyx K

(square of dot product)

Making new kernels from the old

New kernels can be made from valid kernels by allowed operations e.g. addition, multiplication and rescaling of kernels gives a proper kernel as long as the resulting Gram matrix is positive definite.

Also, given a real-valued function f(x) over inputs x, then the

following is a valid kernel

),(),(),(

),(),(

),(),(),(

21221121

21121

21221121

xxxxxx

xxxx

xxxxxx

KKK

λKK

KKK

)()(),( 2121 xxxx ffK

Using SVM for classification

• Prepare the data matrix

• Select the kernel function to use

• Execute the training algorithm using a QP solver to obtain the i values

• Unseen data can be classified using the i values and the support vectors

Applications

• Handwritten digits recognition– Of interest to the US Postal services– 4% error was obtained– about 4% of the training data were SVs only

• Text categorisation• Face detection• DNA analysis• …many others

Discriminative versus generative classification methods

• SVMs learn the discrimination boundary. They are called discriminatory approaches.

• This is in contrast to learning a model for each class, like e.g. Bayesian classification does. This latter approach is called generative approach.

• SVM tries to avoid overfitting in high dimensional spaces (cf regularisation)

Conclusions

• SVMs learn linear decision boundaries (cf perceptrons)– They pick the hyperplane that maximises the margin

– The optimal hyperplane turns out to be a linear combination of support vectors

• Transform nonlinear problems to higher dimensional space using kernel functions; then there is more chance that in the transformed space the classes will be linearly separable.

Resources

• SW & practical guide to SVM for beginners http://www.csie.ntu.edu.tw/~cjlin/libsvm/

• Kernel machines website: http://www.kernel-machines.org/

• Burges, C.J. C: A tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, Vol.2, nr.2, pp.121—167, 1998. Available from http://svm.research.bell-labs.com/SVMdoc.html

• Cristianini & Shawe-Taylor: SVM book (in the School library)

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

http://www.kernel-machines.org/

http://svm.research.bell-labs.com/SVMdoc.html

Support Vector Machines Kernel Machines

Documents

Transcript of Support Vector Machines Kernel Machines