Support Vector Machines Kernel Machines

24
Support Vector Machines Kernel Machines Ata Kaban The University of Birmingham

description

Support Vector Machines Kernel Machines. Ata Kaban The University of Birmingham. Remember the XOR problem?. Today we learn how to solve it. Remember the XOR problem?. Support Vector Machines (SVM). Method for supervised learning problems Classification Regression Two key ideas - PowerPoint PPT Presentation

Transcript of Support Vector Machines Kernel Machines

Page 1: Support Vector Machines  Kernel Machines

Support Vector Machines Kernel Machines

Ata Kaban

The University of Birmingham

Page 2: Support Vector Machines  Kernel Machines

Remember the XOR problem?

Today we learn how to solve it.

Page 3: Support Vector Machines  Kernel Machines

Remember the XOR problem?

Page 4: Support Vector Machines  Kernel Machines

Support Vector Machines (SVM)

• Method for supervised learning problemsClassification– Regression

• Two key ideas– Assuming linearly separable classes, learn separating

hyperplane with maximum margin– Expand input into high-dimensional space to deal

with linearly non-separable cases (such as the XOR)

Page 5: Support Vector Machines  Kernel Machines

Separating hyperplane• Training set: (xi, yi), i=1,2,…N; yi{+1,-1}

• Hyperplane: wx+b=0– This is fully determined by (w,b)

where x=(x1, x2, …, xd), w=(w1, w2, …, wd), wx=(w1x1+w2x2+…+wdxd) dot product

Page 6: Support Vector Machines  Kernel Machines

Maximum margin

According to a theorem from Learning Theory, from all possible linear decision functions the one that maximises the margin of the training set will minimise the generalisation error.

[of course, given enough data points and assuming that data is not noisy!]

Page 7: Support Vector Machines  Kernel Machines

Maximum marginNote1: with c being a constant, the

decision functions (w,b) and (cw, cb) are the same.

Note2: but margins as measured by the outputs of the function xwx+b are not the same if we take (cw, cb).

Definition: geometric margingeometric margin: the margin given by the canonical decision canonical decision functionfunction, which is when c=1/||w||

Strategy: 1) we need to maximise the geometric margin! (cf result from learning theory)2) subject to the constraint that training examples are classified correctly

w

wx+b=0

wx+b>0wx+b<0

Page 8: Support Vector Machines  Kernel Machines

According to Note1, we can demand the function output for the nearest points to be +1 and –1 on the two sides of the decision function. This removes the scaling freedom.

Denoting a nearest positive example x+ and a nearest negative example x-, this is

Computing the geometric margin (that has to be maximised):

And here are the constraints:

Maximum margin

1 and 1 bb wxwx

||||

1)(

||||2

1)

||||||||||||||||(

2

1

wwxwx

wwx

w

w

wx

w

w bb

bb

1for 1

1for 1

ii

ii

yb

yb

wx

wx iby ii allfor 01)( wx

Page 9: Support Vector Machines  Kernel Machines

wx+b=0wx+b=1

wx+b=-1

wx+b>1

wx+b<1

Maximum margin – summing up

Given a linearly separable training set (xi, yi), i=1,2,…N; yi{+1,-1}

Minimise ||w||2

Subject to

This is a quadratic programming problem with linear inequality constraints. There are well known procedures for solving it

Niby ii ,..1 ,01)( wx

Page 10: Support Vector Machines  Kernel Machines

Support vectors

The training points that are nearest to the separating function are called support vectors.

What is the output of our decision function for these points?

Page 11: Support Vector Machines  Kernel Machines

Solving (not req for exam)

• Construct & minimise the Lagrangian

• Take derivatives wrt. w and b, equate them to 0

The Lagrange multipliers i are called ‘dual variables’

Each training point has an associated dual variable.

Ni

bybL

i

N

iiii

,...1,0 constraint wrt.

)1)((||||2

1),,(

1

2

wxww

0)1)((:

0),,(

0),,(

1

1

bycondKKT

yb

bL

ybL

iii

N

iii

N

iiii

wx

w

xww

w

parameters are expressed as a linear combination of training points

only SVs will have non-zero i

Page 12: Support Vector Machines  Kernel Machines

(not req for exam)

Page 13: Support Vector Machines  Kernel Machines

• So,

• Plug this back into the Lagrangian to obtain the dual formulation (homework)

• The resulting dual that is solved for by using a QP solver:

• The b does not appear in the dual so it is determined separatelyfrom the initial constraints (homework)

SVi iii

N

i iii yyw xx 1

Solving (not req for exam)

Niy

yyW

ii

N

i i

N

i ijiji

N

ji ji

,...1,0,0 :subject to

2

1)( :maximise

1

11,

xx

Data enters only in the form of dot products!

Page 14: Support Vector Machines  Kernel Machines

Classifying new data points (not req for exam)

• Once the parameters (*, b*) are found by solving the required quadratic optimisation on the training set of points, the SVM is ready to be used for classifying new points.

• Given new point x, its class membership is

sign[f(x, *, b*)], where ***

1

***** ),,( bybybbfSVi iii

N

i iii xxxxxwx

Data enters only in the form of dot products!

Page 15: Support Vector Machines  Kernel Machines

Solution• The solution of the SVM, i.e. of the quadratic programming

problem with linear inequality constraints has the nice property that the data enters only in the form of dot products!

• Dot product (notation & memory refreshing): given x=(x1,x2,…xn) and y=(y1,y2,…yn), then the dot product of x and y is xy=(x1y1+x2y2+…+xnyn).

• This is nice because it allows us to make SVMs non-linear without complicating the algorithm see on next slide.

• If you want to use SVM in practice, many sw are available, see e.g. the Resources page at the back of this handout.

• If you want to understand what the sw does, then you need to master the previous slides marked as ‘not req for exam’.

Page 16: Support Vector Machines  Kernel Machines

Non-linear SVMs

• Transform x (x)

• The linear algorithm depends only on xxi, hence transformed algorithm depends only on (x)(xi)

• Use kernel function K(x,y) such that K(x,y)= (x)(y)

Page 17: Support Vector Machines  Kernel Machines

Examples of kernels• Example1: x, y points in 2D input, 3D feature space

• Example2:

the that corresponds to this kernel has infinite dimension.

• Note: Not every function is a proper kernel. There is a theorem called Mercer Theorem that characterises proper kernels.

• To test a new input x when working with kernels

2

22

21

21

2

1

2

1 )()()(),( implies this,2)( if ; ; xyyxyxxyx

K

x

xx

x

y

y

x

x

)),(()(1

bKysignfSV

i iii xxx

}2/||||exp{),( 22 yxyx K

(square of dot product)

Page 18: Support Vector Machines  Kernel Machines
Page 19: Support Vector Machines  Kernel Machines

Making new kernels from the old

New kernels can be made from valid kernels by allowed operations e.g. addition, multiplication and rescaling of kernels gives a proper kernel as long as the resulting Gram matrix is positive definite.

Also, given a real-valued function f(x) over inputs x, then the

following is a valid kernel

),(),(),(

),(),(

),(),(),(

21221121

21121

21221121

xxxxxx

xxxx

xxxxxx

KKK

λKK

KKK

)()(),( 2121 xxxx ffK

Page 20: Support Vector Machines  Kernel Machines

Using SVM for classification

• Prepare the data matrix

• Select the kernel function to use

• Execute the training algorithm using a QP solver to obtain the i values

• Unseen data can be classified using the i values and the support vectors

Page 21: Support Vector Machines  Kernel Machines

Applications

• Handwritten digits recognition– Of interest to the US Postal services– 4% error was obtained– about 4% of the training data were SVs only

• Text categorisation• Face detection• DNA analysis• …many others

Page 22: Support Vector Machines  Kernel Machines

Discriminative versus generative classification methods

• SVMs learn the discrimination boundary. They are called discriminatory approaches.

• This is in contrast to learning a model for each class, like e.g. Bayesian classification does. This latter approach is called generative approach.

• SVM tries to avoid overfitting in high dimensional spaces (cf regularisation)

Page 23: Support Vector Machines  Kernel Machines

Conclusions

• SVMs learn linear decision boundaries (cf perceptrons)– They pick the hyperplane that maximises the margin

– The optimal hyperplane turns out to be a linear combination of support vectors

• Transform nonlinear problems to higher dimensional space using kernel functions; then there is more chance that in the transformed space the classes will be linearly separable.

Page 24: Support Vector Machines  Kernel Machines

Resources

• SW & practical guide to SVM for beginners http://www.csie.ntu.edu.tw/~cjlin/libsvm/

• Kernel machines website: http://www.kernel-machines.org/

• Burges, C.J. C: A tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, Vol.2, nr.2, pp.121—167, 1998. Available from http://svm.research.bell-labs.com/SVMdoc.html

• Cristianini & Shawe-Taylor: SVM book (in the School library)