Epsrcws08 Campbell Isvm 01

8/2/2019 Epsrcws08 Campbell Isvm 01

1/64

Introduction to Support Vector Machines

Colin Campbell,

Bristol University

1


2/64

Outline of talk.

Part 1. An Introduction to SVMs

1.1. SVMs for binary classification.

1.2. Soft margins and multi-class classification.

1.3. SVMs for regression.

2


3/64

Part 2. General kernel methods

2.1 Kernel methods based on linear programming and

other approaches.

2.2 Training SVMs and other kernel machines.

2.3 Model Selection.

2.4 Different types of kernels.

2.5. SVMs in practice

3


4/64

Advantages of SVMs

A principled approach to classification, regression or

novelty detection tasks.

SVMs exhibit good generalisation.

Hypothesis has an explicit dependence on the data

(via the support vectors). Hence can readily

interpret the model.

4


5/64

Learning involves optimisation of a convex function

(no false minima, unlike a neural network).

Few parameters required for tuning the learningmachine (unlike neural network where the

architecture and various parameters must be found).

Can implement confidence measures, etc.

5


6/64

1.1 SVMs for Binary Classification.

Preliminaries: Consider a binary classification problem:input vectors are xi and yi = 1 are the targets or labels.

The index i labels the pattern pairs (i = 1, . . . , m).

The xi define a space of labelled points called input space.

6


7/64

From the perspective of statistical learning theory the

motivation for considering binary classifier SVMs comesfrom theoretical bounds on the generalization error.

These generalization bounds have two important features:

7


8/64

1. the upper bound on the generalization error does not

depend on the dimensionality of the space.

2. the bound is minimized by maximizing the margin, ,

i.e. the minimal distance between the hyperplane

separating the two classes and the closest datapoints of

each class.

8


9/64

Separating hyperplane

Margin

9


10/64

In an arbitrary-dimensional space a separating

hyperplane can be written:

w x + b = 0

where b is the bias, and w the weights, etc.

Thus we will consider a decision function of the form:

D(x) = sign (w x + b)

10


11/64

We note that the argument in D(x) is invariant under a

rescaling: w w, b b.

We will implicitly fix a scale with:

11


12/64

w x + b = 1

w x + b = 1

for the support vectors (canonical hyperplanes).

12


13/64

Thus:

w (x1 x2) = 2

For two support vectors on each side of the separating

hyperplane.

13


14/64

The margin will be given by the projection of the vector

(x1 x2) onto the normal vector to the hyperplane i.e.w/||w|| from which we deduce that the margin is given

by = 1/ ||w||2.

14


15/64

Separating hyperplane

Margin

1

2

15


16/64

Maximisation of the margin is thus equivalent to

minimisation of the functional:

(w) =

1

2 (w w)

subject to the constraints:

yi [(w xi) + b] 1

16


17/64

Thus the task is to find an optimum of the primal

objective function:

L(w, b) =1

2(w w)

m

i=1 i [yi((w xi) + b) 1]Solving the saddle point equations L/b = 0 gives:

mi=1

iyi = 0

17


18/64

and L/w = 0 gives:

w =

mi=1

iyixi

which when substituted back in L(w, , ) tells us that

we should maximise the functional (the Wolfe dual):

18


19/64

W() =mi=1

i 1

2

mi,j=1

ijyiyj (xi xj)

subject to the constraints:

i 0

(they are Lagrange multipliers)

19


20/64

and:

m

i=1iyi = 0

The decision function is then:

D(z) = signm

j=1 jyj (xj z) + b

20


21/64

So far we dont know how to handle non-separable

datasets.

To answer this problem we will exploit the second point

mentioned above: the theoretical generalisation bounds

do not depend on the dimensionality of the space.

21


22/64

For the dual objective function we notice that the

datapoints, xi, only appear inside an inner product.

To get a better representation of the data we cantherefore map the datapoints into an alternative higher

dimensional space through a replacement:

xi xj (xi) (xj)

22


23/64

i.e. we have used a mapping xi ( xi).

This higher dimensional space is called Feature Space and

must be a Hilbert Space (the concept of an inner product

applies).

23


24/64

The function (xi) (xj) will be called a kernel, so:

(xi) (xj) = K(xi, xj)

The kernel is therefore the inner product between

mapped pairs of points in Feature Space.

24


25/64

What is the mapping relation?

In fact we do not need to know the form of this mapping

since it will be implicitly defined by the functional form

of the mapped inner product in Feature Space.

25


26/64

Different choices for the kernel K(x, x

) define differentHilbert spaces to use. A large number of Hilbert spaces

are possible with RBF kernels:

K(x, x

) = e||xx||2/22

and polynomial kernels:

K(x, x

) = (< x, x

> +1)

d

being popular choices.

26


27/64

Legitimate kernels are not restricted to functions e.g.

they may be defined by an algorithm, for example.

(1) String kernels

Consider the following text strings car, cat, cart, chart.

They have certain similarities despite being of unequal

length. dug is of equal length to car and cat but differs

in all letters.

27


28/64

A measure of the degree of alignment of two text strings

or sequences can be found using algorithmic techniques

e.g. edit codes (dynamic programming).

Frequent applications in

bioinformatics e.g. Smith-Waterman,

Waterman-Eggert algorithm (4-digit strings ...

ACCGTATGTAAA ... from genomes),

text processing (e.g. WWW), etc.

28


29/64

(2) Kernels for graphs/networks

Diffusion kernel ( Kondor andLafferty)

29


30/64

Mathematically, valid kernel functions should satisfyMercers condition:

30


31/64

For any g(x) for which:

g(x)2dx <

it must be the case that:K(x, x)g(x)g(x)dxdx 0

A simple criterion is that the kernel should be positivesemi-definite.

31


32/64

Theorem: If a kernel is positive semi-definite i.e.:

i,j

K(xi, xj)cicj 0

where {c1, . . . , cn} are real numbers, then there exists a

function (x) defining an inner product of possibly

higher dimension i.e.:

K(x, y) = (x) (y)

32


33/64

Thus the following steps are used to train an SVM:

1. Choose kernel function K(xi, xj).

2. Maximise:

W() =mi=1

i 1

2

mi=1

ijyiyjK(xi, xj)

subject to i 0 and i iyi = 0.

33


34/64

3. The bias b is found as follows:

b = 12min

{i|yi=+1}

iyiK(xi, xj)+ max

{i|yi=1}

iyiK(xi, xj)

34


35/64

4. The optimal i go into the decision function:

D(z) = sign mi=1

iyiK(xi, z) + b

35


36/64

1.2. Soft Margins and Multi-class problems

Multi-class classification: Many problems involve

multiclass classification and a number of schemes havebeen outlined.

One of the simplest schemes is to use a directed acyclic

graph (DAG) with the learning task reduced to binaryclassification at each node:

36


37/64

1/3

2/3

321

1/2

37


38/64

Soft margins: Most real life datasets contain noise and

an SVM can fit this noise leading to poor generalization.

The effect of outliers and noise can be reduced by

introducing a soft margin. Two schemes are commonly

used:

38


39/64

L1 error norm:

0 i C

L2 error norm:

K(xi, xi) K(xi, xi) +

39


40/64

5.4

5.6

5.8

6

6.2

6.4

6.6

6.8

7

7.2

0 2 4 6 8 10

40


41/64

5

5.5

6

6.5

7

7.5

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

41


42/64

The justification for these soft margin techniques comes

from statistical learning theory.

Can be readily viewed as relaxation of the hard margin

constraint.

42


43/64

For the L1 error norm (prior to introducing kernels) weintroduce a positive slack variable i:

yi (w xi + b) 1 i

so minimize the sum of errorsm

i=1 i in addition to

||w||2:

min 12w w + Cmi=1

i

43


44/64

This is readily formulated as a primal objective function :

L(w, b , , ) =1

2w w + C

mi=1

i

mi=1

i [yi (w xi + b) 1 + i]mi=1

rii

with Lagrange multipliers i 0 and ri 0.

44


45/64

The derivatives with respect to w, b and give:

L

w= w

m

i=1 iyixi = 0 (1)L

b=

mi=1

iyi = 0 (2)

L

i = C i ri = 0 (3)

45


46/64

Resubstituting back in the primal objective function

obtain the same dual objective function as before.However, ri 0 and C i ri = 0, hence i C and

the constraint 0 i is replaced by 0 i C.

46


47/64

Patterns with values 0 < i < C will be referred to as

non-bound and those with i = 0 or i = C will be said

to be at bound.

The optimal value of C must be found by

experimentation using a validation set and it cannot be

readily related to the characteristics of the dataset or

model.

47


48/64

In an alternative approach, -SVM, it can be shown that

solutions for an L1-error norm are the same as those

obtained from maximizing:

W() = 1

2

mi,j=1

ijyiyjK(xi, xj)

48


49/64

subject to:

mi=1

yii = 0mi=1

i 0 i 1

m(4)

where lies on the range 0 to 1.

49


50/64

The fraction of training errors is upper bounded by and

also provides a lower bound on the fraction of points

which are support vectors.

Hence in this formulation the conceptual meaning of the

soft margin parameter is more transparent.

50


51/64

For the L2 error norm the primal objective function is:

L(w, b , , ) =

1

2w w + C

m

i=1

2

i

mi=1

i [yi (w xi + b) 1 + i]mi=1

rii(5)

with i 0 and ri 0.

51


52/64

After obtaining the derivatives with respect to w, b and

, substituting for w and in the primal objective

function and noting that the dual objective function is

maximal when ri = 0, we obtain the following dualobjective function after kernel substitution:

W() =m

i=1i

1

2

m

i,j=1yiyjijK(xi, xj)

1

4C

m

i=12i

52


53/64

With = 1/2C this gives the same dual objective

function as for hard margin learning except for the

substitution:

K(xi, xi) K(xi, xi) +

53


54/64

For many real-life datasets there is an imbalance between

the amount of data in different classes, or the significance

of the data in the two classes can be quite different. Therelative balance between the detection rate for different

classes can be easily shifted by introducing asymmetric

soft margin parameters.

54


55/64

Thus for binary classification with an L1 error norm:

0 i C+

for yi = +1), and

0 i C,

for yi = 1, etc.

55


56/64

1.3. Solving Regression Tasks with SVMs

-SV regression (Vapnik): not unique approach.

Statistical learning theory: we use constraints

yi w xi b and w xi + b yi to allow for

some deviation between the eventual targets yi and the

function f(x) = w x + b, modelling the data.

56


57/64

As before we minimize ||w||2 to penalise overcomplexity.

To account for training errors we also introduce slack

variables i, i for the two types of training error.These slack variables are zero for points inside the tube

and progressively increase for points outside the tube

according to the loss function used.

57


58/64

L

0 -

Figure 1: A linear -insensitive loss function

58


59/64

For a linear insensitive loss function the task is

therefore to minimize:

min

||w||2 + C

mi=1

i + i

59


60/64

subject to

yi w xi b + i

(w xi + b) yi + iwhere the slack variables are both positive.

60


61/64

Deriving the dual objective function we get:

W(, ) = mi=1

yi(i i) mi=1

(i + i)

1

2

mi,j=1

(i

i)(j

j)K(xi, xj)

61


62/64

which is maximized subject to

m

i=1i =m

i=1 iand:

0 i C 0 i C

62


63/64

The function modelling the data is then:

f(z) =mi=1

(i i)K(xi, z) + b

63


64/64

Similarly we can define many other types of loss

functions.

We can define a linear programming approach toregression.

We can use many other approaches to regression (e.g.

kernelising least squares).

Epsrcws08 Campbell Isvm 01

Documents

Transcript of Epsrcws08 Campbell Isvm 01