Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm [email protected].

77
Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm [email protected]

Transcript of Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm [email protected].

Page 1: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

Neurocomputation Seminar © 2011 IBM Corporation

A tutorial about SVM

Omer [email protected]

Page 2: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM2

Outline

Introduction Classification Perceptron SVM for linearly separable data. SVM for almost linearly separable data. SVM for non-linearly separable data.

Page 3: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM3

Introduction

A branch of artificial intelligence, a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data

An important task of machine learning is classification.

Classification is also referred to as pattern recognition.

Page 4: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM4

Example

Income Debt Married age

Shelley 60,000 1000 No 30

Elad 200,000 0 Yes 80

Dan 0 20,000 No 25

Alona 100,000 10,000 Yes 40

Objects Classes

Approve deny

Learning Machine

Page 5: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM5

Types of learning problems Supervised learning (n class, n>1)

Classification Regression

Unsupervised learning (0 class) Clustering

(building equivalence classes) Density estimation

Page 6: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM6

Supervised learning Regression

Learn a continuous function from input samples Stock prediction

Input – future date Output – stock price Training – information on stack price over last period

Classification Learn a separation function from discrete inputs to classes. Optical Character Recognition (OCR)

Input – images of digits. Output – labeling 0-9. Training - labeled images of digits.

In fact, these are

approximation problems

Page 7: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM7

Regression

Page 8: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM8

Classification

Page 9: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM9

Density estimation

Page 10: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM10

What makes learning difficultGiven the following examplesHow should we draw the line?

Page 11: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM11

What makes learning difficult

Which one is most appropriate?

Page 12: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM12

What makes learning difficult

The hidden test points

Page 13: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM13

What is Learning (mathematically)?

We would like to ensure that small changes in an input point from a learning point will not result in a jump to a different classification

Such an approximation is called a stable approximation As a rule of thumb, small derivatives ensure stable

approximation

ii yxx ii yx

Page 14: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM14

Stable vs. Unstable approximation

Lagrange approximation (unstable) given points, , we find the unique polynomial

, that passes through the given points

Spline approximation (stable)given points, , we find a piecewise approximation

by third degree polynomials such that they pass through the given points and have common tangents at the division points and in addition : nx

xdxxf

1

min|)(''| 2

),( ii yxn)()( 1 xLxf n

),( ii yxn)(xf

Page 15: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM15

What would be the best choice?

The “simplest” solution A solution where the distance from each example is as small as

possible and where the derivative is as small as possible

Page 16: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM16

Vector GeometryJust in case ….

Page 17: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM17

The dot product of two vectors

Is defined as:

An example

nn

n

i ii bababababa ...33221

3)1)(5()2)(3()4)(1(]1,2,4[]5,3,1[

],...,,,[ 321 naaaa ],...,,,[ 321 nbbbb

Dot product

Page 18: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM18

Dot product

where denotes the length

(magnitude) of ‘a’

Unit vector

2|||| aaa

cos*||||*|||| baba

|||| a

aa

|||| a

)090cos( 0

then b'' lar toperpendicu is a'' if

oba

Page 19: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM19

Plane/Hyperplane

Hyperplane can be defined by: Three points Two vectors A normal vector and a point

Page 20: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM20

Plane/Hyperplane Let be a perpendicular vector to the

hyperplane H Let be the position vector of some

known point in the plane. A point _ with position vector is in the plane iff the vector drawn from to is perpendicular to

Two vectors are perpendicular iff their dot product is zero

The hyperplane H can be expressed as

, , substitue // 0 11 pnbxpwnpnpn

0)( 1 ppn

0 bxw

n

1p

1pp p

1p pn

Page 21: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM21

Classification

Page 22: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM22

Solving approximation problems

First we define the family of approximating functions F

Next we define the cost function . This function tells how well performs the required approximation

Getting this done , the approximation/classification consists of solving the minimization problem

A first necessary condition (after Fermat) is

As we know it is always possible to do Newton-Raphson, and get a sequence of approximations

)(min CFf

0

f

C

)( fC Ff

Page 23: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM25

Classification

A classifier is a function or an algorithm that maps every possible input (from a legal set of inputs) to a finite set of categories.

X is the input space, is a data point from an input space.

A typical input space is high-dimensional, for example

X is also called a feature vector.

Ω is a finite set of categories to which the input data points belong : Ω =1,2,…,C.

are called labels.

Xx

1 d ,R ..., ,, d21 dxxxx

i

Page 24: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM26

Classification

Y is a finite set of decisions – the output set of the classifier.

The classifier is a function yxf :

YxfyclassifyfXx )()(

Page 25: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM27

The Perceptron

Page 26: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM28

Perceptron - Frank Rosenblatt (1957)

Linear separation of the input space

bxwxf )(

))(()( xfsignxh

Page 27: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM29

Perceptron algorithm

Start: The weight vector is generated randomly,set

Test: A vector is selected randomly,if and go to test, if and go to add, if and go to test, if and go to subtract

Add: go to test, Subtract: go to test,

0W0t

NPx PxPxNxNx

0xwt0xwt

0xwt0xwt

1,1 ttxww tt

1,1 ttxww tt

Page 28: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM30

Perceptron algorithm

Shorter version

Update rule for the k+1 iterations (iteration for each data point)

thenxwyif iki ,0)((

iikk xyww 1

1 kk

Page 29: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM31

Perceptron – visualization (intuition)

Page 30: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM32

Perceptron – visualization (intuition)

Page 31: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM33

Perceptron – visualization (intuition)

Page 32: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM34

Perceptron – visualization (intuition)

Page 33: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM35

Perceptron – visualization (intuition)

Page 34: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM36

Perceptron - analysis

Solution is a linear combination of training points

Only uses informative points (mistake driven)

The coefficient of a point reflect its ‘difficulty’

The perceptron learning algorithm does not terminate if the learning set is not linearly separable (e.g. XOR)

iii xyW 0, ii

Page 35: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM37

Support Vector Machines

Page 36: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM38

Advantages of SVM, Vladimir Vapnik 1979,1998

Exhibit good generalization Can implement confidence measures, etc.

Hypothesis has an explicit dependence on the data (via the support vectors)

Learning involves optimization of a convex function (no false minima, unlike NN).

Few parameters required for tuning the learning machine (unlike NN where the architecture/various parameters must be found).

Page 37: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM39

Advantages of SVM

From the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error.

These generalization bounds have two important features:

Page 38: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM40

Advantages of SVM

The upper bound on the generalization error does not depend on the dimensionality of the space.

The bound is minimized by maximizing the margin, i.e. the minimal distance between the hyperplane separating the two classes and the closest data-points of each class.

Page 39: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM41

Basic scenario - Separable data set

Page 40: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM42

Basic scenario – define margin

Page 41: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM43

In an arbitrary-dimensional space, a separating hyperplane can be written :

Where W is the normal. The decision function would be :

0 bxw

)()( bxwsignxD

Page 42: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM44

Note argument in is invariant under a rescaling of the form

Implicitly the scale can be fixed by defining

as the support vectors (canonical hyperplanes)

bbww ,

)(xD

1:1 bxwH

1:2 bxwH

Page 43: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM45

The task is to select , so that the training data can be described as:

for

for

These can be combined into:

bw,

1 bwxi

1 bwxi

1iy

1iy

ibwxy ii ,01)(

Page 44: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM46

Page 45: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM47

The margin will be given by the projectionof the vector

onto the normal vector to the hyperplane

i.e.

So the distance (Euclidian) can be formed

)( 21 xx

|||| w

ww

||||

|)(| 21

w

wxxd

Page 46: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM48

Note that lies on i.e.

Similarly for

Subtracting the two results in

1x 1 bxw1H

11 bxw

2x12 bxw

2)( 21 xxw

Page 47: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM49

The margin can be put as

Can convert the problem to

subject to the constraints:

J(w) is a quadratic function, thus there is a single global minimum

||||

2

||||

|)(| 21

ww

wxxm

)(2

1)( wwwJ

ibwxy ii ,01)(

Page 48: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM50

Lagrange multipliers

Problem definition :Maximize

subject to

A new λ variable is used , called ‘Lagrange multiplier‘ to define

)),((),(),,( cyxgyxfyxLd

),( yxf

cyxg ),(

Page 49: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM51

Lagrange multipliers

Page 50: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM52

Lagrange multipliers

Page 51: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM53

Lagrange multipliers - example

Maximize , subject to

Formally set

Set the derivatives to 0

Combining the first two yields : Substituting into the last Evaluating the objective function f on these yields

yxyxf ),( 122 yx

)1(),( 22 yxyxyx

021

xx

021

yy

0122

yx

yx 2/2yx

2)2/2,2/2( f

2)2/2,2/2( f

Page 52: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM54

Lagrange multipliers - example

Page 53: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM55

Primal problem:minimize s.t.

Introduce Lagrange multipliers associated with the constraints

The solution to the primal problem is equivalent to determining the saddle point of the function :

)(2

1)( wwwJ ibwxy ii ,01)(

0, ii

)1)(()(2

1),,(

1

bwxywwbwL ii

n

i ip

Page 54: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM56

At saddle point, , has minimum requiring

iii

iiii

i xywxyww

L 0

pL

0 i

ii yb

L

Page 55: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM57

Primal-Dual

Primal:

Minimize with respect to subject to

Substitute and

Dual:

Maximize with respect to subject to

0, ii

n

i iii

n

i ip bwxywwL11

)()(2

1

n

i

n

j jijiji

n

i id xxyyL1 11 2

1

pL

iii

i xyw

bw,

0 ii

i y

dL

i

ii y 0,0, ii

Page 56: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM58

Solving QP using dual problem

Maximize

constrained to and

We have , new variables. One for each data point

This is a convex quadratic optimization problem , and we run a QP solver to get , and W

n

i

n

j jijiji

n

i id xxyyL1 11 2

1

,0, ii i

ii y 0

,...,, 21 n

)( iii

i xyw

Page 57: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM59

‘b’ can be determined from the optimal and Karush-Kuhn-Tucker (KKT) conditions

(data points residing on the SV) implies

or AVG

iiii ybxwbxwy 1)(

ibxwy iii ,0]1)([

0i

i

iis

xwyN

b1

ii xwyb

Page 58: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM60

For every data point i, one of the following must hold

Many sparse solution

Data Points with are Support Vectors

Optimal hyperplane is completely defined by support vectors

0i0)1(,0 bxwy iii

0i

0i iii

i xyw

Page 59: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM61

SVM - The classification

Given a new data point Z, find it’s label Y

m

iiii bzxysignzD

1

)()(

Page 60: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM62

Extended scenario - Non-Separable data set

Page 61: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM63

Data is most likely not to be separable (inconsistencies, outliers, noise), but linear classifier may still be appropriate

Can apply SVM in non-linearly separable case Data should be almost linearly separable

Page 62: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM64

Use non-negative slack variablesone per data point

Change constraints from to

is a measure of deviation from ideal position for sample I

n ,......,, 21

SVM with slacks

ibwxy ii ,1)(ibwxy iii ,1)(

i

1i10 i

Page 63: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM65

Would like to minimize

constrained to

The parameter C is a regularization term, which provides a way to control over-fitting if C is small, we allow a lot of samples not in ideal position if C is large, we want to have very few samples not in idealposition

i in CwwwJ )(

2

1),......,,,( 21

SVM with slacks

0 and ,1)( iiii ibwxy

Page 64: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM66

SVM with slacks - Dual formulation

Maximize

Constraint to

n

i

n

j jijiji

n

i id xxyyL1 11 2

1)(

n

1i

0 and 0 iii yiC

Page 65: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM67

SVM - non linear mapping

Cover’s theorem: “pattern-classification problem cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low-dimensional space”

One dimensional space, not linearly separable

Lift to two dimensional space with ),()( 2xxx

Page 66: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM68

SVM - non linear mapping

Solve a non linear classification problem with a linear classifierProject data x to high dimension using functionFind a linear discriminant function for transformed data Final nonlinear discriminant function is

In 2D, discriminant function is linear

In 1D, discriminant function is NOT linear

0)()( wxwxg t

)(x)(xf

Page 67: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM69

SVM - non linear mapping

Can use any linear classifier after lifting data into a higher dimensional space. However we will have to deal with the “curse of dimensionality” poor generalization to test data computationally expensive

SVM handles the “curse of dimensionality” problem: enforcing largest margin permits good generalization

It can be shown that generalization in SVM is a function of the margin, independent of the dimensionality

computation in the higher dimensional case is performed only implicitly

through the use of kernel functions

Page 68: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM70

Non linear SVM - kernels

Recall:

The data points appear only in dot products If is mapped to high dimensional space using

The high dimensional product is needed

The dimensionality of space F not necessarily important. May not even know the map

n

i

n

j jijiji

n

i id xxyyL1 11 2

1 Maximize

m

iiii bzxysignzD

1

)()( :tionclassifica

ix)(x

ix)()( j

ti xx

)()(2

1 Maximize

1 11 jt

i

n

i

n

j jiji

n

i id xxyyL

Page 69: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM71

Kernel A function that returns the value of the dot product between the

images of the two arguments:

Given a function K, it is possible to verify that it is a kernel. Now we only need to compute instead of

“kernel trick”: do not need to perform operations in high dimensional space explicitly

)()(),( yxyxk

),(2

1 Maximize

1 11 ji

n

i

n

j jiji

n

i id xxkyyL

),( yxk )()( yx

Page 70: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM72

Kernel Matrix

The central structure in kernel machines Contains all necessary information for the learning algorithm Fuses information about the data AND the kernel Many interesting properties:

Page 71: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM73

Mercer’s Theorem The kernel matrix is Symmetric Positive Definite Any symmetric positive definite matrix can be regarded as

a kernel matrix, that is as an inner product matrix in some space

Every (semi)positive definite, symmetric function is a kernel:

i.e. there exists a mapping φ such that it is possible to write:

)()(),( yxyxk

0)()(),( definite Positive2Lf

dxdyyfxfyxk

Page 72: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM74

Examples of kernels Some common choices (both satisfying Mercer’s

condition):

Polynomial kernel

Gaussian radial basis function (RBF)

pjiji xxxxk )1,(),(

)||||2

1exp(),( 2

2 jii

ji xxxxk

Page 73: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM75

Polynomial Kernel - example

22211

2 )()( zxzxzx

)2( 221122

22

21

21 zxzxzxzx

)2,,()2,,( 2122

2121

22

21 zzzzxxxx

)()( zx

Page 74: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM76

Applying - non linear SVM Start with data , which lives in feature space of

dimension n

Choose a kernel corresponding to some function , which takes data point to a higher dimensional space

Find the largest margin linear discriminant function in the higher dimensional space by using quadratic programming package to solve

..., ,, 21 nxxx

),( ji xxk)(x ix

n

1i

1 11

0 and 0 toconstraint

),(2

1)(

iii

n

i

n

j jijiji

n

i id

yiC

xxkyyL

Page 75: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM77

Applying - non linear SVM Weight vector w in the high dimensional space:

Linear discriminant function of largest margin in the high dimensional space:

Non-Linear discriminant function in the original space:

)( iii

i xyw

)()()())(( xxyxwxgt

iivsx

it

),()()()())(( xxkyxxyxwxg iivsx

iit

ivsx

it

Page 76: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM78

Applying - non linear SVM

Page 77: Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm omerb@il.ibm.com.

IBM Haifa Labs

© 2011 IBM CorporationIBM79

SVM summary Advantages:

Based on nice theoryExcellent generalization propertiesObjective function has no local minimaCan be used to find non linear discriminant functionsComplexity of the classifier is characterized by the number of support vectors rather than the dimensionality of the transformed space

Disadvantages:It’s not clear how to select a kernel function in a principled manner tends to be slower than other methods (in non-linear case).