Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm [email protected].

Neurocomputation Seminar © 2011 IBM Corporation

A tutorial about SVM

Omer [email protected]

IBM Haifa Labs

© 2011 IBM CorporationIBM2

Outline

Introduction Classification Perceptron SVM for linearly separable data. SVM for almost linearly separable data. SVM for non-linearly separable data.

IBM Haifa Labs


Introduction

A branch of artificial intelligence, a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data

An important task of machine learning is classification.

Classification is also referred to as pattern recognition.

IBM Haifa Labs


Example

Income Debt Married age

Shelley 60,000 1000 No 30

Elad 200,000 0 Yes 80

Dan 0 20,000 No 25

Alona 100,000 10,000 Yes 40

Objects Classes

Approve deny

Learning Machine

IBM Haifa Labs


Types of learning problems Supervised learning (n class, n>1)

Classification Regression

Unsupervised learning (0 class) Clustering

(building equivalence classes) Density estimation

IBM Haifa Labs


Supervised learning Regression

Learn a continuous function from input samples Stock prediction

Input – future date Output – stock price Training – information on stack price over last period

Classification Learn a separation function from discrete inputs to classes. Optical Character Recognition (OCR)

Input – images of digits. Output – labeling 0-9. Training - labeled images of digits.

In fact, these are

approximation problems

IBM Haifa Labs


Regression

IBM Haifa Labs


Classification

IBM Haifa Labs


Density estimation

IBM Haifa Labs


What makes learning difficultGiven the following examplesHow should we draw the line?

IBM Haifa Labs


What makes learning difficult

Which one is most appropriate?

IBM Haifa Labs


What makes learning difficult

The hidden test points

IBM Haifa Labs


What is Learning (mathematically)?

We would like to ensure that small changes in an input point from a learning point will not result in a jump to a different classification

Such an approximation is called a stable approximation As a rule of thumb, small derivatives ensure stable

approximation

ii yxx ii yx

IBM Haifa Labs


Stable vs. Unstable approximation

Lagrange approximation (unstable) given points, , we find the unique polynomial

, that passes through the given points

Spline approximation (stable)given points, , we find a piecewise approximation

by third degree polynomials such that they pass through the given points and have common tangents at the division points and in addition : nx

xdxxf

1

min|)(''| 2

),( ii yxn)()( 1 xLxf n

),( ii yxn)(xf

IBM Haifa Labs


What would be the best choice?

The “simplest” solution A solution where the distance from each example is as small as

possible and where the derivative is as small as possible

IBM Haifa Labs


Vector GeometryJust in case ….

IBM Haifa Labs


The dot product of two vectors

Is defined as:

An example

nn

n

i ii bababababa ...33221

3)1)(5()2)(3()4)(1(]1,2,4[]5,3,1[

],...,,,[ 321 naaaa ],...,,,[ 321 nbbbb

Dot product

IBM Haifa Labs


Dot product

where denotes the length

(magnitude) of ‘a’

Unit vector

2|||| aaa

cos*||||*|||| baba

|||| a

aa

|||| a

)090cos( 0

then b'' lar toperpendicu is a'' if

oba

IBM Haifa Labs


Plane/Hyperplane

Hyperplane can be defined by: Three points Two vectors A normal vector and a point

IBM Haifa Labs


Plane/Hyperplane Let be a perpendicular vector to the

hyperplane H Let be the position vector of some

known point in the plane. A point _ with position vector is in the plane iff the vector drawn from to is perpendicular to

Two vectors are perpendicular iff their dot product is zero

The hyperplane H can be expressed as

, , substitue // 0 11 pnbxpwnpnpn

0)( 1 ppn

0 bxw

n

1p

1pp p

1p pn

IBM Haifa Labs


Classification

IBM Haifa Labs


Solving approximation problems

First we define the family of approximating functions F

Next we define the cost function . This function tells how well performs the required approximation

Getting this done , the approximation/classification consists of solving the minimization problem

A first necessary condition (after Fermat) is

As we know it is always possible to do Newton-Raphson, and get a sequence of approximations

)(min CFf

0

f

C

)( fC Ff

IBM Haifa Labs


Classification

A classifier is a function or an algorithm that maps every possible input (from a legal set of inputs) to a finite set of categories.

X is the input space, is a data point from an input space.

A typical input space is high-dimensional, for example

X is also called a feature vector.

Ω is a finite set of categories to which the input data points belong : Ω =1,2,…,C.

are called labels.

Xx

1 d ,R ..., ,, d21 dxxxx

i

IBM Haifa Labs


Classification

Y is a finite set of decisions – the output set of the classifier.

The classifier is a function yxf :

YxfyclassifyfXx )()(

IBM Haifa Labs


The Perceptron

IBM Haifa Labs


Perceptron - Frank Rosenblatt (1957)

Linear separation of the input space

bxwxf )(

))(()( xfsignxh

IBM Haifa Labs


Perceptron algorithm

Start: The weight vector is generated randomly,set

Test: A vector is selected randomly,if and go to test, if and go to add, if and go to test, if and go to subtract

Add: go to test, Subtract: go to test,

0W0t

NPx PxPxNxNx

0xwt0xwt

0xwt0xwt

1,1 ttxww tt

1,1 ttxww tt

IBM Haifa Labs


Perceptron algorithm

Shorter version

Update rule for the k+1 iterations (iteration for each data point)

thenxwyif iki ,0)((

iikk xyww 1

1 kk

IBM Haifa Labs


Perceptron – visualization (intuition)

IBM Haifa Labs



IBM Haifa Labs


Perceptron - analysis

Solution is a linear combination of training points

Only uses informative points (mistake driven)

The coefficient of a point reflect its ‘difficulty’

The perceptron learning algorithm does not terminate if the learning set is not linearly separable (e.g. XOR)

iii xyW 0, ii

IBM Haifa Labs


Support Vector Machines

IBM Haifa Labs


Advantages of SVM, Vladimir Vapnik 1979,1998

Exhibit good generalization Can implement confidence measures, etc.

Hypothesis has an explicit dependence on the data (via the support vectors)

Learning involves optimization of a convex function (no false minima, unlike NN).

Few parameters required for tuning the learning machine (unlike NN where the architecture/various parameters must be found).

IBM Haifa Labs


Advantages of SVM

From the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error.

These generalization bounds have two important features:

IBM Haifa Labs


Advantages of SVM

The upper bound on the generalization error does not depend on the dimensionality of the space.

The bound is minimized by maximizing the margin, i.e. the minimal distance between the hyperplane separating the two classes and the closest data-points of each class.

IBM Haifa Labs


Basic scenario - Separable data set

IBM Haifa Labs


Basic scenario – define margin

IBM Haifa Labs


In an arbitrary-dimensional space, a separating hyperplane can be written :

Where W is the normal. The decision function would be :

0 bxw

)()( bxwsignxD

IBM Haifa Labs


Note argument in is invariant under a rescaling of the form

Implicitly the scale can be fixed by defining

as the support vectors (canonical hyperplanes)

bbww ,

)(xD

1:1 bxwH

1:2 bxwH

IBM Haifa Labs


The task is to select , so that the training data can be described as:

for

for

These can be combined into:

bw,

1 bwxi

1 bwxi

1iy

1iy

ibwxy ii ,01)(

IBM Haifa Labs


IBM Haifa Labs


The margin will be given by the projectionof the vector

onto the normal vector to the hyperplane

i.e.

So the distance (Euclidian) can be formed

)( 21 xx

|||| w

ww

||||

|)(| 21

w

wxxd

IBM Haifa Labs


Note that lies on i.e.

Similarly for

Subtracting the two results in

1x 1 bxw1H

11 bxw

2x12 bxw

2)( 21 xxw

IBM Haifa Labs


The margin can be put as

Can convert the problem to

subject to the constraints:

J(w) is a quadratic function, thus there is a single global minimum

||||

2

||||

|)(| 21

ww

wxxm

)(2

1)( wwwJ

ibwxy ii ,01)(

IBM Haifa Labs


Lagrange multipliers

Problem definition :Maximize

subject to

A new λ variable is used , called ‘Lagrange multiplier‘ to define

)),((),(),,( cyxgyxfyxLd

),( yxf

cyxg ),(

IBM Haifa Labs



IBM Haifa Labs


Lagrange multipliers - example

Maximize , subject to

Formally set

Set the derivatives to 0

Combining the first two yields : Substituting into the last Evaluating the objective function f on these yields

yxyxf ),( 122 yx

)1(),( 22 yxyxyx

021

xx

021

yy

0122

yx

yx 2/2yx

2)2/2,2/2( f

2)2/2,2/2( f

IBM Haifa Labs


Lagrange multipliers - example

IBM Haifa Labs


Primal problem:minimize s.t.

Introduce Lagrange multipliers associated with the constraints

The solution to the primal problem is equivalent to determining the saddle point of the function :

)(2

1)( wwwJ ibwxy ii ,01)(

0, ii

)1)(()(2

1),,(

1

bwxywwbwL ii

n

i ip

IBM Haifa Labs


At saddle point, , has minimum requiring

iii

iiii

i xywxyww

L 0

pL

0 i

ii yb

L

IBM Haifa Labs


Primal-Dual

Primal:

Minimize with respect to subject to

Substitute and

Dual:

Maximize with respect to subject to

0, ii

n

i iii

n

i ip bwxywwL11

)()(2

1

n

i

n

j jijiji

n

i id xxyyL1 11 2

1

pL

iii

i xyw

bw,

0 ii

i y

dL

i

ii y 0,0, ii

IBM Haifa Labs


Solving QP using dual problem

Maximize

constrained to and

We have , new variables. One for each data point

This is a convex quadratic optimization problem , and we run a QP solver to get , and W

n

i

n

j jijiji

n

i id xxyyL1 11 2

1

,0, ii i

ii y 0

,...,, 21 n

)( iii

i xyw

IBM Haifa Labs


‘b’ can be determined from the optimal and Karush-Kuhn-Tucker (KKT) conditions

(data points residing on the SV) implies

or AVG

iiii ybxwbxwy 1)(

ibxwy iii ,0]1)([

0i

i

iis

xwyN

b1

ii xwyb

IBM Haifa Labs


For every data point i, one of the following must hold

Many sparse solution

Data Points with are Support Vectors

Optimal hyperplane is completely defined by support vectors

0i0)1(,0 bxwy iii

0i

0i iii

i xyw

IBM Haifa Labs


SVM - The classification

Given a new data point Z, find it’s label Y

m

iiii bzxysignzD

1

)()(

IBM Haifa Labs


Extended scenario - Non-Separable data set

IBM Haifa Labs


Data is most likely not to be separable (inconsistencies, outliers, noise), but linear classifier may still be appropriate

Can apply SVM in non-linearly separable case Data should be almost linearly separable

IBM Haifa Labs


Use non-negative slack variablesone per data point

Change constraints from to

is a measure of deviation from ideal position for sample I

n ,......,, 21

SVM with slacks

ibwxy ii ,1)(ibwxy iii ,1)(

i

1i10 i

IBM Haifa Labs


Would like to minimize

constrained to

The parameter C is a regularization term, which provides a way to control over-fitting if C is small, we allow a lot of samples not in ideal position if C is large, we want to have very few samples not in idealposition

i in CwwwJ )(

2

1),......,,,( 21

SVM with slacks

0 and ,1)( iiii ibwxy

IBM Haifa Labs


SVM with slacks - Dual formulation

Maximize

Constraint to

n

i

n

j jijiji

n

i id xxyyL1 11 2

1)(

n

1i

0 and 0 iii yiC

IBM Haifa Labs


SVM - non linear mapping

Cover’s theorem: “pattern-classification problem cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low-dimensional space”

One dimensional space, not linearly separable

Lift to two dimensional space with ),()( 2xxx

IBM Haifa Labs



Solve a non linear classification problem with a linear classifierProject data x to high dimension using functionFind a linear discriminant function for transformed data Final nonlinear discriminant function is

In 2D, discriminant function is linear

In 1D, discriminant function is NOT linear

0)()( wxwxg t

)(x)(xf

IBM Haifa Labs



Can use any linear classifier after lifting data into a higher dimensional space. However we will have to deal with the “curse of dimensionality” poor generalization to test data computationally expensive

SVM handles the “curse of dimensionality” problem: enforcing largest margin permits good generalization

It can be shown that generalization in SVM is a function of the margin, independent of the dimensionality

computation in the higher dimensional case is performed only implicitly

through the use of kernel functions

IBM Haifa Labs


Non linear SVM - kernels

Recall:

The data points appear only in dot products If is mapped to high dimensional space using

The high dimensional product is needed

The dimensionality of space F not necessarily important. May not even know the map

n

i

n

j jijiji

n

i id xxyyL1 11 2

1 Maximize

m

iiii bzxysignzD

1

)()( :tionclassifica

ix)(x

ix)()( j

ti xx

)()(2

1 Maximize

1 11 jt

i

n

i

n

j jiji

n

i id xxyyL

IBM Haifa Labs


Kernel A function that returns the value of the dot product between the

images of the two arguments:

Given a function K, it is possible to verify that it is a kernel. Now we only need to compute instead of

“kernel trick”: do not need to perform operations in high dimensional space explicitly

)()(),( yxyxk

),(2

1 Maximize

1 11 ji

n

i

n

j jiji

n

i id xxkyyL

),( yxk )()( yx

IBM Haifa Labs


Kernel Matrix

The central structure in kernel machines Contains all necessary information for the learning algorithm Fuses information about the data AND the kernel Many interesting properties:

IBM Haifa Labs


Mercer’s Theorem The kernel matrix is Symmetric Positive Definite Any symmetric positive definite matrix can be regarded as

a kernel matrix, that is as an inner product matrix in some space

Every (semi)positive definite, symmetric function is a kernel:

i.e. there exists a mapping φ such that it is possible to write:

)()(),( yxyxk

0)()(),( definite Positive2Lf

dxdyyfxfyxk

IBM Haifa Labs


Examples of kernels Some common choices (both satisfying Mercer’s

condition):

Polynomial kernel

Gaussian radial basis function (RBF)

pjiji xxxxk )1,(),(

)||||2

1exp(),( 2

2 jii

ji xxxxk

IBM Haifa Labs


Polynomial Kernel - example

22211

2 )()( zxzxzx

)2( 221122

22

21

21 zxzxzxzx

)2,,()2,,( 2122

2121

22

21 zzzzxxxx

)()( zx

IBM Haifa Labs


Applying - non linear SVM Start with data , which lives in feature space of

dimension n

Choose a kernel corresponding to some function , which takes data point to a higher dimensional space

Find the largest margin linear discriminant function in the higher dimensional space by using quadratic programming package to solve

..., ,, 21 nxxx

),( ji xxk)(x ix

n

1i

1 11

0 and 0 toconstraint

),(2

1)(

iii

n

i

n

j jijiji

n

i id

yiC

xxkyyL

IBM Haifa Labs


Applying - non linear SVM Weight vector w in the high dimensional space:

Linear discriminant function of largest margin in the high dimensional space:

Non-Linear discriminant function in the original space:

)( iii

i xyw

)()()())(( xxyxwxgt

iivsx

it

),()()()())(( xxkyxxyxwxg iivsx

iit

ivsx

it

IBM Haifa Labs


Applying - non linear SVM

IBM Haifa Labs


SVM summary Advantages:

Based on nice theoryExcellent generalization propertiesObjective function has no local minimaCan be used to find non linear discriminant functionsComplexity of the classifier is characterized by the number of support vectors rather than the dimensionality of the transformed space

Disadvantages:It’s not clear how to select a kernel function in a principled manner tends to be slower than other methods (in non-linear case).

Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm [email protected].

Documents

Transcript of Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm [email protected].