Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm [email protected].
-
Upload
phillip-flynn -
Category
Documents
-
view
218 -
download
0
Transcript of Neurocomputation Seminar © 2011 IBM Corporation A tutorial about SVM Omer Boehm [email protected].
IBM Haifa Labs
© 2011 IBM CorporationIBM2
Outline
Introduction Classification Perceptron SVM for linearly separable data. SVM for almost linearly separable data. SVM for non-linearly separable data.
IBM Haifa Labs
© 2011 IBM CorporationIBM3
Introduction
A branch of artificial intelligence, a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data
An important task of machine learning is classification.
Classification is also referred to as pattern recognition.
IBM Haifa Labs
© 2011 IBM CorporationIBM4
Example
Income Debt Married age
Shelley 60,000 1000 No 30
Elad 200,000 0 Yes 80
Dan 0 20,000 No 25
Alona 100,000 10,000 Yes 40
Objects Classes
Approve deny
Learning Machine
IBM Haifa Labs
© 2011 IBM CorporationIBM5
Types of learning problems Supervised learning (n class, n>1)
Classification Regression
Unsupervised learning (0 class) Clustering
(building equivalence classes) Density estimation
IBM Haifa Labs
© 2011 IBM CorporationIBM6
Supervised learning Regression
Learn a continuous function from input samples Stock prediction
Input – future date Output – stock price Training – information on stack price over last period
Classification Learn a separation function from discrete inputs to classes. Optical Character Recognition (OCR)
Input – images of digits. Output – labeling 0-9. Training - labeled images of digits.
In fact, these are
approximation problems
IBM Haifa Labs
© 2011 IBM CorporationIBM7
Regression
IBM Haifa Labs
© 2011 IBM CorporationIBM8
Classification
IBM Haifa Labs
© 2011 IBM CorporationIBM9
Density estimation
IBM Haifa Labs
© 2011 IBM CorporationIBM10
What makes learning difficultGiven the following examplesHow should we draw the line?
IBM Haifa Labs
© 2011 IBM CorporationIBM11
What makes learning difficult
Which one is most appropriate?
IBM Haifa Labs
© 2011 IBM CorporationIBM12
What makes learning difficult
The hidden test points
IBM Haifa Labs
© 2011 IBM CorporationIBM13
What is Learning (mathematically)?
We would like to ensure that small changes in an input point from a learning point will not result in a jump to a different classification
Such an approximation is called a stable approximation As a rule of thumb, small derivatives ensure stable
approximation
ii yxx ii yx
IBM Haifa Labs
© 2011 IBM CorporationIBM14
Stable vs. Unstable approximation
Lagrange approximation (unstable) given points, , we find the unique polynomial
, that passes through the given points
Spline approximation (stable)given points, , we find a piecewise approximation
by third degree polynomials such that they pass through the given points and have common tangents at the division points and in addition : nx
xdxxf
1
min|)(''| 2
),( ii yxn)()( 1 xLxf n
),( ii yxn)(xf
IBM Haifa Labs
© 2011 IBM CorporationIBM15
What would be the best choice?
The “simplest” solution A solution where the distance from each example is as small as
possible and where the derivative is as small as possible
IBM Haifa Labs
© 2011 IBM CorporationIBM16
Vector GeometryJust in case ….
IBM Haifa Labs
© 2011 IBM CorporationIBM17
The dot product of two vectors
Is defined as:
An example
nn
n
i ii bababababa ...33221
3)1)(5()2)(3()4)(1(]1,2,4[]5,3,1[
],...,,,[ 321 naaaa ],...,,,[ 321 nbbbb
Dot product
IBM Haifa Labs
© 2011 IBM CorporationIBM18
Dot product
where denotes the length
(magnitude) of ‘a’
Unit vector
2|||| aaa
cos*||||*|||| baba
|||| a
aa
|||| a
)090cos( 0
then b'' lar toperpendicu is a'' if
oba
IBM Haifa Labs
© 2011 IBM CorporationIBM19
Plane/Hyperplane
Hyperplane can be defined by: Three points Two vectors A normal vector and a point
IBM Haifa Labs
© 2011 IBM CorporationIBM20
Plane/Hyperplane Let be a perpendicular vector to the
hyperplane H Let be the position vector of some
known point in the plane. A point _ with position vector is in the plane iff the vector drawn from to is perpendicular to
Two vectors are perpendicular iff their dot product is zero
The hyperplane H can be expressed as
, , substitue // 0 11 pnbxpwnpnpn
0)( 1 ppn
0 bxw
n
1p
1pp p
1p pn
IBM Haifa Labs
© 2011 IBM CorporationIBM21
Classification
IBM Haifa Labs
© 2011 IBM CorporationIBM22
Solving approximation problems
First we define the family of approximating functions F
Next we define the cost function . This function tells how well performs the required approximation
Getting this done , the approximation/classification consists of solving the minimization problem
A first necessary condition (after Fermat) is
As we know it is always possible to do Newton-Raphson, and get a sequence of approximations
)(min CFf
0
f
C
)( fC Ff
IBM Haifa Labs
© 2011 IBM CorporationIBM25
Classification
A classifier is a function or an algorithm that maps every possible input (from a legal set of inputs) to a finite set of categories.
X is the input space, is a data point from an input space.
A typical input space is high-dimensional, for example
X is also called a feature vector.
Ω is a finite set of categories to which the input data points belong : Ω =1,2,…,C.
are called labels.
Xx
1 d ,R ..., ,, d21 dxxxx
i
IBM Haifa Labs
© 2011 IBM CorporationIBM26
Classification
Y is a finite set of decisions – the output set of the classifier.
The classifier is a function yxf :
YxfyclassifyfXx )()(
IBM Haifa Labs
© 2011 IBM CorporationIBM27
The Perceptron
IBM Haifa Labs
© 2011 IBM CorporationIBM28
Perceptron - Frank Rosenblatt (1957)
Linear separation of the input space
bxwxf )(
))(()( xfsignxh
IBM Haifa Labs
© 2011 IBM CorporationIBM29
Perceptron algorithm
Start: The weight vector is generated randomly,set
Test: A vector is selected randomly,if and go to test, if and go to add, if and go to test, if and go to subtract
Add: go to test, Subtract: go to test,
0W0t
NPx PxPxNxNx
0xwt0xwt
0xwt0xwt
1,1 ttxww tt
1,1 ttxww tt
IBM Haifa Labs
© 2011 IBM CorporationIBM30
Perceptron algorithm
Shorter version
Update rule for the k+1 iterations (iteration for each data point)
thenxwyif iki ,0)((
iikk xyww 1
1 kk
IBM Haifa Labs
© 2011 IBM CorporationIBM31
Perceptron – visualization (intuition)
IBM Haifa Labs
© 2011 IBM CorporationIBM32
Perceptron – visualization (intuition)
IBM Haifa Labs
© 2011 IBM CorporationIBM33
Perceptron – visualization (intuition)
IBM Haifa Labs
© 2011 IBM CorporationIBM34
Perceptron – visualization (intuition)
IBM Haifa Labs
© 2011 IBM CorporationIBM35
Perceptron – visualization (intuition)
IBM Haifa Labs
© 2011 IBM CorporationIBM36
Perceptron - analysis
Solution is a linear combination of training points
Only uses informative points (mistake driven)
The coefficient of a point reflect its ‘difficulty’
The perceptron learning algorithm does not terminate if the learning set is not linearly separable (e.g. XOR)
iii xyW 0, ii
IBM Haifa Labs
© 2011 IBM CorporationIBM37
Support Vector Machines
IBM Haifa Labs
© 2011 IBM CorporationIBM38
Advantages of SVM, Vladimir Vapnik 1979,1998
Exhibit good generalization Can implement confidence measures, etc.
Hypothesis has an explicit dependence on the data (via the support vectors)
Learning involves optimization of a convex function (no false minima, unlike NN).
Few parameters required for tuning the learning machine (unlike NN where the architecture/various parameters must be found).
IBM Haifa Labs
© 2011 IBM CorporationIBM39
Advantages of SVM
From the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error.
These generalization bounds have two important features:
IBM Haifa Labs
© 2011 IBM CorporationIBM40
Advantages of SVM
The upper bound on the generalization error does not depend on the dimensionality of the space.
The bound is minimized by maximizing the margin, i.e. the minimal distance between the hyperplane separating the two classes and the closest data-points of each class.
IBM Haifa Labs
© 2011 IBM CorporationIBM41
Basic scenario - Separable data set
IBM Haifa Labs
© 2011 IBM CorporationIBM42
Basic scenario – define margin
IBM Haifa Labs
© 2011 IBM CorporationIBM43
In an arbitrary-dimensional space, a separating hyperplane can be written :
Where W is the normal. The decision function would be :
0 bxw
)()( bxwsignxD
IBM Haifa Labs
© 2011 IBM CorporationIBM44
Note argument in is invariant under a rescaling of the form
Implicitly the scale can be fixed by defining
as the support vectors (canonical hyperplanes)
bbww ,
)(xD
1:1 bxwH
1:2 bxwH
IBM Haifa Labs
© 2011 IBM CorporationIBM45
The task is to select , so that the training data can be described as:
for
for
These can be combined into:
bw,
1 bwxi
1 bwxi
1iy
1iy
ibwxy ii ,01)(
IBM Haifa Labs
© 2011 IBM CorporationIBM46
IBM Haifa Labs
© 2011 IBM CorporationIBM47
The margin will be given by the projectionof the vector
onto the normal vector to the hyperplane
i.e.
So the distance (Euclidian) can be formed
)( 21 xx
|||| w
ww
||||
|)(| 21
w
wxxd
IBM Haifa Labs
© 2011 IBM CorporationIBM48
Note that lies on i.e.
Similarly for
Subtracting the two results in
1x 1 bxw1H
11 bxw
2x12 bxw
2)( 21 xxw
IBM Haifa Labs
© 2011 IBM CorporationIBM49
The margin can be put as
Can convert the problem to
subject to the constraints:
J(w) is a quadratic function, thus there is a single global minimum
||||
2
||||
|)(| 21
ww
wxxm
)(2
1)( wwwJ
ibwxy ii ,01)(
IBM Haifa Labs
© 2011 IBM CorporationIBM50
Lagrange multipliers
Problem definition :Maximize
subject to
A new λ variable is used , called ‘Lagrange multiplier‘ to define
)),((),(),,( cyxgyxfyxLd
),( yxf
cyxg ),(
IBM Haifa Labs
© 2011 IBM CorporationIBM51
Lagrange multipliers
IBM Haifa Labs
© 2011 IBM CorporationIBM52
Lagrange multipliers
IBM Haifa Labs
© 2011 IBM CorporationIBM53
Lagrange multipliers - example
Maximize , subject to
Formally set
Set the derivatives to 0
Combining the first two yields : Substituting into the last Evaluating the objective function f on these yields
yxyxf ),( 122 yx
)1(),( 22 yxyxyx
021
xx
021
yy
0122
yx
yx 2/2yx
2)2/2,2/2( f
2)2/2,2/2( f
IBM Haifa Labs
© 2011 IBM CorporationIBM54
Lagrange multipliers - example
IBM Haifa Labs
© 2011 IBM CorporationIBM55
Primal problem:minimize s.t.
Introduce Lagrange multipliers associated with the constraints
The solution to the primal problem is equivalent to determining the saddle point of the function :
)(2
1)( wwwJ ibwxy ii ,01)(
0, ii
)1)(()(2
1),,(
1
bwxywwbwL ii
n
i ip
IBM Haifa Labs
© 2011 IBM CorporationIBM56
At saddle point, , has minimum requiring
iii
iiii
i xywxyww
L 0
pL
0 i
ii yb
L
IBM Haifa Labs
© 2011 IBM CorporationIBM57
Primal-Dual
Primal:
Minimize with respect to subject to
Substitute and
Dual:
Maximize with respect to subject to
0, ii
n
i iii
n
i ip bwxywwL11
)()(2
1
n
i
n
j jijiji
n
i id xxyyL1 11 2
1
pL
iii
i xyw
bw,
0 ii
i y
dL
i
ii y 0,0, ii
IBM Haifa Labs
© 2011 IBM CorporationIBM58
Solving QP using dual problem
Maximize
constrained to and
We have , new variables. One for each data point
This is a convex quadratic optimization problem , and we run a QP solver to get , and W
n
i
n
j jijiji
n
i id xxyyL1 11 2
1
,0, ii i
ii y 0
,...,, 21 n
)( iii
i xyw
IBM Haifa Labs
© 2011 IBM CorporationIBM59
‘b’ can be determined from the optimal and Karush-Kuhn-Tucker (KKT) conditions
(data points residing on the SV) implies
or AVG
iiii ybxwbxwy 1)(
ibxwy iii ,0]1)([
0i
i
iis
xwyN
b1
ii xwyb
IBM Haifa Labs
© 2011 IBM CorporationIBM60
For every data point i, one of the following must hold
Many sparse solution
Data Points with are Support Vectors
Optimal hyperplane is completely defined by support vectors
0i0)1(,0 bxwy iii
0i
0i iii
i xyw
IBM Haifa Labs
© 2011 IBM CorporationIBM61
SVM - The classification
Given a new data point Z, find it’s label Y
m
iiii bzxysignzD
1
)()(
IBM Haifa Labs
© 2011 IBM CorporationIBM62
Extended scenario - Non-Separable data set
IBM Haifa Labs
© 2011 IBM CorporationIBM63
Data is most likely not to be separable (inconsistencies, outliers, noise), but linear classifier may still be appropriate
Can apply SVM in non-linearly separable case Data should be almost linearly separable
IBM Haifa Labs
© 2011 IBM CorporationIBM64
Use non-negative slack variablesone per data point
Change constraints from to
is a measure of deviation from ideal position for sample I
n ,......,, 21
SVM with slacks
ibwxy ii ,1)(ibwxy iii ,1)(
i
1i10 i
IBM Haifa Labs
© 2011 IBM CorporationIBM65
Would like to minimize
constrained to
The parameter C is a regularization term, which provides a way to control over-fitting if C is small, we allow a lot of samples not in ideal position if C is large, we want to have very few samples not in idealposition
i in CwwwJ )(
2
1),......,,,( 21
SVM with slacks
0 and ,1)( iiii ibwxy
IBM Haifa Labs
© 2011 IBM CorporationIBM66
SVM with slacks - Dual formulation
Maximize
Constraint to
n
i
n
j jijiji
n
i id xxyyL1 11 2
1)(
n
1i
0 and 0 iii yiC
IBM Haifa Labs
© 2011 IBM CorporationIBM67
SVM - non linear mapping
Cover’s theorem: “pattern-classification problem cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low-dimensional space”
One dimensional space, not linearly separable
Lift to two dimensional space with ),()( 2xxx
IBM Haifa Labs
© 2011 IBM CorporationIBM68
SVM - non linear mapping
Solve a non linear classification problem with a linear classifierProject data x to high dimension using functionFind a linear discriminant function for transformed data Final nonlinear discriminant function is
In 2D, discriminant function is linear
In 1D, discriminant function is NOT linear
0)()( wxwxg t
)(x)(xf
IBM Haifa Labs
© 2011 IBM CorporationIBM69
SVM - non linear mapping
Can use any linear classifier after lifting data into a higher dimensional space. However we will have to deal with the “curse of dimensionality” poor generalization to test data computationally expensive
SVM handles the “curse of dimensionality” problem: enforcing largest margin permits good generalization
It can be shown that generalization in SVM is a function of the margin, independent of the dimensionality
computation in the higher dimensional case is performed only implicitly
through the use of kernel functions
IBM Haifa Labs
© 2011 IBM CorporationIBM70
Non linear SVM - kernels
Recall:
The data points appear only in dot products If is mapped to high dimensional space using
The high dimensional product is needed
The dimensionality of space F not necessarily important. May not even know the map
n
i
n
j jijiji
n
i id xxyyL1 11 2
1 Maximize
m
iiii bzxysignzD
1
)()( :tionclassifica
ix)(x
ix)()( j
ti xx
)()(2
1 Maximize
1 11 jt
i
n
i
n
j jiji
n
i id xxyyL
IBM Haifa Labs
© 2011 IBM CorporationIBM71
Kernel A function that returns the value of the dot product between the
images of the two arguments:
Given a function K, it is possible to verify that it is a kernel. Now we only need to compute instead of
“kernel trick”: do not need to perform operations in high dimensional space explicitly
)()(),( yxyxk
),(2
1 Maximize
1 11 ji
n
i
n
j jiji
n
i id xxkyyL
),( yxk )()( yx
IBM Haifa Labs
© 2011 IBM CorporationIBM72
Kernel Matrix
The central structure in kernel machines Contains all necessary information for the learning algorithm Fuses information about the data AND the kernel Many interesting properties:
IBM Haifa Labs
© 2011 IBM CorporationIBM73
Mercer’s Theorem The kernel matrix is Symmetric Positive Definite Any symmetric positive definite matrix can be regarded as
a kernel matrix, that is as an inner product matrix in some space
Every (semi)positive definite, symmetric function is a kernel:
i.e. there exists a mapping φ such that it is possible to write:
)()(),( yxyxk
0)()(),( definite Positive2Lf
dxdyyfxfyxk
IBM Haifa Labs
© 2011 IBM CorporationIBM74
Examples of kernels Some common choices (both satisfying Mercer’s
condition):
Polynomial kernel
Gaussian radial basis function (RBF)
pjiji xxxxk )1,(),(
)||||2
1exp(),( 2
2 jii
ji xxxxk
IBM Haifa Labs
© 2011 IBM CorporationIBM75
Polynomial Kernel - example
22211
2 )()( zxzxzx
)2( 221122
22
21
21 zxzxzxzx
)2,,()2,,( 2122
2121
22
21 zzzzxxxx
)()( zx
IBM Haifa Labs
© 2011 IBM CorporationIBM76
Applying - non linear SVM Start with data , which lives in feature space of
dimension n
Choose a kernel corresponding to some function , which takes data point to a higher dimensional space
Find the largest margin linear discriminant function in the higher dimensional space by using quadratic programming package to solve
..., ,, 21 nxxx
),( ji xxk)(x ix
n
1i
1 11
0 and 0 toconstraint
),(2
1)(
iii
n
i
n
j jijiji
n
i id
yiC
xxkyyL
IBM Haifa Labs
© 2011 IBM CorporationIBM77
Applying - non linear SVM Weight vector w in the high dimensional space:
Linear discriminant function of largest margin in the high dimensional space:
Non-Linear discriminant function in the original space:
)( iii
i xyw
)()()())(( xxyxwxgt
iivsx
it
),()()()())(( xxkyxxyxwxg iivsx
iit
ivsx
it
IBM Haifa Labs
© 2011 IBM CorporationIBM78
Applying - non linear SVM
IBM Haifa Labs
© 2011 IBM CorporationIBM79
SVM summary Advantages:
Based on nice theoryExcellent generalization propertiesObjective function has no local minimaCan be used to find non linear discriminant functionsComplexity of the classifier is characterized by the number of support vectors rather than the dimensionality of the transformed space
Disadvantages:It’s not clear how to select a kernel function in a principled manner tends to be slower than other methods (in non-linear case).