111
Part 2: Support Vector Machines
Electrical and Computer Engineering
Vladimir Cherkassky University of Minnesota
Presented at Tech Tune Ups, ECE Dept, June 1, 2011
2
SVM: Brief History
1963 Margin (Vapnik & Lerner)1964 Margin (Vapnik and Chervonenkis, 1964)1964 RBF Kernels (Aizerman)1965 Optimization formulation (Mangasarian)1971 Kernels (Kimeldorf annd Wahba)1992-1994 SVMs (Vapnik et al)
1996 – present Rapid growth, numerous apps1996 – present Extensions to other problems
3
MOTIVATION for SVM• Problems with ‘conventional’ methods:
- model complexity ~ dimensionality (# features)- nonlinear methods multiple minima- hard to control complexity
• SVM solution approach - adaptive loss function (to control complexity independent of dimensionality)- flexible nonlinear models- tractable optimization formulation
4
SVM APPROACH• Linear approximation in Z-space using
special adaptive loss function• Complexity independent of dimensionality
x g x
z wz ˆ y
5
OUTLINE
• Margin-based loss• SVM for classification• SVM examples• Support vector regression• Summary
6
Example: binary classification• Given: Linearly separable data How to construct linear decision boundary?
7
Linear Discriminant Analysis
LDA solution Separation margin
8
Perceptron (linear NN)
• Perceptron solutions and separation margin
9
Largest-margin solution• All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (falsifiability)
2M
10
Complexity of -margin hyperplanes
• If data samples belong to a sphere of radius R, then the set of -margin hyperplanes has VC dimension bounded by
• For large margin hyperplanes, VC-dimension controlled independent of dimensionality d.
1),/min( 22 dRh
11
Motivation: philosophical• Classical view: good model
explains the data + low complexity Occam’s razor (complexity ~ # parameters)
• VC theory: good model
explains the data + low VC-dimension
~ VC-falsifiability: good model:
explains the data + has large falsifiability
The idea: falsifiability ~ empirical loss function
12
Adaptive loss functions
• Both goals (explanation + falsifiability) can encoded into empirical loss function where- (large) portion of the data has zero loss- the rest of the data has non-zero loss, i.e. it falsifies the model
• The trade-off (between the two goals) is adaptively controlled adaptive loss fct
• Examples of such loss functions for different learning problems are shown next
13
Margin-based loss for classification
0),,(max)),(,( xx yffyL 2Margin
Margin-based loss for classification: margin is adapted to training data
0),,(max)),(,( xx yffyL
M argin
C lass +1 C lass -1
),( xfy
15
Epsilon loss for regression
0,|),(|max)),(,( xx fyfyL
Parameter epsilon is adapted to training data Example: linear regression y = x + noise where noise = N(0, 0.36), x ~ [0,1], 4 samplesCompare: squared, linear and SVM loss (eps = 0.6)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1
-0.5
0
0.5
1
1.5
2
x
y
17
OUTLINE • Margin-based loss• SVM for classification
- Linear SVM classifier- Inner product kernels- Nonlinear SVM classifier
• SVM examples• Support Vector regression• Summary
18
SVM Loss for Classification
Continuous quantity measures how close a sample x is to decision boundary
),( wyf x
19
Optimal Separating Hyperplane
Distance btwn hyperplane and sample Margin Shaded points are SVsw/1
wx /)'(f
20
Linear SVM Optimization Formulation (for separable data)
• Given training data • Find parameters of linear hyperplane
that minimize
under constraints
• Quadratic optimization with linear constraints
tractable for moderate dimensions d• For large dimensions use dual formulation:
- scales with sample size(n) rather than d- uses only dot products
bf xwx
25.0 ww
1 by ii xw
ni ,...,1
x i ,yi b,w
ji xx ,
21
Classification for non-separable data
),(
_
xyf
variablesslack
0),,(max)),(,( xx yffyL
22
SVM for non-separable data
min2
1 2
1
wn
iiC
iii by 1xw
Minimize
under constraints
bf xwx)(
f x( ) = +1
f x( ) = -1
f x( ) = 0
x1
= 1 - f x1
( )
x1
x2
x3
x2
= 1 - f x2
( )
x3
= 1+ f x3
( )
23
SVM Dual Formulation• Given training data • Find parameters of an opt. hyperplane
as a solution to maximization problem
under constraints
• Solutionwhere samples with nonzero are SVs
• Needs only inner products
ni ,...,1
x i ,yi ** ,bi
max2
1
1,1
n
jijijiji
n
ii yy xxL
,01
n
iiiy Ci 0
*i 'xx
n
iiii byf
1
** xxx
24
Nonlinear Decision Boundary• Fixed (linear) parameterization is too rigid• Nonlinear curved margin may yield larger margin
(falsifiability) and lower error
25
Nonlinear Mapping via KernelsNonlinear f(x,w) + margin-based loss = SVM
• Nonlinear mapping to feature z space, i.e.
• Linear in z-space ~ nonlinear in x-space
• BUT ~ kernel trick
Compute dot product via kernel analytically
x g x
z wz ˆ y
z z H x , x
),,,,,1(~),(~ 22
21212121 xxxxxxxx zx
26
SVM Formulation (with kernels)• Replacing leads to: • Find parameters of an optimal
hyperplane
as a solution to maximization problem
under constraints Given: the training data
an inner product kernelregularization parameter C
** ,bi
n
iiii bHyD
1
** ,xxx
max,2
1
1,1
n
jijijiji
n
ii Hyy xxL
,01
n
iiiy Ci 0
xxzz ,H
x i ,yi xx ,H
ni ,...,1
27
Examples of KernelsKernel is a symmetric function satisfying general
math conditions (Mercer’s conditions)
Examples of kernels for different mappings xz• Polynomials of degree q
• RBF kernel
• Neural Networks
for given parameters
Automatic selection of the number of hidden units (SV’s)
xx ,H
qH 1', xxxx
2
2'
exp,
xxxxH
avH )'(tanh, xxxxav,
28
More on Kernels• The kernel matrix has all info (data + kernel)
H(1,1) H(1,2)…….H(1,n)H(2,1) H(2,2)…….H(2,n)………………………….H(n,1) H(n,2)…….H(n,n)
• Kernel defines a distance in some feature space (aka kernel-induced feature space)
• Kernels can incorporate apriori knowledge• Kernels can be defined over complex
structures (trees, sequences, sets etc)
29
Support Vectors
• SV’s ~ training samples with non-zero loss• SV’s are samples that falsify the model • The model depends only on SVs SV’s ~ robust characterization of the dataWSJ Feb 27, 2004:
About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters.
• SVM Generalization ~ data compression
30
New insights provided by SVM• Why linear classifiers can generalize?
(1) Margin is large (relative to R)(2) % of SV’s is small(3) ratio d/n is small
• SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both
• Requires common-sense parameter tuning
1),/min( 22 dRh
31
OUTLINE
• Margin-based loss• SVM for classification• SVM examples• Support Vector regression• Summary
32
Ripley’s data set• 250 training samples, 1,000 test samples• SVM using RBF kernel • Model selection via 10-fold cross-validation
2( , ) exp u v u v
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1
33
Ripley’s data set: SVM model• Decision boundary and margin borders • SV’s are circled
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
x1
x2
34
Ripley’s data set: model selection• SVM tuning parameters C,• Select opt parameter values via 10-fold x-validation• Results of cross-validation are summarized below:
C= 0.1 C= 1 C= 10 C= 100 C= 1000 C= 10000
=2-3 98.4% 23.6% 18.8% 20.4% 18.4% 14.4%
=2-2 51.6% 22% 20% 20% 16% 14%
=2-1 33.2% 19.6% 18.8% 15.6% 13.6% 14.8%
=20 28% 18% 16.4% 14% 12.8% 15.6%
=21 20.8% 16.4% 14% 12.8% 16% 17.2%
=22 19.2% 14.4% 13.6% 15.6% 15.6% 16%
=23 15.6% 14% 15.6% 16.4% 18.4% 18.4%
35
Noisy Hyperbolas data set• This example shows application of different kernels• Note: decision boundaries are quite different
RBF kernel Polynomial
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
36
Many challenging applications• Mimic human recognition capabilities
- high-dimensional data- content-based- context-dependent
• Example: read the sentenceSceitnitss osbevred: it is nt inptrant how lteters are msspled isnide the word. It is ipmoratnt that the fisrt and lsat letetrs do not chngae, tehn the txet is itneprted corrcetly
• SVM is suitable for sparse high-dimensional data
37
Example SVM Applications
• Handwritten digit recognition
• Genomics
• Face detection in unrestricted images
• Text/ document classification
• Image classification and retrieval
• …….
38
Handwritten Digit Recognition (mid-90’s)
• Data set: postal images (zip-code), segmented, cropped;
~ 7K training samples, and 2K test samples
• Data encoding: 16x16 pixel image 256-dim. vector
• Original motivation: Compare SVM with custom MLP network (LeNet) designed for this application
• Multi-class problem: one-vs-all approach 10 SVM classifiers (one per each digit)
39
Digit Recognition Results• Summary
- prediction accuracy better than custom NN’s- accuracy does not depend on the kernel type- 100 – 400 support vectors per class (digit)
• More details Type of kernel No. of Support Vectors Error% Polynomial 274 4.0RBF 291 4.1Neural Network 254 4.2
• ~ 80-90% of SV’s coincide (for different kernels)
40
Document Classification (Joachims, 1998)
• The Problem: Classification of text documents in large data bases, for text indexing and retrieval
• Traditional approach: human categorization (i.e. via feature selection) – relies on a good indexing scheme. This is time-consuming and costly
• Predictive Learning Approach (SVM): construct a classifier using all possible features (words)
• Document/ Text Representation:
individual words = input features (possibly weighted)• SVM performance:
– Very promising (~ 90% accuracy vs 80% by other classifiers)– Most problems are linearly separable use linear SVM
41
OUTLINE
• Margin-based loss• SVM for classification• SVM examples• Support vector regression• Summary
42
Linear SVM regression
0,|),(|max)),(,( xx fyfyL
bf xwx ),( Assume linear parameterization
x
y
*2
1
43
Direct Optimization Formulation
x i ,yi
x
y
*2
1
Minimize
n
iiiC
1
*)()(2
1 ww
ni
yb
by
ii
iii
iii
,...,1,0,
)(
)(
*
*
xw
xwUnder constraints
Given training data
ni ,...,1
44
Example: SVM regression using RBF kernel
SVM estimate is shown in dashed line
SVM model uses only 5 SV’s (out of the 40 points)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2
0
0.2
0.4
0.6
0.8
1
x
y
45
RBF regression model
Weighted sum of 5 RBF kernels gives the SVM model
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-4
-3
-2
-1
0
1
2
3
4
x
y
2
21
exp0.20
mj
jj
x cf x w
,w
46
Summary• Margin-based loss: robust +
performs complexity control
• Nonlinear feature selection (~ SV’s): performed automatically
• Tractable model selection – easier than most nonlinear methods.
• SVM is not a magic bullet solution
- similar to other methods when n >> h
- SVM is better when n << h or n ~ h
Top Related