Introduction to Predictive Learning

82
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 7 Support Vector Machines

description

Introduction to Predictive Learning . LECTURE SET 7 Support Vector Machines. Electrical and Computer Engineering. 1. OUTLINE. Objectives explain motivation for SVM describe basic SVM for classification & regression compare SVM vs. statistical & NN methods - PowerPoint PPT Presentation

Transcript of Introduction to Predictive Learning

Page 1: Introduction to  Predictive Learning

11

Introduction to Predictive Learning

Electrical and Computer Engineering

LECTURE SET 7

Support Vector Machines

Page 2: Introduction to  Predictive Learning

2

OUTLINE• Objectives

explain motivation for SVMdescribe basic SVM for classification & regression compare SVM vs. statistical & NN methods

• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion

Page 3: Introduction to  Predictive Learning

3

MOTIVATION for SVM• Recall ‘conventional’ methods:

- model complexity ~ dimensionality- nonlinear methods multiple local minima- hard to control complexity

• ‘Good’ learning method:(a) tractable optimization formulation(b) tractable complexity control(1-2 parameters)(c) flexible nonlinear parameterization

• Properties (a), (b) hold for linear methods• SVM solution approach

Page 4: Introduction to  Predictive Learning

4

SVM APPROACH• Linear approximation in Z-space using

special adaptive loss function• Complexity independent of dimensionality

x g x

z wz ˆ y

Page 5: Introduction to  Predictive Learning

555

Motivation for Nonlinear Methods1. Nonlinear learning algorithm proposed using

‘reasonable’ heuristic arguments.reasonable ~ statistical or biological

2. Empirical validation + improvement3. Statistical explanation (why it really works)Examples: statistical, neural network methods.In contrast, SVM methods have been originally

proposed under VC theoretic framework.

Page 6: Introduction to  Predictive Learning

6

OUTLINE• Objectives• Motivation for margin-based loss

Loss functions for regressionLoss functions for classificationPhilosophical interpretation

• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion

Page 7: Introduction to  Predictive Learning

7

Main Idea• Model complexity controlled by a special

loss function used for fitting training data• Such empirical loss functions may be

different from the loss functions used in a learning problem setting

• Such loss functions are adaptive, i.e. can adapt their complexity to particular data set

• Different loss functions for different learning problems (classification, regression etc)

• Model complexity(VC-dim.) is controlled independently of the number of features

Page 8: Introduction to  Predictive Learning

8

Robust Loss Function for Regression• Squared loss ~ motivated by

large sample settingsparametric assumptions Gaussian noise

• For practical settings better to use linear loss

Page 9: Introduction to  Predictive Learning

9

Epsilon-insensitive Loss for Regression

0,|),(|max)),(,( xx fyfyL

• Can also control model complexity (~ falsifiability)

Page 10: Introduction to  Predictive Learning

10

Empirical Comparison• Univariate regression• Squared, linear and SVM loss (with ):

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1

-0.5

0

0.5

1

1.5

2

x

y

xy ]1,0[x )36.0,0(~ N

0.6

• Red ~ target function, Dotted ~ estimate using squared loss, Dashed ~ linear loss, Dashed-dotted ~ SVM loss

Page 11: Introduction to  Predictive Learning

11

Empirical Comparison (cont’d)• Univariate regression• Squared, linear and SVM loss (with )• Test error (MSE) estimated for 5 independent

realizations of training data (4 training samples)

xy ]1,0[x )36.0,0(~ N

0.6

Squared loss

Least modulus loss

SVM loss with epsilon=0.6

1 0.024 0.134 0.067

2 0.128 0.075 0.063

3 0.920 0.274 0.041

4 0.035 0.053 0.032

5 0.111 0.027 0.005

Mean 0.244 0.113 0.042

St. Dev. 0.381 0.099 0.025

Page 12: Introduction to  Predictive Learning

12

Loss Functions for Classification

• Decision rule • Quantity is analogous to residuals in regression• Common loss functions: 0/1 loss and linear loss• Properties of a good loss function?

~ continuous + convex + robust

),()( xx fsignD ),( xyf

Page 13: Introduction to  Predictive Learning

13

Motivation for margin-based loss (1)

• Given: Linearly separable data How to construct linear decision boundary?

(a) Many linear decision boundaries (that have no errors)

Page 14: Introduction to  Predictive Learning

14

Motivation for margin-based loss (2)

• Given: Linearly separable data Which linear decision boundary is better ?

The model with larger margin is more robust for future data

Page 15: Introduction to  Predictive Learning

15

Largest-margin solution• All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (larger falsifiability)

2M

Page 16: Introduction to  Predictive Learning

16

Margin-based loss for classification

1

2

1y

1y

( , ( , )) max | ( , ) |, 0L y f yf x x( , )i i iy f x

SVM loss or hinge lossMinimization of slack variables

Page 17: Introduction to  Predictive Learning

17

Margin-based loss for classification: margin size is adapted to training data

M argin

C lass +1 C lass -1

0),,(max)),(,( xx yffyL

Page 18: Introduction to  Predictive Learning

18

Motivation: philosophical• Classical view: good model

explains the data + low complexity Occam’s razor (complexity ~ # parameters)• VC theory: good model

explains the data + low VC-dimension VC-falsifiability (small VC-dim ~ large falsifiability),

i.e. the goal is to find a model that:can explain training data / cannot explain other data

The idea: falsifiability ~ empirical loss function

Page 19: Introduction to  Predictive Learning

19

Adaptive Loss Functions• Both goals (explanation + falsifiability) can

encoded into empirical loss function where- (large) portion of the data has zero loss- the rest of the data has non-zero loss, i.e. it falsifies the model

• The trade-off (between the two goals) is adaptively controlled adaptive loss fct

• For classification, the degree of falsifiability is ~ margin size (see below)

Page 20: Introduction to  Predictive Learning

20

Margin-based loss for classification

0),,(max)),(,( xx yffyL 2Margin

Page 21: Introduction to  Predictive Learning

21

Classification: non-separable data

),(_

xyfvariablesslack

0),,(max)),(,( xx yffyL

Page 22: Introduction to  Predictive Learning

22

Margin based complexity control• Large degree of falsifiability is achieved by

- large margin (for classification)- small epsilon (for regression)

• For linear classifiers:larger margin smaller VC-dimension

~1 2 3 ... 1 2 3 ...h h h

Page 23: Introduction to  Predictive Learning

23

-margin hyperplanes • Solutions provided by minimization of SVM loss can be

indexed by the value of margin SRM structure: for VC-dim.

• If data samples belong to a sphere of radius R, then the VC dimension bounded by

• For large margin hyperplanes, VC-dimension controlled independent of dimensionality d.

1),/min( 22 dRh

1 2 3 ... 1 2 3 ...h h h

Page 24: Introduction to  Predictive Learning

24

SVM Model Complexity

• Two ways to control model complexity-via model parameterization use fixed loss function:

-via adaptive loss function: use fixed (linear) parameterization

• ~ Two types of SRM structures• Margin-based loss can be motivated by

Popper’s falsifiability

)),(,( xfyL)),(,( xfyL

),( xf

bf )(),( xwx

Page 25: Introduction to  Predictive Learning

25

Margin-based loss: summary• Classification:

falsifiability controlled by margin

• Regression:

falsifiability controlled by

• Single class learning:

falsifiability controlled by radius r

NOTE: the same interpretation/ motivation for margin-based loss for different types of learning problems.

0),,(max)),(,( xx yffyL

0,|),(|max)),(,( xx fyfyL

0,|)|max)),(( rfLr axx

Page 26: Introduction to  Predictive Learning

26

OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers

- Primal formulation (linearly separable case)- Dual optimization formulation- Soft-margin SVM formulation

• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion

Page 27: Introduction to  Predictive Learning

27

Optimal Separating Hyperplane

Distance btwn hyperplane and sample Margin Shaded points are SVsw/1

wx /)'(f

Page 28: Introduction to  Predictive Learning

28

Optimization Formulation• Given training data • Find parameters of linear hyperplane

that minimizeunder constraints

• Quadratic optimization with linear constraintstractable for moderate dimensions d

• For large dimensions use dual formulation:- scales better with n (rather than d)- uses only dot products

bf xwx 25.0 ww

1 by ii xw

ni ,...,1

x i ,yi b,w

Page 29: Introduction to  Predictive Learning

29

From Optimization Theory:• For a given convex minimization problem

with convex inequality constraints there exists an equivalent dual unconstrained maximization formulation with nonnegative Lagrange multipliers

• Karush-Kuhn-Tucker (KKT) conditions:Lagrange coefficients only for samples that satisfy the original constraint

with equality ~ SV’s have positive Lagrange coefficients

1 by ii xw

0* i

0i

Page 30: Introduction to  Predictive Learning

30

Convex Hull Interpretation of Dual

Find convex hulls for each class. The closest points to an optimal hyperplane are support vectors

Page 31: Introduction to  Predictive Learning

31

Dual Optimization Formulation• Given training data • Find parameters of an opt. hyperplane

as a solution to maximization problem

under constraints

• Note: data samples with nonzero are SV’s• Formulation requires only inner products

ni ,...,1

x i ,yi ** ,bi

n

iiii byD

1

** xxx

max21

1,1

n

jijijiji

n

ii yy xxL

,01

n

iiiy i 0

*i

'xx

Page 32: Introduction to  Predictive Learning

32

Support Vectors• SV’s ~ training samples with non-zero loss• SV’s are samples that falsify the model • The model depends only on SVs SV’s ~ robust characterization of the data

WSJ Feb 27, 2004:About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters.

Page 33: Introduction to  Predictive Learning

33

Support Vectors• SVM test error bound:

small # SV’s ~ good generalization• Can be explained using LOO cross-validation

• SVM generalization can be related to data compression

#_ support _ vectors_

nE

E Test error

Page 34: Introduction to  Predictive Learning

34

Soft-Margin SVM formulation

min21 2

1

wn

iiC

iii by 1xw

Minimize:

under constraints

bf xwx)(

f x( ) = +1

f x( ) = -1

f x( ) = 0

x1 = 1 - f x1( )

x1

x 2

x 3

x2 =1 - f x2( )

x3 = 1+ f x3( )

Page 35: Introduction to  Predictive Learning

35

SVM Dual Formulation• Given training data • Find parameters of an opt. hyperplane

as a solution to maximization problem

under constraints

• Note: data samples with nonzero are SVs• Formulation requires only inner products

ni ,...,1

x i ,yi ** ,bi

n

iiii byD

1

** xxx

max21

1,1

n

jijijiji

n

ii yy xxL

,01

n

iiiy Ci 0

*i

'xx

Page 36: Introduction to  Predictive Learning

36

OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion

Page 37: Introduction to  Predictive Learning

37

Nonlinear Decision Boundary• Fixed (linear) parameterization is too rigid• Nonlinear curved margin may yield larger margin

(falsifiability) and lower error

Page 38: Introduction to  Predictive Learning

38

Nonlinear Mapping via KernelsNonlinear f(x,w) + margin-based loss = SVM• Nonlinear mapping to feature z space• Linear in z-space ~ nonlinear in x-space• But ~ symmetric fct Compute dot product via kernel analytically

x g x

z wz ˆ y

1

( ) ( ) ,m

i j k i k j i jk

g g

z z x x x x

Page 39: Introduction to  Predictive Learning

39

Example of Kernel Function• 2D input space • Mapping to z space (2-d order polynomial)

• Can show by direct substitution that for two input vectors

Their dot product is calculated analytically

x x1 ,x2

11 z 12 2xz 23 2xz 214 2 xxz 215 xz 2

26 xz

21,uuu 21,vvv

2( ) ( ) , 1G G u v u v u v

Page 40: Introduction to  Predictive Learning

40

SVM Formulation (with kernels)• Replacing leads to: • Find parameters of an optimal

hyperplane as a solution to maximization problem

under constraints • Given: the training data

an inner product kernelregularization parameter C

** ,bi

n

iiii bKyD

1

** ,xxx

max,21

1,1

n

jijijiji

n

ii Kyy xxL

,01

n

iiiy nCi /0

xxzz ,K

x i ,yi xx ,K

ni ,...,1

Page 41: Introduction to  Predictive Learning

41

Examples of KernelsKernel is a symmetric function satisfying general

(Mercer’s) conditions. Examples of kernels for different mappings xz• Polynomials of degree m

• RBF kernel(width parameter)

• Neural Networksfor given parameters

Automatic selection of the number of hidden units (SV’s)

xx ,K

mK 1', xxxx

2

2'exp,

xx

xxK

avK )'(tanh, xxxxav,

Page 42: Introduction to  Predictive Learning

42

More on Kernels• The kernel matrix has all info (data + kernel)

K(1,1) K(1,2)…….K(1,n)K(2,1) K(2,2)…….K(2,n)………………………….K(n,1) K(n,2)…….K(n,n)

• Kernel defines a distance in some feature space (aka kernel-induced feature space)

• Kernel parameter controls nonlinearity• Kernels can incorporate a priori knowledge• Kernels can be defined over complex

structures (trees, sequences, sets, etc.)

Page 43: Introduction to  Predictive Learning

43

New insights provided by SVM

• Why linear classifiers can generalize?(1) Margin is large (relative to R)(2) % of SV’s is small(3) ratio d/n is small

• SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both

• Requires common-sense parameter tuning

Page 44: Introduction to  Predictive Learning

44

OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples

- Model Selection- Histogram of Projections- SVM Extensions and Modifications

• SVM for Regression• Summary and Discussion

Page 45: Introduction to  Predictive Learning

45

SVM Model Selection• The quality of SVM classifiers depends on

proper tuning of model parameters:- kernel type (poly, RBF, etc)- kernel complexity parameter- regularization parameter C

• Note: VC-dimension depends on both C and kernel parameters

• These parameters are usually selected via x-validation, by searching over wide range of parameter values (on the log-scale)

Page 46: Introduction to  Predictive Learning

46

SVM Example 1: Ripley’s data• Ripley’s data set:

- 250 training samples- SVM using RBF kernel- model selection via 10-fold cross-validation

• Cross-validation error table:

optimal C = 1,000, gamma = 1Note: may be multiple optimal parameter values

2( , ) exp u v u v

gamma C= 0.1 C= 1 C= 10 C= 100 C= 1000 C= 10000=2-3 98.4% 23.6% 18.8% 20.4% 18.4% 14.4%=2-2 51.6% 22% 20% 20% 16% 14%=2-1 33.2% 19.6% 18.8% 15.6% 13.6% 14.8%=20 28% 18% 16.4% 14% 12.8% 15.6%=21 20.8% 16.4% 14% 12.8% 16% 17.2%=22 19.2% 14.4% 13.6% 15.6% 15.6% 16%=23 15.6% 14% 15.6% 16.4% 18.4% 18.4%

Page 47: Introduction to  Predictive Learning

47

Optimal SVM Model• RBF SVM with optimal parameters C = 1,000, gamma = 1

• Test error is 9.8% (estimated using1,000 test samples)

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

x1

x2

Page 48: Introduction to  Predictive Learning

48

SVM Example 2: Noisy Hyperbolas• Noisy Hyperbolas data set:

- 100 training samples (50 per class)- 100 validation samples (used for parameter tuning)RBF SVM model: Poly SVM model (5-th degree):

• Which model is ‘better’?• Model interpretation?

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 49: Introduction to  Predictive Learning

49

SVM Example 3: handwritten digits• MNIST handwritten digits (5 vs. 8) ~ high-dimensional data

- 1,000 training samples (500 per class)- 1,000 validation samples (used for parameter tuning)- 1,866 test samples

• Each sample is a real-valued vector of size 28*28=784:

• RBF SVM: optimal parameters C=1,

28 pixels

28 pixels

82

Page 50: Introduction to  Predictive Learning

50

How to visualize high-dim SVM model?• Histogram of projections for linear SVM:

- project training data onto normal vector w (of SVM model)- show univariate histogram of projected training samples

• On the histogram: ‘0’~ decision boundary, -1/+1 ~ margins• Similar histograms can be obtained for nonlinear SVM

Page 51: Introduction to  Predictive Learning

51

Histogram of Projections for Digits Data• Projections of training/test data onto normal direction of RBF SVM decision boundary:

Training data Test data

-1.5 -1 -0.5 0 0.5 1 1.50

50

100

150

200

250

300

350

400

450

500

-1.5 -1 -0.5 0 0.5 1 1.50

20

40

60

80

100

120

140

160

180

Page 52: Introduction to  Predictive Learning

52

Practical Issues for SVM Classifiers

• Pre-processing all inputs pre-scaled to the range [0,1] or [-1,+1]

• Model Selection (parameter tuning) • SVM Extensions

- multi-class problems - unbalanced data sets- unequal misclassification costs

Page 53: Introduction to  Predictive Learning

53

SVM for multi-class problems

• Digit recognition ~ ten-class problem:- estimate 10 binary classifiers

(one digit vs the rest)• For prediction:

a test input is applied to all 10 binary SVM classifiers, and the class with the largest SVM output value is selected

Page 54: Introduction to  Predictive Learning

54

Unbalanced Settings and Unequal Costs

• Unbalanced settings: - different number of positive and negative samples- different prior probabilities for training /test data

• Different Misclassification Costs- two types of errors, FP and FN- Cost (false_positive) vs. Cost(false_negative)- Loss function:- these ‘costs’ need to be specified a priori, based on application requirements minPCPC fnfp

Page 55: Introduction to  Predictive Learning

55

SVM Modifications• The Problem:

- How to modify standard SVM formulation? • Unbalanced Data + Unequal Costs

where

• In practice, need to specify

minCCclassi

iclassi

i

2

21 w

S

S

posfalseCostC

negfalseCostC

CCC ,

Page 56: Introduction to  Predictive Learning

56

Example: SVM with unequal costs• Ripley’s data set (as before) where

- negative samples ~ ‘triangles’- given misclassification costs

Note: boundary shifted away from positive samples/ 3 :1C C

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

x1

x2

Page 57: Introduction to  Predictive Learning

57

SVM Applications

• Handwritten digit recognition • Face detection in unrestricted

images • Text/ document classification• Image classification and retrieval• …….

Page 58: Introduction to  Predictive Learning

58

Handwritten Digit Recognition (mid-90’s)

• Data set: postal images (zip-code), segmented, cropped;

~ 7K training samples, and 2K test samples

• Data encoding: 28x28 grey scale pixel image

• Original motivation: Compare SVM with custom MLP network (LeNet) designed for this application

• Multi-class problem: one-vs-all approach 10 SVM classifiers (one per each digit)

Page 59: Introduction to  Predictive Learning

59

Digit Recognition Results• Summary

- prediction accuracy better than custom NN’s- accuracy does not depend on the kernel type- 100 – 400 support vectors per class (digit)

• More details Type of kernel No. of Support Vectors Error% Polynomial 274 4.0RBF 291 4.1Neural Network 254 4.2

• ~ 80-90% of SV’s coincide (for different kernels)• Reduced-set SVM (Burges, 1996) ~ 15 per class

Page 60: Introduction to  Predictive Learning

60

Document Classification (Joachims, 1998)

• The Problem: Classification of text documents in large data bases, for text indexing and retrieval

• Traditional approach: human categorization (i.e. via feature selection) – relies on a good indexing scheme. This is time-consuming and costly

• Predictive Learning Approach (SVM): construct a classifier using all possible features (words)

• Document/ Text Representation: individual words = input features (possibly weighted)

• SVM performance:– Very promising (~ 90% accuracy vs 80% by other classifiers)– Most problems are linearly separable use linear SVM

Page 61: Introduction to  Predictive Learning

61

Image Data Mining (Chapelle et al, 1999)• Example image data:• Classification of images in

data bases, for image indexing etc

• DATA SETCorel photo images:2670 samples divided into 7 classes: airplanes, birds, fish, vehicles etc.

Training data: 1375 images; Test data: 1375 images (50%)

MAIN ISSUE: invariant representation/ data encoding

Page 62: Introduction to  Predictive Learning

62

OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression

- SV Regression formulation- Dual optimization formulation- Model selection- Example: Boston Housing

• Summary and Discussion

Page 63: Introduction to  Predictive Learning

63

General SVM Modeling Approach1 For linear model

minimize SVM functional

using SVM loss suitable for the learning problem at hand

2 Transform (1) to dual optimization formulation (using only dot products)

3 Use kernels to obtain nonlinear version of (2).

Note: this approach is used for all learning problems. However, tunable parameters of margin-based loss are different for various types of learning problems

21( , , ) ( , , )2SVM n emp nR b C R b w Z w w Z

( , )f b x w x

Page 64: Introduction to  Predictive Learning

64

SVM Regression• For linear model

minimize SVM functional

where empirical loss (for regression) is given by

• Two distinct ways to control model complexity:- by the value of C (with fixed epsilon)- by the value of epsilon (with fixed large C)

• SVM regression tunes both epsilon and C for optimal performance

21( , , ) ( , , )2SVM n emp nR b C R b w Z w w Z

( , )f b x w x

0,|),f(|max)),f(,( xx yyL

Page 65: Introduction to  Predictive Learning

65

Linear SVM regression

0,|),(|max)),(,( xx fyfyL

bf xwx ),(

min),(21),,( 2 nempnSVM RCbR ZwZw

n

iiinemp fyL

nR

1

)),(,(1),( xZ

For linear parameterization

SVM regression functional:

where

Page 66: Introduction to  Predictive Learning

66

Direct Optimization Formulation

x i ,yi

x

y

*2

1

Minimize

n

iiiC

1

*)()(21 ww

niybby

ii

iii

iii

,...,1,0,)(

)(

*

*

xwxw

Under constraints

Given training data

ni ,...,1

Page 67: Introduction to  Predictive Learning

67

Dual Formulation for SVM Regressionx i ,yi

And the values of

Under constraints

Given training data ni ,...,1

C,

Find coefficients which maximize

)())(()()(),(,

ji

n

jijjiiii

n

iiii

n

iii y xx

111 21 L

n

ii

n

ii

11

Ci 0Ci 0

niii ,...,1,, **

Yields the following solution*** )()()( bf i

SViii

xxx

Page 68: Introduction to  Predictive Learning

68

Example: RBF regression

RBF estimate (dashed line) usingSVM model uses only 5 SV’s (out of the 40 points)

2000,16.0 C

2

21

exp0.20

mj

jj

x cf x w

,w

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2

0

0.2

0.4

0.6

0.8

1

x

y

Page 69: Introduction to  Predictive Learning

69

Example: decomposition of RBF model

Weighted sum of 5 RBF kernel fcts gives the SVM model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-4

-3

-2

-1

0

1

2

3

4

x

y

Page 70: Introduction to  Predictive Learning

70

SVM Model Selection: General• Setting/ tuning of SVM hyper-parameters

- usually performed by experts- more recently, by non-expert practitioners

• Issues for SVM model selection (1) parameters controlling the ‘margin’ size(2) kernel type and kernel complexity

• Strategies for model selection- exhaustive search in the parameter space(via resampling)- efficient search using VC analytic bounds- rule-of-thumb analytic strategies (for a particular type of learning problem)

Page 71: Introduction to  Predictive Learning

71

Model Selection: continued• Parameters controlling margin size

- for classification, parameter C- for regression, the value of epsilon- for single-class learning, the radius

• Complexity control ~ the fraction of SV’s( -SVM)- for classification, replace C with - for regression, specify the fraction of points allowed to lie outside -insensitive zone

• For very sparse data (d/n>>1) use linear SVM

]1,0[

Page 72: Introduction to  Predictive Learning

72

Parameter Selection for SVM Regression• Selection of parameter C

Recall the SVM solution where and with bounded kernels (RBF)

• Selection of

in general, (noise level)But this does not reflect dependency on sample sizeFor linear regression: suggesting

• The final prescription

Ci *0 niCi ,...,1,0 *

minmax yyC

~

nxy

22

/

n

nnln3

*** )()()( bf iSVi

ii

xxx

Page 73: Introduction to  Predictive Learning

73

Effect of SVM parameters on test error• Training data

univariate Sinc(x) function

with additive Gaussian noise (sigma=0.2)(a) small sample size 50 (b) large sample size 200

02

46

810

00.2

0.40.6

0

0.05

0.1

0.15

0.2

C/n

Prediction Risk

epsilon0

24

68

10

00.2

0.40.6

0

0.05

0.1

0.15

0.2

C/n

Prediction Risk

epsilon 02

46

810

00.2

0.40.6

0

0.05

0.1

0.15

0.2

C/n

Prediction Risk

epsilon

]10,10[)sin()( xx

xxt

Page 74: Introduction to  Predictive Learning

74

SVM vs Regularization• System imitation SVM• System identification regularization

But their risk functionals ‘look similar’

Recent claims: SVM = special case of regularization

• These claims neglect the role of margin loss

2

1 21)),(,(),( wxw

n

iiiSVM fyLCbR

2

1

2)),((),( wxw

n

iiireg fybR

Page 75: Introduction to  Predictive Learning

75

Comparison for Classification• Linear SVM vs Penalized LDA – comparison is fair• Data Sets: small (20 samples per class) large (100

samples per class)

Page 76: Introduction to  Predictive Learning

76

Comparison results: classification• Small sample size:

Linear SVM yields 0.5% - 1.1% error ratePenalized LDA yields 2.8% - 3% error

• Large sample size:Linear SVM yields 0.4% - 1.1% error ratePenalized LDA yields 1.1% - 2.2% error

• Conclusion: margin based complexity control is better than regularization

Page 77: Introduction to  Predictive Learning

77

Comparison for regression• Linear SVM vs linear ridge regressionNote: Linear SVM has 2 parameters

• Sparse Data Set: 30 noisy samples, using target function

corrupted with gaussian noise with• Complexity Control:

- for RR vary regularization parameter- for SVM ~ epsilon and C parameters

54321 0002)( xxxxxt x 5]1,0[x2.0

Page 78: Introduction to  Predictive Learning

78

Control for ridge regression• Coefficient shrinkage for ridge regression

-8 -6 -4 -2 0 2 4 6 8-0.5

0

0.5

1

1.5

2

2.5

Log(Lambda)

Coe

ffici

ents

w1

w2 w2 w2

w3

w4

w5

Page 79: Introduction to  Predictive Learning

79

Complexity control for SVM• Coefficient shrinkage for SVM:(a)Vary C (epsilon=0) (b)Vary epsilon(C=large)

0 0.5 1 1.5 2-0.5

0

0.5

1

1.5

2

2.5

epsilon

Coe

ffici

ents

w1

w3

w4

w5

w2

-8 -6 -4 -2 0 2 4 6 8-0.5

0

0.5

1

1.5

2

2.5

Log(n/C)

Coe

ffici

ents

w1

w2

w3

w4

w5

Page 80: Introduction to  Predictive Learning

80

Comparison: ridge regression vs SVM • Sparse setting: n=10, noise• Ridge regression: chosen by cross-validation• SV Regression:

C selected by cross-validation • Ave Risk (100 realizations): 0.44 (RR) vs 0.37 (SVM)

2.0

2.0

Ridge SVM 10

-2

10-1

100

101

Ris

k (M

SE

)

Page 81: Introduction to  Predictive Learning

81

OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion

Page 82: Introduction to  Predictive Learning

82

Summary• Direct approach different formulations• Margin-based loss: robust, controls

complexity (falsifiability)• SRM: new type of structure• Nonlinear feature selection (~ SV’s):

incorporated into model estimation• Appropriate applications

- high-dimensional data- content-based /content-dependent