Download - 111 Part 2: Support Vector Machines Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota [email protected] Presented at Tech.

111

Part 2: Support Vector Machines

Electrical and Computer Engineering

Vladimir Cherkassky University of Minnesota

[email protected]

Presented at Tech Tune Ups, ECE Dept, June 1, 2011

mailto:[email protected]

2

SVM: Brief History

1963 Margin (Vapnik & Lerner)1964 Margin (Vapnik and Chervonenkis, 1964)1964 RBF Kernels (Aizerman)1965 Optimization formulation (Mangasarian)1971 Kernels (Kimeldorf annd Wahba)1992-1994 SVMs (Vapnik et al)

1996 – present Rapid growth, numerous apps1996 – present Extensions to other problems

3

MOTIVATION for SVM• Problems with ‘conventional’ methods:

- model complexity ~ dimensionality (# features)- nonlinear methods multiple minima- hard to control complexity

• SVM solution approach - adaptive loss function (to control complexity independent of dimensionality)- flexible nonlinear models- tractable optimization formulation

4

SVM APPROACH• Linear approximation in Z-space using

special adaptive loss function• Complexity independent of dimensionality

x g x

z wz ˆ y

5

OUTLINE

• Margin-based loss• SVM for classification• SVM examples• Support vector regression• Summary

6

Example: binary classification• Given: Linearly separable data How to construct linear decision boundary?

7

Linear Discriminant Analysis

LDA solution Separation margin

8

Perceptron (linear NN)

• Perceptron solutions and separation margin

9

Largest-margin solution• All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (falsifiability)

2M

10

Complexity of -margin hyperplanes

• If data samples belong to a sphere of radius R, then the set of -margin hyperplanes has VC dimension bounded by

• For large margin hyperplanes, VC-dimension controlled independent of dimensionality d.

1),/min( 22 dRh

11

Motivation: philosophical• Classical view: good model

explains the data + low complexity Occam’s razor (complexity ~ # parameters)

• VC theory: good model

explains the data + low VC-dimension

~ VC-falsifiability: good model:

explains the data + has large falsifiability

The idea: falsifiability ~ empirical loss function

12

Adaptive loss functions

• Both goals (explanation + falsifiability) can encoded into empirical loss function where- (large) portion of the data has zero loss- the rest of the data has non-zero loss, i.e. it falsifies the model

• The trade-off (between the two goals) is adaptively controlled adaptive loss fct

• Examples of such loss functions for different learning problems are shown next

13

Margin-based loss for classification

0),,(max)),(,( xx yffyL 2Margin

Margin-based loss for classification: margin is adapted to training data

0),,(max)),(,( xx yffyL

M argin

C lass +1 C lass -1

),( xfy

15

Epsilon loss for regression

0,|),(|max)),(,( xx fyfyL

Parameter epsilon is adapted to training data Example: linear regression y = x + noise where noise = N(0, 0.36), x ~ [0,1], 4 samplesCompare: squared, linear and SVM loss (eps = 0.6)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1

-0.5

0

0.5

1

1.5

2

x

y

17

OUTLINE • Margin-based loss• SVM for classification

- Linear SVM classifier- Inner product kernels- Nonlinear SVM classifier

• SVM examples• Support Vector regression• Summary

18

SVM Loss for Classification

Continuous quantity measures how close a sample x is to decision boundary

),( wyf x

19

Optimal Separating Hyperplane

Distance btwn hyperplane and sample Margin Shaded points are SVsw/1

wx /)'(f

20

Linear SVM Optimization Formulation (for separable data)

• Given training data • Find parameters of linear hyperplane

that minimize

under constraints

• Quadratic optimization with linear constraints

tractable for moderate dimensions d• For large dimensions use dual formulation:

- scales with sample size(n) rather than d- uses only dot products

bf xwx

25.0 ww

1 by ii xw

ni ,...,1

x i ,yi b,w

ji xx ,

21

Classification for non-separable data

),(

_

xyf

variablesslack

0),,(max)),(,( xx yffyL

22

SVM for non-separable data

min2

1 2

1

wn

iiC

iii by 1xw

Minimize

under constraints

bf xwx)(

f x( ) = +1

f x( ) = -1

f x( ) = 0

x1

= 1 - f x1

( )

x1

x2

x3

x2

= 1 - f x2

( )

x3

= 1+ f x3

( )

23

SVM Dual Formulation• Given training data • Find parameters of an opt. hyperplane

as a solution to maximization problem

under constraints

• Solutionwhere samples with nonzero are SVs

• Needs only inner products

ni ,...,1

x i ,yi ** ,bi

max2

1

1,1

n

jijijiji

n

ii yy xxL

,01

n

iiiy Ci 0

*i 'xx

n

iiii byf

1

** xxx

24

Nonlinear Decision Boundary• Fixed (linear) parameterization is too rigid• Nonlinear curved margin may yield larger margin

(falsifiability) and lower error

25

Nonlinear Mapping via KernelsNonlinear f(x,w) + margin-based loss = SVM

• Nonlinear mapping to feature z space, i.e.

• Linear in z-space ~ nonlinear in x-space

• BUT ~ kernel trick

Compute dot product via kernel analytically

x g x

z wz ˆ y

z z H x , x

),,,,,1(~),(~ 22

21212121 xxxxxxxx zx

26

SVM Formulation (with kernels)• Replacing leads to: • Find parameters of an optimal

hyperplane

as a solution to maximization problem

under constraints Given: the training data

an inner product kernelregularization parameter C

** ,bi

n

iiii bHyD

1

** ,xxx

max,2

1

1,1

n

jijijiji

n

ii Hyy xxL

,01

n

iiiy Ci 0

xxzz ,H

x i ,yi xx ,H

ni ,...,1

27

Examples of KernelsKernel is a symmetric function satisfying general

math conditions (Mercer’s conditions)

Examples of kernels for different mappings xz• Polynomials of degree q

• RBF kernel

• Neural Networks

for given parameters

Automatic selection of the number of hidden units (SV’s)

xx ,H

qH 1', xxxx

2

2'

exp,

xxxxH

avH )'(tanh, xxxxav,

28

More on Kernels• The kernel matrix has all info (data + kernel)

H(1,1) H(1,2)…….H(1,n)H(2,1) H(2,2)…….H(2,n)………………………….H(n,1) H(n,2)…….H(n,n)

• Kernel defines a distance in some feature space (aka kernel-induced feature space)

• Kernels can incorporate apriori knowledge• Kernels can be defined over complex

structures (trees, sequences, sets etc)

29

Support Vectors

• SV’s ~ training samples with non-zero loss• SV’s are samples that falsify the model • The model depends only on SVs SV’s ~ robust characterization of the dataWSJ Feb 27, 2004:

About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters.

• SVM Generalization ~ data compression

30

New insights provided by SVM• Why linear classifiers can generalize?

(1) Margin is large (relative to R)(2) % of SV’s is small(3) ratio d/n is small

• SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both

• Requires common-sense parameter tuning

1),/min( 22 dRh

31

OUTLINE

• Margin-based loss• SVM for classification• SVM examples• Support Vector regression• Summary

32

Ripley’s data set• 250 training samples, 1,000 test samples• SVM using RBF kernel • Model selection via 10-fold cross-validation

2( , ) exp u v u v

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-1.5 -1 -0.5 0 0.5 1

33

Ripley’s data set: SVM model• Decision boundary and margin borders • SV’s are circled

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

x1

x2

34

Ripley’s data set: model selection• SVM tuning parameters C,• Select opt parameter values via 10-fold x-validation• Results of cross-validation are summarized below:

C= 0.1 C= 1 C= 10 C= 100 C= 1000 C= 10000

=2-3 98.4% 23.6% 18.8% 20.4% 18.4% 14.4%

=2-2 51.6% 22% 20% 20% 16% 14%

=2-1 33.2% 19.6% 18.8% 15.6% 13.6% 14.8%

=20 28% 18% 16.4% 14% 12.8% 15.6%

=21 20.8% 16.4% 14% 12.8% 16% 17.2%

=22 19.2% 14.4% 13.6% 15.6% 15.6% 16%

=23 15.6% 14% 15.6% 16.4% 18.4% 18.4%

35

Noisy Hyperbolas data set• This example shows application of different kernels• Note: decision boundaries are quite different

RBF kernel Polynomial

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

36

Many challenging applications• Mimic human recognition capabilities

- high-dimensional data- content-based- context-dependent

• Example: read the sentenceSceitnitss osbevred: it is nt inptrant how lteters are msspled isnide the word. It is ipmoratnt that the fisrt and lsat letetrs do not chngae, tehn the txet is itneprted corrcetly

• SVM is suitable for sparse high-dimensional data

37

Example SVM Applications

• Handwritten digit recognition

• Genomics

• Face detection in unrestricted images

• Text/ document classification

• Image classification and retrieval

• …….

38

Handwritten Digit Recognition (mid-90’s)

• Data set: postal images (zip-code), segmented, cropped;

~ 7K training samples, and 2K test samples

• Data encoding: 16x16 pixel image 256-dim. vector

• Original motivation: Compare SVM with custom MLP network (LeNet) designed for this application

• Multi-class problem: one-vs-all approach 10 SVM classifiers (one per each digit)

39

Digit Recognition Results• Summary

- prediction accuracy better than custom NN’s- accuracy does not depend on the kernel type- 100 – 400 support vectors per class (digit)

• More details Type of kernel No. of Support Vectors Error% Polynomial 274 4.0RBF 291 4.1Neural Network 254 4.2

• ~ 80-90% of SV’s coincide (for different kernels)

40

Document Classification (Joachims, 1998)

• The Problem: Classification of text documents in large data bases, for text indexing and retrieval

• Traditional approach: human categorization (i.e. via feature selection) – relies on a good indexing scheme. This is time-consuming and costly

• Predictive Learning Approach (SVM): construct a classifier using all possible features (words)

• Document/ Text Representation:

individual words = input features (possibly weighted)• SVM performance:

– Very promising (~ 90% accuracy vs 80% by other classifiers)– Most problems are linearly separable use linear SVM

41

OUTLINE

• Margin-based loss• SVM for classification• SVM examples• Support vector regression• Summary

42

Linear SVM regression

0,|),(|max)),(,( xx fyfyL

bf xwx ),( Assume linear parameterization

x

y

*2

1

43

Direct Optimization Formulation

x i ,yi

x

y

*2

1

Minimize

n

iiiC

1

*)()(2

1 ww

ni

yb

by

ii

iii

iii

,...,1,0,

)(

)(

*

*

xw

xwUnder constraints

Given training data

ni ,...,1

44

Example: SVM regression using RBF kernel

SVM estimate is shown in dashed line

SVM model uses only 5 SV’s (out of the 40 points)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2

0

0.2

0.4

0.6

0.8

1

x

y

45

RBF regression model

Weighted sum of 5 RBF kernels gives the SVM model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-4

-3

-2

-1

0

1

2

3

4

x

y

2

21

exp0.20

mj

jj

x cf x w

,w

46

Summary• Margin-based loss: robust +

performs complexity control

• Nonlinear feature selection (~ SV’s): performed automatically

• Tractable model selection – easier than most nonlinear methods.

• SVM is not a magic bullet solution

- similar to other methods when n >> h

- SVM is better when n << h or n ~ h