A Simple Review on SVM

27
A Simple Review on SVM Honglin Yu Australian National University, NICTA September 2, 2013

Transcript of A Simple Review on SVM

Page 1: A Simple Review on SVM

A Simple Review on SVM

Honglin Yu

Australian National University, NICTA

September 2, 2013

Page 2: A Simple Review on SVM

Outline

1 The Tutorial RoutineOverviewLinear SVC in Separable Case: Largest Margin ClassifierSoft MarginSolving SVMKernel Trick and Non-linear SVM

2 Some TopicsWhy the Name: Support Vectors?Why SVC Works Well: A Simple ExampleRelation with Logistic Regression etc.

3 Packages

Page 3: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Overview

SVM (Support Vector Machines) are supervised learningmethods

It includes both methods for classification and regression

In this talk, we focus on binary classifications.

Page 4: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Symbols

training data: (x1, y1), ..., (xm, ym) ∈ X × {±1}patterns: xi , i = 1, 2, ...,m

pattern space: Xtargets: yi , i = 1, 2, ...,m

features: xi = Φ(xi )

feature space: Hfeature mapping: Φ : X → H

Page 5: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Separable Case: Largest Margin Classifier

Figure: Simplest Case

“Separable” means: ∃ linew · x + b = 0 correctly separates allthe training data.

“Margin”: d+ + d−(d± = min

yi=±1dist(xi ,w · x + b = 0))

In this case, the SVC just looks fora line maximizing the margins.

Page 6: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Separable Case: Largest Margin Classifier

Another form of expressing separable: yi (w · xi + b) > 0

Because the training data is finite, ∃ε, yi (w · xi + b) ≥ εThis is equivalent to yi (

wε · xi + b

ε ) ≥ 1

w · xi + b = 0 and wε · xi + b

ε = 0 are same line.

We can directly write the constraints as yi (w · xi + b) ≥ 1

This removes the scaling redundancy in w, b

Page 7: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

We also want the separating plane to place in the middle(which means d+ = d−).

So the optimization problem can be formulated as

arg maxw,b

(2 minx

|w · xi + b||w|

)

s.t. yi (w · xi + b) ≥ 1, i = 1, 2, ...,N

(1)

This is equivalent to:

arg minw,b

|w|2

s.t. yi (w · xi + b) ≥ 1, i = 1, 2, ...,N(2)

But, until now, it can only be confirmed that Eq.2 is only thenecessary condition of finding the plane we want (correct andin the middle)

Page 8: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Largest Margin Classifier

It can be proved that, when the data is separable, for the followingproblem

minw,b

1

2||w||2

s.t. yi · (w · xi + b) ≥ 1, i = 1, ...,m.

(3)

we have,

1 When the ||w|| is minimized, the equality holds for some x.

2 The equality holds at least for some xi , xj where yiyj < 0.

3 Based on 1) and 2) we can calculate that the margin is 2||w|| ,

so the margin is maximized.

Page 9: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Proof of Previous Slide (Warning: My Proof)

1 If ∃c > 1 that ∀xi , yi · (w · xi + b) ≥ c , then wc and b

c alsosatisfy the constraints and the length is smaller.

2 If not, assume that ∃c > 1,

yi · (w · xi + b) ≥ 1,where yi = 1

yi · (w · xi + b) ≥ c ,where yi = −1(4)

Add c−12 to each side where yi = 1, minus c−1

2 to each sidewhere yi = −1, we can get:

yi · (w · xi + b +c − 1

2) ≥ c + 1

2(5)

Because c+12 > 1, similar to 1), the |w| here is not the

smallest3 Pick x1, x2 where the equality holds and y1y2 < 0, the margin

is just the distance between x1 and the line y2 · (w · x + b) = 1which can be easily calculated as 2

||w|| .

Page 10: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Non Separable Case

Figure: Non separable case: miss classified points exist

Page 11: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Non Separable Case

Constraints yi (w · xi + b) ≥ 1, i = 1, 2, ...,m can not besatisfied

Solution: add slack variables ξi , reformulate form problem as,

minw,b,ξ

1

2||w||2 + C

m∑i=1

ξi

s.t. yi ((w · xi) + b) ≥ 1− ξi , i = 1, 2, ...,m

ξi ≥ 0

(6)

Show the trade off (C) between margins ( 1|w|) and penalty (ξi ).

Page 12: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Solving SVM: Lagrangian Dual

Constraint optimization → Lagrangian Dual

Primal form:

minw,b,ξ

1

2||w||2 + C

m∑i=1

ξi

s.t. yi ((w · xi) + b) ≥ 1− ξi , i = 1, 2, ...,m

ξi ≥ 0

(7)

The Primal Lagrangian:

L(w, b, ξ, α, µ) =1

2||w||2+C

∑i

ξi−∑i

αi{yi (w·x+b−1−ξi )}−∑i

µiξi

Because [7] is convex, Karush-Kuhn-Tucker conditions hold.

Page 13: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Applying KKT Conditions

Stationarity

∂L

∂w= 0 → w =

∑i

αiyixi

∂L

∂b= 0 →

∑i

αiyi = 0

∂L

∂ξ= 0 → C − αi − µi = 0,∀i

Primal Feasibility: yi ((w · xi) + b) ≥ 1− ξi ,∀i

Dual Feasibility: αi ≥ 0, ui ≥ 0

Complementary Slackness, ∀i

µiξi = 0

αi{yi (w · xi + b)− 1 + ξi} = 0

When αi 6= 0, corresponding xi is called support vectors

Page 14: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Dual Form

Using the equations derived from KKT conditions, removew, b, ξi , µi in the primal form to get the dual form:

minα

∑i

αi −1

2

∑i ,j

αiαjyiyjxTi xj

s.t.∑i

αiyi = 0

C ≥ αi ≥ 0

(8)

And the decision function is:y = sign(∑

i αiyixTi x + b)

(b = yk −w · xk , ∀k ,C > αk > 0)

Page 15: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

We Need Nonlinear Classifier

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

Figure: Case that linear classifier can not handle

Finding appropriate form of curves is hard, but we can transformthe data!

Page 16: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Mapping Training Data to Feature Space

Φ(x) = (x , x2)T

Figure: Feature Mapping Helps Classification

To solve nonlinear classification problem, we can define somemapping Φ : X → H and do linear classification on feature spaceH

Page 17: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Recap the Dual Form: An important Fact

Dual form:

minα

∑i

αi −1

2

∑i ,j

αiαjyiyjxTi xj

s.t.∑i

αiyi = 0

C ≥ αi ≥ 0

(9)

Decision function: y = sign(∑

i αiyixTi x + b)

To train SVC or use SVC to predict, we only need to know theinner product between x s!

If we want to apply linear SVC in H, we do NOT need to knowΦ(x), we ONLY need to know k(x , x ′) =< Φ(x),Φ(x ′) >.And k(x , x ′) is called “kernel function”.

Page 18: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Kernel Functions

The input of kernel function k : X × X → R is two patternsx , x ′ in X , the output is the canonical inner product betweenΦ(x),Φ(x ′) in HBy using k(·, ·), we can implicitly transform the data by someΦ(·) (which is often with infinite dimension) E.g. fork(x , x ′) = (xx ′ + 1)2, Φ(x) = (x2,

√2x , 1)T

But not for all functions X × X → R, we can findcorresponding Φ(x). Kernel functions should satisfy Mercer’sconditions

Page 19: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Conditions of Kernel Functions

Necessity: Kernel Matrix K = [k(xi , xj)]m×m must be positivesemidefinite:

tTKt =∑i ,j

ti tjk(xi , xj) =∑i ,j

ti tj < Φ(xi ),Φ(xj) >

=<∑i

tiΦ(xi ),∑j

tjΦ(xj) >= |∑i

tiΦ(xi )|2 ≥ 0

Sufficiency in Continuous Form: Mercer’s Condition:For any symmetric function k : X × X → R which is squareintegrable in X × X , if it satisfies∫

X×Xk(x , x ′)f (x)f (x ′)dxdx ′ ≥ 0 for all f ∈ L2(X )

there exist functions φi : X → R and numbers λi ≥ 0 that,

k(x , x ′) =∑i

λiφi (x)φi (x ′) for all x , x ′ in X

Page 20: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Commonly Used Kernel Functions

Linear Kernel: k(x , x ′) = x ′T x

RBF Kernel: k(x , x ′) = e−γ|x−x′|2 , for gamma = 1

2 (fromwiki)

Polynomial Kernel: k(x , x ′) = (γx ′T x + r)d , for γ = 1, d = 2(from wiki)

etc.

Page 21: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Mechanical Analogy

Remember from KKT conditions,

∂L

∂w= 0 → w =

∑i

αiyixi

∂L

∂w= 0 →

∑i

αiyi = 0

imagine every support vector xi exerts a force Fi = αiyiw|w| on

the “separating plane + margin”, we have,∑Forces =

∑i

αiyiw

|w|=

w

|w|∑i

αiyi = 0

∑Torques =

∑i

xi × (αiyiw

|w|) = (

∑i

αiyixi )×w

|w|= w × w

|w|= 0

This is why {xi} are called “support vectors”

Page 22: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Why SVC Works Well

Let’s first consider using linear regression to do classification, thedecision function is y = sign(w · x + b)

Figure: Feature Mapping Helps Classification

In SVM, we only considers about the boundaries

Page 23: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Min-Loss Framework

Primal form:

minw,b,ξ

1

2||w||2 + C

m∑i=1

ξi

s.t. yi ((w · xi) + b) ≥ 1− ξi , i = 1, 2, ...,m

ξi ≥ 0

(10)

Rewrite into min-loss form,

minw,b,ξ

1

2||w||2 + C

m∑i=1

max{0, (1− yi ((w · xi) + b))} (11)

This is called hinge loss.

Page 24: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

See C-SVM and LMC from a Unified Direction

Rewriting LMC classifier,

minw

1

2||w||2 +

m∑i=0

∞ · (sign(1− y(w · xi + b)) + 1) (12)

Regularised Logistic Regression(y ∈ {0, 1}, not {−1, 1}, pi = 1

1+e−w·xi )

minw

1

2||w||2 +

m∑i=0

−(yi log(pi ) + (1− yi )log(1− pi )) (13)

Page 25: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Relation with Logistic Regression etc.

Figure: black:0-1 loss; red: logistic loss (−log( 11+e−yiw·x )); blue: hinge

loss; green: quadratic loss.

“0-1 loss” and “hinge loss” are not affected by correctlyclassified outliers.

BTW, logistic regression can also be “kernelised”.

Page 26: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Commonly Used Packages

libsvm(liblinear), svmlight and sklearn (python wrap-up oflibsvm)

Code example in sklearn

impor t numpy as npX = np . a r r a y ( [ [ −1 , −1] , [−2 , −1] , [ 1 , 1 ] , [ 2 , 1 ] ] )y = np . a r r a y ( [ 1 , 1 , 2 , 2 ] )from s k l e a r n . svm impor t SVCc l f = SVC( )c l f . f i t (X, y )c l f . p r e d i c t ( [ [ −0 . 8 , −1] ])

Page 27: A Simple Review on SVM

The Tutorial Routine Some Topics Packages

Things Not Covered

Algorithms (SMO, SGD)

Generalisation bound and VC dimension

ν-SVM, one-class SVM etc.

SVR

etc.