COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf ·...

46
COMP 551 - Applied Machine Learning Lecture 10 --- Support vector machines William L. Hamilton (with slides and content from Joelle Pineau) * Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission. William L. Hamilton, McGill University and Mila 1

Transcript of COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf ·...

Page 1: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

COMP 551 - Applied Machine LearningLecture 10 --- Support vector machines William L. Hamilton(with slides and content from Joelle Pineau)* Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

William L. Hamilton, McGill University and Mila 1

Page 2: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Quiz 5

William L. Hamilton, McGill University and Mila

§ Deadlines extended by one week, since the quizzes are temporarily ahead of the lectures.

§ The quizzes will move forward next week as usual, however.

2

Page 3: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

MiniProject 2

William L. Hamilton, McGill University and Mila

§ No late penalty until Sunday at 11:59pm, but…

§ If you submit by Friday 11:59pm, your competition grade will be computed relative to the current standings.

§ No bonus points for Sunday submissions.

3

Page 4: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

MiniProject 2

William L. Hamilton, McGill University and Mila 4

This shift and scale is necessary and was missing in the original assignment specification! This ensures that the

grading works as explained/intended and that the minimum you get is 75% if you beat the TA baseline!

Page 5: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

High-level views of classification

William L. Hamilton, McGill University and Mila

§ Probabilistic§ Goal: Estimate P(y | x), i.e. the conditional probability of the

target variable given the feature data.

§ Decision boundaries§ Goal: Partition the feature space into different regions, and

classify points based on the region where the lie.

5

Page 6: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Outline§ Perceptrons

§ Definition

§ Perceptron learning rule

§ Convergence

§ Margin & max margin classifiers

§ Linear Support Vector Machines

§ Formulation as optimization problem

§ Generalized Lagrangian and dual

§ Non-linear Support Vector Machines (next class)

William L. Hamilton, McGill University and Mila 6

Page 7: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

A simple linear classifier§ Given a binary classification task: {xi,yi}i=1:n,yi={-1,+1}.§ The perceptron (Rosenblatt, 1957) is a classifier of the form:

hw(x)=sign(wTx)={+1if wTx≥0;-1otherwise}§ The decision boundary is wTx=0.

§ An example <xi,yi>is classified correctly if and only if: yi(wTxi)>0.

w0

w1

wm

1

x1

xm

… y

Linear + threshold

William L. Hamilton, McGill University and Mila 7

Page 8: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

§ Consider the following procedure:

Initialize wj, j=0:m randomly,

While any training examples remain incorrectly classified

Loop through all misclassified examples xi

Perform the update: w ⃪w +αyi xi

where α is the learning rate (or step size).

§ Intuition: For misclassified positive examples, increase wTx, and

reduce it for negative examples.

Perceptron learning rule (Rosenblatt, 1957)

William L. Hamilton, McGill University and Mila 8

Page 9: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Gradient-descent learning§ The perceptron learning rule can be interpreted as a gradient descent

procedure, with optimization criterion:

Err(w)=∑i=1:n {0if yiwTxi≥0; -yiwTx otherwise }

William L. Hamilton, McGill University and Mila 9

Page 10: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Gradient-descent learning§ The perceptron learning rule can be interpreted as a gradient descent

procedure, with optimization criterion:

Err(w)=∑i=1:n {0if yiwTxi≥0; -yiwTx otherwise }

§ For correctly classified examples, the error is zero.

§ For incorrectly classified examples, the error tells by how much wTx is on

the wrong side of the decision boundary.

§ The error is zero when all examples are classified correctly.

William L. Hamilton, McGill University and Mila 10

Page 11: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Linear separabilityThe data is linearly separable if and only if there exists aw such that:

§ For all examples, yiwTxi >0

§ Or equivalently, the loss is zero for some set of parameters (w).

Linear separability

• The data set is linearly separable if and only if there exists w, w0 such

that:

– For all i, yi(w · xi + w0) > 0.– Or equivalently, the 0-1 loss is zero for some set of parameters (w, w0).

x1

x2

++

--

+

-

x1

x2

(a) (b)

-

+ -

+

COMP-652, Lecture 9 - October 9, 2012 5

Perceptron convergence theorem

• The perceptron convergence theorem states that if the perceptron

learning rule is applied to a linearly separable data set, a solution

will be found after some finite number of updates.

• The number of updates depends on the data set, and also on the step

size parameter.

• If the data is not linearly separable, there will be oscillation (which can

be detected automatically).

• Decreasing the learning rate to 0 can cause the oscillation to settle on

some particular solution

COMP-652, Lecture 9 - October 9, 2012 6

Linear separability

• The data set is linearly separable if and only if there exists w, w0 such

that:

– For all i, yi(w · xi + w0) > 0.– Or equivalently, the 0-1 loss is zero for some set of parameters (w, w0).

x1

x2

++

--

+

-

x1

x2

(a) (b)

-

+ -

+

COMP-652, Lecture 9 - October 9, 2012 5

Perceptron convergence theorem

• The perceptron convergence theorem states that if the perceptron

learning rule is applied to a linearly separable data set, a solution

will be found after some finite number of updates.

• The number of updates depends on the data set, and also on the step

size parameter.

• If the data is not linearly separable, there will be oscillation (which can

be detected automatically).

• Decreasing the learning rate to 0 can cause the oscillation to settle on

some particular solution

COMP-652, Lecture 9 - October 9, 2012 6

Linearly separable Not linearly separable

William L. Hamilton, McGill University and Mila 11

Page 12: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Perceptron convergence theorem§ The basic theorem:

§ If the perceptron learning rule is applied to a linearly separable dataset, a solution will be found after some finite number of updates.

William L. Hamilton, McGill University and Mila 12

Page 13: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Perceptron convergence theorem§ The basic theorem:

§ If the perceptron learning rule is applied to a linearly separable dataset, a solution will be found after some finite number of updates.

§ Additional comments:

§ The number of updates depends on the dataset, on the learning rate, and on the initial weights.

§ If the data is not linearly separable, there will be oscillation (which can be detected automatically).

§ Decreasing the learning rate to 0 can cause the oscillation to settle on some particular solution.

William L. Hamilton, McGill University and Mila 13

Page 14: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Perceptron learning examplePerceptron learning example–separable data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x1

x2

w = [0 0] w0 = 0

COMP-652, Lecture 9 - October 9, 2012 7

Perceptron learning example–separable data

! !"# !"$ !"% !"& '!

!"'

!"#

!"(

!"$

!")

!"%

!"*

!"&

!"+

'

,'

,#

-./.0$"'''.("&*!$1...-!./.!$

COMP-652, Lecture 9 - October 9, 2012 8

William L. Hamilton, McGill University and Mila 14

Page 15: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Perceptron learning example

Perceptron learning example–separable data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x1

x2

w = [0 0] w0 = 0

COMP-652, Lecture 9 - October 9, 2012 7

Perceptron learning example–separable data

! !"# !"$ !"% !"& '!

!"'

!"#

!"(

!"$

!")

!"%

!"*

!"&

!"+

'

,'

,#

-./.0$"'''.("&*!$1...-!./.!$

COMP-652, Lecture 9 - October 9, 2012 8

William L. Hamilton, McGill University and Mila 15

Page 16: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Weight as a combination of input vectors§ Recall perceptron learning rule:

w ⃪w +αyi xi§ If initial weights are zero, then at any step, the weights are a linear combination

of feature vectors of the examples:

w =∑i=1:n αi yi xiwhere αi is the sum of step sizes used for all updates applied example i.

William L. Hamilton, McGill University and Mila 16

Page 17: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Weight as a combination of input vectors§ Recall perceptron learning rule:

w ⃪w +αyi xi§ If initial weights are zero, then at any step, the weights are a linear combination

of feature vectors of the examples:

w =∑i=1:n αi yi xiwhere αi is the sum of step sizes used for all updates applied example i.

§ By the end of training, some examples may have never participated in an

update, so will have αi=0.§ This is called the dual representation of the classifier.

William L. Hamilton, McGill University and Mila 17

Page 18: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Perceptron learning example§ Examples used (bold) and not (faint). What do you notice?

Weight as a combination of input vectors

• Recall percepton learning rule:

w ⌃ w + ⇤yixi, w0 ⌃ w0 + ⇤yi

• If initial weights are zero, then at any step, the weights are a linearcombination of feature vectors of the examples:

w =mX

i=1

�iyixi, w0 =mX

i=1

�iyi

where �i is the sum of step sizes used for all updates based on example

i.• This is called the dual representation of the classifier.

• Even by the end of training, some example may have never participated

in an update, so the corresponding �i = 0.

COMP-652, Lecture 9 - October 9, 2012 9

Example used (bold) and not used (faint) in updates

! !"# !"$ !"% !"& '!

!"'

!"#

!"(

!"$

!")

!"%

!"*

!"&

!"+

'

,'

,#

-./.0$"'''.("&*!$1...-!./.!$

COMP-652, Lecture 9 - October 9, 2012 10

William L. Hamilton, McGill University and Mila 18

Page 19: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Perceptron learning example

§ Solutions are often non-unique. The solution depends on

the set of instances and the order of sampling in updates.Comment: Solutions are nonunique

! !"# !"$ !"% !"& '!

!"'

!"#

!"(

!"$

!")

!"%

!"*

!"&

!"+

,'

,#

-./.0#"'(+).'"+(*#1...-!./.!#

Solutions depend on the set of instances and the order of sampling in updates

COMP-652, Lecture 9 - October 9, 2012 11

Perceptron summary

• Perceptrons can be learned to fit linearly separable data, using a gradient

descent rule.

• There are other fitting approaches – e.g., formulation as a linear

constraint satisfaction problem / linear program.

• Solutions are non-unique.

• Logistic neurons are often thought of as a “smooth” version of a

perceptron

• For non-linearly separable data:

– Perhaps data can be linearly separated in a di�erent feature space?

– Perhaps we can relax the criterion of separating all the data?

COMP-652, Lecture 9 - October 9, 2012 12

William L. Hamilton, McGill University and Mila 19

Page 20: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

A few comments on the Perceptron

§ Perceptrons can be learned to fit linearly separable data, using a

gradient-descent rule.

§ The logistic function offers a

“smooth” version of the perceptron.

William L. Hamilton, McGill University and Mila 20

Page 21: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

A few comments on the Perceptron

§ Perceptrons can be learned to fit linearly separable data, using a

gradient-descent rule.

§ The logistic function offers a

“smooth” version of the perceptron.

§ Two issues:

§ Solutions are non-unique.

William L. Hamilton, McGill University and Mila 21

Page 22: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

A few comments on the Perceptron

§ Perceptrons can be learned to fit linearly separable data, using a

gradient-descent rule.

§ The logistic function offers a

“smooth” version of the perceptron.

§ Two issues:

§ Solutions are non-unique.

§ What about non-linearly separable data? (Topic for next class.)

§ Perhaps data can be linearly separated in a different feature space?

§ Perhaps we can relax the criterion of separating all the data?

William L. Hamilton, McGill University and Mila 22

Page 23: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

The non-uniqueness issue§ Consider a linearly separable binary classification dataset.

§ There is an infinite number of hyper-planes that separate the classes:

§ Which plane is best?

Support Vector Machines

• Support vector machines (SVMs) for binary classification can be viewed

as a way of training perceptrons

• There are three main new ideas:

– An alternative optimization criterion (the “margin”), which eliminates

the non-uniqueness of solutions and has theoretical advantages

– An e⇥cient way of operating in expanded feature spaces, which allow

non-linear functions to be represented – the “kernel trick”

– A way of handling overfitting and non-separable data by allowing

mistakes

• SVMs can also be used for multiclass classification and regression.

COMP-652, Lecture 9 - October 9, 2012 13

Returning to the non-uniqueness issue

• Consider a linearly separable binary classification data set {xi, yi}mi=1.

• There is an infinite number of hyperplanes that separate the classes:

!!

!

!

!

"

"

"

"

"

• Which plane is best?

• Relatedly, for a given plane, for which points should we be most confident

in the classification?

COMP-652, Lecture 9 - October 9, 2012 14William L. Hamilton, McGill University and Mila 23

Page 24: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

The non-uniqueness issue§ Consider a linearly separable binary classification dataset.

§ There is an infinite number of hyper-planes that separate the classes:

§ Which plane is best?

§ Related question: For a given plane, for which points should we be most

confident in the classification?

Support Vector Machines

• Support vector machines (SVMs) for binary classification can be viewed

as a way of training perceptrons

• There are three main new ideas:

– An alternative optimization criterion (the “margin”), which eliminates

the non-uniqueness of solutions and has theoretical advantages

– An e⇥cient way of operating in expanded feature spaces, which allow

non-linear functions to be represented – the “kernel trick”

– A way of handling overfitting and non-separable data by allowing

mistakes

• SVMs can also be used for multiclass classification and regression.

COMP-652, Lecture 9 - October 9, 2012 13

Returning to the non-uniqueness issue

• Consider a linearly separable binary classification data set {xi, yi}mi=1.

• There is an infinite number of hyperplanes that separate the classes:

!!

!

!

!

"

"

"

"

"

• Which plane is best?

• Relatedly, for a given plane, for which points should we be most confident

in the classification?

COMP-652, Lecture 9 - October 9, 2012 14William L. Hamilton, McGill University and Mila 24

Page 25: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Linear Support Vector Machine (SVM)§ A linear SVM is a perceptron for which we chose w such that the

margin is maximized.

§ For a given separating hyper-plane, the margin is twice the (Euclidean)

distance from hyper-plane to nearest training example.

§ I.e. the width of the “strip” around the decision boundary that contains no

training examples.

The margin, and linear SVMs

• For a given separating hyperplane, themargin is two times the (Euclidean)

distance from the hyperplane to the nearest training example.

!!

!

!

!

"

"

"

"

"

!!

!

!

!

"

"

"

"

"

• It is the width of the “strip” around the decision boundary containing no

training examples.

• A linear SVM is a perceptron for which we choose w, w0 so that margin

is maximized

COMP-652, Lecture 9 - October 9, 2012 15

Distance to the decision boundary

• Suppose we have a decision boundary that separates the data.

!"

!

#

$%&

• Let ⇤i be the distance from instance xi to the decision boundary.

• How can we write ⇤i in term of xi, yi,w, w0?

COMP-652, Lecture 9 - October 9, 2012 16

William L. Hamilton, McGill University and Mila 25

Page 26: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Distance to the decision boundary§ Suppose we have a decision boundary that separates the data.

+ +

++ ++

++

xixi0

o o

oo o

oo

Class 1 Class 2

w

γi

William L. Hamilton, McGill University and Mila 26

Page 27: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Distance to the decision boundary§ Suppose we have a decision boundary that separates the data.

§ Let ɣi be the distance from instance xi to the decision boundary.

§ Define vector w to be the normal to the decision boundary.

+ +

++ ++

++

xixi0

o o

oo o

oo

Class 1 Class 2

w

γi

William L. Hamilton, McGill University and Mila 27

Page 28: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Distance to the decision boundary§ How can we write ɣi in terms of xi,yi,w?§ Let xi0 be the point on the decision boundary nearest xi§ The vector from xi0 to xi is ɣiw /||w||.

§ ɣi is a scalar (distance from xi to xi0)§ w/||w||is the unit normal.

§ So we can define xi0 =xi-ɣiw /||w||.

+

+

++

xixi0

o

oo o

o

w

γi

William L. Hamilton, McGill University and Mila 28

Page 29: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Distance to the decision boundary§ How can we write ɣi in terms of xi,yi,w?§ Let xi0 be the point on the decision boundary nearest xi§ The vector from xi0 to xi is ɣiw /||w||.

§ ɣi is a scalar (distance from xi to xi0)§ w/||w||is the unit normal.

§ So we can define xi0 =xi-ɣiw /||w||.§ As xi0 is on the decision boundary, we have

wT(xi-ɣiw /||w||)=0§ Solving for ɣi yields, for a positive example: ɣi =wTxi /||w||

or for examples of both classes: ɣi =yiwTxi /||w||

+

+

++

xixi0

o

oo o

o

w

γi

William L. Hamilton, McGill University and Mila 29

Page 30: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Optimization§ First suggestion:

Maximize Mwith respect to wsubject to yiwTxi /||w||≥M,∀i

William L. Hamilton, McGill University and Mila 30

Page 31: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Optimization§ First suggestion:

Maximize Mwith respect to wsubject to yiwTxi /||w||≥M,∀i

§ This is not very convenient for optimization:§ w appears nonlinearly in the constraints.

§ Problem is underconstrained. If (w,M)is optimal, so is (βw,M),for any β>0. Add a constraint: ||w||M=1

William L. Hamilton, McGill University and Mila 31

Page 32: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Optimization§ First suggestion:

Maximize Mwith respect to wsubject to yiwTxi /||w||≥M,∀i

§ This is not very convenient for optimization:§ w appears nonlinearly in the constraints.

§ Problem is underconstrained. If (w,M)is optimal, so is (βw,M),for any β>0. Add a constraint: ||w||M=1

§ Instead try:

Minimize ||w||with respect to wsubject to yiwTxi ≥1

William L. Hamilton, McGill University and Mila 32

Page 33: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Final formulation§ Let’s minimize ½||w||2 instead of ||w||

(Taking the square is a monotone transform, as ||w|| is positive, so it doesn’t change the optimal solution. The ½ is for mathematical convenience.)

§ This gets us to: Min ½||w||2w.r.t. ws.t. yiwTxi ≥1

William L. Hamilton, McGill University and Mila 33

Page 34: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Final formulation§ Let’s minimize ½||w||2 instead of ||w||

(Taking the square is a monotone transform, as ||w|| is positive, so it doesn’t change the optimal solution. The ½ is for mathematical convenience.)

§ This gets us to: Min ½||w||2w.r.t. ws.t. yiwTxi ≥1

§ This can be solved! How?§ It is a quadratic programming (QP) problem – a standard type of

optimization problem for which many efficient packages are available. Better yet, it’s a convex (positive semidefinite) QP.

William L. Hamilton, McGill University and Mila 34

Page 35: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Constrained optimization

Picture from: http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/

William L. Hamilton, McGill University and Mila 35

Page 36: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Example

Final formulation

• Let’s maximize ✏w✏2 instead of ✏w✏.(Taking the square is a monotone transformation, as ✏w✏ is postive, so

this doesn’t change the optimal solution.)

• This gets us to:

min ✏w✏2w.r.t. w, w0

s.t. yi(w · xi + w0) ⇧ 1

• This we can solve! How?

– It is a quadratic programming (QP) problem—a standard type of

optimization problem for which many e⇥cient packages are available.

– Better yet, it’s a convex (positive semidefinite) QP

COMP-652, Lecture 9 - October 9, 2012 21

Example

! !"# !"$ !"% !"& '!

!"'

!"#

!"(

!"$

!")

!"%

!"*

!"&

!"+

'

,'

,#

-./.0$+"%)!$.$%"&+%#1...-!./.!$&"%+(%

! !"# !"$ !"% !"& '!

!"'

!"#

!"(

!"$

!")

!"%

!"*

!"&

!"+

,'

,#

-./.0''"*+)+.'#"&!%%1...-!./.!'#"+'*$

We have a solution, but no support vectors yet...

COMP-652, Lecture 9 - October 9, 2012 22

William L. Hamilton, McGill University and Mila 36

Page 37: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Example

Final formulation

• Let’s maximize ✏w✏2 instead of ✏w✏.(Taking the square is a monotone transformation, as ✏w✏ is postive, so

this doesn’t change the optimal solution.)

• This gets us to:

min ✏w✏2w.r.t. w, w0

s.t. yi(w · xi + w0) ⇧ 1

• This we can solve! How?

– It is a quadratic programming (QP) problem—a standard type of

optimization problem for which many e⇥cient packages are available.

– Better yet, it’s a convex (positive semidefinite) QP

COMP-652, Lecture 9 - October 9, 2012 21

Example

! !"# !"$ !"% !"& '!

!"'

!"#

!"(

!"$

!")

!"%

!"*

!"&

!"+

'

,'

,#

-./.0$+"%)!$.$%"&+%#1...-!./.!$&"%+(%

! !"# !"$ !"% !"& '!

!"'

!"#

!"(

!"$

!")

!"%

!"*

!"&

!"+

,'

,#

-./.0''"*+)+.'#"&!%%1...-!./.!'#"+'*$

We have a solution, but no support vectors yet...

COMP-652, Lecture 9 - October 9, 2012 22

We have a unique solution, but no support vectors yet.Recall the dual solution for the Perceptron: Extend for the margin case.

William L. Hamilton, McGill University and Mila 37

Page 38: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Lagrange multipliers§ Consider the following optimization problem, called primal:

minw f(w)s.t. gi(w)≤0,i=1…k

§ We define the generalized Lagrangian:L(w,α)=f(w)+∑i=1:kαi gi(w)where αi,i=1…kare the Lagrange multipliers.

Figure : Find x and y to maximize f(x, y) subject to a constraint (shown in red) g(x, y) = c.From: https://en.wikipedia.org/wiki/Lagrange_multiplier

William L. Hamilton, McGill University and Mila 38

Page 39: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Lagrangian optimization§ Consider P(w)=maxα:αi≥0 L(w,α)=maxα:αi≥0 f(w)+∑i=1:kαi gi(w)

§ (P stands for “primal”)

§ Recall: L(w,α)=Observe that the following is true:P(w) = { f(w), if all constraints are satisfied,

+∞, otherwise }

William L. Hamilton, McGill University and Mila 39

Page 40: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Lagrangian optimization§ Consider P(w)=maxα:αi≥0 L(w,α)=maxα:αi≥0 f(w)+∑i=1:kαi gi(w)

§ (P stands for “primal”)§ Recall: L(w,α)=Observe that the following is true:

P(w) = { f(w), if all constraints are satisfied,+∞, otherwise }

§ Hence, instead of computing minw f(w)subject to the original constraints, we can compute:p*=minw P(w)=minwmaxα:αi≥0 L(w,α) Primal

§ Alternately, invert max and min to get:

d*=maxα:αi≥0 minw L(w,α) Dual

William L. Hamilton, McGill University and Mila 40

Page 41: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Maximum Margin Perceptron§ We wanted to solve: Min ½||w||2

w.r.t. ws.t. yiwTxi ≥1

§ The Lagrangian is:

L(w,α)=½||w||2 +∑iαi (1– yi (wTxi))

§ The primal problem is: minw maxα:αi≥0 L(w,α)§ The dual problem is: maxα:αi≥0 minw L(w,α)

William L. Hamilton, McGill University and Mila 41

Page 42: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Dual optimization problem• Consider both solutions:

p*=minwmaxα:αi≥0 L(w,α) Primald*=maxα:αi≥0 minw L(w,α) Dual

§ If fand gi are convex and the gi can all be satisfied simultaneously for some w, then we have equality: d*=p*=L(w*,α*).

§ w*is the optimal weight vector (= primal solution)

§ α*is the optimal set of support vectors (= dual solution)

§ For SVMs, we have a quadratic objective and linear constraints so both f and giare convex.

§ For linearly separable data, all gi can be satisfied simultaneously.

§ Note: w*,α*solve the primal and dual if and only if they satisfy the Karush-Kunh-Tucker conditions (see suggested readings).

William L. Hamilton, McGill University and Mila 42

Page 43: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Solving the dual§ Taking derivatives of L(w,α)wrt w, setting to 0, and solving for w :

L(w,α) =½||w||2 +∑i αi (1– yi (wTxi))δL/δw =w - ∑i αi yi xi =0w* =∑i αi yi xi

§ Just like for the perceptron with zero initial weights, the optimal solution w* is a linear

combination of the xi.

§ Plugging this back into L we get the dual: maxα∑i αi – ½∑i,j yiyjαiαj(xi·x)with constraints αi ≥0and ∑i αi yi=0. (Quadratic programming problem).

§ Complexity of solving quadratic program? Polynomial time, O(|v|3)(where |v|=# variables in optimization; here |v|=n). Fast approximations exist.

William L. Hamilton, McGill University and Mila 43

Page 44: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

The support vectors§ Suppose we find the optimal α ‘s (e.g. using a QP package.)

§ Constraint i is active when αi >0. This corresponds for the points for which (1-yiwTxi)=0.

§ These are the points lying on the edge of the margin. We call them support vectors. They define the decision boundary.

§ The output of the classifier for query point x is computed as:

hw(x)=sign(∑i=1:n αiyi (xi·x))§ I.e., it is determined by computing the dot product of the query point with the

support vectors.

William L. Hamilton, McGill University and Mila 44

Page 45: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

Example

William L. Hamilton, McGill University and Mila

Example

Support vectors are in bold

COMP-652, Lecture 9 - October 9, 2012 31

Non-linearly separable data

0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

• A linear boundary might be too simple to capture the class structure.

• One way of getting a nonlinear decision boundary in the input space is to

find a linear decision boundary in an expanded space (e.g., for polynomial

regression.)

• Thus, xi is replaced by ⇧(xi), where ⇧ is called a feature mapping

COMP-652, Lecture 9 - October 9, 2012 32

45

Page 46: COMP 551 -Applied Machine Learning Lecture 10 ---Support …wlh/comp551/slides/11-svms.pdf · COMP-652, Lecture 9 - October 9, 2012 5 Perceptron convergence theorem • The perceptron

What you should knowFrom today:

§ The perceptron algorithm.

§ The margin definition for linear SVMs.

§ The use of Lagrange multipliers to transform optimization problems.

§ The primal and dual optimization problems for SVMs.

After the next class:

§ Non-linearly separable case.

§ Feature space version of SVMs.

§ The kernel trick and examples of common kernels.

William L. Hamilton, McGill University and Mila 46