Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector...

53
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Transcript of Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector...

Page 1: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Support Vector Machines

Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Page 2: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Notation

• Assume a binary classification problem: h (x) {−1, 1}

• Instances are represented by vector x n.

• Training examples: xk = (x1, x2, …, xn)

• Training set:

S = {(x1, y1), (x2, y2), ..., (xm, ym) | (xk, yk) n {+1, -1}}

Questions on notation...

Page 3: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Here, assume positive and negative instances are to be separated by the hyperplane , where b is the bias.

Equation of line:

x2

x1

Page 4: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

• Intuition: The best hyperplane for future generalization will “maximally” separate the examples (assuming new examples will come from the same distribution)

Page 5: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Margin is defined as distance from separating hypeplane to nearest training example.

Note: Sometimes margin is definedas twice the distance.

Definition of Margin

Page 6: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Definition of Margin

Vapnik (1979) showed that the hyperplane maximizing the margin of S will have minimal VC dimension in the set of all consistent hyperplanes, and will thus be optimal.

Page 7: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Definition of Margin

Vapnik (1979) showed that the hyperplane maximizing the margin of S will have minimal VC dimension in the set of all consistent hyperplanes, and will thus be optimal.

This is an optimizationproblem!

Page 8: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Support vectors:

Training examplesthat lie on themargin.

http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png

Page 9: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Support vectors:

Training examplesthat lie on themargin.

http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png

Page 10: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Support vectors:

Training examplesthat lie on themargin.

http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png

Page 11: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Support vectors:

Training examplesthat lie on themargin.

http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png

Page 12: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Support vectors:

Training examplesthat lie on themargin.

http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png

Without changing the problem, we can rescale our data to set a = 1

Page 13: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Support vectors:

Training examplesthat lie on themargin.

http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png

Without changing the problem, we can rescale our data to set a = 1

Page 14: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png

Without changing the problem, we can rescale our data to set a = 1

Length of the margin

mar

gin

Page 15: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

http://nlp.stanford.edu/IR-book/html/htmledition/img1260.png

Without changing the problem, we can rescale our data to set a = 1

Length of the margin

mar

gin

Page 16: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

w is perpendicular to decision boundary

Page 17: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

• The length of the margin is

• So to maximize the margin, we need to minimize .w

Page 18: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Minimizing ||w||

Find w and b by doing the following minimization:

This is a quadratic optimization problem.

Page 19: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Minimizing ||w||

Find w and b by doing the following minimization:

This is a quadratic optimization problem.

Shorthand:

Page 20: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Dual Representation

• It turns out that w can be expressed as a linear combination of the training examples:

• The results of the SVM training algorithm (involving solving a quadratic programming problem) are the i and the bias b.

Page 21: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

• Classify new example x as follows:

where | S | = m. (S is the training set)

SVM Classification

Page 22: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

• Classify new example x as follows:

where | S | = m. (S is the training set)

• Note that dot product is

a kind of “similarity measure”

SVM Classification

Page 23: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Example

1 2-1-2

1

2

-2

-1

Page 24: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Example

1 2-1-2

1

2

-2

-1

Input to SVM optimizer:

x1 x2 class1 1 11 2 12 1 1-1 0 -10 -1 -1-1 -1 -1

Page 25: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Example

1 2-1-2

1

2

-2

-1

Input to SVM optimizer:

x1 x2 class1 1 11 2 12 1 1-1 0 -10 -1 -1-1 -1 -1

Output from SVM optimizer:

Support vector α(-1, 0) -.208(1, 1) .416(0, -1) -.208

b = -.376

Page 26: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Example

1 2-1-2

1

2

-2

-1

Input to SVM optimizer:

x1 x2 class1 1 11 2 12 1 1-1 0 -10 -1 -1-1 -1 -1

Output from SVM optimizer:

Support vector α(-1, 0) -.208(1, 1) .416(0, -1) -.208

b = -.376

Weight vector:

Separation line: Compute this

Sketch this

Page 27: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Example

1 2-1-2

1

2

-2

-1

Input to SVM optimizer:

x1 x2 class1 1 11 2 12 1 1-1 0 -10 -1 -1-1 -1 -1

Output from SVM optimizer:

Support vector α(-1, 0) -.208(1, 1) .416(0, -1) -.208

b = -.376

Weight vector:

Separation line:

Page 28: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Example

1 2-1-2

1

2

-2

-1

Classifying a new point:

Compute this

Page 29: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Example

1 2-1-2

1

2

-2

-1

Classifying a new point:

Page 30: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

svm_light Demo

Page 31: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

In-Class Exercises

Page 32: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Non-linearly separable training examples• What if the training examples are not linearly separable?

Page 33: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Non-linearly separable training examples• What if the training examples are not linearly separable?

• Use old trick: Find a function that maps points to a higher dimensional space (“feature space”) in which they are linearly separable, and do the classification in that higher-dimensional space.

Page 34: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Need to find a function that will perform such a mapping:

: n F

Then can find hyperplane in higher dimensional feature space F, and do classification using that hyperplane in higher dimensional space.

x1

x2

Page 35: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Need to find a function that will perform such a mapping:

: n F

Then can find hyperplane in higher dimensional feature space F, and do classification using that hyperplane in higher dimensional space.

x1

x2

x1

x2

x3

Page 36: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Exercise

Find : 2 F to create a 3-dimensional feature space in which XOR is linearly separable.

I.e., find (x1, x2) = (x1, x2, x3) so that these points become separable

x2

x1

(0, 1) (1, 1)

(0, 0)

(1, 0) x1

x2

x3

Page 37: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

• Problem:

– Recall that classification of instance x is expressed in terms of dot products of x and support vectors.

– The quadratic programming problem of finding the support vectors and coefficients also depends only on dot products between training examples.

Page 38: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

– So if each xi is replaced by (xi) in these procedures, we will have to calculate (xi) for each i as well as calculate a lot of dot products, (xi) (xj)

– In general, if the feature space F is high dimensional, (xi) (xj) will be expensive to compute.

Page 39: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

• Second trick:

– Suppose that there were some magic function,

k(xi, xj) = (xi) (xj)

such that k is cheap to compute even though (xi) (xj) is expensive to compute.

– Then we wouldn’t need to compute the dot product directly; we’d just need to compute k during both the training and testing phases.

– The good news is: such k functions exist! They are called kernel functions, and come from the theory of integral operators.

Page 40: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Example: Polynomial kernel:

Suppose x = (x1, x2) and z = (z1, z2).

Page 41: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Example

Does the degree-2 polynomial kernel make XOR linearly separable?

x2

x1

(0, 1) (1, 1)

(0, 0)

(1, 0)

Page 42: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Recap of SVM algorithm

Given training set

S = {(x1, y1), (x2, y2), ..., (xm, ym) | (xi, yi) n {+1, -1}

1. Choose a map : n F, which maps xi to a higher dimensional feature space. (Solves problem that X might not be linearly separable in original space.)

2. Choose a cheap-to-compute kernel function

k(x,z)=(x) (z)

(Solves problem that in high dimensional spaces, dot products are very expensive to compute.)

Page 43: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

4. Apply quadratic programming procedure (using the kernel function k) to find a hyperplane (w, b), such that

where the (xi)’s are support vectors, the i’s are coefficients, and b is a bias, such that (w, b ) is the hyperplane maximizing the margin of the training set in feature space F.

Page 44: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

5. Now, given a new instance, x, find the classification of x by computing

Page 45: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Page 46: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Page 47: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Page 48: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

More on Kernels• So far we’ve seen kernels that map instances in n to

instances in z where z > n.

• One way to create a kernel: Figure out appropriate feature space Φ(x), and find kernel function k which defines inner product on that space.

• More practically, we usually don’t know appropriate feature space Φ(x).

• What people do in practice is either:

1. Use one of the “classic” kernels (e.g., polynomial),

or

2. Define their own function that is appropriate for their task, and show that it qualifies as a kernel.

Page 49: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

How to define your own kernel

• Given training data (x1, x2, ..., xn)

• Algorithm for SVM learning uses kernel matrix (also called Gram matrix):

• We can choose some function k, and compute the kernel matrix K using the training data.

• We just have to guarantee that our kernel defines an inner product on some feature space.

• Not as hard as it sounds.

Page 50: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

What counts as a kernel?

• Mercer’s Theorem: If the kernel matrix K is “symmetric positive definite”, it defines a kernel, that is, it defines an inner product in some feature space.

• We don’t even have to know what that feature space is! It can have a huge number of dimensions.

Page 51: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Structural Kernels

• In domains such as natural language processing and bioinformatics, the similarities we want to capture are often structural (e.g., parse trees, formal grammars).

• An important area of kernel methods is defining structural kernels that capture this similarity (e.g., sequence alignment kernels, tree kernels, graph kernels, etc.)

Page 52: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

From www.cs.pitt.edu/~tomas/cs3750/kernels.ppt:

• Design criteria - we want kernels to be

– valid – Satisfy Mercer condition of positive semidefiniteness

– good – embody the “true similarity” between objects

– appropriate – generalize well

– efficient – the computation of k(x,x’) is feasible

Page 53: Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Example: Watson used tree kernels and SVMs to classify

question types for Jeopardy! questions

From Moschitti et al., 2011