Support Vector Machines (and Kernel Methods in general)

Machine Learning

Last Time

• Multilayer Perceptron/Logistic Regression Networks– Neural Networks– Error Backpropagation

• Support Vector Machines

• Note: we’ll rely on some math from Optimality Theory that we won’t derive.

Maximum Margin

• Perceptron (and other linear classifiers) can lead to many equally valid choices for the decision boundary

Are these really “equally valid”?

Max Margin

• How can we pick which is best?

• Maximize the size of the margin.

Are these really “equally valid”?

Small Margin

Large Margin

Support Vectors

• Support Vectors are those input points (vectors) closest to the decision boundary

• 1. They are vectors• 2. They “support”

the decision hyperplane

Support Vectors

• Define this as a decision problem

• The decision hyperplane:

• No fancy math, just the equation of a hyperplane.

Support Vectors

• Aside: Why do some cassifiers use or – Simplicity of the math and

interpretation.– For probability density

function estimation 0,1 has a clear correlate.

– For classification, a decision boundary of 0 is more easily interpretable than .5.

Support Vectors

• Decision Function:

Support Vectors

• Margin hyperplanes:

Support Vectors

• Scale invariance

Support Vectors

This scaling does not change the decision hyperplane, or the supportvector hyperplanes. But we willeliminate a variable from the optimization

What are we optimizing?

• We will represent the size of the margin in terms of w.

• This will allow us to simultaneously– Identify a decision

boundary– Maximize the margin

How do we represent the size of the margin in terms of w?

1. There must at least one point that lies on each support hyperplanes

Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).

2. Thus:

3. And:

2. Thus:

3. And:

• The vector w is perpendicular to the decision hyperplane– If the dot product of two

vectors equals zero, the two vectors are perpendicular.

• The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.

Aside: Vector Projection

• The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.

Size of the Margin:

Projection:

Maximizing the margin

• Goal: maximize the margin

Linear Separability of the data by the decision boundary

Max Margin Loss Function

• If constraint optimization then Lagrange Multipliers

• Optimize the “Primal”

Partial wrt b

Partial wrt w

Now have to find αi.Substitute back to the Loss function

• Construct the “dual”

Dual formulation of the error

• Optimize this quadratic program to identify the lagrange multipliers and thus the weights

There exist (rather) fast approaches to quadratic optimization in both C, C++, Python, Java and R

Quadratic Programming

•If Q is positive semi definite, then f(x) is convex.

•If f(x) is convex, then there is a single maximum.

Support Vector Expansion

• When αi is non-zero then xi is a support vector

• When αi is zero xi is not a support vector

New decision FunctionIndependent of the

Dimension of x!

Kuhn-Tucker Conditions

• In constraint optimization: At the optimal solution– Constraint * Lagrange Multiplier = 0

Only points on the decision boundary contribute to the solution!

Visualization of Support Vectors

Interpretability of SVM parameters

• What else can we tell from alphas?– If alpha is large, then the associated data point is

quite important.– It’s either an outlier, or incredibly important.

• But this only gives us the best solution for linearly separable data sets…

Basis of Kernel Methods

• The decision process doesn’t depend on the dimensionality of the data.• We can map to a higher dimensionality of the data space.

• Note: data points only appear within a dot product.• The error is based on the dot product of data points – not the data

points themselves.

Basis of Kernel Methods

• Since data points only appear within a dot product.• Thus we can map to another space through a replacement

• The error is based on the dot product of data points – not the data points themselves.

Learning Theory bases of SVMs

• Theoretical bounds on testing error.– The upper bound doesn’t depend on the

dimensionality of the space– The lower bound is maximized by maximizing the

margin, γ, associated with the decision boundary.

Why we like SVMs

• They work– Good generalization

• Easily interpreted.– Decision boundary is based on the data in the

form of the support vectors.• Not so in multilayer perceptron networks

• Principled bounds on testing error from Learning Theory (VC dimension)

SVM vs. MLP

• SVMs have many fewer parameters– SVM: Maybe just a kernel parameter– MLP: Number and arrangement of nodes and eta

learning rate • SVM: Convex optimization task– MLP: likelihood is non-convex -- local minima

Soft margin classification• There can be outliers on the other side of the decision

boundary, or leading to a small margin.• Solution: Introduce a penalty term to the constraint function

Soft Max Dual

Still Quadratic Programming!

• Points are allowed within the margin, but cost is introduced.

Soft margin example

Hinge Loss

Probabilities from SVMs

• Support Vector Machines are discriminant functions

– Discriminant functions: f(x)=c– Discriminative models: f(x) = argmaxc p(c|x)

– Generative Models: f(x) = argmaxc p(x|c)p(c)/p(x)

• No (principled) probabilities from SVMs• SVMs are not based on probability distribution

functions of class instances.

Efficiency of SVMs

• Not especially fast.• Training – n^3– Quadratic Programming efficiency

• Evaluation – n– Need to evaluate against each support vector

(potentially n)

Good Bye

• Next time: – The Kernel “Trick” -> Kernel Methods– or– How can we use SVMs that are not linearly

separable?

Support Vector Machines (and Kernel Methods in general)

Documents

Transcript of Support Vector Machines (and Kernel Methods in general)

Lecture 10: Non-linear support vector machines. …Lecture 10: Non-linear support vector machines. Kernels. Gaussian Processes SVMs for non-linearly separable data The kernel trick

kernlab - An S4 Package for Kernel Methods in Rlc436/07Spring665/kernlab.pdf · Keywords: kernel methods, support vector machines, quadratic programming, ranking, clustering, S4,

Kernel Methods and Support Vector Machines - Computer Science

Applications of Support Vector Machines in Chemistryivanciuc.org/Files/Reprint/Ivanciuc_Applications_of_Support_Vector... · including Advances in Kernel Methods: Support Vector Learning

Support Vector Machines & Kernel Machines

A Divide-and-Conquer Solver for Kernel Support Vector …cjhsieh/DC_ICML.pdfA Divide-and-Conquer Solver for Kernel Support Vector Machines Cho-Jui Hsieh Dept of Computer Science UT

Solving Support Vector Machines in Reproducing Kernel ...

A Divide-and-Conquer Solver for Kernel Support Vector Machines · A Divide-and-Conquer Solver for Kernel Support Vector Machines can ﬁnd a globally optimal solution (to within 10

OntheAlgorithmicImplementationof …jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdfMulticlass Kernel-based Vector Machines Summingoveralltheexamplesin Swegetanupperboundontheempiricalloss,

Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

Support Vector Machines -Dual formulation and Kernel Trickaarti/Class/10315_Fall20/lecs/... · 2020. 10. 27. · Support Vector Machines-Dual formulation and Kernel Trick Aarti Singh

Support Vector Machines in R - USTCstaff.ustc.edu.cn/~zwp/teach/MVA/v15i09.pdf2 Support Vector Machines in R deﬁned by a kernel function, i.e., a function returning the inner product

Support vector machines - slazebni.cs.illinois.eduslazebni.cs.illinois.edu/fall17/lec17_svm.pdf• Linear SVM decision function: • Kernel SVM decision function: • This gives a

Support Vector Machines using Kernels - imag · Support Vector Machines with Kernel Functions 4-4 Radial Basis Function (RBF) Radial functions of the form ! f! X "! (X n) are popular

1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.

Kernel methods and support vector machines

Support Vector Machines and Kernel Functions for Text ......Support Vector Machines and Kernel Functions for Text Processing several kernel functions that are currently employed for

Statistical Learning and Kernel Methods in …vector machines, and kernel feature spaces. In addition, we present an overview of applications of kernel methods in bioinformatics.1

Kernel Methods and Support Vector Machinesbao/VIASM-SML/Lecture/L3-Kernel... · Linear support vector machines The linearly separable case 11 Learning set of data L = {( , ): i =

Localized algorithms for multiple kernel learningethem/files/papers/mehmet_lmkl_pr… · Multiple kernel learning Support vector machines Support vector regression Classiﬁcation