SVM

1

Data Mining: Concepts and Techniques

Chapter 9: Advanced Classification Methods

Support Vector Machines

2013 Han, Kamber & Pei. All rights reserved.

Classification

n Assign input vector to one of two or more classes n Any decision rule divides input space into decision

regions separated by decision boundaries

3

Classification as Mathematical Mapping

n Classica(on: Predict categorical class label y Y for x X

n Learning: Derive a func(on f: X Yn 2-Class Classica(on: E.g. Job page classica(on

n y {+1, 1} n x Rn

n xi = (xi1, xi2, xi3, ), n n = Number of dis(nct word-features n xij : P-idf weight of word j in document i

4

SVMHistory and Applications

n SVMs introduced by Vapnik and colleagues in 1992.

n Theoretically well motivated algorithm: developed from Statistical Learning Theory since the 60s.

n Empirically good performance: successful applications in many fields (bioinformatics, text, image recognition, )

n Used for: classification and numeric prediction

n Features: training can be slow but accuracy is high owing

to their ability to model complex nonlinear decision boundaries (margin maximization)

5

SVMGeneral Philosophy

Support Vectors

Small Margin Large Margin

6

SVMSupport Vector Machines

n It uses a nonlinear mapping to transform the original training data into a higher dimension.

n With the new dimension, it searches for the linear optimal separating hyperplane (i.e., decision boundary)

n With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane.

n SVM finds this hyperplane using support vectors (essential training tuples) and margins (defined by the support vectors)

7

SVMWhen Data Is Linearly Separable

n A separating hyperplane can be written as n W X + b = 0 n where W={w1, w2, , wn} is a weight vector and b a

scalar (bias)

n For 2-D it can be written as n w0 + w1 x1 + w2 x2 = 0

n The hyperplane defining the sides of the margin: n H1: w0 + w1 x1 + w2 x2 1 for yi = +1, and

n H2: w0 + w1 x1 + w2 x2 1 for yi = 1 n Any training tuples that fall on hyperplanes H1 or H2 (i.e.,

the sides defining the margin) are support vectors

SVMLinearly Separable

8

Margin Support vectors

Distance between point and hyperplane: ||||

||wwx bi +

Therefore, the margin is 2/||w||

There are infinite hyperplanes separating the two classes but we want to find the best one, the one that minimizes classification error

on unseen data.

SVM searches for the hyperplane with the largest

margin, i.e., maximum marginal hyperplane

Finding the maximum margin hyperplane

n Maximize margin 2/||w|| n Correctly classify all training data:

n Quadratic optimization problem:

n Minimize

n Subject to yi(wxi+b) 1

9

1:1)(negative1:1)( positive+=

+=

byby

iii

iii

wxxwxx

wwT21

Finding the maximum margin hyperplane

10

Solution = i iii y xw

Support vector

learned weight

bybi iii

+=+ xxxw Classification function

Notice the inner product between the test point x and the support vectors xi used as a measure of similarity.

11

Why Is SVM Effective on High Dimensional Data?

n The complexity of trained classifier is characterized by the # of

support vectors rather than the dimensionality of the data

n The support vectors are the essential or critical training examples

they lie closest to the decision boundary (MMH)

n If all other training examples are removed and the training is

repeated, the same separating hyperplane would be found

n The number of support vectors found can be used to compute an

(upper) bound on the expected error rate of the SVM classifier, which

is independent of the data dimensionality

n Thus, an SVM with a small number of support vectors can have good

generalization, even when the dimensionality of the data is high

Nonlinear SVMs

n Datasets that are linearly separable work out great:

n But what if the dataset is just too hard? n Can we can map it to a higher-dimensional space?

12

0 x

0 x

x

Slide credit: Andrew Moore

Nonlinear SVMs

n General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable.

13

: x (x)

Slide credit: Andrew Moore

The Kernel Tricks

n Instead of explicitly computing the lifting transformation (x), define a kernel function K such that n K(xi , xj) = (xi ) (xj) n K must satisfy Mercers condition

n This gives a nonlinear decision boundary in the original feature space

14

bKybyi

iiii

iii +=+ ),()()( xxxx

Nonlinear Kernel Example

n Consider the mapping

15

),()( 2xxx =

22

2222

),(),(),()()(yxxyyxK

yxxyyyxxyx+=

+==

x2

0 x

Kernels for Bags of Features

n Histogram intersection kernel:

n Generalized Gaussian kernel:

n D can be L1 distance, Euclidean distance, 2 distance, etc.

16

=

=N

iihihhhI

12121 ))(),(min(),(

= 22121 ),(1exp),( hhDA

hhK

17

More Kernels for Nonlinear Classification

n Polynomial kernel of degree h

n Gaussian radial basis function kernel

n Sigmoid kernel

18

Scaling SVM by Hierarchical Micro-Clustering

n SVM is not scalable to the number of data objects in terms of training time and memory usage

n H. Yu, J. Yang, and J. Han, Classifying Large Data Sets Using SVM with Hierarchical Clusters, KDD'03)

n CB-SVM (Clustering-Based SVM)

n Given limited amount of system resources (e.g., memory), maximize the SVM performance in terms of accuracy and the

training speed

n Use micro-clustering to effectively reduce the number of points to be considered

n At deriving support vectors, de-cluster micro-clusters near candidate vector to ensure high classification accuracy

19

CF-Tree: Hierarchical Micro-cluster

n Read the data set once, construct a statistical summary of the data (i.e., hierarchical clusters) given a limited amount of memory

n Micro-clustering: Hierarchical indexing structure

n provide finer samples closer to the boundary and coarser samples farther from the boundary

20

Selective Declustering: Ensure High Accuracy

n CF tree is a suitable base structure for selective declustering n De-cluster only the cluster Ei such that

n Di Ri < Ds, where Di is the distance from the boundary to the center point of Ei and Ri is the radius of Ei

n Decluster only the cluster whose subclusters have possibilities to be the support cluster of the boundary n Support cluster: The cluster whose centroid is a support vector

21

CB-SVM Algorithm: Outline

n Construct two CF-trees from positive and negative data sets independently n Need one scan of the data set

n Train an SVM from the centroids of the root entries n De-cluster the entries near the boundary into the next

level n The children entries de-clustered from the parent

entries are accumulated into the training set with the non-declustered parent entries

n Train an SVM again from the centroids of the entries in the training set

n Repeat until nothing is accumulated

22

Accuracy and Scalability on Synthetic Dataset

n Experiments on large synthetic data sets shows better accuracy than random sampling approaches and far more scalable than the original SVM algorithm

23

SVM vs. Neural Network

n SVM

n Deterministic algorithm

n Nice generalization properties

n Hard to learn learned in batch mode using quadratic programming techniques

n Using kernels can learn very complex functions

n Neural Network n Nondeterministic

algorithm n Generalizes well but

doesnt have strong mathematical foundation

n Can easily be learned in incremental fashion

n To learn complex functionsuse multilayer perceptron (nontrivial)

24

SVM Related Links

n SVM Website: http://www.kernel-machines.org/

n Representative implementations

n LIBSVM: an efficient implementation of SVM, multi-class classifications, nu-SVM, one-class SVM, including

also various interfaces with java, python, etc.

n SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only in C

n SVM-torch: another recent implementation also written in C

SVM

Documents

Transcript of SVM