SVM

24
1 Data Mining: Concepts and Techniques Chapter 9: Advanced Classification Methods Support Vector Machines ©2013 Han, Kamber & Pei. All rights reserved.

description

SVM

Transcript of SVM

  • 1

    Data Mining: Concepts and Techniques

    Chapter 9: Advanced Classification Methods

    Support Vector Machines

    2013 Han, Kamber & Pei. All rights reserved.

  • Classification

    n Assign input vector to one of two or more classes n Any decision rule divides input space into decision

    regions separated by decision boundaries

  • 3

    Classification as Mathematical Mapping

    n Classica(on: Predict categorical class label y Y for x X

    n Learning: Derive a func(on f: X Yn 2-Class Classica(on: E.g. Job page classica(on

    n y {+1, 1} n x Rn

    n xi = (xi1, xi2, xi3, ), n n = Number of dis(nct word-features n xij : P-idf weight of word j in document i

  • 4

    SVMHistory and Applications

    n SVMs introduced by Vapnik and colleagues in 1992.

    n Theoretically well motivated algorithm: developed from Statistical Learning Theory since the 60s.

    n Empirically good performance: successful applications in many fields (bioinformatics, text, image recognition, )

    n Used for: classification and numeric prediction

    n Features: training can be slow but accuracy is high owing

    to their ability to model complex nonlinear decision boundaries (margin maximization)

  • 5

    SVMGeneral Philosophy

    Support Vectors

    Small Margin Large Margin

  • 6

    SVMSupport Vector Machines

    n It uses a nonlinear mapping to transform the original training data into a higher dimension.

    n With the new dimension, it searches for the linear optimal separating hyperplane (i.e., decision boundary)

    n With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane.

    n SVM finds this hyperplane using support vectors (essential training tuples) and margins (defined by the support vectors)

  • 7

    SVMWhen Data Is Linearly Separable

    n A separating hyperplane can be written as n W X + b = 0 n where W={w1, w2, , wn} is a weight vector and b a

    scalar (bias)

    n For 2-D it can be written as n w0 + w1 x1 + w2 x2 = 0

    n The hyperplane defining the sides of the margin: n H1: w0 + w1 x1 + w2 x2 1 for yi = +1, and

    n H2: w0 + w1 x1 + w2 x2 1 for yi = 1 n Any training tuples that fall on hyperplanes H1 or H2 (i.e.,

    the sides defining the margin) are support vectors

  • SVMLinearly Separable

    8

    Margin Support vectors

    Distance between point and hyperplane: ||||

    ||wwx bi +

    Therefore, the margin is 2/||w||

    There are infinite hyperplanes separating the two classes but we want to find the best one, the one that minimizes classification error

    on unseen data.

    SVM searches for the hyperplane with the largest

    margin, i.e., maximum marginal hyperplane

  • Finding the maximum margin hyperplane

    n Maximize margin 2/||w|| n Correctly classify all training data:

    n Quadratic optimization problem:

    n Minimize

    n Subject to yi(wxi+b) 1

    9

    1:1)(negative1:1)( positive+=

    +=

    byby

    iii

    iii

    wxxwxx

    wwT21

  • Finding the maximum margin hyperplane

    10

    Solution = i iii y xw

    Support vector

    learned weight

    bybi iii

    +=+ xxxw Classification function

    Notice the inner product between the test point x and the support vectors xi used as a measure of similarity.

  • 11

    Why Is SVM Effective on High Dimensional Data?

    n The complexity of trained classifier is characterized by the # of

    support vectors rather than the dimensionality of the data

    n The support vectors are the essential or critical training examples

    they lie closest to the decision boundary (MMH)

    n If all other training examples are removed and the training is

    repeated, the same separating hyperplane would be found

    n The number of support vectors found can be used to compute an

    (upper) bound on the expected error rate of the SVM classifier, which

    is independent of the data dimensionality

    n Thus, an SVM with a small number of support vectors can have good

    generalization, even when the dimensionality of the data is high

  • Nonlinear SVMs

    n Datasets that are linearly separable work out great:

    n But what if the dataset is just too hard? n Can we can map it to a higher-dimensional space?

    12

    0 x

    0 x

    x

    Slide credit: Andrew Moore

  • Nonlinear SVMs

    n General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable.

    13

    : x (x)

    Slide credit: Andrew Moore

  • The Kernel Tricks

    n Instead of explicitly computing the lifting transformation (x), define a kernel function K such that n K(xi , xj) = (xi ) (xj) n K must satisfy Mercers condition

    n This gives a nonlinear decision boundary in the original feature space

    14

    bKybyi

    iiii

    iii +=+ ),()()( xxxx

  • Nonlinear Kernel Example

    n Consider the mapping

    15

    ),()( 2xxx =

    22

    2222

    ),(),(),()()(yxxyyxK

    yxxyyyxxyx+=

    +==

    x2

    0 x

  • Kernels for Bags of Features

    n Histogram intersection kernel:

    n Generalized Gaussian kernel:

    n D can be L1 distance, Euclidean distance, 2 distance, etc.

    16

    =

    =N

    iihihhhI

    12121 ))(),(min(),(

    = 22121 ),(1exp),( hhDA

    hhK

  • 17

    More Kernels for Nonlinear Classification

    n Polynomial kernel of degree h

    n Gaussian radial basis function kernel

    n Sigmoid kernel

  • 18

    Scaling SVM by Hierarchical Micro-Clustering

    n SVM is not scalable to the number of data objects in terms of training time and memory usage

    n H. Yu, J. Yang, and J. Han, Classifying Large Data Sets Using SVM with Hierarchical Clusters, KDD'03)

    n CB-SVM (Clustering-Based SVM)

    n Given limited amount of system resources (e.g., memory), maximize the SVM performance in terms of accuracy and the

    training speed

    n Use micro-clustering to effectively reduce the number of points to be considered

    n At deriving support vectors, de-cluster micro-clusters near candidate vector to ensure high classification accuracy

  • 19

    CF-Tree: Hierarchical Micro-cluster

    n Read the data set once, construct a statistical summary of the data (i.e., hierarchical clusters) given a limited amount of memory

    n Micro-clustering: Hierarchical indexing structure

    n provide finer samples closer to the boundary and coarser samples farther from the boundary

  • 20

    Selective Declustering: Ensure High Accuracy

    n CF tree is a suitable base structure for selective declustering n De-cluster only the cluster Ei such that

    n Di Ri < Ds, where Di is the distance from the boundary to the center point of Ei and Ri is the radius of Ei

    n Decluster only the cluster whose subclusters have possibilities to be the support cluster of the boundary n Support cluster: The cluster whose centroid is a support vector

  • 21

    CB-SVM Algorithm: Outline

    n Construct two CF-trees from positive and negative data sets independently n Need one scan of the data set

    n Train an SVM from the centroids of the root entries n De-cluster the entries near the boundary into the next

    level n The children entries de-clustered from the parent

    entries are accumulated into the training set with the non-declustered parent entries

    n Train an SVM again from the centroids of the entries in the training set

    n Repeat until nothing is accumulated

  • 22

    Accuracy and Scalability on Synthetic Dataset

    n Experiments on large synthetic data sets shows better accuracy than random sampling approaches and far more scalable than the original SVM algorithm

  • 23

    SVM vs. Neural Network

    n SVM

    n Deterministic algorithm

    n Nice generalization properties

    n Hard to learn learned in batch mode using quadratic programming techniques

    n Using kernels can learn very complex functions

    n Neural Network n Nondeterministic

    algorithm n Generalizes well but

    doesnt have strong mathematical foundation

    n Can easily be learned in incremental fashion

    n To learn complex functionsuse multilayer perceptron (nontrivial)

  • 24

    SVM Related Links

    n SVM Website: http://www.kernel-machines.org/

    n Representative implementations

    n LIBSVM: an efficient implementation of SVM, multi-class classifications, nu-SVM, one-class SVM, including

    also various interfaces with java, python, etc.

    n SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only in C

    n SVM-torch: another recent implementation also written in C