Epsrcws08 Campbell Isvm 01

download Epsrcws08 Campbell Isvm 01

of 64

Transcript of Epsrcws08 Campbell Isvm 01

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    1/64

    Introduction to Support Vector Machines

    Colin Campbell,

    Bristol University

    1

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    2/64

    Outline of talk.

    Part 1. An Introduction to SVMs

    1.1. SVMs for binary classification.

    1.2. Soft margins and multi-class classification.

    1.3. SVMs for regression.

    2

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    3/64

    Part 2. General kernel methods

    2.1 Kernel methods based on linear programming and

    other approaches.

    2.2 Training SVMs and other kernel machines.

    2.3 Model Selection.

    2.4 Different types of kernels.

    2.5. SVMs in practice

    3

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    4/64

    Advantages of SVMs

    A principled approach to classification, regression or

    novelty detection tasks.

    SVMs exhibit good generalisation.

    Hypothesis has an explicit dependence on the data

    (via the support vectors). Hence can readily

    interpret the model.

    4

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    5/64

    Learning involves optimisation of a convex function

    (no false minima, unlike a neural network).

    Few parameters required for tuning the learningmachine (unlike neural network where the

    architecture and various parameters must be found).

    Can implement confidence measures, etc.

    5

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    6/64

    1.1 SVMs for Binary Classification.

    Preliminaries: Consider a binary classification problem:input vectors are xi and yi = 1 are the targets or labels.

    The index i labels the pattern pairs (i = 1, . . . , m).

    The xi define a space of labelled points called input space.

    6

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    7/64

    From the perspective of statistical learning theory the

    motivation for considering binary classifier SVMs comesfrom theoretical bounds on the generalization error.

    These generalization bounds have two important features:

    7

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    8/64

    1. the upper bound on the generalization error does not

    depend on the dimensionality of the space.

    2. the bound is minimized by maximizing the margin, ,

    i.e. the minimal distance between the hyperplane

    separating the two classes and the closest datapoints of

    each class.

    8

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    9/64

    Separating hyperplane

    Margin

    9

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    10/64

    In an arbitrary-dimensional space a separating

    hyperplane can be written:

    w x + b = 0

    where b is the bias, and w the weights, etc.

    Thus we will consider a decision function of the form:

    D(x) = sign (w x + b)

    10

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    11/64

    We note that the argument in D(x) is invariant under a

    rescaling: w w, b b.

    We will implicitly fix a scale with:

    11

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    12/64

    w x + b = 1

    w x + b = 1

    for the support vectors (canonical hyperplanes).

    12

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    13/64

    Thus:

    w (x1 x2) = 2

    For two support vectors on each side of the separating

    hyperplane.

    13

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    14/64

    The margin will be given by the projection of the vector

    (x1 x2) onto the normal vector to the hyperplane i.e.w/||w|| from which we deduce that the margin is given

    by = 1/ ||w||2.

    14

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    15/64

    Separating hyperplane

    Margin

    1

    2

    15

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    16/64

    Maximisation of the margin is thus equivalent to

    minimisation of the functional:

    (w) =

    1

    2 (w w)

    subject to the constraints:

    yi [(w xi) + b] 1

    16

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    17/64

    Thus the task is to find an optimum of the primal

    objective function:

    L(w, b) =1

    2(w w)

    m

    i=1 i [yi((w xi) + b) 1]Solving the saddle point equations L/b = 0 gives:

    mi=1

    iyi = 0

    17

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    18/64

    and L/w = 0 gives:

    w =

    mi=1

    iyixi

    which when substituted back in L(w, , ) tells us that

    we should maximise the functional (the Wolfe dual):

    18

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    19/64

    W() =mi=1

    i 1

    2

    mi,j=1

    ijyiyj (xi xj)

    subject to the constraints:

    i 0

    (they are Lagrange multipliers)

    19

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    20/64

    and:

    m

    i=1iyi = 0

    The decision function is then:

    D(z) = signm

    j=1 jyj (xj z) + b

    20

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    21/64

    So far we dont know how to handle non-separable

    datasets.

    To answer this problem we will exploit the second point

    mentioned above: the theoretical generalisation bounds

    do not depend on the dimensionality of the space.

    21

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    22/64

    For the dual objective function we notice that the

    datapoints, xi, only appear inside an inner product.

    To get a better representation of the data we cantherefore map the datapoints into an alternative higher

    dimensional space through a replacement:

    xi xj (xi) (xj)

    22

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    23/64

    i.e. we have used a mapping xi ( xi).

    This higher dimensional space is called Feature Space and

    must be a Hilbert Space (the concept of an inner product

    applies).

    23

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    24/64

    The function (xi) (xj) will be called a kernel, so:

    (xi) (xj) = K(xi, xj)

    The kernel is therefore the inner product between

    mapped pairs of points in Feature Space.

    24

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    25/64

    What is the mapping relation?

    In fact we do not need to know the form of this mapping

    since it will be implicitly defined by the functional form

    of the mapped inner product in Feature Space.

    25

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    26/64

    Different choices for the kernel K(x, x

    ) define differentHilbert spaces to use. A large number of Hilbert spaces

    are possible with RBF kernels:

    K(x, x

    ) = e||xx||2/22

    and polynomial kernels:

    K(x, x

    ) = (< x, x

    > +1)

    d

    being popular choices.

    26

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    27/64

    Legitimate kernels are not restricted to functions e.g.

    they may be defined by an algorithm, for example.

    (1) String kernels

    Consider the following text strings car, cat, cart, chart.

    They have certain similarities despite being of unequal

    length. dug is of equal length to car and cat but differs

    in all letters.

    27

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    28/64

    A measure of the degree of alignment of two text strings

    or sequences can be found using algorithmic techniques

    e.g. edit codes (dynamic programming).

    Frequent applications in

    bioinformatics e.g. Smith-Waterman,

    Waterman-Eggert algorithm (4-digit strings ...

    ACCGTATGTAAA ... from genomes),

    text processing (e.g. WWW), etc.

    28

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    29/64

    (2) Kernels for graphs/networks

    Diffusion kernel ( Kondor andLafferty)

    29

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    30/64

    Mathematically, valid kernel functions should satisfyMercers condition:

    30

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    31/64

    For any g(x) for which:

    g(x)2dx <

    it must be the case that:K(x, x)g(x)g(x)dxdx 0

    A simple criterion is that the kernel should be positivesemi-definite.

    31

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    32/64

    Theorem: If a kernel is positive semi-definite i.e.:

    i,j

    K(xi, xj)cicj 0

    where {c1, . . . , cn} are real numbers, then there exists a

    function (x) defining an inner product of possibly

    higher dimension i.e.:

    K(x, y) = (x) (y)

    32

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    33/64

    Thus the following steps are used to train an SVM:

    1. Choose kernel function K(xi, xj).

    2. Maximise:

    W() =mi=1

    i 1

    2

    mi=1

    ijyiyjK(xi, xj)

    subject to i 0 and i iyi = 0.

    33

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    34/64

    3. The bias b is found as follows:

    b = 12min

    {i|yi=+1}

    iyiK(xi, xj)+ max

    {i|yi=1}

    iyiK(xi, xj)

    34

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    35/64

    4. The optimal i go into the decision function:

    D(z) = sign mi=1

    iyiK(xi, z) + b

    35

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    36/64

    1.2. Soft Margins and Multi-class problems

    Multi-class classification: Many problems involve

    multiclass classification and a number of schemes havebeen outlined.

    One of the simplest schemes is to use a directed acyclic

    graph (DAG) with the learning task reduced to binaryclassification at each node:

    36

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    37/64

    1/3

    2/3

    321

    1/2

    37

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    38/64

    Soft margins: Most real life datasets contain noise and

    an SVM can fit this noise leading to poor generalization.

    The effect of outliers and noise can be reduced by

    introducing a soft margin. Two schemes are commonly

    used:

    38

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    39/64

    L1 error norm:

    0 i C

    L2 error norm:

    K(xi, xi) K(xi, xi) +

    39

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    40/64

    5.4

    5.6

    5.8

    6

    6.2

    6.4

    6.6

    6.8

    7

    7.2

    0 2 4 6 8 10

    40

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    41/64

    5

    5.5

    6

    6.5

    7

    7.5

    0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

    41

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    42/64

    The justification for these soft margin techniques comes

    from statistical learning theory.

    Can be readily viewed as relaxation of the hard margin

    constraint.

    42

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    43/64

    For the L1 error norm (prior to introducing kernels) weintroduce a positive slack variable i:

    yi (w xi + b) 1 i

    so minimize the sum of errorsm

    i=1 i in addition to

    ||w||2:

    min 12w w + Cmi=1

    i

    43

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    44/64

    This is readily formulated as a primal objective function :

    L(w, b , , ) =1

    2w w + C

    mi=1

    i

    mi=1

    i [yi (w xi + b) 1 + i]mi=1

    rii

    with Lagrange multipliers i 0 and ri 0.

    44

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    45/64

    The derivatives with respect to w, b and give:

    L

    w= w

    m

    i=1 iyixi = 0 (1)L

    b=

    mi=1

    iyi = 0 (2)

    L

    i = C i ri = 0 (3)

    45

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    46/64

    Resubstituting back in the primal objective function

    obtain the same dual objective function as before.However, ri 0 and C i ri = 0, hence i C and

    the constraint 0 i is replaced by 0 i C.

    46

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    47/64

    Patterns with values 0 < i < C will be referred to as

    non-bound and those with i = 0 or i = C will be said

    to be at bound.

    The optimal value of C must be found by

    experimentation using a validation set and it cannot be

    readily related to the characteristics of the dataset or

    model.

    47

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    48/64

    In an alternative approach, -SVM, it can be shown that

    solutions for an L1-error norm are the same as those

    obtained from maximizing:

    W() = 1

    2

    mi,j=1

    ijyiyjK(xi, xj)

    48

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    49/64

    subject to:

    mi=1

    yii = 0mi=1

    i 0 i 1

    m(4)

    where lies on the range 0 to 1.

    49

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    50/64

    The fraction of training errors is upper bounded by and

    also provides a lower bound on the fraction of points

    which are support vectors.

    Hence in this formulation the conceptual meaning of the

    soft margin parameter is more transparent.

    50

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    51/64

    For the L2 error norm the primal objective function is:

    L(w, b , , ) =

    1

    2w w + C

    m

    i=1

    2

    i

    mi=1

    i [yi (w xi + b) 1 + i]mi=1

    rii(5)

    with i 0 and ri 0.

    51

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    52/64

    After obtaining the derivatives with respect to w, b and

    , substituting for w and in the primal objective

    function and noting that the dual objective function is

    maximal when ri = 0, we obtain the following dualobjective function after kernel substitution:

    W() =m

    i=1i

    1

    2

    m

    i,j=1yiyjijK(xi, xj)

    1

    4C

    m

    i=12i

    52

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    53/64

    With = 1/2C this gives the same dual objective

    function as for hard margin learning except for the

    substitution:

    K(xi, xi) K(xi, xi) +

    53

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    54/64

    For many real-life datasets there is an imbalance between

    the amount of data in different classes, or the significance

    of the data in the two classes can be quite different. Therelative balance between the detection rate for different

    classes can be easily shifted by introducing asymmetric

    soft margin parameters.

    54

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    55/64

    Thus for binary classification with an L1 error norm:

    0 i C+

    for yi = +1), and

    0 i C,

    for yi = 1, etc.

    55

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    56/64

    1.3. Solving Regression Tasks with SVMs

    -SV regression (Vapnik): not unique approach.

    Statistical learning theory: we use constraints

    yi w xi b and w xi + b yi to allow for

    some deviation between the eventual targets yi and the

    function f(x) = w x + b, modelling the data.

    56

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    57/64

    As before we minimize ||w||2 to penalise overcomplexity.

    To account for training errors we also introduce slack

    variables i, i for the two types of training error.These slack variables are zero for points inside the tube

    and progressively increase for points outside the tube

    according to the loss function used.

    57

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    58/64

    L

    0 -

    Figure 1: A linear -insensitive loss function

    58

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    59/64

    For a linear insensitive loss function the task is

    therefore to minimize:

    min

    ||w||2 + C

    mi=1

    i + i

    59

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    60/64

    subject to

    yi w xi b + i

    (w xi + b) yi + iwhere the slack variables are both positive.

    60

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    61/64

    Deriving the dual objective function we get:

    W(, ) = mi=1

    yi(i i) mi=1

    (i + i)

    1

    2

    mi,j=1

    (i

    i)(j

    j)K(xi, xj)

    61

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    62/64

    which is maximized subject to

    m

    i=1i =m

    i=1 iand:

    0 i C 0 i C

    62

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    63/64

    The function modelling the data is then:

    f(z) =mi=1

    (i i)K(xi, z) + b

    63

  • 8/2/2019 Epsrcws08 Campbell Isvm 01

    64/64

    Similarly we can define many other types of loss

    functions.

    We can define a linear programming approach toregression.

    We can use many other approaches to regression (e.g.

    kernelising least squares).