November, 2006 CCKM'06 1

November, 2006 CCKM'06 1

Kernel Methods: the Emergence of a Well-founded Machine LearningJohn Shawe-Taylor

Centre for Computational Statistics and Machine Learning

University College London


Overview

Celebration of 10 years of kernel methods: what has been achieved and what can we learn from the experience?

Some historical perspectives: Theory or not? Applicable or not?

Some emphases: Role of theory – need for plurality of approaches Importance of scalability

Surprising fact: there are now 13,444 citations of Support Vectors reported by Google scholar – the vast majority from papers not about machine learning!

Ratio NNs/SVs in Google scholar 1987-1996: 220 1997-2006: 11


Caveats

Personal perspective with inevitable bias One very small slice through what is now a very

big field Focus on theory with emphasis on frequentist

analysis There is no pro-forma for scientific research

But role of theory worth discussing Needed to give firm foundation for proposed

approaches?


Motivation behind kernel methods Linear learning typically has nice properties

Unique optimal solutions Fast learning algorithms Better statistical analysis

But one big problem Insufficient capacity


Historical perspective

Minsky and Pappert highlighted the weakness in their book Perceptrons

Neural networks overcame the problem by gluing together many linear units with non-linear activation functions Solved problem of capacity and led to very

impressive extension of applicability of learning But ran into training problems of speed and

multiple local minima


Kernel methods approach

The kernel methods approach is to stick with linear functions but work in a high dimensional feature space:

The expectation is that the feature space has a much higher dimension than the input space.


Example

Consider the mapping

If we consider a linear equation in this feature space:

We actually have an ellipse – i.e. a non-linear shape in the input space.


Capacity of feature spaces

The capacity is proportional to the dimension – for example:

2-dim:


Form of the functions

So kernel methods use linear functions in a feature space:

For regression this could be the function For classification require thresholding


Problems of high dimensions

Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well

Computational costs involved in dealing with large vectors


Overview

Two theoretical approaches converged on very similar algorithms: Frequentist led to Support Vector Machine Bayesian approach led to Bayesian inference

using Gaussian Processes First we briefly discuss the Bayesian

approach before mentioning some of the frequentist results


Bayesian approach

The Bayesian approach relies on a probabilistic analysis by positing a noise model a prior distribution over the function class

Inference involves updating the prior distribution with the likelihood of the data

Possible outputs: MAP function Bayesian posterior average


Bayesian approach Avoids overfitting by

Controlling the prior distribution Averaging over the posterior

For Gaussian noise model (for regression) and Gaussian process prior we obtain a ‘kernel’ method where Kernel is covariance of the prior GP Noise model translates into addition of ridge to kernel

matrix MAP and averaging give the same solution Link with infinite hidden node limit of single hidden layer

Neural Networks – see seminal paperWilliams, Computation with infinite neural networks (1997)

Surprising fact: the covariance (kernel) function that arises from infinitely many sigmoidal hidden units is not a sigmoidal kernel – indeed the sigmoidal kernel is not positive semi-definite and so cannot arise as a covariance function!


Bayesian approach

Subject to assumptions about noise model and prior distribution: Can get error bars on the output Compute evidence for the model and use for

model selection Approach developed for different noise

models eg classification Typically requires approximate inference


Frequentist approach

Source of randomness is assumed to be a distribution that generates the training data i.i.d. – with the same distribution generating the test data

Different/weaker assumptions than the Bayesian approach – so more general but less analysis can typically be derived

Main focus is on generalisation error analysis


Capacity problem What do we mean by generalisation?


Generalisation of a learner


Example of Generalisation

We consider the Breast Cancer dataset from the UCIrepository

Use the simple Parzen window classifier: weight vector is

where is the average of the positive (negative) training examples.

Threshold is set so hyperplane bisects the line joining these two points.


Example of Generalisation

By repeatedly drawing random training sets S of size m we estimate the distribution of

by using the test set error as a proxy for the true generalisation

We plot the histogram and the average of the distribution for various sizes of training set

648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7.


Example of Generalisation Since the expected classifier is in all cases

the same

we do not expect large differences in the average of the distribution, though the non-linearity of the loss function means they won't be the same exactly.


Error distribution: full dataset


Error distribution: dataset size: 342


Observations

Things can get bad if number of training examples small compared to dimension (in this case input dimension is 9)

Mean can be bad predictor of true generalisation – i.e. things can look okay in expectation, but still go badly wrong

Key ingredient of learning – keep flexibility high while still ensuring good generalisation


Controlling generalisation The critical method of controlling

generalisation for classification is to force a large margin on the training data:


Intuitive and rigorous explanations Makes classification robust to uncertainties in inputs Can randomly project into lower dimensional spaces

and still have separation – so effectively low dimensional

Rigorous statistical analysis shows effective dimension

This is not structural risk minimisation over VC classes since hierarchy depends on the data: data-dependent structural risk minimisation

see S-T, Bartlett, Williamson & Anthony (1996 and 1998)

Surprising fact: structural risk minimisation over VCclasses does not provide a bound on the generalisation of SVMs, except in the transductive setting – and then only if the margin is measured on training and test data! Indeed SVMs were wake-up call that classical PAC analysis was not capturing critical factors in real-world applications!


Learning framework

Since there are lower bounds in terms of the VC dimension the margin is detecting a favourable distribution/task alignment – luckiness framework captures this idea

Now consider using an SVM on the same data and compare the distribution of generalisations

SVM distribution in red


Handling training errors

So far only considered case where data can be separated

For non-separable sets we can introduce a penalty proportional to the amount by which a point fails to meet the margin

These amounts are often referred to as slack variables from optimisation theory


Support Vector Machines

SVM optimisation

Analysis of this case given using augmented space

trick inS-T and Cristianini (1999 and 2002) On the generalisation of soft margin algorithms.


Complexity problem

Let’s apply the quadratic example

to a 20x30 image of 600 pixels – gives approximately 180000 dimensions!

Would be computationally infeasible to work in this space


Dual representation

Suppose weight vector is a linear combination of the training examples:

can evaluate inner product with new example


Learning the dual variables

The αi are known as dual variables Since any component orthogonal to the space

spanned by the training data has no effect, general result that weight vectors have dual representation: the representer theorem.

Hence, can reformulate algorithms to learn dual variables rather than weight vector directly


Dual form of SVM

The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives:

Note that threshold must be determined from border examples


Using kernels

Critical observation is that again only inner products are used

Suppose that we now have a shortcut method of computing:

Then we do not need to explicitly compute the feature vectors either in training or testing


Kernel example

As an example consider the mapping

Here we have a shortcut:


Efficiency

Hence, in the pixel example rather than work with 180000 dimensional vectors, we compute a 600 dimensional inner product and then square the result!

Can even work in infinite dimensional spaces, eg using the Gaussian kernel:


Using Gaussian kernel for Breast: 273


Data size 342

Surprising fact: kernel methods are invariant torotations of the coordinate system – so any specialinformation encoded in the choice of coordinates will be lost! Surprisingly one of the most successful applicationsof SVMs has been for text classification in which one would expect the encoding to be very informative!


Constraints on the kernel

There is a restriction on the function:

This restriction for any training set is enough to guarantee function is a kernel

Surprising fact: the property that guarantees theexistence of a feature space is precisely the propertyrequired to ensure convexity of the resulting optimisationproblem for SVMs, etc!


What have we achieved?

Replaced problem of neural network architecture by kernel definition Arguably more natural to define but restriction is a bit

unnatural Not a silver bullet as fit with data is key Can be applied to non-vectorial (or high dim) data

Gained more flexible regularisation/ generalisation control

Gained convex optimisation problem i.e. NO local minima!


Historical note

First use of kernel methods in machine learning was in combination with perceptron algorithm: Aizerman et al., Theoretical foundations of the potential function method in pattern recognition learning(1964)

Apparently failed because of generalisation issues – margin not optimised but this can be incorporated with one extra parameter!

Surprising fact: good covering number bounds on generalisation of SVMs use bound on updates of theperceptron algorithm to bound the covering numbers and hence generalisation – but the same bound can be applied directly to the perceptron without margin using a sparsity argument, so bound is tighter for classicalperceptron!


PAC-Bayes: General perspectives The goal of different theories is to capture the key

elements that enable an understanding, analysis and learning of different phenomena

Several theories of machine learning: notably Bayesian and frequentist

Different assumptions and hence different range of applicability and range of results

Bayesian able to make more detailed probabilistic predictions

Frequentist makes only i.i.d. assumption


Evidence and generalisation Evidence is a measure used in Bayesian analysis to

inform model selection Evidence related to the posterior volume of weight

space consistent with the sample Link between evidence and generalisation

hypothesised by McKay First such result was obtained by S-T & Williamson

(1997): PAC Analysis of a Bayes Estimator Bound on generalisation in terms of the volume of

the sphere that can be inscribed in the version space -- included a dependence on the dim of the space but applies to non-linear function classes

Used Luckiness framework where luckiness measured by the volume of the ball


PAC-Bayes Theorem

PAC-Bayes theorem has a similar flavour but bounds in terms of prior and posterior distributions

First version proved by McAllester in 1999 Improved proof and bound due to Seeger in 2002

with application to Gaussian processes Application to SVMs by Langford and S-T also in

2002 Gives tightest bounds on generalisation for SVMs

see Langford tutorial on JMLR


Examples of bound evaluationProblem PAC-Bayes PAC-Bayes

priorTest Error

Wdbc 0.334±.005 0.306±.018 0.073±.021

Image 0.254±.003 0.184±.010 0.074±.014

(0.056±.01)

Waveform 0.198±.002 0.151±.005 0.089±.005

Ringnorm 0.212±.002 0.109±.024 0.026±.005

Surprising fact: optimising the bound does not typically improve the test error – despite significant improvements in the bound itself! Training an SVM onhalf the data to learn a prior and then using the restto learn relative to this prior further improves the boundwith almost no effect on the test error!


Principal Components Analysis (PCA) Classical method is principal component

analysis – looks for directions of maximum variance, given by eigenvectors of covariance matrix


Dual representation of PCA

Eigenvectors of kernel matrix give dual representation:

Means we can perform PCA projection in a kernel defined feature space: kernel PCA


Kernel PCA

Need to take care of normalisation to obtain:

where λ is the corresponding eigenvalue.


Generalisation of k-PCA How reliable are estimates obtained from a sample

when working in such high dimensional spaces Using Rademacher complexity bounds can show

that if low dimensional projection captures most of the variance in the training set, will do well on test examples as well even in high dimensional spacessee S-T, Williams, Cristianini and Kandola (2004) On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA

The bound also gives a new bound on the difference between the process and empirical eigenvalues

Surprising fact: no frequentist bounds apply for learning in the representation given by k-PCA – the data is used twice, once to find the subspace and then to learn the classifier/regressor! Intuitively, this shouldn’t be a problem since the labels are not used by k-PCA!


Latent Semantic Indexing

Developed in information retrieval to overcome problem of semantic relations between words in BoWs representation

Use Singular Value Decomposition of Term-doc matrix:


Lower dimensional representation If we truncate the matrix U at k columns we obtain the best k dimensional representation in the least squares sense:

In this case we can write a ‘semantic matrix’ as


Latent Semantic Kernels

Can perform the same transformation in a kernel defined feature space by performing an eigenvalue decomposition of the kernel matrix (this is equivalent to kernel PCA):


Related techniques Number of related techniques:

Probabilistic LSI (pLSI) Non-negative Matrix Factorisation (NMF) Multinomial PCA (mPCA) Discrete PCA (DPCA)

All can be viewed as alternative decompositions:


Different criteria

Vary by: Different constraints (eg non-negative entries) Different prior distributions (eg Dirichlet, Poisson) Different optimisation criteria (eg max likelihood,

Bayesian) Unlike LSI typically suffer from local minima

and so require EM type iterative algorithms to converge to solutions


Other subspace methods

Kernel partial Gram-Schmidt orthogonalisation is equivalent to incomplete Cholesky decomposition – greedy kernel PCA

Kernel Partial Least Squares implements a multi-dimensional regression algorithm popular in chemometrics – takes account of labels

Kernel Canonical Correlation Analysis uses paired datasets to learn a semantic representation independent of the two views


Paired corpora

Can we use paired corpora to extract more information?

Two views of same semantic object – hypothesise that both views contain all of the necessary information, eg document and translation to a second language:


aligned text

E1

E2

EN

Ei

..

..

F1

F2

FN

Fi

..

..


Canadian parliament corpusLAND MINES Ms. Beth Phinney (Hamilton Mountain, Lib.):Mr. Speaker, we are pleased that the Nobel peace prize has been given to those working to ban land mines worldwide. We hope this award will encourage the United States to join the over 100 countries planning to come to …

LES MINES ANTIPERSONNEL Mme Beth Phinney (Hamilton Mountain, Lib.):Monsieur le Président, nous nous réjouissons du fait que le prix Nobel ait été attribué à ceux qui oeuvrent en faveur de l'interdiction des mines antipersonnel dans le monde entier. Nous espérons que cela incitera les Américains à se joindre aux représentants de plus de 100 pays qui ont l'intention de venir à …

E12

F12


cross-lingual lsi via svd

T

F

E

F

E VU

U

D

DD

'

''

M. L. Littman, S. T. Dumais, and T. K. Landauer. Automaticcross-language information retrieval using latent semanticindexing. In G. Grefenstette, editor, Cross-languageinformation retrieval. Kluwer, 1998.


cross-lingual kernel canonical correlation analysis

input “English” space input “French” space

fE1

fE2

fF1

fF2

Φ(x)

feature “English” space feature “French” space


kernel canonical correlation analysis

))(,,)(,(max , FfEfcorr FFEEff FE

l

lElE Ef )( l

lFlF Ff )(

DB

OKK

KKOB

EF

FE

2

2

F

E

KO

OKD


regularization

2

2

)(

)(

IKO

OIKD

F

E

is the regularization parameter

using kernel functions may result in overfitting Theoretical analysis shows that provided the norms of the

weight vectors are small the correlation will still hold for new data

see S-T and Cristianini (2004) Kernel Methods for Pattern Analysis

need to control flexibility of the projections fE and fF: add diagonal to the matrix D:


pseudo query test

Ei qei

F1

F2

FN

Fi

..

..

Queries were generated from each test document by extracting 5 words with the highest TFIDF weights and using them as a query.


Experimental ResultsExperimental ResultsThe goal was to retrieve the paired document.

Experimental procedure:

(1) LSI/KCCA trained on paired documents,

(2) All test documents projected into the LSI/KCCA semantic space,

(3) Each query was projected into the LSI/KCCA semantic space and documents were retrieved using nearest neighbour based on cosine distance to the query.


English-French retrieval accuracy, %

0.00

20.00

40.00

60.00

80.00

CL-LSI 29.95 38.20 42.44 45.35 47.11

CL_KCCA 67.88 74.92 77.56 78.87 79.79

100 200 300 400 500


Data Combined image and associated text obtained from

the web Three categories: sport, aviation and paintball 400 examples from each category (1200 overall) Features extracted: HSV, Texture , Bag of words

Tasks1. Classification of web pages into the 3 categories

2. Text query -> image retrieval

Applying to different data types


Classification error of baseline method

Colour Texture Colour +

Texture

Text All

22.9% 18.3% 13.6% 9.0% 3.0%

• Previous error rates obtained using probabilistic ICA classification done for single feature groups


Classification rates using KCCA

K Error rate

Plain SVM 2.13%0.23%

KCCA-SVM (150) 1.36%0.15%

KCCA-SVM (200) 1.21%0.27%


Classification with multi-views If we use KCCA to generate a semantic feature

space and then learn with an SVM, can envisage combining the two steps into a single ‘SVM-2k’

learns two SVMs one on each representation, but constrains their outputs to be similar across the training (and any unlabelled data).

Can give classification for single mode test data – applied to patent classification for Japanese patents

Again theoretical analysis predicts good generalisation if the training SVMs have a good match, while the two representations have a small overlapsee Farquhar, Hardoon, Meng, S-T, Szedmak (2005) Two view learning: SVM-2K, Theory and Practice


Results

pSVM kcca_SVM SVM SVM_2k_j Concat SVM_2k

1 59.4±3.9 60.3±2.8 66.6±2.8 66.1± 2.6 67.5±2.3 67.5±2.1

2 71.1±4.5 68.4±4.4 73.0±4.0 74.8±4.7 73.9±4.0 75.1±4.1

3 16.7±1.2 13.1±1.0 18.8±1.6 20.8±1.9 21.5±1.9 22.5±1.7

7 74.9±1.8 76.0±1.2 76.7±1.3 77.5±1.4 79.0±1.2 80.7±1.5

12 75.0±0.8 73.6±0.8 76.8±1.0 77.6±0.7 76.8±0.6 78.4±0.6

14 76.0±1.6 71.5±1.5 80.9±1.3 82.2±1.3 81.4±1.4 82.7±1.3

Dualvariables

KCCA+SVM

DirectSVM

X-ling.SVM-2k

Concat+SVM

Co-ling.SVM-2k


Targeted Density Learning

Learning a probability density function in an L1 sense is hard

Learning for a single classification or regression task is ‘easy’

Consider a set of tasks (Touchstone Class) F that we think may arise with a distribution emphasising those more likely to be seen

Now learn density that does well for the tasks in F. Surprisingly it can be shown that just including a

small subset of tasks gives us convergence over all the tasks


Example plot Consider fitting cumulative distribution up to sample

points as proposed by Mukherjee and Vapnik. Plot shows loss as a function of the number of

constraints (tasks) included:


Idealised view of progress

Study problem to develop theoretical model

Derive analysis that indicates

factors that affect solutionquality

Translateinto optimisation maximising

factors – relaxing toensure convexity

Develop efficient solutions using specifics

of the task


Role of theory

Theoretical analysis plays a critical auxiliary role: Can never be complete picture Needs to capture the factors that most influence

quality of result Means to an end – only as good as the result Important to be open to refinements/relaxations

that can be shown to better characterise real-world learning and translate into efficient algorithms


Conclusions and future directions Extensions to learning data with output structure that

can be used to inform the learning More general subspace methods: algorithms and

theoretical foundations Extensions to more complex tasks, eg system

identification, reinforcement learning Linking learning into tasks with more detailed prior

knowledge eg stochastic differential equations for climate modelling

Extending theoretical framework for more general learning tasks eg learning a density function

November, 2006 CCKM'06 1

Documents

Transcript of November, 2006 CCKM'06 1