November, 2006 CCKM'06 1
description
Transcript of November, 2006 CCKM'06 1
November, 2006 CCKM'06 1
Kernel Methods: the Emergence of a Well-founded Machine LearningJohn Shawe-Taylor
Centre for Computational Statistics and Machine Learning
University College London
November, 2006 CCKM'06 2
Overview
Celebration of 10 years of kernel methods: what has been achieved and what can we learn from the experience?
Some historical perspectives: Theory or not? Applicable or not?
Some emphases: Role of theory – need for plurality of approaches Importance of scalability
Surprising fact: there are now 13,444 citations of Support Vectors reported by Google scholar – the vast majority from papers not about machine learning!
Ratio NNs/SVs in Google scholar 1987-1996: 220 1997-2006: 11
November, 2006 CCKM'06 3
Caveats
Personal perspective with inevitable bias One very small slice through what is now a very
big field Focus on theory with emphasis on frequentist
analysis There is no pro-forma for scientific research
But role of theory worth discussing Needed to give firm foundation for proposed
approaches?
November, 2006 CCKM'06 4
Motivation behind kernel methods Linear learning typically has nice properties
Unique optimal solutions Fast learning algorithms Better statistical analysis
But one big problem Insufficient capacity
November, 2006 CCKM'06 5
Historical perspective
Minsky and Pappert highlighted the weakness in their book Perceptrons
Neural networks overcame the problem by gluing together many linear units with non-linear activation functions Solved problem of capacity and led to very
impressive extension of applicability of learning But ran into training problems of speed and
multiple local minima
November, 2006 CCKM'06 6
Kernel methods approach
The kernel methods approach is to stick with linear functions but work in a high dimensional feature space:
The expectation is that the feature space has a much higher dimension than the input space.
November, 2006 CCKM'06 7
Example
Consider the mapping
If we consider a linear equation in this feature space:
We actually have an ellipse – i.e. a non-linear shape in the input space.
November, 2006 CCKM'06 8
Capacity of feature spaces
The capacity is proportional to the dimension – for example:
2-dim:
November, 2006 CCKM'06 9
Form of the functions
So kernel methods use linear functions in a feature space:
For regression this could be the function For classification require thresholding
November, 2006 CCKM'06 10
Problems of high dimensions
Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well
Computational costs involved in dealing with large vectors
November, 2006 CCKM'06 11
Overview
Two theoretical approaches converged on very similar algorithms: Frequentist led to Support Vector Machine Bayesian approach led to Bayesian inference
using Gaussian Processes First we briefly discuss the Bayesian
approach before mentioning some of the frequentist results
November, 2006 CCKM'06 12
Bayesian approach
The Bayesian approach relies on a probabilistic analysis by positing a noise model a prior distribution over the function class
Inference involves updating the prior distribution with the likelihood of the data
Possible outputs: MAP function Bayesian posterior average
November, 2006 CCKM'06 13
Bayesian approach Avoids overfitting by
Controlling the prior distribution Averaging over the posterior
For Gaussian noise model (for regression) and Gaussian process prior we obtain a ‘kernel’ method where Kernel is covariance of the prior GP Noise model translates into addition of ridge to kernel
matrix MAP and averaging give the same solution Link with infinite hidden node limit of single hidden layer
Neural Networks – see seminal paperWilliams, Computation with infinite neural networks (1997)
Surprising fact: the covariance (kernel) function that arises from infinitely many sigmoidal hidden units is not a sigmoidal kernel – indeed the sigmoidal kernel is not positive semi-definite and so cannot arise as a covariance function!
November, 2006 CCKM'06 14
Bayesian approach
Subject to assumptions about noise model and prior distribution: Can get error bars on the output Compute evidence for the model and use for
model selection Approach developed for different noise
models eg classification Typically requires approximate inference
November, 2006 CCKM'06 15
Frequentist approach
Source of randomness is assumed to be a distribution that generates the training data i.i.d. – with the same distribution generating the test data
Different/weaker assumptions than the Bayesian approach – so more general but less analysis can typically be derived
Main focus is on generalisation error analysis
November, 2006 CCKM'06 16
Capacity problem What do we mean by generalisation?
November, 2006 CCKM'06 17
Generalisation of a learner
November, 2006 CCKM'06 18
Example of Generalisation
We consider the Breast Cancer dataset from the UCIrepository
Use the simple Parzen window classifier: weight vector is
where is the average of the positive (negative) training examples.
Threshold is set so hyperplane bisects the line joining these two points.
November, 2006 CCKM'06 19
Example of Generalisation
By repeatedly drawing random training sets S of size m we estimate the distribution of
by using the test set error as a proxy for the true generalisation
We plot the histogram and the average of the distribution for various sizes of training set
648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7.
November, 2006 CCKM'06 20
Example of Generalisation Since the expected classifier is in all cases
the same
we do not expect large differences in the average of the distribution, though the non-linearity of the loss function means they won't be the same exactly.
November, 2006 CCKM'06 21
Error distribution: full dataset
November, 2006 CCKM'06 22
Error distribution: dataset size: 342
November, 2006 CCKM'06 23
Error distribution: dataset size: 273
November, 2006 CCKM'06 24
Error distribution: dataset size: 205
November, 2006 CCKM'06 25
Error distribution: dataset size: 137
November, 2006 CCKM'06 26
Error distribution: dataset size: 68
November, 2006 CCKM'06 27
Error distribution: dataset size: 34
November, 2006 CCKM'06 28
Error distribution: dataset size: 27
November, 2006 CCKM'06 29
Error distribution: dataset size: 20
November, 2006 CCKM'06 30
Error distribution: dataset size: 14
November, 2006 CCKM'06 31
Error distribution: dataset size: 7
November, 2006 CCKM'06 32
Observations
Things can get bad if number of training examples small compared to dimension (in this case input dimension is 9)
Mean can be bad predictor of true generalisation – i.e. things can look okay in expectation, but still go badly wrong
Key ingredient of learning – keep flexibility high while still ensuring good generalisation
November, 2006 CCKM'06 33
Controlling generalisation The critical method of controlling
generalisation for classification is to force a large margin on the training data:
November, 2006 CCKM'06 34
Intuitive and rigorous explanations Makes classification robust to uncertainties in inputs Can randomly project into lower dimensional spaces
and still have separation – so effectively low dimensional
Rigorous statistical analysis shows effective dimension
This is not structural risk minimisation over VC classes since hierarchy depends on the data: data-dependent structural risk minimisation
see S-T, Bartlett, Williamson & Anthony (1996 and 1998)
Surprising fact: structural risk minimisation over VCclasses does not provide a bound on the generalisation of SVMs, except in the transductive setting – and then only if the margin is measured on training and test data! Indeed SVMs were wake-up call that classical PAC analysis was not capturing critical factors in real-world applications!
November, 2006 CCKM'06 35
Learning framework
Since there are lower bounds in terms of the VC dimension the margin is detecting a favourable distribution/task alignment – luckiness framework captures this idea
Now consider using an SVM on the same data and compare the distribution of generalisations
SVM distribution in red
November, 2006 CCKM'06 36
Error distribution: dataset size: 205
November, 2006 CCKM'06 37
Error distribution: dataset size: 137
November, 2006 CCKM'06 38
Error distribution: dataset size: 68
November, 2006 CCKM'06 39
Error distribution: dataset size: 34
November, 2006 CCKM'06 40
Error distribution: dataset size: 27
November, 2006 CCKM'06 41
Error distribution: dataset size: 20
November, 2006 CCKM'06 42
Error distribution: dataset size: 14
November, 2006 CCKM'06 43
Error distribution: dataset size: 7
November, 2006 CCKM'06 44
Handling training errors
So far only considered case where data can be separated
For non-separable sets we can introduce a penalty proportional to the amount by which a point fails to meet the margin
These amounts are often referred to as slack variables from optimisation theory
November, 2006 CCKM'06 45
Support Vector Machines
SVM optimisation
Analysis of this case given using augmented space
trick inS-T and Cristianini (1999 and 2002) On the generalisation of soft margin algorithms.
November, 2006 CCKM'06 46
Complexity problem
Let’s apply the quadratic example
to a 20x30 image of 600 pixels – gives approximately 180000 dimensions!
Would be computationally infeasible to work in this space
November, 2006 CCKM'06 47
Dual representation
Suppose weight vector is a linear combination of the training examples:
can evaluate inner product with new example
November, 2006 CCKM'06 48
Learning the dual variables
The αi are known as dual variables Since any component orthogonal to the space
spanned by the training data has no effect, general result that weight vectors have dual representation: the representer theorem.
Hence, can reformulate algorithms to learn dual variables rather than weight vector directly
November, 2006 CCKM'06 49
Dual form of SVM
The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives:
Note that threshold must be determined from border examples
November, 2006 CCKM'06 50
Using kernels
Critical observation is that again only inner products are used
Suppose that we now have a shortcut method of computing:
Then we do not need to explicitly compute the feature vectors either in training or testing
November, 2006 CCKM'06 51
Kernel example
As an example consider the mapping
Here we have a shortcut:
November, 2006 CCKM'06 52
Efficiency
Hence, in the pixel example rather than work with 180000 dimensional vectors, we compute a 600 dimensional inner product and then square the result!
Can even work in infinite dimensional spaces, eg using the Gaussian kernel:
November, 2006 CCKM'06 53
Using Gaussian kernel for Breast: 273
November, 2006 CCKM'06 54
Data size 342
Surprising fact: kernel methods are invariant torotations of the coordinate system – so any specialinformation encoded in the choice of coordinates will be lost! Surprisingly one of the most successful applicationsof SVMs has been for text classification in which one would expect the encoding to be very informative!
November, 2006 CCKM'06 55
Constraints on the kernel
There is a restriction on the function:
This restriction for any training set is enough to guarantee function is a kernel
Surprising fact: the property that guarantees theexistence of a feature space is precisely the propertyrequired to ensure convexity of the resulting optimisationproblem for SVMs, etc!
November, 2006 CCKM'06 56
What have we achieved?
Replaced problem of neural network architecture by kernel definition Arguably more natural to define but restriction is a bit
unnatural Not a silver bullet as fit with data is key Can be applied to non-vectorial (or high dim) data
Gained more flexible regularisation/ generalisation control
Gained convex optimisation problem i.e. NO local minima!
November, 2006 CCKM'06 57
Historical note
First use of kernel methods in machine learning was in combination with perceptron algorithm: Aizerman et al., Theoretical foundations of the potential function method in pattern recognition learning(1964)
Apparently failed because of generalisation issues – margin not optimised but this can be incorporated with one extra parameter!
Surprising fact: good covering number bounds on generalisation of SVMs use bound on updates of theperceptron algorithm to bound the covering numbers and hence generalisation – but the same bound can be applied directly to the perceptron without margin using a sparsity argument, so bound is tighter for classicalperceptron!
November, 2006 CCKM'06 58
PAC-Bayes: General perspectives The goal of different theories is to capture the key
elements that enable an understanding, analysis and learning of different phenomena
Several theories of machine learning: notably Bayesian and frequentist
Different assumptions and hence different range of applicability and range of results
Bayesian able to make more detailed probabilistic predictions
Frequentist makes only i.i.d. assumption
November, 2006 CCKM'06 59
Evidence and generalisation Evidence is a measure used in Bayesian analysis to
inform model selection Evidence related to the posterior volume of weight
space consistent with the sample Link between evidence and generalisation
hypothesised by McKay First such result was obtained by S-T & Williamson
(1997): PAC Analysis of a Bayes Estimator Bound on generalisation in terms of the volume of
the sphere that can be inscribed in the version space -- included a dependence on the dim of the space but applies to non-linear function classes
Used Luckiness framework where luckiness measured by the volume of the ball
November, 2006 CCKM'06 60
PAC-Bayes Theorem
PAC-Bayes theorem has a similar flavour but bounds in terms of prior and posterior distributions
First version proved by McAllester in 1999 Improved proof and bound due to Seeger in 2002
with application to Gaussian processes Application to SVMs by Langford and S-T also in
2002 Gives tightest bounds on generalisation for SVMs
see Langford tutorial on JMLR
November, 2006 CCKM'06 61
November, 2006 CCKM'06 62
November, 2006 CCKM'06 63
November, 2006 CCKM'06 64
Examples of bound evaluationProblem PAC-Bayes PAC-Bayes
priorTest Error
Wdbc 0.334±.005 0.306±.018 0.073±.021
Image 0.254±.003 0.184±.010 0.074±.014
(0.056±.01)
Waveform 0.198±.002 0.151±.005 0.089±.005
Ringnorm 0.212±.002 0.109±.024 0.026±.005
Surprising fact: optimising the bound does not typically improve the test error – despite significant improvements in the bound itself! Training an SVM onhalf the data to learn a prior and then using the restto learn relative to this prior further improves the boundwith almost no effect on the test error!
November, 2006 CCKM'06 65
Principal Components Analysis (PCA) Classical method is principal component
analysis – looks for directions of maximum variance, given by eigenvectors of covariance matrix
November, 2006 CCKM'06 66
Dual representation of PCA
Eigenvectors of kernel matrix give dual representation:
Means we can perform PCA projection in a kernel defined feature space: kernel PCA
November, 2006 CCKM'06 67
Kernel PCA
Need to take care of normalisation to obtain:
where λ is the corresponding eigenvalue.
November, 2006 CCKM'06 68
Generalisation of k-PCA How reliable are estimates obtained from a sample
when working in such high dimensional spaces Using Rademacher complexity bounds can show
that if low dimensional projection captures most of the variance in the training set, will do well on test examples as well even in high dimensional spacessee S-T, Williams, Cristianini and Kandola (2004) On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA
The bound also gives a new bound on the difference between the process and empirical eigenvalues
Surprising fact: no frequentist bounds apply for learning in the representation given by k-PCA – the data is used twice, once to find the subspace and then to learn the classifier/regressor! Intuitively, this shouldn’t be a problem since the labels are not used by k-PCA!
November, 2006 CCKM'06 69
Latent Semantic Indexing
Developed in information retrieval to overcome problem of semantic relations between words in BoWs representation
Use Singular Value Decomposition of Term-doc matrix:
November, 2006 CCKM'06 70
Lower dimensional representation If we truncate the matrix U at k columns we obtain the best k dimensional representation in the least squares sense:
In this case we can write a ‘semantic matrix’ as
November, 2006 CCKM'06 71
Latent Semantic Kernels
Can perform the same transformation in a kernel defined feature space by performing an eigenvalue decomposition of the kernel matrix (this is equivalent to kernel PCA):
November, 2006 CCKM'06 72
Related techniques Number of related techniques:
Probabilistic LSI (pLSI) Non-negative Matrix Factorisation (NMF) Multinomial PCA (mPCA) Discrete PCA (DPCA)
All can be viewed as alternative decompositions:
November, 2006 CCKM'06 73
Different criteria
Vary by: Different constraints (eg non-negative entries) Different prior distributions (eg Dirichlet, Poisson) Different optimisation criteria (eg max likelihood,
Bayesian) Unlike LSI typically suffer from local minima
and so require EM type iterative algorithms to converge to solutions
November, 2006 CCKM'06 74
Other subspace methods
Kernel partial Gram-Schmidt orthogonalisation is equivalent to incomplete Cholesky decomposition – greedy kernel PCA
Kernel Partial Least Squares implements a multi-dimensional regression algorithm popular in chemometrics – takes account of labels
Kernel Canonical Correlation Analysis uses paired datasets to learn a semantic representation independent of the two views
November, 2006 CCKM'06 75
Paired corpora
Can we use paired corpora to extract more information?
Two views of same semantic object – hypothesise that both views contain all of the necessary information, eg document and translation to a second language:
November, 2006 CCKM'06 76
aligned text
E1
E2
EN
Ei
..
..
F1
F2
FN
Fi
..
..
November, 2006 CCKM'06 77
Canadian parliament corpusLAND MINES Ms. Beth Phinney (Hamilton Mountain, Lib.):Mr. Speaker, we are pleased that the Nobel peace prize has been given to those working to ban land mines worldwide. We hope this award will encourage the United States to join the over 100 countries planning to come to …
LES MINES ANTIPERSONNEL Mme Beth Phinney (Hamilton Mountain, Lib.):Monsieur le Président, nous nous réjouissons du fait que le prix Nobel ait été attribué à ceux qui oeuvrent en faveur de l'interdiction des mines antipersonnel dans le monde entier. Nous espérons que cela incitera les Américains à se joindre aux représentants de plus de 100 pays qui ont l'intention de venir à …
E12
F12
November, 2006 CCKM'06 78
cross-lingual lsi via svd
T
F
E
F
E VU
U
D
DD
'
''
M. L. Littman, S. T. Dumais, and T. K. Landauer. Automaticcross-language information retrieval using latent semanticindexing. In G. Grefenstette, editor, Cross-languageinformation retrieval. Kluwer, 1998.
November, 2006 CCKM'06 79
cross-lingual kernel canonical correlation analysis
input “English” space input “French” space
fE1
fE2
fF1
fF2
Φ(x)
feature “English” space feature “French” space
November, 2006 CCKM'06 80
kernel canonical correlation analysis
))(,,)(,(max , FfEfcorr FFEEff FE
l
lElE Ef )( l
lFlF Ff )(
DB
OKK
KKOB
EF
FE
2
2
F
E
KO
OKD
November, 2006 CCKM'06 81
regularization
2
2
)(
)(
IKO
OIKD
F
E
is the regularization parameter
using kernel functions may result in overfitting Theoretical analysis shows that provided the norms of the
weight vectors are small the correlation will still hold for new data
see S-T and Cristianini (2004) Kernel Methods for Pattern Analysis
need to control flexibility of the projections fE and fF: add diagonal to the matrix D:
November, 2006 CCKM'06 82
pseudo query test
Ei qei
F1
F2
FN
Fi
..
..
Queries were generated from each test document by extracting 5 words with the highest TFIDF weights and using them as a query.
November, 2006 CCKM'06 83
Experimental ResultsExperimental ResultsThe goal was to retrieve the paired document.
Experimental procedure:
(1) LSI/KCCA trained on paired documents,
(2) All test documents projected into the LSI/KCCA semantic space,
(3) Each query was projected into the LSI/KCCA semantic space and documents were retrieved using nearest neighbour based on cosine distance to the query.
November, 2006 CCKM'06 84
English-French retrieval accuracy, %
0.00
20.00
40.00
60.00
80.00
CL-LSI 29.95 38.20 42.44 45.35 47.11
CL_KCCA 67.88 74.92 77.56 78.87 79.79
100 200 300 400 500
November, 2006 CCKM'06 85
Data Combined image and associated text obtained from
the web Three categories: sport, aviation and paintball 400 examples from each category (1200 overall) Features extracted: HSV, Texture , Bag of words
Tasks1. Classification of web pages into the 3 categories
2. Text query -> image retrieval
Applying to different data types
November, 2006 CCKM'06 86
Classification error of baseline method
Colour Texture Colour +
Texture
Text All
22.9% 18.3% 13.6% 9.0% 3.0%
• Previous error rates obtained using probabilistic ICA classification done for single feature groups
November, 2006 CCKM'06 87
Classification rates using KCCA
K Error rate
Plain SVM 2.13%0.23%
KCCA-SVM (150) 1.36%0.15%
KCCA-SVM (200) 1.21%0.27%
November, 2006 CCKM'06 88
Classification with multi-views If we use KCCA to generate a semantic feature
space and then learn with an SVM, can envisage combining the two steps into a single ‘SVM-2k’
learns two SVMs one on each representation, but constrains their outputs to be similar across the training (and any unlabelled data).
Can give classification for single mode test data – applied to patent classification for Japanese patents
Again theoretical analysis predicts good generalisation if the training SVMs have a good match, while the two representations have a small overlapsee Farquhar, Hardoon, Meng, S-T, Szedmak (2005) Two view learning: SVM-2K, Theory and Practice
November, 2006 CCKM'06 89
Results
pSVM kcca_SVM SVM SVM_2k_j Concat SVM_2k
1 59.4±3.9 60.3±2.8 66.6±2.8 66.1± 2.6 67.5±2.3 67.5±2.1
2 71.1±4.5 68.4±4.4 73.0±4.0 74.8±4.7 73.9±4.0 75.1±4.1
3 16.7±1.2 13.1±1.0 18.8±1.6 20.8±1.9 21.5±1.9 22.5±1.7
7 74.9±1.8 76.0±1.2 76.7±1.3 77.5±1.4 79.0±1.2 80.7±1.5
12 75.0±0.8 73.6±0.8 76.8±1.0 77.6±0.7 76.8±0.6 78.4±0.6
14 76.0±1.6 71.5±1.5 80.9±1.3 82.2±1.3 81.4±1.4 82.7±1.3
Dualvariables
KCCA+SVM
DirectSVM
X-ling.SVM-2k
Concat+SVM
Co-ling.SVM-2k
November, 2006 CCKM'06 90
Targeted Density Learning
Learning a probability density function in an L1 sense is hard
Learning for a single classification or regression task is ‘easy’
Consider a set of tasks (Touchstone Class) F that we think may arise with a distribution emphasising those more likely to be seen
Now learn density that does well for the tasks in F. Surprisingly it can be shown that just including a
small subset of tasks gives us convergence over all the tasks
November, 2006 CCKM'06 91
Example plot Consider fitting cumulative distribution up to sample
points as proposed by Mukherjee and Vapnik. Plot shows loss as a function of the number of
constraints (tasks) included:
November, 2006 CCKM'06 92
Idealised view of progress
Study problem to develop theoretical model
Derive analysis that indicates
factors that affect solutionquality
Translateinto optimisation maximising
factors – relaxing toensure convexity
Develop efficient solutions using specifics
of the task
November, 2006 CCKM'06 93
Role of theory
Theoretical analysis plays a critical auxiliary role: Can never be complete picture Needs to capture the factors that most influence
quality of result Means to an end – only as good as the result Important to be open to refinements/relaxations
that can be shown to better characterise real-world learning and translate into efficient algorithms
November, 2006 CCKM'06 94
Conclusions and future directions Extensions to learning data with output structure that
can be used to inform the learning More general subspace methods: algorithms and
theoretical foundations Extensions to more complex tasks, eg system
identification, reinforcement learning Linking learning into tasks with more detailed prior
knowledge eg stochastic differential equations for climate modelling
Extending theoretical framework for more general learning tasks eg learning a density function