Dr. Orla Doyle [email protected] - NEWMEDS€¦ · Dr. Orla Doyle [email protected] • Face...
Transcript of Dr. Orla Doyle [email protected] - NEWMEDS€¦ · Dr. Orla Doyle [email protected] • Face...
• Face recognition
• Reading text/handwriting
• Recognising food by smell
We would like to give similar capabilities to machines.
Page 2
What is machine learning?
• Building models of data for – Predicting categorical variables (classification) – Predicting numerical variables (regression) – Searching for groupings in the data (unsupervised learning) – Learning from delayed feedback (reinforcement learning)
𝒚 𝑿
Data Label
Cross Validation
TEST
Data from a subject
Subject 1 … Subject 10 Subject 2…
TEST
Subject 1 … Subject 10 Subject 2…
TRAINING
TRAINING
A
B
A
B
.
.
.
Repeated measures across two conditions A and B in 10 subjects
Machine learning for neuroimaging
Univariate GLM: are there voxels that reflect the stimulus?
1. GLM 2. T-test 3. Correction for
multiple comparisons
Time
BO
LD s
ign
al
Multivariate Pattern Recognition: is the pattern of voxels predictive?
Classifier Trained Classifier
Prediction
Type of Pattern Recognition
• Binary (controls vs. patients) • Support vector machine (SVM) • Gaussian process classification (GPC) • Neural networks
• Multi-class (placebo vs. drug 1 vs. drug 2 ) • Gaussian process classification • Sparse multinomial logistic regression (SMLR)
• Real-valued (age, VAS scores) • Kernel ridge regression • Gaussian process regression
• Ordinal (symptom severity, drug dose) • Ordinal regression using Gaussian processes
• Event (time to disease onset) • Cox regression
Type of Pattern Recognition
• Binary (controls vs. patients) • Support vector machine (SVM) • Gaussian process classification (GPC) • Neural networks
• Multi-class (placebo vs. drug 1 vs. drug 2 ) • Gaussian process classification (GPC) • Sparse Multinomial Logistic Regression (SMLR)
• Real-valued (age, VAS scores) • Kernel ridge regression • Gaussian process regression
• Ordinal (symptom severity, drug dose) • Ordinal regression using Gaussian processes
• Event (time to disease onset) • Cox regression
An example
• Problem: sorting incoming fruit according to type. • Assumptions: there are two types of fruit:
• Oranges • Lemons
Page 9
How do we describe the fruit to the computer?
Features:
• dimensions
SVM
• Presented with a set of fruit how can we train an algorithm to classify the type?
Method: 1. Choose two ‘discriminating’ features 2. Plot in the input feature space 3. Here, the examples are linearly separable by a hyperplane: wTx + b = 0 4.‘Positive’ class and ‘negative’ class.
width
hei
ght
Oranges
Lemons
The weight vector
• Positive class region:
• 𝒘𝑇𝐱 + b > 0
• 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏 > 0
• Note that 𝒘 is a vector perpendicular to the decision plane (hyperplane), i.e. we specify the vector that does not lie on the decision planes.
Page 11
Quantifies feature importance!
A 2-D Example
Method: 1. Choose two ‘discriminating’ features 2. Plot in the input feature space 3. Here, the examples are linearly separable by a hyperplane: wTx + b = 0 4.‘Positive’ class and ‘negative’ class.
But which linear hyperplane is the optimal choice? Note: we want the solution to work well on unseen data (good generalisation).
Presented with a set of fruit how can we train an algorithm to classify the type?
width
hei
ght
SVM
But which linear hyperplane is the optimal choice? The one which maximises the margin (distance between closest point and the hyperplane) .
The margin is equal to the distance between
the two dashed hyperplanes with equations: Therefore, the margin is defined as,
2
w2
wTx + b = -1
wTx + b = 1
width
hei
ght
SVM
Optimisation problem becomes:
Minimise
Subject to
1
2w
2
yi(wTxi + b) ³1,"i
width
hei
ght
.
.
.
Dot product -
d imensions of the data
is transformed to the
number of samples!!
Solved using Lagrangian Multipliers to give:
Maximise ∑iαi - ½∑i∑jαiαjyiyjxiTxj
Subject to ∑iαiyi = 0 and 0 ≤ αi ≤ C, ∀ i
Optimisation problem becomes: Minimise ½‖w‖2 + C∑iξi Subject to yi(wTxi + b) ≥ 1 - ξi , ∀ i
• What if the data are not linearly separable? (This is almost always the case.) Then ‘slack’ variables are introduced.
Real world data
width
hei
ght
Real world data
• C - what does it do?
• It allows us to control training error vs margin. Very low training error (C ↑) -> model may have overfitted the training data.
• Very large margin (C ↓)-> model may be a poor representation of the training data.
• How do we choose it? Minimise the nested cross validation training error.
Kernels
• Kernels are used as similarity measures between samples (e.g. subjects).
• Classification is performed in this ‘similarity’ subspace.
• Linear kernel:
• Each element in the kernel matrix in a dot product of two samples and is computed over all samples in this pairwise fashion.
• Non-linear kernels:
• Help linear models perform well in nonlinear settings
• Map input data to higher dimensions where it exhibits linear patterns
Page 17
Kernels
Page 18
Consider the binary classification problem:
• Each example is represented by a single feature 𝒙 • No adequate linear model exists
Instead, transform each example 𝒙 → 𝒙, 𝒙𝟐
𝒙
𝒙
𝒙𝟐
• Each example now has two features
Data is linearly separable in this higher dimensional space.
Kernels
• Each example is transformed from having 2 features to 3 features.
• A complex boundary in the input feature space is replaced by a linear hyperplane in the kernel feature space.
x → Φ (x)
𝒙 = 𝒙1, 𝒙2 → {𝒙12, 2𝒙1𝒙2, 𝒙2
2}
Kernels • Directly mapping to the transformed feature space could lead
to large computational cost of the number of features increases.
• The “kernel trick” helps us to avoid by not requiring the mapping to be computed explicitly.
Other kernels,
Page 20
x → Φ (x)
𝒙 = 𝑥1, 𝑥2 → {𝑥12, 2𝑥1𝑥2, 𝑥2
2}
Radial Basis Function:
Polynomial:
𝐾 𝒙, 𝒛 = 𝜙(𝒙)𝐓𝜙(𝒙′)
𝐾 𝒙, 𝒙′ = 𝑥12, 2𝑥1𝑥2, 𝑥2
2 T 𝑥′12, 2𝑥′1𝑥′2, 𝑥′22
𝐾 𝒙, 𝒙′ = (𝒙𝑇𝒙′)2
Kernels If the data are not linearly separable we could map the data to a higher
dimensional space,
x → Φ(x)
so that K(xi, xj) = Φ(xi)TΦ(xj), where K(xi, xj) is a kernel function. Therefore, we
only need to know the scalar function - K(xi, xj).
This makes reverse inference more challenging.
For linear kernels we can visualise the weight vector directly.
For non-linear kernels given that we don’t know Φ(x) we need to approximate
the mapping where possible.
Nonlinear SVMs
• To train a non-linear SVM using a radial basis function kernel we need to optimise C and γ. Achieved by minimising the nested cross validation training error.
Large sigma -> close to a linear hyperplane (underfitted)
Small sigma-> highly complex boundary (difficult to generalise, overfitted)
Overfitting
• The model is more flexible that we need!
• Addressing overfitting:
• Perform cross validation
• Use methods with regularisation – i.e. penalty for many parameters, restrict the set of solutions to a particular space.
Advantages of SVMs
• Exhibit good generalisation
• Learning involves optimisation of a convex function, i.e. no local maxima!
• It can handle very high-dimensional datasets.
• The formulation of the SVM enables the use of kernels.
Issues • Workaround to achieve multi-class classification
• SVM outputs are generally discrete. Transformations exist but not very robust.
• How to choose the kernel?
• Training time. For e.g. two parameter grid search + large number of examples.
Seminal SVM and fMRI paper
Page 26
• Seminal paper in applying pattern recognition to fMRI • 13 subjects • Reading a sentence versus looking at a picture with geometrical
shapes. • Leave one out cross validation • Mean activation in 7 ROIs used as the input to SVM, Gaussian
naïve Bayes and nearest neighbour classifiers.
Structural MRI and SVMs for diagnosis
Page 27
• Use of T1-weighted MR scans to diagnose
Alzheimer’s disease
• Group I - 20 healthy, 20 AD
• Group II – 14 healthy, 14 AD
• Group III – 57 healthy, 33 probable AD
• Group IV – 18 AD, 19 frontotemporal lobar
degeneration (FTLD)
• SVM applied to normalised grey matter images.
Probabilistic predictions Access to the ‘model evidence’
Elegant approach for model selection – no need for nested cross validation
Elegant approach for model comparison Incorporation of prior information Two ways to ‘map’
Map illustrating features most important for the discrimination, i.e. at the boundary between the classes
Map illustrating distribution between classes, i.e. most similar to the classic ‘t-map’.
Learning with Gaussian Processes
Gaussian Processes
A Gaussian process ~ Gaussian distribution with infinitely many variables.
Define a latent function as a Gaussian process over our data:
The latent function f – relates the data to the label.
Mean function Covariance function
1-D: f(x) ∼ N(μ, σ2), mean and standard deviation.
2-D: f(x) ∼ N(μ, K), mean and covariance matrix
𝑓 𝒙 ~ 𝒢𝒫(𝑚 𝒙 , 𝑘(𝒙, 𝒙′))
Gaussian Processes
Mean function: 𝑚 𝒙 ≡ 𝟎 𝑚 𝒙 ≡ 𝒄
Covariance functions encodes knowledge about the similarities between data points. Assumes that a training samples ‘close’ to a test point should be informative.
𝑘 𝒙, 𝒙′ = 𝒙. 𝒙′, 𝑘 𝒙, 𝒙′ = (𝒙. 𝒙′ + ℓ)/𝑠2,
𝑘 𝒙, 𝒙′ = 𝑒𝑥𝑝− | 𝒙−𝒙′ |2
2𝜎2
We refer to parameters of these functions as hyperparameters and collect them in the vector 𝜽.
𝑓 𝒙 ~ 𝒢𝒫(𝑚 𝒙 , 𝑘(𝒙, 𝒙′))
Learning with Gaussian Processes
Provides:
– probabilistic predictions
– Elegant framework for model selection and model comparison
Data: 𝒟 = {𝑿, 𝒚}
Model: 𝑦 = 𝑓𝑤 𝐱 + 𝜖
Here, we learn the function using Bayesian inference.
Page 31
1. Place a GP prior on the function 𝑓.
2. Observe the data.
3. Compute the posterior distribution over f.
4. Compute the predictive distribution for a test case.
Gaussian Process Learning
𝑝 𝒇 𝒟, 𝜽 =𝑝 𝒚 𝒇, 𝜽 𝑝(𝒇|𝜽)
𝑝(𝓓|𝜽)
• Posterior over 𝒇 - used to compute the predictive distribution over a test case.
• Likelihood function – determines the type of learning: classification, regression, etc. Selected by the user.
• Gaussian process prior – encodes the functional form of 𝒇, analogous to role of the kernel in SVMs.
• Marginal likelihood – this entity is often used to optimise the set of hyperparmeters as minimising the marginal likelihood is equivalent to maximising the likelihood. It can also be used for model
comparison (e.g. using different likelihoods) a long as 𝒟 is consistent.
Gaussian Process Classification
• For regression the likelihood function is Gaussian and therefore analytical approaches can be used to compute this distribution.
• However, for classification the likelihood is non-Gaussian (discrete labels).
• In this case, is a latent function which is ‘squashed’ from 0 to 1 using a link function to provide the probabilistic predictions.
• Both the posterior and predictive distribution are now intractable so we need use approximations.
Page 33
Example GP model for fMRI
fMRI
timeseries
Task 2
Task 2
Task 1
... ...
...
Task 1
Prior Posterior
Integrate over
Latent function to
make probabilistic
Prediction
p(y|f,X,θ)
Test image
(unknown
Label)
Some examples of GPs at work.
• Clinical application of GPs for discriminating healthy subjects from those at risk for bioplar disorder.
• Pharmacolgical application of GPs for quantifying the BOLD response to ketamine.
Page 35
GP application
Page 36
• Use of fMRI to predict prognosis in a sample genetically at risk for bipolar disorder
• Subjects: 16 healthy, 16 at risk • Gaussian process classification was
applied to subjects’ fMRI response to viewing neutral faces.
• 75% classification accuracy was achieved. • Predictive probabilities were found to be
higher for those who subsequently developed symptoms.
Imaging Ketamine
Ketamine
• NMDA antagonist • At sub-anaesthetic doses is can induce symptoms resembling
schizophrenia in healthy humans.
Deakin et al. 2008, Arch Gen Psychiatry
Aims of the study
1. Can the ketamine BOLD response be modulated?
2. How can we quantify the degree of modulation?
15min
5min 10min
cASL Cognitive Tasks phMRI cASL
1. Placebo 2. Placebo 3. Pre-treatment 4. Pre-treatment
1. Saline 2. Ketamine 3. Ketamine 4. Ketamine
Lamotrigine
• Anticonvulsant • Inhibits voltage-gated ion channels with downstream effects
resulting in inhibition of glutamate release. • Behavioural data (Anand et al. 2000) and imaging data (Deakin
et al. 2008) showed that it can attenuate the effects of ketamine.
Risperidone
• Antipsychotic • High affinities for D2 and 5-HT2A receptors. • Hypothesised to reduce glutamate in the cortex via 5-HT2A
antagonism. • No human data on its effect on a ketamine challenge.
Repeated measures pharmacological data:
Session 1: placebo Session 2: active compound Session 3: inhibitory compound + active compound.
PLA PRE+KET KET
Binary GPC Mil PLA
KET
Trained GPC
PRE+KET
Probability of KET
KET
PLA
KET KET
• But we haven’t modeled the intermediate class.
• How we can use this data to inform the model?
• Regression
– Encode the labels as [-1 0 1] or [1 2 3], etc.
– Enforces a metric notion of distance.
• Multi-class classification
– Assumes the labels are nominal.
• A solution: ordinal regression.
PLA PRE+KET KET
Examples of ranked data
Placebo 50mg 150mg
Controls ARMS First
Episode Disease
State
Controls MCI-s MCI-c Alzheimer’s
Visual stimulus frequency: LOW
Visual stimulus frequency: MED
Visual stimulus frequency: HIGH
Multivariate Ordinal Regression
• Models the natural ordering in labels.
• Bridges classification and metric regression.
1.
2.
Binary classifier applied to pairwise (exhaustive) difference images. Requires that each class is present in the test data to decode the ranking.
Reduces the problem to a set of binary classifiers. Creates consensus labels by combining the classifiers using an ordinal ranking rule.
Probabilistic predictions Considers all classes
simultaneously Kernel framework Elegant solution for model
selection and comparison
𝑝 𝒇 𝒟, 𝜽 =𝑝 𝒚 𝒇, 𝜽 𝑝(𝒇|𝜽)
𝑝(𝓓|𝜽)
arg max𝑖
𝑝 𝑦∗ = 𝑖 𝒙∗, 𝑿, 𝒚, 𝜽
Multivariate Ordinal Regression
• Validated on two exemplar pharmacological data:
– Ketamine Study
• Do we observe a whole brain ordinal response for 1) placebo -
lamotrigine+ketamine - ketamine and 2) placebo - risperidone+ketamine – ketamine?
– Scopolamine study
• Perfusion data (pCASL) acquired form 15 healthy volunteers on three
visits: placebo, donepezil+scopolamine and scopolamine.
Compare Ordinal Regression using Gaussian Processes with:
– Multi-class classification using binary classifiers and error correcting codes.
» State-of-the-art.
– Inherently multi-class GP classification.
» Considers all classes simultaneously.
Performance
Metric
LAM
(ORGP)
LAM
(PMCGP)
LAM
(MCGP)
RIS
(ORGP)
RIS
(PMCGP)
RIS
(MCGP)
Accuracy 72.9%*†‡ 60.4%* 56.3%* 60.4%* 56.3%* 56.3%*
Kendall’s Tau 0.70 0.61 0.53 0.57 0.61 0.61
AIC 86.9 - 93.5 89.7 95.5
12 4 0
3 10 3
1 2 13
15 1 0
5 1 11
4 1 11
Predicted
PLA INT KET
Act
ual
PLA 16 0 0
INT 0 16 0
KET 0 0 16
ORGP MCGP
Performance
Metric
DON
(ORGP)
DON
(PMCGP)
DON
(MCGP) A
CC
Accuracy 73.3%*†‡ 40.0%* 51.1%*
Kendall’s Tau 0.70 0.21 0.53
AIC 58.6 - 86.6
Occ
ipit
al L
ob
e Accuracy 64.4%* 64.4%* 60.0%*
Kendall’s Tau 0.63 0.59 0.52
AIC 64.4 - 80.8
Thal
amu
s
Accuracy 80.0%*†‡
60.0%* 68.9%*
Kendall’s Tau 0.81 0.60 0.72
AIC 50.6 - 68.6
Summary
• For data that lies on a continuum ordinal regression often outperformed the state-of-the art.
• It provides probabilistic predictions. • The marginal likelihood can be used for model selection as
well as model comparison. i.e. does not require nested cross validation.
Practical Info
• For the 16 whole brain images it takes ~7 minutes to optimise 4 parameters in a leave one out manner (16 runs), MATLAB 8 on CNS server.
• Pattern recognition is a powerful tool for discrimination and prediction.
• Its potential has been well-established for neuroimaging data.