Dr. Orla Doyle [email protected] - NEWMEDS€¦ · Dr. Orla Doyle [email protected] • Face...

50
Introduction to Machine Learning Dr. Orla Doyle [email protected]

Transcript of Dr. Orla Doyle [email protected] - NEWMEDS€¦ · Dr. Orla Doyle [email protected] • Face...

Page 1: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Introduction to Machine Learning

Dr. Orla Doyle

[email protected]

Page 2: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

• Face recognition

• Reading text/handwriting

• Recognising food by smell

We would like to give similar capabilities to machines.

Page 2

Page 3: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like
Page 4: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

What is machine learning?

• Building models of data for – Predicting categorical variables (classification) – Predicting numerical variables (regression) – Searching for groupings in the data (unsupervised learning) – Learning from delayed feedback (reinforcement learning)

𝒚 𝑿

Data Label

Page 5: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Cross Validation

TEST

Data from a subject

Subject 1 … Subject 10 Subject 2…

TEST

Subject 1 … Subject 10 Subject 2…

TRAINING

TRAINING

A

B

A

B

.

.

.

Repeated measures across two conditions A and B in 10 subjects

Page 6: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Machine learning for neuroimaging

Univariate GLM: are there voxels that reflect the stimulus?

1. GLM 2. T-test 3. Correction for

multiple comparisons

Time

BO

LD s

ign

al

Multivariate Pattern Recognition: is the pattern of voxels predictive?

Classifier Trained Classifier

Prediction

Page 7: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Type of Pattern Recognition

• Binary (controls vs. patients) • Support vector machine (SVM) • Gaussian process classification (GPC) • Neural networks

• Multi-class (placebo vs. drug 1 vs. drug 2 ) • Gaussian process classification • Sparse multinomial logistic regression (SMLR)

• Real-valued (age, VAS scores) • Kernel ridge regression • Gaussian process regression

• Ordinal (symptom severity, drug dose) • Ordinal regression using Gaussian processes

• Event (time to disease onset) • Cox regression

Page 8: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Type of Pattern Recognition

• Binary (controls vs. patients) • Support vector machine (SVM) • Gaussian process classification (GPC) • Neural networks

• Multi-class (placebo vs. drug 1 vs. drug 2 ) • Gaussian process classification (GPC) • Sparse Multinomial Logistic Regression (SMLR)

• Real-valued (age, VAS scores) • Kernel ridge regression • Gaussian process regression

• Ordinal (symptom severity, drug dose) • Ordinal regression using Gaussian processes

• Event (time to disease onset) • Cox regression

Page 9: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

An example

• Problem: sorting incoming fruit according to type. • Assumptions: there are two types of fruit:

• Oranges • Lemons

Page 9

How do we describe the fruit to the computer?

Features:

• dimensions

Page 10: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

SVM

• Presented with a set of fruit how can we train an algorithm to classify the type?

Method: 1. Choose two ‘discriminating’ features 2. Plot in the input feature space 3. Here, the examples are linearly separable by a hyperplane: wTx + b = 0 4.‘Positive’ class and ‘negative’ class.

width

hei

ght

Oranges

Lemons

Page 11: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

The weight vector

• Positive class region:

• 𝒘𝑇𝐱 + b > 0

• 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏 > 0

• Note that 𝒘 is a vector perpendicular to the decision plane (hyperplane), i.e. we specify the vector that does not lie on the decision planes.

Page 11

Quantifies feature importance!

Page 12: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

A 2-D Example

Method: 1. Choose two ‘discriminating’ features 2. Plot in the input feature space 3. Here, the examples are linearly separable by a hyperplane: wTx + b = 0 4.‘Positive’ class and ‘negative’ class.

But which linear hyperplane is the optimal choice? Note: we want the solution to work well on unseen data (good generalisation).

Presented with a set of fruit how can we train an algorithm to classify the type?

width

hei

ght

Page 13: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

SVM

But which linear hyperplane is the optimal choice? The one which maximises the margin (distance between closest point and the hyperplane) .

The margin is equal to the distance between

the two dashed hyperplanes with equations: Therefore, the margin is defined as,

2

w2

wTx + b = -1

wTx + b = 1

width

hei

ght

Page 14: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

SVM

Optimisation problem becomes:

Minimise

Subject to

1

2w

2

yi(wTxi + b) ³1,"i

width

hei

ght

.

.

.

Dot product -

d imensions of the data

is transformed to the

number of samples!!

Page 15: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Solved using Lagrangian Multipliers to give:

Maximise ∑iαi - ½∑i∑jαiαjyiyjxiTxj

Subject to ∑iαiyi = 0 and 0 ≤ αi ≤ C, ∀ i

Optimisation problem becomes: Minimise ½‖w‖2 + C∑iξi Subject to yi(wTxi + b) ≥ 1 - ξi , ∀ i

• What if the data are not linearly separable? (This is almost always the case.) Then ‘slack’ variables are introduced.

Real world data

width

hei

ght

Page 16: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Real world data

• C - what does it do?

• It allows us to control training error vs margin. Very low training error (C ↑) -> model may have overfitted the training data.

• Very large margin (C ↓)-> model may be a poor representation of the training data.

• How do we choose it? Minimise the nested cross validation training error.

Page 17: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Kernels

• Kernels are used as similarity measures between samples (e.g. subjects).

• Classification is performed in this ‘similarity’ subspace.

• Linear kernel:

• Each element in the kernel matrix in a dot product of two samples and is computed over all samples in this pairwise fashion.

• Non-linear kernels:

• Help linear models perform well in nonlinear settings

• Map input data to higher dimensions where it exhibits linear patterns

Page 17

Page 18: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Kernels

Page 18

Consider the binary classification problem:

• Each example is represented by a single feature 𝒙 • No adequate linear model exists

Instead, transform each example 𝒙 → 𝒙, 𝒙𝟐

𝒙

𝒙

𝒙𝟐

• Each example now has two features

Data is linearly separable in this higher dimensional space.

Page 19: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Kernels

• Each example is transformed from having 2 features to 3 features.

• A complex boundary in the input feature space is replaced by a linear hyperplane in the kernel feature space.

x → Φ (x)

𝒙 = 𝒙1, 𝒙2 → {𝒙12, 2𝒙1𝒙2, 𝒙2

2}

Page 20: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Kernels • Directly mapping to the transformed feature space could lead

to large computational cost of the number of features increases.

• The “kernel trick” helps us to avoid by not requiring the mapping to be computed explicitly.

Other kernels,

Page 20

x → Φ (x)

𝒙 = 𝑥1, 𝑥2 → {𝑥12, 2𝑥1𝑥2, 𝑥2

2}

Radial Basis Function:

Polynomial:

𝐾 𝒙, 𝒛 = 𝜙(𝒙)𝐓𝜙(𝒙′)

𝐾 𝒙, 𝒙′ = 𝑥12, 2𝑥1𝑥2, 𝑥2

2 T 𝑥′12, 2𝑥′1𝑥′2, 𝑥′22

𝐾 𝒙, 𝒙′ = (𝒙𝑇𝒙′)2

Page 21: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Kernels If the data are not linearly separable we could map the data to a higher

dimensional space,

x → Φ(x)

so that K(xi, xj) = Φ(xi)TΦ(xj), where K(xi, xj) is a kernel function. Therefore, we

only need to know the scalar function - K(xi, xj).

This makes reverse inference more challenging.

For linear kernels we can visualise the weight vector directly.

For non-linear kernels given that we don’t know Φ(x) we need to approximate

the mapping where possible.

Page 22: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Nonlinear SVMs

• To train a non-linear SVM using a radial basis function kernel we need to optimise C and γ. Achieved by minimising the nested cross validation training error.

Large sigma -> close to a linear hyperplane (underfitted)

Small sigma-> highly complex boundary (difficult to generalise, overfitted)

Page 23: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Overfitting

• The model is more flexible that we need!

• Addressing overfitting:

• Perform cross validation

• Use methods with regularisation – i.e. penalty for many parameters, restrict the set of solutions to a particular space.

Page 24: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Advantages of SVMs

• Exhibit good generalisation

• Learning involves optimisation of a convex function, i.e. no local maxima!

• It can handle very high-dimensional datasets.

• The formulation of the SVM enables the use of kernels.

Page 25: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Issues • Workaround to achieve multi-class classification

• SVM outputs are generally discrete. Transformations exist but not very robust.

• How to choose the kernel?

• Training time. For e.g. two parameter grid search + large number of examples.

Page 26: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Seminal SVM and fMRI paper

Page 26

• Seminal paper in applying pattern recognition to fMRI • 13 subjects • Reading a sentence versus looking at a picture with geometrical

shapes. • Leave one out cross validation • Mean activation in 7 ROIs used as the input to SVM, Gaussian

naïve Bayes and nearest neighbour classifiers.

Page 27: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Structural MRI and SVMs for diagnosis

Page 27

• Use of T1-weighted MR scans to diagnose

Alzheimer’s disease

• Group I - 20 healthy, 20 AD

• Group II – 14 healthy, 14 AD

• Group III – 57 healthy, 33 probable AD

• Group IV – 18 AD, 19 frontotemporal lobar

degeneration (FTLD)

• SVM applied to normalised grey matter images.

Page 28: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Probabilistic predictions Access to the ‘model evidence’

Elegant approach for model selection – no need for nested cross validation

Elegant approach for model comparison Incorporation of prior information Two ways to ‘map’

Map illustrating features most important for the discrimination, i.e. at the boundary between the classes

Map illustrating distribution between classes, i.e. most similar to the classic ‘t-map’.

Learning with Gaussian Processes

Page 29: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Gaussian Processes

A Gaussian process ~ Gaussian distribution with infinitely many variables.

Define a latent function as a Gaussian process over our data:

The latent function f – relates the data to the label.

Mean function Covariance function

1-D: f(x) ∼ N(μ, σ2), mean and standard deviation.

2-D: f(x) ∼ N(μ, K), mean and covariance matrix

𝑓 𝒙 ~ 𝒢𝒫(𝑚 𝒙 , 𝑘(𝒙, 𝒙′))

Page 30: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Gaussian Processes

Mean function: 𝑚 𝒙 ≡ 𝟎 𝑚 𝒙 ≡ 𝒄

Covariance functions encodes knowledge about the similarities between data points. Assumes that a training samples ‘close’ to a test point should be informative.

𝑘 𝒙, 𝒙′ = 𝒙. 𝒙′, 𝑘 𝒙, 𝒙′ = (𝒙. 𝒙′ + ℓ)/𝑠2,

𝑘 𝒙, 𝒙′ = 𝑒𝑥𝑝− | 𝒙−𝒙′ |2

2𝜎2

We refer to parameters of these functions as hyperparameters and collect them in the vector 𝜽.

𝑓 𝒙 ~ 𝒢𝒫(𝑚 𝒙 , 𝑘(𝒙, 𝒙′))

Page 31: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Learning with Gaussian Processes

Provides:

– probabilistic predictions

– Elegant framework for model selection and model comparison

Data: 𝒟 = {𝑿, 𝒚}

Model: 𝑦 = 𝑓𝑤 𝐱 + 𝜖

Here, we learn the function using Bayesian inference.

Page 31

1. Place a GP prior on the function 𝑓.

2. Observe the data.

3. Compute the posterior distribution over f.

4. Compute the predictive distribution for a test case.

Page 32: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Gaussian Process Learning

𝑝 𝒇 𝒟, 𝜽 =𝑝 𝒚 𝒇, 𝜽 𝑝(𝒇|𝜽)

𝑝(𝓓|𝜽)

• Posterior over 𝒇 - used to compute the predictive distribution over a test case.

• Likelihood function – determines the type of learning: classification, regression, etc. Selected by the user.

• Gaussian process prior – encodes the functional form of 𝒇, analogous to role of the kernel in SVMs.

• Marginal likelihood – this entity is often used to optimise the set of hyperparmeters as minimising the marginal likelihood is equivalent to maximising the likelihood. It can also be used for model

comparison (e.g. using different likelihoods) a long as 𝒟 is consistent.

Page 33: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Gaussian Process Classification

• For regression the likelihood function is Gaussian and therefore analytical approaches can be used to compute this distribution.

• However, for classification the likelihood is non-Gaussian (discrete labels).

• In this case, is a latent function which is ‘squashed’ from 0 to 1 using a link function to provide the probabilistic predictions.

• Both the posterior and predictive distribution are now intractable so we need use approximations.

Page 33

Page 34: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Example GP model for fMRI

fMRI

timeseries

Task 2

Task 2

Task 1

... ...

...

Task 1

Prior Posterior

Integrate over

Latent function to

make probabilistic

Prediction

p(y|f,X,θ)

Test image

(unknown

Label)

Page 35: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Some examples of GPs at work.

• Clinical application of GPs for discriminating healthy subjects from those at risk for bioplar disorder.

• Pharmacolgical application of GPs for quantifying the BOLD response to ketamine.

Page 35

Page 36: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

GP application

Page 36

• Use of fMRI to predict prognosis in a sample genetically at risk for bipolar disorder

• Subjects: 16 healthy, 16 at risk • Gaussian process classification was

applied to subjects’ fMRI response to viewing neutral faces.

• 75% classification accuracy was achieved. • Predictive probabilities were found to be

higher for those who subsequently developed symptoms.

Page 37: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Imaging Ketamine

Ketamine

• NMDA antagonist • At sub-anaesthetic doses is can induce symptoms resembling

schizophrenia in healthy humans.

Deakin et al. 2008, Arch Gen Psychiatry

Page 38: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Aims of the study

1. Can the ketamine BOLD response be modulated?

2. How can we quantify the degree of modulation?

15min

5min 10min

cASL Cognitive Tasks phMRI cASL

1. Placebo 2. Placebo 3. Pre-treatment 4. Pre-treatment

1. Saline 2. Ketamine 3. Ketamine 4. Ketamine

Page 39: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Lamotrigine

• Anticonvulsant • Inhibits voltage-gated ion channels with downstream effects

resulting in inhibition of glutamate release. • Behavioural data (Anand et al. 2000) and imaging data (Deakin

et al. 2008) showed that it can attenuate the effects of ketamine.

Risperidone

• Antipsychotic • High affinities for D2 and 5-HT2A receptors. • Hypothesised to reduce glutamate in the cortex via 5-HT2A

antagonism. • No human data on its effect on a ketamine challenge.

Repeated measures pharmacological data:

Session 1: placebo Session 2: active compound Session 3: inhibitory compound + active compound.

Page 40: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

PLA PRE+KET KET

Binary GPC Mil PLA

KET

Trained GPC

PRE+KET

Probability of KET

KET

PLA

KET KET

Page 41: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

• But we haven’t modeled the intermediate class.

• How we can use this data to inform the model?

• Regression

– Encode the labels as [-1 0 1] or [1 2 3], etc.

– Enforces a metric notion of distance.

• Multi-class classification

– Assumes the labels are nominal.

• A solution: ordinal regression.

PLA PRE+KET KET

Page 42: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Examples of ranked data

Placebo 50mg 150mg

Controls ARMS First

Episode Disease

State

Controls MCI-s MCI-c Alzheimer’s

Visual stimulus frequency: LOW

Visual stimulus frequency: MED

Visual stimulus frequency: HIGH

Page 43: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Multivariate Ordinal Regression

• Models the natural ordering in labels.

• Bridges classification and metric regression.

1.

2.

Binary classifier applied to pairwise (exhaustive) difference images. Requires that each class is present in the test data to decode the ranking.

Reduces the problem to a set of binary classifiers. Creates consensus labels by combining the classifiers using an ordinal ranking rule.

Page 44: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Probabilistic predictions Considers all classes

simultaneously Kernel framework Elegant solution for model

selection and comparison

𝑝 𝒇 𝒟, 𝜽 =𝑝 𝒚 𝒇, 𝜽 𝑝(𝒇|𝜽)

𝑝(𝓓|𝜽)

arg max𝑖

𝑝 𝑦∗ = 𝑖 𝒙∗, 𝑿, 𝒚, 𝜽

Page 45: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Multivariate Ordinal Regression

• Validated on two exemplar pharmacological data:

– Ketamine Study

• Do we observe a whole brain ordinal response for 1) placebo -

lamotrigine+ketamine - ketamine and 2) placebo - risperidone+ketamine – ketamine?

– Scopolamine study

• Perfusion data (pCASL) acquired form 15 healthy volunteers on three

visits: placebo, donepezil+scopolamine and scopolamine.

Page 46: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Compare Ordinal Regression using Gaussian Processes with:

– Multi-class classification using binary classifiers and error correcting codes.

» State-of-the-art.

– Inherently multi-class GP classification.

» Considers all classes simultaneously.

Page 47: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Performance

Metric

LAM

(ORGP)

LAM

(PMCGP)

LAM

(MCGP)

RIS

(ORGP)

RIS

(PMCGP)

RIS

(MCGP)

Accuracy 72.9%*†‡ 60.4%* 56.3%* 60.4%* 56.3%* 56.3%*

Kendall’s Tau 0.70 0.61 0.53 0.57 0.61 0.61

AIC 86.9 - 93.5 89.7 95.5

12 4 0

3 10 3

1 2 13

15 1 0

5 1 11

4 1 11

Predicted

PLA INT KET

Act

ual

PLA 16 0 0

INT 0 16 0

KET 0 0 16

ORGP MCGP

Page 48: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Performance

Metric

DON

(ORGP)

DON

(PMCGP)

DON

(MCGP) A

CC

Accuracy 73.3%*†‡ 40.0%* 51.1%*

Kendall’s Tau 0.70 0.21 0.53

AIC 58.6 - 86.6

Occ

ipit

al L

ob

e Accuracy 64.4%* 64.4%* 60.0%*

Kendall’s Tau 0.63 0.59 0.52

AIC 64.4 - 80.8

Thal

amu

s

Accuracy 80.0%*†‡

60.0%* 68.9%*

Kendall’s Tau 0.81 0.60 0.72

AIC 50.6 - 68.6

Page 49: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

Summary

• For data that lies on a continuum ordinal regression often outperformed the state-of-the art.

• It provides probabilistic predictions. • The marginal likelihood can be used for model selection as

well as model comparison. i.e. does not require nested cross validation.

Practical Info

• For the 16 whole brain images it takes ~7 minutes to optimise 4 parameters in a leave one out manner (16 runs), MATLAB 8 on CNS server.

Page 50: Dr. Orla Doyle orla.doyle@kcl.ac - NEWMEDS€¦ · Dr. Orla Doyle orla.doyle@kcl.ac.uk • Face recognition • Reading text/handwriting • Recognising food by smell We would like

• Pattern recognition is a powerful tool for discrimination and prediction.

• Its potential has been well-established for neuroimaging data.