Dr. Orla Doyle [email protected] - NEWMEDS€¦ · Dr. Orla Doyle [email protected] • Face...

Introduction to Machine Learning

Dr. Orla Doyle

[email protected]

• Face recognition

• Reading text/handwriting

• Recognising food by smell

We would like to give similar capabilities to machines.

What is machine learning?

• Building models of data for – Predicting categorical variables (classification) – Predicting numerical variables (regression) – Searching for groupings in the data (unsupervised learning) – Learning from delayed feedback (reinforcement learning)

𝒚 𝑿

Data Label

Cross Validation

TEST

Data from a subject

Subject 1 … Subject 10 Subject 2…

TEST

Subject 1 … Subject 10 Subject 2…

TRAINING

TRAINING

A

B

A

B

.

.

.

Repeated measures across two conditions A and B in 10 subjects

Machine learning for neuroimaging

Univariate GLM: are there voxels that reflect the stimulus?

1. GLM 2. T-test 3. Correction for

multiple comparisons

Time

BO

LD s

ign

al

Multivariate Pattern Recognition: is the pattern of voxels predictive?

Classifier Trained Classifier

Prediction

Type of Pattern Recognition

• Binary (controls vs. patients) • Support vector machine (SVM) • Gaussian process classification (GPC) • Neural networks

• Multi-class (placebo vs. drug 1 vs. drug 2 ) • Gaussian process classification • Sparse multinomial logistic regression (SMLR)

• Real-valued (age, VAS scores) • Kernel ridge regression • Gaussian process regression

• Ordinal (symptom severity, drug dose) • Ordinal regression using Gaussian processes

• Event (time to disease onset) • Cox regression

Type of Pattern Recognition

• Binary (controls vs. patients) • Support vector machine (SVM) • Gaussian process classification (GPC) • Neural networks

• Multi-class (placebo vs. drug 1 vs. drug 2 ) • Gaussian process classification (GPC) • Sparse Multinomial Logistic Regression (SMLR)

• Real-valued (age, VAS scores) • Kernel ridge regression • Gaussian process regression

• Ordinal (symptom severity, drug dose) • Ordinal regression using Gaussian processes

• Event (time to disease onset) • Cox regression

An example

• Problem: sorting incoming fruit according to type. • Assumptions: there are two types of fruit:

• Oranges • Lemons

How do we describe the fruit to the computer?

Features:

• dimensions

SVM

• Presented with a set of fruit how can we train an algorithm to classify the type?

Method: 1. Choose two ‘discriminating’ features 2. Plot in the input feature space 3. Here, the examples are linearly separable by a hyperplane: wTx + b = 0 4.‘Positive’ class and ‘negative’ class.

width

hei

ght

Oranges

Lemons

The weight vector

• Positive class region:

• 𝒘𝑇𝐱 + b > 0

• 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏 > 0

• Note that 𝒘 is a vector perpendicular to the decision plane (hyperplane), i.e. we specify the vector that does not lie on the decision planes.

Quantifies feature importance!

A 2-D Example

Method: 1. Choose two ‘discriminating’ features 2. Plot in the input feature space 3. Here, the examples are linearly separable by a hyperplane: wTx + b = 0 4.‘Positive’ class and ‘negative’ class.

But which linear hyperplane is the optimal choice? Note: we want the solution to work well on unseen data (good generalisation).

Presented with a set of fruit how can we train an algorithm to classify the type?

width

hei

ght

SVM

But which linear hyperplane is the optimal choice? The one which maximises the margin (distance between closest point and the hyperplane) .

The margin is equal to the distance between

the two dashed hyperplanes with equations: Therefore, the margin is defined as,

2

w2

wTx + b = -1

wTx + b = 1

width

hei

ght

SVM

Optimisation problem becomes:

Minimise

Subject to

1

2w

2

yi(wTxi + b) ³1,"i

width

hei

ght

.

.

.

Dot product -

d imensions of the data

is transformed to the

number of samples!!

Solved using Lagrangian Multipliers to give:

Maximise ∑iαi - ½∑i∑jαiαjyiyjxiTxj

Subject to ∑iαiyi = 0 and 0 ≤ αi ≤ C, ∀ i

Optimisation problem becomes: Minimise ½‖w‖2 + C∑iξi Subject to yi(wTxi + b) ≥ 1 - ξi , ∀ i

• What if the data are not linearly separable? (This is almost always the case.) Then ‘slack’ variables are introduced.

Real world data

width

hei

ght

Real world data

• C - what does it do?

• It allows us to control training error vs margin. Very low training error (C ↑) -> model may have overfitted the training data.

• Very large margin (C ↓)-> model may be a poor representation of the training data.

• How do we choose it? Minimise the nested cross validation training error.

Kernels

• Kernels are used as similarity measures between samples (e.g. subjects).

• Classification is performed in this ‘similarity’ subspace.

• Linear kernel:

• Each element in the kernel matrix in a dot product of two samples and is computed over all samples in this pairwise fashion.

• Non-linear kernels:

• Help linear models perform well in nonlinear settings

• Map input data to higher dimensions where it exhibits linear patterns

Kernels

Consider the binary classification problem:

• Each example is represented by a single feature 𝒙 • No adequate linear model exists

Instead, transform each example 𝒙 → 𝒙, 𝒙𝟐

𝒙

𝒙

𝒙𝟐

• Each example now has two features

Data is linearly separable in this higher dimensional space.

Kernels

• Each example is transformed from having 2 features to 3 features.

• A complex boundary in the input feature space is replaced by a linear hyperplane in the kernel feature space.

x → Φ (x)

𝒙 = 𝒙1, 𝒙2 → {𝒙12, 2𝒙1𝒙2, 𝒙2

2}

Kernels • Directly mapping to the transformed feature space could lead

to large computational cost of the number of features increases.

• The “kernel trick” helps us to avoid by not requiring the mapping to be computed explicitly.

Other kernels,

x → Φ (x)

𝒙 = 𝑥1, 𝑥2 → {𝑥12, 2𝑥1𝑥2, 𝑥2

2}

Radial Basis Function:

Polynomial:

𝐾 𝒙, 𝒛 = 𝜙(𝒙)𝐓𝜙(𝒙′)

𝐾 𝒙, 𝒙′ = 𝑥12, 2𝑥1𝑥2, 𝑥2

2 T 𝑥′12, 2𝑥′1𝑥′2, 𝑥′22

𝐾 𝒙, 𝒙′ = (𝒙𝑇𝒙′)2

Kernels If the data are not linearly separable we could map the data to a higher

dimensional space,

x → Φ(x)

so that K(xi, xj) = Φ(xi)TΦ(xj), where K(xi, xj) is a kernel function. Therefore, we

only need to know the scalar function - K(xi, xj).

This makes reverse inference more challenging.

For linear kernels we can visualise the weight vector directly.

For non-linear kernels given that we don’t know Φ(x) we need to approximate

the mapping where possible.

Nonlinear SVMs

• To train a non-linear SVM using a radial basis function kernel we need to optimise C and γ. Achieved by minimising the nested cross validation training error.

Large sigma -> close to a linear hyperplane (underfitted)

Small sigma-> highly complex boundary (difficult to generalise, overfitted)

Overfitting

• The model is more flexible that we need!

• Addressing overfitting:

• Perform cross validation

• Use methods with regularisation – i.e. penalty for many parameters, restrict the set of solutions to a particular space.

Advantages of SVMs

• Exhibit good generalisation

• Learning involves optimisation of a convex function, i.e. no local maxima!

• It can handle very high-dimensional datasets.

• The formulation of the SVM enables the use of kernels.

Issues • Workaround to achieve multi-class classification

• SVM outputs are generally discrete. Transformations exist but not very robust.

• How to choose the kernel?

• Training time. For e.g. two parameter grid search + large number of examples.

Seminal SVM and fMRI paper

• Seminal paper in applying pattern recognition to fMRI • 13 subjects • Reading a sentence versus looking at a picture with geometrical

shapes. • Leave one out cross validation • Mean activation in 7 ROIs used as the input to SVM, Gaussian

naïve Bayes and nearest neighbour classifiers.

Structural MRI and SVMs for diagnosis

• Use of T1-weighted MR scans to diagnose

Alzheimer’s disease

• Group I - 20 healthy, 20 AD

• Group II – 14 healthy, 14 AD

• Group III – 57 healthy, 33 probable AD

• Group IV – 18 AD, 19 frontotemporal lobar

degeneration (FTLD)

• SVM applied to normalised grey matter images.

Probabilistic predictions Access to the ‘model evidence’

Elegant approach for model selection – no need for nested cross validation

Elegant approach for model comparison Incorporation of prior information Two ways to ‘map’

Map illustrating features most important for the discrimination, i.e. at the boundary between the classes

Map illustrating distribution between classes, i.e. most similar to the classic ‘t-map’.

Learning with Gaussian Processes

Gaussian Processes

A Gaussian process ~ Gaussian distribution with infinitely many variables.

Define a latent function as a Gaussian process over our data:

The latent function f – relates the data to the label.

Mean function Covariance function

1-D: f(x) ∼ N(μ, σ2), mean and standard deviation.

2-D: f(x) ∼ N(μ, K), mean and covariance matrix

𝑓 𝒙 ~ 𝒢𝒫(𝑚 𝒙 , 𝑘(𝒙, 𝒙′))

Gaussian Processes

Mean function: 𝑚 𝒙 ≡ 𝟎 𝑚 𝒙 ≡ 𝒄

Covariance functions encodes knowledge about the similarities between data points. Assumes that a training samples ‘close’ to a test point should be informative.

𝑘 𝒙, 𝒙′ = 𝒙. 𝒙′, 𝑘 𝒙, 𝒙′ = (𝒙. 𝒙′ + ℓ)/𝑠2,

𝑘 𝒙, 𝒙′ = 𝑒𝑥𝑝− | 𝒙−𝒙′ |2

2𝜎2

We refer to parameters of these functions as hyperparameters and collect them in the vector 𝜽.

𝑓 𝒙 ~ 𝒢𝒫(𝑚 𝒙 , 𝑘(𝒙, 𝒙′))

Learning with Gaussian Processes

Provides:

– probabilistic predictions

– Elegant framework for model selection and model comparison

Data: 𝒟 = {𝑿, 𝒚}

Model: 𝑦 = 𝑓𝑤 𝐱 + 𝜖

Here, we learn the function using Bayesian inference.

1. Place a GP prior on the function 𝑓.

2. Observe the data.

3. Compute the posterior distribution over f.

4. Compute the predictive distribution for a test case.

Gaussian Process Learning

𝑝 𝒇 𝒟, 𝜽 =𝑝 𝒚 𝒇, 𝜽 𝑝(𝒇|𝜽)

𝑝(𝓓|𝜽)

• Posterior over 𝒇 - used to compute the predictive distribution over a test case.

• Likelihood function – determines the type of learning: classification, regression, etc. Selected by the user.

• Gaussian process prior – encodes the functional form of 𝒇, analogous to role of the kernel in SVMs.

• Marginal likelihood – this entity is often used to optimise the set of hyperparmeters as minimising the marginal likelihood is equivalent to maximising the likelihood. It can also be used for model

comparison (e.g. using different likelihoods) a long as 𝒟 is consistent.

Gaussian Process Classification

• For regression the likelihood function is Gaussian and therefore analytical approaches can be used to compute this distribution.

• However, for classification the likelihood is non-Gaussian (discrete labels).

• In this case, is a latent function which is ‘squashed’ from 0 to 1 using a link function to provide the probabilistic predictions.

• Both the posterior and predictive distribution are now intractable so we need use approximations.

Example GP model for fMRI

fMRI

timeseries

Task 2

Task 2

Task 1

... ...

...

Task 1

Prior Posterior

Integrate over

Latent function to

make probabilistic

Prediction

p(y|f,X,θ)

Test image

(unknown

Label)

Some examples of GPs at work.

• Clinical application of GPs for discriminating healthy subjects from those at risk for bioplar disorder.

• Pharmacolgical application of GPs for quantifying the BOLD response to ketamine.

GP application

• Use of fMRI to predict prognosis in a sample genetically at risk for bipolar disorder

• Subjects: 16 healthy, 16 at risk • Gaussian process classification was

applied to subjects’ fMRI response to viewing neutral faces.

• 75% classification accuracy was achieved. • Predictive probabilities were found to be

higher for those who subsequently developed symptoms.

Imaging Ketamine

Ketamine

• NMDA antagonist • At sub-anaesthetic doses is can induce symptoms resembling

schizophrenia in healthy humans.

Deakin et al. 2008, Arch Gen Psychiatry

Aims of the study

1. Can the ketamine BOLD response be modulated?

2. How can we quantify the degree of modulation?

15min

5min 10min

cASL Cognitive Tasks phMRI cASL

1. Placebo 2. Placebo 3. Pre-treatment 4. Pre-treatment

1. Saline 2. Ketamine 3. Ketamine 4. Ketamine

Lamotrigine

• Anticonvulsant • Inhibits voltage-gated ion channels with downstream effects

resulting in inhibition of glutamate release. • Behavioural data (Anand et al. 2000) and imaging data (Deakin

et al. 2008) showed that it can attenuate the effects of ketamine.

Risperidone

• Antipsychotic • High affinities for D2 and 5-HT2A receptors. • Hypothesised to reduce glutamate in the cortex via 5-HT2A

antagonism. • No human data on its effect on a ketamine challenge.

Repeated measures pharmacological data:

Session 1: placebo Session 2: active compound Session 3: inhibitory compound + active compound.

PLA PRE+KET KET

Binary GPC Mil PLA

KET

Trained GPC

PRE+KET

Probability of KET

KET

PLA

KET KET

• But we haven’t modeled the intermediate class.

• How we can use this data to inform the model?

• Regression

– Encode the labels as [-1 0 1] or [1 2 3], etc.

– Enforces a metric notion of distance.

• Multi-class classification

– Assumes the labels are nominal.

• A solution: ordinal regression.

PLA PRE+KET KET

Examples of ranked data

Placebo 50mg 150mg

Controls ARMS First

Episode Disease

State

Controls MCI-s MCI-c Alzheimer’s

Visual stimulus frequency: LOW

Visual stimulus frequency: MED

Visual stimulus frequency: HIGH

Multivariate Ordinal Regression

• Models the natural ordering in labels.

• Bridges classification and metric regression.

1.

2.

Binary classifier applied to pairwise (exhaustive) difference images. Requires that each class is present in the test data to decode the ranking.

Reduces the problem to a set of binary classifiers. Creates consensus labels by combining the classifiers using an ordinal ranking rule.

Probabilistic predictions Considers all classes

simultaneously Kernel framework Elegant solution for model

selection and comparison

𝑝 𝒇 𝒟, 𝜽 =𝑝 𝒚 𝒇, 𝜽 𝑝(𝒇|𝜽)

𝑝(𝓓|𝜽)

arg max𝑖

𝑝 𝑦∗ = 𝑖 𝒙∗, 𝑿, 𝒚, 𝜽

Multivariate Ordinal Regression

• Validated on two exemplar pharmacological data:

– Ketamine Study

• Do we observe a whole brain ordinal response for 1) placebo -

lamotrigine+ketamine - ketamine and 2) placebo - risperidone+ketamine – ketamine?

– Scopolamine study

• Perfusion data (pCASL) acquired form 15 healthy volunteers on three

visits: placebo, donepezil+scopolamine and scopolamine.

Compare Ordinal Regression using Gaussian Processes with:

– Multi-class classification using binary classifiers and error correcting codes.

» State-of-the-art.

– Inherently multi-class GP classification.

» Considers all classes simultaneously.

Performance

Metric

LAM

(ORGP)

LAM

(PMCGP)

LAM

(MCGP)

RIS

(ORGP)

RIS

(PMCGP)

RIS

(MCGP)

Accuracy 72.9%*†‡ 60.4%* 56.3%* 60.4%* 56.3%* 56.3%*

Kendall’s Tau 0.70 0.61 0.53 0.57 0.61 0.61

AIC 86.9 - 93.5 89.7 95.5

12 4 0

3 10 3

1 2 13

15 1 0

5 1 11

4 1 11

Predicted

PLA INT KET

Act

ual

PLA 16 0 0

INT 0 16 0

KET 0 0 16

ORGP MCGP

Performance

Metric

DON

(ORGP)

DON

(PMCGP)

DON

(MCGP) A

CC

Accuracy 73.3%*†‡ 40.0%* 51.1%*

Kendall’s Tau 0.70 0.21 0.53

AIC 58.6 - 86.6

Occ

ipit

al L

ob

e Accuracy 64.4%* 64.4%* 60.0%*

Kendall’s Tau 0.63 0.59 0.52

AIC 64.4 - 80.8

Thal

amu

s

Accuracy 80.0%*†‡

60.0%* 68.9%*

Kendall’s Tau 0.81 0.60 0.72

AIC 50.6 - 68.6

Summary

• For data that lies on a continuum ordinal regression often outperformed the state-of-the art.

• It provides probabilistic predictions. • The marginal likelihood can be used for model selection as

well as model comparison. i.e. does not require nested cross validation.

Practical Info

• For the 16 whole brain images it takes ~7 minutes to optimise 4 parameters in a leave one out manner (16 runs), MATLAB 8 on CNS server.

• Pattern recognition is a powerful tool for discrimination and prediction.

• Its potential has been well-established for neuroimaging data.

Dr. Orla Doyle [email protected] - NEWMEDS€¦ · Dr. Orla Doyle [email protected] • Face...

Documents

Transcript of Dr. Orla Doyle [email protected] - NEWMEDS€¦ · Dr. Orla Doyle [email protected] • Face...