Classification with microarray data · The challenge Given a training set (data of known class),...

36
Classification with microarray data Aron Charles Eklund [email protected] January 9, 2009 1

Transcript of Classification with microarray data · The challenge Given a training set (data of known class),...

Page 1: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Classification with microarray data

Aron Charles [email protected]

January 9, 2009

1

Page 2: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

What to expect from Aron today

What is classification, and why

How to do classification with microarray data

After lunch : case study.

Then, classification exercises.

2

Page 3: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Sample PreparationHybridization

Array designProbe design

Experimental Design

Buy standardChip / Array

Statistical AnalysisFit to Model (time series)

Expression IndexCalculation

Advanced Data AnalysisClustering PCA Gene Annotation Analysis Promoter Analysis

Classification Meta analysis Survival analysis Regulatory Network

ComparableGene Expression Data

Normalization

Image analysis

Question/hypothesis

3

Page 4: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

What is classification?Given an object with a set of features (input data), a classifier assigns

the objects to a class.

classifier

Aronage 34sex maleincome $$

Obama

McCainfeatures

object

classifier algorithm:

if (age > 40) AND (income > $$)then vote = McCainelse vote = Obama

A classifier is simply a set of rules or an algorithm

4

Page 5: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Why is classification useful?1. Classification as an end in itself

– diagnosis of disease– predict response to chemotherapy– identify bacterial strains

2. Learn from the classifier– identify drug targets

classifier algorithm:

if (age > 40) AND (income > $$)then vote = republicanelse vote = democrat

5

Page 6: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

The challenge

Given a training set (data of known class), design a classifier that accurately predicts the class of novel data.

x01 x02 x03 x04 x05 x06 x07age 43 65 34 22 44 29 57sex female male male female female male maleincome $$$$ $$$$ $$$$$ $ $$$$$ $$ $$$vote Obama McCain Obama McCain Obama McCain ???

Training set Test

This is also called statistical learning or machine learning.

6

Page 7: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

G1

G2

Classification as mapping

A classifier based on the expression levels of two genes, G1 and G2:

ObamaMcCain

Training data set:

>>> How do we do this algorithmically?

linear vs. nonlinear

multiple dimensions...?

unknown

7

Page 8: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Recipe for classification

1. Choose a classification method

2. Feature selection / dimension reduction

3. Train the classifier

4. Assess classifier performance

8

Page 9: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

1. Choose a classification method

linear:–Linear Discriminant Analysis (LDA)–Nearest Centroid

nonlinear:–k Nearest Neighbors (kNN)–Artificial Neural Network (ANN)–Support Vector Machine (SVM)

9

Page 10: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Linear Discriminant Analysis (LDA)

Find a line / plane / hyperplane

G1

G2

assumptions: • sampled from a normal distribution• variance/covariance same in each class

R: lda (MASS package)

10

Page 11: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Nearest centroid

Calculate centroids for each class.

G1

G2

Similar to LDA.

Can be extended to multiple (n > 2) classes.

11

Page 12: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

k Nearest Neighbors (kNN)

For a test case, find the k nearest samples in the training set, and let them vote.

G1

G2

• need to choose k (hyperparameter)• k must be odd• need to choose distance metric

No real “learning” involved - the training set defines the classifier.

R: knn (class package)

k=3

12

Page 13: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Artificial Neural Network (ANN)

Class A Class B

Gene1 Gene2 Gene3 ...

Each “neuron” is simply a function (usually nonlinear).

The “network” to the left just represents a series of nested functions.

Intepretation of results is difficult.

R: nnet (nnet package)

13

Page 14: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Support Vector Machine (SVM)

Find a line / plane / hyperplane maximizing margin. Uses the “kernel trick” to map data to higher dimensions.

G1

G2

• very flexible• many parameters

Results can be difficult to interpret.

R: svm (e1071 package)

support vectors

margin

14

Page 15: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Comparison of methods

Linear discriminant analysisNearest centroidKNN

Neural networksSupport vector machines

Simple methodBased on distance calculationGood for simple problemsGood for few training samples

Distribution of data assumed

Advanced methodsInvolve machine learning

Several adjustable parametersMany training samples required

(e.g.. 50-100)Flexible methods

G1

G2

15

Page 16: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Recipe for classification

1. Choose a classification method

2. Feature selection / dimension reduction

3. Train the classifier

4. Assess classifier performance

16

Page 17: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Dimension reduction - why?

Previous slides: 2 genes X 16 samples.

Your data: 22000 genes X 31 samples.

More genes = more parameters in classifier. Potential for over-training!! aka The Curse of Dimensionality.

number of parameters

accuracy

test set (novel data)

training set

17

Page 18: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Dimension reduction

Two approaches:

1. Data transformation–Principal component analysis (PCA); use top components –Average co-expressed genes (e.g. from cluster analysis)

2. Feature selection (gene selection)–Significant genes: t-test / ANOVA–High variance genes–Hypothesis-driven–

18

Page 19: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Recipe for classification

1. Choose a classification method

2. Feature selection / dimension reduction

3. Train the classifier

4. Assess classifier performance

19

Page 20: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Train the classifierIf you know the exact classifier you want to use: just do it.

but...

Usually, we want to try several classifiers or several variations of a classifier. e.g. number of features, k in kNN, etc.

The problem:Testing several classifiers, and then choosing the best one leads to

selection bias (= overtraining). This is bad.

Instead we can use cross-validation to choose a classifier.

20

Page 21: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Cross validation

Data set

Performance

Train Classifier

Temporary training set

Temporary testing set

Apply classifier to temporary testing set

split

Do this several times!

Each time, select different sample(s) for the temporary testing set.

Leave-one-out cross-validation (LOOCV):Do this once for each sample (i.e. the temporary testing set is only the one sample).

Assess average performance.

If you are using cross-validation as an optimization step, choose the classifier (or classifier parameters) that results in best performance.

21

Page 22: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Cross-validation example

Given data with 100 samples

5-fold cross-validation: Training: 4/5 of data (80 samples) Testing: 1/5 of data (20 samples) - 5 different models and performance results

Leave-one-out cross-validation (LOOCV) Training: 99/100 of data (99 samples) Testing: 1/100 of data (1 sample) - 100 different models

22

Page 23: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Recipe for classification

1. Choose a classification method

2. Feature selection / dimension reduction

3. Train the classifier

4. Assess classifier performance

23

Page 24: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Assess performance of the classifier

How well does the classifier perform on novel samples?

Worst: Assess performance on the training set.

Better: Assess performance on the training set using cross-validation.

Alternative: Set aside a subset of your data as a test set.

Best: Also assess performance on an entirely independent data set (e.g. data produced by another lab).

24

Page 25: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Problems with cross-validation

1. You cannot use the same cross-validation to optimize and evaluate your classifier.

2. The classifier obtained at each round of cross-validation is different, and is different from that obtained using all data.

3. Cross-validation will overestimate performance in the presence of experimental bias.

25

Page 26: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Assessing classifier performance

prediction actual outcome

Aron yes no FP

Beth no no TN

Chuck yes yes TP

Dorthe no yes FN

... ... ... ...

predict relapse of cancerTP = True PositiveTN = True NegativeFP = False PositiveFN = False Negative

predict yes predict no

actual yes 131 TP 21 FN

actual no 7 FP 212 TN

confusion matrix

26

Page 27: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Classifier performance: Accuracy

predict yes predict no

actual yes 131 TP 21 FN

actual no 7 FP 212 TN

confusion matrix

Accuracy = 0.92

range: [0 .. 1]

fraction of predictions that arecorrect

27

Page 28: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Unbalanced test data

confusion matrix

Accuracy = 0.98 (?)

a new classifier: does patient have tuberculosis?

>>> Accuracy can be misleading, especially when the test cases are unbalanced

For patients who do not really have tuberculosis: this classifier is usually wrong!

predict yes predict no

actual yes 5345 TP 21 FN

actual no 113 FP 18 TN

28

Page 29: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Classifier performance: sensitivity/specificity

predict yes predict no

actual yes 131 TP 21 FN

actual no 7 FP 212 TN

confusion matrix

Sensitivity = 0.86Specificity = 0.97

range: [0 .. 1]

the ability to detect positives

the ability to reject negatives

29

Page 30: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Classifier performance: Matthews Correlation Coefficient

predict yes predict no

actual yes 131 TP 21 FN

actual no 7 FP 212 TN

confusion matrix

MCC = 0.84

range: [-1 .. 1]

A single performance measurethat is less influenced by unbalanced test sets

30

Page 31: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Caution: unbalanced test data

predict yes predict no

actual yes 5345 TP 21 FN

actual no 113 FP 18 TN

confusion matrix

Accuracy = 0.98 (?)

Sensitivity = 0.996Specificity = 0.14

MCC = 0.24

a new classifier: does patient have tuberculosis?

31

Page 32: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Unbiased classifier assessment

Avoid information leakage!

1. Use separate data sets for training and for testing.

2. Try to avoid “wearing out” the testing data set. Testing multiple classifiers (and choosing the best one) can only increase the estimated performance -- therefore it is biased!

3. Perform any feature selection using only the training set (or temporary training set).

32

Page 33: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Summary: Recipe for classification

1. Choose a classification method– kNN, LDA, SVM, etc.

2. Feature selection / dimension reduction– PCA, possibly t-tests

3. Train the classifier– use cross-validation to optimize parameters

4. Assess classifier performance– use an independent test set if possible– determine sensitivity and specificity when appropriate

33

Page 34: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

The data set in the exercise

Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease.Borovecki F, Lovrecic L, Zhou J, Jeong H, Then F, Rosas HD, Hersch SM, Hogarth P, Bouzou B, Jensen RV, Krainc D.

Proc Natl Acad Sci U S A. 2005 Aug 2;102(31):11023-8.

31 samples:- 14 normal- 5 presymptomatic- 12 symptomatic

Affymetrix HG-U133A arrays (and Codelink Uniset)

34

Page 35: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

The data set in the exercise

Huntingon’s disease- neurological disorder- genetic- polyglutamine expansion in huntingtin gene

Why search for marker of disease progression (not diagnosis)- assess treatment efficacy- surrogate endpoint in drug trials

35

Page 36: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03

Questions and/or Lunch

36