Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon,...

127
Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov

Transcript of Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon,...

Page 1: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Development and Validation of Prognostic Classifiers using High

Dimensional Data

Richard Simon, D.Sc.

Chief, Biometric Research Branch

National Cancer Institute

http://brb.nci.nih.gov

Page 2: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Class Prediction

• Predict which tumors will respond to a particular treatment

• Predict which patients will relapse after a particular treatment

Page 3: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Class Prediction

• A set of genes is not a classifier• Testing whether analysis of independent data results in

selection of the same set of genes is not an appropriate test of predictive accuracy of a classifier

Page 4: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Components of Class Prediction

• Feature (gene) selection– Which genes will be included in the model

• Select model type – E.g. Diagonal linear discriminant analysis,

Nearest-Neighbor, …

• Fitting parameters (regression coefficients) for model– Selecting value of tuning parameters

• Estimating prediction accuracy

Page 5: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Class Prediction ≠ Class Comparison

• The criteria for gene selection for class prediction and for class comparison are different– For class comparison false discovery rate is important– For class prediction, predictive accuracy is important

• Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy.

• Statisticians are used to inference, not prediction• Most statistical methods were not developed for p>>n

prediction problems

Page 6: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Feature (Gene) Selection

• Select genes that are differentially expressed among the classes at a significance level (e.g. 0.01) – The level is a tuning parameter which can be optimized by

“inner” cross-validation– For class comparison false discovery rate is important– For class prediction, predictive accuracy is important – For prediction it is usually more serious to exclude an

informative variable than to include some noise variables

• L1 penalized likelihood or predictive likelihood for gene selection with penalty value selected by cross-validation– Select smallest number of genes consistent with accurate

prediction

Page 7: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Optimal significance level cutoffs for gene selection.

50 differentially expressed genes out of 22,000 on n arrays 2δ/σ

standardized differencen=10 n=30 n=50

1 0.167 0.003 0.00068

1.25 0.085 0.0011 0.00035

1.5 0.045 0.00063 0.00016

1.75 0.026 0.00036 0.00006

2 0.015 0.0002 0.00002

Page 8: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Complex Gene Selection

• Small subset of genes which together give most accurate predictions– Genetic algorithms

• Little evidence that complex feature selection is useful in microarray problems– Failure to compare to simpler methods– Improper use of cross-validation

Page 9: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Linear Classifiers for Two Classes

( )

vector of log ratios or log signals

features (genes) included in model

weight for i'th feature

decision boundary ( ) > or < d

i ii F

i

l x w x

x

F

w

l x

Page 10: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Linear Classifiers for Two Classes

• Fisher linear discriminant analysis• Diagonal linear discriminant analysis (DLDA)

assumes features are uncorrelated• Compound covariate predictor (Radmacher) • Golub’s weighted voting method• Support vector machines with inner product

kernel

Page 11: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Fisher LDA

( )

( )

,

expression profile

mean vector for i'th class

common covariance matrix

for DLDA

i

i

x MVN

x

cI

Page 12: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

The Compound Covariate Predictor (CCP)

• Motivated by J. Tukey, Controlled Clinical Trials, 1993

• A compound covariate is built from the basic covariates (log-ratios)

tj is the two-sample t-statistic for gene j.

xij is the log-expression measure of sample i for gene j.

Sum is over selected genes.

• Threshold of classification: midpoint of the CCP means for the two classes.

j

ijji xtCCP

Page 13: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Linear Classifiers for Two Classes

• Compound covariate predictor

Instead of for DLDA

(1) (2)

ˆi i

ii

x xw

(1) (2)

2ˆi i

ii

x xw

Page 14: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Support Vector Machine

2i

i

( )

j

minimize w

subject to ' 1

where y 1 for class 1 or 2.

jjy w x b

Page 15: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Other Simple Methods

• Nearest neighbor classification

• Nearest k-neighbors

• Nearest centroid classification

• Shrunken centroid classification

Page 16: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Nearest Neighbor Classifier

• To classify a sample in the validation set as being in outcome class 1 or outcome class 2, determine which sample in the training set it’s gene expression profile is most similar to.– Similarity measure used is based on genes

selected as being univariately differentially expressed between the classes

– Correlation similarity or Euclidean distance generally used

• Classify the sample as being in the same class as it’s nearest neighbor in the training set

Page 17: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

When p>>n

• It is always possible to find a set of features and a weight vector for which the classification error on the training set is zero.

• Why consider more complex models?

Page 18: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

• Some predictive classification problems are easy– Toy problems– Most all methods work well– It is possible to predict accurately with few variables

• Many real classification problems are difficult– Difficult to distinguish informative variables from the

extreme order statistics of noise variables– Comparative studies generally indicate that simpler

methods work as well or better for microarray problems because they avoid overfitting the data.

Page 19: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 20: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 21: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 22: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 23: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 24: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Other Methods

• Top-scoring pairs

• CART

• Random Forrest

Page 25: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Dimension Reduction Methods

• Principal component regression

• Supervised principal component regression

• Partial least squares

• Compound covariate and DLDA

• Supervised clustering

• L1 penalized logistic regression

Page 26: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

When There Are More Than 2 Classes

• Nearest neighbor type methods

• Decision tree of binary classifiers

Page 27: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Decision Tree of Binary Classifiers

• Partition the set of classes {1,2,…,K} into two disjoint subsets S1 and S2 – e.g. S1={1}, S2 ={2,3,4}– Develop a binary classifier for distinguishing the composite classes S1

and S2

• Compute the cross-validated classification error for distinguishing S1 and S2

• Repeat the above steps for all possible partitions in order to find the partition S1and S2 for which the cross-validated classification error is minimized

• If S1and S2 are not singleton sets, then repeat all of the above steps separately for the classes in S1and S2 to optimally partition each of them

Page 28: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Gene-Expression Profiles in Hereditary Breast Cancer

• Breast tumors studied:7 BRCA1+ tumors8 BRCA2+ tumors7 sporadic tumors

• Log-ratios measurements of 3226 genes for each tumor after initial data filtering

cDNA MicroarraysParallel Gene Expression Analysis

RESEARCH QUESTIONCan we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from BRCA2– cancers based solely on their gene expression profiles?

Page 29: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

BRCA1

g

# of

significant genes

# of misclassified

samples (m)

% of random permutations with

m or fewer misclassifications

10-2 182 3 0.4 10-3 53 2 1.0 10-4 9 1 0.2

Page 30: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

BRCA2

g # of significant

genesm = # of misclassified elements

(misclassified samples)

% of randompermutations with m

or fewermisclassifications

10-2 212 4 (s11900, s14486, s14572, s14324) 0.810-3 49 3 (s11900, s14486, s14324) 2.210-4 11 4 (s11900, s14486, s14616, s14324) 6.6

Page 31: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Classification of BRCA2 Germline Mutations

Classification Classification MethodMethod

LOOCV Prediction LOOCV Prediction Error Error

Compound Covariate Compound Covariate PredictorPredictor

14%14%

Fisher LDAFisher LDA 36%36%

Diagonal LDADiagonal LDA 14%14%

1-Nearest Neighbor1-Nearest Neighbor 9%9%

3-Nearest Neighbor3-Nearest Neighbor 23%23%

Support Vector MachineSupport Vector Machine

(linear kernel)(linear kernel)18%18%

Classification TreeClassification Tree 45%45%

Page 32: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Evaluating a Classifier

• “Prediction is difficult, especially the future.”– Neils Bohr

Page 33: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Evaluating a Classifier

• Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data– Goodness of fit vs prediction accuracy

Page 34: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Simulation Training Validation

1

2

3

4

5

6

7

8

9

10

p=7.0e-05

p=0.70

p=4.2e-07

p=0.54

p=2.4e-13

p=0.60

p=1.3e-10

p=0.89

p=1.8e-13

p=0.36

p=5.5e-11

p=0.81

p=3.2e-09

p=0.46

p=1.8e-07

p=0.61

p=1.1e-07

p=0.49

p=4.3e-09

p=0.09

Page 35: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 36: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 37: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

• Hazard ratios and statistical significance levels are not appropriate measures of prediction accuracy

• A hazard ratio is a measure of association– Large values of HR may correspond to small improvement in

prediction accuracy• Kaplan-Meier curves for predicted risk groups within

strata defined by standard prognostic variables provide more information about improvement in prediction accuracy

• Time dependent ROC curves within strata defined by standard prognostic factors can also be useful

Page 38: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 39: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Time-Dependent ROC Curve

• Sensitivity vs 1-Specificity

• Sensitivity = prob{M=1|S>T}

• Specificity = prob{M=0|S<T}

Page 40: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Pr{ | 1}Pr{ 1}Pr{ 1| }

Pr{ }

Pr{ | 0}Pr{ 0}Pr{ 0 | }

Pr{ }

S T M MM S T

S T

S T M MM S T

S T

Page 41: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Validation of a Predictor

• Internal validation– Re-substitution estimate

• Very biased

– Split-sample validation– Cross-validation

• Independent data validation

Page 42: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Validating a Predictive Classifier

• Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data– Goodness of fit is not prediction accuracy

• Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy

• Demonstrating stability of selected genes is not demonstrating predictive accuracy of a model for independent data

Page 43: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Split-Sample Evaluation

• Training-set– Used to select features, select model type, determine

parameters and cut-off thresholds

• Test-set– Withheld until a single model is fully specified using

the training-set.– Fully specified model is applied to the expression

profiles in the test-set to predict class labels. – Number of errors is counted– Ideally test set data is from different centers than the

training data and assayed at a different time

Page 44: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

training set

test set

spec

imen

s

log-expression ratios

Cross-Validated Prediction (Leave-One-Out Method)

1. Full data set is divided into training and test sets (test set contains 1 specimen).

2. Prediction rule is built from scratch using the training set.

3. Rule is applied to the specimen in the test set for class prediction.

4. Process is repeated until each specimen has appeared once in the test set.

Page 45: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Leave-one-out Cross Validation

• Omit sample 1– Develop multivariate classifier from scratch on

training set with sample 1 omitted– Predict class for sample 1 and record whether

prediction is correct

Page 46: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Leave-one-out Cross Validation

• Repeat analysis for training sets with each single sample omitted one at a time

• e = number of misclassifications determined by cross-validation

• Subdivide e for estimation of sensitivity and specificity

Page 47: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

• Cross validation is only valid if the test set is not used in any way in the development of the model. Using the complete set of samples to select genes violates this assumption and invalidates cross-validation.

• With proper cross-validation, the model must be developed from scratch for each leave-one-out training set. This means that feature selection must be repeated for each leave-one-out training set.

– Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the analysis of DNA microarray data. Journal of the National Cancer Institute 95:14-18, 2003.

• The cross-validated estimate of misclassification error is an estimate of the prediction error for model fit using specified algorithm to full dataset

Page 48: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Permutation Distribution of Cross-validated Misclassification Rate of a Multivariate

Classifier Radmacher, McShane & Simon

J Comp Biol 9:505, 2002

• Randomly permute class labels and repeat the entire cross-validation

• Re-do for all (or 1000) random permutations of class labels

• Permutation p value is fraction of random permutations that gave as few misclassifications as e in the real data

Page 49: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 50: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Prediction on Simulated Null Data

Generation of Gene Expression Profiles

• 14 specimens (Pi is the expression profile for specimen i)

• Log-ratio measurements on 6000 genes

• Pi ~ MVN(0, I6000)

• Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)?

Prediction Method

• Compound covariate prediction (discussed later)

• Compound covariate built from the log-ratios of the 10 most differentially expressed genes.

Page 51: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Number of misclassifications

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Pro

po

rtio

n o

f sim

ula

ted

da

ta s

ets

0.00

0.05

0.10

0.90

0.95

1.00

Cross-validation: none (resubstitution method)Cross-validation: after gene selectionCross-validation: prior to gene selection

Page 52: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 53: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 54: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 55: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Major Flaws Found in 40 Studies Published in 2004

• Inadequate control of multiple comparisons in gene finding– 9/23 studies had unclear or inadequate methods to deal with

false positives• 10,000 genes x .05 significance level = 500 false positives

• Misleading report of prediction accuracy– 12/28 reports based on incomplete cross-validation

• Misleading use of cluster analysis – 13/28 studies invalidly claimed that expression clusters based on

differentially expressed genes could help distinguish clinical outcomes

• 50% of studies contained one or more major flaws

Page 56: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Class Prediction

• Cluster analysis is frequently used in publications for class prediction in a misleading way

Page 57: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Fallacy of Clustering Classes Based on Selected Genes

• Even for arrays randomly distributed between classes, genes will be found that are “significantly” differentially expressed

• With 10,000 genes measured, about 500 false positives will be differentially expressed with p < 0.05

• Arrays in the two classes will necessarily cluster separately when using a distance measure based on genes selected to distinguish the classes

Page 58: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 59: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Myth

• Split sample validation is superior to LOOCV or 10-fold CV for estimating prediction error

Page 60: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 61: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Comparison of Internal Validation MethodsMolinaro, Pfiffer & Simon

• For small sample sizes, LOOCV is much less biased than split-sample validation

• For small sample sizes, LOOCV is preferable to 10-fold or 5-fold cross-validation or repeated k-fold versions

• For moderate sample sizes, 10-fold is preferable to LOOCV

Page 62: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Simulated Data40 cases, 10 genes selected from 5000

Method Estimate Std Deviation

True .078

Resubstitution .007 .016

LOOCV .092 .115

10-fold CV .118 .120

5-fold CV .161 .127

Split sample 1-1 .345 .185

Split sample 2-1 .205 .184

.632+ bootstrap .274 .084

Page 63: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 64: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Simulated Data40 cases

Method Estimate Std Deviation

True .078

10-fold .118 .120

Repeated 10-fold .116 .109

5-fold .161 .127

Repeated 5-fold .159 .114

Split 1-1 .345 .185

Repeated split 1-1 .371 .065

Page 65: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Sample Size Planning References

• K Dobbin, R Simon. Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 6:27, 2005

• K Dobbin, R Simon. Sample size planning for developing classifiers using high dimensional DNA microarray data. Biostatistics 8:101, 2007

• K Dobbin, Y Zhao, R Simon. How large a training set is needed to develop a classifier for microarray data? Clinical Cancer Res 14:108, 2008

Page 66: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Development of Empirical Gene Expression Based Classifier

• 20-30 phase II responders are needed to compare to non-responders in order to develop signature for predicting response

Page 67: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Sample Size Planning for Classifier Development

• The expected value (over training sets) of the probability of correct classification PCC(n) should be within of the maximum achievable PCC()

Page 68: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Probability Model

• Two classes• Log expression or log ratio MVN in each class with

common covariance matrix• m differentially expressed genes• p-m noise genes• Expression of differentially expressed genes are

independent of expression for noise genes• All differentially expressed genes have same inter-class

mean difference 2• Common variance for differentially expressed genes and

for noise genes

Page 69: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Classifier

• Feature selection based on univariate t-tests for differential expression at significance level

• Simple linear classifier with equal weights (except for sign) for all selected genes. Power for selecting each of the informative genes that are differentially expressed by mean difference 2 is 1-(n)

Page 70: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

• For 2 classes of equal prevalence, let 1 denote the largest eigenvalue of the covariance matrix of informative genes. Then

1

( )m

PCC

Page 71: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

1

1( ) 1

1

mmPCC n

m p m

Page 72: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 73: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

1.0 1.2 1.4 1.6 1.8 2.0

40

60

80

100

2 delta/sigma

Sam

ple

siz

e

gamma=0.05gamma=0.10

Sample size as a function of effect size (log-base 2 fold-change between classes divided by

standard deviation). Two different tolerances shown, . Each class is equally represented in the population. 22000 genes on an array.

Page 74: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 75: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Developing and Validating Classifiers of Survival Risk

Page 76: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

BRB-ArrayToolsSurvival Risk Group Prediction

• No need to transform data to good vs bad outcome. Censored survival is directly analyzed

• Gene selection based on significance in univariate Cox Proportional Hazards regression

• Uses k principal components of selected genes• Gene selection re-done for each resampled

training set• Develop k-variable Cox PH model for each

leave-one-out training set

Page 77: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

BRB-ArrayToolsSurvival Risk Group Prediction

• Classify left out sample as above or below median risk based on model not involving that sample

• Repeat, leaving out 1 sample at a time to obtain cross-validated risk group predictions for all cases

• Compute Kaplan-Meier survival curves of the two predicted risk groups

• Permutation analysis to evaluate statistical significance of separation of K-M curves

Page 78: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

BRB-ArrayToolsSurvival Risk Group Prediction

• Compare Kaplan-Meier curves for gene expression based classifier to that for standard clinical classifier

• Develop classifier using standard clinical staging plus genes that add to standard staging

Page 79: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Does an Expression Profile Classifier Predict More Accurately Than Standard

Prognostic Variables?

• Not an issue of which variables are significant after adjusting for which others or which are independent predictors– Predictive accuracy and inference are different

• The predictiveness of the expression profile classifier can be evaluated within levels of the classifier based on standard prognostic variables

Page 80: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Survival Risk Group Prediction

• LOOCV loop:– Create training set by omitting i’th case

• Develop PH model for training set

• Compute predictive index for i’th case using PH model developed for training set

• Compute percentile of predictive index for i’th case among predictive indices for cases in the training set

Page 81: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Survival Risk Group Prediction

• Plot Kaplan Meier survival curves for cases with predictive index percentiles above 50% and for cases with cross-validated risk percentiles below 50%– Or for however many risk groups and

thresholds is desired

• Compute log-rank statistic comparing the cross-validated Kaplan Meier curves

Page 82: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Survival Risk Group Prediction

• Evaluate individual genes by fitting single variable proportional hazards regression models to log expression for each gene

• Select genes based on p-value threshold for single gene PH regressions

• Compute first k principal components of the selected genes

• Fit PH regression model with the k pc’s as predictors. Let b1 , …, bk denote the estimated regression coefficients

• To predict for case with expression profile vector x, compute the k supervised pc’s y1 , …, yk and the predictive index = b1 y1 + … + bk yk

Page 83: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Survival Risk Group Prediction

• Repeat the entire procedure for permutations of survival times and censoring indicators to generate the null distribution of the log-rank statistic– The usual chi-square null distribution is not valid

because the cross-validated risk percentiles are correlated among cases

• Evaluate statistical significance of the association of survival and expression profiles by referring the log-rank statistic for the unpermuted data to the permutation null distribution

Page 84: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Outcome prediction in estrogen-receptor positive, chemotherapy and tamoxifen treated patients with

locally advanced breast cancer

R. Simon, G. Bianchini, M. Zambetti, S. Govi, G. Mariani, M. L. Carcangiu, P. Valagussa, L. Gianni

National Cancer Institute, Bethesda, MD; Fondazione IRCCS - Istituto Tumori di Milano, Milan, Italy

Page 85: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

PATIENTS AND METHODS - I

• Fifty-seven patients with ER positive tumors enrolled in a neoadjuvant clinical trial for LABC were evaluated. All patients had been treated with doxorubicin and paclitaxel q 3wk x 3, followed by weekly paclitaxel x 12 before surgery, then adjuvant intravenous CMF q 4wk x 4 and thereafter tamoxifen.

• High-throughput qRT-PCR gene expression analysis in paraffin-embedded formalin-fixed core biopsies at diagnosis was performed by Genomic Health to quantify expression of 363 genes (plus 21 for Oncotype DXTM determination), as described previously (Gianni L, JCO 2005). RS genes were excluded from analysis.

Page 86: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

PATIENTS AND METHODS - II

• Three models (prognostic index) were developed to predict Distant Event Free Survival (DEFS): – GENE MODEL Using only expression data, genes were

selected based on univariate Cox analysis p value under a specific threshold significance level.

– COVARIATES MODEL Using RS (as continuous variable), age and IBC status (covariates) a multivariate proportional hazards model was developed.

– COMBINED MODEL Using a combination of these covariates and expression data, genes were selected which add to predicting survival over the predictive value provided by the covariates and under a specific threshold significance level.

• Survival risk groups were constructed using the supervised principal component method implemented in BRB-ArrayTools (Bair E, Tibshirani R, PLOS Biology 2004).

Page 87: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

PATIENTS AND METHODS - III

• In order to evaluate the predictive value for each model a complete Leave-One-Out Cross-Validation was used.– For each i-th cross-validated training set (with one case

removed) a prognostic index (PI) function was created. The PI for the omitted patient is ranked relative to the PI for the i-th training set. Because the PI is a continuous variable, a cut-off percentiles have to be pre-specified for defining the risk groups. The omitted patient is placed into a risk group based on her percentile ranking. The entire procedure has been repeated using different cut-off percentiles (BRB-ArrayTools User’s Manual v3.7).

Page 88: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

PATIENTS AND METHODS - IV

• Statistical significance was determined by repeating the entire cross-validation process 1000 random permutations of the survival data. – For GENE MODEL the p value was testing the null hypothesis

that there is no relation between the expression data and survival (by providing a null-distribution of the log-rank statistic)

– For COVARIATES MODEL the p value was the parametric log-rank test statistic between risk groups

– For COMBINED MODEL the p value addressed whether the expression data adds significantly to risk prediction compared to the covariates

Page 89: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

RESULTSPatients characteristics at diagnosis

• The median follow-up was 76 months (range 18-103) (by inverse Kaplan-Meier method)

• Patients characteristics were summarized in Table 1.

Page 90: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Months

Pro

babili

ty

OS

DEFS

Overall Survival and Distant Event Free survival – All patients

OS and DEFS of all patients

Page 91: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Genes selected for the GENE MODEL and COMBINED MODEL

• The significance level for gene selection used for the identified models was p=0.005.

• All genes included in the COMBINED MODEL were also selected in the GENE MODEL.

Gene symbol For increase of gene expression Gene symbol For increase of gene expression

BECN1 Better Prognosis BECN1 Better PrognosisABCC4 Poorer Prognosis ABCC4 Poorer PrognosisIL10 Poorer Prognosis IL10 Poorer PrognosisDHPS Better Prognosis DHPS Better PrognosisSTS Poorer Prognosis STS Poorer PrognosisErbB3 Better Prognosis ErbB3 Better PrognosisZSCAN21 Better Prognosis ABCC1 Better PrognosisIRS1 Better PrognosisFOXA1 Better PrognosisERCC1 Better Prognosis RS (unit increase) Poorer PrognosisABCC1 Better Prognosis Age ≥ 50 (vs < 50) Better PrognosisFUS Better Prognosis IBC (vs not-IBC) Poorer PrognosisHPN Better PrognosisECGF1 Poorer Prognosis

GENE MODEL COMBINED MODEL

Covariates (included in COVARIATES MODEL)

Page 92: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Cross-validated Kaplan-Meier curves for risk groups using 50th percentile cut-off

GENEMODEL

COVARIATESMODEL

COMBINEDMODEL

DIS

TA

NT

EV

EN

T F

RE

E S

UR

VIV

AL

DIS

TA

NT

EV

EN

T F

RE

E S

UR

VIV

AL

DIS

TA

NT

EV

EN

T F

RE

E S

UR

VIV

AL

Page 93: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

An Evaluation of Resampling Methods for Assessment of

Survival Risk Prediction in High-dimensional Settings

Jyothi Subramanian and Richard Simon*Biometric Research Branch National Cancer Institute,

Page 94: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

ABSTRACTResampling techniques are often used to provide an initial assessment of accuracy for prognostic prediction models developed using high-dimensional genomic data with binary outcomes. Risk prediction is most important, however, in medical applications and frequently the outcome measure is a right-censored time-to-event variable such as survival. Although several methods have been developed for survival risk prediction with high-dimensional genomic data, there has been little evaluation of the use of resampling techniques for the assessment of such models. Using real and simulated datasets, we compared several resampling techniques for their ability to estimate the accuracy of risk prediction models. Our study showed that accuracy estimates for popular resampling methods such as sample splitting and leave-one-out cross-validation have a higher mean square error than for other methods. Moreover, the large variability of the split-sample and leave-one-out cross-validation may make the point estimates of accuracy obtained using these methods unreliable and hence should be interpreted carefully. A k-fold cross-validation with k = 5 or 10 was seen to provide a good balance between bias and variability for a wide range of data settings and should be more widely adopted in practice.

Page 95: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Cross-Validated Time Dependent ROC Curve

-i

ˆPredictive index for i'th patient m

ˆ regression coefficients for model with pt i omitted

Sensitivity for cutpoint c

Pr( | ) Pr( )Pr( | )

Pr( )

Specificity for cutpoint c

Pr( | )

i i ix

S T M c M cM c S T

S T

M c S T

Pr( | ) Pr( )

Pr( )

ROC: Sensitivity vs 1-Specificity

S T M c M c

S T

Page 96: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

training set of n samples

( ) accuracy of classifier developed using

AUC of time-dependent ROC curve

Known for simulations

Estimated from test set of N-n samples for real data

n

n n

S

A S S

set of N samples.

N=280 lung cancer, N=240 DLBCL.

ˆ ( ) resubstitution estimate of ( )

ˆ ( ) resampling based estimate of ( )

resubn n

RSn n

A S A S

A S A S

Page 97: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

n = 40 n = 80 n = 160

n = 40 n = 80 n = 160

Figure 1a. Distribution of the estimated resampling and true AUC(t) for SuperPCR. (Top) High-signal data. (Bottom) Null data.

Page 98: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Figure 1b. Distribution of the estimated resampling and true AUC(t) for

SuperPCR. (Top) Lung cancer data. (Bottom) DLBCL data.

n = 40

n = 80 n = 160

n = 40 n = 80 n = 160

Page 99: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Figure 2. In the case of the null dataset, the AUC(t) is expected to be around 0.5. The number of times ÂRS(Sn) is too pessimistic (below 0.3, black bars) or too optimistic

(above 0.7, light grey bars) in a total of 100 replications for the null dataset is shown. In comparison to this, the number of times Â(Sn) is below 0.3 or above 0.7 is nil. (a) n = 40

(b) n = 80 (c) n = 160.

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold UniCox

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold SuperPCR

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

Frequency

Lasso

(a)

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold UniCox

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold SuperPCR

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

Frequency

Lasso

(b)

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold UniCox

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold SuperPCR

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

Frequency

Lasso

(c)

Page 100: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Figure 3: The number of times ÂRS(Sn) is too pessimistic (below 0.3, black bars) or too optimistic (above 0.7, light grey bars) in a total of 100 replications

for the lung cancer dataset is compared with the number of times Â(Sn) is below 0.3 or above 0.7. (a) n = 40 (b) n = 80 (c) n = 160

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

ASn UniCox

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

ASn SuperPCR

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

ASn

Frequency

Lasso

(a)

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

ASn UniCox

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

ASn SuperPCR

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

ASn

Frequency

Lasso

(b)

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

ASn UniCox

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

ASn SuperPCR

0 10 20 30

Split-sample

Loo

10-fold

5-fold

2-fold

ASn

Frequency

Lasso

(c)

Page 101: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Figure 4: Bias in the resampling AUC(t) estimates.

Page 102: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Figure 5: Mean square error in the resampling AUC(t) estimates.

Page 103: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

BRB-ArrayTools

• Contains analysis tools that I have selected as valid and useful

• Analysis wizard and multiple help screens for biomedical scientists

• Imports data from all platforms and major databases

• Automated import of data from NCBI Gene Express Omnibus

Page 104: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Predictive Classifiers in BRB-ArrayTools

• Classifiers– Diagonal linear discriminant– Compound covariate – Bayesian compound covariate– Support vector machine with

inner product kernel– K-nearest neighbor– L1 penalized logistic

regression– Nearest centroid– Shrunken centroid (PAM)– Random forrest– Tree of binary classifiers for k-

classes• Survival risk-group

– Supervised pc’s

• Feature selection options– Univariate t/F statistic– Hierarchical variance option– Restricted by fold effect– Univariate classification power– Recursive feature elimination– Top-scoring pairs

• Validation methods– Split-sample– LOOCV– Repeated k-fold CV– .632+ bootstrap

Page 105: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Selected Features of BRB-ArrayTools• Multivariate permutation tests for class comparison to control

number and proportion of false discoveries with specified confidence level– Permits blocking by another variable, pairing of data, averaging of

technical replicates• SAM

– Fortran implementation 7X faster than R versions• Extensive annotation for identified genes

– Internal annotation of NetAffx, Source, Gene Ontology, Pathway information

– Links to annotations in genomic databases• Find genes correlated with quantitative factor while controlling

number of proportion of false discoveries• Find genes correlated with censored survival while controlling

number or proportion of false discoveries– Kaplan Meier curves

• Analysis of variance

Page 106: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 107: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 108: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 109: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 110: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 111: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 112: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 113: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 114: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 115: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 116: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 117: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 118: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 119: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 120: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 121: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 122: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 123: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 124: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 125: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 126: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Page 127: Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Thank You

• I’ve learned things from your questions. I hope that you have too.

• “When you’re green you’re growing, when your ripe you’re rotting.”

• Keep learning new things and expanding your boundaries.

• There are major opportunities for improving public health. Think big and be brave.