Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional...

70
Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute Linus.nci.nih.gov/brb

Transcript of Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional...

Page 1: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Statistical Aspects of the Development and Validation of Predictive Classifiers for High

Dimensional Data

Richard Simon, D.Sc.Chief, Biometric Research Branch

National Cancer InstituteLinus.nci.nih.gov/brb

Page 2: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

BRB Websitehttp://linus.nci.nih.gov/brb

• Powerpoint presentations and audio files

• Reprints & Technical Reports

• BRB-ArrayTools software

• BRB-ArrayTools Data Archive

• Sample Size Planning for Targeted Clinical Trials

Page 3: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

DNA Microarray Gene Expression Assay

• Extract mRNA from cells of interest– Each mRNA molecule was transcribed from a single gene and it has a

linear structure complementary to that gene– 1 mRNA molecule is translated into one protein molecule

• Reverse transcribe mRNA to cDNA introducing a fluorescently labeled dye to each molecule

• Distribute the cDNA sample to a solid surface containing “probes” of DNA representing all “genes”– the probes are arranged in a grid on the surface with each gene having

a known address• Let the molecules from the sample hybridize with the probes for the

corresponding genes• Wash off excess sample and illuminate surface with laser with

frequency corresponding to the dye• Measure intensity of fluorescence over each probe

Page 4: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Resulting Data

• Intensity over a probe is approximately proportional to abundance of mRNA molecules in the sample for the gene corresponding to the probe

• 40,000 variables measured for each specimen

Page 5: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Good Microarray Studies Have Clear Objectives

• Class Comparison (Gene Finding)– Find genes whose expression differs among predetermined classes– Tumor versus normal– After infection of cells by virus versus before infection

• Class Prediction– Prediction of predetermined class using gene expression profile– Predicting whether a patient will or will not respond to a given

treatment• Class Discovery

– Discover clusters of specimens having similar expression profiles– Find genes that are in the same biochemical pathways

Page 6: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Class Comparison and Class Prediction

• Not clustering problems

• Supervised methods

Page 7: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Components of Class Prediction

• Feature (gene) selection– Which genes will be included in the model

• Select model type – E.g. Diagonal linear discriminant analysis,

Nearest-Neighbor, …

• Fitting parameters (regression coefficients) for model– Selecting value of tuning parameters

Page 8: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Simple Gene Selection

• Select genes that are differentially expressed among the classes at a significance level (e.g. 0.01) – The level is a tuning parameter– Number of false discoveries is not of direct relevance for

prediction

• For prediction it is usually more serious to exclude an informative variable than to include some noise variables

Page 9: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Optimal significance level cutoffs for gene selection. 50 differentially expressed genes

out of 22,000 genes on the microarrays 2δ/σ n=10 n=30 n=50

1 0.167 0.003 0.00068

1.25 0.085 0.0011 0.00035

1.5 0.045 0.00063 0.00016

1.75 0.026 0.00036 0.00006

2 0.015 0.0002 0.00002

Page 10: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Complex Gene Selection

• Select a set of genes which together give most accurate predictions– Genetic algorithms

• Little evidence that complex feature selection is useful in microarray problems

Page 11: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Linear Classifiers for Two Classes

( )

vector of log ratios or log signals

features (genes) included in model

weight for i'th feature

decision boundary ( ) > or < d

i ii F

i

l x w x

x

F

w

l x

Page 12: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Linear Classifiers for Two Classes

• Fisher linear discriminant analysis

• Diagonal linear discriminant analysis (DLDA)

• Compound covariate predictor

• Golub’s weighted voting method

• Support vector machines with inner product kernel

• Perceptrons

Page 13: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Fisher LDA

( )

(1) (2)

,

log( ( ; , ) / ( ; , )) '

ix MVN

x x w x

Page 14: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

The Compound Covariate Predictor (CCP)

• Motivated by J. Tukey, Controlled Clinical Trials, 1993

• A compound covariate is built from the basic covariates (log-ratios)

tj is the two-sample t-statistic for gene j.

xij is the log-expression measure of sample i for gene j.

Sum is over selected genes.

• Threshold of classification: midpoint of the CCP means for the two classes.

j

ijji xtCCP

Page 15: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Linear Classifiers for Two Classes

• Compound covariate predictor

Instead of for DLDA

(1) (2)

ˆi i

ii

x xw

(1) (2)

2ˆi i

ii

x xw

Page 16: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Support Vector Machine

2i

i

( )

j

minimize w

subject to ' 1

where y 1 for class 1 or 2.

jjy w x b

Page 17: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Perceptrons

• Perceptrons are neural networks with no hidden layer and linear transfer functions between input output– Number of input nodes equals number of genes selected– Number of output nodes equals number of classes minus 1– Number of inputs may be major principal components of genes

or major principal components of informative genes

• Perceptrons are linear classifiers

Page 18: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

When p>>n

• It is always possible to find a set of features and a weight vector for which the classification error on the training set is zero.

• There is generally not sufficient information in p>>n training sets to effectively use more complex methods

Page 19: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Myth

• Complex classification algorithms such as neural networks perform better than simpler methods for class prediction.

Page 20: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

• Comparative studies have indicated that simpler methods usually work as well or better for microarray problems because they avoid overfitting the data.

Page 21: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Other Simple Methods

• Nearest neighbor classification

• Nearest k-neighbors

• Nearest centroid classification

• Shrunken centroid classification

Page 22: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Nearest Neighbor Classifier

• To classify a sample in the validation set as being in outcome class 1 or outcome class 2, determine which sample in the training set it’s gene expression profile is most similar to.– Similarity measure used is based on genes

selected as being univariately differentially expressed between the classes

– Correlation similarity or Euclidean distance generally used

• Classify the sample as being in the same class as it’s nearest neighbor in the training set

Page 23: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Evaluating a Classifier

• Most statistical methods were not developed for p>>n prediction problems

• Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data

• Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy

• Testing whether analysis of independent data results in selection of the same set of genes is not an appropriate test of predictive accuracy of a classifier

Page 24: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Internal Validation of a Classifier

• Re-substitution estimate– Develop classifier on dataset, test predictions

on same data– Very biased for p>>n

• Split-sample validation

• Cross-validation

Page 25: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Split-Sample Evaluation

• Training-set– Used to select features, select model type, determine

parameters and cut-off thresholds

• Test-set– Withheld until a single model is fully specified using

the training-set.– Fully specified model is applied to the expression

profiles in the test-set to predict class labels. – Number of errors is counted

Page 26: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Leave-one-out Cross Validation

• Omit sample 1– Develop multivariate classifier from scratch on

training set with sample 1 omitted– Predict class for sample 1 and record whether

prediction is correct

Page 27: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Leave-one-out Cross Validation

• Repeat analysis for training sets with each single sample omitted one at a time

• e = number of misclassifications determined by cross-validation

• Subdivide e for estimation of sensitivity and specificity

Page 28: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

• With proper cross-validation, the model must be developed from scratch for each leave-one-out training set. This means that feature selection must be repeated for each leave-one-out training set.

– Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the analysis of DNA microarray data. Journal of the National Cancer Institute 95:14-18, 2003.

• The cross-validated estimate of misclassification error is an estimate of the prediction error for model fit using specified algorithm to full dataset

Page 29: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Prediction on Simulated Null Data

Generation of Gene Expression Profiles

• 14 specimens (Pi is the expression profile for specimen i)

• Log-ratio measurements on 6000 genes

• Pi ~ MVN(0, I6000)

• Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)?

Prediction Method

• Compound covariate prediction

• Compound covariate built from the log-ratios of the 10 most differentially expressed genes.

Page 30: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Number of misclassifications

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Pro

po

rtio

n o

f sim

ula

ted

da

ta s

ets

0.00

0.05

0.10

0.90

0.95

1.00

Cross-validation: none (resubstitution method)Cross-validation: after gene selectionCross-validation: prior to gene selection

Page 31: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.
Page 32: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Major Flaws Found in 40 Studies Published in 2004

• Inadequate control of multiple comparisons in gene finding– 9/23 studies had unclear or inadequate methods to deal with

false positives• 10,000 genes x .05 significance level = 500 false positives

• Misleading report of prediction accuracy– 12/28 reports based on incomplete cross-validation

• Misleading use of cluster analysis – 13/28 studies invalidly claimed that expression clusters based on

differentially expressed genes could help distinguish clinical outcomes

• 50% of studies contained one or more major flaws

Page 33: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Myth

• Split sample validation is superior to LOOCV or 10-fold CV for estimating prediction error

Page 34: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.
Page 35: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Comparison of Internal Validation MethodsMolinaro, Pfiffer & Simon

• For small sample sizes, LOOCV is much more accurate than split-sample validation– Split sample validation over-estimates prediction error

• For small sample sizes, LOOCV is preferable to 10-fold, 5-fold cross-validation or repeated k-fold versions

• For moderate sample sizes, 10-fold is preferable to LOOCV

• Some claims for bootstrap resampling for estimating prediction error are not valid for p>>n problems

Page 36: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.
Page 37: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Simulated Data40 cases, 10 genes selected from 5000

Method Estimate Std Deviation

True .078

Resubstitution .007 .016

LOOCV .092 .115

10-fold CV .118 .120

5-fold CV .161 .127

Split sample 1-1 .345 .185

Split sample 2-1 .205 .184

.632+ bootstrap .274 .084

Page 38: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Simulated Data40 cases

Method Estimate Std Deviation

True .078

10-fold .118 .120

Repeated 10-fold .116 .109

5-fold .161 .127

Repeated 5-fold .159 .114

Split 1-1 .345 .185

Repeated split 1-1 .371 .065

Page 39: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

DLBCL Data

Method Bias Std Deviation MSE

LOOCV -.019 .072 .008

10-fold CV -.007 .063 .006

5-fold CV .004 .07 .007

Split 1-1 .037 .117 .018

Split 2-1 .001 .119 .017

.632+ bootstrap -.006 .049 .004

Page 40: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Permutation Distribution of Cross-validated Misclassification Rate of a

Multivariate Classifier• Randomly permute class labels and repeat

the entire cross-validation• Re-do for all (or 1000) random

permutations of class labels• Permutation p value is fraction of random

permutations that gave as few misclassifications as e in the real data

Page 41: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Gene-Expression Profiles in Hereditary Breast Cancer

• Breast tumors studied:7 BRCA1+ tumors8 BRCA2+ tumors7 sporadic tumors

• Log-ratios measurements of 3226 genes for each tumor after initial data filtering

cDNA MicroarraysParallel Gene Expression Analysis

RESEARCH QUESTIONCan we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from BRCA2– cancers based solely on their gene expression profiles?

Page 42: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

BRCA1

g

# of

significant genes

# of misclassified

samples (m)

% of random permutations with

m or fewer misclassifications

10-2 182 3 0.4 10-3 53 2 1.0 10-4 9 1 0.2

Page 43: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

BRCA2

g # of significant

genesm = # of misclassified elements

(misclassified samples)

% of randompermutations with m

or fewermisclassifications

10-2 212 4 (s11900, s14486, s14572, s14324) 0.810-3 49 3 (s11900, s14486, s14324) 2.210-4 11 4 (s11900, s14486, s14616, s14324) 6.6

Page 44: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Classification of BRCA2 Germline Mutations

Classification Method LOOCV Prediction Error

Compound Covariate Predictor 14%

Fisher LDA 36%

Diagonal LDA 14%

1-Nearest Neighbor 9%

3-Nearest Neighbor 23%

Support Vector Machine

(linear kernel)

18%

Classification Tree 45%

Page 45: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Does an Expression Profile Classifier Predict More Accurately Than Standard

Prognostic Variables?

• Some publications fit logistic model to standard covariates and the cross-validated predictions of expression profile classifiers

• This is valid only with split-sample analysis because the cross-validated predictions are not independent

log ( ) ( | )i iit p y x i z

Page 46: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Does an Expression Profile Classifier Predict More Accurately Than Standard

Prognostic Variables?

• Not an issue of which variables are significant after adjusting for which others or which are independent predictors– Predictive accuracy and inference are different

• The predictiveness of the expression profile classifier can be evaluated within levels of the classifier based on standard prognostic variables

Page 47: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Survival Risk Group Prediction

• LOOCV loop:– Create training set by omitting i’th case

• Develop PH model for training set

• Compute predictive index for i’th case using PH model developed for training set

• Compute percentile of predictive index for i’th case among predictive indices for cases in the training set

Page 48: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Survival Risk Group Prediction

• Plot Kaplan Meier survival curves for cases with predictive index percentiles above 50% and for cases with cross-validated risk percentiles below 50%– Or for however many risk groups and

thresholds is desired

• Compute log-rank statistic comparing the cross-validated Kaplan Meier curves

Page 49: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Survival Risk Group Prediction

• Evaluate individual genes by fitting single variable proportional hazards regression models to log expression for each gene

• Select genes based on p-value threshold for single gene PH regressions

• Compute first k principal components of the selected genes

• Fit PH regression model with the k pc’s as predictors. Let b1 , …, bk denote the estimated regression coefficients

• To predict for case with expression profile vector x, compute the k supervised pc’s y1 , …, yk and the predictive index = b1 y1 + … + bk yk

Page 50: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Survival Risk Group Prediction

• Repeat the entire procedure for permutations of survival times and censoring indicators to generate the null distribution of the log-rank statistic– The usual chi-square null distribution is not valid

because the cross-validated risk percentiles are correlated among cases

• Evaluate statistical significance of the association of survival and expression profiles by referring the log-rank statistic for the unpermuted data to the permutation null distribution

Page 51: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Sample Size Planning References

• K Dobbin, R Simon. Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 6:27-38, 2005

• K Dobbin, R Simon. Sample size planning for developing classifiers using high dimensional DNA microarray data. Biostatistics 2007

Page 52: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Sample Size Planning for Classifier Development

• The expected value (over training sets) of the probability of correct classification PCC(n) should be within of the maximum achievable PCC()

Page 53: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Probability Model

• Two classes• Log expression or log ratio MVN in each class with

common covariance matrix• m differentially expressed genes• p-m noise genes• Expression of differentially expressed genes are

independent of expression for noise genes• All differentially expressed genes have same inter-class

mean difference 2• Common variance for differentially expressed genes and

for noise genes

Page 54: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Classifier

• Feature selection based on univariate t-tests for differential expression at significance level

• Simple linear classifier with equal weights (except for sign) for all selected genes. Power for selecting each of the informative genes that are differentially expressed by mean difference 2 is 1-(n)

Page 55: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

• For 2 classes of equal prevalence, let 1 denote the largest eigenvalue of the covariance matrix of informative genes. Then

1

( )m

PCC

Page 56: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

1

1( ) 1

1

mmPCC n

m p m

Page 57: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.
Page 58: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Optimal significance level cutoffs for gene selection. 50 differentially expressed genes

out of 22,000 genes on the microarrays 2δ/σ n=10 n=30 n=50

1 0.167 0.003 0.00068

1.25 0.085 0.0011 0.00035

1.5 0.045 0.00063 0.00016

1.75 0.026 0.00036 0.00006

2 0.015 0.0002 0.00002

Page 59: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.
Page 60: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.
Page 61: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

BRB-ArrayTools

• Contains analysis tools that I have selected as valid and useful

• Analysis wizard and multiple help screens for biomedical scientists

• Imports data from all platforms and major databases

• Automated import of data from NCBI Gene Express Omnibus

Page 62: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Predictive Classifiers in BRB-ArrayTools

• Classifiers– Diagonal linear discriminant– Compound covariate – Bayesian compound covariate– Support vector machine with

inner product kernel– K-nearest neighbor

– Nearest centroid– Shrunken centroid (PAM)– Random forrest– Tree of binary classifiers for k-

classes

• Survival risk-group– Supervised pc’s

• Feature selection options– Univariate t/F statistic– Hierarchical variance option– Restricted by fold effect– Univariate classification power– Recursive feature elimination– Top-scoring pairs

• Validation methods– Split-sample– LOOCV– Repeated k-fold CV– .632+ bootstrap

Page 63: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Selected Features of BRB-ArrayTools• Multivariate permutation tests for class comparison to control

number and proportion of false discoveries with specified confidence level– Permits blocking by another variable, pairing of data, averaging of

technical replicates• SAM

– Fortran implementation 7X faster than R versions• Extensive annotation for identified genes

– Internal annotation of NetAffx, Source, Gene Ontology, Pathway information

– Links to annotations in genomic databases• Find genes correlated with quantitative factor while controlling

number of proportion of false discoveries• Find genes correlated with censored survival while controlling

number or proportion of false discoveries• Analysis of variance

Page 64: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Selected Features of BRB-ArrayTools

• Gene set enrichment analysis. – Gene Ontology groups, signaling pathways, transcription

factor targets, micro-RNA putative targets– Automatic data download from Broad Institute– KS & LS test statistics for null hypothesis that gene set is not

enriched– Hotelling’s and Goeman’s Global test of null hypothesis that

no genes in set are differentially expressed– Goeman’s Global test for survival data

• Class prediction– Multiple classifiers– Complete LOOCV, k-fold CV, repeated k-fold, .632

bootstrap– permutation significance of cross-validated error rate

Page 65: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Selected Features of BRB-ArrayTools

• Survival risk-group prediction– Supervised principal components with and without clinical

covariates– Cross-validated Kaplan Meier Curves– Permutation test of cross-validated KM curves

• Clustering tools for class discovery with reproducibility statistics on clusters– Internal access to Eisen’s Cluster and Treeview

• Visualization tools including rotating 3D principal components plot exportable to Powerpoint with rotation controls

• Extensible via R plug-in feature• Tutorials and datasets

Page 66: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

BRB-ArrayTools

• Extensive built-in gene annotation and linkage to gene annotation websites

• Publicly available for non-commercial use– http://linus.nci.nih.gov/brb

Page 67: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

BRB-ArrayToolsMay 2007

• 7188 Registered users• 1962 Distinct institutions • 68 Countries• 365 Citations

• Registered users– 3951 in US

• 565 at NIH– 275 at NCI

• 2014 US EDU• 754 US Govt (non NIH)

– 3237 Foreign

Page 68: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Countries With Most BRB ArrayTools Registered Users

• France 270• Canada 269• UK 244• Germany 239• Italy 216• Taiwan 196• Netherlands 177• Korea 168• Japan 153• China 150

• Spain 146• Australia 130• India 107• Belgium 83• New Zeland 61• Sweden 50• Singapore 46• Brazil 48• Israel 41• Denmark 40

Page 69: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.
Page 70: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Acknowledgements

• Kevin Dobbin• Alain Dupuy• Wenyu Jiang• Annette Molinaro• Ruth Pfeiffer• Michael Radmacher• Joanna Shih• Yingdong Zhao• BRB-ArrayTools Development Team