RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP,...

40
RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th , 2007

Transcript of RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP,...

Page 1: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

RandomForest as a Variable Selection

Tool for Biomarker Data

Katja RemlingerGlaxoSmithKline, RTP, NA

ICSA Applied Statistics SymposiumJune 6th, 2007

Page 2: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Outline

Introduction

Example: Liver Fibrosis

RandomForest Algorithm

Challenges with Variable Selection

External Cross Validation

Summary and Discussion

Page 3: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Introduction

With high dimensional data we want to reduce the number of variables

– Remove “noise” variables

– Ease model interpretation

– Reduce cost by measuring a subset of variables

Biomarker data are typically high dimensional => excellent candidates for variable reduction

Page 4: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

What is a Biomarker?

Characteristic that is objectively measured and evaluated as an indicator of – normal biological processes – pathogenic processes – pharmacologic responses to a therapeutic intervention

Types of biomarkers– Genes– Proteins– Lipids

– Metabolites….

Page 5: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Example – Liver Fibrosis 8th leading cause of death in the US

Scar formation that occurs as the liver tries to repair damaged tissue

Current approach: Liver biopsy to determine fibrosis stage

Goal:

– Identify small panel of biomarkers that can predict fibrosis stage of patient (mild or severe)

=> Prediction Problem with Variable Selection

Page 6: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Example – Liver Fibrosis 384 Hepatitis C infected patients of various fibrosis

stages– 61% Mild

– 39% Severe

Collected 46 serum biomarkers

Select 5-10 biomarkers

Page 7: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Prediction & Variable Selection Tools

Stepwise Regression

PLS, PLS-DA

LARS/LASSO

Elastic Net

RandomForest

Page 8: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

A Single Tree

Candidate Node10 Mild

10 SevereGini Index = 0.5

Daughter Node5 Mild

0 SevereGini Index = 0

Biomarker 4 >= 14.45Biomarker 4 < 14.45

Daughter Node5 Mild

10 SevereGini Index = 0.44

Mild

Daughter Node1 Mild

9 SevereGini Index = 0.18

Daughter Node4 Mild

1 SevereGini Index = 0.32

Mild Severe

Biomarker 32 = 1Biomarker 32 = 0

Node purity is measured by Gini Index

New SampleBiomarker 4 = 28.65Biomarker 32 = 0

Page 9: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Data

S1, S2, S3, S4, S5, S6, S7, S8, S9, S10

S1,S2,S2,S3,S5, S6,S7,S8,S9,S9

S2,S3,S4,S4,S4, S5,S7,S8,S8,S10

S1,S1,S2,S3,S3, S4,S7,S8,S9,S10

S1,S6,S6,S6,S8, S8,S9,S9,S9,S9…….

Draw Bootstrap Samples

Tree 1 Tree 2 Tree3 Tree 5000

Grow Trees

RandomForest

Page 10: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Data

S1, S2, S3, S4, S5, S6, S7, S8, S9, S10

S1,S2,S2,S3,S5, S6,S7,S8,S9,S9

S2,S3,S4,S4,S4, S5,S7,S8,S8,S10

S1,S1,S2,S3,S3, S4,S7,S8,S9,S10

S1,S6,S6,S6,S8, S8,S9,S9,S9,S9…….

Draw Bootstrap Samples

Variable Importance

S4S10

S1S6S9

S5S6

S2,S3S4,S5S7,S10

Tree 1 Tree 2 Tree3 Tree 5000

P P P P

Drop Down Trees

Prediction AccuracyPermuted Prediction Accuracy

Drop Down Trees

Page 11: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Making Prediction with RandomForest

New SampleM1, M2, …., Mp

Tree 1 Tree 2 Tree3 Tree 5000

…….

Mild Mild MildSevere …….

Results from all Trees: Mild

Severe

70%

30%Majority VoteMild

Page 12: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Challenges with Variable Selection

How many variables are important?

Which variables are important?

How do we validate the model?– Correct way of validating model?

– Is prediction accuracy significant?

External cross validation

Permutation test

Page 13: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

A Common Variable Selection Approach is…

Use all data to select variables

Obtain prediction accuracy on reduced data

Introduces selection bias

Used in many publications

Y X

Y X*

Page 14: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

A Better Variable Selection Approach is …

Separate training and test set

External cross validation (ECV)

Avoid selection bias

Page 15: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

External Cross Validation

1. Partition data for 5-fold Cross-Validation.

Training Set Test Set

Test Set

Training Set

..

.

Yn x 1

Xn x p

Svetnik et al. 2004

Training Set

Page 16: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

2. Build RandomForest for each training set.

3. Use importance measure to rank variables.

4. Record test set predictions.

RF

Y X

Training Set

Test Set Prediction

1. Marker_62. Marker_5093. Marker_906

98. Marker_5799. Marker_21000. Marker_49

. . .

.

Variable Importance Ranking

Page 17: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

5. Remove fraction of least important variables and rebuild RandomForest.

6. Record test set predictions.7. Do not re-rank variables. Repeat 5-7 until small # of

variables is left.

RF

Y X

Training Set

Test Set Prediction

. . .

.

Variable Importance Ranking

. . .

.

Remove

1. Marker_62. Marker_5093. Marker_906

98. Marker_5799. Marker_21000. Marker_49

1. Marker_62. Marker_5093. Marker_906

98. Marker_5799. Marker_21000. Marker_49

Repeat with remaining variables

Page 18: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

mtry = sqrt(p)

No. of Descriptors

AU

C u

nder R

OC

Curv

e

0.64

0.66

0.68

0.70

0.72

1 2 3 5 6 8 11 15 19 26 34 46

3 variables are very important

No additional gains by including more variables

8. Compute optimization criterion at each step of variable removal.

9. Replicate to “smooth” out variability.

10. Select p’ = number of variables in the model, based on optimization criterion.

No. of Variables

Opt

imiz

atio

n C

riter

ion

Page 19: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

11. Pick p‘ most important variables.

12. Repeat 1-11 with permuted Y.

Page 20: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

We Discussed How to…

Use RandomForest to do variable selection

Use external cross validation to select variables in proper way and to validate model

Return to example – Identify small set of biomarkers that can predict mild or severe fibrosis stage

Page 21: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Liver Fibrosis – Comparison to Commercial Tests

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1 - Specificity

Sensi

tivity

RF 11 var. (AUC = 0.73)RF 3 var. (AUC = 0.74)FibroTest (AUC = 0.70)ActiTest (AUC = 0.65)

False Positive Rate

Tru

e P

osi

tive

Rat

e

RF w. 11 markers RF w. 3 markers FibroTestActiTest

ROC Curves

TPR = Sensitivity =P( Predsev. | Actualsev.)

FPR = 1- Specificity = P( Predsev. | Actualmild)

AUC0.730.740.700.65

Page 22: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Summary and Discussion

Approach has found sets of biomarkers that GSK can use to– Predict fibrosis stage

– Monitor progression of patients in non-invasive manner

– Safe money

Avoid selection bias

Page 23: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Acknowledgements

Kwan Lee, GSK

Mandy Bergquist, GSK

Lei Zhu, GSK

Jack Liu, GSK

Terry Walker, GSK

Peter Leitner, GSK

Andy Liaw, Merck

Christopher Tong, Merck

Vladimir Svetnik, Merck

Duke University

Page 24: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

References

DeLong, E., DeLong, D., and Clarke-Pearson, D. (1988), “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach, “ Biometrics, 44, 837-845.

Svetnik, V., Liaw, A., Tong, C., and Wang, T. (2004), “Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules,” Multiple Classier Systems, Fifth International Workshop, MCS 2004, Proceedings, 9-11 June 2004, Cagliari, Italy. F. Roli, J. Kittler, and T. Windeatt (eds.). Lecture Notes in Computer Science, vol. 3077. Berlin:Springer, pp. 334-343.

Page 25: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Backup

Page 26: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Random Forest with all biomarkers:• hyaluronic acid• alpha-2 macroglobulin• VCAM-1• GGT• RBP• ALT

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1 - Specificity

Se

nsi

tivity

RandomForest (RF)RF with FibroTest MarkersFibroTestActiTest

ROC CurvesROC CurvesSeparation of Metavir F0/F1 from F2-F4Separation of Metavir F0/F1 from F2-F4

ROC CurvesROC CurvesSeparation of Metavir F0/F1 from F2-F4Separation of Metavir F0/F1 from F2-F4

FibroTest biomarkers:• alpha-2 macroglobulin• haptoglobin• ApoA1• total bilirubin • GGT

AUC

0.700.750.700.65

Page 27: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

FibroTest and Random Forest ComparisonFibroTest and Random Forest ComparisonFibroTest and Random Forest ComparisonFibroTest and Random Forest Comparison

Test Cutoff Sensitivity Specificity

Positive Predictive

Value

Negative Predictive

ValueActitest 0.3 0.88 0.27 0.43 0.78Fibrotest 0.3 0.85 0.43 0.48 0.81RandomForest (RF) 0.3 0.77 0.55 0.52 0.79RF with FibroTest Markers 0.3 0.78 0.53 0.51 0.79Actitest 0.7 0.48 0.73 0.53 0.69Fibrotest 0.7 0.36 0.85 0.61 0.68RandomForest (RF) 0.7 0.23 0.96 0.77 0.66RF with FibroTest Markers 0.7 0.21 0.96 0.78 0.66

The “cut-off” is the algorithm score for predicting a subject is Metavir stage F2-F4

Page 28: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

A closer look……

Page 29: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Random Input Variables

Candidate Node10 Mild

10 SevereGini Index = 0.5

X1X1=0 6/6X1=1 4/4Δ Gini=0.00

X2X2=0 7/6X2=1 3/4Δ Gini=0.05

X3X3=0 9/1X3=1 1/9Δ Gini=0.32

X4X4=0 6/5X4=1 4/5Δ Gini=0.01

X5X5=0 4/6X5=1 6/4Δ Gini=0.02

X6X6=0 9/4X6=1 1/6Δ Gini=0.17

Usual Tree Algorithm chooses the best among all Variables: X3

RandomForest chooses the best among a random subset of variables: X6

Page 30: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Example – Breast Cancer Study

Study Details– Breast cancer patients; stages II-IV– Control subjects; matched by

Age Race Smoking status

– 42 serum biomarkers

Goal: Identify panel of biomarkers to – Monitor patient response in a non-invasive and longitudinal

manner – Provide more information on underlying biology and

mechanisms of drug action

Page 31: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Examples of Biomarkers Laboratory Tests

– Routine, non-routine and novel tests, novel applications, genes, proteins, metabolites, lipids, …

Electrophysiological Measures– ECG, EEG, …

Imaging– fMRI, PET, X-ray, BMD, Ultrasound, CT, …

Histological Analyses– Immunohistochemistry, electron microscopy, …

Physiological Measures– Heart rate, blood pressure, pupil size, …

Behavioral Tests– Cognitive function, motor performance, …

Page 32: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Biomarkers in the Research and Drug Discovery Process

Gene to function to target

Target toLead

Lead to candidateSelection

Candidate selection to

FTIH

FTIH to Proof ofConcept

Proof of Conceptto Phase III

Phase III File &

Launch

Disease selection

Target familyselection

Target selection

Lead(CEDD entry)

Candidate selected

Commit to FTIH

Proof of concept

Commit to phase III

Commit to file and launch

Commit to product type

Targets Drugs Products

Biomarker

Page 33: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Prognostic & Predictive Biomarkers

Prognostic Biomarkers– Inform you about clinical outcome independent of therapeutic

intervention– Stable during treatment course– Patient enrichment strategies

Predictive Biomarkers– Indicate that effect of new drug relative to control is related to

biomarker– Change over course of treatment

– High importance for successful drug discovery

Page 34: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

RandomForest on Weight Loss Data: Protein Marker Model Based on Baseline Markers and Baseline Weight

The following markers were selected as having the highest median and mean importance ranking in the protein marker model:– Weight Week 0, IGFBP-3, CRP, TNF-α, CD40L, and

MMP-9

Lipid model not as good as Protein model, but still better than Weight model.

Page 35: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

RandomForest on Weight Loss Data: Models Based on Early Changes in Markers, Baseline Weight, and Early Change in Weight

Weight Week 0 - Week 3

Lipid Markers Week 0 - Week 3 & Weight Week 0 & Weight Week 0 - Week 3

Protein Markers Week 0 - Week 3 & Weight Week

0 & Weight Week 0 - Week 3

Error Rate 0.369 0.333 0.375Predicted "fast" Actual "fast" 6 6 5Predicted "fast" Actual "normal" 4 3 3Predicted "fast" Actual "slow" 0 0 0

Number of markers in the model 2 4 4

Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers

Note: All results in the table are median numbers based on 50 replicates

Page 36: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

RandomForest on Weight Loss Data: Lipid Marker Model Based on Early Change in Markers, Baseline Weight, and Early Change in Weight The following markers were selected as having

the highest median and mean importance ranking in the lipid marker model:– Weight Week 0 – Week 3, Weight Week 0, and 2 lipid

markers

Very similar results are obtained if Weight Week 0 is excluded. The following markers now have highest importance:– Weight Week 0 – Week 3, and three lipid markers

Page 37: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Obesity - Models Based on Baseline Markers and Baseline Weight

Weight Week 0Lipid Markers Week 0

& Weight Week 0Protein Markers Week 0

& Weight Week 0

Error Rate 0.462 0.444 0.375

Predicted "fast" Actual "fast" 6 5 4

Predicted "fast" Actual "normal" 5 4 2

Predicted "fast" Actual "slow" 0 0 0

Number of markers in the model 1 7 6

Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers

Permutation p-value for Protein Markers: 0.01

Note: All results in the table are median numbers based on 50 replicates

Page 38: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Example – Obesity

50 obese patients

Several hundred protein and lipid biomarkers at different time points Weight at different time points

Start 1200 calorie liquid diet

Start 900 calorie liquid diet

Week 1 Week 3 Week 6 Week 26 Week 52

Subjects Reside in Clinic

Subjects return home and regain diet control

Sample Collection for Biomarkers

Average Weight = 266 lbsAverage BMI = 43

Page 39: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Example – Obesity

50 obese patients

Several hundred protein and lipid biomarkers at different time points Weight at different time points

0

5

10

15

20

25

30

35

Week 3 Week 6 Week 26Week 3 Week 6 Week 26

% W

eigh

t Cha

nge

from

B

asel

ine

Page 40: RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP, NA ICSA Applied Statistics Symposium June 6 th, 2007.

Example – Obesity

50 obese patients

– 266 lbs at baseline

– BMI of 43 at baseline

Several hundred protein and lipid biomarkers at

baseline

Weight at different time points

-2.5

0

2.5

5

7.5

10

12.5

15

17.5

% W

eigh

t Cha

nge

from

Bas

elin

e

Week 3 Week 6

Time