RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP,...

RandomForest as a Variable Selection

Tool for Biomarker Data

Katja RemlingerGlaxoSmithKline, RTP, NA

ICSA Applied Statistics SymposiumJune 6th, 2007

Outline

Introduction

Example: Liver Fibrosis

RandomForest Algorithm

Challenges with Variable Selection

External Cross Validation

Summary and Discussion

Introduction

With high dimensional data we want to reduce the number of variables

– Remove “noise” variables

– Ease model interpretation

– Reduce cost by measuring a subset of variables

Biomarker data are typically high dimensional => excellent candidates for variable reduction

What is a Biomarker?

Characteristic that is objectively measured and evaluated as an indicator of – normal biological processes – pathogenic processes – pharmacologic responses to a therapeutic intervention

Types of biomarkers– Genes– Proteins– Lipids

– Metabolites….

Example – Liver Fibrosis 8th leading cause of death in the US

Scar formation that occurs as the liver tries to repair damaged tissue

Current approach: Liver biopsy to determine fibrosis stage

Goal:

– Identify small panel of biomarkers that can predict fibrosis stage of patient (mild or severe)

=> Prediction Problem with Variable Selection

Example – Liver Fibrosis 384 Hepatitis C infected patients of various fibrosis

stages– 61% Mild

– 39% Severe

Collected 46 serum biomarkers

Select 5-10 biomarkers

Prediction & Variable Selection Tools

Stepwise Regression

PLS, PLS-DA

LARS/LASSO

Elastic Net

RandomForest

A Single Tree

Candidate Node10 Mild

10 SevereGini Index = 0.5

Daughter Node5 Mild

0 SevereGini Index = 0

Biomarker 4 >= 14.45Biomarker 4 < 14.45

Daughter Node5 Mild


Mild

Daughter Node1 Mild


Daughter Node4 Mild


Mild Severe

Biomarker 32 = 1Biomarker 32 = 0

Node purity is measured by Gini Index

New SampleBiomarker 4 = 28.65Biomarker 32 = 0

Data

S1, S2, S3, S4, S5, S6, S7, S8, S9, S10

S1,S2,S2,S3,S5, S6,S7,S8,S9,S9

S2,S3,S4,S4,S4, S5,S7,S8,S8,S10

S1,S1,S2,S3,S3, S4,S7,S8,S9,S10

S1,S6,S6,S6,S8, S8,S9,S9,S9,S9…….

Draw Bootstrap Samples

Tree 1 Tree 2 Tree3 Tree 5000

Grow Trees

RandomForest

Data

S1, S2, S3, S4, S5, S6, S7, S8, S9, S10

S1,S2,S2,S3,S5, S6,S7,S8,S9,S9

S2,S3,S4,S4,S4, S5,S7,S8,S8,S10

S1,S1,S2,S3,S3, S4,S7,S8,S9,S10

S1,S6,S6,S6,S8, S8,S9,S9,S9,S9…….

Draw Bootstrap Samples

Variable Importance

S4S10

S1S6S9

S5S6

S2,S3S4,S5S7,S10


P P P P

Drop Down Trees

Prediction AccuracyPermuted Prediction Accuracy

Drop Down Trees

Making Prediction with RandomForest

New SampleM1, M2, …., Mp


…….

Mild Mild MildSevere …….

Results from all Trees: Mild

Severe

70%

30%Majority VoteMild

Challenges with Variable Selection

How many variables are important?

Which variables are important?

How do we validate the model?– Correct way of validating model?

– Is prediction accuracy significant?

External cross validation

Permutation test

A Common Variable Selection Approach is…

Use all data to select variables

Obtain prediction accuracy on reduced data

Introduces selection bias

Used in many publications

Y X

Y X*

A Better Variable Selection Approach is …

Separate training and test set

External cross validation (ECV)

Avoid selection bias

External Cross Validation

1. Partition data for 5-fold Cross-Validation.

Training Set Test Set

Test Set

Training Set

..

.

Yn x 1

Xn x p

Svetnik et al. 2004

Training Set

2. Build RandomForest for each training set.

3. Use importance measure to rank variables.

4. Record test set predictions.

RF

Y X

Training Set

Test Set Prediction

1. Marker_62. Marker_5093. Marker_906


. . .

.

Variable Importance Ranking

5. Remove fraction of least important variables and rebuild RandomForest.

6. Record test set predictions.7. Do not re-rank variables. Repeat 5-7 until small # of

variables is left.

RF

Y X

Training Set

Test Set Prediction

. . .

.

Variable Importance Ranking

. . .

.

Remove





Repeat with remaining variables

mtry = sqrt(p)

No. of Descriptors

AU

C u

nder R

OC

Curv

e

0.64

0.66

0.68

0.70

0.72

1 2 3 5 6 8 11 15 19 26 34 46

3 variables are very important

No additional gains by including more variables

8. Compute optimization criterion at each step of variable removal.

9. Replicate to “smooth” out variability.

10. Select p’ = number of variables in the model, based on optimization criterion.

No. of Variables

Opt

imiz

atio

n C

riter

ion

11. Pick p‘ most important variables.

12. Repeat 1-11 with permuted Y.

We Discussed How to…

Use RandomForest to do variable selection

Use external cross validation to select variables in proper way and to validate model

Return to example – Identify small set of biomarkers that can predict mild or severe fibrosis stage

Liver Fibrosis – Comparison to Commercial Tests

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1 - Specificity

Sensi

tivity

RF 11 var. (AUC = 0.73)RF 3 var. (AUC = 0.74)FibroTest (AUC = 0.70)ActiTest (AUC = 0.65)

False Positive Rate

Tru

e P

osi

tive

Rat

e

RF w. 11 markers RF w. 3 markers FibroTestActiTest

ROC Curves

TPR = Sensitivity =P( Predsev. | Actualsev.)

FPR = 1- Specificity = P( Predsev. | Actualmild)

AUC0.730.740.700.65

Summary and Discussion

Approach has found sets of biomarkers that GSK can use to– Predict fibrosis stage

– Monitor progression of patients in non-invasive manner

– Safe money

Avoid selection bias

Acknowledgements

Kwan Lee, GSK

Mandy Bergquist, GSK

Lei Zhu, GSK

Jack Liu, GSK

Terry Walker, GSK

Peter Leitner, GSK

Andy Liaw, Merck

Christopher Tong, Merck

Vladimir Svetnik, Merck

Duke University

References

DeLong, E., DeLong, D., and Clarke-Pearson, D. (1988), “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach, “ Biometrics, 44, 837-845.

Svetnik, V., Liaw, A., Tong, C., and Wang, T. (2004), “Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules,” Multiple Classier Systems, Fifth International Workshop, MCS 2004, Proceedings, 9-11 June 2004, Cagliari, Italy. F. Roli, J. Kittler, and T. Windeatt (eds.). Lecture Notes in Computer Science, vol. 3077. Berlin:Springer, pp. 334-343.

Backup

Random Forest with all biomarkers:• hyaluronic acid• alpha-2 macroglobulin• VCAM-1• GGT• RBP• ALT

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1 - Specificity

Se

nsi

tivity

RandomForest (RF)RF with FibroTest MarkersFibroTestActiTest

ROC CurvesROC CurvesSeparation of Metavir F0/F1 from F2-F4Separation of Metavir F0/F1 from F2-F4

ROC CurvesROC CurvesSeparation of Metavir F0/F1 from F2-F4Separation of Metavir F0/F1 from F2-F4

FibroTest biomarkers:• alpha-2 macroglobulin• haptoglobin• ApoA1• total bilirubin • GGT

AUC

0.700.750.700.65

FibroTest and Random Forest ComparisonFibroTest and Random Forest ComparisonFibroTest and Random Forest ComparisonFibroTest and Random Forest Comparison

Test Cutoff Sensitivity Specificity

Positive Predictive

Value

Negative Predictive

ValueActitest 0.3 0.88 0.27 0.43 0.78Fibrotest 0.3 0.85 0.43 0.48 0.81RandomForest (RF) 0.3 0.77 0.55 0.52 0.79RF with FibroTest Markers 0.3 0.78 0.53 0.51 0.79Actitest 0.7 0.48 0.73 0.53 0.69Fibrotest 0.7 0.36 0.85 0.61 0.68RandomForest (RF) 0.7 0.23 0.96 0.77 0.66RF with FibroTest Markers 0.7 0.21 0.96 0.78 0.66

The “cut-off” is the algorithm score for predicting a subject is Metavir stage F2-F4

A closer look……

Random Input Variables

Candidate Node10 Mild


X1X1=0 6/6X1=1 4/4Δ Gini=0.00

X2X2=0 7/6X2=1 3/4Δ Gini=0.05

X3X3=0 9/1X3=1 1/9Δ Gini=0.32

X4X4=0 6/5X4=1 4/5Δ Gini=0.01

X5X5=0 4/6X5=1 6/4Δ Gini=0.02

X6X6=0 9/4X6=1 1/6Δ Gini=0.17

Usual Tree Algorithm chooses the best among all Variables: X3

RandomForest chooses the best among a random subset of variables: X6

Example – Breast Cancer Study

Study Details– Breast cancer patients; stages II-IV– Control subjects; matched by

Age Race Smoking status

– 42 serum biomarkers

Goal: Identify panel of biomarkers to – Monitor patient response in a non-invasive and longitudinal

manner – Provide more information on underlying biology and

mechanisms of drug action

Examples of Biomarkers Laboratory Tests

– Routine, non-routine and novel tests, novel applications, genes, proteins, metabolites, lipids, …

Electrophysiological Measures– ECG, EEG, …

Imaging– fMRI, PET, X-ray, BMD, Ultrasound, CT, …

Histological Analyses– Immunohistochemistry, electron microscopy, …

Physiological Measures– Heart rate, blood pressure, pupil size, …

Behavioral Tests– Cognitive function, motor performance, …

Biomarkers in the Research and Drug Discovery Process

Gene to function to target

Target toLead

Lead to candidateSelection

Candidate selection to

FTIH

FTIH to Proof ofConcept

Proof of Conceptto Phase III

Phase III File &

Launch

Disease selection

Target familyselection

Target selection

Lead(CEDD entry)

Candidate selected

Commit to FTIH

Proof of concept

Commit to phase III

Commit to file and launch

Commit to product type

Targets Drugs Products

Biomarker

Prognostic & Predictive Biomarkers

Prognostic Biomarkers– Inform you about clinical outcome independent of therapeutic

intervention– Stable during treatment course– Patient enrichment strategies

Predictive Biomarkers– Indicate that effect of new drug relative to control is related to

biomarker– Change over course of treatment

– High importance for successful drug discovery

RandomForest on Weight Loss Data: Protein Marker Model Based on Baseline Markers and Baseline Weight

The following markers were selected as having the highest median and mean importance ranking in the protein marker model:– Weight Week 0, IGFBP-3, CRP, TNF-α, CD40L, and

MMP-9

Lipid model not as good as Protein model, but still better than Weight model.

RandomForest on Weight Loss Data: Models Based on Early Changes in Markers, Baseline Weight, and Early Change in Weight

Weight Week 0 - Week 3

Lipid Markers Week 0 - Week 3 & Weight Week 0 & Weight Week 0 - Week 3

Protein Markers Week 0 - Week 3 & Weight Week

0 & Weight Week 0 - Week 3

Error Rate 0.369 0.333 0.375Predicted "fast" Actual "fast" 6 6 5Predicted "fast" Actual "normal" 4 3 3Predicted "fast" Actual "slow" 0 0 0

Number of markers in the model 2 4 4

Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers

Note: All results in the table are median numbers based on 50 replicates

RandomForest on Weight Loss Data: Lipid Marker Model Based on Early Change in Markers, Baseline Weight, and Early Change in Weight The following markers were selected as having

the highest median and mean importance ranking in the lipid marker model:– Weight Week 0 – Week 3, Weight Week 0, and 2 lipid

markers

Very similar results are obtained if Weight Week 0 is excluded. The following markers now have highest importance:– Weight Week 0 – Week 3, and three lipid markers

Obesity - Models Based on Baseline Markers and Baseline Weight

Weight Week 0Lipid Markers Week 0

& Weight Week 0Protein Markers Week 0

& Weight Week 0

Error Rate 0.462 0.444 0.375

Predicted "fast" Actual "fast" 6 5 4

Predicted "fast" Actual "normal" 5 4 2

Predicted "fast" Actual "slow" 0 0 0

Number of markers in the model 1 7 6

Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers

Permutation p-value for Protein Markers: 0.01

Note: All results in the table are median numbers based on 50 replicates

Example – Obesity

50 obese patients

Several hundred protein and lipid biomarkers at different time points Weight at different time points

Start 1200 calorie liquid diet

Start 900 calorie liquid diet

Week 1 Week 3 Week 6 Week 26 Week 52

Subjects Reside in Clinic

Subjects return home and regain diet control

Sample Collection for Biomarkers

Average Weight = 266 lbsAverage BMI = 43

Example – Obesity

50 obese patients

Several hundred protein and lipid biomarkers at different time points Weight at different time points

0

5

10

15

20

25

30

35

Week 3 Week 6 Week 26Week 3 Week 6 Week 26

% W

eigh

t Cha

nge

from

B

asel

ine

Example – Obesity

50 obese patients

– 266 lbs at baseline

– BMI of 43 at baseline

Several hundred protein and lipid biomarkers at

baseline

Weight at different time points

-2.5

0

2.5

5

7.5

10

12.5

15

17.5

% W

eigh

t Cha

nge

from

Bas

elin

e

Week 3 Week 6

Time

RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP,...

Documents

Transcript of RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP,...