RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP,...
-
Upload
nicole-job -
Category
Documents
-
view
216 -
download
0
Transcript of RandomForest as a Variable Selection Tool for Biomarker Data Katja Remlinger GlaxoSmithKline, RTP,...
RandomForest as a Variable Selection
Tool for Biomarker Data
Katja RemlingerGlaxoSmithKline, RTP, NA
ICSA Applied Statistics SymposiumJune 6th, 2007
Outline
Introduction
Example: Liver Fibrosis
RandomForest Algorithm
Challenges with Variable Selection
External Cross Validation
Summary and Discussion
Introduction
With high dimensional data we want to reduce the number of variables
– Remove “noise” variables
– Ease model interpretation
– Reduce cost by measuring a subset of variables
Biomarker data are typically high dimensional => excellent candidates for variable reduction
What is a Biomarker?
Characteristic that is objectively measured and evaluated as an indicator of – normal biological processes – pathogenic processes – pharmacologic responses to a therapeutic intervention
Types of biomarkers– Genes– Proteins– Lipids
– Metabolites….
Example – Liver Fibrosis 8th leading cause of death in the US
Scar formation that occurs as the liver tries to repair damaged tissue
Current approach: Liver biopsy to determine fibrosis stage
Goal:
– Identify small panel of biomarkers that can predict fibrosis stage of patient (mild or severe)
=> Prediction Problem with Variable Selection
Example – Liver Fibrosis 384 Hepatitis C infected patients of various fibrosis
stages– 61% Mild
– 39% Severe
Collected 46 serum biomarkers
Select 5-10 biomarkers
Prediction & Variable Selection Tools
Stepwise Regression
PLS, PLS-DA
LARS/LASSO
Elastic Net
RandomForest
A Single Tree
Candidate Node10 Mild
10 SevereGini Index = 0.5
Daughter Node5 Mild
0 SevereGini Index = 0
Biomarker 4 >= 14.45Biomarker 4 < 14.45
Daughter Node5 Mild
10 SevereGini Index = 0.44
Mild
Daughter Node1 Mild
9 SevereGini Index = 0.18
Daughter Node4 Mild
1 SevereGini Index = 0.32
Mild Severe
Biomarker 32 = 1Biomarker 32 = 0
Node purity is measured by Gini Index
New SampleBiomarker 4 = 28.65Biomarker 32 = 0
Data
S1, S2, S3, S4, S5, S6, S7, S8, S9, S10
S1,S2,S2,S3,S5, S6,S7,S8,S9,S9
S2,S3,S4,S4,S4, S5,S7,S8,S8,S10
S1,S1,S2,S3,S3, S4,S7,S8,S9,S10
S1,S6,S6,S6,S8, S8,S9,S9,S9,S9…….
Draw Bootstrap Samples
Tree 1 Tree 2 Tree3 Tree 5000
Grow Trees
RandomForest
Data
S1, S2, S3, S4, S5, S6, S7, S8, S9, S10
S1,S2,S2,S3,S5, S6,S7,S8,S9,S9
S2,S3,S4,S4,S4, S5,S7,S8,S8,S10
S1,S1,S2,S3,S3, S4,S7,S8,S9,S10
S1,S6,S6,S6,S8, S8,S9,S9,S9,S9…….
Draw Bootstrap Samples
Variable Importance
S4S10
S1S6S9
S5S6
S2,S3S4,S5S7,S10
Tree 1 Tree 2 Tree3 Tree 5000
P P P P
Drop Down Trees
Prediction AccuracyPermuted Prediction Accuracy
Drop Down Trees
Making Prediction with RandomForest
New SampleM1, M2, …., Mp
Tree 1 Tree 2 Tree3 Tree 5000
…….
Mild Mild MildSevere …….
Results from all Trees: Mild
Severe
70%
30%Majority VoteMild
Challenges with Variable Selection
How many variables are important?
Which variables are important?
How do we validate the model?– Correct way of validating model?
– Is prediction accuracy significant?
External cross validation
Permutation test
A Common Variable Selection Approach is…
Use all data to select variables
Obtain prediction accuracy on reduced data
Introduces selection bias
Used in many publications
Y X
Y X*
A Better Variable Selection Approach is …
Separate training and test set
External cross validation (ECV)
Avoid selection bias
External Cross Validation
1. Partition data for 5-fold Cross-Validation.
Training Set Test Set
Test Set
Training Set
..
.
Yn x 1
Xn x p
Svetnik et al. 2004
Training Set
2. Build RandomForest for each training set.
3. Use importance measure to rank variables.
4. Record test set predictions.
RF
Y X
Training Set
Test Set Prediction
1. Marker_62. Marker_5093. Marker_906
98. Marker_5799. Marker_21000. Marker_49
. . .
.
Variable Importance Ranking
5. Remove fraction of least important variables and rebuild RandomForest.
6. Record test set predictions.7. Do not re-rank variables. Repeat 5-7 until small # of
variables is left.
RF
Y X
Training Set
Test Set Prediction
. . .
.
Variable Importance Ranking
. . .
.
Remove
1. Marker_62. Marker_5093. Marker_906
98. Marker_5799. Marker_21000. Marker_49
1. Marker_62. Marker_5093. Marker_906
98. Marker_5799. Marker_21000. Marker_49
Repeat with remaining variables
mtry = sqrt(p)
No. of Descriptors
AU
C u
nder R
OC
Curv
e
0.64
0.66
0.68
0.70
0.72
1 2 3 5 6 8 11 15 19 26 34 46
3 variables are very important
No additional gains by including more variables
8. Compute optimization criterion at each step of variable removal.
9. Replicate to “smooth” out variability.
10. Select p’ = number of variables in the model, based on optimization criterion.
No. of Variables
Opt
imiz
atio
n C
riter
ion
11. Pick p‘ most important variables.
12. Repeat 1-11 with permuted Y.
We Discussed How to…
Use RandomForest to do variable selection
Use external cross validation to select variables in proper way and to validate model
Return to example – Identify small set of biomarkers that can predict mild or severe fibrosis stage
Liver Fibrosis – Comparison to Commercial Tests
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 - Specificity
Sensi
tivity
RF 11 var. (AUC = 0.73)RF 3 var. (AUC = 0.74)FibroTest (AUC = 0.70)ActiTest (AUC = 0.65)
False Positive Rate
Tru
e P
osi
tive
Rat
e
RF w. 11 markers RF w. 3 markers FibroTestActiTest
ROC Curves
TPR = Sensitivity =P( Predsev. | Actualsev.)
FPR = 1- Specificity = P( Predsev. | Actualmild)
AUC0.730.740.700.65
Summary and Discussion
Approach has found sets of biomarkers that GSK can use to– Predict fibrosis stage
– Monitor progression of patients in non-invasive manner
– Safe money
Avoid selection bias
Acknowledgements
Kwan Lee, GSK
Mandy Bergquist, GSK
Lei Zhu, GSK
Jack Liu, GSK
Terry Walker, GSK
Peter Leitner, GSK
Andy Liaw, Merck
Christopher Tong, Merck
Vladimir Svetnik, Merck
Duke University
References
DeLong, E., DeLong, D., and Clarke-Pearson, D. (1988), “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach, “ Biometrics, 44, 837-845.
Svetnik, V., Liaw, A., Tong, C., and Wang, T. (2004), “Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules,” Multiple Classier Systems, Fifth International Workshop, MCS 2004, Proceedings, 9-11 June 2004, Cagliari, Italy. F. Roli, J. Kittler, and T. Windeatt (eds.). Lecture Notes in Computer Science, vol. 3077. Berlin:Springer, pp. 334-343.
Backup
Random Forest with all biomarkers:• hyaluronic acid• alpha-2 macroglobulin• VCAM-1• GGT• RBP• ALT
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 - Specificity
Se
nsi
tivity
RandomForest (RF)RF with FibroTest MarkersFibroTestActiTest
ROC CurvesROC CurvesSeparation of Metavir F0/F1 from F2-F4Separation of Metavir F0/F1 from F2-F4
ROC CurvesROC CurvesSeparation of Metavir F0/F1 from F2-F4Separation of Metavir F0/F1 from F2-F4
FibroTest biomarkers:• alpha-2 macroglobulin• haptoglobin• ApoA1• total bilirubin • GGT
AUC
0.700.750.700.65
FibroTest and Random Forest ComparisonFibroTest and Random Forest ComparisonFibroTest and Random Forest ComparisonFibroTest and Random Forest Comparison
Test Cutoff Sensitivity Specificity
Positive Predictive
Value
Negative Predictive
ValueActitest 0.3 0.88 0.27 0.43 0.78Fibrotest 0.3 0.85 0.43 0.48 0.81RandomForest (RF) 0.3 0.77 0.55 0.52 0.79RF with FibroTest Markers 0.3 0.78 0.53 0.51 0.79Actitest 0.7 0.48 0.73 0.53 0.69Fibrotest 0.7 0.36 0.85 0.61 0.68RandomForest (RF) 0.7 0.23 0.96 0.77 0.66RF with FibroTest Markers 0.7 0.21 0.96 0.78 0.66
The “cut-off” is the algorithm score for predicting a subject is Metavir stage F2-F4
A closer look……
Random Input Variables
Candidate Node10 Mild
10 SevereGini Index = 0.5
X1X1=0 6/6X1=1 4/4Δ Gini=0.00
X2X2=0 7/6X2=1 3/4Δ Gini=0.05
X3X3=0 9/1X3=1 1/9Δ Gini=0.32
X4X4=0 6/5X4=1 4/5Δ Gini=0.01
X5X5=0 4/6X5=1 6/4Δ Gini=0.02
X6X6=0 9/4X6=1 1/6Δ Gini=0.17
Usual Tree Algorithm chooses the best among all Variables: X3
RandomForest chooses the best among a random subset of variables: X6
Example – Breast Cancer Study
Study Details– Breast cancer patients; stages II-IV– Control subjects; matched by
Age Race Smoking status
– 42 serum biomarkers
Goal: Identify panel of biomarkers to – Monitor patient response in a non-invasive and longitudinal
manner – Provide more information on underlying biology and
mechanisms of drug action
Examples of Biomarkers Laboratory Tests
– Routine, non-routine and novel tests, novel applications, genes, proteins, metabolites, lipids, …
Electrophysiological Measures– ECG, EEG, …
Imaging– fMRI, PET, X-ray, BMD, Ultrasound, CT, …
Histological Analyses– Immunohistochemistry, electron microscopy, …
Physiological Measures– Heart rate, blood pressure, pupil size, …
Behavioral Tests– Cognitive function, motor performance, …
Biomarkers in the Research and Drug Discovery Process
Gene to function to target
Target toLead
Lead to candidateSelection
Candidate selection to
FTIH
FTIH to Proof ofConcept
Proof of Conceptto Phase III
Phase III File &
Launch
Disease selection
Target familyselection
Target selection
Lead(CEDD entry)
Candidate selected
Commit to FTIH
Proof of concept
Commit to phase III
Commit to file and launch
Commit to product type
Targets Drugs Products
Biomarker
Prognostic & Predictive Biomarkers
Prognostic Biomarkers– Inform you about clinical outcome independent of therapeutic
intervention– Stable during treatment course– Patient enrichment strategies
Predictive Biomarkers– Indicate that effect of new drug relative to control is related to
biomarker– Change over course of treatment
– High importance for successful drug discovery
RandomForest on Weight Loss Data: Protein Marker Model Based on Baseline Markers and Baseline Weight
The following markers were selected as having the highest median and mean importance ranking in the protein marker model:– Weight Week 0, IGFBP-3, CRP, TNF-α, CD40L, and
MMP-9
Lipid model not as good as Protein model, but still better than Weight model.
RandomForest on Weight Loss Data: Models Based on Early Changes in Markers, Baseline Weight, and Early Change in Weight
Weight Week 0 - Week 3
Lipid Markers Week 0 - Week 3 & Weight Week 0 & Weight Week 0 - Week 3
Protein Markers Week 0 - Week 3 & Weight Week
0 & Weight Week 0 - Week 3
Error Rate 0.369 0.333 0.375Predicted "fast" Actual "fast" 6 6 5Predicted "fast" Actual "normal" 4 3 3Predicted "fast" Actual "slow" 0 0 0
Number of markers in the model 2 4 4
Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers
Note: All results in the table are median numbers based on 50 replicates
RandomForest on Weight Loss Data: Lipid Marker Model Based on Early Change in Markers, Baseline Weight, and Early Change in Weight The following markers were selected as having
the highest median and mean importance ranking in the lipid marker model:– Weight Week 0 – Week 3, Weight Week 0, and 2 lipid
markers
Very similar results are obtained if Weight Week 0 is excluded. The following markers now have highest importance:– Weight Week 0 – Week 3, and three lipid markers
Obesity - Models Based on Baseline Markers and Baseline Weight
Weight Week 0Lipid Markers Week 0
& Weight Week 0Protein Markers Week 0
& Weight Week 0
Error Rate 0.462 0.444 0.375
Predicted "fast" Actual "fast" 6 5 4
Predicted "fast" Actual "normal" 5 4 2
Predicted "fast" Actual "slow" 0 0 0
Number of markers in the model 1 7 6
Error Rate = # subjects misclassified as fast weight losers / # subjects classified as fast weight losers
Permutation p-value for Protein Markers: 0.01
Note: All results in the table are median numbers based on 50 replicates
Example – Obesity
50 obese patients
Several hundred protein and lipid biomarkers at different time points Weight at different time points
Start 1200 calorie liquid diet
Start 900 calorie liquid diet
Week 1 Week 3 Week 6 Week 26 Week 52
Subjects Reside in Clinic
Subjects return home and regain diet control
Sample Collection for Biomarkers
Average Weight = 266 lbsAverage BMI = 43
Example – Obesity
50 obese patients
Several hundred protein and lipid biomarkers at different time points Weight at different time points
0
5
10
15
20
25
30
35
Week 3 Week 6 Week 26Week 3 Week 6 Week 26
% W
eigh
t Cha
nge
from
B
asel
ine
Example – Obesity
50 obese patients
– 266 lbs at baseline
– BMI of 43 at baseline
Several hundred protein and lipid biomarkers at
baseline
Weight at different time points
-2.5
0
2.5
5
7.5
10
12.5
15
17.5
% W
eigh
t Cha
nge
from
Bas
elin
e
Week 3 Week 6
Time