Tree-Based Methods in Drug Safety Research
Salford Systems Data Mining Conference
March 30, 2006
Page 2 | Copyright © 2006 i3
Outline
In-going question: “Can tree-based methods help here?”
Background
Methods
Focus on splitting rules for CART
Brief description of Stochastic Gradient Boosting and Random Forests
Results
– Question 1 - Associations not discernable by epidemiologic analyses?
– Question 2 - What else can be learned about the outcomes ?
Limitations
Page 3 | Copyright © 2006 i3
Background
Common disorder
Medications
– Potential adverse outcomes– Cerebrovascular (n=2 types)
– Cardiovascular (n=4)
– Multiple events
– Death
Concern re: particular drug class
– Focus drug
– Compared with all other drugs used to treat the disorder
Page 4 | Copyright © 2006 i3
Background
Epidemiologic Analyses
– Matched retrospective cohort study– Focus drug users, users of other drugs in the class, matched controls
– Four entry points into the study
– Using state-of-the-art propensity score matching
– Conclusion: No difference in the occurrence of cardiovascular or cerebrovascular events in the two treated groups
– Increased risk for those who are treated with any drug in the class compared with the control group
Page 5 | Copyright © 2006 i3
Objectives of the Tree-Based Analyses
Are there patterns of covariates associated with the outcomes that are better identified using tree-based methods?
1. Are there patterns of association with the outcomes and drugs not amenable to standard epidemiologic analyses?
2. What else can be learned about the etiology of the target outcomes from this study?
Methods
Page 7 | Copyright © 2006 i3
Methods - Sample
Medical and pharmacy claims data
– 3 Groups (N ≈ 50,000)– Focus drug ≈ 12,500
– Other drugs ≈ 12,500
– Controls ≈ 25,000
– Propensity score matched
– 4 Cohorts– Entry points into the study
– 6 months apart
Page 8 | Copyright © 2006 i3
Methods - Outcomes
Outcomes
– Individual (7 binary, one polychotomous – 8 levels)– One-at-a-time – individuals with multiple outcomes in more than one
– One polychotomous outcome variable including, “Multiple”
– Grouped by type (one polychotomous – 5 levels)– Cardiovascular
– Cerebrovascular
– Multiple
– Death
– None
– Any vs. None (binary)
– Continuous – count of the number of outcomes (0-3)
Page 9 | Copyright © 2006 i3
Methods - Covariates
Covariates
– Prescription drug use (binary indicators)– 17 drug classes, total Rx, cardiovascular Rx
– Medical claims 6 months prior to study entry (binary indicators)– 19 classes of medical conditions
– Number of hospital visits– ER, ICU, psychiatry, psychology, cardiovascular, other, other mental, # stays
– Costs– Total, total drug
– Demographics– Age (year of birth), gender, region
– No personally identifying information of any kind
Page 10 | Copyright © 2006 i3
Statistical Methods
Statistical Methods
– Recursive partitioning (using CART)– More on specific splitting rules …
– Random forests (using Salford’s RF)– Methods summary to follow – see Breiman and Cutler refs
– Stochastic gradient boosting (using TreeNet)– Methods summary to follow – see Friedman refs
Page 11 | Copyright © 2006 i3
Statistical Methods - Details
CART
– Mix of equal and data priors for CART
– Misclassification costs – case as control = 1.5
Random Forests
– 1,000 trees grown, 3 variables considered at each split
– Balanced prior probabilities
Stochastic Gradient Boosting
– Misclassification costs as in CART
– 500 trees grown
– Balanced prior probabilities
Splitting Rules
Page 13 | Copyright © 2006 i3
Statistical Methods - Splitting Rules
Splitting rules can effect the classification accuracy of a tree
– Some might argue that pruning is equally, if not more, important
In some cases, the purity of particular nodes may be of more interest than overall accuracy
– Given 2 different structures with equal accuracy
Page 14 | Copyright © 2006 i3
Splitting Rules
Conceptually:
Systematically choose a split-point and divide the sample into two groups
Construct measure (impurity or goodness – cup ½ empty…)
Assess split using some weighted combination of the measure and class probabilities
– i.e. a small child node with perfect information contributes only a small amount to the adjudication of the quality of the split
Iterate across all possible splits – choose the “best” using given measure
Page 15 | Copyright © 2006 i3
Statistical Methods - Splitting Rules
Gini Symmetric Gini
Entropy
Twoing Ordered Twoing
Class probability
Child - Left Child - Right
Parent
Best?
Page 16 | Copyright © 2006 i3
Gini – A Little History
Developed by Italian statistician Corrado Gini in 1912
– Designed to measure income inequity
– Can be used to assess inequity – – or impurity - in any distribution
– Lorenz curve (1905) income equity
Graph: http://en.wikipedia.org/wiki/Gini_coefficient
Page 17 | Copyright © 2006 i3
Gini – A Little Math
GINI “impurity” criterion:
Where = relative frequency of class j at node t
Split quality measured by:
Where nj = number of individuals at child node, j
np = number of individuals at parent node, p
Split with the minimum GINI index is chosen
[ ]2)|(1)( ∑−=j
tjptGINI
( )tjp |
∑ ==
k
j jp
jsplit GINI
n
nGINI
1
Child - Left Child - Right
Parent
Page 18 | Copyright © 2006 i3
Gini – An Example
Simple Gini Example
Parent Node 200
Cases 50 Cases 50
Controls 50 Controls 50
node n 100 node n 100Prop Cases 50% Prop Cases 50%
Prop Controls 50% Prop Controls 50%
Gini(s1) 0.50 Gini(s2) 0.50
Gini(1,2) 0.50
Child 2Child 1
Page 19 | Copyright © 2006 i3
Gini
Categorical variables:
– Count each class
Continuous variables:
– Sort the variable
– Choose one or more splits
Tends to find the largest classes first
Misclassification costs are incorporated by adjusting prior probabilities
Page 20 | Copyright © 2006 i3
Symmetric Gini
Gini criterion using symmetric misclassification costs
– Difference: – At the pruning stage Gini can use nonsymmetric costs
– Symmetric Gini imposes symmetry on the matrix
– Motivation: highly nonsymmetric costs can cause the impurity function to behave badly (not concave)
– See CART monograph for nice example
– Something to consider when costs are important and not balanced
Page 21 | Copyright © 2006 i3
Entropy – A Little History
Claude Shannon introduced the concept in 1948
– “Mathematical Theory of Communications” – Bell Labs
– Minimum # of bits needed to encode a string of symbols
Many (>30) definitions and interpretations
– Physics: the amount of disorder in a system
– Statistics and Information Theory: information – as in statistical genetics
Entropy is a measure of uncertainty, or conversely, information
Page 22 | Copyright © 2006 i3
Entropy – A Little Math
Entropy calculation
Similar evaluation:
– Change in entropy = Parent entropy – weighted at each node
Multi-level outcomes
– Looks for splits where as many levels as possible are divided perfectly or near perfectly.
∑−=j
tjptjpti )])|(log([)|()(
(Recall that Gini is p(j|t)2)
Page 23 | Copyright © 2006 i3
Twoing – History & Math
First proposed in the original 1984 CART monograph
Idea: in a multimulti--class problemclass problem, separate the classes into two “superclasses”
– Choose split with greatest decrease in node impurity
– “Strategic” splits – information re: class similarities
Attempts to find groups of up to 50% of the data each
– Power-modified Twoing forces splits to be close to 50%
Has been suggested for difficult problems – low signal / noise ratio
[ ]2)|()|(4 ∑ −
j RLRL tjptjp
pp
Page 24 | Copyright © 2006 i3
Ordered Twoing
Ordered Twoing
Twoing designed for ordered outcomes
Constraint:
– Only considers grouping adjacent classes
Example:
– Twoing - consider any combinations of numeric values (e.g. 2,5 vs 1,3,4)
– Ordered twoing - only consider adjacent splits (1,2,3 vs 4,5)
– … a severity scale, for example
Page 25 | Copyright © 2006 i3
Class Probability
Probability trees instead of classification trees
– Estimate of the probability that a case is in the class– Class assignment also
– Measure of association (or affinity) with each class
– Example– Useful to estimate the relative probabilities of a disease w/o ruling any out
Page 26 | Copyright © 2006 i3
Splitting Rules
Using different rules will have the most effect for multilevel outcomes
Good practice to use several splitting rules and compare the results
Page 27 | Copyright © 2006 i3
Tree-Based Methods
Child - Left
Parent
Child - Left Child - Right
Parent
Child - Left
Parent
Child - Left Child - Right
Parent
Vote: Class=1
Child - Left
Parent
Child - Left Child - Right
Parent
Vote: Class=2
Child - Left
Parent
Child - Left Child - Right
Parent
Vote: Class=1
Child - Left
Parent
Child - Left Child - Right
Parent
Vote: Class=1
Child - Left
Parent
Child - Left Child - Right
Parent
Child - Left
Parent
Child - Left Child - Right
Parent
Child - Left
Parent
Child - Left Child - Right
Parent
Child - Left
Parent
Child - Left Child - Right
Parent
CART Random Forests Stochastic Gradient Boosting
Your vote counts!
Page 28 | Copyright © 2006 i3
Model Quality Across Methods
Variable importance
Classification accuracy
Random Forests- Brief Overview
Page 30 | Copyright © 2006 i3
Random Forests
Concept:
– Grow an ensemble of trees using bootstrapped samples– Each “votes” on the classification
Random sample of the data – WITH replacement– Usually, about 1/3 of the data are “out of the bag (OOB)”
– Can be used for validation purposes
– New observations are classified by all of the trees– Majority vote is the final classification for the observation
Sampling
– Random sample = to original sample size (WITH replacement)
– Sub-sample of all available covariates
Full trees grown – *no* pruning
Page 31 | Copyright © 2006 i3
Random Forests
The error of a forest depends upon:
– Strength of the individual trees
– Correlation among the trees
Page 32 | Copyright © 2006 i3
Random Forests
Evaluation criteria:
– Classification accuracy on the OOB samples
– Variable importance– Relative contribution of each variable to the final classification solution
Page 33 | Copyright © 2006 i3
Random Forests
Proximities for each pair of observations can be estimated
– Useful for clustering / segmentation
Nice feature – not used in these analyses
Available in CART 6.2 and in code from Adelle Cutler - flexible
Stochastic Gradient Boosting- Brief Overview
Page 35 | Copyright © 2006 i3
Stochastic Gradient Boosting
Boosting methods:
– Sequential algorithms – weights at a particular step are dependent upon previously fitted functions
Model residuals from previous stage
– Sub-sample of the data
– Small trees, usually with 2 to 8 nodes, at each step.
– Trees are combined by adding individual scores
Page 36 | Copyright © 2006 i3
Stochastic Gradient Boosting
Friedman : weighting coefficient, or “shrinkage parameter” implemented at each step
– slow the learning process
– results in better classification accuracy (Friedman, 2001)
Version 6.2 allows investigation of 3 shrinkage parameters in one run; 0.001, 0.01, 0.1
– “Battery” option LEARNRATE
Page 37 | Copyright © 2006 i3
Stochastic Gradient Boosting
Distinct advantage
– Much higher degree of accuracy than traditional methods – e.g. logistic or ordinary least squares regression,
– Better accuracy than single-tree methods – e.g. RP, or other parallel “forest” methods (e.g. random forests)
– Model implementation in an independent sample is not more statistically difficult than single-tree methods,
– But, models are considerably larger
– No simple graphical representation
– Model characterization:
– Classification accuracy in test or holdout samples (v-fold cross-validation)
– Number and relative importance of variables used
Page 38 | Copyright © 2006 i3
Summary
CART
– Single tree-structured model
– Nice graphical representation
– Several methods for splitting criteria
Random Forests
– Ensemble of trees based on bootstrapped samples
– Majority “vote”
– More accurate than CART – No single graphical representation
Stochastic Gradient Boosting
– Ensemble – small trees based on partial samples
– Models residuals
– Best for many classes of problems
So,
How do these methods help discover the etiology of outcomes in these data?
Results
Question 1: Adverse Drug Events
Page 41 | Copyright © 2006 i3
Adverse Drug Events: Results
Are there patterns of association with the outcomes and specificdrugs not amenable to standard epidemiologic analyses?
The short answer is No, in this case.
There were no analyses in which the focus drug separated from the class was associated with the outcome.
Not with any construction of the outcome
Not with any of the tree-based statistical methods
Conclusion: Verification of the epidemiologic analysis
The Good News:
If there is no association, you’re not likely to make one up using Tree-Based methods
Results
Question 2:Etiology of Outcomes in these Data
Page 43 | Copyright © 2006 i3
Etiology: Cardiovascular & Cerebrovascular Outcomes
What else can be learned about the etiology of the target outcomes from this study?
Methods
– Sample of all outcomes and 600 randomly selected non-outcomes constructed (N=883)
– Outcomes examined from individual to any/none (as before)
– RP, SGB, and RF methods employed– CART version 6.2
Page 44 | Copyright © 2006 i3
CART Classification Accuracy
“Any / None” outcome, Gini splitting rules
Misclassification cost (1,0) = 1.5, mixed prior probabilities of class membership
Classification Accuracy
– 32% of the sample are cases– Indices: Estimation: 233, Validation: 215
Prediction Success--Focus Class 1 --Estimation--Row %
Actual Class
Total Cases
PercentCorrect
1 N=366
0 N=517
1 283 75.27 75.27 24.730 600 74.50 25.50 74.50
Total: 883.00Average: 74.88
Overall % Correct: 74.75
Prediction Success--Focus Class 1 --Validation--Row % Actual Class
Total Cases
PercentCorrect
1 N=403
0 N=480
1 283 72.79 72.79 27.210 600 67.17 32.83 67.17
Total: 883.00Average: 69.98
Overall % Correct: 68.97
HS_VIZ_OTHER <= 5.50
Terminal
Node 1
Class = None
Class Cases %
None 326 88.3
Any 43 11.7
N = 369
HS_VIZ_OTHER > 5.50
Terminal
Node 2
Class = Any
Class Cases %
None 28 54.9
Any 23 45.1
N = 51
AGE <= 39.50
Node 2
Class = None
N = 420
EPI_COST <= 254.68
Terminal
Node 3
Class = None
Class Cases %
None 43 82.7
Any 9 17.3
N = 52
EPI_COST <= 2270.25
Terminal
Node 4
Class = Any
Class Cases %
None 45 52.3
Any 41 47.7
N = 86
EPI_COST > 2270.25
Terminal
Node 5
Class = None
Class Cases %
None 25 80.6
Any 6 19.4
N = 31
EPI_COST > 254.68
Node 7
Class = Any
N = 117
AGE <= 50.50
Node 6
Class = Any
N = 169
AGE > 50.50
Terminal
Node 6
Class = Any
Class Cases %
None 43 44.8
Any 53 55.2
N = 96
BLOCK4$ = (2003_00,...)
Node 5
Class = Any
N = 265
BLOCK4$ = (2004_01)
Terminal
Node 7
Class = None
Class Cases %
None 53 81.5
Any 12 18.5
N = 65
EPI_COST <= 3698.13
Node 4
Class = Any
N = 330
EPI_COST > 3698.13
Terminal
Node 8
Class = Any
Class Cases %
None 37 27.8
Any 96 72.2
N = 133
AGE > 39.50
Node 3
Class = Any
N = 463
Node 1
Class = Any
N = 883
(Sample mean age=39.7)
(~75th %ile) (Mean=$4,068)
(75th %ile=$2,911)
Page 46 | Copyright © 2006 i3
Variable Importance
0 20 40 60 80 100
Total Cost
Age
Total Rx Cost
Rx History
BL High BP
Lab History
Other Hosp Visits
# Hosp Stays
Cohort
Cardio Rx History
BL Statin
Cardio Hosp Visits
Psych Hosp Visits
Standardized Variable Importance
Page 47 | Copyright © 2006 i3
Conclusion
The older you are, and the more prior illness you have had, the more likely you are to be susceptible to these outcomes.
– Confirms what we already know
Need more information, and probably more accurate classification
– Additional source data – E.g. charts
– Claims designed for insurance payment
But, what if…
Page 49 | Copyright © 2006 i3
… I used a different splitting rule?
0.65
0.70
0.75
0.80
Sym
Gin
i
Gin
i
Tw
oin
g
Ord
. Tw
oin
g
Entro
py
Cl. P
rob
Rel.
Err
or
Test Rel. Error
Gini (8) (0.612)
Min = 0.6122
Median = 0.6273
Mean = 0.6496
Max = 0.7702
Sym Gini (3) (0.628)
Relative error -> Gini
Smallest tree -> Symmetric Gini
Page 50 | Copyright © 2006 i3
Average Variable Importance
Average variable importance across methods – similar to Gini
Page 51 | Copyright © 2006 i3
… I used a different outcome
4-Level classification (cerebral or cardiac, death, multiple, none)
But what if…
… I used a different tree-based method?
Page 53 | Copyright © 2006 i3
Random Forests – OOB Test Sample Results
459 trees, ROC = 0.60
229 trees, ROC = 0.64
Actual Total Percent Cerebral or cardiac Death Multiple None
Class Cases Correct N=1 N=120 N=198 N=564
Cerebral or cardiac 233 0.43 0.430.43 22.32 37.77 39.48
Death 13 76.92 0 76.9276.92 0 23.08
Multiple 37 37.84 0 35.14 37.8437.84 27.03
None 600 76.5 0 7.5 16 76.576.5
Actual Total Percent Cardiac Cerebral Death Multiple None
Class Cases Correct N=0 N=134 N=108 N=144 N=497
Cardiac 197 0 00 21.83 18.27 25.89 34.01
Cerebral 36 19.44 0 19.4419.44 33.33 25 22.22
Death 13 69.23 0 7.69 69.2369.23 0 23.08
Multiple 37 27.03 0 18.92 29.73 27.0327.03 24.32
None 600 68.33 0 12.67 6.67 12.33 68.3368.33
Page 54 | Copyright © 2006 i3
Random Forests – OOB Results
325 trees, ROC = 0.76
SGB?
Actual Total Percent None Any
Class Cases Correct N=612 N=271
None 600 82.83 82.8382.83 17.17
Any 283 59.36 40.64 59.3659.36
Page 55 | Copyright © 2006 i3
Stochastic Gradient Boosting (TreeNet)
169 trees, ROC=0.79
114 trees, ROC = 0.84
Actual Total Percent Cerebral or cardiac Death Multiple None
Class Cases Correct N=555 N=36 N=52 N=240
Cerebral or cardiac 233 75.97 75.9775.97 6.87 10.30 6.87
Death 13 61.54 23.08 61.5461.54 0.00 15.38
Multiple 37 8.11 72.97 10.81 8.118.11 8.11
None 600 36.50 58.00 1.33 4.17 36.5036.50
Actual Total Percent None Any
Class Cases Correct N=517 N=366
None 600 70.33 70.3370.33 29.67
Any 283 66.43 33.57 66.4366.43
Page 56 | Copyright © 2006 i3
Tree-Based Model Comparisons
Classification Accuracy - Any/None Outcome - Validation Samples
Of interesting note: Top variables were the same across methods
Cases Non-Cases
CART 72.8 67.2
Random Forests 59.4 82.8
Stochastic Gradient Boosting 66.4 70.3
Page 57 | Copyright © 2006 i3
Conclusions
There were no associations of the focus drug and the adverse outcomes in the data
– Tree-based methods confirmed the finding
– Methods focus: variable importance
No surprises in the etiology of the outcomes using the available data
– Sometimes simple really is best
– Methods focus: classification accuracy
Page 58 | Copyright © 2006 i3
Methods Conclusions
In this case, CART was the most parsimonious model
– Classification accuracy in the same range as SGB
– Variable importance ranks similar
Advantage: Graphically tractable tree
– Intuitive
– Easy to program
Sometimes, simple is best
Page 59 | Copyright © 2006 i3
Limitations
Bias of claims data
None of these tree-based analyses used the “person-time”
– Can be accomplished with Tree-Structured Survival Analytic methods
– Available in S-Plus
– In development at Salford
Top Related