Download - Tree-Based Methods in Drug Safety Researchdocs.salford-systems.com/MarshaWilcox.pdfMethods - Covariates ̇ Covariates – Prescription drug use (binary indicators) – 17 drug classes,

Tree-Based Methods in Drug Safety Research

Salford Systems Data Mining Conference

March 30, 2006

| Copyright © 2006 i3

Outline

In-going question: “Can tree-based methods help here?”

Background

Methods

Focus on splitting rules for CART

Brief description of Stochastic Gradient Boosting and Random Forests

Results

– Question 1 - Associations not discernable by epidemiologic analyses?

– Question 2 - What else can be learned about the outcomes ?

Limitations


Background

Common disorder

Medications

– Potential adverse outcomes– Cerebrovascular (n=2 types)

– Cardiovascular (n=4)

– Multiple events

– Death

Concern re: particular drug class

– Focus drug

– Compared with all other drugs used to treat the disorder


Background

Epidemiologic Analyses

– Matched retrospective cohort study– Focus drug users, users of other drugs in the class, matched controls

– Four entry points into the study

– Using state-of-the-art propensity score matching

– Conclusion: No difference in the occurrence of cardiovascular or cerebrovascular events in the two treated groups

– Increased risk for those who are treated with any drug in the class compared with the control group


Objectives of the Tree-Based Analyses

Are there patterns of covariates associated with the outcomes that are better identified using tree-based methods?

1. Are there patterns of association with the outcomes and drugs not amenable to standard epidemiologic analyses?

2. What else can be learned about the etiology of the target outcomes from this study?

Methods


Methods - Sample

Medical and pharmacy claims data

– 3 Groups (N ≈ 50,000)– Focus drug ≈ 12,500

– Other drugs ≈ 12,500

– Controls ≈ 25,000

– Propensity score matched

– 4 Cohorts– Entry points into the study

– 6 months apart


Methods - Outcomes

Outcomes

– Individual (7 binary, one polychotomous – 8 levels)– One-at-a-time – individuals with multiple outcomes in more than one

– One polychotomous outcome variable including, “Multiple”

– Grouped by type (one polychotomous – 5 levels)– Cardiovascular

– Cerebrovascular

– Multiple

– Death

– None

– Any vs. None (binary)

– Continuous – count of the number of outcomes (0-3)


Methods - Covariates

Covariates

– Prescription drug use (binary indicators)– 17 drug classes, total Rx, cardiovascular Rx

– Medical claims 6 months prior to study entry (binary indicators)– 19 classes of medical conditions

– Number of hospital visits– ER, ICU, psychiatry, psychology, cardiovascular, other, other mental, # stays

– Costs– Total, total drug

– Demographics– Age (year of birth), gender, region

– No personally identifying information of any kind


Statistical Methods

Statistical Methods

– Recursive partitioning (using CART)– More on specific splitting rules …

– Random forests (using Salford’s RF)– Methods summary to follow – see Breiman and Cutler refs

– Stochastic gradient boosting (using TreeNet)– Methods summary to follow – see Friedman refs


Statistical Methods - Details

CART

– Mix of equal and data priors for CART

– Misclassification costs – case as control = 1.5

Random Forests

– 1,000 trees grown, 3 variables considered at each split

– Balanced prior probabilities

Stochastic Gradient Boosting

– Misclassification costs as in CART

– 500 trees grown

– Balanced prior probabilities

Splitting Rules


Statistical Methods - Splitting Rules

Splitting rules can effect the classification accuracy of a tree

– Some might argue that pruning is equally, if not more, important

In some cases, the purity of particular nodes may be of more interest than overall accuracy

– Given 2 different structures with equal accuracy


Splitting Rules

Conceptually:

Systematically choose a split-point and divide the sample into two groups

Construct measure (impurity or goodness – cup ½ empty…)

– More on this …

Assess split using some weighted combination of the measure and class probabilities

– i.e. a small child node with perfect information contributes only a small amount to the adjudication of the quality of the split

Iterate across all possible splits – choose the “best” using given measure


Statistical Methods - Splitting Rules

Gini Symmetric Gini

Entropy

Twoing Ordered Twoing

Class probability

Child - Left Child - Right

Parent

Best?


Gini – A Little History

Developed by Italian statistician Corrado Gini in 1912

– Designed to measure income inequity

– Can be used to assess inequity – – or impurity - in any distribution

– Lorenz curve (1905) income equity

Graph: http://en.wikipedia.org/wiki/Gini_coefficient


Gini – A Little Math

GINI “impurity” criterion:

Where = relative frequency of class j at node t

Split quality measured by:

Where nj = number of individuals at child node, j

np = number of individuals at parent node, p

Split with the minimum GINI index is chosen

[ ]2)|(1)( ∑−=j

tjptGINI

( )tjp |

∑ ==

k

j jp

jsplit GINI

n

nGINI

1


Parent


Gini – An Example

Simple Gini Example

Parent Node 200

Cases 50 Cases 50

Controls 50 Controls 50

node n 100 node n 100Prop Cases 50% Prop Cases 50%

Prop Controls 50% Prop Controls 50%

Gini(s1) 0.50 Gini(s2) 0.50

Gini(1,2) 0.50

Child 2Child 1


Gini

Categorical variables:

– Count each class

Continuous variables:

– Sort the variable

– Choose one or more splits

Tends to find the largest classes first

Misclassification costs are incorporated by adjusting prior probabilities


Symmetric Gini

Gini criterion using symmetric misclassification costs

– Difference: – At the pruning stage Gini can use nonsymmetric costs

– Symmetric Gini imposes symmetry on the matrix

– Motivation: highly nonsymmetric costs can cause the impurity function to behave badly (not concave)

– See CART monograph for nice example

– Something to consider when costs are important and not balanced


Entropy – A Little History

Claude Shannon introduced the concept in 1948

– “Mathematical Theory of Communications” – Bell Labs

– Minimum # of bits needed to encode a string of symbols

Many (>30) definitions and interpretations

– Physics: the amount of disorder in a system

– Statistics and Information Theory: information – as in statistical genetics

Entropy is a measure of uncertainty, or conversely, information


Entropy – A Little Math

Entropy calculation

Similar evaluation:

– Change in entropy = Parent entropy – weighted at each node

Multi-level outcomes

– Looks for splits where as many levels as possible are divided perfectly or near perfectly.

∑−=j

tjptjpti )])|(log([)|()(

(Recall that Gini is p(j|t)2)


Twoing – History & Math

First proposed in the original 1984 CART monograph

Idea: in a multimulti--class problemclass problem, separate the classes into two “superclasses”

– Choose split with greatest decrease in node impurity

– “Strategic” splits – information re: class similarities

Attempts to find groups of up to 50% of the data each

– Power-modified Twoing forces splits to be close to 50%

Has been suggested for difficult problems – low signal / noise ratio

[ ]2)|()|(4 ∑ −

j RLRL tjptjp

pp


Ordered Twoing

Ordered Twoing

Twoing designed for ordered outcomes

Constraint:

– Only considers grouping adjacent classes

Example:

– Twoing - consider any combinations of numeric values (e.g. 2,5 vs 1,3,4)

– Ordered twoing - only consider adjacent splits (1,2,3 vs 4,5)

– … a severity scale, for example


Class Probability

Probability trees instead of classification trees

– Estimate of the probability that a case is in the class– Class assignment also

– Measure of association (or affinity) with each class

– Example– Useful to estimate the relative probabilities of a disease w/o ruling any out


Splitting Rules

Using different rules will have the most effect for multilevel outcomes

Good practice to use several splitting rules and compare the results


Tree-Based Methods

Child - Left

Parent


Parent

Child - Left

Parent


Parent

Vote: Class=1

Child - Left

Parent


Parent

Vote: Class=2

Child - Left

Parent


Parent

Vote: Class=1

Child - Left

Parent


Parent

Vote: Class=1

Child - Left

Parent


Parent

Child - Left

Parent


Parent

Child - Left

Parent


Parent

Child - Left

Parent


Parent

CART Random Forests Stochastic Gradient Boosting

Your vote counts!


Model Quality Across Methods

Variable importance

Classification accuracy

Random Forests- Brief Overview


Random Forests

Concept:

– Grow an ensemble of trees using bootstrapped samples– Each “votes” on the classification

Random sample of the data – WITH replacement– Usually, about 1/3 of the data are “out of the bag (OOB)”

– Can be used for validation purposes

– New observations are classified by all of the trees– Majority vote is the final classification for the observation

Sampling

– Random sample = to original sample size (WITH replacement)

– Sub-sample of all available covariates

Full trees grown – *no* pruning


Random Forests

The error of a forest depends upon:

– Strength of the individual trees

– Correlation among the trees


Random Forests

Evaluation criteria:

– Classification accuracy on the OOB samples

– Variable importance– Relative contribution of each variable to the final classification solution


Random Forests

Proximities for each pair of observations can be estimated

– Useful for clustering / segmentation

Nice feature – not used in these analyses

Available in CART 6.2 and in code from Adelle Cutler - flexible

Stochastic Gradient Boosting- Brief Overview



Boosting methods:

– Sequential algorithms – weights at a particular step are dependent upon previously fitted functions

Model residuals from previous stage

– Sub-sample of the data

– Small trees, usually with 2 to 8 nodes, at each step.

– Trees are combined by adding individual scores



Friedman : weighting coefficient, or “shrinkage parameter” implemented at each step

– slow the learning process

– results in better classification accuracy (Friedman, 2001)

Version 6.2 allows investigation of 3 shrinkage parameters in one run; 0.001, 0.01, 0.1

– “Battery” option LEARNRATE



Distinct advantage

– Much higher degree of accuracy than traditional methods – e.g. logistic or ordinary least squares regression,

– Better accuracy than single-tree methods – e.g. RP, or other parallel “forest” methods (e.g. random forests)

– Model implementation in an independent sample is not more statistically difficult than single-tree methods,

– But, models are considerably larger

– No simple graphical representation

– Model characterization:

– Classification accuracy in test or holdout samples (v-fold cross-validation)

– Number and relative importance of variables used


Summary

CART

– Single tree-structured model

– Nice graphical representation

– Several methods for splitting criteria

Random Forests

– Ensemble of trees based on bootstrapped samples

– Majority “vote”

– More accurate than CART – No single graphical representation


– Ensemble – small trees based on partial samples

– Models residuals

– Best for many classes of problems

So,

How do these methods help discover the etiology of outcomes in these data?

Results

Question 1: Adverse Drug Events


Adverse Drug Events: Results

Are there patterns of association with the outcomes and specificdrugs not amenable to standard epidemiologic analyses?

The short answer is No, in this case.

There were no analyses in which the focus drug separated from the class was associated with the outcome.

Not with any construction of the outcome

Not with any of the tree-based statistical methods

Conclusion: Verification of the epidemiologic analysis

The Good News:

If there is no association, you’re not likely to make one up using Tree-Based methods

Results

Question 2:Etiology of Outcomes in these Data


Etiology: Cardiovascular & Cerebrovascular Outcomes

What else can be learned about the etiology of the target outcomes from this study?

Methods

– Sample of all outcomes and 600 randomly selected non-outcomes constructed (N=883)

– Outcomes examined from individual to any/none (as before)

– RP, SGB, and RF methods employed– CART version 6.2


CART Classification Accuracy

“Any / None” outcome, Gini splitting rules

Misclassification cost (1,0) = 1.5, mixed prior probabilities of class membership

Classification Accuracy

– 32% of the sample are cases– Indices: Estimation: 233, Validation: 215

Prediction Success--Focus Class 1 --Estimation--Row %

Actual Class

Total Cases

PercentCorrect

1 N=366

0 N=517

1 283 75.27 75.27 24.730 600 74.50 25.50 74.50

Total: 883.00Average: 74.88

Overall % Correct: 74.75

Prediction Success--Focus Class 1 --Validation--Row % Actual Class

Total Cases

PercentCorrect

1 N=403

0 N=480

1 283 72.79 72.79 27.210 600 67.17 32.83 67.17

Total: 883.00Average: 69.98

Overall % Correct: 68.97

HS_VIZ_OTHER <= 5.50

Terminal

Node 1

Class = None

Class Cases %

None 326 88.3

Any 43 11.7

N = 369

HS_VIZ_OTHER > 5.50

Terminal

Node 2

Class = Any

Class Cases %

None 28 54.9

Any 23 45.1

N = 51

AGE <= 39.50

Node 2

Class = None

N = 420

EPI_COST <= 254.68

Terminal

Node 3

Class = None

Class Cases %

None 43 82.7

Any 9 17.3

N = 52

EPI_COST <= 2270.25

Terminal

Node 4

Class = Any

Class Cases %

None 45 52.3

Any 41 47.7

N = 86

EPI_COST > 2270.25

Terminal

Node 5

Class = None

Class Cases %

None 25 80.6

Any 6 19.4

N = 31

EPI_COST > 254.68

Node 7

Class = Any

N = 117

AGE <= 50.50

Node 6

Class = Any

N = 169

AGE > 50.50

Terminal

Node 6

Class = Any

Class Cases %

None 43 44.8

Any 53 55.2

N = 96

BLOCK4$ = (2003_00,...)

Node 5

Class = Any

N = 265

BLOCK4$ = (2004_01)

Terminal

Node 7

Class = None

Class Cases %

None 53 81.5

Any 12 18.5

N = 65

EPI_COST <= 3698.13

Node 4

Class = Any

N = 330

EPI_COST > 3698.13

Terminal

Node 8

Class = Any

Class Cases %

None 37 27.8

Any 96 72.2

N = 133

AGE > 39.50

Node 3

Class = Any

N = 463

Node 1

Class = Any

N = 883

(Sample mean age=39.7)

(~75th %ile) (Mean=$4,068)

(75th %ile=$2,911)


Variable Importance

0 20 40 60 80 100

Total Cost

Age

Total Rx Cost

Rx History

BL High BP

Lab History

Other Hosp Visits

# Hosp Stays

Cohort

Cardio Rx History

BL Statin

Cardio Hosp Visits

Psych Hosp Visits

Standardized Variable Importance


Conclusion

The older you are, and the more prior illness you have had, the more likely you are to be susceptible to these outcomes.

– Confirms what we already know

Need more information, and probably more accurate classification

– Additional source data – E.g. charts

– Claims designed for insurance payment

But, what if…


… I used a different splitting rule?

0.65

0.70

0.75

0.80

Sym

Gin

i

Gin

i

Tw

oin

g

Ord

. Tw

oin

g

Entro

py

Cl. P

rob

Rel.

Err

or

Test Rel. Error

Gini (8) (0.612)

Min = 0.6122

Median = 0.6273

Mean = 0.6496

Max = 0.7702

Sym Gini (3) (0.628)

Relative error -> Gini

Smallest tree -> Symmetric Gini


Average Variable Importance

Average variable importance across methods – similar to Gini


… I used a different outcome

4-Level classification (cerebral or cardiac, death, multiple, none)

But what if…

… I used a different tree-based method?


Random Forests – OOB Test Sample Results

459 trees, ROC = 0.60

229 trees, ROC = 0.64

Actual Total Percent Cerebral or cardiac Death Multiple None

Class Cases Correct N=1 N=120 N=198 N=564

Cerebral or cardiac 233 0.43 0.430.43 22.32 37.77 39.48

Death 13 76.92 0 76.9276.92 0 23.08

Multiple 37 37.84 0 35.14 37.8437.84 27.03

None 600 76.5 0 7.5 16 76.576.5

Actual Total Percent Cardiac Cerebral Death Multiple None

Class Cases Correct N=0 N=134 N=108 N=144 N=497

Cardiac 197 0 00 21.83 18.27 25.89 34.01

Cerebral 36 19.44 0 19.4419.44 33.33 25 22.22

Death 13 69.23 0 7.69 69.2369.23 0 23.08

Multiple 37 27.03 0 18.92 29.73 27.0327.03 24.32

None 600 68.33 0 12.67 6.67 12.33 68.3368.33


Random Forests – OOB Results

325 trees, ROC = 0.76

SGB?

Actual Total Percent None Any

Class Cases Correct N=612 N=271

None 600 82.83 82.8382.83 17.17

Any 283 59.36 40.64 59.3659.36


Stochastic Gradient Boosting (TreeNet)

169 trees, ROC=0.79

114 trees, ROC = 0.84

Actual Total Percent Cerebral or cardiac Death Multiple None

Class Cases Correct N=555 N=36 N=52 N=240

Cerebral or cardiac 233 75.97 75.9775.97 6.87 10.30 6.87

Death 13 61.54 23.08 61.5461.54 0.00 15.38

Multiple 37 8.11 72.97 10.81 8.118.11 8.11

None 600 36.50 58.00 1.33 4.17 36.5036.50

Actual Total Percent None Any

Class Cases Correct N=517 N=366

None 600 70.33 70.3370.33 29.67

Any 283 66.43 33.57 66.4366.43


Tree-Based Model Comparisons

Classification Accuracy - Any/None Outcome - Validation Samples

Of interesting note: Top variables were the same across methods

Cases Non-Cases

CART 72.8 67.2

Random Forests 59.4 82.8

Stochastic Gradient Boosting 66.4 70.3


Conclusions

There were no associations of the focus drug and the adverse outcomes in the data

– Tree-based methods confirmed the finding

– Methods focus: variable importance

No surprises in the etiology of the outcomes using the available data

– Sometimes simple really is best

– Methods focus: classification accuracy


Methods Conclusions

In this case, CART was the most parsimonious model

– Classification accuracy in the same range as SGB

– Variable importance ranks similar

Advantage: Graphically tractable tree

– Intuitive

– Easy to program

Sometimes, simple is best


Limitations

Bias of claims data

None of these tree-based analyses used the “person-time”

– Can be accomplished with Tree-Structured Survival Analytic methods

– Available in S-Plus

– In development at Salford