STATISTICAL LEARNING FROM DATA “LIES, DAMNED LIES, AND STATISTICS’’, Mark Twain. Senate...

STATISTICAL LEARNING FROM DATA

“LIES, DAMNED LIES, AND STATISTICS’’, Mark Twain.

Senate Approves Tighter Policing of Drug Makers, May 8, 2007

Mark van der Laan, www.stat.berkeley.edu/~laan

OVERVIEW How good is the human statistical intuition? Statistics are a tool of modern life. The loose statistical world of data analyses and scientific

publishing: “Why more than half [data based] published research findings are false?” (John P.A. Ioannidis)

The rigid statistical world of the FDA and Drug Manufacturers: Role of statistics in clinical trials.

Advances in statistics can help but are heavily un-used: The challenge.

Can we reduce the cost of prescription drugs through improvements in statistical practice and design?

Can statistics improve safety reviews of drugs in use? Concluding remarks.

THE QUIZZ-MASTER PROBLEM

Behind one of these doors is a car.Behind the other two is a goat.Click on the door that you think the car is behind.

YOU SELECT DOOR 1

To keep it exciting, I open one of the other two doors without a car behind it.Obviously the car is not behind door 3.But before I open door 1, the door you selected,I'm going to let you switch to door 2 if you like.Again, click on the door in which you think the car is behind.

Congratulations! You're a winner!

Recap: You originally picked door 1 and then switched to door 2.

Here is a summary of how previous contestants have fared.

# of Players Winners Percent Winners

Switched 131 88 67.2

Didn't Switch 116 42 36.2

This problem was given the name The Monty Hall Paradox in honor of the long time host of the television game show "Let's Make a Deal." Articles about the controversy appeared in the New York Times and other papers around the country.

Marilyn's answer to the reader was that the contestant should switch doors and she received nearly 10,000 responses from readers, most of them disagreeing with her. Hundreds of them were Professors in mathematics and scientists whose responses ranged from hostility to disappointment at the nation's lack of mathematical skills.

Quizz-Master Problem was raised by reader in Marilyn Vos Savant's Sunday Parade Column

Statistics are a tool of modern life Identifying associations,

correlations and patterns Establishing causation

based on randomized trials and observational studies.

Making predictions Shaping strategies and

future behavior Of people and societies Of machines and computing Of complex systems

Statistics are often cited to denotea certainty that does not, in fact, exist “There is increasing concern

that in modern research, false findings may be the majority, or even the vast majority, of published research claims.” “Simulations show that for most

study designs and settings it is more likely for a research claim to be false than true.”

“For many current scientific fields, claimed research findings often may be simply accurate measures of the prevailing bias.”

- J.P.A. Ioannidis, “Why Most Published Research Findings are False”, Chance, Vol. 18, No. 4, 2005

Causal Inference and Curse of Dimensionality

Causal Model and Data

Text from Taubes NYTimes Article

Text from Taubes NYTimes Article

False conclusions are expensive In medicine

False positives lead to expensive additional tests and anxiety False negatives lead to delayed treatment with escalated costs

and illness In drug discovery

False positives lead to failed trials The average cost of a phase III clinical trial is $4m-$20m,

some cost more than $100m False negatives lead to failed trials

Missed contraindications, negative interactions and imprecise dosages

In genomics, proteomics and chemoinformatics False positives are abundant and lead to wasted time, effort and

experimentation False negatives lead to missed business opportunities

In public policy False positives and false negatives lead to action based on false

premises and, frequently, public cynicism

Bias is a hazard of statistics Statistical data samples can be

biased The sample selected does not

represent the population Example: There are five red

heads in a town of 100 people. Our sample of 20 people happens to include all five.

Statistical methods for learning from data can be biased The statistical model selected is

not the one that best fits the data…

… for the question being asked! Statistical interpretations of

findings can be biased.

Variable Importance of HIV resistance mutations Goal: Rank a set of genetic mutations based on their

importance for determining an outcome Mutations (A) in the HIV protease enzyme

Measured by sequencing Outcome (Y) = change in viral load 12 weeks after starting

new regimen containing saquinavir How important is each mutation for viral resistance to this

specific protease inhibitor drug? Inform genotypic scoring systems

Stanford Drug Resistance Database

All Treatment Change Episodes (TCEs) in the Stanford Drug Resistance Database Patients drawn from 16 clinics in Northern CA

Baseline Viral Load

Viral GenotypeTCE (Change >= 1 Drug)

Final Viral Load

Change in Regimen

<24 weeks 12 weeks

333 patients on saquinavir regimen

Parameter of Interest Need to control for a range of other covariates W

Include: past treatment history, baseline clinical characteristics, non-protease mutations, other drugs in regimen

Parameter of Interest: Variable Importanceψ = E[E(Y|Aj=1,W)-E(Y|Aj=0,W)] For each protease mutation (indexed by j)

Analytic approach Standard approach:

Fit a single multivariable regression E(Y|A,W) i.e. Regress clinical response on mutations,

covariates Is this the best approach for answering the

scientific question of interest? What is the scientific question?

Construct best predictor vs. Estimate importance of each mutation

Prediction vs. Importance

Prediction – create a model that the clinician will use to help predict risk of a disease for the patient.

Explanation – trying to investigate the causal association of a treatment or risk factor and a disease outcome.

Targeted Maximum Likelihood MLE- aims to do good job of estimating whole

density Targeted MLE- aims to do good job at parameter

of interest General decrease in bias for parameter of Interest Protection under the null hypothesis Honest p-values, inference, multiple testing

Targeted Maximum Likelihood

In regression case, implementation just involves adding a covariate h(A,W) to the regression model

Requires estimating g(A|W) E.g. distribution of each mutation given covariates

Robust: Estimate of ψ is consistent if either g(A|W) is estimated consistently E(Y|A,W) is estimated consistently

More on this later . . .

)|Pr()|( where,)|0(

)0(

)|1(

)1(),( WaAWag

Wg

AI

Wg

AIWAh

Mutation Rankings Based on Variable Importance

Current Score Mutation VIM VIM p-value Crude Crude p-value

35 90M 0.70 0.00 0.76 0.00

40 48VM 0.79 0.00 1.07 0.00

0 30N -0.78 0.00 -1.06 0.00

10 82AFST 0.46 0.01 0.35 0.03

10 54VA 0.46 0.01 0.31 0.11

10 73CSTA 0.67 0.03 0.80 0.00

2 20IMRTVL 0.32 0.07 0.26 0.18

1 36ILVTA 0.28 0.10 0.27 0.12

2 10FIRVY 0.27 0.13 0.48 0.00

5 88DTG -0.23 0.24 -0.50 0.33

2 71TVI 0.18 0.29 0.14 0.37

5 32I -0.18 0.58 -0.20 0.55

2 63P 0.06 0.77 0.11 0.56

5 46ILV 0.13 0.98 0.27 0.10

“Better Evaluation Tools – Biomarkers and Disease” #1 highly-targeted research project in FDA “Critical Path Initiative”

Requests “clarity on the conceptual framework and evidentiary standards for qualifying a biomarker for various purposes”

“Accepted standards for demonstrating comparability of results, … or for biological interpretation of significant gene expression changes or mutations”

Proper identification of biomarkers can . . .

Identify patient risk or disease susceptibility Determine appropriate treatment regime Detect disease progression and clinical outcomes Access therapy effectiveness Determine level of disease activity etc . . .

Evaluation of Biomarker Discovery Methods

> Univariate Linear Regression Importance measure: Coefficient value

with associated p-valueMeasures marginal association

> RandomForest (Breiman 2001)Importance measures (no p-values)

RF1: variable’s influence on error rate

RF2: mean improvement in node splits due to

variable

> Variable Importance with LARS

• Importance measure: causal effect

Formal inference, p-values providedLARS used to fit initial E[Y|A,W]

estimate W={marginally significant covariates}

All p-values are FDR adjusted

> Test methods ability to determine “true” variables under increasing correlation conditions

• Ranking by measure and p-value• Minimal list necessary to get all “true”?

> Variables Block Diagonal correlation structure: 10 independent sets of 10

Multivariate normal distributionConstant ρ, variance=1ρ={0,0.1,0.2,0.3,…,0.9}

> OutcomeMain effect linear model10 “true” biomarkers, one variable from each set of 10

Equal coefficients Noise term with mean=0 sigma=10

“realistic noise”

),0|(),|( WAYEWaAYE pp

Methods Simulation Study

Simulation Results

No appreciable difference in ranking by importance measure or p-value plot above is with respect to ranked importance measures

List Length for linear regression and randomForest increase with increasing correlation, Variable Importance w/LARS stays near minimum (10) through ρ=0.6, with only small decreases in power

Linear regression list length is 2X Variable Importance list length at ρ=0.4 and 4X at ρ=0.6

RandomForest (RF2) list length is consistently short than linear regression but still is 50% than Variable Importance list length at ρ=0.4, and twice as long at ρ=0.6

Variable importance coupled with LARS estimates true causal effect and outperforms both linear regression and randomForest

Minimal List length to obtain all 10 “true” variables

0

20

40

60

80

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Correlation

Lis

t L

en

gth

Linear Reg

VImp w/LARS

RF1

RF2

THE ‘’RIGID’’ STATISTICAL WORLD OF THE FDA Clinical Trials Rigid statistical methodology and designs

required for FDA approval.

Clinical trials are expensive Time to market is critical

Half of the time-to-approval (currently 15.3 yrs) is spent in clinical trials Each day of delay is expensive

Moderately successful drug: $1m per day in lost sales Blockbuster drug: $3m per day

A lot of money is involved Spending on US-sponsored clinical trials is $25.6b in 2006

Biotech + Pharma = $22.6b, NIH = $3 9,937 trials this year

Pharma: 71% of R&D goes to drug development, 45% of this goes to clinical trials

Recruiting patients is expensive Direct costs of patient recruitment are high: $440m per year Indirect costs due to delays

#1 contributor to drug application delays 94% of trials in US miss their enrollment deadlines (Europe: 82%) 80% are delayed at least one month

Drop outs are a major problem 1 of 4 volunteers drops out of a study after it begins

MOVING TOWARDS ADAPTIVE CLINICAL TRIALS‘’A widely noted survey by Accenture provided some alarming

figures a few years ago: Eighty-nine percent of all drug candidates from the initiation of Phase I through FDA approval fail and many of them in the clinic.

Clearly, any techniques that could give an earlier read on these issues would be valuable. In too many cases , the chief result of a trial is to show that the trial itself was set up wrong, in ways that only became clear after the data were un-blinded. Did the numbers show that your dosage was suboptimal partway into a two year trial? Too bad- you probably weren'tallowed to know that. Were several arms of your study obviously pointless from the start? Even if you know, what could you do about it without harming the validity of the whole effort? Over the last years, such concerns have stimulated an unprecedented amount of work on new approaches. Ideas have come from industry, academia, and regulatory agencies such as the FDA's critical path initiative. A common theme in these efforts has been to move toward adaptive clinical trials.’’

Approval of Drugs, and Post-Market Safety Reviews of Drugs

FDA approvals are based on inefficient and often biased statistical methods (e.g., in how

they deal with informative drop out.) FDA does not have expertise to do post market

safety reviews, since this requires expertise in the challenging field of causal inference.

FDA needs to be modernized and need to make strong alliances with academic and industrial centers of excellence.

Senate Approves Tighter Policing of Drug Makers, May 8, 2007

Statistical Innovations are Available Statistical Inference for Adaptive designs. Targeted (Maximum Likelihood) Learning in

Biomarker Discovery Targeted (Maximum Likelihood) Learning of

Causal Effects of drugs and other interventions Targeted Learning of Treatment effect

modification due to genetic and genomic factors (Multiple Testing).

Learning of Individualized Treatment Rules (e.g., individualized medicine).

Super Learning in Prediction

CONCLUDING REMARKS: ‘’Statistics’’ can do a lot of harm in the hands of

people. Any published statistical analysis should be based

on publicly available data. Any statistical analysis should be based on a

priori specified analysis plan (MACHINE LEARNING) and any HUMAN INDUCED deviation from it should be documented (just like the FDA requires!).

Statistical tools used in practice need to improved: FDA needs to be modernized with well founded statistical innovations.

STATISTICAL LEARNING FROM DATA “LIES, DAMNED LIES, AND STATISTICS’’, Mark Twain. Senate...

Documents

Transcript of STATISTICAL LEARNING FROM DATA “LIES, DAMNED LIES, AND STATISTICS’’, Mark Twain. Senate...