Non-replicating comments on replication Steven Goodman, MD, PhD Johns Hopkins University SAMSI...

Post on 27-Mar-2015

218 views 2 download

Tags:

Transcript of Non-replicating comments on replication Steven Goodman, MD, PhD Johns Hopkins University SAMSI...

Non-replicating comments on

replication

Steven Goodman, MD, PhD

Johns Hopkins University

SAMSI Workshop

July 10-12, 2006

Things identified as cancer risks

(SImon and Altman, JNCI, 1994)

Electric Razors

Broken Arms

(in women)

Fluorescent lights

Allergies

Breeding reindeer

Being a waiter

Owning a pet bird

Being short

Being tall

Hot dogs

Having a refrigerator!!

Outline

Glaring examples

P-value/replication misconceptions

Ioannidis methods/conclusions

Evidence of selective reporting

Reproducible research

“We have no idea how or why the magnets work.”

“A real breakthrough…”

“…the [study] must be regarded as preliminary….” “But…the early results were clear and... the treatment ought to be put to use immediately.”

FDA Discussion, cont. (Fisher, CCT, 20:16-39,1999)

Dr. Lipicky: What are the p-values needed for the secondary endpoints? …Certainly we’re not talking 0.05 anymore. …You’re out of this 0.05 stuff and I would have like to have seen what you thought was significant and at what level…

What p-value tells you that it’s there study after study?

Dr. Konstam: …what kind of statistical correction would you have to do that survival data given the fact that it’s not a specified endpoint? I have no idea how to do that from a mathematical viewpoint.

Replication probability, as a function of the p-value

P-value of initial

experiment

Probability of p<0.05 when the first observed

difference is the true one

Probability of p<0.05 when has

a uniform prior before the first

experiment

0.10 .37 .41 0.05 .50 .50 0.03 .58 .56 0.01 .73 .66 0.005 .80 .71 0.001 .91 .78

Goodman, SN, “A Comment on Replication, P-values and Evidence, Stat Med, 11:875-879, 1992.

What do we mean by replication?

Statistical significance?

Same results/concslusions from same original data?

Same results/conclusions from same analytic data?

Same R/C in ostensibly identical study?

Same R/C in similar but non-identical study?

Surrogate for whether underlying hypothesis is true?

Is combinability/heterogeneity a more profitable concept to explore?

Reasons for non-replication

Hypothesis not true. {Prior / Posterior probability}

Misrepresented evidence. {Improper/selective analysis, improper/selective reporting}

Different balance of unmeasured covariates across studies/designs {Quality of design, reliability of mechanistic knowledge}

Different handling/measurement of measured covariates across studies/designs. {Combinability / heterogeneity}

Fundamentally different question asked, i.e. new study is not a replicate of previous one. {Combinability / mechanistic knowledge}

JAMA, 2005

Ioannidis findings

45 original articles claiming effectiveness w/ > 1000 citations in NEJM, Lancet, JAMA, 1990-2003

7 (16%) subsequently contradicted

7 (16%) exaggerated effects

20 (44%) replicated

11 (24%) unchallenged

5/6 nonrandomized studies contradicted + exag. vs. 9/39 RCTs.

Unit of analysis?

Study Condition Agent

NHS CAD prevention Estrogen / Progestin

NHS CAD prevention Vit. E (women)

HPFS CAD prevention Vit. E (men)

Zutphen CAD prevention Flavonoids

Case series Leukemia Trans retinoic acid

Case series Resp. distress Nitric Oxide

Effect of “bias” on Bayes factor

LR(H1 vs. Ho | p-value, , , bias)

= Pr (pŠ | H1)

Pr (pŠ | Ho )

bias (1 )

(1 ) bias

As -->0, the LR -->

1- (1-bias)

bias

LR (bias, )

1 (1 bias)(1 0.05) bias 0.05

p≤0.05 p-value=0

Bias Power =80%

90% 80% 90%

0.1 5.7 6.3 8.2 9.1

0.3 2.6 2.8 2.9 3.1

0.5 1.7 1.8 1.8 1.9

0.8 1.2 1.2 1.2 1.2

1- (1- bias)

bias

R u Practical Example

PPV p<5%

PPV 1%ŠpŠ5%(LR = 10.5)

PPV pŠ1%

(LR = 80)

PPV for pŠ0.05

PPV pŠ0.01

LR for pŠ1%

0.8 1.00 0.10 Adequately powered RCT with little bias and

1:1 pre-study odds

0.97 0.91 0.99 0.87 0.89 7.85

0.95 2.00 0.30 Confirmatory meta-analysis of goodquality

RCTs

0.99 0.95 0.99 0.86 0.86 3.18

0.8 0.33 0.40 Meta-analysis of small inconclusive studies

0.91 0.78 0.96 0.41 0.42 2.18

0.2 0.20 0.20 Underpowered, but well-performed phase I/II

0.62 0.68 0.94 0.25 0.26 1.76

0.2 0.20 0.80 Underpowered, poorly performed phase I/II

0.62 0.68 0.94 0.17 0.17 1.05

0.8 0.10 0.30 Adequately powered exploratory

epidemiological study

0.76 0.51 0.89 0.21 0.22 2.83

0.2 0.10 0.30 Underpowered exploratory

epidemiological study

0.44 0.51 0.89 0.12 0.13 1.45

0.2 0.001 0.80 Discovery-oriented exploratory research with

massive testing

0.01 0.01 0.07 0.00 0.00 1.05

0.2 0.001 0.20 As in previous example, but with more limited

bias (more standardized)

0.01 0.01 0.07 0.00 0.00 1.76

WITHOUT BIAS FACTOR WITH BIAS FACTOR

, Including this one

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Abstract conclusions

Design Cohort study using protocols and published reports of randomized trials approved by the Scientific-Ethical Committees for Copenhagen and Frederiksberg, Denmark, in 1994-1995. The number and characteristics of reported and unreported trial outcomes were recorded from protocols, journal articles, and a survey of trialists….

Results One hundred two trials with 122 published journal articles and 3736 outcomes were identified. Overall, 50% of efficacy and 65% of harm outcomes per trial were incompletely reported. Statistically significant outcomes had a higher odds of being fully reported compared with nonsignificant outcomes for both efficacy (pooled odds ratio, 2.4; 95% confidence interval [CI], 1.4-4.0) and harm (pooled odds ratio, 4.7; 95% CI, 1.8-12.0) data. In comparing published articles with protocols, 62% of trials had at least 1 primary outcome that was changed, introduced, or omitted. Eighty-six percent of survey responders (42/49) denied the existence of unreported outcomes despite clear evidence to the contrary.

Conclusions The reporting of trial outcomes is not only frequently incomplete but also biased and inconsistent with protocols. Published articles, as well as reviews that incorporate them, may therefore be unreliable and overestimate the benefits of an intervention. To ensure transparency, planned trials should be registered and protocols should be made publicly

Reproducible Research

Roger Peng, F. Dominici, S. Zeger

AJE, 2006

A Research Pipeline

What is Reproducible Research?

Data: Analytic dataset is available

Methods: Computer code underlying figures, tables, and other principal results is available

Documentation: Adequate documentation of the code, software environment, and data is available

Distribution: Standard methods of distribution are employed for others to access materials

A Research Pipeline (reprise)

A Licensing Spectrum for Data

Full access: Data can be used for any purpose

Attribution: Data can be used for any purpose so long as a specific citation is used

Share-alike: Data can be used to produce new findings --- any modifications/linkages must be made available under the same terms

Reproduction: Data can only be used for reproducing results and commenting on those results via a letter to the editor

(No Data Available)

Issues to Consider

Making datasets available

What is code? Does it exist?

Separating content from presentation

Technical sophistication of authors, publishers, readers; requirements?

Protecting authors’ original ideas

Logistics – data storage, accessibility

RR Options considered at medical journal

Assign and “advertise” RR “Score” depending on how much info author makes available.

Do we ask everyone? Do we penalize those who don’t/can’t share data? How do we prioritize between components of the score? Do we treat differently sophisticated and unsophisticated analysts?

Divulge data sharing policy of author, including code-sharing, like roles on manuscript and conflict of interest.

What do we mean by replication?

Statistical significance?

Same results/concslusions from same original data?

Same results/conclusions from same analytic data?

Same R/C in ostensibly identical study?

Same R/C in similar but non-identical study?

Surrogate for whether underlying hypothesis is true?

Is combinability/heterogeneity a more profitable concept to explore?

Reasons for non-replication

Hypothesis not true. {Prior / Posterior probability}

Misrepresented evidence. {Improper/selective analysis, selective reporting}

Different balance of unmeasured covariates across studies/designs {Quality of design, reliability of mechanistic knowledge}

Different handling/measurement of measured covariates across studies/designs. {Combinability / heterogeneity}

Fundamentally different question asked, i.e. new study is not a replicate of previous one. {Combinability / mechanistic knowledge}