Title - Critical Evaluation of Clinical Trial Data Erick Turner, M.D. Oregon Health & Science...

Post on 16-Jan-2016

220 views 0 download

Tags:

Transcript of Title - Critical Evaluation of Clinical Trial Data Erick Turner, M.D. Oregon Health & Science...

Title - Critical Evaluation of Clinical Trial Data

Erick Turner, M.D.Oregon Health & Science University

Dept of Psychiatry; Dept of Pharmacology

Portland VA Medical Center

Mood Disorders Center

Disclosure

No trade names, advertising, or product-group messages

Recovering promotional speaker– Last “slip” was in fall of 2005

Objectives

Things to watch for in evaluating medical

information

Heighten your level of skepticism and

paranoia

May or may not apply to today’s talks

More about clinical trials in general, esp.

industry-sponsored

Studies Presented Today

CATIESTAR*DSTEP-BDBOLDER

The A*C*R*O*N*Y*M Study

Effect of Acronym Name

Doubled the citation rate Independent of study size, quality,

outcome

Source

– Poster: What's in a NAME?

– Peer Review Congress 2005 (AMA)

Standard Clinical Trials vs. Large Simple Clinical Trials

Signal-to-noiseSmall & clean N (standard clinical trials)Big & dirty N“Dirt” “comes out in the wash”

Efficacy vs. Effectiveness

Patients: “squeaky clean” vs. “real world”Comorbidities

– EtOH, other drugs– Depression + anxiety

“The clinical evidence”

Whose evidence?– Intellectual COI

• “I was right! I’ve been vindicated!”

• Attracting grant money - “the Midas touch”

Which evidence?– Available evidence-based medicine

Selective Publication

Nonsignificant studies tend not to get published

Some studies never see the light of dayAmong studies that are published

– Selective presentation of endpoints within those studies

– “Outcome reporting bias”

Why the Need for Selective Publication?

Unimpressive effect sizes in psychiatryMany NS antidepressant trials

– 47/92 (51%) active tx arms NS• Khan 2003 Neuropsychopharm

• Later-approved drugs and dosages

“The Emperor’s New Drugs”

80% of drug effect duplicated by placebo2-point difference between drug and

placebo– HAMD-17-item max = 50 points– 21-item max = 62 points

Kirsch I. Prevention & Treatment, Volume 5, Article 23, posted July 15, 2002

There Must Be 50 Ways . . .

…to put lipstick on a pig

Splice the Y-AxisDepakote and Lithium

0

5

10

15

20

25

30

0 5 10 15 21

Time on Protocol, d

Mania Rating Scale Scores

Placebo

Divalproex

Lithium

(Bowden et al, JAMA, 271:12, March 1994)*p < 0.05

Show Change from Baseline (not Absolute) Scores

(Keck et al, Am J. Psychiatry, 160:4, April 2003)

0

5

10

15

20

25

30

35

40

0 2 4 7 14 21

Study Day

Mania Rating Scale Score

Non-Psychiatric Example

Graph in PDRGraph in PDRChange scoresChange scores

Same numbersSame numbersAbsolute scoresAbsolute scores

Don’t Show Variability in Data

Noise in data– random variability– Interindividual differences

• Perhaps your patient isn’t “Mr. Mean”

Showing just means can be misleading– Liquid N2

Prefer error bars (or even raw data points)

But how much/little overlap do you want the error bars to show?

Have it Your Way

Small Standard ErrorMedium Confidence interval (95%)Large Standard Deviation

Overpower Your Study

Unnecessarily large N

Clinically insignificant result statistically significant

Candidate A vs. Candidate BEffect of the Number of Voters

Disclaimer: Assumes that popular vote matters

The split:

Total No. Voters P value News Headline1,000 0.95 tie

10,000 0.84 tie100,000 0.53 tie

1,000,000 0.046 (<.05) A wins10,000,000 <.0001 A wins by a landslide!!

Limitation of P Values P values confounded by sample size

– Clinically insignificant difference can be statistically very significant

P values tell about precision, – how likely the difference observed could have occurred

by chance Clinicians and pts also interested in magnitude of

effect– Effect size– Confidence intervals– Reading: Jacob Cohen: The Earth is Round, P<.05

Underpowered Studies

Could have clinically significant difference

N too small to reach statistical significance

Michael Jordan free-throw shootout MJ vs. ET -- 7 free throws each MJ makes 7, I make 3

P = .07 (NS, Fisher Exact test)

Conclusions– There was “no difference” between us. – I’m as good as Michael Jordan!

Vickers A, Medscape 2006. Michael Jordan Won't Accept the Null Hypothesis: Notes on Interpreting High P Values

Lack of a significant difference does not mean equality!

If it’s not black, it’s not necessarily white, either… could be gray

Study could be underpoweredBeware claims of equivalence

But what if Ns are adequate?

Claims of Equivalence

Example: Two drugs performed “the same”. Were both medications really equally effective?Or were they equally ineffective?

St. John’s Wort vs. Sertraline

0

5

10

15

20

25

30

0 1 2 3 4 5 6 7 8

Study Week

HAM-D Scores

Hypericum

Sertraline

Mean decrease = 47% for Zoloft (vs. 38%) p = .06

JAMA Apr 10, 2002 -- Vol 287, No. 14, 1807-1814

. . . and with Placebo in the Picture

0

5

10

15

20

25

30

0 1 2 3 4 5 6 7 8

Study Week

HAM-D Scores

Hypericum

Placebo

Sertraline

Comparison pHyp vs. Pbo .59Ser vs. Pbo .18Ser vs. Hyp .06

St. John’s Wort vs. Sertraline Analysis of other primary efficacy endpoint

24 %25 %

0

10

20

30

Hypericum Sertraline

% Full Responders

p = .99

Chi-squared test, Yates corrected

. . . with Placebo in the Picture

24 % 25 %

32 %

0

10

20

30

40

Hypericum Sertraline Placebo

% Full Responders

Comparative Claims

FDA leery– …of equivalence claims– …of superiority claims

FDA does not allow them in labeling (package insert, advertising)

Efficacy advantage– Underdose competing drug

Safety advantage– Dose competing drug too high and/or too fast

Transitivity

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Am J Psychiatry 163:185-194, February 2006

Consider the Source

RESULTS: Of the 42 reports identified by the authors, 33 were sponsored by a pharmaceutical company. In 90.0% of the studies, the reported overall outcome was in favor of the sponsor’s drug. This pattern resulted in contradictory conclusions across studies when the findings of studies of the same drugs but with different sponsors were compared.

Beware the Comparison to Nothing!

Open-label study - pts know what they are getting– Voice alteration in VNS trials

Often single-arm w/ no placebo control

Anyone ever seen an open-label study in which pts did not get better compared to baseline?

(How do they get published?)

Single-Blind Studies

A step above open-label in rigorInvestigators know what tx the study pt is

gettingExamples:

– Acupuncture studies– Many device studies (e.g. rTMS)

The Problem with Single-Blind Studies:Clever Hans

Use Lots of ScalesDon’t Put All Your Eggs in One Basket

Observer-based– MADRS– CGI

• CGI-I (improvement)• CGI-S (severity)

– HAMD in all its flavors

• 17-item• 21-item• 28-item• 33-item

Self-report– BDI (Beck)

– QIDS-SR (STAR*D)

– Quality of life scales

Pros and Cons of Many Scales

The upside of multiple endpoints:– Internal replication– Robustness (vs. fragile finding)

The downside– Increased probability of chance finding– Multiplicity, aka multiple comparisons

Put Enough Monkeys at Enough Typewriters . . .

…and sooner or later you’ll have the complete works of William Shakespeare

Multiple Subscales

HAMD-33 item, you also get . . .– 28-item– 21-item– 17-– 6- (“core items”)

Anxiety subscale of the HAMD Depression subscale of the PANSS

But was it in the original protocol?

What Can You Do With All These Scales?

Continuous measure– Use each score as-is (absolute score)– Change from baseline

Transform into categorical measure– Cutoffs patients either above or below

– Remitters– Responders

Responders

Just “responders”– >= 50% decrease from baseline

• Ex. Baseline score 40 -> endpoint score = 20

– < 50% ==> “nonresponder”• Baseline = 40, endpoint score = 21

Gradations of responders– Partial responders (25-50% decrease from baseline)

– Full responders (>50% decrease)

Remitters

“Remission” usually = absolute score (HAMD < 8)

STAR*D defines remission as 75% decrease from baseline

Advantage - set threshold deemed clinically significant

But % remitters may still differ between groups to extent that is just statistically significant (remember the “election” slide)

Handling Dropouts

LOCF – last observation carried forward

OC– Observed cases– aka. completers

MMRM– Mixed model repeated measures

HARKing

HypothesizingAfter theResults are Known

A priori vs. post hoc

How the FDA Guards Against This

FDA gets protocol before study beginsSponsors can’t “censor” studies that don’t

go wellDrugs approved based on all studies

It’s the Protocol, Stupid!

“If the Devil is in the Details, Salvation is in the Protocol”– Talk by Paul Andreason, FDA

Primary endpoints– a priori hypothesis– Where you’re placing your bet

Secondary endpoints– Exploratory– If you make it, fine, but don’t make a big deal about it.– Repeat study, designate it as primary, see if it replicates

Off-Label Use

Drug used for something FDA has not approved it for

(FDA does not regulate prescribing) Often appropriate to prescribe off-label

– No approved drugs for condition (but why not?)– You’ve exhausted approved drugs

Ask why isn’t drug approved for this condition?– Could they have submitted and gotten it rejected?– If they haven’t submitted an application, why not?

How do you Know Whether a Drug is FDA-Approved for the Condition You’re Treating?

Beware of sources that talk about “uses”– AHFS Drug Information (“The Red Book”)– Fluoxetine uses: obesity, bipolar d/o, myoclonus,

cataplexy, EtOH dependence

Gabapentin has never been approved for any psych indication

Just look in the package insert or PDR– Indications & Usage section

– More details in Clinical Trials section

The End

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.