Randomized controlled trials in psychiatry. Part 1: methodology and critical evaluation

8
Randomized controlled trials in psychiatry. Part 1: methodology and critical evaluation * Richard Porter, Chris Frampton, Peter R. Joyce, Roger T. Mulder Objective: To introduce clinicians to the methodology and critical appraisal of randomized controlled trials (RCTs) in psychiatry. Method: The methodology of RCTs in psychiatry is discussed. Using a systematic approach to critical appraisal, a published RCT of treatments for severe depression is examined and evaluated. Results and Conclusions: The RCT appraised illustrates certain problematic areas in the methodology of trials in psychiatry. A detailed knowledge of methodology and critique of RCTs is essential in determining whether reported results will influence clinicians’ practice. Key words: evidence-based medicine, psychopharmacology/methods, randomized Australian and New Zealand Journal of Psychiatry 2003; 37:257–264 controlled trials. This is the first of two articles that aim to introduce readers to randomized controlled trials in psychiatry. Part 1 covers basic methodology and critical appraisal; Part 2 examines in more detail clinical relevance of trials and methodological issues that may detract from their usefulness. The evaluation of clinical trials has been the subject of much discussion in recent years and a general scheme, now widely used, has been developed [1]. Proformas for the evaluation of reports of clinical trials are available and readily downloaded from the web (http://minerva. minervation.com/cebm/documents/worksheets.pdf). We will follow a similar approach, modified from Guyatt [1]. In doing so we have paid attention to issues that are complex and contentious (see Table 1 for our modified scheme). In discussing research methods it is easier to refer to a particular example. We have therefore selected a clinical question and will discuss the methodology that might be used to answer this question. We will also discuss a study that has attempted to address this question [2]; we encourage readers to obtain a copy and to assess it in conjunction with this article. Further examples of structured critical appraisal of RCTs can be found in Warner [3] and Lawrie [4]. Why a randomized controlled trial? As the name suggests, randomized controlled trials (RCTs) have two important features designed to provide objective evidence regarding clinical practice. These key features are randomization and the use of a control treatment. 1 While a treatment may be associated with improve- ment in a patient’s condition, this may be due to many factors including spontaneous remission. Without com- paring the ‘investigational’ treatment with a ‘control’ treatment we cannot assess its real efficacy. 2 Unless a comparison between an investigational treatment and control treatment is made in a group of patients truly ‘randomized’ to one treatment or the other, the biases of clinicians are likely to dictate which treatments are received by which patients. More Richard Porter, Senior Lecturer (Correspondence); Chris Frampton, Associ- ate Professor in Biostatistics; Peter R. Joyce, Professor; Roger T. Mulder, Associate Professor Department of Psychological Medicine, Christchurch School of Medicine, PO Box 4345, Christchurch 8001, New Zealand. Email: [email protected] Received 13 December 2002; revised 17 February 2003; accepted 13 December 2002. *The third article in an occasional series on ‘Conducting Research in Psy- chiatry’, coordinated by the Australasian Society for Psychiatric Research and the Research Board of the Royal Australian and New Zealand College of Psychiatrists.

Transcript of Randomized controlled trials in psychiatry. Part 1: methodology and critical evaluation

Randomized controlled trials in psychiatry. Part 1: methodology and critical evaluation

*

Richard Porter, Chris Frampton, Peter R. Joyce, Roger T. Mulder

Objective:

To introduce clinicians to the methodology and critical appraisal of randomizedcontrolled trials (RCTs) in psychiatry.

Method:

The methodology of RCTs in psychiatry is discussed. Using a systematic approachto critical appraisal, a published RCT of treatments for severe depression is examined andevaluated.

Results and Conclusions:

The RCT appraised illustrates certain problematic areas in themethodology of trials in psychiatry. A detailed knowledge of methodology and critique ofRCTs is essential in determining whether reported results will influence clinicians’ practice.

Key words:

evidence-based medicine, psychopharmacology/methods, randomized

Australian and New Zealand Journal of Psychiatry 2003; 37:257–264

controlled trials.

This is the first of two articles that aim to introducereaders to randomized controlled trials in psychiatry.Part 1 covers basic methodology and critical appraisal;Part 2 examines in more detail clinical relevance of trialsand methodological issues that may detract from theirusefulness.

The evaluation of clinical trials has been the subject ofmuch discussion in recent years and a general scheme,now widely used, has been developed [1]. Proformas forthe evaluation of reports of clinical trials are availableand readily downloaded from the web (http://minerva.minervation.com/cebm/documents/worksheets.pdf). Wewill follow a similar approach, modified from Guyatt [1].In doing so we have paid attention to issues that arecomplex and contentious (see Table 1 for our modifiedscheme). In discussing research methods it is easier to

refer to a particular example. We have therefore selecteda clinical question and will discuss the methodology thatmight be used to answer this question. We will alsodiscuss a study that has attempted to address thisquestion [2]; we encourage readers to obtain a copyand to assess it in conjunction with this article. Furtherexamples of structured critical appraisal of RCTs can befound in Warner [3] and Lawrie [4].

Why a randomized controlled trial?

As the name suggests, randomized controlled trials(RCTs) have two important features designed to provideobjective evidence regarding clinical practice. These keyfeatures are randomization and the use of a controltreatment.1 While a treatment may be associated with improve-ment in a patient’s condition, this may be due to manyfactors including spontaneous remission. Without com-paring the ‘investigational’ treatment with a ‘control’treatment we cannot assess its real efficacy.2 Unless a comparison between an investigationaltreatment and control treatment is made in a group ofpatients truly ‘randomized’ to one treatment or theother, the biases of clinicians are likely to dictate whichtreatments are received by which patients. More

Richard Porter, Senior Lecturer (Correspondence); Chris Frampton, Associ-ate Professor in Biostatistics; Peter R. Joyce, Professor; Roger T. Mulder,Associate Professor

Department of Psychological Medicine, Christchurch School ofMedicine, PO Box 4345, Christchurch 8001, New Zealand. Email: [email protected]

Received 13 December 2002; revised 17 February 2003; accepted13 December 2002.

*The third article in an occasional series on ‘Conducting Research in Psy-chiatry’, coordinated by the Australasian Society for Psychiatric Researchand the Research Board of the Royal Australian and New Zealand Collegeof Psychiatrists.

258 RANDOMIZED CONTROLLED TRIALS IN PSYCHIATRY

severely depressed patients might, for instance, be giventricyclic antidepressants (TCAs) and less severelydepressed, specific serotonin reuptake inhibitors (SSRIs).This may bias the results of a comparison between thesetreatments.

Without such a design it is difficult to draw con-clusions from any trial. However, there are many otherfeatures of the design of RCTs that may affect theirvalidity and these are addressed below.

Definition of the ‘problem’

The crux of the clinical trial methodology is to definethe question. In everyday practice many questions arisefor each patient. Many have never been addressedin trials and those that have, have sometimes beenaddressed in a way that does not help to guide practice.This problem will be discussed further in Part 2. We willtake as an example treatment of severe depression inhospital.

Scenario

A 50-year-old man is admitted with an episode ofmajor depression with a Montgomery Asberg Depres-sion rating scale [5] (MADRS) score of 35. There areDSM-IV melancholic but no psychotic features. He hashad two previous episodes of major depression, whichwere not as severe and did not require hospitalization.You can elicit no history suggestive of bipolar disorder.

His daughter’s wedding is in 3 weeks and he is desperateto be well enough to attend.

A literature search reveals a study of rapidly escalatingvenlafaxine in the treatment of inpatient depression [2].You decide to evaluate this report using the headingsbelow.

What are the treatments assessed and are treatment details adequately defined?

Investigational treatment

The choice of investigational treatment is usuallydriven by commercial interest, by a clinical idea that aparticular treatment may have benefits or by a perceivedneed to determine which option is the best among treat-ments which, on the basis of existing evidence, appearequivalent. It is important that the treatment is welldefined so that clinicians can be confident they areadministering the same treatment if the results of the trialpersuade them that it is appropriate to do so. This isclearly a major issue in psychotherapy trials (see Part 2).Equally, however, in drug trials, the dosing schedulerequires careful consideration and should be adequatelyspecified and reported at the end of the trial.

Comparator treatment

Once an investigational treatment is decided on, thenext step is to examine the suitability of the comparator.Its choice is one of the most difficult and criticisedaspects of trials.

Table 1. Scheme for evaluation of articles (modified from Guyat

et al

. 1993)

What are the treatments to be assessed?

– Are adequate details of the treatments given?– Is the proposed treatment compared with a suitable comparator?– Apart from the proposed treatment were the groups treated equally?

Was the assignment of patients to treatments truly randomized?Were the groups similar at the start of the trial?Was the study double-blind?What is the primary outcome measure and is it clinically relevant?Are all clinically important outcomes considered?Were all patients who entered the trial properly accounted for and analysed in the groups to which they were randomized?Were the statistical methods appropriate and what are the results?

– How large was the treatment effect?– How precise was the estimate of the treatment effect?

Can the results be applied to my patient?– How were subjects chosen?– Were eligibility criteria well defined and what were they?– Is there a differential effect in particular of sub groups?

Are the likely treatment benefits worth the potential harms and costs?

R. PORTER, C. FRAMPTON, P.R. JOYCE, R.T. MULDER 259

New agents are usually compared with placebo (apharmacologically inactive compound). This is because,particularly in conditions such as major depressivedisorder, there is a high placebo response rate thatvaries considerably depending on chronicity, severityand subtype of depression (see Schatzberg [6] andQuitkin [7] for an in-depth discussion). While it has beensuggested that a placebo ‘run in’ period can help tocontrol for this, evidence is inconsistent [8].

In some situations however, it is argued that placebotreatment is ethically inappropriate (see Miller [9] for aninteresting review of ethical issues of placebo controlledtrials). In this case, the best comparison would bebetween a proposed treatment and a currently acceptedoptimal treatment. In our scenario, what would be themost appropriate comparator? It could be argued thatplacebo treatment for severe hospitalized depression isunethical and that the information which would be usefulin developing practice is whether a proposed treatment isbetter than current usual treatment. Perhaps the trialshould compare venlafaxine with the currently acceptedoptimal treatment of inpatient depression. A recentmeta-analysis in hospitalized depression suggested thatamitriptyline had a significant advantage over SSRIs[10]. This might then be a suitable agent to use as acomparator. However, it could be argued that because ofa relatively high side-effect burden amitriptyline maynot be the most commonly used antidepressant in thetreatment of inpatient depression and may not, therefore,be as clinically relevant as a comparison with an SSRI.The study of Benkert

et al

. [2] uses rapidly escalatingimipramine as a comparator. While few psychiatristswould argue against imipramine as a comparator, theexact schedule of treatment does need to be examinedmore closely.

Detailed evaluation of the treatments delivered

The next issue is the way in which the treatments weredelivered. For instance, in drug trials a dosage schedulemust be specified. This can be left up to the clinician orpredetermined by the investigators. Pre-set schedulesgive a clearer indication of the exact treatment used.Clinician titration is probably closer to clinical practiceand may reduce dropouts since sensitive patients willbe maintained at lower doses and dosage increased moreslowly. In this situation, average dose received should bereported.

The dose of an active comparator can become thesubject of debate. Benkert

et al

. [2] compared rapidtitration to high dose venlafaxine with rapid titration tohigh dose imipramine. While, as discussed, imipramineis probably a reasonable comparator, rapid titration is

rarely used and there is little evidence for its use. There-fore, it could be argued that a more usual protocol wouldbe a more meaningful comparison. This study also illus-trates the importance of considering the details of thetreatments in appraising trials. Although upward titrationof dose was roughly equivalent in both groups, thevenlafaxine dose was reduced after two weeks whilethe imipramine dose remained high. This issue may haveinfluenced the results as we will discuss later.

A further debate on the importance of comparatorselection and dosing can be found in Healy [11] andMcKenna [12] who debate the issue of an appropriatecomparison treatment in trials of clozapine in treatment-resistant schizophrenia.

The exact details of treatment and comparator are alsoimportant in trials of psychotherapeutic interventions.Since different types of psychotherapy may give rise todifferent results, it is important to define the exact formof therapy to be given, to ensure adherence of therapiststo that therapy and to maintain its quality. In reportingthe trial, it is important to describe the therapy in such away that it can be replicated if it proves to be clinicallyuseful.

Apart from the proposed treatments (investigational vscomparator) it is important to anticipate the possibilitythat additional, less specific treatments will be neededduring a trial and that these will vary between groups.An obvious example is that in trials for mania, if aparticular agent is not effective, certain ‘rescue’ medica-tions (e.g. benzodiazepines) may be used. This mayoccur to a greater extent in one group compared withanother and could obscure a difference in outcome. Thismay therefore need to be considered as an actual treat-ment outcome. The use of additional treatments shouldalso be predetermined and included in the protocol sothat, for instance, different centres or investigators donot use different ‘rescue medications’, making this adifficult factor to analyse.

Randomization: was assignment of patients to each treatment truly random?

The most important feature of the RCT is randomallocation of patients. Practically, randomization is doneby allocating each patient a number from a predetermined,computer-generated code. This then determines whichtreatment the patient will receive. There must be no wayin which a clinician can influence which number isassigned, and if the trial is truly double-blind, the clini-cian should not know which numbers correspond towhich treatment. It is otherwise possible that clinicianswho have a pre-existing belief about a treatment may

260 RANDOMIZED CONTROLLED TRIALS IN PSYCHIATRY

attempt to obtain that treatment for particular patients. ARCT report should contain an unequivocal statementregarding randomization, as does the study by Benkert

et al

. [2].

Were the treatment groups comparable?

It is critical that any differences in outcome betweengroups can only be ascribed to the treatment effect andnot to the confounding effect of prognostic variableswhich may differ between groups. Confounders maybe described as variables that distort the relationshipbetween treatments and outcome. Large, randomizedstudies in most cases, by chance, result in treatmentgroups that are comparable in all important respects. Insmaller studies, it is possible that an important prog-nostic variable will differ between groups. There aretwo ways of overcoming this problem. First, a techniquecalled stratification can be used. In this system, impor-tant prognostic groups (e.g. males and females) arerandomized separately, ensuring that the same numbersof each are allocated to each treatment. It is also commonto stratify by study centre so that each has equivalent

numbers of patients in each group. The second method isto use statistical techniques, such as covariate analysiswhich ‘correct’ for any differences in prognostic vari-ables between groups before comparing outcomes.Reports should contain a table (Table 2 in Benkert

et al

.[2]) with summaries of key prognostic variables speci-fied for each treatment group. Not only is it important toexamine this for variables which differ between groups(these may be brought to the reader’s attention by sig-nificance values in the table) but to consider whetherother variables should also have been considered. Forinstance in Benkert

et al

. [2], it would be useful to knowwhether patients suffering from depression with psy-chotic features were represented more in one group thananother.

Was the study double-blind?

Double-blind means that neither patients, cliniciansnor raters of important variables know which treatmentgroup the patients are in. It is not clear that all studiesdescribed as double-blind are truly so. There is an ethicalobligation and a clinical necessity to explain to patients

Table 2. Summary of evaluation of Benkert

et al

. 1996

What are the treatments to be assessed?Adequate details given? Yes – but note that the dose of venlafaxine was reduced after two

weeks while the dose of imipramine was left static.Is the proposed treatment compared with a suitable

comparator? Yes but note concerns regarding the use of rapid escalation for both

venlafaxine and comparator.Apart from the proposed treatment were the groups treated

equally?Yes.

Was the assignment of patients to treatments trulyrandomized?

Yes.

Were the groups similar at the start of the trial? Yes – but the numbers of patients in each group with psychosis is notgiven.

Was the study double-blind? Yes – but given marked differences in side-effects between experimental treatments, blinding may have been incomplete.Raters were not independent.

What is the primary outcome measure and is it clinicallyrelevant?

No single primary outcome measure was specified and a single measure that was significantly different between groups wasselected for emphasis.

Are all clinically important outcomes considered? Yes.Were all patients who entered the trial properly accounted

for and analysed in the groups to which they wererandomized?

Yes – but it is of some concern that the study was stopped early despite not meeting the recruitment target determined from the power calculation.

Were the statistical methods appropriate and what are theresults?

A significantly greater percentage of patients had sustained response to venlafaxine but only as measured by HAM-D. Thishowever, was not the stated primary outcome measure.

How large was the treatment effect? Number needed to treat = 7.How precise was the estimate of the treatment effect? 95% CI 4–80.Can the results be applied to my patient? Yes – the inclusion and exclusion criteria results certainly apply to

the case scenario described.Are the likely treatment benefits worth the potential harms

and costs?No.

R. PORTER, C. FRAMPTON, P.R. JOYCE, R.T. MULDER 261

likely side-effects and while this does not necessarilymean that patients will know which side-effects apply towhich medication, they can look this up. Likewise, byinquiring about side-effects, researchers may guess whattreatment a patient is taking. One partial solution is touse ‘blind’ raters to assess outcome. It is probably alsouseful to assess the degree to which the study wasdouble-blind simply by asking patients and researchersto guess the assignment of each patient.

In appraising a RCT it is useful to consider whetherthe trial is likely to have been blind and what, if any, theimplications of researchers and patients knowing theirassignment might have been. One might expect the‘placebo’ effect of a new and therefore promising treat-ment to be greater than that of an older treatment beingused as a comparator or of placebo itself (i.e. an expec-tation effect). It is likely that patients and clinicianscould have distinguished between the side-effects ofvenlafaxine and imipramine and that the Benkert

et al

.[2] trial was not in fact truly double-blind. A possibleeffect is to have made dropout greater among patientswho perceived themselves to be on an ‘old’ treatmentand an increased placebo effect in those knowing them-selves to be on the ‘new’ treatment. In fact, significantlymore patients did drop out of the imipramine group as aresult of ‘patient request’. This could bias results. Whileit is not specified, it appears that clinicians and notindependent ‘blind’ raters performed ratings. It is alsopossible that raters could be biased by their expectationsof the efficacy of either treatment.

The primary outcome measure: is it defined and is it clinically relevant?

For any trial, a predetermined primary outcomemeasure should be chosen. Many published trials haveincluded several measures and report positive effects ononly one of these. However, including multiple measuresmakes it likely that at least one of them will show adifference by chance.

It is not an easy task to decide on a single outcomemeasure. In many trials of antidepressant treatment it hasbeen total score or change from baseline on the HamiltonDepression Rating Scale (HAMD) [13] at 6 weeks. Inour scenario, where speed of response is important, oneoption is to define primary outcome as the score, on apredetermined scale (e.g. HAMD), at a specified point inthe treatment course (e.g. 3 weeks). Another option is touse clinically relevant, objectively defined outcomemeasures. In trials of cancer treatment, for example, deathor survival is clinically relevant and easily measured. Itis more difficult in psychiatry to find such measures.However, researchers are increasingly advocating the

use of simple objective outcomes to make trials largeenough (and therefore simple enough to be carried out atmany sites) to show relatively small differences. This isthe philosophy of the important BALANCE trial ofprophylaxis in bipolar affective disorder [14] under-way in the UK, in which a primary measure is admissionto hospital.

Benkert

et al

. [2] cite four primary outcome variables(times to response and to sustained response on theHAMD and MADRS scales). A sustained response wasdefined as one that occurred by week 2 and persisted tothe end of the study and included at least 39 days ofdouble-blind therapy. This is not a commonly usedmeasure. It is possible that reducing the dose of venla-faxine but not imipramine increased dropouts at thisstage in the imipramine group. This may have contrib-uted to the greater ‘sustained response’ with venlafaxine.Although not statistically significant (25% vs 38%) therewas a greater number of dropouts in the imipraminegroup.

Other outcome measures: are all clinically important outcomes assessed and reported?

Since all treatments potentially have adverse effects,it is important to anticipate, measure and report themaccurately. In pharmacological trials, side-effects areimportant and a treatment that increases the speed ofrecovery from depression but gives rise to enduring side-effects may not be deemed useful. This might apply toECT that may lead to a rapid recovery for an inpatientwith depression. If we were to conduct a trial of highdose venlafaxine versus ECT, cognitive side-effectsshould obviously be measured.

Were all patients who entered the trial properly accounted for and analysed in the groups to which they were randomized?

Patients are generally assessed on a range of inclusionand exclusion criteria and if suitable asked to consentto involvement in a trial. Further assessments may takeplace at which patients may still be excluded. At apredetermined point, patients are then randomized anddeemed to have entered the trial. Results should beanalysed by ‘intention to treat analysis’. In this method,the last appropriate assessment is included in the analy-sis even if the patient drops out, a procedure known aslast observation carried forward (LOCF). Treatment mayresult in a number of discontinuations secondary toside-effects or perceived lack of effect. If only patientscompleting treatment were analysed, this would give aninflated estimate of clinical efficacy. Analysis using

262 RANDOMIZED CONTROLLED TRIALS IN PSYCHIATRY

intention to treat includes the observation taken beforedropout, which in most cases will show little improve-ment and reduce apparent efficacy. In evaluating a trialit is important to check that all randomized patients areincluded in the final analysis.

In the Benkert

et al

. [2] study, 167 patients were ran-domized and at all points data from all of them arereported and analysed by the LOCF method.

Is the analysis valid and what are the results?

The simplest form of analysis is a comparison ofoutcome, as measured by the predetermined primarymeasure, between the treatment groups, carried out aftera predetermined number of patients have completed thestudy. If the measure is continuous (e.g. percentagereduction from baseline in HAMD score after 6 weeks)the statistical test carried out may be an independentt-test. If the data do not fit a normal distribution curvethen non-parametric statistics are used. If the primarymeasure is binary (e.g. response versus non-responseaccording to a predetermined definition such as 50%reduction in HAMD at 6 weeks) a simple

χ

2

test can beused to determine if different numbers of patients in eachgroup meet this criterion.

Since published analyses can be complicated and mis-leading, it is useful to discuss several other concepts atthis point.

Analysis of variance

Despite the ideal, trials are often imperfectly describedby comparison of a single variable at the ‘end’ of treat-ment. Such an analysis may incompletely summarize theresults, missing key features. Several factors may makea more complicated analysis of variance appropriate.First, where measures have been made at several pointsand speed of response can be examined, a repeatedmeasures analysis of variance may be used. In its sim-plest form this allows examination of treatment effect,the role of time and the interaction between treatmentand time. A significant effect of time means that theoutcome variable changes over time. A significant inter-action between treatment and time suggests that there isa difference between the speed of response to the twotreatments. This can then be explored by examiningresponse at different points. Making multiple com-parisons at different points without first finding an inter-action between treatment and time is invalid. Second,there may be a differential effect of treatment in certainsubgroups; this subgrouping can be entered into ananalysis of variance. This might arise for instance in astudy of treatment for depression, where an analysis may

include both treatment and melancholia as factors. Aninteraction between treatment and melancholia wouldsuggest that the relative effects of the treatments differin the subgroups. A significant effect of treatmentwould suggest an advantage for one treatment withouta differential subgroup effect. An example of such ananalysis is seen in Swann [15] where a difference inresponse to lithium and sodium valproate is seen betweensubgroups of manic patients with and without depres-sive symptoms.

One advantage of this type of analysis is that it allowsmore complex data to be analysed in a single analysis.This avoids the problem of multiple statistical compari-sons that give an increasing potential for false positiveswith each analysis. We should be suspicious of studiesthat use t-tests to compare outcome at multiple points orseveral subgroups, especially where there is no priorjustification. This may apply, for instance, where base-line severity is deemed important and patients are splitinto groups of different severity and the effect of treat-ment assessed in each group. More valid would beanalysis of covariance with severity as a covariate.

In Benkert

et al

. [2] ‘times to response and to sustainedresponse on the MADRS and HAMD scales’ are theprimary outcome variables. However, these variables onlyapply to responders. Therefore, this is not an intention totreat analysis. A further analysis is given which is inten-tion-to-treat and examines response at 2 and 6 weeks and‘sustained response’ on both MADRS and HAMD. Sixseparate

χ

2

tests are carried out, only one of which showsa significant difference. In our opinion a repeated meas-ures analysis of variance with treatment and time asfactors and percentage change in HAMD from baselineas the variable would have been a better way to deal withthe issue of differential time to response.

Study size

One of the most important features of planning anyresearch is a power calculation. This is a statisticalstrategy to determine numbers needed in each group tohave a predetermined chance (usually 80%) of detectinga clinically significant difference in outcome (usually atp < 0.05), if that difference actually exists. The first stepis to decide what a clinically significant differencewould be and to determine from previous studies thevariation (standard deviation) in the outcomes. A cal-culation then allows determination of the number ofsubjects needed in each group.

In evaluating trials it is useful to determine whether thestudy size was predetermined with a power calculation.In the extreme case, researchers could monitor andanalyse the results throughout the study and report the

R. PORTER, C. FRAMPTON, P.R. JOYCE, R.T. MULDER 263

results at the point at which the difference became sig-nificant. This would be misleading. An interim analysisis permissible so that if treatments unexpectedly differbefore all intended patients are recruited, further patientsneed not be exposed to what is an ineffective treatment.This analysis should be done independently since know-ledge of the analysis may give rise to confoundingexpectation effects.

The Benkert

et al

. [2] study stopped early ‘because ofobserved differences in outcomes between differentstudy centres’. However, after data analysis, no ‘centre-by-treatment interactions were noted’. Thus, althoughthe investigators perceived a problematic difference inoutcome between centres (but do not state what thiswas), there was no significant difference betweencentres in terms of relative response to the treatments. Itis questionable whether the study should have beenstopped at this point.

How large was the treatment effect?

Number needed to treat

A useful measure of effectiveness is the ‘numberneeded to treat’ (NNT) (see Chatellier [16] for furtherdiscussion), defined as the number of patients whowould need to be treated with the investigational drugrather than the comparator in order for one additionalpatient to derive a significant benefit (usually a pre-defined response). Individual clinicians may, based onthis number, make an informed decision regarding clini-cal practice. If, for instance, an antidepressant was betterthan its comparator with a NNT of 25, this would implythat of 25 patients treated with drug A rather than drugB, one would respond who would not have done hadthey all been treated with drug B. If costs and side-effects were equal most clinicians would prescribedrug A. If, however, drug A was worse in terms ofside-effects with a NNT of 10 for significant adverseeffects (of 10 patients treated with drug A rather thandrug B one would develop significant adverse effectswhich they would not have done if treated with drug B)most clinicians would use drug B. Number needed totreat and how to interpret it is discussed further inThompson [17] and Anderson [18].

Where effects are assessed on a rating scale, this canbe recalculated as a categorical variable by defining alevel of improvement on that scale and calculating thepercentage of patients who reach this threshold. A goodexample of such a calculation is given by Warner [3].

How precise is the estimate of treatment effect?

Confidence intervals give an estimate of the precisionwith which the NNT has been measured. The figures

quoted are usually 95% confidence intervals (95% CI)which refer to the range of values which statistically theauthors can be 95% sure will contain the true value. Forinstance, the NNT may be expressed as NNT = 7, 95%CI = 4–80. These figures are calculated from the figuresfor ‘sustained response’ on HAMD for venlafaxine andimipramine in the study of Benkert

et al

. [2]. The figuressuggest that if seven patients were treated with therapidly escalating schedule of venlafaxine rather thanimipramine, we would expect one extra patient torespond according to their definition. The figures alsosay that the data gives a 95% CI whereby to get an extraresponder, between four and 80 patients would need tobe treated with venlafaxine rather than imipramine. Alarger sample would give rise to a narrower confidenceinterval (i.e. a more precise estimate of the true treatmenteffect).

Inclusion and exclusion criteria: does the trial apply to my patient?

Every RCT investigates the comparative effects oftwo or more treatments on a patient group or population,defined by inclusion and exclusion criteria.

There are two extremes in defining groups. At oneextreme a tightly defined homogenous group (few inclu-sions and many exclusions) may be recruited and sig-nificant results reported. However, the results apply tovery few patients seen in practice. It is important thatthose reading the results realize that they may not applygenerally because of the narrow definition of the groupstudied. At the other end of the spectrum are trials withfew exclusion criteria but the problem that significanteffects of treatment in subgroups may be masked by alack of effect in other groups. In a large enough trial, thismay be overcome by subgroup analyses – in effect split-ting the trial into several individual trials in differentgroups of patients. In reading a reported RCT it isimportant to examine the inclusion/exclusion criteria todetermine whether the results can be applied to one’sown clinical practice.

The group in Benkert

et al

. [2] fits our clinicalscenario exactly because the inclusion criteria appear tospecify a group (depressed, melancholic, inpatients)similar to our patient. Among the exclusions werepatients with ‘serious comorbid disease’ (although ‘seri-ous’ is not specified), ‘those undergoing formal psycho-therapy’ and ‘patients with a history of drug or alcoholdependence within 2 years’. Since our patient does notmeet these criteria we can assume that the results maystill apply to him.

264 RANDOMIZED CONTROLLED TRIALS IN PSYCHIATRY

Are the likely treatment benefits worth the potential harms and costs?

Having evaluated evidence presented in a RCT clini-cians must decide whether the results should influencepractice. In the example presented, although there islittle evidence of harm, neither is there convincingevidence of benefit and the treatment (rapid titration ofvenlafaxine) was not evaluated against a current stand-ard treatment. We have not changed our clinical practiceas a result of this trial.

Websites

The Centre for Evidence Based Medicine in Oxfordhas a useful website and vast number of links to othersites dealing with evidence-based medicine and criticalappraisal. URL: http://minerva.minervation.com/cebm/

The Centre for Evidence Based Mental Health also hasuseful critical appraisal forms for a variety of studies.URL: http://www.cebmh.com/

The Journal of the Medical Association EvidenceBased Practice Users Guides are available at, URL:http://www.cche.net/che/home.asp

References

1. Guyatt GH, Sackett DL, Cook DJ. Users’ guides to the medical literature. II. How to use an article about therapy or prevention. A. Are the results of the study valid? Evidence-Based Medicine Working Group.

Journal of the American Medical Association

1993; 270:2598–2601.

2. Benkert O, Grunder G, Wetzel H, Hackett D. A randomized, double-blind comparison of a rapidly escalating dose of venlafaxine and imipramine in inpatients with major depression and melancholia.

Journal of Psychiatric Research

1996; 30:441–451.

3. Warner JP. Evidence-based psychopharmacology 1. Appraising a single therapeutic trial: what is the evidence for treating early Alzheimer’s disease with donepezil?

Journal of Psychopharmacology

1999; 13:308–312.4. Lawrie SM. Randomised controlled trial. In: Brown T,

Wilkinson G, eds.

Critical reviews in psychiatry

. London: Gaskell, 2000, 91–99.

5. Montgomery SA, Asberg M. A new depression scale designed to be sensitive to change.

British Journal of Psychiatry

1979; 134:382–389.

6. Schatzberg AF, Kraemer HC. Use of placebo control groups in evaluating efficacy of treatment of unipolar major depression.

Biological Psychiatry

2000; 47:736–744.7. Quitkin FM, Rabkin JG, Gerald J, Davis JM, Klein DF. Validity

of clinical trials of antidepressants.

American Journal of Psychiatry

2000; 157:327–337.8. Trivedi MH, Rush H. Does a placebo run-in or a placebo

treatment cell affect the efficacy of antidepressant medications?

Neuropsychopharmacology

1994; 11:33–43.9. Miller FG. Placebo-controlled trials in psychiatric research: an

ethical perspective.

Biological Psychiatry

2000; 47:707–716.10. Anderson IM. Selective serotonin reuptake inhibitors versus

tricyclic antidepressants: a meta-analysis of efficacy and tolerability.

Journal of Affective Disorders

2000; 58:19–36.11. Healy D. Psychopharmacology and the ethics of resource

allocation.

British Journal of Psychiatry

1993; 162:23–29; discussion 29–37.

12. McKenna PJ, Bailey P. The strange story of clozapine.

British Journal of Psychiatry

1993; 162:32–37.13. Hamilton M. A rating scale for depression.

Journal of Neurology, Neurosurgery and Psychiatry

1960; 23:56–62.14. Balance.

The balance trial.

[Cited 4 March, 2003.] Available from URL: http://cebmh.warne.ox.ac.uk/balance/index.html

15. Swann AC, Bowden CL, Morris D

et al.

Depression during mania. Treatment response to lithium or divalproex.

Archives of General Psychiatry

1997; 54:37–42.16. Chatellier G, Zapletal E, Lemaitre D, Menard J, Degoulet P. The

number needed to treat: a clinically useful nomogram in its proper context.

British Medical Journal

1996; 312:426–429.17. Thompson C. Amitriptyline. Still efficacious, but at what cost?

British Journal of Psychiatry

2001; 178:99–100.18. Anderson I. Lessons to be learnt from meta-analyses of newer

versus older antidepressants.

Advances in Psychiatric Treatment

1997; 3:58–63.