Diagnostic Test When No Gold Standard

86
Evaluation of diagnostic tests when there is no gold standard. A review of methods AWS Rutjes, JB Reitsma, A Coomarasamy, KS Khan and PMM Bossuyt Health Technology Assessment 2007; Vol. 11: No. 50 HTA Health Technology Assessment NHS R&D HTA Programme www.hta.ac.uk December 2007

description

This article about how to analyze data of diagnostic test when the "perfect" gold standard is not avaliable

Transcript of Diagnostic Test When No Gold Standard

Evaluation of diagnostic tests whenthere is no gold standard. A review of methods

AWS Rutjes, JB Reitsma, A Coomarasamy, KS Khan and PMM Bossuyt

Health Technology Assessment 2007; Vol. 11: No. 50

HTAHealth Technology AssessmentNHS R&D HTA Programmewww.hta.ac.uk

The National Coordinating Centre for Health Technology Assessment,Mailpoint 728, Boldrewood,University of Southampton,Southampton, SO16 7PX, UK.Fax: +44 (0) 23 8059 5639 Email: [email protected]://www.hta.ac.uk ISSN 1366-5278

FeedbackThe HTA Programme and the authors would like to know

your views about this report.

The Correspondence Page on the HTA website(http://www.hta.ac.uk) is a convenient way to publish

your comments. If you prefer, you can send your comments to the address below, telling us whether you would like

us to transfer them to the website.

We look forward to hearing from you.

December 2007

Health Technology Assessm

ent 2007;Vol. 11: No. 50

Evaluation of diagnostic tests when there is no gold standard

Copyright notice
© Queen's Printer and Controller of HMSO 2007 HTA reports may be freely reproduced for the purposes of private research and study and may be included in professional journals provided that suitable acknowledgement is made and the reproduction is not associated with any form of advertising Violations should be reported to [email protected] Applications for commercial reproduction should be addressed to HMSO, The Copyright Unit, St Clements House, 2–16 Colegate, Norwich NR3 1BQ

How to obtain copies of this and other HTA Programme reports.An electronic version of this publication, in Adobe Acrobat format, is available for downloading free ofcharge for personal use from the HTA website (http://www.hta.ac.uk). A fully searchable CD-ROM isalso available (see below).

Printed copies of HTA monographs cost £20 each (post and packing free in the UK) to both public andprivate sector purchasers from our Despatch Agents.

Non-UK purchasers will have to pay a small fee for post and packing. For European countries the cost is£2 per monograph and for the rest of the world £3 per monograph.

You can order HTA monographs from our Despatch Agents:

– fax (with credit card or official purchase order) – post (with credit card or official purchase order or cheque)– phone during office hours (credit card only).

Additionally the HTA website allows you either to pay securely by credit card or to print out yourorder and then post or fax it.

Contact details are as follows:HTA Despatch Email: [email protected]/o Direct Mail Works Ltd Tel: 02392 492 0004 Oakwood Business Centre Fax: 02392 478 555Downley, HAVANT PO9 2NP, UK Fax from outside the UK: +44 2392 478 555

NHS libraries can subscribe free of charge. Public libraries can subscribe at a very reduced cost of £100 for each volume (normally comprising 30–40 titles). The commercial subscription rate is £300 per volume. Please see our website for details. Subscriptions can only be purchased for the current orforthcoming volume.

Payment methods

Paying by chequeIf you pay by cheque, the cheque must be in pounds sterling, made payable to Direct Mail Works Ltdand drawn on a bank with a UK address.

Paying by credit cardThe following cards are accepted by phone, fax, post or via the website ordering pages: Delta, Eurocard,Mastercard, Solo, Switch and Visa. We advise against sending credit card details in a plain email.

Paying by official purchase orderYou can post or fax these, but they must be from public bodies (i.e. NHS or universities) within the UK.We cannot at present accept purchase orders from commercial companies or from outside the UK.

How do I get a copy of HTA on CD?

Please use the form on the HTA website (www.hta.ac.uk/htacd.htm). Or contact Direct Mail Works (seecontact details above) by email, post, fax or phone. HTA on CD is currently free of charge worldwide.

The website also provides information about the HTA Programme and lists the membership of the variouscommittees.

HTA

Evaluation of diagnostic tests whenthere is no gold standard. A review of methods

AWS Rutjes,1 JB Reitsma,1 A Coomarasamy,2

KS Khan2* and PMM Bossuyt1

1 Department of Clinical Epidemiology, Biostatistics and Bioinformatics,Academic Medical Center, University of Amsterdam, The Netherlands

2 Division of Reproductive and Child Health, University of Birmingham, UK

* Corresponding author

Declared competing interests of authors: none

Published December 2007

This report should be referenced as follows:

Rutjes AWS, Reitsma JB, Coomarasamy A, Khan KS, Bossuyt PMM. Evaluation of diagnostictests when there is no gold standard. A review of methods. Health Technol Assess2007;11(50).

Health Technology Assessment is indexed and abstracted in Index Medicus/MEDLINE,Excerpta Medica/EMBASE and Science Citation Index Expanded (SciSearch®) and Current Contents®/Clinical Medicine.

NIHR Health Technology Assessment Programme

The Health Technology Assessment (HTA) programme, now part of the National Institute for HealthResearch (NIHR), was set up in 1993. It produces high-quality research information on the costs,

effectiveness and broader impact of health technologies for those who use, manage and provide care inthe NHS. ‘Health technologies’ are broadly defined to include all interventions used to promote health,prevent and treat disease, and improve rehabilitation and long-term care, rather than settings of care.The research findings from the HTA Programme directly influence decision-making bodies such as theNational Institute for Health and Clinical Excellence (NICE) and the National Screening Committee(NSC). HTA findings also help to improve the quality of clinical practice in the NHS indirectly in thatthey form a key component of the ‘National Knowledge Service’.The HTA Programme is needs-led in that it fills gaps in the evidence needed by the NHS. There arethree routes to the start of projects.First is the commissioned route. Suggestions for research are actively sought from people working in theNHS, the public and consumer groups and professional bodies such as royal colleges and NHS trusts.These suggestions are carefully prioritised by panels of independent experts (including NHS serviceusers). The HTA Programme then commissions the research by competitive tender.Secondly, the HTA Programme provides grants for clinical trials for researchers who identify researchquestions. These are assessed for importance to patients and the NHS, and scientific rigour.Thirdly, through its Technology Assessment Report (TAR) call-off contract, the HTA Programmecommissions bespoke reports, principally for NICE, but also for other policy-makers. TARs bring togetherevidence on the value of specific technologies.Some HTA research projects, including TARs, may take only months, others need several years. They cancost from as little as £40,000 to over £1 million, and may involve synthesising existing evidence,undertaking a trial, or other research collecting new data to answer a research problem.The final reports from HTA projects are peer-reviewed by a number of independent expert refereesbefore publication in the widely read monograph series Health Technology Assessment.

Criteria for inclusion in the HTA monograph seriesReports are published in the HTA monograph series if (1) they have resulted from work for the HTAProgramme, and (2) they are of a sufficiently high scientific quality as assessed by the referees and editors.Reviews in Health Technology Assessment are termed ‘systematic’ when the account of the search,appraisal and synthesis methods (to minimise biases and random errors) would, in theory, permit thereplication of the review by others.

The research reported in this monograph was commissioned by the National Coordinating Centre forResearch Methodology (NCCRM), and was formerly transferred to the HTA Programme in April 2007under the newly established NIHR Methodology Panel. The HTA Programme project number is06/90/23. The contractual start date was in October 2005. The draft report began editorial review inMarch 2007 and was accepted for publication in April 2007. The commissioning brief was devised by theNCCRM who specified the research question and study design. The authors have been wholly responsiblefor all data collection, analysis and interpretation, and for writing up their work. The HTA editors andpublisher have tried to ensure the accuracy of the authors’ report and would like to thank the referees fortheir constructive comments on the draft document. However, they do not accept liability for damages orlosses arising from material published in this report.The views expressed in this publication are those of the authors and not necessarily those of the HTA Programme or the Department of Health.Editor-in-Chief: Professor Tom WalleySeries Editors: Dr Aileen Clarke, Dr Peter Davidson, Dr Chris Hyde,

Dr John Powell, Dr Rob Riemsma and Professor Ken SteinProgramme Managers: Sarah Llewellyn Lloyd, Stephen Lemon, Kate Rodger,

Stephanie Russell and Pauline SwinburneISSN 1366-5278

© Queen’s Printer and Controller of HMSO 2007This monograph may be freely reproduced for the purposes of private research and study and may be included in professional journals providedthat suitable acknowledgement is made and the reproduction is not associated with any form of advertising.Applications for commercial reproduction should be addressed to: NCCHTA, Mailpoint 728, Boldrewood, University of Southampton,Southampton, SO16 7PX, UK.Published by Gray Publishing, Tunbridge Wells, Kent, on behalf of NCCHTA.Printed on acid-free paper in the UK by St Edmundsbury Press Ltd, Bury St Edmunds, Suffolk. MR

Objective: To generate a classification of methods to evaluate medical tests when there is no goldstandard.Methods: Multiple search strategies were employed toobtain an overview of the different methods describedin the literature, including searches of electronicdatabases, contacting experts for papers in personalarchives, exploring databases from previousmethodological projects and cross-checking ofreference lists of useful papers already identified. Results: All methods available were classified into fourmain groups. The first method group, impute or adjustfor missing data on reference standard, needs carefulattention to the pattern and fraction of missing values.The second group, correct imperfect referencestandard, can be useful if there is reliable informationabout the degree of imperfection of the referencestandard and about the correlation of the errorsbetween the index test and the reference standard. Thethird group of methods, construct reference standard,have in common that they combine multiple test resultsto construct a reference standard outcome includingdeterministic predefined rules, consensus procedures

and statistical modelling (latent class analysis). In the finalgroup, validate index test results, the diagnostic testaccuracy paradigm is abandoned and research examines,using a number of different methods, whether theresults of an index test are meaningful in practice, forexample by relating index test results to relevant otherclinical characteristics and future clinical events. Conclusions: The majority of methods try to impute,adjust or construct a reference standard in an effort toobtain the familiar diagnostic accuracy statistics, such assensitivity and specificity. In situations that deviate onlymarginally from the classical diagnostic accuracyparadigm, these are valuable methods. However, insituations where an acceptable reference standard doesnot exist, applying the concept of clinical test validationcan provide a significant methodological advance. All methods summarised in this report need furtherdevelopment. Some methods, such as the constructionof a reference standard using panel consensus methodsand validation of tests outwith the accuracy paradigm,are particularly promising but are lacking inmethodological research. These methods deserveparticular attention in future research.

Health Technology Assessment 2007; Vol. 11: No. 50

iii

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Abstract

Evaluation of diagnostic tests when there is no gold standard. A review of methods

AWS Rutjes,1 JB Reitsma,1 A Coomarasamy,2 KS Khan2* and PMM Bossuyt1

1 Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Center, University ofAmsterdam, The Netherlands

2 Division of Reproductive and Child Health, University of Birmingham, UK* Corresponding author

Health Technology Assessment 2007; Vol. 11: No. 50

v

List of abbreviations .................................. vii

Executive summary .................................... ix

1 Background ................................................ 1Introduction and scope of the report ........ 1Key concepts in diagnostic accuracy studies ......................................................... 2Problems in verification ............................. 3

2 Aims of the project .................................... 7Aims ............................................................ 7Outline of the report .................................. 7

3 Methods ..................................................... 9Literature search and inclusion of papers ......................................................... 9Assessment of individual studies ................ 9Overall classification of methods ............... 9Structured summary of individual methods ...................................................... 10Expert review on individual methods ........ 10Development of research guidance ............ 10Peer review of report .................................. 10

4 Results ........................................................ 11Search results and selection of studies ....... 11

Classification of methods ........................... 11Impute or adjust for missing data on reference standard ...................................... 13Correct imperfect reference standard ........ 16Construct reference standard ..................... 18Validate index test results ........................... 29

5 Guidance and discussion ............................ 35Guidance for researchers ........................... 35Limits to accuracy: towards a validationparadigm .................................................... 37Recommendations for further research ..... 39

Acknowledgements .................................... 41

References .................................................. 43

Appendix 1 Search terms in databases ..... 49

Appendix 2 Experts in peer review process ........................................................ 51

Health Technology Assessment reportspublished to date ....................................... 53

Health Technology Assessment Programme ................................................ 69

Contents

BCG Bacillus Calmette–Guérin

CI confidence interval

COPD chronic obstructive pulmonarydisease

CT computed tomography

ED emergency department

EIA enzyme immunoassay

DOR diagnostic odds ratio

FN false negative result

FP false positive result

LR likelihood ratio

LTBI latent tuberculosis infection

MAR missing at random

MCAR missing completely at random

MRI magnetic resonance imaging

NMAR not missing at random

NPV negative predictive value

NT-proBNP N-terminal pro-brain natriureticpeptide

PCR polymerase chain reaction

PE pulmonary embolism

PET positron emission tomography

PPV positive predictive value

RCT randomised controlled trial

ROC receiver operating characteristic

TN true negative result

TP true positive result

TST tuberculin skin test

Health Technology Assessment 2007; Vol. 11: No. 50

vii

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

List of abbreviations

All abbreviations that have been used in this report are listed here unless the abbreviation is well known (e.g. NHS), or it has been used only once, or it is a non-standard abbreviation used only in figures/tables/appendices in which case the abbreviation is defined in the figure legend or at the end of the table.

Health Technology Assessment 2007; Vol. 11: No. 50

ix

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

BackgroundThe classical diagnostic accuracy paradigm isbased on studies that compare the results of thetest under evaluation (index test) with the resultsof the reference standard, the best availablemethod to determine the presence or absence ofthe condition or disease of interest. Accuracymeasures express how the results of the test underevaluation agree with the outcome of the referencestandard. Determining accuracy is a key step inthe health technology assessment of medical tests.

Researchers evaluating the diagnostic accuracy ofa test often encounter situations where thereference standard is not available in all patients,where the reference standard is imperfect or wherethere is no accepted reference standard. We usethe term ‘no gold standard situations’ to refer toall those situations. Several solutions have beenproposed in these circumstances. Most articlesdealing with imperfect or absent referencestandards focus on one type of solution anddiscuss the strengths and limitations of thatapproach. Few authors have compared differentapproaches or provided guidelines on how toproceed when faced with an imperfect referencestandard.

ObjectivesWe systematically searched the literature formethods that have been proposed and/or appliedin situations without a ‘gold’ standard, that is, areference standard that is without error. Ourproject had the following aims:

1. To generate an overview and classification ofmethods that have been proposed to evaluatemedical tests when there is no gold standard.

2. To describe the main methods discussingrationale, assumptions, strengths andweaknesses.

3. To describe and explain examples from theliterature that applied one or more of themethods in our overview.

4. To provide general guidance to researchersfacing research situations where there is nogold standard.

MethodsWe employed multiple search strategies to obtainan overview of the different methods described inthe literature, including searches of electronicdatabases, contacting experts for papers inpersonal archives, exploring databases fromprevious methodological projects (STARD andQUADAS) and cross-checking of reference lists ofuseful papers already identified.

We developed a classification for the methodsidentified through our review taking into accountthe degree to which they represented a departureaway from the classical diagnostic accuracyparadigm.

For each method in our overview, we prepared astructured summary based on all or the mostinformative papers describing its rationale, itsstrengths and weaknesses, its field of application,available software and illustrative examples of themethod.

Based on the findings of our review, discussionsabout the pros and cons of different methods invarious situations within the research team andinput from expert peer reviewers, we constructed aflowchart providing general guidance toresearchers faced with evaluation of tests without agold standard.

ResultsFrom 2200 references initially checked for theirusefulness, we ultimately included 189 relevantarticles that were subsequently used to classify andsummarise all methods into four main groups, asfollows.

Impute or adjust for missing data onreference standardIn this group of methods, there is an acceptablereference standard, but for various reasons theoutcome of the reference standard is not obtainedin all patients. Methods in this group eitherimpute or adjust for this missing information inthe subset of patients without reference standardoutcome. Researchers should be careful with these

Executive summary

x

methods if (1) the pattern of missing values is not determined by the study design, but isinfluenced by the choice of patients andphysicians, or (2) the fraction of patients verifiedwith the reference standard is small within resultsof the index tests.

Correct imperfect reference standardIn this group, there is a preferred referencestandard, but this standard is known to beimperfect. Solutions from this group either adjustestimates of accuracy or perform sensitivityanalysis to examine the impact of this imperfectreference standard. The adjustment is based onexternal data (previous research) about the degreeof imperfection. Correction methods can be usefulif there is reliable information about the degree ofimperfection of the reference standard and aboutthe correlation of the errors between the indextest and the reference standard.

Construct reference standardThese methods have in common that theycombine multiple test results to construct areference standard outcome. Groups of patientsreceive either different tests (differentialverification and discrepant analysis) or the sameset of tests, after which these results are combinedby: (1) deterministic predefined rule (compositereference standard); (2) consensus procedureamong experts (panel diagnosis); (3) a statisticalmodel based on actual data (latent class analysis).The prespecified rule for target condition makesthe composite reference standard methodtransparent and easy to use, but misclassificationof patients is likely to remain. Discrepant analysisshould not be considered in general, as themethod is likely to produce biased results. Thedrawback of latent class models is that the targetcondition is not defined in a clinical way, so therecan be lack of clarity about what the results standfor in practice. Panel diagnosis also combinesmultiple pieces of information, but experts maycombine these items in a manner that moreclosely reflects their own personal concept of thetarget condition.

Validate index test resultsThe diagnostic test accuracy paradigm isabandoned in this group and index test results arerelated to relevant other clinical characteristics. An

important category is relating index test resultswith future clinical events, such as the number ofevents in those tested negative for the index testresults. Test results can also be used in arandomised study to see whether the test canpredict who will benefit more from oneintervention than the other. Because the classicalaccuracy paradigm is not employed, measuresother than accuracy measures are calculated,including event rates, relative risks and othercorrelation statistics.

ConclusionsThe majority of methods try to impute, adjust orconstruct a reference standard in an effort toobtain the familiar diagnostic accuracy statisticssuch as pairs of sensitivity and specificity orlikelihood ratios. In situations that deviate onlymarginally from the classical diagnostic accuracyparadigm, for example where there are fewmissing values on an otherwise acceptablereference standard or where the magnitude andtype of imperfection in a reference standard is welldocumented, these are valuable methods.However, in situations where an acceptablereference standard does not exist, holding on tothe accuracy paradigm is less fruitful. In thesesituations, applying the concept of clinical testvalidation can provide a significantmethodological advance. Validating a test meansthat scientists and practitioners examine, using anumber of different methods, whether the resultsof an index test are meaningful in practice.Validation will always be a gradual process. It willinvolve the scientific and clinical communitydefining a threshold, a point in the validationprocess, whereby the information gathered wouldbe considered sufficient to allow clinical use of thetest with confidence.

Recommendations for further researchAll methods summarised in this report needfurther development. Some methods, such as theconstruction of a reference standard using panelconsensus methods and validation of tests outwiththe accuracy paradigm, are particularly promisingbut are lacking in methodological research. Thesemethods deserve particular attention in futureresearch.

Executive summary

“Accuracy is telling the truth … Precision is telling thesame story over and over again.”

Yiding Wang

Introduction and scope of thereportAs with all other elements of healthcare, medicaltests should be thoroughly evaluated in high-quality studies. Biased results from poorlydesigned, conducted or analysed studies maytrigger premature dissemination andimplementation of a medical test and misleadphysicians to incorrect decisions regarding thecare for an individual patient.1–3 Avoidance ofthese perils requires a proper evaluation ofmedical tests.4

Several authors have proposed a staged model forthe evaluation of medical tests.5–7 Technicalevaluations dominate in the early phases, in whichthe reproducibility under different conditions ofbiochemical tests and the intra- and inter-observervariation of tests are evaluated. A key phase in theclinical evaluation of a test is determining itsdiagnostic accuracy: the ability to discriminatebetween patients who have the condition ofinterest (target condition) and those who havenot.8 The target condition can refer to a disease,syndrome or any other identifiable condition thatmay prompt clinical actions such as furtherdiagnostic testing, or the initiation, modificationor termination of treatment.

By themselves, accuracy studies cannot alwaysanswer the question whether a medical test isuseful or not. More informative accuracy studiescan be designed by taking into account the likelyfuture role of the test under evaluation. Threepossible roles are replacement, addition or triage.9

In comparative accuracy studies, the accuracy ofthe test under evaluation is compared against thatof existing diagnostic pathways, leading to moreinformative and possibly more efficient diagnosticaccuracy studies.9

The clinical value of a test will ultimately dependon whether it is able to improve patient outcome.In most cases this will be by guiding subsequent

decision-making. Accuracy studies may not besufficient to evaluate the clinical value of a test,especially if the new test is more sensitive than theexisting test(s).10 The reason is that the resultsfrom current intervention studies may not apply tothose additional cases detected. Results fromrandomised studies assessing response to therapyin these additional cases are then required. Laterstage evaluation studies may focus on determiningthe societal costs and benefits of a test.

The focus of this report is on problems related todiagnostic accuracy studies. The key challenge indiagnostic accuracy is to determine in all patientswhether the target condition is present or absent.The reference standard should provide thisclassification. As such, the reference standard playsa crucial role in accuracy studies.

Problems with the reference standard abound in diagnostic accuracy studies.(http://www.fda.gov/cdrh/osb/guidance/1428.pdf).The outcome of the reference standard may not beavailable in all patients, it may be unreliable, itmay be inaccurate or there could be no acceptablereference standard at all. As in any other form ofepidemiological research, outcome data that aremissing or misclassified pose a great threat to thevalidity of such studies, and diagnostic accuracystudies are no exception.11–15 We will use the term‘no gold standard situations’ to refer loosely to allthese situations where the outcome of thereference standard is missing, the referencestandard is imperfect or there is no acceptablereference standard.

This report gives an overview of solutions thathave been proposed to overcome no gold standardsituations in diagnostic accuracy research. Becausethe scope of this report is limited to problemsrelated to diagnostic accuracy studies, we will notdiscuss the broader issue of determining the mostappropriate type of evaluation given a specificdiagnostic research question.

Before explaining our methods (Chapter 3) andreporting our results (Chapter 4), in the nextsection we first explain the key features ofdiagnostic accuracy studies and in the subsequentsection discuss the various mechanisms that can

Health Technology Assessment 2007; Vol. 11: No. 50

1

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Chapter 1

Background

lead to no gold standard situations. In Chapter 5we provide guidance on selecting the mostappropriate method for a given situation byaddressing some key questions. In addition, wesketch possible research alternatives in situationswhere the diagnostic accuracy paradigm is unlikelyto be useful.

Key concepts in diagnosticaccuracy studiesDiagnostic accuracy studies aim to measure theamount of agreement between index test resultsand the outcome of the reference standard. Theclassical outline of an accuracy study is given inFigure 1(a). The starting point is a consecutiveseries of individuals in whom the target conditionis suspected. The index test is performed first inall subjects, and subsequently the presence orabsence of the target condition is determined bythe outcome of the reference standard. In the caseof a dichotomous index test result, the results ofan accuracy study can be summarised in a 2-by-2table, as shown in Figure 1(b). Several measures of

accuracy can be calculated from this table, as alsoshown in Figure 1(b).

The term accuracy has been borrowed frommeasurement theory, where it is defined as thecloseness of agreement between an analyticalmeasurement and its actual (true) value.16 Thefirst publication mentioning accuracy and theassociated statistics sensitivity and specificity toexpress the performance of a medical test was byYerushalmy,17 followed by the landmarkpublication of Ledley and Lusted.18 In thesepublications, test results in patients known to havethe disease of interest were compared with testresults in subjects not having the disease. Sincethen, various other measures of accuracy havebeen introduced, including likelihood ratios,predictive values and the diagnostic odds ratio.4

All of these accuracy measures have in commonthat they need a classification of patients in thosewith and those without the condition of interest.In other words, index test results are verified bycomparing them with the outcome of thereference standard. Based on this concept, we canformulate the properties of the ideal reference

Background

2

Patients

Index test

Reference standard

Cross classification

(a)

(b)

TNFN–

FPTP+Index test

Result

AbsentPresent

Target condition

Accuracy measures:

Sensitivity = TP/(TP + FN)

Specificity = TN/(TN + FP)

LR+ = [TP/(TP + FN)]/[FP/(FP + TN)]

LR– = [FN/(FN + TP)]/[TN/(TN + FP)]

PPV = TP/(TP + FP)

NPV = TN/(TN + FN)

DOR = (TP/FN)/(FP/TN)

FIGURE 1 (a) Classical design of a diagnostic accuracy study and (b) results of an accuracy study in the case of a dichotomous indextest result.TP, true positive result; FP, false positive result; FN, false negative result; TN, true negative result; LR, likelihood ratio; PPV, positive predictive value; NPV, negative predictive value; DOR, diagnostic odds ratio.

standard and verification procedure. The idealverification protocol would fulfil the followingcriteria:

1. The reference standard provides error-freeclassification.

2. All index test results are verified by the samereference standard.

3. The index test and reference standard areperformed at the same time, or within aninterval that is short enough to eliminatechanges in target condition status.

Empirical studies have shown that estimates ofdiagnostic accuracy are directly influenced by thequality of the verification procedure.1,3,19

Problems in verificationIn practice, researchers may encounter situationswhere the ideal verification procedure cannot beachieved. Based on the first two criteria for anideal verification procedure, we will briefly discussthe key mechanisms why verification procedurescan produce errors.

Classification errors by the referencestandardWithin the accuracy concept, the referencestandard is judged by its performance inproducing error-free classification with respect tothe presence or absence of the target condition.Even if a reference standard has no analyticalerror but the target condition does not producethe biochemical changes of interest, we consider ita ‘failure’ of the reference standard. Otherexamples of imperfections of the referencestandard are tumours that are missed because theyare below the level of detection in case of aradiological reference standard, or the presence ofan alternative condition that is misclassified as thetarget condition because it produces similarchanges in a biomarker as that of the targetcondition.

One inherent difficulty in accuracy studies is thatwe use the dichotomy of target condition presentor absent, whereas in reality the target conditionvaries from very early and minor changes to severeand advanced stages of the disease, thus coveringa wide spectrum of disease. For some conditions,defining the lower end of disease is difficult. Thiscan be illustrated with the example ofappendicitis. In the majority of accuracy studiesevaluating non-invasive imaging techniques forthe diagnosis of appendicitis, histological

examination of the removed appendix is used asreference standard. This requires defining athreshold for the type and amount ofinflammatory changes above which we will classifya patient as having appendicitis. Differentreference standards may apply a differentthreshold before classifying patients as having thetarget condition. In the example of appendicitis,clinical follow-up and observing whether thesymptoms of the patients improve or deterioratewill probably mean a higher threshold for disease,resulting in some patients being classifieddifferently between these two reference standards;for example, some patients with inflammatorychanges at histological examination would haverecovered in a natural way without intervention.This requires proper thinking by the researcher ofwhat is the right definition of the target conditionand then choosing the most appropriate referencestandard in the light of this target condition.

Additional sources of misclassification are failuresin the reference standard protocol andinterpretation errors by observers. These errors inmisclassification could be preventable by stricteradherence to protocol or better training ofobservers. Examples include failure of detectingcancer cells after fine-needle aspiration becausethe biopsy was performed outside the tumourmass or overlooking a small pulmonary embolismin spiral computed tomography (CT) images.

Furthermore, for several target conditions there isno reference standard based on histological orbiochemical changes. In some of those cases, thecondition is defined by a combination ofsymptoms and signs. Migraine is an example. Thepresence or absence of this and similar conditionshas been based on criteria developed by individualresearchers or on criteria established during aconsensus meeting. Such classifications can varyover time or across countries, and cannot be error-free.

Given the many potential factors that can lead toerrors in the classification of the target conditionby a reference standard, a perfect referencestandard (e.g. providing error-free classification) isunlikely to exist in practice. The shift interminology from ‘gold standard’, suggesting astandard without error, to the more neutral term‘reference standard’, indicating the best availablemethod, has been initiated by these observations.It means that researchers should always discuss thequality of the reference standard and the potentialconsequences of misclassification by the referencestandard.

Health Technology Assessment 2007; Vol. 11: No. 50

3

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Whatever the cause of errors in classification oftarget condition status by the reference standard,it will directly lead to changes in the 2-by-2classification table (Figure 1b). Within the classicaccuracy concept, any disagreement between thereference standard and the index test will belabelled as a ‘false’ result for the index test. Thenet effect of the misclassification by the referencestandard can either be an upward or downwardbias in estimates of diagnostic accuracy. Thedirection depends on whether errors by the indextest and imperfect reference standard arecorrelated. If errors are positively correlated, itwill erroneously increase agreement in the 2-by-2tables and estimates of accuracy will beinflated.11–15 The magnitude of the biasing effectdepends on the frequency of errors by theimperfect reference standard and the degree ofcorrelation in errors between index test andreference standard.

Partial and differential verificationEven if a near perfect reference standard exists, itmay be impossible, unethical or too costly to applythis standard in all patients. In the case of targetconditions that can produce multiple lesions thatneed histological verification, it is often impossibleto verify (the countless) negative index test results.An example is 18-fluorodeoxyglucose-positronemission tomography (PET) scanning to detectpossible distant metastases before planning majorcurative surgery in patients with carcinoma of theoesophagus: only PET hot spots can be verifiedhistologically.

Ethical reasons play a role in the choice ofreference standard in pulmonary embolism.Angiography is still considered the best availablemethod for the detection of pulmonaryembolisms, but because of the frequency of serious

complications it is now considered unethical toperform this reference standard in low-riskpatients, for instance in patients with low clinicalprobability and negative D-dimer result. Otherreasons for missing data on the outcome of thereference standard are situations where thereference standard is temporarily unavailable orwhen patients and doctors decide to refrain fromverification.

One solution is to leave the unverified patients outof the 2-by-2 table, which is referred to as partialverification (Figure 2b). Omitting unverifiedpatients from the 2-by-2 table may generate a bias.The direction and magnitude will depend on (1) the fraction of patients that are unverified; (2) the ratio between the number of patients withpositive and negative index test results that remainunverified; and (3) whether the reason for notverifying is related to the presence or absence ofthe target condition.20

Because omitting patients from the 2-by-2classification table can lead to bias, researchershave strived to obtain complete verification byapplying an alternative reference standard in theinitially unverified patients. The use of differentreference standards between patients is known asdifferential verification (Figure 2c). An example ofdifferential verification that is fairly common iswhere follow-up is used as the alternative referencestandard in patients not verified by the preferredreference standard. In these studies, clinicalfollow-up is used as a proxy to obtain theinformation of true status at the moment of theindex test; the term delayed-type cross-sectionalaccuracy study has been introduced for thisdesign.8 For all studies with differential verificationwhere the alternative reference standard providesimperfect classification, it may affect the 2-by-2

Background

4

Index test

Reference standard

Cross classification

Patients

Index test

Patients Patients

Reference standard…

Cross classification

Index test

Cross classification

Ref. A Ref. B…

(a) (b) (c)

FIGURE 2 (a) Diagnostic accuracy study with complete verification by the same reference standard (classic design), (b) study withpartial verification and (c) study with differential verification.

table and generate bias along the same lines asdiscussed earlier in the previous section on errorsin classification by the reference standard.Empirical studies have shown that studies withdifferential verification produce higher estimatesof diagnostic accuracy than studies with completeverification by the preferred reference standard.1,3

Given these diverging reasons for imperfections inthe verification procedure, it is not surprising that

different solutions have been suggested to remedythe effects of verification procedures that are notbased on using a single gold standard in allpatients. We have systematically searched theliterature to identify and summarise the solutionsthat have been proposed. We have developed aclassification of these methods based on theirfigurative distance, that is, the extent to whichthey depart from the classical diagnostic accuracyparadigm described in Figure 1.

Health Technology Assessment 2007; Vol. 11: No. 50

5

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Several methods have been proposed to dealwith situations where the reference standard is

partially unavailable or imperfect or where there isno accepted reference standard. These methodsvary from relatively simple correction methodsbased on the expected degree of imperfection ofthe reference standard, through more complexstatistical models that construct a pseudo-referencestandard, to methods that validate index testresults by examining their association with otherrelevant clinical characteristics. Most articlesdealing with imperfect or absent referencestandards focus on one type of solution anddiscuss strengths and limitations of that particular approach.21–23 Few authors havecompared different approaches or have providedguidelines on how to proceed when faced with an imperfect reference standard24–26

(http://www.fda.gov/cdrh/osb/guidance/1428.pdf).

AimsIn this project, we systematically searched theliterature for methods to be used in situationswithout a ‘gold’ standard. Our project had thefollowing aims:

1. To generate an overview and classification ofmethods that have been proposed to evaluatemedical tests when there is no gold standard.

2. To describe the main methods discussingrationale, assumptions, strengths andweaknesses.

3. To describe and explain examples from theliterature that applied one or more of themethods in our overview.

4. To provide general guidance to researchersfacing research situations where there is nogold standard.

Outline of the reportChapter 1 provides background information aboutthe key concepts in diagnostic accuracy studies, in particular the role of the reference standardand the problems that can be encountered in the verification of index tests results. The methods chapter (Chapter 3) describes thestrategies that we employed to identify and assessmethodological papers on methods relevant forthis project. The results chapter (Chapter 4) hasthe following structure: the first section presentsthe results of the literature search; an overviewand classification of the different methods that weencountered is given in the second section; and adescription of each of the methods discussingstrengths and limitations is given in the thirdsection. In Chapter 5 we provide general guidanceto researchers when faced with research situationswhere there is no gold standard. In addition, wediscuss some alternative options when repairing orhanging on to the accuracy paradigm is nothelpful.

Health Technology Assessment 2007; Vol. 11: No. 50

7

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Chapter 2

Aims of the project

In our systematic review, we modified therecommendations set out by the Cochrane

Collaboration and the NHS Centre for Reviewsand Dissemination27 to make them moreapplicable for a review of methodological papers.

Our approach consisted of the following steps:

1. literature search and inclusion of papers2. assessment of individual studies3. overall classification of methods4. structured summary of individual methods5. expert review on individual methods6. development of guidance statements7. peer review of total report.

Literature search and inclusion ofpapersSearching for methodological papers in electronicdatabases is difficult because of inconsistentindexing and the absence of a specific keyword forthe relevant publication types. Several methods forconducting diagnostic research when there is nogold standard have been developed in other areasof epidemiological, biomedical or even non-medical research. Conducting a broad literaturesearch to capture every single potentially relevantpaper would be inappropriate, especially sinceprecise estimation is not the objective of a reviewof methodological papers. Once a set ofcomprehensive papers has been obtained about amethodological issue, there is no additional valuein reviewing additional papers explaining thesame concept. This is known as theoreticalsaturation,28 a principle that guided our searchand selection.

Based on these considerations, we relied onmultiple strategies to obtain an overview of thedifferent methods described in the literature andto maximise the likelihood of identifying thosepapers that provide the most thorough andcomplete description:

● Restricted electronic searches in the followingdatabases: MEDLINE (OVID and PubMed),EMBASE (OVID), MEDION, a database ofdiagnostic test reviews (www.mediondatabase.nl),

and the Cochrane Library (DARE, CENTRAL,CMR, NHS). The exact terms of the searchstrategy are given in Appendix 1.

● Searching databases that have been establishedin earlier methodological projects, includingSTARD and QUADAS.

● Searching personal archives.● Contacting other experts in the field of

diagnostic research to establish whether theyhad additional relevant papers, especially aimedat retrieving articles within the grey literature.

● To locate additional information on specificmethods, we checked reference lists, used thecitation tracking option of SCISEARCH andapplied the ‘related articles’ function ofPubMed.

Inclusion of papersPublications in English, German, Dutch, Italian,and French were included. One researcher (AWSR)reviewed the identified studies for inclusion in thereview and this process was checked by anotherreviewer (JBR). The only reason for exclusion wasif the article did not address a method that couldbe used in a no gold standard situation.

Assessment of individual studiesWe extracted a limited set of standard items fromeach included paper. These included the strategythat identified the paper, the type of journal, thetype of article and the type of method proposed.Other items were the rationale and structure ofthe article, a description of its usefulness andremarks about the overlap in relation to otherpapers (all free text fields). The informationextracted in this way was used to organise paperswithin each method. The main function of thisstep was to categorise papers in order to facilitatethe writing of a structured summary of eachmethod.

Overall classification of methodsWe classified all methods into groups thataddressed diagnostic research in situations wherethere is no gold standard in an analogous way.The purpose of this classification was to give an

Health Technology Assessment 2007; Vol. 11: No. 50

9

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Chapter 3

Methods

overview of methods and to serve as a startingpoint for formulating guidance. The key elementin the classification was the underlying mechanismleading to the absence of perfect referencestandard information (see also Chapter 1).

Structured summary of individualmethodsFor each method in our overview, we made astructured summary based on all or the mostinformative papers describing that method. Eachsummary starts with a description of the methodand its rationale, avoiding technical languagewhere possible. In subsequent sections, thestrengths and weaknesses of the methods aredescribed; the field of application of the methodin terms of the circumstances in which the methodshould or should not be used; available software;and an illustrative example of the method. Onemember of the research team (AWSR or AC) wrotethe first draft of a summary, which was thenreviewed and modified by another team member(JBR, PMMB, AC, KSK). These two reviewerscontinued exchanging drafts, held face-to-facemeetings or had teleconference meetings untilthey agreed that the version was ready for expertreview.

Expert review on individualmethodsWe assembled a list of potential reviewers outsideour research team to review each method. Theseexperts were selected based on their knowledgeand/or expertise in relation to a specific method.This group included epidemiologists, statisticiansand clinicians (see Appendix 2). Each expert wascontacted by electronic mail or by telephone andwas asked to comment on at least one method.These experts received a brief description of the

project, its objectives and the summary of aspecific method. They were asked to review thisfirst version, paying particular attention to thefollowing issues:

● Is the method accurately described?● Are the main characteristics of the method

described?● Are key strength and weaknesses mentioned?● Are key references missing?● Are tables and/or figures understandable?● Do you have any other suggestions how the

summary can be improved?

Based on the comments from the expert, a finalversion of each method was prepared. Because oftime constraints, the revised version was not sentback to the expert.

Development of researchguidanceThe research team held face-to-face meetings todiscuss the pros and cons of different methods indifferent situations. Prior to this meeting, we hadcontacted and interviewed a few general expertsabout their preferred solutions when faced withresearch situations where there is no goldstandard. Based on the discussion, a first versionwas drafted and circulated among the researchmembers. Further comments were incorporated toproduce a final version.

Peer review of reportThe full report was reviewed by general expertsidentified by the research team and peer reviewersassigned by the NHS Research Methodologyprogramme.

Methods

10

Search results and selection ofstudiesThe references retrieved by the electronic searcheswere assembled in a single database. Afterduplicate references had been deleted, theabstracts of 2265 references were assessed foreligibility. Of those, 134 were labelled potentiallyrelevant. Twenty-three references were excludedbecause they could not be retrieved or inspectionof the full article revealed that the topic was nottest evaluation. Contact with experts in the fieldresulted in an additional seven references to booksand 50 reference to articles. Reference checking atvarious stages of preparing the report yielded atotal of 11 additional relevant articles.

As the number of references was limited forimputation methods and the panel consensusmethod, we conducted additional electronic

searches. Using the ‘related articles’ approach inPubMed we identified 10 additional referencesrelevant for imputation. For the panel consensusmethod, an additional search in PubMed wasperformed with the search term Delphi[textword],resulting in no relevant references. The finalnumber of relevant articles included was 189. Aflowchart of the retrieval and inclusion of studiesis given in Figure 3.

Classification of methodsMany methods have been described for no goldstandard situations. An exact number is difficult toprovide because many methods can be viewed asvariations or extensions of a single underlyingapproach. For this report, we classified themethods into four main groups (Table 1). Thesefour groups differ in their distance or extent of

Health Technology Assessment 2007; Vol. 11: No. 50

11

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Chapter 4

Results

Electronic search: potentially eligible publications

Electronic search: initially included articles n = 134

Articles excluded:

Articles excluded:Not addressing test evaluationNot retrievable

Electronic search: final number of included articles n = 111

Personal database searches and expert contact: Articles included Books included n = 7

Additional searches n = 10

Reference checking n = 11

Total number included n = 189

n = 2131

n = 15n = 8

n = 2265

n = 50

FIGURE 3 Flowchart of the selection process of articles

departure from the classical diagnostic accuracyparadigm including a gold standard, as explainedin Chapter 1. Our classification is neitherexhaustive nor mutually exclusive. Researchers canalso use a combination of methods within a singlestudy.

In the first group of methods, Group A: Imputeor adjust for missing data on reference standard,there is a reference standard providing adequateclassification, but for various reasons the outcomeof the reference standard is not obtained in allpatients (see Chapter 1 for an overview ofreasons). The methods in this group impute oradjust for this missing information in the subset ofpatients without reference standard outcome.Methods within this group differ in the way inwhich they impute or adjust for this missingness.

In the second group, Group B: Correct forimperfections in reference standard, there is apreferred reference standard, but this standard isknown to be imperfect. Solutions from this groupeither adjust estimates of accuracy or performsensitivity analysis to examine the impact of usingthis imperfect reference standard. The adjustmentis based on external data (previous research) aboutthe degree of imperfection.

Methods in the third group, Group C: Constructa reference standard, have in common that theycombine multiple pieces of information (testresults) to construct a reference standard outcome.The methods differ as to whether they use apredefined deterministic rule to classify patients ashaving the target condition (composite referencestandard), whether only discordant results areretested with a second reference standard(discrepant analysis) or whether the different testsare combined through a statistical model (latentclass analysis). Also, a panel of experts can be usedto determine the presence or absence of the targetcondition in each patient.

The diagnostic test accuracy paradigm isabandoned in the fourth group, Group D:Validate index test results. In these studies, indextest results are related to other relevant clinicalcharacteristics. An important category is relatingindex test results with future clinical events, suchas the number of events in those tested negativefor the index test or a randomised comparisonbetween testing and non-testing. Any relevantclinical information could be used to validateindex test results, including cross-sectional orhistorical data. Because this group departscompletely from the classical accuracy paradigm,

Results

12

TABLE 1 Classification of methods for diagnostic research where there is no gold standard

Main classification Main characteristic Section inSubdivision Chapter 4a

A. Impute or adjust for Impute the outcome of the reference standard in those patients Amissing data on reference who did not receive verification by the reference standard or adjust standard estimates of accuracy based on complete cases

B. Correct imperfect Correct estimates of accuracy or perform sensitivity analysis to Breference standard examine the impact of using imperfect reference standard based on

external data about the degree of imperfection

C. Construct reference Information from different tests is combined to construct the Cstandard reference standard outcome. Groups of patients receive either DDifferential verification different tests (differential verification and discrepant analysis) or EDiscrepant analysis the same set of tests after which these results are combined by: FComposite reference standard (a) deterministic predefined rule (composite reference standard) GPanel or consensus diagnosis (b) consensus procedure among experts (panel diagnosis) HLatent class analysis (c) a statistical model based on actual data (latent class analysis)

D. Validate index test results Explore meaningful relations between index test results and other IExamine patient outcomes relevant clinical characteristics. An important way to validate is to

use dedicated follow-up to capture clinical events of interest in relation to index test results, including randomised diagnostic studies

a A, ‘Impute or adjust for missing data on reference standard’ (p. 13); B, ‘Correct imperfect reference standard’ (p. 16); C, ‘Construct reference standard’ (p. 18); D, ‘Differential verification’ (p. 18); E, ‘Discrepant analysis’ (p. 19); F, ‘Composite reference standard’ (p. 21); G ‘Panel or consensus diagnosis’ (p. 24); H, ‘Latent class analysis’ (p. 26); I, Validate index test results’ (p. 29).

measures other than accuracy measures arecalculated, including event rates, relative risks andcorrelation statistics.

Impute or adjust for missing dataon reference standardIn some studies, an accepted reference standardproviding adequate classification is available, butfor a variety of reasons not all patients receive thisstandard, leading to missing data on whether thetarget condition is present or absent. The generalproblem of missing data is well established inepidemiology and biostatistics, and manystatistical methods have been developed to dealwith missing data.29 Most literature focuses onmissing values on predictors, but a fair amount of literature is available on missing data ofoutcome variables, including papers dealing with diagnostic research.22,23,30–41 Missing data on the reference standard have been labelledpartial or incomplete verification in the diagnosticliterature, and the bias associated with it is knownas partial verification bias or sequential orderingbias.19,20

Imputation methods use a mathematical functionto fill in each missing value, whereas correctionmethods use a mathematical function to correctthe indexes of accuracy directly. Three mainpatterns of partial verification or missingness havebeen described: missing completely at random(MCAR), missing at random (MAR) and notmissing at random (NMAR).29

In MCAR, missing values are unrelated to thestatus of the target condition, index test resultsand other patient characteristics. Hence theoccurrence is a true random process and thereforethe frequency of missing values is similar amongpatients with positive or negative index testresults. This situation can occur due tounavailability of the reference standard because oftechnical failures.

More often, missing values for the referencestandard do not occur completely at random, butare related to the results of the index test (morefrequent in patients with negative index test result)and to other patient characteristics. If themechanisms that have led to the missing values onthe reference standard are known and observed,then the technical term ‘missing at random’(MAR) is used. This term is used because withinstrata of the mechanisms leading to missingvalues, the pattern is random. This also forms the

basis for predicting and subsequently imputingmissing values.

If the pattern of missing values is related to thetarget condition, but through mechanisms that wehave not observed, it is called NMAR. Thissituation can occur when incomplete verification is not design based, but determined by the choice of patients and physicians. In thesesituations, the mechanisms leading to missingvalues are difficult to determine, especially whenadditional patient characteristics such assymptoms, signs or previous test results have notbeen recorded.

Imputation methodsImputation methods comprise two phases, animputation phase where each missing value isreplaced, and an analysis phase, where estimatesof sensitivity and specificity are computed basedon the now complete dataset. Many variants ofimputation are possible, ranging from singleimputations of missing values to multipleimputations.42–46 Instead of filling in a single value for each missing value, multiple imputationprocedure replaces each missing value with a set of plausible values that represent theuncertainty about the correct value to impute.These multiple imputed data sets are thenanalysed (one by one) by standard procedures for complete data sets. In a next step, the resultsfrom these analyses are combined to produceestimates and confidence intervals (CIs) thatproperly reflect the uncertainty due to missingvalues.

The choice of imputation method is largely basedon the pattern of missingness. Basically, theassociations between patient characteristics andthe outcome of the reference standard areevaluated through a statistical model, like logisticregression, subsequently to impute missing values.In the NMAR setting, it is very rare to identify anappropriate model for the missingnessmechanism. Consequently, the validity of themodel is uncertain. For further information ondealing with missingness and imputations, readersare referred to Little and Rubin.29

Correction methods for missing data onreference standard outcomeIn correction methods, no effort is made toimpute missing values. The notion is thatestimates of diagnostic test accuracy are likely tobe biased if only patients who received verificationwith the reference standard (complete cases) areanalysed. A mathematical correction of these

Health Technology Assessment 2007; Vol. 11: No. 50

13

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

estimates and their CIs is warranted based on themechanism of missingness.

The first publications about correction for partialverification were based on the MCARassumption.39 If missings occur completely atrandom and the fractions of persons missing ineach of the cells of the two-by-two table are likelyto be comparable, estimates (but not CIs) ofaccuracy can be easily recalculated without anystatistical model. As this pattern is not likely tooccur often, mathematical correction methodshave been developed that can deal with MAR. InMAR, estimates of accuracy can be corrected ifverification is random within specific patientprofiles, based on patient characteristics or (index)test results, and the number of patients notverified within each stratum is known. Here,several methods to correct estimates of sensitivityand specificity based on this assumption have beendescribed.13,22,23,32,47,48 The same is true for thecorrection of receiver operating characteristic(ROC) curves and associated indexes.38,47,49

Strengths and weaknessesBoth methods aim to alleviate the bias that ispotentially introduced by analysing only cases with complete data on the reference standard. The strengths of each method are determined byhow accurately the mechanism behind missingvalues is known. If the mechanisms that lead tomissing values on the reference standard are(partially) unknown, the correction or imputationmethods are prone to bias. Moreover, correctionand imputation methods are based on statisticalmodelling of the data, which requires sufficientlylarge sample sizes.50 The potential of bias can be large when correction methods are used instudies with relative small sample sizes, or if thefraction of patients receiving the referencestandard is small. Imputation methods, especiallymultiple imputation methods, seem to be morerobust in these situations than correctionmethods.23

ApplicationThese methods can be used when a preferredreference standard is available, but has not beenapplied in all patients enrolled in the study.Ideally, partial or incomplete verification shouldbe planned by design, so that the pattern ofmissingness is known. Researchers should be verycareful with these methods if missingness isuncontrolled and open to influence by patient andpractitioner choice, if the sample size is small or ifthe fraction of verified patients is small within testcategories.

SoftwareSoftware for dealing with missing data is nowembedded in the main statistical software packages.

Clinical exampleHarel and Zhou used two real data examples toillustrate the results of multiple imputation andcorrection methods for missing data on theoutcome of the reference standard.23 We present abrief summary of their results.

The first example addresses hepatic scintigraphy,an imaging scan procedure, for the detection ofliver cancer.51 Out of 650 patients, 344 werereferred to liver pathology, which was consideredthe reference standard (Table 2). The MCARassumption does not hold here, as 39% of theindex test positives and 63% of the index testnegatives are not verified. Either MAR or NMARis true, and Harel and Zhou used the MARassumption that non-verification occurredrandomly within index test positives andnegatives, respectively. Table 3 presents summaryestimates of sensitivity, specificity and CIs ascomputed by eight different methods. The firstmethod concerns a ‘complete case’ analysis, wherethe 306 patients without liver pathology areignored and estimates are based on the remaining344 patients only. This method gives a higherestimate of sensitivity and lower estimate ofspecificity in comparison with all other methods.The two correction methods, as proposed by Beggand Greenes,22,48 produce comparable estimates,with lower sensitivities and higher specificitiescompared with the complete case analysis. Incomparison with the remaining five multipleimputation methods, however, the sensitivities arelower and the specificities are higher.

In the second example, Harel and Zhou used dataof a study evaluating diaphanography as a test fordetecting breast cancer.52 Of the 900 patientsenrolled, 812 were not verified. The pattern ofmissingness was either MAR or NMAR, as 55% ofthe diaphanography positives and only 6% of thediaphonography negatives were verified (Table 4).Again, Harel and Zhou chose to use the MARassumption that is related to index test resultsonly. Table 5 shows the estimates computed by theeight methods. Here the differences between thecomplete case analysis, the correction andimputation methods are more prominent, whilethe estimates of multiple imputations are fairlyclose to each other. In the light of the sample sizeand their previous simulation work, Harel andZhou concluded that the estimates of multipleimputations are more representative of the data.

Results

14

Health Technology Assessment 2007; Vol. 11: No. 50

15

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

TABLE 2 Hepatic scintigraphy data: outcome of reference standard in those verified and fraction of unverified test results (adaptedfrom Harel and Zhou23)

Liver pathology Liver pathology Liver pathologypositive negative not performed

Hepatic scintigraphy positive 231 32 166Hepatic scintigraphy negative 27 54 140Total 258 86 306

TABLE 3 Estimates with 95% CIs of sensitivity and specificity for different methods to adjust or impute missing data on referencestandard outcome (based on data from Table 2)

Procedure Sensitivity Specificity

Estimate 95% CI Estimate 95% CI

Complete cases 0.895 0.858 to 0.932 0.628 0.526 to 0.730Begg and Greenesa 0.836 0.788 to 0.884 0.738 0.662 to 0.815Logit version Begg and Greenesa 0.836 0.835 to 0.838 0.738 0.735 to 0.741A&Cb 0.869 0.820 to 0.918 0.672 0.571 to 0.772Rubin (logit)b 0.872 0.817 to 0.912 0.675 0.567 to 0.797Wilsonb 0.869 0.837 to 0.901 0.672 0.610 to 0.733Jeffreyb 0.872 0.838 to 0.901 0.675 0.611 to 0.734Z&Lb 0.872 0 to 1 0.675 0 to 1

A&C, Agresti-Coull; Z&L, Zhou and Li method.a Correction method.b Multiple imputation method.

TABLE 4 Diaphanography data: outcome of reference standard in those verified and fraction of unverified test results (adapted fromHarel and Zhou 200623)

Breast cancer positive Breast cancer negative No verification

Diaphanography positive 26 11 30Diaphanography negative 7 44 782Total 33 55 812

TABLE 5 Estimates with 95% CIs of sensitivity and specificity for different methods to adjust or impute missing data on referencestandard outcome (based on data from Table 4)

Procedure Sensitivity Specificity

Estimate 95% CI Estimate 95% CI

Complete cases 0.788 0.649 to 0.927 0.800 0.694 to 0.906Begg and Greenesa 0.280 0.127 to 0.434 0.974 0.960 to 0.989Logit version Begg and Greenesa 0.280 0.275 to 0.285 0.974 0.973 to 0.975A&Cb 0.706 0.560 to 0.852 0.861 0.753 to 0.970Rubin (logit)b 0.717 0.548 to 0.841 0.869 0.721 to 0.944Wilsonb 0.706 0.601 to 0.812 0.861 0.839 to 0.884Jeffreyb 0.718 0.603 to 0.815 0.863 0.839 to 0.885Z&Lb 0.717 0 to 1 0.869 0 to 1

A&C, Agresti-Coull; Z&L, Zhou and Li method.a Correction method.b Multiple imputation method.

Complete case analysis seems to overestimatesensitivity and underestimate specificity, whereasBegg and Greenes’s correction methods seemdramatically to underestimate sensitivity andoverestimate specificity.

Harel and Zhou used the MAR assumption wheremissingness is related to index test results only. Inpractice, factors other than index test results, suchas symptoms, signs, co-morbidity or other testresults, may have driven the decision to apply orto withhold the reference standard. Imputationmodels can be extended to incorporate theseadditional sources of information, if known andcollected, in an effort to improve the prediction ofthe reference standard outcome in unverifiedpatients.

Correct imperfect referencestandardDescription of the methodIf a true gold standard does not exist, a logical nextstep is to search for the best available procedure toverify index test results. This is the rationale behindthe concept of reference standard: the best availablemethod to determine the presence or absence ofthe target condition, not necessarily without error.However, imperfection of the reference standardleads to bias known as reference standard bias.11,19,53

If knowledge about the amount and type of errors ofthe imperfect reference standard is available, thenthis information can be used to correct estimates ofdiagnostic accuracy. We assume that all patientshave received the imperfect reference standard, sothere are no missing data in contrast to thecorrection methods described in the previoussection.

Basic modelThe basic model assumes that the error rates ofthe reference standard are known and thatalgebraic functions can be used to recalculateestimates of accuracy. Key publications by Hadguand colleagues54 and Staquet and colleagues55

provide several algebraic functions, all based onthe assumption of conditional independencebetween the index test and reference standardresults. The assumption implies that the errors ofboth tests are independent of the underlying truestatus of the target condition, hence the index testand reference standard do not tend to err in thesame patients.

Sometimes, the exact sensitivity and specificity ofthe imperfect reference standard are not known,

but a range of plausible values are available. Thisrange of values can then be applied in a sensitivityanalyses that will produce a range of estimates ofaccuracy for the index test.

Extensions of the basic modelIn many clinical situations, the assumption ofconditional independence is unlikely to be true.Often the index test and reference standard have atendency to make errors in the same (difficult)patients, particularly when index test andreference standard are methodologically related ormeasure the same physiologic alteration.56 In thedetection of cancer, for example, both the indextest and reference standard may have difficultiesin detecting early stages of cancer, and in clinicalchemistry, contamination of body fluid samplesmay affect both the index test and referencestandard. Algebraic functions have therefore beendeveloped that build in the conditionaldependence, such as the correlation in errors.57

Unfortunately, the amount of correlation betweenerrors is rarely known, hence this parameter isoften varied over a wide range of plausible values.

Strengths and weaknessesThese correction methods can easily be appliedusing simple algebraic functions. The mainlimitation is that in most situations the truesensitivity and specificity of the imperfectreference standard are not known, nor does oneknow the amount of correlation among errors ofthe index test and the reference standard. If thechosen values do not match the true values, theresulting estimates of diagnostic test accuracywould still be biased or could become even morebiased than the unadjusted ones.58 For thesereasons, researchers have tried to generalise thealgebraic correction functions to avoid makinguntenable assumptions.54 Latent class analysissubsequently evolved, which can be viewed as anextension of this method for situations wherethere is no information available on either theerror rates of the reference standard or the trueprevalence (see the section ‘Latent class analysis’,p. 26). In latent class analysis, the result of the(imperfect) reference standard is incorporated asjust one source of information about the truedisease status.

ApplicabilityThe basic correction method can be applied inany situation if the following conditions are met:reliable information on the magnitude of the errorrates of the reference standard is available and theconditional independence assumption is likely tobe true, or there is reliable information about the

Results

16

correlation between errors in the index test andthe reference standard. In all other situations, themethod is likely to render accuracy estimates thatare biased, despite the use of a correction method.

SoftwareThe algebraic functions can be used in widelyavailable software programs such as Excel or anyother statistical program.

Clinical exampleThe development of sensitive tests for bladdercancer is critical for its early detection. A goldstandard does not exist for the evaluation ofendoscopic tests as ideally this would necessitateremoving the bladder for detailed pathologicalexamination, a procedure that would not beethically justifiable. Therefore, pathologicalevaluation is done only on biopsy or surgicallyremoved tissue. In other words, only positivefindings during endoscopy (index test) areverified. Histological examination alone istherefore not a reference standard providingperfect classification because of the unknownlikelihood of missing cancerous lesions becausethey were not biopsied. This is the downside ofnot being able to verify negative findings of theindex test.

Assumptions are necessary to correct for biasesinherent in this approach. Schneeweiss andcolleagues59 provide an example of derivation ofan interval of sensitivity estimates that includes thetrue value. They compared the ability of 5-aminolevulinic acid-induced fluorescence andwhite light endoscopy to detect bladder cancer.This is an example of a comparative diagnosticaccuracy study with two index tests. Theyhypothesised that 5-aminolevulinic acid-inducedfluorescence endoscopy is of superior diagnosticvalue. A total of 208 patients under surveillanceafter superficial bladder cancer were included.Multiple evaluations over time within the samepatient were included, leading to a total of 328endoscopic evaluations in which both procedures(index tests) were performed. They used sensitivityas the main accuracy parameter as theconsequences of false negative cases would beworse and these results should be minimised by agood test.

The effect of having no gold standard is that therecan be misclassification among the four cells in the2-by-2 table. The observed true positive cellconsists of lesions that are positive with the indextest and confirmed on biopsy. It is unlikely thatany observation in this cell should belong in

another cell because we assume that cancer cellsidentified by pathological evaluation alwaysindicate to cancer. However, in reality, there couldbe more observations in the true positive cell whensome observations were wrongly classified as falsepositives, which can happen when not enough ornot the correct biopsies were taken from a lesion.These observations would be classified as false-positives when in reality they belong in the truepositive cell. The magnitude of this type ofmisclassification is assumed to be small butunknown.

For the false negative and true negative cells, themechanism is analogous. If cancerous lesions aremissed by both index tests, they are classified astrue negatives, but in reality they are falsenegatives. False negatives can be observed in thisstudy because of the paired design of this study;for example, patients underwent both index testsso that a true positive finding with one techniquecan be considered a false negative finding for theother technique if that technique missed thatlesion. The magnitude of this misclassification ofnegative index test results is also assumed to besmall but larger than the misclassification in thepositive tests.

The following misclassification model was used tocorrect the counts in the observed 2-by-2 table andrecalculate sensitivity based on these correctedcounts. Let p(A) be the probability ofmisclassifying true positives as false positive resultsand p(B) be the probability of misclassifying falsenegative as true negative results. A more generalequation can now be derived for estimating testsensitivity:

‘true positive’ + p(A) בfalse positive’

Sensitivity = –––––––––––––––––––––––––––––––––––‘true positive’ + p(A) ×

‘false positive’ + ‘false negative’ +p(B) × ‘true negative’

If it were assumed that there would be nomisclassification, sensitivity could be easilycalculated from the observed numbers in the 2-by-2 table, the so-called naive estimate. Theanalysis that Schneeweiss and colleagues proposecomprises most optimistic, most pessimistic andsome realistic assumptions to generate an intervalof sensitivity.59,60 The maximum interval ofobservable sensitivity for 5-aminolevulinic acid-induced fluorescence endoscopy ranged between78 and 97.5%, and the best estimate for sensitivitybased on realistic assumptions was 93.4% (95% CI90 to 97.3). The best sensitivity estimate for white

Health Technology Assessment 2007; Vol. 11: No. 50

17

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

light endoscopy was 46.7% (95% CI 39.4 to 54.3,maximum range 47.2–53%).

This method to determine the maximum possiblerange of sensitivity estimates in studies in whichnegative findings cannot all be verified is easilyapplied. Depending on the assumptions, a rangeof reasonable scenarios can be constructed and thecorresponding sensitivities can be reported.

Construct reference standardDifferential verificationDescription of the methodInstead of ignoring, imputing or correcting thenon-verified results in the analysis, a secondreference standard can be used to achievecomplete verification. Use of different referencestandards in this way is also known as differentialverification (see also Figure 2, p. 4).19 The secondreference standard is usually a less invasive testand is less costly or less burdensome to patients.In the detection of pulmonary embolism (PE), for example, it is nowadays considered unethicalto perform pulmonary angiography in patientswith a low suspicion of PE and a negative D-dimertest result.61 In many of these studies, onlypatients at high-risk for PE receive the bestavailable reference standard (angiography),whereas ‘low-risk’ patients are likely to receive adifferent reference standard, such as clinicalfollow-up.

The use of different reference standards betweenpatient groups is frequently encountered in theliterature. In a survey of 31 diagnostic reviews,differential verification was present in 99 out of487 (20%) primary diagnostic accuracy studies.3 Inmost cases, incomplete verification is neitherspecified in the design nor completely at random.Triggers to perform the preferred referencestandard in some patients and not in othersinclude a positive result on the index test, positiveresults from other tests or the presence of riskfactors for the condition of interest. This meansthat differential verification is selective and showsa non-random pattern, being based on decisionsby the practitioner or patient.

Differential verification has been shown to lead tohigher estimates of accuracy than studies using asingle reference standard in all patients.1,3 Thistype of bias is known as differential verificationbias, or different reference standard bias, work-upbias or selection bias.19,62 The effect of differentialverification on estimates of accuracy is difficult to

predict, as it depends on the proportion ofpatients verified differently, the selection processbehind patients verified differently, the propertiesof the reference standards involved and theirrelation with the index test.20

Strengths and weaknessesDifferential verification appears to escape the biasof incomplete (partial) verification, but can lead toestimates of diagnostic accuracy that differ fromthose obtained with full verification by thepreferred reference standard.1,3 Moreover,estimates of sensitivity and specificity are moredifficult to interpret, as they are based on multipleindex test–reference standard combinations.20 Ifcomplete verification by the preferred reference isnot possible and different reference standardshave to be used, the best approach is toincorporate differential verification in the design.This means prespecifying the group of patientsthat will receive the first reference standard andthe group that will receive the second. An examplecould be that all patients with a positive index testare verified by one reference standard and allnegative patients are verified by the secondreference standard. In that case, the appropriatemeasures of accuracy are the positive and negativepredictive values, calculated with correspondingreference standards. Because these measures aredirectly influenced by changes in prevalence theright design has to be chosen, such as cohortrather than case–control.

Field of applicationDifferential verification can be a reasonable optionif several acceptable diagnostic tests are availablethat can serve as the reference standards. Aprerequisite, however, is that the verificationscheme is preplanned (design-based). As in anydesign, authors should report the rationale foreach reference standard. Moreover, results shouldbe reported separately for each indextest–reference standard combination.

SoftwareStudies with differential verification producestandard accuracy results. No additionalprogramming is necessary.

Clinical exampleAn illustrative example of differential verificationthat was (partly) design based can be found in thepublication of Kline and colleagues.63 Theobjective of the study was to evaluate thediagnostic accuracy of the combination of D-dimerassay with an alveolar dead-space measurementfor rapid exclusion of pulmonary embolism.

Results

18

In this multicentre study, V/Q scans, helicalcomputed tomography, ultrasonography, clinicalfollow-up, angiography, death, events of deepvenous thrombosis or new events of pulmonaryembolism and combinations of these tests wereused as diagnostic criteria to determine thepresence or absence of pulmonary embolism. Theauthors describe a verification scheme to establishthe final diagnosis that at first seems to be designbased, but later on they state that the decision toorder further imaging in patients with non-diagnostic V/Q scans was at the discretion of theattending physician.

The authors provide a table describing thedifferent criteria for the presence of PE and thematching number of patients. Unfortunately, thecorresponding results of the combined D-dimer–alveolar dead-space measurement werenot stated in this table. In the results section,sensitivity, specificity, negative and positivelikelihood ratios and the corresponding CIs werecalculated in the usual way.

In the discussion, the authors do address theproblem of differential verification. They state thatcomputed tomography, for example, performsdifferently from angiography, and may havewrongly diagnosed some patients with PE whomay have been free from PE.

The results of this study may not be very helpfulto clinicians in practice, as it is unclear to whichpatients the estimates of diagnostic accuracy apply.

Discrepant analysisDescription of the methodDiscrepant analysis, also referred to as discrepantresolution or discordant analysis, uses acombination of reference standards in a sequentialmanner to classify patients as having the targetcondition or not.22,64,65 Discrepant analysis canprovide estimates of prevalence and estimates ofsensitivity and specificity of the index test withoutstatistical modelling.

Initially, all patients are tested with the index testand one imperfect reference standard (Figure 4).Since the imperfect reference standard is known tobe imperfect, the discordant or discrepant results(cases where index and first reference standarddisagree) are retested (resolved) with an additionalreference standard, frequently called the resolvertest. The resolver test is usually a more invasive,more costly or otherwise burdening test, withbetter discriminatory properties than the firstreference standard. The results of the resolver test

are then used to update the final 2-by-2 table.Based on this final 2-by-2 table, estimates ofaccuracy for the index test are calculated.

Strengths and weaknessesThe method is straightforward and easy to carryout without statistical expertise. At first sight, thisdesign seems to provide an efficient alternative toreduce the number of patients who have to betested with the best available reference standardwhen this standard is either invasive or costly toapply. Yet a fundamental problem of discrepantanalysis is that the verification pattern isdependent on the index test results.64,66 Althoughdiscrepant analysis provides the status of the targetcondition for those who are retested, it does notprovide that information for those not retested,which is usually the majority. Discrepant analysistherefore has the potential to lead to seriousbias.64,65,67–71

The potential for bias has been demonstratedalgebraically and numerically in the estimation ofsensitivity and specificity in a situation where theresolver test is a perfect gold standard.66,68,69 Inthis situation, discrepant analysis will lead toestimates of sensitivity and specificity for the indextest that are biased upwards, so that the index testappears to be more accurate than it really is. Themagnitude of this bias depends on the absolutenumber of errors of the index tests (false positiveand false negative index test results) and theamount of correlation between the errors of theindex test with the results of the first referencestandard. These are correlated errors that willerroneously appear as either true positives or truenegatives in the final 2-by-2 table as these patientswill not enter the second stage and therefore willnot be corrected by the perfect resolver test.

Green and colleagues showed that when theresolver test is not a perfect gold standard, evenlarger biases are possible, as the second imperfectreference standard can lead to furthermisclassification of the discordant results.72 If errorsof the stage 2 resolver test are correlated with errorsof either the index test or the stage 1 imperfectreference standard, the diagnostic accuracy of thestage 2 reference standard will be biased in theassessment of discordant results. In othercircumstances, the measured values may actually becloser to the true values.66 In general, situationswhere the errors of the index test and the resolvertest are related are of particular concern.

A method to correct for the potential bias in thediscrepant analysis has been proposed,66,73 but this

Health Technology Assessment 2007; Vol. 11: No. 50

19

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

method is considered to be of limited applicability.The method requires that a random sample ofpatients with concordant results of the index testand the initial reference standard is tested by theresolver test. From this sample, the concordancerate of false results is estimated and the observedsensitivity and specificity are corrected accordingly.This procedure is adequate only if the resolver testis (nearly) perfect, a situation which occursinfrequently.

Field of applicationAlthough discrepant analysis has been mostfrequently applied in the field of microbiology

(detection of infections),64,66 it can, in theory, be applied in all other medical fields. The general notice is to refrain from discrepantanalysis because of the fundamental problem of the implicit incorporation of index test results in the definition of the true disease status leading to potential bias.64,65

(http://www.fda.gov/cdrh/osb/guidance/1428.pdf).Only in special situations can the choice fordiscrepant analysis be defended if there is aperfect resolver test and a random sample ofconcordant results is also verified by the resolvertest using the correction method proposed byBegg and Greenes.22,48

Results

20

Final 2 � 2 Table:TN =

T–/R1– &

T–/R1+/R2–

FN =

T–/R1+/R2+

FP =

T+/R1–/R2–

TP =

T+/R1+ & T+/R1–/R2+

+Index test

Result

AbsentPresent

Target condition

+

Study population

Index test

Reference standard 1 (R1)

Discordant results:T+/R1 – = ?T–/R1+ = ?

Concordant results:T+/R1+ = TPT–/R1– = TN

Reference standard 2 (R2)Resolver test

Discordant → Resolver results: T+/R1– → R2+ = TPT+/R1– → R2– = FPT–/R1+ → R2+ = FNT–/R1+ → R2– = TP

FIGURE 4 Flow of patients and final 2-by-2 table in a study using discrepant analysis

SoftwareAny statistical package can be used. Thecalculations can also be done with a pocketcalculator.

Clinical exampleTo diagnose Chlamydia trachomatis infection,culture, DNA amplification methods such aspolymerase chain reaction (PCR) and antigendetection methods such as enzyme immunoassay(EIA) are used (the same example is also used incomposite reference standard and latent classanalysis). Alonzo and Pepe25 discuss the problemsof assessing the accuracy of EIA (index test) asneither of the two other methods can beconsidered ‘gold standards’. Culture is believed tohave nearly perfect specificity, but also missespatients with Chlamydia infection (false negatives).PCR is believed to be more sensitive than cultureto detect Chlamydia trachomatis. Alonzo and Pepeuse the method of discrepant analysis to assess theaccuracy of EIA by using culture as the initialreference test and PCR to resolve the discrepantresults in the second stage (resolver test).25

The data are derived from the study of Wu andcolleagues.74 In the first stage, all 324 specimensare tested by EIA (index test) and culture (Table 6).In the second stage, only the discrepant results, forexample, specimens which are positive by EIA andnegative by culture (n = 7) or vice versa (n = 3), areretested with the resolver, PCR. The concordantresults directly enter the final 2-by-2 table as truepositives (n = 20) and true negatives (n = 294).

In the second stage, the 10 discordant results aretested with PCR and the outcome of the PCR testdetermines the classification in the final 2-by-2table (Table 7). The EIA positive and culturenegative results become either true positives if thePCR is positive (n = 4) or false positives if thePCR outcome is negative (n = 3). The same ruleapplies for the EIA negative and culture positiveresults (n = 3): the outcome of the PCRdetermines the final classification as either truenegative if PCR is negative (n = 1) or falsenegative if PCR is positive (n = 2).

Combining the results from the initial stage(Table 6) and the second resolver stage (Table 7)leads to the final 2-by-2 table (Table 8) on whichthe accuracy can be calculated.

The estimates of accuracy after retesting the 10discordant results with PCR are as follows:

prevalence is 26/324 = 0.080sensitivity is (20 + 4)/(20 + 4 + 3 – 1) =

24/26 = 0.923specificity is (294 + 1)/(294 + 1 + 7 – 4) =

295/298 = 0.990.

If we would have calculated the estimates ofaccuracy directly after using culture as a singlereference standard, they would have been asfollows:

prevalence is 23/324 = 0.071sensitivity is 20/23 = 0.870specificity is 294/301 = 0.977.

Changes in this case are small because thefrequency of discordant results was relatively low(3.1%). The validity of the approach depends onthe error rate of the resolver test and thefrequency of errors in the concordant results in theinitial stage.

Composite reference standardDescription of the methodIn the absence of a single gold standard, theresults of several imperfect tests can be combinedto create a composite reference standard. Theresults of the component tests of the composite

Health Technology Assessment 2007; Vol. 11: No. 50

21

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

TABLE 6 Initial classification using culture as the first(imperfect) reference standard

Culture

EIA + –

EIA+ 20 7EIA– 3 294

TABLE 7 Results of the second stage: selective testing ofdiscrepant results with PCR, the resolver test

Discrepant results Resolver test PCR

+ –

EIA+ and culture – (n = 7) 4 3EIA– and culture + (n = 3) 2 1

TABLE 8 Final classification based on discrepant analysismethod

EIA Final classification

+ –

EIA+ 20 + 4 3EIA– 2 294 + 1

reference standard determine the presence of thetarget condition based on a prespecified rule. Sucha composite reference standard is believed to havebetter discriminatory properties than each of thereference standards components inisolation.24,25,64,66,75

Basic modelA basic model is where two tests are applied in allpatients and a prespecified (deterministic) rule isused to classify patients as having the targetcondition. Several definitions for the presence ofthe target condition can be used, depending onwhether emphasis should be given to detecting orexcluding the target condition and on thecharacteristics of the available reference tests. Mostfrequently, researchers define the target conditionto be present if either one of the reference tests ispositive. In the evaluation of the diagnosticaccuracy of EIA in the detection of Chlamydiatrachomatis, for example, a composite standard oftwo reference tests has been used: culture andPCR.76 EIA and the two reference tests wereapplied in all patients and infection with Chlamydiatrachomatis was diagnosed if either culture or PCRwas positive. If both reference tests were negative,the person was labelled free of Chlamydiatrachomatis.

It is important to note that the compositereference standard approach differs fromdiscrepant analysis (see the section ‘Discrepantanalysis’, p. 19), as the results of the index testthemselves play no role in the verificationprocedure of the composite method. Thedifference between using a composite referencestandard and differential verification is that in thecomposite method each patient receives all(necessary) components of the composite referencestandards whereas in differential verificationsubgroups of patients are verified by one referencestandard and other subgroups by a differentreference standard (see also the section‘Differential verification’, p. 18).

Extensions of the basic modelThe composite reference standard approach canbe extended to include more than two referencetests, as for instance in the detection of myocardialinfarction, where chest pain, serological markersand electrocardiograms are used to define thepresence or absence of the disease.77 Increasingthe number of reference tests amplifies thenumber of ways to define the presence of thetarget condition. If six reference tests arecombined, for example, one possibility is the ‘anytest positive’ definition of disease. Another

definition was used in a study evaluatingHelicobacter pylori, where a patient was regardedinfected whenever two or more of the six availabletests were positive.78

The efficiency of the composite reference standardcan often be improved by avoiding redundanttesting. If the classification rule is based on one ofthe two reference tests being positive, there is noneed to perform a second reference test once theoutcome of the first reference test is positive.Hence retesting is only necessary in patients with anegative result on the first reference test.Performing the more accurate reference test firstlowers the number of patients needing additionaltesting.

Another extension is the use of statistical modelsto combine several reference test results to definedisease status as in latent class analysis. Thisapproach is discussed separately, as it does not usea prespecified deterministic rule for classifyingpatients having the target condition or not (seethe section ‘Latent class analysis’, p. 26, for moredetails).

Strengths and weaknessesThe method is straightforward and easy tounderstand. It allows one to combine severalsources of information in order to assess whetherthe target condition is present. The prespecifiedrule to define when the target condition is presentincreases transparency and avoids problems ofincorporation and work-up bias. Unlike discrepantanalysis, the application of the second referencestandard is independent of the index test result.

A major issue is whether the combination ofreference test results is a meaningful way ofdefining the target condition and whether residualmisclassification is likely to be present. Dodifferent reference standard tests look at the sametarget condition, but with different error rates, ordo they define the target condition differently?The latter is thought to reduce the clinicalusefulness.24 The inclusion of more than tworeference tests in the composite reference standardmay then obscure the final definition of thedisease.

A general problem is the determination of the‘cut-off value’ to classify patients as having thetarget condition when several imperfect referencetests are available. Several suggestions have beenmade in literature. Alonzo and Pepe referred tolatent class analysis as a tool for finding anoptimal cut-off to define the target condition, but

Results

22

they concluded against its use, as they see themodel as being prone to bias due to itsassumptions64 (see also the ‘Latent class analysis’,p. 26). Some authors have suggested to use allpossible ‘cut-off values’ to define the targetcondition rather than choosing a single one. Theyevaluate the index test against all these definitionsand show how the accuracy of the index test variesacross these definitions.75,79

Field of applicationThe composite reference standard method can beconsidered in all situations where a single goldstandard does not exist, but several imperfectreference tests are available. One potential benefitis that it allows the researcher to ‘adjust’ thethreshold for disease depending on the clinicalproblem at hand. In situations where missing thetarget condition outweighs the risk of wrongfullylabelling persons as diseased, using a cut-off of‘any result positive indicates disease’ can beadvantageous. The composite reference standardmethod has been used in many areas, includinginfectious diseases64,75,76 and gastroenterology.79

SoftwareThe method is based on simple predefinedclassification rules, and requires no additionalsoftware. Measures of accuracy including CIs arecalculated in their traditional way.

Clinical exampleTo diagnose Chlamydia trachomatis infection,culture, DNA amplification methods such as PCRand antigen detection methods such as EIA areused. (The same example is used in discrepantanalysis and latent class analysis.) Alonzo and Pepediscuss the problems of assessing the accuracy ofEIA (index test) as neither of the two othermethods can be considered ‘gold standards’.25

Culture is believed to have nearly perfectspecificity. PCR is believed to be more sensitivethan culture to detect Chlamydia trachotomatis.

Alonzo and Pepe provide an example ofassessment of EIA’s accuracy using a compositereference standard based on culture and PCR.25

They use the either positive rule to classifypatients as having Chlamydia infection meaningthat any specimen that is culture positive or PCRpositive is composite reference standard positiveand any specimen that is culture negative andPCR negative is composite reference standardnegative. This approach allows one to use severalsources of information in order to assess if aninfection is present based on an a priori rule.

Because they use the either positive rule, there isno need to test all specimens with both referencetests of the composite reference standard: if aspecimen is positive on the first reference test itwill be positive on the composite referencestandard and no further testing is thereforerequired. In the Chlamydia setting, in the first stageall specimens are tested by EIA (index test) andculture. In the second stage, only those specimenswhich are culture negative at the first stage aretested with PCR.

Considering the data from the study by Wu andcolleagues74 a contingency table summarising thetwo stages of composite reference standard isgiven in Table 9. After the first stage in assessingthe performance of EIA using the compositereference standard, the 301 culture negativespecimens (7 + 294) were tested with PCR. Six ofthese specimens were PCR positive and thereforeconsidered to be infected based on the compositereference standard. The final estimates of accuracyare as follows:

prevalence is 29/324 = 0.090sensitivity is (20 + 4)/(20 + 4 + 3 + 2) =

24/29 = 0.828specificity is (294 – 2)/(294 – 2 + 7 – 4) =

292/295 = 0.990.

Because there is an a priori rule involving only thetwo reference test, the results of the index test(EIA) play no role in the classification of patients.This is different from the discrepant methodbecause there the index test does play a role asonly discrepant results move on to the secondstage. To put it differently, in the composite

Health Technology Assessment 2007; Vol. 11: No. 50

23

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

TABLE 9 Contingency table summarising composite reference for the Chlamydia data: EIA is the index test and the elements of thecomposite reference standard are culture and PCR.

EIA Culture Composite reference standard: culture + PCR

+ – + –

EIA+ 20 7 20 + 4 7 – 4EIA– 3 294 3 + 2 294 – 2

example the classification would not have changedif all specimens had been tested by both referencetests (culture and PCR). Because of the nature ofthe a priori rule (either positive), redundant testingcan be avoided to gain efficiency.

One drawback of the composite referencestandard is that it requires testing the typicallylarge number of specimens negative by culture.

Panel or consensus diagnosisDescription of the methodIn a panel or consensus diagnosis, a group ofexperts determines the presence or absence of thetarget condition in each patient based on multiplesources of information. These sources can includegeneral patient characteristics, signs andsymptoms from history and physical examination,and other test results. Information from clinicalfollow-up may also be included as an additionalsource of information to improve the classificationof patients by the experts.

Variation exists in how to synthesise the inputfrom different experts. Experts can discuss theinformation on each patient directly in a meetingand produce a consensus diagnosis using majorityvoting in case of disagreement. In other studies,experts determined the final diagnosisindependently from each other and patients wereonly discussed in a consensus meeting in case ofdisagreement among experts. The final diagnosesfrom individual experts can also be combinedusing a statistical model.

Because a final diagnosis is determined in allpatients, researchers can calculate estimates ofaccuracy in the traditional way without the use ofspecialised software.

Several practical decisions have to be made insetting-up a panel method. These include:

● the choice of experts: number and background● the number of items to be considered in

reaching a diagnosis and the way to presentthem

● whether or not to include the results of theindex test

● combining the input from experts● training and piloting.

Surprisingly little has been written in the medicalliterature about these practical decisions and howthey can affect the final outcome. The discussionthat follows is therefore largely based ontheoretical considerations.

In the selection of experts, the qualities of theexperts are expected to be of greater importancethan the number of experts in a panel.Judgements by specialists may come closer to thetrue status of patients than a random orconvenience sample of physicians. An unevennumber of experts is sometimes recommended asit facilitates decision-making in case of majorityvoting.80

In general, all information relevant for theclassification of patients is presented to the expertsin a standardised way. Especially, subtleinformation from history taking might be difficultto get across on paper. Video footage of theoriginal history taking or video films of dynamicradiological examinations might be an option.

Special care should be given to the role of theindex test result in a panel diagnosis. If the indextest result is given to the experts, its importancemay be overestimated, leading to inflatedmeasures of accuracy. This is known asincorporation bias.2 On the other hand, the finalclassification in patients with and without thetarget condition may become better if the resultsof the index tests are disclosed to the experts. Thiswould then reduce the problem of misclassificationof the disease status. Most authors advisewithholding the index test result from the experts.An attractive alternative is to use a stagedapproach where experts make a final diagnosiswithout the index test results first and then revealthe index test result and ask whether expertswould like to change their classification.

As stated before, several methods of synthesisingthe input from different experts have beenapplied. Standard consensus meetings have beenquestioned because of problems arising frompowerful personalities and also peer or grouppressures. The Delphi procedure is a formaltechnique to collect and synthesise expertopinions anonymously to overcome these types ofproblems.81

Offering training and piloting sessions to expertscan be used to increase agreement among experts.Piloting of the disease classification process mayunearth problems that can then be addressedprior to the actual consensus process. Whether ornot fixed decision rules need to be establishedduring this piloting process is a matter of debate.

Strengths and weaknessesConsensus diagnosis is an attractive alternative if agenerally accepted reference standard does not

Results

24

exist and multiple sources of information have tobe interpreted in a judicious way to reach adiagnosis. Examples include target conditions thatcan lead to a wide range of symptoms and thatcannot be diagnosed histologically. In those cases,the target condition can be defined by theclinicians, who take into account items of history,clinical examination, other test results and clinicalfollow-up, in order to determine clinicalmanagement. In such cases, the panel methodclosely reflects the clinically relevant concept oftarget condition. The accuracy statistics calculatedin a study that relies on panel method to classifydisease are likely to have high levels ofgeneralisability to clinical practice.

Other advantages of this approach are theflexibility in determining conditions that havebeen poorly defined and the option of classifyingconditions in a binary (target condition present orabsent) and also in a multi-level way (as in severe,moderate, mild, no disease, indeterminate, etc.).

The downside of this method can be poor inter-and even intra-rater agreement, leading todiscordance in disease classification between andwithin the experts. The levels of inter- and intra-rater agreements for the experts can be measuredand should be reported in the study as part ofvalidation of the consensus method.

Second, the subjectivity of the disease classificationprocess may introduce bias. Incorporation bias ortest review bias looms if the result of the index testis disclosed to the experts.

Third, the absence of a strict definition of diseasein panel-based methods may be responsible for adivergence in the definition of ‘disease’ in thegroup of experts, who may have a different view ofwhat constitutes the target condition of interest.Piloting and some form of standardisation canhelp in remedying this problem.

Fourth, the panel method can become a laboriousenterprise if many items of information per patientneed to be summarised and if many patients haveto be discussed among several experts.

Finally, domination by assertive members of thepanel can weaken any consensus procedure,although formal techniques such as the Delphimethod are available to reduce this problem.

ApplicationThe panel diagnosis method is well suited whenthere is no generally accepted reference standard

procedure and multiple sources of informationhave to be interpreted in a judicious way to reacha diagnosis. In particular, the panel method issuited for target conditions that cannot beunequivocally defined. Examples in which paneldiagnoses have been used include heart failure,the underlying causes in patients with syncope andunderlying conditions in patients presenting withdyspnoea.

SoftwareNo specialised software is needed, unless advancedprocedures such as latent class analysis are used asa method of combining the results from variousexperts (for details, see the section ‘Latent classanalysis’, p. 26).

Clinical exampleEchocardiography is considered an imperfectreference standard test for heart failure. TheEuropean Society of Cardiology recommends thatthe diagnosis of heart failure be “based on thesymptoms and clinical findings, supported byappropriate investigations such aselectrocardiogram, chest X-ray, biomarkers andDoppler-echocardiography”. Convening aconsensus panel to establish the presence orabsence of the target condition is then anappropriate approach to address the issue ofimperfect reference standard in this study.

Rutten and colleagues evaluated which clinicalvariables provide diagnostic information inrecognising heart failure in primary care patientswith stable chronic obstructive pulmonary disease(COPD), and whether easily available tests provideadded diagnostic information, in particular N-terminal pro-brain natriuretic peptide (NT-proBNP.82 They studied patients over the age of65 years with stable COPD diagnosed by a generalpractitioner without a previous diagnosis of heartfailure made by a cardiologist. Clinical variablesincluded history of ischaemic heart disease,cardiovascular medications, body mass index,displacement heart and heart rate. The testsincluded NT-proBNP, C-reactive protein,electrocardiography and chest radiography. Anexpert panel made up of two cardiologists, apulmonologist and a GP determined the presenceor absence of heart failure by consensus. Thepanel used all available information, includingechocardiography, but did not use NT-proBNP inmaking the diagnosis. When there was noconsensus, the majority decision was used toallocate the diagnostic category. Whenever asituation of evenly split votes arose, the majoritydecision amongst the two cardiologists and the GP

Health Technology Assessment 2007; Vol. 11: No. 50

25

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

was used to reach a diagnosis, thus leaving out thevote of the pulmonologist. The situation of splitvote only occurred in a minority of the cases(5/405, 1%). The authors re-presented a randomsample of patients to the expert panel, blinded tothe original decision, and found excellent level ofreproducibility (Cohen’s k = 0.92).

A limited number of items from history could helpGPs identify those with heart failure. NT-proBNPand electrocardiography were the two tests thatwere found to be useful in improving the accuracyof diagnosis. As fully appreciated by the studyauthors, the study suffers from the possibility ofincorporation bias, except in the case of the indextest of NT-proBNP, as results of this test were notmade available to the consensus panel. Theauthors postulated that the magnitude of theincorporation bias is likely to have been small asmost diagnostic determinants were not crucial inthe panel diagnosis process. This postulate issupported by the use of echocardiography, whichis central in the diagnosis of heart failure, only aspart of the reference standard, and not as one ofthe index tests.

No specific algorithms or guidelines were given inthe article on how the panel weighted andcombined the various items of history,examination and investigations findings, althoughreferences were made to previous studies whichmay have given some guidance on this.Standardisation of the disease categorisationprocess and threshold for disease positivity arerecognised to be problems with heart failure. It isnot clear if domination by a specific person orpersons in the consensus panel was an issue, asthis is certainly plausible when the panel hasgeneralists and specialists with varying levels ofexpertise in cardiology. Despite these weaknesses,this study is an excellent example of the use of aconsensus panel as a reference standard.

Latent class analysisThis group contains many variations, but allmethods have in common the use of a statisticalmodel to combine different pieces of information(test results) from each patient to construct areference standard. These methods acknowledgethat there is no gold standard and that theavailable tests are all related to the unknown truestatus: target condition present or absent.64,83–89

The problem that the outcome of interest cannotbe measured directly occurs in many researchsituations. Examples include constructs such asintelligence, personality traits or, as in our case,

the true diagnosis. These unobservable outcomesare named latent variables. These latent variablescan only be measured indirectly by elicitingresponses that are related to the construct ofinterest. These measurable responses are calledindicators or manifest variables. Latent variablemodels are a group of methods that use theinformation from the manifest variables to identifysubtypes of cases defined by the latent variable.

The problem of evaluating an index test in theabsence of a gold standard can be viewed as alatent variable problem. In our case we havedichotomous latent variable, namely whether ornot patients have the target condition. When thelatent variable is categorical (dichotomous being aspecial case within this group), the latent modelsare referred to as latent class models, whereaswhen the latent variable is continuous they arenamed latent trait models. Results of the index testand other imperfect tests are then the observable(manifest) variables that can be used to estimatethe parameters that are linked to true diseasesstatus, like sensitivity, specificity and prevalence.Maximum likelihood methods can be used toestimate these parameters of the latent model.

The latent class approach has similarities with thepanel or consensus diagnosis methods, which alsouses multiple pieces of information to construct areference standard. In the panel method, expertsdetermine whether each patient has the targetcondition given a set of test results, whereas aformal statistical model is used to obtain thestatistics of interest in the latent class approach.

Basic modelThe statistical framework underlying latent classmodels can be illustrated using the following basicexample. In this example we have three differenttests being applied in all patients with each testproducing a dichotomous test result (e.g the test iseither positive or negative). All three tests relate tothe same target condition, but none of them iserror free. For a single test, the probability ofobtaining a positive test result can be written asthe sum of finding a positive test in a patient whohas the target condition (true positive result) or apositive test result in a patient without the targetcondition (false positive result). These probabilities(Prob) can be written as a function of the followingunknown measures: prevalence (prev), sensitivityof test 1 (sens1) and specificity of test 1 (spec1).The probability of finding a true positive result fortest 1 (TP1) can be written as

Prob(TP1) = prev × sens1

Results

26

and the probability for obtaining a false positiveresult (FP) as

Prob(FP1) = (1 – prev) × (1 – spec1)

Therefore, the probability of a positive test resultsfor test 1 is the sum of these two probabilities

Prob(+) = Prob(TP1) + Prob(FP1) = prev ×sens1 + (1 – prev) × (1 – spec1) (1)

In the same way, we can write the probability forobtaining a true negative (TN) result as

Prob(TN1) = (1 – prev) × spec1

and for a false negative (FN) test result as

Prob(FN1) = prev × (1 – sens1)

and the probability of a negative test result as

Prob(–) = Prob(TN1) + Prob(FN1) = (1 – prev) × spec1 + prev × (1 – sens1) (2)

Of course, the probabilities of the other tests aredefined in the same way, but then using thesensitivity and specificity of that specific test. Thismeans that there are seven unknown parametersin this example: one prevalence parameter andthe sensitivity and specificity for each of the threetests (1 + 6 = 7).

With three different dichotomous tests, there areeight possible combinations of test results: all testsbeing positive, three variations where two tests arepositive and one negative, three situations whereone test is positive and two negative, and thesituation where all tests are negative. By using theprobabilities for a positive [equation (1)] ornegative test result [equation (2)] for each test, wecan write down the likelihood of observing eachpattern of test results. Under the assumption ofstatistical independence, the likelihood ofobserving a specific pattern can be written as theprobability of observing that pattern in patientswho have the target condition plus the probabilityof observing the same pattern in patients withoutthe target condition.

For instance, the probability of observing thepattern ++– is

Prob(++–) = sens1 × sens2 × (1 – sens3) × prev + (1 – spec1) × (1 – spec2) × spec3) × (1 – prev)

When we carry out such a study, we would observethe number of patients for each of the eightpatterns of test results. The sum of these numbers

has to be equal to the total number of patients inthe study, which means that we have 8 – 1 = 7degrees of freedom. If the degrees of freedom areequal to or greater than the number of parametersto be estimated, standard maximum likelihoodmethods can be used to obtain a (unique) solution.More mathematical details can be foundelsewhere.85,90

Extensions of the basic latent class modelSeveral extensions to this basic model(dichotomous latent variable, three dichotomoustests assuming uncorrelated errors) have beenformulated. Here we discuss the main extensionsrelevant for diagnostic research.

The number and type of testsLatent class models are flexible and canincorporate dichotomous results, but also ordinaltest results or continuous test results.90 The modelcan easily be extended to incorporate the resultsof more than three tests. Including additional testsis beneficial from a modelling point of view as itincreases the available degrees of freedom (moretests lead to more test results combinations). Moredegrees of freedom mean that more parameterscan be estimated, for instance a correlationparameter to acknowledge that errors betweentests might be correlated (see also the sectionbelow on conditional dependence). The extradegrees of freedom can also be used for additionalchecks of the fit of the model.

Much attention has been given to estimating theaccuracy of a test when only one other additionaltest (e.g. imperfect reference standard) isavailable.91,92 In this case, the number ofparameters to estimate (1 prevalence + 2sensitivities + 2 specificities = 5 unknownparameters) is larger than the available degrees offreedom (4 test results combinations: ++, +–, –+,– – minus 1 = 3 degrees of freedom). This meansthat no optimal maximum likelihood solution canbe identified: different combinations of values ofprevalence, sensitivities and specificities fit thedata equally well. Only through restrictions can weestimate the parameters of the model, for instanceassuming that the sensitivities and specificities ofthe two tests are equal. Another option is toincorporate prior information about theparameters into the model by using a Bayesianapproach to estimate their values (see the section‘Bayesian framework’, p. 28).

Conditional dependenceThe basic model assumed that the results of thethree available tests were independent conditional

Health Technology Assessment 2007; Vol. 11: No. 50

27

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

on the true disease status, also known as localindependence. Independence in this contextmeans that the errors of the test are notcorrelated, e.g. if a diseased patient is misclassifiedby one test, it does not increase the likelihood thatthis patient will be misclassified by another test. Inother words, there is no group of ‘difficult’patients in whom several tests perform less thanexpected. This assumption might hold if testsmeasure different manifestations of the targetcondition and/or use different clinical methods. Inthe detection of Chlamydia, for example, whenantigen detection with EIA, cell culture and DNAamplification with PCR is used, it is less likely thatthese tests make the same type of errors than iftwo of the three tests were DNA amplication tests.An example where the assumption of conditionalindependence is likely to be violated is in thedetection of lumbar herniation if all threeavailable tests are imaging tests, such as magneticresonance imaging (MRI), CT and radiography, allfocusing at the detection of visible abnormalitiesof the discus.

For many situations, the independence assumptionis unlikely to be true. Ignoring the correlation oferrors between tests can seriously affect theestimates of accuracy.26,64 To solve this problem,we can incorporate the correlation of errorsbetween tests into the model.26,88 This requires theestimation of additional parameters and thereforemore degrees of freedom to estimate them.Additional degrees of freedom can be obtained byincluding more tests, repeating the study inanother population with a different prevalence ofthe condition (but unchanged accuracy estimates)or incorporating prior information using aBayesian framework (see the next section).

Bayesian frameworkThe parameters of a latent class model can also beestimated through a Bayesian approach instead ofthe more traditional maximum likelihoodestimation (frequentist approach). The Bayesianframework estimates the same latent model andparameters, but it explicitly incorporates priorinformation to the model.54,91,93–95 In the Bayesianapproach, the unknown parameters are all treatedas random variables having a probabilitydistribution. Information available on eachparameter prior to collecting the data issummarised in the prior probability distribution,which is then combined with information from theobserved data to obtain a posterior probabilitydistribution for each parameter. The posteriordistribution can be used to obtain point estimatesfor the mean and median sensitivity and specificity

with credibility intervals, which can be looselyinterpreted as Bayesian CIs.

Prior distributions typically are uninformative ordetermined from the published literature or inconsultation with experts. The Bayesian approachis sensitive to the chosen prior distribution used:using different priors can lead to differences inestimates of diagnostic accuracy.

The Bayesian approach can be particular helpfulin ill-defined situations, such as situations wherethe number of parameters to be estimated is largerelative to the available degrees of freedom. Theuse of prior information in combination withsimulations means that the Bayesian approach canobtain estimates where traditional methods fail.96

Bayesian modelling is more complicated as itrequires programming, simulations andvalidations of the results.

Strengths and weaknessesThe latent class analysis methods are welldocumented, statistically sound and have beenapplied in many areas of research.90 Thecharacteristics of this method have beenextensively studied, including in simulationsstudies64,85,88 and in studies that compare latentclass estimates of accuracy with the estimatesderived from a classic design where a goldstandard was available.97 The model providesestimates of sensitivity and specificity for all testsincorporated in the model, which are the indexesmost commonly used in test evaluation research.Latent class analysis is a flexible approach that canincorporate different types of test results(dichotomous, ordinal and continuous).

The drawbacks of latent class analysis are welldescribed in the literature.64,85,89 The greatestconcern is not related to statistical issues but to amore basic principle. In a latent class analysis, thetarget condition is not defined in a clinical sense.Because we make no clinical definition of diseasein latent class analysis, clinicians can feeluncomfortable about what the resultsrepresent.24,64,85

Latent class analysis does not fully comply with abasic principle of test evaluation research. Whenevaluating the performance of an index test, itneeds to be compared with a standard that isindependent of the index test itself. In latent classanalysis, however, this principle is partly violatedbecause the index test results are often used toconstruct the reference standard. Considerationssimilar to those mentioned in the panel diagnosis

Results

28

methods apply (see the section ‘Panel or consensusdiagnosis’, p. 24). This problem lies more in thepanel method because experts may overestimatethe capabilities of the index test, whereas in thelatent class setting a formal statistical model isapplied, which is not sensitive like humans to thereputation of any test.

Like any statistical model, the validity of theresults of a latent class model depends on whetherthe underlying assumptions are met. Whether ornot a latent class model assumes conditionalindependence is a critical issue. Simulation studieshave shown that violation of conditionalindependence biases the estimates of diagnosticaccuracy.26,64 If tests are positively related todiseased and/or not diseased patients, latent classanalysis will yield accuracy estimates which are toohigh, especially for the more accurate test. Aproblem is that the conditional independenceassumption cannot be fully tested in a model withthree dichotomous tests. Additional or repeatedtesting to increase the degrees of freedom mightnot be feasible for ethical or economic reasons.

These observations have led to caution against theuse of latent class analysis in practice when onlythree tests are available. Some even extrapolatethis view to all latent class models, as even when itis possible to incorporate more information in themodel, unverifiable assumptions aboutdependence are still required.64,85

Field of applicationLatent class analysis can be applied in thosesituations where multiple pieces of information areavailable for each patient.

SoftwareSeveral more advanced statistical packages haveimplemented latent class analysis, such as Splusand R. In addition, there are programs speciallydeveloped for latent variable models such asLatent Gold and LEM. Free software to performBayesian latent class analysis is available, requiringonly a user registration (WinBugs).

Clinical exampleIn the detection of Chlamydia trachomatis, culture,DNA amplification methods such as PCR andantigen detection methods such as EIA have beenmodelled with latent class analysis (the sameexample is also used in discrepant analysis andlatent class analysis). Culture is believed to havenearly perfect specificity. PCR and EIA arebelieved to be more sensitive than culture todetect Chlamydia trachomatis. In Table 10, we

present the observed test results of these tests in agroup of 324 persons, as reported by Alonzo andPepe.64 The frequency of occurrence for eachpattern of test results is given. Latent class analysisuses the information displayed in the table todetermine associations between the diagnostictests. More detail on the analytical expressions forestimates can be found in the paper by Pepe andJanes.85 Referring to the detection of Chlamydiatrachomatis with cell culture as the imperfectreference standard, specificity is assumed to beclose to 100%, whereas a wide range of valueshave been reported for its sensitivity.93,98,99

Different priors can be incorporated in theBayesian approach, for instance an optimisticprior (using a uniform distribution in the range80–90%) and a pessimistic prior (in the range55–65%) for the sensitivity of culture. If nothing isknown about the index test, a prior distributionfor sensitivity and specificity allowing equalweighting in the 0–100% range can be used.

Parameter estimates derived from latent classanalysis are as follows:

Prevalence: 0.081Sensitivity and specificity of EIA: 0.909 and

0.990Sensitivity and specificity of culture: 0.834 and

0.997Sensitivity and specificity of PCR: 1.000 and

0.995

Validate index test resultsIn all previous methods, the focus was oncorrecting imperfect reference standards, orconstructing a new or better reference standard, tocalculate the classical indexes of diagnostic testaccuracy such as sensitivity and specificity. In thissection, we discuss approaches to evaluate index

Health Technology Assessment 2007; Vol. 11: No. 50

29

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

TABLE 10 Pattern of tests results and their frequency ofoccurrence

Pattern PCR EIA Culture Observed frequency

1 + + + 192 + + – 43 + – + 34 – + + 15 + – – 26 – + – 37 – – + 08 – – – 292

test results that go beyond the diagnostic accuracyparadigm. Validation is an alternative process toevaluate a medical test in the absence of anunproblematic and equivocal reference standard.In this context, validity refers to how well theindex test measures what it is supposed to bemeasuring.100–102

Many instruments in the social sciences andelsewhere in science rely on a validation process todetermine whether or not the instrument canserve its purpose. In these cases, the diagnosticaccuracy paradigm often cannot be used becausethe ‘truth’ cannot be observed. The latter appliesto many hypothetical or conceptual constructs. Wecannot observe a construct, or any latent variable(see the section ‘Latent class analysis’, p. 26). Whatwe can do is to observe associated attributes, as,for instance, sweating, moaning and asking forpain medication in the evaluation of pain.101,103 Ina similar way, we can evaluate associations betweenthe index test and these attributes.

Three different forms of validity have traditionallybeen distinguished: content validity, criterion-related validity and construct validity.101 In thiscontext the classical diagnostic accuracy design(Figure 1, p. 2) refers to criterion-related validitywhere the reference standard provides thecriterion against which the index test is validated.Several other definitions have been provided,including intrinsic validity, logical and empiricalvalidity, factorial validity and face validity. Thisproliferation is in part based on a misinterpretationof the classical paper by Cronbach and Meehl,who pointed to the quintessential nature ofconstruct validation in all areas where criterion-oriented definitions cannot be used.104

Approaches towards the validation of medical testsexplore meaningful relations between index testresults and other test results or clinicalcharacteristics, none of which can be uplifted tothe status of a reference standard, either isolatedor in combination. Relevant items can come fromthe patients’ history, clinical examination,imaging, laboratory or function tests, severityscores and prognostic information.

Construct validity is not determined by a singlestatistic, but by a body of research thatdemonstrates the relationship between the testand the target condition that it is intended toidentify or characterise. Validating a test can thenbe understood as a gradual process whereby wedetermine the degree of confidence we can placeon inferences about the target condition in tested

patients, based on their index test results. Thatdegree of confidence is based on a network ofassociations between the test results and otherpieces of information in tested patients.

One important way to validate an index test is touse dedicated follow-up to capture clinical eventsof interest in relation to index test results. If theindex test will be used for predicting future events,such a prognostic study can be seen as anevaluation of the predictive validity of the test.101

The question of whether it is safe to withholdfurther testing in patients with low probability ofpulmonary embolism and a negative D-dimerresult has been studied by looking at the 3-monthincidence of venous thromboembolism, andreported as such.61

One step further is the evaluation of a test withinrandomised clinical trials of therapy. One aim ofthe test is to identify patients who are more likelyto benefit from the new, active treatment.105,106 Apossible design corresponding to that studyquestion is displayed in Figure 5.

Note that all patients receive the index test, but thatnone of them receives a reference standard. Aftertesting, patients are randomly allocated to eithertreatment or clinical follow-up. At the end of follow-up, the data collected can be displayed in four 2-by-2 tables (Figure 5). Different study objectives can beanswered by these tables. In Tables A and B inFigure 5, a conclusion can be drawn as to whethertreatment is beneficial compared with clinicalfollow-up in index test positive patients and testnegative patients, respectively. Odds ratios and CIscan be calculated to underpin the conclusions. InTables C and D, conclusions can be drawnconcerning the prognostic value of the test withinthe context of subsequent clinical decision-making.Table C refers to the ability of the index test toidentify patients who are likely to benefit fromtherapy. The test properties can be either expressedas event rates (of, for example, poor events) or oddsratios. Table D refers to the ability to discriminatebetween different risk categories for a specificevent. Several modifications of the basicrandomised controlled trial (RCT) model toevaluate tests have been described by Lijmer andBossuyt.105

Basic designNo single basic design exists for constructvalidation studies. There are many possibledesigns to evaluate the validity of an index, asthere are many different predictions possiblebased on our theory or construct.

Results

30

As an example, we look at the evaluation of a newquestionnaire to evaluate psychological stress.Many hypotheses may be generated, includingthat level of stress increases heart frequency, andalso blood values of cortisol. An association studycould be conducted to evaluate the correlationbetween the score, heart frequency, blood levels ofcortisol and past exposure to psychological stress,such as surviving a Tsunami. An additionalapproach would be to perform a factor analysis, toevaluate whether the elements of thequestionnaire belong to one or more dimensions(factors), and to match if the identified factorscorrespond with the theory of psychological stress.Still another theory could point out that stress isartificially induced by an intervention, such asshowing pictures of tortured persons. If so,participants should have higher scores on thequestionnaire after such an intervention. Statedalternatively: if patients with high scores are givena tranquilliser, then their scores should drop. Herean intervention study would be conducted. Ifscores drop after tranquilliser intake in those withinitially high scores, whereas those with low scoresremain at approximately the same level, then thenew questionnaire can be considered to haveconstruct validity. Additional studies would have tobe conducted. Gradually, our confidence in the

construct validity and the inferences we can drawfrom measuring that construct will grow.

Several ways exist to express associations betweenthe measured attribute and other attributes, eventsor subgroups, including correlation indexes, oddsand risk ratios and absolute and relativedifferences.

Strengths and weaknessesIn situations where no reference standard isavailable, association studies may be the only typeof study that can be performed. The validation ofthe index test will depend on the underlyingtheories about the target condition, as the latterprovides us with hypotheses about the kind ofassociations between index test and attributes thathave to be evaluated. If the theory is wrong, totallyor in part, any quantitative expression of thestrength of the association may be misleading.

Whenever the index test results fail to show thehypothesised network of associations between thetarget condition and other observations, morethan one conclusion is possible: the index test haslow validity, the theory about the target conditionis not correct or both the index test and the theoryare inadequate.

Health Technology Assessment 2007; Vol. 11: No. 50

31

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

RCT design

RCT results

Patients Indextest R

No treatment

Treatment Outcome

Outcome

Pooroutcome

Favourableoutcome

Pooroutcome

Favourableoutcome

Pooroutcome

Favourableoutcome

Pooroutcome

Favourableoutcome

Treat

Notreat

Treat

Notreat

Follow-up

Follow-up

T+

T–

T+

T–

A. In patients with T+: B. In patients with T–: C. Treated patients: D. Untreated patients:

FIGURE 5 Randomised controlled trial (RCT) using index test results as a prognostic marker (T = test)

Specific problems related to clinical follow-up havebeen identified. The nature of the target conditionmay change during clinical follow-up. Even whenpresent, the target condition is not guaranteed toproduce detectable events or complaints. Thelength of follow-up has to be chosen judiciouslyfor each specific target condition.

An additional problem arises when an interventionis used. If the intervention is chosen badly, anddoes not achieve what it was intended to achieve,the construct validity of the index test will bemasked. If the outcome is favourable, any chancecannot be attributed exclusively to thediscriminatory characteristics of the index test.Hence test evaluation and managementconsequences cannot be disentangled in this typeof test evaluation.

A major drawback of construct validity is thatthere is no single type of study that can beprescribed, as in diagnostic accuracy studies.Several studies have to be performed, beforeenough confidence in the validity of the index testcan be achieved.107

Field of applicationWe hypothesise that this type of test evaluation canbe done in all areas of medicine where anaccepted reference standard, or any othercriterion to determine the diagnostic accuracy of a

test, is absent. In the discussion (Chapter 5), wewill argue that validation is a more universal wayof appraising the value of medical tests.

Clinical example 1: randomised controlled trialWe will use a study by Subtil and colleagues titled‘Randomised comparison of uterine arteryDoppler and aspirin (100 mg) with placebo innulliparous women’ to explain some of the designand efficiency issues that play a role when usingan RCT to evaluate a test.108

The objective of this study was to assess theeffectiveness of a strategy of pre-eclampsiaprevention based on routine uterine arteryDoppler examination during the second trimesterof pregnancy, followed by a prescription of 100 mgof aspirin in those with abnormal Doppler, versusno Doppler strategy. The population werenulliparous women (no previous delivery before�22 weeks) between 14 and 20 weeks’ gestation,with no history of hypertension. Randomisationwas used to determine whether women wouldreceive uterine artery Doppler testing between 22and 24 weeks or not. Women in the no testing armof the trial would receive usual care, as would thewomen in the testing arm with a negative testresult (for the design, see Figure 6). The outcomesof the trial were the development of pre-eclampsia,having a baby small for gestational age andperinatal deaths. The trial did not find a difference

Results

32

Nulliparouswomen

No doppler

R

Positive test result >> Treatment with aspirin

Outcomes

Outcomes

Follow-up

Uterine arterydoppler

Outcomes

Care as usual

Negative test result >> Care as usual

FIGURE 6 Design of the uterine artery Doppler trial

in outcomes between the tested (Doppler group)and the non-tested group (no Doppler).

This design offers an assessment of the Doppler testand aspirin therapy combination, as women arerandomised to Doppler or no Doppler groups, withtreatment of the Doppler positive women in theDoppler arm. It addresses the question: does testingwith Doppler (plus aspirin for test positive cases)prevent pre-eclampsia? In this design, if testing andtreatment were shown not to be effective, it could bebecause the test is inaccurate or because aspirin isineffective. This design, therefore, addresses twoquestions in one, but unfortunately, it is not oftenpossible to segregate the accuracy of the test fromthe effectiveness of the treatment.

This design has various weaknesses. First, thedesign is not efficient and requires large samplesto achieve satisfactory power. This is because theonly women contributing to the expecteddifference in outcome in the two trial arms arethose who belong to a small subgroup of thosewith abnormal Doppler result in the Doppler arm.In this study, 2491 women were randomised in a2:1 ratio to Doppler and control arms, and ofthese 2491 women, an abnormal Doppler resultwas found in 239 (10%) women and these receivedaspirin. Even with a 2:1 ratio of randomisation toimprove the numbers of women who hadabnormal Doppler result (and, therefore, receivedthe active treatment, aspirin), the total number ofwomen who contributed to the difference betweenthe Doppler and control arms was just 239,suggesting an under-powered study.

Even if this trial recruited many more thousandsof women, and became adequately powered, and abenefit is subsequently shown for the Dopplerarm, this may not be a vindication of benefit forthe combination of Doppler test and aspirintherapy. This is because more women in theDoppler arm would have received aspirincompared with the no Doppler arm, and asaspirin has been shown to be generally effective inreducing pre-eclampsia,109 then it is possible thatthe Doppler arm would have shown benefitregardless of whether the Doppler test was a goodpredictor of pre-eclampsia or not or, indeed,whether the Doppler test was done or not. This isbecause about 10% of women would have receivedaspirin in the Doppler group compared with noneor few in the control group.

This design is, therefore, of limited use inassessing the role of uterine Doppler testing foraspirin therapy.

Clinical example 2 and 3: validation studies otherthan trialsAn example of a validation approach can be foundin a series of studies that evaluated the use oftroponin to identify acute coronary syndrome inchest pain patients. Initial studies have evaluatedcardiac troponin in an accuracy framework, usingthe original WHO definition of acute myocardialinfarction. The latter invites the use of acomposite reference standard, looking atcharacteristic ECG changes – either ST segmentelevation or the development of new Q waves –confirmed by significant changes in serial cardiacenzymes.110 Other studies have looked at 30-dayoutcomes in patients admitted without STsegment elevation on the initial ECG. They foundthat cardiac events increased significantly withincreasing cardiac troponin values, suggesting thatthe degree of troponin elevation, and also clinicalvariables such as previous myocardial infarctionand an ischaemic ECG, should be consideredwhen deciding treatment.111 These and otherstudies have led to the recommendation thatcardiac troponins should be the preferred markersfor the diagnosis of myocardial injury.112 Morerecent investigations have indicated that increasesin biomarkers upstream from markers of necrosis,such as inflammatory cytokines, cellular adhesionmolecules, acute-phase reactants, plaquedestabilisation and rupture biomarkers,biomarkers of ischaemia and biomarkers ofmyocardial stretch may provide an even earlierassessment of overall patient risk. The studies tosupport this claim are all closer to the validationframework than to the accuracy paradigm.113

Another example of association studies for testvalidation includes the evaluation of new tests forlatent tuberculosis infection (LTBI), a conditionfor which no gold standard exists. The tuberculinskin test (TST) has been the standard test to detectLTBI, although it is known to have false results,particularly in people with previous BacillusCalmette–Guérin (BCG) vaccination. Interferon-gamma assays (e.g. Quantiferon and Elispot) haverecently been developed for use as new tests forLTBI. Tuberculosis infection evokes a strong T-helper 1 (Th1) type cell-mediated immuneresponse with release of interferon-gamma. In vitroassessment of interferon-gamma production inresponse to mycobacterial antigens can be used todetect latent infection with Mycobacteriumtuberculosis. Blood infected with M. tuberculosiscontains a specific clone of T-lymphocytesstimulated by exposure to the ESAT-6 or CFP-10antigen. The Elispot assay is based on thedetection of interferon-gamma, which these

Health Technology Assessment 2007; Vol. 11: No. 50

33

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

ESAT-6 and CFP-10 specific T lymphocytessecrete. As there is no gold standard or a suitablereference standard for LTBI against which thecomparative accuracy of the TST and Elispotcould be assessed, a validation study describedbelow provides a good alternative to the classicaldiagnostic accuracy paradigm. The risk ofinfection is greatest among those contacts whoshare a room with the index case for the greatestlength of time. This means that airbornetransmission increases with length of exposure andproximity to an infectious case of tuberculosis.Hence results of tests for LTBI should correlatewith level of exposure, and the test with thestrongest association with exposure would be likelyto be the most accurate. This issue can beevaluated in observational studies that ascertainexposure to tuberculosis in a relevant populationand setting (e.g. outbreak investigation), performvarious index tests of interest in all eligiblesubjects and compare test results with exposurestatus.114 TST and Elispot response to ESAT-6 andCFP-10 were evaluated in this manner in 535subjects during a school outbreak.114 In this

outbreak investigation, four exposure groupsbased on proximity and shared activities with aninfected case were established following detailedinterviews undertaken by a school nurse. Theassociation of index test with exposure status wasexamined using the gradient of dose–responserelating test results to degree of tuberculosisexposure. Comparison of these gradients showedthat Elispot test was statistically significantly betterthan TST. Elispot correlated significantly moreclosely with M. tuberculosis exposure than did TSTon the basis of measures of proximity (p = 0.03)and duration of exposure (p = 0.007) to theinfected case. TST was significantly more likely tobe positive in BCG-vaccinated than in non-vaccinated students (p = 0.002), whereas Elispotresults were not associated with BCG vaccination(p = 0.44). The authors concluded that Elispotoffers a more accurate approach than TST foridentification of individuals who have LTBI.Further validation may come from observationalstudies evaluating whether new test results arepredictive of development of active tuberculosis inthe future.

Results

34

In this report, we have summarised a number ofmethods that can be used to evaluate tests in

the absence of a gold standard, an error-freemethod to establish with certainty the presence orabsence of the target condition in tested patients.In this chapter we present our conclusions asguidance to researchers and emphasise the needto enrich or enlarge the accuracy paradigm in theevaluation of medical tests.

Guidance for researchersThe guidance we have distilled from the existingevidence and the recommendations fromresearchers can be summarised as follows(Figure 7). Readers should be aware that thisflowchart can only provide general guidancebecause many factors have to be balanced whenchoosing one method over another.

The first question to be answered is whether theuncertainty about the test can be satisfactorilyanswered by knowing its accuracy. The diagnosticaccuracy of a test is an expression of how well thetest is able to identify tested persons with thetarget condition. The limits of this report do notallow us to discuss at length the various other waysin which a test can be evaluated, includingrandomised trials of test–treatment combinationsthat can document whether the use of a test willlead to improved patient outcomes.106 We alsoomitted from consideration many of the lower-level evaluations that usually precede evaluationsof diagnostic accuracy, including studiesexamining the reproducibility and intra- andinter-observer variability of tests.

We found that in situations that deviate onlymarginally from the classical diagnostic accuracyparadigm, for example, where there are fewmissing values on an otherwise acceptablereference standard or where the magnitude andtype of imperfection in a reference standard is welldocumented, the methods we summarised may bevaluable. However, in situations where anacceptable reference standard does not exist,holding on to the accuracy paradigm may befruitless. In these situations, applying the conceptof clinical test validation can provide a significant

methodological advance (see also the next sectionof this chapter).

A different class of test evaluation methods focuseson changes in patient outcome from using tests.This class includes randomised clinical trials oftests: comparisons of testing versus not testing, orof one index test compared with another. As thenumber of testing strategies usually exceeds thenumber that can be adequately dealt with in anRCT, many researchers turn to modelling toexplore the health consequences of testing. Indecision analysis, data on test characteristics arecombined with estimates of the effectiveness, risksand side-effects of treatment. A test’s accuracy isnot carved in stone, but will depend on the targetcondition that is to be detected. In talking about atest’s accuracy, we will assume that thecorresponding target condition has been defined.

If knowing a test’s accuracy is important, the nextquestion to ask is whether there is a referencestandard providing adequate classification. Thiswould be a reference standard about which there is consensus that it is the best availablemethod for establishing the presence or absenceof the target condition. In addition, researchersand readers should have confidence in theclassification provided by reference standard,although misclassification should be considered in every accuracy study.

If the outcome of the reference standard cannot orhas not been obtained in all study participants,statistical methods can be used to correct for theincomplete verification. The STARD statementencourages researchers to be explicit aboutmissing information.115 With small rates of missinginformation on reference standard outcome, thesemethods can be effectively used. For large volumesof missing data on the reference standard, thesoundness of the approach depends on whetherthe mechanisms that led to the missing data areknown. A general problem for these methods isthat the general level of acceptance for correctiontechniques and for imputation methods is stillfairly low.

If the pattern of missingness is not random,researchers could consider using a second

Health Technology Assessment 2007; Vol. 11: No. 50

35

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Chapter 5

Guidance and discussion

Guidance and discussion

36

Will

the

outc

ome

ofth

e re

fere

nce

stan

dard

be

avai

labl

ein

all

stud

ypa

rtic

ipan

ts?

Is th

ere

a sin

gle

refe

renc

e st

anda

rdpr

ovid

ing

adeq

uate

clas

sific

atio

n?

Is th

ere

relia

ble

exte

rnal

info

rmat

ion

abou

t the

deg

ree

ofim

perf

ectio

n of

the

refe

renc

e st

anda

rd?

Con

sider

usin

gpa

nel c

onse

nsus

diag

nosis

or

late

nt c

lass

an a

lysis

f

Con

sider

usin

gco

rrec

tion

met

hods

for

impe

rfec

tre

fere

nce

stan

dard

d

Con

sider

usin

gco

mpo

site

refe

renc

est

anda

rde

Cla

ssic

diag

nost

ic a

ccur

acy

stud

ya

Con

sider

usin

g an

alte

rnat

ive

refe

renc

est

anda

rd in

thes

esu

bgro

upsc

No

Is it

pla

usib

le th

at th

ere

fere

nce

stan

dard

outc

omes

are

miss

ing

(com

plet

ely)

at r

ando

m?

Yes

Con

sider

impu

ting

miss

ing

outc

ome

onre

fere

nce

stan

dard

or c

orre

ct a

ccur

acy

estim

ates

b

Are

ther

e sp

ecifi

csu

bgro

ups

in w

hich

this

refe

renc

est

anda

rd c

anno

t be

appl

ied?

Is th

ere

a pr

iori

cons

ensu

s on

wha

tco

mbi

natio

n of

test

resu

lts p

rovi

dead

equa

tecl

assif

icat

ion?

Is th

ere

a pr

efer

red

refe

renc

e st

anda

rdbu

t pro

vidi

ngin

adeq

uate

clas

sific

atio

n?

Yes

No

Is a

ccur

acy

suffi

cien

t for

answ

erin

g th

e qu

estio

n?

Can

mul

tiple

test

spr

ovid

e ad

equa

tecl

assif

icat

ion?

No

No

Yes

Yes

No

Yes

No

Yes

Yes

No

Yes

Yes

No

Con

sider

oth

erty

pes

of e

valu

atio

ng

FIG

UR

E 7

Gui

danc

e fo

r res

earc

hers

whe

n fa

ced

with

no

gold

sta

ndar

d di

agno

stic

rese

arch

situ

atio

ns.a

See

Chap

ter 1

. bSe

e ‘Im

pute

or a

djus

t fo

r miss

ing

data

on

refe

renc

e st

anda

rd’ (

p. 1

3).

cSe

e ‘D

iffer

entia

l ver

ifica

tion’

(p.

18).

dSe

e ‘C

orre

ct im

perfe

ct re

fere

nce

stan

dard

’ (p.

16)

. eSe

e ‘C

ompo

site

refe

renc

e st

anda

rd’ (

p. 2

1). f

See

‘Pan

el o

r con

sens

us d

iagn

osis’

(p.

24)

and

‘Lat

ent

data

ana

lysis

’ (p.

26)

. gSe

e Ch

apte

r 5.

reference standard in patients in whom the resultof the first, intended reference standard is notavailable. The use of that second referencestandard should preferably be stipulated in theprotocol and reported as such. With differentialverification, as in the use of one reference standardin test positive patients and a second one in testnegative patients, the results should be reportedfor each reference standard to prevent bias.

If there is a reference standard but one that isknown to be prone to error, statistical methodscould be used to obtain estimates that are correctedfor these reference standard errors. When there isan identified target condition but no availablereference standard, the researcher may aim toconstruct one, based on two or more other tests. Ifthese are used in a rule-based way to obtaininformation about the target condition, a compositereference standard has been constructed. If such arule cannot be made explicit, it is possible to use apanel method, by inviting experts. An alternativemethod is the use of statistical techniques to obtainthe most likely classification. By selecting one ofthese methods, the researchers – or the panelmembers – use their knowledge of the clinicaltarget condition to identify variables or features thatallow the identification of patients with thatparticular target condition.

With these techniques, there is an implicitassumption that they can be used to classifypatients in previously unknown categories: thelatent classes. This means that in latent classanalysis the target condition is not defined in aclinical sense, but is a mathematically definedentity. The classification that comes out of theanalysis may not fully coincide with pre-existingknowledge of the patients and the conditions thatare amenable to treatment.

If none of these methods seem appropriate, theresearchers have to turn to alternative methods forevaluation. Several proposals for a stagedevaluation of tests have been made in theliterature. A number of these focus on the addedvalue or diagnostic gain of tests.116 This includesstudies that determine the added discriminatorypower of tests, relative to pre-existing data fromhistory, physical or previous tests. Multivariablelogistic regression modelling is often used for thispurpose, with the change in area under the curveas one of the statistics. The subjective side of the‘diagnostic gain’ invites clinicians to express thereduction in experience they feel after havingreceived the index test result. These reductions areelicited using visual analogue or quantitative

scales. An additional step in these subjectiveapproaches is to look at changes in management,either explicitly or in intended decisions.

Several authors have claimed that the accuracyparadigm is insufficient for evaluating the clinicalusefulness of tests. The methods just mentionedcan all be used, although most require anappropriate reference standard (diagnostic gain)or a clear link with management consequences(RCTs, decision analysis). We feel that there isboth a need and a place for additional methodsfor exploring the likely clinical usefulness of tests.In the final section, below, we introduce theconcept of test validation, a progressiveexploration and testing of the tests results withother features, as a more general and moreproductive way of evaluating tests.

Limits to accuracy: towards avalidation paradigmIn clinical medicine, tests are not only ordered forknowing the results for their own sake. A physicianat the emergency department (ED) not only wantsto know the concentration of cardiac troponin ofthe patient with chest pain, but also wants to knowthe likelihood of a diagnosis of acute coronarysyndrome. In addition, he or she wants to knowwhere to locate the patient in the risk spectrum ofacute coronary syndrome.77,117 The ED physicianwill probably rely on the troponin results and alsoon other findings in trying to find out what to dowith the patient. Can this patient be safely senthome? Does the patient have to be admitted tothe coronary care unit? Does the patient qualifyfor primary angioplasty? An umbrella question forthese issues is: are patients with chest pain at theED better off if they are routinely tested fortroponin? In the answers to these questions, theexact value of the troponin concentration is anessential but insufficient element. It is not theattribute that is measured (troponin), but the levelof confidence by which that attribute can be usedto classify the patient in the risk spectrum of theacute coronary syndrome.

For decades, the diagnostic accuracy paradigm hasbeen used to obtain answers to this type ofquestion. It is difficult to pinpoint its originsexactly, but they go back well to – and beyond –Ledley and Lusted’s seminal 1959 paper.18 In itsmost classical sense, as can be found in alltextbooks, the accuracy paradigm requires a goldstandard, a technique used to identify withcertainty patients with the disease of interest.

Health Technology Assessment 2007; Vol. 11: No. 50

37

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Pathophysiology and histology present theprototypical form of a gold standard: a method todetermine unequivocally the presence or absenceof disease in the human body. Sensitivity andspecificity are the two best known statistics toexpress the results of the comparison of the testresults with those of the gold standard.

The accuracy paradigm has proven to be a veryvaluable one. Its ubiquity has provided a focalpoint for the clinical evaluation of tests. The needfor knowing the diagnostic accuracy of a test hasprompted many useful evaluations. Reports of thepoor sensitivity or specificity of a test are likely toinhibit the premature dissemination of tests inmodern medicine.

The limited time frame of this report did not allowus to explore, analyse and speculate why the goldstandard and the diagnostic accuracy paradigmhave become so dominant in the evaluation ofmedical tests. We can only hint at the prevailingpower of the classical view of clinical medicine, andits classical triad aetiognosis–diagnosis–prognosis.One can also point to the appealing simplicity ofthe basic accuracy design, which allows studyresults to be summarised in a very simple two-by-two table, or a slanted curve in two-dimensionalROC space. However, things should only be madeas simple as possible and no simpler. Beyond anydoubt, the prevailing accuracy paradigm is farfrom sufficient to cover all issues in the evaluationof the clinical evaluation of tests. We willsummarise only a few of these issues here.

To start, many tests are not used for making adiagnosis at all. They are used for a variety ofother purposes, such as guiding treatmentdecisions, monitoring treatment in chronicpatients, informing patients and documentingchanges in their condition, or for clinical or basicresearch. Many problems in present-dayhealthcare do not rely on issues of diagnosis,definitely not in chronic conditions, where patientsare known to have diabetes, cardiovascular diseaseor COPD. The diagnostic accuracy paradigmcannot be simply applied in situations where testsare used for purposes other than diagnosis.

The second issue is the problematic relationbetween pathophysiology and diagnosis, and thatbetween pathology and subsequent actions.Furthermore, the definition or the concept of adisease can change over time as new insightsbecome available. The classical diagnosis of ‘acutemyocardial infarction’ is now embedded in thespectrum of conditions known as ‘acute coronary

syndrome’. New management options andadvances in testing can change the concept ofdisease, as in ectopic pregnancy. This diagnosis,once attached to a life-threatening condition, nowcovers subclinical conditions, such as ‘throphoblastin regression’.118 Advances in imaging allow thedetection of even smaller, subsegmentalpulmonary emboli or micro-metastases in patientswith cancer. The current ‘diagnoses’ do notcoincide with the older categories.

To some extent, this problem can be remedied byreplacing the term ‘disease’ with ‘target condition’,as has been done in this report and elsewhere.5

The target condition is a much broader concept,covering any particular disease, a disease stage orjust any other identifiable condition that mayprompt clinical action, such as further diagnostictesting or the initiation, modification ortermination of treatment.

Another – and more widely documented – problemis the absence of a true gold standard to classify with certainty and without errors patientsas having the target condition or not. In morecases than many imagine, such a gold standardsimply does not exist. It definitely does not existfor many of the most prevalent chronic conditions,including diabetes, migraine and cardiovasculardisease. For these problems, the term ‘referencestandard’ has been introduced. This termacknowledges the absence of a ‘gold standard’ andrefers to the best available method for classifyingpatients as having the target condition.

The term reference standard may seem like asolution, but in accordance with the Law ofFrankenstein – ‘the monster you create is yours,forever’ – it also introduces ineradicable subjectivity.What should the reference standard be forappendicitis, or for deep venous thrombosis? Is itan image, or is it follow-up? The two methods willnot yield identical results, and whereas one may becloser to pathophysiology, the other may be closerto patient outcome, the penultimate criterion fordecisions in healthcare. A number of authors havesuggested that these problems can be circumventedby jumping to the cornerstone of evaluations ofinterventions: the RCT. As summarised earlier inthe report, RCTs cannot act as a panacea.105,106 Toallow meaningful interpretations of their results,such trials presume that the links between testresults and subsequent clinical actions have beenwell established. That may not always be the case.

In our view, a move towards a test validationparadigm is justified. This means that scientists and

Guidance and discussion

38

practitioners examine, using a number of differentmethods, whether the results of an index test aremeaningful in practice. Validation will always be agradual process. It will involve the scientific andclinical community defining a threshold, a point inthe validation process, where the informationgathered would be considered sufficient to allowclinical use of the test with confidence.

Validating a test is a process through whichscientists and practitioners can find out whetherthe results of a test are meaningful. The process ofvalidation will come after initial evaluations of thebasic properties of the test: its reliability,consistency, trueness. Once these hurdles havebeen overcome, we can try to find out how and towhat extent the test results fit in ourunderstanding of a patient’s condition, its likelycauses and course.

To validate a test, we will have to build aconceptual framework, defining how the testresults relate to other features. These features maybe derived from history, from physicalexamination, from other test results, from follow-up or from response to treatment.

The concept of test validation encompasses thetraditional notion of diagnostic accuracy. If thereis an accepted pathophysiological gold standard,one that can be used to detect the presence ofabsence of disease with certainty, we can validate atest by comparing its results with the findings ofthe gold standard (criterion validity). Testvalidation also allows the option for evaluating testresults of tests that are not used, or not usedexclusively, in making a diagnosis.

Test validation is probably the way to go in casethere is a test that is proclaimed to be ‘better thanthe existing reference standard’. In these casesalso, we have to show that the differences betweenthat test and the reference standard are‘meaningful’, by demonstrating reliableassociations with other findings and test results.

The results of a validation process cannot becaptured in simple statistics. The associations withother variables can and must be expressed in aquantitative sense, but they will never be reducedto a simple pair of numbers, as a test’s sensitivityand specificity.

Discovering or demonstrating that test results aremeaningful does not justify the use of that test indaily practice. In the end, the use of tests has to bejustified by demonstrating that it leads to better

healthcare, by improving outcome, reducing costs,or both.

Validation offers a way out of the dilemmaintroduced by the multiple fixes that are neededin order to cling to the diagnostic accuracyparadigm in the absence of an accepted goldstandard. Replacing the gold standard by thereference standard, and the pathophysiologicaldetection of disease by the identification of thetarget condition is not enough. Like all humans,we value simplicity, and the alluring attraction oftwo simple statistics makes it hard to give up thenotion of fixed test characteristics. In some casesthese fixes are feasible and reasonable, as has beensummarised in this report. In others, we may wellabandon the accuracy paradigm and replace it bya new one: the concept of the validation ofmedical tests.

Recommendations for furtherresearchDiagnostic research with partial verification iscommon, this includes research based on routinelycollected clinical data. There is a need toincrease awareness that naïve estimates of accuracyby leaving out unverified patients from thecalculation can lead to serious bias.

There is a need to perform empirical andsimulation studies that will determine thepotential and limitations of multiple imputationmethods to reduce the bias caused by missing dataon the reference standard.

In setting up consensus-based diagnosis, there are many practical choices to be made. Theseinclude the number of experts, the way patientinformation is presented, and how to obtain afinal classification. There is a need to performmore methodological studies to provide guidancein this area.

There is a need to design studies that compare theresults based on a consensus procedure with theresults of latent class models when multiple piecesof information will be combined to construct areference standard. Also there is a need to examineapproaches that combine both procedures.

There is a need to develop more elaborateschemes outwith the accuracy paradigm for thevalidation of tests targeted at diseases orconditions for which there is no acceptablereference standard.

Health Technology Assessment 2007; Vol. 11: No. 50

39

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

We gratefully acknowledge the many usefulcomments and interesting discussions we had

with the many experts we involved in this project.For a list of experts, see Appendix 2. Additionally,we gained input from discussions with ProfessorPaul Glasziou, Dr Tracy Roberts, Dr Chris Hyde[CH is a member of the Editorial board for HealthTechnology Assessment but was not involved in theeditorial process for this report] and Dr PriscillaHarries. We thank them for their time and theefforts that they put into this project. This reporthas been shaped and improved by the commentsof the experts, but the views expressed in thisreport are those of the authors and do notnecessarily reflect the opinions of the experts.

Contribution of authorsAnna Rutjes (Research Fellow) was a co-applicanton the project grant application and worked onthe development of the protocol, acted as areviewer, and was involved in the drafting of

methods and results. Johannes Reitsma (SeniorClinical Epidemiologist) was a co-applicant on theproject grant application and carried out work onthe development of the protocol, projectmanagement, supervision of review work, as wellas drafting and editing of the final report. Arri Coomarasamy (Honorary Lecturer inEpidemiology) was a co-applicant on the projectgrant application and worked on the developmentof the protocol and on drafting the results. Khalid Khan (Professor of Obstetrics-Gynaecologyand Clinical Epidemiology) was a main applicanton the project grant application and also carriedout development of the protocol, projectmanagement, drafting of results and editing of thefinal report. Patrick Bossuyt (Professor and Chairof Department of Clinical Epidemiology) was amain applicant on the project grant applicationand worked on the development of the protocol,on drafting the results and contributed todiscussion, as well as to editing the final report.

Health Technology Assessment 2007; Vol. 11: No. 50

41

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Acknowledgements

1. Lijmer JG, Mol BW, Heisterkamp S, et al.Empirical evidence of design-related bias instudies of diagnostic tests. JAMA 1999;282:1061–6.

2. Whiting P, Rutjes AW, Reitsma JB, Glas AS,Bossuyt PM, Kleijnen J. Sources of variation andbias in studies of diagnostic accuracy: a systematicreview. Ann Intern Med 2004;140:189–202.

3. Rutjes AW, Reitsma JB, Di NM, Smidt N, van Rijn JC, Bossuyt PM. Evidence of bias andvariation in diagnostic accuracy studies. CMAJ2006;174:469–76.

4. Knottnerus JA, van Weel C. General introduction:evaluation of diagnostic procedures. In Knottnerus JA, editor. The evidence base of clinicaldiagnosis. London: BMJ Books; 2002. pp. 1–18.

5. Sackett DL, Haynes RB. The architecture ofdiagnostic research. In Knottnerus JA, editor. Theevidence base of clinical diagnosis. London: BMJBooks; 2002. pp. 19–38.

6. Lijmer JG, Mol BWJ, Bonsel GJ, Prins MH,Bossuyt PM. Strategies for the evaluation ofdiagnostic technologies. Evaluation of diagnostictests: from accuracy to outcome (Thesis).Amsterdam; 2001. pp. 15–28.

7. Fryback DG, Thornbury JR. The efficacy ofdiagnostic imaging. Med Decis Making 1991;11:88–94.

8. Knottnerus JA, Muris JW. Assessment of theaccuracy of diagnostic tests: the cross-sectionalstudy. In Knottnerus JA, editor. The evidence base ofclinical diagnosis. London: BMJ Books; 2002.pp. 39–59.

9. Bossuyt PM, Irwig L, Craig J, Glasziou P.Comparative accuracy: assessing new tests againstexisting diagnostic pathways. BMJ 2006;332:1089–92.

10. Lord SJ, Irwig L, Simes RJ. When is measuringsensitivity and specificity sufficient to evaluate adiagnostic test, and when do we need randomizedtrials? Ann Intern Med 2006;144:850–5.

11. Boyko EJ, Alderman BW, Baron AE. Reference testerrors bias the evaluation of diagnostic tests forischemic heart disease. J Gen Intern Med 1988;3:476–81.

12. Deneef P. Evaluating rapid tests for streptococcalpharyngitis: the apparent accuracy of a diagnostictest when there are errors in the standard ofcomparison. Med Decis Making 1987;7:92–6.

13. Pepe MS. Incomplete data and imperfect referencetests. In Pepe MS, editor. The statistical evaluation ofmedical tests for classification and prediction. Oxford:Oxford University Press; 2004. pp. 168–213.

14. Vacek PM. The effect of conditional dependenceon the evaluation of diagnostic tests. Biometrics1985;41:959–68.

15. Thibodeau L. Evaluating diagnostic tests.Biometrics 1981;37:801–4.

16. International Organization for Standardization.International vocabulary of basic and general terms inmetrology. Geneva: ISO; 1993.

17. Yerushalmy J. Statistical problems in assessingmethods of medical diagnosis, with specialreference to X-ray techniques. Public Health Rep1947;62:1432–49.

18. Ledley RS, Lusted LB. Reasoning foundations ofmedical diagnosis; symbolic logic, probability, andvalue theory aid our understanding of howphysicians reason. Science 1959;130:9–21.

19. Whiting P, Rutjes AW, Reitsma JB, Glas AS,Bossuyt PM, Kleijnen J. Sources of variation andbias in studies of diagnostic accuracy: a systematicreview. Ann Intern Med 2004;140:189–202.

20. Rutjes AW, Reitsma JB, Irwig L, Bossuyt PM.Partial and differential bias in diagnostic accuracystudies. Sources of bias and variation in diagnosticaccuracy studies (Thesis). Amsterdam; 2005.pp. 31–44.

21. Gustafson P. The utility of prior information andstratification for parameter estimation with twoscreening tests but no gold standard. Stat Med2005;24:1203–17.

22. Begg CB, Greenes RA. Assessment of diagnostictests when disease verification is subject toselection bias. Biometrics 1983;39:207–15.

23. Harel O, Zhou XH. Multiple imputation forcorrecting verification bias. Stat Med 2006;25:3769–86.

24. Pepe MS. The statistical evaluation of medical tests forclassification and prediction. Oxford: OxfordUniversity Press; 2003.

25. Alonzo TA, Pepe MS. Using a combination ofreference tests to assess the accuracy of a newdiagnostic test. Stat Med 1999;18:2987–3003.

26. Hui SL, Zhou XH. Evaluation of diagnostic testswithout gold standards. Stat Methods Med Res 1998;7:354–70.

Health Technology Assessment 2007; Vol. 11: No. 50

43

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

References

27. Cochrane Handbook for Systematic Reviews ofInterventions 4.2.5 [updated May 2005]. ChichesterWiley; 2006.

28. Lilford RJ, Richardson A, Stevens A, et al. Issues inmethodological research: perspectives fromresearchers and commissioners. Health TechnolAssess 2001;5(8).

29. Little R, Rubin D. Statistical analysis with missingdata. 2nd ed. New York: Wiley; 2002.

30. Paliwal P, Gelfand AE. Estimating measures ofdiagnostic accuracy when some covariateinformation is missing. Stat Med 2006;25:2981–93.

31. Choi BC. Sensitivity and specificity of a singlediagnostic test in the presence of work-up bias.J Clin Epidemiol 1992;45:581–6.

32. Diamond GA. Off Bayes: effect of verification biason posterior probabilities calculated using Bayes’theorem. Med Decis Making 1992;12:22–31.

33. Cecil MP, Kosinski AS, Jones MT, et al. Theimportance of work-up (verification) biascorrection in assessing the accuracy of SPECTthallium-201 testing for the diagnosis of coronaryartery disease. J Clin Epidemiol 1996;49:735–42.

34. Danias PG, Parker JA. Novel Internet-based toolfor correcting apparent sensitivity and specificityof diagnostic tests to adjust for referral(verification) bias. Radiographics 2002;22(2):e4.

35. Held L, Ranyimbo AO. A Bayesian approach toestimate and validate the false negative fraction ina two-stage multiple screening test. Methods InfMed 2004;43:461–4.

36. Kosinski AS, Barnhart HX. Accounting fornonignorable verification bias in assessment ofdiagnostic tests. Biometrics 2003;59:163–71.

37. Kosinski AS, Barnhart HX. A global sensitivityanalysis of performance of a medical diagnostictest when verification bias is present. Stat Med2003;22:2711–21.

38. Toledano AY, Gatsonis C. Generalized estimatingequations for ordinal categorical data: arbitrarypatterns of missing responses and missingness in akey covariate. Biometrics 1999;55:488–96.

39. Zhou XH. Correcting for verification bias instudies of a diagnostic test’s accuracy. Stat MethodsMed Res 1998;7:337–53.

40. Zhou XH, Higgs RE. COMPROC andCHECKNORM: computer programs forcomparing accuracies of diagnostic tests usingROC curves in the presence of verification bias.Comput Methods Programs Biomed 1998;57:179–86.

41. Zhou XH, Higgs RE. Assessing the relativeaccuracies of two screening tests in the presence ofverification bias. Stat Med 2000;19:1697–705.

42. Molenberghs G, Goetghebeur EJ, Lipsitz SR,Kenward MG, Lesaffre E, Michiels B. Missing dataperspectives of the fluvoxamine data set: a review.Stat Med 1999;18:2449–64.

43. Engels JM, Diehr P. Imputation of missinglongitudinal data: a comparison of methods. J ClinEpidemiol 2003;56:968–76.

44. Liu M, Taylor JM, Belin TR. Multiple imputationand posterior simulation for multivariate missingdata in longitudinal studies. Biometrics 2000;56:1157–63.

45. Rubin DB. Multiple imputation after 18+ years.J Am Stat Assoc 1996;91:473–89.

46. Schafer JL. Multiple imputation: a primer. StatMethods Med Res 1999;8:3–15.

47. Zhou X-H, Obuchowski NA, McClish DK. Statisticalmethods in diagnostic medicine. New York: Wiley; 2002.

48. Greenes RA, Begg CB. Assessment of diagnostictechnologies. Methodology for unbiasedestimation from samples of selectively verifiedpatients. Invest Radiol 1985;20:751–6.

49. Pisano ED, Gatsonis C, Hendrick E, et al.Diagnostic performance of digital versus filmmammography for breast-cancer screening. N EnglJ Med 2005;353:1773–83.

50. Harrell FE, Jr., Lee KL, Mark DB. Multivariableprognostic models: issues in developing models,evaluating assumptions and adequacy, andmeasuring and reducing errors. Stat Med 1996;15:361–87.

51. Drum D, Christacopoulos J. Hepatic scintigraphyin clinical decision making. J Nucl Med 1969;13:908–15.

52. Marshall V, Williams DC, Smith KD.Diaphanography as a means of detecting breastcancer. Radiology 1981;150:339–43.

53. Zhou X-H, Obuchowski NA, McClish DK. Methodsfor correcting imperfect standard bias. Statistical methodsin diagnostic medicine. New York: Wiley; 2002.pp. 359–95.

54. Hadgu A, Dendukuri N, Hilden J. Evaluation ofnucleic acid amplification tests in the absence of aperfect gold-standard test: a review of thestatistical and epidemiologic issues. Epidemiology2005;16:604–12.

55. Staquet M, Rozencweig M, Lee YJ, et al.Methodology for the assessment of newdichotomous diagnostic tests. J Chronic Dis1981;34:599–610.

56. Valenstein PN. Evaluating diagnostic tests withimperfect standards. Am J Clin Pathol 1990;93:252–8.

References

44

57. Brenner H. Correcting for exposuremisclassification using an alloyed gold standard.Epidemiology 1996;7:406–10.

58. Wacholder S, Armstrong B, Hartge P. Validationstudies using an alloyed gold standard. Am JEpidemiol 1993;137:1251–8.

59. Schneeweiss S, Kriegmair M, Stepp H. Iseverything all right if nothing seems wrong? Asimple method of assessing the diagnostic value ofendoscopic procedures when a gold standard isabsent. J Urol 1999;161:1116–19.

60. Schneeweiss S. Sensitivity analysis of the diagnosticvalue of endoscopies in cross-sectional studies inthe absence of a gold standard. Int J Technol AssessHealth Care 2000;16:834–41.

61. van Belle A., Buller HR, Huisman MV, et al.Effectiveness of managing suspected pulmonaryembolism using an algorithm combining clinicalprobability, D-dimer testing, and computedtomography. JAMA 2006;295:172–9.

62. Ransohoff DF, Feinstein AR. Problems of spectrumand bias in evaluating the efficacy of diagnostictests. N Engl J Med 1978;299:926–30.

63. Kline JA, Israel EG, Michelson EA, O’Neil BJ, PlewaMC, Portelli DC. Diagnostic accuracy of a bedside D-dimer assay and alveolar dead-space measurementfor rapid exclusion of pulmonary embolism: a multicenter study. JAMA 2001;285:761–8.

64. Alonzo TA, Pepe MS. Assessing the accuracy of anew diagnostic test when a gold standard does notexist. UW Biostatistics Working Paper Series 1998;paper 156.

65. Hadgu A. The discrepancy in discrepant analysis.Lancet 1996;348:592–3.

66. Miller WC. Can we do better than discrepantanalysis for new diagnostic test evaluation? ClinInfect Dis 1998;27:1186–93.

67. Miller WC. Bias in discrepant analysis: when twowrongs don’t make a right. J Clin Epidemiol 1998;51:219–31.

68. Hadgu A. Bias in the evaluation of DNA-amplification tests for detecting Chlamydiatrachomatis. Stat Med 1997;16:1391–9.

69. Lipman H, Astles J. Quantifying the biasassociated with use of discrepant analysis. ClinChem 1998;44:108–15.

70. Hadgu A. Discrepant analysis: a biased and anunscientific method for estimating test sensitivityand specificity. J Clin Epidemiol 1999;52:1231–7.

71. Hadgu A. Discrepant analysis is an inappropriateand unscientific method. J Clin Microbiol2000;38:4301–2.

72. Green TA, Black CM, Johnson RE. Evaluation ofbias in diagnostic-test sensitivity and specificity

estimates computed by discrepant analysis. J ClinMicrobiol 1998;36:375–81.

73. Diamond GA. Affirmative actions: can thediscriminant accuracy of a test be determined inthe face of selection bias? Med Decis Making1991;11:48–56.

74. Wu CH, Lee MF, Yin SC, Yang DM, Cheng SF.Comparison of polymerase chain reaction,monoclonal antibody based enzyme immunoassay,and cell culture for detection of Chlamydiatrachomatis in genital specimens. Sex Transm Dis1992;19:193–7.

75. Martin DH, Nsuami M, Schachter J, et al. Use ofmultiple nucleic acid amplification tests to definethe infected-patient “gold standard” in clinicaltrials of new diagnostic tests for Chlamydiatrachomatis infections. J Clin Microbiol 2004;42:4749–58.

76. Jang D, Sellors JW, Mahony JB, Pickard L,Chernesky MA. Effects of broadening the goldstandard on the performance of achemiluminometric immunoassay to detectChlamydia trachomatis antigens in centrifuged firstvoid urine and urethral swab samples from men.Sex Transm Dis 1992;19:315–19.

77. Das R, Kilcullen N, Morrell C, Robinson MB,Barth JH, Hall AS. The British Cardiac SocietyWorking Group definition of myocardialinfarction: implications for practice. Heart2006;92:21–6.

78. Thijs JC, Van Zwet AA, Thijs WJ, et al. Diagnostictests for Helicobacter pylori: a prospective evaluationof their accuracy, without selecting a single test asthe gold standard. Am J Gastroenterol 1996;91:2125–9.

79. Madan K, Ahuja V, Gupta SD, Bal C, Kapoor A,Sharma MP. Impact of 24-h esophageal pHmonitoring on the diagnosis of gastroesophagealreflux disease: defining the gold standard.J Gastroenterol Hepatol 2005;20:30–7.

80. Gagnon R, Charlin B, Coletti M, Sauve E, Van D,V. Assessment in the context of uncertainty: howmany members are needed on the panel ofreference of a script concordance test? Med Educ2005;39:284–91.

81. Jones J, Hunter D. Consensus methods formedical and health services research. BMJ1995;311:376–80.

82. Rutten FH, Moons KG, Cramer MJ, et al.Recognising heart failure in elderly patients withstable chronic obstructive pulmonary disease inprimary care: cross sectional diagnostic study. BMJ2005;331:1379.

83. Hui SL, Zhou XH. Evaluation of diagnostic testswithout gold standards. Stat Methods Med Res1998;7:354–70.

Health Technology Assessment 2007; Vol. 11: No. 50

45

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

84. Walter SD, Irwig LM. Estimation of test errorrates, disease prevalence and relative risk frommisclassified data: a review. J Clin Epidemiol1988;41:923–37.

85. Pepe MS, Janes H. Insights into latent classanalysis. UW Biostatistics Working Paper Series 2005;paper 236.

86. Formann AK. Measurement errors in cariesdiagnosis: some further latent class models.Biometrics 1994;50:865–71.

87. Committee on Quality Management, Japan Societyof Clinical Chemistry. Proposed standard formatrix reference materials for quantitativelaboratory methods. JPN J Clin Chem 2003;32:180–5.

88. Qu Y, Tan M, Kutner MH. Random effects modelsin latent class analysis for evaluating accuracy ofdiagnostic tests. Biometrics 1996;52:797–810.

89. Bertrand P, Benichou J, Grenier P, Chastang C.Hui and Walter’s latent-class reference-freeapproach may be more useful in assessingagreement than diagnostic performance. J ClinEpidemiol 2005;58:688–700.

90. Vermunt JK, Magidson J. Latent class analysis. InLewis-Beck M, Bryman A, Liao TF, editors. TheSage Encyclopedia of Social Sciences Research Methods.Thousand Oaks, CA Sage Publications; 2004.pp. 549–53.

91. Joseph L, Gyorkos TW, Coupal L. Bayesianestimation of disease prevalence and theparameters of diagnostic tests in the absence of agold standard. Am J Epidemiol 1995;141:263–72.

92. Black MA, Craig BA. Estimating diseaseprevalence in the absence of a gold standard. StatMed 2002;21:2653–69.

93. Hadgu A, Qu Y. A biomedical application of latentclass models with random effects. Appl Stat1998;47:603–16.

94. Gustafson P. Measurement error and misclassificationin statistics and epidemiology: impacts and Bayesianadjustments. Boca Raton, FL: Chapman andHall/CRC; 2003.

95. Dendukuri N, Joseph L. Bayesian approaches tomodelling the conditional dependence betweenmultiple diagnostic tests. Biometrics 2001;57:158–67.

96. Georgiadis MP, Johnson WO, Gardner IA, SinghR. Correlation-adjusted estimation of sensitivityand specificity of two diagnostic tests. Appl Stat2003;52:63–76.

97. Bertrand P, Benichou J, Grenier P, Chastang C.Hui and Walter’s latent-class reference-freeapproach may be more useful in assessingagreement than diagnostic performance. J ClinEpidemiol 2005;58:688–700.

98. Black CM. Current methods of laboratorydiagnosis of Chlamydia trachomatis infections. ClinMicrobiol Rev 1997;10:160–84.

99. Barnes RC. Laboratory diagnosis of humanchlamydial infections. Clin Microbiol Rev1989;2:119–36.

100. Winter G. A comparative discussion of the notion of‘validity’ in qualitative and quantitative research. Thequalitative report. 2000; 4. URL: http://www.nova.edu/ssss/QR/OR4-3/winter.html. Accessed 21September 2007.

101. Streiner DL, Norman GR. Validity. In Streiner DL,Norman GR, editors. Health measurement scales: apractical guide to their development and use. Oxford:Oxford University Press; 1995. pp. 144–62.

102. Bland JM, Altman DG. Statistics notes: validatingscales and indexes. BMJ 2002;324:606–7.

103. Fletcher RH, Fletcher SW, Wagner EH.Abnormality. In Satterfield TS, editor. Clinicalepidemiology: the essentials. Baltimore, MD: Williamsand Wilkins; 1996. pp. 22–24.

104. Cronbach LJ, Meehl PE. Construct validity inpsychological tests. Psychol Bull 1955;52:281–302.

105. Lijmer JG, Bossuyt PM. Diagnostic testing andprognosis: the randomised controlled trial indiagnostic research. In Knottnerus JA, editor. Theevidence base of clinical diagnosis. London: BMJBooks; 2002. pp. 61–80.

106. Bossuyt PM, Lijmer JG, Mol BW. Randomisedcomparisons of medical tests: sometimes invalid,not always efficient. Lancet 2000;356:1844–7.

107. Fryback DG, Thornbury JR. The efficacy ofdiagnostic imaging. Med Decis Making 1991;11:88–94.

108. Subtil D, Goeusse P, Houfflin-Debarge V, et al.Randomised comparison of uterine artery Dopplerand aspirin (100 mg) with placebo in nulliparouswomen: the Essai Regional Aspirine Mere-EnfantStudy (Part 2). BJOG 2003;110:485–91.

109. Duley L, Henderson-Smart D, Knight M, King J.Antiplatelet drugs for prevention of pre-eclampsiaand its consequences: systematic review. BMJ2001;322:329–33.

110. Collinson PO, Stubbs PJ, Kessler AC. Multicentreevaluation of the diagnostic value of cardiactroponin T, CK-MB mass, and myoglobin forassessing patients with suspected acute coronarysyndromes in routine clinical practice. Heart2003;89:280–6.

111. Kontos MC, Shah R, Fritz LM, et al. Implication ofdifferent cardiac troponin I levels for clinicaloutcomes and prognosis of acute chest painpatients. J Am Coll Cardiol 2004;43:958–65.

112. Jaffe AS, Ravkilde J, Roberts R, et al. It’s time for achange to a troponin standard. Circulation2000;102:1216–20.

References

46

113. Apple FS, Wu AH, Mair J, et al. Future biomarkersfor detection of ischemia and risk stratification inacute coronary syndrome. Clin Chem2005;51:810–24.

114. Ewer K, Deeks J, Alvarez L, et al. Comparison of T-cell-based assay with tuberculin skin test fordiagnosis of Mycobacterium tuberculosis infectionin a school tuberculosis outbreak. Lancet2003;361:1168–73.

115. Bossuyt PM, Reitsma JB, Bruns DE, et al. TheSTARD statement for reporting studies ofdiagnostic accuracy: explanation and elaboration.Clin Chem 2003;49:7–18.

116. Moons KG, Biesheuvel CJ, Grobbee DE. Testresearch versus diagnostic research. Clin Chem2004;50:473–6.

117. Lee TH, Goldman L. Evaluation of the patientwith acute chest pain. N Engl J Med 2000;342:1187–95.

118. Ankum WM, Van der Veen F, Hamerlynck JV,Lammes FB. Suspected ectopic pregnancy. What todo when human chorionic gonadotropin levels arebelow the discriminatory zone. J Reprod Med1995;40:525–8.

Health Technology Assessment 2007; Vol. 11: No. 50

47

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

49

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Appendix 1

Search terms in databases

MEDLINE EMBASESearched 6 October 2005 via OVID Searched 6 October 2005 via OVIDDate coverage: 1966–September 2005 Date coverage: 1980–September 20051. reference.ti. 1. reference.ti.2. gold.ti. 2. gold.ti.3. golden.ti. 3. golden.ti.4. test.ti. 4. test.ti.5. standard.ti. 5. standard.ti.6. 1 or 2 or 3 6. 1 or 2 or 37. 4 or 5 7. 4 or 58. 6 and 7 8. 6 and 79. absen$.ti. 9. absen$.ti.10. reference test.ti. 10. reference test.ti.11. 9 and 10 11. 9 and 1112. 9 and 1 12. 9 and 113. 9 and 2 13. 9 and 214. 9 and 3 14. 9 and 315. 10 and 5 15. 10 and 516. 8 or 11 or 12 or 13 or 14 or 15 16. 8 or 11 or 12 or 13 or 14 or 15

Cochrane databases Pubmed(DARE, CENTRAL, CMR, NHS) Searched 6 October 2005 via www.pubmed.govSearched 13 October 2005 Date coverage: 1966–October 20051. reference.ti. (((“reference”[ti] OR “gold”[ti] OR “golden”[ti]) 2. gold.ti. AND(“test”[ti] OR “standard”[ti])) OR ((absen*[ti] AND 3. golden.ti. gold*[ti]) OR (“no gold*”[ti]) OR (absen*[ti] AND 4. test.ti. referen*[ti]) OR (absen*[ti] AND standard[ti]) OR 5. standard.ti. (absen*[ti] AND “reference test”[ti])))6. 1 or 2 or 37. 4 or 58. 6 and 79. absen*.ti.10. reference test.ti.11. 9 and 1012. 10 and 113. 9 and 214 9 and 315 9 and 516 8 or 11 or 12 or 13 or 14 or 15

MedionSearched 24 November 2005 via www.mediondatabase.nlFilter used: ‘Methodological Studies on Systematic Reviews of Diagnostic Tests’1. MKD (quality assessment)2. MBD (bias in meta-analysis)3. MH (heterogeneity)4. MAD (several/methods)

We assembled a list of topic-specific expertsoutside our research team to review each

method. In addition, several general expertsreviewed the whole report.

Topic-specific experts‘Impute missing data on referencestandard’ and ‘Correct imperfectreference standard’Professor Aeilko H ZwindermanProfessor in Biostatistics, Department of ClinicalEpidemiology and Biostatistics, Academic MedicalCenter, University of Amsterdam, The Netherlands

Dr Francisca Galindo GarreStatistician, Psychometrician, Department ofClinical Epidemiology and Biostatistics, AcademicMedical Center, University of Amsterdam, The Netherlands

‘Construct reference standard’ Professor Les IrwigProfessor of Epidemiology, Department of PublicHealth and Community Medicine, University ofSydney, Australia

Professor Karel GM MoonsProfessor of Clinical Epidemiology, Julius Centerfor Health Sciences and Primary Care, UniversityMedical Center, Utrecht, The Netherlands

Dr Alexandra AH van AbswoudeMethodologist, Department of ClinicalEpidemiology and Biostatistics, Academic MedicalCenter, University of Amsterdam, The Netherlands

‘Validate index test results’Professor Les IrwigProfessor of Epidemiology, Department of PublicHealth and Community Medicine, University ofSydney, Australia

General experts reviewing the fullreportProfessor Les IrwigProfessor of Epidemiology, Department of PublicHealth and Community Medicine, University ofSydney, Australia

Professor Karel GM MoonsProfessor of Clinical Epidemiology, Julius Centerfor Health Sciences and Primary Care, UniversityMedical Center, Utrecht, The Netherlands

Professor Aeilko H. ZwindermanProfessor of Biostatistics, Department of ClinicalEpidemiology and Biostatistics, Academic MedicalCenter, University of Amsterdam, The Netherlands

Professor Constantine GatsonisProfessor of Biostatistics, Community Health(Biostatistics) and Applied MathematicsBrown University, Providence, RI, USA

Health Technology Assessment 2007; Vol. 11: No. 50

51

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Appendix 2

Experts in peer review process

Health Technology Assessment 2007; Vol. 11: No. 50

53

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Volume 1, 1997

No. 1Home parenteral nutrition: a systematicreview.

By Richards DM, Deeks JJ, SheldonTA, Shaffer JL.

No. 2Diagnosis, management and screeningof early localised prostate cancer.

A review by Selley S, Donovan J,Faulkner A, Coast J, Gillatt D.

No. 3The diagnosis, management, treatmentand costs of prostate cancer in Englandand Wales.

A review by Chamberlain J, Melia J,Moss S, Brown J.

No. 4Screening for fragile X syndrome.

A review by Murray J, Cuckle H,Taylor G, Hewison J.

No. 5A review of near patient testing inprimary care.

By Hobbs FDR, Delaney BC,Fitzmaurice DA, Wilson S, Hyde CJ,Thorpe GH, et al.

No. 6Systematic review of outpatient servicesfor chronic pain control.

By McQuay HJ, Moore RA, EcclestonC, Morley S, de C Williams AC.

No. 7Neonatal screening for inborn errors ofmetabolism: cost, yield and outcome.

A review by Pollitt RJ, Green A,McCabe CJ, Booth A, Cooper NJ,Leonard JV, et al.

No. 8Preschool vision screening.

A review by Snowdon SK, Stewart-Brown SL.

No. 9Implications of socio-cultural contextsfor the ethics of clinical trials.

A review by Ashcroft RE, ChadwickDW, Clark SRL, Edwards RHT, Frith L,Hutton JL.

No. 10A critical review of the role of neonatalhearing screening in the detection ofcongenital hearing impairment.

By Davis A, Bamford J, Wilson I,Ramkalawan T, Forshaw M, Wright S.

No. 11Newborn screening for inborn errors ofmetabolism: a systematic review.

By Seymour CA, Thomason MJ,Chalmers RA, Addison GM, Bain MD,Cockburn F, et al.

No. 12Routine preoperative testing: a systematic review of the evidence.

By Munro J, Booth A, Nicholl J.

No. 13Systematic review of the effectiveness of laxatives in the elderly.

By Petticrew M, Watt I, Sheldon T.

No. 14When and how to assess fast-changingtechnologies: a comparative study ofmedical applications of four generictechnologies.

A review by Mowatt G, Bower DJ,Brebner JA, Cairns JA, Grant AM,McKee L.

Volume 2, 1998

No. 1Antenatal screening for Down’ssyndrome.

A review by Wald NJ, Kennard A,Hackshaw A, McGuire A.

No. 2Screening for ovarian cancer: a systematic review.

By Bell R, Petticrew M, Luengo S,Sheldon TA.

No. 3Consensus development methods, andtheir use in clinical guidelinedevelopment.

A review by Murphy MK, Black NA,Lamping DL, McKee CM, SandersonCFB, Askham J, et al.

No. 4A cost–utility analysis of interferon beta for multiple sclerosis.

By Parkin D, McNamee P, Jacoby A,Miller P, Thomas S, Bates D.

No. 5Effectiveness and efficiency of methodsof dialysis therapy for end-stage renaldisease: systematic reviews.

By MacLeod A, Grant A, Donaldson C, Khan I, Campbell M,Daly C, et al.

No. 6Effectiveness of hip prostheses inprimary total hip replacement: a criticalreview of evidence and an economicmodel.

By Faulkner A, Kennedy LG, Baxter K, Donovan J, Wilkinson M,Bevan G.

No. 7Antimicrobial prophylaxis in colorectalsurgery: a systematic review ofrandomised controlled trials.

By Song F, Glenny AM.

No. 8Bone marrow and peripheral bloodstem cell transplantation for malignancy.

A review by Johnson PWM, Simnett SJ,Sweetenham JW, Morgan GJ, Stewart LA.

No. 9Screening for speech and languagedelay: a systematic review of theliterature.

By Law J, Boyle J, Harris F, HarknessA, Nye C.

No. 10Resource allocation for chronic stableangina: a systematic review ofeffectiveness, costs and cost-effectiveness of alternativeinterventions.

By Sculpher MJ, Petticrew M, Kelland JL, Elliott RA, Holdright DR,Buxton MJ.

No. 11Detection, adherence and control ofhypertension for the prevention ofstroke: a systematic review.

By Ebrahim S.

No. 12Postoperative analgesia and vomiting,with special reference to day-casesurgery: a systematic review.

By McQuay HJ, Moore RA.

No. 13Choosing between randomised andnonrandomised studies: a systematicreview.

By Britton A, McKee M, Black N,McPherson K, Sanderson C, Bain C.

No. 14Evaluating patient-based outcomemeasures for use in clinical trials.

A review by Fitzpatrick R, Davey C,Buxton MJ, Jones DR.

Health Technology Assessment reports published to date

No. 15Ethical issues in the design and conductof randomised controlled trials.

A review by Edwards SJL, Lilford RJ,Braunholtz DA, Jackson JC, Hewison J,Thornton J.

No. 16Qualitative research methods in healthtechnology assessment: a review of theliterature.

By Murphy E, Dingwall R, GreatbatchD, Parker S, Watson P.

No. 17The costs and benefits of paramedicskills in pre-hospital trauma care.

By Nicholl J, Hughes S, Dixon S,Turner J, Yates D.

No. 18Systematic review of endoscopicultrasound in gastro-oesophagealcancer.

By Harris KM, Kelly S, Berry E,Hutton J, Roderick P, Cullingworth J, et al.

No. 19Systematic reviews of trials and otherstudies.

By Sutton AJ, Abrams KR, Jones DR,Sheldon TA, Song F.

No. 20Primary total hip replacement surgery: a systematic review of outcomes andmodelling of cost-effectivenessassociated with different prostheses.

A review by Fitzpatrick R, Shortall E,Sculpher M, Murray D, Morris R, LodgeM, et al.

Volume 3, 1999

No. 1Informed decision making: an annotatedbibliography and systematic review.

By Bekker H, Thornton JG, Airey CM, Connelly JB, Hewison J,Robinson MB, et al.

No. 2Handling uncertainty when performingeconomic evaluation of healthcareinterventions.

A review by Briggs AH, Gray AM.

No. 3The role of expectancies in the placeboeffect and their use in the delivery ofhealth care: a systematic review.

By Crow R, Gage H, Hampson S,Hart J, Kimber A, Thomas H.

No. 4A randomised controlled trial ofdifferent approaches to universalantenatal HIV testing: uptake andacceptability. Annex: Antenatal HIVtesting – assessment of a routinevoluntary approach.

By Simpson WM, Johnstone FD, BoydFM, Goldberg DJ, Hart GJ, GormleySM, et al.

No. 5Methods for evaluating area-wide andorganisation-based interventions inhealth and health care: a systematicreview.

By Ukoumunne OC, Gulliford MC,Chinn S, Sterne JAC, Burney PGJ.

No. 6Assessing the costs of healthcaretechnologies in clinical trials.

A review by Johnston K, Buxton MJ,Jones DR, Fitzpatrick R.

No. 7Cooperatives and their primary careemergency centres: organisation andimpact.

By Hallam L, Henthorne K.

No. 8Screening for cystic fibrosis.

A review by Murray J, Cuckle H,Taylor G, Littlewood J, Hewison J.

No. 9A review of the use of health statusmeasures in economic evaluation.

By Brazier J, Deverill M, Green C,Harper R, Booth A.

No. 10Methods for the analysis of quality-of-life and survival data in healthtechnology assessment.

A review by Billingham LJ, AbramsKR, Jones DR.

No. 11Antenatal and neonatalhaemoglobinopathy screening in theUK: review and economic analysis.

By Zeuner D, Ades AE, Karnon J,Brown J, Dezateux C, Anionwu EN.

No. 12Assessing the quality of reports ofrandomised trials: implications for theconduct of meta-analyses.

A review by Moher D, Cook DJ, JadadAR, Tugwell P, Moher M, Jones A, et al.

No. 13‘Early warning systems’ for identifyingnew healthcare technologies.

By Robert G, Stevens A, Gabbay J.

No. 14A systematic review of the role of humanpapillomavirus testing within a cervicalscreening programme.

By Cuzick J, Sasieni P, Davies P,Adams J, Normand C, Frater A, et al.

No. 15Near patient testing in diabetes clinics:appraising the costs and outcomes.

By Grieve R, Beech R, Vincent J,Mazurkiewicz J.

No. 16Positron emission tomography:establishing priorities for healthtechnology assessment.

A review by Robert G, Milne R.

No. 17 (Pt 1)The debridement of chronic wounds: a systematic review.

By Bradley M, Cullum N, Sheldon T.

No. 17 (Pt 2)Systematic reviews of wound caremanagement: (2) Dressings and topicalagents used in the healing of chronicwounds.

By Bradley M, Cullum N, Nelson EA,Petticrew M, Sheldon T, Torgerson D.

No. 18A systematic literature review of spiraland electron beam computedtomography: with particular reference toclinical applications in hepatic lesions,pulmonary embolus and coronary arterydisease.

By Berry E, Kelly S, Hutton J, Harris KM, Roderick P, Boyce JC, et al.

No. 19What role for statins? A review andeconomic model.

By Ebrahim S, Davey Smith G,McCabe C, Payne N, Pickin M, SheldonTA, et al.

No. 20Factors that limit the quality, numberand progress of randomised controlledtrials.

A review by Prescott RJ, Counsell CE,Gillespie WJ, Grant AM, Russell IT,Kiauka S, et al.

No. 21Antimicrobial prophylaxis in total hipreplacement: a systematic review.

By Glenny AM, Song F.

No. 22Health promoting schools and healthpromotion in schools: two systematicreviews.

By Lister-Sharp D, Chapman S,Stewart-Brown S, Sowden A.

No. 23Economic evaluation of a primary care-based education programme for patientswith osteoarthritis of the knee.

A review by Lord J, Victor C,Littlejohns P, Ross FM, Axford JS.

Volume 4, 2000

No. 1The estimation of marginal timepreference in a UK-wide sample(TEMPUS) project.

A review by Cairns JA, van der PolMM.

No. 2Geriatric rehabilitation followingfractures in older people: a systematicreview.

By Cameron I, Crotty M, Currie C,Finnegan T, Gillespie L, Gillespie W, et al.

Health Technology Assessment reports published to date

54

No. 3Screening for sickle cell disease andthalassaemia: a systematic review withsupplementary research.

By Davies SC, Cronin E, Gill M,Greengross P, Hickman M, Normand C.

No. 4Community provision of hearing aidsand related audiology services.

A review by Reeves DJ, Alborz A,Hickson FS, Bamford JM.

No. 5False-negative results in screeningprogrammes: systematic review ofimpact and implications.

By Petticrew MP, Sowden AJ, Lister-Sharp D, Wright K.

No. 6Costs and benefits of communitypostnatal support workers: arandomised controlled trial.

By Morrell CJ, Spiby H, Stewart P,Walters S, Morgan A.

No. 7Implantable contraceptives (subdermalimplants and hormonally impregnatedintrauterine systems) versus other formsof reversible contraceptives: twosystematic reviews to assess relativeeffectiveness, acceptability, tolerabilityand cost-effectiveness.

By French RS, Cowan FM, MansourDJA, Morris S, Procter T, Hughes D, et al.

No. 8An introduction to statistical methodsfor health technology assessment.

A review by White SJ, Ashby D, Brown PJ.

No. 9Disease-modifying drugs for multiplesclerosis: a rapid and systematic review.

By Clegg A, Bryant J, Milne R.

No. 10Publication and related biases.

A review by Song F, Eastwood AJ,Gilbody S, Duley L, Sutton AJ.

No. 11Cost and outcome implications of theorganisation of vascular services.

By Michaels J, Brazier J, PalfreymanS, Shackley P, Slack R.

No. 12Monitoring blood glucose control indiabetes mellitus: a systematic review.

By Coster S, Gulliford MC, Seed PT,Powrie JK, Swaminathan R.

No. 13The effectiveness of domiciliary healthvisiting: a systematic review ofinternational studies and a selective review of the Britishliterature.

By Elkan R, Kendrick D, Hewitt M,Robinson JJA, Tolley K, Blair M, et al.

No. 14The determinants of screening uptakeand interventions for increasing uptake:a systematic review.

By Jepson R, Clegg A, Forbes C,Lewis R, Sowden A, Kleijnen J.

No. 15The effectiveness and cost-effectivenessof prophylactic removal of wisdomteeth.

A rapid review by Song F, O’Meara S,Wilson P, Golder S, Kleijnen J.

No. 16Ultrasound screening in pregnancy: asystematic review of the clinicaleffectiveness, cost-effectiveness andwomen’s views.

By Bricker L, Garcia J, Henderson J,Mugford M, Neilson J, Roberts T, et al.

No. 17A rapid and systematic review of theeffectiveness and cost-effectiveness ofthe taxanes used in the treatment ofadvanced breast and ovarian cancer.

By Lister-Sharp D, McDonagh MS,Khan KS, Kleijnen J.

No. 18Liquid-based cytology in cervicalscreening: a rapid and systematic review.

By Payne N, Chilcott J, McGoogan E.

No. 19Randomised controlled trial of non-directive counselling,cognitive–behaviour therapy and usualgeneral practitioner care in themanagement of depression as well asmixed anxiety and depression inprimary care.

By King M, Sibbald B, Ward E, BowerP, Lloyd M, Gabbay M, et al.

No. 20Routine referral for radiography ofpatients presenting with low back pain:is patients’ outcome influenced by GPs’referral for plain radiography?

By Kerry S, Hilton S, Patel S, DundasD, Rink E, Lord J.

No. 21Systematic reviews of wound caremanagement: (3) antimicrobial agentsfor chronic wounds; (4) diabetic footulceration.

By O’Meara S, Cullum N, Majid M,Sheldon T.

No. 22Using routine data to complement andenhance the results of randomisedcontrolled trials.

By Lewsey JD, Leyland AH, Murray GD, Boddy FA.

No. 23Coronary artery stents in the treatmentof ischaemic heart disease: a rapid andsystematic review.

By Meads C, Cummins C, Jolly K,Stevens A, Burls A, Hyde C.

No. 24Outcome measures for adult criticalcare: a systematic review.

By Hayes JA, Black NA, Jenkinson C, Young JD, Rowan KM,Daly K, et al.

No. 25A systematic review to evaluate theeffectiveness of interventions to promote the initiation of breastfeeding.

By Fairbank L, O’Meara S, RenfrewMJ, Woolridge M, Sowden AJ, Lister-Sharp D.

No. 26Implantable cardioverter defibrillators:arrhythmias. A rapid and systematicreview.

By Parkes J, Bryant J, Milne R.

No. 27Treatments for fatigue in multiplesclerosis: a rapid and systematic review.

By Brañas P, Jordan R, Fry-Smith A,Burls A, Hyde C.

No. 28Early asthma prophylaxis, naturalhistory, skeletal development andeconomy (EASE): a pilot randomisedcontrolled trial.

By Baxter-Jones ADG, Helms PJ,Russell G, Grant A, Ross S, Cairns JA, et al.

No. 29Screening for hypercholesterolaemiaversus case finding for familialhypercholesterolaemia: a systematicreview and cost-effectiveness analysis.

By Marks D, Wonderling D,Thorogood M, Lambert H, HumphriesSE, Neil HAW.

No. 30A rapid and systematic review of theclinical effectiveness and cost-effectiveness of glycoprotein IIb/IIIaantagonists in the medical managementof unstable angina.

By McDonagh MS, Bachmann LM,Golder S, Kleijnen J, ter Riet G.

No. 31A randomised controlled trial ofprehospital intravenous fluidreplacement therapy in serious trauma.

By Turner J, Nicholl J, Webber L,Cox H, Dixon S, Yates D.

No. 32Intrathecal pumps for giving opioids inchronic pain: a systematic review.

By Williams JE, Louw G, Towlerton G.

No. 33Combination therapy (interferon alfaand ribavirin) in the treatment ofchronic hepatitis C: a rapid andsystematic review.

By Shepherd J, Waugh N, Hewitson P.

Health Technology Assessment 2007; Vol. 11: No. 50

55

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

No. 34A systematic review of comparisons ofeffect sizes derived from randomisedand non-randomised studies.

By MacLehose RR, Reeves BC,Harvey IM, Sheldon TA, Russell IT,Black AMS.

No. 35Intravascular ultrasound-guidedinterventions in coronary artery disease:a systematic literature review, withdecision-analytic modelling, of outcomesand cost-effectiveness.

By Berry E, Kelly S, Hutton J,Lindsay HSJ, Blaxill JM, Evans JA, et al.

No. 36A randomised controlled trial toevaluate the effectiveness and cost-effectiveness of counselling patients withchronic depression.

By Simpson S, Corney R, Fitzgerald P,Beecham J.

No. 37Systematic review of treatments foratopic eczema.

By Hoare C, Li Wan Po A, Williams H.

No. 38Bayesian methods in health technologyassessment: a review.

By Spiegelhalter DJ, Myles JP, Jones DR, Abrams KR.

No. 39The management of dyspepsia: asystematic review.

By Delaney B, Moayyedi P, Deeks J,Innes M, Soo S, Barton P, et al.

No. 40A systematic review of treatments forsevere psoriasis.

By Griffiths CEM, Clark CM, ChalmersRJG, Li Wan Po A, Williams HC.

Volume 5, 2001

No. 1Clinical and cost-effectiveness ofdonepezil, rivastigmine andgalantamine for Alzheimer’s disease: arapid and systematic review.

By Clegg A, Bryant J, Nicholson T,McIntyre L, De Broe S, Gerard K, et al.

No. 2The clinical effectiveness and cost-effectiveness of riluzole for motorneurone disease: a rapid and systematicreview.

By Stewart A, Sandercock J, Bryan S,Hyde C, Barton PM, Fry-Smith A, et al.

No. 3Equity and the economic evaluation ofhealthcare.

By Sassi F, Archard L, Le Grand J.

No. 4Quality-of-life measures in chronicdiseases of childhood.

By Eiser C, Morse R.

No. 5Eliciting public preferences forhealthcare: a systematic review oftechniques.

By Ryan M, Scott DA, Reeves C, BateA, van Teijlingen ER, Russell EM, et al.

No. 6General health status measures forpeople with cognitive impairment:learning disability and acquired braininjury.

By Riemsma RP, Forbes CA, Glanville JM, Eastwood AJ, Kleijnen J.

No. 7An assessment of screening strategies forfragile X syndrome in the UK.

By Pembrey ME, Barnicoat AJ,Carmichael B, Bobrow M, Turner G.

No. 8Issues in methodological research:perspectives from researchers andcommissioners.

By Lilford RJ, Richardson A, StevensA, Fitzpatrick R, Edwards S, Rock F, et al.

No. 9Systematic reviews of wound caremanagement: (5) beds; (6) compression;(7) laser therapy, therapeuticultrasound, electrotherapy andelectromagnetic therapy.

By Cullum N, Nelson EA, FlemmingK, Sheldon T.

No. 10Effects of educational and psychosocialinterventions for adolescents withdiabetes mellitus: a systematic review.

By Hampson SE, Skinner TC, Hart J,Storey L, Gage H, Foxcroft D, et al.

No. 11Effectiveness of autologous chondrocytetransplantation for hyaline cartilagedefects in knees: a rapid and systematicreview.

By Jobanputra P, Parry D, Fry-SmithA, Burls A.

No. 12Statistical assessment of the learningcurves of health technologies.

By Ramsay CR, Grant AM, Wallace SA, Garthwaite PH, Monk AF,Russell IT.

No. 13The effectiveness and cost-effectivenessof temozolomide for the treatment ofrecurrent malignant glioma: a rapid andsystematic review.

By Dinnes J, Cave C, Huang S, Major K, Milne R.

No. 14A rapid and systematic review of theclinical effectiveness and cost-effectiveness of debriding agents intreating surgical wounds healing bysecondary intention.

By Lewis R, Whiting P, ter Riet G,O’Meara S, Glanville J.

No. 15Home treatment for mental healthproblems: a systematic review.

By Burns T, Knapp M, Catty J, Healey A, Henderson J, Watt H, et al.

No. 16How to develop cost-consciousguidelines.

By Eccles M, Mason J.

No. 17The role of specialist nurses in multiplesclerosis: a rapid and systematic review.

By De Broe S, Christopher F, Waugh N.

No. 18A rapid and systematic review of theclinical effectiveness and cost-effectiveness of orlistat in themanagement of obesity.

By O’Meara S, Riemsma R, Shirran L, Mather L, ter Riet G.

No. 19The clinical effectiveness and cost-effectiveness of pioglitazone for type 2diabetes mellitus: a rapid and systematicreview.

By Chilcott J, Wight J, Lloyd JonesM, Tappenden P.

No. 20Extended scope of nursing practice: amulticentre randomised controlled trialof appropriately trained nurses andpreregistration house officers in pre-operative assessment in elective generalsurgery.

By Kinley H, Czoski-Murray C,George S, McCabe C, Primrose J, Reilly C, et al.

No. 21Systematic reviews of the effectiveness ofday care for people with severe mentaldisorders: (1) Acute day hospital versusadmission; (2) Vocational rehabilitation;(3) Day hospital versus outpatient care.

By Marshall M, Crowther R, Almaraz-Serrano A, Creed F, Sledge W, Kluiter H, et al.

No. 22The measurement and monitoring ofsurgical adverse events.

By Bruce J, Russell EM, Mollison J,Krukowski ZH.

No. 23Action research: a systematic review andguidance for assessment.

By Waterman H, Tillen D, Dickson R,de Koning K.

No. 24A rapid and systematic review of theclinical effectiveness and cost-effectiveness of gemcitabine for thetreatment of pancreatic cancer.

By Ward S, Morris E, Bansback N,Calvert N, Crellin A, Forman D, et al.

Health Technology Assessment reports published to date

56

No. 25A rapid and systematic review of theevidence for the clinical effectivenessand cost-effectiveness of irinotecan,oxaliplatin and raltitrexed for thetreatment of advanced colorectal cancer.

By Lloyd Jones M, Hummel S,Bansback N, Orr B, Seymour M.

No. 26Comparison of the effectiveness ofinhaler devices in asthma and chronicobstructive airways disease: a systematicreview of the literature.

By Brocklebank D, Ram F, Wright J,Barry P, Cates C, Davies L, et al.

No. 27The cost-effectiveness of magneticresonance imaging for investigation ofthe knee joint.

By Bryan S, Weatherburn G, BungayH, Hatrick C, Salas C, Parry D, et al.

No. 28A rapid and systematic review of theclinical effectiveness and cost-effectiveness of topotecan for ovariancancer.

By Forbes C, Shirran L, Bagnall A-M,Duffy S, ter Riet G.

No. 29Superseded by a report published in alater volume.

No. 30The role of radiography in primary carepatients with low back pain of at least 6weeks duration: a randomised(unblinded) controlled trial.

By Kendrick D, Fielding K, Bentley E,Miller P, Kerslake R, Pringle M.

No. 31Design and use of questionnaires: areview of best practice applicable tosurveys of health service staff andpatients.

By McColl E, Jacoby A, Thomas L,Soutter J, Bamford C, Steen N, et al.

No. 32A rapid and systematic review of theclinical effectiveness and cost-effectiveness of paclitaxel, docetaxel,gemcitabine and vinorelbine in non-small-cell lung cancer.

By Clegg A, Scott DA, Sidhu M,Hewitson P, Waugh N.

No. 33Subgroup analyses in randomisedcontrolled trials: quantifying the risks offalse-positives and false-negatives.

By Brookes ST, Whitley E, Peters TJ, Mulheran PA, Egger M,Davey Smith G.

No. 34Depot antipsychotic medication in thetreatment of patients with schizophrenia:(1) Meta-review; (2) Patient and nurseattitudes.

By David AS, Adams C.

No. 35A systematic review of controlled trialsof the effectiveness and cost-effectiveness of brief psychologicaltreatments for depression.

By Churchill R, Hunot V, Corney R,Knapp M, McGuire H, Tylee A, et al.

No. 36Cost analysis of child healthsurveillance.

By Sanderson D, Wright D, Acton C,Duree D.

Volume 6, 2002

No. 1A study of the methods used to selectreview criteria for clinical audit.

By Hearnshaw H, Harker R, CheaterF, Baker R, Grimshaw G.

No. 2Fludarabine as second-line therapy forB cell chronic lymphocytic leukaemia: a technology assessment.

By Hyde C, Wake B, Bryan S, BartonP, Fry-Smith A, Davenport C, et al.

No. 3Rituximab as third-line treatment forrefractory or recurrent Stage III or IVfollicular non-Hodgkin’s lymphoma: asystematic review and economicevaluation.

By Wake B, Hyde C, Bryan S, BartonP, Song F, Fry-Smith A, et al.

No. 4A systematic review of dischargearrangements for older people.

By Parker SG, Peet SM, McPherson A,Cannaby AM, Baker R, Wilson A, et al.

No. 5The clinical effectiveness and cost-effectiveness of inhaler devices used inthe routine management of chronicasthma in older children: a systematicreview and economic evaluation.

By Peters J, Stevenson M, Beverley C,Lim J, Smith S.

No. 6The clinical effectiveness and cost-effectiveness of sibutramine in themanagement of obesity: a technologyassessment.

By O’Meara S, Riemsma R, ShirranL, Mather L, ter Riet G.

No. 7The cost-effectiveness of magneticresonance angiography for carotidartery stenosis and peripheral vasculardisease: a systematic review.

By Berry E, Kelly S, Westwood ME,Davies LM, Gough MJ, Bamford JM,et al.

No. 8Promoting physical activity in SouthAsian Muslim women through ‘exerciseon prescription’.

By Carroll B, Ali N, Azam N.

No. 9Zanamivir for the treatment of influenzain adults: a systematic review andeconomic evaluation.

By Burls A, Clark W, Stewart T,Preston C, Bryan S, Jefferson T, et al.

No. 10A review of the natural history andepidemiology of multiple sclerosis:implications for resource allocation andhealth economic models.

By Richards RG, Sampson FC, Beard SM, Tappenden P.

No. 11Screening for gestational diabetes: asystematic review and economicevaluation.

By Scott DA, Loveman E, McIntyre L,Waugh N.

No. 12The clinical effectiveness and cost-effectiveness of surgery for people withmorbid obesity: a systematic review andeconomic evaluation.

By Clegg AJ, Colquitt J, Sidhu MK,Royle P, Loveman E, Walker A.

No. 13The clinical effectiveness of trastuzumabfor breast cancer: a systematic review.

By Lewis R, Bagnall A-M, Forbes C,Shirran E, Duffy S, Kleijnen J, et al.

No. 14The clinical effectiveness and cost-effectiveness of vinorelbine for breastcancer: a systematic review andeconomic evaluation.

By Lewis R, Bagnall A-M, King S,Woolacott N, Forbes C, Shirran L, et al.

No. 15A systematic review of the effectivenessand cost-effectiveness of metal-on-metalhip resurfacing arthroplasty fortreatment of hip disease.

By Vale L, Wyness L, McCormack K,McKenzie L, Brazzelli M, Stearns SC.

No. 16The clinical effectiveness and cost-effectiveness of bupropion and nicotinereplacement therapy for smokingcessation: a systematic review andeconomic evaluation.

By Woolacott NF, Jones L, Forbes CA,Mather LC, Sowden AJ, Song FJ, et al.

No. 17A systematic review of effectiveness andeconomic evaluation of new drugtreatments for juvenile idiopathicarthritis: etanercept.

By Cummins C, Connock M, Fry-Smith A, Burls A.

No. 18Clinical effectiveness and cost-effectiveness of growth hormone inchildren: a systematic review andeconomic evaluation.

By Bryant J, Cave C, Mihaylova B,Chase D, McIntyre L, Gerard K, et al.

Health Technology Assessment 2007; Vol. 11: No. 50

57

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

No. 19Clinical effectiveness and cost-effectiveness of growth hormone inadults in relation to impact on quality oflife: a systematic review and economicevaluation.

By Bryant J, Loveman E, Chase D,Mihaylova B, Cave C, Gerard K, et al.

No. 20Clinical medication review by apharmacist of patients on repeatprescriptions in general practice: arandomised controlled trial.

By Zermansky AG, Petty DR, RaynorDK, Lowe CJ, Freementle N, Vail A.

No. 21The effectiveness of infliximab andetanercept for the treatment ofrheumatoid arthritis: a systematic reviewand economic evaluation.

By Jobanputra P, Barton P, Bryan S,Burls A.

No. 22A systematic review and economicevaluation of computerised cognitivebehaviour therapy for depression andanxiety.

By Kaltenthaler E, Shackley P, StevensK, Beverley C, Parry G, Chilcott J.

No. 23A systematic review and economicevaluation of pegylated liposomaldoxorubicin hydrochloride for ovariancancer.

By Forbes C, Wilby J, Richardson G,Sculpher M, Mather L, Reimsma R.

No. 24A systematic review of the effectivenessof interventions based on a stages-of-change approach to promote individualbehaviour change.

By Riemsma RP, Pattenden J, Bridle C,Sowden AJ, Mather L, Watt IS, et al.

No. 25A systematic review update of theclinical effectiveness and cost-effectiveness of glycoprotein IIb/IIIaantagonists.

By Robinson M, Ginnelly L, SculpherM, Jones L, Riemsma R, Palmer S, et al.

No. 26A systematic review of the effectiveness,cost-effectiveness and barriers toimplementation of thrombolytic andneuroprotective therapy for acuteischaemic stroke in the NHS.

By Sandercock P, Berge E, Dennis M,Forbes J, Hand P, Kwan J, et al.

No. 27A randomised controlled crossover trial of nurse practitioner versus doctor-led outpatient care in abronchiectasis clinic.

By Caine N, Sharples LD,Hollingworth W, French J, Keogan M,Exley A, et al.

No. 28Clinical effectiveness and cost –consequences of selective serotoninreuptake inhibitors in the treatment ofsex offenders.

By Adi Y, Ashcroft D, Browne K,Beech A, Fry-Smith A, Hyde C.

No. 29Treatment of established osteoporosis: a systematic review and cost–utilityanalysis.

By Kanis JA, Brazier JE, Stevenson M,Calvert NW, Lloyd Jones M.

No. 30Which anaesthetic agents are cost-effective in day surgery? Literaturereview, national survey of practice andrandomised controlled trial.

By Elliott RA Payne K, Moore JK,Davies LM, Harper NJN, St Leger AS,et al.

No. 31Screening for hepatitis C amonginjecting drug users and ingenitourinary medicine clinics:systematic reviews of effectiveness,modelling study and national survey ofcurrent practice.

By Stein K, Dalziel K, Walker A,McIntyre L, Jenkins B, Horne J, et al.

No. 32The measurement of satisfaction withhealthcare: implications for practicefrom a systematic review of theliterature.

By Crow R, Gage H, Hampson S,Hart J, Kimber A, Storey L, et al.

No. 33The effectiveness and cost-effectivenessof imatinib in chronic myeloidleukaemia: a systematic review.

By Garside R, Round A, Dalziel K,Stein K, Royle R.

No. 34A comparative study of hypertonicsaline, daily and alternate-day rhDNase in children with cystic fibrosis.

By Suri R, Wallis C, Bush A,Thompson S, Normand C, Flather M, et al.

No. 35A systematic review of the costs andeffectiveness of different models ofpaediatric home care.

By Parker G, Bhakta P, Lovett CA,Paisley S, Olsen R, Turner D, et al.

Volume 7, 2003

No. 1How important are comprehensiveliterature searches and the assessment oftrial quality in systematic reviews?Empirical study.

By Egger M, Jüni P, Bartlett C,Holenstein F, Sterne J.

No. 2Systematic review of the effectivenessand cost-effectiveness, and economicevaluation, of home versus hospital orsatellite unit haemodialysis for peoplewith end-stage renal failure.

By Mowatt G, Vale L, Perez J, WynessL, Fraser C, MacLeod A, et al.

No. 3Systematic review and economicevaluation of the effectiveness ofinfliximab for the treatment of Crohn’sdisease.

By Clark W, Raftery J, Barton P, Song F, Fry-Smith A, Burls A.

No. 4A review of the clinical effectiveness andcost-effectiveness of routine anti-Dprophylaxis for pregnant women whoare rhesus negative.

By Chilcott J, Lloyd Jones M, WightJ, Forman K, Wray J, Beverley C, et al.

No. 5Systematic review and evaluation of theuse of tumour markers in paediatriconcology: Ewing’s sarcoma andneuroblastoma.

By Riley RD, Burchill SA, Abrams KR,Heney D, Lambert PC, Jones DR, et al.

No. 6The cost-effectiveness of screening forHelicobacter pylori to reduce mortalityand morbidity from gastric cancer andpeptic ulcer disease: a discrete-eventsimulation model.

By Roderick P, Davies R, Raftery J,Crabbe D, Pearce R, Bhandari P, et al.

No. 7The clinical effectiveness and cost-effectiveness of routine dental checks: asystematic review and economicevaluation.

By Davenport C, Elley K, Salas C,Taylor-Weetman CL, Fry-Smith A, Bryan S, et al.

No. 8A multicentre randomised controlledtrial assessing the costs and benefits ofusing structured information andanalysis of women’s preferences in themanagement of menorrhagia.

By Kennedy ADM, Sculpher MJ,Coulter A, Dwyer N, Rees M, Horsley S, et al.

No. 9Clinical effectiveness and cost–utility ofphotodynamic therapy for wet age-relatedmacular degeneration: a systematic reviewand economic evaluation.

By Meads C, Salas C, Roberts T,Moore D, Fry-Smith A, Hyde C.

No. 10Evaluation of molecular tests forprenatal diagnosis of chromosomeabnormalities.

By Grimshaw GM, Szczepura A,Hultén M, MacDonald F, Nevin NC,Sutton F, et al.

Health Technology Assessment reports published to date

58

No. 11First and second trimester antenatalscreening for Down’s syndrome: theresults of the Serum, Urine andUltrasound Screening Study (SURUSS).

By Wald NJ, Rodeck C, Hackshaw AK, Walters J, Chitty L,Mackinson AM.

No. 12The effectiveness and cost-effectivenessof ultrasound locating devices forcentral venous access: a systematicreview and economic evaluation.

By Calvert N, Hind D, McWilliamsRG, Thomas SM, Beverley C, Davidson A.

No. 13A systematic review of atypicalantipsychotics in schizophrenia.

By Bagnall A-M, Jones L, Lewis R,Ginnelly L, Glanville J, Torgerson D, et al.

No. 14Prostate Testing for Cancer andTreatment (ProtecT) feasibility study.

By Donovan J, Hamdy F, Neal D,Peters T, Oliver S, Brindle L, et al.

No. 15Early thrombolysis for the treatment of acute myocardial infarction: asystematic review and economicevaluation.

By Boland A, Dundar Y, Bagust A,Haycox A, Hill R, Mujica Mota R,et al.

No. 16Screening for fragile X syndrome: a literature review and modelling.

By Song FJ, Barton P, Sleightholme V,Yao GL, Fry-Smith A.

No. 17Systematic review of endoscopic sinussurgery for nasal polyps.

By Dalziel K, Stein K, Round A,Garside R, Royle P.

No. 18Towards efficient guidelines: how tomonitor guideline use in primary care.

By Hutchinson A, McIntosh A, Cox S,Gilbert C.

No. 19Effectiveness and cost-effectiveness ofacute hospital-based spinal cord injuriesservices: systematic review.

By Bagnall A-M, Jones L, Richardson G, Duffy S, Riemsma R.

No. 20Prioritisation of health technologyassessment. The PATHS model:methods and case studies.

By Townsend J, Buxton M, Harper G.

No. 21Systematic review of the clinicaleffectiveness and cost-effectiveness of tension-free vaginal tape fortreatment of urinary stress incontinence.

By Cody J, Wyness L, Wallace S,Glazener C, Kilonzo M, Stearns S,et al.

No. 22The clinical and cost-effectiveness ofpatient education models for diabetes: a systematic review and economicevaluation.

By Loveman E, Cave C, Green C,Royle P, Dunn N, Waugh N.

No. 23The role of modelling in prioritisingand planning clinical trials.

By Chilcott J, Brennan A, Booth A,Karnon J, Tappenden P.

No. 24Cost–benefit evaluation of routineinfluenza immunisation in people 65–74years of age.

By Allsup S, Gosney M, Haycox A,Regan M.

No. 25The clinical and cost-effectiveness ofpulsatile machine perfusion versus coldstorage of kidneys for transplantationretrieved from heart-beating and non-heart-beating donors.

By Wight J, Chilcott J, Holmes M,Brewer N.

No. 26Can randomised trials rely on existingelectronic data? A feasibility study toexplore the value of routine data inhealth technology assessment.

By Williams JG, Cheung WY, Cohen DR, Hutchings HA, Longo MF, Russell IT.

No. 27Evaluating non-randomised interventionstudies.

By Deeks JJ, Dinnes J, D’Amico R,Sowden AJ, Sakarovitch C, Song F, et al.

No. 28A randomised controlled trial to assessthe impact of a package comprising apatient-orientated, evidence-based self-help guidebook and patient-centredconsultations on disease managementand satisfaction in inflammatory boweldisease.

By Kennedy A, Nelson E, Reeves D,Richardson G, Roberts C, Robinson A,et al.

No. 29The effectiveness of diagnostic tests forthe assessment of shoulder pain due tosoft tissue disorders: a systematic review.

By Dinnes J, Loveman E, McIntyre L,Waugh N.

No. 30The value of digital imaging in diabeticretinopathy.

By Sharp PF, Olson J, Strachan F,Hipwell J, Ludbrook A, O’Donnell M, et al.

No. 31Lowering blood pressure to preventmyocardial infarction and stroke: a new preventive strategy.

By Law M, Wald N, Morris J.

No. 32Clinical and cost-effectiveness ofcapecitabine and tegafur with uracil forthe treatment of metastatic colorectalcancer: systematic review and economicevaluation.

By Ward S, Kaltenthaler E, Cowan J,Brewer N.

No. 33Clinical and cost-effectiveness of new and emerging technologies for earlylocalised prostate cancer: a systematicreview.

By Hummel S, Paisley S, Morgan A,Currie E, Brewer N.

No. 34Literature searching for clinical andcost-effectiveness studies used in healthtechnology assessment reports carriedout for the National Institute forClinical Excellence appraisal system.

By Royle P, Waugh N.

No. 35Systematic review and economicdecision modelling for the preventionand treatment of influenza A and B.

By Turner D, Wailoo A, Nicholson K,Cooper N, Sutton A, Abrams K.

No. 36A randomised controlled trial toevaluate the clinical and cost-effectiveness of Hickman line insertions in adult cancer patients bynurses.

By Boland A, Haycox A, Bagust A,Fitzsimmons L.

No. 37Redesigning postnatal care: arandomised controlled trial of protocol-based midwifery-led carefocused on individual women’s physical and psychological health needs.

By MacArthur C, Winter HR, Bick DE, Lilford RJ, Lancashire RJ,Knowles H, et al.

No. 38Estimating implied rates of discount inhealthcare decision-making.

By West RR, McNabb R, Thompson AGH, Sheldon TA, Grimley Evans J.

Health Technology Assessment 2007; Vol. 11: No. 50

59

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

No. 39Systematic review of isolation policies inthe hospital management of methicillin-resistant Staphylococcus aureus: a review ofthe literature with epidemiological andeconomic modelling.

By Cooper BS, Stone SP, Kibbler CC,Cookson BD, Roberts JA, Medley GF, et al.

No. 40Treatments for spasticity and pain inmultiple sclerosis: a systematic review.

By Beard S, Hunn A, Wight J.

No. 41The inclusion of reports of randomisedtrials published in languages other thanEnglish in systematic reviews.

By Moher D, Pham B, Lawson ML,Klassen TP.

No. 42The impact of screening on futurehealth-promoting behaviours and healthbeliefs: a systematic review.

By Bankhead CR, Brett J, Bukach C,Webster P, Stewart-Brown S, Munafo M,et al.

Volume 8, 2004

No. 1What is the best imaging strategy foracute stroke?

By Wardlaw JM, Keir SL, Seymour J,Lewis S, Sandercock PAG, Dennis MS, et al.

No. 2Systematic review and modelling of theinvestigation of acute and chronic chestpain presenting in primary care.

By Mant J, McManus RJ, Oakes RAL,Delaney BC, Barton PM, Deeks JJ, et al.

No. 3The effectiveness and cost-effectivenessof microwave and thermal balloonendometrial ablation for heavymenstrual bleeding: a systematic reviewand economic modelling.

By Garside R, Stein K, Wyatt K,Round A, Price A.

No. 4A systematic review of the role ofbisphosphonates in metastatic disease.

By Ross JR, Saunders Y, Edmonds PM,Patel S, Wonderling D, Normand C, et al.

No. 5Systematic review of the clinicaleffectiveness and cost-effectiveness ofcapecitabine (Xeloda®) for locallyadvanced and/or metastatic breast cancer.

By Jones L, Hawkins N, Westwood M,Wright K, Richardson G, Riemsma R.

No. 6Effectiveness and efficiency of guidelinedissemination and implementationstrategies.

By Grimshaw JM, Thomas RE,MacLennan G, Fraser C, Ramsay CR,Vale L, et al.

No. 7Clinical effectiveness and costs of theSugarbaker procedure for the treatmentof pseudomyxoma peritonei.

By Bryant J, Clegg AJ, Sidhu MK,Brodin H, Royle P, Davidson P.

No. 8Psychological treatment for insomnia inthe regulation of long-term hypnoticdrug use.

By Morgan K, Dixon S, Mathers N,Thompson J, Tomeny M.

No. 9Improving the evaluation of therapeuticinterventions in multiple sclerosis:development of a patient-based measureof outcome.

By Hobart JC, Riazi A, Lamping DL,Fitzpatrick R, Thompson AJ.

No. 10A systematic review and economicevaluation of magnetic resonancecholangiopancreatography comparedwith diagnostic endoscopic retrogradecholangiopancreatography.

By Kaltenthaler E, Bravo Vergel Y,Chilcott J, Thomas S, Blakeborough T,Walters SJ, et al.

No. 11The use of modelling to evaluate newdrugs for patients with a chroniccondition: the case of antibodies againsttumour necrosis factor in rheumatoidarthritis.

By Barton P, Jobanputra P, Wilson J,Bryan S, Burls A.

No. 12Clinical effectiveness and cost-effectiveness of neonatal screening forinborn errors of metabolism usingtandem mass spectrometry: a systematicreview.

By Pandor A, Eastham J, Beverley C,Chilcott J, Paisley S.

No. 13Clinical effectiveness and cost-effectiveness of pioglitazone androsiglitazone in the treatment of type 2 diabetes: a systematic review and economic evaluation.

By Czoski-Murray C, Warren E,Chilcott J, Beverley C, Psyllaki MA,Cowan J.

No. 14Routine examination of the newborn:the EMREN study. Evaluation of anextension of the midwife role including a randomised controlled trial of appropriately trained midwivesand paediatric senior house officers.

By Townsend J, Wolke D, Hayes J,Davé S, Rogers C, Bloomfield L, et al.

No. 15Involving consumers in research anddevelopment agenda setting for theNHS: developing an evidence-basedapproach.

By Oliver S, Clarke-Jones L, Rees R, Milne R, Buchanan P, Gabbay J, et al.

No. 16A multi-centre randomised controlledtrial of minimally invasive directcoronary bypass grafting versuspercutaneous transluminal coronaryangioplasty with stenting for proximalstenosis of the left anterior descendingcoronary artery.

By Reeves BC, Angelini GD, BryanAJ, Taylor FC, Cripps T, Spyt TJ, et al.

No. 17Does early magnetic resonance imaginginfluence management or improveoutcome in patients referred tosecondary care with low back pain? A pragmatic randomised controlledtrial.

By Gilbert FJ, Grant AM, GillanMGC, Vale L, Scott NW, Campbell MK,et al.

No. 18The clinical and cost-effectiveness ofanakinra for the treatment ofrheumatoid arthritis in adults: asystematic review and economic analysis.

By Clark W, Jobanputra P, Barton P,Burls A.

No. 19A rapid and systematic review andeconomic evaluation of the clinical andcost-effectiveness of newer drugs fortreatment of mania associated withbipolar affective disorder.

By Bridle C, Palmer S, Bagnall A-M,Darba J, Duffy S, Sculpher M, et al.

No. 20Liquid-based cytology in cervicalscreening: an updated rapid andsystematic review and economic analysis.

By Karnon J, Peters J, Platt J, Chilcott J, McGoogan E, Brewer N.

No. 21Systematic review of the long-termeffects and economic consequences oftreatments for obesity and implicationsfor health improvement.

By Avenell A, Broom J, Brown TJ,Poobalan A, Aucott L, Stearns SC, et al.

No. 22Autoantibody testing in children withnewly diagnosed type 1 diabetesmellitus.

By Dretzke J, Cummins C,Sandercock J, Fry-Smith A, Barrett T,Burls A.

Health Technology Assessment reports published to date

60

No. 23Clinical effectiveness and cost-effectiveness of prehospital intravenousfluids in trauma patients.

By Dretzke J, Sandercock J, Bayliss S,Burls A.

No. 24Newer hypnotic drugs for the short-term management of insomnia: asystematic review and economicevaluation.

By Dündar Y, Boland A, Strobl J,Dodd S, Haycox A, Bagust A, et al.

No. 25Development and validation of methodsfor assessing the quality of diagnosticaccuracy studies.

By Whiting P, Rutjes AWS, Dinnes J,Reitsma JB, Bossuyt PMM, Kleijnen J.

No. 26EVALUATE hysterectomy trial: amulticentre randomised trial comparingabdominal, vaginal and laparoscopicmethods of hysterectomy.

By Garry R, Fountain J, Brown J,Manca A, Mason S, Sculpher M, et al.

No. 27Methods for expected value ofinformation analysis in complex healtheconomic models: developments on thehealth economics of interferon-� andglatiramer acetate for multiple sclerosis.

By Tappenden P, Chilcott JB,Eggington S, Oakley J, McCabe C.

No. 28Effectiveness and cost-effectiveness ofimatinib for first-line treatment ofchronic myeloid leukaemia in chronicphase: a systematic review and economicanalysis.

By Dalziel K, Round A, Stein K,Garside R, Price A.

No. 29VenUS I: a randomised controlled trialof two types of bandage for treatingvenous leg ulcers.

By Iglesias C, Nelson EA, Cullum NA,Torgerson DJ on behalf of the VenUSTeam.

No. 30Systematic review of the effectivenessand cost-effectiveness, and economicevaluation, of myocardial perfusionscintigraphy for the diagnosis andmanagement of angina and myocardialinfarction.

By Mowatt G, Vale L, Brazzelli M,Hernandez R, Murray A, Scott N, et al.

No. 31A pilot study on the use of decisiontheory and value of information analysisas part of the NHS Health TechnologyAssessment programme.

By Claxton Kon K, Ginnelly L, SculpherM, Philips Z, Palmer S.

No. 32The Social Support and Family HealthStudy: a randomised controlled trial andeconomic evaluation of two alternativeforms of postnatal support for mothersliving in disadvantaged inner-city areas.

By Wiggins M, Oakley A, Roberts I,Turner H, Rajan L, Austerberry H, et al.

No. 33Psychosocial aspects of genetic screeningof pregnant women and newborns: a systematic review.

By Green JM, Hewison J, Bekker HL,Bryant, Cuckle HS.

No. 34Evaluation of abnormal uterinebleeding: comparison of threeoutpatient procedures within cohortsdefined by age and menopausal status.

By Critchley HOD, Warner P, Lee AJ, Brechin S, Guise J, Graham B.

No. 35Coronary artery stents: a rapid systematicreview and economic evaluation.

By Hill R, Bagust A, Bakhai A,Dickson R, Dündar Y, Haycox A, et al.

No. 36Review of guidelines for good practicein decision-analytic modelling in healthtechnology assessment.

By Philips Z, Ginnelly L, Sculpher M,Claxton K, Golder S, Riemsma R, et al.

No. 37Rituximab (MabThera®) for aggressivenon-Hodgkin’s lymphoma: systematicreview and economic evaluation.

By Knight C, Hind D, Brewer N,Abbott V.

No. 38Clinical effectiveness and cost-effectiveness of clopidogrel andmodified-release dipyridamole in thesecondary prevention of occlusivevascular events: a systematic review andeconomic evaluation.

By Jones L, Griffin S, Palmer S, MainC, Orton V, Sculpher M, et al.

No. 39Pegylated interferon �-2a and -2b incombination with ribavirin in thetreatment of chronic hepatitis C: asystematic review and economicevaluation.

By Shepherd J, Brodin H, Cave C,Waugh N, Price A, Gabbay J.

No. 40Clopidogrel used in combination withaspirin compared with aspirin alone inthe treatment of non-ST-segment-elevation acute coronary syndromes: asystematic review and economicevaluation.

By Main C, Palmer S, Griffin S, Jones L, Orton V, Sculpher M, et al.

No. 41Provision, uptake and cost of cardiacrehabilitation programmes: improvingservices to under-represented groups.

By Beswick AD, Rees K, Griebsch I,Taylor FC, Burke M, West RR, et al.

No. 42Involving South Asian patients inclinical trials.

By Hussain-Gambles M, Leese B,Atkin K, Brown J, Mason S, Tovey P.

No. 43Clinical and cost-effectiveness ofcontinuous subcutaneous insulininfusion for diabetes.

By Colquitt JL, Green C, Sidhu MK,Hartwell D, Waugh N.

No. 44Identification and assessment ofongoing trials in health technologyassessment reviews.

By Song FJ, Fry-Smith A, DavenportC, Bayliss S, Adi Y, Wilson JS, et al.

No. 45Systematic review and economicevaluation of a long-acting insulinanalogue, insulin glargine

By Warren E, Weatherley-Jones E,Chilcott J, Beverley C.

No. 46Supplementation of a home-basedexercise programme with a class-basedprogramme for people with osteoarthritisof the knees: a randomised controlledtrial and health economic analysis.

By McCarthy CJ, Mills PM, Pullen R, Richardson G, Hawkins N,Roberts CR, et al.

No. 47Clinical and cost-effectiveness of once-daily versus more frequent use of samepotency topical corticosteroids for atopiceczema: a systematic review andeconomic evaluation.

By Green C, Colquitt JL, Kirby J,Davidson P, Payne E.

No. 48Acupuncture of chronic headachedisorders in primary care: randomisedcontrolled trial and economic analysis.

By Vickers AJ, Rees RW, Zollman CE,McCarney R, Smith CM, Ellis N, et al.

No. 49Generalisability in economic evaluationstudies in healthcare: a review and casestudies.

By Sculpher MJ, Pang FS, Manca A,Drummond MF, Golder S, Urdahl H, et al.

No. 50Virtual outreach: a randomisedcontrolled trial and economic evaluationof joint teleconferenced medicalconsultations.

By Wallace P, Barber J, Clayton W,Currell R, Fleming K, Garner P, et al.

Health Technology Assessment 2007; Vol. 11: No. 50

61

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Volume 9, 2005

No. 1Randomised controlled multipletreatment comparison to provide a cost-effectiveness rationale for theselection of antimicrobial therapy inacne.

By Ozolins M, Eady EA, Avery A,Cunliffe WJ, O’Neill C, Simpson NB, et al.

No. 2Do the findings of case series studiesvary significantly according tomethodological characteristics?

By Dalziel K, Round A, Stein K,Garside R, Castelnuovo E, Payne L.

No. 3Improving the referral process forfamilial breast cancer geneticcounselling: findings of threerandomised controlled trials of twointerventions.

By Wilson BJ, Torrance N, Mollison J,Wordsworth S, Gray JR, Haites NE, et al.

No. 4Randomised evaluation of alternativeelectrosurgical modalities to treatbladder outflow obstruction in men withbenign prostatic hyperplasia.

By Fowler C, McAllister W, Plail R,Karim O, Yang Q.

No. 5A pragmatic randomised controlled trialof the cost-effectiveness of palliativetherapies for patients with inoperableoesophageal cancer.

By Shenfine J, McNamee P, Steen N,Bond J, Griffin SM.

No. 6Impact of computer-aided detectionprompts on the sensitivity and specificityof screening mammography.

By Taylor P, Champness J, Given-Wilson R, Johnston K, Potts H.

No. 7Issues in data monitoring and interimanalysis of trials.

By Grant AM, Altman DG, BabikerAB, Campbell MK, Clemens FJ,Darbyshire JH, et al.

No. 8Lay public’s understanding of equipoiseand randomisation in randomisedcontrolled trials.

By Robinson EJ, Kerr CEP, StevensAJ, Lilford RJ, Braunholtz DA, EdwardsSJ, et al.

No. 9Clinical and cost-effectiveness ofelectroconvulsive therapy for depressiveillness, schizophrenia, catatonia andmania: systematic reviews and economicmodelling studies.

By Greenhalgh J, Knight C, Hind D,Beverley C, Walters S.

No. 10Measurement of health-related qualityof life for people with dementia:development of a new instrument(DEMQOL) and an evaluation ofcurrent methodology.

By Smith SC, Lamping DL, Banerjee S, Harwood R, Foley B, Smith P, et al.

No. 11Clinical effectiveness and cost-effectiveness of drotrecogin alfa(activated) (Xigris®) for the treatment ofsevere sepsis in adults: a systematicreview and economic evaluation.

By Green C, Dinnes J, Takeda A,Shepherd J, Hartwell D, Cave C, et al.

No. 12A methodological review of howheterogeneity has been examined insystematic reviews of diagnostic testaccuracy.

By Dinnes J, Deeks J, Kirby J,Roderick P.

No. 13Cervical screening programmes: canautomation help? Evidence fromsystematic reviews, an economic analysisand a simulation modelling exerciseapplied to the UK.

By Willis BH, Barton P, Pearmain P,Bryan S, Hyde C.

No. 14Laparoscopic surgery for inguinalhernia repair: systematic review ofeffectiveness and economic evaluation.

By McCormack K, Wake B, Perez J,Fraser C, Cook J, McIntosh E, et al.

No. 15Clinical effectiveness, tolerability andcost-effectiveness of newer drugs forepilepsy in adults: a systematic reviewand economic evaluation.

By Wilby J, Kainth A, Hawkins N,Epstein D, McIntosh H, McDaid C, et al.

No. 16A randomised controlled trial tocompare the cost-effectiveness oftricyclic antidepressants, selectiveserotonin reuptake inhibitors andlofepramine.

By Peveler R, Kendrick T, Buxton M,Longworth L, Baldwin D, Moore M, et al.

No. 17Clinical effectiveness and cost-effectiveness of immediate angioplastyfor acute myocardial infarction:systematic review and economicevaluation.

By Hartwell D, Colquitt J, LovemanE, Clegg AJ, Brodin H, Waugh N, et al.

No. 18A randomised controlled comparison ofalternative strategies in stroke care.

By Kalra L, Evans A, Perez I, Knapp M, Swift C, Donaldson N.

No. 19The investigation and analysis of criticalincidents and adverse events inhealthcare.

By Woloshynowych M, Rogers S,Taylor-Adams S, Vincent C.

No. 20Potential use of routine databases inhealth technology assessment.

By Raftery J, Roderick P, Stevens A.

No. 21Clinical and cost-effectiveness of newerimmunosuppressive regimens in renaltransplantation: a systematic review andmodelling study.

By Woodroffe R, Yao GL, Meads C,Bayliss S, Ready A, Raftery J, et al.

No. 22A systematic review and economicevaluation of alendronate, etidronate,risedronate, raloxifene and teriparatidefor the prevention and treatment ofpostmenopausal osteoporosis.

By Stevenson M, Lloyd Jones M, De Nigris E, Brewer N, Davis S, Oakley J.

No. 23A systematic review to examine theimpact of psycho-educationalinterventions on health outcomes andcosts in adults and children with difficultasthma.

By Smith JR, Mugford M, Holland R,Candy B, Noble MJ, Harrison BDW, et al.

No. 24An evaluation of the costs, effectivenessand quality of renal replacementtherapy provision in renal satellite unitsin England and Wales.

By Roderick P, Nicholson T, ArmitageA, Mehta R, Mullee M, Gerard K, et al.

No. 25Imatinib for the treatment of patientswith unresectable and/or metastaticgastrointestinal stromal tumours:systematic review and economicevaluation.

By Wilson J, Connock M, Song F, YaoG, Fry-Smith A, Raftery J, et al.

No. 26Indirect comparisons of competinginterventions.

By Glenny AM, Altman DG, Song F,Sakarovitch C, Deeks JJ, D’Amico R, et al.

No. 27Cost-effectiveness of alternativestrategies for the initial medicalmanagement of non-ST elevation acutecoronary syndrome: systematic reviewand decision-analytical modelling.

By Robinson M, Palmer S, SculpherM, Philips Z, Ginnelly L, Bowens A, et al.

Health Technology Assessment reports published to date

62

No. 28Outcomes of electrically stimulatedgracilis neosphincter surgery.

By Tillin T, Chambers M, Feldman R.

No. 29The effectiveness and cost-effectivenessof pimecrolimus and tacrolimus foratopic eczema: a systematic review andeconomic evaluation.

By Garside R, Stein K, Castelnuovo E,Pitt M, Ashcroft D, Dimmock P, et al.

No. 30Systematic review on urine albumintesting for early detection of diabeticcomplications.

By Newman DJ, Mattock MB, DawnayABS, Kerry S, McGuire A, Yaqoob M, et al.

No. 31Randomised controlled trial of the cost-effectiveness of water-based therapy forlower limb osteoarthritis.

By Cochrane T, Davey RC, Matthes Edwards SM.

No. 32Longer term clinical and economicbenefits of offering acupuncture care topatients with chronic low back pain.

By Thomas KJ, MacPherson H,Ratcliffe J, Thorpe L, Brazier J,Campbell M, et al.

No. 33Cost-effectiveness and safety of epiduralsteroids in the management of sciatica.

By Price C, Arden N, Coglan L,Rogers P.

No. 34The British Rheumatoid Outcome StudyGroup (BROSG) randomised controlledtrial to compare the effectiveness andcost-effectiveness of aggressive versussymptomatic therapy in establishedrheumatoid arthritis.

By Symmons D, Tricker K, Roberts C,Davies L, Dawes P, Scott DL.

No. 35Conceptual framework and systematicreview of the effects of participants’ andprofessionals’ preferences in randomisedcontrolled trials.

By King M, Nazareth I, Lampe F,Bower P, Chandler M, Morou M, et al.

No. 36The clinical and cost-effectiveness ofimplantable cardioverter defibrillators: a systematic review.

By Bryant J, Brodin H, Loveman E,Payne E, Clegg A.

No. 37A trial of problem-solving by communitymental health nurses for anxiety,depression and life difficulties amonggeneral practice patients. The CPN-GPstudy.

By Kendrick T, Simons L, Mynors-Wallis L, Gray A, Lathlean J,Pickering R, et al.

No. 38The causes and effects of socio-demographic exclusions from clinicaltrials.

By Bartlett C, Doyal L, Ebrahim S,Davey P, Bachmann M, Egger M, et al.

No. 39Is hydrotherapy cost-effective? Arandomised controlled trial of combinedhydrotherapy programmes comparedwith physiotherapy land techniques inchildren with juvenile idiopathicarthritis.

By Epps H, Ginnelly L, Utley M,Southwood T, Gallivan S, Sculpher M, et al.

No. 40A randomised controlled trial and cost-effectiveness study of systematicscreening (targeted and total populationscreening) versus routine practice forthe detection of atrial fibrillation inpeople aged 65 and over. The SAFEstudy.

By Hobbs FDR, Fitzmaurice DA,Mant J, Murray E, Jowett S, Bryan S, et al.

No. 41Displaced intracapsular hip fractures in fit, older people: a randomisedcomparison of reduction and fixation,bipolar hemiarthroplasty and total hiparthroplasty.

By Keating JF, Grant A, Masson M,Scott NW, Forbes JF.

No. 42Long-term outcome of cognitivebehaviour therapy clinical trials incentral Scotland.

By Durham RC, Chambers JA, Power KG, Sharp DM, Macdonald RR,Major KA, et al.

No. 43The effectiveness and cost-effectivenessof dual-chamber pacemakers comparedwith single-chamber pacemakers forbradycardia due to atrioventricularblock or sick sinus syndrome: systematic review and economicevaluation.

By Castelnuovo E, Stein K, Pitt M,Garside R, Payne E.

No. 44Newborn screening for congenital heartdefects: a systematic review and cost-effectiveness analysis.

By Knowles R, Griebsch I, DezateuxC, Brown J, Bull C, Wren C.

No. 45The clinical and cost-effectiveness of leftventricular assist devices for end-stageheart failure: a systematic review andeconomic evaluation.

By Clegg AJ, Scott DA, Loveman E,Colquitt J, Hutchinson J, Royle P, et al.

No. 46The effectiveness of the HeidelbergRetina Tomograph and laser diagnosticglaucoma scanning system (GDx) indetecting and monitoring glaucoma.

By Kwartz AJ, Henson DB, Harper RA, Spencer AF, McLeod D.

No. 47Clinical and cost-effectiveness ofautologous chondrocyte implantationfor cartilage defects in knee joints:systematic review and economicevaluation.

By Clar C, Cummins E, McIntyre L,Thomas S, Lamb J, Bain L, et al.

No. 48Systematic review of effectiveness ofdifferent treatments for childhoodretinoblastoma.

By McDaid C, Hartley S, Bagnall A-M, Ritchie G, Light K,Riemsma R.

No. 49Towards evidence-based guidelines for the prevention of venousthromboembolism: systematic reviews of mechanical methods, oralanticoagulation, dextran and regionalanaesthesia as thromboprophylaxis.

By Roderick P, Ferris G, Wilson K,Halls H, Jackson D, Collins R,et al.

No. 50The effectiveness and cost-effectivenessof parent training/educationprogrammes for the treatment ofconduct disorder, including oppositionaldefiant disorder, in children.

By Dretzke J, Frew E, Davenport C,Barlow J, Stewart-Brown S, Sandercock J, et al.

Volume 10, 2006

No. 1The clinical and cost-effectiveness ofdonepezil, rivastigmine, galantamineand memantine for Alzheimer’s disease.

By Loveman E, Green C, Kirby J,Takeda A, Picot J, Payne E, et al.

No. 2FOOD: a multicentre randomised trialevaluating feeding policies in patientsadmitted to hospital with a recentstroke.

By Dennis M, Lewis S, Cranswick G,Forbes J.

No. 3The clinical effectiveness and cost-effectiveness of computed tomographyscreening for lung cancer: systematicreviews.

By Black C, Bagust A, Boland A,Walker S, McLeod C, De Verteuil R, et al.

Health Technology Assessment 2007; Vol. 11: No. 50

63

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

No. 4A systematic review of the effectivenessand cost-effectiveness of neuroimagingassessments used to visualise the seizure focus in people with refractoryepilepsy being considered for surgery.

By Whiting P, Gupta R, Burch J,Mujica Mota RE, Wright K, Marson A, et al.

No. 5Comparison of conference abstracts and presentations with full-text articles in the health technologyassessments of rapidly evolvingtechnologies.

By Dundar Y, Dodd S, Dickson R,Walley T, Haycox A, Williamson PR.

No. 6Systematic review and evaluation ofmethods of assessing urinaryincontinence.

By Martin JL, Williams KS, AbramsKR, Turner DA, Sutton AJ, Chapple C,et al.

No. 7The clinical effectiveness and cost-effectiveness of newer drugs for childrenwith epilepsy. A systematic review.

By Connock M, Frew E, Evans B-W, Bryan S, Cummins C, Fry-Smith A, et al.

No. 8Surveillance of Barrett’s oesophagus:exploring the uncertainty throughsystematic review, expert workshop andeconomic modelling.

By Garside R, Pitt M, Somerville M,Stein K, Price A, Gilbert N.

No. 9Topotecan, pegylated liposomaldoxorubicin hydrochloride andpaclitaxel for second-line or subsequenttreatment of advanced ovarian cancer: asystematic review and economicevaluation.

By Main C, Bojke L, Griffin S,Norman G, Barbieri M, Mather L, et al.

No. 10Evaluation of molecular techniques in prediction and diagnosis ofcytomegalovirus disease inimmunocompromised patients.

By Szczepura A, Westmoreland D,Vinogradova Y, Fox J, Clark M.

No. 11Screening for thrombophilia in high-risksituations: systematic review and cost-effectiveness analysis. The Thrombosis:Risk and Economic Assessment ofThrombophilia Screening (TREATS)study.

By Wu O, Robertson L, Twaddle S,Lowe GDO, Clark P, Greaves M, et al.

No. 12A series of systematic reviews to informa decision analysis for sampling andtreating infected diabetic foot ulcers.

By Nelson EA, O’Meara S, Craig D,Iglesias C, Golder S, Dalton J, et al.

No. 13Randomised clinical trial, observationalstudy and assessment of cost-effectiveness of the treatment of varicoseveins (REACTIV trial).

By Michaels JA, Campbell WB,Brazier JE, MacIntyre JB, PalfreymanSJ, Ratcliffe J, et al.

No. 14The cost-effectiveness of screening fororal cancer in primary care.

By Speight PM, Palmer S, Moles DR,Downer MC, Smith DH, Henriksson Met al.

No. 15Measurement of the clinical and cost-effectiveness of non-invasive diagnostictesting strategies for deep veinthrombosis.

By Goodacre S, Sampson F, StevensonM, Wailoo A, Sutton A, Thomas S, et al.

No. 16Systematic review of the effectivenessand cost-effectiveness of HealOzone®

for the treatment of occlusal pit/fissurecaries and root caries.

By Brazzelli M, McKenzie L, FieldingS, Fraser C, Clarkson J, Kilonzo M, et al.

No. 17Randomised controlled trials ofconventional antipsychotic versus newatypical drugs, and new atypical drugsversus clozapine, in people withschizophrenia responding poorly to, or intolerant of, current drugtreatment.

By Lewis SW, Davies L, Jones PB,Barnes TRE, Murray RM, Kerwin R, et al.

No. 18Diagnostic tests and algorithms used inthe investigation of haematuria:systematic reviews and economicevaluation.

By Rodgers M, Nixon J, Hempel S,Aho T, Kelly J, Neal D, et al.

No. 19Cognitive behavioural therapy inaddition to antispasmodic therapy forirritable bowel syndrome in primarycare: randomised controlled trial.

By Kennedy TM, Chalder T, McCrone P, Darnley S, Knapp M, Jones RH, et al.

No. 20A systematic review of the clinicaleffectiveness and cost-effectiveness ofenzyme replacement therapies forFabry’s disease andmucopolysaccharidosis type 1.

By Connock M, Juarez-Garcia A, FrewE, Mans A, Dretzke J, Fry-Smith A, et al.

No. 21Health benefits of antiviral therapy formild chronic hepatitis C: randomisedcontrolled trial and economicevaluation.

By Wright M, Grieve R, Roberts J,Main J, Thomas HC on behalf of theUK Mild Hepatitis C Trial Investigators.

No. 22Pressure relieving support surfaces: arandomised evaluation.

By Nixon J, Nelson EA, Cranny G,Iglesias CP, Hawkins K, Cullum NA, et al.

No. 23A systematic review and economicmodel of the effectiveness and cost-effectiveness of methylphenidate,dexamfetamine and atomoxetine for thetreatment of attention deficithyperactivity disorder in children andadolescents.

By King S, Griffin S, Hodges Z,Weatherly H, Asseburg C, Richardson G,et al.

No. 24The clinical effectiveness and cost-effectiveness of enzyme replacementtherapy for Gaucher’s disease: a systematic review.

By Connock M, Burls A, Frew E, Fry-Smith A, Juarez-Garcia A, McCabe C, et al.

No. 25Effectiveness and cost-effectiveness ofsalicylic acid and cryotherapy forcutaneous warts. An economic decisionmodel.

By Thomas KS, Keogh-Brown MR,Chalmers JR, Fordham RJ, Holland RC,Armstrong SJ, et al.

No. 26A systematic literature review of theeffectiveness of non-pharmacologicalinterventions to prevent wandering indementia and evaluation of the ethicalimplications and acceptability of theiruse.

By Robinson L, Hutchings D, CornerL, Beyer F, Dickinson H, Vanoli A, et al.

No. 27A review of the evidence on the effectsand costs of implantable cardioverterdefibrillator therapy in different patientgroups, and modelling of cost-effectiveness and cost–utility for thesegroups in a UK context.

By Buxton M, Caine N, Chase D,Connelly D, Grace A, Jackson C, et al.

Health Technology Assessment reports published to date

64

No. 28Adefovir dipivoxil and pegylatedinterferon alfa-2a for the treatment ofchronic hepatitis B: a systematic reviewand economic evaluation.

By Shepherd J, Jones J, Takeda A,Davidson P, Price A.

No. 29An evaluation of the clinical and cost-effectiveness of pulmonary arterycatheters in patient management inintensive care: a systematic review and arandomised controlled trial.

By Harvey S, Stevens K, Harrison D,Young D, Brampton W, McCabe C, et al.

No. 30Accurate, practical and cost-effectiveassessment of carotid stenosis in the UK.

By Wardlaw JM, Chappell FM,Stevenson M, De Nigris E, Thomas S,Gillard J, et al.

No. 31Etanercept and infliximab for thetreatment of psoriatic arthritis: a systematic review and economicevaluation.

By Woolacott N, Bravo Vergel Y,Hawkins N, Kainth A, Khadjesari Z,Misso K, et al.

No. 32The cost-effectiveness of testing forhepatitis C in former injecting drugusers.

By Castelnuovo E, Thompson-Coon J,Pitt M, Cramp M, Siebert U, Price A, et al.

No. 33Computerised cognitive behaviourtherapy for depression and anxietyupdate: a systematic review andeconomic evaluation.

By Kaltenthaler E, Brazier J, De Nigris E, Tumur I, Ferriter M,Beverley C, et al.

No. 34Cost-effectiveness of using prognosticinformation to select women with breast cancer for adjuvant systemic therapy.

By Williams C, Brunskill S, Altman D,Briggs A, Campbell H, Clarke M, et al.

No. 35Psychological therapies includingdialectical behaviour therapy forborderline personality disorder: a systematic review and preliminaryeconomic evaluation.

By Brazier J, Tumur I, Holmes M,Ferriter M, Parry G, Dent-Brown K, et al.

No. 36Clinical effectiveness and cost-effectiveness of tests for the diagnosisand investigation of urinary tractinfection in children: a systematic reviewand economic model.

By Whiting P, Westwood M, Bojke L,Palmer S, Richardson G, Cooper J, et al.

No. 37Cognitive behavioural therapy inchronic fatigue syndrome: a randomisedcontrolled trial of an outpatient groupprogramme.

By O’Dowd H, Gladwell P, Rogers CA,Hollinghurst S, Gregory A.

No. 38A comparison of the cost-effectiveness offive strategies for the prevention of non-steroidal anti-inflammatory drug-inducedgastrointestinal toxicity: a systematicreview with economic modelling.

By Brown TJ, Hooper L, Elliott RA,Payne K, Webb R, Roberts C, et al.

No. 39The effectiveness and cost-effectivenessof computed tomography screening forcoronary artery disease: systematicreview.

By Waugh N, Black C, Walker S,McIntyre L, Cummins E, Hillis G.

No. 40What are the clinical outcome and cost-effectiveness of endoscopy undertakenby nurses when compared with doctors?A Multi-Institution Nurse EndoscopyTrial (MINuET).

By Williams J, Russell I, Durai D,Cheung W-Y, Farrin A, Bloor K, et al.

No. 41The clinical and cost-effectiveness ofoxaliplatin and capecitabine for theadjuvant treatment of colon cancer:systematic review and economicevaluation.

By Pandor A, Eggington S, Paisley S,Tappenden P, Sutcliffe P.

No. 42A systematic review of the effectivenessof adalimumab, etanercept andinfliximab for the treatment ofrheumatoid arthritis in adults and aneconomic evaluation of their cost-effectiveness.

By Chen Y-F, Jobanputra P, Barton P,Jowett S, Bryan S, Clark W, et al.

No. 43Telemedicine in dermatology: a randomised controlled trial.

By Bowns IR, Collins K, Walters SJ,McDonagh AJG.

No. 44Cost-effectiveness of cell salvage andalternative methods of minimisingperioperative allogeneic bloodtransfusion: a systematic review andeconomic model.

By Davies L, Brown TJ, Haynes S,Payne K, Elliott RA, McCollum C.

No. 45Clinical effectiveness and cost-effectiveness of laparoscopic surgery forcolorectal cancer: systematic reviews andeconomic evaluation.

By Murray A, Lourenco T, de Verteuil R, Hernandez R, Fraser C,McKinley A, et al.

No. 46Etanercept and efalizumab for thetreatment of psoriasis: a systematicreview.

By Woolacott N, Hawkins N, Mason A, Kainth A, Khadjesari Z, Bravo Vergel Y, et al.

No. 47Systematic reviews of clinical decisiontools for acute abdominal pain.

By Liu JLY, Wyatt JC, Deeks JJ,Clamp S, Keen J, Verde P, et al.

No. 48Evaluation of the ventricular assistdevice programme in the UK.

By Sharples L, Buxton M, Caine N,Cafferty F, Demiris N, Dyer M, et al.

No. 49A systematic review and economicmodel of the clinical and cost-effectiveness of immunosuppressivetherapy for renal transplantation inchildren.

By Yao G, Albon E, Adi Y, Milford D,Bayliss S, Ready A, et al.

No. 50Amniocentesis results: investigation ofanxiety. The ARIA trial.

By Hewison J, Nixon J, Fountain J,Cocks K, Jones C, Mason G, et al.

Volume 11, 2007

No. 1Pemetrexed disodium for the treatmentof malignant pleural mesothelioma: a systematic review and economicevaluation.

By Dundar Y, Bagust A, Dickson R,Dodd S, Green J, Haycox A, et al.

No. 2A systematic review and economicmodel of the clinical effectiveness andcost-effectiveness of docetaxel incombination with prednisone orprednisolone for the treatment ofhormone-refractory metastatic prostatecancer.

By Collins R, Fenwick E, Trowman R,Perard R, Norman G, Light K, et al.

No. 3A systematic review of rapid diagnostictests for the detection of tuberculosisinfection.

By Dinnes J, Deeks J, Kunst H,Gibson A, Cummins E, Waugh N, et al.

No. 4The clinical effectiveness and cost-effectiveness of strontium ranelate forthe prevention of osteoporotic fragility fractures in postmenopausalwomen.

By Stevenson M, Davis S, Lloyd-Jones M, Beverley C.

Health Technology Assessment 2007; Vol. 11: No. 50

65

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

No. 5A systematic review of quantitative andqualitative research on the role andeffectiveness of written informationavailable to patients about individualmedicines.

By Raynor DK, Blenkinsopp A,Knapp P, Grime J, Nicolson DJ, Pollock K, et al.

No. 6Oral naltrexone as a treatment forrelapse prevention in formerly opioid-dependent drug users: a systematic review and economicevaluation.

By Adi Y, Juarez-Garcia A, Wang D,Jowett S, Frew E, Day E, et al.

No. 7Glucocorticoid-induced osteoporosis: a systematic review and cost–utility analysis.

By Kanis JA, Stevenson M, McCloskey EV, Davis S, Lloyd-Jones M.

No. 8Epidemiological, social, diagnostic andeconomic evaluation of populationscreening for genital chlamydial infection.

By Low N, McCarthy A, Macleod J,Salisbury C, Campbell R, Roberts TE, et al.

No. 9Methadone and buprenorphine for themanagement of opioid dependence: a systematic review and economicevaluation.

By Connock M, Juarez-Garcia A,Jowett S, Frew E, Liu Z, Taylor RJ, et al.

No. 10Exercise Evaluation Randomised Trial(EXERT): a randomised trial comparingGP referral for leisure centre-basedexercise, community-based walking andadvice only.

By Isaacs AJ, Critchley JA, See Tai S, Buckingham K, Westley D,Harridge SDR, et al.

No. 11Interferon alfa (pegylated and non-pegylated) and ribavirin for thetreatment of mild chronic hepatitis C: a systematic review and economicevaluation.

By Shepherd J, Jones J, Hartwell D,Davidson P, Price A, Waugh N.

No. 12Systematic review and economicevaluation of bevacizumab andcetuximab for the treatment ofmetastatic colorectal cancer.

By Tappenden P, Jones R, Paisley S,Carroll C.

No. 13A systematic review and economicevaluation of epoetin alfa, epoetin betaand darbepoetin alfa in anaemiaassociated with cancer, especially thatattributable to cancer treatment.

By Wilson J, Yao GL, Raftery J,Bohlius J, Brunskill S, Sandercock J, et al.

No. 14A systematic review and economicevaluation of statins for the preventionof coronary events.

By Ward S, Lloyd Jones M, Pandor A,Holmes M, Ara R, Ryan A, et al.

No. 15A systematic review of the effectivenessand cost-effectiveness of differentmodels of community-based respite care for frail older people and theircarers.

By Mason A, Weatherly H, SpilsburyK, Arksey H, Golder S, Adamson J, et al.

No. 16Additional therapy for young childrenwith spastic cerebral palsy: a randomised controlled trial.

By Weindling AM, Cunningham CC,Glenn SM, Edwards RT, Reeves DJ.

No. 17Screening for type 2 diabetes: literaturereview and economic modelling.

By Waugh N, Scotland G, McNamee P, Gillett M, Brennan A,Goyder E, et al.

No. 18The effectiveness and cost-effectivenessof cinacalcet for secondaryhyperparathyroidism in end-stage renal disease patients on dialysis: asystematic review and economicevaluation.

By Garside R, Pitt M, Anderson R,Mealing S, Roome C, Snaith A, et al.

No. 19The clinical effectiveness and cost-effectiveness of gemcitabine formetastatic breast cancer: a systematicreview and economic evaluation.

By Takeda AL, Jones J, Loveman E,Tan SC, Clegg AJ.

No. 20A systematic review of duplexultrasound, magnetic resonanceangiography and computed tomographyangiography for the diagnosis and assessment of symptomatic, lowerlimb peripheral arterial disease.

By Collins R, Cranny G, Burch J,Aguiar-Ibáñez R, Craig D, Wright K, et al.

No. 21The clinical effectiveness and cost-effectiveness of treatments for childrenwith idiopathic steroid-resistantnephrotic syndrome: a systematic review.

By Colquitt JL, Kirby J, Green C,Cooper K, Trompeter RS.

No. 22A systematic review of the routinemonitoring of growth in children ofprimary school age to identify growth-related conditions.

By Fayter D, Nixon J, Hartley S,Rithalia A, Butler G, Rudolf M, et al.

No. 23Systematic review of the effectiveness ofpreventing and treating Staphylococcusaureus carriage in reducing peritonealcatheter-related infections.

By McCormack K, Rabindranath K,Kilonzo M, Vale L, Fraser C, McIntyre L,et al.

No. 24The clinical effectiveness and cost ofrepetitive transcranial magneticstimulation versus electroconvulsivetherapy in severe depression: a multicentre pragmatic randomisedcontrolled trial and economic analysis.

By McLoughlin DM, Mogg A, Eranti S, Pluck G, Purvis R, Edwards D,et al.

No. 25A randomised controlled trial andeconomic evaluation of direct versusindirect and individual versus groupmodes of speech and language therapyfor children with primary languageimpairment.

By Boyle J, McCartney E, Forbes J,O’Hare A.

No. 26Hormonal therapies for early breastcancer: systematic review and economicevaluation.

By Hind D, Ward S, De Nigris E,Simpson E, Carroll C, Wyld L.

No. 27Cardioprotection against the toxic effects of anthracyclines given tochildren with cancer: a systematic review.

By Bryant J, Picot J, Levitt G, Sullivan I, Baxter L, Clegg A.

No. 28Adalimumab, etanercept and infliximabfor the treatment of ankylosingspondylitis: a systematic review andeconomic evaluation.

By McLeod C, Bagust A, Boland A,Dagenais P, Dickson R, Dundar Y, et al.

Health Technology Assessment reports published to date

66

No. 29Prenatal screening and treatmentstrategies to prevent group B streptococcaland other bacterial infections in earlyinfancy: cost-effectiveness and expectedvalue of information analyses.

By Colbourn T, Asseburg C, Bojke L,Philips Z, Claxton K, Ades AE, et al.

No. 30Clinical effectiveness and cost-effectiveness of bone morphogeneticproteins in the non-healing of fracturesand spinal fusion: a systematic review.

By Garrison KR, Donell S, Ryder J,Shemilt I, Mugford M, Harvey I, et al.

No. 31A randomised controlled trial ofpostoperative radiotherapy followingbreast-conserving surgery in aminimum-risk older population. The PRIME trial.

By Prescott RJ, Kunkler IH, WilliamsLJ, King CC, Jack W, van der Pol M, et al.

No. 32Current practice, accuracy, effectivenessand cost-effectiveness of the schoolentry hearing screen.

By Bamford J, Fortnum H, Bristow K,Smith J, Vamvakas G, Davies L, et al.

No. 33The clinical effectiveness and cost-effectiveness of inhaled insulin in diabetes mellitus: a systematic review and economic evaluation.

By Black C, Cummins E, Royle P,Philip S, Waugh N.

No. 34Surveillance of cirrhosis forhepatocellular carcinoma: systematicreview and economic analysis.

By Thompson Coon J, Rogers G,Hewson P, Wright D, Anderson R,Cramp M, et al.

No. 35The Birmingham Rehabilitation UptakeMaximisation Study (BRUM). Home-based compared with hospital-basedcardiac rehabilitation in a multi-ethnicpopulation: cost-effectiveness andpatient adherence.

By Jolly K, Taylor R, Lip GYH,Greenfield S, Raftery J, Mant J, et al.

No. 36A systematic review of the clinical,public health and cost-effectiveness ofrapid diagnostic tests for the detectionand identification of bacterial intestinalpathogens in faeces and food.

By Abubakar I, Irvine L, Aldus CF, Wyatt GM, Fordham R, Schelenz S, et al.

No. 37A randomised controlled trialexamining the longer-term outcomes ofstandard versus new antiepileptic drugs. The SANAD trial.

By Marson AG, Appleton R, BakerGA, Chadwick DW, Doughty J, Eaton B,et al.

No. 38Clinical effectiveness and cost-effectiveness of different models of managing long-term oralanticoagulation therapy: a systematicreview and economic modelling.

By Connock M, Stevens C, Fry-Smith A, Jowett S, Fitzmaurice D,Moore D, et al.

No. 39A systematic review and economicmodel of the clinical effectiveness and cost-effectiveness of interventions forpreventing relapse in people withbipolar disorder.

By Soares-Weiser K, Bravo Vergel Y,Beynon S, Dunn G, Barbieri M, Duffy S,et al.

No. 40Taxanes for the adjuvant treatment ofearly breast cancer: systematic reviewand economic evaluation.

By Ward S, Simpson E, Davis S, Hind D, Rees A, Wilkinson A.

No. 41The clinical effectiveness and cost-effectiveness of screening for open angleglaucoma: a systematic review andeconomic evaluation.

By Burr JM, Mowatt G, Hernández R,Siddiqui MAR, Cook J, Lourenco T, et al.

No. 42Acceptability, benefit and costs of earlyscreening for hearing disability: a studyof potential screening tests and models.

By Davis A, Smith P, Ferguson M,Stephens D, Gianopoulos I.

No. 43Contamination in trials of educationalinterventions.

By Keogh-Brown MR, BachmannMO, Shepstone L, Hewitt C, Howe A,Ramsay CR, et al.

No. 44Overview of the clinical effectiveness ofpositron emission tomography imagingin selected cancers.

By Facey K, Bradbury I, Laking G,Payne E.

No. 45The effectiveness and cost-effectivenessof carmustine implants andtemozolomide for the treatment ofnewly diagnosed high-grade glioma: a systematic review and economicevaluation.

By Garside R, Pitt M, Anderson R,Rogers G, Dyer M, Mealing S, et al.

No. 46Drug-eluting stents: a systematic reviewand economic evaluation.

By Hill RA, Boland A, Dickson R, Dündar Y, Haycox A, McLeod C, et al.

No. 47The clinical effectiveness and cost-effectiveness of cardiac resynchronisation(biventricular pacing) for heart failure:systematic review and economic model.

By Fox M, Mealing S, Anderson R,Dean J, Stein K, Price A, et al.

No. 48Recruitment to randomised trials:strategies for trial enrolment andparticipation study. The STEPS study.

By Campbell MK, Snowdon C,Francis D, Elbourne D, McDonald AM,Knight R, et al.

No. 49Cost-effectiveness of functional cardiac testing in the diagnosis andmanagement of coronary artery disease: a randomised controlled trial. The CECaT trial.

By Sharples L, Hughes V, Crean A,Dyer M, Buxton M, Goldsmith K, et al.

No. 50Evaluation of diagnostic tests whenthere is no gold standard. A review ofmethods.

By Rutjes AWS, Reitsma JB,Coomarasamy A, Khan KS, Bossuyt PMM.

Health Technology Assessment 2007; Vol. 11: No. 50

67

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

69

Health Technology AssessmentProgramme

Prioritisation Strategy GroupMembers

Chair,Professor Tom Walley, Director, NHS HTA Programme,Department of Pharmacology &Therapeutics,University of Liverpool

Professor Bruce Campbell,Consultant Vascular & GeneralSurgeon, Royal Devon & ExeterHospital

Professor Robin E Ferner,Consultant Physician andDirector, West Midlands Centrefor Adverse Drug Reactions,City Hospital NHS Trust,Birmingham

Dr Edmund Jessop, MedicalAdviser, National Specialist,Commissioning Advisory Group(NSCAG), Department ofHealth, London

Professor Jon Nicholl, Director,Medical Care Research Unit,University of Sheffield, School of Health and Related Research

Dr Ron Zimmern, Director,Public Health Genetics Unit,Strangeways ResearchLaboratories, Cambridge

Director, Professor Tom Walley, Director, NHS HTA Programme,Department of Pharmacology &Therapeutics,University of Liverpool

Deputy Director, Professor Jon Nicholl,Director, Medical Care ResearchUnit, University of Sheffield,School of Health and RelatedResearch

HTA Commissioning BoardMembers

Programme Director, Professor Tom Walley, Director, NHS HTA Programme,Department of Pharmacology &Therapeutics,University of Liverpool

Chair,Professor Jon Nicholl,Director, Medical Care ResearchUnit, University of Sheffield,School of Health and RelatedResearch

Deputy Chair, Dr Andrew Farmer, University Lecturer in GeneralPractice, Department of Primary Health Care, University of Oxford

Dr Jeffrey Aronson,Reader in ClinicalPharmacology, Department ofClinical Pharmacology,Radcliffe Infirmary, Oxford

Professor Deborah Ashby,Professor of Medical Statistics,Department of Environmentaland Preventative Medicine,Queen Mary University ofLondon

Professor Ann Bowling,Professor of Health ServicesResearch, Primary Care andPopulation Studies,University College London

Professor John Cairns, Professor of Health Economics,Public Health Policy, London School of Hygiene and Tropical Medicine, London

Professor Nicky Cullum,Director of Centre for EvidenceBased Nursing, Department ofHealth Sciences, University ofYork

Professor Jon Deeks, Professor of Health Statistics,University of Birmingham

Professor Jenny Donovan,Professor of Social Medicine,Department of Social Medicine,University of Bristol

Professor Freddie Hamdy,Professor of Urology, University of Sheffield

Professor Allan House, Professor of Liaison Psychiatry,University of Leeds

Professor Sallie Lamb, Director,Warwick Clinical Trials Unit,University of Warwick

Professor Stuart Logan,Director of Health & SocialCare Research, The PeninsulaMedical School, Universities ofExeter & Plymouth

Professor Miranda Mugford,Professor of Health Economics,University of East Anglia

Dr Linda Patterson, Consultant Physician,Department of Medicine,Burnley General Hospital

Professor Ian Roberts, Professor of Epidemiology &Public Health, InterventionResearch Unit, London Schoolof Hygiene and TropicalMedicine

Professor Mark Sculpher,Professor of Health Economics,Centre for Health Economics,Institute for Research in theSocial Services, University of York

Professor Kate Thomas,Professor of Complementaryand Alternative Medicine,University of Leeds

Professor David John Torgerson,Director of York Trial Unit,Department of Health Sciences,University of York

Professor Hywel Williams,Professor of Dermato-Epidemiology,University of Nottingham

Current and past membership details of all HTA ‘committees’ are available from the HTA website (www.hta.ac.uk)

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment Programme

70

Diagnostic Technologies & Screening PanelMembers

Chair,Dr Ron Zimmern, Director ofthe Public Health Genetics Unit,Strangeways ResearchLaboratories, Cambridge

Ms Norma Armston,Freelance Consumer Advocate,Bolton

Professor Max Bachmann,Professor of Health CareInterfaces, Department ofHealth Policy and Practice,University of East Anglia

Professor Rudy BilousProfessor of Clinical Medicine &Consultant Physician,The Academic Centre,South Tees Hospitals NHS Trust

Ms Dea Birkett, Service UserRepresentative, London

Dr Paul Cockcroft, ConsultantMedical Microbiologist andClinical Director of Pathology,Department of ClinicalMicrobiology, St Mary'sHospital, Portsmouth

Professor Adrian K Dixon,Professor of Radiology,University Department ofRadiology, University ofCambridge Clinical School

Dr David Elliman, Consultant inCommunity Child Health,Islington PCT & Great OrmondStreet Hospital, London

Professor Glyn Elwyn, Research Chair, Centre forHealth Sciences Research,Cardiff University, Departmentof General Practice, Cardiff

Professor Paul Glasziou,Director, Centre for Evidence-Based Practice,University of Oxford

Dr Jennifer J Kurinczuk,Consultant ClinicalEpidemiologist, NationalPerinatal Epidemiology Unit,Oxford

Dr Susanne M Ludgate, Clinical Director, Medicines &Healthcare Products RegulatoryAgency, London

Mr Stephen Pilling, Director,Centre for Outcomes, Research & Effectiveness, Joint Director, NationalCollaborating Centre for MentalHealth, University CollegeLondon

Mrs Una Rennard, Service User Representative,Oxford

Dr Phil Shackley, SeniorLecturer in Health Economics,Academic Vascular Unit,University of Sheffield

Dr Margaret Somerville,Director of Public HealthLearning, Peninsula MedicalSchool, University of Plymouth

Dr Graham Taylor, ScientificDirector & Senior Lecturer,Regional DNA Laboratory, TheLeeds Teaching Hospitals

Professor Lindsay WilsonTurnbull, Scientific Director,Centre for MR Investigations &YCR Professor of Radiology,University of Hull

Professor Martin J Whittle,Clinical Co-director, NationalCo-ordinating Centre forWomen’s and Childhealth

Dr Dennis Wright, Consultant Biochemist &Clinical Director, The North West LondonHospitals NHS Trust, Middlesex

Pharmaceuticals PanelMembers

Chair,Professor Robin Ferner,Consultant Physician andDirector, West Midlands Centrefor Adverse Drug Reactions, City Hospital NHS Trust,Birmingham

Ms Anne Baileff, ConsultantNurse in First Contact Care,Southampton City Primary CareTrust, University ofSouthampton

Professor Imti Choonara,Professor in Child Health,Academic Division of ChildHealth, University ofNottingham

Professor John Geddes,Professor of EpidemiologicalPsychiatry, University of Oxford

Mrs Barbara Greggains, Non-Executive Director,Greggains Management Ltd

Dr Bill Gutteridge, MedicalAdviser, National SpecialistCommissioning Advisory Group(NSCAG), London

Mrs Sharon Hart, Consultant PharmaceuticalAdviser, Reading

Dr Jonathan Karnon, SeniorResearch Fellow, HealthEconomics and DecisionScience, University of Sheffield

Dr Yoon Loke, Senior Lecturerin Clinical Pharmacology,University of East Anglia

Ms Barbara Meredith,Lay Member, Epsom

Dr Andrew Prentice, SeniorLecturer and ConsultantObstetrician & Gynaecologist,Department of Obstetrics &Gynaecology, University ofCambridge

Dr Frances Rotblat, CPMPDelegate, Medicines &Healthcare Products RegulatoryAgency, London

Dr Martin Shelly, General Practitioner, Leeds

Mrs Katrina Simister, AssistantDirector New Medicines,National Prescribing Centre,Liverpool

Dr Richard Tiner, MedicalDirector, Medical Department,Association of the BritishPharmaceutical Industry,London

Current and past membership details of all HTA ‘committees’ are available from the HTA website (www.hta.ac.uk)

Therapeutic Procedures PanelMembers

Chair, Professor Bruce Campbell,Consultant Vascular andGeneral Surgeon, Departmentof Surgery, Royal Devon &Exeter Hospital

Dr Mahmood Adil, DeputyRegional Director of PublicHealth, Department of Health,Manchester

Dr Aileen Clarke,Consultant in Public Health,Public Health Resource Unit,Oxford

Professor Matthew Cooke,Professor of EmergencyMedicine, Warwick EmergencyCare and Rehabilitation,University of Warwick

Mr Mark Emberton, SeniorLecturer in OncologicalUrology, Institute of Urology,University College Hospital

Professor Paul Gregg,Professor of OrthopaedicSurgical Science, Department ofGeneral Practice and PrimaryCare, South Tees Hospital NHSTrust, Middlesbrough

Ms Maryann L Hardy, Lecturer, Division ofRadiography, University ofBradford

Dr Simon de Lusignan,Senior Lecturer, Primary CareInformatics, Department ofCommunity Health Sciences,St George’s Hospital MedicalSchool, London

Dr Peter Martin, ConsultantNeurologist, Addenbrooke’sHospital, Cambridge

Professor Neil McIntosh,Edward Clark Professor of ChildLife & Health, Department ofChild Life & Health, Universityof Edinburgh

Professor Jim Neilson,Professor of Obstetrics andGynaecology, Department ofObstetrics and Gynaecology,University of Liverpool

Dr John C Pounsford,Consultant Physician,Directorate of Medical Services,North Bristol NHS Trust

Dr Karen Roberts, NurseConsultant, Queen ElizabethHospital, Gateshead

Dr Vimal Sharma, ConsultantPsychiatrist/Hon. Senior Lecturer, Mental HealthResource Centre, Cheshire andWirral Partnership NHS Trust,Wallasey

Professor Scott Weich, Professor of Psychiatry, Division of Health in theCommunity, University ofWarwick

Disease Prevention PanelMembers

Chair, Dr Edmund Jessop, MedicalAdviser, National SpecialistCommissioning Advisory Group(NSCAG), London

Mrs Sheila Clark, ChiefExecutive, St James’s Hospital,Portsmouth

Mr Richard Copeland, Lead Pharmacist: ClinicalEconomy/Interface, Wansbeck General Hospital,Northumberland

Dr Elizabeth Fellow-Smith,Medical Director, West London Mental HealthTrust, Middlesex

Mr Ian Flack, Director PPIForum Support, Council ofEthnic Minority VoluntarySector Organisations, Stratford

Dr John Jackson, General Practitioner, Newcastle upon Tyne

Mrs Veronica James, ChiefOfficer, Horsham District AgeConcern, Horsham

Professor Mike Kelly, Director, Centre for PublicHealth Excellence, National Institute for Healthand Clinical Excellence, London

Professor Yi Mien Koh, Director of Public Health andMedical Director, London NHS (North West LondonStrategic Health Authority),London

Ms Jeanett Martin, Director of Clinical Leadership& Quality, Lewisham PCT,London

Dr Chris McCall, GeneralPractitioner, Dorset

Dr David Pencheon, Director,Eastern Region Public HealthObservatory, Cambridge

Dr Ken Stein, Senior ClinicalLecturer in Public Health,Director, Peninsula TechnologyAssessment Group, University of Exeter, Exeter

Dr Carol Tannahill, Director,Glasgow Centre for PopulationHealth, Glasgow

Professor Margaret Thorogood,Professor of Epidemiology,University of Warwick, Coventry

Dr Ewan Wilkinson, Consultant in Public Health,Royal Liverpool UniversityHospital, Liverpool

Health Technology Assessment 2007; Vol. 11: No. 50

71Current and past membership details of all HTA ‘committees’ are available from the HTA website (www.hta.ac.uk)

Health Technology Assessment Programme

72Current and past membership details of all HTA ‘committees’ are available from the HTA website (www.hta.ac.uk)

Expert Advisory NetworkMembers

Professor Douglas Altman,Professor of Statistics inMedicine, Centre for Statisticsin Medicine, University ofOxford

Professor John Bond,Director, Centre for HealthServices Research, University ofNewcastle upon Tyne, School ofPopulation & Health Sciences,Newcastle upon Tyne

Professor Andrew Bradbury,Professor of Vascular Surgery,Solihull Hospital, Birmingham

Mr Shaun Brogan, Chief Executive, RidgewayPrimary Care Group, Aylesbury

Mrs Stella Burnside OBE,Chief Executive, Regulation and ImprovementAuthority, Belfast

Ms Tracy Bury, Project Manager, WorldConfederation for PhysicalTherapy, London

Professor Iain T Cameron,Professor of Obstetrics andGynaecology and Head of theSchool of Medicine,University of Southampton

Dr Christine Clark,Medical Writer & ConsultantPharmacist, Rossendale

Professor Collette Clifford,Professor of Nursing & Head ofResearch, School of HealthSciences, University ofBirmingham, Edgbaston,Birmingham

Professor Barry Cookson,Director, Laboratory ofHealthcare Associated Infection,Health Protection Agency,London

Dr Carl Counsell, ClinicalSenior Lecturer in Neurology,Department of Medicine &Therapeutics, University ofAberdeen

Professor Howard Cuckle,Professor of ReproductiveEpidemiology, Department ofPaediatrics, Obstetrics &Gynaecology, University ofLeeds

Dr Katherine Darton, Information Unit, MIND – The Mental Health Charity,London

Professor Carol Dezateux, Professor of PaediatricEpidemiology, London

Dr Keith Dodd, ConsultantPaediatrician, Derby

Mr John Dunning,Consultant CardiothoracicSurgeon, CardiothoracicSurgical Unit, PapworthHospital NHS Trust, Cambridge

Mr Jonothan Earnshaw,Consultant Vascular Surgeon,Gloucestershire Royal Hospital,Gloucester

Professor Martin Eccles, Professor of ClinicalEffectiveness, Centre for HealthServices Research, University ofNewcastle upon Tyne

Professor Pam Enderby,Professor of CommunityRehabilitation, Institute ofGeneral Practice and PrimaryCare, University of Sheffield

Professor Gene Feder, Professorof Primary Care Research &Development, Centre for HealthSciences, Barts & The LondonQueen Mary’s School ofMedicine & Dentistry, London

Mr Leonard R Fenwick, Chief Executive, Newcastleupon Tyne Hospitals NHS Trust

Mrs Gillian Fletcher, Antenatal Teacher & Tutor andPresident, National ChildbirthTrust, Henfield

Professor Jayne Franklyn,Professor of Medicine,Department of Medicine,University of Birmingham,Queen Elizabeth Hospital,Edgbaston, Birmingham

Dr Neville Goodman, Consultant Anaesthetist,Southmead Hospital, Bristol

Professor Robert E Hawkins, CRC Professor and Director ofMedical Oncology, Christie CRCResearch Centre, ChristieHospital NHS Trust, Manchester

Professor Allen Hutchinson, Director of Public Health &Deputy Dean of ScHARR,Department of Public Health,University of Sheffield

Professor Peter Jones, Professorof Psychiatry, University ofCambridge, Cambridge

Professor Stan Kaye, CancerResearch UK Professor ofMedical Oncology, Section ofMedicine, Royal MarsdenHospital & Institute of CancerResearch, Surrey

Dr Duncan Keeley,General Practitioner (Dr Burch& Ptnrs), The Health Centre,Thame

Dr Donna Lamping,Research Degrees ProgrammeDirector & Reader in Psychology,Health Services Research Unit,London School of Hygiene andTropical Medicine, London

Mr George Levvy,Chief Executive, Motor Neurone Disease Association,Northampton

Professor James Lindesay,Professor of Psychiatry for theElderly, University of Leicester,Leicester General Hospital

Professor Julian Little,Professor of Human GenomeEpidemiology, Department ofEpidemiology & CommunityMedicine, University of Ottawa

Professor Rajan Madhok, Consultant in Public Health,South Manchester Primary Care Trust, Manchester

Professor Alexander Markham, Director, Molecular MedicineUnit, St James’s UniversityHospital, Leeds

Professor Alistaire McGuire,Professor of Health Economics,London School of Economics

Dr Peter Moore, Freelance Science Writer, Ashtead

Dr Andrew Mortimore, PublicHealth Director, SouthamptonCity Primary Care Trust,Southampton

Dr Sue Moss, Associate Director,Cancer Screening EvaluationUnit, Institute of CancerResearch, Sutton

Mrs Julietta Patnick, Director, NHS Cancer ScreeningProgrammes, Sheffield

Professor Robert Peveler,Professor of Liaison Psychiatry,Royal South Hants Hospital,Southampton

Professor Chris Price, Visiting Professor in ClinicalBiochemistry, University ofOxford

Professor William Rosenberg,Professor of Hepatology andConsultant Physician, Universityof Southampton, Southampton

Professor Peter Sandercock,Professor of Medical Neurology,Department of ClinicalNeurosciences, University ofEdinburgh

Dr Susan Schonfield, Consultantin Public Health, HillingdonPCT, Middlesex

Dr Eamonn Sheridan,Consultant in Clinical Genetics,Genetics Department,St James’s University Hospital,Leeds

Professor Sarah Stewart-Brown, Professor of Public Health,University of Warwick, Division of Health in theCommunity Warwick MedicalSchool, LWMS, Coventry

Professor Ala Szczepura, Professor of Health ServiceResearch, Centre for HealthServices Studies, University ofWarwick

Dr Ross Taylor, Senior Lecturer, Department ofGeneral Practice and PrimaryCare, University of Aberdeen

Mrs Joan Webster, Consumer member, HTA –Expert Advisory Network

How to obtain copies of this and other HTA Programme reports.An electronic version of this publication, in Adobe Acrobat format, is available for downloading free ofcharge for personal use from the HTA website (http://www.hta.ac.uk). A fully searchable CD-ROM isalso available (see below).

Printed copies of HTA monographs cost £20 each (post and packing free in the UK) to both public andprivate sector purchasers from our Despatch Agents.

Non-UK purchasers will have to pay a small fee for post and packing. For European countries the cost is£2 per monograph and for the rest of the world £3 per monograph.

You can order HTA monographs from our Despatch Agents:

– fax (with credit card or official purchase order) – post (with credit card or official purchase order or cheque)– phone during office hours (credit card only).

Additionally the HTA website allows you either to pay securely by credit card or to print out yourorder and then post or fax it.

Contact details are as follows:HTA Despatch Email: [email protected]/o Direct Mail Works Ltd Tel: 02392 492 0004 Oakwood Business Centre Fax: 02392 478 555Downley, HAVANT PO9 2NP, UK Fax from outside the UK: +44 2392 478 555

NHS libraries can subscribe free of charge. Public libraries can subscribe at a very reduced cost of £100 for each volume (normally comprising 30–40 titles). The commercial subscription rate is £300 per volume. Please see our website for details. Subscriptions can only be purchased for the current orforthcoming volume.

Payment methods

Paying by chequeIf you pay by cheque, the cheque must be in pounds sterling, made payable to Direct Mail Works Ltdand drawn on a bank with a UK address.

Paying by credit cardThe following cards are accepted by phone, fax, post or via the website ordering pages: Delta, Eurocard,Mastercard, Solo, Switch and Visa. We advise against sending credit card details in a plain email.

Paying by official purchase orderYou can post or fax these, but they must be from public bodies (i.e. NHS or universities) within the UK.We cannot at present accept purchase orders from commercial companies or from outside the UK.

How do I get a copy of HTA on CD?

Please use the form on the HTA website (www.hta.ac.uk/htacd.htm). Or contact Direct Mail Works (seecontact details above) by email, post, fax or phone. HTA on CD is currently free of charge worldwide.

The website also provides information about the HTA Programme and lists the membership of the variouscommittees.

HTA

Evaluation of diagnostic tests whenthere is no gold standard. A review of methods

AWS Rutjes, JB Reitsma, A Coomarasamy, KS Khan and PMM Bossuyt

Health Technology Assessment 2007; Vol. 11: No. 50

HTAHealth Technology AssessmentNHS R&D HTA Programmewww.hta.ac.uk

The National Coordinating Centre for Health Technology Assessment,Mailpoint 728, Boldrewood,University of Southampton,Southampton, SO16 7PX, UK.Fax: +44 (0) 23 8059 5639 Email: [email protected]://www.hta.ac.uk ISSN 1366-5278

FeedbackThe HTA Programme and the authors would like to know

your views about this report.

The Correspondence Page on the HTA website(http://www.hta.ac.uk) is a convenient way to publish

your comments. If you prefer, you can send your comments to the address below, telling us whether you would like

us to transfer them to the website.

We look forward to hearing from you.

December 2007

Health Technology Assessm

ent 2007;Vol. 11: No. 50

Evaluation of diagnostic tests when there is no gold standard