COSMIN for systematic reviews of Outcome Measures (PROMs) · COSMIN manual for systematic reviews...
-
Upload
hoanghuong -
Category
Documents
-
view
233 -
download
0
Transcript of COSMIN for systematic reviews of Outcome Measures (PROMs) · COSMIN manual for systematic reviews...
COSMINmanualforsystematicreviewsofPROMs
1
COSMINmethodologyforsystematicreviewsof
Patient‐ReportedOutcomeMeasures(PROMs)
usermanual
Version1.0datedFebruary2018
LidwineBMokkinkCeciliaACPrinsenDonaldLPatrickJordiAlonsoLexMBouter
HenricaCWdeVetCarolineBTerwee
ContactLBMokkink,PhDVUUniversityMedicalCenterDepartmentofEpidemiologyandBiostatisticsAmsterdamPublicHealthresearchinstitute1081BTAmsterdamTheNetherlandsWebsite:www.cosmin.nlE‐mail:[email protected]
COSMINmanualforsystematicreviewsofPROMs
2
ThedevelopmentoftheCOSMINguidelineforsystematicreviewsofPatient‐ReportedOutcomeMeasures(PROMs)andtheCOSMINRiskofBiaschecklistforsystematicreviewsofPROMswasfinanciallysupportedbytheAmsterdamPublicHealthresearchinstitute,VUUniversityMedicalCenter,Amsterdam,theNetherlands.
COSMINmanualforsystematicreviewsofPROMs
3
Tableofcontents
Foreword 5 1. Backgroundinformation 7 1.1TheCOSMINinitiative 7
1.2Howtocitethismanual 8
1.3DevelopmentofacomprehensiveCOSMINguidelineforsystematicreviewsof
Patient‐ReportedOutcomeMeasures(PROMs) 8
1.4COSMINtaxonomyanddefinitions 9
1.5TheCOSMINRiskofBiasChecklist 13
1.6WhathaschangedintheCOSMINmethodology? 15
1.7Ten‐stepprocedure&outlineofthemanual 17
2. PartASteps1‐4:Performtheliteraturesearch 20 2.1Step1:Formulatetheaimofthereview 20
2.2Step2:Formulateeligibilitycriteria 20
2.3Step3:Performaliteraturesearch 21
2.4Step4:Selectabstractsandfull‐textarticles 24
3. PartBsteps5‐7:EvaluatingthemeasurementpropertiesoftheincludedPROMs 253.1Generalmethodology 25
3.1.2Applyingcriteriaforgoodmeasurementproperties 27
3.2.Step5:Evaluatingcontentvalidity 37
3.3Step6.EvaluationoftheinternalstructureofPROMs:structuralvalidity,internal
consistency,andcross‐culturalvalidity\measurementinvariance 38
3.4Step7.Evaluationoftheremainingmeasurementproperties:reliability,
measurementerror,criterionvalidity,hypothesestestingforconstructvalidity,and
responsiveness 40
4. PartCsteps8‐10:SelectingaPROM 44 4.1Step8:Describeinterpretabilityandfeasibility 44
4.2Step9:formulaterecommendations 45
4.3Step10:reportthesystematicreview 45
5. COSMINRiskofBiaschecklist 47 5.1Assessingriskofbiasinastudyonstructuralvalidity(box3) 47
5.2Assessingriskofbiasinastudyoninternalconsistency(box4) 50
5.3Assessingriskofbiasinastudyoncross‐culturalvalidity\measurement
invariance(box5) 51
5.4Assessingriskofbiasinastudyonreliability(box6) 53
COSMINmanualforsystematicreviewsofPROMs
4
5.5Assessingriskofbiasinastudyonmeasurementerror(box7) 56
5.6Assessingriskofbiasinastudyoncriterionvalidity(box8) 57
5.7Assessingriskofbiasinastudyonhypothesestestingforconstructvalidity(box
9) 58
5.8Assessingriskofbiasinastudyonresponsiveness(box10) 60
Appendix1.Exampleofasearchstrategy(81) 64 Appendix2.Exampleofaflowchart 65 Appendix4.Exampleofatableoncharacteristicsoftheincludedstudypopulations 67 Appendix5.InformationtoextractoninterpretabilityofPROMs 68 Appendix6.InformationtoextractonfeasibilityofPROMs 69 Appendix7.Tableonresultsofstudiesonmeasurementproperties 70 Appendix8.SummaryofFindingsTables 72 6. References 74
COSMINmanualforsystematicreviewsofPROMs
5
Foreword
Researchperformedwithoutcomemeasurementinstrumentsofpoororunknownqualityconstitutesawasteofresourcesandisunethical(3).Unfortunatelythispracticeiswidespread(4).Selectingthebestoutcomemeasurementinstrumentfortheoutcomeofinterestinamethodologicallysoundwayrequires:(1)highqualitystudiesthatdocumenttheevaluationofthemeasurementproperties(intotalninedifferentaspectsofreliability,validity,andresponsiveness)ofrelevantoutcomemeasurementinstrumentsinthetargetpopulation;and(2)ahighqualitysystematicreviewofstudiesonmeasurementpropertiesinwhichallinformationisgatheredandevaluatedinasystematicandtransparentway,accompaniedbyclearrecommendationsforthemostsuitableavailableoutcomemeasurementinstrument.However,conductingsuchasystematicreviewisquitecomplexandtimeconsuming,anditrequiresexpertisewithintheresearchteamontheconstructtobemeasured,onthepatientpopulation,andonthemethodologyofstudiesofmeasurementproperties.
HighqualitysystematicreviewscanprovideacomprehensiveoverviewofthemeasurementpropertiesofPatient‐ReportedOutcomeMeasures(PROMs)andsupportsevidence‐basedrecommendationsintheselectionofthemostsuitablePROMforagivenpurpose.Forexample,forselectingthemostsuitablePROMforanoutcomeincludedinacoreoutcomeset(COS)(5).ThesesystematicreviewscanalsoidentifygapsinknowledgeonthemeasurementpropertiesofPROMs,whichcanbeusedtodesignnewstudiesonmeasurementproperties.
TheCOSMIN(COnsensus‐basedStandardsfortheselectionofhealthMeasurementINstruments)initiativeaimstoimprovetheselectionofoutcomemeasurementinstrumentsinresearchandclinicalpracticebydevelopingtoolsforselectingthemostsuitableinstrumentforthesituationatissue.Recently,acomprehensivemethodologicalguidelineforsystematicreviewsofPROMswasdevelopedbytheCOSMINinitiative(2).Inaddition,theCOSMINchecklistwasadaptedforassessingtheriskofbiasinstudiesonmeasurementpropertiesinsystematicreviewsofPROMs(6).Also,aDelphistudywasperformedtodevelopstandardsandcriteriaforassessingthecontentvalidityofPROMs(7).
ThepresentmanualisasupplementtotheCOSMINguidelineforsystematicreviewsofmeasurementinstruments(2)includingtheuseoftheCOSMINRiskofBiaschecklist(6).ThismanualcontainsadditionaldetailedinformationonhowtoperformeachoftheproposedtenstepstoconductasystematicreviewofPROMs,onhowtousetheCOSMINRiskofBiaschecklist,andonhowtocometorecommendationsonthemostsuitableinstrumentforagiven.
IntheCOSMINmethodologyweusethewordpatient.However,sometimesthetargetpopulationofthesystematicrevieworthePROMisnotpatients,bute.g.healthyindividuals(e.g.forgenericPROMs)orcaregivers(whenaPROMmeasurescaregiverburden).Inthesecases,thewordpatientshouldbereadase.g.healthypersonorcaregiver.
COSMINmanualforsystematicreviewsofPROMs
6
ThroughoutthemanualweprovidesomeexamplestoexplaintheCOSMINRiskofBiaschecklist.Wewouldliketoemphasizethatthesearearbitrarilychosenandusedforillustrativepurposes.
TheCOSMINmethodologyfocussesonPROMsusedasoutcomemeasurementinstruments(i.e.evaluativeapplication).Themethodologycanalsobeusedforothertypesofmeasurementinstruments(likeclinician‐reportedoutcomemeasuresorperformance‐basedoutcomemeasures),orotherapplications(e.g.diagnosticorpredictiveapplications),butthemethodologymayneedtobeadaptedfortheseotherpurposes.Forexample,structuralvaliditymaynotberelevantforsometypesofinstrumentsandresponsivenessislessrelevantwhenaninstrumentisusedasadiagnosticinstrument.NewstandardsforassessingthequalityofastudyonthereliabilityofaMRIscanshouldbedeveloped.
ThismanualaimstofacilitatetheunderstandingandpracticeperformanceofasystematicreviewonPROMs.Wegivedetailedinstructionsoneachsteptotake,andwewillprovidemanyexamplesandsuggestionsonhowtoperformeachstep.
Weaimtocontinueupdatingthismanualwhendeemednecessary,basedonexperienceandsuggestionsbyusersoftheCOSMINmethodology.Ifyouhaveanysuggestionsorquestions,pleasecontactus([email protected]).
COSMINmanualforsystematicreviewsofPROMs
7
1. Backgroundinformation
1.1TheCOSMINinitiative
TheCOSMINinitiativeaimstoimprovetheselectionofhealthmeasurementinstrumentsbothinresearchandclinicalpracticebydevelopingtoolsforselectingthemostsuitableinstrumentforagivensituation.COSMINisaninternationalinitiativeconsistingofamultidisciplinaryteamofresearcherswithexpertiseinepidemiology,psychometrics,andqualitativeresearch,andinthedevelopmentandevaluationofoutcomemeasurementinstrumentsinthefieldofhealthcare,aswellasinperformingsystematicreviewsofoutcomemeasurementinstruments.
COSMINsteeringcommittee
LidwineBMokkink1CeciliaACPrinsen1DonaldLPatrick2JordiAlonso3LexMBouter1HenricaCWdeVet1CarolineBTerwee1
1 DepartmentofEpidemiologyandBiostatistics,AmsterdamPublicHealthresearchinstitute,VUUniversityMedicalCenter,DeBoelelaan1089a,1081HV,Amsterdam,theNetherlands;
2DepartmentofHealthServices,UniversityofWashington,ThurCanalStResearchOffice,146NCanalSuite310,98103,Seattle,USA;
3 IMIM(HospitaldelMarMedicalResearchInstitute),CIBEREpidemiologyandPublicHealth(CIBERESP),Dept.ExperimentalandHealthSciences,PompeuFabraUniversity(UPF),DoctorAiguader88,08003,Barcelona,Spain.
COSMINmanualforsystematicreviewsofPROMs
8
1.2Howtocitethismanual
Thismanualwasbasedonthreearticles,publishedinpeer‐reviewedscientificjournals.IfyouusetheCOSMINmethodology,pleaserefertothesearticles:
Prinsen,C.A.C.,Mokkink,L.B.,Bouter,L.M.,Alonso,J.,Patrick,D.L.,DeVet,H.C.,etal.(2018).COSMINguidelineforsystematicreviewsofPatient‐ReportedOutcomeMeasures.QualLifeRes,accept.
Mokkink,L.B.,deVet,H.C.W.,Prinsen,C.A.C.,Patrick,D.L.,Alonso,J.,Bouter,L.M.,Terwee,C.B.(2017).COSMINRiskofBiaschecklistforsystematicreviewsofPatient‐ReportedOutcomeMeasures.QualLifeRes.doi:10.1007/s11136‐017‐1765‐4.[Epubaheadofprint].
TerweeCB,PrinsenCAC,ChiarottoA,WestermanMJ,PatrickDL,AlonsoJ,BouterLM,deVetHCW,MokkinkLB.COSMINmethodologyforevaluatingthecontentvalidityofPatient‐ReportedOutcomeMeasures:aDelphistudy,submitted.
1.3DevelopmentofacomprehensiveCOSMINguidelineforsystematicreviewsofPatient‐ReportedOutcomeMeasures(PROMs)
In the absence of empirical evidence, theCOSMIN guideline for systematic reviews ofPROMs (2) was based on our experience that we (that is: the COSMIN steeringcommittee)havegainedoverthepastyearsinconductingsystematicreviewsofPROMs(8, 9), in supporting other systematic reviewers in their work (10, 11) and in thedevelopmentofCOSMINmethodology(12,13).Inaddition,wehavestudiedthequalityofsystematicreviewsofPROMsintwoconsecutivereviews(14,15),andinreviewsthathave used the COSMINmethodologywe have specifically searched for the commentsmade by review authors relating to the COSMINmethodology. Further, we have haditerativediscussionsby theCOSMINsteeringcommittee,bothat face‐to‐facemeetings(CP,WM, HdV and CT) and by email.We gained experience from results of a recentDelphistudyonthecontentvalidityofPROMs(7),andfromresultsofapreviousDelphistudyontheselectionofoutcomemeasurement instruments foroutcomes included incoreoutcomesets(COS)(5).Further, the guideline was developed in concordance with existing guidelines forreviews, suchas theCochraneHandbook forsystematic reviewsof interventions (16),andfordiagnostictestaccuracyreviews(17),thePRISMAStatement(18),theInstituteof Medicine (IOM) standards for systematic reviews of comparative effectivenessresearch(19), and the Grading of Recommendations Assessment, Development andEvaluation(GRADE)principles(20).
ThisguidelinefocusesonthemethodologyofsystematicreviewsofexistingPROMs,forwhichatleastaPROMdevelopmentstudyorsomeinformationonitsmeasurementpropertiesisavailable,andthatareusedforevaluativepurposes(e.g.tomeasuretheeffectsoftreatmentortomonitorpatientsovertime).Instrumentsthatarebeingusedfordiscriminativeorpredictivepurposesarenotbeingconsideredintheguideline.Themethodscanalsobeusedforsystematicreviewsofothertypesofoutcomemeasurementinstrumentssuchasclinicianreportedoutcomemeasurement
COSMINmanualforsystematicreviewsofPROMs
9
instruments(ClinROMs)orperformance‐basedoutcomemeasurementinstruments(PerBOMs),forreviewsofonesingleoutcomemeasurementinstrument,orforapredefinedsetofoutcomemeasurementinstruments.However,someadaptationsmaybeneededinsomeofthestepsorintheCOSMINRiskofBiaschecklist.
1.4COSMINtaxonomyanddefinitionsIntheliteraturedifferentterminologyanddefinitionsformeasurementpropertiesarecontinuouslybeingused.TheCOSMINinitiativedevelopedataxonomyofmeasurementpropertiesrelevantforevaluatingPROMs.InthefirstCOSMINDelphistudy,conductedin2006‐2007,consensuswasreachedonterminologyanddefinitionsofallincludedmeasurementpropertiesintheCOSMINchecklist(21).Thistaxonomy(Figure1)formedthefoundationonwhichtheguidelinewasbased.
Figure1.TheCOSMINtaxonomy(21)
COSMINmanualforsystematicreviewsofPROMs
10
TaxonomyofmeasurementpropertiesTheCOSMINtaxonomyofmeasurementpropertiesispresentedinFigure1.ItwasdecidedthatallmeasurementpropertiesincludedinthetaxonomyarerelevantandshouldbeevaluatedforPROMsusedinanevaluativeapplication.
InassessingthequalityofaPROMwedistinguishthreedomains,i.e.reliability,validity,andresponsiveness.Eachdomaincontainsoneormoremeasurementproperties,i.e.qualityaspectsofmeasurementinstruments.Thedomainreliabilitycontainsthreemeasurementproperties:internalconsistency,reliability,andmeasurementerror.Thedomainvalidityalsocontainsthreemeasurementproperties:contentvalidity(includingfacevalidity),structuralvalidity,hypothesestestingforconstructvalidity,cross‐culturalvalidityandcriterionvalidity.Thedomainresponsivenesscontainsonlyonemeasurementproperty,whichisalsocalledresponsiveness.
DefinitionsofmeasurementpropertiesConsensus‐baseddefinitionsofallincludedmeasurementpropertiesintheCOSMINchecklistarepresentedinTable1.
COSMINmanualforsystematicreviewsofPROMs
11
Table1.COSMINdefinitionsofdomains,measurementproperties,andaspectsofmeasurementproperties(21)
Term Definition
Domain Measurementproperty
Aspectofameasurementproperty
Reliability Thedegreetowhichthemeasurementisfreefrommeasurementerror
Reliability(extendeddefinition)
Theextenttowhichscoresforpatientswhohavenotchangedarethesameforrepeatedmeasurementunderseveralconditions:e.g.usingdifferentsetsofitemsfromthesamePROM(internalconsistency);overtime(test‐retest);bydifferentpersonsonthesameoccasion(inter‐rater);orbythesamepersons(i.e.ratersorresponders)ondifferentoccasions(intra‐rater)
Internalconsistency
Thedegreeoftheinterrelatednessamongtheitems
Reliability Theproportionofthetotalvarianceinthemeasurementswhichisdueto‘true’†differencesbetweenpatients
Measurementerror
Thesystematicandrandomerrorofapatient’sscorethatisnotattributedtotruechangesintheconstructtobemeasured
Validity ThedegreetowhichaPROMmeasurestheconstruct(s)itpurportstomeasure
Contentvalidity
ThedegreetowhichthecontentofaPROMisanadequatereflectionoftheconstructtobemeasured
Facevalidity Thedegreetowhich(theitemsof)aPROMindeedlooksasthoughtheyareanadequatereflectionoftheconstructtobemeasured
COSMINmanualforsystematicreviewsofPROMs
12
Constructvalidity
ThedegreetowhichthescoresofaPROMareconsistentwithhypotheses(forinstancewithregardtointernalrelationships,relationshipstoscoresofotherinstruments,ordifferencesbetweenrelevantgroups)basedontheassumptionthatthePROMvalidlymeasurestheconstructtobemeasured
Structuralvalidity
ThedegreetowhichthescoresofaPROMareanadequatereflectionofthedimensionalityoftheconstructtobemeasured
Hypothesestesting
Idemconstructvalidity
Cross‐culturalvalidity
ThedegreetowhichtheperformanceoftheitemsonatranslatedorculturallyadaptedPROMareanadequatereflectionoftheperformanceoftheitemsoftheoriginalversionofthePROM
Criterionvalidity
ThedegreetowhichthescoresofaPROMareanadequatereflectionofa‘goldstandard’
Responsiveness TheabilityofaPROMtodetectchangeovertimeintheconstructtobemeasured
Responsiveness IdemresponsivenessInterpretability* Interpretabilityisthedegreeto
whichonecanassignqualitativemeaning‐thatis,clinicalorcommonlyunderstoodconnotations–toaPROM’squantitativescoresorchangeinscores.
†Theword‘true’mustbeseeninthecontextoftheCTT,whichstatesthatanyobservationiscomposedoftwocomponents–atruescoreanderrorassociatedwiththeobservation.‘True’istheaveragescorethatwouldbeobtainedifthescaleweregivenaninfinitenumberoftimes.Itrefersonlytotheconsistencyofthescore,andnottoitsaccuracy(22)* Interpretabilityisnotconsideredameasurementproperty,butanimportantcharacteristicofameasurementinstrument
COSMINmanualforsystematicreviewsofPROMs
13
1.5TheCOSMINRiskofBiasChecklist
ItwasdecidedtocreatethreeseparateversionsoftheoriginalCOSMINchecklist(1)tobeusedfordifferentpurposeswhenassessingthemethodologicalqualityofstudiesonmeasurement properties, i.e. the COSMIN Design checklist, the COSMIN Risk of Biaschecklist,andtheCOSMINReportingchecklist.TheRiskofBiaschecklistwasexclusivelydeveloped for assessing the methodological quality of single studies included insystematicreviewsofPROMs.Thepurposeofassessingthemethodologicalqualityofastudyinasystematicreviewistoscreenforriskofbiasintheincludedstudies.Theterm‘riskofbias’isincompliancewiththeCochranemethodologyforsystematicreviewsoftrials and diagnostic studies(16). It refers to whether the results based on themethodologicalqualityofthestudyaretrustworthy.The checklist contains standards referring to design requirements and preferredstatistical methods of studies on measurement properties. For each measurementproperty a COSMIN boxwas developed containing all standards needed to assess thequality of a study on that specific measurement property. Both preferred statisticalmethodsbasedonClassicaltestTheory(CTT)andItemResponseTheory(IRT)orRaschanalysesareincludedinthestandards.
BoxesfromthechecklistThe COSMIN Risk of Bias checklist contains ten boxes with standards for PROMdevelopment (box 1) and for ninemeasurement properties: Content validity (box 2),Structural validity (box 3), Internal consistency (box 4), Cross‐culturalvalidity\measurement invariance (box5),Reliability (box6),Measurementerror (box7), Criterion validity (box 8), Hypotheses testing for construct validity (box 9), andResponsiveness(box10).
COSMINstandardsandcriteria
ItisimportanttonotethatCOSMINmakesadistinctionbetween“standards”and“criteria”:Standardsrefertodesignrequirementsandpreferredstatisticalmethodsforevaluatingthemethodologicalqualityofstudiesonmeasurementproperties.Criteriarefertowhatconstitutesgoodmeasurementproperties(qualityofPROMs).InthefirstCOSMINDelphistudy(1),onlystandardsweredevelopedforevaluatingthequalityofstudiesonmeasurementproperties.Criteriaforwhatconstitutesgoodmeasurementpropertieswerenotdeveloped.However,suchcriteriaareneededinsystematicreviewstoprovideevidence‐basedrecommendationsforwhichPROMsaregoodenoughtobeusedinresearchorclinicalpractice.Therefore,criteriaweredevelopedforgoodmeasurementproperties(Table4)(2).
COSMINmanualforsystematicreviewsofPROMs
14
ModulartoolInonearticleoneormorestudies(oftenondifferentmeasurementproperties)canbedescribed.Themethodologicalqualityof each study shouldbe assessed separatelybyratingallstandardsincludedinthecorrespondingbox.Therefore,theCOSMINchecklistshouldbeusedasamodulartool.Thismeansthatitmaynotbenecessarytocompletethe whole checklist when evaluating the quality of studies described in an article. Inaccordancewith theCOSMINtaxonomy(21), themeasurementpropertiesevaluated inan article determine which boxes need to be completed. Each assessment of ameasurementpropertyisconsideredtobeaseparatestudy.Forexample,ifinanarticletheinternalconsistencyandconstructvalidityofaninstrumentwereassessed,onlytwoboxesneedtobecompleted,i.e.box4Internalconsistencyandbox9Hypothesestestingforconstructvalidity.Thismodularsystemwasdevelopedbecausenotallmeasurementpropertiesareassessedinallarticles.
Four‐pointratingsystemForeachstudy,anoveralljudgementisneededonthequalityoftheparticularstudy.Wethereforedevelopeda four‐pointratingsystemwhereeachstandardwithinaCOSMINbox can be rated as ‘very good’, ‘adequate’, ‘doubtful’ or ‘inadequate’ (6). The overallrating of the quality of each study is determined by taking the lowest rating of anystandardinthebox(i.e.“theworstscorecounts”principle)(12).Thisoverallratingofthequalityofthestudiescanbeusedingradingthequalityoftheevidence(seeChapter3),takingintoaccountthatresultsof‘inadequate’qualitystudieswilldecreasethetrustonecanputontheresultsandtheoverallconclusionaboutameasurementpropertyofaPROM(asusedinaspecificcontext).Eachboxcontainsastandardaskingiftherewereanyotherimportantmethodologicalflawsthatarenotcoveredbythechecklist,butthatmayleadtobiasedresultsorconclusions.Forexample,inareliabilitystudy,theadministrationsshouldbeindependent.Independentadministrationsimplythatthefirstadministrationhasnotinfluencedthesecondadministration,i.e.atthesecondadministrationthepatientshouldnothavebeenawareofthescoresonthefirstadministration(recallbias).
COSMINconsiderseachsubscaleofa(multi‐dimensional)PROMseparatelyThemeasurementpropertiesofaPROMshouldberatedseparatelyforeachsetofitemsthatmakeupascore.Thiscanbeasingleitemifthisitemisusedasastandalonescore,asetofitemsmakingupasubscalescorewithinamulti‐dimensionalPROM,oratotalPROMscoreifallitemsofaPROMarebeingsummarizedintoatotalscore.EachscoreisassumedtorepresentaconstructandisthereforeconsideredaseparatePROM.IntheremainingofthismanualwhenwerefertoaPROM,wemeanaPROMscoreorsubscore.Forexample,ifamultidimensionalPROMconsistsofthreesubscalesandonesingleitem,eachscoredseparately,themeasurementpropertiesofthethreesubscalesandthesingleitemneedtoberatedseparately.Ifthesubscalescoresarealsosummarizedintoatotalscore,themeasurementpropertiesofthetotalscoreshouldalsoberatedseparately.
COSMINmanualforsystematicreviewsofPROMs
15
1.6WhathaschangedintheCOSMINmethodology?InadequatestudiesarenolongerignoredInourpreviousprotocolforperformingsystematicreviewsofPROMs,werecommendedtoignoretheresultsofinadequate(previouslycalled‘poor’)qualitystudiesbecausetheresultsofthesestudiesmightbebiased.Inthecurrentmethodology,werecommendthatresultsfrominadequatestudiesmaybeincludedwhenpoolingorsummarizingtheresultsfromallstudiesispossible,oriftheresultsofinadequatestudiesareconsistentwiththeresultsfromstudiesofbetterquality.Thereasoningbehindthisisthatiftheresultsfrominadequatestudiesareincluded,itshouldbeconsideredtodowngradethequalityoftheevidencebecauseofriskofbias(seeChapter3).Herewith,thecurrentmethodologyismoreinlinewithrecommendationsfromtheCochraneCollaborationforsystematicreviewsofinterventionstudies(16)andofdiagnostictestaccuracystudies(17).RemovingstandardsaboutareasonablegoldstandardforcriterionvalidityandresponsivenessWedecidedtodeletethestandardsaboutdecidingwhetheragoldstandardusedcanbeconsideredareasonablegoldstandard.Wenowrecommendthereviewteamtodeterminebeforeassessingthemethodologicalqualityofstudies,whichoutcomemeasurementinstrumentcanbeconsideredareasonablegoldstandard.Whenanincludedstudyusesthisparticulargoldstandardinstrumenttoassesvalidity,thestudycanbeconsideredasastudyoncriterionvalidity.TheCOSMINpanelreachedconsensusthatnogoldstandardexistforPROMs(13).Theonlyexceptioniswhenashortenedinstrumentiscomparedtotheoriginallongversion.Inthatcase,theoriginallongversioncanbeconsideredthegoldstandard.Often,authorsconsidertheircomparatorinstrumentincorrectlyagoldstandard,forexamplewhentheycomparethescoresofanewinstrumenttoawidelyusedandwell‐knowninstrument.Inthissituation,thestudyisconsideredastudyonconstructvalidityandbox9Hypothesestestingforconstructvalidityshouldbecompleted.RemovingstandardsonformulatinghypothesesforhypothesestestingforconstructvalidityandresponsivenessWedecidedtodeleteallstandardsaboutformulatinghypothesesapriorifromtheboxesHypothesestestingforconstructvalidityandResponsiveness.AlthoughweconsideritmajorlyimportanttodefinehypothesesinadvancewhenassessingconstructvalidityorresponsivenessofaPROM,resultsofstudieswithoutthesehypothesescan–inmanycases–stillbeusedinasystematicreviewonPROMsbecausethepresentedcorrelationsormeandifferencesbetween(sub)groupsarenotnecessarilybiasedandthuscanbeevaluated.Theconclusionsoftheauthorsthoughmaybebiasedwhenapriorihypothesesarelacking.WerecommendthatthereviewteamformulatesasetofhypothesesabouttheexpecteddirectionandmagnitudeofcorrelationsbetweenthePROMofinterestandotherPROMsandofmeandifferencesinscoresbetweengroups(23).Thisway,allresultsarecomparedtothesamerelevanthypothesis.Ifconstructvaliditystudiesdoincludehypotheses,thereviewteamcanadoptthesehypothesesiftheyconsiderthemadequate.Thisway,theresultsfrommanystudiescanstillbeusedinthesystematicreviewasstudieswithouthypotheseswillnolongerreceivea‘inadequate’(previouslycalled‘poor’)qualityrating.AdetailedexplanationforcompletingtheseboxescanbefoundinChapter5.
COSMINmanualforsystematicreviewsofPROMs
16
RemovalofstandardsonsamplesizeWedecidedtoremovethestandardaboutadequatesamplesizeforsinglestudiesfromthoseboxeswhereitispossibletopooltheresults(i.e.theboxesInternalconsistency,Reliability,Measurementerror,Criterionvalidity,Hypothesestestingforconstructvalidity,andResponsiveness)toalaterphaseofthereview,i.e.whendrawingconclusionsacrossstudiesonthemeasurementpropertiesofthePROM(Chapter3).Thiswasdecidedbecauseseveralsmallhigh‐qualitystudiescantogetherprovidesufficientevidenceforthemeasurementproperty.Therefore,werecommendtotaketheaggregatedsamplesizeoftheavailablestudiesintoaccountwhenassessingtheoverallqualityofevidenceforameasurementpropertyinasystematicreview.ThisisincompliancewithCochraneHandbooks(16,17).However,thestandardaboutadequatesamplesizeforsinglestudieswasmaintainedintheboxesContentvalidity,Structuralvalidity,andCross‐culturalvalidity\measurementinvariance,becausetheresultsofthesestudiescannotbepooled.Intheseboxesfactoranalyses,IRTorRaschanalysesareincludedaspreferredstatisticalmethodsandthesemethodsrequiresamplesizesthataresufficientlylargetoobtainreliableresults.Thesuggestedsamplesizerequirementsshouldbeconsideredasgeneralguidance;insomesituations,dependentontypeofmodel,numberoffactorsoritems,morenuancedcriteriamightbeapplied.
RemovalofstandardsonmissingdataandhandlingmissingdataEachboxoftheoriginalCOSMINchecklist,exceptforboxoncontentvalidity,containsstandardsaboutwhetherthepercentagemissingitemswasreported,andhowthesemissingitemswerehandled.Althoughweconsiderinformationonmissingitemsveryimportanttoreport,wedecidedtoremovethesestandardsfromallboxesintheCOSMINRiskofBiaschecklist,asitwasagreeduponwithintheCOSMINsteeringcommitteethatlackofreportingonnumberofmissingitemsandonhowmissingitemswerehandledwouldnotnecessarilyleadtobiasedresultsofthestudy.Furthermore,atthemomentthereislittleevidenceandnoconsensusyetaboutwhatthebestwayistohandlemissingitemsinstudiesonmeasurementproperties.NeworderofevaluatingthemeasurementpropertiesAneworderofevaluatingthemeasurementpropertiesisproposed,asshowninTable2.ContentvalidityisconsideredtobethemostimportantmeasurementpropertybecausefirstofallitshouldbeclearthattheitemsofthePROMarerelevant,comprehensive,andcomprehensiblewithrespecttotheconstructofinterestandtargetpopulation(2).Therefore,werecommendtofirstevaluatethedevelopmentandcontentvaliditystudiesofthePROMs.PROMswithhighqualityevidenceofinadequatecontentvaliditycanbeexcludedfromfurtherassessmentinthesystematicreview.Next, we recommend to evaluate the internal structure of PROMs, including themeasurement properties structural validity, internal consistency, and cross‐culturalvalidity\measurementinvariance.Internalstructurereferstohowthedifferentitemsina PROM are related, which is important to know for deciding how items might becombinedintoascaleorsubscale.Evaluatingtheinternalstructureoftheinstrumentisrelevant for PROMs that are based on a reflective model. In a reflective model theconstructmanifestsitselfintheitems,i.e.theitemsareareflectionoftheconstructtobemeasured (24). Finally, the remaining measurement properties are considered, i.e.reliability, measurement error, criterion validity, hypotheses testing for constructvalidityandresponsiveness.
COSMINmanualforsystematicreviewsofPROMs
17
Table 2. Order in which the measurement properties are assessed Contentvalidity
1.PROMdevelopment*
2.Contentvalidity
Internalstructure
3.Structuralvalidity
4.Internalconsistency
5.Cross‐culturalvalidity\measurementinvariance
Remainingmeasurementproperties
6.Reliability
7.Measurementerror
8.Criterionvalidity
9.Hypothesestestingforconstructvalidity
10.Responsiveness
*notameasurementproperty,buttakenintoaccountwhenevaluatingcontentvalidityTheCOSMINmethodologyisspecificallydevelopedforuseinreviewsofoutcomemeasurementinstruments.Theproposedorderofevaluatingthemeasurementpropertiesisthereforebasedonuse/applicationasanoutcomemeasure,i.e.inanevaluativepurpose.ThisorderisalsoreflectedintheorderingofboxesintheCOSMINRiskofBiaschecklist.Wethinkthatsomeoftheincludedmeasurementpropertiesarelessrelevantforotherpurposes.Forexample,responsivenessislessrelevantwhenaninstrumentisusedasadiagnostictool.Moreinformationonremovalofotherstandardsandadaptationstoindividualstandardsandhowthesechangesweredecidedupon,canbefoundelsewhere(6).1.7Ten‐stepprocedure&outlineofthemanualInasystematicreviewofPROMsanoverviewisgivenonavailableevidenceofeachmeasurementpropertyofeachincludedPROMtocometooverallconclusionspermeasurementpropertyandtogiverecommendationforthemostsuitablePROMforagivenpurpose.Aguideline,consistingofasequentialten‐stepprocedurewasdevelopedbytheCOSMINsteeringcommittee,subdividedintothreeparts:A,BandC(Figure2)(2).Part A consists of steps 1-4 and generally, these steps are standard procedures when performing systematic reviews, and are in agreement with existing guidelines for reviews (16,17): preparingandperformingtheliteraturesearch,andselectingrelevantpublications.ThemethodologyofpartAispresentedinChapter2.
COSMINmanualforsystematicreviewsofPROMs
18
Part B consists of steps 5-7 and concerns the evaluation of the measurement properties of the included PROMs. These steps were particularly developed for systematic reviews of PROMs. Part BisdescribedinChapter3.Wefirstdescribethegeneralmethodology,andnext,weexplainstep5(evaluationofcontentvalidity),step6(evaluationofinternalstructure),andstep7(evaluationoftheremainingmeasurementproperties)inmoredetail.PartCconsistsofsteps8‐10andconcernstheevaluationoftheinterpretabilityandfeasibilityofthePROMs(step8),formulating(step9)andthereportingofthesystematicreview(step10).PartCisdescribedinChapter4.InChapter5weelaborateonhowtorateeachstandardincludedintheCOSMINRiskofBiaschecklist.Weclosethemanualwithprovidingexamplesofasearchstrategy,aflowchart,andalltablesthatneedtobepresentedinareview(appendices1‐8).Weaimtocontinueupdatingthismanualwhendeemednecessary,basedonexperienceandsuggestionsbyusersoftheCOSMINmethodology.Ifyouhaveanysuggestionsorquestions,pleasecontactus.
COSMINmanualforsystematicreviewsofPROMs
20
2.PartASteps1‐4:PerformtheliteraturesearchSteps1‐4arestandardprocedureswhenperformingasystematicreviews(2).Itconcernsformulatingtheresearchaim(step1)andeligibilitycriteria(step2),andperformingtheliteraturesearch(step3)andtheselectionofarticles(step4).Theseareinagreementwithexistingguidelinesforreviews(16,17).2.1Step1:Formulatetheaimofthereview TheaimofasystematicreviewofPROMsfocusesonthequalityofthePROMs.Itshouldincludethefollowingfourkeyelements:1)theconstruct;2)thepopulation(s);3)thetypeofinstrument(s);and4)themeasurementpropertiesofinterest.Anexampleofaresearchaimcouldbe:“ouraimistocriticallyappraise,compareandsummarizethequalityofthemeasurementpropertiesofallself‐reportfatiguequestionnairesforpatientswithmultiplesclerosis(MS),Parkinson’sdisease(PD)orstroke”(25).Intheaimofthisreviewtheconstructofinterestis‘fatigue’,thepopulationofinterestis‘patientswithMS,PDorstroke’,thetypeofinstrumentofinterestis‘self‐reportquestionnaire’,and‘all’measurementpropertiesareexploredinthereview.Wealsorecommendthatthesefourkeyelementsareincludedinthetitleofthereview.Forexample:“Self‐reportfatiguequestionnairesinmultiplesclerosis,Parkinson'sdiseaseandstroke:asystematicreviewofmeasurementproperties.”ThefourkeyelementswillalsoinformboththeeligibilitycriteriainStep2andthesearchstrategytobeconductedinStep3.2.2Step2:Formulateeligibilitycriteria Theeligibilitycriteriashouldbeinagreementwiththefourkeyelementsofthereviewaim:1)thePROM(s)shouldaimtomeasuretheconstructofinterest;2)thestudysample(oranarbitrarymajority,e.g.≥50%)shouldrepresentthepopulationofinterest;3)thestudyshouldconcernPROMs;and4)theaimofthestudyshouldbetheevaluationofoneormoremeasurementproperties,thedevelopmentofaPROM(toratethecontentvalidity),ortheevaluationoftheinterpretabilityofthePROMsofinterest(e.g.evaluatingthedistributionofscoresinthestudypopulation,percentageofmissingitems,floorandceilingeffects,theavailabilityofscoresandchangescoresforrelevant(sub)groups,andtheminimalimportantchangeorminimalimportantdifference(26)).WerecommendtoexcludestudiesthatonlyusethePROMasanoutcomemeasurementinstrument,forinstance,studiesinwhichthePROMisusedtomeasuretheoutcome(e.g.inrandomizedcontrolledtrials),orstudiesinwhichthePROMisusedinavalidationstudyofanotherinstrument.IncludingallstudiesthatuseaspecificPROMwouldrequireamuchmoreextendedsearchstrategybecausethesearchfilterforstudiesonmeasurementpropertiescouldnotbeused(seestep3).Itwouldalsobeextremelytime‐consumingtofindallstudiesusingaspecificPROMbecausethePROMsusedinastudymaynotbereportedintheabstract.Thisbasicallymeansthatallstudiesperformedinthepopulationofinterestshouldbescreenedfull‐text.Asitmightbe
COSMINmanualforsystematicreviewsofPROMs
21
unlikelythatthiscanbedoneinastandardizedway,wethereforeconsiderthisapproachnotfeasible.Further,werecommendtoincludeonlyfulltextarticlesbecauseoftenverylimitedinformationonthedesignofastudyisfoundinabstracts.Informationfoundinabstractswillhamperthequalityassessmentofthestudyandtheresultsofthemeasurementpropertiesinsteps5through7.2.3Step3:Performaliteraturesearch InagreementwiththeCochranemethodology(16,17),andbasedonconsensus(5),MEDLINEandEMBASEareconsideredtobetheminimumdatabasestobesearched.Inaddition,itisrecommendedtosearchinother(content‐specific)databases,dependingontheconstructandpopulationofinterest,forexampleWebofScience,Scopus,CINAHL,orPsycINFO.SearchstrategyIntheguidelineandinthismanualwefocusonasystematicreviewincludingallPROMsmeasuringaspecificconstructwhichhave‐atleastsomeextent‐beenvalidated.Anadequatesearchstrategythereforeconsistsofacomprehensivecollectionofsearchterms(i.e.indextermsandfreetextwords)forthefourkeyelementsofthereviewaim:1)construct;2)population(s);3)typeofinstrument(s),and4)measurementproperties.Itisrecommendedtoconsultaclinicallibrarianaswellasexpertsontheconstructandstudypopulationofinterest.AcomprehensivePROMfilterhasbeendevelopedforPubMedbythePatientReportedOutcomesMeasurementGroup,UniversityofOxford,thatcanbeusedasasearchblockfor‘typeofmeasurementinstrument(s)’,andisavailablethroughtheCOSMINwebsite.Regardingsearchtermsformeasurementpropertieswerecommendtouseahighlysensitivevalidatedsearchfilterforfindingstudiesonmeasurementproperties,whichisavailableforPubMedandEMBASEandcanbefoundontheCOSMINwebsite(27,28).AnexampleofaPubMedsearchstrategyisincludedinAppendix1(25).Additionalsourcesforsearchblockscanbefoundonthewebsite https://blocks.bmi‐online.nl/.HereagroupofDutchmedicalinformationspecialistshavecompiledanumberof‘SearchBlocks’.Thesebuildingblocksareintendedforusewhenperformingomplexsearchstrategiesinmedicalandhealthbibliographicdatabases,likePubMed,Embase,PsycInfo,WebofScienceandothers.Theseblockscanbeusedbyexpertsinthefieldofsearchingforliterature,e.g.informationspecialists,clinicallibrarians,healthlibrariansandprofessionalsincloselyrelatedareas.As,inprinciple,theaimisfindingallPROMs,inagreementwiththeCochranemethodology,itisrecommendedtosearchdatabasesfromthedateofinceptiontillpresent(16,17).TheuseoflanguagerestrictionsdependsontheinclusioncriteriadefinedinStep2.Ingeneral,itisrecommendnottouselanguagerestrictionsinthesearchstrategy,eveniftherearenoresourcestotranslatethearticlesforthereview.Inthisway,reviewauthorsareatleastabletoreporttheirexistence.Next,itisrecommendedtousesoftwaresuchasEndnoteorReferenceManagertomanagereferences.Covidence(ww.covidence.org)couldbeofusewhenmanagingthe
COSMINmanualforsystematicreviewsofPROMs
22
reviewscreeningperformance.Covidenceisanonlinetoolthatsupportssystematicreviewersinuploadingsearchresults,screeningabstractsandfulltexts,resolvedisagreements,andexportdataintoRevManorExcel.ItisrecommendedbytheCochraneCollaboration.Searchesmustbeasuptodateaspossiblewhenthereviewispublished,soitmaybenecessarytoupdatethesearchbeforesubmitting(orbeforeacceptation)ofthemanuscript.InadditiontosearchingforallPROMs,Figure3showsthesearchstrategyforothertypesofreviews.Forexample,iftheaimofasystematicreviewistoidentifyallPROMsthathavebeenused,searchtermsformeasurementpropertiesshouldnotbeused(firstcolumnofFigure3).IftheaimofasystematicreviewistoreportonthemeasurementpropertiesofoneparticularPROM(oraselectionofpredefinedPROMs),searchtermsfortheconstructofinterestshouldnotbeusedandsearchtermsforthetypeofinstrumentcanbereplacedbysearchtermsforthenamesoftheinstrumentsofinterest(thirdcolumnofFigure3).Asinallsystematicreviewstheliteraturesearchshouldbecarefullydocumentedinthestudyprotocol.Werecommendtodocumentthefollowing:namesofthedatabasesthatweresearched,includingtheinterfaceusedtosearchthedatabases(forexample,PubMedwasusedforsearchingMEDLINE);allsearchtermsused;anyrestrictionsthatwereused(forexample,humanstudiesonly,notanimals);thenumberofrecordsretrievedfromeachdatabase;andthenumberofuniquerecords.InaccordancewiththePRISMAstatement,itisrecommendtoaddthedocumentationoftheselectionprocesstoaflowdiagram.AnexampleofthePRISMAflowdiagramcanbefoundinAppendix2(29).Thisflowdiagramincludesinformationonthetotalnumberofabstractsselected,thetotalnumberoffull‐textarticlesthatwereselected,andthemainreasonsforexcludingfull‐textarticles.Notethattheincludedarticlescandescribedoneormorestudies(ononeormoredifferentmeasurementproperties).Therefore,werecommendtodescribeintheflowdiagramthetotalnumberofarticlesincluded,thetotalnumberofstudiesdescribedinthosearticles,andthetotalnumberofdifferent(versionsof)PROMsfound.
COSMINmanualforsystematicreviewsofPROMs
23
*SearchfilterdescribedbyTerweeetal(27)Figure3.Searchstrategyfordifferenttypesofsystematicreviewsofmeasurementinstruments
COSMINmanualforsystematicreviewsofPROMs
24
2.4Step4:Selectabstractsandfull‐textarticles Itisgenerallyrecommendedtoperformtheselectionofabstractsandfull‐textarticlesbytworeviewersindependently(16,17).Ifastudyseemsrelevantbyatleastonereviewerbasedontheabstract,orincaseofdoubt,thefull‐textarticleneedstoberetrievedandscreened.Differencesshouldbediscussedandifconsensusbetweenthetworeviewerscannotbereached,itisrecommendedtoconsultathirdreviewer.Itisalsorecommendedtocheckallreferencesoftheincludedarticlestosearchforadditionalpotentiallyrelevantstudies.Ifmanynewarticlesarefound,theinitialsearchstrategymighthavebeeninsufficientlycomprehensiveandmayneedtobeimprovedandredone.NotethatCochranereviewsusevariousadditionalsourcesinfindingrelevantstudies,suchasdissertations,editorials,conferenceproceedings,andreports.TheprobabilityoffindingadditionalrelevantarticlesforsystematicreviewsofPROMsinthesetypeofadditionalsources,however,appearstobesmall.
COSMINmanualforsystematicreviewsofPROMs
25
3.PartBsteps5‐7:EvaluatingthemeasurementpropertiesoftheincludedPROMsPartBconsistsofsteps5‐7oftheguidelineonconductingasystematicreviewofPROMsandconcernstheevaluationofthemeasurementpropertiesoftheincludedPROMs.InChapter3.1wewillstartwithexplainingthegeneralmethodologyofPartB.InChapter3.2wediscussstep5inwhichthecontentvalidityisassessed.Next,inChapter3.3step6isexplained,whichconcernsevaluatingtheinternalstructureofaPROM(i.e.structuralvalidity,internalconsistency,andcross‐culturalvalidity\measurementinvariance).Lastly,inChapter3.4wedescribestep7oftheguidelinethatconcernstheevaluationoftheremainingmeasurementproperties(i.e.reliability,measurementerror,criterionvalidity,hypothesestestingforconstructvalidity,andresponsiveness).3.1GeneralmethodologyTheevaluationofeachmeasurementpropertyincludesthreesubsteps(seeFigure4):
Figure4.Practicaloutlineforperformingasystematicreview
First,themethodologicalqualityofeachsinglestudyonameasurementpropertyisassessedusingtheCOSMINRiskofBiaschecklist.DetailedinstructionsforhoweachstandardincludedintheCOSMINRiskofBiaschecklistshouldberatedaredescribedinChapter5.
Second,theresultofeachsinglestudyonameasurementpropertyisratedagainsttheupdatedcriteriaforgoodmeasurementproperties(+/‐/?)(Table4);
Third,theevidenceissummarizedpermeasurementpropertyperPROM,theoverallresultisratedagainstthecriteriaforgoodmeasurementproperties,
COSMINmanualforsystematicreviewsofPROMs
26
(Table4),andthequalityoftheevidencewillbegradedbyusingtheGRADEapproach(2).
3.1.1EvaluatingthemethodologicalqualityofstudiesToevaluatethemethodologicalqualityoftheincludedstudiesusingtheCOSMINRiskofBiaschecklist,itshouldfirstbedeterminedwhichmeasurementpropertiesareassessedineacharticle.DeterminewhichmeasurementpropertiesareassessedOftenmultiplestudiesondifferentmeasurementpropertiesaredescribedinonearticle(i.e.onestudyforeachmeasurementproperty,e.g.astudyoninternalconsistency,constructvalidity,andreliability,eachwithitsownspecificdesignrequirements).ThequalityofeachstudyisseparatelyevaluatedusingthecorrespondingCOSMINbox(Table 3). TheCOSMINRiskofBiaschecklistshouldbeusedasamodulartool;onlythoseboxesshouldbecompletedforthemeasurementpropertiesthatareevaluatedinthearticle.Table3.BoxesoftheCOSMINRiskofBiaschecklistMarkthemeasurementpropertiesthathavebeenevaluatedinthearticle*.
Contentvalidity
Box1.PROMdevelopment
Box2.Contentvalidity
Internalstructure
Box3.Structuralvalidity
Box4.Internalconsistency
Box5.Cross‐culturalvalidity\measurementinvariance
Remainingmeasurementproperties
Box6.Reliability
Box7.Measurementerror
Box8.Criterionvalidity
Box9.Hypothesestestingforconstructvalidity
Box10.Responsiveness
*Ifaboxneedstobecompletedmorethanoncetwoormoremarkscanbeplaced.Sometimesthesamemeasurementpropertyisreportedformultiple(sub)groupsinonearticle.Forexample,whenaninstrumentisvalidatedintwodifferentcountriesandthemeasurementpropertiesarereportedforbothcountriesseparately.Inthatcase,thesameboxmayneedtobecompletedmultipletimesifthedesignofthestudywasdifferentamongcountries.Thereviewteamshoulddecidewhichboxesshouldbecompleted(andhowmanytimes).Ingeneral,werecommendreviewerstoconsiderhowthedesignsandanalysespresentedinthearticlerelatetotheCOSMINtaxonomy(seeFigure1)(21),and
COSMINmanualforsystematicreviewsofPROMs
27
subsequentlycompletethecorrespondingCOSMINbox.Thisfacilitatesaconsistentevaluationofthemeasurementpropertiesacrosstheincludedstudies,regardlessoftheterminologyusedbytheauthorsofthedifferentarticles.DeterminingwhichboxneedtobecompletedsometimesrequiresasubjectivejudgementbecausethetermsanddefinitionsformeasurementpropertiesusedinanarticlemaynotbesimilartothetermsusedintheCOSMINtaxonomy.Forexample,authorsmayusethetermreliabilitywhentheypresentlimitsofagreement.WhileaccordingtotheCOSMINtaxonomy,thiswouldbeconsideredmeasurementerror.Inthatcase,werecommendtocompletebox7(Measurementerror).Also,authorsoftenusethetermcriterionvalidityforstudiesinwhichaPROMiscomparedtootherPROMsthatmeasureasimilarconstruct.Inmostcases,thiswouldbeconsideredevidenceforconstructvalidity,ratherthancriterionvalidityaccordingtotheCOSMINtermsanddefinitions.Inthatcase,werecommendtocompletebox9(Hypothesestestingforconstructvalidity).Sometimesresultspresentedinstudiesonmeasurementpropertiesactuallydonot(orhardly)provideanyevidenceonthemeasurementpropertiesofaPROM,eventhoughtheyarepresentedassuch.Forexample,acomparisonofdifferentmodesofadministration(e.g.paperversuscomputer)doesnotprovideinformationonthereliabilityorvalidityofthePROM.Also,sometimescorrelationsofaPROMwithothervariables(e.g.correlationswithdemographicvariables)arereportedandconsideredasevidenceforconstructvalidityhowever,thesecorrelationsprovideverylimitedevidenceforconstructvalidity.Reviewersmaydecidetoignoresuchresultsintheirreview.AssessthemethodologicalqualityofthestudiesThequalityofeachstudyonameasurementpropertyshouldbeassessedseparately,usingthecorrespondingCOSMINbox.Eachstudyisratedasverygood,adequate,doubtfulorinadequatequality.Todeterminetheoverallratingofthequalityofeachsinglestudyonameasurementproperty,thelowestratingofanystandardintheboxistaken(i.e.“theworstscorecounts”principle)(12).Forexample,ifthelowestratingofalleightitemsofthereliabilityboxis‘inadequate’,theoverallmethodologicalqualityofthatspecificreliabilitystudyisratedas‘inadequate’.InChapter5wewillprovidemoredetailsandexamplesforhoweachstandardintheCOSMINRiskofBiaschecklistshouldberated.Wealsoexplaininmoredetailhowtocometotheoverallconclusionaboutthemethodologicalqualityofastudy,i.e.howtoapplytheworstscorecountsprinciple.TheCOSMINRiskofBiaschecklistcanbefoundonourwebsite,alongwithaworkingdocument(createdinExcel)thatcanbeusedtodocumenttheCOSMINratingsforeachbox.3.1.2ApplyingcriteriaforgoodmeasurementpropertiesDataextractionWerecommendtosubsequentlyextractthedataonthecharacteristicsofthePROM(s),oncharacteristicsoftheincludedsample(s),onresultsonthemeasurementproperties,andoninformationaboutinterpretabilityandfeasibilityofthescore(s)ofthePROM(s).Thisinformationisneededtodecidewhethertheresultsofdifferentstudiesaresufficientlysimilartobepooledorqualitativelysummarized.Thisinformationcanbe
COSMINmanualforsystematicreviewsofPROMs
28
presentedinoverviewtables.Werecommendtoextracttherequiredinformationfromthearticlesanddirectlypasteitintotheoverviewtables.ExamplesofthesetablesaregiveninAppendices3‐7.IntheCochraneHandbook forsystematicreviewsof interventionsit isrecommendedthat the data extraction is done by two reviewers independently to avoid missingrelevantinformation(16).EvaluateeachresultagainstcriteriaofgoodmeasurementpropertiesNext,theresultofeachstudyonameasurementpropertyshouldberatedagainsttheupdatedcriteriaforgoodmeasurementproperties(Table4)(2).Eachresultisratedaseithersufficient(+),insufficient(–),orindeterminate(?).Theresultofeachmeasurementpropertyanditsqualityratingcandirectlybeaddedtotheapplicabletable(Appendix7).Table4.Updatedcriteriaforgoodmeasurementproperties
Measurementproperty
Rating1 Criteria
Structuralvalidity
+
CTT:CFA:CFIorTLIorcomparablemeasure>0.95ORRMSEA<0.06ORSRMR<0.082
IRT/Rasch:Noviolationofunidimensionality3:CFIorTLIorcomparablemeasure>0.95ORRMSEA<0.06ORSRMR<0.08ANDnoviolationoflocalindependence:residualcorrelationsamongtheitemsaftercontrollingforthedominantfactor<0.20ORQ3's<0.37ANDnoviolationofmonotonicity:adequatelookinggraphsORitemscalability>0.30ANDadequatemodelfit:IRT:χ2>0.01Rasch:infitandoutfitmeansquares≥0.5and≤1.5ORZ‐standardizedvalues>‐2and<2
?
CTT:Notallinformationfor‘+’reportedIRT/Rasch:Modelfitnotreported
– Criteriafor‘+’notmet
Internalconsistency
+
Atleastlowevidence4 forsufficientstructuralvalidity5ANDCronbach'salpha(s)≥0.70foreachunidimensionalscaleorsubscale6
?Criteriafor“Atleastlowevidence4 forsufficientstructuralvalidity5”notmet
–
Atleastlowevidence4 forsufficient structuralvalidity5ANDCronbach’salpha(s)<0.70foreachunidimensionalscaleorsubscale6
COSMINmanualforsystematicreviewsofPROMs
29
Reliability
+ ICCorweightedKappa≥0.70
? ICCorweightedKappanotreported
– ICCorweightedKappa<0.70
Measurementerror
+ SDCorLoA<MIC
5
? MICnotdefined
– SDCorLoA>MIC5
Hypothesestestingforconstructvalidity
+
Theresultisinaccordancewiththehypothesis7
? Nohypothesisdefined(bythereviewteam)– Theresultisnotinaccordancewiththehypothesis7
Cross‐culturalvalidity\measurement
invariance
+
Noimportantdifferencesfoundbetweengroupfactors(suchasage,gender,language)inmultiplegroupfactoranalysisORnoimportantDIFforgroupfactors(McFadden'sR2<0.02)
? NomultiplegroupfactoranalysisORDIFanalysisperformed
– ImportantdifferencesbetweengroupfactorsORDIFwasfound
Criterionvalidity
+
Correlationwithgoldstandard≥0.70ORAUC≥0.70
?
Notallinformationfor‘+’reported
– Correlationwithgoldstandard<0.70ORAUC<0.70
Responsiveness
+
Theresultisinaccordancewiththehypothesis7ORAUC≥0.70
? Nohypothesisdefined(bythereviewteam)
– Theresultisnotinaccordancewiththehypothesis7ORAUC<0.70
Thecriteriaarebasedone.g.Terweeetal.(30)andPrinsenetal.(5)AUC=areaunderthecurve,CFA=confirmatoryfactoranalysis,CFI=comparativefitindex,CTT=classicaltesttheory,DIF=differentialitemfunctioning,ICC=intraclasscorrelationcoefficient,IRT=itemresponsetheory,LoA=limitsofagreement,MIC=minimalimportantchange,RMSEA:RootMeanSquareErrorofApproximation,SEM=StandardErrorofMeasurement,SDC=smallestdetectablechange,SRMR:StandardizedRootMeanResiduals,TLI=Tucker‐Lewisindex1“+”=sufficient,”–“=insufficient,“?”=indeterminate2Toratethequalityofthesummaryscore,thefactorstructuresshouldbeequalacrossstudies3unidimensionalityreferstoafactoranalysispersubscale,whilestructuralvalidityreferstoafactoranalysisofa(multidimensional)patient‐reportedoutcomemeasure4AsdefinedbygradingtheevidenceaccordingtotheGRADEapproach5Thisevidencemaycomefromdifferentstudies6Thecriteria‘Cronbachalpha<0.95’wasdeleted,asthisisrelevantinthedevelopmentphaseofaPROMandnotwhenevaluatinganexistingPROM.7Theresultsofallstudiesshouldbetakentogetheranditshouldthenbedecidedif75%oftheresultsareinaccordancewiththehypothesesThecriteriaprovidedinTable4arethepreferredcriteriaforeachmeasurementproperty.However,forsomemeasurementpropertiesalternativecriteriamightalsobeappropriate.Forexample,additionalcriteriacouldbeusedforassessingtheresultsofstudiesusingexploratoryfactoranalysis(EFA)ortestingbi‐factorconfirmatoryfactor
COSMINmanualforsystematicreviewsofPROMs
30
analysis(CFA)models.Ifthereviewteamhastheexpertisetoassesstheresultsofthesetypesofstudies,additionalcriteriacouldbeused.Fromthistable,thereisoneaspectonwhichwewouldliketoelaborateinmoredetail.Thisconcernsthe‘indeterminate’ratingfortheresultofastudyoninternalconsistency,i.e.whenthecriterion‘atleastlowevidenceforsufficientstructuralvalidity’isnotmet.Thismeansthateither(1)thereisonlyverylowevidenceforsufficientstructuralvalidity(e.g.becausetherewasonlyonestudyonstructuralvaliditywithaverylowsamplesize),(2)therewas(any)evidenceforinsufficientstructuralvalidity,(3)thereareinconsistentresultsforstructuralvaliditywhichcannotbeexplained,or(4)thereisnoinformationonthestructuralvalidityavailable.3.1.3SummarizetheevidenceandgradethequalityoftheevidenceWhereasinChapter3.1.1and3.1.2wefocusedonthequalityofsinglestudiesonmeasurementpropertiesofaPROM,intheChapters3.1.3and3.1.4wewillfocusonthequalityofthePROMasawhole.TocometoanoverallconclusionofthequalityofthePROM,oneshouldfirstdecidewhethertheresultsofallavailablestudiespermeasurementpropertyareconsistent.Iftheresultsareconsistent,theresultsofstudiescanbequantitativelypooledorqualitativelysummarized,andcomparedagainstthecriteriaforgoodmeasurementpropertiestodeterminewhetheroverall,themeasurementpropertyofthePROMissufficient(+),insufficient(–),inconsistent(±),orindeterminate(?).Finally,thequalityoftheevidenceisgraded(high,moderate,low,verylowevidence),usingamodifiedGRADEapproach,asexplainedbelow(page32‐36).Iftheresultsareinconsistent,thereareseveralstrategiesthatcanbeused:(a)findexplanationsandsummarizepersubgroup;(b)donotsummarizetheresultsanddonotgradetheevidence;or(c)basetheconclusiononthemajorityofconsistentresults,anddowngradeforinconsistency.Itisuptothereviewteamtodecideonwhichstrategyismostappropriatetouseintheirspecificsituation.Beloweachstrategyisexplainedinmoredetail.HandlinginconsistentresultsBasedonthequalityratingsofsingleresults(asdiscussedinChapter3.1.2.),onedecideswhetherornotallresultsononemeasurementpropertyofaPROMareconsistent.Ifallresultsareconsistent,theoverallratingwillbesufficient(+)orinsufficient(‐).Iftheresultsareinconsistent(e.g.bothsufficientandinsufficientresultsforreliabilityorappropriatemodelfitfoundbutfordifferentfactorstructuresacrossstudies),explanationsforinconsistencyshouldbeexplored.Forexample,inconsistencycouldbeduetodifferentpopulationsormethodsused.Ifanexplanationisfound,theresultscanbesummarizede.g.persubgroupofconsistentresults.Forexample,ifdifferentresultsarefoundinstudiesperformedinacutepatientsversuschronicpatients,resultspersubgroupshouldbeseparatelysummarized.Theoverallratingforthespecificmeasurementproperty(e.g.reliability)maybesufficient(+)inacutepatients,butinsufficient(‐)inchronicpatients.
COSMINmanualforsystematicreviewsofPROMs
31
Highqualitystudiescouldbeconsideredasprovidingmoreevidencethanlowqualitystudiesandcanbeconsidereddecisiveindeterminingtheoverallratingwhenratingsareinconsistent.Ifinconsistentresultscanbeexplainedbythequalityofthestudies,onemaydecidetosummarizetheresultsofverygoodoradequatestudiesonly,andtoignoretheresultsofdoubtfulaninadequatequalitystudies.Thisshouldthenbeexplainedinthemanuscript.
Insomecases,morerecentevidencecanbeconsideredmoreimportantthanolderevidence.Thiscanalsobetakenintoaccountwhendeterminingtheoverallrating.
Ifnoexplanationforinconsistentresultsisfound,therearetwopossibilities:(1)theoverallratingwillbeinconsistent(±);or(2)onecoulddecidetobasetheoverallratingonthemajorityoftheresults(e.g.ifthemajorityofthe(consistent)resultsofstudiesaresufficient,anoverallratingofsufficientcouldbeconsidered)anddowngradethequalityoftheevidenceforinconsistency(seeChapter3.5).
SummarizetheevidenceTheresultscanbequantitativelypooledorqualitativelysummarized.WerecommendtoreportthesepooledorsummarizedresultspermeasurementpropertyperPROMinSummaryofFinding(SoF)Tables,accompaniedbyaratingofthepooledorsummarizedresults(+/─/+/?),andthegradingofthequalityofevidence(high,moderate,low,verylow).TheseSoFtables(i.e.onepermeasurementproperty)willultimatelybeusedinprovidingrecommendationsfortheselectionofthemostappropriatePROMforagivenpurposeoraparticularsituation.SeeAppendix8foranexample.
QuantitativelypoolingtheresultsIfpossible,theresultsfromdifferentstudiesononemeasurementpropertyshouldbestatisticallypooledinameta‐analysis.Pooledestimatesofmeasurementpropertiescanbeobtainedbycalculatingweightedmeans(basedonthenumberofparticipantsincludedperstudy)and95%confidenceintervals(e.g.(31)).Werecommendtoconsultastatisticianforperformingmeta‐analyses.Fortest‐retestreliability,forexample,weightedmeanintraclasscorrelationcoefficients(ICCs)and95%confidenceintervalscanbecalculatedusingastandardgenericinversevariancerandomeffectsmodel(32).ICCvaluescanbecombinedbasedonestimatesderivedfromaFishertransformation,z=0.5xln((1+ICC)/(1‐ICC)),whichhasanapproximatevariance,(Var(z)=1/(N‐3)),whereNisthesamplesize.ThismethodwasappliedinastudybyCollinsetal.(8).Forconstructvalidity,forexample,itcanbeconsideredtopoolallcorrelationsofaPROMwithotherPROMsthatmeasureasimilarconstruct.Forexample,whenevaluatingtheconstructvalidityofaphysicalfunctionsubscaleonecouldpoolallextractedcorrelationsofthissubscalewithothercomparisoninstrumentsmeasuringphysicalfunction.
QualitativelysummarizingintoasummarizedresultIfitisnotpossibletostatisticallypooltheresults,theresultsofstudiesshouldbequalitativelysummarizedtocometoasummarizedresult,forexamplebyprovidingtherange(lowestandhighest)ofMICvaluesfoundforinterpretability,thepercentageofconfirmedhypothesesforconstructvalidity,ortherangeofeachmodelfitparameteronaconsistentlyfoundfactorstructureinstructuralvaliditystudies.
COSMINmanualforsystematicreviewsofPROMs
32
ApplyingcriteriaforgoodmeasurementpropertiestothepooledorsummarizedresultThepooledorsummarizedresultpermeasurementpropertyperPROMshouldagainberatedagainstthesamequalitycriteriaforgoodmeasurementproperties(Table4).Theoverallratingforthepooledorsummarizedresultcanbesufficient(+),insufficient(–),inconsistent(±),orindeterminate(?).ThisratingcanbeaddedtothepooledorsummarizedresultperPROMforeachmeasurementpropertyintheSummaryofFindingsTables(Appendix8).Iftheresultsperstudyareallsufficient(orallinsufficient),theoverallratingwillalsobesufficient(orinsufficient).Toratethequalitativelysummarizedresultsassufficient(orinsufficient),inprinciple75%oftheresultsshouldmetthecriteria.Forexample,forstructuralvaliditythecriteriaisthat‘atleast75%oftheCFAstudiesfoundthesamefactorstructure’.Thecriteriaforhypothesestestingandresponsiveness(constructapproach)forsummaryresultsisthat‘atleast75%oftheresultsshouldbeinaccordancewiththehypotheses’toratetheoverallresultsas‘sufficient’and‘atleast75%oftheresultsarenotinaccordancewiththehypotheses’toratetheoverallresultsas‘insufficient’’.Iftheresultsofsinglestudieswhichcanbepooledareinconsistentandtheinconsistencyisunexplained,theresultscouldbepooledanyway,andthispooledresultcouldberatedaseithersufficientorinsufficient,andsubsequentlybedowngradedduetoinconsistency(seealsoChapter3.1.3andthenextsection).Iftheresultsofsinglestudieswhichcannotbepooled(e.g.resultsofCFAs)areinconsistentandtheinconsistencyisunexplained,theoverallresultwillberatedas‘inconsistent’.Inthiscase,theresultsareactuallynotsummarized,andtheevidencewillnotbegraded.Iftheresultsperstudyareallindeterminate(?),theoverallratingwillalsobeindeterminate(?).GradingthequalityofevidenceAfterpoolingorsummarizingallevidencepermeasurementpropertyperPROM,andratingthepooledorsummarizedresultagainstthecriteriaforgoodmeasurementproperties,thenextstepistogradethequalityofthisevidence.Thequalityoftheevidencereferstotheconfidencethatthepooledorsummarizedresultistrustworthy.ThegradingofthequalityisbasedontheGradingofRecommendationsAssessment,Development,andEvaluation(GRADE)approachforsystematicreviewsofclinicaltrials(20).UsingamodifiedGRADEapproach,thequalityoftheevidenceisgradedashigh,moderate,low,orverylowevidence(fordefinitions,seeTable5).TheGRADEapproachusesfivefactorstodeterminethequalityoftheevidence:riskofbias(qualityofthestudies),inconsistency(oftheresultsofthestudies),indirectness(evidencecomesfromdifferentpopulations,interventionsoroutcomesthantheonesofinterestinthereview),imprecision(wideconfidenceintervals),andpublicationbias(negativeresultsarelessoftenpublished).ForevaluatingmeasurementpropertiesinsystematicreviewsofPROMs,thefollowingfourfactorsshouldbetakenintoaccount:(1)risk of bias (i.e. the methodological quality of the studies), (2) inconsistency (i.e. unexplained inconsistency of results across studies), (3) imprecision (i.e. total sample size of the available studies), and (4) indirectness (i.e. evidence from different populations than the population of interest in the review).
COSMINmanualforsystematicreviewsofPROMs
33
Thefifthfactor,i.e.publicationbias,isdifficulttoassessinstudiesonmeasurementproperties,becauseofalackofregistriesforthesetypeofstudies.Therefore,wedonottakethisfactorintoaccountinthismethodology.Table5.DefinitionsofqualitylevelsQualitylevel DefinitionHigh Weareveryconfidentthatthetruemeasurementpropertylies
closetothatoftheestimate*ofthemeasurementpropertyModerate We are moderately confident in the measurement property
estimate:thetruemeasurementpropertyislikelytobeclosetothe estimate of the measurement property, but there is apossibilitythatitissubstantiallydifferent
Low Our confidence in the measurement property estimate islimited: the true measurement property may be substantiallydifferentfromtheestimateofthemeasurementproperty
Verylow We have very little confidence in the measurement propertyestimate: the true measurement property is likely to besubstantially different from the estimate of the measurementproperty
*EstimateofthemeasurementpropertyreferstothepooledorsummarizedresultofthemeasurementpropertyofaPROM.ThesedefinitionswereadaptedfromtheGRADEapproach(20)TheGRADEapproachisusedtodowngradeevidencewhenthereareconcernsaboutthequalityoftheevidence.Thestartingpointisalwaystheassumptionthatthepooledoroverallresultisofhighquality.Thequalityofevidenceissubsequentlydowngradedbyoneortwolevelsperfactortomoderate,low,orverylowevidencewhenthereisriskofbias,(unexplained)inconsistency,imprecision(lowsamplesize),orindirectresults.Thequalityofevidencecanevenbedowngradedbythreelevelswhentheevidenceisbasedononlyoneinadequatestudy(i.e.extremelyseriousriskofbias).Thegradingofthequalityofeachmeasurementpropertycandirectlybeaddedtotheapplicabletable(Appendix8).BelowweexplaininmoredetailhowthefourGRADEfactorscanbeinterpretedandappliedinevaluatingthemeasurementpropertiesofPROMs.HowtoapplyGRADEForeachpooledresultorforthesummarizedresultforeachmeasurementpropertyperPROM,thequalityoftheevidencewillbedeterminedbyusingTable6.Ifinsummarizingtheevidenceanddeterminingtheoverallratingofthepooledorsummarizedresultforameasurementproperty,theresultsofsomestudiesareignored,thesestudiesshouldalsobeignoredindeterminingthequalityoftheevidence.Forexample,ifonlytheresultsofhighqualitystudiesareconsideredindeterminingtheoverallrating,thenonlythehighqualitystudiesdeterminethegradingoftheevidence(inthiscasewewouldnotdowngradeforriskofbias).
COSMINmanualforsystematicreviewsofPROMs
34
Table6.ModifiedGRADEapproachforgradingthequalityofevidenceQuality of evidence Lower if High Risk of bias
-1 Serious -2 Very serious -3 Extremely serious Inconsistency -1 Serious -2 Very serious Imprecision -1 total n=50-100 -2 total n<50 Indirectness -1 Serious -2 Very serious
Moderate Low
Very low
n=samplesizeBelowweexplaininmoredetailhowthefourGRADEfactorscanbeinterpretedandappliedinevaluatingthemeasurementpropertiesofPROMs.
(1) Riskofbiascanoccurifthequalityofthestudyisdoubtfulorinadequate,asassessedwiththeCOSMINRiskofBiaschecklist,orifonlyonestudyofadequatequalityisavailable.Thequalityofevidenceshouldbedowngradedwithonelevel(e.g.fromhightomoderateevidence)ifthereisseriousriskofbias,withtwolevels(e.g.fromhightolow)ifthereisveryseriousriskofbias,orwiththreelevels(i.e.fromhightoverylow)ofthereisextremelyriskofbias.InTable7weexplainwhatweconsiderasserious,veryseriousorextremelyseriousriskofbias.
Table7.InstructionsondowngradingforRiskofBiasRiskofbias DowngradingforRiskofBiasNo Therearemultiplestudiesofatleastadequatequality,or
thereisonestudyofverygoodqualityavailableSerious Therearemultiplestudiesofdoubtfulqualityavailable,
orthereisonlyonestudyofadequatequalityVeryserious Therearemultiplestudiesofinadequatequality,orthere
isonlyonestudyofdoubtfulqualityavailableExtremelyserious
Thereisonlyonestudyofinadequatequalityavailable
(2) Inconsistency:Inconsistencymayalreadyhavebeensolvedbypoolingor
summarizingtheresultsofsubgroupsofstudieswithsimilarresultsandprovideoverallratingsforthesesubgroups.Inthiscase,onedoesn’tneedtodowngrade.Ifnoexplanationforinconsistencyisfound,thereviewteamcandecidenottopoolorsummarizeresults,andratetheresultsas‘inconsistent’.Inthiscase,no
COSMINmanualforsystematicreviewsofPROMs
35
qualitylevelfortheevidencewillbegiven.However,analternativesolutioncouldbetoratethepooledorsummarizedresult(e.g.basedonthemajorityofresults)assufficientorinsufficientanddowngradethequalityoftheevidenceforinconsistencywithoneortwolevels.Thereviewersshouldalsodecidewhatwillbeconsideredasserious(i.e.‐1level)orveryserious(i.e.‐2levels)inconsistency,becausethisiscontextdependent.Itisuptothereviewteamtodecidewhichseemstobethebestsolutionineachsituation.
(3) Imprecisionreferstothetotalsampleincludedinthestudies.Werecommendto
downgradewithonelevelwhenthetotalsamplesizeofthepooledorsummarizedstudiesisbelow100,andwithtwolevelswhenthetotalsamplesizeisbelow50.NotethatthisprincipleshouldnotbeusedformeasurementpropertiesinwhichasamplesizerequirementisalreadyincludedintheCOSMINRiskofBiasbox,i.e.contentvalidity,structuralvalidity,andcross‐culturalvalidity.
(4) Indirectnesscanoccurifstudiesareincludedinthereviewthatwere(partly)
performedinanotherpopulationoranothercontextofusethanthepopulationorcontextofuseofinterestinthesystematicreview.Forexample,ifonlypartofthestudypopulationconsistsofpatientswiththediseaseofinterest,thereviewteamcandecidetodowngradewithoneortwolevelsforseriousorveryseriousindirectness.Onecanconsiderdowngradingforindirectnessforconstructvalidityorresponsivenesswhentheevidenceisconsideredweak.ForexamplewhentheevidenceisonlybasedoncomparisonswithPROMsmeasuringdifferentconstructsoronlybasedondifferencesbetweenextremelydifferentgroups.Howtodecideonwhatshouldbeconsideredasseriousorveryseriousindirectnessiscontextdependent,andshouldbedecidedonbythereviewteam.
Todeterminethegradingforthequalityofevidence,theconcernsaboutthequalityoftheevidenceshouldbeaddedup.Therefore,itishelpfultoconsidertheGRADEfactorsonebyonebyusingtheconsecutiveorderasspecifiedinTable6.First,theriskofbiasisconsidered(seeTable7).Forexample,whenthreestudiesarefoundwithsufficient(i.e.‘+’)results,butallofdoubtfulquality,thelevelofevidencewillbedowngradedforriskofbiasfromhightomoderate(i.e.‐1).Second,furtherdowngradingforotherfactorsshouldbeconsidered.Afterriskofbias,inconsistencyshouldbeconsidered.Iftheresultsofthethreestudiesintheexampleaboveareallratedassufficient,nodowngradingforinconsistencyisrequired.Otherwise,downgradingshouldbeconsidered.Next,thesamplesizeshouldbetakenintoaccount.Forexample,whenthesamplesizeofthethreestudiestogetherismorethan100,therewillbenofurtherdowngrading.Ifthe(total)samplesizeisbetweene.g.50‐100,oneshoulddowngradewith‐1.Lastly,theevidencecouldbedowngradedbecauseofindirectness.Forexample,considerasystematicreviewthatfocussesonpainandcomfortscalesforinfants,andtheinclusioncriteriais‘childrenbetween0‐18years’becausealackofstudiesininfantsonly(i.e.below1year)wasexpected.Studiesincludingchildrenofallages,includinginfants,mayleadtodowngradethequalityofevidencebyonelevel,andstudiesincludingonlychildrenbetween4and12yearsmayevenleadtodowngradethequalitybytwolevels,duetoindirectnessoftheresults.If,inourexample,thethreestudiesfoundallincludechildrenbetween0‐4,butonlyveryfewinfants,onemaydecidetodowngradeonelevel(i.e.frommoderatetolow).Inthisexample,theoverallqualityof
COSMINmanualforsystematicreviewsofPROMs
36
theevidenceisnowconsideredas‘low’,sotheconclusionwillbethatthereislowqualityevidencethatthemeasurementpropertyofinterestissufficient.Werecommendthatgradingisdonebytworeviewersindependentlyandthatconsensusamongthereviewersisreached,ifnecessarywithhelpofathirdreviewer.
COSMINmanualforsystematicreviewsofPROMs
37
3.2.Step5:EvaluatingcontentvalidityContentvalidity(i.e.the degree to which the content of a PROM is an adequate reflection of the construct to be measured (21))isconsideredtobethemostimportantmeasurementproperty,becauseitshouldbeclearthattheitemsofthePROMarerelevant,comprehensive,andcomprehensiblewithrespecttotheconstructofinterestandstudypopulation(7).Theevaluationofcontentvalidityrequiresasubjectivejudgmentbythereviewers.Inthisjudgement,thePROMdevelopmentstudy,thequalityandresultsofadditionalcontentvaliditystudiesonthePROMs(ifavailable),andasubjectiveratingofthecontentofthePROMsbythereviewersistakenintoaccount.Asthisstepisveryimportantbutratherextensivewedevelopedaseparateusers’manualforguidanceonhowtoevaluatethecontentvalidityofPROMs.ThismanualcanbefoundontheCOSMINwebsite(33).IfthereishighqualityevidencethatthecontentvalidityofaPROMisinsufficient,thePROMwillnotbefurtherconsideredinSteps6‐8ofthesystematicreviewandonecandirectlydrawarecommendationforthisPROMinStep9(i.e.recommendation‘C’:“PROMsthatshouldnotberecommended(i.e.PROMswithhighevidenceofinsufficientcontentvalidity)).Inallothercases,thePROMcanbefurthertakenintoconsiderationinthesystematicreview.
COSMINmanualforsystematicreviewsofPROMs
38
3.3Step6.EvaluationoftheinternalstructureofPROMs:structuralvalidity,internalconsistency,andcross‐culturalvalidity\measurementinvarianceInternal structure refers to how the different items in a PROM are related, which isimportanttoknowfordecidinghowitemsmightbecombinedintoascaleorsubscale(6).Thisstepconcernsanevaluationofstructuralvalidity(includingunidimensionality)usingfactoranalysesorIRTorRaschanalyses;internalconsistency;andcross‐culturalvalidityandotherformsofmeasurementinvariance(usingdifferentialitemfunctioning(DIF) analyses or Multi‐Group Confirmatory Factor Analyses (MGCFA)). Here we arereferring to testingof existingPROMs;not further refinementordevelopmentof newPROMs. These three measurement properties focus on the quality of items and therelationshipsbetween the items incontrast to the remainingmeasurementpropertiesdiscussed in Step 7, who mainly focus on the quality of scales or subscales. Werecommendtoevaluate internalstructuredirectlyafterevaluatingthecontentvalidityof a PROM. As evidence for structural validity (or unidimensionality) of a scale orsubscale is a prerequisite for the interpretation of internal consistency analyses (i.e.Cronbach’salpha’s),werecommendto firstevaluatestructuralvalidity, tobefollowedbyinternalconsistencyandcross‐culturalvalidity\measurementinvariance.Step6isonlyrelevantforPROMsthatarebasedonareflectivemodelthatassumesthatall items inascaleorsubscalearemanifestationsofoneunderlyingconstructandareexpected to be correlated. An example of a reflective model is the measurement ofanxiety; anxietymanifests itself in specific characteristics, such asworrying thoughts,panic,andrestlessness.Byaskingpatientsaboutthesecharacteristics,wecanassessthedegreeofanxiety(i.e.all itemsareareflectionof theconstruct) (23). If the items inascale or subscale are not supposed to be correlated (i.e. a formative model), theseanalysesarenotrelevantandStep6canbeomitted.Inotherwords,iffactoranalysis,orIRTorRaschanalysisisperformedonaPROMbasedonaformativemodel,theseresultscanbeignored,astheresultsarenotinterpretable.If it is not reportedwhether a PROM is based on a reflective or formativemodel, thereviewersneed todecideon the contentof thePROMwhether it is likelybasedon areflective or a formative model. Unfortunately, it is not always possible to decideafterwards if the instrument is based on a reflective or formative model and thuswhetherstructuralvalidityisrelevant.Ifastudyincludedinthereviewpresentsafactoranalysis,IRTorRaschanalysisandyouareindoubtwhetherthe(sub)scaleisreflective,orwhetheritmightbemixed(bothreflectiveandformativeitemswithinasubscale),werecommendtoconsideritareflectivemodel.Subsequently,thequalityofthestudyandthequalityoftheresultsarerated.The evaluation of structural validity, internal consistency and cross‐culturalvalidity\measurementinvarianceconsistsofthreesubstepsasdescribedinChapter3.1(see Figure 4). First, the risk of bias in each study on structural validity, internalconsistency and cross‐cultural validity\measurement invariance is assessed using theCOSMINRiskofBiaschecklist(seeChapter5fordetailedinstructionsforhowtoratethestandardsineachbox).Second,dataisextractedonthecharacteristicsofthePROM(s),on characteristics of the included patient sample(s), and on the results of themeasurementproperties (seeAppendices3‐7). The result per study is thenevaluatedagainst the criteria for good measurement properties. Third, all results permeasurement property of a PROM are quantitatively pooled or qualitatively
COSMINmanualforsystematicreviewsofPROMs
39
summarized, and this pooled or summarized result is again evaluated against thecriteriaforgoodmeasurementpropertiestogetanoverallrating(Table4).Finally,thequalityof theevidence isgradedusing themodifiedGRADEapproach,asdescribed inChapter3.1.3.Theriskofbiasinastudyoninternalconsistencydependsontheavailableevidenceforstructuralvaliditybecauseunidimensionality isaprerequisitefortheinterpretationofinternalconsistencyanalyses(i.e.Cronbach’salpha’s).Therefore,thequalityofevidencefor internal consistency cannot be higher than the quality of evidence for structuralvalidity. Note that in a systematic review of PROMs, the conclusion about structuralvaliditymaycomefromdifferentstudies,evenfromstudiesthatwereconductedafterstudiesoninternalconsistency.Werecommendtotakethequalityofevidenceforstructuralvalidityasastartingpointfor determining the quality of evidence for internal consistency, and to downgradefurther for risk of bias, inconsistency (i.e. unexplained inconsistency of results acrossstudies), imprecision(i.e.totalsamplesizeoftheavailablestudies),andindirectnessifneeded.ACronbach’salphabasedonascalewhichisnotunidimensionalisdifficulttointerpret.Wethereforerecommendto ignoreresultsofstudieson internalconsistencyofscaleswhenthereisevidencethatthesescalesarenotunidimensional.Forexample,ifresultsonstructuralvalidityshowthatascalehasthreefactors,internalconsistencyofeachofthosethreesubscalesisrelevant.Cronbach’salpha’sonatotalscorecanbeignored(andclinicians and researchers should be encouragednot to use these total scores) unlessthereisproveofunidimensionalityofthetotalscore,e.g.byahigherorderorbi‐factorCFA.
COSMINmanualforsystematicreviewsofPROMs
40
3.4Step7.Evaluationoftheremainingmeasurementproperties:reliability,measurementerror,criterionvalidity,hypothesestestingforconstructvalidity,andresponsivenessSubsequently, the remainingmeasurement properties (reliability,measurement error,criterionvalidity,hypothesestestingforconstructvalidity,andresponsiveness)shouldbe evaluated. Unlike content validity and internal structure, the evaluation of thesemeasurementpropertiesmainlyassess thequalityof thescaleorsubscaleasawhole,insteadoftheitems.IssuesregardingvaliditytobedecideduponaprioribythereviewteamForassessingthequalityofhypothesestestingforconstructvalidityandfortheconstructapproachforresponsiveness,thereviewteamshouldaprioridecideontwoissues:First,thereviewteamshoulddecidewhetherthereisagoldstandardavailableformeasuringtheconstructofinterestinthetargetpopulation.Ifthereisnogoldstandard(inprinciple,thereisnogoldstandardavailableofaPROM,onlythelongversionofashortenedPROMcanbeconsideredasagoldstandard),thereviewteamshouldnotusetheboxforcriterionvalidityandtheboxforthecriterionapproachforresponsivenessinthesystematicreview.Allevidenceforvaliditypresentedintheincludedstudieswillbeconsideredasevidenceforconstructvalidityortheconstructapproachforresponsiveness.Ifanoutcomemeasurementinstrumentisconsideredagoldstandard,studiescomparingthePROMtothisparticularinstrumentareconsideredasevidenceforcriterionvalidityorcriterionapproachforresponsiveness.Second,forinterpretingtheresultsofstudiesonhypothesestestingforconstructvalidity,andonstudiesusingaconstructapproachforevaluatingresponsiveness,thereviewteamshouldformulateasetofhypothesesaboutexpectedrelationshipsbetweenthePROMunderreviewandotherwell‐definedandhighqualitycomparatorinstrumentswhichareusedinthefield.Ifpossible,alsosomehypothesesaboutexpecteddifferencesbetweensubgroupscanbeformulatedinadvance.Thisway,thestudyresultsarecomparedagainstthesamehypotheses.Theexpecteddirection(positiveornegative)andmagnitude(absoluteorrelative)ofthecorrelationsordifferencesshouldbeincludedinthehypotheses(forexampleswerefertootherpublications(34‐37)).Withoutthisspecificationitisdifficulttodecideafterwardswhethertheresultsareinaccordancewiththehypothesisornot.Forexample,reviewteamcouldstatethattheyexpectacorrelationofatleast0.50betweenthePROMunderstudyandthecomparatorinstrumentthatintendtomeasurethesameconstruct.Orthattheyexpectedthatchronicallyillpatientsscoreonaveragescore10pointshighercomparedtoacutepatientsonthePROMunderconsideration.Thehypothesesmayalsoconcerntherelativemagnitudeofcorrelations.Forexample,thereviewteamcouldstatethattheyexpectthatthescoreonPROMAcorrelatesatleast0.10pointshigherwiththescoreoninstrument(e.g.aPROM)BthanwiththescoreoninstrumentC.InTable8somegenerichypothesesarepresentedthatcouldbeconsideredasastartingpointforformulatinghypothesesforthereview(23).
COSMINmanualforsystematicreviewsofPROMs
41
Table8.GenerichypothesestoevaluateconstructvalidityandresponsivenessGenerichypotheses*:1. Correlationswith(changesin)instrumentsmeasuringsimilar
constructsshouldbe≥0.50.2. Correlationswith(changesin)instrumentsmeasuringrelated,but
dissimilarconstructsshouldbelower,i.e.0.30‐0.50.3. Correlationswith(changesin)instrumentsmeasuringunrelated
constructsshouldbe<0.30.4. Correlationsdefinedunder1,2,and3shoulddifferbyaminimumof
0.10.5. Meaningfulchangesbetweenrelevant(sub)groups(e.g.patientswith
expectedhighvslowlevelsoftheconstructofinterest)6. Forresponsiveness,AUCshouldbe≥0.70*ReproducedwithapprovalfromDeVetetal.(23)AUC = Area Under the Curve with an external measure of change used as the ‘goldstandard’
Hypothesescanbebasedonliterature(includingtheincludedstudiesinthereview)and(clinical)experiencesofthereviewteammembers.IntheoriginalCOSMINchecklist,oneofthestandardsincludedintheboxonhypothesestestingforconstructvalidityandresponsivenesswasaboutwhetherhypotheseswerestatedinthestudy,whichoftenleadtoan‘inadequate’rating.Byrecommendingthereviewteamtoformulatehypothesesinthenewmethodology,themethodologicalqualityofanincludedstudyisnolongerdependentsuponwhetherornothypotheseswereformulatedinthesestudies.Allresultsfoundinincludedarticlescannowbecomparedagainstthissamesetofhypotheses,andmoreresultscanbeusedtobuildaconclusionabouttheconstructvalidityofaPROM.Additionalhypothesesmayneedtobeformulatedduringthereviewprocess,dependingontheobservedresultsintheincludedstudies.Weconsideritnotaproblemifthehypothesesarenotalldefinedapriori,aslongasthehypothesesarereportedinthereview,makingthemethodologytransparent.Ifitisimpossibletoformulateagoodhypothesisforaspecificanalysis(e.g.aboutanexpectedmagnitudeofdifferencebetweentwosubgroups),andtheauthorsofthestudydidnotstatetheirexpectations,thereviewteamcanconsiderignoringtheseresults,asitisunknownhowtheresultsshouldbeinterpreted.
Whenassessingresponsiveness,oneofthemostdifficulttasksisformulatingchallenginghypotheses.Bychallenginghypothesesweaimtoshowthattheinstrumenttrulymeasureschangesintheconstruct(s)itpurportstomeasure.Thismeansthattheinstrumentshouldmeasurechangesinthepurportedconstruct(s),butalsothatitshouldmeasuretherightamountofchange,i.e.itshouldnotunder‐oroverestimatetherealchangeintheconstructthathasoccurred.Thislatteraspectisoftenoverlookedinassessingresponsiveness.Forexample,thehypothesescanconcernexpectedmeandifferencesbetweenchangesingroupsorexpectedcorrelationsbetweenchangesinthescoresontheinstrumentandchangesinothervariables,suchasscoresonotherinstruments,ordemographicorclinicalvariables.Hypothesesaboutexpectedeffectsize(ES)orsimilarmeasuressuchasstandardizedresponsemean(SRM)canalsobeused,butonlywhenanexplicithypothesis(andrationale)fortheexpectedmagnitudeofthe
COSMINmanualforsystematicreviewsofPROMs
42
effectsizeisgiven.Thehypothesesmayalsoconcerntherelativemagnitudeofcorrelations,forexampleastatementthatchangeininstrumentAisexpectedtocorrelatehigherwithchangeininstrumentBthanwithchangeininstrumentC.EvaluatingthemeasurementpropertiesandgradingthequalityoftheevidenceThe evaluation of the measurement properties consists of the three sub steps, asdescribedearlierinChapter3.1(Figure4).First,theriskofbiasineachstudyonthesemeasurementpropertiesisassessedusingtheCOSMINRiskofBiaschecklist(seeChapter5fordetailedinstructionsforhowtoratethestandards).Second,dataisextractedonthecharacteristicsofthePROM(s),oncharacteristicsoftheincluded study population(s), and on results on the measurement properties (seeAppendices 3‐6), and the result per study is evaluated against the criteria for goodmeasurementproperties(seeTable4).Third, all results per measurement property of a PROM are quantitatively pooled orqualitativelysummarized,andthispooledorsummarizedresultisevaluatedagainstthecriteriaforgoodmeasurementpropertiestogetanoverallratingforthemeasurementproperty (Table 4). Finally, the quality of the evidence is graded using the modifiedGRADEapproach,asdescribedinChapter3.1.SpecificaspectstoconsiderformeasurementerrorWhenapplyingthecriteriaforgoodmeasurementerror,informationisneededontheSmallestDetectableChange(SDC)orLimitsofAgreement(LoA),aswellasontheMinimalImportantChange(MIC).Thisinformationmaycomefromdifferentstudies.TheMICshouldhavebeendeterminedusingananchor‐basedlongitudinalapproach(38‐41).TheMICisbestcalculatedfrommultiplestudiesandbyusingmultipleanchors.Itisuptothereviewteamtodecidewhetherthequalityoftheevidenceshouldbedowngraded(e.g.onelevel)whenthereisinformationavailableontheMICvaluefromonlyonestudy,orwhenthestudyontheMICisnotadequatelyperformed(e.g.insufficientvalidityoftheanchor)(42,43).WhenaMICvalueisdeterminedusingadistribution‐basedmethod,infact,theMICvaluedoesnotreflectwhatpatientsconsiderimportant,butisratheranindicatorofthemeasurementerrorofthePROM.Resultsfoundinsuchstudiesshouldeitherbeignoredorconsideredasevidenceonmeasurementerror(23).IfnotenoughinformationisavailabletojudgewhethertheSDCorLoAissmallerthantheMIC,werecommendtojustreporttheinformationthatisavailableontheSDCorLoAwithoutgradingthequalityofevidence(notethatinformationontheMICaloneprovidesinformationontheinterpretabilityofaPROM,seeChapter4.1).SpecificaspectstoconsiderforhypothesestestingandresponsivenessInstudiesonconstructvalidityandresponsiveness,differentstudydesignsmayhavebeenused,i.e.comparisonswithotheroutcomemeasurementinstrument,comparisonsbetweengroups,orcomparisonsbeforeandafteranintervention.Therefore,theboxesHypothesestestingforconstructvalidityandResponsivenessbothconsistoftwoandfourparts,respectively(seeChapter5.7and5.8).Themethodologicalqualityofeachpartwillberatedseparately,usingthe“worstscorecounts”method(12).Ingeneral,eachresult(e.g.acorrelationbetweenscoresoftwooutcomemeasurementinstruments,oranES)couldbeconsideredasasinglestudy.Whenforexample,thePROMunderstudyiscomparedtofourdifferentoutcomemeasurementinstruments,
COSMINmanualforsystematicreviewsofPROMs
43
fourcorrelationsarecomputed,andthiscouldbeconsideredasfoursingle‘studies’.Basically,theboxHypothesestestingshouldbecompletedfourtimes.However,wheneachstandardforallfour‘studies’willberatedthesame(e.g.theconstructsmeasuredbyallfourcomparisonoutcomemeasurementinstrumentsareclearandallcomparisoninstrumentsareofgoodquality)theratingscanbecombinedintooneriskofbiasrating(i.e.themethodologicalqualityofthestudiesis‘verygood’).Next,eachresultiscomparedagainstthecriterion(i.e.whetherornotthehypothesiswasconfirmed),andreportedintheresultstable(seeAppendix7).Ifthemethodologicalqualityofeach‘study’isnotsimilar,itcouldbeassessedseparately.InAppendix7,anexampleisgiven(called‘ref6’underhypothesestesting)ofanarticleinwhichthreehypothesesweretestedinadequatestudiesandtheresultsareinaccordancewiththehypotheses,andthreeotherhypothesesweretestedininadequatestudies,andoneoftheresultswasinaccordancewiththehypothesesandtworesultswerenotinaccordancewiththehypotheses.Moreover,inonearticlebothhypothesesaboutcomparisonwithotherinstrumentscouldbetestedaswellashypothesesaboutexpecteddifferencesbetweensubgroups.Inthatcase,differentpartsoftheboxHypothesestestingforconstructvaliditywillbeused(partaandb,respectively).However,whentheratingsarethesame(formethodologicalquality),thiscouldbereportedtogether.Next,thequalityoftheevidenceshouldbedetermined.Werecommendtodeterminethissimilarlyasfortheothermeasurementproperties.Inthisstep,weconsiderallresultsreportedinonearticletogetherasonestudy.Wedonotgradetheevidenceperhypothesisorperstudydesign,becauseotherwisehighevidenceforhypothesestestingcaneasilybereached.AnexampleisprovidedinAppendix7.Here,11outof13resultsareratedassufficient.Weconcludethatwehaveconsistentfindingsforsufficienthypothesestestingforconstructvalidity,as85%oftheresultsareinlinewiththehypotheses.Wehaveoneverygoodstudy.Therefore,wedonotdowngradeforriskofbias.Dependingonimprecision(i.e.totalsamplesizeoftheavailablestudies),andindirectnesswecoulddowngrade,ifneeded.Instudiesonconstructvalidityandresponsivenesssomeresultsmayprovidestrongerevidencethanotherresults.Forexample,correlationswithPROMsmeasuringsimilarconstructs(i.e.convergentvalidity)canbeconsideredasprovidingmoreevidencethancorrelationswithPROMsmeasuringdissimilarconstructs.Thiscanbetakenintoaccountwhendeterminingthepooledorsummarizedresultforconstructvalidity,especiallywhenresultsareinconsistent.Forexample,whenresultsaboutcomparisonswithPROMsmeasuringsimilarconstructsareconsistentlynotinaccordancewiththehypotheses,whileresultsaboutcomparisonwithdissimilarconstructstendtobeinaccordancewiththehypotheses,thereviewteamcoulddecidetoputmoreemphasisontheresultsofthehypothesesconcerningcomparisonswithinstrumentsmeasuringsimilarconstructs.
COSMINmanualforsystematicreviewsofPROMs
44
4.PartCsteps8‐10:SelectingaPROMPartCsteps8‐10concernsthedescriptionofdataoninterpretabilityandfeasibilityofthePROMs(step8), formulatingrecommendationsbasedonall evidence (step9)andthereportingofthesystematicreview(step10).4.1Step8:Describeinterpretabilityandfeasibility Interpretability is defined as the degree towhich one can assign qualitativemeaning(thatis,clinicalorcommonlyunderstoodconnotations)toaPROM’squantitativescoresor change in scores (21). Both the interpretability of single scores and theinterpretability of change scores is informative to report in a systematic review. Theinterpretation of single scores can be outlined by providing information on thedistribution of scores in the study population or other relevant subgroups, as itmayreveal clustering of scores, and it can indicate floor and ceiling effects. Theinterpretability of change scores can be enhanced by reporting MIC values andinformation on response shift (referring to changes in the meaning of one's self‐evaluation of a target construct as a result of: (a) a change in the patient's internalstandardsofmeasurement(i.e.scalerecalibration);(b)achangeinthepatient'svalues(i.e.theimportanceofcomponentsubdomainsconstitutingthetargetconstruct);or(c)aredefinitionofthetargetconstruct(i.e.reconceptualization)(44))(seeAppendix5).Feasibility isdefinedas theeaseofapplicationof thePROM in its intendedcontextofuse,givenconstraintssuchastimeormoney(45),forexample,completiontime,costofaninstrument,lengthoftheinstrument,typeandeaseofadministration(seeAppendix6forallaspectsoffeasibility)(26).FeasibilityappliestopatientscompletingthePROM(self‐administered)andresearchersorclinicianswhointervieworhandoverthePROMto patients. The concept ‘feasibility’ is related to the concept ‘clinical utility’,whereasfeasibilityfocussesonPROMs,clinicalutilityreferstoanintervention(46).Interpretability and feasibility are not measurement properties because they do notrefer to the quality of a PROM.However, they are considered important aspects for awell‐consideredselectionofaPROM.IncasetherearetwoPROMsthatareverydifficulttodifferentiateintermsofquality, itisrecommendedthatfeasibilityaspects(e.g.timeaspectsandbudgetrestrictions)shouldbe taken intoconsideration in theselectionofthe most appropriate instrument. Floor and ceiling effects can indicate insufficientcontentvalidity,andcanresultininsufficientreliability.Interpretability and feasibility are only described, and not evaluated, as these are noformalmeasurementproperties(i.e. theydonottellussomethingaboutthequalityofan instrument, rather they are important aspects that should be considered in theselectionofthemostsuitableinstrumentforaspecificpurpose).
COSMINmanualforsystematicreviewsofPROMs
45
4.2Step9:formulaterecommendations Recommendationson themost suitablePROMforuse inanevaluativeapplicationareformulatedwithrespecttotheconstructofinterestandstudypopulation.Tocometoanevidence‐basedandfullytransparentrecommendation,werecommendtocategorizetheincludedPROMsintothreecategories:(A) PROMs with evidence for sufficient content validity (any level) AND at least lowqualityevidenceforsufficientinternalconsistency;(B)PROMscategorizednotinAorC.(C)PROMswithhighqualityevidenceforaninsufficientmeasurementpropertyPROMscategorizedas‘A’canberecommendedforuseandresultsobtainedwiththesePROMscanbetrusted.PROMscategorizedas‘B’havepotentialtoberecommendedforuse, but they require further research to assess the quality of these PROMs. PROMscategorizedas‘C’shouldnotberecommendedforuse.WhenonlyPROMscategorizedas‘B’arefoundinareview,theonewiththebestevidenceforcontentvaliditycouldbetheonetobeprovisionallyrecommendedforuse,untilfurtherevidenceisprovided.Justifications should be given to as why a PROM is placed in a certain category, anddirection should be given on future validation work, if applicable. To work towardsstandardization in outcomemeasurement (e.g. coreoutcome set development) and tofacilitatemeta‐analysesofstudiesusingPROMs,werecommendtosubsequentlyadviseononemostsuitablePROM(5).Thisrecommendationdoesnotonlyhavetobebasedontheevaluationofthemeasurementproperties,butmayalsodependoninterpretabilityandfeasibilityaspects.4.3Step10:reportthesystematicreview InaccordancewiththePRISMAStatement(18),werecommendtoreportthefollowinginformation:(1) the search strategy (for example on a website or in the (online) supplementalmaterialstothearticleatissue),andtheresultsoftheliteraturesearchandselectionofthe studies and PROMs, displayed in the PRIMSA flow diagram (including the finalnumberofarticlesandthefinalnumberofPROMsincludedinthereview)(Appendix2);(2)thecharacteristicsoftheincludedPROMs,suchasnameofthePROMs,referencetothe article in which the development of the PROM is described, constructs beingmeasured,languageandstudypopulationforwhichthePROMwasdeveloped,intendedcontext(s)ofuse,availablelanguageversionofthePROM,numberofscalesorsubscales,numberofitems,responseoptions,recallperiod,interpretabilityaspects,andfeasibilityaspects(Appendix3);(3)thecharacteristicsoftheincludedstudypopulations,suchasgeographicallocation,language,importantdiseasecharacteristics,targetpopulation,samplesize,age,gender,andsetting(Appendix4);(4) the methodological quality ratings of each study per measurement property perPROM(i.e.verygood,adequate,doubtful,inadequate),theresultsofeachstudy,andthe
COSMINmanualforsystematicreviewsofPROMs
46
accompanying ratings of the results based on the criteria for good measurementproperties(sufficient(+)/insufficient(‐)/indeterminate(?))(Appendix7).ThistablecouldbepublishedforexampleasAppendixorsupplementalmaterialonthewebsiteofthejournalonly;(5)aSummaryofFindings(SoF)tablepermeasurementproperty,includingthepooledor summarizedresultsof themeasurementproperties, itsoverall rating (i.e. sufficient(+)/insufficient(‐)/inconsistent(±)/indeterminate(?)),andthegradingofthequalityofevidence(high,moderate, low,verylow).AnexampleofaSoFtablecanbefoundinAppendix 8. These SoF tables (i.e. one permeasurement property)will ultimately beusedinprovidingrecommendationsfortheselectionofthemostappropriatePROMforagivenpurposeoraparticularcontextofuse.Notethatthesetablescanbeusedinthedataextractionprocessthroughouttheentirereview.
COSMINmanualforsystematicreviewsofPROMs
47
5.COSMINRiskofBiaschecklistInthischapterweelaborateonallstandardsincludedinboxes3to10oftheCOSMINRiskofBiasChecklist.Anelaborationofallstandardsincludedinbox1PROMDevelopmentandbox2Contentvalidityisprovidedintheusers’manualforguidanceonhowtoevaluatethecontentvalidityofPROMs.ThismanualcanbefoundontheCOSMINwebsite(33).Eachboxcontainsanitemaskingiftherewereanyotherimportantothermethodologicalflawsthatarenotcoveredbythechecklist,butthatmayleadtobiasedresultsorconclusions.Forsomeoftheboxeswewillprovideexamplesofsuchflawsbelow.5.1Assessingriskofbiasinastudyonstructuralvalidity(box3) StructuralvalidityreferstothedegreetowhichthescoresofaPROMareanadequatereflection of the dimensionality of the construct to be measured(21) and is usuallyassessedbyfactoranalysis.Box3.StructuralvalidityDoesthescaleconsistofeffectindicators,i.e.isitbasedonareflectivemodel?yes/noDoesthestudyconcernunidimensionalityorstructuralvalidity?unidimensionality/structuralvalidityTheboxonstructuralvaliditystartswithtwoquestionsthatcanhelpthereviewertodecidewhetherthismeasurementpropertyisrelevantfortheinstrumentunderstudy(i.e.questiononreflectivemodel),ortobecomeawareofthespecificpurposeofthestudy(i.e.questiononunidimensionalityorstructuralvalidity).Bothquestionsarenotstandardsandtheyarenotusedintheworstscorecountsmethod.Structuralvalidityisonlyrelevantforinstrumentsthatarebasedonareflectivemodel.Areflectivemodelisamodelinwhichallitemsareamanifestationofthesameunderlyingconstruct.Thesekindofitemsarecalledeffectindicators.Theseitemsareexpectedtobehighlycorrelatedandinterchangeable.Itscounterpartisaformativemodel,inwhichtheitemstogetherformtheconstruct.Theseitemsdonotneedtobecorrelated.Therefore,structuralvalidityisnotrelevantforitemsthatarebasedonaformativemodel(24,47,48).Forexample,stresscouldbemeasuredbyaskingabouttheoccurrenceofdifferentsituationsandeventsthatmightleadtostress,suchasjobloss,deathinafamily,divorcesetc.(49).Theseeventsdonotneedtobecorrelated,thusstructuralvalidityisnotrelevantforsuchaninstrument.Often,authorsdonotexplicitlydescribewhethertheirinstrumentisbasedonareflectiveorformativemodel.Todecideafterwardswhichmodelisused,onecandoasimple“thoughttest”.Withthistestoneshouldconsiderwhetherallitemscoresareexpectedtochangewhentheconstructchanges.Ifyes,theconstructcanbeconsideredareflectivemodel.Ifnot,thePROMisprobablybasedonaformativemodel(24,47).
COSMINmanualforsystematicreviewsofPROMs
48
Itisnotalwayspossibletodecideafterwardsiftheinstrumentisbasedonareflectiveorformativemodelandthuswhetherstructuralvalidityisrelevant.Inthiscase,werecommendtocompletetheboxtoassessthequalityoftheanalysesthatwereperformedintheincludedstudies.ThemeasurementpropertyStructuralvalidityreferstothemodelfitofafactoranalysisonallitemsinanoutcomemeasurementinstrument,e.g.toconfirma3‐factormodelforaninstrumentwiththreesubscales.Unidimensionalityreferstowhethertheitemsinascaleorsubscalemeasureasingleconstruct.ItisanassumptionforinternalconsistencyorIRT/Raschanalyses.ForthepurposeofcheckingunidimensionalityeachsubscalefromamultidimensionalPROMcanseparatelybeevaluatedwithafactoranalysisorIRT/Raschanalyses.AnevaluationofunidimensionalityissufficientfortheinterpretationofaninternalconsistencyanalysisandIRT/Raschanalysis,butitdoesnotprovideevidenceforstructuralvalidityaspartofconstructvalidityofamultidimensionalPROM.Thatis,becauseitdoesnotprovideevidence,forexample,thataninstrumentwiththreesubscalesindeedmeasuresthreedifferentconstructs.AquestionwasaddedtotheboxonStructuralvalidityaboutwhethertheaimofthestudywastoassessunidimensionalityortoassessstructuralvalidity.Althoughthestandardsforassessingunidimensionalityandstructuralvalidityarethesame,thepurposeisdifferentandtheconclusionaboutthePROMcanbedifferentandreviewersshouldtakethisintoaccountwhenreportingtheresultsofsuchstudiesinasystematicreview.Thisquestionisnotastandardanditisnotusedintheworstscorecountsmethod.Itisonlyahelpforthereviewerstobeawareofwhichpurposewasservedinthestudy.Statisticalmethods verygood adequate doubtful inadequate NA 1 ForCTT:Wasexploratoryor
confirmatoryfactoranalysis
performed?
Confirmatoryfactoranalysisperformed
Exploratoryfactoranalysisperformed
Noexploratoryorconfirmatoryfactoranalysisperformed
Notapplicable
Standard1.Todeterminethestructureoftheinstrument,afactoranalysisisthepreferredstatisticwhenusingCTT.Althoughconfirmatoryfactoranalysisispreferredoverexplorativefactoranalysis,bothcouldbeusefulfortheevaluationofstructuralvalidity.Confirmativefactoranalysistestswhetherthedatafitapremeditatedfactorstructure(50).Basedontheoryorpreviousanalysesapriorihypothesesareformulatedandtested.Explorativefactoranalysiscanbeusedwhennoclearhypothesesexistabouttheunderlyingdimensions(50).Statisticalmethods 2 ForIRT/Rasch:doesthechosenmodel
fittotheresearchquestion?
Chosenmodelfitswelltotheresearchquestion
Assumablethatthechosenmodelfitswelltotheresearchquestion
Doubtfulifthechosenmodelfitswelltotheresearchquestion
Chosenmodeldoesnotfittotheresearchquestion
NA
COSMINmanualforsystematicreviewsofPROMs
49
Standard2:Anexampleofamodelthatdoesnotfittheresearchquestioniswhenfollow‐updataarecombinedbutnotanalysedusingamulti‐levelIRTmodel.Statisticalmethods verygood adequate doubtful inadequate NA 3 Wasthesamplesizeincluded
intheanalysisadequate?
FA:7timesthenumberofitems,and≥100Rasch/1PLmodels:≥200subjects2PLparametricIRTmodelsORMokkenscaleanalysis:≥1000subjects
FA:atleast5thetimesnumberofitems,and≥100;ORatleast6timesthenumberofitems,but<100Rasch/1PLmodels:100‐199subjects2PLparametricIRTmodelsORMokkenscaleanalysis:500‐999subjects
FA:5timesthenumberofitems,but<100Rasch/1PLmodels:50‐99subjects2PLparametricIRTmodelsORMokkenscaleanalysis:250‐499subjects
FA:<5timesthenumberofitemsRasch/1PLmodels:<50subjects2PLparametricIRTmodelsORMokkenscaleanalysis:<250subjects
Standard3.FactoranalysesandIRT/Raschanalysesrequirelargesamplesizes.TheproposedrequirementswerebasedonComrey(51),Brown(chapter10)(52),EmbretsonandReise(53),EdelenandReeve(54),ReiseandYu(55),Reiseetal.(56),andLinacre(57).Thesesamplesizerequirementsarerulesofthumb,andarecontextrelated.Samplesizerequirementsincreaseasaresultofthecomplexityofthemodel(Edelenetal.2007).Moreover,dependingontheresearchquestiondifferentlevelsofprecisionmaybeacceptable,whichrelateagaintothesamplesizeneeded(Edelenetal.2007).Anotherrelatedconsiderationisthesamplingdistributionoftherespondents.Thesampleshouldreflectthepopulationofinterestandcontainenoughrespondentswithextremescoressothatitemsevenatextremeendsoftheconstructcontinuumwillhavereasonablestandarderrorsassociatedwiththeirestimatedparameters.(Edelenetal.2007).Other 4 Werethereanyotherimportantflaws
inthedesignorstatisticalmethodsof
thestudy?
Nootherimportantmethodologicalflaws
Otherminormethodologicalflaws(e.g.rotationmethodnotdescribed)
Otherimportantmethodologicalflaws(e.g.inappropriaterotationmethod)
Wehavenotdefinedspecificrequirementsforfactoranalyses,suchasthechoiceoftheexplorativefactoranalysis(principalcomponentanalysisorcommonfactoranalysis),
COSMINmanualforsystematicreviewsofPROMs
50
thechoiceandjustificationoftherotationmethod(e.g.orthogonalorobliquerotation),orthedecisionaboutthenumberofrelevantfactors.Suchspecificrequirementsaredescribedbye.g.Floyd&Widaman(50)anddeVetetal.(58).Whenthereareseriousflawsinthequalityofthefactoranalysis,werecommendtorateitem4with“doubtful”or“inadequate”.5.2Assessingriskofbiasinastudyoninternalconsistency(box4) Internal consistency refers to the degree of interrelatedness among the items and isoften assessed by Cronbach’s alpha. Cronbach’s alpha’s can be pooled across studieswhentheresultsaresufficientlysimilar.Foranexample,werefertoastudyperformedbyCollinsandcolleagues(8).Foranappropriateinterpretationoftheinternalconsistencyparameter,theitemstogethershouldformaunidimensionalscaleorsubscale.Internalconsistencyandunidimensionalityarenotthesame.Internalconsistencyistherelatednessamongtheitems(59).Unidimensionalityreferstowhethertheitemsinascaleorsubscalemeasureasingleconstruct.Itisaprerequisiteforaclearinterpretationoftheinternalconsistencystatistics(59,60),andcanbeinvestigatedforexamplebyfactoranalysis(61)orIRTmethods(seestructuralvalidity,box3).Box4.InternalconsistencyDoesthescaleconsistofeffectindicators,i.e.isitbasedonareflectivemodel?1yes/noThefirstquestionintheBoxInternalconsistencyconcernstherelevanceoftheassessmentofthemeasurementpropertyinternalconsistencyforthePROMunderstudy.Theinternalconsistencystatisticonlygetsaninterpretablemeaning,whentheinterrelatednessamongtheitemsisdeterminedofasetofitemsthattogetherformareflectivemodel(59,60).Seealsobox3foranexplanationaboutreflectivemodels.Designrequirements verygood adequate Doubtful inadequate NA 1 Wasaninternal
consistencystatisticcalculatedforeachunidimensionalscaleorsubscaleseparately?
Internalconsistencystatisticcalculatedforeachunidimensionalscaleorsubscale
Unclearwhetherscaleorsubscaleisunidimensional
InternalconsistencystatisticNOTcalculatedonunidimensionalscale
Standard1.Theinternalconsistencycoefficientshouldbecalculatedforeachunidimensionalsubscaleseparately.Informationonunidimensionality(orstructuralvalidity)canbeeitherfoundinthesamestudyorinotherstudies.Basedontheatleastlowqualityevidenceforstructuralvaliditywedeterminewhetherornottheitemsfromthe(sub)scalesformanunidimensionalscale.Whenaninternalconsistencyparameteriscalculatedpersubscaleaswellasforamultidimensionaltotalscale(forexamplethetotalscoreoffoursubscales)thislatterresultcanbeignored,asitcannotbeinterpreted.Ifaninternalconsistencyparameterisonlyreportedforamultidimensionaltotalscale,thisstandardshouldberated
COSMINmanualforsystematicreviewsofPROMs
51
‘inadequate’.IfnoinformationisfoundintheliteratureonthestructuralvalidityorunidimensionalityofaPROM,thisstandardcanberatedwith‘doubtful’.Inthiscase,werecommendtoreporttheresultsfoundoninternalconsistency,withoutdrawingaconclusion.Statisticalmethods 2 Forcontinuousscores:Was
Cronbach’salphaoromegacalculated?
Cronbach’salpha,orOmegacalculated
Onlyitem‐totalcorrelationscalculated
NoCronbach’salphaandnoitem‐totalcorrelationscalculated
Na
3 Fordichotomousscores:WasCronbach’salphaorKR‐20calculated?
Cronbach’salphaorKR‐20calculated
Onlyitem‐totalcorrelationscalculated
NoCronbach’salphaorKR‐20andnoitem‐totalcorrelationscalculated
Na
4 ForIRT‐basedscores:Wasstandarderrorofthetheta(SE(θ))orreliabilitycoefficientofestimatedlatenttraitvalue(indexof(subjectoritem)separation)calculated?
SE(θ)orreliabilitycoefficientcalculated
SE(θ)orreliabilitycoefficientNOTcalculated
Na
Standards2and3.StudiesbasedonCTTshouldreportCronbach’salphaorOmegavalues(62).Standard4.SeveralreliabilityindicesareavailablethatarebasedonIRT/Raschanalyses.Theseindicesarebasedononescore.Examplesarethestandarderroroftheta,andindexofpersonorsubjectseparation.5.3Assessingriskofbiasinastudyoncross‐culturalvalidity\measurementinvariance(box5)Cross‐culturalvalidityreferstothedegreetowhichtheperformanceoftheitemsonatranslated or culturally adapted instrument are an adequate reflection of theperformance of the items of the original version of the instrument. To assess thismeasurementpropertydatafromatleasttwodifferentgroupsisrequired,e.g.aDutchandanEnglishsample,ormalesandfemales.Werecommendtocompletethisboxforstudiesthathaveevaluatedcross‐culturalvalidityforPROMsacrossculturallydifferentpopulations. We interpret ‘culturally different population’ broadly. We do not onlyconsider different ethnicity or language groups as different cultural populations, butalso other groups such as different gender or age groups, or different patientpopulations. Cross‐cultural validity is evaluated by assessing whether the scale ismeasurement invariant orwhether or not Differential Item Functioning (DIF) occurs.MeasurementInvariance(MI)andnon‐DIFrefertowhetherrespondentsfromdifferentgroups with the same latent trait level (allowing for group differences) respondsimilarlytoaparticularitem.
COSMINmanualforsystematicreviewsofPROMs
52
Box5.Cross‐culturalvalidity\Measurementinvariance
Designrequirements verygood adequate doubtful inadequate NA 1 Werethesamplessimilarfor
relevantcharacteristicsexceptforthegroupvariable?
Evidenceprovidedthatsamplesweresimilarforrelevantcharacteristicsexceptgroupvariable
Stated(butnoevidenceprovided)thatsamplesweresimilarforrelevantcharacteristicsexceptgroupvariable
Unclearwhethersamplesweresimilarforrelevantcharacteristicsexceptgroupvariable
SampleswereNOTsimilarforrelevantcharacteristicsexceptgroupvariable
Standard1.Whenevaluatingcross‐culturalvalidity,scoresoftwo(ormore)groupsaredirectlycomparedinonestatisticalmodel.Thiscouldbelanguagegroups(e.g.whentheDutchversionofaPROMisbeingcomparedtotheEnglishversionofthesamePROM),orgroupsbasedonanothervariable,suchasmalesversusfemales,orhealthyandchronicallyillpeople.Exceptfromthegroupvariable,thetwogroupsshouldbesimilarforrelevantcharacteristics,suchasdiseaseseverity,age,etc.Inonestudygendercouldbethegroupvariable,andinanotherstudy(e.g.comparingtwolanguagegroups)genderisconsideredarelevantcharacteristicofthegroupsandisexpectedtobesimilarlydistributedacrossthegroups.Itisuptothereviewteamtojudgewhetherallrelevantcharacteristicsaresimilarlydistributedacrossthegroups.Statisticalmethods 2 Wasanappropriateapproachused
toanalysethedata?Awidelyrecognizedorwelljustifiedapproachwasused
Assumablethattheapproachwasappropriate,butnotclearlydescribed
Notclearwhatapproachwasusedordoubtfulwhethertheapproachwasappropriate
approachNOTappropriate
Notapplicable
Standard2.Adequateapproachedforassessingcross‐culturalvalidityusingCTTareregressionanalysesorconfirmatoryfactoranalysis(CFA).Foranexplanationoftheuseofordinalregressionmodels,werefertoCraneetal.(63)orPetersenetal.(64).ForanexplanationofCFAtoinvestigatemeasurementinvariancewerefertoGregorich(65).Anadequateapproachforassessingcross‐culturalvalidityusingIRTmethodsisDifferentialItemFunctioning(DIF)analyses(66).
COSMINmanualforsystematicreviewsofPROMs
53
Statisticalmethods 3 Wasthesamplesize
includedintheanalysisadequate?
RegressionanalysesorIRT/Raschbasedanalyses:200subjectspergroupMGCFA*:7timesthenumberofitems,and≥100
150 subjectspergroup5timesthenumberofitems,and≥100;OR5‐7timesthenumberofitems,but<100
100subjectspergroup5timesthenumberofitems,but<100
<100subjectspergroup<5timesthenumberofitems
Standard3.CFA,IRTanalysisorregressionanalysisrequirehighsamplesizestoobtainreliableresults.Astheresultsofstudiescannotbepooled,asamplesizerequirementwasincludedasastandard.TheserecommendationsarebasedonapublicationbyScottetal.(67).5.4Assessingriskofbiasinastudyonreliability(box6)Reliabilityreferstotheproportionofthetotalvarianceinthemeasurementswhichisdueto‘true’differencesbetweenpatients.Theword‘true’mustbeseeninthecontextofCTT,whichstatesthatanyobservationiscomposedoftwocomponents–atruescoreanderrorassociatedwiththeobservation.‘True’istheaveragescorethatwouldbeobtainedifthescalewasadministeredaninfinitenumberoftimestothesameperson.Itrefersonlytotheconsistencyofthescore,andnottoitsaccuracy(22).ReliabilitycanalsobeexplainedastheabilityofaPROMtodistinguishbetweenpatients.Withinahomogeneousgroup,itishardtodistinguishbetweenpatients(23).Animportantassumptionmadeinareliabilitystudy(andinastudyonmeasurementerror)isthatpatientsarestableontheconstructtobemeasuredbetweentherepeatedmeasurements.Box6.Reliability
Designrequirements verygood adequate doubtful inadequate NA 1 Werepatientsstableintheinterim
periodontheconstructtobemeasured?Evidenceprovidedthatpatientswerestable
Assumablethatpatientswerestable
Unclearifpatientswerestable
PatientswereNOTstable
COSMINmanualforsystematicreviewsofPROMs
54
2 Wasthetimeintervalappropriate? Timeintervalappropriate
Doubtfulwhethertimeintervalwasappropriateortimeintervalwasnotstated
TimeintervalNOTappropriate
3 Werethetestconditionssimilarforthemeasurements?e.g.typeofadministration,environment,instructions
Testconditionsweresimilar(evidenceprovided)
Assumablethattestconditionsweresimilar
Uncleariftestconditionsweresimilar
TestconditionswereNOTsimilar
Standard1.Patientsshouldbestablewithregardtotheconstructtobemeasuredbetweentheadministrations.What“stablepatients”aredependsontheconstructtobemeasuredandthetargetpopulation.Evidencethatpatientswerestablecouldbe,forexample,anassessmentofaglobalratingofchange,completedbypatientsorphysicians.Whenaninterventionisgivenintheinterimperiod,onecanassumethat(manyof)thepatientshavechangedontheconstructtobemeasured.Inthatcase,werecommendtorateStandard2as“inadequate”.Standard2.Thetimeintervalbetweentheadministrationsmustbeappropriate.Thetimeintervalshouldbelongenoughtopreventrecallbias,andshortenoughtoensurethatpatientshavenotbeenchangedontheconstructtobemeasured.Whatanappropriatetimeintervalis,dependsontheconstructtobemeasuredandthetargetpopulation.Atimeintervalofabout2weeksisoftenconsideredappropriatefortheevaluationofPROMs(22).Standard3.Alastrequirementisthatthetestconditionsshouldbesimilar.Testconditionsrefertothetypeofadministration(e.g.aself‐administeredquestionnaire,interview,performance‐test),thesettinginwhichtheinstrumentwasadministered(e.g.atthehospital,orathome),andtheinstructionsgiven.Thesetestconditionsmayinfluencetheresponsesofapatient.Thereliabilitymaybeunderestimatedifthetestconditionsarenotsimilar.However,inclinicalpractice,differenttestconditionmightbeuseddisorderly,andaspecificresearchquestioncouldbewhetherthereliabilityofaPROMisstillappropriatewhenusingthePROMunderdifferenttestconditions.Inthiscase,thetestconditionsdonotneedtobesimilarastheyaresupposedtovaryaspartoftheresearchaimofthestudy.Inthatcase,thisitemcanberatedwith‘verygood’.Seeforexample,VanLeeuwen(68).
COSMINmanualforsystematicreviewsofPROMs
55
Statisticalmethods
4 Forcontinuousscores:Wasanintraclasscorrelationcoefficient(ICC)calculated?
ICCcalculatedandmodelorformulaoftheICCisdescribed
ICCcalculatedbutmodelorformulaoftheICCnotdescribedornotoptimal.PearsonorSpearmancorrelationcoefficientcalculatedwithevidenceprovidedthatnosystematicchangehasoccurred
PearsonorSpearmancorrelationcoefficientcalculatedWITHOUTevidenceprovidedthatnosystematicchangehasoccurredorWITHevidencethatsystematicchangehasoccurred
NoICCorPearsonorSpearmancorrelationscalculated
Na
5 Fordichotomous/nominal/ordinalscores:Waskappacalculated?
Kappacalculated
Nokappacalculated
Na
6 Forordinalscores:Wasaweightedkappacalculated?
WeightedKappacalculated
UnweightedKappacalculatedornotdescribed
Na
7 Forordinalscores:Wastheweightingschemedescribed?e.g.linear,quadratic
Weightingschemedescribed
WeightingschemeNOTdescribed
Na
Thepreferredreliabilitystatisticdependsonthetypeofresponseoptions.Standard4.Forcontinuousscorestheintraclasscorrelationcoefficient(ICC)ispreferred(22,69).ToassessthereliabilityofaPROM,oftenastraightforwardtest‐retestreliabilitystudydesignischosen.ThepreferredICCmodelinthatcaseisthetwo‐wayrandomeffectsmodel(70),inwhichinadditiontothevariancewithinpatientsthevariance(i.e.systematicdifferences)betweentimepointistakenintoaccount(23).TheuseofthePearson’sandSpearman’scorrelationcoefficientisconsidereddoubtfulwhenitisnotclearwhethertherearesystematicdifferencesoccurred,becausethesecorrelationsdonottakesystematicerrorintoaccount.Standards5,6,and7.FordichotomousscoresornominalscorestheCohen’skappaisthepreferredstatisticalmethod(22).Forordinalscalespartialchanceagreementshouldbeconsidered,andthereforeaweightedkappa(22,71,72)ispreferred.Adescriptionoftheweights(e.g.,linearorquadraticweights(73))shouldbegiven.Proportionagreementisconsiderednotadequateasitisaparameterformeasurementerror(seebox7).Unweightedkappaconsidersanymisclassificationequallyinappropriate.However,amisclassificationoftwoadjacentcategoriesmaybelesserroneousasamisclassificationofcategoriesthataremoreapartfromeachother.Aweightedkappatakesthisintoaccount.However,theaimofastudycouldbetoassessthereliabilityofanymisclassification,makingaunweightedkappaasanappropriateparameter.Insuchastudy,Standard7canberatedas‘verygood’.
COSMINmanualforsystematicreviewsofPROMs
56
Other
8 Werethereanyotherimportantflawsinthedesignorstatisticalmethodsofthestudy?
Nootherimportantmethodologicalflaws
Otherminormethodologicalflaws
Otherimportantmethodologicalflaws
Standard8:Anexampleofanotherimportantflawiswhentheadministrationswerenotindependent.Independentadministrationsimplythatthefirstadministrationhasnotinfluencedthesecondadministration.Atthesecondadministrationthepatientshouldnothavebeenawareofthescoresonthefirstadministration.Anotherexamplecouldbe,whenthePROMwasadministeredinaninterview.Supposeoneexperiencedintervieweradministersallfirstadministrationsoftheinterviewsofallpatients,andthesecondadministrationwaseitherdonebyanexperiencedinterviewerorbyanunexperiencedinterviewer,anditwasnotclearwhichpatientwasinterviewedbywhichinterviewer.Subsequently,thiswasnottakenintoaccountintheanalysis.SupposealowICCwasfound,itisunknownwhetherthisisduetothedifferentcharacteristicsoftheinterviewers(thatcouldbeimprovedbystandardizingrequirementsforinterviewers),orduetoaninsufficientqualityofthePROM.5.5Assessingriskofbiasinastudyonmeasurementerror(box7)Measurementerrorreferstothesystematicandrandomerrorofanindividualpatient’sscorethatisnotattributedtotruechangesintheconstructtobemeasured.Box7.Measurementerror
Designrequirements verygood adequate doubtful Inadequate NA 1 Werepatientsstableinthe
interimperiodontheconstructtobemeasured?
Patientswerestable(evidenceprovided)
Assumablethatpatientswerestable
Unclearifpatientswerestable
PatientswereNOTstable
2 Wasthetimeintervalappropriate?
Timeintervalappropriate
Doubtfulwhethertimeintervalwasappropriateortimeintervalwasnotstated
TimeintervalNOTappropriate
3 Werethetestconditionssimilarforthemeasurements?(e.g.typeofadministration,environment,instructions)
Testconditionsweresimilar(evidenceprovided)
Assumablethattestconditionsweresimilar
Uncleariftestconditionsweresimilar
TestconditionswereNOTsimilar
Standards1‐3:seereliability
COSMINmanualforsystematicreviewsofPROMs
57
Statisticalmethods 4 Forcontinuousscores:Was
theStandardErrorofMeasurement(SEM),SmallestDetectableChange(SDC)orLimitsofAgreement(LoA)calculated?
SEM,SDC,orLoAcalculated
PossibletocalculateLoAfromthedatapresented
SEMcalculatedbasedonCronbach’salpha,oronSDfromanotherpopulation
Notapplicable
5 Fordichotomous/nominal/ordinalscores:Wasthepercentage(positiveandnegative)agreementcalculated?
%positiveandnegativeagreementcalculated
%agreementcalculated
%agreementnotcalculated
Notapplicable
Standard4.ThepreferredstatisticformeasurementerrorinstudiesbasedonCTTistheStandardErrorofMeasurement(SEM)basedonatest‐retestdesign.NotethatthecalculationoftheSEMbasedonCronbach’salphaisconsiderednotappropriate,becauseitdoesnottakethevariancebetweentimepointsintoaccount(74).OtherappropriatestatisticsforassessingmeasurementerroraretheLimitsofAgreement(LoA)andtheSmallestDetectableChange(SDC)(23).BothparametersaredirectlyrelatedtotheSEM.ChangeswithintheLoAorsmallerthantheSDCarelikelytobeduetomeasurementerrorandchangesoutsidetheLoAorlargerthantheSDCshouldbeconsideredasrealchangeinindividualpatients.Notethatthisdoesnotindicatethatthesechangesarealsomeaningfultopatients.Thisdependsonwhatchangeisconsideredimportant,whichisanissueofinterpretability.Standard5.Oftenkappaisconsideredasameasureofagreement,however,kappaisameasureofreliability(72).Anappropriateparameterofmeasurementerror(alsocalledagreement)ofdichotomous/nominal/ordinalPROMscoresispercentageagreement.5.6Assessingriskofbiasinastudyoncriterionvalidity(box8)CriterionvalidityreferstothedegreetowhichthescoresofaPROMareanadequatereflectionofa‘goldstandard’.Inasystematicreview,thereviewteamshoulddeterminewhatreasonable‘goldstandards’arefortheconstructtobemeasured.AllstudiescomparingaPROMtothisacceptedgoldstandardcanbeconsideredastudyoncriterionvalidity.Box8.Criterionvalidity
Statisticalmethods verygood adequate doubtful inadequate NA 1 Forcontinuousscores:Were
correlations,ortheareaunderthereceiveroperatingcurvecalculated?
CorrelationsorAUCcalculated
CorrelationsorAUCNOTcalculated
Na
2 Fordichotomousscores:Weresensitivityandspecificitydetermined?
Sensitivityandspecificitycalculated
SensitivityandspecificityNOTcalculated
Na
COSMINmanualforsystematicreviewsofPROMs
58
Standards2and3.WhenboththePROMandthegoldstandardhavecontinuousscores,correlationisthepreferredstatisticalmethod.Whentheinstrumentscoresarecontinuousandscoresonthegoldstandardaredichotomoustheareaunderthereceiveroperatingcharacteristic(ROC)isthepreferredmethod,andwhenbothscoresaredichotomoussensitivityandspecificityarethepreferredmethodstouse.Other 3 Were there any other important flaws
in the design or statistical methods of the study?
No other important methodological flaws
Other minor methodological flaws
Other important methodological flaws
Standard3:Biasmayoccur,forexample,whenalongversionofaquestionnaireiscomparedtoashortversion,whilethescoresoftheshortversionwerecomputedusingtheresponsesobtainedwiththelongerversion.Inthatcase,thisstandardcouldberatedasinadequate.5.7Assessingriskofbiasinastudyonhypothesestestingforconstructvalidity(box9)HypothesestestingforconstructvalidityreferstothedegreetowhichthescoresofaPROMareconsistentwithhypotheses(forinstancewithregardtointernalrelationships,relationshipstoscoresofotherinstruments,ordifferencesbetweenrelevantgroups)basedontheassumptionthatthePROMvalidlymeasurestheconstructtobemeasured.Hypothesestestingisanongoing,iterativeprocess(34).Themorespecificthehypothesesareandthemorehypothesesarebeingtested,themoreevidenceisgatheredforconstructvalidity.ManytypesofhypothesescanbetestedtoevaluateconstructvalidityofaPROM.Ingeneral,thesehypothesesconcerncomparisonswithotheroutcomemeasurementinstruments(thatarenotbeingconsideredasagoldstandard),ordifferencesinscoresbetween‘known’groups(e.g.patientswithmultiplesclerosis(MS)andspasticitywereexpectedtoscorestatisticallysignificantlyhigheronanarmandhandfunctioningscale,ascomparedwithpatientswithMSwithoutspasticity,becausespasticityisacausativeforactivitylimitationsduetoimpairmentsofthearmandhand(68)).Theboxonhypothesestestingisthereforestructuredintwosections.ToassesstheriskofbiasofstudiescomparingthePROMtocomparisoninstruments,partaofthebox(items1‐4)shouldbecompleted.Forassessingtheriskofbiasofknowngroupvaliditystudies,partbofthebox(items5‐7)shouldbecompleted.TheriskofbiasforstudiesinwhichMIorDIFwasinvestigatedcanbeassessedusingBox5Cross‐culturalvalidity.Theoverallratingisthelowestratinggivenonanyapplicableitemperpartofthebox.IfinanarticlethePROMunderstudyisbeingcomparedtoboth(a)comparisoninstrument(s),aswellasdifferentsubgroupsarecompared,theboxshouldbecompletedtwice,andresultsarehandledastwosubstudies.However,whentheratingsarethesame,theycanbecombined.Forexample,inAppendix7(overviewTable)reference5tested5hypotheses,andwasratedas‘adequate’.These5hypothesesconcernedbothcomparisonswithotherinstrumentsaswellasexpecteddifferencesbetweengroups.
COSMINmanualforsystematicreviewsofPROMs
59
Box9.Hypothesestestingforconstructvalidity
9a.Comparisonwithotheroutcomemeasurementinstruments(convergentvalidity) Designrequirements
verygood adequate doubtful inadequate NA
1 Isitclearwhat
thecomparatorinstrument(s)measure(s)?
Constructsmeasuredbythecomparatorinstrument(s)isclear
Constructsmeasuredbythecomparatorinstrument(s)isnotclear
2 Werethemeasurementpropertiesofthecomparatorinstrument(s)sufficient?
Sufficientmeasurementpropertiesofthecomparatorinstrument(s)inapopulationsimilartothestudypopulation
Sufficientmeasurementpropertiesofthecomparatorinstrument(s)butnotsureiftheseapplytothestudypopulation
Someinformationonmeasurementpropertiesofthecomparatorinstrument(s)inanystudypopulation
Noinformationonthemeasurementpropertiesofthecomparatorinstrument(s),ORevidenceofinsufficientmeasurementpropertiesofthecomparatorinstrument(s)
Standards1and2:WhenthePROMunderreviewiscomparedtoanotherinstrument,itshouldbeknownwhattheconstructisthatthecomparatorinstrumentsmeasures,andtheinstrumentitselfshouldbeofsufficientquality.Ifnot,itisdifficulttointerprettheresultsof(e.g.)correlations.Thisinformationcanbereportedinthearticleincludedinthereview.However,whenthisisnotdescribed,thereviewteamcouldtrytofindadditionalinformationtodecideonratingthesestandards.Whenmultiplecomparatorinstrumentsarebeingusedinastudy,andforoneinstrumenttheconstructisclearlydescribed,andfortheotherinstrumentitisnotclearlydescribed,werecommendtoconsidereachanalysesasaseparatestudy,andcompletetheboxmultipletimes.Werecommendtodothesameincasesomeofthecomparatorinstrumentsareofadequatequality,whileotherareofdoubtfulorinadequatequality.Statisticalmethods
3 Weredesignandstatisticalmethodsadequateforthehypothesestobetested?
Statisticalmethodsappliedappropriate
Assumablethatstatisticalmethodswereappropriate
StatisticalmethodsappliedNOToptimal
StatisticalmethodsappliedNOTappropriate
Standard3.AppropriatestatisticalmethodscouldbecorrelationsbetweenthePROMandthecomparatorinstrument.When,forexample,Pearsoncorrelationswerecalculated,butthedistributionofscoresormeanscores(SD)werenotpresented,itcouldbeconsidered‘assumable’thatthestatisticalmethodswereappropriate.P‐valuesshouldnotbeusedintestinghypotheses,becauseitisnotrelevanttoexaminewhether
COSMINmanualforsystematicreviewsofPROMs
60
correlationsstatisticallydifferfromzero(75).Thevalidityissueisaboutwhetherthedirectionandmagnitudeofacorrelationissimilartowhatcouldbeexpectedbasedontheconstruct(s)thatarebeingmeasured.AnotherexampleofanappropriatemethodiswhenCFAorStructuralEquationModeling(SEM)isperformedovermultiplescalesorsubscaleswhichareconsideredtomeasuresimilarordifferentconstructstoexaminewhethersubscalesmeasuringsimilarconstructsaremorerelatedtoeachotherthansubscalesmeasuringdifferentconstructs. 9b.Comparisonbetweensubgroups(discriminativeorknown‐groupsvalidity) Designrequirements verygood adequate doubtful inadequate NA 5 Wasanadequate
descriptionprovidedofimportantcharacteristicsofthesubgroups?
Adequatedescriptionoftheimportantcharacteristicsofthesubgroups
Adequatedescriptionofmostoftheimportantcharacteristicsofthesubgroups
Poorofnodescriptionoftheimportantcharacteristicsofthesubgroups
Standard5.WhenPROMscoresbetweentwosubgroupsarecompared,importantcharacteristicsofthegroupsshouldbereported,anditshouldbeclearonwhichcharacteristicsthegroupsdiffer,forexampleage,gender,diseasecharacteristics,languageetc..Statisticalmethods 6 Weredesignandstatistical
methodsadequateforthehypothesestobetested?
Statisticalmethodsappliedappropriate
Assumablethatstatisticalmethodswereappropriate
StatisticalmethodsappliedNOToptimal
StatisticalmethodsappliedNOTappropriate
Standard6.Manydifferenthypothesescanbeformulatedandtested.TheusersoftheCOSMINRiskofBiaschecklisthavetodecidewhetherornotthestatisticalmethodsusedinthearticleareadequatefortestingthestatedhypotheses.P‐valuesshouldbeavoidedintestinghypotheses,becauseitisnotrelevanttoexaminewhethercorrelationsstatisticallydifferfromzero(75).Thevalidityissueisaboutwhetherthedirectionandmagnitudeofacorrelationissimilartowhatcouldbeexpectedbasedontheconstruct(s)thatarebeingmeasured.Whenassessingdifferencesbetweengroups,itisalsolessrelevantwhetherthesedifferencesarestatisticallysignificant(whichdependsonthesamplesize)thanwhetherthesedifferencesareaslargeasexpected.5.8Assessingriskofbiasinastudyonresponsiveness(box10) ResponsivenessreferstotheabilityofaPROMtodetectchangeovertimeintheconstructtobemeasured.Althoughresponsivenessisconsideredtobeaseparatemeasurementpropertyfromvalidity,theonlydifferencebetweencross‐sectional(constructandcriterion)validityandresponsivenessisthatvalidityreferstothevalidityofasinglescore,andresponsivenessreferstothevalidityofachangescore(21).Therefore,thestandardsforresponsivenessaresimilartothestandardsofconstructandcriterionvalidity.Althoughtheapproachissimilartoconstructandcriterion
COSMINmanualforsystematicreviewsofPROMs
61
validity,wedonottousethetermsconstructandcriterionresponsiveness,becausethesetermsareunfamiliarintheliterature.Similarlyasforconstructandcriterionvalidity,thedesignrequirementsforassessingresponsivenessaredividedinsituationsinwhichagoldstandardisavailable(partastandards1‐5),situationsinwhichhypothesesaretestedbetweenchangescoresonthePROMunderreviewandchangescoresoncomparatorinstrumentswhicharenotgoldstandards(partbstandards6‐9),andsituationsinwhichhypothesesaretestedaboutexpecteddifferencesinchangesbetweensubgroups(partcstandards10‐12).Inaddition,standardsareincludedforsituationsinwhichhypothesesaretestedabouttheexpectedmagnitudeofanintervention(partdstandards13‐15). Box10.Responsiveness 10a.Criterionapproach(i.e.comparisontoagoldstandard) Statisticalmethods verygood adequate doubtful inadequate NA 1 Forcontinuousscores:Were
correlationsbetweenchangescores,ortheareaundertheReceiverOperatorCurve(ROC)curvecalculated?
CorrelationsorAreaundertheROCCurve(AUC)calculated
CorrelationsorAUCNOTcalculated
na
2 Fordichotomousscales:Weresensitivityandspecificity(changedversusnotchanged)determined?
Sensitivityandspecificitycalculated
SensitivityandspecificityNOTcalculated
na
Standards1‐2aresimilarasthestandardsforcriterionvalidity.10b.Constructapproach(i.e.hypothesestesting;comparisonwithotheroutcomemeasurementinstruments) Designrequirements verygood adequate doubtful inadequate NA 4 Isitclearwhatthe
comparatorinstrument(s)measure(s)?
Constructsmeasuredbythecomparatorinstrument(s)isclear
Constructsmeasuredbythecomparatorinstrument(s)isnotclear
5 Werethemeasurementpropertiesofthecomparatorinstrument(s)sufficient?
Sufficientmeasurementpropertiesofthecomparatorinstrument(s)inapopulationsimilartothestudypopulation
Sufficientmeasurementpropertiesofthecomparatorinstrument(s)butnotsureiftheseapplytothestudypopulation
Someinformationonmeasurementpropertiesofthecomparatorinstrument(s)inanystudypopulation
NOinformationonthemeasurementpropertiesofthecomparatorinstrument(s)ORevidenceofinsufficientqualityofcomparatorinstrument(s)
COSMINmanualforsystematicreviewsofPROMs
62
Statisticalmethods 6 Weredesignandstatistical
methodsadequateforthehypothesestobetested?
Statisticalmethodsappliedappropriate
Assumablethatstatisticalmethodswereappropriate
StatisticalmethodsappliedNOToptimal
StatisticalmethodsappliedNOTappropriate
Other
7 Werethereanyother
importantflawsinthedesignorstatisticalmethodsofthestudy?
Nootherimportantmethodologicalflaws
Otherminormethodologicalflaws
Otherimportantmethodologicalflaws
Standards4‐7aresimilarasthestandardsforhypothesestestingforconstructvalidityparta.10c.Constructapproach:(i.e.hypothesestesting:comparisonbetweensubgroups) Designrequirements verygood adequate doubtful inadequate NA 8 Wasanadequate
descriptionprovidedofimportantcharacteristicsofthesubgroups?
Adequatedescriptionoftheimportantcharacteristicsofthesubgroups
Adequatedescriptionofmostoftheimportantcharacteristicsofthesubgroups
Poorornodescriptionoftheimportantcharacteristicsofthesubgroups
Statisticalmethods 9 Weredesignand
statisticalmethodsadequateforthehypothesestobetested?
Statisticalmethodsappliedappropriate
Assumablethatstatisticalmethodswereappropriate
StatisticalmethodsappliedNOToptimal
StatisticalmethodsappliedNOTappropriate
Standards8–9aresimilarasthestandardsforhypothesestestingforconstructvaliditypartb.10d.Constructapproach: (i.e.hypothesestesting:beforeandafterintervention) Designrequirements verygood adequate doubtful inadequate NA 11 Wasanadequate
descriptionprovidedoftheinterventiongiven?
Adequatedescriptionoftheintervention
Poordescriptionoftheintervention
NOdescriptionoftheintervention
Statisticalmethods 12 Weredesignand
statisticalmethodsadequateforthehypothesestobetested?
Statisticalmethodsappliedappropriate
Assumablethatstatisticalmethodswereappropriate
StatisticalmethodsappliedNOToptimal
StatisticalmethodsappliedNOTappropriate
COSMINmanualforsystematicreviewsofPROMs
63
Aneffectsizeonlyhasmeaningasameasureofresponsivenessifweknow(orassume)beforehandwhatthemagnitudeoftheeffectoftheinterventionis.If,forexample,weexpectalargeeffectoftheinterventionontheconstructmeasuredbythePROMwecantestthehypothesisthattheinterventionresultsinaneffectsizeof0.8orhigher.Butifweexpectasmalleffectoftheintervention,wewouldnotexpectsuchahigheffectsize.Thisexampleshowsthatahigheffectsizedoesnotnecessarilyindicatesagoodresponsiveness.So,iftheeffectivenessofaninterventionorthechangeinhealthstatusontheconstructofinterestisknown,onecoulduseeffectsizes(meanchangescore/SDbaseline)(76),andrelatedmeasures,suchasstandardisedresponsemean(meanchangescore/SDchangescore)(77),Norman’sresponsivenesscoefficient(σ2change/σ2change+σ2error)(78),andrelativeefficacystatistic((t‐statistic1/t‐statistic2)2)(79)toevaluateresponsiveness.Whenseveralinstrumentsarecomparedinthesamestudy,thiscouldgiveevidencefortherelativeresponsivenessoftheinstruments.Butagain,onlyifahypothesisisbeingtestedincludingtheexpectedmagnitudeofthetreatmenteffect.Letusproposethatwehavethreemeasurementinstruments(A,B,andC),allmeasuringthesameconstruct.Theinterventiongivenisexpectedtomoderatelyaffecttheconstructmeasuredbythethreeinstruments.ResultsshowthatinstrumentAhasaneffectsizeof0.8,instrumentBof0.40andinstrumentCof0.15.BasedonourhypothesisofamoderateeffectweshouldconcludethatinstrumentBappearstobestmeasuretheconstructofinterest.InstrumentAseemstoover‐estimatesthetreatmenteffect(e.g.becauseitshowschangeinpersonswhodonotreallychange),andinstrumentAseemstounder‐estimatesit.Thisexampleshowsthatitmaynotalwaysbeappropriatetoconcludethattheinstrumentwiththehighesteffectsizeisthemostresponsive.InappropriatemeasuresforresponsivenessGuyatt’sresponsivenessratio(MIC/SDchangescoreofstablepatients)(80)isconsideredtobeaninappropriatemeasureofresponsiveness,becauseittakestheminimalimportantchangeintoaccount.Minimalimportantchangeconcernstheinterpretationofthechangescore,notthevalidityofthechangescore.Thepairedt‐testisalsoconsideredtobeaninappropriatemeasureofresponsivenessbecauseitisameasureofsignificantchangeinsteadofvalidchange,anditisdependentonthesamplesizeofthestudy(75).
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
64
Appendix1.Exampleofasearchstrategy(81)
1) construct #1: (Depressive disorder[mh] OR depression[mh] OR (depress*[tiab] NOT medline[sb])) (2) population #2: (Diabet*[tiab])
(3) sensitive search filter developed by Terwee et al. (27)
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
65
Appendix2.Exampleofaflowchart
PubMedn = ...
EMBASEn =...
….n =...
Additional record identifiedn = ...
Records screened after removing duplicates
n =
Articles selected based on title and
abstractn =
Total included in the review: n articles describing x PROMs
Reasons for exclusion:Validation not aim of study (n = )Different constrct measures (n = )Different study population (n = )
Different type of outcome measure (n = )Other (n = )
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
66
Appendix3.ExampleofatableoncharacteristicsoftheincludedPROMsPROM*(referencetofirstarticle)
Construct(s) Targetpopulation
Modeofadministration(e.g.self‐report,interview‐based,parent/proxyreportetc)
Recallperiod
(Sub)scale(s)(numberofitems)
Responseoptions
Rangeofscores/scoring
Originallanguage
Availabletranslations
*EachversionofaPROMisconsideredaseparatePROM.
Othercharacteristicswhichmaybeextractedare:‘administrationtime’;‘yearofdevelopment’;‘conceptualmodelused’;‘recommendedbystandardizationinitiativesfor[aspecificpatientpopulationorfortheconstructtobemeasures]’,‘numberofstudiesevaluatingtheinstrument’;‘completiontime(minutes)’;‘Fullcopyavailable’;‘licensinginformationandcosts’.
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
67
Appendix4.Exampleofatableoncharacteristicsoftheincludedstudypopulations Population Diseasecharacteristics
Instrumentadministration
PROM Ref N AgeMean(SD,range)yr
Gender%female
Disease Diseasedurationmean(SD)yr
Diseaseseverity
Setting Country Language Responserate
A 1 2 3 B 1
Othercharacteristicswhichmaybeextractedare‘studydesign’,‘patientselection’.
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
68
Appendix5.InformationtoextractoninterpretabilityofPROMs
ThecontentofthistableisbasedontheBoxInterpretabilityfromtheoriginalCOSMINChecklist(1)PROM(ref) Distributionof
scoresinthestudypopulation
Percentageofmissingitemsandpercentageofmissingtotalscores
Floorandceilingeffects
Scoresandchangescoresavailableforrelevant(sub)groups
Minimalimportantchange(MIC)orminimalimportantdifference(MID)
Informationonresponseshift
PROMA(ref1) PROMA(ref2) PROMA(ref3) PROMB(ref1) …
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
69
Appendix6.InformationtoextractonfeasibilityofPROMsThecontentofthistableisbasedontheguidelineforselectingPROMsforCoreOutcomeSets(5)
Feasibilityaspects PROMA PROMB PROMC PROMDPatient’scomprehensibility Clinician’scomprehensibility
Typeandeaseof
administration
Lengthoftheinstrument
Completiontime
Patient’srequiredmental
andphysicalabilitylevel
Easeofstandardization
Easeofscorecalculation
Copyright
Costofaninstrument
Requiredequipment
Availabilityindifferent
settings
Regulatoryagency’srequirementforapproval
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
70
Appendix7.TableonresultsofstudiesonmeasurementpropertiesFictionalexampleofresultsofthemeasurementproperties
PROM(ref) Country(language)inwhichthequestionnairewasevaluated
Structuralvalidity Internalconsistency Cross‐culturalvalidity\measurementinvariance
Reliability
n Methqual
Result (rating) n Methqual
Result(rating)
n Methqual
Result(rating)
n Methqual
Result (rating)
PROMA(ref1) US 878 Verygood
Unidimensionalscale(CFI0,97,TLI0,97)(+)
878 Verygood
Cronb.Alpha=0.92(+)
PROMA(ref2) Dutch 573 Verygood
PSI=0.94(+)
PROMA(ref3) Spanish 43 Verygood
Cronbalpha=0.91(+)
84 Verygood
ICC(agreement)=0.85(+)
PROMA(ref4) Dutch 113 adequte
ICC=0.93(+)
PROMA(ref5) UK 78 inadequate
Spearmanrho=0.94(+)
Pooledorsummaryresult(overallrating)
878 1factor(+) 1494 0.91‐0.94(+)
275 0.85‐0.94(+)
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
71
PROM Country(language)inwhichthequestionnairewasevaluated
Measurementerror Criterionvalidity Hypothesestesting Responsivenessn Meth
qualResult(rating)
n Methqual
Result(rating)
n Methqual
Result (rating) n Methqual
Result(rating)
PROMA(ref2) Dutch 573 Verygood
Resultsinlinewith2hypo’s(2+)
PROMA(ref5) UK 78 adequate
Resultsinlinewith5hypo’s(5+)
PROMA(ref6) US 154
Adequate
Resultsinlinewith3hypo’s(3+)
inadequate
Resultinlinewith1hypo(1+)Resultsnotinlinewith2hypo’s(2‐)
Pooledorsummaryresult(overallrating)
805 11+ and2‐overall(+)
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
72
Appendix8.SummaryofFindingsTables Structural validity Summary or pooled result Overall rating Quality of evidence
PROM A Unidimensional score sufficient High (as there is one very good study
available)
PROM B
PROM C
Internal consistency Summary or pooled result Overall rating Quality of evidence
PROM A summarized Cronbach alpha / PSI =
0.91‐0.94; total sample size: 1494
sufficient High: multiple very good studies,
consistent results
PROM B
PROM C
Cross‐cultural
validity\measurement invariance
Summary or pooled result Overall rating Quality of evidence
PROM A No info available No info available
PROM B
PROM C
Reliability Summary or pooled result Overall rating Quality of evidence
PROM A ICC range 0.85 – 0.93; consistent sufficient High: one very good study, and
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
73
results; sample size: 275 consistent results
PROM B
PROM C
Measurement error Summary or pooled result Overall rating Quality of evidence
PROM A No info available No info available
PROM B
PROM C
Hypotheses testing Summary or pooled result Overall rating Quality of evidence
PROM A 11 out of 13 hypotheses confirmed sufficient High: as the unconfirmed
hypotheses come from inadequate
studies, we ignore these results.
PROM B
PROM C
Responsiveness Summary or pooled result Overall rating Quality of evidence
PROM A
PROM B
PROM C
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
74
6.References1. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual Life Res. 2010;19(4):539-49. 2. Prinsen CA, Mokkink LB, Bouter LM, Alonso J, Patrick DL, Vet HC, et al. COSMIN guideline for systematic reviews of outcome measurement instruments. Qual Life Res. 2018;[Epub ahead of print]. 3. Ioannidis JP, Greenland S, Hlatky MA, Khoury MJ, Macleod MR, Moher D, et al. Increasing value and reducing waste in research design, conduct, and analysis. Lancet. 2014;383(9912):166-75. 4. Walton MK, Powers JA, Hobart J, al. e. Clinical outcome assessments: A conceptual foundation – Report of the ISPOR Clinical Outcomes Assessment Emerging Good Practices Task Force. Value Health. 2015;18:741-52. 5. Prinsen CA, Vohra S, Rose MR, Boers M, Tugwell P, Clarke M, et al. How to select outcome measurement instruments for outcomes included in a "Core Outcome Set" - a practical guideline. Trials. 2016;17(1):449. 6. Mokkink LB, Vet HC, Prinsen CA, Patrick D, Alonso J, Bouter LM, et al. COSMIN Risk of Bias checklist for systematic reviews of Patient-Reported Outcome Measures. Qual Life Res. 2018;[Epub ahead of print]. 7. Terwee CB, Prinsen CA, Chiarotto A, Vet HC, Westerman MJ, Patrick DL, et al. COSMIN standards and criteria for evaluating the content validity of health-related Patient-Reported Outcome Measures: a Delphi study. 2018. 8. Collins NJ, Prinsen CA, Christensen R, Bartels EM, Terwee CB, Roos EM. Knee Injury and Osteoarthritis Outcome Score (KOOS): systematic review and meta-analysis of measurement properties. Osteoarthritis Cartilage. 2016;24(8):1317-29. 9. Gerbens LA, Chalmers JR, Rogers NK, Nankervis H, Spuls PI, Harmonising Outcome Measures for Eczema i. Reporting of symptoms in randomized controlled trials of atopic eczema treatments: a systematic review. Br J Dermatol. 2016;175(4):678-86. 10. Chinapaw MJ, Mokkink LB, van Poppel MN, van MW, Terwee CB. Physical activity questionnaires for youth: a systematic review of measurement properties. Sports Med. 2010;40(7):539-63. 11. Speksnijder CM, Koppenaal T, Knottnerus JA, Spigt M, Staal JB, Terwee CB. Measurement Properties of the Quebec Back Pain Disability Scale in Patients With Nonspecific Low Back Pain: Systematic Review. Phys Ther. 2016;96(11):1816-31. 12. Terwee CB, Mokkink LB, Knol DL, Ostelo RW, Bouter LM, de Vet HC. Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist. Qual Life Res. 2012;21(4):651-7. 13. Mokkink LB, Terwee CB, Knol DL, Stratford PW, Alonso J, Patrick DL, et al. The COSMIN checklist for evaluating the methodological quality of studies on measurement properties: a clarification of its content. BMC Med Res Methodol. 2010;10:22. 14. Mokkink LB, Terwee CB, Stratford PW, Alonso J, Patrick DL, Riphagen I, et al. Evaluation of the methodological quality of systematic reviews of health status measurement instruments. Qual Life Res. 2009;18(3):313-33. 15. Terwee CB, Prinsen CA, Ricci Garotti MG, Suman A, de Vet HC, Mokkink LB. The quality of systematic reviews of health-related outcome measurement instruments. Qual Life Res. 2016;25(4):767-79.
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
75
16. Higgins JP, Green S. Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated March 2011]. The Cochrane Collaboration, 2011. Available from: www.handbook.cochrane.org. 17. Cochrane Hanbook for Systematic reviews of Diagnostic Test Accuracy Reviews 2013 [Available from: http://methods.cochrane.org/sdt/handbook-dta-reviews. 18. Peterson DAB, P.; Jabusch, H. C.; Altenmuller, E.; Frucht, S. J. Rating scales for musician's dystonia: the state of the art. Neurology. 2013;81(6):589-98. 19. Herr KB, K.; Decker, S. Tools for assessment of pain in nonverbal older adults with dementia: A state-of-the-science review. Journal of Pain and Symptom Management. 2006;31(2):170-92. 20. GRADE. GRADE Handbook - Handbook for grading the quality of evidence and the strength of recommendations using the GRADE approach. 2013. 21. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 2010;63(7):737-45. 22. Streiner DL, Norman G. Health Measurement Scales. A practical guide to their development and use. 4th edition ed. New York: Oxford University Press; 2008. 23. de Vet HC, Terwee CB, Mokkink L, Knol DL. Measurement in Medicine: a practical guide: Cambridge University Press; 2010. 24. Fayers PM, Hand DJ, Bjordal K, Groenvold M. Causal indicators in quality of life research. Qual Life Res. 1997;6(5):393-406. 25. Elbers RG, Rietberg MB, van Wegen EE, Verhoef J, Kramer SF, Terwee CB, et al. Self-report fatigue questionnaires in multiple sclerosis, Parkinson's disease and stroke: a systematic review of measurement properties. Qual Life Res. 2012;21(6):925-44. 26. Griffiths C, Armstrong-James L, White P, Rumsey N, Pleat J, Harcourt D. A systematic review of patient reported outcome measures (PROMs) used in child and adolescent burn research. Burns. 2015;41(2):212-24. 27. Terwee CB, Jansma EP, Riphagen, II, de Vet HC. Development of a methodological PubMed search filter for finding studies on measurement properties of measurement instruments. Qual Life Res. 2009;18(8):1115-23. 28. Clanchy KMT, S. M.; Boyd, R. Measurement of habitual physical activity performance in adolescents with cerebral palsy: a systematic review. DevMed Child Neurol. 2011;53(6):499-505. 29. PRISMA Statement. http://prisma-statement org/ [Internet]. 2016 10/24/2016. Available from: http://prisma-statement.org/. 30. Terwee CB, Bot SD, de Boer MR, van der Windt DA, Knol DL, Dekker J, et al. Quality criteria were proposed for measurement properties of health status questionnaires. J Clin Epidemiol. 2007;60(1):34-42. 31. Ernst MY, Albert J.; Kriston, Levente; Schonfeld, Michael H.; Vettorazzi, Eik; Fiehler, Jens. Is visual evaluation of aneurysm coiling a reliable study end point? Systematic review and meta-analysis. Stroke. 2015;46(6):1574-81. 32. DerSimonian R, Laird N. Meta-analysis in clinical trials revisited. Contemp Clin Trials. 2015;45(Pt A):139-45. 33. Terwee CB, Prinsen CA, de Vet HCW, Bouter LM, Alonso J, Westerman MJ, et al. COSMIN methodology for assessing the content validity of Patient-Reported Outcome Measures (PROMs). User manual. 2017. 34. Strauss ME, Smith GT. Construct Validity: Advances in Theory and Methodology. Annu Rev Clin Psychol. 2008.
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
76
35. Cronbach LJ, Meehl PE. Construct validity in psychological tests. Psychological Bulletin. 1955;52:281-302. 36. McDowell I, Jenkinson C. Development standards for health measures. J Health Serv Res Policy. 1996;1:238-46. 37. Messick S. The standard problem. Meaning and values in measurement and evaluation. . American Psychologist. 1975;oct:955-66. 38. de Vet HC, Terwee CB. The minimal detectable change should not replace the minimal important difference. J Clin Epidemiol. 2010;63(7):804-5. 39. de Vet HC, Ostelo RW, Terwee CB, van der Roer N, Knol DL, Beckerman H, et al. Minimally important change determined by a visual method integrating an anchor-based and a distribution-based approach. Qual Life Res. 2007;16(1):131-42. 40. de Vet HC, Terwee CB, Ostelo RW, Beckerman H, Knol DL, Bouter LM. Minimal changes in health status questionnaires: distinction between minimally detectable change and minimally important change. Health Qual Life Outcomes. 2006;4:54. 41. Crosby RD, Kolotkin RL, Williams GR. Defining clinically meaningful change in health-related quality of life. J Clin Epidemiol. 2003;56(5):395-407. 42. Yost KJ, Eton DT, Garcia SF, Cella D. Minimally important differences were estimated for six Patient-Reported Outcomes Measurement Information System-Cancer scales in advanced-stage cancer patients. J Clin Epidemiol. 2011;64(5):507-16. 43. van Kampen DA, Willems WJ, van Beers LW, Castelein RM, Scholtes VA, Terwee CB. Determination and comparison of the smallest detectable change (SDC) and the minimal important change (MIC) of four-shoulder patient-reported outcome measures (PROMs). Journal of orthopaedic surgery and research. 2013;8:40. 44. Sprangers MA, Schwartz CE. Integrating response shift into health-related quality of life research: a theoretical model. Soc Sci Med. 1999;48(11):1507-15. 45. Boers M, Kirwan JR, Tugwell P, Beaton D, Bingham CO, III, Conaghan PG, et al. The OMERACT handbook: OMERACT; 2015. 46. Smart A. A multi-dimensional model of clinical utility. International journal for quality in health care : journal of the International Society for Quality in Health Care. 2006;18(5):377-82. 47. Fayers PM, Hand DJ. Factor analysis, causal indicators and quality of life. Qual Life Res. 1997;6(2):139-50. 48. Streiner DL. Being inconsistent about consistency: when coefficient alpha does and doesn't matter. J Pers Assess. 2007;80:217-22. 49. Wirth RJ, Edwards MC. Item factor analysis: Current approaches and future directions. Psychological Methods. 2007;12:58-79. 50. Floyd FJ, Widaman FK. Factor analysis in the development and refinement of clinical assessment instruments. . Psychological Assessment. 1995;7:286-99. 51. Comrey AL, Lee HB. A first course in factor analysis. 2nd ed. ed. Hillsdale, NJ: Erlbaum; 1992 1992. 52. Brown T. Confirmatory Factor Analysis for applied research. New York: The Guilford Press; 2015. 53. Embretson SE, Reise SP. Item response theory for psychologists. Mahwah, New Jersey: Lawrence Erlbaum Associates; 2000. 54. Edelen MO, Reeve BB. Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Qual Life Res. 2007;16 Suppl 1:5-18. 55. Reise SPY, J. Parameter recovery in the graded response model using MULTILOG. JEM. 1990;27:133-44.
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
77
56. Reise SPM, T.M.; Maydeu-Olivares, A. Targeted bifactor rotations and assessing the impact of model violations on the parameters of unidimensional and bifactor models. Journal of Educational and Psychological Measurement. 2011;71:684-711. 57. Linacre JM. Sample size and item calibration stability. Rasch Measurement Transactions. 1994;7(4):328. 58. de Vet HC, Ader HJ, Terwee CB, Pouwers F. Are factor analytical techniques appropriately used in the validation of health status questionnaires? A systematic review on the quality of factor analyses of the SF-36. . Quality of Life Research. 2005;14:1203-18. 59. Cortina JM. What is coefficient alpha? An examination of theory and applications. Appl Psychology. 1993;78:98-104. 60. Cronbach LJ. Coefficient Alpha and the Internal Structure of Tests. Psychometrika. 1951;16:297-334. 61. Streiner DL. Figuring out factors: The use and misuse of factor analysis. Can J Psychiatry. 1994;39:135-40. 62. Revelle WZRE. Coefficients Alpha, Beta, Omega, and the GLB: Comments on Sijtsma. Psychometrika. 2009;74(1):145-54. 63. Crane PK, Gibbons LE, Jolley L, Van Belle G. Differential item functioning analysis with ordinal logistic regression techniques. DIFdetect and difwithpar. Med Care. 2006;44(11 Suppl 3):S115-23. 64. Petersen MA, Groenvold M, Bjorner JB, Aaronson N, Conroy T, Cull A, et al. Use of differential item functioning analysis to assess the equivalence of translations of a questionnaire. Qual Life Res. 2003;12(4):373-85. 65. Gregorich SE. Do Self-Report Instruments Allow Meaningful Comparisons Across Diverse Population Groups? Testing Measurement Invariance Using the Confirmatory Factor Analysis Framework. Medical Care. 2006;44(11 Suppl 3):S78-S94. 66. Teresi JA, Ocepek-Welikson K, Kleinman M, Eimicke JP, Crane PK, Jones RN, et al. Analysis of differential item functioning in the depression item bank from the Patient Reported Outcome Measurement Information System (PROMIS): An item response theory approach. Psychol Sci Q. 2009;51(2):148-80. 67. Scott NW, Fayers PM, Aaronson NK, Bottomley A, De Graeff A, Groenvold M, et al. A simulation study provided sample size guidance for differential item functioning (DIF) studies using short scales. Journal of Clinical Epidemiology. 2009;62:288-95. 68. Van Leeuwen LM, Mokkink LB, Kamm CP, de Groot V, van den Berg P, Ostelo RW, et al. Measurement properties of the Arm Function in Multiple Sclerosis Questionnaire (AMSQ): a study based on Classical Test Theory. Disabil Rehabil. 2016:1-8. 69. Shrout PE, Fleiss JL. Intraclass Correlations: Uses in assessing rater reliability. Psychological Bulletin. 1979;86:420-8. 70. McGraw KOW, S.P. Forming inferences about some intraclass correlation coefficients. Psychological Methods. 1996;1:30-46. 71. Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin. 1968;70:213-20. 72. de Vet HC, Mokkink LB, Terwee CB, Hoekstra OS, Knol DL. Clinicians are right not to like Cohen's kappa. BMJ. 2013;346:f2125. 73. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. . Educational and Psychological Measurement. 1973;33:613-9. 74. de Vet HC, Terwee CB, Knol DL, Bouter LM. When to use agreement versus reliability measures. . Journal of Clinical Epidemiology. 2006;59:1033-9. 75. Altman DG. Practical statistics for medical research. London: Chapman and Hall; 1991.
COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs
78
76. Cohen J. Statistical power analysis for the behavioural sciences. . 2nd ed. ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988. 77. McHorney CAT, A.R. Individual-patient monitoring in clinical practice: are available health status surveys adequate? Qual Life Res. 1995;4:293-307. 78. Norman GR. Issues in the use of change scores in randomized trials. J Clin Epidemiol. 1989;42(11):1097-105. 79. Stockler MR, Osoba D, Goodwin P, Corey P, Tannock IF. Responsiveness to change in health-related quality of life in a randomized clinical trial: a comparison of the Prostate Cancer Specific Quality of Life Instrument (PROSQOLI) with analogous scales from the EORTC QLQ-C30 and a trial specific module. European Organization for Research and Treatment of Cancer. J Clin Epidemiol. 1998;51(2):137-45. 80. Guyatt G, Walter S, Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. J Chronic Dis. 1987;40(2):171-8. 81. van Dijk SEA, M.C.; van der Zwaan, L.; Bosmans, J.E.; van Marwijk, H.W.J.; van Tulder, M.W.; Terwee, C.B. Measurement properties of questionnaires for depressive symptoms in adult patients with type 1 or type 2 diabetes: a systematic review. Qual Life Res. 2018; doi: 10.1007/s11136-018-1782-y. [Epub ahead of print].