COSMIN for systematic reviews of Outcome Measures (PROMs) · COSMIN manual for systematic reviews...

COSMINmanualforsystematicreviewsofPROMs

1

COSMINmethodologyforsystematicreviewsof

Patient‐ReportedOutcomeMeasures(PROMs)

usermanual

Version1.0datedFebruary2018

LidwineBMokkinkCeciliaACPrinsenDonaldLPatrickJordiAlonsoLexMBouter

HenricaCWdeVetCarolineBTerwee

ContactLBMokkink,PhDVUUniversityMedicalCenterDepartmentofEpidemiologyandBiostatisticsAmsterdamPublicHealthresearchinstitute1081BTAmsterdamTheNetherlandsWebsite:www.cosmin.nlE‐mail:[email protected]


2

ThedevelopmentoftheCOSMINguidelineforsystematicreviewsofPatient‐ReportedOutcomeMeasures(PROMs)andtheCOSMINRiskofBiaschecklistforsystematicreviewsofPROMswasfinanciallysupportedbytheAmsterdamPublicHealthresearchinstitute,VUUniversityMedicalCenter,Amsterdam,theNetherlands.


3

Tableofcontents

Foreword 5 1. Backgroundinformation 7 1.1TheCOSMINinitiative 7

1.2Howtocitethismanual 8

1.3DevelopmentofacomprehensiveCOSMINguidelineforsystematicreviewsof

Patient‐ReportedOutcomeMeasures(PROMs) 8

1.4COSMINtaxonomyanddefinitions 9

1.5TheCOSMINRiskofBiasChecklist 13

1.6WhathaschangedintheCOSMINmethodology? 15

1.7Ten‐stepprocedure&outlineofthemanual 17

2. PartASteps1‐4:Performtheliteraturesearch 20 2.1Step1:Formulatetheaimofthereview 20

2.2Step2:Formulateeligibilitycriteria 20

2.3Step3:Performaliteraturesearch 21

2.4Step4:Selectabstractsandfull‐textarticles 24

3. PartBsteps5‐7:EvaluatingthemeasurementpropertiesoftheincludedPROMs 253.1Generalmethodology 25

3.1.2Applyingcriteriaforgoodmeasurementproperties 27

3.2.Step5:Evaluatingcontentvalidity 37

3.3Step6.EvaluationoftheinternalstructureofPROMs:structuralvalidity,internal

consistency,andcross‐culturalvalidity\measurementinvariance 38

3.4Step7.Evaluationoftheremainingmeasurementproperties:reliability,

measurementerror,criterionvalidity,hypothesestestingforconstructvalidity,and

responsiveness 40

4. PartCsteps8‐10:SelectingaPROM 44 4.1Step8:Describeinterpretabilityandfeasibility 44

4.2Step9:formulaterecommendations 45

4.3Step10:reportthesystematicreview 45

5. COSMINRiskofBiaschecklist 47 5.1Assessingriskofbiasinastudyonstructuralvalidity(box3) 47

5.2Assessingriskofbiasinastudyoninternalconsistency(box4) 50

5.3Assessingriskofbiasinastudyoncross‐culturalvalidity\measurement

invariance(box5) 51

5.4Assessingriskofbiasinastudyonreliability(box6) 53


4

5.5Assessingriskofbiasinastudyonmeasurementerror(box7) 56

5.6Assessingriskofbiasinastudyoncriterionvalidity(box8) 57

5.7Assessingriskofbiasinastudyonhypothesestestingforconstructvalidity(box

9) 58

5.8Assessingriskofbiasinastudyonresponsiveness(box10) 60

Appendix1.Exampleofasearchstrategy(81) 64 Appendix2.Exampleofaflowchart 65 Appendix4.Exampleofatableoncharacteristicsoftheincludedstudypopulations 67 Appendix5.InformationtoextractoninterpretabilityofPROMs 68 Appendix6.InformationtoextractonfeasibilityofPROMs 69 Appendix7.Tableonresultsofstudiesonmeasurementproperties 70 Appendix8.SummaryofFindingsTables 72 6. References 74


5

Foreword

Researchperformedwithoutcomemeasurementinstrumentsofpoororunknownqualityconstitutesawasteofresourcesandisunethical(3).Unfortunatelythispracticeiswidespread(4).Selectingthebestoutcomemeasurementinstrumentfortheoutcomeofinterestinamethodologicallysoundwayrequires:(1)highqualitystudiesthatdocumenttheevaluationofthemeasurementproperties(intotalninedifferentaspectsofreliability,validity,andresponsiveness)ofrelevantoutcomemeasurementinstrumentsinthetargetpopulation;and(2)ahighqualitysystematicreviewofstudiesonmeasurementpropertiesinwhichallinformationisgatheredandevaluatedinasystematicandtransparentway,accompaniedbyclearrecommendationsforthemostsuitableavailableoutcomemeasurementinstrument.However,conductingsuchasystematicreviewisquitecomplexandtimeconsuming,anditrequiresexpertisewithintheresearchteamontheconstructtobemeasured,onthepatientpopulation,andonthemethodologyofstudiesofmeasurementproperties.

HighqualitysystematicreviewscanprovideacomprehensiveoverviewofthemeasurementpropertiesofPatient‐ReportedOutcomeMeasures(PROMs)andsupportsevidence‐basedrecommendationsintheselectionofthemostsuitablePROMforagivenpurpose.Forexample,forselectingthemostsuitablePROMforanoutcomeincludedinacoreoutcomeset(COS)(5).ThesesystematicreviewscanalsoidentifygapsinknowledgeonthemeasurementpropertiesofPROMs,whichcanbeusedtodesignnewstudiesonmeasurementproperties.

TheCOSMIN(COnsensus‐basedStandardsfortheselectionofhealthMeasurementINstruments)initiativeaimstoimprovetheselectionofoutcomemeasurementinstrumentsinresearchandclinicalpracticebydevelopingtoolsforselectingthemostsuitableinstrumentforthesituationatissue.Recently,acomprehensivemethodologicalguidelineforsystematicreviewsofPROMswasdevelopedbytheCOSMINinitiative(2).Inaddition,theCOSMINchecklistwasadaptedforassessingtheriskofbiasinstudiesonmeasurementpropertiesinsystematicreviewsofPROMs(6).Also,aDelphistudywasperformedtodevelopstandardsandcriteriaforassessingthecontentvalidityofPROMs(7).

ThepresentmanualisasupplementtotheCOSMINguidelineforsystematicreviewsofmeasurementinstruments(2)includingtheuseoftheCOSMINRiskofBiaschecklist(6).ThismanualcontainsadditionaldetailedinformationonhowtoperformeachoftheproposedtenstepstoconductasystematicreviewofPROMs,onhowtousetheCOSMINRiskofBiaschecklist,andonhowtocometorecommendationsonthemostsuitableinstrumentforagiven.

IntheCOSMINmethodologyweusethewordpatient.However,sometimesthetargetpopulationofthesystematicrevieworthePROMisnotpatients,bute.g.healthyindividuals(e.g.forgenericPROMs)orcaregivers(whenaPROMmeasurescaregiverburden).Inthesecases,thewordpatientshouldbereadase.g.healthypersonorcaregiver.


6

ThroughoutthemanualweprovidesomeexamplestoexplaintheCOSMINRiskofBiaschecklist.Wewouldliketoemphasizethatthesearearbitrarilychosenandusedforillustrativepurposes.

TheCOSMINmethodologyfocussesonPROMsusedasoutcomemeasurementinstruments(i.e.evaluativeapplication).Themethodologycanalsobeusedforothertypesofmeasurementinstruments(likeclinician‐reportedoutcomemeasuresorperformance‐basedoutcomemeasures),orotherapplications(e.g.diagnosticorpredictiveapplications),butthemethodologymayneedtobeadaptedfortheseotherpurposes.Forexample,structuralvaliditymaynotberelevantforsometypesofinstrumentsandresponsivenessislessrelevantwhenaninstrumentisusedasadiagnosticinstrument.NewstandardsforassessingthequalityofastudyonthereliabilityofaMRIscanshouldbedeveloped.

ThismanualaimstofacilitatetheunderstandingandpracticeperformanceofasystematicreviewonPROMs.Wegivedetailedinstructionsoneachsteptotake,andwewillprovidemanyexamplesandsuggestionsonhowtoperformeachstep.

Weaimtocontinueupdatingthismanualwhendeemednecessary,basedonexperienceandsuggestionsbyusersoftheCOSMINmethodology.Ifyouhaveanysuggestionsorquestions,pleasecontactus([email protected]).


7

1. Backgroundinformation

1.1TheCOSMINinitiative

TheCOSMINinitiativeaimstoimprovetheselectionofhealthmeasurementinstrumentsbothinresearchandclinicalpracticebydevelopingtoolsforselectingthemostsuitableinstrumentforagivensituation.COSMINisaninternationalinitiativeconsistingofamultidisciplinaryteamofresearcherswithexpertiseinepidemiology,psychometrics,andqualitativeresearch,andinthedevelopmentandevaluationofoutcomemeasurementinstrumentsinthefieldofhealthcare,aswellasinperformingsystematicreviewsofoutcomemeasurementinstruments.

COSMINsteeringcommittee

LidwineBMokkink1CeciliaACPrinsen1DonaldLPatrick2JordiAlonso3LexMBouter1HenricaCWdeVet1CarolineBTerwee1

1 DepartmentofEpidemiologyandBiostatistics,AmsterdamPublicHealthresearchinstitute,VUUniversityMedicalCenter,DeBoelelaan1089a,1081HV,Amsterdam,theNetherlands;

2DepartmentofHealthServices,UniversityofWashington,ThurCanalStResearchOffice,146NCanalSuite310,98103,Seattle,USA;

3 IMIM(HospitaldelMarMedicalResearchInstitute),CIBEREpidemiologyandPublicHealth(CIBERESP),Dept.ExperimentalandHealthSciences,PompeuFabraUniversity(UPF),DoctorAiguader88,08003,Barcelona,Spain.


8

1.2Howtocitethismanual

Thismanualwasbasedonthreearticles,publishedinpeer‐reviewedscientificjournals.IfyouusetheCOSMINmethodology,pleaserefertothesearticles:

Prinsen,C.A.C.,Mokkink,L.B.,Bouter,L.M.,Alonso,J.,Patrick,D.L.,DeVet,H.C.,etal.(2018).COSMINguidelineforsystematicreviewsofPatient‐ReportedOutcomeMeasures.QualLifeRes,accept.

Mokkink,L.B.,deVet,H.C.W.,Prinsen,C.A.C.,Patrick,D.L.,Alonso,J.,Bouter,L.M.,Terwee,C.B.(2017).COSMINRiskofBiaschecklistforsystematicreviewsofPatient‐ReportedOutcomeMeasures.QualLifeRes.doi:10.1007/s11136‐017‐1765‐4.[Epubaheadofprint].

TerweeCB,PrinsenCAC,ChiarottoA,WestermanMJ,PatrickDL,AlonsoJ,BouterLM,deVetHCW,MokkinkLB.COSMINmethodologyforevaluatingthecontentvalidityofPatient‐ReportedOutcomeMeasures:aDelphistudy,submitted.

1.3DevelopmentofacomprehensiveCOSMINguidelineforsystematicreviewsofPatient‐ReportedOutcomeMeasures(PROMs)

In the absence of empirical evidence, theCOSMIN guideline for systematic reviews ofPROMs (2) was based on our experience that we (that is: the COSMIN steeringcommittee)havegainedoverthepastyearsinconductingsystematicreviewsofPROMs(8, 9), in supporting other systematic reviewers in their work (10, 11) and in thedevelopmentofCOSMINmethodology(12,13).Inaddition,wehavestudiedthequalityofsystematicreviewsofPROMsintwoconsecutivereviews(14,15),andinreviewsthathave used the COSMINmethodologywe have specifically searched for the commentsmade by review authors relating to the COSMINmethodology. Further, we have haditerativediscussionsby theCOSMINsteeringcommittee,bothat face‐to‐facemeetings(CP,WM, HdV and CT) and by email.We gained experience from results of a recentDelphistudyonthecontentvalidityofPROMs(7),andfromresultsofapreviousDelphistudyontheselectionofoutcomemeasurement instruments foroutcomes included incoreoutcomesets(COS)(5).Further, the guideline was developed in concordance with existing guidelines forreviews, suchas theCochraneHandbook forsystematic reviewsof interventions (16),andfordiagnostictestaccuracyreviews(17),thePRISMAStatement(18),theInstituteof Medicine (IOM) standards for systematic reviews of comparative effectivenessresearch(19), and the Grading of Recommendations Assessment, Development andEvaluation(GRADE)principles(20).

ThisguidelinefocusesonthemethodologyofsystematicreviewsofexistingPROMs,forwhichatleastaPROMdevelopmentstudyorsomeinformationonitsmeasurementpropertiesisavailable,andthatareusedforevaluativepurposes(e.g.tomeasuretheeffectsoftreatmentortomonitorpatientsovertime).Instrumentsthatarebeingusedfordiscriminativeorpredictivepurposesarenotbeingconsideredintheguideline.Themethodscanalsobeusedforsystematicreviewsofothertypesofoutcomemeasurementinstrumentssuchasclinicianreportedoutcomemeasurement


9

instruments(ClinROMs)orperformance‐basedoutcomemeasurementinstruments(PerBOMs),forreviewsofonesingleoutcomemeasurementinstrument,orforapredefinedsetofoutcomemeasurementinstruments.However,someadaptationsmaybeneededinsomeofthestepsorintheCOSMINRiskofBiaschecklist.

1.4COSMINtaxonomyanddefinitionsIntheliteraturedifferentterminologyanddefinitionsformeasurementpropertiesarecontinuouslybeingused.TheCOSMINinitiativedevelopedataxonomyofmeasurementpropertiesrelevantforevaluatingPROMs.InthefirstCOSMINDelphistudy,conductedin2006‐2007,consensuswasreachedonterminologyanddefinitionsofallincludedmeasurementpropertiesintheCOSMINchecklist(21).Thistaxonomy(Figure1)formedthefoundationonwhichtheguidelinewasbased.

Figure1.TheCOSMINtaxonomy(21)


10

TaxonomyofmeasurementpropertiesTheCOSMINtaxonomyofmeasurementpropertiesispresentedinFigure1.ItwasdecidedthatallmeasurementpropertiesincludedinthetaxonomyarerelevantandshouldbeevaluatedforPROMsusedinanevaluativeapplication.

InassessingthequalityofaPROMwedistinguishthreedomains,i.e.reliability,validity,andresponsiveness.Eachdomaincontainsoneormoremeasurementproperties,i.e.qualityaspectsofmeasurementinstruments.Thedomainreliabilitycontainsthreemeasurementproperties:internalconsistency,reliability,andmeasurementerror.Thedomainvalidityalsocontainsthreemeasurementproperties:contentvalidity(includingfacevalidity),structuralvalidity,hypothesestestingforconstructvalidity,cross‐culturalvalidityandcriterionvalidity.Thedomainresponsivenesscontainsonlyonemeasurementproperty,whichisalsocalledresponsiveness.

DefinitionsofmeasurementpropertiesConsensus‐baseddefinitionsofallincludedmeasurementpropertiesintheCOSMINchecklistarepresentedinTable1.


11

Table1.COSMINdefinitionsofdomains,measurementproperties,andaspectsofmeasurementproperties(21)

Term Definition

Domain Measurementproperty

Aspectofameasurementproperty

Reliability Thedegreetowhichthemeasurementisfreefrommeasurementerror

Reliability(extendeddefinition)

Theextenttowhichscoresforpatientswhohavenotchangedarethesameforrepeatedmeasurementunderseveralconditions:e.g.usingdifferentsetsofitemsfromthesamePROM(internalconsistency);overtime(test‐retest);bydifferentpersonsonthesameoccasion(inter‐rater);orbythesamepersons(i.e.ratersorresponders)ondifferentoccasions(intra‐rater)

Internalconsistency

Thedegreeoftheinterrelatednessamongtheitems

Reliability Theproportionofthetotalvarianceinthemeasurementswhichisdueto‘true’†differencesbetweenpatients

Measurementerror

Thesystematicandrandomerrorofapatient’sscorethatisnotattributedtotruechangesintheconstructtobemeasured

Validity ThedegreetowhichaPROMmeasurestheconstruct(s)itpurportstomeasure

Contentvalidity

ThedegreetowhichthecontentofaPROMisanadequatereflectionoftheconstructtobemeasured

Facevalidity Thedegreetowhich(theitemsof)aPROMindeedlooksasthoughtheyareanadequatereflectionoftheconstructtobemeasured


12

Constructvalidity

ThedegreetowhichthescoresofaPROMareconsistentwithhypotheses(forinstancewithregardtointernalrelationships,relationshipstoscoresofotherinstruments,ordifferencesbetweenrelevantgroups)basedontheassumptionthatthePROMvalidlymeasurestheconstructtobemeasured

Structuralvalidity

ThedegreetowhichthescoresofaPROMareanadequatereflectionofthedimensionalityoftheconstructtobemeasured

Hypothesestesting

Idemconstructvalidity

Cross‐culturalvalidity

ThedegreetowhichtheperformanceoftheitemsonatranslatedorculturallyadaptedPROMareanadequatereflectionoftheperformanceoftheitemsoftheoriginalversionofthePROM

Criterionvalidity

ThedegreetowhichthescoresofaPROMareanadequatereflectionofa‘goldstandard’

Responsiveness TheabilityofaPROMtodetectchangeovertimeintheconstructtobemeasured

Responsiveness IdemresponsivenessInterpretability* Interpretabilityisthedegreeto

whichonecanassignqualitativemeaning‐thatis,clinicalorcommonlyunderstoodconnotations–toaPROM’squantitativescoresorchangeinscores.

†Theword‘true’mustbeseeninthecontextoftheCTT,whichstatesthatanyobservationiscomposedoftwocomponents–atruescoreanderrorassociatedwiththeobservation.‘True’istheaveragescorethatwouldbeobtainedifthescaleweregivenaninfinitenumberoftimes.Itrefersonlytotheconsistencyofthescore,andnottoitsaccuracy(22)* Interpretabilityisnotconsideredameasurementproperty,butanimportantcharacteristicofameasurementinstrument


13

1.5TheCOSMINRiskofBiasChecklist

ItwasdecidedtocreatethreeseparateversionsoftheoriginalCOSMINchecklist(1)tobeusedfordifferentpurposeswhenassessingthemethodologicalqualityofstudiesonmeasurement properties, i.e. the COSMIN Design checklist, the COSMIN Risk of Biaschecklist,andtheCOSMINReportingchecklist.TheRiskofBiaschecklistwasexclusivelydeveloped for assessing the methodological quality of single studies included insystematicreviewsofPROMs.Thepurposeofassessingthemethodologicalqualityofastudyinasystematicreviewistoscreenforriskofbiasintheincludedstudies.Theterm‘riskofbias’isincompliancewiththeCochranemethodologyforsystematicreviewsoftrials and diagnostic studies(16). It refers to whether the results based on themethodologicalqualityofthestudyaretrustworthy.The checklist contains standards referring to design requirements and preferredstatistical methods of studies on measurement properties. For each measurementproperty a COSMIN boxwas developed containing all standards needed to assess thequality of a study on that specific measurement property. Both preferred statisticalmethodsbasedonClassicaltestTheory(CTT)andItemResponseTheory(IRT)orRaschanalysesareincludedinthestandards.

BoxesfromthechecklistThe COSMIN Risk of Bias checklist contains ten boxes with standards for PROMdevelopment (box 1) and for ninemeasurement properties: Content validity (box 2),Structural validity (box 3), Internal consistency (box 4), Cross‐culturalvalidity\measurement invariance (box5),Reliability (box6),Measurementerror (box7), Criterion validity (box 8), Hypotheses testing for construct validity (box 9), andResponsiveness(box10).

COSMINstandardsandcriteria

ItisimportanttonotethatCOSMINmakesadistinctionbetween“standards”and“criteria”:Standardsrefertodesignrequirementsandpreferredstatisticalmethodsforevaluatingthemethodologicalqualityofstudiesonmeasurementproperties.Criteriarefertowhatconstitutesgoodmeasurementproperties(qualityofPROMs).InthefirstCOSMINDelphistudy(1),onlystandardsweredevelopedforevaluatingthequalityofstudiesonmeasurementproperties.Criteriaforwhatconstitutesgoodmeasurementpropertieswerenotdeveloped.However,suchcriteriaareneededinsystematicreviewstoprovideevidence‐basedrecommendationsforwhichPROMsaregoodenoughtobeusedinresearchorclinicalpractice.Therefore,criteriaweredevelopedforgoodmeasurementproperties(Table4)(2).

https://link.springer.com/article/10.1007%2Fs11136-017-1765-4#SupplementaryMaterial


14

ModulartoolInonearticleoneormorestudies(oftenondifferentmeasurementproperties)canbedescribed.Themethodologicalqualityof each study shouldbe assessed separatelybyratingallstandardsincludedinthecorrespondingbox.Therefore,theCOSMINchecklistshouldbeusedasamodulartool.Thismeansthatitmaynotbenecessarytocompletethe whole checklist when evaluating the quality of studies described in an article. Inaccordancewith theCOSMINtaxonomy(21), themeasurementpropertiesevaluated inan article determine which boxes need to be completed. Each assessment of ameasurementpropertyisconsideredtobeaseparatestudy.Forexample,ifinanarticletheinternalconsistencyandconstructvalidityofaninstrumentwereassessed,onlytwoboxesneedtobecompleted,i.e.box4Internalconsistencyandbox9Hypothesestestingforconstructvalidity.Thismodularsystemwasdevelopedbecausenotallmeasurementpropertiesareassessedinallarticles.

Four‐pointratingsystemForeachstudy,anoveralljudgementisneededonthequalityoftheparticularstudy.Wethereforedevelopeda four‐pointratingsystemwhereeachstandardwithinaCOSMINbox can be rated as ‘very good’, ‘adequate’, ‘doubtful’ or ‘inadequate’ (6). The overallrating of the quality of each study is determined by taking the lowest rating of anystandardinthebox(i.e.“theworstscorecounts”principle)(12).Thisoverallratingofthequalityofthestudiescanbeusedingradingthequalityoftheevidence(seeChapter3),takingintoaccountthatresultsof‘inadequate’qualitystudieswilldecreasethetrustonecanputontheresultsandtheoverallconclusionaboutameasurementpropertyofaPROM(asusedinaspecificcontext).Eachboxcontainsastandardaskingiftherewereanyotherimportantmethodologicalflawsthatarenotcoveredbythechecklist,butthatmayleadtobiasedresultsorconclusions.Forexample,inareliabilitystudy,theadministrationsshouldbeindependent.Independentadministrationsimplythatthefirstadministrationhasnotinfluencedthesecondadministration,i.e.atthesecondadministrationthepatientshouldnothavebeenawareofthescoresonthefirstadministration(recallbias).

COSMINconsiderseachsubscaleofa(multi‐dimensional)PROMseparatelyThemeasurementpropertiesofaPROMshouldberatedseparatelyforeachsetofitemsthatmakeupascore.Thiscanbeasingleitemifthisitemisusedasastandalonescore,asetofitemsmakingupasubscalescorewithinamulti‐dimensionalPROM,oratotalPROMscoreifallitemsofaPROMarebeingsummarizedintoatotalscore.EachscoreisassumedtorepresentaconstructandisthereforeconsideredaseparatePROM.IntheremainingofthismanualwhenwerefertoaPROM,wemeanaPROMscoreorsubscore.Forexample,ifamultidimensionalPROMconsistsofthreesubscalesandonesingleitem,eachscoredseparately,themeasurementpropertiesofthethreesubscalesandthesingleitemneedtoberatedseparately.Ifthesubscalescoresarealsosummarizedintoatotalscore,themeasurementpropertiesofthetotalscoreshouldalsoberatedseparately.


15

1.6WhathaschangedintheCOSMINmethodology?InadequatestudiesarenolongerignoredInourpreviousprotocolforperformingsystematicreviewsofPROMs,werecommendedtoignoretheresultsofinadequate(previouslycalled‘poor’)qualitystudiesbecausetheresultsofthesestudiesmightbebiased.Inthecurrentmethodology,werecommendthatresultsfrominadequatestudiesmaybeincludedwhenpoolingorsummarizingtheresultsfromallstudiesispossible,oriftheresultsofinadequatestudiesareconsistentwiththeresultsfromstudiesofbetterquality.Thereasoningbehindthisisthatiftheresultsfrominadequatestudiesareincluded,itshouldbeconsideredtodowngradethequalityoftheevidencebecauseofriskofbias(seeChapter3).Herewith,thecurrentmethodologyismoreinlinewithrecommendationsfromtheCochraneCollaborationforsystematicreviewsofinterventionstudies(16)andofdiagnostictestaccuracystudies(17).RemovingstandardsaboutareasonablegoldstandardforcriterionvalidityandresponsivenessWedecidedtodeletethestandardsaboutdecidingwhetheragoldstandardusedcanbeconsideredareasonablegoldstandard.Wenowrecommendthereviewteamtodeterminebeforeassessingthemethodologicalqualityofstudies,whichoutcomemeasurementinstrumentcanbeconsideredareasonablegoldstandard.Whenanincludedstudyusesthisparticulargoldstandardinstrumenttoassesvalidity,thestudycanbeconsideredasastudyoncriterionvalidity.TheCOSMINpanelreachedconsensusthatnogoldstandardexistforPROMs(13).Theonlyexceptioniswhenashortenedinstrumentiscomparedtotheoriginallongversion.Inthatcase,theoriginallongversioncanbeconsideredthegoldstandard.Often,authorsconsidertheircomparatorinstrumentincorrectlyagoldstandard,forexamplewhentheycomparethescoresofanewinstrumenttoawidelyusedandwell‐knowninstrument.Inthissituation,thestudyisconsideredastudyonconstructvalidityandbox9Hypothesestestingforconstructvalidityshouldbecompleted.RemovingstandardsonformulatinghypothesesforhypothesestestingforconstructvalidityandresponsivenessWedecidedtodeleteallstandardsaboutformulatinghypothesesapriorifromtheboxesHypothesestestingforconstructvalidityandResponsiveness.AlthoughweconsideritmajorlyimportanttodefinehypothesesinadvancewhenassessingconstructvalidityorresponsivenessofaPROM,resultsofstudieswithoutthesehypothesescan–inmanycases–stillbeusedinasystematicreviewonPROMsbecausethepresentedcorrelationsormeandifferencesbetween(sub)groupsarenotnecessarilybiasedandthuscanbeevaluated.Theconclusionsoftheauthorsthoughmaybebiasedwhenapriorihypothesesarelacking.WerecommendthatthereviewteamformulatesasetofhypothesesabouttheexpecteddirectionandmagnitudeofcorrelationsbetweenthePROMofinterestandotherPROMsandofmeandifferencesinscoresbetweengroups(23).Thisway,allresultsarecomparedtothesamerelevanthypothesis.Ifconstructvaliditystudiesdoincludehypotheses,thereviewteamcanadoptthesehypothesesiftheyconsiderthemadequate.Thisway,theresultsfrommanystudiescanstillbeusedinthesystematicreviewasstudieswithouthypotheseswillnolongerreceivea‘inadequate’(previouslycalled‘poor’)qualityrating.AdetailedexplanationforcompletingtheseboxescanbefoundinChapter5.


16

RemovalofstandardsonsamplesizeWedecidedtoremovethestandardaboutadequatesamplesizeforsinglestudiesfromthoseboxeswhereitispossibletopooltheresults(i.e.theboxesInternalconsistency,Reliability,Measurementerror,Criterionvalidity,Hypothesestestingforconstructvalidity,andResponsiveness)toalaterphaseofthereview,i.e.whendrawingconclusionsacrossstudiesonthemeasurementpropertiesofthePROM(Chapter3).Thiswasdecidedbecauseseveralsmallhigh‐qualitystudiescantogetherprovidesufficientevidenceforthemeasurementproperty.Therefore,werecommendtotaketheaggregatedsamplesizeoftheavailablestudiesintoaccountwhenassessingtheoverallqualityofevidenceforameasurementpropertyinasystematicreview.ThisisincompliancewithCochraneHandbooks(16,17).However,thestandardaboutadequatesamplesizeforsinglestudieswasmaintainedintheboxesContentvalidity,Structuralvalidity,andCross‐culturalvalidity\measurementinvariance,becausetheresultsofthesestudiescannotbepooled.Intheseboxesfactoranalyses,IRTorRaschanalysesareincludedaspreferredstatisticalmethodsandthesemethodsrequiresamplesizesthataresufficientlylargetoobtainreliableresults.Thesuggestedsamplesizerequirementsshouldbeconsideredasgeneralguidance;insomesituations,dependentontypeofmodel,numberoffactorsoritems,morenuancedcriteriamightbeapplied.

RemovalofstandardsonmissingdataandhandlingmissingdataEachboxoftheoriginalCOSMINchecklist,exceptforboxoncontentvalidity,containsstandardsaboutwhetherthepercentagemissingitemswasreported,andhowthesemissingitemswerehandled.Althoughweconsiderinformationonmissingitemsveryimportanttoreport,wedecidedtoremovethesestandardsfromallboxesintheCOSMINRiskofBiaschecklist,asitwasagreeduponwithintheCOSMINsteeringcommitteethatlackofreportingonnumberofmissingitemsandonhowmissingitemswerehandledwouldnotnecessarilyleadtobiasedresultsofthestudy.Furthermore,atthemomentthereislittleevidenceandnoconsensusyetaboutwhatthebestwayistohandlemissingitemsinstudiesonmeasurementproperties.NeworderofevaluatingthemeasurementpropertiesAneworderofevaluatingthemeasurementpropertiesisproposed,asshowninTable2.ContentvalidityisconsideredtobethemostimportantmeasurementpropertybecausefirstofallitshouldbeclearthattheitemsofthePROMarerelevant,comprehensive,andcomprehensiblewithrespecttotheconstructofinterestandtargetpopulation(2).Therefore,werecommendtofirstevaluatethedevelopmentandcontentvaliditystudiesofthePROMs.PROMswithhighqualityevidenceofinadequatecontentvaliditycanbeexcludedfromfurtherassessmentinthesystematicreview.Next, we recommend to evaluate the internal structure of PROMs, including themeasurement properties structural validity, internal consistency, and cross‐culturalvalidity\measurementinvariance.Internalstructurereferstohowthedifferentitemsina PROM are related, which is important to know for deciding how items might becombinedintoascaleorsubscale.Evaluatingtheinternalstructureoftheinstrumentisrelevant for PROMs that are based on a reflective model. In a reflective model theconstructmanifestsitselfintheitems,i.e.theitemsareareflectionoftheconstructtobemeasured (24). Finally, the remaining measurement properties are considered, i.e.reliability, measurement error, criterion validity, hypotheses testing for constructvalidityandresponsiveness.


17

Table 2. Order in which the measurement properties are assessed Contentvalidity

1.PROMdevelopment*

2.Contentvalidity

Internalstructure

3.Structuralvalidity

4.Internalconsistency

5.Cross‐culturalvalidity\measurementinvariance

Remainingmeasurementproperties

6.Reliability

7.Measurementerror

8.Criterionvalidity

9.Hypothesestestingforconstructvalidity

10.Responsiveness

*notameasurementproperty,buttakenintoaccountwhenevaluatingcontentvalidityTheCOSMINmethodologyisspecificallydevelopedforuseinreviewsofoutcomemeasurementinstruments.Theproposedorderofevaluatingthemeasurementpropertiesisthereforebasedonuse/applicationasanoutcomemeasure,i.e.inanevaluativepurpose.ThisorderisalsoreflectedintheorderingofboxesintheCOSMINRiskofBiaschecklist.Wethinkthatsomeoftheincludedmeasurementpropertiesarelessrelevantforotherpurposes.Forexample,responsivenessislessrelevantwhenaninstrumentisusedasadiagnostictool.Moreinformationonremovalofotherstandardsandadaptationstoindividualstandardsandhowthesechangesweredecidedupon,canbefoundelsewhere(6).1.7Ten‐stepprocedure&outlineofthemanualInasystematicreviewofPROMsanoverviewisgivenonavailableevidenceofeachmeasurementpropertyofeachincludedPROMtocometooverallconclusionspermeasurementpropertyandtogiverecommendationforthemostsuitablePROMforagivenpurpose.Aguideline,consistingofasequentialten‐stepprocedurewasdevelopedbytheCOSMINsteeringcommittee,subdividedintothreeparts:A,BandC(Figure2)(2).Part A consists of steps 1-4 and generally, these steps are standard procedures when performing systematic reviews, and are in agreement with existing guidelines for reviews (16,17): preparingandperformingtheliteraturesearch,andselectingrelevantpublications.ThemethodologyofpartAispresentedinChapter2.


18

Part B consists of steps 5-7 and concerns the evaluation of the measurement properties of the included PROMs. These steps were particularly developed for systematic reviews of PROMs. Part BisdescribedinChapter3.Wefirstdescribethegeneralmethodology,andnext,weexplainstep5(evaluationofcontentvalidity),step6(evaluationofinternalstructure),andstep7(evaluationoftheremainingmeasurementproperties)inmoredetail.PartCconsistsofsteps8‐10andconcernstheevaluationoftheinterpretabilityandfeasibilityofthePROMs(step8),formulating(step9)andthereportingofthesystematicreview(step10).PartCisdescribedinChapter4.InChapter5weelaborateonhowtorateeachstandardincludedintheCOSMINRiskofBiaschecklist.Weclosethemanualwithprovidingexamplesofasearchstrategy,aflowchart,andalltablesthatneedtobepresentedinareview(appendices1‐8).Weaimtocontinueupdatingthismanualwhendeemednecessary,basedonexperienceandsuggestionsbyusersoftheCOSMINmethodology.Ifyouhaveanysuggestionsorquestions,pleasecontactus.


19

Figure2.TenstepsforconductingasystematicreviewofPROMs(2)


20

2.PartASteps1‐4:PerformtheliteraturesearchSteps1‐4arestandardprocedureswhenperformingasystematicreviews(2).Itconcernsformulatingtheresearchaim(step1)andeligibilitycriteria(step2),andperformingtheliteraturesearch(step3)andtheselectionofarticles(step4).Theseareinagreementwithexistingguidelinesforreviews(16,17).2.1Step1:Formulatetheaimofthereview TheaimofasystematicreviewofPROMsfocusesonthequalityofthePROMs.Itshouldincludethefollowingfourkeyelements:1)theconstruct;2)thepopulation(s);3)thetypeofinstrument(s);and4)themeasurementpropertiesofinterest.Anexampleofaresearchaimcouldbe:“ouraimistocriticallyappraise,compareandsummarizethequalityofthemeasurementpropertiesofallself‐reportfatiguequestionnairesforpatientswithmultiplesclerosis(MS),Parkinson’sdisease(PD)orstroke”(25).Intheaimofthisreviewtheconstructofinterestis‘fatigue’,thepopulationofinterestis‘patientswithMS,PDorstroke’,thetypeofinstrumentofinterestis‘self‐reportquestionnaire’,and‘all’measurementpropertiesareexploredinthereview.Wealsorecommendthatthesefourkeyelementsareincludedinthetitleofthereview.Forexample:“Self‐reportfatiguequestionnairesinmultiplesclerosis,Parkinson'sdiseaseandstroke:asystematicreviewofmeasurementproperties.”ThefourkeyelementswillalsoinformboththeeligibilitycriteriainStep2andthesearchstrategytobeconductedinStep3.2.2Step2:Formulateeligibilitycriteria Theeligibilitycriteriashouldbeinagreementwiththefourkeyelementsofthereviewaim:1)thePROM(s)shouldaimtomeasuretheconstructofinterest;2)thestudysample(oranarbitrarymajority,e.g.≥50%)shouldrepresentthepopulationofinterest;3)thestudyshouldconcernPROMs;and4)theaimofthestudyshouldbetheevaluationofoneormoremeasurementproperties,thedevelopmentofaPROM(toratethecontentvalidity),ortheevaluationoftheinterpretabilityofthePROMsofinterest(e.g.evaluatingthedistributionofscoresinthestudypopulation,percentageofmissingitems,floorandceilingeffects,theavailabilityofscoresandchangescoresforrelevant(sub)groups,andtheminimalimportantchangeorminimalimportantdifference(26)).WerecommendtoexcludestudiesthatonlyusethePROMasanoutcomemeasurementinstrument,forinstance,studiesinwhichthePROMisusedtomeasuretheoutcome(e.g.inrandomizedcontrolledtrials),orstudiesinwhichthePROMisusedinavalidationstudyofanotherinstrument.IncludingallstudiesthatuseaspecificPROMwouldrequireamuchmoreextendedsearchstrategybecausethesearchfilterforstudiesonmeasurementpropertiescouldnotbeused(seestep3).Itwouldalsobeextremelytime‐consumingtofindallstudiesusingaspecificPROMbecausethePROMsusedinastudymaynotbereportedintheabstract.Thisbasicallymeansthatallstudiesperformedinthepopulationofinterestshouldbescreenedfull‐text.Asitmightbe


21

unlikelythatthiscanbedoneinastandardizedway,wethereforeconsiderthisapproachnotfeasible.Further,werecommendtoincludeonlyfulltextarticlesbecauseoftenverylimitedinformationonthedesignofastudyisfoundinabstracts.Informationfoundinabstractswillhamperthequalityassessmentofthestudyandtheresultsofthemeasurementpropertiesinsteps5through7.2.3Step3:Performaliteraturesearch InagreementwiththeCochranemethodology(16,17),andbasedonconsensus(5),MEDLINEandEMBASEareconsideredtobetheminimumdatabasestobesearched.Inaddition,itisrecommendedtosearchinother(content‐specific)databases,dependingontheconstructandpopulationofinterest,forexampleWebofScience,Scopus,CINAHL,orPsycINFO.SearchstrategyIntheguidelineandinthismanualwefocusonasystematicreviewincludingallPROMsmeasuringaspecificconstructwhichhave‐atleastsomeextent‐beenvalidated.Anadequatesearchstrategythereforeconsistsofacomprehensivecollectionofsearchterms(i.e.indextermsandfreetextwords)forthefourkeyelementsofthereviewaim:1)construct;2)population(s);3)typeofinstrument(s),and4)measurementproperties.Itisrecommendedtoconsultaclinicallibrarianaswellasexpertsontheconstructandstudypopulationofinterest.AcomprehensivePROMfilterhasbeendevelopedforPubMedbythePatientReportedOutcomesMeasurementGroup,UniversityofOxford,thatcanbeusedasasearchblockfor‘typeofmeasurementinstrument(s)’,andisavailablethroughtheCOSMINwebsite.Regardingsearchtermsformeasurementpropertieswerecommendtouseahighlysensitivevalidatedsearchfilterforfindingstudiesonmeasurementproperties,whichisavailableforPubMedandEMBASEandcanbefoundontheCOSMINwebsite(27,28).AnexampleofaPubMedsearchstrategyisincludedinAppendix1(25).Additionalsourcesforsearchblockscanbefoundonthewebsite https://blocks.bmi‐online.nl/.HereagroupofDutchmedicalinformationspecialistshavecompiledanumberof‘SearchBlocks’.Thesebuildingblocksareintendedforusewhenperformingomplexsearchstrategiesinmedicalandhealthbibliographicdatabases,likePubMed,Embase,PsycInfo,WebofScienceandothers.Theseblockscanbeusedbyexpertsinthefieldofsearchingforliterature,e.g.informationspecialists,clinicallibrarians,healthlibrariansandprofessionalsincloselyrelatedareas.As,inprinciple,theaimisfindingallPROMs,inagreementwiththeCochranemethodology,itisrecommendedtosearchdatabasesfromthedateofinceptiontillpresent(16,17).TheuseoflanguagerestrictionsdependsontheinclusioncriteriadefinedinStep2.Ingeneral,itisrecommendnottouselanguagerestrictionsinthesearchstrategy,eveniftherearenoresourcestotranslatethearticlesforthereview.Inthisway,reviewauthorsareatleastabletoreporttheirexistence.Next,itisrecommendedtousesoftwaresuchasEndnoteorReferenceManagertomanagereferences.Covidence(ww.covidence.org)couldbeofusewhenmanagingthe


22

reviewscreeningperformance.Covidenceisanonlinetoolthatsupportssystematicreviewersinuploadingsearchresults,screeningabstractsandfulltexts,resolvedisagreements,andexportdataintoRevManorExcel.ItisrecommendedbytheCochraneCollaboration.Searchesmustbeasuptodateaspossiblewhenthereviewispublished,soitmaybenecessarytoupdatethesearchbeforesubmitting(orbeforeacceptation)ofthemanuscript.InadditiontosearchingforallPROMs,Figure3showsthesearchstrategyforothertypesofreviews.Forexample,iftheaimofasystematicreviewistoidentifyallPROMsthathavebeenused,searchtermsformeasurementpropertiesshouldnotbeused(firstcolumnofFigure3).IftheaimofasystematicreviewistoreportonthemeasurementpropertiesofoneparticularPROM(oraselectionofpredefinedPROMs),searchtermsfortheconstructofinterestshouldnotbeusedandsearchtermsforthetypeofinstrumentcanbereplacedbysearchtermsforthenamesoftheinstrumentsofinterest(thirdcolumnofFigure3).Asinallsystematicreviewstheliteraturesearchshouldbecarefullydocumentedinthestudyprotocol.Werecommendtodocumentthefollowing:namesofthedatabasesthatweresearched,includingtheinterfaceusedtosearchthedatabases(forexample,PubMedwasusedforsearchingMEDLINE);allsearchtermsused;anyrestrictionsthatwereused(forexample,humanstudiesonly,notanimals);thenumberofrecordsretrievedfromeachdatabase;andthenumberofuniquerecords.InaccordancewiththePRISMAstatement,itisrecommendtoaddthedocumentationoftheselectionprocesstoaflowdiagram.AnexampleofthePRISMAflowdiagramcanbefoundinAppendix2(29).Thisflowdiagramincludesinformationonthetotalnumberofabstractsselected,thetotalnumberoffull‐textarticlesthatwereselected,andthemainreasonsforexcludingfull‐textarticles.Notethattheincludedarticlescandescribedoneormorestudies(ononeormoredifferentmeasurementproperties).Therefore,werecommendtodescribeintheflowdiagramthetotalnumberofarticlesincluded,thetotalnumberofstudiesdescribedinthosearticles,andthetotalnumberofdifferent(versionsof)PROMsfound.


23

*SearchfilterdescribedbyTerweeetal(27)Figure3.Searchstrategyfordifferenttypesofsystematicreviewsofmeasurementinstruments


24

2.4Step4:Selectabstractsandfull‐textarticles Itisgenerallyrecommendedtoperformtheselectionofabstractsandfull‐textarticlesbytworeviewersindependently(16,17).Ifastudyseemsrelevantbyatleastonereviewerbasedontheabstract,orincaseofdoubt,thefull‐textarticleneedstoberetrievedandscreened.Differencesshouldbediscussedandifconsensusbetweenthetworeviewerscannotbereached,itisrecommendedtoconsultathirdreviewer.Itisalsorecommendedtocheckallreferencesoftheincludedarticlestosearchforadditionalpotentiallyrelevantstudies.Ifmanynewarticlesarefound,theinitialsearchstrategymighthavebeeninsufficientlycomprehensiveandmayneedtobeimprovedandredone.NotethatCochranereviewsusevariousadditionalsourcesinfindingrelevantstudies,suchasdissertations,editorials,conferenceproceedings,andreports.TheprobabilityoffindingadditionalrelevantarticlesforsystematicreviewsofPROMsinthesetypeofadditionalsources,however,appearstobesmall.


25

3.PartBsteps5‐7:EvaluatingthemeasurementpropertiesoftheincludedPROMsPartBconsistsofsteps5‐7oftheguidelineonconductingasystematicreviewofPROMsandconcernstheevaluationofthemeasurementpropertiesoftheincludedPROMs.InChapter3.1wewillstartwithexplainingthegeneralmethodologyofPartB.InChapter3.2wediscussstep5inwhichthecontentvalidityisassessed.Next,inChapter3.3step6isexplained,whichconcernsevaluatingtheinternalstructureofaPROM(i.e.structuralvalidity,internalconsistency,andcross‐culturalvalidity\measurementinvariance).Lastly,inChapter3.4wedescribestep7oftheguidelinethatconcernstheevaluationoftheremainingmeasurementproperties(i.e.reliability,measurementerror,criterionvalidity,hypothesestestingforconstructvalidity,andresponsiveness).3.1GeneralmethodologyTheevaluationofeachmeasurementpropertyincludesthreesubsteps(seeFigure4):

Figure4.Practicaloutlineforperformingasystematicreview

First,themethodologicalqualityofeachsinglestudyonameasurementpropertyisassessedusingtheCOSMINRiskofBiaschecklist.DetailedinstructionsforhoweachstandardincludedintheCOSMINRiskofBiaschecklistshouldberatedaredescribedinChapter5.

Second,theresultofeachsinglestudyonameasurementpropertyisratedagainsttheupdatedcriteriaforgoodmeasurementproperties(+/‐/?)(Table4);

Third,theevidenceissummarizedpermeasurementpropertyperPROM,theoverallresultisratedagainstthecriteriaforgoodmeasurementproperties,


26

(Table4),andthequalityoftheevidencewillbegradedbyusingtheGRADEapproach(2).

3.1.1EvaluatingthemethodologicalqualityofstudiesToevaluatethemethodologicalqualityoftheincludedstudiesusingtheCOSMINRiskofBiaschecklist,itshouldfirstbedeterminedwhichmeasurementpropertiesareassessedineacharticle.DeterminewhichmeasurementpropertiesareassessedOftenmultiplestudiesondifferentmeasurementpropertiesaredescribedinonearticle(i.e.onestudyforeachmeasurementproperty,e.g.astudyoninternalconsistency,constructvalidity,andreliability,eachwithitsownspecificdesignrequirements).ThequalityofeachstudyisseparatelyevaluatedusingthecorrespondingCOSMINbox(Table 3). TheCOSMINRiskofBiaschecklistshouldbeusedasamodulartool;onlythoseboxesshouldbecompletedforthemeasurementpropertiesthatareevaluatedinthearticle.Table3.BoxesoftheCOSMINRiskofBiaschecklistMarkthemeasurementpropertiesthathavebeenevaluatedinthearticle*.

Contentvalidity

Box1.PROMdevelopment

Box2.Contentvalidity

Internalstructure

Box3.Structuralvalidity

Box4.Internalconsistency

Box5.Cross‐culturalvalidity\measurementinvariance

Remainingmeasurementproperties

Box6.Reliability

Box7.Measurementerror

Box8.Criterionvalidity

Box9.Hypothesestestingforconstructvalidity

Box10.Responsiveness

*Ifaboxneedstobecompletedmorethanoncetwoormoremarkscanbeplaced.Sometimesthesamemeasurementpropertyisreportedformultiple(sub)groupsinonearticle.Forexample,whenaninstrumentisvalidatedintwodifferentcountriesandthemeasurementpropertiesarereportedforbothcountriesseparately.Inthatcase,thesameboxmayneedtobecompletedmultipletimesifthedesignofthestudywasdifferentamongcountries.Thereviewteamshoulddecidewhichboxesshouldbecompleted(andhowmanytimes).Ingeneral,werecommendreviewerstoconsiderhowthedesignsandanalysespresentedinthearticlerelatetotheCOSMINtaxonomy(seeFigure1)(21),and


27

subsequentlycompletethecorrespondingCOSMINbox.Thisfacilitatesaconsistentevaluationofthemeasurementpropertiesacrosstheincludedstudies,regardlessoftheterminologyusedbytheauthorsofthedifferentarticles.DeterminingwhichboxneedtobecompletedsometimesrequiresasubjectivejudgementbecausethetermsanddefinitionsformeasurementpropertiesusedinanarticlemaynotbesimilartothetermsusedintheCOSMINtaxonomy.Forexample,authorsmayusethetermreliabilitywhentheypresentlimitsofagreement.WhileaccordingtotheCOSMINtaxonomy,thiswouldbeconsideredmeasurementerror.Inthatcase,werecommendtocompletebox7(Measurementerror).Also,authorsoftenusethetermcriterionvalidityforstudiesinwhichaPROMiscomparedtootherPROMsthatmeasureasimilarconstruct.Inmostcases,thiswouldbeconsideredevidenceforconstructvalidity,ratherthancriterionvalidityaccordingtotheCOSMINtermsanddefinitions.Inthatcase,werecommendtocompletebox9(Hypothesestestingforconstructvalidity).Sometimesresultspresentedinstudiesonmeasurementpropertiesactuallydonot(orhardly)provideanyevidenceonthemeasurementpropertiesofaPROM,eventhoughtheyarepresentedassuch.Forexample,acomparisonofdifferentmodesofadministration(e.g.paperversuscomputer)doesnotprovideinformationonthereliabilityorvalidityofthePROM.Also,sometimescorrelationsofaPROMwithothervariables(e.g.correlationswithdemographicvariables)arereportedandconsideredasevidenceforconstructvalidityhowever,thesecorrelationsprovideverylimitedevidenceforconstructvalidity.Reviewersmaydecidetoignoresuchresultsintheirreview.AssessthemethodologicalqualityofthestudiesThequalityofeachstudyonameasurementpropertyshouldbeassessedseparately,usingthecorrespondingCOSMINbox.Eachstudyisratedasverygood,adequate,doubtfulorinadequatequality.Todeterminetheoverallratingofthequalityofeachsinglestudyonameasurementproperty,thelowestratingofanystandardintheboxistaken(i.e.“theworstscorecounts”principle)(12).Forexample,ifthelowestratingofalleightitemsofthereliabilityboxis‘inadequate’,theoverallmethodologicalqualityofthatspecificreliabilitystudyisratedas‘inadequate’.InChapter5wewillprovidemoredetailsandexamplesforhoweachstandardintheCOSMINRiskofBiaschecklistshouldberated.Wealsoexplaininmoredetailhowtocometotheoverallconclusionaboutthemethodologicalqualityofastudy,i.e.howtoapplytheworstscorecountsprinciple.TheCOSMINRiskofBiaschecklistcanbefoundonourwebsite,alongwithaworkingdocument(createdinExcel)thatcanbeusedtodocumenttheCOSMINratingsforeachbox.3.1.2ApplyingcriteriaforgoodmeasurementpropertiesDataextractionWerecommendtosubsequentlyextractthedataonthecharacteristicsofthePROM(s),oncharacteristicsoftheincludedsample(s),onresultsonthemeasurementproperties,andoninformationaboutinterpretabilityandfeasibilityofthescore(s)ofthePROM(s).Thisinformationisneededtodecidewhethertheresultsofdifferentstudiesaresufficientlysimilartobepooledorqualitativelysummarized.Thisinformationcanbe


28

presentedinoverviewtables.Werecommendtoextracttherequiredinformationfromthearticlesanddirectlypasteitintotheoverviewtables.ExamplesofthesetablesaregiveninAppendices3‐7.IntheCochraneHandbook forsystematicreviewsof interventionsit isrecommendedthat the data extraction is done by two reviewers independently to avoid missingrelevantinformation(16).EvaluateeachresultagainstcriteriaofgoodmeasurementpropertiesNext,theresultofeachstudyonameasurementpropertyshouldberatedagainsttheupdatedcriteriaforgoodmeasurementproperties(Table4)(2).Eachresultisratedaseithersufficient(+),insufficient(–),orindeterminate(?).Theresultofeachmeasurementpropertyanditsqualityratingcandirectlybeaddedtotheapplicabletable(Appendix7).Table4.Updatedcriteriaforgoodmeasurementproperties

Measurementproperty

Rating1 Criteria

Structuralvalidity

+

CTT:CFA:CFIorTLIorcomparablemeasure>0.95ORRMSEA<0.06ORSRMR<0.082

IRT/Rasch:Noviolationofunidimensionality3:CFIorTLIorcomparablemeasure>0.95ORRMSEA<0.06ORSRMR<0.08ANDnoviolationoflocalindependence:residualcorrelationsamongtheitemsaftercontrollingforthedominantfactor<0.20ORQ3's<0.37ANDnoviolationofmonotonicity:adequatelookinggraphsORitemscalability>0.30ANDadequatemodelfit:IRT:χ2>0.01Rasch:infitandoutfitmeansquares≥0.5and≤1.5ORZ‐standardizedvalues>‐2and<2

?

CTT:Notallinformationfor‘+’reportedIRT/Rasch:Modelfitnotreported

– Criteriafor‘+’notmet

Internalconsistency

+

Atleastlowevidence4 forsufficientstructuralvalidity5ANDCronbach'salpha(s)≥0.70foreachunidimensionalscaleorsubscale6

?Criteriafor“Atleastlowevidence4 forsufficientstructuralvalidity5”notmet

–

Atleastlowevidence4 forsufficient structuralvalidity5ANDCronbach’salpha(s)<0.70foreachunidimensionalscaleorsubscale6


29

Reliability

+ ICCorweightedKappa≥0.70

? ICCorweightedKappanotreported

– ICCorweightedKappa<0.70

Measurementerror

+ SDCorLoA<MIC

5

? MICnotdefined

– SDCorLoA>MIC5

Hypothesestestingforconstructvalidity

+

Theresultisinaccordancewiththehypothesis7

? Nohypothesisdefined(bythereviewteam)– Theresultisnotinaccordancewiththehypothesis7

Cross‐culturalvalidity\measurement

invariance

+

Noimportantdifferencesfoundbetweengroupfactors(suchasage,gender,language)inmultiplegroupfactoranalysisORnoimportantDIFforgroupfactors(McFadden'sR2<0.02)

? NomultiplegroupfactoranalysisORDIFanalysisperformed

– ImportantdifferencesbetweengroupfactorsORDIFwasfound

Criterionvalidity

+

Correlationwithgoldstandard≥0.70ORAUC≥0.70

?

Notallinformationfor‘+’reported

– Correlationwithgoldstandard<0.70ORAUC<0.70

Responsiveness

+

Theresultisinaccordancewiththehypothesis7ORAUC≥0.70

? Nohypothesisdefined(bythereviewteam)

– Theresultisnotinaccordancewiththehypothesis7ORAUC<0.70

Thecriteriaarebasedone.g.Terweeetal.(30)andPrinsenetal.(5)AUC=areaunderthecurve,CFA=confirmatoryfactoranalysis,CFI=comparativefitindex,CTT=classicaltesttheory,DIF=differentialitemfunctioning,ICC=intraclasscorrelationcoefficient,IRT=itemresponsetheory,LoA=limitsofagreement,MIC=minimalimportantchange,RMSEA:RootMeanSquareErrorofApproximation,SEM=StandardErrorofMeasurement,SDC=smallestdetectablechange,SRMR:StandardizedRootMeanResiduals,TLI=Tucker‐Lewisindex1“+”=sufficient,”–“=insufficient,“?”=indeterminate2Toratethequalityofthesummaryscore,thefactorstructuresshouldbeequalacrossstudies3unidimensionalityreferstoafactoranalysispersubscale,whilestructuralvalidityreferstoafactoranalysisofa(multidimensional)patient‐reportedoutcomemeasure4AsdefinedbygradingtheevidenceaccordingtotheGRADEapproach5Thisevidencemaycomefromdifferentstudies6Thecriteria‘Cronbachalpha<0.95’wasdeleted,asthisisrelevantinthedevelopmentphaseofaPROMandnotwhenevaluatinganexistingPROM.7Theresultsofallstudiesshouldbetakentogetheranditshouldthenbedecidedif75%oftheresultsareinaccordancewiththehypothesesThecriteriaprovidedinTable4arethepreferredcriteriaforeachmeasurementproperty.However,forsomemeasurementpropertiesalternativecriteriamightalsobeappropriate.Forexample,additionalcriteriacouldbeusedforassessingtheresultsofstudiesusingexploratoryfactoranalysis(EFA)ortestingbi‐factorconfirmatoryfactor


30

analysis(CFA)models.Ifthereviewteamhastheexpertisetoassesstheresultsofthesetypesofstudies,additionalcriteriacouldbeused.Fromthistable,thereisoneaspectonwhichwewouldliketoelaborateinmoredetail.Thisconcernsthe‘indeterminate’ratingfortheresultofastudyoninternalconsistency,i.e.whenthecriterion‘atleastlowevidenceforsufficientstructuralvalidity’isnotmet.Thismeansthateither(1)thereisonlyverylowevidenceforsufficientstructuralvalidity(e.g.becausetherewasonlyonestudyonstructuralvaliditywithaverylowsamplesize),(2)therewas(any)evidenceforinsufficientstructuralvalidity,(3)thereareinconsistentresultsforstructuralvaliditywhichcannotbeexplained,or(4)thereisnoinformationonthestructuralvalidityavailable.3.1.3SummarizetheevidenceandgradethequalityoftheevidenceWhereasinChapter3.1.1and3.1.2wefocusedonthequalityofsinglestudiesonmeasurementpropertiesofaPROM,intheChapters3.1.3and3.1.4wewillfocusonthequalityofthePROMasawhole.TocometoanoverallconclusionofthequalityofthePROM,oneshouldfirstdecidewhethertheresultsofallavailablestudiespermeasurementpropertyareconsistent.Iftheresultsareconsistent,theresultsofstudiescanbequantitativelypooledorqualitativelysummarized,andcomparedagainstthecriteriaforgoodmeasurementpropertiestodeterminewhetheroverall,themeasurementpropertyofthePROMissufficient(+),insufficient(–),inconsistent(±),orindeterminate(?).Finally,thequalityoftheevidenceisgraded(high,moderate,low,verylowevidence),usingamodifiedGRADEapproach,asexplainedbelow(page32‐36).Iftheresultsareinconsistent,thereareseveralstrategiesthatcanbeused:(a)findexplanationsandsummarizepersubgroup;(b)donotsummarizetheresultsanddonotgradetheevidence;or(c)basetheconclusiononthemajorityofconsistentresults,anddowngradeforinconsistency.Itisuptothereviewteamtodecideonwhichstrategyismostappropriatetouseintheirspecificsituation.Beloweachstrategyisexplainedinmoredetail.HandlinginconsistentresultsBasedonthequalityratingsofsingleresults(asdiscussedinChapter3.1.2.),onedecideswhetherornotallresultsononemeasurementpropertyofaPROMareconsistent.Ifallresultsareconsistent,theoverallratingwillbesufficient(+)orinsufficient(‐).Iftheresultsareinconsistent(e.g.bothsufficientandinsufficientresultsforreliabilityorappropriatemodelfitfoundbutfordifferentfactorstructuresacrossstudies),explanationsforinconsistencyshouldbeexplored.Forexample,inconsistencycouldbeduetodifferentpopulationsormethodsused.Ifanexplanationisfound,theresultscanbesummarizede.g.persubgroupofconsistentresults.Forexample,ifdifferentresultsarefoundinstudiesperformedinacutepatientsversuschronicpatients,resultspersubgroupshouldbeseparatelysummarized.Theoverallratingforthespecificmeasurementproperty(e.g.reliability)maybesufficient(+)inacutepatients,butinsufficient(‐)inchronicpatients.


31

Highqualitystudiescouldbeconsideredasprovidingmoreevidencethanlowqualitystudiesandcanbeconsidereddecisiveindeterminingtheoverallratingwhenratingsareinconsistent.Ifinconsistentresultscanbeexplainedbythequalityofthestudies,onemaydecidetosummarizetheresultsofverygoodoradequatestudiesonly,andtoignoretheresultsofdoubtfulaninadequatequalitystudies.Thisshouldthenbeexplainedinthemanuscript.

Insomecases,morerecentevidencecanbeconsideredmoreimportantthanolderevidence.Thiscanalsobetakenintoaccountwhendeterminingtheoverallrating.

Ifnoexplanationforinconsistentresultsisfound,therearetwopossibilities:(1)theoverallratingwillbeinconsistent(±);or(2)onecoulddecidetobasetheoverallratingonthemajorityoftheresults(e.g.ifthemajorityofthe(consistent)resultsofstudiesaresufficient,anoverallratingofsufficientcouldbeconsidered)anddowngradethequalityoftheevidenceforinconsistency(seeChapter3.5).

SummarizetheevidenceTheresultscanbequantitativelypooledorqualitativelysummarized.WerecommendtoreportthesepooledorsummarizedresultspermeasurementpropertyperPROMinSummaryofFinding(SoF)Tables,accompaniedbyaratingofthepooledorsummarizedresults(+/─/+/?),andthegradingofthequalityofevidence(high,moderate,low,verylow).TheseSoFtables(i.e.onepermeasurementproperty)willultimatelybeusedinprovidingrecommendationsfortheselectionofthemostappropriatePROMforagivenpurposeoraparticularsituation.SeeAppendix8foranexample.

QuantitativelypoolingtheresultsIfpossible,theresultsfromdifferentstudiesononemeasurementpropertyshouldbestatisticallypooledinameta‐analysis.Pooledestimatesofmeasurementpropertiescanbeobtainedbycalculatingweightedmeans(basedonthenumberofparticipantsincludedperstudy)and95%confidenceintervals(e.g.(31)).Werecommendtoconsultastatisticianforperformingmeta‐analyses.Fortest‐retestreliability,forexample,weightedmeanintraclasscorrelationcoefficients(ICCs)and95%confidenceintervalscanbecalculatedusingastandardgenericinversevariancerandomeffectsmodel(32).ICCvaluescanbecombinedbasedonestimatesderivedfromaFishertransformation,z=0.5xln((1+ICC)/(1‐ICC)),whichhasanapproximatevariance,(Var(z)=1/(N‐3)),whereNisthesamplesize.ThismethodwasappliedinastudybyCollinsetal.(8).Forconstructvalidity,forexample,itcanbeconsideredtopoolallcorrelationsofaPROMwithotherPROMsthatmeasureasimilarconstruct.Forexample,whenevaluatingtheconstructvalidityofaphysicalfunctionsubscaleonecouldpoolallextractedcorrelationsofthissubscalewithothercomparisoninstrumentsmeasuringphysicalfunction.

QualitativelysummarizingintoasummarizedresultIfitisnotpossibletostatisticallypooltheresults,theresultsofstudiesshouldbequalitativelysummarizedtocometoasummarizedresult,forexamplebyprovidingtherange(lowestandhighest)ofMICvaluesfoundforinterpretability,thepercentageofconfirmedhypothesesforconstructvalidity,ortherangeofeachmodelfitparameteronaconsistentlyfoundfactorstructureinstructuralvaliditystudies.


32

ApplyingcriteriaforgoodmeasurementpropertiestothepooledorsummarizedresultThepooledorsummarizedresultpermeasurementpropertyperPROMshouldagainberatedagainstthesamequalitycriteriaforgoodmeasurementproperties(Table4).Theoverallratingforthepooledorsummarizedresultcanbesufficient(+),insufficient(–),inconsistent(±),orindeterminate(?).ThisratingcanbeaddedtothepooledorsummarizedresultperPROMforeachmeasurementpropertyintheSummaryofFindingsTables(Appendix8).Iftheresultsperstudyareallsufficient(orallinsufficient),theoverallratingwillalsobesufficient(orinsufficient).Toratethequalitativelysummarizedresultsassufficient(orinsufficient),inprinciple75%oftheresultsshouldmetthecriteria.Forexample,forstructuralvaliditythecriteriaisthat‘atleast75%oftheCFAstudiesfoundthesamefactorstructure’.Thecriteriaforhypothesestestingandresponsiveness(constructapproach)forsummaryresultsisthat‘atleast75%oftheresultsshouldbeinaccordancewiththehypotheses’toratetheoverallresultsas‘sufficient’and‘atleast75%oftheresultsarenotinaccordancewiththehypotheses’toratetheoverallresultsas‘insufficient’’.Iftheresultsofsinglestudieswhichcanbepooledareinconsistentandtheinconsistencyisunexplained,theresultscouldbepooledanyway,andthispooledresultcouldberatedaseithersufficientorinsufficient,andsubsequentlybedowngradedduetoinconsistency(seealsoChapter3.1.3andthenextsection).Iftheresultsofsinglestudieswhichcannotbepooled(e.g.resultsofCFAs)areinconsistentandtheinconsistencyisunexplained,theoverallresultwillberatedas‘inconsistent’.Inthiscase,theresultsareactuallynotsummarized,andtheevidencewillnotbegraded.Iftheresultsperstudyareallindeterminate(?),theoverallratingwillalsobeindeterminate(?).GradingthequalityofevidenceAfterpoolingorsummarizingallevidencepermeasurementpropertyperPROM,andratingthepooledorsummarizedresultagainstthecriteriaforgoodmeasurementproperties,thenextstepistogradethequalityofthisevidence.Thequalityoftheevidencereferstotheconfidencethatthepooledorsummarizedresultistrustworthy.ThegradingofthequalityisbasedontheGradingofRecommendationsAssessment,Development,andEvaluation(GRADE)approachforsystematicreviewsofclinicaltrials(20).UsingamodifiedGRADEapproach,thequalityoftheevidenceisgradedashigh,moderate,low,orverylowevidence(fordefinitions,seeTable5).TheGRADEapproachusesfivefactorstodeterminethequalityoftheevidence:riskofbias(qualityofthestudies),inconsistency(oftheresultsofthestudies),indirectness(evidencecomesfromdifferentpopulations,interventionsoroutcomesthantheonesofinterestinthereview),imprecision(wideconfidenceintervals),andpublicationbias(negativeresultsarelessoftenpublished).ForevaluatingmeasurementpropertiesinsystematicreviewsofPROMs,thefollowingfourfactorsshouldbetakenintoaccount:(1)risk of bias (i.e. the methodological quality of the studies), (2) inconsistency (i.e. unexplained inconsistency of results across studies), (3) imprecision (i.e. total sample size of the available studies), and (4) indirectness (i.e. evidence from different populations than the population of interest in the review).


33

Thefifthfactor,i.e.publicationbias,isdifficulttoassessinstudiesonmeasurementproperties,becauseofalackofregistriesforthesetypeofstudies.Therefore,wedonottakethisfactorintoaccountinthismethodology.Table5.DefinitionsofqualitylevelsQualitylevel DefinitionHigh Weareveryconfidentthatthetruemeasurementpropertylies

closetothatoftheestimate*ofthemeasurementpropertyModerate We are moderately confident in the measurement property

estimate:thetruemeasurementpropertyislikelytobeclosetothe estimate of the measurement property, but there is apossibilitythatitissubstantiallydifferent

Low Our confidence in the measurement property estimate islimited: the true measurement property may be substantiallydifferentfromtheestimateofthemeasurementproperty

Verylow We have very little confidence in the measurement propertyestimate: the true measurement property is likely to besubstantially different from the estimate of the measurementproperty

*EstimateofthemeasurementpropertyreferstothepooledorsummarizedresultofthemeasurementpropertyofaPROM.ThesedefinitionswereadaptedfromtheGRADEapproach(20)TheGRADEapproachisusedtodowngradeevidencewhenthereareconcernsaboutthequalityoftheevidence.Thestartingpointisalwaystheassumptionthatthepooledoroverallresultisofhighquality.Thequalityofevidenceissubsequentlydowngradedbyoneortwolevelsperfactortomoderate,low,orverylowevidencewhenthereisriskofbias,(unexplained)inconsistency,imprecision(lowsamplesize),orindirectresults.Thequalityofevidencecanevenbedowngradedbythreelevelswhentheevidenceisbasedononlyoneinadequatestudy(i.e.extremelyseriousriskofbias).Thegradingofthequalityofeachmeasurementpropertycandirectlybeaddedtotheapplicabletable(Appendix8).BelowweexplaininmoredetailhowthefourGRADEfactorscanbeinterpretedandappliedinevaluatingthemeasurementpropertiesofPROMs.HowtoapplyGRADEForeachpooledresultorforthesummarizedresultforeachmeasurementpropertyperPROM,thequalityoftheevidencewillbedeterminedbyusingTable6.Ifinsummarizingtheevidenceanddeterminingtheoverallratingofthepooledorsummarizedresultforameasurementproperty,theresultsofsomestudiesareignored,thesestudiesshouldalsobeignoredindeterminingthequalityoftheevidence.Forexample,ifonlytheresultsofhighqualitystudiesareconsideredindeterminingtheoverallrating,thenonlythehighqualitystudiesdeterminethegradingoftheevidence(inthiscasewewouldnotdowngradeforriskofbias).


34

Table6.ModifiedGRADEapproachforgradingthequalityofevidenceQuality of evidence Lower if High Risk of bias

-1 Serious -2 Very serious -3 Extremely serious Inconsistency -1 Serious -2 Very serious Imprecision -1 total n=50-100 -2 total n<50 Indirectness -1 Serious -2 Very serious

Moderate Low

Very low

n=samplesizeBelowweexplaininmoredetailhowthefourGRADEfactorscanbeinterpretedandappliedinevaluatingthemeasurementpropertiesofPROMs.

(1) Riskofbiascanoccurifthequalityofthestudyisdoubtfulorinadequate,asassessedwiththeCOSMINRiskofBiaschecklist,orifonlyonestudyofadequatequalityisavailable.Thequalityofevidenceshouldbedowngradedwithonelevel(e.g.fromhightomoderateevidence)ifthereisseriousriskofbias,withtwolevels(e.g.fromhightolow)ifthereisveryseriousriskofbias,orwiththreelevels(i.e.fromhightoverylow)ofthereisextremelyriskofbias.InTable7weexplainwhatweconsiderasserious,veryseriousorextremelyseriousriskofbias.

Table7.InstructionsondowngradingforRiskofBiasRiskofbias DowngradingforRiskofBiasNo Therearemultiplestudiesofatleastadequatequality,or

thereisonestudyofverygoodqualityavailableSerious Therearemultiplestudiesofdoubtfulqualityavailable,

orthereisonlyonestudyofadequatequalityVeryserious Therearemultiplestudiesofinadequatequality,orthere

isonlyonestudyofdoubtfulqualityavailableExtremelyserious

Thereisonlyonestudyofinadequatequalityavailable

(2) Inconsistency:Inconsistencymayalreadyhavebeensolvedbypoolingor

summarizingtheresultsofsubgroupsofstudieswithsimilarresultsandprovideoverallratingsforthesesubgroups.Inthiscase,onedoesn’tneedtodowngrade.Ifnoexplanationforinconsistencyisfound,thereviewteamcandecidenottopoolorsummarizeresults,andratetheresultsas‘inconsistent’.Inthiscase,no


35

qualitylevelfortheevidencewillbegiven.However,analternativesolutioncouldbetoratethepooledorsummarizedresult(e.g.basedonthemajorityofresults)assufficientorinsufficientanddowngradethequalityoftheevidenceforinconsistencywithoneortwolevels.Thereviewersshouldalsodecidewhatwillbeconsideredasserious(i.e.‐1level)orveryserious(i.e.‐2levels)inconsistency,becausethisiscontextdependent.Itisuptothereviewteamtodecidewhichseemstobethebestsolutionineachsituation.

(3) Imprecisionreferstothetotalsampleincludedinthestudies.Werecommendto

downgradewithonelevelwhenthetotalsamplesizeofthepooledorsummarizedstudiesisbelow100,andwithtwolevelswhenthetotalsamplesizeisbelow50.NotethatthisprincipleshouldnotbeusedformeasurementpropertiesinwhichasamplesizerequirementisalreadyincludedintheCOSMINRiskofBiasbox,i.e.contentvalidity,structuralvalidity,andcross‐culturalvalidity.

(4) Indirectnesscanoccurifstudiesareincludedinthereviewthatwere(partly)

performedinanotherpopulationoranothercontextofusethanthepopulationorcontextofuseofinterestinthesystematicreview.Forexample,ifonlypartofthestudypopulationconsistsofpatientswiththediseaseofinterest,thereviewteamcandecidetodowngradewithoneortwolevelsforseriousorveryseriousindirectness.Onecanconsiderdowngradingforindirectnessforconstructvalidityorresponsivenesswhentheevidenceisconsideredweak.ForexamplewhentheevidenceisonlybasedoncomparisonswithPROMsmeasuringdifferentconstructsoronlybasedondifferencesbetweenextremelydifferentgroups.Howtodecideonwhatshouldbeconsideredasseriousorveryseriousindirectnessiscontextdependent,andshouldbedecidedonbythereviewteam.

Todeterminethegradingforthequalityofevidence,theconcernsaboutthequalityoftheevidenceshouldbeaddedup.Therefore,itishelpfultoconsidertheGRADEfactorsonebyonebyusingtheconsecutiveorderasspecifiedinTable6.First,theriskofbiasisconsidered(seeTable7).Forexample,whenthreestudiesarefoundwithsufficient(i.e.‘+’)results,butallofdoubtfulquality,thelevelofevidencewillbedowngradedforriskofbiasfromhightomoderate(i.e.‐1).Second,furtherdowngradingforotherfactorsshouldbeconsidered.Afterriskofbias,inconsistencyshouldbeconsidered.Iftheresultsofthethreestudiesintheexampleaboveareallratedassufficient,nodowngradingforinconsistencyisrequired.Otherwise,downgradingshouldbeconsidered.Next,thesamplesizeshouldbetakenintoaccount.Forexample,whenthesamplesizeofthethreestudiestogetherismorethan100,therewillbenofurtherdowngrading.Ifthe(total)samplesizeisbetweene.g.50‐100,oneshoulddowngradewith‐1.Lastly,theevidencecouldbedowngradedbecauseofindirectness.Forexample,considerasystematicreviewthatfocussesonpainandcomfortscalesforinfants,andtheinclusioncriteriais‘childrenbetween0‐18years’becausealackofstudiesininfantsonly(i.e.below1year)wasexpected.Studiesincludingchildrenofallages,includinginfants,mayleadtodowngradethequalityofevidencebyonelevel,andstudiesincludingonlychildrenbetween4and12yearsmayevenleadtodowngradethequalitybytwolevels,duetoindirectnessoftheresults.If,inourexample,thethreestudiesfoundallincludechildrenbetween0‐4,butonlyveryfewinfants,onemaydecidetodowngradeonelevel(i.e.frommoderatetolow).Inthisexample,theoverallqualityof


36

theevidenceisnowconsideredas‘low’,sotheconclusionwillbethatthereislowqualityevidencethatthemeasurementpropertyofinterestissufficient.Werecommendthatgradingisdonebytworeviewersindependentlyandthatconsensusamongthereviewersisreached,ifnecessarywithhelpofathirdreviewer.


37

3.2.Step5:EvaluatingcontentvalidityContentvalidity(i.e.the degree to which the content of a PROM is an adequate reflection of the construct to be measured (21))isconsideredtobethemostimportantmeasurementproperty,becauseitshouldbeclearthattheitemsofthePROMarerelevant,comprehensive,andcomprehensiblewithrespecttotheconstructofinterestandstudypopulation(7).Theevaluationofcontentvalidityrequiresasubjectivejudgmentbythereviewers.Inthisjudgement,thePROMdevelopmentstudy,thequalityandresultsofadditionalcontentvaliditystudiesonthePROMs(ifavailable),andasubjectiveratingofthecontentofthePROMsbythereviewersistakenintoaccount.Asthisstepisveryimportantbutratherextensivewedevelopedaseparateusers’manualforguidanceonhowtoevaluatethecontentvalidityofPROMs.ThismanualcanbefoundontheCOSMINwebsite(33).IfthereishighqualityevidencethatthecontentvalidityofaPROMisinsufficient,thePROMwillnotbefurtherconsideredinSteps6‐8ofthesystematicreviewandonecandirectlydrawarecommendationforthisPROMinStep9(i.e.recommendation‘C’:“PROMsthatshouldnotberecommended(i.e.PROMswithhighevidenceofinsufficientcontentvalidity)).Inallothercases,thePROMcanbefurthertakenintoconsiderationinthesystematicreview.


38

3.3Step6.EvaluationoftheinternalstructureofPROMs:structuralvalidity,internalconsistency,andcross‐culturalvalidity\measurementinvarianceInternal structure refers to how the different items in a PROM are related, which isimportanttoknowfordecidinghowitemsmightbecombinedintoascaleorsubscale(6).Thisstepconcernsanevaluationofstructuralvalidity(includingunidimensionality)usingfactoranalysesorIRTorRaschanalyses;internalconsistency;andcross‐culturalvalidityandotherformsofmeasurementinvariance(usingdifferentialitemfunctioning(DIF) analyses or Multi‐Group Confirmatory Factor Analyses (MGCFA)). Here we arereferring to testingof existingPROMs;not further refinementordevelopmentof newPROMs. These three measurement properties focus on the quality of items and therelationshipsbetween the items incontrast to the remainingmeasurementpropertiesdiscussed in Step 7, who mainly focus on the quality of scales or subscales. Werecommendtoevaluate internalstructuredirectlyafterevaluatingthecontentvalidityof a PROM. As evidence for structural validity (or unidimensionality) of a scale orsubscale is a prerequisite for the interpretation of internal consistency analyses (i.e.Cronbach’salpha’s),werecommendto firstevaluatestructuralvalidity, tobefollowedbyinternalconsistencyandcross‐culturalvalidity\measurementinvariance.Step6isonlyrelevantforPROMsthatarebasedonareflectivemodelthatassumesthatall items inascaleorsubscalearemanifestationsofoneunderlyingconstructandareexpected to be correlated. An example of a reflective model is the measurement ofanxiety; anxietymanifests itself in specific characteristics, such asworrying thoughts,panic,andrestlessness.Byaskingpatientsaboutthesecharacteristics,wecanassessthedegreeofanxiety(i.e.all itemsareareflectionof theconstruct) (23). If the items inascale or subscale are not supposed to be correlated (i.e. a formative model), theseanalysesarenotrelevantandStep6canbeomitted.Inotherwords,iffactoranalysis,orIRTorRaschanalysisisperformedonaPROMbasedonaformativemodel,theseresultscanbeignored,astheresultsarenotinterpretable.If it is not reportedwhether a PROM is based on a reflective or formativemodel, thereviewersneed todecideon the contentof thePROMwhether it is likelybasedon areflective or a formative model. Unfortunately, it is not always possible to decideafterwards if the instrument is based on a reflective or formative model and thuswhetherstructuralvalidityisrelevant.Ifastudyincludedinthereviewpresentsafactoranalysis,IRTorRaschanalysisandyouareindoubtwhetherthe(sub)scaleisreflective,orwhetheritmightbemixed(bothreflectiveandformativeitemswithinasubscale),werecommendtoconsideritareflectivemodel.Subsequently,thequalityofthestudyandthequalityoftheresultsarerated.The evaluation of structural validity, internal consistency and cross‐culturalvalidity\measurementinvarianceconsistsofthreesubstepsasdescribedinChapter3.1(see Figure 4). First, the risk of bias in each study on structural validity, internalconsistency and cross‐cultural validity\measurement invariance is assessed using theCOSMINRiskofBiaschecklist(seeChapter5fordetailedinstructionsforhowtoratethestandardsineachbox).Second,dataisextractedonthecharacteristicsofthePROM(s),on characteristics of the included patient sample(s), and on the results of themeasurementproperties (seeAppendices3‐7). The result per study is thenevaluatedagainst the criteria for good measurement properties. Third, all results permeasurement property of a PROM are quantitatively pooled or qualitatively


39

summarized, and this pooled or summarized result is again evaluated against thecriteriaforgoodmeasurementpropertiestogetanoverallrating(Table4).Finally,thequalityof theevidence isgradedusing themodifiedGRADEapproach,asdescribed inChapter3.1.3.Theriskofbiasinastudyoninternalconsistencydependsontheavailableevidenceforstructuralvaliditybecauseunidimensionality isaprerequisitefortheinterpretationofinternalconsistencyanalyses(i.e.Cronbach’salpha’s).Therefore,thequalityofevidencefor internal consistency cannot be higher than the quality of evidence for structuralvalidity. Note that in a systematic review of PROMs, the conclusion about structuralvaliditymaycomefromdifferentstudies,evenfromstudiesthatwereconductedafterstudiesoninternalconsistency.Werecommendtotakethequalityofevidenceforstructuralvalidityasastartingpointfor determining the quality of evidence for internal consistency, and to downgradefurther for risk of bias, inconsistency (i.e. unexplained inconsistency of results acrossstudies), imprecision(i.e.totalsamplesizeoftheavailablestudies),andindirectnessifneeded.ACronbach’salphabasedonascalewhichisnotunidimensionalisdifficulttointerpret.Wethereforerecommendto ignoreresultsofstudieson internalconsistencyofscaleswhenthereisevidencethatthesescalesarenotunidimensional.Forexample,ifresultsonstructuralvalidityshowthatascalehasthreefactors,internalconsistencyofeachofthosethreesubscalesisrelevant.Cronbach’salpha’sonatotalscorecanbeignored(andclinicians and researchers should be encouragednot to use these total scores) unlessthereisproveofunidimensionalityofthetotalscore,e.g.byahigherorderorbi‐factorCFA.


40

3.4Step7.Evaluationoftheremainingmeasurementproperties:reliability,measurementerror,criterionvalidity,hypothesestestingforconstructvalidity,andresponsivenessSubsequently, the remainingmeasurement properties (reliability,measurement error,criterionvalidity,hypothesestestingforconstructvalidity,andresponsiveness)shouldbe evaluated. Unlike content validity and internal structure, the evaluation of thesemeasurementpropertiesmainlyassess thequalityof thescaleorsubscaleasawhole,insteadoftheitems.IssuesregardingvaliditytobedecideduponaprioribythereviewteamForassessingthequalityofhypothesestestingforconstructvalidityandfortheconstructapproachforresponsiveness,thereviewteamshouldaprioridecideontwoissues:First,thereviewteamshoulddecidewhetherthereisagoldstandardavailableformeasuringtheconstructofinterestinthetargetpopulation.Ifthereisnogoldstandard(inprinciple,thereisnogoldstandardavailableofaPROM,onlythelongversionofashortenedPROMcanbeconsideredasagoldstandard),thereviewteamshouldnotusetheboxforcriterionvalidityandtheboxforthecriterionapproachforresponsivenessinthesystematicreview.Allevidenceforvaliditypresentedintheincludedstudieswillbeconsideredasevidenceforconstructvalidityortheconstructapproachforresponsiveness.Ifanoutcomemeasurementinstrumentisconsideredagoldstandard,studiescomparingthePROMtothisparticularinstrumentareconsideredasevidenceforcriterionvalidityorcriterionapproachforresponsiveness.Second,forinterpretingtheresultsofstudiesonhypothesestestingforconstructvalidity,andonstudiesusingaconstructapproachforevaluatingresponsiveness,thereviewteamshouldformulateasetofhypothesesaboutexpectedrelationshipsbetweenthePROMunderreviewandotherwell‐definedandhighqualitycomparatorinstrumentswhichareusedinthefield.Ifpossible,alsosomehypothesesaboutexpecteddifferencesbetweensubgroupscanbeformulatedinadvance.Thisway,thestudyresultsarecomparedagainstthesamehypotheses.Theexpecteddirection(positiveornegative)andmagnitude(absoluteorrelative)ofthecorrelationsordifferencesshouldbeincludedinthehypotheses(forexampleswerefertootherpublications(34‐37)).Withoutthisspecificationitisdifficulttodecideafterwardswhethertheresultsareinaccordancewiththehypothesisornot.Forexample,reviewteamcouldstatethattheyexpectacorrelationofatleast0.50betweenthePROMunderstudyandthecomparatorinstrumentthatintendtomeasurethesameconstruct.Orthattheyexpectedthatchronicallyillpatientsscoreonaveragescore10pointshighercomparedtoacutepatientsonthePROMunderconsideration.Thehypothesesmayalsoconcerntherelativemagnitudeofcorrelations.Forexample,thereviewteamcouldstatethattheyexpectthatthescoreonPROMAcorrelatesatleast0.10pointshigherwiththescoreoninstrument(e.g.aPROM)BthanwiththescoreoninstrumentC.InTable8somegenerichypothesesarepresentedthatcouldbeconsideredasastartingpointforformulatinghypothesesforthereview(23).


41

Table8.GenerichypothesestoevaluateconstructvalidityandresponsivenessGenerichypotheses*:1. Correlationswith(changesin)instrumentsmeasuringsimilar

constructsshouldbe≥0.50.2. Correlationswith(changesin)instrumentsmeasuringrelated,but

dissimilarconstructsshouldbelower,i.e.0.30‐0.50.3. Correlationswith(changesin)instrumentsmeasuringunrelated

constructsshouldbe<0.30.4. Correlationsdefinedunder1,2,and3shoulddifferbyaminimumof

0.10.5. Meaningfulchangesbetweenrelevant(sub)groups(e.g.patientswith

expectedhighvslowlevelsoftheconstructofinterest)6. Forresponsiveness,AUCshouldbe≥0.70*ReproducedwithapprovalfromDeVetetal.(23)AUC = Area Under the Curve with an external measure of change used as the ‘goldstandard’

Hypothesescanbebasedonliterature(includingtheincludedstudiesinthereview)and(clinical)experiencesofthereviewteammembers.IntheoriginalCOSMINchecklist,oneofthestandardsincludedintheboxonhypothesestestingforconstructvalidityandresponsivenesswasaboutwhetherhypotheseswerestatedinthestudy,whichoftenleadtoan‘inadequate’rating.Byrecommendingthereviewteamtoformulatehypothesesinthenewmethodology,themethodologicalqualityofanincludedstudyisnolongerdependentsuponwhetherornothypotheseswereformulatedinthesestudies.Allresultsfoundinincludedarticlescannowbecomparedagainstthissamesetofhypotheses,andmoreresultscanbeusedtobuildaconclusionabouttheconstructvalidityofaPROM.Additionalhypothesesmayneedtobeformulatedduringthereviewprocess,dependingontheobservedresultsintheincludedstudies.Weconsideritnotaproblemifthehypothesesarenotalldefinedapriori,aslongasthehypothesesarereportedinthereview,makingthemethodologytransparent.Ifitisimpossibletoformulateagoodhypothesisforaspecificanalysis(e.g.aboutanexpectedmagnitudeofdifferencebetweentwosubgroups),andtheauthorsofthestudydidnotstatetheirexpectations,thereviewteamcanconsiderignoringtheseresults,asitisunknownhowtheresultsshouldbeinterpreted.

Whenassessingresponsiveness,oneofthemostdifficulttasksisformulatingchallenginghypotheses.Bychallenginghypothesesweaimtoshowthattheinstrumenttrulymeasureschangesintheconstruct(s)itpurportstomeasure.Thismeansthattheinstrumentshouldmeasurechangesinthepurportedconstruct(s),butalsothatitshouldmeasuretherightamountofchange,i.e.itshouldnotunder‐oroverestimatetherealchangeintheconstructthathasoccurred.Thislatteraspectisoftenoverlookedinassessingresponsiveness.Forexample,thehypothesescanconcernexpectedmeandifferencesbetweenchangesingroupsorexpectedcorrelationsbetweenchangesinthescoresontheinstrumentandchangesinothervariables,suchasscoresonotherinstruments,ordemographicorclinicalvariables.Hypothesesaboutexpectedeffectsize(ES)orsimilarmeasuressuchasstandardizedresponsemean(SRM)canalsobeused,butonlywhenanexplicithypothesis(andrationale)fortheexpectedmagnitudeofthe


42

effectsizeisgiven.Thehypothesesmayalsoconcerntherelativemagnitudeofcorrelations,forexampleastatementthatchangeininstrumentAisexpectedtocorrelatehigherwithchangeininstrumentBthanwithchangeininstrumentC.EvaluatingthemeasurementpropertiesandgradingthequalityoftheevidenceThe evaluation of the measurement properties consists of the three sub steps, asdescribedearlierinChapter3.1(Figure4).First,theriskofbiasineachstudyonthesemeasurementpropertiesisassessedusingtheCOSMINRiskofBiaschecklist(seeChapter5fordetailedinstructionsforhowtoratethestandards).Second,dataisextractedonthecharacteristicsofthePROM(s),oncharacteristicsoftheincluded study population(s), and on results on the measurement properties (seeAppendices 3‐6), and the result per study is evaluated against the criteria for goodmeasurementproperties(seeTable4).Third, all results per measurement property of a PROM are quantitatively pooled orqualitativelysummarized,andthispooledorsummarizedresultisevaluatedagainstthecriteriaforgoodmeasurementpropertiestogetanoverallratingforthemeasurementproperty (Table 4). Finally, the quality of the evidence is graded using the modifiedGRADEapproach,asdescribedinChapter3.1.SpecificaspectstoconsiderformeasurementerrorWhenapplyingthecriteriaforgoodmeasurementerror,informationisneededontheSmallestDetectableChange(SDC)orLimitsofAgreement(LoA),aswellasontheMinimalImportantChange(MIC).Thisinformationmaycomefromdifferentstudies.TheMICshouldhavebeendeterminedusingananchor‐basedlongitudinalapproach(38‐41).TheMICisbestcalculatedfrommultiplestudiesandbyusingmultipleanchors.Itisuptothereviewteamtodecidewhetherthequalityoftheevidenceshouldbedowngraded(e.g.onelevel)whenthereisinformationavailableontheMICvaluefromonlyonestudy,orwhenthestudyontheMICisnotadequatelyperformed(e.g.insufficientvalidityoftheanchor)(42,43).WhenaMICvalueisdeterminedusingadistribution‐basedmethod,infact,theMICvaluedoesnotreflectwhatpatientsconsiderimportant,butisratheranindicatorofthemeasurementerrorofthePROM.Resultsfoundinsuchstudiesshouldeitherbeignoredorconsideredasevidenceonmeasurementerror(23).IfnotenoughinformationisavailabletojudgewhethertheSDCorLoAissmallerthantheMIC,werecommendtojustreporttheinformationthatisavailableontheSDCorLoAwithoutgradingthequalityofevidence(notethatinformationontheMICaloneprovidesinformationontheinterpretabilityofaPROM,seeChapter4.1).SpecificaspectstoconsiderforhypothesestestingandresponsivenessInstudiesonconstructvalidityandresponsiveness,differentstudydesignsmayhavebeenused,i.e.comparisonswithotheroutcomemeasurementinstrument,comparisonsbetweengroups,orcomparisonsbeforeandafteranintervention.Therefore,theboxesHypothesestestingforconstructvalidityandResponsivenessbothconsistoftwoandfourparts,respectively(seeChapter5.7and5.8).Themethodologicalqualityofeachpartwillberatedseparately,usingthe“worstscorecounts”method(12).Ingeneral,eachresult(e.g.acorrelationbetweenscoresoftwooutcomemeasurementinstruments,oranES)couldbeconsideredasasinglestudy.Whenforexample,thePROMunderstudyiscomparedtofourdifferentoutcomemeasurementinstruments,


43

fourcorrelationsarecomputed,andthiscouldbeconsideredasfoursingle‘studies’.Basically,theboxHypothesestestingshouldbecompletedfourtimes.However,wheneachstandardforallfour‘studies’willberatedthesame(e.g.theconstructsmeasuredbyallfourcomparisonoutcomemeasurementinstrumentsareclearandallcomparisoninstrumentsareofgoodquality)theratingscanbecombinedintooneriskofbiasrating(i.e.themethodologicalqualityofthestudiesis‘verygood’).Next,eachresultiscomparedagainstthecriterion(i.e.whetherornotthehypothesiswasconfirmed),andreportedintheresultstable(seeAppendix7).Ifthemethodologicalqualityofeach‘study’isnotsimilar,itcouldbeassessedseparately.InAppendix7,anexampleisgiven(called‘ref6’underhypothesestesting)ofanarticleinwhichthreehypothesesweretestedinadequatestudiesandtheresultsareinaccordancewiththehypotheses,andthreeotherhypothesesweretestedininadequatestudies,andoneoftheresultswasinaccordancewiththehypothesesandtworesultswerenotinaccordancewiththehypotheses.Moreover,inonearticlebothhypothesesaboutcomparisonwithotherinstrumentscouldbetestedaswellashypothesesaboutexpecteddifferencesbetweensubgroups.Inthatcase,differentpartsoftheboxHypothesestestingforconstructvaliditywillbeused(partaandb,respectively).However,whentheratingsarethesame(formethodologicalquality),thiscouldbereportedtogether.Next,thequalityoftheevidenceshouldbedetermined.Werecommendtodeterminethissimilarlyasfortheothermeasurementproperties.Inthisstep,weconsiderallresultsreportedinonearticletogetherasonestudy.Wedonotgradetheevidenceperhypothesisorperstudydesign,becauseotherwisehighevidenceforhypothesestestingcaneasilybereached.AnexampleisprovidedinAppendix7.Here,11outof13resultsareratedassufficient.Weconcludethatwehaveconsistentfindingsforsufficienthypothesestestingforconstructvalidity,as85%oftheresultsareinlinewiththehypotheses.Wehaveoneverygoodstudy.Therefore,wedonotdowngradeforriskofbias.Dependingonimprecision(i.e.totalsamplesizeoftheavailablestudies),andindirectnesswecoulddowngrade,ifneeded.Instudiesonconstructvalidityandresponsivenesssomeresultsmayprovidestrongerevidencethanotherresults.Forexample,correlationswithPROMsmeasuringsimilarconstructs(i.e.convergentvalidity)canbeconsideredasprovidingmoreevidencethancorrelationswithPROMsmeasuringdissimilarconstructs.Thiscanbetakenintoaccountwhendeterminingthepooledorsummarizedresultforconstructvalidity,especiallywhenresultsareinconsistent.Forexample,whenresultsaboutcomparisonswithPROMsmeasuringsimilarconstructsareconsistentlynotinaccordancewiththehypotheses,whileresultsaboutcomparisonwithdissimilarconstructstendtobeinaccordancewiththehypotheses,thereviewteamcoulddecidetoputmoreemphasisontheresultsofthehypothesesconcerningcomparisonswithinstrumentsmeasuringsimilarconstructs.


44

4.PartCsteps8‐10:SelectingaPROMPartCsteps8‐10concernsthedescriptionofdataoninterpretabilityandfeasibilityofthePROMs(step8), formulatingrecommendationsbasedonall evidence (step9)andthereportingofthesystematicreview(step10).4.1Step8:Describeinterpretabilityandfeasibility Interpretability is defined as the degree towhich one can assign qualitativemeaning(thatis,clinicalorcommonlyunderstoodconnotations)toaPROM’squantitativescoresor change in scores (21). Both the interpretability of single scores and theinterpretability of change scores is informative to report in a systematic review. Theinterpretation of single scores can be outlined by providing information on thedistribution of scores in the study population or other relevant subgroups, as itmayreveal clustering of scores, and it can indicate floor and ceiling effects. Theinterpretability of change scores can be enhanced by reporting MIC values andinformation on response shift (referring to changes in the meaning of one's self‐evaluation of a target construct as a result of: (a) a change in the patient's internalstandardsofmeasurement(i.e.scalerecalibration);(b)achangeinthepatient'svalues(i.e.theimportanceofcomponentsubdomainsconstitutingthetargetconstruct);or(c)aredefinitionofthetargetconstruct(i.e.reconceptualization)(44))(seeAppendix5).Feasibility isdefinedas theeaseofapplicationof thePROM in its intendedcontextofuse,givenconstraintssuchastimeormoney(45),forexample,completiontime,costofaninstrument,lengthoftheinstrument,typeandeaseofadministration(seeAppendix6forallaspectsoffeasibility)(26).FeasibilityappliestopatientscompletingthePROM(self‐administered)andresearchersorclinicianswhointervieworhandoverthePROMto patients. The concept ‘feasibility’ is related to the concept ‘clinical utility’,whereasfeasibilityfocussesonPROMs,clinicalutilityreferstoanintervention(46).Interpretability and feasibility are not measurement properties because they do notrefer to the quality of a PROM.However, they are considered important aspects for awell‐consideredselectionofaPROM.IncasetherearetwoPROMsthatareverydifficulttodifferentiateintermsofquality, itisrecommendedthatfeasibilityaspects(e.g.timeaspectsandbudgetrestrictions)shouldbe taken intoconsideration in theselectionofthe most appropriate instrument. Floor and ceiling effects can indicate insufficientcontentvalidity,andcanresultininsufficientreliability.Interpretability and feasibility are only described, and not evaluated, as these are noformalmeasurementproperties(i.e. theydonottellussomethingaboutthequalityofan instrument, rather they are important aspects that should be considered in theselectionofthemostsuitableinstrumentforaspecificpurpose).


45

4.2Step9:formulaterecommendations Recommendationson themost suitablePROMforuse inanevaluativeapplicationareformulatedwithrespecttotheconstructofinterestandstudypopulation.Tocometoanevidence‐basedandfullytransparentrecommendation,werecommendtocategorizetheincludedPROMsintothreecategories:(A) PROMs with evidence for sufficient content validity (any level) AND at least lowqualityevidenceforsufficientinternalconsistency;(B)PROMscategorizednotinAorC.(C)PROMswithhighqualityevidenceforaninsufficientmeasurementpropertyPROMscategorizedas‘A’canberecommendedforuseandresultsobtainedwiththesePROMscanbetrusted.PROMscategorizedas‘B’havepotentialtoberecommendedforuse, but they require further research to assess the quality of these PROMs. PROMscategorizedas‘C’shouldnotberecommendedforuse.WhenonlyPROMscategorizedas‘B’arefoundinareview,theonewiththebestevidenceforcontentvaliditycouldbetheonetobeprovisionallyrecommendedforuse,untilfurtherevidenceisprovided.Justifications should be given to as why a PROM is placed in a certain category, anddirection should be given on future validation work, if applicable. To work towardsstandardization in outcomemeasurement (e.g. coreoutcome set development) and tofacilitatemeta‐analysesofstudiesusingPROMs,werecommendtosubsequentlyadviseononemostsuitablePROM(5).Thisrecommendationdoesnotonlyhavetobebasedontheevaluationofthemeasurementproperties,butmayalsodependoninterpretabilityandfeasibilityaspects.4.3Step10:reportthesystematicreview InaccordancewiththePRISMAStatement(18),werecommendtoreportthefollowinginformation:(1) the search strategy (for example on a website or in the (online) supplementalmaterialstothearticleatissue),andtheresultsoftheliteraturesearchandselectionofthe studies and PROMs, displayed in the PRIMSA flow diagram (including the finalnumberofarticlesandthefinalnumberofPROMsincludedinthereview)(Appendix2);(2)thecharacteristicsoftheincludedPROMs,suchasnameofthePROMs,referencetothe article in which the development of the PROM is described, constructs beingmeasured,languageandstudypopulationforwhichthePROMwasdeveloped,intendedcontext(s)ofuse,availablelanguageversionofthePROM,numberofscalesorsubscales,numberofitems,responseoptions,recallperiod,interpretabilityaspects,andfeasibilityaspects(Appendix3);(3)thecharacteristicsoftheincludedstudypopulations,suchasgeographicallocation,language,importantdiseasecharacteristics,targetpopulation,samplesize,age,gender,andsetting(Appendix4);(4) the methodological quality ratings of each study per measurement property perPROM(i.e.verygood,adequate,doubtful,inadequate),theresultsofeachstudy,andthe


46

accompanying ratings of the results based on the criteria for good measurementproperties(sufficient(+)/insufficient(‐)/indeterminate(?))(Appendix7).ThistablecouldbepublishedforexampleasAppendixorsupplementalmaterialonthewebsiteofthejournalonly;(5)aSummaryofFindings(SoF)tablepermeasurementproperty,includingthepooledor summarizedresultsof themeasurementproperties, itsoverall rating (i.e. sufficient(+)/insufficient(‐)/inconsistent(±)/indeterminate(?)),andthegradingofthequalityofevidence(high,moderate, low,verylow).AnexampleofaSoFtablecanbefoundinAppendix 8. These SoF tables (i.e. one permeasurement property)will ultimately beusedinprovidingrecommendationsfortheselectionofthemostappropriatePROMforagivenpurposeoraparticularcontextofuse.Notethatthesetablescanbeusedinthedataextractionprocessthroughouttheentirereview.


47

5.COSMINRiskofBiaschecklistInthischapterweelaborateonallstandardsincludedinboxes3to10oftheCOSMINRiskofBiasChecklist.Anelaborationofallstandardsincludedinbox1PROMDevelopmentandbox2Contentvalidityisprovidedintheusers’manualforguidanceonhowtoevaluatethecontentvalidityofPROMs.ThismanualcanbefoundontheCOSMINwebsite(33).Eachboxcontainsanitemaskingiftherewereanyotherimportantothermethodologicalflawsthatarenotcoveredbythechecklist,butthatmayleadtobiasedresultsorconclusions.Forsomeoftheboxeswewillprovideexamplesofsuchflawsbelow.5.1Assessingriskofbiasinastudyonstructuralvalidity(box3) StructuralvalidityreferstothedegreetowhichthescoresofaPROMareanadequatereflection of the dimensionality of the construct to be measured(21) and is usuallyassessedbyfactoranalysis.Box3.StructuralvalidityDoesthescaleconsistofeffectindicators,i.e.isitbasedonareflectivemodel?yes/noDoesthestudyconcernunidimensionalityorstructuralvalidity?unidimensionality/structuralvalidityTheboxonstructuralvaliditystartswithtwoquestionsthatcanhelpthereviewertodecidewhetherthismeasurementpropertyisrelevantfortheinstrumentunderstudy(i.e.questiononreflectivemodel),ortobecomeawareofthespecificpurposeofthestudy(i.e.questiononunidimensionalityorstructuralvalidity).Bothquestionsarenotstandardsandtheyarenotusedintheworstscorecountsmethod.Structuralvalidityisonlyrelevantforinstrumentsthatarebasedonareflectivemodel.Areflectivemodelisamodelinwhichallitemsareamanifestationofthesameunderlyingconstruct.Thesekindofitemsarecalledeffectindicators.Theseitemsareexpectedtobehighlycorrelatedandinterchangeable.Itscounterpartisaformativemodel,inwhichtheitemstogetherformtheconstruct.Theseitemsdonotneedtobecorrelated.Therefore,structuralvalidityisnotrelevantforitemsthatarebasedonaformativemodel(24,47,48).Forexample,stresscouldbemeasuredbyaskingabouttheoccurrenceofdifferentsituationsandeventsthatmightleadtostress,suchasjobloss,deathinafamily,divorcesetc.(49).Theseeventsdonotneedtobecorrelated,thusstructuralvalidityisnotrelevantforsuchaninstrument.Often,authorsdonotexplicitlydescribewhethertheirinstrumentisbasedonareflectiveorformativemodel.Todecideafterwardswhichmodelisused,onecandoasimple“thoughttest”.Withthistestoneshouldconsiderwhetherallitemscoresareexpectedtochangewhentheconstructchanges.Ifyes,theconstructcanbeconsideredareflectivemodel.Ifnot,thePROMisprobablybasedonaformativemodel(24,47).


48

Itisnotalwayspossibletodecideafterwardsiftheinstrumentisbasedonareflectiveorformativemodelandthuswhetherstructuralvalidityisrelevant.Inthiscase,werecommendtocompletetheboxtoassessthequalityoftheanalysesthatwereperformedintheincludedstudies.ThemeasurementpropertyStructuralvalidityreferstothemodelfitofafactoranalysisonallitemsinanoutcomemeasurementinstrument,e.g.toconfirma3‐factormodelforaninstrumentwiththreesubscales.Unidimensionalityreferstowhethertheitemsinascaleorsubscalemeasureasingleconstruct.ItisanassumptionforinternalconsistencyorIRT/Raschanalyses.ForthepurposeofcheckingunidimensionalityeachsubscalefromamultidimensionalPROMcanseparatelybeevaluatedwithafactoranalysisorIRT/Raschanalyses.AnevaluationofunidimensionalityissufficientfortheinterpretationofaninternalconsistencyanalysisandIRT/Raschanalysis,butitdoesnotprovideevidenceforstructuralvalidityaspartofconstructvalidityofamultidimensionalPROM.Thatis,becauseitdoesnotprovideevidence,forexample,thataninstrumentwiththreesubscalesindeedmeasuresthreedifferentconstructs.AquestionwasaddedtotheboxonStructuralvalidityaboutwhethertheaimofthestudywastoassessunidimensionalityortoassessstructuralvalidity.Althoughthestandardsforassessingunidimensionalityandstructuralvalidityarethesame,thepurposeisdifferentandtheconclusionaboutthePROMcanbedifferentandreviewersshouldtakethisintoaccountwhenreportingtheresultsofsuchstudiesinasystematicreview.Thisquestionisnotastandardanditisnotusedintheworstscorecountsmethod.Itisonlyahelpforthereviewerstobeawareofwhichpurposewasservedinthestudy.Statisticalmethods verygood adequate doubtful inadequate NA 1 ForCTT:Wasexploratoryor

confirmatoryfactoranalysis

performed?

Confirmatoryfactoranalysisperformed

Exploratoryfactoranalysisperformed

Noexploratoryorconfirmatoryfactoranalysisperformed

Notapplicable

Standard1.Todeterminethestructureoftheinstrument,afactoranalysisisthepreferredstatisticwhenusingCTT.Althoughconfirmatoryfactoranalysisispreferredoverexplorativefactoranalysis,bothcouldbeusefulfortheevaluationofstructuralvalidity.Confirmativefactoranalysistestswhetherthedatafitapremeditatedfactorstructure(50).Basedontheoryorpreviousanalysesapriorihypothesesareformulatedandtested.Explorativefactoranalysiscanbeusedwhennoclearhypothesesexistabouttheunderlyingdimensions(50).Statisticalmethods 2 ForIRT/Rasch:doesthechosenmodel

fittotheresearchquestion?

Chosenmodelfitswelltotheresearchquestion

Assumablethatthechosenmodelfitswelltotheresearchquestion

Doubtfulifthechosenmodelfitswelltotheresearchquestion

Chosenmodeldoesnotfittotheresearchquestion

NA


49

Standard2:Anexampleofamodelthatdoesnotfittheresearchquestioniswhenfollow‐updataarecombinedbutnotanalysedusingamulti‐levelIRTmodel.Statisticalmethods verygood adequate doubtful inadequate NA 3 Wasthesamplesizeincluded

intheanalysisadequate?

FA:7timesthenumberofitems,and≥100Rasch/1PLmodels:≥200subjects2PLparametricIRTmodelsORMokkenscaleanalysis:≥1000subjects

FA:atleast5thetimesnumberofitems,and≥100;ORatleast6timesthenumberofitems,but<100Rasch/1PLmodels:100‐199subjects2PLparametricIRTmodelsORMokkenscaleanalysis:500‐999subjects

FA:5timesthenumberofitems,but<100Rasch/1PLmodels:50‐99subjects2PLparametricIRTmodelsORMokkenscaleanalysis:250‐499subjects

FA:<5timesthenumberofitemsRasch/1PLmodels:<50subjects2PLparametricIRTmodelsORMokkenscaleanalysis:<250subjects

Standard3.FactoranalysesandIRT/Raschanalysesrequirelargesamplesizes.TheproposedrequirementswerebasedonComrey(51),Brown(chapter10)(52),EmbretsonandReise(53),EdelenandReeve(54),ReiseandYu(55),Reiseetal.(56),andLinacre(57).Thesesamplesizerequirementsarerulesofthumb,andarecontextrelated.Samplesizerequirementsincreaseasaresultofthecomplexityofthemodel(Edelenetal.2007).Moreover,dependingontheresearchquestiondifferentlevelsofprecisionmaybeacceptable,whichrelateagaintothesamplesizeneeded(Edelenetal.2007).Anotherrelatedconsiderationisthesamplingdistributionoftherespondents.Thesampleshouldreflectthepopulationofinterestandcontainenoughrespondentswithextremescoressothatitemsevenatextremeendsoftheconstructcontinuumwillhavereasonablestandarderrorsassociatedwiththeirestimatedparameters.(Edelenetal.2007).Other 4 Werethereanyotherimportantflaws

inthedesignorstatisticalmethodsof

thestudy?

Nootherimportantmethodologicalflaws

Otherminormethodologicalflaws(e.g.rotationmethodnotdescribed)

Otherimportantmethodologicalflaws(e.g.inappropriaterotationmethod)

Wehavenotdefinedspecificrequirementsforfactoranalyses,suchasthechoiceoftheexplorativefactoranalysis(principalcomponentanalysisorcommonfactoranalysis),


50

thechoiceandjustificationoftherotationmethod(e.g.orthogonalorobliquerotation),orthedecisionaboutthenumberofrelevantfactors.Suchspecificrequirementsaredescribedbye.g.Floyd&Widaman(50)anddeVetetal.(58).Whenthereareseriousflawsinthequalityofthefactoranalysis,werecommendtorateitem4with“doubtful”or“inadequate”.5.2Assessingriskofbiasinastudyoninternalconsistency(box4) Internal consistency refers to the degree of interrelatedness among the items and isoften assessed by Cronbach’s alpha. Cronbach’s alpha’s can be pooled across studieswhentheresultsaresufficientlysimilar.Foranexample,werefertoastudyperformedbyCollinsandcolleagues(8).Foranappropriateinterpretationoftheinternalconsistencyparameter,theitemstogethershouldformaunidimensionalscaleorsubscale.Internalconsistencyandunidimensionalityarenotthesame.Internalconsistencyistherelatednessamongtheitems(59).Unidimensionalityreferstowhethertheitemsinascaleorsubscalemeasureasingleconstruct.Itisaprerequisiteforaclearinterpretationoftheinternalconsistencystatistics(59,60),andcanbeinvestigatedforexamplebyfactoranalysis(61)orIRTmethods(seestructuralvalidity,box3).Box4.InternalconsistencyDoesthescaleconsistofeffectindicators,i.e.isitbasedonareflectivemodel?1yes/noThefirstquestionintheBoxInternalconsistencyconcernstherelevanceoftheassessmentofthemeasurementpropertyinternalconsistencyforthePROMunderstudy.Theinternalconsistencystatisticonlygetsaninterpretablemeaning,whentheinterrelatednessamongtheitemsisdeterminedofasetofitemsthattogetherformareflectivemodel(59,60).Seealsobox3foranexplanationaboutreflectivemodels.Designrequirements verygood adequate Doubtful inadequate NA 1 Wasaninternal

consistencystatisticcalculatedforeachunidimensionalscaleorsubscaleseparately?

Internalconsistencystatisticcalculatedforeachunidimensionalscaleorsubscale

Unclearwhetherscaleorsubscaleisunidimensional

InternalconsistencystatisticNOTcalculatedonunidimensionalscale

Standard1.Theinternalconsistencycoefficientshouldbecalculatedforeachunidimensionalsubscaleseparately.Informationonunidimensionality(orstructuralvalidity)canbeeitherfoundinthesamestudyorinotherstudies.Basedontheatleastlowqualityevidenceforstructuralvaliditywedeterminewhetherornottheitemsfromthe(sub)scalesformanunidimensionalscale.Whenaninternalconsistencyparameteriscalculatedpersubscaleaswellasforamultidimensionaltotalscale(forexamplethetotalscoreoffoursubscales)thislatterresultcanbeignored,asitcannotbeinterpreted.Ifaninternalconsistencyparameterisonlyreportedforamultidimensionaltotalscale,thisstandardshouldberated


51

‘inadequate’.IfnoinformationisfoundintheliteratureonthestructuralvalidityorunidimensionalityofaPROM,thisstandardcanberatedwith‘doubtful’.Inthiscase,werecommendtoreporttheresultsfoundoninternalconsistency,withoutdrawingaconclusion.Statisticalmethods 2 Forcontinuousscores:Was

Cronbach’salphaoromegacalculated?

Cronbach’salpha,orOmegacalculated

Onlyitem‐totalcorrelationscalculated

NoCronbach’salphaandnoitem‐totalcorrelationscalculated

Na

3 Fordichotomousscores:WasCronbach’salphaorKR‐20calculated?

Cronbach’salphaorKR‐20calculated

Onlyitem‐totalcorrelationscalculated

NoCronbach’salphaorKR‐20andnoitem‐totalcorrelationscalculated

Na

4 ForIRT‐basedscores:Wasstandarderrorofthetheta(SE(θ))orreliabilitycoefficientofestimatedlatenttraitvalue(indexof(subjectoritem)separation)calculated?

SE(θ)orreliabilitycoefficientcalculated

SE(θ)orreliabilitycoefficientNOTcalculated

Na

Standards2and3.StudiesbasedonCTTshouldreportCronbach’salphaorOmegavalues(62).Standard4.SeveralreliabilityindicesareavailablethatarebasedonIRT/Raschanalyses.Theseindicesarebasedononescore.Examplesarethestandarderroroftheta,andindexofpersonorsubjectseparation.5.3Assessingriskofbiasinastudyoncross‐culturalvalidity\measurementinvariance(box5)Cross‐culturalvalidityreferstothedegreetowhichtheperformanceoftheitemsonatranslated or culturally adapted instrument are an adequate reflection of theperformance of the items of the original version of the instrument. To assess thismeasurementpropertydatafromatleasttwodifferentgroupsisrequired,e.g.aDutchandanEnglishsample,ormalesandfemales.Werecommendtocompletethisboxforstudiesthathaveevaluatedcross‐culturalvalidityforPROMsacrossculturallydifferentpopulations. We interpret ‘culturally different population’ broadly. We do not onlyconsider different ethnicity or language groups as different cultural populations, butalso other groups such as different gender or age groups, or different patientpopulations. Cross‐cultural validity is evaluated by assessing whether the scale ismeasurement invariant orwhether or not Differential Item Functioning (DIF) occurs.MeasurementInvariance(MI)andnon‐DIFrefertowhetherrespondentsfromdifferentgroups with the same latent trait level (allowing for group differences) respondsimilarlytoaparticularitem.


52

Box5.Cross‐culturalvalidity\Measurementinvariance

Designrequirements verygood adequate doubtful inadequate NA 1 Werethesamplessimilarfor

relevantcharacteristicsexceptforthegroupvariable?

Evidenceprovidedthatsamplesweresimilarforrelevantcharacteristicsexceptgroupvariable

Stated(butnoevidenceprovided)thatsamplesweresimilarforrelevantcharacteristicsexceptgroupvariable

Unclearwhethersamplesweresimilarforrelevantcharacteristicsexceptgroupvariable

SampleswereNOTsimilarforrelevantcharacteristicsexceptgroupvariable

Standard1.Whenevaluatingcross‐culturalvalidity,scoresoftwo(ormore)groupsaredirectlycomparedinonestatisticalmodel.Thiscouldbelanguagegroups(e.g.whentheDutchversionofaPROMisbeingcomparedtotheEnglishversionofthesamePROM),orgroupsbasedonanothervariable,suchasmalesversusfemales,orhealthyandchronicallyillpeople.Exceptfromthegroupvariable,thetwogroupsshouldbesimilarforrelevantcharacteristics,suchasdiseaseseverity,age,etc.Inonestudygendercouldbethegroupvariable,andinanotherstudy(e.g.comparingtwolanguagegroups)genderisconsideredarelevantcharacteristicofthegroupsandisexpectedtobesimilarlydistributedacrossthegroups.Itisuptothereviewteamtojudgewhetherallrelevantcharacteristicsaresimilarlydistributedacrossthegroups.Statisticalmethods 2 Wasanappropriateapproachused

toanalysethedata?Awidelyrecognizedorwelljustifiedapproachwasused

Assumablethattheapproachwasappropriate,butnotclearlydescribed

Notclearwhatapproachwasusedordoubtfulwhethertheapproachwasappropriate

approachNOTappropriate

Notapplicable

Standard2.Adequateapproachedforassessingcross‐culturalvalidityusingCTTareregressionanalysesorconfirmatoryfactoranalysis(CFA).Foranexplanationoftheuseofordinalregressionmodels,werefertoCraneetal.(63)orPetersenetal.(64).ForanexplanationofCFAtoinvestigatemeasurementinvariancewerefertoGregorich(65).Anadequateapproachforassessingcross‐culturalvalidityusingIRTmethodsisDifferentialItemFunctioning(DIF)analyses(66).


53

Statisticalmethods 3 Wasthesamplesize

includedintheanalysisadequate?

RegressionanalysesorIRT/Raschbasedanalyses:200subjectspergroupMGCFA*:7timesthenumberofitems,and≥100

150 subjectspergroup5timesthenumberofitems,and≥100;OR5‐7timesthenumberofitems,but<100

100subjectspergroup5timesthenumberofitems,but<100

<100subjectspergroup<5timesthenumberofitems

Standard3.CFA,IRTanalysisorregressionanalysisrequirehighsamplesizestoobtainreliableresults.Astheresultsofstudiescannotbepooled,asamplesizerequirementwasincludedasastandard.TheserecommendationsarebasedonapublicationbyScottetal.(67).5.4Assessingriskofbiasinastudyonreliability(box6)Reliabilityreferstotheproportionofthetotalvarianceinthemeasurementswhichisdueto‘true’differencesbetweenpatients.Theword‘true’mustbeseeninthecontextofCTT,whichstatesthatanyobservationiscomposedoftwocomponents–atruescoreanderrorassociatedwiththeobservation.‘True’istheaveragescorethatwouldbeobtainedifthescalewasadministeredaninfinitenumberoftimestothesameperson.Itrefersonlytotheconsistencyofthescore,andnottoitsaccuracy(22).ReliabilitycanalsobeexplainedastheabilityofaPROMtodistinguishbetweenpatients.Withinahomogeneousgroup,itishardtodistinguishbetweenpatients(23).Animportantassumptionmadeinareliabilitystudy(andinastudyonmeasurementerror)isthatpatientsarestableontheconstructtobemeasuredbetweentherepeatedmeasurements.Box6.Reliability

Designrequirements verygood adequate doubtful inadequate NA 1 Werepatientsstableintheinterim

periodontheconstructtobemeasured?Evidenceprovidedthatpatientswerestable

Assumablethatpatientswerestable

Unclearifpatientswerestable

PatientswereNOTstable


54

2 Wasthetimeintervalappropriate? Timeintervalappropriate

Doubtfulwhethertimeintervalwasappropriateortimeintervalwasnotstated

TimeintervalNOTappropriate

3 Werethetestconditionssimilarforthemeasurements?e.g.typeofadministration,environment,instructions

Testconditionsweresimilar(evidenceprovided)

Assumablethattestconditionsweresimilar

Uncleariftestconditionsweresimilar

TestconditionswereNOTsimilar

Standard1.Patientsshouldbestablewithregardtotheconstructtobemeasuredbetweentheadministrations.What“stablepatients”aredependsontheconstructtobemeasuredandthetargetpopulation.Evidencethatpatientswerestablecouldbe,forexample,anassessmentofaglobalratingofchange,completedbypatientsorphysicians.Whenaninterventionisgivenintheinterimperiod,onecanassumethat(manyof)thepatientshavechangedontheconstructtobemeasured.Inthatcase,werecommendtorateStandard2as“inadequate”.Standard2.Thetimeintervalbetweentheadministrationsmustbeappropriate.Thetimeintervalshouldbelongenoughtopreventrecallbias,andshortenoughtoensurethatpatientshavenotbeenchangedontheconstructtobemeasured.Whatanappropriatetimeintervalis,dependsontheconstructtobemeasuredandthetargetpopulation.Atimeintervalofabout2weeksisoftenconsideredappropriatefortheevaluationofPROMs(22).Standard3.Alastrequirementisthatthetestconditionsshouldbesimilar.Testconditionsrefertothetypeofadministration(e.g.aself‐administeredquestionnaire,interview,performance‐test),thesettinginwhichtheinstrumentwasadministered(e.g.atthehospital,orathome),andtheinstructionsgiven.Thesetestconditionsmayinfluencetheresponsesofapatient.Thereliabilitymaybeunderestimatedifthetestconditionsarenotsimilar.However,inclinicalpractice,differenttestconditionmightbeuseddisorderly,andaspecificresearchquestioncouldbewhetherthereliabilityofaPROMisstillappropriatewhenusingthePROMunderdifferenttestconditions.Inthiscase,thetestconditionsdonotneedtobesimilarastheyaresupposedtovaryaspartoftheresearchaimofthestudy.Inthatcase,thisitemcanberatedwith‘verygood’.Seeforexample,VanLeeuwen(68).


55

Statisticalmethods

4 Forcontinuousscores:Wasanintraclasscorrelationcoefficient(ICC)calculated?

ICCcalculatedandmodelorformulaoftheICCisdescribed

ICCcalculatedbutmodelorformulaoftheICCnotdescribedornotoptimal.PearsonorSpearmancorrelationcoefficientcalculatedwithevidenceprovidedthatnosystematicchangehasoccurred

PearsonorSpearmancorrelationcoefficientcalculatedWITHOUTevidenceprovidedthatnosystematicchangehasoccurredorWITHevidencethatsystematicchangehasoccurred

NoICCorPearsonorSpearmancorrelationscalculated

Na

5 Fordichotomous/nominal/ordinalscores:Waskappacalculated?

Kappacalculated

Nokappacalculated

Na

6 Forordinalscores:Wasaweightedkappacalculated?

WeightedKappacalculated

UnweightedKappacalculatedornotdescribed

Na

7 Forordinalscores:Wastheweightingschemedescribed?e.g.linear,quadratic

Weightingschemedescribed

WeightingschemeNOTdescribed

Na

Thepreferredreliabilitystatisticdependsonthetypeofresponseoptions.Standard4.Forcontinuousscorestheintraclasscorrelationcoefficient(ICC)ispreferred(22,69).ToassessthereliabilityofaPROM,oftenastraightforwardtest‐retestreliabilitystudydesignischosen.ThepreferredICCmodelinthatcaseisthetwo‐wayrandomeffectsmodel(70),inwhichinadditiontothevariancewithinpatientsthevariance(i.e.systematicdifferences)betweentimepointistakenintoaccount(23).TheuseofthePearson’sandSpearman’scorrelationcoefficientisconsidereddoubtfulwhenitisnotclearwhethertherearesystematicdifferencesoccurred,becausethesecorrelationsdonottakesystematicerrorintoaccount.Standards5,6,and7.FordichotomousscoresornominalscorestheCohen’skappaisthepreferredstatisticalmethod(22).Forordinalscalespartialchanceagreementshouldbeconsidered,andthereforeaweightedkappa(22,71,72)ispreferred.Adescriptionoftheweights(e.g.,linearorquadraticweights(73))shouldbegiven.Proportionagreementisconsiderednotadequateasitisaparameterformeasurementerror(seebox7).Unweightedkappaconsidersanymisclassificationequallyinappropriate.However,amisclassificationoftwoadjacentcategoriesmaybelesserroneousasamisclassificationofcategoriesthataremoreapartfromeachother.Aweightedkappatakesthisintoaccount.However,theaimofastudycouldbetoassessthereliabilityofanymisclassification,makingaunweightedkappaasanappropriateparameter.Insuchastudy,Standard7canberatedas‘verygood’.


56

Other

8 Werethereanyotherimportantflawsinthedesignorstatisticalmethodsofthestudy?


Otherminormethodologicalflaws

Otherimportantmethodologicalflaws

Standard8:Anexampleofanotherimportantflawiswhentheadministrationswerenotindependent.Independentadministrationsimplythatthefirstadministrationhasnotinfluencedthesecondadministration.Atthesecondadministrationthepatientshouldnothavebeenawareofthescoresonthefirstadministration.Anotherexamplecouldbe,whenthePROMwasadministeredinaninterview.Supposeoneexperiencedintervieweradministersallfirstadministrationsoftheinterviewsofallpatients,andthesecondadministrationwaseitherdonebyanexperiencedinterviewerorbyanunexperiencedinterviewer,anditwasnotclearwhichpatientwasinterviewedbywhichinterviewer.Subsequently,thiswasnottakenintoaccountintheanalysis.SupposealowICCwasfound,itisunknownwhetherthisisduetothedifferentcharacteristicsoftheinterviewers(thatcouldbeimprovedbystandardizingrequirementsforinterviewers),orduetoaninsufficientqualityofthePROM.5.5Assessingriskofbiasinastudyonmeasurementerror(box7)Measurementerrorreferstothesystematicandrandomerrorofanindividualpatient’sscorethatisnotattributedtotruechangesintheconstructtobemeasured.Box7.Measurementerror

Designrequirements verygood adequate doubtful Inadequate NA 1 Werepatientsstableinthe

interimperiodontheconstructtobemeasured?

Patientswerestable(evidenceprovided)

Assumablethatpatientswerestable

Unclearifpatientswerestable

PatientswereNOTstable

2 Wasthetimeintervalappropriate?

Timeintervalappropriate

Doubtfulwhethertimeintervalwasappropriateortimeintervalwasnotstated

TimeintervalNOTappropriate

3 Werethetestconditionssimilarforthemeasurements?(e.g.typeofadministration,environment,instructions)

Testconditionsweresimilar(evidenceprovided)

Assumablethattestconditionsweresimilar

Uncleariftestconditionsweresimilar

TestconditionswereNOTsimilar

Standards1‐3:seereliability


57

Statisticalmethods 4 Forcontinuousscores:Was

theStandardErrorofMeasurement(SEM),SmallestDetectableChange(SDC)orLimitsofAgreement(LoA)calculated?

SEM,SDC,orLoAcalculated

PossibletocalculateLoAfromthedatapresented

SEMcalculatedbasedonCronbach’salpha,oronSDfromanotherpopulation

Notapplicable

5 Fordichotomous/nominal/ordinalscores:Wasthepercentage(positiveandnegative)agreementcalculated?

%positiveandnegativeagreementcalculated

%agreementcalculated

%agreementnotcalculated

Notapplicable

Standard4.ThepreferredstatisticformeasurementerrorinstudiesbasedonCTTistheStandardErrorofMeasurement(SEM)basedonatest‐retestdesign.NotethatthecalculationoftheSEMbasedonCronbach’salphaisconsiderednotappropriate,becauseitdoesnottakethevariancebetweentimepointsintoaccount(74).OtherappropriatestatisticsforassessingmeasurementerroraretheLimitsofAgreement(LoA)andtheSmallestDetectableChange(SDC)(23).BothparametersaredirectlyrelatedtotheSEM.ChangeswithintheLoAorsmallerthantheSDCarelikelytobeduetomeasurementerrorandchangesoutsidetheLoAorlargerthantheSDCshouldbeconsideredasrealchangeinindividualpatients.Notethatthisdoesnotindicatethatthesechangesarealsomeaningfultopatients.Thisdependsonwhatchangeisconsideredimportant,whichisanissueofinterpretability.Standard5.Oftenkappaisconsideredasameasureofagreement,however,kappaisameasureofreliability(72).Anappropriateparameterofmeasurementerror(alsocalledagreement)ofdichotomous/nominal/ordinalPROMscoresispercentageagreement.5.6Assessingriskofbiasinastudyoncriterionvalidity(box8)CriterionvalidityreferstothedegreetowhichthescoresofaPROMareanadequatereflectionofa‘goldstandard’.Inasystematicreview,thereviewteamshoulddeterminewhatreasonable‘goldstandards’arefortheconstructtobemeasured.AllstudiescomparingaPROMtothisacceptedgoldstandardcanbeconsideredastudyoncriterionvalidity.Box8.Criterionvalidity

Statisticalmethods verygood adequate doubtful inadequate NA 1 Forcontinuousscores:Were

correlations,ortheareaunderthereceiveroperatingcurvecalculated?

CorrelationsorAUCcalculated

CorrelationsorAUCNOTcalculated

Na

2 Fordichotomousscores:Weresensitivityandspecificitydetermined?

Sensitivityandspecificitycalculated

SensitivityandspecificityNOTcalculated

Na


58

Standards2and3.WhenboththePROMandthegoldstandardhavecontinuousscores,correlationisthepreferredstatisticalmethod.Whentheinstrumentscoresarecontinuousandscoresonthegoldstandardaredichotomoustheareaunderthereceiveroperatingcharacteristic(ROC)isthepreferredmethod,andwhenbothscoresaredichotomoussensitivityandspecificityarethepreferredmethodstouse.Other 3 Were there any other important flaws

in the design or statistical methods of the study?

No other important methodological flaws

Other minor methodological flaws

Other important methodological flaws

Standard3:Biasmayoccur,forexample,whenalongversionofaquestionnaireiscomparedtoashortversion,whilethescoresoftheshortversionwerecomputedusingtheresponsesobtainedwiththelongerversion.Inthatcase,thisstandardcouldberatedasinadequate.5.7Assessingriskofbiasinastudyonhypothesestestingforconstructvalidity(box9)HypothesestestingforconstructvalidityreferstothedegreetowhichthescoresofaPROMareconsistentwithhypotheses(forinstancewithregardtointernalrelationships,relationshipstoscoresofotherinstruments,ordifferencesbetweenrelevantgroups)basedontheassumptionthatthePROMvalidlymeasurestheconstructtobemeasured.Hypothesestestingisanongoing,iterativeprocess(34).Themorespecificthehypothesesareandthemorehypothesesarebeingtested,themoreevidenceisgatheredforconstructvalidity.ManytypesofhypothesescanbetestedtoevaluateconstructvalidityofaPROM.Ingeneral,thesehypothesesconcerncomparisonswithotheroutcomemeasurementinstruments(thatarenotbeingconsideredasagoldstandard),ordifferencesinscoresbetween‘known’groups(e.g.patientswithmultiplesclerosis(MS)andspasticitywereexpectedtoscorestatisticallysignificantlyhigheronanarmandhandfunctioningscale,ascomparedwithpatientswithMSwithoutspasticity,becausespasticityisacausativeforactivitylimitationsduetoimpairmentsofthearmandhand(68)).Theboxonhypothesestestingisthereforestructuredintwosections.ToassesstheriskofbiasofstudiescomparingthePROMtocomparisoninstruments,partaofthebox(items1‐4)shouldbecompleted.Forassessingtheriskofbiasofknowngroupvaliditystudies,partbofthebox(items5‐7)shouldbecompleted.TheriskofbiasforstudiesinwhichMIorDIFwasinvestigatedcanbeassessedusingBox5Cross‐culturalvalidity.Theoverallratingisthelowestratinggivenonanyapplicableitemperpartofthebox.IfinanarticlethePROMunderstudyisbeingcomparedtoboth(a)comparisoninstrument(s),aswellasdifferentsubgroupsarecompared,theboxshouldbecompletedtwice,andresultsarehandledastwosubstudies.However,whentheratingsarethesame,theycanbecombined.Forexample,inAppendix7(overviewTable)reference5tested5hypotheses,andwasratedas‘adequate’.These5hypothesesconcernedbothcomparisonswithotherinstrumentsaswellasexpecteddifferencesbetweengroups.


59

Box9.Hypothesestestingforconstructvalidity

9a.Comparisonwithotheroutcomemeasurementinstruments(convergentvalidity) Designrequirements

verygood adequate doubtful inadequate NA

1 Isitclearwhat

thecomparatorinstrument(s)measure(s)?

Constructsmeasuredbythecomparatorinstrument(s)isclear

Constructsmeasuredbythecomparatorinstrument(s)isnotclear

2 Werethemeasurementpropertiesofthecomparatorinstrument(s)sufficient?

Sufficientmeasurementpropertiesofthecomparatorinstrument(s)inapopulationsimilartothestudypopulation

Sufficientmeasurementpropertiesofthecomparatorinstrument(s)butnotsureiftheseapplytothestudypopulation

Someinformationonmeasurementpropertiesofthecomparatorinstrument(s)inanystudypopulation

Noinformationonthemeasurementpropertiesofthecomparatorinstrument(s),ORevidenceofinsufficientmeasurementpropertiesofthecomparatorinstrument(s)

Standards1and2:WhenthePROMunderreviewiscomparedtoanotherinstrument,itshouldbeknownwhattheconstructisthatthecomparatorinstrumentsmeasures,andtheinstrumentitselfshouldbeofsufficientquality.Ifnot,itisdifficulttointerprettheresultsof(e.g.)correlations.Thisinformationcanbereportedinthearticleincludedinthereview.However,whenthisisnotdescribed,thereviewteamcouldtrytofindadditionalinformationtodecideonratingthesestandards.Whenmultiplecomparatorinstrumentsarebeingusedinastudy,andforoneinstrumenttheconstructisclearlydescribed,andfortheotherinstrumentitisnotclearlydescribed,werecommendtoconsidereachanalysesasaseparatestudy,andcompletetheboxmultipletimes.Werecommendtodothesameincasesomeofthecomparatorinstrumentsareofadequatequality,whileotherareofdoubtfulorinadequatequality.Statisticalmethods

3 Weredesignandstatisticalmethodsadequateforthehypothesestobetested?

Statisticalmethodsappliedappropriate

Assumablethatstatisticalmethodswereappropriate

StatisticalmethodsappliedNOToptimal

StatisticalmethodsappliedNOTappropriate

Standard3.AppropriatestatisticalmethodscouldbecorrelationsbetweenthePROMandthecomparatorinstrument.When,forexample,Pearsoncorrelationswerecalculated,butthedistributionofscoresormeanscores(SD)werenotpresented,itcouldbeconsidered‘assumable’thatthestatisticalmethodswereappropriate.P‐valuesshouldnotbeusedintestinghypotheses,becauseitisnotrelevanttoexaminewhether


60

correlationsstatisticallydifferfromzero(75).Thevalidityissueisaboutwhetherthedirectionandmagnitudeofacorrelationissimilartowhatcouldbeexpectedbasedontheconstruct(s)thatarebeingmeasured.AnotherexampleofanappropriatemethodiswhenCFAorStructuralEquationModeling(SEM)isperformedovermultiplescalesorsubscaleswhichareconsideredtomeasuresimilarordifferentconstructstoexaminewhethersubscalesmeasuringsimilarconstructsaremorerelatedtoeachotherthansubscalesmeasuringdifferentconstructs. 9b.Comparisonbetweensubgroups(discriminativeorknown‐groupsvalidity) Designrequirements verygood adequate doubtful inadequate NA 5 Wasanadequate

descriptionprovidedofimportantcharacteristicsofthesubgroups?

Adequatedescriptionoftheimportantcharacteristicsofthesubgroups

Adequatedescriptionofmostoftheimportantcharacteristicsofthesubgroups

Poorofnodescriptionoftheimportantcharacteristicsofthesubgroups

Standard5.WhenPROMscoresbetweentwosubgroupsarecompared,importantcharacteristicsofthegroupsshouldbereported,anditshouldbeclearonwhichcharacteristicsthegroupsdiffer,forexampleage,gender,diseasecharacteristics,languageetc..Statisticalmethods 6 Weredesignandstatistical

methodsadequateforthehypothesestobetested?





Standard6.Manydifferenthypothesescanbeformulatedandtested.TheusersoftheCOSMINRiskofBiaschecklisthavetodecidewhetherornotthestatisticalmethodsusedinthearticleareadequatefortestingthestatedhypotheses.P‐valuesshouldbeavoidedintestinghypotheses,becauseitisnotrelevanttoexaminewhethercorrelationsstatisticallydifferfromzero(75).Thevalidityissueisaboutwhetherthedirectionandmagnitudeofacorrelationissimilartowhatcouldbeexpectedbasedontheconstruct(s)thatarebeingmeasured.Whenassessingdifferencesbetweengroups,itisalsolessrelevantwhetherthesedifferencesarestatisticallysignificant(whichdependsonthesamplesize)thanwhetherthesedifferencesareaslargeasexpected.5.8Assessingriskofbiasinastudyonresponsiveness(box10) ResponsivenessreferstotheabilityofaPROMtodetectchangeovertimeintheconstructtobemeasured.Althoughresponsivenessisconsideredtobeaseparatemeasurementpropertyfromvalidity,theonlydifferencebetweencross‐sectional(constructandcriterion)validityandresponsivenessisthatvalidityreferstothevalidityofasinglescore,andresponsivenessreferstothevalidityofachangescore(21).Therefore,thestandardsforresponsivenessaresimilartothestandardsofconstructandcriterionvalidity.Althoughtheapproachissimilartoconstructandcriterion


61

validity,wedonottousethetermsconstructandcriterionresponsiveness,becausethesetermsareunfamiliarintheliterature.Similarlyasforconstructandcriterionvalidity,thedesignrequirementsforassessingresponsivenessaredividedinsituationsinwhichagoldstandardisavailable(partastandards1‐5),situationsinwhichhypothesesaretestedbetweenchangescoresonthePROMunderreviewandchangescoresoncomparatorinstrumentswhicharenotgoldstandards(partbstandards6‐9),andsituationsinwhichhypothesesaretestedaboutexpecteddifferencesinchangesbetweensubgroups(partcstandards10‐12).Inaddition,standardsareincludedforsituationsinwhichhypothesesaretestedabouttheexpectedmagnitudeofanintervention(partdstandards13‐15). Box10.Responsiveness 10a.Criterionapproach(i.e.comparisontoagoldstandard) Statisticalmethods verygood adequate doubtful inadequate NA 1 Forcontinuousscores:Were

correlationsbetweenchangescores,ortheareaundertheReceiverOperatorCurve(ROC)curvecalculated?

CorrelationsorAreaundertheROCCurve(AUC)calculated

CorrelationsorAUCNOTcalculated

na

2 Fordichotomousscales:Weresensitivityandspecificity(changedversusnotchanged)determined?

Sensitivityandspecificitycalculated

SensitivityandspecificityNOTcalculated

na

Standards1‐2aresimilarasthestandardsforcriterionvalidity.10b.Constructapproach(i.e.hypothesestesting;comparisonwithotheroutcomemeasurementinstruments) Designrequirements verygood adequate doubtful inadequate NA 4 Isitclearwhatthe

comparatorinstrument(s)measure(s)?

Constructsmeasuredbythecomparatorinstrument(s)isclear

Constructsmeasuredbythecomparatorinstrument(s)isnotclear

5 Werethemeasurementpropertiesofthecomparatorinstrument(s)sufficient?

Sufficientmeasurementpropertiesofthecomparatorinstrument(s)inapopulationsimilartothestudypopulation

Sufficientmeasurementpropertiesofthecomparatorinstrument(s)butnotsureiftheseapplytothestudypopulation

Someinformationonmeasurementpropertiesofthecomparatorinstrument(s)inanystudypopulation

NOinformationonthemeasurementpropertiesofthecomparatorinstrument(s)ORevidenceofinsufficientqualityofcomparatorinstrument(s)


62

Statisticalmethods 6 Weredesignandstatistical

methodsadequateforthehypothesestobetested?





Other

7 Werethereanyother

importantflawsinthedesignorstatisticalmethodsofthestudy?


Otherminormethodologicalflaws

Otherimportantmethodologicalflaws

Standards4‐7aresimilarasthestandardsforhypothesestestingforconstructvalidityparta.10c.Constructapproach:(i.e.hypothesestesting:comparisonbetweensubgroups) Designrequirements verygood adequate doubtful inadequate NA 8 Wasanadequate

descriptionprovidedofimportantcharacteristicsofthesubgroups?

Adequatedescriptionoftheimportantcharacteristicsofthesubgroups

Adequatedescriptionofmostoftheimportantcharacteristicsofthesubgroups

Poorornodescriptionoftheimportantcharacteristicsofthesubgroups

Statisticalmethods 9 Weredesignand

statisticalmethodsadequateforthehypothesestobetested?





Standards8–9aresimilarasthestandardsforhypothesestestingforconstructvaliditypartb.10d.Constructapproach: (i.e.hypothesestesting:beforeandafterintervention) Designrequirements verygood adequate doubtful inadequate NA 11 Wasanadequate

descriptionprovidedoftheinterventiongiven?

Adequatedescriptionoftheintervention

Poordescriptionoftheintervention

NOdescriptionoftheintervention

Statisticalmethods 12 Weredesignand

statisticalmethodsadequateforthehypothesestobetested?






63

Aneffectsizeonlyhasmeaningasameasureofresponsivenessifweknow(orassume)beforehandwhatthemagnitudeoftheeffectoftheinterventionis.If,forexample,weexpectalargeeffectoftheinterventionontheconstructmeasuredbythePROMwecantestthehypothesisthattheinterventionresultsinaneffectsizeof0.8orhigher.Butifweexpectasmalleffectoftheintervention,wewouldnotexpectsuchahigheffectsize.Thisexampleshowsthatahigheffectsizedoesnotnecessarilyindicatesagoodresponsiveness.So,iftheeffectivenessofaninterventionorthechangeinhealthstatusontheconstructofinterestisknown,onecoulduseeffectsizes(meanchangescore/SDbaseline)(76),andrelatedmeasures,suchasstandardisedresponsemean(meanchangescore/SDchangescore)(77),Norman’sresponsivenesscoefficient(σ2change/σ2change+σ2error)(78),andrelativeefficacystatistic((t‐statistic1/t‐statistic2)2)(79)toevaluateresponsiveness.Whenseveralinstrumentsarecomparedinthesamestudy,thiscouldgiveevidencefortherelativeresponsivenessoftheinstruments.Butagain,onlyifahypothesisisbeingtestedincludingtheexpectedmagnitudeofthetreatmenteffect.Letusproposethatwehavethreemeasurementinstruments(A,B,andC),allmeasuringthesameconstruct.Theinterventiongivenisexpectedtomoderatelyaffecttheconstructmeasuredbythethreeinstruments.ResultsshowthatinstrumentAhasaneffectsizeof0.8,instrumentBof0.40andinstrumentCof0.15.BasedonourhypothesisofamoderateeffectweshouldconcludethatinstrumentBappearstobestmeasuretheconstructofinterest.InstrumentAseemstoover‐estimatesthetreatmenteffect(e.g.becauseitshowschangeinpersonswhodonotreallychange),andinstrumentAseemstounder‐estimatesit.Thisexampleshowsthatitmaynotalwaysbeappropriatetoconcludethattheinstrumentwiththehighesteffectsizeisthemostresponsive.InappropriatemeasuresforresponsivenessGuyatt’sresponsivenessratio(MIC/SDchangescoreofstablepatients)(80)isconsideredtobeaninappropriatemeasureofresponsiveness,becauseittakestheminimalimportantchangeintoaccount.Minimalimportantchangeconcernstheinterpretationofthechangescore,notthevalidityofthechangescore.Thepairedt‐testisalsoconsideredtobeaninappropriatemeasureofresponsivenessbecauseitisameasureofsignificantchangeinsteadofvalidchange,anditisdependentonthesamplesizeofthestudy(75).

COSMINRiskofBiaschecklistmanualforsystematicreviewsofPROMs

64

Appendix1.Exampleofasearchstrategy(81)

1) construct #1: (Depressive disorder[mh] OR depression[mh] OR (depress*[tiab] NOT medline[sb])) (2) population #2: (Diabet*[tiab])

(3) sensitive search filter developed by Terwee et al. (27)


65

Appendix2.Exampleofaflowchart

PubMedn = ...

EMBASEn =...

….n =...

Additional record identifiedn = ...

Records screened after removing duplicates

n =

Articles selected based on title and

abstractn =

Total included in the review: n articles describing x PROMs

Reasons for exclusion:Validation not aim of study (n = )Different constrct measures (n = )Different study population (n = )

Different type of outcome measure (n = )Other (n = )


66

Appendix3.ExampleofatableoncharacteristicsoftheincludedPROMsPROM*(referencetofirstarticle)

Construct(s) Targetpopulation

Modeofadministration(e.g.self‐report,interview‐based,parent/proxyreportetc)

Recallperiod

(Sub)scale(s)(numberofitems)

Responseoptions

Rangeofscores/scoring

Originallanguage

Availabletranslations

*EachversionofaPROMisconsideredaseparatePROM.

Othercharacteristicswhichmaybeextractedare:‘administrationtime’;‘yearofdevelopment’;‘conceptualmodelused’;‘recommendedbystandardizationinitiativesfor[aspecificpatientpopulationorfortheconstructtobemeasures]’,‘numberofstudiesevaluatingtheinstrument’;‘completiontime(minutes)’;‘Fullcopyavailable’;‘licensinginformationandcosts’.


67

Appendix4.Exampleofatableoncharacteristicsoftheincludedstudypopulations Population Diseasecharacteristics

Instrumentadministration

PROM Ref N AgeMean(SD,range)yr

Gender%female

Disease Diseasedurationmean(SD)yr

Diseaseseverity

Setting Country Language Responserate

A 1 2 3 B 1

Othercharacteristicswhichmaybeextractedare‘studydesign’,‘patientselection’.


68

Appendix5.InformationtoextractoninterpretabilityofPROMs

ThecontentofthistableisbasedontheBoxInterpretabilityfromtheoriginalCOSMINChecklist(1)PROM(ref) Distributionof

scoresinthestudypopulation

Percentageofmissingitemsandpercentageofmissingtotalscores

Floorandceilingeffects

Scoresandchangescoresavailableforrelevant(sub)groups

Minimalimportantchange(MIC)orminimalimportantdifference(MID)

Informationonresponseshift

PROMA(ref1) PROMA(ref2) PROMA(ref3) PROMB(ref1) …


69

Appendix6.InformationtoextractonfeasibilityofPROMsThecontentofthistableisbasedontheguidelineforselectingPROMsforCoreOutcomeSets(5)

Feasibilityaspects PROMA PROMB PROMC PROMDPatient’scomprehensibility Clinician’scomprehensibility

Typeandeaseof

administration

Lengthoftheinstrument

Completiontime

Patient’srequiredmental

andphysicalabilitylevel

Easeofstandardization

Easeofscorecalculation

Copyright

Costofaninstrument

Requiredequipment

Availabilityindifferent

settings

Regulatoryagency’srequirementforapproval


70

Appendix7.TableonresultsofstudiesonmeasurementpropertiesFictionalexampleofresultsofthemeasurementproperties

PROM(ref) Country(language)inwhichthequestionnairewasevaluated

Structuralvalidity Internalconsistency Cross‐culturalvalidity\measurementinvariance

Reliability

n Methqual

Result (rating) n Methqual

Result(rating)

n Methqual

Result(rating)

n Methqual

Result (rating)

PROMA(ref1) US 878 Verygood

Unidimensionalscale(CFI0,97,TLI0,97)(+)

878 Verygood

Cronb.Alpha=0.92(+)

PROMA(ref2) Dutch 573 Verygood

PSI=0.94(+)

PROMA(ref3) Spanish 43 Verygood

Cronbalpha=0.91(+)

84 Verygood

ICC(agreement)=0.85(+)

PROMA(ref4) Dutch 113 adequte

ICC=0.93(+)

PROMA(ref5) UK 78 inadequate

Spearmanrho=0.94(+)

Pooledorsummaryresult(overallrating)

878 1factor(+) 1494 0.91‐0.94(+)

275 0.85‐0.94(+)


71

PROM Country(language)inwhichthequestionnairewasevaluated

Measurementerror Criterionvalidity Hypothesestesting Responsivenessn Meth

qualResult(rating)

n Methqual

Result(rating)

n Methqual

Result (rating) n Methqual

Result(rating)

PROMA(ref2) Dutch 573 Verygood

Resultsinlinewith2hypo’s(2+)

PROMA(ref5) UK 78 adequate


PROMA(ref6) US 154

Adequate


inadequate

Resultinlinewith1hypo(1+)Resultsnotinlinewith2hypo’s(2‐)

Pooledorsummaryresult(overallrating)

805 11+ and2‐overall(+)


72

Appendix8.SummaryofFindingsTables Structural validity Summary or pooled result Overall rating Quality of evidence

PROM A Unidimensional score sufficient High (as there is one very good study

available)

PROM B

PROM C

Internal consistency Summary or pooled result Overall rating Quality of evidence

PROM A summarized Cronbach alpha / PSI =

0.91‐0.94; total sample size: 1494

sufficient High: multiple very good studies,

consistent results

PROM B

PROM C

Cross‐cultural

validity\measurement invariance

Summary or pooled result Overall rating Quality of evidence

PROM A No info available No info available

PROM B

PROM C

Reliability Summary or pooled result Overall rating Quality of evidence

PROM A ICC range 0.85 – 0.93; consistent sufficient High: one very good study, and


73

results; sample size: 275 consistent results

PROM B

PROM C

Measurement error Summary or pooled result Overall rating Quality of evidence

PROM A No info available No info available

PROM B

PROM C

Hypotheses testing Summary or pooled result Overall rating Quality of evidence

PROM A 11 out of 13 hypotheses confirmed sufficient High: as the unconfirmed

hypotheses come from inadequate

studies, we ignore these results.

PROM B

PROM C

Responsiveness Summary or pooled result Overall rating Quality of evidence

PROM A

PROM B

PROM C


74

6.References1. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual Life Res. 2010;19(4):539-49. 2. Prinsen CA, Mokkink LB, Bouter LM, Alonso J, Patrick DL, Vet HC, et al. COSMIN guideline for systematic reviews of outcome measurement instruments. Qual Life Res. 2018;[Epub ahead of print]. 3. Ioannidis JP, Greenland S, Hlatky MA, Khoury MJ, Macleod MR, Moher D, et al. Increasing value and reducing waste in research design, conduct, and analysis. Lancet. 2014;383(9912):166-75. 4. Walton MK, Powers JA, Hobart J, al. e. Clinical outcome assessments: A conceptual foundation – Report of the ISPOR Clinical Outcomes Assessment Emerging Good Practices Task Force. Value Health. 2015;18:741-52. 5. Prinsen CA, Vohra S, Rose MR, Boers M, Tugwell P, Clarke M, et al. How to select outcome measurement instruments for outcomes included in a "Core Outcome Set" - a practical guideline. Trials. 2016;17(1):449. 6. Mokkink LB, Vet HC, Prinsen CA, Patrick D, Alonso J, Bouter LM, et al. COSMIN Risk of Bias checklist for systematic reviews of Patient-Reported Outcome Measures. Qual Life Res. 2018;[Epub ahead of print]. 7. Terwee CB, Prinsen CA, Chiarotto A, Vet HC, Westerman MJ, Patrick DL, et al. COSMIN standards and criteria for evaluating the content validity of health-related Patient-Reported Outcome Measures: a Delphi study. 2018. 8. Collins NJ, Prinsen CA, Christensen R, Bartels EM, Terwee CB, Roos EM. Knee Injury and Osteoarthritis Outcome Score (KOOS): systematic review and meta-analysis of measurement properties. Osteoarthritis Cartilage. 2016;24(8):1317-29. 9. Gerbens LA, Chalmers JR, Rogers NK, Nankervis H, Spuls PI, Harmonising Outcome Measures for Eczema i. Reporting of symptoms in randomized controlled trials of atopic eczema treatments: a systematic review. Br J Dermatol. 2016;175(4):678-86. 10. Chinapaw MJ, Mokkink LB, van Poppel MN, van MW, Terwee CB. Physical activity questionnaires for youth: a systematic review of measurement properties. Sports Med. 2010;40(7):539-63. 11. Speksnijder CM, Koppenaal T, Knottnerus JA, Spigt M, Staal JB, Terwee CB. Measurement Properties of the Quebec Back Pain Disability Scale in Patients With Nonspecific Low Back Pain: Systematic Review. Phys Ther. 2016;96(11):1816-31. 12. Terwee CB, Mokkink LB, Knol DL, Ostelo RW, Bouter LM, de Vet HC. Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist. Qual Life Res. 2012;21(4):651-7. 13. Mokkink LB, Terwee CB, Knol DL, Stratford PW, Alonso J, Patrick DL, et al. The COSMIN checklist for evaluating the methodological quality of studies on measurement properties: a clarification of its content. BMC Med Res Methodol. 2010;10:22. 14. Mokkink LB, Terwee CB, Stratford PW, Alonso J, Patrick DL, Riphagen I, et al. Evaluation of the methodological quality of systematic reviews of health status measurement instruments. Qual Life Res. 2009;18(3):313-33. 15. Terwee CB, Prinsen CA, Ricci Garotti MG, Suman A, de Vet HC, Mokkink LB. The quality of systematic reviews of health-related outcome measurement instruments. Qual Life Res. 2016;25(4):767-79.


75

16. Higgins JP, Green S. Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated March 2011]. The Cochrane Collaboration, 2011. Available from: www.handbook.cochrane.org. 17. Cochrane Hanbook for Systematic reviews of Diagnostic Test Accuracy Reviews 2013 [Available from: http://methods.cochrane.org/sdt/handbook-dta-reviews. 18. Peterson DAB, P.; Jabusch, H. C.; Altenmuller, E.; Frucht, S. J. Rating scales for musician's dystonia: the state of the art. Neurology. 2013;81(6):589-98. 19. Herr KB, K.; Decker, S. Tools for assessment of pain in nonverbal older adults with dementia: A state-of-the-science review. Journal of Pain and Symptom Management. 2006;31(2):170-92. 20. GRADE. GRADE Handbook - Handbook for grading the quality of evidence and the strength of recommendations using the GRADE approach. 2013. 21. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 2010;63(7):737-45. 22. Streiner DL, Norman G. Health Measurement Scales. A practical guide to their development and use. 4th edition ed. New York: Oxford University Press; 2008. 23. de Vet HC, Terwee CB, Mokkink L, Knol DL. Measurement in Medicine: a practical guide: Cambridge University Press; 2010. 24. Fayers PM, Hand DJ, Bjordal K, Groenvold M. Causal indicators in quality of life research. Qual Life Res. 1997;6(5):393-406. 25. Elbers RG, Rietberg MB, van Wegen EE, Verhoef J, Kramer SF, Terwee CB, et al. Self-report fatigue questionnaires in multiple sclerosis, Parkinson's disease and stroke: a systematic review of measurement properties. Qual Life Res. 2012;21(6):925-44. 26. Griffiths C, Armstrong-James L, White P, Rumsey N, Pleat J, Harcourt D. A systematic review of patient reported outcome measures (PROMs) used in child and adolescent burn research. Burns. 2015;41(2):212-24. 27. Terwee CB, Jansma EP, Riphagen, II, de Vet HC. Development of a methodological PubMed search filter for finding studies on measurement properties of measurement instruments. Qual Life Res. 2009;18(8):1115-23. 28. Clanchy KMT, S. M.; Boyd, R. Measurement of habitual physical activity performance in adolescents with cerebral palsy: a systematic review. DevMed Child Neurol. 2011;53(6):499-505. 29. PRISMA Statement. http://prisma-statement org/ [Internet]. 2016 10/24/2016. Available from: http://prisma-statement.org/. 30. Terwee CB, Bot SD, de Boer MR, van der Windt DA, Knol DL, Dekker J, et al. Quality criteria were proposed for measurement properties of health status questionnaires. J Clin Epidemiol. 2007;60(1):34-42. 31. Ernst MY, Albert J.; Kriston, Levente; Schonfeld, Michael H.; Vettorazzi, Eik; Fiehler, Jens. Is visual evaluation of aneurysm coiling a reliable study end point? Systematic review and meta-analysis. Stroke. 2015;46(6):1574-81. 32. DerSimonian R, Laird N. Meta-analysis in clinical trials revisited. Contemp Clin Trials. 2015;45(Pt A):139-45. 33. Terwee CB, Prinsen CA, de Vet HCW, Bouter LM, Alonso J, Westerman MJ, et al. COSMIN methodology for assessing the content validity of Patient-Reported Outcome Measures (PROMs). User manual. 2017. 34. Strauss ME, Smith GT. Construct Validity: Advances in Theory and Methodology. Annu Rev Clin Psychol. 2008.


76

35. Cronbach LJ, Meehl PE. Construct validity in psychological tests. Psychological Bulletin. 1955;52:281-302. 36. McDowell I, Jenkinson C. Development standards for health measures. J Health Serv Res Policy. 1996;1:238-46. 37. Messick S. The standard problem. Meaning and values in measurement and evaluation. . American Psychologist. 1975;oct:955-66. 38. de Vet HC, Terwee CB. The minimal detectable change should not replace the minimal important difference. J Clin Epidemiol. 2010;63(7):804-5. 39. de Vet HC, Ostelo RW, Terwee CB, van der Roer N, Knol DL, Beckerman H, et al. Minimally important change determined by a visual method integrating an anchor-based and a distribution-based approach. Qual Life Res. 2007;16(1):131-42. 40. de Vet HC, Terwee CB, Ostelo RW, Beckerman H, Knol DL, Bouter LM. Minimal changes in health status questionnaires: distinction between minimally detectable change and minimally important change. Health Qual Life Outcomes. 2006;4:54. 41. Crosby RD, Kolotkin RL, Williams GR. Defining clinically meaningful change in health-related quality of life. J Clin Epidemiol. 2003;56(5):395-407. 42. Yost KJ, Eton DT, Garcia SF, Cella D. Minimally important differences were estimated for six Patient-Reported Outcomes Measurement Information System-Cancer scales in advanced-stage cancer patients. J Clin Epidemiol. 2011;64(5):507-16. 43. van Kampen DA, Willems WJ, van Beers LW, Castelein RM, Scholtes VA, Terwee CB. Determination and comparison of the smallest detectable change (SDC) and the minimal important change (MIC) of four-shoulder patient-reported outcome measures (PROMs). Journal of orthopaedic surgery and research. 2013;8:40. 44. Sprangers MA, Schwartz CE. Integrating response shift into health-related quality of life research: a theoretical model. Soc Sci Med. 1999;48(11):1507-15. 45. Boers M, Kirwan JR, Tugwell P, Beaton D, Bingham CO, III, Conaghan PG, et al. The OMERACT handbook: OMERACT; 2015. 46. Smart A. A multi-dimensional model of clinical utility. International journal for quality in health care : journal of the International Society for Quality in Health Care. 2006;18(5):377-82. 47. Fayers PM, Hand DJ. Factor analysis, causal indicators and quality of life. Qual Life Res. 1997;6(2):139-50. 48. Streiner DL. Being inconsistent about consistency: when coefficient alpha does and doesn't matter. J Pers Assess. 2007;80:217-22. 49. Wirth RJ, Edwards MC. Item factor analysis: Current approaches and future directions. Psychological Methods. 2007;12:58-79. 50. Floyd FJ, Widaman FK. Factor analysis in the development and refinement of clinical assessment instruments. . Psychological Assessment. 1995;7:286-99. 51. Comrey AL, Lee HB. A first course in factor analysis. 2nd ed. ed. Hillsdale, NJ: Erlbaum; 1992 1992. 52. Brown T. Confirmatory Factor Analysis for applied research. New York: The Guilford Press; 2015. 53. Embretson SE, Reise SP. Item response theory for psychologists. Mahwah, New Jersey: Lawrence Erlbaum Associates; 2000. 54. Edelen MO, Reeve BB. Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Qual Life Res. 2007;16 Suppl 1:5-18. 55. Reise SPY, J. Parameter recovery in the graded response model using MULTILOG. JEM. 1990;27:133-44.


77

56. Reise SPM, T.M.; Maydeu-Olivares, A. Targeted bifactor rotations and assessing the impact of model violations on the parameters of unidimensional and bifactor models. Journal of Educational and Psychological Measurement. 2011;71:684-711. 57. Linacre JM. Sample size and item calibration stability. Rasch Measurement Transactions. 1994;7(4):328. 58. de Vet HC, Ader HJ, Terwee CB, Pouwers F. Are factor analytical techniques appropriately used in the validation of health status questionnaires? A systematic review on the quality of factor analyses of the SF-36. . Quality of Life Research. 2005;14:1203-18. 59. Cortina JM. What is coefficient alpha? An examination of theory and applications. Appl Psychology. 1993;78:98-104. 60. Cronbach LJ. Coefficient Alpha and the Internal Structure of Tests. Psychometrika. 1951;16:297-334. 61. Streiner DL. Figuring out factors: The use and misuse of factor analysis. Can J Psychiatry. 1994;39:135-40. 62. Revelle WZRE. Coefficients Alpha, Beta, Omega, and the GLB: Comments on Sijtsma. Psychometrika. 2009;74(1):145-54. 63. Crane PK, Gibbons LE, Jolley L, Van Belle G. Differential item functioning analysis with ordinal logistic regression techniques. DIFdetect and difwithpar. Med Care. 2006;44(11 Suppl 3):S115-23. 64. Petersen MA, Groenvold M, Bjorner JB, Aaronson N, Conroy T, Cull A, et al. Use of differential item functioning analysis to assess the equivalence of translations of a questionnaire. Qual Life Res. 2003;12(4):373-85. 65. Gregorich SE. Do Self-Report Instruments Allow Meaningful Comparisons Across Diverse Population Groups? Testing Measurement Invariance Using the Confirmatory Factor Analysis Framework. Medical Care. 2006;44(11 Suppl 3):S78-S94. 66. Teresi JA, Ocepek-Welikson K, Kleinman M, Eimicke JP, Crane PK, Jones RN, et al. Analysis of differential item functioning in the depression item bank from the Patient Reported Outcome Measurement Information System (PROMIS): An item response theory approach. Psychol Sci Q. 2009;51(2):148-80. 67. Scott NW, Fayers PM, Aaronson NK, Bottomley A, De Graeff A, Groenvold M, et al. A simulation study provided sample size guidance for differential item functioning (DIF) studies using short scales. Journal of Clinical Epidemiology. 2009;62:288-95. 68. Van Leeuwen LM, Mokkink LB, Kamm CP, de Groot V, van den Berg P, Ostelo RW, et al. Measurement properties of the Arm Function in Multiple Sclerosis Questionnaire (AMSQ): a study based on Classical Test Theory. Disabil Rehabil. 2016:1-8. 69. Shrout PE, Fleiss JL. Intraclass Correlations: Uses in assessing rater reliability. Psychological Bulletin. 1979;86:420-8. 70. McGraw KOW, S.P. Forming inferences about some intraclass correlation coefficients. Psychological Methods. 1996;1:30-46. 71. Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin. 1968;70:213-20. 72. de Vet HC, Mokkink LB, Terwee CB, Hoekstra OS, Knol DL. Clinicians are right not to like Cohen's kappa. BMJ. 2013;346:f2125. 73. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. . Educational and Psychological Measurement. 1973;33:613-9. 74. de Vet HC, Terwee CB, Knol DL, Bouter LM. When to use agreement versus reliability measures. . Journal of Clinical Epidemiology. 2006;59:1033-9. 75. Altman DG. Practical statistics for medical research. London: Chapman and Hall; 1991.


78

76. Cohen J. Statistical power analysis for the behavioural sciences. . 2nd ed. ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988. 77. McHorney CAT, A.R. Individual-patient monitoring in clinical practice: are available health status surveys adequate? Qual Life Res. 1995;4:293-307. 78. Norman GR. Issues in the use of change scores in randomized trials. J Clin Epidemiol. 1989;42(11):1097-105. 79. Stockler MR, Osoba D, Goodwin P, Corey P, Tannock IF. Responsiveness to change in health-related quality of life in a randomized clinical trial: a comparison of the Prostate Cancer Specific Quality of Life Instrument (PROSQOLI) with analogous scales from the EORTC QLQ-C30 and a trial specific module. European Organization for Research and Treatment of Cancer. J Clin Epidemiol. 1998;51(2):137-45. 80. Guyatt G, Walter S, Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. J Chronic Dis. 1987;40(2):171-8. 81. van Dijk SEA, M.C.; van der Zwaan, L.; Bosmans, J.E.; van Marwijk, H.W.J.; van Tulder, M.W.; Terwee, C.B. Measurement properties of questionnaires for depressive symptoms in adult patients with type 1 or type 2 diabetes: a systematic review. Qual Life Res. 2018; doi: 10.1007/s11136-018-1782-y. [Epub ahead of print].

COSMIN for systematic reviews of Outcome Measures (PROMs) · COSMIN manual for systematic reviews...

Documents

Transcript of COSMIN for systematic reviews of Outcome Measures (PROMs) · COSMIN manual for systematic reviews...