Download - Methodology: Recommendations for Outcome Measure ...€¦ · 0.7 65% 143 43% 0.5 51% 200 100% Methodology: Recommendations for Diagnostic Reliability: Investigator meeting rating

RESEARCH INSTITUTE

The Unreliability of Reliability Statistics: A Primer on Calculating Interrater Reliability in CNS Trials

Popp, D1; Mallinckrodt CH2; Williams, JBW1,3; Detke, MJ3,4

1MedAvante, Inc.; 2Eli Lilly Company; 3College of Physicians and Surgeons, Columbia University; 4Indiana University School of Medicine

©2013 MedAvante Inc.

Commonly used reliability statistics are reviewed and the appropriateness of their use with various data types and methodologies typical of CNS clinical trials is evaluated. Guidelines for selecting appropriate reliability statistics are presented.

Finally, common misuses of reliability statistics are discussed and the impact of inappropriate analyses on estimates of reliability is demonstrated.

In CNS clinical trial research, IRR is typically measured using one or both of the following methodologies: • InvestigatorMeeting(IM)RatingPrecisionExercises:typically,alargegroupofraters independently score one or more subject videotapes prior to study start

• In-StudySurveillance:anexpertclinicianreviewsandindependentlyscores audio/videotapedin-studyassessments

Further,IRRcanbemeasuredforbothdiagnosisandoutcomevariables(e.g.,severityscales).

The decision tree shown here can be used to determine the appropriate IRR measure for various methodologies based on the type of variable, the number of raters and the number of subjects or observations.

Successratesinclinicaltrialsofapprovedantidepressantsarelessthan50percentevenwhentheoreticallypoweredat80–90percent(Khinetal.,2011).However,powercalculationsrarelytakeintoaccountthevariability attributable to the less than perfect agreement between raters in the subjective assessments of symptom severity in CNS clinical trials – that is, interrater reliability. As depicted in the table to the right, failing to account for interrater reliability can have substantial implications for study power and the ability to distinguish effective drugs from placebo.

PoorIRRorinaccuratereliabilityestimatesresultingfrominappropriatereliabilitystatisticscanhavesignificantconsequences,includingincreasedR&Dcosts,significantdelaysingettingeffectivedrugstopatientswho need them, and terminating development of effective drugs.

Despitetheimportanceofreliableoutcomeassessments,clinicaltrialreportingseldomincludesestimatesofIRR(Mulsantetal.,2002).Whenreported,selectionofreliabilitystatisticsisinconsistentandofteninappropriate for the level of measurement or methodology employed. A set of guidelines is proposed for the appropriate selection of reliability measures for CNS clinical trials.

KappaThemostcommonlyusedmeasureofIRRforpsychiatricdiagnosis(Cohen,1960;Fleiss,1971),Kappaisameasureofagreementbetweentwoormoreratersacrosstwoormoresubjects.Kappacanbeusedwithbinary,nominalorordinaldata.Kappaispreferredtopercentagreementasitiscorrectedforchanceagreement.Cohen’sKappaisusedwhentworatersratetwoormoresubjectssuchaswithin-studysurveillancemethodswhereasFleiss’Kappaisusedformultipleraters,suchasdatacollectedatIMs.

Outcome MeasuresCommon efficacy outcomes in CNS clinical trials are summed total or subscale scores onpsychiatricratingscales(e.g.,MADRS,PANSS).

T-Tests/Analysis of Variance (ANOVA) One method to assess agreement between twoormoreratersisameanscomparisontest,suchasapaired-samplet-testorone-wayrepeatedmeasuresANOVA.Thesetestsexaminewhethermultipleraters’scoresofthesamesubjectsarestatisticallysignificantlydifferentfromoneanother(i.e.,thedisagreementbetweenratersreachesstatisticalsignificance).Regardlessofstatisticalsignificance,resultsofmeanscomparisonsshouldbeaccompaniedbyestimates of effect size, such as Cohen’s d, in order to judge the magnitude of difference between raters.

Impact of Interrater Reliability (IRR) on Power and Sample Size

Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power

1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%

Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa

In-study surveillance Cohen’s Kappa

Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)

Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)

- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals

- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals

Methodology: Recommendations for Outcome Measure Reliability* (Single observation)

- CoV - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

•

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the

variable of interest?

CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No

Were raters selected from a larger pool?

ICC (2,1)Two-way Random

Effects ANOVA

ICC (2,n)Two-way Random

Effects ANOVA

ICC (1,1)One-way Random

Effects ANOVA

ICC (1,n)One-way Random

Effects ANOVA

ICC (3,1)Two-way FixedEffects ANOVA

ICC (3,n)Two-way FixedEffects ANOVA

# of Raters# of

Observations(i.e., subjects

or videos)

AppropriateStatistic

Can Not CalculateReliability


Cohen’s Kappa

% Agreement

Fleiss’ Kappa


CoVr wgADI

Paired t-testBland-Altman

ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+

Categorical(eg., Diagnosis)

Continuous(eg., Severity Scale)

Type of Variable

*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.

Investigator meeting rating precision exercise

In-study surveillance

Investigator meeting rating precision exerciseand in-study surveillance



1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%








- CoV - rwg

- ADM or ADMD

Diff

eren

ce

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

•

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the


CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No



Effects ANOVA


Effects ANOVA


Effects ANOVA


Effects ANOVA



# of Raters# of


or videos)




Cohen’s Kappa

% Agreement

Fleiss’ Kappa


CoVr wgADI


ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+



Type of Variable





Guidelines for selecting the appropriate IRR statistic

*Muller&Szegedi,2002**Perkins,Wyatt&Bartko,2002



1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%








- CoV - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

•

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the


CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No



Effects ANOVA


Effects ANOVA


Effects ANOVA


Effects ANOVA



# of Raters# of


or videos)




Cohen’s Kappa

% Agreement

Fleiss’ Kappa


CoVr wgADI


ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+



Type of Variable





INTRODUCTION

Bland-Altman Plots Another measure of the magnitude of (dis)agreementbetweentworatersistheBland-Altmantest(Bland&Altman,1986).ABland-Altmanplotvisuallydepicts agreement between two raters across multiple observations. The difference of the two ratings is plotted ontheY-axisandtheaverageofthetworatingsontheX-axis.Three reference lines delineated on the plot indicate the average difference between the raters and the upper and lower confidencelimits.Thegreatertheagreementbetweenthetworaters,themorecloselyclusteredthepointsaroundzeroontheY-axis.AsampleBland-Altmanplotusingsurveillance data is shown here. This plot shows good agreement with values clustered aroundzeroontheY-axisandconfidencelimitsnear+/-3pointsontheMADRS.



1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%








- CoV - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

•

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the


CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No



Effects ANOVA


Effects ANOVA


Effects ANOVA


Effects ANOVA



# of Raters# of


or videos)




Cohen’s Kappa

% Agreement

Fleiss’ Kappa


CoVr wgADI


ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+



Type of Variable





ShroutandFleiss(1979)proposedsixformsofICC.DecisionsaboutwhichformofICCisestimatedshould be based on the type and number of raters and whether the outcome variable of interest is fromasingleraterortheaveragescorefrommultipleraters(e.g.,fourratersassessallsubjectsontheMADRSandtheoutcomevariableistheaverageofthefourscores).



1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%








- CoV - rwg

- ADM or ADMD

Diff

eren

ce

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

•

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the


CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No



Effects ANOVA


Effects ANOVA


Effects ANOVA


Effects ANOVA



# of Raters# of


or videos)




Cohen’s Kappa

% Agreement

Fleiss’ Kappa


CoVr wgADI


ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+



Type of Variable





Guidelines for selecting the appropriate ICC

Common Misuses of Reliability StatisticsDichotomizing continuous outcome measuresKappahasoftenbeenmisusedtoestimatetheIRRofcontinuousoutcomemeasures.InordertoestimateKappafromcontinuousoutcomemeasures,thevariablemustbeartificiallytransformedintoadichotomous or categorical variable. Kappaishighlyinfluencedbythecriterionmeasureselected.Attimes,afixedcriterion(e.g.,+/-20percent)isusedtoindicaterateragreementwitha“goldstandard”score.Forexample,withacriterionof+/-20percentofthegoldstandard,85percentofratersmay“meetcriteria.”However,ifthecriterionisnarrowedtowithin+/-10percentofthegold standard, the number of raters meeting criteria may drop to 45 percent. Selecting a broadercriterionrangecanartificiallyinflateKappa.

IRR must be estimated using the variable as it will be used in the primary efficacy analysis to accurately assess the IRR of an outcome measure. That is, dichotomization of variables forIRRshouldonlytakeplaceifoneplanstodichotomizetheoutcomemeasureinthefinaldataanalysis.Therefore,KappaisalmostalwaystheincorrectmeasureofIRRforseverity scales in CNS clinical trials.

Treating Items as Subjects It is sometimes impossible to obtain ratings of multiple observations or subjects. In cases where two or more raters rated a single subject (asinagroupcalibrationataninvestigatormeeting)onecommonerroristotreatindividualitemsonascaleasindependentobservationstocompensateforalackofmultipleobservations.However,ICCscalculatedthiswaymaybeinversely related to thereliabilityofaconstruct(James,Demaree,&Wolf,1984). Forexample,imagineasituationinwhich20ratersscoredonevideotapedMontgomery-AsbergDepressionRatingScale(MADRS)assessmentataninvestigatormeeting.IfonetreatstheindividualitemsoftheMADRSas10independentobservations,bydefinitionahighICCisachievedonlybecausethebetween-itemmeansquaresarelargeinrelationtothewithin-itemmeansquare.Thatis,higherICCsareactuallyinverselyrelatedtointernalscale consistency, which may indicate that raters are not applying the scale correctly and additional observations may reveal that interrater reliability issues are present.

Whenitisnotpossibletoobtainratingsofmultiplesubjects,interrateragreement(notreliability)shouldbeestimatedusingtheagreementindicespresentedabove.

Diagnosis Reliability of psychiatric diagnosis can only be calculated when two or more raters rate two or more subjects. Reliability for diagnosis cannot be estimated with only one observation or subject because there is no variability.

Intraclass Correlation Coefficient (ICC)TheICC(Shrout&Fleiss,1979)isrequiredforappropriate measurement of IRR with continuous outcome measures. ICC is a measure of the interchangeability of raters in a larger cohort. To calculate ICC, two or more raters must rate two or more subjects. An ICC, or any measure of reliability, cannot be calculated on ratings of a single subject. ICC is calculated as:

LargerICCsindicatebetteragreementbetweenratersorahigherdegreeofinterchange-ability.ConfidenceintervalsshouldbereportedwhencalculatingICCs.



1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%

Methodology: Recommendation for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa



Methodology: Recommendation for Outcome Measure Reliability (Two or more observations)

Investigator meeting rating precision exercise - Repeated Measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals

In-study surveillance - Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals

Methodology: Recommendation for Outcome Measure Reliability (Single observation)

Investigator meeting rating precision exercise - CoVand in-study surveillance - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

•

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the


CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No



Effects ANOVA


Effects ANOVA


Effects ANOVA


Effects ANOVA



# of Raters# of


or videos)




Cohen’s Kappa

% Agreement

Fleiss’ Kappa


CoVr wgADI


ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+



Type of Variable

Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.

Estimating Interrater Agreement from a Single ObservationThestatisticsshownrequirethatreliabilitybemeasuredonmorethanonesubject.However,itisnotalwayspossibletoobtainmultipleobservations.Whileitisnotpossibletoestimatereliability with only one observation, agreement can be estimated.

Themoststraight-forwardagreementstatisticforasingleobservationistheCoefficientofVariation(CoV),astandardizedmeasureofthevariabilityofraterscores,calculatedasthestandarddeviationdividedbythemean.ThelowertheCoV,themorealignedtheraters’scores,with 0 indicating that all of the scores are the same.

Alternatively, one can estimate rwg(James,Demaree&Wolf,1984)tocomparetheobservedvarianceinmultipleraters’ratingsofasingletargettothevarianceexpectedifalloftheratingswererandom.Rwg typically ranges from zero to one with higher values indicating greater agreement.



1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%








- CoV - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

•

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the


CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No



Effects ANOVA


Effects ANOVA


Effects ANOVA


Effects ANOVA



# of Raters# of


or videos)




Cohen’s Kappa

% Agreement

Fleiss’ Kappa


CoVr wgADI


ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+



Type of Variable





Finally,averagedeviation(AD)indicessuchasaveragedeviationofthemean(ADM)ormedian(ADMD)canbeusedtoestimateagreementamongratersonasingleobservation(Burke,Finkelstein&Dusig,1999).Averagedeviationiscalculatedastheaverageabsolutedeviationacrossratersfromapointofcentraltendency,namelythemeanormedian.OnebenefitofADMandADMD is that it maintains the raw metric of the observed variable.



1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%








- CoV - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

•

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the


CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No



Effects ANOVA


Effects ANOVA


Effects ANOVA


Effects ANOVA



# of Raters# of


or videos)




Cohen’s Kappa

% Agreement

Fleiss’ Kappa


CoVr wgADI


ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+



Type of Variable





Reliabilitycanhaveasignificantimpactonclinicaltrialoutcomes.ItisimportanttoaccuratelyassessandreportIRRpriortostudystartandthroughoutthecourseofaclinicaltrial.WhenIRR is assessed prior to study start, it is possible for researchers to employ a methodology for obtainingIRRdatathatfullyexploitsthestrengthsofaparticularstatistic.However,sincetheseestimatesareoftenobtainedwithoutindependentinterviews(i.e.,watchingvideo-tapedassessments),inartificialsettings(i.e.,atinvestigatormeetings)andatasinglepointintime(i.e.,priortothestartofthestudy)itisimportanttocoupletheseestimateswithIRRcalculated from actual trial assessments throughout a study.

Whenselectingreliabilitystatisticsresearchersmusttakeintoaccountthetypeofvariable(e.g.,binary,nominal,interval),thenumberofraters,compositionoftheraterpool(i.e.,sameratersrateallsubjectsvs.ratersselectedfromalargerpool)andthenumberofobservationsusing the guidelines presented for various methodologies.

DisclosureOneormoreauthorsreportpotentialconflictswhicharedescribedintheprogram.

ReferencesBlandJM,AltmanDG.Statisticalmethodsforassessingagreementbetweentwomethodsofclinicalmeasurement.Lancet,1986;327(8476):307-310.BurkeMJ,FinkelsteinLM,DusigMS.Onaveragedeviationindicesforestimatinginterrateragreement.OrganizationalResearchMethods,1999;2(1):49-68.CohenJ.Acoefficientofagreementfornominalscales.EducationalandPsychologicalMeasurement,1960;20(1):37–46.FleissJL.Measuringnominalscaleagreementamongmanyraters.PsychologicalBulletin,1971;76(5):378–382.JamesLR,DemareeRG,WolfG.Estimatingwithin-groupinterraterreliabilitywithandwithoutresponsebias.JournalofAppliedPsychology,1984;69:85-98.KhinNA,ChenY,YangY,YangP,LaughrenTP.(2011).ExploratoryAnalysesofEfficacyDataFromMajorDepressiveDisorderTrialsSubmittedtotheUSFoodandDrugAdministrationinSupportofNewDrugApplications.JournalofClinicalPsychiatry,2011;72(4):464-472.MullerMJ,SzegediA.EffectsofInterraterReliabilityofPsychopathologicAssessmentonPowerandSampleSizeCalculationsinClinicalTrials.JournalofClinicalPsychopharmacology,2002;22:318-325.MulsantBH,KastangoKB,RosenJ,StoneRA,MazumdarS,PollockBG.InterraterReliabilityinClinicalTrialsofDepressiveDisorders.AmericanJournalofPsychiatry,2002;159:1598-1600.PerkinsDO,WyattRJ,BartkoJJ.Penny-wiseandpound-foolish:theimpactofmeasurementerroronsamplesizerequirementsinclinicaltrials.BiologicalPsychiatry,2002;47:762-766.ShroutPE,FleissJL.IntraclassCorrelations:UsesinAssessingRaterReliability.PsychologicalBulletin,1979;86(2):420–428.

METHODS

RESULTS

DISCUSSION