RESEARCH INSTITUTE
The Unreliability of Reliability Statistics: A Primer on Calculating Interrater Reliability in CNS Trials
Popp, D1; Mallinckrodt CH2; Williams, JBW1,3; Detke, MJ3,4
1MedAvante, Inc.; 2Eli Lilly Company; 3College of Physicians and Surgeons, Columbia University; 4Indiana University School of Medicine
©2013 MedAvante Inc.
Commonly used reliability statistics are reviewed and the appropriateness of their use with various data types and methodologies typical of CNS clinical trials is evaluated. Guidelines for selecting appropriate reliability statistics are presented.
Finally, common misuses of reliability statistics are discussed and the impact of inappropriate analyses on estimates of reliability is demonstrated.
In CNS clinical trial research, IRR is typically measured using one or both of the following methodologies: • InvestigatorMeeting(IM)RatingPrecisionExercises:typically,alargegroupofraters independently score one or more subject videotapes prior to study start
• In-StudySurveillance:anexpertclinicianreviewsandindependentlyscores audio/videotapedin-studyassessments
Further,IRRcanbemeasuredforbothdiagnosisandoutcomevariables(e.g.,severityscales).
The decision tree shown here can be used to determine the appropriate IRR measure for various methodologies based on the type of variable, the number of raters and the number of subjects or observations.
Successratesinclinicaltrialsofapprovedantidepressantsarelessthan50percentevenwhentheoreticallypoweredat80–90percent(Khinetal.,2011).However,powercalculationsrarelytakeintoaccountthevariability attributable to the less than perfect agreement between raters in the subjective assessments of symptom severity in CNS clinical trials – that is, interrater reliability. As depicted in the table to the right, failing to account for interrater reliability can have substantial implications for study power and the ability to distinguish effective drugs from placebo.
PoorIRRorinaccuratereliabilityestimatesresultingfrominappropriatereliabilitystatisticscanhavesignificantconsequences,includingincreasedR&Dcosts,significantdelaysingettingeffectivedrugstopatientswho need them, and terminating development of effective drugs.
Despitetheimportanceofreliableoutcomeassessments,clinicaltrialreportingseldomincludesestimatesofIRR(Mulsantetal.,2002).Whenreported,selectionofreliabilitystatisticsisinconsistentandofteninappropriate for the level of measurement or methodology employed. A set of guidelines is proposed for the appropriate selection of reliability measures for CNS clinical trials.
KappaThemostcommonlyusedmeasureofIRRforpsychiatricdiagnosis(Cohen,1960;Fleiss,1971),Kappaisameasureofagreementbetweentwoormoreratersacrosstwoormoresubjects.Kappacanbeusedwithbinary,nominalorordinaldata.Kappaispreferredtopercentagreementasitiscorrectedforchanceagreement.Cohen’sKappaisusedwhentworatersratetwoormoresubjectssuchaswithin-studysurveillancemethodswhereasFleiss’Kappaisusedformultipleraters,suchasdatacollectedatIMs.
Outcome MeasuresCommon efficacy outcomes in CNS clinical trials are summed total or subscale scores onpsychiatricratingscales(e.g.,MADRS,PANSS).
T-Tests/Analysis of Variance (ANOVA) One method to assess agreement between twoormoreratersisameanscomparisontest,suchasapaired-samplet-testorone-wayrepeatedmeasuresANOVA.Thesetestsexaminewhethermultipleraters’scoresofthesamesubjectsarestatisticallysignificantlydifferentfromoneanother(i.e.,thedisagreementbetweenratersreachesstatisticalsignificance).Regardlessofstatisticalsignificance,resultsofmeanscomparisonsshouldbeaccompaniedbyestimates of effect size, such as Cohen’s d, in order to judge the magnitude of difference between raters.
Impact of Interrater Reliability (IRR) on Power and Sample Size
Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power
1.0 80% 100
0.9 76% 111 11%
0.7 65% 143 43%
0.5 51% 200 100%
Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa
In-study surveillance Cohen’s Kappa
Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)
Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)
- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals
- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals
Methodology: Recommendations for Outcome Measure Reliability* (Single observation)
- CoV - rwg
- ADM or ADMD
Diffe
renc
e
Average0 10 20 30 40
6
4
2
0
-2
-4
-6
•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •
• • • • • •• • • • •••
•
••••
•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •
• •
Did thesame raters
rate allsubjects?
What is the
variable of interest?
CorrectICC
Formula/ANOVASource
Single
Average
Single
Average
Single
Average
Yes
No
Yes
Yes
No
Were raters selected from a larger pool?
ICC (2,1)Two-way Random
Effects ANOVA
ICC (2,n)Two-way Random
Effects ANOVA
ICC (1,1)One-way Random
Effects ANOVA
ICC (1,n)One-way Random
Effects ANOVA
ICC (3,1)Two-way FixedEffects ANOVA
ICC (3,n)Two-way FixedEffects ANOVA
# of Raters# of
Observations(i.e., subjects
or videos)
AppropriateStatistic
Can Not CalculateReliability
Can Not CalculateReliability
Cohen’s Kappa
% Agreement
Fleiss’ Kappa
Can Not CalculateReliability
CoVr wgADI
Paired t-testBland-Altman
ICC
CoVr wgADI
RM ANOVAICC
1
2+
1
2+
1
2+
1
2+
1
2
3+
1
2
3+
Categorical(eg., Diagnosis)
Continuous(eg., Severity Scale)
Type of Variable
*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.
Investigator meeting rating precision exercise
In-study surveillance
Investigator meeting rating precision exerciseand in-study surveillance
Impact of Interrater Reliability (IRR) on Power and Sample Size
Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power
1.0 80% 100
0.9 76% 111 11%
0.7 65% 143 43%
0.5 51% 200 100%
Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa
In-study surveillance Cohen’s Kappa
Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)
Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)
- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals
- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals
Methodology: Recommendations for Outcome Measure Reliability* (Single observation)
- CoV - rwg
- ADM or ADMD
Diff
eren
ce
Average0 10 20 30 40
6
4
2
0
-2
-4
-6
•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •
• • • • • •• • • • •••
•
••••
•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •
• •
Did thesame raters
rate allsubjects?
What is the
variable of interest?
CorrectICC
Formula/ANOVASource
Single
Average
Single
Average
Single
Average
Yes
No
Yes
Yes
No
Were raters selected from a larger pool?
ICC (2,1)Two-way Random
Effects ANOVA
ICC (2,n)Two-way Random
Effects ANOVA
ICC (1,1)One-way Random
Effects ANOVA
ICC (1,n)One-way Random
Effects ANOVA
ICC (3,1)Two-way FixedEffects ANOVA
ICC (3,n)Two-way FixedEffects ANOVA
# of Raters# of
Observations(i.e., subjects
or videos)
AppropriateStatistic
Can Not CalculateReliability
Can Not CalculateReliability
Cohen’s Kappa
% Agreement
Fleiss’ Kappa
Can Not CalculateReliability
CoVr wgADI
Paired t-testBland-Altman
ICC
CoVr wgADI
RM ANOVAICC
1
2+
1
2+
1
2+
1
2+
1
2
3+
1
2
3+
Categorical(eg., Diagnosis)
Continuous(eg., Severity Scale)
Type of Variable
*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.
Investigator meeting rating precision exercise
In-study surveillance
Investigator meeting rating precision exerciseand in-study surveillance
Guidelines for selecting the appropriate IRR statistic
*Muller&Szegedi,2002**Perkins,Wyatt&Bartko,2002
Impact of Interrater Reliability (IRR) on Power and Sample Size
Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power
1.0 80% 100
0.9 76% 111 11%
0.7 65% 143 43%
0.5 51% 200 100%
Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa
In-study surveillance Cohen’s Kappa
Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)
Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)
- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals
- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals
Methodology: Recommendations for Outcome Measure Reliability* (Single observation)
- CoV - rwg
- ADM or ADMD
Diffe
renc
e
Average0 10 20 30 40
6
4
2
0
-2
-4
-6
•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •
• • • • • •• • • • •••
•
••••
•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •
• •
Did thesame raters
rate allsubjects?
What is the
variable of interest?
CorrectICC
Formula/ANOVASource
Single
Average
Single
Average
Single
Average
Yes
No
Yes
Yes
No
Were raters selected from a larger pool?
ICC (2,1)Two-way Random
Effects ANOVA
ICC (2,n)Two-way Random
Effects ANOVA
ICC (1,1)One-way Random
Effects ANOVA
ICC (1,n)One-way Random
Effects ANOVA
ICC (3,1)Two-way FixedEffects ANOVA
ICC (3,n)Two-way FixedEffects ANOVA
# of Raters# of
Observations(i.e., subjects
or videos)
AppropriateStatistic
Can Not CalculateReliability
Can Not CalculateReliability
Cohen’s Kappa
% Agreement
Fleiss’ Kappa
Can Not CalculateReliability
CoVr wgADI
Paired t-testBland-Altman
ICC
CoVr wgADI
RM ANOVAICC
1
2+
1
2+
1
2+
1
2+
1
2
3+
1
2
3+
Categorical(eg., Diagnosis)
Continuous(eg., Severity Scale)
Type of Variable
*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.
Investigator meeting rating precision exercise
In-study surveillance
Investigator meeting rating precision exerciseand in-study surveillance
INTRODUCTION
Bland-Altman Plots Another measure of the magnitude of (dis)agreementbetweentworatersistheBland-Altmantest(Bland&Altman,1986).ABland-Altmanplotvisuallydepicts agreement between two raters across multiple observations. The difference of the two ratings is plotted ontheY-axisandtheaverageofthetworatingsontheX-axis.Three reference lines delineated on the plot indicate the average difference between the raters and the upper and lower confidencelimits.Thegreatertheagreementbetweenthetworaters,themorecloselyclusteredthepointsaroundzeroontheY-axis.AsampleBland-Altmanplotusingsurveillance data is shown here. This plot shows good agreement with values clustered aroundzeroontheY-axisandconfidencelimitsnear+/-3pointsontheMADRS.
Impact of Interrater Reliability (IRR) on Power and Sample Size
Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power
1.0 80% 100
0.9 76% 111 11%
0.7 65% 143 43%
0.5 51% 200 100%
Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa
In-study surveillance Cohen’s Kappa
Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)
Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)
- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals
- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals
Methodology: Recommendations for Outcome Measure Reliability* (Single observation)
- CoV - rwg
- ADM or ADMD
Diffe
renc
e
Average0 10 20 30 40
6
4
2
0
-2
-4
-6
•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •
• • • • • •• • • • •••
•
••••
•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •
• •
Did thesame raters
rate allsubjects?
What is the
variable of interest?
CorrectICC
Formula/ANOVASource
Single
Average
Single
Average
Single
Average
Yes
No
Yes
Yes
No
Were raters selected from a larger pool?
ICC (2,1)Two-way Random
Effects ANOVA
ICC (2,n)Two-way Random
Effects ANOVA
ICC (1,1)One-way Random
Effects ANOVA
ICC (1,n)One-way Random
Effects ANOVA
ICC (3,1)Two-way FixedEffects ANOVA
ICC (3,n)Two-way FixedEffects ANOVA
# of Raters# of
Observations(i.e., subjects
or videos)
AppropriateStatistic
Can Not CalculateReliability
Can Not CalculateReliability
Cohen’s Kappa
% Agreement
Fleiss’ Kappa
Can Not CalculateReliability
CoVr wgADI
Paired t-testBland-Altman
ICC
CoVr wgADI
RM ANOVAICC
1
2+
1
2+
1
2+
1
2+
1
2
3+
1
2
3+
Categorical(eg., Diagnosis)
Continuous(eg., Severity Scale)
Type of Variable
*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.
Investigator meeting rating precision exercise
In-study surveillance
Investigator meeting rating precision exerciseand in-study surveillance
ShroutandFleiss(1979)proposedsixformsofICC.DecisionsaboutwhichformofICCisestimatedshould be based on the type and number of raters and whether the outcome variable of interest is fromasingleraterortheaveragescorefrommultipleraters(e.g.,fourratersassessallsubjectsontheMADRSandtheoutcomevariableistheaverageofthefourscores).
Impact of Interrater Reliability (IRR) on Power and Sample Size
Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power
1.0 80% 100
0.9 76% 111 11%
0.7 65% 143 43%
0.5 51% 200 100%
Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa
In-study surveillance Cohen’s Kappa
Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)
Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)
- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals
- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals
Methodology: Recommendations for Outcome Measure Reliability* (Single observation)
- CoV - rwg
- ADM or ADMD
Diff
eren
ce
Average0 10 20 30 40
6
4
2
0
-2
-4
-6
•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •
• • • • • •• • • • •••
•
••••
•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •
• •
Did thesame raters
rate allsubjects?
What is the
variable of interest?
CorrectICC
Formula/ANOVASource
Single
Average
Single
Average
Single
Average
Yes
No
Yes
Yes
No
Were raters selected from a larger pool?
ICC (2,1)Two-way Random
Effects ANOVA
ICC (2,n)Two-way Random
Effects ANOVA
ICC (1,1)One-way Random
Effects ANOVA
ICC (1,n)One-way Random
Effects ANOVA
ICC (3,1)Two-way FixedEffects ANOVA
ICC (3,n)Two-way FixedEffects ANOVA
# of Raters# of
Observations(i.e., subjects
or videos)
AppropriateStatistic
Can Not CalculateReliability
Can Not CalculateReliability
Cohen’s Kappa
% Agreement
Fleiss’ Kappa
Can Not CalculateReliability
CoVr wgADI
Paired t-testBland-Altman
ICC
CoVr wgADI
RM ANOVAICC
1
2+
1
2+
1
2+
1
2+
1
2
3+
1
2
3+
Categorical(eg., Diagnosis)
Continuous(eg., Severity Scale)
Type of Variable
*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.
Investigator meeting rating precision exercise
In-study surveillance
Investigator meeting rating precision exerciseand in-study surveillance
Guidelines for selecting the appropriate ICC
Common Misuses of Reliability StatisticsDichotomizing continuous outcome measuresKappahasoftenbeenmisusedtoestimatetheIRRofcontinuousoutcomemeasures.InordertoestimateKappafromcontinuousoutcomemeasures,thevariablemustbeartificiallytransformedintoadichotomous or categorical variable. Kappaishighlyinfluencedbythecriterionmeasureselected.Attimes,afixedcriterion(e.g.,+/-20percent)isusedtoindicaterateragreementwitha“goldstandard”score.Forexample,withacriterionof+/-20percentofthegoldstandard,85percentofratersmay“meetcriteria.”However,ifthecriterionisnarrowedtowithin+/-10percentofthegold standard, the number of raters meeting criteria may drop to 45 percent. Selecting a broadercriterionrangecanartificiallyinflateKappa.
IRR must be estimated using the variable as it will be used in the primary efficacy analysis to accurately assess the IRR of an outcome measure. That is, dichotomization of variables forIRRshouldonlytakeplaceifoneplanstodichotomizetheoutcomemeasureinthefinaldataanalysis.Therefore,KappaisalmostalwaystheincorrectmeasureofIRRforseverity scales in CNS clinical trials.
Treating Items as Subjects It is sometimes impossible to obtain ratings of multiple observations or subjects. In cases where two or more raters rated a single subject (asinagroupcalibrationataninvestigatormeeting)onecommonerroristotreatindividualitemsonascaleasindependentobservationstocompensateforalackofmultipleobservations.However,ICCscalculatedthiswaymaybeinversely related to thereliabilityofaconstruct(James,Demaree,&Wolf,1984). Forexample,imagineasituationinwhich20ratersscoredonevideotapedMontgomery-AsbergDepressionRatingScale(MADRS)assessmentataninvestigatormeeting.IfonetreatstheindividualitemsoftheMADRSas10independentobservations,bydefinitionahighICCisachievedonlybecausethebetween-itemmeansquaresarelargeinrelationtothewithin-itemmeansquare.Thatis,higherICCsareactuallyinverselyrelatedtointernalscale consistency, which may indicate that raters are not applying the scale correctly and additional observations may reveal that interrater reliability issues are present.
Whenitisnotpossibletoobtainratingsofmultiplesubjects,interrateragreement(notreliability)shouldbeestimatedusingtheagreementindicespresentedabove.
Diagnosis Reliability of psychiatric diagnosis can only be calculated when two or more raters rate two or more subjects. Reliability for diagnosis cannot be estimated with only one observation or subject because there is no variability.
Intraclass Correlation Coefficient (ICC)TheICC(Shrout&Fleiss,1979)isrequiredforappropriate measurement of IRR with continuous outcome measures. ICC is a measure of the interchangeability of raters in a larger cohort. To calculate ICC, two or more raters must rate two or more subjects. An ICC, or any measure of reliability, cannot be calculated on ratings of a single subject. ICC is calculated as:
LargerICCsindicatebetteragreementbetweenratersorahigherdegreeofinterchange-ability.ConfidenceintervalsshouldbereportedwhencalculatingICCs.
Impact of Interrater Reliability (IRR) on Power and Sample Size
Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power
1.0 80% 100
0.9 76% 111 11%
0.7 65% 143 43%
0.5 51% 200 100%
Methodology: Recommendation for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa
In-study surveillance Cohen’s Kappa
Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)
Methodology: Recommendation for Outcome Measure Reliability (Two or more observations)
Investigator meeting rating precision exercise - Repeated Measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals
In-study surveillance - Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals
Methodology: Recommendation for Outcome Measure Reliability (Single observation)
Investigator meeting rating precision exercise - CoVand in-study surveillance - rwg
- ADM or ADMD
Diffe
renc
e
Average0 10 20 30 40
6
4
2
0
-2
-4
-6
•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •
• • • • • •• • • • •••
•
••••
•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •
• •
Did thesame raters
rate allsubjects?
What is the
variable of interest?
CorrectICC
Formula/ANOVASource
Single
Average
Single
Average
Single
Average
Yes
No
Yes
Yes
No
Were raters selected from a larger pool?
ICC (2,1)Two-way Random
Effects ANOVA
ICC (2,n)Two-way Random
Effects ANOVA
ICC (1,1)One-way Random
Effects ANOVA
ICC (1,n)One-way Random
Effects ANOVA
ICC (3,1)Two-way FixedEffects ANOVA
ICC (3,n)Two-way FixedEffects ANOVA
# of Raters# of
Observations(i.e., subjects
or videos)
AppropriateStatistic
Can Not CalculateReliability
Can Not CalculateReliability
Cohen’s Kappa
% Agreement
Fleiss’ Kappa
Can Not CalculateReliability
CoVr wgADI
Paired t-testBland-Altman
ICC
CoVr wgADI
RM ANOVAICC
1
2+
1
2+
1
2+
1
2+
1
2
3+
1
2
3+
Categorical(eg., Diagnosis)
Continuous(eg., Severity Scale)
Type of Variable
Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.
Estimating Interrater Agreement from a Single ObservationThestatisticsshownrequirethatreliabilitybemeasuredonmorethanonesubject.However,itisnotalwayspossibletoobtainmultipleobservations.Whileitisnotpossibletoestimatereliability with only one observation, agreement can be estimated.
Themoststraight-forwardagreementstatisticforasingleobservationistheCoefficientofVariation(CoV),astandardizedmeasureofthevariabilityofraterscores,calculatedasthestandarddeviationdividedbythemean.ThelowertheCoV,themorealignedtheraters’scores,with 0 indicating that all of the scores are the same.
Alternatively, one can estimate rwg(James,Demaree&Wolf,1984)tocomparetheobservedvarianceinmultipleraters’ratingsofasingletargettothevarianceexpectedifalloftheratingswererandom.Rwg typically ranges from zero to one with higher values indicating greater agreement.
Impact of Interrater Reliability (IRR) on Power and Sample Size
Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power
1.0 80% 100
0.9 76% 111 11%
0.7 65% 143 43%
0.5 51% 200 100%
Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa
In-study surveillance Cohen’s Kappa
Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)
Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)
- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals
- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals
Methodology: Recommendations for Outcome Measure Reliability* (Single observation)
- CoV - rwg
- ADM or ADMD
Diffe
renc
e
Average0 10 20 30 40
6
4
2
0
-2
-4
-6
•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •
• • • • • •• • • • •••
•
••••
•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •
• •
Did thesame raters
rate allsubjects?
What is the
variable of interest?
CorrectICC
Formula/ANOVASource
Single
Average
Single
Average
Single
Average
Yes
No
Yes
Yes
No
Were raters selected from a larger pool?
ICC (2,1)Two-way Random
Effects ANOVA
ICC (2,n)Two-way Random
Effects ANOVA
ICC (1,1)One-way Random
Effects ANOVA
ICC (1,n)One-way Random
Effects ANOVA
ICC (3,1)Two-way FixedEffects ANOVA
ICC (3,n)Two-way FixedEffects ANOVA
# of Raters# of
Observations(i.e., subjects
or videos)
AppropriateStatistic
Can Not CalculateReliability
Can Not CalculateReliability
Cohen’s Kappa
% Agreement
Fleiss’ Kappa
Can Not CalculateReliability
CoVr wgADI
Paired t-testBland-Altman
ICC
CoVr wgADI
RM ANOVAICC
1
2+
1
2+
1
2+
1
2+
1
2
3+
1
2
3+
Categorical(eg., Diagnosis)
Continuous(eg., Severity Scale)
Type of Variable
*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.
Investigator meeting rating precision exercise
In-study surveillance
Investigator meeting rating precision exerciseand in-study surveillance
Finally,averagedeviation(AD)indicessuchasaveragedeviationofthemean(ADM)ormedian(ADMD)canbeusedtoestimateagreementamongratersonasingleobservation(Burke,Finkelstein&Dusig,1999).Averagedeviationiscalculatedastheaverageabsolutedeviationacrossratersfromapointofcentraltendency,namelythemeanormedian.OnebenefitofADMandADMD is that it maintains the raw metric of the observed variable.
Impact of Interrater Reliability (IRR) on Power and Sample Size
Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power
1.0 80% 100
0.9 76% 111 11%
0.7 65% 143 43%
0.5 51% 200 100%
Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa
In-study surveillance Cohen’s Kappa
Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)
Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)
- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals
- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals
Methodology: Recommendations for Outcome Measure Reliability* (Single observation)
- CoV - rwg
- ADM or ADMD
Diffe
renc
e
Average0 10 20 30 40
6
4
2
0
-2
-4
-6
•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •
• • • • • •• • • • •••
•
••••
•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •
• •
Did thesame raters
rate allsubjects?
What is the
variable of interest?
CorrectICC
Formula/ANOVASource
Single
Average
Single
Average
Single
Average
Yes
No
Yes
Yes
No
Were raters selected from a larger pool?
ICC (2,1)Two-way Random
Effects ANOVA
ICC (2,n)Two-way Random
Effects ANOVA
ICC (1,1)One-way Random
Effects ANOVA
ICC (1,n)One-way Random
Effects ANOVA
ICC (3,1)Two-way FixedEffects ANOVA
ICC (3,n)Two-way FixedEffects ANOVA
# of Raters# of
Observations(i.e., subjects
or videos)
AppropriateStatistic
Can Not CalculateReliability
Can Not CalculateReliability
Cohen’s Kappa
% Agreement
Fleiss’ Kappa
Can Not CalculateReliability
CoVr wgADI
Paired t-testBland-Altman
ICC
CoVr wgADI
RM ANOVAICC
1
2+
1
2+
1
2+
1
2+
1
2
3+
1
2
3+
Categorical(eg., Diagnosis)
Continuous(eg., Severity Scale)
Type of Variable
*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.
Investigator meeting rating precision exercise
In-study surveillance
Investigator meeting rating precision exerciseand in-study surveillance
Reliabilitycanhaveasignificantimpactonclinicaltrialoutcomes.ItisimportanttoaccuratelyassessandreportIRRpriortostudystartandthroughoutthecourseofaclinicaltrial.WhenIRR is assessed prior to study start, it is possible for researchers to employ a methodology for obtainingIRRdatathatfullyexploitsthestrengthsofaparticularstatistic.However,sincetheseestimatesareoftenobtainedwithoutindependentinterviews(i.e.,watchingvideo-tapedassessments),inartificialsettings(i.e.,atinvestigatormeetings)andatasinglepointintime(i.e.,priortothestartofthestudy)itisimportanttocoupletheseestimateswithIRRcalculated from actual trial assessments throughout a study.
Whenselectingreliabilitystatisticsresearchersmusttakeintoaccountthetypeofvariable(e.g.,binary,nominal,interval),thenumberofraters,compositionoftheraterpool(i.e.,sameratersrateallsubjectsvs.ratersselectedfromalargerpool)andthenumberofobservationsusing the guidelines presented for various methodologies.
DisclosureOneormoreauthorsreportpotentialconflictswhicharedescribedintheprogram.
ReferencesBlandJM,AltmanDG.Statisticalmethodsforassessingagreementbetweentwomethodsofclinicalmeasurement.Lancet,1986;327(8476):307-310.BurkeMJ,FinkelsteinLM,DusigMS.Onaveragedeviationindicesforestimatinginterrateragreement.OrganizationalResearchMethods,1999;2(1):49-68.CohenJ.Acoefficientofagreementfornominalscales.EducationalandPsychologicalMeasurement,1960;20(1):37–46.FleissJL.Measuringnominalscaleagreementamongmanyraters.PsychologicalBulletin,1971;76(5):378–382.JamesLR,DemareeRG,WolfG.Estimatingwithin-groupinterraterreliabilitywithandwithoutresponsebias.JournalofAppliedPsychology,1984;69:85-98.KhinNA,ChenY,YangY,YangP,LaughrenTP.(2011).ExploratoryAnalysesofEfficacyDataFromMajorDepressiveDisorderTrialsSubmittedtotheUSFoodandDrugAdministrationinSupportofNewDrugApplications.JournalofClinicalPsychiatry,2011;72(4):464-472.MullerMJ,SzegediA.EffectsofInterraterReliabilityofPsychopathologicAssessmentonPowerandSampleSizeCalculationsinClinicalTrials.JournalofClinicalPsychopharmacology,2002;22:318-325.MulsantBH,KastangoKB,RosenJ,StoneRA,MazumdarS,PollockBG.InterraterReliabilityinClinicalTrialsofDepressiveDisorders.AmericanJournalofPsychiatry,2002;159:1598-1600.PerkinsDO,WyattRJ,BartkoJJ.Penny-wiseandpound-foolish:theimpactofmeasurementerroronsamplesizerequirementsinclinicaltrials.BiologicalPsychiatry,2002;47:762-766.ShroutPE,FleissJL.IntraclassCorrelations:UsesinAssessingRaterReliability.PsychologicalBulletin,1979;86(2):420–428.
METHODS
RESULTS
DISCUSSION
Top Related