W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should...

41
2/9/2017 1 W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information about the American Board of Radiology or its examinations. Public Domain Public Domain

Transcript of W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should...

Page 1: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

1

W.F. Sensakovic, PhD, DABR, MRSC

Attendees/trainees should not construe any of the discussion or content of the session as insider 

information about the American Board of Radiology or its examinations.

Public Domain Public Domain

Page 2: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

2

• Task is complex

– Outline subtle tumor

• Unquantifiable human element

– Clinical decision or Human visual system

• Human response is goal

– Does widget “A” make it easier for the observerto detect the microcalcification?

Bunch of observers look at

Bunch of subject images to create

data that is then analyzed

Page 3: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

3

CC 3.0: Zzyzx11

CC 3.0 Aaron Dodson, from The Noun Project

Detection

Delineation

Diagnosis

Bunch of radiologists look at bunch of CT scans (FBP or Iterative Recon) to record probability 

of malignancy for each. ROC analysis determines if iterative reconstruction impacts 

diagnosis.

Page 4: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

4

Widely Used Scale

Definitely or almost definitely malignant

Probably malignant

Possibly malignant

Probably benign

Definitely or almost definitely benign

Based on: Swets JA, et al. Assessment of Diagnostic Technologies. Science 205(4408):753 (1979)

Including clinically relevance

Malignant—diagnosis apparent—warrantsappropriate clinical management

Malignant—diagnosis uncertain—warrantsfurther diagnostic study/biopsy

I’m not certain—warrants further diagnostic study

Benign—no follow‐up necessaryBased on: Potchen EJ. Measuring Observer Performance in Chest Radiology: Some Experiences. J Am Coll Radiol

3:423 (2006)

• Typically 5‐7 categories

• Validated scale if available and appropriate

• Continuous vs. Categorical difference biggest for single reader studies

– Wagner RF et al. Continuous versus Categorical Data for ROC Analysis Some Quantitative Considerations. Acad Rad 8(4): 328 (2001).

• No practical difference between discrete and continuous  scales for ratings

– Rockette HE, et al. The use of continuous and discrete confidence judgments in receiver operating characteristic studies of diagnostic imaging techniques. Invest Radiol 27(2):169 (1992).

Probability of Malignancy0% 20% 40% 60% 80% 100%

Page 5: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

5

• Best– Abnormal: Biopsy or other gold standard

– Normal: Follow‐up (e.g., 1‐year) post imaging

• Combined reads (expert panel)– In a 3 system comparison, the “best” system depended on medthod used for truth

• Revesz G et al. The effect of verification on the assessment of imaging techniques. Invest Radiol 18:194 (1983).

– Report variability in consensus• Bankier AA et al. Consensus Interpretation in Imaging Research: Is There a Better Way? Radiology 257:14 (2010).

• Task is binary (e.g., Malignant vs. Benign)

• Multi‐Reader, Multi‐Case (MRMC)

• Multiple treatments (e.g., IR vs FBP)

• Traditional, Fully‐Crossed, Paired‐Case Paired‐Reader, Full Factorial– Every observer, reads every case, in every modality

– Data correlations all us to get the highest power and lowest sample requirements

Page 6: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

6

• Software (free or not) does it for you– ROC Software listed later

• Some unsupported or not functional on modern computers, but may still run on an emulator such as dosbox(https://www.dosbox.com) 

MATH

• True Positive (TP)– Sensitivity

• False Positive (FP)– 1‐Specificity

• True Negative (TN)

• False Negative (FN)CC 3.0: Marco Evangelista

Page 7: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

7

Does iterative reconstruction impact diagnosis of malignancy in lung lesions?

Public Domain Public DomainPublic Domain

Public Domain

Public Domain

Public Domain

Case #/Truth Obs. 1 Obs. 2

1/Malignant 10.0 8.9

2/Benign 4.4 6.3

3/Benign 3.4 2.7

5/Malignant 5.6 5.2

6/Malignant 7.7 7.0

7/Malignant 9.2 8.1

… …

Case #/Truth Obs. 1 Obs. 2

1/Malignant 10.0 8.9

2/Benign 4.4 6.3

3/Benign 3.4 2.7

5/Malignant 5.6 5.2

6/Malignant 7.7 7.0

7/Malignant 9.2 8.1

… …

Case #/Truth Obs. 1 Obs. 2

1/Malignant 10.0 8.9

2/Benign 4.4 6.3

3/Benign 3.4 2.7

5/Malignant 5.6 5.2

6/Malignant 7.7 7.0

7/Malignant 9.2 8.1

… …

With IR

W/O IR

True Positive Fraction 

(Sensitivity)

False Positive Fraction(1‐Specificity)

With IR

True Positive Fraction 

(Sensitivity)

False Positive Fraction(1‐Specificity)

W/O IR

True Positive Fraction 

(Sensitivity)

False Positive Fraction(1‐Specificity)

With IR

True Positive Fraction 

(Sensitivity)

False Positive Fraction(1‐Specificity)

W/O IR

Page 8: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

8

Does iterative reconstruction impact diagnosis of malignancy in lung lesions?

True Positive 

Fraction 

(Sensitivity)

False Positive Fraction(1‐Specificity)

With IRW/O IR • Yes, it improves 

diagnosis

• By how much?

Does iterative reconstruction impact diagnosis of malignancy in lung lesions?

True Positive 

Fraction 

(Sensitivity)

False Positive Fraction(1‐Specificity)

With IRW/O IR • Yes, it improves 

diagnosis

• By how much?

– AUC = 0.8

Page 9: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

9

Does iterative reconstruction impact diagnosis of malignancy in lung lesions?

True Positive 

Fraction 

(Sensitivity)

False Positive Fraction(1‐Specificity)

With IRW/O IR • Yes, it improves 

diagnosis• By how much?

– AUC = 0.8– AUC = 0.7

Average percent correct if observers shown random malignant and benign and asked to 

choose the malignant

Page 10: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

10

• ROC Software will (generally):

– Calculate ROC and AUC for each observer

– Calculate combined ROC and AUC with dispersion

– Perform hypothesis test to determine if AUC’s from 2 treatments significantly differ

• Non‐parametric ROC gives bias underestimates with a small number of rating categories

• Zweig MH, Campbell G. Receiver operating characteristic plots: a fundamental evaluation tool in clinical medicine. ClinChem 39:561 (1993).

• Parametric (semi‐parametric) may perform poorly if there are too few samples or if ratings are confined to a narrow range

– Metz CE. Practical Aspects of CAD Research Assessment Methodologies for CAD. Presented at the AAPM annual meeting.

• Only generalizable to population of all observers if observer is treated as a random effect instead of fixed effect– Similarly, for cases

Page 11: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

11

• Comparisons should be on same cases– Sensitivity 25%‐100% depending on case selection

• Nishikawa RM, et al. Effect of case selection on the performance of computer‐aided detection schemes. Med Phys 21, 265 (1994)

• The normal case subtlety must be considered to ensure sufficient number of false‐positive responses

– Rockette, et al. Selection of subtle cases for observer‐performance studies: The importance of knowing the true diagnosis (1998).

• Study disease prevalence does not need to match disease population prevalence– ROC AUC stable between 2%‐28% study prevalence, but small 

increases in observer ratings are seen with low prevalence• Gur D, et al. Prevalence effect in a laboratory environment. Radiology 228:10 (2003).• Gur D, et al. The Prevalence Effect in a Laboratory Environment: Changing the Confidence Ratings. Acad Radiol 14:49 (2007).

Public Domain

Public DomainPublic Domain

Public Domain

Public Domain Public Domain

Page 12: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

12

• We need to know:– Minimum effect size of interest

• Smaller needs more cases for testing• Appendix C of ICRU 79: ΔSe (at Sp)  ΔAUC

– How much the difference varies• More variation needs more cases for testing

CC BY‐SA 2.0 Barry Stock

?

• Sample size software (see references)– Run a small pilot

– Program uses pilot data and resampling/Monte Carlo simulation to estimate variance for various model componenets (reader, case, etc.)

• Typical power 0.8 and α of 0.05

• Typical numbers are 3‐5 observers and 100 case pairs (near equal for normal/abnormal)

– ICRU Report 79

Page 13: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

13

• 50 observers, 530 cases each . . . Probably pass

Pilot Data

• Observer training– Non‐clinical task, specialized software, new modality

• Data/truth verification– 45% of truth cases contained errors– Armato SG, et al. The Lung Image Database Consortium (LIDC): Ensuring the Integrity of Expert‐Defined “Truth” Acad

Radiol 14:1455 (2007)

• Display and acquisition– Clinical conditions and equipment

Page 14: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

14

• Bias from re‐reading– A few weeks rule‐of‐thumb (unless case is unusual)

– Block study design (see refs. below)– Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 24:234 (1989).

– Metz CE. Fundamental ROC analysis. In: Beutel J, et al. Handbook of medical imaging. Vol 1. Bellingham, WA: SPIE Press, 2000.

• Observer Experience– Sp 0.9:

• Se ‐ 0.76 (high volume mammographers) 

• Se ‐ 0.65 (low volume mammographers)• Esserman L, et al. Improving the accuracy of mammography: volume and outcome relationships.  J Natl Cancer Inst 6;94(5):369 (2002)

• According to ICRU Report 79

– Study description mindful of blinding

– Types of relevant abnormalities and their precise study definition

– How to perform task and record data

– Unique conditions observers should or should not consider

Page 15: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

15

• ROC is costly (time and or money)

• Best used when looking for small to moderate, but important differences– ~5% (ICRU Report 79)

– Bigger difference could be seen with easier testing methodology

– Smaller differences might be too costly or clinically insignificant

1. No localization

Bunch of radiologists look at bunch of chest radiographs (CR and DR) to determine if pneumonia is present. ROC determines if 

the modalities are equivalent.

Page 16: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

16

• Rating scales, sample size, and truth essentially the same as in diagnosis observer study . . .

• . . . but the tasks are very different!

Public Domain CC 2.0: Abhijit Tembhekar

Reduced Observer Variability 20%7%

Abnormal—diagnosis apparent—warrantsappropriate clinical management

Abnormal—diagnosis uncertain—warrantsfurther diagnostic study

I’m not certain—warrants further diagnostic study

Abnormal—but not clinically significant 

NormalPotchen EJ. Measuring Observer Performance in Chest Radiology: Some Experiences. J Am Coll Radiol 3:423 

(2006)

Clinical relevance reduces variability and thus sample size requirements

Page 17: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

17

2. Localization

Bunch of radiologists look at bunch of radiographs with and without CAD system to, mark centroid of nodules if present, and give confidence ratings. FROC determines if 

CAD helps.

Courtesy William F. Sensakovic

• Mark lesion centroid

• Determine how close mark must be for “hit”– 50% ROI overlap

– Radius based on size of largest lesion

• Haygood TM, et al. On the choice of acceptance radius in free‐response observer performance studies. Br J Radiol 86 (2013)

Confidence0% 20% 40% 60% 80% 100%

Page 18: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

18

• Bunch of dosimetrists outline the brainstem on CT scans displayed two different window/level settings. “Distance” between outlines is calculated. ANOVA is used to test if outlines are impacted by window/level settings.

• Phantom– Know exact size

– Clinically relevant?

• Combined outlines on patient images– Union/Intersection

– P‐Map• Meyer CR, et al. Evaluation of Lung MDCT Nodule Annotation Across Radiologists and Methods. Acad Radiol 13(10): 1254 (2006).

– STAPLE• Warfield SK. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image 

segmentation. IEEE Trans Med Imaging 23(7):903 (2004).

Page 19: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

19

• Jaccard Similarity Coefficient

• Jaccard– Count pixels in intersection

– Count pixels in union

– Divide intersection by union

• Dice– D = 2J/(1+J)

∩∪

Page 20: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

20

Dice

Jaccard

0.0

0.0

0.8

0.67

0.20

0.33

0.33

0.50

0.50

0.67

• Average Euclidean Distance

– Easy to understand

– Meaningful units

Page 21: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

21

• Average Euclidean Distance– Find shortest absolute distance from each boundary point of A to each the boundary point of B

– Repeat for B to A

– Summary stats

• Fail to capture difference

– Dice/Jaccard

• ~0.9

– Average distance

• <1mm

Page 22: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

22

• Hausdorff distance

– Take a point in A and find the shortest distance to B

– Repeat for all points of A

– Take the maximum of shortest distances

• h(A,B)

A

B

• Hausdorff distance– Take a point in A and find the shortest distance to B

– Repeat for all points of A

– Take the maximum of shortest distances

• h(A,B)

– Repeat for h(B,A)– Max of h(A,B) and h(B,A)

A

B

Page 23: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

23

• Observers tend to agree with whatever is already drawn– 48% increase in Jaccard when previous outline used

– Sensakovic et al. The influence of initial outlines on manual segmentation. Med Phys. 37(5):2153 (2010).

• Different boundary definitions can alter measurements by 20%

– Sensakovic, WF et al. Discrete‐space versus continuous‐space lesion boundary and area definitions. Med Phys 35(9): 4070 (2008).

With Permission: Sensakovic, WF et al. Discrete‐space versus continuous‐space lesion boundary and area definitions. Med Phys 35(9): 4070 (2008).

• Summary statistics, regression, hypothesis testing

– See “Practical Statistics for Medical Physicists” from the last two AAPM annual meetings

CC 3.0: ReubenGBrewer

Page 24: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

24

• Many intricacies to running an observer study

• Most respected studies in radiology and medicine in general

• Review (ROC and some FROC)– ICRU Report 79– Wagner RF et al. Assessment of Medical Imaging Systems and Computer Aids: A Tutorial Review. Acad

Radiol 14: 723 (2007)– Chakraborty  DP. New Developments in Observer Performance Methodology in Medical Imaging. Semin

Nucl Med 41(6): 401 (2011)

• Comparing ROC Methods– Obuchowski NA, Beiden SV, Berbaum KS, et al. Multi‐reader, multicase ROC analysis: an empirical 

comparison of five methods. Acad Radiol 2004; 11:980 –995.– Toledano A. Three methods for analyzing correlated ROC curves: A comparison in real data sets from 

multi‐reader, multi‐case studies with a factorial design. Stat Med 2003; 22:2919 –2933

• Study Design– Obuchowski NA. Multireader receiver operating characteristic studies: a comparison of study designs. 

Acad Radiol 1995; 2:709 –716– Potchen EJ. Measuring Observer Performance in Chest Radiology: Some Experiences. J Am Coll Radiol

3:423 (2006)

• Power and Sample Size– Hillis et al. Power Estimation for the Dorfman‐Berbaum‐Metz Method. Acad Radiol 11:1260 (2004).– See also some of the software listed

Page 25: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

25

• FROC and JAFROC– Chakraborty DP, et al. Observer studies involving detection and 

localization: Modeling, analysis, and validation. Medical Physics 31, 2313 (2004).

– Chakraborty DP. New Developments in Observer Performance Methodology in Medical Imaging. Semin Nucl Med 41(6): 401 (2011).

– Thompson JD, et al. Analysing data from observer studies in medical imaging research: An introductory guide to free‐response techniques. Radiography 20: 295 (2014).

– Thompson JD, et al. The Value of Observer Performance Studies in Dose Optimization: A Focus on Free‐Response Receiver Operating Characteristic Methods. J Nucl Med Technol 41:57 (2013).

• http://www.lerner.ccf.org/qhs/software/

• http://metz‐roc.uchicago.edu/MetzROC/software

• http://perception.radiology.uiowa.edu/Software/ReceiverOperatingCharacteristicROC/MRMCAnalysis/tabid/116/Default.aspx

• http://www.devchakraborty.com/index.php

• http://didsr.github.io/iMRMC/

• Websearch your favorite software package and ROC

Page 26: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

26

• Sensakovic, WF. MO‐FG‐206‐02: Implementation and Analysis of Observer Studies in Medical Physics. Med Phys. 43, 3714 (2016); http://dx.doi.org/10.1118/1.4957320

• 2015 (Virtual Library)– Phases, Levels, Controls, and All That: An Informal Session On Clinical Trials

• http://www.aapm.org/education/VL/vl.asp?id=4686

– Use and Abuse of Common Statistics in Radiological Physics• http://www.aapm.org/education/VL/vl.asp?id=4685

– Uncertainty and Issues in Biological Modeling for the Modern Medical Physicist• http://www.aapm.org/education/VL/vl.asp?id=4687

• 2016 (Handouts Available)– Clinical trials and the medical physicist: design, analysis, and our role– Implementation and Analysis of Observer Studies in Medical Physics– Analysis of Dependent Variables: Correlation and Simple Regression– Hypothesis or Hypotheses, That is the Question

• http://www.aapm.org/meetings/2016AM/PRAbs.asp?mid=115&aid=31918

• https://creativecommons.org/licenses/

Page 27: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

27

Correlation: Review of Terminology• Dependent vs. Independent Variables

• Standard plot: X is IndependentY is Dependent

• Linear vs. Monotonic

• Linear: increase in X leadsto proportional increase in Y

• Monotonic: increase in Xleads to some increase in Y

Aug 1, 2016 Labby - AAPM 2016 54

0

1

0 1

y

x

Page 28: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

28

Correlation: Review of Terminology• Variable Type

•Continuous

• Example: Ionization chamber charge collected vs. Dose delivered

•Discrete

• Example: Number of patients seen vs. Calendar year

•Ordinal

• Example: Severity of normal tissue toxicity vs. Prescription Level

•Categorical

• Example: RECIST response classification vs. Radiologist Observer

Aug 1, 2016 Labby - AAPM 2016 55

Correlation: Metrics of Interest • Four big categories of data

• Continuous

• Discrete

• Ordinal

• Categorical

Aug 1, 2016 Labby - AAPM 2016 56

Page 29: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

29

Correlation: Metrics of Interest • Four big categories of data

• Continuous

• Discrete

• Ordinal

• Categorical

Three major correlation metrics

Pearson’s r

Spearman’s ⍴

Fleiss’ κ

Aug 1, 2016 Labby - AAPM 2016 57

Correlation: Metrics of Interest • Four big categories of data

• Continuous

• Discrete

• Ordinal

• Categorical

Three major correlation metrics

Pearson’s r

Spearman’s ⍴

Fleiss’ κ

Aug 1, 2016 Labby - AAPM 2016 58

Page 30: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

30

Correlation: Pearson’s r• “Linear” or “Product-Moment” correlation

• Applies only to continuous data

• Parametric correlation

• Tendency of dependent variable to increase linearly with the independent variable

• Key Point:

• There is an assumed form to the relationship

•Linear, and therefore also monotonic

Aug 1, 2016 Labby - AAPM 2016 59

Correlation: Pearson’s r

Aug 1, 2016 Labby - AAPM 2016 60

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

r = 1.00 r = 0.97 r = 0.76

r = 0.96 r = 0.96 r = 0.75

Page 31: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

31

Correlation: Pearson’s r

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

r = 1.00 r = 0.97 r = 0.76

Correlation: Spearman’s ⍴• “Rank” correlation

• Applies to continuous, discrete, or ordinal data

• Non-parametric correlation

• Tendency of dependent variable to increase with the independent variable

• Key Point:

• There is no assumed relationship, only monotonicity

• Math: Pearson’s r of rank-transformed data

Aug 1, 2016 Labby - AAPM 2016 62

Page 32: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

32

Correlation: Spearman’s ⍴

Aug 1, 2016 Labby - AAPM 2016 63

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Correlation: Spearman’s ⍴

Aug 1, 2016 Labby - AAPM 2016 64

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Raw: (0,0)Rank: (1,1)

(X,Y) pairs

Page 33: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

33

Correlation: Spearman’s ⍴

Aug 1, 2016 Labby - AAPM 2016 65

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Raw: (0,0)Rank: (1,1)

(X,Y) pairs

Raw: (0.05,0.0025)Rank: (2,2)

Correlation: Spearman’s ⍴

Aug 1, 2016 Labby - AAPM 2016 66

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0 Raw: (1,1)

Rank: (20,20)

Pearson’s r of rank-transformed data: 1.00

Raw: (0,0)Rank: (1,1)

Raw: (0.05,0.0025)Rank: (2,2)

Page 34: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

34

Correlation: Spearman’s ⍴

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

r = 1.00⍴ = 1.00

r = 0.97⍴ = 0.97

r = 0.76⍴ = 0.90

Correlation: Spearman’s ⍴

Aug 1, 2016 Labby - AAPM 2016 68

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

r = 1.00⍴ = 1.00

r = 0.97⍴ = 0.97

r = 0.76⍴ = 0.90

r = 0.96⍴ = 1.00

r = 0.96⍴ = 0.99

r = 0.75⍴ = 0.89

Page 35: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

35

Correlation: Which Metric?

Continuous variables; “When one goes up, does the other (reliably) go down?”

Aug 1, 2016 Labby - AAPM 2016 69

Months after Baseline

Rel

ati

ve C

han

ge

fro

m B

asel

ine

−1.0

−0.5

0.0

0.5

1.0

0 1 2 3 4 5 6 7

Disease Volume

Lung Volume

Z.E. Labby et al, J Thorac Oncol 8, (2013)

Correlation: Which Metric?

Continuous variables; “When one goes up, does the other (reliably) go down?”

Aug 1, 2016 Labby - AAPM 2016 70

Months after Baseline

Rel

ati

ve C

han

ge

fro

m B

asel

ine

−1.0

−0.5

0.0

0.5

1.0

0 1 2 3 4 5 6 7

Disease Volume

Lung Volume

Z.E. Labby et al, J Thorac Oncol 8, (2013) Answer:Spearman’s ⍴

Page 36: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

36

Correlation: Fleiss’ κ• Categorical correlation

• Applies only to categorical data

• Categorical data could be inherently ordinal

• Non-parametric correlation

• How well do independent categories sort dependent categories?

• Math: number of dependent-independent pairs in agreement over the number expected by chance alone.

Aug 1, 2016 Labby - AAPM 2016 71

Correlation: Fleiss’ κ• Example:

• 5 radiologists contour tumors in

• 31 patients

• Response classification from baseline to post-chemo CT scans

•Progressive Disease

•Stable Disease

•Partial Response

•Complete Response

Aug 1, 2016 Labby - AAPM 2016 72

Obs. 1 Obs. 2 Obs. 3 Obs. 4 Obs. 5

Progression 6 11 7 11 14

Stable 17 10 19 15 9

Partial 7 10 5 4 8

Complete 1 0 0 1 0

κ = 0.64Landis and Koch, Biometrics, 33,159–174 (1977)

Page 37: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

37

Correlation vs. Agreement• Quick tangent…

Important question:Do you already know that the two variables will be correlated?

Example: Tumor volumes as assessed by Physician vs. Algorithm

Aug 1, 2016 Labby - AAPM 2016 73

Correlation vs. Agreement• Especially with implicit independent variables (i.e., the true value remains unknown), correlation isn’t as meaningful

• Correlation is only the strength of a relationship between two variables

• Agreement is the actual 1:1 accuracy

Aug 1, 2016 Labby - AAPM 2016 74

Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).

Page 38: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

38

Correlation vs. Agreement

Aug 1, 2016 Labby - AAPM 2016 75

Average of Physician and Algorithm

Difference

(Physician –Algorithm)

Mean

Mean + 2SD

Mean - 2SD

Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).

Correlation vs. Agreement• Absolute agreement vs. Relative agreement

• Absolute: plot raw differences

• Relative: plot log differences

• Get mean, SD of log-transformed data, then apply exponential to get relative agreement bounds

Aug 1, 2016 Labby - AAPM 2016 76

Bland and Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 327, 307 (1986).

ln ln ln

Page 39: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

39

Case #/Truth Obs. 1

1/Malignant 10.0

2/Benign 4.4

3/Benign 3.4

5/Malignant 5.6

6/Malignant 7.7

7/Malignant 9.2

… …

Observer Malignancy Rating # of Ratings

Actually MalignantActually Benign

With IR

Observer Malignancy Rating # of Ratings

Actually MalignantActually Benign

0.0: Definitely Benign5.0: Indeterminate10.0 Definitely MalignantTrue Positive 

Fraction 

(Sensitivity)

False Positive Fraction (1‐Specificity)

With IR

TPFPTP

TN FNTPTN

FPFN

Page 40: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

40

AUC comparison not appropriate if ROC curves cross each other

True Positive 

Fraction 

(Sensitivity)

False Positive Fraction(1‐Specificity)

• Better for Screening• Maybe better for Diagnostic

• Partial AUC– McClish DK. Analyzing a Portion of the ROC Curve. 

Medical Decision Making 9 (3): 190 (1989)

Sp and Se. . . Why bother with ROC?

• CR: Better Se, Worse Sp

• DR: Better Sp, Worse Se

• Which is better?

Based on example given by CE Metz during lectures at the University of Chicago, 2003

True Positive 

Fraction 

(Sensitivity)

False Positive Fraction (1‐Specificity)

Page 41: W.F. Sensakovic, PhD, DABR, MRSC Fall...W.F. Sensakovic, PhD, DABR, MRSC Attendees/trainees should not construe any of the discussion or content of the session as insider information

2/9/2017

41

• JAFROC– Use localization information– More than one response per subject

• Chakraborty DP, et al. Observer studies involving detection and localization: Modeling, analysis, and validation. Med Phys 31, 2313 (2004)

• Practically, comparisons need whole ROC curves and not specific operating points (TP,FP) or (PPV, NPV) values– Reader ability and experience, societal norms, and even disease prevalence in the study can impact specific operating points and (PPV,NPV) values

• Wagner et al. Assessment of Medical Imaging Systems and Computer Aids: A Tutorial Review. Acad Radiol 14: 723 (2007)