November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D....

67
November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division of Biostatistics Office of Surveillance and Biometrics November 18, 2009

Transcript of November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D....

Page 1: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 1

CAD Panel Meeting

Statistical Issues in CADe Evaluations

Thomas E. Gwise, Ph.D.Mathematical Statistician / Acting Team Leader

Division of BiostatisticsOffice of Surveillance and Biometrics

November 18, 2009

Page 2: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 2

Outline

• Statistical concepts• Reader studies for CADe evaluation

• Prospective and retrospective• Retrospective study design examples• Complications in retrospective studies• Choice of endpoints• Choice of controls

• Standalone studies• Compared to reader studies• Re-use of data

Page 3: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 3

Statistical Evaluation of Diagnostic Tests

• Two dimensions are considered when evaluating diagnostic test performance.

• How well can the test detect diseased cases?• Sensitivity: Fraction of diseased patients who are test

positive

• How well can the test correctly identify the non-diseased cases?• Specificity : Fraction of non-diseased patients who are

test negative

• Sensitivity and Specificity are not comparable if estimated in separate studies

Page 4: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 4

ROC Curves

ROC curves are plots of Se/Sp considering all possible cutoffs.

Page 5: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 5

Statistical Evaluation of Diagnostic Tests

Does the test add value?

Example: Is a diagnostic test for bone mineral density better than just using a person’s age in diagnosing osteoporosis?

Example: Does use of a CADe device improve diagnostic performance of readers?

Examples of improvement:Sensitivity, Specificity both betterROC plot (or area) betterImproved reading time, same performance

Page 6: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 6

Intended Use• The vast majority of submissions for CADe

devices to date have been for those labeled as second reader, aids to physicians.

• User is directed to completely evaluate images as practice dictates before initiating CADe

• As such, it is expected that using the device in accordance with the label will improve performance of the physician.

Page 7: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 7

Prospective Study

• If study conduct matches intended use, it is generally believed that a good way to test for a change in performance is to do a multi-center, prospective, randomized clinical trial, e.g.• Randomize patients to the respective experimental

conditions: unassisted image reading; CADe assisted image reading.

• Manage patients according to the evaluations as in routine clinical practice.

• Follow-up patients to determine true disease state.• Analyze results and compare performance under the

two experimental conditions.

Page 8: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 8

Prospective Studies: Pros

• Study conduct matches indications for use (routine clinical practice, where reader decisions affect patient management).• Estimate of performance under intended use

conditions

Page 9: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 9

Drawbacks to Prospective Randomized Trials

• For intended use populations where disease prevalence is low, a prospective study as described would require large amounts of time and result in large enrollments to obtain enough disease cases to compare the performance of the two modalities.

• Risk to participants, if patient management will depend on readings in the study (IDE may be required)

Page 10: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 10

Possible Proxies for Dx Performance in Population

• Retrospective Reader Studies

• Standalone Studies• (bench testing without reader)

Page 11: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 11

Retrospective Reader Studies

• Reader evaluations are made off-line on a retrospective data set of images on which disease status of patients has been established according to ground-truthing rules.

• Multi-reader Multi-case (MRMC) designs: multiple readers read some or all images

• Sample is enriched with disease cases.

Page 12: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 12

Retrospective Reader Studies: Pros

• Not significant risk because reader results are not used manage patients (IDE not required)

• Very efficient. Relatively small sample size can result in precise estimates of sensitivity, specificity, ROC curve, and CADe effect on these endpoints.

Page 13: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 13

Retrospective Reader Studies: Cons

• Reading behavior may not be the same as in routine clinical practice because:• Readers know their readings do not matter

to the patient. • Readers may detect enrichment, which

could affect their reading behavior.• Enrichment causes spectrum bias • Example: enriching with challenging cases

results in• downward bias in reader performance• upward bias in CADe effect on reader

• A small number of readers may not generalize

Page 14: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 14

Complications In Retrospective Reader Studies

• Reader variability issues

• Enrichment related biases

• Choice of controls

• Assumptions

Page 15: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 15

Reader Variability

• 108 US mammographers • reading a common set of 79 mammograms• provided a rating of suspicion of disease using

the breast imaging recording and data system (BIRADS) rating scale of 1–5, where 5 is the highest level of suspicion of cancer

Data from Beam et al., Variability in the interpretation of screening mammograms by US radiologists Arch Intern Med 1996;156:209-213, as in Wagner et al., Assessment of Medical Imaging and Computer-assist Systems: Lessons from Recent Experience, 9 Acad Radiol 1264–1277, 2002

Page 16: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 16

14

0 .0

0 .0

0 .0

0 .0

0 .1

0 .1

0 .1

0 .1

0 .2

0 .2

0 .2

0 .2

0 .3

0 .3

0 .3

0 .3

0 .4

0 .4

0 .4

0 .4

0 .5

0 .5

0 .5 0 .5

0 .6

0 .6

0 .6

0 .6

0 .7

0 .7

0 .7

0 .7

0 .8

0 .8

0 .8

0 .8

0 .9

0 .9

0 .9

0 .9

1 .0

1 .01 .0

1 .0

F alse P o sitiv e F ractio n

Tru e N ega tiv e F ractio n

Tru

e P

osit

ive

Fra

ctio

n

Fal

se N

egat

ive

Fra

ctio

n

TPF vs FPF for 108 US radiologists in study by Beam et al.

(Se

ns

itiv

ity

)

(Specificity)

Page 17: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 17

Number of Readers

• Companies have submitted studies with from 5 to 20 readers.

• Reader sample should represent intended use population of readers.

• A small number of readers may not be representative of the reader population.

Page 18: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 18

Enrichment

• The process of supplementing the image sample with disease positive images.

• Performance estimates obtained with enriched study samples will likely be different than performance in the intended use population

• Infer that differences in performance between modalities may be qualitatively applicable to the intended use population if the spectrum of disease is properly represented.

Page 19: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 19

Enrichment (Spectrum Effect)

• Different case mixes of lesion types will likely result in different performance estimates (spectrum effect)

• For example: in mammography, a CADe may have more difficulty detecting some masses than microcalcifications. A sample in which the proportion of microcalcifications to masses is large will give higher performance estimates than a sample in which that proportion is smaller.

Page 20: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 20

Dis

ea

se (

-)

Dis

ea

se (

+)

Page 21: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 21

Dis

ea

se (

-)

Dis

ea

se (

+)

Page 22: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 22

Enrichment (Easy Cases)

• Consider a sample of images enriched with a large proportion of disease positive cases easily detected by readers and CADes. • Performance estimates for both modalities will

likely be high.• Possibly difficult to detect a difference in

performance between the two modalities.

Page 23: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 23

Reader Alone(red)

Reader

W/CADe

Simulated data

Page 24: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 24

Enrichment (Challenging Cases)

• Stress Test : A study in which a sample of images is enriched with a large proportion of positive cases considered to be difficult to detect by readers and CADes.

• Goal: to show that the device can add value in cases that are difficult for readers.

• Performance results obtained from studies on enriched samples cannot be easily generalized across studies.

Page 25: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 25

Reader

Reader W/CADe

Simulated data

Page 26: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 26

Enrichment (Context Bias)

• Readers in a study environment will become aware of the enrichment and could change their reading behavior in response.

• Investigators attempt to mitigate this context bias by estimating relative performance.

Egglin et al., Context Bias: A Problem in Diagnostic Radiology, JAMA 1996;276:1752-1755

Page 27: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 27

Background for Questions on Endpoints

• Contrast endpoints with specific thresholds (Se/Sp) to aggregating endpoints (ROC)

Page 28: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 28

ROC Curves and Decision Variable Models

• ROC curves show how well a test separates disease test scores from non-disease test scores.

• Assume that a decision variable can model a reader’s decision process• Example: Probability of Malignancy (POM)• Readers are instructed to rate an image with respect

to the probability that it is malignant

• Ratings simulated for 25 healthy and 25 diseased images

Page 29: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 29

Dis

ea

se (

-)

Dis

ea

se (

+)

Gaussian

Page 30: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 30

ROC Curves Depend on Relative Ranking

• ROC curves are invariant to monotone transformations

• Relative ranking is the key

Page 31: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 31

Gaussian

Page 32: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 32

Complication

• Very large fraction of responses for certain detection tasks are in the extreme ranges of the scale.- Gur, et al.

• Similar pattern is not uncommon in reader study results submitted to FDA

Gur, et al, “Binary” and “Non-Binary” Detection Tasks: Are Current Performance Measures Optimal? 2007 Acad Radiol;14:871-876

Page 33: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 33

Dis

ea

se (

-)

Dis

ea

se (

+)

Page 34: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 34

• Certain tasks that are binary in nature are better

represented by a binary endpoint-both conceptually and statistically.

• In simulations Gur, et al showed that a binary task is evaluated with less bias and variability if a binary scale rather than continuous scale is used.

• For a task that is essentially binary, such as detecting microcalcifications, how rigorous can we expect relative rankings to be?

Gur, et al, “Binary” and “Non-Binary” Detection Tasks: Are Current Performance Measures Optimal? 2007 Acad Radiol;14:871-876

Page 35: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 35

ROC Based Endpoints

• Good for comparing tests over all possible cutoffs

• Use information efficiently

• Following slides discuss details associated with ROC analyses

Page 36: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 36

Control Modality

CADe Modality

Difference between AUCs is the average difference in Se over all Sp

Page 37: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 37

Comparable AUCs?

Depends on clinical context

Page 38: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 38

Is all of the difference in AUC clinically relevant?

Control Modality

CADe Modality

Possible to weight regions according to clinical relevance?

Partial AUC?

Use other device specific criteria?

Context dependent bound

Page 39: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 39

Thresholds (Se/Sp)

• Intuitive• Binary, similar to practice => work up or no?• Obviate adapting readers to unfamiliar rating

scales

• Mimic reality• Same framework as Post-Market Information

(spectrum bias still an issue)

Page 40: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 40

Example“Keep All Positives from Unaided

Read” rule

• Several 2nd reader CADe device labels require or imply that positive findings on the initial unaided read should not be negated by the CADe-aided read.

Page 41: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 41

Endpoints Specific to Intended Use(“Keep All Positives from Unaided Read” rule)

• “Therefore, the radiologist’s work-up decision should not be altered if the system fails to mark an area that the radiologist has detected on the initial film review and has already decided requires further work-up. Nor should the decision be affected if the system marks an area that the radiologist decides is not suspicious enough to warrant further work-up, whether the area is detected by the radiologist on initial film review or only after being marked by the system.”

• From SecondLook label

• • The radiologist should base interpretation only upon the original images and not depend on the CAD markers for interpretation.

• The device is a detection aid, not an interpretative aid. The CAD markers should be activated only after the first reading.

• • • • The device does not identify all areas that are suspicious for cancer. - Some lesions are not marked by the device and a user should not be dissuaded from working up a finding if the device fails to mark that site.

• From R2 label

Page 42: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 42

Applying “Keep All Positives from Unaided Read” Rule

Se Un-Aided to CADe-Aided => NON NEGATIVE

Sp Un-Aided to CADe-Aided => NON POSITIVE

Bound increase of FPF

Unaided reader Se, 1-Sp)

Success* Region

*Biggerstaff, “Comparing diagnostic tests: a simple graphic using likelihood ratios,” Stat Med, 2000.

Page 43: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 43

Image Sample Required for Comparing the Same Two ROC

Curves Using Different Accuracy Measures

• Compare sample size needs for various measures

• Context: Two specified ROC curves• Detectable change in AUC• Corresponding detectable change at given

false positive rates (FPRs) or over given FPR intervals

Zhou, Obuchowski and McClish 2002, Statistical Methods in Diagnostic Medicine, Wiley & Sons, Inc. NY

Page 44: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 44

Se

FPR0.1 0.2

Se at FPR=0.2

PAUC(0.1<FPR<0.2)

(0.2-0.1)

(FPR interval)

Detectable Changes

Page 45: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 45

Measure of Accuracy Detectable

Change

Ntotal

(n+=n-)

ROC AUC 0.100 278

Se (FPR=0.01) 0.108 930

Se (FPR=0.10) 0.201 482

Se (FPR=0.20) 0.276 382

PAUC(FPR<0.1) /(FPR2-FPR1) 0.167 722

PAUC(FPR<0.2) /(FPR2-FPR1) 0.182 522

PAUC(0.1<FPR<0.2) /(FPR2-FPR1) 0.198 384

Sample Size Efficiency

Adapted from table 6.8, Zhou, Obuchowski and McClish 2002, Statistical Methods in Diagnostic Medicine, Wiley & Sons, Inc. NY

Page 46: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 46

Not Uncommon Problem

AUC difficult to interpret

Post hoc PAUC as rescue?

Choosing Bound has Type I error implications

N for AUC may be too small to get useful PAUC or

Se/Sp estimates

Inadequate information => Failed Study

Page 47: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 47

Endpoint Summary

• Sensitivity and specificity are more relevant than ROC AUC to the dichotomous decisions made in image reading.

• Drawbacks to using ROC analysis• Not always easy to interpret AUCs

• Crossing curves• Comparable FPF regions

• Reader scoring representative of practice?

Page 48: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 48

Endpoint Summary

• “So my comment is about CADe. I want to point out that the ROC, which is, of course, a wonderful device for assessing the process is not perfectly relevant from the clinical setting. The clinical setting, there is a particular algorithm cut-point and decision are dichotomous. And so one had ought to focus on specific points on the ROC curve. And it seems to me that it is essential that your– that companies show that they have improved sensitivity, which to me means statistical significance or Bayesian probability that the sensitivity is improved. This is a very low hurdle.”—D Berry, March 2008 Radiological Devices Advisory Panel meeting.

• Pepe, M.S, Urban, N., Rutter, C. and Longton, G. (1997) Design of a study to improve accuracy in reading mammograms, J Clin. Epidemiol 50: 1327-1338.

• Van Belle, G. 2002, Statistical Rules of Thumb, Wiley & Sons, Inc. NY (p 100)

Page 49: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 49

Control Arm Discussion

• It is assumed that effectiveness or clinical utility can be shown by comparing unaided image reading to CADe-aided image reading

• We formulated several questions for the panel concerning control arms for 510K (substantial equivalence). The next slides provide some background.

Page 50: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 50

Example Non-Inferiority Test

0-

Reader Performance CADenew

Reader Performance CADepredicate

Success: CI of difference in improvements is greater than some preset limit

Page 51: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 51

Study Design #1

• Readers read common set of images under three modalities• Unaided reading• CADe aided reading with study device• CADe aided reading with predicate

• Note: CADe aided reading according to label

• Randomize image order• Washout periods between modalities• Compare performance results

• Unaided reading comparisons ensure clinical utility• Non inferiority delta can be defined

Page 52: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 52

Study Design #2

• Un-aided reading vs CADe-aided reading• Unaided reading• CADe aided reading with study device

• Randomize image order• Washout periods between modalities• Compare performance results to recorded

predicate performance (label, prior study)

Page 53: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 53

CADe SE Study Example

• Assume study design #2 from previous slide• Case mix

• Predicate study: Difficult to detect• New device study: Easy to detect

• Readers• Predicate study: Experienced

specialist• New device study: Minimally experienced

• performance (W/CADe- W/O CADe) are similar in the two studies.

Page 54: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 54

Changes In Performance Are Not Comparable Across Studies

• In design two, the comparison across studies is confounded by spectrum bias and reader differences.

• Using such a study design comparing changes across enriched studies effectively reduces the question to one of whether or not the CADe device offers any increase in performance over unaided reading.

• With respect to performance, comparing across enriched studies invites imprecise or erroneous SE and NSE conclusions due to confounding (case mix, reader differences, others).

Page 55: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 55

Example Non-Inferiority Test

0-

Reader Performance CADenew

Reader Performance CADepredicate

Given there is an improvement with CADe-aided reading over reader alone

Compared in same study

Success: CI of difference in improvements is greater than some preset limit

Page 56: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 56

Standalone Studies

• Cannot show clinical utility because no reader is involved

• Standalone studies may be useful in comparing a CADe device to a previous version or investigating the performance of the device without the reader.• Example:

• Studying a sample large enough to characterize all important strata (diseased & non-diseased cases) can be useful label information

Page 57: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 57

Enriched Standalone Studies

• Suffer the same complications as reader studies with respect to sample enrichment.• Results are not generalizable across studies.

• Performance estimators apply only to the sample• Not simple random samples of population• Do not represent stand alone performance in

population

Page 58: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 58

Reuse of Test Data (Standalone Studies)

• Some companies have proposed re-using test data in evaluating updated versions of CADes.

Page 59: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 59

Multiplicity

• Multiple tests on the same data set will inflate type I error

• Sponsors must account for multiplicity• Example: Bonferroni correction

• Practical problem: Choosing in a “reuse” test if = 0.05 for first test of several

Page 60: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 60

“Teaching to the Test”

• Each upgrade iteration on the same data could be considered training.

• Test on training data => unreliable results “teaching to the test” “fitting to the noise”

• This is in addition to multiplicity problems

• Difficult to quantify this bias.

Page 61: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 61

Example of Overfitting

Randomly generate data set of 20 profiles having 6000 features each

Arbitrarily assign each member to one of two classes

Develop and evaluate classifiers using 3 processes.

Nearly unbiased cross validation

Resubstitution Method

1) Build predictor on full data set

2) Reapply predictor to each specimen

(Teaching to test)

Partial cross validation1) Leave one out

2) Build classifier on remaining data

3) Classify last point

Simon, Richard, Radmacher, D., Dobbin,K. McShane, L.M. Pitfalls in the Use of DNA Microarray Data forDiagnostic and Prognostic Classification Journal of the National Cancer Institute, Vol. 95, No. 1, January 1, 2003

4) Repeat (total 20)

This example from Simon et al, illustrates the problems of overfitting in the context of developing algorithms for class prediction with gene expression data.

The large number of features within relatively small samples make this a good parallel to the situation faced by CADe developers.

Page 62: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 62

Simon, Richard, Radmacher, D., Dobbin,K. McShane, L.M. Pitfalls in the Use of DNA Microarray Data forDiagnostic and Prognostic Classification Journal of the National Cancer Institute, Vol. 95, No. 1, January 1, 2003

Random data:Expect ~ ½ to be misclassified

Resubstitution Method (Teaching to Test) 98.2% of data sets had zero misclassifications

Page 63: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 63

Added Review Questions

• Any variation of reusing data would raise many difficult review issues:• Data integrity/access controls

• Who has access to test data? When?

• Theoretical basis for procedures• Published method?• Assumptions verifiable?

• Selection bias • How were images chosen?

• Type I error control

Page 64: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 64

Using Only Standalone Data

• A change in marker style can affect reader behavior--- Krupinski, et al 1992, Gilbert, et al 2008

• Changes in prevalence affect reader behavior--- Egglin et al 1996

• Deduce that changes in CADe mark placement or frequency could impact reader behavior.

• A change to the algorithm is a change to the device, the device is acting on reader Dx. It is difficult to know a priori what change to an algorithm will produce a change in Dx performance.

Page 65: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 65

Reader Studies Compared to Standalone Studies

• Reader studies investigate reader-device interaction.

• Standalone studies investigate only device performance.

Page 66: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 66

Summary• Endpoints for reader studies

• Binary endpoint more relevant to study question• Sample size for appropriate endpoint

• Control arms for 510K reader studies• Is any improvement over un-aided reading adequate?

• Reuse of data• Teaching to the test

• Evaluating CADes without readers• Does not show clinical utility• Does not investigate device under its intended use

Page 67: November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division.

November 18, 2009 67

Thank You