…Borrelia diagnostics – statistical aspects Jørgen Hilden [email protected] February 2009 Notes...
-
date post
15-Jan-2016 -
Category
Documents
-
view
221 -
download
0
Transcript of …Borrelia diagnostics – statistical aspects Jørgen Hilden [email protected] February 2009 Notes...
…Borrelia diagnostics – statistical aspects
Jørgen [email protected]
February 2009
Notes have been added in this file
Plan of my talk
Clinicometric framework
Descriptors of diagnostic power
Displays of diagnostic power
including the ROC diagram
Simultaneous use of 2 measurements
Randomized testing of diagn. procedures
Special topics in supplementary slides
Biostatistical motto: Formalism with a human face
Topics not mentioned
Systematic reviews &
meta-analyses
”Clinicometrics”
…always considers a stream of cases
( statisticians say: a population of cases ):
They are the units of clinical experience
and also of clinical decision making.
They are instances of a (well-defined?)
clinical problem,
”the who-how-where-why of
a patient-doctor encounter.”
Therefore…
In clinical studies the choice of sample, and of the variables on which one bases one's prediction,
must match the clinical problem as it presents itself at the time of decision making.
In particular, one mustn't discard subgroups (as ‘atypical’ or ‘impurities’) that did not become identifiable until later: ensure prospective recognizability !
Data
colle
ctio
n*
*as opposed to the ’engineering’ phases
Purity vs. representativeness:
A meticulously filtered case stream ( 'proven single-agent infections', or 'meeting CDC criteria' ) may be needed for patho- and pharmaco-physiological research,
but is inappropriate as a basis for clinical decision policies
[incl. cost studies].
Data
colle
ctio
n
Your job is to create decision rules that help the clinician decide, e.g.
- whether to proceed with antibiotics- when to plan clin. & serol. follow-up checks - when to apply other tests for, e.g. HSV
►ideally drawing a complete management flowchart,i.e. a bushy tree of action diagnoses, not etiological diagnoses
Don’t
fo
rget…
Consecutivity as a safeguard against selection bias.
Standardization: How? Who? Where? When? Gold standard … the big problem !! w. blinding, etc.
Safeguards against change of data after the fact.
w3.consort-statement.org/Initiatives/stardClinical_Chemistry_statement.pdf
Data
colle
ctio
n
Quantitative markers
A quantity holds the result of a diagnostic procedure. Histograms describe its distribution in two subpopulations.
We can interpret ordinates and areas under the two humps in terms of true and false decisions …
and get a feel for the trade-off involved, provided that the pre-test probability of disease (percentage diseased) is known.
Focussing on
… principle …
Measurement
Non-disease
Each area = 1.00 = 100 % of the subpopulation
False negatives
False positives
Cutoff point
Negative rangePositive range
Diseased
Healthy
False positive
False negative
Measurement
… principle …
Measurement
Non-disease
False negatives
False positives
Cutoff point
Negative rangePositive range
Diseased
Healthy
False positive
False negative
Measurement
Sensitivity( true positive
fraction )Specificity
( true negative fraction )
Note: BLACK&WHITE
paradigm !
Pre-test ’case mix’ 30% 70% diseased non-diseased
I.e., 64.4 % of cases are true negatives; the other three areas are analogous.
the
probability
square’
Sensitivity,
true posit. fraction
1 – spec. = false
positive fraction
All the positives
Specificity = 0.92, say
True negatives area =
0.70 × 0.92 = 0.644
Classical terminology”Positive” = suggestive of (target) disease
”Negative” = suggestive of its absence
”False / True Positive / Negative …”
Sensitivity = TP/(those diseased)
Specificity = TN/(those without it)
What is meant by PV ( ”predictive value” )?What is meant by LR ( ”likelihood ratio” )?
Classical terminology
”Positive” = suggestive of (target) disease”Negative” = suggestive of its absence”False / True Positive / Negative …” Sensitivity = TP/(those diseased) Specificity = TN/(those without it)
PVpos = the ”predictive value” of a positive outcome = TP/(all positives) = Pr{ disease | pos }
…chance that the test is right when it says ”positive”
Classical terminology
”Positive” = suggestive of (target) disease”Negative” = suggestive of its absence”False / True Positive / Negative …” Sensitivity = TP/(those diseased) Specificity = TN/(those without it)
PVneg = the ”predictive value” of a negative outcome = TN/(all negatives) = Pr{ non-disease | neg }
…chance that the test is right when its verdict is ”negative”
pre-test odds = 3 : 7
30% 70%
diseased non-diseased
”LR” = 5 : 1 (the ratio of red arrows);
ergo post-test odds = 15 : 7.
”Likelih
ood
ratio”
principle …
Sensitivity
1 – specificity
Pre-test odds low in Lyme problems diseased non-diseased
”LRpos” = 5 : 1 is fair; but
post-test odds and PVpos are still low.
Sensitivity
1 – specificity
Specificity is not bad. Yet most positives are
false positives
classical terminology:
Sensitivity = TP/(those diseased)
Specificity = TN/(those without it)
LRpos = the ”likelihood ratio” occasioned
by a positive outcome =
(sensitivity) / (1 – specificity) =
Pr{ pos | disease } / Pr{ pos | non-disease }
Not quite so
classical terminology:
Sensitivity = TP/(those diseased)
Specificity = TN/(those without it)
LRneg = the ”likelihood ratio” occasioned
by a negative outcome =
(1 – sensitivity) / (specificity) =
Pr{ neg | disease } / Pr{ neg | non-disease }
Not quite so
= 0.1 = 1 : 10, for instance. If the pre-test risk of Lyme Disease is low, say p = 2%, a negative outcome almost eliminates it:
(post-test odds) = (pre-test odds)(LR) = (1 : 49)(1 : 10) = (1 : 490) .
”LR” principle: it’s the factor by which the observed data will change the odds
Measurement
Non-disease
False negatives
False positives
Cutoff point
Negative rangePositive range
Diseased
Healthy
False positive
False negative
Measurement
LRpos = ratio –
– ratio = LRneg*
*LRneg < 1 (!)
”LR” principle: it’s still the factor by which the observed data will change the odds
Measurement
Non-disease
False negatives
False positives
Cutoff point
Negative rangePositive range
Diseased
Healthy
False positive
False negative
Measurement
LR = ratio
Cutpoint now
irrelevant
*
LRdata when data = a measurement value*
A 2-gate study
50 diseased 75 non-diseased
”LRpos” = 5 : 1 but the ”predictive values” and
the post-test odds are unavailable.
Sensitivity
1 – specificity
Warning…
A 2-dim. task
IgM
IgG
Confirmed infectionConfirmed non-infected case
A 2-dim. task
IgM ?
IgG ?
Iso-Likelih. Ratio lines
(uphill arrow)
A 2-dim. task
IgM
IgG
?
Nearest-neighboursclassification of
a new patient
A 2-dim. task
IgM
IgG
etc., with decreasing influence, the farther away.
Kernel methods form aweighted averageof neighbouring
prototypes(diagnosed cases)
Iso-density (iso-tætheds-)
linier
IgM
IgG
IgM
IgG
Iso-Likelihood Ratio lines
(uphill arrows)
IgM
IgG
Infection
Non-infected
Simulated data (100+100)
A ROC diagram shows the true positive fraction against the false positive fraction as a function of the choice of cutoff point
Hypothetical smooth trajectory, and two raw empirical ones [ sample sizes: 17+17 ( ), and 40+40 ( ) ]
Everyone negative
Everyone treated as positive
Liberal cutoffStrict cutoff
The ROC diagram describes the nosographics
Sens, spec.LRpos, LRneg = slopes of segments.
Y = Youden’s Index = sens + spec – 1 is equivalent to
AUC [Area Under Curve] = ½(sens + spec)
in this case. ROCY = 1
Y = 0
FN
TP
FP TNBLACK&WHITE
neg
posWe are
within the Black&WhiteParadigm
nosographic
properties
The ROC diagram describes the nosographics*
The slope of each outcome line is its LR; e.g.
LRpos = (TP fraction of Diseased)/(FP fraction of non-dis.)
ROCY = 1
Y = 0
FN
TP
FP TN
neg
pos
*i.e., the information obtainable from a 2-gate study
Three testoutcomes
Ideal test
Om
inou
s
Almost no
evidence
either way
Reassuring
Three testoutcomes
Ideal test
Positive
+/-
NegativeNeg.?
Pos.?
Ordered (ordinal) test outcomes
Ordered how?By increasing slope, i.e. LR[ concavity ! ]
Three testoutcomes
Definitely positive
+/-
NegativeNeg.?
Possibly positive
Ordered (ordinal) test outcomes
Ordered how?By increasing slope, i.e. LR[ concavity ! ]
The slope reflects the medical trade-off between % sensitivity and % specificity
Those with a ”+/ – ”test result are best treated as negative
in this situation
Trade-off? Constant benefit? … Please take a look at the supplementary figures
A ’constant-benefit’ line
E.g., 5 cases of disease D and 10 non-D cases:The ROC square holds 50 small rectangles, 40 of which happen to be below the ROC trajectory, because 40 times (out of 50) it so happens that a a non-D finding > a D-group finding[the desired ordering]. For an example, see patient * vs. patient **.
Area Under ROC Curve =freq{ (non-D value) > (D value) } = 0.80.
Interpretation of the area under the ROC as a rank statistic ( cf. Wilcoxon-Mann-Whitney )
The 5 cases of disease D and 10 non-D cases:The ROC square holds 50 small rectangles, 40 of which happen to be below the ROC trajectory, because 40 times (out of 50) it so happens that a a non-D finding > a D-group finding[the desired ordering]. For an example, see patient * vs. patient **.
Area Under ROC Curve =freq{ (non-D value) > (D value) } = 0.80.
But where does that lead us?The AUC has no definable interpretation in terms of blood, sweat and tears(loss, benefit, utility).
It only has a soft association with decision-analyticmeasures of diagnostic power (separation, discrimination).
Its frequent use is purely a matter of being thepopular girl in the class.
The primary virtues of the ROC: it allows you
(1) to compare tests regardless of scale, units, & transformations
(2) to see oddities [ which may point to a technical problem, or
call for a revised test interpretation rule ]
What!?
Lesions in floating locationsSuspect
area?Red = as the imagist saw it
Green = surgical truth
How do we score diagnostic performance
in such situations ???
Randomized trials of diagn. tests
… theory under development
Purpose & design: many variantsSub(-set-)randomization, depending on the
pt.’s data so far collected.
”Non-disclosure”: some data are kept under seal until analysis. No parallel in therapeutic trials!
Main purposes …
Digression…
… Randomized trials of diagn. tests
1) when the diagnostic intervention is itself potentially therapeutic;
2) when the new test is likely to redefine the disease(s) ( cutting the cake in a completely new way );
3) when there is no obvious rule of translation from the outcomes of the new test to existing treatment guidelines;
4) when clinician behaviour is part of the research question…
…end of digression
Statistical analysis … in the narrow sense: … is very much standard
once you knowwhat aspects to count and compare.
To know that, work backwards from (likely) consequences:
what would have happened to these patients? And
what would have happened in the alternative scenario?
Never argue ”It’s customary to calculate … (this or that)” !
Thank you !
Let me add a personal maxim: Never ask
”What can the journal impact factors do for me?” Ask instead
”What can I do for the journal impact factors?”
Supplementary pictures follow here
… Vassily Vlassov pixit
The rôle of noise
Pure noise, independent of the patient’s true condition,flattens distributions and henceflattens the ROC; less information.
Remedies: technical & procedural standardization, duplicate measurements,
(averaging over assessors, dominance-free consensus formation) …
… may be ineffective if the noise is ”inter-patient”
Three testoutcomes
Definitely positive
+/-
NegativeNeg.?
Presumably positive
Ordered (ordinal) test outcomes
Ordered how?By increasing slope, i.e. LR[ concavity ! ]
Its slope reflects the medical trade-off between % sensitivity and % specificity
Slope? Constant benefit? … Let’s first look at a
continuous test & selection of cutoff that maximizes benefit
An ”iso-benefit” line
The slope chosen so as to imply constant benefit
A continuous test Cutoff at measurement x = c
maximizes benefit
x = c How do we find that critical slope?
It depends on the pre-test ’disease mix’ – and on the (human) loss associated with wrong or suboptimal treatment
- when only two courses of action are available (otherwise there will be more lines, reflecting several trade-offs).
The slope chosen so as to imply constant benefit
A continuous test : Cutoff at measurement x = c
maximizes benefit
x = c
Treat
no-
one
Treat
eve
rybo
dy
The slope chosen so as to imply constant benefit
A continuous test : Cutoff at measurement x = c
maximizes benefit
x = c
Treat
no-
one
Treat
eve
rybo
dy
Without the test, it’s (slightly) betterto treat everybody than to treat no-one.
With the test available, about 60 % of the’misdiagnostic burden’is eliminated;cf. purple bar.
No misdiagnoses
B
A
AB
A or B
Not A
A a
nd
B
A, n
ot B
Parallelogram
Two binary tests and their 6 most important
joint rules of interpretation( positivity criterion BLACK )
Always!
Never!
Three slides that illustrate the pitfalls of combining two or more tests …
Test interpretation rules in Boolean terms
Test result,
T = A OR B
vs.
T = A AND B
T = A OR B
A
B
A B
positive
positive
… but do not talk about tests in parallel
vs. tests in series (next slide)
not A not Bnegative
Suppose Tests A and B both figure in the rule of interpretation adopted.
Next, choose a rule of execution,
sequential or simultaneous
– depending on cost, inconvenience and delays
A first; if needed, B
B first; if needed, A
A & B simultaneously
Beware: careless writers may describe either rule by words like ”parallel” & ”serial” without realizing the ambiguity. Distinguish!
Rule of interpretation vs. rule of execution …
Three existing tests with poorly researched ROC trajectories& their conventional decision points
(cut-off values)
The point is that – unbeknownst to us –
the 3 tests are really equally good (the ROCs nearly coincide):
only historical coincidences, FDA, and other irrelevant factors
make performance look very different.
Remedy...
Sensit.
1 – specif.
Estimate the entire ROC
- with focus on the local slope
- which becomes the LR estimate belonging to the test outcome concerned
- to be weighed into the pre-test odds: (post-test odds) = (pre-test odds)(LR).
Slope +/– statistical uncertainty
Sensit.
1 – specif.
Sensit.
1 – specif.
Actual data (108 infected; 815 non-infected subjects)
An ordinary (mathematically formulated) statistical model, and a more closely fitting (over-fitting?) kernel procedure
Lesions in floating locations
?
Red = as the imagist sees it
Lesions in floating locations
?
Red = as the imagist saw it
Green = surgical truth
How do we score diagnostic performance
in such situations ???
– No neat answer !
Lesions in floating locations
?
Red = as the imagist saw it
Green = surgical truth
1 true positive
1 dubious lesion confirmed apart from extention (two ½-errors)
1 false positive & 1 false negative, but due to proximity surgeons count:
1 true positive finding
An ’infinite’ number of true negative locations
the region-by-region truth may remain unknown
( death from focus A leaves a suspect
location B unresolved ); and
– region-to-region interdependence is common:
– pathogenetic ( a focus at A makes focus at B more likely )
– diagnostic ( a focus at A
makes location B less visible, or
prompts surgery making B directly observable, or
sharpens the imagist’s attention to B, C, …)
– prognostic ( one verified metastatic focus incurability )
– therapeutic (a positive finding, even a false positive one,
may prompt a drug regimen that cures an overlooked
focus [cancelling a false negative blunder] )
Unfortunately,