Post on 15-Sep-2015
description
Introduction to Detection Theory
Reading:
Ch. 3 in Kay-II.
Notes by Prof. Don Johnson on detection theory,see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf.
Ch. 10 in Wasserman.
EE 527, Detection and Estimation Theory, # 5 1
Introduction to Detection Theory (cont.)
We wish to make a decision on a signal of interest using noisymeasurements. Statistical tools enable systematic solutionsand optimal design.
Application areas include:
Communications,
Radar and sonar,
Nondestructive evaluation (NDE) of materials,
Biomedicine, etc.
EE 527, Detection and Estimation Theory, # 5 2
Example: Radar Detection. We wish to decide on thepresence or absence of a target.
EE 527, Detection and Estimation Theory, # 5 3
Introduction to Detection Theory
We assume a parametric measurement model p(x | ) [orp(x ; ), which is the notation that we sometimes use in theclassical setting].
In point estimation theory, we estimated the parameter given the data x.
Suppose now that we choose 0 and 1 that form a partitionof the parameter space :
0 1 = , 0 1 = .
In detection theory, we wish to identify which hypothesis is true(i.e. make the appropriate decision):
H0 : 0, null hypothesisH1 : 1, alternative hypothesis.
Terminology: If can only take two values,
= {0, 1}, 0 = {0}, 1 = {1}
we say that the hypotheses are simple. Otherwise, we say thatthey are composite.
Composite Hypothesis Example: H0 : = 0 versus H1 : (0,).
EE 527, Detection and Estimation Theory, # 5 4
The Decision Rule
We wish to design a decision rule (function) (x) : X (0, 1):
(x) ={
1, decide H1,0, decide H0.
which partitions the data space X [i.e. the support of p(x | )]into two regions:
Rule (x): X0 = {x : (x) = 0}, X1 = {x : (x) = 1}.
Let us define probabilities of false alarm and miss:
PFA = E x | [(X) | ] =X1p(x | ) dx for in 0
PM = E x | [1 (X) | ] = 1X1p(x | ) dx
=X0p(x | ) dx for in 1.
Then, the probability of detection (correctly deciding H1) is
PD = 1PM = E x | [(X) | ] =X1p(x | ) dx for in 1.
EE 527, Detection and Estimation Theory, # 5 5
Note: PFA and PD/PM are generally functions of the parameter (where 0 when computing PFA and 1 whencomputing PD/PM).
More Terminology. Statisticians use the followingterminology:
False alarm Type I error
Miss Type II error
Probability of detection Power
Probability of false alarm Significance level.
EE 527, Detection and Estimation Theory, # 5 6
Bayesian Decision-theoretic Detection Theory
Recall (a slightly generalized version of) the posterior expectedloss:
(action |x) =
L(, action) p( |x) dthat we introduced in handout # 4 when we discussed Bayesiandecision theory. Let us now apply this theory to our easyexample discussed here: hypothesis testing, where our actionspace consists of only two choices. We first assign a loss table:
decision rule true state 1 0x X1 L(1|1) = 0 L(1|0)x X0 L(0|1) L(0|0) = 0
with the loss function described by the quantitiesL(declared | true):
L(1 | 0) quantifies loss due to a false alarm,
L(0 | 1) quantifies loss due to a miss,
L(1 | 1) and L(0 | 0) (losses due to correct decisions) typically set to zero in real life. Here, we adopt zero lossesfor correct decisions.
EE 527, Detection and Estimation Theory, # 5 7
Now, our posterior expected loss takes two values:
0(x) =1
L(0 | 1) p( |x) d
+0
L(0 | 0) 0
p( |x) d
=1
L(0 | 1) p( |x) d
L(0 | 1) is constant= L(0 | 1)
1
p( |x) d P [1 |x]
and, similarly,
1(x) =0
L(1 | 0) p( |x) d
L(1 | 0) is constant= L(1 | 0)
0
p( |x) d P [0 |x]
.
We define the Bayes decision rule as the rule that minimizesthe posterior expected loss; this rule corresponds to choosingthe data-space partitioning as follows:
X1 = {x : 1(x) 0(x)}EE 527, Detection and Estimation Theory, # 5 8
or
X1 ={x :
P [1 |x] 1
p( |x) d0
p( |x) d P [0 |x]
L(1 | 0)L(0 | 1)
}(1)
or, equivalently, upon applying the Bayes rule:
X1 ={x :
1p(x | )pi() d
0p(x | )pi() d
L(1 | 0)L(0 | 1)
}. (2)
0-1 loss: For L(1|0) = L(0|1) = 1, we havedecision rule true state 1 0
x X1 L(1|1) = 0 L(1|0) = 1x X0 L(0|1) = 1 L(0|0) = 0
yielding the following Bayes decision rule, called the maximuma posteriori (MAP) rule:
X1 ={x :
P [ 1 |x]P [ 0 |x] 1
}(3)
or, equivalently, upon applying the Bayes rule:
X1 ={x :
1p(x | )pi() d
0p(x | )pi() d 1
}. (4)
EE 527, Detection and Estimation Theory, # 5 9
Simple hypotheses. Let us specialize (1) to the case of simplehypotheses (0 = {0},1 = {1}):
X1 ={x :
p(1 |x)p(0 |x)
posterior-odds ratio
L(1 | 0)L(0 | 1)
}. (5)
We can rewrite (5) using the Bayes rule:
X1 ={x :
p(x | 1)p(x | 0)
likelihood ratio
pi0 L(1 | 0)pi1 L(0 | 1)
}(6)
wherepi0 = pi(0), pi1 = pi(1) = 1 pi0
describe the prior probability mass function (pmf) of the binaryrandom variable (recall that {0, 1}). Hence, for binarysimple hypotheses, the prior pmf of is the Bernoulli pmf.
EE 527, Detection and Estimation Theory, # 5 10
Preposterior (Bayes) Risk
The preposterior (Bayes) risk for rule (x) is
E x,[loss] =X1
0
L(1 | 0) p(x | )pi() d dx
+X0
1
L(0 | 1) p(x | )pi() d dx.
How do we choose the rule (x) that minimizes the preposterior
EE 527, Detection and Estimation Theory, # 5 11
risk? X1
0
L(1 | 0) p(x | )pi() d dx
+X0
1
L(0 | 1) p(x | )pi() d dx
=X1
0
L(1 | 0) p(x | )pi() d dx
X1
1
L(0 | 1) p(x | )pi() d dx
+X0
1
L(0 | 1) p(x | )pi() d dx
+X1
1
L(0 | 1) p(x | )pi() d dx
= const not dependent on (x)
+X1
{L(1 | 0)
0
p(x | )pi() d
L(0 | 1) 1
p(x | )pi() d}dx
implying that X1 should be chosen as{X1 : L(1 | 0)
0
p(x | )pi() dL(0 | 1)1
p(x | )pi() < 0}
EE 527, Detection and Estimation Theory, # 5 12
which, as expected, is the same as (2), since
minimizing the posterior expected loss
minimizing the preposterior risk for every x
as showed earlier in handout # 4.
0-1 loss: For the 0-1 loss, i.e. L(1|0) = L(0|1) = 1, thepreposterior (Bayes) risk for rule (x) is
E x,[loss] =X1
0
p(x | )pi() d dx
+X0
1
p(x | )pi() d dx (7)
which is simply the average error probability, with averagingperformed over the joint probability density or mass function(pdf/pmf) or the data x and parameters .
EE 527, Detection and Estimation Theory, # 5 13
Bayesian Decision-theoretic Detection forSimple Hypotheses
The Bayes decision rule for simple hypotheses is (6):
(x) likelihood ratio
=p(x | 1)p(x | 0)
H1 pi0 L(1|0)
pi1 L(0|1) (8)
see also Ch. 3.7 in Kay-II. (Recall that (x) is the sufficientstatistic for the detection problem, see p. 37 in handout # 1.)Equivalently,
log (x) = log[p(x | 1)] log[p(x | 0)]H1 log .
Minimum Average Error Probability Detection: In thefamiliar 0-1 loss case where L(1|0) = L(0|1) = 1, we knowthat the preposterior (Bayes) risk is equal to the average errorprobability, see (7). This average error probability greatly
EE 527, Detection and Estimation Theory, # 5 14
simplifies in the simple hypothesis testing case:
av. error probability =X1
L(1 | 0) 1
p(x | 0)pi0 dx
+X0
L(0 | 1) 1
p(x | 1)pi1 dx
= pi0 X1p(x | 0) dx PFA
+pi1 X0p(x | 1) dx
PM
where, as before, the averaging is performed over the jointpdf/pmf of the data x and parameters , and
pi0 = pi(0), pi1 = pi(1) = 1 pi0 (the Bernoulli pmf).
In this case, our Bayes decision rule simplifies to the MAP rule(as expected, see (5) and Ch. 3.6 in Kay-II):
p(1 |x)p(0 |x)
posterior-odds ratio
H1 1 (9)
or, equivalently, upon applying the Bayes rule:
p(x | 1)p(x | 0)
likelihood ratio
H1 pi0
pi1. (10)
EE 527, Detection and Estimation Theory, # 5 15
which is the same as
(4), upon substituting the Bernoulli pmf as the prior pmf for and
(8), upon substituting L(1|0) = L(0|1) = 1.
EE 527, Detection and Estimation Theory, # 5 16
Bayesian Decision-theoretic Detection Theory:Handling Nuisance Parameters
We apply the same approach as before integrate the nuisanceparameters (, say) out!
Therefore, (1) still holds for testing
H0 : 0 versusH1 : 1
but p |x( |x) is computed as follows:
p |x( |x) =p, |x(, |x) d
and, therefore,
1
p( |x) p, |x(, |x) d d
0
p, |x(, |x) d
p( |x)
d
H1 L(1 | 0)
L(0 | 1) (11)
EE 527, Detection and Estimation Theory, # 5 17
or, equivalently, upon applying the Bayes rule:1
px | ,(x | , )pi,(, ) d d
0
px | ,(x | , )pi,(, ) d d
H1 L(1 | 0)
L(0 | 1). (12)
Now, if and are independent a priori, i.e.
pi,(, ) = pi() pi() (13)
then (12) can be rewritten as
1pi()
p(x |) px | ,(x | , )pi() d d
0pi()
px | ,(x | , )pi()
p(x |)
d d
H1 L(1 | 0)
L(0 | 1). (14)
Simple hypotheses and independent priors for and : Letus specialize (11) to the simple hypotheses (0 = {0},1 ={1}):
p, |x(1, |x) dp, |x(0, |x) d
H1 L(1 | 0)
L(0 | 1). (15)
Now, if and are independent a priori, i.e. (13) holds, then
EE 527, Detection and Estimation Theory, # 5 18
we can rewrite (14) [or (15) using the Bayes rule]:
px | ,(x | 1, )pi() dpx | ,(x | 0, )pi() d integrated likelihood ratio
=
same as (6) p(x | 1)p(x | 0)
H1 pi0 L(1 | 0)
pi1 L(0 | 1) (16)
wherepi0 = pi(0), pi1 = pi(1) = 1 pi0.
EE 527, Detection and Estimation Theory, # 5 19
Chernoff Bound on Average Error Probabilityfor Simple Hypotheses
Recall that minimizing the average error probability
av. error probability =X1
0
p(x | )pi() d dx
+X0
1
p(x | )pi() d dx
leads to the MAP decision rule:
X ?1 ={x :0
p(x | )pi() d 1
p(x | )pi() d < 0}.
In many applications, we may not be able to obtain asimple closed-form expression for the minimum average error
EE 527, Detection and Estimation Theory, # 5 20
probability, but we can bound it as follows:
min av. error probability =X?1
0
p(x | )pi() d dx
+X?0
1
p(x | )pi() d dx
usingthe def.of X?1=
X
data space
min{
0
p(x | )pi() d,1
p(x | )pi() d)}dx
X
[ 4= q0(x) 0
p(x | )pi() d] [ 4= q1(x)
1
p(x | )pi() d]1
dx
=X[q0(x)] [q1(x)]1 dx
which is the Chernoff bound on the minimum average errorprobability. Here, we have used the fact that
min{a, b} a b1, for 0 1, a, b 0.
When
x =
x1x2...xN
EE 527, Detection and Estimation Theory, # 5 21
with
x1, x2, . . . xN conditionally independent, identicallydistributed (i.i.d.) given and
simple hypotheses (i.e. 0 = {0},1 = {1})
then
q0(x) = p(x | 0) pi0 = pi0Nn=1
p(xn | 0)
q1(x) = p(x | 1) pi1 = pi1Nn=1
p(xn | 1)
yielding
Chernoff bound for N conditionally i.i.d. measurements (given ) and simple hyp.
= [
pi0
Nn=1
p(xn | 0)] [
pi1
Nn=1
p(xn | 1)]1
dx
= pi0 pi11
Nn=1
{[p(xn | 0)] [p(xn | 1)]1 dxn
}= pi0 pi
11
{[p(x1 | 0)] [p(x1 | 1)]1 dx1
}NEE 527, Detection and Estimation Theory, # 5 22
or, in other words,
1N log(min av. error probability) log(pi0 pi11 )
+ log[p(x1 | 0)] [p(x1 | 1)]1 dx1, [0, 1].
If pi0 = pi1 = 1/2 (which is almost always the case of interestwhen evaluating average error probabilities), we can say that,as N ,
min av. error probabilityfor N cond. i.i.d. measurements (given ) and simple hypotheses
f(N) exp(N
{ min
[0,1]log[p(x1 | 0)] [p(x1 | 1)]1
} Chernoff information for a single observation
)
where f(N) is a slowly-varying function compared with theexponential term:
limN
log f(N)N
= 0.
Note that the Chernoff information in the exponent term ofthe above expression quantifies the asymptotic behavior of theminimum average error probability.
We now give a useful result, taken from
EE 527, Detection and Estimation Theory, # 5 23
K. Fukunaga, Introduction to Statistical Pattern Recognition,2nd ed., San Diego, CA: Academic Press, 1990
for evaluating a class of Chernoff bounds.
Lemma 1. Consider p1(x) = N (1,1) and p2(x) =N (2,2). Then
[p1(x)] [p2(x)]1 dx = exp[g()]
where
g() = (1 )
2 (2 1)T [1 + (1 )2]1(2 1)
+12 log[|1 + (1 )2|
|1| |2|1].
EE 527, Detection and Estimation Theory, # 5 24
Probabilities of False Alarm (P FA) andDetection (P D) for Simple Hypotheses
0 5 10 15 20 250
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Test Statistic
Prob
abilit
y De
nsity
Fun
ctio
nPDF under H0PDF under H1
PFA
PD
PFA = P [(x)
test statistic > XX1
| = 0]
PD = P [test statistic > XX0
| = 1].
Comments:
(i) As the region X1 shrinks (i.e. ), both of the aboveprobabilities shrink towards zero.
EE 527, Detection and Estimation Theory, # 5 25
(ii) As the region X1 grows (i.e. 0), both of theseprobabilities grow towards unity.
(iii) Observations (i) and (ii) do not imply equality betweenPFA and PD; in most cases, as R1 grows, PD grows morerapidly than PFA (i.e. we better be right more often than weare wrong).
(iv) However, the perfect case where our rule is always rightand never wrong (PD = 1 and PFA = 0) cannot occur whenthe conditional pdfs/pmfs p(x | 0) and p(x | 1) overlap.
(v) Thus, to increase the detection probability PD, we mustalso allow for the false-alarm probability PFA to increase.This behavior
represents the fundamental tradeoff in hypothesis testingand detection theory and
motivates us to introduce a (classical) approach to testingsimple hypotheses, pioneered by Neyman and Pearson (tobe discussed next).
EE 527, Detection and Estimation Theory, # 5 26
Neyman-Pearson Test for Simple Hypotheses
Setup:
Parametric data models p(x ; 0), p(x ; 1),
Simple hypothesis testing:
H0 : = 0 versusH1 : = 1.
No prior pdf/pmf on is available.
Goal: Design a test that maximizes the probability of detection
PD = P [X X1 ; = 0]
(equivalently, minimizes the miss probability PM) under theconstraint
PFA = P [X X1 ; = 0] = .
Here, we consider simple hypotheses; classical version oftesting composite hypotheses is much more complicated. TheBayesian version of testing composite hypotheses is trivial (as
EE 527, Detection and Estimation Theory, # 5 27
is everything else Bayesian, at least conceptually) and we havealready seen it.
Solution. We apply the Lagrange-multiplier approach:maximize
L = PD + (PFA )=X1p(x ; 1) dx+
[ X1p(x; 0) dx
]=X1[p(x ; 1) p(x ; 0)] dx .
To maximize L, set
X1 = {x : p(x ; 1)p(x ; 0) > 0} ={x :
p(x ; 1)p(x ; 0)
> }.
Again, we find the likelihood ratio:
(x) =p(x ; 1)p(x ; 0)
.
Recall our constraint:X1p(x ; 0) dx = PFA = .
EE 527, Detection and Estimation Theory, # 5 28
If we increase , PFA and PD go down. Similarly, if we decrease, PFA and PD go up. Hence, to maximize PD, choose sothat PFA is as big as possible under the constraint.
Two useful ways for determining the threshold thatachieves a specified false-alarm rate:
Find that satisfiesx : (x)>
p(x ; 0) dx = PFA =
or,
expressing in terms of the pdf/pmf of (x) under H0:
p;0(l ; 0) dl = .
or, perhaps, in terms of a monotonic function of (x), sayT (x) = monotonic function((x)).
Warning: We have been implicitly assuming that PFA is acontinuous function of . Some (not insightful) technicaladjustments are needed if this is not the case.
A way of handling nuisance parameters: We can utilize theintegrated (marginal) likelihood ratio (16) under the Neyman-Pearson setup as well.
EE 527, Detection and Estimation Theory, # 5 29
Chernoff-Stein Lemma for Bounding the MissProbability in Neyman-Pearson Tests of Simple
Hypotheses
Recall the definition of the Kullback-Leibler (K-L) distanceD(p q) from one pmf (p) to another (p):
D(p q) =k
pk logpkqk.
The complete proof of this lemma for the discrete (pmf) caseis given in
Additional Reading: T.M. Cover and J.A. Thomas, Elementsof Information Theory. Second ed., New York: Wiley, 2006.
EE 527, Detection and Estimation Theory, # 5 30
Setup for the Chernoff-Stein Lemma
Assume that x1, x2, . . . , xN are conditionally i.i.d. given .
We adopt the Neyman-Pearson framework, i.e. obtain adecision threshold to achieve a fixed PFA. Let us study theasymptotic PM = 1 PD as the number of observations Ngets large.
To keep PFA constant as N increases, we need to make ourdecision threshold (, say) vary with N , i.e.
= N(PFA)
Now, the miss probability is
PM = PM() = PM(N(PFA)
).
EE 527, Detection and Estimation Theory, # 5 31
Chernoff-Stein Lemma
The Chernoff-Stein lemma says:
limPFA0
limN
1N
logPM = D(p(Xn | 0) p(Xn | 1)
)
K-L distance for a single observation
where the K-L distance between p(xn | 0) and p(xn | 1)
D(p(Xn | 0) p(Xn | 1)
)= E p(xn | 0)
[log
p(Xn | 1)p(Xn | 0)
]discrete (pmf) case
= xn
p(xn | 0) log[p(xn | 1)p(xn | 0)
]
does not depend on the observation index n, since xn areconditionally i.i.d. given .
Equivalently, we can state that
PMN+ f(N) exp
[N D(p(Xn | 0) p(Xn | 1))]
as PFA 0 and N , where f(N) is a slowly-varyingfunction compared with the exponential term (when PFA 0and N ).
EE 527, Detection and Estimation Theory, # 5 32
Detection for Simple Hypotheses: Example
Known positive DC level in additive white Gaussian noise(AWGN), Example 3.2 in Kay-II.
Consider
H0 : x[n] = w[n], n = 1, 2, . . . , N versusH1 : x[n] = A+ w[n], n = 1, 2, . . . , N
where
A > 0 is a known constant,
w[n] is zero-mean white Gaussian noise with known variance2, i.e.
w[n] N (0, 2).
The above hypothesis-testing formulation is EE-like: noiseversus signal plus noise. It is similar to the on-off keyingscheme in communications, which gives us an idea torephrase it so that it fits our formulation on p. 4 (for whichwe have developed all the theory so far). Here is such an
EE 527, Detection and Estimation Theory, # 5 33
alternative formulation: consider a family of pdfs
p(x ; a) =1
(2pi2)N exp
[ 1
22
Nn=1
(x[n] a)2]
(17)
and the following (equivalent) hypotheses:
H0 : a = 0 (off) versusH1 : a = A (on).
Then, the likelihood ratio is
(x) =p(x ; a = A)p(x ; a = 0)
=1/(2pi2)N/2 exp[ 1
22
Nn=1(x[n]A)2]
1/(2pi2)N/2 exp( 122
Nn=1 x[n]2)
.
Now, take the logarithm and, after simple manipulations, reduceour likelihood-ratio test to comparing
T (x) = x4=
1N
Nn=1
x[n]
with a threshold . [Here, T (x) is a monotonic function of(x).] If T (x) > acceptH1 (i.e. rejectH0), otherwise acceptEE 527, Detection and Estimation Theory, # 5 34
H0 (well, not exactly, we will talk more about this decision onp. 59).
The choice of depends on the approach that we take. Forthe Bayes decision rule, is a function of pi0 and pi1. Forthe Neyman-Pearson test, is chosen to achieve (control) adesired PFA.
Bayesian decision-theoretic detection for 0-1 loss(corresponding to minimizing the average error
EE 527, Detection and Estimation Theory, # 5 35
probability):
log (x) = 122
Nn=1
(x[n]A)2 + 122
Nn=1
(x[n])2H1 log
(pi0pi1
)
122
Nn=1
(x[n] x[n] 0
+A) (x[n] + x[n]A)H1 log
(pi0pi1
)
2A ( Nn=1
x[n])A2N
H1 22 log
(pi0pi1
)
( Nn=1
x[n]) AN
2
H1
2
Alog(pi0pi1
)since A > 0
and, finally, xH1
2
N A log(pi0/pi1) + A2
which, for the practically most interesting case of equiprobablehypotheses
pi0 = pi1 = 12 (18)
simplifies toxH1 A
2
known as the maximum-likelihood test (i.e. the Bayes decisionrule for 0-1 loss and a priori equiprobable hypotheses is definedas the maximum-likelihood test). This maximum-likelihooddetector does not require the knowledge of the noise variance
EE 527, Detection and Estimation Theory, # 5 36
2 to declare its decision. However, the knowledge of 2 iskey to assessing the detection performance. Interestingly, theseobservations will carry over to a few maximum-likelihood teststhat we will derive in the future.
Assuming (18), we now derive the minimum average errorprobability. First, note that X | a = 0 N (0, 2/N) andX | a = A N (A, 2/N). Then
min av. error prob. = 12 P [X >A
2| a = 0]
PFA
+12 P [X A/22/N
; a = 0]
+12 P[ X A
2/N standardnormal
random var.
, decide H1 (i.e. reject H0),
otherwise decide H0 (see also the discussion on p. 59).
Performance evaluation: Assuming (17), we have
T (x) | a N (a, 2/N).
Therefore, T (X) | a = 0 N (0, 2/N), implying
PFA = P [T (X) > ; a = 0] = Q(
2/N
)EE 527, Detection and Estimation Theory, # 5 40
and we obtain the decision threshold as follows:
=
2
NQ1(PFA).
Now, T (X) | a = A N (A, 2/N), implying
PD = P (T (X) > | a = A) = Q( A
2/N
)= Q
(Q1(PFA)
A2
2/N
)
= Q
(Q1(PFA)
NA2/2 4= SNR = d2
).
Given the false-alarm probability PFA, the detection probabilityPD depends only on the deflection coefficient:
d2 =NA2
2={E [T (X) | a = A] E [T (X) | a = 0]}2
var[T (X | a = 0)]
which is also (a reasonable definition for) the signal-to-noiseratio (SNR).
EE 527, Detection and Estimation Theory, # 5 41
Receiver Operating Characteristics (ROC)
PD = Q(Q1(PFA) d
).
Comments:
As we raise the threshold , PFA goes down but so does PD.
ROC should be above the 45 line otherwise we can dobetter by flipping a coin.
Performance improves with d2.
EE 527, Detection and Estimation Theory, # 5 42
Typical Ways of Depicting the DetectionPerformance Under the Neyman-Pearson Setup
To analyze the performance of a Neyman-Pearson detector, weexamine two relationships:
Between PD and PFA, for a given SNR, called the receiveroperating characteristics (ROC).
Between PD and SNR, for a given PFA.
Here are examples of the two:
see Figs. 3.8 and 3.5 in Kay-II, respectively.
EE 527, Detection and Estimation Theory, # 5 43
Asymptotic (as N and P FA 0)P D and PM for a Known DC Level in AWGN
We apply the Chernoff-Stein lemma, for which we need tocompute the following K-L distance:
D(p(Xn | a = 0) p(Xn | a = A)
)= E p(xn | a=0)
{log[p(Xn | a = A)p(Xn | a = 0)
]}where
p(xn | a = 0) = N (0, 2)
log[p(xn | a = A)p(xn | a = 0)
]= (xn A)
2
22+
x2n22
=122
(2Axn A2).
Therefore,
D(p(Xn | 0) p(Xn | 1)
)=
A2
22
and the Chernoff-Stein lemma predicts the following behaviorof the detection probability as N and PFA 0:
PD 1 f(N) exp(N A
2
22)
PM
EE 527, Detection and Estimation Theory, # 5 44
where f(N) is a slowly-varying function of N compared withthe exponential term. In this case, the exact expression for PM(PD) is available and consistent with the Chernoff-Stein lemma:
PM = 1Q(Q1(PFA)
NA2/2
)= Q
(NA2/2 Q1(PFA)
)N+ 1
[NA2/2 Q1(PFA)]
2pi
exp{ [NA2/2 Q1(PFA)]2/2
}=
1[NA2/2 Q1(PFA)]
2pi
exp{NA2/(22) [Q1(PFA)]2/2
+Q1(PFA)NA2/2
}=
1[NA2/2 Q1(PFA)]
2pi
exp{ [Q1(PFA)]2/2 +Q1(PFA)
NA2/2
} exp[NA2/(22)] as predicted by the Chernoff-Stein lemma
(21)
where we have used the asymptotic formula (20). When PFA
EE 527, Detection and Estimation Theory, # 5 45
is small and N is large, the first two (green) terms in theabove expression make a slowly-varying function f(N) of N .Note that the exponential term in (21) does not depend onthe false-alarm probability PFA (or, equivalently, on the choiceof the decision threshold). The Chernoff-Stein lemma assertsthat this is not a coincidence.
Comment. For detecting a known DC level in AWGN:
The slope of the exponential-decay term of PM is
A2/(22)
which is different from (larger than, in this case) the slopeof the exponential decay of the minimum average errorprobability:
A2/(82).
EE 527, Detection and Estimation Theory, # 5 46
Decentralized Neyman-Pearson Detection forSimple Hypotheses
Consider a sensor-network scenario depicted by
Assumption: Observations made at N spatially distributedsensors (nodes) follow the same marginal probabilistic model:
Hi : xn p(xn | i)
where n = 1, 2, . . . , N and i {0, 1} for binary hypotheses.Each node n makes a hard local decision dn based on its localobservation xn and sends it to the headquarters (fusion center),which collects all the local decisions and makes the final global
EE 527, Detection and Estimation Theory, # 5 47
decision about H0 or H1. This structure is clearly suboptimal it is easy to construct a better decision strategy in whicheach node sends its (quantized, in practice) likelihood ratio tothe fusion center, rather than the decision only. However, sucha strategy would have a higher communication (energy) cost.
We now go back to the decentralized detection problem.Suppose that each node n makes a local decision dn {0, 1}, n = 1, 2, . . . , N and transmits it to the fusion center.Then, the fusion center makes the global decision based onthe likelihood ratio formed from the dns. The simplestfusion scheme is based on the assumption that the dns areconditionally independent given (which may not always bereasonable, but leads to an easy solution). We can now write
p(dn | 1) = P dnD,n (1 PD,n)1dn Bernoulli pmf
where PD,n is the nth sensors local detection probability.Similarly,
p(dn | 0) = P dnFA,n (1 PFA,n)1dn Bernoulli pmf
where PFA,n is the nth sensors local detection false-alarm
EE 527, Detection and Estimation Theory, # 5 48
probability. Now,
log (d) =Nn=1
log[p(dn | 1)p(dn | 0
]
=Nn=1
log[ P dnD,n (1 PD,n)1dnP dnFA,n (1 PFA,n)1dn
] H1 log .
To further simplify the exposition, we assume that all sensorshave identical performance:
PD,n = PD, PFA,n = PFA.
Define the number of sensors having dn = 1:
u1 =Nn=1
dn
Then, the log-likelihood ratio becomes
log (d) = u1 log( PDPFA
)+ (N u1) log
( 1 PD1 PFA
) H1 log
or
u1 log[PD (1 PFA)PFA (1 PD)
] H1 log +N log
(1 PFA1 PD
). (22)
EE 527, Detection and Estimation Theory, # 5 49
Clearly, each nodes local decision dn is meaningful only ifPD > PFA, which implies
PD (1 PFA)PFA (1 PD) > 1
the logarithm of which is therefore positive, and the decisionrule (22) further simplifies to
u1H1 .
The Neyman-Person performance analysis of this detectoris easy: the random variable U1 is binomial given (i.e.conditional on the hypothesis) and, therefore,
P [U1 = u1] =(N
u1
)pu1 (1 p)Nu1
where p = PFA under H0 and p = PD under H1. Hence, theglobal false-alarm probability is
PFA,global = P [U1 > | 0] =N
u1=d e
(N
u1
)Pu1FA (1PFA)Nu1.
Clearly, PFA,global will be a discontinuous function of and
therefore, we should choose our PFA,global specification from
EE 527, Detection and Estimation Theory, # 5 50
the available discrete choices. But, if none of the candidatechoices is acceptable, this means that our current system doesnot satisfy the requirements and a remedial action is needed,e.g. increasing the quantity (N) or improving the quality ofsensors (through changing PD and PFA), or both.
EE 527, Detection and Estimation Theory, # 5 51
Testing Multiple Hypotheses
Suppose now that we choose 0,1, . . . ,M1 that form apartition of the parameter space :
0 1 M1 = , i j = i 6= j.
We wish to distinguish among M > 2 hypotheses, i.e. identifywhich hypothesis is true:
H0 : 0 versusH1 : 1 versus
... versus
HM1 : M1
and, consequently, our action space consists of M choices. Wedesign a decision rule : X (0, 1, . . . ,M 1):
(x) =
0, decide H0,1, decide H1,...M 1, decide HM1
where partitions the data space X [i.e. the support of p(x | )]into M regions:
Rule : X0 = {x : (x) = 0}, . . . ,XM1 = {x : (x) =M1}.EE 527, Detection and Estimation Theory, # 5 52
We specify the loss function using L(i |m), where, typically,the losses due to correct decisions are set to zero:
L(i | i) = 0, i = 0, 1, . . . ,M 1.
Here, we adopt zero losses for correct decisions. Now, ourposterior expected loss takes M values:
m(x) =M1i=0
i
L(m | i) p( |x) d
=M1i=0
L(m | i)i
p( |x) d, m = 0, 1, . . . ,M 1.
Then, the Bayes decision rule ? is defined via the followingdata-space partitioning:
X ?m = {x : m(x) = min0lM1
l(x)}, m = 0, 1, . . . ,M 1
or, equivalently, upon applying the Bayes rule
X ?m ={x : m = arg min
0lM1
M1i=0
i
L(l | i) p(x | )pi() d}
={x : m = arg min
0lM1
M1i=0
L(l | i)i
p(x | )pi() d}.
EE 527, Detection and Estimation Theory, # 5 53
The preposterior (Bayes) risk for rule (x) is
E x,[loss] =M1m=0
M1i=0
Xm
i
L(m | i) p(x | )pi() d dx
=M1m=0
Xm
M1i=0
L(m | i)i
p(x | )pi() d hm()
dx.
Then, for an arbitrary hm(x),
[M1m=0
Xm
hm(x) dx][M1m=0
Xm?
hm(x) dx] 0
which verifies that the Bayes decision rule ? minimizes thepreposterior (Bayes) risk.
EE 527, Detection and Estimation Theory, # 5 54
Special Case: L(i | i) = 0 and L(m | i) = 1 for i 6= m (0-1loss), implying that m(x) can be written as
m(x) =M1
i=0, i 6=m
i
p( |x) d
= const not a function of m
m
p( |x) d
and
X ?m ={x : m = arg max
0lM1
l
p( |x) d}
={x : m = arg max
0lM1P [ l |x]
}(23)
which is the MAP rule, as expected.
Simple hypotheses: Let us specialize (23) to simplehypotheses (m = {m}, m = 0, 1, . . . ,M 1):
X ?m ={x : m = arg max
0lM1p(l |x)
}or, equivalently,
X ?m ={x : m = arg max
0lM1[pil p(x | l)]
}, m = 0, 1, . . . ,M1
EE 527, Detection and Estimation Theory, # 5 55
wherepi0 = pi(0), . . . , piM1 = pi(M1)
define the prior pmf of the M -ary discrete random variable (recall that {0, 1, . . . , M1}). If pii, i = 0, 1, . . . ,M 1are all equal:
pi0 = pi1 = = piM1 = 1M
the resulting test
X ?m ={x : m = arg max
0lM1
[ 1M
p(x | l)] }
={x : m = arg max
0lM1p(x | l) likelihood
}(24)
is the maximum-likelihood test; this name is easy to justify afterinspecting (24) and noting that the computation of the optimaldecision region X ?m requires the maximization of the likelihoodp(x | ) with respect to the parameter {0, 1, . . . , M1}.
EE 527, Detection and Estimation Theory, # 5 56
Summary: Bayesian Decision Approach versusNeyman-Pearson Approach
The Neyman-Pearson approach appears particularly suitablefor applications where the null hypothesis can be formulatedas absence of signal or perhaps, absence of statisticaldifference between two data sets (treatment versus placebo,say).
In the Neyman-Pearson approach, the null hypothesis istreated very differently from the alternative. (If the nullhypothesis is true, we wish to control the false-alarm rate,which is different from our desire to maximize the probabilityof detection when the alternative is true). Consequently, ourdecisions should also be treated differently. If the likelihoodratio is large enough, we decide to accept H1 (or rejectH0). However, if the likelihood ratio is not large enough,we decide not to reject H0 because, in this case, it may bethat either
(i) H0 is true or(ii) H0 is false but the test has low detection probability
(power) (e.g. because the signal level is small comparedwith noise or we collected too small number ofobservations).
EE 527, Detection and Estimation Theory, # 5 57
The Bayesian decision framework is suitable forcommunications applications as it can easily handle multiplehypotheses (unlike the Neyman-Pearson framework).
0-1 loss: In communications applications, we typicallyselect a 0-1 loss, implying that all hypotheses are treatedequally (i.e. we could change the roles of null andalternative hypotheses without any problems). Therefore,interpretations of our decisions are also straightforward.Furthermore, in this case, the Bayes decision rule isalso optimal in terms of minimizing the average errorprobability, which is one of the most popular performancecriteria in communications.
EE 527, Detection and Estimation Theory, # 5 58
P Values
Reporting accept H0 or accept H1 is not very informative.Instead, we could vary PFA and examine how our report wouldchange.
Generally, if H1 is accepted for a certain specified PFA, it willbe accepted for P FA > PFA. Therefore, there exists a smallestPFA at which H1 is accepted. This motivates the introductionof the p value.
To be more precise (and be able to handle compositehypotheses), here is a definition of a size of a hypothesistest.
Definition 1. The size of a hypothesis test described by
Rule : X0 = {x : (x) = 0}, X1 = {x : (x) = 1}.
is defined as follows:
= max0
P [x X1 | ] = max possible PFA.
A hypothesis test is said to have level if its size is less thanor equal to . Therefore, a level- test is guaranteed to havea false-alarm probability less than or equal to .
EE 527, Detection and Estimation Theory, # 5 59
Definition 2. Consider a Neyman-Pearson-type setup whereour test
Rule : X0, = {x : (x) = 0}, X1, = {x : (x) = 1}.(25)
achieves a specified size , meaning that,
= max possible PFA
= max0
P [x X1, | ] (composite hypotheses)
or, in the simple-hypothesis case (0 = {0},1 = {1}):
=PFA
P [x X1, | = 0] (simple hypotheses).
We suppose that, for every (0, 1), we have a size- testwith decision regions (25). Then, the p value for this test isthe smallest level at which we can declare H1:
p value = inf{ : x X1,}.
Informally, the p value is a measure of evidence for supportingH1. For example, p values less than 0.01 are considered verystrong evidence supporting H1.There are a lot of warnings (and misconceptions) regarding pvalues. Here are the most important ones.
EE 527, Detection and Estimation Theory, # 5 60
Warning: A large p value is not strong evidence in favor ofH0; a large p value can occur for two reasons:
(i) H0 is true or
(ii) H0 is false but the test has low detection probability(power).
Warning: Do not confuse the p value with
P [H0 | data] = P [ 0 |x]
which is used in Bayesian inference. The p value is not theprobability that H0 is true.Theorem 1. Suppose that we have a size- test of the form
declare H1 if and only if T (x) c.
Then, the p value for this test is
p value = max0
P [T (X) T (x) | ]
where x is the observed value of X. For 0 = {0}:
p value = P [T (X) T (x) | = 0].
EE 527, Detection and Estimation Theory, # 5 61
In words, Theorem 1 states that
The p value is the probability that, under H0, a randomdata realization X is observed yielding a value of the teststatistic T (X) that is greater than or equal to what hasactually been observed (i.e. T (x)).
Note: This interpretation requires that we allow the experimentto be repeated many times. This is what Bayesians criticize bysaying that data that have never been observed are used forinference.
Theorem 2. If the test statistics has a continuousdistribution, then, under H0 : = 0, the p value has auniform(0, 1) distribution. Therefore, if we declare H1 (rejectH0) when the p value is less than or equal to , the probabilityof false alarm is .
In other words, if H0 is true, the p value is like a randomdraw from an uniform(0, 1) distribution. If H1 is true and if werepeat the experiment many times, the random p values willconcentrate closer to zero.
EE 527, Detection and Estimation Theory, # 5 62
Multiple Testing
We may conduct many hypothesis tests in some applications,e.g.
bioinformatics and
sensor networks.
Here, we perform many (typically binary) tests (one test pernode in a sensor network, say). This is different from testingmultiple hypotheses that we considered on pp. 5458, wherewe performed a single test of multiple hypotheses. For asensor-network related discussion on multiple testing, see
E.B. Ermis, M. Alanyali, and V. Saligrama, Search anddiscovery in an uncertain networked world, IEEE SignalProcessing Magazine, vol. 23, pp. 107118, Jul. 2006.
Suppose that each test is conducted with false-alarm probabilityPFA = . For example, in a sensor-network setup, each nodeconducts a test based on its local data.
Although the chance of false alarm at each node is only , thechance of at least one falsely alarmed node is much higher,since there are many nodes. Here, we discuss two ways to dealwith this problem.
EE 527, Detection and Estimation Theory, # 5 63
Consider M hypothesis tests:
H0i versus H1i, i = 1, 2, . . . ,M
and denote by p1, p2, . . . , pM the p values for these tests. Then,the Bonferroni method does the following:
Given the p values p1, p2, . . . , pm, accept H1i if
pi x} assuming H0ii = 1, 2, . . . ,M
= (1 x)M
yielding the proper p value to be attached tomin{p1, p2, . . . , pm} as
1 (1min{p1, p2, . . . , pM})M (26)
and if min{p1, p2, . . . , pM} is small and M not too large, (26)will be close to mmin{p1, p2, . . . , pM}.False Discovery Rate: Sometimes it is reasonable to controlfalse discovery rate (FDR), which we introduce below.
Suppose that we accept H1i for all i for which
pi < threshold
EE 527, Detection and Estimation Theory, # 5 65
and let m0 +m1 = m benumber of true H0i hypotheses + number of true H1ihypotheses = total number of hypotheses (nodes, say).
# of different outcomes H0 not rejected H1 declared totalH0 true U V m0H1 true T S m1total mR R m
Define the false discovery proportion (FDP) as
FDP ={V/R, R > 0,0, R = 0
which is simply the proportion of incorrect H1 decisions. Now,define
FDR = E [FDP].
EE 527, Detection and Estimation Theory, # 5 66
The Benjamini-Hochberg (BH) Method
(i) Denote the ordered p values by p(1) < p(2) < p(m).
(ii) Define
li =i
Cmmand R = max{i : p(i) < li}
where Cm is defined to be 1 if the p values areindependent and Cm =
mi=1(1/i) otherwise.
(iii) Define the BH rejection threshold = p(R).
(iv) Accept all H1i for which p(i) .
Theorem 4. (formulated and proved by Benjamini andHochberg) If the above BH method is applied, then, regardlessof how many null hypotheses are true and regardless of thedistribution of the p values when the null hypothesis is false,
FDR = E [FDP] m0m
.
EE 527, Detection and Estimation Theory, # 5 67