20090914183 - DTIC · 2011-05-14 · This report examines histogram estimation methods for...
Transcript of 20090914183 - DTIC · 2011-05-14 · This report examines histogram estimation methods for...
NUWC-NPT Technical Report 11,807 1 June 2009
A Tutorial on EM-Based Density Estimation with Histogram Intensity Data Phillip L. Ainsleigh Sensors and Sonar Systems Department
20090914183 MAVSEA WARFARE CENTERS
NEWPORT
Naval Undersea Warfare Center Division Newport, Rhode Island
Approved for public release; distribution is unlimited.
PREFACE
This report was partially funded by the Office of Naval Research (ONR-321 US).
The technical reviewer for this report was Tod E. Luginbuhl (Code 1522).
Reviewed and Approved: 1 June 2009
David W. Grande Head, Sensors and Sonar Systems Department
REPORT DOCUMENTATION PAGE Form Approved
OMB No. 0704-0188 The public reporting burden for this collection of Information Is estimated to average 1 hour per response, Including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of Information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any penalty for falling to comply with a collection of Information if it does not display a currently valid OPM control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.
1. REPORT DATE (DD-MM-YYYY) 01-06-2009
2. REPORT TYPE 3. DATES COVERED (From - To)
4. TITLE AND SUBTITLE
A Tutorial on EM-Based Density Estimation with Histogram Intensity Data
5a. CONTRACT NUMBER
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S)
Phillip L. Ainsleigh
5d PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
Naval Undersea Warfare Center Division 1176 HowelI Street Newport, Rl 02841-1708
8. PERFORMING ORGANIZATION REPORT NUMBER
TR 11,807
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)
Office of Naval Research 875 North Randolph Street Arlington, VA 22203
10. SPONSORING/MONITOR'S ACRONYM
ONR
11. SPONSORING/MONITORING REPORT NUMBER
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution is unlimited
13. SUPPLEMENTARY NOTES
14. ABSTRACT
This report examines histogram estimation techniques in which intensity data are represented using a parameterized probability density function (PDF) model. It gives a high-level overview of histogram modeling, introducing the dominant issues and motivations, and then provides detailed mathematical developments of the histogram-based algorithms.
15. SUBJECT TERMS
Signal Processing
Static-Mixture Histograms
Histogram Estimation Methods Histogram Modeling
Expectation-Maximization (EM) Algorithm
16. SECURITY CLASSIFICATION OF:
a. REPORT
(U)
b. ABSTRACT
(U)
c. THIS PAGE
(U)
17. LIMITATION OF ABSTRACT
(SAR)
18. NUMBER OF
PAGES 69
19a. NAME OF RESPONSIBLE PERSON
Phillip L. Ainsleigh
19b. TELEPHONE NUMBER (Include area code) (401 (-832-9201
Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std. Z39-18
TABLE OF CONTENTS
Section Page
LIST OF ILLUSTRATIONS ii
1 INTRODUCTION 1
2 OVERVIEW OF HISTOGRAM MODELING 3 2.1 Histogram Approximation 4 2.2 Histogram Approximation and Acoustical Data 6 2.3 Histogram Methods and Mixture Models 8 2.4 Truncated Histogram Data 11
3 ESTIMATION FROM HISTOGRAM DATA 15 3.1 Estimation from Complete Histogram Data 15 3.2 Estimation from Truncated Histogram Data 20 3.3 Relationship Between Complete and Truncated Histograms 25
4 HISTOGRAM MIXTURE MODELS 29 4.1 Model Definition 29 4.2 Estimation from Point Measurements 30 4.3 Estimation from Histogram Data 33
5 SUMMARY 41
APPENDIX A—EXPECTATION-MAXIMIZATION ALGORITHMS A-l
APPENDIX B—COMBINATORIAL PROBABILITY DISTRIBUTIONS B-l
APPENDIX C—STATISTICS OF MISSING HIT COUNTS C-l
APPENDIX D—ESTIMATING GAUSSIAN PARAMETERS D-l
APPENDIX E—GAUSSIAN MOMENTS IN AN INTERVAL E-l
REFERENCES R-l
LIST OF ILLUSTRATIONS
Figure Page
1 PDF, Measurements, and Histogram Intensities 5
2 Histogram Data as a Function of Overall Intensity 9
3 Histogram Data as a Function of SNR 10
4 Truncated Histogram Data 12
5 Gaussian-Uniform Mixture Estimation, High-Intensity Data 39
6 Gaussian-Uniform Mixture Estimation, Low-Intensity Data 40
A-l Iterative Minorization (IM) A-4
D-l A Test for Radial Concavity D-5
n
A TUTORIAL ON EM-BASED DENSITY ESTIMATION
WITH HISTOGRAM INTENSITY DATA
1. INTRODUCTION
Observed measurements from real-world physical systems are inherently stochastic
because of noise in the propagation medium and measurement system, and because
of random variabilities in the source-generating mechanism. Therefore, the essential
problem in estimation is not to identify the "true value" of a variable of interest (such
as spatial location or frequency), but to accurately characterize the probability density
function (PDF) associated with that variable of interest. In most real-world situations,
there are no such things as numbers; there are only distributions. This report is about
the characterization of those distributions.
The notion that data collection is an exercise in distributions is consistent with
the way observations of a physical variable are actually obtained. That is, observations
from digital processing systems are usually obtained by partitioning the range of the
physical variable into bins, and then measuring the energy that falls within each bin
(or, equivalently, the energy intensity associated with each bin). Characteristics of the
variable of interest are then inferred from the energy in these bins. The estimation error
in this inference process is related to the amount of probability mass (i.e., area under
the PDF) that is not encapsulated in the estimator. Thus, if the PDF governing the
spread of energy in the variable of interest is concentrated enough that "nearly all" of
the probability mass falls within a single bin, then viewing observed data as point mea-
surements (i.e., numbers) is a reasonable approximation. In situations where the PDF
has probability mass extending across several bins, however, such point approximations
can lead to significant information loss and often to bias.
This report examines histogram estimation methods for representing intensity data
using parameterized PDF models, which provide a mechanism to significantly reduce
the information loss and bias that result from point approximations. While paramet-
ric density estimation has a long history, modern treatments of the problem usually
trace back to the seminal paper by Dempster, Laird, and Rubin [1], who, among other
things, applied the expectation-maximization (EM) algorithm for parameter estima-
tion with histogram data. McLachlan and Jones [2] extended the algorithm in [1] by
applying EM-based histogram estimation to static mixture densities. Luginbuhl [3],
Luginbuhl and Willett [4], and Streit [5] took this a step further by applying histogram-
estimation methods for dynamic mixtures within the probabilistic multi-hypothesis
tracking (PMHT) algorithm. The focus here is on the static-mixture histograms dis-
cussed by McLachlan and Jones [2], with the objective of providing enough detail to
allow these models and algorithms to be applied as they stand (e.g., in signal classifi-
cation applications) or extended for dynamic tracking contexts other than PMHT.
The organization of this report is intended to develop the concepts in increasing
detail, ultimately linking all aspects of histogram-based algorithms back to fundamental
statistical principles. The next section gives a high-level overview of histogram model-
ing, introducing the dominant issues and motivations. Sections 3 and 4 then provide
detailed mathematical developments of the algorithms. In particular, histogram meth-
ods for non-mixture distributions are developed in section 3, and these are extended
to mixture distributions in section 4. After a brief summary in section 5, a set of ap-
pendixes discuss supporting developments from optimization and distribution theory. In
addition to reviewing material that is important to understanding the algorithms, these
appendixes introduce much of the notation used in the body of the report. For exam-
ple, the discussion of the EM algorithm in appendix A provides a notational and logical
template that is used repeatedly when discussing histogram estimation algorithms in
sections 3 and 4.
2 OVERVIEW OF HISTOGRAM MODELING
The objective is to define statistical models that characterize the behavior of vari-
ables of interest (such as bearing or frequency) for some class of sources, a problem
that arises in a number of applications. For example, in maximum-likelihood classi-
fication (e.g., see [6]), PDF models are use to represent the behavior of the features
under various class hypotheses. The classifier decision boundaries are then found at
certain intersections of these class-conditional densities. As another example, proba-
bilistic tracking algorithms (e.g., see [7]) estimate parameters in a dynamic PDF model
at each time step. Regardless of the application, the goal when developing PDF models
is usually to provide a "best fit" for recorded data. However, most sensors do not pro-
vide direct measurements of the variables of interest (e.g., recorded data are not tagged
with location or frequency information). Information about the desired physical vari-
able is usually derived by transforming the received data (e.g., beamforming of spatial
array data or Fourier transform of time-domain data). An inherent characteristic of
this transformation process is a partitioning of the variable domain into bins and the
generation of output values that correspond to these bins. These transform outputs are
usually further transformed to magnitude-squared, or energy intensity, data because
the original transform outputs are complex quantities whose phase is very sensitive to
noise. Therefore, PDF models are derived from data that take the form of energy as a
function of transform bins (e.g., energy as a function of beam or frequency).
The traditional approach for processing this type of intensity data extracts "point
observations" of the physical variable using a peak estimator, which selects one or more
values of the variable for which the intensity data are locally maximum. While simple
interpolation methods can significantly improve accuracy when signal energy extends
over just a few bins, this approach is inappropriate when the spread in the energy
extends over several transform bins. Histogram methods accommodate large intensity
spreads by employing distribution models with nonzero second (and possibly higher)
moment characteristics.
This section introduces the basic ideas and representation abilities of the histogram
methods, as well as some of the issues that are addressed in the later sections. The first
subsection introduces the histogram approximation. The second discusses issues that
arise in the application of histogram methods to acoustic intensity data. The third
motivates the use of mixture densities in a histogram context, and the last subsection
discusses issues that arise when the range of available measurements does not fully cover
the range of the PDF.
2.1 HISTOGRAM APPROXIMATION
The histogram algorithms discussed here fall in the general class of parametric
statistical modeling techniques, where a PDF model p(z; 0) is used to represent the
energy distribution in the physical variable z of interest. The characteristics of the
model are governed by the parametric structure of the model and by the values in the
parameter set 0, which must be estimated from observed data. The difficulty with this
situation is that there are no direct observations of z. The parameter vector 0 must
be estimated from the energy intensity data, denoted S = {se : i = 1,..., L}, where
Si represents the energy in the £th bin and the collection of bins covers some range of
interest in the values of z.
A natural approach for overcoming this difficulty is to transform the PDF model
p(z;0) into another valid PDF model p(S;0), allowing 0 to be estimated from the
intensity data. When a convenient mathematical form for p(S; 0) is available, then
various optimization methods can be applied directly (e.g., Newton or other derivative-
based ascent method). Typically, however, the task of expressing S directly in terms
of 0 is highly nontrivial, and the resulting expression, if it can be obtained, is highly
nonlinear. For this reason, histogram methods employ an approximate model in which
p(z; 0) is used in conjunction with a multinomial approximation to represent the within-
bin and across-bin characteristics of the intensity data. Estimation algorithms are then
developed using an iterative EM approach. A brief overview of the EM approach is
provided in appendix A, which establishes a template for the algorithm descriptions in
sections 3 and 4.
The use of the EM algorithm and multinomial approximation invokes a quantum
image of intensity data, analogous to photons of light energy. The variable z of interest
is considered to be a feature of the quantum particles, and the particles are assumed
to be sorted into bins according to the value of z associated with each particle. The
number of particles in each bin, denoted me for the ^th bin, is referred to synonymously
as a histogram intensity, bin intensity, histogram count, particle count, or bin count.
The symbol me is used to emphasize the discrete nature of the histogram intensities,
in contrast to real-valued energy intensities. The relationship of PDF model, point
measurements, and histogram intensities is illustrated in figure 1.
For sensing modalities in which particle counts are actually observed (e.g., ionizing
radiation), or when there exists a one-to-one physical correspondence between energy
intensity s^ and particle count m^ (e.g., electromagnetic energy), the histogram approach
is generally valid. The only question in these cases concerns the appropriateness of the
Figure 1: PDF, Measurements, and Histogram Intensities The PDF (top) corresponds to a hypothetical scalar distribution. The points measurements (middle) are synthesized by sampling from the PDF, and these measurements are assigned to the various histogram bins with boundaries given by the vertical grid lines. The numbers of point measurements in each bin correspond to the histogram bin inten- sities (bottom). In practice, the point measurements are not observed (indeed, they may not even exist), such that histogram methods must estimate the PDF parameters from the intensity data. EM-based his- togram methods postulate the existence of the point measurements in order to obtain an easier optimization problem. These point measure- ments are treated as missing data, however, and are marginalized from the objective function during each EM iteration. Note that the point measurements shown in this figure are one-dimensional in the horizon- tal axis; the vertical spread in the measurements is included merely to illustrate the clustering of measurements.
parametric form for p(z;@). When using the histogram model for acoustic intensity
data, however, there is no physical mapping from s^ to me- The histogram model is
therefore inherently mismatched to the data in this case.
2.2 HISTOGRAM APPROXIMATION AND ACOUSTIC DATA
When applied to acoustic intensity data, the histogram model appears to exhibit
some insurmountable flaws, namely, the presumption of particles and point measure-
ments that don't really exist, as well as the real-valued nature of the energy intensities
in a theory requiring integer count data. From the perspective of practical implementa-
tion, there are two factors that ameliorate these problems. First, the actual estimators
for the model parameters in 0 are obtained by integrating over the space of the point
measurements, whereby the point measurements are marginalized out of the likelihood
function. This marginalization takes place during the derivation of the estimators, such
that the point measurement z never appears in any computational algorithm. Sec-
ond, the parameter estimators end up being functions only of the relative intensities,
which are the individual bin intensities divided by the overall intensity. The numeri-
cal algorithms do not care if these ratios are formed from integer-valued count data or
real-valued energy data, since the ratio is, in general, real in both cases.
Now, just because the algorithm can be applied does not make it the right tool.
It is therefore of interest to take a closer look at a common special case, specifically
magnitude-squared discrete Fourier transform (DFT) data. The DFT forms the inner
product of recorded time-domain data with a set of complex sinusoidal basis functions.
Since, in practice, it is impossible to observe an infinite duration of the time-domain
signal, it is impossible to localize the energy to a single frequency point. The DFT
sample therefore represents the energy "in the vicinity of" a given frequency, such that
the partitioning of the frequency domain into bins is a natural result of the transform
process and the DFT samples correspond to the energy that is projected into each bin.
This projection of energy into DFT bins has the flavor of a histogram by definition,
although the histogram model takes this a step farther by assuming a quantization of
the DFT bin energies to take integer values, as if there were some basic unit of acoustic
energy (i.e., the acoustic equivalent of Planck's constant) and the quantized integers
represent the number of these units that fall within each DFT bin. This quantization
implies a set of "synthetic" particles for each bin, and the number of these synthetic
particles is indeed a histogram count, even if it does not correspond to any known physi-
cal entity. The multinomial model at the heart of histogram methods then characterizes
these synthetic count data.
The multinomial model starts out by treating histogram count data as statistically
independent Poisson processes. The histogram count for each bin is then modeled using
a conditional distribution, where the conditioning is on the total synthetic intensity in all
bins (the quantized version of the total signal energy). Now, the total synthetic intensity
is merely the sum of the individual intensities, and a sum of Poisson processes is itself
a Poisson process whose expected value is the sum of the individual expected values
[8]. Furthermore, since the conditioning process is equivalent to dividing probability
mass functions, the multinomial model is effectively a series of ratios of Poisson mass
functions. Note that, while the synthetic intensities for the various DFT bins are initially
represented as independent processes, the bin intensities are not independent in the
multinomial model because of the conditioning on the total intensity. That is, all bins
affect all other bins through their contribution to the total intensity. Note also that
modeling the quantized intensities with a multinomial distribution provides a statistical
description that matches the first moment of the observed data but does not match any
higher moments.
Given the subtleties of the multinomial approximation, its appropriateness must
be evaluated on a case-by-case basis for different applications. That said, the histogram
approach provides a way forward in cases where the alternatives are intractable. For
example, an attempt to directly model the energy intensities would require a joint
multivariate exponential distribution over all of the spectral or spatial bins (i.e., a
joint distribution over easily thousands of variables, and more as resolution increases).
Estimation in such high-dimensional spaces is impossible in most real-world scenarios.
In many applications, the quantization of the bin intensities is implicit because the
algorithms depend only on the relative intensities. However, the quantization manifests
itself in a very explicit way in Bayesian contexts where a prior distribution is imposed.
The problem involves the relative weighting of the prior distribution and the measure-
ment likelihoods when estimating the posterior parameter estimates. Specifically, the
measurement likelihoods are weighted by the overall intensity of the bins (i.e., the total
number of particles in all bins). The more "measurements" there are, the more the
prior distribution is discounted. But when dealing with artificially quantized intensity
data, the overall intensity depends on the unit energy associated with each particle,
which is itself a function of the quantization. The net result is that the quantization
unit becomes an explicit variable in the algorithm, making the relative weighting of the
prior completely dependent on an algorithm design parameter. If a coarse quantization
(i.e., a large particle energy unit) is assumed, then the synthetic particle counts will be
low and the prior distribution will dominate. If a very fine quantization (i.e., a small
particle energy unit) is used, then the apparent number of particles is very high and
the data overwhelm the prior. Indeed, in the limit as the quantization unit approaches
zero, the synthetic particle counts become infinite and the estimator effectively ignores
the prior distribution altogether, a problem that was observed by Streit in his work of
histogram PMHT [5]. This same issue will arise with any dynamic mixture scenario
in which the overall intensity varies with time, which includes just about all practical
tracking applications.
As a final note, the dependence of the parameter estimators only on the relative
intensities also means that the estimators are invariant to the true overall intensity.
The degree of freedom introduced by this invariance allows the same model to represent
large variabilities in different realizations of the distribution. To illustrate this, figures
2 and 3 show examples of synthetic histogram data with different values of signal-to-
noise-ratio (SNR) and overall intensity. In figure 2, plots are shown with the same
value of SNR, but with different values of the overall intensities. In contrast, figure
3 contains plots with the same overall intensity but with varying SNR. While there is
significant variability in both cases, there is a subtle distinction. Low SNR generally
means that there is too much of the wrong kind of data (i.e., noise), whereas low overall
intensity indicates that there is too little of every kind of data. The histogram model
accommodates both cases equally well.
2.3 HISTOGRAM METHODS AND MIXTURE MODELS
The Gaussian-signal-in-uniform-noise mixture model used to generate the above
examples was chosen because of the basic role that such models play when using his-
togram techniques. Specifically, mixture models provide a convenient way to explicitly
model noise, which is necessary because histogram models are very data inclusive. This
is most easily seen by contrasting the histogram approach to an estimator that selects
the single maximal peak. When such a peak estimator is applied to data with highly
concentrated signal energy (e.g., narrowband signal spectra), a beneficial side effect is
provided in the form of signal cleaning. That is, in cases where the signal intensity is
largely confined to a single bin and that bin is correctly identified, most of the wideband
noise is eliminated with the omitted bins. The histogram estimator, on the other hand,
throws away nothing. It must therefore explicitly account for the noise to minimize
noise-induced errors.
LARGE OVERALL INTENSITY
TIME
MODERATE OVERALL INTENSITY
TIME
SMALL OVERALL INTENSITY
TIME
Figure 2: Histogram Data as a Function of Overall Intensity Image plots of histogram intensity data are synthesized independently at each of 150 time points, with intensities at each time point obtained by sampling from a two-component mixture model containing a Gaussian signal in uniform noise. Samples are sorted into unit-width bins cover- ing the range [—10,10]. The Gaussian component has constant variance a2 = 4 and a sinusoidally time-varying mean. The mixture components have constant probabilities ns = 0.3 (for signal) and nn — 0.7 (for noise). The ratio 7rs/nn is the (wide-band) SNR. Intensity gram plots are shown for overall intensities corresponding to 2500 point measurements per sample time (top), 250 measurements per sample time (middle), and 25 measurements per time (bottom). g
HIGH SIGNAL-TO-NOISE RATIO
TIME
MEDIUM SIGNAL-TO-NOISE RATIO
l i •
u o
1 o I- w • I 1 11- !!• flf
• ,p
'-'•I ""J . .'
1 "TfiySf i'i1 VN3-v, TIME
LOW SIGNAL-TO-NOISE RATIO
TIME
Figure 3: Histogram Data as a Function of SNR Histogram data are synthesized using the same model as in figure 2. Here, the total number of measurements in all plots is K — 250 (as in the middle plot figure 2). The signal-mode assignment probability varies, with values irs = 0.5 (top), 7rs = 0.3 (middle), to TTS = 0.1 (bottom), giving high, medium, and low signal-to-noise ratios, respectively.
10
Finite mixture distributions, the topic of section 4, are ideally suited for noise
modeling because they provide a natural mechanism for dividing the energy between
a number of component distributions, one or more of which can be tailored to noise.
The "basic" PDF model for histogram methods is therefore the Gaussian-signal-in-
uniform-noise model. The use of a uniform noise distribution assumes that the data
has been pre-whitened, for example, using a spectral or spatial normalizer. If it is more
desirable to work with un-normalized data, however, a more sophisticated noise model
is required and can be achieved using a mixture of uniform or Gaussian components.
Multiple mixture modes can also be used to describe the signal energy itself, say for
data containing energy from multiple sources of interest or signals whose energy in the
variable of interest is inherently multi-modal.
2.4 TRUNCATED HISTOGRAM DATA
For a variety of reasons, intensity measurements may not be available for some
histogram bins. This can happen unintentionally, for example, when processing spectral
data from sensors with a limited frequency range. It can also occur intentionally, as
when data are truncated or subsampled to reduce communication and/or computational
requirements, or to isolate a phenomenon of interest. Figure 4 shows a hypothetical
example of truncated histogram data. While this illustration focuses on histogram "edge
effects" where the missing bin intensities are on the outer edges of the distribution,
missing intensity values can also occur in the "interior" of the histogram, say, due
to unintentional drop-outs in the sensor response or intentional subsampling of the
histogram bins. When the histogram bins do not fully cover the range of the PDF, the
histogram and its data are called truncated. This is in contrast to a complete histogram,
whose bins do cover the entire range.
If complete-histogram algorithms are applied to truncated-histogram data, then
a mismatch exists between the physical world (where a nonzero intensity is impossible
in some bins) and the mathematical world (where nonzero intensities are possible, but
none just happened to be observed in the data at hand). The size of the mismatch
depends on the probability mass under the density function in the truncated regions.
This probability mass may be very small, allowing the issue to be ignored in some
applications. However, when modeling physical phenomena whose energy can approach
the edges of the observable measurement space, or when sensor malfunction causes lost
sensitivity in some interior region, then the probability mass in the truncated region
can be significant. The EM algorithm is used to circumvent this problem by treating
the unobserved intensities as missing data.
11
i—I—I—I—i i—i—i—i—i—i—:—i—i—i—i—i—r
a*
Figure 4' Truncated Histogram Data. This figure shows data from the same scalar distribution as in figure 1, with the continuous PDF shown on top, the synthetic point measure- ments shown in the middle, and the histogram bin intensities shown on bottom. In this case, however, the histogram bins do not cover the entire region over which point measurements can occur. The energy associated with measurements falling outside of the observable region (i.e., outside of the bold vertical lines) is not included in any histogram bin and is therefore lost in the histogram model. The lost data are accounted for in the EM estimation algorithm by utilizing an augmented set of missing data. That is, the EM algorithm treats as missing data the intensity values in the truncated region, in addition to the point measurements that form the missing data in the case of a complete histogram.
12
The computational algorithm for the truncated histogram ends up being a simple
modification of the algorithm for the corresponding complete histogram. The theoretical
work needed to develop the algorithm, however, has a subtlety requiring careful analysis.
In particular, it is impossible to know the total intensity (i.e., the sum of the bin
intensities) when some of the bin intensities were not recorded, and the multinomial
distribution is well defined only when the overall intensity is given. The uncertainty
regarding the overall intensity requires the use of a negative binomial distribution and
its multivariate extension, the negative multinomial distribution. This extension is
discussed in section 3 and appendix C. Fortunately, the end result of that analysis is a
simple extrapolation formula for obtaining the expected values of the missing intensities.
The EM auxiliary function for the truncated histogram is then a linear function of
the missing intensities, such that the maximization step in each iteration of the EM
algorithm can be formulated in terms of the complete histogram. The algorithm then
operates on an extended data set in which the observed intensities are augmented with
expected intensities in the truncated regions.
13(14 blank)
3. ESTIMATION FROM HISTOGRAM DATA
The remainder of this report focuses on the mathematical developments related
to histogram modeling and estimation. In all that follows, integer count data are as-
sumed available for each bin. That is, the issues cited in the previous section related to
converting from real-valued energy data to integer-valued histogram data are assumed
to have been dealt with appropriately.
As mentioned in the previous section, one key distinction among histogram meth-
ods involves whether or not the histogram bins completely cover the range of possible
measurement values and intensity data are available for all bins. When estimating the
parameter set 0, the values of z for which p(z; 0) is well defined constitutes the mea-
surement space, denoted Z. If data are available for histogram bins that completely
cover Z, then the histogram and its data are complete. If there are regions of Z for
which histogram data are lacking, then the histogram is truncated. This section first
examines complete histograms and then extends those algorithms for truncated data.
A general relationship between the two types of histograms is then discussed.
The notation used in this section and the logical steps in deriving algorithms
closely parallel the discussion of the expectation-maximization (EM) method in ap-
pendix A. Note that EM theory defines complete data generically to indicate the con-
catenation of the observed and missing data, which is independent of whether the his-
togram bins fully cover the space of the physical variable. To distinguish these different
types of complete data, data from a complete histogram will always be referred to as
complete-histogram data. Any reference to "complete data" without the "histogram"
modifier indicates the more generic notion from EM theory.
3.1 ESTIMATION FROM COMPLETE HISTOGRAM DATA
A complete histogram partitions the measurement space into a set of mutually
exclusive and exhaustive regions, which are the histogram bins (or cells), denoted Ze
for £ = 1,..., L. The measurement space is thus decomposed as
/.
Z = \JZe, (1) e=i
where the non-overlapping nature of the Ze is implicit. Consider now a collection of
hypothetical point measurements Z = {zi,..., z#}. The number of these measurements
whose value falls in each bin comprise the histogram intensity data, denoted
Mc = {m,:^=l,...,l}. (2)
L5
where the subscript C indicates complete histogram data. The overall intensity is then
defined as
L
K = Y, m(- (3) e=i
Under a given set of parameter values, the unconditional probability that a measurement
falls in the ith bin is
which, for a complete histogram, satisfies
L
0) = / dzp(z;0), (4) J zf
£*<(e) = i. (5)
Assuming that the measurements constitute K independent draws from L categories,
the bin intensities are governed by the multinomial distribution discussed in appendix B;
that is,
L m
p(Mc ; 0) = c(Mc) [I {Me*)}•1, (6) e=i
where c(Mc) is the multinomial coefficient defined in equation (127). Prom the stand-
point of the EM-based algorithms developed in this section, equation (6) is the observed
data likelihood function (ODLF) since it characterizes the observed histogram counts.
3.1.1 EM Algorithm with Missing Measurements
Definition of Missing Data and Auxiliary Function. The observed histogram
intensities provide a relative measure of the number of unobserved point measurements
that "fall into" each bin. The missing data for the EM algorithm are these unobserved
measurements. Within a given bin (say, the £th), the measurements are denoted
Ze = jztt : k= l,...,m<|, (7)
and the full set of missing data contains measurement sets for all bins as
Z = [Ze :£=1,...,L}. (8)
The complete data are the concatenated set containing the observed bin intensities
and missing measurements, and its joint distribution p(Z, Mc ; 0) is the complete data
16
likelihood function (CDLF). Given the CDLF and the posterior density p(Z|Mc ; 0*)
for the missing data (with parameter estimates 0* from the previous EM iteration),
the auxiliary function is formed by taking the conditional expectation
Qz(0;0*,Mc) = y"dZp(Z|Mc;0*)logp(Z,Mc;0), (9)
where fz dZ is the marginalization operator for the missing measurements. This marginal-
ization is represented by the sequence of (possibly multivariate) integrations given by
dZ = I dzn • • • / dzlmi >••{ / dzii • • • / dzLmL \ , (10) Jz I Jzx Jzx ) I JzL JzL )
which can be expressed using the shorthand notation
dzik \ • (11) JZ e=i fc=i k J£*
Care must be exercised when using this shorthand notation because a set of nested
integrals is clearly not equal to a product of single-variable integrals. However, due to
the non-overlapping nature of the integration regions (i.e., the Zk), the cross terms in
the nested integrals are zero, such that equation (10) is equivalent to equation (11) in
this case.
CDLF and Posterior Distribution. With missing measurements, it is perhaps
easiest to start with the posterior distribution of the missing data and then derive
the CDLF from it. With no other knowledge, each measurement is governed by the
distribution p(zek ; 0)- Because the measurement zek is known to reside in the £th bin,
however, its distribution in this restricted region is
I I -7 C>.\ P(Z^;0) /TON p(ztk\Ze;e) = MQ) . (12)
The intensities provide the numbers of independent measurements in each bin. Given
these, the likelihood of the group of measurements is just the product of the likelihoods of
each measurement. The posterior of the measurements given the intensities is therefore
L mi L mt , , r\\\
p(ziMc;e) = nn»(-i*i®> - n n {%#}• <i3> e=i fe=i e=i fc=i *• n ; >
The CDLF is derived from equations (6) and (13) by applying Bayes' rule and canceling
terms, which gives
L mi
p(Z,Mc;0) = p(Mc;@)p(Z\Mc:@) = c(Mr) ]J J] p(z(k ; 0) • (14) t=\ k=\
17
The log-CDLF is then given by
L mt
logp(Z, Mc ; 0) = log c(Mc) + ^ ^ logp(z,fc ; 0). (15) <=i fc=i
Dropping the term logc(Mc), which does not depend on 0, the auxiliary function
becomes
Qz(0;0*,Mc) = / dZp(Z|Mc;0*)X^Elog^fc'0)- (16) ^z e=i fc=i
Conditional Expectation. Due to the independence of the measurements and the
structure of the posterior distribution in equation (13), the conditional expectation
operation can be expressed as
JzdZp(Z\Mc;S*) = f[ fj {^^)Jz *»!**»;©•)}. (17)
where, noting the definition of $^(0*) in equation (4), each component in this expression
satisfies
9^1*** **»'>*) -1' (18)
When the conditional expectation operation in equation (17) is applied to the log of the
CDLF, it applies individually to each of the log terms that appear after the summa-
tions in equation (16). Furthermore, the components in the conditional expectation all
integrate to one, except for those in which the indices of the integration variable match
the indices of the log term being operated on. The auxiliary function therefore becomes
Qz(0;0*,Mc) = £ ¥— Y, / dzekp(ztk;e*) log p(z£k ;&). e=i e\ ) k=i JZt
At this point, the indices t and k on the measurements have no significance because
the integration is carried out independently for each summand (i.e., the integration
is "inside" the summation operations), and because there is no actual set of observed
measurements into which one might have to index. The indices £ and k are therefore
dropped from the measurements to obtain the expression
L , Qz(0;0*,Mc) = Y, $^) jz dzp(z;S*) log p(z; 0 )• (19)
18
At the level of generality considered thus far, further developments are impossible be-
cause maximization of the auxiliary function (the M-step) requires a particular form for
p(z; 0). The M-step is carried out below, however, for the special case of a multivariate
normal density.
M-Step for a Gaussian Model. As an example, consider estimating the mean vector
H and covariance matrix P in the Gaussian distribution
p(z;0) = Af(z;/i,P) = |27rP|-1/2expj-^(z-/z)Tp-1(z-/z)j . (20)
In this case, the auxiliary function given in equation (19) becomes
Qz(0;0*,Mc) = Y. ^jh\ / d*tf(z;n\P*) log Af(z;n,P) . (21)
where /Lt* and P* are the existing estimates from the previous EM iteration. The
estimators for the mean and covariance are given in terms of a set of sufficient statistics
for the posterior distribution. In addition to the bin probability
$,(©*) = / dzAA(z;//*.P*) , Jzt
(22)
these sufficient statistics include the centroid and spread variables for each histogram
cell, which are given respectively as
w<(e*) = f dzJV(z;/i*,P*) z, (23)
ne(®*) = f dzAA(z;/i*,P*) zzT. (24) Jzt
Computational algorithms for equations (22)-(24) are given for the case of scalar mea-
surements in appendix E. The estimators for the mean vector and covariance matrix
are derived in appendix D and are given by
e=i x '
'4^ta (26)
where tit is the center-shifted second local moment, which is computed from the centroid
and spread variables as
0«(©*,£) = n,(0*)-2/i w,(0*) + /x2 $,(0*). (27)
L9
As shown in appendix D, estimators ft and P are located at a critical point of the
auxiliary function, and, due to the concave structure of the auxiliary function, they also
locate the global maximum.
3.2 ESTIMATION FROM TRUNCATED HISTOGRAM DATA
When histogram intensity data for various bins or regions are unavailable for
some reason, then the estimator needs to acknowledge that there are unknown values
in those bins or regions, instead of assuming that they are zero. The EM algorithm
handily addresses this problem by letting the unobservable intensities be missing data.
To establish notation, let ZQ denote the observable region of the measurement
space, which is partitioned into LQ histogram cells as
Lo
Zo = {jZe. (28) e=i
Further, let ZT denote the truncated region, where particles are not counted (e.g., at
the edge of an image or at either end of a frequency band). This region is partitioned
into Li cells as
L
ZT = U Ze, (29) e=L0+i
where L = Lo + LT is the total number of cells that cover the concatenated space
Z = Z0UZT. (30)
The unconditional probability of a particle in the observable region is
*o(e) = / dzp(z;&) = X>,(0), (31)
and the particle probability for the truncated region is
$T(e) = / dzp(z;&) = J2 $<(0)- (32) ^
ZT e=L0+i
Since ZQ and Zi satisfy equation (30), $o(©) and $T(©) satisfy
$O(0) + <I>T(0) = 1. (33)
The collection of observed intensities is denoted
M0 = {m€:£=l,...,L0}, (34)
20
and the total number of observed particles is
Lo Ko = Yme- (35)
Since the observed intensities represent K0 draws on Lo categories, Mo is multinomially
distributed. But since the unconditional probabilities of the observed histogram bins
do not sum to one, the ODLF is given by
P(Mo;0) = C(Mo)n{^}
{n&rntl} {$o(e)}_ ' n{*'(Q)}m'- (36)
The EM algorithm for optimizing this likelihood function is developed in two stages.
First, the unobserved intensities are treated as the sole piece of missing data. This is
then extended to the case of missing measurements and intensities.
3.2.1 EM Algorithm with Missing Hit Counts
Consider, for a moment, estimating the distribution parameters using an EM
algorithm that treats the truncated intensities as missing data but does not include
the measurements in the missing data. The resulting algorithm is of little practical
interest since the M-step likely involves a difficult nonlinear problem. This situation is
considered, however, to isolate the issues involved with truncated bin intensities. After
getting an understanding of these issues in the present subsection, the next subsection
takes on the case where both the truncated intensities and measurements are included
in the missing data.
Definition of Missing Data and Auxiliary Function. For the present discussion,
the missing data are the intensities in the truncated region, denoted
MT = {TO*:'« = .LO + 1,...,L}, (37)
and the auxiliary function is defined as
QMT(0;0*.MO) = Y, P(MT|MO;0*) logp(MT,Mo;0). (38)
The marginalization operator for the missing intensities is expressed using the shorthand
notation
E - n E h E E - E • (39) MT £=L0 + l {. m(=0 ) mz,o + i=0 mLo+2=0 mL=0
21
Similar to the notation introduced in equation (11), this shorthand notation is only
accurate when the cross-terms in the nested summations are zero, in which case the
nested sum does indeed reduce to a product of individual summations.
CDLF and Posterior Distribution. The posterior distribution of the missing in-
tensities is derived in appendix C, and is given by
K L
p(MT | Mo ; 0) = c-(MT,Ao) {$<>(©)} ' II {*<(©) P , (40) e=LQ+i
where C"(MT,^O) is the negative multinomial coefficient defined in equation (140).
The CDLF is given in terms of the ODLF and posterior distribution as
p(MT, Mo ; 0) = p(MQ ; 0) p{MT | MQ ; 0) (41)
L
= c(Mo) c-(MT, KO) II { **(©) }m , (42)
whose logarithm is
L
logp(MT, M0 ; 0) = log c(M0) + logc_(MT, KQ) + ^ rne log$£(0). (43) €=1
Since the first two terms in equation (43) do not depend 0, they can be dropped from
the auxiliary function to obtain
L
QMT(0;0*,MO) = J2 P(MT|Mo;0*) ]T me log^(0). (44) MT e=i
Conditional Expectation. The expression in equation (44) is a linear function of the
missing bin intensities, with respect to which the expectation is being taken. Because
of this linearity, the expected value of the log-CDLF is obtained merely by substituting
the expected values of the missing data, leading to the expression
L
<2MT(0;0*,MO) = J2 •>t log*<(®), (45) e=i
where the expected intensities are defined by
if I = 1,..., L0
, (46) £|mf|Mo;0*j if I = L0+ 1,.. .,L.
22
The expected intensities for the truncated cells axe derived in appendix C and are given
for I = Lo + 1,... ,L as
£{ra'|Mo;8"} = {^j}*',e')- (47)
This expression extrapolates the intensities from the observed region into the truncated
region by using the ratio { Ko/$o(®*)} as a "probability-to-intensity" conversion fac-
tor and then applying this conversion factor to the unconditional probability in each
truncated bin.
3.2.2 EM Algorithm with Missing Hit Counts and Measurements
The results of the previous subsection are now extended such that the missing
data include both the truncated intensities and measurements.
Definition of Missing Data and Auxiliary Function. The missing data are
defined by the sets
MT = { me: £ = L0 + 1,..., L } ,
Z = {ztk :^=1,...,L, fc = l,...,m*}. (48)
The auxiliary function is defined by
<2z,MT(e;0*,Mo) = E / dZ/>(Z,MT|Mo;0*) logp(Z,MT,Mo ;0), (49)
where the marginalization operators for the missing data are again expressed using the
shorthand notation
L ( oo
£• n E MT e=LQ + l I me=0
„ L mi
•/ 2 B 1 I 1
L mi
j I dzek e=i fc=i
CDLF and Posterior Distribution. The distribution p(Z,Mx,Mo;@) can be
factored using Bayes' rule as
p(Z,MT,Mo;0) = p(Mo;0) p(MT | M0 ; 0) p(Z|MT,Mo;0)
= p(M0 ; 0) p(MT I Mo ; 0) p(Z| Mc; 0), (50)
23
where p(Z| MT, Mo; 0) = p(Z| Mc; 0) because, once the missing intensities are given,
the concatenated set {Mo,MT} corresponds to the set of intensities for the complete
histogram. Equation (50) serves as the starting point both for obtaining the posterior
distribution of the missing data and for evaluating the CDLF to be substituted into
equation (49). A suitable expression for this latter purpose is obtained by substituting
equations (36), (40), and (13), and canceling terms, which gives
L m.(
p(Z,MT,Mo;0) = c(M0) c"(MT, K0) J] ]\ p(z<fc;0). (51) e=i k=i
The log of the CDLF is therefore given by
L 7Tl£
logp(Z,MT,Mo;0) = logc(Mo) + logc-(MT,^o) + J]J]logp(z^;0), (52) e=i fc=i
such that, after ignoring constant terms logc(Mo) and logc~(MT, A^o), the auxiliary
function becomes
~ L me
<2Z,MT(0;0*,MO) = 52 I dZp(Z,MT|Mo;0*) J2 52 ^gp(zek-,e). (53) MT
Jz e=i fc=i
The missing-data posterior distribution is obtained by dividing equation (50) by p(Mo ; 0),
which gives
p(Z,MT|Mo;0*) = p(MT|Mo;0*)p(Z|Mc;0*). (54)
Conditional Expectation. The expectation operation in equation (53) can be de-
composed as
52 f dZp(Z,MT|Mo;0*) = ^p(MT|Mo;0*) f dZ p(Z|Mc;0*), (55) MT ^Z MT
Z
allowing the auxiliary function to be expressed as
Qz,MT(0;0*,Mo) = ^p(MT|Mo;0*) MT
P L me
x dZp(Z\Mc;e*)5252loZp(Z(k'@^ (56) Jz e=i fc=i
Noting equation (16), the second line of this expression is just the auxiliary function for
the complete histogram, such that
<2z,MT(0;0*,Mo) = 52p(Mr\M0;Q*) QZ(@-&\MC). (57) MT
24
Substituting the final form for Qz(0;0*,Mc) given in equation (19) again leaves an
expectation of a function that is linear in the missing intensities. Paralleling the case
in which only the intensities are missing, the auxiliary function simplifies to
L _ „
Qz,MT(0;0*,Mo) = £ -r^v / dz p(z;0*) log p(z;0), (58) j^[ *HW ) Jzi
where fh( is defined in equation (46). Once again it is impossible to go any further
without imposing a form on p(x;0). For the special case of Gaussian measurements,
maximization of equation (58) over the mean vector // and covariance matrix P yields
estimates that are equal to those in equation (25) and equation (26), but with me re-
placed with fhe and K replaced with K = Yle=\ •e- The Gaussian parameter estimates
after each iteration of the EM algorithm with truncated histogram data are thus
where u>^(0*) and $7^(0*, jx) are defined in equations (23) and (27), respectively.
3.3 RELATIONSHIP BETWEEN COMPLETE AND TRUNCATED HIS-
TOGRAMS
In the analysis of truncated histograms, the auxiliary function with missing mea-
surements and histogram intensities reduced to an expected version of the auxiliary
function for the complete histogram. This occured because of the similarity of the
CDLF for the two situations. It is useful for later developments with mixtures to flesh
out this equivalence more generally. Consider the generic problem in which the missing
data contain some set of variables S. For a complete histogram, the CDLF in most
cases of practical interest can be expressed as
p(S, Mc ; 0) = p(Mc ; 0) p(21 Mc ; 0) L
= c(Mc) J] {$/(e)}"" p(S|Mc;0). (61)
The CDLF for the truncated histogram, on the other hand, is given by
p(S,MT,Mo;0) = p(MT,Mo;0) p(3|Mc;0)
L
= c(M0) c-(MT,K0) n{^(0)}m' P(S|MC;0). (62) e=i
25
With the exception of the coefficients, these two expressions are identical. Substituting
the definitions of the multinomial and negative multinomial coefficients in the CDLF
for the truncated histogram gives
KQ\ (K-l)\ c(M0) c-(MT,K0) =
{ilfcW} (Ko-l)\ {Yle=Lo+lrne\}
K
= (jf) c(Mc), (63)
such that the two CDLFs are related by the expression
p(S, MT, MO ; 0) = (jf) P& Mo; O). (64)
This expression gives an indication of the likelihood "loss" that results from the unob-
servability of some of the intensities. This relationship between the CDLFs results in a
similar relationship between the auxiliary functions for the two problems. The auxiliary
function for the truncated histogram can be written as
<2S,MT(0;0*,MO) = J>(MT|Mo;0*) J dZ p(S|Mc;0*) logp(H,MT,M0 ;0)
= ^p(MT|Mo;0*) / dSp(S|Mc;0*)
x {log(^) +logp(S,Mc;0)}.
The random variable K is a, function of 0*, but not of 0. Since Ko is observed, the
term log(A'o/^) is independent of 0, and the auxiliary function effectively becomes
Qs,MT(0;0*,Mo) = ^p(MT|Mo;0*)Qs(0;0*,Mc).
Furthermore, the parameter estimators for the complete histogram usually depend lin-
early on me, as was the case in the previous subsection. The approach used above
therefore generalizes to most cases of interest; that is, the truncated-histogram esti-
mators are obtained by substituting the expected intensities into the corresponding
complete-histogram estimators. One thus never need explicitly derive the estimator for
the truncated histogram, so long as the complete-histogram estimator exhibits this lin-
ear dependence on the cell intensities. Since this is also the case with the mixture models
26
discussed in the next section, the results are formulated for complete histograms only.
The qualifiers (i.e., subscripts I, C, and T) on histogram-related variables are therefore
dropped in the next section, with the understanding that variable me in a given ex-
pression is an observed intensity for cells in ZQ and an expected intensity for cells in
27(28 blank)
4. HISTOGRAM MIXTURE MODELS
This section discusses parameter estimation for finite mixture models, which in-
troduce another form of missing data, namely the mode assignments. Subsection 1
outlines the signal-in-noise mixture model; subsection 2 reviews estimation from point
measurements, which illustrates in isolation the issues surrounding mode assignment
uncertainty. Subsection 3 then generalizes this to estimation from histogram intensity
data, wherein the point measurements and modes assignments are missing data.
4.1 MODEL DEFINITION
Let the distribution of interest (i.e., the "signal distribution") be denoted ps(z; 0S).
At issue is the fact that, in noisy data, some of the measurements do not belong to
ps{z:0s), but instead belong to some extraneous distribution po(z;0o) (i.e., the "noise
distribution"). If one attempts to estimate 0 s in the signal distribution without account-
ing for the noise measurements, then the resulting parameter estimates are distorted.
Noise measurements are accommodated using the two-mode "signal-or-noise" mixture
distribution given by
p(z ; 7T0, 0O, 0S) = 7T0 p0(z; 0o) + (1 ~ TTO) Rs(z ; 0S), (65)
where 7TQ is the unconditional probability that a measurement belongs to the noise. In
general, the model distributions ps(z;0s) and po(z;0o) can have different parametric
structures (e.g., Gaussian signal distribution and uniform noise distribution).
From this simple statement of the model, more sophisticated models are obtained
by letting the individual signal or noise components themselves be mixtures, resulting
in a mixture of mixtures. This is most valuable when ps(z;0s) and/or po(z;0o) are
either unknown or so complicated that they lead to intractable estimation problems.
In the discussion below, a separate mixture is included for the signal distribution only;
a mixture model for the noise component is constructed similarly. Thus, the signal
distribution is expressed in terms of modal distributions as
J
ps{z ; 0S) = ^2 $i Pi(z> °^ ' (66) .7 = 1
where Pj(z; 0j) is the jth modal distribution and ipj is the unconditional probability that
the jth mode is valid (i.e., that z is actually governed by the jth modal distribution).
The collection of these probabilities satisfies the constraint X)»=i ^j = 1- Substituting
equation (66) into equation (65), and defining 7Tj = (1 — 7r0) tpj for j = 1,..., J, results
29
in an overall mixture distribution, including signal and noise models, given by j
p(z;e) = £>*»(•;**)• (67) j-"
This model is parameterized by the set 0 = {n, 0Q, 0 s}, where n = {no, TTI, ..., 7Tj}
is the collection of unconditional mode assignment probabilities, 0O contains the noise
distribution parameters, and 0s = {0\,... ,0j} is the collection of parameters for all
modes of the signal distribution. The right-hand side of equation (67) is a convex
combination since 5Z,-=0 TTj = 1.
4.2 ESTIMATION FROM POINT MEASUREMENTS
The assignment uncertainty issue is the defining characteristic of finite mixture
densities. This issue is now considered in isolation by considering the problem in which
a set Z of independent point measurements are observed. While the resulting algo-
rithm is of value in certain cases in which feature values of observed data samples are
directly observed (e.g., see [9]), this is a rather artificial problem in the context of
histogram modeling and is presented merely as a stepping stone to the more general
problem description in the next subsection. In any case, when observations of the point
measurements are assumed to be available, an optimal parameter estimation algorithm
maximizes the ODLF given by
p(Z; 0) = ft | E *"ii &(**; eh) \ • (68)
Direct optimization of this likelihood function with respect to 0 is generally difficult
because of the interaction between mixture modes.
4-2.1 EM Algorithm with Missing Assignments
The EM algorithm for mixtures circumvents the mode interaction issue by treating
as missing data the mode assignments (i.e., the fa). The idea behind this choice is
that, if the mode assignment were known for each measurement (i.e., if the component
distribution actually governing each measurement were known), then the measurements
could be grouped according to mode and the parameter estimation problem would
decompose into J independent problems that are easier to solve. The EM algorithm
does induce such a decomposition, albeit in each iteration of an iterative procedure.
Definition of Missing Data and Auxiliary Function. The set of missing data
corresponding to the observed measurements in Z is denoted as
J = { ji,h, ••JKJ, (69)
30
such that the auxiliary function is given by
Qj(0;0*,Z) = Y, MJ|Z;0*) logp(Z,J;0). (70) J
Given that the measurements are independent, the joint density is a product of terms
that allows the missing-data marginalization operation to be expressed using the short-
hand notation
3 fc=l I jk=l )
CDLF and Posterior Distribution. The CDLF is obtained by noting the in-
dependence of the measurements and by applying Bayes' rule on a measurement-by-
measurement basis, which gives
K K
p(J,Z;0) = H P(jk;&) P(zk\jk;&) = J] nJkPjk(zk;0jk), (72) jt=i fc=i
where p( jk; 0) = 7rJfc and p(zk\jk; 0) = Pjk(zk;0jk). The log-CDLF is therefore
K
logp(J,Z;0) = Y, {^^jk+logP]k(zk;dJk)]. (73) fc=i
Bayes' rule then gives the posterior distribution as to obtain
where 7fcjfc(©*) = p(jk\zk]&*) is the single-measurement posterior assignment proba-
bility, which is defined (and computed) as
W«n= . W'^' .. (75) {Ei,T?p.(^;9,*)}
For all values of k, these posterior probabilities satisfy
.;
£7^.(0*) = !. (76)
Conditional Expectation. Given the structure of equation (74), the conditional
expectation in equation (70) is a sequence of single-variable expectation operations,
which is expressed as a product of operators as
J>(JIZ;0*) = n|E^(0*)}- (77)
31
If evaluated as it stands, then all of the single-variable expectations in this expression
sum to one (hence the overall expression is one). Similarly, when the conditional-
expectation operator is applied to the log-CDLF in equation (73), all of the single-
variable expectations sum to one, except those for which the marginalization index
matches the index jk appearing in the log-CDLF. The auxiliary function therefore re-
duces to K j
Qj(0;0*,Z) = £ Y, 7^(0*){log^ + log^(zfc;^)}. (78) fc=i jfc=i
At this point, the index k on the assignment variable adds no meaning since the sum-
mation covers all possible values of the assignment. Furthermore, since jk is missing
data, there is no actual "observed value" for jk that is tied to a particular measurement
Zfc. This index is therefore dropped to obtain K j
Qj(0;0*,Z) = ]T E^(0*){loS^ + 1°g^(z^^)}- (79) fc=i j=i
Without the ^-dependence, the summation over j can be moved in front of the summa-
tion over k, and the auxiliary function can be decomposed into a collection of component
auxiliary functions, each of which depends on a separate subset of the model parameters.
This decomposition is defined as j J
Qj(0;0*,Z) = ^QJ(7r,;0*,Z) + ^gJ(^;0*,Z), (80)
where K ,.
Qj(7r,;0*,Z) = J2 7**(©*) logTrj, (81) fc=i
K
Qj(Oj;<S>\Z) = Y 7fcj(0*) logpjfaOj). (82) k=\
Given the mutually exclusive nature of the parameters in these various components, the
M-step of the EM algorithm decomposes into a set of independent optimization prob-
lems, one for each component. The assignment probabilities are obtained by maximizing
equation (81), and the parameters in each mode density are obtained by maximizing
equation (82).
Estimation of Assignment Probabilities. Equation (81) is maximized using La-
grange multiplier techniques to obtain the estimator
Kj = K > (83)
32
where KJ(@*) is the effective number of measurements corresponding to mode j given
by
K
«*(©•) = £7^(0*). (84) fc=i
The effective numbers of measurements for all modes satisfy
j
J2*A®*) = K. (85) j=i
Gaussian Mode Estimation. Each component in equation (82) is used to individ-
ually estimate the parameters in one of the modal distributions. For Gaussian mode
densities, pj(z;0j) = A/"(z; //;,Pj), the optimization problem decomposes into J inde-
pendent (single-mode) Gaussian estimation problems, which are discussed in appendix
D. Drawing on the results of that appendix, equation(82) is maximized by
1 K
h = ^?en £ fwCe*) z*> (86) f(e*)
K
P> = ^7&) £ 7«(0*) (zfc - A,) (* - Ai)T • (87) k=\
These Gaussian parameter estimators take the above form regardless of the parametric
structure of any other modes. That is, not all modes need be Gaussian.
4.3 ESTIMATION FROM HISTOGRAM DATA
To set the stage for histogram mixture estimation, the unconditional probability
of any given histogram cell under the mixture model in equation (67) is given by
$,(©) = / p(a;e) = Y^rtj f dzPj(z;Oj). (88) J Z( j_Q J Z(
This can be alternatively expressed by defining the "mode-bin probability"
<t>ti[ej) = j dzP]{z;03), (89) Jzt
such that the overall bin probability is
.)
**(e) = X><M^)- (90)
33
The "observed" data are the intensity vector
M={mi:€=l,...,L}, (91)
and the total number of measurements is
L
K = ]T mt. (92) £=1
As discussed in section 3.3, the intensity vector may contain place-holders for expected
intensities in the truncated region of a truncated histogram.
4-3.1 EM Algorithm with Missing Assignments and Measurements
Definition of Missing Data and Auxiliary Function. For histogram estimation
of mixture-distribution parameters, the missing data include the measurements and
mode assignments, which are denoted
Z = I ztk : I = 1,. • •, L, k = 1,..., me J ,
J = I jtk • I = 1, • • •, L, k = 1,..., mi \ .
The auxiliary function is then defined as
Qj,z(0;0*,M) = f dZT p(J,Z|M;0*) logp(J, Z,M; 0), (93) Jz j
where the marginalization operators for J and Z were defined in equations (71) and
(11), respectively, and are given for the current case as
L mi
dzek z,
L me ( J
E-nn E j e=i fc=i Kjik=Q
CDLF and Posterior Distribution. The CDLF is obtained using Bayes' rule as
p(J,Z,M;0) = p(M;0)p(J,Z|M;0)
L m(
= C(M) n n ^« Puk (z^; «i«). (94) e=i fc=i
34
whose logarithm is
logp(J,Z,M;0) = logC(M) + ^^{log7r^ + logpm(z,fc;^j}. (95)
The posterior distribution of the missing data is then obtained by dividing equation
(94) by the definition of p(M ; 0), given in equation (6), which yields
P(J,ziM:e-) = nn{^pp}. w
Conditional Expectation. Given equation (96), the conditional expectation opera-
tor is a product of operators given by
/ dZ £ p(J,Z|M;0*) = J] II IT• £ *L / d2ekP3ek(zek;0*ek)
(97)
Except when operating on some function containing z^, each term in the large brackets
is unity since
4>,(0*) = £ "J* / ^ Pn^k;9)tk). (98)
Therefore, as in previous cases, when the conditional expectation operator is applied
to the CDLF, all components of the expectation marginalize to one except those for
which the £ and k in the expectation match the £ and k in the CDLF. In fact, it would
be more correct notationally (but much uglier) to denote the indices in the expectation
operation as £' and k' to distinguish them from the £ and k that appear in the CDLF, and
then introduce a Kronecker delta function for bookkeeping. Whether said in symbols
or words, the net result is that all expectation terms marginalize to one except those
for which £' = £ and k' = k. Noting all of these "unit marginalizations" and dropping
the term logc(M), the auxiliary function reduces to
L me J
QJiZ(0;0*,M) = £ £ -j-• E <* / **%*(«*;*D e=i fc=i n ' jek=o Jz<
j lognjlk + \ogPjek (zek ; euJ } . x
At this point, the indices £ and k on the missing variables j and z impart no infor-
mation since the summation over the assignment variable and the integration over the
35
measurement covers all possible values, independent of I or k. They are thus dropped
to obtain
L mt -, J r-
gJiZ(0; 0*, M) = £ Y. ^7©^ E *J / dz P^Z s *;){ 1(w + 1(w(z; *;)} •
f=l jfc=i ^ > j=0 ^
Without the dependence on i and A;, the summation over j can be moved forward, and
the sum over k becomes a constant multiplier me, allowing the auxiliary function to be
expressed as
j J
Qj,z(0;0*,M) = ^Qj,z(7r,;0*,M) + ^ QJ)Z(^;0*,M), (99) 3=0 3=0
where the components in this expression are defined as
QJtZ(n3; 0*, M) = *J £ —±-r / dz Pj(z ; &*) log*,-, (100) p -j * V / «/ 2^
L .
gJ)Z(^;e*,M) = TT; £ —^ / dzP3(z-e*) logp^e,). (101)
With the exception of the factor 7r*, the form of this last component is identical to
the auxiliary function QZ(0J;0*,M) for the non-mixture case. During the M-step,
when equation (99) is differentiated with respect to 0j to find a critical point, the
derivative is n* times the derivative of Qz{0j\ 0*, M). When equated to zero, the
iTj term drops out and optimization of Qjtz(^j', 0*,M) is achieved by independently
optimizing Qz(0j',®*,M.) for each 6j, making all of the earlier (non-mixture) results
applicable. That said, the discussion below for Gaussian densities includes the factor
TTj in order for the effective number of measurements to fall out as an intermediate
variable. In effect, this just means that the expressions for /i- and Pj include a term
*?/*; -1. Estimation of Assignment Probabilities. Optimization of the assignment proba-
bilities is performed using equation (100) directly. A convenient form for that expression
is obtained by noting the definition of <Ev(0), and defining
•K* L dzpj(z;d*) 7^(0*) = ( / JZ\ ' -, • (102)
{EH)*? /**»(•;*;)} With these so defined, the expression for Qj,z(^j', 0*, M) then becomes
QJ,z(7rj;0*,M) = }f^me 7^(0*) I log^ . (103)
36
Aside from the weight me, this is identical in form to the case with observed measure-
ments. The EM update for the assignment probability is thus given by
1 L
*i = K Em^(0*)" (104) e=\
As was the case for observed measurements, this expression is independent of the form
of the modal distributions.
Gaussian Mode Estimation. When the jth mode is Gaussian, the corresponding
component of equation (101) becomes
L
Qj,z(0j;0*,M) = TT* £ -^ / dz^(z;/i;,P;) logA^(z;/x,,P,) . (105)
Substituting the Gaussian density in the log-CDLF term of the auxiliary function gives
Qz(ej;e*,M) = | loglp-11 J2 i^k) Jzt rfz^(z;M;,p;)
- y E ^y jZi d* ^ M;. P?) (• - ^)T P71 (• - Mi) • doe)
To parallel the non-mixture case, it is convenient to define the weighted "mode-specific"
local moments
utjty) = [ dztf(z;n*P*) z. (107) Jze
nej(0*) = f dzAf(z;/z*,P*) zzT. (108)
Also, the effective number of measurements corresponding to the jth mode is
«*(©•) = *;E $wr **«)• (109)
While the meaning of this variable is the same as in section 4, the definition is altered
relative to equation (84) to accommodate the missing measurements.
Again drawing on the results of appendix D, the estimators for the mean and
covariance of the jth (Gaussian) component are
1 L
^ = ^)^wi *i««W. (n°)
Pj -^£«^"JM!.W. d")
:$7
where f2^(0*, /x-) = Jz dz A/"(z; /x*, P*) (z — iXj)(z — fij)T is the mode-specific center-
shifted second local moment, computed as
ft»(«;,/y = n^p-2^ ««,(«;)+# &;(*;)• (112)
Estimating a Scalar Gaussian Signal in Uniform Noise. If the measurement
space is scalar, the signal distribution contains a single Gaussian component, and the
noise is considered to be uniform over the observable region of measurement space, then
the model distribution is the two-component mixture given by
p(z; TTS, //, a) = nsX(z; ix, a2) + (1 - irs) f - j , (113)
where w = max(2o) — min(2b) is the width of the observable region and of the uniform
noise distribution. Note that the above expression has been parameterized in terms
of the mixing weight irs for the signal component, as opposed to parameterizing in
terms of 7To = 1 — ns as was done earlier in the section to accommodate mixture signal
distributions.
One nice property of the model in equation (113) is that the noise component
contains no unknown parameters. The unconditional bin probabilities for the noise
component can therefore be computed once at the outset of the estimation algorithm;
they need not be updated in each EM iteration since nothing changes. Figures 5 and
6 show EM iteration results for the Gaussian-signal-in-uniform-noise model under two
scenarios. In generating both figures, the model was used to synthesize a number of
measurements, these measurements were binned into a set of histogram cells, and the
histogram intensities were used to estimate the model parameters. For figure 5, a
very large number of synthetic measurements was generated and binned such that the
histogram has a large overall intensity and the individual intensities are quite accurate.
For figure 6, a much smaller number of synthetic measurements was generated such that
the histogram has a small overall intensity and the individual intensities are noisy. As
one might expect, the final EM estimates computed from the noisy histogram data are
degraded from those computed from the accurate histogram intensities. They are still
quite close to the true values, however. Furthermore, as figure 6 suggests, the estimate
of the mean (i.e., the location parameter) is far superior to what would be obtained if
the location of the largest histogram count were chosen (i.e., peak picking).
38
0.05
0.025
0
0.1
0.05
0
0.2
0.1
0
0.2
0.1
Weighted Mixture Components Observed Hit Counts and Approximate Distribution
Iteration 0
-10 0 10
Iterj ition 10
/\ - / ^ -10 0 10
Iters ition 20
A -/ V -10 0 10
Iterc ition 58
A ' v -10 10
0.2
0.1
0.2
0.1
0.2
0.1
0.25
0.125
* •
• •
-10 0 10
• •
I -10 0 10
_/V r——* -10 0 10
l\ -10 10
Figure 5: Gaussian-Uniform Mixture Estimation, High-Intensity Data Histogram-based parameter estimation for a two-component mixture dis- tribution with Gaussian and uniform components. Histogram data were obtain by binning K = 106 measurements that were generated from the ideal mixture distribution with parameters u = A, a = 1, and irs — 0.4. The large number of measurements gives "clean" histogram data, and the es- timated parameters are extremely accurate at u = 3.99937, a = 1.00260, and 7rs = 0.39925. The first and last rows correspond to the initial and final parameter estimates, respectively. The second and third rows correspond to intermediate iterations of the EM algorithm.
39
0.05
0.025
Weighted Mixture Components
Iteration 0 ^->.
Observed Hit Counts and Approximate Distribution
Figure 6: Gaussian- Uniform Mixture Estimation, Low-Intensity Data Histogram estimation of parameters in the two-component mixture dis- tribution described in the caption of Figure 5. In this case, however, pa- rameters are estimated from "noisy" histogram data with K = 200. The final parameter estimates are fi = 4.03696, a = 1.18893, and 7rs = 0.49568.
40
5. SUMMARY
This report has examined histogram estimation techniques for inferring the statis-
tical properties of physical variables from observed intensity data. The motivation for
using these estimation algorithms is to reduce bias and other artifacts that can occur
when traditional peak-picking or simple interpolation algorithms are applied to intensity
data, particularly when the data exhibit significant spreading in the variable of interest.
The report opened with a high-level consideration of the theory, highlighting some of
the primary characteristics, issues, capabilities, and limitations of the theory and algo-
rithms. Great emphasis was placed on the fact that histogram techniques are based on
discrete stochastic theory (for integer-valued intensity data) and the issues that arise
when the theory and techniques are applied to real-valued acoustic energy data. As
it turns out, this does not present a significant difficulty except in Bayesian contexts
where the maximum likelihood (ML) estimate must be weighted with a prior distribu-
tion. The relative weighting of the prior and ML distributions remains a significant
open issue when using histogram methods with real-valued energy data.
After the introduction and consideration of issues involved with application of the
methods, the histogram estimation algorithms were developed from first statistical prin-
ciples by drawing on significant background material provided in a set of appendixes.
The algorithms were "built up" by considering the simplest cases first and then adding
complexity. The in-depth tutorial examination given here provides the detailed theoret-
ical developments that were omitted from earlier works such as McLachlan and Jones
[2]. This report also limited the discussion to static distributions, which are subtle
and interesting enough to warrant consideration in and of themselves. Limiting the
discussion to the static case also avoids the significant unresolved issues and notational
baggage that is incurred when histogram techniques are considered in the context of
dynamic tracking algorithms (e.g., see [3] and [5]). The overall goal of this report was
to arm the reader with adequate background and insights to apply and extend these
powerful and versatile methods for a variety of applications.
41(42 blank)
APPENDIX A
EXPECTATION-MAXIMIZATION ALGORITHMS
The goal of this appendix is to introduce the basic ideas, terminology, and no-
tation for the expectation-maximization (EM) approach to algorithm design, which is
the common thread through all of the algorithms discussed in this report. The EM
approach provides a general template for generating iterative maximum-likelihood al-
gorithms that estimate the parameter vector 0 in a probability density function (PDF)
p(x;0), given observations of the variable random variable x. The EM approach is
most useful when it is difficult to directly maximize p(x; 0) with respect to 0, but
there exists an auxiliary variable £, such that maximizing the joint PDF p(x,£;0) is
straightforward. The approach replaces a difficult nonlinear optimization problem with
an iterative sequence of easier problems. The EM method is extensively used in modern
statistical analysis because it greatly simplifies algorithm development for many prob-
lems, it has guaranteed convergence under some fairly general conditions, and many
statistical models have obvious choices for missing data. The general properties of
the EM algorithm are discussed in [1], and numerous applications and extensions are
discussed in the text by McLachlan and Krishnan [10].
Observed, Missing, and Complete Data. EM algorithms all share the charac-
teristic of having observed, missing, and complete data. The observed data consist of
samples of data that are at hand, representing either physical measurements from a
sensor or features that are computed from physical measurements. The collection of
such samples is denoted X = {xn : n — 1,..., N}. The density function p(X; 0) is
referred to as is the observed-data likelihood function (ODLF).
The missing data contain the information that, were it known in addition to the
observed data, causes the difficult estimation problem to reduce to a simpler one. For
a given set of observed data X, the corresponding collection of missing data is denoted
S = {£„, : m = 1,... , M}. The concatenated set containing both the observed and
missing data is referred to as the complete data, and the joint distribution p(X, S; 0) is
referred to as the complete-data likelihood function (CDLF). As mentioned above, the
fundamental premise of EM is that p(X, 2; 0) is much easier to optimize with respect
to 0 than isp(X;0).
Iterative Structure and Auxiliary Function. The EM algorithm is iterative. It
requires an initial estimate for 0, which is obtained using an application-dependent sub-
optimal algorithm (the preferred method), or by random selection within some reason-
able range of values. Denoting the initial estimate by 0°, the EM algorithm generates
A-l
the sequence of estimates denned by
0i+1 = argmaxQa(©;©',X), (114)
where <5 = (0;0\X) is the so-called auxiliary function, which will be formally defined
in a moment. While the literature typically denotes the auxiliary function simply as
Q(0;©*), this report discusses a number of different auxiliary functions with differ-
ent sets of observed and missing data. To more easily distinguish these various cases,
the notation used here includes a subscript to denote the missing data and an addi-
tional argument to denote the observed data; thus the subscript S and argument X in
<5S(©;0*,X).
When iterating equation (114) to optimize the ODLF, the iterations are termi-
nated either at a pre-selected number of iterations or when some set of convergence
criteria is satisfied. When convergence tests are used, they are typically based on the
relative change in the observed-data likelihood and/or values of the estimates, similar to
convergence tests for standard nonlinear optimization techniques like the Newton and
gradient ascent algorithms.
Equation (114) includes the iteration number as an index. When using the EM
approach, there are often a large number of other indices that track a number of other
characteristics, making index variables a rare notational commodity. It is therefore
convenient to define the parameter estimates as 0 = 0t+1 and 0* = 0\ such that the
"typical" EM iteration is
0 = argmax Qs(0;0*,X). (115)
The auxiliary function <5H(©;@*,X) is defined as the expectation of the log of
the CDLF, conditioned on the posterior distribution of the missing data, which is stated
mathematically as
QH(0;0*,X) = £S|X;e.{logp(X,S;0)}
= y<iHp(E|X,0*)logp(X,S;0). (116)
Here the integral notation f dE is used to indicate marginalization over the missing
data 3; in general, this is a sequence of continuous integrations, discrete summations,
or both. Since missing data with both continuous and discrete variables are common,
sequences that mix integrations and summations are also common.
The basic idea behind EM is well illustrated from the viewpoint of iterative mi-
norization (IM), which is an even more general class of algorithms to which the EM
A-2
method belongs. Figure A-l shows two iterations of a generic IM algorithm. As dis-
cussed in the figure caption, each iteration maximizes an auxiliary function that "mi-
norizes" the function being maximized, which for the EM algorithm is the ODLF. That
the EM auxiliary function minorizes the ODLF follows from the nature of missing data,
in the sense that including a stochastic nuisance parameter in a model always drives
down the likelihood relative to the nuisance-free model because of the added uncertainty
associated with the nuisance parameter.
Expectation and Maximization Steps. Each EM iteration consists of an ex-
pectation step (E-step) and a maximization step (M-step). The E-step involves eval-
uating equation (116), and the M-step involves computing estimates that maximize
equation (116) with respect to 0. The process for deriving the auxiliary function in
the E-step is fairly universal across applications and is largely an exercise in conditional
probability. The steps are to (1) define the observed data and the ODLF, (2) define
the missing data and the CDLF, (3) determine the posterior distribution of the missing
data, and (4) carry out the conditional expectation to obtain the auxiliary function. In
many applications, the missing data are chosen such that the conditional distribution
p(X| S; 0) can be written in closed form, and there exists a known unconditional (prior)
distribution p(S; 0). In such cases, the CDLF is obtained as
p(X, S; 0) = p(X| S; 0) p(S; 0). (117)
Given the ODLF and CDLF, the posterior for the missing data is given by
*«*•>-ISP- <118> In other applications, it may be more convenient to first derive the posterior distribution
p(S|X; 0) and then obtain the CDLF as
p(X,S;0) = p(S|X;e)p(X;0). (119)
The preferred order for evaluating these distributions depends on the structure of the
missing data and the distributions involved, but some form of these steps will be en-
countered whenever EM is applied in a new situation. The final stage in deriving the
auxiliary function involves carrying out the conditional expectation, which is usually
simplified by noting independence or conditional independence among the variables.
More will be said about this process in the context of particular problems.
While the details of the M-step cannot be specified without formulating a partic-
ular parameterization of the CDLF, these details typically follow from standard con-
A-3
e *-\ e
Figure A-l: Iterative Minorization (IM)
Two IM iterations are shown; the first depicted in red and the second in
green. The function L{6) to be maximized is shown in black. The first
iteration maximizes an auxiliary function Q{0e\0e'~x), which "minorizes" e\ni-l^ the likelihood function at 0 ; that is, Q(6 \6 ) is strictly less than L(6
at every point 6 except 6 ~ , where the two functions share the same
function value and first derivative (i.e., the two functions are tangent at
Oe~l). Under these minorization constraints, maximization of Q((r\fr~ )
is guaranteed to increase L(0) unless Oe~l is already a stationary point of
L{6), in which case no changes take place. The second iteration repeats
the process with the tangent constraint imposed at the point 0e produced
by the first iteration.
A-4
strained or unconstrained optimization methods. For example, a stationary point is ob-
tained by equating to zero the derivative of QH(0; 0*, X) with respect to each element
of 0. The stationary point is a maximum if the second derivative (Hessian) matrix is
negative definite (i.e., all of its eigenvalues are strictly negative) at the stationary point.
Fortunately, the auxiliary function is concave for many problem formulations involving
Gaussian or Gaussian-mixture model densities, which means that a stationary point for
such an objective function is the unique maximum. Appendix D demonstrates that the
auxiliary function for Gaussian histogram estimators is, indeed, concave.
A-5(A-6 blank)
APPENDIX B
COMBINATORIAL PROBABILITY DISTRIBUTIONS
Histogram-based methods are inherently combinatorial, since they consist of look-
ing at all combinations of ways that events can be grouped into a collection of histogram
cells. This appendix reviews the binomial, negative binomial, and multinomial distri-
butions to establish notation and to give an easy single reference for these basic distri-
butions.
Binomial Distribution. Perhaps the simplest experimental situation involves a
series of Bernoulli trials, whose outcomes are limited to one of two categories, "success"
or "failure". A common question with regard to Bernoulli trials concerns the number of
successes that might be observed over the course of n independent trials. This situation
is described by the binomial distribution. Let the probability of success in any single
trial be denoted by 0, such that the probability of failure is (1 — 0). To have exactly m
successes in n independent trials, there must also be (n — m) failures. The probability
of the m successes is 0m; the probability of the (n — m) failures is (1 — 0)(n_m); and
there are b(m, n) ways in which this situation can occur in n trials, where b(m, n) is the
binomial coefficient:
b(m,n)= "' (120) ml (n — m)\
The binomial distribution measures the probability of the event "m successes out of n
trials." The distribution is therefore defined as
p(m; 0. n) = b(m, n) 0m (1 - 0)(n-m>, (121)
which satisfies the marginalization constraint
^p(m;0,n) = 1. (122) m=0
Negative Binomial Distribution. In the binomial distribution, the number of trials
is a parameter, and number of successes is a random variable. Suppose, however, that
the question is turned around to ask "How many trials are needed to achieve a certain
number of successes?" Here, the number of successes is the parameter, and the number
of trials is the random variable. This situation is described by the negative binomial
distribution
p(n ; 0, m) = b~(n, m) 0m (1 - 0)(""m) , (123)
B-l
where b (n, m) is the negative binomial coefficient, defined as
b~(m, n) = b{n - m - l,n - 1) = -. V~/ —rr • (124) (m — l)!(n — m)!
The negative binomial distribution satisfies the marginalization constraint
oo
£p(n;0,m) = 1. (125) n=m
Multinomial Distribution. Now consider an independent series of trials in which
each trial has L categories of possible outcomes, rather than simply success or fail-
ure. With no other information, the probability of an outcome occuring in the ith
category is denoted <pe, and the collection of probabilities for all categories is 0 =
{4>e : I — 1,. •., L}. When actually performing a series of trials, an observed outcome in
one of the categories is referred to as a "hit" in that category. The number of hits ob-
served in each category over the series of trials is denoted me for £ = 1,..., L. These "hit
counts" are collected in the vector M = {me : (. = 1,..., L}, whose statistical properties
are governed by the multinomial distribution
L
p(M;0) = c(M)J] ft', (126) e=i
where c(M) is the multinomial coefficient, defined as
(Zti me) ! c(M) = f- 4-. (127)
Unlike the binomial distribution, the number of trials is not an explicit parameter
because the multinomial distribution implicitly assumes that the sum of the hit counts
for all categories equals the number of trials (i.e., that all outcomes are counted). If
this is not the case in a given application, then the multinomial distribution is not the
correct model.
Equation (126) assumes that any outcome not fitting into one of the defined
categories has zero probability of occurrence. Alternatively stated, it assumes that,
with probability one, all possible outcomes fall within one of the categories, such that
L
£>< = ! • (128) If, on the other hand, events occur with nonzero probability that do not fit any of the
categories, then the statistical model must be modified. In particular, consider the case
B-2
of "pre-sereened" data, where any trial whose outcome does not fit a given category
is ignored entirely; not only is the hit not recorded, but the counter that records the
total number of trials is not incremented. The sum of hit counts equals the number of
recorded trials, so the general form of the multinomial distribution is still valid. The
probabilities for the categories, however, must be normalized to account for the fact that
the universe of outcomes has been projected down to the union of the known categories.
Defining the probability that a hit occurs in any of the categories as
L
allows the multinomial distribution for this case to be defined as
L / , \ me L
p(M;0) = c(M) II (T) = C(M) *"""' II #* . (13°) e=i ^ ^ ' t=\
where m is the total number of trials that pass the screening process; that is,
L
m = y. me • (131)
Negative Multinomial Distribution. When applying the multinomial distribution
to pre-screened data, the distribution is useful for making inferences only about things
that occur within the observable categories. It is sometimes desirable to extrapolate
inference into categories that cannot be observed. This situation can be thought of in
terms of Bernoulli trials by grouping the observable categories into one single super-
category. The universe of possible outcomes can then be partitioned into two classes,
with a measurement either falling within the observed super-category (a "success")
or falling outside of this super-category (a "failure"). Treating the sum of observed
hit counts as the number of successes in a sequence of Bernoulli trials, the unknown
total number of trials required to have generated these successes is a random variable
governed by the negative binomial distribution in equation (123), with 0 defined by
equation (129) and m given by equation (131). This analysis can be taken a step further
by subdividing the space of unobservable outcomes and making inferences about these
unobservable categories. This last situation is described by the negative multinomial
distribution, which is discussed further in appendix C.
B-3(B-4 blank)
APPENDIX C
STATISTICS OF MISSING HIT COUNTS
This appendix derives the posterior distribution p(Mj | M0; 0) of the missing in-
tensities given the observed histogram counts, and then uses this posterior to obtain
the expected intensities in the unobserved histogram cells.
Posterior Distribution. Development of the posterior density involves making a
number of structural observations concerning the component distributions involved.
This discussion is facilitated by defining the "number of truncated measurements" as
the discrete random variable
L
e=L0+i
The total number of (observed and unobserved) measurements is then the discrete
random variable defined by
Kc = Ko + KT. (133)
Observing that the regions Zo and Zi are disjoint in Z, the posterior distribution
satisfies
p(MT|Mo;0) = p(MT|A'o;0)- (134)
That is, from the standpoint of Zi, it does not matter how the measurements in Zo
are distributed within Zo; only the total number of hits in Zo is important since
that provides evidence for inferring the total number of truncated measurements. The
second structural observation concerns the appearance of these hit-count totals in the
distribution functions. For example, note that in the presence of MT, the variable Ky
carries no additional statistical information, such that
p(MT | A'0 ; 0) = p(MT, Kr \KO;0) 6 I KT - ]T m A , V e=LQ+\ )
where S(-) is the Kronecker delta function, whose value is one for an argument of zero
and zero otherwise. The delta function is introduced to ensure that the distribution has
nonzero probability only when equation (132) is satisfied. Having said this, this delta
function will be dropped to ease the notational burden, with the understanding that
equation (132) is indeed satisfied. A similar observation can be made concerning KQ
C-l
and KT- That is, given the observed hit total Ko, the variables KQ and Ky carry the
same information, such that
p{KT\Ko;0) = p(Kc\Ko-0)S(Kc-Ko-KT) (135)
As before, the Kronecker delta is dropped, with the understanding that equation (133)
is satisfied. Given these observations, the posterior distribution for the missing data is
given by
p(MT\Ko;0) =P(MT,KT\KO;0)
= p(MT\KT,Ko;0) p(KT\K0-6)
= p(MT\KT;0)p(Kc\Ko;O), (136)
where the first term in the last expression follows because once KT is given, Ko does
not bring any additional information concerning MT- The variable MT merely splits
the KT measurements among the Lx bins in ZT, independent of what happens in ZQ.
The conditional distribution |)(Mx | KT ', 0) thus corresponds to KT independent draws
from LT categories, and is given by the multinomial distribution
P(MT\KT;0) = C(MT){M0)Y II {<M0)}m (137) i=L0 + \
The distribution p(Kc \ Ko ', 0) pertains to the total number of Bernoulli trials Kc that
must be executed in order to achieve Ko successes, where a "success" is a hit anywhere in
Zo- The total number of hits is therefore governed by the negative binomial distribution
in equation (123). Noting the relationships between Ko, KT, and Kc, and those between
4>o(0) and <pr{0), the negative binomial distribution for this situation is expressed as
p(Kc\Ko;0) = b-(Kc,K0) { <M0) }*" {<M0)}*T, (138)
where b~(Kc, Ko) is the negative binomial coefficient defined in equation (124). Sub-
stituting equation (137) and equation (138) into equation (136) then gives the posterior
distribution of the missing intensities as a negative multinomial distribution, which is
defined for the present situation as
(139)
C-2
where c (Mj, A'o) is the negative multinomial coefficient, defined as
c-(MT,K0) = c(MT) b-(KcKo) = - (^c-1)! (14Q)
(^o-l)!{ntL0+1^'}'
Expected Values. The expected intensities axe obtained by computing the first mo-
ment of the posterior distribution. Dropping the explicit dependence on the parameter
vector 0 and expressing the negative multinomial coefficient directly in terms of the
missing intensities instead of total hit counts allows the posterior to be written as
(K0 - 1 + Yli-L +i me) ! £ p{Mr\K0) = ±- ;\;Lo+1 > 4>go J] 0-. (141)
(^o-i)!{ntL0+i^'} ^0+i The expected value of the (typical) missing hit count, m^ is defined for Lo < C < L as
£{mc|M0} = ^m<p{Mr\K0), (142)
where the marginalization operator was defined in equation (39), which is repeated here
for convenience as
L f oo ^
E- n E • (i«) MT e=L0 + l K me=0 )
The key piece of information for taking the expectation is the normalization property
of the negative multinomial distribution, that is,
Y,P(MT\K0) = 1. (144)
Forming and simplifying the product m^p(Mx| K0) is greatly facilitated by defin-
ing the variable
me = < (145) me if £ ^ C
me-l if £ = <" -
With this variable so defined, the product m^^Mxl Ko) can be written as
(K0 - 1 + ELL +I rne)\ L
mQp{MT\K0) = ±- f " -AT 0g° <t>7- (146) (Ko-l)\{UlLo+lrh(\\ e=Lo+i
The desire is to obtain an expression that is equal to within a scale factor of the nega-
tive multinomial distribution with the variable me instead of m^. If such an expression
C-3
can be obtained, then the negative multinomial portion marginalizes to unity and the
desired expectation is the scale factor. To get the numerator of the negative multino-
mial coefficient in terms of fhe while retaining the desired structure, it is required to compensate by introducing the variable
KQ = KQ + 1 (147)
such that the numerator is given by
(148)
Now, the marginalization property of the negative multinomial distribution is valid
regardless of the actual value of Ko, so long as the structure is retained and the same
value appears everywhere. The new observed count variable is thus substituted and compensated everywhere it appears in equation (146) to obtain
/ , N (*o-i + £iLo+i^)! 4° A mcP(MT Ko) = f\ , f -T^T- U * (149)
Substituting and compensating the variable me in the final product term and rearranging
then yields
- 1 + Etio+i •<) ! (K0
mcp(MT|^0) ---- -T-h { "V" UKO-1)\{YILLO+I•'
]} "0+1
4° n * 7Tl£ (150)
Since the term in brackets is exactly the negative multinomial distribution in the vari-
ables fhe and Ko, and the scale factor is independent of any of the truncated me, the bracketed term marginalizes under the expectation, leaving the desired expression
(151)
C-4
APPENDIX D
ESTIMATING GAUSSIAN PARAMETERS
This appendix derives estimates and proves the maximality of those estimates for
parameters // and T in the objective function
Q{0) = Ype[ dzAf(z; C, S) log M(z; /x, T) , (152) = Y P< j dzM^ C,S) logjVfz; /z, T
where pe is a non-negative weighting function, A/"(z; £, S) is the observed measurement
density
AT(z;C,S) = |27rS|-1/2exp|-^(z-C)TS-l(z-C)} , (153)
and logA/"(z; //, T) is the model likelihood function (i.e., the log of the model density
function). Omitting the constant log(27r) term, which does not affect the parameter
estimation problem, the model likelihood is
iogAf(z; /x, r) = l- log | r I - l- (z - /i)T r (z - M), (154)
which is parameterized here in terms of the inverse covariance matrix (or information
matrix) T = P_1 to ease the math. The parameter estimates that maximize the ob-
jective function are identified below. The integrations are carried out first to obtain an
expression that is cast in terms of a set of effective measurements. The resulting expres-
sion is then maximized by finding the stationary point of the function and by proving
that the function is concave, such that the stationary point is the unique maximum.
Effective Measurements. Given the form of the model likelihood function, it is con-
venient to similarly decompose the objective function into determinant and quadratic-
form components, giving
Q(0) = m(0) - m(d), (155)
where
Vtf) = \il P' f dzAf(z;CS) log (156)
= \J2 P( I dzX(z;C,S) (z-/x)Tr(z-/i (= 1 * Zi
»»(*) = oLPW dzA/-(z;C,S) (z-M)Tr(z-/x). (157)
D-l
The determinant component is rewritten by defining the bin probability value
(158)
and the effective number of measurements
(159)
to obtain
K m(0) = -log (160)
Equation (157) is rewritten by expressing the quadratic form as the trace of an outer
product, which gives
'/2 1 L f
(*) = 2 Z) P< / <fcjV(z;C,S) tr{r(zzT-2z/xT + /x/xT)} e=i ^Z(
= \Y, Petrlr J dzAf(z;CS) (zzT - 2z/xT + M/xT) }
1 L
= 7, Yl Pf<Mr{r (n«-2w^T + ^/iT)} (161)
i.
= nEpt &tr {r \Pe+(w' ~ >*) (w< ~ A*)T } i (162)
where
fte = 1 / dzAA(z;C,S) ZZ
SZf = J7^ — u?^u>^ .
(163)
(164)
(165)
Vector u)e is the normalized first moment (centroid) and matrix £le is the normalized
second moment of Af(z; C> S) when restricted to the region Ze. Vector u>e is also referred
D-2
to as an "effective measurement" for reasons that will be made clear in a moment-
Matrix Cl( is the second central moment (covariance matrix) of A/*(z; £, S) in Ze.
An interesting form of the objective function is obtained by defining the weighted
sum of covariance matrices
1 J^ ^ = - /, Pe <Pe ?tt (166)
e=i
such that, when 772(6?) is re-combined with the determinant term r)\{0), the overall
objective function becomes
Q(6) = tr {r ft} + ^ Pe <f>t log Af{ut] ^ T) . e=i
(167)
Aside from the trace term, which reflects the cumulative measurement uncertainty
within all bins, the objective function is equivalent to one in which the effective measure-
ments u)( are observed point measurements. Indeed, from the standpoint of estimating
the mean, this problem is identical to one in which the u>i are observed point measure-
ments since the trace term is not a function of //.
Location of the Stationary Point. The stationary point of the objective function is
found by equating to zero its partial derivatives with respect to the various parameters.
Noting equation (161) and the identity
d_ dx
tr{AxxT} = 2Ax, (168)
the derivative of the objective function with respect to the mean is given by
dQ(0) dm(0) dp dfx
= T^2 pe 4>e {u)( - n) (169) e=i
Equating to zero gives the stationary value for the mean as
(170)
The information matrix is estimated by differentiating Q{0) with respect to T,
equating the result to zero, and substituting in place of \x the stationary value ft.
Expressing the objective function as the sum of equations (160) and (162), and drawing
D-3
upon the derivative identities
dXlog A
and
= A"1
d__ dA
tr{AB} = B
(171)
(172)
(173)
gives the desired derivative as
dQ(0) « _ 1 dT
(174)
Equating to zero and substituting p, then gives the stationary value
A-1 1 P = r = - 2_^ Pi 4>i |o< + (<*>< - /i) (w< - A) |
€=1
(175)
Maximality of the Stationary Point. A stationary point is a local maximum of
the objective function if the Hessian matrix of the objective function is negative definite
at the stationary point. Construction of "the Hessian matrix, however, requires taking
second derivatives of the function with respect to all possible pairwise combinations of
the elements of the mean vector and information matrix, which can be a formidable task.
If the objective function is radially concave, however, then it has the nice characteristic
that there is only one stationary point and it is the unique global maximum. The
remainder of this section parallels a proof by Liporace [11] of a test for radial concavity,
which does not require forming all of the second partial derivatives.1 Now, if the function
fails the concavity test, a critical point may still be a local maximum, so the test is not
conclusive in that regard. If, on the other hand, the function passes the test, then no
further analysis is needed.
The basic idea behind the test for radial concavity, which is illustrated in fig-
ure D-l, is to formulate a family of one-dimensional trajectories along the function
that cover every possible direction from the stationary point, and then show that these
one-dimensional trajectories are all individually concave. If the objective function is
continuous at the stationary point, and the "hill curves downward" in every direction 1 Indeed, the test outlined below would be a test of absolute concavity were it not for the pathological family of
functions in which non-concave behavior can occur along contour lines that are perpendicular to radial lines emanating
from the critical point, which is the motivation for the term radially concave. Even in this pathological case, however,
the stationary point is still the unique maximum.
D-4
Figure D-l: A Test for Radial Concavity. The surface plot and equal-value contour lines depict a hypothetical con- cave function /(x) of the two-dimensional variable x. The "stem, lines" indicate the locations of the stationary point x (black), an arbitrarily chosen reference point x, (blue), and the points x (green) that satisfy the constraint x = Axr + (1 — A) x for A = 0.1. 0.2 0.9. Over the con- tinuum 0 < A < 1, the locus of x forms a one-dimensional trajectory (a subsegment of which is indicated by lower green line) that emanates outward from x along the line passing through x and x,., but in the di- rection opposite to x,. The upper ends of the stem lines correspond to the function values /(x), /(xr), and /(x). Given particular values of x and xr, x is completely determined by X, such that /(x) can be expressed as /(A) (upper green line), which follows the curvature of /(x) along the one-dimensional trajectory. By varying x,., this trajectory can be made to extend from x in every possible direction. Thus, if /(A) is necessarily concave with respect to X, regardless of the value of x,, then the overall function is radially concave and the stationary point is necessarily the unique maximum.
D-5
away from the stationary point, then the stationary point must be the unique maximum.
This approach reduces the maximality test to evaluating a single second derivative with
respect to a scalar variable.
To set up the concavity test, each set of parameter values in 0 = {/z, T} is viewed
as a "point" in the domain of the objective function (i.e., parameter space). From any
such point, a family of radial trajectories is obtained by defining a "reference point"
0r = {nr, Tr} and a set of "locus points" 0 = < /2, F > that satisfy
fi = A//r + (l-A)/2, (176)
r = Arr + (i-A)f, (177)
where A is a scalar variable satisfying 0 < A < 1. Substitution of these expressions into
the objective function generates a modified objective function over a restricted parameter
space. For a given 0r, the values of 0 fall along a radial line emanating from (but not
including) 0, in the direction opposite 0T. The direction of 0 from 0 is thus a function
of the reference point location. The distance of 0 from 0 varies as a function of A, with
a scale factor equal to the distance from 0 to 0r. The reference point 0r is treated as
an implicit parameter that is arbitrarily chosen but is a fixed constant once it is chosen.
The modified function is then defined as
Q(A, 0) = Q (A 0r + (1 - A) §) , (178)
where the short-hand notation 0 = X0r + (1 — A)0 is used to indicate that equa-
tions (176) and (177) are satisfied. To show that the objective function is concave,
the modified objective function is twice differentiated with respect to A. The resulting
second derivative is evaluated at the stationary point 0 = 0, where it is shown to be
manifestly negative. The information term 771 (0) and error term 772(0) are again treated
individually. Both terms contribute negative values to the overall second derivative.
The first component is expressed in terms of A by substituting equation (177) into
equation (160), which gives
77i(A,0) = |log|Arr + (l-A)f|. (179)
When evaluating this determinant, it is helpful to observe that the constraint on the
information matrices in equation (177) is satisfied for all A, which implies that T, Tr,
and F are all diagonalized by the same set of eigenvectors. Therefore, there exists an
orthogonal matrix U for which
UTUT = D = U {Arr + (1-A)f} UT = ADr + (l-A)D, (180)
D-6
where D = diag(di,... ,C?M), Dr = diag(dr>i,.. . ,dr,M), and D = diag(di,... ,d\i) contain
the eigenvalues of T, Tr, and T, respectively. The first component of the modified
objective function is thus given by
M
fh(X,9) = ^log|ADr + (l-A)D| = ^]Tlog{A^ + (l-AK} , (181) i=\
which is differentiated twice with respect to A to obtain
u2 M (d • — d 1 M 1 o
Sg*^—S{A^(l-A)<}'--"^g^-*)- (l82> l'=l
The numerator in the summand of equation (182) is strictly positive since dr; and d*
cannot be equal. In general, the denominator is only non-negative (i.e., T will have
zero-valued eigenvalues somewhere), but d? is strictly positive at the stationary point,
as long as the number of independent "measurements" uj£ is larger than the dimension of
the measurement space. Finally, since K is also a positive number, the overall expression
for the second derivative is therefore negative, regardless of the value of rr.
Turning now to the quadratic-form component, the manipulations involved with
evaluating 772 (0) are eased by defining the intermediate variables
d = fi-nr, (183)
A = f-Tr, (184)
ye = ue- p,, (185)
such that
T = AI\. + (l-A)f = f-AA. (186)
U3i — fj, = oj( — A/xr — (1 — A)/2 = y^ + A<5. (187)
The modified function is evaluated by substituting equations (176) and (177) into equa-
tion (162) and collecting terms in A to obtain
*(*•*) = 1 Y^pe&txl f J7n + fyriy^ + A(2fyn(5-Af2„-Aynyj)
+ X2(T8ST - 2A yn<5T) - A3 (A S6T) I. (188)
D-7
Differentiating twice with respect to A and performing some algebra then gives
}2 L
j^rf2(A,0) = KSrTS-2X6T A \ £ Pt fa («„-/*) \ . (189)
At the stationary point, the mean vector must satisfy ^2e=1 pe<j>e (<*>e — A) = 0; sucn
that the term in braces in equation (189) is zero. Combining these results with equa-
tion (182) gives
rfi M 1 2 — $(\,0) = -K^-(dri-i) -K6
TT6<0, (190)
i=l "•i
where 6TT 6 is necessarily positive because T is positive definite. This expression is
independent of A and its sign is independent of the value of 0r. The original objective
function therefore must be concave, and the stationary point is the unique maximum.
D-8
APPENDIX E
GAUSSIAN MOMENTS IN AN INTERVAL
When implementing histogram-based estimation methods, it is necessary to evalu-
ate the bin probabilities and the first two local moments in each bin, which are developed
in this appendix for the scalar Gaussian density
p(x\ fi, a2) = (2TTCT2)
2 exp I - — (x - [if \ (191)
The integral expressions required for these probabilities are not very exciting or inter-
esting, but they are quite useful. In what follows, the bin-probability and local-moment
computations are outlined for a "typical" bin interval / = [a, b).
Bin Probability. The bin probability is given by the integral
4,(1 '"•<72) = 7Sp/exp{"^(l"")2}d1' which is evaluated by performing the change of variables
1 y =
v7^ (X- fi) X = V2a2 y + n => dx = V2a2 dy
to obtain
(f)(I, fi, a2) IK Ja V^
dy,
where the modified integration boundaries are
Q = 1
(a- -**)•
0 = 1
y/2a* (b- -/,).
Given a computational routine for the error function
2 fu
erf(u) = -= /
the bin probability is computed as
e x dx,
4>(I^,a2) = i{erf(/?)-erf(a)}
(192)
(193)
(194)
(195)
(196)
(197)
(198)
E-l
dx
First Moment. The first moment in the interval is
w(/'",ff2) = 7ib/iexp{~"^(1"'')2} Adding and subtracting // fa e~^x~^ ^2<T ^dx yields
u{I,Hi02) = u(I,n,a2) + n0(1,n, a2),
where £j(I, fi, a2) is the first central local moment
u(I,n,a2) = —_ / (x-fi) exp|-^(x-/x)H dx.
This central moment is evaluated by performing the change of variables
(199)
(200)
(201)
y=^-2(x-»)2^(x-fi) = ±V2^ y1'2 =* dx = ±-j= y^dy (202)
to obtain
^^ = w,£e-Vd^wA^-e-62)- (203)
The first moment is therefore
u(I,fi,a2) = -^= (e-^-e-^+zi^/,//^2). V27T ^ '
(204)
Second Moment. The second moment in the interval is
Completing the square in the moment variable yields
Q{I,^a2) = Q{I,n,a2) + 2nuj(I,LL,a2)-n2<f)(I,n,P2),
where Cl(I, fi, a2) is the second central local moment
£1(1, //,a2) = —_ (x- n)2 exp | -— (x - n)2 \ dx.
(205)
(206)
(207)
This expression is simplified using the change of variables defined in equation (193) to
obtain
Cl(I^,a2) = -J= A2<7V) e-S (>/Wdy) = % f y2 e^ dy, (208) \l2-KOl Ja V ' V71" Ja
E-2
where a and j3 are denned in equations (195) and (196), respectively. Using the
integration-by-parts formula fudv = uv — fvdu with dv = e~y dy and u = y2
(such that v = —e~~y /(2y) and du = 2ydy), the second central moment is
n(/,//,a2) = ^= {-\ye~y2 + />"*} (209)
V7r V /
The overall second moment is therefore given by
2
Q(/,/i,a2) = ^=(ae~a2 -pe-(32^+2a24>{I1^a2)+2fiio(I,^a2)-n2<P(I,fi,a2).
Substituting the definition of u;(/,/x, a2) given in equation (204), collecting like terms,
and defining the function
/? fi" c(x) = v* cr/i + \ / 7T
a2 x (210)
gives the second moment as
Q(I, /*, a2) = C(Q) e"Q2 - c(/?) e"^ + (2a2 + //2) 0(7, //, a2) (211)
Underflow Issues. The expressions given above for the bin probabilities and mo-
ments become unstable when interval I is far removed from the mean yu. In these
regions, underflow issues become a dominating factor. This difficulty is easily avoided,
however, by restricting the range of integration to those bins that lie within, say, ±8
standard deviations of the mean, effectively treating the remaining bins as a "set of
measure zero."
E-3(E-4 blank)
REFERENCES
1. A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum Likelihood Estimation
from Incomplete Data," J. Royal Stat. Soc. (B), vol. 39, no. 1, 1977, pp. 1 38.
2. G. J. McLachlan and P. N. Jones, "Fitting Mixture Models to Grouped and Trun-
cated Data via the EM Algorithm," Biometrics, vol. 44, 1988, pp. 571-578.
3. T. E. Luginbuhl, Estimation OF General Discrete-Time FM Processes, Ph.D.
thesis, University of Connecticut, 1999.
4. T. E. Luginbuhl and P. Willet, "Estimating the Parameters of General Frequency
Modulated Signals," IEEE Trans. Signal Process., vol. 52, no. 1, January 2004, pp.
117-131.
5. R. L. Streit, "Tracking on Intensity-Modulated Data Streams," NUWC-NPT
Technical Report 11,221, Naval Undersea Warfare Center, Newport, RI, 2000.
6. K. Fukunaga, Statistical Pattern Recognition, Academic Press, New York, 1990.
7. Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Association, Academic
Press, New York, 1988.
8. J. F. C. Kingman, Poisson Processes, Oxford University Press, New York, 1993.
9. R. A. Redner and H. F. Walker, "Mixture Densities, Maximum Likelihood, and
the EM Algorithm," SIAM Review, vol. 26, no. 2, 1984, pp. 195-239.
10. G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, John Wiley
& Sons, Inc., New York, 1997.
11. L. A. Liporace, "Maximum Likelihood Estimation for Multivariate Observations of
Markov Sources," IEEE Trans. Informal Theory, vol. 28, no. 5, 1982, pp. 729-734.
R-KR-2 blank)
INITIAL DISTRIBUTION LIST
Addressee No. of Copies
Office of Naval Research (Attn: John Tague) 1
NSWC Panama City (Attn: James Cobb) I
Defense Technical Information Center 2
Center for Naval Analyses 1
University of Florida (Attn: K.C. Slatton) I
University of Connecticut (Attn: Peter Willett) 1
Metron Inc. (Attn: Roy Streit) 1