20090914183 - DTIC · 2011-05-14 · This report examines histogram estimation methods for...

NUWC-NPT Technical Report 11,807 1 June 2009

A Tutorial on EM-Based Density Estimation with Histogram Intensity Data Phillip L. Ainsleigh Sensors and Sonar Systems Department

20090914183 MAVSEA WARFARE CENTERS

NEWPORT

Naval Undersea Warfare Center Division Newport, Rhode Island

Approved for public release; distribution is unlimited.

PREFACE

This report was partially funded by the Office of Naval Research (ONR-321 US).

The technical reviewer for this report was Tod E. Luginbuhl (Code 1522).

Reviewed and Approved: 1 June 2009

David W. Grande Head, Sensors and Sonar Systems Department

REPORT DOCUMENTATION PAGE Form Approved

OMB No. 0704-0188 The public reporting burden for this collection of Information Is estimated to average 1 hour per response, Including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of Information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any penalty for falling to comply with a collection of Information if it does not display a currently valid OPM control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.

1. REPORT DATE (DD-MM-YYYY) 01-06-2009

2. REPORT TYPE 3. DATES COVERED (From - To)

4. TITLE AND SUBTITLE

A Tutorial on EM-Based Density Estimation with Histogram Intensity Data

5a. CONTRACT NUMBER

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S)

Phillip L. Ainsleigh

5d PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)

Naval Undersea Warfare Center Division 1176 HowelI Street Newport, Rl 02841-1708

8. PERFORMING ORGANIZATION REPORT NUMBER

TR 11,807

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)

Office of Naval Research 875 North Randolph Street Arlington, VA 22203

10. SPONSORING/MONITOR'S ACRONYM

ONR

11. SPONSORING/MONITORING REPORT NUMBER

12. DISTRIBUTION/AVAILABILITY STATEMENT

Approved for public release; distribution is unlimited

13. SUPPLEMENTARY NOTES

14. ABSTRACT

This report examines histogram estimation techniques in which intensity data are represented using a parameterized probability density function (PDF) model. It gives a high-level overview of histogram modeling, introducing the dominant issues and motivations, and then provides detailed mathematical developments of the histogram-based algorithms.

15. SUBJECT TERMS

Signal Processing

Static-Mixture Histograms

Histogram Estimation Methods Histogram Modeling

Expectation-Maximization (EM) Algorithm

16. SECURITY CLASSIFICATION OF:

a. REPORT

(U)

b. ABSTRACT

(U)

c. THIS PAGE

(U)

17. LIMITATION OF ABSTRACT

(SAR)

18. NUMBER OF

PAGES 69

19a. NAME OF RESPONSIBLE PERSON

Phillip L. Ainsleigh

19b. TELEPHONE NUMBER (Include area code) (401 (-832-9201

Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std. Z39-18

TABLE OF CONTENTS

Section Page

LIST OF ILLUSTRATIONS ii

1 INTRODUCTION 1

2 OVERVIEW OF HISTOGRAM MODELING 3 2.1 Histogram Approximation 4 2.2 Histogram Approximation and Acoustical Data 6 2.3 Histogram Methods and Mixture Models 8 2.4 Truncated Histogram Data 11

3 ESTIMATION FROM HISTOGRAM DATA 15 3.1 Estimation from Complete Histogram Data 15 3.2 Estimation from Truncated Histogram Data 20 3.3 Relationship Between Complete and Truncated Histograms 25

4 HISTOGRAM MIXTURE MODELS 29 4.1 Model Definition 29 4.2 Estimation from Point Measurements 30 4.3 Estimation from Histogram Data 33

5 SUMMARY 41

APPENDIX A—EXPECTATION-MAXIMIZATION ALGORITHMS A-l

APPENDIX B—COMBINATORIAL PROBABILITY DISTRIBUTIONS B-l

APPENDIX C—STATISTICS OF MISSING HIT COUNTS C-l

APPENDIX D—ESTIMATING GAUSSIAN PARAMETERS D-l

APPENDIX E—GAUSSIAN MOMENTS IN AN INTERVAL E-l

REFERENCES R-l

LIST OF ILLUSTRATIONS

Figure Page

1 PDF, Measurements, and Histogram Intensities 5

2 Histogram Data as a Function of Overall Intensity 9

3 Histogram Data as a Function of SNR 10

4 Truncated Histogram Data 12

5 Gaussian-Uniform Mixture Estimation, High-Intensity Data 39

6 Gaussian-Uniform Mixture Estimation, Low-Intensity Data 40

A-l Iterative Minorization (IM) A-4

D-l A Test for Radial Concavity D-5

n

A TUTORIAL ON EM-BASED DENSITY ESTIMATION

WITH HISTOGRAM INTENSITY DATA

1. INTRODUCTION

Observed measurements from real-world physical systems are inherently stochastic

because of noise in the propagation medium and measurement system, and because

of random variabilities in the source-generating mechanism. Therefore, the essential

problem in estimation is not to identify the "true value" of a variable of interest (such

as spatial location or frequency), but to accurately characterize the probability density

function (PDF) associated with that variable of interest. In most real-world situations,

there are no such things as numbers; there are only distributions. This report is about

the characterization of those distributions.

The notion that data collection is an exercise in distributions is consistent with

the way observations of a physical variable are actually obtained. That is, observations

from digital processing systems are usually obtained by partitioning the range of the

physical variable into bins, and then measuring the energy that falls within each bin

(or, equivalently, the energy intensity associated with each bin). Characteristics of the

variable of interest are then inferred from the energy in these bins. The estimation error

in this inference process is related to the amount of probability mass (i.e., area under

the PDF) that is not encapsulated in the estimator. Thus, if the PDF governing the

spread of energy in the variable of interest is concentrated enough that "nearly all" of

the probability mass falls within a single bin, then viewing observed data as point mea-

surements (i.e., numbers) is a reasonable approximation. In situations where the PDF

has probability mass extending across several bins, however, such point approximations

can lead to significant information loss and often to bias.

This report examines histogram estimation methods for representing intensity data

using parameterized PDF models, which provide a mechanism to significantly reduce

the information loss and bias that result from point approximations. While paramet-

ric density estimation has a long history, modern treatments of the problem usually

trace back to the seminal paper by Dempster, Laird, and Rubin [1], who, among other

things, applied the expectation-maximization (EM) algorithm for parameter estima-

tion with histogram data. McLachlan and Jones [2] extended the algorithm in [1] by

applying EM-based histogram estimation to static mixture densities. Luginbuhl [3],

Luginbuhl and Willett [4], and Streit [5] took this a step further by applying histogram-

estimation methods for dynamic mixtures within the probabilistic multi-hypothesis

tracking (PMHT) algorithm. The focus here is on the static-mixture histograms dis-

cussed by McLachlan and Jones [2], with the objective of providing enough detail to

allow these models and algorithms to be applied as they stand (e.g., in signal classifi-

cation applications) or extended for dynamic tracking contexts other than PMHT.

The organization of this report is intended to develop the concepts in increasing

detail, ultimately linking all aspects of histogram-based algorithms back to fundamental

statistical principles. The next section gives a high-level overview of histogram model-

ing, introducing the dominant issues and motivations. Sections 3 and 4 then provide

detailed mathematical developments of the algorithms. In particular, histogram meth-

ods for non-mixture distributions are developed in section 3, and these are extended

to mixture distributions in section 4. After a brief summary in section 5, a set of ap-

pendixes discuss supporting developments from optimization and distribution theory. In

addition to reviewing material that is important to understanding the algorithms, these

appendixes introduce much of the notation used in the body of the report. For exam-

ple, the discussion of the EM algorithm in appendix A provides a notational and logical

template that is used repeatedly when discussing histogram estimation algorithms in

sections 3 and 4.

2 OVERVIEW OF HISTOGRAM MODELING

The objective is to define statistical models that characterize the behavior of vari-

ables of interest (such as bearing or frequency) for some class of sources, a problem

that arises in a number of applications. For example, in maximum-likelihood classi-

fication (e.g., see [6]), PDF models are use to represent the behavior of the features

under various class hypotheses. The classifier decision boundaries are then found at

certain intersections of these class-conditional densities. As another example, proba-

bilistic tracking algorithms (e.g., see [7]) estimate parameters in a dynamic PDF model

at each time step. Regardless of the application, the goal when developing PDF models

is usually to provide a "best fit" for recorded data. However, most sensors do not pro-

vide direct measurements of the variables of interest (e.g., recorded data are not tagged

with location or frequency information). Information about the desired physical vari-

able is usually derived by transforming the received data (e.g., beamforming of spatial

array data or Fourier transform of time-domain data). An inherent characteristic of

this transformation process is a partitioning of the variable domain into bins and the

generation of output values that correspond to these bins. These transform outputs are

usually further transformed to magnitude-squared, or energy intensity, data because

the original transform outputs are complex quantities whose phase is very sensitive to

noise. Therefore, PDF models are derived from data that take the form of energy as a

function of transform bins (e.g., energy as a function of beam or frequency).

The traditional approach for processing this type of intensity data extracts "point

observations" of the physical variable using a peak estimator, which selects one or more

values of the variable for which the intensity data are locally maximum. While simple

interpolation methods can significantly improve accuracy when signal energy extends

over just a few bins, this approach is inappropriate when the spread in the energy

extends over several transform bins. Histogram methods accommodate large intensity

spreads by employing distribution models with nonzero second (and possibly higher)

moment characteristics.

This section introduces the basic ideas and representation abilities of the histogram

methods, as well as some of the issues that are addressed in the later sections. The first

subsection introduces the histogram approximation. The second discusses issues that

arise in the application of histogram methods to acoustic intensity data. The third

motivates the use of mixture densities in a histogram context, and the last subsection

discusses issues that arise when the range of available measurements does not fully cover

the range of the PDF.

2.1 HISTOGRAM APPROXIMATION

The histogram algorithms discussed here fall in the general class of parametric

statistical modeling techniques, where a PDF model p(z; 0) is used to represent the

energy distribution in the physical variable z of interest. The characteristics of the

model are governed by the parametric structure of the model and by the values in the

parameter set 0, which must be estimated from observed data. The difficulty with this

situation is that there are no direct observations of z. The parameter vector 0 must

be estimated from the energy intensity data, denoted S = {se : i = 1,..., L}, where

Si represents the energy in the £th bin and the collection of bins covers some range of

interest in the values of z.

A natural approach for overcoming this difficulty is to transform the PDF model

p(z;0) into another valid PDF model p(S;0), allowing 0 to be estimated from the

intensity data. When a convenient mathematical form for p(S; 0) is available, then

various optimization methods can be applied directly (e.g., Newton or other derivative-

based ascent method). Typically, however, the task of expressing S directly in terms

of 0 is highly nontrivial, and the resulting expression, if it can be obtained, is highly

nonlinear. For this reason, histogram methods employ an approximate model in which

p(z; 0) is used in conjunction with a multinomial approximation to represent the within-

bin and across-bin characteristics of the intensity data. Estimation algorithms are then

developed using an iterative EM approach. A brief overview of the EM approach is

provided in appendix A, which establishes a template for the algorithm descriptions in

sections 3 and 4.

The use of the EM algorithm and multinomial approximation invokes a quantum

image of intensity data, analogous to photons of light energy. The variable z of interest

is considered to be a feature of the quantum particles, and the particles are assumed

to be sorted into bins according to the value of z associated with each particle. The

number of particles in each bin, denoted me for the ^th bin, is referred to synonymously

as a histogram intensity, bin intensity, histogram count, particle count, or bin count.

The symbol me is used to emphasize the discrete nature of the histogram intensities,

in contrast to real-valued energy intensities. The relationship of PDF model, point

measurements, and histogram intensities is illustrated in figure 1.

For sensing modalities in which particle counts are actually observed (e.g., ionizing

radiation), or when there exists a one-to-one physical correspondence between energy

intensity s^ and particle count m^ (e.g., electromagnetic energy), the histogram approach

is generally valid. The only question in these cases concerns the appropriateness of the

Figure 1: PDF, Measurements, and Histogram Intensities The PDF (top) corresponds to a hypothetical scalar distribution. The points measurements (middle) are synthesized by sampling from the PDF, and these measurements are assigned to the various histogram bins with boundaries given by the vertical grid lines. The numbers of point measurements in each bin correspond to the histogram bin intensities (bottom). In practice, the point measurements are not observed (indeed, they may not even exist), such that histogram methods must estimate the PDF parameters from the intensity data. EM-based histogram methods postulate the existence of the point measurements in order to obtain an easier optimization problem. These point measurements are treated as missing data, however, and are marginalized from the objective function during each EM iteration. Note that the point measurements shown in this figure are one-dimensional in the horizon- tal axis; the vertical spread in the measurements is included merely to illustrate the clustering of measurements.

parametric form for p(z;@). When using the histogram model for acoustic intensity

data, however, there is no physical mapping from s^ to me- The histogram model is

therefore inherently mismatched to the data in this case.

2.2 HISTOGRAM APPROXIMATION AND ACOUSTIC DATA

When applied to acoustic intensity data, the histogram model appears to exhibit

some insurmountable flaws, namely, the presumption of particles and point measure-

ments that don't really exist, as well as the real-valued nature of the energy intensities

in a theory requiring integer count data. From the perspective of practical implementa-

tion, there are two factors that ameliorate these problems. First, the actual estimators

for the model parameters in 0 are obtained by integrating over the space of the point

measurements, whereby the point measurements are marginalized out of the likelihood

function. This marginalization takes place during the derivation of the estimators, such

that the point measurement z never appears in any computational algorithm. Sec-

ond, the parameter estimators end up being functions only of the relative intensities,

which are the individual bin intensities divided by the overall intensity. The numeri-

cal algorithms do not care if these ratios are formed from integer-valued count data or

real-valued energy data, since the ratio is, in general, real in both cases.

Now, just because the algorithm can be applied does not make it the right tool.

It is therefore of interest to take a closer look at a common special case, specifically

magnitude-squared discrete Fourier transform (DFT) data. The DFT forms the inner

product of recorded time-domain data with a set of complex sinusoidal basis functions.

Since, in practice, it is impossible to observe an infinite duration of the time-domain

signal, it is impossible to localize the energy to a single frequency point. The DFT

sample therefore represents the energy "in the vicinity of" a given frequency, such that

the partitioning of the frequency domain into bins is a natural result of the transform

process and the DFT samples correspond to the energy that is projected into each bin.

This projection of energy into DFT bins has the flavor of a histogram by definition,

although the histogram model takes this a step farther by assuming a quantization of

the DFT bin energies to take integer values, as if there were some basic unit of acoustic

energy (i.e., the acoustic equivalent of Planck's constant) and the quantized integers

represent the number of these units that fall within each DFT bin. This quantization

implies a set of "synthetic" particles for each bin, and the number of these synthetic

particles is indeed a histogram count, even if it does not correspond to any known physi-

cal entity. The multinomial model at the heart of histogram methods then characterizes

these synthetic count data.

The multinomial model starts out by treating histogram count data as statistically

independent Poisson processes. The histogram count for each bin is then modeled using

a conditional distribution, where the conditioning is on the total synthetic intensity in all

bins (the quantized version of the total signal energy). Now, the total synthetic intensity

is merely the sum of the individual intensities, and a sum of Poisson processes is itself

a Poisson process whose expected value is the sum of the individual expected values

[8]. Furthermore, since the conditioning process is equivalent to dividing probability

mass functions, the multinomial model is effectively a series of ratios of Poisson mass

functions. Note that, while the synthetic intensities for the various DFT bins are initially

represented as independent processes, the bin intensities are not independent in the

multinomial model because of the conditioning on the total intensity. That is, all bins

affect all other bins through their contribution to the total intensity. Note also that

modeling the quantized intensities with a multinomial distribution provides a statistical

description that matches the first moment of the observed data but does not match any

higher moments.

Given the subtleties of the multinomial approximation, its appropriateness must

be evaluated on a case-by-case basis for different applications. That said, the histogram

approach provides a way forward in cases where the alternatives are intractable. For

example, an attempt to directly model the energy intensities would require a joint

multivariate exponential distribution over all of the spectral or spatial bins (i.e., a

joint distribution over easily thousands of variables, and more as resolution increases).

Estimation in such high-dimensional spaces is impossible in most real-world scenarios.

In many applications, the quantization of the bin intensities is implicit because the

algorithms depend only on the relative intensities. However, the quantization manifests

itself in a very explicit way in Bayesian contexts where a prior distribution is imposed.

The problem involves the relative weighting of the prior distribution and the measure-

ment likelihoods when estimating the posterior parameter estimates. Specifically, the

measurement likelihoods are weighted by the overall intensity of the bins (i.e., the total

number of particles in all bins). The more "measurements" there are, the more the

prior distribution is discounted. But when dealing with artificially quantized intensity

data, the overall intensity depends on the unit energy associated with each particle,

which is itself a function of the quantization. The net result is that the quantization

unit becomes an explicit variable in the algorithm, making the relative weighting of the

prior completely dependent on an algorithm design parameter. If a coarse quantization

(i.e., a large particle energy unit) is assumed, then the synthetic particle counts will be

low and the prior distribution will dominate. If a very fine quantization (i.e., a small

particle energy unit) is used, then the apparent number of particles is very high and

the data overwhelm the prior. Indeed, in the limit as the quantization unit approaches

zero, the synthetic particle counts become infinite and the estimator effectively ignores

the prior distribution altogether, a problem that was observed by Streit in his work of

histogram PMHT [5]. This same issue will arise with any dynamic mixture scenario

in which the overall intensity varies with time, which includes just about all practical

tracking applications.

As a final note, the dependence of the parameter estimators only on the relative

intensities also means that the estimators are invariant to the true overall intensity.

The degree of freedom introduced by this invariance allows the same model to represent

large variabilities in different realizations of the distribution. To illustrate this, figures

2 and 3 show examples of synthetic histogram data with different values of signal-to-

noise-ratio (SNR) and overall intensity. In figure 2, plots are shown with the same

value of SNR, but with different values of the overall intensities. In contrast, figure

3 contains plots with the same overall intensity but with varying SNR. While there is

significant variability in both cases, there is a subtle distinction. Low SNR generally

means that there is too much of the wrong kind of data (i.e., noise), whereas low overall

intensity indicates that there is too little of every kind of data. The histogram model

accommodates both cases equally well.

2.3 HISTOGRAM METHODS AND MIXTURE MODELS

The Gaussian-signal-in-uniform-noise mixture model used to generate the above

examples was chosen because of the basic role that such models play when using his-

togram techniques. Specifically, mixture models provide a convenient way to explicitly

model noise, which is necessary because histogram models are very data inclusive. This

is most easily seen by contrasting the histogram approach to an estimator that selects

the single maximal peak. When such a peak estimator is applied to data with highly

concentrated signal energy (e.g., narrowband signal spectra), a beneficial side effect is

provided in the form of signal cleaning. That is, in cases where the signal intensity is

largely confined to a single bin and that bin is correctly identified, most of the wideband

noise is eliminated with the omitted bins. The histogram estimator, on the other hand,

throws away nothing. It must therefore explicitly account for the noise to minimize

noise-induced errors.

LARGE OVERALL INTENSITY

TIME

MODERATE OVERALL INTENSITY

TIME

SMALL OVERALL INTENSITY

TIME

Figure 2: Histogram Data as a Function of Overall Intensity Image plots of histogram intensity data are synthesized independently at each of 150 time points, with intensities at each time point obtained by sampling from a two-component mixture model containing a Gaussian signal in uniform noise. Samples are sorted into unit-width bins cover- ing the range [—10,10]. The Gaussian component has constant variance a2 = 4 and a sinusoidally time-varying mean. The mixture components have constant probabilities ns = 0.3 (for signal) and nn — 0.7 (for noise). The ratio 7rs/nn is the (wide-band) SNR. Intensity gram plots are shown for overall intensities corresponding to 2500 point measurements per sample time (top), 250 measurements per sample time (middle), and 25 measurements per time (bottom). g

HIGH SIGNAL-TO-NOISE RATIO

TIME

MEDIUM SIGNAL-TO-NOISE RATIO

l i •

u o

1 o I- w • I 1 11- !!• flf

• ,p

'-'•I ""J . .'

1 "TfiySf i'i1 VN3-v, TIME

LOW SIGNAL-TO-NOISE RATIO

TIME

Figure 3: Histogram Data as a Function of SNR Histogram data are synthesized using the same model as in figure 2. Here, the total number of measurements in all plots is K — 250 (as in the middle plot figure 2). The signal-mode assignment probability varies, with values irs = 0.5 (top), 7rs = 0.3 (middle), to TTS = 0.1 (bottom), giving high, medium, and low signal-to-noise ratios, respectively.

10

Finite mixture distributions, the topic of section 4, are ideally suited for noise

modeling because they provide a natural mechanism for dividing the energy between

a number of component distributions, one or more of which can be tailored to noise.

The "basic" PDF model for histogram methods is therefore the Gaussian-signal-in-

uniform-noise model. The use of a uniform noise distribution assumes that the data

has been pre-whitened, for example, using a spectral or spatial normalizer. If it is more

desirable to work with un-normalized data, however, a more sophisticated noise model

is required and can be achieved using a mixture of uniform or Gaussian components.

Multiple mixture modes can also be used to describe the signal energy itself, say for

data containing energy from multiple sources of interest or signals whose energy in the

variable of interest is inherently multi-modal.

2.4 TRUNCATED HISTOGRAM DATA

For a variety of reasons, intensity measurements may not be available for some

histogram bins. This can happen unintentionally, for example, when processing spectral

data from sensors with a limited frequency range. It can also occur intentionally, as

when data are truncated or subsampled to reduce communication and/or computational

requirements, or to isolate a phenomenon of interest. Figure 4 shows a hypothetical

example of truncated histogram data. While this illustration focuses on histogram "edge

effects" where the missing bin intensities are on the outer edges of the distribution,

missing intensity values can also occur in the "interior" of the histogram, say, due

to unintentional drop-outs in the sensor response or intentional subsampling of the

histogram bins. When the histogram bins do not fully cover the range of the PDF, the

histogram and its data are called truncated. This is in contrast to a complete histogram,

whose bins do cover the entire range.

If complete-histogram algorithms are applied to truncated-histogram data, then

a mismatch exists between the physical world (where a nonzero intensity is impossible

in some bins) and the mathematical world (where nonzero intensities are possible, but

none just happened to be observed in the data at hand). The size of the mismatch

depends on the probability mass under the density function in the truncated regions.

This probability mass may be very small, allowing the issue to be ignored in some

applications. However, when modeling physical phenomena whose energy can approach

the edges of the observable measurement space, or when sensor malfunction causes lost

sensitivity in some interior region, then the probability mass in the truncated region

can be significant. The EM algorithm is used to circumvent this problem by treating

the unobserved intensities as missing data.

11

i—I—I—I—i i—i—i—i—i—i—:—i—i—i—i—i—r

a*

Figure 4' Truncated Histogram Data. This figure shows data from the same scalar distribution as in figure 1, with the continuous PDF shown on top, the synthetic point measurements shown in the middle, and the histogram bin intensities shown on bottom. In this case, however, the histogram bins do not cover the entire region over which point measurements can occur. The energy associated with measurements falling outside of the observable region (i.e., outside of the bold vertical lines) is not included in any histogram bin and is therefore lost in the histogram model. The lost data are accounted for in the EM estimation algorithm by utilizing an augmented set of missing data. That is, the EM algorithm treats as missing data the intensity values in the truncated region, in addition to the point measurements that form the missing data in the case of a complete histogram.

12

The computational algorithm for the truncated histogram ends up being a simple

modification of the algorithm for the corresponding complete histogram. The theoretical

work needed to develop the algorithm, however, has a subtlety requiring careful analysis.

In particular, it is impossible to know the total intensity (i.e., the sum of the bin

intensities) when some of the bin intensities were not recorded, and the multinomial

distribution is well defined only when the overall intensity is given. The uncertainty

regarding the overall intensity requires the use of a negative binomial distribution and

its multivariate extension, the negative multinomial distribution. This extension is

discussed in section 3 and appendix C. Fortunately, the end result of that analysis is a

simple extrapolation formula for obtaining the expected values of the missing intensities.

The EM auxiliary function for the truncated histogram is then a linear function of

the missing intensities, such that the maximization step in each iteration of the EM

algorithm can be formulated in terms of the complete histogram. The algorithm then

operates on an extended data set in which the observed intensities are augmented with

expected intensities in the truncated regions.

13(14 blank)

3. ESTIMATION FROM HISTOGRAM DATA

The remainder of this report focuses on the mathematical developments related

to histogram modeling and estimation. In all that follows, integer count data are as-

sumed available for each bin. That is, the issues cited in the previous section related to

converting from real-valued energy data to integer-valued histogram data are assumed

to have been dealt with appropriately.

As mentioned in the previous section, one key distinction among histogram meth-

ods involves whether or not the histogram bins completely cover the range of possible

measurement values and intensity data are available for all bins. When estimating the

parameter set 0, the values of z for which p(z; 0) is well defined constitutes the mea-

surement space, denoted Z. If data are available for histogram bins that completely

cover Z, then the histogram and its data are complete. If there are regions of Z for

which histogram data are lacking, then the histogram is truncated. This section first

examines complete histograms and then extends those algorithms for truncated data.

A general relationship between the two types of histograms is then discussed.

The notation used in this section and the logical steps in deriving algorithms

closely parallel the discussion of the expectation-maximization (EM) method in ap-

pendix A. Note that EM theory defines complete data generically to indicate the con-

catenation of the observed and missing data, which is independent of whether the his-

togram bins fully cover the space of the physical variable. To distinguish these different

types of complete data, data from a complete histogram will always be referred to as

complete-histogram data. Any reference to "complete data" without the "histogram"

modifier indicates the more generic notion from EM theory.

3.1 ESTIMATION FROM COMPLETE HISTOGRAM DATA

A complete histogram partitions the measurement space into a set of mutually

exclusive and exhaustive regions, which are the histogram bins (or cells), denoted Ze

for £ = 1,..., L. The measurement space is thus decomposed as

/.

Z = \JZe, (1) e=i

where the non-overlapping nature of the Ze is implicit. Consider now a collection of

hypothetical point measurements Z = {zi,..., z#}. The number of these measurements

whose value falls in each bin comprise the histogram intensity data, denoted

Mc = {m,:^=l,...,l}. (2)

L5

where the subscript C indicates complete histogram data. The overall intensity is then

defined as

L

K = Y, m(- (3) e=i

Under a given set of parameter values, the unconditional probability that a measurement

falls in the ith bin is

which, for a complete histogram, satisfies

L

0) = / dzp(z;0), (4) J zf

£*<(e) = i. (5)

Assuming that the measurements constitute K independent draws from L categories,

the bin intensities are governed by the multinomial distribution discussed in appendix B;

that is,

L m

p(Mc ; 0) = c(Mc) [I {Me*)}•1, (6) e=i

where c(Mc) is the multinomial coefficient defined in equation (127). Prom the stand-

point of the EM-based algorithms developed in this section, equation (6) is the observed

data likelihood function (ODLF) since it characterizes the observed histogram counts.

3.1.1 EM Algorithm with Missing Measurements

Definition of Missing Data and Auxiliary Function. The observed histogram

intensities provide a relative measure of the number of unobserved point measurements

that "fall into" each bin. The missing data for the EM algorithm are these unobserved

measurements. Within a given bin (say, the £th), the measurements are denoted

Ze = jztt : k= l,...,m<|, (7)

and the full set of missing data contains measurement sets for all bins as

Z = [Ze :£=1,...,L}. (8)

The complete data are the concatenated set containing the observed bin intensities

and missing measurements, and its joint distribution p(Z, Mc ; 0) is the complete data

16

likelihood function (CDLF). Given the CDLF and the posterior density p(Z|Mc ; 0*)

for the missing data (with parameter estimates 0* from the previous EM iteration),

the auxiliary function is formed by taking the conditional expectation

Qz(0;0*,Mc) = y"dZp(Z|Mc;0*)logp(Z,Mc;0), (9)

where fz dZ is the marginalization operator for the missing measurements. This marginal-

ization is represented by the sequence of (possibly multivariate) integrations given by

dZ = I dzn • • • / dzlmi >••{ / dzii • • • / dzLmL \ , (10) Jz I Jzx Jzx ) I JzL JzL )

which can be expressed using the shorthand notation

dzik \ • (11) JZ e=i fc=i k J£*

Care must be exercised when using this shorthand notation because a set of nested

integrals is clearly not equal to a product of single-variable integrals. However, due to

the non-overlapping nature of the integration regions (i.e., the Zk), the cross terms in

the nested integrals are zero, such that equation (10) is equivalent to equation (11) in

this case.

CDLF and Posterior Distribution. With missing measurements, it is perhaps

easiest to start with the posterior distribution of the missing data and then derive

the CDLF from it. With no other knowledge, each measurement is governed by the

distribution p(zek ; 0)- Because the measurement zek is known to reside in the £th bin,

however, its distribution in this restricted region is

I I -7 C>.\ P(Z^;0) /TON p(ztk\Ze;e) = MQ) . (12)

The intensities provide the numbers of independent measurements in each bin. Given

these, the likelihood of the group of measurements is just the product of the likelihoods of

each measurement. The posterior of the measurements given the intensities is therefore

L mi L mt , , r\\\

p(ziMc;e) = nn»(-i*i®> - n n {%#}• <i3> e=i fe=i e=i fc=i *• n ; >

The CDLF is derived from equations (6) and (13) by applying Bayes' rule and canceling

terms, which gives

L mi

p(Z,Mc;0) = p(Mc;@)p(Z\Mc:@) = c(Mr) ]J J] p(z(k ; 0) • (14) t=\ k=\

17

The log-CDLF is then given by

L mt

logp(Z, Mc ; 0) = log c(Mc) + ^ ^ logp(z,fc ; 0). (15) <=i fc=i

Dropping the term logc(Mc), which does not depend on 0, the auxiliary function

becomes

Qz(0;0*,Mc) = / dZp(Z|Mc;0*)X^Elog^fc'0)- (16) ^z e=i fc=i

Conditional Expectation. Due to the independence of the measurements and the

structure of the posterior distribution in equation (13), the conditional expectation

operation can be expressed as

JzdZp(Z\Mc;S*) = f[ fj {^^)Jz *»!**»;©•)}. (17)

where, noting the definition of $^(0*) in equation (4), each component in this expression

satisfies

9^1*** **»'>*) -1' (18)

When the conditional expectation operation in equation (17) is applied to the log of the

CDLF, it applies individually to each of the log terms that appear after the summa-

tions in equation (16). Furthermore, the components in the conditional expectation all

integrate to one, except for those in which the indices of the integration variable match

the indices of the log term being operated on. The auxiliary function therefore becomes

Qz(0;0*,Mc) = £ ¥— Y, / dzekp(ztk;e*) log p(z£k ;&). e=i e\ ) k=i JZt

At this point, the indices t and k on the measurements have no significance because

the integration is carried out independently for each summand (i.e., the integration

is "inside" the summation operations), and because there is no actual set of observed

measurements into which one might have to index. The indices £ and k are therefore

dropped from the measurements to obtain the expression

L , Qz(0;0*,Mc) = Y, $^) jz dzp(z;S*) log p(z; 0 )• (19)

18

At the level of generality considered thus far, further developments are impossible be-

cause maximization of the auxiliary function (the M-step) requires a particular form for

p(z; 0). The M-step is carried out below, however, for the special case of a multivariate

normal density.

M-Step for a Gaussian Model. As an example, consider estimating the mean vector

H and covariance matrix P in the Gaussian distribution

p(z;0) = Af(z;/i,P) = |27rP|-1/2expj-^(z-/z)Tp-1(z-/z)j . (20)

In this case, the auxiliary function given in equation (19) becomes

Qz(0;0*,Mc) = Y. ^jh\ / d*tf(z;n\P*) log Af(z;n,P) . (21)

where /Lt* and P* are the existing estimates from the previous EM iteration. The

estimators for the mean and covariance are given in terms of a set of sufficient statistics

for the posterior distribution. In addition to the bin probability

$,(©*) = / dzAA(z;//*.P*) , Jzt

(22)

these sufficient statistics include the centroid and spread variables for each histogram

cell, which are given respectively as

w<(e*) = f dzJV(z;/i*,P*) z, (23)

ne(®*) = f dzAA(z;/i*,P*) zzT. (24) Jzt

Computational algorithms for equations (22)-(24) are given for the case of scalar mea-

surements in appendix E. The estimators for the mean vector and covariance matrix

are derived in appendix D and are given by

e=i x '

'4^ta (26)

where tit is the center-shifted second local moment, which is computed from the centroid

and spread variables as

0«(©*,£) = n,(0*)-2/i w,(0*) + /x2 $,(0*). (27)

L9

As shown in appendix D, estimators ft and P are located at a critical point of the

auxiliary function, and, due to the concave structure of the auxiliary function, they also

locate the global maximum.

3.2 ESTIMATION FROM TRUNCATED HISTOGRAM DATA

When histogram intensity data for various bins or regions are unavailable for

some reason, then the estimator needs to acknowledge that there are unknown values

in those bins or regions, instead of assuming that they are zero. The EM algorithm

handily addresses this problem by letting the unobservable intensities be missing data.

To establish notation, let ZQ denote the observable region of the measurement

space, which is partitioned into LQ histogram cells as

Lo

Zo = {jZe. (28) e=i

Further, let ZT denote the truncated region, where particles are not counted (e.g., at

the edge of an image or at either end of a frequency band). This region is partitioned

into Li cells as

L

ZT = U Ze, (29) e=L0+i

where L = Lo + LT is the total number of cells that cover the concatenated space

Z = Z0UZT. (30)

The unconditional probability of a particle in the observable region is

*o(e) = / dzp(z;&) = X>,(0), (31)

and the particle probability for the truncated region is

$T(e) = / dzp(z;&) = J2 $<(0)- (32) ^

ZT e=L0+i

Since ZQ and Zi satisfy equation (30), $o(©) and $T(©) satisfy

$O(0) + <I>T(0) = 1. (33)

The collection of observed intensities is denoted

M0 = {m€:£=l,...,L0}, (34)

20

and the total number of observed particles is

Lo Ko = Yme- (35)

Since the observed intensities represent K0 draws on Lo categories, Mo is multinomially

distributed. But since the unconditional probabilities of the observed histogram bins

do not sum to one, the ODLF is given by

P(Mo;0) = C(Mo)n{^}

{n&rntl} {$o(e)}_ ' n{*'(Q)}m'- (36)

The EM algorithm for optimizing this likelihood function is developed in two stages.

First, the unobserved intensities are treated as the sole piece of missing data. This is

then extended to the case of missing measurements and intensities.

3.2.1 EM Algorithm with Missing Hit Counts

Consider, for a moment, estimating the distribution parameters using an EM

algorithm that treats the truncated intensities as missing data but does not include

the measurements in the missing data. The resulting algorithm is of little practical

interest since the M-step likely involves a difficult nonlinear problem. This situation is

considered, however, to isolate the issues involved with truncated bin intensities. After

getting an understanding of these issues in the present subsection, the next subsection

takes on the case where both the truncated intensities and measurements are included

in the missing data.

Definition of Missing Data and Auxiliary Function. For the present discussion,

the missing data are the intensities in the truncated region, denoted

MT = {TO*:'« = .LO + 1,...,L}, (37)

and the auxiliary function is defined as

QMT(0;0*.MO) = Y, P(MT|MO;0*) logp(MT,Mo;0). (38)

The marginalization operator for the missing intensities is expressed using the shorthand

notation

E - n E h E E - E • (39) MT £=L0 + l {. m(=0 ) mz,o + i=0 mLo+2=0 mL=0

21

Similar to the notation introduced in equation (11), this shorthand notation is only

accurate when the cross-terms in the nested summations are zero, in which case the

nested sum does indeed reduce to a product of individual summations.

CDLF and Posterior Distribution. The posterior distribution of the missing in-

tensities is derived in appendix C, and is given by

K L

p(MT | Mo ; 0) = c-(MT,Ao) {$<>(©)} ' II {*<(©) P , (40) e=LQ+i

where C"(MT,^O) is the negative multinomial coefficient defined in equation (140).

The CDLF is given in terms of the ODLF and posterior distribution as

p(MT, Mo ; 0) = p(MQ ; 0) p{MT | MQ ; 0) (41)

L

= c(Mo) c-(MT, KO) II { **(©) }m , (42)

whose logarithm is

L

logp(MT, M0 ; 0) = log c(M0) + logc_(MT, KQ) + ^ rne log$£(0). (43) €=1

Since the first two terms in equation (43) do not depend 0, they can be dropped from

the auxiliary function to obtain

L

QMT(0;0*,MO) = J2 P(MT|Mo;0*) ]T me log^(0). (44) MT e=i

Conditional Expectation. The expression in equation (44) is a linear function of the

missing bin intensities, with respect to which the expectation is being taken. Because

of this linearity, the expected value of the log-CDLF is obtained merely by substituting

the expected values of the missing data, leading to the expression

L

<2MT(0;0*,MO) = J2 •>t log*<(®), (45) e=i

where the expected intensities are defined by

if I = 1,..., L0

, (46) £|mf|Mo;0*j if I = L0+ 1,.. .,L.

22

The expected intensities for the truncated cells axe derived in appendix C and are given

for I = Lo + 1,... ,L as

£{ra'|Mo;8"} = {^j}*',e')- (47)

This expression extrapolates the intensities from the observed region into the truncated

region by using the ratio { Ko/$o(®*)} as a "probability-to-intensity" conversion fac-

tor and then applying this conversion factor to the unconditional probability in each

truncated bin.

3.2.2 EM Algorithm with Missing Hit Counts and Measurements

The results of the previous subsection are now extended such that the missing

data include both the truncated intensities and measurements.

Definition of Missing Data and Auxiliary Function. The missing data are

defined by the sets

MT = { me: £ = L0 + 1,..., L } ,

Z = {ztk :^=1,...,L, fc = l,...,m*}. (48)

The auxiliary function is defined by

<2z,MT(e;0*,Mo) = E / dZ/>(Z,MT|Mo;0*) logp(Z,MT,Mo ;0), (49)

where the marginalization operators for the missing data are again expressed using the

shorthand notation

L ( oo

£• n E MT e=LQ + l I me=0

„ L mi

•/ 2 B 1 I 1

L mi

j I dzek e=i fc=i

CDLF and Posterior Distribution. The distribution p(Z,Mx,Mo;@) can be

factored using Bayes' rule as

p(Z,MT,Mo;0) = p(Mo;0) p(MT | M0 ; 0) p(Z|MT,Mo;0)

= p(M0 ; 0) p(MT I Mo ; 0) p(Z| Mc; 0), (50)

23

where p(Z| MT, Mo; 0) = p(Z| Mc; 0) because, once the missing intensities are given,

the concatenated set {Mo,MT} corresponds to the set of intensities for the complete

histogram. Equation (50) serves as the starting point both for obtaining the posterior

distribution of the missing data and for evaluating the CDLF to be substituted into

equation (49). A suitable expression for this latter purpose is obtained by substituting

equations (36), (40), and (13), and canceling terms, which gives

L m.(

p(Z,MT,Mo;0) = c(M0) c"(MT, K0) J] ]\ p(z<fc;0). (51) e=i k=i

The log of the CDLF is therefore given by

L 7Tl£

logp(Z,MT,Mo;0) = logc(Mo) + logc-(MT,^o) + J]J]logp(z^;0), (52) e=i fc=i

such that, after ignoring constant terms logc(Mo) and logc~(MT, A^o), the auxiliary

function becomes

~ L me

<2Z,MT(0;0*,MO) = 52 I dZp(Z,MT|Mo;0*) J2 52 ^gp(zek-,e). (53) MT

Jz e=i fc=i

The missing-data posterior distribution is obtained by dividing equation (50) by p(Mo ; 0),

which gives

p(Z,MT|Mo;0*) = p(MT|Mo;0*)p(Z|Mc;0*). (54)

Conditional Expectation. The expectation operation in equation (53) can be de-

composed as

52 f dZp(Z,MT|Mo;0*) = ^p(MT|Mo;0*) f dZ p(Z|Mc;0*), (55) MT ^Z MT

Z

allowing the auxiliary function to be expressed as

Qz,MT(0;0*,Mo) = ^p(MT|Mo;0*) MT

P L me

x dZp(Z\Mc;e*)5252loZp(Z(k'@^ (56) Jz e=i fc=i

Noting equation (16), the second line of this expression is just the auxiliary function for

the complete histogram, such that

<2z,MT(0;0*,Mo) = 52p(Mr\M0;Q*) QZ(@-&\MC). (57) MT

24

Substituting the final form for Qz(0;0*,Mc) given in equation (19) again leaves an

expectation of a function that is linear in the missing intensities. Paralleling the case

in which only the intensities are missing, the auxiliary function simplifies to

L _ „

Qz,MT(0;0*,Mo) = £ -r^v / dz p(z;0*) log p(z;0), (58) j^[ *HW ) Jzi

where fh( is defined in equation (46). Once again it is impossible to go any further

without imposing a form on p(x;0). For the special case of Gaussian measurements,

maximization of equation (58) over the mean vector // and covariance matrix P yields

estimates that are equal to those in equation (25) and equation (26), but with me re-

placed with fhe and K replaced with K = Yle=\ •e- The Gaussian parameter estimates

after each iteration of the EM algorithm with truncated histogram data are thus

where u>^(0*) and $7^(0*, jx) are defined in equations (23) and (27), respectively.

3.3 RELATIONSHIP BETWEEN COMPLETE AND TRUNCATED HIS-

TOGRAMS

In the analysis of truncated histograms, the auxiliary function with missing mea-

surements and histogram intensities reduced to an expected version of the auxiliary

function for the complete histogram. This occured because of the similarity of the

CDLF for the two situations. It is useful for later developments with mixtures to flesh

out this equivalence more generally. Consider the generic problem in which the missing

data contain some set of variables S. For a complete histogram, the CDLF in most

cases of practical interest can be expressed as

p(S, Mc ; 0) = p(Mc ; 0) p(21 Mc ; 0) L

= c(Mc) J] {$/(e)}"" p(S|Mc;0). (61)

The CDLF for the truncated histogram, on the other hand, is given by

p(S,MT,Mo;0) = p(MT,Mo;0) p(3|Mc;0)

L

= c(M0) c-(MT,K0) n{^(0)}m' P(S|MC;0). (62) e=i

25

With the exception of the coefficients, these two expressions are identical. Substituting

the definitions of the multinomial and negative multinomial coefficients in the CDLF

for the truncated histogram gives

KQ\ (K-l)\ c(M0) c-(MT,K0) =

{ilfcW} (Ko-l)\ {Yle=Lo+lrne\}

K

= (jf) c(Mc), (63)

such that the two CDLFs are related by the expression

p(S, MT, MO ; 0) = (jf) P& Mo; O). (64)

This expression gives an indication of the likelihood "loss" that results from the unob-

servability of some of the intensities. This relationship between the CDLFs results in a

similar relationship between the auxiliary functions for the two problems. The auxiliary

function for the truncated histogram can be written as

<2S,MT(0;0*,MO) = J>(MT|Mo;0*) J dZ p(S|Mc;0*) logp(H,MT,M0 ;0)

= ^p(MT|Mo;0*) / dSp(S|Mc;0*)

x {log(^) +logp(S,Mc;0)}.

The random variable K is a, function of 0*, but not of 0. Since Ko is observed, the

term log(A'o/^) is independent of 0, and the auxiliary function effectively becomes

Qs,MT(0;0*,Mo) = ^p(MT|Mo;0*)Qs(0;0*,Mc).

Furthermore, the parameter estimators for the complete histogram usually depend lin-

early on me, as was the case in the previous subsection. The approach used above

therefore generalizes to most cases of interest; that is, the truncated-histogram esti-

mators are obtained by substituting the expected intensities into the corresponding

complete-histogram estimators. One thus never need explicitly derive the estimator for

the truncated histogram, so long as the complete-histogram estimator exhibits this lin-

ear dependence on the cell intensities. Since this is also the case with the mixture models

26

discussed in the next section, the results are formulated for complete histograms only.

The qualifiers (i.e., subscripts I, C, and T) on histogram-related variables are therefore

dropped in the next section, with the understanding that variable me in a given ex-

pression is an observed intensity for cells in ZQ and an expected intensity for cells in

27(28 blank)

4. HISTOGRAM MIXTURE MODELS

This section discusses parameter estimation for finite mixture models, which in-

troduce another form of missing data, namely the mode assignments. Subsection 1

outlines the signal-in-noise mixture model; subsection 2 reviews estimation from point

measurements, which illustrates in isolation the issues surrounding mode assignment

uncertainty. Subsection 3 then generalizes this to estimation from histogram intensity

data, wherein the point measurements and modes assignments are missing data.

4.1 MODEL DEFINITION

Let the distribution of interest (i.e., the "signal distribution") be denoted ps(z; 0S).

At issue is the fact that, in noisy data, some of the measurements do not belong to

ps{z:0s), but instead belong to some extraneous distribution po(z;0o) (i.e., the "noise

distribution"). If one attempts to estimate 0 s in the signal distribution without account-

ing for the noise measurements, then the resulting parameter estimates are distorted.

Noise measurements are accommodated using the two-mode "signal-or-noise" mixture

distribution given by

p(z ; 7T0, 0O, 0S) = 7T0 p0(z; 0o) + (1 ~ TTO) Rs(z ; 0S), (65)

where 7TQ is the unconditional probability that a measurement belongs to the noise. In

general, the model distributions ps(z;0s) and po(z;0o) can have different parametric

structures (e.g., Gaussian signal distribution and uniform noise distribution).

From this simple statement of the model, more sophisticated models are obtained

by letting the individual signal or noise components themselves be mixtures, resulting

in a mixture of mixtures. This is most valuable when ps(z;0s) and/or po(z;0o) are

either unknown or so complicated that they lead to intractable estimation problems.

In the discussion below, a separate mixture is included for the signal distribution only;

a mixture model for the noise component is constructed similarly. Thus, the signal

distribution is expressed in terms of modal distributions as

J

ps{z ; 0S) = ^2 $i Pi(z> °^ ' (66) .7 = 1

where Pj(z; 0j) is the jth modal distribution and ipj is the unconditional probability that

the jth mode is valid (i.e., that z is actually governed by the jth modal distribution).

The collection of these probabilities satisfies the constraint X)»=i ^j = 1- Substituting

equation (66) into equation (65), and defining 7Tj = (1 — 7r0) tpj for j = 1,..., J, results

29

in an overall mixture distribution, including signal and noise models, given by j

p(z;e) = £>*»(•;**)• (67) j-"

This model is parameterized by the set 0 = {n, 0Q, 0 s}, where n = {no, TTI, ..., 7Tj}

is the collection of unconditional mode assignment probabilities, 0O contains the noise

distribution parameters, and 0s = {0\,... ,0j} is the collection of parameters for all

modes of the signal distribution. The right-hand side of equation (67) is a convex

combination since 5Z,-=0 TTj = 1.

4.2 ESTIMATION FROM POINT MEASUREMENTS

The assignment uncertainty issue is the defining characteristic of finite mixture

densities. This issue is now considered in isolation by considering the problem in which

a set Z of independent point measurements are observed. While the resulting algo-

rithm is of value in certain cases in which feature values of observed data samples are

directly observed (e.g., see [9]), this is a rather artificial problem in the context of

histogram modeling and is presented merely as a stepping stone to the more general

problem description in the next subsection. In any case, when observations of the point

measurements are assumed to be available, an optimal parameter estimation algorithm

maximizes the ODLF given by

p(Z; 0) = ft | E *"ii &(**; eh) \ • (68)

Direct optimization of this likelihood function with respect to 0 is generally difficult

because of the interaction between mixture modes.

4-2.1 EM Algorithm with Missing Assignments

The EM algorithm for mixtures circumvents the mode interaction issue by treating

as missing data the mode assignments (i.e., the fa). The idea behind this choice is

that, if the mode assignment were known for each measurement (i.e., if the component

distribution actually governing each measurement were known), then the measurements

could be grouped according to mode and the parameter estimation problem would

decompose into J independent problems that are easier to solve. The EM algorithm

does induce such a decomposition, albeit in each iteration of an iterative procedure.

Definition of Missing Data and Auxiliary Function. The set of missing data

corresponding to the observed measurements in Z is denoted as

J = { ji,h, ••JKJ, (69)

30

such that the auxiliary function is given by

Qj(0;0*,Z) = Y, MJ|Z;0*) logp(Z,J;0). (70) J

Given that the measurements are independent, the joint density is a product of terms

that allows the missing-data marginalization operation to be expressed using the short-

hand notation

3 fc=l I jk=l )

CDLF and Posterior Distribution. The CDLF is obtained by noting the in-

dependence of the measurements and by applying Bayes' rule on a measurement-by-

measurement basis, which gives

K K

p(J,Z;0) = H P(jk;&) P(zk\jk;&) = J] nJkPjk(zk;0jk), (72) jt=i fc=i

where p( jk; 0) = 7rJfc and p(zk\jk; 0) = Pjk(zk;0jk). The log-CDLF is therefore

K

logp(J,Z;0) = Y, {^^jk+logP]k(zk;dJk)]. (73) fc=i

Bayes' rule then gives the posterior distribution as to obtain

where 7fcjfc(©*) = p(jk\zk]&*) is the single-measurement posterior assignment proba-

bility, which is defined (and computed) as

W«n= . W'^' .. (75) {Ei,T?p.(^;9,*)}

For all values of k, these posterior probabilities satisfy

.;

£7^.(0*) = !. (76)

Conditional Expectation. Given the structure of equation (74), the conditional

expectation in equation (70) is a sequence of single-variable expectation operations,

which is expressed as a product of operators as

J>(JIZ;0*) = n|E^(0*)}- (77)

31

If evaluated as it stands, then all of the single-variable expectations in this expression

sum to one (hence the overall expression is one). Similarly, when the conditional-

expectation operator is applied to the log-CDLF in equation (73), all of the single-

variable expectations sum to one, except those for which the marginalization index

matches the index jk appearing in the log-CDLF. The auxiliary function therefore re-

duces to K j

Qj(0;0*,Z) = £ Y, 7^(0*){log^ + log^(zfc;^)}. (78) fc=i jfc=i

At this point, the index k on the assignment variable adds no meaning since the sum-

mation covers all possible values of the assignment. Furthermore, since jk is missing

data, there is no actual "observed value" for jk that is tied to a particular measurement

Zfc. This index is therefore dropped to obtain K j

Qj(0;0*,Z) = ]T E^(0*){loS^ + 1°g^(z^^)}- (79) fc=i j=i

Without the ^-dependence, the summation over j can be moved in front of the summa-

tion over k, and the auxiliary function can be decomposed into a collection of component

auxiliary functions, each of which depends on a separate subset of the model parameters.

This decomposition is defined as j J

Qj(0;0*,Z) = ^QJ(7r,;0*,Z) + ^gJ(^;0*,Z), (80)

where K ,.

Qj(7r,;0*,Z) = J2 7**(©*) logTrj, (81) fc=i

K

Qj(Oj;<S>\Z) = Y 7fcj(0*) logpjfaOj). (82) k=\

Given the mutually exclusive nature of the parameters in these various components, the

M-step of the EM algorithm decomposes into a set of independent optimization prob-

lems, one for each component. The assignment probabilities are obtained by maximizing

equation (81), and the parameters in each mode density are obtained by maximizing

equation (82).

Estimation of Assignment Probabilities. Equation (81) is maximized using La-

grange multiplier techniques to obtain the estimator

Kj = K > (83)

32

where KJ(@*) is the effective number of measurements corresponding to mode j given

by

K

«*(©•) = £7^(0*). (84) fc=i

The effective numbers of measurements for all modes satisfy

j

J2*A®*) = K. (85) j=i

Gaussian Mode Estimation. Each component in equation (82) is used to individ-

ually estimate the parameters in one of the modal distributions. For Gaussian mode

densities, pj(z;0j) = A/"(z; //;,Pj), the optimization problem decomposes into J inde-

pendent (single-mode) Gaussian estimation problems, which are discussed in appendix

D. Drawing on the results of that appendix, equation(82) is maximized by

1 K

h = ^?en £ fwCe*) z*> (86) f(e*)

K

P> = ^7&) £ 7«(0*) (zfc - A,) (* - Ai)T • (87) k=\

These Gaussian parameter estimators take the above form regardless of the parametric

structure of any other modes. That is, not all modes need be Gaussian.

4.3 ESTIMATION FROM HISTOGRAM DATA

To set the stage for histogram mixture estimation, the unconditional probability

of any given histogram cell under the mixture model in equation (67) is given by

$,(©) = / p(a;e) = Y^rtj f dzPj(z;Oj). (88) J Z( j_Q J Z(

This can be alternatively expressed by defining the "mode-bin probability"

<t>ti[ej) = j dzP]{z;03), (89) Jzt

such that the overall bin probability is

.)

**(e) = X><M^)- (90)

33

The "observed" data are the intensity vector

M={mi:€=l,...,L}, (91)

and the total number of measurements is

L

K = ]T mt. (92) £=1

As discussed in section 3.3, the intensity vector may contain place-holders for expected

intensities in the truncated region of a truncated histogram.

4-3.1 EM Algorithm with Missing Assignments and Measurements

Definition of Missing Data and Auxiliary Function. For histogram estimation

of mixture-distribution parameters, the missing data include the measurements and

mode assignments, which are denoted

Z = I ztk : I = 1,. • •, L, k = 1,..., me J ,

J = I jtk • I = 1, • • •, L, k = 1,..., mi \ .

The auxiliary function is then defined as

Qj,z(0;0*,M) = f dZT p(J,Z|M;0*) logp(J, Z,M; 0), (93) Jz j

where the marginalization operators for J and Z were defined in equations (71) and

(11), respectively, and are given for the current case as

L mi

dzek z,

L me ( J

E-nn E j e=i fc=i Kjik=Q

CDLF and Posterior Distribution. The CDLF is obtained using Bayes' rule as

p(J,Z,M;0) = p(M;0)p(J,Z|M;0)

L m(

= C(M) n n ^« Puk (z^; «i«). (94) e=i fc=i

34

whose logarithm is

logp(J,Z,M;0) = logC(M) + ^^{log7r^ + logpm(z,fc;^j}. (95)

The posterior distribution of the missing data is then obtained by dividing equation

(94) by the definition of p(M ; 0), given in equation (6), which yields

P(J,ziM:e-) = nn{^pp}. w

Conditional Expectation. Given equation (96), the conditional expectation opera-

tor is a product of operators given by

/ dZ £ p(J,Z|M;0*) = J] II IT• £ *L / d2ekP3ek(zek;0*ek)

(97)

Except when operating on some function containing z^, each term in the large brackets

is unity since

4>,(0*) = £ "J* / ^ Pn^k;9)tk). (98)

Therefore, as in previous cases, when the conditional expectation operator is applied

to the CDLF, all components of the expectation marginalize to one except those for

which the £ and k in the expectation match the £ and k in the CDLF. In fact, it would

be more correct notationally (but much uglier) to denote the indices in the expectation

operation as £' and k' to distinguish them from the £ and k that appear in the CDLF, and

then introduce a Kronecker delta function for bookkeeping. Whether said in symbols

or words, the net result is that all expectation terms marginalize to one except those

for which £' = £ and k' = k. Noting all of these "unit marginalizations" and dropping

the term logc(M), the auxiliary function reduces to

L me J

QJiZ(0;0*,M) = £ £ -j-• E <* / **%*(«*;*D e=i fc=i n ' jek=o Jz<

j lognjlk + \ogPjek (zek ; euJ } . x

At this point, the indices £ and k on the missing variables j and z impart no infor-

mation since the summation over the assignment variable and the integration over the

35

measurement covers all possible values, independent of I or k. They are thus dropped

to obtain

L mt -, J r-

gJiZ(0; 0*, M) = £ Y. ^7©^ E *J / dz P^Z s *;){ 1(w + 1(w(z; *;)} •

f=l jfc=i ^ > j=0 ^

Without the dependence on i and A;, the summation over j can be moved forward, and

the sum over k becomes a constant multiplier me, allowing the auxiliary function to be

expressed as

j J

Qj,z(0;0*,M) = ^Qj,z(7r,;0*,M) + ^ QJ)Z(^;0*,M), (99) 3=0 3=0

where the components in this expression are defined as

QJtZ(n3; 0*, M) = *J £ —±-r / dz Pj(z ; &*) log*,-, (100) p -j * V / «/ 2^

L .

gJ)Z(^;e*,M) = TT; £ —^ / dzP3(z-e*) logp^e,). (101)

With the exception of the factor 7r*, the form of this last component is identical to

the auxiliary function QZ(0J;0*,M) for the non-mixture case. During the M-step,

when equation (99) is differentiated with respect to 0j to find a critical point, the

derivative is n* times the derivative of Qz{0j\ 0*, M). When equated to zero, the

iTj term drops out and optimization of Qjtz(^j', 0*,M) is achieved by independently

optimizing Qz(0j',®*,M.) for each 6j, making all of the earlier (non-mixture) results

applicable. That said, the discussion below for Gaussian densities includes the factor

TTj in order for the effective number of measurements to fall out as an intermediate

variable. In effect, this just means that the expressions for /i- and Pj include a term

*?/*; -1. Estimation of Assignment Probabilities. Optimization of the assignment proba-

bilities is performed using equation (100) directly. A convenient form for that expression

is obtained by noting the definition of <Ev(0), and defining

•K* L dzpj(z;d*) 7^(0*) = ( / JZ\ ' -, • (102)

{EH)*? /**»(•;*;)} With these so defined, the expression for Qj,z(^j', 0*, M) then becomes

QJ,z(7rj;0*,M) = }f^me 7^(0*) I log^ . (103)

36

Aside from the weight me, this is identical in form to the case with observed measure-

ments. The EM update for the assignment probability is thus given by

1 L

*i = K Em^(0*)" (104) e=\

As was the case for observed measurements, this expression is independent of the form

of the modal distributions.

Gaussian Mode Estimation. When the jth mode is Gaussian, the corresponding

component of equation (101) becomes

L

Qj,z(0j;0*,M) = TT* £ -^ / dz^(z;/i;,P;) logA^(z;/x,,P,) . (105)

Substituting the Gaussian density in the log-CDLF term of the auxiliary function gives

Qz(ej;e*,M) = | loglp-11 J2 i^k) Jzt rfz^(z;M;,p;)

- y E ^y jZi d* ^ M;. P?) (• - ^)T P71 (• - Mi) • doe)

To parallel the non-mixture case, it is convenient to define the weighted "mode-specific"

local moments

utjty) = [ dztf(z;n*P*) z. (107) Jze

nej(0*) = f dzAf(z;/z*,P*) zzT. (108)

Also, the effective number of measurements corresponding to the jth mode is

«*(©•) = *;E $wr **«)• (109)

While the meaning of this variable is the same as in section 4, the definition is altered

relative to equation (84) to accommodate the missing measurements.

Again drawing on the results of appendix D, the estimators for the mean and

covariance of the jth (Gaussian) component are

1 L

^ = ^)^wi *i««W. (n°)

Pj -^£«^"JM!.W. d")

:$7

where f2^(0*, /x-) = Jz dz A/"(z; /x*, P*) (z — iXj)(z — fij)T is the mode-specific center-

shifted second local moment, computed as

ft»(«;,/y = n^p-2^ ««,(«;)+# &;(*;)• (112)

Estimating a Scalar Gaussian Signal in Uniform Noise. If the measurement

space is scalar, the signal distribution contains a single Gaussian component, and the

noise is considered to be uniform over the observable region of measurement space, then

the model distribution is the two-component mixture given by

p(z; TTS, //, a) = nsX(z; ix, a2) + (1 - irs) f - j , (113)

where w = max(2o) — min(2b) is the width of the observable region and of the uniform

noise distribution. Note that the above expression has been parameterized in terms

of the mixing weight irs for the signal component, as opposed to parameterizing in

terms of 7To = 1 — ns as was done earlier in the section to accommodate mixture signal

distributions.

One nice property of the model in equation (113) is that the noise component

contains no unknown parameters. The unconditional bin probabilities for the noise

component can therefore be computed once at the outset of the estimation algorithm;

they need not be updated in each EM iteration since nothing changes. Figures 5 and

6 show EM iteration results for the Gaussian-signal-in-uniform-noise model under two

scenarios. In generating both figures, the model was used to synthesize a number of

measurements, these measurements were binned into a set of histogram cells, and the

histogram intensities were used to estimate the model parameters. For figure 5, a

very large number of synthetic measurements was generated and binned such that the

histogram has a large overall intensity and the individual intensities are quite accurate.

For figure 6, a much smaller number of synthetic measurements was generated such that

the histogram has a small overall intensity and the individual intensities are noisy. As

one might expect, the final EM estimates computed from the noisy histogram data are

degraded from those computed from the accurate histogram intensities. They are still

quite close to the true values, however. Furthermore, as figure 6 suggests, the estimate

of the mean (i.e., the location parameter) is far superior to what would be obtained if

the location of the largest histogram count were chosen (i.e., peak picking).

38

0.05

0.025

0

0.1

0.05

0

0.2

0.1

0

0.2

0.1

Weighted Mixture Components Observed Hit Counts and Approximate Distribution

Iteration 0

-10 0 10

Iterj ition 10

/\ - / ^ -10 0 10

Iters ition 20

A -/ V -10 0 10

Iterc ition 58

A ' v -10 10

0.2

0.1

0.2

0.1

0.2

0.1

0.25

0.125

* •

• •

-10 0 10

• •

I -10 0 10

_/V r——* -10 0 10

l\ -10 10

Figure 5: Gaussian-Uniform Mixture Estimation, High-Intensity Data Histogram-based parameter estimation for a two-component mixture distribution with Gaussian and uniform components. Histogram data were obtain by binning K = 106 measurements that were generated from the ideal mixture distribution with parameters u = A, a = 1, and irs — 0.4. The large number of measurements gives "clean" histogram data, and the estimated parameters are extremely accurate at u = 3.99937, a = 1.00260, and 7rs = 0.39925. The first and last rows correspond to the initial and final parameter estimates, respectively. The second and third rows correspond to intermediate iterations of the EM algorithm.

39

0.05

0.025

Weighted Mixture Components

Iteration 0 ^->.

Observed Hit Counts and Approximate Distribution

Figure 6: Gaussian- Uniform Mixture Estimation, Low-Intensity Data Histogram estimation of parameters in the two-component mixture distribution described in the caption of Figure 5. In this case, however, parameters are estimated from "noisy" histogram data with K = 200. The final parameter estimates are fi = 4.03696, a = 1.18893, and 7rs = 0.49568.

40

5. SUMMARY

This report has examined histogram estimation techniques for inferring the statis-

tical properties of physical variables from observed intensity data. The motivation for

using these estimation algorithms is to reduce bias and other artifacts that can occur

when traditional peak-picking or simple interpolation algorithms are applied to intensity

data, particularly when the data exhibit significant spreading in the variable of interest.

The report opened with a high-level consideration of the theory, highlighting some of

the primary characteristics, issues, capabilities, and limitations of the theory and algo-

rithms. Great emphasis was placed on the fact that histogram techniques are based on

discrete stochastic theory (for integer-valued intensity data) and the issues that arise

when the theory and techniques are applied to real-valued acoustic energy data. As

it turns out, this does not present a significant difficulty except in Bayesian contexts

where the maximum likelihood (ML) estimate must be weighted with a prior distribu-

tion. The relative weighting of the prior and ML distributions remains a significant

open issue when using histogram methods with real-valued energy data.

After the introduction and consideration of issues involved with application of the

methods, the histogram estimation algorithms were developed from first statistical prin-

ciples by drawing on significant background material provided in a set of appendixes.

The algorithms were "built up" by considering the simplest cases first and then adding

complexity. The in-depth tutorial examination given here provides the detailed theoret-

ical developments that were omitted from earlier works such as McLachlan and Jones

[2]. This report also limited the discussion to static distributions, which are subtle

and interesting enough to warrant consideration in and of themselves. Limiting the

discussion to the static case also avoids the significant unresolved issues and notational

baggage that is incurred when histogram techniques are considered in the context of

dynamic tracking algorithms (e.g., see [3] and [5]). The overall goal of this report was

to arm the reader with adequate background and insights to apply and extend these

powerful and versatile methods for a variety of applications.

41(42 blank)

APPENDIX A

EXPECTATION-MAXIMIZATION ALGORITHMS

The goal of this appendix is to introduce the basic ideas, terminology, and no-

tation for the expectation-maximization (EM) approach to algorithm design, which is

the common thread through all of the algorithms discussed in this report. The EM

approach provides a general template for generating iterative maximum-likelihood al-

gorithms that estimate the parameter vector 0 in a probability density function (PDF)

p(x;0), given observations of the variable random variable x. The EM approach is

most useful when it is difficult to directly maximize p(x; 0) with respect to 0, but

there exists an auxiliary variable £, such that maximizing the joint PDF p(x,£;0) is

straightforward. The approach replaces a difficult nonlinear optimization problem with

an iterative sequence of easier problems. The EM method is extensively used in modern

statistical analysis because it greatly simplifies algorithm development for many prob-

lems, it has guaranteed convergence under some fairly general conditions, and many

statistical models have obvious choices for missing data. The general properties of

the EM algorithm are discussed in [1], and numerous applications and extensions are

discussed in the text by McLachlan and Krishnan [10].

Observed, Missing, and Complete Data. EM algorithms all share the charac-

teristic of having observed, missing, and complete data. The observed data consist of

samples of data that are at hand, representing either physical measurements from a

sensor or features that are computed from physical measurements. The collection of

such samples is denoted X = {xn : n — 1,..., N}. The density function p(X; 0) is

referred to as is the observed-data likelihood function (ODLF).

The missing data contain the information that, were it known in addition to the

observed data, causes the difficult estimation problem to reduce to a simpler one. For

a given set of observed data X, the corresponding collection of missing data is denoted

S = {£„, : m = 1,... , M}. The concatenated set containing both the observed and

missing data is referred to as the complete data, and the joint distribution p(X, S; 0) is

referred to as the complete-data likelihood function (CDLF). As mentioned above, the

fundamental premise of EM is that p(X, 2; 0) is much easier to optimize with respect

to 0 than isp(X;0).

Iterative Structure and Auxiliary Function. The EM algorithm is iterative. It

requires an initial estimate for 0, which is obtained using an application-dependent sub-

optimal algorithm (the preferred method), or by random selection within some reason-

able range of values. Denoting the initial estimate by 0°, the EM algorithm generates

A-l

the sequence of estimates denned by

0i+1 = argmaxQa(©;©',X), (114)

where <5 = (0;0\X) is the so-called auxiliary function, which will be formally defined

in a moment. While the literature typically denotes the auxiliary function simply as

Q(0;©*), this report discusses a number of different auxiliary functions with differ-

ent sets of observed and missing data. To more easily distinguish these various cases,

the notation used here includes a subscript to denote the missing data and an addi-

tional argument to denote the observed data; thus the subscript S and argument X in

<5S(©;0*,X).

When iterating equation (114) to optimize the ODLF, the iterations are termi-

nated either at a pre-selected number of iterations or when some set of convergence

criteria is satisfied. When convergence tests are used, they are typically based on the

relative change in the observed-data likelihood and/or values of the estimates, similar to

convergence tests for standard nonlinear optimization techniques like the Newton and

gradient ascent algorithms.

Equation (114) includes the iteration number as an index. When using the EM

approach, there are often a large number of other indices that track a number of other

characteristics, making index variables a rare notational commodity. It is therefore

convenient to define the parameter estimates as 0 = 0t+1 and 0* = 0\ such that the

"typical" EM iteration is

0 = argmax Qs(0;0*,X). (115)

The auxiliary function <5H(©;@*,X) is defined as the expectation of the log of

the CDLF, conditioned on the posterior distribution of the missing data, which is stated

mathematically as

QH(0;0*,X) = £S|X;e.{logp(X,S;0)}

= y<iHp(E|X,0*)logp(X,S;0). (116)

Here the integral notation f dE is used to indicate marginalization over the missing

data 3; in general, this is a sequence of continuous integrations, discrete summations,

or both. Since missing data with both continuous and discrete variables are common,

sequences that mix integrations and summations are also common.

The basic idea behind EM is well illustrated from the viewpoint of iterative mi-

norization (IM), which is an even more general class of algorithms to which the EM

A-2

method belongs. Figure A-l shows two iterations of a generic IM algorithm. As dis-

cussed in the figure caption, each iteration maximizes an auxiliary function that "mi-

norizes" the function being maximized, which for the EM algorithm is the ODLF. That

the EM auxiliary function minorizes the ODLF follows from the nature of missing data,

in the sense that including a stochastic nuisance parameter in a model always drives

down the likelihood relative to the nuisance-free model because of the added uncertainty

associated with the nuisance parameter.

Expectation and Maximization Steps. Each EM iteration consists of an ex-

pectation step (E-step) and a maximization step (M-step). The E-step involves eval-

uating equation (116), and the M-step involves computing estimates that maximize

equation (116) with respect to 0. The process for deriving the auxiliary function in

the E-step is fairly universal across applications and is largely an exercise in conditional

probability. The steps are to (1) define the observed data and the ODLF, (2) define

the missing data and the CDLF, (3) determine the posterior distribution of the missing

data, and (4) carry out the conditional expectation to obtain the auxiliary function. In

many applications, the missing data are chosen such that the conditional distribution

p(X| S; 0) can be written in closed form, and there exists a known unconditional (prior)

distribution p(S; 0). In such cases, the CDLF is obtained as

p(X, S; 0) = p(X| S; 0) p(S; 0). (117)

Given the ODLF and CDLF, the posterior for the missing data is given by

*«*•>-ISP- <118> In other applications, it may be more convenient to first derive the posterior distribution

p(S|X; 0) and then obtain the CDLF as

p(X,S;0) = p(S|X;e)p(X;0). (119)

The preferred order for evaluating these distributions depends on the structure of the

missing data and the distributions involved, but some form of these steps will be en-

countered whenever EM is applied in a new situation. The final stage in deriving the

auxiliary function involves carrying out the conditional expectation, which is usually

simplified by noting independence or conditional independence among the variables.

More will be said about this process in the context of particular problems.

While the details of the M-step cannot be specified without formulating a partic-

ular parameterization of the CDLF, these details typically follow from standard con-

A-3

e *-\ e

Figure A-l: Iterative Minorization (IM)

Two IM iterations are shown; the first depicted in red and the second in

green. The function L{6) to be maximized is shown in black. The first

iteration maximizes an auxiliary function Q{0e\0e'~x), which "minorizes" e\ni-l^ the likelihood function at 0 ; that is, Q(6 \6 ) is strictly less than L(6

at every point 6 except 6 ~ , where the two functions share the same

function value and first derivative (i.e., the two functions are tangent at

Oe~l). Under these minorization constraints, maximization of Q((r\fr~ )

is guaranteed to increase L(0) unless Oe~l is already a stationary point of

L{6), in which case no changes take place. The second iteration repeats

the process with the tangent constraint imposed at the point 0e produced

by the first iteration.

A-4

strained or unconstrained optimization methods. For example, a stationary point is ob-

tained by equating to zero the derivative of QH(0; 0*, X) with respect to each element

of 0. The stationary point is a maximum if the second derivative (Hessian) matrix is

negative definite (i.e., all of its eigenvalues are strictly negative) at the stationary point.

Fortunately, the auxiliary function is concave for many problem formulations involving

Gaussian or Gaussian-mixture model densities, which means that a stationary point for

such an objective function is the unique maximum. Appendix D demonstrates that the

auxiliary function for Gaussian histogram estimators is, indeed, concave.

A-5(A-6 blank)

APPENDIX B

COMBINATORIAL PROBABILITY DISTRIBUTIONS

Histogram-based methods are inherently combinatorial, since they consist of look-

ing at all combinations of ways that events can be grouped into a collection of histogram

cells. This appendix reviews the binomial, negative binomial, and multinomial distri-

butions to establish notation and to give an easy single reference for these basic distri-

butions.

Binomial Distribution. Perhaps the simplest experimental situation involves a

series of Bernoulli trials, whose outcomes are limited to one of two categories, "success"

or "failure". A common question with regard to Bernoulli trials concerns the number of

successes that might be observed over the course of n independent trials. This situation

is described by the binomial distribution. Let the probability of success in any single

trial be denoted by 0, such that the probability of failure is (1 — 0). To have exactly m

successes in n independent trials, there must also be (n — m) failures. The probability

of the m successes is 0m; the probability of the (n — m) failures is (1 — 0)(n_m); and

there are b(m, n) ways in which this situation can occur in n trials, where b(m, n) is the

binomial coefficient:

b(m,n)= "' (120) ml (n — m)\

The binomial distribution measures the probability of the event "m successes out of n

trials." The distribution is therefore defined as

p(m; 0. n) = b(m, n) 0m (1 - 0)(n-m>, (121)

which satisfies the marginalization constraint

^p(m;0,n) = 1. (122) m=0

Negative Binomial Distribution. In the binomial distribution, the number of trials

is a parameter, and number of successes is a random variable. Suppose, however, that

the question is turned around to ask "How many trials are needed to achieve a certain

number of successes?" Here, the number of successes is the parameter, and the number

of trials is the random variable. This situation is described by the negative binomial

distribution

p(n ; 0, m) = b~(n, m) 0m (1 - 0)(""m) , (123)

B-l

where b (n, m) is the negative binomial coefficient, defined as

b~(m, n) = b{n - m - l,n - 1) = -. V~/ —rr • (124) (m — l)!(n — m)!

The negative binomial distribution satisfies the marginalization constraint

oo

£p(n;0,m) = 1. (125) n=m

Multinomial Distribution. Now consider an independent series of trials in which

each trial has L categories of possible outcomes, rather than simply success or fail-

ure. With no other information, the probability of an outcome occuring in the ith

category is denoted <pe, and the collection of probabilities for all categories is 0 =

{4>e : I — 1,. •., L}. When actually performing a series of trials, an observed outcome in

one of the categories is referred to as a "hit" in that category. The number of hits ob-

served in each category over the series of trials is denoted me for £ = 1,..., L. These "hit

counts" are collected in the vector M = {me : (. = 1,..., L}, whose statistical properties

are governed by the multinomial distribution

L

p(M;0) = c(M)J] ft', (126) e=i

where c(M) is the multinomial coefficient, defined as

(Zti me) ! c(M) = f- 4-. (127)

Unlike the binomial distribution, the number of trials is not an explicit parameter

because the multinomial distribution implicitly assumes that the sum of the hit counts

for all categories equals the number of trials (i.e., that all outcomes are counted). If

this is not the case in a given application, then the multinomial distribution is not the

correct model.

Equation (126) assumes that any outcome not fitting into one of the defined

categories has zero probability of occurrence. Alternatively stated, it assumes that,

with probability one, all possible outcomes fall within one of the categories, such that

L

£>< = ! • (128) If, on the other hand, events occur with nonzero probability that do not fit any of the

categories, then the statistical model must be modified. In particular, consider the case

B-2

of "pre-sereened" data, where any trial whose outcome does not fit a given category

is ignored entirely; not only is the hit not recorded, but the counter that records the

total number of trials is not incremented. The sum of hit counts equals the number of

recorded trials, so the general form of the multinomial distribution is still valid. The

probabilities for the categories, however, must be normalized to account for the fact that

the universe of outcomes has been projected down to the union of the known categories.

Defining the probability that a hit occurs in any of the categories as

L

allows the multinomial distribution for this case to be defined as

L / , \ me L

p(M;0) = c(M) II (T) = C(M) *"""' II #* . (13°) e=i ^ ^ ' t=\

where m is the total number of trials that pass the screening process; that is,

L

m = y. me • (131)

Negative Multinomial Distribution. When applying the multinomial distribution

to pre-screened data, the distribution is useful for making inferences only about things

that occur within the observable categories. It is sometimes desirable to extrapolate

inference into categories that cannot be observed. This situation can be thought of in

terms of Bernoulli trials by grouping the observable categories into one single super-

category. The universe of possible outcomes can then be partitioned into two classes,

with a measurement either falling within the observed super-category (a "success")

or falling outside of this super-category (a "failure"). Treating the sum of observed

hit counts as the number of successes in a sequence of Bernoulli trials, the unknown

total number of trials required to have generated these successes is a random variable

governed by the negative binomial distribution in equation (123), with 0 defined by

equation (129) and m given by equation (131). This analysis can be taken a step further

by subdividing the space of unobservable outcomes and making inferences about these

unobservable categories. This last situation is described by the negative multinomial

distribution, which is discussed further in appendix C.

B-3(B-4 blank)

APPENDIX C

STATISTICS OF MISSING HIT COUNTS

This appendix derives the posterior distribution p(Mj | M0; 0) of the missing in-

tensities given the observed histogram counts, and then uses this posterior to obtain

the expected intensities in the unobserved histogram cells.

Posterior Distribution. Development of the posterior density involves making a

number of structural observations concerning the component distributions involved.

This discussion is facilitated by defining the "number of truncated measurements" as

the discrete random variable

L

e=L0+i

The total number of (observed and unobserved) measurements is then the discrete

random variable defined by

Kc = Ko + KT. (133)

Observing that the regions Zo and Zi are disjoint in Z, the posterior distribution

satisfies

p(MT|Mo;0) = p(MT|A'o;0)- (134)

That is, from the standpoint of Zi, it does not matter how the measurements in Zo

are distributed within Zo; only the total number of hits in Zo is important since

that provides evidence for inferring the total number of truncated measurements. The

second structural observation concerns the appearance of these hit-count totals in the

distribution functions. For example, note that in the presence of MT, the variable Ky

carries no additional statistical information, such that

p(MT | A'0 ; 0) = p(MT, Kr \KO;0) 6 I KT - ]T m A , V e=LQ+\ )

where S(-) is the Kronecker delta function, whose value is one for an argument of zero

and zero otherwise. The delta function is introduced to ensure that the distribution has

nonzero probability only when equation (132) is satisfied. Having said this, this delta

function will be dropped to ease the notational burden, with the understanding that

equation (132) is indeed satisfied. A similar observation can be made concerning KQ

C-l

and KT- That is, given the observed hit total Ko, the variables KQ and Ky carry the

same information, such that

p{KT\Ko;0) = p(Kc\Ko-0)S(Kc-Ko-KT) (135)

As before, the Kronecker delta is dropped, with the understanding that equation (133)

is satisfied. Given these observations, the posterior distribution for the missing data is

given by

p(MT\Ko;0) =P(MT,KT\KO;0)

= p(MT\KT,Ko;0) p(KT\K0-6)

= p(MT\KT;0)p(Kc\Ko;O), (136)

where the first term in the last expression follows because once KT is given, Ko does

not bring any additional information concerning MT- The variable MT merely splits

the KT measurements among the Lx bins in ZT, independent of what happens in ZQ.

The conditional distribution |)(Mx | KT ', 0) thus corresponds to KT independent draws

from LT categories, and is given by the multinomial distribution

P(MT\KT;0) = C(MT){M0)Y II {<M0)}m (137) i=L0 + \

The distribution p(Kc \ Ko ', 0) pertains to the total number of Bernoulli trials Kc that

must be executed in order to achieve Ko successes, where a "success" is a hit anywhere in

Zo- The total number of hits is therefore governed by the negative binomial distribution

in equation (123). Noting the relationships between Ko, KT, and Kc, and those between

4>o(0) and <pr{0), the negative binomial distribution for this situation is expressed as

p(Kc\Ko;0) = b-(Kc,K0) { <M0) }*" {<M0)}*T, (138)

where b~(Kc, Ko) is the negative binomial coefficient defined in equation (124). Sub-

stituting equation (137) and equation (138) into equation (136) then gives the posterior

distribution of the missing intensities as a negative multinomial distribution, which is

defined for the present situation as

(139)

C-2

where c (Mj, A'o) is the negative multinomial coefficient, defined as

c-(MT,K0) = c(MT) b-(KcKo) = - (^c-1)! (14Q)

(^o-l)!{ntL0+1^'}'

Expected Values. The expected intensities axe obtained by computing the first mo-

ment of the posterior distribution. Dropping the explicit dependence on the parameter

vector 0 and expressing the negative multinomial coefficient directly in terms of the

missing intensities instead of total hit counts allows the posterior to be written as

(K0 - 1 + Yli-L +i me) ! £ p{Mr\K0) = ±- ;\;Lo+1 > 4>go J] 0-. (141)

(^o-i)!{ntL0+i^'} ^0+i The expected value of the (typical) missing hit count, m^ is defined for Lo < C < L as

£{mc|M0} = ^m<p{Mr\K0), (142)

where the marginalization operator was defined in equation (39), which is repeated here

for convenience as

L f oo ^

E- n E • (i«) MT e=L0 + l K me=0 )

The key piece of information for taking the expectation is the normalization property

of the negative multinomial distribution, that is,

Y,P(MT\K0) = 1. (144)

Forming and simplifying the product m^p(Mx| K0) is greatly facilitated by defin-

ing the variable

me = < (145) me if £ ^ C

me-l if £ = <" -

With this variable so defined, the product m^^Mxl Ko) can be written as

(K0 - 1 + ELL +I rne)\ L

mQp{MT\K0) = ±- f " -AT 0g° <t>7- (146) (Ko-l)\{UlLo+lrh(\\ e=Lo+i

The desire is to obtain an expression that is equal to within a scale factor of the nega-

tive multinomial distribution with the variable me instead of m^. If such an expression

C-3

can be obtained, then the negative multinomial portion marginalizes to unity and the

desired expectation is the scale factor. To get the numerator of the negative multino-

mial coefficient in terms of fhe while retaining the desired structure, it is required to compensate by introducing the variable

KQ = KQ + 1 (147)

such that the numerator is given by

(148)

Now, the marginalization property of the negative multinomial distribution is valid

regardless of the actual value of Ko, so long as the structure is retained and the same

value appears everywhere. The new observed count variable is thus substituted and compensated everywhere it appears in equation (146) to obtain

/ , N (*o-i + £iLo+i^)! 4° A mcP(MT Ko) = f\ , f -T^T- U * (149)

Substituting and compensating the variable me in the final product term and rearranging

then yields

- 1 + Etio+i •<) ! (K0

mcp(MT|^0) ---- -T-h { "V" UKO-1)\{YILLO+I•'

]} "0+1

4° n * 7Tl£ (150)

Since the term in brackets is exactly the negative multinomial distribution in the vari-

ables fhe and Ko, and the scale factor is independent of any of the truncated me, the bracketed term marginalizes under the expectation, leaving the desired expression

(151)

C-4

APPENDIX D

ESTIMATING GAUSSIAN PARAMETERS

This appendix derives estimates and proves the maximality of those estimates for

parameters // and T in the objective function

Q{0) = Ype[ dzAf(z; C, S) log M(z; /x, T) , (152) = Y P< j dzM^ C,S) logjVfz; /z, T

where pe is a non-negative weighting function, A/"(z; £, S) is the observed measurement

density

AT(z;C,S) = |27rS|-1/2exp|-^(z-C)TS-l(z-C)} , (153)

and logA/"(z; //, T) is the model likelihood function (i.e., the log of the model density

function). Omitting the constant log(27r) term, which does not affect the parameter

estimation problem, the model likelihood is

iogAf(z; /x, r) = l- log | r I - l- (z - /i)T r (z - M), (154)

which is parameterized here in terms of the inverse covariance matrix (or information

matrix) T = P_1 to ease the math. The parameter estimates that maximize the ob-

jective function are identified below. The integrations are carried out first to obtain an

expression that is cast in terms of a set of effective measurements. The resulting expres-

sion is then maximized by finding the stationary point of the function and by proving

that the function is concave, such that the stationary point is the unique maximum.

Effective Measurements. Given the form of the model likelihood function, it is con-

venient to similarly decompose the objective function into determinant and quadratic-

form components, giving

Q(0) = m(0) - m(d), (155)

where

Vtf) = \il P' f dzAf(z;CS) log (156)

= \J2 P( I dzX(z;C,S) (z-/x)Tr(z-/i (= 1 * Zi

»»(*) = oLPW dzA/-(z;C,S) (z-M)Tr(z-/x). (157)

D-l

The determinant component is rewritten by defining the bin probability value

(158)

and the effective number of measurements

(159)

to obtain

K m(0) = -log (160)

Equation (157) is rewritten by expressing the quadratic form as the trace of an outer

product, which gives

'/2 1 L f

(*) = 2 Z) P< / <fcjV(z;C,S) tr{r(zzT-2z/xT + /x/xT)} e=i ^Z(

= \Y, Petrlr J dzAf(z;CS) (zzT - 2z/xT + M/xT) }

1 L

= 7, Yl Pf<Mr{r (n«-2w^T + ^/iT)} (161)

i.

= nEpt &tr {r \Pe+(w' ~ >*) (w< ~ A*)T } i (162)

where

fte = 1 / dzAA(z;C,S) ZZ

SZf = J7^ — u?^u>^ .

(163)

(164)

(165)

Vector u)e is the normalized first moment (centroid) and matrix £le is the normalized

second moment of Af(z; C> S) when restricted to the region Ze. Vector u>e is also referred

D-2

to as an "effective measurement" for reasons that will be made clear in a moment-

Matrix Cl( is the second central moment (covariance matrix) of A/*(z; £, S) in Ze.

An interesting form of the objective function is obtained by defining the weighted

sum of covariance matrices

1 J^ ^ = - /, Pe <Pe ?tt (166)

e=i

such that, when 772(6?) is re-combined with the determinant term r)\{0), the overall

objective function becomes

Q(6) = tr {r ft} + ^ Pe <f>t log Af{ut] ^ T) . e=i

(167)

Aside from the trace term, which reflects the cumulative measurement uncertainty

within all bins, the objective function is equivalent to one in which the effective measure-

ments u)( are observed point measurements. Indeed, from the standpoint of estimating

the mean, this problem is identical to one in which the u>i are observed point measure-

ments since the trace term is not a function of //.

Location of the Stationary Point. The stationary point of the objective function is

found by equating to zero its partial derivatives with respect to the various parameters.

Noting equation (161) and the identity

d_ dx

tr{AxxT} = 2Ax, (168)

the derivative of the objective function with respect to the mean is given by

dQ(0) dm(0) dp dfx

= T^2 pe 4>e {u)( - n) (169) e=i

Equating to zero gives the stationary value for the mean as

(170)

The information matrix is estimated by differentiating Q{0) with respect to T,

equating the result to zero, and substituting in place of \x the stationary value ft.

Expressing the objective function as the sum of equations (160) and (162), and drawing

D-3

upon the derivative identities

dXlog A

and

= A"1

d__ dA

tr{AB} = B

(171)

(172)

(173)

gives the desired derivative as

dQ(0) « _ 1 dT

(174)

Equating to zero and substituting p, then gives the stationary value

A-1 1 P = r = - 2_^ Pi 4>i |o< + (<*>< - /i) (w< - A) |

€=1

(175)

Maximality of the Stationary Point. A stationary point is a local maximum of

the objective function if the Hessian matrix of the objective function is negative definite

at the stationary point. Construction of "the Hessian matrix, however, requires taking

second derivatives of the function with respect to all possible pairwise combinations of

the elements of the mean vector and information matrix, which can be a formidable task.

If the objective function is radially concave, however, then it has the nice characteristic

that there is only one stationary point and it is the unique global maximum. The

remainder of this section parallels a proof by Liporace [11] of a test for radial concavity,

which does not require forming all of the second partial derivatives.1 Now, if the function

fails the concavity test, a critical point may still be a local maximum, so the test is not

conclusive in that regard. If, on the other hand, the function passes the test, then no

further analysis is needed.

The basic idea behind the test for radial concavity, which is illustrated in fig-

ure D-l, is to formulate a family of one-dimensional trajectories along the function

that cover every possible direction from the stationary point, and then show that these

one-dimensional trajectories are all individually concave. If the objective function is

continuous at the stationary point, and the "hill curves downward" in every direction 1 Indeed, the test outlined below would be a test of absolute concavity were it not for the pathological family of

functions in which non-concave behavior can occur along contour lines that are perpendicular to radial lines emanating

from the critical point, which is the motivation for the term radially concave. Even in this pathological case, however,

the stationary point is still the unique maximum.

D-4

Figure D-l: A Test for Radial Concavity. The surface plot and equal-value contour lines depict a hypothetical concave function /(x) of the two-dimensional variable x. The "stem, lines" indicate the locations of the stationary point x (black), an arbitrarily chosen reference point x, (blue), and the points x (green) that satisfy the constraint x = Axr + (1 — A) x for A = 0.1. 0.2 0.9. Over the con- tinuum 0 < A < 1, the locus of x forms a one-dimensional trajectory (a subsegment of which is indicated by lower green line) that emanates outward from x along the line passing through x and x,., but in the direction opposite to x,. The upper ends of the stem lines correspond to the function values /(x), /(xr), and /(x). Given particular values of x and xr, x is completely determined by X, such that /(x) can be expressed as /(A) (upper green line), which follows the curvature of /(x) along the one-dimensional trajectory. By varying x,., this trajectory can be made to extend from x in every possible direction. Thus, if /(A) is necessarily concave with respect to X, regardless of the value of x,, then the overall function is radially concave and the stationary point is necessarily the unique maximum.

D-5

away from the stationary point, then the stationary point must be the unique maximum.

This approach reduces the maximality test to evaluating a single second derivative with

respect to a scalar variable.

To set up the concavity test, each set of parameter values in 0 = {/z, T} is viewed

as a "point" in the domain of the objective function (i.e., parameter space). From any

such point, a family of radial trajectories is obtained by defining a "reference point"

0r = {nr, Tr} and a set of "locus points" 0 = < /2, F > that satisfy

fi = A//r + (l-A)/2, (176)

r = Arr + (i-A)f, (177)

where A is a scalar variable satisfying 0 < A < 1. Substitution of these expressions into

the objective function generates a modified objective function over a restricted parameter

space. For a given 0r, the values of 0 fall along a radial line emanating from (but not

including) 0, in the direction opposite 0T. The direction of 0 from 0 is thus a function

of the reference point location. The distance of 0 from 0 varies as a function of A, with

a scale factor equal to the distance from 0 to 0r. The reference point 0r is treated as

an implicit parameter that is arbitrarily chosen but is a fixed constant once it is chosen.

The modified function is then defined as

Q(A, 0) = Q (A 0r + (1 - A) §) , (178)

where the short-hand notation 0 = X0r + (1 — A)0 is used to indicate that equa-

tions (176) and (177) are satisfied. To show that the objective function is concave,

the modified objective function is twice differentiated with respect to A. The resulting

second derivative is evaluated at the stationary point 0 = 0, where it is shown to be

manifestly negative. The information term 771 (0) and error term 772(0) are again treated

individually. Both terms contribute negative values to the overall second derivative.

The first component is expressed in terms of A by substituting equation (177) into

equation (160), which gives

77i(A,0) = |log|Arr + (l-A)f|. (179)

When evaluating this determinant, it is helpful to observe that the constraint on the

information matrices in equation (177) is satisfied for all A, which implies that T, Tr,

and F are all diagonalized by the same set of eigenvectors. Therefore, there exists an

orthogonal matrix U for which

UTUT = D = U {Arr + (1-A)f} UT = ADr + (l-A)D, (180)

D-6

where D = diag(di,... ,C?M), Dr = diag(dr>i,.. . ,dr,M), and D = diag(di,... ,d\i) contain

the eigenvalues of T, Tr, and T, respectively. The first component of the modified

objective function is thus given by

M

fh(X,9) = ^log|ADr + (l-A)D| = ^]Tlog{A^ + (l-AK} , (181) i=\

which is differentiated twice with respect to A to obtain

u2 M (d • — d 1 M 1 o

Sg*^—S{A^(l-A)<}'--"^g^-*)- (l82> l'=l

The numerator in the summand of equation (182) is strictly positive since dr; and d*

cannot be equal. In general, the denominator is only non-negative (i.e., T will have

zero-valued eigenvalues somewhere), but d? is strictly positive at the stationary point,

as long as the number of independent "measurements" uj£ is larger than the dimension of

the measurement space. Finally, since K is also a positive number, the overall expression

for the second derivative is therefore negative, regardless of the value of rr.

Turning now to the quadratic-form component, the manipulations involved with

evaluating 772 (0) are eased by defining the intermediate variables

d = fi-nr, (183)

A = f-Tr, (184)

ye = ue- p,, (185)

such that

T = AI\. + (l-A)f = f-AA. (186)

U3i — fj, = oj( — A/xr — (1 — A)/2 = y^ + A<5. (187)

The modified function is evaluated by substituting equations (176) and (177) into equa-

tion (162) and collecting terms in A to obtain

*(*•*) = 1 Y^pe&txl f J7n + fyriy^ + A(2fyn(5-Af2„-Aynyj)

+ X2(T8ST - 2A yn<5T) - A3 (A S6T) I. (188)

D-7

Differentiating twice with respect to A and performing some algebra then gives

}2 L

j^rf2(A,0) = KSrTS-2X6T A \ £ Pt fa («„-/*) \ . (189)

At the stationary point, the mean vector must satisfy ^2e=1 pe<j>e (<*>e — A) = 0; sucn

that the term in braces in equation (189) is zero. Combining these results with equa-

tion (182) gives

rfi M 1 2 — $(\,0) = -K^-(dri-i) -K6

TT6<0, (190)

i=l "•i

where 6TT 6 is necessarily positive because T is positive definite. This expression is

independent of A and its sign is independent of the value of 0r. The original objective

function therefore must be concave, and the stationary point is the unique maximum.

D-8

APPENDIX E

GAUSSIAN MOMENTS IN AN INTERVAL

When implementing histogram-based estimation methods, it is necessary to evalu-

ate the bin probabilities and the first two local moments in each bin, which are developed

in this appendix for the scalar Gaussian density

p(x\ fi, a2) = (2TTCT2)

2 exp I - — (x - [if \ (191)

The integral expressions required for these probabilities are not very exciting or inter-

esting, but they are quite useful. In what follows, the bin-probability and local-moment

computations are outlined for a "typical" bin interval / = [a, b).

Bin Probability. The bin probability is given by the integral

4,(1 '"•<72) = 7Sp/exp{"^(l"")2}d1' which is evaluated by performing the change of variables

1 y =

v7^ (X- fi) X = V2a2 y + n => dx = V2a2 dy

to obtain

(f)(I, fi, a2) IK Ja V^

dy,

where the modified integration boundaries are

Q = 1

(a- -**)•

0 = 1

y/2a* (b- -/,).

Given a computational routine for the error function

2 fu

erf(u) = -= /

the bin probability is computed as

e x dx,

4>(I^,a2) = i{erf(/?)-erf(a)}

(192)

(193)

(194)

(195)

(196)

(197)

(198)

E-l

dx

First Moment. The first moment in the interval is

w(/'",ff2) = 7ib/iexp{~"^(1"'')2} Adding and subtracting // fa e~^x~^ ^2<T ^dx yields

u{I,Hi02) = u(I,n,a2) + n0(1,n, a2),

where £j(I, fi, a2) is the first central local moment

u(I,n,a2) = —_ / (x-fi) exp|-^(x-/x)H dx.

This central moment is evaluated by performing the change of variables

(199)

(200)

(201)

y=^-2(x-»)2^(x-fi) = ±V2^ y1'2 =* dx = ±-j= y^dy (202)

to obtain

^^ = w,£e-Vd^wA^-e-62)- (203)

The first moment is therefore

u(I,fi,a2) = -^= (e-^-e-^+zi^/,//^2). V27T ^ '

(204)

Second Moment. The second moment in the interval is

Completing the square in the moment variable yields

Q{I,^a2) = Q{I,n,a2) + 2nuj(I,LL,a2)-n2<f)(I,n,P2),

where Cl(I, fi, a2) is the second central local moment

£1(1, //,a2) = —_ (x- n)2 exp | -— (x - n)2 \ dx.

(205)

(206)

(207)

This expression is simplified using the change of variables defined in equation (193) to

obtain

Cl(I^,a2) = -J= A2<7V) e-S (>/Wdy) = % f y2 e^ dy, (208) \l2-KOl Ja V ' V71" Ja

E-2

where a and j3 are denned in equations (195) and (196), respectively. Using the

integration-by-parts formula fudv = uv — fvdu with dv = e~y dy and u = y2

(such that v = —e~~y /(2y) and du = 2ydy), the second central moment is

n(/,//,a2) = ^= {-\ye~y2 + />"*} (209)

V7r V /

The overall second moment is therefore given by

2

Q(/,/i,a2) = ^=(ae~a2 -pe-(32^+2a24>{I1^a2)+2fiio(I,^a2)-n2<P(I,fi,a2).

Substituting the definition of u;(/,/x, a2) given in equation (204), collecting like terms,

and defining the function

/? fi" c(x) = v* cr/i + \ / 7T

a2 x (210)

gives the second moment as

Q(I, /*, a2) = C(Q) e"Q2 - c(/?) e"^ + (2a2 + //2) 0(7, //, a2) (211)

Underflow Issues. The expressions given above for the bin probabilities and mo-

ments become unstable when interval I is far removed from the mean yu. In these

regions, underflow issues become a dominating factor. This difficulty is easily avoided,

however, by restricting the range of integration to those bins that lie within, say, ±8

standard deviations of the mean, effectively treating the remaining bins as a "set of

measure zero."

E-3(E-4 blank)

REFERENCES

1. A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum Likelihood Estimation

from Incomplete Data," J. Royal Stat. Soc. (B), vol. 39, no. 1, 1977, pp. 1 38.

2. G. J. McLachlan and P. N. Jones, "Fitting Mixture Models to Grouped and Trun-

cated Data via the EM Algorithm," Biometrics, vol. 44, 1988, pp. 571-578.

3. T. E. Luginbuhl, Estimation OF General Discrete-Time FM Processes, Ph.D.

thesis, University of Connecticut, 1999.

4. T. E. Luginbuhl and P. Willet, "Estimating the Parameters of General Frequency

Modulated Signals," IEEE Trans. Signal Process., vol. 52, no. 1, January 2004, pp.

117-131.

5. R. L. Streit, "Tracking on Intensity-Modulated Data Streams," NUWC-NPT

Technical Report 11,221, Naval Undersea Warfare Center, Newport, RI, 2000.

6. K. Fukunaga, Statistical Pattern Recognition, Academic Press, New York, 1990.

7. Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Association, Academic

Press, New York, 1988.

8. J. F. C. Kingman, Poisson Processes, Oxford University Press, New York, 1993.

9. R. A. Redner and H. F. Walker, "Mixture Densities, Maximum Likelihood, and

the EM Algorithm," SIAM Review, vol. 26, no. 2, 1984, pp. 195-239.

10. G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, John Wiley

& Sons, Inc., New York, 1997.

11. L. A. Liporace, "Maximum Likelihood Estimation for Multivariate Observations of

Markov Sources," IEEE Trans. Informal Theory, vol. 28, no. 5, 1982, pp. 729-734.

R-KR-2 blank)

INITIAL DISTRIBUTION LIST

Addressee No. of Copies

Office of Naval Research (Attn: John Tague) 1

NSWC Panama City (Attn: James Cobb) I

Defense Technical Information Center 2

Center for Naval Analyses 1

University of Florida (Attn: K.C. Slatton) I

University of Connecticut (Attn: Peter Willett) 1

Metron Inc. (Attn: Roy Streit) 1

20090914183 - DTIC · 2011-05-14 · This report examines histogram estimation methods for...

Documents

Transcript of 20090914183 - DTIC · 2011-05-14 · This report examines histogram estimation methods for...