Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

37
Maximum Likelihood Maximum Likelihood Estimation of Mixture Estimation of Mixture Densities for Binned and Densities for Binned and Truncated Multivariate Data Truncated Multivariate Data Igor V. Cadez, Padhraic Smyth, Geoff J. Mclachlan, Christine and E. McLaren, Machine Learning 2001 (to appear) O, Jangmin 2001/06/01

description

Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data. Igor V. Cadez, Padhraic Smyth, Geoff J. Mclachlan, Christine and E. McLaren, Machine Learning 2001 (to appear) O, Jangmin 2001/06/01. Introduction (1). - PowerPoint PPT Presentation

Transcript of Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Page 1: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Maximum Likelihood Maximum Likelihood Estimation of Mixture Estimation of Mixture Densities for Binned and Densities for Binned and Truncated Multivariate DataTruncated Multivariate Data

Igor V. Cadez, Padhraic Smyth, Geoff J. Mclachlan, Christine and E. McLaren,

Machine Learning 2001 (to appear)

O, Jangmin

2001/06/01

Page 2: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Introduction (1)Introduction (1)

Fitting mixture models to binned and truncated data by ML via EM.

Binning measurement with finite resolution quantifying real-valued variables

Truncation Motivation

diagnostic evaluation of anemia volume of RBC, amount of hemoglobin : measured by cytometric blood cell counter (Bayer Corp.)

Page 3: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Figure 1Figure 1

Page 4: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Introduction (2)Introduction (2)

Data in the form of histogram Computer Vision, Massive data sets, …

Binning Measurement Precision

Truncation Limitation of the range of measurement, intentionally, …

EM frame work Missing data: original data points.

Page 5: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Binned and Truncated DataBinned and Truncated Data

Sample space v mutually exclusive regions Hr (r=1,…,v)

Observation Only the number of nr of the Yj that fall in Hr (r=1,…,v0) is

recorded (v0 v).

Observed data vector :

a is multinomial distribution

ov

r rT

r nnnnα11 ,),...,(

)()(

);()(

1

o

r

v

rr

Y jjr

PP

dyyfP

Page 6: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Observed log likelihood

}!/!log{

)}(/)(log{)(log

11

11

o

o

v

rr

v

rrr

nnC

CPPnL

0

1

1

)(

)(

!

!)()(

v

r

n

rv

rr

r

o P

P

n

naPL

Page 7: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Application of EM Algorithm : Application of EM Algorithm : Missing DataMissing Data Unobservable frequencies in the case of truncation.

nr unobservable individuals in the rth region Yr.

Complete Data vector

Tvv nnu

o),...,( 1

vryyY TTnr

Trr r

,...,1 ,),...,( ,1,

TTv

TTT yyuax ),...,,,( 1

),|,...,()|()()( 1TTT

vTTTT uayypaupapxp

v

r

n

ssrc

r

yfL1 1

, );(log)(log

Page 8: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

p(a;) is specified

p(u|a;) can be specified… (negative binomial ?)

p(y1+,…, yv+|u, a; ) is specified Conditioning on u and a, yj+ is composed by independent nj sampling

from the density

}!)!1/{()!1(

,)}({)}({

112

12

v

vrr

v

vrr

v

vr

nr

n

oo

o

r

nnnnC

PPC

0

1

1

)(

)(

!

!)()(

v

r

n

rv

rr

r

o P

P

n

naPL

)(/);( jj Pyf

v

r

n

srrs

r

Pyf1 1

)(/);(

Page 9: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Application of EM Algorithm : Application of EM Algorithm : Missing DataMissing Data Then, complete data log-likelihood

v

r

n

ssr

v

vr

nr

nv

r

n

ssr

v

r

n

srsr

v

rrr

vc

Cyf

CPPaupyf

PyfaupCPPn

auyypaupapL

r

r

r

r

1 1,

111 1

,

1 1,1

1

,...,1

);(log

})()(log{)|(log);(log

)}(/);(log{)|(log)}(/)(log{

),|(log)|(log)(log)(log

0

0

Page 10: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Application of EM Algorithm : Application of EM Algorithm : Mixture ModelMixture Model Extension to mixture model (g components)

Conditional probability that Yrs belongs to i-th component given yrs.

Final complete data log-likelihood

),...,1;,...,1( ),...,( 1 rT

grsrsrs nsvrzzz Zero-one indicator variable

);(

);(/);(}|1{

rsi

rsirsiirsirs

y

yfyfyZpr

g

i

v

r

n

sisriiirsc

r

yfzL1 1 1

, )};(log{)(log

Page 11: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

E-StepE-Step

Calculation of Q(; (k)) expection over y1+,…,yv+ expection over u .

Expectation of u given a …

),...,1( )|(

),...,1(

]|}log);(){log;([);(

)(

)(

)(

1 1

)()()(

vvranE

vrnn

HYYfYEnQ

or

orkr

g

i

v

rrjiji

kji

kk

k

kr

),...,1( )()/()|( )()()( vvrPnPanE o

kkrrk

Page 12: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

M-StepM-Step

i(k+1) update

= (1,…, g) : other parameters are adjusted to be…

v

r

kr

v

r rjk

jikrk

in

HYYEn k

1

)(

1

)()()1(

}|);({)(

0/);( )( kQ

0]|/);(log);([1 1

)()()(

g

i

v

rrjji

kji

k HYYfYEn kr

Page 13: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

M-Step for Normal M-Step for Normal ComponentsComponents Parameter update equation

Practical implementation is more complex due to multinomial integrals.

v

r rjk

jikr

ki

ki

ki

v

r rjTk

ijkij

kji

krk

i

ki

v

r rjjk

jikr

ki

HYYEnC

C

CHYYYYEn

CHYYYEn

k

k

k

1

)()()(

)(

)(

1

)1()1()()()1(

)(

1

)()()1(

}|);({)(

)(

)(/}|))()(;({

)(/}|);({

)(

)(

)(

Page 14: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Computational and Numerical Computational and Numerical IssuesIssues Integration can’t be evaluated analytically.

m bins in univariate, O(md) in d-dimensional. O(i) evaluation in univariate integration, O(id) in d-dimensional Complex geometry. For fixed sample size, more sparser multivariate histogram

Integrating methods Numerical Monte Carlo Romberg : Idea – repeated 1-dimensional integration.

Page 15: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Handling Truncated RegionsHandling Truncated Regions

A single bin

No extra integration is needed.

0

01

, \v

rrvv HHH

0

,0

0

,0

1

1

),(),(

)(1),(

v

rH jjjiH jjj

v

rrH jj

rvv

vv

dyyfydyyfy

Pdyyf

Page 16: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

3.3 The Complete EM 3.3 The Complete EM AlgorithmAlgorithm Treat the histogram as a PDF and draw a small number

of data points from it Fit the mixture model using the standard EM algorithm

(nonbinned , nontruncated) Using the parameter estimates from above, refine the

estimate with the full EM algorithm applied to the binned and truncated data

Page 17: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

4. 4. Experimental Results with Experimental Results with Simulated DataSimulated Data 3 experiments

Generate data from a known PDF and then bin them (bivariate). Number of bin per dimension: 5 ~ 100 (step 5) 10 different samples for smoothing results. Standard EM on unbinned samples v.s. full EM on binned

samples Estimation method: KL distance between true density v.s. 2

EMs

Page 18: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Experiment SetupExperiment Setup

To test the quality of the solution for different numbers of data points from Figure 4. Data points N : 100 ~ 1000 (step 10) (20 bin, 100 data, 10 samples)

To test performance of the algorithm when the component densities are not so well separated. 3 apart components (20 bin, 20 separation, 10 samples)

To test the performance of the algorithm when significant truncation occurs (20 bin, 100 positions, 10 samples)

Page 19: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data
Page 20: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data
Page 21: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

4.2 4.2 Estimation from Random Estimation from Random Samples Generated from the Samples Generated from the Binned Data Binned Data Baseline approach

Estimate PDF from a random sample from the binned data Uniform sampling estimation method

Figure 6 : comparison Overestimates the variance Variance inflation

Page 22: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Figure 6 : Estimated PDFs obtained from original data and PDFs fitted by binned and the uniform random-sample algorithm for (a) 5 bins per dimension and (b) 10 per dimension. 3-covariance ellipse

Page 23: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

4.3 4.3 Experiments with Different Experiments with Different Sample SizeSample Size Figure 7

As a function of number of bins and number of data points Bin > 20, data > 500 : small KL distance

Figure 8 As a function of number of bins Bin (5 ~ 20): rapid decay, Bin > 20 : flat

Figure 9 As a function of number of data Exponential decay

Page 24: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Figure 7 : (a) average KL distance between the estimated density and the true density, (b) standard deviation of the KL distance from10 repeated samples.

Page 25: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data
Page 26: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data
Page 27: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

4.4 4.4 Experiments with Different Experiments with Different Separations of Mixture Separations of Mixture ComponentsComponents Figure 10

As a function of number of bins and separation of mean Insensitive to separation of components

Figure 11 As a function of separation of mean Ratio of KL distance of the standard and binned algorithm Small number of bin : standard EM is better. Small separation : binned EM is better

Figure 12

Page 28: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data
Page 29: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data
Page 30: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data
Page 31: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

4.5 4.5 Experiments with Experiments with TruncationTruncation Figure 13

Function of ratio of truncated points Standard EM ignores the information of truncation Relatively insensitive to truncation, in binned EM

Figure 14

Page 32: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data
Page 33: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data
Page 34: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Real Example : Red Blood Cell Real Example : Red Blood Cell DataData Medical diagnosis

based on two-dimensional histograms characterizing RBC and hemoglobin measurements

Mixture densities were fitted to histograms from 90 control subject and 82 subjects with iron deficient anemia

B=1002, N=40,000 Using for discriminant rule

Baseline features: 4-dim feature vector (mean, variance along RBC and hemoglobin)

11-dim features: two-component lognormal mixture models (mean, cov, mixing weight)

9-dim features: (mean, log-odds of eigenvalues of cov, mixing weight)

Page 35: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

Figure 15. Contour plots from estimated density estimates for three control patients and three iron deficient anemia patients.

Page 36: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data
Page 37: Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data

ConclusionConclusion

Fitting mixture densities to multivariate binned and truncated data

Computational and numerical implementation issues In 2-dim simulation, If number of bins exceeds 10 the

loss of information from quantization is minimal.