Connections between MCMC and Likelihood Methods

1

Connections between MCMC and Likelihood

MethodsDonald A. Pierce

with Ruggero BellioWinter 2010 OSU

Slides are at www.science.oregonstate.edu/~piercedo/osu-mcmc-mpl.ppt

http://www.science.oregonstate.edu/~piercedo/osu-mcmc_mpl.ppt

http://www.science.oregonstate.edu/~piercedo/osu-mcmc_mpl.ppt

2

It is popular these days to be “Bayesian”, in large part due to the utility of MCMC and in particular (Win)BUGSHowever, substantive prior information is seldom used, aiming for “objective Bayes”, and connections to likelihood inference are interestingLargely, the gain in MCMC is in utilizing rather intractable likelihood functions: integrating over latent variates, e.g. latent cluster effects or covariates observed with errorHowever, if everything except observed data is a random variable, issues of inference become highly (too?) automatic

3

A key issue in this is the contrast of profile and integrated likelihoods, namely

Modern higher-order likelihood theory suggests, surprisingly, that integrated likelihoods can overcome shortcomings of profile likelihoodA posterior for is an instance of integrated likelihoodThat is, so

( ; ) max ( , ; )( ; ) ( , ; ) ( )

P

wI

L y L yL y L y w d

( , | ) ( , ; ) ( , )y L y

( | ) ( ) ( , ; ) ( | )y L y d

4

An integrated likelihood is approximated very well by a Laplace approximation

Hence, the MCMC posterior for “flat” priors is essentially

We will see that this depends substantially on the representation of the nuisance parameter --- to be avoided in frequentist or likelihood inferenceThe approximation above is, within reason, valid for any such representation (not that this is so comforting)

1/2ˆ ˆ( ; ) ( ; ) | ( ) | ( )wI PL y L y j w

1/2ˆ( | ) ( ; ) | ( ) |Py L y j

5

Regarding “flat” priors: in practice those used in WinBUGS manual examples seem advisable, i.e. proper but very diffuse for parameters on , e.g. dnorm(0,1E-6), and implicitly for the logs of inherently positive parameters, e.g. dgamma(1E-6,1E-6)The latter is to obtain approximate invariance to scale for scale parameters, a natural requirementIf to facilitate convergence is chosen otherwise, then for likelihood analysis one should divide the posterior of by the priorGeyer & Thompson (1992 JRSS-B) gave a method for computing the likelihood using MCMC, but the proposal here is far simpler

( , )

( )

6

An attempt to generally improve on profile likelihood was the Cox-Reid approximate conditional likelihood

requiring that the nuisance parameter be represented as ‘orthogonal’ to , i.e. that varies slowly with However, orthogonal parameters are not at all uniquely defined, resulting in arbitrariness of the ACL that must be resolvedA partial indication of our interests is that the ACL is formally the same as the above approximation to the posterior for using flat priors

1/2ˆ( ; ) ( ; ) | ( ) |AC PL y L y j

7

Barndorff-Nielsen developed the modified profile likelihood

that is invariant to representation of the nuisance parameter --- a really key issue Remarkable stroke of intuition, and B-N only showed that the MPL approximates what is desired for the primary special settings: exponential families, regression-scale models, etcWe have been developing the idea that what the MPL in general approximates is a suitable integrated likelihood, hence with close connections to MCMC

1/2 1ˆ ˆ ˆ( ; ) ( ; ) | ( ) | | / |MP PL y L y j

8

Example (Pierce & Peters 1992): CC study, 40 sets with 2:1 matching, 30/80 of controls “exposed”Solid line PL, dashed lines conditional likelihood and MPL

9

Concept of ‘orthogonal’ parameter, for ACL and for MCMC, needs clarificationIn principle there is an ‘ideal’ choice of orthogonal parameter such that the integrated likelihood, i.e. the Bayes posterior (with uniform priors), approximates the MPLSome goals are: (a) to actually compute this, either from the likelihood or the posterior samples, (b) to recover the PL from the posterior distribution, and (c) to approximate the MPL in this way, even if not as in (a)These are not completed, but some progress has been made

10

Example: Binary data on 50 subjects, repeated observations at up to five times, total of 220 observationsSuitable for logistic mixed model with latent random intercepts for subjectsInterest parameter the standard deviation of the random intercepts. Seven nuisance parameters: constant term, 2 treatment parameters, 4 for time effectsUsual parametrization is not orthogonal: vector of canonical regression parameters are ‘attenuated’ as

suggesting an approximately orthogonal parameter

20

ˆ ˆ / 1 0.304

2/ 1 0.304

11

WinBUGS posterior densities of using flat priors: heavy line original parametrization, light line using the approximately orthogonal nuisance parameters

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Sigma

Den

sity

12

Posterior samples: Sigma vs constant term, for original and orthogonal parametrizationsThis provides a clue that we can use posterior samples to assess and correct for lack of orthogonality

1 2 3 4 5

12

34

constant-orig

Sig

ma

1.0 1.5 2.0 2.5

0.5

1.0

1.5

2.0

2.5

3.0constant-orthog

Sig

ma

13

Important but confusing issue –- clearly, if we transform the posterior samples asthe marginal distribution of is unchangedPart of reason reparametrization of matters is that this is done in the model specification, where in contrast to the above there is no (implicit) Jacobian involved in the densityHaving samples from the joint distribution of , it would be possible but impractical to divide the density by the Jacobian, to avoid re-doing MCMCWe can achieve this aim otherwise by resampling from the posterior samples with weights inversely proportional to the reciprocal Jacobian

{ , } { , ( , )}

{ , }

14

1/2ˆ( | ) ( ; ) | ( ) |Py L y j

Recall that to very good approximation the MCMC posterior, for flat priors, is essentially

which can be expressed approximately as 1/2( | ) ( ; ) | asyvar( | ) |Py L y

We can approximate the final factor from the MCMC samples at hand, and thus approximate the PL by dividing the posterior density of by our estimate ofThere are, however, issues involving the distinction between posterior andsampling theory ˆvar( ; )

1/2| asyvar( | ) |

var( | )

15

A transparent way to do this, although there may be more accurate waysChoose bins for (e.g. 20 using quantiles), for each of these compute , and then smooth (the logs of) these by quadratic regression on the bin classmarks

| var( | ) |

1.0 1.5 2.0 2.5 3.0 3.5

-10

-9

-8

-7

-6

-5

-4

log

var

(lam

|psi

)

16

Red right: MCMC posterior original parametrizationRed left (dashed): after above adjustmentBlack: PL computed by quadrature

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

17

What should be the meaning of ‘orthogonal’ parameter for use in the APL?Said earlier that should vary slowly with which is related to the more usual definition that the (expected) cross-information terms are zeroBut if satisfies this definition then so does any 1-1 transformation of it --- very unsatisfactoryFurther, this could not be a requirement for validity of APL, since linear transformations leave the APL unchanged even though not conforming at all to such requirements This suggests more difficulties than first thought in utilizing plots such as on slide 12 for such purposes

c

18

There is in principle a reparametrization such that MPL and IL agree (related to Severini, 2007 Bmtrka)The constrained MLE can be thought of as a function of if sufficient, otherwise If there is taken as a variable, this defines a nuisance parameter representation This representation of the NP depends on or on --- no real problem for Bayesian methodsDefine as the inverse function solving the equationThen the MPL is the Laplace approximation to the integrated likelihood based on representation of the nuisance parameter

ˆ( ˆ , , )a

( , )

*( , ) *( , )

*( , )

ˆ( ˆ , )

( ˆ , )a

19

Theory for this: Laplace approximation in parametrizations and differ only by Jacobian factor

and we are matching that Jacobian with final factor of

Actually need only derivativesDifficulty in all this is in utilizing, for likelihood, variations in while holding fixed a suitable ancillary “a”Roughly speaking, a suitable ancillary is the ratio of observed to expected information for

* 1/ { / }

1/2 1/2ˆ( ) | ( ) | ( ) | ( ˆ ) | | / |P P cnstr MLEL j L j

( , )

1/2 1ˆ ˆ ˆ( ; ) ( ; ) | ( ) | | / |MP PL y L y j

( , )

20

Ex: Two exponential samples with means and Reparametrize orthogonally with means Then provides the corresponding parametric functionSet this equal to and solve for the inverse

Then to up to Laplace approximation the MPL is the IL for nuisance parameter representation

log PLlog ACL and MCMC posterior with “obvious” orthog

but for this example MPL=PL

ˆ ˆ(1 ˆ / ) / 2

( , ) (1 ˆ / ) / 2

* (1 ˆ / ) / 2

*( , ) 2 / 1 ˆ /

*

/ ,

2 log(1 ˆ / ) log( )n n

2( 1) log(1 ˆ / ) ( 1/ 2) log( )n n

21

Our MCMC example is not very suitable for investigating all this --- MPL is (again) very near the PLWhen likelihood is intractable, or when the MLE is not sufficient, can we use the MCMC to approximate the MPL?Is it better to approximate the reparametrization for which IL = MPL, or better to compute the required Jacobian more directly?An issue is whether there can, in principle, be enough information in the likelihood, or posterior samples, to approximate the MPLCan we tell from the posterior samples how the joint distribution would change for slightly different data?

22

There is yet another parametrization such that locally the nuisance parameter becomes a translation parameterIn this parametrization the answer to that question is “yes”An aim is to capitalize on this without solving for that new parametrization, perhaps taking advantage of the fact that the product of the final two terms in the MPL is invariant to reparametrization

Have had some success for a single nuisance parameter, but there remains much to do

1/2 1ˆ ˆ ˆ( ; ) ( ; ) | ( ) | | / |MP PL y L y j

Connections between MCMC and Likelihood Methods

Documents

Transcript of Connections between MCMC and Likelihood Methods