Robust Bayesian analysis with partially exchangeable priors

Journal of Statistical Planning andInference 102 (2002) 99–107

www.elsevier.com/locate/jspi

Robust Bayesian analysis with partiallyexchangeable priors�

Malay Ghosha ; ∗, Dal Ho Kimb

aDepartment of Statistics, University of Florida, 103 Gri�n-Floyd Hall, P.O. Box 118545, Gainesville,FL 32611-8545, USA

bDepartment of Statistics, Kyungpook National University, Taegu, 702-701 South Korea

Received 1 February 1998; received in revised form 1 September 1998

Abstract

In this paper, we consider a general Bayesian model which allows multiple grouping of pa-rameters, where the components within a subgroup are exchangeable. The general idea is thenillustrated for the normal means estimation problem under priors which are scale mixture ofnormals. We discuss also implementation of the Bayes procedure via Markov chain Monte Carlointegration techniques. We illustrate the proposed methods with a numerical example. c© 2002Elsevier Science B.V. All rights reserved.

Keywords: Partial exchangeability; Multiple shrinkage; Robust Bayesian; Scale mixtures ofnormals; Model averaging

1. Introduction

Often, in statistical practice, it is useful to combine results from several experimentsor observational studies. One objective for this is the ability to make reliable inferencefor each component experiment (or study) by “borrowing strength” from other relatedexperiments (or studies). This is a standard practice for small area estimation (cf. Ghoshand Rao, 1994). A second objective is to combine inferential summaries from severalstudies into a single analysis. The latter, popularly known as “metaanalysis” has beendiscussed in Hedges and Olkin (1985) and DuMouchel (1990) from the frequentistand Bayesian perspectives, respectively.

One simple Bayesian approach for the general problem, in the absence of covariates, isto use a prior that builds exchangeability among the component problems, thereby allowing

� Research partially supported by NSF Grant Numbers SBR-9423996 and SBR-9810968.∗ Corresponding author.E-mail address: [email protected] (M. Ghosh).

0378-3758/02/$ - see front matter c© 2002 Elsevier Science B.V. All rights reserved.PII: S0378 -3758(01)00174 -4

100 M. Ghosh, D.H. Kim / Journal of Statistical Planning and Inference 102 (2002) 99–107

the experimenter to “borrow strength” in a sensible way. An interesting discussion withexamples appears in DuMouchel (1990) (see also Morris and Normand, 1992).

Malec and Sedransk (1992) have pointed out the weakness of Bayesian modelsbased solely on the exchangeability assumption. As an example, often the populationmeans are clustered into two or more subgroups, as opposed to being clustered in asingle group. Clearly, shrinking all the means towards a common weighted average isinappropriate in such cases, and a modiFed analysis which allows diGerent shrinkagepoints is called for.

A useful substitute for exchangeability in the above situation is partial exchange-ability, where the components within a subgroup are exchangeable, but the diGerentsubgroups are not. Partial exchangeability is often dictated by the problem at hand. Tocite an example (cf. Efron and Morris, 1973), for estimating the batting averages ofbaseball players, it is natural to cluster the players into two groups: (i) right-handed,and (ii) left-handed.

However, in reality, a natural clustering of parameters from prior considerations isnot immediate. Instead, an adaptive clustering dictated by the data seems more appro-priate. Malec and Sedransk (1992) initiated such a study for the normal means estima-tion problem. Their method allowed partition of the parameters into diGerent plausibleclusters, and prior probabilities were assigned to the diGerent partitions. For a givenpartition, within each cluster, a normal–normal hierarchical model built exchangeabilityamong the component parameters, while parameters in diGerent clusters were assignedindependent priors. This idea was pursued further by Consonni and Veronese (1995)who considered combining results from several binomial experiments.

The objective of this paper is to outline in Section 2, a general Bayesian modelwhich allows multiple grouping of parameters. The basic idea is then illustrated withthe aid of scale mixture of normal priors for the normal means estimation problem.Priors which are scale-mixtures of normals have, by construction, tails that are Datterthan those of the normal, and are very suitable for robust Bayesian analysis. Also, thisclass of priors is suHciently rich, since it includes Student t family, double exponential,logistic and the exponential power family of Box and Tiao (1973) among others. Thesepriors are also very useful for detection of outliers as pointed out in Dawid (1973),and O’Hagan (1979,1988).

Section 2 also discusses implementation of the Bayes procedure via Markov chainMonte Carlo (MC2) integration techniques (cf. Geman and Geman, 1984; Gelfand andSmith, 1990). This is in contrast to the approximation of posterior pdf’s as done, forexample, in Consonni and Veronese (1995). Also, unlike Malec and Sedransk (1992),we do not have closed form formulas for the posterior moments, so that numericalintegration becomes a necessity. Finally, in this section, we obtain posterior means andvariances by “model averaging” as in Malec and Sedransk (1992), Draper (1995) orConsonni and Veronese (1995). Section 3 contains a numerical example with multi-variate t priors to illustrate the methods of Section 2.

As a consequence of multiple grouping, estimates of parameters within diGerentgroups are shrunk towards diGerent points. In this way, the present procedure can be

M. Ghosh, D.H. Kim / Journal of Statistical Planning and Inference 102 (2002) 99–107 101

aptly described as a “multiple shrinkage” procedure. That is, however, distinct from themultiple shrinkage of George (1986) who considered as the prior a weighted averageof several normal distributions with diGerent means. This results in shrinking eachparameter towards a common weighted average of prior means, where the weights aregoverned by the prior weights as well as data. The point of distinction is that unlikeours, George (1986) is not shrinking his estimates towards diGerent group means, butinstead towards a common weighted average of several means.

2. The Bayesian analysis

Let Yi given �i and � be independently distributed with pdf’s f(yi | �i; �) (i= 1; : : : ; L).We denote by G the number of partitions of {1; : : : ; L} that we want to consider ina given context. Clearly 16G6 2L − 1. A typical partition is denoted by g com-prising d(g) subsets Sk(g) of sizes pk(g) (k = 1; : : : ; d(g)). For example if L= 4, andone considers only the partitions g1 = {{1; 2}; {3; 4}} and g2 = {{1; 2}; {3}; {4}}, thenG= 2; d(g1) = 2 with S1(g1) = {1; 2}, S2(g1) = {3; 4}, p1(g1) = 2 =p2(g1). Similarly,d(g2) = 3 with S1(g2) = {1; 2}, S2(g2) = {3}, S3(g2) = {4} and p1(g2) = 2, p2(g2) = 1,p3(g2) = 1.

Write �= (�1; : : : ; �L). We consider the following partially exchangeable prior for(�; �):

For a given partition g, we denote by �(�(Sk (g))) the joint prior of the �i (i∈ Sk(g)).In the above, �(Sk (g)) denotes the vector of the �i each belonging to Sk(g) with suHxesarranged in an ascending order. It is assumed that the �i belonging to diGerent groupsunder a given partition are independently distributed. Thus, corresponding to the parti-tion g, � has the joint prior �g(�) =

∏d(g)k=1 �(�(Sk (g))). Also, using the prior �(�) for �

independent of �, one arrives at the joint prior �g(�)�(�) for (�; �) under the partitiong. Finally, one assigns the prior p(g) to the partition g when

∑Gg=1 p(g) = 1.

Now for a given partition g, one gets the posterior

�(�; � | g; y) =

[d(g)∏k=1

∏i∈Sk (g)

{f(yi | �i; �)�(�(Sk (g)))}]�(�): (2.1)

If now p(g | y) denotes the posterior probability of g given y, the joint posterior of �and � given y is

�(�; � | y) =G∑g=1

p(g | y)�(�; � | g; y): (2.2)

The posterior moments for the �i are now obtained as

E(�i | y) =E[E{�i | g; y} |y] =G∑g=1

p(g | y)E(�i | g; y); (2.3)

V (�i | y) = E[V{�i | g; y}|y] + V [E{�i | g; y} |y]


=G∑g=1

p(g | y)V (�i | g; y) +G∑g=1

p(g | y){E(�i | g; y)}2 − (E(�i | y))2; (2.4)

Cov (�i; �j | y) =G∑g=1

p(g | y)cov (�i; �j | g; y)

+G∑g=1

p(g | y)E(�i | g; y)E(�j | g; y) − E(�i | y)E(�j | y): (2.5)

Note that if i and j belong to diGerent clusters under the partition g, thencov(�i; �j | g; y) = 0.

The above methodology is now illustrated with an example. Suppose Yi | �i;rind∼N(�i; r−1); i= 1; : : : ; L. For a given partition g, �(Sk (g)) has a mulivariate t pdf withlocation parameter �k(g), scale parameter �k(g) and degrees of freedom �k(g).

Symbolically, (see e.g. Zellner, 1971 or Press, 1972),

�(�(Sk (g))) ˙

[�k(g) +

∑i∈Sk (g)

(�i − �k(g))2=�2k(g)

]−(�k (g)+pk (g))=2

; (2.6)

where one may recall that pk(g) is the size of Sk(g) under the partition g. For rwe assign the gamma pdf �(r) ˙ exp(−(a=2)r)rb=2−1, with a¿ 0 and b¿ 0. This isreferred to as a Gamma( 1

2a;12b) prior.

Direct evaluation of the joint posterior of � given y involves high-dimensional nu-merical integration, and is analytically intractable. Instead, we adopt the MC2 numericalintegration methods.

For a given partition g, the MC2 method requires Fnding the full conditionals for �igiven �j (j �= i); r; g and y (i= 1; : : : ; L), and r given �; g and y. The calculations aregreatly facilitated by parameter augmentation as described below.

We write the pdf given in (2.6) in two stages. Conditional on some para-meter uk(g); �i (i∈ Sk(g)) are iid N(�k(g); u−1

k (g)�2k(g)); uk(g) has the marginal

Gamma( 12�k(g); 1

2�k(g)) pdf.The full conditionals are given as follows:

(i) �i (i∈ Sk(g)) |y; u1(g); : : : ; ud(g)(g); �j (j �∈ Sk(g)); r; gind∼ N ( ryi+�k (g)uk (g)=�2k (g)

r+uk (g)=�2k (g) ;

1r+uk (g)=�2

k (g) ),

(ii) r | y; �; u1(g); : : : ; ud(g); g ∼ Gamma( 12{a+

∑Li=1 (yi − �i)2}; 1

2 (L+ b)),(iii) uk(g) |y; �; r; g ∼ Gamma( 1

2{�k(g)+∑

i∈Sk (g) (�i−�k(g))2=�2k(g)}; 1

2 (pk(g)+�k(g)).

Also, the full conditional for the partition variable g is given by

p(g | y; �; r; u1(g); : : : ; ud(g)(g))

=

[p(g)

d(g)∏k=1

�k(g)−pk (g)uk(g)1=2{pk (g)+�k (g)}−1


× exp

(−uk(g)

2

{�k(g) +

∑i∈Sk (g) (�i − �k(g))2

�2k(g)

})]/

[G∑g=1

p(g)d(g)∏k=1

�k(g)−pk (g)uk(g)1=2{pk (g)+�k (g)}−1

× exp

(−uk(g)

2

{�k(g) +

∑i∈Sk (g) (�i − �k(g))2

�2k(g)

})]: (2.7)

Inferences about � will be based on the posterior moments (2.3)–(2.5).To compute the posterior pdf p(� | g; y) of � given g and y, one typically Frst

derives p(� | g; y; uk(g); r) and then p(uk(g); r | g; y). It is immediate to verify that fori∈ Sk(g),

�i | g; yi; uk(g); rind∼ N[

uk(g)uk(g) + r�2

k(g)�k(g) +

r�2k(g)

uk(g) + r�2k(g)

yi;r�2

k(g)uk(g) + r�2

k(g)

]:

(2.8)

On the other hand, the distribution of uk(g) and r given g and y is

p(uk(g); r | g; y)˙ uk(g)(�k (g)=2)−1r(b=2)−1

{ ∏i∈Sk (g)

(r−1 + u−1k (g)�2

k(g)−1=2)

}

× exp

[−1

2

{ar + �k(g)uk(g)+

∑i∈Sk (g)(yi − �k(g))2

r−1 + u−1k (g)

}]:

(2.9)

Hence, conditional on the partition g, it can be shown that for i∈ Sk(g),

E(�i | g; y) = E[B(uk(g); r) |y; g]�k(g) + (1 − E[B(uk(g); r) |y; g])yi

= yi − (yi − �k(g))E[B(uk(g); r) |y; g]; (2.10)

where B(uk(g); r) = uk(g)=(uk(g) + r�2k(g)). Also, we obtain, for i∈ Sk(g),

V (�i | g; y) = 1 − E[B(uk(g); r) |y; g] + (yi − �k(g))2V{B(uk(g); r) |y; g}: (2.11)

Finally, for i; j∈ Sk(g); i �= j,

Cov (�i; �j | g; y) = (yi − �k(g))(yi − �k(g))V{B(uk(g); r) |y; g}: (2.12)

For implementing the Gibbs sampler, Gelman and Rubin (1992) recommended torun m(¿ 2) parallel chains, each for 2d iterations, with starting points drawn from anoverdispersed distribution. But to diminish the eGects of the starting distribution, theFrst d iterations of each chain are discarded. Hence after d iterations, we retain all ofthe subsequent iterates for Fnding the posterior moments given in (2.3)–(2.5) as wellas for monitoring the convergence of the Gibbs sampler.


To estimate the posterior moments using Gibbs sampling, we use theRao–Blackwellized estimates as in Gelfand and Smith (1991). Note that using (2.10)E(�i | y) is approximated by

G∑g=1

[yi − (yi − �k(g))

m∑i=1

2d∑j=d+1

B(ukij(g); rij)=md

]

×p(g | y; �= �ij ; r= rij; u1(g) = u1ij(g); : : : ; ud(g)(g) = ud(g)ij(g)): (2.13)

Similarly, one uses (2.10) and (2.11) to approximate V (�i | y) by

G∑g=1

1 −

m∑i=1

2d∑j=d+1

B(ukij(g); rij)=md+ (yi − �k(g))2

×

m∑i=1

2d∑j=d+1

B2(ukij(g); rij)=md−(

m∑i=1

2d∑j=d+1

B(ukij(g); rij)=md

)2

×p(g | y; �= �ij ; r= rij; u1(g) = u1ij(g); : : : ; ud(g)(g) = ud(g)ij(g))

+G∑g=1

[yi − (yi − �k(g))

m∑i=1

2d∑j=d+1

B(ukij(g); rij)=md

]2

×p(g | y; �= �ij ; r= rij; u1(g) = u1ij(g); : : : ; ud(g)(g) = ud(g)ij(g)) − (E(�i | y))2:

(2.14)

Also, using (2.10) and (2.12) cov(�i; �j | g; y) is approximated by

G∑g=1

(yi − �k(g))(yj − �k(g))

×

m∑i=1

2d∑j=d+1

B2(ukij(g); rij)=md−(

m∑i=1

2d∑j=d+1

B(ukij(g); rij)=md

)2


+G∑g=1

{yi − (yi − �k(g))

m∑i=1

2d∑j=d+1

B(ukij(g); rij)=md

}

×{yj − (yj − �k(g))

m∑i=1

2d∑j=d+1

B(ukij(g); rij)=md

}


−(E(�i | y))(E(�j | y)): (2.15)


Finally, note that using Gibbs sampler, one uses (2.7) to approximate p(g | y)by

(md)−1m∑i=1

2d∑j=d+1

p(g | y; �= �ij ; r= rij; u1(g) = u1ij(g); : : : ; ud(g)(g) = ud(g)ij(g)):

(2.16)

3. A numerical example

We illustrate the methods of the previous section with a numerical example. Sup-pose the given data is y1 = 1:1; y2 = 1:2 and y3 = 10:0. Here L= 3. Suppose thefour partitions that we are interested in are g1 = {{1}; {2}; {3}}; g2 = {{1; 2}; {3}};g3 = {{1}; {2; 3}} and g4 = {1; 2; 3}. To maintain neutrality, we assign equal priorprobabilities 1

4 to these partitions. The choice of these partitions is consistent withthe recommendation of Consonni and Veronese (1995) who advocate retaining alwaysthe two extreme partitions g1 and g4, namely those corresponding to independence andexchangeability. The given data at hand suggest g2 as the best possible partition, butg3 is considered also to examine the eGect of a seemingly wrong partition. As we shallsee in this section, the posterior probability of g2 is not overwhelmingly larger than theposterior probabilities of other possible partitions so that model averaging still seemsreasonable.

In deriving the posterior moments of �i (i= 1; 2; 3) given y as well as the posteriorprobabilities p(g | y) we have used �k(g) = 0 throughout, and �2

k(g) = �2 = 0:5; 1:0; 5:0and 10.0. Also, to avoid fully exchangeable priors, we have chosen diGerent degr-ees of freedom for the t-priors for diGerent subgroups. In particular, we have taken�1(g1)=2; �2(g1)=1; �3(g1)=3; �1(g2)=2; �2(g2)=1; �1(g3)=1; �2(g3)=2; �1(g4)=3.Clearly, the choice of degrees of freedom is ad hoc, but will illustrate the main pointsthat we are going to make.

To implement and monitor the convergence of the Gibbs sampler, we follow thebasic approach of Gelman and Rubin (1992). We consider 10 independent sequenceseach with a sample of size 2000 with a burn in sample of another 2000. We have useda= b= 0:005 as the parameters of the Gamma distribution.

Table 1 illustrates that although g2 is the clear winner under all circumstances asconsistent with the data, the other partitions also have nonnegligible posterior probabil-ities. Second, we notice that larger the value of �2, the better we are able to identifythe correct partition, namely g2. This phenomenon becomes apparent from (2.8) oncewe observe that for large �2, the full conditional for the �i becomes approximatelyN(yi; 1); (i= 1; 2; 3). Now the wide discrepancy between (y1; y2) and y3 will be re-Dected in the generation of the �i (i= 1; 2; 3) from (2.8), and will then have its eGectin the calculation of the partition probabilities as given in (2.7).

Table 2 provides the posterior means and standard errors of the �i via the modelaveraging formulas as given in (2.3) and (2.4). Once again, the discrepancy between


Table 1Posterior probabilities for a selected collection of partitions g

p(g | y)Partition g �2 = 0:5 �2 = 1:0 �2 = 5:0 �2 = 10:0

{{1}; {2}; {3}} 1 0.068 0.057 0.038 0.043{{1; 2}; {3}} 2 0.347 0.409 0.582 0.596{{1}; {2; 3}} 3 0.285 0.246 0.186 0.195{1; 2; 3} 4 0.300 0.288 0.194 0.166

Table 2Data, estimates of the �i , and the standard errors (in parenthesis)

E(�i | y)i yi �2 = 0:5 �2 = 1:0 �2 = 5:0 �2 = 10:0

1 1.1 0.290 0.430 0.762 0.852(0.648) (0.763) (0.918) (0.941)

2 1.2 0.335 0.490 0.846 0.940(0.695) (0.806) (0.939) (0.956)

3 10.0 3.772 5.017 7.758 8.371(4.500) (4.579) (3.562) (3.020)

(y1; y2) and y3 is reDected in the posterior means of the �i, that is, E(�3 | y) is usuallymuch larger than E(�1 | y) and E(�2 | y) as suggested by the data. Second, larger the�2, the posterior mean E(�i | y) converges more and more to yi (i= 1; 2; 3) as one mayanticipate from (2.8).

References

Box, G.E.P., Tiao, G.C., 1973. Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading, MA.Consonni, G., Veronese, P., 1995. A Bayesian method for combining results from several binomial

experiments. J. Amer. Statist. Assoc. 90, 935–944.Dawid, A.P., 1973. Posterior expectations for large observations. Biometrika 60, 664–667.Draper, D., 1995. Assessment and propagation of model uncertainty (with discussion). J. Roy. Statist. Soc.

Ser. B 57, 45–97.DuMouchel, W., 1990. Bayesian metaanalysis. In: Berry, D.A. (Ed.), Statistical Methodology in the

Pharmaceutical Sciences. Dekker, New York, pp. 509–529.Efron, B., Morris, C., 1973. Combining possibly related estimation problems. J. Roy. Statist. Soc. Ser. B

35, 379–421.Gelfand, A.E., Smith, A.F.M., 1990. Sampling based approaches to calculating marginal densities. J. Amer.

Statist. Assoc. 85, 398–409.Gelfand, A.E., Smith, A.F.M., 1991. Gibbs sampling for marginal posterior expectations. Comm. Statist.

Theory Methods 20 (5,6), 1747–1766.Gelman, A.E., Rubin, D., 1992. Inference from iterative simulation (with discussion). Statist. Sci. 7, 457–511.Geman, S., Geman, D., 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of

images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741.George, E.I., 1986. Combining minimax shrinkage estimators. J. Amer. Statist. Assoc. 81, 437–445.Ghosh, M., Rao, J.N.K., 1994. Small area estimation: an appraisal (with discussion). Statist. Sci. 9, 55–93.


Hedges, J.V., Olkin, I., 1985. Statistical Methods for Metaanalysis. Academic Press, Orlando.Malec, D., Sedransk, J., 1992. Bayesian methodology for combining the results from diGerent experiments

when the speciFcations for pooling are uncertain. Biometrika 79, 593–601.Morris, C.N., Normand, S.L., 1992. Hierarchical methods for combining information and for meta-analysis.

In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics, Vol. 4. OxfordScience Publications, Oxford, pp. 321–335.

O’Hagan, A., 1979. On outlier rejection phenomena in Bayes inference. J. Roy. Statist. Soc. Ser. B 41,358–367.

O’Hagan, A., 1988. Modeling with heavy tails. In: Bernardo, J.M., DeGroot, M.H., Linley, D.V., Smith,A.F.M. (Eds.), Bayesian Statistics, Vol. 3. Oxford University Press, Oxford, pp. 349–359.

Press, S.J., 1972. Applied Multivariate Analysis. Holt, Rinehart and Winston, New York.Zellner, A., 1971. An Introduction to Bayesian Inference in Econometrics. Wiley, New York.

Robust Bayesian analysis with partially exchangeable priors

Documents

Transcript of Robust Bayesian analysis with partially exchangeable priors