Bayesian Meta-analysis with Hierarchical Modeling...Bayesian Meta-analysis with Hierarchical...

Bayesian Meta-analysis with Hierarchical ModelingBrian P. Hobbs1

Division of Biostatistics, School of Public Health, University of Minnesota,Mayo Mail Code 303, Minneapolis, Minnesota 55455–0392, U.S.A.

1Brian P. Hobbs is Graduate Assistant and Bradley P. Carlin is Professor of Biostatistics and Mayo Professor in Public Healthat the Division of Biostatistics, School of Public Health, 420 Delaware St. S.E., University of Minnesota, Minneapolis, MN, 55455.

1 Introduction

The Bayesian approach to inference enables relevant existing information to be formally incorporated into

a statistical analysis. This is done through the specification of prior distributions, which summarize our

preexisting understanding or beliefs regarding any unknown model parameters θ = (θ1, . . . , θK)′. Inference is

conducted on the posterior distribution of θ given the observed data y = (y1, . . . , yN )′, given by Bayes’ Rule

as

p(θ|y) =p(θ,y)p(y)

=p(y|θ)p(θ)∫p(y|θ)p(θ)

.

This simple formulation assumes the prior p(θ) is fully specified. However, when we are less certain about

p(θ), or when model variability must be allocated to multiple sources (say, centers and patients within centers),

a hierarchical model may be more appropriate. This approach places prior distributions on the unknown

parameters of previously specified priors in stages. Posterior distributions are again derived by Bayes’ theorem,

where the denominator integral is now more difficult, but remains feasible using modern Markov chain Monte

Carlo (MCMC) methods. The WinBUGS package (http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml)

and its open source cousin OpenBUGS (http://mathstat.helsinki.fi/openbugs) are able to handle a wide

variety of hierarchical models, permitting posterior inference, prediction, model choice, and model checking

all within a user-friendly MCMC framework. Hierarchical models permit “borrowing of strength” from the

prior distributions and across subgroups. When combined with any information incorporated in the priors,

this translates into a larger effective sample size, thus offering potentially important savings (both ethical and

financial) in the practice of drug and device clinical trials.

Suppose the prior distribution for θ depends on a vector of second-stage parameters γ. These parameters

are called hyperparameters, and we then write p(θ|γ). In a simple two-stage model, γ is assumed to be known,

and is often set to produce a noninformative prior (i.e., one that does not favor one value of θ over any other.

However, if γ is unknown, a third-stage prior, or hyperprior, p(γ) may be chosen. In clinical trials, p(γ) is often

determined at least in part using data from existing historical controls. This additional informative content

is part of what gives Bayesian methods their advantage over classical methods, although this advantage is

typically small if noninformative priors are used.

1

2 Application of Bayes methods to Meta-analysis

Results vary across studies due to random variation or differences in implementation. The studies may be

carried out at different times and locations or include different types of subjects. Furthermore, the applications

of eligibility criteria may vary. These differences may lead to disparate conclusions about the intervention of

interest across studies. Consider, for example two studies which test the ability of a particular cardiac device

to improve heart efficiency by increasing the amount of blood pumped out of the left ventricle relative to the

amount blood contained in the ventricle. Suppose that both studies define eligible patients as those who have

a left ventricular ejection fraction (LVEF) as low as 25%. One investigator may admit every such eligible

candidate patient. A second investigator might alter the LVEF boundary to 40% for a subset of individuals

with another condition, restricting eligibility. Consequently, the first study may tend to incorporate a frailer

population, so that the first study may suggest that the device is less effective than the second (Berry, 1997).

In such cases, a single comprehensive analysis of all relevant data from several independent studies, or a

meta-analysis, is often used to assess the clinical effectiveness of healthcare interventions. Results from meta-

analyses provide a thorough assessment of the intervention of interest. The Bayesian hierarchical approach to

meta-analysis treats “study” as one level of experimental unit, and “patient within study” as a second level

(Lindley and Smith, 1972; Berger, 1985). Inter-study differences may be accounted for by measured covariates

as in the above illustration; however, unaccounted for differences will still remain.

Since the Bayesian paradigm treats all unknowns as random, a Bayesian meta-analysis can be structured

as a random effects model. Specifically, each study in a Bayesian meta-analysis has a distribution of patient

responses specific to the particular study. Thus selecting a study corresponds to selecting one of these distri-

butions. Furthermore, one is limited to only a sample from each study’s distribution, revealing only indirect

information about the distribution of study-specific effects.

2.1 Meta-analysis for a Single Success Proportion

Berry (1997, Sec. 3.1) describes a simple yet commonly occurring setting where Bayesian meta-analysis

pays significant dividends, illustrating with the data in Table 1. These data are from nine antidepressant drug

2

Study (i) xi ni πi = xi/ni

1 20 20 1.002 4 10 0.403 11 16 0.694 10 19 0.535 5 14 0.366 36 46 0.787 9 10 0.908 7 9 0.789 4 6 0.67

Total 106 150 0.71

Table 1: Successes xi and total numbers of patients ni in 9 antidepressant drug studies.

studies (Janicak et al., 1988), where a “success” is considered a positive response to the treatment regimen.

For our purpose of illustrating Bayesian hierarchical modeling in meta-analysis, suppose a “success” concerns

effectiveness of a medical device, and that within study i the experimental units receiving the intervention are

exchangeable (all have the same probability of success πi).

Define the random variable xi to be the number of successes among the ni patients in study i, so that

xi ∼ Binomial(ni, πi) for i = 1, . . . , 9 .

The likelihood function for π = (π1, . . . , πn)′ is then

p(x|π) ∝n∏

i=1

πxii (1− πi)ni−xi (1)

A pooled analysis assumes that all 150 patients are independent and identically distributed (iid). Therefore,

all nine πi’s are equal to a common π. Given we have 106 total successes in 150 trials, the likelihood function

is then p(x|π) ∝ π106(1 − π)44, and suggests that π is very likely to be between 0.6 and 0.8. However, the

observed success proportions (Table 1) in five of the nine studies are outside this range. This is more than

what would be expected from sampling variability alone, and suggests the πi’s may be unequal.

Sadly, separate analyses of the nine studies provides even less satisfying results. The effect of an exper-

imental device is not well addressed by giving nine different likelihood functions, or by giving nine different

3

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

Figure 1: Beta(α, β) densities for α, β = 1, 2, 4, 8.

confidence intervals. Consider the probability of success if the device were used in a tenth study with another

patient population. Separate analyses provide no way to utilize the results from the nine previous studies.

A Bayesian hierarchical perspective provides a beneficial middle ground. Here we view each study’s success

probability πi as having been selected from a population. A computationally convenient assumption here is to

suppose the πi are random sample from a beta distribution, i.e., πiiid∼ Beta(α, β). Denoting the beta function as

B(α, β) = Γ(α+β)/[Γ(α)Γ(β)], for each πi, p(πi|α, β) = B(α, β)πα−1i (1−πi)β−1, a beta distribution with mean

E(πi|α, β) = αα+β and variance V ar(πi|α, β) = αβ

(α+β)2(α+β+1), where α, β > 0. Since limα+β→∞[V ar(πi)] = 0,

we can think of α + β as measuring homogeneity among studies. If α + β is large then the πi’s distribution is

highly concentrated near its mean. Smaller α and β permit more variability, hence a noticeable study effect

(unequal πi). Assuming that only two parameters index the entire distribution may be a restriction depending

on the curve of choice. In this case, Figure 1 shows the beta family to be surprisingly flexible, able to capture

various shapes (flat, bell-shaped, U-shaped, one-tailed, etc.).

Since the Beta prior is conjugate with the binomial likelihood, the posterior of πi given xi emerges in

closed form using Bayes theorem as p(πi|xi) ∝ πα−1+xii (1 − πi)β−1+ni−xi . That is, the πi are independent

4

Beta(α + xi, β + ni − xi) random variables with mean

Eπi [πi|xi] =α + xi

α + β + ni. (2)

In order to proceed further with a Bayesian hierarchical approach, the impact of the hyperprior p(α, β),

for the second stage parameters needs to be assessed. Recall that concentrating the hyperprior’s probability

on large values of α and β suggests homogeneity among the πi. while small α + β suggests heterogeneity.

If each πi was observable, then posterior distribution of (α, β) would be

p(α, β|πi) ∝9∏

i=1

{B(α, β)πα−1

i (1− πi)β−1}

p(α, β) .

In reality, πi cannot be observed directly, but indirect information about π1, ..., π9 is available through obser-

vations x = (x1, . . . , x9)′. Therefore, the posterior distribution of α and β, p(α, β|x), is proportional to

p(α, β)9∏

i=1

B(α, β)∫ 1

0πxi+α−1

i (1− πi)ni−xi+β−1dπ1 · · · dπ9 ∝9∏

i=1

{B(α, β)

B(α + xi, β + ni − xi)

}p(α, β) . (3)

Given the data in Table 1, the posterior expected mean success rate for the next patient treated in study i,

for i = 1, ..., 9 is

E(πi|x) = E(α,β) {Eπi [πi|α, β,x]} = E(α,β)

[α + xi

α + β + ni|x

], i = 1, . . . , 9. (4)

Next, predictive distributions are obtained by averaging the likelihood over the full posterior distribution. If

a new, tenth study similar to the first nine is implemented, inference for π10 requires the predictive distribution,

p(π10|x1, ..., x9) =∫ ∫

p(π10|α, β, x1, ..., x9)p(α, β|x1, ..., x9)dαdβ. It follows that the expected probability of a

successful treatment for a particular patient enrolled in the new study is the expected posterior mean of the

predictive distribution of π10,

E(π10|x) = E(α,β) {E[π10|α, β,x]} = E(α,β)

[α

α + β|x

]. (5)

5

Hyperprior probability distributions are often hard to conceptualize. Commonly reference priors are used

when assign distributions beyond the second stage. In the current model, the shape of the first stage prior

distribution varies considerably for relatively small changes in α, β as seen in Figure 1. Therefore, a prior that

associates some probability with α+β large and α+β small, while assigning a moderate amount of probability

to roughly equivalent α and β will be quite effective in covering a wide range of shapes for p(πi|α, β).

Bowing to computational limitations of the time, Berry (1997) adopted independent discrete uniform priors

on {1, 2, . . . , 10} for α and β, essentially discretizing the (α, β) space onto a square 10×10 grid. Here we switch

to independent continuous U(0, 20) priors, a true “joint flat prior” over a broad range of sensible values. Thus

the posterior probability density function p(α, β|x) is proportional to the likelihood restricted to [0, 20]×[0, 20].

Note also that values larger than 20 have some likelihood, therefore the truncation of α and β at 20 is a slight

approximation made by this model.

2.2 Sampling Based Inference using MCMC

In order to analyze the data in Tables 1 using our three-stage model, we use Markov chain Monte Carlo

(MCMC) computational methods implemented in WinBUGS. These methods operate by sampling from a Markov

chain whose stationary distribution is the joint posterior distribution. This permits easy evaluation of posterior

distributions for the πi, which lack closed forms due to the nonconjugate hyperprior for (α, β). Specifically,

we may estimate E(πx|x) in (4) by MCMC sampling {(α(g), β(g)), g = 1, . . . , G} from their joint posterior, and

then using the Monte Carlo approximation

E(πi|x) =1G

G∑

g=1

α(g) + xi

α(g) + β(g) + ni.

The Gibbs sampler begins the Markov chain with initial values (“inits”) (π(0), α(0), β(0)), and then succes-

sively samples from the conditional distributions for α, β, and the πi. We usually discard draws from the first

K iterations, the initial transient or “burn-in” period, though choosing reasonable initial values can reduce

the need for this. Typically, multiple Markov chains are started from disparate initial values and checked

to see if they all appear to have the same equilibrium distribution. Modern software packages make MCMC

6

��

� ��

� ��

��

��

��

��

��

��

��

��

��

��

� ��

Figure 2: Posterior summary statistics generated in WinBUGS given the data in Table 1.

� � ��

��

��

��

α

β

Figure 3: Bivariate posterior scatterplot of (α, β) given the data in Table 1. Correlation(α, β) = 0.865.

sampling quick and relatively easy; the popular WinBUGS package also offers several convergence checks and

output summary tools.

Using the WinBUGS code, data, and inits shown in Appendix A, we ran the Gibbs sampler for 30,000

iterations, discarding the first 10,000 as burn-in. Notice that we added a tenth study with no observations

for n10 = 10 patients. Summary statistics for the posterior distributions of model parameters are shown in

Figure 2. The posterior mean of p(α, β|x) is (9.50, 4.30). Figure 3 shows that (α, β) given the data in Table 1

are highly correlated. The expected posterior predictive mean success rate for a patient in a new study, i = 10,

is the posterior mean of τ = α/α + β, or posterior mean of π10, which from Figure 2 is approximately 0.69.

The posterior mean success rates (4) for each of the nine studies are also given in Figure 2. These posterior

7

means represent the probability that the intervention is successful for the next patient in the respective study.

The 0.025 and 0.975 posterior quantiles are also given in Figure 2, permitting ready evaluation of equal-tail

95% Bayesian credible intervals for the πi. Regression (or shrinkage) to the overall mean is observed in the

predictive probabilities for each of the nine studies. More shrinkage occurs for smaller studies, since their

likelihoods are less informative; see e.g. the high shrinkage and relatively wide credible interval for study 2.

��

��

��

��

��

��

��

��

��

��

��

�� !�"��# �� $��

π

π10

Figure 4: Posterior density of π10 (the next trial) given the data in Table 1.

Finally, Figure 4 plots the posterior distribution of expected success proportion in the next study, i = 10,

given the results of the previous nine studies, (5), as well as the likelihood function that assumes all nine

studies have a common success probability π. The posterior distribution of π10 clearly has more variability,

and appears to be more consistent with the observed success proportions observed in Table 1.

8

References

Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. New York: Springer-Verlag.

Berry, D.A. (1997). Using a Bayesian approach in medical device development. White paper, Center for

Devices and Radiological Health, U.S. Food and Drug Administration, Rockville, MD.

Berry, D.A. (2006). Bayesian clinical trials. Nature Reviews Drug Discovery, 5, 27–36.

Janicak, P.G., Lipinski, J., Davis, J.M., Comaty, J.E., Waternaux, C., Cohen, B., Altman, E., and Sharma,

R.P. (1988). S-adenosyl-methionine (SAMe) in depression: a literature review and preliminary data

report. Alabama Journal of Medical Sciences, 25, 306–312.

Lindley, D.V., and Smith, A.F.M. (1972). Bayes estimates for the linear model (with discussion). J. Roy.

Statist. Soc., Ser. B, 34, 1–41.

Spiegelhalter, D.J., Abrams, K.R., and Myles, J.P. (2004). Bayesian Approaches to Clinical Trials and

Health-Care Evaluation. Chichester: Wiley.

Spiegelhalter, D.J., Best, N., Carlin, B.P., and van der Linde, A. (2002). Bayesian measures of model

complexity and fit (with discussion). J. Roy. Statist. Soc., Ser. B, 64, 583–639.

Spiegelhalter, D.J., Freedman, L.S., and Parmar, M.K.B. (1994). Bayesian approaches to randomized trials

(with discussion). J. Roy. Statist. Soc., Ser. A, 157, 357–416.

9

Appendix A: WinBUGS code for the Meta-analysis example

��

��

��

��

�� !�� "� �� "��

#

�� $ %&'��'�� (��

� � �)��

� � �)��

� *+ �,��-��

#

�� .��"/ 0�"� ��

��/"� �12��34��34��34��34��34��34��34��34��34��34�� 14� �14 �

�� 0�"� ��

��/"��12� �� 5� � �� 4� $6� 7� 8� 5� 9��

�12� �� 6� 7� 5� 56� �� 7� 6� ��

10

Bayesian Meta-analysis with Hierarchical Modeling...Bayesian Meta-analysis with Hierarchical...

Documents

Transcript of Bayesian Meta-analysis with Hierarchical Modeling...Bayesian Meta-analysis with Hierarchical...