Introduction to Bayesian Methodsbmallick/Bayes1.pdf · Introduction We develop the Bayesian...

Introduction to Bayesian Methods

Introduction to Bayesian Methods – p.1/??

IntroductionWe develop the Bayesian paradigm for parametric inference.To

this end, suppose we conduct (or wish to design) a study, in

which the parameterθ is of inferential interest. Hereθ may be

vector valued. For example,

1. θ = difference in treatment means

2. θ = hazard ratio

3. θ = vector of regression coefficients

4. θ = probability a treatment is effective


IntroductionIn parametric inference, we specify a parametric model for the

data, indexed by the parameterθ. Letting x denote the data, we

denote this model (density) byp(x|θ). The likelihood function of

θ is any function proportional top(x|θ), i.e.,

L(θ) ∝ p(x|θ).

Example

Supposex|θ Binomial(N, θ). Then

p(x|θ) =

(

N

x

)

θx(1 − θ)N−x,

x = 0, 1, ..., N.


IntroductionWe can take

L(θ) = θx(1 − θ)N−x.

The parameterθ is unknown. In the Bayesian mind-set, we

express our uncertainty about quantities by specifying

distributions for them. Thus, we express our uncertainty aboutθ

by specifying aprior distribution for it. We denote the prior

density ofθ by π(θ). The word "prior" is used to denote that it is

the density ofθ before the data x is observed. By Bayes theorem,

we can construct the distribution ofθ|x, which is called the

posterior distribution of θ . We denote the posterior distribution

of θ by p(θ|x).


IntroductionBy Bayes theorem,

p(θ|x) =p(x|θ)π(θ)

∫

Θ

p(x|θ)π(θ)dθ

whereΘ denotes the parameter space ofθ. The quantity

p(x) =

∫

Θ

p(x|θ)π(θ)dθ

is the normalizing constant of the posterior distribution.For most

inference problems, p(x) does not have a closed form. Bayesian

inference aboutθ is primarily based on the posterior distribution

of θ , p(θ|x).


IntroductionFor example, one can compute various posterior summaries,

such as the mean, median, mode, variance, and quantiles. For

example, the posterior mean ofθ is given by

E(θ|x) =

∫

Θ

θp(θ|x)dθ

Example 1 Given θ, suppose x1, x2, ..., xn are i.i.d.

Binomial(1,θ), and θ ∼ Beta(α, λ). The parameters of the

prior distribution are often called thehyperparameters.

Let us derive the posterior distribution ofθ. Let

x = (x1, x2, ..., xn), and thus,


Introduction

p(θ|x) ∝ θP

xiθα−1(1 − θ)n−P

xi(1 − θ)λ−1

= θP

xi+α−1(1 − θ)n−P

xi+λ−1

Thusp(θ|x) ∝ θP

xi+α−1(1 − θ)n−P

xi+λ−1. We can recognize

this kernel as a beta kernel with paramters

(∑

xi + α, n −∑

xi + λ) . Thus,

θ|x ∼ Beta(

∑

xi + α, n −∑

xi + λ)

and therefore

p(θ|x) =Γ(α + n + λ)

Γ(P

xi + α)Γ(n −P

xi + λ)× θ

P

xi+α−1(1 − θ)n−

P

xi+λ−1.


IntroductionRemark In deriving posterior densities, an often used technique

is to try and recognize the kernel of the posterior density ofθ.

This avoids direct computation ofp(x). This technique saves lots

of time in derivation. If the kernel cannot be recognized, then

p(x) must be computed directly.

In this example we have

p(x) = p(x1, ..., xn)

∝∫ 1

0

θP

xi+α−1(1 − θ)n−P

xi+λ−1.

=Γ(∑

xi + α)Γ(n −∑

xi + λ)

Γ(α + n + λ)


IntroductionThus

p(x1, ..., xn) =Γ(α + λ)

Γ(α)Γ(λ)

Γ(∑

xi + α)Γ(n −∑xi + λ)

Γ(α + n + λ)

for xi = 0, 1, andi = 1, ..., n.

SupposeA1, A2, ... are events such thatAi

⋂

Aj = φ and⋃∞

i=1 = Ω, whereΩ denotes thesample space. Let B denote an

event inΩ. Then Bayes theorem for events can be written as

p(Ai|B) =P (B|Ai)P (Ai)

∑∞i=1 P (B|Ai)P (Ai)


IntroductionP (Ai) is the prior probability ofAi andp(Ai|B) is the posterior

probability ofAi givenB has ocurred.

Example 2 Bayes theorem is often used in diagnostic tests for

cancer. A young person was diagnosed as having a type of cancer

that occurs extremely rarely in young people. Naturally, has was

very upset. A friend told him that it was probably a mistake. His

friend reasons as follows. No medical test is perfect: Thereare

always incidences of false positives and false negatives.


IntroductionLet C stand for the event that he has cancer and let+ stand for

the event that an individual responds positively to the test.

AssumeP (C) = 1/1, 000, 000 = 10−6 andP (+|Cc) = .01. (So

only one per million people his age have the disease and the test

is extremely god relative to most medical tests, giving only1%

false positives and 1% false negatives). Find the probability that

he has cancer given that he has a positive response. (After you

make this calculation you will not be surprised to learn thathe

did not have cancer.)

P (C|+) =P (+|C)P (C)

P (+|C)P (C) + P (+|Cc)P (Cc)

P (C|+) =(.99)(10−6)

(.99)(10−6) + (.01)(.999999)Introduction to Bayesian Methods – p.12/??

Introduction

P (C|+) =00000099

.01000098= .00009899

Example 3Supposex1, ..., xn is a random sample fromN(µ, σ2).i) Supposeσ2 is known andµ ∼ N(µo, σ

2o). The posterior

density ofµ is given by:

P (µ|x) ∝

nY

i=1

p(xi|µ, σ2)

!

π(µ)

∝

„

exp

−1

2σ2

X

(xi − µ)2ff«

×

„

exp

−1

2σ2o

(µ − µo)2ff«


Introduction

∝ exp

−1

2

(

nσ2o + σ2

σ2oσ

2

)

µ2 + 2µ

(

σ2o

∑

xi + µoσ2

2σ2oσ

2

)

= exp

−1

2

(

nσ2o + σ2

σ2oσ

2

)[

µ2 − 2µ

(

σ2o

∑

xi + µoσ2

nσ2o + σ2

)]

∝ exp

−1

2

(

nσ2o + σ2

σ2oσ

2

)[

µ −(

σ2o

∑

xi + µoσ2

nσ2o + σ2

)]2

We can recognize this as a normal kernel with mean

µpost = σ2o

P

xi+µoσ2

nσ2o+σ2 and varianceσ2

post = (nσ2o+σ2

σ2oσ2 )−1 = σ2

oσ2

nσ2o+σ2

Thus

µ|x ∼ N

(

σ2o

∑

xi + µoσ2

nσ2o + σ2

,σ2

oσ2

nσ2o + σ2

)

.


Introductionii) Supposeµ is known andσ2 is unknown. Letτ = 1/σ2. τ is

often called theprecisionparameter. Suppose

τ ∼ gamma( δo

2, γo

2). Thus

π(τ) ∝ τδo2−1exp

(

−τγo

2

)

Let us derive the posterior distribution ofτ .

p(τ |x) ∝ τn/2exp

−τ

2

∑

(xi − µ)2

τδo2−1exp

−τγo

2

p(τ |x) ∝ τn+δo

2−1exp

−τ

2(γo +

∑

(xi − µ)2)


IntroductionThus

τ |x ∼ gamma

(

n + δo

2,γo +

∑

(xi − µ)2

2

)

iii) Now supposeµ andσ2 are both unknown. Suppose we

specify the joint prior

π(µ, τ) = π(µ|τ)π(τ)

where

µ|τ ∼ N(µo, τ−1σ2

o)

τ ∼ gamma

(

δo

2,γo

2

)


IntroductionThe joint posterior density of (µ, τ ) is given by

∝(

τn/2exp

−τ

2

∑

(xi − µ)2)

×(

τ1

2 exp

− τ

2σ2o

(µ − µo)2

)

×(

τ δo/2−1exp

−τγo

2

)

= τn+δo+1

2−1exp

−τ

2

(

γo +(µ − µo)

2

σ2o

+∑

(xi − µ)2

)

The joint posterior does not have a clear recognizable form.Thus,

we need to computep(x) by brute force.


Introduction

p(x) ∝

Z

∞

0

Z

∞

−∞

τn+δo+1

2−1exp

−τ

2

„

γo +(µ − µo)2

σ2o

+X

(xi − µ)2«ff

dµdτ

∝

Z

∞

0

Z

∞

−∞

τn+δo+1

2−1 × exp−

τ

2(γo + µ2(n + 1/σ2

o) − 2µ(X

xi + µo/σ2o) +

(µ2o/σ2

o +X

x2i )dµτ

=

„Z

∞

0

τn+δo+1

2−1exp

n

−τ

2(γo + µ2

o/σ2o +

X

x2i )o

«

dτ

×

„Z

∞

−∞

expn

−τ

2

“

µ2(n + 1/σ2o) − 2µ(

X

xi + µo/σ2o)”o

«

dµ


IntroductionThe integratal with respect toµ can be evaluated by completing

the square.

∫ ∞

0

exp

−τ(n + σ−2o )2

2

[

µ − (∑

xi + µoσ−2o )

(n + σ−2o )

]2

×exp

τ(∑

xi + µoσ−2o )2

2(n + σ−2o )

dµ

= exp

τ(∑

xi + µoσ−2o )2

2(n + σ−2o )

(2π)1/2τ−1/2(n + σ−2o )−1/2


IntroductionNow we need to evaluate

∫ ∞

0

(2π)1/2(n + σ−2o )−1/2τ−1/2τ

n+δo/2−1

2−1

×exp

−τ

2[γo + µ2

o/σ2o +

∑

x2i ]

×exp

τ

2[(∑

xi + µo/σ2o)

2

(n + 1/σ2o)

]

dτ

= (2π)1/2(n + σ−2o )−1/2

∫ ∞

0

τn+δo/2−1

2−1

×exp

−τ

2

[

γo + µ2o/σ

2o +

∑

x2i −

(∑

xi + µo/σ2o)

2

(n + 1/σ2o)

]

dτ


Introduction

=(2π)1/2Γ

(

n+δo

2

)

(n + 1/σ2o)

− 1

2

[

12

(

γo + µ2o/σ

2o +

∑

x2i − (

P

xi+µo/σ2o)2

(n+1/σ2o)

)]n+δo

2

=(2π)1/2Γ

(

n+δo

2

)

2n+δo

2 (n + 1/σ2o)

− 1

2

[

γo + µ2o/σ

2o +

∑

x2i − (

P

xi+µo/σ2o)2

(n+1/σ2o)

]n+δo

2

≡ p∗(x)

Thus,

p(x) =

(

(2π)−(n+1)/2σ−1o

(γo

2)δo/2

Γ( δo

2)

)

p∗(x)


Introduction

=

∫ ∞

0

τn+δ0+1

2−1exp

−τ

2

[

γo + µ2o/σ

2o +

∑

x2i

]

×exp

−τ(n + 1/σ2o)

2

[

(

µ −∑

xi + µo/σ2o

n + 1/σ2o

)2]

×exp

τ

2

[

(∑

xi + µo/σ2o)

2

n + 1/σ2o

]

dτ

Let a = (∑

xi+µo/σ2

o)

(n+1/σ2o) . Then, we can write the integral

as


Introduction

=

Z

∞

0

τn+δ0+1

2−1

×expn

−τ

2

h

γo + µ2o/σ2

o +X

x2i + (n + 1/σ2

o)(µ − a)2 − (n + 1/σ2o)a2

io

dτ

=Γ“

n+δ0+1

2

”

2n+δ0+1

2

ˆ

γo + µ2o/σ2

o +P

x2i + (n + 1/σ2

o)(µ − a)2 − (n + 1/σ2o)a2

˜

∝

»

1 +c(µ − a)2

b − ca2

–

n+δ0+1

2

wherec = n + 1/σ2o andb = γo + µ2

o/σ2o +

∑

x2i . We recognize

this kernel as that of at-distribution with location parametera

and dispersion parameter(

(n+δo)cb−ca2

)−1

, andn+ δo degrees of free-

dom.


IntroductionDefinition Let y = (y1, ..., yp)

′ be ap × 1 random vector. Theny

is said to have ap diminsional multivariatet distribution withd

degrees of freedom, location paramterm and dispersion matrix

Σp×p if y has density

p(y) =

(

Γ(

d+p2

)

(πd)−p/2|Σ|−1/2)

Γ(

d2

)

×[

1 +1

d(y − m)′Σ−1(y − m)

]− d+p2

We write this asy ∼ Sp(d, m, Σ). In our problem,p =

1, d = n + δo, m = a, Σ−1 = (n+δo)cb−ca2 , Σ =

(

(n+δo)cb−ca2

)−1


IntroductionThe marginal distribution ofτ is give by

p(τ |y) =

Z

∞

0

τn+δ0+1

2−1 × exp

n

−τ

2

h

γo + µ2o/σ2

o +X

x2i

io

×expn τ

2(n + 1/σ2

o)a2o

× exp

−τ(n + 1/σ2

o)

2(m − a)2

ff

dµ

∝ τn+δ0+1

2−1τ−

12 exp

n

−τ

2

h

γo + µ2o/σ2

o +X

x2i − (n + 1/σ2

o)a2io

= τn+δ0

2−1exp

n

−τ

2

h

γo + µ2o/σ2

o +X

x2i − (n + 1/σ2

o)a2io

Thus,

τ |x ∼ gamma

»

n + δ0

2,1

2

“

γo + µ2o/σ2

o +X

x2i − (n + 1/σ2

o)a2”

–

.


IntroductionRemark Note that in Examples 1 and 3i),ii), the posterior

distribution is of the same family as the prior distribution. When

the posterior distributionof a paramter is of the sme familyas the

prior istribution, such prior distributions are calledconjugateprior distributions .

For example 1, a Beta prior inθ led to a Beta posterior forθ. In

example 3i), a normal prior forµ yielded a normal posterior forµ.

In example 3ii), a gamma prior forτ yielded a gamma posterior

for τ . More on conjugate priors later.


Advantages of Bayesian Methods

1. InterpretationHaving a distribution for your unknown parameterθ is easier to

understand that a point estimate and a standard error. In addition,

we consider the following example of a confidence interval. A

95% confidence interval for a population meanθ can be written

as

x ± (1.96)s/√

n.

ThusP (a < θ < b) 6= 0.95.



1. Interpretation We have to rely on a repeated sampling

interpretation to make a probability as above. Thus, after

observing the data, wecannotmake a statement like the trueθ

has a 95% chance of falling in

x ± (1.96)s/√

n.

although we are tempted to say this.



2. Bayes Inference Obeys the Likelihood Principal

The likelihood principle: If two distinct sampling plans (designs)

yield proportional likelihood functions forθ, then inference about

θ should be identical from these two designs. Frequentist infer-

ence does not obey the likelihood principle, in general.

Example Suppose in 12 independent tosses of a coin, 9 heads

and 3 tails are observed. I wish to test the null hypothesis

Ho : θ = 1/2 vs.Ho : θ > 1/2, whereθ is the true probabil-

ity of heads.



Consider the following 2 choices for the likelihood function:

a) Binomial n = 12 (fixed), x = number of heads.x ∼Binomial(12,θ) and the likelihood is

L1(θ) =

(

n

x

)

θx(1 − θ)n−x

=

(

12

9

)

θ9(1 − θ)3

b) Negative Binomial:n is not fixed, flip until the third

tail appears. Herex is the number of flips required to

complete the experiment,x ∼ NegBinomial(r=3,θ).



L2(θ) =

(

r + x − 1

x

)

θx(1 − θ)r

=

(

11

9

)

θ9(1 − θ)3

Note thatL1(θ) ∝ L2(θ). From a Bayesian perspective, the

posterior distribution ofθ is thesameunder either design. That

is

p(θ|x) =L1(θ)π(θ)

∫

L1(θ)π(θ)dθ≡ L2(θ)π(θ)∫

L2(θ)π(θ)dθ



However, under the frequentist paradigm, inferences aboutθ are

quite different under each design. The rejection region based on

the binomial likelihood is

p(x ≥ 9|θ = 1/2) =12∑

j=9

(

12

j

)

θj(1 − θ)12−j = 0.075

while for the negative binomial likelihood, thep-value is

p(x ≥ 9|θ = 1/2) =∞∑

j=9

(

2 + j

j

)

θj(1 − θ)3 = 0.0325

The two designs lead to different decisions, rejectingHo under

design 2 and not under design 1.



3. Bayesian Inference Does not Lead to Absurd Results

Absurd results can be obtained when doing UMVUE estimation.

Supposex ∼ Poisson(λ), and we want to estimateθ = e−2λ,

0 < θ < 1. It can be shown that the UMVUE ofθ is (−1)x. Thus,

if x is even the UMVUE ofθ is 1 and ifx is odd the UMVUE of

θ is -1!!



4. Bayes Theorem is a formula for learningSuppose you conduct an experiment and collect observations

x1, ..., xn. Then

p(θ|x) =p(x|θ)π(θ)

∫

Θ

p(x|θ)π(θ)dθ

wherex = (x1, ..., xn). Suppose you collect an additional

observationxn+1 in a new study. Then,

p(θ|x, xn+1) =p(xn+1|θ)π(θ|x)

∫

Θ

p(xn+1|θ)π(θ|x)dθ

So your prior in the new study is the posterior from the previous.Introduction to Bayesian Methods – p.36/??


5. Bayes inference does not require large sample theoryWith modern computing advances, “exact” calculations can be

carried out using Markov chain Monte Carlo (MCMC) methods.

Bayes methods do not require asymptotics for valid inference.

Thus small sample Bayesian inference proceeds in the same way

as if one had a large sample.



6. Bayes inference often has frequentist inference as a specialcaseOften one can obtain frequentists answers by choosing a uniform

priorfor the parameters, i.e.π(θ) ∝ 1, so that

p(θ|x) ∝ L(θ)

In such cases, frenquentist answers can be obtained from such a

posterior distribution.


Introduction to Bayesian Methodsbmallick/Bayes1.pdf · Introduction We develop the Bayesian...

Documents

Transcript of Introduction to Bayesian Methodsbmallick/Bayes1.pdf · Introduction We develop the Bayesian...