Stat 460course1.winona.edu/bdeppa/stat 450-460/Notes/Chapter … · Web viewStat 460. Chapter 16:...

Stat 460Chapter 16: Bayesian estimation and inference

Spring 2020

A Bayesian is one who, vaguely expecting to see a horse and catching a glimpse of a donkey, strongly concludes he has seen a mule. (Senn, 1997)

Motivation

The goal of email filters is to detect spam and route them directly to the trash or to a spam filter. Many email filters use Bayesian procedures. These filters work by parsing the emails and finding words that are typically associated with spam, such as “Viagra.” Such words are very rarely found in non-spam email (“ham”). Most filters need some initial training from the owner of the email account.

1. The account holder clicks “Spam” on all emails they perceive as such.2. The email filter parses all the owner-specified emails, looking for words these emails

have in common.3. The filter comes up with a probability that an email is spam, given the presence of a

given word.4. Future emails are classified as “spam” if the probability that the email is spam given

the presence of a word is above some threshold.

However, the filter is usually programmed with the following prior information: the fact that any given email is more likely to be spam than it is to be ham. Any given email account is flooded with spam each day, such that about 80% of emails sent to an account are spam. How can this prior information be taken under consideration?

1

Bayes’ Theorem and reversal of conditioning

Email filters want to estimate the probability that an email is spam given the user-specified information. This probability depends on the proportion of user-specified “spam” emails that contain the word; the proportion of “ham” emails that contain the word; and the prior information about the prevalence of spam.

These ingredients come together by way of Bayes’ Theorem:

P(S∨W )=P(W∨S )P(S)

P (W∨S)P (S )+P (W∨H )P (H)=P(W∨S)P(S)

P(W )

Note the denominator, by the law of total probability, is just P(W ), i.e., the proportion of all emails that contains the keyword.

Simplified setting

Consider a simplified setting, where a filter wants to estimate θ, the probability that any email is spam, based solely on the email flow into an account and which emails the user marks as “spam.” However, the filter still wants to take into consideration that the prior information that any email is spam is quite high, around 80%.

Bayesian estimation of θ involves the following:

1. Specification of the likelihood: f Y ( y∨θ); or P(Y= y∨θ) for discrete data;2. A prior distribution for θ, π (θ), that incorporates any prior information about θ.

Given these ingredients, we can form the posterior distribution of θ, or:

p(θ∨ y )=f Y ( y∨θ) π (θ)

∫θf Y ( y∨θ)π (θ)dθ

2

If θ is assumed to be discrete, the integral in the denominator is replaced by a summation.

Example 1: Bayesian Estimate of the Probability of Success (θ)

1. Let Y 1 ,Y 2 , ... ,Y n represent the data, where Y i=1 if the user specifies it as spam, and Y i=0 if the email is not spam. Recall that θ is the probability that any email is spam. What is the likelihood, P(Y= y∨θ)?

2. Recall that we want to incorporate prior information about θ. What would be a reasonable prior distribution for θ?

3. Find the posterior distribution, p(θ∨ y ).

3

4. Find E(θ∨ y ). This is called the Bayes’ estimate of θ.

5. Recall that we specifically believe that θ∼0.8. What would be good values of (α ,β ) to use for the prior? Find the Bayes’ estimate of θ using this prior, if 10 emails are received and 5 are marked as spam. How does this compare to the MLE of θ?

4

6. Given 10 emails are received, and half are marked as spam, find the 2.5th and 97.5th percentiles of the posterior distribution. This is referred to as a 95% credible interval.

7. Suppose we want to test H 0 :θ≥0.75 vs H a :θ<0.75. Bayesian tests of these hypotheses finds P(θ≥0.75), and P(θ<0.75), using the posterior distribution. Given 10 emails are received, and half are marked as spam, find the two probabilities. What is our decision?

5

(continued space)

Using R

Posterior distribution of θ, with 50% of emails marked as spam, for n=10, and using a BETA(8,2) prior:

calc.lik <- function(theta,n,sum.y) { likelihood <- (theta^sum.y)*(1-theta)^(n-sum.y) return(likelihood)}calc.prior <- function(theta) { dbeta(theta, shape1=8, shape2=2)}calc.posterior <- function(theta,n,sum.y) { dbeta(theta, shape1 = sum.y+8, shape2=(n-sum.y)+2)}

theta.seq <- seq(0,1,l=100)lik.n10 <- calc.lik(theta.seq,10,5)prior <- calc.prior(theta.seq)post.n10 <- calc.posterior(theta.seq,10,5)

plot(theta.seq,lik.n10,ylab='',main=expression(L(theta)),xlab=expression(theta), type='l')plot(theta.seq,prior,col='red',lwd=2,ylab='',main='Prior +Posterior', xlab=expression(theta),type='l',ylim=c(0,10))lines(theta.seq,post.n10,col='blue',lwd=2,lty=2)legend('topleft',c('Prior','Posterior'), col=c('red','blue'),lty=1:2,cex=0.8)

6

Now for n=100:

lik.n100 <- calc.lik(theta.seq,100,50)prior <- calc.prior(theta.seq)post.n100 <- calc.posterior(theta.seq,100,50)

plot(theta.seq,lik.n100,ylab='',main=expression(L(theta)),xlab=expression(theta),type='l')plot(theta.seq,prior,col='red',lwd=2,ylab='',main='Prior +Posterior', xlab=expression(theta),type='l',ylim=c(0,10))lines(theta.seq,post.n100,col='blue',lwd=2,lty=2)legend('topleft',c('Prior','Posterior'), col=c('red','blue'),lty=1:2,cex=0.8)

7

Posterior, prior, and likelihood

Bayes’ theorem establishes the following relationship between these three quantities:

Posterior∝Likelihood ×Prior ,

where the constant of proportionality is the marginal distribution of the data, call it m( y )=∫

θ

f Y( y∨θ)π (θ)dθ. This constant of proportionality is what is required to make the

posterior integrate to 1.

8

Bayesian versus Frequentist

The Bayesian framework for estimation is very different from the frequentist framework, which is where we have been all semester.

In the Frequentist setting:

1. The parameter θ is fixed.2. Estimators depend only on the data.

3. P-values measure the probability of observing a test statistic beyond the observed sample’s test statistic under a fixed value of θ under H 0.

4. We compute 95% confidence intervals, which refer to the belief that such intervals calculated repeatedly will contain the true θ approximately 95% of the time.

In the Bayesian setting:

1. The parameter θ is random.2. Estimators depend both on the data, and on the prior information about θ.

3. We can measure the probability that the null and alternative are true. i.e., we measure P(θ∈Ω0) and P(θ∈Ωa), and choose H 0 or H a depending on which probability is larger.

4. We compute 95% credible intervals, that refer to the probability that the random variable θ is inside the interval.

9

Conjugate distributions

The example we considered earlier used a BETA(α , β) distribution as a prior for the θ from a BERN (θ) likelihood. The resulting posterior was again BETA. When the posterior and the prior belong to the same family, then the two distributions are called conjugate distributions, and the prior a conjugate prior.

Example 2: Mean Rate of Arrivals

Consider the problem of estimating the rate of arrivals at a drive-up window. We have data on the number of arrivals over several 1-hour time spans, and we know the Poisson distribution is appropriate for modeling data on the number of arrivals. Accordingly, let Y 1 ,... , Y n∼POI (θ) represent the data, and we want to use a Bayesian analysis to estimate θ. Suppose from other drive-up windows in the company, we suspect θ to be around 10.

1. Show that the GAM (α , β) distribution is a conjugate prior for this problem.

2. Find θ¿

BAYES.

10

3. Describe how you would find a 95% Bayesian credible interval.

4. What parameters α and β would ou use to incorporate the prior information that θ=10, if you are very confident in this information? Not very confident in this information?

11

Example 3: Estimating the Mean of a Normal Population

Bone fractures are a common medical problem as we grow older, so there is much interest in determining risk factors. One assessment tool is bone mineral density (BMD), a measure of the amount of certain minerals such as calcium in the bone (in g/cm2). The lower the BMD, the higher the risk for bone fractures.

Suppose BMD measurements Y 1 ,... , Y n in female vegetarians are approximately normal with unknown mean θ but known σ . Suppose a researcher wants to estimate this θ. After doing a literature review, he believes θ might be around 0.72 g/cm2, but he’s not sure, so he assigns some uncertainty to this prior belief by assuming θ∼N (0.72 ,0.082).

1. Show that the N (μ , τ2) is a conjugate prior for estimating θ, the mean of a normal likelihood.

12

(continued space)

13

(continued space)

14

2. Find θ¿

Bayes.

15

3. Suppose Y=0.85g /cm2 and n=20. Using this information, and the prior information assumed by the researcher, find a 95% credible interval for θ.

16

Bayesian computation methodsConjugate priors that yield analytically well-behaved posteriors are great. However, it quicky becomes intractable to analytically derive the posterior distributions, especially for multivariate densities when we are interested in simultaneously estimating multiple parameters. In these situations, computational methods are required to characterize the posterior distribution (e.g. posterior mean and quantiles).

Example

We will consider the following example. Suppose it is the summer of 2016, and pollsters are interested in predicting the Trump vote in Hennepin County (which contains Minneapolis). A sample of 100 Hennepin County voters were sampled and polled about who they were likely to vote for. Suppose that of these 50, 14 of them (28%) said they supported Trump. Hennepin County leans democratic: in the 2012 election, only 35.4% voted for Mitt Romney, the GOP candidate for president.

Here, it follows that with Y ≡ number of people out of n Hennepin County voters who support Trump, Y ∼BIN (n ,θ) with θ≡ the true percent of Hennepin County voters who support Trump. We have already seen that the Beta distribution is a conjugate prior for θ, and that specifically using a θ∼BETA (α , β) prior the posterior is again Beta:

(θ∨Y= y )∼BETA ( y+α ,n− y+β) .

We might think of specifying a prior for θ that puts weight at 0.354, the proportion of voters who went for Romney in the 2012 election. One possible prior distribution is θ∼BETA (3.54,6 .46) which has mean equal to 0.354. It would then follow that the posterior is:

(θ∨Y=14)∼BETA(14+3.54,50−14+6.46)≡BETA (17.54,42 .46).

This would yield a posterior mean of E(θ∨Y=14)=17.54 /(17.54+42.46)=0.292, and we would find the 95% credible interval by finding the 2.5th and 97.5th percentiles of the posterior which are (0.185, 0.412)

However, what if we could not find the analytic form of the posterior? For example, what if we wanted to specify a normal prior for θ? In this case, the posterior has no closed form. There are many ways to sample from the posterior but we will consider two common approaches here:

1. Rejection Sampling2. MCMC sampling via the Metropolis algorithm

These methods are based on the idea that we can get away without knowing the true form of p(θ∨ y ), all we need to do is sample from q (θ∨ y) where typically:

q (θ∨ y)=f ( y∨θ)π (θ)∝ f ( y∨θ)π (θ)∫ f ( y∨θ)π (θ)dθ

=p (θ∨ y)

17

The resulting q (θ∨ y) will not be mathematically equivalent to the posterior, but the mean and quantiles will all be the same. The reason we might want to get away with sampling from q and not p is this: it’s easy to multiply the likelihood and the posterior. It is hard (often impossible) to find the normalizing constant.

It turns out I can get away with sampling from q to approximate posterior properties of θ. For example, does it matter if we sample from a BETA distribution with

p(θ∨ y )= Γ (α+β+n)Γ (α+ y)Γ (β+n− y )

θα+ y−1 ¿

exactly, or can we get away with sampling from:

q (θ∨ y)=f ( y∨θ)π (θ)=(ny )θ y¿The “normalizing constant” ∫ f ( y∨θ) π (θ)dθ that would be required to make q a density is irrelevant for simulating means and quantiles.

Visually, does it matter from which of these I simulate (note the only difference is the y-axis):

theta <- seq(0,1,l=100)p.theta <- dbeta(theta,14+3.54, 50-14+6.46) #normalized posteriorq.theta <- dbinom(14,50,theta)*dbeta(theta,3.54,6.46) #likelihood times prior (not normalized)df <- data.frame(theta,p.theta,q.theta)p <- ggplot(data = df, aes(x = theta)) + xlab(expression(theta)) p + geom_line(aes(y = p.theta),col='tan',size = 2) + ylab(expression(p(theta)))p + geom_line(aes(y = q.theta),col='lightblue',size = 2)+ ylab(expression(q(theta)))

18

Rejection sampling

The rejection sampling algorithm samples from a “proposal” distribution g(θ) defined over the entire support of θ. The restriction on g(θ) is that there must be some M such that Mg (θ)≥q (θ∨ y ) for all θ. We then accept the proposed value with probability q (θ∨ y)/Mg (θ). The algorithm is specifically defined as follows.

1. Simulate a “proposal” of θ from some distribution g(θ) (e.g., uniform). Let θ¿ represent the proposed value.

2. Accept the proposed value with probability q (θ¿∨ y )/(Mg (θ¿)).3. Continue until you have desired number of accepted simulated θ’s.

The distribution of θ’s simulated in this way will mimic i.i.d. realizations from the posterior. Diagrams illustrating the idea are shown below.

It is important to be smart about the proposal distribution g(θ). If you pick a distribution that is generating too many θ¿ in the nether-regions of the posterior’s support, you will wait around forever for your rejection sampling algorithm to generate a sufficiently large sample.

19

Why the restriction that there must be some M such that Mg (θ)≥q (θ∨ y ) for all θ? If this is violated, could get acceptance probabilites ¿1. For example, consider the hypothetical q and Mg (θ) below:

Implementing the rejection sampling method: uniform proposal distribution

We will begin by using a uniform proposal distribution. Since we know θ is a proportion, we will simulate proposed θ¿ from the UNIF (0,1) distribution. Note that g(θ)=1 in this case, thus if we let M≥0.4 we will satisfy the requirement that Mg (θ)≥q (θ∨ y ) for all θ∈[0,1] (see the picture of q (θ) above). We want to pick as small an M as possible; otherwise we will have to wait a long time to accept a large number of proposed θ¿.

Thus, we will accept with probability equal to q (θ¿∨ y )/(0.4×1). We will proceed by tracking all θ¿ as well as the ones that will comprise our sample from the posterior. In our R code, we will first write a function that evaluates q (⋅) given a value of θ:

q.theta <- function(theta) { q <- dbinom(14, 50, theta) * dbeta(theta, 3.54, 6.46) return(q)}

20

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.1

M <- 0.4posterior.sampsize <- 10000all.thetastars <- c() # Empty vectorall.decisions <- c()count <- 1

set.seed(12345)system.time({while(count <= posterior.sampsize) { thetastar <- runif(1) accept.prob <- q.theta(thetastar)/(M*1) newdecision <- rbinom(1, 1, accept.prob) all.thetastars <- c(all.thetastars,thetastar) all.decisions <- c(all.decisions,newdecision) count <- ifelse(newdecision==1, count+1, count)}})

## user system elapsed ## 13.62 0.00 13.83

df <- data.frame(all.thetastars,all.decisions)posterior.sample <- df %>% filter(all.decisions==1)nrow(df) #Total number of considered theta.stars:

## [1] 85971

nrow(posterior.sample) #Number of 'accepted' theta.stars (this is the posterior sample size):

## [1] 10000

mean(posterior.sample$all.thetastars) #Mean of posterior sample:

## [1] 0.2914111

quantile(posterior.sample$all.thetastars,c(.025,.975)) #95% Bayesian credible interval:

## 2.5% 97.5% ## 0.1837566 0.4113888

21

Plotting the results; grey histogram is for all proposed θ¿ and the red is only for the accepted θ¿ (i.e., the simulated sample from the posterior):

ggplot() + geom_histogram(data = df, aes(x = all.thetastars), binwidth=0.025,alpha=.5,fill='grey',color='black') + geom_histogram(data = posterior.sample, aes(x = all.thetastars), binwidth=0.025,alpha=.5,fill='red') + xlab(expression(theta^"*"))

22

Implementing the rejection sampling method: prior proposal distribution

Another (smarter?) approach is to use the prior distribution as the proposal distribution. Thus g(θ)=π (θ). This is because, given a proposed value θ¿:

P(Accept )=q (θ¿∨ y)Mg(θ¿)

= f ( y∨θ¿)π (θ¿)Mπ (θ¿)

= f ( y∨θ¿)M

From here, we just need a value of M that bounds f ( y∨θ¿). But f ( y∨θ)≡L(θ), and we know what value of θ maximizes L(θ): the MLE!!

Thus, if we carry out rejection sampling using the prior as the proposal distribution, we will accept

with probability L(θ¿) /M with M=L(θ¿

MLE).

In our example, the MLE of θ is obviously just the sample proportion: θ¿

MLE=14/50=0.28. Thus:

M=L(0.28)=(5014)0.2814 ¿Carrying on; notice how much faster this approach is than using a uniform proposal distribution!

library(dplyr)M <- dbinom(14,50, 0.28) #Find the maximum height of the likelihoodM

## [1] 0.1248285

posterior.sampsize <- 10000all.thetastars <- c() # Empty vectorall.decisions <- c()count <- 1

set.seed(12345)system.time({while(count <= posterior.sampsize) { thetastar <- rbeta(1, 3.54, 6.46) #propsal distribution is the prior accept.prob <- dbinom(14, 50, thetastar)/M newdecision <- rbinom(1, 1, accept.prob) all.thetastars <- c(all.thetastars,thetastar) all.decisions <- c(all.decisions,newdecision) count <- ifelse(newdecision==1, count+1, count)}})

## user system elapsed ## 1.33 0.00 1.35

df <- data.frame(all.thetastars,all.decisions)posterior.sample <- df %>% filter(all.decisions==1)mean(posterior.sample$all.thetastars) #Mean of posterior sample:

## [1] 0.2926516

quantile(posterior.sample$all.thetastars,c(.025,.975)) #95% Bayesian credible interval:

23

## 2.5% 97.5% ## 0.1846594 0.4141539

Plotting the results; grey histogram is for all proposed θ¿ and the red is only for the accepted θ¿.

ggplot() + geom_histogram(data = df, aes(x = all.thetastars), binwidth=0.025,alpha=.5,fill='grey',color='black') + geom_histogram(data = posterior.sample, aes(x = all.thetastars), binwidth=0.025,alpha=.5,fill='red') + xlab(expression(theta^"*"))

Plotting these on the density scale, and superimposing the true posterior and prior (which we know):

ggplot() + geom_histogram(data = df, aes(x = all.thetastars,y=..density..), binwidth=0.025,fill='darkgrey') + geom_histogram(data = posterior.sample, aes(x = all.thetastars,y=..density..), binwidth=0.025,alpha=.5,fill='red') + stat_function(aes(x = all.thetastars, color='black'), data = df,fun = dbeta, args=list(shape1 = 3.54, shape2= 6.46), size=2) + stat_function(aes(x = all.thetastars,color='red'), data = posterior.sample,fun = dbeta, args=list(shape1 = 17.54, shape2= 42.46), size=2) + xlab(expression(theta)) + scale_color_manual(name='',values=c('black','red'),labels=c('prior','posterior')) + ggtitle('Histogram of simulated proposed and accepted \n values with true densities')

24

Markov-Chain Monte Carlo (MCMC) approach: The Metropolis algorithm

Rejection sampling is appealing in its simplicity, but it can take a long time for multi-parameter problems. Another approach frequently used in Bayesian application is the Metropolis-Hastings algorithm. The Metropolis algorithm is a special case of Markov-Chain Monte Carlo sampling. Let’s break down what MCMC means first:

• Monte Carlo methods: a broad term to describe methods that involve computational random sampling to obtain numerical results. We’ve been using Monte Carlo methods all semester (e.g., simulating the MSE of an estimator when the analytic MSE has no closed form.)

• Markov-Chain: “a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.” (https://en.wikipedia.org/wiki/Markov_chain)

Here’s an MCMC sampling scheme, in diagram form (image source link here), using a uniform proposal distribution:

25

https://www.researchgate.net/figure/Illustration-of-Bayesian-Method-process-with-Markov-Chain-Monte-Carlo_fig6_266562928

https://en.wikipedia.org/wiki/Markov_chain

MCMC are sometimes referred to as “random walks” since they involve “walking” around the support of the target (posterior) distribution, visiting higher-density parts more often than lower-density parts.

26

Compare the diagram of MCRMC sampling from Hartig et al. with their diagram of rejection sampling:

Note that unlike rejection sampling, samples from MCMC are not quite independent: two samples close to each other in the chain are likely to be more similar than samples far apart in the chain. For this reason, the chain is often “thinned” by taking every 100th observation (for example). Additionally, the algorithm can take a while to “walk” into the parts of the support that have high probability. Accordingly, a “burn-in” portion of the chain (the first 1000 samples or so) are often discarded.

All that said, here is the formal representating of the Metropolis algorithm:

Initial state (J=0):

0. Draw a θ0 from a proposal distribution g(⋅)

For state J ≥1:

4. Draw a proposal value θJ¿ from proposal distribution g(⋅).

5. Evaluate r=q(θJ¿ )/q(θJ−1).

6. Simulate a realization of U∼UNIF (0,1).7. Set:

θJ={ θJ¿ U<r

θJ−1 U>r

In the Metropolis algorithm, the proposal distribution g(⋅) is symmetric (e.g., uniform or normal). Non-symmetric proposals can be used, in which case r must be modified slightly

27

https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1461-0248.2011.01640.x

and the algorithm is termed the Metropolis-Hastings algorithm. We will not consider non-symmetric proposals.

Coding up the Metropolis algorithm for our example, we will use g(⋅)=1 (i.e., a UNIF (0,1) proposal distribution) since this covers the support of θ and is symmetric. We will also take a very large number of samples, as often the chain needs to be thinned or an initial burn-in subset discarded:

q.theta <- function(theta) { q <- dbinom(14, 50, theta)*dbeta(theta, 3.54, 6.46) return(q)}

posterior.sampsize <- 100000thetas <- c()

#Set initial value; theta_0:thetas[1] <- runif(1)

set.seed(12345)for(i in 2:posterior.sampsize) { thetastar <- runif(1) r <- q.theta(thetastar)/q.theta(thetas[i-1]) U <- runif(1) thetas[i] <- ifelse(U < r, thetastar, thetas[i-1])}

df <- data.frame(Step = 1:posterior.sampsize, theta = thetas)

Plotting a histogram of the samples, superimposing what we know to be the true BETA(17.54,42.46) posterior.We can also investigate the behavior of the chain by plotting the θ as a function of chain step, J . The horizontal line denote 0.292, which we know is the true posterior mean:

ggplot(data = df) + geom_histogram(aes(x = theta, y=..density..),fill='red',col='black',alpha=0.3) + stat_function(fun=dbeta, args = list(shape1 = 17.54, shape2 = 42.46),size=2, col='red')+ xlab(expression(theta))

ggplot(data = df[1:100,]) + geom_point(aes(x = Step, y = theta)) + ylab(expression(theta)) + xlab('Step number') + geom_hline(aes(yintercept = 0.292),col='red')

28

The chain quickly converges to what it should, implying we may not need to discard a “burn-in” subset. The MCMC visits “denser” areas of the support more often, occasionally jumping to less-dense regions. Note also how this graph illustrates the correlation of θ close together in the chain; sometimes it gets “stuck” and stays in the same place for several iterations. This means nearby observations are correlated, as indicated in the below visualization of the autocorrelation:

par(mar=c(4,4,0,0))with(df, plot(acf(theta)))

29

We can thin the chain to make the samples better mimic iid draws from the posterior, by taking every 20th observation as our simulated posterior sample:

thinned <- df[seq(1, nrow(df),by=20),]ggplot(data = thinned[1:100,]) + geom_point(aes(x = Step, y = theta)) + ylab(expression(theta)) + xlab('Original step number') + geom_hline(aes(yintercept = 0.292),col='red')

ggplot(data = thinned) + geom_histogram(aes(x = theta, y=..density..),fill='red',col='black',alpha=0.3) + stat_function(fun=dbeta, args = list(shape1 = 17.54, shape2 = 42.46),size=2, col='red')+ xlab(expression(theta))

par(mar=c(4,4,0,0))with(thinned, plot(acf(theta)))

31

Stat 460course1.winona.edu/bdeppa/stat 450-460/Notes/Chapter … · Web viewStat 460. Chapter 16:...

Documents

Transcript of Stat 460course1.winona.edu/bdeppa/stat 450-460/Notes/Chapter … · Web viewStat 460. Chapter 16:...