Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes'...

49
Day 2 – Intro to Bayesian Concepts Nonhierarchical Regression Models 1

Transcript of Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes'...

Page 1: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Day 2 – Intro to Bayesian Concepts

Nonhierarchical Regression Models

1

Page 2: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Outline   Some cheerleading for Bayes   Probability fundamentals   Bayes’ Theorem   Gibbs Sampling   Supporting info for exercises

2

Page 3: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Why Bayes in ecology?   Simplifies understanding of the processes by factoring complex

relationships into simple pieces

  Can incorporate many sources of uncertainty and stochasticity

  Can easily combine ugly data from very different sources

  Can incorporate prior information (expert range maps, previous studies)

  Ecological units are clustered and similar but different (hierarchical models)

  Deals well with large numbers of random effects (e.g. spatial or hierarchical models)

3

Page 4: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

What makes something Bayesian?   Bayesian methods contrast with Frequentist/Classical

methods   Bayesian methods have only become popular in the

last 20ish years due to their computational demands   The key difference between classical and Bayesian

reasoning is that the Bayesian believes that knowledge is subjective.

  Bayesian only has beliefs about the parameter value which are updated, based on data

4

Page 5: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Outline   Some cheerleading for Bayes   Probability fundamentals   Bayes’ Theorem   Gibbs Sampling   Supporting info for exercises

5

Page 6: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Fundamental Probability Terms   Probability density function – aka pdf or density   Likelihood function – aka likelihood   Conditional probability distribution   Marginal probability distribution   Bayes’ Theorem

6

Page 7: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Probability density functions You can think of a PDF as a histogram (normalized)

7

Page 8: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

8

Probability density functions A pdf, f(x), assigns a probability to x taking a value between

x and x+dx based on the area under the curve

f (x)dx =1−∞

∫€

f (x) ≥ 0

Page 9: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Probability density functions   “Densities” can occur in Bayesian models because you

specify a distribution, like a Normal, Beta, Poisson, etc.

  Or because you’re sampling from an unknown distribution that doesn’t necessarily have a pretty form

9

Page 10: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Probability density functions   Joint pdfs describe the distribution of 2 or more

variables which may or may not be independent of one another

10

Page 11: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

11

Likelihood and pdfs

  Likelihood can be thought of as the opposite of probability   Probability (pdfs) allow us to predict an unknown

outcome from known parameters   Likelihood allows us to predict unknown parameters from

a known outcome   Likelihoods do not integrate to 1, while pdfs do   Confusingly, the functions have the same form, but

different independent variables   e.g. the exponential distribution:

f (x) = ρe−ρx

L(ρ) = ρe−ρx

Page 12: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

12

Likelihood

  Likelihood allows us to estimate unknown parameters with known data

Page 13: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

13

Conditional Probability

Conditional probabilities allow us to understand how the probability of an event A changes after it has been learned that some other event B has occurred.

Key concept: the occurrence of B reshapes the sample space for subsequent events, such as A.

S

A

B

Pr(A | B) The conditional probability of A given B is denoted Pr(A | B).

Page 14: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

14

Conditional Probability

Page 15: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

15

Conditional Probability

Page 16: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Conditional Probability   In Bayesian models we need each parameter’s distribution

conditional on the values of all the other parameters.   If a regression model has 3 parameters: b0, b1, b2 then

we’ll need   P(b0|b1, b2,data)   P(b1|b0, b2,data)   P(b2|b0, b1,data)

16

Page 17: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

17

Marginal Probability

P(A) is the marginal distribution of A, composed of the sum of conditional distributions

Page 18: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Outline   Some cheerleading for Bayes   Probability fundamentals   Bayes’ Theorem   Gibbs Sampling   Supporting info for exercises

18

Page 19: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Bayes’ Theorem   Bayes' Theorem can be derived from the expression for

the joint probability of two events D and θ

  Let p(D) denote the probability that event D will occur and let p(θ) denote the probability that event θ will occur. Then p(D, θ) is the probability that both events occur

p(D, θ)=p(D) p(θ|D) = p(θ) p(D| θ)   Then Bayes’ Theorem states:

p( θ | D ) = p( θ ) p( D | θ )p(D)

19

Page 20: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

20

Interpretation of Bayes’ Theorem

p( Parami | Data ) = p( Parami ) p( Data | Parami )p(Paramk)p(Data|Paramkk∑ )

p(Parami) = Prior distribution for Parami - summarizes beliefs about the probability of Parami before Parami or Data are observed.

Marginal

p( Data | Parami ) = The conditional probability of Data given Parami. It summarizes the likelihood of the Data given Parami. Conditional

∑k p( Paramk ) p( Data | Paramk ) = The uninteresting normalizing constant - the sum of the quantities in the numerator for all Paramk. Marginal

p( Parami | Data ) = The posterior distribution for Parami - the probability of Parami after Data has been observed. Conditional

Page 21: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Review of Fundamental Probability Terms

  Probability density function – aka pdf or density   Likelihood function – aka likelihood   Conditional probabability distribution   Marginal probability distribution   Bayes’ Theorem

21

Page 22: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

22

The components of Bayes’ Theorem

Before we worry about computation in a model, we’ll focus on the concepts of what goes in and what comes out

  Priors   Likelihoods   Posterior

Page 23: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

23

Prior distribution   This describes our previous belief/knowledge about a

parameter value (θ)   It can incorporate information or be vague/

noninformative

p(θ |D) =p(θ)p(D |θ)

p(D)

Page 24: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Prior distribution Consider a prior for the probability parameter from a binomial distribution

24

Page 25: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

25

Likelihood

The exponential distribution likelihood:

L(ρ) = ρe−ρx

1.  Write the likelihood function for 5 data points 2.  Write the log likelihood function from 1 3.  Generate some random exponential data using rexp 4.  Evaluate 3 for different values of ρ 5.  Plot 3 as a function of ρ

with mean 1/ ρ

p(θ |D) =p(θ)p(D |θ)

p(D)

Page 26: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

26

The exponential distribution likelihood:

1.  Write the likelihood function for 5 data points (n=5)

2.  Write the log likelihood function from 1 3.  Generate some random exponential data using rexp 4.  Evaluate 3 for different values of ρ 5.  Plot 3 as a function of ρ

L(ρ) = ρe−ρxii=1

n

∏ = ρn e−ρx1e−ρx2e−ρx3 …e−ρxn[ ] = ρne−ρ xi∑

with mean 1/ ρ

Likelihood

L(ρ) = ρe−ρx

Page 27: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

27

1.  Write the likelihood function for, say 5 data points (n=5)

2.  Write the log likelihood function from 1

3.  Generate some random exponential data using rexp 4.  Evaluate 3 for different values of ρ 5.  Plot 3 as a function of ρ

L(ρ) = ρe−ρxii=1

n

∏ = ρn e−ρx1e−ρx2e−ρx3 …e−ρxn[ ] = ρne−ρ xi∑

ln(L) = ln ρne−ρ xi∑⎛ ⎝ ⎜ ⎞

⎠ ⎟ = lnρn + ln e−ρ xi∑⎛

⎝ ⎜ ⎞

⎠ ⎟

= n lnρ −ρ xi∑( ) = n lnρ − ρ xi∑

Likelihood

Page 28: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

28

data=rexp(10,.1) #simulate data

#explore likelihood function numericallyrho.seq=seq(.01,.4,length=300) #define a sequence of rho

lexp=function(rho,data){ #write likelihood functionn=length(data)n*log(rho) - rho*sum(data)}

plot(rho.seq,lexp(rho.seq,data),type='l') #plot L vs. rho

Likelihood

Directions: 1.  Paste this code in to R 2.  Try some different values for the mean of the exponential distribution (the second

argument of rexp) to see how the likelihood changes 3.  Try sampling different numbers of points by changing the first argument of rexp to see

how sharply peaked the likelihood function becomes with more data

Page 29: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

29

Likelihood

Page 30: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Posterior Distribution   To get posterior distributions for the parameters, we use

a search algorithm to find the best values for each parameter

  The search algorithm proceeds by obtaining samples from the posterior

  We can summarize the results of all these steps using histograms

p(θ |D) =p(θ)p(D |θ)

p(D)

30

Page 31: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Posterior Distribution   The posterior distribution

summarizes all that we know after analyzing the data

  We get a posterior distribution for each parameter in the model

  The posterior distribution gives the probability that the parameter takes a certain value

31

Page 32: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

32

Posterior Quantile Summaries

Page 33: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Example   Sections 2.1 – 2.4 from Albert (2007) - Bayesian

Computation with R

33

Page 34: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Outline   Some cheerleading for Bayes   Probability fundamentals   Bayes’ Theorem   Gibbs Sampling   Supporting info for exercises

34

Page 35: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Gibbs sampling   Getting the posterior from the previous exercise was

easy because we only had to worry about 1 parameter: the binomial proportion

  Gibbs sampling is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables. The purpose of such a sequence is to to approximate the marginal distribution of each of the variables

  The point of Gibbs sampling is that given a joint distribution it is simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution

35

Page 36: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Markov Chain Monte Carlo (MCMC)

Gibbs sampling is a type of MCMC

Markov chain means that a the sample at t+1 depends only on the sample at t

e.g. µt+1 ~ N( µt , some variance you set)

Monte Carlo means ‘wandering around’ sort of randomly

36

Page 37: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Gibbs Sampling - Algorithm

If yi ~ N(β1 + β2 X2 + … + βk Xk , τ) and

Then f(β, τ | y) ∝ ∏i=1 to nf(yi| β, τ) * ∏j=1 to kf(βj) * f(τ) posterior ∝ likelihood * βj priors * τ prior

The Gibbs Sampler works by repeatedly sampling each of the conditional distributions

1st iteration β1

t+1 ~ p(β1t | β2

t,…, βkt, τ t, Y)

β2t+1 ~ p(β2 | β1

t+1, β3 t ,…, βk

t, τ t, Y) … βk

t+1 ~ p(βk t | β1

t+1, β2 t+1 …, βk-1 t, τ t, Y)

τ t+1 ~ p(τ t | β1 t+1,…, βk

t+1, Y) 2nd iteration β1

t+2 ~ p(β1 t+1

| β2 t+1,…, βk

t+1, τ t+1, Y) β2

t+2 ~ p(β2 t+1 | β1

t+2, β3 t+1

,…, βk t+1, τ t+1, Y)

… βk

t+2 ~ p(βk t+1 | β1 t+2,…, βk-1

t+1, τ t+1, Y) τ t+2 ~ p(τ t+1 | β1

t+1,…, βk t+2, Y)

3rd iteration etc.

Page 38: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Gibbs Sampling

  Target distribution is the posterior   Proposal (Jump, Tuning) distribution is decides where

to go next   In Gibbs sampling, the proposal distribution is a

random walk   In Gibbs sampling, the proposed value is always

accepted

38

Page 39: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Gibbs sampling   Example: MCMC Robot – a program by Paul Lewis

  Follow link on course website

  Input: a joint posterior distribution for all parameters   Output: the trajectory of a random MCMC walk

  Most commands you need are under the ‘Robot’ menu   Clicking on the black area and dragging your mouse creates normal distributions with means at your starting point

39

Page 40: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

MCMC Robot   To start a chain, choose ‘start walking’ from the ‘robot’ menu   Explore the options on the robot menu

  Note that you can set the number of steps per run, which allows you to see more of each chain

  Try making a more complex sample space   ‘Clear’ erases the walk but not the landscape; you have to close and reopen to get new landscapes   Use the ‘chains’ menu to enable multiple chains   Use the ‘show’ menu to show these different chains   You can read the very short help file if these settings are

unclear under the help menu   See questions on next slide

40

Page 41: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

MCMC robot

41

  QUESTIONS:   What similarities and differences do you notice among

different chains?   What happens to the chain if you make 2 small islands in

opposite corners of the window?

  What happens to the chain if you make 1 large and 1 small island?

Page 42: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Convergence and MCMC

  The general questions to ask :

  At what point have we converged to the stationary distribution? (i.e. how long should our “burn-in” period be?)

  After we have reached the stationary distribution, how many iterations will it take to summarize the posterior distribution?

  The answers to both of these questions remain a bit ad hoc because the desirable results that we depend on are only true asymptotically.

  When there are island of high-probability states with no paths between them, the chain can get stuck.

  Given infinite time it will emerge, but you probably don’t have that long.

42

Page 43: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

43

Posterior Trace Plots

  Plots the parameter value at time t against the iteration number.

  If the model has converged, the trace plot will move around the mode of the distribution.

  A clear sign of non-convergence with a trace plot occurs when we observe some trending in the sample space.

  The problem with trace plots is that it may appear that we have converged, however, the chain is trapped in a local region rather exploring the full posterior.

Page 44: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

44

Apparent Convergence and Non-Convergence

Convergence No convergence

Page 45: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

MCMC

Convergence diagnostics • Geweke time series diagnostic • Gelman-Rubin diagnostic

To check for convergence, multiple starting points can be used

45

Page 46: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

46

Posterior Parameter Density Plots

Sometimes non-convergence is reflected in lumpy distributions so search longer

Bad!

Page 47: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

47

Example: Trace plots and posterior densities from a regression

install.packages('MCMCpack')library(MCMCpack) data(swiss) posterior1=MCMCregress(Fertility ~ Agriculture +

whatever you want, data=swiss, mcmc=200, burnin=0)summary(posterior1)plot(posterior1)   Try different values for mcmc, which sets the number of posterior samples, and

for burnin, which sets the number of samples to throw out before collecting samples. What is a sufficient burnin length and number of samples to make sure that thereʼs not a long term trend in the trace plot and that the density plot is pretty smooth?#

  Use different combinations of predictors to find the best model#  Look at the posterior summary: How can you tell if a predictor should be

dropped?#  Have a look at the help file for MCMCregress if you get bored and alter more

settings, like the starting values for each chain.#

Page 48: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Example: Regression Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = 10000

1. Empirical mean and standard deviation for each variable, plus standard error of the mean:

Mean SD Naive SE Time-series SE (Intercept) 84.0109 5.8959 0.058959 0.0636739 Agriculture -0.0662 0.0818 0.000818 0.0008651 Education -0.9602 0.1930 0.001930 0.0021080 sigma2 94.2139 21.3430 0.213430 0.2253052

2. Quantiles for each variable:

2.5% 25% 50% 75% 97.5% (Intercept) 72.5952 80.0868 83.98372 87.94325 95.62394 Agriculture -0.2276 -0.1210 -0.06507 -0.01111 0.09113 Education -1.3458 -1.0873 -0.95727 -0.83240 -0.58836 sigma2 61.6284 78.8988 91.02573 105.73569 143.53792

48

Page 49: Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes' Theorem can be derived from the expression for the joint probability of two events

Example: Regression

49