Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes'...

Day 2 – Intro to Bayesian Concepts

Nonhierarchical Regression Models

1

Outline   Some cheerleading for Bayes   Probability fundamentals   Bayes’ Theorem   Gibbs Sampling   Supporting info for exercises

2

Why Bayes in ecology?   Simplifies understanding of the processes by factoring complex

relationships into simple pieces

  Can incorporate many sources of uncertainty and stochasticity

  Can easily combine ugly data from very different sources

  Can incorporate prior information (expert range maps, previous studies)

  Ecological units are clustered and similar but different (hierarchical models)

  Deals well with large numbers of random effects (e.g. spatial or hierarchical models)

3

What makes something Bayesian?   Bayesian methods contrast with Frequentist/Classical

methods   Bayesian methods have only become popular in the

last 20ish years due to their computational demands   The key difference between classical and Bayesian

reasoning is that the Bayesian believes that knowledge is subjective.

  Bayesian only has beliefs about the parameter value which are updated, based on data

4


5

Fundamental Probability Terms   Probability density function – aka pdf or density   Likelihood function – aka likelihood   Conditional probability distribution   Marginal probability distribution   Bayes’ Theorem

6

Probability density functions You can think of a PDF as a histogram (normalized)

7

8

Probability density functions A pdf, f(x), assigns a probability to x taking a value between

x and x+dx based on the area under the curve

€

f (x)dx =1−∞

∞

∫€

f (x) ≥ 0

Probability density functions   “Densities” can occur in Bayesian models because you

specify a distribution, like a Normal, Beta, Poisson, etc.

  Or because you’re sampling from an unknown distribution that doesn’t necessarily have a pretty form

9

Probability density functions   Joint pdfs describe the distribution of 2 or more

variables which may or may not be independent of one another

10

11

Likelihood and pdfs

  Likelihood can be thought of as the opposite of probability   Probability (pdfs) allow us to predict an unknown

outcome from known parameters   Likelihood allows us to predict unknown parameters from

a known outcome   Likelihoods do not integrate to 1, while pdfs do   Confusingly, the functions have the same form, but

different independent variables   e.g. the exponential distribution:

€

f (x) = ρe−ρx

€

L(ρ) = ρe−ρx

12

Likelihood

  Likelihood allows us to estimate unknown parameters with known data

13

Conditional Probability

Conditional probabilities allow us to understand how the probability of an event A changes after it has been learned that some other event B has occurred.

Key concept: the occurrence of B reshapes the sample space for subsequent events, such as A.

S

A

B

Pr(A | B) The conditional probability of A given B is denoted Pr(A | B).

14


15


Conditional Probability   In Bayesian models we need each parameter’s distribution

conditional on the values of all the other parameters.   If a regression model has 3 parameters: b0, b1, b2 then

we’ll need   P(b0|b1, b2,data)   P(b1|b0, b2,data)   P(b2|b0, b1,data)

16

17

Marginal Probability

P(A) is the marginal distribution of A, composed of the sum of conditional distributions


18

Bayes’ Theorem   Bayes' Theorem can be derived from the expression for

the joint probability of two events D and θ

  Let p(D) denote the probability that event D will occur and let p(θ) denote the probability that event θ will occur. Then p(D, θ) is the probability that both events occur

p(D, θ)=p(D) p(θ|D) = p(θ) p(D| θ)   Then Bayes’ Theorem states:

€

p( θ | D ) = p( θ ) p( D | θ )p(D)

19

20

Interpretation of Bayes’ Theorem

€

p( Parami | Data ) = p( Parami ) p( Data | Parami )p(Paramk)p(Data|Paramkk∑ )

p(Parami) = Prior distribution for Parami - summarizes beliefs about the probability of Parami before Parami or Data are observed.

Marginal

p( Data | Parami ) = The conditional probability of Data given Parami. It summarizes the likelihood of the Data given Parami. Conditional

∑k p( Paramk ) p( Data | Paramk ) = The uninteresting normalizing constant - the sum of the quantities in the numerator for all Paramk. Marginal

p( Parami | Data ) = The posterior distribution for Parami - the probability of Parami after Data has been observed. Conditional

Review of Fundamental Probability Terms

  Probability density function – aka pdf or density   Likelihood function – aka likelihood   Conditional probabability distribution   Marginal probability distribution   Bayes’ Theorem

21

22

The components of Bayes’ Theorem

Before we worry about computation in a model, we’ll focus on the concepts of what goes in and what comes out

  Priors   Likelihoods   Posterior

23

Prior distribution   This describes our previous belief/knowledge about a

parameter value (θ)   It can incorporate information or be vague/

noninformative

€

p(θ |D) =p(θ)p(D |θ)

p(D)

Prior distribution Consider a prior for the probability parameter from a binomial distribution

24

25

Likelihood

The exponential distribution likelihood:

€

L(ρ) = ρe−ρx

1.  Write the likelihood function for 5 data points 2.  Write the log likelihood function from 1 3.  Generate some random exponential data using rexp 4.  Evaluate 3 for different values of ρ 5.  Plot 3 as a function of ρ

with mean 1/ ρ

€


p(D)

26

The exponential distribution likelihood:

1.  Write the likelihood function for 5 data points (n=5)

2.  Write the log likelihood function from 1 3.  Generate some random exponential data using rexp 4.  Evaluate 3 for different values of ρ 5.  Plot 3 as a function of ρ

€

L(ρ) = ρe−ρxii=1

n

∏ = ρn e−ρx1e−ρx2e−ρx3 …e−ρxn[ ] = ρne−ρ xi∑

with mean 1/ ρ

Likelihood

€

L(ρ) = ρe−ρx

27

1.  Write the likelihood function for, say 5 data points (n=5)

2.  Write the log likelihood function from 1

3.  Generate some random exponential data using rexp 4.  Evaluate 3 for different values of ρ 5.  Plot 3 as a function of ρ

€

L(ρ) = ρe−ρxii=1

n

∏ = ρn e−ρx1e−ρx2e−ρx3 …e−ρxn[ ] = ρne−ρ xi∑

€

ln(L) = ln ρne−ρ xi∑⎛ ⎝ ⎜ ⎞

⎠ ⎟ = lnρn + ln e−ρ xi∑⎛

⎝ ⎜ ⎞

⎠ ⎟

€

= n lnρ −ρ xi∑( ) = n lnρ − ρ xi∑

Likelihood

28

data=rexp(10,.1) #simulate data

#explore likelihood function numericallyrho.seq=seq(.01,.4,length=300) #define a sequence of rho

lexp=function(rho,data){ #write likelihood functionn=length(data)n*log(rho) - rho*sum(data)}

plot(rho.seq,lexp(rho.seq,data),type='l') #plot L vs. rho

Likelihood

Directions: 1.  Paste this code in to R 2.  Try some different values for the mean of the exponential distribution (the second

argument of rexp) to see how the likelihood changes 3.  Try sampling different numbers of points by changing the first argument of rexp to see

how sharply peaked the likelihood function becomes with more data

29

Likelihood

Posterior Distribution   To get posterior distributions for the parameters, we use

a search algorithm to find the best values for each parameter

  The search algorithm proceeds by obtaining samples from the posterior

  We can summarize the results of all these steps using histograms

€


p(D)

30

Posterior Distribution   The posterior distribution

summarizes all that we know after analyzing the data

  We get a posterior distribution for each parameter in the model

  The posterior distribution gives the probability that the parameter takes a certain value

31

32

Posterior Quantile Summaries

Example   Sections 2.1 – 2.4 from Albert (2007) - Bayesian

Computation with R

33


34

Gibbs sampling   Getting the posterior from the previous exercise was

easy because we only had to worry about 1 parameter: the binomial proportion

  Gibbs sampling is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables. The purpose of such a sequence is to to approximate the marginal distribution of each of the variables

  The point of Gibbs sampling is that given a joint distribution it is simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution

35

Markov Chain Monte Carlo (MCMC)

Gibbs sampling is a type of MCMC

Markov chain means that a the sample at t+1 depends only on the sample at t

e.g. µt+1 ~ N( µt , some variance you set)

Monte Carlo means ‘wandering around’ sort of randomly

36

Gibbs Sampling - Algorithm

If yi ~ N(β1 + β2 X2 + … + βk Xk , τ) and

Then f(β, τ | y) ∝ ∏i=1 to nf(yi| β, τ) * ∏j=1 to kf(βj) * f(τ) posterior ∝ likelihood * βj priors * τ prior

The Gibbs Sampler works by repeatedly sampling each of the conditional distributions

1st iteration β1

t+1 ~ p(β1t | β2

t,…, βkt, τ t, Y)

β2t+1 ~ p(β2 | β1

t+1, β3 t ,…, βk

t, τ t, Y) … βk

t+1 ~ p(βk t | β1

t+1, β2 t+1 …, βk-1 t, τ t, Y)

τ t+1 ~ p(τ t | β1 t+1,…, βk

t+1, Y) 2nd iteration β1

t+2 ~ p(β1 t+1

| β2 t+1,…, βk

t+1, τ t+1, Y) β2

t+2 ~ p(β2 t+1 | β1

t+2, β3 t+1

,…, βk t+1, τ t+1, Y)

… βk

t+2 ~ p(βk t+1 | β1 t+2,…, βk-1

t+1, τ t+1, Y) τ t+2 ~ p(τ t+1 | β1

t+1,…, βk t+2, Y)

3rd iteration etc.

Gibbs Sampling

  Target distribution is the posterior   Proposal (Jump, Tuning) distribution is decides where

to go next   In Gibbs sampling, the proposal distribution is a

random walk   In Gibbs sampling, the proposed value is always

accepted

38

Gibbs sampling   Example: MCMC Robot – a program by Paul Lewis

  Follow link on course website

  Input: a joint posterior distribution for all parameters   Output: the trajectory of a random MCMC walk

  Most commands you need are under the ‘Robot’ menu   Clicking on the black area and dragging your mouse creates normal distributions with means at your starting point

39

MCMC Robot   To start a chain, choose ‘start walking’ from the ‘robot’ menu   Explore the options on the robot menu

  Note that you can set the number of steps per run, which allows you to see more of each chain

  Try making a more complex sample space   ‘Clear’ erases the walk but not the landscape; you have to close and reopen to get new landscapes   Use the ‘chains’ menu to enable multiple chains   Use the ‘show’ menu to show these different chains   You can read the very short help file if these settings are

unclear under the help menu   See questions on next slide

40

MCMC robot

41

  QUESTIONS:   What similarities and differences do you notice among

different chains?   What happens to the chain if you make 2 small islands in

opposite corners of the window?

  What happens to the chain if you make 1 large and 1 small island?

Convergence and MCMC

  The general questions to ask :

  At what point have we converged to the stationary distribution? (i.e. how long should our “burn-in” period be?)

  After we have reached the stationary distribution, how many iterations will it take to summarize the posterior distribution?

  The answers to both of these questions remain a bit ad hoc because the desirable results that we depend on are only true asymptotically.

  When there are island of high-probability states with no paths between them, the chain can get stuck.

  Given infinite time it will emerge, but you probably don’t have that long.

42

43

Posterior Trace Plots

  Plots the parameter value at time t against the iteration number.

  If the model has converged, the trace plot will move around the mode of the distribution.

  A clear sign of non-convergence with a trace plot occurs when we observe some trending in the sample space.

  The problem with trace plots is that it may appear that we have converged, however, the chain is trapped in a local region rather exploring the full posterior.

44

Apparent Convergence and Non-Convergence

Convergence No convergence

MCMC

Convergence diagnostics • Geweke time series diagnostic • Gelman-Rubin diagnostic

To check for convergence, multiple starting points can be used

45

46

Posterior Parameter Density Plots

Sometimes non-convergence is reflected in lumpy distributions so search longer

Bad!

47

Example: Trace plots and posterior densities from a regression

install.packages('MCMCpack')library(MCMCpack) data(swiss) posterior1=MCMCregress(Fertility ~ Agriculture +

whatever you want, data=swiss, mcmc=200, burnin=0)summary(posterior1)plot(posterior1)   Try different values for mcmc, which sets the number of posterior samples, and

for burnin, which sets the number of samples to throw out before collecting samples. What is a sufficient burnin length and number of samples to make sure that thereʼs not a long term trend in the trace plot and that the density plot is pretty smooth?#

  Use different combinations of predictors to find the best model#  Look at the posterior summary: How can you tell if a predictor should be

dropped?#  Have a look at the help file for MCMCregress if you get bored and alter more

settings, like the starting values for each chain.#

Example: Regression Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = 10000

1. Empirical mean and standard deviation for each variable, plus standard error of the mean:

Mean SD Naive SE Time-series SE (Intercept) 84.0109 5.8959 0.058959 0.0636739 Agriculture -0.0662 0.0818 0.000818 0.0008651 Education -0.9602 0.1930 0.001930 0.0021080 sigma2 94.2139 21.3430 0.213430 0.2253052

2. Quantiles for each variable:

2.5% 25% 50% 75% 97.5% (Intercept) 72.5952 80.0868 83.98372 87.94325 95.62394 Agriculture -0.2276 -0.1210 -0.06507 -0.01111 0.09113 Education -1.3458 -1.0873 -0.95727 -0.83240 -0.58836 sigma2 61.6284 78.8988 91.02573 105.73569 143.53792

48

Example: Regression

49

Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes'...

Documents

Transcript of Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes'...