Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes'...
-
Upload
nguyenhanh -
Category
Documents
-
view
218 -
download
0
Transcript of Day 2 - Bayes Introweb2.uconn.edu/cyberinfra/module3/Downloads/Day 2 - Bayes Intro.pdf · Bayes'...
Day 2 – Intro to Bayesian Concepts
Nonhierarchical Regression Models
1
Outline Some cheerleading for Bayes Probability fundamentals Bayes’ Theorem Gibbs Sampling Supporting info for exercises
2
Why Bayes in ecology? Simplifies understanding of the processes by factoring complex
relationships into simple pieces
Can incorporate many sources of uncertainty and stochasticity
Can easily combine ugly data from very different sources
Can incorporate prior information (expert range maps, previous studies)
Ecological units are clustered and similar but different (hierarchical models)
Deals well with large numbers of random effects (e.g. spatial or hierarchical models)
3
What makes something Bayesian? Bayesian methods contrast with Frequentist/Classical
methods Bayesian methods have only become popular in the
last 20ish years due to their computational demands The key difference between classical and Bayesian
reasoning is that the Bayesian believes that knowledge is subjective.
Bayesian only has beliefs about the parameter value which are updated, based on data
4
Outline Some cheerleading for Bayes Probability fundamentals Bayes’ Theorem Gibbs Sampling Supporting info for exercises
5
Fundamental Probability Terms Probability density function – aka pdf or density Likelihood function – aka likelihood Conditional probability distribution Marginal probability distribution Bayes’ Theorem
6
Probability density functions You can think of a PDF as a histogram (normalized)
7
8
Probability density functions A pdf, f(x), assigns a probability to x taking a value between
x and x+dx based on the area under the curve
€
f (x)dx =1−∞
∞
∫€
f (x) ≥ 0
Probability density functions “Densities” can occur in Bayesian models because you
specify a distribution, like a Normal, Beta, Poisson, etc.
Or because you’re sampling from an unknown distribution that doesn’t necessarily have a pretty form
9
Probability density functions Joint pdfs describe the distribution of 2 or more
variables which may or may not be independent of one another
10
11
Likelihood and pdfs
Likelihood can be thought of as the opposite of probability Probability (pdfs) allow us to predict an unknown
outcome from known parameters Likelihood allows us to predict unknown parameters from
a known outcome Likelihoods do not integrate to 1, while pdfs do Confusingly, the functions have the same form, but
different independent variables e.g. the exponential distribution:
€
f (x) = ρe−ρx
€
L(ρ) = ρe−ρx
12
Likelihood
Likelihood allows us to estimate unknown parameters with known data
13
Conditional Probability
Conditional probabilities allow us to understand how the probability of an event A changes after it has been learned that some other event B has occurred.
Key concept: the occurrence of B reshapes the sample space for subsequent events, such as A.
S
A
B
Pr(A | B) The conditional probability of A given B is denoted Pr(A | B).
14
Conditional Probability
15
Conditional Probability
Conditional Probability In Bayesian models we need each parameter’s distribution
conditional on the values of all the other parameters. If a regression model has 3 parameters: b0, b1, b2 then
we’ll need P(b0|b1, b2,data) P(b1|b0, b2,data) P(b2|b0, b1,data)
16
17
Marginal Probability
P(A) is the marginal distribution of A, composed of the sum of conditional distributions
Outline Some cheerleading for Bayes Probability fundamentals Bayes’ Theorem Gibbs Sampling Supporting info for exercises
18
Bayes’ Theorem Bayes' Theorem can be derived from the expression for
the joint probability of two events D and θ
Let p(D) denote the probability that event D will occur and let p(θ) denote the probability that event θ will occur. Then p(D, θ) is the probability that both events occur
p(D, θ)=p(D) p(θ|D) = p(θ) p(D| θ) Then Bayes’ Theorem states:
€
p( θ | D ) = p( θ ) p( D | θ )p(D)
19
20
Interpretation of Bayes’ Theorem
€
p( Parami | Data ) = p( Parami ) p( Data | Parami )p(Paramk)p(Data|Paramkk∑ )
p(Parami) = Prior distribution for Parami - summarizes beliefs about the probability of Parami before Parami or Data are observed.
Marginal
p( Data | Parami ) = The conditional probability of Data given Parami. It summarizes the likelihood of the Data given Parami. Conditional
∑k p( Paramk ) p( Data | Paramk ) = The uninteresting normalizing constant - the sum of the quantities in the numerator for all Paramk. Marginal
p( Parami | Data ) = The posterior distribution for Parami - the probability of Parami after Data has been observed. Conditional
Review of Fundamental Probability Terms
Probability density function – aka pdf or density Likelihood function – aka likelihood Conditional probabability distribution Marginal probability distribution Bayes’ Theorem
21
22
The components of Bayes’ Theorem
Before we worry about computation in a model, we’ll focus on the concepts of what goes in and what comes out
Priors Likelihoods Posterior
23
Prior distribution This describes our previous belief/knowledge about a
parameter value (θ) It can incorporate information or be vague/
noninformative
€
p(θ |D) =p(θ)p(D |θ)
p(D)
Prior distribution Consider a prior for the probability parameter from a binomial distribution
24
25
Likelihood
The exponential distribution likelihood:
€
L(ρ) = ρe−ρx
1. Write the likelihood function for 5 data points 2. Write the log likelihood function from 1 3. Generate some random exponential data using rexp 4. Evaluate 3 for different values of ρ 5. Plot 3 as a function of ρ
with mean 1/ ρ
€
p(θ |D) =p(θ)p(D |θ)
p(D)
26
The exponential distribution likelihood:
1. Write the likelihood function for 5 data points (n=5)
2. Write the log likelihood function from 1 3. Generate some random exponential data using rexp 4. Evaluate 3 for different values of ρ 5. Plot 3 as a function of ρ
€
L(ρ) = ρe−ρxii=1
n
∏ = ρn e−ρx1e−ρx2e−ρx3 …e−ρxn[ ] = ρne−ρ xi∑
with mean 1/ ρ
Likelihood
€
L(ρ) = ρe−ρx
27
1. Write the likelihood function for, say 5 data points (n=5)
2. Write the log likelihood function from 1
3. Generate some random exponential data using rexp 4. Evaluate 3 for different values of ρ 5. Plot 3 as a function of ρ
€
L(ρ) = ρe−ρxii=1
n
∏ = ρn e−ρx1e−ρx2e−ρx3 …e−ρxn[ ] = ρne−ρ xi∑
€
ln(L) = ln ρne−ρ xi∑⎛ ⎝ ⎜ ⎞
⎠ ⎟ = lnρn + ln e−ρ xi∑⎛
⎝ ⎜ ⎞
⎠ ⎟
€
= n lnρ −ρ xi∑( ) = n lnρ − ρ xi∑
Likelihood
28
data=rexp(10,.1) #simulate data
#explore likelihood function numericallyrho.seq=seq(.01,.4,length=300) #define a sequence of rho
lexp=function(rho,data){ #write likelihood functionn=length(data)n*log(rho) - rho*sum(data)}
plot(rho.seq,lexp(rho.seq,data),type='l') #plot L vs. rho
Likelihood
Directions: 1. Paste this code in to R 2. Try some different values for the mean of the exponential distribution (the second
argument of rexp) to see how the likelihood changes 3. Try sampling different numbers of points by changing the first argument of rexp to see
how sharply peaked the likelihood function becomes with more data
29
Likelihood
Posterior Distribution To get posterior distributions for the parameters, we use
a search algorithm to find the best values for each parameter
The search algorithm proceeds by obtaining samples from the posterior
We can summarize the results of all these steps using histograms
€
p(θ |D) =p(θ)p(D |θ)
p(D)
30
Posterior Distribution The posterior distribution
summarizes all that we know after analyzing the data
We get a posterior distribution for each parameter in the model
The posterior distribution gives the probability that the parameter takes a certain value
31
32
Posterior Quantile Summaries
Example Sections 2.1 – 2.4 from Albert (2007) - Bayesian
Computation with R
33
Outline Some cheerleading for Bayes Probability fundamentals Bayes’ Theorem Gibbs Sampling Supporting info for exercises
34
Gibbs sampling Getting the posterior from the previous exercise was
easy because we only had to worry about 1 parameter: the binomial proportion
Gibbs sampling is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables. The purpose of such a sequence is to to approximate the marginal distribution of each of the variables
The point of Gibbs sampling is that given a joint distribution it is simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution
35
Markov Chain Monte Carlo (MCMC)
Gibbs sampling is a type of MCMC
Markov chain means that a the sample at t+1 depends only on the sample at t
e.g. µt+1 ~ N( µt , some variance you set)
Monte Carlo means ‘wandering around’ sort of randomly
36
Gibbs Sampling - Algorithm
If yi ~ N(β1 + β2 X2 + … + βk Xk , τ) and
Then f(β, τ | y) ∝ ∏i=1 to nf(yi| β, τ) * ∏j=1 to kf(βj) * f(τ) posterior ∝ likelihood * βj priors * τ prior
The Gibbs Sampler works by repeatedly sampling each of the conditional distributions
1st iteration β1
t+1 ~ p(β1t | β2
t,…, βkt, τ t, Y)
β2t+1 ~ p(β2 | β1
t+1, β3 t ,…, βk
t, τ t, Y) … βk
t+1 ~ p(βk t | β1
t+1, β2 t+1 …, βk-1 t, τ t, Y)
τ t+1 ~ p(τ t | β1 t+1,…, βk
t+1, Y) 2nd iteration β1
t+2 ~ p(β1 t+1
| β2 t+1,…, βk
t+1, τ t+1, Y) β2
t+2 ~ p(β2 t+1 | β1
t+2, β3 t+1
,…, βk t+1, τ t+1, Y)
… βk
t+2 ~ p(βk t+1 | β1 t+2,…, βk-1
t+1, τ t+1, Y) τ t+2 ~ p(τ t+1 | β1
t+1,…, βk t+2, Y)
3rd iteration etc.
Gibbs Sampling
Target distribution is the posterior Proposal (Jump, Tuning) distribution is decides where
to go next In Gibbs sampling, the proposal distribution is a
random walk In Gibbs sampling, the proposed value is always
accepted
38
Gibbs sampling Example: MCMC Robot – a program by Paul Lewis
Follow link on course website
Input: a joint posterior distribution for all parameters Output: the trajectory of a random MCMC walk
Most commands you need are under the ‘Robot’ menu Clicking on the black area and dragging your mouse creates normal distributions with means at your starting point
39
MCMC Robot To start a chain, choose ‘start walking’ from the ‘robot’ menu Explore the options on the robot menu
Note that you can set the number of steps per run, which allows you to see more of each chain
Try making a more complex sample space ‘Clear’ erases the walk but not the landscape; you have to close and reopen to get new landscapes Use the ‘chains’ menu to enable multiple chains Use the ‘show’ menu to show these different chains You can read the very short help file if these settings are
unclear under the help menu See questions on next slide
40
MCMC robot
41
QUESTIONS: What similarities and differences do you notice among
different chains? What happens to the chain if you make 2 small islands in
opposite corners of the window?
What happens to the chain if you make 1 large and 1 small island?
Convergence and MCMC
The general questions to ask :
At what point have we converged to the stationary distribution? (i.e. how long should our “burn-in” period be?)
After we have reached the stationary distribution, how many iterations will it take to summarize the posterior distribution?
The answers to both of these questions remain a bit ad hoc because the desirable results that we depend on are only true asymptotically.
When there are island of high-probability states with no paths between them, the chain can get stuck.
Given infinite time it will emerge, but you probably don’t have that long.
42
43
Posterior Trace Plots
Plots the parameter value at time t against the iteration number.
If the model has converged, the trace plot will move around the mode of the distribution.
A clear sign of non-convergence with a trace plot occurs when we observe some trending in the sample space.
The problem with trace plots is that it may appear that we have converged, however, the chain is trapped in a local region rather exploring the full posterior.
44
Apparent Convergence and Non-Convergence
Convergence No convergence
MCMC
Convergence diagnostics • Geweke time series diagnostic • Gelman-Rubin diagnostic
To check for convergence, multiple starting points can be used
45
46
Posterior Parameter Density Plots
Sometimes non-convergence is reflected in lumpy distributions so search longer
Bad!
47
Example: Trace plots and posterior densities from a regression
install.packages('MCMCpack')library(MCMCpack) data(swiss) posterior1=MCMCregress(Fertility ~ Agriculture +
whatever you want, data=swiss, mcmc=200, burnin=0)summary(posterior1)plot(posterior1) Try different values for mcmc, which sets the number of posterior samples, and
for burnin, which sets the number of samples to throw out before collecting samples. What is a sufficient burnin length and number of samples to make sure that thereʼs not a long term trend in the trace plot and that the density plot is pretty smooth?#
Use different combinations of predictors to find the best model# Look at the posterior summary: How can you tell if a predictor should be
dropped?# Have a look at the help file for MCMCregress if you get bored and alter more
settings, like the starting values for each chain.#
Example: Regression Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable, plus standard error of the mean:
Mean SD Naive SE Time-series SE (Intercept) 84.0109 5.8959 0.058959 0.0636739 Agriculture -0.0662 0.0818 0.000818 0.0008651 Education -0.9602 0.1930 0.001930 0.0021080 sigma2 94.2139 21.3430 0.213430 0.2253052
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5% (Intercept) 72.5952 80.0868 83.98372 87.94325 95.62394 Agriculture -0.2276 -0.1210 -0.06507 -0.01111 0.09113 Education -1.3458 -1.0873 -0.95727 -0.83240 -0.58836 sigma2 61.6284 78.8988 91.02573 105.73569 143.53792
48
Example: Regression
49