Lecture 2

83
Lecture 2: Basics of Bayesian modeling Lecture 2: Basics of Bayesian modeling Shravan Vasishth Department of Linguistics University of Potsdam, Germany October 9, 2014 1 / 83

description

Basics of Bayesian modeling

Transcript of Lecture 2

Page 1: Lecture 2

Lecture 2: Basics of Bayesian modeling

Lecture 2: Basics of Bayesian modeling

Shravan Vasishth

Department of Linguistics

University of Potsdam, Germany

October 9, 2014

1 / 83

Page 2: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some important concepts from lecture 1

The bivariate distributionIndependent (normal) random variables

If we have two random variables U0, U1, and we examine theirjoint distribution, we can plot a 3-d plot which shows, u0, u1, andf(u0,u1). E.g., f (u0,u1)⇠ (N(0,1)), with two independentrandom variables:

u0−3

−2−10 1 2 3u1

−3−2−10123

z

0.05

0.10

0.15

2 / 83

Page 3: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some important concepts from lecture 1

The bivariate distributionCorrelated random variables (r = 0.6)

The random variables U, W could be correlated positively. . .

u0−3

−2−10 1 2 3u1

−3−2−10123

z

0.000.050.100.15

3 / 83

Page 4: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some important concepts from lecture 1

The bivariate distributionCorrelated random variables (r =�0.6)

. . . or negatively:

u0−3

−2−10 1 2 3u1

−3−2−10123

z

0.000.050.100.15

4 / 83

Page 5: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some important concepts from lecture 1

Bivariate distributions

This is why, when talking about two normal random variables U0and U1, we have to talk about

1 U0’s mean and variance

2 U1’s mean and variance

3 the correlation r between them

A mathematically convenient way to talk about it is in terms of thevariance-covariance matrix we saw in lecture 1:

⌃u

=

s2

u0 ru

su0

su1

ru

su0

su1

s2

u1

�=

0.612 �.51⇥0.61⇥0.23

�.51⇥0.61⇥0.23 0.232

(1)

5 / 83

Page 6: Lecture 2

Lecture 2: Basics of Bayesian modeling

Introduction

Today’s goals

In this second lecture, my goals are to

1 Give you a feeling for how Bayesian analysis works using twosimple examples that need only basic mathematical operations(addition, subtraction, division, multiplication).

2 Start thinking about priors for parameters in preparation forfitting linear mixed models.

3 Start fitting linear regression models in JAGS.

I will assign some homework which is designed to help youunderstand these concepts. Solutions will be discussed next week.

6 / 83

Page 7: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Random variables

A random variable X is a function X : S ! R that associates toeach outcome w 2 S exactly one number X (w) = x .S

X

is all the x ’s (all the possible values of X, the support of X).I.e., x 2 S

X

.Good example: number of coin tosses till H

X : w ! x

w: H, TH, TTH,. . . (infinite)

x = 0,1,2, . . . ;x 2 S

X

7 / 83

Page 8: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Random variables

Every discrete (continuous) random variable X has associated withit a probability mass (distribution) function (pmf, pdf).PMF is used for discrete distributions and PDF for continuous.

p

X

: SX

! [0,1] (2)

defined by

p

X

(x) = P(X (w) = x),x 2 S

X

(3)

[Note: Books sometimes abuse notation by overloading themeaning of X . They usually have: p

X

(x) = P(X = x),x 2 S

X

]

8 / 83

Page 9: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Random variables

Probability density functions (continuous case) or probability massfunctions (discrete case) are functions that assign probabilities orrelative frequencies to all events in a sample space.The expression

X ⇠ f (·) (4)

means that the random variable X has pdf/pmf f (·). For example,if we say that X ⇠ Normal(µ,s2), we are assuming that the pdf is

f (x) =1p2ps2

exp[�(x�µ)2

2s2

] (5)

9 / 83

Page 10: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Random variables

We also need a cumulative distribution function or cdf because,in the continuous case, P(X=some point value) is zero and wetherefore need a way to talk about P(X in a specific range). cdfsserve that purpose.In the continuous case, the cdf or distribution function is definedas:

P(x < k) = F (x < k) = The area under the curve to the left of k(6)

For example, suppose X ⇠ Normal(600,50).

10 / 83

Page 11: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Random variablesWe can ask for Prob(X < 600):

> pnorm(600,mean=600,sd=sqrt(50))

[1] 0.5

We can ask for the quantile that has 50% of the probability to the left ofit:

> qnorm(0.5,mean=600,sd=sqrt(50))

[1] 600

. . . or to the right of it:

> qnorm(0.5,mean=600,sd=sqrt(50),lower.tail=FALSE)

[1] 600

11 / 83

Page 12: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Random variables

We can also calculate the probability that X lies between 590 and 610:Prob(590< X < 610):

> pnorm(610,mean=600,sd=sqrt(50))-pnorm(490,mean=600,sd=sqrt(50))

[1] 0.9213504

12 / 83

Page 13: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Random variables

Another way to compute area under the curve is by simulation:

> x<-rnorm(10000,mean=600,sd=sqrt(50))

> ## proportion of cases where

> ## x is less than 500:

> mean(x<590)

[1] 0.0776

> ## theoretical value:

> pnorm(590,mean=600,sd=sqrt(50))

[1] 0.0786496

We will be doing this a lot.

13 / 83

Page 14: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Random variables

E.g., in linguistics we take as continous random variables:

1 reading time: Here the random variable (RV) X has possiblevalues w ranging from 0 ms to some upper bound b ms (ormaybe unbounded?), and the RV X maps each possible valuew to the corresponding number (0 to 0 ms, 1 to 1 ms, etc.).

2 acceptability ratings (technically not correct; but peoplegenerally treat ratings as continuous, at least inpsycholinguistics)

3 EEG signals

In this course, due to time constraints, we will focus almostexclusively on reading time data (eye-tracking and self-pacedreading).

14 / 83

Page 15: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Normal random variables

We will also focus mostly on the normal distribution.

f

X

(x) =1

sp2p

e

�(x�µ)2

2s2 , �• < x < •. (7)

It is conventional to write X ⇠ N(µ,s2).Important note: The normal distribution is represented di↵erentlyin di↵erent programming languages:

1 R: dnorm(mean,sigma)

2 JAGS: dnorm(mean,precision) whereprecision = 1/variance

3 Stan: normal(mean,sigma)

I apologize in advance for all the confusion this will cause.

15 / 83

Page 16: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Normal random variables

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

Normal density N(0,1)

X

density

16 / 83

Page 17: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Normal random variables

Standard or unit normal random variable:

If X is normally distributed with parameters µ and s2, thenZ = (X �µ)/s is normally distributed with parameters 0,1.We conventionally write �(x) for the CDF:

�(x) =1p2p

ˆx

�•e

�y

2

2

dy where y = (x�µ)/s (8)

In R, we can type pnorm(x), to find out �(x). Suppose x =�2:

> pnorm(-2)

[1] 0.02275013

17 / 83

Page 18: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Normal random variables

If Z is a standard normal random variable (SNRV) then

p{Z �x}= p{Z > x}, �• < x < • (9)

We can check this with R:

> ##P(Z < -x):

> pnorm(-2)

[1] 0.02275013

> ##P(Z > x):

> pnorm(2,lower.tail=FALSE)

[1] 0.02275013

18 / 83

Page 19: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Normal random variablesAlthough the following expression looks scary:

�(x) =1p2p

ˆx

�•e

�y

2

2

dy where y = (x�µ)/s (10)

all it is saying is “find the area under the normal curve, ranging -Infinity to x. Always visualize! �(�2):

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

Standard Normal

x

dnor

m(x

, 0, 1

)

φ(−2)

19 / 83

Page 20: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Normal random variablesSince Z = ((X �µ)/s) is an SNRV whenever X is normallydistributed with parameters µ and s2, then the CDF of X can beexpressed as:

F

X

(a) = P{X a}= P

✓X �µ

s a�µ

s

◆= �

✓a�µ

s

◆(11)

Practical application: Suppose you know that X ⇠ N(µ,s2), and youknow µ but not s . If you know that a 95% confidence interval is [-q,+q],then you can work out s by

a. computing the Z ⇠ N(0,1) that has 2.5% of the area to its right:

> round(qnorm(0.025,lower.tail=FALSE),digits=2)

[1] 1.96

b. Solve for s in Z = q�µs .

We will use this fact in Homework 0. 20 / 83

Page 21: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Normal random variables

Summary of useful commands:

## pdf of normal:

dnorm(x, mean = 0, sd = 1)

## compute area under the curve:

pnorm(q, mean = 0, sd = 1)

## find out the quantile that has

## area (probability) p under the curve:

qnorm(p, mean = 0, sd = 1)

## generate normally distributed data of size n:

rnorm(n, mean = 0, sd = 1)

21 / 83

Page 22: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Likelihood function (Normal distribution)Let’s assume that we have generated a data point from a particularnormal distribution:x ⇠ N(µ = 600,s2 = 50).

> x<-rnorm(1,mean=600,sd=sqrt(50))

Given x, and di↵erent values of µ, we can determine which µ ismost likely to have generated x. You can eyeball the result:

400 500 600 700 800

0.00

0.03

mudnor

m(x

, mea

n =

mu,

sd

= sq

rt(50

))

22 / 83

Page 23: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Likelihood functionSuppose that we had generated 10 independent values of x:

> x<-rnorm(10,mean=600,sd=sqrt(50))

We can plot the likelihood of each of the x’s that the mean of theNormal distribution that generated the data is µ, for di↵erentvalues of µ:> ## mu = 500

> dnorm(x,mean=500,sd=sqrt(50))

[1] 2.000887e-48 6.542576e-43 3.383346e-31 1.423345e-46 7.571519e-43

[6] 1.686776e-41 3.214674e-39 2.604492e-40 6.892547e-50 1.186747e-52

Since each of the x’s are independently generated, the totallikelihood of the 10 x’s is:

f (x1

)⇥ f (x2

)⇥ · · ·⇥ f (x1

0) (12)

for some µ in f (·) = Normal(µ,50).23 / 83

Page 24: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Likelihood function

It’s computationally easier to just take logs and sum them up (loglikelihood):

log f (x1

)+ log f (x2

)⇥ · · ·⇥ log f (x1

0) (13)

> ## mu = 500

> sum(dnorm(x,mean=500,sd=sqrt(50),log=TRUE))

[1] -986.1017

24 / 83

Page 25: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

(Log) likelihood functionWe can now plot, for di↵erent values of µ, the likelihood that eachof the µ generated the 10 data points:

> mu<-seq(400,800,by=0.1)

> liks<-rep(NA,length(mu))

> for(i in 1:length(mu)){

+ liks[i]<-sum(dnorm(x,mean=mu[i],sd=sqrt(50),log=TRUE))

+ }

> plot(mu,liks,type="l")

400 500 600 700 800

−4000

−1000

mu

liks

25 / 83

Page 26: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

(Log) likelihood function

1 It’s intuitively clear that we’d probably want to declare thevalue of µ that brings us to the“highest”point in this figure.

2 This is the maximum likelihood estimate, MLE.

3 Practical implication: In frequentist statistics, our datavector x is assumed to be X ⇠ N(µ,s2), and we attempt tofigure out the MLE, i.e., the estimates of µ and s2 thatwould maximize the likelihood.

4 In Bayesian models, when we assume a uniform prior, we willget an estimate of the parameters which coincides with theMLE (examples coming soon).

26 / 83

Page 27: Lecture 2

Lecture 2: Basics of Bayesian modeling

Some preliminaries: Probability distributions

Bayesian modeling examples

Next, I will work through five relatively simple examples that useBayes’ Theorem.

1 Example 1: Proportions

2 Example 2: Normal distribution

3 Example 3: Linear regression with one predictor

4 Example 4: Linear regression with multiple predictors

5 Example 5: Generalized linear models example (binomial link).

27 / 83

Page 28: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

1 Recall the binomial distribution: Let X: no. successes in n

trials. We generally assume thatX ⇠ Binomial(n,q), q unknown.

2 Suppose we have 46 successes out of 100. We generally usethe empirically observed proportion 46/100 as our estimate ofq . I.e., we assume that the generating distribution isX ⇠ Binomial(n = 100,q = .46).

3 This is because, for all possible values of q , going from 0 to 1,0.46 has the highest likelihood.

28 / 83

Page 29: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

> dbinom(x=46,size=100,0.4)

[1] 0.03811036

> dbinom(x=46,size=100,0.46)

[1] 0.07984344

> dbinom(x=46,size=100,0.5)

[1] 0.0579584

> dbinom(x=46,size=100,0.6)

[1] 0.001487007

Actually, we could take a whole range of values of q from 0 to 1and plot the resulting probabilities.

29 / 83

Page 30: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

> theta<-seq(0,1,by=0.01)

> plot(theta,dbinom(x=46,size=100,theta),

+ xlab=expression(theta),type="l")

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.04

0.08

θ

dbin

om(x

= 4

6, s

ize =

100

, the

ta)

This is the likelihood function for the binomial distribution, andwe will write it as f (data | q). It is a function of q .

30 / 83

Page 31: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

Since Binomial(x ,n,q) =�n

x

�q xqn�x , we can see that:

f (data | q) µ q46(1�q)54 (14)

We are now going to use Bayes’ theorem to work out the posteriordistribution of q given the data:

f (q | data)"

posterior

µ f (data | q)"

likelihood

f (q)"

prior

(15)

All that’s missing here is the prior distribution f (q). So let’s try todefine a prior for q .

31 / 83

Page 32: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

To define a prior for q , we will use a distribution called the Betadistribution, which takes two parameters, a and b.We can plot some Beta distributions to get a feel for what theseparameters do.We are going to plot

1 Beta(a=2,b=2)

2 Beta(a=3,b=3)

3 Beta(a=6,b=6)

4 Beta(a=10,b=10)

32 / 83

Page 33: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

Beta density

X

density

a=2,b=2

a=3,b=3

a=6,b=6

a=60,b=60

33 / 83

Page 34: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

Each successive density expresses increasing certainty about qbeing centered around 0.5; notice that the spread about 0.5 isdecreasing as a and b increase.

1 Beta(a=2,b=2)

2 Beta(a=3,b=3)

3 Beta(a=6,b=6)

4 Beta(a=10,b=10)

34 / 83

Page 35: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

1 If we don’t have much prior information, we could usea=b=2; this gives us a uniform prior; this is called anuninformative prior or non-informative prior.

2 If we have a lot of prior knowledge and/or a strong belief thatq has a particular value, we can use a larger a,b to reflect ourgreater certainty about the parameter.

3 You can think of the parameter referring to the number ofsuccesses, and the parameter b to the number of failures.

So the beta distribution can be used to define the prior distributionof q .

35 / 83

Page 36: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

Just for the sake of illustration, let’s take four di↵erent beta priors,each reflecting increasing certainty.

1 Beta(a=2,b=2)

2 Beta(a=3,b=3)

3 Beta(a=6,b=6)

4 Beta(a=21,b=21)

Each (except perhaps the first) reflects a belief that q = 0.5, withvarying degrees of uncertainty.Note an important fact: Beta(q | a,b) µ q a�1(1�q)b�1.

This is because the Beta distribution is:f (q | a,b) = �(a,b)

�(a)�(b)

q a�1(1�q)b�1

36 / 83

Page 37: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

Now we just need to plug in the likelihood and the prior to get theposterior:

f (q | data) µ f (data | q)f (q) (16)

The four corresponding posterior distributions would be as follows(I hope I got the sums right!).

37 / 83

Page 38: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

f (q | data) µ [q46(1�q)54][q2�1(1�q)2�1] = q47(1�q)55 (17)

f (q | data) µ [q46(1�q)54][q3�1(1�q)3�1] = q48(1�q)56 (18)

f (q | data) µ [q46(1�q)54][q6�1(1�q)6�1] = q51(1�q)59 (19)

f (q | data) µ [q46(1�q)54][q21�1(1�q)21�1] = q66(1�q)74 (20)

38 / 83

Page 39: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

1 We can now visualize each of these triplets of priors,likelihoods and posteriors.

2 Note that I use the beta to model the likelihood because thisallows me to visualize all three (prior, lik., posterior) in thesame plot.

3 I first show the plot just for the priorq ⇠ Beta(a= 2,a= 2)

39 / 83

Page 40: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

ProportionsBeta(2,2) prior: posterior is shifted just a bit to the right compared to the likelihood

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

X

postlikprior

40 / 83

Page 41: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

ProportionsBeta(6,6) prior: Posterior shifts even more towards the prior

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

X

postlikprior

41 / 83

Page 42: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

ProportionsBeta(21,21) prior: Posterior shifts even more towards the prior

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

X

postlikprior

42 / 83

Page 43: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

Proportions

Basically, the posterior is a compromise between the prior and thelikelihood.

1 When the prior has high uncertainty or we have a lot of data,the likelihood will dominate.

2 When the prior has high certainty (like the Beta(21,21) case),and we have very little data, then the prior will dominate.

So, Bayesian methods are particularly important when you havelittle data but a whole lot of expert knowledge.But they are also useful for standard psycholinguistic research, as Ihope to demonstrate.

43 / 83

Page 44: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

An example application(This exercise is from a Copenhagen course I took in 2012. I quote italmost verbatim).“The French mathematician Pierre-Simon Laplace (1749-1827) was thefirst person to show definitively that the proportion of female births inthe French population was less then 0.5, in the late 18th century, using aBayesian analysis based on a uniform prior distribution. Suppose youwere doing a similar analysis but you had more definite prior beliefs aboutthe ratio of male to female births. In particular, if q represents theproportion of female births in a given population, you are willing to placea Beta(100,100) prior distribution on q .

1 Show that this means you are more than 95% sure that q isbetween 0.4 and 0.6, although you are ambivalent as to whether itis greater or less than 0.5.

2 Now you observe that out of a random sample of 1,000 births, 511are boys. What is your posterior probability that q > 0.5?”

44 / 83

Page 45: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

An example applicationShow that this means you are more than 95% sure that q is

between 0.4 and 0.6, although you are ambivalent as to whether it

is greater or less than 0.5.Prior: Beta(a=100,b=100)

> round(qbeta(0.025,shape1=100,shape2=100),digits=1)

[1] 0.4

> round(qbeta(0.975,shape1=100,shape2=100),digits=1)

[1] 0.6

> ## ambivalent as to whether theta <0.5 or not:

> round(pbeta(0.5,shape1=100,shape2=100),digits=1)

[1] 0.5

[Exercise: Draw the prior distribution]

45 / 83

Page 46: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 1: Proportions

An example applicationNow you observe that out of a random sample of 1,000 births, 511are boys. What is your posterior probability that q > 0.5?Prior: Beta(a=100,b=100)Data: 489 girls out of 1000.Posterior:

f (q | data) µ [q489(1�q)511][q100�1(1�q)100�1] = q588(1�q)610(21)

Since Beta(q | a,b) µ q a�1(1�q)b�1, the posterior isBeta(a=589,b=611).Therefore the posterior probability of q > 0.5 is:

> qbeta(0.5,shape1=589,shape2=611,lower.tail=FALSE)

[1] 0.4908282

46 / 83

Page 47: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 2: Normal distribution

Normal distributionThe normal distribution is the most frequently used probabilitymodel in psycholinguistics.If x̄ is sample mean, sample size is n and sample variance is knownto be s2, and if the prior on the mean µ is Normal(m,v), then:It is pretty easy to derive (see Lynch textbook) the posterior meanm

⇤ and variance v

⇤ analytically:

v

⇤ =1

1

v

+ n

s2

m

⇤ = v

⇤✓m

v

+nx̄

s2

◆(22)

E [q | x ] =m⇥ w1

w1+w2+ x̄⇥ w2

w1+w2w

1

= v

�1,w2

= (s2/n)�1

(23)So: the posterior mean is a weighted mean of the prior andlikelihood.

47 / 83

Page 48: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 2: Normal distribution

Normal distribution

1 The weight w1 is determined by the inverse of the priorvariance.

2 The weight w2 is determined by the inverse of the samplestandard error.

3 It is common in Bayesian statistics to talk aboutprecision = 1

variance

, so that

1 w

1

= v

�1 = precision

prior

2 w

2

= (s2/n)�1 = precision

data

48 / 83

Page 49: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 2: Normal distribution

Normal distribution

If w1 is very large compared to w2, then the posterior mean will bedetermined mainly by the prior mean m:

E [q | x ] =m⇥ w1

w1+w2+ x̄⇥ w2

w1+w2w

1

= v

�1,w2

= (s2/n)�1

(24)If w2 is very large compared to w1, then the posterior mean will bedetermined mainly by the sample mean x̄ :

E [q | x ] =m⇥ w1

w1+w2

+ x̄⇥ w2

w1+w2

w

1

= v

�1,w2

= (s2/n)�1

(25)

49 / 83

Page 50: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 2: Normal distribution

An example application

[This is an old exam problem from She�eld University]Let’s say there is a hormone measurement test that yields anumerical value that can be positive or negative. We know thefollowing:

The doctor’s prior: 75% interval (“patient healthy”) is[-0.3,0.3].

Data from patient: x = 0.2, known s = 0.15.

Compute posterior N(m*,v*).I’ll leave this as Homework 0 (solution next week). Hint: seeslide 20.

50 / 83

Page 51: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Simple linear regression

We begin with a simple example. Let the response variable bey

i

, i = 1, . . . ,n, and let there be p predictors, x1i

, . . . ,xpi

. Also, let

y

i

⇠ N(µi

,s2), µi

= b0

+Âbxki

(26)

(the summation is over the p predictors, i.e., k = 1, . . . ,p).We need to specify a prior distribution for the parameters:

bk

⇠ Normal(0,1002) logs ⇠ Unif (�100,100) (27)

51 / 83

Page 52: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Better looking professors get better teaching evaluationsSource: Gelman and Hill 2007

> beautydata<-read.table("data/beauty.txt",header=T)

> ## Note: beauty level is centered.

> head(beautydata)

beauty evaluation

1 0.2015666 4.3

2 -0.8260813 4.5

3 -0.6603327 3.7

4 -0.7663125 4.3

5 1.4214450 4.4

6 0.5002196 4.2

> ## restate the data as a list for JAGS:

> data<-list(x=beautydata$beauty,

+ y=beautydata$evaluation)

52 / 83

Page 53: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Better looking professors get better teaching evaluationsWe literally follow the specification of the linear model givenabove. We specify the model for the data frame row by row, usinga for loop, so that for each dependent variable value y

i

(theevaluation score) we specify how we believe it was generated.

y

i

⇠ Normal(µ[i ],s2) i = 1, . . . ,463 (28)

µ[i ]⇠ b0

+b1

x

i

Note: predictor is centered (29)

Define priors on the b and on s :

b0

⇠ Uniform(�10,10) b0

⇠ Uniform(�10,10) (30)

s ⇠ Uniform(0,100) (31)

53 / 83

Page 54: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Better looking professors get better teaching evaluations

> library(rjags)

> cat("

+ model

+ {

+ ## specify model for data:

+ for(i in 1:463){

+ y[i] ~ dnorm(mu[i],tau)

+ mu[i] <- beta0 + beta1 * (x[i])

+ }

+ # priors:

+ beta0 ~ dunif(-10,10)

+ beta1 ~ dunif(-10,10)

+ sigma ~ dunif(0,100)

+ sigma2 <- pow(sigma,2)

+ tau <- 1/sigma2

+ }",

+ file="JAGSmodels/beautyexample1.jag" )

54 / 83

Page 55: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Better looking professors get better teaching evaluationsData from Gelman and Hill, 2007

Some things to note in JAGS syntax:

1 The normal distribution is defined in terms of precision, notvariance.

2 ⇠ means“is generated by”

3 <-

is a deterministic assignment, like = in mathematics.

4 The model specification is declarative, order does not matter.For example, we would have written the following in any order:

sigma ~ dunif(0,100)

sigma2 <- pow(sigma,2)

tau <- 1/sigma2

55 / 83

Page 56: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Better looking professors get better teaching evaluations

> ## specify which variables you want to examine

> ## the posterior distribution of:

> track.variables<-c("beta0","beta1","sigma")

> ## define model:

> beauty.mod <- jags.model(

+ file = "JAGSmodels/beautyexample1.jag",

+ data=data,

+ n.chains = 1,

+ n.adapt =2000,

+ quiet=T)

> ## sample from posterior:

> beauty.res <- coda.samples(beauty.mod,

+ var = track.variables,

+ n.iter = 2000,

+ thin = 1 )

56 / 83

Page 57: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Better looking professors get better teaching evaluations

> round(summary(beauty.res)$statistics[,1:2],digits=2)

Mean SD

beta0 4.01 0.03

beta1 0.13 0.03

sigma 0.55 0.02

> round(summary(beauty.res)$quantiles[,c(1,3,5)],digits=2)

2.5% 50% 97.5%

beta0 3.96 4.01 4.06

beta1 0.07 0.13 0.19

sigma 0.51 0.55 0.58

57 / 83

Page 58: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Better looking professors get better teaching evaluations

Compare with standard lm fit:

> lm_summary<-summary(lm(evaluation~beauty,

+ beautydata))

> round(lm_summary$coef,digits=2)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.01 0.03 157.21 0

beauty 0.13 0.03 4.13 0

> round(lm_summary$sigma,digits=2)

[1] 0.55

Note that: (a) with uniform priors, we get a Bayesian estimateequal to the MLE, (b) we get uncertainty estimates for s in theBayesian model.

58 / 83

Page 59: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Better looking professors get better teaching evaluationsPosterior distributions of parameters

> densityplot(beauty.res)

Density

0510

15

3.95 4.00 4.05 4.10

beta0

0246

812

0.00 0.05 0.10 0.15 0.20 0.25

beta1

0510

20

0.50 0.55 0.60

sigma

59 / 83

Page 60: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Better looking professors get better teaching evaluationsPosterior distributions of parameters

> op<-par(mfrow=c(1,3),pty="s")

> traceplot(beauty.res)

2000 3000 4000

3.95

4.05

Iterations

Trace of beta0

2000 3000 4000

0.05

0.15

Iterations

Trace of beta1

2000 3000 4000

0.50

0.56

Iterations

Trace of sigma

60 / 83

Page 61: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Better looking professors get better teaching evaluationsPosterior distributions of parameters

> MCMCsamp<-as.matrix(beauty.res)

> op<-par(mfrow=c(1,3),pty="s")

> hist(MCMCsamp[,1],main=expression(beta[0]),

+ xlab="",freq=FALSE)

> hist(MCMCsamp[,2],main=expression(beta[1]),

+ xlab="",freq=FALSE)

> hist(MCMCsamp[,3],main=expression(sigma),

+ xlab="",freq=FALSE)

β0

Density

3.95 4.05

05

1015

β1

Density

0.05 0.15

02

46

810

σ

Density

0.48 0.52 0.56 0.60

05

1015

20

61 / 83

Page 62: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Homework 1: Rats’ weightsSource: Lunn et al 2012

Five measurements of a rat’s weight, in grams, as a function ofsome x (say some nutrition-based variable). Note that here we willcenter the predictor in the model code.First we load/enter the data:

> data<-list(x=c(8,15,22,29,36),

+ y=c(177,236,285,350,376))

Then we fit the linear model using lm, for comparison with theBayesian model:

> lm_summary_rats<-summary(fm<-lm(y~x,data))

> round(lm_summary_rats$coef,digits=3)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 123.886 11.879 10.429 0.002

x 7.314 0.492 14.855 0.001

62 / 83

Page 63: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 3: Linear regression

Homework 2: Rats’ weights (solution in next class)Fit the following model in JAGS:

> cat("

+ model

+ {

+ ## specify model for data:

+ for(i in 1:5){

+ y[i] ~ dnorm(mu[i],tau)

+ mu[i] <- beta0 + beta1 * (x[i]-mean(x[]))

+ }

+ # priors:

+ beta0 ~ dunif(-500,500)

+ beta1 ~ dunif(-500,500)

+ tau <- 1/sigma2

+ sigma2 <-pow(sigma,2)

+ sigma ~ dunif(0,200)

+ }",

+ file="JAGSmodels/ratsexample2.jag" )

63 / 83

Page 64: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 4: Multiple regression

Multiple predictorsSource: Baayen’s book

We fit log reading time to Trial id (centered), Native Language, and Sex.The categorical variables are centered as well.

> lexdec<-read.table("data/lexdec.txt",header=TRUE)

> data<-lexdec[,c(1,2,3,4,5)]

> contrasts(data$NativeLanguage)<-contr.sum(2)

> contrasts(data$Sex)<-contr.sum(2)

> lm_summary_lexdec<-summary(fm<-lm(RT~scale(Trial,scale=F)+

+ NativeLanguage+Sex,data))

> round(lm_summary_lexdec$coef,digits=2)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.41 0.01 1061.53 0.00

scale(Trial, scale = F) 0.00 0.00 -2.39 0.02

NativeLanguage1 -0.08 0.01 -14.56 0.00

Sex1 -0.03 0.01 -5.09 0.00

64 / 83

Page 65: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 4: Multiple regression

Multiple predictors

Preparing the data for JAGS:

> dat<-list(y=data$RT,

+ Trial=(data$Trial-mean(data$Trial)),

+ Lang=data$NativeLanguage,

+ Sex=data$Sex)

65 / 83

Page 66: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 4: Multiple regression

Multiple predictors

The JAGS model:

> cat("

+ model

+ {

+ ## specify model for data:

+ for(i in 1:1659){

+ y[i] ~ dnorm(mu[i],tau)

+ mu[i] <- beta0 +

+ beta1 * Trial[i]+

+ beta2 * Lang[i] + beta3 * Sex[i]

+ }

+ # priors:

+ beta0 ~ dunif(-10,10)

+ beta1 ~ dunif(-5,5)

+ beta2 ~ dunif(-5,5)

+ beta3 ~ dunif(-5,5)

+ tau <- 1/sigma2

+ sigma2 <-pow(sigma,2)

+ sigma ~ dunif(0,200)

+ }",

+ file="JAGSmodels/multregexample1.jag" )

66 / 83

Page 67: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 4: Multiple regression

Multiple predictors

> track.variables<-c("beta0","beta1",

+ "beta2","beta3","sigma")

> library(rjags)

> lexdec.mod <- jags.model(

+ file = "JAGSmodels/multregexample1.jag",

+ data=dat,

+ n.chains = 1,

+ n.adapt =2000,

+ quiet=T)

> lexdec.res <- coda.samples( lexdec.mod,

+ var = track.variables,

+ n.iter = 3000)

67 / 83

Page 68: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 4: Multiple regression

Multiple predictors

> round(summary(lexdec.res)$statistics[,1:2],

+ digits=2)

Mean SD

beta0 6.06 0.03

beta1 0.00 0.00

beta2 0.17 0.01

beta3 0.06 0.01

sigma 0.23 0.00

> round(summary(lexdec.res)$quantiles[,c(1,3,5)],

+ digits=2)

2.5% 50% 97.5%

beta0 6.01 6.06 6.12

beta1 0.00 0.00 0.00

beta2 0.14 0.17 0.19

beta3 0.04 0.06 0.09

sigma 0.22 0.23 0.23 68 / 83

Page 69: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 4: Multiple regression

Multiple predictors

Here is the lm fit for comparison (I center all predictors):

> lm_summary_lexdec<-summary(

+ fm<-lm(RT~scale(Trial,scale=F)

+ +NativeLanguage+Sex,

+ lexdec))

> round(lm_summary_lexdec$coef,digits=2)[,1:2]

Estimate Std. Error

(Intercept) 6.29 0.01

scale(Trial, scale = F) 0.00 0.00

NativeLanguageOther 0.17 0.01

SexM 0.06 0.01

Note: We should have fit a linear mixed model here; I willreturn to this later.

69 / 83

Page 70: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 5: Generalized linear models

GLMsWe consider the model

y

i

⇠ Binomial(pi

,ni

) logit(pi

) = b0

+b1

(xi

� x̄) (32)

A simple example is beetle data from Dobson et al 2010:

> beetledata<-read.table("data/beetle.txt",header=T)

> head(beetledata)

dose number killed

1 1.6907 59 6

2 1.7242 60 13

3 1.7552 62 18

4 1.7842 56 28

5 1.8113 63 52

6 1.8369 59 53

70 / 83

Page 71: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 5: Generalized linear models

GLMs

Prepare data for JAGS:

> dat<-list(x=beetledata$dose-mean(beetledata$dose),

+ n=beetledata$number,

+ y=beetledata$killed)

71 / 83

Page 72: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 5: Generalized linear models

GLMs

> cat("

+ model

+ {

+ for(i in 1:8){

+ y[i] ~ dbin(p[i],n[i])

+ logit(p[i]) <- beta0 + beta1 * x[i]

+ }

+ # priors:

+ beta0 ~ dunif(0,pow(100,2))

+ beta1 ~ dunif(0,pow(100,2))

+ }",

+ file="JAGSmodels/glmexample1.jag" )

72 / 83

Page 73: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 5: Generalized linear models

GLMsNotice use of initial values:

> track.variables<-c("beta0","beta1")

> ## new:

> inits <- list (list(beta0=0,

+ beta1=0))

> glm.mod <- jags.model(

+ file = "JAGSmodels/glmexample1.jag",

+ data=dat,

+ ## new:

+ inits=inits,

+ n.chains = 1,

+ n.adapt =2000, quiet=T)

> glm.res <- coda.samples( glm.mod,

+ var = track.variables,

+ n.iter = 2000)

73 / 83

Page 74: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 5: Generalized linear models

GLMs

> round(summary(glm.res)$statistics[,1:2],

+ digits=2)

Mean SD

beta0 0.75 0.13

beta1 34.53 2.84

> round(summary(glm.res)$quantiles[,c(1,3,5)],

+ digits=2)

2.5% 50% 97.5%

beta0 0.49 0.75 1.03

beta1 29.12 34.51 40.12

74 / 83

Page 75: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 5: Generalized linear models

GLMs

The values match up with glm output:

> round(coef(glm(killed/number~scale(dose,scale=F),

+ weights=number,

+ family=binomial(),beetledata)),

+ digits=2)

(Intercept) scale(dose, scale = F)

0.74 34.27

75 / 83

Page 76: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 5: Generalized linear models

GLMs> plot(glm.res)

2000 2500 3000 3500 4000

0.4

0.8

1.2

Iterations

Trace of beta0

0.2 0.4 0.6 0.8 1.0 1.2

0.0

1.0

2.0

Density of beta0

N = 2000 Bandwidth = 0.03123

2000 2500 3000 3500 4000

2535

45

Iterations

Trace of beta1

25 30 35 40 45

0.00

0.06

0.12

Density of beta1

N = 2000 Bandwidth = 0.6572 76 / 83

Page 77: Lecture 2

Lecture 2: Basics of Bayesian modeling

Example 5: Generalized linear models

Homework 3: GLMs

We fit uniform priors to the coe�cients b :

# priors:

beta0 ~ dunif(0,pow(100,2))

beta1 ~ dunif(0,pow(100,2))

Fit the beetle data again, using suitable normal distribution priorsfor the coe�cients beta0 and beta1. Does the posteriordistribution depend on the prior?

77 / 83

Page 78: Lecture 2

Lecture 2: Basics of Bayesian modeling

Prediction

GLMs: Predicting future/missing data

One important thing we can do is to predict the posteriordistribution of future or missing data.One easy to way to do this is to define how we expect thepredicted data to be generated.This example revisits the earlier toy example from Lunn et al. onrat data (slide 62).

> data<-list(x=c(8,15,22,29,36),

+ y=c(177,236,285,350,376))

78 / 83

Page 79: Lecture 2

Lecture 2: Basics of Bayesian modeling

Prediction

GLMs: Predicting future/missing data

> cat("

+ model

+ {

+ ## specify model for data:

+ for(i in 1:5){

+ y[i] ~ dnorm(mu[i],tau)

+ mu[i] <- beta0 + beta1 * (x[i]-mean(x[]))

+ }

+ ## prediction

+ mu45 <- beta0+beta1 * (45-mean(x[]))

+ y45 ~ dnorm(mu45,tau)

+ # priors:

+ beta0 ~ dunif(-500,500)

+ beta1 ~ dunif(-500,500)

+ tau <- 1/sigma2

+ sigma2 <-pow(sigma,2)

+ sigma ~ dunif(0,200)

+ }",

+ file="JAGSmodels/ratsexample2pred.jag" )

79 / 83

Page 80: Lecture 2

Lecture 2: Basics of Bayesian modeling

Prediction

GLMs: Predicting future/missing data

> track.variables<-c("beta0","beta1","sigma","y45")

> rats.mod <- jags.model(

+ file = "JAGSmodels/ratsexample2pred.jag",

+ data=data,

+ n.chains = 1,

+ n.adapt =2000, quiet=T)

> rats.res <- coda.samples( rats.mod,

+ var = track.variables,

+ n.iter = 2000,

+ thin = 1)

80 / 83

Page 81: Lecture 2

Lecture 2: Basics of Bayesian modeling

Prediction

GLMs: Predicting future/missing data

> round(summary(rats.res)$statistics[,1:2],

+ digits=2)

Mean SD

beta0 284.99 11.72

beta1 7.34 1.26

sigma 20.92 16.46

y45 453.87 37.37

81 / 83

Page 82: Lecture 2

Lecture 2: Basics of Bayesian modeling

Prediction

GLMs: Predicting future/missing dataDensity

0.00

0.05

200 250 300 350 400

beta0

0.00.4

0 5 10 15

beta1

0.000.05

0 50 100 150

sigma

0.000

300 400 500 600 700 800

y45

82 / 83

Page 83: Lecture 2

Lecture 2: Basics of Bayesian modeling

Concluding remarks

Summing up

1 In some cases Bayes’ Theorem can be used analytically(Examples 1, 2)

2 It is relatively easy to define di↵erent kinds of Bayesianmodels using programming languages like JAGS.

3 Today we saw some examples from linear models (Examples3-5).

4 Next week: Linear mixed models.

83 / 83