Bayesian Essentials and Bayesian Regression

52
1 Bayesian Essentials and Bayesian Regression

description

Bayesian Essentials and Bayesian Regression. Y. 1. 1. X. Distribution Theory 101. Marginal and Conditional Distributions:. uniform. Simulating from Joint. To draw from the joint: i. draw from marginal on X ii. Condition on this draw, and draw from conditional of Y|X. - PowerPoint PPT Presentation

Transcript of Bayesian Essentials and Bayesian Regression

Page 1: Bayesian Essentials and Bayesian Regression

1

Bayesian Essentials and Bayesian Regression

Page 2: Bayesian Essentials and Bayesian Regression

2

Distribution Theory 101

Marginal and Conditional Distributions:

Y X,Y XY Xp (y) p x,y dx p y x p x dx

X X,Y

xx0

0

p (x) p x,y dy

2dy 2y 2x

X

Y

1

1

X,YY X

X

p x,y 2p y x

p x 2x

y 0,xuniform

Page 3: Bayesian Essentials and Bayesian Regression

3

Simulating from Joint

X,Y XY Xp x,y p y x p x

To draw from the joint:i. draw from marginal on Xii. Condition on this draw, and draw from

conditional of Y|X

Page 4: Bayesian Essentials and Bayesian Regression

4

The Goal of InferenceMake inferences about unknown quantities using available information.

Inference -- make probability statements

unknowns --

parameters, functions of parameters, states or latent variables, “future” outcomes, outcomes conditional on an action

Information –

data-based

non data-based

theories of behavior; “subjective views” there is an underlying structure

parameters are finite or in some range

Page 5: Bayesian Essentials and Bayesian Regression

5

Data Aspects of Marketing Problems

Ex: Conjoint Survey500 respondents rank, rate, chose among product configurations. Small Amount of Information per respondentResponse Variable Discrete

Ex: Retail Scanning Datavery large number of productslarge number of geographical units (markets, zones, stores)limited variation in some marketing mix varsMust make plausible predictions for decision making!

Page 6: Bayesian Essentials and Bayesian Regression

6

The likelihood principle

LP: the likelihood contains all information relevant for inference. That is, as long as I have same likelihood function, I should make the same inferences about the unknowns.

Implies analysis is done conditional on the data,.

p(D | ) ( )

Note: any function proportional to data density can be called the likelihood.

Page 7: Bayesian Essentials and Bayesian Regression

7

p(|D) p(D| ) p()

Posterior “Likelihood” × Prior

Modern Bayesian computing– simulation methods for generating draws from the posterior distribution p(|D).

Bayes theorem

p D pp D,

p Dp D p D

Page 8: Bayesian Essentials and Bayesian Regression

8

Summarizing the posterior

Output from Bayesian Inf:A high dimensional dist

Summarize this object via simulation:marginal distributions of don’t just compute

p D

, h E D , Var D

Page 9: Bayesian Essentials and Bayesian Regression

9

Prediction

p(D|D) p(D| )p( |D)d

ˆ( p(D| )) !!!)

p(D|D)See D, compute: “Predictive Distribution”

assumes p D,D p D p D

Page 10: Bayesian Essentials and Bayesian Regression

10

Decision theory

Loss: L(a,) where a=action; =state of nature

Bayesian decision theory:

Estimation problem is a special case:

|Damin L(a) E L(a, ) L(a, )p( |D)d

ˆ

ˆ ˆ ˆ ˆ ˆmin L( ) ; typically, L , ' A

Page 11: Bayesian Essentials and Bayesian Regression

11

Sampling properties of Bayes estimators

ˆ D|ˆ ˆr ( ) E L( , ) L (D), p(D | )dD

An estimator is admissible if there exist no other estimator with lower risk for all values of . The Bayes estimator minimizes expected (average) risk which implies they are admissible:

D| D |D

ˆ ˆE r E E L , E E L ,

The Bayes estimator does the best for every D. Therefore, it must work as at least as well as any other estimator.

Page 12: Bayesian Essentials and Bayesian Regression

12

Bayes Inference: Summary

Bayesian Inference delivers an integrated approach to:Inference – including “estimation” and “testing”Prediction – with a full accounting for uncertaintyDecision – with likelihood and loss (these are distinct!)

Bayesian Inference is conditional on available info.

The right answer to the right question.

Bayes estimators are admissible. All admissible estimators are Bayes (Complete Class Thm). Which Bayes estimator?

Page 13: Bayesian Essentials and Bayesian Regression

13

Bayes/Classical Estimators

p

n

N

Prior washes out – locally uniform!!! Bayes is consistent unless you have dogmatic prior.

MLE

1

ˆMLEˆp D ~ N , H

Page 14: Bayesian Essentials and Bayesian Regression

14

Benefits/Costs of Bayes Inf

Benefits-finite sample answer to right questionfull accounting for uncertaintyintegrated approach to inf/decision

“Costs”-computational (true any more? < classical!!)prior (cost or benefit?)

esp. with many parms(hierarchical/non-parameric problems)

Page 15: Bayesian Essentials and Bayesian Regression

15

Bayesian Computations

Before simulation methods, Bayesians used posterior expectations of various functions as summary of posterior.

If p(θ|D) is in a convenient form (e.g. normal), then I might be able to compute this for some h. Via iid simulation for all h.

D

p D pE h h d

p D

note : p D p D p d

Page 16: Bayesian Essentials and Bayesian Regression

16

Conjugate Families

Models with convenient analytic properties almost invariably come from conjugate families.

Why do I care now?- conjugate models are used as building blocks- build intuition re functions of Bayesian inference

Definition:A prior is conjugate to a likelihood if the

posterior is in the same class of distributions as prior.

Basically, conjugate priors are like the posterior from some imaginary dataset with a diffuse prior.

Page 17: Bayesian Essentials and Bayesian Regression

17

Beta-Binomial model

)(~ Bernyi

n

i

yy ii

1

1)1()(

yny )1(

n

iiyywhere

1

?)|( yp

Need a prior!

Page 18: Bayesian Essentials and Bayesian Regression

18

Beta distribution 1 1Beta( , ) (1 )

E[ ] /( )

0.0

00

.02

0.0

40

.06

0.0

80

.10

0.0 0.2 0.4 0.6 0.8 1.0

a=2, b=4a=3, b=3a=4, b=2

Page 19: Bayesian Essentials and Bayesian Regression

19

Posterior

p( | D) p(D | )p( )

y n y 1 1(1 ) (1 )

y 1 n y 1(1 )

~ Beta( y, n y)

Page 20: Bayesian Essentials and Bayesian Regression

20

Prediction

Pr(y 1|y) Pr(y 1| ,y)d

1

0

p( |y)d

E[ |y]

Page 21: Bayesian Essentials and Bayesian Regression

21

Regression model

'i i iy x 2

i ~ Normal(0, )

2 2y X, , ~ N(X , I)

2 ' 2

i i i22

1 1p(y | , ) exp (y x )

22

Is this model complete? For non-experimental data, don’t we need a model for the joint distribution of y and x?

Page 22: Bayesian Essentials and Bayesian Regression

22

Regression model

2p(x,y) p x p y x, ,

If Ψ is a priori indep of (β,σ),

2 2 2

i i ii i

p , , y,X p p x p , p y x , ,

rules out x=f(β)!!!

two separate analyses

simultaneous systems are not written this way!

Page 23: Bayesian Essentials and Bayesian Regression

23

Conjugate Prior

2

2n

n/ 22 2 2 12

~ N 0, I

Jacobian : y 1

, p y X, , exp y X '

What is conjugate prior? Comes from form of likelihood function. Here we condition on X.

quadratic form suggests normal prior.

Let’s – complete the square on β or rewrite by projecting y on X (column space of X).

Page 24: Bayesian Essentials and Bayesian Regression

24

Geometry of regression

x1

x2

y

1x1

2x2

2211ˆ xxy

yye ˆ

"LeastSquares" min (e 'e)

ˆe 'y 0

Page 25: Bayesian Essentials and Bayesian Regression

25

Traditional regression

1ˆ (X 'X) X 'y

ˆ ˆe 'y y 'e 0 ˆ ˆ(X )'(y X ) 0

No one ever computes a matrix inverse directly.

Two numerically stable methods:

QR decomposition of X

Cholesky root of X’X and compute inverse using root

Non-Bayesians have to worry about singularity or near singularity of X’X. We don’t! more later

Page 26: Bayesian Essentials and Bayesian Regression

26

In Bayesian computations, the fundamental matrix operation is the Cholesky root. chol() in R

The Cholesky root is the generalization of the square root applied to positive definite matrices.

As Bayesians with proper priors, we don’t ever have to worry about singular matrices!

U is upper triangular with positive diagonal elements. U-1 is easy to compute by recursively solving TU = I for T, backsolve() in R.

Cholesky Roots

2

iii

U'U, p.d.s. u

Page 27: Bayesian Essentials and Bayesian Regression

27

Cholesky roots can be useful to simulate from Multivariate Normal Distribution.

Cholesky Roots

U'U; z ~ N 0,I

y U'z ~ N 0,E U'zz 'U U'IU

To simulate a matrix of draws from MVN (each row is a separate draw) in R,

Y=matrix(rnorm(n*k),ncol=k)%*%chol(Sigma)

Y=t(t(Y)+mu)

Page 28: Bayesian Essentials and Bayesian Regression

28

Regression with R

data.txt:

UNIT Y X1 X2

A 1 0.23815 0.43730

A 2 0.55508 0.47938

A 3 3.03399 -2.17571

A 4 -1.49488 1.66929

B 10 -1.74019 0.35368

B 9 1.40533 -1.26120

B 8 0.15628 -0.27751

B 7 -0.93869 -0.04410

B 6 -3.06566 0.14486

df=read.table("data.txt",header=TRUE)

myreg=function(y,X){## purpose: compute lsq regression## arguments: # y -- vector of dep var# X -- array of indep vars## output:# list containing lsq coef and std errors#XpXinv=chol2inv(chol(crossprod(X)))bhat=XpXinv%*%crossprod(X,y)res=as.vector(y-X%*%bhat)ssq=as.numeric(res%*%res/(nrow(X)-ncol(X)))se=sqrt(diag(ssq*XpXinv))list(b=bhat,std_errors=se)}

Page 29: Bayesian Essentials and Bayesian Regression

29

Regression likelihood

ˆ ˆ ˆ ˆ(y X )'(y X ) (X X )'(X X ) (y X )'(y X )

ˆ ˆ2(y X )'(X X )

2 ˆ ˆs ( )'X 'X( )

2 ˆ ˆs SSE (y X )'(y X )

n k

where

2 2 k / 22

22 / 2

2

1 ˆ ˆp y X, , ( ) exp ( )'X 'X( )2

s( ) exp

2

Page 30: Bayesian Essentials and Bayesian Regression

30

Regression likelihood

2p y X, , normal ?

? is density of form p e

This is called an inverted gamma distribution. It can also be related to the inverse of a Chi-squared distribution.

Note the conjugate prior suggested by the form the likelihood has a prior on β which depends on σ.

Page 31: Bayesian Essentials and Bayesian Regression

31

Bayesian Regression

2 2 2p( , ) p( | )p( )Prior:

2 2 k / 2

2

1p( | ) ( ) exp ( )' A( )

2

0 212 2 2 0 0

2

sp( ) ( ) exp

2

Inverted Chi-Square:

0

22 0 0

2

s~

Interpretation as from another dataset.

Draw from prior?

Page 32: Bayesian Essentials and Bayesian Regression

32

Posterior

2 2 2 2p( , |D) ( , )p( | )p( )

2 n/ 2

2

1( ) exp (y X )'(y X )

2

2 k / 2

2

1( ) exp ( )' A( )

2

0 212 2 0 0

2

s( ) exp

2

Page 33: Bayesian Essentials and Bayesian Regression

33

Combining quadratic forms

(y X )'(y X ) ( )' A( )

(y X )'(y X ) ( )'U'U( )

(v W )'(v W )

y Xv W

U U

2(v W )'(v W ) s ( )'W 'W( )

1 1 ˆ(W 'W) W 'v (X 'X A) (X 'X A )

2ns (v W )'( W ) (y X )'(y X ) ( )'A ( )

Page 34: Bayesian Essentials and Bayesian Regression

34

Posterior

2 k / 22

1( ) exp ( )'(X 'X A)( )

2

0n 2 2 22 0 02

2

( s ns )( ) exp

2

2 2 1[ | ] N( , (X 'X A) )

1

22 1 1

1 02

2 22 0 01

0

s[ ] with n

s nss

n

Page 35: Bayesian Essentials and Bayesian Regression

35

IID Simulations

3) Repeat

1) Draw [2 | y, X]

2) Draw [ | 2,y, X]

Scheme: [y|X, , 2] [|2] [2]

Page 36: Bayesian Essentials and Bayesian Regression

36

IID Simulator, cont.

12 2

1

12

2) y,X, N , X' X A

ˆ(X' X A) (X' X A )

note : ~ N 0,I ; U' ~ N ,U'U X' X A

1

22 1 1

2

s1) [ |y,X]

Page 37: Bayesian Essentials and Bayesian Regression

37

Bayes Estimator

2 22D DD,E D E E E

The Bayes Estimator is the posterior mean of β.

Marginal on β is a multivariate student t.

Who cares?

Page 38: Bayesian Essentials and Bayesian Regression

38

Shrinkage and Conjugate Priors

1 ˆ ˆ(X 'X A) (X 'X A ) shrinks

The Bayes Estimator is the posterior mean of β.

This is a “shrinkage” estimator.

ˆas n , (Why? X'X is of order n).

12 2 1 2 1 2Var (X 'X A) A or X 'X

Is this reasonable?

Page 39: Bayesian Essentials and Bayesian Regression

39

Assessing Prior Hyperparameters

20 0,A, ,s

These determine prior location and spread for both coefs and error variance.

It has become customary to assess a “diffuse” prior:

2k k

0

20

0

"small" value of A, A .01I Var 100I

"small",e.g. 3

s 1This can be problematic. Var(y) might be a better choice.

Page 40: Bayesian Essentials and Bayesian Regression

40

Improper or “non-informative” priors

2 22

1p , p p

Classic “non-informative” prior (improper):

Is this “non-informative”?

Of course not, it says that is large with high prior “probability”

Is this wise computationally?

No, I have to worry about singularity in X’X

Is this a good procedure?

No, it is not admissible. Shrinkage is good!

Page 41: Bayesian Essentials and Bayesian Regression

41

runiregrunireg=function(Data,Prior,Mcmc){# # purpose: # draw from posterior for a univariate regression model with natural conjugate prior## Arguments:# Data -- list of data # y,X# Prior -- list of prior hyperparameters# betabar,A prior mean, prior precision# nu, ssq prior on sigmasq# Mcmc -- list of MCMC parms# R number of draws# keep -- thinning parameter# # output:# list of beta, sigmasq draws# beta is k x 1 vector of coefficients# model:# Y=Xbeta+e var(e_i) = sigmasq# priors: beta| sigmasq ~ N(betabar,sigmasq*A^-1)# sigmasq ~ (nu*ssq)/chisq_nu

Page 42: Bayesian Essentials and Bayesian Regression

42

runiregRA=chol(A)

W=rbind(X,RA)

z=c(y,as.vector(RA%*%betabar))

IR=backsolve(chol(crossprod(W)),diag(k))

# W'W=R'R ; (W'W)^-1 = IR IR' -- this is UL decomp

btilde=crossprod(t(IR))%*%crossprod(W,z)

res=z-W%*%btilde

s=t(res)%*%res

#

# first draw Sigma

#

sigmasq=(nu*ssq + s)/rchisq(1,nu+n)

#

# now draw beta given Sigma

#

beta = btilde + as.vector(sqrt(sigmasq))*IR%*%rnorm(k)

list(beta=beta,sigmasq=sigmasq)

}

Page 43: Bayesian Essentials and Bayesian Regression

43

0 500 1000 1500 2000

1.0

1.5

2.0

2.5

ou

t$b

eta

dra

w

Page 44: Bayesian Essentials and Bayesian Regression

44

0 500 1000 1500 2000

0.2

00

.25

0.3

00

.35

ou

t$si

gm

asq

dra

w

Page 45: Bayesian Essentials and Bayesian Regression

45

Multivariate Regression

1 1 1

c c c

m m m

y X

y X

y X

1 c m

1 c m

1 c m

row

Y XB E,

Y y , ,y , ,y

B , , , ,

E , , , ,

~ iid N 0,

Page 46: Bayesian Essentials and Bayesian Regression

46

Multivariate regression likelihood

n

n/ 2 1r r r r

r 1

n/ 2 1

(n k) / 2 1

k / 2 1

1p Y | X,B, | | exp y B'x y B x

2

1| | etr Y XB Y XB

2

1| | etr S

2

1 ˆ ˆ| | etr B B X X B B2

ˆ ˆwhere S Y XB Y XB

Page 47: Bayesian Essentials and Bayesian Regression

47

Multivariate regression likelihood

1 1

1

But, tr(A 'B) vec A ' vec B

ˆ ˆ ˆ ˆtr B B X X B B vec B B 'vec X X B B

and vec ABC C' A vec B

ˆ ˆvec B B ' X X vec B B

(n k) / 2 1

k / 2 1

therefore,

1p Y | X,B, | | etr S

2

1 ˆ ˆ| | exp X X2

Page 48: Bayesian Essentials and Bayesian Regression

48

Inverted Wishart distribution

Form of the likelihood suggests that natural conjugate (convenient prior) for would be of the Inverted Wishart form:

denoted

0 m 1 / 2 110 0 02p ,V etr V

0 0~ IW ,V

10 0 0if m 1, E ( m 1) V

- tightness

V- location

however, as increases, spread also increases

limitations: i. small -- thick tail ii. only one tightness parm

Page 49: Bayesian Essentials and Bayesian Regression

49

Wishart distribution (rwishart)

1 10 0 0 0If ~ IW ,V , ~ W ,V

10 0 0if m 1, E V

2

i m

Generalization of :

Let ~ N (0, )

i ii 1

Then W ' ~ W( , )

2The diagonals are

Page 50: Bayesian Essentials and Bayesian Regression

50

Multivariate regression prior and posterior

0 0

1

p ,B p p B |

~ IW ,V

| ~ N , A

0 0

1

1

| Y,X ~ IW n,V S

| Y,X, ~ N , X X A

ˆvec B , B X X A X XB AB ,

S Y XB Y XB B B A B B

Prior:

Posterior:

Page 51: Bayesian Essentials and Bayesian Regression

51

Drawing from Posterior: rmultiregrmultireg=function(Y,X,Bbar,A,nu,V)RA=chol(A)W=rbind(X,RA)Z=rbind(Y,RA%*%Bbar)# note: Y,X,A,Bbar must be matrices!IR=backsolve(chol(crossprod(W)),diag(k))# W'W = R'R & (W'W)^-1 = IRIR' -- this is the UL decomp!Btilde=crossprod(t(IR))%*%crossprod(W,Z) # IRIR'(W'Z) = (X'X+A)^-1(X'Y + ABbar)S=crossprod(Z-W%*%Btilde)# rwout=rwishart(nu+n,chol2inv(chol(V+S)))## now draw B given Sigma note beta ~ N(vec(Btilde),Sigma (x) Cov) # Cov=(X'X + A)^-1 = IR t(IR) # Sigma=CICI' # therefore, cov(beta)= Omega = CICI' (x) IR IR' = (CI (x) IR) (CI (x) IR)'# so to draw beta we do beta= vec(Btilde) +(CI (x) IR)vec(Z_mk) # Z_mk is m x k matrix of N(0,1)# since vec(ABC) = (C' (x) A)vec(B), we have # B = Btilde + IR Z_mk CI'#B = Btilde + IR%*%matrix(rnorm(m*k),ncol=m)%*%t(rwout$CI)

Page 52: Bayesian Essentials and Bayesian Regression

52

Conjugacy is Fragile!

i i i iy X i 1, ,m

SUR:

given , ~ N would be conjugate

given , ~ IW would be conjugate

set of regressions “related” via correlated

errors

BUT, no joint conjugate prior!!