Bayesian Estimation & Model Evaluation

Frank SchorfheideUniversity of Pennsylvania

MFM Summer Camp

June 12, 2016

Frank Schorfheide Bayesian Estimation & Model Evaluation

Why Bayesian Inference?

• Why not?

p(θ|Y ) =p(Y |θ)p(θ)∫p(Y |θ)p(θ)dθ

• Treat uncertainty with respect to shocks, latent states, parameters,and model specifications uncertainty symmetrically.

• Condition inference on what you know (the data Y ) instead of whatyou don’t know (the parameter θ).

• Make optimal decision conditional on observed data.

Excuses and Overview

• Too little time to provide a detailed survey of state-of-the-artBayesian methods.

• Instead: an eclectic collection of ideas and insights related to:

1 Model Development

2 Identification

3 Priors

4 Computations

5 Working with Multiple Models

1. Model Development

• Bayesian estimation can take a lot of time... so don’t waste it onbad models!

• Suppose you have an elaborate macro-finance DSGE model...

• Applied theorists get credit for plugging parameter values into themodel and solving/simulating it.

• You can easily get extra credit by:• specifying a prior distribution p(θ);• generating draws θi , i = 1, . . . ,N from prior;• simulating trajectories Y i (conditional on θi ), i = 1, . . . ,N;• computing sample statistics S(Y i );• comparing the distribution of simulated sample statistics observed

sample statistic S(Y );• calling it a prior predictive check

1. Predictive Checks – An Example

Reference: Chang, Doh, and Schorfheide (2007, JMCB)

2. Identification

• We are trying to learn the parameters θ from the data.

• Formal definitions... e.g., model is identified at θ0 ifp(Y |θ) = p(Y |θ0) implies that θ = θ0.

• Without identification or with weak identfication:

• use more/different data to achieve identification;

• use identification-robust inference procedures.

• Lack of identification does not raise conceptual issues for Bayesianinference (as long as priors are proper), but possibly computationalchallenges.

Reference: Fernandez-Villaverde, Rubio-Ramirez, Schorfheide (2016, HB of Macro

Chapter)

2. (Lack of) Identification – An Analytical Example

• Let φ be an identifiable reduced-form parameter.

• Let θ be a structural parameter of interest:

φ ≤ θ and θ ≤ φ+ 1.

• Parameter θ is set-identified.

• The interval Θ(φ) = [φ, φ + 1] is called the identified set.

• This problem shows up prominently in VARs identified with signrestrictions.

References: Moon and Schorfheide (2012, Econometrica); Schorfheide (2016,

Discussion of World Congress Lectures by Muller and Uhlig)

2. (Lack of) Identification – An Analytical Example

• Joint posterior of θ and φ:

p(θ, φ|Y ) = p(φ|Y )p(θ|φ,Y ) ∝ p(Y |φ)p(θ|φ)p(φ).

• Because θ does not enter the likelihood function, we deduce that

p(φ|Y ) =p(Y |φ)p(φ)∫p(Y |φ)p(φ)dφ

p(θ|φ,Y ) = p(θ|φ).

No updating of beliefs about θ conditional on φ!

• Marginal posterior distribution of θ:

p(θ|Y ) =

∫ θ

θ−1

p(φ|Y )p(θ|φ)dφ

Updating of marginal posterior of θ!

2. An Analytical Example: Posterior p(θ|Y )

Assume φ|Y ∼ N(− 0.5, V

); θ|φ ∼ U[φ, φ+ 1].

V is equal to 1/4 (solid red), 1/20 (dashed blue), and 1/100 (dottedgreen).

Parameter θ-2 -1.5 -1 -0.5 0 0.5 1 1.5

3. Prior Distributions

• Ideally: probabilistic representation of our knowledge/beliefs beforeobserving sample Y .

• More realistically: choice of prior as well as model are influenced bysome observations. Try to keep influence small or adjust measures ofuncertainty.

• Views about role of priors:

1 keep them “uninformative” (???) so that posterior inherits shape oflikelihood function;

2 use them to regularize the likelihood function;

3 incorporate information from sources other than Y ;

3. Role of Priors – Example 1

• “Uninformative” priors?

• Consider structural VAR

yt = Φyt−1 + ΣtrΩεt , ut = ΣtrΩεt , E[utu′t ] = Σ

• Uniform distribution on orthonormal matrix Ω does not induceuniform prior over identified set for IRF

IRF (i , h) = ΦhΣtr [Ω].i = ΦhΣtrq, where ‖q‖ = 1

θ = q1

F q(Σtr )

Fθ(Σtr )

Σtr21q1 + Σtr

22q2 = 0

Reference: Schorfheide (2016, World Congress Discussion)

3. Role of Priors – Example 2a

• Consider model

yt = θ1x1,t + θ1θ2x2,t + ut .

• No identification of θ2 if θ1 = 0.

• Models with multiplicative parameters generate likelihood functionsthat look like this...

0.0 0.1 0.2 0.3 0.4 0.5

3. Role of Priors – Example 2a

• Identification problem also distorts

p(θ1 = 0|Y ) ∝∫

p(Y |θ1 = 0, θ2)p(θ2)p(θ1 = 0)dθ2.

• Reparameterize: α1 = θ1, α2 = θ1θ2.

• Prior p(α1, α2) ∝ c can regularize the problem.

• Jacobian:∣∣∣∣ ∂α∂θ′∣∣∣∣ =

∣∣∣∣ 1 0θ2 θ1

∣∣∣∣ = |θ1|.

• Prior density p(θ1, θ2) ∝ |θ1| vanishes as θ1 approaches point ofnon-identification.

• More generally: try to add information when data are notparticularly informative.

References: For cointegration model: Kleibergen and van Dijk (1994), Kleibergen and

Paap (2002)

3. Role of Priors – Example 2b

• For instance, high-dimensional VARs:

Y = XΦ + U, ut ∼ N(0,Σ)

with low observation-parameter ratio.

• Hierarchical (conjugate) MNIW prior p(Φ,Σ|λ) adds information.Frequentist perspective: add some bias and reduce variance toimprove MSE.

• How much? Data-driven choice of λ (empirical Bayes):

λ = argmaxλ

∫p(Y |Φ,Σ)p(Φ,Σ|λ)d(Φ,Σ)

• Or specify prior p(λ) and integrate out hyperparameters.

• Alternative priors: LASSO, spike-and-slab,...

References: Giannone, Lenza, and Primiceri (2014, REStat)

3. Role of Priors – Example 3

• Prior elicitation based on: pre-sample information; information fromexcluded data series; or micro (macro) level information whenestimating a model on macro (micro) data.

• A cute example...• Production function:

Yt = (AtHt)αK 1−α

(1− ϕ ·

Ht−1− 1

• Prior for adjustment costs ϕ?• Firms can either search for workers, incurring adjustment costsϕ( ∆H

H)2Y , or pay head hunters for finding workers.

• Head hunters service fee is ζW∆H.• Head hunters tend to charge about ζ = 1/3 to 2/3 of quarterly

earnings of a worker.• Recruiting costs should be approximately the same:ϕ( ∆H

H)2Y = ζW∆H.

• With the labor share of 1/3 (=WHY

) for a size of one percent increaseof employment, ∆H

H= 1%, we obtain a range of 22 to 44 for ϕ.

Reference: Chang, Doh, and Schorfheide (2007, JMCB)

4. Computations

• Practical work utilizes algorithms to generate draws θi , i = 1, . . . ,Nfrom posterior p(θ|Y ).

• Post-process draws by converting them into object of interesthi = h(θi ) to characterize p(h(θ)|Y )=⇒ inference and decision making under uncertainty.

• Important algorithms:

• importance sampling

• Markov chain Monte Carlo (MCMC) algorithms, e.g.,Metropolis-Hastings samplers or Gibbs samplers

• More recently: widespread access to parallel computationenvironments.

• Sequential Monte Carlo (SMC) techniques provide an interestingalternative.

Reference: Herbst and Schorfheide (2015, Princeton University Press)

4. Importance Sampling

• Target posterior π(θ) ∝ f (θ).

• Use identity∫h(θ)f (θ)dθ =

∫h(θ) f (θ)

g(θ)g(θ)dθ.

• θi ’s are draws from g(·).• approximation:

Eπ[h] ≈1N

∑Ni=1 h(θi )w(θi )

∑Ni=1 w(θi )

, w(θ) =f (θ)

g(θ).

6 4 2 0 2 4 60.00

6 4 2 0 2 4 6

weights

4. A Challenging Posterior

• Consider the state-space model:

yt = [1 1]st , st =

1 0(1− θ2

1)− θ1θ2 (1− θ21)

]st−1 +

]εt .

• Shocks: εt ∼ iidN(0, 1); uniform prior.

• Simulate T = 200 observations given θ = [0.45, 0.45]′, which isobservationally equivalent to θ = [0.89, 0.22]′.

0.0 0.2 0.4 0.6 0.8 1.00.0

1.0θ2

4. From Importance to Sequential Importance Sampling

0.00.2

0.40.6

0.81.0

πn(θ) =[p(Y |θ)]φnp(θ)∫[p(Y |θ)]φnp(θ)dθ

=fn(θ)

Zn, φn =

4. SMC Algorithm: A Graphical Illustration

C S M C S M C S M

φ0 φ1 φ2 φ3

• πn(θ) is represented by a swarm of particles θin,W inNi=1:

hn,N =1

N∑i=1

W inh(θin)

a.s.−→ Eπn [h(θn)].

• C is Correction; S is Selection; and M is Mutation.

4. SMC Algorithm

1 Initialization. (φ0 = 0). Draw the initial particles from the prior:

θi1iid∼ p(θ) and W i

1 = 1, i = 1, . . . ,N.

2 Recursion. For n = 1, . . . ,Nφ,

1 Correction. Reweight the particles from stage n − 1 by defining theincremental weights

w in = [p(Y |θin−1)]φn−φn−1 (1)

and the normalized weights

W in =

in−1

∑Ni=1 w

, i = 1, . . . ,N. (2)

An approximation of Eπn [h(θ)] is given by

hn,N =1

N∑i=1

W inh(θin−1). (3)

2 Selection.

4. SMC Algorithm

1 Initialization.

2 Recursion. For n = 1, . . . ,Nφ,

1 Correction.2 Selection. (Optional Resampling) Let θNi=1 denote N iid draws

from a multinomial distribution characterized by support points andweights θin−1, W

inNi=1 and set W i

n = 1.An approximation of Eπn [h(θ)] is given by

hn,N =1

N∑i=1

W inh(θin). (4)

3 Mutation. Propagate the particles θi ,W in via NMH steps of a MH

algorithm with transition density θin ∼ Kn(θn|θin; ζn) and stationarydistribution πn(θ). An approximation of Eπn [h(θ)] is given by

hn,N =1

N∑i=1

h(θin)W in . (5)

4. Remarks

• Correction Step:• reweight particles from iteration n− 1 to create importance sampling

approximation of Eπn [h(θ)]

• Selection Step: the resampling of the particles• (good) equalizes the particle weights and thereby increases accuracy

of subsequent importance sampling approximations;• (not good) adds a bit of noise to the MC approximation.

• Mutation Step:• adapts particles to posterior πn(θ);• imagine we don’t do it: then we would be using draws from prior

p(θ) to approximate posterior π(θ), which can’t be good!

0.00.2

0.40.6

0.81.0

5. Working with Multiple Models

• Assign prior probabilities γj,0 to models Mj , j = 1, . . . , J.• Posterior model probabilities are given by

γj,T =γj,0p(Y |Mj)∑Jj=1 γj,0p(Y |Mj)

p(Y |Mj) =

∫p(Y |θ(j),Mj)p(θ(j)|Mj)dθ(j)

• Log marginal data densities are one-step-ahead predictive scores:

ln p(Y |Mj)

=T∑t=1

∫p(yt |θ(j),Y1:t−1,Mj)p(θ(j)|Y1:t−1,Mj)dθ(j).

• Bayesian model averaging:

p(h|Y ) =J∑

γj,Tp(hj(θ(j))|Y ,Mj).

5. Working with Multiple Models

• Application: DSGE model with and without financial frictions.

• Food for thought:

• Bayesian model averaging essentially assumes that the model spaceis complete. Is it?

• Time-varying model weights can be a stand in for nonlinearmacroeconomic dynamics.

Reference: Del Negro, Hasegawa, and Schorfheide (2016, JoE)

5. A Stylized Framework

• Consider principal-agent setting in mind to separate the task ofestimating models from the task of combining them.

• Agents Mm= econometric modelers:• provide principal with predictive densities p(yt+1|Imt ,Mm);• are rewarded based on the realized value of ln p(yt+1|Imt ,Mm)

(induces truth-telling).• Imt is model specific information set.

• Principal P = policy maker who aggregates information obtainedfrom modelers:

p(yt+1|λ, IPt ,P) = λp(yt+1|I1t ,M1) + (1− λ)p(yt+1|I2

t ,M2)

where IPt = y1:t , p(yτ |Imτ−1,Mm)tτ=1 for m = 1, 2

5. Bayesian Model Averaging (BMA): λ ∈ 1, 0

• At any time T the policy maker can use the predictive densities toform marginal likelihoods:

p(Y1:T |Mi ) =T∏t=1

p(yt |Y1:t−1,Mi )

• . . . use them to update model probabilities:

λBMAT = P[λ = 1|Y1:T ]︸︷︷︸

P[M1 is correct]

=λBMA

0 p(Y1:T |M1)

λBMA0 p(Y1:T |M1) + (1− λBMA

0 )p(Y1:T |M2)

• Predictive density:

pBMA(yt+1|IPt ,P) = λBMAt p(yt+1|Y1:t ,M1)

+(1− λBMAt )p(yt+1|Y1:t ,M2)

5. BMA and Model Misspecification

• BMA is based on the assumption that the model space contains the‘true’ model (“complete model space”):

p(y1:T |λ,P) =

p(y1:T |M1) =

∏Tt=1 p(yt |Y1:t−1,M1) if λ = 1

p(y1:T |M2) =∏T

t=1 p(yt |Y1:t−1,M2) if λ = 0

DGP = p(Y1:T )

KL Discrepancy

p(Y1:T |M1) p(Y1:T |M2)

• λBMAT

a.s.−→ 1 or 0 as T −→∞ (Dawid 1984, others): Asymptotically,no model averaging! All the weight is on model closest in KLdiscrepancy.

5. Optimal (Static) Pools: λ ∈ [0, 1]

• A policy maker concerned about misspecification of Mi could createconvex combinations of predictive densities:

DGP = p(Y1:T )

p(Y1:T |M1) p(Y1:T |M2)

p(Y1:T |λ,P) =T∏t=1

λ p (yt |Y1:t−1,M1) + (1− λ) p (yt |Y1:t−1,M2)

• λSPT = argmaxλ∈[0,1]p(y1:T |λ,P) generally 6→ 1 or 0 (unless one ofthe models is correct): Exploits gains from diversification.

References: Hall and Mitchell (2007), Geweke and Amisano (2011)

5. Dynamic Pools - Prior for Weights λ1:T

• Dynamic pool: replace λ by sequence λt

• Likelihood function:

p(y1:T |λ1:T ,P) =T∏t=1

[λtp(yt |y1:t−1,M1)+(1−λt)p(yt |y1:t−1,M2)

• Prior p(λ1:T |ρ) for sequence λ1:T :

xt = ρxt−1 +√

1− ρ2εt , εt ∼ iid N(0, 1), x0 ∼ N(0, 1),

λt = Φ(xt)

where Φ(.) is the Gaussian CDF.

• Unconditionally, λt ∼ U[0, 1] for all t.

• Hyperparameter ρ controls the amount of “smoothing.”• As ρ −→ 1: dynamic pool −→ static pool.

• Specify a prior distribution for ρ (and other hyperparamters) andbase our results on the (real time) posterior distribution.

5. Dynamic Pools - Nonlinear State Space System

• Measurement equation:

p(yt |λt ,P) = λtp(yt |y1:t−1,M1) + (1− λt)p(yt |y1:t−1,M2)

• Transition equation:

λt = Φ(xt), xt = ρxt−1 +√

1− ρ2εt , εt ∼ iid N(0, 1)

• Use particle filter to construct the sequence p(λt |ρ, IPt ,P)..

5. Application

• Two models: Smets-Wouters and Smets-Wouters with financialfrictions

• Track relative performance over time and construct real-time weights

5. Log Scores Comparison: SWFF vs SWπ

p(yt+h,h|Imt+ ,Mm)

5. Dynamic Pools – Posterior p(h)DP(λt |IPt ,P)

ρ ∼ U[0, 1], µ = 0, σ = 1

To Recap...

• Too little time to provide a detailed survey of state-of-the-artBayesian methods.

• Instead: an eclectic collection of ideas and insights related to:

1 Model Development

2 Identification

3 Priors

4 Computations

5 Working with Multiple Models

Bayesian Estimation & Model Evaluation

Documents

Transcript of Bayesian Estimation & Model Evaluation

Bayesian Estimation of Discrete Duration Models · Bayesian Estimation of Dimete Duration Models ... The results from Bayesian estimation are compared with ... 4a: S pecification

Recursive Bayesian electromagnetic refractivity estimation from …noiselab.ucsd.edu/papers/Vasudevan07.pdf · 2017-10-14 · Recursive Bayesian electromagnetic refractivity estimation

Bayesian Divergence Time Estimation – Workshop Lecture

Chapter 3 (part 2): Maximum-Likelihood and …1 Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation •Bayesian Estimation (BE) •Bayesian Parameter Estimation:

SMALL AREA ESTIMATION WITH HIERARCHICAL BAYESIAN …

Bayesian estimation Why and How to Run Your First Bayesian Model

BAYESIAN STATE AND PARAMETER ESTIMATION OF …jimbeck.caltech.edu/summerlectures/references/PEM_PF6_submitted.pdf · BAYESIAN STATE AND PARAMETER ESTIMATION OF UNCERTAIN DYNAMICAL

Bayesian Estimation for Reliability Engineering ...

LECTURE 07: BAYESIAN ESTIMATION (Cont.)

V9: Parameter Estimation for Bayesian Networks

Bayesian Estimation of Dynamical Systems: An Application to fMRIkarl/Bayesian Estimation of Dynamical Systems.pdf · Bayesian Estimation of Dynamical Systems: An Application to fMRI

Bayesian Divergence Time Estimation

Bayesian GARCH Estimation and Expectile Backtesting

Bayesian Estimation of the Markov-Switching … · Bayesian Estimation of the Markov-Switching GARCH(1,1) ... Why using MS-GARCH models? ... Bayesian Estimation of the Markov-Switching

Bayesian Estimation of Time Varying Systems

Overview Multiple Imputation for Multilevel Data Bayesian ... · PDF fileBayesian Estimation For Multilevel Models Bayesian Estimation And Imputation Bayesian estimation (e.g., Gibbs

Bayesian Learning & Estimation Theory

Bayesian Entropy Estimation for Countable Discrete Distributions · 2021. 1. 5. · 2.2 Bayesian Entropy Estimation The Bayesian approach to entropy estimation involves formulating

Bayesian Estimation of GPM with DYNARE

Recursive Bayesian Estimation