Slide Mancini Eco No Metrics EPFL2011
Transcript of Slide Mancini Eco No Metrics EPFL2011
Master in Financial Engineering
Econometrics
Prof. Loriano Mancini
Swiss Finance Institute at EPFL
First semester
Slides version: September 2011
Information about the course
• Material: slides, exercises, data, etc., at
http://sfi.epfl.ch/mfe → “Courses online” → “Mancini E”
Username: StudMFE
Password: Fall2011
Also, register at http://is-academia.epfl.ch
• Book: “Econometric Analysis”, sixth edition, W. Greene, Prentice Hall, 2008
• Assignments: each week, due next Monday, groups max 3 persons
• Exams: written, closed book, closed notes, one A4 page hand-written notes
• Grade: 30% homework, 30% midterm, 40% final exams
• Assistants: Benjamin Junge (E-mail: [email protected])
Emmanuel Leclercq (E-mail: [email protected])
1
Information about the course
• Exercise sessions start on October 3rd
i.e. no exercise session on September 26th
• Prerequisites: W. Greene “Econometric Analysis” book
Appendix A on matrix algebra
Appendix B on probability and distributions
Appendix D on Laws of Large Numbers and Central Limit Theorems
2
Agenda of the course
• Linear regression model
• Generalized regression model
• Panel data model
• Instrumental variables
• Generalized method of moments
• Maximum likelihood estimation
• Hypothesis testing
3
Chapter 2: Econometric model
Econometrics: intersection of Economics and Statistics
Econometric model = association between yi and xi
E.g.: stock return yi (IBM) and market return xi (S&P 500 index)
Econometric model provides “approximate” description of the association
The relation will be stochastic and not deterministic
Econometric model provides probabilistic description of the association
Model: yi = f(xi) + εi
4
Linear regression model
yi = f(xi1, . . . , xiK) + ε = xi1β1 + · · ·+ xiKβK + εi
yi: dependent or explained variable
xi: regressors or covariates or explanatory variables
εi: error term or random disturbance
Each observation in a sample yi, xi1, . . . , xiK , i = 1, . . . , n, comes from
yi = xi1β1 + · · ·+ xiKβK︸ ︷︷ ︸
“deterministic”
+ εi︸︷︷︸
random
Goal: estimate β1, . . . , βK
5
Assumptions of the linear regression model
Assumptions on the data generating process
1. Linearity: linear relationship between yi and xi1, . . . , xiK
2. Full rank: X = [x1, . . . , xK] is an n×K matrix with rank K
3. Exogeneity of the independent variables: E[εi|xj1, . . . , xjK] = 0, ∀i, j
4. Homoscedasticity and nonautocorrelation: Var[εi|X] = σ2, i = 1, . . . , n, and
Cov[εi, εj|X] = 0, ∀i 6= j
5. Data generation: X can include constants and random variables
6. Normal distribution: ε|X ∼ N(0, σ2I)
Assumptions 4 and 6 simplify life but are too restrictive and will be relaxed
6
Linearity of the regression model
The same linear model holds for all n observations yi, xi1, . . . , xiKni=1
y = x1β1 + · · ·+ xKβK + ε = Xβ + ε
Notation: y is an n× 1 vector; X = [x1, . . . , xK] is an n×K matrix;
ε is an n× 1 vector; β is a K × 1 vector
In the design matrix X: columns are variables, rows are observations
E.g. for the i-th observation: yi = x′i β + εi
Remark: we are modeling E[y|X] = Xβ, as E[ε|X] = 0 by assumption
Linearity refers to β and ε, not X
E.g. g(yi) = β h(xi) + εi is a linear model for any function g and h
7
Error term ε
By assumption E[ε|X] = 0 =⇒ E[ε] = 0
Note: εi does not depend on any xj, neither past nor future xs
Let X = E[X]. By the “tower property” or “law of iterated expectations”
Cov[ε,X] = E[ε(X − X)] = Ex[E[ε(X − X)|X]] = Ex[E[ε|X]︸ ︷︷ ︸
=0
(X − X)] = 0
E[ε|X] = 0 implies E[y|X] = Xβ, i.e. Xβ is the conditional mean of y|X
Our analysis is conditional on design matrix X which can be stochastic
8
Spherical error term ε
Assumptions:
Homoscedasticity Var[εi|X] = σ2, i = 1, . . . , n
Nonautocorrelation Cov[εi, εj|X] = 0, ∀i 6= j
In short: E[ε ε′|X] = σ2I
9
Data generating process for the regressors
X may include constants and random variables
“Golden rule”: include a column of 1s in X
Crucial assumption: ε ⊥ X
10
Chapter 3: Least squares
Regression model: yi = x′i β + εi
Goal: statistical inference on β, e.g. estimate β
Population quantities, not observed: E[yi|xi] = x′i β, β, εi
Sample quantities, estimated from sample data: yi = x′i b, b, ei
11
Least squares estimator
Let b0 be the least squares estimator:
b0 = arg minb0
n∑
i=1
(yi − x′i b0)
2
S(b0) :=n∑
i=1
(yi − x′i b0)
2 =n∑
i=1
e2i0
= e′0 e0 = (y −Xb0)′(y −Xb0)
= y′y − 2y′Xb0 + b′0X′Xb0
12
Least squares estimator: normal equations
Necessary condition for a minimum:
∂S(b0)
∂b0=
∂(y′y − 2y′Xb0 + b′0X′Xb0)
∂b0= −2X ′y + 2X ′Xb0 = 0
Let b be the solution, normal equations:
X ′Xb = X ′y
By assumption X has full column rank,
b = (X ′X)−1X ′y
Since X has full column rank, the following matrix is positive definite
∂2S(b0)
∂b0 ∂b′0= 2X ′X
13
Example: regression with simulated data
DGP: yi = x′i β + εi, with x′
i = [1 x2i], x2i ∼ U [0, 1], εi ∼ N(0, 1)
i = 1, . . . , 100, β = [1 2]′, in this sample b = [1.01 2.07]′
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
0
1
2
3
4
5
x2i
= Uniform[0,1]
y i = x
i β +
ε i
True regression lineEstimated regression line
14
Algebraic aspects of the least squares solution
“Golden rule”: include a column of 1s in X
Normal equations: 0 = X ′y −X ′Xb = X ′(y −Xb) = X ′e
First column of X, x1 = 1s, then the first normal equation is
0 = x′1e = [1 · · · 1]e =
n∑
i=1
ei =
n∑
i=1
(yi − x′i b)
Implications:
1. Least square residuals have zero mean, e = 0
2. Estimated regression line through the means of the data, y = x′b
3. Mean of fitted values equals mean of actual data, y = y
None of these implications holds if X does not include a column of 1s
15
Projection
Estimated residuals e = y −Xb, LS estimator b = (X ′X)−1X ′y
e = y −Xb = y −X(X ′X)−1X ′y = (I −X(X ′X)−1X ′)y = My
M is called residual Maker matrix as My = e
As MX = 0, e′ y = y′MXb = 0, LS partitions y in two orthogonal parts
y = Xb + e = y + e
P is called the Projection matrix
y = y − e = (I −M)y = X(X ′X)−1X ′y = Py
16
Properties of M and P matrices
M and P are symmetric, idempotent and orthogonal (PM = MP = 0)
Orthogonal decomposition of y
y = Xb + e = Py + My = projection + residual
Pythagorean theorem:
y′y = y′P ′Py + y′M ′My = y′y + e′e
17
Partitioned regression
E.g. regression model: income = β0 + β1 age + β2 education + error
Goal: study income–education association; age is a control variable
Model: y = Xβ + ε = X1β1 + X2β2 + ε, solve normal equations for b2
b2 = (X ′2M1X2)
−1X ′2M1y
= (X ′2M
′1 M1X2)
−1X ′2M
′1 M1y
= (X∗2′X∗
2)−1X∗2′y∗
M1 is the residual maker matrix based on columns of X1
b2 is obtained regressing y∗ on X∗2
y∗ (resp. X∗2) are residuals from regression of y (resp. X2) on X1
18
Partial correlation
When education ↑ ⇒ income ↑, but education and age both ↑ in time
What is the net effect of education on income?
Partial correlation, r∗yz:
y∗ = residuals in a regression of income on a constant and age
z∗ = residuals in a regression of education on a constant and age
r∗yz = simple correlation between y∗ and z∗
19
Goodness of fit
Goal: measure how much variation in y is explained by variation in x
Suppose y = x = 0. Recall yi ⊥ ei, for each observation i
yi = yi + ei
n∑
i=1
y2i =
n∑
i=1
y2i +
n∑
i=1
e2i
SST = SSR + SSE
Good regression model: SST ≈ SSR, hence SSE ≈ 0
20
Goodness of fit (when means 6= 0)
When y, x 6= 0, consider deviations from the means
yi − y = yi − y + ei
= (x′i − x′)b + ei
n∑
i=1
(yi − y)2 =
n∑
i=1
((x′i − x′)b)2 +
n∑
i=1
e2i
Define M0 = [I − ii′/n], n× n, symmetric, idempotent; i′ = [1 · · · 1]
M0 transforms observations in deviations from sample means, M0y = y− iy
y′M0′ M0y = b′X ′M0′ M0Xb + e′e
y′M0y = b′X ′M0Xb + e′e
SST = SSR + SSE
21
Coefficient of determination, R2
R2 =SSR
SST=
b′X ′M0Xb
y′M0y= 1− e′e
y′M0y
Properties:
R2 measures the linear association between X and y
0 ≤ R2 ≤ 1, as 0 ≤ SSR ≤ SST
R2 ↑ when a regressor is added, from X = [x1 · · ·xK] to X = [x1 · · ·xK+1]
Adjusted R2 = 1− e′e/(n−K)
y′M0y/(n− 1)
Remark: X should include a column of 1s ⇒ M0e = e and e ⊥ X
22
Chapter 4: Statistical properties of LS estimators
LS estimator enjoys various good statistical properties:
1. Easy to compute
2. Explicit use of model assumptions
3. Optimal linear predictor
4. Most efficient, under certain conditions
23
Orthogonality conditions
Assumptions: X stochastic or not, linear model, E[εi|X] = 0 =⇒
E[xi εi] = Ex[E[xi εi|X]] = Ex[xiE[εi|X]] = 0 = Ex[xiE[(yi − x′iβ)|X]]
which implies the population orthogonality conditions:
ExE[xi yi|X] = ExE[xi x′iβ|X]]
E[xi yi] = E[xi x′i]β
LS normal equations are sample counterpart of orthogonality conditions:
X ′y = X ′X b
1
n
n∑
i=1
xi yi =1
n
n∑
i=1
xi x′i b
24
Optimal linear predictor
Goal: find linear function of xi, x′iγ, that minimizes MSE
MSE = E[(yi − x′iγ)2]
= E[(yi − E[yi|X] + E[yi|X]− x′iγ)2]
= E[(yi − E[yi|X])2] + E[(E[yi|X]− x′iγ)2]
minγ
MSE = minγ
E[(E[yi|X]− x′iγ)2]
0 = −2E[xi(E[yi|X]− x′iγ)]
E[xi yi] = E[xi x′i]γ
which are the LS normal equations
Implicit assumption: all these expectations exist, i.e. E[·] <∞
25
Unbiased estimation
LS estimator is unbiased in every sample:
b = (X ′X)−1X ′y = (X ′X)−1X ′(Xβ + ε) = β + (X ′X)−1X ′ε
Using law of iterated expectations, and assumption E[ε|X] = 0
E[b] = EX[E[β + (X ′X)−1X ′ε|X]]
= β + EX[(X ′X)−1X ′E[ε|X]]
= β
26
Monte Carlo simulation: b2 slope estimates
DGP: yi = x′i β + εi, with x′
i = [1 x2i], x2i ∼ U [0, 1], εi ∼ N(0, 1)
i = 1, . . . , 100, β = [1 2]′, repeat simulation and estimation 1,000 times
0 0.5 1 1.5 2 2.5 3 3.50
50
100
150
200
250
b2
Freq
uenc
y
27
Variance of LS estimator
LS estimator is linear in ε: b = β + (X ′X)−1X ′ε
Easy to derive variance of linear estimator:
Var[b|X] = E[(b− β)(b− β)′|X]
= E[(X ′X)−1X ′ε ε′X(X ′X)−1|X]
= (X ′X)−1X ′E[ε ε′|X]X(X ′X)−1
= (X ′X)−1X ′(σ2I)X(X ′X)−1
= σ2(X ′X)−1
Note: assumption of spherical errors, Var[ε|X] = σ2I, is crucial
28
Gauss–Markov theorem
Any linear unbiased estimator b0 = Cy, where C is a K × n matrix
Unbiasedness: E[Cy|X] = E[CXβ + Cε|X] = β ⇒ CX = I
Define C = D+(X ′X)−1X ′ ⇒ CX = I = DX +(X ′X)−1X ′X =
=0︷︸︸︷DX +I
Var[b0|X] = CVar[y|X]C ′ = CVar[ε|X]C ′ = σ2CC ′
= σ2(D + (X ′X)−1X ′)(D + (X ′X)−1X ′)′
= σ2(D + (X ′X)−1X ′)(D′ + X(X ′X)−1)
= σ2DD′ + σ2(X ′X)−1 = σ2DD′ + Var[b|X]
= Var[b|X] + nonnegative definite matrix
LS estimator is BLUE (when X is constant and/or stochastic)
29
Estimating the variance of LS estimator
Estimate σ2 in Var[b|X] = σ2(X ′X)−1 ⇒ use ei sample analog of εi
But ei = yi − x′ib = εi − x′
i(b− β) is an “imperfect estimate” of εi
Sample residual: e = My = M(Xβ + ε) = Mε, as MX = 0
E[e′e|X] = E[ε′Mε|X] = E[tr(ε′Mε)|X] = E[tr(Mεε′)|X]
= tr(ME[εε′|X]) = tr(M)σ2
= tr(I −X(X ′X)−1X ′)σ2 = (tr(I)− tr(X ′X(X ′X)−1))σ2
= (tr(In)− tr(IK))σ2 = (n−K)σ2
Unbiased estimator of σ2 (conditionally on X and unconditionally):
s2 =e′e
n−K=
∑ni=1 e2
i
n−K
30
Normality of LS estimator
Assumption ε|X ∼ N(0, σ2I), and linearity b = β + (X ′X)−1X ′ε⇒
joint normality of b (multivariate normal distribution)
b|X ∼ N(β, σ2(X ′X)−1)
and each slope, bk, is normally distributed
bk|X ∼ N(βk, σ2(X ′X)−1
kk )
Note: exact distribution in finite samples
31
Distribution of b2 slope estimates: simulation
DGP: yi = [1 x2i] [1 2]′ + εi, with εi ∼ N(0, 1), 1,000 estimations
Comparison between simulated and true normal density of b2
0 0.5 1 1.5 2 2.5 3 3.50
0.2
0.4
0.6
0.8
1
1.2
Den
sity
b2
32
Hypothesis testing on a coefficient
As b|X ∼ N(β, σ2(X ′X)−1)
(bk − βk)/√
σ2(X ′X)−1kk ∼ N(0, 1)
Unfortunately σ2 is not known but estimated via s2. Useful statistic:
(bk − βk)/√
σ2(X ′X)−1kk
√
[e′e/σ2]/(n−K)∼ N(0, 1)√
χ2(n−K)/(n−K)
∼ t-Student(n−K)
Note: σ2 is unknown but cancels in the ratio above
Need to show: e′e/σ2 ∼ χ2(n−K) and e′e independent on bk
33
χ2 distribution of e′e
Recall: M is residual maker matrix, e = My = Mε as MX = 0
As ε|X ∼ N(0, σ2I)⇒ ε/σ|X ∼ N(0, I)
e′e
σ2=
ε′
σM
ε
σ
which is an idempotent quadratic form in ε/σ, and in Appendix B.11.4
ε′
σM
ε
σ∼ χ2
rank(M)
where rank(M) = tr(M) = n−K
34
Independence of b and e′e
To show independence between
b− β
σ= (X ′X)−1X ′ ε
σ= L
ε
σ∼ N(0, LL′)
ande′e
σ2=
ε′
σM ′ M
ε
σ
it suffices to show that LM = 0 because this implies, conditional on X,
Cov(
Lε
σ, M
ε
σ
)
= E[Lε
σ(M
ε
σ)′] = E[L
ε
σ
ε′
σM ′] = L
σ2I
σ2M ′ = LM
= (X ′X)−1X ′ (I −X(X ′X)−1X ′) = 0
which implies independence as ε|X ∼ N
35
Significance of a coefficient: t-statistic
Common test H0 : βk = 0
tk = t-statistic =(bk − 0)/
√
(X ′X)−1kk
√
[e′e]/(n−K)=
bk√
s2(X ′X)−1kk
∼ t-Student(n−K)
36
Example: Significance of a coefficient
True β = [1 2]′, estimate b = [1.01 2.07]′, n = 100, K = 2
Is b2 statistically different from zero?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
0
1
2
3
4
5
x2i
= Uniform[0,1]
yi =
xi β
+ ε
i
True regression lineEstimated regression line
−8 −6 −4 −2 0 2 4 6 80
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
b / (s2 (X´ X)−1)1/2
De
nsity
t−Student distributionNormal distribution
37
Confidence intervals for parameters
Point estimates are useless without confidence intervals or standard errors
Use t-Student distribution
(bk − βk)/√
σ2(X ′X)−1kk
√
[e′e/σ2]/(n−K)∼ t-Student(n−K)
to set confidence intervals:
Pr(bk − tα/2 sbk≤ βk ≤ bk + tα/2 sbk
) = 1− α
where tα/2 is t-Student quantile, e.g. α = 0.05, sbk=√
s2(X ′X)−1kk
38
Significance of the regression
Common test H0 : β2 = · · · = βK = 0 (except intercept)
or equivalently H0 : R2 = 0
F -test statistic:
F [K − 1, n−K] =R2/(K − 1)
(1−R2)/(n−K)∼
χ2(K−1)/(K − 1)
χ2(n−K)/(n−K)
R2 ≈ 1 ⇒ large F ⇒ reject H0
39
Marginal distribution of test statistics
Under H0 : βk = β0k, and conditionally on X, bk|X ∼ N(β0
k, σ2(X ′X)−1kk )
Unconditionally, bk ∼ ?, hard to find, depends on distribution of X
Key property: t-statistic
tk = t|X =bk − β0
k√
s2(X ′X)−1kk
∼ t-Student(n−K)
but t-Student(n−K) does not depend on X ⇒ unconditionally
tk ∼ t-Student(n−K)
40
Multicollinearity
Multicollinearity = variables in X are linearly dependent ⇒ X ′X is singular
In practice, variables in X are often close to be linearly dependent
“Symptoms” of multicollinearity:
• Small changes in data produce large changes in b
• Var[b|X] very large (⇒ t-statistic close to zero) but R2 is high
• Coefficient estimates “wrong” sign or implausible
41
Multicollinearity: analysis
Demeaned variables, X = [X(k) xk], where xk (n× 1) is the k-th variable
Use Appendix A.5.3 on inverse of partitioned matrix:
Var[bk|X] = σ2(X ′X)−1kk
= σ2(
x′kxk − x′
kX(k)(X′(k)X(k))
−1X ′(k)xk
)−1
= σ2
(
x′kxk
[
1−x′
kX(k)(X′(k)X(k))
−1X ′(k)xk
x′kxk
])−1
= σ2
(
x′kxk
[
1− x′kP(k)xk
x′kxk
])−1
= σ2
(
x′kxk
[
1− x′kxk
x′kxk
])−1
= σ2(x′
kxk
[1−R2
k.
])−1
=σ2
x′kxk [1−R2
k.]
42
Multicollinearity: interpretation
Hence, as column variables in X are demeaned,
Var[bk|X] =σ2
x′kxk [1−R2
k.]=
σ2
∑ni=1(xik − xk)2 [1−R2
k.]
where R2k. is R2 from regression of xk on X(k) (i.e. X \ xk)
Var[bk|X] ↑ when
• R2k. → 1, i.e. multicollinearity
•∑n
i=1(xik − xk)2 → 0
• σ2 ↑, i.e. ↑ dispersion of yi around regression line
43
Large sample properties of LS estimator
ε|X ∼ N is a strong assumption and can be relaxed, but now
Assumption 5a (DGP of X):
• (xi, εi) i = 1, . . . , n, sequence of independent observations
• plimX ′X/n = Q positive definite matrix
Notation: plim = probability limit, i.e. convergence in probability
plimZn = Z stands for
limn→∞
Pr(|Zn − Z| > ǫ) = 0,∀ǫ > 0
where Z can be either random or constant
44
Consistency of LS estimator
Consistency means plim b = β
Highly desirable property of any estimator
Recall: distribution of ε|X is unknown
b = β +
(X ′X
n
)−1X ′ε
n
plim b = β + plim
(X ′X
n
)−1
plimX ′ε
n
= β + Q−1 plimX ′ε
n
If plimX ′ε/n = 0, then b is consistent
45
Random term X ′ε/n
E[X ′ε
n] = EX[E[
X ′ε
n|X]] =
1
n
n∑
i=1
EX[xi
=0︷ ︸︸ ︷
E[εi|X]] = 0
Var[X ′ε
n] = E[Var[
X ′ε
n|X]] + Var[
=0︷ ︸︸ ︷
E[X ′ε
n|X]]
= E[1
n2X ′E[εε′|X]X] =
σ2
nE[
X ′X
n] =
σ2
nQ
As E[X ′ε/n] = 0 and limn→∞ Var[X ′ε/n] = 0,
X ′ε
n
m.s.−→ 0 =⇒ plimX ′ε
n= 0
Remark: Var[X ′ε/n] decays as 1/n
46
Example: convergence of X ′ε/n
xi ∼ U [−0.5, 0.5], σ2 = 2, hence Var[X ′ε/n] = σ2/nE[∑n
i=1 x2i/n]
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.50
1
2
3
4
5
6
7
8
9
10
X ε / n
Den
sity
n=100n=50n=10
47
Asymptotic distribution of LS estimator
Key idea: stabilize the distribution of X ′ε/n
Recall: Var[X ′ε/n] decays as 1/n
Var[√
nX ′ε/n] = nVar[X ′ε/n] ∈ O(1)
√n(b− β) =
(X ′X
n
)−1√nX ′ε/n
−→ Q−1 × asymptotic distribution of√
nX ′ε/n
48
Random term√
nX ′ε/n
Recall: E[√
nX ′ε/n] = 0; (xi, εi) independent; regressors well behaved
Var[√
nX ′ε/n] =1
nVar[X ′ε] =
1
nVar[
n∑
i=1
xi εi]
=1
n
n∑
i=1
Var[xi εi] =1
n
n∑
i=1
σ2E[xi x′i] = σ2Q
By Central Limit Theorem:√
nX ′ε/nd−→ N(0, σ2Q)
√n(b− β)
d−→ Q−1 ×N(0, σ2Q)
d= N(0, σ2Q−1)
ba∼ N(β,
σ2
nQ−1)
49
Asymptotic normality of LS estimator
If regressors well behaved and observations independent, then
asymptotic normality of LS estimator follows from CLT, not ε|X ∼ N
In practice in
ba∼ N(β,
σ2
nQ−1)
Q is estimated by X ′X/n
σ2 is estimated by s2 = e′e/(n−K) (as plim s2 = σ2)
If ε|X ∼ N(0, σ2I), then b ∼ N(β, σ2(X ′X)−1) for every sample size n
50
Asymptotic dist. of nonlinear function: Delta method
f(b): J possibly nonlinear C1 functions
∂f(b)
∂b′=: C(b) (J ×K)
Goal: find asymptotic distribution of f(b)
Slutsky theorem: plim f(b) = f(plim b) = f(β), and plimC(b) = C(β)
First order Taylor expansion (remainder negligible if plim b = β)
f(b) = f(β) + C(β)× (b− β) + remainder
f(b)a∼ N(f(β), C(β)
σ2
nQ−1C(β)′)
51
t-Statistic: remark
To test H0 : βk = 0, t-statistic tk = bk/√
s2(X ′X)−1kk
If in finite sample ε|X ∼ N , then tk ∼ t-Student(n−K)
If only asymptotically ε|X ∼ N (not in finite sample), then tk ∼ N(0, 1)
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
b / (s2 (X´ X)−1)1/2
Dens
ity
t−statistic (n=10)
Normalt−Student
52
Missing observations
Common issue in applied work
• Missing at random: least serious case, just discard those observations, sample
size reduced
• Not missing at random: most difficult case, selection bias, mechanism should
be studied
Read Chapter 4.8.2
53
Chapter 5: Inference
Goal: test implications of economic theory
Example: unrestricted model of investment, It,
ln It = β1 + β2 it + β3∆pt + β4 lnYt + β5 t + εt
where it nominal interest rate, ∆pt inflation rate, Yt real output
H0 : “investors care only about real interest rate, (it −∆pt)”
⇒ restricted (or nested) model of investment:
ln It = β1 + β2(it −∆pt) + β4 lnYt + β5 t + εt
⇒ β3 = −β2 ⇒ β2 + β3 = 0, in the unrestricted model
54
Linear restrictions
In the linear regression model, y = Xβ + ε, consider J linear restrictions
Rβ = q
R is J ×K and usually J ≪ K
Example: β = (β1 β2 β3 β4)′
1. H0 : β2 = 0 tested with R = (0 1 0 0) and q = 0
2. H0 : β2 = β3 = β4 = 0 tested with
R =
0 1 0 0
0 0 1 0
0 0 0 1
and q = (0 0 0)′
55
Two approaches to testing hypothesis
1. Fit unrestricted model and check whether estimates satisfy restrictions
2. Fit restricted model and check loss of fit (in terms of R2)
The two approaches are equivalent in the linear regression model
Working assumption: ε|X ∼ N(0, σ2I) (to be relaxed)
56
Approach 1: discrepancy vector
Null hypothesis: J linear restrictions, R is J ×K
H0 : Rβ − q = 0
Alternative hypothesis:
H1 : Rβ − q 6= 0
Discrepancy vector, m = Rb− q, will not be exactly zero (most likely)
Decide whether m is not exactly zero because of
(a) sampling variability (do not reject H0)
(b) or restrictions are not satisfied by the data (reject H0)
57
Wald criterion
Under H0 : Rβ − q = 0, discrepancy vector m = Rb− q
E[m|X] = RE[b|X]− q = Rβ − q = 0
Var[m|X] = Var[Rb− q|X] = RVar[b|X]R′ = σ2R(X ′X)−1R′
Recall, as ε|X ∼ N(0, σ2I) by assumption, b|X ∼ N(β, σ2(X ′X)−1)
=⇒ m|X ∼ N(0, σ2R(X ′X)−1R′)
Wald statistic:
W = m′ (Var[m|X])−1 m
= (Rb− q)′ (σ2R(X ′X)−1R′)−1 (Rb− q)
∼ χ2(J)
χ2 distribution ⇐ Full Rank Gaussian Quadratic form, Appendix B.11.6
58
Wald statistic feasible and F -statistic
In the Wald statistic, need to get rid of unknown σ2
F =(Rb− q)′ (σ2R(X ′X)−1R′)−1 (Rb− q)/J
[e′e/σ2]/(n−K)
∼χ2
(J)/J
χ2(n−K)/(n−K)
∼ F(J, n−K)
• Numerator: under H0, (Rb− q)/σ = R(b− β)/σ = R(X ′X)−1X ′ε/σ
i.e. standardized Gaussian quadratic form in R(X ′X)−1X ′ε/σ ⇒ χ2(J)
• Denominator: standardized Gaussian quadratic form in Mε/σ ⇒ χ2(n−K)
As MX = 0, Cov(R(X ′X)−1X ′ε/σ,Mε/σ) = 0⇒ Num. Den. independent
59
Hypothesis testing on a single coefficient
H0 : βk = β0 can be tested with “t-statistic”
t :=(bk − β0)/
√
σ2(X ′X)−1kk
√
[e′e/σ2]/(n−K)∼ N(0, 1)√
χ2(n−K)/(n−K)
∼ t-Student(n−K)
or with linear restriction R = (0 · · · 0 1 0 · · · 0) and q = β0
F =(bk − β0) (σ2(X ′X)−1
kk )−1 (bk − β0)
[e′e/σ2]/(n−K)
∼χ2
(1)
χ2(n−K)/(n−K)
∼ F(1, n−K)
As t2 = F the two tests are equivalent
60
Approach 2: restricted least squares
Fit of restricted model cannot be better than unrestricted model
Restricted LS:
b∗ = arg minb0
(y −Xb0)′(y −Xb0) subject to Rb0 = q
= b− (X ′X)−1R′[R(X ′X)−1R′]−1(Rb− q)
(b∗ − b) = −(X ′X)−1R′[R(X ′X)−1R′]−1(Rb− q)
e∗ residuals from restricted model. Loss of fit due to constraints:
e∗ = y −Xb∗ = y −Xb−X(b∗ − b) = e−X(b∗ − b)
e′∗e∗ = e′e + (b∗ − b)′X ′ X(b∗ − b) ≥ e′e
e′∗e∗ − e′e = (b∗ − b)′X ′ X(b∗ − b)
= (Rb− q)′[R(X ′X)−1R′]−1(Rb− q)
61
Loss of fit and F -statistic
F -statistic for H0 : Rβ = q
F =(Rb− q)′ [R(X ′X)−1R′]−1 (Rb− q)/J
e′e/(n−K)
=(e′∗e∗ − e′e)/J
e′e/(n−K)
=(R2 −R2
∗)/J
(1−R2)/(n−K)∼ F(J, n−K)
Recall R2 = 1− e′e/y′M0y and y′M0y not depend on b. Similarly for R2∗.
Special case: overall significance of the regression
β2 = . . . = βK = 0 (except intercept) ⇒ R2∗ = 0 with J = K − 1
62
Nonnormal disturbances and large sample tests
Drop assumption ε|X ∼ N which implies b|X ∼ N(β, σ2(X ′X)−1)
All previous tests hold asymptotically, when n→∞
Key ingredient: asymptotic distribution of b
ba∼ N(β,
σ2
nQ−1), where Q = plim (X ′X/n)
Recall √n(b− β)
d−→ N(0, σ2Q−1), from CLT
plim s2 = σ2, where s2 = e′e/(n−K)
63
Example: limiting distribution of Wald statistic
If√
n(b− β)d−→ N(0, σ2Q−1) and H0 : Rβ − q = 0, then
√n(Rb− q) =
√nR(b− β)
d−→ N(0, σ2RQ−1R′)
which implies
√n(Rb− q)′(σ2RQ−1R′)−1
√n(Rb− q)
d−→ χ2(J)
which has the same limiting distribution as W
W = (Rb− q)′ (s2R(X ′X)−1R′)−1 (Rb− q)d−→ χ2
(J)
when plim s2(X ′X/n)−1 = σ2Q−1. Note: in W all n’s cancel
Remark: W is only approximately distributed as χ2(J) in finite samples,
in practice n 6→ ∞
64
Testing nonlinear restrictions
Test H0 : c(β) = q, where c is J × 1 nonlinear functions
Apply delta method: first order Taylor expansion of c
c(β) ≈ c(β) +∂c(β)
∂β′(β − β)
Var[c(β)] ≈ ∂c(β)
∂β′Var[β]
∂c(β)′
∂β
In ∂c(β)/∂β′ replace β by β
Wald statistic
W = (c(β)− q)′(
Var[c(β)])−1
(c(β)− q)d−→ χ2
(J)
65
Prediction
Prominent use of regression model
y0, x0 not in our sample, not observed. Predict y0 using
E[y0|x0, X] = x0′b
as y0 = x0′β + ε0, and assuming that x0 is known
Forecast error: e0 = y0 − y0 = (β − b)′x0 + ε0
Prediction variance: Var[e0|x0, X] = σ2 + x0′[σ2(X ′X)−1]x0 > 0
Prediction interval at (1− λ) confidence level:
y0 ± zλ/2
√
Var[e0|x0, X]
where zλ/2 is the λ/2-quantile of N(0, 1), e.g. λ = 0.05
66
Prediction of y0 and x0
If x0 is known, prediction of y0
E[y0|x0, X] = x0′b
with Var[e0|x0, X]
If x0 is not known and needs to be predicted too, prediction of y0
Ex0E[y0|x0, X] = Ex0[x0′b|X]
depends on distribution of x0, usually unknown and computed by simulation,
with Var[e0|X] > Var[e0|x0, X]
67
Measure of predictive accuracy
Notation: yi realized values, yi predicted values, n0 number of predictions
• Not scale invariant:
– Root mean square error (RMSE) =√∑
i(yi − yi)2/n0
– Mean absolute error (MAE) =∑
i |yi − yi|/n0
• Scale invariant:
– Theil U statistic
U =
√∑
i(yi − yi)2/n0
∑
i y2i /n0
68
Chapter 6: Functional form
Very general functional form of regression model:
L independent variables: zi = [z1i · · · zLi]
K linearly independent functions of zi: f1i(zi) · · · fKi(zi)
g(yi) observable function of yi
usual assumptions on εi
The following model is still linear and can be estimated by LS:
g(yi) = β1 f1i(zi) + . . . + βK fKi(zi) + εi
= β1 x1i + . . . + βK xKi + εi
yi = x′i β + εi
69
Nonlinearity in variables
A linear model, e.g., yi = β1 + xi β2 + wi β3 + εi is typically enriched with
• dummy variables
• nonlinear functions of regressors (e.g. quadratic function)
• interaction terms (i.e. cross products)
yi = β1 + xi β2 + wi β3 + β4 di + β5 x2i + β6 xiwi + εi
= x′i β + εi
where x′i = [1 xi wi di x2
i xiwi] and the dummy variable
di =
1, i ∈ D0, otherwise
70
Dummy variable
Easy to use: one dummy variable is one more column in X
To study various effects (treatment, grouping, seasonality, thresholds, etc.)
yi = β1 + x2i β2 + di β3 + εi
= (β1 + di β3) + x2i β2 + εi
= x′i β + εi
where
di =
1, i ∈ D0, otherwise
In this model the dummy variable “shifts” the intercept: β1 ←→ (β1 + β3)
71
Example: regression with dummy variable
yi = x′i β + εi, x′
i = [1 x2i di], x2i ∼ U [0, 1], di = 1x2i>0.5, εi ∼ N(0, 1)
i = 1, . . . , 100, β = [1 2 2]′, in this sample b = [0.99 2.13 1.96]′
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
0
1
2
3
4
5
6
7
x2i
= Uniform[0,1]
y i = β
1 + x
2i β
2 + β
3 di +
ε i
True regression lineEstimated regression line
72
Structural break
Previous graph shows a structural break in the model
yi =
β1 + x2i β2 + εi, x2i ≤ 0.5
(β1 + β3) + x2i β2 + εi, x2i > 0.5
Structural change can be tested with F -test
Note: the break point is supposed to be known a priori
73
Testing for a structural break
Split the sample in two parts, according to potential structural break
nb observations on yb and Xb (nb × k) before potential structural break
na observations on ya and Xa (na × k) after potential structural break
• Unrestricted model allows for potential structural break, βb 6= βa:
[
yb
ya
]
=
[
Xb 0
0 Xa
][
βb
βa
]
+
[
εb
εa
]
• Restricted model, no structural break, β′ = [β′b β′
a]
βb = βa
βb − βa = 0
[Ik...− Ik] β = R β = 0
74
F -test for a structural break
H0 : R β = q, with q = 0, R = [Ik...−Ik], dim(R) = k×2k, dim(β) = 2k×1
F =(Rb− q)′ [R(X ′X)−1R′]−1 (Rb− q)/J
e′e/(n−K)∼ F(J, n−K)
where
J = k = number of restrictions = number of rows in R
n−K = (nb + na)− 2k = total number of observations minus dim(β)
Alternative ways exist to test for structural break (e.g., Wald statistic)
Typical issue: limited sample sizes before, nb, and/or after, na, the break
75
Chapter 7: Specification analysis
Implicit assumption: the model y = Xβ + ε is correct
Common model misspecification:
• Omission of relevant variables
• Inclusion of superfluous variables
76
Omitted relevant variables
True regression model: y = X1 β1 + X2 β2 + ε
Use wrong regression model: “y = X1 β1 + ε”
Regress y on X1 only:
b1 = (X ′1X1)
−1X ′1 y = (X ′
1X1)−1X ′
1(X1 β1 + X2 β2 + ε)
= β1 + (X ′1X1)
−1X ′1X2 β2 + (X ′
1X1)−1X ′
1ε
E[b1|X] = β1 + (X ′1X1)
−1X ′1X2 β2
Unless X ′1X2 = 0 or β2 = 0,
E[b1|X] 6= β1, i.e. b1 is biased
plim (b1) 6= β1, i.e. b1 is inconsistent
Inference procedures (t-test, F -test, etc.) are invalid
77
Inclusion of superfluous variables
True regression model: y = X1 β1 + ε
Use “wrong” regression model: y = X1 β1 + X2 β2 + ε
Rewrite y = X1 β1 + X2 β2 + ε = Xβ + ε
where X = [X1 X2] and β′ = [β′1 β′
2] = [β′1 0′]
Model used, per se, it is not wrong, simply β2 = 0
Regress y on X: LS estimator is unbiased estimator of β
E[b|X] = β =
[
β1
β2
]
=
[
β1
0
]
Price to pay for not using information β2 = 0: reduced precision of estimates
“Var[b|X] ≥ Var[b1|X]”
78
Model building
• Simple-to-general:
not a good strategy, omitted variables induce biased and inconsistent
estimates
• General-to-simple:
better strategy, computing power is cheap, but variable selection is a difficult
task
79
Choosing between nonnested models
F -test of H0 : Rβ = q is only for nested models
R represents (linear) restrictions on the model y = Xβ + ε
Various nonnested hypothesis can be of interest
e.g., choosing between linear or loglinear functional forms:
yi = β1 + xi β2 + εi or log(yi) = β1 + log(xi)β2 + εi
Typically, these tests are based on likelihood function
80
Likelihood function: digression
Probability theory : given population model, what is the probability of
observing that sample?
Inference procedure : given that sample, what is the population model?
Likelihood function = probability of observing that sample as a function of
model parameters
81
Likelihood function: simple example
Fair coin, H, T, Pr(toss = T ) = p0 = 0.5
Goal: estimate p0 (unknown to us)
Observed sample: n = 60 tossing, total T = k = 28
L(p) =
(
n
k
)
pk(1− p)n−k =
(
60
28
)
p28(1− p)32
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.02
0.04
0.06
0.08
0.1
0.12
p
Likeli
hood
82
Choosing between nonnested models: Vuong’s test
Goal: choose between two nonnested models
No model is favored, as in classical hypothesis testing
Models can be both wrong: choose the least misspecified
Assumption: observations are independent (conditionally on regressors)
True model: yi ∼ h(yi), density with parameter α
Model 0: yi ∼ f(yi), density with parameter θ
Model 1: yi ∼ g(yi), density with parameter γ
KLIC0 = E [(lnh(yi)− ln f(yi))| h is true ] ≥ 0
KLIC0 = distance between model h (true) and f in terms of log-likelihood
83
Vuong’s statistic
Decision criteria: model 1 is better than model 0 if KLIC1 < KLIC0
KLIC1 −KLIC0 = E [(ln f(yi)− ln g(yi))| h is true ]
≈ 1
n
n∑
i=1
(ln f(yi)− ln g(yi)) =
n∑
i=1
mi/n
Vuong’s statistic:
V =√
n
∑ni=1 mi/n
√∑ni=1(mi −m)2/n
• Vd−→ N(0, 1) when model 0 and 1 are “equivalent”
• Va.s.−→ +∞ when model 0, f(yi), is “better”
• Va.s.−→ −∞ when model 1, g(yi), is “better”
84
Vuong’s test: application to linear models
Assume ε ∼ N(0, σ2)
Model 0: yi ∼ f(yi), with yi = x′iθ + ε0i
Model 1: yi ∼ g(yi), with yi = x′iγ + ε1i
f(yi) =1√
2πσ2e−0.5(yi−x′
iβ)2/σ2
ln f(yi) = −1
2ln(2πσ2)− 1
2(yi − x′
iβ)2/σ2
= −1
2[ln 2π + ln(σ2) + ε2
0i/σ2]
ln f(yi)− ln g(yi) =
[
−1
2[ln(e′0e0/n) +
e20i
e′0e0/n]
]
−[
−1
2[ln(e′1e1/n) +
e21i
e′1e1/n]
]
85
Model selection criteria
Various criteria have been proposed
Adjusted R2:
R2
= 1− e′e/(n−K)∑n
i=1(yi − y)2/(n− 1)
Akaike Information Criteria:
lnAIC(K) = ln(e′e/n) + K2/n
Bayesian Information Criteria:
lnBIC(K) = ln(e′e/n) + K lnn/n
86
Chapter 8: Generalized regression model
Spherical error E[ε ε′|X] = σ2I is a restrictive assumption
Allow for heteroscedasticity, σ2i 6= σ2
j , and autocorrelation, σij 6= 0, ∀i, j:
E[ε ε′|X] = σ2Ω = Σ =
σ21 σ12 · · · σ1n
σ12 σ22 · · · σ2n
. . .
σ1n σ2n
Total number of parameters in Σ = n + (n2 − n)/2 = n(n + 1)/2≫ n
E.g., n = 100 ⇒ n(n + 1)/2 = 5,050 too many!
Need to impose structure on Σ
87
Heteroscedasticity: asset returns and stochastic volatility
S&P 500 daily returns, 1999–2003, and asymmetric GARCH volatility
0 100 200 300 400 500 600 700 800 900 1000−8
−6
−4
−2
0
2
4
6
Retu
rn %
0 100 200 300 400 500 600 700 800 900 10000.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Vola
tility
88
Least square estimator
When Var[ε|X] = σ2Ω
LS estimator, b = β + (X ′X)−1X ′ε, has still good properties:
unbiased, consistent, and asymptotically normal
E[b|X] = β
Var[b|X] = (X ′X)−1X ′ Var[ε|X]X(X ′X)−1
=σ2
n
(X ′X
n
)−1(X ′ΩX
n
)(X ′X
n
)−1
If plim (X ′X/n) and plim (X ′ΩX/n) are positive definite, plim b = β
√n(b− β) =
(X ′X
n
)−1√nX ′ε/n
d−→ Q−1 ×N(0, σ2 plimX ′ΩX
n)
89
Generalized least square estimator
Var[ε|X] = σ2Ω, assume Ω is known; decompose Ω = CΛC ′
Ω−1 = CΛ−1/2 Λ−1/2C ′ = P ′ P , where Λ = diag(λ1, . . . , λn), C ′C = I
Transformed model : Py = PXβ + Pε ⇒ Var[Pε|X] = σ2PΩP ′ = σ2I
β = (X ′P ′ PX)−1X ′P ′ Py
= (X ′Ω−1X)−1X ′Ω−1y
= arg minβ0
(y′ − β′0X
′)Ω−1 (y −Xβ0)
Heteroscedasticity case: Ω = diag(w1, . . . , wn)
β = arg minβ0
n∑
i=1
(yi − x′i β0)
2/wi
Recall: OLS case Ω = I
90
GLS efficient estimator
In the classical model, y = Xβ + ε, where Var[ε|X] = σ2I:
OLS is minimum variance, BLUE, estimator
In the transformed model, Py = PXβ + Pε, where Var[Pε|X] = σ2I:
GLS estimator = OLS in the transformed model
⇒ GLS estimator is efficient (not OLS)
91
Feasible generalized least square estimator (FGLS)
Var[ε|X] contains n(n + 1)/2 parameters: impossible to estimate all
Var[ε|X] = σ2Ω parameterized with few unknown parameters θ
E.g. Time series: Ωij = θ|i−j|, where |θ| < 1
E.g. Heteroscedasticity: Ωii = exp(z′i θ)
FGLS estimator relies on Ω = Ω(θ)
β(Ω) = (X ′Ω−1X)−1X ′Ω−1y
Key result: when n→∞, β(Ω) behaves like β(Ω)
using any consistent (not necessarily efficient) estimator of Ω(θ)
92
Heteroscedasticity
Var[ε|X] = σ2Ω = σ2diag(w1, . . . , wn)
Scaling: tr(σ2Ω) =∑n
i=1 σ2i = σ2
∑ni=1 wi = σ2n⇒ σ2 =
∑ni=1 σ2
i /n
Interpretation: wi positive weight
When form of heteroscedasticity is
• known: parameterize and estimate Ω, then FGLS
• unknown: OLS can still be applied, but Var[b|X]?
93
Estimating Var[b|X] under unknown heteroscedasticity
White’s heteroscedasticity consistent estimator:
Var[b|X] =σ2
n
(X ′X
n
)−1(X ′ΩX
n
)(X ′X
n
)−1
=1
n
(X ′X
n
)−1(
1
n
n∑
i=1
σ2i xi x
′i
)(X ′X
n
)−1
≈ 1
n
(X ′X
n
)−1(
1
n
n∑
i=1
e2i xi x
′i
)(X ′X
n
)−1
Proof sketch: As σ2i xi x
′i = E[ε2
i xi x′i|xi],
plim1
n
n∑
i=1
σ2i xi x
′i = plim
1
n
n∑
i=1
ε2i xi x
′i = plim
1
n
n∑
i=1
e2i xi x
′i
Remark: equalities above are in plim , X ′ΩX/n never estimated
94
Test for heteroscedasticity: Breusch–Pagan test
Form of heteroscedasticity: σ2i = σ2f(α0 + α′zi)
Note: functional form f does not need to be specified
H0 : α = 0, i.e. homoscedasticity
Under H0, E[ε2i/(σ2f(α0))− 1] = 0 and does not depend on zi
Regress gi := (e2i/(e′e/n)− 1) on Z ′
i := [1 z′i] (1× k), i = 1, . . . , n
calculate b = (Z ′Z)−1Z ′g and g = Zb
Under H0, test statistic:
1
2g′g =
1
2g′Z(Z ′Z)−1Z ′g
d−→ χ2(k−1)
95
Multiplicative heteroscedasticity: example
Goal: explain firms profit, yi, i = 1, . . . , n
Model: yi = x′i β + εi, where
Var[εi|X] = σ2 exp(z′i α) (Harvey’s model)
Step 1: regress yi on xi using OLS and compute ei
Step 2: regress log(e2i ) on [1 z′i] using OLS to estimate σ2 (biased) and α
Step 3: regress yi on xi using FGLS with Ωii = exp(z′i α) to estimate β
LS applied twice to model yi = x′i β + εi: two-stage least squares
Remark: LS estimate of σ2 biased (but not important for FGLS) because
E log ε2i < log Eε2
i = log σ2i = log σ2 + z′iα
E log ε2i = −c + log σ2
i , where c > 0
log e2i = −c + log σ2 + z′iα + νi, where νi error term
96
Chapter 9: Panel data models
Time series: yit, t = 1, . . . , T
Cross sectional: yit, i = 1, . . . , n
Panel or longitudinal: yit, i = 1, . . . , n, t = 1, . . . , T , with n≫ T
y1t y2t y3t · · · ynt
... ...
y1T y2T y3T · · · ynT
97
Why panel data model
Reach panel databases are available, e.g. labor market, industrial sectors
Certain phenomena can be studied only in panel data models
E.g. Analysis of production function:
technological change (over time) and
economies of scale (across firms of different sizes)
98
General framework for panel data model
Typically n≫ T
yit = x′it β + z′i α + εit
= x′it β + ci + εit
xit: K × 1, without constant term
zi: individual specific variables, observed or unobserved, with constant term
ci: individual effect, often unobserved and stochastic, e.g. “health”, “ability”
Goal: estimate partial effects β = ∂E[yit|xit]/∂xit and E[ci|xi1, xi2, . . .]
Note: if zi observed ∀i ⇒ linear model estimated by LS
99
Modeling frameworks
Panel data model: yit = x′it β + ci + εit
1. Pooled model: ci = α constant term. Use OLS to estimate α, β
2. Fixed effects: ci unobserved and correlated with xit: E[ci|Xi] = αi
yit = x′it β + αi + εit + (ci − αi)
Regress yit on xit omits variables: LS biased, inconsistent estimate of β
3. Random effects: ci unobserved and uncorrelated with xit: E[ci|Xi] = α
yit = x′it β + α + εit + (ci − α)
Regress yit on xit and constant: OLS consistent, inefficient estimate of α, β
100
Pooled model
Assumption: ci = α constant term
yit = x′it β + ci + εit
= x′it β + α + εit
E[εit|Xi] = 0
Var[εit|Xi] = σ2ε
Cov[εit εjs|Xi, Xj] = 0, if i 6= j or t 6= s
If assumptions of linear regression model are met: OLS unbiased and efficient
But this is hardly the case
101
LS estimation of pooled model
Pooled model: yit = x′it β + ci + εit = x′
it β + α + εit
If FE true model, Cov[ci, xit] 6= 0: LS is inconsistent (omitted variables)
If RE true model, Cov[ci, xit] = 0: LS consistent but inefficient
In RE model:
yit = x′it β + ci + εit
= x′it β + E[ci|Xi] + (ci − E[ci|Xi]) + εit
= x′it β + α + ui + εit
= x′it β + α + wit
Autocorrelation (within group i): Cov[wit wis] = σ2u 6= 0, t 6= s
102
Pooled regression with random effects
RE model: yit = x′it β + α + ui + εit. Stack Ti observations for individual i:
yi = [ii xi]
[
α
β
]
+ (εi + iiui) = Xi β + wi
Shocks, wi, are heteroscedastic (across individuals) and autocorrelated:
Var[wi] = Var[εi + iiui] =
σ2ε · · · 0
. . .
0 · · · σ2ε
+
σ2u · · · σ2
u... . . . ...
σ2u · · · σ2
u
= σ2εITi
+ Σi
= Ωi
Recall: i = 1, . . . , n, and goal is to estimate β
103
LS pooled regression with random effects
Stack all observations for all individuals, (T1 + . . . + Tn):
b = (X ′X)−1X ′y = β +
[
1
n
n∑
i=1
X ′iXi
]−11
n
n∑
i=1
X ′iwi
p−→ β
Asy.Var[b] =1
nplim
[
1
n
n∑
i=1
X ′iXi
]−1
plim
[
1
n
n∑
i=1
X ′iwi w
′iXi
]
plim
[
1
n
n∑
i=1
X ′iXi
]−1
LS consistent; Asy.Var[b] called robust covariance matrix
If data are well behaved
plim
[
1
n
n∑
i=1
X ′iXi
]
and plim
[
1
n
n∑
i=1
X ′iwi w
′iXi
]
are positive definite
but second matrix needs to be “estimated”
104
“Estimating” center matrix in Asy.Var[b]
Use White’s approach (not White’s heterosc. estimator):
plim
[
1
n
n∑
i=1
X ′iwi w
′iXi
]
= plim
[
1
n
n∑
i=1
X ′iΩiXi
]
= plim
[
1
n
n∑
i=1
X ′iwi w
′iXi
]
= plim
1
n
n∑
i=1
Ti∑
t=1
xitwit
Ti∑
t=1
xitwit
′
6= plim
1
n
n∑
i=1
Ti∑
t=1
w2itxit x′
it
Correlations across observation (not heterosc.) contribute most to Asy.Var[b]
105
Pooled regression: group means estimator
To estimate β use n group means, e.g. for yit, t = 1, . . . , Ti:
(1/Ti)
Ti∑
t=1
yit = (1/Ti)i′i yi = yi.
Averaging eliminates time series dimension of panel data (≈ cross section)
yi = Xi β + wi
(1/Ti)i′i yi = (1/Ti)i
′i Xi β + (1/Ti)i
′i wi
yi. = x′i. β + wi.
In Pooled model wi. = εi.; in RE model wi. = εi. + ui heteroscedastic
Sample data (yi., xi.), i = 1, . . . , n
Estimation: LS for β and White’s heterosc. estimator for Asy.Var[b]
106
Pooled regression: first difference estimator
General panel data model: yi,t = x′i,t β + ci + εi,t, where
ci correlated (fixed effects) or uncorrelated (random effects) with xi,t
yi,t − yi,t−1 = (x′i,t − x′
i,t−1)β + εi,t − εi,t−1
∆yi,t = (∆x′i,t)β + ui,t
Advantage: first difference removes all individual specific heterogeneity ci
Disadvantage: first difference removes all time-invariant variables too
ui,t: moving average (MA), covariance matrix tridiagonal, two-stage GLS
107
Fixed effects model
Assumption: unobservable individual effect, ci, correlated with xit
yit = x′it β + ci + εit
= x′it β + E[ci|Xi] + (ci − E[ci|Xi]) + εit
= x′it β + h(Xi) + νi + εit
= x′it β + αi + εit
Further assumption: Var[ci|Xi] = Var[νi|Xi] is constant
In general: Cov[εit, εis|Xi] = E[(νi + εit)(νi + εis)|Xi] = E[ν2i |Xi] 6= 0
Assumption: Var[εi|Xi] = σ2ε ITi
⇒ classical regression model
Parameters to estimate (K + n): [β1 · · ·βK]′ and αi, i = 1, . . . , n
108
Fixed effects model: drawback
Time invariant variables in xit are absorbed in αi
x′it = [1x
′it 2x
′i] time variant and time invariant variables
yit = x′it β + αi + εit
= 1x′it β1 + 2x
′i β2 + αi + εit
= 1x′it β1 + αi + εit
β2 cannot be estimated (not identified)
109
Fixed effects model: Least Squares Dummy Variable
Recall i = (T × 1) column of ones. Stack T observations for individual i:
yi = Xi β + i αi + εi
Stack all regression models for n individuals, LSDV model:
y1
...
yn
=
X1
...
Xn
β +
i 0 · · · 0... ...
0 0 · · · i
α1
...
αn
+
ε1
...
εn
y = [X d1 · · · dn]
[
β
α
]
+ ε
= X β + D α + ε
110
Fixed effects model: least squares estimation
Model for nT observations: y = X β + D α + ε, interest on β
Partitioned regression, MD y on MD X, reduces size of computation
b = [X ′MDX]−1X ′MDy
Asy.Var[b] = s2 [X ′MDX]−1
Individual effect, αi, estimated using only T observations on individual i:
ai = yi. − x′i.b =
1
T
T∑
t=1
(αi + x′it β + εit)− x′
i.b
ai − αi =1
T
T∑
t=1
εit +1
T
T∑
t=1
x′it(β − b) = εi. + x′
i.(β − b)
Asy.Var[ai] =σ2
ε
T+ x′
i. Asy.Var[b] xi. 6→ 0, when n→∞
111
Testing differences across groups
Null hypothesis H0 : α1 = · · · = αn
α1 − α2 = 0
α2 − α3 = 0...
αn−1 − αn = 0
⇒
1 −1 0 0′
0 1 −1 0′
... ...
0′ 0 1 −1
α1
α2
...
αn
= R α = 0
that is J = n− 1 restrictions on α.
F -statistic: compare unrestricted R2 vs. restricted R2
F [n− 1, nT −K − n] =(R2
LSDV −R2Pooled)/(n− 1)
(1−R2LSDV)/(nT −K − n)
112
Random effects model
Assumption: unobservable individual effect, ci, uncorrelated with xit
yit = x′it β + ci + εit
= x′it β + E[ci] + (ci − E[ci]) + εit
= x′it β + α + ui + εit
= x′it β + α + ηit
For T observations on individual i:
Var[ηi] = Var[εi + iT ui] =
σ2ε · · · 0
. . .
0 · · · σ2ε
+
σ2u · · · σ2
u... . . . ...
σ2u · · · σ2
u
= σ2ε IT + σ2
u iT i′T
= Σ
113
Random effects model: Generalized least squares
Observations i ⊥ j ⇒ nT × nT cov. matrix block diagonal, Ω = In ⊗ Σ
Remark: Σ does not depend on i
GLS:
β = (X ′Ω−1X)−1X ′Ω−1y =
(n∑
i=1
X ′i Σ−1 Xi
)−1( n∑
i=1
X ′i Σ−1 yi
)
σ2ε and σ2
u in Σ are usually unknown: estimate them and then FGLS
114
FGLS of random effects model: estimating σ2ε
Taking deviations from group means remove heterogeneity ui
yit = x′it β + α + ui + εit
yi. = x′i. β + α + ui + εi.
yit − yi. = (xit − xi.)′β + εit − εi.
= (xit − xi.)′b + eit − ei.
σ2ε =
∑ni=1
∑Tt=1(eit − ei.)
2
nT − n−K
p−→ σ2ε
Degrees of freedom: nT observations − n yi. means − K slopes
Note ei. = 0; Note σ2ε = s2
LSDV as
eit = yit − yi. − (xit − xi.)′b = yit − x′
itb− (yi. − x′i.b)
= yit − x′itb− ai = residual in FE LSDV model
115
FGLS of random effects model: estimating σ2u
OLS consistent, unbiased, not efficient estimator of α and β in
yit = x′it β + α + ui + εit = x′
it β + α + ηit
Hence
plim s2Pooled = plim
e′e
nT −K − 1= Var[ηit] = σ2
u + σ2ε
Consistent estimator of σ2u:
σ2u = s2
Pooled − s2LSDV
If negative, change degrees of freedom
116
Random Effects or Fixed Effects model?
FE: flexible, Cov[ci, xit] 6= 0, but many parameters to estimate: α1, . . . , αn
RE: parsimonious but assumption Cov[ci, xit] = 0 might be violated
Hausman’s specification test, H0 : RE model
• H0 : Cov[ci, xit] = 0⇒ OLS in LSDV and GLS in RE model both consistent,
but OLS inefficient
• H1 : Cov[ci, xit] 6= 0⇒ only OLS in LSDV consistent,
but GLS in RE model inconsistent
Under H0 : OLS in LSDV model ≈ GLS in RE model
117
Hausman’s specification test
b OLS in LSDV model; β GLS in RE model. Under H0 : b− β ≈ 0
Var[b− β] = Var[b] + Var[β]− Cov[b, β]− Cov[β, b]
Hausman’s key result:
0 = Cov[efficient estimator, (efficient estimator − inefficient estimator)]
0 = Cov[β, (β − b)] = Var[β]− Cov[β, b]
This implies, under H0,
Var[b− β] = Var[b]−Var[β]
Wald criterion, based on K estimated slopes, excluding intercept:
W = [b− β]′ (Var[b]−Var[β])−1 [b− β] ∼ χ2(K)
118
Mundlak’s approach
Fixed effects model: E[ci|Xi] = αi, one parameter for each individual i
Random effects model: E[ci|Xi] = α, one parameter for all individuals
Mundlak’s approach: E[ci|Xi] = x′i.γ, parameters γ for all individuals
Model:
yit = x′it β + ci + εit
= x′it β + E[ci|Xi] + (ci − E[ci|Xi]) + εit
= x′it β + x′
i.γ + ui + εit
Drawback: x′i.γ can only include time varying variables
119
Dynamic panel data model
Model yit = x′it β + ci + εit describes static relation
Dynamic model yit = γ yi,t−1 + x′it β + ci + εit fits data much better
OLS and GLS inconsistent: ci correlated with yi,t−1
FE model, deviations from means, first difference: inconsistent estimates
Instrumental variable estimator: consistent estimates
Read about SUR and CAPM in Chapter 10
120
Chapter 12: Instrumental variables
Linear regression model: y = Xβ + ε
b = β + (X ′X)−1X ′ε
b unbiased when E[ε|X] = 0
b consistent when plimX ′ε/n = 0
In many situations (e.g., dynamic panel models, measurement error on X),
X and ε are correlated ⇒ OLS (and GLS) biased and inconsistent
Solution: Instrumental variables (IV), consistent estimates
121
Assumptions of the model
1. Linearity: E[y|X] linear in β
2. Full rank: X is an n×K matrix with rank K
3. Endogeneity of independent variables: E[εi|xi] 6= 0
4. Homoscedasticity and nonautocorrelation of εi
5. Stochastic or nonstochastic X
6. Normal distribution: ε|X ∼ N(0, σ2I)
122
Instrumental variable: Definition
Instrumental variables Z = [z1 · · · zL] (n× L), L ≥ K, have two properties:
1. Exogeneity: Z uncorrelated with ε
2. Relevance: Z correlated with X
Further assumptions of the model:
• [xi, zi, εi], i = 1, . . . , n, i.i.d.
• E[εi|zi] = 0
• plimZ ′Z/n = Qzz, finite, positive definite matrix
• plimZ ′ε/n = 0 (Exogeneity)
• plimZ ′X/n = Qzx, finite, L×K matrix, rank K (Relevance)
123
Insight on IV estimation
When plimX ′ε/n = 0
y = X β + ε
X ′y/n = X ′X β/n + X ′ε/n
X ′y/n ≈ X ′X β/n
β ≈ (X ′X)−1X ′y
When plimX ′ε/n 6= 0, but plimZ ′ε/n = 0 (and L = K)
Z ′y/n = Z ′X β/n + Z ′ε/n
Z ′y/n ≈ Z ′X β/n
β ≈ (Z ′X)−1Z ′y
Remark: ≈ are = in plim
124
Instrumental variable estimator (L = K)
L instruments, observed variables, Z is n× L matrix, when L = K
bIV = (Z ′X)−1Z ′y
= β + (Z ′X)−1Z ′ε
plim bIV = β + (plimZ ′X/n)−1 plimZ ′ε/n
= β√
n(bIV − β) = (Z ′X/n)−1√
n Z ′ε/n
d−→ Q−1zx ×N(0, σ2Qzz)
d= N(0, σ2Q−1
zx QzzQ−1xz )
bIVa∼ N(β, σ2Q−1
zx QzzQ−1xz /n)
Exogeneity ⇒ consistency; Relevance ⇒ low variance
125
Instrumental variable estimator (L > K)
When L > K, Z ′X is L×K, not invertible matrix
X correlated with ε ⇒ inconsistency
Z uncorrelated with ε (Exogeneity)
Idea: project X on Z to get X, than regress y on X to estimate β
X = Z × slope of X on Z = Z(Z ′Z)−1Z ′X
Regressing y on X:
bIV = [X ′X]−1X ′y
= [X ′Z(Z ′Z)−1Z ′ Z(Z ′Z)−1Z ′X]−1X ′Z(Z ′Z)−1Z ′y
= [X ′Z(Z ′Z)−1Z ′X]−1X ′Z(Z ′Z)−1Z ′y
Two-stage least squares (2SLS) estimator (only logically)
126
Which instruments?
Instrumental variables are generally difficult to find
Z can include variables in X uncorrelated with ε
In time series settings, lagged values of x and y are typical instruments
Relevance ⇒ high correlation between X and Z (otherwise Q−1xz large)
But then Z might be correlated with ε (as ε is correlated with X)
127
Example: Dynamic panel data model
Model: yit = γ yi,t−1 + x′it β + ci + εit
ci correlated or uncorrelated with xit
ci certainly correlated with yi,t−1 ⇒ LS inconsistent
Taking first difference, ∆yit = yit − yi,t−1
∆yit = γ∆yi,t−1 + ∆x′it β + ∆εit
Cov[∆yi,t−1,∆εit] 6= 0 ⇒ LS still inconsistent
To estimate γ and β, valid instruments, e.g., yi,t−2 and ∆yi,t−2
128
Measurement error
Measurement errors are very common in practice
E.g., variables of interest are not available but only approximated by others
E.g., GDP, consumption, capital, . . . , cannot be measured exactly
129
Regression model with measurement error
True, latent (unobserved), univariate model
y∗i = x∗
i β + εi
Observed data: yi = y∗i + vi and xi = x∗
i + ui
where vi ∼ (0, σ2v), vi ⊥ y∗
i , x∗i , ui and ui ∼ (0, σ2
u), ui ⊥ y∗i , x
∗i , vi
Working model, derived from true model:
yi − vi = (xi − ui)β + εi
yi = xi β + (−ui β + εi + vi)
Measurement error on yi, i.e. vi, absorbed in the error term
Measurement error on xi, i.e. ui, makes LS inconsistent
130
LS estimation with measurement error
Set vi = 0 for simplicity. Working model:
yi = xi β + (−ui β + εi) = xi β + wi
LS estimation of β inconsistent because
Cov[xi, wi] = Cov[x∗i + ui,−ui β + εi] = −β σ2
u 6= 0
b =
(n∑
i=1
x2i/n
)−1 n∑
i=1
xi yi/n
plim b =
(
plim
n∑
i=1
(x∗i + ui)
2/n
)−1
plim
n∑
i=1
(x∗i + ui)(x
∗i β + εi)/n
=(Q∗ + σ2
u
)−1βQ∗ = β/(1 + σ2
u/Q∗)
→ 0 when σ2u →∞
131
IV estimation with measurement error
Instrument zi has the two properties:
1. Exogeneity: Cov[zi, ui] = 0
2. Relevance: Cov[zi, x∗i ] = Q∗
zx 6= 0
Recall true model yi = x∗i β + εi, observed regressor xi = x∗
i + ui
bIV =
(n∑
i=1
xi zi/n
)−1 n∑
i=1
zi yi/n
plim bIV =
(
plim
n∑
i=1
(x∗i + ui)zi/n
)−1
plim
n∑
i=1
zi(x∗i β + εi)/n
= (Q∗zx)
−1βQ∗
zx = β
132
IV estimation of generalized regression model
In generalized regression model E[ε ε′|X] = σ2Ω
bIV = [X ′Z(Z ′Z)−1Z ′X]−1X ′Z(Z ′Z)−1Z ′y
= β + [X ′Z(Z ′Z)−1Z ′X]−1X ′Z(Z ′Z)−1Z ′ε
plim bIV = β + Qxx.z × plimZ ′ε/n = β√
n(bIV − β)d−→ Qxx.z ×N(0, σ2 plim (Z ′ΩZ/n))
d= N(0, σ2Qxx.z plim (Z ′ΩZ/n)Q′
xx.z)
bIVa∼ N(β, σ2Qxx.z plim (Z ′ΩZ/n)Q′
xx.z/n)
Same derivation as when E[ε ε′|X] = σ2I
133
Chapter 15: Generalized method of moments (GMM)
General framework for estimation and hypothesis testing
LS, NLS, GLS, IV, etc. special cases of GMM
GMM relies on “weak” assumptions about first moments
(existence and convergence of first moments)
Strength (and limitation) of GMM:
No assumptions about distribution ⇒ Robust to misspecification of DGP
Widely used in Econometrics, Finance, . . .
134
Logic behind method of moments
Sample momentsp−→ Population moments = function(parameters)
E.g., random sample yi, i = 1, . . . , n, with E[yi] = µ and Var[yi] = σ2
1
n
n∑
i=1
yip−→ E[yi] = µ
1
n
n∑
i=1
y2i
p−→ E[y2i ] = σ2 + µ2
Assumptions of Law of Large Numbers need to hold
135
Orthogonality conditions: Example
Parameters are implicitly defined by two orthogonality conditions:
E[yi − µ] = 0
E[y2i − σ2 − µ2] = 0
To estimate µ and σ2, replace E[·] by empirical distribution
and solve two moment equations:
1
n
n∑
i=1
(yi − µ) = 0
1
n
n∑
i=1
(y2i − σ2 − µ2) = 0
Moment estimators: µ =∑n
i=1 yi/n and σ2 =∑n
i=1(yi − µ)2/n
136
Example: Gamma distribution
Gamma distribution used to model positive r.v. yi, e.g. waiting time
f(y) =λp
Γ[p]e−λyyp−1, y ≥ 0, p > 0, λ > 0
(Some) orthogonality conditions:
E
yi − p/λ
y2i − p(p + 1)/λ2
ln yi − d ln Γ[p]/dp + lnλ
1/yi − λ/(p− 1)
= 0
Orthogonality conditions are (general) nonlinear functions of sample data
More orthogonality conditions (four) than parameters (two)
Any two orthogonality conditions give (p, λ): need to reconcile all of them
137
Orthogonality conditions
K parameters to estimate, θ = (θ1, . . . , θK)′
L moment conditions (L ≥ K):
E
mi1(yi, xi, θ)...
mil(yi, xi, θ)...
miL(yi, xi, θ)
= E[mi(yi, xi, θ)] = 0
θ implicitly defined by equation above
estimated via empirical counter part of E[·]
138
Exactly identified case
When L = K, i.e. # moment conditions = # parameters,
sample moment equations have a unique solution and are all exactly satisfied
• E.g., previous method of moments estimator of µ and σ2
• E.g., LS estimator: E[mi(yi, xi, θ)] = E[xi (yi − x′i θ)] = 0
Solving sample moment equations (or normal equations)
1
n
n∑
i=1
[xi (yi − x′i θ)] = 0
1
n
n∑
i=1
xiyi −1
n
n∑
i=1
xix′i θ = 0
θ =
(n∑
i=1
xix′i
)−1( n∑
i=1
xiyi
)
139
Overidentified case
When L > K, i.e. # moment conditions > # parameters,
system of L equations in K unknown parameters
1
n
n∑
i=1
mil(yi, xi, θ) = 0, l = 1, . . . , L
has no solution (equations functionally independent) in finite samples
although
plim1
n
n∑
i=1
mil(yi, xi, θ) = E[mil(yi, xi, θ)] = 0, l = 1, . . . , L
E.g., previous estimation of parameters of Gamma distribution
E.g., IV estimation when # instruments L > # parameters K
140
Criterion function
When L > K, to reconcile different estimates, minimize criterion function
q = m(θ)′ Wn m(θ)
where m(θ) =∑n
i=1 mi(yi, xi, θ)/n, L× 1 moment conditions
Wn: positive definite, weighting matrix, with plimWn = W
• When Wn = I
q = m(θ)′ m(θ) =L∑
l=1
ml(θ)2
where ml(θ) =∑n
i=1 mil(yi, xi, θ)/n, l = 1, . . . , L
• When Wn inversely proportional to variance of m(θ) ⇒ Efficiency gains
same logic that makes GLS more efficient than OLS
141
Optimal weighting matrix
L orthogonality conditions, possibly correlated
optimal weighting matrix:
W = Asy.Var[√
n m(θ)]−1 = Φ−1
Recall, Var[m(θ)] = Var[∑n
i=1 mi(yi, xi, θ)/n] ∈ O(1/n)
Efficient GMM estimator based on Φ−1
• When L > K, W = I (or any W 6= Φ−1) produces inefficient estimates of θ
• When L = K =⇒ moment equations satisfied exactly, i.e. m(θ) = 0,
=⇒ q = 0 and W irrelevant
142
Assumptions of GMM estimation
θ0 true parameter vector, K × 1
L population orthogonality conditions: E[mi(θ0)] = 0, L ≥ K
L sample moments: mn(θ0) =∑n
i=1 mi(θ0)/n
E.g., IV estimation: mn(θ0) =∑n
i=1 zi(yi − x′iθ0)/n,
L ≥ K instruments, one orthogonality condition for each instrument
Assumption 1: Convergence of empirical moments
Data generating process satisfies assumptions of Law of Large Numbers
mn(θ0) =1
n
n∑
i=1
mi(θ0)p−→ E[mi(θ0)] = 0
143
Assumptions of GMM estimation
Empirical moment equations continuous and continuously differentiable
=⇒ L×K matrix of partial derivatives
Gn(θ0) =∂mn(θ0)
∂θ′0=
1
n
n∑
i=1
∂mi(θ0)
∂θ′0
p−→ G(θ0)
Law of Large Numbers apply to moments and derivatives of moments
Assumption 2: Identification
For any n > K, if θ1 6= θ2, then mn(θ1) 6= mn(θ2)
plim qn(θ) = plim (mn(θ)′ Wn mn(θ)) has unique minimum (= zero) at θ0
Identification ⇒ L ≥ K and rank(Gn(θ0)) = K
144
Assumptions of GMM estimation
Assumptions 1 and 2 =⇒ θ can be estimated
Assumption 3: Asymptotic distribution of empirical moments
Empirical moments obey a Central Limit Theorem
√n mn(θ0)
d−→ N(0,Φ)
145
Asymptotic properties of GMM
Under previous assumptions
θGMMp−→ θ0
θGMMa∼ N(θ0, [G(θ0)
′Φ−1G(θ0)]−1/n)
146
Consistency of GMM estimator
Recall criterion function qn(θ) = mn(θ)′ Wn mn(θ)
Assumption 1 and continuity of moments =⇒ qn(θ)p−→ q0(θ)
Wn positive definite, for any finite n
0 ≤ qn(θGMM) ≤ qn(θ0)
When n→∞, qn(θ0)p−→ 0 =⇒ qn(θGMM)
p−→ 0
W positive definite and identification assumption =⇒ θGMMp−→ θ0
147
Asymptotic normality of GMM estimator
First order condition for the GMM estimator:
∂qn(θGMM)
∂θGMM
= 2Gn(θGMM)′ Wn mn(θGMM) = 0
Assumption: moment equations continuous and continuously differentiable
Mean Value Theorem and Taylor expansion at θ0 of moment equations:
mn(θGMM) = mn(θ0) + Gn(θ)(θGMM − θ0)
where θ0 < θ < θGMM componentwise. Fist order condition becomes:
2 Gn(θGMM)′ Wn [mn(θ0) + Gn(θ)(θGMM − θ0)] = 0
Solve for (θGMM − θ0) and ×√n give:
148
Asymptotic normality of GMM estimator
√n(θGMM − θ0) = −[Gn(θGMM)′ Wn Gn(θ)]−1Gn(θGMM)′ Wn
√n mn(θ0)
When n→∞
• θGMMp−→ θ0 and θ
p−→ θ0 as θ0 < θ < θGMM componentwise
• Gn(θGMM)p−→ G(θ0) and Gn(θ)
p−→ G(θ0)
• Wnp−→W by construction of weighting matrix
• √n mn(θ0)d−→ N(0,Φ) by Assumption 3
√n(θGMM − θ0)
d−→ −[G(θ0)′ W G(θ0)]
−1G(θ0)′ W ×N(0,Φ)
d= N [0, [G(θ0)
′ W G(θ0)]−1G(θ0)
′ WΦ . . . . . .′]d= N [0, [G(θ0)
′ Φ−1 G(θ0)]−1] using W = Φ−1
θGMMa∼ N(θ0, [G(θ0)
′ Φ−1 G(θ0)]−1/n)
149
Weighting matrix
Any W positive definite matrix produces consistent GMM estimates
W determines efficiency of GMM estimator:
Optimal W = Asy.Var[√
n m(θ)]−1 depends on unknown θ
Feasible two-step procedure:
Step 1. Use W = I to obtain a consistent estimator, θ(1), then estimate
Φ =1
n
n∑
i=1
mi(yi, xi, θ(1)) mi(yi, xi, θ(1))′
(when mi(yi, xi, θ0), i = 1, . . . , n uncorrelated sequence)
Step 2. Use W = Φ−1 to compute GMM estimator
150
Testing hypothesis in GMM framework
Two sets of tests:
1. Testing restrictions induced by moment equations
2. GMM counterparts to Wald, LM, and LR tests
151
Specification test
In exactly identified case, L moment equations = K parameters:
θ exists such that m(θ) = 0
In overidentified case, L moment equations > K parameters:
L−K moment equations imply moment restrictions on θ
Intuition: •K moment equations “set to zero to compute” the K parameters
• L−K “free” moment equations
Test of overidentifying restrictions, using W = Asy.Var[√
n m(θ)]−1:
J-stat = nq =√
n m(θ)′ W√
n m(θ)d−→ χ2
(L−K)
Note: no parametric restrictions on θ in the specification test
152
Testing parametric restrictions
To test J (linear or nonlinear) parametric restrictions on θ
Given L moment equations, now only K − J free parameters
nqR =√
n m(θR)′ W√
n m(θR)d−→ χ2
(L−(K−J))
nqR − nqd−→ χ2
(J)
as for degrees of freedom (L− (K − J))− (L−K) = J
Note: same optimal weighting matrix W in qR and q =⇒ qR ≥ q
153
Application of GMM: Asset pricing model estimation
Asset pricing model:
E[rej,t] = δAER βAER,j + δHML βHML,j + δCLS βCLS,j = δ′β
Stochastic Discount Factor (SDF) representation, demeaned factors ft:
mt = 1− bAER˜AERt − bHML
˜HMLt − bCLS˜CLSt = 1− b′ft
Euler pricing equation:
E[mt rej,t] = 0
⇒ N moment conditions j = 1, . . . , N
Market price of risks, δ, and SDF loadings, b:
δ = E[ft f ′t]b
154
GMM estimation resultsModel (1) (2) (3)
δAER 11.43 4.34
(7.26) (11.36)
δHML 18.70 7.48
(13.08) (35.93)
δCLS 13.16 27.13
(3.35) (17.77)
bAER 0.13 −0.03
(0.05) (0.06)
bHML 0.05 −0.16
(0.04) (0.07)
bCLS 0.26 0.75
(0.07) (0.24)
J-stat 0.0467 0.0444 0.0349
p-value 6.03% 7.71% 14.14%
Table 1: Parameter estimates (Newey–West standard errors).
155
Chapter 16: Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE): very important inference method
Maximum likelihood principle:
Given sample data generated from parametric model,
find parameters that maximize probability of observing that sample
Basic, strong assumption:
DGP has parametric, known (up to θ) distribution
Fundamental result:
MLE makes “best use” of this information
156
Likelihood function
Likelihood function = probability of observing that sample
Formally, joint density of n i.i.d. observations, y1, . . . , yn
f(y1, . . . , yn; θ) =n∏
i=1
f(yi; θ) = L(θ; y)
L(θ; y) is the likelihood function, with θ unknown
Log-likelihood is usually easier to deal with
lnL(θ; y) =n∑
i=1
ln f(yi; θ)
157
Identification
Identification means parameters are estimable. It depends on the model
Check identification before estimating or testing the model
Definition: θ is identified (or estimable) if
L(θ; y) 6= L(θ∗; y)
∀θ∗ 6= θ and some data y
E.g. Linear regression model not identified when rank [x1, . . . , xK] < K
E.g. Threshold model for yi > 0 or yi ≤ 0
Pr(yi > 0) = Pr(β1 + β2 xi + εi > 0) = Pr(εi/σ > −(β1 + β2 xi)/σ)
not identified, σ, β1, β2 not estimable (normalization required, e.g. σ = 1)
158
Maximum likelihood estimator
Maximum likelihood estimator, θ, solves
θ = arg maxθ
L(θ; y)
= arg maxθ
lnL(θ; y)
or equivalently the likelihood equation
∂ lnL(θ; y)
∂θ= 0
159
Maximum likelihood estimator: Example
i.i.d. normal random variables, yi ∼ N(µ, σ2), i = 1, . . . , n
lnL(µ, σ2; y) = −n
2ln(2π)− n
2lnσ2 − 1
2
n∑
i=1
(yi − µ)2/σ2
∂ lnL/∂µ =
n∑
i=1
(yi − µ)/σ2 = 0
∂ lnL/∂σ2 = − n
2σ2+
1
2 σ4
n∑
i=1
(yi − µ)2 = 0
Solve likelihood equations:
µML =1
n
n∑
i=1
yi
σ2ML =
1
n
n∑
i=1
(yi − µML)2
160
Asymptotic efficiency
An estimator is asymptotically efficient if it is
• consistent,
• asymptotically normally distributed (CAN), and has
• asy. covariance matrix not larger than that of any other CAN estimator
Under some regularity conditions, MLE is asymptotically efficient
Finite sample properties usually not optimal
E.g., σ2ML =
∑ni=1(yi− y)2/n biased (no correction for degrees of freedom)
161
Properties of MLE
Under regularity conditions, MLE θ has the following properties:
M1 Consistency: plim θ = θ0
M2 Asymptotic normality: θa∼ N(θ0, −E0[∂
2 lnL/∂θ0 ∂θ′0]−1)
M3 Asymptotic efficiency: θ reaches Cramer–Rao lower bound in M2
M4 Invariance: MLE of γ0 = c(θ0) is γ0 = c(θ) if c ∈ C1
162
Regularity conditions on f(yi; θ)
R1 First three derivatives of ln f(yi; θ) w.r.t. θ are continuous and finite ∀θ
R2 Conditions for E[∂ ln f(yi; θ)/∂θ] <∞, E[∂2 ln f(yi; θ)/∂θ ∂θ′] <∞ hold
R3 |∂3 ln f(yi; θ)/∂θj ∂θk ∂θl| < h, where E[h] <∞, ∀θ
Definition: Regular densities satisfy R1–R3
Goals: use Taylor approximation; interchange differentiation and expectation
Notation: gradient gi = ∂ ln f(yi; θ)/∂θ, Hessian Hi = ∂2 ln f(yi; θ)/∂θ ∂θ′
163
Properties of regular densities
Moments of derivatives of log-likelihood:
D1 ln f(yi; θ), gi, Hi, i = 1, . . . , n are random samples
D2 E0[gi(θ0)] = 0
D3 Var0[gi(θ0)] = −E0[Hi(θ0)]
D1 implied by assumption: yi, i = 1, . . . , n is random sample
To prove D2: by definition 1 =∫
f(yi; θ0) dyi
∂1
∂θ0=
∂
∂θ0
∫
f(yi; θ0) dyi
0 =
∫∂f(yi; θ0)
∂θ0dyi =
∫∂ ln f(yi; θ0)
∂θ0f(yi; θ0) dyi = E0[gi(θ0)]
164
Information matrix equality
To prove D3: differentiate previous integral once more w.r.t. θ0
∂0
∂θ′0=
∂
∂θ′0
∫∂ ln f(yi; θ0)
∂θ0f(yi; θ0) dyi
0 =
∫ [∂2 ln f(yi; θ0)
∂θ0 ∂θ′0f(yi; θ0) +
∂ ln f(yi; θ0)
∂θ0
∂f(yi; θ0)
∂θ′0
]
dyi
=
∫ [∂2 ln f(yi; θ0)
∂θ0 ∂θ′0f(yi; θ0) +
∂ ln f(yi; θ0)
∂θ0
∂ ln f(yi; θ0)
∂θ′0f(yi; θ0)
]
dyi
= E0[Hi(θ0)] + Var0[gi(θ0)] =⇒ D3
D1 (random sample) ⇒ Var0[∑n
i=1 gi(θ0)] =∑n
i=1 Var0[gi(θ0)]
Var0[
n∑
i=1
gi(θ0)] =: Var0
[∂ lnL(θ0; y)
∂θ0
]
= −E0
[∂2 lnL(θ0; y)
∂θ0 ∂θ′0
]
︸ ︷︷ ︸
Information matrix equality
:= −E0[
n∑
i=1
Hi(θ0)]
165
Likelihood equation
Score vector at θ:
g =∂ lnL(θ; y)
∂θ=
n∑
i=1
∂ ln f(yi; θ)
∂θ=
n∑
i=1
gi
D1 (random sample) and D2 (E0[gi(θ0)] = 0) ⇒ Likelihood equation at θ0:
E0
[∂ lnL(θ0; y)
∂θ0
]
= 0
166
Consistency of MLE
In any finite sample, lnL(θ) ≥ lnL(θ0) (and in general ∀θ 6= θ, not only θ0)
From Jensen’s inequality, if θ0 6= θ (and in general ∀θ 6= θ0, not only θ)
E0
[
lnL(θ)
L(θ0)
]
< lnE0
[
L(θ)
L(θ0)
]
= ln
∫L(θ)
L(θ0)L(θ0) dy = ln 1 = 0
E0[lnL(θ)/n] < E0[lnL(θ0)/n] (♣)
Under previous assumptions, using inequality in the very first row:
plim lnL(θ)/n ≥ plim lnL(θ0)/n
E0[lnL(θ)/n] ≥ E0[lnL(θ0)/n]
and combining with (♣): E0[lnL(θ0)/n] > E0[lnL(θ)/n] ≥ E0[lnL(θ0)/n]
⇒ plim lnL(θ)/n = E0[lnL(θ0)/n] and plim θ = θ0
167
Asymptotic normality of MLE
MLE solves sample likelihood equation: g(θ) =∑n
i=1 gi(θ) = 0
First order Taylor expansion: g(θ) = g(θ0) + H(θ)(θ − θ0) = 0
As θ = w θ0 + (1− w) θ, 0 < w < 1, plim θ = θ0 ⇒ plim θ = θ0
Hessian is continuous in θ. Rearranging, scaling by√
n, taking limit n→∞
√n(θ − θ0) = −H(θ)−1
√n g(θ0) =
−1
n
n∑
i=1
Hi(θ)
−1√n
1
n
n∑
i=1
gi(θ0)
d−→
−E0
[
1
n
n∑
i=1
Hi(θ0)
]−1
×N
(
0,−E0
[
1
n
n∑
i=1
Hi(θ0)
])
d= N
0,
−E0
[
1
n
n∑
i=1
Hi(θ0)
]−1
θa∼ N
(
θ0, −E0 [H(θ0)/n]−1/n)
= N(
θ0, I(θ0)−1)
168
Asymptotic efficiency
Cramer–Rao lower bound :
Assume that f(yi; θ0) satisfies regularity conditions R1–R3,
the asymptotic variance of a consistent and asy. normally distributed
estimator of θ0 is at least as large as
I(θ0)−1 =
−E0
[∂2 lnL(θ0)
∂θ0 ∂θ′0
]−1
Asymptotic variance of MLE reaches the Cramer–Rao lower bound
169
Invariance
MLE of γ0 = c(θ0) is γ0 = c(θ) if c ∈ C1
MLE invariant to one-to-one transformation
Useful application: lnL(θ0) can be “complicated” function of θ0
re-parameterize the model to simplify calculations using lnL(γ0)
E.g. Normal log-likelihood, precision parameter γ2 = 1/σ2
lnL(µ, γ2; y) = −n
2ln(2π) +
n
2ln γ2 − γ2
2
n∑
i=1
(yi − µ)2
∂ lnL
∂γ2=
n
2
1
γ2− 1
2
n∑
i=1
(yi − µ)2 = 0
γ2ML =
n∑n
i=1(yi − µML)2=
1
σ2ML
170
Estimating asymptotic covariance matrix of MLE
Asy.Var[θ] depends on θ0. Three estimators, asymptotically equivalent:
1. Calculate E0[H(θ0)] (very difficult) and evaluate it at θ to estimate
I(θ0)−1 =
−E0
[∂2 lnL(θ0)
∂θ0 ∂θ′0
]−1
2. Calculate H(θ0) (still quite difficult) and evaluate it at θ to get
I(θ)−1 =
−∂2 lnL(θ)
∂θ ∂θ′
−1
=
−n∑
i=1
Hi(θ)
−1
3. BHHH or OPG estimator (very easy): use D3 E0[−Hi(θ0)] = Var0[gi(θ0)]
ˆI(θ)−1 =
−n∑
i=1
Hi(θ)
−1
=
n∑
i=1
gi(θ) gi(θ)′
−1
171
Conditional likelihood
Econometric models involve exogenous variables xi ⇒ yi not i.i.d.
E.g. Model: yi = x′i β + εi, xi can be stochastic, correlated across i’s, etc.
Usually f(y; θ) not interesting, data generated by f(y, x) not known
Way out: DGP of xi exogenous and well-behaved (LLN applies),
xi ∼ f(xi, δ), θ and δ no common elements, no restrictions between θ and δ
f(yi, xi; θ, δ) = f(yi|xi; θ) f(xi; δ)
lnL(θ, δ; y, x) =
n∑
i=1
ln f(yi|xi; θ) +
n∑
i=1
ln f(xi; δ)
θML = arg maxθ
n∑
i=1
ln f(yi|xi; θ)
172
Maximizing log-likelihood
Log-likelihoods are typically highly nonlinear functions of parameters
E.g., GARCH in mean model for asset return, yt = pt/pt−1−1 = Et−1[yt]+εt
with Et−1[yt] = γ0 + γ1σ2t and Vart−1[yt] = σ2
t = β0 + β1 ε2t−1 + β2 σ2
t−1
lnL = −0.5∑T
t=1[ln(2π) + lnσ2t + (yt − γ0 − γ1σ
2t )
2/σ2t ]
Maximizing log-likelihood is a numerical problem, various methods:
• “Brute force” (but using good routines, e.g. FMINSEARCH in Matlab)
• Newton’s method: θ(i+1) = θ(i) −H−1(i) g(i), use actual Hessian
• Score method: θ(i+1) = θ(i) −H−1 g(i), use expected Hessian
173
Hypothesis testing
Test of hypothesis H0 : c(θ) = 0
Three tests, asymptotically equivalent (not in finite sample)
• Likelihood ratio: If c(θ) = 0, then lnLU − lnLR ≈ 0
Both unrestricted (ML) and restricted estimators are required
• Wald test: If c(θ) = 0, then c(θML) ≈ 0
Only unrestricted (ML) estimator is required
• Lagrange multiplier test: If c(θ) = 0, then ∂ lnL/∂θR ≈ 0
Only restricted estimator is required
174
Likelihood ratio test
LU = L(θU), where θU is MLE, unrestricted
LR = L(θR), where θR is restricted estimator
Likelihood ratio: LR/LU
0 ≤ LR/LU ≤ 1
Limiting distribution of likelihood ratio: 2 (lnLU − lnLR) ∼ χ2df
with df = # of restrictions
Remarks:
• LR test cannot be used to test two restricted models, θU must be MLE
• Likelihood function L must be the same in LU and LR
175
Wald test
Wald test based on full rank quadratic forms
Recall: If x ∼ N(µ,Σ), quadratic form (x− µ)′ Σ−1 (x− µ) ∼ χ2(J)
If E[x] 6= µ, (x−µ)′ Σ−1 (x−µ) ∼ noncentral χ2(J) (> χ2
(J) on average)
If H0 : c(θ) = q is true, c(θML)− q ≈ 0 (not “= 0” for sampling variability)
If H0 : c(θ) = q is false, c(θML)− q ≪ 0 or ≫ 0
Wald test statistic:
W = [c(θML)− q]′Asy.Var[c(θML)− q]−1[c(θML)− q] ∼ χ2df
with df = # of restrictions
Drawbacks: no H1 ⇒ limited power; not invariant to restriction formulation
176
Lagrange multiplier test
Lagrange multiplier (or score) test based on restricted model
Restrictions H0 : c(θ) = q, Lagrangean: lnL(θ) + (c(θ)− q)′λ
First order conditions for restricted θ, i.e. θR:
∂ lnL(θ)
∂θ+
∂c(θ)′
∂θλ = 0
If restrictions not binding ⇒ λ = 0 (first term MLE) and can be tested.
Simpler, equivalent approach: at restricted maximum
∂ lnL(θR)
∂θR
+∂c(θR)′
∂θR
λ = 0 =⇒ −∂c(θR)′
∂θR
λ =∂ lnL(θR)
∂θR
= gR
Under H0 : λ = 0, gR =∑n
i=1 gi(θR) = 0
Recall, Var0[∑n
i=1 gi(θ0)] = −E0[∂2 lnL/∂θ0 ∂θ′0] = I(θ0)
177
Lagrange multiplier test statistic
As in Wald test, LM statistic is a full rank quadratic form:
LM =
(
∂ lnL(θR)
∂θR
)′
I(θR)−1
(
∂ lnL(θR)
∂θR
)
∼ χ2df
with df = # of restrictions
Alternative calculation of LM test: define G′R = [g1(θR), . . . , gn(θR)] (K × n),
regress a column of 1s, i, on GR ⇒ slope bi = (G′R GR)−1G′
R i
uncentered R2i =
i′i
i′ i=
(b′i G′R) (GR bi)
i′ i
=(i′GR (G′
R GR)−1G′R) (GR (G′
R GR)−1G′R i)
n
=i′GR G′
R GR−1G′R i
n=
LM
n
178
Application of MLE: Linear regression model
Model: yi = x′i β + εi, and yi|xi ∼ N(x′
i β, σ2)
Log-likelihood based on n conditionally independent observations:
lnL = −n
2ln(2π)− n
2lnσ2 − 1
2
n∑
i=1
(yi − x′i β)2
σ2
= −n
2ln(2π)− n
2lnσ2 − (y −Xβ)′ (y −Xβ)
2σ2
Likelihood equations:
∂ lnL
∂β=
X ′(y −Xβ)
σ2= 0
∂ lnL
∂σ2= − n
2σ2+
(y −Xβ)′ (y −Xβ)
2 σ4= 0
179
MLE of linear regression model
Solving likelihood equations:
βML = (X ′X)−1X ′y and σ2ML =
e′e
n
βML = b =⇒ OLS has all desirable asymptotic properties of MLE
σ2ML 6= s2 = e′e/(n−K) =⇒ σ2
ML biased in finite samples, but
E[σ2ML] = E
[(n−K)
ns2
]
=(n−K)
nσ2 −→ σ2, n→∞
Cramer–Rao lower bound for θ′ML = (β′ML σ2
ML) can be computed explicitly:
I(θ)−1 =
−E
[∂2 lnL(θ)
∂θ ∂θ′
]−1
=
[
σ2(X ′X)−1 0
0′ 2σ4/n
]
180
MLE and Wald test
Testing J (possibly nonlinear) restrictions, H0 : c(β) = 0 vs. H1 : c(β) 6= 0
Idea: check whether unrestricted estimator (i.e. MLE) “satisfies” restrictions
Under H0, Wald statistic
W = c(b)′
∂c(b)
∂b′[σ2(X ′X)−1
] ∂c(b)′
∂b
−1
c(b)d−→ χ2
(J)
where σ2(X ′X)−1 = Asy.Var[b], using Delta method:
c(b) ≈ c(β) +∂c(β)
∂β′(b− β)
Asy.Var[c(b)] =∂c(β)
∂β′Asy.Var[b]
∂c(β)′
∂β
and plim b = β, plim c(b) = c(β), plim ∂c(b)/∂b′ = plim ∂c(β)/∂β′
181
MLE and Likelihood ratio test
Testing J (possibly nonlinear) restrictions, H0 : c(β) = 0 vs. H1 : c(β) 6= 0
Idea: check whether unrestricted L “significantly” larger than restricted L∗
Likelihood ratio test: b unrestricted, b∗ restricted slopes
FOC of σ2 implies: Est.[σ2] = (y −Xβ)′(y −Xβ)/n, with β = b or b∗
LR = 2[lnL− lnL∗]
=
[
−n ln σ2 − (y −Xb)′ (y −Xb)
σ2
]
−[
−n ln σ2∗ −
(y −Xb∗)′ (y −Xb∗)
σ2∗
]
= n ln σ2∗ − n ln σ2 d−→ χ2
(J)
plugging σ2 in lnL, and σ2∗ in lnL∗, (i.e. concentrating log-likelihood)
⇒ second terms in square brackets both equal n and cancel out
182
MLE and Lagrange multiplier test
Testing J (possibly nonlinear) restrictions, H0 : c(β) = 0 vs. H1 : c(β) 6= 0
Idea: gradient of lnL at restricted maximum, gR, should be “close” to zero
From Lagrangean: gR(β) = ∂ lnL(β)/∂β = −∂c(β)′/∂β λ
Under H0 : λ = 0, E0[gR(β)] = E0[X′ε/σ2] = 0⇒ X ′e∗ ≈ 0
Lagrange multiplier: apply Wald-type test to restricted gradient of lnL
LM = e′∗X(Est.Var[X ′ε])−1X ′e∗
= e′∗X(σ2∗X
′X)−1X ′e∗
=e′∗X(X ′X)−1X ′e∗
e′∗e∗/n= nR2
∗d−→ χ2
(J)
R2∗ in the regression of restricted residuals e∗ = (y −Xb∗) on X
Intuition: if restrictions not binding, b∗ = b, e∗ ⊥ X, LM = 0
183
Pseudo Maximum Likelihood estimation
ML requires complete specification of f(yi|xi; θ)
What if the density is misspecified?
Under certain conditions, the estimator retain some good properties
even if the wrong likelihood is maximized
E.g., In the model yi = x′i β + εi, OLS is MLE when εi ∼ N(0, σ2)
but under certain conditions LS is still consistent, even when εi 6∼ N(0, σ2)
When εi 6∼ N(0, σ2), OLS is maximizing the wrong likelihood
Key point: OLS solves normal equations E[xi(yi − x′i β)] = 0
These equations might hold even when εi 6∼ N(0, σ2)
184
Pseudo Maximum Likelihood estimator
θML = arg maxθ
∑ni=1 ln f(yi|xi; θ), where f(yi|xi; θ) true p.d.f.
θPML = arg maxθ
∑ni=1 lnh(yi|xi; θ), where h(yi|xi; θ) ∈ Exponential family
Key point: h(yi|xi; θ) 6= f(yi|xi; θ), possibly
If h(yi|xi; θ) = f(yi|xi; θ), then θPML = θML
E.g., Estimate θ when f(yi|xi; θ) = N(x′i θ, σ
2i ) and h(yi|xi; θ) = N(x′
i θ, σ2)
PML estimator solves first order conditions:
1
n
n∑
i=1
∂ lnh(yi|xi; θPML)
∂θPML
= 0
185
Asymptotic distribution of PML estimator
Usual method: first order Taylor expansion of FOC, mean value theorem,
rearrange to have (θPML − θ0) in LHS, scale by√
n, take limit n→∞:
√n(θPML − θ0) =
−1
n
n∑
i=1
∂2 lnh(yi|xi; θ)
∂θ ∂θ′
−1√n
1
n
n∑
i=1
∂ lnh(yi|xi; θ0)
∂θ0
d−→−H(θ0)
−1 ×N (0,Φ)
d= N
(0,H(θ0)
−1 Φ H(θ0)−1)
θPMLa∼ N
(θ0,H(θ0)
−1 Φ H(θ0)−1/n
)
If h(yi|xi; θ0) is true p.d.f., then information matrix equality holds,
Φ = −H(θ0), and θPML = θML
θMLa∼ N
(θ0,−H(θ0)
−1/n)
186
Estimator of Asy.Var[θPML]
Sandwich (or robust) estimator of
Asy.Var[θPML] = H(θ0)−1 Φ H(θ0)
−1/n
based on
• Empirical counterpart (no expectation) of the Hessian H(θ0):
Est.[H(θ0)] =1
n
n∑
i=1
∂2 lnh(yi|xi; θPML)
∂θPML ∂θ′PML
• Sample variance of gradients:
Est.[Φ] =1
n
n∑
i=1
[
∂ lnh(yi|xi; θPML)
∂θPML
][
∂ lnh(yi|xi; θPML)
∂θ′PML
]
187
Remarks on PML estimation
In general, maximizing wrong likelihoods gives inconsistent estimates
(in those cases, sandwich estimator of Asy.Var[θPML] useless)
Under certain conditions, θPML robust to some model misspecification
Major advantage of PML: if h(yi|xi; θ0) is true p.d.f., then θPML = θML
(in those cases, sandwich estimator should not be used)
Typical application of PML in Finance: daily asset returns are not normal,
but GARCH volatility models typically estimated using Gaussian likelihoods
188
Summary of the course
• Linear regression model: OLS estimator, specification and hypothesis testing
• Generalized regression model: heteroscedastic data, GLS estimator
• Panel data model: Fixed and Random effects, Hausman’s specification test
• Instrumental variables: regressors correlated with disturbances
• Generalized method of moments: general framework for inference, weak
assumptions
• Maximum likelihood estimation: assume parametric DGP, best use of this
information
• Hypothesis testing: Likelihood ratio, Wald, Lagrange multiplier tests
189