Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ......
Transcript of Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ......
Maximum Likelihood (ML)
We have previously:
- introduced (conditional) likelihood and log-likelihood functions
- introduced concentrated (log-) likelihood functions
- seen that the (conditional) MaximumLikelihood estimator of the parame-
ter vector β in the classical linear regression model with normally distributed
errors coincides with the OLS estimator of β
To conclude this section of the course, we now introduce some further
concepts and results for ML estimators
1
Conditioning
Denote the joint density of the K + 1 random variables (y,X) by
f (y,X|θ)
which is specified up to a finite-dimensional parameter vector θ
Given the data on (y,X) for a sample of N observations, we have the
likelihood function
L(θ|y,X) = f (y,X|θ)
2
To emphasize that the likelihood function is based on a sample of size N ,
this is sometimes written as
LN(θ|y,X)
We also have the log-likelihood function
L(θ|y,X) = lnL(θ|y,X)
or
LN(θ|y,X) = lnLN(θ|y,X)
3
We can write the joint density of (y,X) as the product of the conditional
density of (y|X) and the marginal density of X
f (y,X|θ) = f (y|X, θ)× f (X|θ)
Joint = Conditional×Marginal
We then have the conditional likelihood function
L(θ) = f (y|X, θ)
and the conditional log-likelihood function
L(θ) = lnL(θ) = ln f (y|X, θ)
4
In general, focusing on the conditional log-likelihood function may imply
some loss of information about the parameters of interest, if the conditional
density f (y|X) and the marginal density f (X) have some parameters in
common
Nevertheless, the focus in econometric models is typically on parameters
of the conditional density
- e.g. θ = (β, σ2) in the CLRN model
Typically we choose not to specify the form of the marginal density f (X),
and focus on conditional maximum likelihood estimators, obtained by max-
imizing the conditional (log-) likelihood function5
For the case of data on (yi, xi) that is independent over i = 1, ..., N with
the same conditional density function f (yi|xi, θ), i.e.
yi|xi ∼ IID f (yi|xi, θ)
we have
f (y|X, θ) =
N∏i=1
f (yi|xi, θ)
and hence
L(θ) =
N∑i=1
ln f (yi|xi, θ)
6
(Conditional) Maximum Likelihood Estimator (MLE)
Now let θ denote a K × 1 parameter vector
The (conditional) MLE is the value of θ which maximizes L(θ), or (in most
cases) solves the system of first order conditions
∂L(θ)
∂θ= 0
or
∂L(θ)∂θ1
...
∂L(θ)∂θK
=
0
...
0
The K × 1 vector of first derivatives of the log-likelihood function ∂L(θ)
∂θ is
called the score vector
7
Under standard regularity conditions (R1 below), the expected value of the
score vector is zero, and the K ×K variance matrix of the first derivatives
of the log-likelihood function has the form
V
(∂L(θ)
∂θ
)= E
[∂L(θ)
∂θ.∂L(θ)
∂θ′
]= I
where the K ×K matrix I is called the (Fisher) information matrix
8
Regularity Conditions
Now denote the true conditional density by f (y|x, θ0), with θ0 being the
true parameter vector
Note that we drop the i subscripts here, as the true conditional density is
assumed to be the same for all observations (so that f (y|x, θ0) here denotes
the true conditional density for a typical observation)
This is also known as the data generation process or DGP for y|x
9
R1. E
[∂ ln f (y|x, θ0)
∂θ
]= 0
where the expectation is taken with respect to the true conditional density
f (y|x, θ0)
R1. implies that the expected value of the score vector is zero
R2. − E[∂2 ln f (y|x, θ0)
∂θ∂θ′
]= E
[∂ ln f (y|x, θ0)
∂θ.∂ ln f (y|x, θ0)
∂θ′
]R2. implies the Information Matrix Equality
−E[∂2L(θ0)
∂θ∂θ′
]= E
[∂L(θ0)
∂θ.∂L(θ0)
∂θ′
]10
Asymptotic Properties
Assuming:
1) The true conditional density f (y|x, θ) is used to define the conditional
likelihood function (correct specification)
2) f (y|x, θA) = f (y|x, θB) iff θA = θB (s.t. L(θ) has a unique maximum)
3) 1N∂2L(θ0)∂θ∂θ′
P→ A0, a K ×K matrix which exists and is non-singular
4) Regularity Conditions R1. and R2. hold
Then we have:
θMLP→ θ0 (weak consistency)
√N(θML − θ0)
D→ N(0,−A−10 ) (asymptotic normality)
11
The limit distribution for√N(θML − θ0) gives the approximation
θMLa∼ N(θ0,−A−1
0 /N)
or
θMLa∼ N(θ0,−(NA0)
−1)
Since L(θ) =N∑i=1
ln f (yi|xi, θ), we also have NA0 = E[∂2L(θ0)∂θ∂θ′
], so that
θMLa∼ N
(θ0,−E
[∂2L(θ0)
∂θ∂θ′
]−1)
12
Asymptotic Effi ciency
The asymptotic variance of the ML estimator in correctly specified models
avar(θML) = −E[∂2L(θ0)
∂θ∂θ′
]−1
can be shown to be the lower bound for the asymptotic variance of any√N -consistent, asymptotically normal estimator of θ0 in a wide class
This is known as the Cramer-Rao bound
13
Given that the information matrix equality holds in correctly specified
models, this asymptotic variance of theML estimator can be estimated either
using the Hessian matrix of second derivatives
avar(θML) = −E[∂2L(θ0)
∂θ∂θ′
]−1
or using the outer product of the gradient vector
avar(θML) = E
[∂L(θ0)
∂θ.∂L(θ0)
∂θ′
]−1
This equality can also be tested (information matrix tests)
14
Quasi-Maximum Likelihood (QML)
The ML estimator is consistent and achieves the Cramer-Rao effi ciency
bound in correctly specified models
Conveniently, estimators which maximize a likelihood function based on
an incorrect assumption about the form of the conditional density function
nevertheless remain consistent and asymptotically normal in an important
class of problems
Estimators which maximize a mis-specified likelihood function are called
quasi-maximum likelihood (QML) estimators
15
We have already seen an example of this in the case of the linear regression
model with non-normal and/or heteroskedastic errors
Suppose the true conditional density is
y|X ∼ N(Xβ,Ω) with Ω 6= σ2I
But we construct a likelihood function based on the assumed conditional
density
y|X ∼ N(Xβ, σ2I)
Then the QML estimator of β coincides with the OLS estimator of β
Which remains consistent and asymptotically normal in the linear regres-
sion model with conditional heteroskedasticity16
More generally, QML estimators based on assumed conditional densities
within the linear exponential family (LEF) remain consistent un-
der mis-specification of the true conditional density (‘distributional mis-
specification’), provided the conditional expectation E(y|X) is correctly
specified
The normal distribution is in the LEF, which is why the QML estimator
of β in the CLRNmodel remains consistent if we mis-specify the conditional
variance, or even the functional form of the conditional distribution
17
Other important examples of densities in the LEF include the Bernouilli,
Exponential and Poisson densities
Models based on assumed densities in the LEF are also known as gener-
alized linear models (GLMs)
QML estimators are asymptotically normal more generally, with a
sandwich form for their asymptotic variance
- under correct specification, this simplifies using the Information Matrix
Equality to give the simpler expression for avar(θML) stated earlier
Hence in contexts where QML estimators are consistent, we can still con-
duct asymptotically valid inference based on mis-specified ML estimators
18
This robustness is convenient, but notice that the requirement for the con-
ditional expectation E(y|X) to be correctly specified is stronger than the
assumptions needed for GMM estimators to be consistent
And there are important cases where mis-specification of the conditional
variance and/or the shape of the conditional distribution will imply mis-
specification of the conditional mean
- for example, some of the non-linear models you will cover in the micro-
econometrics module
19
More on Testing
Three classical tests of restrictions can be considered, based on ML esti-
mators
To relate this to our earlier discussion of Wald tests based on the OLS or
2SLS estimators, we now denote the unrestricted K × 1 parameter vector
by β, and consider testing p linear restrictions represented as Hβ = θ for a
p×K matrix H
Null hypothesis H0 : Hβ = θ ↔ Hβ − θ = 0
Alternative H1 : Hβ 6= θ ↔ Hβ − θ 6= 0
20
Given
βMLa∼ N
(β, avar(βML)
)then under H0
θML − θa∼ N
(0, avar(θML)
)where θML = HβML, and avar(θML) = Havar(βML)H ′
Then theWald test has the form
w =(θML − θ
)′ [avar(θML)
]−1 (θML − θ
)a∼ χ2(p)
under H0, where avar(θML) is a consistent estimator of avar(θML)
21
The likelihood ratio test has the form
LR = 2[L(βML)− L(β
R
ML)]
a∼ χ2(p)
under H0, where L(βML) is the maximized value of the log-likelihood func-
tion for the unrestricted model, and L(βR
ML) is the maximized value of the
log-likelihood function for the restricted model (i.e. imposing the p linear
restrictions that Hβ = θ)
Intuition - how much worse do we do in maximizing the log-likelihood if
we impose these restrictions on the parameter vector?
Advantage - unlike the Wald test, the LR test is invariant to non-linear
re-parameterizations22
The Lagrange Multiplier (LM) or score test evaluates the score
vector ∂L(β)∂β for the unrestricted model based on estimates of the restricted
model (βR
ML)
Intuition - all elements of the score vector for the unrestricted model should
be close to zero if the restrictions imposed are valid
LM = −(
1
N
)(∂L(β)
∂β′|βR
ML
)A−1R
(∂L(β)
∂β|βR
ML
)a∼ χ2(p)
under H0, where under this null hypothesis AR is a consistent estimator of
A0 = p lim(
1N
) ∂2L(β0)∂β∂β′
evaluated at βR
ML (not at the unrestricted βML)
Advantage - only requires estimation of the (simpler) restricted model
23