Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ......

23
Maximum Likelihood (ML) We have previously: - introduced (conditional) likelihood and log-likelihood functions - introduced concentrated (log-) likelihood functions - seen that the (conditional) Maximum Likelihood estimator of the parame- ter vector in the classical linear regression model with normally distributed errors coincides with the OLS estimator of To conclude this section of the course, we now introduce some further concepts and results for ML estimators 1

Transcript of Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ......

Page 1: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

Maximum Likelihood (ML)

We have previously:

- introduced (conditional) likelihood and log-likelihood functions

- introduced concentrated (log-) likelihood functions

- seen that the (conditional) MaximumLikelihood estimator of the parame-

ter vector β in the classical linear regression model with normally distributed

errors coincides with the OLS estimator of β

To conclude this section of the course, we now introduce some further

concepts and results for ML estimators

1

Page 2: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

Conditioning

Denote the joint density of the K + 1 random variables (y,X) by

f (y,X|θ)

which is specified up to a finite-dimensional parameter vector θ

Given the data on (y,X) for a sample of N observations, we have the

likelihood function

L(θ|y,X) = f (y,X|θ)

2

Page 3: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

To emphasize that the likelihood function is based on a sample of size N ,

this is sometimes written as

LN(θ|y,X)

We also have the log-likelihood function

L(θ|y,X) = lnL(θ|y,X)

or

LN(θ|y,X) = lnLN(θ|y,X)

3

Page 4: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

We can write the joint density of (y,X) as the product of the conditional

density of (y|X) and the marginal density of X

f (y,X|θ) = f (y|X, θ)× f (X|θ)

Joint = Conditional×Marginal

We then have the conditional likelihood function

L(θ) = f (y|X, θ)

and the conditional log-likelihood function

L(θ) = lnL(θ) = ln f (y|X, θ)

4

Page 5: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

In general, focusing on the conditional log-likelihood function may imply

some loss of information about the parameters of interest, if the conditional

density f (y|X) and the marginal density f (X) have some parameters in

common

Nevertheless, the focus in econometric models is typically on parameters

of the conditional density

- e.g. θ = (β, σ2) in the CLRN model

Typically we choose not to specify the form of the marginal density f (X),

and focus on conditional maximum likelihood estimators, obtained by max-

imizing the conditional (log-) likelihood function5

Page 6: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

For the case of data on (yi, xi) that is independent over i = 1, ..., N with

the same conditional density function f (yi|xi, θ), i.e.

yi|xi ∼ IID f (yi|xi, θ)

we have

f (y|X, θ) =

N∏i=1

f (yi|xi, θ)

and hence

L(θ) =

N∑i=1

ln f (yi|xi, θ)

6

Page 7: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

(Conditional) Maximum Likelihood Estimator (MLE)

Now let θ denote a K × 1 parameter vector

The (conditional) MLE is the value of θ which maximizes L(θ), or (in most

cases) solves the system of first order conditions

∂L(θ)

∂θ= 0

or

∂L(θ)∂θ1

...

∂L(θ)∂θK

=

0

...

0

The K × 1 vector of first derivatives of the log-likelihood function ∂L(θ)

∂θ is

called the score vector

7

Page 8: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

Under standard regularity conditions (R1 below), the expected value of the

score vector is zero, and the K ×K variance matrix of the first derivatives

of the log-likelihood function has the form

V

(∂L(θ)

∂θ

)= E

[∂L(θ)

∂θ.∂L(θ)

∂θ′

]= I

where the K ×K matrix I is called the (Fisher) information matrix

8

Page 9: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

Regularity Conditions

Now denote the true conditional density by f (y|x, θ0), with θ0 being the

true parameter vector

Note that we drop the i subscripts here, as the true conditional density is

assumed to be the same for all observations (so that f (y|x, θ0) here denotes

the true conditional density for a typical observation)

This is also known as the data generation process or DGP for y|x

9

Page 10: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

R1. E

[∂ ln f (y|x, θ0)

∂θ

]= 0

where the expectation is taken with respect to the true conditional density

f (y|x, θ0)

R1. implies that the expected value of the score vector is zero

R2. − E[∂2 ln f (y|x, θ0)

∂θ∂θ′

]= E

[∂ ln f (y|x, θ0)

∂θ.∂ ln f (y|x, θ0)

∂θ′

]R2. implies the Information Matrix Equality

−E[∂2L(θ0)

∂θ∂θ′

]= E

[∂L(θ0)

∂θ.∂L(θ0)

∂θ′

]10

Page 11: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

Asymptotic Properties

Assuming:

1) The true conditional density f (y|x, θ) is used to define the conditional

likelihood function (correct specification)

2) f (y|x, θA) = f (y|x, θB) iff θA = θB (s.t. L(θ) has a unique maximum)

3) 1N∂2L(θ0)∂θ∂θ′

P→ A0, a K ×K matrix which exists and is non-singular

4) Regularity Conditions R1. and R2. hold

Then we have:

θMLP→ θ0 (weak consistency)

√N(θML − θ0)

D→ N(0,−A−10 ) (asymptotic normality)

11

Page 12: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

The limit distribution for√N(θML − θ0) gives the approximation

θMLa∼ N(θ0,−A−1

0 /N)

or

θMLa∼ N(θ0,−(NA0)

−1)

Since L(θ) =N∑i=1

ln f (yi|xi, θ), we also have NA0 = E[∂2L(θ0)∂θ∂θ′

], so that

θMLa∼ N

(θ0,−E

[∂2L(θ0)

∂θ∂θ′

]−1)

12

Page 13: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

Asymptotic Effi ciency

The asymptotic variance of the ML estimator in correctly specified models

avar(θML) = −E[∂2L(θ0)

∂θ∂θ′

]−1

can be shown to be the lower bound for the asymptotic variance of any√N -consistent, asymptotically normal estimator of θ0 in a wide class

This is known as the Cramer-Rao bound

13

Page 14: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

Given that the information matrix equality holds in correctly specified

models, this asymptotic variance of theML estimator can be estimated either

using the Hessian matrix of second derivatives

avar(θML) = −E[∂2L(θ0)

∂θ∂θ′

]−1

or using the outer product of the gradient vector

avar(θML) = E

[∂L(θ0)

∂θ.∂L(θ0)

∂θ′

]−1

This equality can also be tested (information matrix tests)

14

Page 15: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

Quasi-Maximum Likelihood (QML)

The ML estimator is consistent and achieves the Cramer-Rao effi ciency

bound in correctly specified models

Conveniently, estimators which maximize a likelihood function based on

an incorrect assumption about the form of the conditional density function

nevertheless remain consistent and asymptotically normal in an important

class of problems

Estimators which maximize a mis-specified likelihood function are called

quasi-maximum likelihood (QML) estimators

15

Page 16: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

We have already seen an example of this in the case of the linear regression

model with non-normal and/or heteroskedastic errors

Suppose the true conditional density is

y|X ∼ N(Xβ,Ω) with Ω 6= σ2I

But we construct a likelihood function based on the assumed conditional

density

y|X ∼ N(Xβ, σ2I)

Then the QML estimator of β coincides with the OLS estimator of β

Which remains consistent and asymptotically normal in the linear regres-

sion model with conditional heteroskedasticity16

Page 17: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

More generally, QML estimators based on assumed conditional densities

within the linear exponential family (LEF) remain consistent un-

der mis-specification of the true conditional density (‘distributional mis-

specification’), provided the conditional expectation E(y|X) is correctly

specified

The normal distribution is in the LEF, which is why the QML estimator

of β in the CLRNmodel remains consistent if we mis-specify the conditional

variance, or even the functional form of the conditional distribution

17

Page 18: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

Other important examples of densities in the LEF include the Bernouilli,

Exponential and Poisson densities

Models based on assumed densities in the LEF are also known as gener-

alized linear models (GLMs)

QML estimators are asymptotically normal more generally, with a

sandwich form for their asymptotic variance

- under correct specification, this simplifies using the Information Matrix

Equality to give the simpler expression for avar(θML) stated earlier

Hence in contexts where QML estimators are consistent, we can still con-

duct asymptotically valid inference based on mis-specified ML estimators

18

Page 19: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

This robustness is convenient, but notice that the requirement for the con-

ditional expectation E(y|X) to be correctly specified is stronger than the

assumptions needed for GMM estimators to be consistent

And there are important cases where mis-specification of the conditional

variance and/or the shape of the conditional distribution will imply mis-

specification of the conditional mean

- for example, some of the non-linear models you will cover in the micro-

econometrics module

19

Page 20: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

More on Testing

Three classical tests of restrictions can be considered, based on ML esti-

mators

To relate this to our earlier discussion of Wald tests based on the OLS or

2SLS estimators, we now denote the unrestricted K × 1 parameter vector

by β, and consider testing p linear restrictions represented as Hβ = θ for a

p×K matrix H

Null hypothesis H0 : Hβ = θ ↔ Hβ − θ = 0

Alternative H1 : Hβ 6= θ ↔ Hβ − θ 6= 0

20

Page 21: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

Given

βMLa∼ N

(β, avar(βML)

)then under H0

θML − θa∼ N

(0, avar(θML)

)where θML = HβML, and avar(θML) = Havar(βML)H ′

Then theWald test has the form

w =(θML − θ

)′ [avar(θML)

]−1 (θML − θ

)a∼ χ2(p)

under H0, where avar(θML) is a consistent estimator of avar(θML)

21

Page 22: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

The likelihood ratio test has the form

LR = 2[L(βML)− L(β

R

ML)]

a∼ χ2(p)

under H0, where L(βML) is the maximized value of the log-likelihood func-

tion for the unrestricted model, and L(βR

ML) is the maximized value of the

log-likelihood function for the restricted model (i.e. imposing the p linear

restrictions that Hβ = θ)

Intuition - how much worse do we do in maximizing the log-likelihood if

we impose these restrictions on the parameter vector?

Advantage - unlike the Wald test, the LR test is invariant to non-linear

re-parameterizations22

Page 23: Maximum Likelihood (ML) - Nuffield College Oxford … on ml.pdf · Maximum Likelihood (ML) ... Joint = Conditional Marginal We then have the conditional likelihood ... and focus on

The Lagrange Multiplier (LM) or score test evaluates the score

vector ∂L(β)∂β for the unrestricted model based on estimates of the restricted

model (βR

ML)

Intuition - all elements of the score vector for the unrestricted model should

be close to zero if the restrictions imposed are valid

LM = −(

1

N

)(∂L(β)

∂β′|βR

ML

)A−1R

(∂L(β)

∂β|βR

ML

)a∼ χ2(p)

under H0, where under this null hypothesis AR is a consistent estimator of

A0 = p lim(

1N

) ∂2L(β0)∂β∂β′

evaluated at βR

ML (not at the unrestricted βML)

Advantage - only requires estimation of the (simpler) restricted model

23