Maximum Likelihood Estimation - Keio...

18
Maximum Likelihood Estimation Instructor: Daisuke Nagakura

Transcript of Maximum Likelihood Estimation - Keio...

Page 1: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood Estimation

Instructor: Daisuke Nagakura

Page 2: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood Estimation

◼ Situation

Let [yiT , xi

T]T be i.i.d. random vector with yi ∊ Y ⸦ ℝG and xi ∊ X ⸦ℝK (The

spaces Y and X are called supports of yi and xi, respectively, which are collections

of values that yi and xi can possibly take, respectively.

We consider the situation where the “true” conditional density (or probability)

function of yi conditioned on xi is given as f (yi |xi; θo), where θ is an unknown

parameter contained in the density function of yi. The subscript “o” emphasizes

that θo is the true parameter vector. Suppose that we know the functional form of

f(. |.) but do not know the value of θo.

We are interested in estimation the unknown parameter from a given observation {yi, xi}, i =1,…,N.

Page 3: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood Estimation

◼ Likelihood function and Maximum likelihood estimator

For yi, the conditional likelihood function conditioned on xi is defined as

By taking the log of LN, the conditional log likelihood function for yi conditioned

on xi is defined as

Then the (conditional) maximum likelihood (ML) estimator for θ is defined as

where Θ⸦ℝP is the parameter space of θ that contains a true parameter vector θo.

1

( ) ( | ; ) .

N

N i i

i

L f

=

= y xθ θ

1

log ( ) ( | ; ) .

N

N i i

i

L f

=

= y xθ θ

1

log 1ˆ argmax argmax ( | ; ) ,

N

Ni i

i

Lf

N N =

= = Θ Θ

y xθ θ

θ θ

Page 4: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood EstimationThe maximum likelihood estimator is clearly an M estimator.

Thus, we can apply the consistency result for the M estimator to the ML estimator.

Let ℓi(θ) = ℓ(yi, xi, θ) = log f (yi | xi; θ), which is called the (conditional) log

likelihood for observation i.

Theorem 7 (Consistency of conditional ML estimator) (THEOREM 13.1)

Assume that

(a) for each θ ∊ Θ, ℓ(.,. ,θ) is a Borel measurable function on Y×X ;

(b) θo is the unique solution of the problem ;

(c) Θ is a compact space ;

(d) for each (yi, xi) ∊ Y×X , ℓ(yi, x;, . ) is a continuous function on Θ ;

(e) | ℓ(wi, θ) | ≤ b(wi) for all θ ∊ Θ, and E[b(wi)] < ∞, where wi= [yiT, xi

T]T .

Then, the (conditional) ML estimator exists and .

max [ ( )]iEΘθ

θ

θ̂ oˆplim =θ θ

Page 5: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood EstimationWe can also apply the asymptotic normality result for M estimators to the ML

estimator.

Definition 16 (score of the log likelihood)

The score of the log likelihood or observation i is defined as a P × 1 vector of

partial derivatives:

One important property of si(θ) is zero conditional mean property:

Eo[si(θo) | xi] = 0,

where Eo[.] implies that the expectation is taken with respect to f (yi | xi; θo).

This implies that when we evaluate the P ×1 score at θo, and take its expectation

with respect to f (yi | xi; θo), the expectation is 0. This condition immediately implies

that Eo[si(θo)] = 0 by the law of iterated expectation.

T

1 2

( ) ( ) ( ) ( )( ) ( , , ) , , , .i i i i

i i iP

= = =

s s y x

θ θ θ θθ θ

θ

Page 6: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood EstimationThe zero conditional mean property of si(θ) is easy to prove. Let Eθ[. |xi] denote

conditional expectation with respect to the density f (.|xi; θ) for any θ ∊ Θ (not

necessarily equal to θo). Then, by definition, we have

Next, note that because f( yi | xi; θ) is a density function, its integral with respect to

yi over Y is 1, namely, Differentiating both sides of this

equation with respect to θ over the interior of Θ (hereafter, denoted by int(Θ)), we

have

Here, we assumed the interchangeability of differentiation and integration.

[ ( ) | ] ( , , ) ( | ; ) d .i i i i i i iE f = s x s y x y x yθ θ θY

1 ( | ; ) d .i i if= y x yθY

( | ; ) log ( | ; )d ( | ; ) d

( , ; ) ( | ; ) d [ ( ) | ] .

i i i ii i i i

i i i i i i i

f ff

f E

= =

= =

y x y x0 y y x y

s y x y x y s x

θ θθ

θ θ

θ θ θ

Y Y

Y

Page 7: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood EstimationNext, we define the Hessian matrix for observation i.

Definition 17 (Hessian matrix for observation i)

The P×P matrix of second partial derivatives of ℓi(θ), namely,

is called the Hessian matrix for observation i.

Let Ao be the negative of expected value of Hi(θ) at θo, namely, Ao = –Eo[Hi(θo)].

Ao is generally a positive definite matrix when θo is identified.

Ao is called the information matrix.

2

T T

( ) ( )( ) i i

i

s = =

Hθ θ

θθ θ θ

Page 8: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood Estimation

◼Asymptotic Normality of ML Estimator

Appling the asymptotic normality result of M estimator to the ML estimator, under

regularity condition, that are stated later in detail, we have

where Bo = Var[si(θo)] = Eo[si(θo) si(θo)T].

It turns out the expression of the variance can be simplified. Specifically, we show

that Bo= Ao in the context of ML estimation.

1 1o o o o

ˆ( ) ~ ( , ),AN N − −− 0 A B Aθ θ

Page 9: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood EstimationTo show this, first differentiating both sides of the identity 0 = Eθ[si(θ)], we have

which implies –Eo[Hi(θo) | xi] = Eo[si(θo)si(θo)T | xi]. This relationship called

(conditional) information matrix equality .

T T

T

T T

T

[ ( ) | ]( , , ) ( | ; ) d

[ ( , , ) ( | ; )]d

( , , ) ( | ; )( | ; ) ( , , ) d

( ) ( | ; ) d ( , , ) ( , , ) ( | ; ) d

i ii i i i i

i i i ii

i i i ii i i i i

i i i i i i i i i i i

Ef

f

ff

f f

= =

=

= +

= +

=

s x0 s y x y x y

s y x y xy

s y x y xy x s y x y

H y x y s y x s y x y x y

θθ θ

θ θ

θ θ

θ

θ θθ θ

θ θ

θ θ θ θ θ

Y

Y

Y

Y Y

T[ ( ) | ] [ ( ) ( ) | ]i i i i iE E +H x s s xθ θ θ

Page 10: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood EstimationBy the law of iterated expectation, the conditional information matrix equality

immediately implies is unconditional version holds, namely,

–Eo[Hi(θo)] = Eo[si(θo)si(θo)T], or Ao = Bo.

Because of information matrix equality, the asymptotic covariance matrix of

reduces to Avar = Ao–1 (= Bo

–1). Thus, we have the following theorem:

Theorem 8 (Asymptotic normality of ML estimator) (THEOREM 13.2)

Suppose that the conditions in Theorem 7 hold. In addition, assume that

(a) θo ∊ int(Θ); (b) Ao is positive definite; (c) ℓ(yi, xi, . ) is twice continuously differentiable on int(Θ) for each (y, x) ∊ Y×X, ;

(d) the interchanges of derivative and integral in the previous page hold for all θ ∊ int(Θ);

(e) the elements of Hi(θ) is bounded in absolute value by a function b(y, x) with

Eo[b(y, x)] < ∞.

Then, we have

oˆ( )N −θ θ

oˆ( )N −θ θ

1o o

ˆ( ) ~ ( , ),AN N −− 0 Aθ θ

Page 11: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood Estimation

◼ Estimating the Asymptotic Variance

For estimating Ao = Bo, the following three estimators are frequently used:

(Empirical Hessian (EH) Estimator)

(Outer Product of Gradient (OPG) Estimator)

(Information Matrix Estimator)

where IM(xi, θ) = – Eθ[ H(yi, xi, θ) | xi ].

.

( ) 1o

1

ˆ ˆ( ( )),N

EHi

iN −

== −A H θ

( ) 1 To

1

ˆ ˆ ˆ( ) ( ) ,N

OPGi i

iN −

== A s sθ θ

( ) 1o

1

ˆ ˆ( , ) ,N

IMi

iN −

== A IM x θ

Page 12: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood EstimationEach of these three estimators has its own advantage and disadvantages.

● Among the three, the IM estimator generally the best estimator; however, often it

is difficult to calculate A(xi, θ) analytically in closed form.

● The OPG estimator is the easiest to compute because it only requires to calculate

the first partial derivative(s) of ℓi(θ), or si(θ), but its performance is generally

worse than other two estimators.

● The EH estimator also has computational disadvantages because it requires to

calculate the Hessian matrix of ℓi(θ). Note that to calculate A(xi, θ), we do not

necessarily calculate the Hessian matrix because by the information matrix

equality, A(xi, θ) is also equal to Eθ[si(θ)si(θ)T | xi], which requires the calculation

of si(θ) only (and the expectation).

Page 13: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood EstimationOnce we obtain a consistent estimator of Ao, then the asymptotic variance

of , or Ao–1, can be consistently estimated by its inverse (because

elements of inverse matrix is continuous functions

of elements of the original matrix, so Theorem 1 or Slutsky’s theorem implies that

the inverse matrix of a consistent estimator of a matrix converges in probability to

the inverse of the true matrix).

Then, the consistent estimator of Avar is obtained from the relationship:

oˆ( )N −θ θ

θ̂

oAvar ( )ˆAvar .N

N

−=

θ θθ

Page 14: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood Estimation

◼ Quasi Maximum Likelihood Estimation

So far, we have assumed that the density f (yi | xi, θo) corresponds to the true

unknown density. However this assumption may not be true in practice.

In that case, what is the meaning of the ML estimator , or the estimated density

(model) f(yi | xi, θo) ?

In the case, we still can regard the model as an approximation to the unknow true

model in a certain sense, and the ML estimator give the best approximating model

within the framework of f (yi | xi, θ). In such a case, the ML estimator is called the

quasi-maximum likelihood (QML) estimator (or some authors prefers the name

pseudo-maximum likelihood (PML) estimator).

θ̂

θ̂

θ̂

Page 15: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood EstimationMore specifically, let g(yi | xi) be the true density of yi given xi.

The QML estimator is a consistent estimator for the values of θ that maximizes the

“closeness” between g(yi | xi) and f (yi | xi; θ) measured by the (conditional)

Kullback-Leibler information criterion (KLIC):

It is easy to see that when f(yi | xi; θ) is actually the true model at θo,

or f (yi | xi; θo) = g(yi | xi), then K(f ; x) is zero at θ = θo.

We can also show that the following conditional KL information inequality:K(f ; x) ≥ 0 for all x ∊ X. Thus, the KLIC is minimized when f( yi |xi; θo) is actually

the ture model.

( | )( ; ) log ( | ) d

( | ; )

gK f g

f

=

y xx y x y

y x θY

Page 16: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood EstimationIn practice, f(yi | xi, θ) would never corresponds to g(yi | xi), the true conditional

density of yi. However we still consider f(yi | xi, θ) as an approximate model, and try

to minimize the KLIC with respect to θ. Let θ* be the value of θ that minimizes the

KLIC. It is known that the QML estimator is a consistent estimator for θ*.

Here, we just state a known result about θ* and QML estimator:

The QML estimator satisfies

where A*= – E[Hi(θ*)] and B*= E[si(θ

*)si(θ*)T]. Note that here the expectation is

taken with respect to the true density g(yi | xi). The condition for this to hold can be

deduced by the theory of M estimator.

* * 1 * * 1ˆ( ) ~ ( , )N N − −− 0 A B Aθ θ

Page 17: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Maximum Likelihood EstimationThe asymptotic variance, Avar can be consistently estimated by*ˆ( )N −θ θ

1 1T

1 1 1

ˆ ˆ ˆ ˆ ˆAvar ( ) ( ) ( ) ( ) , ( ) .N N N

i i i ii i i

N

− −

= = =

− =

H s s H

*θ θ θ θ θ θ

Page 18: Maximum Likelihood Estimation - Keio Universityuser.keio.ac.jp/~nagakura/ae2020/Maximum_Likelihood... · 2020. 6. 1. · Maximum Likelihood Estimation The maximum likelihood estimator

Exercise QuestionsExercise 1: Prove that KLIC satisfies K(f ; x) ≥ 0 for all x ∊ X.