Theory of Statistics II (Fall 2016) - WordPress.com · This note is a summary of the lecture Theory...

Theory of Statistics II (Fall 2016)

J.P.Kim

Dept. of Statistics

First Draft Published at: January 1, 2017

Final Draft Published at: November 11, 2018

Preface & Disclaimer

This note is a summary of the lecture Theory of Statistics II (326.522) held at Seoul Na-

tional University, Fall 2016. Lecturer was B.U.Park, and the note was summarized by J.P.Kim,

who is a Ph.D student. There are few textbooks and references in this course. Contents and

corresponding references are following.

• Asymptotic Approximations. Reference: Mathematical Statistics: Basic ideas and selected

topics, Vol. I., 2nd edition, P.Bickel & K.Doksum, 2007.

• Weak Convergence. Reference: Convergence of Probability Measures, P.Billingsley, 1999.

• Nonparametric Density Estimation. Reference: Kernel Smoothing, M.P.Wand & M.C.Jones,

1994.

Lecture notes are available at stat.snu.ac.kr/theostat. Also I referred to following books

when I write this note. The list would be updated continuously.

• Probability: Theory and Examples, R.Durrett

• Mathematical Statistics (in Korean), W.C.Kim

• Introduction to Nonparametric Regression, K.Takezawa

• Semiparametric Regression, D.Ruppert, M.P.Wand & R.J.Carroll

• Nonparametric and Semiparametric Models, W.Härdle, M.Müller, S.Sperlich & A.Werwatz

If you want to correct typo or mistakes, please contact to: [email protected]

Acknowledgement

I’m very grateful to my colleagues for comments and correction of errors. Their suggestions

were very helpful to improve this note.

Special Thanks to: Hyemin Yeon, Seonghun Cho

2

Chapter 1

Asymptotic Approximations

1.1 Consistency

1.1.1 Preliminary for the chapter

Definition 1.1.1 (Notations). Let Θ be a parameter space. Then we consider a ‘random vari-

able’ X on the probability space (Ω,F , Pθ) which is a function

X : (Ω,F , Pθ)→ (Rd,B(Rd), PXθ ),

where PXθ := Pθ ◦X−1. Note that Pθ is a probability measure from the model P := {Pθ : θ ∈ Θ}.

For the convenience, now we omit the explanation of fundamental setting.

Definition 1.1.2 (Convergence). Let {Xn} be a sequence of random variables.

1. Xna.s−−−→

n→∞X if P

(limn→∞

Xn = X)

= 1⇔ P (|Xn −X| > � i.o.) = 0 ∀� > 0

⇔ limN→∞

P

(∞⋃n=N

(|Xn −X| > �)

)= 0 ∀� > 0

2. XnP−−−→

n→∞X if ∀� > 0 P (|Xn −X| > �)→ 0 as n→∞.

Proposition 1.1.3. XnP−−−→

n→∞X if and only if for any subsequence {nk} ⊆ {n} there is a

further subsequence {nkj} ⊆ {nk} such that Xnkja.s.−−−→j→∞

X.

Proof. Durrett, p.65.

Definition 1.1.4 (Consistency). q̂n = qn(X1, · · · , Xn) is consistent estimator of q(θ) if

q̂nPθ−−−→

n→∞q(θ)

3

Theory of Statistics II J.P.Kim

for any θ ∈ Θ. (We don’t know what is the true parameter.)

Remark 1.1.5. There are some tools to obtain consistency.

1. V ar(Zn)→ 0, EZn → µ as n→∞ ⇒ ZnP−−−→

n→∞µ.

∵ P (|Zn − µ| > �) ≤ P (|Zn − EZn|+ |EZn − µ| > �)

≤ P (|Zn − EZn| > �/2) + P (|EZn − µ| > �/2)︸︷︷︸=0 for sufficiently large n

≤ 4�2V ar(Zn)→ 0

2. WLLN: X1, · · · , Xn: i.i.d. and E|X1| 0 there exists M > 0 such that P (|Y | > M) ≤ η

and P (|Z| > M) ≤ η. Now,

P (|YnZn − Y Z| > �) ≤ P (|Yn||Zn − Z| > �/2) + P (|Z||Yn − Y | > �/2)

≤ P (|Yn − Y ||Zn − Z| > �/4) + P (|Y ||Zn − Z| > �/4) + P (|Z||Yn − Y | > �/2)

and note that P (|Y ||Zn−Z| > �/4) = P (|Y ||Zn−Z| > �/4, |Y | > M) +P (|Y ||Zn−Z| >

�/4, |Y | ≤M) ≤ η + P (|Zn − Z| ≥ �/4M). Thus

lim supn→∞

P (|Y ||Zn − Z| > �/4) ≤ η

4

CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

and similarly

lim supn→∞

P (|Z||Yn − Y | > �/2) ≤ η.

Now, since

P (|Yn − Y ||Zn − Z| > �/4) = P (|Yn − Y ||Zn − Z| > �/4, |Yn − Y | >√�/4)

+ P (|Yn − Y ||Zn − Z| > �/4, |Yn − Y | ≤√�/4)

≤ P (|Yn − Y | >√�/4) + P (|Zn − Z| ≥

√�/4)→ 0

as n→∞, we get

lim supn→∞

P (|YnZn − Y Z| > �) ≤ 2η.

Finally, since η > 0 was arbitrary, we get the result.

(c) By (b), it’s sufficient to show that Z−1nP−−−→

n→∞Z−1. Since P (Z = 0) = 0, for any η > 0

there exists δ > 0 such that P (|Z| ≤ δ) ≤ η. (If not, ∃η > 0 such that ∀δ > 0 P (|Z| ≤ δ) >

η. Then by continuity of measure, P (⋃δ>0(|Z| ≤ δ)) = P (Z = 0) ≥ η > 0. Contradiction.)

Thus

P

(∣∣∣∣ 1Zn − 1Z∣∣∣∣ > �) = P ( |Zn − Z||ZnZ| > �

)≤ P

(|Zn − Z|

|Z|(|Z| − |Zn − Z|)> �

)+ P (|Z| < |Zn − Z|)

≤ P(

|Zn − Z||Z|(|Z| − |Zn − Z|)

> �, |Z| > δ, |Zn − Z| ≤ δ/2)

︸︷︷︸≤P (|Zn−Z|> δ

2

2�)−−−→

n→∞0

+ P (|Z| ≤ δ)︸︷︷︸≤η

+P (|Zn − Z| > δ/2)︸︷︷︸−−−→n→∞

0

+ P (|Z| < |Zn − Z|, |Zn − Z| > δ)︸︷︷︸≤P (|Zn−Z|>δ)−−−→

n→∞0

+P (|Z| < |Zn − Z|, |Zn − Z| ≤ δ)︸︷︷︸≤P (|Z|≤δ)≤η

and hence

lim supn→∞

P

(∣∣∣∣ 1Zn − 1Z∣∣∣∣ > �) ≤ 2η

holds. Note that η > 0 was arbitrary.

Definition 1.1.6 (Probabilistic O-notation). Let Xn be a sequence of r.v.’s.

5


1. Xn = Op(1) if limc→∞

supnP (|Xn| > c) = 0 ⇔ lim

c→∞lim supn→∞

P (|Xn| > c) = 0. (“Bounded in

probability”)

2. Xn = op(1) if XnP−−−→

n→∞0.

3. Xn = Op(an) if Xn/an = Op(1), and Xn = op(an) if Xn/an = op(1).

Proposition 1.1.7. If Xnd−−−→

n→∞X for some X, then Xn = Op(1).

Proof. For given � > 0, there exists c such that P (|X| > c) < �/2. For such c, P (|Xn| > c) →

P (|X| > c), so ∃N s.t.

supn>N|P (|Xn| > c)− P (|X| > c)| <

�

2.

Thus supn>N P (|Xn| > c) < �. For n = 1, 2, · · · , N , there exists cn such that P (|Xn| > cn) < �,

and letting c∗ = max(c1, · · · , cN , c), we get supn P (|Xn| > c∗) < �.

Example 1.1.8 (Simple Linear Regression). Consider a simple linear regression model yi =

β0 + β1xi + �i, where �ii.i.d.∼ (0, σ2). Also assume that x1, · · · , xn are known and not all equal.

Note that

β̂1,n =

∑ni=1(xi − x)Yi∑ni=1(xi − x)2

.

Since E(β̂1,n) = β1 and V ar(β̂1,n) = σ2/Sxx, we obtain consistency

β̂1,nPβ,σ2−−−→n→∞

β1

provided that Sxx =∑n

i=1(xi − x)2 →∞ as n→∞.

Example 1.1.9 (Sample correlation coefficient). Let (X1, Y1), · · · , (Xn, Yn) be random sample

from the population

EX1 = µ1, EY1 = µ2, V ar(X1) = σ21 > 0, V ar(Y1) = σ

22 > 0, and Corr(X1, Y1) = ρ.

Then by WLLN we get

(X,Y ,X2, Y 2, XY )P−−−→

n→∞(EX1, EY1, EX

21 , EY

21 , EX1Y1).

Since the function

g(u1, u2, u3, u4, u5) =u5 − u1u2√

u3 − u21√u4 − u22

6


is continuous at (EX1, EY1, EX21 , EY

21 , EX1Y1), we get

ρ̂n =XY −XY√

X2 −X2√Y 2 − Y 2

P−−−→n→∞

ρ.

Remark 1.1.10. Note that, if XnP−−−→

n→∞c where c is a constant, then continuity of g(x) at

x = c is sufficient for consistency g(Xn)P−−−→

n→∞g(c). It is trivial from the definition of continuity.

Example 1.1.11. Let X1, · · · , Xn be a random sample from a population with cdf F . Then

we use an empirical distribution function

F̂n(x) =1

n

n∑i=1

I(−∞,x](Xi) =1

n

n∑i=1

I(Xi ≤ x)

for estimation of F . Then by WLLN, for each x, F̂n(x) is consistent estimator for F (x),

F̂n(x)P−−−→

n→∞F (x).

Remark 1.1.12. Note that in this case, we can say more strong result, which is known as

Glivenko-Cantelli theorem:

supx|F̂n(x)− F (x)|

P−−−→n→∞

0.

Sketch of proof is given here. Since F̂n and F are nondecreasing and bounded, we can partition

[0, 1], which is a range of both functions, into finite number of intervals, and then each interval

has a well-defined inverse image which is an interval. For whole proof, see Durrett, p.76.

1.1.2 FSE and MLE in Exponential Families

FSE

Recall that FSE of ν(F ) is defined as ν(F̂n). One example of FSE is MME: to estimate

EXj =: νj(F ) =:∫xjdF (x), we use

m̂j = νj(F̂n) =

∫xjdF̂n(x) =

1

n

n∑i=1

Xji .

By WLLN we have (m̂1, m̂2, · · · , m̂k)>P−−−→

n→∞(m1,m2, · · · ,mk)> where mj = EXj, so we can

obtain consistency of MME easily.

Proposition 1.1.13. Let q = h(m1,m2, · · · ,mk) be a parameter of interest where mj’s are

7


population moments. Then for MME

q̂n = h(m̂1, · · · , m̂k)

based on a random sample X1, · · · , Xn,

q̂nP−−−→

n→∞q

holds, provided that h is continuous at (m1, · · · ,mk)>.

We can do similar work in FSE ν(F ). Note that in here, ν is a functional, so we may define a

continuity of functional. We may use sup norm as a metric in the space of distribution functions.

Definition 1.1.14. Let F be a space of distribution functions. In this space, we give the norm

‖ · ‖ as a sup norm

‖F‖ = supx|F (x)|.

Then metric is given as

‖F −G‖ = supx|F (x)−G(x)|.

Also, we say that a functional ν : F → R is continuous at F if for any � > 0 there exists δ > 0

such that

‖G− F‖ < δ ⇒ |ν(G)− ν(F )| < �.

Remark 1.1.15. Note that since ‖F̂n − F‖ → 0 as n → ∞ from Glivenko-Cantelli theorem,

we get consistency of FSE

ν(F̂n)P−−−→

n→∞ν(F )

provided that ν is continuous at F . In many cases, showing continuity may be difficult problem.

Example 1.1.16 (Best Linear Predictor). Let X1, · · · , Xn be k-dimensional i.i.d. r.v.’s, and

Y1, · · · , Yn be i.i.d. 1-dim random variable. Then we know that

BLP (x) = µ2 + Σ21Σ−111 (x− µ1),

where

E

XY

=µ1µ2

and V arXY

=Σ11 Σ12

Σ21 Σ22

.8


Thus for sample variance

S11 =1

n

n∑i=1

(Xi −X)(Xi −X)>

S12 =1

n

n∑i=1

(Xi −X)(Yi − Y )> = S>21

S22 =1

n

n∑i=1

(Yi − Y )2,

we obtain FSE for BLP,

B̂LPFSE

(x) = Y + S21S−111 (x−X).

Note that it is the same as simple linear regression line. Detail is given in next proposition.

Proposition 1.1.17.

(a) Solution of minimizing problem

(β∗0 ,β∗1)> = arg min

β0,β1

E(Y − β0 − β>1 X)2

is

BLP (x) := β∗0 + β∗1>x = µ2 + Σ21Σ

−111 (x− µ1).

(b) For Y = (Y1, · · · , Yn)> and design matrix X = (1,X1) where X1 = (X1, · · · , Xn)>, LSE

is

β̂1 = S−111 S12 and β̂0 = Y −X

>β̂1.

Proof. (a) Two approaches are given. First one is direct proof: It is clear because of

E(Y − β0 − β>1 X)2 = E[(Y − µ2)− β>1 (X − µ1)]2 + [µ2 − β0 − β>1 µ1]2

= Σ22 − 2β>1 Σ12 + β>1 Σ11β1 + [β0 − (µ2 − β>1 µ1)]2.

Second approach uses projection in L2 space. For convenience, suppose EX = 0 and EY = 0.

Then (β∗0 ,β∗1)> should satisfy

〈β0 + β>1 X, Y − β∗0 − β∗1>X〉 = 0 ∀β0, β1.

It yields that

β∗0 = 0, β∗1 =

(E(XX>)

)−1E(XY ).

9


(b)Xβ̂ = 1β̂0+X1β̂1 should satisfy 1β̂0+X1β̂1 = Π(Y |C(X)). For X1 = X1−Π(X1|C(1)) =

X1 − 1X>

,

1β̂0 +X1β̂1 = 1

(β̂0 +

1>X1n

β̂1

)+ X1β̂1 = Π(Y |C(1)) + Π(Y |C(X1))

we get

β̂0 = Y −X>β̂1 and β̂1 = (X>1 X1)−1X>1 Y .

Now X>1 X1/n = S11 and X>1 Y /n = S12 ends the proof.

Figure 1.1: Regression with intercept. Image from Lecture Note.

Example 1.1.18 (Multiple Correlation Coefficient). We define a multiple correlation coefficient

(MCC) as

ρ = maxβ0,β1

Corr(Y, β0 + β>1 X)

and sample MCC is

ρ̂n = maxβ0,β1

Ĉorr(Y, β0 + β>1 X).

Note that,

Corr(Y, β0 + β>1 X) = Corr(Y − µ2,β>1 (X − µ1))

=Σ21β1

√Σ22

√β>1 Σ11β1

=(Σ−1/211 Σ12)

>(Σ1/211 β1)

√Σ22

√β>1 Σ11β1

10


≤

√Σ21Σ

−111 Σ12

Σ22

holds by Cauchy-Schwarz inequality, and equality holds when β1 = Σ−111 Σ12. Thus population

MCC is obtained as

ρ =

√Σ21Σ

−111 Σ12

Σ22.

Meanwhile, sample correlation is obtained as

Ĉorr(Y , β0 + β>1X) =

〈Y − Y 1, (X − 1X>)β1〉‖Y − Y 1‖‖(X − 1X>)β1‖

so it is the cosine of the angle between the two rays, Y − Y 1 and X1β1. Its maximal value is

attaiend by X1β̂1 = Π(Y − Y 1|C(X1)). (Or, one may use Cauchy-Schwarz,

Ĉorr(Y , β0 + β>1X) =

y>X1(X>1 X1)−1X>1 X1β1‖Y − Y 1‖‖X1β1‖

≤√y>ΠX1y√

y>(1− Π1)y,

where equality holds iff X1(X>1 X1)−1X>1 y = X1β1, i.e., β1 = β̂1) Thus,

ρ̂2 =SSR

SST=β̂>1 X>1 X1β̂1‖Y − Y 1‖2

=S21S

−111 S12S22

.

Example 1.1.19 (Sample Proportions). Let (X1, · · · , Xk)> ∼ Multi(n, p), where p ∈ Θ :=

{(p1, · · · , pk)> :∑k

i=1 pi = 1, pi ≥ 0 (i = 1, 2, · · · , k)}. We estimate p with sample proportion

p̂n =

(X1n, · · · , Xk

n

)>.

Then,

(a) p̂n is consistent estimator of p, i.e.,

∀� > 0, supp∈Θ

Pp(|p̂n − p| ≥ �) −−−→n→∞

0.

(b) q(p̂n) is consistent estimator of q(p) provided that q is (uniformly) continuous on Θ.

Proof. (a) Note that there exists a constant C > 0 such that

supp∈Θ

Pp(|p̂n − p| ≥ �) ≤ supp∈Θ

E|p̂n − p|2

�2

11


= supp∈Θ

k∑i=1

pi(1− pi)n�2

≤ Cn�2−−−→n→∞

0

so we get the desired result. Note that first inequality is from Chebyshev’s inequality.

(b) Note that q is uniformly continuous on Θ, since Θ is closed and bounded. Thus the

assertion holds. More precisely, for any � > 0, there exists δ > 0 such that

|p′ − p| < δ, p, p′ ∈ Θ⇒ |q(p′)− q(p)| < �.

Therefore, we get

supp∈Θ

Pp(|q(p̂n)− q(p)| ≥ �) ≤ supp∈Θ

Pp(|p̂n − p| ≥ δ) −−−→n→∞

0.

MLE in exponential families

Consider a random variable X with pdf in canonical exponential family

qη(x) = h(x) exp(η>T (x)− A(η))IX (x), η ∈ E ,

where E is natural parameter space in Rk. Our goal is to show consistency of MLE in canonical

exponential family.

Theorem 1.1.20. Let

qη(x) = h(x) exp(η>T (x)− A(η))IX (x), η ∈ E

be a canonical exponential family with natural parameter space E ⊆ Rk. Further assume

(i) E is open.

(ii) The family is of rank k.

(iii) t0 := T (x) ∈ C0, where C denotes the smallest convex set containing the support of T (X),

and C0 be its interior.

12


Then the unique ML estimate η̂(x) exists and is the solution of the likelihood equation

l̇x(η) = T (x)− Ȧ(η) = 0.

Remark 1.1.21. Note that in (iii), x is the observation of X, so t0 is the observation of T (X).

It is reasonable to consider t0 because ML estimate only depends on t0. Also, recall that (ii)

means

@a 6= 0 s.t.[Pη(a

>(T (x)− µ) = 0) = 1 for some η ∈ E]

⇔@a 6= 0 s.t.[V arη(a

>T (x)) = 0 for some η ∈ E]

⇔Ä(η) is positive definite ∀η ∈ E .

To prove this, we need some preparation.

Lemma 1.1.22.

(a) (“Supporting Hyperplane Theorem”) Let C ⊆ Rk be a convex set, and C0 be its interior.

Then for t0 /∈ C or t0 ∈ ∂C,

∃a 6= 0 s.t. [a>t ≥ a>t0 ∀t ∈ C].

Conversely, for t0 ∈ C0,

@a 6= 0 s.t. [a>t ≥ a>t0 ∀t ∈ C].

(b) Let P (T ∈ T ) = 1 and E(maxi |Ti|) 0 such that B(t0, δ) ⊆ C0, since C0 is open. Note that for any u s.t.

‖u‖ = 1, we get

t0 −δ

2u, t0 +

δ

2u ∈ B(t0, δ) ⊆ C.

If ∃a 6= 0 such that a>t ≥ a>t0 ∀t ∈ C, then

a>(t0 −

δ

2u

)≥ a>t0, a>

(t0 +

δ

2u

)≥ a>t0

13


holds for u = a/|a|, which yields contradiction. (Note that convexity condition is not used)

Figure 1.2: Proof of (a)

(b) Note that since C is a convex set, µ := ET ∈ C holds. (Convex set contains average of

itself) Assume µ /∈ C0. Then µ ∈ ∂C. Then by (a), ∃a 6= 0 such that a>t ≥ a>µ for any t ∈ C.

It implies that, ∃a 6= 0 such that P (a>(T − µ) ≥ 0) = 1, since T ⊆ C. It implies that

P (a>(T − µ) = 0) = 1,

by the fact that

f ≥ 0,∫fdµ = 0⇒ f = 0 µ− a.e..

It is contradictory to (ii), which is full rank condition of the exponential family.

(c) Done in TheoStat I.

Proof of theorem. By lemma, it’s sufficient to show that:

(1) l(θ) diverges to −∞ at the boundary. (Existence)

(2) Uniqueness

Note that Uniqueness is clear since lx(η) is C2 function and strictly concave from Ä(η) > 0.

Thus, our claim is

Claim. l(θ) approches −∞ on the boundary ∂E .

Let η0 ∈ ∂E . Then there is ηn −−−→n→∞

η0 such that ηn ∈ E . Now our claim is, for any such

sequence ηn, we get lx(ηn) −−−→n→∞

−∞. Note that |ηn| −−−→n→∞

∞ or sup |ηn|


for both cases, from lx(η) = log h(x) + η>T (x)−A(η) and eA(η) =

∫Xh(x)eη

>T (x)dµ(x) , we get

−lx(ηn) + log h(x) = A(ηn)− η>n t0

= log

∫X

exp(η>n (T (y)− t0)

)h(y)dµ(y).

Case 1. |ηn| → ∞.

Then since

∫Xeη>n (T (y)−t0)h(y)dµ(y) ≥

∫η>n|ηn|

(T (y)−t0)> 1k

e|ηn|·η>n|ηn|

(T (y)−t0)h(y)dµ(y)

≥∫η>n|ηn|

(T (y)−t0)> 1k

e|ηn|/kh(y)dµ(y)

= exp(|ηn|/k)∫η>n|ηn|

(T (y)−t0)> 1k

h(y)dµ(y),

if we can conclude

infn

∫η>n|ηn|

(T (y)−t0)> 1k

h(y)dµ(y) > 0,

by the assumption |ηn| → ∞, we get lx(ηn)→ −∞. Note that if

infu:‖u‖=1

∫u>(T (y)−t0)>0

h(y)dµ(y) > 0,

then

infn

∫η>n|ηn|

(T (y)−t0)>0h(y)dµ(y) > 0,

and from

infn

∫η>n|ηn|

(T (y)−t0)> 1k

h(y)dµ(y) −−−→k→∞

infn

∫η>n|ηn|

(T (y)−t0)>0h(y)dµ(y),

we get ∃� > 0 & k s.t.

infn

∫η>n|ηn|

(T (y)−t0)> 1k

h(y)dµ(y) > �

and the assertion holds. So our claim is:

Claim. infu:‖u‖=1

∫u>(T (y)−t0)>0

h(y)dµ(y) > 0.

15


Assume not. If

infu:‖u‖=1

∫u>(T (y)−t0)>0

h(y)dµ(y) = 0,

then since {u : ‖u‖ = 1} is compact, there exists u0 ∈ {u : ‖u‖ = 1} such that

∫u>0 (T (y)−t0)>0

h(y)dµ(y) = 0.

It implies h(y) = 0 on the set {y : u>0 (T (y)− t0) > 0} µ-a.e., and hence

∫u>0 (T (y)−t0)>0

h(y)eη>T (y)−A(η)dµ(y) = 0,

which implies that

Pη(u>0 (T (X)− t0) > 0) = 0.

Thus, we get

Pη(u>0 (T (X)− t0) ≤ 0) = 1,

which is equivalent to

u>0 (t− t0) ≤ 0 ∀t ∈ T .

Since C is convex hull of T , it implies

u>0 (t− t0) ≤ 0 ∀t ∈ C,

however, this yields contradiction to

@a 6= 0 s.t. a>(t− t0) ≤ 0 ∀t ∈ C,

from t0 ∈ C0.

Case 2. sup |ηn| n (T (y)−t0)h(y)dµ(y) ≥

∫Xeη

0>(T (y)−t0)h(y)dµ(y)(∗)= ∞

16


Figure 1.3: Convex hull of T

by Fatou’s lemma. (∗) holds because E is natural parameter space, and η0 ∈ ∂E implies η0 /∈ E ,

since E is open. Thus −lx(ηn) −−−→n→∞

∞.

Now we are ready to prove consistency.

Theorem 1.1.23. Let X1, · · · , Xn be a random sample from a population with pdf

pη(x) = h(x) exp{η>T (x)− A(η)}IX (x), η ∈ E

where E is the natural parameter space in Rk. Further, assume that

(i) E is open.


Then, the followings hold:

(a) Pη(η̂MLEn exists

)−−−→n→∞

1

(b) η̂MLEn is consistent.

Proof. (a) Let T n =1

n

n∑i=1

T (Xi). Then by WLLN, we get

limn→∞

Pη(|T n − EηT (X1)| < �) = 1 ∀� > 0.

Also note that EηT (X1) ∈ C0, where C0 is the interior of the convex hull of the support of

T (X1). Then since C0 is open, open ball (|T n−EηT (X1)| < �) is contained in C0 for sufficiently

small � > 0, which implies

limn→∞

Pη(T n ∈ C0) = 1.

17


Now consider T n instead of T (X1) in previous theorems, and note that (convex hull of support

of T n) = (convex hull of support of T (X1)). Then we can find that

(T n ∈ C0) ⊆ (η̂MLEn exists)

and therefore

limn→∞

Pη(η̂MLEn exists) = 1.

(b) From Ä > 0, we get Ȧ(η) is one-to-one and continuous for any η. Then we get

(T n ∈ C0) ⊆ (η̂MLEn exists) = (Ȧ(η̂MLEn ) = T n)

and hence

limn→∞

Pη(η̂MLEn = (Ȧ)

−1(T n)) = 1 ∀η ∈ E . (1.1)

Further, by inverse function theorem, and C2 property of A, we have that (Ȧ)−1 is continuous.

Thus by WLLN and continuous mapping theorem,

(Ȧ)−1(T n)Pη−−−→

n→∞(Ȧ)−1(EηT (X1)) = (Ȧ)

−1(Ȧ(η)) = η

and since (Ȧ)−1(T n) ≈ η̂MLEn in the sense of (1.1), we get

limn→∞

Pη(|η̂MLEn − η| < �) = 1 ∀� > 0,

i.e., η̂MLEnPη−−−→

n→∞η.

Now let’s see some general results. Suppose we have limn→∞

Ψn(θ) = Ψ0(θ) and

θn : solution of Ψn(θ) = 0, θ ∈ C (n = 1, 2, · · · )

θ0 : solution of Ψ0(θ) = 0, θ ∈ C.

Under what conditions, limn→∞

θn = θ0? We need following four conditions:

Uniform convergence of Ψn, Continuity of Ψ0, Uniqueness of θ0, and Compactness of C.

Note that these are sufficient conditions simultaneously. Our goal is to obtain similar result for

optimization.

18


Theorem 1.1.24. Suppose that we have limn→∞

Dn(θ) = D0(θ) and

θn = arg minθ∈C

Dn(θ) (n = 1, 2, · · · )

θ0 = arg minθ∈C

D0(θ)

where Dn and D0 are deterministic functions. Also assume that

(i) Dn converges to D0 uniformly.

(ii) D0 is continuous on C.

(iii) Minimizer θ0 is unique.

(iv) C is compact.

Then limn→∞

θn = θ0.

Proof. Assume not. In other words, θn 6→ θ0. Then ∃� > 0 such that |θn− θ0| > � i.o.. It means

that there is a subsequence {n′} ⊆ {n} s.t. |θn′ − θ0| > � ∀n′. Now define

∆n = supθ∈C|Dn(θ)−D0(θ)|.

Then by uniform convergence of Dn, we get ∆n −−−→n→∞

0. Now note that

inf|θ−θ0|>�

D0(θ) = inf|θ−θ0|>�

{D0(θ)−Dn′(θ) +Dn′(θ)}

≤ inf|θ−θ0|>�

{|D0(θ)−Dn′(θ)|+Dn′(θ)}

≤ ∆n′ + inf|θ−θ0|>�

Dn′(θ)

holds. Because minimization of Dn′ is achieved at θn′ ∈ {θ : |θ − θ0| > �}, we get

∆n′ + inf|θ−θ0|>�

Dn′(θ) ≤ ∆n′ + inf|θ−θ0|≤�

Dn′(θ)

≤ ∆n′ + inf|θ−θ0|≤�

{|Dn′(θ)−D0(θ)|+D0(θ)}

≤ 2∆n′ + inf|θ−θ0|≤�

D0(θ)

= 2∆n′ +D0(θ0).

19


The last equality holds from θ0 = arg minD0(θ) and θ0 ∈ {θ : |θ − θ0| ≤ �}. Thus

inf|θ−θ0|>�

D0(θ) ≤ 2∆n′ +D0(θ0)

holds, which implies1

2

(inf

|θ−θ0|>�D0(θ)−D0(θ0)

)≤ ∆n′ .

Letting n′ →∞, we get

inf|θ−θ0|>�

D0(θ)−D0(θ0) = 0.

It is contradictory due to our claim that will be shown:

Claim. inf|θ−θ0|>�

D0(θ)−D0(θ0) > 0.

Intuitively, since θ0 is unique minimizer, our claim seems trivial, but we also need continuity

and compactness condition to guarantee this. (For this see next remark.)

Note that, by definition of infimum, there is a sequence {θk} ⊆ {θ : |θ − θ0| > �} ∩ C such

that

limk→∞

D0(θk) = inf|θ−θ0|>�

D0(θ).

Now, by compactness of C, there is a subsequence {k′} ⊆ {k} that makes θk′ converge to

some θ∗0 (“Bolzano-Weierstrass”), so with the abuse of notation, let θk → θ∗0 as k → ∞. Then

note that θ∗0 should belong to {θ : |θ− θ0| ≥ �} ∩C, so θ∗0 6= θ0. Now, continuity of D0 makes

limk→∞

D0(θk) = D0(θ∗0),

which implies

inf|θ−θ0|>�

D0(θ) = D0(θ∗0).

Therefore, by uniqueness of minimizer, D0(θ∗0) > D0(θ0), and combining to above result we

can obtain

inf|θ−θ0|>�

D0(θ) > D0(θ0).

Remark 1.1.25. See next figures. Each example tells that we need continuity and compactness,

respectively.

Remark 1.1.26. For deterministic case, one can give an alternative proof. Suppose θn 6→ θ0.

Then since C is compact, we can find a subsequence {θnk} such that θnk → θ∗0, θ∗0 6= θ0. (If any

20


Figure 1.4: Continuity and Compactness are needed.

convergent subsequence converges to θ0, then origin sequence should converge to θ0.) Now for

sufficiently large nk,

supθ∈C|Dnk(θ)−D0(θ)| <

�

3

holds, so

D0(θ0) ≥ Dnk(θ0)−�

3(∵ uniform convergence)

≥ Dnk(θnk)−�

3(∵ minimizer)

≥ D0(θnk)−2

3� (∵ uniform convergence)

≥ D0(θ∗0)− � (∵ D0(θnk)→ D0(θ∗0) from continuity of D0)

and hence taking �↘ 0 gives D0(θ0) ≥ D0(θ∗0), which is contradictory to uniqueness of θ0.

In fact, our real goal was, to get the similar result for random Dn.

Theorem 1.1.27. Let Dn be a sequence of random functions, and D0 be deterministic. Simi-

larly, define

θ̂n = arg minθ∈C

Dn(θ) (n = 1, 2, · · · )

θ0 = arg minθ∈C

D0(θ).

Now suppose that

21


(i) Dn converges in probability to D0 uniformly. It means that,

supθ∈C|Dn(θ)−D0(θ)|

P−−−→n→∞

0.

(ii) D0 is continuous on C.

(iii) Minimizer θ0 is unique.

(iv) C is compact.

Then θ̂nP−−−→

n→∞θ0.

Proof. Note that in the proof of theorem 1.1.24, we did not used convergence in deriving

inf|θ−θ0|>�

D0(θ) ≤ 2∆n′ +D0(θ0).

Rather, we only used |θn′−θ0| > �. (Convergence is used when deriving inf |θ−θ0|>�D0(θ)−D0(θ0))

Thus,

|θ̂n − θ0| > �⇒ inf|θ−θ0|>�

D0(θ) ≤ 2∆n +D0(θ0)⇒ ∆n ≥1

2

(inf

|θ−θ0|>�D0(θ)−D0(θ0)

)holds. Define

1

2

(inf

|θ−θ0|>�D0(θ)−D0(θ0)

)=: δ(�).

Then, we get (|θ̂n − θ0| > �

)⊆ (∆n ≥ δ(�)) ,

and therefore, by uniform P-convergence, ∆nP−−−→

n→∞0 and hence

P (|θ̂n − θ0| > �) ≤ P (∆n ≥ δ(�)) −−−→n→∞

0.

Example 1.1.28 (Consistency of MLE when Θ is finite). Let X1, · · · , Xn be a random sample

from a population with pdf fθ(·), θ ∈ Θ. Assume that the parametrization is identifiable and

Θ = {θ1, · · · , θk}. Then

θ̂MLEnPθ0−−−→n→∞

θ0,

provided that

22


(0) (Identifiability) Pθ1 = Pθ2 ⇒ θ1 = θ2

(1) (Kullback-Leibler divergence) Eθ0

∣∣∣∣log fθ(X1)fθ0(X1)∣∣∣∣ �}

= Pθ0

{ ⋃1≤j≤k

(|Dn(θj)−D0(θj)| > �)

}

≤k∑j=1

Pθ0 (|Dn(θj)−D0(θj)| > �)

= o(1) by WLLN.

so we can derive the result similarly. In precise, it’s sufficient to show

inf|θ−θ0|>�

D0(θ)−D0(θ0) > 0

for � s.t. |θn − θ0| > � i.o.. Uniqueness of θ0 implies it clearly, because Θ is finite in here. Note

that continuity of D0 is not considered.

Remark 1.1.29. Kullback-Leibler divergence. Since 1 + log z ≤ z, we get

−Eθ0 logfθ(X1)

fθ0(X1)= −

∫log

fθ(X1)

fθ0(X1)dPθ0

≥ 1−∫S(θ0)

fθ(x)

fθ0(x)fθ0(x)dµ(x)

≥ 0,

23


and hence D0(θ) ≥ 0. In here S(θ0) = {x : fθ0(x) > 0} and S(θ) = {x : fθ(x) > 0}. Note that

1 + log z ≤ z ⇔ z = 1. Thus equality of D0(θ) = 0 holds if and only if

fθ(x)

fθ0(x)= 1 µ− a.e. on S(θ0)

and

∫S(θ0)

fθ(x)dµ(x) = 1.

Since

1 =

∫S(θ)

fθ(x)dµ(x) =

∫S(θ0)∪S(θ)

fθ(x)dµ(x)

=

∫S(θ0)

fθ(x)dµ(x) +

∫S(θ)\S(θ0)

fθ(x)dµ(x)

we get ∫S(θ0)

fθ(x)dµ(x) = 1⇔∫S(θ)\S(θ0)

fθ(x)dµ(x) = 0.

However, by definition of the support, fθ(x) > 0 on S(θ)\S(θ0), and hence∫S(θ)\S(θ0)

fθ(x)dµ(x) = 0⇔ µ(S(θ)\S(θ0)) = 0.

Thus D0(θ) holds if and only if

fθ(x) = fθ0(x) µ− a.e. on S(θ0)

and µ(S(θ)\S(θ0)) = 0.

However, note that

fθ(x) = fθ0(x) µ− a.e. on S(θ0) implies µ(S(θ)\S(θ0)) = 0,

because

1 =

∫S(θ)

fθ(x)dµ(x) =

∫S(θ0)

fθ(x)dµ(x) +

∫S(θ)\S(θ0)

fθ(x)dµ(x)

=

∫S(θ0)

fθ0(x)dµ(x) +

∫S(θ)\S(θ0)

fθ(x)dµ(x)

= 1 +

∫S(θ)\S(θ0)

fθ(x)dµ(x).

24


Therefore we get,

D0(θ) = 0⇔ fθ(x) = fθ0(x) µ− a.e. on S(θ0).

Now µ(S(θ)\S(θ0)) = 0 implies fθ(x) = fθ0(x) µ − a.e. on S(θ)\S(θ0), and therefore fθ(x) =

fθ0(x) µ− a.e., if fθ(x) = fθ0(x) µ− a.e. on S(θ0).Therefore we get

D0(θ) = 0⇔ fθ(x) = fθ0(x) µ− a.e.⇔ θ = θ0 (∵ identifiability).

It means that θ0 is unique minimizer of D0(θ).

Example 1.1.30 (Consistency of MCE). Let X1, · · · , Xn be a random sample from Pθ, θ ∈

Θ ⊆ Rk, and

θ̂MCEn = arg minθ∈Θ

1

n

n∑i=1

ρ(Xi, θ).

Assume the following along with Eθ0|ρ(X1, θ)| �, θ̂MCEn ∈ K

]−−−→n→∞

0.

Thus, we get

Pθ0

[|θ̂MCEn − θ0| > �

]≤ Pθ0

[|θ̂MCEn − θ0| > �, θ̂MCEn ∈ K

]+ Pθ0

[θ̂MCEn /∈ K

]−−−→n→∞

0.

25


Remark 1.1.31. Indeed, we did not see consistency of MCE yet, but we only verified for fixed

θ0 ∈ Θ. For the consistency of MCE, we need that for any θ0 ∈ Θ ∃K ⊆ Θ containing θ0 such

that the conditions (i)-(iv) are fulfilled. Suppose that

(a) for all compact K ⊆ Θ and for all θ0 ∈ Θ,

supθ∈K|ρn(θ)− Eθ0ρ(X1, θ)|

Pθ0−−−→n→∞

0.

(b) for any θ0 ∈ Θ there exists a compact subset K of Θ containing θ0 such that

Pθ0

(infθ∈Kc

(ρn(θ)− ρn(θ0)) > 0)−−−→n→∞

1.

(c) θ 7→ Eθ0ρ(X1, θ) is continuous on Θ.

Then for any θ0 ∈ Θ there exists a compact subset K of Θ containing θ0 such that (ii)-(iv)

hold. Note that, (b) implies (iii) with (i) and (c).

Also note that, MLE is a special case for MCE, ρ(x, θ) = − log f(x, θ).

Remark 1.1.32. In many cases, it’s difficult to verify uniform convergence condition. For this,

following convexity lemma is useful: If K is convex,

ρn(θ)Pθ0−−−→n→∞

Eθ0ρ(X1, θ) ∀θ ∈ K, (“pointwise convergence”)

and ρn is a convex function on K with probability 1 under Pθ0, then we get “uniform conver-

gence”

supθ∈K|ρn(θ)− Eθ0ρ(X1, θ)|

Pθ0−−−→n→∞

0.

See D. Pollard (1991), Econometric Theory, 7, 186-199.

Remark 1.1.33. The condition (b) in remark 1.1.31 is satisfied if the empirical contrast

ρn(θ) =1

n

n∑i=1

ρ(Xi, θ)

is convex on a convex open parameter space Θ ⊆ Rk, and approaches +∞ on the boundary

with probability tending to 1.

26


1.2 The Delta Method

Basic intuition of the Delta Method is Taylor expansion.

Theorem 1.2.1. Suppose√n(Xn − a)

d−−−→n→∞

X. Then

√n(g(Xn)− g(a)) = ġ(a)

√n(Xn − a) + oP (1)

and hence√n(g(Xn)− g(a))

d−−−→n→∞

ġ(a)X,

provided g is differentiable at a.

Proof. By Taylor theorem, ∃R(x, a) s.t.

g(x) = g(a) + (ġ(a) +R(x, a)) (x− a)

where R(x, a) → 0 as x → a. Note that if XnP−−−→

n→∞a and R(x, a) → 0 as x → a then

R(Xn, a)P−−−→

n→∞0 (∵ ∀� > 0 ∃δ > 0 s.t.|x− a| < δ ⇒ |R(x, a)| < � implies

P (|R(Xn, a)| > �) ≤ P (|Xn − a| ≥ δ) −−−→n→∞

0

and then R(Xn, a) = oP (1)). Thus

g(Xn) = g(a) + (ġ(a) +R(Xn, a)) (Xn − a)

and hence

√n(g(Xn)− g(a)) = ġ(a)

√n(Xn − a) +R(Xn, a)︸︷︷︸

=oP (1)

√n(Xn − a)︸︷︷︸

=OP (1)

= ġ(a)√n(Xn − a) + oP (1).

In multivariate case, statement becomes ġ(a)>(Xn − a).

Remark 1.2.2. When g is a function of several variables, the differentiability means the total

differentiability, which is implied by the existence of “continuous partial derivatives.”

27


Example 1.2.3. (X1, Y1), · · · , (Xn, Yn) : iid from (µ1, µ2, σ21, σ22, ρ). Let

ρ̂n =

n∑i=1

(Xi −X)(Yi − Y )√n∑i=1

(Xi −X)2√

n∑i=1

(Yi − Y )2.

(i) As far as the distribution of ρ̂n is concerned, we may assume µ1 = µ2 = 0, σ21 = σ

22 = 1.

Let

Wi = (Xi, Yi, X2i , Y

2i , XiYi)

> i.i.d.∼ (0, 0, 1, 1, ρ),

and Zn =√n(W n − (0, 0, 1, 1, ρ)>), i.e.,

Zn1 =√nX, Zn2 =

√nY , Zn3 =

√n(X2 − 1), Zn4 =

√n(Y 2 − 1), Zn5 =

√n(XY − ρ).

Note that Zn = OP (1). Then

ρ̂n =

1√nZn5 + ρ−

(1√nZn1

)(1√nZn2

)√

1 +1√nZn3 −

(1√nZn1

)2√1 +

1√nZn4 −

(1√nZn2

)2=

(1√nZn5 + ρ−

(1√nZn1

)(1√nZn2

))

·

(1 +

1√nZn3 −

(1√nZn1

)2)−1/2(1 +

1√nZn4 −

(1√nZn2

)2)−1/2=

(1√nZn5 + ρ− oP

(1√n

))·

(1− 1

2

(1√nZn3 −

(1√nZn1

)2)+ oP

(1√n

))(1− 1

2

(1√nZn4 −

(1√nZn2

)2)+ oP

(1√n

))

=


(1√n

))(1− 1

2

1√nZn3 + oP

(1√n

))(1− 1

2

1√nZn4 + oP

(1√n

))=


(1√n

))(1− 1

2

1√nZn3 −

1

2

1√nZn4 + oP

(1√n

))= ρ+

1√nZn5 −

ρ

2

(1√nZn3 +

1√nZn4

)+ oP

(1√n

)holds, so we get

√n(ρ̂n − ρ) = Zn5 −

ρ

2Zn3 −

ρ

2Zn4 + oP (1)

28


=√n(

(XY − ρ)− ρ2

(X2 − 1)− ρ2

(Y 2 − 1))

+ oP (1)

=√n(XY − ρ

2X2 − ρ

2Y 2)

+ oP (1)

=1√n

n∑i=1

(XiYi −

ρ

2X2i −

ρ

2Y 2i

)+ oP (1)

d−−−→n→∞

N(0, V ar(X1Y1 −

ρ

2X21 −

ρ

2Y 21

)).

(ii) Now additionally suppose that (Xi, Yi)’s are from bivariate normal distribution. Then

Y1−ρX1 is independent of X1. Letting Z1 = Y1−ρX1, we get V ar(Z1) = 1−ρ2, V ar(Z21) =

2(1− ρ2)2 and hence

V ar(X1Y1 −

ρ

2X21 −

ρ

2Y 21

)= V ar

((1− ρ2)X1Z1 −

ρ

2Z21 +

ρ

2(1− ρ2)X21

)= V ar

(ρ2

(1− ρ2)X21 −ρ

2Z21

)+ V ar

((1− ρ2)X1Z1

)+ 2Cov

(ρ2

(1− ρ2)X21 −ρ

2Z21 , (1− ρ2)X1Z1

)︸︷︷︸

=0

=ρ2

4

((1− ρ2)2V ar(X21 )− 2(1− ρ2)Cov(X21 , Z21) + V ar(Z21)

)+ (1− ρ2)2V ar(X1Z1)

=ρ2

4

(2(1− ρ2)2 + 2(1− ρ2)2

)+ (1− ρ2)2(1− ρ2)

= ρ2(1− ρ2)2 + (1− ρ2)2(1− ρ2)

= (1− ρ2)2

holds. It implies that√n(ρ̂n − ρ)

d−−−→n→∞

N(0, (1− ρ2)2).

Therefore, if we define h(ρ) as h′(ρ) = (1− ρ2)−1, i.e.,

h(ρ) =1

2log

1 + ρ

1− ρ, (“Fisher’s z-transform”)

then we get√n(h(ρ̂n)− h(ρ))

d−−−→n→∞

N(0, 1),

and with this, we can find a confidence region “with stabilized variance.”

We can also expand with higher order terms.

Theorem 1.2.4 (Higher order stochastic expansion). Let X1, · · · , Xn be a random sample with

29


EX1 = µ and finite V ar(X1) = Σ.

(a) (1-dim case) For g with ∃g̈,

g(Xn) = g(µ) +σ√nġ(µ)Zn +

σ2

2ng̈(µ)Z2n + oP

(1

n

),

where Zn =√n(Xn − µ)/σ.

(b) (general case) For g with ∃g̈,

g(Xn) = g(µ) + ġ(µ)>(Xn − µ) +

1

2(Xn − µ)>g̈(µ)(Xn − µ) + oP

(1

n

).

Proof. Again, use Taylor theorem. Only prove (a). Note that

g(x) = g(a) + ġ(a)(x− a) + 12

(g̈(a) +R(x, a)) (x− a)2

for R(x, a)→ 0 as x→ a, so letting Zn =√n(Xn − µ)/σ, we get

g(Xn) = g(µ) + ġ(µ)(Xn − µ) +1

2g̈(µ)(Xn − µ)2 +

1

2R(Xn, µ)︸︷︷︸

=oP (1)

(Xn − µ)2︸︷︷︸=OP (1/n)

= g(µ) +σ√nġ(µ)Zn +

σ2

2ng̈(µ)Z2n + oP (1/n)

which implies the conclusion.

Remark 1.2.5. For general case, following notation is also frequently used. For (Zin)di=1 =

√n(Xn − µ),

g(Xn) = g(µ) +1√ng/i(µ)Z

in +

1

2ng/ij(µ)Z

inZ

jn + oP

(1

n

).

In here, we omit the “∑

,” i.e.,

g/i(µ)Zin :=

d∑i=1

g/i(µ)Zin, g/ij(µ)Z

inZ

jn =

d∑i=1

d∑j=1

g/ij(µ)ZinZ

jn.

Example 1.2.6 (Estimation of Reliability). Let X1, · · · , Xn be a random sample from Exp(λ),

where λ > 0 is a rate. Consider an estimation problem of “reliability”

η = P (X1 > a) = e−aλ.

30


(i) Note that

η̂MLEn = e−a/X .

Let Zn =√n(λX − 1). Then Zn

d−−−→n→∞

N(0, 1) and

X−1

= λ

(Zn√n

+ 1

)−1.

Thus, we get

η̂MLEn = exp

(−aλ

(Zn√n

+ 1

)−1)

= exp

(−aλ

(1− Zn√

n+Z2nn

+ oP

(1

n

)))= e−aλ

(1 +

(aλ

Zn√n− aλZ

2n

n

)+

1

2

(aλ

Zn√n− aλZ

2n

n

)2+ oP

(1

n

))

= e−aλ(

1 + aλZn√n

+−aλ+ (aλ)2/2

nZ2n + oP

(1

n

))= η +

aλe−aλ√n

Zn +(−aλ+ (aλ)2/2)e−aλ

nZ2n + oP

(1

n

)(1.2)

from (1 + x)−1 = 1− x+ x2 + o(x2) and ex = 1 + x+ x2/2 + o(x2) when x ≈ 0.

(ii) Now consider

η̂UMV UE =

(1− a

nX

)n−1I

(a

nX< 1

).

Note that from P(anX

< 1)−−−→n→∞

1, we can let η̂UMV UE =(1− a

nX

)n−1(See next remark).

Then

log η̂UMV UE = (n− 1) log(

1− anX

)= (n− 1) log

(1− aλ

n

(Zn√n

+ 1

)−1)

= (n− 1) log(

1− aλn

(1− Zn√

n+Z2nn

+ oP

(1

n

)))= (n− 1) log

(1− aλ

n+

aλ

n√nZn −

aλ

n2Z2n + oP

(1

n2

))= (n− 1)

{(−aλn

+aλ

n√nZn −

aλ

n2Z2n

)− 1

2

(−aλn

+aλ

n√nZn −

aλ

n2Z2n

)2}+ oP

(1

n

)= −aλ+ aλ√

nZn −

aλ

nZ2n −

(aλ)2

2n+aλ

n+ oP

(1

n

)31


= −aλ+ aλ√nZn +

−aλZ2n + aλ− (aλ)2/2n

+ oP

(1

n

)implies

η̂UMV UE = exp

(−aλ+ aλ√

nZn +


+ oP

(1

n

))= e−aλ

(1 +

(aλ√nZn +


)+

1

2

(aλ√nZn +


)2)

+ oP

(1

n

)= e−aλ

(1 +

aλ√nZn +

(−aλZ2n + aλ−

(aλ)2

2+

1

2(aλ)2Z2n

)1

n

)+ oP

(1

n

)= η +

aλe−aλ√n

Zn +(−aλ+ (aλ)2/2)e−aλ

n(Z2n − 1) + oP

(1

n

)(1.3)

from log(1 + x) = x − x2/2 + o(x2), x ≈ 0. Comparing (1.3) to (1.2), we can say that

UMVUE is “closer” than MLE to η, since MLE’s leading term has a bias

(−aλ+ (aλ)2/2)e−aλ

n,

while UMVUE’s leading term has no bias. Like this case, if one suggests a new estimator,

then in many cases, one compares 2nd order term to judge its asymptotic behavior.

Remark 1.2.7. If there is an event that occurring probability converges to 1, then in an

asymptotic sense, we may ignore such event, in the sense that:

(i) If P (En) −−−→n→∞

1 and P (Xn ≤ x, En) −−−→n→∞

F (x), then Xnd−−−→

n→∞F .

(ii) If P (En) −−−→n→∞

1 and Xn = X + OP (n−α) on En, then Xn = X + OP (n−α) in general.

Convergence rate of P (En) does not matter!

(∵ P (nα|Xn −X| ≥ C) ≤ P (nα|Xn −X| ≥ C, En) + P (Ecn)︸︷︷︸−−−→n→∞

0

, take lim supn and limC on

both sides.)

Example 1.2.8. Consider a sample correlation coefficient again. Assume EX41


=

(ρ+

1√nZn5 −

1

nZn1Zn2 + oP

(1

n

))

·

1− 12

(1√nZn3 −

(1√nZn1

)2)+

3

8

(1√nZn3 −

(1√nZn1

)2)2+ oP

(1

n

)·

1− 12

(1√nZn4 −

(1√nZn2

)2)+

3

8

(1√nZn4 −

(1√nZn2

)2)2+ oP

(1

n

)=

(ρ+

1√nZn5 −

1

nZn1Zn2 + oP

(1

n

))·(

1− 12√nZn3 +

1

n

(1

2Z2n1 +

3

8Z2n3

)+ oP

(1

n

))·(

1− 12√nZn4 +

1

n

(1

2Z2n2 +

3

8Z2n4

)+ oP

(1

n

))= ρ+

1√n

(Zn5 −

ρ

2Zn3 −

ρ

2Zn4

)+

1

n

(−Zn1Zn2 −

1

2Zn3Zn5 −

1

2Zn4Zn5 + ρ

(1

4Zn3Zn4 +

1

2Z2n1 +

1

2Z2n2 +

3

8Z2n3 +

3

8Z2n4

))+ oP

(1

n

)holds. The leading term has bias

1

n

{ρ

4EX21Y

21 +

3

8ρ(EX41 + EY

41 )−

1

2(EX31Y1 + EX1Y

31 )

}from

E(Z3nZ4n) = EX21Y

21 − 1

E(Z23n + Z24n) = EX

41 + EY

41 − 2

E(Z21n + Z22n) = 2

E(Z3nZ5n + Z4nZ5n) = EX31Y1 + EX1Y

31 − 2ρ

E(Z1nZ2n) = ρ.

For bivariate normal case, it becomes

− 12nρ(1− ρ2).

Remark 1.2.9. Note that in using stochastic expansion, we can get the mean, variance, skew-

ness, ... of the leading term and might expect that they become the approximation of the

33


moments of g(Xn), but this is not true! For example, if Xn ∼ Ber(1/n), then

P (nXn > �) = P (Xn = 1) =1

n−−−→n→∞

0

from Xn = 0 or 1, so nXn = oP (1), but we get E(nXn) = 1 ∀n. Thus, we need to check the

behavior of the remainder.

As we can see in this example, we cannot say that E(op(1)) = o(1), i.e., “convergence in prob-

ability does not imply convergence in L1.” (Similarly, “bounded in probability does not imply

uniform integrability”) By Vitalli’s theorem, it is known that if {Xn} is uniformly integrable,

then

XnP−−−→

n→∞X implies EXn −−−→

n→∞EX.

From now on, we compare “moments of leading terms” and “approximation of moments.”

Example 1.2.10. Recall mgf

mgfX(t) = EetX |t| < �

and cgf

cgfX(t) = logmgfX(t) = logEetX . |t| < �

For mr = EXr, mgf has a Taylor expansion

mgfX(t) = 1 +m1t+m22!t2 +

m33!t3 + · · · ,

and if X and Y are independent,

mgfX+Y (t) = mgfX(t) ·mgfY (t)

and

cgfaX+b(t) = cgfX(at) + bt

holds for constants a and b. From this we get

cr(aX + b) = arcr(X),

where cr denotes rth cumulant. Also recall that, for A ≈ 0,

log(1 + A) = A− 12A2 +

1

3A3 − · · · ,

34


and with this, we can obtain

cgfX(t) = log

1 + (mgfX(t)− 1)︸︷︷︸=A

= c1t+ c22!t2 +

c33!t3 + · · ·

where

c1 = m1, c2 = m2−m21, c3 = m3−3m1m2 +2m31, c4 = m4−4m3m1−3m22 +12m2m21−m41, · · · .

If observations are normalized, i.e., m1 = 0 and m2 = 1, then

c1 = 0, c2 = 1, c3 = m3, c4 = m4 − 3.

Example 1.2.11. Let

Zn =√nXn − µ

σ=

1√n

n∑i=1

Xi − µσ︸︷︷︸

=:Z̃id≡Z1

.

Then from

cgfZn(t) = cgfn−1/2 ∑ Z̃i(t) = n · cgfZ1(

t√n

),

we obtain

cr(Zn) = n ·(

1√n

)rcr(Z1) = n

− r2

+1cr(Z1).

From this, we obtain

EZ3n = c3(Zn) =1√nc3(Z1) (1.4)

EZ4n = c4(Zn) + 3 =1

nc4(Z1) + 3. (1.5)

Example 1.2.12. Now see the multivariate case. Let X = (X1, · · · , Xd)> and t = (t1, · · · , td)>.

Then

mgfX(t) = Eet>X = Eet1X1+···+tdXd

and

m1 =

[∂

∂timgfX(t)

∣∣∣∣t=0

]i

m2 =

[∂2

∂ti∂tjmgfX(t)

∣∣∣∣t=0

]i,j

35


m3 =

[∂3

∂ti∂tj∂tkmgfX(t)

∣∣∣∣t=0

]i,j,k

...

and we get

mgfX(t) = 1 +∑i

m1(i)ti +1

2!

∑i,j

m2(i, j)titj +1

3!

∑i,j,k

m3(i, j, k)titjtk + · · · .

If data is centered, i.e., EX = 0, then m1 = 0 and so

cgfX(t) =1

2!

∑i,j

m2(i, j)titj+1

3!

∑i,j,k

m3(i, j, k)titjtk+1

4!

∑i,j,k,l

(m4(i, j, k, l)− 3m2(i, j)m2(k, l)) titjtktl+· · · .

Now let Zn = (Z1n, · · · , Zdn)> =

√n(Xn − µ). Then

cgfZn(t) = n · cgfZ1(t/√n)

implies

EZinZjn =: σ

i,j, (σi,j)i,j = V ar(X1)

EZinZjnZ

kn =

1√nc3(i, j, k)

EZinZjnZ

knZ

ln =

1

nc4(i, j, k, l) + 3σ

ijσkl

where

c3(i, j, k) = c3(Z1)(i, j, k) = EZi1Z

j1Z

k1

c4(i, j, k, l) = c4(Z1)(i, j, k, l) = EZi1Z

j1Z

k1Z

l1 − 3(EZi1Z

j1)(EZ

k1Z

l1).

Proposition 1.2.13 (Moments of the leading terms). Let X1, · · · , Xn be i.i.d. with E|X1|4 <

∞1.

(a) (Univariate case) For g with ∃g̈ and Zn =√n(Xn − µ)/σ,

g(Xn) = g(µ) +σ√nġ(µ)Zn +

σ2

2ng̈(µ)Z2n︸︷︷︸

=:Wn

+oP (n−1),

1In fact, stronger condition is needed: mgf of X1 exists

36


and for the leading term Wn,

E(Wn) = g(µ) +σ2

2ng̈(µ)

V ar(Wn) =σ2

n(ġ(µ))2 +

1

n2

(σ3c3(Z1)ġ(µ)g̈(µ) +

1

2σ4 (g̈(µ))2

)+O(n−3)

E(Wn − EWn)3 =1

n2(σ3c3(Z1) (ġ(µ))

3 + 3σ4 (ġ(µ))2 (g̈(µ)))

+O(n−3).

(b) (Multivariate case) For g with ∃g̈ and (Zin) =√n(Xn − µ),


in +

1

2ng/ij(µ)Z

inZ

jn︸︷︷︸

=:Wn

+oP (n−1),

and for the leading term Wn,

E(Wn) = g(µ) +1

2ng/ij(µ)σ

ij

V ar(Wn) =1

ng/i(µ)g/j(µ)σ

ij +O(n−2)

where (σij) = V ar(X1).

Proof. (a) Nothing but tedious calculation. First,

E(Wn) = g(µ) +σ2

2ng̈(µ)

is easily obtained. Next, note that

V ar(Wn) = V ar

(σ√nġ(µ)Zn +

σ2

2ng̈(µ)Z2n

)=σ2

n(ġ(µ))2 V ar(Zn) +

σ3

n√nġ(µ)g̈(µ)Cov(Zn, Z

2n) +

σ4

4n2(g̈(µ))2 V ar(Z2n).

First, V ar(Zn) = 1. Also, V ar(Z2n) = E(Z

4n) − [E(Z2n)]

2= 2 + n−1c4(Z1) from (1.5). Finally,

Cov(Zn, Z2n) = EZ

3n − EZn · EZ2n = n−1/2c3(Z1) from (1.4). Now we get

V ar(Wn) =σ2

n(ġ(µ))2 +

σ3

n2ġ(µ)g̈(µ)c3(Z1) +

σ4

2n2(g̈(µ))2 +

σ4

4n3(g̈(µ))2 c4(Z1)

=σ2

n(ġ(µ))2 +

1

n2

(σ3c3(Z1)ġ(µ)g̈(µ) +

1

2σ4 (g̈(µ))2

)+O(n−3).

37


For E(Wn − EWn)3, note that EZ5n = O(n−1/2) and EZ6n = O(1). (Check!) Then

E(Wn − EWn)3 = E[σ√nġ(µ)Zn +

σ2

2ng̈(µ)(Z2n − 1)

]3=

σ3

n√nġ(µ)3 EZ3n︸︷︷︸

=n−1/2c3(Z1)

+3σ4

2n2ġ(µ)2g̈(µ)E[Z2n(Z

2n − 1)]︸︷︷︸

=n−1c4(Z1)+2

+σ5

4n2√nġ(µ)g̈(µ)2E[Zn(Z

2n − 1)2] +

σ6

8n3g̈(µ)3E[(Z2n − 1)3]︸︷︷︸

=O(n−3)

=1

n2(σ3c3(Z1) (ġ(µ))

3 + 3σ4 (ġ(µ))2 (g̈(µ)))

+O(n−3)

by (1.4) and (1.5).

(b) Note that

Wn = g(µ) +1√ng/i(µ)Z

in +

1

2ng/ij(µ)Z

inZ

jn = g(µ) +

1√n

d∑i=1

g/i(µ)Zin +

1

2n

d∑i,j=1

g/ij(µ)ZinZ

jn

so

V ar(Wn) =1

n

d∑i=1

d∑j=1

g/i(µ)g/j(µ)Cov(Zin, Z

jn) +

1

n√n

d∑i=1

d∑k,l=1

g/i(µ)g/kl(µ)Cov(Zin, Z

knZ

ln)

+1

4n2

d∑i,j=1

d∑k,l=1

g/ij(µ)g/kl(µ)Cov(ZinZ

jn, Z

knZ

ln)

=1

ng/i(µ)g/j(µ)Cov(Z

in, Z

jn)︸︷︷︸

=σij

+1

n√ng/i(µ)g/kl(µ)Cov(Z

in, Z

knZ

ln)︸︷︷︸

=EZinZknZ

ln

+1

4n2g/ij(µ)g/kl(µ) Cov(Z

inZ

jn, Z

knZ

ln)︸︷︷︸

EZinZjnZknZ

ln−(EZinZ

jn)(EZknZ

ln)

=1

ng/i(µ)g/j(µ)σ

ij +1

n2g/i(µ)g/kl(µ)c3(i, j, k) +

1

4n2g/ij(µ)g/kl(µ)

(1

nc4(i, j, k, l) + 2σ

ijσkl)

=1

ng/i(µ)g/j(µ)σ

ij +O(n−2).

Proposition 1.2.14 (Approximation of moments). Let X1, · · · , Xn be i.i.d. with E|X1|4


V ar(g(Xn)) =σ2

ng(1)(µ)2 +O(n−2)

where µ = EX1, σ2 = V ar(X1).

(b) (Multivariate case) For g with bounded and contiuous g/I (|I| = 0, 1, · · · , 4),

E(g(Xn)) = g(µ) +1

2ng/ij(µ)σ

ij +O(n−2)

V ar(g(Xn)) =1

ng/i(µ)g/j(µ)σ

ij +O(n−2)

where µ = EX1, (σij) = V ar(X1).

Proof. (a) Note that,

g(Xn) = g(µ) +σ√ng(1)(µ)Zn +

σ2

2ng(2)(µ)Z2n +

1

3!

σ3

n√ng(3)(µ)Z3n +Rn,

where

Rn =1

4!

σ4

n2g(4)(ξn)Z

4n, ξn : a number between µ and Xn.

Note that

E|Rn| ≤1

4!

σ4

n2supx|g(4)(x)| · EZ4n = O(n−2)

from “boundedness of g(4)” and EZ4n = 3 + n−1c4(Z1). Also note that

σ3

n√nEZ3n =

σ3

n√n

1√nc3(Z1) = O(n

−2).

From these, we obtain

Eg(Xn) = g(µ) +σ2

2ng(2)(µ).

Next, for the variance, we get

V arg(Xn) =σ2

ng(1)(µ)2V ar(Zn) +

σ2

4n2g(2)(µ)2V ar(Z2n)

+O(n−3)V ar(Z3n) + V ar(Rn)

+O(n−3/2)Cov(Zn, Z2n) +O(n

−2)Cov(Zn, Z3n) +O(n

−5/2)Cov(Z2n, Z3n)

+O(n−1/2)Cov(Zn, Rn) +O(n−1)Cov(Z2n, Rn) +O(n

−3/2)Cov(Z3n, Rn).

39


Note that,

V ar(Zn) = 1, V ar(Z2n) = 2 +

1

nc4(Z1) = O(1), V ar(Z

3n) = EZ

6n − (EZ3n)2 ≤ O(1),

V ar(Rn) ≤ ER2n = O(n−4) · EZ8n ≤ O(n−4),

|Cov(Zn, Rn)| ≤√V arZn

√V arRn ≤ O(n−2),

|Cov(Z2n, Rn)| ≤√V arZ2n


|Cov(Z3n, Rn)| ≤√V arZ3n


Cov(Zn, Z2n) = EZ

3n − (EZn)(EZ2n) = O(n−1/2),

|Cov(Zn, Z3n)| ≤√V arZn

√V arZ3n ≤ O(1),

|Cov(Z2n, Z3n)| ≤√V arZ2n

√V arZ3n ≤ O(1).

From these, we get

V arg(Xn) =σ2

ng(1)(µ)2 V ar(Zn)︸︷︷︸

=1

+σ2

4n2g(2)(µ)2 V ar(Z2n)︸︷︷︸

=O(1)

+O(n−3)V ar(Z3n)︸︷︷︸=O(1)

+V ar(Rn)︸︷︷︸≤O(n−4)

+O(n−3/2)Cov(Zn, Z2n)︸︷︷︸

=O(n−1/2)

+O(n−2)Cov(Zn, Z3n)︸︷︷︸

≤O(1)

+O(n−5/2)Cov(Z2n, Z3n)︸︷︷︸

≤O(1)

+O(n−1/2)Cov(Zn, Rn)︸︷︷︸≤O(n−2)

+O(n−1)Cov(Z2n, Rn)︸︷︷︸≤O(n−2)

+O(n−3/2)Cov(Z3n, Rn)︸︷︷︸≤O(n−2)

=σ2

ng(1)(µ)2 +O(n−2).

(b) Recall the Taylor theorem for multivariate function: for (k + 1)-time continuously differ-

entiable function f : Rn → R,

f(x) =∑|α|≤k

Dαf(a)

α!(x− a)α +

∑|β|=k+1

Rβ(x)(x− a)β,

where for α = (α1, · · · , αn) ∈ Nn we define |α| = α1 + · · · + αn, α! = α1! · · ·αn!, and xα =

xα11 · · ·xαnn ,

Rβ(x) =|β|β!

∫ 10

(1− t)|β|−1Dβf(a + t(x− a))dt.

40


In here,

Dαf =∂|α|f

∂xα11 · · · ∂xαnn.

Using this, we obtain


in +

1

2!

1

ng/ijZ

inZ

jn +

1

3!

1

n√ng/ijkZ

inZ

jnZ

kn +Rn,

where

|Rn| ≤1

6n2

∣∣∣∣∫ 10

(1− u)3g/ijkl(µ+

1√nuZn

)ZinZ

jnZ

knZ

lndu

∣∣∣∣ .By boundedness of g/I and

E|ZinZjnZknZ ln| ≤1

4E((Zin)

4 + (Zjn)4 + (Zkn)

4 + (Z ln)4) = O(1), (“AM-GM”)

we obtain the desired result with similar procedure as univariate case.

Example 1.2.15 (Estimation of reliability). Let X1, · · · , Xn be a random sample from Exp(λ),

λ > 0. Define

η = Pλ(X1 > a) = e−aλ.

Then

η̂MLEn = exp(−a/X).

Here, g(x) = exp(−a/x) is infinitely differentiable with bounded derivatives. Thus

E(η̂MLEn ) = η +(−aλ+ (aλ)2/2)e−aλ

n+O(n−2)

V ar(η̂MLEn ) =(aλ)2e−2aλ

n+O(n−2),

the leading terms of which agree with the mean and variance of the leading term in its stochastic

expansion. On the other hand, for

η̂UMV UEn =(

1− anx

)n−1I

(a

nX< 1

),

we cannot apply the result for the approximation of moments. But, it can be shown that its

variance is also approximated by the same approximation formula.

Remark 1.2.16. The boundedness of the derivatives for the approximation of moments is

rather stronger than needed. Whenever the approximation can be proved, the formulae agree

41


with the moments of the leading term of its stochastic expansion. So only the validity of the

order of the remainder needs to be proved. For example, in the bivariate normal case, the mean

and variance of the sample correlation coefficient can be approximated as follows;

E(ρ̂n) =−ρ(1− ρ2)

2n+O(n−2)

V ar(ρ̂n) =(1− ρ2)2

n+O(n−2).

1.3 Asymptotic Theory

1.3.1 MLE in Exponential Family

Proposition 1.3.1. Let X1, · · · , Xn be a random sample from a population with pdf

pη(x) = h(x) exp(η>T (x)− A(η))IX (x), η ∈ E ,

where E is a natural parameter space in Rk. Further, assume that

(i) E is open.


Then

(a) η̂MLEn = η + (Ä(η))−1(T n − Ȧ(η)) + oPη(n−1/2) = η + (−l̈(η))−1l̇(η) + oPη(n−1/2).

(b)√n(η̂MLEn − η)

d−−−→n→∞

N(

0, (Ä(η))−1)

= N(0, I−11 (η)).

Proof. Recall that, under the same assumptions,

(T n ∈ C0) ⊆ (Ȧ(η̂MLEn ) = T n) ⊆(η̂MLEn = (Ȧ)

−1(T n)),

with the probabilities of these events tending to 1 (See theorem 1.1.20). Also note that, by full

rank condition, Ȧ is one-to-one and differentiable on E with Ä > 0.

Now let t = Ȧ(η). Then by Inverse Function Theorem (see next Remark), Ȧ−1 is also

differentiable and

D(Ȧ−1)(t) =∂

∂tȦ−1(t) =

(Ä(η)

)−1.

CLT implies√n(T n − EηT (X1))

d−−−→n→∞

N (0, V arηT (X1)) ,

42


and recall that EηT (X1) = Ȧ(η), V arηT (X1) = Ä(η). Therefore, ∆-method implies

η̂MLEn = Ȧ−1(T n)

= Ȧ−1(t) +(Ä(η)

)−1(T n − t) + o(|T n − t|)

= η +(Ä(η)

)−1(T n − t) + oPη(n−1/2),

and hence

√n(η̂MLEn − η) =

(Ä(η)

)−1√n(T n − t) + oPη(1)

d−−−→n→∞

N(0, V arηT (X1)

−1) .Rest part is obtained from T n − Ȧ(η) = l̇n(η)/n and −l̈n(η) = nÄ(η).

Remark 1.3.2 (Inverse Function Theorem). Let F : U → Rd, where U ⊆ Rd is an open set. If

(i) F is one-to-one

(ii) F is Fréchet differentiable near x0 ∈ U

(iii) DFx0 :=

[∂F

∂xj

∣∣∣∣x=x0

]i,j

is invertible.

Then F−1 : F (U)→ U is also Fréchet differentiable at y0 = F (x0), and it satisfies D(F−1)(y0) =

(DFx0)−1.

Example 1.3.3 (Multinomial case). Let X1, · · · , Xn be a random sample from Multi(1, p(θ)),

where p(θ) = (p1(θ), · · · , pk(θ))>, and θ ∈ Θ ⊆ R. For example, consider Hardy-Weinberg

proportions p(θ) = (θ2, 2θ(1− θ), (1− θ)2)>, 0 < θ < 1. Assume that

(i) Θ is open and 0 < pi(θ) < 1,k∑i=1

pi(θ) = 1.

(ii) p(θ) = (p1(θ), · · · , pk(θ))> is twice (totally) differentiable.

Then we can derive the asymptotic distribution of estimator of θ. Let θ = h(p(θ)) for any θ ∈ Θ

for some differentiable function h. Let

p̂(θ) =1

n

n∑i=1

Xi =

(N1n, · · · , Nk

n

)>,

where Nj =∑n

i=1 I(Xij = 1). Then we get

Eθp̂(θ) = p(θ)

43


and hence

h(Xn) = h(p(θ)) + ḣ(p(θ))>(Xn − p(θ)) + o(|Xn − p(θ)|).

Note that Zn :=√n(Xn − p(θ)) = OP (1), and it has an asymptotic distribution

Znd−−−→

n→∞N(0,Σ(θ)), Σ(θ) := diag(p(θ))− p(θ)p(θ)>,

and therefore we get

√n(h(Xn)− h(p(θ))) = ḣ(p(θ))>Zn + oP (1),

which implies√n(h(Xn)− h(p(θ)))

d−−−→n→∞

N(0, σ2h(θ)),

where

σ2h(θ) = ḣ(p(θ))>Σ(θ)ḣ(p(θ)).

Furthermore, we can obtain that

σ2h(θ) ≥ I−11 (θ),

where equality holds iff

ḣ(p(θ))>(X1 − p(θ)) = I−11 (θ)l̇1(θ).

It can be shown as following. First, note that

ḣ(p(θ))>ṗ(θ) = 1.

It implies that

1 = ḣ(p(θ))>∂

∂θEθX1

= ḣ(p(θ))>∫

∂

∂θxf(x : θ)dµ(x)

= ḣ(p(θ))>∫x∂

∂θf(x : θ)dµ(x)

= ḣ(p(θ))>Covθ

(X1,

∂

∂θl1(θ)

)(∵ Eθ

∂

∂θf(X1 : θ) = Eθ

∂

∂θl1(θ) = 0)

= Covθ

(ḣ(p(θ))>X1,

∂

∂θl1(θ)

)≤

√V arθ

(ḣ(p(θ))>X1

)V arθ

(∂

∂θl1(θ)

)44


=

√ḣ(p(θ))>Σ(θ)ḣ(p(θ)) · I1(θ)

holds. In here, “=” holds when ḣ(p(θ))>X1 and ∂l1(θ)/∂θ has a linear relationship, i.e.,

ḣ(p(θ))>(X1 − p(θ)) = I−11 (θ)l̇1(θ).

Remark 1.3.4. Actually, previous example shows how to deal with asymptotic distribution of

FSE, more generally.

1.3.2 Asymptotic Normality of MCE

Our real goal of this section is right here.

Theorem 1.3.5 (Asymptotic Normality of MCE). Let X1, · · · , Xn be a random sample from

Pθ, where θ ∈ Θ and parameter space Θ is open in Rk. Let

θ̂MCEn = arg minθ∈Θ

1

n

n∑i=1

ρ(Xi, θ)

θ0 = arg minθ∈Θ

Eθ0ρ(X1, θ).

Under the assumption of their existence, let

Ψ1(θ) = Ψ(X1, θ) =∂

∂θρ(X1, θ), Ψn(θ) =

1

n

n∑i=1

Ψ(Xi, θ)

Ψ̇1(θ) =∂Ψ1(θ)

∂θ, Ψ̇n(θ) =

∂Ψn(θ)

∂θ

Ψ̈1(θ) =∂2Ψ1(θ)

∂θ2, Ψ̈n(θ) =

∂2Ψn(θ)

∂θ2

Assume that

(A0) Pθ0

(Ψn(θ̂

MCEn ) = 0

)−−−→n→∞

1.

(A1) Eθ0Ψ1(θ0) = 0.

(A2) V arθ0(Ψ1(θ0)) exists.

(A3) Eθ0(Ψ̇1(θ0)) exists and is nonsingular.

45


(A4) ∃δ > 0 and ∃M(X1) = Mθ0,δ(X1) s.t.

max|θ−θ0|≤δθ∈Θ

|Ψ̈1(θ)| ≤M(X1), where Eθ0M(X1) .

(1) (1st-order gradient)

∂F

∂x(x0) := DF (x0) =

(∂F

∂x1(x0),

∂F

∂x2(x0), · · · ,

∂F

∂xk(x0)

)∈ Rd×k,

where∂F

∂xj(x0) =

(∂F1∂xj

(x0), · · · ,∂Fd∂xj

(x0)

)>is a column vector. It can be interpreted as

“a linear map.”

(2) (2nd-order gradient)

∂2F

∂x2(x0) := D

2F (x0) =

(∂

∂x1

∂F

∂x(x0), · · · ,

∂

∂xk

∂F

∂x(x0)

)∈ Rd×k×k,

where∂

∂xi

∂F

∂x(x0) is d × k matrix with

∂2F

∂xixj(x0) as the jth column vector. Note that it

can be interpreted as “a bi-linear map.”

(3) (Taylor expansion of vector-valued map)

F (x) ≈ F (x0) +k∑j=1

∂F

∂xj(x0)(xj − x0j) +

1

2

∑i,j

∂2F

∂xi∂xj(x0)(xi − x0i)(xj − x0j)

46


= F (x0) +∂F

∂xj(x0)︸︷︷︸

matrix

(x− x0)︸︷︷︸vector

+1

2(x− x0)>

∂2F

∂xi∂xj(x0)︸︷︷︸

3-array

(x− x0).

In here, “matrix DF (x0) × vector” becomes a vector, and “quadratic form with 3-array

D2F (x0)” becomes vector-valued. In this view, DF (x0) and D2F (x0) can be interpreted as

a linear and bi-linear map, respectively.

Proof. By Taylor’s theorem, ∃θ∗n in line(θ0, θ̂n) such that ‖θ∗n − θ0‖ ≤ ‖θ̂n − θ0‖ and

Ψn(θ̂n) = Ψn(θ0) + Ψ̇n(θ0)(θ̂n − θ0) +1

2(θ̂n − θ0)>Ψ̈n(θ∗n)(θ̂n − θ0).

Also note that, from (A4) and (A5),

limK→∞

supnPθ0

(Ψ̈n(θ

∗n) ≥ K

)= 0

holds, and hence

Ψ̈n(θ∗n) = OPθ0 (1),

i.e.,

Ψ̈n(θ∗n)(θ̂n − θ0) = oPθ0 (1).

(∵ (Motivation: since Ψ̈n is dominated by integrable majorant, it is “uniformly integrable,”

hence it is “tight.”) From

Pθ0

(∣∣∣Ψ̈n(θ∗n)∣∣∣ > K) = Pθ0 (∣∣∣Ψ̈n(θ∗n)∣∣∣ > K, |θ̂n − θ0| ≤ δ)+ Pθ0 (∣∣∣Ψ̈n(θ∗n)∣∣∣ > K, |θ̂n − θ0| > δ)≤ Pθ0

(∣∣∣Ψ̈n(θ∗n)∣∣∣ > K, |θ̂n − θ0| ≤ δ)+ Pθ0 (|θ̂n − θ0| > δ)≤

(A4)Pθ0 (M(X1) > K) + Pθ0

(|θ̂n − θ0| > δ

)︸︷︷︸

(A5)−−−→n→∞

0

,

for any � > 0 we get for large N

supn>N

Pθ0

(|Ψ̈n(θ∗n)| > K

)≤ 1KEθ0M(X1) + �,

which implies

limK→∞

supn>N

Pθ0

(|Ψ̈n(θ∗n)| > K

)= 0.

47


(Let N be larger and take �↘ 0)) Thus, we get

Ψn(θ̂n) = Ψn(θ0) + Ψ̇n(θ0)(θ̂n − θ0) +1

2(θ̂n − θ0)>Ψ̈n(θ∗n)(θ̂n − θ0)

= Ψn(θ0) +

(Ψ̇n(θ0) +

1

2(θ̂n − θ0)>Ψ̈n(θ∗n)

)(θ̂n − θ0)

= Ψn(θ0) +(

Ψ̇n(θ0) + oPθ0 (1))

(θ̂n − θ0)

= Ψn(θ0) +(Eθ0Ψ̇1(θ0) + oPθ0 (1)

)(θ̂n − θ0).

Now, note that

(1) Pθ0(Ψn(θ̂n) = 0) −−−→n→∞

1.

(2) Eθ0Ψ̇1(θ0) + oPθ0 (1) is nonsingular with probability 1 (See remark 1.3.7)

(3) If Xn = Yn +OP (an) on En with P (En) −−−→n→∞

1, then Xn = Yn +OP (an) (on whole space),

and the same holds for oP (See remark 1.2.7).

Thus, on the set with probability tending to 1,

θ̂n − θ0 =(−Eθ0Ψ̇1(θ0) + oPθ0 (1)

)−1Ψn(θ0)

=

[(−Eθ0Ψ̇1(θ0)

)−1+ oPθ0 (1)

]Ψn(θ0)

holds, which yields

√n(θ̂n − θ0) =

(−Eθ0Ψ̇1(θ0)

)−1√nΨn(θ0)︸︷︷︸OPθ0

(1)

+oPθ0 (1)

and therefore

√n(θ̂n − θ0)

d−−−→n→∞

N

(0,(−Eθ0Ψ̇1(θ0)

)−1V arθ0Ψ1(θ0)

(−Eθ0Ψ̇1(θ0)

)−1).

Remark 1.3.7. If A ∈ Rd×d is symmetric positive definite matrix, then for small perbutation

∆ s.t. ‖∆‖2 < σmin(A), rank(A+ ∆) = d. Note that rank(A) = d. In here, σmin(A) denotes the

smallest eigenvalue of A, and ‖ · ‖p is a matrix norm induced by corresponding Lp vector norm,

i.e.,

‖∆‖p = supx:‖x‖p=1

‖∆x‖p.

48


Proof. (Motivation: for c ≈ 0 and x 6= 0,

1

x+ c=x

c− x

2

c2+x3

c3− · · ·

exists, or for small vectors u, v, A+ uv> is nonsingular from

(A+ uv>)−1 = A−1 − A−1uv>A−1

1 + u>A−1v,

if A is invertible) Let rank(A+ ∆) < d. Then ∃x0 6= 0 s.t. (A+ ∆)x0 = 0 and ‖x0‖2 = 1. Then

by definition of matrix norm,

‖∆‖2 ≥ ‖∆x0‖2 = ‖Ax0‖2 ≥ σmin(A)

holds. The last inequality is from spectral theorem. It is contradictory to our assumption that

∆ is small.

1.3.3 Asymptotic Normality and Efficiency of MLE

Note that MLE is just a special case of MCE.

Theorem 1.3.8. Let X1, · · · , Xn be a random sample from Pθ, where θ ∈ Θ and parameter

space Θ is open in Rk. Recall that MLE is an MCE with

ρ(x, θ) = − log pθ(x), pθ : pdf of Pθ.

Under the assumption of their existence, denote

l̇n(θ) =n∑i=1

∂

∂θlog pθ(Xi), l̇n(θ) =

1

nl̇n(θ)

l̈n(θ) =∂2

∂θ2log pθ(Xi), l̈n(θ) =

1

nl̈n(θ)

...l n(θ) =

∂3

∂θ3log pθ(Xi),

...l n(θ) =

1

n

...l n(θ).

Also assume that

(M0) Pθ0

(l̇n(θ̂

MLEn ) = 0

)−−−→n→∞

1.

(M1) Eθ0 l̇1(θ0) = 0.

(M2) I(θ0) = V arθ0(l̇1(θ0)) exists.

49


(M3) Eθ0(l̈1(θ0)) exists and is nonsingular.

(M4) ∃δ0 > 0 and ∃M(X1) = Mθ0,δ(X1) s.t.

max|θ−θ0|≤δ0θ∈Θ

|...l 1(θ)| ≤M(X1), where Eθ0M(X1) 1 (θ0).

Then√n(θ̂MLEn − θ0)

d−−−→n→∞

N(0, I(θ0)−1)

√n(θ̂MCEn − θ0)

d−−−→n→∞

N(0,ΣΨ(θ0))

with ΣΨ(θ0)− I(θ0)−1 being nonnegative definite (i.e., “ ΣΨ(θ0) ≥ I(θ0)−1”), and “=” holds if

and only if

Ψ1(θ0) = (−Eθ0Ψ̇1(θ0))I1(θ0)−1l̇1(θ0). (“Determining contrast function”)

Proof. By Cauchy-Schwarz inequality,

Corr(λ>l̇1(θ0), γ>Ψ1(θ0)) =

λ>(−Eθ0Ψ̇1(θ0))γ√λ>I1(θ0)λ

√γ>V arθ0Ψ1(θ0)γ

=λ>I

1/21 (θ0)I

−1/21 (θ0)(−Eθ0Ψ̇1(θ0))γ√

λ>I1(θ0)λ√γ>V arθ0Ψ1(θ0)γ

50


≤

√λ>I1(θ0)λ

√γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ√λ>I1(θ0)λ

√γ>V arθ0Ψ1(θ0)γ

=

√γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ√

γ>V arθ0Ψ1(θ0)γ

holds, and hence

maxλ 6=0

Corr(λ>l̇1(θ0), γ>Ψ1(θ0)) =

√γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ√

γ>V arθ0Ψ1(θ0)γ

is obtained (we can find λ that achieves maximum). Since correlation coefficient is less than 1,

we get

γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ ≤ γ>V arθ0Ψ1(θ0)γ

for any γ. Using (−Eθ0Ψ̇1(θ0))−1γ instead of γ, we obtain that

γ>I−11 (θ0)γ ≤ γ>(−Eθ0Ψ̇1(θ0))−1V arθ0Ψ1(θ0)(−Eθ0Ψ̇1(θ0))−1γ,

i.e.,

I−11 (θ0) ≤ (−Eθ0Ψ̇1(θ0))−1V arθ0Ψ1(θ0)(−Eθ0Ψ̇1(θ0))−1 = ΣΨ(θ0).

Equality holds iff

γ>(−Eθ0Ψ̇1(θ0))−1I−11 (θ0)l̇1(θ0) = γ>Ψ1(θ0),

i.e.,

Ψ1(θ0) = (−Eθ0Ψ̇1(θ0))−1I−11 (θ0)l̇1(θ0).

Remark 1.3.10. The claim that the MLE has smaller variance than other asymptotically

normal estimators was known as Fisher’s conjecture. This is true for a certain class of estimators

in a regular parametric model. Essential property for such a comparison is the “uniform”

convergence to the normal distribution as it can be seen in the following example.

Example 1.3.11 (Hodge’s example: “Superefficient estimator”). Let X be the sample mean

in N(θ, 1). Let

θ̂sn = XI(|X| > n−1/4).

(Actually, not only 1/4, but any positive number less than 1/2 is OK.)

51


If θ 6= 0, then

Pθ

(θ̂sn = X

)= 1− Pθ(|X| ≤ n−1/4)

= 1− Pθ(−n−1/4 ≤ X ≤ n−1/4)

= 1− Pθ(−√nθ − n1/4 ≤

√n(X − θ) ≤ −

√nθ + n1/4)

= 1− Φ(−√nθ + n1/4

)︸︷︷︸−−−→n→∞

0 if θ>0;1 if θ0;1 if θ n−1/4)

becomes 1. Thus,

Pθ

(√n(θ̂sn − θ) ≤ x

)= Pθ

(√n(θ̂sn − θ) ≤ x, θ̂sn = X

)+ Pθ

(√n(θ̂sn − θ) ≤ x, θ̂sn 6= X

)= Pθ

(√n(X − θ) ≤ x, θ̂sn = X

)+ Pθ

(√n(θ̂sn − θ) ≤ x, θ̂sn 6= X

)−−−→n→∞

Φ(x)

holds, i.e.,√n(θ̂sn − θ)

d−−−→n→∞

N(0, 1) under Pθ.

Now, assume that θ = 0. Then√nX ∼ N(0, 1), so

Pθ(θ̂sn = 0) = Pθ(|X| ≤ n−1/4) = Φ(n−1/4)− Φ(−n−1/4) −−−→

n→∞1.

Then we obtain limn→∞

Pθ

(θ̂sn = 0

)= 1, and therefore,

limn→∞

Pθ

(√n(θ̂sn − θ) ≤ x

)= lim

n→∞Pθ(0 ≤ x)

=

1 if x ≥ 00 if x < 0= I[0,∞)(x),

which yields√n(θ̂sn − θ)

d−−−→n→∞

N(0, 0) under Pθ.

52


Thus, we get√n(θ̂sn − θ)

d−−−→n→∞

N(0, σ2s(θ)) under Pθ,

where

σ2s(θ) =

1 if θ 6= 00 if θ = 0 .It implies that, there is a “superefficient” estimator than “optimal” one with variance I(θ)−1,

i.e., Fisher’s conjecture is wrong!

Remark 1.3.12. Note that Fisher’s conjecture is wrong in general, but it is “partially correct”

in “regular” model. It means “uniform convegence” of CLT. First note that

√n(θ̂n − θ)

d (Pθ)−−−→n→∞

N(0, ν(θ))

only means “pointwise” convergence for each θ. It is equivalent to, for some metric d, and law

(distribution function) Lθ under Pθ,

d

(Lθ(√n(θ̂n − θ)),Φ

(·

ν(θ)

))−−−→n→∞

0.

“Regular estimator” means, on the other hand, that

supθ:|θ−θ0|


Example 1.3.13 (Linear model with stochastic covariates). Consider the model

Y |Z = z ∼ N(α + z>β, σ2), α ∈ R, β ∈ Rk, σ2 > 0,

where E(Z) = 0 and V ar(Z) is non-singular. Assume that there are n independent copies of

Y and Z, and denote Z(n) = (Z1, · · · , Zn)>. Then as long as the distribution of Z does not

depend on θ = (α, β, σ2), we get MLE of β is equivalent to least square procedure, from

ln(θ) = −1

2σ2

n∑i=1

(Yi − α− z>i β)2 −n

2log 2πσ2 +

n∑i=1

log pdfZ(zi).

In other words,

β̂MLE = (Z̃>(n)Z̃(n))−1Z̃>(n)Y, Z̃(n) = (Z1 − Z, · · · , Zn − Z)>

α̂MLE = Y − Z>β̂MLE

σ̂2MLE

=1

n‖Y − (1α̂MLE + Z(n)β̂MLE)‖2.

It also says that,

l1(θ) = −1

2σ2(Y1 − α− z>1 β)2 −

1

2log 2πσ2 + log pdfZ(z1)

and hence

l̇1(θ) =

(�1σ2, z>1

�1σ2,�21

2σ4− 1

2σ2

)>, where �1 = Y1 − α− z>1 β.

Thus

I(θ) =

1σ2

1σ2E(Z1Z

>1 )

12σ4

.(It can be easily obtained from �1|z1 ∼ N(0, σ2). Note that

V ar(�1) = EV ar(�1|z1) + V arE(�1|z1) = σ2,

V ar(z1�1) = EV ar(z1�1|z1) + V arE(z1�1|z1) = E(z1σ2z>1 ) + V arz1 · 0,

and similarly we can obtain V ar(�21).) It implies that,

√n(θ̂MLEn − θ)

d−−−→n→∞

N(0,Σ),

54


where

Σ = I(θ)−1 =

σ2

σ2(E(Z1Z

>1 ))−1

2σ4

,provided that E(Z1Z

>1 ), or equivalently V ar(Z1), is non-singular.

1.3.4 Asymptotic Null distribution of LRT

Consider a random sample X1, · · · , Xn from Pθ, where θ ∈ Θ ⊆ Rk. Denote θ = (ξ>, η>)>,

where η ∈ Rk0 . We wish to test

H0 : ξ = ξ0 vs H1 : ξ 6= ξ0.

(Note that it is composite null!) Let Θ be a k-dimensional open set, and

Θ0 = {(ξ>0 , η>)> : (ξ>0 , η>)> ∈ Θ}

be a k0-dimensional open set. Now denote

l̇(θ) =

l̇1(θ)l̇2(θ)

, I(θ) =I11(θ) I12(θ)I21(θ) I22(θ)

,where

l̇1(θ) =∂

∂ξl(θ), l̇2(θ) =

∂

∂ηl(θ).

Theorem 1.3.14 (Asymptotic null distribution of LRT). Assume (M0) ∼ (M6). Then under

H0 : ξ = ξ0

2(l(θ̂n)− l(θ̂0n))d−−−→

n→∞χ2(k − k0)

where

θ̂n = arg maxθ∈Θ

l(θ), θ̂0n = arg maxθ∈Θ0

l(θ).

Proof. Recall that,

l(θ0) = l(θ̂n) +1

2

√n(θ̂n − θ0)>

[1

nl̈(θ0) + oPθ0 (1)

]√n(θ̂n − θ0)

55


holds. Since√n(θ̂n − θ0) = OPθ0 (1), and OPθ0 (1) · oPθ0 (1) = oPθ0 (1), we get

2(l(θ̂n)− l(θ0)) =√n(θ̂n − θ0)>I(θ0)

√n(θ̂n − θ0) + oPθ0 (1),

from

− 1nl̈(θ0)

Pθ0−−−→n→∞

I(θ0).

Also recall that√n(θ̂n − θ0) = I(θ0)−1

√nl̇(θ0) + oPθ0 (1).

It implies that

2(l(θ̂n)− l(θ0)) =√nl̇(θ0)

>I(θ0)−1√nl̇(θ0) + oPθ0 (1).

Similarly, we get

2(l(θ̂0n)− l(θ0)) =√n(η̂0n − η0)>I22(θ0)

√n(η̂0n − η0) + oPθ0 (1)

and√n(η̂0n − η0) = I22(θ0)−1

√nl̇2(θ0) + oPθ0 (1)

under H0, so we get

2(l(θ̂0n)− l(θ0)) =√nl̇2(θ0)

>I22(θ0)−1√nl̇2(θ0) + oPθ0 (1)

under H0. Thus, we get

2(l(θ̂n)− l(θ̂0n)) =√nl̇(θ0)

>I(θ0)−1√nl̇(θ0)−

√nl̇2(θ0)

>I22(θ0)−1√nl̇2(θ0) + oPθ0 (1)

under H0. Now note that

I(θ0)−1 =

I11(θ0) I12(θ0)I21(θ0) I22(θ0)

−1

=

J1 0−I−122 (θ0)I21(θ0) J2

I−111·2(θ0) 00 I−122 (θ0)

J1 −I12(θ0)I−122 (θ0)0 J2

56


holds, for identity matrices J1 and J2 with suitable sizes. Then

l̇(θ0)>I(θ0)

−1l̇(θ0) = l̇(θ0)>

J1 0−I−122 (θ0)I21(θ0) J2

I−111·2(θ0) 00 I−122 (θ0)

J1 −I12(θ0)I−122 (θ0)0 J2

l̇(θ0)=

l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)l̇2(θ0)

>I−111·2(θ0) 00 I−122 (θ0)

l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)l̇2(θ0)

holds, so we get

2(l(θ̂n)− l(θ̂0n)) =√n(l̇1(θ0)− I21(θ0)I−122 (θ0)l̇2(θ0)

)>I−111·2(θ0)

√n(l̇1(θ0)− I21(θ0)I−122 (θ0)l̇2(θ0)

)+ oPθ0 (1). (1.6)

Now by CLT,

√n(l̇1(θ0)− I21(θ0)I−122 (θ0)l̇2(θ0))

d−−−→n→∞

(J1 −I12(θ0)I−122 (θ0)

)N(0, I(θ0))

= N(0, I11(θ0)− I12(θ0)I−122 (θ0)I21(θ0))

= N(0, I11·2(θ0)),

and hence,

2(l(θ̂n)− l(θ̂0n))d−−−→

n→∞χ2(k − k0).

Remark 1.3.15. (i) The extension to a null hypothesis

H0 : g1(θ) = 0, · · · , gk1(θ) = 0

is rather trivial by considering a smooth reparametrization

ξ = g1(θ), · · · , ξk1(θ) = gk1(θ), η1 = gk1+1(θ), · · · , ηk0 = gk(θ)

whenever such gj(θ) (j = k1 + 1, · · · , k) can be found.

(ii) Large sample confidence set based on the maximum likelihood is obtained by duality.

Remark 1.3.16. Note that to perform LRT we should find MLE on Θ and Θ0, however, it

may not be so easy. Thus our interest is to find asymptotic equivalent tests of LRT.

57


1.3.5 Asymptotic Equivalents of LRT

Let X1, · · · , Xn be a random sample from Pθ, where θ ∈ Θ ⊆ Rk. For θ = (ξ>, η>)> ∈ Θ, we

again wish to test

H0 : ξ = ξ0 vs H1 : ξ 6= ξ0.

We assume the regularity conditions, and in addition, that I(θ) is continuous. Also, we denotethe

true parameter under H0 as θ0 = (ξ>0 , η

>0 )>.

Lemma 1.3.17. (a) From

√n(θ̂MLEn − θ0) = I(θ0)−1

√nl̇(θ0) + oPθ0 (1),

we get

I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η0) = l̇1(θ0) + oPθ0 (n−1/2),

I21(θ0)(ξ̂n − ξ0) + I22(θ0)(η̂n − η0) = l̇2(θ0) + oPθ0 (n−1/2).

(b)√n(η̂0n − η0) = I−122 (θ0)

√nl̇2(θ0) + oPθ0 (1)

η̂0n − η0 = I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)

(c) 2(l(θ̂n)−l(θ̂0n)) =√n(l̇1(θ0)−I21(θ0)I−122 (θ0)l̇2(θ0))>I−111·2(θ0)

√n(l̇1(θ0)−I21(θ0)I−122 (θ0)l̇2(θ0))+

oPθ0 (1).

Theorem 1.3.18 (Wald’s test). Let

Wn =√n(θ̂n − θ̂0n)>I(θ̂0n)

√n(θ̂n − θ̂0n).

Then,

Wn =√n(ξ̂n − ξ0)>I11·2(θ0)

√n(ξ̂n − ξ0) + oPθ0 (1)

= 2(l(θ̂n)− l(θ̂0n)) + oPθ0 (1).

Proof. First note that, by continuity of I,

Wn = n(θ̂n − θ̂0n)>(I(θ0) + oPθ0 (1)

)(θ̂n − θ̂0n)

= n(θ̂n − θ̂0n)>I(θ0)(θ̂n − θ̂0n) + oPθ0 (1).

58


Then, we get

Wn = n(θ̂n − θ̂0n)>I(θ0)(θ̂n − θ̂0n) + oPθ0 (1)

= n[(ξ̂n − ξ0)>I11(θ0)(ξ̂n − ξ0) + 2(ξ̂n − ξ0)>I12(θ0)(η̂n − η̂0n)

+ (η̂n − η̂0n)>I22(θ0)(η̂n − η̂0n)]

+ oPθ0 (1)

= n[(ξ̂n − ξ0)>

(I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η̂0n)

)+(

(ξ̂n − ξ0)>I12(θ0) + (η̂n − η̂0n)>I22(θ0))

(η̂n − η̂0n)]

+ oPθ0 (1),

and from

I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η̂0n) = I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η0)− I12(θ0)(η̂0n − η0)

= l̇1(θ0)− I12(θ0)(η̂0n − η0) + oPθ0 (n−1/2)

= l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)

and

(ξ̂n − ξ0)>I12(θ0) + (η̂n − η̂0n)>I22(θ0) = (ξ̂n − ξ0)>I12(θ0) + (η̂n − η0)>I22(θ0)− (η̂0n − η0)>I22(θ0)

= l̇2(θ0)− I22(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)

we get

Wn = n

(ξ̂n − ξ0)>︸︷︷︸=OPθ0

(n−1/2)

(l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n

−1/2))

+

l̇2(θ0)− I22(θ0)I−122 (θ0)l̇2(θ0)︸︷︷︸=0

+oPθ0 (n−1/2)

(η̂n − η̂0n)︸︷︷︸=OPθ0

(n−1/2)

+ oPθ0 (1)= n(ξ̂n − ξ0)>

(l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)

)+ oPθ0 (1).

Now by (a) of lemma, we get

ξ̂n − ξ0 = I−111·2(θ0)(l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)

)+ oPθ0 (n

−1/2),

and hence,

Wn = n(ξ̂n − ξ0)>I11·2(θ0)(ξ̂n − ξ0) + oPθ0 (1).

59


Use (a) part again, and then we can obtain

Wn = 2(l(θ̂n)− l(θ̂0n)) + oPθ0 (1)

by (1.6).

Remark 1.3.19. The leading term in previous theorem depends on unknown η0 through

I11·2(θ0), so in practice, it should be estimated as

n(ξ̂n − ξ0)>I11·2(θ̂0n)(ξ̂n − ξ0), (“estimated Wald statistics”)

which is also the leading term of LRT.

Theorem 1.3.20 (Rao’s test; score test statistic). Let

Rn = nl̇(θ̂0n)>Î(θ0)

−1l̇(θ̂0n),

where

Î(θ0) = −l̈(θ̂0n)/n or I(θ̂0n).

i.e., “observed information.” Define

̂I11·2(θ0) = Î11(θ0)− Î12(θ0)Î22(θ0)−1Î21(θ0),

just as

I11·2(θ0) = I11(θ

0)− I12(θ0)I22(θ0)−1I21(θ0).

Then,

Rn = nl̇1(θ̂0n)> ̂I11·2(θ0)

−1l̇1(θ̂

0n)

= 2(l(θ̂n)− l(θ̂0n)) + oPθ0 (1).

Remark 1.3.21. Note that, unlike Wald statistics or LRT statistics, to obtain Rn we only

need MLE on the null, θ̂0n.

Proof. Note that, under H0, l̇2(θ̂0n) = 0 holds,

l̇(θ̂0n) =

l̇1(θ̂0n)0

60


so we get

Rn = nl̇1(θ̂0n)> ̂I11·2(θ0)

−1l̇1(θ̂

0n)

directly. Now with Taylor expansion, ∃θ̂0,∗n ∈ line

ξ0η̂0n

,ξ0η0

s.t.l̇1(θ̂

0n)− l̇1(θ0) =

∂

∂ηl̇1(θ)

∣∣∣∣θ=θ̂0,∗n

(η̂0n − η0)

(by chain rule, and ξ0 − ξ0 = 0) Thus by continuity of I,

l̇1(θ̂0n)− l̇1(θ0) =

(−I12(θ0) + oPθ0 (1)

)(η̂0n − η0)

holds. In the same way, we get

l̇2(θ̂0n)− l̇2(θ0) =

(−I22(θ0) + oPθ0 (1)

)(η̂0n − η0),

but from l̇2(θ̂0n) = 0,

η̂0n − η0 =(I22(θ0) + oPθ0 (1)

)−1l̇2(θ0) =

[I−122 (θ0) + oPθ0 (1)

]l̇2(θ0)

is obtained. Therefore,

l̇1(θ̂0n) = l̇1(θ0)−

[I12(θ0) + oPθ0 (1)

](η̂0n − η0)

= l̇1(θ0)−[I12(θ0) + oPθ0 (1)

] [I−122 (θ0) + oPθ0 (1)

]l̇2(θ0)

= l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)

holds, and the proof completes if one uses

̂I11·2(θ0) = I11·2(θ0) + oPθ0 (1)

and (1.6).

Remark 1.3.22. Note that, for a simple null H0 : θ = θ0,

Wn = n(θ̂n − θ0)I(θ0)(θ̂n − θ0)

Rn = nl̇1(θ0)I(θ0)−1l̇2(θ0).

61


In this case, k0 = 0, so Wn and Rn converges (in distribution) to χ2(k) under H0. For the

composite, θ0 is estimated by θ̂0n.

Example 1.3.23 (GoF test in a multinomial model). Let X1, · · · , Xn be a random sample

from Multi(1, (p1, · · · ,

Theory of Statistics II (Fall 2016) - WordPress.com · This note is a summary of the lecture Theory...

Documents

Transcript of Theory of Statistics II (Fall 2016) - WordPress.com · This note is a summary of the lecture Theory...