Theory of Statistics II (Fall 2016) - WordPress.com · This note is a summary of the lecture Theory...
Transcript of Theory of Statistics II (Fall 2016) - WordPress.com · This note is a summary of the lecture Theory...
-
Theory of Statistics II (Fall 2016)
J.P.Kim
Dept. of Statistics
First Draft Published at: January 1, 2017
Final Draft Published at: November 11, 2018
-
Preface & Disclaimer
This note is a summary of the lecture Theory of Statistics II (326.522) held at Seoul Na-
tional University, Fall 2016. Lecturer was B.U.Park, and the note was summarized by J.P.Kim,
who is a Ph.D student. There are few textbooks and references in this course. Contents and
corresponding references are following.
• Asymptotic Approximations. Reference: Mathematical Statistics: Basic ideas and selected
topics, Vol. I., 2nd edition, P.Bickel & K.Doksum, 2007.
• Weak Convergence. Reference: Convergence of Probability Measures, P.Billingsley, 1999.
• Nonparametric Density Estimation. Reference: Kernel Smoothing, M.P.Wand & M.C.Jones,
1994.
Lecture notes are available at stat.snu.ac.kr/theostat. Also I referred to following books
when I write this note. The list would be updated continuously.
• Probability: Theory and Examples, R.Durrett
• Mathematical Statistics (in Korean), W.C.Kim
• Introduction to Nonparametric Regression, K.Takezawa
• Semiparametric Regression, D.Ruppert, M.P.Wand & R.J.Carroll
• Nonparametric and Semiparametric Models, W.Härdle, M.Müller, S.Sperlich & A.Werwatz
If you want to correct typo or mistakes, please contact to: [email protected]
Acknowledgement
I’m very grateful to my colleagues for comments and correction of errors. Their suggestions
were very helpful to improve this note.
Special Thanks to: Hyemin Yeon, Seonghun Cho
2
-
Chapter 1
Asymptotic Approximations
1.1 Consistency
1.1.1 Preliminary for the chapter
Definition 1.1.1 (Notations). Let Θ be a parameter space. Then we consider a ‘random vari-
able’ X on the probability space (Ω,F , Pθ) which is a function
X : (Ω,F , Pθ)→ (Rd,B(Rd), PXθ ),
where PXθ := Pθ ◦X−1. Note that Pθ is a probability measure from the model P := {Pθ : θ ∈ Θ}.
For the convenience, now we omit the explanation of fundamental setting.
Definition 1.1.2 (Convergence). Let {Xn} be a sequence of random variables.
1. Xna.s−−−→
n→∞X if P
(limn→∞
Xn = X)
= 1⇔ P (|Xn −X| > � i.o.) = 0 ∀� > 0
⇔ limN→∞
P
(∞⋃n=N
(|Xn −X| > �)
)= 0 ∀� > 0
2. XnP−−−→
n→∞X if ∀� > 0 P (|Xn −X| > �)→ 0 as n→∞.
Proposition 1.1.3. XnP−−−→
n→∞X if and only if for any subsequence {nk} ⊆ {n} there is a
further subsequence {nkj} ⊆ {nk} such that Xnkja.s.−−−→j→∞
X.
Proof. Durrett, p.65.
Definition 1.1.4 (Consistency). q̂n = qn(X1, · · · , Xn) is consistent estimator of q(θ) if
q̂nPθ−−−→
n→∞q(θ)
3
-
Theory of Statistics II J.P.Kim
for any θ ∈ Θ. (We don’t know what is the true parameter.)
Remark 1.1.5. There are some tools to obtain consistency.
1. V ar(Zn)→ 0, EZn → µ as n→∞ ⇒ ZnP−−−→
n→∞µ.
∵ P (|Zn − µ| > �) ≤ P (|Zn − EZn|+ |EZn − µ| > �)
≤ P (|Zn − EZn| > �/2) + P (|EZn − µ| > �/2)︸ ︷︷ ︸=0 for sufficiently large n
≤ 4�2V ar(Zn)→ 0
2. WLLN: X1, · · · , Xn: i.i.d. and E|X1| 0 there exists M > 0 such that P (|Y | > M) ≤ η
and P (|Z| > M) ≤ η. Now,
P (|YnZn − Y Z| > �) ≤ P (|Yn||Zn − Z| > �/2) + P (|Z||Yn − Y | > �/2)
≤ P (|Yn − Y ||Zn − Z| > �/4) + P (|Y ||Zn − Z| > �/4) + P (|Z||Yn − Y | > �/2)
and note that P (|Y ||Zn−Z| > �/4) = P (|Y ||Zn−Z| > �/4, |Y | > M) +P (|Y ||Zn−Z| >
�/4, |Y | ≤M) ≤ η + P (|Zn − Z| ≥ �/4M). Thus
lim supn→∞
P (|Y ||Zn − Z| > �/4) ≤ η
4
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
and similarly
lim supn→∞
P (|Z||Yn − Y | > �/2) ≤ η.
Now, since
P (|Yn − Y ||Zn − Z| > �/4) = P (|Yn − Y ||Zn − Z| > �/4, |Yn − Y | >√�/4)
+ P (|Yn − Y ||Zn − Z| > �/4, |Yn − Y | ≤√�/4)
≤ P (|Yn − Y | >√�/4) + P (|Zn − Z| ≥
√�/4)→ 0
as n→∞, we get
lim supn→∞
P (|YnZn − Y Z| > �) ≤ 2η.
Finally, since η > 0 was arbitrary, we get the result.
(c) By (b), it’s sufficient to show that Z−1nP−−−→
n→∞Z−1. Since P (Z = 0) = 0, for any η > 0
there exists δ > 0 such that P (|Z| ≤ δ) ≤ η. (If not, ∃η > 0 such that ∀δ > 0 P (|Z| ≤ δ) >
η. Then by continuity of measure, P (⋃δ>0(|Z| ≤ δ)) = P (Z = 0) ≥ η > 0. Contradiction.)
Thus
P
(∣∣∣∣ 1Zn − 1Z∣∣∣∣ > �) = P ( |Zn − Z||ZnZ| > �
)≤ P
(|Zn − Z|
|Z|(|Z| − |Zn − Z|)> �
)+ P (|Z| < |Zn − Z|)
≤ P(
|Zn − Z||Z|(|Z| − |Zn − Z|)
> �, |Z| > δ, |Zn − Z| ≤ δ/2)
︸ ︷︷ ︸≤P (|Zn−Z|> δ
2
2�)−−−→
n→∞0
+ P (|Z| ≤ δ)︸ ︷︷ ︸≤η
+P (|Zn − Z| > δ/2)︸ ︷︷ ︸−−−→n→∞
0
+ P (|Z| < |Zn − Z|, |Zn − Z| > δ)︸ ︷︷ ︸≤P (|Zn−Z|>δ)−−−→
n→∞0
+P (|Z| < |Zn − Z|, |Zn − Z| ≤ δ)︸ ︷︷ ︸≤P (|Z|≤δ)≤η
and hence
lim supn→∞
P
(∣∣∣∣ 1Zn − 1Z∣∣∣∣ > �) ≤ 2η
holds. Note that η > 0 was arbitrary.
Definition 1.1.6 (Probabilistic O-notation). Let Xn be a sequence of r.v.’s.
5
-
Theory of Statistics II J.P.Kim
1. Xn = Op(1) if limc→∞
supnP (|Xn| > c) = 0 ⇔ lim
c→∞lim supn→∞
P (|Xn| > c) = 0. (“Bounded in
probability”)
2. Xn = op(1) if XnP−−−→
n→∞0.
3. Xn = Op(an) if Xn/an = Op(1), and Xn = op(an) if Xn/an = op(1).
Proposition 1.1.7. If Xnd−−−→
n→∞X for some X, then Xn = Op(1).
Proof. For given � > 0, there exists c such that P (|X| > c) < �/2. For such c, P (|Xn| > c) →
P (|X| > c), so ∃N s.t.
supn>N|P (|Xn| > c)− P (|X| > c)| <
�
2.
Thus supn>N P (|Xn| > c) < �. For n = 1, 2, · · · , N , there exists cn such that P (|Xn| > cn) < �,
and letting c∗ = max(c1, · · · , cN , c), we get supn P (|Xn| > c∗) < �.
Example 1.1.8 (Simple Linear Regression). Consider a simple linear regression model yi =
β0 + β1xi + �i, where �ii.i.d.∼ (0, σ2). Also assume that x1, · · · , xn are known and not all equal.
Note that
β̂1,n =
∑ni=1(xi − x)Yi∑ni=1(xi − x)2
.
Since E(β̂1,n) = β1 and V ar(β̂1,n) = σ2/Sxx, we obtain consistency
β̂1,nPβ,σ2−−−→n→∞
β1
provided that Sxx =∑n
i=1(xi − x)2 →∞ as n→∞.
Example 1.1.9 (Sample correlation coefficient). Let (X1, Y1), · · · , (Xn, Yn) be random sample
from the population
EX1 = µ1, EY1 = µ2, V ar(X1) = σ21 > 0, V ar(Y1) = σ
22 > 0, and Corr(X1, Y1) = ρ.
Then by WLLN we get
(X,Y ,X2, Y 2, XY )P−−−→
n→∞(EX1, EY1, EX
21 , EY
21 , EX1Y1).
Since the function
g(u1, u2, u3, u4, u5) =u5 − u1u2√
u3 − u21√u4 − u22
6
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
is continuous at (EX1, EY1, EX21 , EY
21 , EX1Y1), we get
ρ̂n =XY −XY√
X2 −X2√Y 2 − Y 2
P−−−→n→∞
ρ.
Remark 1.1.10. Note that, if XnP−−−→
n→∞c where c is a constant, then continuity of g(x) at
x = c is sufficient for consistency g(Xn)P−−−→
n→∞g(c). It is trivial from the definition of continuity.
Example 1.1.11. Let X1, · · · , Xn be a random sample from a population with cdf F . Then
we use an empirical distribution function
F̂n(x) =1
n
n∑i=1
I(−∞,x](Xi) =1
n
n∑i=1
I(Xi ≤ x)
for estimation of F . Then by WLLN, for each x, F̂n(x) is consistent estimator for F (x),
F̂n(x)P−−−→
n→∞F (x).
Remark 1.1.12. Note that in this case, we can say more strong result, which is known as
Glivenko-Cantelli theorem:
supx|F̂n(x)− F (x)|
P−−−→n→∞
0.
Sketch of proof is given here. Since F̂n and F are nondecreasing and bounded, we can partition
[0, 1], which is a range of both functions, into finite number of intervals, and then each interval
has a well-defined inverse image which is an interval. For whole proof, see Durrett, p.76.
1.1.2 FSE and MLE in Exponential Families
FSE
Recall that FSE of ν(F ) is defined as ν(F̂n). One example of FSE is MME: to estimate
EXj =: νj(F ) =:∫xjdF (x), we use
m̂j = νj(F̂n) =
∫xjdF̂n(x) =
1
n
n∑i=1
Xji .
By WLLN we have (m̂1, m̂2, · · · , m̂k)>P−−−→
n→∞(m1,m2, · · · ,mk)> where mj = EXj, so we can
obtain consistency of MME easily.
Proposition 1.1.13. Let q = h(m1,m2, · · · ,mk) be a parameter of interest where mj’s are
7
-
Theory of Statistics II J.P.Kim
population moments. Then for MME
q̂n = h(m̂1, · · · , m̂k)
based on a random sample X1, · · · , Xn,
q̂nP−−−→
n→∞q
holds, provided that h is continuous at (m1, · · · ,mk)>.
We can do similar work in FSE ν(F ). Note that in here, ν is a functional, so we may define a
continuity of functional. We may use sup norm as a metric in the space of distribution functions.
Definition 1.1.14. Let F be a space of distribution functions. In this space, we give the norm
‖ · ‖ as a sup norm
‖F‖ = supx|F (x)|.
Then metric is given as
‖F −G‖ = supx|F (x)−G(x)|.
Also, we say that a functional ν : F → R is continuous at F if for any � > 0 there exists δ > 0
such that
‖G− F‖ < δ ⇒ |ν(G)− ν(F )| < �.
Remark 1.1.15. Note that since ‖F̂n − F‖ → 0 as n → ∞ from Glivenko-Cantelli theorem,
we get consistency of FSE
ν(F̂n)P−−−→
n→∞ν(F )
provided that ν is continuous at F . In many cases, showing continuity may be difficult problem.
Example 1.1.16 (Best Linear Predictor). Let X1, · · · , Xn be k-dimensional i.i.d. r.v.’s, and
Y1, · · · , Yn be i.i.d. 1-dim random variable. Then we know that
BLP (x) = µ2 + Σ21Σ−111 (x− µ1),
where
E
XY
=µ1µ2
and V arXY
=Σ11 Σ12
Σ21 Σ22
.8
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
Thus for sample variance
S11 =1
n
n∑i=1
(Xi −X)(Xi −X)>
S12 =1
n
n∑i=1
(Xi −X)(Yi − Y )> = S>21
S22 =1
n
n∑i=1
(Yi − Y )2,
we obtain FSE for BLP,
B̂LPFSE
(x) = Y + S21S−111 (x−X).
Note that it is the same as simple linear regression line. Detail is given in next proposition.
Proposition 1.1.17.
(a) Solution of minimizing problem
(β∗0 ,β∗1)> = arg min
β0,β1
E(Y − β0 − β>1 X)2
is
BLP (x) := β∗0 + β∗1>x = µ2 + Σ21Σ
−111 (x− µ1).
(b) For Y = (Y1, · · · , Yn)> and design matrix X = (1,X1) where X1 = (X1, · · · , Xn)>, LSE
is
β̂1 = S−111 S12 and β̂0 = Y −X
>β̂1.
Proof. (a) Two approaches are given. First one is direct proof: It is clear because of
E(Y − β0 − β>1 X)2 = E[(Y − µ2)− β>1 (X − µ1)]2 + [µ2 − β0 − β>1 µ1]2
= Σ22 − 2β>1 Σ12 + β>1 Σ11β1 + [β0 − (µ2 − β>1 µ1)]2.
Second approach uses projection in L2 space. For convenience, suppose EX = 0 and EY = 0.
Then (β∗0 ,β∗1)> should satisfy
〈β0 + β>1 X, Y − β∗0 − β∗1>X〉 = 0 ∀β0, β1.
It yields that
β∗0 = 0, β∗1 =
(E(XX>)
)−1E(XY ).
9
-
Theory of Statistics II J.P.Kim
(b)Xβ̂ = 1β̂0+X1β̂1 should satisfy 1β̂0+X1β̂1 = Π(Y |C(X)). For X1 = X1−Π(X1|C(1)) =
X1 − 1X>
,
1β̂0 +X1β̂1 = 1
(β̂0 +
1>X1n
β̂1
)+ X1β̂1 = Π(Y |C(1)) + Π(Y |C(X1))
we get
β̂0 = Y −X>β̂1 and β̂1 = (X>1 X1)−1X>1 Y .
Now X>1 X1/n = S11 and X>1 Y /n = S12 ends the proof.
Figure 1.1: Regression with intercept. Image from Lecture Note.
Example 1.1.18 (Multiple Correlation Coefficient). We define a multiple correlation coefficient
(MCC) as
ρ = maxβ0,β1
Corr(Y, β0 + β>1 X)
and sample MCC is
ρ̂n = maxβ0,β1
Ĉorr(Y, β0 + β>1 X).
Note that,
Corr(Y, β0 + β>1 X) = Corr(Y − µ2,β>1 (X − µ1))
=Σ21β1
√Σ22
√β>1 Σ11β1
=(Σ−1/211 Σ12)
>(Σ1/211 β1)
√Σ22
√β>1 Σ11β1
10
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
≤
√Σ21Σ
−111 Σ12
Σ22
holds by Cauchy-Schwarz inequality, and equality holds when β1 = Σ−111 Σ12. Thus population
MCC is obtained as
ρ =
√Σ21Σ
−111 Σ12
Σ22.
Meanwhile, sample correlation is obtained as
Ĉorr(Y , β0 + β>1X) =
〈Y − Y 1, (X − 1X>)β1〉‖Y − Y 1‖‖(X − 1X>)β1‖
so it is the cosine of the angle between the two rays, Y − Y 1 and X1β1. Its maximal value is
attaiend by X1β̂1 = Π(Y − Y 1|C(X1)). (Or, one may use Cauchy-Schwarz,
Ĉorr(Y , β0 + β>1X) =
y>X1(X>1 X1)−1X>1 X1β1‖Y − Y 1‖‖X1β1‖
≤√y>ΠX1y√
y>(1− Π1)y,
where equality holds iff X1(X>1 X1)−1X>1 y = X1β1, i.e., β1 = β̂1) Thus,
ρ̂2 =SSR
SST=β̂>1 X>1 X1β̂1‖Y − Y 1‖2
=S21S
−111 S12S22
.
Example 1.1.19 (Sample Proportions). Let (X1, · · · , Xk)> ∼ Multi(n, p), where p ∈ Θ :=
{(p1, · · · , pk)> :∑k
i=1 pi = 1, pi ≥ 0 (i = 1, 2, · · · , k)}. We estimate p with sample proportion
p̂n =
(X1n, · · · , Xk
n
)>.
Then,
(a) p̂n is consistent estimator of p, i.e.,
∀� > 0, supp∈Θ
Pp(|p̂n − p| ≥ �) −−−→n→∞
0.
(b) q(p̂n) is consistent estimator of q(p) provided that q is (uniformly) continuous on Θ.
Proof. (a) Note that there exists a constant C > 0 such that
supp∈Θ
Pp(|p̂n − p| ≥ �) ≤ supp∈Θ
E|p̂n − p|2
�2
11
-
Theory of Statistics II J.P.Kim
= supp∈Θ
k∑i=1
pi(1− pi)n�2
≤ Cn�2−−−→n→∞
0
so we get the desired result. Note that first inequality is from Chebyshev’s inequality.
(b) Note that q is uniformly continuous on Θ, since Θ is closed and bounded. Thus the
assertion holds. More precisely, for any � > 0, there exists δ > 0 such that
|p′ − p| < δ, p, p′ ∈ Θ⇒ |q(p′)− q(p)| < �.
Therefore, we get
supp∈Θ
Pp(|q(p̂n)− q(p)| ≥ �) ≤ supp∈Θ
Pp(|p̂n − p| ≥ δ) −−−→n→∞
0.
MLE in exponential families
Consider a random variable X with pdf in canonical exponential family
qη(x) = h(x) exp(η>T (x)− A(η))IX (x), η ∈ E ,
where E is natural parameter space in Rk. Our goal is to show consistency of MLE in canonical
exponential family.
Theorem 1.1.20. Let
qη(x) = h(x) exp(η>T (x)− A(η))IX (x), η ∈ E
be a canonical exponential family with natural parameter space E ⊆ Rk. Further assume
(i) E is open.
(ii) The family is of rank k.
(iii) t0 := T (x) ∈ C0, where C denotes the smallest convex set containing the support of T (X),
and C0 be its interior.
12
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
Then the unique ML estimate η̂(x) exists and is the solution of the likelihood equation
l̇x(η) = T (x)− Ȧ(η) = 0.
Remark 1.1.21. Note that in (iii), x is the observation of X, so t0 is the observation of T (X).
It is reasonable to consider t0 because ML estimate only depends on t0. Also, recall that (ii)
means
@a 6= 0 s.t.[Pη(a
>(T (x)− µ) = 0) = 1 for some η ∈ E]
⇔@a 6= 0 s.t.[V arη(a
>T (x)) = 0 for some η ∈ E]
⇔Ä(η) is positive definite ∀η ∈ E .
To prove this, we need some preparation.
Lemma 1.1.22.
(a) (“Supporting Hyperplane Theorem”) Let C ⊆ Rk be a convex set, and C0 be its interior.
Then for t0 /∈ C or t0 ∈ ∂C,
∃a 6= 0 s.t. [a>t ≥ a>t0 ∀t ∈ C].
Conversely, for t0 ∈ C0,
@a 6= 0 s.t. [a>t ≥ a>t0 ∀t ∈ C].
(b) Let P (T ∈ T ) = 1 and E(maxi |Ti|) 0 such that B(t0, δ) ⊆ C0, since C0 is open. Note that for any u s.t.
‖u‖ = 1, we get
t0 −δ
2u, t0 +
δ
2u ∈ B(t0, δ) ⊆ C.
If ∃a 6= 0 such that a>t ≥ a>t0 ∀t ∈ C, then
a>(t0 −
δ
2u
)≥ a>t0, a>
(t0 +
δ
2u
)≥ a>t0
13
-
Theory of Statistics II J.P.Kim
holds for u = a/|a|, which yields contradiction. (Note that convexity condition is not used)
Figure 1.2: Proof of (a)
(b) Note that since C is a convex set, µ := ET ∈ C holds. (Convex set contains average of
itself) Assume µ /∈ C0. Then µ ∈ ∂C. Then by (a), ∃a 6= 0 such that a>t ≥ a>µ for any t ∈ C.
It implies that, ∃a 6= 0 such that P (a>(T − µ) ≥ 0) = 1, since T ⊆ C. It implies that
P (a>(T − µ) = 0) = 1,
by the fact that
f ≥ 0,∫fdµ = 0⇒ f = 0 µ− a.e..
It is contradictory to (ii), which is full rank condition of the exponential family.
(c) Done in TheoStat I.
Proof of theorem. By lemma, it’s sufficient to show that:
(1) l(θ) diverges to −∞ at the boundary. (Existence)
(2) Uniqueness
Note that Uniqueness is clear since lx(η) is C2 function and strictly concave from Ä(η) > 0.
Thus, our claim is
Claim. l(θ) approches −∞ on the boundary ∂E .
Let η0 ∈ ∂E . Then there is ηn −−−→n→∞
η0 such that ηn ∈ E . Now our claim is, for any such
sequence ηn, we get lx(ηn) −−−→n→∞
−∞. Note that |ηn| −−−→n→∞
∞ or sup |ηn|
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
for both cases, from lx(η) = log h(x) + η>T (x)−A(η) and eA(η) =
∫Xh(x)eη
>T (x)dµ(x) , we get
−lx(ηn) + log h(x) = A(ηn)− η>n t0
= log
∫X
exp(η>n (T (y)− t0)
)h(y)dµ(y).
Case 1. |ηn| → ∞.
Then since
∫Xeη>n (T (y)−t0)h(y)dµ(y) ≥
∫η>n|ηn|
(T (y)−t0)> 1k
e|ηn|·η>n|ηn|
(T (y)−t0)h(y)dµ(y)
≥∫η>n|ηn|
(T (y)−t0)> 1k
e|ηn|/kh(y)dµ(y)
= exp(|ηn|/k)∫η>n|ηn|
(T (y)−t0)> 1k
h(y)dµ(y),
if we can conclude
infn
∫η>n|ηn|
(T (y)−t0)> 1k
h(y)dµ(y) > 0,
by the assumption |ηn| → ∞, we get lx(ηn)→ −∞. Note that if
infu:‖u‖=1
∫u>(T (y)−t0)>0
h(y)dµ(y) > 0,
then
infn
∫η>n|ηn|
(T (y)−t0)>0h(y)dµ(y) > 0,
and from
infn
∫η>n|ηn|
(T (y)−t0)> 1k
h(y)dµ(y) −−−→k→∞
infn
∫η>n|ηn|
(T (y)−t0)>0h(y)dµ(y),
we get ∃� > 0 & k s.t.
infn
∫η>n|ηn|
(T (y)−t0)> 1k
h(y)dµ(y) > �
and the assertion holds. So our claim is:
Claim. infu:‖u‖=1
∫u>(T (y)−t0)>0
h(y)dµ(y) > 0.
15
-
Theory of Statistics II J.P.Kim
Assume not. If
infu:‖u‖=1
∫u>(T (y)−t0)>0
h(y)dµ(y) = 0,
then since {u : ‖u‖ = 1} is compact, there exists u0 ∈ {u : ‖u‖ = 1} such that
∫u>0 (T (y)−t0)>0
h(y)dµ(y) = 0.
It implies h(y) = 0 on the set {y : u>0 (T (y)− t0) > 0} µ-a.e., and hence
∫u>0 (T (y)−t0)>0
h(y)eη>T (y)−A(η)dµ(y) = 0,
which implies that
Pη(u>0 (T (X)− t0) > 0) = 0.
Thus, we get
Pη(u>0 (T (X)− t0) ≤ 0) = 1,
which is equivalent to
u>0 (t− t0) ≤ 0 ∀t ∈ T .
Since C is convex hull of T , it implies
u>0 (t− t0) ≤ 0 ∀t ∈ C,
however, this yields contradiction to
@a 6= 0 s.t. a>(t− t0) ≤ 0 ∀t ∈ C,
from t0 ∈ C0.
Case 2. sup |ηn| n (T (y)−t0)h(y)dµ(y) ≥
∫Xeη
0>(T (y)−t0)h(y)dµ(y)(∗)= ∞
16
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
Figure 1.3: Convex hull of T
by Fatou’s lemma. (∗) holds because E is natural parameter space, and η0 ∈ ∂E implies η0 /∈ E ,
since E is open. Thus −lx(ηn) −−−→n→∞
∞.
Now we are ready to prove consistency.
Theorem 1.1.23. Let X1, · · · , Xn be a random sample from a population with pdf
pη(x) = h(x) exp{η>T (x)− A(η)}IX (x), η ∈ E
where E is the natural parameter space in Rk. Further, assume that
(i) E is open.
(ii) The family is of rank k.
Then, the followings hold:
(a) Pη(η̂MLEn exists
)−−−→n→∞
1
(b) η̂MLEn is consistent.
Proof. (a) Let T n =1
n
n∑i=1
T (Xi). Then by WLLN, we get
limn→∞
Pη(|T n − EηT (X1)| < �) = 1 ∀� > 0.
Also note that EηT (X1) ∈ C0, where C0 is the interior of the convex hull of the support of
T (X1). Then since C0 is open, open ball (|T n−EηT (X1)| < �) is contained in C0 for sufficiently
small � > 0, which implies
limn→∞
Pη(T n ∈ C0) = 1.
17
-
Theory of Statistics II J.P.Kim
Now consider T n instead of T (X1) in previous theorems, and note that (convex hull of support
of T n) = (convex hull of support of T (X1)). Then we can find that
(T n ∈ C0) ⊆ (η̂MLEn exists)
and therefore
limn→∞
Pη(η̂MLEn exists) = 1.
(b) From Ä > 0, we get Ȧ(η) is one-to-one and continuous for any η. Then we get
(T n ∈ C0) ⊆ (η̂MLEn exists) = (Ȧ(η̂MLEn ) = T n)
and hence
limn→∞
Pη(η̂MLEn = (Ȧ)
−1(T n)) = 1 ∀η ∈ E . (1.1)
Further, by inverse function theorem, and C2 property of A, we have that (Ȧ)−1 is continuous.
Thus by WLLN and continuous mapping theorem,
(Ȧ)−1(T n)Pη−−−→
n→∞(Ȧ)−1(EηT (X1)) = (Ȧ)
−1(Ȧ(η)) = η
and since (Ȧ)−1(T n) ≈ η̂MLEn in the sense of (1.1), we get
limn→∞
Pη(|η̂MLEn − η| < �) = 1 ∀� > 0,
i.e., η̂MLEnPη−−−→
n→∞η.
Now let’s see some general results. Suppose we have limn→∞
Ψn(θ) = Ψ0(θ) and
θn : solution of Ψn(θ) = 0, θ ∈ C (n = 1, 2, · · · )
θ0 : solution of Ψ0(θ) = 0, θ ∈ C.
Under what conditions, limn→∞
θn = θ0? We need following four conditions:
Uniform convergence of Ψn, Continuity of Ψ0, Uniqueness of θ0, and Compactness of C.
Note that these are sufficient conditions simultaneously. Our goal is to obtain similar result for
optimization.
18
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
Theorem 1.1.24. Suppose that we have limn→∞
Dn(θ) = D0(θ) and
θn = arg minθ∈C
Dn(θ) (n = 1, 2, · · · )
θ0 = arg minθ∈C
D0(θ)
where Dn and D0 are deterministic functions. Also assume that
(i) Dn converges to D0 uniformly.
(ii) D0 is continuous on C.
(iii) Minimizer θ0 is unique.
(iv) C is compact.
Then limn→∞
θn = θ0.
Proof. Assume not. In other words, θn 6→ θ0. Then ∃� > 0 such that |θn− θ0| > � i.o.. It means
that there is a subsequence {n′} ⊆ {n} s.t. |θn′ − θ0| > � ∀n′. Now define
∆n = supθ∈C|Dn(θ)−D0(θ)|.
Then by uniform convergence of Dn, we get ∆n −−−→n→∞
0. Now note that
inf|θ−θ0|>�
D0(θ) = inf|θ−θ0|>�
{D0(θ)−Dn′(θ) +Dn′(θ)}
≤ inf|θ−θ0|>�
{|D0(θ)−Dn′(θ)|+Dn′(θ)}
≤ ∆n′ + inf|θ−θ0|>�
Dn′(θ)
holds. Because minimization of Dn′ is achieved at θn′ ∈ {θ : |θ − θ0| > �}, we get
∆n′ + inf|θ−θ0|>�
Dn′(θ) ≤ ∆n′ + inf|θ−θ0|≤�
Dn′(θ)
≤ ∆n′ + inf|θ−θ0|≤�
{|Dn′(θ)−D0(θ)|+D0(θ)}
≤ 2∆n′ + inf|θ−θ0|≤�
D0(θ)
= 2∆n′ +D0(θ0).
19
-
Theory of Statistics II J.P.Kim
The last equality holds from θ0 = arg minD0(θ) and θ0 ∈ {θ : |θ − θ0| ≤ �}. Thus
inf|θ−θ0|>�
D0(θ) ≤ 2∆n′ +D0(θ0)
holds, which implies1
2
(inf
|θ−θ0|>�D0(θ)−D0(θ0)
)≤ ∆n′ .
Letting n′ →∞, we get
inf|θ−θ0|>�
D0(θ)−D0(θ0) = 0.
It is contradictory due to our claim that will be shown:
Claim. inf|θ−θ0|>�
D0(θ)−D0(θ0) > 0.
Intuitively, since θ0 is unique minimizer, our claim seems trivial, but we also need continuity
and compactness condition to guarantee this. (For this see next remark.)
Note that, by definition of infimum, there is a sequence {θk} ⊆ {θ : |θ − θ0| > �} ∩ C such
that
limk→∞
D0(θk) = inf|θ−θ0|>�
D0(θ).
Now, by compactness of C, there is a subsequence {k′} ⊆ {k} that makes θk′ converge to
some θ∗0 (“Bolzano-Weierstrass”), so with the abuse of notation, let θk → θ∗0 as k → ∞. Then
note that θ∗0 should belong to {θ : |θ− θ0| ≥ �} ∩C, so θ∗0 6= θ0. Now, continuity of D0 makes
limk→∞
D0(θk) = D0(θ∗0),
which implies
inf|θ−θ0|>�
D0(θ) = D0(θ∗0).
Therefore, by uniqueness of minimizer, D0(θ∗0) > D0(θ0), and combining to above result we
can obtain
inf|θ−θ0|>�
D0(θ) > D0(θ0).
Remark 1.1.25. See next figures. Each example tells that we need continuity and compactness,
respectively.
Remark 1.1.26. For deterministic case, one can give an alternative proof. Suppose θn 6→ θ0.
Then since C is compact, we can find a subsequence {θnk} such that θnk → θ∗0, θ∗0 6= θ0. (If any
20
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
Figure 1.4: Continuity and Compactness are needed.
convergent subsequence converges to θ0, then origin sequence should converge to θ0.) Now for
sufficiently large nk,
supθ∈C|Dnk(θ)−D0(θ)| <
�
3
holds, so
D0(θ0) ≥ Dnk(θ0)−�
3(∵ uniform convergence)
≥ Dnk(θnk)−�
3(∵ minimizer)
≥ D0(θnk)−2
3� (∵ uniform convergence)
≥ D0(θ∗0)− � (∵ D0(θnk)→ D0(θ∗0) from continuity of D0)
and hence taking �↘ 0 gives D0(θ0) ≥ D0(θ∗0), which is contradictory to uniqueness of θ0.
In fact, our real goal was, to get the similar result for random Dn.
Theorem 1.1.27. Let Dn be a sequence of random functions, and D0 be deterministic. Simi-
larly, define
θ̂n = arg minθ∈C
Dn(θ) (n = 1, 2, · · · )
θ0 = arg minθ∈C
D0(θ).
Now suppose that
21
-
Theory of Statistics II J.P.Kim
(i) Dn converges in probability to D0 uniformly. It means that,
supθ∈C|Dn(θ)−D0(θ)|
P−−−→n→∞
0.
(ii) D0 is continuous on C.
(iii) Minimizer θ0 is unique.
(iv) C is compact.
Then θ̂nP−−−→
n→∞θ0.
Proof. Note that in the proof of theorem 1.1.24, we did not used convergence in deriving
inf|θ−θ0|>�
D0(θ) ≤ 2∆n′ +D0(θ0).
Rather, we only used |θn′−θ0| > �. (Convergence is used when deriving inf |θ−θ0|>�D0(θ)−D0(θ0))
Thus,
|θ̂n − θ0| > �⇒ inf|θ−θ0|>�
D0(θ) ≤ 2∆n +D0(θ0)⇒ ∆n ≥1
2
(inf
|θ−θ0|>�D0(θ)−D0(θ0)
)holds. Define
1
2
(inf
|θ−θ0|>�D0(θ)−D0(θ0)
)=: δ(�).
Then, we get (|θ̂n − θ0| > �
)⊆ (∆n ≥ δ(�)) ,
and therefore, by uniform P-convergence, ∆nP−−−→
n→∞0 and hence
P (|θ̂n − θ0| > �) ≤ P (∆n ≥ δ(�)) −−−→n→∞
0.
Example 1.1.28 (Consistency of MLE when Θ is finite). Let X1, · · · , Xn be a random sample
from a population with pdf fθ(·), θ ∈ Θ. Assume that the parametrization is identifiable and
Θ = {θ1, · · · , θk}. Then
θ̂MLEnPθ0−−−→n→∞
θ0,
provided that
22
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
(0) (Identifiability) Pθ1 = Pθ2 ⇒ θ1 = θ2
(1) (Kullback-Leibler divergence) Eθ0
∣∣∣∣log fθ(X1)fθ0(X1)∣∣∣∣ �}
= Pθ0
{ ⋃1≤j≤k
(|Dn(θj)−D0(θj)| > �)
}
≤k∑j=1
Pθ0 (|Dn(θj)−D0(θj)| > �)
= o(1) by WLLN.
so we can derive the result similarly. In precise, it’s sufficient to show
inf|θ−θ0|>�
D0(θ)−D0(θ0) > 0
for � s.t. |θn − θ0| > � i.o.. Uniqueness of θ0 implies it clearly, because Θ is finite in here. Note
that continuity of D0 is not considered.
Remark 1.1.29. Kullback-Leibler divergence. Since 1 + log z ≤ z, we get
−Eθ0 logfθ(X1)
fθ0(X1)= −
∫log
fθ(X1)
fθ0(X1)dPθ0
≥ 1−∫S(θ0)
fθ(x)
fθ0(x)fθ0(x)dµ(x)
≥ 0,
23
-
Theory of Statistics II J.P.Kim
and hence D0(θ) ≥ 0. In here S(θ0) = {x : fθ0(x) > 0} and S(θ) = {x : fθ(x) > 0}. Note that
1 + log z ≤ z ⇔ z = 1. Thus equality of D0(θ) = 0 holds if and only if
fθ(x)
fθ0(x)= 1 µ− a.e. on S(θ0)
and
∫S(θ0)
fθ(x)dµ(x) = 1.
Since
1 =
∫S(θ)
fθ(x)dµ(x) =
∫S(θ0)∪S(θ)
fθ(x)dµ(x)
=
∫S(θ0)
fθ(x)dµ(x) +
∫S(θ)\S(θ0)
fθ(x)dµ(x)
we get ∫S(θ0)
fθ(x)dµ(x) = 1⇔∫S(θ)\S(θ0)
fθ(x)dµ(x) = 0.
However, by definition of the support, fθ(x) > 0 on S(θ)\S(θ0), and hence∫S(θ)\S(θ0)
fθ(x)dµ(x) = 0⇔ µ(S(θ)\S(θ0)) = 0.
Thus D0(θ) holds if and only if
fθ(x) = fθ0(x) µ− a.e. on S(θ0)
and µ(S(θ)\S(θ0)) = 0.
However, note that
fθ(x) = fθ0(x) µ− a.e. on S(θ0) implies µ(S(θ)\S(θ0)) = 0,
because
1 =
∫S(θ)
fθ(x)dµ(x) =
∫S(θ0)
fθ(x)dµ(x) +
∫S(θ)\S(θ0)
fθ(x)dµ(x)
=
∫S(θ0)
fθ0(x)dµ(x) +
∫S(θ)\S(θ0)
fθ(x)dµ(x)
= 1 +
∫S(θ)\S(θ0)
fθ(x)dµ(x).
24
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
Therefore we get,
D0(θ) = 0⇔ fθ(x) = fθ0(x) µ− a.e. on S(θ0).
Now µ(S(θ)\S(θ0)) = 0 implies fθ(x) = fθ0(x) µ − a.e. on S(θ)\S(θ0), and therefore fθ(x) =
fθ0(x) µ− a.e., if fθ(x) = fθ0(x) µ− a.e. on S(θ0).Therefore we get
D0(θ) = 0⇔ fθ(x) = fθ0(x) µ− a.e.⇔ θ = θ0 (∵ identifiability).
It means that θ0 is unique minimizer of D0(θ).
Example 1.1.30 (Consistency of MCE). Let X1, · · · , Xn be a random sample from Pθ, θ ∈
Θ ⊆ Rk, and
θ̂MCEn = arg minθ∈Θ
1
n
n∑i=1
ρ(Xi, θ).
Assume the following along with Eθ0|ρ(X1, θ)| �, θ̂MCEn ∈ K
]−−−→n→∞
0.
Thus, we get
Pθ0
[|θ̂MCEn − θ0| > �
]≤ Pθ0
[|θ̂MCEn − θ0| > �, θ̂MCEn ∈ K
]+ Pθ0
[θ̂MCEn /∈ K
]−−−→n→∞
0.
25
-
Theory of Statistics II J.P.Kim
Remark 1.1.31. Indeed, we did not see consistency of MCE yet, but we only verified for fixed
θ0 ∈ Θ. For the consistency of MCE, we need that for any θ0 ∈ Θ ∃K ⊆ Θ containing θ0 such
that the conditions (i)-(iv) are fulfilled. Suppose that
(a) for all compact K ⊆ Θ and for all θ0 ∈ Θ,
supθ∈K|ρn(θ)− Eθ0ρ(X1, θ)|
Pθ0−−−→n→∞
0.
(b) for any θ0 ∈ Θ there exists a compact subset K of Θ containing θ0 such that
Pθ0
(infθ∈Kc
(ρn(θ)− ρn(θ0)) > 0)−−−→n→∞
1.
(c) θ 7→ Eθ0ρ(X1, θ) is continuous on Θ.
Then for any θ0 ∈ Θ there exists a compact subset K of Θ containing θ0 such that (ii)-(iv)
hold. Note that, (b) implies (iii) with (i) and (c).
Also note that, MLE is a special case for MCE, ρ(x, θ) = − log f(x, θ).
Remark 1.1.32. In many cases, it’s difficult to verify uniform convergence condition. For this,
following convexity lemma is useful: If K is convex,
ρn(θ)Pθ0−−−→n→∞
Eθ0ρ(X1, θ) ∀θ ∈ K, (“pointwise convergence”)
and ρn is a convex function on K with probability 1 under Pθ0, then we get “uniform conver-
gence”
supθ∈K|ρn(θ)− Eθ0ρ(X1, θ)|
Pθ0−−−→n→∞
0.
See D. Pollard (1991), Econometric Theory, 7, 186-199.
Remark 1.1.33. The condition (b) in remark 1.1.31 is satisfied if the empirical contrast
ρn(θ) =1
n
n∑i=1
ρ(Xi, θ)
is convex on a convex open parameter space Θ ⊆ Rk, and approaches +∞ on the boundary
with probability tending to 1.
26
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
1.2 The Delta Method
Basic intuition of the Delta Method is Taylor expansion.
Theorem 1.2.1. Suppose√n(Xn − a)
d−−−→n→∞
X. Then
√n(g(Xn)− g(a)) = ġ(a)
√n(Xn − a) + oP (1)
and hence√n(g(Xn)− g(a))
d−−−→n→∞
ġ(a)X,
provided g is differentiable at a.
Proof. By Taylor theorem, ∃R(x, a) s.t.
g(x) = g(a) + (ġ(a) +R(x, a)) (x− a)
where R(x, a) → 0 as x → a. Note that if XnP−−−→
n→∞a and R(x, a) → 0 as x → a then
R(Xn, a)P−−−→
n→∞0 (∵ ∀� > 0 ∃δ > 0 s.t.|x− a| < δ ⇒ |R(x, a)| < � implies
P (|R(Xn, a)| > �) ≤ P (|Xn − a| ≥ δ) −−−→n→∞
0
and then R(Xn, a) = oP (1)). Thus
g(Xn) = g(a) + (ġ(a) +R(Xn, a)) (Xn − a)
and hence
√n(g(Xn)− g(a)) = ġ(a)
√n(Xn − a) +R(Xn, a)︸ ︷︷ ︸
=oP (1)
√n(Xn − a)︸ ︷︷ ︸
=OP (1)
= ġ(a)√n(Xn − a) + oP (1).
In multivariate case, statement becomes ġ(a)>(Xn − a).
Remark 1.2.2. When g is a function of several variables, the differentiability means the total
differentiability, which is implied by the existence of “continuous partial derivatives.”
27
-
Theory of Statistics II J.P.Kim
Example 1.2.3. (X1, Y1), · · · , (Xn, Yn) : iid from (µ1, µ2, σ21, σ22, ρ). Let
ρ̂n =
n∑i=1
(Xi −X)(Yi − Y )√n∑i=1
(Xi −X)2√
n∑i=1
(Yi − Y )2.
(i) As far as the distribution of ρ̂n is concerned, we may assume µ1 = µ2 = 0, σ21 = σ
22 = 1.
Let
Wi = (Xi, Yi, X2i , Y
2i , XiYi)
> i.i.d.∼ (0, 0, 1, 1, ρ),
and Zn =√n(W n − (0, 0, 1, 1, ρ)>), i.e.,
Zn1 =√nX, Zn2 =
√nY , Zn3 =
√n(X2 − 1), Zn4 =
√n(Y 2 − 1), Zn5 =
√n(XY − ρ).
Note that Zn = OP (1). Then
ρ̂n =
1√nZn5 + ρ−
(1√nZn1
)(1√nZn2
)√
1 +1√nZn3 −
(1√nZn1
)2√1 +
1√nZn4 −
(1√nZn2
)2=
(1√nZn5 + ρ−
(1√nZn1
)(1√nZn2
))
·
(1 +
1√nZn3 −
(1√nZn1
)2)−1/2(1 +
1√nZn4 −
(1√nZn2
)2)−1/2=
(1√nZn5 + ρ− oP
(1√n
))·
(1− 1
2
(1√nZn3 −
(1√nZn1
)2)+ oP
(1√n
))(1− 1
2
(1√nZn4 −
(1√nZn2
)2)+ oP
(1√n
))
=
(1√nZn5 + ρ− oP
(1√n
))(1− 1
2
1√nZn3 + oP
(1√n
))(1− 1
2
1√nZn4 + oP
(1√n
))=
(1√nZn5 + ρ− oP
(1√n
))(1− 1
2
1√nZn3 −
1
2
1√nZn4 + oP
(1√n
))= ρ+
1√nZn5 −
ρ
2
(1√nZn3 +
1√nZn4
)+ oP
(1√n
)holds, so we get
√n(ρ̂n − ρ) = Zn5 −
ρ
2Zn3 −
ρ
2Zn4 + oP (1)
28
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
=√n(
(XY − ρ)− ρ2
(X2 − 1)− ρ2
(Y 2 − 1))
+ oP (1)
=√n(XY − ρ
2X2 − ρ
2Y 2)
+ oP (1)
=1√n
n∑i=1
(XiYi −
ρ
2X2i −
ρ
2Y 2i
)+ oP (1)
d−−−→n→∞
N(0, V ar(X1Y1 −
ρ
2X21 −
ρ
2Y 21
)).
(ii) Now additionally suppose that (Xi, Yi)’s are from bivariate normal distribution. Then
Y1−ρX1 is independent of X1. Letting Z1 = Y1−ρX1, we get V ar(Z1) = 1−ρ2, V ar(Z21) =
2(1− ρ2)2 and hence
V ar(X1Y1 −
ρ
2X21 −
ρ
2Y 21
)= V ar
((1− ρ2)X1Z1 −
ρ
2Z21 +
ρ
2(1− ρ2)X21
)= V ar
(ρ2
(1− ρ2)X21 −ρ
2Z21
)+ V ar
((1− ρ2)X1Z1
)+ 2Cov
(ρ2
(1− ρ2)X21 −ρ
2Z21 , (1− ρ2)X1Z1
)︸ ︷︷ ︸
=0
=ρ2
4
((1− ρ2)2V ar(X21 )− 2(1− ρ2)Cov(X21 , Z21) + V ar(Z21)
)+ (1− ρ2)2V ar(X1Z1)
=ρ2
4
(2(1− ρ2)2 + 2(1− ρ2)2
)+ (1− ρ2)2(1− ρ2)
= ρ2(1− ρ2)2 + (1− ρ2)2(1− ρ2)
= (1− ρ2)2
holds. It implies that√n(ρ̂n − ρ)
d−−−→n→∞
N(0, (1− ρ2)2).
Therefore, if we define h(ρ) as h′(ρ) = (1− ρ2)−1, i.e.,
h(ρ) =1
2log
1 + ρ
1− ρ, (“Fisher’s z-transform”)
then we get√n(h(ρ̂n)− h(ρ))
d−−−→n→∞
N(0, 1),
and with this, we can find a confidence region “with stabilized variance.”
We can also expand with higher order terms.
Theorem 1.2.4 (Higher order stochastic expansion). Let X1, · · · , Xn be a random sample with
29
-
Theory of Statistics II J.P.Kim
EX1 = µ and finite V ar(X1) = Σ.
(a) (1-dim case) For g with ∃g̈,
g(Xn) = g(µ) +σ√nġ(µ)Zn +
σ2
2ng̈(µ)Z2n + oP
(1
n
),
where Zn =√n(Xn − µ)/σ.
(b) (general case) For g with ∃g̈,
g(Xn) = g(µ) + ġ(µ)>(Xn − µ) +
1
2(Xn − µ)>g̈(µ)(Xn − µ) + oP
(1
n
).
Proof. Again, use Taylor theorem. Only prove (a). Note that
g(x) = g(a) + ġ(a)(x− a) + 12
(g̈(a) +R(x, a)) (x− a)2
for R(x, a)→ 0 as x→ a, so letting Zn =√n(Xn − µ)/σ, we get
g(Xn) = g(µ) + ġ(µ)(Xn − µ) +1
2g̈(µ)(Xn − µ)2 +
1
2R(Xn, µ)︸ ︷︷ ︸
=oP (1)
(Xn − µ)2︸ ︷︷ ︸=OP (1/n)
= g(µ) +σ√nġ(µ)Zn +
σ2
2ng̈(µ)Z2n + oP (1/n)
which implies the conclusion.
Remark 1.2.5. For general case, following notation is also frequently used. For (Zin)di=1 =
√n(Xn − µ),
g(Xn) = g(µ) +1√ng/i(µ)Z
in +
1
2ng/ij(µ)Z
inZ
jn + oP
(1
n
).
In here, we omit the “∑
,” i.e.,
g/i(µ)Zin :=
d∑i=1
g/i(µ)Zin, g/ij(µ)Z
inZ
jn =
d∑i=1
d∑j=1
g/ij(µ)ZinZ
jn.
Example 1.2.6 (Estimation of Reliability). Let X1, · · · , Xn be a random sample from Exp(λ),
where λ > 0 is a rate. Consider an estimation problem of “reliability”
η = P (X1 > a) = e−aλ.
30
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
(i) Note that
η̂MLEn = e−a/X .
Let Zn =√n(λX − 1). Then Zn
d−−−→n→∞
N(0, 1) and
X−1
= λ
(Zn√n
+ 1
)−1.
Thus, we get
η̂MLEn = exp
(−aλ
(Zn√n
+ 1
)−1)
= exp
(−aλ
(1− Zn√
n+Z2nn
+ oP
(1
n
)))= e−aλ
(1 +
(aλ
Zn√n− aλZ
2n
n
)+
1
2
(aλ
Zn√n− aλZ
2n
n
)2+ oP
(1
n
))
= e−aλ(
1 + aλZn√n
+−aλ+ (aλ)2/2
nZ2n + oP
(1
n
))= η +
aλe−aλ√n
Zn +(−aλ+ (aλ)2/2)e−aλ
nZ2n + oP
(1
n
)(1.2)
from (1 + x)−1 = 1− x+ x2 + o(x2) and ex = 1 + x+ x2/2 + o(x2) when x ≈ 0.
(ii) Now consider
η̂UMV UE =
(1− a
nX
)n−1I
(a
nX< 1
).
Note that from P(anX
< 1)−−−→n→∞
1, we can let η̂UMV UE =(1− a
nX
)n−1(See next remark).
Then
log η̂UMV UE = (n− 1) log(
1− anX
)= (n− 1) log
(1− aλ
n
(Zn√n
+ 1
)−1)
= (n− 1) log(
1− aλn
(1− Zn√
n+Z2nn
+ oP
(1
n
)))= (n− 1) log
(1− aλ
n+
aλ
n√nZn −
aλ
n2Z2n + oP
(1
n2
))= (n− 1)
{(−aλn
+aλ
n√nZn −
aλ
n2Z2n
)− 1
2
(−aλn
+aλ
n√nZn −
aλ
n2Z2n
)2}+ oP
(1
n
)= −aλ+ aλ√
nZn −
aλ
nZ2n −
(aλ)2
2n+aλ
n+ oP
(1
n
)31
-
Theory of Statistics II J.P.Kim
= −aλ+ aλ√nZn +
−aλZ2n + aλ− (aλ)2/2n
+ oP
(1
n
)implies
η̂UMV UE = exp
(−aλ+ aλ√
nZn +
−aλZ2n + aλ− (aλ)2/2n
+ oP
(1
n
))= e−aλ
(1 +
(aλ√nZn +
−aλZ2n + aλ− (aλ)2/2n
)+
1
2
(aλ√nZn +
−aλZ2n + aλ− (aλ)2/2n
)2)
+ oP
(1
n
)= e−aλ
(1 +
aλ√nZn +
(−aλZ2n + aλ−
(aλ)2
2+
1
2(aλ)2Z2n
)1
n
)+ oP
(1
n
)= η +
aλe−aλ√n
Zn +(−aλ+ (aλ)2/2)e−aλ
n(Z2n − 1) + oP
(1
n
)(1.3)
from log(1 + x) = x − x2/2 + o(x2), x ≈ 0. Comparing (1.3) to (1.2), we can say that
UMVUE is “closer” than MLE to η, since MLE’s leading term has a bias
(−aλ+ (aλ)2/2)e−aλ
n,
while UMVUE’s leading term has no bias. Like this case, if one suggests a new estimator,
then in many cases, one compares 2nd order term to judge its asymptotic behavior.
Remark 1.2.7. If there is an event that occurring probability converges to 1, then in an
asymptotic sense, we may ignore such event, in the sense that:
(i) If P (En) −−−→n→∞
1 and P (Xn ≤ x, En) −−−→n→∞
F (x), then Xnd−−−→
n→∞F .
(ii) If P (En) −−−→n→∞
1 and Xn = X + OP (n−α) on En, then Xn = X + OP (n−α) in general.
Convergence rate of P (En) does not matter!
(∵ P (nα|Xn −X| ≥ C) ≤ P (nα|Xn −X| ≥ C, En) + P (Ecn)︸ ︷︷ ︸−−−→n→∞
0
, take lim supn and limC on
both sides.)
Example 1.2.8. Consider a sample correlation coefficient again. Assume EX41
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
=
(ρ+
1√nZn5 −
1
nZn1Zn2 + oP
(1
n
))
·
1− 12
(1√nZn3 −
(1√nZn1
)2)+
3
8
(1√nZn3 −
(1√nZn1
)2)2+ oP
(1
n
)·
1− 12
(1√nZn4 −
(1√nZn2
)2)+
3
8
(1√nZn4 −
(1√nZn2
)2)2+ oP
(1
n
)=
(ρ+
1√nZn5 −
1
nZn1Zn2 + oP
(1
n
))·(
1− 12√nZn3 +
1
n
(1
2Z2n1 +
3
8Z2n3
)+ oP
(1
n
))·(
1− 12√nZn4 +
1
n
(1
2Z2n2 +
3
8Z2n4
)+ oP
(1
n
))= ρ+
1√n
(Zn5 −
ρ
2Zn3 −
ρ
2Zn4
)+
1
n
(−Zn1Zn2 −
1
2Zn3Zn5 −
1
2Zn4Zn5 + ρ
(1
4Zn3Zn4 +
1
2Z2n1 +
1
2Z2n2 +
3
8Z2n3 +
3
8Z2n4
))+ oP
(1
n
)holds. The leading term has bias
1
n
{ρ
4EX21Y
21 +
3
8ρ(EX41 + EY
41 )−
1
2(EX31Y1 + EX1Y
31 )
}from
E(Z3nZ4n) = EX21Y
21 − 1
E(Z23n + Z24n) = EX
41 + EY
41 − 2
E(Z21n + Z22n) = 2
E(Z3nZ5n + Z4nZ5n) = EX31Y1 + EX1Y
31 − 2ρ
E(Z1nZ2n) = ρ.
For bivariate normal case, it becomes
− 12nρ(1− ρ2).
Remark 1.2.9. Note that in using stochastic expansion, we can get the mean, variance, skew-
ness, ... of the leading term and might expect that they become the approximation of the
33
-
Theory of Statistics II J.P.Kim
moments of g(Xn), but this is not true! For example, if Xn ∼ Ber(1/n), then
P (nXn > �) = P (Xn = 1) =1
n−−−→n→∞
0
from Xn = 0 or 1, so nXn = oP (1), but we get E(nXn) = 1 ∀n. Thus, we need to check the
behavior of the remainder.
As we can see in this example, we cannot say that E(op(1)) = o(1), i.e., “convergence in prob-
ability does not imply convergence in L1.” (Similarly, “bounded in probability does not imply
uniform integrability”) By Vitalli’s theorem, it is known that if {Xn} is uniformly integrable,
then
XnP−−−→
n→∞X implies EXn −−−→
n→∞EX.
From now on, we compare “moments of leading terms” and “approximation of moments.”
Example 1.2.10. Recall mgf
mgfX(t) = EetX |t| < �
and cgf
cgfX(t) = logmgfX(t) = logEetX . |t| < �
For mr = EXr, mgf has a Taylor expansion
mgfX(t) = 1 +m1t+m22!t2 +
m33!t3 + · · · ,
and if X and Y are independent,
mgfX+Y (t) = mgfX(t) ·mgfY (t)
and
cgfaX+b(t) = cgfX(at) + bt
holds for constants a and b. From this we get
cr(aX + b) = arcr(X),
where cr denotes rth cumulant. Also recall that, for A ≈ 0,
log(1 + A) = A− 12A2 +
1
3A3 − · · · ,
34
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
and with this, we can obtain
cgfX(t) = log
1 + (mgfX(t)− 1)︸ ︷︷ ︸=A
= c1t+ c22!t2 +
c33!t3 + · · ·
where
c1 = m1, c2 = m2−m21, c3 = m3−3m1m2 +2m31, c4 = m4−4m3m1−3m22 +12m2m21−m41, · · · .
If observations are normalized, i.e., m1 = 0 and m2 = 1, then
c1 = 0, c2 = 1, c3 = m3, c4 = m4 − 3.
Example 1.2.11. Let
Zn =√nXn − µ
σ=
1√n
n∑i=1
Xi − µσ︸ ︷︷ ︸
=:Z̃id≡Z1
.
Then from
cgfZn(t) = cgfn−1/2 ∑ Z̃i(t) = n · cgfZ1(
t√n
),
we obtain
cr(Zn) = n ·(
1√n
)rcr(Z1) = n
− r2
+1cr(Z1).
From this, we obtain
EZ3n = c3(Zn) =1√nc3(Z1) (1.4)
EZ4n = c4(Zn) + 3 =1
nc4(Z1) + 3. (1.5)
Example 1.2.12. Now see the multivariate case. Let X = (X1, · · · , Xd)> and t = (t1, · · · , td)>.
Then
mgfX(t) = Eet>X = Eet1X1+···+tdXd
and
m1 =
[∂
∂timgfX(t)
∣∣∣∣t=0
]i
m2 =
[∂2
∂ti∂tjmgfX(t)
∣∣∣∣t=0
]i,j
35
-
Theory of Statistics II J.P.Kim
m3 =
[∂3
∂ti∂tj∂tkmgfX(t)
∣∣∣∣t=0
]i,j,k
...
and we get
mgfX(t) = 1 +∑i
m1(i)ti +1
2!
∑i,j
m2(i, j)titj +1
3!
∑i,j,k
m3(i, j, k)titjtk + · · · .
If data is centered, i.e., EX = 0, then m1 = 0 and so
cgfX(t) =1
2!
∑i,j
m2(i, j)titj+1
3!
∑i,j,k
m3(i, j, k)titjtk+1
4!
∑i,j,k,l
(m4(i, j, k, l)− 3m2(i, j)m2(k, l)) titjtktl+· · · .
Now let Zn = (Z1n, · · · , Zdn)> =
√n(Xn − µ). Then
cgfZn(t) = n · cgfZ1(t/√n)
implies
EZinZjn =: σ
i,j, (σi,j)i,j = V ar(X1)
EZinZjnZ
kn =
1√nc3(i, j, k)
EZinZjnZ
knZ
ln =
1
nc4(i, j, k, l) + 3σ
ijσkl
where
c3(i, j, k) = c3(Z1)(i, j, k) = EZi1Z
j1Z
k1
c4(i, j, k, l) = c4(Z1)(i, j, k, l) = EZi1Z
j1Z
k1Z
l1 − 3(EZi1Z
j1)(EZ
k1Z
l1).
Proposition 1.2.13 (Moments of the leading terms). Let X1, · · · , Xn be i.i.d. with E|X1|4 <
∞1.
(a) (Univariate case) For g with ∃g̈ and Zn =√n(Xn − µ)/σ,
g(Xn) = g(µ) +σ√nġ(µ)Zn +
σ2
2ng̈(µ)Z2n︸ ︷︷ ︸
=:Wn
+oP (n−1),
1In fact, stronger condition is needed: mgf of X1 exists
36
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
and for the leading term Wn,
E(Wn) = g(µ) +σ2
2ng̈(µ)
V ar(Wn) =σ2
n(ġ(µ))2 +
1
n2
(σ3c3(Z1)ġ(µ)g̈(µ) +
1
2σ4 (g̈(µ))2
)+O(n−3)
E(Wn − EWn)3 =1
n2(σ3c3(Z1) (ġ(µ))
3 + 3σ4 (ġ(µ))2 (g̈(µ)))
+O(n−3).
(b) (Multivariate case) For g with ∃g̈ and (Zin) =√n(Xn − µ),
g(Xn) = g(µ) +1√ng/i(µ)Z
in +
1
2ng/ij(µ)Z
inZ
jn︸ ︷︷ ︸
=:Wn
+oP (n−1),
and for the leading term Wn,
E(Wn) = g(µ) +1
2ng/ij(µ)σ
ij
V ar(Wn) =1
ng/i(µ)g/j(µ)σ
ij +O(n−2)
where (σij) = V ar(X1).
Proof. (a) Nothing but tedious calculation. First,
E(Wn) = g(µ) +σ2
2ng̈(µ)
is easily obtained. Next, note that
V ar(Wn) = V ar
(σ√nġ(µ)Zn +
σ2
2ng̈(µ)Z2n
)=σ2
n(ġ(µ))2 V ar(Zn) +
σ3
n√nġ(µ)g̈(µ)Cov(Zn, Z
2n) +
σ4
4n2(g̈(µ))2 V ar(Z2n).
First, V ar(Zn) = 1. Also, V ar(Z2n) = E(Z
4n) − [E(Z2n)]
2= 2 + n−1c4(Z1) from (1.5). Finally,
Cov(Zn, Z2n) = EZ
3n − EZn · EZ2n = n−1/2c3(Z1) from (1.4). Now we get
V ar(Wn) =σ2
n(ġ(µ))2 +
σ3
n2ġ(µ)g̈(µ)c3(Z1) +
σ4
2n2(g̈(µ))2 +
σ4
4n3(g̈(µ))2 c4(Z1)
=σ2
n(ġ(µ))2 +
1
n2
(σ3c3(Z1)ġ(µ)g̈(µ) +
1
2σ4 (g̈(µ))2
)+O(n−3).
37
-
Theory of Statistics II J.P.Kim
For E(Wn − EWn)3, note that EZ5n = O(n−1/2) and EZ6n = O(1). (Check!) Then
E(Wn − EWn)3 = E[σ√nġ(µ)Zn +
σ2
2ng̈(µ)(Z2n − 1)
]3=
σ3
n√nġ(µ)3 EZ3n︸︷︷︸
=n−1/2c3(Z1)
+3σ4
2n2ġ(µ)2g̈(µ)E[Z2n(Z
2n − 1)]︸ ︷︷ ︸
=n−1c4(Z1)+2
+σ5
4n2√nġ(µ)g̈(µ)2E[Zn(Z
2n − 1)2] +
σ6
8n3g̈(µ)3E[(Z2n − 1)3]︸ ︷︷ ︸
=O(n−3)
=1
n2(σ3c3(Z1) (ġ(µ))
3 + 3σ4 (ġ(µ))2 (g̈(µ)))
+O(n−3)
by (1.4) and (1.5).
(b) Note that
Wn = g(µ) +1√ng/i(µ)Z
in +
1
2ng/ij(µ)Z
inZ
jn = g(µ) +
1√n
d∑i=1
g/i(µ)Zin +
1
2n
d∑i,j=1
g/ij(µ)ZinZ
jn
so
V ar(Wn) =1
n
d∑i=1
d∑j=1
g/i(µ)g/j(µ)Cov(Zin, Z
jn) +
1
n√n
d∑i=1
d∑k,l=1
g/i(µ)g/kl(µ)Cov(Zin, Z
knZ
ln)
+1
4n2
d∑i,j=1
d∑k,l=1
g/ij(µ)g/kl(µ)Cov(ZinZ
jn, Z
knZ
ln)
=1
ng/i(µ)g/j(µ)Cov(Z
in, Z
jn)︸ ︷︷ ︸
=σij
+1
n√ng/i(µ)g/kl(µ)Cov(Z
in, Z
knZ
ln)︸ ︷︷ ︸
=EZinZknZ
ln
+1
4n2g/ij(µ)g/kl(µ) Cov(Z
inZ
jn, Z
knZ
ln)︸ ︷︷ ︸
EZinZjnZknZ
ln−(EZinZ
jn)(EZknZ
ln)
=1
ng/i(µ)g/j(µ)σ
ij +1
n2g/i(µ)g/kl(µ)c3(i, j, k) +
1
4n2g/ij(µ)g/kl(µ)
(1
nc4(i, j, k, l) + 2σ
ijσkl)
=1
ng/i(µ)g/j(µ)σ
ij +O(n−2).
Proposition 1.2.14 (Approximation of moments). Let X1, · · · , Xn be i.i.d. with E|X1|4
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
V ar(g(Xn)) =σ2
ng(1)(µ)2 +O(n−2)
where µ = EX1, σ2 = V ar(X1).
(b) (Multivariate case) For g with bounded and contiuous g/I (|I| = 0, 1, · · · , 4),
E(g(Xn)) = g(µ) +1
2ng/ij(µ)σ
ij +O(n−2)
V ar(g(Xn)) =1
ng/i(µ)g/j(µ)σ
ij +O(n−2)
where µ = EX1, (σij) = V ar(X1).
Proof. (a) Note that,
g(Xn) = g(µ) +σ√ng(1)(µ)Zn +
σ2
2ng(2)(µ)Z2n +
1
3!
σ3
n√ng(3)(µ)Z3n +Rn,
where
Rn =1
4!
σ4
n2g(4)(ξn)Z
4n, ξn : a number between µ and Xn.
Note that
E|Rn| ≤1
4!
σ4
n2supx|g(4)(x)| · EZ4n = O(n−2)
from “boundedness of g(4)” and EZ4n = 3 + n−1c4(Z1). Also note that
σ3
n√nEZ3n =
σ3
n√n
1√nc3(Z1) = O(n
−2).
From these, we obtain
Eg(Xn) = g(µ) +σ2
2ng(2)(µ).
Next, for the variance, we get
V arg(Xn) =σ2
ng(1)(µ)2V ar(Zn) +
σ2
4n2g(2)(µ)2V ar(Z2n)
+O(n−3)V ar(Z3n) + V ar(Rn)
+O(n−3/2)Cov(Zn, Z2n) +O(n
−2)Cov(Zn, Z3n) +O(n
−5/2)Cov(Z2n, Z3n)
+O(n−1/2)Cov(Zn, Rn) +O(n−1)Cov(Z2n, Rn) +O(n
−3/2)Cov(Z3n, Rn).
39
-
Theory of Statistics II J.P.Kim
Note that,
V ar(Zn) = 1, V ar(Z2n) = 2 +
1
nc4(Z1) = O(1), V ar(Z
3n) = EZ
6n − (EZ3n)2 ≤ O(1),
V ar(Rn) ≤ ER2n = O(n−4) · EZ8n ≤ O(n−4),
|Cov(Zn, Rn)| ≤√V arZn
√V arRn ≤ O(n−2),
|Cov(Z2n, Rn)| ≤√V arZ2n
√V arRn ≤ O(n−2),
|Cov(Z3n, Rn)| ≤√V arZ3n
√V arRn ≤ O(n−2),
Cov(Zn, Z2n) = EZ
3n − (EZn)(EZ2n) = O(n−1/2),
|Cov(Zn, Z3n)| ≤√V arZn
√V arZ3n ≤ O(1),
|Cov(Z2n, Z3n)| ≤√V arZ2n
√V arZ3n ≤ O(1).
From these, we get
V arg(Xn) =σ2
ng(1)(µ)2 V ar(Zn)︸ ︷︷ ︸
=1
+σ2
4n2g(2)(µ)2 V ar(Z2n)︸ ︷︷ ︸
=O(1)
+O(n−3)V ar(Z3n)︸ ︷︷ ︸=O(1)
+V ar(Rn)︸ ︷︷ ︸≤O(n−4)
+O(n−3/2)Cov(Zn, Z2n)︸ ︷︷ ︸
=O(n−1/2)
+O(n−2)Cov(Zn, Z3n)︸ ︷︷ ︸
≤O(1)
+O(n−5/2)Cov(Z2n, Z3n)︸ ︷︷ ︸
≤O(1)
+O(n−1/2)Cov(Zn, Rn)︸ ︷︷ ︸≤O(n−2)
+O(n−1)Cov(Z2n, Rn)︸ ︷︷ ︸≤O(n−2)
+O(n−3/2)Cov(Z3n, Rn)︸ ︷︷ ︸≤O(n−2)
=σ2
ng(1)(µ)2 +O(n−2).
(b) Recall the Taylor theorem for multivariate function: for (k + 1)-time continuously differ-
entiable function f : Rn → R,
f(x) =∑|α|≤k
Dαf(a)
α!(x− a)α +
∑|β|=k+1
Rβ(x)(x− a)β,
where for α = (α1, · · · , αn) ∈ Nn we define |α| = α1 + · · · + αn, α! = α1! · · ·αn!, and xα =
xα11 · · ·xαnn ,
Rβ(x) =|β|β!
∫ 10
(1− t)|β|−1Dβf(a + t(x− a))dt.
40
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
In here,
Dαf =∂|α|f
∂xα11 · · · ∂xαnn.
Using this, we obtain
g(Xn) = g(µ) +1√ng/i(µ)Z
in +
1
2!
1
ng/ijZ
inZ
jn +
1
3!
1
n√ng/ijkZ
inZ
jnZ
kn +Rn,
where
|Rn| ≤1
6n2
∣∣∣∣∫ 10
(1− u)3g/ijkl(µ+
1√nuZn
)ZinZ
jnZ
knZ
lndu
∣∣∣∣ .By boundedness of g/I and
E|ZinZjnZknZ ln| ≤1
4E((Zin)
4 + (Zjn)4 + (Zkn)
4 + (Z ln)4) = O(1), (“AM-GM”)
we obtain the desired result with similar procedure as univariate case.
Example 1.2.15 (Estimation of reliability). Let X1, · · · , Xn be a random sample from Exp(λ),
λ > 0. Define
η = Pλ(X1 > a) = e−aλ.
Then
η̂MLEn = exp(−a/X).
Here, g(x) = exp(−a/x) is infinitely differentiable with bounded derivatives. Thus
E(η̂MLEn ) = η +(−aλ+ (aλ)2/2)e−aλ
n+O(n−2)
V ar(η̂MLEn ) =(aλ)2e−2aλ
n+O(n−2),
the leading terms of which agree with the mean and variance of the leading term in its stochastic
expansion. On the other hand, for
η̂UMV UEn =(
1− anx
)n−1I
(a
nX< 1
),
we cannot apply the result for the approximation of moments. But, it can be shown that its
variance is also approximated by the same approximation formula.
Remark 1.2.16. The boundedness of the derivatives for the approximation of moments is
rather stronger than needed. Whenever the approximation can be proved, the formulae agree
41
-
Theory of Statistics II J.P.Kim
with the moments of the leading term of its stochastic expansion. So only the validity of the
order of the remainder needs to be proved. For example, in the bivariate normal case, the mean
and variance of the sample correlation coefficient can be approximated as follows;
E(ρ̂n) =−ρ(1− ρ2)
2n+O(n−2)
V ar(ρ̂n) =(1− ρ2)2
n+O(n−2).
1.3 Asymptotic Theory
1.3.1 MLE in Exponential Family
Proposition 1.3.1. Let X1, · · · , Xn be a random sample from a population with pdf
pη(x) = h(x) exp(η>T (x)− A(η))IX (x), η ∈ E ,
where E is a natural parameter space in Rk. Further, assume that
(i) E is open.
(ii) The family is of rank k.
Then
(a) η̂MLEn = η + (Ä(η))−1(T n − Ȧ(η)) + oPη(n−1/2) = η + (−l̈(η))−1l̇(η) + oPη(n−1/2).
(b)√n(η̂MLEn − η)
d−−−→n→∞
N(
0, (Ä(η))−1)
= N(0, I−11 (η)).
Proof. Recall that, under the same assumptions,
(T n ∈ C0) ⊆ (Ȧ(η̂MLEn ) = T n) ⊆(η̂MLEn = (Ȧ)
−1(T n)),
with the probabilities of these events tending to 1 (See theorem 1.1.20). Also note that, by full
rank condition, Ȧ is one-to-one and differentiable on E with Ä > 0.
Now let t = Ȧ(η). Then by Inverse Function Theorem (see next Remark), Ȧ−1 is also
differentiable and
D(Ȧ−1)(t) =∂
∂tȦ−1(t) =
(Ä(η)
)−1.
CLT implies√n(T n − EηT (X1))
d−−−→n→∞
N (0, V arηT (X1)) ,
42
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
and recall that EηT (X1) = Ȧ(η), V arηT (X1) = Ä(η). Therefore, ∆-method implies
η̂MLEn = Ȧ−1(T n)
= Ȧ−1(t) +(Ä(η)
)−1(T n − t) + o(|T n − t|)
= η +(Ä(η)
)−1(T n − t) + oPη(n−1/2),
and hence
√n(η̂MLEn − η) =
(Ä(η)
)−1√n(T n − t) + oPη(1)
d−−−→n→∞
N(0, V arηT (X1)
−1) .Rest part is obtained from T n − Ȧ(η) = l̇n(η)/n and −l̈n(η) = nÄ(η).
Remark 1.3.2 (Inverse Function Theorem). Let F : U → Rd, where U ⊆ Rd is an open set. If
(i) F is one-to-one
(ii) F is Fréchet differentiable near x0 ∈ U
(iii) DFx0 :=
[∂F
∂xj
∣∣∣∣x=x0
]i,j
is invertible.
Then F−1 : F (U)→ U is also Fréchet differentiable at y0 = F (x0), and it satisfies D(F−1)(y0) =
(DFx0)−1.
Example 1.3.3 (Multinomial case). Let X1, · · · , Xn be a random sample from Multi(1, p(θ)),
where p(θ) = (p1(θ), · · · , pk(θ))>, and θ ∈ Θ ⊆ R. For example, consider Hardy-Weinberg
proportions p(θ) = (θ2, 2θ(1− θ), (1− θ)2)>, 0 < θ < 1. Assume that
(i) Θ is open and 0 < pi(θ) < 1,k∑i=1
pi(θ) = 1.
(ii) p(θ) = (p1(θ), · · · , pk(θ))> is twice (totally) differentiable.
Then we can derive the asymptotic distribution of estimator of θ. Let θ = h(p(θ)) for any θ ∈ Θ
for some differentiable function h. Let
p̂(θ) =1
n
n∑i=1
Xi =
(N1n, · · · , Nk
n
)>,
where Nj =∑n
i=1 I(Xij = 1). Then we get
Eθp̂(θ) = p(θ)
43
-
Theory of Statistics II J.P.Kim
and hence
h(Xn) = h(p(θ)) + ḣ(p(θ))>(Xn − p(θ)) + o(|Xn − p(θ)|).
Note that Zn :=√n(Xn − p(θ)) = OP (1), and it has an asymptotic distribution
Znd−−−→
n→∞N(0,Σ(θ)), Σ(θ) := diag(p(θ))− p(θ)p(θ)>,
and therefore we get
√n(h(Xn)− h(p(θ))) = ḣ(p(θ))>Zn + oP (1),
which implies√n(h(Xn)− h(p(θ)))
d−−−→n→∞
N(0, σ2h(θ)),
where
σ2h(θ) = ḣ(p(θ))>Σ(θ)ḣ(p(θ)).
Furthermore, we can obtain that
σ2h(θ) ≥ I−11 (θ),
where equality holds iff
ḣ(p(θ))>(X1 − p(θ)) = I−11 (θ)l̇1(θ).
It can be shown as following. First, note that
ḣ(p(θ))>ṗ(θ) = 1.
It implies that
1 = ḣ(p(θ))>∂
∂θEθX1
= ḣ(p(θ))>∫
∂
∂θxf(x : θ)dµ(x)
= ḣ(p(θ))>∫x∂
∂θf(x : θ)dµ(x)
= ḣ(p(θ))>Covθ
(X1,
∂
∂θl1(θ)
)(∵ Eθ
∂
∂θf(X1 : θ) = Eθ
∂
∂θl1(θ) = 0)
= Covθ
(ḣ(p(θ))>X1,
∂
∂θl1(θ)
)≤
√V arθ
(ḣ(p(θ))>X1
)V arθ
(∂
∂θl1(θ)
)44
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
=
√ḣ(p(θ))>Σ(θ)ḣ(p(θ)) · I1(θ)
holds. In here, “=” holds when ḣ(p(θ))>X1 and ∂l1(θ)/∂θ has a linear relationship, i.e.,
ḣ(p(θ))>(X1 − p(θ)) = I−11 (θ)l̇1(θ).
Remark 1.3.4. Actually, previous example shows how to deal with asymptotic distribution of
FSE, more generally.
1.3.2 Asymptotic Normality of MCE
Our real goal of this section is right here.
Theorem 1.3.5 (Asymptotic Normality of MCE). Let X1, · · · , Xn be a random sample from
Pθ, where θ ∈ Θ and parameter space Θ is open in Rk. Let
θ̂MCEn = arg minθ∈Θ
1
n
n∑i=1
ρ(Xi, θ)
θ0 = arg minθ∈Θ
Eθ0ρ(X1, θ).
Under the assumption of their existence, let
Ψ1(θ) = Ψ(X1, θ) =∂
∂θρ(X1, θ), Ψn(θ) =
1
n
n∑i=1
Ψ(Xi, θ)
Ψ̇1(θ) =∂Ψ1(θ)
∂θ, Ψ̇n(θ) =
∂Ψn(θ)
∂θ
Ψ̈1(θ) =∂2Ψ1(θ)
∂θ2, Ψ̈n(θ) =
∂2Ψn(θ)
∂θ2
Assume that
(A0) Pθ0
(Ψn(θ̂
MCEn ) = 0
)−−−→n→∞
1.
(A1) Eθ0Ψ1(θ0) = 0.
(A2) V arθ0(Ψ1(θ0)) exists.
(A3) Eθ0(Ψ̇1(θ0)) exists and is nonsingular.
45
-
Theory of Statistics II J.P.Kim
(A4) ∃δ > 0 and ∃M(X1) = Mθ0,δ(X1) s.t.
max|θ−θ0|≤δθ∈Θ
|Ψ̈1(θ)| ≤M(X1), where Eθ0M(X1) .
(1) (1st-order gradient)
∂F
∂x(x0) := DF (x0) =
(∂F
∂x1(x0),
∂F
∂x2(x0), · · · ,
∂F
∂xk(x0)
)∈ Rd×k,
where∂F
∂xj(x0) =
(∂F1∂xj
(x0), · · · ,∂Fd∂xj
(x0)
)>is a column vector. It can be interpreted as
“a linear map.”
(2) (2nd-order gradient)
∂2F
∂x2(x0) := D
2F (x0) =
(∂
∂x1
∂F
∂x(x0), · · · ,
∂
∂xk
∂F
∂x(x0)
)∈ Rd×k×k,
where∂
∂xi
∂F
∂x(x0) is d × k matrix with
∂2F
∂xixj(x0) as the jth column vector. Note that it
can be interpreted as “a bi-linear map.”
(3) (Taylor expansion of vector-valued map)
F (x) ≈ F (x0) +k∑j=1
∂F
∂xj(x0)(xj − x0j) +
1
2
∑i,j
∂2F
∂xi∂xj(x0)(xi − x0i)(xj − x0j)
46
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
= F (x0) +∂F
∂xj(x0)︸ ︷︷ ︸
matrix
(x− x0)︸ ︷︷ ︸vector
+1
2(x− x0)>
∂2F
∂xi∂xj(x0)︸ ︷︷ ︸
3-array
(x− x0).
In here, “matrix DF (x0) × vector” becomes a vector, and “quadratic form with 3-array
D2F (x0)” becomes vector-valued. In this view, DF (x0) and D2F (x0) can be interpreted as
a linear and bi-linear map, respectively.
Proof. By Taylor’s theorem, ∃θ∗n in line(θ0, θ̂n) such that ‖θ∗n − θ0‖ ≤ ‖θ̂n − θ0‖ and
Ψn(θ̂n) = Ψn(θ0) + Ψ̇n(θ0)(θ̂n − θ0) +1
2(θ̂n − θ0)>Ψ̈n(θ∗n)(θ̂n − θ0).
Also note that, from (A4) and (A5),
limK→∞
supnPθ0
(Ψ̈n(θ
∗n) ≥ K
)= 0
holds, and hence
Ψ̈n(θ∗n) = OPθ0 (1),
i.e.,
Ψ̈n(θ∗n)(θ̂n − θ0) = oPθ0 (1).
(∵ (Motivation: since Ψ̈n is dominated by integrable majorant, it is “uniformly integrable,”
hence it is “tight.”) From
Pθ0
(∣∣∣Ψ̈n(θ∗n)∣∣∣ > K) = Pθ0 (∣∣∣Ψ̈n(θ∗n)∣∣∣ > K, |θ̂n − θ0| ≤ δ)+ Pθ0 (∣∣∣Ψ̈n(θ∗n)∣∣∣ > K, |θ̂n − θ0| > δ)≤ Pθ0
(∣∣∣Ψ̈n(θ∗n)∣∣∣ > K, |θ̂n − θ0| ≤ δ)+ Pθ0 (|θ̂n − θ0| > δ)≤
(A4)Pθ0 (M(X1) > K) + Pθ0
(|θ̂n − θ0| > δ
)︸ ︷︷ ︸
(A5)−−−→n→∞
0
,
for any � > 0 we get for large N
supn>N
Pθ0
(|Ψ̈n(θ∗n)| > K
)≤ 1KEθ0M(X1) + �,
which implies
limK→∞
supn>N
Pθ0
(|Ψ̈n(θ∗n)| > K
)= 0.
47
-
Theory of Statistics II J.P.Kim
(Let N be larger and take �↘ 0)) Thus, we get
Ψn(θ̂n) = Ψn(θ0) + Ψ̇n(θ0)(θ̂n − θ0) +1
2(θ̂n − θ0)>Ψ̈n(θ∗n)(θ̂n − θ0)
= Ψn(θ0) +
(Ψ̇n(θ0) +
1
2(θ̂n − θ0)>Ψ̈n(θ∗n)
)(θ̂n − θ0)
= Ψn(θ0) +(
Ψ̇n(θ0) + oPθ0 (1))
(θ̂n − θ0)
= Ψn(θ0) +(Eθ0Ψ̇1(θ0) + oPθ0 (1)
)(θ̂n − θ0).
Now, note that
(1) Pθ0(Ψn(θ̂n) = 0) −−−→n→∞
1.
(2) Eθ0Ψ̇1(θ0) + oPθ0 (1) is nonsingular with probability 1 (See remark 1.3.7)
(3) If Xn = Yn +OP (an) on En with P (En) −−−→n→∞
1, then Xn = Yn +OP (an) (on whole space),
and the same holds for oP (See remark 1.2.7).
Thus, on the set with probability tending to 1,
θ̂n − θ0 =(−Eθ0Ψ̇1(θ0) + oPθ0 (1)
)−1Ψn(θ0)
=
[(−Eθ0Ψ̇1(θ0)
)−1+ oPθ0 (1)
]Ψn(θ0)
holds, which yields
√n(θ̂n − θ0) =
(−Eθ0Ψ̇1(θ0)
)−1√nΨn(θ0)︸ ︷︷ ︸OPθ0
(1)
+oPθ0 (1)
and therefore
√n(θ̂n − θ0)
d−−−→n→∞
N
(0,(−Eθ0Ψ̇1(θ0)
)−1V arθ0Ψ1(θ0)
(−Eθ0Ψ̇1(θ0)
)−1).
Remark 1.3.7. If A ∈ Rd×d is symmetric positive definite matrix, then for small perbutation
∆ s.t. ‖∆‖2 < σmin(A), rank(A+ ∆) = d. Note that rank(A) = d. In here, σmin(A) denotes the
smallest eigenvalue of A, and ‖ · ‖p is a matrix norm induced by corresponding Lp vector norm,
i.e.,
‖∆‖p = supx:‖x‖p=1
‖∆x‖p.
48
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
Proof. (Motivation: for c ≈ 0 and x 6= 0,
1
x+ c=x
c− x
2
c2+x3
c3− · · ·
exists, or for small vectors u, v, A+ uv> is nonsingular from
(A+ uv>)−1 = A−1 − A−1uv>A−1
1 + u>A−1v,
if A is invertible) Let rank(A+ ∆) < d. Then ∃x0 6= 0 s.t. (A+ ∆)x0 = 0 and ‖x0‖2 = 1. Then
by definition of matrix norm,
‖∆‖2 ≥ ‖∆x0‖2 = ‖Ax0‖2 ≥ σmin(A)
holds. The last inequality is from spectral theorem. It is contradictory to our assumption that
∆ is small.
1.3.3 Asymptotic Normality and Efficiency of MLE
Note that MLE is just a special case of MCE.
Theorem 1.3.8. Let X1, · · · , Xn be a random sample from Pθ, where θ ∈ Θ and parameter
space Θ is open in Rk. Recall that MLE is an MCE with
ρ(x, θ) = − log pθ(x), pθ : pdf of Pθ.
Under the assumption of their existence, denote
l̇n(θ) =n∑i=1
∂
∂θlog pθ(Xi), l̇n(θ) =
1
nl̇n(θ)
l̈n(θ) =∂2
∂θ2log pθ(Xi), l̈n(θ) =
1
nl̈n(θ)
...l n(θ) =
∂3
∂θ3log pθ(Xi),
...l n(θ) =
1
n
...l n(θ).
Also assume that
(M0) Pθ0
(l̇n(θ̂
MLEn ) = 0
)−−−→n→∞
1.
(M1) Eθ0 l̇1(θ0) = 0.
(M2) I(θ0) = V arθ0(l̇1(θ0)) exists.
49
-
Theory of Statistics II J.P.Kim
(M3) Eθ0(l̈1(θ0)) exists and is nonsingular.
(M4) ∃δ0 > 0 and ∃M(X1) = Mθ0,δ(X1) s.t.
max|θ−θ0|≤δ0θ∈Θ
|...l 1(θ)| ≤M(X1), where Eθ0M(X1) 1 (θ0).
Then√n(θ̂MLEn − θ0)
d−−−→n→∞
N(0, I(θ0)−1)
√n(θ̂MCEn − θ0)
d−−−→n→∞
N(0,ΣΨ(θ0))
with ΣΨ(θ0)− I(θ0)−1 being nonnegative definite (i.e., “ ΣΨ(θ0) ≥ I(θ0)−1”), and “=” holds if
and only if
Ψ1(θ0) = (−Eθ0Ψ̇1(θ0))I1(θ0)−1l̇1(θ0). (“Determining contrast function”)
Proof. By Cauchy-Schwarz inequality,
Corr(λ>l̇1(θ0), γ>Ψ1(θ0)) =
λ>(−Eθ0Ψ̇1(θ0))γ√λ>I1(θ0)λ
√γ>V arθ0Ψ1(θ0)γ
=λ>I
1/21 (θ0)I
−1/21 (θ0)(−Eθ0Ψ̇1(θ0))γ√
λ>I1(θ0)λ√γ>V arθ0Ψ1(θ0)γ
50
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
≤
√λ>I1(θ0)λ
√γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ√λ>I1(θ0)λ
√γ>V arθ0Ψ1(θ0)γ
=
√γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ√
γ>V arθ0Ψ1(θ0)γ
holds, and hence
maxλ 6=0
Corr(λ>l̇1(θ0), γ>Ψ1(θ0)) =
√γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ√
γ>V arθ0Ψ1(θ0)γ
is obtained (we can find λ that achieves maximum). Since correlation coefficient is less than 1,
we get
γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ ≤ γ>V arθ0Ψ1(θ0)γ
for any γ. Using (−Eθ0Ψ̇1(θ0))−1γ instead of γ, we obtain that
γ>I−11 (θ0)γ ≤ γ>(−Eθ0Ψ̇1(θ0))−1V arθ0Ψ1(θ0)(−Eθ0Ψ̇1(θ0))−1γ,
i.e.,
I−11 (θ0) ≤ (−Eθ0Ψ̇1(θ0))−1V arθ0Ψ1(θ0)(−Eθ0Ψ̇1(θ0))−1 = ΣΨ(θ0).
Equality holds iff
γ>(−Eθ0Ψ̇1(θ0))−1I−11 (θ0)l̇1(θ0) = γ>Ψ1(θ0),
i.e.,
Ψ1(θ0) = (−Eθ0Ψ̇1(θ0))−1I−11 (θ0)l̇1(θ0).
Remark 1.3.10. The claim that the MLE has smaller variance than other asymptotically
normal estimators was known as Fisher’s conjecture. This is true for a certain class of estimators
in a regular parametric model. Essential property for such a comparison is the “uniform”
convergence to the normal distribution as it can be seen in the following example.
Example 1.3.11 (Hodge’s example: “Superefficient estimator”). Let X be the sample mean
in N(θ, 1). Let
θ̂sn = XI(|X| > n−1/4).
(Actually, not only 1/4, but any positive number less than 1/2 is OK.)
51
-
Theory of Statistics II J.P.Kim
If θ 6= 0, then
Pθ
(θ̂sn = X
)= 1− Pθ(|X| ≤ n−1/4)
= 1− Pθ(−n−1/4 ≤ X ≤ n−1/4)
= 1− Pθ(−√nθ − n1/4 ≤
√n(X − θ) ≤ −
√nθ + n1/4)
= 1− Φ(−√nθ + n1/4
)︸ ︷︷ ︸−−−→n→∞
0 if θ>0;1 if θ0;1 if θ n−1/4)
becomes 1. Thus,
Pθ
(√n(θ̂sn − θ) ≤ x
)= Pθ
(√n(θ̂sn − θ) ≤ x, θ̂sn = X
)+ Pθ
(√n(θ̂sn − θ) ≤ x, θ̂sn 6= X
)= Pθ
(√n(X − θ) ≤ x, θ̂sn = X
)+ Pθ
(√n(θ̂sn − θ) ≤ x, θ̂sn 6= X
)−−−→n→∞
Φ(x)
holds, i.e.,√n(θ̂sn − θ)
d−−−→n→∞
N(0, 1) under Pθ.
Now, assume that θ = 0. Then√nX ∼ N(0, 1), so
Pθ(θ̂sn = 0) = Pθ(|X| ≤ n−1/4) = Φ(n−1/4)− Φ(−n−1/4) −−−→
n→∞1.
Then we obtain limn→∞
Pθ
(θ̂sn = 0
)= 1, and therefore,
limn→∞
Pθ
(√n(θ̂sn − θ) ≤ x
)= lim
n→∞Pθ(0 ≤ x)
=
1 if x ≥ 00 if x < 0= I[0,∞)(x),
which yields√n(θ̂sn − θ)
d−−−→n→∞
N(0, 0) under Pθ.
52
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
Thus, we get√n(θ̂sn − θ)
d−−−→n→∞
N(0, σ2s(θ)) under Pθ,
where
σ2s(θ) =
1 if θ 6= 00 if θ = 0 .It implies that, there is a “superefficient” estimator than “optimal” one with variance I(θ)−1,
i.e., Fisher’s conjecture is wrong!
Remark 1.3.12. Note that Fisher’s conjecture is wrong in general, but it is “partially correct”
in “regular” model. It means “uniform convegence” of CLT. First note that
√n(θ̂n − θ)
d (Pθ)−−−→n→∞
N(0, ν(θ))
only means “pointwise” convergence for each θ. It is equivalent to, for some metric d, and law
(distribution function) Lθ under Pθ,
d
(Lθ(√n(θ̂n − θ)),Φ
(·
ν(θ)
))−−−→n→∞
0.
“Regular estimator” means, on the other hand, that
supθ:|θ−θ0|
-
Theory of Statistics II J.P.Kim
Example 1.3.13 (Linear model with stochastic covariates). Consider the model
Y |Z = z ∼ N(α + z>β, σ2), α ∈ R, β ∈ Rk, σ2 > 0,
where E(Z) = 0 and V ar(Z) is non-singular. Assume that there are n independent copies of
Y and Z, and denote Z(n) = (Z1, · · · , Zn)>. Then as long as the distribution of Z does not
depend on θ = (α, β, σ2), we get MLE of β is equivalent to least square procedure, from
ln(θ) = −1
2σ2
n∑i=1
(Yi − α− z>i β)2 −n
2log 2πσ2 +
n∑i=1
log pdfZ(zi).
In other words,
β̂MLE = (Z̃>(n)Z̃(n))−1Z̃>(n)Y, Z̃(n) = (Z1 − Z, · · · , Zn − Z)>
α̂MLE = Y − Z>β̂MLE
σ̂2MLE
=1
n‖Y − (1α̂MLE + Z(n)β̂MLE)‖2.
It also says that,
l1(θ) = −1
2σ2(Y1 − α− z>1 β)2 −
1
2log 2πσ2 + log pdfZ(z1)
and hence
l̇1(θ) =
(�1σ2, z>1
�1σ2,�21
2σ4− 1
2σ2
)>, where �1 = Y1 − α− z>1 β.
Thus
I(θ) =
1σ2
1σ2E(Z1Z
>1 )
12σ4
.(It can be easily obtained from �1|z1 ∼ N(0, σ2). Note that
V ar(�1) = EV ar(�1|z1) + V arE(�1|z1) = σ2,
V ar(z1�1) = EV ar(z1�1|z1) + V arE(z1�1|z1) = E(z1σ2z>1 ) + V arz1 · 0,
and similarly we can obtain V ar(�21).) It implies that,
√n(θ̂MLEn − θ)
d−−−→n→∞
N(0,Σ),
54
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
where
Σ = I(θ)−1 =
σ2
σ2(E(Z1Z
>1 ))−1
2σ4
,provided that E(Z1Z
>1 ), or equivalently V ar(Z1), is non-singular.
1.3.4 Asymptotic Null distribution of LRT
Consider a random sample X1, · · · , Xn from Pθ, where θ ∈ Θ ⊆ Rk. Denote θ = (ξ>, η>)>,
where η ∈ Rk0 . We wish to test
H0 : ξ = ξ0 vs H1 : ξ 6= ξ0.
(Note that it is composite null!) Let Θ be a k-dimensional open set, and
Θ0 = {(ξ>0 , η>)> : (ξ>0 , η>)> ∈ Θ}
be a k0-dimensional open set. Now denote
l̇(θ) =
l̇1(θ)l̇2(θ)
, I(θ) =I11(θ) I12(θ)I21(θ) I22(θ)
,where
l̇1(θ) =∂
∂ξl(θ), l̇2(θ) =
∂
∂ηl(θ).
Theorem 1.3.14 (Asymptotic null distribution of LRT). Assume (M0) ∼ (M6). Then under
H0 : ξ = ξ0
2(l(θ̂n)− l(θ̂0n))d−−−→
n→∞χ2(k − k0)
where
θ̂n = arg maxθ∈Θ
l(θ), θ̂0n = arg maxθ∈Θ0
l(θ).
Proof. Recall that,
l(θ0) = l(θ̂n) +1
2
√n(θ̂n − θ0)>
[1
nl̈(θ0) + oPθ0 (1)
]√n(θ̂n − θ0)
55
-
Theory of Statistics II J.P.Kim
holds. Since√n(θ̂n − θ0) = OPθ0 (1), and OPθ0 (1) · oPθ0 (1) = oPθ0 (1), we get
2(l(θ̂n)− l(θ0)) =√n(θ̂n − θ0)>I(θ0)
√n(θ̂n − θ0) + oPθ0 (1),
from
− 1nl̈(θ0)
Pθ0−−−→n→∞
I(θ0).
Also recall that√n(θ̂n − θ0) = I(θ0)−1
√nl̇(θ0) + oPθ0 (1).
It implies that
2(l(θ̂n)− l(θ0)) =√nl̇(θ0)
>I(θ0)−1√nl̇(θ0) + oPθ0 (1).
Similarly, we get
2(l(θ̂0n)− l(θ0)) =√n(η̂0n − η0)>I22(θ0)
√n(η̂0n − η0) + oPθ0 (1)
and√n(η̂0n − η0) = I22(θ0)−1
√nl̇2(θ0) + oPθ0 (1)
under H0, so we get
2(l(θ̂0n)− l(θ0)) =√nl̇2(θ0)
>I22(θ0)−1√nl̇2(θ0) + oPθ0 (1)
under H0. Thus, we get
2(l(θ̂n)− l(θ̂0n)) =√nl̇(θ0)
>I(θ0)−1√nl̇(θ0)−
√nl̇2(θ0)
>I22(θ0)−1√nl̇2(θ0) + oPθ0 (1)
under H0. Now note that
I(θ0)−1 =
I11(θ0) I12(θ0)I21(θ0) I22(θ0)
−1
=
J1 0−I−122 (θ0)I21(θ0) J2
I−111·2(θ0) 00 I−122 (θ0)
J1 −I12(θ0)I−122 (θ0)0 J2
56
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
holds, for identity matrices J1 and J2 with suitable sizes. Then
l̇(θ0)>I(θ0)
−1l̇(θ0) = l̇(θ0)>
J1 0−I−122 (θ0)I21(θ0) J2
I−111·2(θ0) 00 I−122 (θ0)
J1 −I12(θ0)I−122 (θ0)0 J2
l̇(θ0)=
l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)l̇2(θ0)
>I−111·2(θ0) 00 I−122 (θ0)
l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)l̇2(θ0)
holds, so we get
2(l(θ̂n)− l(θ̂0n)) =√n(l̇1(θ0)− I21(θ0)I−122 (θ0)l̇2(θ0)
)>I−111·2(θ0)
√n(l̇1(θ0)− I21(θ0)I−122 (θ0)l̇2(θ0)
)+ oPθ0 (1). (1.6)
Now by CLT,
√n(l̇1(θ0)− I21(θ0)I−122 (θ0)l̇2(θ0))
d−−−→n→∞
(J1 −I12(θ0)I−122 (θ0)
)N(0, I(θ0))
= N(0, I11(θ0)− I12(θ0)I−122 (θ0)I21(θ0))
= N(0, I11·2(θ0)),
and hence,
2(l(θ̂n)− l(θ̂0n))d−−−→
n→∞χ2(k − k0).
Remark 1.3.15. (i) The extension to a null hypothesis
H0 : g1(θ) = 0, · · · , gk1(θ) = 0
is rather trivial by considering a smooth reparametrization
ξ = g1(θ), · · · , ξk1(θ) = gk1(θ), η1 = gk1+1(θ), · · · , ηk0 = gk(θ)
whenever such gj(θ) (j = k1 + 1, · · · , k) can be found.
(ii) Large sample confidence set based on the maximum likelihood is obtained by duality.
Remark 1.3.16. Note that to perform LRT we should find MLE on Θ and Θ0, however, it
may not be so easy. Thus our interest is to find asymptotic equivalent tests of LRT.
57
-
Theory of Statistics II J.P.Kim
1.3.5 Asymptotic Equivalents of LRT
Let X1, · · · , Xn be a random sample from Pθ, where θ ∈ Θ ⊆ Rk. For θ = (ξ>, η>)> ∈ Θ, we
again wish to test
H0 : ξ = ξ0 vs H1 : ξ 6= ξ0.
We assume the regularity conditions, and in addition, that I(θ) is continuous. Also, we denotethe
true parameter under H0 as θ0 = (ξ>0 , η
>0 )>.
Lemma 1.3.17. (a) From
√n(θ̂MLEn − θ0) = I(θ0)−1
√nl̇(θ0) + oPθ0 (1),
we get
I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η0) = l̇1(θ0) + oPθ0 (n−1/2),
I21(θ0)(ξ̂n − ξ0) + I22(θ0)(η̂n − η0) = l̇2(θ0) + oPθ0 (n−1/2).
(b)√n(η̂0n − η0) = I−122 (θ0)
√nl̇2(θ0) + oPθ0 (1)
η̂0n − η0 = I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)
(c) 2(l(θ̂n)−l(θ̂0n)) =√n(l̇1(θ0)−I21(θ0)I−122 (θ0)l̇2(θ0))>I−111·2(θ0)
√n(l̇1(θ0)−I21(θ0)I−122 (θ0)l̇2(θ0))+
oPθ0 (1).
Theorem 1.3.18 (Wald’s test). Let
Wn =√n(θ̂n − θ̂0n)>I(θ̂0n)
√n(θ̂n − θ̂0n).
Then,
Wn =√n(ξ̂n − ξ0)>I11·2(θ0)
√n(ξ̂n − ξ0) + oPθ0 (1)
= 2(l(θ̂n)− l(θ̂0n)) + oPθ0 (1).
Proof. First note that, by continuity of I,
Wn = n(θ̂n − θ̂0n)>(I(θ0) + oPθ0 (1)
)(θ̂n − θ̂0n)
= n(θ̂n − θ̂0n)>I(θ0)(θ̂n − θ̂0n) + oPθ0 (1).
58
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
Then, we get
Wn = n(θ̂n − θ̂0n)>I(θ0)(θ̂n − θ̂0n) + oPθ0 (1)
= n[(ξ̂n − ξ0)>I11(θ0)(ξ̂n − ξ0) + 2(ξ̂n − ξ0)>I12(θ0)(η̂n − η̂0n)
+ (η̂n − η̂0n)>I22(θ0)(η̂n − η̂0n)]
+ oPθ0 (1)
= n[(ξ̂n − ξ0)>
(I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η̂0n)
)+(
(ξ̂n − ξ0)>I12(θ0) + (η̂n − η̂0n)>I22(θ0))
(η̂n − η̂0n)]
+ oPθ0 (1),
and from
I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η̂0n) = I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η0)− I12(θ0)(η̂0n − η0)
= l̇1(θ0)− I12(θ0)(η̂0n − η0) + oPθ0 (n−1/2)
= l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)
and
(ξ̂n − ξ0)>I12(θ0) + (η̂n − η̂0n)>I22(θ0) = (ξ̂n − ξ0)>I12(θ0) + (η̂n − η0)>I22(θ0)− (η̂0n − η0)>I22(θ0)
= l̇2(θ0)− I22(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)
we get
Wn = n
(ξ̂n − ξ0)>︸ ︷︷ ︸=OPθ0
(n−1/2)
(l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n
−1/2))
+
l̇2(θ0)− I22(θ0)I−122 (θ0)l̇2(θ0)︸ ︷︷ ︸=0
+oPθ0 (n−1/2)
(η̂n − η̂0n)︸ ︷︷ ︸=OPθ0
(n−1/2)
+ oPθ0 (1)= n(ξ̂n − ξ0)>
(l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)
)+ oPθ0 (1).
Now by (a) of lemma, we get
ξ̂n − ξ0 = I−111·2(θ0)(l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)
)+ oPθ0 (n
−1/2),
and hence,
Wn = n(ξ̂n − ξ0)>I11·2(θ0)(ξ̂n − ξ0) + oPθ0 (1).
59
-
Theory of Statistics II J.P.Kim
Use (a) part again, and then we can obtain
Wn = 2(l(θ̂n)− l(θ̂0n)) + oPθ0 (1)
by (1.6).
Remark 1.3.19. The leading term in previous theorem depends on unknown η0 through
I11·2(θ0), so in practice, it should be estimated as
n(ξ̂n − ξ0)>I11·2(θ̂0n)(ξ̂n − ξ0), (“estimated Wald statistics”)
which is also the leading term of LRT.
Theorem 1.3.20 (Rao’s test; score test statistic). Let
Rn = nl̇(θ̂0n)>Î(θ0)
−1l̇(θ̂0n),
where
Î(θ0) = −l̈(θ̂0n)/n or I(θ̂0n).
i.e., “observed information.” Define
̂I11·2(θ0) = Î11(θ0)− Î12(θ0)Î22(θ0)−1Î21(θ0),
just as
I11·2(θ0) = I11(θ
0)− I12(θ0)I22(θ0)−1I21(θ0).
Then,
Rn = nl̇1(θ̂0n)> ̂I11·2(θ0)
−1l̇1(θ̂
0n)
= 2(l(θ̂n)− l(θ̂0n)) + oPθ0 (1).
Remark 1.3.21. Note that, unlike Wald statistics or LRT statistics, to obtain Rn we only
need MLE on the null, θ̂0n.
Proof. Note that, under H0, l̇2(θ̂0n) = 0 holds,
l̇(θ̂0n) =
l̇1(θ̂0n)0
60
-
CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim
so we get
Rn = nl̇1(θ̂0n)> ̂I11·2(θ0)
−1l̇1(θ̂
0n)
directly. Now with Taylor expansion, ∃θ̂0,∗n ∈ line
ξ0η̂0n
,ξ0η0
s.t.l̇1(θ̂
0n)− l̇1(θ0) =
∂
∂ηl̇1(θ)
∣∣∣∣θ=θ̂0,∗n
(η̂0n − η0)
(by chain rule, and ξ0 − ξ0 = 0) Thus by continuity of I,
l̇1(θ̂0n)− l̇1(θ0) =
(−I12(θ0) + oPθ0 (1)
)(η̂0n − η0)
holds. In the same way, we get
l̇2(θ̂0n)− l̇2(θ0) =
(−I22(θ0) + oPθ0 (1)
)(η̂0n − η0),
but from l̇2(θ̂0n) = 0,
η̂0n − η0 =(I22(θ0) + oPθ0 (1)
)−1l̇2(θ0) =
[I−122 (θ0) + oPθ0 (1)
]l̇2(θ0)
is obtained. Therefore,
l̇1(θ̂0n) = l̇1(θ0)−
[I12(θ0) + oPθ0 (1)
](η̂0n − η0)
= l̇1(θ0)−[I12(θ0) + oPθ0 (1)
] [I−122 (θ0) + oPθ0 (1)
]l̇2(θ0)
= l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)
holds, and the proof completes if one uses
̂I11·2(θ0) = I11·2(θ0) + oPθ0 (1)
and (1.6).
Remark 1.3.22. Note that, for a simple null H0 : θ = θ0,
Wn = n(θ̂n − θ0)I(θ0)(θ̂n − θ0)
Rn = nl̇1(θ0)I(θ0)−1l̇2(θ0).
61
-
Theory of Statistics II J.P.Kim
In this case, k0 = 0, so Wn and Rn converges (in distribution) to χ2(k) under H0. For the
composite, θ0 is estimated by θ̂0n.
Example 1.3.23 (GoF test in a multinomial model). Let X1, · · · , Xn be a random sample
from Multi(1, (p1, · · · ,