Theory of Statistics II (Fall 2016) - WordPress.com · This note is a summary of the lecture Theory...

174
Theory of Statistics II (Fall 2016) J.P.Kim Dept. of Statistics First Draft Published at: January 1, 2017 Final Draft Published at: November 11, 2018

Transcript of Theory of Statistics II (Fall 2016) - WordPress.com · This note is a summary of the lecture Theory...

  • Theory of Statistics II (Fall 2016)

    J.P.Kim

    Dept. of Statistics

    First Draft Published at: January 1, 2017

    Final Draft Published at: November 11, 2018

  • Preface & Disclaimer

    This note is a summary of the lecture Theory of Statistics II (326.522) held at Seoul Na-

    tional University, Fall 2016. Lecturer was B.U.Park, and the note was summarized by J.P.Kim,

    who is a Ph.D student. There are few textbooks and references in this course. Contents and

    corresponding references are following.

    • Asymptotic Approximations. Reference: Mathematical Statistics: Basic ideas and selected

    topics, Vol. I., 2nd edition, P.Bickel & K.Doksum, 2007.

    • Weak Convergence. Reference: Convergence of Probability Measures, P.Billingsley, 1999.

    • Nonparametric Density Estimation. Reference: Kernel Smoothing, M.P.Wand & M.C.Jones,

    1994.

    Lecture notes are available at stat.snu.ac.kr/theostat. Also I referred to following books

    when I write this note. The list would be updated continuously.

    • Probability: Theory and Examples, R.Durrett

    • Mathematical Statistics (in Korean), W.C.Kim

    • Introduction to Nonparametric Regression, K.Takezawa

    • Semiparametric Regression, D.Ruppert, M.P.Wand & R.J.Carroll

    • Nonparametric and Semiparametric Models, W.Härdle, M.Müller, S.Sperlich & A.Werwatz

    If you want to correct typo or mistakes, please contact to: [email protected]

    Acknowledgement

    I’m very grateful to my colleagues for comments and correction of errors. Their suggestions

    were very helpful to improve this note.

    Special Thanks to: Hyemin Yeon, Seonghun Cho

    2

  • Chapter 1

    Asymptotic Approximations

    1.1 Consistency

    1.1.1 Preliminary for the chapter

    Definition 1.1.1 (Notations). Let Θ be a parameter space. Then we consider a ‘random vari-

    able’ X on the probability space (Ω,F , Pθ) which is a function

    X : (Ω,F , Pθ)→ (Rd,B(Rd), PXθ ),

    where PXθ := Pθ ◦X−1. Note that Pθ is a probability measure from the model P := {Pθ : θ ∈ Θ}.

    For the convenience, now we omit the explanation of fundamental setting.

    Definition 1.1.2 (Convergence). Let {Xn} be a sequence of random variables.

    1. Xna.s−−−→

    n→∞X if P

    (limn→∞

    Xn = X)

    = 1⇔ P (|Xn −X| > � i.o.) = 0 ∀� > 0

    ⇔ limN→∞

    P

    (∞⋃n=N

    (|Xn −X| > �)

    )= 0 ∀� > 0

    2. XnP−−−→

    n→∞X if ∀� > 0 P (|Xn −X| > �)→ 0 as n→∞.

    Proposition 1.1.3. XnP−−−→

    n→∞X if and only if for any subsequence {nk} ⊆ {n} there is a

    further subsequence {nkj} ⊆ {nk} such that Xnkja.s.−−−→j→∞

    X.

    Proof. Durrett, p.65.

    Definition 1.1.4 (Consistency). q̂n = qn(X1, · · · , Xn) is consistent estimator of q(θ) if

    q̂nPθ−−−→

    n→∞q(θ)

    3

  • Theory of Statistics II J.P.Kim

    for any θ ∈ Θ. (We don’t know what is the true parameter.)

    Remark 1.1.5. There are some tools to obtain consistency.

    1. V ar(Zn)→ 0, EZn → µ as n→∞ ⇒ ZnP−−−→

    n→∞µ.

    ∵ P (|Zn − µ| > �) ≤ P (|Zn − EZn|+ |EZn − µ| > �)

    ≤ P (|Zn − EZn| > �/2) + P (|EZn − µ| > �/2)︸ ︷︷ ︸=0 for sufficiently large n

    ≤ 4�2V ar(Zn)→ 0

    2. WLLN: X1, · · · , Xn: i.i.d. and E|X1| 0 there exists M > 0 such that P (|Y | > M) ≤ η

    and P (|Z| > M) ≤ η. Now,

    P (|YnZn − Y Z| > �) ≤ P (|Yn||Zn − Z| > �/2) + P (|Z||Yn − Y | > �/2)

    ≤ P (|Yn − Y ||Zn − Z| > �/4) + P (|Y ||Zn − Z| > �/4) + P (|Z||Yn − Y | > �/2)

    and note that P (|Y ||Zn−Z| > �/4) = P (|Y ||Zn−Z| > �/4, |Y | > M) +P (|Y ||Zn−Z| >

    �/4, |Y | ≤M) ≤ η + P (|Zn − Z| ≥ �/4M). Thus

    lim supn→∞

    P (|Y ||Zn − Z| > �/4) ≤ η

    4

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    and similarly

    lim supn→∞

    P (|Z||Yn − Y | > �/2) ≤ η.

    Now, since

    P (|Yn − Y ||Zn − Z| > �/4) = P (|Yn − Y ||Zn − Z| > �/4, |Yn − Y | >√�/4)

    + P (|Yn − Y ||Zn − Z| > �/4, |Yn − Y | ≤√�/4)

    ≤ P (|Yn − Y | >√�/4) + P (|Zn − Z| ≥

    √�/4)→ 0

    as n→∞, we get

    lim supn→∞

    P (|YnZn − Y Z| > �) ≤ 2η.

    Finally, since η > 0 was arbitrary, we get the result.

    (c) By (b), it’s sufficient to show that Z−1nP−−−→

    n→∞Z−1. Since P (Z = 0) = 0, for any η > 0

    there exists δ > 0 such that P (|Z| ≤ δ) ≤ η. (If not, ∃η > 0 such that ∀δ > 0 P (|Z| ≤ δ) >

    η. Then by continuity of measure, P (⋃δ>0(|Z| ≤ δ)) = P (Z = 0) ≥ η > 0. Contradiction.)

    Thus

    P

    (∣∣∣∣ 1Zn − 1Z∣∣∣∣ > �) = P ( |Zn − Z||ZnZ| > �

    )≤ P

    (|Zn − Z|

    |Z|(|Z| − |Zn − Z|)> �

    )+ P (|Z| < |Zn − Z|)

    ≤ P(

    |Zn − Z||Z|(|Z| − |Zn − Z|)

    > �, |Z| > δ, |Zn − Z| ≤ δ/2)

    ︸ ︷︷ ︸≤P (|Zn−Z|> δ

    2

    2�)−−−→

    n→∞0

    + P (|Z| ≤ δ)︸ ︷︷ ︸≤η

    +P (|Zn − Z| > δ/2)︸ ︷︷ ︸−−−→n→∞

    0

    + P (|Z| < |Zn − Z|, |Zn − Z| > δ)︸ ︷︷ ︸≤P (|Zn−Z|>δ)−−−→

    n→∞0

    +P (|Z| < |Zn − Z|, |Zn − Z| ≤ δ)︸ ︷︷ ︸≤P (|Z|≤δ)≤η

    and hence

    lim supn→∞

    P

    (∣∣∣∣ 1Zn − 1Z∣∣∣∣ > �) ≤ 2η

    holds. Note that η > 0 was arbitrary.

    Definition 1.1.6 (Probabilistic O-notation). Let Xn be a sequence of r.v.’s.

    5

  • Theory of Statistics II J.P.Kim

    1. Xn = Op(1) if limc→∞

    supnP (|Xn| > c) = 0 ⇔ lim

    c→∞lim supn→∞

    P (|Xn| > c) = 0. (“Bounded in

    probability”)

    2. Xn = op(1) if XnP−−−→

    n→∞0.

    3. Xn = Op(an) if Xn/an = Op(1), and Xn = op(an) if Xn/an = op(1).

    Proposition 1.1.7. If Xnd−−−→

    n→∞X for some X, then Xn = Op(1).

    Proof. For given � > 0, there exists c such that P (|X| > c) < �/2. For such c, P (|Xn| > c) →

    P (|X| > c), so ∃N s.t.

    supn>N|P (|Xn| > c)− P (|X| > c)| <

    2.

    Thus supn>N P (|Xn| > c) < �. For n = 1, 2, · · · , N , there exists cn such that P (|Xn| > cn) < �,

    and letting c∗ = max(c1, · · · , cN , c), we get supn P (|Xn| > c∗) < �.

    Example 1.1.8 (Simple Linear Regression). Consider a simple linear regression model yi =

    β0 + β1xi + �i, where �ii.i.d.∼ (0, σ2). Also assume that x1, · · · , xn are known and not all equal.

    Note that

    β̂1,n =

    ∑ni=1(xi − x)Yi∑ni=1(xi − x)2

    .

    Since E(β̂1,n) = β1 and V ar(β̂1,n) = σ2/Sxx, we obtain consistency

    β̂1,nPβ,σ2−−−→n→∞

    β1

    provided that Sxx =∑n

    i=1(xi − x)2 →∞ as n→∞.

    Example 1.1.9 (Sample correlation coefficient). Let (X1, Y1), · · · , (Xn, Yn) be random sample

    from the population

    EX1 = µ1, EY1 = µ2, V ar(X1) = σ21 > 0, V ar(Y1) = σ

    22 > 0, and Corr(X1, Y1) = ρ.

    Then by WLLN we get

    (X,Y ,X2, Y 2, XY )P−−−→

    n→∞(EX1, EY1, EX

    21 , EY

    21 , EX1Y1).

    Since the function

    g(u1, u2, u3, u4, u5) =u5 − u1u2√

    u3 − u21√u4 − u22

    6

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    is continuous at (EX1, EY1, EX21 , EY

    21 , EX1Y1), we get

    ρ̂n =XY −XY√

    X2 −X2√Y 2 − Y 2

    P−−−→n→∞

    ρ.

    Remark 1.1.10. Note that, if XnP−−−→

    n→∞c where c is a constant, then continuity of g(x) at

    x = c is sufficient for consistency g(Xn)P−−−→

    n→∞g(c). It is trivial from the definition of continuity.

    Example 1.1.11. Let X1, · · · , Xn be a random sample from a population with cdf F . Then

    we use an empirical distribution function

    F̂n(x) =1

    n

    n∑i=1

    I(−∞,x](Xi) =1

    n

    n∑i=1

    I(Xi ≤ x)

    for estimation of F . Then by WLLN, for each x, F̂n(x) is consistent estimator for F (x),

    F̂n(x)P−−−→

    n→∞F (x).

    Remark 1.1.12. Note that in this case, we can say more strong result, which is known as

    Glivenko-Cantelli theorem:

    supx|F̂n(x)− F (x)|

    P−−−→n→∞

    0.

    Sketch of proof is given here. Since F̂n and F are nondecreasing and bounded, we can partition

    [0, 1], which is a range of both functions, into finite number of intervals, and then each interval

    has a well-defined inverse image which is an interval. For whole proof, see Durrett, p.76.

    1.1.2 FSE and MLE in Exponential Families

    FSE

    Recall that FSE of ν(F ) is defined as ν(F̂n). One example of FSE is MME: to estimate

    EXj =: νj(F ) =:∫xjdF (x), we use

    m̂j = νj(F̂n) =

    ∫xjdF̂n(x) =

    1

    n

    n∑i=1

    Xji .

    By WLLN we have (m̂1, m̂2, · · · , m̂k)>P−−−→

    n→∞(m1,m2, · · · ,mk)> where mj = EXj, so we can

    obtain consistency of MME easily.

    Proposition 1.1.13. Let q = h(m1,m2, · · · ,mk) be a parameter of interest where mj’s are

    7

  • Theory of Statistics II J.P.Kim

    population moments. Then for MME

    q̂n = h(m̂1, · · · , m̂k)

    based on a random sample X1, · · · , Xn,

    q̂nP−−−→

    n→∞q

    holds, provided that h is continuous at (m1, · · · ,mk)>.

    We can do similar work in FSE ν(F ). Note that in here, ν is a functional, so we may define a

    continuity of functional. We may use sup norm as a metric in the space of distribution functions.

    Definition 1.1.14. Let F be a space of distribution functions. In this space, we give the norm

    ‖ · ‖ as a sup norm

    ‖F‖ = supx|F (x)|.

    Then metric is given as

    ‖F −G‖ = supx|F (x)−G(x)|.

    Also, we say that a functional ν : F → R is continuous at F if for any � > 0 there exists δ > 0

    such that

    ‖G− F‖ < δ ⇒ |ν(G)− ν(F )| < �.

    Remark 1.1.15. Note that since ‖F̂n − F‖ → 0 as n → ∞ from Glivenko-Cantelli theorem,

    we get consistency of FSE

    ν(F̂n)P−−−→

    n→∞ν(F )

    provided that ν is continuous at F . In many cases, showing continuity may be difficult problem.

    Example 1.1.16 (Best Linear Predictor). Let X1, · · · , Xn be k-dimensional i.i.d. r.v.’s, and

    Y1, · · · , Yn be i.i.d. 1-dim random variable. Then we know that

    BLP (x) = µ2 + Σ21Σ−111 (x− µ1),

    where

    E

    XY

    =µ1µ2

    and V arXY

    =Σ11 Σ12

    Σ21 Σ22

    .8

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    Thus for sample variance

    S11 =1

    n

    n∑i=1

    (Xi −X)(Xi −X)>

    S12 =1

    n

    n∑i=1

    (Xi −X)(Yi − Y )> = S>21

    S22 =1

    n

    n∑i=1

    (Yi − Y )2,

    we obtain FSE for BLP,

    B̂LPFSE

    (x) = Y + S21S−111 (x−X).

    Note that it is the same as simple linear regression line. Detail is given in next proposition.

    Proposition 1.1.17.

    (a) Solution of minimizing problem

    (β∗0 ,β∗1)> = arg min

    β0,β1

    E(Y − β0 − β>1 X)2

    is

    BLP (x) := β∗0 + β∗1>x = µ2 + Σ21Σ

    −111 (x− µ1).

    (b) For Y = (Y1, · · · , Yn)> and design matrix X = (1,X1) where X1 = (X1, · · · , Xn)>, LSE

    is

    β̂1 = S−111 S12 and β̂0 = Y −X

    >β̂1.

    Proof. (a) Two approaches are given. First one is direct proof: It is clear because of

    E(Y − β0 − β>1 X)2 = E[(Y − µ2)− β>1 (X − µ1)]2 + [µ2 − β0 − β>1 µ1]2

    = Σ22 − 2β>1 Σ12 + β>1 Σ11β1 + [β0 − (µ2 − β>1 µ1)]2.

    Second approach uses projection in L2 space. For convenience, suppose EX = 0 and EY = 0.

    Then (β∗0 ,β∗1)> should satisfy

    〈β0 + β>1 X, Y − β∗0 − β∗1>X〉 = 0 ∀β0, β1.

    It yields that

    β∗0 = 0, β∗1 =

    (E(XX>)

    )−1E(XY ).

    9

  • Theory of Statistics II J.P.Kim

    (b)Xβ̂ = 1β̂0+X1β̂1 should satisfy 1β̂0+X1β̂1 = Π(Y |C(X)). For X1 = X1−Π(X1|C(1)) =

    X1 − 1X>

    ,

    1β̂0 +X1β̂1 = 1

    (β̂0 +

    1>X1n

    β̂1

    )+ X1β̂1 = Π(Y |C(1)) + Π(Y |C(X1))

    we get

    β̂0 = Y −X>β̂1 and β̂1 = (X>1 X1)−1X>1 Y .

    Now X>1 X1/n = S11 and X>1 Y /n = S12 ends the proof.

    Figure 1.1: Regression with intercept. Image from Lecture Note.

    Example 1.1.18 (Multiple Correlation Coefficient). We define a multiple correlation coefficient

    (MCC) as

    ρ = maxβ0,β1

    Corr(Y, β0 + β>1 X)

    and sample MCC is

    ρ̂n = maxβ0,β1

    Ĉorr(Y, β0 + β>1 X).

    Note that,

    Corr(Y, β0 + β>1 X) = Corr(Y − µ2,β>1 (X − µ1))

    =Σ21β1

    √Σ22

    √β>1 Σ11β1

    =(Σ−1/211 Σ12)

    >(Σ1/211 β1)

    √Σ22

    √β>1 Σ11β1

    10

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    √Σ21Σ

    −111 Σ12

    Σ22

    holds by Cauchy-Schwarz inequality, and equality holds when β1 = Σ−111 Σ12. Thus population

    MCC is obtained as

    ρ =

    √Σ21Σ

    −111 Σ12

    Σ22.

    Meanwhile, sample correlation is obtained as

    Ĉorr(Y , β0 + β>1X) =

    〈Y − Y 1, (X − 1X>)β1〉‖Y − Y 1‖‖(X − 1X>)β1‖

    so it is the cosine of the angle between the two rays, Y − Y 1 and X1β1. Its maximal value is

    attaiend by X1β̂1 = Π(Y − Y 1|C(X1)). (Or, one may use Cauchy-Schwarz,

    Ĉorr(Y , β0 + β>1X) =

    y>X1(X>1 X1)−1X>1 X1β1‖Y − Y 1‖‖X1β1‖

    ≤√y>ΠX1y√

    y>(1− Π1)y,

    where equality holds iff X1(X>1 X1)−1X>1 y = X1β1, i.e., β1 = β̂1) Thus,

    ρ̂2 =SSR

    SST=β̂>1 X>1 X1β̂1‖Y − Y 1‖2

    =S21S

    −111 S12S22

    .

    Example 1.1.19 (Sample Proportions). Let (X1, · · · , Xk)> ∼ Multi(n, p), where p ∈ Θ :=

    {(p1, · · · , pk)> :∑k

    i=1 pi = 1, pi ≥ 0 (i = 1, 2, · · · , k)}. We estimate p with sample proportion

    p̂n =

    (X1n, · · · , Xk

    n

    )>.

    Then,

    (a) p̂n is consistent estimator of p, i.e.,

    ∀� > 0, supp∈Θ

    Pp(|p̂n − p| ≥ �) −−−→n→∞

    0.

    (b) q(p̂n) is consistent estimator of q(p) provided that q is (uniformly) continuous on Θ.

    Proof. (a) Note that there exists a constant C > 0 such that

    supp∈Θ

    Pp(|p̂n − p| ≥ �) ≤ supp∈Θ

    E|p̂n − p|2

    �2

    11

  • Theory of Statistics II J.P.Kim

    = supp∈Θ

    k∑i=1

    pi(1− pi)n�2

    ≤ Cn�2−−−→n→∞

    0

    so we get the desired result. Note that first inequality is from Chebyshev’s inequality.

    (b) Note that q is uniformly continuous on Θ, since Θ is closed and bounded. Thus the

    assertion holds. More precisely, for any � > 0, there exists δ > 0 such that

    |p′ − p| < δ, p, p′ ∈ Θ⇒ |q(p′)− q(p)| < �.

    Therefore, we get

    supp∈Θ

    Pp(|q(p̂n)− q(p)| ≥ �) ≤ supp∈Θ

    Pp(|p̂n − p| ≥ δ) −−−→n→∞

    0.

    MLE in exponential families

    Consider a random variable X with pdf in canonical exponential family

    qη(x) = h(x) exp(η>T (x)− A(η))IX (x), η ∈ E ,

    where E is natural parameter space in Rk. Our goal is to show consistency of MLE in canonical

    exponential family.

    Theorem 1.1.20. Let

    qη(x) = h(x) exp(η>T (x)− A(η))IX (x), η ∈ E

    be a canonical exponential family with natural parameter space E ⊆ Rk. Further assume

    (i) E is open.

    (ii) The family is of rank k.

    (iii) t0 := T (x) ∈ C0, where C denotes the smallest convex set containing the support of T (X),

    and C0 be its interior.

    12

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    Then the unique ML estimate η̂(x) exists and is the solution of the likelihood equation

    l̇x(η) = T (x)− Ȧ(η) = 0.

    Remark 1.1.21. Note that in (iii), x is the observation of X, so t0 is the observation of T (X).

    It is reasonable to consider t0 because ML estimate only depends on t0. Also, recall that (ii)

    means

    @a 6= 0 s.t.[Pη(a

    >(T (x)− µ) = 0) = 1 for some η ∈ E]

    ⇔@a 6= 0 s.t.[V arη(a

    >T (x)) = 0 for some η ∈ E]

    ⇔Ä(η) is positive definite ∀η ∈ E .

    To prove this, we need some preparation.

    Lemma 1.1.22.

    (a) (“Supporting Hyperplane Theorem”) Let C ⊆ Rk be a convex set, and C0 be its interior.

    Then for t0 /∈ C or t0 ∈ ∂C,

    ∃a 6= 0 s.t. [a>t ≥ a>t0 ∀t ∈ C].

    Conversely, for t0 ∈ C0,

    @a 6= 0 s.t. [a>t ≥ a>t0 ∀t ∈ C].

    (b) Let P (T ∈ T ) = 1 and E(maxi |Ti|) 0 such that B(t0, δ) ⊆ C0, since C0 is open. Note that for any u s.t.

    ‖u‖ = 1, we get

    t0 −δ

    2u, t0 +

    δ

    2u ∈ B(t0, δ) ⊆ C.

    If ∃a 6= 0 such that a>t ≥ a>t0 ∀t ∈ C, then

    a>(t0 −

    δ

    2u

    )≥ a>t0, a>

    (t0 +

    δ

    2u

    )≥ a>t0

    13

  • Theory of Statistics II J.P.Kim

    holds for u = a/|a|, which yields contradiction. (Note that convexity condition is not used)

    Figure 1.2: Proof of (a)

    (b) Note that since C is a convex set, µ := ET ∈ C holds. (Convex set contains average of

    itself) Assume µ /∈ C0. Then µ ∈ ∂C. Then by (a), ∃a 6= 0 such that a>t ≥ a>µ for any t ∈ C.

    It implies that, ∃a 6= 0 such that P (a>(T − µ) ≥ 0) = 1, since T ⊆ C. It implies that

    P (a>(T − µ) = 0) = 1,

    by the fact that

    f ≥ 0,∫fdµ = 0⇒ f = 0 µ− a.e..

    It is contradictory to (ii), which is full rank condition of the exponential family.

    (c) Done in TheoStat I.

    Proof of theorem. By lemma, it’s sufficient to show that:

    (1) l(θ) diverges to −∞ at the boundary. (Existence)

    (2) Uniqueness

    Note that Uniqueness is clear since lx(η) is C2 function and strictly concave from Ä(η) > 0.

    Thus, our claim is

    Claim. l(θ) approches −∞ on the boundary ∂E .

    Let η0 ∈ ∂E . Then there is ηn −−−→n→∞

    η0 such that ηn ∈ E . Now our claim is, for any such

    sequence ηn, we get lx(ηn) −−−→n→∞

    −∞. Note that |ηn| −−−→n→∞

    ∞ or sup |ηn|

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    for both cases, from lx(η) = log h(x) + η>T (x)−A(η) and eA(η) =

    ∫Xh(x)eη

    >T (x)dµ(x) , we get

    −lx(ηn) + log h(x) = A(ηn)− η>n t0

    = log

    ∫X

    exp(η>n (T (y)− t0)

    )h(y)dµ(y).

    Case 1. |ηn| → ∞.

    Then since

    ∫Xeη>n (T (y)−t0)h(y)dµ(y) ≥

    ∫η>n|ηn|

    (T (y)−t0)> 1k

    e|ηn|·η>n|ηn|

    (T (y)−t0)h(y)dµ(y)

    ≥∫η>n|ηn|

    (T (y)−t0)> 1k

    e|ηn|/kh(y)dµ(y)

    = exp(|ηn|/k)∫η>n|ηn|

    (T (y)−t0)> 1k

    h(y)dµ(y),

    if we can conclude

    infn

    ∫η>n|ηn|

    (T (y)−t0)> 1k

    h(y)dµ(y) > 0,

    by the assumption |ηn| → ∞, we get lx(ηn)→ −∞. Note that if

    infu:‖u‖=1

    ∫u>(T (y)−t0)>0

    h(y)dµ(y) > 0,

    then

    infn

    ∫η>n|ηn|

    (T (y)−t0)>0h(y)dµ(y) > 0,

    and from

    infn

    ∫η>n|ηn|

    (T (y)−t0)> 1k

    h(y)dµ(y) −−−→k→∞

    infn

    ∫η>n|ηn|

    (T (y)−t0)>0h(y)dµ(y),

    we get ∃� > 0 & k s.t.

    infn

    ∫η>n|ηn|

    (T (y)−t0)> 1k

    h(y)dµ(y) > �

    and the assertion holds. So our claim is:

    Claim. infu:‖u‖=1

    ∫u>(T (y)−t0)>0

    h(y)dµ(y) > 0.

    15

  • Theory of Statistics II J.P.Kim

    Assume not. If

    infu:‖u‖=1

    ∫u>(T (y)−t0)>0

    h(y)dµ(y) = 0,

    then since {u : ‖u‖ = 1} is compact, there exists u0 ∈ {u : ‖u‖ = 1} such that

    ∫u>0 (T (y)−t0)>0

    h(y)dµ(y) = 0.

    It implies h(y) = 0 on the set {y : u>0 (T (y)− t0) > 0} µ-a.e., and hence

    ∫u>0 (T (y)−t0)>0

    h(y)eη>T (y)−A(η)dµ(y) = 0,

    which implies that

    Pη(u>0 (T (X)− t0) > 0) = 0.

    Thus, we get

    Pη(u>0 (T (X)− t0) ≤ 0) = 1,

    which is equivalent to

    u>0 (t− t0) ≤ 0 ∀t ∈ T .

    Since C is convex hull of T , it implies

    u>0 (t− t0) ≤ 0 ∀t ∈ C,

    however, this yields contradiction to

    @a 6= 0 s.t. a>(t− t0) ≤ 0 ∀t ∈ C,

    from t0 ∈ C0.

    Case 2. sup |ηn| n (T (y)−t0)h(y)dµ(y) ≥

    ∫Xeη

    0>(T (y)−t0)h(y)dµ(y)(∗)= ∞

    16

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    Figure 1.3: Convex hull of T

    by Fatou’s lemma. (∗) holds because E is natural parameter space, and η0 ∈ ∂E implies η0 /∈ E ,

    since E is open. Thus −lx(ηn) −−−→n→∞

    ∞.

    Now we are ready to prove consistency.

    Theorem 1.1.23. Let X1, · · · , Xn be a random sample from a population with pdf

    pη(x) = h(x) exp{η>T (x)− A(η)}IX (x), η ∈ E

    where E is the natural parameter space in Rk. Further, assume that

    (i) E is open.

    (ii) The family is of rank k.

    Then, the followings hold:

    (a) Pη(η̂MLEn exists

    )−−−→n→∞

    1

    (b) η̂MLEn is consistent.

    Proof. (a) Let T n =1

    n

    n∑i=1

    T (Xi). Then by WLLN, we get

    limn→∞

    Pη(|T n − EηT (X1)| < �) = 1 ∀� > 0.

    Also note that EηT (X1) ∈ C0, where C0 is the interior of the convex hull of the support of

    T (X1). Then since C0 is open, open ball (|T n−EηT (X1)| < �) is contained in C0 for sufficiently

    small � > 0, which implies

    limn→∞

    Pη(T n ∈ C0) = 1.

    17

  • Theory of Statistics II J.P.Kim

    Now consider T n instead of T (X1) in previous theorems, and note that (convex hull of support

    of T n) = (convex hull of support of T (X1)). Then we can find that

    (T n ∈ C0) ⊆ (η̂MLEn exists)

    and therefore

    limn→∞

    Pη(η̂MLEn exists) = 1.

    (b) From Ä > 0, we get Ȧ(η) is one-to-one and continuous for any η. Then we get

    (T n ∈ C0) ⊆ (η̂MLEn exists) = (Ȧ(η̂MLEn ) = T n)

    and hence

    limn→∞

    Pη(η̂MLEn = (Ȧ)

    −1(T n)) = 1 ∀η ∈ E . (1.1)

    Further, by inverse function theorem, and C2 property of A, we have that (Ȧ)−1 is continuous.

    Thus by WLLN and continuous mapping theorem,

    (Ȧ)−1(T n)Pη−−−→

    n→∞(Ȧ)−1(EηT (X1)) = (Ȧ)

    −1(Ȧ(η)) = η

    and since (Ȧ)−1(T n) ≈ η̂MLEn in the sense of (1.1), we get

    limn→∞

    Pη(|η̂MLEn − η| < �) = 1 ∀� > 0,

    i.e., η̂MLEnPη−−−→

    n→∞η.

    Now let’s see some general results. Suppose we have limn→∞

    Ψn(θ) = Ψ0(θ) and

    θn : solution of Ψn(θ) = 0, θ ∈ C (n = 1, 2, · · · )

    θ0 : solution of Ψ0(θ) = 0, θ ∈ C.

    Under what conditions, limn→∞

    θn = θ0? We need following four conditions:

    Uniform convergence of Ψn, Continuity of Ψ0, Uniqueness of θ0, and Compactness of C.

    Note that these are sufficient conditions simultaneously. Our goal is to obtain similar result for

    optimization.

    18

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    Theorem 1.1.24. Suppose that we have limn→∞

    Dn(θ) = D0(θ) and

    θn = arg minθ∈C

    Dn(θ) (n = 1, 2, · · · )

    θ0 = arg minθ∈C

    D0(θ)

    where Dn and D0 are deterministic functions. Also assume that

    (i) Dn converges to D0 uniformly.

    (ii) D0 is continuous on C.

    (iii) Minimizer θ0 is unique.

    (iv) C is compact.

    Then limn→∞

    θn = θ0.

    Proof. Assume not. In other words, θn 6→ θ0. Then ∃� > 0 such that |θn− θ0| > � i.o.. It means

    that there is a subsequence {n′} ⊆ {n} s.t. |θn′ − θ0| > � ∀n′. Now define

    ∆n = supθ∈C|Dn(θ)−D0(θ)|.

    Then by uniform convergence of Dn, we get ∆n −−−→n→∞

    0. Now note that

    inf|θ−θ0|>�

    D0(θ) = inf|θ−θ0|>�

    {D0(θ)−Dn′(θ) +Dn′(θ)}

    ≤ inf|θ−θ0|>�

    {|D0(θ)−Dn′(θ)|+Dn′(θ)}

    ≤ ∆n′ + inf|θ−θ0|>�

    Dn′(θ)

    holds. Because minimization of Dn′ is achieved at θn′ ∈ {θ : |θ − θ0| > �}, we get

    ∆n′ + inf|θ−θ0|>�

    Dn′(θ) ≤ ∆n′ + inf|θ−θ0|≤�

    Dn′(θ)

    ≤ ∆n′ + inf|θ−θ0|≤�

    {|Dn′(θ)−D0(θ)|+D0(θ)}

    ≤ 2∆n′ + inf|θ−θ0|≤�

    D0(θ)

    = 2∆n′ +D0(θ0).

    19

  • Theory of Statistics II J.P.Kim

    The last equality holds from θ0 = arg minD0(θ) and θ0 ∈ {θ : |θ − θ0| ≤ �}. Thus

    inf|θ−θ0|>�

    D0(θ) ≤ 2∆n′ +D0(θ0)

    holds, which implies1

    2

    (inf

    |θ−θ0|>�D0(θ)−D0(θ0)

    )≤ ∆n′ .

    Letting n′ →∞, we get

    inf|θ−θ0|>�

    D0(θ)−D0(θ0) = 0.

    It is contradictory due to our claim that will be shown:

    Claim. inf|θ−θ0|>�

    D0(θ)−D0(θ0) > 0.

    Intuitively, since θ0 is unique minimizer, our claim seems trivial, but we also need continuity

    and compactness condition to guarantee this. (For this see next remark.)

    Note that, by definition of infimum, there is a sequence {θk} ⊆ {θ : |θ − θ0| > �} ∩ C such

    that

    limk→∞

    D0(θk) = inf|θ−θ0|>�

    D0(θ).

    Now, by compactness of C, there is a subsequence {k′} ⊆ {k} that makes θk′ converge to

    some θ∗0 (“Bolzano-Weierstrass”), so with the abuse of notation, let θk → θ∗0 as k → ∞. Then

    note that θ∗0 should belong to {θ : |θ− θ0| ≥ �} ∩C, so θ∗0 6= θ0. Now, continuity of D0 makes

    limk→∞

    D0(θk) = D0(θ∗0),

    which implies

    inf|θ−θ0|>�

    D0(θ) = D0(θ∗0).

    Therefore, by uniqueness of minimizer, D0(θ∗0) > D0(θ0), and combining to above result we

    can obtain

    inf|θ−θ0|>�

    D0(θ) > D0(θ0).

    Remark 1.1.25. See next figures. Each example tells that we need continuity and compactness,

    respectively.

    Remark 1.1.26. For deterministic case, one can give an alternative proof. Suppose θn 6→ θ0.

    Then since C is compact, we can find a subsequence {θnk} such that θnk → θ∗0, θ∗0 6= θ0. (If any

    20

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    Figure 1.4: Continuity and Compactness are needed.

    convergent subsequence converges to θ0, then origin sequence should converge to θ0.) Now for

    sufficiently large nk,

    supθ∈C|Dnk(θ)−D0(θ)| <

    3

    holds, so

    D0(θ0) ≥ Dnk(θ0)−�

    3(∵ uniform convergence)

    ≥ Dnk(θnk)−�

    3(∵ minimizer)

    ≥ D0(θnk)−2

    3� (∵ uniform convergence)

    ≥ D0(θ∗0)− � (∵ D0(θnk)→ D0(θ∗0) from continuity of D0)

    and hence taking �↘ 0 gives D0(θ0) ≥ D0(θ∗0), which is contradictory to uniqueness of θ0.

    In fact, our real goal was, to get the similar result for random Dn.

    Theorem 1.1.27. Let Dn be a sequence of random functions, and D0 be deterministic. Simi-

    larly, define

    θ̂n = arg minθ∈C

    Dn(θ) (n = 1, 2, · · · )

    θ0 = arg minθ∈C

    D0(θ).

    Now suppose that

    21

  • Theory of Statistics II J.P.Kim

    (i) Dn converges in probability to D0 uniformly. It means that,

    supθ∈C|Dn(θ)−D0(θ)|

    P−−−→n→∞

    0.

    (ii) D0 is continuous on C.

    (iii) Minimizer θ0 is unique.

    (iv) C is compact.

    Then θ̂nP−−−→

    n→∞θ0.

    Proof. Note that in the proof of theorem 1.1.24, we did not used convergence in deriving

    inf|θ−θ0|>�

    D0(θ) ≤ 2∆n′ +D0(θ0).

    Rather, we only used |θn′−θ0| > �. (Convergence is used when deriving inf |θ−θ0|>�D0(θ)−D0(θ0))

    Thus,

    |θ̂n − θ0| > �⇒ inf|θ−θ0|>�

    D0(θ) ≤ 2∆n +D0(θ0)⇒ ∆n ≥1

    2

    (inf

    |θ−θ0|>�D0(θ)−D0(θ0)

    )holds. Define

    1

    2

    (inf

    |θ−θ0|>�D0(θ)−D0(θ0)

    )=: δ(�).

    Then, we get (|θ̂n − θ0| > �

    )⊆ (∆n ≥ δ(�)) ,

    and therefore, by uniform P-convergence, ∆nP−−−→

    n→∞0 and hence

    P (|θ̂n − θ0| > �) ≤ P (∆n ≥ δ(�)) −−−→n→∞

    0.

    Example 1.1.28 (Consistency of MLE when Θ is finite). Let X1, · · · , Xn be a random sample

    from a population with pdf fθ(·), θ ∈ Θ. Assume that the parametrization is identifiable and

    Θ = {θ1, · · · , θk}. Then

    θ̂MLEnPθ0−−−→n→∞

    θ0,

    provided that

    22

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    (0) (Identifiability) Pθ1 = Pθ2 ⇒ θ1 = θ2

    (1) (Kullback-Leibler divergence) Eθ0

    ∣∣∣∣log fθ(X1)fθ0(X1)∣∣∣∣ �}

    = Pθ0

    { ⋃1≤j≤k

    (|Dn(θj)−D0(θj)| > �)

    }

    ≤k∑j=1

    Pθ0 (|Dn(θj)−D0(θj)| > �)

    = o(1) by WLLN.

    so we can derive the result similarly. In precise, it’s sufficient to show

    inf|θ−θ0|>�

    D0(θ)−D0(θ0) > 0

    for � s.t. |θn − θ0| > � i.o.. Uniqueness of θ0 implies it clearly, because Θ is finite in here. Note

    that continuity of D0 is not considered.

    Remark 1.1.29. Kullback-Leibler divergence. Since 1 + log z ≤ z, we get

    −Eθ0 logfθ(X1)

    fθ0(X1)= −

    ∫log

    fθ(X1)

    fθ0(X1)dPθ0

    ≥ 1−∫S(θ0)

    fθ(x)

    fθ0(x)fθ0(x)dµ(x)

    ≥ 0,

    23

  • Theory of Statistics II J.P.Kim

    and hence D0(θ) ≥ 0. In here S(θ0) = {x : fθ0(x) > 0} and S(θ) = {x : fθ(x) > 0}. Note that

    1 + log z ≤ z ⇔ z = 1. Thus equality of D0(θ) = 0 holds if and only if

    fθ(x)

    fθ0(x)= 1 µ− a.e. on S(θ0)

    and

    ∫S(θ0)

    fθ(x)dµ(x) = 1.

    Since

    1 =

    ∫S(θ)

    fθ(x)dµ(x) =

    ∫S(θ0)∪S(θ)

    fθ(x)dµ(x)

    =

    ∫S(θ0)

    fθ(x)dµ(x) +

    ∫S(θ)\S(θ0)

    fθ(x)dµ(x)

    we get ∫S(θ0)

    fθ(x)dµ(x) = 1⇔∫S(θ)\S(θ0)

    fθ(x)dµ(x) = 0.

    However, by definition of the support, fθ(x) > 0 on S(θ)\S(θ0), and hence∫S(θ)\S(θ0)

    fθ(x)dµ(x) = 0⇔ µ(S(θ)\S(θ0)) = 0.

    Thus D0(θ) holds if and only if

    fθ(x) = fθ0(x) µ− a.e. on S(θ0)

    and µ(S(θ)\S(θ0)) = 0.

    However, note that

    fθ(x) = fθ0(x) µ− a.e. on S(θ0) implies µ(S(θ)\S(θ0)) = 0,

    because

    1 =

    ∫S(θ)

    fθ(x)dµ(x) =

    ∫S(θ0)

    fθ(x)dµ(x) +

    ∫S(θ)\S(θ0)

    fθ(x)dµ(x)

    =

    ∫S(θ0)

    fθ0(x)dµ(x) +

    ∫S(θ)\S(θ0)

    fθ(x)dµ(x)

    = 1 +

    ∫S(θ)\S(θ0)

    fθ(x)dµ(x).

    24

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    Therefore we get,

    D0(θ) = 0⇔ fθ(x) = fθ0(x) µ− a.e. on S(θ0).

    Now µ(S(θ)\S(θ0)) = 0 implies fθ(x) = fθ0(x) µ − a.e. on S(θ)\S(θ0), and therefore fθ(x) =

    fθ0(x) µ− a.e., if fθ(x) = fθ0(x) µ− a.e. on S(θ0).Therefore we get

    D0(θ) = 0⇔ fθ(x) = fθ0(x) µ− a.e.⇔ θ = θ0 (∵ identifiability).

    It means that θ0 is unique minimizer of D0(θ).

    Example 1.1.30 (Consistency of MCE). Let X1, · · · , Xn be a random sample from Pθ, θ ∈

    Θ ⊆ Rk, and

    θ̂MCEn = arg minθ∈Θ

    1

    n

    n∑i=1

    ρ(Xi, θ).

    Assume the following along with Eθ0|ρ(X1, θ)| �, θ̂MCEn ∈ K

    ]−−−→n→∞

    0.

    Thus, we get

    Pθ0

    [|θ̂MCEn − θ0| > �

    ]≤ Pθ0

    [|θ̂MCEn − θ0| > �, θ̂MCEn ∈ K

    ]+ Pθ0

    [θ̂MCEn /∈ K

    ]−−−→n→∞

    0.

    25

  • Theory of Statistics II J.P.Kim

    Remark 1.1.31. Indeed, we did not see consistency of MCE yet, but we only verified for fixed

    θ0 ∈ Θ. For the consistency of MCE, we need that for any θ0 ∈ Θ ∃K ⊆ Θ containing θ0 such

    that the conditions (i)-(iv) are fulfilled. Suppose that

    (a) for all compact K ⊆ Θ and for all θ0 ∈ Θ,

    supθ∈K|ρn(θ)− Eθ0ρ(X1, θ)|

    Pθ0−−−→n→∞

    0.

    (b) for any θ0 ∈ Θ there exists a compact subset K of Θ containing θ0 such that

    Pθ0

    (infθ∈Kc

    (ρn(θ)− ρn(θ0)) > 0)−−−→n→∞

    1.

    (c) θ 7→ Eθ0ρ(X1, θ) is continuous on Θ.

    Then for any θ0 ∈ Θ there exists a compact subset K of Θ containing θ0 such that (ii)-(iv)

    hold. Note that, (b) implies (iii) with (i) and (c).

    Also note that, MLE is a special case for MCE, ρ(x, θ) = − log f(x, θ).

    Remark 1.1.32. In many cases, it’s difficult to verify uniform convergence condition. For this,

    following convexity lemma is useful: If K is convex,

    ρn(θ)Pθ0−−−→n→∞

    Eθ0ρ(X1, θ) ∀θ ∈ K, (“pointwise convergence”)

    and ρn is a convex function on K with probability 1 under Pθ0, then we get “uniform conver-

    gence”

    supθ∈K|ρn(θ)− Eθ0ρ(X1, θ)|

    Pθ0−−−→n→∞

    0.

    See D. Pollard (1991), Econometric Theory, 7, 186-199.

    Remark 1.1.33. The condition (b) in remark 1.1.31 is satisfied if the empirical contrast

    ρn(θ) =1

    n

    n∑i=1

    ρ(Xi, θ)

    is convex on a convex open parameter space Θ ⊆ Rk, and approaches +∞ on the boundary

    with probability tending to 1.

    26

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    1.2 The Delta Method

    Basic intuition of the Delta Method is Taylor expansion.

    Theorem 1.2.1. Suppose√n(Xn − a)

    d−−−→n→∞

    X. Then

    √n(g(Xn)− g(a)) = ġ(a)

    √n(Xn − a) + oP (1)

    and hence√n(g(Xn)− g(a))

    d−−−→n→∞

    ġ(a)X,

    provided g is differentiable at a.

    Proof. By Taylor theorem, ∃R(x, a) s.t.

    g(x) = g(a) + (ġ(a) +R(x, a)) (x− a)

    where R(x, a) → 0 as x → a. Note that if XnP−−−→

    n→∞a and R(x, a) → 0 as x → a then

    R(Xn, a)P−−−→

    n→∞0 (∵ ∀� > 0 ∃δ > 0 s.t.|x− a| < δ ⇒ |R(x, a)| < � implies

    P (|R(Xn, a)| > �) ≤ P (|Xn − a| ≥ δ) −−−→n→∞

    0

    and then R(Xn, a) = oP (1)). Thus

    g(Xn) = g(a) + (ġ(a) +R(Xn, a)) (Xn − a)

    and hence

    √n(g(Xn)− g(a)) = ġ(a)

    √n(Xn − a) +R(Xn, a)︸ ︷︷ ︸

    =oP (1)

    √n(Xn − a)︸ ︷︷ ︸

    =OP (1)

    = ġ(a)√n(Xn − a) + oP (1).

    In multivariate case, statement becomes ġ(a)>(Xn − a).

    Remark 1.2.2. When g is a function of several variables, the differentiability means the total

    differentiability, which is implied by the existence of “continuous partial derivatives.”

    27

  • Theory of Statistics II J.P.Kim

    Example 1.2.3. (X1, Y1), · · · , (Xn, Yn) : iid from (µ1, µ2, σ21, σ22, ρ). Let

    ρ̂n =

    n∑i=1

    (Xi −X)(Yi − Y )√n∑i=1

    (Xi −X)2√

    n∑i=1

    (Yi − Y )2.

    (i) As far as the distribution of ρ̂n is concerned, we may assume µ1 = µ2 = 0, σ21 = σ

    22 = 1.

    Let

    Wi = (Xi, Yi, X2i , Y

    2i , XiYi)

    > i.i.d.∼ (0, 0, 1, 1, ρ),

    and Zn =√n(W n − (0, 0, 1, 1, ρ)>), i.e.,

    Zn1 =√nX, Zn2 =

    √nY , Zn3 =

    √n(X2 − 1), Zn4 =

    √n(Y 2 − 1), Zn5 =

    √n(XY − ρ).

    Note that Zn = OP (1). Then

    ρ̂n =

    1√nZn5 + ρ−

    (1√nZn1

    )(1√nZn2

    )√

    1 +1√nZn3 −

    (1√nZn1

    )2√1 +

    1√nZn4 −

    (1√nZn2

    )2=

    (1√nZn5 + ρ−

    (1√nZn1

    )(1√nZn2

    ))

    ·

    (1 +

    1√nZn3 −

    (1√nZn1

    )2)−1/2(1 +

    1√nZn4 −

    (1√nZn2

    )2)−1/2=

    (1√nZn5 + ρ− oP

    (1√n

    ))·

    (1− 1

    2

    (1√nZn3 −

    (1√nZn1

    )2)+ oP

    (1√n

    ))(1− 1

    2

    (1√nZn4 −

    (1√nZn2

    )2)+ oP

    (1√n

    ))

    =

    (1√nZn5 + ρ− oP

    (1√n

    ))(1− 1

    2

    1√nZn3 + oP

    (1√n

    ))(1− 1

    2

    1√nZn4 + oP

    (1√n

    ))=

    (1√nZn5 + ρ− oP

    (1√n

    ))(1− 1

    2

    1√nZn3 −

    1

    2

    1√nZn4 + oP

    (1√n

    ))= ρ+

    1√nZn5 −

    ρ

    2

    (1√nZn3 +

    1√nZn4

    )+ oP

    (1√n

    )holds, so we get

    √n(ρ̂n − ρ) = Zn5 −

    ρ

    2Zn3 −

    ρ

    2Zn4 + oP (1)

    28

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    =√n(

    (XY − ρ)− ρ2

    (X2 − 1)− ρ2

    (Y 2 − 1))

    + oP (1)

    =√n(XY − ρ

    2X2 − ρ

    2Y 2)

    + oP (1)

    =1√n

    n∑i=1

    (XiYi −

    ρ

    2X2i −

    ρ

    2Y 2i

    )+ oP (1)

    d−−−→n→∞

    N(0, V ar(X1Y1 −

    ρ

    2X21 −

    ρ

    2Y 21

    )).

    (ii) Now additionally suppose that (Xi, Yi)’s are from bivariate normal distribution. Then

    Y1−ρX1 is independent of X1. Letting Z1 = Y1−ρX1, we get V ar(Z1) = 1−ρ2, V ar(Z21) =

    2(1− ρ2)2 and hence

    V ar(X1Y1 −

    ρ

    2X21 −

    ρ

    2Y 21

    )= V ar

    ((1− ρ2)X1Z1 −

    ρ

    2Z21 +

    ρ

    2(1− ρ2)X21

    )= V ar

    (ρ2

    (1− ρ2)X21 −ρ

    2Z21

    )+ V ar

    ((1− ρ2)X1Z1

    )+ 2Cov

    (ρ2

    (1− ρ2)X21 −ρ

    2Z21 , (1− ρ2)X1Z1

    )︸ ︷︷ ︸

    =0

    =ρ2

    4

    ((1− ρ2)2V ar(X21 )− 2(1− ρ2)Cov(X21 , Z21) + V ar(Z21)

    )+ (1− ρ2)2V ar(X1Z1)

    =ρ2

    4

    (2(1− ρ2)2 + 2(1− ρ2)2

    )+ (1− ρ2)2(1− ρ2)

    = ρ2(1− ρ2)2 + (1− ρ2)2(1− ρ2)

    = (1− ρ2)2

    holds. It implies that√n(ρ̂n − ρ)

    d−−−→n→∞

    N(0, (1− ρ2)2).

    Therefore, if we define h(ρ) as h′(ρ) = (1− ρ2)−1, i.e.,

    h(ρ) =1

    2log

    1 + ρ

    1− ρ, (“Fisher’s z-transform”)

    then we get√n(h(ρ̂n)− h(ρ))

    d−−−→n→∞

    N(0, 1),

    and with this, we can find a confidence region “with stabilized variance.”

    We can also expand with higher order terms.

    Theorem 1.2.4 (Higher order stochastic expansion). Let X1, · · · , Xn be a random sample with

    29

  • Theory of Statistics II J.P.Kim

    EX1 = µ and finite V ar(X1) = Σ.

    (a) (1-dim case) For g with ∃g̈,

    g(Xn) = g(µ) +σ√nġ(µ)Zn +

    σ2

    2ng̈(µ)Z2n + oP

    (1

    n

    ),

    where Zn =√n(Xn − µ)/σ.

    (b) (general case) For g with ∃g̈,

    g(Xn) = g(µ) + ġ(µ)>(Xn − µ) +

    1

    2(Xn − µ)>g̈(µ)(Xn − µ) + oP

    (1

    n

    ).

    Proof. Again, use Taylor theorem. Only prove (a). Note that

    g(x) = g(a) + ġ(a)(x− a) + 12

    (g̈(a) +R(x, a)) (x− a)2

    for R(x, a)→ 0 as x→ a, so letting Zn =√n(Xn − µ)/σ, we get

    g(Xn) = g(µ) + ġ(µ)(Xn − µ) +1

    2g̈(µ)(Xn − µ)2 +

    1

    2R(Xn, µ)︸ ︷︷ ︸

    =oP (1)

    (Xn − µ)2︸ ︷︷ ︸=OP (1/n)

    = g(µ) +σ√nġ(µ)Zn +

    σ2

    2ng̈(µ)Z2n + oP (1/n)

    which implies the conclusion.

    Remark 1.2.5. For general case, following notation is also frequently used. For (Zin)di=1 =

    √n(Xn − µ),

    g(Xn) = g(µ) +1√ng/i(µ)Z

    in +

    1

    2ng/ij(µ)Z

    inZ

    jn + oP

    (1

    n

    ).

    In here, we omit the “∑

    ,” i.e.,

    g/i(µ)Zin :=

    d∑i=1

    g/i(µ)Zin, g/ij(µ)Z

    inZ

    jn =

    d∑i=1

    d∑j=1

    g/ij(µ)ZinZ

    jn.

    Example 1.2.6 (Estimation of Reliability). Let X1, · · · , Xn be a random sample from Exp(λ),

    where λ > 0 is a rate. Consider an estimation problem of “reliability”

    η = P (X1 > a) = e−aλ.

    30

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    (i) Note that

    η̂MLEn = e−a/X .

    Let Zn =√n(λX − 1). Then Zn

    d−−−→n→∞

    N(0, 1) and

    X−1

    = λ

    (Zn√n

    + 1

    )−1.

    Thus, we get

    η̂MLEn = exp

    (−aλ

    (Zn√n

    + 1

    )−1)

    = exp

    (−aλ

    (1− Zn√

    n+Z2nn

    + oP

    (1

    n

    )))= e−aλ

    (1 +

    (aλ

    Zn√n− aλZ

    2n

    n

    )+

    1

    2

    (aλ

    Zn√n− aλZ

    2n

    n

    )2+ oP

    (1

    n

    ))

    = e−aλ(

    1 + aλZn√n

    +−aλ+ (aλ)2/2

    nZ2n + oP

    (1

    n

    ))= η +

    aλe−aλ√n

    Zn +(−aλ+ (aλ)2/2)e−aλ

    nZ2n + oP

    (1

    n

    )(1.2)

    from (1 + x)−1 = 1− x+ x2 + o(x2) and ex = 1 + x+ x2/2 + o(x2) when x ≈ 0.

    (ii) Now consider

    η̂UMV UE =

    (1− a

    nX

    )n−1I

    (a

    nX< 1

    ).

    Note that from P(anX

    < 1)−−−→n→∞

    1, we can let η̂UMV UE =(1− a

    nX

    )n−1(See next remark).

    Then

    log η̂UMV UE = (n− 1) log(

    1− anX

    )= (n− 1) log

    (1− aλ

    n

    (Zn√n

    + 1

    )−1)

    = (n− 1) log(

    1− aλn

    (1− Zn√

    n+Z2nn

    + oP

    (1

    n

    )))= (n− 1) log

    (1− aλ

    n+

    n√nZn −

    n2Z2n + oP

    (1

    n2

    ))= (n− 1)

    {(−aλn

    +aλ

    n√nZn −

    n2Z2n

    )− 1

    2

    (−aλn

    +aλ

    n√nZn −

    n2Z2n

    )2}+ oP

    (1

    n

    )= −aλ+ aλ√

    nZn −

    nZ2n −

    (aλ)2

    2n+aλ

    n+ oP

    (1

    n

    )31

  • Theory of Statistics II J.P.Kim

    = −aλ+ aλ√nZn +

    −aλZ2n + aλ− (aλ)2/2n

    + oP

    (1

    n

    )implies

    η̂UMV UE = exp

    (−aλ+ aλ√

    nZn +

    −aλZ2n + aλ− (aλ)2/2n

    + oP

    (1

    n

    ))= e−aλ

    (1 +

    (aλ√nZn +

    −aλZ2n + aλ− (aλ)2/2n

    )+

    1

    2

    (aλ√nZn +

    −aλZ2n + aλ− (aλ)2/2n

    )2)

    + oP

    (1

    n

    )= e−aλ

    (1 +

    aλ√nZn +

    (−aλZ2n + aλ−

    (aλ)2

    2+

    1

    2(aλ)2Z2n

    )1

    n

    )+ oP

    (1

    n

    )= η +

    aλe−aλ√n

    Zn +(−aλ+ (aλ)2/2)e−aλ

    n(Z2n − 1) + oP

    (1

    n

    )(1.3)

    from log(1 + x) = x − x2/2 + o(x2), x ≈ 0. Comparing (1.3) to (1.2), we can say that

    UMVUE is “closer” than MLE to η, since MLE’s leading term has a bias

    (−aλ+ (aλ)2/2)e−aλ

    n,

    while UMVUE’s leading term has no bias. Like this case, if one suggests a new estimator,

    then in many cases, one compares 2nd order term to judge its asymptotic behavior.

    Remark 1.2.7. If there is an event that occurring probability converges to 1, then in an

    asymptotic sense, we may ignore such event, in the sense that:

    (i) If P (En) −−−→n→∞

    1 and P (Xn ≤ x, En) −−−→n→∞

    F (x), then Xnd−−−→

    n→∞F .

    (ii) If P (En) −−−→n→∞

    1 and Xn = X + OP (n−α) on En, then Xn = X + OP (n−α) in general.

    Convergence rate of P (En) does not matter!

    (∵ P (nα|Xn −X| ≥ C) ≤ P (nα|Xn −X| ≥ C, En) + P (Ecn)︸ ︷︷ ︸−−−→n→∞

    0

    , take lim supn and limC on

    both sides.)

    Example 1.2.8. Consider a sample correlation coefficient again. Assume EX41

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    =

    (ρ+

    1√nZn5 −

    1

    nZn1Zn2 + oP

    (1

    n

    ))

    ·

    1− 12

    (1√nZn3 −

    (1√nZn1

    )2)+

    3

    8

    (1√nZn3 −

    (1√nZn1

    )2)2+ oP

    (1

    n

    1− 12

    (1√nZn4 −

    (1√nZn2

    )2)+

    3

    8

    (1√nZn4 −

    (1√nZn2

    )2)2+ oP

    (1

    n

    )=

    (ρ+

    1√nZn5 −

    1

    nZn1Zn2 + oP

    (1

    n

    ))·(

    1− 12√nZn3 +

    1

    n

    (1

    2Z2n1 +

    3

    8Z2n3

    )+ oP

    (1

    n

    ))·(

    1− 12√nZn4 +

    1

    n

    (1

    2Z2n2 +

    3

    8Z2n4

    )+ oP

    (1

    n

    ))= ρ+

    1√n

    (Zn5 −

    ρ

    2Zn3 −

    ρ

    2Zn4

    )+

    1

    n

    (−Zn1Zn2 −

    1

    2Zn3Zn5 −

    1

    2Zn4Zn5 + ρ

    (1

    4Zn3Zn4 +

    1

    2Z2n1 +

    1

    2Z2n2 +

    3

    8Z2n3 +

    3

    8Z2n4

    ))+ oP

    (1

    n

    )holds. The leading term has bias

    1

    n

    4EX21Y

    21 +

    3

    8ρ(EX41 + EY

    41 )−

    1

    2(EX31Y1 + EX1Y

    31 )

    }from

    E(Z3nZ4n) = EX21Y

    21 − 1

    E(Z23n + Z24n) = EX

    41 + EY

    41 − 2

    E(Z21n + Z22n) = 2

    E(Z3nZ5n + Z4nZ5n) = EX31Y1 + EX1Y

    31 − 2ρ

    E(Z1nZ2n) = ρ.

    For bivariate normal case, it becomes

    − 12nρ(1− ρ2).

    Remark 1.2.9. Note that in using stochastic expansion, we can get the mean, variance, skew-

    ness, ... of the leading term and might expect that they become the approximation of the

    33

  • Theory of Statistics II J.P.Kim

    moments of g(Xn), but this is not true! For example, if Xn ∼ Ber(1/n), then

    P (nXn > �) = P (Xn = 1) =1

    n−−−→n→∞

    0

    from Xn = 0 or 1, so nXn = oP (1), but we get E(nXn) = 1 ∀n. Thus, we need to check the

    behavior of the remainder.

    As we can see in this example, we cannot say that E(op(1)) = o(1), i.e., “convergence in prob-

    ability does not imply convergence in L1.” (Similarly, “bounded in probability does not imply

    uniform integrability”) By Vitalli’s theorem, it is known that if {Xn} is uniformly integrable,

    then

    XnP−−−→

    n→∞X implies EXn −−−→

    n→∞EX.

    From now on, we compare “moments of leading terms” and “approximation of moments.”

    Example 1.2.10. Recall mgf

    mgfX(t) = EetX |t| < �

    and cgf

    cgfX(t) = logmgfX(t) = logEetX . |t| < �

    For mr = EXr, mgf has a Taylor expansion

    mgfX(t) = 1 +m1t+m22!t2 +

    m33!t3 + · · · ,

    and if X and Y are independent,

    mgfX+Y (t) = mgfX(t) ·mgfY (t)

    and

    cgfaX+b(t) = cgfX(at) + bt

    holds for constants a and b. From this we get

    cr(aX + b) = arcr(X),

    where cr denotes rth cumulant. Also recall that, for A ≈ 0,

    log(1 + A) = A− 12A2 +

    1

    3A3 − · · · ,

    34

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    and with this, we can obtain

    cgfX(t) = log

    1 + (mgfX(t)− 1)︸ ︷︷ ︸=A

    = c1t+ c22!t2 +

    c33!t3 + · · ·

    where

    c1 = m1, c2 = m2−m21, c3 = m3−3m1m2 +2m31, c4 = m4−4m3m1−3m22 +12m2m21−m41, · · · .

    If observations are normalized, i.e., m1 = 0 and m2 = 1, then

    c1 = 0, c2 = 1, c3 = m3, c4 = m4 − 3.

    Example 1.2.11. Let

    Zn =√nXn − µ

    σ=

    1√n

    n∑i=1

    Xi − µσ︸ ︷︷ ︸

    =:Z̃id≡Z1

    .

    Then from

    cgfZn(t) = cgfn−1/2 ∑ Z̃i(t) = n · cgfZ1(

    t√n

    ),

    we obtain

    cr(Zn) = n ·(

    1√n

    )rcr(Z1) = n

    − r2

    +1cr(Z1).

    From this, we obtain

    EZ3n = c3(Zn) =1√nc3(Z1) (1.4)

    EZ4n = c4(Zn) + 3 =1

    nc4(Z1) + 3. (1.5)

    Example 1.2.12. Now see the multivariate case. Let X = (X1, · · · , Xd)> and t = (t1, · · · , td)>.

    Then

    mgfX(t) = Eet>X = Eet1X1+···+tdXd

    and

    m1 =

    [∂

    ∂timgfX(t)

    ∣∣∣∣t=0

    ]i

    m2 =

    [∂2

    ∂ti∂tjmgfX(t)

    ∣∣∣∣t=0

    ]i,j

    35

  • Theory of Statistics II J.P.Kim

    m3 =

    [∂3

    ∂ti∂tj∂tkmgfX(t)

    ∣∣∣∣t=0

    ]i,j,k

    ...

    and we get

    mgfX(t) = 1 +∑i

    m1(i)ti +1

    2!

    ∑i,j

    m2(i, j)titj +1

    3!

    ∑i,j,k

    m3(i, j, k)titjtk + · · · .

    If data is centered, i.e., EX = 0, then m1 = 0 and so

    cgfX(t) =1

    2!

    ∑i,j

    m2(i, j)titj+1

    3!

    ∑i,j,k

    m3(i, j, k)titjtk+1

    4!

    ∑i,j,k,l

    (m4(i, j, k, l)− 3m2(i, j)m2(k, l)) titjtktl+· · · .

    Now let Zn = (Z1n, · · · , Zdn)> =

    √n(Xn − µ). Then

    cgfZn(t) = n · cgfZ1(t/√n)

    implies

    EZinZjn =: σ

    i,j, (σi,j)i,j = V ar(X1)

    EZinZjnZ

    kn =

    1√nc3(i, j, k)

    EZinZjnZ

    knZ

    ln =

    1

    nc4(i, j, k, l) + 3σ

    ijσkl

    where

    c3(i, j, k) = c3(Z1)(i, j, k) = EZi1Z

    j1Z

    k1

    c4(i, j, k, l) = c4(Z1)(i, j, k, l) = EZi1Z

    j1Z

    k1Z

    l1 − 3(EZi1Z

    j1)(EZ

    k1Z

    l1).

    Proposition 1.2.13 (Moments of the leading terms). Let X1, · · · , Xn be i.i.d. with E|X1|4 <

    ∞1.

    (a) (Univariate case) For g with ∃g̈ and Zn =√n(Xn − µ)/σ,

    g(Xn) = g(µ) +σ√nġ(µ)Zn +

    σ2

    2ng̈(µ)Z2n︸ ︷︷ ︸

    =:Wn

    +oP (n−1),

    1In fact, stronger condition is needed: mgf of X1 exists

    36

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    and for the leading term Wn,

    E(Wn) = g(µ) +σ2

    2ng̈(µ)

    V ar(Wn) =σ2

    n(ġ(µ))2 +

    1

    n2

    (σ3c3(Z1)ġ(µ)g̈(µ) +

    1

    2σ4 (g̈(µ))2

    )+O(n−3)

    E(Wn − EWn)3 =1

    n2(σ3c3(Z1) (ġ(µ))

    3 + 3σ4 (ġ(µ))2 (g̈(µ)))

    +O(n−3).

    (b) (Multivariate case) For g with ∃g̈ and (Zin) =√n(Xn − µ),

    g(Xn) = g(µ) +1√ng/i(µ)Z

    in +

    1

    2ng/ij(µ)Z

    inZ

    jn︸ ︷︷ ︸

    =:Wn

    +oP (n−1),

    and for the leading term Wn,

    E(Wn) = g(µ) +1

    2ng/ij(µ)σ

    ij

    V ar(Wn) =1

    ng/i(µ)g/j(µ)σ

    ij +O(n−2)

    where (σij) = V ar(X1).

    Proof. (a) Nothing but tedious calculation. First,

    E(Wn) = g(µ) +σ2

    2ng̈(µ)

    is easily obtained. Next, note that

    V ar(Wn) = V ar

    (σ√nġ(µ)Zn +

    σ2

    2ng̈(µ)Z2n

    )=σ2

    n(ġ(µ))2 V ar(Zn) +

    σ3

    n√nġ(µ)g̈(µ)Cov(Zn, Z

    2n) +

    σ4

    4n2(g̈(µ))2 V ar(Z2n).

    First, V ar(Zn) = 1. Also, V ar(Z2n) = E(Z

    4n) − [E(Z2n)]

    2= 2 + n−1c4(Z1) from (1.5). Finally,

    Cov(Zn, Z2n) = EZ

    3n − EZn · EZ2n = n−1/2c3(Z1) from (1.4). Now we get

    V ar(Wn) =σ2

    n(ġ(µ))2 +

    σ3

    n2ġ(µ)g̈(µ)c3(Z1) +

    σ4

    2n2(g̈(µ))2 +

    σ4

    4n3(g̈(µ))2 c4(Z1)

    =σ2

    n(ġ(µ))2 +

    1

    n2

    (σ3c3(Z1)ġ(µ)g̈(µ) +

    1

    2σ4 (g̈(µ))2

    )+O(n−3).

    37

  • Theory of Statistics II J.P.Kim

    For E(Wn − EWn)3, note that EZ5n = O(n−1/2) and EZ6n = O(1). (Check!) Then

    E(Wn − EWn)3 = E[σ√nġ(µ)Zn +

    σ2

    2ng̈(µ)(Z2n − 1)

    ]3=

    σ3

    n√nġ(µ)3 EZ3n︸︷︷︸

    =n−1/2c3(Z1)

    +3σ4

    2n2ġ(µ)2g̈(µ)E[Z2n(Z

    2n − 1)]︸ ︷︷ ︸

    =n−1c4(Z1)+2

    +σ5

    4n2√nġ(µ)g̈(µ)2E[Zn(Z

    2n − 1)2] +

    σ6

    8n3g̈(µ)3E[(Z2n − 1)3]︸ ︷︷ ︸

    =O(n−3)

    =1

    n2(σ3c3(Z1) (ġ(µ))

    3 + 3σ4 (ġ(µ))2 (g̈(µ)))

    +O(n−3)

    by (1.4) and (1.5).

    (b) Note that

    Wn = g(µ) +1√ng/i(µ)Z

    in +

    1

    2ng/ij(µ)Z

    inZ

    jn = g(µ) +

    1√n

    d∑i=1

    g/i(µ)Zin +

    1

    2n

    d∑i,j=1

    g/ij(µ)ZinZ

    jn

    so

    V ar(Wn) =1

    n

    d∑i=1

    d∑j=1

    g/i(µ)g/j(µ)Cov(Zin, Z

    jn) +

    1

    n√n

    d∑i=1

    d∑k,l=1

    g/i(µ)g/kl(µ)Cov(Zin, Z

    knZ

    ln)

    +1

    4n2

    d∑i,j=1

    d∑k,l=1

    g/ij(µ)g/kl(µ)Cov(ZinZ

    jn, Z

    knZ

    ln)

    =1

    ng/i(µ)g/j(µ)Cov(Z

    in, Z

    jn)︸ ︷︷ ︸

    =σij

    +1

    n√ng/i(µ)g/kl(µ)Cov(Z

    in, Z

    knZ

    ln)︸ ︷︷ ︸

    =EZinZknZ

    ln

    +1

    4n2g/ij(µ)g/kl(µ) Cov(Z

    inZ

    jn, Z

    knZ

    ln)︸ ︷︷ ︸

    EZinZjnZknZ

    ln−(EZinZ

    jn)(EZknZ

    ln)

    =1

    ng/i(µ)g/j(µ)σ

    ij +1

    n2g/i(µ)g/kl(µ)c3(i, j, k) +

    1

    4n2g/ij(µ)g/kl(µ)

    (1

    nc4(i, j, k, l) + 2σ

    ijσkl)

    =1

    ng/i(µ)g/j(µ)σ

    ij +O(n−2).

    Proposition 1.2.14 (Approximation of moments). Let X1, · · · , Xn be i.i.d. with E|X1|4

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    V ar(g(Xn)) =σ2

    ng(1)(µ)2 +O(n−2)

    where µ = EX1, σ2 = V ar(X1).

    (b) (Multivariate case) For g with bounded and contiuous g/I (|I| = 0, 1, · · · , 4),

    E(g(Xn)) = g(µ) +1

    2ng/ij(µ)σ

    ij +O(n−2)

    V ar(g(Xn)) =1

    ng/i(µ)g/j(µ)σ

    ij +O(n−2)

    where µ = EX1, (σij) = V ar(X1).

    Proof. (a) Note that,

    g(Xn) = g(µ) +σ√ng(1)(µ)Zn +

    σ2

    2ng(2)(µ)Z2n +

    1

    3!

    σ3

    n√ng(3)(µ)Z3n +Rn,

    where

    Rn =1

    4!

    σ4

    n2g(4)(ξn)Z

    4n, ξn : a number between µ and Xn.

    Note that

    E|Rn| ≤1

    4!

    σ4

    n2supx|g(4)(x)| · EZ4n = O(n−2)

    from “boundedness of g(4)” and EZ4n = 3 + n−1c4(Z1). Also note that

    σ3

    n√nEZ3n =

    σ3

    n√n

    1√nc3(Z1) = O(n

    −2).

    From these, we obtain

    Eg(Xn) = g(µ) +σ2

    2ng(2)(µ).

    Next, for the variance, we get

    V arg(Xn) =σ2

    ng(1)(µ)2V ar(Zn) +

    σ2

    4n2g(2)(µ)2V ar(Z2n)

    +O(n−3)V ar(Z3n) + V ar(Rn)

    +O(n−3/2)Cov(Zn, Z2n) +O(n

    −2)Cov(Zn, Z3n) +O(n

    −5/2)Cov(Z2n, Z3n)

    +O(n−1/2)Cov(Zn, Rn) +O(n−1)Cov(Z2n, Rn) +O(n

    −3/2)Cov(Z3n, Rn).

    39

  • Theory of Statistics II J.P.Kim

    Note that,

    V ar(Zn) = 1, V ar(Z2n) = 2 +

    1

    nc4(Z1) = O(1), V ar(Z

    3n) = EZ

    6n − (EZ3n)2 ≤ O(1),

    V ar(Rn) ≤ ER2n = O(n−4) · EZ8n ≤ O(n−4),

    |Cov(Zn, Rn)| ≤√V arZn

    √V arRn ≤ O(n−2),

    |Cov(Z2n, Rn)| ≤√V arZ2n

    √V arRn ≤ O(n−2),

    |Cov(Z3n, Rn)| ≤√V arZ3n

    √V arRn ≤ O(n−2),

    Cov(Zn, Z2n) = EZ

    3n − (EZn)(EZ2n) = O(n−1/2),

    |Cov(Zn, Z3n)| ≤√V arZn

    √V arZ3n ≤ O(1),

    |Cov(Z2n, Z3n)| ≤√V arZ2n

    √V arZ3n ≤ O(1).

    From these, we get

    V arg(Xn) =σ2

    ng(1)(µ)2 V ar(Zn)︸ ︷︷ ︸

    =1

    +σ2

    4n2g(2)(µ)2 V ar(Z2n)︸ ︷︷ ︸

    =O(1)

    +O(n−3)V ar(Z3n)︸ ︷︷ ︸=O(1)

    +V ar(Rn)︸ ︷︷ ︸≤O(n−4)

    +O(n−3/2)Cov(Zn, Z2n)︸ ︷︷ ︸

    =O(n−1/2)

    +O(n−2)Cov(Zn, Z3n)︸ ︷︷ ︸

    ≤O(1)

    +O(n−5/2)Cov(Z2n, Z3n)︸ ︷︷ ︸

    ≤O(1)

    +O(n−1/2)Cov(Zn, Rn)︸ ︷︷ ︸≤O(n−2)

    +O(n−1)Cov(Z2n, Rn)︸ ︷︷ ︸≤O(n−2)

    +O(n−3/2)Cov(Z3n, Rn)︸ ︷︷ ︸≤O(n−2)

    =σ2

    ng(1)(µ)2 +O(n−2).

    (b) Recall the Taylor theorem for multivariate function: for (k + 1)-time continuously differ-

    entiable function f : Rn → R,

    f(x) =∑|α|≤k

    Dαf(a)

    α!(x− a)α +

    ∑|β|=k+1

    Rβ(x)(x− a)β,

    where for α = (α1, · · · , αn) ∈ Nn we define |α| = α1 + · · · + αn, α! = α1! · · ·αn!, and xα =

    xα11 · · ·xαnn ,

    Rβ(x) =|β|β!

    ∫ 10

    (1− t)|β|−1Dβf(a + t(x− a))dt.

    40

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    In here,

    Dαf =∂|α|f

    ∂xα11 · · · ∂xαnn.

    Using this, we obtain

    g(Xn) = g(µ) +1√ng/i(µ)Z

    in +

    1

    2!

    1

    ng/ijZ

    inZ

    jn +

    1

    3!

    1

    n√ng/ijkZ

    inZ

    jnZ

    kn +Rn,

    where

    |Rn| ≤1

    6n2

    ∣∣∣∣∫ 10

    (1− u)3g/ijkl(µ+

    1√nuZn

    )ZinZ

    jnZ

    knZ

    lndu

    ∣∣∣∣ .By boundedness of g/I and

    E|ZinZjnZknZ ln| ≤1

    4E((Zin)

    4 + (Zjn)4 + (Zkn)

    4 + (Z ln)4) = O(1), (“AM-GM”)

    we obtain the desired result with similar procedure as univariate case.

    Example 1.2.15 (Estimation of reliability). Let X1, · · · , Xn be a random sample from Exp(λ),

    λ > 0. Define

    η = Pλ(X1 > a) = e−aλ.

    Then

    η̂MLEn = exp(−a/X).

    Here, g(x) = exp(−a/x) is infinitely differentiable with bounded derivatives. Thus

    E(η̂MLEn ) = η +(−aλ+ (aλ)2/2)e−aλ

    n+O(n−2)

    V ar(η̂MLEn ) =(aλ)2e−2aλ

    n+O(n−2),

    the leading terms of which agree with the mean and variance of the leading term in its stochastic

    expansion. On the other hand, for

    η̂UMV UEn =(

    1− anx

    )n−1I

    (a

    nX< 1

    ),

    we cannot apply the result for the approximation of moments. But, it can be shown that its

    variance is also approximated by the same approximation formula.

    Remark 1.2.16. The boundedness of the derivatives for the approximation of moments is

    rather stronger than needed. Whenever the approximation can be proved, the formulae agree

    41

  • Theory of Statistics II J.P.Kim

    with the moments of the leading term of its stochastic expansion. So only the validity of the

    order of the remainder needs to be proved. For example, in the bivariate normal case, the mean

    and variance of the sample correlation coefficient can be approximated as follows;

    E(ρ̂n) =−ρ(1− ρ2)

    2n+O(n−2)

    V ar(ρ̂n) =(1− ρ2)2

    n+O(n−2).

    1.3 Asymptotic Theory

    1.3.1 MLE in Exponential Family

    Proposition 1.3.1. Let X1, · · · , Xn be a random sample from a population with pdf

    pη(x) = h(x) exp(η>T (x)− A(η))IX (x), η ∈ E ,

    where E is a natural parameter space in Rk. Further, assume that

    (i) E is open.

    (ii) The family is of rank k.

    Then

    (a) η̂MLEn = η + (Ä(η))−1(T n − Ȧ(η)) + oPη(n−1/2) = η + (−l̈(η))−1l̇(η) + oPη(n−1/2).

    (b)√n(η̂MLEn − η)

    d−−−→n→∞

    N(

    0, (Ä(η))−1)

    = N(0, I−11 (η)).

    Proof. Recall that, under the same assumptions,

    (T n ∈ C0) ⊆ (Ȧ(η̂MLEn ) = T n) ⊆(η̂MLEn = (Ȧ)

    −1(T n)),

    with the probabilities of these events tending to 1 (See theorem 1.1.20). Also note that, by full

    rank condition, Ȧ is one-to-one and differentiable on E with Ä > 0.

    Now let t = Ȧ(η). Then by Inverse Function Theorem (see next Remark), Ȧ−1 is also

    differentiable and

    D(Ȧ−1)(t) =∂

    ∂tȦ−1(t) =

    (Ä(η)

    )−1.

    CLT implies√n(T n − EηT (X1))

    d−−−→n→∞

    N (0, V arηT (X1)) ,

    42

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    and recall that EηT (X1) = Ȧ(η), V arηT (X1) = Ä(η). Therefore, ∆-method implies

    η̂MLEn = Ȧ−1(T n)

    = Ȧ−1(t) +(Ä(η)

    )−1(T n − t) + o(|T n − t|)

    = η +(Ä(η)

    )−1(T n − t) + oPη(n−1/2),

    and hence

    √n(η̂MLEn − η) =

    (Ä(η)

    )−1√n(T n − t) + oPη(1)

    d−−−→n→∞

    N(0, V arηT (X1)

    −1) .Rest part is obtained from T n − Ȧ(η) = l̇n(η)/n and −l̈n(η) = nÄ(η).

    Remark 1.3.2 (Inverse Function Theorem). Let F : U → Rd, where U ⊆ Rd is an open set. If

    (i) F is one-to-one

    (ii) F is Fréchet differentiable near x0 ∈ U

    (iii) DFx0 :=

    [∂F

    ∂xj

    ∣∣∣∣x=x0

    ]i,j

    is invertible.

    Then F−1 : F (U)→ U is also Fréchet differentiable at y0 = F (x0), and it satisfies D(F−1)(y0) =

    (DFx0)−1.

    Example 1.3.3 (Multinomial case). Let X1, · · · , Xn be a random sample from Multi(1, p(θ)),

    where p(θ) = (p1(θ), · · · , pk(θ))>, and θ ∈ Θ ⊆ R. For example, consider Hardy-Weinberg

    proportions p(θ) = (θ2, 2θ(1− θ), (1− θ)2)>, 0 < θ < 1. Assume that

    (i) Θ is open and 0 < pi(θ) < 1,k∑i=1

    pi(θ) = 1.

    (ii) p(θ) = (p1(θ), · · · , pk(θ))> is twice (totally) differentiable.

    Then we can derive the asymptotic distribution of estimator of θ. Let θ = h(p(θ)) for any θ ∈ Θ

    for some differentiable function h. Let

    p̂(θ) =1

    n

    n∑i=1

    Xi =

    (N1n, · · · , Nk

    n

    )>,

    where Nj =∑n

    i=1 I(Xij = 1). Then we get

    Eθp̂(θ) = p(θ)

    43

  • Theory of Statistics II J.P.Kim

    and hence

    h(Xn) = h(p(θ)) + ḣ(p(θ))>(Xn − p(θ)) + o(|Xn − p(θ)|).

    Note that Zn :=√n(Xn − p(θ)) = OP (1), and it has an asymptotic distribution

    Znd−−−→

    n→∞N(0,Σ(θ)), Σ(θ) := diag(p(θ))− p(θ)p(θ)>,

    and therefore we get

    √n(h(Xn)− h(p(θ))) = ḣ(p(θ))>Zn + oP (1),

    which implies√n(h(Xn)− h(p(θ)))

    d−−−→n→∞

    N(0, σ2h(θ)),

    where

    σ2h(θ) = ḣ(p(θ))>Σ(θ)ḣ(p(θ)).

    Furthermore, we can obtain that

    σ2h(θ) ≥ I−11 (θ),

    where equality holds iff

    ḣ(p(θ))>(X1 − p(θ)) = I−11 (θ)l̇1(θ).

    It can be shown as following. First, note that

    ḣ(p(θ))>ṗ(θ) = 1.

    It implies that

    1 = ḣ(p(θ))>∂

    ∂θEθX1

    = ḣ(p(θ))>∫

    ∂θxf(x : θ)dµ(x)

    = ḣ(p(θ))>∫x∂

    ∂θf(x : θ)dµ(x)

    = ḣ(p(θ))>Covθ

    (X1,

    ∂θl1(θ)

    )(∵ Eθ

    ∂θf(X1 : θ) = Eθ

    ∂θl1(θ) = 0)

    = Covθ

    (ḣ(p(θ))>X1,

    ∂θl1(θ)

    )≤

    √V arθ

    (ḣ(p(θ))>X1

    )V arθ

    (∂

    ∂θl1(θ)

    )44

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    =

    √ḣ(p(θ))>Σ(θ)ḣ(p(θ)) · I1(θ)

    holds. In here, “=” holds when ḣ(p(θ))>X1 and ∂l1(θ)/∂θ has a linear relationship, i.e.,

    ḣ(p(θ))>(X1 − p(θ)) = I−11 (θ)l̇1(θ).

    Remark 1.3.4. Actually, previous example shows how to deal with asymptotic distribution of

    FSE, more generally.

    1.3.2 Asymptotic Normality of MCE

    Our real goal of this section is right here.

    Theorem 1.3.5 (Asymptotic Normality of MCE). Let X1, · · · , Xn be a random sample from

    Pθ, where θ ∈ Θ and parameter space Θ is open in Rk. Let

    θ̂MCEn = arg minθ∈Θ

    1

    n

    n∑i=1

    ρ(Xi, θ)

    θ0 = arg minθ∈Θ

    Eθ0ρ(X1, θ).

    Under the assumption of their existence, let

    Ψ1(θ) = Ψ(X1, θ) =∂

    ∂θρ(X1, θ), Ψn(θ) =

    1

    n

    n∑i=1

    Ψ(Xi, θ)

    Ψ̇1(θ) =∂Ψ1(θ)

    ∂θ, Ψ̇n(θ) =

    ∂Ψn(θ)

    ∂θ

    Ψ̈1(θ) =∂2Ψ1(θ)

    ∂θ2, Ψ̈n(θ) =

    ∂2Ψn(θ)

    ∂θ2

    Assume that

    (A0) Pθ0

    (Ψn(θ̂

    MCEn ) = 0

    )−−−→n→∞

    1.

    (A1) Eθ0Ψ1(θ0) = 0.

    (A2) V arθ0(Ψ1(θ0)) exists.

    (A3) Eθ0(Ψ̇1(θ0)) exists and is nonsingular.

    45

  • Theory of Statistics II J.P.Kim

    (A4) ∃δ > 0 and ∃M(X1) = Mθ0,δ(X1) s.t.

    max|θ−θ0|≤δθ∈Θ

    |Ψ̈1(θ)| ≤M(X1), where Eθ0M(X1) .

    (1) (1st-order gradient)

    ∂F

    ∂x(x0) := DF (x0) =

    (∂F

    ∂x1(x0),

    ∂F

    ∂x2(x0), · · · ,

    ∂F

    ∂xk(x0)

    )∈ Rd×k,

    where∂F

    ∂xj(x0) =

    (∂F1∂xj

    (x0), · · · ,∂Fd∂xj

    (x0)

    )>is a column vector. It can be interpreted as

    “a linear map.”

    (2) (2nd-order gradient)

    ∂2F

    ∂x2(x0) := D

    2F (x0) =

    (∂

    ∂x1

    ∂F

    ∂x(x0), · · · ,

    ∂xk

    ∂F

    ∂x(x0)

    )∈ Rd×k×k,

    where∂

    ∂xi

    ∂F

    ∂x(x0) is d × k matrix with

    ∂2F

    ∂xixj(x0) as the jth column vector. Note that it

    can be interpreted as “a bi-linear map.”

    (3) (Taylor expansion of vector-valued map)

    F (x) ≈ F (x0) +k∑j=1

    ∂F

    ∂xj(x0)(xj − x0j) +

    1

    2

    ∑i,j

    ∂2F

    ∂xi∂xj(x0)(xi − x0i)(xj − x0j)

    46

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    = F (x0) +∂F

    ∂xj(x0)︸ ︷︷ ︸

    matrix

    (x− x0)︸ ︷︷ ︸vector

    +1

    2(x− x0)>

    ∂2F

    ∂xi∂xj(x0)︸ ︷︷ ︸

    3-array

    (x− x0).

    In here, “matrix DF (x0) × vector” becomes a vector, and “quadratic form with 3-array

    D2F (x0)” becomes vector-valued. In this view, DF (x0) and D2F (x0) can be interpreted as

    a linear and bi-linear map, respectively.

    Proof. By Taylor’s theorem, ∃θ∗n in line(θ0, θ̂n) such that ‖θ∗n − θ0‖ ≤ ‖θ̂n − θ0‖ and

    Ψn(θ̂n) = Ψn(θ0) + Ψ̇n(θ0)(θ̂n − θ0) +1

    2(θ̂n − θ0)>Ψ̈n(θ∗n)(θ̂n − θ0).

    Also note that, from (A4) and (A5),

    limK→∞

    supnPθ0

    (Ψ̈n(θ

    ∗n) ≥ K

    )= 0

    holds, and hence

    Ψ̈n(θ∗n) = OPθ0 (1),

    i.e.,

    Ψ̈n(θ∗n)(θ̂n − θ0) = oPθ0 (1).

    (∵ (Motivation: since Ψ̈n is dominated by integrable majorant, it is “uniformly integrable,”

    hence it is “tight.”) From

    Pθ0

    (∣∣∣Ψ̈n(θ∗n)∣∣∣ > K) = Pθ0 (∣∣∣Ψ̈n(θ∗n)∣∣∣ > K, |θ̂n − θ0| ≤ δ)+ Pθ0 (∣∣∣Ψ̈n(θ∗n)∣∣∣ > K, |θ̂n − θ0| > δ)≤ Pθ0

    (∣∣∣Ψ̈n(θ∗n)∣∣∣ > K, |θ̂n − θ0| ≤ δ)+ Pθ0 (|θ̂n − θ0| > δ)≤

    (A4)Pθ0 (M(X1) > K) + Pθ0

    (|θ̂n − θ0| > δ

    )︸ ︷︷ ︸

    (A5)−−−→n→∞

    0

    ,

    for any � > 0 we get for large N

    supn>N

    Pθ0

    (|Ψ̈n(θ∗n)| > K

    )≤ 1KEθ0M(X1) + �,

    which implies

    limK→∞

    supn>N

    Pθ0

    (|Ψ̈n(θ∗n)| > K

    )= 0.

    47

  • Theory of Statistics II J.P.Kim

    (Let N be larger and take �↘ 0)) Thus, we get

    Ψn(θ̂n) = Ψn(θ0) + Ψ̇n(θ0)(θ̂n − θ0) +1

    2(θ̂n − θ0)>Ψ̈n(θ∗n)(θ̂n − θ0)

    = Ψn(θ0) +

    (Ψ̇n(θ0) +

    1

    2(θ̂n − θ0)>Ψ̈n(θ∗n)

    )(θ̂n − θ0)

    = Ψn(θ0) +(

    Ψ̇n(θ0) + oPθ0 (1))

    (θ̂n − θ0)

    = Ψn(θ0) +(Eθ0Ψ̇1(θ0) + oPθ0 (1)

    )(θ̂n − θ0).

    Now, note that

    (1) Pθ0(Ψn(θ̂n) = 0) −−−→n→∞

    1.

    (2) Eθ0Ψ̇1(θ0) + oPθ0 (1) is nonsingular with probability 1 (See remark 1.3.7)

    (3) If Xn = Yn +OP (an) on En with P (En) −−−→n→∞

    1, then Xn = Yn +OP (an) (on whole space),

    and the same holds for oP (See remark 1.2.7).

    Thus, on the set with probability tending to 1,

    θ̂n − θ0 =(−Eθ0Ψ̇1(θ0) + oPθ0 (1)

    )−1Ψn(θ0)

    =

    [(−Eθ0Ψ̇1(θ0)

    )−1+ oPθ0 (1)

    ]Ψn(θ0)

    holds, which yields

    √n(θ̂n − θ0) =

    (−Eθ0Ψ̇1(θ0)

    )−1√nΨn(θ0)︸ ︷︷ ︸OPθ0

    (1)

    +oPθ0 (1)

    and therefore

    √n(θ̂n − θ0)

    d−−−→n→∞

    N

    (0,(−Eθ0Ψ̇1(θ0)

    )−1V arθ0Ψ1(θ0)

    (−Eθ0Ψ̇1(θ0)

    )−1).

    Remark 1.3.7. If A ∈ Rd×d is symmetric positive definite matrix, then for small perbutation

    ∆ s.t. ‖∆‖2 < σmin(A), rank(A+ ∆) = d. Note that rank(A) = d. In here, σmin(A) denotes the

    smallest eigenvalue of A, and ‖ · ‖p is a matrix norm induced by corresponding Lp vector norm,

    i.e.,

    ‖∆‖p = supx:‖x‖p=1

    ‖∆x‖p.

    48

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    Proof. (Motivation: for c ≈ 0 and x 6= 0,

    1

    x+ c=x

    c− x

    2

    c2+x3

    c3− · · ·

    exists, or for small vectors u, v, A+ uv> is nonsingular from

    (A+ uv>)−1 = A−1 − A−1uv>A−1

    1 + u>A−1v,

    if A is invertible) Let rank(A+ ∆) < d. Then ∃x0 6= 0 s.t. (A+ ∆)x0 = 0 and ‖x0‖2 = 1. Then

    by definition of matrix norm,

    ‖∆‖2 ≥ ‖∆x0‖2 = ‖Ax0‖2 ≥ σmin(A)

    holds. The last inequality is from spectral theorem. It is contradictory to our assumption that

    ∆ is small.

    1.3.3 Asymptotic Normality and Efficiency of MLE

    Note that MLE is just a special case of MCE.

    Theorem 1.3.8. Let X1, · · · , Xn be a random sample from Pθ, where θ ∈ Θ and parameter

    space Θ is open in Rk. Recall that MLE is an MCE with

    ρ(x, θ) = − log pθ(x), pθ : pdf of Pθ.

    Under the assumption of their existence, denote

    l̇n(θ) =n∑i=1

    ∂θlog pθ(Xi), l̇n(θ) =

    1

    nl̇n(θ)

    l̈n(θ) =∂2

    ∂θ2log pθ(Xi), l̈n(θ) =

    1

    nl̈n(θ)

    ...l n(θ) =

    ∂3

    ∂θ3log pθ(Xi),

    ...l n(θ) =

    1

    n

    ...l n(θ).

    Also assume that

    (M0) Pθ0

    (l̇n(θ̂

    MLEn ) = 0

    )−−−→n→∞

    1.

    (M1) Eθ0 l̇1(θ0) = 0.

    (M2) I(θ0) = V arθ0(l̇1(θ0)) exists.

    49

  • Theory of Statistics II J.P.Kim

    (M3) Eθ0(l̈1(θ0)) exists and is nonsingular.

    (M4) ∃δ0 > 0 and ∃M(X1) = Mθ0,δ(X1) s.t.

    max|θ−θ0|≤δ0θ∈Θ

    |...l 1(θ)| ≤M(X1), where Eθ0M(X1) 1 (θ0).

    Then√n(θ̂MLEn − θ0)

    d−−−→n→∞

    N(0, I(θ0)−1)

    √n(θ̂MCEn − θ0)

    d−−−→n→∞

    N(0,ΣΨ(θ0))

    with ΣΨ(θ0)− I(θ0)−1 being nonnegative definite (i.e., “ ΣΨ(θ0) ≥ I(θ0)−1”), and “=” holds if

    and only if

    Ψ1(θ0) = (−Eθ0Ψ̇1(θ0))I1(θ0)−1l̇1(θ0). (“Determining contrast function”)

    Proof. By Cauchy-Schwarz inequality,

    Corr(λ>l̇1(θ0), γ>Ψ1(θ0)) =

    λ>(−Eθ0Ψ̇1(θ0))γ√λ>I1(θ0)λ

    √γ>V arθ0Ψ1(θ0)γ

    =λ>I

    1/21 (θ0)I

    −1/21 (θ0)(−Eθ0Ψ̇1(θ0))γ√

    λ>I1(θ0)λ√γ>V arθ0Ψ1(θ0)γ

    50

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    √λ>I1(θ0)λ

    √γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ√λ>I1(θ0)λ

    √γ>V arθ0Ψ1(θ0)γ

    =

    √γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ√

    γ>V arθ0Ψ1(θ0)γ

    holds, and hence

    maxλ 6=0

    Corr(λ>l̇1(θ0), γ>Ψ1(θ0)) =

    √γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ√

    γ>V arθ0Ψ1(θ0)γ

    is obtained (we can find λ that achieves maximum). Since correlation coefficient is less than 1,

    we get

    γ>(−Eθ0Ψ̇1(θ0))I−11 (θ0)(−Eθ0Ψ̇1(θ0))γ ≤ γ>V arθ0Ψ1(θ0)γ

    for any γ. Using (−Eθ0Ψ̇1(θ0))−1γ instead of γ, we obtain that

    γ>I−11 (θ0)γ ≤ γ>(−Eθ0Ψ̇1(θ0))−1V arθ0Ψ1(θ0)(−Eθ0Ψ̇1(θ0))−1γ,

    i.e.,

    I−11 (θ0) ≤ (−Eθ0Ψ̇1(θ0))−1V arθ0Ψ1(θ0)(−Eθ0Ψ̇1(θ0))−1 = ΣΨ(θ0).

    Equality holds iff

    γ>(−Eθ0Ψ̇1(θ0))−1I−11 (θ0)l̇1(θ0) = γ>Ψ1(θ0),

    i.e.,

    Ψ1(θ0) = (−Eθ0Ψ̇1(θ0))−1I−11 (θ0)l̇1(θ0).

    Remark 1.3.10. The claim that the MLE has smaller variance than other asymptotically

    normal estimators was known as Fisher’s conjecture. This is true for a certain class of estimators

    in a regular parametric model. Essential property for such a comparison is the “uniform”

    convergence to the normal distribution as it can be seen in the following example.

    Example 1.3.11 (Hodge’s example: “Superefficient estimator”). Let X be the sample mean

    in N(θ, 1). Let

    θ̂sn = XI(|X| > n−1/4).

    (Actually, not only 1/4, but any positive number less than 1/2 is OK.)

    51

  • Theory of Statistics II J.P.Kim

    If θ 6= 0, then

    (θ̂sn = X

    )= 1− Pθ(|X| ≤ n−1/4)

    = 1− Pθ(−n−1/4 ≤ X ≤ n−1/4)

    = 1− Pθ(−√nθ − n1/4 ≤

    √n(X − θ) ≤ −

    √nθ + n1/4)

    = 1− Φ(−√nθ + n1/4

    )︸ ︷︷ ︸−−−→n→∞

    0 if θ>0;1 if θ0;1 if θ n−1/4)

    becomes 1. Thus,

    (√n(θ̂sn − θ) ≤ x

    )= Pθ

    (√n(θ̂sn − θ) ≤ x, θ̂sn = X

    )+ Pθ

    (√n(θ̂sn − θ) ≤ x, θ̂sn 6= X

    )= Pθ

    (√n(X − θ) ≤ x, θ̂sn = X

    )+ Pθ

    (√n(θ̂sn − θ) ≤ x, θ̂sn 6= X

    )−−−→n→∞

    Φ(x)

    holds, i.e.,√n(θ̂sn − θ)

    d−−−→n→∞

    N(0, 1) under Pθ.

    Now, assume that θ = 0. Then√nX ∼ N(0, 1), so

    Pθ(θ̂sn = 0) = Pθ(|X| ≤ n−1/4) = Φ(n−1/4)− Φ(−n−1/4) −−−→

    n→∞1.

    Then we obtain limn→∞

    (θ̂sn = 0

    )= 1, and therefore,

    limn→∞

    (√n(θ̂sn − θ) ≤ x

    )= lim

    n→∞Pθ(0 ≤ x)

    =

    1 if x ≥ 00 if x < 0= I[0,∞)(x),

    which yields√n(θ̂sn − θ)

    d−−−→n→∞

    N(0, 0) under Pθ.

    52

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    Thus, we get√n(θ̂sn − θ)

    d−−−→n→∞

    N(0, σ2s(θ)) under Pθ,

    where

    σ2s(θ) =

    1 if θ 6= 00 if θ = 0 .It implies that, there is a “superefficient” estimator than “optimal” one with variance I(θ)−1,

    i.e., Fisher’s conjecture is wrong!

    Remark 1.3.12. Note that Fisher’s conjecture is wrong in general, but it is “partially correct”

    in “regular” model. It means “uniform convegence” of CLT. First note that

    √n(θ̂n − θ)

    d (Pθ)−−−→n→∞

    N(0, ν(θ))

    only means “pointwise” convergence for each θ. It is equivalent to, for some metric d, and law

    (distribution function) Lθ under Pθ,

    d

    (Lθ(√n(θ̂n − θ)),Φ

    ν(θ)

    ))−−−→n→∞

    0.

    “Regular estimator” means, on the other hand, that

    supθ:|θ−θ0|

  • Theory of Statistics II J.P.Kim

    Example 1.3.13 (Linear model with stochastic covariates). Consider the model

    Y |Z = z ∼ N(α + z>β, σ2), α ∈ R, β ∈ Rk, σ2 > 0,

    where E(Z) = 0 and V ar(Z) is non-singular. Assume that there are n independent copies of

    Y and Z, and denote Z(n) = (Z1, · · · , Zn)>. Then as long as the distribution of Z does not

    depend on θ = (α, β, σ2), we get MLE of β is equivalent to least square procedure, from

    ln(θ) = −1

    2σ2

    n∑i=1

    (Yi − α− z>i β)2 −n

    2log 2πσ2 +

    n∑i=1

    log pdfZ(zi).

    In other words,

    β̂MLE = (Z̃>(n)Z̃(n))−1Z̃>(n)Y, Z̃(n) = (Z1 − Z, · · · , Zn − Z)>

    α̂MLE = Y − Z>β̂MLE

    σ̂2MLE

    =1

    n‖Y − (1α̂MLE + Z(n)β̂MLE)‖2.

    It also says that,

    l1(θ) = −1

    2σ2(Y1 − α− z>1 β)2 −

    1

    2log 2πσ2 + log pdfZ(z1)

    and hence

    l̇1(θ) =

    (�1σ2, z>1

    �1σ2,�21

    2σ4− 1

    2σ2

    )>, where �1 = Y1 − α− z>1 β.

    Thus

    I(θ) =

    1σ2

    1σ2E(Z1Z

    >1 )

    12σ4

    .(It can be easily obtained from �1|z1 ∼ N(0, σ2). Note that

    V ar(�1) = EV ar(�1|z1) + V arE(�1|z1) = σ2,

    V ar(z1�1) = EV ar(z1�1|z1) + V arE(z1�1|z1) = E(z1σ2z>1 ) + V arz1 · 0,

    and similarly we can obtain V ar(�21).) It implies that,

    √n(θ̂MLEn − θ)

    d−−−→n→∞

    N(0,Σ),

    54

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    where

    Σ = I(θ)−1 =

    σ2

    σ2(E(Z1Z

    >1 ))−1

    2σ4

    ,provided that E(Z1Z

    >1 ), or equivalently V ar(Z1), is non-singular.

    1.3.4 Asymptotic Null distribution of LRT

    Consider a random sample X1, · · · , Xn from Pθ, where θ ∈ Θ ⊆ Rk. Denote θ = (ξ>, η>)>,

    where η ∈ Rk0 . We wish to test

    H0 : ξ = ξ0 vs H1 : ξ 6= ξ0.

    (Note that it is composite null!) Let Θ be a k-dimensional open set, and

    Θ0 = {(ξ>0 , η>)> : (ξ>0 , η>)> ∈ Θ}

    be a k0-dimensional open set. Now denote

    l̇(θ) =

    l̇1(θ)l̇2(θ)

    , I(θ) =I11(θ) I12(θ)I21(θ) I22(θ)

    ,where

    l̇1(θ) =∂

    ∂ξl(θ), l̇2(θ) =

    ∂ηl(θ).

    Theorem 1.3.14 (Asymptotic null distribution of LRT). Assume (M0) ∼ (M6). Then under

    H0 : ξ = ξ0

    2(l(θ̂n)− l(θ̂0n))d−−−→

    n→∞χ2(k − k0)

    where

    θ̂n = arg maxθ∈Θ

    l(θ), θ̂0n = arg maxθ∈Θ0

    l(θ).

    Proof. Recall that,

    l(θ0) = l(θ̂n) +1

    2

    √n(θ̂n − θ0)>

    [1

    nl̈(θ0) + oPθ0 (1)

    ]√n(θ̂n − θ0)

    55

  • Theory of Statistics II J.P.Kim

    holds. Since√n(θ̂n − θ0) = OPθ0 (1), and OPθ0 (1) · oPθ0 (1) = oPθ0 (1), we get

    2(l(θ̂n)− l(θ0)) =√n(θ̂n − θ0)>I(θ0)

    √n(θ̂n − θ0) + oPθ0 (1),

    from

    − 1nl̈(θ0)

    Pθ0−−−→n→∞

    I(θ0).

    Also recall that√n(θ̂n − θ0) = I(θ0)−1

    √nl̇(θ0) + oPθ0 (1).

    It implies that

    2(l(θ̂n)− l(θ0)) =√nl̇(θ0)

    >I(θ0)−1√nl̇(θ0) + oPθ0 (1).

    Similarly, we get

    2(l(θ̂0n)− l(θ0)) =√n(η̂0n − η0)>I22(θ0)

    √n(η̂0n − η0) + oPθ0 (1)

    and√n(η̂0n − η0) = I22(θ0)−1

    √nl̇2(θ0) + oPθ0 (1)

    under H0, so we get

    2(l(θ̂0n)− l(θ0)) =√nl̇2(θ0)

    >I22(θ0)−1√nl̇2(θ0) + oPθ0 (1)

    under H0. Thus, we get

    2(l(θ̂n)− l(θ̂0n)) =√nl̇(θ0)

    >I(θ0)−1√nl̇(θ0)−

    √nl̇2(θ0)

    >I22(θ0)−1√nl̇2(θ0) + oPθ0 (1)

    under H0. Now note that

    I(θ0)−1 =

    I11(θ0) I12(θ0)I21(θ0) I22(θ0)

    −1

    =

    J1 0−I−122 (θ0)I21(θ0) J2

    I−111·2(θ0) 00 I−122 (θ0)

    J1 −I12(θ0)I−122 (θ0)0 J2

    56

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    holds, for identity matrices J1 and J2 with suitable sizes. Then

    l̇(θ0)>I(θ0)

    −1l̇(θ0) = l̇(θ0)>

    J1 0−I−122 (θ0)I21(θ0) J2

    I−111·2(θ0) 00 I−122 (θ0)

    J1 −I12(θ0)I−122 (θ0)0 J2

    l̇(θ0)=

    l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)l̇2(θ0)

    >I−111·2(θ0) 00 I−122 (θ0)

    l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)l̇2(θ0)

    holds, so we get

    2(l(θ̂n)− l(θ̂0n)) =√n(l̇1(θ0)− I21(θ0)I−122 (θ0)l̇2(θ0)

    )>I−111·2(θ0)

    √n(l̇1(θ0)− I21(θ0)I−122 (θ0)l̇2(θ0)

    )+ oPθ0 (1). (1.6)

    Now by CLT,

    √n(l̇1(θ0)− I21(θ0)I−122 (θ0)l̇2(θ0))

    d−−−→n→∞

    (J1 −I12(θ0)I−122 (θ0)

    )N(0, I(θ0))

    = N(0, I11(θ0)− I12(θ0)I−122 (θ0)I21(θ0))

    = N(0, I11·2(θ0)),

    and hence,

    2(l(θ̂n)− l(θ̂0n))d−−−→

    n→∞χ2(k − k0).

    Remark 1.3.15. (i) The extension to a null hypothesis

    H0 : g1(θ) = 0, · · · , gk1(θ) = 0

    is rather trivial by considering a smooth reparametrization

    ξ = g1(θ), · · · , ξk1(θ) = gk1(θ), η1 = gk1+1(θ), · · · , ηk0 = gk(θ)

    whenever such gj(θ) (j = k1 + 1, · · · , k) can be found.

    (ii) Large sample confidence set based on the maximum likelihood is obtained by duality.

    Remark 1.3.16. Note that to perform LRT we should find MLE on Θ and Θ0, however, it

    may not be so easy. Thus our interest is to find asymptotic equivalent tests of LRT.

    57

  • Theory of Statistics II J.P.Kim

    1.3.5 Asymptotic Equivalents of LRT

    Let X1, · · · , Xn be a random sample from Pθ, where θ ∈ Θ ⊆ Rk. For θ = (ξ>, η>)> ∈ Θ, we

    again wish to test

    H0 : ξ = ξ0 vs H1 : ξ 6= ξ0.

    We assume the regularity conditions, and in addition, that I(θ) is continuous. Also, we denotethe

    true parameter under H0 as θ0 = (ξ>0 , η

    >0 )>.

    Lemma 1.3.17. (a) From

    √n(θ̂MLEn − θ0) = I(θ0)−1

    √nl̇(θ0) + oPθ0 (1),

    we get

    I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η0) = l̇1(θ0) + oPθ0 (n−1/2),

    I21(θ0)(ξ̂n − ξ0) + I22(θ0)(η̂n − η0) = l̇2(θ0) + oPθ0 (n−1/2).

    (b)√n(η̂0n − η0) = I−122 (θ0)

    √nl̇2(θ0) + oPθ0 (1)

    η̂0n − η0 = I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)

    (c) 2(l(θ̂n)−l(θ̂0n)) =√n(l̇1(θ0)−I21(θ0)I−122 (θ0)l̇2(θ0))>I−111·2(θ0)

    √n(l̇1(θ0)−I21(θ0)I−122 (θ0)l̇2(θ0))+

    oPθ0 (1).

    Theorem 1.3.18 (Wald’s test). Let

    Wn =√n(θ̂n − θ̂0n)>I(θ̂0n)

    √n(θ̂n − θ̂0n).

    Then,

    Wn =√n(ξ̂n − ξ0)>I11·2(θ0)

    √n(ξ̂n − ξ0) + oPθ0 (1)

    = 2(l(θ̂n)− l(θ̂0n)) + oPθ0 (1).

    Proof. First note that, by continuity of I,

    Wn = n(θ̂n − θ̂0n)>(I(θ0) + oPθ0 (1)

    )(θ̂n − θ̂0n)

    = n(θ̂n − θ̂0n)>I(θ0)(θ̂n − θ̂0n) + oPθ0 (1).

    58

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    Then, we get

    Wn = n(θ̂n − θ̂0n)>I(θ0)(θ̂n − θ̂0n) + oPθ0 (1)

    = n[(ξ̂n − ξ0)>I11(θ0)(ξ̂n − ξ0) + 2(ξ̂n − ξ0)>I12(θ0)(η̂n − η̂0n)

    + (η̂n − η̂0n)>I22(θ0)(η̂n − η̂0n)]

    + oPθ0 (1)

    = n[(ξ̂n − ξ0)>

    (I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η̂0n)

    )+(

    (ξ̂n − ξ0)>I12(θ0) + (η̂n − η̂0n)>I22(θ0))

    (η̂n − η̂0n)]

    + oPθ0 (1),

    and from

    I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η̂0n) = I11(θ0)(ξ̂n − ξ0) + I12(θ0)(η̂n − η0)− I12(θ0)(η̂0n − η0)

    = l̇1(θ0)− I12(θ0)(η̂0n − η0) + oPθ0 (n−1/2)

    = l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)

    and

    (ξ̂n − ξ0)>I12(θ0) + (η̂n − η̂0n)>I22(θ0) = (ξ̂n − ξ0)>I12(θ0) + (η̂n − η0)>I22(θ0)− (η̂0n − η0)>I22(θ0)

    = l̇2(θ0)− I22(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)

    we get

    Wn = n

    (ξ̂n − ξ0)>︸ ︷︷ ︸=OPθ0

    (n−1/2)

    (l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n

    −1/2))

    +

    l̇2(θ0)− I22(θ0)I−122 (θ0)l̇2(θ0)︸ ︷︷ ︸=0

    +oPθ0 (n−1/2)

    (η̂n − η̂0n)︸ ︷︷ ︸=OPθ0

    (n−1/2)

    + oPθ0 (1)= n(ξ̂n − ξ0)>

    (l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)

    )+ oPθ0 (1).

    Now by (a) of lemma, we get

    ξ̂n − ξ0 = I−111·2(θ0)(l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0)

    )+ oPθ0 (n

    −1/2),

    and hence,

    Wn = n(ξ̂n − ξ0)>I11·2(θ0)(ξ̂n − ξ0) + oPθ0 (1).

    59

  • Theory of Statistics II J.P.Kim

    Use (a) part again, and then we can obtain

    Wn = 2(l(θ̂n)− l(θ̂0n)) + oPθ0 (1)

    by (1.6).

    Remark 1.3.19. The leading term in previous theorem depends on unknown η0 through

    I11·2(θ0), so in practice, it should be estimated as

    n(ξ̂n − ξ0)>I11·2(θ̂0n)(ξ̂n − ξ0), (“estimated Wald statistics”)

    which is also the leading term of LRT.

    Theorem 1.3.20 (Rao’s test; score test statistic). Let

    Rn = nl̇(θ̂0n)>Î(θ0)

    −1l̇(θ̂0n),

    where

    Î(θ0) = −l̈(θ̂0n)/n or I(θ̂0n).

    i.e., “observed information.” Define

    ̂I11·2(θ0) = Î11(θ0)− Î12(θ0)Î22(θ0)−1Î21(θ0),

    just as

    I11·2(θ0) = I11(θ

    0)− I12(θ0)I22(θ0)−1I21(θ0).

    Then,

    Rn = nl̇1(θ̂0n)> ̂I11·2(θ0)

    −1l̇1(θ̂

    0n)

    = 2(l(θ̂n)− l(θ̂0n)) + oPθ0 (1).

    Remark 1.3.21. Note that, unlike Wald statistics or LRT statistics, to obtain Rn we only

    need MLE on the null, θ̂0n.

    Proof. Note that, under H0, l̇2(θ̂0n) = 0 holds,

    l̇(θ̂0n) =

    l̇1(θ̂0n)0

    60

  • CHAPTER 1. ASYMPTOTIC APPROXIMATIONS J.P.Kim

    so we get

    Rn = nl̇1(θ̂0n)> ̂I11·2(θ0)

    −1l̇1(θ̂

    0n)

    directly. Now with Taylor expansion, ∃θ̂0,∗n ∈ line

    ξ0η̂0n

    ,ξ0η0

    s.t.l̇1(θ̂

    0n)− l̇1(θ0) =

    ∂ηl̇1(θ)

    ∣∣∣∣θ=θ̂0,∗n

    (η̂0n − η0)

    (by chain rule, and ξ0 − ξ0 = 0) Thus by continuity of I,

    l̇1(θ̂0n)− l̇1(θ0) =

    (−I12(θ0) + oPθ0 (1)

    )(η̂0n − η0)

    holds. In the same way, we get

    l̇2(θ̂0n)− l̇2(θ0) =

    (−I22(θ0) + oPθ0 (1)

    )(η̂0n − η0),

    but from l̇2(θ̂0n) = 0,

    η̂0n − η0 =(I22(θ0) + oPθ0 (1)

    )−1l̇2(θ0) =

    [I−122 (θ0) + oPθ0 (1)

    ]l̇2(θ0)

    is obtained. Therefore,

    l̇1(θ̂0n) = l̇1(θ0)−

    [I12(θ0) + oPθ0 (1)

    ](η̂0n − η0)

    = l̇1(θ0)−[I12(θ0) + oPθ0 (1)

    ] [I−122 (θ0) + oPθ0 (1)

    ]l̇2(θ0)

    = l̇1(θ0)− I12(θ0)I−122 (θ0)l̇2(θ0) + oPθ0 (n−1/2)

    holds, and the proof completes if one uses

    ̂I11·2(θ0) = I11·2(θ0) + oPθ0 (1)

    and (1.6).

    Remark 1.3.22. Note that, for a simple null H0 : θ = θ0,

    Wn = n(θ̂n − θ0)I(θ0)(θ̂n − θ0)

    Rn = nl̇1(θ0)I(θ0)−1l̇2(θ0).

    61

  • Theory of Statistics II J.P.Kim

    In this case, k0 = 0, so Wn and Rn converges (in distribution) to χ2(k) under H0. For the

    composite, θ0 is estimated by θ̂0n.

    Example 1.3.23 (GoF test in a multinomial model). Let X1, · · · , Xn be a random sample

    from Multi(1, (p1, · · · ,