Lecture Note: Theory of Statistics - Seoul National University › theostat › course › theostat2...

Lecture Note: Theory of Statistics

Weak convergence

Byeong Uk Park1

1 Department of Statistics, Seoul National University

1 / 91

1 Weak Convergence in Rk

Let X and Xn, n ≥ 1, be random vectors taking values in Rk. These random

vectors are allowed to be defined on different probability spaces. Below, for the

simplicity of notation, we denote all probability measures associated with

random vectors simply by P although they are defined on different probability

spaces. Define Fn(x) = P (Xn ≤ x) and F (x) = P (X ≤ x). For a function

f : Rk → R, let Cf = x ∈ Rk : f is continuous at x.

Definition. We say a sequence of random vectors Xn converges weakly to

a random vector X, and we write Xnd→ X, if P (Xn ∈ A) converges to

P (X ∈ A) for all Borel sets A ⊂ Rk with P (X ∈ ∂A) = 0. In the case of Rk,

Xnd→ X if and only if Fn(x)→ F (x) for all x ∈ CF .

2 / 91

Theorem 1.1. (Skorokhod representation theorem). Suppose that

Xnd→ X. Then, there exist X∗ and X∗n defined on the same probability

space such that

X∗nd= Xn, X∗

d= X and X∗n

a.s.→ X∗.

Theorem 1.2. (Continuous mapping theorem). If Xna.s.→ X and

P (X ∈ Cf ) = 1 for a real-valued function f , then f(Xn)a.s.→ f(X).

Proof. From the conditions of the theorem,

1 = P (X ∈ Cf , limn→∞

Xn = X)

≤ P (X ∈ Cf , limn→∞

f(Xn) = f(X))

≤ P ( limn→∞

f(Xn) = f(X)).

3 / 91

Theorem 1.3. If Xna.s.→ X, then Xn

d→ X.

Proof. Fix x ∈ CF and consider f = I(−∞,x](·). The function f is bounded

and Cf = Rk − x. Since

P (X ∈ Cf ) = P (X 6= x) = 1− [F (x)− F (x−)] = 1, we get from

Theorem 1.2 that f(Xn)a.s.→ f(X). Since f is bounded and Xn

a.s.→ X,

Dominated Convergence Theorem implies

Fn(x) = Ef(Xn)→ Ef(X) = F (x).

The converse of Theorem 1.3 is not true. The following theorem presents a set

of equivalent definitions of weak convergence.

4 / 91

Theorem 1.4. The followings are equivalent.

(a) Fn(x)→ F (x) for all x ∈ CF .

(b) Ef(Xn)→ Ef(X) for all bounded f : Rk → R with P (X ∈ Cf ) = 1.

(c) Ef(Xn)→ Ef(X) for all bounded and continuous f : Rk → R.

(d) Ef(Xn)→ Ef(X) for all bounded and uniformly continuous f : Rk → R.

5 / 91

Proof. The implications (b)⇒ (c)⇒ (d) are trivial. We prove the

implications (a)⇒ (b) and (d)⇒ (a). To prove (a)⇒ (b), we use

Theorems 1.1 and 1.2. Let X∗ and X∗n be defined on the same probability

space such that X∗nd= Xn, X∗

d= X and X∗n

a.s.→ X∗. Let f : Rk → R be

bounded and satisfy P (X ∈ Cf ) = 1. Since X∗d= X, we also have

P (X∗ ∈ Cf ) = 1. Then, by Theorem 1.2 we obtain f(X∗n)a.s.→ f(X∗).

Applying DCT to f(X∗n), we get

Ef(Xn) = Ef(X∗n)→ Ef(X∗) = Ef(X).

6 / 91

To prove (d)⇒ (a), let x ∈ CF be fixed, and for such an x let f+m : Rk → R

be defined by

f+m(u) =

1 if u ≤ x;

(m/k)1>(x+ 1/m− u) if x ≤ u ≤ x+ 1/m;

0 if u ≥ x+ 1/m.

The function f+m is uniformly continuous and 0 ≤ I(−∞,x](·) ≤ f+

m ≤ 1. Also,

let f−m : Rk → R be defined by

f−m(u) =

1 if u ≤ x− 1/m;

(m/k)1>(x− u) if x− 1/m ≤ u ≤ x;

0 if u ≥ x.

7 / 91

The function f−m is also uniformly continuous and 0 ≤ f−m ≤ I(−∞,x](·) ≤ 1. It

follows that

limm→∞

f+m(u) = I(−∞,x](u) and lim

m→∞f−m(u) = I(−∞,x)(u) for all u ∈ Rk.

Thus, (d) implies that, for all m ≥ 1,

lim supn→∞

Fn(x) ≤ lim supn→∞

Ef+m(Xn) = Ef+

m(X),

lim infn→∞

Fn(x) ≥ lim infn→∞

Ef−m(Xn) = Ef−m(X).

By DCT, we also have

limm→∞

Ef−m(X) = F (x−), limm→∞

Ef+m(X) = F (x).

8 / 91

These results give

F (x−) ≤ lim infn→∞

Fn(x) ≤ lim supn→∞

Fn(x) ≤ F (x).

Since x ∈ CF so that F (x−) = F (x), we obtain limn→∞ Fn(x) = F (x).

Remark. For other equivalent definitions of weak convergence, see the

portmanteau theorem on page 24 in

Billingsley, P. (1968). Convergence of Probability Measures. John Wiley &

Sons, New York.

9 / 91

2 Weak Convergence in Metric Spaces

We have studied the notion of weak convergence for random variables taking

values in Rk. Here, we extend the notion to random elements taking values in

a function space. We start with weak convergence in a metric space, and then

specialize the notion to the space of continuous functions defined on [0, 1] and

also to the space of cadlag functions.

Let (S,S) be a metric space, where S denotes the Borel σ-field of S. Let Xn

and X be random elements taking values in S. This means that Xn and X are

measurable mappings from a probability space (Ω,F , P ) to the metric space

(S,S).

10 / 91

Definition. We say a sequence of random elements Xn converges weakly

to a random element X, and we write Xnd→ X, if P (Xn ∈ A) converges to

P (X ∈ A) for all Borel sets A ∈ S with P (X ∈ ∂A) = 0. Equivalently, we say

Xnd→ X if Ef(Xn)→ Ef(X) for any bounded and uniformly continuous

real-valued function f .

In the case where S = R∞ equipped with the sup-metric

d(x, y) = supk≥1

sup1≤i≤k

|xi − yi|,

a sequence of random elements Xn = (Xn,1, Xn,2, . . .) converges weakly to

X = (X1, X2, . . .) if every finite-dimensional distribution of Xn converges

weakly to the corresponding finite-dimensional distribution of X, i.e.,

(Xn,1, . . . , Xn,k) converges weakly to (X1, . . . , Xk) for all k ≥ 1. But this is

not true when S is a function space.

11 / 91

Example. Consider the space of continuous functions defined on the interval

[0, 1] with d(x, y) = sup0≤t≤1 |x(t)− y(t)|. Define

Xn(t) =

nt if 0 ≤ t ≤ n−1;

2− nt if n−1 ≤ t ≤ 2n−1;

0 if 2n−1 ≤ t ≤ 1.

and X(t) ≡ 0. Then, Xn does not converge weakly to X. To see this, take

A = B(0, 1/2) = y : d(y, 0) ≤ 1/2. For this Borel set A, we have

P (X ∈ ∂A) = P (d(X, 0) = 1/2) = 0 and P (X ∈ A) = P (d(X, 0) ≤ 1/2) = 1,

but P (Xn ∈ A) = P (d(Xn, 0) ≤ 1/2) = 0. Every finite-dimensional

distribution of Xn converges weakly to the corresponding finite-dimensional

distribution of X since for any 0 ≤ t1 < t2 < · · · < tk ≤ 1

(Xn(t1), . . . , Xn(tk))d= (X(t1), . . . , X(tk)),

provided t1 > 2/n.

12 / 91

The main task is to find a plausible set of sufficient conditions that ensures

Xnd→ X. A useful condition for weak convergence in a metric space is given

below.

Theorem 2.1. Xnd→ X if and only if each subsequence Xn′ contains a

further subsequence Xn′′ such that Xn′′d→ X.

Proof. We only need to prove ‘if’ part. Suppose Xnd9 X. Then, there

exists a bounded and uniformly continuous function f : S→ R such that

Ef(Xn) 9 Ef(X). This means that there exists ε > 0 and n′ ⊂ n such

that |Ef(Xn′)− Ef(X)| > ε for all n′. This contradicts to the existence of

n′′ ⊂ n′ such that |Ef(Xn′′)− Ef(X)| → 0 along n′′.

13 / 91

Definition. We say that a sequence of random elements Xn is relatively

compact if each subsequence Xn′ contains a further subsequence Xn′′

that converges weakly to a random element (which may depend on the choice

of Xn′).

Theorem 2.2 (Continuous mapping theorem). Let h be a measurable

function that maps (S,S) to another metric space (S′,S ′). If Xnd→ X and

P (X ∈ Dh) = 0 for the set Dh of discontinuities of h, then h(Xn)d→ h(X).

Theorem 2.3. Let H be a collection of measurable and continuous functions

that map (S,S) to another metric space (S′,S ′). Suppose that

H−1(S ′) ≡⋃h∈H h

−1(S ′) is a field generating S. If Xn is relatively

compact and h(Xn) converges weakly to h(X) for all h ∈ H for some random

element X, then Xnd→ X.

14 / 91

Remark. In fact, we do not need to assume at the outset that there exists a

random element X such that h(Xn) converges weakly to h(X) for all h ∈ H.

What we need is that h(Xn) converges weakly to a random element, say Xh,

which takes values in S′ for all h ∈ H. Then, we can define a random element

X such that

P [X ∈ h−1(A′)] = P (Xh ∈ A′) for all A′ ∈ S ′ and for all h ∈ H.

Note that the distribution of X is uniquely determined by the probabilities

P (X ∈ A) for A ∈ H−1(S ′) since H−1(S ′) is a field generating S ′.

15 / 91

Proof of Theorem 2.3. We only need to check that the weak limit of any

convergent subsequence Xn′′ of Xn′ does not depend on n′. Let Y be

the weak limit of Xn′′. By Theorem 2.2 it follows that h(Xn′′)d→ h(Y ) for

all h ∈ H. On the other hand, we also have h(Xn′′)d→ h(X) for all h ∈ H by

the condition of the theorem. Thus, h(X)d= h(Y ) for all h ∈ H. This means

P (X ∈ A) = P (Y ∈ A) for all A ∈ H−1(S ′), which implies

P (X ∈ A) = P (Y ∈ A) for all A ∈ S since H−1(S ′) is a field generating S.

It is rather difficult to prove directly relative compactness for a given sequence

of random elements Xn. A more convenient notion that implies relative

compactness is tightness.

16 / 91

Definition. We say a random element X is tight if for any ε > 0 there exists

a compact set K such that P (X ∈ K) > 1− ε.

Definition. A set A in a metric space (S,S) is called compact if every open

cover of A has a finite subcover. Alternatively, a set A is compact if it is totally

bounded and complete.

Definition. A set A in a metric space (S,S) is called totally bounded if for

any ε > 0 the set A is covered by finitely many open balls of radius ε in (S,S).

Definition. A metric space S is called complete if every Cauchy sequence of

points in S has a limit that is also in S, in other words, if every Cauchy

sequence in S converges in S. A set in S is called complete if every Cauchy

sequence in the set converges in the set.

17 / 91

Definition. A topological space is called separable if it contains a countable

dense subset; that is, there exists a sequence of elements of the space such that

every nonempty open subset of the space contains at least one element of the

sequence.

Definition. A Polish space is a topological space that is separable and

metrizable in such a way that it becomes complete.

The following theorem gives a sufficient condition for a single random element

to be tight.

Theorem 2.4 (Theorem 1.4, Billingsley, 1968). If S is separable and

complete, then each random element in (S,S) is tight.

18 / 91

Proof of Theorem 2.4. Fix ε > 0. By the separability of S, for each k ≥ 1

there exist a countable number of balls Bk,1, Bk,2, . . . with radius k−1 that

cover S so that P(X ∈ ∪∞j=1Bk,j

)= 1. We may find Jk such that

P

(X ∈

Jk⋃j=1

Bk,j

)> 1− ε

2k.

Let B = ∩∞k=1 ∪Jkj=1 Bk,j . Then, B is totally bounded. Since S is complete, the

closure B of B is also complete so that B is compact. We obtain

P (X ∈ Bc) ≤ P (X ∈ Bc)

≤∞∑k=1

P

(X ∈

Jk⋂j=1

Bck,j

)

≤∞∑k=1

ε

2k

= ε.

19 / 91

Definition. We say a sequence of random elements Xn is tight if for any

ε > 0 there exists a compact set K such that infn P (Xn ∈ K) > 1− ε.

Remark. In the case of Rk, tightness of a sequence of random vectors Xn

means Xn = Op(1).

Theorem 2.5 (Theorem 6.1, Billingsley, 1968). If Xn is tight, then it is

relatively compact. The converse is also true if S is separable and complete.

20 / 91

3 Weak Convergence in C[0, 1]

Here we consider weak convergence in C ≡ C[0, 1], the space of real-valued

continuous functions defined on the interval [0, 1]. Let C be the Borel σ-field of

C. We endow C with the uniform metric

d(x, y) = supt∈[0,1]

|x(t)− y(t)|.

With this metric, the space C is separable due to the Stone-Weierstrass

theorem (any continuous function can be approximated by a polynomial

function), and is also complete. Separability and completeness facilitate

derivation of a plausible set of sufficient conditions for weak convergence.

21 / 91

3.1 Projection from C to Rk

For a set of points t1, . . . tk in [0, 1], let πt1,...,tk be a map that carries a

point x of C to the points (x(t1), . . . , x(tk)) of Rk. It is a map from (C, C) to

(Rk,Rk), where Rk is the Borel σ-field of Rk. The following theorem

demonstrates that the collection of all projections πt1,...,tk for t1, . . . tk ∈ [0, 1]

and k ≥ 1 satisfies the conditions of H in Theorem 2.3.

Theorem 3.1. The projection πt1,...,tk is measurable and continuous for all

t1, . . . tk ∈ [0, 1] and k ≥ 1. Also, all sets of the form π−1t1,...,tk

(A′) for some

A′ ∈ Rk, t1, . . . tk ∈ [0, 1] and k ≥ 1 form a field that generates C.

Proof. The first part is obvious. For the second part, let

C0 = π−1t1,...,tk

(B) : B ∈ Rk, t1, . . . tk ∈ [0, 1], k ≥ 1.

22 / 91

The fact that C0 is a field follows from

[π−1t1,...,tk

(B)]c = π−1t1,...,tk

(Bc),

π−1t1,...,tk

(B1)⋃π−1s1,...,sl(B2) = π−1

t1,...,tk,s1,...,sl((B1 × Rl) ∪ (Rk ×B2)).

Now, recall that each open set in a separable space is a countable union of

closed balls (or open balls). Thus, it suffices to prove that each closed ball can

be obtained by the operations of countable union, countable intersection and

complementation of the sets in C0. Let

B(x, ε) = y : sup0≤t≤1 |y(t)− x(t)| ≤ ε be a closed ball in C. Clearly,

B(x, ε) ⊂∞⋂n=1

n⋂i=1

y : |y(i/n)− x(i/n)| ≤ ε.

23 / 91

Definition. The distribution of πt1,...,tkXn = (Xn(t1), . . . , Xn(tk)) is called

a finite-dimensional distribution of Xn.

Theorem 3.2. Let Xn and X be random elements in C. If all

finite-dimensional distributions of Xn converge weakly to those of X and if

Xn is tight, then Xnd→ X.

25 / 91

3.2 Conditions for tightness in C

Here, we study necessary and sufficient conditions for a sequence of continuous

random functions being tight. We start with the following theorem.

Theorem 3.3 (Theorem 8.2, Billingsley, 1968). A sequence Xn in C is

tight if and only if (i) Xn(0) is tight in R; (ii) for any ε > 0

limδ→0

lim supn→∞

P[

sup|s−t|<δ

|Xn(s)−Xn(t)| ≥ ε]

= 0. (3.1)

The theorem follows from the following Arzela-Ascoli characterization of

compact sets. The property (3.1) is sometimes called asymptotic

equicontinuity.

26 / 91

Lemma 3.4 (Arzela-Ascoli characterization of compact sets). A set

A ⊂ C has compact closure if and only if (i) supx∈A |x(0)| <∞ and (ii)

limδ→0 supx∈A wx(δ) = 0, where wx(δ) is the modulus of continuity of x in C

defined by

wx(δ) = sup|s−t|<δ

|x(s)− x(t)|.

Note. The conditions (i) and (ii) are in fact necessary and sufficient for A to

be totally bounded. Since C is complete, A is complete for any A ⊂ C. Thus,

A is compact if and only if A is totally bounded.

27 / 91

Proof of Theorem 3.3. To prove the ’only if’ part, assume that Xn is

tight. Then, for any given ε > 0 there exists a compact set K ≡ K(ε) such

that infn P (Xn ∈ K) > 1− ε. By Lemma 3.4, (i) supx∈K |x(0)| < C0 for

some 0 < C0 <∞ and (ii) there exists δ0 > 0 such that

supx∈K sup|s−t|<δ0 |x(s)− x(t)| < ε. Thus, from (i)

infnP (|Xn(0)| < C0) ≥ inf

nP (Xn ∈ K) > 1− ε

so that Xn(0) is tight in R. Furthermore, from (ii) it also holds that

infnP[

sup|s−t|<δ0

|Xn(s)−Xn(t)| < ε]≥ inf

nP (Xn ∈ K) > 1− ε

so that supn P[

sup|s−t|<δ0 |Xn(s)−Xn(t)| ≥ ε]< ε. This implies that, for

any ε > 0,

limδ→0

lim supn→∞

P[

sup|s−t|<δ

|Xn(s)−Xn(t)| ≥ ε]< ε.

28 / 91

Taking ε ↓ 0 gives that, for any ε0 > 0

limδ→0

lim supn→∞

P[

sup|s−t|<δ

|Xn(s)−Xn(t)| ≥ ε0]

≤ limε→0

limδ→0

lim supn→∞

P[

sup|s−t|<δ

|Xn(s)−Xn(t)| ≥ ε]≤ 0.

This completes the proof of the ’only if’ part.

To prove the ’if’ part, let ε > 0 is fixed. We construct a totally bounded set K

such that supn P (Xn ∈ Kc) < ε. Suppose that (i) and (ii) of the theorem

hold. Then, we may find C0 such that

supnP (|Xn(0)| > C0) < ε/2. (3.2)

29 / 91

Also, we may choose δj > 0, j ≥ 1, such that

lim supn→∞

P[

sup|s−t|<δj

|Xn(s)−Xn(t)| ≥ 1/j]<

ε

2j.

Note that we actually can choose δj > 0, j ≥ 1, such that

supnP[

sup|s−t|<δj

|Xn(s)−Xn(t)| ≥ 1/j]<

ε

2j. (3.3)

This follows since every single random element in Xn is tight due to

Theorem 2.4, and thus from the necessity part of the theorem that we have

just proved entails

limδ→0

P[

sup|s−t|<δ

|Xk(s)−Xk(t)| ≥ 1/j]

= 0

for each fixed k and j. We take

K = x : |x(0)| ≤ C0⋂∩∞j=1x : sup

|s−t|<δj|x(s)− x(t)| < 1/j.

30 / 91

This set is totally bounded by Lemma 3.4, so that its closure K is compact.

From (3.2) and (3.3), we get

supnP (Xn ∈ Kc)

≤ supnP (|Xn(0)| > C0) +

∞∑j=1

supnP[

sup|s−t|<δj

|Xn(s)−Xn(t)| ≥ 1/j]

≤ε.

31 / 91

Corollary 3.5. Let Xn and X be random elements in C. If all

finite-dimensional distributions of Xn converge weakly to those of X, and if for

any ε > 0 there exists n0 and δ > 0 such that

supn≥n0

P[

sup|s−t|<δ

|Xn(s)−Xn(t)| ≥ ε]≤ ε, (3.4)

then Xnd→ X.

Note. (3.1) holds for any ε > 0 if and only if for any ε > 0 there exists n0

and δ > 0 such that (3.4) holds.

Theorem 3.6. The inequality (3.4) follows if

supn≥n0

sup0≤t≤1

P[

sups∈[t,t+δ]

|Xn(s)−Xn(t)| ≥ ε/3]≤ δε. (3.5)

32 / 91

Proof of Theorem 3.6. Let ti = δi for i = 0, 1, . . . , δ−1. Without loss of

generality, we may assume In ≡ δ−1 is an integer. Note that, if |s− t| < δ and

t < s (WLOG), then (i) there exists a grid point ti (1 ≤ i ≤ In − 1) such that

ti−1 ≤ t ≤ ti ≤ s ≤ ti+1, or (ii) there exists a grid point ti (0 ≤ i ≤ In − 1)

such that ti ≤ t < s ≤ ti+1. This means

sup|s−t|<δ

∣∣Xn(s)−Xn(t)∣∣ ≤ 3 max

0≤i≤In−1sup

ti≤t≤ti+1

∣∣Xn(t)−Xn(ti)∣∣,

which with the inequality (3.5) gives

P[

sup|s−t|<δ

∣∣Xn(s)−Xn(t)∣∣ ≥ ε]

≤ P[

max0≤i≤In−1

supti≤t≤ti+1

∣∣Xn(t)−Xn(ti)∣∣ ≥ ε/3]

≤In−1∑i=0

P[

supti≤t≤ti+1

∣∣Xn(t)−Xn(ti)∣∣ ≥ ε/3]

≤ In δ ε = ε.

33 / 91

3.3 Donsker’s Theorem

Let the random variables ξj be iid with mean 0 and variance 1. Define a

sequence of random elements Xn in C by

Xn(t) =1√n

bntc∑i=1

ξi + (nt− bntc) 1√nξbntc+1, (3.6)

where bntc denotes the largest integer which is less than or equal to nt. The

process Xn is simply the linear interpolation of Xn(j/n) =∑ji=1 ξi/

√n. The

following theorem due to Donsker is a generalization of the classical Central

Limit Theorem. It is a functional CLT for the entire process of partial sum, not

for the nth partial sum as the classical CLT treats.

34 / 91

Definition. The standard Wiener process or Brownian motion W is a

Gaussian process taking values in C such that EW (t) = 0 and

cov(W (s),W (t)) = s ∧ t. Alternatively, it is defined to be a stochastic process

with the following properties: (i) for each 0 ≤ t ≤ 1, W (t) ∼ N(0, t); (ii) W

has independent increments, i.e., W (tk)−W (tk−1), . . . ,W (t2)−W (t1) are

independent for all 0 ≤ t1 ≤ · · · ≤ tk ≤ 1.

35 / 91

Theorem 3.7 (Donsker). The partial sum process defined at (3.6)

converges weakly to the standard Wiener process W .

Proof. Convergence of the finite-dimensional distributions of Xn to those of

W follows immediately from the classical CLT. We prove that for any ε > 0

there exists n0 and δ > 0 such that (3.5) holds. Our approach needs finite 4th

moments of ξj . For a proof with only second moments, see Theorem 8.4 and

the arguments running through pp 89-91 of Billingsley (1968). We introduce a

technique based on the assumption that ξj have finite 4th moment since it is

more instructive and can be generalized to different settings. The basic idea is

to use the following maximal inequality, which is fairly general so that it can be

applied to partial sums of arbitrary random variables that may not be

independent or identically distributed.

36 / 91

Lemma 3.8 (Theorem 12.2, Billingsley, 1968). Let ξ1, . . . , ξm be random

variables. Let Sk = ξ1 + · · ·+ ξk for k ≥ 1 and put S0 = 0. If

E|Sj − Si|γ ≤ (ui+1 + · · ·+ uj)α for some γ ≥ 0, α > 1 and u1, . . . , um ≥ 0,

then

P

(max

0≤k≤m|Sk| ≥ λ

)≤ Cγ,α

λγ(u1 + · · ·+ um)α,

where Cγ,α is a constant that depends only on γ and α.

37 / 91

Writing I(δ, j) = (j + 1 + nδ) ∧ (n− 1), we establish

sups∈[t,t+δ]

|Xn(s)−Xn(t)| ≤ maxj<i≤I(δ,j)

|Xn(i/n)−Xn(j/n)| (3.7)

+ 2 maxj≤i≤I(δ,j)

|Xn((i+ 1)/n)−Xn(i/n)|

= maxj<i≤I(δ,j)

∣∣(ξj+1 + · · ·+ ξi)/√n∣∣

+ 2 maxj+1≤i≤I(δ,j)+1

|ξi/√n|.

We apply Lemma 3.8 to get a probability bound for the large deviation of the

first term on the RHS of (3.7). Since ξk are independent with Eξk = 0, we

obtain for any i, i′ : i′ > i > j

E(Si′ − Si)4 = E(ξi+1 + · · ·+ ξi′

)4≤ C1

( i′∑k=i+1

Eξ2k

)2

+i′∑

k=i+1

Eξ4k

≤ (C1 + 1)(i′ − i)2

for some absolute constant C1 > 0. This means we can apply Lemma 3.8 with

γ = 4, α = 2 and uk ≡ (C1 + 1)1/2.39 / 91

Thus, there exists an absolute constant C2 > 0 such that

P

[max

j<i≤I(δ,j)

∣∣(ξj+1 + · · ·+ ξi)/√n∣∣ ≥ ε/6] ≤ C2(nδ + 1)2

n2ε4

≤ 4C2δ2

ε4

for sufficiently large n such that n ≥ 1/δ. Taking δ ≤ ε5/(8C2) gives

P

[max

j<i≤I(δ,j)

∣∣(ξj+1 + · · ·+ ξi)/√n∣∣ ≥ ε/6] ≤ δε/2. (3.8)

For the second term on the RHS of (3.7), there exist an absolute constant

C3 > 0 and an integer n0(δ, ε) such that for all n ≥ n0(δ, ε)

P

[max

j≤i≤I(δ,j)|ξi/√n| ≥ ε/12

]≤

I(δ,j)∑i=j

P[|ξi/√n| ≥ ε/12

](3.9)

≤ C3δ

nε4

≤ δε/2.

The inequalities (3.7), (3.8) and (3.9) give (3.5).

40 / 91

4 Weak Convergence in D[0, 1]

We consider weak convergence in D ≡ D[0, 1], the space of cadlag functions

defined on the interval [0, 1].

Definition. A function defined on A ⊂ R is called cadlag function if it is

right-continuous and has left limit everywhere in A.

Note that all continuous functions are cadlag functions. All distribution

functions are also cadlag functions. The main difficulty with this space is that

it is not separable with the uniform metric dU .

41 / 91

4.1 Non-separability of (D, dU)

Define xα ∈ D, for 0 ≤ α ≤ 1, by xα(t) = I(t ≥ α). If α 6= α′, then

dU (xα, xα′) = 1. Let ε ≤ 1/2. Then, for any α 6= α′

x ∈ D : dU (xα, x) < ε ∩ x ∈ D : dU (xα′ , x) < ε = ∅.

If (D, dU ) is separable, there exists a countable subset, say D0, of D such that

every open ball x ∈ D : dU (xα, x) < ε for α ∈ [0, 1] contains a member of

D0. But, this is impossible since the number of these open balls is uncountable

and all the balls are disjoint.

Non-separability of (D, dU ) causes a fundamental difficulty. If a metric space

(S,S, d) is not separable, then functions that map a probability space (Ω,F , P )

to (S,S, d) often fail to be measurable. Important examples are empirical

processes.

42 / 91

Define X : ([0, 1],B, µ) 7→ (D,D, dU ) by X(t, w) = I(w ≤ t), where B is the

Borel σ-field of [0, 1], µ is the Lebesgue measure and D is the Borel σ-field of

D. For any subset H of [0, 1],⋃α∈H

B(xα, 1/2) =⋃α∈H

y ∈ D : dU (y, xα) < 1/2 ∈ D.

However, since X(·, w) ∈ B(xα, 1/2) if and only if X(·, w) = xα, which is also

equivalent to w = α, we obtain

X−1( ⋃α∈H

B(xα, 1/2))

=w : X(·, w) ∈

⋃α∈H

B(xα, 1/2)

=w : X(·, w) = xα for some α ∈ H

= H.

By taking H /∈ B, we see that X−1(D) ⊂ B fails.

43 / 91

4.2 Skorohod metric

The Skorohod metric dS defined below makes D separable. The basic idea is to

allow a deformation on the time scale to define a distance between two

elements in D.

Definition. Let Λ be the class of all strictly increasing and continuous

mappings λ of [0, 1] onto itself such that λ(0) = 0 and λ(1) = 1. The

Skorohod metric, denoted by dS , is defined by

dS(x, y) = infλ∈Λ

max supt∈[0,1]

|λ(t)− t|, supt∈[0,1]

|x(t)− y(λ(t))|.

A proof for the fact that dS is indeed a metric can be found in page 111 of

Billingsley (1968).

44 / 91

Example. We compute dS(xα, xα′) for α 6= α′. If we take λ ∈ Λ such that

λ(α) 6= α′, then supt∈[0,1] |xα(t)− xα′(λ(t))| = 1. For λ ∈ Λ such that

λ(α) = α′, it hold that supt∈[0,1] |xα(t)− xα′(λ(t))| = 0. Also,

infλ∈Λ:λ(α)=α′

supt∈[0,1]

|λ(t)− t| = |α′ − α| ≤ 1.

Thus, dS(xα, xα′) = |α′ − α|.

Theorem 4.1. The metric space (D, dS) is separable.

A proof of the above theorem can be found in page 112 of Billingsley (1968).

There is another difficulty. The space (D, dS) is not complete as the following

example illustrates. Completeness facilitates characterization of compact sets,

and thus that of tight sequences of random elements.

45 / 91

Example. Define xn ∈ D by xn(t) = I((1/2) ≤ t < (1/2) + (1/n)

). Then,

dS(xm, xn) =∣∣∣ 1

m− 1

n

∣∣∣ → 0

as m,n→∞. Thus, xn is a Cauchy sequence. However, there exists no

x ∈ D such that xn → x in dS . To see this, suppose that there exists such a

function x. Then, there exists a strictly increasing and continuous function λn

with λn(0) = 0 and λn(1) = 1 such that

supt∈[0,1]

|λn(t)− t| → 0,

supt∈[0,1]

|xn(λn(t))− x(t)| → 0.(4.10)

Note that

xn(λn(t)) = I(λ−1n (1/2) ≤ t < λ−1

n ((1/2) + (1/n))).

46 / 91

Due to the second convergence in (4.10) and the fact that xn(λn(·)) is an

indicator, the limit x ∈ D must take the form x(t) = I(α ≤ t < β) for some

0 ≤ α < β ≤ 1. The case α = β is excluded here since then x ≡ 0 and thus

the second convergence in (4.10) does not hold. Now, due to the first

convergence in (4.10), we have

|λ−1n (1/2)− 1/2| = |λ−1

n (1/2)− λn(λ−1n (1/2))| → 0.

Similarly λ−1n ((1/2) + (1/n))→ 1/2. This means that α = β, which is a

contradiction.

There is a metric dS′ which is equivalent to dS such that the metric space

(D, dS′) is complete. See Theorems 14.1 and 14.2 of Billingsley (1968). Thus,

we can proceed as if the Skorohod space (D, dS) is separable and complete.

47 / 91

4.3 Finite-dimensional distributions

We try to use Theorem 2.3 to get a set of sufficient conditions for weak

convergence in (D,D, dS). As in the case of C, we consider the class of all

projections πt1,...,tk for H.

Theorem 4.2. The projection πt1,...,tk as a map from (D,D) to (Rk,Rk) is

measurable for all t1, . . . tk ∈ [0, 1] and k ≥ 1.

For a proof of this theorem, see page 121 of Billingsley (1968). However, the

projections πt1,...,tk are not continuous everywhere in D for each (t1, . . . , tk).

This complicates matters somewhat.

48 / 91

Recall that, when we derive Theorem 3.2 for a sequence of random elements

Xn in C from Theorem 2.3, continuity of the projections πt1,...,tk was used

only to establish the weak convergence of πt1,...,tkXn to πt1,...,tkX for Xn

converging weakly to X. In general, a measurable function h : (S,S) 7→ (S′,S ′)

does not need to be continuous everywhere in S for the sequence h(Xn) to

converge weakly to h(X). According to the continuous mapping theorem

(Theorem 2.2), if P (X ∈ Dh) = 0 for the set Dh where h is discontinuous,

then h(Xn) converges weakly to h(X). The following theorem is a slight

generalization of Theorem 2.3 that embodies this idea.

49 / 91

Theorem 4.3. For a random element Y taking values in a metric space

(S,S), let HY be a collection of measurable functions h that map (S,S) to

another metric space (S′,S ′) such that P (Y ∈ Dh) = 0 for the set Dh where h

is discontinuous. Suppose that Xn is relatively compact and h(Xn)

converges weakly to h(X) for all h ∈ HX for some random element X. If

H−1X,Y (S ′) ≡

⋃h∈HX∩HY

h−1(S ′) is a field generating S for all random

elements Y , then Xn converges weakly to X.

In fact, the requirement in Theorem 3.2 and Corollary 3.5 that all

finite-dimensional distributions of Xn converge weakly to those of X is too

much in the space (D,D, dS) as the following example illustrates.

50 / 91

Example. Let Xn ≡ I[0,(1/2)+(1/n)) and X ≡ I[0,1/2). Then, Xnd→ X since

dS(Xn, X) = 1/n→ 0. However, Xn(1/2) ≡ 1 9 0 ≡ X(1/2).

Theorem 4.3 enables us to relax the condition that all finite-dimensional

distributions of Xn converge weakly to those of X. Our relaxation is founded

on the following three theorems. The first one characterizes the discontinuity

sets of πt1,...,tk in (D, dS), which tells that πt1,...,tk for 0 < t1 < · · · < tk < 1

is discontinuous at x if and only if x is discontinuous at some tj , 1 ≤ j ≤ k.

51 / 91

For a point t ∈ [0, 1], define

D(πt) = x ∈ D : x(t) 6= x(t−),

which is the set of all elements of D which are discontinuous at t. The main

lesson of Theorem 4.4 is that πt for 0 < t < 1 is discontinuous on D(πt), and

is continuous on D(πt)c ⊂ D.

Theorem 4.5. The complement of the set t ∈ (0, 1) : P (X ∈ D(πt)) = 0

for a random element X taking values in (D,D) is at most countable.

For a proof of this theorem, see page 124 of Billingsley (1968). The theorem

tells that the set

TX ≡ 0, 1 ∪ t ∈ (0, 1) : P (X ∈ D(πt)) = 0

is dense in [0, 1]. Furthermore, it implies that ∩mi=1TXi for finitely many Xi is

also dense in [0, 1].

54 / 91

Theorem 4.6 (Theorem 14.5, Billingsley, 1968). For a subset T of [0, 1],

let FT be the class of all sets of the form π−1t1,...,tk

(A′) for some A′ ∈ Rk,

t1, . . . tk ∈ T and k ≥ 1. If T contains 1 and is dense in [0, 1], then FT is a

field that generates D.

The following theorem demonstrates that the class of the functions

ΠX = πt1,...,tk : tj ∈ TX for all 1 ≤ j ≤ k, k ≥ 1

plays the role of HX in Theorem 4.3. Recall that TX ⊂ [0, 1] is the set of

t = 0, 1 and t ∈ (0, 1), where the probability that X is discontinuous at t

equals 0, and that P (X ∈ Dπ) = 0 for any π ∈ ΠX since

Dπt1,...,tk =x ∈ (D, dS) : πt1,...,tk is discontinuous at x

=

⋃1≤j≤k:0<tj<1

D(πtj ).

due to Theorem 4.4.

55 / 91

Theorem 4.7. Let Xn and X be random elements in D. If the distribution

(Xn(t1), . . . , Xn(tk)) converges weakly to the distribution (X(t1), . . . , X(tk))

for all t1, . . . , tk ∈ TX and for all k ≥ 1, and if Xn is tight, then Xnd→ X.

Proof of Theorem 4.7. We follow the lines of the proof of Theorem 2.3.

We prove that the weak limit of the convergent subsequence Xn′′ of Xn′

does not depend on Xn′. Let Y be the weak limit of Xn′′. Then, by the

continuous mapping theorem, πXn′′d→ πY for all π ∈ ΠY . On the other hand,

we also have πXn′′d→ πX for all π ∈ ΠX by the condition of the theorem.

This implies πYd= πX for all π ∈ ΠX ∩ΠY , so that

P (πX ∈ A)A≡ P (πY ∈ A) ∀π ∈ ΠX ∩ΠY

⇔ PX ≡ PY on π−1(A) : A ∈ Rk, π ∈ ΠX ∩ΠY

⇔ PX ≡ PY on FTX∩TY (notation in Theorem 4.6).

By Theorem 4.5, TX ∩ TY contains 0 and 1, and is dense in [0, 1]. Application

of Theorem 4.6 concludes Yd= X.

56 / 91

4.4 Tightness in (D, dS)

The following theorem is an analogue of Arzela-Ascoli theorem which

characterizes compact sets in (D,D). Let w′x(δ) be defined by

w′x(δ) = infti

max1≤i≤r

sup|x(s)− x(t)| : s, t ∈ [ti−1, ti),

where the infimum extends over all finite sets ti of points such that

0 = t0 < t1 < · · · < tr = 1 and ti − ti−1 > δ for all 1 ≤ i ≤ r and r ≥ 1. This

is a modulus that plays in D the role of wx(δ) in C. In fact, w′x(δ)→ 0 as

δ → 0 for all x ∈ D (Lemma 1, page 110, Billingsley, 1968).

Note. w′x(δ)→ 0 as δ → 0 if and only if for any ε > 0 there exists a

partition of [0, 1] into finitely many Ti such that

maxi

sups,t∈Ti

|x(s)− x(t)| ≤ ε.

57 / 91

Theorem 4.8 (Theorem 14.3, Billingsley, 1968). A set A ⊂ D has

compact closure if and only if

(i) supx∈A

supt∈[0,1]

|x(t)| <∞;

(ii) limδ→0

supx∈A

w′x(δ) = 0.

It is sometimes difficult to work with w′x(δ). The following modulus is often

more convenient. Define

w′′x(δ) = sup|x(t)−x(t1)|∧|x(t2)−x(t)| : 0 ≤ t1 ≤ t ≤ t2 ≤ 1, t2−t1 ≤ δ.

58 / 91

Example. Recall the definition of xα : xα(t) = I(t ≥ α). For this function,

wxα(δ) = 1 for all δ > 0. For w′xα(δ), note that

w′xα(δ) ≤ sups,t∈[α,α+ε)

|xα(s)− xα(t)| = 0

for any δ and ε such that 0 < δ < ε < 1− α. Thus, w′xα(δ) = 0 for all

sufficiently small δ > 0 if α < 1. Also, we have w′′xα(δ) = 0 for all sufficiently

small δ > 0 if 0 < α < 1, since

w′′xα(δ) = supt∈[α−δ/2,α+δ/2]

∣∣∣xα(t)−xα(α− δ

2

) ∣∣∣∧∣∣∣xα(α+δ

2

)−xα(t)

∣∣∣ = 0.

Fact. For all x ∈ D, it follows that

w′′x(δ) ≤ w′x(δ) ≤ wx(2δ). (4.11)

For a proof of the first inequality, see pages 118–119 of Billingsley (1968), and

for a proof of the second, see page 110 of Billingsley (1968).

59 / 91

The following theorem gives another characterization of compact sets in (D,D)

based on w′′x(δ). It is sometimes more convenient to work with than the

characterization in Theorem 4.8. We write wx(T ) = sups,t∈T |x(s)− x(t)|.

Theorem 4.9 (Theorem 14.4, Billingsley, 1968). A set A ⊂ D has

compact closure if and only if

(i) supx∈A supt∈[0,1] |x(t)| <∞;

(ii) limδ→0 supx∈A w′′x(δ) = 0;

(iii) limδ→0 supx∈A wx[0, δ) = 0;

(iv) limδ→0 supx∈A wx[1− δ, 1) = 0.

60 / 91

From Theorems 4.8 and 4.9, we get the following characterizations of a tight

sequence in D

Theorem 4.10 (Theorem 15.2, Billingsley, 1968). A sequence Xn in D

is tight if and only if

(i) the sequence of random variables, sup|Xn(t)| : t ∈ [0, 1], is tight in R;

(ii) for any ε > 0, limδ→0

lim supn→∞

P[w′Xn(δ) ≥ ε

]= 0.

Note. The condition (ii) of Theorem 4.10 is equivalent to asking that for

any ε, η > 0 there exists a partition of [0, 1] into finitely many Ti such that

lim supn→∞

P[

maxi

sups,t∈Ti

|Xn(s)−Xn(t)| ≥ ε]≤ η.

61 / 91

Theorem 4.11 (Theorem 15.3, Billingsley, 1968). A sequence Xn in D

is tight if and only if the following property (i) and those (ii), (iii) and (iv) hold

for any ε > 0:

(i) the sequence of random variables, sup|Xn(t)| : t ∈ [0, 1], is tight in R;

(ii) limδ→0

lim supn→∞

P[w′′Xn(δ) ≥ ε

]= 0;

(iii) limδ→0

lim supn→∞

P[

sups,t∈[0,δ)

|Xn(s)−Xn(t)| ≥ ε]

= 0;

(iv) limδ→0

lim supn→∞

P[

sups,t∈[1−δ,1)

|Xn(s)−Xn(t)| ≥ ε]

= 0.

Homework: Prove Theorems 4.10 and 4.11, and also prove the result in “Note”

between them.

62 / 91

4.5 Weak convergence in (D,D, dS)

The following theorem gives a set of sufficient conditions for weak convergence

in D. The theorem follows from Theorems 4.7 and 4.11.

Theorem 4.12 (Theorem 15.4, Billingsley, 1968). Let Xn and X be

random elements in D. Suppose that P (X(−1) 6= X(1)) = 0. If the

distribution (Xn(t1), . . . , Xn(tk)) converges weakly to the distribution

(X(t1), . . . , X(tk)) for all t1, . . . , tk ∈ TX and for all k ≥ 1, and if

limδ→0

lim supn→∞

P[w′′Xn(δ) ≥ ε

]= 0 (4.12)

for any ε > 0, then Xnd→ X.

63 / 91

Proof. We prove (i), (iii) and (iv) in Theorem 4.11. To prove (i), let ε > 0

be given. Then by the condition (4.12), we can take δ0 such that

lim supn→∞

P(w′′Xn(δ0) ≥ ε

)≤ ε/2.

Choose 0 = t1 < · · · < tk = 1 such that ti − ti−1 ≤ δ0 from TX . This is

possible since TX is dense in [0, 1] by Theorem 4.5. Then,

supt∈[0,1]

|Xn(t)| ≤ w′′Xn(δ0) + max1≤i≤k

|Xn(ti)|.

Since each sequence Xn(ti) is tight, max1≤i≤k |Xn(ti)| is also tight. Thus,

there exists C > 0 such that

lim supn→∞

P

(max

1≤i≤k|Xn(ti)| > C

)≤ ε/2.

64 / 91

Since Xn(δ0) and Xn(0) converge weakly to X(δ0) and X(0), respectively,

there exists n′1 ≡ n′1(δ0, ε) such that for all n ≥ n′1

P[|Xn(δ0)−X(δ0)| ≥ ε/12

]≤ ε/4, (4.16)

P[|Xn(0)−X(0)| ≥ ε/12

]≤ ε/4. (4.17)

In (4.15)–(4.17) and below wherever relevant, Xn(0), X(0) and Xn(δ0), X(δ0)

are those versions in Skorohod Representation Theorem (Theorem 1.1) that are

defined on the same probability space having almost sure convergence.

The inequalities (4.14)–(4.17) give

P[

sups,t∈[0,δ0)

|Xn(s)−Xn(t)| ≥ ε]

≤ P[w′′Xn(δ0) ≥ ε/4

]+ P

[|Xn(δ0)−Xn(0)| ≥ ε/4

]≤ ε

for all n ≥ n0 ≡ n1 ∨ n′1.

67 / 91

It remains to prove (4.15) to complete the proof of (iii). Since all x in D are

right continuous, it follows that

1 = P

(limδ→0

sups∈[0,δ]

|X(s)−X(0)| = 0

)= P

[⋂k

⋃l

(sup

s∈[0,1/l]

|X(s)−X(0)| ≤ 1/k)].

This means that for any η1 > 0

0 = P

[⋂l

(sup

s∈[0,1/l]

|X(s)−X(0)| ≥ η1

)]= lim

l→∞P

(sup

s∈[0,1/l]

|X(s)−X(0)| ≥ η1

).

Thus, for any η1, η2 > 0, there exists an L > 0 such that

P

(sup

s∈[0,1/L]

|X(s)−X(0)| ≥ η1

)≤ η2.

68 / 91

Since TX is dense in [0, 1], we can always take a δ0 ∈ TX such that δ0 ≤ 1/L

and

P (|X(δ0)−X(0)| ≥ η1) ≤ P(

sups∈[0,1/L]

|X(s)−X(0)| ≥ η1

)≤ η2.

This completes the proof of (iii).

The condition (iv) also holds by symmetry. One thing one should take care is

existence of a δ0 ∈ TX that ensures (4.15) which is now replaced by

P[|X(1)−X(1− δ0)| ≥ ε/12

]≤ ε/4.

This can be proved as (4.15) by using the condition P (X ∈ D(π1)) = 0 and by

working on the set D(π1)c whose elements are all left continuous at 1.

69 / 91

4.6 Limit process with continuous sample paths

When the limit process X in Theorem 4.12 has continuous sample paths a.e.,

then P (X(−1) 6= X(1)) = 0 is automatically satisfied and also TX = [0, 1].

Thus, in this case Xn converges weakly to X if all finite-dimensional

distributions of Xn converge weakly to those of X and if (4.12) holds for any

ε > 0. What if all finite-dimensional distributions of Xn converges weakly and

if, instead of (4.12),

limδ→0

lim supn→∞

P[

sup|s−t|<δ

|Xn(s)−Xn(t)| ≥ ε]

= 0 (4.18)

for any ε > 0? Recall that this was a criterion for weak convergence in C. The

following theorem tells that the limit process in this case has continuous sample

paths a.e.

70 / 91

Theorem 4.13 (Theorem 15.5, Billingsley, 1968). Let Xn be a

sequence of random elements in (D,D, dS). Suppose that Xn(0) is tight in

R and (4.18) holds for any ε > 0. Then, Xn is tight, and, if X is the weak

limit of a subsequence Xn′, then X has continuous sample paths a.e., i.e.,

P (X ∈ D(πt)c for all t ∈ [0, 1]) = 1.

Donsker’s Theorem.

Let the random variables ξj be iid with mean 0 and variance 1. Define a

sequence of random elements Xn in D by

Xn(t) =1√n

bntc∑i=1

ξi, (4.19)

where bntc denotes the largest integer which is less than or equal to nt.

71 / 91

Theorem 4.14 (Donsker: Theorem 16.1, Billingsley, 1968). The partial

sum process defined at (4.19) converges weakly to the standard Wiener process

W .

Proof. Prove (4.18) along the lines in the proof of Theorem 3.7. Then, use

the inequality at (4.11) to prove (4.12) and apply Theorem 4.12. Since the

limit X is the standard Wiener process, X has continuous sample paths a.e. so

that TX = [0, 1]. Thus, we need to show that all finite-dimensional

distributions of Xn converge to those of X, which is clear.

72 / 91

5 Weak Convergence of Empirical Processes

We discuss weak convergence of uniform empirical processes first, and then

extend the discussion to more general cases.

5.1 Uniform empirical processes

Let ξj be iid uniform[0, 1], and Fn be their empirical distribution function

defined by

Fn(t) = n−1n∑i=1

I(ξi ≤ t).

A centered and scaled version of Fn defines a uniform empirical process Xn

indexed by t ∈ [0, 1]:

Xn(t) =√n(Fn(t)− t

). (5.20)

It will be shown that this process converges weakly to the Brownian bridge

defined below.

73 / 91

Definition. A Gaussian process B taking values in C such that EB(t) = 0

and cov(B(s), B(t)) = s ∧ t− st, is called the standard Brownian bridge.

Alternatively, it is defined by

B(t) = W (t)− tW (1),

where W is the standard Brownian motion.

Note. The standard Brownian bridge B are tied down at 0 and 1 with

probability 1, i.e., P [B(0) = B(1) = 0] = 1.

Theorem 5.1. The empirical process defined at (5.20) for iid uniformly

distributed random variables ξj on [0, 1] converges weakly to the standard

Brownian bridge B.

74 / 91

Proof. Note that B takes values in C a.e., thus it has continuous sample

paths a.e. Convergence of all finite-dimensional distributions of Xn to those of

B follows from the classical CLT. We use Theorem 3.6 and prove that for any

ε, η > 0 there exists n0 and δ > 0 such that

supn≥n0

sup0≤t≤1

P[

sups∈[t,t+δ]

|Xn(s)−Xn(t)| ≥ η]≤ δε. (5.21)

For each fixed t ∈ [0, 1], we divide the interval [t, t+ δ] into m subintervals of

length p = δ/m, i.e., [t, t+ ip] for 1 ≤ i ≤ m. Note that

sups∈[t,t+δ]

|Xn(s)−Xn(t)|

≤ max0≤i≤m

|Xn(t+ ip)−Xn(t)| (5.22)

+ max0≤i≤m−1

sups∈[t+ip,t+(i+1)p]

|Xn(s)−Xn(t+ ip)|.

75 / 91

The second term on the RHS of (5.22) can be made small enough by choosing

m large enough and the first term which is

max0≤i≤m

∣∣ i∑l=1

(Xn(t+ lp)−Xn(t+ (l − 1)p))∣∣ let

= max0≤i≤m

|Si|

involves only finitely many (with p being determined) partial sums so that it can

be handled by a maximal inequality for the partial sums such as Lemma 3.8.

Treatments of the two terms on the RHS of (5.22) need the following identities:

E[I(ξi ≤ s)− I(ξi ≤ t)]2 = E[I(ξi ≤ s)− I(ξi ≤ t)]4

= s+ t− 2s ∧ t = |s− t|. (5.23)

76 / 91

This gives for the first term

E(Sj − Si)4 = E[Xn(t+ jp)−Xn(t+ ip)]4

= E

(1√n

n∑k=1

[I(ξk ≤ t+ jp)− I(ξk ≤ t+ ip)− (j − i)p]

)4

≤ C(j − i)2p2

for some constant C > 0 as long as n > p−1 = m/δ and δ < 1. By applying

Lemma 3.8 with γ = 4, α = 2 and uj ≡√Cp, we get

P

(max

0≤i≤m|Xn(t+ ip)−Xn(t)| ≥ λ

)≤ C1λ

−4m2p2 (5.24)

for an absolute constant C1 > 0. Plugging λ = η/2 gives

P

(max

0≤i≤m|Xn(t+ ip)−Xn(t)| ≥ η/2

)≤ 16C1η

−4m2p2 (5.25)

77 / 91

To treat the second term on the RHS of (5.22), we observe that for all

s ∈ [t+ ip, t+ (i+ 1)p]

Xn(s)−Xn(t+ ip) ≤√n[Fn(t+ (i+ 1)p)− Fn(t+ ip)]

≤ |Xn(t+ (i+ 1)p)−Xn(t+ ip)|+√np

since Fn is a non-decreasing function. Also, we have

Xn(s)−Xn(t+ ip) ≥ −√np

≥ −|Xn(t+ (i+ 1)p)−Xn(t+ ip)| −√np.

Thus

sups∈[t+ip,t+(i+1)p]

|Xn(s)−Xn(t+ ip)| (5.26)

≤ |Xn(t+ (i+ 1)p)−Xn(t+ ip)|+√np.

78 / 91

Take m large so that√nδ/m =

√np < η/4 and δ/m > n−1. Then, by (5.24)

and (5.26)

P

(max

0≤i≤m−1sup

s∈[t+ip,t+(i+1)p]

|Xn(s)−Xn(t+ ip)| ≥ η/2

)(5.27)

≤ P

(max

0≤i≤m−1|Xn(t+ (i+ 1)p)−Xn(t+ ip)| ≥ η/4

)≤ 2P

(max

0≤i≤m|Xn(t+ ip)−Xn(t)| ≥ η/8

)≤ C2η

−4m2p2

for some 0 < C2 <∞. Note that there always exists m such that√nδ/m ≤ η/4 and δ/m ≥ n−1 if n > (4/η)2. From (5.22), (5.25) and (5.27)

we get

sup0≤t≤1

P[

sups∈[t,t+δ]

|Xn(s)−Xn(t)| ≥ η]≤ C3η

−4δ2

for some constant C3 > 0. Taking δ ≤ η4ε/C3 gives (5.21).

79 / 91

5.2 Empirical processes in D(−∞,∞)

Theorem 5.1 can be extended to a more general case where ξj are iid with

distribution function F . Define

Xn(t) =√n(Fn(t)− F (t)

). (5.28)

The function space involved in this case is D(−∞,∞), the space of all cadlag

functions defined on the whole real line (−∞,∞). Also, we need to extend the

definition of the Skorohod metric to D(−∞,∞). The weak limit of the

empirical process Xn in this case is the F -Brownian bridge B(F (·)).

80 / 91

Definition. Let Λ be the class of all strictly increasing and continuous

mappings λ of R onto itself. The Skorohod metric for D(−∞,∞) is defined by

dS(x, y) = infλ∈Λ

max sup−∞<t<∞

|λ(t)− t|, sup−∞<t<∞

|x(t)− y(λ(t))|.

With a slight abuse of notation, we continue to denote by dS the Skorohod

metric for D(−∞,∞).

Theorem 5.2. The empirical process defined at (5.28) for iid random

variables ξj with a common distribution function F converges weakly to

B(F (·)).

81 / 91

Proof of Theorem 5.2. Define the quantile function of F by

F−1(t) = infs : t ≤ F (s). Then, we know

F−1(t) ≤ s if and only if t ≤ F (s), and thus F−1(υ) ∼ F

for a uniformly distributed υ on [0, 1]. Thus, we may represent ξi = F−1(υi)

with iid uniformly distributed υi on [0, 1]. By Theorem 5.1, the empirical

process Yn defined by

Yn(t) =1√n

n∑i=1

[I(υi ≤ t)− t]

converges weakly to B. For the empirical process Xn defined at (5.28),

Xn(t) =1√n

n∑i=1

[I(ξi ≤ t)− F (t)]

d=

1√n

n∑i=1

[I(F−1(υi) ≤ t)− F (t)

]=

1√n

n∑i=1

[I(υi ≤ F (t))− F (t)] = Yn(F (t)).

82 / 91

Define a map ψ : (D[0, 1], dS) 7→ (D(−∞,∞), dS) by (ψx)(t) = x(F (t)). If we

prove ψ is continuous on C[0, 1], then by the continuous mapping theorem

(Theorem 2.2) and by the fact that P (B ∈ C[0, 1]c) = 0 we have

Xn = ψYnd→ ψB = B(F (·)),

concluding the proof of the theorem. To prove ψ is continuous at all

x ∈ C[0, 1], let xn be a sequence of elements in (D[0, 1], dS) such that

dS(xn, x)→ 0 as n→∞ for some x ∈ C[0, 1]. Then, dU (xn, x)→ 0. One can

prove this by using the inequalities in the first half of the proof of Theorem 4.4

and the fact that all x ∈ C[0, 1] are uniformly continuous on [0, 1]. This implies

dU (ψxn, ψx) = sup−∞<t<∞

|xn(F (t))−x(F (t))| ≤ supt∈[0,1]

|xn(t)−x(t)| → 0.

Thus, dS(ψxn, ψx)→ 0.

83 / 91

5.3 Empirical processes indexed by G

Let ξj be iid random variables taking values in R with distribution function F .

Let G be a collection of measurable functions g in L2(F ) = g :∫g2dF <∞.

Define

Xn(g) =√n

∫g d(Fn − F ) =

1√n

n∑i=1

[ g(ξi)− Eg(ξ)]. (5.29)

The process Xn = Xn(g) : g ∈ G is called the empirical process indexed by

G. Note that taking G = I(−∞,t] : t ∈ R gives the empirical process at (5.28).

84 / 91

We assume that Xn is a map from a probability space (Ω,F , P ) to the space

`∞(G) of all bounded real-valued functions on G, equipped with the uniform

metric

dU (x, y) = supg∈G|x(g)− y(g)|.

This is a restriction imposed on G. For example, if the class G is enveloped by

a square-integrable function G, i.e., if |g(x)| ≤ G(x) for all x and for all g ∈ G,

then Xn takes values in `∞(G).

85 / 91

As we have seen before, Xn is not Borel-measurable when

G = I(−∞,t] : t ∈ R. This can happen with other G. If Xn is not measurable,

then the statement “Ef(Xn)→ Ef(X) for any bounded and uniformly

continuous real-valued function f” does not make sense. To accommodate this

situation in general, we extend the definition of weak convergence to a

sequence of arbitrary maps Xn : Ω 7→ `∞(G).

Definition. For an arbitrary map Y : Ω 7→ `∞(G), define the outer

expectation E∗ by

E∗f(Y ) = infE(U) : U is measurable , U ≥ f(Y ) and E(U) exists.

86 / 91

Definition. For a sequence arbitrary maps Xn and for a Borel-measurable

X, we say Xn converges weakly to X and we write Xnd→ X if

E∗f(Xn)→ Ef(X) for any bounded and continuous real-valued function

f : `∞(G) 7→ R.

Definition. Let BF be a Gaussian process indexed by G such that

EBF (g) = 0 and

cov(BF (g1), BF (g2)) =

∫g1g2 dF −

∫g1 dF

∫g2 dF.

The process BF is called F -Brownian bridge.

Definition. The class G is called F -Donsker if the process Xn defined at

(5.29) converges weakly to BF .

87 / 91

The following theorem gives a set of sufficient conditions for weak convergence

of the empirical process Xn indexed by G. Define P ∗ by P ∗(A) = E∗IA.

Theorem 5.3. If G equipped with the L2(F )-metric ‖ · ‖ is totally bounded

and if for any ε > 0

limδ→0

lim supn→∞

P ∗(

sup‖g1−g2‖≤δ

∣∣Xn(g1)−Xn(g2)∣∣ ≥ ε) = 0,

then G is F -Donsker.

For a proof of this theorem, see

Dudley, R. M. (1984). A Course on Empirical Processes. Springer-Verlag, New

York.

88 / 91

Definition. Let G be a class of functions in L2(F ). The δ-entropy of G,

denoted by H(δ,G, F ), equals the logarithm of the smallest number of balls

with radius δ whose union covers G.

Definition. Let G be a class of functions in L2(F ). A bracket [gL, gU ] is the

set of all functions g in G such that gL ≤ g ≤ gU . A δ-bracket is a bracket

[gL, gU ] such that ‖gU − gL‖ ≤ δ. The δ-entropy with bracketing of G,

denoted by HB(δ,G, F ), equals the logarithm of the smallest number of

δ-brackets that cover G.

Note. If H(δ,G, F ) <∞ for any δ > 0, then G is totally bounded. Also, it

holds that H(δ,G, F ) ≤ HB(δ,G, F ).

89 / 91

Theorem 5.4 (Theorem 6.3, van de Geer, 2000).

If∫ 1

0H

1/2B (u,G, F ) du <∞, then G is F -Donsker.

Final Remark. So far we have considered the case where ξj are random

variables taking values in R. All discussions remain valid for iid measurable

maps ξj : Ω 7→ X , where (X ,A) is a measurable space.

90 / 91

Suggested References.

van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge

University Press.

van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University

Press.

van der Vaart, A. W. and Jon A. Wellner (1996). Weak Convergence and

Empirical Processes with Applications to Statistics. Springer, New York

91 / 91

Lecture Note: Theory of Statistics - Seoul National University › theostat › course › theostat2...

Documents

Transcript of Lecture Note: Theory of Statistics - Seoul National University › theostat › course › theostat2...