Lecture Note: Theory of Statistics - Seoul National University › theostat › course › theostat2...
Transcript of Lecture Note: Theory of Statistics - Seoul National University › theostat › course › theostat2...
Lecture Note: Theory of Statistics
Weak convergence
Byeong Uk Park1
1 Department of Statistics, Seoul National University
1 / 91
1 Weak Convergence in Rk
Let X and Xn, n ≥ 1, be random vectors taking values in Rk. These random
vectors are allowed to be defined on different probability spaces. Below, for the
simplicity of notation, we denote all probability measures associated with
random vectors simply by P although they are defined on different probability
spaces. Define Fn(x) = P (Xn ≤ x) and F (x) = P (X ≤ x). For a function
f : Rk → R, let Cf = x ∈ Rk : f is continuous at x.
Definition. We say a sequence of random vectors Xn converges weakly to
a random vector X, and we write Xnd→ X, if P (Xn ∈ A) converges to
P (X ∈ A) for all Borel sets A ⊂ Rk with P (X ∈ ∂A) = 0. In the case of Rk,
Xnd→ X if and only if Fn(x)→ F (x) for all x ∈ CF .
2 / 91
Theorem 1.1. (Skorokhod representation theorem). Suppose that
Xnd→ X. Then, there exist X∗ and X∗n defined on the same probability
space such that
X∗nd= Xn, X∗
d= X and X∗n
a.s.→ X∗.
Theorem 1.2. (Continuous mapping theorem). If Xna.s.→ X and
P (X ∈ Cf ) = 1 for a real-valued function f , then f(Xn)a.s.→ f(X).
Proof. From the conditions of the theorem,
1 = P (X ∈ Cf , limn→∞
Xn = X)
≤ P (X ∈ Cf , limn→∞
f(Xn) = f(X))
≤ P ( limn→∞
f(Xn) = f(X)).
3 / 91
Theorem 1.3. If Xna.s.→ X, then Xn
d→ X.
Proof. Fix x ∈ CF and consider f = I(−∞,x](·). The function f is bounded
and Cf = Rk − x. Since
P (X ∈ Cf ) = P (X 6= x) = 1− [F (x)− F (x−)] = 1, we get from
Theorem 1.2 that f(Xn)a.s.→ f(X). Since f is bounded and Xn
a.s.→ X,
Dominated Convergence Theorem implies
Fn(x) = Ef(Xn)→ Ef(X) = F (x).
The converse of Theorem 1.3 is not true. The following theorem presents a set
of equivalent definitions of weak convergence.
4 / 91
Theorem 1.4. The followings are equivalent.
(a) Fn(x)→ F (x) for all x ∈ CF .
(b) Ef(Xn)→ Ef(X) for all bounded f : Rk → R with P (X ∈ Cf ) = 1.
(c) Ef(Xn)→ Ef(X) for all bounded and continuous f : Rk → R.
(d) Ef(Xn)→ Ef(X) for all bounded and uniformly continuous f : Rk → R.
5 / 91
Proof. The implications (b)⇒ (c)⇒ (d) are trivial. We prove the
implications (a)⇒ (b) and (d)⇒ (a). To prove (a)⇒ (b), we use
Theorems 1.1 and 1.2. Let X∗ and X∗n be defined on the same probability
space such that X∗nd= Xn, X∗
d= X and X∗n
a.s.→ X∗. Let f : Rk → R be
bounded and satisfy P (X ∈ Cf ) = 1. Since X∗d= X, we also have
P (X∗ ∈ Cf ) = 1. Then, by Theorem 1.2 we obtain f(X∗n)a.s.→ f(X∗).
Applying DCT to f(X∗n), we get
Ef(Xn) = Ef(X∗n)→ Ef(X∗) = Ef(X).
6 / 91
To prove (d)⇒ (a), let x ∈ CF be fixed, and for such an x let f+m : Rk → R
be defined by
f+m(u) =
1 if u ≤ x;
(m/k)1>(x+ 1/m− u) if x ≤ u ≤ x+ 1/m;
0 if u ≥ x+ 1/m.
The function f+m is uniformly continuous and 0 ≤ I(−∞,x](·) ≤ f+
m ≤ 1. Also,
let f−m : Rk → R be defined by
f−m(u) =
1 if u ≤ x− 1/m;
(m/k)1>(x− u) if x− 1/m ≤ u ≤ x;
0 if u ≥ x.
7 / 91
The function f−m is also uniformly continuous and 0 ≤ f−m ≤ I(−∞,x](·) ≤ 1. It
follows that
limm→∞
f+m(u) = I(−∞,x](u) and lim
m→∞f−m(u) = I(−∞,x)(u) for all u ∈ Rk.
Thus, (d) implies that, for all m ≥ 1,
lim supn→∞
Fn(x) ≤ lim supn→∞
Ef+m(Xn) = Ef+
m(X),
lim infn→∞
Fn(x) ≥ lim infn→∞
Ef−m(Xn) = Ef−m(X).
By DCT, we also have
limm→∞
Ef−m(X) = F (x−), limm→∞
Ef+m(X) = F (x).
8 / 91
These results give
F (x−) ≤ lim infn→∞
Fn(x) ≤ lim supn→∞
Fn(x) ≤ F (x).
Since x ∈ CF so that F (x−) = F (x), we obtain limn→∞ Fn(x) = F (x).
Remark. For other equivalent definitions of weak convergence, see the
portmanteau theorem on page 24 in
Billingsley, P. (1968). Convergence of Probability Measures. John Wiley &
Sons, New York.
9 / 91
2 Weak Convergence in Metric Spaces
We have studied the notion of weak convergence for random variables taking
values in Rk. Here, we extend the notion to random elements taking values in
a function space. We start with weak convergence in a metric space, and then
specialize the notion to the space of continuous functions defined on [0, 1] and
also to the space of cadlag functions.
Let (S,S) be a metric space, where S denotes the Borel σ-field of S. Let Xn
and X be random elements taking values in S. This means that Xn and X are
measurable mappings from a probability space (Ω,F , P ) to the metric space
(S,S).
10 / 91
Definition. We say a sequence of random elements Xn converges weakly
to a random element X, and we write Xnd→ X, if P (Xn ∈ A) converges to
P (X ∈ A) for all Borel sets A ∈ S with P (X ∈ ∂A) = 0. Equivalently, we say
Xnd→ X if Ef(Xn)→ Ef(X) for any bounded and uniformly continuous
real-valued function f .
In the case where S = R∞ equipped with the sup-metric
d(x, y) = supk≥1
sup1≤i≤k
|xi − yi|,
a sequence of random elements Xn = (Xn,1, Xn,2, . . .) converges weakly to
X = (X1, X2, . . .) if every finite-dimensional distribution of Xn converges
weakly to the corresponding finite-dimensional distribution of X, i.e.,
(Xn,1, . . . , Xn,k) converges weakly to (X1, . . . , Xk) for all k ≥ 1. But this is
not true when S is a function space.
11 / 91
Example. Consider the space of continuous functions defined on the interval
[0, 1] with d(x, y) = sup0≤t≤1 |x(t)− y(t)|. Define
Xn(t) =
nt if 0 ≤ t ≤ n−1;
2− nt if n−1 ≤ t ≤ 2n−1;
0 if 2n−1 ≤ t ≤ 1.
and X(t) ≡ 0. Then, Xn does not converge weakly to X. To see this, take
A = B(0, 1/2) = y : d(y, 0) ≤ 1/2. For this Borel set A, we have
P (X ∈ ∂A) = P (d(X, 0) = 1/2) = 0 and P (X ∈ A) = P (d(X, 0) ≤ 1/2) = 1,
but P (Xn ∈ A) = P (d(Xn, 0) ≤ 1/2) = 0. Every finite-dimensional
distribution of Xn converges weakly to the corresponding finite-dimensional
distribution of X since for any 0 ≤ t1 < t2 < · · · < tk ≤ 1
(Xn(t1), . . . , Xn(tk))d= (X(t1), . . . , X(tk)),
provided t1 > 2/n.
12 / 91
The main task is to find a plausible set of sufficient conditions that ensures
Xnd→ X. A useful condition for weak convergence in a metric space is given
below.
Theorem 2.1. Xnd→ X if and only if each subsequence Xn′ contains a
further subsequence Xn′′ such that Xn′′d→ X.
Proof. We only need to prove ‘if’ part. Suppose Xnd9 X. Then, there
exists a bounded and uniformly continuous function f : S→ R such that
Ef(Xn) 9 Ef(X). This means that there exists ε > 0 and n′ ⊂ n such
that |Ef(Xn′)− Ef(X)| > ε for all n′. This contradicts to the existence of
n′′ ⊂ n′ such that |Ef(Xn′′)− Ef(X)| → 0 along n′′.
13 / 91
Definition. We say that a sequence of random elements Xn is relatively
compact if each subsequence Xn′ contains a further subsequence Xn′′
that converges weakly to a random element (which may depend on the choice
of Xn′).
Theorem 2.2 (Continuous mapping theorem). Let h be a measurable
function that maps (S,S) to another metric space (S′,S ′). If Xnd→ X and
P (X ∈ Dh) = 0 for the set Dh of discontinuities of h, then h(Xn)d→ h(X).
Theorem 2.3. Let H be a collection of measurable and continuous functions
that map (S,S) to another metric space (S′,S ′). Suppose that
H−1(S ′) ≡⋃h∈H h
−1(S ′) is a field generating S. If Xn is relatively
compact and h(Xn) converges weakly to h(X) for all h ∈ H for some random
element X, then Xnd→ X.
14 / 91
Remark. In fact, we do not need to assume at the outset that there exists a
random element X such that h(Xn) converges weakly to h(X) for all h ∈ H.
What we need is that h(Xn) converges weakly to a random element, say Xh,
which takes values in S′ for all h ∈ H. Then, we can define a random element
X such that
P [X ∈ h−1(A′)] = P (Xh ∈ A′) for all A′ ∈ S ′ and for all h ∈ H.
Note that the distribution of X is uniquely determined by the probabilities
P (X ∈ A) for A ∈ H−1(S ′) since H−1(S ′) is a field generating S ′.
15 / 91
Proof of Theorem 2.3. We only need to check that the weak limit of any
convergent subsequence Xn′′ of Xn′ does not depend on n′. Let Y be
the weak limit of Xn′′. By Theorem 2.2 it follows that h(Xn′′)d→ h(Y ) for
all h ∈ H. On the other hand, we also have h(Xn′′)d→ h(X) for all h ∈ H by
the condition of the theorem. Thus, h(X)d= h(Y ) for all h ∈ H. This means
P (X ∈ A) = P (Y ∈ A) for all A ∈ H−1(S ′), which implies
P (X ∈ A) = P (Y ∈ A) for all A ∈ S since H−1(S ′) is a field generating S.
It is rather difficult to prove directly relative compactness for a given sequence
of random elements Xn. A more convenient notion that implies relative
compactness is tightness.
16 / 91
Definition. We say a random element X is tight if for any ε > 0 there exists
a compact set K such that P (X ∈ K) > 1− ε.
Definition. A set A in a metric space (S,S) is called compact if every open
cover of A has a finite subcover. Alternatively, a set A is compact if it is totally
bounded and complete.
Definition. A set A in a metric space (S,S) is called totally bounded if for
any ε > 0 the set A is covered by finitely many open balls of radius ε in (S,S).
Definition. A metric space S is called complete if every Cauchy sequence of
points in S has a limit that is also in S, in other words, if every Cauchy
sequence in S converges in S. A set in S is called complete if every Cauchy
sequence in the set converges in the set.
17 / 91
Definition. A topological space is called separable if it contains a countable
dense subset; that is, there exists a sequence of elements of the space such that
every nonempty open subset of the space contains at least one element of the
sequence.
Definition. A Polish space is a topological space that is separable and
metrizable in such a way that it becomes complete.
The following theorem gives a sufficient condition for a single random element
to be tight.
Theorem 2.4 (Theorem 1.4, Billingsley, 1968). If S is separable and
complete, then each random element in (S,S) is tight.
18 / 91
Proof of Theorem 2.4. Fix ε > 0. By the separability of S, for each k ≥ 1
there exist a countable number of balls Bk,1, Bk,2, . . . with radius k−1 that
cover S so that P(X ∈ ∪∞j=1Bk,j
)= 1. We may find Jk such that
P
(X ∈
Jk⋃j=1
Bk,j
)> 1− ε
2k.
Let B = ∩∞k=1 ∪Jkj=1 Bk,j . Then, B is totally bounded. Since S is complete, the
closure B of B is also complete so that B is compact. We obtain
P (X ∈ Bc) ≤ P (X ∈ Bc)
≤∞∑k=1
P
(X ∈
Jk⋂j=1
Bck,j
)
≤∞∑k=1
ε
2k
= ε.
19 / 91
Definition. We say a sequence of random elements Xn is tight if for any
ε > 0 there exists a compact set K such that infn P (Xn ∈ K) > 1− ε.
Remark. In the case of Rk, tightness of a sequence of random vectors Xn
means Xn = Op(1).
Theorem 2.5 (Theorem 6.1, Billingsley, 1968). If Xn is tight, then it is
relatively compact. The converse is also true if S is separable and complete.
20 / 91
3 Weak Convergence in C[0, 1]
Here we consider weak convergence in C ≡ C[0, 1], the space of real-valued
continuous functions defined on the interval [0, 1]. Let C be the Borel σ-field of
C. We endow C with the uniform metric
d(x, y) = supt∈[0,1]
|x(t)− y(t)|.
With this metric, the space C is separable due to the Stone-Weierstrass
theorem (any continuous function can be approximated by a polynomial
function), and is also complete. Separability and completeness facilitate
derivation of a plausible set of sufficient conditions for weak convergence.
21 / 91
3.1 Projection from C to Rk
For a set of points t1, . . . tk in [0, 1], let πt1,...,tk be a map that carries a
point x of C to the points (x(t1), . . . , x(tk)) of Rk. It is a map from (C, C) to
(Rk,Rk), where Rk is the Borel σ-field of Rk. The following theorem
demonstrates that the collection of all projections πt1,...,tk for t1, . . . tk ∈ [0, 1]
and k ≥ 1 satisfies the conditions of H in Theorem 2.3.
Theorem 3.1. The projection πt1,...,tk is measurable and continuous for all
t1, . . . tk ∈ [0, 1] and k ≥ 1. Also, all sets of the form π−1t1,...,tk
(A′) for some
A′ ∈ Rk, t1, . . . tk ∈ [0, 1] and k ≥ 1 form a field that generates C.
Proof. The first part is obvious. For the second part, let
C0 = π−1t1,...,tk
(B) : B ∈ Rk, t1, . . . tk ∈ [0, 1], k ≥ 1.
22 / 91
The fact that C0 is a field follows from
[π−1t1,...,tk
(B)]c = π−1t1,...,tk
(Bc),
π−1t1,...,tk
(B1)⋃π−1s1,...,sl(B2) = π−1
t1,...,tk,s1,...,sl((B1 × Rl) ∪ (Rk ×B2)).
Now, recall that each open set in a separable space is a countable union of
closed balls (or open balls). Thus, it suffices to prove that each closed ball can
be obtained by the operations of countable union, countable intersection and
complementation of the sets in C0. Let
B(x, ε) = y : sup0≤t≤1 |y(t)− x(t)| ≤ ε be a closed ball in C. Clearly,
B(x, ε) ⊂∞⋂n=1
n⋂i=1
y : |y(i/n)− x(i/n)| ≤ ε.
23 / 91
Now, let y belong to the set on the right hand side of the above inclusion. For
such y, we may find t0 ∈ [0, 1] where sup0≤t≤1 |y(t)− x(t)| = |y(t0)− x(t0)|.
Given δ > 0, we may also find tδ ∈ i/n : 1 ≤ i ≤ n, n ≥ 1 such that
|x(tδ)− x(t0)| ≤ δ/2 and |y(tδ)− y(t0)| ≤ δ/2 due to the continuity of x and
y. Thus, we have
sup0≤t≤1
|y(t)− x(t)| = |y(t0)− x(t0)|
≤ |y(t0)− y(tδ)|+ |y(tδ)− x(tδ)|+ |x(tδ)− x(t0)|
≤ ε+ δ.
Letting δ ↓ 0 gives sup0≤t≤1 |y(t)− x(t)| ≤ ε. Thus,
B(x, ε) ⊃∞⋂n=1
n⋂i=1
y : |y(i/n)− x(i/n)| ≤ ε.
This completes the proof.
24 / 91
Definition. The distribution of πt1,...,tkXn = (Xn(t1), . . . , Xn(tk)) is called
a finite-dimensional distribution of Xn.
Theorem 3.2. Let Xn and X be random elements in C. If all
finite-dimensional distributions of Xn converge weakly to those of X and if
Xn is tight, then Xnd→ X.
25 / 91
3.2 Conditions for tightness in C
Here, we study necessary and sufficient conditions for a sequence of continuous
random functions being tight. We start with the following theorem.
Theorem 3.3 (Theorem 8.2, Billingsley, 1968). A sequence Xn in C is
tight if and only if (i) Xn(0) is tight in R; (ii) for any ε > 0
limδ→0
lim supn→∞
P[
sup|s−t|<δ
|Xn(s)−Xn(t)| ≥ ε]
= 0. (3.1)
The theorem follows from the following Arzela-Ascoli characterization of
compact sets. The property (3.1) is sometimes called asymptotic
equicontinuity.
26 / 91
Lemma 3.4 (Arzela-Ascoli characterization of compact sets). A set
A ⊂ C has compact closure if and only if (i) supx∈A |x(0)| <∞ and (ii)
limδ→0 supx∈A wx(δ) = 0, where wx(δ) is the modulus of continuity of x in C
defined by
wx(δ) = sup|s−t|<δ
|x(s)− x(t)|.
Note. The conditions (i) and (ii) are in fact necessary and sufficient for A to
be totally bounded. Since C is complete, A is complete for any A ⊂ C. Thus,
A is compact if and only if A is totally bounded.
27 / 91
Proof of Theorem 3.3. To prove the ’only if’ part, assume that Xn is
tight. Then, for any given ε > 0 there exists a compact set K ≡ K(ε) such
that infn P (Xn ∈ K) > 1− ε. By Lemma 3.4, (i) supx∈K |x(0)| < C0 for
some 0 < C0 <∞ and (ii) there exists δ0 > 0 such that
supx∈K sup|s−t|<δ0 |x(s)− x(t)| < ε. Thus, from (i)
infnP (|Xn(0)| < C0) ≥ inf
nP (Xn ∈ K) > 1− ε
so that Xn(0) is tight in R. Furthermore, from (ii) it also holds that
infnP[
sup|s−t|<δ0
|Xn(s)−Xn(t)| < ε]≥ inf
nP (Xn ∈ K) > 1− ε
so that supn P[
sup|s−t|<δ0 |Xn(s)−Xn(t)| ≥ ε]< ε. This implies that, for
any ε > 0,
limδ→0
lim supn→∞
P[
sup|s−t|<δ
|Xn(s)−Xn(t)| ≥ ε]< ε.
28 / 91
Taking ε ↓ 0 gives that, for any ε0 > 0
limδ→0
lim supn→∞
P[
sup|s−t|<δ
|Xn(s)−Xn(t)| ≥ ε0]
≤ limε→0
limδ→0
lim supn→∞
P[
sup|s−t|<δ
|Xn(s)−Xn(t)| ≥ ε]≤ 0.
This completes the proof of the ’only if’ part.
To prove the ’if’ part, let ε > 0 is fixed. We construct a totally bounded set K
such that supn P (Xn ∈ Kc) < ε. Suppose that (i) and (ii) of the theorem
hold. Then, we may find C0 such that
supnP (|Xn(0)| > C0) < ε/2. (3.2)
29 / 91
Also, we may choose δj > 0, j ≥ 1, such that
lim supn→∞
P[
sup|s−t|<δj
|Xn(s)−Xn(t)| ≥ 1/j]<
ε
2j.
Note that we actually can choose δj > 0, j ≥ 1, such that
supnP[
sup|s−t|<δj
|Xn(s)−Xn(t)| ≥ 1/j]<
ε
2j. (3.3)
This follows since every single random element in Xn is tight due to
Theorem 2.4, and thus from the necessity part of the theorem that we have
just proved entails
limδ→0
P[
sup|s−t|<δ
|Xk(s)−Xk(t)| ≥ 1/j]
= 0
for each fixed k and j. We take
K = x : |x(0)| ≤ C0⋂∩∞j=1x : sup
|s−t|<δj|x(s)− x(t)| < 1/j.
30 / 91
This set is totally bounded by Lemma 3.4, so that its closure K is compact.
From (3.2) and (3.3), we get
supnP (Xn ∈ Kc)
≤ supnP (|Xn(0)| > C0) +
∞∑j=1
supnP[
sup|s−t|<δj
|Xn(s)−Xn(t)| ≥ 1/j]
≤ε.
31 / 91
Corollary 3.5. Let Xn and X be random elements in C. If all
finite-dimensional distributions of Xn converge weakly to those of X, and if for
any ε > 0 there exists n0 and δ > 0 such that
supn≥n0
P[
sup|s−t|<δ
|Xn(s)−Xn(t)| ≥ ε]≤ ε, (3.4)
then Xnd→ X.
Note. (3.1) holds for any ε > 0 if and only if for any ε > 0 there exists n0
and δ > 0 such that (3.4) holds.
Theorem 3.6. The inequality (3.4) follows if
supn≥n0
sup0≤t≤1
P[
sups∈[t,t+δ]
|Xn(s)−Xn(t)| ≥ ε/3]≤ δε. (3.5)
32 / 91
Proof of Theorem 3.6. Let ti = δi for i = 0, 1, . . . , δ−1. Without loss of
generality, we may assume In ≡ δ−1 is an integer. Note that, if |s− t| < δ and
t < s (WLOG), then (i) there exists a grid point ti (1 ≤ i ≤ In − 1) such that
ti−1 ≤ t ≤ ti ≤ s ≤ ti+1, or (ii) there exists a grid point ti (0 ≤ i ≤ In − 1)
such that ti ≤ t < s ≤ ti+1. This means
sup|s−t|<δ
∣∣Xn(s)−Xn(t)∣∣ ≤ 3 max
0≤i≤In−1sup
ti≤t≤ti+1
∣∣Xn(t)−Xn(ti)∣∣,
which with the inequality (3.5) gives
P[
sup|s−t|<δ
∣∣Xn(s)−Xn(t)∣∣ ≥ ε]
≤ P[
max0≤i≤In−1
supti≤t≤ti+1
∣∣Xn(t)−Xn(ti)∣∣ ≥ ε/3]
≤In−1∑i=0
P[
supti≤t≤ti+1
∣∣Xn(t)−Xn(ti)∣∣ ≥ ε/3]
≤ In δ ε = ε.
33 / 91
3.3 Donsker’s Theorem
Let the random variables ξj be iid with mean 0 and variance 1. Define a
sequence of random elements Xn in C by
Xn(t) =1√n
bntc∑i=1
ξi + (nt− bntc) 1√nξbntc+1, (3.6)
where bntc denotes the largest integer which is less than or equal to nt. The
process Xn is simply the linear interpolation of Xn(j/n) =∑ji=1 ξi/
√n. The
following theorem due to Donsker is a generalization of the classical Central
Limit Theorem. It is a functional CLT for the entire process of partial sum, not
for the nth partial sum as the classical CLT treats.
34 / 91
Definition. The standard Wiener process or Brownian motion W is a
Gaussian process taking values in C such that EW (t) = 0 and
cov(W (s),W (t)) = s ∧ t. Alternatively, it is defined to be a stochastic process
with the following properties: (i) for each 0 ≤ t ≤ 1, W (t) ∼ N(0, t); (ii) W
has independent increments, i.e., W (tk)−W (tk−1), . . . ,W (t2)−W (t1) are
independent for all 0 ≤ t1 ≤ · · · ≤ tk ≤ 1.
35 / 91
Theorem 3.7 (Donsker). The partial sum process defined at (3.6)
converges weakly to the standard Wiener process W .
Proof. Convergence of the finite-dimensional distributions of Xn to those of
W follows immediately from the classical CLT. We prove that for any ε > 0
there exists n0 and δ > 0 such that (3.5) holds. Our approach needs finite 4th
moments of ξj . For a proof with only second moments, see Theorem 8.4 and
the arguments running through pp 89-91 of Billingsley (1968). We introduce a
technique based on the assumption that ξj have finite 4th moment since it is
more instructive and can be generalized to different settings. The basic idea is
to use the following maximal inequality, which is fairly general so that it can be
applied to partial sums of arbitrary random variables that may not be
independent or identically distributed.
36 / 91
Lemma 3.8 (Theorem 12.2, Billingsley, 1968). Let ξ1, . . . , ξm be random
variables. Let Sk = ξ1 + · · ·+ ξk for k ≥ 1 and put S0 = 0. If
E|Sj − Si|γ ≤ (ui+1 + · · ·+ uj)α for some γ ≥ 0, α > 1 and u1, . . . , um ≥ 0,
then
P
(max
0≤k≤m|Sk| ≥ λ
)≤ Cγ,α
λγ(u1 + · · ·+ um)α,
where Cγ,α is a constant that depends only on γ and α.
37 / 91
To prove (3.5), consider a fixed point t ∈ [0, 1]. Let j be an integer such that
j/n ≤ t < (j + 1)/n. Suppose that k/n < t+ δ ≤ (k + 1)/n for some
j ≤ k ≤ n− 1. In this case, (k − j − 1)/n ≤ δ and s ∈ (t, t+ δ] may lie in an
interval (i/n, (i+ 1)/n] for some i : j ≤ i ≤ k. For such an i,
|Xn(s)−Xn(t)| ≤ |Xn(s)−Xn(i/n)|+ |Xn(i/n)−Xn(j/n)|
+ |Xn(t)−Xn(j/n)|
≤ |Xn((i+ 1)/n)−Xn(i/n)|+ |Xn(i/n)−Xn(j/n)|
+ |Xn((j + 1)/n)−Xn(j/n)|,
where the second inequality follows from the polygonal character of Xn.
38 / 91
Writing I(δ, j) = (j + 1 + nδ) ∧ (n− 1), we establish
sups∈[t,t+δ]
|Xn(s)−Xn(t)| ≤ maxj<i≤I(δ,j)
|Xn(i/n)−Xn(j/n)| (3.7)
+ 2 maxj≤i≤I(δ,j)
|Xn((i+ 1)/n)−Xn(i/n)|
= maxj<i≤I(δ,j)
∣∣(ξj+1 + · · ·+ ξi)/√n∣∣
+ 2 maxj+1≤i≤I(δ,j)+1
|ξi/√n|.
We apply Lemma 3.8 to get a probability bound for the large deviation of the
first term on the RHS of (3.7). Since ξk are independent with Eξk = 0, we
obtain for any i, i′ : i′ > i > j
E(Si′ − Si)4 = E(ξi+1 + · · ·+ ξi′
)4≤ C1
( i′∑k=i+1
Eξ2k
)2
+i′∑
k=i+1
Eξ4k
≤ (C1 + 1)(i′ − i)2
for some absolute constant C1 > 0. This means we can apply Lemma 3.8 with
γ = 4, α = 2 and uk ≡ (C1 + 1)1/2.39 / 91
Thus, there exists an absolute constant C2 > 0 such that
P
[max
j<i≤I(δ,j)
∣∣(ξj+1 + · · ·+ ξi)/√n∣∣ ≥ ε/6] ≤ C2(nδ + 1)2
n2ε4
≤ 4C2δ2
ε4
for sufficiently large n such that n ≥ 1/δ. Taking δ ≤ ε5/(8C2) gives
P
[max
j<i≤I(δ,j)
∣∣(ξj+1 + · · ·+ ξi)/√n∣∣ ≥ ε/6] ≤ δε/2. (3.8)
For the second term on the RHS of (3.7), there exist an absolute constant
C3 > 0 and an integer n0(δ, ε) such that for all n ≥ n0(δ, ε)
P
[max
j≤i≤I(δ,j)|ξi/√n| ≥ ε/12
]≤
I(δ,j)∑i=j
P[|ξi/√n| ≥ ε/12
](3.9)
≤ C3δ
nε4
≤ δε/2.
The inequalities (3.7), (3.8) and (3.9) give (3.5).
40 / 91
4 Weak Convergence in D[0, 1]
We consider weak convergence in D ≡ D[0, 1], the space of cadlag functions
defined on the interval [0, 1].
Definition. A function defined on A ⊂ R is called cadlag function if it is
right-continuous and has left limit everywhere in A.
Note that all continuous functions are cadlag functions. All distribution
functions are also cadlag functions. The main difficulty with this space is that
it is not separable with the uniform metric dU .
41 / 91
4.1 Non-separability of (D, dU)
Define xα ∈ D, for 0 ≤ α ≤ 1, by xα(t) = I(t ≥ α). If α 6= α′, then
dU (xα, xα′) = 1. Let ε ≤ 1/2. Then, for any α 6= α′
x ∈ D : dU (xα, x) < ε ∩ x ∈ D : dU (xα′ , x) < ε = ∅.
If (D, dU ) is separable, there exists a countable subset, say D0, of D such that
every open ball x ∈ D : dU (xα, x) < ε for α ∈ [0, 1] contains a member of
D0. But, this is impossible since the number of these open balls is uncountable
and all the balls are disjoint.
Non-separability of (D, dU ) causes a fundamental difficulty. If a metric space
(S,S, d) is not separable, then functions that map a probability space (Ω,F , P )
to (S,S, d) often fail to be measurable. Important examples are empirical
processes.
42 / 91
Define X : ([0, 1],B, µ) 7→ (D,D, dU ) by X(t, w) = I(w ≤ t), where B is the
Borel σ-field of [0, 1], µ is the Lebesgue measure and D is the Borel σ-field of
D. For any subset H of [0, 1],⋃α∈H
B(xα, 1/2) =⋃α∈H
y ∈ D : dU (y, xα) < 1/2 ∈ D.
However, since X(·, w) ∈ B(xα, 1/2) if and only if X(·, w) = xα, which is also
equivalent to w = α, we obtain
X−1( ⋃α∈H
B(xα, 1/2))
=w : X(·, w) ∈
⋃α∈H
B(xα, 1/2)
=w : X(·, w) = xα for some α ∈ H
= H.
By taking H /∈ B, we see that X−1(D) ⊂ B fails.
43 / 91
4.2 Skorohod metric
The Skorohod metric dS defined below makes D separable. The basic idea is to
allow a deformation on the time scale to define a distance between two
elements in D.
Definition. Let Λ be the class of all strictly increasing and continuous
mappings λ of [0, 1] onto itself such that λ(0) = 0 and λ(1) = 1. The
Skorohod metric, denoted by dS , is defined by
dS(x, y) = infλ∈Λ
max supt∈[0,1]
|λ(t)− t|, supt∈[0,1]
|x(t)− y(λ(t))|.
A proof for the fact that dS is indeed a metric can be found in page 111 of
Billingsley (1968).
44 / 91
Example. We compute dS(xα, xα′) for α 6= α′. If we take λ ∈ Λ such that
λ(α) 6= α′, then supt∈[0,1] |xα(t)− xα′(λ(t))| = 1. For λ ∈ Λ such that
λ(α) = α′, it hold that supt∈[0,1] |xα(t)− xα′(λ(t))| = 0. Also,
infλ∈Λ:λ(α)=α′
supt∈[0,1]
|λ(t)− t| = |α′ − α| ≤ 1.
Thus, dS(xα, xα′) = |α′ − α|.
Theorem 4.1. The metric space (D, dS) is separable.
A proof of the above theorem can be found in page 112 of Billingsley (1968).
There is another difficulty. The space (D, dS) is not complete as the following
example illustrates. Completeness facilitates characterization of compact sets,
and thus that of tight sequences of random elements.
45 / 91
Example. Define xn ∈ D by xn(t) = I((1/2) ≤ t < (1/2) + (1/n)
). Then,
dS(xm, xn) =∣∣∣ 1
m− 1
n
∣∣∣ → 0
as m,n→∞. Thus, xn is a Cauchy sequence. However, there exists no
x ∈ D such that xn → x in dS . To see this, suppose that there exists such a
function x. Then, there exists a strictly increasing and continuous function λn
with λn(0) = 0 and λn(1) = 1 such that
supt∈[0,1]
|λn(t)− t| → 0,
supt∈[0,1]
|xn(λn(t))− x(t)| → 0.(4.10)
Note that
xn(λn(t)) = I(λ−1n (1/2) ≤ t < λ−1
n ((1/2) + (1/n))).
46 / 91
Due to the second convergence in (4.10) and the fact that xn(λn(·)) is an
indicator, the limit x ∈ D must take the form x(t) = I(α ≤ t < β) for some
0 ≤ α < β ≤ 1. The case α = β is excluded here since then x ≡ 0 and thus
the second convergence in (4.10) does not hold. Now, due to the first
convergence in (4.10), we have
|λ−1n (1/2)− 1/2| = |λ−1
n (1/2)− λn(λ−1n (1/2))| → 0.
Similarly λ−1n ((1/2) + (1/n))→ 1/2. This means that α = β, which is a
contradiction.
There is a metric dS′ which is equivalent to dS such that the metric space
(D, dS′) is complete. See Theorems 14.1 and 14.2 of Billingsley (1968). Thus,
we can proceed as if the Skorohod space (D, dS) is separable and complete.
47 / 91
4.3 Finite-dimensional distributions
We try to use Theorem 2.3 to get a set of sufficient conditions for weak
convergence in (D,D, dS). As in the case of C, we consider the class of all
projections πt1,...,tk for H.
Theorem 4.2. The projection πt1,...,tk as a map from (D,D) to (Rk,Rk) is
measurable for all t1, . . . tk ∈ [0, 1] and k ≥ 1.
For a proof of this theorem, see page 121 of Billingsley (1968). However, the
projections πt1,...,tk are not continuous everywhere in D for each (t1, . . . , tk).
This complicates matters somewhat.
48 / 91
Recall that, when we derive Theorem 3.2 for a sequence of random elements
Xn in C from Theorem 2.3, continuity of the projections πt1,...,tk was used
only to establish the weak convergence of πt1,...,tkXn to πt1,...,tkX for Xn
converging weakly to X. In general, a measurable function h : (S,S) 7→ (S′,S ′)
does not need to be continuous everywhere in S for the sequence h(Xn) to
converge weakly to h(X). According to the continuous mapping theorem
(Theorem 2.2), if P (X ∈ Dh) = 0 for the set Dh where h is discontinuous,
then h(Xn) converges weakly to h(X). The following theorem is a slight
generalization of Theorem 2.3 that embodies this idea.
49 / 91
Theorem 4.3. For a random element Y taking values in a metric space
(S,S), let HY be a collection of measurable functions h that map (S,S) to
another metric space (S′,S ′) such that P (Y ∈ Dh) = 0 for the set Dh where h
is discontinuous. Suppose that Xn is relatively compact and h(Xn)
converges weakly to h(X) for all h ∈ HX for some random element X. If
H−1X,Y (S ′) ≡
⋃h∈HX∩HY
h−1(S ′) is a field generating S for all random
elements Y , then Xn converges weakly to X.
In fact, the requirement in Theorem 3.2 and Corollary 3.5 that all
finite-dimensional distributions of Xn converge weakly to those of X is too
much in the space (D,D, dS) as the following example illustrates.
50 / 91
Example. Let Xn ≡ I[0,(1/2)+(1/n)) and X ≡ I[0,1/2). Then, Xnd→ X since
dS(Xn, X) = 1/n→ 0. However, Xn(1/2) ≡ 1 9 0 ≡ X(1/2).
Theorem 4.3 enables us to relax the condition that all finite-dimensional
distributions of Xn converge weakly to those of X. Our relaxation is founded
on the following three theorems. The first one characterizes the discontinuity
sets of πt1,...,tk in (D, dS), which tells that πt1,...,tk for 0 < t1 < · · · < tk < 1
is discontinuous at x if and only if x is discontinuous at some tj , 1 ≤ j ≤ k.
51 / 91
Theorem 4.4. The projections π0 and π1 are everywhere continuous, but πt
for 0 < t < 1 is continuous at x if and only if x is continuous at t.
Proof. The first result is immediate. For the second, suppose that x is
continuous at t. Let xn be a sequence in D such that dS(xn, x)→ 0. Then,
there exists a sequence λn in Λ such that supt∈[0,1] |xn(λn(t))− x(t)| → 0 and
supt∈[0,1] |λn(t)− t| → 0. Since x is continuous at t,
|xn(t)− x(t)| ≤ |xn(t)− x(λ−1n (t))|+ |x(λ−1
n (t))− x(t)|
≤ sups∈[0,1]
|xn(λn(s))− x(s)|+ |x(λ−1n (t))− x(t)|
→ 0.
52 / 91
Suppose, on the other hand, that x is discontinuous at t. We need to show
that there exists an ε > 0 and a sequence xn such that dS(xn, x)→ 0 but
|xn(t)− x(t)| ≥ ε for infinitely many n. Take xn such that xn(s) = x(λn(s))
for a sequence λn which is linear on [0, t] and on [t, 1] and satisfies
λn(t) = t− 1/n. Then,
dS(xn, x) = infλ∈Λ
max sups∈[0,1]
|λ(s)− s|, sups∈[0,1]
|xn(λ(s))− x(s)|
≤ sups∈[0,1]
|λ−1n (s)− s|
≤ 1/n → 0,
but |xn(t)− x(t)| → |x(t−)− x(t)| > 0.
53 / 91
For a point t ∈ [0, 1], define
D(πt) = x ∈ D : x(t) 6= x(t−),
which is the set of all elements of D which are discontinuous at t. The main
lesson of Theorem 4.4 is that πt for 0 < t < 1 is discontinuous on D(πt), and
is continuous on D(πt)c ⊂ D.
Theorem 4.5. The complement of the set t ∈ (0, 1) : P (X ∈ D(πt)) = 0
for a random element X taking values in (D,D) is at most countable.
For a proof of this theorem, see page 124 of Billingsley (1968). The theorem
tells that the set
TX ≡ 0, 1 ∪ t ∈ (0, 1) : P (X ∈ D(πt)) = 0
is dense in [0, 1]. Furthermore, it implies that ∩mi=1TXi for finitely many Xi is
also dense in [0, 1].
54 / 91
Theorem 4.6 (Theorem 14.5, Billingsley, 1968). For a subset T of [0, 1],
let FT be the class of all sets of the form π−1t1,...,tk
(A′) for some A′ ∈ Rk,
t1, . . . tk ∈ T and k ≥ 1. If T contains 1 and is dense in [0, 1], then FT is a
field that generates D.
The following theorem demonstrates that the class of the functions
ΠX = πt1,...,tk : tj ∈ TX for all 1 ≤ j ≤ k, k ≥ 1
plays the role of HX in Theorem 4.3. Recall that TX ⊂ [0, 1] is the set of
t = 0, 1 and t ∈ (0, 1), where the probability that X is discontinuous at t
equals 0, and that P (X ∈ Dπ) = 0 for any π ∈ ΠX since
Dπt1,...,tk =x ∈ (D, dS) : πt1,...,tk is discontinuous at x
=
⋃1≤j≤k:0<tj<1
D(πtj ).
due to Theorem 4.4.
55 / 91
Theorem 4.7. Let Xn and X be random elements in D. If the distribution
(Xn(t1), . . . , Xn(tk)) converges weakly to the distribution (X(t1), . . . , X(tk))
for all t1, . . . , tk ∈ TX and for all k ≥ 1, and if Xn is tight, then Xnd→ X.
Proof of Theorem 4.7. We follow the lines of the proof of Theorem 2.3.
We prove that the weak limit of the convergent subsequence Xn′′ of Xn′
does not depend on Xn′. Let Y be the weak limit of Xn′′. Then, by the
continuous mapping theorem, πXn′′d→ πY for all π ∈ ΠY . On the other hand,
we also have πXn′′d→ πX for all π ∈ ΠX by the condition of the theorem.
This implies πYd= πX for all π ∈ ΠX ∩ΠY , so that
P (πX ∈ A)A≡ P (πY ∈ A) ∀π ∈ ΠX ∩ΠY
⇔ PX ≡ PY on π−1(A) : A ∈ Rk, π ∈ ΠX ∩ΠY
⇔ PX ≡ PY on FTX∩TY (notation in Theorem 4.6).
By Theorem 4.5, TX ∩ TY contains 0 and 1, and is dense in [0, 1]. Application
of Theorem 4.6 concludes Yd= X.
56 / 91
4.4 Tightness in (D, dS)
The following theorem is an analogue of Arzela-Ascoli theorem which
characterizes compact sets in (D,D). Let w′x(δ) be defined by
w′x(δ) = infti
max1≤i≤r
sup|x(s)− x(t)| : s, t ∈ [ti−1, ti),
where the infimum extends over all finite sets ti of points such that
0 = t0 < t1 < · · · < tr = 1 and ti − ti−1 > δ for all 1 ≤ i ≤ r and r ≥ 1. This
is a modulus that plays in D the role of wx(δ) in C. In fact, w′x(δ)→ 0 as
δ → 0 for all x ∈ D (Lemma 1, page 110, Billingsley, 1968).
Note. w′x(δ)→ 0 as δ → 0 if and only if for any ε > 0 there exists a
partition of [0, 1] into finitely many Ti such that
maxi
sups,t∈Ti
|x(s)− x(t)| ≤ ε.
57 / 91
Theorem 4.8 (Theorem 14.3, Billingsley, 1968). A set A ⊂ D has
compact closure if and only if
(i) supx∈A
supt∈[0,1]
|x(t)| <∞;
(ii) limδ→0
supx∈A
w′x(δ) = 0.
It is sometimes difficult to work with w′x(δ). The following modulus is often
more convenient. Define
w′′x(δ) = sup|x(t)−x(t1)|∧|x(t2)−x(t)| : 0 ≤ t1 ≤ t ≤ t2 ≤ 1, t2−t1 ≤ δ.
58 / 91
Example. Recall the definition of xα : xα(t) = I(t ≥ α). For this function,
wxα(δ) = 1 for all δ > 0. For w′xα(δ), note that
w′xα(δ) ≤ sups,t∈[α,α+ε)
|xα(s)− xα(t)| = 0
for any δ and ε such that 0 < δ < ε < 1− α. Thus, w′xα(δ) = 0 for all
sufficiently small δ > 0 if α < 1. Also, we have w′′xα(δ) = 0 for all sufficiently
small δ > 0 if 0 < α < 1, since
w′′xα(δ) = supt∈[α−δ/2,α+δ/2]
∣∣∣xα(t)−xα(α− δ
2
) ∣∣∣∧∣∣∣xα(α+δ
2
)−xα(t)
∣∣∣ = 0.
Fact. For all x ∈ D, it follows that
w′′x(δ) ≤ w′x(δ) ≤ wx(2δ). (4.11)
For a proof of the first inequality, see pages 118–119 of Billingsley (1968), and
for a proof of the second, see page 110 of Billingsley (1968).
59 / 91
The following theorem gives another characterization of compact sets in (D,D)
based on w′′x(δ). It is sometimes more convenient to work with than the
characterization in Theorem 4.8. We write wx(T ) = sups,t∈T |x(s)− x(t)|.
Theorem 4.9 (Theorem 14.4, Billingsley, 1968). A set A ⊂ D has
compact closure if and only if
(i) supx∈A supt∈[0,1] |x(t)| <∞;
(ii) limδ→0 supx∈A w′′x(δ) = 0;
(iii) limδ→0 supx∈A wx[0, δ) = 0;
(iv) limδ→0 supx∈A wx[1− δ, 1) = 0.
60 / 91
From Theorems 4.8 and 4.9, we get the following characterizations of a tight
sequence in D
Theorem 4.10 (Theorem 15.2, Billingsley, 1968). A sequence Xn in D
is tight if and only if
(i) the sequence of random variables, sup|Xn(t)| : t ∈ [0, 1], is tight in R;
(ii) for any ε > 0, limδ→0
lim supn→∞
P[w′Xn(δ) ≥ ε
]= 0.
Note. The condition (ii) of Theorem 4.10 is equivalent to asking that for
any ε, η > 0 there exists a partition of [0, 1] into finitely many Ti such that
lim supn→∞
P[
maxi
sups,t∈Ti
|Xn(s)−Xn(t)| ≥ ε]≤ η.
61 / 91
Theorem 4.11 (Theorem 15.3, Billingsley, 1968). A sequence Xn in D
is tight if and only if the following property (i) and those (ii), (iii) and (iv) hold
for any ε > 0:
(i) the sequence of random variables, sup|Xn(t)| : t ∈ [0, 1], is tight in R;
(ii) limδ→0
lim supn→∞
P[w′′Xn(δ) ≥ ε
]= 0;
(iii) limδ→0
lim supn→∞
P[
sups,t∈[0,δ)
|Xn(s)−Xn(t)| ≥ ε]
= 0;
(iv) limδ→0
lim supn→∞
P[
sups,t∈[1−δ,1)
|Xn(s)−Xn(t)| ≥ ε]
= 0.
Homework: Prove Theorems 4.10 and 4.11, and also prove the result in “Note”
between them.
62 / 91
4.5 Weak convergence in (D,D, dS)
The following theorem gives a set of sufficient conditions for weak convergence
in D. The theorem follows from Theorems 4.7 and 4.11.
Theorem 4.12 (Theorem 15.4, Billingsley, 1968). Let Xn and X be
random elements in D. Suppose that P (X(−1) 6= X(1)) = 0. If the
distribution (Xn(t1), . . . , Xn(tk)) converges weakly to the distribution
(X(t1), . . . , X(tk)) for all t1, . . . , tk ∈ TX and for all k ≥ 1, and if
limδ→0
lim supn→∞
P[w′′Xn(δ) ≥ ε
]= 0 (4.12)
for any ε > 0, then Xnd→ X.
63 / 91
Proof. We prove (i), (iii) and (iv) in Theorem 4.11. To prove (i), let ε > 0
be given. Then by the condition (4.12), we can take δ0 such that
lim supn→∞
P(w′′Xn(δ0) ≥ ε
)≤ ε/2.
Choose 0 = t1 < · · · < tk = 1 such that ti − ti−1 ≤ δ0 from TX . This is
possible since TX is dense in [0, 1] by Theorem 4.5. Then,
supt∈[0,1]
|Xn(t)| ≤ w′′Xn(δ0) + max1≤i≤k
|Xn(ti)|.
Since each sequence Xn(ti) is tight, max1≤i≤k |Xn(ti)| is also tight. Thus,
there exists C > 0 such that
lim supn→∞
P
(max
1≤i≤k|Xn(ti)| > C
)≤ ε/2.
64 / 91
This implies
lim supn→∞
P
(supt∈[0,1]
|Xn(t)| < C + ε
)
≤ lim supn→∞
P(w′′Xn(δ0) > ε
)+ lim sup
n→∞P
(max
1≤i≤k|Xn(ti)| > C
)≤ ε,
which concludes the proof of (i).
For (iii) it suffices to prove that for any ε > 0 there exist δ0 and n0 such that
for all n ≥ n0
P[
sups,t∈[0,δ0)
|Xn(s)−Xn(t)| ≥ ε]≤ ε. (4.13)
We note that
sups,t∈[0,δ0)
|Xn(s)−Xn(t)| ≤ 2 sups∈[0,δ0)
|Xn(s)−Xn(0)|
≤ 2[w′′Xn(δ0) + |Xn(δ0)−Xn(0)|
].
65 / 91
The second inequality above holds since, for each s ∈ [0, δ0),
|Xn(s)−Xn(0)| ≤ w′′Xn(δ0)
in case |Xn(s)−Xn(0)| ≤ |Xn(s)−Xn(δ0)|, and
|Xn(s)−Xn(0)| ≤ |Xn(s)−Xn(δ0)|+ |Xn(δ0)−Xn(0)|
≤ w′′Xn(δ0) + |Xn(δ0)−Xn(0)|
in case |Xn(s)−Xn(0)| ≥ |Xn(s)−Xn(δ0)|.
Let ε > 0 be fixed. By the second condition of the theorem, one can take a
δ1 ∈ TX and n1 ≡ n1(δ1) such that for all n ≥ n1
P[w′′Xn(δ1) ≥ ε/4
]≤ ε/4. (4.14)
We claim that there exists also a δ0 ≤ δ1 such that δ0 ∈ TX and
P[|X(δ0)−X(0)| ≥ ε/12
]≤ ε/4. (4.15)
66 / 91
Since Xn(δ0) and Xn(0) converge weakly to X(δ0) and X(0), respectively,
there exists n′1 ≡ n′1(δ0, ε) such that for all n ≥ n′1
P[|Xn(δ0)−X(δ0)| ≥ ε/12
]≤ ε/4, (4.16)
P[|Xn(0)−X(0)| ≥ ε/12
]≤ ε/4. (4.17)
In (4.15)–(4.17) and below wherever relevant, Xn(0), X(0) and Xn(δ0), X(δ0)
are those versions in Skorohod Representation Theorem (Theorem 1.1) that are
defined on the same probability space having almost sure convergence.
The inequalities (4.14)–(4.17) give
P[
sups,t∈[0,δ0)
|Xn(s)−Xn(t)| ≥ ε]
≤ P[w′′Xn(δ0) ≥ ε/4
]+ P
[|Xn(δ0)−Xn(0)| ≥ ε/4
]≤ ε
for all n ≥ n0 ≡ n1 ∨ n′1.
67 / 91
It remains to prove (4.15) to complete the proof of (iii). Since all x in D are
right continuous, it follows that
1 = P
(limδ→0
sups∈[0,δ]
|X(s)−X(0)| = 0
)= P
[⋂k
⋃l
(sup
s∈[0,1/l]
|X(s)−X(0)| ≤ 1/k)].
This means that for any η1 > 0
0 = P
[⋂l
(sup
s∈[0,1/l]
|X(s)−X(0)| ≥ η1
)]= lim
l→∞P
(sup
s∈[0,1/l]
|X(s)−X(0)| ≥ η1
).
Thus, for any η1, η2 > 0, there exists an L > 0 such that
P
(sup
s∈[0,1/L]
|X(s)−X(0)| ≥ η1
)≤ η2.
68 / 91
Since TX is dense in [0, 1], we can always take a δ0 ∈ TX such that δ0 ≤ 1/L
and
P (|X(δ0)−X(0)| ≥ η1) ≤ P(
sups∈[0,1/L]
|X(s)−X(0)| ≥ η1
)≤ η2.
This completes the proof of (iii).
The condition (iv) also holds by symmetry. One thing one should take care is
existence of a δ0 ∈ TX that ensures (4.15) which is now replaced by
P[|X(1)−X(1− δ0)| ≥ ε/12
]≤ ε/4.
This can be proved as (4.15) by using the condition P (X ∈ D(π1)) = 0 and by
working on the set D(π1)c whose elements are all left continuous at 1.
69 / 91
4.6 Limit process with continuous sample paths
When the limit process X in Theorem 4.12 has continuous sample paths a.e.,
then P (X(−1) 6= X(1)) = 0 is automatically satisfied and also TX = [0, 1].
Thus, in this case Xn converges weakly to X if all finite-dimensional
distributions of Xn converge weakly to those of X and if (4.12) holds for any
ε > 0. What if all finite-dimensional distributions of Xn converges weakly and
if, instead of (4.12),
limδ→0
lim supn→∞
P[
sup|s−t|<δ
|Xn(s)−Xn(t)| ≥ ε]
= 0 (4.18)
for any ε > 0? Recall that this was a criterion for weak convergence in C. The
following theorem tells that the limit process in this case has continuous sample
paths a.e.
70 / 91
Theorem 4.13 (Theorem 15.5, Billingsley, 1968). Let Xn be a
sequence of random elements in (D,D, dS). Suppose that Xn(0) is tight in
R and (4.18) holds for any ε > 0. Then, Xn is tight, and, if X is the weak
limit of a subsequence Xn′, then X has continuous sample paths a.e., i.e.,
P (X ∈ D(πt)c for all t ∈ [0, 1]) = 1.
Donsker’s Theorem.
Let the random variables ξj be iid with mean 0 and variance 1. Define a
sequence of random elements Xn in D by
Xn(t) =1√n
bntc∑i=1
ξi, (4.19)
where bntc denotes the largest integer which is less than or equal to nt.
71 / 91
Theorem 4.14 (Donsker: Theorem 16.1, Billingsley, 1968). The partial
sum process defined at (4.19) converges weakly to the standard Wiener process
W .
Proof. Prove (4.18) along the lines in the proof of Theorem 3.7. Then, use
the inequality at (4.11) to prove (4.12) and apply Theorem 4.12. Since the
limit X is the standard Wiener process, X has continuous sample paths a.e. so
that TX = [0, 1]. Thus, we need to show that all finite-dimensional
distributions of Xn converge to those of X, which is clear.
72 / 91
5 Weak Convergence of Empirical Processes
We discuss weak convergence of uniform empirical processes first, and then
extend the discussion to more general cases.
5.1 Uniform empirical processes
Let ξj be iid uniform[0, 1], and Fn be their empirical distribution function
defined by
Fn(t) = n−1n∑i=1
I(ξi ≤ t).
A centered and scaled version of Fn defines a uniform empirical process Xn
indexed by t ∈ [0, 1]:
Xn(t) =√n(Fn(t)− t
). (5.20)
It will be shown that this process converges weakly to the Brownian bridge
defined below.
73 / 91
Definition. A Gaussian process B taking values in C such that EB(t) = 0
and cov(B(s), B(t)) = s ∧ t− st, is called the standard Brownian bridge.
Alternatively, it is defined by
B(t) = W (t)− tW (1),
where W is the standard Brownian motion.
Note. The standard Brownian bridge B are tied down at 0 and 1 with
probability 1, i.e., P [B(0) = B(1) = 0] = 1.
Theorem 5.1. The empirical process defined at (5.20) for iid uniformly
distributed random variables ξj on [0, 1] converges weakly to the standard
Brownian bridge B.
74 / 91
Proof. Note that B takes values in C a.e., thus it has continuous sample
paths a.e. Convergence of all finite-dimensional distributions of Xn to those of
B follows from the classical CLT. We use Theorem 3.6 and prove that for any
ε, η > 0 there exists n0 and δ > 0 such that
supn≥n0
sup0≤t≤1
P[
sups∈[t,t+δ]
|Xn(s)−Xn(t)| ≥ η]≤ δε. (5.21)
For each fixed t ∈ [0, 1], we divide the interval [t, t+ δ] into m subintervals of
length p = δ/m, i.e., [t, t+ ip] for 1 ≤ i ≤ m. Note that
sups∈[t,t+δ]
|Xn(s)−Xn(t)|
≤ max0≤i≤m
|Xn(t+ ip)−Xn(t)| (5.22)
+ max0≤i≤m−1
sups∈[t+ip,t+(i+1)p]
|Xn(s)−Xn(t+ ip)|.
75 / 91
The second term on the RHS of (5.22) can be made small enough by choosing
m large enough and the first term which is
max0≤i≤m
∣∣ i∑l=1
(Xn(t+ lp)−Xn(t+ (l − 1)p))∣∣ let
= max0≤i≤m
|Si|
involves only finitely many (with p being determined) partial sums so that it can
be handled by a maximal inequality for the partial sums such as Lemma 3.8.
Treatments of the two terms on the RHS of (5.22) need the following identities:
E[I(ξi ≤ s)− I(ξi ≤ t)]2 = E[I(ξi ≤ s)− I(ξi ≤ t)]4
= s+ t− 2s ∧ t = |s− t|. (5.23)
76 / 91
This gives for the first term
E(Sj − Si)4 = E[Xn(t+ jp)−Xn(t+ ip)]4
= E
(1√n
n∑k=1
[I(ξk ≤ t+ jp)− I(ξk ≤ t+ ip)− (j − i)p]
)4
≤ C(j − i)2p2
for some constant C > 0 as long as n > p−1 = m/δ and δ < 1. By applying
Lemma 3.8 with γ = 4, α = 2 and uj ≡√Cp, we get
P
(max
0≤i≤m|Xn(t+ ip)−Xn(t)| ≥ λ
)≤ C1λ
−4m2p2 (5.24)
for an absolute constant C1 > 0. Plugging λ = η/2 gives
P
(max
0≤i≤m|Xn(t+ ip)−Xn(t)| ≥ η/2
)≤ 16C1η
−4m2p2 (5.25)
77 / 91
To treat the second term on the RHS of (5.22), we observe that for all
s ∈ [t+ ip, t+ (i+ 1)p]
Xn(s)−Xn(t+ ip) ≤√n[Fn(t+ (i+ 1)p)− Fn(t+ ip)]
≤ |Xn(t+ (i+ 1)p)−Xn(t+ ip)|+√np
since Fn is a non-decreasing function. Also, we have
Xn(s)−Xn(t+ ip) ≥ −√np
≥ −|Xn(t+ (i+ 1)p)−Xn(t+ ip)| −√np.
Thus
sups∈[t+ip,t+(i+1)p]
|Xn(s)−Xn(t+ ip)| (5.26)
≤ |Xn(t+ (i+ 1)p)−Xn(t+ ip)|+√np.
78 / 91
Take m large so that√nδ/m =
√np < η/4 and δ/m > n−1. Then, by (5.24)
and (5.26)
P
(max
0≤i≤m−1sup
s∈[t+ip,t+(i+1)p]
|Xn(s)−Xn(t+ ip)| ≥ η/2
)(5.27)
≤ P
(max
0≤i≤m−1|Xn(t+ (i+ 1)p)−Xn(t+ ip)| ≥ η/4
)≤ 2P
(max
0≤i≤m|Xn(t+ ip)−Xn(t)| ≥ η/8
)≤ C2η
−4m2p2
for some 0 < C2 <∞. Note that there always exists m such that√nδ/m ≤ η/4 and δ/m ≥ n−1 if n > (4/η)2. From (5.22), (5.25) and (5.27)
we get
sup0≤t≤1
P[
sups∈[t,t+δ]
|Xn(s)−Xn(t)| ≥ η]≤ C3η
−4δ2
for some constant C3 > 0. Taking δ ≤ η4ε/C3 gives (5.21).
79 / 91
5.2 Empirical processes in D(−∞,∞)
Theorem 5.1 can be extended to a more general case where ξj are iid with
distribution function F . Define
Xn(t) =√n(Fn(t)− F (t)
). (5.28)
The function space involved in this case is D(−∞,∞), the space of all cadlag
functions defined on the whole real line (−∞,∞). Also, we need to extend the
definition of the Skorohod metric to D(−∞,∞). The weak limit of the
empirical process Xn in this case is the F -Brownian bridge B(F (·)).
80 / 91
Definition. Let Λ be the class of all strictly increasing and continuous
mappings λ of R onto itself. The Skorohod metric for D(−∞,∞) is defined by
dS(x, y) = infλ∈Λ
max sup−∞<t<∞
|λ(t)− t|, sup−∞<t<∞
|x(t)− y(λ(t))|.
With a slight abuse of notation, we continue to denote by dS the Skorohod
metric for D(−∞,∞).
Theorem 5.2. The empirical process defined at (5.28) for iid random
variables ξj with a common distribution function F converges weakly to
B(F (·)).
81 / 91
Proof of Theorem 5.2. Define the quantile function of F by
F−1(t) = infs : t ≤ F (s). Then, we know
F−1(t) ≤ s if and only if t ≤ F (s), and thus F−1(υ) ∼ F
for a uniformly distributed υ on [0, 1]. Thus, we may represent ξi = F−1(υi)
with iid uniformly distributed υi on [0, 1]. By Theorem 5.1, the empirical
process Yn defined by
Yn(t) =1√n
n∑i=1
[I(υi ≤ t)− t]
converges weakly to B. For the empirical process Xn defined at (5.28),
Xn(t) =1√n
n∑i=1
[I(ξi ≤ t)− F (t)]
d=
1√n
n∑i=1
[I(F−1(υi) ≤ t)− F (t)
]=
1√n
n∑i=1
[I(υi ≤ F (t))− F (t)] = Yn(F (t)).
82 / 91
Define a map ψ : (D[0, 1], dS) 7→ (D(−∞,∞), dS) by (ψx)(t) = x(F (t)). If we
prove ψ is continuous on C[0, 1], then by the continuous mapping theorem
(Theorem 2.2) and by the fact that P (B ∈ C[0, 1]c) = 0 we have
Xn = ψYnd→ ψB = B(F (·)),
concluding the proof of the theorem. To prove ψ is continuous at all
x ∈ C[0, 1], let xn be a sequence of elements in (D[0, 1], dS) such that
dS(xn, x)→ 0 as n→∞ for some x ∈ C[0, 1]. Then, dU (xn, x)→ 0. One can
prove this by using the inequalities in the first half of the proof of Theorem 4.4
and the fact that all x ∈ C[0, 1] are uniformly continuous on [0, 1]. This implies
dU (ψxn, ψx) = sup−∞<t<∞
|xn(F (t))−x(F (t))| ≤ supt∈[0,1]
|xn(t)−x(t)| → 0.
Thus, dS(ψxn, ψx)→ 0.
83 / 91
5.3 Empirical processes indexed by G
Let ξj be iid random variables taking values in R with distribution function F .
Let G be a collection of measurable functions g in L2(F ) = g :∫g2dF <∞.
Define
Xn(g) =√n
∫g d(Fn − F ) =
1√n
n∑i=1
[ g(ξi)− Eg(ξ)]. (5.29)
The process Xn = Xn(g) : g ∈ G is called the empirical process indexed by
G. Note that taking G = I(−∞,t] : t ∈ R gives the empirical process at (5.28).
84 / 91
We assume that Xn is a map from a probability space (Ω,F , P ) to the space
`∞(G) of all bounded real-valued functions on G, equipped with the uniform
metric
dU (x, y) = supg∈G|x(g)− y(g)|.
This is a restriction imposed on G. For example, if the class G is enveloped by
a square-integrable function G, i.e., if |g(x)| ≤ G(x) for all x and for all g ∈ G,
then Xn takes values in `∞(G).
85 / 91
As we have seen before, Xn is not Borel-measurable when
G = I(−∞,t] : t ∈ R. This can happen with other G. If Xn is not measurable,
then the statement “Ef(Xn)→ Ef(X) for any bounded and uniformly
continuous real-valued function f” does not make sense. To accommodate this
situation in general, we extend the definition of weak convergence to a
sequence of arbitrary maps Xn : Ω 7→ `∞(G).
Definition. For an arbitrary map Y : Ω 7→ `∞(G), define the outer
expectation E∗ by
E∗f(Y ) = infE(U) : U is measurable , U ≥ f(Y ) and E(U) exists.
86 / 91
Definition. For a sequence arbitrary maps Xn and for a Borel-measurable
X, we say Xn converges weakly to X and we write Xnd→ X if
E∗f(Xn)→ Ef(X) for any bounded and continuous real-valued function
f : `∞(G) 7→ R.
Definition. Let BF be a Gaussian process indexed by G such that
EBF (g) = 0 and
cov(BF (g1), BF (g2)) =
∫g1g2 dF −
∫g1 dF
∫g2 dF.
The process BF is called F -Brownian bridge.
Definition. The class G is called F -Donsker if the process Xn defined at
(5.29) converges weakly to BF .
87 / 91
The following theorem gives a set of sufficient conditions for weak convergence
of the empirical process Xn indexed by G. Define P ∗ by P ∗(A) = E∗IA.
Theorem 5.3. If G equipped with the L2(F )-metric ‖ · ‖ is totally bounded
and if for any ε > 0
limδ→0
lim supn→∞
P ∗(
sup‖g1−g2‖≤δ
∣∣Xn(g1)−Xn(g2)∣∣ ≥ ε) = 0,
then G is F -Donsker.
For a proof of this theorem, see
Dudley, R. M. (1984). A Course on Empirical Processes. Springer-Verlag, New
York.
88 / 91
Definition. Let G be a class of functions in L2(F ). The δ-entropy of G,
denoted by H(δ,G, F ), equals the logarithm of the smallest number of balls
with radius δ whose union covers G.
Definition. Let G be a class of functions in L2(F ). A bracket [gL, gU ] is the
set of all functions g in G such that gL ≤ g ≤ gU . A δ-bracket is a bracket
[gL, gU ] such that ‖gU − gL‖ ≤ δ. The δ-entropy with bracketing of G,
denoted by HB(δ,G, F ), equals the logarithm of the smallest number of
δ-brackets that cover G.
Note. If H(δ,G, F ) <∞ for any δ > 0, then G is totally bounded. Also, it
holds that H(δ,G, F ) ≤ HB(δ,G, F ).
89 / 91
Theorem 5.4 (Theorem 6.3, van de Geer, 2000).
If∫ 1
0H
1/2B (u,G, F ) du <∞, then G is F -Donsker.
Final Remark. So far we have considered the case where ξj are random
variables taking values in R. All discussions remain valid for iid measurable
maps ξj : Ω 7→ X , where (X ,A) is a measurable space.
90 / 91
Suggested References.
van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge
University Press.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University
Press.
van der Vaart, A. W. and Jon A. Wellner (1996). Weak Convergence and
Empirical Processes with Applications to Statistics. Springer, New York
91 / 91