1 Sample spaces and sigma-algebraspeople.math.gatech.edu/~mdamron6/teaching/Fall... · Proposition...

$: 1 Sample spaces and sigma-algebraspeople.math.gatech.edu/~mdamron6/teaching/Fall... · Proposition 1.6. If C is a collection of sigma-algebras of ⌦ then \ ⌃2C⌃ is a sigma-algebra$
Lecture 1

We should start with administrative stu↵ from the syllabus. Instructor, grader, solicita-tion for o�ce hour times, text, homework due on Wednesday in class. Grade breakdown.

1 Sample spaces and sigma-algebras

Throughout the course we want to keep the following simple example in mind: suppose weflip a coin three times. The possible outcomes are

HHH, HHT, HTH, HTT, THH, THT, TTH, TTT .

These form the points of our sample space ⌦.

Definition 1.1. A sample space ⌦ is any set. The elements are called outcomes.

We will be interested in analyzing probabilities of events. What is an event? Someexamples are

the event that at least two heads come up ,

the event that an odd number of heads come up .

We can write these events simply as subsets of ⌦. In our example, they are respectively{HHT, HTH, THH, HHH} and {HTT, THT, TTH, HHH}. So we could call an eventany subset of ⌦. But this is in fact not the most general (or useful) thing we can do. Theevents will be subsets whose probabilities are defined. We would like to be able to operatein situations in which we cannot find the probabilities of all events. For example, in ourexperiment, we may not be privy to all of the information. It may be that the second cointoss is not shown to us, and in this case we would not like events to involve the second toss.

Regardless we can agree on certain conditions that events should satisfy.

1. Events should be subsets of ⌦.

2. The set ⌦ should be an event. (This is the event that something happened.)

3. If A is an event, Ac is an event. In other words, if we define the probability that A

occurs, we should define the probability that A does not occur.

4. If A and B are events, then A [B is an event.

So given this motivation, we start with a definition.

Definition 1.2. A field (or algebra) of subsets of ⌦ is a collection ⌃ of subsets of ⌦ satisfyingthe following conditions.

1. ⌦ 2 ⌃.

2. If A 2 ⌃ then Ac 2 ⌃.

3. If A,B 2 ⌃ then A [ B 2 ⌃.

We will actually want a bit more, because we would like to have as many events aspossible. So we will allow countable unions.

Definition 1.3. A sigma-field (or sigma-algebra) ⌃ of subsets of ⌦ is a field (or algebra)that is closed under countable unions. That is, if A1, A2, . . . 2 ⌃ then [1

n=1An 2 ⌃.

Why don’t we just go further and allow uncountable unions? Well, our unions here willbe closely tied to sums of probabilities. And we know that uncountable sums are not verynice. They are not defined unless they only have countably many nonzero terms.

Note some consequences of the definition.

1. ; 2 ⌃. This follows from 1 and 2.

2. If A,B 2 ⌃ then A \ B 2 ⌃.

Proof. If A,B 2 ⌃ then Ac, B

c 2 ⌃. Thus

A \B = (Ac [Bc)c 2 ⌃

by 2 and 3.

3. If ⌃ is a sigma-field then A1, A2, . . . 2 ⌃ implies that \nAn 2 ⌃.

There are in fact algebras that are not sigma-algebras. Let ⌦ be an infinite set and define

A = {A ⇢ ⌦ : A is finite or Ac is finite} .

You can check this is an algebra. However it cannot be an sigma-algebra. To see this, let{!1,!2, . . .} be a countable subset of ⌦ and define Ai = {!i}. Each Ai is a finite set so it isin A. However [iAi is infinite, so it is not in A.

Examples.

1. If ⌦ is any set, then the power set P(⌦) is a sigma algebra.

2. Let ⌦ be the half-open interval (0, 1] and define ⌃0 as finite pairwise disjoint unions ofhalf-open intervals (a, b], a b.

⌃0 = {(a1, b1] [ (a2, b2] [ · · · [ (an, bn] : n 2 N and (ai, bi] \ (aj, bj] = ; for i 6= j} .

Note that any set in ⌃0 can be written as (a1, b1] [ · · · [ (an, bn] with the additionalconstraint that a1 b1 · · · an bn.

Proposition 1.4. ⌃0 is an algebra, but not a sigma-algebra.

2

Proof. First of all, ⌦ = (0, 1] is such a union – of just one half-open interval. Nextsuppose that A 2 ⌃0 and write it as above as

A = (a1, b1] [ · · · [ (an, bn], a1 b1 · · · an bn .

Then Ac is simply

Ac = (0, a1] [ (b1, a2] [ · · · [ (bn, 1] ,

another set of the form in ⌃0, so Ac 2 ⌃0. Last we must check that ⌃0 is closed

under finite unions. For this, first consider two sets A,A0 2 ⌃0 and write them as

A = (a1, b1] [ · · · [ (an, bn] and A0 = (a01, b

01] [ · · · (a0m, b0m]. Then by the distributive

laws for sets,

A \ A0 = ([n

i=1(ai, bi]) \�[m

j=1(a0j, b

0j]�

= [ni=1

⇥(ai, bi] \

⇥[m

j=1(a0j, b

0j]⇤⇤

= [ni=1 [m

j=1

�(ai, bi] \ (a0j, b

0j]�.

Each set in the double union is either empty or a half-open interval. Furthermore theseintervals are disjoint: if i, j and i

0, j

0 are two pairs that are not equal, assume withoutloss in generality that i 6= i

0. Then✓(ai, bi] \ (a0j, b

0j]

◆\✓(ai0 , bi0 ] \ (aj0 , bj0 ]

◆⇢ (ai, bi] \ (ai0 , bi0 ] = ; .

So this shows A \ A0 2 ⌃0. Using this, for A,A0 2 ⌃0,

A [ A0 = (Ac \ (A0)c)c 2 ⌃0 .

By induction, if A1, . . . , An 2 ⌃0, we also have A1[ · · ·[An 2 ⌃0, proving ⌃0 is a field.

To show it is not a sigma-field, we argue by contradiction. Each set (1/2� 1/n, 1/2] isin ⌃0 so if it were a sigma-field, we would have

{1/2} = \1n=1(1/2� 1/n, 1/2] 2 ⌃0 .

However this singleton is not a finite union of half-open intervals, a contradiction.

In practice we will not actually give a complete description of all elements of a sigma-algebra. We will simply specify some sets which we would like to be events and allow allof the other events come in due to the definitions. That is, we will define most of oursigma-algebras via some generating sets.

Definition 1.5. If A is a collection of subsets of ⌦ we define the sigma-algebra generatedby A, written �(A), to be the intersection of all sigma-algebras of ⌦ containing all elementsof A.

3

Why does this procedure define a sigma-algebra? Given A, there is at least one sigma-algebra containing its elements: the power set of ⌦ (this is the collection of all subsets of⌦), written P(⌦). That means that the collection

CA = {⌃ : ⌃ is a sigma-algebra containing A}

is non-empty. Furthermore:

Proposition 1.6. If C is a collection of sigma-algebras of ⌦ then \⌃2C⌃ is a sigma-algebraof ⌦.

Proof. Write I for the intersection. Then ⌦ is an element of each sigma-algebra in C, so itis in the intersection. If A 2 I then A 2 ⌃ for all ⌃ 2 C, so A

c 2 ⌃ for all such ⌃, implyingA

c 2 I. Last if A1, A2, . . . are in I they are is all ⌃’s in C, so since the ⌃’s are closed undercountable union, [nAn is in all such ⌃’s, meaning it is in I.

Examples.

1. The sigma-algebra generated by {;} is {;,⌦}.

2. The sigma-algebra generated by open sets of Rd is called the Borel sigma-algebra. Anelements of it is called a Borel set.

3. If A ⇢ B then �(A) ⇢ �(B).

4. If ⌃ is a sigma-algebra then �(⌃) = ⌃.

5. �(A) is the smallest sigma-algebra containing A; that is, if ⌃ is another sigma-algebracontaining A then �(A) ⇢ ⌃.

6. Let ⌦1,⌦2 be sample spaces with ⌃1,⌃2 sigma-algebras respectively. For any A 2 ⌃1,define the cylinder

A⇥ ⌦2 = {(!1,!2) : !1 2 A} .

Then the product sigma-algebra is defined as the one generated by cylinders of bothsets ⌦1 and ⌦2:

⌃1 ⇥ ⌃2 := �({A⇥ ⌦2 : A 2 ⌃1} [ {⌦1 ⇥ B : B 2 ⌃2}) .

This can be extended to any finite number of spaces. We can even take an infinitenumber of them and this will be important later.

2 Probability measures

Given a sample space and a sigma-algebra, we want to assign probabilities to the events inthe sigma-algebra. We would like the probabilities to satisfy some simple rules.

Definition 2.1. A function P : ⌃ ! R is called a probability measure if the following hold.

4

1. P(A) � 0 for all A 2 ⌃.

2. P(⌦) = 1.

3. (Countable additivity) If A1, A2, . . . 2 ⌃ is a sequence of pairwise disjoint events then

P ([1n=1An) =

1X

n=1

P(An) .

When thinking about the axioms, I usually visualize probability as some measure ofvolume or area of a region in the sample space. All areas should be non-negative and thearea of a disjoint union should just be the sum of the areas of the constituent pieces. Thiscovers 1 and 3. Of course we want number 2 since the probability of something occurring is1 (100 percent).

We can now deduce the following properties.

1. If A 2 ⌃, P(Ac) = 1� P(A).

Proof. A and Ac must be in ⌃ and they are disjoint. Therefore items 2 and 3 give

1 = P(⌦) = P(A [ Ac) = P(A) + P(Ac) .

2. (Inclusion-Exclusion) If A,B 2 ⌃,

P(A [ B) = P(A) + P(B)� P(A \ B) .

Proof. Since A,B 2 ⌃, the sets (a) A \B = A\Bc, (b) B \A = B \A

c and (c) A\B

are in ⌃. They are pairwise disjoint and their union is A [B, so

P(A [ B) = P(A \B) + P(B \ A) + P(A \ B)

=

✓P(A \B) + P(A \ B)

◆+

✓P(B \ A) + P(A \ B)

◆� P(A \B) .

On the other hand, a similar argument gives

P(A) = P(A \B) + P(A \ B)

andP(B) = P(B \ A) + P(A \B) .

Putting these together, we find the statement.

5

Lecture 3

We have now shown that the collection of finite disjoint unions of half-open intervals isan algebra. The next step is to show that � has the properties of a probability measure, buton ⌃0.

Proposition 0.1. � is a probability measure when restricted to ⌃0:

1. �((0, 1]) = 1.

2. �(A) � 0 for all A 2 ⌃0.

3. If A1, A2, . . . are disjoint elements of ⌃0 with A = [1n=1An also in ⌃0,

� ([1n=1An) =

1X

n=1

�(An) .

Proof. 1 and 2 are clear. For 3, suppose that

A = [ji=1Ii and An = [jn

k=1I(n)k

are representations in terms of half-open intervals. Then because

Ii = [1n=1

h[jn

k=1

hIi \ I(n)k

ii

is a disjoint union of half-open intervals, we can use part 3 of the theorem to get

�(A) =jX

i=1

|Ii| =jX

i=1

" 1X

n=1

jnX

k=1

|Ii \ I(n)k |#=

1X

n=1

jnX

k=1

"jX

i=1

|Ii \ I(n)k |#

=1X

n=1

jnX

k=1

|I(n)k | =1X

n=1

�(An) .

1 Caratheodory extension

We are now in the following situation. We have a space ⌦ with a field ⌃0 and a probabilitymeasure P on ⌃0. We would be very happy if ⌃0 were a sigma algebra, but it is not, so ourplan is to generate one and extend P to it. We will prove:

Theorem 1.1. If P is a probability measure on an algebra ⌃0 then there exists a probabilitymeasure on �(⌃0) which, when restricted to ⌃0, is P.

1.1 Outer measure

Because elements of �(⌃0) are in a sense limits of those in ⌃0, we will define P using ap-proximations. In other words, given A ⇢ ⌦, we will define the probability of A as limitsof probabilities of events in ⌃0. Due to monotonicity properties of P, it makes sense touse monotone limits, and the standard way to do this is to approximate A in a decreas-ing fashion from the outside. One possibility would be to define the probability of A asinf{P(B) : A ⇢ B and B 2 ⌃0}. However, if we do this, you can see that in the Lebesguemeasure case, the probability of the rationals will be 1. This is bad, so we allow countableunions of elements in ⌃0.

Definition 1.2. If P is a probability measure on an algebra ⌃0 then for each A ⇢ ⌦, theouter measure of A is

P⇤(A) = inf

( 1X

k=1

P(Bk) : A ⇢ [1k=1Bk and Bk 2 ⌃0 for all k

).

Note that outer measure is defined for all sets. It almost satisfies the properties of aprobability measure; it fails countable additivity, while being countably sub additive.

Proposition 1.3. The outer measure P⇤ satisfies the following.

1. P⇤(;) = 0.

2. If A ⇢ B then P⇤(A) P⇤(B).

3. If A1, A2, . . . ⇢ ⌦ then P⇤([1n=1An)

P1n=1 P⇤(An).

4. If A 2 ⌃0 then P⇤(A) = P(A).

Proof. Since ; 2 ⌃0, we can use it to cover itself. This gives P⇤(;) P(;) = 0. Furthermore,the infimum of the definition of P⇤ is over non-negative numbers, so P⇤(;) � 0. This shows1. For 2, any covering of B covers A, so the infimum in the definition of P⇤(B) is over asmaller set than that in the definition of P⇤(A).

In 3, we use countably many coverings and our beloved epsilon approximation. Bydefinition of infimum, for each n we may find a collection B(n)

1 , B(n)2 , . . . of sets in ⌃0 that

cover An such that

P⇤(An) �1X

i=1

P(B(n)i )� ✏/2n .

The set [1n=1An is thus contained in [1

n=1

h[1

i=1B(n)i

i, so since these sets are in ⌃0,

P⇤([1n=1An)

1X

n=1

1X

i=1

P(B(n)i )

1X

n=1

[P⇤(An) + ✏/2n] =1X

n=1

P⇤(An) + ✏ .

Taking ✏ ! 0 finishes the proof.

2

Item 4 follows from the definition. First A is a cover of A, so P⇤(A) P(A). On the otherhand, since P is a probability measure on the algebra ⌃0 the arguments we gave in last lectureshow that it is countably sub additive. Therefore if A ⇢ [kAk we have P(A)

Pk P(Ak).

Taking infimum over such covers gives also P(A) P⇤(A).

The fact that we can only prove countable subadditivity shows that we are not wise tohope that P⇤ be a probability measure on the whole power set. We will have to restrict ourclass of sets to be successful. For any sets A,B in class we should at least have

P⇤(A [B) = P⇤(A) + P⇤(B) when A \ B = ; .

This is not really a condition on one set, but two. So to make it a condition on one set, weallow B to be arbitrary and write A [B as a general set E.

Definition 1.4. A set A ⇢ ⌦ is said to be P⇤-measurable if

P⇤(E) = P⇤(A \ E) + P⇤(Ac \ E) for all E ⇢ ⌦ .

The class of such sets will be denoted by M.

Note that by subadditivity, this condition is equivalent to P⇤(E) � P⇤(A\E) + P⇤(Ac \E).It will turn out that M solves all of our problems: it is a sigma-algebra, contains ⌃0,

and P⇤ is a probability measure on M. Let’s prove these one by one. Because M will thencontain �(⌃0) we will have proved the theorem.

2 Proof of extension theorem

Claim 2.1. M contains ⌃0.

Proof. Let A 2 ⌃0 and E ⇢ ⌦. Given ✏ > 0 choose sets A1, . . . 2 ⌃0 that cover E and suchthat

P⇤(E) �1X

k=1

P(Ak)� ✏ . (1)

Then Ak \ A 2 ⌃0 and E \ A ⇢ [k(Ak \ A), so

1X

k=1

P(Ak \ A) � P⇤(E \ A) .

Similarly,1X

k=1

P(Ak \ Ac) � P⇤(E \ Ac) .

Adding these together and using that P is a probability measure on ⌃0 we get1X

k=1

P(Ak) � P⇤(E \ A) + P⇤(E \ Ac) .

Combine this with (1) and take ✏ ! 0.

3

We next need to show an extension of the definition of M. This will imply countableadditivity.

Lemma 2.2. If A1, A2, . . . 2 M are pairwise disjoint then for all E ⇢ ⌦,

P⇤ (E \ [1n=1An) =

1X

n=1

P⇤(E \ An) .

With E = ⌦, P⇤ is countably additive on M.

Proof. When there is one Ai, both sides are the same. For two sets, in the definition of M,use E \ (A1 [ A1) in place of E and A2 in place of A to get

P⇤(E \ (A1 [ A2)) = P⇤(E \ (A1 [ A2) \ A2) + P⇤(E \ (A1 [ A2) \ Ac2)

= P⇤(E \ A2) + P⇤(E \ A1) ,

where for the last term we used that A1 and A2 are disjoint.If we have N � 3 sets, use induction, along with putting E \ (A1 [ · · · [ AN) for E and

AN for A in the definition of M:

P⇤(E \ [Nn=1An) = P⇤(E \ [N�1

n=1 An) + P⇤(E \ AN) =NX

n=1

P⇤(E \ An) .

As we have done before, in the infinite case, use monotonicity with the finite case:

P⇤(E \ [1n=1An) � P⇤(E \ [N

n=1An) =NX

n=1

P⇤(E \ An) .

Take N ! 1 to get the lower bound. The upper bound follows from countable subadditivity.

Now we can prove our claims.

Claim 2.3. M is a sigma algebra, so P⇤ is a probability measure on M.

Proof. First ⌦ 2 M since for any E

P⇤(E) = P⇤(E \ ⌦) + P⇤(;) ,

and the last term is P⇤(E \ ⌦c). Next if A 2 M then since the definition is symmetric,Ac 2 M. Next let us just show that M is a field and take A,B 2 M. Then for all E,

P⇤(E) = P⇤(E \ A) + P⇤(E \ Ac) = P⇤(E \ A \B) + P⇤(E \ A \ Bc)

+ P⇤(E \ Ac \B) + P⇤(E \ Ac \Bc) .

By subadditivity, the right side is bounded below by P⇤(E \ (A \ B)) + P⇤(E \ (A \ B)c),showing A \ B 2 M and that it is a field.

4

To prove it is a sigma field, we start by taking A1, A2, . . . 2 M pairwise disjoint. Thenfor each N , as [N

n=1An 2 M,

P⇤(E) = P⇤(E \ ([Nn=1An)) + P⇤(E \ ([N

n=1An)c)) .

For the first term, apply the lemma, and for the second apply monotonicity:

P⇤(E) �NX

n=1

P⇤(E \ An) + P⇤(E \ ([1n=1An)

c) .

Now take N ! 1 and use the lemma again to get

P⇤(E) � P⇤(E \ ([1n=1An)) + P⇤(E \ ([1

n=1An)c) ,

or the union is in M.Now that M is a field closed under countable disjoint unions, we can argue it is closed

under unions. If A1, A2, . . . are in M define B1 = A1 and Bn = An \ An�1 for n � 2. TheBn’s are disjoint and in M so their union, which equals the union of the An’s, is in M.

5

Lecture 4

1 Uniqueness of the extension

We begin with the uniqueness part of measure construction.

Theorem 1.1. If P1 and P2 are probability measures on �(⌃0) that agree on ⌃0 then theyare equal.

Let C be the collection of sets on which P1 and P2 agree:

C = {A 2 �(⌃0) : P1(A) = P2(A)} .

Two facts about C are immediate:

1. C contains ⌃0, by assumption.

2. C is closed under monotone limits. That is, if A1, A2, . . . 2 C are nested decreasing (ornested increasing), their intersection (or their union) is in C.

Proof. By the consequences of countable additivity (see notes from the second lecture),

P1 (\nAn) = limn!1

P1(An) = limn!1

P2(An) = P2 (\nAn) .

The second point above says C is a monotone class.

Definition 1.2. Let ⌦ be a sample space.

1. A collection of subsets of ⌦ is a monotone class if it is closed under monotone limits.

2. If A is a collection of subsets of ⌦ we denote by m(A) the monotone class generatedby A. This is the intersection of all monotone classes containing the elements of A.

It is straightforward to prove that the monotone class generated by a collection is, indeed,a monotone class. To prove the uniqueness theorem, we show:

Theorem 1.3 (Monotone class theorem). If ⌃0 is a field then �(⌃0) = m(⌃0).

This will prove the uniqueness theorem because �(⌃0) = m(⌃0) ⇢ C.

Proof. First, any sigma algebra is a monotone class, since it is closed under countable unionsand intersections (and therefore monotone ones). So to form m(⌃0) we are intersecting morecollections that we are when we form �(⌃0) and thus m(⌃0) ⇢ �(⌃0).

Now we show that because ⌃0 is a field, m(⌃0) is a sigma algebra. This will implythat �(⌃0) ⇢ m(⌃0) and complete the proof. Because m(⌃0) contains ⌃0 and ⌦ 2 ⌃0, we

have ⌦ 2 m(⌃0). Next the collection A1 = {Ac : A 2 m(⌃0)} is also a monotone class:for example if A1 ⇢ A2 ⇢ · · · satisfy Ac

1, Ac2, . . . 2 m(⌃0) then by virtue of m(⌃0) being

a monotone class, \nAcn 2 m(⌃0). This means that [nAn = (\nAc

n)c 2 A1. A similar

argument shows that A1 is closed under nested decreasing limits. Since A1 also contains⌃0, m(⌃0) ⇢ A1 and so each A 2 m(⌃0) is the complement of something in m(⌃0); that is,Ac 2 m(⌃0).

Next we must show m(⌃0) is closed under unions. Let

A2 = {A ⇢ ⌦ : A \B 2 m(⌃0) for all B 2 ⌃0} .

This is also a monotone class: if A1 ⇢ A2 ⇢ · · · are elements of A2 then for any B 2 m(⌃0),

([1n=1An) \B = [1

n=1 (An \ B)

is a union of a nested increasing sequence. Further since An 2 A2 for all n, we haveAn \ B 2 m(⌃0), so since this is a monotone class, ([1

n=1An) \ B 2 m(⌃0). A similarargument shows that A2 is closed under nested decreasing limits and proves that it is amonotone class. Again, because A2 contains ⌃0, this means m(⌃0) ⇢ A2. Therefore ifA 2 m(⌃0) then A \ B 2 m(⌃0) for all B 2 ⌃0.

Last define A3 = {A ⇢ ⌦ : A[B 2 m(⌃0) for all B 2 m(⌃0)}. This is again a monotoneclass and by the last paragraph, it contains ⌃0, so m(⌃0) ⇢ A3. In other words, if A 2 m(⌃0)then A \B 2 m(⌃0) for all B 2 m(⌃0) and so m(⌃0) is at least a field.

To show countable unions, let A1, A2, . . . 2 m(⌃0). Since it is a field, for each N ,[N

n=1An 2 m(⌃0). As N increases, the unions are nested increasing so by m(⌃0) being amonotone class, it contains the union.

2 Basic probability

We will now turn to the study of events and their probabilities. We will take a probabilityspace (⌦,⌃,P).

2.1 Limit sets and the first Borel-Cantelli lemma

Much of the study of probability is involved with limiting behavior. The law of large numbersand the central limit theorem, both of which we will learn, are examples of this. So it is ofinterest to understand when an infinite number of events A1, A2, . . . occur. There are variouslimiting objects associated to these events.

Definition 2.1. If A1, A2, . . . 2 ⌃ then the limit supremum and limit infimum are definedby

lim supn!1

An = \1n=1 [1

i=n Ai

lim infn!1

An = [1n=1 \1

i=n Ai .

2

There are a couple of ways to see these sets. From an analytic point of view, given anarbitrary sequence of sets, it is di�cult to define a limiting set. It is not di�cult, of course,when these sets are nested (we can either take intersection or union). The operations abovegive a way to artificially nest the Ai’s. For instance, given A1, A2, . . ., the sets Bn = [1

i=nAi

form a nested decreasing sequence and Cn = \1i=nAi form a nested increasing sequence. In

this way we can product limits.The probabilistic way to interpret the sets is related to infinite occurrence. For example,

lim supn!1

An = {! 2 ⌦ : ! 2 An for infinitely many n}

= {infinitely many An occur}

lim infn!1

An = {! 2: ! 2 An for all large n}

= {An occurs for all large n} .

Note thatlim supn!1

An � lim infn!1

An .

There is a simple relation between their probabilities.

Theorem 2.2. If A1, A2, . . . 2 ⌃ then

P⇣lim infn!1

An

⌘ lim inf

n!1P(An) lim sup

n!1P(An) P

✓lim supn!1

An

◆.

Proof. The middle relation is obvious. For the first, set Cn = \1i=nAi. Then

P(Cn) P(Ak) for all k � n .

Thus we can take liminf on the right for

P(Cn) lim infn!1

P(An) .

Now (Cn) is a nested increasing sequence with union lim infn!1 An, so taking the limit onthe left and using a consequence of countable additivity gives the first statement.

We can retrieve the last inequality from the first by taking complements:

P✓lim supn!1

An

◆= 1� P

✓✓lim supn!1

An

◆c◆.

By DeMorgan, the limsup on the right equals (\1n=1 [1

i=n Ai)c = [1

n=1 \1i=1 A

ci , so

P(✓lim supn!1

An

◆= 1� P

⇣lim infn!1

Acn

⌘� 1� lim inf

n!1P(An) = lim sup

n!1P(An) .

3

Lecture 5

We can actually give an upper bound for the right side of the last theorem.

Theorem 0.1 (First Borel-Cantelli lemma). IfP

P(An) < 1 then P(lim supn!1 An) = 0.

Proof. For each N ,

P✓lim supn!1

An

◆ P([1

n=NAn) 1X

n=N

P(An) .

Now take the limit in N and since the sum converges, the right side goes to 0.

Example. Consider Lebesgue measure P = � on (0, 1] with the Borel sigma-algebra. Wecan split the interval into two equal-sized ones : (0, 1/2] and (1/2, 1]. Then we split theseinto two and so on. Next give a point ! in (0, 1] an “address” d1(!), d2(!), . . ., where thedi’s depend on ! and are all either 0 or 1. Informally, we assign d1(!) = 0 or 1 dependingon whether ! falls in the left or right “level-1” subinterval: (0, 1/2] or (1/2, 1]. We then setd2(!) = 0 or 1 depending on which level-2 subinterval of the level-1 interval it falls. Formally,we can define level-n intervals

I(n)1 , . . . , I

(n)2n by I

(n)k =

✓k � 1

2n,k

2n

�

and define dn(!) = 0 if ! 2 I(n)j with j odd and 1 otherwise. You will show in the homework

that the dn’s correspond to fair coin flips in the sense that for a given ! 2 (0, 1], the sequence

(d1(!), d2(!), . . .)

is just an infinite sequence of 0’s and 1’s, where 0’s correspond to tails and 1’s correspondto heads (or vice-versa). Given this space, it makes sense to talk about the event

An = {! : dn(!) = 1} = {n-th flip is heads} .

Note that P(An) = 1/2 and so the first theorem above says

P(the n-th flip is heads for all large n) lim infn!1

P(An) = 1/2 .

(In fact it is zero, as we shall see soon.) Furthermore,

P(there are infinitely many heads) � lim supn!1

P(An) = 1/2 .

Again this actually has probability 1, as we will see.Note that being in the very bottom level-n interval (that is, I(n)1 ) corresponds to having

all tails up to n. The length of this interval is 2�n, so we have

P(Bn) = 2�n where Bn = (\ni=1A

ci) .

By the first Borel-Cantelli lemma,

P(all flips are tails) = P(infinitely many Bn’s occur) = 0 ,

sinceP

n P(Bn) =P

n 2�n

< 1.

0.1 Independence

Definition 0.2. A,B 2 ⌃ are independent if the probability factors: P(A\B) = P(A)P(B).A finite collection A1, . . . , An 2 ⌃ is independent if the probability factors for any sub collec-tion:

P(Ak1 \ · · · \ Akm) = P(Ak1) · · ·P(Akm) for all distinct k1, . . . , km 2 {1, . . . , n} .

A collection A of events is independent if each finite sub collection is independent.

In the case of two sets, we can restate this in terms of conditional probability:

Definition 0.3. If A,B 2 ⌃ with P(B) > 0 then the conditional probability

P(A | B) =P(A \B)

P(B).

For a fixed B, the function A 7! P(A | B) is a probability measure.

ThereforeA,B independent , P(A | B) = P(A) .

Intuitively this means that information that B occurred does not change the probability thatA occurred. You can see this pictorially. Interpreting probabilities as areas, this says thatthe fraction of B that A takes up is the same as the fraction that A takes up in the wholesample space.

It is not true that pairwise independence implies independence. For instance, taking oursample space to be sequences of coin flips of length 3 (that is (x1, x2, x3) = HHH, HHT,

and so on) with each sequence of probability 1/8, define the three events

A1 = {x1 = x2}, A2 = {x2 = x3} and A3 = {x1 = x3} .

Then A1 and A2 are independent as are A1 and A3, and as are A2 and A3. For example,

P(A1 \ A2) = P(x1 = x2 = x3) = 1/4

andP(A1) = 1/2, P(A2) = 1/2 .

However the events A1, A2, A3 are not independent:

P(A1 \ A2 \ A3) = P(x1 = x2 = x3) = 1/4 ,

whereas P(A1)P(A2)P(A3) = 1/8.

Definition 0.4. Collections A1,A2, . . . ,An of elements of ⌃ are independent if each choiceof A1 2 A1, . . . , An 2 An is independent.

It is useful to know when we can deduce independence of events from independence ofothers.

2

Theorem 0.5. If A1, . . . ,An are collections which are independent and are each closed underfinite intersections (that is, they are ⇡-systems). Then �(A1), . . . , �(An) are independent.

Proof. Given A1, . . . , Ak 2 ⌃, the collection of events A such that A1, . . . , Ak, A is indepen-dent form a �-system. In other words, they are a collection A such that

• ⌦ 2 A,

• if A 2 A then Ac 2 A and

• if B1, B2, . . . 2 A are pairwise disjoint then [nBn 2 A.

To prove this, the first is obvious. For the second,

P(A1 \ · · · \ Ak \ Ac) = P(A1 \ · · · \ Ak)� P(A1 \ · · · \ Ak \ A)

= P(A1) · · ·P(Ak)� P(A1) · · ·P(Ak)P(Ak)

= P(A1) · · ·P(Ak)P(Ac) .

Last, if B1, B2, . . . 2 A are pairwise disjoint then

P(A1 \ · · · \ Ak \ [nBn) = P ([n (A1 \ · · · \ Ak \Bn))

is a disjoint union, so equals

X

n

P(A1 \ · · · \ Ak \Bn) =X

n

P(A1) · · ·P(Ak)P(Bn)

= P(A1) · · ·P(Ak)P([nBn) .

Note that intersections of �-systems are �-systems, so the set of events A for whichA2, . . . ,An, A is independent is a �-system. However it contains the ⇡-system A1 and theDynkin ⇡-� theorem implies then it contains �(A1). Now we repeat this argument with�(A1),A3, . . . ,An in place ofA2, . . . ,An to see that �(A1), �(A2),A3, . . . ,An is independent.Continuing, we find the result of the theorem.

One consequence is the famous:

Theorem 0.6 (Kolmogorov zero-one law). Let A1, A2, . . . be a sequence of independentevents and define the tail sigma-field by

T = \1n=1�({An, An+1, . . .}) .

If A 2 T then P(A) = 0 or 1.

Proof. Because theAi’s are independent, for each n, the set of finite intersections of {A1, . . . , An}is independent of the set of finite intersections of {An+1, . . .}. By the previous theorem,

�({A1, . . . , An}) and �({An+1, . . .}) are independent .

3

Since T is contained in �({An+1, . . .}) for all n, we find

�({A1, . . . , An}) and T are independent for all n .

However [n�({A1, . . . , An}) is a ⇡-system so this means

�({A1, A2, . . . }) and T are independent .

As T is a sub-field of the field on the left, it is independent of itself. Therefore if A 2 T ,

P(A) = P(A \ A) = P(A)2 ,

giving P(A) = 0 or 1.

What are some examples of tail events? First of all, lim supn An and lim infn An are since,for example,

\n [1i=n Ai = \n=N [1

i=n Ai 2 �({AN , AN+1, . . .}) for all N .

We will see many more when we talk about convergence of random variables.

4

Lecture 6

Let’s begin with a simple example given by a student in class.Example. If A1,A2, . . . are independent then, setting ⇡i to be the ⇡-system generated byAi, the sequence ⇡1, ⇡2, . . . need not be independent. From class, we gave the example ofthe space ⌦ = {(x1, x2, x3) : xi = H or T for all i} with the sigma algebra ⌃ = P(⌦) and Puniformly distributed (that is, each sample point has probability 1/8). Then set

A = {x1 = x2}, B = {x2 = x3}, C = {x1 = x3} .

We saw that these are pairwise independent but not independent. So take

A1 = {A,B}, A2 = {C} ,

so that these are independent collections. The ⇡-systems ⇡(A1), ⇡(A2), however, are notindependent. For example, the first one contains A \ B and the second C. If they wereindependent we would get

1/8 = P(A \B)P(C) = P(A \ B \ C) = 1/4 ,

a contradiction.

1 Borel-Cantelli II

We can now give

Theorem 1.1 (Second Borel-Cantelli lemma). If (An) is a sequence of independent eventswith

Pn P(An) = 1 then P(An occurs infinitely often) = 1.

Proof. For each N1 N2,

P(AcN1

\ · · · \ AcN2) =

N2Y

i=N1

P(Aci) =

N2Y

i=N1

(1� P(Ai)) = exp

N2X

i=N1

log(1� P(Ai))

!.

Now use the inequality log(1�x) �x, which is a consequence of concavity of x 7! log(1�x),to get the upper bound

P(AcN1

\ · · · \ AcN2) exp

�

N2X

i=N1

P(Ai)

!.

Taking N2 to infinity, the sum above diverges, so

P(\1i=N1

Aci) exp

�

1X

i=N1

P(Ai)

!= 0 for all N1 � 1 .

By now we can use countable subadditivity for

P(lim infi!1

Aci) = P(Ac

i occurs for all large i) 1X

n=1

P(\1i=nA

ci) 0 .

In other words P(lim supi!1 Ai) = 1 and we are done.

We will use both parts of Borel-Cantelli over and over. But for now let’s have a couple ofsimple examples.

Examples.

1. In a sequence of fair coin flips, we saw before that P(there are infinitely many heads) �1/2. By the second Borel-Cantelli lemma, we can strengthen this to probability 1. LetAn be the event that the n-th flip is heads. Then P(An) = 1/2 and so

Pn P(An) = 1.

Since the An’s are independent, we find P(An occurs for infinitely many n) = 1.

2. Using Lebesgue measure on the unit interval, let An be the event (0, 1/n). Theseare not independent but

Pn P(An) = 1. On the other hand, lim supn An = ;, so

P(An occurs infinitely often) = 0. This shows that independence is essential. Theproblem here of course is the dependence between the An’s.

2 Random variables

When we perform our experiment, we obtain some sample point ! in our sample space ⌦.Many times, there is a quantity (or quantities) associated to our sample point. For instance,! may be a position of a dart thrown at a dartboard, and the quantity may be the scoreassociated to that position. Or ! may be the state of a physical system and this state mayhave an energy. Regardless, we are many times interested in functions f : ⌦ ! R. To speakabout probability with any sense regarding these functions, it should be that certain setsassociated to f are measurable; that is, they are in our sigma-algebra ⌃. For example, wewould like to be able to calculate P(! : f(!) 2 [1, 2]). For this to make sense, we had betterhave {f(!) 2 [1, 2]} an element of ⌃. This motivates the following definition.

Definition 2.1. Let (⌦1,⌃1) and (⌦2,⌃2) be measure spaces (sample spaces with sigma-algebras). A function f : ⌦1 ! ⌦2 is measurable if the inverse image f

�1(B) 2 ⌃1 for allB 2 ⌃2.

We usually focus on the case that the second space is Rn with the Borel sigma-algebra(with the standard metric).

Definition 2.2. Let (⌦,⌃) be a measure space. X : ⌦ ! R is a random variable if X�1(B) 2⌃ for all Borel sets B ⇢ R.

• It su�ces to show that X�1((a, b)) 2 ⌃ for all a < b to deduce that X is a random

variable. (Prove this!)

2

• We say that X : ⌦ ! R is a simple function if it has a finite range; that is, it assumesfinitely many values. In this case,

if X is simple then X�1(B) 2 ⌃ for all Borel B ⇢ R

,X�1(A) 2 ⌃ for all A ⇢ R

,X�1({x}) 2 ⌃ for all x 2 R .

Proof. The second statement implies the first, which implies the third, so we need onlyshow that the third implies the second. So assume X�1({x}) 2 ⌃ for all x 2 R and letA ⇢ R. Since X is simple, write x1, . . . , xn for its range. Then

X�1(A) =

[

n:xn2A

X�1({xn}) ,

which is a finite union of elements of ⌃, so is in ⌃.

• Simple random variables can be written using indicator functions. For A ⇢ ⌦, write

1A(!) =

(1 ! 2 A

0 ! /2 A.

Then if X is a simple function with values x1, . . . , xn, we can set Ai = {! : X(!) = xi}to find X =

Pni=1 xi1Ai . Using the last characterization, every simple random variable

can be written asnX

i=1

xi1Ai for A1, . . . , An 2 ⌃ .

At times we want to consider only the information given to us by a random variable.Here we are using the term “information” loosely, as there seems to be only an intuitivelink between sigma-algebras and information. See the discussion in the subfields section ofBillingsley.

Definition 2.3. The sigma-algebra generated by a random variable X on ⌦ is defined as

�(X) = ��{X�1(B) : B ⇢ R is a Borel set}

�.

The sigma-algebra generated by the collection of random variables X1, X2, . . . is

�(X1, X2, . . .) = ��{X�1

i (B) : B ⇢ R is a Borel set and i � 1}�.

• Again, we can show

�(X) = ��{X�1((a, b)) : a < b 2 R}

�.

3

• If X1, . . . , Xn are random variables, then define the vector Y = (X1, . . . , Xn). We canshow that

if Xi is simple for all i then �(X1, . . . , Xn) = {Y �1(A) : A ⇢ Rn} .

(If these are not simple, we would need A to be a Borel set.)

Proof. First we show the collection on the right (call it C) is a sigma-algebra. TakingA = Rn, we see that ⌦ = Y

�1(A) 2 C. Next if B 2 C then B = Y�1(A) for some

A ⇢ Rn, so Bc = Y

�1(Ac) 2 C. If B1, B2, . . . 2 C then writing Ai for a subset of Rn

such that Bi = Y�1(Ai),

[nBn = [nY�1(An) = {! : ! 2 Y

�1(An) for some n} = Y�1([nAn) 2 C .

This shows C is a sigma algebra. Furthermore, if i 2 {1, . . . , n} and Ai ⇢ R is a Borelset, set A0

i = {(x1, . . . , xn) 2 Rn : xi 2 Ai}. Then

{Xi 2 Ai} = {Y 2 A0i} 2 C ,

so C contains the generating sets of �(X1, . . . , Xn) and therefore contains �(X1, . . . , Xn).

Conversely, if A ⇢ Rn, then write R1, . . . , Rn for the ranges of X1, . . . , Xn.

Y�1(A) =

[

(y1,...,yn)2(R1⇥···⇥Rn)\A

Y�1({(y1, . . . , yn)}) .

This is a finite union of sets Y �1({(y1, . . . , yn)}), and this equals \ni=1X

�1i ({yi}), a finite

intersection of generating sets for �(X1, . . . , Xn). Therefore Y�1(A) 2 �(X1, . . . , Xn)

and we have shown the other containment.

4

Lecture 7

1 Independence

There is not so much to say about independence, except that it is similar for events. We will

use it a lot, but the definitions are pretty simple.

Definition 1.1. Random variables X1, . . . , Xn on (⌦,⌃) are independent if for all choicesof 1 k1 < · · · < km n and Borel sets B1, . . . , Bm ⇢ R,

P(Xk1 2 B1, . . . , Xkm 2 Bm) = P(X1 2 B1) · · ·P(Xkm 2 Bm) .

A sequence X1, X2, . . . is independent if each finite sub collection is.

Again, in the case of simple random variables, we can reduce that Borel sets to singletons:

if X1, . . . , Xn are simple random variables, they are independent if and only if for all choices

of x1, . . . , xm 2 R and 1 k1 < · · · < km n,

P(Xk1 = x1, . . . , Xkm = xm) = P(Xk1 = x1) · · ·P(Xkm = xm) .

There are similar theorems for random variables. The proofs are roughly the same:

Theorem 1.2. If X1, X2, . . . is a sequence of independent random variables, then the collec-tions �(X1), �(X2), . . . are independent. Defining the tail-field

T = \1n=1�(Xn, Xn+1, . . .) ,

then A 2 T implies P(A) = 0 or 1 when X1, X2, . . . is an independent sequence.

How do we know that there exists a sequence of independent random variables? A

theorem in the book constructs any sequence of independent simple random variables. You

will modify this to get an infinite sequence of independent Bernoulli random variables; that

is, given p 2 [0, 1], an independent sequence X1, X2, . . . such that

P(Xi = 1) = p and P(Xi = 0) = 1� p for all i .

2 Expectation

Expectation will be defined for random variables to be a kind of average relative to a proba-

bility measure. Here we will only develop the theory, as Billingsley does, for simple random

variables. However as we do so, we will need some results from convergence of random

variables, which we will give for general random variables.

If X is a simple random variable which is just the indicator of an event:

X(!) = 1A(!) for some A 2 ⌃ ,

then we define EX = P(A). We would like expectation to extend linearly to all simple

random variables, so we define

Definition 2.1. If X =Pn

i=1 xi1Ai for Ai 2 ⌃ then we define the expectation of X as

EX =

nX

i=1

xiP(Ai) .

Note that this corresponds with our intuitive notion of average. For example, if X takes

values x1, . . . , xn with equal probability (probability 1/n), then each Ai has probability 1/nand therefore the expectation is just (x1 + · · ·+ xn)/n.

Because X is simple, and we have made the above definition, we can rewrite it, noting

that Ai = {Xi = xi}, as

EX =

nX

i=1

xiP(X = xi) .

This is the standard definition in discrete (finite) probability.

Expectation has various nice properties.

Theorem 2.2 (Properties of expected value). Let X, Y be simple random variables on(⌦,⌃,P) and a, b 2 R. Then

1. (Linearity) E(aX + bY ) = aEX + bEY .

2. If X(!) Y (!) for all ! thenEX EY .

3. (Jensen’s inequality) If f : [a, b] ! R is convex with range(X) ⇢ [a, b] then

f(EX) Ef(X) .

In particular, |EX| E|X| and

|EX � EY | E|X � Y | .

Proof. For linearity, first consider X +Y . Write x1, . . . , xn for the values attained by X and

y1, . . . , ym for those attained by Y . Then X + Y attains the values in the set {xi + yj : 1 i n, 1 j m}. Therefore

E(X + Y ) =

X

z

zP(X + Y = z) =X

z

z

2

4X

i,j:xi+yj=z

P(X = xi, Y = yj)

3

5 .

Here we have used that {X + Y = z} is a disjoint union over i and j such that xi + yj = zof the events {X = xi, Y = yj}. By reordering the sum, this is

X

i,j

2

4X

z:xi+yj=z

(xi + yj)P(X = xi, Y = yj)

3

5 =

X

i,j

(xi + yj)P(X = xi, Y = yj) ,

2

and by splitting the sum we get

X

i

xi

"X

j

P(X = xi, Y = yj)

#+

X

j

yj

"X

i

P(X = xi, Y = yj)

#,

which is EX + EY .

Now if a 2 R, the range of aX is simply {ax1, . . . , axn} and we have

X

z

zP(aX = z) =X

i

axiP(X = xi) = aEX .

This together with additivity proves linearity.

If X Y then set Z = Y �X. Now Z � 0 and so

E[Y �X] = EZ =

X

z

zP(Z = z) � 0 ,

giving EX EY .

If f : [a, b] ! R is convex with range(X) ⇢ [a, b], then note that

EX =

X

i

xipi X

i

bpi = b

and similarly EX � a. Therefore f(EX) is defined and by convexity we can find an a�ne

function A(x) = L(x) + c, where L is linear such that A(EX) = f(E(X)) but f(y) � A(y)for all y 2 [a, b]. Now A(X) is a simple function so the last part gives

Ef(X) � EA(X) =

X

i

(L(xi) + c)pi = L(X

i

xipi) + c = A(EX) = f(EX) .

3

Lecture 8

Last we want to formalize the idea that if we know all information about the variablesX1, . . . , Xn then we know all information about another variable, say Y .

Definition 0.1. We say that Y : ⌦ ! R is measurable relative to a sigma-field F ⇢ ⌃ if

Y �1(B) 2 F for all Borel B ⇢ R. Y is measurable relative to X1, . . . , Xn if it is measurable

relative to �(X1, . . . , Xn).

In the simple case, Y being measurable relative to X1, . . . , Xn means that it is a functionof these variables.

Proposition 0.2. If X1, . . . , Xn are simple, then Y : ⌦ ! R is measurable relative to

X1, . . . , Xn if and only if there is a function f : Rn ! R such that

Y (!) = f(X1(!), . . . , Xn(!)) .

Proof. If such an f exists, then Y is simple. IfB ⇢ R is a Borel set, then set A = f�1(B) ⇢ R.Now by the results of Lecture 6,

Y �1(B) = {! : (X1(!), . . . , Xn(!)) 2 A} 2 �(X1, . . . , Xn) .

Conversely suppose that Y is measurable relative to X1, . . . , Xn. Then each y 2 R hasthe property that Y �1({y}) 2 �(X1, . . . , Xn). By Lecture 6 again, there exists Ay ⇢ Rn

such that Y �1({y}) = {! : (X1(!), . . . , Xn(!)) 2 Ay}. Now the Ay’s may not be disjoint fordistinct y, but if

(X1(!), . . . , Xn(!)) 2 Ay \ Ay0

then Y (!) = y and Y (!) = y0, implying y = y0. Therefore if we define

f : Rn ! R by f =X

y2Range(Y )

y1Ay ,

then f(X1, . . . , Xn) = Y .

A consequence of this is that if A1, . . . , An are events, then Y is measurable relative to�(A1, . . . , An) if and only if Y = f(1A1 , . . . ,1An) for some f : Rn ! R.

Here are some more properties.

• If X, Y are independent then EXY = EXEY .

Proof. Write x1, . . . , xn and y1, . . . , ym for the ranges of X, Y . Arguing as in the proofof additivity (in the first equality below),

EXY =X

i,j

xiyjP(X = xi, Y = yj) =X

i,j

xiP(X = xi)yjP(Y = yj) .

Splitting the sum we getX

i

xiP(X = xi)X

j

yjP(Y = yj) = EXEY .

• We define Var X = EX2 � (EX)2. Then

– 0 Var X EX2. The first follows from Jensen applied to x2.

– Var X = E(X � EX)2.

Proof. The right side is

E(X2 � 2XEX + (EX)2) = EX2 � 2(EX)2 + (EX)2 .

– If X, Y are independent then

Var (X + Y ) = Var X +Var Y .

Proof. By independence,

E(X + Y )2 = EX2 + 2EXEY + EY 2 .

Also(E(X + Y ))2 = (EX)2 + 2EXEY + (EY )2 .

Now subtract these two.

– Var aX = a2Var X.

We will also want a number of inequalities, some related to Jensen.

Proposition 0.3 (Integral inequalities). Let X, Y be simple random variables.

1. (Markov inequality) For a > 0,

P(|X| � a) 1

aEX .

2. (Chebyshev inequality) For a > 0,

P(|X � EX| � a) 1

a2Var X .

3. (Holder inequality) For p, q � 1 with p�1 + q�1 = 1,

E|XY | kXkpkY kq ,

where, for example, kXkp = (E|X|p)1/p.

2

Proof. Markov’s inequality follows as:

E|X| =X

i

|xi|pi �X

i:|xi|�a

|xi|pi � aX

i:|xi|�a

pi = aP(|X| � a) .

For Chebyshev, we simply use X � EX and Markov:

P(|X � EX| � a) = P(|X � EX|2 � a2) 1

a2Var X .

Last, Holder can be proved using Young’s inequality: for a, b > 0 and p, q > 1 with p�1+q�1 =1,

ab ap

p+

bq

q.

This is a consequence of convexity of x 7! � log x: since p�1 + q�1 = 1,

1

p(� log)ap +

1

q(� log)bq � (� log)

✓ap

p+

bq

q

◆.

The left side, however, is � log ab and this gives Young’s inequality.Now if kXkp or kY kq is zero, then the variable X or Y must be zero with probability

one, giving EXY = 0 and Holder must be true. Otherwise they are both nonzero and weset a = |X|/kXkp and b = |Y |/kY kq in Young’s inequality:

|X||Y |kXkpkY kq

|X|p

pkXkpp+

|Y |q

qkY kqq.

If we take expected value of both sides, the right is just 1, so we get Holder:

E|XY | = E|X||Y | kXkpkY kq .

3

Lecture 9

1 Convergence

Here we should review some notions of convergence from analysis. We will state the def-

initions for general random variables, but we will only now apply them to simple random

variables.

Definition 1.1. A sequence (Xn) of random variables on (⌦,⌃,P) converge almost surely ifthe event

{! : Xn(!) ! X(!)}

has P-probability one. The variables converge in probability to X if for each ✏ > 0,

P(|Xn �X| � ✏) ! 0 as n ! 1 .

Proposition 1.2. Xn ! X almost surely if and only if for each ✏ > 0,

P(|Xn �X| � ✏ infinitely often) = 0 . (1)

Further, if Xn ! X almost surely, then Xn ! X in probability.

Proof. Suppose that Xn ! X almost surely. Then for each ! 2 {! : Xn(!) ! X(!)}, given✏ > 0, there must be N(!) such that if n � N(!) then |Xn(!) � X(!)| < ✏. This means

that for such !, |Xn �X| � ✏ only finitely often and therefore

{Xn ! X} ⇢ {|Xn �X| � ✏ i.o.}c .

Since the event on the left has probability one, so does the one on right, implying (1).

Conversely, suppose that for each ✏ > 0, (1) holds. Choose ✏k = 1/k and consider

A = [k{|Xn �X| � ✏k i.o.} .

By countable subadditivity, P(A) = 0. But if Xn(!) does not converge to X(!), we can find

✏ > 0 such that |Xn(!)�X(!)| � ✏ infinitely often. Choosing ✏k < ✏, then ! 2 {|Xn�X| �✏k i.o.}. This shows that

{Xn ! X}c ⇢ A ,

and therefore has probability zero.

Last, if Xn ! X almost surely and ✏ > 0, let An = {|Xn � X| � ✏}. The first part

implies that P(lim supn!1 An) = 0, so by our previous result,

lim supn!1

P(An) P(lim supn!1

An) = 0 .

This means P(An) ! 0 and Xn ! X in probability.

It is not true that Xn ! X in probability implies Xn ! X almost surely (do you know

an example from real analysis?) However a subsequence will converge almost surely.

Theorem 1.3. If Xn ! X in probability, there is a subsequence (Xnk) such that Xnk

! Xalmost surely.

Proof. Since Xn ! X in probability, for each k � 1,

P(|Xn �X| � 1/k) ! 0 as n ! 1 .

So choose a subsequence Xn1 , Xn2 , . . . such that for k � 1,

P(|Xnk�X| � 1/k) 2

�k .

Then by Borel Cantelli, P(|Xnk� X| � 1/k for infinitely many k) = 0. Therefore, given

✏ > 0, P(|Xnk�X| � ✏ for infinitely many k) = 0, and Xnk

! X almost surely.

2 Convergence and expectation

Just as in Real Analysis, we can exchange limits and integrals (expectation) under certain

circumstances.

Theorem 2.1 (Bounded convergence theorem). If (Xn) is a sequence of simple randomvariables. If there exists C > 0 such that

P(|Xn| C) = 1 for all n ,

then if Xn ! X almost surely for some simple X,

EXn ! EX .

Proof. Let ✏ > 0. Since Xn ! X almost surely, the convergence also occurs in probability.

So P(|Xn �X| � ✏/2) ! 0. Therefore we can pick N such that if n � N then

P(|Xn �X| � ✏/2) < ✏/(4C) .

Now write

E|Xn �X| = E|Xn �X|1|Xn�X|�✏/2 + E|Xn �X|1|Xn�X|<✏/2 .

Note that these variables are still simple random variables because Xn and X are, and the

events |Xn � X| < ✏/2 and |Xn � X| > ✏/2 are in ⌃ (since |Xn � X| is a simple random

variable). Because Xn ! X almost surely, we must also have |X| C almost surely (that

is, with probability one). Therefore the first term can be bounded by

2CP(|Xn �X| � ✏/2) ✏/2 ,

and the second can be bounded by ✏/2.

2

3 Laws of large numbers

We know from experience that if we flip a fair coin infinitely many times and record at time

n a value Xn of 1 if we have a heads and 0 otherwise, the fraction of heads, or

X1 + · · ·+Xn

n

should approach 1/2. In which senses do we actually have this limit? We start with the

weakest.

Theorem 3.1 (Weak law of large numbers). Let X1, X2, . . . be simple i.i.d. (independent,identically distributed) random variables. Identically distributed means that for each BorelB ⇢ R,

P(Xi 2 B) = P(X1 2 B) for all i .

Then, setting µ = EX1 and Sn =Pn

i=1 Xi,

Sn/n ! µ in probability .

Proof. First assume that µ = 0, so that ESn = 0 for all n. Then by Chebyshev, for any

✏ > 0,

P(|Sn| > ✏n) 1

✏2n2Var Sn .

As the Xi’s are independent, the variance sums and our bound is1

✏2n2nVar (X1), or

P(|Sn/n| > ✏) Var X1

✏2n! 0 as n ! 1 .

Therefore Sn/n ! 0 in probability and we are done.

If µ 6= 0, define Yi = Xi � µ and note that the Yi’s are also i.i.d. (check this!) So we can

apply the previous case to get (Y1 + · · ·+ Yn)/n ! 0 in probability. Thus

X1 + · · ·+Xn

n� µ ! 0 in probability ,

and we are done.

Note that we took advantage of the fact that Var Sn is of lower order than n2. This

generally indicates that the Xi’s do a good job canceling each other, due to independence. If

we drop independence, the theorem will generally be false, and we can see that manifested

sometimes in a lack of cancellation. For instance, take X1 to be any random variable and

set Xi = X1 for all i � 2. Then (X1 + · · ·+Xn)/n ! X1, which need not be constant (like

µ). Here the Xi’s are extremely dependent, and therefore there is no cancelation; that is, if

some Xi’s are above the mean µ others will not be below the mean to counter them.

We can ask for a stronger result, namely almost sure convergence. This is the content of:

3

Theorem 3.2 (Strong law of large numbers). Let X1, X2, . . . be simple i.i.d. random vari-ables. Then, setting µ = EX1 and Sn =

Pni=1 Xi,

Sn/n ! µ almost surely .

Proof. Just as before, we need only consider the case µ = 0. We will estimate as in the last

proof, but then apply Borel-Cantelli. For this we need a stronger bound, not just C/n. So

we resort to looking at higher moments than just 2: by Markov, given ✏ > 0,

P(|Sn| � n✏) = P(S4n � (n✏)4) 1

(n✏)4ES4

n .

Now we must unwind the expectation:

ES4n = E(X1 + · · ·+Xn)

4= E

nX

i1,...,i4=1

Xi1Xi2Xi3Xi4

=

nX

i1,...,i4=1

EXi1Xi2Xi3Xi4 .

If one of the indices i1, . . . , i4 is di↵erent from the others, then that term is zero. To see this,

say for example that i1 is distinct from i2, i3, i4. Then by independence,

EXi1Xi2Xi3Xi4 = EXi1EXi2Xi3Xi4 = 0 .

Therefore the only terms that appear in the sum are those where all ik’s are equal or two

pairs are equal (like i1 = i3, i2 = i4). Each term of the first type equals EX4i1 = EX4

1 and

each term of the second is of the form EX2i1X

2i2 or EX2

i1X2i3 . Both of these, by the i.i.d.

assumption, equal (EX21 )

2. Thus by counting, our bound is

ES4n = nEX4

1 + 3n(n� 1)(EX21 )

2

and

P(|Sn/n| � ✏) 1

(n✏)4⇥nEX4

1 + 3n(n� 1)(EX21 )

2⇤ C/n2 .

Because this is summable, we can apply Borel-Cantelli to get

P(|Sn/n| � ✏ i.o.) = 0 ,

or Sn/n ! 0 almost surely.

4

Lecture 10

1 Gambling systems (aka random walk)

Suppose we are gambling and at each unit time n, we either gain 1 dollar or lose 1 dollar.Represent this with random variables: for each n, our gain is either Xn = 1 or Xn = �1.We will assume that from time to time, the gain is independent, but identically distributed:

Xn =

(1 with probability p

�1 with probability q = 1� pwith (X1, X2, . . .) i.i.d. .

Assuming that we begin with a dollars, our cumulative fortune at time n is

a+ Sn where S0 = 0 and Sn =nX

i=1

Xi .

For some given c with a c, we are declared winner if our winnings reach c before theyreach 0 and loser if they reach 0 first.

Lemma 1.1. For random variables Xi defined on some space (⌦,⌃,P) as above and a � 0,the event

{! : Sn + a = c before Sn + a = 0} 2 ⌃ .

Proof. Note that Sn + a is a simple random variable, since it is a sum of simple randomvariables. Therefore, {Sn + a = c} 2 ⌃ and so

[1n=0 [{0 < S0 + a < c} \ · · · \ {0 < Sn�1 + a < c} \ {Sn + a = c}] 2 ⌃ .

This is the event that for some n, our fortune has not reached 0 by time n� 1, but reachesc at time n. In other words, we eventually win.

The probability of eventually winning can be calculated exactly. Here, define

Aa,n =⇥\n�1k=0{0 < Sk + a < c}

⇤\ {Sn + a = c} if 0 < a < c

and Aa,0 = ;, Ac,0 = ⌦ and A0,n = Ac,n = ; when n � 1 as the event that we win at time n.Then define

f(a) = P ([1n=0Aa,n) .

Theorem 1.2. The function f : {0, . . . , c} ! R satisfies f(0) = 0, f(c) = 1 and

f(a) = qf(a� 1) + pf(a+ 1) for a = 1, . . . , c� 1 .

Proof. We defineS 0n = X2 + · · ·+Xn+1 for n � 1 and S 0

0 = 0

and corresponding events

A0a,n =

⇥\n�1

k=0{0 < S 0k + a < c}

⇤\ {S 0

n + a = c} .

Now let a have 0 < a < c and note that both f(a) =P1

n=0 P(Aa,n) and

P(Aa,n) = P(Aa,n, X1 = 1) + P(Aa,n, X1 = �1) ,

Notice that Aa,n \ {X1 = 1} is exactly the event {X1 = 1} \ A0a+1,n�1. Therefore for n � 1,

P(Aa,n) = P(A0a+1,n�1, X1 = 1) + P(A0

a�1,n�1, X1 = �1) .

Now we use independence of the sigma algebras �(X1) and �(X2, . . .) (this is proven usingthe theorem on ⇡-systems we had before), we get

P(Aa,n) = pP(A0a+1,n�1) + qP(A0

a�1,n�1) .

Notice that for any x1, . . . , xn,

P(X1 = x1, . . . , Xn = xn) = P(X2 = x1, . . . , Xn+1 = xn) .

Summing this over all x1, . . . , xn such that

0 < x1 + · · ·+ xk < c for 1 k n� 1 and x1 + · · ·+ xn = c ,

we find P(Aa,n) = P(A0a,n). Therefore

P(Aa,n) = pP(Aa+1,n�1) + qP(Aa�1,n�1) .

Now sum over all n to get

f(a) = pf(a+ 1) + qf(a� 1) .

In the special case p = q = 1/2, we get

f(a) = (1/2)(f(a+ 1) + f(a� 1))

with f(0) = 0 and f(1) = 1. This means that f is linear; that is, f(a) = a/c. In the generalcase, you can also solve:

f(a) =

((q/p)a�1(q/p)c�1 if q 6= p

a/c if q = p.

2

Theorem 1.3. For any p,

P(the game eventually ends) = 1 .

Proof. Exactly the same proof as above works. Take f(a) = P(Sn + a = c or 0 eventually).Then f(c) = f(0) = 1 and f(a) = pf(a+ 1) + qf(a� 1) for all a 2 {1, . . . , c� 1}. The onlysolution for f is f(a) = 1 for all a.

From these two facts, one sees that, in the fair case (p = q = 1/2) if c is extremely large,the probability that we win is very small (a/c) for fixed a. Since the game cannot go onforever, the probability that we lose is large (1 � a/c). In the case that we play the gamewith winning ending point (that is, we do not stop when we hit c dollars, but try to win asmuch as possible), we eventually lose with probability 1. This is called the gambler’s ruin.You will give a rigorous argument for a version of it in the homework.

2 Betting strategies

Many times gamblers try to “game the system.” One way is to decide whether or not to betdepending on what outcomes have been observed so far. For instance, for each time n wecan define a function gn : Rn�1 ! {0, 1} and set

Bn = gn(X1, . . . , Xn�1) .

For instance, we could take gn = 1 if X1 = · · · = Xn�1 = �1 and 0 otherwise. In this case,we interpret the 1 as “we will bet” and 0 as “we will not bet.” Therefore for this gn, we onlybet if we have seen a long run of �1’s, with the hope that the next will be +1. Note thatfor any function gn as above, we have

Bn is measurable relative to X1, . . . , Xn�1 .

If we were allowed to, we would just set Bn = 1 whenever Xn = 1 and 0 otherwise. Thiswould correspond to betting only when a 1 comes up, but this is clearly not a function ofonly “the past” X1, . . . , Xn�1.

So we will call this a “selection system” and generally define

F0 = {;,⌦} and Fn = �(X1, . . . , Xn) ,

taking B1, . . . to be any sequence of random variables with values in {0, 1} such that

Bn is measurable relative to Fn�1 for n � 1 .

(Note this implies B1 is constant.) We will further assume that infinitely many bets canbe placed; that is, there is no time at which we decide to never bet again. This means weassume

P(Bn = 1 i.o.) = 1 .

3

Now to analyze this, we define Yn to be the outcome at the time at which we place the n-thbet. In other words, set Nn to be the time at which the n-th bet is made:

Nn = k if B1 + · · ·+Bk = n and B1 + · · ·+Bk�1 = n� 1

andYn = XNn .

To see this definition another way, for each ! 2 ⌦, we have assumed that there is definedB1(!), B2(!), . . . and X1(!), X2(!), . . .. We simply look at the n-th Bk(!) which is equalto 1 and set Yn(!) equal to the Xk(!) corresponding to this. Strictly speaking this is onlydefined when there is such a Bk(!), so we must define Yn(!) = �1 if ! 2 {Bn = 1 i.o.}c.

The main theorem is that the Yn’s are distributed just as the Xn’s were. That is, theselection system gives us no advantage whatsoever.

Theorem 2.1. The Y 0ns are simple i.i.d. random variables with P(Yn = 1) = p and with

P(Yn = �1) = q.

4

Lecture 11

Recall the setup from last time: we defined (Xn) as a sequence of i.i.d. random variableswith distribution

P(X1 = 1) = p and P(X1 = �1) = q = 1� p .

DefiningF0 = {;,⌦} and Fn = �(X1, . . . , Xn) ,

we let (Bn) to be any sequence of random variables with values in {0, 1} such that

Bn is measurable relative to Fn�1 for n � 1 .

(Note this implies B1 is constant.) Each n for which Bn = 1 represents a time at which webet; a zero is represents a time at which we abstain. We will further assume that infinitelymany bets can be placed; that is, there is no time at which we decide to never bet again.This means we assume

P(Bn = 1 i.o.) = 1 .

We then put Nn as the time at which the n-th bet is made:

Nn = k if B1 + · · ·+Bk = n and B1 + · · ·+Bk�1 = n� 1

andYn = XNn .

The main theorem was that the Yn’s are distributed just as the Xn’s were. That is, theselection system gives us no advantage whatsoever.

Theorem 0.1. The variables Yn are simple random variables which are i.i.d. with P(Yn =1) = p and P(Yn = �1) = q.

Proof. Note that{Nn k} = {B1 + · · ·+Bk � n} 2 Fk�1 ,

so {Nn = k} 2 Fk�1. However Nn is not a simple random variable. However, for x = ±1,

{Yn = x} = {XNn = x} = [1k=n [{Nn = k} \ {Xk = x}] 2 ⌃ .

Because Yn can only take values ±1, it is simple, so the above shows it is a simple randomvariable.

For x1, . . . , xn 2 {0, 1} write pi = p if xi = 1 and q otherwise. Then we would like toshow that

P(Y1 = x1, . . . , Yn = xn) = p1 · · · pn . (1)

Because x1, . . . , xn are arbitrary, this su�ces to show that the Yi’s are independent (checkthis!) Furthermore, it will also show that the Yi’s are identically distributed, since we cansum this equation over all choices of x1, . . . , xn�1 to get P(Yn = 1) = p and P(Yn = �1) = q.

We prove (1) by induction. For n = 1,

P(Y1 = x1) =1X

k=1

P(N1 = k, Xk = x1) .

However {N1 = k} 2 Fk�1, which is independent of �(Xk). So we can split into a product:

1X

k=1

P(N1 = k)P(Xk = x1) = p1

1X

k=1

P(N1 = k) = p1P([1n=1{Bn = 1}) = p1 .

Now for n > 1, write

P(Y1 = x1, . . . , Yn = xn) =X

k1<···<kn

P(Xk1 = x1, . . . , Xkn = xn, N1 = k1, . . . , Nn = kn) .

Again use independence of Fkn�1 and �(Xkn) for

pknX

k1<···<kn

P(Xk1 = x1, . . . , Xkn�1 = xn�1, N1 = k1, . . . , Nn = kn)

= pknP(Y1 = x1, . . . , Yn�1 = xn�1) .

By induction we get pk1 · · · pkn and we are done.

1 Gambling policies

In a gambling policy, we take a more general strategy. Not only do we decide when to bet,but we decide how much. Our wager at time n is Wn, which we assume again is measurablerelative to Fn�1. That is there is a function fn : Rn�1 ! R such that

Wn = fn(X1, . . . , Xn�1) � 0 .

When Wn = 0 we are not betting. So our total fortune at time n is

Fn = W1X1 + · · ·+WnXn .

Starting from an initial fortune F0 we get

EFn = EWnXn + EFn�1 .

Now Wn is measurable relative to X1, . . . , Xn�1, so is independent of Xn, so we get

EFn = EWnEXn + EFn�1

8><

>:

� EFn�1 if p � q

EFn�1 if p q

= EFn�1 if p = q = 1/2

.

2

Therefore in the fair case,EFn = · · · = EF1 = F0 .

This means that our expected winnings never change; in a sense, the game remains fair nomatter what our policy is.

A gambling policy comes with a stopping time. We will assume that we have a rule todetermine when we would like to stop betting. This rule will produce a time ⌧ after whichwe will not bet. If ⌧ = n this represents our decision to stop playing at time n (directly aftertime n), so this should only depend on X1, . . . , Xn.

Definition 1.1. A stopping time is a function ⌧ : ⌦ ! {0, 1, . . .} [ {1} such that

{⌧ = n} 2 Fn .

⌧ is an almost surely finite stopping time if P(⌧ < 1) = 1.

With the stopping time, along with our betting variables (Wn), we can define our fortuneat time n as

F ⇤n =

(Fn if n ⌧

F⌧ if n � ⌧. (2)

One way to view this gambling policy is that the stopping time forces our wagers to be 0.In other words, we can define a new wager W ⇤

n by

W ⇤n = Wn1{n⌧} .

Then this is still measurable relative to Fn�1, since {n ⌧}c = [n�1k=1{⌧ = k} 2 Fn�1.

Therefore we can recast this in the previous language:

F ⇤n = F ⇤

n�1 +W ⇤nXn .

So again, even with a stopping time in our betting strategy, we still cannot make a fair gamebe advantageous to us.

Theorem 1.2. With any uniformly bounded gambling policy (that is, |F ⇤n | M with prob-

ability 1 for some constant M) and an almost surely finite stopping time, our final fortuneF⌧ satisfies

EF⌧

8><

>:

= F0 if p = q = 1/2

F0 if p q

� F0 if p � q

.

Proof. Note that F ⇤n ! F⌧ almost surely. We will have to assume here that F⌧ is simple,

although this is not necessary, as we will see later in the theory of integration. Howeverunder this assumption, we can use the bounded convergence theorem to get

EF ⇤n ! EF⌧ as n ! 1 .

Now the result follows from (2).

3

If these assumptions are not met, we can actually break the system; that is, we canmake F⌧ > F0 when p = q for instance. To do this, bet Wn = 2n and stop when we winone round. You will show in the homework that for su�ciently large initial fortune F0,EF⌧ > F0. The problem here is that the bounded convergence theorem does not apply: F ⇤

n

will be unbounded.

2 Markov Chains

We will just give a definition of a Markov Chain (with countable state space).

Definition 2.1. A sequence (Xn) of random variables with values in a countable set S is aMarkov Chain if for all x1, . . . , xn 2 S,

P(Xn = xn | X1 = x1, . . . , Xn�1 = xn�1) = P(Xn = xn | Xn�1 = xn�1)

whenever P(X1 = x1, . . . , Xn�1 = xn�1) > 0.

Here when we say that Xn is a random variable with a value in S we mean (since S iscountable) that for each x 2 S, the set X�1

n ({x}) is in ⌃.The intuition behind the definition follows if we think of the value Xn as a position of a

particle at time n. Then the Markov property says that to predict where the particle willbe at time n, we only need to know its position at time n� 1. The particle’s history beforethat does not give any more information.

4

Lecture 12

1 Markov Chains

Recall the definition of a Markov Chain (with countable state space). Here I will change thedefinition slightly to include an initial variable X0.

Definition 1.1. A sequence (Xn)1n=0 of random variables with values in a countable set S(the state space) is a Markov Chain if for all n � 1 and x0, . . . , xn 2 S,

P(Xn = xn | X0 = x0, . . . , Xn�1 = xn�1) = P(Xn = xn | Xn�1 = xn�1)

whenever P(X0 = x0, . . . , Xn�1 = xn�1) > 0.

From the above definition, we see that fundamental quantities in the theory of Markovchains are the transition probabilities:

Definition 1.2. The Markov chain is called time-homogeneous if

P(Xn+1 = j | Xn = i) = P(X1 = j | X0 = i) for all i, j 2 S and n � 1 .

In this case we define the transition probability

pi,j = P(X1 = j | X0 = i) .

We will from now on assume that (Xn) is time-homogeneous.

We will need a couple of basic lemmas. They concern conditional probabilities. Recallwe defined P(A | B) = P(A \B)/P(B) when P(B) > 0. We can extend this definition to allcases by writing P(A | B) = 0 when P(A\B) = 0 (this includes the case P(B) = 0). We willhave to modify this definition later when we deal with general random variables, though.

Lemma 1.3. Let A,B,C be events.

1. If A1, A2, . . . are disjoint events,

P([nAn | B) =X

n

P(An | B) .

2.P(C \ A | B) = P(C | A \ B)P(A | B) .

3. If P(B) > 0 thenP(Ac | B) = 1� P(A | B) .

Proof. In the first item, if P(B) = 0 then both sides are equal. Otherwise, if the An’s arepairwise disjoint, so are B \ An. Therefore

P([nAn \ B) =X

n

P(An \B) .

This implies 1.For the second, if P(A \B \ C) = 0 then both sides are equal. Otherwise, just multiply

it out: the right side isP(A \B \ C)

P(A \ B)· P(A \B)

P(B).

Last, in 3, P(A \ B) = P(B)� P(B \ Ac) and this implies the result.

The next lemma is an extension of the Markov property. It says that if we know anyinformation about X1, . . . , Xn�1 (this is the event A below) and Xn = i, then only the eventXn = i matters.

Lemma 1.4. If n � 1 and A 2 �(X1, . . . , Xn�1) has P(A,Xn = i) > 0,

P(Xn+1 = j | A,Xn = i) = P(Xn+1 = j | Xn = i) for all i, j 2 S .

Proof. You will prove this on the homework.

From the assumption of time-homogeneity, we also have a time-homogeneity of n-th steptransition probabilities.

• For n � 1 and i, j 2 S, set

pn(i, j) = P(Xn = j | X0 = i) .

Then P(Xm+n = j | Xm = i) = pn(i, j).

Proof. The proof is by induction. This holds for all m when n = 1. Assume it holdsfor all m and some n. Then for n+ 1,

P(Xm+n+1 = j | Xm = i) =X

k2S

P(Xm+n+1 = j | Xm = i, Xm+n = k)P(Xm+n = k | Xm = i) .

(Here the sum is only over k such that P(Xm+n = k,Xm = i) > 0.) By induction, thelast term is pn(i, k). For the first term, we use the Markov property in the previouslemma and induction:

P(Xm+n+1 = j | Xm = i, Xm+n = k) = pk,j .

So we getP(Xm+n+1 = j | Xm = i) =

X

k2S

pn(i, k)pk,j .

2

The above equation only depends on m on the left, so we can set m = 0 to also get

P(Xn+1 = j | X0 = i) =X

k2S

pn(i, k)pk,j .

• (Chapman-Kolmogorov) For m,n � 0 and i, j 2 S,

pm+n(i, j) =X

k2S

pm(i, k)pn(k, j) .

Proof. Just as before, we split the space according to Xm and use the Markov property:

pm+n(i, j) =X

k2S

P(Xm+n = j | X0 = i, Xm = k)P(Xm = k | X0 = i)

=X

k2S

P(Xm+n = j | Xm = k)P(Xm = k | X0 = i) .

• Define the transition matrix P with entries

Pi,j = pi,j .

Then the 2-step transition probabilities can be obtained by Chapman-Kolmogorov:

p2(i, j) =X

k

pi,kpk,j = (P 2)i,j ,

so the 2-step matrix is just the square of P . Generally speaking, Chapman-Kolmogorovimplies that

pn(i, j) = (P n)i,j for n � 1 .

• Generally a Markov chain is specified by its transition matrix (Pi,j) and initial prob-abilities. That is, if (Xn) is a Markov chain, it has a transition matrix, and we set(↵i)i2S to be the vector

↵i = P(X0 = i) .

In this notation, we have

P(Xn = j) =X

i2S

P(X0 = i)P(Xn = j | X0 = i) =X

i

↵iPni,j = (~↵ · P )1,j ,

so ~↵ · P n is the vector of probabilities at time n.

We can go the other way; that is, start with a transition matrix and initial vector. Thatis the content of Theorem 8.1 in Billingsley. We will not prove it here, but on the homeworkyou will read it and modify it.

3

Theorem 1.5. Suppose that P is a stochastic matrix (in other words, Pi,j � 0 andP

j Pi,j =1 for all i, j 2 S) and ↵ is a nonnegative vector satisfying

Pi ↵i = 1. Then there exists on

some probability space a sequence of random variables X0, X1, . . . which is a Markov chainwith transition matrix P and initial vector ↵.

Because of the existence theorem we often think of Markov chains by thinking abouttheir transition probabilities. Often times a diagram is useful.Examples.

1. If our state space S is just two points a and b, with transition probabilities pa,a =0, pa,b = 1, pb,b = 1/2, pb,a = 1/2 then our transition matrix is

P =

✓0 11/2 1/2

◆.

In this case, our particle jumps from a to b always, and then from b to either a or bwith probability 1/2. In a more extreme example, take

P =

✓1 01/2 1/2

◆.

In this case, the particle stays at a if it is already there, and jumps from b to either aor b with probability 1/2. Here a is an absorbing state.

2. Simple symmetric random walk on Zd. In this example our state space is

S = Zd = {(x1, . . . , xd) : xi 2 Z for all i} .

Two states ~x and ~y in S are called nearest neighbors if k~x� ~yk1 = 1, where

k(a1, . . . , ad)k1 =dX

i=1

|ai| .

The transition probabilities are assigned as

Pi,j =

(1/(2d) if i, j are nearest neighbors

0 otherwise.

3. Random walk with absorbing barriers. Also known as our gambling system. For a givenc 2 N with c > 0, the state space is

S = {0, . . . , c}

and transition probabilities are

pi,j =

8>>><

>>>:

p if j = i+ 1 and i = 1, . . . , c� 1

q = 1� p if j = i� 1 and i = 1, . . . , c� 1

1 if i = j = 0 or i = j = c

0 otherwise

.

Here again 0 and c are absorbing states. That is, they transition only to themselves.

4

2 Transience and recurrence

For now, given a state i 2 S we abbreviate

Pi(Xn = j) = P(Xn = j | X0 = i) .

We are interested in how often the chain visits particular states.

Definition 2.1. A state i 2 S is called

1. recurrent if Pi(Xn = i for some n � 1) = 1 and

2. transient otherwise.

There is a a criterion that often can be simply applied to determine transience or recur-rence of states.

Theorem 2.2. Let i 2 S be a state. Then

i is recurrent ,1X

n=1

Pi(Xn = i) = 1 , Pi(Xn = i i.o.) = 1 .

Proof. IfP

n Pi(Xn = i) < 1 then Pi(Xn = i i.o.) = 0 by Borel-Cantelli, proving that 3implies 2.

Define f (n)i,j as the probability that the first visit to state j from state i occurs at time n:

f (n)i,j = Pi(X1 6= j, . . . , Xn�1 6= j,Xn = j) .

Further, define fi,j as the probability that we ever visit state j starting from i:

fi,j =1X

n=1

f (n)i,j .

To be continued next time.

5

Lecture 13

Recall that we are in the midst of proving:

Theorem 0.1. Let i 2 S be a state. Then

i is recurrent ,1X

n=1

Pi(Xn = i) = 1 , Pi(Xn = i i.o.) = 1 .

Proof. We showed 3 implies 2. Further we defined f (n)i,j as the probability that the first visit

to state j from state i occurs at time n:

f (n)i,j = Pi(X1 6= j, . . . , Xn�1 6= j,Xn = j) .

Further, define fi,j as the probability that we ever visit state j starting from i:

fi,j =1X

n=1

f (n)i,j .

For times n1 < n2 we can now calculate the probability that the first two visits to state

j from state i occur at times n1 and n2:

Pi(X1 6= j, . . . , Xn1�1 6= j,Xn1 = j,Xn1+1 6= j, . . . , Xn2�1 6= j,Xn2 = j)

= Pi(X1 6= j, . . . , Xn1�1 6= j,Xn1 = j)

⇥ P(Xn1+1 6= j, . . . , Xn2�1 6= j,Xn2 = j | X0 = i, X1 6= j, . . . , Xn1�1 6= j,Xn1 = j)

= f (n1)i,j P(Xn1+1 6= j, . . . , Xn2�1 6= j,Xn2 = i | Xn1 = j)

= f (n1)i,j f (n2�n1)

j,j .

Therefore

Pi(at least two visits to j) =X

n1<n2

Pi(first two visits at times n1, n2) =

1X

n1=1

1X

n2=n1

f (n1)i,j f (n2�n1)

j,j

and this sums to fi,jfj,j.Generally, a similar argument gives

Pi(at least k visits to j) = fi,jfk�1j,j . (1)

Taking k ! 1 for i = j,

Pi(Xn = i i.o.) =

(1 if fi,i = 1

0 if fi,i < 1. (2)

This proves that 1 and 3 are equivalent.

Last we must show that 2 implies 3, so assume thatP1

n=1 Pi(Xn = i) = 1. Then

NX

n=1

pn(i, i) =NX

n=1

nX

m=1

Pi(X1 6= i, . . . , Xm�1 6= j,Xm = i, Xn = i) =NX

n=1

nX

m=1

f (m)i,i pn�m(i, i)

=

NX

m=1

f (m)i,i

NX

n=m

pn�m(i, i)

fi,i

NX

n=0

pn(i, i) .

Therefore

(1� fi,i)NX

n=1

pn(i, i) p0(i, i) = 1 ,

andNX

n=1

pn(i, i) (1� fi,i)�1 .

The left side converges to 1, so fi,i = 1 and by (2), Pi(Xn = i i.o.) = 1.

There is a simpler argument thatP1

n=1 pn(i, i) = 1 implies that i is recurrent, but it

uses expectation of non-simple random variables (but it is intuitively more clear). Here it

is: we can write

1X

n=1

pn(i, i) =1X

n=1

Pi(Xn = i) = Ei

1X

n=1

1{Xn=i} = Ei#{n � 1 : Xn = i} .

Call the number on the right (the number of n � 1 such that Xn = 1) by the name Ni. Then

we have already computed that

Pi(Ni � k) = fki,i ,

so we can compute the expectation using the same formula we have been using (since Ni

takes only countably many values it is ok)

ENi =

1X

k=0

kPi(Ni = k) =1X

k=0

Pi(Ni = k)1X

m=1

1{mk} =

1X

m=1

1X

k=m

Pi(Ni = k)

=

1X

m=1

Pi(Ni � m)

=

1X

m=1

fmi,i .

This is finite if and only if fi,i < 1, which we showed is equivalent to i being transient.

ThereforeP1

n=1 pn(i, i) < 1 if and only if i is transient.

In the simplest case, recurrence or transience of one state determines that of all others.

2

Definition 0.2. A Markov chain (Xn) is irreducible if for each i, j 2 S, there exists n suchthat Pi(Xn = j) > 0.

Theorem 0.3. For an irreducible chain, exactly one of the following two must hold.

1. All states are transient,P

n pn(i, j) < 1 for all i, j 2 S and

Pi(Xn = j i.o. for some j) = 0 .

2. All states are recurrent,P

n pn(i, j) = 1 for all i, j 2 S and

Pi(Xn = j i.o. for all j) = 1 .

Proof. First suppose that there is some recurrent state, say i 2 S. ThenP

n pn(i, i) = 1.

If j is another state, then by irreducibility, choose n1, n2 such that pn1(j, i) and pn2(i, j) arepositive. Then for n � n1 + n2, by Chapman-Kolmogorov,

pn(j, j) =X

k

pn1(j, k)pn�n1(k, j) =X

k,l

pn1(j, k)pn�n1�n2(k, l)pn2(l, j)

� pn1(j, i)pn2(i, j)pn�n1�n2(i, i) .

Summing over all n � n1 + n2, the right side is infinity, so the left is as well. This givesPn pn(j, j) = 1 for all j 2 S. Last, if i, j 2 S,

pn(i, j) = Pi(Xn = j, Xm = i i.o.) X

m�n

Pi(Xn = j,Xn+1 6= i, . . . , Xm�1 6= i, Xm = i)

=

X

m�n

pn(i, j)f(m�n)j,i = pn(i, j)fj,i .

Choosing n such that pn(i, j) > 0, we find fj,i = 1. By our previous result (1),

Pj(Xn = i i.o.) = limk!1

fj,ifk�1i,i = 1 .

We can then intersect over all i to get Pj(Xn = i i.o. for all i) = 1.

Last, for any i, j, choose n1 such that pn1(i, j) > 0. Now for all n � 1,

1X

n=1

pn1+n(i, j) � pn1(i, j)1X

n=1

pn(j, j) = 1 .

We will finish this proof next time. For now: one example.

Random walk again. For simple symmetric random walk in Zd, you saw in the homework

that for d = 1,P

n pn(1, 0) = 1. Since this random walk is irreducible, this means that each

state is recurrent.

• For d = 2, one can show that p2n(0, 0) behaves like c/n for large n. ThereforePn pn(0, 0) �

Pn p2n(0, 0) = 1 and all states are recurrent.

• For general d, one can show that p2n(0, 0) behaves like c/nd/2. Therefore for d � 3,

simple symmetric random walk is transient.

3

Lecture 14

We are finishing the proof from last time of the result:

Theorem 0.1. For an irreducible chain, exactly one of the following two must hold.

1. All states are transient,P

n pn(i, j) < 1 for all i, j 2 S and

Pi(Xn = j i.o. for some j) = 0 .

2. All states are recurrent,P

n pn(i, j) = 1 for all i, j 2 S and

Pi(Xn = j i.o. for all j) = 1 .

Proof. We showed that if there is a recurrent state, then all parts of 2 hold.

Suppose instead there is a transient state, say i. ThenP

n pn(i, i) < 1 and the above

calculation (from the recurrent case) gives alsoP

n pn(j, j) < 1 for any j 2 S. Furthermore

if i, j 2 S,

1X

n=1

pn(i, j) =1X

n=1

nX

m=1

f (m)i,j pn�m(j, j) =

1X

m=1

1X

n=0

f (m)i,j pn(j, j) < 1 ,

sinceP

n pn(j, j) < 1. So by Borel-Cantelli,

Pi(Xn = j i.o.) = 0 for all j .

Unioning over countably many j’s finishes the proof.

Example.

• If the state space is finite and the chain is irreducible, then every state is recurrent.

To see this, let i 2 S and write M for the number of states in S. ThenX

j2S

X

n

pn(i, j) =X

n

X

j

pn(i, j) =X

n

1 = 1 .

Since there are finitely many terms in the sum, one must be infinity. But then the

previous theorem shows they are all infinity.

1 Stationary distributions

Suppose that there is an initial distribution ↵ for our Markov chain such that

↵ · P = ↵ .

What this means is that if P(X0 = i) = ↵i and then P(X1 = i) = ↵i. Repeating this

argument we will have P(Xn = i) = ↵i for all i. That is, the distribution will not change

over time.

Definition 1.1. An initial distribution ↵ is stationary if ↵ · P = ↵.

By the above discussion, if P(X0 = i) = ↵i and ↵ is stationary, then trivially, limn!1 P(Xn =

i) = ↵i. In some cases, even more is true: we can start the chain with any distribution and

the limiting behavior will still be described by a stationary distribution.

This will never be true if the chain is periodic. Take for instance random walk on Zd.

Then the state of the system, started at 0, can only be zero at time n if n is even. In this

case, the period is 2. In such systems, the probabilities Pi(Xn = i) can oscillate in n, so we

want to rule this out.

Definition 1.2. The period of state i is the greatest common divisor of the set

{n � 1 : pn(i, i) > 0} .

• If the chain is irreducible, then given i, j 2 S we can find n0, n1 such that pn1(i, j) andpn2(j, i) are positive. Then for all n � 0,

pn1+n+n2(i, i) � pn1(i, j)pn(j, j)pn2(j, i) .

If we take n = 0 then we see that the period ti of i divides n1 + n2. Further if

pn(j, j) > 0 then pn1+n+n2(i, i) > 0. This implies that ti divides n1 + n + n2 and thus

also n. Therefore ti divides every element of {m � 1 : pm(j, j) > 0} and thus ti tj.Reversing this argument shows in fact ti = tj. So in an irreducible chain, all states

have the same period.

• If an irreducible chain has period 1 then we say it is aperiodic.

• If the chain is irreducible and aperiodic, then given i 2 S, the greatest common divisor

of

U = {n � 1 : pn(i, i) > 0}

is 1 and if a, b 2 U then a + b 2 U (since pa+b(i, i) � pa(i, i)pb(i, i). If 1 2 Uthen U = N. Otherwise we can pick a, b 2 U that are co-prime. Now the set

{a mod b, 2a mod , . . . , (b� 1)a mod b} is equal to {0, . . . , b� 1}. So if c � ab there isa non-negative linear combination of a and b that is equal to c. This means that N \Uis finite.

Now given i, j 2 S, pick n0 such that if n � n0 then pn(i, i) > 0. We can also choose

m0 such that pi,j(m0) > 0. For all n � n0, then, we have pm+n(i, j) > 0 and we have

shown:

8i, j 2 S, 9n0 such that pn(i, j) > 0 for all n � n0 .

The main theorem on stationary distributions says that we can find them uniquely as

limits (in one case).

Theorem 1.3. Let X0, X1, . . . be an aperiodic irreducible chain. If ⇡ is an invariant distri-bution then the following hold.

2

1. The Markov chain is recurrent.

2. ⇡ is the only invariant distribution.

3. For each i, limn!1 pn(i, j) = ⇡j.

Proof. Suppose that the chain were transient. Then by irreducibility, pn(i, j) ! 0 for all i, jand so

⇡i =

X

j

⇡jpn(j, i) ! 0 as n ! 1 .

This means ⇡i = 0, a contradiction sinceP

i ⇡i = 1. So we conclude that the chain is

recurrent.

To prove the other statements we use a coupling argument. Consider a state space S⇥Sand transition probabilities

p(i,j),(k,l) = pi,kpj,l .

By the existence theorem, given i0 2 S, there exists a sequence Y0, Y1, . . . on some space

which is a Markov chain with the above transition probabilities and initial probabilities

↵(i,j) =

(0 if i 6= i0⇡j if i = i0

.

We note some facts:

• (Yn) is an irreducible Markov chain.

Proof. If (i, j) and (k, l) are states, by aperiodicity and irreducibility of X, we can

choose n0 and n1 such that if n � n0 then pn(i, k) > 0 and if n � n1 then pn(j, l) > 0.

(We proved this directly before this theorem.) Now for n � n0 + n1, pn((i, j), (k, l)) >0.

• (Yn) is aperiodic.

Proof. As shown above given (i, j), we can find n0 such that if n � n0 then pn((i, j), (i, j)) >0. This implies that the period of (Yn) is 1.

• The distribution ⇡(i,j) = ⇡i⇡j is invariant for (Yn).

Proof. For k, l 2 S,X

i,j2S

⇡(i,j)p(i,j),(k,l) =X

i

⇡ipi,kX

j

⇡jpj,l = pkpl = p(k,l) .

Since (Yn) has a stationary distribution, it is recurrent (we just proved this). So define the

random time

⌧ = min{n � 0 : Yn = (i0, i0)} ;

we see that ⌧ is almost surely finite. Continue next time.

3

Lecture 15

We were in the midst of proving the theorem on stationary distributions for aperiodic,

irreducible chains.

Proof. Since (Yn) has a stationary distribution, it is recurrent (we just proved this). So

define the random time

⌧ = min{n � 0 : Yn = (i0, i0)} ;

we see that ⌧ is almost surely finite.

The idea of the proof now is that the first coordinate of Yn behaves like Xn started from

the state i0 and the second behaves like Xn started from the distribution ⇡. After the random(finite) time ⌧ , both coordinates obey the same rules, so probabilities associated to them are

equal. Therefore as n ! 1, probabilities associated to both coordinates converge to each

other.

Suppose ⌧ = m but we are looking at a time n � m: for k, l 2 S,

P(Yn = (k, l), ⌧ = m) = P(Yn = (k, l) | Ym = (i0, i0))P(⌧ = m)

= pn�m(i0, k)pn�m(i0, l)P(⌧ = m) .

Write Yn = (Y (1)n , Y (2)

n ). Summing over each of l and k (separately) gives

P(Y (1)n = k, ⌧ = m) = pn�m(i0, k)P(⌧ = m) and P(Y (2)

n = l, ⌧ = m) = pn�m(i0, l) .

If we choose k = l and sum over m = 1, . . . , n, we get

P(Y (1)n = k, ⌧ n) = P(Y (2)

n = k, ⌧ n) .

This means that if ⌧ has already occurred, then both coordinates of Yn behave the same.

Therefore

P(Y (1)n = k) P(Y (1)

n = k, ⌧ n) + P(⌧ > n)

= P(Y (2)n = k, ⌧ n) + P(⌧ > n)

P(Y (2)n = k) + P(⌧ > n) .

If we apply the same inequalities, but to Y (2)n , then we find

��P(Y (1)n = k)� P(Y (2)

n = k)�� P(⌧ > n) .

But now P(Y (1)n = k) =

Pj,l P(Yn = (k, l) | Y0 = (i0, j))P(Y0 = (i0, j)) = pn(i0, k) and

P(Y (2)n = k) =

X

j,l

P(Yn = (k, l) | Y0 = (i0, j))P(Y0 = (i0, j)) =X

j

⇡jpn(j, k) = ⇡k .

Putting these back into the above,

|⇡k � pn(i0, k)| ! 0 as n ! 1 .

The other case is that there is no stationary distribution.

Theorem 0.1. Let (Xn) be irreducible and aperiodic. If there is no stationary distribution,

then for all i, j 2 S,pn(i, j) ! 0 as n ! 1 .

Proof. Construct the coupled chain (Yn) as in the last proof. We know it is irreducible. If

(Yn) is transient then we know that its transition probabilities converge to 0 as n ! 1;

that is, pn(i, j))2 ! 0, giving pn(i, j) ! 0. In the other case (that (Yn) is recurrent), see

Billingsley (Theorem 8.7). The idea is that if pn(i, j) does not converge to zero, we can find

a subsequence pnk(i, j) that converges to a non-zero number. By a diagonal argument, we

can take a further subsequence such that pnkr(a, b) also converges for all choices of a, b 2 S.

Now we define

⌫j = limr!1

pnkr(i, j) and ⇡j =

⌫jPj ⌫j

and show that (⇡j : j 2 S) is actually a stationary distribution. The proof of this uses the

fact that (Yn) is recurrent, so that we can use the finiteness of ⌧ from the last proof.

The last two results give us a classification: exactly one of the following occur for an

aperiodic, irreducible chain (Xn):

1. (Xn) is transient so that pn(i, j) ! 0 for all i, j 2 S. (Example: simple symmetric

random walk in d � 3.)

2. (Xn) is recurrent but there is no stationary distribution, so pn(i, j) ! 0 for all i, j 2 S.(Example: simple symmetric random walk in d = 1, 2.) This is called null-recurrent;

the transition probabilities converge to 0, but they are not summable.

3. (Xn) is recurrent and pn(i, j) ! ⇡j for the unique stationary distribution ⇡. (Example:

a finite state space. Also one can construct examples with infinite state space – can

you?) This is called positive-recurrent.

In the positive-recurrent case, we can actually compute the limit.

Theorem 0.2. If j is a recurrent state for (Xn) and limn!1 pn(j, j) = u. Then

u�1=

1X

n=1

nf (n)j,j .

(Here we allow that u = 0 when the right side is infinity.)

The quantityP1

n=1 nf(n)j,j is called the mean return time. It is actually the expected value

of the first return time from j to j.

2

Proof. The right side above equals

1X

n=1

f (n)j,j

1X

m=1

1{in} =

1X

m=1

1X

n=m

f (n)j,j =

1X

m=1

am ,

where am is the probabilityP1

r=m Pj(X1 6= j, . . . , Xr�1 6= j,Xr = j) that the first return to

j occurs at time � m.

Now we look at a di↵erent quantity: 1 � pn(j, j) = Pj(Xn 6= j). We can split this

according to whether we have visited j before time n:

1� pn(j, j) =n�1X

l=0

Pj(Xl = j,Xl+1 6= j, . . . , Xn 6= j) .

The summand on the right is the same as the probability that we return to j at time l, butthen the next return is longer than n� l units of time later. So we use the Markov property

on the right to get

n�1X

l=0

pl(j, j)an�l = a2pn�1(j, j) + · · ·+ an+1p0(j, j) .

As n ! 1, if u > 0, this converges to uP1

n=2 an. Therefore in this case,

1� u = u1X

n=2

an ) 1 = u1X

n=1

an = u1X

n=1

nf (n)j,j .

(Here we have used that a1 = 1; that is, with probability one, we return to j.)If instead u = 0 then we have

1 = limn!1

a2pn�1(j, j) + · · ·+ an+1p0(j, j) .

IfP1

n=1 an converged, then this limit would be zero (since pn(j, j) ! 0) and this would be

a contradiction. So in this case,P1

n=1 an = 1 = 1/u.

1 Moment generating functions

Let’s move back to the theory of simple random variables and introduce two final concepts

before moving to the general theory.

Definition 1.1. Let X be a simple random variable. Then the function

M(t) = EetX for t 2 R

is called the moment generating function for X. � : R ! R given by

�(t) = logM(t)

is called the cumulant generating function for X.

3

Here are some standard facts.

• If X is simple then M is C1. In fact it is analytic:

M(t) =nX

i=1

pietxi ,

where {xi} is the range of X and pi = P(X = xi).

• In a sense, M contains all information about X. For example,

M(0) = 1, M 0(0) = EX, . . . ,M (k)

(0) = EXk, k 2 N .

We call the number EXkthe k-th moment of X. The derivatives at zero of the

cumulant generating function are a bit di↵erent:

�(0) = 0, �0(0) = EX, �00

(0) = EX2 � (EX)2= Var X, . . .

These numbers are called the cumulants of X.

• M(t) completely determines the distribution of X (that is, the probabilities P(X = x)for all x 2 R).

Proof. We must show that no other simple Y can have its moment generating function,

say N(t), equal to M(t) for all t unless X and Y have the same distribution. So write

N(t) =P

qjetyj and reorder the yj’s and xi’s so that y1 is the max of the yj’s and x1

is the max of the xi’s. Then

0 = M(t)�N(t) = p1etx1 � q1e

ty1 + · · ·

Taking t ! 1, the first two terms dominate and we cannot have M(t)�N(t) = 0 for

all t unless p1 = q1 and x1 = y1. We then apply the same argument to M(t) � p1etx1

and N(t) � q1ety1 to conclude that the next maxima, say x2 and y2 have x2 = y2 and

p2 = q2. Continuing, we find that X and Y have the same distribution.

• Both M and � are convex, since M 00(t) = E[X2etX ] and we can compute �00

(t) by

�00(t) =

d

dt�0(t) =

d

dt

EXetX

EetX =EX2etXEetX � (EXetX)2

(EetX)2

and you can check the top is non-negative. Actually �00(t) it is the variance of a random

variable X which has distribution

P(X = x) = P(X = x)etx

EetX .

We will be using the following convex dual to the cumulant generating function.

Definition 1.2. For a 2 R, define

�⇤(a) = sup

t�0[at� �(t)] .

4

Lecture 16

1 Cramer’s theorem

Cramer’s theorem is about large deviations. As we saw in the law of large numbers, ifX1, X2, . . . are i.i.d. simple variables then

X1 + · · ·+Xn

n! µ := EX1 almost surely .

We can ask for the probability that X1 + · · · +Xn is far away from nµ. For instance, whatis the probability

P(X1 + · · ·+Xn � n(µ+ ✏))

for ✏ > 0. If this event occurs, we say that a large deviation of order ✏ has occurred. Theselarge deviations are very unlikely. In fact, generally the probability of a large deviationdecreases exponentially in n, with a rate depending on ✏. Stated more formally:

Definition 1.1. The rate function I associated to a simple random variable X is given by

I(a) = limn!1

� 1

nlogP(X1 + · · ·+Xn � an) .

Of course this definition requires that the limit exists, but it does for each a, although itmay be 1. Note that by the above definition, for large n,

� 1

nlogP(X1 + · · ·+Xn � an) ⇠ I(a)

P(X1 + · · ·+Xn � an) ⇠ e�nI(a) ,

where we have left “⇠” undefined. This means I(a) is the exponential rate of decay of theprobability in n.

Theorem 1.2 (Cramer for simple variables). Let (Xn) be a simple i.i.d. sequence.

1. The limit I(a) exists for all a > µ.

2. Furthermore it is the Legendre transform of the cumulant generating function:

I(a) = supt�0

(at� �(t)) ,

so it is a convex function.

3. I(a) > 0 for a > 0.

Proof. We will first prove what is called the Chernov bound. It says that

P(X1 + · · ·+Xn � an) e�n�⇤(a) for all n � 1 . (1)

This immediately implies that

lim infn!1

� 1

nlogP(X1 + · · ·+Xn � an) � �⇤(a) .

By subtracting o↵ the mean of the variables, we will assume that

EX1 = 0 and a > 0 .

Let t � 0 be arbitrary and compute using the Markov inequality and the i.i.d. assumption:

P(X1 + · · ·+Xn � an) = P(etX1 · · · etXn � etan) e�tanE(etX1 · · · etXn)

= e�tanE(etX1) · · ·E(etXn)

=�e�taM(t)

�n

= e�n(at��(t)) .

The left side does not depend on t, so we can minimize the right side:

P(X1 + · · ·+Xn � an) inft�0

e�n(ta��(t)) = exp

✓�n sup

t�0(ta� �(t))

◆.

You will show in the homework that

lim supn!1

� 1

nlogP(X1 + · · ·+Xn � an) �⇤(a) .

Combined with the above, we get the statement of the theorem. For convexity, if a, b > 0and � 2 [0, 1] then

�⇤(�a+ (1� �)b) = supt�0

[(�a+ (1� �)b)t� �(t)] .

Note

�a+ (1� �)b� �(t) = �a+ (1� �)b� ��(t)� (1� �)�(t)

= �(a� �(t)) + (1� �)(b� �(t)) .

So�⇤(�a+ (1� �)b) sup

t�0[�(at� �(t)) + (1� �)(bt� �(t))]

This is bounded by ��⇤(a) + (1� �)�⇤(b).For the last part, using EX = 0 and a > 0,

�(0) = 0, �0(0) = 0, �00(0) = Var X � 0 .

Therefore as t ! 0+,

at� �(t) = at� t2

2Var X + o(t2) ,

and this is positive for small t.

2

2 Law of the iterated logarithm

The last topic before the general theory attempts to quantify in more precise terms “howlarge” Sn is as n grows, where

Sn = X1 + · · ·+Xn ,

with (Xn) simple. We will shift by EX1 and normalize them bypVar X1, so that we may

assume thatEX1 = 0 and Var X1 = 1 .

We can already give some quite good estimates below. The first is actually not so good,as we will see, but it illustrates some of the ideas we have developed. The second bound,however, is pretty close to optimal. We will prove this in the next lecture.

Theorem 2.1. Define Sn as above.

1.P(S2

n � n/2 i.o.) = 1 .

2. For any c > 2,P(|Sn| >

pcn log n i.o.) = 0 .

The above theorem states that

n/2 S2n i.o. and S2

n (2 + ✏)n log n for all large n

for any ✏ > 0. We can do better than this, and it is the subject of the law of the iteratedlogarithm.

Theorem 2.2. Given Sn = X1 + · · ·+Xn with EX1 = 0, Var X1 = 1 and Xi simple, i.i.d.,

P✓lim supn!1

Snp2n log log n

= 1

◆= 1 .

The proof is actually quite similar to what we have done above. However there are a fewmore tools involved.

3

Lecture 17

1 Law of the iterated logarithm

Recall the law:

Theorem 1.1. Let X1, X2, . . . be i.i.d. and simple with EX1 = 0 and Var X1 = 1. DefineSn = X1 + · · ·+Xn for n � 1. Then

P✓lim supn!1

Snp2n log log n

= 1

◆= 1 .

Here are some consequences of this theorem.

• By applying the same theorem to the variables �X1,�X2, . . ., which have mean zero

and variance one also, we find

lim infn!1

Snp2n log log n

= �1 almost surely .

• Because the ratio is close to 1 infinitely often and close to -1 infinitely often, one can

argue that in fact

P✓⇢

Snp2n log log n

: n � 1

�is dense in [�1, 1]

◆= 1 .

You will do this on homework.

Recall also that we are proving a much weaker version.

Theorem 1.2. Let X1, X2, . . . be i.i.d. and simple with EX1 = 0 and Var X1 = 1. DefineSn = X1 + · · ·+Xn for n � 1.

1.P(S2

n � n/2 i.o.) = 1 .

2. For any c > 2,P(|Sn| >

pcn log n i.o.) = 0 .

Proof. We begin by recalling the Paley-Zygmund inequality. If X is a non-negative random

variable then

P(X � ✓EX) � (1� ✓)2EX2

(EX)2.

We apply this to X = S2n, recalling that ES2

n = n and ES4n Cn2

for some C (remember in

the proof of the strong law of large numbers), to find

P(S2n � 3n/4) � 1

16Cfor all n .

If these events were independent for di↵erent n we would be done, but alas they are not, so

we need to do some acrobatics. The above implies that

P(S2n � 3n/4 i.o.) � 1/(16C) .

For k � 1 define the event

Ak = {(Xk+1 + · · ·+Xk+n)2 � 3n/4 for infinitely many n} .

Then P(Ak) � 1/(16C) for all k and {Ak i.o.} is a tail event for the variables X1, X2, . . .(although each Ak itself is not). By the Kolmogorov zero-one law,

P(Ak for infinitely many k) = 1 .

Now assume for a contradiction that P(S2n � n/2 finitely often) > 0. Then this has positive

probability intersection with {Ak i.o.} and so we can pick

! 2 {S2n � n/2 finitely often} \ {Ak i.o.} .

For such an !, let K be such that if

k � K ) S2k < k/2 . (1)

We can also pick k � K such that Ak occurs. Then we write for n � 1

S2n+k

n+ k=

n

n+ k

✓Skpn+

Xk+1 + · · ·+Xk+npn

◆2

.

Letting n ! 1, the term n/(n+ k) converges to 1, whereas Sk/pn ! 0. However

(Xk+1 + · · ·+Xk+n)2

n� 3/4 for infinitely many n ,

so

lim supn!1

S2n+k

n+ k� 3/4 ,

contradicting (1).

For the second part, the idea is to use the Chernov bound and try to optimize to get the

best inequality. If x � 0 then, with �(t) as the cumulant generating function for X1:

P(Sn � x) e�tx+n�(t)= e�n(xt/n��(t)) .

Given c > 2, we can choose ✏ > 0 such that

c(1� ✏)

2> 1 . (2)

2

Because �(0) = �0(0) = 0 and �00

(0) = 1, for any nonnegative sequence (tn) with tn ! 0, we

have

�(tn) t2n2(1 + ✏) for all large n .

Take x =pcn log n and tn = x/n. Then tn ! 0 and tnx/n = t2n, so

P(Sn �p

cn log n) e�n(t2n�t2n2 (1+✏))

= e�nt2n2 +n✏

t2n2 .

The exponent is

�nx2

2n2+ n✏

x2

2n2= �n(1� ✏)

cn log n

2n2= �(1� ✏)

c

2.

Therefore

P(Sn �p

cn log n) e�(1�✏) c2 logn= n�(1�✏) c2 .

By (2), then, X

n

P(Sn �p

cn log n) < 1

and the second statement follows by Borel-Cantelli.

3

Lecture 18

1 Back to measure theory

After our beginning in more basic probability, we will now start over with the general theory.For this we will develop some more measure theory. First we want to build measures thatare not just probability measures.

Many of the results we give for a while will not be given with proofs, as their proofs aresimilar those we did before.

One thing we will want to do is define Lebesgue measure on Rd for d � 1 (not just on(0, 1]). To do this, we recall

Definition 1.1. If ⌦ is a topological space, the Borel sigma-algebra is the one generated bythe collection of open sets of ⌦.

• In Rd, the Borel sigma-algebra (given from the usual topology) is also generated byhalf-open rectangles:

Bd = �

{

dY

i=1

(ai, bi] : ai < bi}!

.

• We can even choose the above rectangles to have rational endpoints, so that the Borelsigma-algebra is countably generated.

• The Borel sets in (0, 1] are just the Borel subsets of R that are contained in (0, 1].(Similarly for (0, 1]d.)

Now we define a general (positive) measure.

Definition 1.2. If (⌦,⌃) is a set with a sigma-algebra, then a function µ : ⌃ ! [0,1] is ameasure if

1. for all A 2 ⌃, µ(A) 2 [0,1],

2. µ(;) = 0 and

3. whenever A1, A2, . . . 2 ⌃ are disjoint then µ([nAn) =P

n µ(An).

µ is a finite measure if µ(⌦) < 1 and is an infinite measure otherwise. Last, µ is calledsigma-finite if there exists a nested increasing sequence (An) in ⌃ such that ⌦ = [nAn andµ(An) < 1 for all n.

As before, a measure on an algebra is the same except in condition 3, [nAn is requiredto lie in ⌃ also. µ above is also called a set function.

Here are some properties of measures that are similar to those from before.

• If A,B 2 ⌃ with A ⇢ B then µ(A) ⇢ µ(B).

Proof.µ(B) = µ(A) + µ(B \ A) � µ(A) .

• Countable additivity consequences. Their proofs are exactly as before. Note that thesecond one is slightly di↵erent than in the probability measure case: we must requirethat µ(A1) < 1.

1. If An 2 ⌃ for all n and (An) is nested increasing, then µ([nAn) = limn!1 µ(An).

2. If An 2 ⌃ for all n and µ(A1) < 1 with (An) nested decreasing, then µ(\nAn) =limn!1 µ(An).

Proof. Here we need µ(A1) < 1 to derive this from the first property. Specifi-cally, define Bn = A1 \ An. Then the Bn’s are nested increasing so, µ([nBn) =limn!1 µ(Bn). Therefore

µ(A1 \ \nAn) = limn!1

µ(A1 \ An) .

3. Countable subadditivity: if A1, A2, . . . 2 ⌃ then µ([nAn) P

n µ(An).

The more involved theorems we proved about probability measures involved extendingthem from algebras to sigma-algebras. There is are analogous statements here. The extensionhas an identical proof.

Theorem 1.3. A measure on a field has an extension to the generated sigma-field.

The uniqueness, though, requires sigma-finiteness.

Theorem 1.4. If two measures µ and ⌫ on (⌦,⌃) agree on a subfield ⌃0 such that �(⌃0) = ⌃,then if µ and ⌫ are sigma-finite on (⌦,⌃0), they must agree on ⌃.

Proof. The proof is similar to before, except we need to use sigma-finiteness to apply themonotone class theorem. Specifically, take (Bn) in ⌃0 such that µ(Bn) < 1 for all n and⌦ = [nBn. Then define

Mn = {A 2 ⌃ : µ(A \ Bn) = ⌫(A \Bn)} .

As before, Mn is a monotone class that contains the field ⌃0. (If we did not intersect withBn and simply defined M = {A 2 ⌃ : µ(A) = ⌫(A)} then we would not be able to showthat M is closed under decreasing intersections if µ(⌦) = 1.) Therefore by the monotoneclass theorem, Mn contains ⌃.

Now for any A 2 ⌃,

µ(A) = limn!1

µ(A \Bn) = limn!1

⌫(A \ Bn) = ⌫(A)

and we are done.

2

2 Example: Lebesgue measure

As we did at the beginning of the course, we can build Lebesgue measure by first setting

⌃0 = {finite unions of disjoint half-open intervals} .

In higher dimensions we define a half-open rectangle to be of the form

(a1, b1]⇥ · · ·⇥ (an, bn], ai bi for all i

and define⌃0 = {finite unions of disjoint half-open rectangles} .

For any n-dimensional rectangle we define the Lebesgue measure

�n(nY

i=1

(ai, bi]) =nY

i=1

(bi � ai)

(just like a volume). Furthermore, for a finite disjoint union of half-open rectangles [Ni=1Ri,

we set

�n([Ni=1Ri) =

NX

i=1

�n(Ri) .

To show this is well-defined requires using a variant of Billingsley’s main theorem fromChapter 1 (which we proved in the third lecture or so).

• Just as before, �n is a measure on the field ⌃0 (in particular it is countably additive,and the proof of this also requires a variant of Billingsley’s main theorem).

• (Rn,⌃0,�n) is a sigma-finite measure space. The reason is that we can choose for our

sequence (Bk)Bk = (�k, k]n .

Then �n(Bk) = (2k)n < 1 and Rn = [kBk.

• So we can apply the measure extension theorem and define �n as the n-dimensionalLebesgue measure.

Lebesgue measure satisfies some natural properties.

1. It is translation invariant. That is, if A is a Borel set and x 2 Rn then �n(A) =�n(A+ x).

Proof. First if A is a Borel set then x+A is a Borel set. We can see this because if Ais a finite union of disjoint half-open rectangles then x+A is as well, so the collection{A Borel : x+ A is Borel} is a monotone class containing ⌃0, so it contains all Borelsets.

Next we can define a new measure by �xn(A) = �n(x + A). You can easily check this

is a measure on Borel sets of Rn. It agrees with �n on ⌃0, so it agrees on all Borelsets.

3

2. If T : Rn ! Rn is linear and invertible then for all Borel A ⇢ Rn

�n(TA) = |det T |�n(A) .

For the proof, see Billingsley, Theorem 12.2.

3. Lebesgue measure is regular. That is, given ✏ > 0 and A ⇢ Rn Borel, we can find aclosed F ⇢ A and open O � A such that �n(F \O) < ✏. In fact, any Borel measure (ameasure on (Rn

,⌃), where ⌃ is the Borel sigma-algebra) that assigns finite measure tobounded Borel sets is regular. This is shown in Billingsley, Theorem 12.3.

4

Lecture 19

1 Distribution functions

We will use many di↵erent measures in probability, not just finite ones. But generally, ifthey lie on Rn, they will assign finite measure to bounded sets. For instance, we can createsuch a measure by pushing forward any random variable: if X is a random variable on aprobability space (⌦,⌃,P) then we can define µ on Borel subsets of R by

µ(A) = P(X 2 A) .

This expression defines a probability measure on R.There is a nice way to visualize measures that assign finite mass to bounded sets. Given

such a µ, we can define its distribution function F : R ! R by

F (x) =

(�µ((x, 0]) if x 0

µ((0, x]) if x � 0.

For instance, if µ is Lebesgue measure, we get F (x) = x. If �y is the delta mass at y (thatis, the measure that assigns �y(A) = 1 if y 2 A and 0 otherwise) then F (x) will have a jumpdiscontinuity at x = y.

When µ is a finite measure, it is standard to define the distribution function instead as

Definition 1.1. If µ is a finite Borel measure on R, the distribution function F for µ isdefined by

F (x) = µ((�1, x]) .

This is what we do typically with a probability measure. Let us note some properties ofF for finite µ:

1. µ((a, b]) = F (b)� F (a).

2. F is non-decreasing:

Proof. If x y,F (x) = µ((�1, x]) µ((�1, y]) = F (y) .

3. F is right continuous.

Proof. If x 2 R and xn ! x+, \n(�1, xn] = (�1, x], so we can use the consequence

of countable additivity:

F (x) = µ((�1, x]) = limn!1

µ((�1, xn]) = limn!1

F (xn) .

In fact there is a existence and uniqueness theorem: such functions correspond exactlyto the Borel measures.

Theorem 1.2. For each F : R ! R that is non-decreasing and right continuous, there existsexactly one Borel measure µ on R such that

µ((a, b]) = F (b)� F (a) for all a b .

Proof. For existence, given such an F , we can build a Borel measure on R exactly parallel tothe construction of Lebesgue measure. That is, we define µ by µ((a, b]) = F (b)� F (a) andfor a finite union of disjoint half-open intervals [N

n=1Rn we define µ([Nn=1Rn) =

PNn=1 µ(Rn).

We can then show (using right continuity and monotonicity of F ) that this is a countablyadditive measure on the field of such unions. By Caratheodory, we can extend to the Borelsigma-algebra.

In the finite case, we will give a more enlightening proof again, after talking a bit aboutmeasurable functions.

For uniqueness, if µ1 and µ2 satisfy

µ1((a, b]) = F (b)� F (a) = µ2((a, b]) for all a < b ,

then they must agree on ⌃0, the field consisting of finite unions of disjoint half-open intervals.Again by the monotone class theorem, we deduce µ1 = µ2.

In higher dimensions, we can still define a distribution function. Given a finite Borelmeasure µ on Rn, the replacement is

F (x1, . . . , xn) = µ

nY

i=1

(�1, xi]

!.

See “Specifying measures in Rk” in Chapter 12 of Billingsley for details.

2 Measurable functions

2.1 Measurability

We now return to our previous study of random variables and recall a definition.

Definition 2.1. Let (⌦1,⌃1) and (⌦2,⌃2) be measurable spaces. Then f : ⌦1 ! ⌦2 ismeasurable (or measurable relative to ⌃1/⌃2) if

f�1(B) 2 ⌃1 for all B 2 ⌃2 .

A function f : ⌦ ! Rn is said to be measurable if it is Borel measurable.

We saw before that if f is a simple real function then f is Borel-measurable if and only iff�1({x}) 2 ⌃1 for all x 2 R. In general this is too strong: it will not be true for measurable

functions. But we can replace this condition with another. (This is the first item below).

2

Proposition 2.2. Let f : ⌦1 ! ⌦2 and g : ⌦2 ! ⌦3, where (⌦3,⌃3) is another measurablespace. Then

1. Let A2 ⇢ P(⌦2) is a generating set; that is, �(A2) = ⌃2. Then f is measurable if andonly if f�1(A) 2 ⌃1 for all A 2 A2.

2. If f and g are measurable, so is g � f .

Proof. Suppose that f : ⌦1 ! ⌦2 satisfies f�1(A) 2 ⌃1 for all A 2 A2. Note that

A = {B 2 ⌃2 : f�1(A) 2 ⌃1}

is a sigma-algebra. Indeed, first f�1(;) = ; 2 ⌃1 so ; 2 A. Next if A 2 A then f�1(Ac) =

(f�1(A))c 2 ⌃1, since f�1(A) 2 ⌃1, so A

c 2 A. Last, if A1, A2, . . . 2 A2 then f�1([nAn) =

[nf�1(An) 2 ⌃1, so [nAn 2 A. Since A contains A2 it must contain �(A2) = ⌃2. Therefore

for all A 2 ⌃2, f�1(A) 2 ⌃1 and f is measurable.Conversely, if f is measurable then f

�1(A) 2 ⌃1 for all A 2 A2, since A2 ⇢ ⌃2.If f and g are measurable then for A 2 ⌃3, g�1(A) 2 ⌃2, and so f

�1(g�1(A)) 2 ⌃1. Thismeans (g � f)�1(A) 2 ⌃1, giving g � f is measurable.

Corollary 2.3. If X, Y are topological spaces and f : X ! Y is continuous then f is Borelmeasurable.

Proof. By the last proposition, it su�ces to show that for all open O ⇢ Y , f�1(O) is a Borelset in X. But by continuity, f�1(O) is open and hence Borel.

Note that from the above proposition,

• if f : ⌦ ! R satisfies f�1((�1, x]) 2 ⌃ for all x 2 R then it is measurable. This is

because we can generate any open set using these: first build a half-open interval

(a, b] = (�1, b] \ (�1, a]

and you showed in homework already that half-open intervals generate the Borel sigma-algebra.

• Similarly, f : ⌦ ! Rn then f is measurable if and only if for all (x1, . . . , xn),

f�1

nY

i=1

(�1, xi]

!2 ⌃ .

2.2 Measurable functions from others

As usual, we can build measurable functions from others:

Proposition 2.4. If f1, . . . , fn : ⌦ ! R are measurable, then the the functions f1+ · · ·+ fn,f1 · · · fn and max{f1, . . . , fn} are measurable.

3

Proof. Define F : ⌦ ! Rn by F (!) = (f1(!), . . . , fn(!)). Then F is measurable: for each(x1, . . . , xn),

F�1

nY

i=1

(�1, xi]

!= \n

i=1f�1i ((�1, xi]) 2 ⌃ .

Now since the functions

(x1, . . . , xn) 7! x1 + · · ·+ xn

(x1, . . . , xn) 7! x1 · · · xn

(x1, . . . , xn) 7! max{x1, . . . , xn}

are continuous from Rn to R, they are measurable, and the proposition follows from the lastproposition.

We can even build more complicated combinations of functions. For these results we saythat a function f : ⌦ ! [�1,1] is measurable if for each Borel A ⇢ R, f�1(A) 2 ⌃ butalso the sets {! : f(!) = 1} and {! : f(!) = �1} are in ⌃.

Theorem 2.5. Let f1, f2, . . . : ⌦ ! R be measurable.

1. The functions supn fn, infn fn, lim supn fn and lim infn fn are measurable.

2. If limn fn exists everywhere then the limit is measurable.

3. {! : fn(!) converges} 2 ⌃.

4. If f is measurable then {! : fn(!) ! f(!)} 2 ⌃.

Proof. For each x, (even x = ±1)

{! : supn

fn(!) x} = \nf�1n ([�1, x]) 2 ⌃ .

Therefore supn fn is measurable. Also infn fn = � supn �fn is measurable. Further, lim supn fn =infn supk�n fk and lim infn fn = � lim supn �fn are measurable. This shows 1. In 2, if thelimit exists, it equals the lim sup, which is measurable. For 3, we write

{! : fn(!) converges} = {! : lim infn

fn(!) = lim supn

fn(!)} .

Now (lim infn fn, lim supn fn) defines a function F from ⌦ to [�1,1]2 and the set above issimply

F�1

✓{(x, y) 2 R2 : x = y} [ {(�1,�1)} [ {1,1}

◆2 ⌃ ,

proving 3. In 4, we do a similar argument for the triple (lim infn fn, lim supn fn, f), which isa function from ⌦ to [�1,1]3. We just use the fact that

{! : fn(!) ! f(!)} = {! : lim infn

fn(!) = lim supn

fn(!) = f(!)} .

4

Lecture 20

0.1 Approximation by simple functions

We spent the first part of the semester talking about simple functions – actually simple

random variables. This will be of some use to us in the general case not just for intuition.

We can approximate measurable functions by simple ones. This can be done in many ways;

here is the most basic. It says that we can approximate a measurable function in a monotonic

way – increasing when the function is nonnegative and decreasing otherwise.

Theorem 0.1. Given f : ⌦ ! R that is measurable, there exists a sequence (fn) of simplefunctions such that

fn(!) " f(!) if f(!) � 0

fn(!) # f(!) if f(!) 0 .

Proof. We will give the precise formula and it will be apparent that fn has the required

properties. The idea is to break the range of f into intervals of length 2�n

. Then we set

fn to be the left endpoint of the interval if it is nonnegative and the right endpoint if it is

nonpositive. Since the range of f can be unbounded and thus would give infinitely many

intervals, we need to truncate the range too. Set

I0 =

✓0,

1

2n

�, I1 =

✓1

2n,2

2n

�, . . . , In2n�1 =

✓n2n � 1

2n,n2n

2n

�,

J0 =

✓� 1

2n, 0

�, . . . , Jn2n�1 =

✓�n2n

2n,�n2n � 1

2n

�,

I+ = (n,1] and J� = [�1,�n] .

Now define

fn(!) =

8>>><

>>>:

�n if f(!) 2 J�� k

2n if f(!) 2 Jkk2n if f(!) 2 Ikn if f(!) 2 I+

.

The approximation given by the last theorem is particularly useful because it will play a

role in our definition of the integral.

0.2 Pushforward

Measurable functions can be used to push a measure on one space forward to another space.

Definition 0.2. Let (⌦1,⌃1), (⌦2,⌃2) be measurable spaces and f : ⌦1 ! ⌦2 measurable. Ifµ is a measure on (⌦1,⌃1) we define the push forward by f , written µf�1 (or f ⇤µ), as

�µf�1

�(A) = µf�1

(A) .

• If µ is a measure on (⌦1,⌃1) then µf�1is a measure on (⌦2,⌃2).

Proof. First µf�1(;) = µ (f�1

(;)) = µ(;) = 0. Also if A 2 ⌃2 then µf�1(A) =

µ (f�1(A)) 2 [0,1]. If A1, A2, . . . are pairwise disjoint in ⌃2 then f�1

(A1), f�1(A2), . . .

are also disjoint in ⌃1, so

µf�1([nAn) = µ

�f�1

([nAn)�= µ

�[nf

�1(An)

�=

X

n

µ(f�1(An)) =

X

n

µf�1(An) .

• If µ = P is a probability measure then so is Pf�1. In the case that f = X : ⌦ ! R is a

random variable (it is (Borel) measurable), then PX�1is a Borel probability measure

on R. It is usually called the distribution or the law of X.

Now we can give another proof of the existence of a probability measure with a given

distribution function.

Definition 0.3. If X is a random variable on (⌦,⌃,P) we define

F (x) = P(X x)

as the distribution function of X. (It is the distribution function of the law of X.)

1. As in the case of a measure, F is right-continuous and non-decreasing.

2. Because F is non-decreasing

F (x�) := lim

y!x�F (y) exists for all x 2 R .

Furthermore,

F (x)� F (x�) = P(X x)� lim

y!x�P(X y) .

By monotonicity, we can restrict the limit to a sequence yn ! x�and use the conse-

quence of countable additivity:

limy!x�

P(X y) = P(X < x) .

So

F (x)� F (x�) = P(X = x)

is the value of the jump of F at x.

2

3. Last,

limx!�1

F (x) = limx!�1

P(X x) ,

which, again, can be computed by appealing to monotonicity and a sequence xn ! �1to find

limx!�1

P(X x) = P(\{X xn}) = P(X = �1) .

Similarly, limx!1 F (x) = 1 � P(X = 1). In the case that X does not take values

±1, these are 0 and 1 respectively.

The existence portion of our old result is stated here:

Theorem 0.4. Given a right-continuous non-decreasing F : R ! R with

limx!�1

F (x) = 0 and limx!1

F (x) = 1

there exists a random variable X on some probability space such that

P(X x) = F (x) for all x 2 R .

A way to phrase this is from the point of view of the push-foward PX�1: given an F as

above, there is a probability measure µ such that µ((�1, x]) = F (x) for all x.

Proof. The idea here is to use Lebesgue measure to build our random variable. This is a

standard technique in probability and shows that once we have Lebesgue measure, we can

generate any random variable. So define the probability space (⌦,⌃,P) to be ((0, 1),⌃,�),where ⌃ is the Borel subsets of (0, 1) and P = � is Lebesgue measure.

We want to define X : (0, 1) ! R such that

P(X(!) x) = F (x) for x 2 (0, 1) .

If F were strictly increasing and continuous, this would be equivalent to P(X(!) F�1(x)) =

x, orP(F (X(!)) x) = x = P(! x) .

This suggests that we define X so that it is the inverse of F . Generally F is not invertible,

so we use the right-continuous inverse.

Define

X(!) = inf{x : ! F (x)} for ! 2 (0, 1) .

Because F (x) ! 1 as x ! 1, the above set is nonempty and because F (x) ! 0 as x ! �1,

it is bounded below; therefore, the definition is sensible. Furthermore, because F is right-

continuous, the infimum is attained in the set and so monotonicity guarantees

{x : ! F (x)} = [X(!),1) .

Hence X(!) x if and only if ! F (x) and

P(X x) = P(! F (x)) = F (x) .

3

Example. Billingsley gives an example of the exponential distribution. It is motivated as

follows: suppose we have a random variable X which is supposed to represent a waiting time

for an event (say a telephone call). We assume that, given we have waited a certain amount

of time y, the conditional distribution of the waiting time is the same as the unconditioned

one. In other words,

P(X > x+ y | X > y) = P(X > x) for x, y � 0 .

This is sometimes called the memoryless property of the exponential distribution. We also

assume that since X is a waiting time, P(X � 0) = 1.

We can write this equation in terms of distribution functions: the left side is

P(X > x+ y,X > y)

P(X > y)=

P(X > x+ y)

P(X > y)=

1� F (x+ y)

1� F (y),

whereas the left side is 1� F (x). Setting G(x) = 1� F (x) this functional relationship is

G(x+ y) = G(x)G(y) .

Assuming G(x) > 0 for all x, we can take logarithms to see that logG is additive. If G is

also measurable, it is a theorem that logG must be linear. The requirement that F bounded

implies that

F (x) =

(1� e�↵x, ↵ > 0 if x � 0

0 if x < 0.

4

Lecture 21

1 Weak convergence

We have already studied almost sure convergence and convergence in probability. We now

meet weak convergence, and this notion measures less how close the values Xn(!) and X(!)are and more how close their laws are.

For this section, let F and F1, F2, . . . be distribution functions on R of random variables

X and X1, X2, . . . taking values in R.

Definition 1.1. We say Fn ! F weakly (or Xn ! X weakly) if Fn(x) ! F (x) for all x 2 Rat which F is continuous.

Why do we want F to be continuous at x? Exactly to handle the following type of

situation. Suppose that Fn = 1{[1/n,1)} and F = 1{[0,1)}. Here Fn is the distribution

function of a random variable that takes the value 1/n with probability 1 and F is the

distribution function of a random variable that takes the value 0 with probability 1. In such

a situation, it seems reasonable that we would define weak convergence so that in this very

strong situation, Fn ! F weakly. But if we allow the definition to address points x at which

F is not continuous, we could use x = 0 and we would see

Fn(0) = 1{[1/n,1)}(0) = 0 for all n

but F (0) = 1.

Example. Given X1, X2, . . . that are i.i.d. random variables (that is, they are independent

and have the same law), we define

Mn = max{X1, . . . , Xn}

and consider this new sequence M1,M2, . . .. Whereas the Xi’s all has the same distribution,

the Mi’s do not. We can directly compute with independence:

P(Mn x) = P(Xi x for all i = 1, . . . , n) =nY

i=1

P(X1 x)

and this last expression is (F (x))n, where F is the distribution function of X1. Therefore

P(Mn x) = F (x)n .

Note that then

P(Mn x) !(0 if F (x) < 1

1 if F (x) = 1.

This means that if X1 has some probability to be above x, then the probability that Mn is

above x gets larger and larger as n grows (that is, Mn, in a sense, gets larger).

1. If the Xi’s have exponential distribution with some parameter ↵ > 0, then

P(Mn x) = (1� e�↵x)n, x � 0 .

Choose x = xn =1↵(log n+ y). Then

P(↵�1Mn � log n y) =�1� e� logn�y

�n=

✓1� e�y

n

◆n

! e�e�y.

In other words, the sequence of random variables given by ↵�1Mn � log n converges

weakly to a variable with the distribution F (y) = e�e�y.

2. In the last example, we showed that F n(any + bn) ! G for

an = ↵�1, bn = ↵�1log n, G(y) = e�e�y

.

Here we can regard bn as a shift of Mn and an as a scaling of Mn. Generally, Billingsley

shows in the section “extremal distributions” that the only possible limits of sequences

of the type F n(anx+ bn) are: (for ↵ > 0)

F1(x) = e�e�x

F2,↵(x) =

(0 if x < 0

e�x�↵if x � 0

F3,↵ =

(e�(�x)↵

if x 0

1 if x > 0

2 Integration

Now we will spend some time developing integration for general measures. We start at the

bottom, with simple measurable functions. For the remainder, let

(⌦,⌃, µ) be a measure space .

As before, we allow random variables to take the values ±1, but we use the conventions

1 · 0 = 0, 1± x = 1 if x 2 R .

Definition 2.1. If f : ⌦ ! R is simple and measurable then we write

f =

nX

i=1

xi1Ai ,

where xi 2 R and Ai 2 ⌃. Then defineZ

f dµ =

nX

i=1

xiµ(Ai) .

2

Note that if f takes a value ±1 on a set of measure zero, its contribution to the integral

is 0 · ±1 = 0.

Just as before, when we defined expectation of simple random variables, we can prove

various properties of this definition: additivity, monotonicity, etc. Although our setting is

slightly di↵erent, as our simple functions can take the values ±1, the proofs are similar.

Instead of restating these therefore, we will assume then, define integrals for general non-

negative functions and prove the properties for these.

Definition 2.2. Let f : ⌦ ! R be non-negative and measurable. Define

Zf dµ = sup

⇢Zs dµ : s simple and s(!) f(!) for all !

�.

Note that in the definition, we may assume s � 0, since the integral is monotone for

simple functions.

If f : ⌦ ! R is only measurable, we write

f+ = max{f, 0} and f� = (�f)+

for the positive and negative parts of f (note here that they are both non-negative), so that

f(!) = f+(!)� f�(!) for all ! .

Definition 2.3. If f : ⌦ ! R is measurable, defineZ

f dµ =

Zf+ dµ�

Zf� dµ .

If bothRf+ dµ =

Rf� dµ = 1 we leave

Rf dµ undefined and say f is not integrable.

2.1 Properties of the integral: non-negative functions

Proposition 2.4. Let f, g : ⌦ ! R be non-negative and measurable.

1. (Monotonicity) If f(!) g(!) for all ! thenRf dµ

Rg dµ.

2. (Linearity) If a, b � 0 then

Z(af + bg) dµ = a

Zf dµ+ b

Zg dµ .

3. (Fatou’s lemma) If (fn) is a sequence of nonnegative measurable functions,

Zlim inf

nfn dµ lim inf

n

Zfn dµ .

3

4. (Monotone convergence theorem) If (fn) is a sequence of nonnegative measurable func-tions with fn(!) " f(!) for all ! then

Zfn dµ "

Zf dµ .

Proof. If f(!) g(!) then for all simple s with s(!) f(!), we also have s(!) g(!).Therefore Z

f dµ = supsf

Zs dµ sup

sg

Zs dµ =

Zg dµ .

To prove Fatou, the definition of the integral implies that it su�ces to show that if

0 s lim infn fn is simple and measurable, then

Zs dµ lim inf

n

Zfn dµ . (1)

We will do this in the case thatRs dµ < 1. The extension to the other case is an exercise.

Note that if s is identically zero, (1) holds trivially, so we may assume min(Range(s)\{0}) >0. Write S = {! : s(!) > 0} and note that since

Rs dµ < 1 we also have

µ(S) (min(Range(s) \ {0}))�1

Zs dµ < 1 .

Continued next time.

4

Lecture 22

We are in the middle of proving Fatou.

Proof. We had definedS = {! : s(!) > 0} ,

where s is a simple function with the properties 0 s(!) lim infn fn(!) for all ! andRs dµ < 1.Let ✏ > 0 be any number such that s(!) � ✏ � 0 for all ! 2 S (this is possible because

the range of s is finite). Now define s✏(!) = s� ✏ on S and 0 otherwise.Because s(!) � ✏ < lim infn fn(!) for each ! 2 S, we can split S into pieces: for n � 1

letSn = {! : fm(!) � s✏(!) for all m � n} .

Then[nSn = S is an increasing union . (1)

Then fn � s✏ on Sn and monotonicity impliesZ

fn dµ �Z

fn1Sn dµ �Z

s✏1Sn dµ . (2)

However because s✏1Sn is simple, we can write s =Pk

i=1 xi1Ai for Ai 2 ⌃ and

Zs✏1Sn dµ =

kX

i=1

(xi � ✏)µ(Ai \ Sn) .

If any xi’s are 1, then µ(Ai) must be 0 to makeRs dµ < 1. So remove these from the

sum. For the other xi’s their µ(Ai) µ(S) < 1. Because this is a finite sum, we may taken ! 1 and get

Zs✏1Sn dµ !

kX

i=1

(xi � ✏)µ(Ai) .

Thus

lim infn

Zfn dµ �

kX

i=1

(xi � ✏)µ(Ai) =

Zs dµ� ✏µ(S) .

Let ✏ ! 0 to obtain Zs dµ lim inf

n

Zfn dµ .

This shows Fatou.Monotone convergence follows from Fatou. If fn " f and fn(!) � 0 for all ! then by

monotonicity, fn(!) f(!) for all ! and soZ

fn(!) dµ Z

f(!) dµ for all n .

This means in particular that lim supnRfn dµ

Rf dµ. Now use Fatou for the other

direction: Zf dµ =

Zlim inf

nfn dµ lim inf

n

Zfn dµ .

These two imply that limn

Rfn dµ =

Rf dµ (and the convergence in monotone).

For linearity, let f, g be nonnegative and measurable and let a, b � 0. From our previoustheorem (two lectures ago) we can find sequences (sn) and (tn) of simple measurable functionssuch that sn " f and tn " g point wise. Then asn + btn " af + bg. Now use linearity of theintegral for simple functions:Z

(af+bg) dµ = limn

Z(asn+btn) dµ = a lim

n

Zsn dµ+b lim

n

Ztn dµ = a

Zf dµ+b

Zg dµ .

The previous properties involved statements that hold for all !. Now we consider oneswhich old almost everywhere (a.e.).

Proposition 0.1. Let f, g be nonnegative and measurable.

1.Rf dµ = 0 if and only if f = 0 a.e.

2. IfRf dµ < 1 then f < 1 a.e.

3. If f g a.e. thenRf dµ

Rg dµ.

4. If f = g a.e. thenRf dµ =

Rg dµ.

Proof. First assume that f = 0 a.e. Then for any simple function s =Pn

i=1 xi1Ai with s fthen the following holds. If Ai intersects {! : f(!) = 0} then xi = 0, and so xiµ(Ai) = 0.Otherwise, if Ai does not intersect this set, it must be in the complement and so has measurezero. This means for such Ai, we also have xiµ(Ai) = 0 and therefore

Rs dµ = 0. Taking

supremum over such s, we getRf dµ = 0.

IfRf dµ = 0 then assume that A := {! : f(!) > 0} has positive measure. Defining

An = {! : f(!) � 1/n}, we see A = [nAn, so

0 < µ(A) X

n

µ(An) ,

so some An has positive measure. For this n, consider the function f1An . This function isbounded below by the simple function s = (1/n)1An and

Zf dµ �

Zf1An dµ �

Zs dµ = (1/n)µ(An) > 0 .

This is a contradiction and gives µ(A) = 0, or f = 0 a.e.If A := {! : f = 1} has µ(A) > 0 then define the simple function 11A. This has

infinite integral, and since it is dominated by f point wise,Rf dµ = 1. This shows 2.

2

For 3, choose any simple function s such that s f and note that if we set A = {! :f(!) g(!)} then

Rs1Ac dµ =

Rs dµ. Furthermore s1A g and so

Zs dµ =

Zs1A dµ

Zg dµ .

Taking supremum over s gives 3. 4 follows by interchanging f and g.

1 Integration of measurable functions

1.1 Basic properties

Recall that if f is measurable we split it into

f+ = max{0, f} and f� = (�f)+

and define Zf dµ =

Zf+ dµ�

Zf� dµ .

Note that

f integrable , max

⇢Zf+ dµ,

Zf� dµ

�< 1

, |f | integrable .

Also:

• If g is integrable and |f | |g| a.e. then f is integrable.

• If f is bounded a.e. then whenever µ(⌦) < 1, f is integrable.

Properties from the nonnegative case carry over:

Proposition 1.1. Let f, g be integrable and a, b 2 R.

1. (Monotonicity) If f g a.e. thenRf dµ

Rg dµ.

2. (Linearity) af + bg is integrable and

Z(af + bg) dµ = a

Zf dµ+ b

Zg dµ .

3. |Rf dµ|

R|f | dµ.

3

Proof. If f g a.e. then f+ g+ and f� � g� a.e. This implies 1. For 2,Z

|af + bg| dµ |a|Z

|f | dµ+ |b|Z

|g| dµ

so |af + bg| is integrable. To show the equality of 2, first take a = b = 1. Then

(f + g)+ � (f + g)� = f + g = f+ � f� + g+ � g� ,

so(f + g)+ + f� + g� = (f + g)� + f+ + g+ .

All of these functions are nonnegative so we can integrate to getZ

(f + g)+ dµ+

Zf� dµ+

Zg� dµ =

Z(f + g)� dµ+

Zf+ dµ+

Zg+ dµ

or Z(f + g)+ dµ�

Z(f + g)� dµ =

Zf+ dµ�

Zf� dµ+

Zg+ dµ�

Zg� dµ

and this implies additivity. To deduce linearity we need only show thatRaf dµ = a

Rf dµ

and this follows directly from the definition (considering a � 0 and a < 0 separately).Last, both f |f | and �f |f |, so

Zf dµ

Z|f | dµ and �

Zf dµ =

Z�f dµ

Z|f | dµ

and this implies 3.

1.2 Integration and limits

Both Fatou and the monotone convergence theorem hold, since they were stated for nonneg-ative functions. However,

• they can be strengthened to sequences (fn) such that fn � 0 a.e. and

• we can strengthen monotone convergence to the assumption fn " f a.e.

The main theorem for general measurable sequences (fn) is:

Theorem 1.2 (Dominated convergence). If |fn| g a.e. and g is integrable with fn ! fa.e. then

1. fn and f are integrable and

2.Rfn dµ !

Rf dµ.

4

Proof. Since |fn| |g| and |g| is integrable, fn is integrable. Also by taking limits, |f | |g|,so f is integrable. Now apply Fatou to the nonnegative functions g � fn and g + fn to get

Z(f + g) dµ lim inf

n

Z(fn + g) dµ = lim inf

n

Zfn dµ+

Zg dµ

and Z(g � f) dµ lim inf

n

Z(g � fn) dµ =

Zg dµ� lim sup

n

Zfn dµ .

Therefore Zf dµ lim inf

n

Zfn dµ lim inf

n

Zfn dµ

Zf dµ .

5

Lecture 23

Remarks about convergence theorems:

• There are various versions of these theorems involving sums. For instance, if fn � 0a.e. then by monotone convergence,

X

n

Zfn dµ =

Z X

n

fn dµ .

• If (An) is a sequence of disjoint sets in ⌃ and f is integrable, then by dominatedconvergence, Z

[nAn

f dµ =X

n

Z

An

f dµ .

Here we defineRA f dµ =

Rf1A dµ.

1 Uniform integrability

To finish discussing convergence theorems, we discuss uniform integrability. The idea is thatwe assume that on some measure space (⌦,⌃, µ), fn ! f a.e. and try to deduce as usual thatRfn dµ !

Rf dµ. Can we find necessary and su�cient conditions on (fn) for the integrals

to converge? We have seen su�cient ones, in the dominated and monotone convergencetheorems.

Definition 1.1. The sequence (fn) is called uniformly integrable if given ✏ > 0 there existsM such that Z

|fn|1{|fn|�M} dµ ✏ for all n .

In words, (fn) is uniformly integrable if we can make all integralsR|fn| dµ small (uni-

formly in n) by integrating on the sets {|fn| � M} only. One way to think of this is in termsof truncation. If we decide to truncate the all functions |fn| at M (replacing their value by0 or ±M for instance) then this has a small e↵ect on the total integrals. Considering thedominated convergence theorem, we would expect uniformly integrability to be useful whenthe constant function M is integrable; that is, when µ(⌦) < 1.

One way to rewrite uniform integrability is

limM!1

supn

Z

{|fn|�M}|fn| dµ = 0 .

This condition gets us very close to necessary and su�cient conditions:

Theorem 1.2. Suppose µ(⌦) < 1 and that (fn), f are integrable functions such that fn ! fa.e. The following are equivalent.

1. (fn) is uniformly integrable.

2.R|f � fn| dµ ! 0.

3.R|fn| dµ !

R|f | dµ.

Proof. Assume first that (fn) is uniformly integrable. For any M , define the truncation

f (M)n = min{|fn|,M}sign(fn) .

(This is just fn, replaced by M when it is � M and by �M when it is �M .) Defining

f (M) analogously for f , we have f (M)n ! f (M) a.e. Therefore by the bounded convergence

theorem, Zf (M)n dµ !

Zf (M) dµ .

This is not yet enough to conclude convergence of the integrals, but it does imply by thetriangle inequality

lim supn

Z|fn � f | dµ lim sup

n

Z|fn � f (M)

n | dµ+

Z|f � f (M)| dµ .

Now |fn � f (M)n | |fn|1{|fn|>M} and similarly for f . Therefore

lim supn

Z|fn � f | dµ sup

n

Z

{|fn|>M}|fn| dµ+

Z

{|f |>M}|f | dµ .

By uniform integrability, the first term goes to 0 as M ! 1. For the second, use dominatedconvergence. Because

R|f | dµ < 1 and both

|f |1{|f |>M} |f | and |f |1{|f |>M} ! 0

as M ! 1, the second term converges to 0, implying 2.To see 2 implies 3, use the triangle inequality:

��Z

|fn| dµ�Z

|f | dµ��

Z||fn|� |f || dµ

Z|fn � f | dµ ! 0 .

Last if 3 holds, define f (M)n and f (M) a bit di↵erently: set f (M)

n = fn1{|fn|<M} and f (M) =

f1{|f |<M}. Then as long as µ({|f | = M}) = 0, we still have f (M)n ! f (M) a.e. (check this!)

So write as beforeZ(|fn|� |f |) dµ =

Z(|fn|� |f (M)

n |) dµ+

Z(|f |� |f (M)|) dµ+

Z(|f (M)

n |� |f (M)|) dµ .

Taking n ! 1 and using assumption 3 and bounded convergence, we find

limn

Z(|fn|� |f (M)

n |) dµ =

Z(|f (M)|� |f |) dµ .

2

As M ! 1, though, the right side converges to 0, so given ✏ > 0 we can choose M suchthat

lim supn

��Z

(|fn|� |f (M)n |) dµ

�� < ✏ .

Therefore |R(|fn|� |f (M)

n |) dµ| < ✏ for all but finitely many n, and using integrability of thefn’s, we may further increase M to make these terms less than ✏ for all n:

supn

Z(|fn|� |f (M)

n |) dµ = 0 ,

If µ({|f | = M}) = 0 then this equals

supn

Z

{|fn|�M}|fn| dµ = 0 . (1)

Now µ({|f | = M}) can only be nonzero at countably many M (or else µ could not be finite;this is even true more generally – see Theorem 10.2(iv) in Billingsley). So taking Mk asequence to infinity such that µ({|f | = Mk}) = 0 for all n, and using monotonicity of theleft side of (1) in M , we get

limM!1

supn

Z

{|fn|�M}|fn| dµ = 0 ,

proving 1.

Below is a consequence that is used very often in probability.

Corollary 1.3. If fn ! f a.e. with µ(⌦) < 1 and

supn

Z|fn|1+✏ dµ < 1

for some ✏ > 0 thenRfn dµ !

Rf dµ.

Proof. Each fn is integrable so we must show that f is and that (fn) is uniformly integrable.For f , use Fatou:

Zf dµ lim inf

n

Z|fn| dµ sup

n

Zmax{|fn|1+✏, 1} dµ

µ(⌦) + supn

Z|fn|1+✏ dµ < 1 .

To show uniform integrability, let M > 0 and estimate:Z

{|fn|�M}|fn| dµ 1

M ✏

Z

{|fn|�M}|fn|1+✏ dµ C

M ✏.

Taking M ! 1 finishes the proof.

3

2 Densities

A standard way to build a measure from another one is to introduce a density function. Letf be nonnegative and measurable and define a function ⌫ : ⌃ ! R by

⌫(A) =

Z

A

f dµ :=

Zf1A dµ .

Then ⌫ is also a measure and f is called the density of ⌫ relative to µ.

Properties

1. ⌫ as defined above is a measure: ⌫(A) � 0 for all A with ⌫(;) = 0. Furthermorecountable additivity follows from the last point in the previous section. (Actuallyf does not even need to be integrable; if it is not then we will have sets of infinitemeasure.)

2. If µ(A) = 0 then ⌫(A) = 0. This is phrased as “⌫ is absolutely continuous relative toµ.”

4

Lecture 24

Back to densities:

• Two measures with the same density on a sigma-finite space are equal. To see this,suppose that f, g are nonnegative measurable functions such that

Z

A

f dµ =

Z

A

g dµ for all A 2 ⌃ .

If f = g a.e. does not hold, then either f > g or g > f on a set of positive measure.Let us assume the first and define A = {! : f(!) > g(!)}. Setting

AN = {! : 1/N f(!)� g(!), g(!) N} ,

we have µ(A) P

N µ(AN), so for some N , µ(AN) > 0.

By sigma-finiteness let (Bn) be a sequence of sets with µ(Bn) < 1 and ⌦ = [nBn.Define Cn = AN \ Bn, so that AN = [nCn. Now

0 < (1/N)µ(AN) Z

AN

(f � g) dµ X

n

Z

Cn

(f � g) dµ .

HoweverRCn

g dµ µ(Bn)N , so the above equals

X

n

Z

Cn

f dµ�Z

Cn

g dµ

�= 0 ,

a contradiction.

• However (as pointed out by Wai Kit from class), we cannot drop sigma-finiteness. Take⌦ as any nonempty set and ⌃ = {;,⌦} with µ(⌦) = 1 and µ(;) = 0. Then for f ⌘ 1and g ⌘ 2, we have Z

A

f dµ =

Z

A

g dµ

for all A 2 ⌃, but f 6= g a.e.

Integrating relative to the measure ⌫ is done naturally:

Theorem 0.1. If g is nonnegative and measurable,

Zg d⌫ =

Zfg dµ . (1)

For measurable g, g is integrable relative to ⌫ if and only if fg is integrable relative to µ, inwhich case the above formula holds.

Proof. The strategy of the proof is used in many proofs involving integration. We firstprove it for simple functions, then nonnegative measurable ones (by taking limits), and thengeneral measurable ones. So first let g = 1A for some A 2 ⌃. Then (1) becomes

Z1A d⌫ =

Zf1A dµ ,

or ⌫(A) =Rf1A dµ, which holds by the definition of ⌫. The above formula is linear in g,

so because it holds for indicators, it holds for simple functions. Now if g is nonnegative andmeasurable, take sn " g to be simple and use the monotone convergence theorem:

Zfg dµ = lim

n

Zfsn dµ = lim

n

Zsn d⌫ =

Zg d⌫ .

Last, for measurable g, the formula applies to |g| and this shows that g is integrablerelative to ⌫ if and only if fg is integrable relative to µ. In this case, we can split intopositive and negative parts and use linearity to conclude (1).

As an example, we can choose µ to be Lebesgue measure and f any nonnegative mea-surable function with

Rf dµ = 1. Then

⌫(A) =

Z

A

f dµ

gives a probability measure on R. Some standard examples are

1. The exponential distribution: for ↵ > 0,

f(x) =

(0 if x < 0

↵e�↵x if x � 0.

One can check that this definition coincides with the previous one: that the distributionfunction F (x) for the probability measure with density f above satisfies

F (x) =

(0 if x < 0

1� e�↵x if x � 0.

2. Uniform distribution:f(x) = 1[0,1](x) .

3. Gaussian distribution: for � > 0,

f(x) =1

�p2⇡

exp

✓� x2

2�2

◆.

We will see later that this actually has total integral 1.

2

0.1 Change of variable

A closely related concept is that of change of variable. One perspective is simply the “u-substitution” from calculus. We will take a di↵erent one and consider the push forward.That is, take (⌦1,⌃1, µ) and (⌦2,⌃2), with T : ⌦1 ! ⌦2 measurable. We have seen (a weekago) that the formula

µT�1(A) := µ ({! : T (!) 2 A})defines a measure on (⌦2,⌃2). We can ask how we integrate functions relative to it. (Wecan think here that T changes the variable ! to T (!).)

Theorem 0.2. If f is nonnegative,

Zf � T dµ =

Zfd(µT�1) .

If f is measurable, it is integrable relative to µT�1if and only if f � T is integrable relative

to µ, in which case the above formula holds.

Proof. The proof is nearly identical to the last one: for f = 1A with A 2 ⌃2, it reduces to

µ({! : T (!) 2 A}) = µT�1(A) ,

which is the definition. The general case is done by monotone convergence and then splittinginto positive and negative parts.

One simple consequence of this is the case T = X, a random variable. Then for eachBorel measurable f : R ! R,

Zf(X(!)) dP =

Zf(x) dµ .

Here we have taken X : ⌦ ! R as the random variable, with the probability space (⌦,⌃,P)and pushfoward PX�1 written as µ.

1 Product measure and Fubini

Apart from push forward (change of variable) and densities, there is another natural wayto build measures from other ones. Given two measures µ1 and µ2 we seek a measure onthe product space µ such that µ(A ⇥ B) = µ1(A)µ2(B) for measurable sets A,B. Such afactorization property is similar to independence that we saw before and can be used to buildindependent random variables. It also gives us another natural construction of k-dimensionalLebesgue measure.

Definition 1.1. If (⌦1,⌃1) and (⌦2,⌃2) are measurable spaces then we define the product

(⌦1 ⇥ ⌦2,⌃1 ⇥ ⌃2) as the space ⌦1 ⇥ ⌦2 with product sigma algebra:

⌃1 ⇥ ⌃2 := � ({A⇥ B : A 2 ⌃1, B 2 ⌃2}) .

3

Remarks:

• Note that ⌃1⇥⌃2 is NOT simply a cartesian product. It is the sigma algebra generatedby all products (these products A⇥ B are called measurable rectangles).

• The set [0, 2]2 \ [0, 1]2 in R2 is not of the form A ⇥ B for A,B ⇢ R, but it is in theproduct sigma-algebra for R⇥R. Therefore ⌃1 ⇥⌃2 is generally much larger than thecollection of sets of the form A⇥ B.

4

Lecture 25

The precursor to Fubini concerns slices of sets in ⌃1 ⇥ ⌃2.

Proposition 0.1. The following statements hold.

1. Let E 2 ⌃1 ⇥ ⌃2 and for x 2 ⌦1 define the slice

Sx(E) = {y 2 ⌃2 : (x, y) 2 E} .

Then Sx(E) 2 ⌃2.

2. Let f : ⌦1 ⇥ ⌦2 ! R be (Borel) measurable. For x 2 ⌦1, the function fx : ⌦2 ! Rgiven by fx(y) = f(x, y) is measurable.

Proof. Beginning with 1, given x 2 ⌦1, define T : ⌦2 ! ⌦1 ⇥ ⌦2 by T (y) = (x, y). IfE = A⇥ B for A 2 ⌃1 and B 2 ⌃2,

T�1(E) =

(B if x 2 A

; if x /2 A.

This means that T�1(E) 2 ⌃2. Since such E generate the product sigma-algebra, this impliesthat T is measurable. Therefore if E 2 ⌃1 ⇥ ⌃2, then T�1(E) 2 ⌃2; in other words,

{y 2 ⌃2 : T (y) 2 E} 2 ⌃2 .

This is Sx(E), so we have proved 1.For 2, simply write fx = f � Tx. This is a composition of measurable functions and so is

measurable.

Now we can define product measure.

Definition 0.2. If (⌦1,⌃1, µ1) and (⌦2,⌃2, µ2) are measure spaces we define for E 2 ⌃1⇥⌃2

(µ1 ⇥ µ2) (E) =

Zµ2(Sx(E)) dµ1(x) .

• Note that for the above definition to make sense, we need to know that x 7! µ2(Sx(E))is a ⌃1-measurable function. All we have shown so far is that Sx(E) 2 ⌃2. To see whyit is true, we first consider E = A⇥ B, so that

µ2(Sx(E)) =

(µ2(B) if x 2 A

0 if x /2 A.

This is just µ2(B)1A(x), a measurable function of x. The next step is to show that theset of E such that µ2(Sx(E)) is ⌃1-measurable is a �-system. Then we show that itcontains the ⇡-system generated by measurable rectangles. This holds simply becausea finite intersection of measurable rectangles is also a measurable rectangles. By the⇡-� theorem, x 7! µ2(Sx(E)) is measurable.

• µ1 ⇥ µ2 is in fact a measure. It assigns 0 to ;, takes values in [0,1] and if E1, . . . arepairwise disjoint, Sx(E1), . . . are pairwise disjoint, so by countable additivity of µ2 andlinearity of the integral relative to µ1, (µ1 ⇥ µ2)([nEn) =

Pn(µ1 ⇥ µ2)(En).

• For E = A⇥ B, we saw above that µ2(Sx(E)) = µ2(B)1A(x), so

(µ1 ⇥ µ2)(A⇥ B) = µ1(A)µ2(B) .

• We define similarly

(µ2 ⇥ µ1) (E) =

Zµ1(Sy(E)) dµ2(y) .

As above, this measure equals µ1(A)µ2(B) for E of the form A⇥B. So it equals µ1⇥µ2

on measurable rectangles and so on the ⇡-system generated by measurable rectangles.Since the set of elements on which two measures are equal is a �-system, the ⇡-�theorem says these measures are equal. Recall, however, that for this argument to gothrough, we need to know that the measures µ1⇥µ2 and µ2⇥µ1 are sigma-finite. (Thisis referring to the uniqueness part of the Caratheodory extension theorem – we actuallyused the monotone class theorem in place of the ⇡-� theorem, but sigma-finiteness isstill needed.)

The last item proves:

Theorem 0.3. If E 2 ⌃1 ⇥ ⌃2 and µ1 and µ2 are sigma-finite measures, then

Z Z1E(x, y) dµ1(x)

�dµ2(y) =

Z Z1E(x, y) dµ2(y)

�dµ1(x) ,

and the common value, denoted (µ1 ⇥ µ2)(E), gives a sigma-finite measure on the productspace.

Proof. The last item says (µ1 ⇥ µ2)(E) = (µ2 ⇥ µ1)(E) if the measures µ1 ⇥ µ2 and µ2 ⇥ µ1

are sigma-finite. This implies the theorem because, for example,

µ1(Ey) =

Z1E(x, y) dµ1(x) .

So we need to only check sigma-finiteness. For that, take (An) and (Bn) sequences in ⌃1 and⌃2 with [nAn = ⌦1 and [nBn = ⌦2. Then [m,n(Am ⇥Bn) = ⌦1 ⇥⌦2 and each of these setshas finite measure.

Remark. This construction can continue to higher orders (the product of any finite numberof spaces). In the case of Lebesgue measure on Rn, we find that it is equal to � ⇥ · · · ⇥ �,the product of n 1-dimensional Lebesgue measures.

2

0.1 Fubini

Fubini is various extensions of Theorem 0.3 to functions other than indicators.

Theorem 0.4 (Tonelli). If (⌦1,⌃1, µ1) and (⌦2,⌃2, µ2) are sigma-finite measure spaces andf : ⌦1 ⇥ ⌦2 ! R is nonnegative and measurable, then

Zf d(µ1 ⇥ µ2) =

Z Zf(x, y) dµ1(x)

�dµ2(y) (1)

=

Z Zf(x, y) dµ2(y)

�dµ1(x) . (2)

Proof. Theorem 0.3 says that this holds when f = 1E for E 2 ⌃1 ⇥ ⌃2. For general f � 0,we first need to show that

x 7!Z

f(x, y) dµ2(y) and y 7!Z

f(x, y) dµ1(x)

are ⌃1 and ⌃2-measurable. It holds for indicator functions and by additivity, for simplefunctions. Then passing to monotone limits (using the monotone convergence theorem), weget it for nonnegative f .

The same works to justify the theorem. We use linearity to deduce it for simple functionsand take monotone limits to obtain it for general nonnegative functions.

The general case is called Fubini’s theorem.

Theorem 0.5 (Fubini). With measure spaces as in the last theorem, let f : ⌦1 ⇥⌦2 ! R bemeasurable. If f is integrable then (1) and (2) hold.

3

Lecture 26

1 Integration by parts

Suppose that F,G are right-continuous, nondecreasing functions from R to R. Then we canbuild Borel measures µF and µG associated to them. For this section, if f is a measurablefunction then write

Rf dF for the integral

Rf dµF (and similarly for G).

We can now build the product measure as in the last section: for E ⇢ R2 Borel,

(µF ⇥ µG)(E) =

ZµG(Sx(E)) dµF (x) .

In particular, for A,B Borel subsets of R,

(µF ⇥ µG)(A⇥ B) = µF (A)µG(B) ,

so for a < b,

(µF ⇥ µG)((a, b]2) = µF ((a, b])µG((a, b]) = (F (b)� F (a))(G(b)�G(a)) . (1)

On the other hand, we can compute this value a di↵erent way, using iterated integrals, as inthe proof of the following result.

Theorem 1.1 (Integration by parts). If a < b thenZ

(a,b]

G(x) dF (x) = F (b)G(b)� F (a)G(a)�Z

(a,b]

F (y) dG(y) + (µF ⇥ µG)(Rd) ,

where Rd = {(x, x) : a < x b} is the diagonal.

Proof. Split the rectangle R = (a, b]2 into three pieces:

R+ = {(x, y) : x < y} \R

R� = {(x, y) : x > y} \R

Rd = {(x, y) : x = y} \R .

The iterated integral over R+ can be computed

(µF ⇥ µG)(R+) =

Zµ2(Sx(R+)) dF (x) =

Z

(a,b]

(G(b)�G(x)) dF (x)

= G(b)(F (b)� F (a))�Z

(a,b]

G(x) dF (x)

and over R� as

(µF ⇥ µG)(R�) =

Z

(a,b]

(F (b)� F (y)) dG(y) = F (b)(G(b)�G(a))�Z

(a,b]

F (y) dG(y) .

Adding these together and using (1),

(F (b)� F (a))(G(b)�G(a)) = G(b)(F (b)� F (a)) + F (b)(G(b)�G(a))

�Z

(a,b]

G(x) dF (x)�Z

(a,b]

F (y) dG(y) + (µF ⇥ µG)(Rd) .

Simplifying,Z

(a,b]

G(x) dF (x) = F (b)G(b)� F (a)G(a)�Z

(a,b]

F (y) dG(y) + (µF ⇥ µG)(Rd) .

• If F and G have no common points of discontinuity, then

(µF ⇥ µG) ({(x, x) : x 2 R}) = 0

(you can check this), so we get the standard integration by parts formula:Z b

a

G(x) dF (x) = F (b)G(b)� F (a)G(a)�Z b

a

F (y) dG(y) .

Here the integrals in the theorem over (a, b] can be replaced by [a, b], and this is thenotation above.

• If µF and µG have density functions f, g relative to Lebesgue measure, then by ourprevious results on densities,

Z b

a

G(x) dF (x) =

Z b

a

G(x)f(x) dx and

Z b

a

F (y) dG(y) =

Z b

a

F (y)g(y) dy .

If we set G(x) =R x

a g(y) dy + C then G is a distribution function for a measure withdensity g and similarly for F (y) =

R y

a f(x) dx + C 0. Thus we can view F and G asantiderivatives of f ,g and we recover the traditional integration by parts formula.

2 Random variables

Let (⌦,⌃,P) be a probability space. We can now finally return to our study of probability.

Definition 2.1. A random variable is a function X : ⌦ ! R that is Borel measurable. Arandom vector is a function X : ⌦ ! Rn with n � 2 that is Borel measurable.

If X is a random vector, we can write it as

X(!) = (X1(!), . . . , Xn(!)) .

You can check that X is measurable to Rn if and only if each Xi is measurable to R.If X1, . . . , Xn are random variables, we define as before �(X1, . . . , Xn) as the smallest

sigma-algebra relative to which each Xi is measurable. The counterpart to our earlier resulton this sigma-algebra is:

2

Proposition 2.2. If X = (X1, . . . , Xn) is a random vector then

1. �(X) = {X�1(A) : A ⇢ Rn Borel} and

2. Y : ⌦ ! R is measurable relative to �(X) if and only if there is a Borel measurablef : Rn ! R such that Y = f �X.

Proof. As before, {X�1(A) : A ⇢ Rn Borel} is a sigma-algebra. Also the preimage of eachBorel set in Rn under X is in this collection, so X is measurable relative to it, and thus �(X)is contained in it. On the other hand, if ⌃0 is some other sigma-algebra relative to whichX is measurable, it must contain the pre image of each Borel subset of Rn under X, so itcontains {X�1(A) : A ⇢ Rn Borel}. This implies the first item.

For the second, if f is Borel measurable, then because X is measurable relative to �(X),the variable f � X is also measurable. Conversely, we must show that if Y is measurablerelative to �(X) then there is a Borel measurable f : Rn ! R such that Y = f �X. Firstassume that Y is simple. Then we can write

Y =NX

i=1

xi1Bi for Bi 2 �(X) pairwise disjoint .

By the first part, there are Borel sets Ai ⇢ Rn such that Bi = {! : X(!) 2 Ai} for all i.Now (as in the earlier proof), define f =

PNi=1 xi1Ai . Then f is Borel measurable and for

! 2 ⌦, either ! 2 Bi for some unique i or ! is in no Bi. In the first case, we have X(!) 2 Ai

and f(X(!)) = xi, while Y (!) = xi. Otherwise if ! is in no Bi, then X(!) = 0 and 0 cannotbe in any Ai, so f �X(!) = 0, while Y (!) = 0. This shows that Y = f �X.

If Y is not simple, but is only measurable relative to �(X) then we can find a simplesequence (sk) such that sk ! Y point wise and sk is measurable relative to �(X) for eachk. For each k we can then find a Borel measurable fk : Rn ! R such that sk = fk �X. IfM = {x 2 Rn : (fk(x)) converges} then M is a Borel set (from previous results) and we candefine f as

f(x) = 1M(x) limk

fk(x) ,

which is a Borel measurable function. For all !, we have

Y (!) = limk

sk(!) = limk

fk �X(!) ,

so X(!) 2 M and we get a limit of f �X(!).

3 Independence

As before, if X1, . . . , Xk are random variables on the same probability space, they are calledindependent if the sigma-algebras �(X1), . . . , �(Xk) are independent.

Proposition 3.1. X1, . . . , Xk are independent if and only if the law of the random vectorX = (X1, . . . , Xk) is the product measure µ = µ1 ⇥ · · ·⇥ µk, where µi is the law of Xi.

3

Proof. The law of X is defined as the push forward PX�1 on Rk. That is, for each BorelA ⇢ Rk, we have PX�1(A) = P(X 2 A). For a measurable rectangle of the form A1⇥· · ·⇥Ak

in Rk, independence of the Xi’s gives

P(X 2 A1 ⇥ · · ·⇥ Ak) = P(X1 2 A1) · · ·P(Xk 2 Ak) = µ1(A1) · · ·µk(Ak) .

Therefore the law of X agrees with the product measure on rectangles. As usual, since therectangles form a ⇡-system and we are on a sigma-finite space (for the product measure),they are equal.

Conversely, if the law of X is product measure, the factorization formula

P(X1 2 A1, . . . , Xk 2 Ak) = P(X1 2 A1) · · ·P(Xk 2 Ak)

holds for all A1, . . . , Ak Borel sets in R. Because each {Xi 2 Ai} is an arbitrary set in �(Xi),this shows independence.

4

Lecture 27

We ended last time with discussing independence: a random vector X = (X1, . . . , Xn) hasindependent entries if and only if its law is a product measure µ1 ⇥ · · ·⇥ µn, where µi is thelaw of Xi.

• There are other equivalent conditions you can check (or read in the book in Section 20).For instance, X1, . . . , Xk are independent if and only if

P(X1 x1, . . . , Xk xk) = P(X1 x1) · · ·P(Xk xk)

for all choices of x1, . . . , xk. This is equivalent to saying that the multidimensionaldistribution function factors.

• Another equivalent condition to independence is the following. If Xi has density firelative to Lebesgue measure, in the sense that

PX�1i (A) = P(Xi 2 A) =

Z

A

fi(x) dx

for all Borel A ⇢ R, then the density for X relative to k-dimensional Lebesgue measureis f(x1, . . . , xk) =

Qki=1 fi(xi). That is, for all B ⇢ Rk Borel, we have

P(X 2 B) =

Z

B

f1(x1) · · · fk(xk) dx1 · · · dxk .

Example. Gaussian. Recall that the (centered) Gaussian distribution on R has densityfunction

f(x) =1p2⇡�

e�x2

2�2 .

The standard Normal, or standard Gaussian is given by � = 1.By Fubini and what we know from the Riemann integral (which, it is a theorem, equals

the Lebesgue integral for functions which are both Riemann and Lebesgue integrable – thisis true for instance for continuous functions), we can now show that

Rf(x) dx = 1. Define

I to be the numberRe�x2/2 dx and compute using Fubini:

I2 =

Ze�x2/2 dx

Ze�y2/2 dy =

Z Ze�(x2+y2)/2 dx dy .

We can then change variables to polar, using dxdy = rdrd✓ to get

Z 2⇡

0

Z 1

0

re�r2/2 dr d✓ = 2⇡

Z 1

0

re�r2/2 dr = 2⇡ .

ThereforeI =

pI2 =

p2⇡ ,

or (2⇡)�1/2Re�x2/2 dx = 1. Using a change of variables, we get

Rf(x) dx = 1, where f(x)

is given above.We say that an n-dimensional vector X = (X1, . . . , Xn) is an n-dimensional (standard)

Gaussian if the variables Xi are i.i.d. standard Normal. One way to say this is that themarginal distributions of the Xi’s are standard Normal. Generally if X = (X1, . . . , Xk) thenany distribution of an Xi is called a marginal distribution. In this case we can compute thedensity function: for x 2 Rn,

f(x) = f(x1, . . . , xn) =1p2⇡

e�x21/2 · · · 1p

2⇡e�x2

n/2 =1

(2⇡)n/2e�(x2

1+···+x2n)/2

=1

(2⇡)n/2e�kxk2/2 .

The amazing thing is that an n-dimensional standard Normal distribution is rotationallysymmetric: its density only depends on kxk.

1 Convolution

If X and Y are independent random vectors with distributions µX and µY in Rk and Rj

respectively, we can use Fubini to find for B Borel in Rj+k,

P((X, Y ) 2 B) = (µX ⇥ µY )(B) =

ZµX(Sy(B)) dµY (y) ,

where Sy(B) is the k-dimensional slice {x 2 Rk : (x, y) 2 B}. The integrand is

P(X 2 Sy(B)) = P((X, y) 2 B) .

So we get

P((X, Y ) 2 B) =

ZP((X, y) 2 B) dµY (y) .

We can think of this as a conditional probability formula (we are conditioning on di↵erentvalues of Y and integrating over the possibilities) and this will be made precise later.

Proposition 1.1. If X and Y are independent random variables, then the distribution ofX + Y is the convolution

P(X + Y 2 B) =

ZP(Y 2 B � x) dµX(x) ,

where µX is the distribution of X.

Proof. We can write down

P(X + Y 2 B) = P((X, Y ) 2 B) ,

2

where B = {(x, y) : x+ y 2 B}. By the remarks before the proposition,

P((X, Y ) 2 B) =

ZP((x, Y ) 2 B) dµX(x) .

HoweverP((x, Y ) 2 B) = P(Y 2 B � x) .

• If X and Y have density functions f and g (that is, the distributions µX and µY havedensities f and g relative to Lebesgue measure) then

P(X + Y z) = P(X + Y 2 (�1, z]) =

ZP(Y 2 (�1, z � x]) dµX(x)

=

Z Z z�x

�1g(y) dy f(x) dx =

Z Z z

�1g(y � x)f(x) dy dx

=

Z z

�1

Zg(y � x)f(x) dx

�dy .

Therefore X+Y has density h(y) =Rg(y�x)f(x) dx =: (f ⇤g)(y) relative to Lebesgue

measure.

• Because of the last example, we can define, for two probability measures µ1 and µ2 onR, their convolution µ1 ⇤ µ2 by

(µ1 ⇤ µ2)(B) =

Zµ2(B � x) dµ1(x) .

To see this is a probability measure, let X be a random variable on some probabilityspace with distribution µ1 and Y one (on another space possibly) with distribution µ2.Now on R2 we may build the product measure µ1⇥µ2 and consider the distribution ofthe random variable (x, y) 7! x + y. (Note this is a Borel measurable function so is arandom variable on (R2,⌃, µ1 ⇥ µ2), where ⌃ is the product Borel sigma-algebra.) Itfollows that the distribution of x+ y is µ1 ⇤ µ2 and so this is a probability measure.

IfX1 andX2 are independent and have exponential distribution with the same parameter,then their density functions are f(x) = ↵e�↵x1[0,1)(x). We can find the density of their sumby

h(x) =

Zf(y)f(x� y) dy = ↵2

Ze�↵y1{y�0}e

�↵(x�y)1{x�y�0} dy =

Z x

0

e�↵x dy

and this equals ↵2xe�↵x for x � 0 and 0 for x < 0.For three terms we can similarly integrate

Z↵2ye�↵y1{y�0}↵e

�↵(x�y)1{x�y�0} dy = ↵3e�↵x

Z x

0

y dy = ↵e�↵x (↵x)2

2.

3

One can argue by induction to show that the density function for X1 + · · · +Xk, where Xi

are i.i.d. with exponential(↵) distribution is

↵e�↵x (↵x)k�1

(k � 1)!1{x�0} .

4

Lecture 28

1 Existence of independent sequences

In the beginning of the course, we described an argument that proves the existence of i.i.d.

sequences (X1, X2, . . .) such that P(X1 = 1) = 1/2 = P(X1 = 0). (That is, (Xn) is a sequence

of i.i.d. Bernoulli(1/2) variables.) The idea was to consider the probability space (0, 1] withLebesgue measure and define Xn(!) inductively for ! 2 (0, 1] in such a way that

Xi(!) = 1 if ! 2✓k

2i,k + 1

2i

�for k odd

and 0 otherwise. Using these variables we can prove existence of any sequence of independent

variables.

Theorem 1.1. Let µ1, µ2, . . . be Borel probability measures on R. There exists a probability

space and an independent sequence (X1, X2, . . .) such that Xi has distribution µi for all i � 1.

Proof. The first step is to prove the theorem in the case µi equal to Lebesgue measure on

(0, 1] for all i. To this end, we first let (Xi) be a sequence of i.i.d. Bernoulli(1/2) variablesand arrange them in an array (Xij). If ⌃i is the sigma-algebra generated by the variables in

row i:⌃i = �(Xi1, Xi2, . . .) ,

then the ⌃i’s are independent. The reason is that the ⇡-systems⇧i generated by �(Xi1), �(Xi2), . . .are independent (as i varies) and a theorem from earlier in the course implies that the gen-

erated �-systems are independent, but these are the ⌃i’s.

Now define for a given realization of the array

Yi =

1X

j=1

Xij

2j.

Then Yi is the number whose binary expansion is Xi1Xi2 · · · . Because Yi is ⌃i-measurable,

the Yi’s are i.i.d. We claim that their distributions are Lebesgue. To do this, let x 2 [0, 1]

and let Y (n)i be the partial sum

Pnj=1

Xij

2j . Then Y (n)i is a number whose binary expansion is

just n digits and each expansion has probability 2�n

. The number of these that are x is

bx2nc. Therefore

P(Y (n)i x) =

bx2nc2n

! x as n ! 1 .

However Y (n)i " Yi a.s. so P(Y (n)

i x) # P(Y (n)i x) and

P(Yi x) = limn

P(Y (n)i x) = x .

In other words, the distribution of Yi is Lebesgue.

Now we transform the variables (Yi) as we did earlier in the course. Given the measure

µi, let Fi be its distribution function and define

Zi(!) = inf{u 2 R : F (u) � Yi} .

Exactly as before, since Yi has Lebesgue distribution, Zi has distribution µi. As the Zi’s are

independent, we are done.

2 Expected values

If (⌦,⌃,P) is a probability space and X is an integrable random variable, then we define

EX =

ZX dP .

If f : R ! R is measurable then as long as f(X) is integrable,

Ef(X) =

Zf(x) dµX ,

where µX is the distribution of X.

2.1 Moments

For k � 0 the k-th moment of X is EXkand the k-th absolute moment of X is E|X|k.

Definition 2.1. We say that X has k moments if E|X|k < 1.

Unlike in the case that X is simple, these integrals may not exist. Note that if j � kthen

E|X|k E�max{|X|j, 1}

� 1 + E|X|j .

So if X has k moments, it has j moments. Further if X has k moments, we can define the

k-th central moment

E|X � EX|k

and this is finite. The case k = 2 is again defined as the variance of X.

To give a better bound between moments, let us recall the integral inequalities.

Lemma 2.2. Let X be a random variable.

1. (Markov inequality) For a > 0,

P(|X| � a) 1

aE|X| .

2

2. (Chebyshev inequality) Assume that EX2 < 1. Then

P(|X � EX| � a) 1

a2Var X .

3. (Jensen inequality) If � : I ! R is convex and I ⇢ R is an interval containing

Range(X), then

E�(X) �(EX)

whenever both sides exist.

4. (Holder inequality) If1p +

1q = 1 and p, q > 1 then

E|XY | kXkpkY kq ,

where, for example, kXkp = (E|X|p)1/p.Proof. The proofs are almost the same as in the simple case. For Markov,

E|X| � E|X|1{|X|�a} � aP(|X| � a) .

Chebychev is proved by

Var X = E(X � EX)2 � a2P(|X � EX| � a) .

To prove Jensen, note that EX 2 I so we can find an a�ne L(x) + b such that L(EX) + b =�(EX) and L(x) + b �(x) for all x. Then

E�(X) � E [L(X) + b] = L(EX) + b = �(EX) .

Last, to prove Holder we can just take sequences sn and tn of simple functions that increase

point wise to |X| and |Y |. Then by the monotone convergence theorem,

E|XY | = limn

Esntn limn

ksnkpktnkq = kXkpkY kq .

• So assume that X has j moments and k j. Then

kXkk =�E((|X|j)k/j)

�1/k �E|X|j

�1/j= kXkj ,

by applying Jensen to the concave function x 7! xk/j. (Here the inequality of Jensen

reverses because the function is concave.)

• For nonnegative X we can compute the moments using Fubini:

EXk=

ZXk

dP =

Z Z X

0

kxk�1dx dP =

Z 1

0

kxk�1

Z1{0xX} dP dx

=

Z 1

0

kxk�1P(X � x) dx .

Taking k = 1 gives

EX =

Z 1

0

P(X � x) dx .

3

Lecture 29

0.1 Moment generating functions

As in the simple case we define for X a random variable

M(t) = EetX

as the moment generating function. Because X need not be bounded, there is no reason whyM(t) < 1.

• We always have M(0) = 1. Furthermore the values of t such that M(t) < 1 is aninterval containing 0 (may just be {0}).

Proof. Suppose M(t) < 1 for some t > 0. Then if 0 s < t,

EesX1{X�0} EetX1{X�0} < 1 .

Furthermore EesX1{X<0} < 1. A similar proof holds if t < 0.

• If X is nonnegative then M(t) < 1 for all t 0. For instance, if X has exponentialdistribution with parameter ↵ > 0 then

EetX =

Z 1

0

↵e�↵xetx dx = ↵

Z 1

0

e�x(↵�t) dx ,

so

M(t) =

(↵

↵�t if t < ↵

1 if t � ↵.

So here M(t) < 1 for t 2 (�1,↵).

For a Gaussian, we can compute

EetX =1p2⇡�

Z 1

�1etxe�

x2

2�2 dx =1p2⇡�

Z 1

�1exp

✓�x2 � 2�2tx

2�2

◆dx

= exp

✓�2t2

2

◆1p2⇡�

Z 1

�1exp

✓�(x� �2t)2

2�2

◆dx .

Therefore

M(t) = exp

✓�2t2

2

◆

and is finite for t 2 (�1,1).

Last, even if X is nonnegative, its moment generating function may not be definedfor any positive t. Take a distribution with density f(x) = C/(1 + x2)1[0,1), with Cchosen so that the total integral is 1. Then if X has this distribution and t > 0,

EetX = C

Z 1

0

etx

1 + x2dx = 1 .

So M(t) < 1 when t 2 (�1, 0].

• For a nonpositive random variable, M(0) = 1 with M(t) < 1 when t 0 but may beinfinity for t > 0.

• We can even have M(t) < 1 only when t = 0. Take the Cauchy distribution, withdensity f(x) = 1

⇡(1+x2) , which does not even have an expected value.

We can also expand the moment generating function as a power series. For this reason,it is called the moment generating function.

Theorem 0.1. Let X be a random variable such that M(t) = EetX < 1 for t 2 (�t0, t0)for some t0 > 0. Then

M(t) =1X

n=0

EXn tn

n!.

Therefore EXn = M (n)(0) for all n.

Proof. Because et|x| etx + e�tx, we have Eet|X| EetX + Ee�tX < 1 for t 2 (�t0, t0).Fixing any t 2 (0, t0), we have for any k � 0 |x|k et|x| for large x and so

E|X|k < 1 for all k � 0 .

This means X has moments of all orders.Furthermore, we can expand the exponential and use dominated convergence if t 2

(�t0, t0):

EetX = E1X

n=0

Xntn

n!=

1X

n=0

EXn tn

n!,

where we have dominated the sum by the integrable function et|X|.

• A consequence (from the theory of power series) is that if two variables have equalmoment generating functions in some neighborhood of 0 then the variables have allequal moments. We will see later that in certain cases (most cases), this implies thevariables have the same distributions.

• Billingsley derives a result which says (Theorem 22.2) that if X and Y are nonnegativeand have the same moment generating function for all t T0 for some T0 then X andY have the same distribution.

• When M(t) < 1 for t 2 (�t0, t0) an argument similar to what was done in the lasthomework (using dominated convergence) shows that the k-th derivative M (k)(t) =EXketX for k � 0. Taking k = 2, we see that M(t) is convex.

Examples.

2

1. If X is a standard Normal variable then M(t) = et2/2 (here � = 1). We can write this

as a power series as

M(t) = et2/2 =

1X

n=0

(t2/2)n

n!=

1X

n=0

t2n

(2n)!

(2n)!

2nn!=

1X

k=0

aktk

k!,

where

EXk = ak =

(0 if k is odd(2l)!2ll! if k = 2l is even

.

2. For an exponential variable with parameter ↵ > 0 we have seen that M(t) = ↵↵�t . This

can be written as

↵

↵� t=

1

1� t/↵=

1X

n=0

↵�ntn =1X

n=0

n!

↵n

tn

n!.

So we see that

EXk =k!

↵kfor k � 0 .

0.2 Independence

We showed earlier in the course that ifX, Y are independent and simple then EXY = EXEY .For the general case we argue as follows.

Proposition 0.2. If X, Y are independent then

E|XY | = E|X|E|Y | .

Further if X and Y are integrable, then EXY = EXEY .

Proof. We approximate X and Y by simple functions. Explicitly, we define for n � 1

Xn =

8>>><

>>>:

�n if X n�i2n if X 2

��i�12n , �i

2n

⇤and i = 0, . . . , n2n � 1

i2n if X 2

�i2n ,

i+12n

⇤and i = 0, . . . , n2n � 1

n if X > n

and Yn similarly. As we argued several weeks ago Xn(!) # X(!) if X < 0 and Xn(!) " X(!)if X � 0 (and similarly for Y ). Further, Xn is measurable relative to �(X) and Yn ismeasurable relative to �(Y ). Therefore Xn and Yn are independent.

Applying the result for simple variables and using monotone convergence (as |Xn| " |X|and |Yn| " |Y |),

E|XY | = limn

E|Xn||Yn| = limn

E|Xn|E|Yn| = E|X|E|Y | .

3

This holds for all X, Y (without any integrability assumption).IfX and Y are integrable, then E|X|E|Y | < 1 and the above results implies E|XY | < 1.

Now using dominated convergence (as |XnYn| |XY |, we obtain

EXY = limn

EXnYn = limn

EXnEYn = EXEY .

4

Lecture 30

As a consequence of the result on independent variables (factorization), note that if Xand Y are independent and have finite moment generating functions M(t) and N(t) then

Eet(X+Y ) = M(t)N(t) .

Furthermore,Var (X + Y ) = Var X +Var Y ,

just as in the simple case.Another consequence is the weak law which requires only two moments (one can do with

even less, and we will see that soon).

Theorem 0.1 (Weak law of large numbers). If X1, X2, . . . are i.i.d. with EX1 = 0 andEX2

1 < 1 thenX1 + · · ·+Xn

n! 0 in probability .

Proof. It is the same proof as before: for ✏ > 0, writing Sn = X1 + · · ·+Xn,

P(|Sn/n| > ✏) = P(|Sn| > n✏) 1

n2✏2Var Sn =

Var X1

✏21

n! 0 .

1 Sums of independent variables

We want to prove the strong law of large numbers for general variables. It in fact holdswhenever EX1 exists, but it will take a bit of work to prove this. Recall that for simplevariables, we used a fourth moment calculation to prove it; fourth moments may not existand this severely complicates the matter. To study the problem we need some maximalinequalities.

1.1 Maximal inequalities

Let X1, X2, . . . be independent random variables and as usual set Sn = X1 + · · · + Xn forn � 1. Last, define

Mn = max{|S1|, . . . , |Sn|} .

If we think of Sn as a random walk, then Mn is the furthest from 0 that the random walkgets by time n.

Billingsley states that a main point illustrated in the two maximal inequalities below isthat, since the Xi’s are independent, whenever Mn is large, it is likely that Sn is also large.(This is certainly not true for dependent variables, as we can engineer them to cancel eachother out: set X1 to be any variable and define X2 = �X1. Then M2 = X1 = S1 but S2 = 0.)

The first result states that we can apply Chebyshev to the maximum and get exactlythe same bound as if we applied it to Sn. This is nice because it appears we are gettingsomething for free.

Theorem 1.1 (Kolmogorov maximal inequality). Suppose X1, . . . , Xn are independent withEXi = 0 and EX2

i < 1 for all i. Then for a > 0,

P(Mn � a) 1

a2Var Sn .

Proof. If Mn � a then we can decompose this event according to the first time k such that|Sk| � a: define

Ak = {|S1| < a, . . . , |Sk�1| < a, |Sk| � a} .

Then {Mn � a} = [nk=1Ak and

P(Mn � a) =nX

k=1

P(Ak) 1

a2

nX

k=1

ES2k1Ak

.

We want to replace the S2k by S2

n and then we could use disjointness of the Ak’s to finish. Todo this we must harness independence: note that

ESk(Sn � Sk)1Ak= ESk1Ak

E(Sn � Sk) ,

since Sk1Akis a function of X1, . . . , Xk and Sn � Sk is a function of Xk+1, . . . , Xn. However

E(Sn � Sk) = 0 so we can add this to the previous chain of inequalities: P(Mn � a) is atmost

1

a2

nX

k=1

E(S2k + 2Sk(Sn � Sk))1Ak

1

a2

nX

k=1

E(S2k + 2Sk(Sn � Sk) + (Sn � Sk)

2)1Ak

=1

a2

nX

k=1

E(Sk + (Sn � Sk))21Ak

.

We conclude that P(Mn � a) 1a2

Pnk=1 ES2

n1Ak 1

a2ES2n and the last term is the variance.

Corollary 1.2 (One series theorem). If X1, X2, . . . are independent with EXi = 0 thenX

n

Var Xn < 1 )X

n

Xn converges a.s. .

Proof. Let ✏ > 0. Writing Tn = sup{|Sm � Sn| : m � n}, then as n ! 1,

P(|Tn| > ✏) = limm

P( maxnkm

|Sk � Sn| > ✏) limm

1

✏2Var (Sm � Sn) =

1

✏2

1X

k=n

Var Xk ! 0 .

This means that Tn ! 0 in probability.An equivalent way to say that (Sn) is Cauchy is to say that Tn ! 0 as n ! 1. However

note that for n1 < n2,

Tn2 = sup{|Sm � Sn2 | : m � n2} 2Tn1 pointwise .

2

Therefore given ✏ > 0,

P(|Tm| > ✏ for all m � n) P(|Tn| > ✏/2) ! 0

as n ! 1 and this implies Tn ! 0 almost surely. This means (Sn) is Cauchy and converges.

Example. Let X1, X2, . . . be i.i.d. with mean zero and variance 1. Then define the sums

SN =NX

n=1

Xn

np, for p > 0 .

If Yn is the n-th summand, we find

Var Yn = n�2p ,

so by independence, Var SN =PN

n=1 n�2p, which converges if p > 1/2. This is in contrast

to the case that Xn = 1 for all n; this converges if and only if p > 1.

There are times when we do not want to give an upper bound for P(M � a) in termsof Var Sn, but instead in terms of P(Sn � a). For instance, we may be able to get a betterinequality by applying Cherno↵ to P(Sn � a). Also, it may be that the variance is not finite,and so Kolmogorov’s inequality is not the best. For these cases, we have a result which canbe much stronger. (Coming next time.)

3

Lecture 31

We begin with the second maximal theorem.

Theorem 0.1 (Etemadi). If X1, . . . , Xn are independent. For a � 0,

P(Mn � 3a) 3 max1kn

P(|Sk| � a) .

Note that there is no integrability assumption.

Proof. We again decompose based on the first time k such that |Sk| � 3a. As before, let

Ak = {|Sk| � 3a, |S1| < 3a, . . . , |Sk�1| < 3a} .

Then

P(Mn � 3a) P(|Sn| � a) +n�1X

k=1

P(Ak \ {|Sn| < a}) .

Now if |Sn| < a but Ak occurs then |Sk| � 3a. Therefore |Sn�Sk| � 2a. This gives an upper

bound of

P(|Sn| � a) +n�1X

k=1

P(Ak \ {|Sn � Sk| � 2a}) .

The two terms in the probability are independent so we can factor and bound above for

P(|Sn| � a) + max1kn

P(|Sn � Sk| � 2a)n�1X

k=1

P(Ak) .

Replacing the sum by 1 (since the events are disjoint) and noticing that |Sn � Sk| � 2aimplies that |Sn| � a or |Sk| � a, we get the bound

P(|Mn � 3a) P(|Sn| � a) + max1kn

[P(|Sn| � a) + P(|Sk| � a)]

and this is bounded by 3max1kn P(|Sn| � a).

As a corollary we can show that for sums of independent variables, convergence in prob-

ability is equivalent to almost sure convergence. This is certainly not true in general, but is

a manifestation of the power of independence.

Corollary 0.2. If X1, X2, . . . are independent then with Sn = X1 + · · ·+Xn for n � 1,

(Sn) converges in probability , (Sn) converges almost surely .

Proof. If (Sn) converges almost surely it converges in probability, so we must only show the

converse. To do this, assume that (Sn) converges in probability to some random variable X.

To show that (Sn) converges almost surely, we will again show it is almost surely Cauchy.

From last time we defined the sequence (Tn) by

Tn = sup{|Sm � Sn| : m � n} .

We also remarked that (Sn) is almost surely Cauchy if Tn ! 0 in probability. To show this,

let ✏ > 0 and use the assumption that (Sn) converges in probability to find N such that

P(|Sn �X| > ✏/6) < ✏/6 for n � N .

Therefore if m � n � N ,

P(|Sm � Sn| > ✏/3) P(|Sm �X| > ✏/6) + P(|Sn �X| > ✏/6 < ✏/3 ,

or said di↵erently,

P(|Xn+1 + · · ·+Xm| > ✏/3) ✏/3 for m � n � N .

Applying the Etemadi maximal inequality for m � n � N ,

P( maxn+1jm

|Xn+1 + · · ·+Xj| > ✏) 3 maxn+1jm

P(|Xn+1 + · · ·+Xj| > ✏/3)

= 3 maxn+1jm

P(|Sj � Sn| > ✏/3) < ✏ .

Take m ! 1 to obtain

P(Tn � ✏) ✏ for n � N .

In other words, Tn ! 0 in probability and we are done.

1 Strong law of large numbers

Theorem 1.1 (Strong law of large numbers). Let X1, X2, . . . be i.i.d. with EX1 = 0. Then

Sn/n ! 0 almost surely .

Proof. We want to use the two-series theorem, but our sequence (Sn/n) is not easily written

as a sum of independent random variables (it is an average of them instead). So we need a

lemma.

Lemma 1.2. Let (xn) be a sequence of real numbers such thatP

n(xn/n) converges. If wedefine sn = x1 + · · ·+ xn for n � 1 then sn/n ! 0.

2

Proof. Use summation by parts: set yn =Pn

i=1(xi/i) with y0 = 0 and compute

sn =

nX

i=1

xi =

nX

i=1

xi

i· i =

nX

i=1

(yi � yi�1)i =nX

i=1

iyi �n�1X

i=1

(i+ 1)yi .

This implies

sn/n = yn �1

n

n�1X

i=1

yi .

Since (yn) converges, the right side converges to 0 as n ! 1.

We must truncate before applying the lemma, setting

Yn =

(0 if |Xn| > n

Xn if |Xn| n.

Because of the finite first moment assumption, the Yn’s are close to the Xn’s:

X

n

P(Yn 6= Xn) X

n

P(|Xn| > n) =X

n

P(|X1| � n) .

You will show in the homework that because E|X1| < 1, this sum is finite. Therefore by

Borel-Cantelli,

P(Yn 6= Xn i.o.) = 0 .

Therefore

X1 + · · ·+Xn

n! 0 almost surely , Y1 + · · ·+ Yn

n! 0 almost surely .

We next have to center the Yi’s, since they need not be mean zero. Note that

EYn = EXn1{|Xn|n} = EX11{|X1|n} ! EX1 = 0 .

Therefore if we set Yn = Yn � EYn, we have

Y1 + · · ·+ Yn

n! 0 almost surely , Y1 + · · ·+ Yn

n! 0 almost surely .

For this last sequence we apply the lemma and we see that it su�ces to show thatP

n(Yn/n)converges almost surely. The terms have mean zero and are independent, so by the two

series theorem we need only show thatP

n Var (Yn/n) < 1. For this, estimate

Var (Yn/n) = E(Yn/n)2=

1

n2EX2

11{|X1|n} ,

implying

X

n

Var (Yn/n) = E"X2

1

X

n

1{n�|X1|}

n2

#.

3

HoweverP

n�x1n2 C/x for x > 0 so we get the upper bound

CE⇥X2

1/|X1|⇤= CE|X1| < 1 .

(Here note that when |X1| = 0 we still get the required upper bound sinceP

n(1/n2) < 1.)

This meansP

n Var (Yn/n) < 1 and we are done.

4

Lecture 32

We begin with the two series theorem, which is almost the same as the one series theorem,

but we do not assume mean zero.

Theorem 0.1 (Two series theorem). Let X1, X2, . . . be independent. Assume that

1.P

n EXn converges and

2.P

n Var Sn < 1.

Then (Sn) converges almost surely.

Proof. Define Xi = Xi � EXi. Then Var Xi = Var Xi and we can apply the one series

theorem to get X

n

Xn converges almost surely .

Therefore X

n

Xn =

X

n

(Xn + EXn) =

X

n

Xn +

X

n

EXn

converges almost surely.

There is even a three series theorem which gives both necessary and su�cient conditions

for almost sure convergence. See Blllingsley, Theorem 22.8.

1 Poisson process

We now consider a well-studied example that appears all through probability theory. Let

X1, X2, . . . be a sequence of i.i.d. random variables. We will imagine that the Xi’s represent

waiting times, perhaps in a queue or for telephone calls. Regardless we will make the

memoryless assumption (from earlier in the semester), so that the Xi’s must be exponential

with some parameter ↵ > 0. So we will be thinking of an “exponential clock” which rings

at independent exponential times.

For any t � 0 we can define the number of clock rings that occur until time t. That is,

setting

Sn = X1 + · · ·+Xn for n � 1 and S0 = 0 ,

we define Nt = max{n � 0 : Sn t}. Note that

{Nt � n} = {Sn t} ,

so that Nt is a random variable. This collection {Nt} is a stochastic process, called the

rate-↵ Poisson process. To say it is a stochastic process is to say that it is a collection of

random variables, defined on the same space, indexed by some other set (typically thought

of as time). In our process, there are uncountably many variables, and this can at times

cause measurability issues. For instance, our sigma algebra is not closed necessarily under

uncountable unions. We will not address issues related to this and we will concentrate on

results that do not need these considerations.

Typically, one defines a Poisson process somewhat di↵erently to the above, instead using

an intrinsic definition; that is, defining it to be a collection of variables (Nt) satisfying various

properties (not necessarily with reference to any underlyingXn’s. We will instead prove some

of these properties below.

1. Nt � 0 a.s. and N0 = 0 a.s..

2. For each t, Nt has the Poisson distribution with parameter ↵t.

Proof. The event {Nt k} is equal to {X1 + · · · + Xk+1 > t}. Therefore from our

discovery of the density function of the sum of independent exponentials (from a few

lectures ago)

P(Nt k) = P(X1 + · · ·+Xk+1 > t) =

Z 1

t

↵e�↵x (↵x)

k

k!dx =

↵k+1

k!

Z 1

t

e�↵x

xkdx .

Setting u = ↵x so that du = ↵ dx, this becomes

P(Nt � k) =1

k!

Z 1

↵t

e�u

ukdu .

We can evaluate the integral by parts. For s � 0,

Z 1

s

e�u

undu = �u

ne�u

��1

s

+ n

Z 1

s

e�u

un�1

du

= sne�s

+ n

Z 1

s

e�u

un�1

du .

Continuing, we get

Z 1

s

e�u

undu = e

�s⇥sn+ ns

n�1+ · · ·+ n!

⇤.

Now plug back into the expression for P(Nt � k) to find

e�↵t

k!

⇥(↵t)

k+ k(↵t)

k�1+ · · ·+ k!

⇤= e

�↵tkX

j=0

(↵t)j

j!.

3. For almost every !, Nt(!) is non-decreasing in t, right continuous with left limits.

4. limt!1 Nt = 1 a.s..

2

Proof. The sum Sn is finite almost surely for all n and NSn = n. Therefore Nt is

unbounded almost surely and since it is non-decreasing it must converge to infinity.

The next property is really central to the Poisson process. Further, Billingsley shows (in

the second half of Theorem 23.1) that in the presence of a couple of other smaller assumptions

(what he called Condition 00), this property characterizes the Poisson process.

Theorem 1.1. For 0 < t1 < · · · < tk, the increments Nt1 , Nt2 � Nt1 , . . . , Ntk � Ntk�1are

independent and each increment Nt�Ns (for t > s) has a Poisson distribution with parameter↵(t� s):

P(Nt �Ns = n) = e�↵(t�s) (↵(t� s))

n

n!for n � 0 and 0 s < t .

Proof. Pick any t � 0. The proof will proceed by examining the process Ns for s � t. So

first suppose that Nt = n. This is the same as supposing that Sn t < Sn+1. When Nt = n,

we can use the memoryless property of the exponential to understand the distribution of the

next clock ring. That is, if we begin by observing the process at time t, then the next clock

ring will occur after an additional time Sn+1 � t. We will see here that the distribution of

the duration until this next ring is also exponential. To do this, compute

P(Sn t < Sn+1, Sn+1 � t > y) = P(Sn t,Xn+1 > t+ y � Sn) .

Now Sn and Xn+1 are independent, so we can use Fubini, calling Gn(x) the distribution

function of Sn, to get Z

{xt}P(Xn+1 > t+ y � x) dGn(x)

and by the memoryless property of exponential, (as Xn+1 is exponential) we can reduce this

to

e�↵y

Z

{xt}P(Xn+1 > t� x) dGn(x) = e

�↵y P(Sn t, Xn > t� Sn) .

This represents a sort of memoryless property of the next clock ring distribution whenNt = n.

Moving to more and more subsequent clock rings, for y1, . . . , yj � 0, we can use indepen-

dence of the Xi’s to compute

P(Sn t < Sn+1, Sn+1 � t > y1, Xn+2 > y2, . . . , Xn+j > yj)

= e�↵y2 · · · e�↵yjP(Sn t < Sn+1, Sn+1 � t > y1)

= P(Nt = n)e�↵y1 · · · e�↵yj .

One way to write this is to define X(t)1 , . . . , X

(t)j as the waiting times until subsequent clock

rings following t, with H equal to the Borel set (y1,1)⇥ · · ·⇥ (yj,1) and

P(Nt = n, (X(t)1 , . . . , X

(t)j 2 H) = P(Nt = n) P((X1, . . . , Xj) 2 H) .

Because such H’s form a ⇡-system that generates the Borel sigma-algebra, we can extend

this equation to all Borel sets in Rj.

(Continued next time.)

3

Lecture 33

Last time we were proving the following theorem, where (Nt) is a rate�↵ Poisson process.

Theorem 0.1. For 0 < t1 < · · · < tk, the increments Nt1 , Nt2 � Nt1 , . . . , Ntk � Ntk�1are

independent and each increment Nt�Ns (for t > s) has a Poisson distribution with parameter↵(t� s):

P(Nt �Ns = n) = e�↵(t�s) (↵(t� s))

n

n!for n � 0 and 0 s < t .

Proof. We had shown that for any t � 0, if we define the subsequent waiting times as

X(t)1 = Sn+1 � t, X

(t)2 = Xn+2, . . . , where n = Nt ,

then for all n � 0 and Borel sets H ⇢ Rj,

P(Nt = n, (X(t)1 , . . . , X

(t)j 2 H) = P(Nt = n) P((X1, . . . , Xj) 2 H) .

This shows a sort of independence of the future waiting times (after t) from an event from

the past (the event that Nt = n). If t1 < t2 < . . . < tk and n1, . . . , nk are given, then

the increments Nt2 � Nt1 , . . . , Ntk � Ntk�1are completely determined by the waiting times

X(t1)1 , X

(t1)2 , . . .. In particular, we can write the event {Nt2 �Nt1 = n2, . . . , Ntk �Ntk�1

= nk}in the above form. That is, setting p = n2 + · · · + nk + 1, there is some Borel set H ⇢ Rp

such that this event equals {(X(t1)1 , . . . , X

(t1)p ) 2 H}. (Since the next p clock rings determine

the event.) For example, when k = 2, we can write

{Nt1 = n1, Nt2 �Nt1 = n2} = {Nt1 = n1, X(t1)1 + · · ·+X

(t1)n2

t2 X(t1)1 + · · ·+X

(t1)n2+1} .

And the second event is of the required form.

Therefore

P(Nt1 = n1, Nti �Nti�1 = ni for i = 2, . . . , k)

= P(Nt1 = n1) P(Nti�t1 = ni for i = 2, . . . , k) .

We can then iterate this, using induction, to find

P(Nt1 = n1, Nti �Nti�1 = ni for i = 2, . . . , k) =

kY

i=1

P(Nti�ti�1 = ni) .

However as we showed before this theorem, Nt is a Poisson random variable with parameter

↵t, so we are done.

From this result it easily follows that Nt cannot have jumps of size bigger than 1. Al-

though this can be derived directly from the definitions (in one line!) the following proof

is of interest because it shows how one can go from the existence of a process (Nt) with

independent Poisson increments to the result that it has jumps of size 1.

Corollary 0.2. With probability 1, Nt does not have jumps of size bigger than 1.

Proof. If Nt has a jump of size bigger than 1, it must occur in some interval [0, N ]. Therefore

{Nt has a jump of size bigger than 1} =

[

N

{Nt has a jump of size bigger than 1 in [0, N ]} .

So if su�ces to show that almost surely, Nt has no jumps of size bigger than 1 in [0, N ]. We

do this by creating nested partitions.

For any n � 1, cover [0, N ] by 2Nn open intervals (I(n)k ) of length 1/n. If there is a jump

in [0, N ] it must lie in at least one I(n)k . But if there is a jump of size bigger than 1 in this

interval (a, b), we must have Nb � Na � 2. Using the preceding result, this probability is

equal to

P(Nb �Na � 2) = 1� e�↵(b�a)

[1 + ↵(b� a)] C(b� a)2

for some C independent of b� a. Therefore by a union bound, the probability that there is

a jump of size at least 2 in [0, N ] is bounded above by

2NnC

n2=

2NC

n! 0 as n ! 1 .

However the original probability does not depend on n or N and must then be 0.

Another consequence of the main theorem is that Nt obeys a law of large numbers.

Corollary 0.3. As t ! 1,Nt/t ! ↵ almost surely .

Proof. We first prove for integer times. For any n write

Nn/n =1

n

nX

k=1

(Nk �Nk�1) .

Each increment Nk � Nk�1 is Poisson distributed with parameter ↵ · 1 = ↵. Furthermore

these variables are independent. So the strong law of large numbers implies that

Nn/n ! ↵ almost surely .

Here we are using that the mean of the Poisson distribution with parameter ↵ is

e�↵

1X

k=0

k↵k

k!= e

�↵↵

1X

k=1

↵(k�1)

(k � 1)!= ↵ .

To prove the result for Nt/t use monotonicity. We have

btct

·Nbtc

btc Nt/t dtet

Ndte

dte

and both the left and right sides converge to ↵ almost surely.

2

Before we move on, here are some examples of the many uses of Poisson processes.

1. Because the clock rings of a Poisson process are fairly regular (the waiting times are

exponential and therefore tend to be not so large) it is common to use them to index

another discrete-time process. In other words, if (Nt) is a Poisson process, we may

substitute Nt for n in a i.i.d. sequence X1, X2, . . . to obtain a continuous-time process

St := SNt = X1 + · · ·+XNt .

As t ! 1 Nt is about order ↵t, so we are typically summing about ↵t number of

elements. Here St is called a compound Poisson process. We can think of it as a Poisson

process which, instead of incrementing by 1, increments by some random number Xi.

Why would we want to reindex time by Nt? Well in many cases it is easier to treat

a continuous variable. For instance, if X0, X1, . . . is a Markov chain, we can define a

continuous time version by XNt , where Nt is a Poisson process. If, for example, Nt has

rate-one, then Nn is close to n, so XNn is a reasonable approximation for Xn. However,

there are many di↵erential equations and inequalities for XNt that are not available in

the discrete setting.

2. Given a Poisson process Nt, we can write a1 < a2 < a3 < · · · for the locations of its

(random) points of discontinuity and build a (random) measure µ = µ(!) =P

i �ai .

This is the start of the theory of point processes, this particular one being called a

Poisson point process (PPP). Given the appropriate sigma-algebra and probability

space, a PPP is characterized by the following properties:

(a) for any bounded Borel A ⇢ R, the variable µ(A) is Poisson distributed with

parameter �(A) (where � is Lebesgue measure) and

(b) for bounded, disjoint Borel A,B, the variables µ(A) and µ(B) are independent.

These concepts can be extended to random measures on Rdwith d � 2.

3

Lecture 34

1 Weak convergence

Recall that a sequence (Fn) of distribution functions is said to converge weakly to a distri-bution function F (written Fn ) F ) if

Fn(x) ! F (x) for all x 2 R at which F is continuous .

We will also say that a sequence of Borel probability measures (µn) converges weakly to µ(written µn ) µ) if the associated distribution functions converge weakly to that of µ.

We saw some examples earlier in the course; here is a somewhat more interesting one.A sequence of discrete measures (those whose distribution functions are constant exceptat discontinuity points) can converge weakly to a continuous one (that with a continuousdistribution function). Take

µn =1

n+ 1

nX

k=0

�k/n .

Then if Fn is the distribution function of µn, for 0 x 1,

Fn(x) =1

n+ 1#{k : 0 k n and k/n x} =

bnxc+ 1

n+ 1! x .

Here we do not even need to look only at points of continuity of F (convergence holds forall x), but many times this is necessary.

We now extend the idea to random variables.

Definition 1.1. We say that a sequence of random variables (Xn) converges in distributionto a random variable X (written Xn ) X) if the associated distributions µn converge weaklyto the distribution of X.

One advantage of convergence in distribution is that the variables do not need to bedefined on the same space. This was mentioned earlier in the course. In this way it is weakerthan almost sure convergence or convergence in probability.

• We have remarked earlier that if Xn ! X almost surely, then Xn ! X in probability.

• If Xn ! X in probability then Xn ) X. To see this, note that for x 2 R and ✏ > 0,

P(X x� ✏) P(Xn x) + P(|Xn �X| � ✏) .

Taking n ! 1, the right term approaches zero, so

P(X x� ✏) lim infn

P(Xn x) ,

and taking ✏ ! 0,P(X < x) lim inf

nP(Xn x) .

Similarly, noting the inequality

P(Xn x) P(X x+ ✏) + P(|Xn �X| � ✏)

and taking n ! 1 and ✏ ! 0,

lim supn

P(Xn x) P(X x) .

Combining these,

P(X < x) lim infn

P(Xn x) lim supn

P(Xn x) P(X x) .

If the distribution function of X is continuous at x, these are all equal, giving Xn ) X.

• The converse is generally not true. First if Xn ) X, the Xn’s need not be even definedon the same space, so Xn ! X in probability does not make sense. Even still, if (Xn)is an i.i.d. sequence of non-degenerate random variables (say Bernoulli(1/2)) then allXn’s have the same distribution so trivially Xn ) X1. However Xn cannot convergeto X1 in probability, since for n � 2, P(Xn = 0, X1 = 1) = 1/4.

• However the converse holds when X is constant. That is if Xn ) a for some a 2 Rthen Xn ! a in probability. You can check this.

1.1 Coupling

Although convergence in distribution does not imply almost sure convergence (far from it!)it is possible to prove a version of this implication. If Xn ) X then we can change all thevariables, but leave the distributions the same, so that point wise convergence holds. Themain result is Skorohod’s theorem. It is extremely useful because it allows us to use factsabout point wise convergence to prove facts about convergence in distribution.

Theorem 1.2 (Skorohod). Suppose µn ) µ for Borel probability measures µn, µ on R.There exists a sequence (Yn) and a random variable Y , all defined on some probability space(⌦,⌃,P) such that both

1. Yn(!) ! Y (!) for all ! and

2. Yn has distribution µn and Y has distribution µ.

Proof. We use the same construction introduced earlier in the semester. Consider the prob-ability space ((0, 1),⌃,P), where ⌃ is the Borel subsets of (0, 1) and P is Lebesgue measure.If Fn is the distribution function of µn and F is the distribution function of µ we define

Yn(!) = inf{x : Fn(x) � !}

and similarly for Y (!). Recall that

! Fn(x) if and only if Yn(!) x and similarly for F, Y (1)

2

and this implied that Yn has distribution µn and Y has distribution µ.We must now show that Yn(!) ! Y (!) for all ! 2 (0, 1). We will argue this in the

next paragraph from convergence of Fn to F when ! is a continuity point for the function! 7! Y (!) (note here we are not taking ! as a continuity point of F , but of Y ). If Y is notcontinuous at ! then redefine Y (!) and Yn(!) to be zero for all n. Because Y is a monotonefunction, it has only countably many discontinuities and therefore we have only redefinedthe Yn’s and Y on a set of Lebesgue measure zero; this leaves the distributions unchangedand assures that Yn(!) ! Y (!) for such !. Note that (1) still holds for ! at which Y iscontinuous.

If ! is a continuity point of Y and ✏ > 0 we may choose x in the interval (Y (!)� ✏, Y (!))which is a continuity point of F . Since x < Y (!), (1) implies that F (x) < !. As Fn(x) !F (x), we also have Fn(x) < ! for large n and again by (1), x < Yn(!). So

Y (!)� ✏ < x < Yn(!) for large n .

This is one half of the inequality we need; it implies that lim infn Yn(!) � Y (!).For the second, pick !0 > ! and choose a continuity point y for F such that Y (!0) < y <

Y (!0) + ✏. Then by (1), !0 F (y) , so

! < !0 F (y) ,

Therefore as Fn(y) ! F (y), we have ! Fn(y), giving by (1)

Yn(!) y < Y (!0) + ✏ .

This means lim supn Yn(!) Y (!0) when !0 > !.Since Y is continuous at !, as we take !0 down to !, we obtain

lim supn

Yn(!) Y (!)

and this completes the proof.

3

Lecture 35

0.1 Equivalent forms of weak convergence

It is often useful to tell conditions under which h(Xn) converge weakly in distribution toh(X), where (Xn) is a sequence that converges weakly to X and h : R ! R is a function.The proof below is made much more simple by the Skorohod theorem.

Theorem 0.1 (Mapping theorem). Let h : R ! R be Borel measurable and set Dh tobe the set of discontinuities of h. If Xn ! X in distribution and P(X 2 Dh) = 0 thenh(Xn) ! h(X) in distribution.

Proof. You can check that Dh is a Borel set. By Skorohod, pick (Yn) and Y on the samespace such that Yn has the same distribution as Xn and Y has the same distribution as X,and Yn(!) ! Y (!) for all !. If Y (!) is not a discontinuity of h, then h(Yn(!)) ! h(Y (!)).Therefore

P(h(Yn) ! h(Y )) � P(h(Yn) ! h(Y ), Y /2 Dh) = P(Y /2 Dh) = 1 .

This means h(Yn) ! h(Y ) almost surely and by the previous results, this implies h(Xn) !h(X) in distribution.

One version of the previous theorem is the continuous mapping theorem. It says that ifXn ) X then for h continuous, h(Xn) ) h(X).

Another application of Skorohod is part of the Portmanteau theorem:

Theorem 0.2. The following are equivalent.

1. µn ) µ.

2.Rf dµn !

Rf dµ whenever f : R ! R is continuous and bounded.

3. µn(A) ! µ(A) for all Borel A ⇢ R with µ(@A) = 0.

Proof. Assume µn ) µ and by Skorohod let (Yn), Y be random variables on some space suchthat Yn has distribution µn, Y has distribution µ and Yn(!) ! Y (!) for all !. Because f iscontinuous, f(Yn) ! f(Y ) point wise. By bounded convergence, 1 implies 2:

Zf dµn = Ef(Yn) ! Ef(Y ) =

Zf dµ .

To show 1 implies 3, let h = 1A, so that the discontinuity set of h equals @A. If µ(@A) = 0then the mapping theorem implies that h(Yn) ) h(Y ) almost surely and then by boundedconvergence again, µn(A) = Eh(Yn) ! Eh(Y ) = µ(A).

Now assume 3. To show µn ! µ, let x 2 R such that x is a continuity point of F , thedistribution function of µ. This is equivalent to saying that µ({x}) = 0, or µ(@(�1, x]) = 0.Using this set as A,

Fn(x) = µn(A) ! µ(A) = F (x) ,

so µn ) µ.Last we must show 2 implies 1. So given x 2 R as a continuity point of F we must show

that Fn(x) ! F (x). This can be rewritten as µn(A) ! µ(A), where A = (�1, x]. Theproblem here is that 1A is not a continuous function, so we cannot directly apply item 2.So we must approximate. Given ✏ > 0 let f✏ be the function that equals 1 on (�1, x], 0 on[x + ✏,1), and interpolates linearly between them, on [x, x + ✏]. Because f✏ is continuous,item 2 implies that Z

f✏ dµn !Z

f✏ dµ .

Note that for all y, 1(�1,x](y) f✏(y) 1(�1,x+✏](y), so by monotonicity of integrals,

Fn(x) Z

f✏ dµn .

Taking a limit,

lim supn

Fn(x) Z

f✏ dµ F (x+ ✏) .

Take ✏ # 0 to get lim supn Fn(x) F (x).By a similar argument, we define g✏ to be 1 on (�1, x � ✏], 0 on [x,1), and linear in

between. We find

F (x� ✏) Z

g✏ dµ = limn

Zg✏ dµn lim inf

nFn(x) .

Taking ✏ " 0,F (x�) lim inf

nFn(x) lim sup

nFn(x) F (x) ,

so when F is continuous at x, Fn(x) ! F (x), which is 1.

0.2 Tightness and subsequences

It is often useful to extract a weakly convergent subsequence from a sequence of probabilitymeasures. We have seen this all over parts of analysis. We will see that every sequence ofdistribution functions has a point wise convergent subsequence. However the limit does notneed to be a distribution function. For example, if we take

Fn(x) = 1(�1,�n](x)

then Fn converges point wise to 0, so any sub sequential limit will. Unfortunately 0 is notthe distribution function of a probability measure. The problem in this example is that themass of the µn’s is “escaping” to ±1. As long as this doesn’t happen, then we are ok. Sowe introduce a condition that guarantees that mass does not escape to infinity.

Definition 0.3. A sequence (µn) of probability measures is tight if for each ✏ > 0 there existsan interval [a, b] such that

µn([a, b]) > 1� ✏ for all n .

2

Tightness ensures a convergent subsequence, but moreover if tightness fails, there is asubsequence which does not have a convergent subsequence. This can be seen as a sort ofcompactness statement in the space of probability measures. We will prove these statementsnext time.

3

Lecture 36

Theorem 0.1 (Helly’s selection theorem). Let (Fn) be a sequence of distribution functionson R. There exists a subsequence (Fnk

) that converges to a nonnegative, nondecreasing, rightcontinuous function F at all continuity points of F .

Proof. Because (Fn(x)) is a bounded sequence of real numbers (for a fixed x), it has a

convergent subsequence. By diagonalization we may find a subsequence (Fnk) such that

for each rational number q 2 Q, Fnk(q) converges to some number, which we will denote by

G(q). We can use these numbers to construct a right-continuous, nondecreasing, nonnegative

function F (you can check these properties) on all of R by

F (x) = inf{G(q) : q > x and q 2 Q} .

If F is continuous at x then we need to show that Fnk(x) ! F (x). To do so, let q < x < r

be rational numbers and note

Fnk(q) Fnk

(x) Fnk(r) ,

and taking k ! 1,

G(q) lim infk

Fnk(x) lim sup

kFnk

(x) G(r) .

So we need to prove that

limq"x

G(q) = F (x) = limr#x

G(r) .

For this, first let ✏ > 0 and choose y < x such that F (y) > F (x)� ✏ (by continuity of F ).

Then if q 2 (y, x), by the definition of F ,

G(q) � inf{G(s) : s � y} = F (y) > F (x)� ✏ .

Now pick z > x such that F (z) < F (x) + ✏ and use monotonicity of G: if r 2 (x, z),

G(r) inf{G(s) : s � z} = F (z) < F (x) + ✏ .

So we find

y < q < x < r < z ) F (x)� ✏ G(q) G(r) F (x) + ✏

and this completes the proof.

Theorem 0.2. (µn) is tight if and only if each subsequence (µnk) has a further subsequence

that converges weakly to a probability measure.

Proof. Assume (µn) is tight and let (⌫n) be a subsequence. Apply Helly to the distribution

functions of ⌫n to get a function F such that (Fnk), a subsequence of the distribution functions

of the ⌫n converge to F at continuity points of F . We need only show that limx!1 F (x) = 1

and limx!�1 F (x) = 0.

Given ✏ > 0 choose [a, b] such that µn([a, b]) > 1� ✏ for all n. We may slightly separate

a and b further to ensure that they are continuity points of F (since F has only countably

many, being monotone). Therefore

F (a) = limk

Fk(a) = limk

[1� µk((a,1])] limk

[1� (1� ✏)] = ✏ ,

so limx!�1 F (x) = 0. Similarly,

F (b) = limk

Fk(b) � limk

µk([a, b]) � 1� ✏ ,

so limx!1 F (x) = 1.

Conversely, suppose that (µn) is not tight, there must be an ✏ > 0 such that for all [a, b],there is an n such that µn([a, b]) 1� ✏. So for each k, choose nk such that µnk

([�k, k]) 1� ✏. If this subsequence had a further subsequence (µnkj

) that converged to a probability

measure µ, then we could choose (a, b] with zero µ-measure boundary such that µ((a, b]) >1� ✏ but then

1� ✏ � lim infj

µnkj([�kj, kj]) � lim

jµnkj

((a, b]) = µ((a, b]) > 1� ✏ ,

a contradiction.

2

Lecture 37

We have covered the basic results we will need about weak convergence (and convergence

in distribution). Here are a couple of extensions of previous integral limit theorems. They

are all proved by using Skorohod.

• If Xn ) X then E|X| lim infn E|Xn|.

Proof. Use Skorohod and then Fatou’s lemma.

• If Xn ) X and (Xn) is uniformly integrable then EXn ! EX, where E|X| < 1.

Proof. Use Skorohod and then the theorem in uniform integrability.

1 Characteristic functions

Now that we have studied a bit about convergence in distribution, we meet a very useful

tool, the characteristic function, which encodes information about the probability measure.

It is a “big kid’s” version of the moment generating function.

Definition 1.1. If µ is a Borel probability measure, its characteristic function is

�(t) = �µ(t) =

Zeitx dµ(x) .

The definition of this complex integral is in terms of its real and imaginary parts. If

f : R ! C is Borel function then we define

Zf(x) dµ(x) =

ZRe f(x) dµ(x) + i

ZIm f(x) dµ(x) .

So the characteristic function is

�(t) = EetX =

Zcos tx dµ(x) + i

Zsin tx dµ(x) ,

where X is a random variable with distribution µ. For its study, we need some extremely

basic facts about complex numbers, like

z = x+ iy ) |z|2 = x2+ y2 and max{|x|, |y|} |z| .

Properties.

1. �(0) = 1 and for all t,|EeitX | 1 .

To see this, estimate

|EeitX | =⇥(E cos(tX))

2+ (E sin(tX))

2⇤1/2 ⇥E�cos

2(tX) + sin

2(tX)

�⇤1/2= 1 .

Alternatively, one can show that for any Borel measurable f : R ! C,��Z

f(x) dµ(x)

�� Z

|f(x)| dµ(x) .

This is in Billingsley, (16.30).

The bound |�(t)| 1 for all t reflects the fact that the characteristic function will be

much more versatile than the moment generating function, which is only defined for

some t and does not have any simple bound.

2. � is uniformly continuous:

Proof. If t, h 2 R,

|�(t+ h)� �(t)| E|ei(t+h)X � eitX | E|eihX � 1| ! 0 as h ! 0 .

The convergence above holds by the bounded convergence theorem and is independent

of t (since the final term does not even have a t).

3. (Riemann-Lebesgue) If µ has a density then �(t) ! 0 as |t| ! 1.

Proof. The standard way to prove this is to approximate by smooth functions. Let fbe the density function for µ and first assume it is smooth with compact support (that

is, the set where it is nonzero is bounded). Then f 0is bounded and by integration by

parts, Z 1

�1f(x) cos tx dx = �

Z 1

�1f 0(x)

sin(tx)

tdx ! 0 as t ! 1 .

Now one must approximate by such functions. It is a standard fact in analysis that

any integrable function can be approximated by smooth functions with compact sup-

port. In other words, we can find g that is smooth and has compact support such thatR|f(x) � g(x)| dx < ✏. (If you have not seen this, you can look in Billingsley The-

orem 17.1(i) for the same statement for approximation by step functions and instead

computeRg(x) cos(tx) dx directly for a step function and see that it goes to zero. This

is the approach in Billingsley Theorem 26.1.)

Now

lim sup

|t|!1

��Z

f(x) cos(tx) dx

�� Z

|f(x)� g(x)| dx+ lim|t|!1

��Z

g(x) cos(tx) dx

�� ✏ .

The same proof shows thatRf(x) sin(tx) dx ! 0 and so �(t) ! 0 as |t| ! 1.

2

4. We would like to expand �(t) as a Taylor series. This is possible in most cases (although

it was not possible in most cases for the moment generating function). To do this, we

need to estimate the remainder in the n-th Taylor approximation for eix:��e

ix �nX

k=0

(ix)k

k!

�� min

⇢|x|n+1

(n+ 1)!,2|x|n

n!

�for n � 0 and x 2 R . (1)

This inequality amounts to an exercise in analysis: it follows from Taylor’s theorem

with remainder, notably the following equation that pops out of multiple integrations

by parts:

eix =

nX

k=0

(ix)k

k!+

in

(n� 1)!

Z x

0

(x� s)n�1(eis � 1) ds .

(See in Billingsley, (26.1)-(26.4).) The point is that we can use this estimate if X has

n moments: ��(t)�nX

k=0

ik

k!EXk

�� Emin

⇢|tX|n+1

(n+ 1)!,2|tX|n

n!

�.

If X has moments of all orders and this remainder goes to 0 for some t; that is,

limn

E|X|n t

n

n!

�= 0 , (2)

then we must have

�(t) =1X

k=0

(it)k

k!EXk .

Equation (2) is a condition on the growth of the moments of X (as long as t 6= 0). If

they do not grow too quickly then the moments determine the characteristic function

(which we will see determines the law of X).

From the theory of power series, if condition (2) holds for some t 6= 0 then we can

di↵erentiate the series and obtain

�(k)(0) = ikEXk

for all k .

In fact the di↵erentiation result above holds under much more minimal assumptions.

Theorem 1.2. If E|X|k < 1 then �(k)(0) = ikEXk.

Proof. For k = 1, let h 6= 0 and t 2 R:��(t+ h)� �(t)

h� EiXeitX

�� E��e

itX eihX � 1� ihX

h

�� E��eihX � 1� ihX

h

�� . (3)

Now use the estimate on the remainder in Taylor’s theorem (1) to find the upper bound��eihX � 1� ihX

h

�� 2|X| ,

so by dominated convergence, (3) goes to 0 and we get �0(0) = iEX.

For k � 2 we simply repeat the above argument, using (1) with higher values of n.

3

Lecture 38

To conclude the discussion from last time on properties of characteristic functions, let usnote that if X and Y are independent then Eeit(X+Y ) = EeitXeitY equals

= E [cos(tX) cos(tY )� sin(tX) sin(tY ) + i(cos(tX) sin(tY ) + sin(tX) cos(tY ))]

= E cos(tX)E cos(tY )� E sin(tX)E sin(tY ) + i(E cos(tX)E sin(tY ) + E sin(tX)E cos(tY ))

= EeitXEeitY .

This means that �X+Y (t) = �X(t)�Y (t).

Examples.

1. For a sum of delta masses µ =Pn

j=1 aj�xj withPn

j=1 aj = 1,

�(t) =nX

j=1

ajeixjt .

Note that this does not converge to 0 as |t| ! 1. (µ has no density).

2. For uniform on an interval, (say [0, 1]),

�(t) =

Z 1

0

eixt dx =

Z 1

0

cos(xt) dx+ i

Z 1

0

sin(xt) dx =sin t

t� i

cos t� 1

t=

eit � 1

it.

Generally, for uniform on an interval [a, b], one has

�(t) =1

b� a· e

ita � eitb

it.

Here, of course, �(t) ! 0 as |t| ! 1.

3. For a standard Gaussian X, we have already computed (partially in the homework)that

E|X|k =(2ll!

p2/⇡ if k = 2l + 1 is odd

(2l))!2ll! if k = 2l is even

.

Therefore for any t 2 R,

tnE|X|n

n!=

8<

:

t2l+12ll!p

2/⇡

(2l+1)! if n = 2l + 1 is oddt2l+1(2l)!2ll!(2l)! if n = 2l is even

.

This converges to 0 as n ! 1, so the characteristic function of the Gaussian can be retrievedfrom its moments:

EeitX =1X

n=0

(it)n

n!EXn .

The odd moments are zero, so only the even moments survive:1X

n=0

(�1)n(t)2n

(2n)!

(2n)!

2nn!=

1X

n=0

(�t2/2)n

n!= e�t2/2 .

1 The uniqueness theorem

One main use of characteristic functions is that they determine the distribution. In otherwords, a characteristic function contains all information about the distribution. The state-ment is the analogue of the Fourier inversion formula.

Theorem 1.1 (Inversion theorem). Let � be a characteristic function for a probability mea-sure µ on R. If µ({a}) = µ({b}) = 0 (that is, a b are continuity points for the distributionfunction) then

µ((a, b]) = limT!1

1

2⇡

Z T

�T

e�ita � e�itb

it�(t) dt .

Proof. Define IT by

IT =1

2⇡

Z T

�T

e�ita � e�itb

it�(t) dt =

1

2⇡

Z T

�T

e�ita � e�itb

it

Z 1

�1eitx dµ(x)

�dt .

We would like to apply Fubini, so consider the product integral

1

2⇡

Z 1

�1

Z T

�T

e�it(a�x) � e�it(b�x)

itdt dµ(x) .

We can split this into real and imaginary parts and apply Fubini to each, allowing us tointegrate in either order. To do this, we must show that each of the real and imaginary partsare (absolutely) integrable relative to the product measure dt dµ. To do this, it su�ces toshow that they are bounded, because this product measure is a finite measure. Verifyingthis is easy for t 6= 0: by the Taylor estimate from last class with n = 0,

��e�it(a�x) � e�it(b�x)

it

�� 1

t

��e�ita � e�itb�� = 1

t

��e�it(a�b) � 1�� |b� a| .

Since the set {(t, x) : t = 0} has measure zero under dt dµ, we can switch the order of theintegrals and get

IT =

Z 1

�1

1

2⇡

Z T

�T


itdt

�dµ(x) . (1)

We want to use the dominated convergence theorem and take T ! 1, so let’s investigatewhat happens as T ! 1 for the inner integral:

limT!1

1

2⇡

Z T

�T


itdt . (2)

Now expand e�i✓ = cos ✓ + i sin ✓ in the integrand for

1

it[cos(�t(a� x)) + i sin(�t(a� x))� cos(�t(b� x))� i sin(�t(b� x))] ,

2

which equals

sin(�t(a� x))� sin(�t(b� x))

t+�i

cos(�t(a� x))� cos(�t(b� x))

t.

By symmetry of cosine,

1

2⇡

Z T

�T

cos(�t(c� x))� cos(�t(b� x))

tdt = 0 for all T > 0 .

Therefore (2) becomes

limT!1

1

2⇡

Z T

�T

sin(t(x� a))

tdt� lim

T!1

1

2⇡

Z T

�T

sin(t(x� b))

tdt . (3)

Recall that

limT!1

Z T

�T

sin t

tdt = ⇡

(this is a fact from complex analysis or many other areas – for a proof, see Billingsley,Example 18.4). Therefore for r 6= 0, using a change of variable u = rt, so that du = r dt,

limT!1

Z T

�T

sin(rt)

tdt = lim

T!1

Z rT

�rT

sin u

udu =

(⇡ if r > 0

�⇡ if r < 0

(and is 0 if r = 0). So taking T ! 1 in (3),

(2) =

8>>>>>><

>>>>>>:

0 if x < a

1/2 if x = a

1 if a < x < b

1/2 if x = b

0 if b < x

.

Calling this function a(x), if we can apply dominated convergence in the iterated integral(1), we will get

limT!1

IT =

Z 1

�1a(x) dµ(x) = µ([a, b])� 1

2[µ({a}) + µ({b})] = µ([a, b]) = µ((a, b])

and this will prove the theorem.So we are left to show that as T grows to infinity, the family of functions of x:

x 7! 1

2⇡

Z T

�T


itdt

is dominated by a µ-integrable function. You will show this in homework (hint: the familyis actually bounded).

As a consequence of this theorem, if two probability measures have the same characteristicfunction, they agree on all intervals (a, b] and, since the collection of such integrals form a⇡-system, the measures must be equal.

3

Lecture 39

1 Continuity theorem

One main reason that characteristic functions are so useful comes from the continuity theo-rem. We will show that:

Theorem 1.1 (Continuity theorem). Let (µn) and µ be probability measures on R withcharacteristic functions (�n) and �. The following are equivalent.

1. µn ) µ.

2. �n ! � pointwise.

Before we prove the theorem, let’s give an example. Let X(n)1 , . . . , X(n)

n be i.i.d. Bernoullivariables with parameter p/n, where p 2 [0, 1] is given. This means

P(X(n)i = 0) = 1� p/n and P(X(n)

i = 1) = p/n .

Then the characteristic function for X(n)i is

EeitX(n)i = (1� p/n) + p/neit =

✓1 +

p(eit � 1)

n

◆.

Further, by the i.i.d. assumption,

Eeit(X(n)1 +···+X

(n)n ) =

✓1 +

peit

n

◆n

! ep(eit�1) pointwise .

This is exactly the characteristic function for a Poisson distribution with parameter p:

�(t) = e�p1X

n=0

eitnpn

n!= ep(e

it�1) .

So we find the Poisson limit theorem:

X(n)1 + · · ·+X(n)

n ) Poisson(p) .

In particular, if n people each buy a lottery ticket with chance 1/n of winning, the probabilitythere are exactly k winners equals

P(X(n)1 + · · ·+X(n)

n = k) ! P(Poisson(1) = k) = e�1 1

k!.

Proof of continuity theorem. If µn ) µ, then apply the Portmanteau theorem to the real andimaginary parts of �n: for fixed t 2 R, x 7! cos tx and x 7! sin tx are bounded continuousfunctions, so �n ! � point wise.

Suppose conversely that �n ! � point wise; we must show that µn ) µ. The argumentrelies on showing that (µn) is tight. Assuming this for the moment, let us show first how tocomplete the proof. If (µn) does not converge weakly to µ then we can find some boundedcontinuous function f : R ! R and a subsequence (µnk

) such that��Z

f dµnk�

Zf dµ

�� is bounded away from 0 in k .

However tightness implies that there is a further subsequence (µnkj) that converges to a

probability measure ⌫. It follows that µ 6= ⌫ sinceZ

f dµnkj!

Zf d⌫

and this sequence does not converge toRf dµ. But now as j ! 1, the characteristic

functions of µnkjmust converge to that of ⌫, by the first part of the theorem. This is a

contradiction, since they also must converge to that of µ.To show tightness, the main idea is that if mass of the µn’s “escaped to infinity” then it

would have to show up in the characteristic function as a sharp oscillation near zero. Forinstance, if the µn’s had a delta mass an�xn escaping to infinity, this would show up as aneitxn .To prove this we follow the trick in Billingsley. Let ✏ > 0 and choose u > 0 such that

1

u

Z u

�u

(1� �(t)) dt < ✏/2 .

This is possible by continuity of � at 0 (and this is our bound for the possible “oscillation”of the �n’s near zero). Then estimate using Fubini:

1

u

Z u

�u

(1� �n(t)) dt =

Z 1

�1

1

u

Z u

�u

(1� eitx) dt

�dµn(x) = 2

Z 1

�1

✓1� sin ux

ux

◆dµn(x)

� 2

Z

|x|�2/u

✓1� sin ux

ux

◆dµn(x)

� 2

Z

|x|�2/u

✓1� 1

|ux|

◆dµn(x)

� µn(|x| � 2/u) .

The integral on the left converges by bounded convergence to 1u

R u

�u(1 � �(t)) dt. So for nlarge,

µn([�2/u, 2/u]c) 1

u

Z u

�u

(1� �n(t)) dt < ✏ .

Further decreasing u to ensure µn([�2/u, 2/u]c) < ✏ for small n completes the proof oftightness.

2

• Actually the above proof can be generalized is a couple of di↵erent ways. Here is one.Suppose (�n) is a sequence of characteristic functions such that �n ! g point wise, forsome g : R ! R. If g is continuous at 0 then it must be a characteristic function �and thus µn ) µ for µn with characteristic function �n and µ characteristic function� = g.

• To prove the above statement, you can use the condition that g is continuous at 0to re-run the tightness argument above, showing that (µn) is tight. Then there is aprobability measure µ and a subsequence (µnk

) such that µnk) µ. Then it follows that

the characteristic function of µ is g and further that µn ) µ, since the characteristicfunctions converge.

• An even simpler result is: if (µn) is tight and �n converge point wise to some g then(µn) converges weakly to some µ with characteristic function g.

3

Lecture 40

1 (Gaussian) central limit theorems

As an application of some of the ideas involving characteristic functions, we can prove variousversions of the (Gaussian) central limit theorem.

1.1 IID case

The simplest version follows more or less directly from our theorems.

Theorem 1.1 (Central limit theorem). Let X1, X2, . . . be an i.i.d. sequence with EX1 = 0and EXi. Then

X1 + · · ·+Xnpn

) N(0, 1) .

Proof. Write � for the characteristic function of X1. Then the characteristic function ofX1/

pn is EetX1/

pn = �(t/

pn) and by independence, the characteristic function of the left

side of the theorem isEet(X1+···+Xn)/

pn = �(t/

pn)n .

The fact that EX21 = 1 < 1 means that we can expand the characteristic function of X1 in

a neighborhood of zero as

�(s) = �(0) + s�0(0) +s2

2�00(0) + o(s2) = 1 + isEX1 �

s2

2EX2

1 + o(s2)

= 1� s2

2+ o(s2) as s ! 0 .

Therefore for fixed t,

�(t/pn)n =

✓1 +

(�t2/2)

n+ o(1/n)

◆n

as n ! 1 .

Because ✓1 +

(�t2/2)

n

◆n

! e�t2/2 as n ! 1 ,

and this is the characteristic function of a Gaussian, it su�ces to show that the o(1/n) termdoes not a↵ect this. One can do this using complex logarithms, or the following lemma.

Lemma 1.2. Let u and w be complex numbers of absolute value at most 1. Then for alln � 1,

|un � wn| n|u� w| .

Proof. For n = 1 it is clear. For n � 2, use induction:

|un � wn| |un � uvn�1|+ |uvn�1 � vn| = |u||un�1 � vn�1|+ |vn�1||u� v| |un�1 � vn�1|+ |u� v| n|u� v| .

To use the lemma, set u = 1 + (�t2/2)/n and w = 1 + (�t2/2)/n + o(1/n). Then forlarge n, u and w have absolute value at most 1. Therefore we can estimate

|un � wn| n|u� w| = no(1/n) ! 0 ,

and thereforewn = �(t/

pn)n ! e�t2/2 as n ! 1 .

1.2 Weakening the assumptions

The assumptions for the last section were

1. Xi’s are independent

2. Xi’s are identically distributed

We will first weaken the second assumption. So we will assume X1, X2, . . . is a sequence ofindependent variables that are not necessarily identically distributed, but they have

EXi = 0 and Var Xi = 1 for all i .

The question again is: what conditions on the variables ensure that

X1 + · · ·+Xnpn

) N(0, 1) ? (1)

If we look at the proof of the CLT, we see that in the i.i.d. case, the characteristicfunction of the left side of (1) is

✓1� t2/2

n+ o(1/n)

◆n

.

In the case of non-identically distributed summands, the characteristic functions of the sum-mands will not be the same generally and so we will get a product

✓1� t2/2

n+ o1(1/n)

◆✓1� t2/2

n+ o2(1/n)

◆· · ·

✓1� t2/2

n+ on(1/n)

◆,

2

where ok(n) is the “small” term from the expansion of the k-th characteristic function. Wewill try to give a condition that ensures that this small term is small enough, so that we canstill neglect it in the limit.

To estimate the error ��EeitX �

✓1� �2t2

2

◆��

for a general random variable X with EX = 0 and finite variance Var X = �2, we use theTaylor estimate from before:

��eix �

nX

k=0

(ix)k

k!

�� min

⇢|x|n+1

(n+ 1)!,2|x|n

n!

�.

Plug in tX here with n = 2 and take expectation, using the values of the mean and variance:

��(t)�✓1� t2�2

2

◆�� =

��EeitX �

2X

k=0

tkEXk

k!

�� E��e

itX �2X

k=0

(itx)k

k!

��

Emin�|tX|2, |tX|3

.

Because we only will assume X has two moments, this would force us to use the first termin the minimum. To get around this, we will use the second only when X is small. That is,given some ✏ > 0, let us give the central estimate

��(t)�✓1� t2�2

2

◆�� E|tX|21{|X|�✏} + E|tX|31{|X|<✏}

t2EX21{|X|�✏} + ✏|t|3EX21{|X|<✏} . (2)

The second term above is generally small (it has ✏ before it), so it is necessary to control thefirst. This was done by the Lindeberg condition.

Theorem 1.3 (Lindeberg central limit theorem – simplified version). Let X1, X2, . . . be anindependent sequence with EXi = 0, Var Xi = 1 for all i. Assume

1

n

nX

k=1

EX2k1{|Xk|�✏

pn} ! 0 as n ! 1 for each ✏ > 0 .

ThenX1 + · · ·+Xnp

n) N(0, 1) .

The Lindeberg condition above is somewhat complicated but you can think of it assaying that each variable |Xk| is not too large. How large is large? We do not want any Xi

to be larger than orderpn. In other words, the central limit theorem is valid in situations

where many small variables combine together and their individual e↵ects wash away. To

3

see (heuristically) why this large behavior is undesirable, suppose that the variables Xi arelikely to be of order

pn and further suppose that a central limit theorem holds. Then

X1 + · · ·+Xn ⇠pnN(0, 1)

but Xn+1 is also of orderpn. Therefore

X1 + · · ·+Xn+1 ⇠pnN(0, 1) +Xn+1 ,

where Xn+1 has a non-negligible e↵ect. This sum need not be approximately normal.

Proof. Write �i for the characteristic function of Xi and apply (2) with t/pn and ✏

pn to

get ��i(t/pn)�

✓1� t2/2

n

◆�� t21

nEX2

i 1{|Xi|�✏pn} + ✏

pn|t|3

n3/2EX2

i 1{|Xi|<✏pn} .

Therefore

nX

i=1

��i(t/pn)�

✓1� t2/2

n

◆�� t21

n

nX

i=1

EX2i 1{|Xi|�✏

pn} + ✏

|t|3

n

nX

i=1

EX2i

= t21

n

nX

i=1

EX2i 1{|Xi|�✏

pn} + ✏|t|3 .

If we take n ! 1 and use the Lindeberg condition, the right side is no bigger than ✏|t|3.Since ✏ is arbitrary, this shows

nX

i=1

��i(t/pn)�

✓1� t2/2

n

◆�� ! 0 as n ! 1 . (3)

We will finish next time using a variant of Lemma 1.2.

4

Lecture 41

Last time we almost finished proving:

Theorem 0.1 (Lindeberg central limit theorem – simplified version). Let X1, X2, . . . be anindependent sequence with EXi = 0, Var Xi = 1 for all i. Assume

1

n

nX

k=1

EX2k1{|Xk|�✏

pn} ! 0 as n ! 1 for each ✏ > 0 .

ThenX1 + · · ·+Xnp

n) N(0, 1) .

Proof. We already showed

nX

i=1

��i(t/pn)�

✓1� t2/2

n

◆�� ! 0 as n ! 1 . (1)

We can finish using the following lemma. Its proof is almost identical to the one from before(using induction).

Lemma 0.2. Let u1, . . . , un and w1, . . . , wn be complex numbers of absolute value at most 1.Then ��

nY

i=1

ui �nY

i=1

wi

�� nX

i=1

|ui � wi| .

We want to apply this lemma to the numbers ui = �i(t/pn) and wi = 1 � t2/2

n . For nlarge, each wi has absolute value at most 1. This is also true for all n for ui. So

��

nY

i=1

�i(t/pn)�

✓1� t2/2

n

◆n�� ! 0

and thereforeQn

i=1 �i(t/pn) ! e�t2/2.

0.1 Triangular arrays

The real Lindeberg CLT is more general. The variables Xn can have arbitrary variances, notnecessarily 1. Also it is valid for arrays of independent variables.

So let {Xn,k : n � 1, 1 k kn} be an array of independent variables. AssumeEXn,k = 0 for all k, n and define

�2n,k = Var Xn,k, s2n = Var Sn ,

where Sn = Xn,1 + · · ·+Xn,kn .

Theorem 0.3 (Lindeberg CLT). Assume that for each ✏ > 0,

1

s2n

knX

k=1

EX2n,k1{|Xn,k|�✏sn} ! 0 as n ! 1 .

Then Sn/sn ) N(0, 1).

From the point of view of Lp spaces, the normalization sn can be seen as thus. Thevariable Sn has two moments and so is in L2. It also has mean zero. Therefore if we wantto normalize it to have L2-norm equal to 1, we should normalize by its L2-norm:

kSnk2 =pES2

n =pVar Sn = sn .

This means that Sn/sn has norm 1. Therefore the sequence (Sn) is simply a sequence of unitvectors in L2. Since a N(0, 1) variable has L2-norm 1, it would make sense that to get this(non-degenerate) distribution in the limit, we should normalize the Sn’s to have norm 1.

Proof. Exactly the same proof as last time gets us almost all the way. Precisely, the samemethod shows that if we set �n,k as the characteristic function of Xn,k, then

knX

k=1

��n,k(t/sn)�✓1�

t2�2n,k

2s2n

◆�� ! 0 (2)

and therefore ��

knY

k=1

�n,k(t/sn)�knY

k=1

✓1�

t2�2n,k

2s2n

◆�� ! 0 . (3)

For this last equation we need to know that each term 1� t2�2n,k

2snis of absolute value at most

1 for n large. This follows by the Lindeberg condition. Indeed, note that

�2n,k ✏2s2n + EX2

n,k1{|Xn,k|�✏sn} ,

so�2n,k

s2n ✏2 +

1

s2n

knX

k=1

EX2n,k1{|Xn,k|�✏sn}

and by the Lindeberg condition, we find

max1kkn

�2n,k

s2n! 0 .

We are then left to show that��

knY

k=1

✓1�

t2�2n,k

2s2n

◆� e�t2/2

�� ! 0 .

2

Using the fact thatPkn

k=1 �2n,k = s2n, we can show

��

knY

k=1

✓1�

t2�2n,k

2s2n

◆�

knY

k=1

e�

t2�2n,k

2s2n

�� ! 0 .

To show this, we can use the lemma again and just show

knX

k=1

��1�t2�2

n,k

2s2n� e

�t2�2

n,k

2s2n

�� ! 0 . (4)

For this we use the inequality valid for complex z

|ez � 1� z| =

��

1X

n=2

zn

n!

�� |z|21X

n=2

|z|n�2

n! |z|2e|z| .

Therefore (4) is bounded by

knX

k=1

✓t4�4

nk

4s4nexp

✓t2�2

nk

2s2n

◆◆ t4 max

1kkn

�2nk

s2nexp

✓t2�2

nk

2s2n

◆� knX

k=1

�2n,k

s2n

= t4 max1kkn

�2nk

s2nexp

✓t2�2

nk

2s2n

◆�.

By max1kkn�2n,k

s2n! 0, this goes to 0 and we are done.

3

Lecture 42

Because the Lindeberg CLT is somewhat complicated and the condition appears odd, let

us give some remarks and examples.

• This theorem generalizes the standard CLT. To see how, take kn = n and Xn,k to be

i.i.d. with mean zero and variance 1. Then the sum in the Lindeberg condition is

1

s2n

knX

k=1

EX2n,k1{|Xn,k|�✏sn} =

1

n

nX

k=1

EX21{|X|�✏pn} ,

where X is a variable with the same distribution as the Xn,k’s. We can rewrite this as

1

n

nX

k=1

EX21{X2�✏2n} = EX21{X2�✏2n} .

As n ! 1, the indicator approaches 1{X2=1} so by the dominated convergence the-

orem, we get 0. Here we can see that the Lindeberg condition is precisely that the

variable X is not “large” – it is not larger than ✏pn.

• There is a slightly simpler condition that implies Lindeberg. It is called the Lyapunovcondition: for some q > 2,

1

sqn

knX

k=1

E|Xn,k|q ! 0 .

To see why, suppose it holds and estimate

1

s2n

knX

k=1

E|Xn,k|21{|Xn,k|�✏sn} 1

s2n

knX

k=1

E|Xn,k|q

(✏sn)q�2

=1

✏q�2

1

sqn

knX

k=1

E|Xn,k|q .

So if Lyapunov holds, the right side goes to 0, and therefore Lindeberg holds, giving

Sn/sn ) N(0, 1).

We can phrase the Lyapunov condition in terms of Lp-norms. Since sn = kSnk2 =

kXn,1 + · · ·+Xn,knk2, we can rewrite the condition as

knX

k=1

✓kXn,kkq

kXn,1 + · · ·+Xn,knk2

◆q

! 0 .

Recall that by Jensen, kXn,kk2 kXn,kkq. So this is a stronger condition than

max1kkn

kXn,kk2kXn,1 + · · ·+Xn,knk2

! 0 ,

which means, as in Lindeberg, that each Xn,k contributes a small amount to the sum

Sn.

• Using the Lyapunov condition, we can show: if |Xn,k| Mn almost surely and

Mn/sn ! 1 then Sn/sn ) N(0, 1). (Note here that, yet again, the condition as-

sures that none of the variables are “too big.”)

Proof. Estimate using Lyapunov with q = 3:

1

s3n

knX

k=1

E|Xn,k|3 Mn

sn

1

s2n

knX

k=1

EX2n,k =

Mn

sn! 0 .

Example. Let � be a permutation on n letters, selected uniformly at random among all

such permutations. Each permutation has a cycle representation and it can be standardized

by first completing the cycle of 1, then selecting the smallest number not in this cycle,

completing it, and so on. The cycle representation has n elements; let X(n)1 , . . . , X

(n)n be

either 0 or 1, with the k-th variable 1 if and only if the k-th position in the representation

is at the end of a cycle.

Then each X(n)k has the distribution

P(X(n)k = 1) =

1

n� k + 1and P(X(n)

k = 0) = 1� 1

n� k + 1.

The reason is that we can imagine generating a uniform permutation by selecting an image

for 1, an image for the image of 1, an image for the image of the image of 1, and so on. At

the k-th stage, we are selecting the number that corresponds to X(n)k and there is exactly

one choice of n� k + 1 that will make it close a cycle.

Furthermore an argument similar to the above one actually shows that the X(n)k ’s are

independent. For the full argument, see Billingsley, Example 5.6.

So we would like to apply the Lindeberg CLT. We simply compute the variance of the

sum Sn = X(n)1 + · · ·+X

(n)n . By independence it is

Var Sn =

nX

k=1

Var X(n)k

and to compute the variance of X(n)k we just set p =

1n�k+1 and compute

EX(n)k = p, E(X(n)

k )2= p ) Var X

(n)k = p� p

2= p(1� p)

and this equalsn�k

(n�k+1)2 . Therefore

Var Sn =

nX

k=1

n� k

(n� k + 1)2=

nX

k=1

k � 1

k2⇠ log n+O(1) ! 1 .

2

Because the variables are bounded by a constant, the last example above shows the CLT

(verifying Lyapunov). To state it, find the expectation:

ESn =

nX

k=1

1

n� k + 1= log n+O(1) ,

soSn � log np

log n) N(0, 1) .

3

1 Sample spaces and sigma-algebraspeople.math.gatech.edu/~mdamron6/teaching/Fall... · Proposition...

Documents

Transcript of 1 Sample spaces and sigma-algebraspeople.math.gatech.edu/~mdamron6/teaching/Fall... · Proposition...