probability-theory_Bass.pdf

download probability-theory_Bass.pdf

of 17

Transcript of probability-theory_Bass.pdf

  • 8/9/2019 probability-theory_Bass.pdf

    1/49

    Probability Theory

    Richard F. Bass

    These notes are   c1998 by Richard F. Bass. They may be used for personal or classroom purposes,but not for commercial purposes.

    Revised 2001.

    1. Basic notions.

    A  probability  or   probability measure   is a measure whose total mass is one. Because the origins of 

    probability are in statistics rather than analysis, some of the terminology is different. For example, instead

    of denoting a measure space by (X, A, µ), probabilists use (Ω, F ,P). So here Ω is a set, F   is called a  σ-field (which is the same thing as a σ-algebra), and P  is a measure with  P(Ω) = 1. Elements of  F  are called events .Elements of Ω are denoted  ω .

    Instead of saying a property occurs almost everywhere, we talk about properties occurring  almost

    surely , written   a.s.. Real-valued measurable functions from Ω to   R   are called   random variables   and are

    usually denoted by  X   or Y   or other capital letters. We often abbreviate ”random variable” by  r.v.

    We let Ac = (ω ∈ Ω : ω /∈ A) (called the  complement  of  A) and  B −A =  B ∩ Ac.Integration (in the sense of Lebesgue) is called  expectation  or  expected value , and we write  EX   for 

     XdP. The notation  E [X ; A] is often used for A

    XdP.

    The random variable 1A   is the function that is one if   ω ∈   A   and zero otherwise. It is called theindicator  of  A  (the name characteristic function in probability refers to the Fourier transform). Events such

    as (ω :  X (ω) > a) are almost always abbreviated by (X > a).

    Given a random variable  X , we can define a probability on  R  by

    PX(A) =

     P(X  ∈ A), A ⊂

     R.   (1.1)

    The probability  PX  is called the   law  of  X  or the  distribution  of  X . We define F X   : R → [0, 1] by

    F X(x) =  PX((−∞, x]) =  P(X  ≤ x).   (1.2)

    The function  F X  is called the  distribution function  of  X .

    As an example, let Ω = {H, T }, F  all subsets of Ω (there are 4 of them),  P(H ) =  P(T ) =   12

    . Let

    X (H ) = 1 and  X (T ) = 0. Then  PX   =  12δ 0 +

      12δ 1, where  δ x   is point mass at  x, that is,  δ x(A) = 1 if  x ∈ A

    and 0 otherwise.   F X(a) = 0 if  a

  • 8/9/2019 probability-theory_Bass.pdf

    2/49

    Proposition 1.2.   Suppose   F   is a distribution function. There exists a random variable   X   such that

    F   = F X .

    Proof.  Let Ω = [0, 1], F   the Borel  σ-field, and  P  Lebesgue measure. Define  X (ω) = sup{x :  F (x) < ω}. Itis routine to check that F X  = F .

    In the above proof, essentially   X   =   F −1. However   F   may have jumps or be constant over someintervals, so some care is needed in defining X .

    Certain distributions or laws are very common. We list some of them.

    (a)   Bernoulli . A random variable is Bernoulli if  P(X  = 1) =  p,  P(X  = 0) = 1 − p  for some  p ∈ [0, 1].(b)   Binomial . This is defined by P(X  = k) =

    nk

     pk(1− p)n−k, where n  is a positive integer, 0 ≤ k ≤ n,

    and  p ∈ [0, 1].(c)   Geometric . For  p ∈ (0, 1) we set  P(X  = k) = (1 − p) pk. Here  k   is a nonnegative integer.(d)   Poisson. For λ > 0 we set  P(X  = k) = e−λλk/k! Again  k   is a nonnegative integer.(e)   Uniform. For some positive integer  n, set  P(X  = k) = 1/n for 1

     ≤ k

     ≤ n.

    If   F   is absolutely continuous, we call   f   =   F    the   density   of   F . Some examples of distributionscharacterized by densities are the following.

    (f)  Uniform on  [a, b]. Define f (x) = (b − a)−11[a,b](x). This means that if  X  has a uniform distribution,then

    P(X  ∈ A) = A

    1

    b − a 1[a,b](x) dx.

    (g)   Exponential . For  x > 0 let  f (x) =  λe−λx.(h)   Standard normal . Define  f (x) =   1√ 

    2πe−x

    2/2. So

    P(X  ∈

     A) =  1

    √ 2π  A e−x2/2dx.(i) N (µ, σ2). We shall see later that a standard normal has mean zero and variance one. If   Z   is a

    standard normal, then a N (µ, σ2) random variable has the same distribution as   µ +  σZ . It is anexercise in calculus to check that such a random variable has density

    1√ 2πσ

    e−(x−µ)2/2σ2 .   (1.3)

    (j)   Cauchy . Here

    f (x) =  1

    π

    1

    1 + x2.

    We can use the law of a random variable to calculate expectations.

    Proposition 1.3.   If  g  is bounded or nonnegative, then

    E g(X ) =

       g(x)PX(dx).

    Proof.   If  g  is the indicator of an event  A, this is just the definition of  PX . By linearity, the result holds for

    simple functions. By the monotone convergence theorem, the result holds for nonnegative functions, and by

    linearity again, it holds for bounded  g .

    2

  • 8/9/2019 probability-theory_Bass.pdf

    3/49

    If   F X   has a density   f , then   PX(dx) =   f (x) dx. So, for example,   EX   = 

     xf (x) dx   and   EX 2 =  x2f (x) dx. (We need  E |X |  finite to justify this if  X   is not necessarily nonnegative.)

    We define the mean of a random variable to be its expectation, and the  variance  of a random variable

    is defined by

    Var X  = E (X  −EX )2

    .

    For example, it is routine to see that the mean of a standard normal is zero and its variance is one.

    Note

    Var X  = E (X 2 − 2X EX  + (EX )2) =  E X 2 − (EX )2.

    Another equality that is useful is the following.

    Proposition 1.4.   If  X  ≥ 0  a.s. and  p > 0, then

    EX  p =   ∞

    0

     pλ p−1P(X > λ) dλ.

    The proof will show that this equality is also valid if we replace  P(X > λ) by P(X  ≥ λ).

    Proof.  Use Fubini’s theorem and write  ∞0

     pλ p−1P(X > λ) dλ =  E  ∞0

     pλ p−11(λ,∞)(X ) dλ =  E   X0

     pλ p−1 dλ =  E X  p.

    We need two elementary inequalities.

    Proposition 1.5.   Chebyshev’s inequality If  X  ≥ 0,

    P(X  ≥ a) ≤ EX a

      .

    Proof.   We write

    P(X  ≥ a) =  E

    1[a,∞)(X )

    ≤ EX 

    a 1[a,∞)(X )

    ≤ E X/a,

    since X/a  is bigger than 1 when  X  ∈ [a, ∞).

    If we apply this to  X  = (Y  − EY )2, we obtain

    P(|Y  − EY | ≥ a) = P((Y  − EY )2 ≥ a2) ≤ Var Y /a2.   (1.4)

    This special case of Chebyshev’s inequality is sometimes itself referred to as Chebyshev’s inequality, while

    Proposition 1.5 is sometimes called the Markov inequality.

    The second inequality we need is Jensen’s inequality, not to be confused with the Jensen’s formula of 

    complex analysis.

    3

  • 8/9/2019 probability-theory_Bass.pdf

    4/49

    Proposition 1.6.   Suppose  g  is convex and and  X   and  g (X )  are both integrable. Then

    g(EX ) ≤ E g(X ).

    Proof.  One property of convex functions is that they lie above their tangent lines, and more generally their

    support lines. So if  x0 ∈ R, we haveg(x) ≥ g(x0) + c(x − x0)

    for some constant  c. Take  x  =  X (ω) and take expectations to obtain

    E g(X ) ≥ g(x0) + c(EX  − x0).

    Now set  x0  equal to  EX .

    If  An  is a sequence of sets, define (An  i.o.), read ”An   infinitely often,” by

    (An   i.o.) = ∩∞n=1 ∪∞i=n Ai.

    This set consists of those  ω  that are in infinitely many of the  An.

    A simple but very important proposition is the Borel-Cantelli lemma. It has two parts, and we prove

    the first part here, leaving the second part to the next section.

    Proposition 1.7.   (Borel-Cantelli lemma) If  

    n P(An) 

  • 8/9/2019 probability-theory_Bass.pdf

    5/49

    {(X  ∈ A) : A  a Borel subset of  R}.) We define the independence of  n σ-fields or  n   random variables in theobvious way.

    Proposition 2.1 tells us that  A  and  B  are independent if the random variables 1A  and 1B  are inde-

    pendent, so the definitions above are consistent.

    If  f   and  g  are Borel functions and  X   and  Y   are independent, then  f (X ) and  g (Y ) are independent.

    This follows because the σ -field generated by  f (X ) is a sub-σ-field of the one generated by  X , and similarlyfor g(Y ).

    Let  F X,Y  (x, y) =  P(X  ≤  x, Y  ≤ y) denote the joint distribution function of  X   and  Y . (The commainside the set means ”and.”)

    Proposition 2.2.   F X,Y  (x, y) =  F X(x)F Y  (y)   if and only if  X   and  Y   are independent.

    Proof.   If  X  and  Y  are independent, the 1(−∞,x](X ) and 1(−∞,y](Y ) are independent by the above comments.Using the above comments and the definition of independence, this shows  F X,Y  (x, y) = F X(x)F Y  (y).

    Conversely, if the inequality holds, fix  y  and let My  denote the collection of sets  A   for which  P(X  ∈A, Y  ≤ y) = P(X  ∈ A)P(Y  ≤ y). My  contains all sets of the form (−∞, x]. It follows by linearity that Mycontains all sets of the form (x, z], and then by linearity again, by all sets that are the finite union of suchhalf-open, half-closed intervals. Note that the collection of finite unions of such intervals, A, is an algebragenerating the Borel  σ-field. It is clear that My  is a monotone class, so by the monotone class lemma, Mycontains the Borel  σ -field.

    For a fixed set   A, let MA   denote the collection of sets   B   for which   P(X  ∈   A, Y   ∈   B) =   P(X  ∈A)P(Y  ∈ B). Again, MA  is a monotone class and by the preceding paragraph contains the  σ-field generatedby the collection of finite unions of intervals of the form (x, z], hence contains the Borel sets. Therefore X 

    and  Y   are independent.

    The following is known as the multiplication theorem.

    Proposition 2.3.   If  X ,  Y , and  X Y   are integrable and  X   and  Y   are independent, then  EXY   = E X EY .

    Proof.   Consider the random variables in   σ(X ) (the   σ-field generated by   X ) and   σ(Y ) for which the

    multiplication theorem is true. It holds for indicators by the definition of   X   and   Y   being independent.

    It holds for simple random variables, that is, linear combinations of indicators, by linearity of both sides.

    It holds for nonnegative random variables by monotone convergence. And it holds for integrable random

    variables by linearity again.

    Let us give an example of independent random variables. Let Ω = Ω1×Ω2 and let P  =  P1×P2, where(Ωi, F i,Pi) are probability spaces for  i = 1, 2. We use the product σ-field. Then it is clear that F 1  and F 2are independent by the definition of  P. If  X 1   is a random variable such that X 1(ω1, ω2) depends only on  ω1and  X 2  depends only on  ω2, then  X 1  and  X 2  are independent.

    This example can be extended to  n  independent random variables, and in fact, if one has independent

    random variables, one can always view them as coming from a product space. We will not use this fact.

    Later on, we will talk about countable sequences of independent r.v.s and the reader may wonder whether

    such things can exist. That it can is a consequence of the Kolmogorov extension theorem; see PTA, for

    example.

    If   X 1, . . . , X  n   are independent, then so are   X 1 −  EX 1, . . . , X  n −  EX n. Assuming everything isintegrable,

    E [(X 1 −EX 1) + · · · (X n −EX n)]2 = E (X 1 −EX 1)2 + · · · + E (X n − EX n)2,

    5

  • 8/9/2019 probability-theory_Bass.pdf

    6/49

    using the multiplication theorem to show that the expectations of the cross product terms are zero. We have

    thus shown

    Var (X 1 + · · · + X n) = Var X 1 + · · · + Var X n.   (2.1)We finish up this section by proving the second half of the Borel-Cantelli lemma.

    Proposition 2.4.   Suppose  An   is a sequence of independent events. If n P(An) = ∞, then P(An  i.o.) = 1.Note that here the  An  are independent, while in the first half of the Borel-Cantelli lemma no such

    assumption was necessary.

    Proof.   Note

    P(∪N i=nAi) = 1 − P(∩N i=nAci ) = 1 −N i=n

    P(Aci ) = 1 −N i=n

    (1 − P(Ai)).

    By the mean value theorem, 1 − x ≤  e−x, so we have that the right hand side is greater than or equal to1 − exp(−N i=n P(Ai)).  As  N  → ∞, this tends to 1, so  P(∪∞i=nAi) = 1. This holds for all n, which provesthe result.

    3. Convergence.

    In this section we consider three ways a sequence of r.v.s  X n  can converge.

    We say  X n  converges to  X   almost surely   if (X n →  X ) has probability zero.   X n   converges to  X   in probability  if for each  ε,  P(|X n − X | > ε) → 0 as  n → ∞.   X n  converges to  X   in  L p if  E |X n − X | p → 0 asn → ∞.

    The following proposition shows some relationships among the types of convergence.

    Proposition 3.1.   (a) If  X n → X  a.s., then  X n → X  in probability.(b) If  X n → X   in  L p, then  X n → X  in probability.

    (c) If  X n → X   in probability, there exists a subsequence  nj  such that  X nj  converges to  X  almost surely.Proof.  To prove (a), note  X n − X  tends to 0 almost surely, so 1(−ε,ε)c(X n − X ) also converges to 0 almostsurely. Now apply the dominated convergence theorem.

    (b) comes from Chebyshev’s inequality:

    P(|X n −X | > ε) = P(|X n − X | p > ε p) ≤ E |X n − X | p/ε p → 0

    as  n → ∞.To prove (c), choose   nj   larger than   nj−1   such that   P(|X n − X |   >   2−j)   <   2−j whenever   n ≥   nj .

    So if we let   Ai   = (|X nj −  X |   >   2−i for some j  ≥   i), then   P(Ai) ≤   2−i+1.  By the Borel-Cantelli lemmaP(Ai  i.o.) = 0. This implies  X nj

     → X  on the complement of (Ai   i.o.).

    Let us give some examples to show there need not be any other implications among the three types

    of convergence.

    Let Ω = [0, 1], F   the Borel  σ-field, and  P  Lebesgue measure. Let X n  = en1(0,1/n). Then clearly  X nconverges to 0 almost surely and in probability, but  EX  pn  =  e

    np/n → ∞  for any  p.Let Ω be the unit circle, and let  P  be Lebesgue measure on the circle normalized to have total mass

    1. Let  tn  = n

    i=1 i−1, and let  An  = {θ   : tn−1 ≤  θ < tn}. Let  X n  = 1An . Any point on the unit circle will

    be in infinitely many  An, so  X n  does not converge almost surely to 0. But  P(An) = 1/2πn → 0, so  X n → 0in probability and in  L p.

    6

  • 8/9/2019 probability-theory_Bass.pdf

    7/49

    4. Weak law of large numbers.

    Suppose  X n   is a sequence of independent random variables. Suppose also that they all have the same

    distribution, that is,   F Xn   =   F X1   for all  n. This situation comes up so often it has a name,  independent,

    identically distributed , which is abbreviated  i.i.d.

    Define  S n  = ni=1 X i.   S n   is called a  partial sum process .   S n/n  is the average value of the first  n  of 

    the X i’s.

    Theorem 4.1.   (Weak law of large numbers)  Suppose the  X i  are i.i.d. and  EX 21    0, by Chebyshev’s inequality,

    P(|S n/n − EX 1| > ε) ≤ Var (S n/n)ε2

      =

    ni=1 Var X i

    n2ε2  =

     nVar X 1n2ε2

      .   (4.1)

    Since  EX 2

    1   δ ) + ε/2.

    By (4.1) the first term will be less than

    M Var X 1/nδ 2

    ≤ M x(1 − x)/nδ 2

    ≤ M nδ 2

    ,

    which will be less than  ε/2 if  n  is large enough, uniformly in  x.

    5. Techniques related to almost sure convergence.

    Our aim is the strong law of large numbers (SLLN), which says that  S n/n converges to  EX 1  almost

    surely if  E |X 1| 

  • 8/9/2019 probability-theory_Bass.pdf

    8/49

    Proposition 5.1.   Suppose  X i   is an i.i.d. sequence with EX 4i   ε) ≤ E (S n/n)

    4

    ε4   = ES 4nn4ε4 .

    If we expand   S 4n, we will have terms involving   X 4i , terms involving   X 

    2i X 

    2j , terms involving   X 

    3i X j, terms

    involving  X 2i X jX k, and terms involving  X iX jX kX , with  i ,j, k,  all being different. By the multiplication

    theorem and the fact that the  X i  have mean 0, the expectations of all the terms will be 0 except for those

    of the first two types. So

    ES 4n  =n

    i=1

    EX 4i   +i=j

    EX 2i EX 2j .

    By the finiteness assumption, the first term on the right is bounded by  c1n. By Cauchy-Schwarz,  EX 2i  ≤

    (EX 4i )1/2 ε) ≤ c3/n2ε4.

    Consequently P(|S n/n| > ε  i.o.) = 0 by Borel-Cantelli. Since ε   is arbitrary, this implies  S n/n → 0 a.s.

    Before we can prove the SLLN assuming only the finiteness of first moments, we need some prelimi-

    naries.

    Proposition 5.2.   If  Y  ≥ 0, then  EY n)  x)dx.   P(Y > x) is nonincreasing in   x, so the integral is

    bounded above by ∞n=0 P(Y > n) and bounded below by ∞n=1 P(Y > n).If   X i   is a sequence of r.v.s, the tail   σ-field is defined by ∩∞n=1σ(X n, X n+1, . . .). An example of an

    event in the tail  σ-field is (lim supn→∞ X n  > a). Another example is (lim supn→∞ S n/n > a). The reasonfor this is that if  k < n is fixed,

    S nn

      =  S k

    n  +

    ni=k+1 X i

    n  .

    The first term on the right tends to 0 as   n → ∞. So lim sup S n/n  = limsup(n

    i=k+1 X i)/n, which is in

    σ(X k+1, X k+2, . . .). This holds for each  k. The set (lim sup S n  > a) is easily seen not to be in the tail  σ-field.

    Theorem 5.3.   (Kolmogorov 0-1 law)   If the   X i   are independent, then the events in the tail   σ-field have 

     probability 0 or 1.

    This implies that in the case of i.i.d. random variables, if  S n/n has a limit with positive probability,

    then it has a limit with probability one, and the limit must be a constant.

    Proof.   Let M  be the collection of sets in  σ(X n+1, . . .) that is independent of every set in  σ(X 1, . . . , X  n).M   is easily seen to be a monotone class and it contains  σ(X n+1, . . . , X  N ) for every  N > n. Therefore Mmust be equal to  σ (X n+1, . . .).

    If  A  is in the tail  σ-field, then for each  n,  A  is independent of  σ (X 1, . . . , X  n). The class MA  of setsindependent of  A  is a monotone class, hence is a  σ-field containing σ(X 1, . . . , X  n) for each n. Therefore MAcontains  σ (X 1, . . .).

    8

  • 8/9/2019 probability-theory_Bass.pdf

    9/49

    We thus have that the event  A  is independent of itself, or

    P(A) =  P(A ∩ A) = P(A)P(A) =  P(A)2.

    This implies  P(A) is zero or one.

    The next proposition shows that in considering a law of large numbers we can consider truncatedrandom variables.

    Proposition 5.4.   Suppose  X i  is an i.i.d. sequence of r.v.s with  E |X 1|  n). Then  P(An) =  P(|X n| > n) = P(|X 1| > n). Since  E |X 1| 

  • 8/9/2019 probability-theory_Bass.pdf

    10/49

    6. Strong law of large numbers.

    This section is devoted to a proof of Kolmogorov’s strong law of large numbers. We showed earlier

    that if EX 2i   x) → 0 as x → ∞.Here we show the strong law (SLLN): if one has a finite first moment, then there is almost sure convergence.

    First we need a lemma.

    Lemma 6.1.   Suppose   V i   is a sequence of independent r.v.s, each with mean 0. Let  W n   = n

    i=1 V i. If ∞i=1 Var V i   nj−1 such that∞

    i=njVar V i  <  2

    −3j. If  n > nj, then applying Kolmogorov’s inequalityshows that

    P( maxnj≤i≤n |

    W i −

    W nj |

     >  2−j) ≤

     2−3j/2−2j = 2−j .

    Letting n → ∞, we have  P(Aj) ≤ 2−j, where

    Aj  = (maxnj≤i

    |W i − W nj | >  2−j).

    By the Borel-Cantelli lemma,  P(Aj  i.o.) = 0.

    Suppose  ω /∈ (Aj   i.o.). Let ε > 0. Choose j  large enough so that 2−j+1 < ε and  ω /∈ Aj . If  n,m > nj ,then

    |W n − W m| ≤ |W n − W nj | + |W m −W nj | ≤ 2−j+1 < ε.

    Since ε  is arbitrary,  W n(ω) is a Cauchy sequence, and hence converges.

    Theorem 6.2.   (SLLN) Let  X i  be a sequence of i.i.d. random variables. Then  S n/n converges almost surely 

    if and only if  E |X 1|  n)  n) 

  • 8/9/2019 probability-theory_Bass.pdf

    11/49

    The convergence follows by the dominated convergence theorem, since the integrands are bounded by |X 1|.To estimate the second moment of the  Y i, we write

    EY 2i   =

      ∞0

    2yP(|Y i| ≥ y) dy

    =    i0

    2yP(|Y i| ≥ y) dy

    ≤   i0

    2yP(|X 1| ≥ y) dy,

    and so∞i=1

    E (Y 2i  /i2) ≤

    ∞i=1

    1

    i2

       i0

    2yP(|X 1| ≥ y) dy

    = 2∞i=1

    1

    i2

      ∞0

    1(y≤i)yP(|X 1| ≥ y) dy

    = 2  ∞0

    ∞i=1

    1

    i2 1(y≤i)yP(|X 1| ≥ y) dy

    ≤ 4  ∞0

    1

    yyP(|X 1| ≥ y) dy

    = 4

      ∞0

    P(|X 1| ≥ y) dy  = 4E |X 1| M )

    |X i| =   |X i|

    ϕ(|X i|) ϕ(|X i|)1(|Xi|>M ) ≤ ε 

      ϕ(|X i|) ≤ ε supiEϕ(|X i|).

    11

  • 8/9/2019 probability-theory_Bass.pdf

    12/49

    Proposition 7.2.   If  X n  and  Y n   are two uniformly integrable sequences, then  X n +  Y n   is also a uniformly 

    integrable sequence.

    Proof.   Since there exists  M 0   such that supn E [|X n|; |X n|   > M 0]   <  1 and supn E [|Y n|; |Y n|   > M 0]   <   1,then supn E |X n| ≤  M 0 + 1, and similarly for the  Y n. Let  ε >  0 and choose   M 1   >  4(M 0 + 1)/ε   such that

    supn E [|X n|; |X n| > M 1] < ε/4 and supn E [|Y n|; |Y n| > M 1] < ε/4. Let  M 2 = 4M 2

    1 .Note  P(|X n| + |Y n| > M 2) ≤ (E |X n| + E |Y n|)/M 2 ≤ ε/(4M 1) by Chebyshev’s inequality. Then

    E [|X n + Y n|; |X n + Y n| > M 2] ≤ E [|X n|; |X n| > M 1]+ E [|X n|; |X n| ≤ M 1, |X n + Y n| > M 2]+ E [|Y n|; |Y n| > M 1]+ E [|Y n|; |Y n| ≤ M 1, |X n + Y n| > M 2].

    The first and third terms on the right are each less than  ε/4 by our choice of  M 1. The second and fourth

    terms are each less than  M 1P(|X n + Y n| > M 2) ≤ ε/2.

    The main result we need in this section is Vitali’s convergence theorem.

    Theorem 7.3.   If  X n → X  almost surely and the  X n  are uniformly integrable, then  E |X n −X | → 0.

    Proof.   By the above proposition,   X n − X   is uniformly integrable and tends to 0 a.s., so without loss of generality, we may assume  X  = 0. Let  ε > 0 and choose  M  such that supn E [|X n|; |X n| > M ] < ε. Then

    E |X n| ≤ E [|X n|; |X n| > M ] + E [|X n|; |X n| ≤ M ] ≤ ε + E [|X n|1(|Xn|≤M )].The second term on the right goes to 0 by dominated convergence.

    8. Complements to the SLLN.

    Proposition 8.1.   Suppose  X i  is an i.i.d. sequence and  E|X 1

    | <

     ∞. Then

    E

    S nn

     −EX 1 → 0.

    Proof.  Without loss of generality we may assume  EX 1  = 0. By the SLLN, S n/n → 0 a.s. So we need toshow that the sequence  S n/n  is uniformly integrable.

    Pick  M 1   such that  E [|X 1|; |X 1| > M 1] < ε/2. Pick  M 2 =  M 1E |X 1|/ε. SoP(|S n/n| > M 2) ≤ E |S n|/nM 2 ≤ E |X 1|/M 2 =  ε/M 1.

    We used here  E |S n| ≤n

    i=1 E |X i| =  nE |X 1|.We then have

    E [|X i|; |S n/n| > M 2] ≤ E [|X i| : |X i| > M 1] + E [|X i|; |X i| ≤ M 1, |S n/n| > M 2]≤ ε + M 1P(|S n/n| > M 2) ≤ 2ε.

    Finally,

    E [|S n/n|; |S n/n| > M 2] ≤  1n

    ni=1

    E [|X i|; |S n/n| > M 2] ≤ 2ε.

    We now consider the “three series criterion.” We prove the “if” portion here and defer the “only if”

    to Section 20.

    12

  • 8/9/2019 probability-theory_Bass.pdf

    13/49

    Theorem 8.2.   Let X i  be a sequence of independent random variables.,  A > 0, and  Y i  =  X i1(|Xi|≤A). ThenX i  converges if and only if all of the following three series converge: (a)

     P(|X n| > A); (b)

     EY i; (c)

    Var Y i.

    Proof of “if” part.  Since (c) holds, then (Y i−EY i) converges by Lemma 6.1. Since (b) holds, taking thedifference shows Y i  converges. Since (a) holds, P(X i = Y i) = P(|X i| > A)   0 for all  i, and F   is the  σ -field generated by the  Ais, then

    P(A | F ) =i

    P(A ∩ Ai)P(Ai)

      1Ai .

    This follows since the right-hand side is F   measurable and its expectation over any set Ai   is  P(A ∩ Ai).As an example, suppose we toss a fair coin independently 5 times and let  X i   be 1 or 0 depending

    whether the   ith toss was a heads or tails. Let   A   be the event that there were 5 heads and let F i   =σ(X 1, . . . , X  i). Then P(A) = 1/32 while  P(A | F 1) is equal to 1/16 on the event (X 1 = 1) and 0 on the event(X 1 = 0).   P(A | F 2) is equal to 1/8 on the event (X 1 = 1, X 2 = 1) and 0 otherwise.

    We have

    E [E [X  | F ]] = E X    (9.1)

    because  E [E [X  | F ]] = E [E [X  | F ];Ω] = E [X ; Ω] = E X .The following is easy to establish.

    Proposition 9.1.   (a) If  X  ≥ Y  are both integrable, then  E [X  | F ] ≥ E [Y  | F ]  a.s.(b) If  X   and  Y   are integrable and  a

     ∈ R, then  E [aX  + Y 

     | F ] = aE [X 

     | F ] + E [Y 

     | F ].

    It is easy to check that limit theorems such as monotone convergence and dominated convergence

    have conditional expectation versions, as do inequalities like Jensen’s and Chebyshev’s inequalities. Thus,

    for example, we have the following.

    Proposition 9.2.  (Jensen’s inequality for conditional expectations)   If   g   is convex and   X   and   g(X )   are 

    integrable,

    E [g(X ) | F ] ≥ g(E [X  | F ]),   a.s.

    A key fact is the following.

    13

  • 8/9/2019 probability-theory_Bass.pdf

    14/49

    Proposition 9.3.   If  X   and  XY   are integrable and  Y  is measurable with respect to  F , thenE [XY  | F ] = Y E [X  | F ].   (9.2)

    Proof.   If  A ∈ F , then for any  B ∈ F ,E

    1AE [X  | F ]; B =

     E

    E [X  | F ]; A ∩ B =

     E [X ; A ∩ B] = E [1AX ; B].Since 1AE [X  | F ] is F  measurable, this shows that (9.1) holds when  Y   = 1A   and  A ∈ F . Using linearityand taking limits shows that (9.1) holds whenever  Y   is F   measurable and  X and XY   are integrable.

    Two other equalities follow.

    Proposition 9.4.   If  E ⊆ F ⊆ G, thenEE [X  | F ] | E  = E [X  | E ] = E E [X  | E ] | F .

    Proof.  The right equality holds because  E [X  | E ] is E   measurable, hence F  measurable. To show the leftequality, let  A ∈ E . Then since  A  is also in F ,

    E

    E

    E [X  | F ] | E 

    ; A

    = E

    E [X  | F ]; A

    = E [X ; A] = E [E

    X  | E ]; A

    .

    Since both sides are E  measurable, the equality follows.To show the existence of  E [X  | F ], we proceed as follows.

    Proposition 9.5.   If  X   is integrable, then  E [X  | F ]  exists.

    Proof.  Using linearity, we need only consider  X  ≥   0. Define a measure  Q  on F   by  Q(A) =  E [X ; A] forA ∈ F . This is trivially absolutely continuous with respect to  P|F , the restriction of  P  to F . Let  E [X  | F ]be the Radon-Nikodym derivative of  Q  with respect to  P|F . The Radon-Nikodym derivative is F  measurableby construction and so provides the desired random variable.

    When F  = σ(Y ), one usually writes E [X  | Y ] for E [X  | F ]. Notation that is commonly used (however,we will use it only very occasionally and only for heuristic purposes) is  E [X  |  Y   =  y ]. The definition is asfollows. If  A ∈ σ(Y ), then A  = (Y  ∈ B) for some Borel set  B  by the definition of  σ (Y ), or 1A = 1B(Y ). Bylinearity and taking limits, if  Z   is  σ(Y ) measurable,  Z   =  f (Y ) for some Borel measurable function  f . Set

    Z  = E [X  | Y ] and choose  f  Borel measurable so that  Z  =  f (Y ). Then  E [X  | Y   = y] is defined to be  f (y).If   X  ∈   L2 and M   = {Y  ∈   L2 :   Y   is F -measurable}, one can show that  E [X  | F ] is equal to the

    projection of  X  onto the subspace M. We will not use this in these notes.

    10. Stopping times.

    We next want to talk about stopping times. Suppose we have a sequence of   σ-fields F i   such thatF i ⊂ F i+1   for each   i. An example would be if  F i   =   σ(X 1, . . . , X  i). A random mapping   N   from Ω to{0, 1, 2, . . .} is called a stopping time  if for each  n, (N  ≤ n) ∈ F n. A stopping time is also called an optionaltime in the Markov theory literature.

    The intuition is that the sequence knows whether   N   has happened by time   n   by looking at F n.Suppose some motorists are told to drive north on Highway 99 in Seattle and stop at the first motorcycle

    shop past the second realtor after the city limits. So they drive north, pass the city limits, pass two realtors,

    and come to the next motorcycle shop, and stop. That is a stopping time. If they are instead told to stop

    at the third stop light before the city limits (and they had not been there before), they would need to drive

    to the city limits, then turn around and return past three stop lights. That is not a stopping time, because

    they have to go ahead of where they wanted to stop to know to stop there.

    We use the notation  a ∧ b = min(a, b) and  a ∨ b = max(a, b). The proof of the following is immediatefrom the definitions.

    14

  • 8/9/2019 probability-theory_Bass.pdf

    15/49

    Proposition 10.1.

    (a) Fixed times  n  are stopping times.

    (b) If  N 1  and  N 2  are stopping times, then so are  N 1 ∧ N 2  and  N 1 ∨N 2.(c) If  N n  is a nondecreasing sequence of stopping times, then so is  N  = supn N n.

    (d) If  N n  is a nonincreasing sequence of stopping times, then so is  N  = inf n N n.

    (e) If  N   is a stopping time, then so is  N  + n.

    We define F N   = {A :  A ∩ (N  ≤ n) ∈ F n  for all  n}.

    11. Martingales.

    In this section we consider martingales. Let F n  be an increasing sequence of  σ-fields. A sequence of random variables  M n   is  adapted  to F n  if for each  n,  M n   is F n  measurable.

    M n   is a  martingale  if  M n  is adapted to F n,  M n  is integrable for all  n, andE [M n | F n−1] = M n−1,   a.s., n = 2, 3, . . . .   (11.1)

    If we have  E [M n | F n−1] ≥ M n−1   a.s. for every n, then  M n   is a submartingale. If we have E [M n |

    F n−1] ≤

     M n−1

    , we have a supermartingale. Submartingales have a tendency to increase.

    Let us take a moment to look at some examples. If   X i   is a sequence of mean zero i.i.d. random

    variables and  S n  is the partial sum process, then  M n  = S n  is a martingale, since  E [M n | F n−1] = M n−1 +E [M n − M n−1 | F n−1] =  M n−1 + E [M n − M n−1] =  M n−1, using independence. If the X i’s have varianceone and  M n  =  S 2n − n, then

    E [S 2n | F n−1] = E [(S n − S n−1)2 | F n−1] + 2S n−1E [S n | F n−1] − S 2n−1 = 1 + S 2n−1,using independence. It follows that  M n  is a martingale.

    Another example is the following: if  X  ∈ L1 and  M n  =  E [X  | F n], then  M n  is a martingale.If  M n  is a martingale and  H n ∈ F n−1   for each  n, it is easy to check that  N n  =

    ni=1 H i(M i − M i−1)

    is also a martingale.

    If  M n  is a martingale and  g(M n) is integrable for each  n, then by Jensen’s inequality

    E [g(M n+1) | F n] ≥ g(E [M n+1 | F n]) = g(M n),or   g(M n) is a submartingale. Similarly if   g   is convex and nondecreasing on [0, ∞) and   M n   is a positivesubmartingale, then  g (M n) is a submartingale because

    E [g(M n+1) | F n] ≥ g(E [M n+1 | F n]) ≥ g(M n).

    12. Optional stopping.

    Note that if one takes expectations in (11.1), one has EM n  =  E M n−1, and by induction EM n  =  E M 0.The theorem about martingales that lies at the basis of all other results is Doob’s optional stopping theorem,

    which says that the same is true if we replace  n  by a stopping time N . There are various versions, depending

    on what conditions one puts on the stopping times.

    Theorem 12.1.   If  N   is a bounded stopping time with respect to  F n   and  M n  a martingale, then  EM N   =EM 0.

    Proof.   Since N   is bounded, let  K  be the largest value  N   takes. We write

    EM N   =K k=0

    E [M N ; N  = k] =K k=0

    E [M k; N  = k].

    15

  • 8/9/2019 probability-theory_Bass.pdf

    16/49

    Note (N   = k) is F j   measurable if  j ≥ k, so

    E [M k; N  = k] = E [M k+1; N  = k]

    = E [M k+2; N  = k] = . . . =  E [M K ; N  = k].

    Hence

    EM N   =

    K k=0

    E [M K ; N  = k] = E M K  = E M 0.

    This completes the proof.

    The assumption that  N  be bounded cannot be entirely dispensed with. For example, let M n  be the

    partial sums of a sequence of i.i.d. random variable that take the values  ±1, each with probability   12

    . If 

    N  = min{i :  M i  = 1}, we will see later on that  N

  • 8/9/2019 probability-theory_Bass.pdf

    17/49

    Corollary 12.6.   Suppose  X k  is a submartingale, and  N 1 ≤ N 2  are bounded stopping times. Then

    E [X N 2 | F N 1 ] ≥ X N 1 .

    13. Doob’s inequalities.

    The first interesting consequences of the optional stopping theorems are Doob’s inequalities. If  M nis a martingale, denote  M ∗n  = maxi≤n |M i|.Theorem 13.1.   If  M n  is a martingale or a positive submartingale,

    P(M ∗n ≥ a) ≤ E [|M n|; M ∗n ≥ a]/a ≤ E |M n|/a.

    Proof.   Set M n+1 =  M n. Let N  = min{ j  : |M j | ≥ a}∧ (n +1). Since |· | is convex, |M n| is a submartingale.If  A  = (M ∗n ≥ a), then  A ∈ F N  and by Corollary 12.3

    aP(M ∗n ≥

     a)

     ≤ E [

    |M N 

    |; A]

     ≤ E [

    |M n

    |; A]

     ≤ E

    |M n

    |.

    For  p > 1, we have the following inequality.

    Theorem 13.2.   If  p > 1  and  E |M i| p a) da ≤   ∞0

     pa p−1E [|M n|1(M ∗n≥a)/a] da

    = E

       M ∗n0

     pa p−2|M n| da =   p p − 1E [(M 

    ∗n)

     p−1|M n|]

    ≤   p p − 1 (E (M 

    ∗n)

     p)( p−1)/p(E |M n| p)1/p.

    The last inequality follows by Hölder’s inequality. Now divide both sides by the quantity (E (M ∗n) p)( p−1)/p.

    14. Martingale convergence theorems.

    The martingale convergence theorems are another set of important consequences of optional stopping.The main step is the upcrossing lemma. The number of upcrossings of an interval [a, b] is the number of 

    times a process crosses from below  a  to above  b.

    To be more exact, let

    S 1 = min{k :  X k ≤ a}, T 1 = min{k > S 1 :  X k ≥ b},

    and

    S i+1 = min{k > T i  :  X k ≤ a}, T i+1 = min{k > S i+1 :  X k ≥ b}.The number of upcrossings U n  before time n  is  U n  = max{ j  :  T j ≤ n}.

    17

  • 8/9/2019 probability-theory_Bass.pdf

    18/49

    Theorem 14.1.   (Upcrossing lemma)  If  X k  is a submartingale,

    EU n ≤ (b − a)−1E [(X n − a)+].

    Proof.   The number of upcrossings of [a, b] by  X k   is the same as the number of upcrossings of [0 , b − a]by Y k  = (X k − a)+. Moreover  Y k  is still a submartingale. If we obtain the inequality for the the number of upcrossings of the interval [0, b − a] by the process  Y k, we will have the desired inequality for upcrossings of X .

    So we may assume  a  = 0. Fix  n  and define  Y n+1 =  Y n. This will still be a submartingale. Define the

    S i,  T i  as above, and let S i  =  S i ∧ (n + 1),  T i   = T i ∧ (n + 1). Since  T i+1 > S i+1 > T i, then  T n+1 =  n + 1.

    We write

    EY n+1 =  E Y S 1

    +n+1i=0

    E [Y T i− Y S 

    i] +

    n+1i=0

    E [Y S i+1

    − Y T i

    ].

    All the summands in the third term on the right are nonnegative since  Y k   is a submartingale. For the  jth

    upcrossing,  Y T j− Y S 

    j≥ b − a, while Y T 

    j− Y S 

    jis always greater than or equal to 0. So

    ∞i=0

    (Y T i− Y S 

    i) ≥ (b − a)U n.

    So

    (4.8)   EU n ≤ E Y n+1/(b − a).

    This leads to the martingale convergence theorem.

    Theorem 14.2.   If  X n  is a submartingale such that  supn EX +n  

  • 8/9/2019 probability-theory_Bass.pdf

    19/49

    If  X n   is a martingale bounded above, by considering −X n, we may assume   X n   is bounded below.Looking at X n + M  for fixed  M  will not affect the convergence, so we may assume  X n  is bounded below by

    0. Now apply the first assertion of the corollary.

    Proposition 14.4.   If  X n  is a martingale with  supn E |X n| p  1, then the convergence is inL p

    as well as a.s. This is also true when  X n   is a submartingale. If  X n   is a uniformly integrable martingale,then the convergence is in  L1. If  X n → X ∞   in  L1, then  X n  =  E [X ∞ | F n].

    X n   is a uniformly integrable martingale if the collection of random variables   X n   is uniformly inte-

    grable.

    Proof.   The  L p convergence assertion follows by using Doob’s inequality (Theorem 13.2) and dominated

    convergence. The L1 convergence assertion follows since a.s. convergence together with uniform integrability

    implies L1 convergence. Finally, if  j < n, we have  X j  = E [X n | F j ]. If  A ∈ F j ,

    E [X j; A] = E [X n; A] → E [X ∞; A]

    by the  L1 convergence of  X n   to  X ∞

    . Since this is true for all  A ∈ F 

    j ,  X j  = E [X ∞ | F 

    j ].

    15. Applications of martingales.

    Let  S n  be your fortune at time  n. In a fair casino,  E [S n+1 | F n] = S n. If  N   is a stopping time, theoptional stopping theorem says that  ES N   =  ES 0; in other words, no matter what stopping time you use

    and what method of betting, you will do not better on average than ending up with what you started with.

    An elegant application of martingales is a proof of the SLLN. Fix   N   large. Let   Y i   be i.i.d. Let

    Z n   =   E [Y 1 |   S n, S n+1, . . . , S  N ]. We claim   Z n   =   S n/n. Certainly   S n/n   is   σ(S n, · · · , S N ) measurable. If A ∈ σ(S n, . . . , S  N ) for some n, then A  = ((S n, . . . , S  N ) ∈ B) for some Borel subset  B  of  RN −n+1. Since theY i  are i.i.d., for each  k ≤ n,

    E [Y 1; (S n, . . . , S  N ) ∈ B ] = E [Y k; (S n, . . . , S  N ) ∈ B ].

    Summing over  k  and dividing by  n,

    E [Y 1; (S n, . . . , S  N ) ∈ B ] = E [S n/n; (S n, . . . , S  N ) ∈ B ].

    Therefore E [Y 1; A] = E [S n/n; A] for every  A ∈ σ(S n, . . . , S  N ). Thus  Z n  =  S n/n.Let   X k   =   Z N −k, and let F k   =   σ(S N −k, S N −k+1, . . . , S  N ). Note F k   gets larger as   k   gets larger,

    and by the above   X k   =   E [Y 1  | F k]. This shows that   X k   is a martingale (cf. the next to last examplein Section 11). By Doob’s upcrossing inequality, if   U Xn   is the number of upcrossings of [a, b] by   X , then

    EU XN  ≤

     EX +N /(b

    −a)

     ≤ E

    |Z 0

    |/(b

    −a) =  E

    |Y 1

    |/(b

    −a). This differs by at most one from the number of 

    upcrossings of [a, b] by   Z 1, . . . , Z  N . So the expected number of upcrossings of [a, b] by   Z k   for   k ≤   N   isbounded by 1 + E |Y 1|/(b − a). Now let N  → ∞. By Fatou’s lemma, the expected number of upcrossingsof [a, b] by  Z 1, . . .   is finite. Arguing as in the proof of the martingale convergence theorem, this says that

    Z n  =  S n/n does not oscillate.

    It is conceivable that |S n/n| → ∞. But by Fatou’s lemma,

    E [lim |S n/n|] ≤ liminf E |S n/n| ≤ lim inf  nE |Y 1|/n =  E |Y 1| 

  • 8/9/2019 probability-theory_Bass.pdf

    20/49

    Proposition 15.1.   Suppose the  Y i  are i.i.d. with  E |Y 1| 

  • 8/9/2019 probability-theory_Bass.pdf

    21/49

    Proposition 15.3.   Suppose  An ∈ F n. Then  (An   i.o.) and  (∞

    n=1 P(An | F n−1) = ∞)  differ by a null set.

    Proof.   Let   X n   = n

    m=1[1Am −  P(Am  | F m−1)]. Note |X n − X n−1| ≤   1. Also, it is easy to see thatE [X n − X n−1 | F n−1] = 0, so  X n  is a martingale.

    We claim that for almost every   ω   either lim X n   exists and is finite, or else limsup X n   = ∞   andliminf  X n   = −∞. In fact, if   N   = min{n   :  X n ≤ −k}, then  X n∧N  ≥ −k − 1, so   X n∧N   converges by themartingale convergence theorem. Therefore lim X n  exists and is finite on (N   = ∞). So if lim X n  does notexist or is not finite, then  N  1 − ε  and  F X(−M )  < ε   and so that  M   and −M   arecontinuity points of   F X   and of the   F Xn. By the above argument,   E (1[−M,M ]g)(X n) →  E (1[−M,M ]g)(X ),

    21

  • 8/9/2019 probability-theory_Bass.pdf

    22/49

    where g   is a bounded continuous function. The difference between  E (1[−M,M ]g)(X ) and  E g(X ) is boundedby g∞P(X /∈   [−M, M ]) ≤   2εg∞. Similarly, when   X   is replaced by   X n, the difference is bounded byg∞P(X n   /∈ [−M, M ]) → g∞P(X /∈ [−M, M ]). So for n  large, it is less than 3εg∞. Since ε  is arbitrary,E g(X n) → E g(X ) whenever  g   is bounded and continuous.

    Let us examine the relationship between weak convergence and convergence in probability. Theexample of  S n/√ n shows that one can have weak convergence without convergence in probability.

    Proposition 16.2.   (a) If  X n  converges to  X  in probability, then it converges weakly.

    (b) If  X n  converges weakly to a constant, it converges in probability.

    (c)   (Slutsky’s theorem)   If   X n  converges weakly to   X   and   Y n   converges weakly to a constant   c, then

    X n + Y n  converges weakly to  X  + c  and  X nY n  converges weakly to  cX .

    Proof.   To prove (a), let   g  be a bounded and continuous function. If   nj   is any subsequence, then there

    exists a further subsequence such that X (njk) converges almost surely to X . Then by dominated convergence,

    E g(X (njk)) → E g(X ). That suffices to show  E g(X n) converges to  E g(X ).For (b), if  X n  converges weakly to  c,

    P(X n − c > ε) =  P(X n  > c + ε) = 1 − P(X n ≤ c + ε) → 1 − P(c ≤ c + ε) = 0.We use the fact that if   Y   ≡   c, then   c +  ε   is a point of continuity for   F Y  . A similar equation showsP(X n − c ≤ −ε) → 0, so  P(|X n − c| > ε) → 0.

    We now prove the first part of (c), leaving the second part for the reader. Let  x  be a point such that

    x − c   is a continuity point of  F X . Choose ε  so that  x − c + ε  is again a continuity point. ThenP(X n + Y n ≤ x) ≤ P(X n + c ≤ x + ε) + P(|Y n − c| > ε) → P(X  ≤ x − c + ε).

    So limsupP(X n + Y n ≤ x) ≤ P(X  + c ≤ x + ε). Since  ε  can be as small as we like and  x − c is a continuitypoint of  F X , then limsupP(X n + Y n ≤ x) ≤ P(X  + c ≤ x). The lim inf is done similarly.

    We say a sequence of distribution functions

     {F n

    }  is  tight   if for each  ε >  0 there exists  M   such that

    F n(M ) ≥  1 − ε   and   F n(−M ) ≤   ε   for all  n. A sequence of r.v.s is tight if the corresponding distributionfunctions are tight; this is equivalent to  P(|X n| ≥ M ) ≤ ε.Theorem 16.3.   (Helly’s theorem) Let  F n  be a sequence of distribution functions that is tight. There exists 

    a subsequence  nj  and a distribution function  F   such that F nj  converges weakly to  F .

    What could happen is that  X n  =  n, so that  F Xn → 0; the tightness precludes this.Proof.   Let   q k  be an enumeration of the rationals. Since   F n(q k) ∈   [0, 1], any subsequence has a furthersubsequence that converges. Use diagonalization so that  F nj (q k) converges for each   q k   and call the limit

    F (q k).   F   is nondecreasing, and define  F (x) = inf qk≥x F (q k). So  F   is right continuous and nondecreasing.If  x   is a point of continuity of  F   and ε > 0, then there exist  r  and  s  rational such that  r < x < s  and

    F (s) − ε < F (x) < F (r) + ε. ThenF nj (x) ≥ F nj (r) → F (r) > F (x) − ε

    and

    F nj (x) ≤ F nj (s) → F (s) < F (x) + ε.Since ε  is arbitrary,  F nj (x) → F (x).

    Since the   F n   are tight, there exists   M   such that   F n(−M )   < ε. Then  F (−M ) ≤   ε, which implieslimx→−∞ F (x) = 0. Showing limx→∞ F (x) = 1 is similar. Therefore  F  is in fact a distribution function.

    We conclude by giving an easily checked criterion for tightness.

    22

  • 8/9/2019 probability-theory_Bass.pdf

    23/49

    Proposition 16.4.  Suppose there exists  ϕ : [0, ∞) → [0, ∞)  that is increasing and  ϕ(x) → ∞ as  x → ∞.If  c  = supn Eϕ(|X n|)  0. Choose  M   such that  ϕ(x) ≥ c/ε  if  x > M . Then

    P(|X n| > M ) ≤    ϕ(|X n|)c/ε   1(|Xn|>M )dP ≤  εcEϕ(|X n|) ≤ ε.

    17. Characteristic functions.

    We define the characteristic function  of a random variable  X   by ϕX(t) = E eitx for t ∈ R.

    Note that ϕX(t) = 

     eitxPX(dx). So if  X  and Y  have the same law, they have the same characteristic

    function. Also, if the law of  X  has a density, that is,  PX(dx) = f X(x) dx, then  ϕX(t) = 

     eitxf X(x) dx, so

    in this case the characteristic function is the same as (one definition of) the Fourier transform of  f X .

    Proposition 17.1.   ϕ(0) = 1,

     |ϕ(t)

    | ≤ 1,  ϕ(

    −t) =  ϕ(t), and  ϕ   is uniformly continuous.

    Proof.   Since |eitx| ≤ 1, everything follows immediately from the definitions except the uniform continuity.For that we write

    |ϕ(t + h) − ϕ(t)| = |E ei(t+h)X − E eitX | ≤ E |eitX(eihX − 1)| =  E |eihX − 1|.

    |eihX − 1|  tends to 0 almost surely as  h → 0, so the right hand side tends to 0 by dominated convergence.Note that the right hand side is independent of  t.

    Proposition 17.2.   ϕaX(t) = ϕX(at)  and  ϕX+b(t) =  eitbϕX(t),

    Proof.  The first follows from  E eit(aX)

    = E ei(at)X

    , and the second is similar.

    Proposition 17.3.   If  X   and  Y   are independent, then  ϕX+Y  (t) =  ϕX(t)ϕY  (t).

    Proof.  From the multiplication theorem,

    E eit(X+Y ) = E eitXeitY  = E eitXE eitY .

    Note that if  X 1  and  X 2  are independent and identically distributed, then

    ϕX1−X2(t) =  ϕX1(t)ϕ−X2(t) =  ϕX1(t)ϕX2(−t) = ϕX1(t)ϕX2(t) = |ϕX1(t)|2.Let us look at some examples of characteristic functions.

    (a)   Bernoulli : By direct computation, this is  peit + (1 − p) = 1 − p(1 − eit).(b)  Coin flip: (i.e.,  P(X  = +1) = P(X  = −1) = 1/2) We have   1

    2eit +   1

    2e−it = cos t.

    (c)   Poisson:

    E eitX =

    ∞k=0

    eitke−λλk

    k!  = e−λ

    (λeit)kk!

      = e−λeλeit

    = eλ(eit−1).

    (d)  Point mass at  a:   E eitX = eita. Note that when  a  = 0, then  ϕ ≡ 1.

    23

  • 8/9/2019 probability-theory_Bass.pdf

    24/49

  • 8/9/2019 probability-theory_Bass.pdf

    25/49

    An alternate proof of (a) is the following.   e−xy sin x   is integrable on {(x, y); 0 < x < a, 0 < y

  • 8/9/2019 probability-theory_Bass.pdf

    26/49

    Letting b → a  shows that  µ  has no point masses.We now write

    µ(x, x + h) =  1

       e−itx − e−it(x+h)

    it  ϕ(t)dt

    =  1

    2π      x+h

    x

    e−itydyϕ(t) dt=

       x+hx

      12π

       e−ityϕ(y) dt

    dy.

    So  µ  has density (1/2π) 

     e−ityϕ(t) dt. As in the proof of Proposition 17.1, we see  f   is continuous.

    A corollary to the inversion formula is the uniqueness theorem.

    Theorem 18.3.   If  ϕX  = ϕY  , then  PX  = PY  .

    The following proposition can be proved directly, but the proof using characteristic functions is much

    easier.

    Proposition 18.4.   (a) If  X   and  Y   are independent,  X  is a normal with mean  a  and variance  b2, and  Y   is 

    a normal with mean  c  and variance  d2, then  X  + Y   is normal with mean  a + c  and variance  b2 + d2.(b) If  X   and  Y   are independent,  X  is Poisson with parameter  λ1, and  Y  is Poisson with parameter  λ2,

    then  X  + Y   is Poisson with parameter  λ1 + λ2.

    (c) If  X i  are i.i.d. Cauchy, then  S n/n is Cauchy.

    Proof.   For (a),

    ϕX+Y  (t) = ϕX(t)ϕY  (t) = eiat−b2t2/2eict−c

    2t2/2 = ei(a+c)t−(b2+d2)t2/2.

    Now use the uniqueness theorem.

    Parts (b) and (c) are proved similarly.

    19. Continuity theorem.

    Lemma 19.1.   Suppose  ϕ is the characteristic function of a probability  µ. Then

    µ([−2A, 2A]) ≥ A   1/A−1/A

    ϕ(t) dt− 1.

    Proof.   Note1

    2T 

       T −T 

    ϕ(t) dt =  1

    2T 

       T −T 

       eitxµ(dx) dt

    =    1

    2T  1[−T,T ](t)e

    itx dtµ(dx)

    =   sin T x

    T x  µ(dx).

    Since |(sin(T x))/T x| ≤ 1 and | sin(T x)| ≤ 1, then |(sin(T x))/T x| ≤ 1/2T A  if  |x| ≥ 2A. So   sin T xT x

      µ(dx) ≤ µ([−2A, 2A]) +  

    [−2A,2A]c1

    2T Aµ(dx)

    = µ([−2A, 2A]) +   12T A

    (1 − µ([−2A, 2A])

    =  1

    2T A +

    1 −   12T A

    µ([−2A, 2A]).

    26

  • 8/9/2019 probability-theory_Bass.pdf

    27/49

    Setting T   = 1/A, A2

       1/A−1/A

    ϕ(t) dt ≤   1

    2 +

     1

    2µ([−2A, 2A]).

    Now multiply both sides by 2.

    Proposition 19.2.   If  µn  converges weakly to  µ, then  ϕn  converges to  ϕ  uniformly on every finite interval.

    Proof.   Let   ε >   0 and choose   M   large so that   µ([−M, M ]c)   < ε. Define   f   to be 1 on [−M, M ], 0 on[−M  − 1, M  + 1]c, and linear in between. Since   f dµn →   f dµ, then if  n  is large enough, 

      (1 − f ) dµn ≤ 2ε.

    We have

    |ϕn(t + h) − ϕn(t)| ≤   |eihx − 1|µn(dx)

    ≤ 2 

      (1 − f ) dµn + h   |x|f (x) µn(dx)

    ≤ 2ε + h(M  + 1).So for n   large enough and |h| ≤ ε/(M  + 1), we have

    |ϕn(t + h) − ϕn(t)| ≤ 3ε,which says that the  ϕn  are equicontinuous. Therefore the convergence is uniform on finite intervals.

    The interesting result of this section is the converse, Lévy’s continuity theorem.

    Theorem 19.3.   Suppose   µn  are probabilities,   ϕn(t)   converges to a function   ϕ(t)   for each   t, and   ϕ   is 

    continuous at 0. Then  ϕ  is the characteristic function of a probability  µ and  µn  converges weakly to  µ.

    Proof.   Let ε > 0. Since  ϕ  is continuous at 0, choose  δ  small so that  12δ 

       δ−δ

    ϕ(t) dt − 1 < ε.Using the dominated convergence theorem, choose  N   such that

    1

    2δ 

       δ−δ

    |ϕn(t) − ϕ(t)| dt < ε

    if  n ≥ N . So if  n ≥ N ,  12δ 

       δ−δ

    ϕn(t) dt ≥  1

    2δ 

       δ−δ

    ϕ(t) dt−   1

    2δ 

       δ−δ

    |ϕn(t) − ϕ(t)| dt

    ≥ 1

    −2ε.

    By Lemma 19.1 with  A  = 1/δ , for such  n,

    µn[−2/δ, 2/δ ] ≥ 2(1 − 2ε) − 1 = 1 − 4ε.This shows the  µn  are tight.

    Let   nj   be a subsequence such that   µnj   converges weakly, say to   µ. Then   ϕnj (t) →   ϕµ(t), henceϕ(t) =  ϕµ(t), or   ϕ   is the characteristic function of a probability   µ. If   µ

      is any subsequential weak limitpoint of  µn, then  ϕµ(t) =  ϕ(t) =  ϕµ(t); so  µ

     must equal  µ. Hence µn  converges weakly to  µ.

    We need the following estimate on moments.

    27

  • 8/9/2019 probability-theory_Bass.pdf

    28/49

    Proposition 19.4.   If  E |X |k

  • 8/9/2019 probability-theory_Bass.pdf

    29/49

    If  ω /∈ N , by dominated convergence,  a0

      eitS n(ω)dt converges, provided a < δ . Call the limit Aa. Also

       a0

    eitS n(ω) dt = eiaS n(ω) − 1

    iS n(ω)

    if  S n(ω) = 0 and equals  a  otherwise.Since   S n   converges weakly, it is not possible for |S n| → ∞   with positive probability. If we let

    N    = {ω   : |S n(ω)| → ∞}   and choose   ω /∈   N  ∪  N , there exists a subsequence   S nj (ω) which convergesto a finite limit, say   R. We can choose   a < δ   such that   eiaS n(ω) converges and   eiaR = 1. ThereforeAa = (e

    iaR − 1)/R, a nonzero quantity. But then

    S n(ω) =  eiaS n(ω) − 1 a0

      eitS n(ω)dt →   limn→∞ e

    iaS n(ω) − 1Aa

    .

    Therefore, except for  ω ∈ N  ∪ N , we have that S n(ω) converges.

    20. Central limit theorem.

    The simplest case of the central limit theorem (CLT) is the case when the  X i  are i.i.d., with mean

    zero and variance one, and then the CLT says that  S n/√ 

    n converges weakly to a standard normal. We first

    prove this case.

    We need the fact that if  cn  are complex numbers converging to  c, then (1 + (cn/n))n → ec. We leave

    the proof of this to the reader, with the warning that any proof using logarithms needs to be done with some

    care, since log z   is a multi-valued function when  z   is complex.

    Theorem 20.1.   Suppose the  X i  are i.i.d., mean zero, and variance one. Then  S n/√ 

    n converges weakly to 

    a standard normal.

    Proof.   Since   X 1   has finite second moment, then   ϕX1   has a continuous second derivative. By Taylor’s

    theorem,

    ϕX1(t) =  ϕX1(0) + ϕX1(0)t + ϕ

    X1(0)t

    2/2 + R(t),

    where |R(t)|/t2 → 0 as |t| → 0. SoϕX1(t) = 1 − t2/2 + R(t).

    Then

    ϕS n/√ n(t) = ϕS n(t/

    √ n) = (ϕX1(t/

    √ n))n =

    1 −   t

    2

    2n + R(t/

    √ n)

    n

    .

    Since t/√ n converges to zero as  n → ∞, we have

    ϕS n/√ n(t) → e−t

    2/2.

    Now apply the continuity theorem.

    Let us give another proof of this simple CLT that does not use characteristic functions. For simplicity

    let X i  be i.i.d. mean zero variance one random variables with  E |X i|3

  • 8/9/2019 probability-theory_Bass.pdf

    30/49

    Proposition 20.2.   With  X i  as above,  S n/√ 

    n  converges weakly to a standard normal.

    Proof.   Let Y 1, . . . , Y  n  be i.i.d. standard normal r.v.s that are independent of the  X i. Let Z 1 =  Y 2 + · · ·+ Y n,Z 2 =  X 1 + Y 3 + · · · + Y n,  Z 3 =  X 1 + X 2 + Y 4 + · · · + Y n, etc.

    Let us suppose  g

     ∈ C 3 with compact support and let  W   be a standard normal. Our first goal is to

    show|E g(S n/

    √ n) − E g(W )| → 0.   (20.1)

    We have

    E g(S n/√ 

    n) − E g(W ) = E g(S n/√ 

    n) − E g(n

    i=1

    Y i/√ 

    n)

    =

    ni=1

    E gX i + Z i√ 

    n

    − E g

    Y i + Z i√ n

    .

    By Taylor’s theorem,

    gX i + Z i√ n

    = g(Z i/√ n) + g(Z i/√ n) X i√ n

     +  12

    g(Z i/√ n)X 2i   + Rn,

    where |Rn| ≤ g∞|X i|3/n3/2. Taking expectations and using the independence,

    E gX i + Z i√ 

    n

    = E g(Z i/

    √ n) + 0 +

     1

    2E g(Z i/

    √ n) + ERn.

    We have a very similar expression for  E g((Y i + Z i)/√ 

    n). Taking the difference,

    E g

    X i + Z i√ 

    n − E g

    Y i + Z i√ 

    n ≤ g∞E |X i|

    3 + E |Y i|3n3/2

      .

    Summing over  i   from 1 to  n, we have (20.1).

    By approximating continuous functions with compact support by  C 3 functions with compact support,

    we have (20.1) for such  g. Since  E (S n/√ 

    n)2 = 1, the sequence  S n/√ 

    n   is tight. So given ε   there exists  M 

    such that  P(|S n/√ n|  > M ) < ε   for all  n. By taking  M   larger if necessary, we also have  P(|W |  > M )  < ε.Suppose g   is bounded and continuous. Let ψ  be a continuous function with compact support that is bounded

    by one, is nonnegative, and that equals 1 on [−M, M ]. By (20.1) applied to  gψ,

    |E (gψ)(S n/√ 

    n) − E (gψ)(W )| → 0.

    However,

    |E g(S n/√ n) − E (gψ)(S n/√ n)| ≤ g∞P(|S n/√ n| > M ) < εg∞,

    and similarly

    |E g(W ) − E (gψ)(W )| < εg∞.

    Since   ε   is arbitrary, this proves (20.1) for bounded continuous   g. By Proposition 16.1, this proves our

    proposition.

    We give another example of the use of characteristic functions.

    30

  • 8/9/2019 probability-theory_Bass.pdf

    31/49

    Proposition 20.3.  Suppose for each  n  the r.v.s  X ni,  i  = 1, . . . , n are i.i.d. Bernoullis with parameter  pn.

    If  npn → λ  and  S n  =n

    i=1 X ni, then  S n   converges weakly to a Poisson r.v. with parameter  λ.

    Proof.   We write

    ϕS n(t) = [ϕXn1(t)]n =

    1 + pn(e

    it − 1)n

    = 1 +  npnn

      (eit − 1)n → eλ(eit−1).Now apply the continuity theorem.

    A much more general theorem than Theorem 20.1 is the Lindeberg-Feller theorem.

    Theorem 20.4.  Suppose for each n, X ni, i  = 1, . . . , n are mean zero independent random variables. Suppose 

    (a) n

    i=1 EX 2ni → σ2 > 0  and 

    (b) for each ε, n

    i=1 E [|X 2ni; |X ni| > ε] → 0.Let S n  =

    ni=1 X ni. Then  S n  converges weakly to a normal r.v. with mean zero and variance  σ

    2.

    Note nothing is said about independence of the  X ni   for different  n.

    Let us look at Theorem 20.1 in light of this theorem. Suppose the  Y i  are i.i.d. and let X ni  =  Y i/√ n.Then

    ni=1

    E (Y i/√ 

    n)2 = E Y 21

    andn

    i=1

    E [|X ni|2; |X ni| > ε] = nE [|Y 1|2/n; |Y 1| >√ 

    nε] = E [|Y 1|2; |Y 1| >√ 

    nε],

    which tends to 0 by the dominated convergence theorem.

    If the  Y i  are independent with mean 0, and

    ni=1 E |Y i|3(Var S n)3/2

     → 0,

    then  S n/(Var S n)1/2 converges weakly to a standard normal. This is known as Lyapounov’s theorem; we

    leave the derivation of this from the Lindeberg-Feller theorem as an exercise for the reader.

    Proof.   Let ϕni  be the characteristic function of  X ni  and let σ2ni  be the variance of  X ni. We need to show

    ni=1

    ϕni(t) → e−t2σ2/2.   (20.2)

    Using Taylor series,

     |eib

    −1

    −ib + b2/2

    | ≤ c

    |b

    |3 for a constant  c. Also,

    |eib − 1 − ib + b2/2| ≤ |eib − 1 − ib| + |b2|/2 ≤ c|b|2.

    If we apply this to a random variable  tY   and take expectations,

    |ϕY  (t) − (1 + itEY  − t2EY 2/2)| ≤ c(t2EY 2 ∧ t3EY 3).

    Applying this to  Y   = X ni,

    |ϕni(t) − (1 − t2σ2ni/2)| ≤ cE [t3|X ni|3 ∧ t2|X ni|2].

    31

  • 8/9/2019 probability-theory_Bass.pdf

    32/49

    The right hand side is less than or equal to

    cE [t3|X ni|3;|X ni| ≤ ε] + cE [t2|X ni|2; |X ni| > ε]≤ cεt3E [|X ni|2] + ct2E [|X ni|2; |X ni| ≥ ε].

    Summing over  i  we obtain

    ni=1

    |ϕni(t) − (1 − t2σ2ni/2)| ≤ cεt3

    E [|X ni|2] + ct2

    E [|X ni|2; |X ni| ≥ ε].

    We need the following inequality: if  |ai|, |bi| ≤ 1, then ni=1

    ai −n

    i=1

    bi

    ≤ ni=1

    |ai − bi|.

    To prove this, note

    ai − bi  = (an − bn)i A)    0, then for   n   large, we have 2A/√ cn   < ε. Since |Y m| ≤   A   and hence|EY m| ≤   A, then |Z nm| ≤   2A/√ cn   < ε. It follows that

     nm=1E (|Z nm|2; |Z nm|   > ε) = 0 for large   n.

    By Theorem 20.4, n

    m=1(Y m − E Y m)/√ 

    cn   converges weakly to a standard normal. However, n

    m=1 Y m

    converges, and  cn → ∞, so 

    Y m/√ 

    cn  must converge to 0. The quantities 

    EY m/√ 

    cn  are nonrandom, so

    there is no way the difference can converge to a standard normal, a contradiction. We conclude  cn  does not

    converge to infinity.

    Let  V i  =  Y i −EY i. Since |V i|  <  2A,  EV i  = 0, and Var V i  = Var Y i, which is summable, by the “if”part of the three series criterion,

     V i   converges. Since

     Y i  converges, taking the difference shows

     EY i

    converges.

    32

  • 8/9/2019 probability-theory_Bass.pdf

    33/49

    21. Framework for Markov chains.

    Suppose S   is a set with some topological structure that we will use as our state space. Think of  S as being  Rd or the positive integers, for example. A sequence of random variables X 0, X 1, . . ., is a Markov

    chain if 

    P(X n+1 ∈ A | X 0, . . . , X  n) = P(X n+1 ∈ A | X n) (21.1)for all   n   and all measurable sets   A. The definition of Markov chain has this intuition: to predict the

    probability that  X n+1   is in any set, we only need to know where we currently are; how we got there gives

    no new additional intuition.

    Let’s make some additional comments. First of all, we previously considered random variables as

    mappings from Ω to  R. Now we want to extend our definition by allowing a random variable be a map  X 

    from Ω to S , where (X  ∈ A) is F  measurable for all open sets  A. This agrees with the definition of r.v. inthe case S  = R.

    Although there is quite a theory developed for Markov chains with arbitrary state spaces, we will con-

    fine our attention to the case where either S  is finite, in which case we will usually suppose S  = {1, 2, . . . , n},or countable and discrete, in which case we will usually suppose S  is the set of positive integers.

    We are going to further restrict our attention to Markov chains where

    P(X n+1 ∈ A | X n  =  x) =  P(X 1 ∈ A | X 0 =  x),

    that is, where the probabilities do not depend on  n. Such Markov chains are said to have stationary transition

    probabilities.

    Define the initial distribution of a Markov chain with stationary transition probabilities by   µ(i) =

    P(X 0   =   i). Define the transition probabilities by   p(i, j) =   P(X n+1   =   j |   X n   =   i). Since the transitionprobabilities are stationary,  p(i, j) does not depend on  n.

    In this case we can use the definition of conditional probability given in undergraduate classes. If 

    P(X n  =  i) = 0 for all  n, that means we never visit  i  and we could drop the point  i  from the state space.

    Proposition 21.1.   Let X  be a Markov chain with initial distribution  µ  and transition probabilities  p(i, j).

    then

    P(X n  =  in, X n−1 =  in−1, . . . , X  1 =  i1, X 0 =  i0) =  µ(i0) p(i0, i1) · · · p(in−1, in).   (21.2)

    Proof.   We use induction on n. It is clearly true for  n  = 0 by the definition of  µ(i). Suppose it holds for  n;

    we need to show it holds for  n + 1. For simplicity, we will do the case  n = 2. Then

    P(X 3 =  i3,X 2 =  i2, X 1 =  i1, X 0 =  i0)

    = E [P(X 3 =  i3 | X 0 =  i0, X 1 =  i1, X 2 =  i2); X 2 =  i2, X 1 =  ii, X 0 =  i0]= E [P(X 3 =  i3 | X 2 =  i2); X 2 =  i2, X 1 =  ii, X 0 =  i0]= p(i2, i3)P(X 2 =  i2, X 1 =  i1, X 0 =  i0).

    Now by the induction hypothesis,

    P(X 2 =  i2, X 1 =  i1, X 0 =  i0) =  p(i1, i2) p(i0, i1)µ(i0).

    Substituting establishes the claim for  n = 3.

    The above proposition says that the law of the Markov chain is determined by the  µ(i) and  p(i, j).

    The formula (21.2) also gives a prescription for constructing a Markov chain given the  µ(i) and p(i, j).

    33

  • 8/9/2019 probability-theory_Bass.pdf

    34/49

    Proposition 21.2.   Suppose   µ(i)   is a sequence of nonnegative numbers with 

    i µ(i) = 1   and for each   i

    the sequence  p(i, j) is nonnegative and sums to 1. Then there exists a Markov chain with  µ(i)  as its initial 

    distribution and  p(i, j)  as the transition probabilities.

    Proof.   Define Ω = S ∞. Let F   be the   σ-fields generated by the collection of sets  {(i0, i1, . . . , in) :   n >0, ij  ∈ S}. An element  ω   of Ω is a sequence (i0, i1, . . .). Define   X j(ω) =   ij   if   ω   = (i0, i1, . . .). DefineP(X 0  = i0, . . . , X  n  = in) by (21.2). Using the Kolmogorov extension theorem, one can show that P  can be

    extended to a probability on Ω.

    The above framework is rather abstract, but it is clear that under   P   the sequence   X n   has initial

    distribution µ(i); what we need to show is that  X n  is a Markov chain and that

    P(X n+1 =  in+1 | X 0 =  i0, . . . X  n  =  in) =  P(X n+1 =  in+1 | X n  =  in) = p(in, in+1).   (21.3)

    By the definition of conditional probability, the left hand side of (21.3) is

    P(X n+1 =  in+1 |

     X 0 =  i0, . . . , X  n  =  in) = P(X n+1 =  in+1, X n  =  in, . . . X  0 =  i0)

    P(X n  =  in, . . . , X  0 =  i0)  (21.4)

    =  µ(i0) · · · p(in−1, in) p(in, in+1)

    µ(i0) · · · p(in−1, in)= p(in, in+1)

    as desired.

    To complete the proof we need to show

    P(X n+1 =  in+1, X n  =  in)

    P(X n  =  in)  = p(in, in+1),

    or

    P(X n+1 =  in+1, X n  =  in) =  p(in, in+1)P(X n  =  in).   (21.5)

    Now

    P(X n  =  in) =

    i0,···,in−1P(X n  =  in, X n−1 =  in−1, . . . , X  0 =  i0)

    =

    i0,···,in−1µ(i0) · · · p(in−1, in)

    and similarly

    P(X n+1 =  in+1, X n  =  in)

    =

    i0,···,in−1P(X n+1 =  in+1, X n  =  in, X n−1 =  in−1, . . . , X  0 =  i0)

    = p(in, in+1)

    i0,···,in−1µ(i0) · · · p(in−1, in).

    Equation (21.5) now follows.

    Note in this construction that the X n  sequence is fixed and does not depend on  µ  or  p. Let p(i, j) be

    fixed. The probability we constructed above is often denoted  Pµ. If  µ  is point mass at a point  i  or  x, it is

    denoted  Pi or  Px. So we have one probability space, one sequence X n, but a whole family of probabilities

    Pµ.

    34

  • 8/9/2019 probability-theory_Bass.pdf

    35/49

    Later on we will see that this framework allows one to express the Markov property and strong

    Markov property in a convenient way. As part of the preparation for doing this, we define the shift operators

    θk   : Ω → Ω byθk(i0, i1, . . .) = (ik, ik+1, . . .).

    Then X j ◦

    θk  =  X j+k. To see this, if  ω  = (i0, i1, . . .), then

    X j ◦ θk(ω) =  X j(ik, ik+1, . . .) = ij+k  = X j+k(ω).

    22. Examples.

    Random walk on the integers 

    We let   Y i   be an i.i.d. sequence of r.v.’s, with   p   =   P(Y i   = 1) and 1 − p   =   P(Y i   = −1). LetX n  =  X 0 +

    ni=1 Y i. Then the  X n  can be viewed as a Markov chain with  p(i, i + 1) = p,  p(i, i − 1) = 1 − p,

    and  p(i, j) = 0 if  | j − i| = 1. More general random walks on the integers also fit into this framework. Tocheck that the random walk is Markov,

    P(X n+1 =  in+1 | X 0 =  i0, . . . , X  n  =  in)= P(X n+1 − X n  =  in+1 − in | X 0 =  i0, . . . , X  n  =  in)= P(X n+1 − X n  =  in+1 − in),

    using the independence, while

    P(X n+1 =  in+1 | X n  =  in) =  P(X n+1 −X n  =  in+1 − in | X n  =  in)= P(X n+1 − X n  =  in+1 − in).

    Random walks on graphs 

    Suppose we have  n  points, and from each point there is some probability of going to another point.

    For example, suppose there are 5 points and we have   p(1, 2) =   12

    ,   p(1, 3) =   12

    ,   p(2, 1) =   14

    ,   p(2, 3) =   12

    ,

     p(2, 5) =

      1

    4 ,  p(3, 1) =

      1

    4 ,  p(3, 2) =

      1

    4 ,  p(3, 3) =

      1

    2 ,  p(4, 1) = 1,  p(5, 1) =

      1

    2 ,  p(5, 5) =

      1

    2 . The  p(i, j) are oftenarranged into a matrix:

    P   =

    0   12

    12

      0 0

    14

      0   12

      0   14

    14

    14

    12   0 0

    1 0 0 0 0

    12

      0 0 0   12

    .

    Note the rows must sum to 1 since

    5j=1

     p(i, j) =5

    j=1

    P(X 1 =  j | X 0 =  i) = P(X 1 ∈ S | X 0 =  i) = 1.

    Renewal processes 

    Let   Y i   be i.i.d. with  P(Y i   =   k) =   ak   and the   ak   are nonnegative and sum to 1. Let  T 0   =   i0   and

    T n  =  T 0 +n

    i=1. We think of the  Y i  as the lifetime of the nth light bulb and T n  the time when the nth light

    bulb burns out. (We replace a light bulb as soon as it burns out.) Let

    X n  = min{m − n :  T i  =  m  for some  i}.

    35

  • 8/9/2019 probability-theory_Bass.pdf

    36/49

    So  X n  is the amount of time after time  n  until the current light bulb burns out.

    If  X n  =  j   and j > 0, then  T i  =  n + j  for some  i  but  T i  does not equal  n, n + 1, . . . , n + j −1 for anyi. So  T i  = (n + 1) + ( j − 1) for some  i  and  T i  does not equal (n + 1), (n + 1) + 1, . . . , (n + 1) + ( j − 2) forany i. Therefore X n+1 =  j − 1. So  p(i, i − 1) = 1 if  i ≥ 1.

    If  X n  = 0, then a light bulb burned out at time  n  and  X n+1  is 0 if the next light bulb burned out

    immediately and  j − 1 if the light bulb has lifetime  j. The probability of this is  aj. So  p(0, j) =  aj+1. Allthe other p(i, j)’s are 0.

    Branching processes 

    Consider k  particles. At the next time interval, some of them die, and some of them split into several

    particles. The probability that a given particle will split into j  particles is given by  aj,  j  = 0, 1, . . ., where

    the   aj   are nonnegative and sum to 1. The behavior of each particle is independent of the behavior of all

    the other particles. If  X n  is the number of particles at time  n, then  X n   is a Markov chain. Let Y i  be i.i.d.

    random variables with  P(Y i  =  j) =  aj . The p(i, j) for  X n  are somewhat complicated, and can be defined by

     p(i, j) = P(i

    m=1 Y m =  j).

    Queues 

    We will discuss briefly the M/G/1 queue. The M  refers to the fact that the customers arrive accordingto a Poisson process. So the probability that the number of customers arriving in a time interval of length t

    is  k  is given by  e−λt(λt)k/k! The  G  refers to the fact that the length of time it takes to serve a customer isgiven by a distribution that is not necessarily exponential. The 1 refers to the fact that there is 1 server.

    Suppose the length of time to serve one customer has distribution function  F   with density  f . The

    probability that k  customers arrive during the time it takes to serve one customer is

    ak  =

      ∞0

    e−λt(λt)k

    k!  f (t) dt.

    Let the  Y i  be i.i.d. with  P(Y i  = k − 1) = ak. So  Y i   is the number of customers arriving during the time ittakes to serve one customer. Let  X n

    +1 = (X n + Y n

    +1)+ be the number of customers waiting. Then  X n   is a

    Markov chain with  p(0, 0) =  a0 + a1  and  p(i, j − 1 + k) = ak   if  j ≥ 1, k > 1.Ehrenfest urns 

    Suppose we have two urns with a total of  r  balls,   k   in one and  r − k   in the other. Pick one of ther  balls at random and move it to the other urn. Let X n  be the number of balls in the first urn.   X n   is a

    Markov chain with  p(k, k + 1) = (r − k)/r,  p(k, k − 1) = k/r , and  p(i, j) = 0 otherwise.One model for this is to consider two containers of air with a thin tube connecting them. Suppose

    a few molecules of a foreign substance are introduced. Then the number of molecules in the first container

    is like an Ehrenfest urn. We shall see that all states in this model are recurrent, so infinitely often all the

    molecules of the foreign substance will be in the first urn. Yet there is a tendency towards equilibrium, so

    on average there will be about the same number of molecules in each container for all large times.

    Birth and death processes 

    Suppose there are   i   particles, and the probability of a birth is   ai, the probability of a death is   bi,

    where ai, bi ≥ 0,  ai + bi ≤ 1. Setting X n  equal to the number of particles, then  X n  is a Markov chain with p(i, i + 1) = ai,  p(i, i − 1) =  bi, and  p(i, i) = 1 − ai − bi.

    23. Markov properties.

    A special case of the Markov property says that

    E x[f (X n+1) | F n] = E Xnf (X 1).   (23.1)

    36

  • 8/9/2019 probability-theory_Bass.pdf

    37/49

    The right hand side is to be interpreted as  ϕ(X n), where  ϕ(y) =  Eyf (X 1). The randomness on the right

    hand side all comes from the  X n. If we write  f (X n+1) =  f (X 1) ◦ θn  and we write  Y   for  f (X 1), then theabove can be rewritten

    E x[Y  ◦ θn | F n] = E XnY.Let

     F ∞ be the  σ -field generated by

     ∪∞n=1F 

    n.

    Theorem 23.1.   (Markov property)  If  Y  is bounded and measurable with respect to  F ∞, then

    E x[Y  ◦ θn | F n] = E Xn [Y ],   P− a.s.

    for each  n  and  x.

    Proof.  If we can prove this for  Y   = f 1(X 1) · · · f m(X m), then taking  f j(x) = 1ij (x), we will have it for  Y  ’sof the form 1(X1=i1,...,Xm=im). By linearity (and the fact that S   is countable), we will then have it for  Y ’sof the form 1((X1,...,Xm)∈B). A monotone class argument shows that such  Y  ’s generate F ∞.

    We use induction on  m, and first we prove it for  m  = 1. We need to show

    E [f 1(X 1) ◦ θn | F n] = E Xnf 1(X 1).

    Using linearity and the fact that S   is countable, it suffices to show this for   f 1(y) = 1{j}(y). Using thedefinition of  θn, we need to show

    P(X n+1 =  j | F n) = PXn(X 1 =  j ),

    or equivalently,

    Px(X n+1 =  j ; A) = Ex[PXn(X 1 =  j ); A] (23.2)

    when A ∈ F n. By linearity it suffices to consider  A  of the form  A  = (X 1 = i1, . . . , X  n  = in). The left handside of (23.2) is then

    Px(X n+1 =  j, X 1 =  i1, . . . , X  n  =  ij),

    and by (21.4) this is equal to

     p(in, j)Px(X 1 =  i1, . . . , X  n − in) = p(in, j)Px(A).

    Let  g(y) = Py(X 1 =  j ). We have

    Px(X 1 =  j, X 0 =  k) =

    Pk(X 1 =  j) if  x  =  k,0 if  x = k,

    while

    E x[g(X 0); X 0 =  k] = Ex[g(k); X 0 =  k] = P

    k(X 1 =  j)Px(X 0 =  k) =

    P(X 1 =  j) if  x  =  k,0 if  x = k.

    It follows that

     p(i, j) = Px(X 1 =  j | X 0 =  i) = Pi(X 1 =  j).So the right hand side of (23.2) is

    E x[ p(X n, j); x1 =  i1, . . . , X  n  =  in] = p(in, j)Px(A)

    37

  • 8/9/2019 probability-theory_Bass.pdf

    38/49

    as required.

    Suppose the result holds for  m  and we want to show it holds for  m + 1. We have

    E x[f 1(X n+1) · · · f m+1(X n+m+1) | F n]= E x[E x[f m+1(X n+m+1) | F n+m]f 1(X n+1) · · · f m(X n+m) | F n]E x[EXn+m [f m+1(X 1)]f 1(X n+1) · · · f m(X n+m) | F n]E x[f 1(X n+1) · · · f m−1(X n+m−1)h(X n+m) | F n].

    Here we used the result for  m  = 1 and we defined  h(y) = f n+m(y)g(y), where  g (y) = Ey[f m+1(X 1)]. Using

    the induction hypothesis, this is equal to

    EXn [f 1(X 1) · · · f m−1(X m−1)g(X m)] =  E Xn [f 1(X 1) · · · f m(X m)EXmf m+1(X 1)]= E Xn [f 1(X 1) · · · f m(X m)E [f m+1(X m+1) | F m]]= E Xn [f 1(X 1) · · · f m+1(X m+1)],

    which is what we needed.

    Define   θN (ω) = (θN (ω))(ω). The strong Markov property is the same as the Markov property, but

    where the fixed time  n is replaced by a stopping time  N .

    Theorem 23.2.   If  Y  is bounded and measurable and  N  is a finite stopping time, then

    E x[Y  ◦ θN  | F N ] = E XN [Y ].

    Proof.  We will show

    Px(X N +1 =  j | F N ) =  PXN (X 1 =  j ).

    Once we have this, we can proceed as in the proof of the Theorem 23.1 to obtain our result. To show theabove equality, we need to show that if  B ∈ F N , then

    Px(X N +1 =  j, B) = Ex[PXN (X 1 =  j ); B].   (23.3)

    Recall that since  B ∈ F N , then  B ∩ (N  = k) ∈ F k. We have

    Px(X N +1 =  j, B,N  = k) =  Px(X k+1 =  j, B,N  = k)

    = E x[Px(X k+1 =  j | F k); B, N  = k]= E x[PXk(X 1 =  j ); B, N  = k]

    = E x[PXN (X 1

     =  j); B, N  = k].

    Now sum over k ; since  N  is finite, we obtain our desired result.

    Another way of expressing the Markov property is through the Chapman-Kolmogorov equations. Let

     pn(i, j) = P(X n  =  j | X 0 =  i).Proposition 23.3.   For all  i, j, m, n  we have 

     pn+m(i, j) =k∈S 

     pn(i, k) pm(k, j).

    38

  • 8/9/2019 probability-theory_Bass.pdf

    39/49

    Proof.   We write

    P(X n+m =  j, X 0 =  i) =k

    P(X n+m =  j, X n  =  k, X 0 =  i)

    =k

    P(X n+m =  j | X n  =  k, X 0 =  i)P(X n  =  k | X 0 =  i)P(X 0 =  i)

    = k

    P(X n+m =  j | X n  =  k) pn(i, k)P(X 0 =  i)

    =k

     pm(k, j) pn(i, k)P(X 0 =  i).

    If we divide both sides by  P(X 0 =  i), we have our result.

    Note the resemblance to matrix multiplication. It is clear if  P  is the matrix made up of the  p(i, j),

    then P n will be the matrix whose (i, j) entry is  pn(i, j).

    24. Recurrence and transience.

    Let

    T y  = min{i > 0 : X i  =  y}.This is the first time that  X i  hits the point  y . Even if  X 0 =  y  we would have T y  > 0. We let  T 

    ky   be the k-th

    time that the Markov chain hits  y  and we set

    r(x, y) =  Px(T y  

  • 8/9/2019 probability-theory_Bass.pdf

    40/49

    We used the fact that N (y) is the number of visits to  y  and the number of visits being larger than  k   is the

    same as the time of the  k-th visit being finite. Since  r(y, y) ≤ 1, the left hand side will be finite if and onlyif  r(y, y) <  1.

    Observe that

    Ey

    N (y) = nPy

    (X n  =  y) = n

     pn

    (y, y).

    If we consider simple symmetric random walk on the integers, then  pn(0, 0) is 0 if  n  is odd and equal

    to

      nn/2

    2−n if  n   is even. This is because in order to be at 0 after n  steps, the walk must have had  n/2

    positive steps and   n/2 negative steps; the probability of this is given by the binomial distribution. Using

    Stirling’s approximation, we see that  pn(0, 0) ∼ c/√ n for  n  even, which diverges, and so simple random walkin one dimension is recurrent.

    Similar arguments show that simple symmetric random walk is also recurrent in 2 dimensions but

    transient in 3 or more dimensions.

    Proposition 24.3.   If  x   is recurrent and  r(x, y) >  0, then  y   is recurrent and  r(y, x) = 1.

    Proof.   First we show  r(y, x) = 1. Suppose not. Since r(x, y)  >   0, there is a smallest  n  and  y1, . . . , yn−1such that  p(x, y1) p(y1, y2) · · · p(yn−1, y) >  0. Since this is the smallest  n, none of the  yi  can equal  x. Then

    Px(T x  = ∞) ≥ p(x, y1) · · · p(yn−1, y)(1 − r(y, x)) >  0,

    a contradiction to x  being recurrent.

    Next we show that  y  is recurrent. Since  r (y, x) >  0, there exists  L  such that pL(y, x) >  0. Then

     pL+n+K (y, y) ≥ pL(y, x) pn(x, x) pK (x, y).

    Summing over  n, n

     pL+n+K (y, y) ≥ pL(y, x) pK (x, y)n

     pn(x, x) = ∞.

    We say a subset C   of  S   is closed if  x ∈ C   and  r(x, y) >  0 implies  y ∈ C . A subset  D   is irreducible if x, y ∈ D   implies r(x, y) >  0.Proposition 24.4.   Let C  be finite and closed. Then  C  contains a recurrent state.

    From the preceding proposition, if  C  is irreducible, then all states will be recurrent.

    Proof.   If not, for all  y  we have  r(y, y) <  1 and

    E xN (y) =∞k=1

    r(x, y)r(y, y)k−1 =  r(x, y)

    1 − r(y, y)  

  • 8/9/2019 probability-theory_Bass.pdf

    41/49

    Theorem 24.5.   Let R  = {x :  r(x, x) = 1}, the set of recurrent states. Then  R  = ∪∞i=1Ri, where each  Ri   is closed and irreducible.

    Proof.   Say x ∼ y  if  r(x, y) >  0. Since every state is recurrent,  x ∼ x  and if  x ∼ y, then y ∼ x. If  x ∼ y  andy ∼ z , then  pn(x, y) >  0 and  pm(y, z) >  0 for some  n  and  m. Then  pn+m(x, z) >  0 or  x ∼ z . Therefore wehave an equivalence relation and we let the  Ri  be the equivalence classes.

    Looking at our examples, it is easy to see that in the Ehrenfest urn model all states are recurrent. For

    the branching process model, suppose  p(x, 0) >  0 for all  x. Then 0 is recurrent and all the other states are

    transient. In the renewal chain, there are two cases. If  {k  :  ak  > 0}  is unbounded, all states are recurrent.If  K  = max{k :  ak  >  0}, then {0, 1, . . . , K   − 1} are recurrent states and the rest are transient.

    For the queueing model, let  µ  =

    kak, the expected number of people arriving during one customer’s

    service time. We may view this as a branching process by letting all the customers arriving during one

    person’s service time be considered the progeny of that customer. It turns out that if  µ ≤ 1, 0 is recurrentand all other states are also. If  µ > 1 all states are transient.

    25. Stationary measures.

    A probability  µ  is a stationary distribution if x

    µ(x) p(x, y) =  µ(y).   (25.1)

    In matrix notation this is µP   = µ, or  µ is the left eigenvector corresponding to the eigenvalue 1. In the case of 

    a stationary distribution,  Pµ(X 1 =  y) =  µ(y), which implies that  X 1, X 2, . . . all have the same distribution.

    We can use (25.1) when   µ   is a measure rather than a probability, in which case it is called a stationary

    measure.

    If we have a random walk on the integers,  µ(x) = 1 for all  x   serves as a stationary measure. In the

    case of an asymmetric random walk:   p(i, i + 1) = p,  p(i, i − 1) =  q  = 1 − p  and  p = q , setting  µ(x) = ( p/q )xalso works.

    In the Ehrenfest urn model, µ(x) = 2−r

    rx

     works. One way to see this is that  µ  is the distribution

    one gets if one flips r  coins and puts a coin in the first urn when th