Download - Math 285 Stochastic Processes Spring 2015bdriver/285_S2015/Lecture Notes/285notes.pdf · Math 285 Stochastic Processes Spring 2015 June 5, 2015 File:285notes.tex. Contents Part Homework

Bruce K. Driver

Math 285 Stochastic Processes Spring 2015

June 5, 2015 File:285notes.tex

Contents

Part Homework Problems

0.1 Homework 1. Due Wednesday, April 8, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.2 Homework 2. Due Wednesday, April 15, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.3 Homework 3. Due Wednesday, April 22, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.4 Homework 4. Due Wednesday, April 29, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.5 Homework 5. Due Wednesday, May 6, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.6 Homework 6. Due Wednesday, May 13, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.7 Homework 7. Due Wednesday, May 20, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.8 Homework 8. Due Friday, May 29, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.9 Homework 9. Due Friday, June 5, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Part I Background Material

1 Basic Probability Facts / Conditional Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.1 Course Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Some Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1 Borel Cantelli Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1 σ – algebras (partial information) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Theory of Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Conditional Expectation for Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Summary on Conditional Expectation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Contents

4 Filtrations and stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1 Filtrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Part II Discrete Time & Space Markov Processes

5 Markov Chain Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.1 Markov Chain Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Joint Distributions of an MC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3 More Markov Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.4 *Hitting Times Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 First Step Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.1 Finite state space examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 More first step analysis examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.3 Random Walk Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.4 Wald’s Equation and Gambler’s Ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.5 Some more worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.5.1 Life Time Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.5.2 Sampling Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.5.3 Extra Homework Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.6 *Computations avoiding the first step analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.6.1 General facts about sub-probability kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7 Long Run Behavior of Discrete Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.1 The Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.2 Aperiodic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2.1 More finite state space examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667.3 Periodic Chain Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.3.1 A number theoretic lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8 *Proofs of Long Run Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718.1 Strong Markov Property Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718.2 Irreducible Recurrent Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

9 More on Transience and Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759.1 *Transience and Recurrence for R.W.s by Fourier Series Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

10 Detail Balance and MMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7910.1 Detail Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7910.2 Some uniform measure MMC examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8010.3 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Page: 4 job: 285notes macro: svmonob.cls date/time: 5-Jun-2015/8:11

Contents 5

10.4 The Linear Algebra associated to detail balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8210.5 Convergence rates of reversible chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8310.6 *Reversible Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

11 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8911.1 The dynamic programing algorithm for most likely trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

11.1.1 Viterbi’s algorithm in the Hidden Markov chain context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9111.2 The computation of the conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Part III Martingales

12 (Sub and Super) Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9512.1 (Sub and Super) Martingale Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9512.2 Jensen’s and Holder’s Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9912.3 Stochastic Integrals and Optional Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10112.4 Uniform Integrability and Optional stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10512.5 Martingale Convergence Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10612.6 Submartingale Maximal Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

12.6.1 *Lp – inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10912.7 Martingale Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

12.7.1 More Random Walk Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11012.7.2 More advanced martingale exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

13 Some Martingale Examples and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11313.1 Aside on Large Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11313.2 A Polya Urn Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11513.3 Galton Watson Branching Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

13.3.1 Appendix: justifying assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Part IV Continuous Time Theory

14 Discrete State Space/Continuous Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12314.1 Warm up exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12314.2 Continuous Time Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12514.3 Jump Hold Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13014.4 Jump Hold Construction of the Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

14.4.1 More related results* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13414.5 Long time behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136


6 Contents

15 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13915.1 Stationary and Independent Increment Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13915.2 Normal/Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14015.3 Brownian Motion Defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14215.4 Some “Brownian” Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14415.5 Optional Sampling Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14615.6 Scaling Properties of B. M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14815.7 Random Walks to Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14815.8 Path Regularity Properties of BM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14915.9 The Strong Markov Property of Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15015.10Dirichelt Problem and Brownian Motion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

16 A short introduction to Ito’s calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15516.1 A short introduction to Ito’s calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15516.2 Ito’s formula, heat equations, and harmonic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15716.3 A Simple Option Pricing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15816.4 The Black-Scholes Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

Part V Appendix

17 Analytic Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16517.1 A Stirling’s Formula Like Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

18 Multivariate Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16718.1 Review of Gaussian Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16718.2 Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175


Part

Homework Problems

0.1 Homework 1. Due Wednesday, April 8, 2015

• Look at from lecture note exercises: 3.1• Hand in lecture note exercises: 3.2, 18.5, 4.1, 4.2, 4.3• Hand in from Lawler §5.1 on page 125.


• Look at from lecture note exercises: 6.3, 6.4, 6.6• Hand in lecture note exercises: 6.1, 6.5, 6.7• Hand in from Lawler §1.1, 1.4, 1.19


• Look at from Lawler §1.10,• Hand in from Lawler §1.5, 1.8, 1.9, 1.14, 1.18, 2.3


• Hand in lecture note exercises: 9.1• Hand in from Lawler any 2 of the following 3 problems: §7.6, 7.7, 7.8.

0.5 Homework 5. Due Wednesday, May 6, 2015

• Look at from lecture note exercises: 12.2• Hand in lecture note exercises: 12.1, 12.3, 12.4• Hand in from Lawler § 5.4, 5.7a, 5.8a, 5.12


• Look at from lecture note exercises: 12.7, 12.8, 12.9, 12.14• Look at from Lawler § 5.13• Hand in lecture note exercises: 12.11, 12.13• Hand in from Lawler § 5.7b, 5.9, 5.14

*Correction to 5.9 The condition, Pf (x) = g (x) for x ∈ S \A, should readPf (x)− f (x) = g (x) for x ∈ S \A.


• Look at from lecture note exercises: 14.4• Hand in lecture note exercises: 14.1, 14.2, 14.3

0.8 Homework 8. Due Friday, May 29, 2015

• Look at from lecture note exercises: 14.5• Hand in lecture note exercises: 14.6, 14.7, 14.8, 14.9

0.9 Homework 9. Due Friday, June 5, 2015

• Look at from lecture note exercises: 15.4, 15.5• Hand in lecture note exercises: 15.1, 15.2, 15.3, 15.6

Part I

Background Material

1

Basic Probability Facts / Conditional Expectations

1.1 Course Notation

Our first goal in this course is to describe modern probability with “sufficient”precision to allow us to do the required computations for this course. We willthus be neglecting some technical details involving measures and σ – algebras.The knowledgeable reader should be able to fill in the missing hypothesis whilethe less knowledgeable readers should not be too harmed by the omissions tofollow.

1. (Ω,P) will denote a probability space and S will denote a set which iscalled state space. Informally put, Ω is a set (often the sample space) andP is a function on all1 subsets of Ω with the following properties;

a) P (A) ∈ [0, 1] for all A ⊂ Ω,b) P (Ω) = 1 and P (∅) = 0.c) P (A ∪B) = P (A) + P (B) is A ∩B = ∅. More generally, if An ⊂ Ω for

all n with An ∩Am = ∅ for m 6= n we have

P (∪∞n=1An) =

∞∑n=1

P (An) .

2. EZ will denote the expectation of a random variable, Z : Ω → R whichis defined as follows. If Z only takes on a finite number of real valuesz1, . . . , zm we define

EZ =

m∑i=1

ziP (Z = zi) .

For general Z ≥ 0 we set EZ = limn→∞ EZn where Zn∞n=1 is any sequenceof discrete random variables such that 0 ≤ Zn ↑ Z as n ↑ ∞. Finally if Zis real valued with E |Z| < ∞ (in which case we say Z is integrable) weset EZ = EZ+ − EZ− where Z± = max (±Z, 0) . With these definition oneeventually shows via the dominated convergence theorem below; if f : R→R is a bounded continuous function, then

1 This is often a lie! Nevertheless, for our purposes it will be reasonably safe to ignorethis lie.

E [f (Z)] = lim∆→0

∑n∈Z

f (n∆)P (n∆ < Z ≤ (n+ 1)∆) .

We summarize this informally by writing;

E [f (Z)] =

∫Rf (z)P (z < Z ≤ Z + dz) .

3. The expectation has the following basic properties;

a it is linear, E [X + cY ] = EX + cEY where X and Y are any integrablerandom variables and c ∈ R.

b MCT: the monotone convergence theorem holds; if 0 ≤ Zn ↑ Zthen

↑ limn→∞

E [Zn] = E [Z] (with ∞ allowed as a possible value).

c DCT: the dominated convergence theorem holds, if

E[supn|Zn|

]<∞ and lim

n→∞Zn = Z, then

E[

limn→∞

Zn

]= EZ = lim

n→∞EZn.

d Fatou’s Lemma: Fatou’s lemma holds; if 0 ≤ Zn ≤ ∞, then

E[lim infn→∞

Zn

]≤ lim inf

n→∞E [Zn] .

4. If S is a discrete set, i.e. finite or countable and X : Ω → S we let

ρX (s) := P (X = s) .

More generally if Xi : Ω → Si for 1 ≤ i ≤ n we let

ρX1,...,Xn (s) := P (X1 = s1, . . . , Xn = sn)

for all s = (s1, . . . , sn) ∈ S1 × · · · × Sn.5. If S is R or Rn and X : Ω → S is a continuous random variable, we letρX (x) be the operability density function of X, namely,

E [f (X)] =

∫S

f (x) ρX (x) dx.

8 1 Basic Probability Facts / Conditional Expectations

6. Given random variables X and Y we let;

a) µX := EX be the mean of X.

b) Var (X) := E[(X − µX)

2]

= EX2 − µ2X be the variance of X.

c) σX = σ (X) :=√

Var (X) be the standard deviation of X.d) Cov (X,Y ) := E [(X − µX) (Y − µY )] = E [XY ]− µXµY be the covari-

ance of X and Y.e) Corr (X,Y ) := Cov (X,Y ) / (σXσY ) be the correlation of X and Y.

7. Tonelli’s theorem; if f : Rk × Rl → R+, then∫Rkdx

∫Rldyf (x, y) =

∫Rldy

∫Rkdxf (x, y) (with ∞ being allowed).

8. Fubini’s theorem; if f : Rk × Rl → R is a function such that∫Rkdx

∫Rldy |f (x, y)| =

∫Rldy

∫Rkdx |f (x, y)| <∞,

then ∫Rkdx

∫Rldyf (x, y) =

∫Rldy

∫Rkdxf (x, y) .

Proposition 1.1. Suppose that X is an Rk – valued random variable, Y is anRl – valued random variable independent of X, and f : Rk × Rl → R+ then(assuming X and Y have continuous distributions),

E [f (X,Y )] =

∫Rk

E [f (x, Y )] ρX (x) dx.

and similarly,

E [f (X,Y )] =

∫Rl

E [f (X, y)] ρY (y) dy

Proof. Independence implies that

ρ(X,Y ) (x, y) = ρX (x) ρY (y) .

Therefore,

E [f (X,Y )] =

∫Rk×Rl

f (x, y) ρX (x) ρY (y) dxdy

=

∫Rk

[∫Rldyf (x, y) ρY (y)

]ρX (x) dx

=

∫Rk

E [f (x, Y )] ρX (x) dx.

1.2 Some Discrete Distributions

Definition 1.2 (Generating Function). Suppose that N : Ω → N0 is aninteger valued random variable on a probability space, (Ω,B,P) . The generatingfunction associated to N is defined by

GN (z) := E[zN]

=

∞∑n=0

P (N = n) zn for |z| ≤ 1. (1.1)

By standard power series considerations, it follows that P (N = n) =1n!G

(n)N (0) so that GN can be used to completely recover the distribution of

N.

Proposition 1.3 (Generating Functions). The generating function satis-fies,

G(k)N (z) = E

[N (N − 1) . . . (N − k + 1) zN−k

]for |z| < 1

andG(k) (1) = lim

z↑1G(k) (z) = E [N (N − 1) . . . (N − k + 1)] ,

where it is possible that one and hence both sides of this equation are infinite.In particular, G′ (1) := limz↑1G

′ (z) = EN and if EN2 <∞,

Var (N) = G′′ (1) +G′ (1)− [G′ (1)]2. (1.2)

Proof. By standard power series considerations, for |z| < 1,

G(k)N (z) =

∞∑n=0

P (N = n) · n (n− 1) . . . (n− k + 1) zn−k

= E[N (N − 1) . . . (N − k + 1) zN−k

]. (1.3)

Since, for z ∈ (0, 1) ,

0 ≤ N (N − 1) . . . (N − k + 1) zN−k ↑ N (N − 1) . . . (N − k + 1) as z ↑ 1,

we may apply the MCT to pass to the limit as z ↑ 1 in Eq. (1.3) to find,

G(k) (1) = limz↑1

G(k) (z) = E [N (N − 1) . . . (N − k + 1)] .

Exercise 1.1 (Some Discrete Distributions). Let p ∈ (0, 1] and λ > 0. Inthe four parts below, the distribution of N will be described. You should workout the generating function, GN (z) , in each case and use it to verify the givenformulas for EN and Var (N) .

1.2 Some Discrete Distributions 9

1. Bernoulli(p) : P (N = 1) = p and P (N = 0) = 1−p. You should find EN = pand Var (N) = p− p2.

2. Bin (n, p) : P (N = k) =(nk

)pk (1− p)n−k for k = 0, 1, . . . , n. (P (N = k)

is the probability of k successes in a sequence of n independent yes/noexperiments with probability of success being p.) You should find EN = npand Var (N) = n

(p− p2

).

3. Geometric(p) : P (N = k) = p (1− p)k−1for k ∈ N. (P (N = k) is the prob-

ability that the kth – trial is the first time of success out a sequence ofindependent trials with probability of success being p.) You should findEN = 1/p and Var (N) = 1−p

p2 .

4. Poisson(λ) : P (N = k) = λk

k! e−λ for all k ∈ N0. You should find EN = λ =

Var (N) .

Remark 1.4 (Memoryless property of the geometric distribution). Supposethat Xi are i.i.d. Bernoulli random variables with P (Xi = 1) = p andP (Xi = 0) = 1 − p and N = inf i ≥ 1 : Xi = 1 . Then P (N = k) =

P (X1 = 0, . . . , Xk−1 = 0, Xk = 1) = (1− p)k−1p, so that N is geometric with

parameter p. Using this representation we easily and intuitively see that

P (N = n+ k|N > n) =P (X1 = 0, . . . , Xn+k−1 = 0, Xn+k = 1)

P (X1 = 0, . . . , Xn = 0)

= P (Xn+1 = 0, . . . , Xn+k−1 = 0, Xn+k = 1)

= P (X1 = 0, . . . , Xk−1 = 0, Xk = 1) = P (N = k) .

This can be verified by first principles as well;

P (N = n+ k|N > n) =P (N = n+ k)

P (N > n)=

p (1− p)n+k−1∑k>n p (1− p)k−1

=p (1− p)n+k−1∑∞j=0 p (1− p)n+j

=(1− p)n+k−1

(1− p)n∑∞j=0 (1− p)j

=(1− p)k−1

11−(1−p)

= p (1− p)k−1= P (N = k) .

Exercise 1.2. Let Sn,pd= Bin (n, p) , k ∈ N, pn = λn/n where λn → λ > 0 as

n→∞. Show that

limn→∞

P (Sn,pn = k) =λk

k!e−λ = P (Poi (λ) = k) .

Thus we see that for p = O (1/n) and k not too large relative to n that for largen,

P (Bin (n, p) = k) ∼= P (Poi (pn) = k) =(pn)

k

k!e−pn.

(We will come back to the Poisson distribution and the related Poisson processlater on.)


2

Independence

Definition 2.1. We say that an event, A, is independent of an event, B, iffP (A|B) = P (A) or equivalently that

P (A ∩B) = P (A)P (B) .

We further say a collection of events Ajj∈J are independent iff

P (∩j∈J0Aj) =∏j∈J0

P (Aj)

for any finite subset, J0, of J.

Lemma 2.2. If Ajj∈J is an independent collection of events then so isAj , A

cj

j∈J .

Proof. First consider the case of two independent events, A and B. Byassumption, P (A ∩B) = P (A)P (B) . Since

A ∩Bc = A \B = A \ (B ∩A) ,

it follows that

P (A ∩Bc) = P (A)− P (B ∩A) = P (A)− P (A)P (B)

= P (A) (1− P (B)) = P (A)P (Bc) .

Thus if A,B are independent then so is A,Bc . Similarly we may showAc, B are independent and then that Ac, Bc are independent. That isP(Aε ∩Bδ

)= P (Aε)P

(Bδ)

where ε, δ is either “nothing” or “c.”The general case now easily follows similarly. Indeed, if A1, . . . , An ⊂

Ajj∈J we must show that

P (Aε11 ∩ · · · ∩Aεnn ) = P (Aε11 ) . . .P (Aεnn )

where εj = c or εj = “ ”. But this follows from above. For example,A1 ∩ · · · ∩An−1, An are independent implies that A1 ∩ · · · ∩An−1, A

cn are

independent and hence

P (A1 ∩ · · · ∩An−1 ∩Acn) = P (A1 ∩ · · · ∩An−1)P (Acn)

= P (A1) . . .P (An−1)P (Acn) .

Thus we have shown it is permissible to add Acj to the list for any j ∈ J.

Lemma 2.3. If An∞n=1 is a sequence of independent events, then

P (∩∞n=1An) =

∞∏n=1

P (An) := limN→∞

N∏n=1

P (An) .

Proof. Since ∩Nn=1An ↓ ∩∞n=1An, it follows that

P (∩∞n=1An) = limN→∞

P(∩Nn=1An

)= limN→∞

N∏n=1

P (An) ,

where we have used the independence assumption for the last equality.

2.1 Borel Cantelli Lemmas

Definition 2.4. Suppose that An∞n=1 is a sequence of events. Let

An i.o. :=

∞∑n=1

1An =∞

denote the event where infinitely many of the events, An, occur. The abbrevia-tion, “i.o.” stands for infinitely often.

For example if Xn is H or T depending on weather a heads or tails is flippedat the nth step, then Xn = H i.o. is the event where an infinite number ofheads was flipped.

Lemma 2.5 (The First Borell – Cantelli Lemma). If An is a sequenceof events such that

∑∞n=0 P (An) <∞, then

P (An i.o.) = 0.

Proof. Since

∞ >

∞∑n=0

P (An) =∞∑n=0

E1An = E

[ ∞∑n=0

1An

]

12 2 Independence

it follows that∑∞n=0 1An <∞ almost surely (a.s.), i.e. with probability 1 only

finitely many of the An can occur.Under the additional assumption of independence we have the following

strong converse of the first Borel-Cantelli Lemma.

Lemma 2.6 (Second Borel-Cantelli Lemma). If An∞n=1 are independentevents, then

∞∑n=1

P (An) =∞ =⇒ P (An i.o.) = 1. (2.1)

Proof. We are going to show P (An i.o.c) = 0. Since,

An i.o.c =

∞∑n=1

1An =∞

c=

∞∑n=1

1An <∞

we see that ω ∈ An i.o.c iff there exists n ∈ N such that ω /∈ Am for allm ≥ n. Thus we have shown, if ω ∈ An i.o.c then ω ∈ Bn := ∩m≥nAcm forsome n and therefore, An i.o.c = ∪∞n=1Bn. As Bn ↑ An i.o.c we have

P (An i.o.c) = limn→∞

P (Bn) .

But making use of the independence (see Lemmas 2.2 and 2.3) and the estimate,1− x ≤ e−x, see Figure 2.1 below, we find

P (Bn) = P (∩m≥nAcm) =∏m≥n

P (Acm) =∏m≥n

[1− P (Am)]

≤∏m≥n

e−P(Am) = exp

−∑m≥n

P (Am)

= e−∞ = 0.

Combining the two Borel Cantelli Lemmas gives the following Zero-OneLaw.

Corollary 2.7 (Borel’s Zero-One law). If An∞n=1 are independent events,then

P (An i.o.) =

0 if

∑∞n=1 P (An) <∞

1 if∑∞n=1 P (An) =∞ .

Example 2.8. If Xn∞n=1 denotes the outcomes of the toss of a coin such thatP (Xn = H) = p > 0, then P (Xn = H i.o.) = 1.

Fig. 2.1. Comparing e−x and 1− x.

Example 2.9. If a monkey types on a keyboard with each stroke being indepen-dent and identically distributed with each key being hit with positive prob-ability. Then eventually the monkey will type the text of the bible if shelives long enough. Indeed, let S be the set of possible key strokes and let(s1, . . . , sN ) be the strokes necessary to type the bible. Further let Xn∞n=1

be the strokes that the monkey types at time n. Then group the monkey’sstrokes as Yk :=

(XkN+1, . . . , X(k+1)N

). We then have

P (Yk = (s1, . . . , sN )) =

N∏j=1

P (Xj = sj) =: p > 0.

Therefore,∞∑k=1

P (Yk = (s1, . . . , sN )) =∞

and so by the second Borel-Cantelli lemma,

P (Yk = (s1, . . . , sN ) i.o. k) = 1.

2.2 Independent Random Variables

Definition 2.10. We say a collection of discrete random variables, Xjj∈J ,are independent if

P (Xj1 = x1, . . . , Xjn = xn) = P (Xj1 = x1) · · ·P (Xjn = xn) (2.2)

for all possible choices of j1, . . . , jn ⊂ J and all possible values xk of Xjk .

2.2 Independent Random Variables 13

Proposition 2.11. A sequence of discrete random variables, Xjj∈J , is in-dependent iff

E [f1 (Xj1) . . . fn (Xjn)] = E [f1 (Xj1)] . . .E [fn (Xjn)] (2.3)

for all choices of j1, . . . , jn ⊂ J and all choice of bounded (or non-negative)functions, f1, . . . , fn. Here n is arbitrary.

Proof. ( =⇒ ) If Xjj∈J , are independent then

E [f (Xj1 , . . . , Xjn)] =∑

x1,...,xn

f (x1, . . . , xn)P (Xj1 = x1, . . . , Xjn = xn)

=∑

x1,...,xn

f (x1, . . . , xn)P (Xj1 = x1) · · ·P (Xjn = xn) .

Therefore,

E [f1 (Xj1) . . . fn (Xjn)] =∑

x1,...,xn

f1 (x1) . . . fn (xn)P (Xj1 = x1) · · ·P (Xjn = xn)

=

(∑x1

f1 (x1)P (Xj1 = x1)

)· · ·

(∑xn

f (xn)P (Xjn = xn)

)= E [f1 (Xj1)] . . .E [fn (Xjn)] .

(⇐=) Now suppose that Eq. (2.3) holds. If fj := δxj for all j, then

E [f1 (Xj1) . . . fn (Xjn)] = E [δx1 (Xj1) . . . δxn (Xjn)] = P (Xj1 = x1, . . . , Xjn = xn)

whileE [fk (Xjk)] = E [δxk (Xjk)] = P (Xjk = xk) .

Therefore it follows from Eq. (2.3) that Eq. (2.2) holds, i.e. Xjj∈J is anindependent collection of random variables.

Using this as motivation we make the following definition.

Definition 2.12. A collection of arbitrary random variables, Xjj∈J , are in-dependent iff

E [f1 (Xj1) . . . fn (Xjn)] = E [f1 (Xj1)] . . .E [fn (Xjn)]

for all choices of j1, . . . , jn ⊂ J and all choice of bounded (or non-negative)functions, f1, . . . , fn.

Fact 2.13 To check independence of a collection of real valued random vari-ables, Xjj∈J , it suffices to show

P (Xj1 ≤ t1, . . . , Xjn ≤ tn) = P (Xj1 ≤ t1) . . .P (Xjn ≤ tn)

for all possible choices of j1, . . . , jn ⊂ J and all possible tk ∈ R. Moreover,one can replace ≤ by < or reverse these inequalities in the the above expression.

One of the key theorems involving independent random variables is thestrong law of large numbers. The other is the central limit theorem.

Theorem 2.14 (Kolmogorov’s Strong Law of Large Numbers). Supposethat Xn∞n=1 are i.i.d. random variables and let Sn := X1 + · · · + Xn. Thenthere exists µ ∈ R such that 1

nSn → µ a.s. iff Xn is integrable and in whichcase EXn = µ.

Remark 2.15. If E |X1| =∞ but EX−1 <∞, then 1nSn →∞ a.s. To prove this,

for M > 0 let

XMn := min (Xn,M) =

Xn if Xn ≤MM if Xn ≥M

and SMn :=∑ni=1X

Mi . It follows from Theorem 2.14 that 1

nSMn → µM := EXM

1

a.s.. Since Sn ≥ SMn , we may conclude that

lim infn→∞

Snn≥ lim inf

n→∞

1

nSMn = µM a.s.

Since µM → ∞ as M → ∞, it follows that lim infn→∞Snn = ∞ a.s. and hence

that limn→∞Snn =∞ a.s.

3

Conditional Expectation

3.1 σ – algebras (partial information)

Definition 3.1 (σ - algebra of X). Let Ω – be a sample space, W be a set,and X : Ω →W be a function. Let

σ (X) :=A := X ∈ B = X−1 (B) : B ⊂W

.

Put another way, A ∈ σ (X) iff A = X ∈ B for some B ⊂W iff 1A = f (X)for some function f : W → 0, 1.Remark 3.2. Notice that if Ai ∈ σ (X) for i ∈ N then Ac1, ∪∞i=1Ai, ∩∞i=1Ai, andA2 \A1 = A2 ∩Ac1 are in σ (X) . In other words, σ (X) is stable under all of theusual set operations. Also notice that ∅, Ω ∈ σ (X) .

The reader should interpret σ (X) as the events in Ω which can be distin-guished (measured) by knowing X. Now let S be another set and Y : Ω → Sbe another function.

Definition 3.3 (X – measurable). We say Y is X – measurable iff σ (Y ) ⊂σ (X) , i.e. the events that can be measure by Y can also be measured by X.

Proposition 3.4. Let Y : Ω → S and X : Ω → W be functions. Then Y isX – measurable iff there exists f : W → S such that Y = f (X) , i.e. Y (ω) iscompletely determined by knowing X (ω) .

Proof. If Y = f (X) for some function, f : W → S, then for A ⊂ S we have

Y ∈ A = f (X) ∈ A =X ∈ f−1 (A)

∈ σ (X)

from which it follows that σ (Y ) ⊂ σ (X) . You are asked to prove the conversedirection in Exercise 3.1.

Exercise 3.1 (Optional). Let W0 := X (Ω) ⊂ W. Finish the proof of Propo-sition 3.4 using the following outline;

1. Use the fact that σ (Y ) ⊂ σ (X) to show for each s ∈ S there exists Bs ⊂W0 ⊂W such that Y = s = X ∈ Bs .

2. Show Bs ∩Bs′ = ∅ for all s, s′ ∈ S with s 6= s′.3. Show X (Ω) = W0 := ∪s∈SBs.

Now fix a point s∗ ∈ S and then define, f : W → S by setting f (w) = s∗when w ∈W \W0 and f (w) = s when w ∈ Bs ⊂W0.

4. Verify that Y = f (X) .

3.2 Theory of Conditional Expectation

Let us now fix a probability, P, on Ω for the rest of this subsection.

Notation 3.5 (Conditional Expectation 1) Given Y ∈ L1 (P) and A ⊂ Ωlet

E [Y : A] := E [1AY ]

and

E [Y |A] =

E [Y : A] /P (A) if P (A) > 0

0 if P (A) = 0. (3.1)

(In point of fact, when P (A) = 0 we could set E [Y |A] to be any real number.We choose 0 for definiteness and so that Y → E [Y |A] is always linear.)

Example 3.6 (Conditioning for the uniform distribution). Suppose that Ω is afinite set and P is the uniform distribution on P so that P (ω) = 1

#(Ω) for all

ω ∈W. Then for non-empty any subset A ⊂ Ω and Y : Ω → R we have E [Y |A]is the expectation of Y restricted to A under the uniform distribution on A.Indeed,

E [Y |A] =1

P (A)E [Y : A] =

1

P (A)

∑ω∈A

Y (ω)P (ω)

=1

# (A) /# (Ω)

∑ω∈A

Y (ω)1

# (Ω)=

1

# (A)

∑ω∈A

Y (ω) .

Theorem 3.7 (Conditional Expectation). Suppose X : Ω → W as above

and Y ∈ L2 (P) , i.e. Y : Ω → C such that E |Y |2 < ∞. Then there existsan “essentially unique” function h : W → C such that h (X) ∈ L2 (P) whichsatisfies,

E[|Y − h (X)|2

]≤ E |Y − f (X)|2 (3.2)

for all function f : W → C such that f (X) ∈ L2 (P) . The function h mayalternatively be determined by requiring,

E [[Y − h (X)] f (X)] = 0 ∀ f : W → C 3 E |f (X)|2 <∞. (3.3)

16 3 Conditional Expectation

Proof. The existence of such an h satisfying Eq. (3.2) is a consequence ofthe orthogonal projection theorem in Hilbert spaces. We will simply take thisresult for granted. However, let us show the conditions in Eq. (3.2) and Eq.(3.3) are equivalent.

Eq. (3.2) =⇒ Eq. (3.3): If Eq. (3.2) holds then for any f : W → R suchthat f (X) ∈ L2 (P) and t ∈ R we have,

ϕ (t) := E[|Y − [h (X) + tg (X)]|2

]= E |Y − h (X)|2 + 2tE ([Y − h (X)] f (X)) + t2E |f (X)|2

has a minimum at t = 0. So by the first derivative test it follows that

0 = ϕ (0) = 2E ([Y − h (X)] f (X))

which shows Eq. (3.3) holds.Eq. (3.3) =⇒ Eq. (3.2): Assuming Eq. (3.3), it follows that

E |Y − f (X)|2 = E |Y − h (X) + (h− f) (X)|2

= E |Y − h (X)|2 + 2E ([Y − h (X)] (h− f) (X)) + E |(h− f) (X)|2

= E |Y − h (X)|2 + E |(h− f) (X)|2 ≥ E |Y − h (X)|2 .

Definition 3.8 (Conditional Expectation). We refer to the function h (X)in Theorem 3.7 as the conditional expectation of Y given X (or σ (X)) anddenote the result by E [Y |σ (X)] or by E [Y |X] .

Proposition 3.9 (Discrete formula). Suppose that X : Ω →W has finite or

countable range, i.e. X (Ω) = xiNi=1 where N ∈ N∪∞ . In this case,

E [Y |X] = h (X) where h (x) = EP [Y |X = x] , (3.4)

where EP [Y |X = x] is as in Notation 3.5.

Proof. If P (X = x) = 0 then P (ω) = 0 for all ω ∈ X = x and Eq. (3.5)holds no matter the value of h (x) . For definiteness, we choose h (x) = 0 in thiscase.

We are looking to find h : X (Ω) ⊂W → R satisfying,

0 = E [[Y − h (X)] f (X)] =∑

x∈X(Ω)

E [[Y − h (X)] f (X) : X = x]

=∑

x∈X(Ω)

E [[Y − h (x)] f (x) : X = x]

=∑

x∈X(Ω)

(E [Y : X = x]− h (x)P (X = x)) f (x)

for all f on X (Ω) ⊂W. This implies that we require,

0 = E [Y : X = x]− h (x)P (X = x) for all x ∈ X (Ω) . (3.5)

In other word,

h (x) = EP [Y · 1X=x] /P (X = x) = EP [Y |X = x] .

If P (X = x) = 0 Eq. (3.5) holds no matter the value of h (x) . For definiteness,we choose h (x) = 0 in this case.

Let us pause for a moment to record a few basic general properties of con-ditional expectations.

Proposition 3.10 (Contraction Property). For all Y ∈ L2 (P) , we haveE |E [Y |X]| ≤ E |Y | . Moreover if Y ≥ 0 then E [Y |X] ≥ 0 (a.s.).

Proof. Let E [Y |X] = h (X) (with h : S → R) and then define

f (x) =

1 if h (x) ≥ 0−1 if h (x) < 0

.

Since h (x) f (x) = |h (x)| , it follows from Eq. (3.3) that

E [|h (X)|] = E [Y f (X)] = |E [Y f (X)]| ≤ E [|Y f (X)|] = E |Y | .

For the second assertion take f (x) = 1h(x)<0 in Eq. (3.3) in order to learn

E[h (X) 1h(X)<0

]= E

[Y 1h(X)<0

]≥ 0.

As h (X) 1h(X)<0 ≤ 0 we may conclude that h (X) 1h(X)<0 = 0 a.s.Because of this proposition we may extend the notion of conditional expec-

tation to Y ∈ L1 (P) as stated in the following theorem which we do not botherto prove here.

Theorem 3.11. Given X : Ω → W and Y ∈ L1 (P) , there exists an “essen-tially unique” function h : W → R such that Eq. (3.3) holds for all boundedfunctions, f : W → R. (As above we write E [Y |X] for h (X) .) Moreover thecontraction property, E |E [Y |X]| ≤ E |Y | , still holds.

Definition 3.12 (Conditional Expectation). If Y ∈ L1 (P) , we letE [Y |X] = h (X) be the “unique” random variable which is σ (X) measurableand satisfies,

E (Y f (X)) = E (h (X) f (X)) = E (E [Y |X] f (X))

for all bounded f.

3.2 Theory of Conditional Expectation 17

Theorem 3.13 (Basic properties). Let Y, Y1, and Y2 be integrable randomvariables and X : Ω →W be given. Then:

1. E(Y1 + Y2|X) = E(Y1|X) + E(Y2|X).2. E(aY |X) = aE(Y |X) for all constants a.3. E(g(X)Y |X) = g(X)E(Y |X) for all bounded functions g.4. E(E(Y |X)) = EY. (Law of total expectation.)5. If Y and X are independent then E(Y |X) = EY.

Proof. 1. Let hi (X) = E [Yi|X] , then for all bounded f,

E [Y1f (X)] = E [h1 (X) f (X)] and

E [Y2f (X)] = E [h2 (X) f (X)]

and therefore adding these two equations together implies

E [(Y1 + Y2) f (X)] = E [(h1 (X) + h2 (X)) f (X)]

= E [(h1 + h2) (X) f (X)]

E [Y2f (X)] = E [h2 (X) f (X)]

for all bounded f . Therefore we may conclude that

E(Y1 + Y2|X) = (h1 + h2) (X) = h1 (X) + h2 (X) = E(Y1|X) + E(Y2|X).

2. The proof is similar to 1 but easier and so is omitted.3. Let h (X) = E [Y |X] , then E [Y f (X)] = E [h (X) f (X)] for all bounded

functions f. Replacing f by g · f implies

E [Y g (X) f (X)] = E [h (X) g (X) f (X)] = E [(h · g) (X) f (X)]

for all bounded functions f. Therefore we may conclude that

E [Y g (X) |X] = (h · g) (X) = h (X) g (X) = g (X)E(Y |X).

4. Take f ≡ 1 in Eq. (3.3).5. If X and Y are independent and µ := E [Y ] , then

E [Y f (X)] = E [Y ]E [f (X)] = µE [f (X)] = E [µf (X)]

from which it follows that E [Y |X] = µ as desired.

Exercise 3.2. Suppose that X and Y are two integrable random variables suchthat

E [X|Y ] = 18− 3

5Y and E [Y |X] = 10− 1

3X.

Find EX and EY.

The next theorem says that conditional expectations essentially only de-pends on the distribution of (X,Y ) and nothing else.

Theorem 3.14 (Dependence only on distributions). Suppose that (X,Y )

and(X, Y

)are random vectors such that (X,Y )

d=(X, Y

), i.e. E [G (X,Y )] =

E[G(X, Y

)]for all bounded (or non-negative) functions G. If h (X) =

E [u (X,Y ) |X] , then E[u(X, Y

)|X]

= h(X).

Proof. By assumption we know that

E [u (X,Y ) f (X)] = E [h (X) f (X)] for all bounded f.

Since (X,Y )d=(X, Y

), this is equivalent to

E[u(X, Y

)f(X)]

= E[h(X)f(X)]

for all bounded f

which is equivalent to E[u(X, Y

)|X]

= h(X).

Exercise 3.3. Let Xi∞i=1 be i.i.d. random variables with E |Xi| <∞ for all iand let Sm := X1 + · · ·+Xm for m = 1, 2, . . . . Show

E [Sm|Sn] =m

nSn for all m ≤ n.

Hint: observe by symmetry1 that there is a function h : R→ R such that

E (Xi|Sn) = h (Sn) independent of i.

Proposition 3.15. Suppose that X : Ω →W and Y : Ω →W ′ are independentrandom functions and U : W ×W ′ → C is a functions such that E |U (X,Y )| <∞. Under these assumptions,

E [U (X,Y ) |X] = h (X)

whereh (x) := E [U (x, Y )] for all x ∈W.

Proof. The theorem is true in general but requires measure theory in orderto prove it in full generality. Here I will give the proof in the case that X (Ω) isa countable set, see Proposition 3.21 for a proof for certain continuous randomfunctions.)

1 Apply Theorem 3.14 using (X 1, Sn)d= (Xi, Sn) for 1 ≤ i ≤ n.


Let

h (x) := E [U (X,Y ) |X = x] = E [U (x, Y ) |X = x] = E [U (x, Y )]

where in the last equality we have used the independence of X from Y whichby definition means,

E [f (X) g (Y )] = E [f (X)] · E [g (Y )]

for all bounded functions, f : W → C and g : W → C. The result now followsfrom Proposition 3.9.

Remark 3.16. If m > n, in Exercise 18.5, then Sm = Sn + Xn+1 + · · · + Xm.Since Xi is independent of Sn for i > n, it follows that

E (Sm|Sn) = E (Sn +Xn+1 + · · ·+Xm|Sn)

= E (Sn|Sn) + E (Xn+1|Sn) + · · ·+ E (Xm|Sn)

= Sn + (m− n)µ if m ≥ n

where µ = EXi.

Theorem 3.17 (Tower Property). Let Y be an integrable random variableand Xi : Ω →Wi be given functions for i = 1, 2. Then

E [E [Y | (X1, X2)] |X1] = E [Y |X1] (3.6)

andE [E [Y |X1] | (X1, X2)] = E [Y |X1] . (3.7)

Proof. Let h (X1, X2) = E [Y | (X1, X2)] and f (X1) = E [h (X1, X2) |X1] .Then for any bounded functions, g : W1 → R, we have

E [Y g (X1)] = E [h (X1, X2) g (X1)] = E [f (X1) g (X1)]

from which it follows that f (X1) = E [Y |X1], i.e. Eq. (3.6) holds. For Eq. (3.7)we simply notice that

E [E [Y |X1] | (X1, X2)] = E [E [Y |X1] · 1| (X1, X2)]

= E [Y |X1] · E [1| (X1, X2)] = E [Y |X1] .

Alternatively, the best approximation to f (X1) by a function of the formH (X1, X2) is clearly f (X1) itself.

3.3 Conditional Expectation for Continuous RandomVariables

(We will cover this section later in the course as needed.)Suppose that Y and X are continuous random variables which have a joint

density, ρ(Y,X) (y, x) . Then by definition of ρ(Y,X), we have, for all bounded ornon-negative, U, that

E [U (Y,X)] =

∫ ∫U (y, x) ρ(Y,X) (y, x) dydx. (3.8)

The marginal density associated to X is then given by

ρX (x) :=

∫ρ(Y,X) (y, x) dy (3.9)

and recall that the conditional density ρ(Y |X) (y, x) is defined by

ρ(Y |X) (y, x) =

ρ(Y,X)(y,x)

ρX(x) if ρX (x) > 0

0 if ρX (x) = 0. (3.10)

Observe that if ρ(Y,X) (y, x) is continuous, then

ρ(Y,X) (y, x) = ρ(Y |X) (y, x) ρX (x) for all (x, y) . (3.11)

Indeed, if ρX (x) = 0, then

0 = ρX (x) =

∫ρ(Y,X) (y, x) dy

from which it follows that ρ(Y,X) (y, x) = 0 for all y. If ρ(Y,X) is not continuous,Eq. (3.11) still holds for “a.e.” (x, y) which is good enough.

Lemma 3.18. In the notation above,

ρ (x, y) = ρ(Y |X) (y, x) ρX (x) for a.e. (x, y) . (3.12)

Proof. By definition Eq. (3.12) holds when ρX (x) > 0 and ρ (x, y) ≥ρ(Y |X) (y, x) ρX (x) for all (x, y) . Moreover,∫ ∫

ρ(Y |X) (y, x) ρX (x) dxdy =

∫ ∫ρ(Y |X) (y, x) ρX (x) 1ρX(x)>0dxdy

=

∫ ∫ρ (x, y) 1ρX(x)>0dxdy

=

∫ρX (x) 1ρX(x)>0dx =

∫ρX (x) dx

= 1 =

∫ ∫ρ (x, y) dxdy,


3.3 Conditional Expectation for Continuous Random Variables 19

or equivalently, ∫ ∫ [ρ (x, y)− ρ(Y |X) (y, x) ρX (x)

]dxdy = 0

which implies the result.

Theorem 3.19. Keeping the notation above, for all or all bounded or non-negative, U, we have E [U (Y,X) |X] = h (X) where

h (x) =

∫U (y, x) ρ(Y |X) (y, x) dy (3.13)

=

∫U(y,x)ρ(Y,X)(y,x)dy∫

ρ(Y,X)(y,x)dyif

∫ρ(Y,X) (y, x) dy > 0

0 otherwise. (3.14)

In the future we will usually denote h (x) informally by E [U (Y, x) |X = x],2 sothat

E [U (Y, x) |X = x] :=

∫U (y, x) ρ(Y |X) (y, x) dy. (3.15)

Proof. We are looking for h : S → R such that

E [U (Y,X) f (X)] = E [h (X) f (X)] for all bounded f.

Using Lemma 3.18, we find

E [U (Y,X) f (X)] =

∫ ∫U (y, x) f (x) ρ(Y,X) (y, x) dydx

=

∫ ∫U (y, x) f (x) ρ(Y |X) (y, x) ρX (x) dydx

=

∫ [∫U (y, x) ρ(Y |X) (y, x) dy

]f (x) ρX (x) dx

=

∫h (x) f (x) ρX (x) dx

= E [h (X) f (X)]

where h is given as in Eq. (3.13).

Example 3.20 (Durrett 8.15, p. 145). Suppose that X and Y have joint densityρ (x, y) = 8xy · 10<y<x<1. We wish to compute E [u (X,Y ) |Y ] . To this end wecompute

2 Warning: this is not consistent with Eq. (3.1) as P (X = x) = 0 for continuousdistributions.

ρY (y) =

∫R

8xy · 10<y<x<1dx = 8y

∫ x=1

x=y

x · dx = 8y · x2

2|1y = 4y ·

(1− y2

).

Therefore,

ρX|Y (x, y) =ρ (x, y)

ρY (y)=

8xy · 10<y<x<1

4y · (1− y2)=

2x · 10<y<x<1

(1− y2)

and so

E [u (X,Y ) |Y = y] =

∫R

2x · 10<y<x<1

(1− y2)u (x, y) dx = 2

10<y<1

1− y2

∫ 1

y

u (x, y) xdx

and so

E [u (X,Y ) |Y ] = 21

1− Y 2

∫ 1

Y

u (x, Y )xdx.

is the best approximation to u (X,Y ) be a function of Y alone.

Proposition 3.21. Suppose that X,Y are independent random functions, then

E [U (Y,X) |X] = h (X)

whereh (x) := E [U (Y, x)] .

Proof. I will prove this in the continuous distribution case and leave thediscrete case to the reader. (The theorem is true in general but requires measuretheory in order to prove it in full generality.) The independence assumption isequivalent to ρ(Y,X) (y, x) = ρY (y) ρX (x) . Therefore,

ρ(Y |X) (y, x) =

ρY (y) if ρX (x) > 0

0 if ρX (x) = 0

and therefore E [U (Y,X) |X] = h0 (X) where

h0 (x) =

∫U (y, x) ρ(Y |X) (y, x) dy

= 1ρX(x)>0

∫U (y, x) ρY (y) dy = 1ρX(x)>0E [U (Y, x)]

= 1ρX(x)>0h (x) .

If f is a bounded function of x, then

E [h0 (X) f (X)] =

∫h0 (x) f (x) ρX (x) dx =

∫x:ρX(x)>0

h0 (x) f (x) ρX (x) dx

=

∫x:ρX(x)>0

h (x) f (x) ρX (x) dx =

∫h (x) f (x) ρX (x) dx

= E [h (X) f (X)] .

So for all practical purposes, h (X) = h0 (X) , i.e. h (X) = h0 (X) – a.s. (In-deed, take f (x) = sgn(h (x) − h0 (x)) in the above equation to learn thatE |h (X)− h0 (X)| = 0.

Theorem 3.22 (Iterated conditioning 1). Let X, Y, and Z be random vec-tors and suppose that (X,Y ) is distributed according to ρ(X,Y ) (x, y) dxdy. Then

E [Z|Y = y] =

∫E [Z|Y = y,X = x] ρY |X (y, x) dx ρY (y) dy – a.s.

Proof. Let h (x, y) := E [Z|Y = y,X = x] so that

E [Zv (X,Y )] = E [h (X,Y ) v (X,Y )] for all v.

Taking v (x, y) = g (y) to be a function of Y alone shows,

E [Zg (Y )] = E [h (X,Y ) g (Y )] for all g.

Thus it follows that

E [Z|Y = y] = E [h (X,Y ) |Y = y]

=

∫h (x, y) ρX|Y (x, y) dx

=

∫E [Z|Y = y,X = x] ρX|Y (x, y) dx.

Remark 3.23. Often, E [Z|Y = y0] may be computed as;

limε↓0

E [Z| |Y − y0| < ε] .

To understand this formula, suppose that h (y) := E [Z|Y = y] and ρY (y) arecontinuous near y0 and ρY (y0) > 0. Then

E [Z| |Y − y0| < ε] =E [Z : |Y − y0| < ε]

P (|Y − y0| < ε)

=E[h (Y ) 1|Y−y0|<ε

]P (|Y − y0| < ε)

=

∫h (y) 1|y−y0|<ερY (y) dy∫

1|y−y0|<ερY (y) dy→ h (y0) as ε ↓ 0,

wherein we have used, h (y) ∼= h (y0) for y near y0 and therefore,∫h (y) 1|y−y0|<ερY (y) dy∫

1|y−y0|<ερY (y) dy∼=∫h (y0) 1|y−y0|<ερY (y) dy∫

1|y−y0|<ερY (y) dy= h (y0) .

Here is a consequence of this result.

Theorem 3.24 (Iterated conditioning 2). Suppose that Ω is partitionedinto disjoint sets Aini=1 and y0 is given such that the following limits exists;

P (Ai|Y = y0) := limε↓0

P (Ai| |Y − y0| < ε) and

E [Z|Y = y0, Ai] := limε↓0

E [Z| |Y − y0| < ε,Ai] .

(In particular we are assuming that and P (|Y − y0| < ε,Ai) > 0 for all ε > 0.)Then

E [Z|Y = y0] =

n∑i=1

E [Z|Y = y0, Ai]P (Ai|Y = y0) .

Proof. Since,

E [Z| |Y − y0| < ε,Ai] =E[Z1Ai1|Y−y0|<ε

]P (Ai ∩ |Y − y0| < ε)

=E [Z1Ai | |Y − y0| < ε]P (|Y − y0| < ε)

P (Ai ∩ |Y − y0| < ε)

=E [Z1Ai | |Y − y0| < ε]

P (Ai| |Y − y0| < ε),

it follows that

limε↓0

E [Z1Ai | |Y − y0| < ε] = limε↓0

(E [Z| |Y − y0| < ε,Ai] · P (Ai| |Y − y0| < ε))

= E [Z|Y = y0, Ai]P (Ai|Y = y0) .

Moreover,

n∑i=1

limε↓0

E [Z1Ai | |Y − y0| < ε] = limε↓0

n∑i=1

E [Z1Ai | |Y − y0| < ε]

= limε↓0

E

[Z

n∑i=1

1Ai | |Y − y0| < ε

]= lim

ε↓0E [Z| |Y − y0| < ε] = E [Z|Y = y0] ,

and therefore

E [Z|Y = y0] =

n∑i=1

limε↓0

E [Z1Ai | |Y − y0| < ε]

=

n∑i=1

E [Z|Y = y0, Ai]P (Ai|Y = y0)

as claimed.

3.3 Conditional Expectation for Continuous Random Variables 21

Remark 3.25 (An advanced remark). The last result is in fact a special case ofTheorem 3.22 wherein we take X =

∑i i1Ai . In this case, any function u (X,Y )

may be written as,

u (X,Y ) =∑i

1X=iu (i, Y ) .

Thus if we let h (x, y) := E [Z| (X = x, Y = y)] , i.e. h (X,Y ) = E [Z| (X,Y )]a.s. then on one hand;

E [h (X,Y )u (X,Y )] =∑i

E [h (X,Y ) 1X=iu (i, Y )] =∑i

E [h (i, Y )u (i, Y ) 1X=i]

while on the other,

E [h (X,Y )u (X,Y )] = E [Zu (X,Y )] =∑i

E [Z1X=iu (i, Y )] .

Taking u (i, Y ) = δi,jv (Y ) and comparing the resulting expressions shows,

E [h (j, Y ) v (Y ) 1X=j ] = E [Z1X=jv (Y )] for all v

and therefore that

E [Z1X=j |Y ] = E [h (j, Y ) 1X=j |Y ] = h (j, Y ) · E [1X=j |Y ] .

Summing this equation on j then shows,

E [Z|Y ] =∑j

E [Z1X=j |Y ] =∑j

h (j, Y ) · E [1X=j |Y ] ,

which reads,

E [Z|Y = y] =∑j

E [Z|X = j, Y = y] · E [1X=j |Y = y]

=∑j

E [Z|X = j, Y = y] · P [X = j|Y = y]

=∑j

E [Z|Aj , Y = y] · P [Aj |Y = y] (µY – a.s.) .

where µY is the law of Y, i.e. µY (A) := P (Y ∈ A) .

Example 3.26. Suppose that Tknk=1 are independent random times such that

P (Tk > t) = e−λkt for all t ≥ 0 for some λk > 0. LetTk

nk=1

be the order

statistics of the sequence, i.e.Tk

nk=1

is the sequence Tknk=1 in increasing

order, i.e. T1 < T2 < · · · < Tn. Further let K = i on T1 = Ti. Then

E[f(T2 − T1

)|T1 = t

]= E

[f(T2 − T1

)|T1 = t,K = i

]P(K = i|T1 = t

)where

T1 = t,K = i

= t = Ti < Tj for j 6= i and therefore,

E[f(T2 − T1

)|T1 = t,K = i

]= E

[f(T2 − t

)|T1 = t,K = i

]= E

[f

(minj 6=i

Tj − t)|t = Ti < min

j 6=iTj

]= E

[f

(minj 6=i

Tj − t)|t < min

j 6=iTj

]= E

[f

(minj 6=i

Tj

)],

wherein we have used Ti is independent of S := minj 6=i Tjd= E (λ− λi) ,

where λ = λ1 + · · · + λn. Since S is an exponential random variable,P (S > t+ s|S > t) = P (S > s) , i.e. S − t under P (·|S > t) is the same indistribution as S under P. Thus we have shown,

E[f(T2 − T1

)|T1 = t

]=∑i

E[f

(minj 6=i

Tj

)]P(K = i|T1 = t

).

We now compute informally,

P(K = i|T = t

)=

P (t = Ti < Tj for j 6= i)

P(t = T1

) =e−(λ−λi)t · P (Ti = t)

P(t = T1

)=e−(λ−λi)t · λie−λitdt

λe−λtdt=λiλ.

Here is the above computation done more rigorously;

P(K = i|t < T ≤ t+ ε

)=

P (Ti < Tj for j 6= i, t < Ti ≤ t+ ε)

P(t < T ≤ t+ ε

)=

∫ t+εt

P (τ < Tj for j 6= i)λie−λiτdτ∫ t+εt

λe−λτdτ

ε↓0→ P (t < Tj for j 6= i)λie−λit

λe−λt

=e−(λ−λi)tλie

−λit

λe−λt=λiλ.

In summary we have shown;

E[f(T2 − T1

)|T1 = t

]=∑i

E[f

(minj 6=i

Tj

)]λiλ.

3.4 Summary on Conditional Expectation Properties

Let Y and X be random variables such that EY 2 <∞ and h be function fromthe range of X to R. Then the following are equivalent:

1. h(X) = E(Y |X), i.e. h(X) is the conditional expectation of Y given X.2. E(Y − h(X))2 ≤ E(Y − g(X))2 for all functions g, i.e. h(X) is the best

approximation to Y among functions of X.3. E(Y ·g(X)) = E(h(X) ·g(X)) for all functions g, i.e. Y −h(X) is orthogonal

to all functions of X. Moreover, this condition uniquely determines h(X).

The methods for computing E(Y |X) are given in the next two propositions.

Proposition 3.27 (Discrete Case). Suppose that Y and X are discrete ran-dom variables and p(y, x) := P(Y = y,X = x). Then E(Y |X) = h(X), where

h(x) = E(Y |X = x) =E(Y : X = x)

P(X = x)=

1

pX(x)

∑y

yp(y, x) (3.16)

and pX(x) = P(X = x) is the marginal distribution of X which may be computedas pX(x) =

∑y p(y, x).

Proposition 3.28 (Continuous Case). Suppose that Y and X are randomvariables which have a joint probability density ρ(y, x) (i.e. P(Y ∈ dy,X ∈dx) = ρ(y, x)dydx). Then E(Y |X) = h(X), where

h(x) = E(Y |X = x) :=1

ρX(x)

∫ ∞−∞

yρ(y, x)dy (3.17)

and ρX(x) is the marginal density of X which may be computed as

ρX(x) =

∫ ∞−∞

ρ(y, x)dy.

Intuitively, in all cases, E(Y |X) on the set X = x is E(Y |X = x). Thisintuitions should help motivate some of the basic properties of E(Y |X) sum-marized in the next theorem.

Theorem 3.29. Let Y, Y1, Y2 and X be random variables. Then:

1. E(Y1 + Y2|X) = E(Y1|X) + E(Y2|X).2. E(aY |X) = aE(Y |X) for all constants a.3. E(f(X)Y |X) = f(X)E(Y |X) for all functions f.4. E(E(Y |X)) = EY.5. If Y and X are independent then E(Y |X) = EY.6. If Y ≥ 0 then E(Y |X) ≥ 0.

Remark 3.30. Property 4 in Theorem 3.29 turns out to be a very powerfulmethod for computing expectations. I will finish this summary by writing outProperty 4 in the discrete and continuous cases:

EY =∑x

E(Y |X = x)pX(x) (Discrete Case)

where

E(Y |X = x) =

E(Y 1X=x)P(X=x) if P (X = x) > 0

0 otherwise

E [U (Y,X)] =

∫E(U (Y,X) |X = x)ρX(x)dx, (Continuous Case)

where

E [U (Y, x) |X = x] :=

∫U (y, x) ρ(Y |X) (y, x) dy

and

ρ(Y |X) (y, x) =

ρ(Y,X)(y,x)

ρX(x) if ρX (x) > 0

0 if ρX (x) = 0.

4

Filtrations and stopping times

Notation 4.1 Let N := 1, 2, 3, . . . , N0 := N∪0 , and N := N0 ∪ ∞ =N∪0,∞ .

In this chapter, let Ω – be a sample space and S be a set called state space.Further let X := Xn∞n=0 be a sequence of random variables taking values inS which we will refer to generically as a stochastic process.

4.1 Filtrations

Definition 4.2 (Filtration). The filtration associated to X isFXn := σ (X0, . . . , Xn)

∞n=0

. We further let FX∞ = σ (X0, . . . , Xn, . . . ) .

Notice that g : Ω → S is FXn – measurable iff g = G (X0, . . . , Xn) some

G : Sn+1 → S. Also notice that FXm ⊂ FX

n for all 0 ≤ m ≤ n ≤ ∞.

Definition 4.3 (Martingales). Let Mn∞n=0 be a sequence of complex orreal valued integrable random variables on a probability space (Ω,P) . We sayMn∞n=0 is a martingale if

E [Mn+1| (M0, . . . ,Mn)] = Mn for n = 0, 1, 2, . . . .

In other words,E [Mn+1 −Mn| (M0, . . . ,Mn)] = 0

which states the expected future increment given the past is zero.

Example 4.4. If Xn∞n=0 are independent random variables with EXn = 0,then Mn = X0 + · · ·+Xn is a martingale. Indeed,

E [Mn+1 −Mn| (M0, . . . ,Mn)] = E [Mn+1 −Mn| (X0, . . . , Xn)]

= E [Xn+1| (X0, . . . , Xn)] = EXn = 0.

We will study martingales in more detail later.

Definition 4.5 (Markov Property). Let (Ω,P) be a probability space andX := Xn : Ω → S∞n=0 be a stochastic process. We say that (Ω,X,P) has theMarkov property if

E [f (Xn+1) | (X0, . . . , Xn)] = E [f (Xn+1) |Xn]

for all n ≥ 0 and f : S → R bounded or non-negative functions.

4.2 Stopping Times

Definition 4.6 (Stopping time). A function τ : Ω → N := N0 ∪ ∞ is aFn – stopping time iff

τ = n ∈ Fn for all n ∈ N0.

[In words, we should be able to determine if we are going to stop at time n fromobserving the information about Xk∞k=0 that we have up to time n.]

Example 4.7. If τ : Ω → N is constant, τ (ω) = k for some k ∈ N0, then τ is astopping time.

Example 4.8 (First Hitting times). Let A ⊂ S be a set and let

HA := inf n ≥ 0 : Xn ∈ A

where inf ∅ :=∞. We call HA the first hitting time of A. Since

HA = n = X0 /∈ A, . . . ,Xn−1 /∈ Ac, Xn ∈ A= X0 ∈ Ac, . . . , Xn−1 ∈ Ac, Xn ∈ A

= (X0, X1, . . . , Xn)−1

(Ac × · · · ×Ac ×A) ∈ FXn

we see that HA is a stopping time.

Example 4.9 (First Hitting time after 0). Let A ⊂ S be a set and let

TA := inf n > 0 : Xn ∈ A

where inf ∅ :=∞. Since TA = 0 = ∅ ∈ F0 and for n ≥ 1 we have

TA = n = X1 ∈ Ac, . . . , Xn−1 ∈ Ac, Xn ∈ A

= (X0, X1, . . . , Xn)−1

(S ×Ac × · · · ×Ac ×A) ∈ FXn ,

it follows that TA is a stopping time.

Exercise 4.1. Let τ : Ω → N be a function. Verify the following are equivalent;

24 4 Filtrations and stopping times

1. τ is a stopping time.2. τ ≤ n ∈ FX

n for all n ∈ N0.3. τ > n ∈ FX

n for all n ∈ N0.

Also show that if τ is a stopping time then τ =∞ ∈ FX∞.

Exercise 4.2. If τ and σ are two stopping times shows, σ ∧ τ = min σ, τ ,σ ∨ τ = max σ, τ , and σ + τ are all stopping times.

Exercise 4.3 (Hitting time after a stopping time). Let σ be any stoppingtime. Show

τ1 = inf n ≥ σ : Xn ∈ B and

τ2 = inf n > σ : Xn ∈ B .

are both stopping times.

Definition 4.10. Let τ be a stopping time. A function F : Ω → W is said tobe FX

τ – measurable if

F = “∑n∈N

1τ=nFn” i.e. F = Fn on τ = n ∀ n ∈ N,

where Fn : Ω → W is FXn – measurable for all n ∈ N. In more detail, we

are assuming for each n ∈ N there exists fn : Sn+1 → W such that Fn =fn (X0, . . . , Xn) and F = Fn on τ = n . We also say that A ⊂ Ω is in FX

τ iffA ∩ τ = n ∈ FX

n for all n ∈ N.

Remark 4.11. A set A ⊂ Ω is in FXτ iff 1A is FX

τ – measurable. Indeed, if A ∈ FXτ

then there exists fn : Sn+1 → 0, 1 such that 1A∩τ=n = fn (X0, . . . , Xn) .

Conversely, if 1A is FXτ – measurable, then there exists fn : Sn+1 → R such

that1A =

∑n∈N

1τ=nfn (X0, . . . , Xn)

and so 1A∩τ=n = fn (X0, . . . , Xn) . Thus it follows that

A ∩ τ = n = fn (X0, . . . , Xn) = 1 = (X0, . . . , Xn)−1 (

f−1n (1)

)∈ FX

n

which implies A ∩ τ = n ∈ FXn for all n ∈ N0 and hence A ∈ FX

τ .

Example 4.12. If τ is a stopping time and f : S → R is a function, then F :=1τ<∞f (Xτ ) is Fτ – measurable. Indeed, for n ∈ N0 we have

F1τ=n = f (Xn) 1τ=n.

Example 4.13. If f ≥ 0, then

F :=∑k≤τ

f (Xk)

is Fτ – measurable. Indeed, for n ∈ N we have

F1τ=n = 1τ=n∑k≤n

f (Xk)

which is of the desired form.

Theorem 4.14. Suppose now that P is a probability on Ω. If Z ∈ L1 (P) and τis a stopping time, then

E [Z|Fτ ] =∑n∈N

1τ=nE [Z|Fn] . (4.1)

Proof. Let F be a bounded Fτ – measurable function, i.e. F =∑n∈N 1τ=nFn where Fn = fn (X0, . . . , Xn) are Fn – measurable functions

as in Definition 4.10. Since F1τ=n = Fn1τ=n is Fn – measurable for all n,we find

E [ZF ] =∑n∈N

E[ZFn1τ=n

]=∑n∈N

E[E [Z|Fn]Fn1τ=n

]

=∑n∈N

E[E [Z|Fn]F1τ=n

]= E

∑n∈N

E [Z|Fn] 1τ=n

F .

As this is true for all bounded Fτ – measurable functions F we conclude thatEq. (4.1) holds.

Part II

Discrete Time & Space Markov Processes

5

Markov Chain Basics

In deterministic modeling one often has a dynamical system on a state spaceS. The dynamical system often takes on one of the two forms;

1. there exists f : S → S and a state xn then evolves according to the rulexn+1 = f (xn). [More generally one might allow xn+1 = fn (x0, . . . , xn)where fn : Sn+1 → S is a given function for each n.

2. There exists a vector field f on S (where now S = Rd or a manifold)such that x (t) = f (x (t)) . [More generally, we might allow for x (t) =f(t;x|[0,t]

), a functional differential equation.]

Much of our time in this course will be to explore the above two situationswhere some extra randomness is added at each stated of the game. Namely;

1. We may now have that Xn+1 ∈ S is random and evolves according to

Xn+1 = f (Xn, ξn)

where ξn∞n=0 is a sequence of i.i.d. random variables. Alternatively put,we might simply let fn := f (·, ξn) so that fn : S → S is a sequence of i.i.d.random functions from S to S. Then Xn∞n=0 is defined recursively by

Xn+1 = fn (Xn) for n = 0, 1, 2, . . . . (5.1)

This is the typical example of a time-homogeneous Markov chain. We as-sume that X0 ∈ S is given with an initial condition which is either deter-ministic or is independent of the fn∞n=0 .

2. Later in the course we will study the continuous time analogue,

“Xt = ft (Xt) ”

where ftt≥0 are again i.i.d. random vector-fields. The continuous timecase will require substantially more technical care.

5.1 Markov Chain Descriptions

Notation 5.1 Given a random function, f : S → S we may describe its “statis-tics” or distribution in two ways. The first is to assign a matrix to f while thesecond is to assign a weighted graph to f.

1. (Matrix assignment.) The first way is to let

p (x, y) := P (f (x) = y) = P (ω ∈ Ω : fω (x) = y) ∀ x, y ∈ S.

Here we must have that p (x, y) ∈ [0, 1] and∑y∈S p (x, y) = 1 for all x ∈ S.

2. (Jump diagram.) Given p (x, y) as above, let 〈x, y〉 be an edge of a graphover S iff p (x, y) > 0 and weight this edge by p (x, y) .

The function p : S × S → [0, 1] is called the one step transition proba-bility associate to the Markov chain Xn∞n=0 . Notice that;

1.∑y p (x, y) =

∑y∈S P (y = fn (x)) = 1,

2. As we will see in Theorem 5.8 below, the law of Xn∞n=0 is completelydetermined by the one step transition probability p and the initial distri-bution, π (x) := P (X0 = x).

Example 5.2. The transition matrix,

P =

1 2 31/4 1/2 1/41/2 0 1/21/3 1/3 1/3

123

is represented by the jump diagram in Figure 5.1.

Example 5.3. The jump diagram for

P =

1 2 31/4 1/2 1/41/2 0 1/21/3 1/3 1/3

123

is shown in Figure 5.2.

Example 5.4. Suppose that S = 1, 2, 3 , then

P =

1 2 3 0 1 01/2 0 1/21 0 0

123

has the jump graph given by 5.1.

28 5 Markov Chain Basics

1 2

3

1/2

1/3

1/2

1/2

1/3

1/4

1/4

0

1/3

1 2

3

1/2

1/3

1/2

1/2

1/3

1/4

Fig. 5.1. A simple 3 state jump diagram. We typically abbreviate the jump diagramon the left by the one on the right. That is we infer by conservation of probabilitythere has to be probability 1/4 of staying at 1, 1/3 of staying at 3 and 0 probabilityof staying at 2.

114

12

##

212

12oo

3

13

YY

13

EE

Fig. 5.2. In the above diagram there are jumps from 1 to 1 with probability 1/4 andjumps from 3 to 3 with probability 1/3 which are not explicitly shown but must beinferred by conservation of probability.

1

1,,

2

12yy

12

ll

3

1

YY

Fig. 5.3. A simple 3 state jump diagram.

Example 5.5 (Random Walks on Graphs). Let S be a set and G be a graph onS. We then take fn∞n=0 i.i.d. such that

P (fn (x) = y) =

0 if x, y /∈ G1

d(x) if x, y ∈ G

where d (x) := # y ∈ S : x, y ∈ G . We can give a similar definition fordirected graphs, namely

P (fn (x) = y) =

0 if 〈x→ y〉 /∈ G1

d(x) if 〈x→ y〉 ∈ G

where nowd (x) := # y ∈ S : 〈x→ y〉 ∈ G .

A directed graph on S is a subset G ⊂ S2 \∆ where ∆ := (s, s) : S ∈ S . Wesay that G is undirected if (s, t) ∈ G implies (t, s) ∈ G. As we have seen, every

Markov chain is really determined by a weighted random walk on a graph.

Example 5.6. Suppose we flip a fair coin repeatedly and would like to find thefirst time the pattern HHT appears. To do this we will later examine theMarkov chain, Yn = (Xn, Xn+1, Xn+2) where Xn∞n=0 is the sequence of un-biased independent coin flips with values in H,T . The state space for Ynis

S =TTT THT TTH THH HHH HTT HTH HHT

.

The transition matrix for recording three flips in a row of a fair coin is


5.2 Joint Distributions of an MC 29

P =1

2

TTT THT TTH THH HHH HTT HTH HHTTTT 1 0 1 0 0 0 0 0THT 0 0 0 0 0 1 1 0TTH 0 1 0 1 0 0 0 0THH 0 0 0 0 1 0 0 1HHH 0 0 0 0 1 0 0 1HTT 1 0 1 0 0 0 0 0HTH 0 1 0 1 0 0 0 0HHT 0 0 0 0 0 1 1 0

.

Example 5.7 (Ehrenfest Urn Model). Let a beaker filled with a particle fluidmixture be divided into two parts A and B by a semipermeable membrane. LetXn = (# of particles in A) which we assume evolves by choosing a particle atrandom from A ∪ B and then replacing this particle in the opposite bin fromwhich it was found. Modeling Xn as a Markov process we find,

P(Xn+1 = j | Xn = i) =

0 if j /∈ i− 1, i+ 1iN if j = i− 1N−iN if j = i+ 1

=: q (i, j)

As these probabilities do not depend on n, Xn is a time homogeneous Markovchain.

Exercise 5.1. Consider a rat in a maze consisting of 7 rooms which is laid outas in the following figure. 1 2 3

4 5 67

In this figure rooms are connected by either vertical or horizontal adjacentpassages only, so that 1 is connected to 2 and 4 but not to 5 and 7 is onlyconnected to 4. At each time t ∈ N0 the rat moves from her current room toone of the adjacent rooms with equal probability (the rat always changes roomsat each time step). Find the one step 7 × 7 transition matrix, q, with entriesgiven by q (i, j) := P (Xn+1 = j|Xn = i) , where Xn denotes the room the rat isin at time n.

5.2 Joint Distributions of an MC

Theorem 5.8. Suppose that fn∞n=0 are i.i.d. random functions1 from S toS, X0 ∈ S is a random variable independent of the fn∞n=0 , and Xn∞n=1 aregenerated as in Eq. (5.1). Further let

1 Actually, the functions fn do not need to be identically distributed, we reallyneed only require that P (y = fn (x)) is independent of n for all n. The correlationsbetween events like y = fn (x) and y′ = fn (x′) are irrelevant.

p (x, y) := P (y = fn (x)) = P (y = f1 (x))

andπ (x) := P (X0 = x) .

Then

Pπ (X0 = x0, X1 = x1, . . . , Xn = xn) = π (x0) p (x0, x1) p (x1, x2) . . . p (xn−1, xn)(5.2)

for all x0, . . . , xn ∈ S. In particular if G : Sn+1 → R is a function, then

Eπ [G (X0, . . . , Xn)]

=∑

x0,...,xn∈SG (x0, . . . , xn)π (x0) p (x0, x1) p (x1, x2) . . . p (xn−1, xn) . (5.3)

Proof. This is straightforward to verify using

X0 = x0, X1 = x1, . . . , Xn = xn = X0 = x0 ∩ ∩n−1k=0 xk+1 = fk (xk)

and hence

Pπ (X0 = x0, X1 = x1, . . . , Xn = xn) = P (X0 = x0) ·n−1∏k=0

P (xk+1 = fk (xk))

= π (x0) ·n−1∏k=0

p (xk, xk+1) .

Notation 5.9 We denote the probability of this stochastic process, Xn∞n=0

with π (x) := P (X0 = x) by Pπ.

Corollary 5.10. Keeping the same notation as above,

P (Xn+1 = xn+1|X0 = x0, . . . , Xn = xn)

= p (xn, xn+1) = P (Xn+1 = xn+1|Xn = xn) . (5.4)

Proof. We have

P (Xn+1 = xn+1 : X0 = x0, . . . , Xn = xn)

= π (x0) p (x0, x1) p (x1, x2) . . . p (xn−1, xn) p (xn, xn+1)

= P (X0 = x0, . . . , Xn = xn) p (xn, xn+1) (5.5)

from which the left equality in Eq. (5.4) immediately follows. Moreover, sum-ming Eq. (5.5) on x0, . . . , xn−1 shows

P (Xn+1 = xn+1 : Xn = xn) = P (Xn = xn) p (xn, xn+1)

from which the right equality in Eq. (5.4) immediately follows.



Corollary 5.11. For all n ∈ N, Pπ (Xn = y) =∑x∈S π (x) pn (x, y) where

p0 (x, y) = δx (y) , p1 = p, and pnn≥1 is defined inductively by

pn (x, y) :=∑z∈S

p (x, z) pn−1 (z, y) .

Proof. From Eq. (5.3) with G (x0, . . . , xn) = δy (xn) we learn that

Pπ (Xn = y) =∑

x0,...,xn∈Sδy (xn)π (x0) p (x0, x1) p (x1, x2) . . . p (xn−1, xn)

=∑x∈S

π (x) pn (x, y) .

Definition 5.12. We say π : S → [0, 1] is an invariant distribution for theMarkov chain determined by p : S × S → [0, 1] provided

∑y∈S π (y) = 1 and

π (y) =∑x∈S

π (x) p (x, y) for all y ∈ S.

Alternatively put, π, should be a distribution such that Pπ (Xn = x) = π (x) forall n ∈ N0 and x ∈ S. [The invariant distribution is found by solving the matrixequation, πP = π under the restrictions that π (x) ≥ 0 and

∑x∈S π (x) = 1.]

Example 5.13. If

P =

1/4 1/2 1/41/2 0 1/21/3 1/3 1/3

,then Ptr has eigenvectors,

− 23− 1

31

↔ 0,

1

561

↔ 1,

1−21

↔ − 5

12.

The invariant distribution is given by

π =1

1 + 1 + 56

[1 5

6 1]

=[

617

517

617

]=[

0.352 94 0.294 12 0.352 94].

Notice that

P100 =

1/4 1/2 1/41/2 0 1/21/3 1/3 1/3

100

=

0.352 94 0.294 12 0.352 940.352 94 0.294 12 0.352 940.352 94 0.294 12 0.352 94

∼=

πππ

.Exercise 5.2 (2 - step MC). Consider the following simple (i.e. no-brainer)two state “game” consisting of moving between two sites labeled 1 and 2. Ateach site you find a coin with sides labeled 1 and 2. The probability of flipping a2 at site 1 is a ∈ (0, 1) and a 1 at site 2 is b ∈ (0, 1). If you are at site i at time n,then you flip the coin at this site and move or stay at the current site as indicatedby coin toss. We summarize this scheme by the “jump diagram” of Figure 5.4.It is reasonable to suppose that your location, Xn, at time n is modeled by a

11−a22

a++

2b

kk 1−bll

Fig. 5.4. The generic jump diagram for a two state Markov chain.

Markov process with state space, S = 1, 2 . Explain (briefly) why this is atime homogeneous chain and find the one step transition probabilities,

p (i, j) = P (Xn+1 = j|Xn = i) for i, j ∈ S.

Use your result and basic linear (matrix) algebra to compute,limn→∞ P (Xn = 1) . Your answer should be independent of the possiblestarting distributions, ν = (ν1, ν2) for X0 where νi := P (X0 = i) .

Solution to Exercise 5.2. Writing P as a matrix with entry in the ith rowand jth column being p (i, j) , we have

P =

[1− a ab 1− b

].

If P (X0 = i) = νi for i = 1, 2 then

P (Xn = 1) =

2∑k=1

νkPnk,1 = [νPn]1

where we now write ν = (ν1, ν2) as a row vector. A simple computation showsthat


5.3 More Markov Conditioning 31

det(Ptr − λI

)= det (P− λI)

= λ2 + (a+ b− 2)λ+ (1− b− a)

= (λ− 1) (λ− (1− a− b)) .

Note that

P

[11

]=

[11

]since

∑j p (i, j) = 1 – this is a general fact. Thus we always know that λ1 = 1

is an eigenvalue of P. The second eigenvalue is λ2 = 1− a− b. We now find theeigenvectors of Ptr;

Nul(Ptr − λ1I

)= Nul

([−a ba −b

])= R ·

[ba

]while

Nul(Ptr − λI2

)= Nul

([b ba a

])= R ·

[1−1

].

In fact by a direct check we have,[b a]P =

[b a]

and[1 −1

]P = (1− b− a)

[1 −1

].

Thus we may writeν = α (b, a) + β (1,−1)

where1 = ν · (1, 1) = α (b, a) · (1, 1) = α (a+ b) .

Thus β = ν1 − b = − (ν2 − a) , we have

ν =1

a+ b(b, a) + β (1,−1) .

and therefore,

νPn = (b, a) Pn + β (1,−1) Pn =1

a+ b(b, a) + β (1,−1)λn2 .

By our assumptions on and a, b ∈ (0, 1) it follows that |λ2| < 1 and therefore

limn→∞

νPn =1

a+ b(b, a)

and we have shown

limn→∞

P (Xn = 1) =b

a+ band lim

n→∞P (Xn = 2) =

a

a+ b

independent of the starting distribution ν. Also observe that the convergence isexponentially fast. Notice that

π :=1

a+ b(b, a)

is the invariant distribution of this chain.

5.3 More Markov Conditioning

We assume the Xn∞n=0 is a Markov chain with values in S and transitionkernel P and π : S → [0, 1] is a probability on S. As usual we write Pπ for theunique probability satisfying Eq. (5.2) and we will often write p (x, y) for Pxy.

Theorem 5.14 (Markov conditioning). Let X = Xn∞n=0 be a Markovchain with transition kernel p and for each x ∈ S let Yn∞n=0 be another Markovchain with transition kernel p starting at Y0 = x which is independent of X.Then relative to P (·|Xm = x) we have,

X = (X0, X1, . . . )d= (X0, . . . , Xm−1, x, Y1, Y2, . . . ) .

In more detail we are asserting,

E [F (X) |Xm = x] = E [F (X0, . . . , Xm−1, x, Y1, Y2, . . . ) |Xm = x]

for all F (X) = F (X0, X1, . . . ), where F is either bounded or non-negative.

Proof. For this proof let us use the construction of Xn in Theorem 5.8,i.e. Xn+1 = fn (Xn) for n = 0, 1, 2, . . . . On the event, Xm = x ,

(X0, . . . , Xm−1, Xm, Xm+1, . . . ) = (X0, . . . , Xm−1, x,Xm+1, . . . )

= (X0, . . . , Xm−1, x, Y1, Y2, . . . )

where Y0 = x and Y1 = fm (x) , Y2 = fm+1 (Y1) , and in general Yn+1 =fm+n (Yn) . Thus we see that Yn∞n=0 is a copy of the Markov chain startingat Y0 = x which is independent of X0, . . . , Xm−1. Therefore

E [F (X) : Xm = x] = E [F (X0, . . . , Xm−1, x, Y1, Y2, . . . ) : Xm = x]

and so dividing by P (Xm = x) shows,

E [F (X) |Xm = x] = E [F (X0, . . . , Xm−1, x, Y1, Y2, . . . ) |Xm = x]

which is to say,

(X0, X1, . . . )d= (X0, . . . , Xm−1, x, Y1, Y2, . . . ) given Xm = x.

Corollary 5.15. If m ∈ N we have

E [F (X) | (X0, . . . , Xm)] = E(Y )Xm

F (X0, . . . , Xm−1, Y0, Y1, Y2, . . . ) .

Put another way,

E [F (X) | (X0, . . . , Xm)] = h (X0, . . . , Xm)

whereh (x0, . . . , xm) := Exm [F (x0, . . . , xm−1, X0, X1, . . . )] .

Proof. Let (x0, . . . , xm) ∈ Sm+1 such that

q := P ((X0, . . . , Xm) = (x0, . . . , xm)) > 0.

Then letting Y0 = x and Y1 = fm (x) , Y2 = fm+1 (Y1) , and in general Yn+1 =fm+n (Yn) as in the proof above we find,

q·h (x0, . . . , xm)

= q · E [F (X) | (X0, . . . , Xm) = (x0, . . . , xm)]

= E [F (X) : (X0, . . . , Xm) = (x0, . . . , xm)]

= E [F (x0, . . . , xm, Y1, Y2, . . . ) : (X0, . . . , Xm) = (x0, . . . , xm)]

= E [F (x0, . . . , xm, Y1, Y2, . . . )] · P ((X0, . . . , Xm) = (x0, . . . , xm))

= Exm [F (x0, . . . , xm−1, X0, X1, . . . )] · q

from which the result follows.The previous two results have far reaching extension to the case where m is

replaced by a stopping time τ.

Theorem 5.16 (Strong Markov Property). ’Let x ∈ X and Yn∞n=0 be aMarkov chain independent of X such that Y0 = x and P (Yn = y|Yn−1 = x) =p (x, y) . Then given τ <∞, Xτ = x we have

(X0, X1, . . . )d= (X0, X1, . . . , Xτ−1, Xτ , Y1, Y2, . . . ) , (5.6)

i.e. relative to P (·|τ <∞, Xτ = x) . We also have,

Eν [F (X0, X1, . . . ) : τ <∞]

= Eν[EXτ

[F(X0, X1, . . . , Xτ−1, X0, X1, X2, . . .

)]: τ <∞

]= Eν

[EXτ

[F(X0, X1, . . . , Xτ−1, Xτ , X1, X2, . . .

)]: τ <∞

]. (5.7)

In this last formula, Ex denotes the expectation of an independent Markov chainXn

∞n=0

starting at x ∈ S with transition probabilities p (x, y) .

Proof. Using1τ=n = f (X0, X1, . . . , Xn) ,

it follows from Theorem 5.14 that

Eν [F (X0, X1, . . . ) : τ = n,Xn = x]

= Eν [F (X0, X1, . . . , Xn−1, x, Y1, Y2, . . . ) : τ = n,Xn = x] .

Summing this equation on n and using τ = n,Xn = x = τ = n,Xτ = x ,we learn that

Eν [F (X0, X1, . . . ) : τ <∞, Xτ = x]

=

∞∑n=0

Eν [F (X0, X1, . . . ) : τ = n,Xn = x]

=

∞∑n=0

Eν [F (X0, X1, . . . , Xn−1, x, Y1, Y2, . . . ) : τ = n,Xn = x]

=

∞∑n=0

Eν [F (X0, X1, . . . , Xτ−1, Xτ , Y1, Y2, . . . ) : τ = n,Xτ = x]

= Eν [F (X0, X1, . . . , Xτ−1, Xτ , Y1, Y2, . . . ) : τ <∞, Xτ = x] (5.8)

which suffices to prove Eq. (5.6).We may rewrite Eq. (5.8) as,

Eν [F (X0, X1, . . . ) : τ <∞, Xτ = x]

= Eν[Ex[F(X0, X1, . . . , Xτ−1, X0, X1, X2, . . .

)]: τ <∞, Xτ = x

]= Eν

[EXτ

[F(X0, X1, . . . , Xτ−1, X0, X1, X2, . . .

)]: τ <∞, Xτ = x

].

Summing this last equation on x ∈ S then gives Eq. (5.7).

5.4 *Hitting Times Estimates

We assume the Xn∞n=0 is a Markov chain with values in S and transitionkernel P. I will often write p (x, y) for Pxy. We are going to further assume thatB ⊂ S is non-empty proper subset of S and A = S \B.

Definition 5.17 (Hitting times). Given a subset B ⊂ S we let HB be thefirst time Xn hits B, i.e.

HB = min n : Xn ∈ B

with the convention that HB = ∞ if n : Xn ∈ B = ∅. We call HB the firsthitting time of B by X = Xnn .

5.4 *Hitting Times Estimates 33

Observe that

HB = n = X0 /∈ B, . . . ,Xn−1 /∈ B,Xn ∈ B= X0 ∈ A, . . . ,Xn−1 ∈ A,Xn ∈ B

andHB > n = X0 ∈ A, . . . ,Xn−1 ∈ A,Xn ∈ A

so that HB = n and HB > n only depends on (X0, . . . , Xn) . A randomtime, T : Ω → N∪ 0,∞ , with either of these properties is called a stoppingtime.

Lemma 5.18. For any random time T : Ω → N∪0,∞ we have

P (T =∞) = limn→∞

P (T > n) and ET =

∞∑k=0

P (T > k) .

Proof. The first equality is a consequence of the continuity of P and thefact that

T > n ↓ T =∞ .

The second equality is proved as follows;

ET =∑m>0

mP (T = m) =∑

0<k≤m<∞

P (T = m)

=

∞∑k=1

P (T ≥ k) =

∞∑k=0

P (T > k) .

Let us now use Theorem 5.16 to give variants of the proofs of our hittingtime results above. In what follows π will denote a probability on S.

Corollary 5.19. Let B ⊂ S and HB be as above, then for n,m ∈ N we have

Pπ (HB > m+ n) = Eπ [1HB>mPXm [HB > n]] . (5.9)

Proof. Using Theorem 5.16,

Pπ (HB > m+ n) = Eπ[1HB(X)>m+n

]= Eπ

[E(Y )Xm

[1HB(X0,...,Xm−1,Y0,Y1,... )>m+n

]]= Eπ

[E(Y )Xm

[1HB(X)>m · 1HB(Y )>n

]]= Eπ

[1HB(X)>mE(Y )

Xm

[1HB(Y )>n

]]= Eπ [1HB>mPXm [HB > n]] .

Corollary 5.20. Suppose that B ⊂ S is non-empty proper subset of S andA = S \ B. Further suppose there is some α < 1 such that Px (HB =∞) ≤ αfor all x ∈ A, then Pπ (HB =∞) = 0. [In words; if there is a “uniform” chancethat X hits B starting from any site, then X will surely hit B from any pointin A.]

Proof. Since HB = 0 on X0 ∈ B we in fact have Px (HB =∞) ≤ α forall x ∈ S. Letting n→∞ in Eq. (5.9) shows,

Pπ (HB =∞) = Eπ [1HB>mPXm [HB =∞]] ≤ Eπ [1HB>mα] = αPπ (HB > m) .

Now letting m → ∞ in this equation shows Pπ (HB =∞) ≤ αPπ (HB =∞)from which it follows that Pπ (HB =∞) = 0.

Corollary 5.21. Suppose that B ⊂ S is non-empty proper subset of S and A =S\B. Further suppose there is some α < 1 and n <∞ such that Px (HB > n) ≤α for all x ∈ A, then

Eπ (HB) ≤ n

1− α<∞

for all x ∈ A. [In words; if there is a “uniform” chance that X hits B startingfrom any site within a fixed number of steps, then the expected hitting time ofB is finite and bounded independent of the starting distribution.]

Proof. Again using HB = 0 on X0 ∈ B we may conclude thatPx (HB > n) ≤ α for all x ∈ S. Letting m = kn in Eq. (5.9) shows

Pπ (HB > kn+ n) = Eπ [1HB>knPXm [HB > n]] ≤ Eπ [1HB>kn · α] = αPπ (HB > kn) .

Iterating this equation using the fact that Pπ (HB > 0) ≤ 1 showsPπ (HB > kn) ≤ αk for all k ∈ N0. Therefore with the aid of Lemma5.18 and the observation,

P (HB > kn+m) ≤ P (HB > kn) for m = 0, . . . , n− 1,

we find,

ExHB =

∞∑k=0

P (HB > k) ≤∞∑k=0

nP (HB > kn)

≤∞∑k=0

nαk =n

1− α<∞.

Corollary 5.22. If A = S \ B is a finite set and Px (HB =∞) < 1 for allx ∈ A, then EπHB <∞.

Proof. Since

Px (T > m) ↓ Px (T =∞) < 1 for all x ∈ A

we can find Mx <∞ such that Px (T > Mx) < 1. Using the fact that A is a finiteset we let n := maxx∈AMx < ∞ and then take α := maxx∈A Px (T > n) < 1.Corollary 5.21 now applies to complete the proof.

6

First Step Analysis

The next theorem (which is a special case of Theorem 5.14) is the basis ofthe first step analysis developed in this section.

Theorem 6.1 (First step analysis). Let F (X) = F (X0, X1, . . . ) be somefunction of the paths (X0, X1, . . . ) of our Markov chain, then for all x, y ∈ Swith p (x, y) > 0 we have

Ex [F (X0, X1, . . . ) |X1 = y] = Ey [F (x,X0, X1, . . . )] (6.1)

and

Ex [F (X0, X1, . . . )] = Ep(x,·) [F (x,X0, X1, . . . )]

=∑y∈S

p (x, y)Ey [F (x,X0, X1, . . . )] . (6.2)

Proof. Equation (6.1) follows directly from Theorem 5.14,

Ex [F (X0, X1, . . . ) |X1 = y] = Ex [F (X0, X1, . . . ) |X0 = x,X1 = y]

= Ey [F (x,X0, X1, . . . )] .

Equation (6.2) now follows from Eq. (6.1), the law of total expectation, and thefact that Px (X1 = y) = p (x, y) .

Let us now suppose for until further notice that B is a non-empty propersubset of S, A = S \B, and TB = TB (X) is the first hitting time of B by X.

Notation 6.2 Given a transition matrix P = (p (x, y))x,y∈S we let Q=(p (x, y))x,y∈A and R := (p (x, y))x∈A,y∈B so that, schematically,

P =

A B[Q R∗ ∗

]AB.

Remark 6.3. To construct the matrix Q and R from P, let P′ be P with therows corresponding to B omitted. To form Q from P′, remove the columnsof P′ corresponding to B and to form R from P′, remove the columns of P′

corresponding to A.

Example 6.4. If S = 1, 2, 3, 4, 5, 6, 7 , A = 1, 2, 4, 5, 6 , B = 3, 7 , and

P =

1 2 3 4 5 6 7

0 1/2 0 1/2 0 0 01/3 0 1/3 0 1/3 0 00 1/2 0 0 0 1/2 0

1/3 0 0 0 1/3 0 1/30 1/3 0 1/3 0 1/3 00 0 1/2 0 1/2 0 00 0 0 0 0 0 1

1234567

,

then

P′ =

1 2 3 4 5 6 70 1/2 0 1/2 0 0 0

1/3 0 1/3 0 1/3 0 01/3 0 0 0 1/3 0 1/30 1/3 0 1/3 0 1/3 00 0 1/2 0 1/2 0 0

12456

.

Deleting the 3 and 7 columns of P′ gives

Q = PA,A =

1 2 4 5 60 1/2 1/2 0 0

1/3 0 0 1/3 01/3 0 0 1/3 00 1/3 1/3 0 1/30 0 0 1/2 0

12456

and deleting the 1, 2, 4, 5, and 6 columns of P′ gives

R = PA,B =

3 70 0

1/3 00 1/30 0

1/2 0

12456

.

36 6 First Step Analysis

Theorem 6.5 (Hitting distributions). Let h : B → R be a bounded or non-negative function and let u : S → R be defined by

u (x) := Ex [h (XTB ) : TB <∞] for x ∈ A.

Then u = h on B and

u (x) =∑y∈A

p (x, y)u (y) +∑y∈B

p (x, y)h (y) for all x ∈ A. (6.3)

In matrix notation this becomes

u = Qu+ Rh =⇒ u = (I −Q)−1

Rh,

i.e. for all x ∈ A we have

Ex [h (XTB ) : TB <∞] =[(I −Q)

−1Rh]x

=[(I −PA×A)

−1PA×Bh

]x

.

(6.4)As a special case, if y ∈ B and h (s) = δy (s) , then Eq. (6.4) becomes,

Px (XTB = y : TB <∞) =[(I −Q)

−1R]x,y

=[(I −PA×A)

−1PA×Bδy

]x

(6.5)More generally if B0 ⊂ B and h (s) := 1B0

(s) we learn that

Px (XTB ∈ B0 : TB <∞) =[(I −PA×A)

−1PA×B1B0

]x. (6.6)

Proof. To shorten the notation we will use the convention that h (XTB ) = 0if TB =∞ so that we may simply write u (x) := Ex [h (XTB )] . Let

F (X0, X1, . . . ) = h(XTB(X)

)= h

(XTB(X)

)1TB(X)<∞,

then for x ∈ A we have F (x,X0, X1, . . . ) = F (X0, X1, . . . ) . Therefore by thefirst step analysis (Theorem 6.1) we learn

u (x) = Exh(XTB(X)

)= ExF (x,X1, . . . ) =

∑y∈S

p (x, y)EyF (x,X0, X1, . . . )

=∑y∈S

p (x, y)EyF (X0, X1, . . . ) =∑y∈S

p (x, y)Ey[h(XTB(X)

)]=∑y∈A

p (x, y)Ey[h(XTB(X)

)]+∑y∈B

p (x, y)h (y)

=∑y∈A

p (x, y)u (y) +∑y∈B

p (x, y)h (y) .

Theorem 6.6 (Travel averages). Given g : A → [0,∞] , let w (x) :=Ex[∑

n<TBg (Xn)

]. Then w (x) satisfies

w (x) =∑y∈A

p (x, y)w (y) + g (x) for all x ∈ A. (6.7)

In matrix notation this becomes,

w = Qw + g =⇒ w = (I −Q)−1

g

so that

Ex

[ ∑n<TB

g (Xn)

]=[(I −Q)

−1g]x

=[(I −PA×A)

−1g]x

The following two special cases are of most interest;

1. Suppose g (x) = δy (x) for some y ∈ A, then∑n<TB

g (Xn) =∑n<TB

δy (Xn) is the number of visits of the chain to y and

Ex (# visits to y before hitting B)

= Ex

[ ∑n<TB

δy (Xn)

]= (I −Q)

−1x,y . (6.8)

2. Suppose that g (x) = 1, then∑n<TB

g (Xn) = TB and we may conclude that

Ex [TB ] =[(I −Q)

−11]x

(6.9)

where 1 is the column vector consisting of all ones.

Proof. Let F (X0, X1, . . . ) =∑n<TB(X0,X1,... )

g (Xn) be the sum of thevalues of g along the chain before its first exit from A, i.e. entrance into B.With this interpretation in mind, if x ∈ A, it is easy to see that

F (x,X0, X1, . . . ) =

g (x) if X0 ∈ B

g (x) + F (X0, X1, . . . ) if X0 ∈ A= g (x) + 1X0∈A · F (X0, X1, . . . ) .

Therefore by the first step analysis (Theorem 6.1) it follows that

6.1 Finite state space examples 37

w (x) = ExF (X0, X1, . . . ) =∑y∈S

p (x, y)EyF (x,X0, X1, . . . )

=∑y∈S

p (x, y)Ey [g (x) + 1X0∈A · F (X0, X1, . . . )]

= g (x) +∑y∈A

p (x, y)Ey [F (X0, X1, . . . )]

= g (x) +∑y∈A

p (x, y)w (y) .

Remark 6.7. We may combine Theorems 6.1 and 6.6 into one theorem as follows.Suppose that h : S → R is a given function and let

w (x) := Ex

[ ∑n<TB

h (Xn) + h (XTB )

],

then for w (x) = h (x) for x ∈ B and

w (x) =∑y∈A

p (x, y)w (y) + h (x) +∑y∈B

p (x, y)h (y) for x ∈ A.

In matrix format we have

w = PA×Aw + hA + PA×BhB

where hA = h (x)x∈A and hB = h (x)x∈B .

6.1 Finite state space examples

Example 6.8. Consider the Markov chain determined by

P =

1 2 3 40 1/3 1/3 1/3

3/4 1/8 1/8 00 0 1 00 0 0 1

1234

whose hitting diagram is given in Figure 6.1.Notice that 3 and 4 are absorbingstates. Taking A = 1, 2 and B = 3, 4 , we find

1 2

41/8

3/4

3

1/3

1/3

1/3

1/8

Fig. 6.1. For this chain the states 3 and 4 are absorbing.

1 2 3 4

P′ =

[0 1/3 1/3 1/3

3/4 1/8 1/8 0

]12,

1 2

Q =

[0 1/3

3/4 1/8

]12, and

3 4

R =

[1/3 1/31/8 0

]12.

Matrix manipulations now show,

Ei (# visits to j before hitting 3, 4) = (I −Q)−1

=

i\j12

1 2[75

815

65

85

]=

[1.4 0.533331.2 1.6

],

EiT3,4 = (I −Q)−1

[11

]=

i

12

[2915145

]=

[1.933 3

2.8

]and

Pi(XT3,4 = j

)= (I −Q)

−1R =

i\j12

3 4[815

715

35

25

]=

[0.534 0.4670.6 0.4

].

The output of one simulation from www.zweigmedia.com/RealWorld/

markov/markov.html is in Figure 6.2 below.

Remark 6.9. As an aside in the above simulation we really used the matrix,

P =

1 2 3 40 1/3 1/3 1/3

3/4 1/8 1/8 01 0 0 01 0 0 0

1234

for which has invariant distribution,


www.zweigmedia.com/RealWorld/markov/markov.html

www.zweigmedia.com/RealWorld/markov/markov.html

Fig. 6.2. In this run, rather than making sites 3 and 4 absorbing, we have madethem transition back to 1. I claim now to get an approximate value for P1 (Xn hits 3)we should compute: (State 3 Hits)/(State 3 Hits + State 4 Hits). In this example wewill get 171/(171 + 154) = 0.526 15 which is a little lower than the predicted value of0.533 . You can try your own runs of this simulator.

π =1

21 + 8 + 8 + 7

[21 8 8 7

]=[

2144

211

211

744

]=[

0.477 0.182 0.182 0.159].

Notice that2

11/

(2

11+

7

44

)=

8

15!

Lemma 6.10. Suppose that B is a subset of S and x ∈ A := S \B. Then

Px (TB <∞) =[(I −PA×A)

−1PA×B1

]x

where

(I −PA×A)−1

:=

∞∑n=0

PnA×A.

[See the optional Section 6.6 below for more analysis of this type.]

Proof. We work this out by first principles,

Px (TB <∞) =

∞∑n=1

Px (TB = n) =

∞∑n=1

Px (X1 ∈ A, . . . ,Xn−1 ∈ A,Xn ∈ B)

=

∞∑n=1

∑x1,...,xn−1∈A

∑y∈B

p (x, x1) p (x1, x2) . . . p (xn−2, xn−1) p (xn−1, y)

=

[ ∞∑n=1

Pn−1A×APA×B1

]x

=

[ ∞∑n=0

PnA×APA×B1

]x

.

Example 6.11. Let us continue the rat in the maze Exercise 5.1 and now supposethat room 3 contains food while room 7 contains a mouse trap. 1 2 3 (food)

4 5 67 (trap)

.Recall that the transition matrix for this chain with sites 3 and 7 absorbing isgiven by,

P =

1 2 3 4 5 6 7

0 1/2 0 1/2 0 0 01/3 0 1/3 0 1/3 0 00 0 1 0 0 0 0

1/3 0 0 0 1/3 0 1/30 1/3 0 1/3 0 1/3 00 0 1/2 0 1/2 0 00 0 0 0 0 0 1

1234567

,

see Figure 6.3 for the corresponding jump diagram for this chain.We would like to compute the probability that the rat reaches the food before

he is trapped. To answer this question we let A = 1, 2, 4, 5, 6 , B = 3, 7 ,and T := TB be the first hitting time of B. Then deleting the 3 and 7 rows ofP leaves the matrix,

P′ =

1 2 3 4 5 6 70 1/2 0 1/2 0 0 0

1/3 0 1/3 0 1/3 0 01/3 0 0 0 1/3 0 1/30 1/3 0 1/3 0 1/3 00 0 1/2 0 1/2 0 0

12456

.

Deleting the 3 and 7 columns of P′ gives

6.1 Finite state space examples 39

1

1/2

1/2++

2

1/3,,

1/3

1/3

kk3

food

4

1/3

SS

1/3

1/3++

51/3

kk

1/3++

1/3

SS

6

1/2

RR

1/2

kk

7trap

Fig. 6.3. The jump diagram for our proverbial rat in the maze. Here we assume therat is “absorbed” at sites 3 and 7

Q = PA,A =

1 2 4 5 60 1/2 1/2 0 0

1/3 0 0 1/3 01/3 0 0 1/3 00 1/3 1/3 0 1/30 0 0 1/2 0

12456

and deleting the 1, 2, 4, 5, and 6 columns of P′ gives

R = PA,B =

3 70 0

1/3 00 1/30 0

1/2 0

12456

.

Therefore,

I −Q =

1 − 1

2 −12 0 0

− 13 1 0 − 1

3 0− 1

3 0 1 − 13 0

0 − 13 −

13 1 − 1

30 0 0 − 1

2 1

,and using a computer algebra package we find

Ei [# visits to j before hitting 3, 7] = (I −Q)−1

=

1 2 4 5 6 j116

54

54 1 1

356

74

34 1 1

356

34

74 1 1

323 1 1 2 2

313

12

12 1 4

3

i

12456

.

In particular we may conclude,E1TE2TE4TE5TE6T

= (I −Q)−1

1 =

173143143163113

,and

P1 (XT = 3) P1 (XT = 7)P2 (XT = 3) P2 (XT = 3)P4 (XT = 3) P4 (XT = 3)P5 (XT = 3) P5 (XT = 3)P6 (XT = 3) P6 (XT = 7)

= (I −Q)−1

R =

3 7712

512

34

14

512

712

23

13

56

16

12456

.

.

Since the event of hitting 3 before 7 is the same as the event XT = 3 , thedesired hitting probabilities are

P1 (XT = 3)P2 (XT = 3)P4 (XT = 3)P5 (XT = 3)P6 (XT = 3)

=

712345122356

.Example 6.12 (A modified rat maze). Here is the modified maze, 1 2 3(food)

4 56(trap)

.We now let T = T3,6 be the first time to absorption – we assume that 3 and6 made are absorbing states.1 The transition matrix is given by

1 It is not necessary to make states 3 and 6 absorbing. In fact it does matter at allwhat the transition probabilities are for the chain for leaving either of the states3 or 6 since we are going to stop when we hit these states. This is reflected in thefact that the first thing we will do in the first step analysis is to delete rows 3 and6 from P. Making 3 and 6 absorbing simply saves a little ink.


P =

1 2 3 4 5 60 1/2 0 1/2 0 0

1/3 0 1/3 0 1/3 00 0 1 0 0 0

1/3 0 0 0 1/3 1/30 1/2 0 1/2 0 00 0 0 0 0 1

123456

.

The corresponding Q and R matrices in this case are;

Q =

1 2 4 50 1/2 1/2 0

1/3 0 0 1/31/3 0 0 1/30 1/2 1/2 0

1245

, and R =

3 60 0

1/3 00 1/30 0

1245

.

After some matrix manipulation we then learn,

Ei [# visits to j] = (I4 −Q)−1

=

1 2 4 52 3

232 1

1 2 1 11 1 2 11 3

232 2

1245

,

Pi [XT = j] = (I4 −Q)−1

R =

3 612

12

23

13

13

23

12

12

1245

,

Ei [T ] = (I4 −Q)−1

1111

=

6556

1245

.

So for example, P4(XT = 3(food)) = 1/3, E4(Number of visits to 1) = 1,E5(Number of visits to 2) = 3/2 and E1T = E5T = 6 and E2T = E4T = 5.

Example 6.13 (Two heads in a row). On average, how many times do we needto toss a coin to get two consecutive heads? To answer this question, we need toconstruct an appropriate Markov chain. There are different ways to approachthis; probably the simplest is to let Xn denote the number of consecutive headsone has immediately seen immediately after the nth toss; we stop the chainonce we see two consecutive heads. This means Xn is a Markov chain withstate space 0, 1, 2 and transition matrix

P =

0 1 2 12

12 0

12 0 1

20 0 1

012.

We then take B = 2 and we wish to compute

E0TB =[(I −Q)

−11]

0

where

Q =

0 1[12

12

12 0

]01.

The linear algebra gives,

(I −Q)−1

=

0 1[4 22 2

]01

and (I −Q)−1

1 =

[4 22 2

] [11

]=

[64

]01

and hence E0TB = 6.

Remark 6.14. Suppose we change P in Example 6.13 above to

P =

0 1 2 12

12 0

12 0 1

21 0 0

012.

This represents the chain where we restart the game once we have reached twoheads in a row. The invariant distribution for P is

π =1

7

[4 2 1

]=[

47

27

17

].

Note that on average we expect 4 visits to 0 and 2 visits to 2 for every time thegame is played and hence the expected number of times to get two heads in arow is 4 + 2 = 6 in agreement with Example 6.13.

Exercise 6.1 (Invariant distributions and expected return times). Sup-pose that Xn∞n=0 is a Markov chain on a finite state space S determined byone step transition probabilities, p (x, y) = P (Xn+1 = y|Xn = x) . For x ∈ S, letRx := inf n > 0 : Xn = x be the first passage time2 to x. [We will assumehere that EyRx <∞ for all x, y ∈ S.] Use the first step analysis to show,

2 Rx is the first return time to x when the chain starts at x.

6.2 More first step analysis examples 41

ExRy =∑z:z 6=y

p (x, z)EzRy + 1. (6.10)

Now further assume that π : S → [0, 1] is an invariant distributions for p, thatis∑x∈S π (x) = 1 and

∑x∈S π (x) p (x, y) = π (y) for all y ∈ S, i.e. πP = π. By

multiplying Eq. (6.10) by π (x) and summing on x ∈ S, show,

π (y)EyRy = 1 for all y ∈ S =⇒ π (y) =1

EyRy> 0. (6.11)

6.2 More first step analysis examples

Example 6.15. Let TB = inf n ≥ 0 : Xn ∈ B . Here we compute w (x) = ExT 2B

as follows,

w (x) =∑y∈S

p (x, y)Ex[T 2B |X1 = y

]=∑y∈S

p (x, y)Ey[(TB + 1)

2]

=∑y∈S

p (x, y)Ey[T 2B + 2TB + 1

]= 1 +

∑y∈A

p (x, y)Ey[T 2B + 2TB

]= 1 +

∑y∈A

p (x, y)w (y) + 2∑y∈A

p (x, y)EyTB .

Thus in matrix notation with Q = PA×A this may be written as,

E·T 2B = w = (I −Q)

−1[1 + 2Q (I −Q)

−11]

wherein we have used E·TB = (I −Q)−1

1. In other words,

E·T 2B = w = (I −Q)

−1[I + 2Q (I −Q)

−1]

1

= (I −Q)−2

(I + Q) 1.

Example 6.16. Let us now suppose that |z| < 1 and let us set w (x) := Ex[zTB

]and notice that w (y) = 1 for y ∈ B. Then by the first step analysis we have,

w (x) :=∑y∈S

p (x, y)Ex[zTB |X1 = y

]=∑y∈S

p (x, y)Ey[zTB+1

]= z

∑y∈S

p (x, y)Ey[zTB

]= z

∑y∈S

p (x, y)w (y) .

So let w := w|A, we find,

w = z [Qw+p (·, B)] = z [Qw + 1−p (·, A)]

= z [Qw + 1−Q1] .

Thus solving this equation for w gives,

E·[zTB

]= w = z (I − zQ)

−1(I −Q) 1.

Differentiating this equation with respect to z at z = 1 shows

E· [TB ] = 1+d

dz|1 (I − zQ)

−1(I −Q) 1

= 1+ (I −Q)−1

Q (I −Q)−1

(I −Q) 1

=[I+ (I −Q)

−1Q]

1 = (I −Q)−1

1

which reproduces are old formula for E· [TB ] .

Exercise 6.2 (III.4.P11 on p.132). An urn contains two red and two greenballs. The balls are chosen at random, one by one, and removed from the urn.The selection process continues until all of the green balls have been removedfrom the urn. What is the probability that a single red ball is in the urn at thetime that the last green ball is chosen?

Theorem 6.17. Let h : B → [0,∞] and g : A→ [0,∞] be given and for x ∈ S.If we let 3

w (x) := Ex

[h (XTB ) ·

∑n<TB

g (Xn) : TB <∞

]and

gh (x) = g (x)Ex [h (XTB ) : TB <∞] ,

then

w (x) = Ex

[ ∑n<TB

gh (Xn) : TB <∞

]. (6.12)

Remark 6.18. Recall that we can find uh (x) := Ex [h (XTB ) : TB <∞] usingTheorem 6.5 and then we can solve for w (x) using Theorem 6.6 with g replacedby gh (x) = g (x)uh (x) . So in the matrix language we solve for w (x) as follows;

3 Recall from Theorem 6.5 that uh = (I −Q)−1 Rh, i.e. u = h on B and u satisfies

u (x) =∑y∈A

p (x, y)u (y) +∑y∈B

p (x, y)h (y) for all x ∈ A.

uh := (I −Q)−1

Rh,

gh : = g ∗ uh, and

w = (I −Q)−1

gh,

where [a ∗ b]x := ax · bx – the entry by entry product of column vectors.

Proof. First proof. Let H (X) := h (XTB ) 1TB<∞, then using 1n<TB(X) =1X0∈A,...,Xn∈A and

H (X0, . . . , Xn−1, Xn, . . . ) = H (Xn, Xn+1, . . . ) when X0, . . . , Xn ∈ A

along with the Markov property in Theorem 5.16 shows;

w (x) =

∞∑n=0

Ex [H (X) · 1n<TBg (Xn)]

=

∞∑n=0

Ex [H (X) · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex[E(Y )Xn

[H (X0, . . . , Xn−1, Y )] · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex[E(Y )Xn

H (Y ) · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex [uh (Xn) · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex [uh (Xn) · g (Xn) 1n<TB ]

= Ex

[ ∑n<TB

uh (Xn) · g (Xn)

]= Ex

[ ∑n<TB

gh (Xn)

].

Second proof. Let G (X) :=∑n<TB

g (Xn) and observe that

H (x, Y ) =

H (Y ) if x ∈ Ah (x) if x ∈ B and

G (x, Y ) = g (x) +G (Y )

and so by the first step analysis we find,

w (x) = Ex [H (X)G (X)] = Ep(x,·) [H (x, Y )G (x, Y )]

= Ep(x,·) [H (x, Y ) (g (x) +G (Y ))]

= g (x)Ep(x,·) [H (x, Y )] + Ep(x,·) [H (x, Y )G (Y )] .

The first step analysis also shows (see the proof of Theorem 6.5)

uh (x) := Ex [h (XTB ) 1TB<∞] = Ex [H (X)] = Ep(x,·) [H (x, Y )] .

and therefore,

w (x) = g (x)uh (x) + Ep(x,·) [H (x, Y )G (Y )]

Since G (Y ) = 0 if Y0 ∈ B and H (x, Y ) = H (Y ) if Y0 ∈ A we find,

Ep(x,·) [H (x, Y )G (Y )] =∑x∈S

p (x, y)Ey [H (x, Y )G (Y )]

=∑x∈A

p (x, y)Ey [H (x, Y )G (Y )]

=∑x∈A

p (x, y)Ey [H (Y )G (Y )]

=∑x∈A

p (x, y)w (y)

and hence

w (x) = g (x)uh (x) +∑x∈A

p (x, y)w (y) = gh (x) +∑x∈A

p (x, y)w (y) .

But Theorem 6.6 with g replaced by gh then shows w is given by Eq. (6.12).

Example 6.19 (A possible carnival game). Suppose that B is the disjoint unionof L and W and suppose that you win

∑n<TB

g (Xn) if you end in W and winnothing when you end in L. What is the least we can expect to have to pay toplay this game and where in A := S \B should we choose to start the game. Toanswer these questions we should compute our expected winnings (w (x)) foreach starting point x ∈ A;

w (x) = Ex

[1W (XTB )

∑n<TB

g (Xn)

].

Once we find w we should expect to pay at least C := maxx∈A w (x) and weshould start at a location x0 ∈ A where w (x0) = maxx∈A w (x) = C. As anapplication of Theorem 6.17 we know that

w (x) =[(I−Q)

−1gh

]x

where4

4 Intuitively, the effective pay off for a visit to site x is g (x) · Px ( we win) + 0 ·Px (we loose) .

6.3 Random Walk Exercises 43

gh (x) = g (x)Ex [1W (XTB )] = g (x)Px (XTB ∈W ) .

Let us now specialize these results to the chain in Example 6.8 where

P =

1 2 3 40 1/3 1/3 1/3

3/4 1/8 1/8 00 0 1 00 0 0 1

1234

Let us make 4 the winning state and 3 the losing state (i.e. h (3) = 0 andh (4) = 1) and let g = (g (1) , g (2)) be the payoff function. We have alreadyseen that [

uh (1)uh (2)

]=

[P1 (XTB = 4)P2 (XTB = 4)

]=

[71525

]so that g ∗ uh =

[715g125g2

]and therefore

[w (1)w (2)

]= (I −Q)

−1

[715g125g2

]=

[75

815

65

85

] [715g125g2

]=

[4975g1 + 16

75g21425g1 + 16

25g2

].

Let us examine a few different choices for g.

1. When g (1) = 32 and g (2) = 7, we have[w (1)w (2)

]=

[75

815

65

85

] [71532257

]=

[1125

1125

]=

[22.422.4

]and so it does not matter where we start and we are going to have to payat least $22.40 to play.

2. When g (1) = 10 = g (2) , then[w (1)w (2)

]=

[75

815

65

85

] [715102510

]=

[263

12

]=

[8. 666 7

12.0

]and we should enter the game at site 2. We are going to have to pay at least$12 to play.

3. If g (1) = 20 and g (2) = 7,[w (1)w (2)

]=

[75

815

65

85

] [71520257

]=

[3642539225

]=

[14.5615.68

]and again we should enter the game at site 2. We are going to have to payat least $15.68 to play.

6.3 Random Walk Exercises

Exercise 6.3 (Uniqueness of solutions to 2nd order recurrence rela-tions). Let a, b, c be real numbers with a 6= 0 6= c, α, β ∈ Z∪±∞ withα < β, and g : Z∩ (α, β) → R be a given function. Show that there is exactlyone function u : [α, β] ∩ Z→ R with prescribed values on two two consecutivepoints in [α, β] ∩ Z which satisfies the second order recurrence relation:

au (x+ 1) + bu (x) + cu (x− 1) = f (x) for all x ∈ Z∩ (α, β) . (6.13)

are for α < x < β. Show; if u and w both satisfy Eq. (6.13) and u = w on twoconsecutive points in (α, β) ∩ Z, then u (x) = w (x) for all x ∈ [α, β] ∩ Z.

Exercise 6.4 (General homogeneous solutions). Let a, b, c be real num-bers with a 6= 0 6= c, α, β ∈ Z∪±∞ with α < β, and supposeu (x) : x ∈ [α, β] ∩ Z solves the second order homogeneous recurrence relation

au (x+ 1) + bu (x) + cu (x− 1) = 0 for all x ∈ Z∩ (α, β) , (6.14)

i.e. Eq. (6.13) with f (x) ≡ 0. Show:

1. for any λ ∈ C,aλx+1 + bλx + cλx−1 = λx−1p (λ) (6.15)

where p (λ) = aλ2 + bλ + c is the characteristic polynomial associatedto Eq. (6.13).

Let λ± = −b±√b2−4ac

2a be the roots of p (λ) and suppose for the momentthat b2− 4ac 6= 0. From Eq. (6.13) it follows that for any choice of A± ∈ R,the function,

w (x) := A+λx+ +A−λ

x−, (6.16)

solves Eq. (6.13) for all x ∈ Z.2. Show there is a unique choice of constants, A± ∈ R, such that the functionu (x) is given by

u (x) := A+λx+ +A−λ

x− for all α ≤ x ≤ β.

3. Now suppose that b2 = 4ac and λ0 := −b/ (2a) is the double root of p (λ) .Show for any choice of A0 and A1 in R that

w (x) := (A0 +A1x)λx0 (6.17)

solves Eq. (6.13) for all x ∈ Z. Hint: Differentiate Eq. (6.15) with respectto λ and then set λ = λ0.

4. Again show that any function u solving Eq. (6.13) is of the form u (x) =(A0 +A1x)λx0 for α ≤ x ≤ β for some unique choice of constants A0, A1 ∈R.

In the next group of exercises you are going to use first step analysis to showthat a simple unbiased random walk on Z is null recurrent. We let Xn∞n=0 bethe Markov chain with values in Z with transition probabilities given by

P (Xn+1 = x± 1|Xn = x) = 1/2 for all n ∈ N0 and x ∈ Z.

Further let a, b ∈ Z with a < 0 < b and

Ta,b := min n : Xn ∈ a, b and Tb := inf n : Xn = b .

We know by Corollary 5.22 that E0 [Ta,b] < ∞ from which it follows thatP (Ta,b <∞) = 1 for all a < 0 < b. For these reasons we will ignore the eventTa,b =∞ in what follows below.

Exercise 6.5. Let w (x) := Px(XTa,b = b

):= P

(XTa,b = b|X0 = x

).

1. Use first step analysis to show for a < x < b that

w (x) =1

2(w (x+ 1) + w (x− 1)) (6.18)

provided we define w (a) = 0 and w (b) = 1.2. Use the results of Exercises 6.3 and 6.4 to show

Px(XTa,b = b

)= w (x) =

1

b− a(x− a) . (6.19)

3. Let

Tb :=

min n : Xn = b if Xn hits b

∞ otherwise

be the first time Xn hits b. Explain why,XTa,b = b

⊂ Tb <∞ and

use this along with Eq. (6.19) to conclude that Px (Tb <∞) = 1 for allx < b. (By symmetry this result holds true for all x ∈ Z.)

Exercise 6.6. The goal of this exercise is to give a second proof of the fact thatPx (Tb <∞) = 1. Here is the outline:

1. Let w (x) := Px (Tb <∞) . Again use first step analysis to show that w (x)satisfies Eq. (6.18) for all x with w (b) = 1.

2. Use Exercises 6.3 and 6.4 to show that there is a constant, c, such that

w (x) = c · (x− b) + 1 for all x ∈ Z.

3. Explain why c must be zero to again show that Px (Tb <∞) = 1 for allx ∈ Z.

Exercise 6.7. Let T = Ta,b and u (x) := ExT := E [T |X0 = x] .

1. Use first step analysis to show for a < x < b that

u (x) =1

2(u (x+ 1) + u (x− 1)) + 1 (6.20)

with the convention that u (a) = 0 = u (b) .2. Show that

u (x) = A0 +A1x− x2 (6.21)

solves Eq. (6.20) for any choice of constants A0 and A1.3. Choose A0 and A1 so that u (x) satisfies the boundary conditions, u (a) =

0 = u (b) . Use this to conclude that

ExTa,b = −ab+ (b+ a)x− x2 = −a (b− x) + bx− x2. (6.22)

Remark 6.20. Notice that Ta,b ↑ Tb = inf n : Xn = b as a ↓ −∞, and sopassing to the limit as a ↓ −∞ in Eq. (6.22) shows

ExTb =∞ for all x < b.

Combining the last couple of exercises together shows that Xn is “null -recurrent.”

Exercise 6.8. Let T = Tb. The goal of this exercise is to give a second proofof the fact and u (x) := ExT = ∞ for all x 6= b. Here is the outline. Letu (x) := ExT ∈ [0,∞] = [0,∞) ∪ ∞ .

1. Note that u (b) = 0 and, by a first step analysis, that u (x) satisfies Eq.(6.20) for all x 6= b – allowing for the possibility that some of the u (x) maybe infinite.

2. Argue, using Eq. (6.20), that if u (x) < ∞ for some x b then u (y) < ∞ for ally > b.

3. If u (x) < ∞ for all x > b then u (x) must be of the form in Eq. (6.21)for some A0 and A1 in R such that u (b) = 0. However, this would imply,u (x) = ExT → −∞ as x → ∞ which is impossible since ExT ≥ 0 for allx. Thus we must conclude that ExT = u (x) = ∞ for all x > b. (A similarargument works if we assume that u (x) <∞ for all x < b.)

1. The first step analysis shows u (x) satisfies Eq. (6.20) for all x 6= b even ifu (x) =∞ for some x. Note that u (b) = EbTb = 0.

2. If u (x) <∞ for some x < b, then Eq. (6.20) can only hold if u (x+ 1) <∞and u (x− 1) <∞. In particular we may continue this line of reasoning tosee that u (x− 2) <∞, and then u (x− 3) <∞, . . . , i.e. u (y) <∞ for ally < b. A similar argument shows that if u (x) < ∞ for some x > b thenu (y) <∞ for all y > b.

6.3 Random Walk Exercises 45

3. If u (x) < ∞ for all x > b then u (x) must be of the form in Eq. (6.21) forsome A0 and A1 in R such that u (b) = 0. However, this would imply,

ExT = u (x) = A0 +A1x− x2 → −∞ as x→∞

which is impossible since ExT ≥ 0 for all x. Thus we must conclude that infact u (x) =∞ for all x > b. A similar argument shows that u (x) =∞ forall x 1.

Exercise 6.9 (Biased random walks I). Let p ∈ (1/2, 1) and consider thebiased random walk Xnn≥0 on the S = Z where Xn = ξ0 + ξ1 + · · · + ξn,

ξi∞i=1 are i.i.d. with P (ξi = 1) = p ∈ (0, 1) and P (ξi = −1) = q := 1 − p,and ξ0 = x for some x ∈ Z. Let T = T0 be the first hitting time of 0 andu (x) := Px (T <∞) .

a) Use the first step analysis to show

u (x) = pu (x+ 1) + qu (x− 1) for x 6= 0 and u (0) = 1. (6.23)

b) Use Eq. (6.23) along with Exercises 6.3 and 6.4 to show for some a± ∈ Rthat

u (x) = (1− a+) + a+ (q/p)x

for x ≥ 0 and (6.24)

u (x) = (1− a−) + a− (q/p)x

for x ≤ 0. (6.25)

c) By considering the limit as x→ −∞ conclude that a− = 0 and u (x) = 1 forall x < 0, i.e. Px (T0 <∞) = 1 for all x ≤ 0.

Exercise 6.10 (Biased random walks II). The goal of this exercise is toevaluate Px (T0 <∞) for x ≥ 0. To do this let Bn := 0, n and Tn := T0,n.Let h (x) := Px (XTn = 0) where XTn = 0 is the event of hitting 0 before n.

a) Use the first step analysis to show

h (x) = ph (x+ 1) + qh (x− 1) with h (0) = 1 and h (n) = 0.

b) Show the unique solution to this equation is given by

Px (XTn = 0) = h (x) =(q/p)

x − (q/p)n

1− (q/p)n .

c) Argue that

Px (T <∞) = limn→∞

Px (XTn = 0) = (q/p)x< 1 for all x ≥ 0.

The following formula summarizes Exercises 6.9 and 6.10; for 12 0 for x > 0 we already knowthat ExT = ∞ for all x > 0. Nevertheless we will deduce this fact again here.Letting u (x) = ExT it follows by the first step analysis that, for x 6= 0,

u (x) = p [1 + u (x+ 1)] + q [1 + u (x− 1)]

= pu (x+ 1) + qu (x− 1) + 1 (6.27)

with u (0) = 0. Notice u (x) =∞ is a solution to this equation while if u (n) <∞for some n 6= 0 then Eq. (6.27) implies that u (x) < ∞ for all x 6= 0 with thesame sign as n. A particular solution to this equation may be found by tryingu (x) = αx to learn,

αx = pα (x+ 1) + qα (x− 1) + 1 = αx+ α (p− q) + 1

which is valid for all x provided α = (q − p)−1. The general finite solution to

Eq. (6.27) is therefore,

u (x) = (q − p)−1x+ a+ b (q/p)

x. (6.28)

Using the boundary condition, u (0) = 0 allows us to conclude that a + b = 0and therefore,

u (x) = (q − p)−1x+ a [1− (q/p)

x] . (6.29)

Notice that u (x)→ −∞ as x→ +∞ no matter how a is chosen and thereforewe must conclude that the desired solution to Eq. (6.27) is u (x) =∞ for x > 0as we already mentioned. In the next exercise you will compute ExT for x < 0.

Exercise 6.11 (Biased random walks IV). Continue the notation in Ex-ample 6.21. Using the outline below, show

ExT =|x|p− q

for x ≤ 0. (6.30)

In the following outline n is a negative integer, Tn is the first hitting time ofn so that Tn,0 = Tn ∧ T = min T, Tn is the first hitting time of n, 0 . By

Corollary 5.22 we know that u (x) := Ex[Tn,0

]<∞ for all n ≤ x ≤ 0 and by

a first step analysis one sees that u (x) still satisfies Eq. (6.27) for n < x < 0and has boundary conditions u (n) = 0 = u (0) .

a) From Eq. (6.29) we know that, for some a ∈ R,

Ex[Tn,0

]= u (x) = (q − p)−1

x+ a [1− (q/p)x] .

Use u (n) = 0 in order to show

a = an =n

(1− (q/p)n) (p− q)

and therefore,

Ex[Tn,0

]=

1

p− q

[|x|+ n

1− (q/p)x

1− (q/p)n

]for n ≤ x ≤ 0.

b) Argue that ExT = limn→−∞ Ex [Tn ∧ T ] and use this and part a) to proveEq. (6.30).

Remark 6.22 (More on the boundary conditions). If we were to use Theorem 6.6directly to derive Eq. (6.27) in the case that u (x) := Ex

[Tn,0

]< ∞ we for

all 0 ≤ x ≤ n. we would find, for x 6= 0, that

u (x) =∑

y/∈n,0

q (x, y)u (y) + 1

which implies that u (x) satisfies Eq. (6.27) for n < x < 0 provided u (n) andu (0) are taken to be equal to zero. Let us again choose a and b

w (x) := (q − p)−1x+ a+ b (q/p)

x

satisfies w (0) = 0 and w (−1) = u (−1) . Then both w and u satisfy Eq. (6.27)for n < x ≤ 0 and agree at 0 and −1 and therefore are equal for n ≤ x ≤ 0and in particular 0 = u (n) = w (n) . Thus correct boundary conditions on w inorder for w = u are w (0) = w (n) = 0 as we have used above.

6.4 Wald’s Equation and Gambler’s Ruin

Example 6.23. Here are some example of random times which are which are notstopping times. In these examples we will always use the convention that theminimum of the empty set is +∞.

1. The random time, τ = min k : |Xk| ≥ 5 (the first time, k, such that |Xk| ≥5) is a stopping time since

τ = k = |X1| < 5, . . . , |Xk−1| < 5, |Xk| ≥ 5.

2. Let Wk := X1 + · · ·+Xk, then the random time,

τ = mink : Wk ≥ π

is a stopping time since,

τ = k =

Wj = X1 + · · ·+Xj < π for j = 1, 2, . . . , k − 1,

& X1 + · · ·+Xk−1 +Xk ≥ π

.

3. For t ≥ 0, let N(t) = #k : Wk ≤ t. Then

N(t) = k = X1 + · · ·+Xk ≤ t, X1 + · · ·+Xk+1 > t

which shows that N (t) is not a stopping time. On the other hand, since

N(t) + 1 = k = N(t) = k − 1= X1 + · · ·+Xk−1 ≤ t, X1 + · · ·+Xk > t,

we see that N(t) + 1 is a stopping time!4. If τ is a stopping time then so is τ + 1 because,

1τ+1=k = 1τ=k−1 = σk−1 (X0, . . . , Xk−1)

which is also a function of (X0, . . . , Xk) which happens not to depend onXk.

5. On the other hand, if τ is a stopping time it is not necessarily true thatτ − 1 is still a stopping time as seen in item 3. above.

6. One can also see that the last time, k, such that |Xk| ≥ π is typically nota stopping time. (Think about this.)

The following presentation of Wald’s equation is taken from Ross [16, p.59-60].

Theorem 6.24 (Wald’s Equation). Suppose that Xn∞n=1 is a sequence ofi.i.d. random variables, f (x) is a non-negative function of x ∈ R, and τ is astopping time. Then

E

[τ∑n=1

f (Xn)

]= Ef (X1) · Eτ. (6.31)

This identity also holds if f (Xn) are real valued but integrable and τ is a stop-ping time such that Eτ <∞. (See Resnick for more identities along these lines.)

Proof. If f (Xn) ≥ 0 for all n, then the the following computations need nojustification,

6.5 Some more worked examples 47

E

[τ∑n=1

f (Xn)

]= E

[ ∞∑n=1

f (Xn) 1n≤τ

]=

∞∑n=1

E [f (Xn) 1n≤τ ]

=

∞∑n=1

E [f (Xn)un (X1, . . . , Xn−1)]

=

∞∑n=1

E [f (Xn)] · E [un (X1, . . . , Xn−1)]

=

∞∑n=1

E [f (Xn)] · E [1n≤τ ] = Ef (X1)

∞∑n=1

E [1n≤τ ]

= Ef (X1) · E

[ ∞∑n=1

1n≤τ

]= Ef (X1) · Eτ.

If E |f (Xn)| <∞ and Eτ <∞, the above computation with f replaced by|f | shows all sums appearing above are equal E |f (X1)| · Eτ < ∞. Hence wemay remove the absolute values to again arrive at Eq. (6.31).

Example 6.25. Let Xn∞n=1 be i.i.d. such that P (Xn = 0) = P (Xn = 1) = 1/2and let

τ := min n : X1 + · · ·+Xn = 10 .

For example τ is the first time we have flipped 10 heads of a fair coin. By Wald’sequation (valid because Xn ≥ 0 for all n) we find

10 = E

[τ∑n=1

Xn

]= EX1 · Eτ =

1

2Eτ

and therefore Eτ = 20 <∞.

Example 6.26 (Gambler’s ruin). Let Xn∞n=1 be i.i.d. such that P (Xn = −1) =P (Xn = 1) = 1/2 and let

τ := min n : X1 + · · ·+Xn = 1 .

So τ may represent the first time that a gambler is ahead by 1. Notice thatEX1 = 0. If Eτ < ∞, then we would have τ < ∞ a.s. and by Wald’s equationwould give,

1 = E

[τ∑n=1

Xn

]= EX1 · Eτ = 0 · Eτ

which can not hold. Hence it must be that

Eτ = E [first time that a gambler is ahead by 1] =∞.

6.5 Some more worked examples

Example 6.27. Let S = 1, 2 and P =

[0 11 0

]with jump diagram in Figure 6.4.

In this case P2n = I while P2n+1 = P and therefore limn→∞Pn does not exist.

1

1))

21

ii

Fig. 6.4. A non-random chain.

On the other hand it is easy to see that the invariant distribution, π, for P isπ =

[1/2 1/2

]and, moreover,

P + P2 + · · ·+ PN

N→ 1

2

[1 11 1

]=

[ππ

].

Let us compute [E1R1

E2R1

]=

([1 00 1

]−[

0 10 0

])−1 [11

]=

[21

]and [

E1R2

E2R2

]=

([1 00 1

]−[

0 01 0

])−1 [11

]=

[12

]so that indeed, π1 = 1/E1R1 and π2 = 1/E2R2. Of course R1 = 2 (P1 -a.s.)and R2 = 2 (P2 -a.s.) so that it is obvious that E1R1 = E2R2 = 2.

Example 6.28. Again let S = 1, 2 and P =

[10

01

]with jump diagram in

Figure 6.5. In this case the chain is not irreducible and every π = [a b] with

1166 2 1

hh

Fig. 6.5. A simple non-irreducible chain.

a+ b = 1 and a, b ≥ 0 is an invariant distribution.

Example 6.29. Suppose that S = 1, 2, 3 , and


P =

1 2 30 1 0

1/2 0 1/21 0 0

123

has the jump graph given by 6.6. Notice that P211 > 0 and P3

11 > 0 that P is

1

1,,

2

12yy

12

ll

3

1

YY

Fig. 6.6. A simple 3 state jump diagram.

“aperiodic.” We now find the invariant distribution,

Nul (P− I)tr

= Nul

−1 12 1

1 −1 00 1

2 −1

= R

221

.Therefore the invariant distribution is given by

π =1

5

[2 2 1

].

Let us now observe that

P2 =

12 0 1

212

12 0

0 1 0

P3 =

0 1 01/2 0 1/21 0 0

3

=

12

12 0

14

12

14

12 0 1

2

P20 =

4091024

205512

2051024

205512

4091024

2051024

205512

205512

51256

=

0.399 41 0.400 39 0.200 200.400 39 0.399 41 0.200 200.400 39 0.400 39 0.199 22

.Let us also compute E2R3 via,E1R3

E2R3

E3R3

=

1 0 00 1 00 0 1

− 0 1 0

1/2 0 01 0 0

−1 111

=

435

so that1

E3R3=

1

5= π3.

Example 6.30. The transition matrix,

P =

1 2 31/4 1/2 1/41/2 0 1/21/3 1/3 1/3

123

is represented by the jump diagram in Figure 6.7. This chain is aperiodic. We

114

12

##

212

12oo

3

13

YY

13

EE

Fig. 6.7. In the above diagram there are jumps from 1 to 1 with probability 1/4 andjumps from 3 to 3 with probability 1/3 which are not explicitly shown but must beinferred by conservation of probability.

find the invariant distribution as,

Nul (P− I)tr

= Nul

1/4 1/2 1/41/2 0 1/21/3 1/3 1/3

−1 0 0

0 1 00 0 1

tr

= Nul

− 34

12

13

12 −1 1

314

12 −

23

= R

1561

= R

656

π =

1

17

[6 5 6

]=[

0.352 94 0.294 12 0.352 94].

In this case

P10 =

1/4 1/2 1/41/2 0 1/21/3 1/3 1/3

10

=

0.352 98 0.294 04 0.352 980.352 89 0.294 23 0.352 890.352 95 0.294 1 0.352 95

.


6.5 Some more worked examples 49

Let us also computeE1R2

E2R2

E3R2

=

1 0 00 1 00 0 1

−1/4 0 1/4

1/2 0 1/21/3 0 1/3

−1 111

=

115175135

so that

1/E2R2 = 5/17 = π2.

Example 6.31. Consider the following Markov matrix,

P =

1 2 3 4

1/4 1/4 1/4 1/41/4 0 0 3/41/2 1/2 0 00 1/4 3/4 0

1234

with jump diagram in Figure 6.8. Since this matrix is doubly stochastic (i.e.

114

14 //

14

234

14

4

14

EE

34

3

12

LL

12

RR

Fig. 6.8. The jump diagram for Q.

∑4i=1 Pij = 1 for all j as well as

∑4j=1 Pij = 1 for all i), it is easy to check that

π = 14

[1 1 1 1

]. Let us compute E3R3 as follows

E1R3

E2R3

E3R3

E4R3

=

1 0 0 00 1 0 00 0 1 00 0 0 1

−

1/4 1/4 0 1/41/4 0 0 3/41/2 1/2 0 00 1/4 0 0

−1

1111

=

5017521743017

so that E3R3 = 4 = 1/π4 as it should be. Similarly,

E1R2

E2R2

E3R2

E4R2

=

1 0 0 00 1 0 00 0 1 00 0 0 1

−

1/4 0 1/4 1/41/4 0 0 3/41/2 0 0 00 0 3/4 0

−1

1111

=

5417444175017

and again E2R2 = 4 = 1/π2.

Example 6.32. Consider the following example,

P =

1 2 31/2 1/2 00 1/2 1/2

1/2 1/2 0

123

with jump diagram given in Figure 6.9.We have

1 2

3

1/2

1/2

1/21/2

Fig. 6.9. The jump diagram associated to P.



P2 =

1/2 1/2 00 1/2 1/2

1/2 1/2 0

2

=

14

12

14

14

12

14

14

12

14

and

P3 =

1/2 1/2 00 1/2 1/2

1/2 1/2 0

3

=

14

12

14

14

12

14

14

12

14

.To have a picture what is going on here, imaging that π = (π1, π2, π3)

represents the amount of sand at the sites, 1, 2, and 3 respectively. Duringeach time step we move the sand on the sites around according to the followingrule. The sand at site j after one step is

∑i πiPij , namely site i contributes Pij

fraction its sand, πi, to site j. Everyone does this to arrive at a new distribution.Hence π is an invariant distribution if each πi remains unchanged, i.e. π = πP.(Keep in mind the sand is still moving around it is just that the size of the pilesremains unchanged.)

As a specific example, suppose π = (1, 0, 0) so that all of the sand starts at1. After the first step, the pile at 1 is split into two and 1/2 is sent to 2 to getπ1 = (1/2, 1/2, 0) which is the first row of P. At the next step the site 1 keeps1/2 of its sand (= 1/4) and still receives nothing, while site 2 again receivesthe other 1/2 and keeps half of what it had (= 1/4 + 1/4) and site 3 then gets(1/2 · 1/2 = 1/4) so that π2 =

[14

12

14

]which is the first row of P2. It turns

out in this case that this is the invariant distribution. Formally,

[14

12

14

] 1/2 1/2 00 1/2 1/2

1/2 1/2 0

=[

14

12

14

].

In general we expect to reach the invariant distribution only in the limit asn→∞.

Notice that if π is any stationary distribution, then πPn = π for all n andin particular,

π = πP2 =[π1 π2 π3

] 14

12

14

14

12

14

14

12

14

=[

14

12

14

].

Hence[

14

12

14

]is the unique stationary distribution for P in this case.

Example 6.33 (§3.2. p108 Ehrenfest Urn Model). Let a beaker filled with a par-ticle fluid mixture be divided into two parts A and B by a semipermeablemembrane. Let Xn = (# of particles in A) which we assume evolves by choos-ing a particle at random from A ∪ B and then replacing this particle in the

opposite bin from which it was found. Suppose there are N total number ofparticles in the flask, then the transition probabilities are given by,

Pij = P(Xn+1 = j | Xn = i) =

0 if j /∈ i− 1, i+ 1iN if j = i− 1N−iN if j = i+ 1.

For example, if N = 2 we have

(Pij) =

0 1 20 1 0

1/2 0 1/20 1 0

012

and if N = 3, then we have in matrix form,

(Pij) =

0 1 2 30 1 0 0

1/3 0 2/3 00 2/3 0 1/30 0 1 0

0123

.

In the case N = 2, 0 1 01/2 0 1/20 1 0

2

=

12 0 1

20 1 012 0 1

2

0 1 0

1/2 0 1/20 1 0

3

=

0 1 012 0 1

20 1 0

and when N = 3,


6.5 Some more worked examples 510 1 0 0

1/3 0 2/3 00 2/3 0 1/30 0 1 0

2

=

13 0 2

3 00 7

9 0 29

29 0 7

9 00 2

3 0 13

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

3

=

0 7

9 0 29

727 0 20

27 00 20

27 0 727

29 0 7

9 0

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

25

∼=

0.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.0

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

26

∼=

0.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.25

:

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

100

∼=

0.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.25

We also have

(P− I)tr

=

−1 1 0 013 −1 2

3 00 2

3 −1 13

0 0 1 −1

tr

=

−1 1

3 0 01 −1 2

3 00 2

3 −1 10 0 1

3 −1

and

Nul(

(P− I)tr)

=

1331

.Hence if we take, π = 1

8

[1 3 3 1

]then

πP =1

8

[1 3 3 1

] 0 1 0 0

1/3 0 2/3 00 2/3 0 1/30 0 1 0

=1

8

[1 3 3 1

]= π

is the stationary distribution. Notice that

1

2

(P25 + P26

) ∼= 1

2

0.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.0

+1

2

0.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.25

=

0.125 0.375 0.375 0.1250.125 0.375 0.375 0.1250.125 0.375 0.375 0.1250.125 0.375 0.375 0.125

=

ππππ

.Example 6.34. Let us consider the Markov matrix,

P =

1 2 30 1 0

1/2 0 1/21 0 0

123

.

In this case we have

P25 =

0 1 01/2 0 1/21 0 0

25

∼=

0.399 9 0.400 15 0.199 950.400 02 0.399 9 0.200 070.400 15 0.399 9 0.199 95

P26 =

0 1 01/2 0 1/21 0 0

26

∼=

0.400 02 0.399 9 0.200 070.400 02 0.400 02 0.199 950.399 9 0.400 15 0.199 95

P100 =

0 1 01/2 0 1/21 0 0

100

∼=

0.4 0.4 0.20.4 0.4 0.20.4 0.4 0.2

and observe that

[0.4 0.4 0.2

] 0 1 01/2 0 1/21 0 0

=[

0.4 0.4 0.2].

so that π =[

0.4 0.4 0.2]

is a stationary distribution for P.

6.5.1 Life Time Processes

A computer component has life time T, with P (T = k) = ak for k ∈ N. Let Xn

denote the age of the component in service at time n. The set up is then

[0, T1] ∪ (T1, T1 + T2] ∪ (T1 + T2, T1 + T2 + T3] ∪ . . .

so for example if (T1, T2, T3, . . . ) = (1, 3, 4, . . . ) , then


X0 = 0, X1 = 0, X2 = 1, X3 = 2, X4 = 0, X5 = 1, X5 = 2, X5 = 3, X6 = 0, . . . .

The transition probabilities are then

P (Xn+1 = 0|Xn = k) = P (T = k + 1|T > k) =ak+1∑m>k ak

P (Xn+1 = k + 1|Xn = k) = P (T > k + 1|T > k)

=P (T > k + 1)

P (T > k)=

∑m>k+1 ak∑m>k ak

=

∑m>k ak − ak+1∑

m>k ak= 1− ak+1∑

m>k ak.

See Exercise IV.2.E6 of Karlin and Taylor for a concrete example involving achain of this form.

There is another way to look at this same situation, namely let Yn denotethe remaining life of the part in service at time n. So if (T1, T2, T3, . . . ) =(1, 3, 4, . . . ) , then

Y0 = 1, Y1 = 3, Y2 = 2, Y3 = 1, Y4 = 4, Y5 = 3, Y5 = 2, Y5 = 1, . . . .

and the corresponding transition matrix is determined by

P (Yn+1 = k − 1|Yn = k) = 1 if k ≥ 2

whileP (Yn+1 = k|Yn = 1) = P (T = k) .

Example 6.35 (Exercise IV.2.E6 revisited). Let Yn denote the remaining lifeof the part in service at time n. So if (T1, T2, T3, . . . ) = (1, 3, 4, . . . ) , then

Y0 = 1, Y1 = 3, Y2 = 2, Y3 = 1, Y4 = 4, Y5 = 3, Y5 = 2, Y5 = 1, . . . .

Ifk 0 1 2 3 4

P (T = k) 0 0.1 0.2 0.3 0.4,

the transition matrix is now given by

P =

1 2 3 4110

15

310

25

1 0 0 00 1 0 00 0 1 0

1234

whose invariant distribution is given by

π =1

30

[10 9 7 4

]=[

13

310

730

215

].

The failure of a part is indicated by Yn being 1 and so again the failure frequencyis 1

3 of the time as found before. Observe that expected life time of a part is;

E [T ] = 1 · 0.1 + 2 · 0.2 + 3 · 0.3 + 4 · 0.4 = 3.

Thus we see that π1 = 1/ET which is what we should have expected. To go alittle further notice that from the jump diagram in Figure 6.10,

1 2 43a1

a2a3

a4

111

Fig. 6.10. The jump diagram for this “renewal” chain.

one see that

k 1 2 3 4P1 (R1 = k) a1 = 0.1 a2 = 0.2 a3 = 0.3 a4 = 0.4

and therefore,E1R1 = 1a1 + 2a2 + 3a3 + 4a4 = ET

and hence π1 = 1/E1R1 = 1/ET in general for this type of chain.

6.5.2 Sampling Plans

This section summarizes the results of Section IV.2.3 of Karlin and Taylor.There one is considering at production line where each item manufactured hasprobability 0 ≤ p ≤ 1 of being defective. Let i and r be two integers and samplethe output of the line as follows. We begin by sampling every item until we havefound i – in a row which are good. Then we sample each of then next itemswith probability 1

r determined randomly at end of production of each item. (Ifr = 6 we might throw a die each time a product comes off the line and samplethat product when we roll a 6 say.) If we find a bad part we start the process

over again. We now describe this as a Markov chain with states Ekik=0 whereEk denotes that we have seen k – good parts in a row for 0 ≤ k < i and Eiindicates we are in stage II where we are randomly choosing to sample an itemwith probability 1

r . The transition probabilities for this chain are given by

6.6 *Computations avoiding the first step analysis 53

P (Xn+1 = Ek+1|Xn = Ek) = q := 1− p for k = 0, 1, 2 . . . , i− 1,

P (Xn+1 = E0|Xn = Ek) = p if 0 ≤ k ≤ i− 1,

P (Xn+1 = E0|Xn = Ei) =p

rand P (Xn+1 = Ei|Xn = Ei) = 1− p

r,

with all other transitions being zero. The stationary distribution for this chainsatisfies the equations,

πk =

k∑l=0

P (Xn+1 = Ek|Xn = El)πl

so that

π0 =

i−1∑k=0

pπi +p

rπ0,

π1 = qπ0, π2 = qπ1, . . . , πi−1 = qπi−2,

πi = qπi−1 +(

1− p

r

)πi.

These equations may be solved (see Section IV.2.3 of Karlin and Taylor) to findin particular that

πk =p (1− p)k

1 + (r − 1) (1− p)ifor 1 ≤ k < i and

πi =r (1− p)k

1 + (r − 1) (1− p)i.

See Karlin and Taylor for more comments on this solution.

6.5.3 Extra Homework Problems

Exercises 6.12 – 6.15 refer to the following Markov matrix:

P =

1 2 3 4 5 60 1 0 0 0 0

1/2 1/2 0 0 0 00 0 1/2 1/2 0 00 0 1 0 0 00 1/2 0 0 0 1/20 0 0 1/4 3/4 0

123456

(6.32)

We will let Xn∞n=0 denote the Markov chain associated to P.

Exercise 6.12. Make a jump diagram for this matrix and identify the recurrentand transient classes. Also find the invariant destitutions for the chain restrictedto each of the recurrent classes.

Exercise 6.13. Find all of the invariant distributions for P.

Exercise 6.14. Compute the hitting probabilities, h5 = P5 (Xn hits 3, 4)and h6 = P6 (Xn hits 3, 4) .

Exercise 6.15. Find limn→∞ P6 (Xn = j) for j = 1, 2, 3, 4, 5, 6.

6.6 *Computations avoiding the first step analysis

You may safely skip the rest of this chapter!!Theorem 6.36. Let n denote a non-negative integer. If h : B → R is measur-able and either bounded or non-negative, then

Ex [h (Xn) : TB = n] =(Qn−1A Q [1Bh]

)(x)

and

Ex [h (XTB ) : TB <∞] =

( ∞∑n=0

QnAQ [1Bh]

)(x) . (6.33)

If g : A→ R+ is a measurable function, then for all x ∈ A and n ∈ N0,

Ex [g (Xn) 1n<TB ] = (QnAg) (x) .

In particular we have

Ex

[ ∑n<TB

g (Xn)

]=

∞∑n=0

(QnAg) (x) =: u (x) , (6.34)

where by convention,∑n<TB

g (Xn) = 0 when TB = 0.

Proof. Let x ∈ A. In computing each of these quantities we will use;

TB > n = Xi ∈ A for 0 ≤ i ≤ n and

TB = n = Xi ∈ A for 0 ≤ i ≤ n− 1 ∩ Xn ∈ B .

From the second identity above it follows that for

Ex [h (Xn) : TB = n] = Ex[h (Xn) : (X1, . . . , Xn−1) ∈ An−1, Xn ∈ B

]=

∞∑n=1

∫An−1×B

n∏j=1

Q (xj−1, dxj)h (xn)

=(Qn−1A Q [1Bh]

)(x)

and therefore

Ex [h (XTB ) : TB <∞] =

∞∑n=1

Ex [h (Xn) : TB = n]

=

∞∑n=1

Qn−1A Q [1Bh] =

∞∑n=0

QnAQ [1Bh] .

Similarly,

Ex [g (Xn) 1n<TB ] =

∫An

Q (x, dx1)Q (x1, dx2) . . . Q (xn−1, dxn) g (xn)

= (QnAg) (x)

and therefore,

Ex

[ ∞∑n=0

g (Xn) 1n<TB

]=

∞∑n=0

Ex [g (Xn) 1n<TB ]

=

∞∑n=0

(QnAg) (x) .

In practice it is not so easy to sum the series in Eqs. (6.33) and (6.34). Thuswe would like to have another way to compute these quantities. Since

∑∞n=0Q

nA

is a geometric series, we expect that

∞∑n=0

QnA = (I −QA)−1

which is basically correct at least when (I −QA) is invertible. This suggeststhat if u (x) = Ex [h (XTB ) : TB <∞] , then (see Eq. (6.33))

u = QAu+Q [1Bh] on A, (6.35)

and if u (x) = Ex[∑

n<TBg (Xn)

], then (see Eq. (6.34))

u = QAu+ g on A. (6.36)

That these equations are valid was the content of Corollary 6.42 below andTheorem 6.6 above. below which we will prove using the “first step” analysis inthe next theorem. We will give another direct proof in Theorem 6.41 below aswell.

Lemma 6.37. Keeping the notation above we have

ExT =

∞∑n=0

∑y∈A

Qn (x, y) for all x ∈ A, (6.37)

where ExT =∞ is possible.

Proof. By definition of T we have for x ∈ A and n ∈ N0 that,

Px (T > n) = Px (X1, . . . , Xn ∈ A)

=∑

x1,...,xn∈Ap (x, x1) p (x1, x2) . . . p (xn−1, xn)

=∑y∈A

Qn (x, y) . (6.38)

Therefore Eq. (6.37) now follows from Lemma 5.18 and Eq. (6.38).

Proposition 6.38. Let us continue the notation above and let us further as-sume that A is a finite set and

Px (T <∞) = P (Xn ∈ B for some n) > 0 ∀ x ∈ A. (6.39)

Under these assumptions, ExT < ∞ for all x ∈ A and in particularPx (T <∞) = 1 for all x ∈ A. In this case we may may write Eq. (6.37)as

(ExT )x∈A = (I −Q)−1

1 (6.40)

where 1 (x) = 1 for all x ∈ A.

Proof. Since T > n ↓ T =∞ and Px (T =∞) < 1 for all x ∈ A itfollows that there exists an m ∈ N and 0 ≤ α < 1 such that Px (T > m) ≤ αfor all x ∈ A. Since Px (T > m) =

∑y∈AQ

m (x, y) it follows that the row sumsof Qm are all less than α < 1. Further observe that∑

y∈AQ2m (x, y) =

∑y,z∈A

Qm (x, z)Qm (z, y) =∑z∈A

Qm (x, z)∑y∈A

Qm (z, y)

≤∑z∈A

Qm (x, z)α ≤ α2.

6.6 *Computations avoiding the first step analysis 55

Similarly one may show that∑y∈AQ

km (x, y) ≤ αk for all k ∈ N. Therefore

from Eq. (6.38) with m replaced by km, we learn that Px (T > km) ≤ αk forall k ∈ N which then implies that∑

y∈AQn (x, y) = Px (T > n) ≤ αb

nk c for all n ∈ N,

where btc = m ∈ N0 if m ≤ t < m+ 1, i.e. btc is the nearest integer to t whichis smaller than t. Therefore, we have

ExT =

∞∑n=0

∑y∈A

Qn (x, y) ≤∞∑n=0

αbnmc ≤ m ·

∞∑l=0

αl = m1

1− α<∞.

So it only remains to prove Eq. (6.40). From the above computations we seethat

∑∞n=0Q

n is convergent. Moreover,

(I −Q)

∞∑n=0

Qn =

∞∑n=0

Qn −∞∑n=0

Qn+1 = I

and therefore (I −Q) is invertible and∑∞n=0Q

n = (I −Q)−1. Finally,

(I −Q)−1

1 =

∞∑n=0

Qn1 =

∞∑n=0

∑y∈A

Qn (x, y)

x∈A

= (ExT )x∈A

as claimed.

Remark 6.39. Let Xn∞n=0 denote the fair random walk on 0, 1, 2, . . . with0 being an absorbing state. Let T = T0, i.e. B = 0 so that A = N is nowan infinite set. From Remark 6.20, we learn that EiT = ∞ for all i > 0. Thisshows that we can not in general drop the assumption that A (A = 1, 2, . . . is a finite set the statement of Proposition 6.38.

6.6.1 General facts about sub-probability kernels

Definition 6.40. Suppose (A,A) is a measurable space. A sub-probabilitykernel on (A,A) is a function ρ : A ×A → [0, 1] such that ρ (·, C) is A/BR –measurable for all C ∈ A and ρ (x, ·) : A → [0, 1] is a measure for all x ∈ A.

As with probability kernels we will identify ρ with the linear map, ρ : Ab →Ab given by

(ρf) (x) = ρ (x, f) =

∫A

f (y) ρ (x, dy) .

Of course we have in mind that A = SA and ρ = QA. In the following lemmalet ‖g‖∞ := supx∈A |g (x)| for all g ∈ Ab.

Theorem 6.41. Let ρ be a sub-probability kernel on a measurable space (A,A)and define un (x) := (ρn1) (x) for all x ∈ A and n ∈ N0. Then;

1. un is a decreasing sequence so that u := limn→∞ un exists and is in Ab.(When ρ = QA, un (x) = Px (TB > n) ↓ u (x) = P (TB =∞) as n→∞.)

2. The function u satisfies ρu = u.3. If w ∈ Ab and ρw = w then |w| ≤ ‖w‖∞ u. In particular the equation,ρw = w, has a non-zero solution w ∈ Ab iff u 6= 0.

4. If u = 0 and g ∈ Ab, then there is at most one w ∈ Ab such that w = ρw+g.5. Let

U :=

∞∑n=0

un =

∞∑n=0

ρn1 : A→ [0,∞] (6.41)

and suppose that U (x) <∞ for all x ∈ A. Then for each g ∈ Sb,

w =

∞∑n=0

ρng (6.42)

is absolutely convergent,|w| ≤ ‖g‖∞ U, (6.43)

ρ (x, |w|) < ∞ for all x ∈ A, and w solves w = ρw + g. Moreover if v alsosolves v = ρv + g and |v| ≤ CU for some C <∞ then v = w.Observe that when ρ = QA,

U (x) =

∞∑n=0

Px (TB > n) =

∞∑n=0

Ex (1TB>n) = Ex

( ∞∑n=0

1TB>n

)= Ex [TB ] .

6. If g : A→ [0,∞] is any measurable function then

w :=

∞∑n=0

ρng : A→ [0,∞]

is a solution to w = ρw + g. (It may be that w ≡ ∞ though!) Moreover ifv : A → [0,∞] satisfies v = ρv + g then w ≤ v. Thus w is the minimalnon-negative solution to v = ρv + g.

7. If there exists α < 1 such that u ≤ α on A then u = 0. (When ρ = QA, thisstates that Px (TB =∞) ≤ α for all x ∈ A implies Px (TA =∞) = 0 for allx ∈ A.)

8. If there exists an α < 1 and an n ∈ N such that un = ρn1 ≤ α on A, thenthere exists C <∞ such that

uk (x) =(ρk1)

(x) ≤ Cβk for all x ∈ A and k ∈ N0

where β := α1/n < 1. In particular, U ≤ C (1− β)−1

and u = 0 under thisassumption.(When ρ = QA this assertion states; if Px (TB > n) ≤ α for all α ∈ A, then

Px (TB > k) ≤ Cβk and ExTB ≤ C (1− β)−1

for all k ∈ N0.)

Proof. We will prove each item in turn.

1. First observe that u1 (x) = ρ (x,A) ≤ 1 = u0 (x) and therefore,

un+1 = ρn+11 = ρnu1 ≤ ρn1 = un.

We now let u := limn→∞ un so that u : A→ [0, 1] .2. Using DCT we may let n→∞ in the identity, ρun = un+1 in order to showρu = u.

3. If w ∈ Ab with ρw = w, then

|w| = |ρnw| ≤ ρn |w| ≤ ‖w‖∞ ρn1 = ‖w‖∞ · un.

Letting n→∞ shows that |w| ≤ ‖w‖∞ u.4. If wi ∈ Ab solves wi = ρwi + g for i = 1, 2 then w := w2 − w1 satisfiesw = ρw and therefore |w| ≤ Cu = 0.

5. Let U :=∑∞n=0 un =

∑∞n=0 ρ

n1 : A → [0,∞] and suppose U (x) < ∞ forall x ∈ A. Then un (x)→ 0 as n→∞ and so bounded solutions to ρu = uare necessarily zero. Moreover we have, for all k ∈ N0, that

ρkU =

∞∑n=0

ρkun =

∞∑n=0

un+k =

∞∑n=k

un ≤ U. (6.44)

Since the tails of convergent series tend to zero it follows that limk→∞ ρkU =0.Now if g ∈ Sb, we have

∞∑n=0

|ρng| ≤∞∑n=0

ρn |g| ≤∞∑n=0

ρn ‖g‖∞ = ‖g‖∞ · U <∞ (6.45)

and therefore∑∞n=0 ρ

ng is absolutely convergent. Making use of Eqs. (6.44)and (6.45) we see that

∞∑n=1

ρ |ρng| ≤ ‖g‖∞ · ρU ≤ ‖g‖∞ U <∞

and therefore (using DCT),

w =

∞∑n=0

ρng = g +

∞∑n=1

ρng

= g + ρ

∞∑n=1

ρn−1g = g + ρw,

i.e. w solves w = g + ρw.If v : A → R is measurable such that |v| ≤ CU and v = g + ρv, theny := w − v solves y = ρy with |y| ≤ (C + ‖g‖∞)U. It follows that

|y| = |ρny| ≤ (C + ‖g‖∞) ρnU → 0 as n→∞,

i.e. 0 = y = w − v.6. If g ≥ 0 we may always define w by Eq. (6.42) allowing for w (x) = ∞ for

some or even all x ∈ A. As in the proof of the previous item (with DCTbeing replaced by MCT), it follows that w = ρw + g. If v ≥ 0 also solvesv = g + ρv, then

v = g + ρ (g + ρv) = g + ρg + ρ2v

and more generally by induction we have

v =

n∑k=0

ρkg + ρn+1v ≥n∑k=0

ρkg.

Letting n→∞ in this last equation shows that v ≥ w.7. If u ≤ α < 1 on A, then by item 3. with w = u we find that

u ≤ ‖u‖∞ · u ≤ αu

which clearly implies u = 0.8. If un ≤ α < 1, then for any m ∈ N we have,

un+m = ρmun ≤ αρm1 = αum.

Taking m = kn in this inequality shows, u(k+1)n ≤ αukn. Thus a simple

induction argument shows ukn ≤ αk for all k ∈ N0. For general l ∈ N0 wewrite l = kn+ r with 0 ≤ r < n. We then have,

ul = ukn+r ≤ ukn ≤ αk = αl−rn = Cαl/n

where C = α−n−1n .

Corollary 6.42. If h : B → [0,∞] is measurable, then u (x) :=Ex [h (XTB ) : TB <∞] is the unique minimal non-negative solution to Eq.(6.35) while if g : A → [0,∞] is measurable, then u (x) = Ex

[∑n<TB

g (Xn)]

is the unique minimal non-negative solution to Eq. (6.36).

Exercise 6.16. Keeping the notation of Exercise 6.9 and 6.11. Use Corollary6.42 to show again that Px (TB <∞) = (q/p)

xfor all x > 0 and ExT0 =

x/ (q − p) for x < 0. You should do so without making use of the extraneoushitting times, Tn for n 6= 0.

Corollary 6.43. If Px (TB =∞) = 0 for all x ∈ A and h : B → R is a boundedmeasurable function, then u (x) := Ex [h (XTB )] is the unique solution to Eq.(6.35).

Corollary 6.44. Suppose now that A = Bc is a finite subset of S such thatPx (TB =∞) < 1 for all x ∈ A. Then there exists C < ∞ and β ∈ (0, 1) suchthat Px (TB > n) ≤ Cβn and in particular ExTB <∞ for all x ∈ A.

Proof. Let α0 = maxx∈A Px (TB =∞) < 1. We know that

limn→∞

Px (TB > n) = Px (TB =∞) ≤ α0 for all x ∈ A.

Therefore if α ∈ (α0, 1) , using the fact that A is a finite set, there exists ann sufficiently large such that Px (TB > n) ≤ α for all x ∈ A. The result nowfollows from item 8. of Theorem 6.41.

7

Long Run Behavior of Discrete Markov Chains

(This chapter needs more editing and in particular include restatements andproofs of theorems already covered.) In this chapter, Xn will be a Markov chainwith a finite or countable state space, S. To each state i ∈ S, let

Ti := minn ≥ 0 : Xn = i and Ri := minn ≥ 1 : Xn = i (7.1)

be the first hitting and passage (return) time of the chain to site irespectively and

Mi :=

∞∑n=0

1Xn=i =

∞∑n=0

1i (Xn) (7.2)

be number of visits (returns) of Xnn≥0 to site i.

Definition 7.1. A state j is accessible from i (written i→ j) iff Pi(Tj <∞) >0 and i←→ j (i communicates with j) iff i→ j and j → i. Notice that i←→ ifor all i ∈ S and i→ j iff there is a path, i = x0, x1, . . . , xn = j ∈ S such thatp (x0, x1) p (x1, x2) . . . p (xn−1, xn) > 0.

Definition 7.2. For each i ∈ S, let Ci := j ∈ S : i←→ j be the communi-cating class of i. The state space, S, is partitioned into a disjoint union of itscommunicating classes.

Definition 7.3. A communicating class C ⊂ S is closed provided the probabil-ity that Xn leaves C given that it started in C is zero. In other words Pij = 0for all i ∈ C and j /∈ C. (Notice that if C is closed, then Xn restricted to C isa Markov chain.)

Definition 7.4. A state i ∈ S is:

1. transient if Pi(Ri <∞) < 1 (⇐⇒ Pi(Ri =∞) > 0) ,2. recurrent if Pi(Ri <∞) = 1 (⇐⇒ Pi(Ri =∞) = 0) ,

a) positive recurrent if 1/ (EiRi) > 0, i.e. EiRi <∞,b) null recurrent if it is recurrent (Pi(Ri <∞) = 1) and 1/ (EiRi) = 0,

i.e. ERi =∞.

We let St, Sr, Spr, and Snr be the transient, recurrent, positive recurrent,and null recurrent states respectively.

The next two sections give the main results of this chapter along with someillustrative examples. The remaining sections are devoted to some of the moretechnical aspects of the proofs. In particular see Examples 9.2, 9.3, and 9.4 forexamples of transient, null-recurrent, and positively recurrent states.

7.1 The Main Results

See Kallenberg [9, pages 148-158] for more information along the lines of whatis to follow.

Proposition 7.5 (Class properties). The notions of being recurrent, positiverecurrent, null recurrent, or transient are all class properties. Namely if C ⊂ Sis a communicating class then either all i ∈ C are recurrent, positive recurrent,null recurrent, or transient. Hence it makes sense to refer to C as being eitherrecurrent, positive recurrent, null recurrent, or transient.

Proof. See Proposition 7.14 for the assertion that being recurrent or tran-sient is a class property. For the fact that positive and null recurrence is a classproperty, see Proposition 8.5 below.

Lemma 7.6. Let C ⊂ S be a communicating class. Then

C not closed =⇒ C is transient

or equivalently put,

C is recurrent =⇒ C is closed.

Proof. If C is not closed and i ∈ C, there is a j /∈ C such that i → j, i.e.there is a path i = x0, x1, . . . , xn = j with all of the xjnj=0 being distinct suchthat

Pi (X0 = i,X1 = x1, . . . , Xn−1 = xn−1, Xn = xn = j) > 0.

Since j /∈ C we must have j 9 C and therefore on the event,

A := X0 = i,X1 = x1, . . . , Xn−1 = xn−1, Xn = xn = j ,

Xm /∈ C for all m ≥ n and therefore Ri =∞ on the event A which has positiveprobability.

60 7 Long Run Behavior of Discrete Markov Chains

Proposition 7.7. Suppose that C ⊂ S is a finite communicating class andT = inf n ≥ 0 : Xn /∈ C be the first exit time from C. If C is not closed, thennot only is C transient but EiT <∞ for all i ∈ C. We also have the equivalenceof the following statements:

1. C is closed.2. C is positive recurrent.3. C is recurrent.

In particular if # (S) <∞, then the recurrent (= positively recurrent) statesare precisely the union of the closed communication classes and the transientstates are what is left over. [Warning: when # (S) = ∞ or more importantly# (C) =∞, life is not so simple.]

Proof. These results follow fairly easily from Corollary 5.21 and the factthat

T =∑i∈C

Mi.

Remark 7.8. Let Xn∞n=0 denote the fair random walk on 0, 1, 2, . . . with 0being an absorbing state. The communication classes are 0 and 1, 2, . . . with the latter class not being closed and hence transient. If T = T0 is thefirst exit time from 1, 2, . . . , Using Remark 6.20, it follows that EiT = ∞for all i > 0 which shows we can not drop the assumption that # (C) < ∞in the first statement in Proposition 7.7. Similarly, using the fair random walkexample, we see that it is not possible to drop the condition that # (C) < ∞for the equivalence statements as well.

Example 7.9. Let P be the Markov matrix with jump diagram given in Figure7.1. In this case the communication classes are 1, 2 , 3, 4 , 5 . The lattertwo are closed and hence positively recurrent while 1, 2 is transient.

Example 7.10. Let Xn∞n=0 denote the fair random walk on S = Z, then thischain is irreducible. On the other hand if Xn∞n=0 is the fair random walk on0, 1, 2, . . . with 0 being an absorbing state, then the communication classesare 0 (closed) and 1, 2, . . . (not closed).

Warning: if C ⊂ S is closed and # (C) = ∞, C could be recurrent or itcould be transient. Transient in this case means the walk goes off to “infinity.”The following proposition is a consequence of the strong Markov property inTheorem 5.16.

Proposition 7.11. If j ∈ S, k ∈ N, and ν : S → [0, 1] is any probability on S,then

Pν (Mj ≥ k) = Pν (Tj <∞) · Pj (Rj <∞)k−1

. (7.3)

12

3 4

5

1/2

1/31/2

1/2

1/2

1/2

Fig. 7.1. A 5 state Markov chain with 3 communicating classes.

Proof. Intuitively, Mj ≥ k happens iff the chain first visits j with probabil-ity Pν (Tj <∞) and then revisits j again k − 1 times which the probability ofeach revisit being Pj (Rj <∞) . Since Markov chains are forgetful, these prob-abilities are all independent and hence we arrive at Eq. (7.3). See Proposition8.2 below for the formal proof based on the strong Markov property in Theorem5.16.

In more detail, fix j ∈ S and define stopping times τn∞n=1 such that τ1 = Tjand

τn+1 = inf k > τn : Xk = jso that τn is the nth - time the chain Xn∞n=0 has visited site j. In particularit now follows that

P (Mj ≥ n) = P (τn <∞)

and in particular

P (Mj ≥ 1) = P (τ1 <∞) = P (Tj <∞) .

Now suppose that n ≥ 1 is given. Since

τn+1 − τn = Rj (Xτn , Xτn+1, . . . ) on τn <∞ ,

we may apply the strong Markov property to conclude,

P (Mj ≥ n+ 1) = P (τn+1 <∞) = P (τn <∞, τn+1 − τn <∞)

= E[EBτn

[1Rj(Xτn ,Xτn+1,... )<∞

]: τn <∞

]= E

[EXτn

[1Rj(X0,X1,... )<∞

]: τn <∞

]= E

[Ej[1Rj(X0,X1,... )<∞

]: τn <∞

]= Pj (Rj <∞) · P (τn <∞) = Pj (Rj <∞) · P (Mj ≥ n) .

7.1 The Main Results 61

Equation (7.3) now follows by induction on n.

Corollary 7.12. If j ∈ S and ν : S → [0, 1] is any probability on S, then1

Pν (Mj =∞) = Pν (Xn = j i.o.) = Pν (Tj <∞) 1j∈Sr , (7.4)

Pj (Mj =∞) = Pj (Xn = j i.o.) = 1j∈Sr , (7.5)

EνMj =

∞∑n=0

∑i∈S

ν (i) Pij =Pν (Tj <∞)

1− Pj (Rj <∞), (7.6)

and

EiMj =

∞∑n=0

Pij =Pi (Rj <∞)

1− Pj (Rj <∞)(7.7)

where the following conventions are used in interpreting the right hand side ofEqs. (7.6) and (7.7): a/0 :=∞ if a > 0 while 0/0 := 0.

Proof. Since

Mj ≥ k ↓ Mj =∞ = Xn = j i.o. n as k ↑ ∞,

it follows, using Eq. (7.3), that

Pν (Xn = j i.o. n) = limk→∞

Pν(Mj ≥ k) = Pν(Rj <∞) · limk→∞

Pj(Rj <∞)k−1

(7.8)which gives Eq. (7.4). Equation (7.5) follows by taking ν = δj in Eq. (7.4) andrecalling that j ∈ Sr iff Pj (Rj <∞) = 1. Similarly Eq. (7.7) is a special caseof Eq. (7.6) with ν = δi. We now prove Eq. (7.6).

Using the definition of Mj in Eq. (7.2),

EνMj = Eν∞∑n=0

1Xn=j =

∞∑n=0

Eν1Xn=j

=

∞∑n=0

Pν(Xn = j) =

∞∑n=0

∑j∈S

ν (j) Pnjj

which is the first equality in Eq. (7.6). For the second, observe that

∞∑k=1

Pν(Mj ≥ k) =

∞∑k=1

Eν1Mj≥k = Eν∞∑k=1

1k≤Mj = EνMj .

On the other hand using Eq. (7.3) we have

1 See Definition 2.4 to review the meaning of Xn = j i.o. . In this case,Xn = j i.o. = Mj =∞ .

∞∑k=1

Pν(Mj ≥ k) =

∞∑k=1

Pν(Rj <∞)Pj(Rj <∞)k−1 =Pν(Rj <∞)

1− Pj(Rj <∞)

provided a/0 :=∞ if a > 0 while 0/0 := 0.It is worth remarking that if j ∈ St, then Eq. (7.6) asserts that

EνMj = (the expected number of visits to j) <∞

which then implies that Mj is a finite valued random variable almost surely.Hence, for almost all sample paths, Xn can visit j at most a finite number oftimes.

Theorem 7.13 (Recurrent States). Let j ∈ S. Then the following are equiv-alent;

1. j is recurrent, i.e. Pj (Rj <∞) = 1,2. Pj (Xn = j i.o. n) = 1,3. EjMj =

∑∞n=0 Pn

jj =∞.

Proof. The equivalence of the first two items follows directly from Eq. (7.5)and the equivalent of items 1. and 3. follows directly from Eq. (7.7) with i = j.

Proposition 7.14. If i ←→ j, then i is recurrent iff j is recurrent, i.e. theproperty of being recurrent or transient is a class property.

Proof. For any α, β ∈ n ∈ N0 we have,

Pn+α+βii =

∑k,l∈S

PαikP

nklP

βli ≥ Pα

ijPnj jP

βji

and therefore,∞∑n=0

Pn+α+βii ≥ Pα

ijPβji

∞∑n=0

Pnj j .

Since i←→ j we may choose α, β ∈ N so that c := PαijP

βji > 0. Hence we learn

that∞∑n=0

Pnii ≥

∞∑n=0

Pn+α+βii = c ·

∞∑n=0

Pnj j

which shows∑∞n=0 Pn

j j = ∞ =⇒∑∞n=0 Pn

ii = ∞. By interchanging the roles

of i and j we may similarly prove that∑∞n=0 Pn

ii = ∞ =⇒∑∞n=0 Pn

j j = ∞.Thus using item 3. of Theorem 7.13, it follows that i is recurrent iff j is recurrent.

Corollary 7.15. If C ⊂ Sr is a recurrent communication class, then

Pi(Rj <∞) = 1 = Pi (Mj =∞) for all i, j ∈ C (7.9)

and in fact

Pi(∩j∈CXn = j i.o. n) = Pi(∩j∈CMj =∞) = 1 for all i ∈ C. (7.10)

More generally if ν : S → [0, 1] is a probability such that ν (i) = 0 for i /∈ C,then

Pν(∩j∈CXn = j i.o. n) = 1 for all i ∈ C. (7.11)

In words, if we start in C then every state in C is visited an infinite number oftimes. (Notice that Pi (Rj <∞) = Pi(Xnn≥1 hits j).)

Proof. Let i, j ∈ C ⊂ Sr and choose m ∈ N such that Pj (Xm = i) > 0.Then using Pj(Mj =∞) = 1 we learn,

Pj (Xm = i) = Pj (Mj =∞, Xm = i) = Pj (Mj =∞|Xm = i)Pj (Xm = i)

= Pi (Mj =∞)Pj (Xm = i)

from which it follows that

1 = Pi (Mj =∞) ≤ Pi(Rj <∞) =⇒ Eq. (7.9).

Here we use

Mj =∞ =

∞∑k=m

1j (Xk) =∞

along with the Markov property (see Theorem 5.14) to assert that

Pj (Mj =∞|Xm = i) = Pj

( ∞∑k=m

1j (Xk) =∞|Xm = i

)

= Pi

( ∞∑k=0

1j (Xk) =∞

)= Pi (Mj =∞) .

Equation (7.10) is a consequence of Eq. (7.9) and the fact that the countableintersection of probability one events is again a probability 1 event. Equation(7.11) follows by multiplying Eq. (7.10) by ν (i) and then summing on i ∈ C.

Theorem 7.16 (Transient States). Let j ∈ S. Then the following are equiv-alent;

1. j is transient, i.e. Pj (Rj <∞) < 1,

2. Pj (Xn = j i.o. n) = 0, and3. EjMj =

∑∞n=0 Pn

j,j <∞.

Moreover, if i ∈ S and j ∈ St, then

∞∑n=0

Pnij = EiMj <∞ (7.12)

=⇒

Pi (Xn = j i.o. n) = 0 andlimn→∞ Pi (Xn = j) = limn→∞Pn

ij = 0.(7.13)

and more generally if ν : S → [0, 1] is any probability, then

∞∑n=0

Pν (Xn = j) = EνMj <∞ (7.14)

=⇒

Pν (Xn = j i.o. n) = 0 andlimn→∞ Pν (Xn = j) = limn→∞ [νPn]j = 0.

(7.15)

Proof. The equivalence of the first two items follows directly from Eq.(7.5) and the equivalent of items 1. and 3. follows directly from Eq. (7.7) withi = j. The fact that EiMj < ∞ and EνMj < ∞ in Eqs. (7.12) and (7.14) forall j ∈ St are consequences of Eqs. (7.7) and (7.6) respectively. The remainingimplications in Eqs. (7.13) and (the more general) Eq. (7.15) are consequences ofthe facts; 1) the nth – term in a convergent series tends to zero as n→∞, 2) theXn = j i.o. n = Mj =∞ , and 3) EνMν <∞ implies Pν (Mj =∞) <∞.

Corollary 7.17. 1) If the state space, S, is a finite set, then Sr 6= ∅. 2) Anyfinite and closed communicating class C ⊂ S is a recurrent.

Proof. First suppose that # (S) < ∞ and for the sake of contradic-tion, suppose Sr = ∅ or equivalently that S = St. Then by Theorem 7.16,limn→∞Pn

ij = 0 for all i, j ∈ S. On the other hand,∑j∈S Pn

ij = 1 so that

1 = limn→∞

∑j∈S

Pnij =

∑j∈S

limn→∞

Pnij =

∑j∈S

0 = 0,

which is a contradiction. (Notice that if S were infinite, we could not (in general)interchange the limit and the above sum without some extra conditions.)

To prove the first statement, restrict Xn to C to get a Markov chain on afinite state space C. By what we have just proved, there is a recurrent statei ∈ C. Since recurrence is a class property, it follows that all states in C arerecurrent.

7.1 The Main Results 63

Definition 7.18. A function, π : S → [0, 1] is a sub-probability if∑j∈S π (j) ≤ 1. We call

∑j∈S π (j) the mass of π. So a probability is a sub-

probability with mass one.

Definition 7.19. We say a sub-probability, π : S → [0, 1] , is invariant ifπP = π, i.e. ∑

i∈Sπ (i) Pij = π (j) for all j ∈ S. (7.16)

An invariant probability, π : S → [0, 1] , is called an invariant distribution.

Example 7.20. If # (S) < ∞ and p : S × S → [0, 1] is a Markov transitionmatrix with column sums adding up to 1 then π (i) := 1

#(S) is an invariant

distribution for p. In particular, if p is a symmetric Markov transition matrix(p (i, j) = p (j, i) for all i, j ∈ S), then the uniform distribution π is an invariantdistribution for p.

Example 7.21. If S is a finite set of nodes and G be an undirected graph on S,i.e. G is a subset of S × S such that

1. (x, x) /∈ G for all x ∈ S,2. if (x, y) ∈ G, then (y, x) ∈ G [the graph is undirected], and3. for all x ∈ G, the set Sx := y ∈ S : (x, y) ∈ G is not empty. [We are not

allowing for any isolated nodes in our graph.]

Letν (x) := # (Sx) =

∑y∈S

1(x,y)∈G

be the valence of G at x. The random walk on this graph is then theMarkov chain on S with Markov transition matrix,

p (x, y) :=1

ν (x)1Sx (y) =

1

ν (x)1(x,y)∈G.

Notice that ∑x∈S

ν (x) p (x, y) =∑x∈S

1(x,y)∈G =∑x∈S

1(y,x)∈G = ν (y) .

Thus if we let Z :=∑x∈S ν (x) and π (x) := ν (x) /Z, we will have π is an

invariant distribution for p.

Theorem 7.22. Suppose that P = (pij) is an irreducible Markov kernel and

πj :=1

EjRjfor all j ∈ S. (7.17)

Then:

1. For all i, j ∈ S, we have

limN→∞

1

N

N∑n=0

1Xn=j = πj Pi − a.s. (7.18)

and

limN→∞

1

N

N∑n=1

Pi (Xn = j) = limN→∞

1

N

N∑n=0

Pnij = πj . (7.19)

2. If µ : S → [0, 1] is an invariant sub-probability, then either µ (i) > 0 for alli or µ (i) = 0 for all i.

3. P has at most one invariant distribution.4. P has a (necessarily unique) invariant distribution, µ : S → [0, 1] , iff P is

positive recurrent in which case µ (i) = π (i) = 1EiRi > 0 for all i ∈ S.

(These results may of course be applied to the restriction of a general non-irreducible Markov chain to any one of its communication classes.)

Proof. These results are the contents of Theorem 8.4 and Propositions 8.5and 8.6 below.

Using this result we can give another proof of Proposition 7.7.

Corollary 7.23. If C is a closed finite communicating class then C is positiverecurrent. (Recall that we already know that C is recurrent by Corollary 7.17.)

Proof. For i, j ∈ C, let

πj := limN→∞

1

N

N∑n=1

Pi (Xn = j) =1

EjRj

as in Theorem 7.25. Since C is closed,∑j∈C

Pi (Xn = j) = 1

and therefore,

∑j∈C

πj = limN→∞

1

N

∑j∈C

N∑n=1

Pi (Xn = j) = limN→∞

1

N

N∑n=1

∑j∈C

Pi (Xn = j) = 1.

Therefore πj > 0 for some j ∈ C and hence all j ∈ C by Theorem 7.22 with Sreplaced by C. Hence we have EjRj <∞, i.e. every j ∈ C is a positive recurrentstate.

Remark 7.24 (The MMC Seed). Suppose that S is a finite set, P = (pij) is anirreducible Markov kernel, ν is a probability on S, and f : S → C is a function.Then

Eπf :=∑i∈S

π (i) f (i) = limN→∞

1

N

N∑n=1

f (Xn) Pν – a.s.

where π is the invariant distribution for P. Indeed,

limN→∞

1

N

N∑n=1

f (Xn) = limN→∞

1

N

N∑n=1

∑i∈S

f (i) 1Xn=i

=∑i∈S

f (i) limN→∞

1

N

N∑n=1

1Xn=i

=∑i∈S

f (i)π (i) Pν – a.s.

Theorem 7.25 (General Convergence Theorem). Let ν : S → [0, 1] beany probability, j ∈ S, C be the communicating class2 containing j,

Xn hits C := Xn ∈ C for some n ,

and

πj := πj (ν) =Pν (Xn hits C)

EjRj, (7.20)

where 1/∞ := 0. Then:

1. Pν – a.s.,

limN→∞

1

N

N∑n=1

1Xn=j =1

EjRj· 1Xn hits C, (7.21)

2.

limN→∞

1

N

N∑n=1

∑i∈S

ν (i)Pnij = limN→∞

1

N

N∑n=1

Pν (Xn = j) = πj , (7.22)

3. π is an invariant sub-probability for P, and4. the mass of π (S) :=

∑j∈S πj is

π (S) =∑

C: pos. recurrent

Pν (Xn hits C) ≤ 1. (7.23)

2 No assumptions on the type of this class are needed here.

Proof. If j ∈ S is a transient site, then according to Eq. (7.15),

Pν (Mj <∞) = 1 and therefore limN→∞1N

∑Nn=1 1Xn=j = 0 which agrees with

Eq. (7.21) for j ∈ St.So now suppose that j ∈ Sr. Let C be the communication class containing

j andT = TC := inf n ≥ 0 : Xn ∈ C

be the first time when Xn enters C. It is clear that Rj <∞ ⊂ T <∞ .On the other hand, for any k ∈ C, it follows by the strong Markov propertyof Theorem 5.16 and Corollary 7.15 that, conditioned on T <∞, XT = k ,Xn hits j i.o. and hence P (Rj <∞|T <∞, XT = k) = 1. Equivalently put,

P (Rj <∞, T <∞, XT = k) = P (T <∞, XT = k) for all k ∈ C.

Summing this last equation on k ∈ C then shows

P (Rj <∞) = P (Rj <∞, T <∞) = P (T <∞)

and therefore Rj <∞ = T <∞ modulo an event with Pν – probabilityzero.

Another application of the strong Markov property of Theorem 5.16, observ-ing that XRj = j on Rj <∞ , allows us to conclude that the Pν (·|Rj <∞) =Pν (·|T <∞) – law of

(XRj , XRj+1, XRj+2, . . .

)is the same as the Pj – law of

(X0, X1, X2, . . . ) . Therefore, we may apply Theorem 7.22 to conclude that

limN→∞

1

N

N∑n=1

1Xn=j = limN→∞

1

N

N∑n=1

1XRj+n=j =1

EjRjPν (·|Rj <∞) – a.s.

On the other hand, on the event Rj =∞ we have limN→∞1N

∑Nn=1 1Xn=j =

0. Thus we have shown Pν – a.s. that

limN→∞

1

N

N∑n=1

1Xn=j =1

EjRj1Rj<∞ =

1

EjRj1T<∞ =

1

EjRj1Xn hits C

which is Eq. (7.21). Taking expectations of this equation, using the dominatedconvergence theorem, gives Eq. (7.22).

Since EiRi =∞ unless i is a positive recurrent site, it follows that∑i∈S

πiPij =∑i∈Spr

πiPij =∑

C: pos-rec.

Pν (Xn hits C)∑i∈C

1

EiRiPij . (7.24)

As each positive recurrent class, C, is closed; if i ∈ C and j /∈ C, thenPij = 0. Therefore

∑i∈C

1EiRiPij is zero unless j ∈ C. So if j /∈ Spr we have∑

i∈S πiPij = 0 = πj and if j ∈ Spr, then by Theorem 7.22,

7.2 Aperiodic chains 65∑i∈C

1

EiRiPij = 1j∈C ·

1

EjRj.

Using this result in Eq. (7.24) shows that∑i∈S

πiPij =∑

C: pos-rec.

Pν (Xn hits C) 1j∈C ·1

EjRj= πj

so that π is an invariant distribution. Similarly, using Theorem 7.22 again,

∑i∈S

πi =∑

C: pos-rec.

Pν (Xn hits C) ·

(∑i∈C

1

EiRi

)=

∑C: pos-rec.

Pν (Xn hits C) .

The point is that for any communicating class C of S we have,∑i∈C

1

EiRi=

1 if C is positive recurrent0 otherwise.

7.2 Aperiodic chains

Definition 7.26. A state i ∈ S is aperiodic if Pnii > 0 for all n sufficiently

large.

Lemma 7.27. If i ∈ S is aperiodic and j ←→ i, then j is aperiodic. So beingaperiodic is a class property.

Proof. We have

Pn+m+kjj =

∑w,z∈S

Pnj,wPm

w,zPkz,j ≥ Pn

j,iPmi,iP

ki,j .

Since j ←→ i, there exists n, k ∈ N such that Pnj,i > 0 and Pk

i,j > 0. Since

Pmi,i > 0 for all large m, it follows that Pn+m+k

jj > 0 for all large m andtherefore, j is aperiodic as well.

Lemma 7.28. A state i ∈ S is aperiodic iff 1 is the greatest common divisorof the set,

n ∈ N : Pi (Xn = i) = Pnii > 0 .

Proof. Use the number theory Lemma 7.39 below.

Theorem 7.29. If P is an irreducible, aperiodic, and recurrent Markov chain,then

limn→∞

Pnij = πj =

1

EjRj. (7.25)

More generally, if C is a communication class which is assumed to be aperiodicif it is recurrent, then

limn→∞

Pν (Xn = j) := limn→∞

∑i∈S

ν (i) Pnij = Pν (TC <∞) · 1

EjRjfor all j ∈ C.

(7.26)

Proof. The proof uses the important idea of a coupling argument (we follow[13, Theorem 1.8.3] or Kallenberg [9, Chapter 8]). Here is the idea. Let Xn andYn be two independent Markov chains having P as their transition matrix.Then the chain (Xn, Yn) ∈ S×S is again a Markov chain with transition matrixP ⊗ P. The aperiodicity and irreducibility assumption guarantees that P ⊗ Pis still irreducible and aperiodic (but the aperiodicity is not needed). Now wehave a couple of choices how to proceed. So we let T be the hitting time of thechain (Xn, Yn) of a fixed point (x, x) ∈ ∆ ⊂ S × S or we can let T be the firsthitting time of the diagonal itself. Either way these are finite. We then let

Yn =

Yn if n ≤ TXn if n > T.

By the strong Markov property Y and Y have the same distribution. Thus iff : S → R is a bounded function and µ, ν are two initial distributions on S wewill have,

Eµf (Xn)− Eνf (Xn) = Eµf (Xn)− Eνf (Yn) = Eµ⊗ν[f (Xn)− f

(Yn

)]= Eµ⊗ν

[f (Xn)− f

(Yn

): n < T

]from which it follows that

|Eµf (Xn)− Eνf (Xn)| ≤ 2 ‖f‖∞ P (T > n)→ 0 as n→∞. (7.27)

This inequality shows that the initial distribution plays no role in the limitingdistribution of the Xn∞n=0 .

Now assuming that an invariant distribution π exists3 for P (as we knowthis means positively recurrent but we do not need this fact), then taking µ = πand using Eπf (Xn) = Eπf (X0) = π (f), we find,

3 We still have to handle the null-recurrent case which is going to be omitted inthese notes but see Kallenberg [9, Chapter 8, Theorem 8.18, p. 152.]. The key newingredient not explained here may be found in [9, Lemma 8.21].


|Eνf (Xn)− π (f)| ≤ 2 ‖f‖∞ P (T > n)→ 0 as n→∞.

which shows LawPν (Xn) =⇒ π as n→∞, and this gives Eq. (7.26).In Nate Eldredge’s notes, the inequality in Eq. (7.27) is derived a bit dif-

ferently. He take T to be the hitting time of (x, x) ∈ ∆ and then notes by thestrong Markov property that

Eµf (Xn) = Eµ⊗νf (Xn) =

n∑k=0

Eµ⊗ν [f (Xn) |T = k]P (T = k) + Eµ⊗ν [f (Xn) : T > n]

=

n∑k=0

Ex [f (Xn−k)]P (T = k) + Eµ⊗ν [f (Xn) : T > n] .

As similar calculation shows

Eνf (Xn) = Eµ⊗νf (Yn) =

n∑k=0

Ex [f (Yn−k)]P (T = k) + Eµ⊗ν [f (Yn) : T > n]

=

n∑k=0

Ex [f (Xn−k)]P (T = k) + Eµ⊗ν [f (Yn) : T > n] .

Then subtracting these two equations shows,

Eµf (Xn)− Eνf (Xn) = Eµ⊗ν [f (Xn) : T > n]− Eµ⊗ν [f (Yn) : T > n]

= Eµ⊗ν [f (Xn)− f (Yn) : T > n]

= Eµ⊗ν[f (Xn)− f

(Yn

): T > n

]and so again,

|Eµf (Xn)− Eνf (Xn)| ≤ 2 ‖f‖∞ P (T > n) .

7.2.1 More finite state space examples

Example 7.30 (Analyzing a non-irreducible Markov chain). In this example weare going to analyze the limiting behavior of the non-irreducible Markov chaindetermined by the Markov matrix,

P =

1 2 3 4 50 1/2 0 0 1/2

1/2 0 0 1/2 00 0 1/2 1/2 00 0 1/3 2/3 00 0 0 0 1

12345

.

Here are the steps to follow.

12

3 4

5

1/2

1/31/2

1/2

1/2

1/2

Fig. 7.2. The jump diagram for P.

1. Find the jump diagram for P. In our case it is given in Figure 7.2.2. Identify the communication classes. In our example they are 1, 2 ,5 , and 3, 4 . The first is not closed and hence transient while the secondtwo are closed and finite sets and hence recurrent.

3. Find the invariant distributions for the recurrent classes. For 5it is simply π′5 = [1] and for 3, 4 we must find the invariant distributionfor the 2× 2 Markov matrix,

Q =

3 4[1/2 1/21/3 2/3

]34.

We do this in the usual way, namely

Nul(I −Qtr

)= Nul

([1 00 1

]−[

12

13

12

23

])= R

[23

]so that π′3,4 = 1

5

[2 3].

4. We can turn π′3,4 and π′5 into invariant distributions for P by paddingthe row vectors with zeros to get

π3,4 =[

0 0 2/5 3/5 0]

π5 =[

0 0 0 0 1].

The general invariant distribution may then be written as;

π = απ5 + βπ3,4 with α, β ≥ 0 and α+ β = 1.


7.3 Periodic Chain Considerations 67

5. We can now work out the limn→∞Pn. If we start at site i we are consideringthe ith – row of limn→∞Pn. If we start in the recurrent class 3, 4 we willsimply get π3,4 for these rows and we start in the recurrent class 5 wewill get π5. However if start in the non-closed transient class, 1, 2 wehave

first row of limn→∞

Pn = P1 (Xn hits 5)π5 + P1 (Xn hits 3, 4)π3,4(7.28)

and

second row of limn→∞

Pn = P2 (Xn hits 5)π5 + P2 (Xn hits 3, 4)π3,4.(7.29)

6. Compute the required hitting probabilities. Let us now compute therequired hitting probabilities by taking B = 3, 4, 5 and A = 1, 2 . Thenwe have,

Pi (XTB = j)i∈A,j∈B = (I −PA×A)−1

PA×B

=

([1 00 1

]−[

0 1/21/2 0

])−1 [0 0 1/20 1/2 0

]=

[43

23

23

43

] [0 0 1/20 1/2 0

]=

[0 1

323

0 23

13

].

From this we learn

P1 (Xn hits 5) =2

3, P2 (Xn hits 5) =

1

3,

P1 (Xn hits 3, 4) =1

3and P2 (Xn hits 3, 4) =

2

3.

7. Using these results in Eqs. (7.28) and (7.29) shows,

first row of limn→∞

Pn =2

3π5 +

1

3π3,4

=[

0 0 215

15 2/3

]=[

0.0 0.0 0.133 33 0.2 0.666 67]

and

second row of limn→∞

Pn =1

3π5 +

2

3π3,4

=1

3

[0 0 0 0 1

]+

2

3

[0 0 2/5 3/5 0

]=[

0 0 415

25

13

]=[

0.0 0.0 0.266 67 0.4 0.333 33].

These answers already compare well with

P10 =

9.7656× 10−4 0.0 0.132 76 0.200 24 0.666 02

0.0 9.7656× 10−4 0.266 26 0.399 76 0.333 010.0 0.0 0.4 0.600 00 0.00.0 0.0 0.400 00 0.6 0.00.0 0.0 0.0 0.0 1.0

.

7.3 Periodic Chain Considerations

Definition 7.31. For each i ∈ S, let d (i) be the greatest common divisor ofn ≥ 1 : Pn

ii > 0 with the convention that d (i) = 0 if Pnii = 0 for all n ≥ 1.

We refer to d (i) as the period of i. We say a site i is aperiodic if d (i) = 1.

Example 7.32. Each site of the fair random walk on S = Z has period 2. Whilefor the fair random walk on 0, 1, 2, . . . with 0 being an absorbing state, eachi ≥ 1 has period 2 while 0 has period 1, i.e. 0 is aperiodic.

Theorem 7.33. The period function is constant on each communication classof a Markov chain.

Proof. Let x, y ∈ C and a = d (x) and b = d (y) . Now suppose that Pmxy > 0

and Pnyx > 0, then Pm+n

x,x ≥ PmxyP

nyx > 0 and so a| (m+ n) . Further suppose

that Ply,y > 0 for come l ∈ N, then

Pm+n+lx,x ≥ Pm

xyPly,yP

nyx > 0

and therefore a| (m+ n+ l) which coupled with a| (m+ n) implies a|l. We maytherefore conclude that a ≤ b (in fact a|b) as b = gcd

(l ∈ N : Pl

j,j > 0).

Similarly we show that b ≤ a and therefore b = a.

Lemma 7.34. If d (i) is the period of site i, then

1. if m ∈ N and Pmi,i > 0 then d (i) divides m and

2. Pnd(i)i,i > 0 for all n ∈ N sufficiently large.

3. If i is aperiodic iff Pni,i > 0 for all n ∈ N sufficiently large.

In summary, Ai :=m ∈ N : Pm

i,i > 0⊂ d (i)N and d (i)n ∈ Ai for all

n ∈ N sufficiently large.

Proof. Choose n1, . . . , nk ∈ n ≥ 1 : Pnii > 0 such that d (i) =

gcd (n1, . . . , nk) . For part 1. we also know that d (i) = gcd (n1, . . . , nk,m) andtherefore d (i) divides m. For part 2., if mi ∈ N we have,


68 7 Long Run Behavior of Discrete Markov Chains(P∑k

l=1mlnl

)i,i

≥k∏l=1

[Pnli,i

]ml > 0.

This observation along with the number theoretic Lemma 7.39 below is enough

to show Pnd(i)i,i > 0 for all n ∈ N sufficiently large. The third item is a special

case of item 2.

Example 7.35. Suppose that P =

[0 11 0

], then Pm = P if m is odd and Pm =[

1 00 1

]if m is even. Therefore d (i) = 2 for i = 1, 2 and in this case P2n

i,i = 1 > 0

for all n ∈ N. However observe that P2 is no longer irreducible – there are nowtwo communication classes.

Example 7.36. Consider the Markov chain with jump diagram given in Figure7.3.In this example, d (i) = 2 for all i and all states for P2 are aperiodic.

P jump diagram. P2 jump diagram.

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

Fig. 7.3. All arrows are assumed to have weight 1 unless otherwise specified. Noticethat each state has period d = 2 and that P2 is the transition matrix having twoaperiodic communication classes.

However P2 is no longer irreducible. This is an indication of the what happensin general. In terms of matrices,

P =

1 2 3 4 5 60 1 0 0 0 00 0 1 0 0 00 0 0 1 0 012 0 0 0 1

2 00 0 0 0 0 11 0 0 0 0 0

123456

and P2 =

1 2 3 4 5 60 0 1 0 0 00 0 0 1 0 012 0 0 0 1

2 00 1

2 0 0 0 12

1 0 0 0 0 00 1 0 0 0 0

123456

.

Example 7.37. Consider the Markov chain with jump diagram given in Figure7.4.Assume there are no implied jumps from a site back to itself, i.e. Pi,i = 0 for

x

y

x

y

P jump diagram. P 2 jump diagram.

1

1

1

11/3

1 1

1

1

2/3

1/2

1/2

Fig. 7.4. Assume there are no implied jumps from a site back to itself, i.e. Pi,i = 0for all i. This chain is then irreducible and has period 2.

all i. This chain is then irreducible and has period 2. This chain is irreducibleand has period 2. To calculate the period notice that starting at y there is anobvious loop of length 4 and starting at x there is one of length 6. Thereforethe period must divide both 4 and 6 and so must be either 2 or 1. The periodis not 1 as one can only return to a site with an even number of jumps in thispicture. If on the other hand there was any one vertex, i, such that Pi,i = 1,then the period of the chain would have been one, i.e. the chain would have beenaperiodic. Further notice that the jump diagram for P2 is no longer irreducible.The red vertices and the blue vertices split apart. This has to happen as is aconsequence of Proposition 7.38 below.

Proposition 7.38. If P is the Markov matrix for a finite state irreducible ape-riodic chain, then there exists n0 ∈ N such that Pn

ij > 0 for all i, j ∈ S andn ≥ n0.


7.3 Periodic Chain Considerations 69

Proof. Let i, j ∈ S. By Lemma 7.34 with d (i) = 1 we know that Pmi,i > 0

for all m large. As P is irreducible there exists a ∈ N such that Paij > 0 and

therefore Pa+mi,j ≥ Pm

i,iPai,j > 0 for all m sufficiently large. This shows for all

i, j ∈ S there exists ni,j ∈ N such that Pnij > 0 for all n ≥ ni,j . Since there are

only finitely many steps we may now take n0 := max ni,j : i, j ∈ S <∞.

7.3.1 A number theoretic lemma

Lemma 7.39 (A number theory lemma). Suppose that 1 is the greatestcommon denominator of a set of positive integers, Γ := n1, . . . , nk . Thenthere exists N ∈ N such that the set,

A = m1n1 + · · ·+mknk : mi ≥ 0 for all i ,

contains all n ∈ N with n ≥ N. More generally if q = gcd (Γ ) (perhaps not 1),then A ⊂ qN and contains all points qn for n sufficiently large.

Proof. First proof. The set I := m1n1 + · · ·+mknk : mi ∈ N for all iis an ideal in Z and as Z is a principle ideal domain there is a q ∈ I with q > 0such that I = qZ = qm : m ∈ Z . In fact q = min (I∩N) . Since q ∈ I we knowthat q = m1n1 + · · · + mknk for some mi ∈ N and so if l is a common divisorof n1, . . . , nk then l divides q. Moreover as I = qZ and ni ∈ I for all i, we knowthat q|ni as well. This shows that q = gcd (n1, n2, . . . , nk) .

Now suppose that n n1 + · · · + nk is given and large (to be explainedshortly). Then write n = l (n1 + · · ·+ nk)+r with l ∈ N and 0 ≤ r < n1+· · ·+nkand therefore,

nq = ql (n1 + · · ·+ nk) + rq

= ql (n1 + · · ·+ nk) + r (m1n1 + · · ·+mknk)

= (ql + rm1)n1 + · · ·+ (ql + rmk)nk

whereql + rmi ≥ ql − (n1 + · · ·+ nk)mi

which is greater than 0 for l and hence n sufficiently large.Second proof. (The following proof is from Durrett [6].) We first will show

that A contains two consecutive positive integers, a and a + 1. To prove thislet,

k := min |b− a| : a, b ∈ A with a 6= b

and choose a, b ∈ A with b = a+ k. If k > 1, there exists n ∈ Γ ⊂ A such thatk does not divide n. Let us write n = mk + r with m ≥ 0 and 1 ≤ r < k. Itthen follows that (m+ 1) b and (m+ 1) a+ n are in A,

(m+ 1) b = (m+ 1) (a+ k) > (m+ 1) a+mk + r = (m+ 1) a+ n,

and(m+ 1) b− (m+ 1) a+ n = k − r < k.

This contradicts the definition of k and therefore, k = 1.Let N = a2. If n ≥ N, then n− a2 = ma+ r for some m ≥ 0 and 0 ≤ r < a.

Therefore,

n = a2 +ma+ r = (a+m) a+ r = (a+m− r) a+ r (a+ 1) ∈ A.

8

*Proofs of Long Run Results

In proving the results above, we are going to make essential use of a strongform of the Markov property of Theorem 5.16.

8.1 Strong Markov Property Consequences

Here is a special case of Theorem 5.16.

Theorem 8.1 (Strong Markov Property). Let(Xn∞n=0 , Pxx∈S , p

)be

Markov chain as above and τ : Ω → [0,∞] be a stopping time. Then

Eπ [f (Xτ , Xτ+1, . . . ) gτ (X0, . . . , Xτ ) 1τ<∞]

= Eπ[[EXτ f

(X0, X1, . . .

)]gτ (X0, . . . , Xτ ) 1τ<∞

]. (8.1)

for all f, g = gn ≥ 0 or f and g bounded.

Proof. Apply Eq. (5.7) of Theorem 5.16 with F (X0, X1, . . . ) =f (Xτ , Xτ+1, . . . ) gτ (X0, . . . , Xτ ) 1τ<∞.

Letf

(n)ii = Pi(Ri = n) = Pi(X1 6= i, . . . ,Xn−1 6= i,Xn = i)

and mij := Ei(Mj) – the expected number of visits to j after n = 0.

Proposition 8.2. Let i ∈ S and n ≥ 1. Then Pnii satisfies the “renewal equa-

tion,”

Pnii =

n∑k=1

P(Ri = k)Pn−kii . (8.2)

Also if j ∈ S, k ∈ N, and ν : S → [0, 1] is any probability on S, then Eq. (7.3)holds, i.e.

Pν (Mj ≥ k) = Pν (Tj <∞) · Pj (Rj <∞)k−1

. (8.3)

Proof. To prove Eq. (8.2) we first observe for n ≥ 1 that Xn = i is thedisjoint union of Xn = i, Ri = k for 1 ≤ k ≤ n and therefore1,

1 Alternatively, we could use the Markov property to show,

Pnii = Pi(Xn = i) =

n∑k=1

Pi(Ri = k,Xn = i)

=

n∑k=1

Pi(X1 6= i, . . . ,Xk−1 6= i,Xk = i,Xn = i)

=

n∑k=1

Pi(X1 6= i, . . . ,Xk−1 6= i,Xk = i)Pn−kii

=

n∑k=1

Pn−kii P(Ri = k).

For Eq. (8.3) we have Mj ≥ 1 = Rj <∞ so that Pi (Mj ≥ 1) =Pi (Rj <∞) . For k ≥ 2, since Rj <∞ if Mj ≥ 1, we have

Pi (Mj ≥ k) = Pi (Mj ≥ k|Rj <∞)Pi (Rj <∞) .

Since, on Rj < ∞, XRj = j, it follows by the strong Markov property ofTheorem 5.16 that;

Pi (Mj ≥ k|Rj <∞) = Pi(Mj ≥ k|Rj <∞, XRj = j

)= Pi

1 +∑n≥1

1XRj+n=j ≥ k|Rj <∞, XRj = j

= Pj

1 +∑n≥1

1Xn=j ≥ k

= Pj (Mj ≥ k − 1) .

Pnii = Pi(Xn = i) =n∑k=1

Ei(1Ri=k · 1Xn=i) =

n∑k=1

Ei(1Ri=k · Ei1Xn−k=i)

=

n∑k=1

Ei(1Ri=k)Ei(1Xn−k=i

)=

n∑k=1

Pi(Ri = k)Pi(Xn−k = i)

=

n∑k=1

Pn−kii P (Ri = k).

72 8 *Proofs of Long Run Results

By the last two displayed equations,

Pi (Mj ≥ k) = Pj (Mj ≥ k − 1)Pi (Rj <∞) (8.4)

Taking i = j in this equation shows,

Pj (Mj ≥ k) = Pj (Mj ≥ k − 1)Pj (Rj <∞)

and so by induction,

Pj (Mj ≥ k) = Pj (Rj <∞)k. (8.5)

Equation (8.3) now follows from Eqs. (8.4) and (8.5).

8.2 Irreducible Recurrent Chains

For this section we are going to assume that Xn is a irreducible recurrentMarkov chain. Let us now fix a state, j ∈ S and define,

τ1 = Rj = minn ≥ 1 : Xn = j,τ2 = minn ≥ 1 : Xn+τ1 = j,

...

τn = minn ≥ 1 : Xn+τn−1 = j,

so that τn is the time it takes for the chain to visit j after the (n − 1)’st visitto j. By Corollary 7.15 we know that Pi (τn <∞) = 1 for all i ∈ S and n ∈ N.We will use strong Markov property to prove the following key lemma in ourdevelopment.

Lemma 8.3. We continue to use the notation above and in particular assumethat Xn is an irreducible recurrent Markov chain. Then relative to any Pi withi ∈ S, τn∞n=1 is a sequence of independent random variables, τn∞n=2 areidentically distributed, and Pi (τn = k) = Pj (τ1 = k) for all k ∈ N0 and n ≥ 2.

Proof. Let T0 = 0 and then define Tk inductively by, Tk+1 =inf n > Tk : Xn = j so that Tn is the time of the n’th visit of Xn∞n=1 tosite j. Observe that T1 = τ1,

τn+1 (X0, X1, . . . ) = τ1(XTn , XTn+1, XTn+2 , . . .

),

and (τ1, . . . , τn) is a function of (X0, . . . , XTn) . Since Pi (Tn <∞) = 1 (Corol-lary 7.15) and XTn = j, we may apply the strong Markov property in the formof Theorem 5.16 to learn:

1. τn+1 is independent of (X0, . . . , XTn) and hence τn+1 is independent of(τ1, . . . , τn) , and

2. the distribution of τn+1 under Pi is the same as the distribution of τ1 underPj .

The result now follows from these two observations and induction.

Theorem 8.4. Suppose that Xn is a irreducible recurrent Markov chain, andlet j ∈ S be a fixed state. Define

πj :=1

Ej(Rj), (8.6)

with the understanding that πj = 0 if Ej(Rj) =∞. Then

limN→∞

1

N

N∑n=0

1Xn=j = πj Pi − a.s. (8.7)

for all i ∈ S and

limN→∞

1

N

N∑n=0

Pnij = πj . (8.8)

Proof. Let us first note that Eq. (8.8) follows by taking expectations of Eq.(8.7). So we must prove Eq. (8.7).

By Lemma 8.3, the sequence τnn≥2 is i.i.d. relative to Pi and Eiτn =Ejτj = EjRj for all i ∈ S. We may now use the strong law of large numbers toconclude that

limN→∞

τ1 + τ2 + · · ·+ τNN

= Eiτ2 = Ejτ1 = EjRj (Pi– a.s.). (8.9)

This may be expressed as follows, let R(N)j = τ1 + τ2 + · · · + τN , be the time

when the chain first visits j for the N th time, then

limN→∞

R(N)j

N= EjRj (Pi– a.s.) (8.10)

Let

νN =

N∑n=0

1Xn = j

be the number of time Xn visits j up to time N. Since j is visited infinitelyoften, νN →∞ as N →∞ and therefore, limN→∞

νN+1νN

= 1. Since there were

νN visits to j in the first N steps, the of the νNth time j was hit is less than or

8.2 Irreducible Recurrent Chains 73

equal to N, i.e. R(νN )j ≤ N. Similarly, the time, R

(νN+1)j , of the (νN + 1)

stvisit

to j must be larger than N, so we have R(νN )j ≤ N ≤ R

(νN+1)j . Putting these

facts together along with Eq. (8.10) shows that

R(νN )

j

νN≤ N

νN≤ R

(νN+1)

j

νN+1 ·νN+1νN

↓ ↓ ↓ N →∞,EjRj ≤ limN→∞

NνN≤ EjRj · 1

i.e. limN→∞NνN

= EjRj for Pi – almost every sample path. Taking reciprocalsof this last set of inequalities implies Eq. (8.7).

Proposition 8.5. Suppose that Xn is a irreducible, recurrent Markov chainand let πj = 1/ (EjRj) for all j ∈ S as in Eq. (8.6). Then either πi = 0 for alli ∈ S (in which case Xn is null recurrent) or πi > 0 for all i ∈ S (in which caseXn is positive recurrent). Moreover if πi > 0 then∑

i∈Sπi = 1 and (8.11)

∑i∈S

πiPij = πj for all j ∈ S. (8.12)

That is π = (πi)i∈S is the unique stationary distribution for P.

Proof. Let us define

Tnki :=1

n

n∑l=1

Plki (8.13)

which, according to Theorem 8.4, satisfies,

limn→∞

Tnki = πi for all i, k ∈ S.

Observe that,

(TnP)ki =1

n

n∑l=1

Pl+1ki =

1

n

n∑l=1

Plki +

1

n

[Pn+1ki −Pki

]→ πi as n→∞.

Let α :=∑i∈S πi. Since πi = limn→∞ Tnki, Fatou’s lemma implies for all

i, j ∈ S that

α =∑i∈S

πi =∑i∈S

lim infn→∞

Tnki ≤ lim infn→∞

∑i∈S

Tnki = 1

and

∑i∈S

πiPij =∑i∈S

limn→∞

TnliPij ≤ lim infn→∞

∑i∈S

TnliPij = lim infn→∞

Tn+1lj = πj

where l ∈ S is arbitrary. Thus

α :=∑i∈S

πi ≤ 1 and∑i∈S

πiPij ≤ πj for all j ∈ S. (8.14)

By induction it also follows that∑i∈S

πiPkij ≤ πj for all j ∈ S. (8.15)

So if πj = 0 for some j ∈ S, then given any i ∈ S, there is a integer k such thatPkij > 0, and by Eq. (8.15) we learn that πi = 0. This shows that either πi = 0

for all i ∈ S or πi > 0 for all i ∈ S.For the rest of the proof we assume that πi > 0 for all i ∈ S. If there were

some j ∈ S such that∑i∈S πiPij < πj , we would have from Eq. (8.14) that

α =∑i∈S

πi =∑i∈S

∑j∈S

πiPij =∑j∈S

∑i∈S

πiPij <∑j∈S

πj = α,

which is a contradiction and Eq. (8.12) is proved.From Eq. (8.12) and induction we also have∑

i∈SπiP

kij = πj for all j ∈ S

for all k ∈ N and therefore,∑i∈S

πiTkij = πj for all j ∈ S. (8.16)

Since 0 ≤ T kij ≤ 1 and∑i∈S πi = α ≤ 1, we may use the dominated convergence

theorem to pass to the limit as k →∞ in Eq. (8.16) to find

πj = limk→∞

∑i∈S

πiTkij =

∑i∈S

limk→∞

πiTkij =

∑i∈S

πiπj = απj .

Since πj > 0, this implies that α = 1 and hence Eq. (8.11) is now verified.

Proposition 8.6. Suppose that P is an irreducible Markov kernel which ad-mits a stationary distribution µ. Then P is positive recurrent and µj = πj =1/ (EjRj) for all j ∈ S. In particular, an irreducible Markov kernel has at mostone invariant distribution and it has exactly one iff P is positive recurrent.

Proof. Suppose that µ = (µi) is a stationary distribution for P, i.e.∑i∈S µi = 1 and µj =

∑i∈S µiPij for all j ∈ S. Then we also have

µj =∑i∈S

µiTkij for all k ∈ N (8.17)

where T kij is defined above in Eq. (8.13). As in the proof of Proposition 8.5, wemay use the dominated convergence theorem to find,

µj = limk→∞

∑i∈S

µiTkij =

∑i∈S

limk→∞

µiTkij =

∑i∈S

µiπj = πj .

Alternative Proof. If P were not positive recurrent then P is either tran-sient or null-recurrent in which case limn→∞ Tnij = 1

Ej(Rj) = 0 for all i, j. So

letting k →∞, using the dominated convergence theorem, in Eq. (8.17) allowsus to conclude that µj = 0 for all j which contradicts the fact that µ wasassumed to be a distribution.

9

More on Transience and Recurrence

Remark 9.1. Let Xn∞n=0 denote the fair random walk on 0, 1, 2, . . . with 0being an absorbing state. The communication classes are now 0 and 1, 2, . . . with the latter class not being closed and hence transient. Using Exercise 6.7or Exercise 6.8, it follows that EiT = ∞ for all i > 0 which shows we can notdrop the assumption that # (C) <∞ in the first statement in Proposition 7.7.Similarly, using the fair random walk example, we see that it is not possible todrop the condition that # (C) <∞ for the equivalence statements as well.

The next examples show that if C ⊂ S is closed and # (C) = ∞, then Ccould be recurrent or it could be transient. Transient in this case means thechain goes off to “infinity,” i.e. eventually leaves every finite subset of C neverto return again.

Example 9.2. Let S = Z and X = Xn be the standard fair random walk onZ, i.e. P (Xn+1 = x± 1|Xn = x) = 1

2 . Then S itself is a closed class and everyelement of S is (null) recurrent. Indeed, using Exercise 6.5 or Exercise 6.6 andthe first step analysis we know that

P0 [R0 =∞] =1

2(P0 [R0 =∞|X1 = 1] + P0 [R0 =∞|X1 = −1])

=1

2(P1 [T0 =∞] + P−1 [T0 =∞]) =

1

2(0 + 0) = 0.

This shows 0 is recurrent. Similarly using Exercise 6.7 or Exercise 6.8 and thefirst step analysis we find,

E0 [R0] =1

2(E0 [R0X1 = 1] + E0 [R0|X1 = −1])

=1

2(1 + E1 [T0X] + 1 + E−1 [T0]) =

1

2(∞+∞) =∞

and so 0 is null recurrent. As this chain is invariant under translation it followsthat every x ∈ Z is a null recurrent site.

Example 9.3. Let S = Z and X = Xn be a biassed random walk on Z, i.e.P (Xn+1 = x+ 1|Xn = x) = p and P (Xn+1 = x− 1|Xn = x) = q := 1− p withp > 1

2 . Then every site of is now transient. Recall from Exercises 6.9 and 6.10(see Eq. (6.26)) that

Px (T0 <∞) =

(q/p)

xif x ≥ 0

1 if x < 0. (9.1)

Using these result and the first step analysis implies,

P0 [R0 =∞] = pP0 [R0 =∞|X1 = 1] + qP0 [R0 =∞|X1 = −1]

= pP1 [T0 =∞] + qP−1 [T0 =∞]

= p[1− (q/p)

1]

+ q (1− 1)

= p− q = 2p− 1 > 0.

Example 9.4. Again let S = Z and p ∈(

12 , 1)

and suppose that Xn is therandom walk on Z described the jump diagram in Figure 9.1. In this case using

10−1−2 2pppppp

q q 1/2 q q1/2

Fig. 9.1. A positively recurrent Markov chain.

the results of Exercise 6.11 we learn that

E0 [R0] =1

2(E0 [R0|X1 = 1] + E0 [R0|X1 = −1])

=1

2(1 + E1 [T0] + 1 + E−1 [T0])

= 1 +1

2

(1

p− q+

1

p− q

)= 1 +

1

p− q=

2p

2p− 1<∞.

This shows the site 0 is positively recurrent. Thus according to Proposition 7.5,every site in Z is positively recurrent. (Notice that E0 [R0] → ∞ as p ↓ 1

2 , i.e.as the chain becomes closer to the unbiased random walk of Example 9.2.

Theorem 9.5 (Recurrence Conditions). Let j ∈ S. Then the following areequivalent;

1. j is recurrent, i.e. Pj (Rj <∞) = 1,

76 9 More on Transience and Recurrence

2. Pj (Xn = j i.o. n) = 1,3. EjMj =

∑∞n=0 Pn

jj =∞.

Moreover if C ⊂ S is a recurrent communication class, then Pi(∩j∈CXn =j i.o. n) = 1 for all i ∈ C.In words, if we start in C then every state in C isvisited an infinite number of times.

Theorem 9.6 (Transient States). Let j ∈ S. Then the following are equiva-lent;

1. j is transient, i.e. Pj (Rj <∞) < 1,2. Pj (Xn = j i.o. n) = 0, and3. EjMj =

∑∞n=0 Pn

jj <∞.

More generally if ν : S → [0, 1] is any probability and j ∈ S is transient,then

EνMj =

∞∑n=0

Pν (Xn = j) <∞ =⇒

limn→∞ Pν (Xn = j) = 0Pν (Xn = j i.o. n) = 0.

(9.2)

Example 9.7. Let us revisit the fair random walk on Z describe before Exercise6.5. In this case P0 (Xn = 0) = 0 if n is odd and

P0 (X2n = 0) =

(2n

n

)(1

2

)2n

=(2n)!

(n!)2

(1

2

)2n

. (9.3)

Making use of Stirling’s formula, n! ∼√

2πnn+ 12 e−n, we find,(

1

2

)2n(2n)!

(n!)2 ∼

(1

2

)2n √2π (2n)

2n+ 12 e−2n

2πn2n+1e−2n=

√1

π

1√n

(9.4)

and therefore,

∞∑n=0

P0 (Xn = 0) =

∞∑n=0

P0 (X2n = 0) ∼ 1 +

∞∑n=1

√1

π

1√n

=∞

which shows again that this walk is recurrent. To now verify that this walk isnull-recurrent it suffices to show there is not invariant probability distribution,π, for this walk. Such an invariant measure must satisfy,

π (x) =1

2[π (x+ 1) + π (x− 1)]

which has general solution given by π (x) = A+ Bx. In order for π (x) ≥ 0 forall x we must take B = 0, i.e. π is constant. As

∑x∈Z π (x) = A ·∞, there is no

way to normalize π to become a probability distribution and hence Xn∞n=0 isnull-recurrent.

Fact 9.8 Simple Random walk in Zd is recurrent if d = 1 or 2 and is transientif d ≥ 3. [For a informal proof of this fact, see page 49 of Lawler. For a formalproof see Section 9.1 below or Todd Kemp’s lecture notes.]

Example 9.9. The above method may easily be modified to show that the biasedrandom walk on Z (see Exercise 6.9) is transient. In this case 1

2 < p < 1 and

P0 (X2n = 0) =

(2n

n

)pn (1− p)n =

(2n

n

)[p (1− p)]n .

Since p (1− p) has a maximum at p = 12 of 1

4 we have ρp := 4p (1− p) < 1 for12 < p < 1. Therefore,

P0 (X2n = 0) =

(2n

n

)[ρp

1

4

]n=

(2n

n

)(1

2

)2n

ρnp ∼√

1

π

1√nρnp .

Hence

∞∑n=0

P0 (Xn = 0) =

∞∑n=0

P0 (X2n = 0) ∼ 1 +

∞∑n=1

1√nρnp ≤ 1 +

1

1− ρp<∞

which again shows the biased random walk is transient.

Exercise 9.1. Let Xn∞n=0 be the fair random walk on Z (as in Exercise 6.5)starting at 0 and let

AN := E

[2N∑k=0

1Xk=0

]denote the expected number of visits to 0. Using Sterling’s formula and integralapproximations for

∑Nn=1

1√n

to argue that AN ∼ c√N for some constant c > 0.

9.1 *Transience and Recurrence for R.W.s by FourierSeries Methods

In the next result we will give another way to compute (or at least estimate)ExMy for random walks and thereby determine if the walk is transient or re-current.

Theorem 9.10. Let ξi∞i=1 be i.i.d. random vectors with values in Zd and forx ∈ Zd let Xn := x + ξ1 + · · · + ξn for all n ≥ 1 with X0 = x. As usual letMz :=

∑∞n=0 1Xn=z

denote the number of visits to z ∈ Zd. Then

EMz = limα↑1

(1

2π

)d ∫[−π,π]d

ei(x−z)·θ

1− αE [eiξ1·θ]dθ

9.1 *Transience and Recurrence for R.W.s by Fourier Series Methods 77

where dθ = dθ1 . . . dθd and in particular,

EMx = limα↑1

(1

2π

)d ∫[−π,π]d

1

1− αE [eiξ1·θ]dθ. (9.5)

Proof. For 0 < α ≤ 1 let

M (α)y :=

∞∑n=0

αn1Xn=y

so that M(1)y = My. Given any θ ∈ [−π, π]

dwe have,∑

y∈ZdEM (α)

y eiy·θ = E∑y∈Zd

M (α)y eiy·θ = E

∑y∈Zd

∞∑n=0

αn1Xn=yeiy·θ

= E∞∑n=0

αn∑y∈Zd

1Xn=yeiy·θ = E

∞∑n=0

αneiθ·Xn

=

∞∑n=0

αnE[eiθ·Xn

]where

E[eiθ·Xn

]= eiθ·xE

[eiθ·(ξ1+···+ξn)

]= eiθ·xE

n∏j=1

eiθ·ξj

= eiθ·x

n∏j=1

Eeiθ·ξj

= eiθ·x ·(E[eiξ1·θ

])n.

Combining the last two equations shows,∑y∈Zd

EM (α)y eiy·θ =

eiθ·x

1− αE [eiξ1·θ].

Multiplying this equation by e−iz·θ for some z ∈ Zd we find, using the orthog-onality of

eiy·θ

y∈Zd , that

EM (α)z =

(1

2π

)d ∫[−π,π]d

dθ

∑y∈Zd

EM (α)y eiy·θe−iz·θ

=

(1

2π

)d ∫[−π,π]d

eiθ·(x−z)

1− αE [eiξ1·θ]dθ.

Since M(α)z ↑ Mz as α ↑ 1 the result is now a consequence of the monotone

convergence theorem.

Example 9.11. Suppose that P (ξi = 1) = p and P (ξi = −1) = q := 1 − p andx = 0. Then

E[eiξ1·θ

]= peiθ + qe−iθ = p (cos θ + i sin θ) + q (cos θ − i sin θ)

= cos θ + i (p− q) sin θ.

Therefore according to Eq. (9.5) we have,

E0M0 = limα↑1

1

2π

∫ π

−π

1

1− α (cos θ + i (p− q) sin θ)dθ. (9.6)

(We could compute the integral in Eq. (9.6) exactly using complex contourintegral methods but I will not do this here.)

The integrand in Eq. (9.6) may be written as,

1− α cos θ + iα (p− q) sin θ

(1− α cos θ)2

+ α2 (p− q)2sin2 θ

.

As sin θ is odd while the denominator is now even we find,

E0M0 = limα↑1

1

2π

∫ π

−π

1− α cos θ

(1− α cos θ)2

+ α2 (p− q)2sin2 θ

dθ. (9.7)

a Let first suppose that p = 12 = q in which case the above equation reduces to

E0M0 = limα↑1

1

2π

∫ π

−π

1

1− α cos θdθ =

1

2π

∫ π

−π

1

1− cos θdθ

wherein we have used the MCT (for θ ∼ 0) and DCT (for θ away from 0)to justify passing the limit inside of the integral. Since 1−cos θ ≈ θ2/2 for θnear zero and

∫ ε−ε

1θ2 dθ =∞ it follows that E0M0 =∞ and the fair random

walk on Z is recurrent.b Now suppose that p 6= 1

2 and let us write α := 1− ε for some ε which we willeventually let tend down to zero. With this notation the integrand (fα (θ))in Eq. (9.7) satisfies,

fα (θ) =1− cos θ + ε cos θ

(1− cos θ + ε cos θ)2

+ (1− ε)2(p− q)2

sin2 θ

=1− cos θ


+ (1− ε)2(p− q)2

sin2 θ

+ε cos θ


+ (1− ε)2(p− q)2

sin2 θ

≤ 1− cos θ

(1− ε)2(p− q)2

sin2 θ+

ε cos θ

ε2 cos2 θ + (1− ε)2(p− q)2

sin2 θ.

78 9 More on Transience and Recurrence

The first term is bounded in θ and ε because

limθ↓0

1− cos θ

sin2 θ=

1

2

and therefore only makes a finite contribution to the integral. Integratingthe second term near zero and making the change of variables u = sin θ (sodu = cos θdθ) shows,∫ δ

−δ

ε cos θ

ε2 cos2 θ + (1− ε)2(p− q)2

sin2 θdθ

=

∫ sin(δ)

− sin(δ)

ε

ε2 (1− u2) + (1− ε)2(p− q)2

u2du

≤∫ sin(δ)

− sin(δ)

ε

ε2 12 + 1

2 (p− q)2u2du

= 2

∫ sin(δ)

− sin(δ)

ε

ε2 + (p− q)2u2du

provided δ is sufficiently small but fixed and ε is small. Lastly we make thechange of variables u = εx/ |p− q| in order to find∫ δ

−δ

ε cos θ

ε2 cos2 θ + (1− ε)2(p− q)2

sin2 θdθ

≤ 4

|p− q|

∫ sin(δ)|p−q|/ε

0

1

1 + x2du ↑ 2π

|p− q|<∞ as ε ↓ 0.

Combining these estimates shows the limit in Eq. (9.7) is now finite so therandom walk is transient when p 6= 1

2 .

Example 9.12 (Unbiased R.W. in Zd). Now suppose that P (ξi = ±ej) = 12d for

j = 1, 2, . . . , d and Xn = ξ1 + · · ·+ ξn. In this case,

E[eiθ·ξj

]=

1

d[cos (θ1) + · · ·+ cos (θd)]

and so according to Eq. (9.5) we find (as before)

EM0 = limα↑1

(1

2π

)d ∫[−π,π]d

1

1− α 1d [cos (θ1) + · · ·+ cos (θd)]

dθ

=

(1

2π

)d ∫[−π,π]d

1

1− 1d [cos (θ1) + · · ·+ cos (θd)]

dθ (by MCT and DCT).

Again the integrand is singular near θ = 0 where

1− 1

d[cos (θ1) + · · ·+ cos (θd)] ∼= 1− 1

d

[d− 1

2‖θ‖2

]=

1

2‖θ‖2 .

Hence it follows that EM0 < ∞ iff∫‖θ‖≤R

1‖θ‖2 dθ <∞ for R < ∞. The last

integral is well know to be finite iff d ≥ 3 as can be seen by computing in polarcoordinates. For example when d = 2 we have∫

‖θ‖≤R

1

‖θ‖2dθ = 2π

∫ R

0

1

r2rdr = 2π ln r|R0 = 2π (lnR− ln 0) =∞

while when d = 3,∫‖θ‖≤R

1

‖θ‖2dθ = 4π

∫ R

0

1

r2r2dr = 4πR <∞.

In this way we have shown that the unbiased random walk in Z and Z2 isrecurrent while it is transient in Zd for d ≥ 3.

10

Detail Balance and MMC

In this chapter, let π : S → (0, 1) be a positive probability on a (largeand complicated) finite set S. One important application of Markov chains issampling (approximately) from such probability distributions even when π itselfis computationally intractable. In order to “sample” from π we first try to findan irreducible Markov matrix P on S such that πP = π. Then from Theorem7.22 in the form Remark 7.24, if Xn∞n=0 is the Markov chain associated to Pwe have,

π (f) :=∑x∈S

f (x)π (x) = limN→∞

1

N

N∑n=1

f (Xn) Pν – a.s.,

Thus we expect that if we simulate, (Xn∞n=0) , we will have,

π (f) ∼=1

N

N∑n=1

f (Xn) for large N,

[The expected error in the above approximation is roughly, O(

1/√N)

), as we

will indicate below in Corollary 10.26.] The key to implementing this methodis to construct P in such a way that the associated chain is relatively easy tosimulate.

Remark 10.1 (A useless choice for P). Suppose π : S → (0, 1) is a distributionwe would like to sample as the stationary distribution of a Markov chain. Oneway to construct such a chain would be to make the Markov matrix P with allrows given by π. In this case we will have νP =π and so Pν (X1 = x) = π (x) forall x ∈ S. Thus for this chain we would reach the invariant distribution in onestep. However, to simulate this chain is equivalent to sampling from π whichis what we were trying to figure out how to do in the first place!! We need asimpler Markov matrix, P, with πP = π.

10.1 Detail Balance

The next Lemma gives a useful criteria on a Markov matrix P so that π is aninvariant (=stationary) distribution of P.

Lemma 10.2 (Detail balance). In general, if we can find a distribution, π,satisfying the detail balance equation,

πiPij = πjPji for all i 6= j, (10.1)

then π is a stationary(=invariant) distribution, i.e. πP = π.

Proof. First proof. Intuitively, Eq. (10.1) states that sites i and j arealways exchanging sand back and forth at equal rates. Hence if all sites aredoing this the size of the piles of sand at each site must remain unchanged.

Second Proof. Summing Eq. (10.1) on i making use of the fact that∑i Pji = 1 for all j implies,

∑i πiPij = πj .

Example 10.3 (Simple Random Walks on Graphs). Let S be a finite or countableset and G be an unoriented graph over S with no isolated vertices. Let us writex ∼ y if 〈x, y〉 ∈ G, i.e. 〈x, y〉 is an edge of G, and x 6∼ y otherwise. The SRW(simple random walk) on G has (by definition) transition probabilities

p(x, y) =1

vx1〈x,y〉∈G

wherevx := |〈x〉G| = |y ∈ S : y ∼ x|

is the valence of x (which is ≥ 1 since x is not isolated). It is now easy to checkthat vxx∈S satisfies the detailed balance equation,

vxp(x, y) = vx1

vx1〈x,y〉∈G = 1〈x,y〉∈G = vyp(y, x).

In particular, this shows that when S is finite, π(x) = vx/∑y∈S vy, is a sta-

tionary distribution for p.

Remark 10.4. If (P,π) are in detail balance and π : S → (0, 1) is a strictlypoisitve, then Pij 6= 0 iff Pji 6= 0. Thus,

G := 〈i, j〉 ∈ S × S : i 6= j and Pij 6= 0

defines an unoriented graph on S. Moreover it is eash to check that P is irre-ducible iff G is connected.

80 10 Detail Balance and MMC

Example 10.5. Capitalizing on Example 10.3, consider the following problem.A Knight starts at a corner of a standard 8 × 8 chessboard, and on each stepmoves at random. How long, on average, does it take to return to its startingposition?

If i is a corner, the question is what is Ei[Ti], where the Markov processin question is the one described in the problem: it is SRW on a graph whosevertices are the 64 squares of the chessboard, and two squares are connectedin the graph if a Knight can move from one to the other. (Note: if the Knightcan move from i to j, it can do so in exactly two ways: 1 up/down and 2left/right, or 2 left/right and 1 up/down. So choosing uniformly among moves oramong positions amounts to the same thing.) This graph is connected (as a littlethought will show), and so the SRW is irreducible; therefore there is a uniquestationary distribution. By Example 10.3, the stationary distribution π is givenby π(i) = vi/

∑j vj , and so by the Ergodic Theorem, Ei[Ti] = 1

π(i) =∑j vj/vi.

A Knight moves 2 units in one direction and 1 unit in the other direction.Starting at a corner (1, 1), the Knight can only more to the positions (3, 2)and (2, 3), so vi = 2. To solve the problem, we need to calculate vj for allstarting positions j on the board. By square symmetry, we need only calculatethe numbers for the upper 4 × 4 grid. An easy calculation then gives thesevalences as

2 3 4 43 4 6 64 6 8 84 6 8 8

The sum is 84, and so the sum over the full chess board is 4 · 84 = 336. Thus,the number of expected steps for the Knight to return to the corner is

Ei[Ti] =1

νi

∑j

vj =1

2336 = 168.

10.2 Some uniform measure MMC examples

Let us continue the notation in Example 10.3, i.e. suppose S is a finite setand G is a connected unoriented graph on S with no isolated points. Thesimple random walk on G would allow us to sample from the distribution,π(x) = vx/

∑y∈S vy. This is already useful as finding the Z :=

∑y∈S vy may

be intractable and moreover for large S, Z will be very large causing problemsof its own. In this section we are going to be interested in sampling the uniformdistribution, π (x) = Z−1 on S, where now Z := # (S) . The difficulty is goingto be that the set S is typically very large complicated so that Z := # (S) islarge and perhaps not even known how to compute.

Let K ≥ maxx∈S νx be an upper bound on the valences of G and then define

p (x, y) :=1

K1〈x,y〉∈G + c (x) 1y=x

where c (x) is chosen so that

1 =∑y∈S

p (x, y) =1

Kν (x) + c (x) =⇒ c (x) := 1− 1

Kν (x) ≥ 0.

Then clearly p (x, y) is a symmetric irreducible Markov matrix and thereforethe invariant distribution of p is π (x) = 1

#(S) . Thus if Xn∞n=0 is the Markov

chain associated to P, we will have

1

|S|∑x∈S

f (x) = limN→∞

1

N

N∑n=1

f (Xn) .

Example 10.6. Suppose that S is the set of D × D - matrices with entries in0, 1 . We say that 〈A,B〉 ∈ G if A,B ∈ S and A and B differ at exactly onelocation, i.e. (A+B) mod 2 is a matrix in S with exactly one non-zero entry.In this case νA = D2 and the simple random-walk on S will sample from theuniform distribution. The associated Markov matrix is

p (A,B) =1

D21A∼B .

It is relatively easily to sample from this chain. To describe how to do this,for i, j ∈ 1, . . . , D , let E (i, j) ∈ S denote the matrix with 1 at the (i, j)

th

- position and zeros at all other positions. To simulate the associated Markovchain, Xn∞n=0 , at each n, choose i, j ∈ 1, . . . , D , uniformly at random sothat i is independent of j which are independent of the previous choices made.Then if Xn = A ∈ S we take Xn+1 = (A+ E (i, j)) .

The next example gives a variant of this example.

Example 10.7 (HCC). Let us continue the notation in Example 10.6 and furtherlet SHC denote those A ∈ S such there are no adjacent 1’s in the matrix A. Tobe more precise we are requiring for all i, j ∈ 1, . . . , D that

Aij · (Ai−1,j +Ai+1,j +Ai,j−1 +Ai,j+1) = 0

where by convention Ai,j = 0 if either i or j is in 0, D + 1 . We refer toSHC as the Hard Core Configuration space. This is a simple model ofthe configuration of an ideal gas (the 1s represent gas molecules, the 0s emptyspace). We would like to pick a random hard-core configuration space – meaning


10.2 Some uniform measure MMC examples 81

sample from the uniform distribution on SHC , i.e. each A ∈ SHC is to bechosen with weight 1/|SHC |. The problem is that |SHC | is unknown, even to

exponential order, for large D. A simple upper-bound is 2D2

= # (S) . It is

conjectured that |SHC | ∼ βD2

for some β ∈ (1, 2), but no one even has a goodguess as to what β is. Nevertheless, we can use the Markov chain associated toMarkov matrix,

pHC (A,B) =1

D21A∼B + c (A) 1A=B

in order to sample from this distribution. Proposition 10.9 below will justifythe following algorithm for sampling from this chain.

Sampling Algorithm. Suppose we generated (X0, X1, . . . , Xn) and Xn =A ∈ SHC . To construct Xn+1;

1. Choose i, j ∈ 1, . . . , D , uniformly at random so that i is independent ofj which are independent of the previous choices made.

2. Let B := (A+ E (i, j)) mod 2 and set

Xn+1 =

B if B ∈ SHCA otherwise.

Note that B ∈ SHC SHC iff

Bij · (Bi−1,j +Bi+1,j +Bi,j−1 +Bi,j+1) = 0

where again Bkl = 0 if either k or l is in 0, D + 1 .3. loop on n.

You are asked to implement the algorithm in Example 10.7 for D = 50in one of your homework problems and use your simulations to estimate theprobability that A25,25 = 1 when A is chosen uniformly at random from SHC .

The easy proofs of the next two results are left to the reader.

Lemma 10.8 (Conditional detial balance). Suppose that (π,P) satisfy de-tail balance and S is a subset of S. Let

π (x) = π(x|S)

=π (x)∑y∈S π (y)

be the conditional distribution of π given S. For x, y ∈ S let,

p (x, y) :=

p (x, y) if x 6= yc (x) if x = y

where

c (x) := 1−∑

y∈S\x

p (x, y) .

Then P is a Markov matrix on S and(π, P

)are in detail balance. In particular

π is an invariant distribution for P.

Proposition 10.9 (Conditional sampling). Let us continue the notation inLemma 10.8 and further suppose that fn : S → S∞n=0 are random functions asin Theorem 5.8 so that Xn+1 = fn (Xn) generates the Markov chain associated

to P.If we let fn : S → S be defined by

fn (x) =

fn (x) if fn (x) ∈ Sx otherwise

thefn

∞n=0

are independent random function which generate (in the sense that

Xn+1 = fn

(Xn

)) the Markov chain,

Xn

∞n=0

, associated to P.

Example 10.10. Consider the problem of graph colorings. Let G = (V,E) be afinite graph. A q-coloring of G (with q ∈ N) is a function f : V → 1, 2, . . . , qwith the property that, if u, v ∈ V with u ∼ v, then f(u) 6= f(v). The set ofq-colorings is very hard to count (especially for small q), so again we cannotdirectly sample a uniformly random q-coloring. Instead, define a Markov chainon the state space of all q-colorings f , where for f 6= g

p(f, g) =

1

q|V | , if f and g differ at exactly one vertex,

0, otherwise.

Again: since there are at most q different q-colorings of G that differ at a givenvertex, and there are |V | vertices, we have

∑f 6=g p(f, g) ≤ 1, and so setting

p(f, f) =∑g 6=f p(f, g) yields a stochastic matrix p which is the transition kernel

of a Markov chain. It is evidently symmetric, and by considerations like those inExample 10.7 the chain is irreducible and aperiodic (so long as q is large enoughfor any q-colorings to exist!); hence, the stationary distribution is uniform, andthis Markov chain converges to the uniform distribution.

To simulate the corresponding Markov chain: given Xn = f , choose one ofthe |V | vertices, v, uniformly at random and one of the q colors, k, uniformly atrandom. If the new function g defined by g(v) = k and g(w) = f(w) for w 6= vis a q-coloring of the graph, set Xn+1 = g; otherwise, keep Xn+1 = Xn = f .This process has the transition probabilities listed above, and so running it forlarge n gives an approximately uniformly random q-coloring.



10.3 The Metropolis-Hastings Algorithm

In Examples 10.6, 10.7, and 10.10, we constructed Markov chains with sym-metric kernels, therefore having the uniform distribution as stationary, andconverging to it. This is often a good way to simulate uniform samples. Buthow can we sample from a non-uniform distribution?

Let S be a finite set, and let π be a (strictly positive) distribution on S. First,construct an irreducible Markov chain on S which has symmetric transitionprobabilities q(i, j) = q(j, i). This chain is thus reversible, and has the uniformdistribution as its stationary state; but π is not necessarily uniform. Instead,we construct a new Markov chain with transition kernel

p(i, j) = q(i, j) min

1,π(j)

π(i)

for i 6= j, p(i, i) = 1−

∑j 6=i

p(i, j).

By construction we have p (i, j) ∈ [0, 1] , p (i, j) = 0 iff q (i, j) = 0, and

π(i)p(i, j) = q(i, j) minπ(i), π(j) = q(j, i) minπ(j), π(i) = π(j)p(j, i)

so π satisfies the detailed balance condition for p. Consequently, π is the sta-tionary distribution for p. Moreover, it follows that p is irreducible (since q is)and therefore

limN→∞

1

N

N∑n=0

1Xn=j = π (j) a.s.

and assuming aperiodicity we will have Pν (Xn = j) → π (j) for all j ∈ Sindependent of the starting distribution.

Simulations. Let f : S → S be a random function such that P (f (i) = j) =q (i, j) for all i, j ∈ S. Let U ∈ [0, 1] be an independent uniform random variableand for α ≥ 0, let Uα = 1U≤α. We then define,

f (i) :=

f (i) if Uπ(f(i))/π(i) = 1i if Uπ(f(i))/π(i) = 0

=

f (i) if U ≤ π(f(i))

π(i)

i if U > π(f(i))π(i)

.

Then for j 6= i, f (i) = j

=

f (i) = j, U ≤ π(f (i))

π (i)

so that

p (i, j) = P(f (i) = j

)= P (f (i) = j)P

(U ≤ π(f (i))

π (i)

)= q (i, j) ·

[π(j)

π(i)∧ 1

].

For j = i we havef (i) = i

=∑j 6=i

f (i) = j, U >

π(j)

π (i)

∪ f (i) = i

and so

p (i, i) = P(f (i) = i

)=∑j 6=i

q (i, j)

[1−min

π(j)

π(i), 1

]+ q (i, i)

= 1−∑j 6=i

q (i, j) min

π(j)

π(i), 1

= 1−

∑j 6=i

p (i, j)

from which we see that p (i, j) is a Markov matrix.The chain dictated by p (i, j) may thus be constructed by the following

rules. Given we are currently at site i, move to site Y1 for sure if π (Y1) ≥ π (i) ,otherwise flip a biased coin probability of heads being π (Y1) /π (i) and move toY1 if the coin is heads otherwise stay put at i. Put another way, spin the q (i, ·)– spinner, it it lands at i stay at i, it it lands at j > i then move to site j with

probability min(π(j)π(i) , 1

).

There are many variations on the Metropolis-Hastings algorithm, all withthe same basic ingredients and goal: to use a (collection of correlated) Markovchain(s), designed to have a given stationary distribution π, to sample from πapproximately. They are collectively known as Markov chain Monte Carlosimulation methods. One place they are especially effective is in calculatinghigh-dimensional integrals (i.e. expectations), a common problem in multivari-ate statistics.

10.4 The Linear Algebra associated to detail balance

In this section suppose that S is a finite set, ν : S → (0,∞) is a given probabilityon S, and p : S×S → [0, 1] is a Markov matrix. We let `2 (ν) denote the vectorspace of functions, f : S → R equipped with the inner product,

〈f, g〉ν =∑x∈S

f (x) g (x) ν (x) ∀ f, g ∈ `2 (ν) .

We further let P : `2 (ν)→ `2 (ν) be defined by

Pf (x) :=∑y∈S

p (x, y) f (y) .

The only statement we really need from the next three results is the part ofCorollary 10.13 which states; if (π,P) satisfy detail balance, then P is symmetricoperator on `2 (π) .


10.5 Convergence rates of reversible chains 83

Lemma 10.11 (Adjoint of P). The adjoint of P (relative to 〈·, ·〉ν) is givenby

(P∗g) (x) =∑y∈S

qν (x, y) g (y) where (10.2)

qν (x, y) =ν (y)

ν (x)p (y, x) . (10.3)

Proof. This lemmas is proved as a consequence of the simple identities,

〈Pf, g〉ν =∑x∈S

ν (x) g (x)∑y∈S

p (x, y) f (y)

=∑x,y∈S

g (x)ν (x)

ν (y)p (x, y) f (y) ν (y)

=∑y∈S

[∑x∈S

qν (y, x) g (x)

]f (y) ν (y) .

= 〈f,P∗g〉ν .

Corollary 10.12. Keeping the notation in Lemma 10.11. The function,qν (x, y) , is Markov matrix iff ν is an invariant measure for P, i.e. νP = ν.

Proof. The matrix, qν , is Markov matrix iff for all x ∈ S,

1 =∑y∈S

qν (x, y) =∑y∈S

ν (y)

ν (x)p (y, x) =

1

ν (x)· (νP) (x)

from which the result immediately follows.The next corollary follows immediately from Eq. (10.3).

Corollary 10.13. Keeping the notation in Lemma 10.11. We have P = P∗ iffthe ν and p satisfy the detail balance equation.

10.5 Convergence rates of reversible chains

In order for the Metropolis-Hastings algorithm (or other MCMC methods) tobe effective, the Markov chain designed to converge to the given distributionmust converge reasonably fast. The term used to measure this rate is the mixingtime of the chain. (To make this precise, we need a precise measure of closenessto the stationary distribution; then we can ask how long before the distance tostationary is less than ε > 0.)

Notation 10.14 If ν is a probability on S and f : S → R is a function, let

ν (f) := Eνf =∑x∈S

f (x) ν (x) .

Assumption 10.15 For this section we assume S is a finite set, p : S × S →[0, 1] is an irreducible Markov matrix1, π : S → [0, 1] be the unique invariantdistribution for p which we further assume satisfies the detail balance equation,

π (x) p (x, y) = π (y) p (y, x) .

Theorem 10.16 (Spectral Theorem). If P is a Markov matrix and π is adistribution satisfying detail balance, then Pf (x) =

∑y∈S p (x, y) f (y) has an

orthonormal (in the `2 (π) – inner product) basis of eigenvectors, fn|S|n=1 ⊂`2 (π) with corresponding eigenvalues

σ (P) = σ (P) := λn|S|n=1 ⊂ R.

Proof. From Corollary 10.12 we know that P is a symmetric operator on`2 (π) . The result is now a consequence of the spectral theorem from linearalgebra.

As we know P1 = 1, let us agree that we take f1 = 1 and λ1 = 1. Our nextgoal is to describe the structure of σ (P) in more detail.

Example 10.17. The Markov matrix,

P =

[0 11 0

],

is irreducible, π =[

12 ,

12

],

f1 =

[11

]and f2 =

[1−1

]with λ1 = 1 and λ2 = −1. The fact that λ2 = −1 is a reflection of the fact thatP has period 2.

Example 10.18. The Markov matrix

P =

0 1 0 0

1/2 0 0 1/20 0 0 10 1/2 1/2 0

1 Thus we know the associated chain is irreducible and the p has a unique strictly

positive invariant distributions, π, on S.


is irreducible, 2-periodic, has

π =[

16

13

16

13

]as its invariant distribution which satisfies detail balance. For example

π2P24 =1

3· 1

2= π4P42.

In this case, σ (P) =− 1

2 , 1,12 ,−1

.

Example 10.19. The Markov matrix,

P =

0 1 00 0 11 0 0

describes the chain which moves around the circle, 1 → 2 → 3 → 1. It isirreducible, 3 - periodic,

σ (P) =

−1

2i√

3− 1

2,

1

2i√

3− 1

2, 1

which are the third roots of unity, and π = [1/3, 1/3, 1/3] is its invariant mea-sure. This matrix does not satisfy detail balance.

Theorem 10.20. Keeping the notation and Assumption 10.15 above in force,then λ = 1 has multiplicity one and

α := max |λ| : λ ∈ σ (P) \ ±1 < 1.

Moreover, if P is aperiodic then −1 /∈ σ (P) .

Proof. First suppose that P is aperiodic. If f ∈ `2 (π) is a unit vectorsuch that Pf = λf and f is perpendicular to 1, then

0 = 〈f, 1〉π = π (f) =∑x∈S

f (x)π (x)

forces f (i) < 0 for some i ∈ S and f (i) > 0 for another j ∈ S. Since P isaperiodic, we know there exists n such that Pn

x,y > 0 for all x, y ∈ S.Thereforeit follows that

|λ|n |f (x)| = |λnf (x)| = |(Pnf) (x)| =

∣∣∣∣∣∣∑y∈S

Pnxyf (y)

∣∣∣∣∣∣ <∑y∈S

Pnxy |f (y)| .

Multiplying this equation by π (x) and summing the result gives

|λ|n π (|f |) < π (|f |) =⇒ |λ|n < 1 =⇒ |λ| < 1

and this completes the proof in the aperiodic case.General Case. We now drop the assumption that P is aperiodic. LetN ∈ N

and set

Q :=1

N

N∑n=1

Pn

so thatQf (x) =

∑y∈S

q (x, y) f (y)

where

q (x, y) =1

N

N∑n=1

Pnx,y.

As P is irreducible, we know that for N sufficiently large, q (x, y) > 0 for allx, y ∈ S so that Q is an aperiodic irreducible Markov matrix.

Now suppose that f ∈ `2 (π) is a unit vector such that Pf = λf and f isperpendicular to 1. Then

Qf =1

N

N∑n=1

Pnf =1

N

N∑n=1

λnf = cNf (10.4)

where

cN =

λNλN+1−1λ−1 if λ 6= 1.

1 if λ = 1.(10.5)

From the aperiodic case above applied to Q we may conclude that |cN | < 1.On the other hand, from Eq. (10.5) we see that |cN | → ∞ if |λ| > 1 whichwould contradict |cN | < 1 and we may conclude |λ| ≤ 1. If λ = 1 we wouldhave cN = 1 which again leads to a contradiction. Thus we may conclude thateither |λ| < 1 or λ = −1.

Notation 10.21 Let Πλ denote orthogonal projection onto the λ – eigenspaceof P for all λ ∈ σ (P) . We then have,

P = Π1 −Π−1 +R where

R =∑

λ∈σ(P)\±1

λΠλ.

We further letα := max |λ| : λ ∈ σ (P) \ ±1 < 1.

10.5 Convergence rates of reversible chains 85

With this notation we have for k ∈ N that

Pk = Π1 + (−1)kΠ−1 +Rk (10.6)

where∥∥Rk∥∥ ≤ αk. Moreover if P is aperiodic, then Π−1 = 0. In all cases

Π1f = 〈f, 1〉π 1 = π (f) .

Corollary 10.22. Let us keep Assumption 10.15. If ν : S → [0, 1] is a proba-bility on S and f : S → R is any function, then

Eν [f (Xn)] = π (f) + (−1)nν (Π−1f) + ‖f‖ ·Oν (αn) . (10.7)

In particular if P is aperiodic, then

Eν [f (Xn)] = π (f) + ‖f‖ ·Oν (αn) .

Proof. Let ρ (x) := ν (x) /π (x) where π is the invariant distribution for P.By the properties of Markov chains we have,

Eν [f (Xn)] =∑x,y∈S

ν (x) Pnx,yf (y) =

∑x∈S

ν (x) (Pnf) (x)

=∑x∈S

(Pnf) (x) ρ (x)π (x) = 〈Pnf, ρ〉π = 〈f,Pnρ〉π .

Making use of Eq. (10.6) we find,

〈f,Pnρ〉π = 〈f, (Π1 + (−1)nΠ−1 +Rn) ρ〉π

= π (f) + (−1)n 〈Π−1f, ρ〉π + ‖f‖O (αn)

= π (f) + (−1)nν (Π−1f) + ‖f‖O (αn) .

Remark 10.23. The above corollary suggests that in the aperiodic case in orderto estimate π (f) a good strategy is to choose n fairly large so that the errorestimate

|Eν [f (Xn)]− π (f)| ≤ C1αn

forces Eν [f (Xn)] ∼= π (f) . We then estimate Eν [f (Xn)] using the law of largenumbers, i.e.

Eν [f (Xn)] ∼=1

N

N∑k=1

f (xn (k))

where xn (k)Nk=0 are N independent samples from Xn. We then expect tohave

∣∣∣∣∣π (f)− 1

N

N∑k=1

f (xn (k))

∣∣∣∣∣ ≤ C1αn + C2

1√N

The typical difficulties here are that C1, α, and the random constant C2 aretypically not easy to estimate.

Notation 10.24 Let SN (f) := 1N

∑Nn=1 f (Xn) be the sample averages of f :

S → R.

Corollary 10.25. Under Assumption 10.15,

Eν [SN (f)] =1

N

N∑n=1

Eν [f (Xn)] = π (f) + ‖f‖O(

1

N

). (10.8)

Proof. Summing Eq. (10.7) on n shows,

1

N

N∑n=1

Eν [f (Xn)] = π (f) +εNNν (Π−1f) + ‖f‖O

(1

N

N∑n=1

αn

)

= π (f) +εNNν (Π−1f) + ‖f‖O

(α

N

1

1− α

)= π (f) + ‖f‖O

(1

N

),

where εN ∈ 0,±1 .

Corollary 10.26 (L2 (Pν) – convergence rates). Given the assumptionsabove for any initial distribution, ν,

Eν |SN − π (f)|2 = O

(1

N

).

Roughly speaking this indicates that the best we can hope for in terms of point-wise convergence is

π (f) =1

N

N∑n=1

f (Xn) +O

(1√N

)Pν – a.s.

[To get more detailed information about this topic, the reader might start byconsulting [11], [4], and [12, Chapter 17] and the references therein.]

Proof. If we let f = π (f) + h (i.e.h = f − π (f)), then

SN (f) =1

N

N∑n=1

f (Xn) = π (f) +1

N

N∑n=1

h (Xn)


and therefore,

E [SN − π (f)]2

=1

N2

N∑m,n=1

E [h (Xn)h (Xm)]

=1

N2

N∑m=1

E[h2 (Xm)

]+ 2

∑1≤m<n≤N

E [h (Xn)h (Xm)]

≤ ‖h‖2∞

1

N+

2

N2

∑1≤m<n≤N

E [h (Xn)h (Xm)] .

To finish the proof it suffices to show∑1≤m<n≤N

E [h (Xn)h (Xm)]

=∑

1≤m<n≤N

E[h (Xm)

(Pn−mh

)(Xm)

]= O

(1

N

),

wherein we have used the Markov property in the first equality.Case 1. Suppose that P is aperiodic. Under this additional assumption, we

know that ‖Pn−mh‖∞ ≤ C ‖h‖∞ αn−m and hence we find,∑1≤m<n≤N

E [h (Xn)h (Xm)] ≤ C ‖h‖2∞∑

1≤m<n≤N

αn−m.

This completes the proof in this case upon observing that∑1≤m<n≤N

O(αn−m

)≤ C

∑1≤m<n≤N

αn−m ≤ α

1− αN = O (N) . (10.9)

Case 2. For the general case, we decompose h as

h = f− + g = Π−1f +∑

λ∈σ(P)\±

Πλf

so thatPn−mh = (−1)

n−mf− + Pn−mg.

With this notation along with Eq. (10.7) we learn

E [h (Xn)h (Xm)] = E[h (Xm)

(Pn−mh

)(Xm)

]= E

[h (Xm)

[(−1)

n−mf− + Pn−mg

](Xm)

]= (−1)

n−m E [h (Xm) f− (Xm)] +O(αn−m

)= (−1)

n−mπ (hf−) + (−1)

mν (Π−1 [hf−]) +O (αm) +O

(αn−m

).

The last displayed equation combined with Eq. (10.9) and Lemma 10.27 belowagain implies ∑

1≤m<n≤N

E [h (Xn)h (Xm)] = O (N) .

Lemma 10.27. The following sums are all order O (N) ,∑1≤m<n≤N

(−1)m,

∑1≤m<n≤N

(−1)n,

∑1≤m,n≤N

(−1)m∧n

,

∑1≤m,n≤N

(−1)m∨n

, and∑

1≤m,n≤N

(−1)|m−n|

.

Proof. These results are elementary and are base on the fact that for any0 ≤ k < p <∞ we have

p∑l=k

(−1)l ∈ 0,±1 .

For example,

∑1≤m,n≤N

(−1)m∧n

=

N∑m=1

(−1)m

+ 2∑

1≤m<n≤N

(−1)m

∈ 0,±1+ 2∑

1<n≤N

0,±1

which is certainly bounded by CN. The largest term is,

∑1≤m,n≤N

(−1)|m−n|

=

N∑m=1

1 + 2∑

1≤m<n≤N

(−1)(n−m)

= N +∑

1≤m<N

0,±1

which is again bounded by CN.

Remark 10.28. The heart of the computations above may be summarized in theaperiodic case as follows.

1. For n ≥ m,

E [f (Xm) g (Xn)] = ν(Pm

[f ·Pn−mg

])= ν

(Pm

[f ·(π (g) +O

(αn−m

))])= π (g) ν (Pmf) +O

(αn−m

)= π (g) ν (π (f) +O (αm)) +O

(αn−m

)= π (g)π (f) +O (αm) +O

(αn−m

).

10.6 *Reversible Markov Chains 87

Here we have used P is a contraction on `2 (π) and Pmf = π (f) + Rmfwhere

‖Rmf‖∞ ≤ C ‖Rmf‖`2(π) ≤ C ‖f‖`2(π) α

m ≤ C ‖f‖∞ .

2. Letting h := f − π (f) so that π (h) = 0, we find

E[(SN (f)− π (f))

2]

= E[[SN (f − π (f))]

2]

=1

N2

N∑m,n=1

E [h (Xm)h (Xn)]

≤ N

N2‖h‖2∞ +

2

N2

∑1≤m<n≤N

E [h (Xm)h (Xn)]

≤ N

N2+

2

N2

∑1≤m<n≤N

[O (αm) +O

(αn−m

)]≤ 1

N‖h‖2∞ +O

(1

N

)owing to that fact that

∞∑k=1

αk =α

1− αfor |α| < 1.

10.6 *Reversible Markov Chains

Let S be a finite or countable state space, p : S × S → [0, 1] be a Markovtransition kernel, and ν : S → [0, 1] be a probability on S. We now wish to findconditions on ν and p so that, for each N ∈ N, Yn := XN−n for 0 ≤ n ≤ Nis distributed as a time homogeneous Markov chain. Thus we are looking for aMarkov transition kernel, q : S × S → [0, 1] , and a probability, ν : S → [0, 1] ,such that

Pν (Y0 = x0, . . . , YN = xN ) = ν (x0) q (x0, x1) . . . q (xN−1, xN )

for all xj ∈ S. By assumption,

Pν (Y0 = x0, . . . , YN = xN ) = Pν (X0 = x0, . . . , X0 = xN )

= ν (xN ) p (xN , xN−1) . . . p (x1, x0) .

Comparing these last two displayed equations leads us to require, for all N ∈ N0

and xj ⊂ S, that

ν (x0) q (x0, x1) . . . q (xN−1, xN ) = ν (xN ) p (xN , xN−1) . . . p (x1, x0) .

Taking N = 0 forces us to take ν = ν. Taking N = 1 we have,

ν (x0) q (x0, x1) = ν (x1) p (x1, x0) (10.10)

and then summing this equation on x1 shows ν (x0) = (νP) (x0) and forces νto be a stationary distribution for p.

From now on let us assume that p has a strictly positive stationary distri-bution, π, i.e. that all communication classes of p are positive recurrent. Let νbe one of these stationary distributions. Then from Eq. (10.10) we see that wemust take,

q (x0, x1) =ν (x1)

ν (x0)p (x1, x0) ≥ 0.

Observe that∑x1∈S

q (x0, x1) =∑x1∈S

ν (x1)

ν (x0)p (x1, x0) =

(νP) (x0)

ν (x0)=ν (x0)

ν (x0)= 1

and that ∑x0∈S

ν (x0) q (x0, x1) =∑x0∈S

ν (x1) p (x1, x0) ν (x1)

which shows that q : S × S → [0, 1] is a Markov transition kernel and withstationary distribution, ν. Finally we have

v (x0)q (x0, x1) . . . q (xN−1, xN )

= v (x0)ν (x1)

ν (x0)p (x1, x0)

ν (x2)

ν (x1)p (x2, x1) . . .

ν (xN )

ν (xN−1)p (xN , xN−1)

= p (xN , xN−1) . . . p (x1, x0) ν (xN )

as desired.

Definition 10.29. We say the chain Xn associated to p is (time) reversibleif there exists a distributions ν on S such that for any N ∈ N,

(X0, . . . , XN )Pν= (XN , XN−1, . . . , X0) .

What we have proved above is that p is reversible iff there exists ν : S →(0,∞) such that

ν (x) p (x, y) = ν (y) p (y, x) ∀ x, y ∈ S.

11

Hidden Markov Models

Problem: Let Xn ⊂ S be a stochastic process and fn : S → O∞n=0 is acollection of random functions. It is often the case that we can not observe Xndirectly but only some noisy output Un = fn (Xn)∞n=0 of the Xn . Our goal

is to predict the values of XnNn=0 given that we have observe u := unNn=0 ⊂O. One way to make such a prediction is based on finding the most probabletrajectory, i.e. we are looking for (x∗0, . . . , x

∗N ) ∈ SN+1 which is a maximizer of

the function,

FN (x0, . . . , xN ) = P(Xn = xnNn=0 | Un = unNn=0

)=P(Xn = xnNn=0 , Un = unNn=0

)P(Un = unNn=0

) .

Equivalently we may define (x∗0, . . . , x∗N ) as,

(x∗0, . . . , x∗N ) = arg max

(x0,...,xN )∈SN+1FN (x0, . . . , xN )

whereFN (x0, . . . , xN )

FN (x0, . . . , xN ) = P(Xn = xnNn=0 , Un = unNn=0

)= P

((Xn, Un) = (xn, un)Nn=0

). (11.1)

With no extra structure this maximization problem is intractable even for mod-erately large N since it requires |S|N+1

evaluations of FN .1 On the other hand,

we will be able to say something about this problem if we make the followingadditional assumptions.

Assumption 11.1 Keeping the notation above, let us now further assume that;

1. Xn is a (hidden) Markov chain with transition probabilities p : S×S →[0, 1] and initial distribution ν : S → [0, 1] and

2. the “random noise” functions, fn : S → O∞n=0 , are i.i.d. random func-tions which are independent of the chain Xn∞n=0 .

1 It is worth observing that the dimension of the space of possible functions FN isalso roughly |S|N+1 .

Notation 11.2 Under Assumption 11.1, the emission probabilities refersto the function, e : S ×O → [0, 1] defined by

e (x, u) := P (fn (x) = u) ∀ (x, u) ∈ S ×O.

Given u := unNn=0 ⊂ O, if we let

V0 (x0) = ν (x0) e (x0, u0) and

qn (x, y) := p (x, y) e (y, un) , (11.2)

then the function FN in Eq. (11.1) becomes,

FN (x0, . . . , xN ) = P(Xn = xnNn=0

)· P(fn (xn) = unNn=0

)= ν (x0) e (x0, u0)

N∏n=1

p (xn−1, xn) e (xn, un) (11.3)

= V0 (x0)

N∏n=1

qn (xn−1, xn) . (11.4)

Example 11.3 (Occasionally Dishonest Casino). In a casino game, a die isthrown. Usually a fair die is used, but sometimes the casino swaps it out for aloaded die. To be precise: there are two dice, F (fair) and L (loaded). For thefair die F , the probability of rolling i is 1

6 for i = 1, 2, . . . , 6. For the loadeddie, however, 1 is rolled with probability 0.5 while 2 through 5 are each rolledwith probability 0.1. Each roll, the casino randomly switches out F for L withprobability 0.05; if L is in use, they switch back to F next roll with probability0.9.

In this example, we may take S = F,L (F =fair and L =loaded) and(Xn)n≥0 to be the Markov chain keeping track of which dice is in play and hastransition probabilities,

P =

F L[0.95 0.050.9 0.1

]FL.

We further let O = 1, 2, 3, 4, 5, 6 and we are given the emission probabilities,

90 11 Hidden Markov Models

e (x, u) =

x/u 1 2 3 4 5 6F 1/6 1/6 1/6 1/6 1/6 1/6L 0.5 0.1 0.1 0.1 0.1 0.1

Questions: You notice that 1 is rolled 6 times in a row. How likely is it

that the fair die was in use for those rolls (given the above information)? Howlikely is it the loaded die was used in rolls 2, 3, and 5? More generally, whatis the most likely sequence of dice that was used to generate this sequence ofrolls?

Example 11.3 is a very typical situation that occurs in many real-worldproblems. Another application of Hidden Markov Models (HMMs) is in machinespeech recognition. Here we have S = words and O = wave forms.

11.1 The dynamic programing algorithm for most likelytrajectory

Let V0 : S → [0,∞) and qn : S × S → [0,∞) for n ∈ N be given functions andthen set

Fn (x0, x1, . . . , xn) = V0 (x0) q1 (x0, x1) q2 (x1, x2) . . . qn (xn−1, xn) . (11.5)

The space of functions, Fn, that have the form in Eq. (11.5) is now approxi-

mately, n |S|2 + |S| which is much smaller than |S|n+1– the dimension of all

functions, Fn on Sn+1. Viterbi’s algorithm exploits this fact in order to find amaximizer of Fn with only order of n |S|2 operations rather than |S|n+1

. Thekey observations is that we may describe Fn inductively as;

F1 (x0, x1) = V0 (x0) q1 (x0, x1) and

Fn (x0, x1, . . . , xn) = Fn−1 (x0, x1, . . . , xn−1) qn (xn−1, xn) . (11.6)

From Eq. (11.6) it follows that

maxx0,x1,...,xn∈S

Fn (x0, x1, . . . , xn)

= maxx0,x1,...,xn∈S

[Fn−1 (x0, x1, . . . , xn−1) qn (xn−1, xn)]

= maxxn∈S

maxxn−1∈S

[max

x0,x1,...,xn−2∈SFn−1 (x0, x1, . . . , xn−1)

]qn (xn−1, xn) .

This identity suggest that we define (for n ∈ N) Vn : S → [0, 1] by,

Vn (xn) := maxx0,x1,...,xn−1∈S

Fn (x0, x1, . . . , xn) for n ≥ 1. (11.7)

It then follows that

maxx0,x1,...,xn∈S

Fn (x0, x1, . . . , xn) = maxxn∈S

Vn (xn)

and

Vn (xn) = maxx0,x1,...,xn−1∈S

[Fn−1 (x0, x1, . . . , xn−1) qn (xn−1, xn)]

= maxxn−1∈S

maxx0,x1,...,xn−2∈S

[Fn−1 (x0, x1, . . . , xn−1) qn (xn−1, xn)]

= maxxn−1∈S

[Vn−1 (xn−1) qn (xn−1, xn)] .

Notation 11.4 If G : S → R is a function on a finite set S, letarg maxx∈S G (x) denote a point x∗ ∈ S such that G (x∗) = maxx∈S G (x) .

The above comments lead to the following algorithm for finding a maximizerof FN .

Theorem 11.5 (Viterbi’s most likely trajectory). Let us continue the no-tation above so that for n ≥ 1,

Vn (y) = maxx∈S

[Vn−1 (x) qn (x, y)]

and further letϕn (y) := arg max

x∈S[Vn−1 (x) qn (x, y)] .

Given N ∈ N, if we let

x∗N = arg maxx∈S

VN (x) ,

x∗N−1 = ϕN (x∗N ) ,

x∗N−2 = ϕN−1

(x∗N−1

),

...

x∗0 := ϕ1 (x∗1) ,

then(x∗0, x

∗1, . . . , x

∗N ) = arg max

(x0,...,xN )∈SN+1FN (x0, . . . , xN ) (11.8)

and

M := max(x0,...,xN )∈SN+1

FN (x0, . . . , xN ) = VN (x∗N ) = maxx

VN (x) . (11.9)


11.2 The computation of the conditional probabilities 91

Proof. First off Eq. (11.9) is just a rewrite of Eq. (11.7). Moreover,

M = VN (x∗N ) = maxxN−1∈S

[VN−1 (xN−1) qN (xN−1, x∗N )]

= VN−1

(x∗N−1

)qN(x∗N−1, x

∗N

)while

VN−1

(x∗N−1

)= maxxN−2∈S

[VN−2 (xN−2) qN−1

(xN−2, x

∗N−1

)]= VN−2

(x∗N−2

)qN−1

(x∗N−2, x

∗N−1

)so that

M = VN−2

(x∗N−2

)qN−1

(x∗N−2, x

∗N−1

)qN(x∗N−1, x

∗N

).

Continuing this way inductively shows for 1 ≤ k ≤ N

M = VN−k(x∗N−k

)qN−k+1

(x∗N−k, x

∗N−k+1

). . . qN−1

(x∗N−2, x

∗N−1

)qN(x∗N−1, x

∗N

)and in particular at k = N we find,

M = V0 (x∗0) q1 (x∗0, x∗1) . . . qN−1

(x∗N−2, x

∗N−1

)qN(x∗N−1, x

∗N

)= FN (x∗0, x

∗1, . . . , x

∗N )

which proves Eq. (11.8).

11.1.1 Viterbi’s algorithm in the Hidden Markov chain context

Let us now write out the pseudo code of this algorithm in the hidden Markovchain context. We now assume the setup described in Notation 11.2.

Hidden Markov chain Pseudo code:For y ∈ S,

V0 (y) = ν (y) e (y, u0)End y ∈ S.For n = 1 to N,

For y ∈ S,

Vn (y) := maxx∈S

[Vn−1 (x) qn (x, y)]

= maxx∈S

[Vn−1 (x) p (x, y) e (x, un)] and

ϕn (y) = arg maxx∈S

[Vn−1 (x) qn (x, y)]

= arg maxx∈S

[Vn−1 (x) p (x, y) e (x, un)]

End y ∈ S.End n = 1 to Nx∗N := arg maxy∈S Vn (y)For n = 1 to N,x∗N−n = ϕN−n+1

(x∗N−n+1

)End n = 1 to N.Output: (x∗0, x

∗1, . . . , x

∗N ) .

The output x∗kNk=0 is a most likely trajectory given ukNk=0 , i.e.

(x∗0, x∗1, . . . , x

∗N ) = arg max

(x0,...,xN )∈SN+1FN (x0, . . . , xN ) .

Complexity analysis: Notice that the computational complexity of thisalgorithm is approximately, C ·N |S|2 rather than |S|N+1

. Indeed, the N comesfrom the for loops involving n = 1 to N. In this loop there is a loop on y ∈ Sand in this loop there are roughly |S| evaluation needed to compute Vn (y) and

ϕn (y) . Thus at each index n, there are about |S|2 operations needed.

11.2 The computation of the conditional probabilities

If u := (u0, u1, u2, . . . , uN ) ∈ ON+1 is given observation sequence and x=(x0, x1, x2, . . . xN ) ∈ SN+1 is a state sequence we may ask how likely is it thatX(N) := (X0, X1, X2, . . . XN ) = x given U(N) := (U0, U1, U2, . . . , UN ) = u, i.e.we want to compute,

ρ (x|u) := P(X(N) = x|U(N)

= u)

=P(X(N) = x,U(N) = u

)P(U(N) = u

)=

ν (x0) e (x0, u0)∏Nk=1 p (xk−1, xk) e (xk, uk)∑

y∈SN+1 ν (y0) e (y0, u0)∏Nk=1 p (yk−1, yk) e (yk, uk)

.

The following theorem is the basis for a computationally tractable algorithmfor computing ρ (x|u) .

Theorem 11.6. Fix an observation sequence u = (u0, u1, . . . , uN ). Let

α0(x) = P(U0 = u0, X0 = x0) = P(X0 = x0)ex0(u0)

and then define αn : S → [0, 1]Nn=1 inductively as

αn+1(y) =∑x∈S

αn(x)p(x, y)e (y, un+1) for y ∈ S, (11.10)

where 1 ≤ n< N. Then

P(U(N) = u

)=∑x∈S

αN (x). (11.11)

Proof. For 0 ≤ n ≤ N, let

αn(x) = P(U(n) = u(n), Xn = x) = E [1U(n)=u(n)1Xn=x] .

With this definition of αn it is clear that Eq. (11.11) holds and so it only remainsto show that Eq. (11.10) is valid. In order to see this is case, first observe fromEq. (11.3) that the joint process, (Xn, Un = fn (Xn))∞n=0 is a Markov processwith state space, S ×O, and transition probabilities,

P ((Xn+1, Un+1) = (y, u) | (Xn, Un) = (x, s)) = p (x, y) e (y, u) .

Therefore if we let Fn := σ((Xk, Uk)k≤n

), then by the Markov property

and the basic properties of conditional expectations,

αn+1(y) := E[1U(n)=u(n) · 1(Xn+1,Un+1)=(un+1,y)

]= E

[1U(n)=u(n) · E

[1(Xn+1,Un+1)=(un+1,y)|Fn

]]= E

[1U(n)=u(n) · E

[1(Xn+1,Un+1)=(un+1,y)| (Xn, Un)

]]= E [1U(n)=u(n) · p (Xn, y) e (y, un+1)]

=∑x∈S

p (x, y) e (y, un+1)E [1U(n)=u(n)1Xn=x]

=∑x∈S

p (x, y) e (y, un+1)αn (x) .

Complexity of the algorithm. The complexity of this algorithm is againroughly CN |S|2. Indeed, there are N inductive steps in Eq. (11.10) and thereare C |S| operations needed to compute the sum in Eq. (11.10) and this has tobe done |S| times, once for each y ∈ S.

Part III

Martingales

12

(Sub and Super) Martingales

The typical setup in this chapter is that Xk, Yk, Zk, . . . ∞k=0 is somecollection of stochastic processes on a probability space (Ω,P) and we letFn = σ (Xn, Yn, Zn, . . . nk=0) for n ∈ N0 and F = σ (Xn, Yn, Zn, . . . ∞k=0) .Notice that Fn ⊂ Fn+1 ⊂ F for all n = 0, 1, 2 . . . . We say a stochastic process,Un∞n=0 is adapted to Fn provided Un is Fn – measurable for all n ∈ N0, i.e.Un = Fn (Xn, Yn, Zn, . . . nk=0) .

Definition 12.1. Let X := Xn∞n=0 be an Fn-adapted sequence of integrablerandom variables. Then;

1. X is a Fn∞n=0 – martingale if E [Xn+1|Fn] = Xn a.s. for all n ∈ N0.2. X is a Fn∞n=0 – submartingale if E [Xn+1|Fn] ≥ Xn a.s. for all n ∈ N0.3. X is a Fn∞n=0 – supermartingale if E [Xn+1|Fn] ≤ Xn a.s. for alln ∈ N0.

It is often fruitful to view Xn as your earnings at time n while playing somegame of chance. In this interpretation, your expected earnings at time n + 1given the history of the game up to time n is the same, greater than, less thanyour earnings at time n if X = Xn∞n=0 is a martingale, submartingale orsupermartingale respectively. In this interpretation, martingales are fair games,submartingales are games which are favorable to the gambler (unfavorable to thecasino), and supermartingales are games which are unfavorable to the gambler(favorable to the casino), see Example 12.35.

By induction one shows that X is a supermartingale, martingale, or sub-martingale iff

E [Xn|Fm]≤=≥Xm a.s for all n ≥ m, (12.1)

to be read from top to bottom respectively. This last equation may also beexpressed as

E [Xn|Fm]≤=≥Xn∧m a.s for all m,n ∈ N0. (12.2)

The reader should also note that E [Xn] is decreasing, constant, or increasingrespectively.

12.1 (Sub and Super) Martingale Examples

Example 12.2. Suppose that Zn∞n=0 are independent1 integrable randomvariables such that EZn = 0 for all n ≥ 1. Then Sn :=

∑nk=0 Zk is a mar-

tingale relative to the filtration, FZn := σ (Z0, . . . , Zn) . Indeed,

E [Sn+1 − Sn|Fn] = E [Zn+1|Fn] = EZn+1 = 0.

This same computation also shows that Snn≥0 is a submartingale if EZn ≥ 0and supermartingale if EZn ≤ 0 for all n.

Example 12.3 (Regular martingales). If X is a random variable with E |X| <∞, then Xn := E [X|Fn] is a martingale. Indeed, by the tower property ofconditional expectations,

E [Xn+1|Fn] = E [E [X|Fn+1] |Fn] = E [X|Fn] = Xn a.s.

A martingale of this form is called a regular martingale. From Proposition3.10 we know that

E |Xn| = E |E [X|Fn]| ≤ E |X|

which shows that supn E |Xn| ≤ E |X| <∞ for regular martingales. In the nextexercise you will show that not all martingales are regular.

Exercise 12.1. Construct an example of a martingale, Mn∞n=0 such thatE |Mn| → ∞ as n → ∞. [In particular, Mn∞n=1 will be a martingale whichis not of the form Mn = EFnX for some X ∈ L1 (P ) .] Hint: try takingMn =

∑nk=0 Zk for a judicious choice of Zk∞k=0 which you should take to

be independent, mean zero, and having E |Zn| growing rather rapidly.

Example 12.4 (“Conditioning a δ – fucntion”). Suppose that Ω = (0, 1],F = F (0,1], and P = m – Lebesgue measure. We then define random variables,Yn∞n=1 , with values in 0, 1 so that

ω =

∞∑n=1

Yn (ω)

2n, i.e. ω = .Y1 (ω)Y2 (ω)Y3 (ω) . . .

1 We do not need to assume that the Zn∞n=0 are identically distributed here!!

96 12 (Sub and Super) Martingales

is the binary expansion of ω. We then let Fn := σ (Y1, . . . , Yn) . A little thoughtshows that X is Fn measurable iff X is constant on each of the sub-intervals,(k−12n ,

k2n

], for 1 ≤ k ≤ 2n. As

√2n1( k−1

2n , k2n ]

2n

k=1is an orthonormal basis for

this collection of fuctions we may conclude that

E [X|Fn] (ω) =

2n∑k=1

[∫ 1

0

X (σ)√

2n1( k−12n , k2n ] (σ) dσ

]·√

2n1( k−12n , k2n ] (ω)

=

2n∑k=1

[2n∫ k

2n

k−12n

X (σ) dσ

]· 1( k−1

2n , k2n ] (ω) .

If we formally take X = δ0 to be the Dirac delta function of 0 (with all itsweight to the right of zero), then

Mn := “E [δ0|Fn] ” = 2n1(0,2−n] for n ∈ N.

In Exercise 12.2), you are asked to verify directly that Mn := 2n1(0,2−n]. Noticethat E |Mn| = 1 for all n. However, there is no X ∈ L1 (Ω,F , P ) such thatMn = E [X|Fn] . To verify this last assertion, suppose such an X existed. Wewould then have for 2n > k > 0 and any m > n, that

E[X :

(k

2n,k + 1

2n

]]= E

[EFmX :

(k

2n,k + 1

2n

]]= E

[Mm :

(k

2n,k + 1

2n

]]= 0.

Using E [X : A] = 0 for all A in the π – system, Q :=∪∞n=1

(k2n ,

k+12n

]: 0 ≤ k < 2n

, an application of the “π – λ theorem”

shows E [X : A] = 0 for all A ∈ σ (Q) = F . By more general measuretheoretic arguments, this suffices to show X = 0 a.s. which is impossible since1 = EMn = EX.

Moral: not all L1 – bounded martingales are regular. It is also worth notingthat pointwise M∞ = limn→∞Mn = 0 yet

limn→∞

EMn = limn→∞

1 6= 0 = EM∞ = E limn→∞

Mn.

Exercise 12.2. Show that Mn := 2n1(0,2−n] for n ∈ N as defined in Example12.4 is a martingale.

Lemma 12.5. Let X := Xn∞n=0 be an adapted process of integrable randomvariables on a filtered probability space, (Ω,F , Fn∞n=0 , P ) and let dn := Xn−Xn−1 with X−1 := EX0. Then X is a martingale (respectively submartingaleor supermartingale) iff E [dn+1|Fn] = 0 (E [dn+1|Fn] ≥ 0 or E [dn+1|Fn] ≤ 0respectively) for all n ∈ N0.

Conversely if dn∞n=1 is an adapted sequence of integrable random vari-ables and X0 is a F0 -measurable integrable random variable. Then Xn =X0 +

∑nj=1 dj is a martingale (respectively submartingale or supermartingale)

iff E [dn+1|Fn] = 0 (E [dn+1|Fn] ≥ 0 or E [dn+1|Fn] ≤ 0 respectively) for alln ∈ N.

Proof. We prove the assertions for martingales only, the other all beingsimilar. Clearly X is a martingale iff

0 = E [Xn+1|Fn]−Xn = E [Xn+1 −Xn|Fn] = E [dn+1|Fn] .

The second assertion is an easy consequence of the first assertion.

Example 12.6. Suppose that Zn∞n=0 is a sequence of independent integrablerandom variables, Xn = Z0 . . . Zn, and Fn := σ (Z0, · · · , Zn) . (Observe thatE |Xn| =

∏nk=0 E |Zk| <∞.) Since

E [Xn+1|Fn] = E [XnZn+1|Fn] = XnE [Zn+1|Fn] = Xn · E [Zn+1] a.s.,

it follows that Xn∞n=0 is a martingale if EZn = 1. If we further assume,for all n, that Zn ≥ 0 so that Xn ≥ 0, then Xn∞n=0 is a supermartingale(submartingale) provided EZn ≤ 1 (EZn ≥ 1) for all n.

Let us specialize the above example even more by taking Znd= p+U where

p ≥ 0 and U is the uniform distribution on [0, 1] . In this case we have by thestrong law of large numbers that

1

nlnXn =

1

n

n∑k=0

lnZk → E [ln (p+ U)] a.s. (12.3)

An elementary computation shows

E [ln (p+ U)] =

∫ 1

0

ln (p+ x) dx =

∫ p+1

p

ln (p+ x) dx

= (x lnx− x)x=p+1x=p = (p+ 1) ln (p+ 1)− p ln p− 1

The function f (p) := E [ln (p+ U)] has a zero at p = pc ∼= 0.542 21 andf (p) < 0 for p < pc while f (p) > 0 for p > pc, see Figure 12.1. Combiningthese observations with Eq. (12.3) implies,

Xn → limn→∞

exp (nE [ln (p+ U)]) =

0 if p < pc? if p = pc∞ if p > pc

a.s.

Notice that EZn = p + 1/2 and therefore Xn is a martingale precisely whenp = 1/2 and is a sub-martingale for p > 1/2. So for 1/2 < p < pc, Xn∞n=1 is

a positive sub-martingale, EXn = (p+ 1/2)n+1 →∞ yet limn→∞Xn = 0 a.s.

12.1 (Sub and Super) Martingale Examples 97

Fig. 12.1. The graph of E [ln (p+ U)] as a function of p. This function has a zero atp = pc ∼= 0.542 21.

Notation 12.7 Given a sequence Zk∞k=0 , let ∆kZ := Zk − Zk−1 for k =1, 2, . . . .

Lemma 12.8 (Doob Decomposition). Each adapted sequence, Zn∞n=0 , ofintegrable random variables has a unique decomposition,

Zn = Mn +An (12.4)

where Mn∞n=0 is a martingale and An is a predictable process such that A0 =0. Moreover this decomposition is given by A0 = 0,

An :=

n∑k=1

EFk−1[∆kZ] for n ≥ 1 (12.5)

and

Mn = Zn −An = Zn −n∑k=1

EFk−1[∆kZ] (12.6)

= Z0 +

n∑k=1

(Zk − EFk−1

Zk). (12.7)

In particular, Zn∞n=0 is a submartingale (supermartingale) iff An is increasing(decreasing) almost surely.

Proof. Assuming Zn has a decomposition as in Eq. (12.4), then

EFn [∆n+1Z] = EFn [∆n+1M +∆n+1A] = ∆n+1A (12.8)

wherein we have used M is a martingale and A is predictable so thatEFn [∆n+1M ] = 0 and EFn [∆n+1A] = ∆n+1A. Hence we must define, form ≥ 1,

An :=

n∑k=1

∆kA =

n∑k=1

EFk−1[∆kZ]

which is a predictable process. This proves the uniqueness of the decompositionand the validity of Eq. (12.5).

For existence, from Eq. (12.5) it follows that

EFn [∆n+1Z] = ∆n+1A = EFn [∆n+1A] .

Hence, if we define Mn := Zn −An, then

EFn [∆n+1M ] = EFn [∆n+1Z −∆n+1A] = 0

and hence Mn∞n=0 is a martingale. Moreover, Eq. (12.7) follows from Eq.(12.6) since,

Mn = Z0 +

n∑k=1

(∆kZ − EFk−1

[∆kZ])

and

∆kZ − EFk−1[∆kZ] = Zk − Zk−1 − EFk−1

[Zk − Zk−1]

= Zk − Zk−1 −(EFk−1

Zk − Zk−1

)= Zk − EFk−1

Zk.

Note: here is a direct check of the most useful result in this lemma, namelythat

Mn = Zn −n∑k=1

EFk−1[∆kZ]

is a martingale. To verify this is the case notice that

Mn −Mn−1 = Zn −n∑k=1

EFk−1[∆kZ]−

(Zn−1 −

n−1∑k=1

EFk−1[∆kZ]

)= ∆nZ − EFn−1

[∆nZ]

and therefore,

EFn−1 [Mn −Mn−1] = EFn−1 [∆nZ]− EFn−1EFn−1 [∆nZ] = 0.


Exercise 12.3. Suppose that Zn∞n=0 are independent random variables suchthat σ2 := EZ2

n <∞ and EZn = 0 for all n ≥ 1. As in Example 12.2, let Sn :=∑nk=0 Zk be the martingale relative to the filtration, FZn := σ (Z0, . . . , Zn) .

ShowMn := S2

n − σ2n for n ∈ N0

is a martingale. [Hint, make use of the basic properties of conditional expecta-tions.]

Exercise 12.4. Suppose that Xn∞n=0 is a Markov chain on S with one steptransition probabilities, P, and Fn := σ (X0, . . . , Xn) . Let f : S → R be abounded (for simplicity) function and set g (x) = (Pf) (x)− f (x) . Show

Mn := f (Xn)−∑

0≤j<n

g (Xj) for n ∈ N0

is a Martingale.

Exercise 12.5 (Quadratic Variation). Suppose Mn∞n=0 is a square inte-grable martingale. Show;

1. E[M2n+1 −M2

n|Fn]

= E[(Mn+1 −Mn)

2 |Fn]. Conclude from this that the

Doob decomposition of M2n is of the form,

M2n = Nn +An

whereAn :=

∑1≤k≤n

E[(Mk −Mk−1)

2 |Fk−1

].

2. If we further assume that Mk −Mk−1 is independent of Fk−1 for all k =1, 2, . . . , explain why,

An =∑

1≤k≤n

E (Mk −Mk−1)2.

Proposition 12.9 (Markov Chains and Martingales). Suppose that S isa finite or countable set, P is a Markov matrix on S, and Xn∞n=0 is theassociated Markov Chain. If f : S → R is a function such that either f ≥ 0 orE |f (Xn)| < ∞ for all n ∈ N0 and (Pf ≤ f) Pf = f then Zn = f (Xn) is a(sub-martingale) martingale. More generally, if f : N0 × S → R is a such thateither f ≥ 0 or E |f (n,Xn)| < ∞ for all n ∈ N0 and (Pf (n+ 1, ·) ≤ f (n, ·))Pf (n+ 1, ·) = f (n, ·) , then Zn∞n=0 is a (sub-martingale) martingale.

Proof. Using the Markov property and the definition of P we find,2

E [Zn+1|Fn] = E [f (n+ 1, Xn+1) |Fn]

= E [f (n+ 1, Xn+1) |Xn] = [Pf (n+ 1, ·)] (Xn) .

The latter expression is (less than or equal) equal to Zn if(Pf (n+ 1, ·) ≤ f (n·)) Pf (n+ 1, ·) = f (n, ·) for all n ≥ 0.

Remark 12.10. One way to find solutions to the equation Pf (n+ 1, ·) = f (n, ·)at least for a finite number of n is to let g : S → R be an arbitrary functionand T ∈ N be given and then define

f (n, y) :=(PT−ng

)(y) for 0 ≤ n ≤ T.

Then Pf (n+ 1, ·) = P(PT−n−1g

)= PT−ng = f (n, ·) and we will have that

Zn = f (n,Xn) =(PT−ng

)(Xn)

is a Martingale for 0 ≤ n ≤ T. If f (n, ·) satisfies Pf (n+ 1, ·) = f (n, ·) for alln then we must have, with f0 := f (0, ·) ,

f (n, ·) = P−nf0

where P−1g denotes a function h solving Ph = g. In general P is not invertibleand hence there may be no solution to Ph = g or there might be many solutions.

Example 12.11. In special cases one can often make sense of these expressions.Let S = Z, Sn = X0 +X1 + · · ·+Xn, where Xi∞i=1 are i.i.d. with P (Xi = 1) =p ∈ (0, 1) and P (Xi = −1) = q := 1− p, and X0 is S –valued random variableindependent of Xi∞i=1 . Recall that Sn∞n=0 is a time homogeneous Markovchain with transition kernel determined by Pf (x) = pf (x+ 1) + qf (x− 1) .As we have seen if f (x) = a+ b (q/p)

x, then Pf = f and therefore

Mn = a+ b (q/p)Sn

is a martingale for all a, b ∈ R. This is easily verified directly as well;

2 Recall from Proposition 3.9 that E [f (n+ 1, Xn+1) |Xn] = h (Xn) where

h (x) = E [f (n+ 1, Xn+1) |Xn = x]

=∑y∈S

p (x, y) f (n+ 1, y) = [Pf (n+ 1, ·)] (x) .

12.2 Jensen’s and Holder’s Inequalities 99

EFn(q

p

)Sn+1

= EFn(q

p

)Sn+Xn+1

=

(q

p

)SnEFn

(q

p

)Xn+1

=

(q

p

)SnE(q

p

)Xn+1

=

(q

p

)Sn·

[(q

p

)1

p+

(q

p

)−1

q

]

=

(q

p

)Sn· [q + p] =

(q

p

)Sn.

Now suppose that λ 6= 0 and observe that Pλx =(pλ+ qλ−1

)λx. Thus

it follows that we may set P−1λx =(pλ+ qλ−1

)−1λx and therefore conclude

thatf (n, x) := P−nλx =

(pλ+ qλ−1

)−nλx

satisfies Pf (n+ 1, ·) = f (n, ·) . So if we suppose that X0 is a bounded so

that Sn is bounded for all n, we will haveMn =

(pλ+ qλ−1

)−nλSn

n≥0

is a

martingale for all λ 6= 0.

Exercise 12.6. For θ ∈ R let

fθ (n, x) := P−neθx =(peθ + qe−θ

)−neθx

so that Pfθ (n+ 1, ·) = fθ (n, ·) for all θ ∈ R. Compute;

1. f(k)θ (n, x) :=

(ddθ

)kfθ (n, x) for k = 1, 2.

2. Use your results to show,

M (1)n := Sn − n (p− q)

and

M (2)n := (Sn − n (p− q))2 − 4npq

are martingales.

(If you are ambitious you might also find M(3)n .)

Example 12.12 (Polya urns). (This example is a special case of Proposition12.9.) Let us consider an urn that contains red and green balls. Let (Rn, Gn)denote the number of red and green balls (respectively) in the urn at time n. Ifat a given time r red balls and g green balls at a given time we draw one of theseballs at random and replace it and add c more balls of the same color drawn.So (Rn, Gn) is a Markov process with transition probabilities determined by,

P ((Rn+1, Gn+1) = (r + c, g) | (Rn, Gn) = (r, g)) =r

r + gand

P ((Rn+1, Gn+1) = (r, g + c) | (Rn, Gn) = (r, g)) =g

r + g.

We now letFn := σ ((Rk, Gk) : k ≤ n) = σ (Xk : k ≤ n)

and recall that by the Markov property along with Proposition 3.9 (the basicformula for computing conditional expectations) we have,

E [f (Rn+1, Gn+1) |Fn] = E [f (Rn+1, Gn+1) | (Rn, Gn)] = h (Rn, Gn)

where

h (r, g) := E [f (Rn+1, Gn+1) | (Rn, Gn) = (r, g)]

=r

r + gf (r + c, g) +

g

r + gf (r, g + c) .

Thus we have shown,

E [f (Rn+1, Gn+1) |Fn] =Rn

Rn +Gnf (Rn + c,Gn) +

GnRn +Gn

f (Rn, Gn + c) .

(12.9)Next let us observe that Rn + Gn = R0 + G0 + nc and hence if we let Xn

be the fraction of green balls in the urn at time n,

Xn :=Gn

Rn +Gn,

then

Xn :=Gn

Rn +Gn=

GnR0 +G0 + nc

.

We now claim that Xn∞n=0 is an Fn – martingale. Indeed, using Eq. (12.9)with f (r, g) = g

r+g we learn

E [Xn+1|Fn] = E [Xn+1| (Rn, Gn)]

=Rn

Rn +Gn· GnRn + c+Gn

+Gn

Rn +Gn· Gn + c

Rn +Gn + c

=Gn

Rn +Gn· Rn +Gn + c

Rn +Gn + c= Xn.

12.2 Jensen’s and Holder’s Inequalities

Theorem 12.13 (Jensen’s Inequality). Suppose that (Ω,P) is a probabilityspace, −∞ ≤ a < b ≤ ∞, and ϕ : (a, b) → R is a convex function, (i.e.ϕ′′ (x) ≥ 0 on (a, b)). If f : Ω → (a, b) is a random variable with E |f | < ∞,then

ϕ (Ef) ≤ E [ϕ(f)] .

[Here it may happen that ϕ f /∈ L1 (P) , in this case it is claimed that ϕ f isintegrable in the extended sense and E [ϕ(f)] =∞.]

Proof. Let t = Ef ∈ (a, b) and let β ∈ R (β = ϕ (t) when ϕ (t) exists), besuch that

ϕ(s)− ϕ(t) ≥ β(s− t) for all s ∈ (a, b), (12.10)

see Figure 12.2. Then integrating the inequality, ϕ(f)−ϕ(t) ≥ β(f − t), impliesthat

0 ≤ Eϕ(f)− ϕ(t) = Eϕ(f)− ϕ(Ef).

Moreover, if ϕ(f) is not integrable, then ϕ(f) ≥ ϕ(t) + β(f − t) which showsthat negative part of ϕ(f) is integrable. Therefore, Eϕ(f) =∞ in this case.

Fig. 12.2. A convex function, ϕ, along with a cord and a tangent line. Notice thatthe tangent line is always below ϕ and the cord lies above ϕ between the points ofintersection of the cord with the graph of ϕ.

Example 12.14. Since ex for x ∈ R, − lnx for x > 0, and xp for x ≥ 0 and p ≥ 1are all convex functions, we have the following inequalities

exp (Ef) ≤ Eef , (12.11)

E log(|f |) ≤ log (E |f |)

and for p ≥ 1,|Ef |p ≤ (E |f |)p ≤ E |f |p .

Example 12.15. As a special case of Eq. (12.11), if pi, si > 0 for i = 1, 2, . . . , nand

∑ni=1

1pi

= 1, then

s1 . . . sn = e∑n

i=1ln si = e

∑n

i=1

1pi

ln spii ≤

n∑i=1

1

pieln s

pii =

n∑i=1

spiipi. (12.12)

Indeed, we have applied Eq. (12.11) with Ω = 1, 2, . . . , n , µ =∑ni=1

1piδi and

f (i) := ln spii . As a special case of Eq. (12.12), suppose that s, t, p, q ∈ (1,∞)with q = p

p−1 (i.e. 1p + 1

q = 1) then

st ≤ 1

psp +

1

qtq. (12.13)

(When p = q = 1/2, the inequality in Eq. (12.13) follows from the inequality,

0 ≤ (s− t)2.)

As another special case of Eq. (12.12), take pi = n and si = a1/ni with

ai > 0, then we get the arithmetic geometric mean inequality,

n√a1 . . . an ≤

1

n

n∑i=1

ai. (12.14)

Example 12.16. Let (Ω,P) be a probability space, 0 < p < q <∞, and f : Ω →C be a measurable function. Then by Jensen’s inequality,

(E |f |p)q/p ≤ E (|f |p)q/p = E |f |q

from which it follows that ‖f‖p ≤ ‖f‖q . In particular, Lp (P) ⊂ Lq (P) for all0 < p < q <∞.

Theorem 12.17 (Holder’s inequality). Suppose that 1 ≤ p ≤ ∞ and q :=pp−1 , or equivalently p−1 + q−1 = 1. If f and g are measurable functions then

‖fg‖1 ≤ ‖f‖p · ‖g‖q. (12.15)

Assuming p ∈ (1,∞) and ‖f‖p · ‖g‖q <∞, equality holds in Eq. (12.15) iff |f |pand |g|q are linearly dependent as elements of L1 which happens iff

|g|q‖f‖pp = ‖g‖qq |f |p

a.s. (12.16)

Proof. The cases p = 1 and q = ∞ or p = ∞ and q = 1 are easy to dealwith and will be left to the reader. So we now assume that p, q ∈ (1,∞) . If‖f‖q = 0 or ∞ or ‖g‖p = 0 or ∞, Eq. (12.15) is again easily verified. So we willnow assume that 0 < ‖f‖q, ‖g‖p < ∞. Taking s = |f | /‖f‖p and t = |g|/‖g‖qin Eq. (12.13) gives,

|fg|‖f‖p‖g‖q

≤ 1

p

|f |p

‖f‖p+

1

q

|g|q

‖g‖q(12.17)

with equality iff |g/‖g‖q| = |f |p−1/‖f‖(p−1)

p = |f |p/q /‖f‖p/qp , i.e. |g|q‖f‖pp =‖g‖qq |f |

p. Integrating Eq. (12.17) implies

12.3 Stochastic Integrals and Optional Stopping 101

‖fg‖1‖f‖p‖g‖q

≤ 1

p+

1

q= 1

with equality iff Eq. (12.16) holds. The proof is finished since it is easily checkedthat equality holds in Eq. (12.15) when |f |p = c |g|q of |g|q = c |f |p for someconstant c.

Theorem 12.18 (Conditional Jensen’s inequality). Let (Ω,P) be a prob-ability space, −∞ ≤ a < b ≤ ∞, and ϕ : (a, b) → R be a convex function.Assume f ∈ L1 (Ω,P) is a random variable satisfying, f ∈ (a, b) a.s. andϕ(f) ∈ L1 (Ω,P) . Then for any sigma-algebra (of information) on Ω,

ϕ(EGf) ≤ EG [ϕ(f)] a.s. (12.18)

andE [ϕ(EGf)] ≤ E [ϕ(f)] . (12.19)

where EGf := E [f |G] .

Proof. Not completely rigorous proof. Assume that ϕ is C1 in whichcase Eq. (12.10) becomes,

ϕ(s)− ϕ(t) ≥ ϕ (t) (s− t) for all s, t ∈ (a, b). (12.20)

We now take s = f and t = EGf in this inequality to find,

ϕ(f)− ϕ(EGf) ≥ ϕ (EGf) (f − EGf).

The result now follows by taking EG of both sides of this inequality to learn,

EGϕ(f)− ϕ(EGf)

= EG [ϕ(f)− ϕ(EGf)]

≥ EG [ϕ (EGf) (f − EGf)] = ϕ (EGf) · EG [(f − EGf)] = 0.

The technical problem with this argument is the justification thatEG [ϕ′(EGf)(f − EGf)] = ϕ′(EGf)EG [(f − EGf)] since there is no reasonfor ϕ′ to be a bounded function. The honest proof below circumvents thistechnical detail.

*Honest proof. Let Λ := Q ∩ (a, b) – a countable dense subset of (a, b) .By the basic properties of convex functions,

ϕ(s) ≥ ϕ(t) + ϕ′−(t)(s− t) for all for all t, s ∈ (a, b) , (12.21)

where ϕ′−(t) is the left hand derivative of ϕ at t. Taking s = f and then takingconditional expectations imply,

EG [ϕ(f)] ≥ EG[ϕ(t) + ϕ′−(t)(f − t)

]= ϕ(t) + ϕ′−(t)(EGf − t) a.s. (12.22)

Since this is true for all t ∈ (a, b) (and hence all t in the countable set, Λ) wemay conclude that

EG [ϕ(f)] ≥ supt∈Λ

[ϕ(t) + ϕ′−(t)(EGf − t)

]a.s.

Since EGf ∈ (a, b) we may conclude that

supt∈Λ

[ϕ(t) + ϕ′−(t)(EGf − t)

]= ϕ (EGf) a.s.

Combining the last two estimates proves Eq. (12.18).From Eq. (12.18) and Eq. (12.21) with s = EGf and t ∈ (a, b) fixed we find,

ϕ(t) + ϕ′−(t) (EGf − t) ≤ ϕ(EGf) ≤ EG [ϕ(f)] . (12.23)

Therefore

|ϕ(EGf)| ≤ |EG [ϕ(f)]| ∨∣∣ϕ(t) + ϕ′−(t)(EGf − t)

∣∣ ∈ L1 (Ω,G, P ) (12.24)

which implies that ϕ(EGf) ∈ L1 (Ω,G, P ) . Taking expectations of Eq. (12.18)is now allowed and immediately gives Eq. (12.19).

Corollary 12.19. If Xn∞n=0 is a martingale and ϕ is a convex function suchthat E [ϕ (Xn)] < ∞ for all n, thenϕ (Xn)∞n=0 is a sub-martingale. [Typicalexamples for ϕ are ϕ (x) = |x|p for p ≥ 1.]

Proof. As Xn∞n=0 is a martingale, Xn = EFnXn+1. Applying ϕ to bothsides of this equation and then using the conditional form of Jensen’s inequalitythen shows,

ϕ (Xn) = ϕ (EFnXn+1) ≤ EFn [ϕ (Xn+1)] .

12.3 Stochastic Integrals and Optional Stopping

Notation 12.20 Suppose that cn∞n=1 and xn∞n=0 are two sequences of num-bers, let c · ∆x = (c ·∆x)nn∈N0

denote the sequence of numbers defined by(c ·∆x)0 = 0 and

(c ·∆x)n =

n∑j=1

cj (xj − xj−1) =

n∑j=1

cj∆jx for n ≥ 1.

(For convenience of notation later we will interpret∑0j=1 cj∆jx = 0.)

For a gambling interpretation of (c ·∆x)n , let xj represent the price of astock at time j. Suppose that you, the investor, buys cj shares at time j−1 andthen sells these shares back at time j. With this interpretation, cj∆jx representsyour profit (or loss if negative) in the time interval from j−1 to j and (c ·∆x)nrepresents your profit (or loss) from time 0 to time n. By the way, if you wantto buy 5 shares of the stock at time n = 3 and then sell them all at time 9, youwould take ck = 5 · 13<k≤9 so that

(c ·∆x)9 = 5 ·∑

3<k≤9

∆kx = 5 · (x9 − x3)

would represent your profit (loss) for this transaction. The next example for-malizes this observation.

Example 12.21. Suppose that 0 ≤ σ ≤ τ where σ, τ ∈ N0 and let cn := 1σ<n≤τ .Then

(c ·∆x)n =

n∑j=1

1σ<j≤τ (xj − xj−1) =

∞∑j=1

1σ<j≤τ∧n (xj − xj−1)

=

∞∑j=1

1σ∧n<j≤τ∧n (xj − xj−1) = xτ∧n − xσ∧n.

More generally if σ, τ ∈ N0 are arbitrary and cn := 1σ<n≤τ we will have cn :=1σ∧τ<n≤τ and therefore

(c ·∆x)n = xτ∧n − xσ∧τ∧n.

Notation 12.22 (Stochastic intervals) If σ, τ : Ω → N, let

(σ, τ ] :=

(ω, n) ∈ Ω × N : σ (ω) < n ≤ τ (ω)

and we will write 1(σ,τ ] for the process, 1σ<n≤τ .

Definition 12.23. We say Cn : Ω → S∞n=1 is predictable or pre-visible if each Cn is Fn−1– measurable for all n ∈ N, i.e. Cn =

Fn

(Xn, Yn, Zn, . . . n−1

k=0

).

Lemma 12.24. If τ is a stopping time, then the processes, fn := 1τ≤n, andfn := 1τ=n are adapted and fn := 1τ<n and fn = 1n≤τ are predictable. More-over, if σ and τ are two stopping times, then fn := 1σ<n≤τ is predictable.

Proof. These are all trivial to prove. For example, if fn := 1σ<n≤τ , then fnis Fn−1 measurable since,

σ < n ≤ τ = σ < n ∩ n ≤ τ = σ < n ∩ τ < nc ∈ Fn−1.

Proposition 12.25 (The Discrete Stochastic Integral). Let X = Xn∞n=0

be an adapted integrable process, i.e. E |Xn| <∞ for all n. If X is a martingaleand Cn∞n=1 is a predictable (as in Definition 12.23) sequence of boundedrandom variables, then (C ·∆X)n

∞n=1

is still a martingale. If X := Xn∞n=0

is a submartingale (supermartingale) (necessarily real valued) and Cn ≥ 0, then(C ·∆X)n

∞n=1

is a submartingale (supermartingale). (In other words, X isa sub-martingale if no matter what your (non-negative) betting strategy is youwill make money on average.)

Proof. For any adapted process X, we have

E[(C ·∆X)n+1 |Fn

]= E [(C ·∆X)n + Cn+1 (Xn+1 −Xn) |Fn]

= (C ·∆X)n + Cn+1E [(Xn+1 −Xn) |Fn] (12.25)

from which the result easily follows.Remark: For n = 1 we have,

E [(C ·∆X)1 |F0] = C1 · E [∆1X|F0] =

0 if X is a mtg.≥ 0 if X is a submtg

and this explains why taking (C ·∆X)0 = 0 is the correct choice here giventhat

(C ·∆X)n =

n∑k=1

Ck∆kX when n ≥ 1.

Example 12.26. Suppose that Xn∞n=0 are mean zero independent integrablerandom variables and fk : Rk → R are bounded measurable functions for k ∈ N.Then Yn∞n=0 , defined by Y0 = 0 and

Yn :=

n∑k=1

fk (X0, . . . , Xk−1) (Xk −Xk−1) for n ∈ N, (12.26)

is a martingale sequence relative toFXnn≥0

.

Notation 12.27 Given an Fn− adapted process, X, and an Fn− stop-ping time τ, let Xτ

n := Xτ∧n. We call Xτ := Xτn∞n=0 the process X stopped

by τ.

Theorem 12.28 (Optional stopping theorem). Suppose X = Xn∞n=0 is asupermartingale, martingale, or submartingale with E |Xn| <∞ for all n. Thenfor every stopping time, τ, Xτ is a Fn∞n=0 – supermartingale, martingale, orsubmartingale respectively.

12.3 Stochastic Integrals and Optional Stopping 103

Proof. If we define Ck := 1k≤τ , then

[C ·∆X]n =

n∑k=1

1k≤τ∆kX =

n∧τ∑k=1

∆kX = Xn∧τ −X0.

This shows thatXτn = X0 + [C ·∆X]n

and the result now follows Proposition 12.25.

Corollary 12.29. Suppose X = Xn∞n=0 is a martingale and τ is a stop-ping time such that P (τ =∞) = 0. If there exists C < ∞ such that eitherE [supn |Xn|] <∞ for all n or τ ≤ C, then

EXτ = EX0. (12.27)

Proof. From Theorem 12.28 we know that

EX0 = E [Xτn ] = E [Xτ∧n] for all n ∈ N.

• If we assume that τ ≤ C then by taking n > C we learn that Eq. (12.27)holds.

• If we assume E [supn |Xn|] <∞ , then by the dominated convergence theo-rem (see Section 1.1),

EX0 = limn→∞

E [Xτ∧n] = E[

limn→∞

Xτ∧n

]= E [Xτ ] .

Corollary 12.30 (Martingale Expectations). Suppose that Xn∞n=0 is amartingale and τ is a stopping time so that P (τ =∞) = 0, E |Xτ | <∞, and

limn→∞

E [|Xn| : τ > n] = 0, (12.28)

then EXτ = EX0.

Proof. We have

Xτ = Xτ∧n + (Xτ −Xn) 1τ>n

and henceE [Xτ ] = E [Xτ∧n] + εn = EX0 + εn

where

|εn| = |E [(Xτ −Xn) 1τ>n]| ≤ E [|Xτ | 1τ>n] + E [|Xn| 1τ>n] .

Thus from the assumption in Eq. (12.28) and the dominated convergence the-orem (see Section 1.1) that εn → 0 as n→∞.

Example 12.31 (A simple betting game). Let Z0 = 0 and Zn∞n=1 be i.i.d.Bernoulli random variables with P (Zn = ±1) = 1

2 and let S0 = 0 and Sn =∑nk=1 Zk for all n ≥ 1. We use the Zn∞n=1 to play a betting game. If a player

bets Cn dollars on the nth – game then she looses her bet to the house ifZn = −1 and otherwise she gets double her wager back. Thus the change ofher wealth Wn due to playing the nth – game is

∆nW := Wn −Wn−1 = CnZn

and so the change of her wealth after playing n – games is,

Wn −W0 =

n∑k=1

∆kW =

n∑k=1

CkZk

=

n∑k=1

Ck∆kS = [C · S]n .

Example 12.32. One way to “beat” the game in Example 12.31 is to use thedouble or nothing betting strategy, i.e. start by betting a $1 and keeping dou-bling our bet until we win and then quit playing. In more mathematical detail,let

T = inf k ≥ 1 : Zk = 1 = inf k ≥ 1 : Sk − Sk−1 = 1 .

Notice that T is an FZ – stopping time since

T = n = Z1 = −1, . . . , Zn−1 = −1, Zn = 1 ∈ FZn for all n

and

P (T =∞) = limn→∞

P (T > n) = limn→∞

P (Zk = −1nk=1)

= limn→∞

(1

2

)n= 0.

With this notation we then let Ck = 2k−11k≤T which is predictable by Lemma12.24. Therefore,

Wn −W0 = [C · S]n =

n∑k=1

Ck (Sk − Sk−1)

n∑k=1

CkZk

is again a martingale. On the event T > n ,

Wn −W0 = −n∑k=1

2k−1 = − (2n − 1)


while on the event T = n,

Wn −W0 = −(2n−1 − 1

)+ 2n−1 = 1

Thus we have shown that WT −W0 = 1 and so 1 = E [WT −W0] 6= 0 in thiscase even though E [Wn −W0] = 0 for all n? This example shows that theassumption in Eq. (12.28) of Corollary 12.30 can not be eliminated in general.

Example 12.33. Suppose that Xn∞n=0 represents the value of a stock which isknown to be a sub-martingale. At time n − 1 you are allowed buy Cn ∈ [0, 1]shares of the stock which you will then sell at time n. Your net gain (loss) inthis transaction is CnXn − CnXn−1 = Cn∆nX and your wealth at time n willbe

Wn = W0 +

n∑k=1

Ck∆kX.

The next lemma asserts that the way to maximize your expected gain is tochoose Ck = 1 for all k, i.e. buy the maximum amount of stock you can at eachstage. We will refer to this as the all in strategy.

The next lemma gives the optimal strategy (relative to expected return) forbuying stock which is known to be a submartingale.

Lemma 12.34 (*“All In”). If Xn∞n=0 is a sub-martingale and Ck∞k=1 isa previsible process with values in [0, 1] , then

E

(n∑k=1

Ck∆kX

)≤ E [Xn −X0]

with equality when Ck = 1 for all k, i.e. the optimal strategy is to go all in.

Proof. Notice that 1− Ck∞k=1 is a previsible non-negative process andtherefore by Proposition 12.25,

E

(n∑k=1

(1− Ck)∆kX

)≥ 0.

Since

Xn −X0 =

n∑k=1

∆kX =

n∑k=1

Ck∆kX +

n∑k=1

(1− Ck)∆kX,

it follows that

E [Xn −X0] = E

(n∑k=1

Ck∆kX

)+E

(n∑k=1

(1− Ck)∆kX

)≥ E

(n∑k=1

Ck∆kX

).

Example 12.35 (*Setting the odds). Let S be a finite set (think of the outcomesof a spinner, or dice, or a roulette wheel) and p : S → (0, 1) be a probabilityfunction3. Let Zn∞n=1 be random functions with values in S such that p (s) :=P (Zn = s) for all s ∈ S. (Zn represents the outcome of the nth – game.) Alsolet α : S → [0,∞) be the house’s payoff function, i.e. for each dollar you (thegambler) bets on s ∈ S, the house will pay α (s) dollars back if s is rolled.Further let W : Ω → W be measurable function into some other measurespace, (W,F) which is to represent your random (or not so random) “whims.”.We now assume that Zn is independent of (W,Z1, . . . , Zn−1) for each n, i.e.the dice are not influenced by the previous plays or your whims. If we letFn := σ (W,Z1, . . . , Zn) with F0 = σ (W ) , then we are assuming the Zn isindependent of Fn−1 for each n ∈ N.

As a gambler, you are allowed to choose before the nth – game is played, theamounts

(Cn (s)s∈S

)that you want to bet on each of the possible outcomes

of the nth – game. Assuming that you are not clairvoyant (i.e. can not see thefuture), these amounts may be random but must be Fn−1 – measurable, thatis Cn (s) = Cn (W,Z1, . . . , Zn−1, s) , i.e. Cn (s)∞n=1 is “previsible” process (seeDefinition 12.23 below). Thus if X0 denotes your initial wealth (assumed to bea non-random quantity) and Xn denotes your wealth just after the nth – gameis played, then

Xn −Xn−1 = −∑s∈S

Cn (s) + Cn (Zn)α (Zn)

where −∑s∈S Cn (s) is your total bet on the nth – game and Cn (Zn)α (Zn)

represents the house’s payoff to you for the nth – game. Therefore it followsthat

Xn = X0 +

n∑k=1

[−∑s∈S

Ck (s) + Cn (Zk)α (Zk)

],

Xn is Fn – measurable for each n, and

EFn−1[Xn −Xn−1] = −

∑s∈S

Cn (s) + EFn−1[Cn (Zn)α (Zn)]

= −∑s∈S

Cn (s) +∑s∈S

Cn (s)α (s) p (s)

=∑s∈S

Cn (s) (α (s) p (s)− 1) .

3 To be concrete, take S = 2, . . . , 12 representing the possible values for the sumsof the upward pointing faces of two dice. Assuming the dice are independent andfair then determines p : S → (0, 1) . For example p (2) = p (12) = 1/36, p (3) =p (11) = 1/18, p (7) = 1/6, etc.


12.4 Uniform Integrability and Optional stopping 105

Thus it follows, that no matter the choice of the betting “strategy,”Cn (s) : s ∈ S∞n=1 , we will have

EFn−1[Xn −Xn−1] =

≥ 0 if α (·) p (·) ≥ 1= 0 if α (·) p (·) = 1≤ 0 if α (·) p (·) ≤ 1

,

that is Xnn≥0 is a sub-martingale, martingale, or supermartingale dependingon whether α · p ≥ 1, α · p = 1, or α · p ≤ 1.

Moral: If the Casino wants to be guaranteed to make money on average, ithad better choose α : S → [0,∞) such that α (s) < 1/p (s) for all s ∈ S. In thiscase the expected earnings of the gambler will be decreasing which means theexpected earnings of the Casino will be increasing.

12.4 Uniform Integrability and Optional stopping

Definition 12.36. A sequence of random functions, Xn∞n=0 , is uniformly in-tegrable (U.I. for short) if the tails are uniformly small in the following sense,

εK := supn

E [|Xn| : |Xn| ≥ K]→ 0 as K →∞.

Lemma 12.37. If Xn is U.I. then supn E [|Xn|] <∞.

Proof. Fix a K <∞ such that

εK := supn

E [|Xn| : |Xn| ≥ K] ≤ 1.

For this K it follows that

E [|Xn|] = E [|Xn| : |Xn| ≥ K] + E [|Xn| : |Xn| < K] ≤ 1 +K

and soM := sup

nE [|Xn|] ≤ 1 +K <∞.

Proposition 12.38. If Xn is U.I., then for all ε > 0 there exists δ > 0 suchthat supn E [|Xn| : A] < ε whenever A ⊂ Ω with P (A) < δ.

Proof. For any A ⊂ Ω,

E [|Xn| : A] = E [|Xn| : A, |Xn| ≥ K] + E [|Xn| : A, |Xn| < K]

≤ E [|Xn| : |Xn| ≥ K] +KP (A)

and sosupn

E [|Xn| : A] ≤ εK +KP (A) . (12.29)

Thus given ε > 0 choose K so large that εK < ε/2 and choose δ := ε2K so that

when P (A) < δ we will have KP (A) < ε/2 which then combined with Eq.(12.29) implies, supn E [|Xn| : A] < ε.

Corollary 12.39. Suppose that Xn∞n=0 is a U.I. martingale and τ is a stop-ping time so that P (τ =∞) = 0 and E |Xτ | <∞, then EXτ = EX0.

Proof. According to Corollary 12.30 we need use the U.I. to show

limn→∞

E [|Xn| : τ > n] = 0.

But using P (τ > n)→ 0 as n→∞ we may use Proposition 12.38 to conclude,

E [|Xn| : τ > n] ≤ supk

E [|Xk| : τ > n]→ 0 as n→∞.

Let us end this section with a couple of criteria for checking when Xn isU.I.

Proposition 12.40 (DCT and U.I.). If F = supn |Xn| ∈ L1 (P ) then Xnis U.I.

Proof. We have

supn

E [|Xn| : |Xn| ≥ K] ≤ E [F : F ≥ K]→ 0 as K →∞

by DCT.

Proposition 12.41 (p – integrability and U.I.). If there exists p > 1 suchthat M := supn E |Xn|p <∞ then Xn is U.I.

Proof. If we let ε := p− 1 > 0, then

E [|Xn| : |Xn| ≥ K] = E[|Xn|1+ε 1

|Xn|ε: |Xn| ≥ K

]≤ 1

KεE[|Xn|1+ε

: |Xn| ≥ K]≤ 1

KεE [|Xn|p]

and so

supn

E [|Xn| : |Xn| ≥ K] ≤ M

Kε→ 0 as K ↑ ∞.

12.5 Martingale Convergence Theorems

Definition 12.42. Suppose X = Xn∞n=0 is a sequence of extended real num-bers and −∞ < a < b <∞. To each N ∈ N, let UN (a, b) denote the number of

up-crossings of [a, b] , i.e. the number times that XnNn=0 goes from below ato above b.

Notice that if Cn∞n=1 is the betting strategy of buy at or below a and sellat or above b then UN (a, b) = [C ·∆X]N . In more detail, Cn = 1 iff Xj ≤ a

for some 0 ≤ j < n and following Xkn−1k=0 backwards we have Xn−k < b for

k ≥ 1 until the first time that Xk ≤ a. In any other scenario, we have Cn = 0.

Lemma 12.43. Suppose X = Xn∞n=0 is a sequence of extended real numberssuch that UX∞ (a, b) < ∞ for all a, b ∈ Q with a < b. Then X∞ := limn→∞Xn

exists in R.

Proof. If limn→∞Xn does not exists in R, then there would exists a, b ∈ Qsuch that

lim infn→∞

Xn < a b infinitelyoften. Therefore, UX∞ (a, b) =∞.

Theorem 12.44 (Doob’s Upcrossing Inequality). If Xn∞n=0 is a mar-tingale and −∞ < a < b <∞, then for all N ∈ N,

E[UXN (a, b)

]≤ 1

b− a[E (XN − a)+

]≤ 1

b− a[E |XN − a|] . (12.30)

Proof. Let Ck∞k=1 be the buy at or below a and sell at or above b asexplained after Definition 12.42. Our net winnings of this strategy up to timeN is [C ·∆X]N which satisfies,

[C ·∆X]N ≥ (b− a)UXN (a, b)− [a−XN ]+ . (12.31)

In words, our net gain in buying at or below a and selling at or above b is atleast equal to (b− a) times the number of times we buy low and sell high plusa possible penalty for holding the stock below a at the end of the day. FromProposition 12.25 we know that E [C ·∆X]N = 0 for all N and therefore takingexpectation of Eq. (12.31) gives Eq. (12.30).

Theorem 12.45 (Martingale Convergence Theorem). If Xn∞n=0 is amartingale such that C := supn E |Xn| <∞, then X∞ = limn→∞Xn exists a.s.and E |X∞| ≤ C.

Proof. Sketch. If −∞ < a < b < ∞, it follows from Eq. (12.30) andFatou’s Lemma or Monotone convergence theorem (see Section 1.1) that

E[UX∞ (a, b)

]= E

[limN→∞

UXN (a, b)]≤ lim inf

N→∞E[UXN (a, b)

]≤ C + a

b− a<∞.

From this we learn that P(UX∞ (a, b) <∞

)= 0 for all a, b ∈ Q with −∞ < a <

b <∞ and hence

P(UX∞ (a, b) <∞ ∀−∞ < a < b <∞

)= 0.

From this inequality and Lemma 12.43 it follows that limn→∞Xn = X∞ existsa.s. as a random variable with values in R∪±∞ . Another application ofFatou’s lemma then shows

E |X∞| = E limn→∞

|Xn| ≤ lim infN→∞

E |Xn| ≤ C <∞.

Corollary 12.46 (Submartingale convergence theorem). If Xn∞n=0 is asubmartingale such that C := supn E |Xn| < ∞, then X∞ = limn→∞Xn existsa.s. and E |X∞| ≤ C.

Proof. By Doob’s decomposition (Lemma 12.8), we have Xn = Mn + Anwhere Mn∞n=0 is a martingale and

An =

n∑k=1

E [∆kX|Fk−1] .

As Xn∞n=0 is a submartingale we know E [∆kX|Fk−1] and hence An ≥ 0 forall n and An is increasing (i.e. non-decreasing) as n increases. Furthermore bythe monotone convergence theorem,

EA∞ = limn→∞

EAn = limn→∞

n∑k=1

EE [∆kX|Fk−1]

= limn→∞

n∑k=1

E [∆kX] = limn→∞

E [Xn −X0] ≤ 2C.

Thus we may conclude for all n ∈ N0 that

E |Mn| = E |Xn −An| ≤ E |Xn|+ EAn ≤ E |Xn|+ EA∞ ≤ 3C <∞.

Therefore limn→∞Mn = M∞ exists by Theorem 12.45 and hence

limn→∞

Xn = limn→∞

[Mn +An] = M∞ +A∞

12.6 Submartingale Maximal Inequalities 107

exists as well.Here is the theorem (stated without proof4) summarizing the key facts about

uniformly integrable martingales.

Theorem 12.47 (Uniformly integrable martingales). If Xn∞n=0 is aU.I. martingale then

1. X∞ = limn→∞Xn exists (P – a.s.) (see Theorem 12.45 below) andlimn→∞ E |Xn −X∞| = 0, i.e. Xn → X∞ in L1 (P ) . [This in particularimplies that limn→∞ EXn = EX∞.]

2. Xn∞n=0 is a regular martingale and in fact Xn = E [X∞|Fn]∞n=0 .3. For any stopping time τ (no need to assume P (τ =∞) = 0) we have

E |Xτ | < ∞ and EXτ = EX0. [This even holds if τ = ∞ a.s.]. [Thisresult extends Corollary 12.39.]

Example 12.48 (Random Sums). Suppose that Xk∞k=1 are independent ran-dom variables such that EXk = 0. If we further know that K :=

∑∞k=1 EX2

k <∞, then

∑∞k=1Xk exists a.s., i.e.

P

( ∞∑k=1

Xk converges in R

)= 1.

To prove this, note that S0 = 0 and Sn =∑nk=1Xk is a martingale. Moreover,

using the fact that variances of sums of independent random variables add wefind,

ES2n = Var (Sn) =

n∑k=1

Var (Xk) =

n∑k=1

EX2k ≤ K <∞.

From Proposition 12.41 we conclude that Sn∞n=0 is a uniformly integrablemartingale and in particular Sn∞n=0 is L1 - bounded.5 Therefore the result nowfollows from the martingale convergence theorem, Theorem 12.45. Moreover,making use of Theorem 12.47 we may conclude that

E [S∞] = E

[ ∞∑k=1

Xk

]= ES0 = 0

and more generally if τ is any stopping time we have E [Sτ ] = E [S0] = 0.For a more explicit example, if Xk = 1

kZk where Zk∞k=1 are i.i.d. such thatP (Zk = ±1) = 1

2 , then∑∞k=1 EX2

k =∑∞k=1

1k2 < ∞ and therefore the random

harmonic series,∑∞k=1

1kZk is almost surely convergent. However, notice that

4 Although the proof is not too hard to give using the material presented in thesenotes.

5 Alternatively, by Jensen’s inequality, [E |Sn|]2 ≤ ES2n ≤ K which again shows Sn

is L1 – bounded.

∞∑k=1

∣∣∣∣1kZk∣∣∣∣ =

∞∑k=1

1

k=∞,

i.e. the series∑∞k=1

1kZk is never absolutely convergent. For the full generaliza-

tion of this result look up Kolmogorov’s three series theorem.

12.6 Submartingale Maximal Inequalities

Notation 12.49 (Running Maximum) If X = Xn∞n=0 is a sequence of(extended) real numbers, we let

X∗N := max X0, . . . , XN . (12.32)

Proposition 12.50 (Submartingale Maximal Inequalities ). Let Xn bea submartingale on a filtered probability space, (Ω, Fn∞n=0 ,P) . Then for anya ≥ 0 and N ∈ N0,

aP (X∗N ≥ a) ≤ E [XN : X∗N ≥ a] ≤ E[X+N

]. (12.33)

Proof. First proof. The event, A := X∗N ≥ a , decomposes as the fol-lowing disjoint union of the events A0 := X0 ≥ a ∈ F0 and

Ak := X0 < a, . . . ,Xk−1 < a,Xk ≥ a ∈ Fk for 1 ≤ k ≤ N.

Hence we learn,

E [XN : X∗N ≥ a] = E [XN : A] =

N∑k=0

E [XN : Ak]

=

N∑k=0

E [E [XN |Fk] : Ak] ≥N∑k=0

E [Xk : Ak]

≥N∑k=0

E [a : Ak] =

N∑k=0

a · P (Ak) = a · P (A) = aP (X∗N ≥ a) ,

where the first inequality is a consequence of Xn being a sub-martingale andthe second coming from the fact that Xk ≥ a on Ak.

Second proof. Let τ := inf n : Xn ≥ a and observe that

X∗N ≥ Xτ ≥ a on τ ≤ N = X∗N ≥ a . (12.34)

Therefore

a · P (X∗N ≥ a) = E [a : X∗N ≥ a] = E [a : τ ≤ N ]

≤ E [Xτ : τ ≤ N ] =

N∑k=0

E [Xk : τ = k]

≤N∑k=0

E [XN : τ = k] = E [XN : τ ≤ N ]

= E [XN : X∗N ≥ a] ≤ E[X+N : X∗N ≥ a

]≤ E

[X+N

],

wherein we have used τ = k ∈ Fk and Xk is a submartingale in the secondinequality. So again we have shown Eq. (12.33) holds.

Corollary 12.51. If Mn∞n=0 is a martingale and M∗N := max M0, . . . ,MN ,then for a > 0,

P(|M |∗N ≥ a

)≤ 1

apE |MN |p for all p ≥ 1 (12.35)

and

P (M∗N ≥ a) ≤ 1

eλaE[eλMN

]for all λ ≥ 0. (12.36)

[This corollary has fairly obvious generalizations to other convex functions, ϕ,other than ϕ (x) = |x|p and ϕ (x) = eλx.]

Proof. By the conditional Jensen’s inequality, it follows that Xn := |Mn|pis a submartingale and so Eq. (12.35) follows from Eq. (12.33) with a replacedby ap. Again by the conditional Jensen’s inequality,

Xn := eλMn

∞n=0

is a

submartingale. Since λ > 0, we know that x → eλx is an increasing functionand hence,

M∗N ≥ a =X∗N ≥ eλa

and so Eq. (12.36) follows from Eq. (12.33) with a replaced by eλa.

As an example of the above result we have the following corollary.

Corollary 12.52. Let Xn be a sequence of Bernoulli random variables withP (Xn = ±1) = 1

2 and let S0 = 0, Sn := X1 + · · ·+Xn for n ∈ N. Then

limN→∞

|S|∗N√N lnN

:= limN→∞

max |S0| , . . . , |SN |√N lnN

= 0 a.s. (12.37)

[This result actually holds for an sequence Xn of i.i.d. random variable suchthat EXn = 0 and Var (Xn) = EX2

n <∞.]

Proof. By Eq. (12.36), if a, λ > 0 then

P (S∗N ≥ a) ≤ 1

eλaE[eλSN

]=

1

eλaE

N∏j=1

eλXj

= e−λa

N∏j=1

E[eλXj

]= e−λa [cos (λ)]

N(12.38)

wherein we have used,

E[eλXj

]=

1

2

[eλ + e−λ

]= cosh (λ) .

Making the substitutions, a→ a√N · lnN and λ→ λ/

√N in Eq. (12.38) shows

P(S∗N ≥ a

√N · lnN

)≤ e−λa lnN

[cos

(λ√N

)]N=

1

Nλa

[cos

(λ√N

)]N.

Given a > 0 we now choose λ so that λa > 1 in which case it follows that

∞∑N=1

P(S∗N ≥ a

√N · lnN

)≤∞∑N=1

1

Nλa

[cos

(λ√N

)]Nwherein we have used

limN→∞

[cos

(λ√N

)]N= limN→∞

(1 +

λ

2N+O

(1

N2

))N= eλ/2.

From this inequality we learn

P(S∗N ≥

√N · lnN · a i.o. N

)= 0

and so with probability one we conclude,S∗N√N ·lnN < a for almost all N. Since a

was arbitrary it follows that

lim supN→∞

S∗N√N · lnN

= 0 a.s.

By replacing Xn by −Xn for all n we may further conclude that

lim supN→∞

(−S)∗N√

N · lnN= 0

and together these two statements proves Eq. (12.37) since |S|∗N =max

S∗N , (−S)

∗N

.

12.6 Submartingale Maximal Inequalities 109

Remark 12.53. The above corollary is not a very good example as the resultsfollow just as easily without the use of Doob’s maximal inequality. Indeed, theproof above goes through the same with S∗N replaced by SN everywhere inwhich case we would conclude that limN→∞

SN√N lnN

= 0 or equivalently that

for all ε > 0, |SN | ≤ ε√N lnN for all N ≥ Nε where Nε is a finite random

variable. For N ≥ Nε, we have

|S|∗N ≤(ε√N lnN

)∨ |S|∗Nε =⇒ |S|∗N ≤

(ε√N lnN

)for almost all N.

This then shows that

P(

lim supN→∞

|S|∗N√N lnN

≤ ε)

= 1

and as ε > 0 was arbitrary we may conclude that

P(

lim supN→∞

|S|∗N√N lnN

= 0

)= 1.

12.6.1 *Lp – inequalities

Lemma 12.54. Suppose that X and Y are two non-negative random variablessuch that P (Y ≥ y) ≤ 1

yE [X : Y ≥ y] for all y > 0. Then for all p ∈ (1,∞) ,

EY p ≤(

p

p− 1

)pEXp. (12.39)

Proof. We will begin by proving Eq. (12.39) under the additional assump-tion that Y ∈ Lp (Ω,B, P ) . Since

EY p = pE∫ ∞

0

1y≤Y · yp−1dy = p

∫ ∞0

E [1y≤Y ] · yp−1dy

= p

∫ ∞0

P (Y ≥ y) · yp−1dy ≤ p∫ ∞

0

1

yE [X : Y ≥ y] · yp−1dy

= pE∫ ∞

0

X1y≤Y · yp−2dy =p

p− 1E[XY p−1

].

Now apply Holder’s inequality, with q = p (p− 1)−1, to find

E[XY p−1

]≤ ‖X‖p ·

∥∥Y p−1∥∥q

= ‖X‖p · [E |Y |p]1/q

.

Combining thew two inequalities shows and solving for ‖Y ‖p shows ‖Y ‖p ≤pp−1 ‖X‖p which proves Eq. (12.39) under the additional restriction of Y being

in Lp (Ω,B, P ) .

To remove the integrability restriction on Y, for M > 0 let Z := Y ∧M andobserve that

P (Z ≥ y) = P (Y ≥ y) ≤ 1

yE [X : Y ≥ y] =

1

yE [X : Z ≥ y] if y ≤M

while

P (Z ≥ y) = 0 =1

yE [X : Z ≥ y] if y > M.

Since Z is bounded, the special case just proved shows

E [(Y ∧M)p] = EZp ≤

(p

p− 1

)pEXp.

We may now use the MCT to pass to the limit, M ↑ ∞, and hence concludethat Eq. (12.39) holds in general.

Corollary 12.55 (Doob’s Inequality). If X = Xn∞n=0 be a non-negativesubmartingale and 1 < p <∞, then

EX∗pN ≤(

p

p− 1

)pEXp

N . (12.40)

Proof. Equation 12.40 follows by applying Lemma 12.54 with the aid ofProposition 12.50.

Corollary 12.56 (Doob’s Inequality). If Mn∞n=0 is a martingale and 1 0 we have

P(|M |∗N ≥ a

)≤ 1

aE [|M |N : M∗N ≥ a] ≤ 1

aE [|MN |] (12.41)

and

E |M |∗pN ≤(

p

p− 1

)pE |MN |p . (12.42)

Proof. By the conditional Jensen’s inequality (Theorem 12.18), it followsthat Xn := |Mn| is a submartingale. Hence Eq. (12.41) follows from Eq. (12.33)and Eq. (12.42) follows from Eq. (12.40).

Example 12.57. Let Xn be a sequence of independent integrable randomvariables with mean zero, S0 = 0, Sn := X1 + · · · + Xn for n ∈ N, and|S|∗n = maxj≤n |Sj | . Since Sn∞n=0 is a martingale, by cJensen’s inequality(Theorem 12.18), |Sn|p

∞n=1 is a (possibly extended) submartingale for any

p ∈ [1,∞). Therefore an application of Eq. (12.33) of Proposition 12.50 show

P(|S|∗N ≥ α

)= P

(|S|∗pN ≥ α

p)≤ 1

αpE [|SN |p : S∗N ≥ α] .

(When p = 2, this is Kolmogorov’s inequality.) From Corollary 12.56 we alsoknow that

E |S|∗pN ≤(

p

p− 1

)pE |SN |p .

In particular when p = 2, this inequality becomes,

E |S|∗2N ≤ 4 · E |SN |2 = 4 ·N∑n=1

E |Xn|2 .

12.7 Martingale Exercises

(The next four problems were taken directly fromhttp://math.nyu.edu/˜sheff/martingalenote.pdf.)

Exercise 12.7. Suppose Harriet has 7 dollars. Her plan is to make one dollarbets on fair coin tosses until her wealth reaches either 0 or 50, and then to gohome. What is the expected amount of money that Harriet will have when shegoes home? What is the probability that she will have 50 when she goes home?

Exercise 12.8. Consider a contract that at time N will be worth either 100 or0. Let Sn be its price at time 0 ≤ n ≤ N . If Sn is a martingale, and S0 = 47,then what is the probability that the contract will be worth 100 at time N?

Exercise 12.9. Pedro plans to buy the contract in the previous problem attime 0 and sell it the first time T at which the price goes above 55 or below 15.What is the expected value of ST ? You may assume that the value, Sn, of thecontract is bounded – there is only a finite amount of money in the world upto time N. Also note, by assumption, T ≤ N.

Exercise 12.10. Suppose SN is with probability one either 100 or 0 and thatS0 = 50. Suppose further there is at least a 60% probability that the price willat some point dip to below 40 and then subsequently rise to above 60 beforetime N . Prove that Sn cannot be a martingale. (I don’t know if this problemis correct! but if we modify the 40 to a 30 the buy low sell high strategy willshow that Sn is not a martingale.)

12.7.1 More Random Walk Exercises

For the next four exercises, let Zn∞n=1 be a sequence of Bernoulli randomvariables with P (Zn = ±1) = 1

2 and let S0 = 0 and Sn := Z1 + · · ·+Zn. ThenS becomes a martingale relative to the filtration, Fn := σ (Z1, . . . , Zn) withF0 := ∅, Ω – of course Sn is the (fair) simple random walk on Z. For anya ∈ Z, let

σa := inf n : Sn = a .

Exercise 12.11. For a < 0 < b with a, b ∈ Z, let τ = σa ∧ σb. Explain whySτn

∞n=0 is a bounded martingale use this to show P (τ =∞) = 0. Hint: make

use of the fact that |Sn − Sn−1| = |Zn| = 1 for all n and hence the only waylimn→∞ Sτn can exist is if it stops moving!

Exercise 12.12. In this exercise, you are asked to use the central limit Theoremto prove again that P (τ =∞) = 0, Exercise 12.11. Hints: Use the central limittheorem to show

1√2π

∫Rf (x) e−x

2/2dx ≥ f (0)P (τ =∞) (12.43)

for all f ∈ C3 (R→ [0,∞)) with M := supx∈R∣∣f (3) (x)

∣∣ <∞. Use this inequal-ity to conclude that P (τ =∞) = 0.

Exercise 12.13. Show

P (σb < σa) =|a|

b+ |a|(12.44)

and use this to conclude P (σb <∞) = 1, i.e. every b ∈ N is almost surelyvisited by Sn.

Hint: As in Exercise 12.11 notice that Sτn∞n=0 is a bounded martingale

where τ := σa ∧ σb. Now compute E [Sτ ] = E [Sττ ] in two different ways.

Exercise 12.14. Let τ := σa ∧ σb. In this problem you are asked to showE [τ ] = |a| b with the aid of the following outline.

1. Use Exercise 12.5 above to conclude Nn := S2n − n is a martingale.

2. Now show0 = EN0 = ENτ∧n = ES2

τ∧n − E [τ ∧ n] . (12.45)

3. Now use DCT and MCT along with Exercise 12.13 to compute the limit asn→∞ in Eq. (12.45) to find

E [σa ∧ σb] = E [τ ] = b |a| . (12.46)

4. By considering the limit, a→ −∞ in Eq. (12.46), show E [σb] =∞.

For the next group of exercise we are now going to suppose thatP (Zn = 1) = p > 1

2 and P (Zn = −1) = q = 1 − p < 12 . As before let

Fn = σ (Z1, . . . , Zn) , S0 = 0 and Sn = Z1 + · · · + Zn for n ∈ N. Let usreview the method above and what you did in Exercise 6.4 above.

In order to follow the procedures above, we start by looking for a function,ϕ, such that ϕ (Sn) is a martingale. Such a function must satisfy,

ϕ (Sn) = EFnϕ (Sn+1) = ϕ (Sn + 1) p+ ϕ (Sn − 1) q,

12.7 Martingale Exercises 111

and this then leads us to try to solve the following difference equation for ϕ;

ϕ (x) = pϕ (x+ 1) + qϕ (x− 1) for all x ∈ Z. (12.47)

Similar to the theory of second order ODE’s this equation has two linearlyindependent solutions which could be found by solving Eq. (12.47) with initialconditions, ϕ (0) = 1 and ϕ (1) = 0 and then with ϕ (0) = 0 and ϕ (1) =0 for example. Rather than doing this, motivated by second order constantcoefficient ODE’s, let us try to find solutions of the form ϕ (x) = λx with λto be determined. Doing so leads to the equation, λx = pλx+1 + qλx−1, orequivalently to the characteristic equation,

pλ2 − λ+ q = 0.

The solutions to this equation are

λ =1±√

1− 4pq

2p=

1±√

1− 4p (1− p)2p

=1±

√4p2 − 4p+ 1

2p=

1±√

(2p− 1)2

2p= 1, (1− p) /p = 1, q/p .

The most general solution to Eq. (12.47) is then given by

ϕ (x) = A+B (q/p)x.

Below we will take A = 0 and B = 1. As before let σa = inf n ≥ 0 : Sn = a .

Exercise 12.15. Let a < 0 < b and τ := σa ∧ σb.

1. Apply the method in Exercise 12.11 with Sn replaced by Mn := (q/p)Sn to

show P (τ =∞) = 0. [Recall that Mn∞n=1 is a martingale as explained inExample 12.11.]

2. Now use the method in Exercise 12.13 to show

P (σa < σb) =(q/p)

b − 1

(q/p)b − (q/p)

a. (12.48)

3. By letting a→ −∞ in Eq. (12.48), conclude P (σb =∞) = 0.

4. By letting b→∞ in Eq. (12.48), conclude P (σa <∞) = (q/p)|a|.

Exercise 12.16. Verify,

Mn := Sn − n (p− q)

andNn := M2

n − σ2n

are martingales, where σ2 = 1 − (p− q)2. (This should be simple; see either

Exercise 12.5 or Exercise 12.6.)

Exercise 12.17. Using exercise 12.16, show

E (σa ∧ σb) =

b [1− (q/p)a] + a

[(q/p)

b − 1]

(q/p)b − (q/p)

a

(p− q)−1. (12.49)

By considering the limit of this equation as a→ −∞, show

E [σb] =b

p− q

and by considering the limit as b→∞, show E [σa] =∞.

12.7.2 More advanced martingale exercises

Exercise 12.18. Let (Mn)∞n=0 be a martingale with M0 = 0 and E[M2n] < ∞

for all n. Show that for all λ > 0,

P

(max

1≤m≤nMm ≥ λ

)≤ E[M2

n]

E[M2n] + λ2

.

Hints: First show that for any c > 0 thatXn := (Mn + c)2

∞n=0

is asubmartingale and then observe,

max1≤m≤n

Mm ≥ λ⊂

max1≤m≤n

Xn ≥ (λ+ c)2

.

Now use Doob’ Maximal inequality (Proposition 12.50) to estimate the proba-bility of the last set and then choose c so as to optimize the resulting estimateyou get for P (max1≤m≤nMm ≥ λ) . (Notice that this result applies to −Mn aswell so it also holds that;

P

(min

1≤m≤nMm ≤ −λ

)≤ E[M2

n]

E[M2n] + λ2

for all λ > 0.

Exercise 12.19. Let Zn∞n=1 be independent random variables, S0 = 0 and

Sn := Z1 + · · · + Zn, and fn (λ) := E[eiλZn

]. Suppose EeiλSn =

∏Nn=1 fn (λ)

converges to a continuous function, F (λ) , as N → ∞. Show for each λ ∈ Rthat

P(

limn→∞

eiλSn exists)

= 1. (12.50)

Hints:

1. Show it is enough to find an ε > 0 such that Eq. (12.50) holds for |λ| ≤ ε.

2. Choose ε > 0 such that |F (λ)− 1| < 1/2 for |λ| ≤ ε. For |λ| ≤ ε, show

Mn (λ) := eiλSn

EeiλSn is a bounded complex6 martingale relative to the filtra-tion, Fn = σ (Z1, . . . , Zn) .

Lemma 12.58 (Protter [15, See the lemma on p. 22.]). Let xn∞n=1 ⊂ Rsuch that

eiuxn

∞n=1

is convergent for Lebesgue almost every u ∈ R. Thenlimn→∞ xn exists in R.

Proof. Let U be a uniform random variable with values in [0, 1] . By as-sumption, for any t ∈ R, limn→∞ eitUxn exists a.s. Thus if nk and mk are anyincreasing sequences we have

limk→∞

eitUxnk = limn→∞

eitUxn = limk→∞

eitUxmk a.s.

and therefore,

eit(Uxnk−Uxmk) =eitUxnk

eitUxmk→ 1 a.s. as k →∞.

Hence by DCT it follows that

E[eit(Uxnk−Uxmk)

]→ 1 as k →∞

and therefore(xnk − xmk) · U = Uxnk − Uxmk → 0

in distribution and hence in probability. But his can only happen if(xnk − xmk)→ 0 as k →∞. As nk and mk were arbitrary, this suffices toshow xn is a Cauchy sequence.

Exercise 12.20 (Continuation of Exercise 12.19 – See Doob [5, Chap-ter VII.5]). Let Zn∞n=1 be independent random variables. Use Exercise 12.19

an Lemma 12.58 to prove the series,∑∞n=1 Zn, converges in R a.s. iff

∏Nn=1 fn (λ)

converges to a continuous function, F (λ) as N →∞. Conclude from this that∑∞n=1 Zn is a.s. convergent iff

∑∞n=1 Zn is convergent in distribution.

6 Please use the obvious generalization of a martingale for complex valued processes.It will be useful to observe that the real and imaginary parts of a complex martin-gales are real martingales.

13

Some Martingale Examples and Applications

Exercise 13.1. Let Sn be the total assets of an insurance company in yearn ∈ N0. Assume S0 > 0 is a constant and that for all n ≥ 1 that Sn = Sn−1+ξn,where ξn = c− Zn and Zn∞n=1 are i.i.d. random variables having the normal

distribution with mean µ < c and variance σ2, i.e. Znd= µ+ σN where N is a

standard normal random variable.1 Let

τ = infn : Sn ≤ 0 and

R = τ <∞ = Sn ≤ 0 for some n

be the event that the company eventually becomes bankrupt, i.e. is Ruined.Show

P (Ruin) = P (R) ≤ e−2(c−µ)S0/σ2

.

Outline:

1. Show that λ = −2 (c− µ) /σ2 < 0 satisfies, E[eλξn

]= 1.

2. With this λ show

Yn := exp (λSn) = eλS0

n∏j=1

eλξj (13.1)

is a non-negative Fn = σ(Z1, . . . , Zn) – martingale.3. Use a martingale convergence theorem to argue that limn→∞ Yn = Y∞

exists a.s. and then use Fatou’s lemma to show EYτ ≤ eλS0 .4. Finally conclude that

P (R) ≤ E [Yτ : τ <∞] ≤ EYτ ≤ eλS0 = e−2(c−µ)S0/σ2

.

Observe that by the strong law of large numbers that limn→∞Snn = Eξ1 =

c − µ > 0 a.s. Thus for large n we have Sn ∼ n (c− µ) → ∞ as n → ∞.The question we have addressed is what happens to the Sn for intermediatevalues – in particular what is the likelyhood that Sn makes a sufficiently “largedeviation” from the “typical” value of n (c− µ) in order for the company to gobankrupt.

1 The number c is to be interpreted as the yearly premium, µ represents the meanpayout in claims per year, and σN represents the random fluctuations which canoccur from year to year.

13.1 Aside on Large Deviations

Definition 13.1. A real valued random variable, Z, is said to be exponentiallyintegrable if EeθZ <∞ for all θ ∈ R.

Theorem 13.2 (Large Deviation Upper Bound). Let, for n ∈ N, Sn :=Z1 + · · ·+Zn where Zn∞n=1 be i.i.d. exponentially integrable random variables

such that EZ = 0 where Zd= Zn. Then for all ` > 0,

P (Sn ≥ n`) ≤ e−nI(`) (13.2)

whereI (`) = sup

θ≥0(θ`− ψ (θ)) = sup

θ∈R(θ`− ψ (θ)) ≥ 0. (13.3)

In particular,

lim supn→∞

lnP (Sn ≥ n`) ≤ −I (`) for all ` > 0. (13.4)

Proof. Let Zn∞n=1 be i.i.d. exponentially integrable random variables such

that EZ = 0 where Zd= Zn. Then for ` > 0 we have for any θ ≥ 0 that

P (Sn ≥ n`) = P(eθSn ≥ eθn`

)≤ e−θn`E

[eθSn

]=(e−θ`E

[eθZ])n

.

Let M (θ) := E[eθZ]

be the moment generating function for Z and ψ (θ) =lnM (θ) be the log – moment generating function. Then we have just shown,

P (Sn ≥ n`) ≤ exp (−n (θ`− ψ (θ))) for all θ ≥ 0.

Minimizing the right side of this inequality over θ ≥ 0 gives the upper boundin Eq. (13.2) where I (`) is given as in the first equality in Eq. (13.3).

To prove the second equality in Eq. (13.3), we use the fact that eθx is aconvex function in x for all θ ∈ R and therefore by Jensen’s inequality,

M (θ) = E[eθZ]≥ eθEX = eθ0 = 1 for all θ ∈ R.

This then implies that ψ (θ) = lnM (θ) ≥ 0 for all θ ∈ R. In particular, θ` −ψ (θ) < 0 for θ < 0 while [θ`− ψ (θ)] |θ=0 = 0 and therefore

supθ∈R

(θ`− ψ (θ)) = supθ≥0

(θ`− ψ (θ)) ≥ 0.

This completes the proof as Eq. (13.4) easily follows from Eq. (13.2).

114 13 Some Martingale Examples and Applications

Theorem 13.3 (Large Deviation Lower Bound). If there is a maximizer,θ0, for the the function θ → θ`− ψ (θ) , then

lim infn→∞

1

nlnP (Sn ≥ n`) ≥ −I (`) = θ0`− ψ (θ0) . (13.5)

Proof. If there is a maximizer, θ0, for the the function θ → θ`−ψ (θ) , then

0 = `− ψ′ (θ0) = `− M ′ (θ0)

M (θ0)= `−

E[Zeθ0Z

]M (θ0)

.

Thus if W is a random variable with Law determined by

E [f (W )] = M (θ0)−1 E

[f (Z) eθ0Z

]for all non-negative functions f : R→ [0,∞] then E [W ] = `.

Suppose that Wn∞n=1 has been chosen to be a sequence of i.i.d. random

variables such that Wnd= W for all n. Then, for all non-negative functions

f : Rn → [0,∞] we have

E [f (W1, . . . ,Wn)] = M (θ0)−n E

[f (Z1, . . . , Zn)

n∏i=1

eθ0Zi

]= M (θ0)

−n E[f (Z1, . . . , Zn) eθ0Sn

].

This is easily verified by showing the right side of this equation gives the cor-rect expectations when f is a product function. Replacing f (z1, . . . , zn) byM (θ0)

ne−θ0(z1+...zn)f (z1, . . . , zn) in the previous equation then shows

E [f (Z1, . . . , Zn)] = M (θ0)n E[f (W1, . . . ,Wn) e−θ0Tn

](13.6)

where Tn := W1 + · · ·+Wn.Taking δ > 0 and f (z1, . . . , zn) = 1z1+...zn≥n` in Eq. (13.6) shows

P (Sn ≥ n`) = M (θ0)n E[e−θ0Tn : n` ≤ Tn

]≥M (θ0)

n E[e−θ0Tn : n` ≤ Tn ≤ n (`+ δ)

]≥M (θ0)

ne−nθ0(`+δ)P [n` ≤ Tn ≤ n (`+ δ)]

= e−nI(`)e−nθ0δP [n` ≤ Tn ≤ n (`+ δ)] .

Taking logarithms of this equation, then dividing by n, then letting n→∞ welearn

lim infn→∞

1

nlnP (Sn ≥ n`) ≥ −I (`)− θ0`δ + lim

n→∞

1

nlnP [n` ≤ Tn ≤ n (`+ δ)]

= −I (`)− θ0`δ + 0 (13.7)

wherein have used the central limit theorem to argue that

P [n` ≤ Tn ≤ n (`+ δ)] = P [0 ≤ Tn − n` ≤ nδ]

= P

[0 ≤ Tn − n`√

n≤√nδ

]→ 1

2as n→∞.

Equation (13.5) now follows from Eq. (13.7) as δ > 0 is arbitrary.

Example 13.4. Suppose that Zd= N

(0, σ2

) d= σN where N

d= N (0, 1) , then

M (θ) = E[eθZ]

= E[eθσN

]= exp

(1

2(σθ)

2

)and therefore ψ (θ) = lnM (θ) = 1

2σ2θ2. Moreover for ` > 0,

` = ψ′ (θ) =⇒ ` = σ2θ =⇒ θ0 =`

σ2.


I (`) = θ0`− ψ (θ0) =`2

σ2− 1

2σ2

(`

σ2

)2

=1

2

`2

σ2.

In this Gaussian case we actually know that Snd= N

(0, nσ2

)and therefore by

Mill’s ratio,

P (Sn ≥ n`) = P(√nσN ≥ n`

)= P

(N ≥

√n`

σ

)∼ 1√

2πn `σe−n

12`2

σ2 =σ√

2πn`e−nI(`) as n→∞.

Remark 13.5. The technique used in the proof of Theorem 13.3 was to make achange of measure so that the large deviation (from the usual) event with smallprobability became typical behavior with substantial probability. One couldimaging making other types of change of variable of the form

E [f (W )] =E [f (Z) ρ (Z)]

E [ρ (Z)]

where ρ is some positive function. Under this change of measure the analogueof Eq. (13.6) is

E [f (Z1, . . . , Zn)] = (E [ρ (Z)])n · E

f (W1, . . . ,Wn)n∏j=1

1

ρ (Wi)

.Page: 114 job: 285notes macro: svmonob.cls date/time: 5-Jun-2015/8:11

13.2 A Polya Urn Model 115

However to make this change of variable easy to deal with in the setting athand we would like to further have

n∏j=1

1

ρ (Wj)= fn (Tn) = fn (W1 + · · ·+Wn)

for some function fn. Equivalently we would like, for some function gn, that

n∏j=1

ρ (wj) = gn (w1 + · · ·+ wn)

for all wi. Taking logarithms of this equation and differentiating in the wj andwk variables shows,

ρ′ (wj)

ρ (wj)= (ln gn)

′(w1 + · · ·+ wn) =

ρ′ (wk)

ρ (wk).

From the extremes of this last equation we conclude that ρ′ (wj) /ρ (wj) = c(for some constant c) and therefore ρ (w) = Kecw for some constant K. Thishelps to explain why the exponential function is used in the above proof.

13.2 A Polya Urn Model

In this section we are going to analyze the long run behavior of the Polya urnMarkov process which was introduced in Example 12.12. Recall that if the urncontains r red balls and g green balls at a given time we draw one of these ballsat random and replace it and add c more balls of the same color drawn. Let(rn, gn) be the number of red and green balls in the earn at time n. Then wehave

P ((rn+1, gn) = (r + c, g) | (rn, gn) = (r, g)) =r

r + gand

P ((rn+1, gn) = (r, g + c) | (rn, gn) = (r, g)) =g

r + g.

Let us observe that rn+gn = r0 +g0 +nc and hence if we let Xn be the fractionof green balls in the urn at time n,

Xn :=gn

rn + gn,

thenXn :=

gnrn + gn

=gn

r0 + g0 + nc.

We now claim that Xn∞n=0 is a martingale relative to

Fn := σ ((rk, gk) : k ≤ n) = σ (Xk : k ≤ n) .

Indeed,

E [Xn+1|Fn] = E [Xn+1|Xn]

=rn

rn + gn· gnrn + gn + c

+gn

rn + gn· gn + c

rn + gn + c

=gn

rn + gn· rn + gn + c

rn + gn + c= Xn.

Since Xn ≥ 0 and EXn = EX0 <∞ for all n it follows by Theorem 12.45 thatX∞ := limn→∞Xn exists a.s. The distribution of X∞ is described in the nexttheorem.

Theorem 13.6. Let γ := g/c and ρ := r/c and µ := LawP (X∞) . Then µ isthe beta distribution on [0, 1] with parameters, γ, ρ, i.e.

dµ (x) =Γ (ρ+ γ)

Γ (ρ)Γ (γ)xγ−1 (1− x)

ρ−1dx for x ∈ [0, 1] . (13.8)

Proof. We will begin by computing the distribution of Xn. As an example,the probability of drawing 3 greens and then 2 reds is

g

r + g· g + c

r + g + c· g + 2c

r + g + 2c· r

r + g + 3c· r + c

r + g + 4c.

More generally, the probability of first drawing m greens and then n−m redsis

g · (g + c) · · · · · (g + (n− 1) c) · r · (r + c) · · · · · (r + (n−m− 1) c)

(r + g) · (r + g + c) · · · · · (r + g + (n− 1) c).

Since this is the same probability for any of the(nm

)– ways of drawing m greens

and n−m reds in n draws we have

P (Draw m – greens)

=

(n

m

)g · (g + c) · · · · · (g + (m− 1) c) · r · (r + c) · · · · · (r + (n−m− 1) c)

(r + g) · (r + g + c) · · · · · (r + g + (n− 1) c)

=

(n

m

)γ · (γ + 1) · · · · · (γ + (m− 1)) · ρ · (ρ+ 1) · · · · · (ρ+ (n−m− 1))

(ρ+ γ) · (ρ+ γ + 1) · · · · · (ρ+ γ + (n− 1)).

(13.9)

Before going to the general case let us warm up with the special case, g = r =c = 1. In this case Eq. (13.9) becomes,


P (Draw m – greens) =

(n

m

)1 · 2 · · · · ·m · 1 · 2 · · · · · (n−m)

2 · 3 · · · · · (n+ 1)=

1

n+ 1.

On the set, Draw m – greens , we have Xn = 1+m2+n and hence it follows that

for any f ∈ C ([0, 1]) that

E [f (Xn)] =

n∑m=0

f

(m+ 1

n+ 2

)· P (Draw m – greens)

=

n∑m=0

f

(m+ 1

n+ 2

)1

n+ 1.

Therefore

E [f (X)] = limn→∞

E [f (Xn)] =

∫ 1

0

f (x) dx (13.10)

and hence we may conclude that X∞ has the uniform distribution on [0, 1] .For the general case, recall that n! = Γ (n + 1), Γ (t+ 1) = tΓ (t) , and

therefore for m ∈ N,

Γ (x+m) = (x+m− 1) (x+m− 2) . . . (x+ 1)xΓ (x) . (13.11)

Also recall Stirling’s formula,

Γ (x) =√

2πxx−1/2e−x [1 + r (x)] (13.12)

where |r (x)| → 0 as x → ∞. To finish the proof we will follow the strategy ofthe proof of Eq. (13.10) using Stirling’s formula to estimate the expression forP (Draw m – greens) in Eq. (13.9).

On the set, Draw m – greens , we have

Xn =g +mc

r + g + nc=

γ +m

ρ+ γ + n=: xm,

where ρ := r/c and γ := g/c. For later notice that ∆mx = γρ+γ+n .

Using this notation we may rewrite Eq. (13.9) as


=

(n

m

) Γ (γ+m)Γ (γ) · Γ (ρ+n−m)

Γ (ρ)

Γ (ρ+γ+n)Γ (ρ+γ)

=Γ (ρ+ γ)

Γ (ρ)Γ (γ)· Γ (n+ 1)

Γ (m+ 1)Γ (n−m+ 1)

Γ (γ +m)Γ (ρ+ n−m)

Γ (ρ+ γ + n). (13.13)

Now by Stirling’s formula,

Γ (γ +m)

Γ (m+ 1)=

(γ +m)γ+m−1/2

e−(γ+m) [1 + r (γ +m)]

(1 +m)m+1−1/2

e−(m+1) [1 + r (1 +m)]

= (γ +m)γ−1 ·

(γ +m

m+ 1

)m+1/2

e−(γ−1) 1 + r (γ +m)

1 + r (m+ 1).

= (γ +m)γ−1 ·

(1 + γ/m

1 + 1/m

)m+1/2

e−(γ−1) 1 + r (γ +m)

1 + r (m+ 1)

We will keep m fairly large, so that(1 + γ/m

1 + 1/m

)m+1/2

= exp

((m+ 1/2) ln

(1 + γ/m

1 + 1/m

))∼= exp ((m+ 1/2) (γ/m− 1/m)) ∼= eγ−1.

Hence we haveΓ (γ +m)

Γ (m+ 1) (γ +m)

γ−1.

Similarly, keeping n−m fairly large, we also have

Γ (ρ+ n−m)

Γ (n−m+ 1) (ρ+ n−m)

ρ−1and

Γ (ρ+ γ + n)

Γ (n+ 1) (ρ+ γ + n)

ρ+γ−1.

Combining these estimates with Eq. (13.13) gives,


Γ (ρ+ γ)

Γ (ρ)Γ (γ)· (γ +m)

γ−1 · (ρ+ n−m)ρ−1

(ρ+ γ + n)ρ+γ−1

=Γ (ρ+ γ)

Γ (ρ)Γ (γ)·

(γ+mρ+γ+n

)γ−1

·(ρ+n−mρ+γ+n

)ρ−1

(ρ+ γ + n)ρ+γ−1

=Γ (ρ+ γ)

Γ (ρ)Γ (γ)· (xm)

γ−1 · (1− xm)ρ−1

∆mx.

Therefore, for any f ∈ C ([0, 1]) , it follows that

E [f (X∞)] = limn→∞

E [f (Xn)]

= limn→∞

n∑m=0

f (xm)Γ (ρ+ γ)

Γ (ρ)Γ (γ)· (xm)

γ−1 · (1− xm)ρ−1

∆mx

=

∫ 1

0

f (x)Γ (ρ+ γ)

Γ (ρ)Γ (γ)xγ−1 (1− x)

ρ−1dx.


13.3 Galton Watson Branching Process 117

13.3 Galton Watson Branching Process

This section is taken from [6, p. 245 –249].

Notation 13.7 Let pk∞k=0 be a probability on N0 where pk := P (ξ = k) isthe off-spring distribution.

We assume that the mean number of offspring,

µ := Eξ =

∞∑k=0

kpk <∞.

Notation 13.8 Let ξni : i, n ≥ 1 be a sequence of i.i.d. non-negative integervalued random variables with P (ξni = k) = pk for all k ∈ N0. We also letYn := ξn1

∞n=1 in order to shorten notation later.

If Zn is the number of “people” (or organisms) alive in the nth – generationthen we assume the ith organism has ξn+1

i – offspring so that

Zn+1 = ξn+11 + · · ·+ ξn+1

Zn

=

∞∑k=1

(ξn+11 + · · ·+ ξn+1

k

)1Zn=k. (13.14)

represents the number of people present in generation, n+ 1. We complete thedescription of the process, Zn by setting Z0 = 1 and Zn+1 = 0 if Zn = 0,i.e. once the population dies out it remains extinct forever after. The processZnn≥0 is called a Galton-Watson Branching process, see Figure 13.1.

Standing assumption: We suppose that p1 < 1 for otherwise we will haveZn = 1 for all n.

To understand Zn a bit better observe that

Z0 = 1,

Z1 = ξ1Z0

= ξ11 ,

Z2 = ξ21 + · · ·+ ξ2

ξ11,

Z3 = ξ31 + · · ·+ ξ3

Z2,

...

The sample path in Figure 13.1 corresponds to

ξ11 = 3,

ξ21 = 2, ξ2

2 = 0, ξ23 = 3,

ξ31 = ξ3

2 = ξ33 = ξ3

4 = 0, ξ35 = 4, and

ξ41 = ξ4

2 = ξ43 = ξ4

4 = 0.

Z0 = 1

Z1 = 3

Z2 = 5

Z3 = 4

Fig. 13.1. A possible realization of a Galton Watson “tree.”

In order to shorten the exposition we will make use of the following twointuitive facts:

1. The different branches of the Galton-Watson tree evolve independently ofone another.

2. The process Zn∞n=0 is a Markov chain with transition probabilities,

p (k, l) = P (Zn+1 = l|Zn = k)

=

δ0,l if k = 0

P (Y1 + · · ·+ Yk = l) if k ≥ 0.

= δk,0δ0,l + 1k≥1 · P (Y1 + · · ·+ Yk = l) . (13.15)

Lemma 13.9. If f : N0 → R is a function then

(Pf) (k) = E [f (Y1 + · · ·+ Yk)]

where Y1 + · · ·+ Yk := 0 if k = 0.

Proof. This follows by the simple computation;

(Pf) (k) =

∞∑l=0

p (k, l) f (l) =

∞∑l=0

[δk,0δ0,l + 1k≥1 · P (Y1 + · · ·+ Yk = l)] f (l)

= δk,0f (0) + 1k≥1

∞∑l=0

[P (Y1 + · · ·+ Yk = l)] f (l)

= δk,0f (0) + 1k≥1E [f (Y1 + · · ·+ Yk)]

= E [f (Y1 + · · ·+ Yk)] .

Let us evaluate Pf for a couple of f. If f (k) = k, then

Pf (k) = E [Y1 + · · ·+ Yk] = k · µ =⇒ Pf = µf. (13.16)

If f (k) = λk for some |λ| ≤ 1, then(Pλ(·)

)(k) = E

[λY1+···+Yk

]= ϕ (λ)

k. (13.17)

Corollary 13.10. The process Mn := Zn/µn∞n=0 is a positive martingale and

in particularEZn = µn <∞ for all n ∈ N0. (13.18)

Proof. If f (n) = n for all n, then Pf = µf by Eq. (13.16) and therefore

E [Zn+1|Fn] = Pf (Zn) = µ · f (Zn) = µZn.

Dividing this equation by µn+1 then shows E [Mn+1|Fn] = Mn as desired. AsM0 = 1 it then follows that EMn = 1 for all n and this gives Eq. (13.18).

Theorem 13.11. If µ < 1, then, almost surely, Zn = 0 for a.a. n. In fact,

E [Total # of organisms ever alive] = E

[ ∞∑n=0

Zn

]<∞.

Proof. When µ < 1, we have

E∞∑n=0

Zn =

∞∑n=0

µn =1

1− µ<∞

and therefore∑∞n=0 Zn <∞ a.s. As Zn ∈ N0 for all n, this can only happen if

Zn = 0 for almost all n a.s.

Theorem 13.12. If µ = 1 and P (ξmi = 1) < 1, then again, almost surely,Zn = 0 for a.a. n.

Proof. First note the assumption µ = 1 and the standing assumption thatp1 < 1 implies p0 > 0. This then further implies that for any k ≥ 1 we have

P (Y1 + · · ·+ Yk = 0) ≥ P (Y1 = 0, . . . , Yk = 0) = pk0 > 0

which then implies,

p (k, k) = P (Y1 + · · ·+ Yk = k) < 1. (13.19)

Because of Corollary 13.10 and the assumption that µ = 1, we know Zn∞n=1

is a martingale. This martingale being positive is L1 bounded as

E [|Zn|] = EZn = EZ0 = 1 for all n.

Therefore the martingale convergence theorem guarantees that Z∞ =limn→∞ Zn =: Z∞ exists with EZ∞ ≤ 1. Because Zn is integer valued, it musthappen that Zn = Z∞ a.a. If k ∈ N, the event Z∞ = k can be expressed as

Z∞ = k = Zn = k a.a. n = ∪∞M=1 Zn = k for all n ≥M .

As the sets in the union of this expression are increasing, it follows by themonotone (and then DCT) that

P (Z∞ = k) = limM→∞

P (Zn = k for all n ≥M)

= limM→∞

limN→∞

P (Zn = k for M ≤ n ≤ N)

= limM→∞

limN→∞

P (ZM = k) p (k, k)N−M

= 0,

wherein we have made use of Eq. (13.19) in order to evaluate the limit. Thusit follows that

P (Z∞ > 0) =

∞∑k=1

P (Z∞ = k) = 0.

The above argument does not apply to k = 0 since p (0, 0) = 1 by definition.

Remark 13.13. By the way, the branching process, Zn∞n=0 with µ = 1 andP (ξ = 1) < 1 gives a nice example of a non regular martingale. Indeed, if Zwere regular, we would have

Zn = E[

limm→∞

Zm|Fn]

= E [0|Fn] = 0

which is clearly false.

We now wish to consider the case where µ := EYk = E [ξmi ] > 1. For λ ∈ Cwith |λ| ≤ 1 let

ϕ (λ) := E[λY1]

=∑k≥0

pkλk (13.20)

be the moment generating function of pk∞k=0 . Notice that ϕ (1) = 1 and forλ = s ∈ (−1, 1) we have

ϕ′ (s) =∑k≥0

kpksk−1 and ϕ′′ (s) =

∑k≥0

k (k − 1) pksk−2 ≥ 0

13.3 Galton Watson Branching Process 119

with

lims↑1

ϕ′ (s) =∑k≥0

kpk = E [ξ] =: µ and

lims↑1

ϕ′′ (s) =∑k≥0

k (k − 1) pk = E [ξ (ξ − 1)] .

Therefore ϕ is convex with ϕ (0) = p0, ϕ (1) = 1 and ϕ′ (1) = µ.

Lemma 13.14. If µ = ϕ′ (1) > 1, there exists a unique ρ < 1 so that ϕ (ρ) = ρ.

Proof. See Figure 13.2 below.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Fig. 13.2. Figure associated to ϕ (s) = 18

(1 + 3s+ 3s2 + s3

)which is relevant for

Exercise 3.13 of Durrett on p. 249. In this case ρ ∼= 0.236 07.

Theorem 13.15 (See Durrett [6], p. 247-248.). If µ > 1, then

P1 (Extinction) = P1

(limn→∞

Zn = 0)

= P1 (Zn = 0 for some n) = ρ.

Proof. Since Zm = 0 ⊂ Zm+1 = 0 , it follows that Zm = 0 ↑Zn = 0 for some n and therefore if θm := P1 (Zm = 0) ,then

P1 (Zn = 0 for some n) = limm→∞

θm.

Notice that θ1 = P1 (Z1 = 0) = p0 ∈ (0, 1) . We now show; θm = ϕ (θm−1) . Tosee this, conditioned on the set Z1 = k , Zm = 0 iff all k – families die out inthe remaining m − 1 time units. Since each family evolves independently, theprobability of this event is θkm−1. [See Example 13.16 for a formal justification

of this fact.] Combining this with, P1 (Z1 = k) = P(ξ11 = k

)= pk, allows us

to conclude by the first step analysis that

θm = P1 (Zm = 0) =∞∑k=0

P1 (Zm = 0, Z1 = k)

=

∞∑k=0

P1 (Zm = 0|Z1 = k)P1 (Z1 = k) =

∞∑k=0

θkm−1pk = ϕ (θm−1) .

It is now easy to see that θm ↑ ρ as m ↑ ∞, again see Figure 13.3.

0 0.05 0.1 0.15 0.2 0.25 0.30

0.05

0.1

0.15

0.2

0.25

0.3

θ0 ρθ1 θ2 θ3 θ4θ5

Fig. 13.3. The graphical interpretation of iterating θm = ϕ (θm−1) starting fromθ0 = P1 (Z0 = 0) = 0 and then θ1 = ϕ (0) = p0 = P1 (Z1 = 0) .


13.3.1 Appendix: justifying assumptions

Exercise 13.2 (Markov Chain Products). Suppose that P and Q areMarkov matrices on state space S and T. Then P⊗Q defined by

P⊗Q ((s, t) , (s′, t′)) := P (s, s′) ·Q (t, t′)

defines a Markov transition matrix on S × T. Moreover, if Xn and Ynare Markov chains on S and T respectively with transition matrices P andQ respectively, then Zn = (Xn, Yn) is a Markov chain with transition matrixP⊗Q.

Exercise 13.3 (Markov Chain Projections). Suppose that Xn is theMarkov chain on a state space S with transition matrix P and π : S → Tis a surjective (for simplicity) map. Then there is a transition matrix Q on Tsuch that Yn := π Xn is a Markov chain on T for all starting distributions onS iff for all t ∈ T,

P(x, π−1 (t)

)= P

(y, π−1 (t)

)whenever π (x) = π (y) . (13.21)

In this case,

Q (s, t) = P(x, π−1 (t)

):=

∑y∈S:π(y)=t

P (x, y) where x ∈ π−1 (s) . (13.22)

Example 13.16. Using the above results we may now justify the fact that“the different branches of the Galton-Watson tree evolve independently of oneanother.” Indeed, suppose that Xn and Yn are two independent copiesof the Galton-Watson tree starting with k and l off-spring respectively. Letπ : N0 × N0 → N0 be the addition map, π (a, b) = a+ b and let

Zn := Xn + Yn = π (Xn, Yn) .

We claim that Zn is the Galton-Watson tree starting with k+ l off spring. Tosee this we notice that for c ∈ N0 (using the transition functions in Eq. (13.15)that ∑

(x,y)∈π−1(c)

P⊗P ((a, b) , (x, y))

=∑x+y=c

P (a, x) P (b, y)

=∑x+y=c

[δa,0δ0,x + 1a≥1 · P (Y1 + · · ·+ Ya = x)]

· [δb,0δ0,y + 1b≥1 · P (Y1 + · · ·+ Yb = y)]

Hence ifYn

is an independent copy of the Yn then assuming the a, b ≥ 1

we find, ∑x+y=c

P (Y1 + · · ·+ Ya = x) · P (Y1 + · · ·+ Yb = y)

=∑x+y=c

P(Y1 + · · ·+ Ya = x, Y1 + · · ·+ Yb = y

)= P

(Y1 + · · ·+ Ya + Y1 + · · ·+ Yb = c

)= P (a+ b, c) = P (π (a, b) , c) .

When a = 0, we get,∑(x,y)∈π−1(c)

P⊗P ((a, b) , (x, y))

=∑x+y=c

δ0,x · [δb,0δ0,y + 1b≥1 · P (Y1 + · · ·+ Yb = y)]

= [δb,0δ0,c + 1b≥1 · P (Y1 + · · ·+ Yb = c)] = P (a+ b, c)

and similarly the statement holds if b = 0. Thus we may now apply Exercse13.3 to complete the proof.

Using the aboe results, if θm (k) := Pk (Zm = 0) , then

θm (k + l) = Pk+l (Zm = 0) = P(k,l) (Xm + Ym = 0)

= P(k,l) (Xm = 0, Ym = 0)

= Pk (Xm = 0)Pl (Ym = 0) = θm (k) θm (l)

and therefore, θm (k) = θm (1)k.


Part IV

Continuous Time Theory

14

Discrete State Space/Continuous Time

In this chapter, we continue to assume that S is a finite or countable statespace and (Ω,P) is a probability space.

14.1 Warm up exercises

Exercise 14.1 (Some Discrete Distributions). Let p ∈ (0, 1] and λ > 0. Inthe two parts below, the distribution of N will be described. In each case findthe generating function (see Proposition 1.3) and use this to verify the statedvalues for EN and Var (N) .

1. Geometric(p) : P (N = k) = p (1− p)k−1for k ∈ N. (P (N = k) is the prob-

ability that the kth – trial is the first time of success out a sequence ofindependent trials with probability of success being p.) You should findEN = 1/p and Var (N) = 1−p

p2 .

2. Poisson(λ) : P (N = k) = λk

k! e−λ for all k ∈ N0. You should find EN = λ =

Var (N) .

The next exercise deals with how to describe a “lazy” Markov chain. Wewill say a chain is lazy if p (x, x) > 0 for some x ∈ S. The point being thatif p (x, x) > 0, then the chain starting at x may be lazy and stay at x forsome period of time before deciding to jump to a new site. The next exercisedescribes lazy chains in terms of a non-lazy chain and the random times thatthe lazy chain will spend lounging at each site x ∈ S. We will refer to this as thejump-hold description of the chain. We will give a similar description of chainson S in the context of continuous time in Corollary 14.22 below.

Exercise 14.2 (Jump - Hold Description I). Let S be a countable orfinite set (Ω,P, Yn∞n=0) be a Markov chain with transition kernel, P :=p (x, y)x,y∈S and let ν (x) := P (Y0 = x) for all x ∈ S. For simplicity let

us assume there are no absorbing states,1 (i.e. p (x, x) < 1 for all x ∈ S) andthen define,

1 A state x is absorbing if p (x, x) = 1 since in this case there is no chance for thechain to leave x once it hits x.

p (x, y) :=

p(x,y)

1−p(x,x) if x 6= y

0 if x = y.

Let jk denote the time of the kth – jump of the chain Yn∞n=0 so that

j1 := inf n > 0 : Yn 6= Y0 and

jk+1 := inf n > jk : Yn 6= Yjk

with the convention that j0 = 0. Further let σk := jk − jk−1 denote the timespent between the (k − 1)

stand kth jump of the chain Yn∞n=0 . Show;

1. For xknk=0 ⊂ S with xk 6= xk−1 for k = 1, . . . , n and m1, . . . ,mk ∈ N,show

P ([∩nk=0 Yjk = xk] ∩ [∩nk=1 σk = mk])

= ν (x0)

n∏k=1

p (xk−1, xk−1)mk−1

(1− p (xk−1, xk−1)) · p (xk−1, xk) .

(14.1)

2. Summing the previous formula on m1, . . . ,mk ∈ N, conclude

P ([∩nk=0 Yjk = xk]) = ν (x0) ·n∏k=1

p (xk−1, xk) ,

i.e. this shows Yjk∞k=0 is a Markov chain with transition kernel, p.

3. Conclude, relative to the conditional probability measure,P (·| [∩nk=0 Yjk = xk]) , that σknk=1 are independent geometric

σkd= Geo (1− p (xk−1, xk−1)) for 1 ≤ k ≤ n.

The reason that geometric distribution appears in the above description isbecause it has the forgetful property described in Remark 1.4 above. This forget-fulness property of the jump time intervals is a consequence of the memorylessproperty of a Markov chain. For continuous time chains we will see that we es-sentially need only replace the geometric distribution in the above descriptionby the exponential distribution.

124 14 Discrete State Space/Continuous Time

Definition 14.1. A random variable T ≥ 0 is said to be exponential withparameter λ ∈ [0,∞) provided, P (T > t) = e−λt for all t ≥ 0. We will write

Td= E (λ) for short.

Alternatively we can express this distribution as

P (T ∈ t, t+ dt]) = P (T > t)− P (T > t+ dt) = e−λt − e−λ(t+dt)

= e−λt[1− e−λdt

]= e−λtλdt+ o (dt) with t ≥ 0.

To be more precise

E [f (T )] =

∫ ∞0

f (τ)λe−λτdτ.

Notice that taking f (τ) = 1τ>t in the above expression again gives,

P (T > t) = E [1T>t] =

∫ ∞0

1τ>tλe−λτdτ

=

∫ ∞t

λe−λτdτ = e−λt.

Notice that T has the following memoryless property,

P (T > s+ t|T > s) = P (T > t) for all s, t ≥ 0.

Indeed,

P (T > s+ t|T > s) =P (T > s+ t, T > s)

P (T > s)

=P (T > s+ t)

P (T > s)=e−λ(t+s)

e−λs

= e−λt = P (T > t) .

See Theorem 14.2 below for the converse assertion.

Exercise 14.3. Let T be as in Definition 14.1. Show;

1. ET = λ−1 and2. Var (T ) = λ−2.

Theorem 14.2 (Memoryless property). A random variable, T ∈ (0,∞] hasan exponential distribution iff it satisfies the memoryless property:

P (T > s+ t|T > s) = P (T > t) for all s, t ≥ 0,

where as usual, P (A|B) := P (A ∩B) /P (B) when p (B) > 0. (Note that Td=

E (0) means that P (T > t) = e0t = 1 for all t > 0 and therefore that T = ∞a.s.)

Proof. (The following proof is taken from [13].) Suppose first that Td= E (λ)

for some λ > 0. Then

P (T > s+ t|T > s) =P (T > s+ t)

P (T > s)=e−λ(s+t)

e−λs= e−λt = P (T > t) .

For the converse, let g (t) := P (T > t) , then by assumption,

g (t+ s)

g (s)= P (T > s+ t|T > s) = P (T > t) = g (t)

whenever g (s) 6= 0 and g (t) is a decreasing function. Therefore if g (s) = 0 forsome s > 0 then g (t) = 0 for all t > s. Thus it follows that

g (t+ s) = g (t) g (s) for all s, t ≥ 0.

Since T > 0, we know that g (1/n) = P (T > 1/n) > 0 for some n andtherefore, g (1) = g (1/n)

n> 0 and we may write g (1) = e−λ for some 0 ≤ λ <

∞.Observe for p, q ∈ N, g (p/q) = g (1/q)

pand taking p = q then shows,

e−λ = g (1) = g (1/q)q. Therefore, g (p/q) = e−λp/q so that g (t) = e−λt for all

t ∈ Q+ := Q ∩ R+. Given r, s ∈ Q+ and t ∈ R such that r ≤ t ≤ s we have,since g is decreasing, that

e−λr = g (r) ≥ g (t) ≥ g (s) = e−λs.

Hence letting s ↑ t and r ↓ t in the above equations shows that g (t) = e−λt for

all t ∈ R+ and therefore Td= E (λ) .

Theorem 14.3. Let Tj∞j=1 be independent random variables such that Tjd=

E (λj) with 0 < λj <∞ for all j. Then:

1. If∑∞n=1 λ

−1n <∞ then P (

∑∞n=1 Tn =∞) = 0 (i.e. P (

∑∞n=1 Tn <∞) = 1).

2. If∑∞n=1 λ

−1n =∞ then P (

∑∞n=1 Tn =∞) = 1.

In summary, there are precisely two possibilities; 1) P (∑∞n=1 Tn =∞) = 0

or 2) P (∑∞n=1 Tn =∞) = 1 and moreover

P

( ∞∑n=1

Tn =∞

)=

0 iff E [

∑∞n=1 Tn] =

∑∞n=1 λ

−1n <∞.

1 iff E [∑∞n=1 Tn] =

∑∞n=1 λ

−1n =∞

Proof. 1. Since

E

[ ∞∑n=1

Tn

]=

∞∑n=1

E [Tn] =

∞∑n=1

λ−1n <∞

14.2 Continuous Time Chains 125

it follows that∑∞n=1 Tn <∞ a.s., i.e. P (

∑∞n=1 Tn =∞) = 0.

2. First observe that for α < λ if Td= exp (λ) , then

E[eαT

]=

∫ ∞0

eατλe−λτdτ =λ

λ− α=

1

1− αλ−1. (14.2)

By the DCT, independence, and Eq. (14.2) with α = −1 we find,

E[e−∑∞

n=1Tn]

= limN→∞

E[e−∑N

n=1Tn

]= limN→∞

N∏n=1

E[e−Tn

]= limN→∞

N∏n=1

(1

1 + λ−1n

)=

∞∏n=1

(1− an)

where

an = 1− 1

1 + λ−1n

=1

1 + λn.

Hence by Exercise 14.4 below, E[e−∑∞

n=1Tn]

= 0 iff ∞ =∑∞n=1 an which

happens iff∑∞n=1 λ

−1n =∞ as you should verify. This completes the proof since

E[e−∑∞

n=1Tn]

= 0 iff e−∑∞

n=1Tn = 0 a.s. or equivalently

∑∞n=1 Tn =∞ a.s.

Lemma 14.4. If 0 ≤ x ≤ 12 , then

e−2x ≤ 1− x ≤ e−x. (14.3)

Moreover, the upper bound in Eq. (14.3) is valid for all x ∈ R.Proof. The upper bound follows by the convexity of e−x, see Figure 14.1.

For the lower bound we use the convexity of ϕ (x) = e−2x to conclude that theline joining (0, 1) = (0, ϕ (0)) and

(1/2, e−1

)= (1/2, ϕ (1/2)) lies above ϕ (x)

for 0 ≤ x ≤ 1/2. Then we use the fact that the line 1− x lies above this line toconclude the lower bound in Eq. (14.3), see Figure 14.2.

For an∞n=1 ⊂ [0, 1] , let

∞∏n=1

(1− an) := limN→∞

N∏n=1

(1− an) .

The limit exists since,∏Nn=1 (1− an) decreases as N increases.

Exercise 14.4. Show; if an∞n=1 ⊂ [0, 1), then

∞∏n=1

(1− an) = 0 ⇐⇒∞∑n=1

an =∞.

The implication, ⇐= , holds even if an = 1 is allowed.

Fig. 14.1. A graph of 1− x and e−x showing that 1− x ≤ e−x for all x.

Fig. 14.2. A graph of 1−x (in red), the line joining (0, 1) and(1/2, e−1

)(in green), e−x

(in purple), and e−2x (in black) showing that e−2x ≤ 1− x ≤ e−x for all x ∈ [0, 1/2] .

14.2 Continuous Time Chains

Let S be a countable or finite state space but now let us replace discrete time,N0, by its continuous cousin, T := [0,∞). We now start our study of continuoustime homogeneous Markov chains. In more detail we will assume that Xtt≥0

is a stochastic process whose sample paths are right continuous, see Figure 14.3.(These processes need not have left hand limits if there are an infinite numberof jumps in a finite time interval. For the most part we will assume that thisdoes not happen almost surely.) The fact that time is continuous adds somemathematical technicalities which typically will be suppressed in these notesunless they are too dangerous to do so.

Notation 14.5 (Jump and Sojourn times) Let J0 = 0 and then define Jninductively by

S0

S1

S5timeS2

S3

S4

J0 J1 J2 J3 J4 J5

Fig. 14.3. A typical sample path for a continuous time Markov chain with S =1, 2, 3, 4. The picture indicates the jump and sojourn times for this random graph.

Jn+1 := inf t > Jn : Xt 6= XJn

so that Jn is the time of the nth – jump of the process Xt . We further definethe sojourn times by

Sn := Jn+1 − Jnso that Sn is the time spent at the nth site which has been visited by the chain.(This notation conflicts with Karlin and Taylor who label the sojourn times bythe states in S.)

In this continuous time setting, if Xtt≥0 is a collection of functions on Ω

with values in S, we let FXt := σ (Xs : s ≤ t) be the continuous time filtrationdetermined by Xtt≥0 . As usual, a function f : Ω → T is Ft – measurable

iff f = F(Xss≤t

)is a function of the path s → Xs restricted to [0, t] .

As in the discrete time Markov chain setting, to each i ∈ S, we will writePi (A) := P (A|X0 = i) . That is Pi is the probability associated to the scenariowhere the chain is forced to start at site i. We now define, for i, j ∈ S,

Pt (i, j) := Pi (Xt = j) (14.4)

which is the probability of finding the chain at time t at site j given the chainstarts at i.

Definition 14.6. The time homogeneous Markov property states for ev-ery 0 ≤ s < t <∞ and bounded functions, f : S → R, we require

Ei [f (Xt) |Fs] = Ei [f (Xt) |σ (Xs)] = (Pt−sf) (Xs)

where as before,

(Ptf) (x) =∑y∈S

Pt (x, y) f (y) .

A more down to earth statement of this property is that for any choices of0 = t0 < t1 < · · · < tn = s < t and i1, . . . , in ∈ S that

Pi (Xt = j|Xt1 = i1, . . . , Xtn = in) = Pt−s (in, j) , (14.5)

whenever Pi (Xt1 = i1, . . . , Xtn = in) > 0. In particular if P (Xs = k) > 0 then

Pi (Xt = j|Xs = k) = Pt−s (k, j) . (14.6)

Roughly speaking the Markov property may be stated as follows; the prob-ability that Xt = j given knowledge of the process up to time s is Pt−s (Xs, j) .In symbols we might express this last sentence as

Pi(Xt = j| Xττ≤s

)= Pi (Xt = j|Xs) = Pt−s (Xs, j) .

So again a continuous time Markov process is forgetful in the sense what thechain does for t ≥ s depend only on where the chain is located, Xs, at time sand not how it got there. See Fact 14.21 below for a more general statement ofthis property. The next theorem gives our first basic description of a continuoustime Markov process on a discrete state space in terms of “finite dimensionaldistributions,” see Figure 14.4.

Theorem 14.7 (Finite dimensional distributions). Let 0 < t1 < t2 <· · · < tn and i0, i1, i2, . . . , in ∈ S. Then

Pi0(Xt1 = i1, Xt2 = i2, . . . , Xtn = in)

= Pt1 (i0, i1) Pt2−t1 (i1, i2) . . .Ptn−tn−1(in−1, in) . (14.7)

Proof. The proof is similar to that of Theorems 5.8 and Corollary 5.10. Fornotational simplicity let us suppose that n = 3. We then have

Pi0(Xt1 = i1, Xt2 = i2, Xt3 = i3)

= Pi0(Xt3 = i3|Xt1 = i1, Xt2 = i2)Pi0 (Xt1 = i1, Xt2 = i2)

= Pt3−t2 (i2, i3)Pi0 (Xt1 = i1, Xt2 = i2)

= Pt3−t2 (i2, i3)Pi0 (Xt2 = i2|Xt1 = i1)Pi0 (Xt1 = i1)

= Pt3−t2 (i2, i3) Pi2,i3 (t3 − t2) Pt2−t1 (i1, i2) Pt1 (i0, i1)

wherein we have used the Markov property once in line 2 and twice in line 4.


S0

S1

S5timeS2

S3

S4

J0 J1 J2 J3 J4 J5t1 t2 t3

Fig. 14.4. In this figure we fix three time windows, t1, t2, and t3 where we are goingto probe the process, Xtt≥0 . In general, this can be done for any finite number oftimes and this is what we mean by finite dimensional distributions.

Proposition 14.8 (Properties of P). Let Pt (i, j) := Pi (Xt = j) be as above.Then:

1. For each t ≥ 0, Pt is a Markov matrix, i.e.∑j∈S

Pt (i, j) = 1 for all i ∈ S and

Pt (i, j) ≥ 0 for all i, j ∈ S.

2. limt↓0 Pt (i, j) = δij for all i, j ∈ S.3. The Chapman – Kolmogorov equation holds:

P(t+ s) = PsPt for all s, t ≥ 0, (14.8)

i.e.Pt+s (i, j) =

∑k∈S

Ps (i, k) Pt (k, j) for all s, t ≥ 0. (14.9)

We will call a matrix Ptt≥0 satisfying items 1. – 3. a continuous timeMarkov semigroup.

Proof. Most of the assertions follow from the basic properties of condi-tional probabilities. The assumed right continuity of Xt implies that limt↓0 Pt =P(0) = I. From Equation (14.7) with n = 2 we learn that

Pi0,i2(t2) =∑i1∈S

Pi0(Xt1 = i1, Xt2 = i2)

=∑i1∈S

Pt1 (i0, i1) Pt2−t1 (i1, i2)

= [Pt1Pt2−t1 ] (i0, i2) .

Definition 14.9 (Infinitesimal Generator). The infinitesimal generator, A,of a Markov semi-group Ptt≥0 is the matrix,

A :=d

dt|0+Pt (14.10)

which we are assuming to exist here. [We will not go into the technicalities ofthis derivative in these notes.]

Assuming that the derivative in Eq. (14.10) exists, then

d

dtPt = APt and (Kolmogorov’s Backwards Equation)

d

dtPt = PtA (Kolmogorov’s Forwards Equation).

Indeed,

d

dtPt =

d

ds|0Pt+s =

d

ds|0 [PsPt] = APt

d

dtPt =

d

ds|0Pt+s =

d

ds|0 [PtPs] = PtA.

We also must have;

1. Since Pt (x, y) ≥ 0 for all t ≥ 0 and x, y ∈ S we have for x 6= y that

A (x, y) = limt↓0

Pt (x, y)−P0 (x, y)

t= lim

t↓0

Pt (x, y)

t≥ 0.

2. Since Pt1 = 1 we also have

0 =d

dt|0+1 =

d

dt|0+Pt1 = A1,

i.e. A1 = 0 and thus

ax :=∑y 6=x

A (x, y) = −A (x, x) ≥ 0.



Example 14.10. Suppose that S = 1, 2, . . . , n and Pt is a Markov-semi-groupwith infinitesimal generator, A, so that d

dtPt = APt = PtA. By assumptionPt (i, j) ≥ 0 for all i, j ∈ S and

∑nj=1 Pt (i, j) = 1 for all i ∈ S. We may write

this last condition as Pt1 = 1 for all t ≥ 0 where 1 denotes the vector in Rnwith all entries being 1. Differentiating Pt1 = 1 at t = 0 shows that A1 = 0,i.e.

∑nj=1 A (i, j) = 0 for all i ∈ S. Since

A (i, j) = limt↓0

Pt (i, j)− δijt

if i 6= j we will have,

A (i, j) = limt↓0

Pt (i, j)

t≥ 0.

Thus we have shown the infinitesimal generator, A, of Pt must satisfy A (i, j) ≥0 for all i 6= j and

∑nj=1 A (i, j) = 0 for all i ∈ S. In words, A is an n × n –

matrix with non-negative off diagonal entries with all row sums being zero. Youare asked to prove the converse in Exercise 14.5. So an explicit example of aninfinitesimal generator when S = 1, 2, 3 is

A =

−3 1 24 −6 27 1 −8

.

In this case my computer finds,

Pt =

17e−7t + 1

5e−10t + 23

3517 −

17e−7t 1

5 −15e−10t

15e−10t − 6

7e−7t + 23

3567e−7t + 1

715 −

15e−10t

17e−7t − 4

5e−10t + 23

3517 −

17e−7t 4

5e−10t + 1

5

.

Fact 14.11 Given a discrete state space S, the the collection of homogeneousMarkov probabilities on trajectories, Xtt≥0 ⊂ S as in Figure 14.3 are in oneto one correspondence with continuous time Markov semigroups P (·) which arein one to one correspondence with Markov infinitesimal generators A where Ais a matrix indexed by S with the properties stated after Definition 14.9. Thecorrespondences are given (when no explosions occur) by

A =d

dt|0+Pt, Pt = etA, Pt (i, j) := Pi (Xt = j) ,

where the probabilities Pi are determined by Theorem 14.7. In more detail,

Xtt≥0 → Pt (i, j) := P (Xt+s = j|Xs = i) for all i, j ∈ S and s, t ≥ 0,

Pt (i, j)→ Xtt≥0 with Pi (Xtk = ik : 1 ≤ k ≤ n) =

n∏k=1

Ptk−tk−1(ik−1, ik) ,

Pt → A : =d

dt|0+Pt and

A→ Pt := etA =

∞∑n=0

tn

n!An.

Proposition 14.12. If S = N0 and

A =

0 1 2 3 4 5 . . .

−λ λ 0 0 0 0 . . .0 −λ λ 0 0 0 . . .0 0 −λ λ 0 0 . . .0 0 0 −λ λ 0 . . ....

......

. . .. . .

. . .. . .

0123...

. (14.11)

then

Pt = etA = e−tλ

0 1 2 3 4 5 . . .

1 λt (λt)2

2!(λt)3

3!(λt)4

4!(λt)5

5! . . .

0 1 λt (λt)2

2!(λt)3

3!(λt)4

4! . . .

0 0 1 λt (λt)2

2!(λt)3

3! . . .

0 0 0 1 λt (λt)2

2! . . ....

......

. . .. . .

. . .. . .

0123...

. (14.12)

In other words,

Pt (m,n) = 1n≥me−tλ (λt)

n−m

(n−m)!. (14.13)

We will summarize the matrix A and hence the Markov chain via the ratediagram;

0λ−→ 1

λ−→ 2λ−→ . . .

λ−→ (n− 1)λ−→ n

λ−→ . . . ..

Proof. First proof. Let e0 (m) = δ0,m and πt = e0Pt be the top row ofPt. Then

d

dtπt (0) = −λπt (0) and

d

dtπt (n) = −λπt (n) + λπt (n− 1) .


Thus if wt (n) := eλtπt (n) we have with the convention that wt (−1) = 0 that

d

dtwt (n) = λwt (n− 1) for all n with w0 (n) = δ0,n.

The solution to this system of equations is wt (n) = (λt)n

n! as can be found byinduction on n. This justifies the first row Pt given in Eq. (14.12). The otherrows are found similarly or using the above results and a little thought.

Second proof. Suppose that Pt is given as in Eq. (14.12) or equivalentlyEq. (14.13). The row sums are clearly equal to one and we can check directlythat PtPs = Ps+t. To verify this we have,

∑k∈S

Pt (i, k) Ps (k, j) =∑k∈N0

1k≥ie−λt (λt)

k−i

(k − i)!1j≥ke

−λs (λs)j−k

(j − k)!

= 1i≤je−λ(t+s)

∑i≤k≤j

(λt)k−i

(k − i)!(λs)

j−k

(j − k)!. (14.14)

Letting k = i+m with 0 ≤ m ≤ j − i, then the above sum may be written as

j−i∑m=0

(λt)m

m!

(λs)j−i−m

(j − i−m)!=

1

(j − i)!

j−i∑m=0

(j − im

)(λt)

m(λs)

j−i−m

and hence by the Binomial formula we find,∑i≤k≤j

(λt)k−i

(k − i)!(λs)

j−k

(j − k)!=

1

(j − i)!(λt+ λs)

j−i.

Combining this with Eq. (14.14) shows that∑k∈S

Pt (i, k) Ps (k, j) = Ps+t (i, j)

as desired. To finish the proof we now need only observe that ddt |0Pt= A which

follows from the facts that

d

dte−λt|t=0 = −λ, d

dt(λt) e−λt|t=0 = λ, and

d

dt

(λt)k

k!e−λt|t=0 = 0 for all k ≥ 2.

Definition 14.13. The continuous time Markov chain Ntt≥0 associated toA in Eq. (14.11) when started at N0 = 0 is called the Poisson process withintensity λ.

Definition 14.14. A stochastic process Xtt≥0 is said to have independentincrements if for any n ∈ N and 0 = t0 < t1 < · · · < tn < ∞ the collection ofrandom functions,

Xti −Xti−1

ni=1

are independent.

Proposition 14.15. The Poisson processNtt≥0 has independent incrementsand moreover for each 0 ≤ s ≤ t <∞,

Nt −Nsd= Poi (λ (t− s)) .

Proof. Let us give the proof when n = 3 as the general case is similar. Thengiven k1, k2, k3 ∈ N0 and 0 = t0 < t1 < t2 < t3 <∞ we find,

P0 (Nt1 −Nt0 = k1, Nt2 −Nt1 = k2, Nt3 −Nt2 = k3)

= P0 (Nt1 = k1, Nt2 = k1 + k2, Nt3 = k1 + k2 + k3)

= Pt1 (0, k1) Pt2−t1 (k1, k1 + k2) Pt3−t3 (k1 + k2, k1 + k2 + k3)

= e−λt1(λt1)

k1

k1!e−λt2

(λ (t2 − t1))k2

k2!e−λt3

(λ (t3 − t2))k3

k3!,

which suffices to verify all the assertions in the proposition.

Exercise 14.5 (Look at). Suppose that S = 1, 2, . . . , n and A is a matrixsuch that A (i, j) ≥ 0 for i 6= j and

∑nj=1 A (i, j) = 0 for all i. Show

Pt = etA :=

∞∑n=0

tn

n!An (14.15)

is a time homogeneous Markov kernel.Hints: 1. To show Pt (i, j) ≥ 0 for all t ≥ 0 and i, j ∈ S, write Pt =

e−tλet(λI+A) where λ > 0 is chosen so that λI + A has only non-negativeentries. 2. To show

∑j∈SP t (i, j) = 1, compute d

dtP t1.

Theorem 14.16 (Feynmann-Kac Formula). Continue the notation in Ex-

ercise 14.5 and let(Ω, Ftt≥0 ,Px, Xtt≥0

)be a time homogeneous Markov

process (assumed to be right continuous) with transition kernels, Ptt≥0 . Given

V : S → R, let Tt := TVt be defined by

(Ttg) (x) = Ex[exp

(∫ t

0

V (Xs) ds

)g (Xt)

](14.16)

for all g : S → R. Then Tt satisfies,

d

dtTt = Tt (A +MV ) with T0 = I (14.17)

where MV g := V g for all g : S → R, i.e. MV is the diagonal matrix withV (1) , . . . , V (n) being placed in the diagonal entries. We may summarize thisresult as,

Ex[exp

(∫ t

0

V (Xs) ds

)g (Xt)

]=(et(A+MV )g

)(x) . (14.18)

Proof. To see what is going on let us first assume that ddtTtg exists in which

case we may compute it as, ddt (Ttg) (x) = d

dh |0+ (Tt+hg) (x) . Then by the chainrule, the fundamental theorem of calculus, and the Markov property, we find

d

dt(Ttg) (x) =

d

dh|0+Ex

[exp

(∫ t+h

0

V (Xs) ds

)g (Xt)

]

+d

dh|0+Ex

[exp

(∫ t

0

V (Xs) ds

)g (Xt+h)

]= Ex

[d

dh|0+ exp

(∫ t+h

0

V (Xs) ds

)g (Xt)

]

+d

dh|0+Ex

[exp

(∫ t

0

V (Xs) ds

)(ehAg

)(Xt)

]= Ex

[exp

(∫ t

0

V (Xs) ds

)V (Xt) g (Xt)

]+ Ex

[exp

(∫ t

0

V (Xs) ds

)(Ag) (Xt)

]= Tt (V g) (x) + Tt (Ag) (x)

which gives Eq. (14.17). [It should be clear that (T0g) (x) = g (x) .]We now give a rigorous proof. For 0 ≤ τ ≤ t < ∞, let Zτ,t :=

exp(∫ t

τV (Xs) ds

)and let Zt := Z0,t. For h > 0,

(Tt+hg) (x) = Ex [Zt+hg (Xt+h)] = Ex [ZtZt,t+hg (Xt+h)]

= Ex [Ztg (Xt+h)] + Ex [Zt [Zt,t+h − 1] g (Xt+h)]

= Ex[Zt(ehAg

)(Xt)

]+ Ex [Zt [Zt,t+h − 1] g (Xt+h)] .

Therefore,

(Tt+hg) (x)− (Ttg) (x)

h= Ex

[Zt

(ehAg

)(Xt)− g (Xt)

h

]

+ Ex[Zt

[Zt,t+h − 1

h

]g (Xt+h)

]

and then letting h ↓ 0 in this equation implies,

d

dh|0+ (Tt+hg) (x) = Ex [Zt (Ag) (Xt)] + Ex [ZtV (Xt) g (Xt+h)] .

This shows that Tt is one sided differentiable and this one sided derivatives isgiven as in Eq. (14.17).

On the other hand for s, t > 0, using the Markov property

Tt+sg (x) = Ex [Zt+sg (Xt+s)] = Ex [ZtZt,t+sg (Xt+s)]

= Ex [ZtEFt (Zt,t+sg (Xt+s))] = Ex [ZtEXt (Z0,sg (Xs))]

= (TtTsg) (x) ,

i.e. Ttt>0 still has the semi-group property. So for h > 0,

Tt−h − Tt = Tt−h − Tt−hTh = Tt−h (I − Th)

and henceTt−h − Tt−h

= Tt−hTh − Ih

→ Tt (A + V ) as h ↓ 0

using Tt is continuous in t and the result we have already proved. This showsTt is differentiable in t and Eq. (14.17) is valid.

14.3 Jump Hold Description

In this section, we wish to give another description of the continuous timeMarkov chain directly associated to an infinitesimal generator A – see Corollary14.22 below. This validity of this description is a consequence of the strongMarkov property (see Fact 14.21 ) along with Theorem 14.19 below.

Definition 14.17. Given a Markov-semi-group infinitesimal generator, A, onS, let A be the matrix indexed by S defined by

A (x, y) :=

A(x,y)−ax if x 6= y and ax 6= 0

0 if x = y and ax 6= 00 if x 6= y and ax = 01 if x = y and ax = 0

. (14.19)

Recall thatax := −A (x, x) =

∑y:y 6=x

A (x, y) .

14.3 Jump Hold Description 131

Notice that A is itself a Markov matrix for a non-lazy chain provided ax 6= 0for all x ∈ S. We will typically denote the associate chain by Yn∞n=0 . Perhapsit is worth remarking if x ∈ S is a point where ax = 0, then πt (y) := etA (x, y) =[δxe

tA]

(y) satisfies,

πt (y) =[δxAe

tA]

(y) =∑z∈S

A (x, z) etA (z, y) = 0

so that πt (y) = π0 (y) = δx (y) . This shows that etA (x, y) = δx (y) for all t ≥ 0and so if the chain starts at x then it stays at x when ax = 0 which explainsthe last two rows in Eq. (14.19).

Example 14.18. If

A =

1 2 3−3 1 20 0 02 7 −9

123

, then A =

1 2 30 1

323

0 1 029

79 0

123

.

Theorem 14.19 (Jump distributions). Suppose that A is a Markov in-finitesimal generator, x ∈ S is a non-absorbing state (ax > 0). Then for ally 6= x and t > 0,

Px (XJ1 = y; J1 > t) = e−taxA (x, y)

ax(14.20)

In other words, given X0 = x, J1d= E (ax) and XJ1 (the first jump location) is

independent of J1 with

Px (XJ1 = y) =A (x, y)

ax.

Proof. (We will give a bit of informal proof here – the reader will need toconsult the references for fully rigorous proof.) Let J1 = inf τ > 0 : Xτ 6= X0 .Referring to Figure 14.3, we then have2 with Pn =

knτ : 1 ≤ k ≤ n

, that

2 Although we do not know where in [t, t+ dt] that Jx is we do expect that withrelatively high likelihood (relative to dt) that there will be at most one jump inthis infinitesimally small interval and hence we should have Xt+dt = XJx whenJx ∈ [t, t+ dt].

Px (J1 ∈ [τ, τ + dτ ], XJ1 = y)

= Px (J1 ∈ [τ, τ + dτ ], Xτ+dτ = y) + o (dτ)

= limn→∞

Px(Xs = xs∈Pn , Xτ+dτ = y

)+ o (dτ)

= limn→∞

[P τn

(x, x)]n ·Pdτ (x, y) + o (dτ)

= limn→∞

[1− τ

n· ax + o

(1

n

)]n·A (x, y) dτ + o (dτ)

= e−τaxA (x, y) dτ + o (dτ) .

Integrating this equation over τ > t then shows,

Px (J1 > t, XJ1 = y) =

∫ ∞t

e−τaxA (x, y) dτ =A (x, y)

axe−tax .

As a check, taking t = 0 above shows,

Px (XJ1 = y) =A (x, y)

ax

and summing this equation on y ∈ S \ x then gives the following requiredidentity, ∑

y 6=x

Px (XJ1 = y) =∑y 6=x

A (x, y)

ax= 1.

Definition 14.20 (Informal). A stopping time, T, for Xt , is a randomvariable with the property that the event T ≤ t is determined from the knowl-edge of Xs : 0 ≤ s ≤ t . Alternatively put, for each t ≥ 0, there is a functional,ft, such that

1T≤t = ft (Xs : 0 ≤ s ≤ t) .

As in the discrete state space setting, the first time the chain hits some subsetof states, A ⊂ S, is a typical example of a stopping time whereas the last timethe chain hits a set A ⊂ S is typically not a stopping time. Similar the discretetime setting, the Markov property leads to a strong form of forgetfulness of thechain. This property is again called the strong Markov property which wetake for granted here.

Fact 14.21 (Strong Markov Property) If Xtt≥0 is a Markov chain, T isa stopping time, and j ∈ S, then, conditioned on T <∞ and XT = j ,

Xs : 0 ≤ s ≤ T and Xt+T : t ≥ 0 are independent

and Xt+T : t ≥ 0 has the same distribution as Xtt≥0 under Pj .

Using Theorem 14.19 along with the strong Markov property leads to thefollowing jump-hold description of the continuous time Markov chain, Xtt≥0 ,associated to A.

Corollary 14.22. Let A be the infinitesimal generator of a Markov semigroupPt. Then the Markov chain, Xt , associated to Pt may be described as follows.Let Yk∞k=0 denote the discrete time Markov chain with Markov matrix A asin Eq. (14.19). Let Sj∞j=0 be random times such that given Yj = xj : j ≤ n ,

Sjd= exp

(axj)

and the Sjnj=0 are independent for 0 ≤ j ≤ n.3 Now let Nt =

max j : S0 + · · ·+ Sj−1 ≤ t (see Figure 14.3) and Xt := YNt . Then Xtt ≥ 0is the Markov process starting at x with Markov semi-group, Pt = etA.

Put another way, if Tn∞n=0 are i.i.d exponential random times with Tnd=

E (1) for all n, then

XJn , Sn∞n=0

d=

Yn,

TnaYn

(14.21)

provided Y0d= X0.

Proof. First proof. Assuming the process X starts at some x ∈ S, then 1)it stays at x for an E(ax) – amount of time, S1, then jump to x1 with probabilityAx,x1

. Stay at x1 for an E(ax1) amount of time, S2, independent of S1 and then

jump to x2 with probability Ax1,x2. Stay at x2 for an E(ax2

) amount of time,

S3, independent of S1 and S2 and then jump to x3 with probability Ax2,x3 , etc.

etc. etc. etc. where one should interpret Td= E (0) to mean that T =∞.

Second proof. The argument in the proof of Theorem 14.19 easily gener-alizes to multiple jump times. For example for 0 ≤ τ < u <∞ and y, z ∈ S wefind,

Px (J1 ∈ [τ, τ + dτ ], J2 ∈ [u, u+ du] , XJ1 = y, XJ2 = z)∼= Px (J1 ∈ [τ, τ + dτ ], J2 ∈ [u, u+ du] , Xτ+dτ = y, Xu+du = z)

∼= limn→∞

Px(

X kn τ

= x,Xτ+dτ+ kn (u−(τ+dτ)) = y

s∈P

, Xτ+dτ = y,Xu+du = z

)∼= limn→∞

[P τn

(x, x)]n ·Pdτ (x, y) ·

[Pu−τ

n(y, y)

]nPdu (y, z)

∼= limn→∞

(1− τ

nax + o

(1

n

))n(1− u− τ

nay + o

(1

n

))n·Pdτ (x, y) Pdu (y, z)

= e−τaxe−(u−τ)ayA (x, y) dτA (y, z) du+ o ((dτ, du))

3 A concrete way to chose the Sj∞j=0 is as follows. Given a sequence, Tj∞j=0 , ofi.i.d. exp (1) – random variables which are independent of Yn∞n=0 , define Sj :=Tj/QYj .

from which we conclude that

Ex [F (J1, J2, X0, XJ1 , XJ2)]

=∑

y:y 6=x,z:z 6=y

∫ ∞0

dτ

∫ ∞0

du F (τ, u, x, y, z) e−τaxe−(u−τ)ayA (x, y) A (y, z)

=∑

y:y 6=x,z:z 6=y

∫ ∞0

dτ

∫ ∞τ

du F (τ, u, x, y, z) axe−τaxaye

−(u−τ)ayA (x, y)

ax

A (y, z)

ay

=∑

y:y 6=x,z:z 6=y

∫ ∞0

dτ

∫ ∞0

du F (τ, v + τ, x, y, z) axe−τaxaye

−vay A (x, y)

ax

A (y, z)

ay

where we should take A(x,y)ax

= 0 if ax = 0. If T1 and T2 are independentexponential random variables with parameter 1 and Yn is the Markov chainwith transition matrix A given below then

Ex [F (J1, J2, X0, XJ1 , XJ2)] = Ex[F

(T1

Y0,T1

Y0+T2

Y1, Y0, Y1, Y2

)].

In other words

(J1, J2, X0, XJ1 , XJ2)d=

(T1

Y0,T1

Y0+T2

Y1, Y0, Y1, Y2

).

This has a clear generalization to any number of times.The proof of Eq. (14.21) is that on one hand,

P(XJn = xn, Sn > snNn=0

)= P (X0 = x0) e−ax0s0

N∏n=1

[A (xn−1, xn) e−axnsn

]while on the other hand,

P(Yn = xn, Tn/aYn > snNn=0

)= P

(Yn = xn, Tn > axnsn

Nn=0

)= P

(Yn = xnNn=0

)· P(Tn > axnsn

Nn=0

)= P (Y0 = x0)

N∏n=1

A (xn−1, xn) ·N∏n=0

e−axnsn .

Thus we have shown

P(XJn = xn, Sn > snNn=0

)= P

(Yn = xn, Tn/aYn > snNn=0

)provided P (X0 = x0) = P (Y0 = x0) .

14.4 Jump Hold Construction of the Poisson Process 133

It is possible to show the description in Corollary 14.22 defines a Markovprocess with the correct semi-group, Pt. For the details the reader is referredto Norris [13, See Theorems 2.8.2 and 2.8.4].

Here is one more interpretation of the jump hold or sojourn descriptionof our continuous time Markov chains. When we arrive at a site x ∈ S westart a collection of independent exponential random clocks, Tyy∈S\x with

Tyd= E (A (x, y)) where we interpret Ty = ∞ if A (x, y) = 0. If the first clock

to ring is the y clock, then we jump to y. It is fairly simple to check that

J = min Tz : z 6= x d= E (ax) and the probabiilyt that the y clock rings first

is precisely, A (x, y) /ax. At the next site in the chian, we restart the processall over again.

The following theorem summarizes the “first jump” analysis in this contin-uous time context.

Theorem 14.23 (First Step Analysis). The first step analysis in this contextis as follows. Let J0 = inf t > 0 : Xt 6= X0 , then the strong Markov propertyin this context gives,

Ex [F (X)] = Ex [Ex [F (X) |FJ0 ]] = Ex [G (J0, XJ0)]

=∑y 6=x

∫ ∞0

axe−axtG (t, y) dt

A (x, y)

ax

=∑y 6=x

∫ ∞0

e−axtA (x, y)G (t, y) dt,

whereG (t, y) = Ey

[F(x1[0,t) ∗X

)]and [

x1[0,t) ∗X]s

=

x if 0 ≤ s < t

Xs−t if s ≥ t .

Example 14.24 (Expected Hitting Times). For example if F (X) = TB (X) - isthe first hitting time of B and we start with x ∈ A := S \B, then for y ∈ B,

G (t, y) = Ey[TB(x1[0,t) ∗X

)]= t

while for y ∈ A,

G (t, y) = Ey[TB(x1[0,t) ∗X

)]= t+ Ey [TB (X)] .

Thus if we let w (x) := Ex [TB (X)] we find,

w (x) =∑y 6=x

∫ ∞0

e−axtA (x, y) tdt+∑

y 6=x,y∈B

∫ ∞0

e−axtA (x, y)w (y) dt

=∑y 6=x

A (x, y)

a2x

+∑

y 6=x,y∈B

1

axA (x, y)w (y) .

Thus we have shown,

axw (x) = 1 +∑

y 6=x,y∈B

A (x, y)w (y) .

14.4 Jump Hold Construction of the Poisson Process

Let Tk∞k=1 be an i.i.d. sequence of random exponential times with parameterλ, i.e. P (Tk ∈ [t, t+ dt]) = λe−λtdt. For each n ∈ N let Wn := T1 + · · · + Tnbe the “waiting time” for the nth event to occur. Because of Theorem 14.3we know that limn→∞Wn =∞ a.s. [Sorry for the change of notation here! Tomatch with the above section you should identify Tk with Sk−1 and Wn withJn.]

Definition 14.25 (Poisson Process I). For any subset A ⊂ R+ let N (A) :=∑∞n=1 1A (Wn) count the number of waiting times which occurred in A. When

A = (0, t] we will write, Nt := N ((0, t]) for all t ≥ 0 and refer to Ntt≥0 asthe Poisson Process with intensity λ.

The next few results summarize a number of the basic properties of thisPoisson process. Many of the proofs will be left as exercises to the reader. Wewill use the following notation below; for each n ∈ N and T ≥ 0 let

∆n (T ) := (w1, . . . , wn) ∈ Rn : 0 < w1 < w2 < · · · < wn < T

and let

∆n := ∪T>0∆n (T ) = (w1, . . . , wn) ∈ Rn : 0 < w1 < w2 < · · · < wn <∞ .

In the exercises below it will be very useful to observe that

Nt = n = Wn ≤ t < Wn+1 .

Exercise 14.6. For 0 ≤ s < t <∞, show by induction on n that

an (s, t) :=

∫s≤w1≤w2≤···≤wn≤t

dw1 . . . dwn =(t− s)n

n!.

Exercise 14.7. If n ∈ N and g : ∆n → R bounded (non-negative) function,then

E [g (W1, . . . ,Wn)] =

∫∆n

g (w1, w2, . . . , wn)λne−λwndw1 . . . dwn. (14.22)

As a simple corollary we have the following result which explains the namegiven to the Poisson process.

Exercise 14.8. Show Ntd= Poi (λt) for all t > 0.

Exercise 14.9. Show for 0 ≤ s < t < ∞ that Nt − Nsd= Poi (λ (t− s)) and

Ns, Nt −Ns are independent.Hint: for m,n ∈ N0 show using the few exercises above that

P0 (Ns = m,Nt = m+ n) = e−λs(λs)

m

m!· e−λ(t−s) (λ (t− s))n

n!. (14.23)

[This result easily generalizes to show Ntt≥0 has independent increments andusing this one easily verifies that Ntt≥0 is a Markov process with transitionkernel given as in Eq. (14.13).]

14.4.1 More related results*

Corollary 14.26. If n ∈ N, then Wnd=Gamma

(n, λ−1

).

Proof. Taking g (w1, w2, . . . , wn) = f (wn) in Eq. (14.22) we find with theaid of Exercise 14.6 that

E [f (Wn)] =

∫∆n

f (wn)λne−λwndw1 . . . dwn

=

∫ ∞0

f (w)λnwn−1

(n− 1)!e−λwdw

which shows that Wnd=Gamma

(n, λ−1

).

Corollary 14.27. If t ∈ R+ and f : ∆n (t)→ R is a bounded (or non-negative)measurable function, then

E [f (W1, . . . ,Wn) : Nt = n]

= λne−λt∫∆n(t)

f (w1, w2, . . . , wn) dw1 . . . dwn. (14.24)

Proof. Making use of the observation that Nt = n = Wn ≤ t < Wn+1 ,we may apply Eq. (14.22) at level n+ 1 with

g (w1, w2, . . . , wn+1) = f (w1, w2, . . . , wn) 1wn≤t<wn+1

to learn

E [f (W1, . . . ,Wn) : Nt = n]

=

∫0<w1<···<wn<t<wn+1

f (w1, w2, . . . , wn)λn+1e−λwn+1dw1 . . . dwndwn+1

=

∫∆n(t)

f (w1, w2, . . . , wn)λne−λtdw1 . . . dwn.

Definition 14.28 (Order Statistics). Suppose that X1, . . . , Xn are non-negative random variables such that P (Xi = Xj) = 0 for all i 6= j. The order

statistics of X1, . . . , Xn are the random variables, X1, X2, . . . , Xn defined by

Xk = min#(Λ)=k

max Xi : i ∈ Λ (14.25)

where Λ always denotes a subset of 1, 2, . . . , n in Eq. (14.25).

The reader should verify that X1 ≤ X2 ≤ · · · ≤ Xn, X1, . . . , Xn =X1, X2, . . . , Xn

with repetitions, and that X1 < X2 < · · · < Xn if

Xi 6= Xj for all i 6= j. In particular if P (Xi = Xj) = 0 for all i 6= j then

P (∪i 6=j Xi = Xj) = 0 and X1 < X2 < · · · < Xn a.s.

Exercise 14.10. Suppose that X1, . . . , Xn are non-negative4 random variablessuch that P (Xi = Xj) = 0 for all i 6= j. Show;

1. If f : ∆n → R is bounded (non-negative) measurable, then

E[f(X1, . . . , Xn

)]=∑σ∈Sn

E [f (Xσ1, . . . , Xσn) : Xσ1 < Xσ2 < · · · < Xσn] ,

(14.26)where Sn is the permutation group on 1, 2, . . . , n .

2. If we further assume that X1, . . . , Xn are i.i.d. random variables, then

E[f(X1, . . . , Xn

)]= n! · E [f (X1, . . . , Xn) : X1 < X2 < · · · < Xn] .

(14.27)

(It is not important that f(X1, . . . , Xn

)is not defined on the null set,

∪i6=j Xi = Xj .)4 The non-negativity of the Xi are not really necessary here but this is all we need

to consider.

14.4 Jump Hold Construction of the Poisson Process 135

3. f : Rn+ → R is a bounded (non-negative) measurable symmetric function(i.e. f (wσ1, . . . , wσn) = f (w1, . . . , wn) for all σ ∈ Sn and (w1, . . . , wn) ∈Rn+) then

E[f(X1, . . . , Xn

)]= E [f (X1, . . . , Xn)] .

4. Suppose that Y1, . . . , Yn is another collection of non-negative random vari-ables such that P (Yi = Yj) = 0 for all i 6= j such that

E [f (X1, . . . , Xn)] = E [f (Y1, . . . , Yn)]

for all bounded (non-negative) measurable symmetric functions from Rn+ →R. Show that

(X1, . . . , Xn

)d=(Y1, . . . , Yn

).

Hint: if g : ∆n → R is a bounded measurable function, define f : Rn+ → Rby;

f (y1, . . . , yn) =∑σ∈Sn

1yσ1<yσ2<···<yσng (yσ1, yσ2, . . . , yσn)

and then show f is symmetric.

Exercise 14.11. Let t ∈ R+ and Uini=1 be i.i.d. uniformly distributed random

variables on [0, t] . Show that the order statistics,(U1, . . . , Un

), of (U1, . . . , Un)

has the same distribution as (W1, . . . ,Wn) given Nt = n. (Thus, given Nt =n, the collection of points, W1, . . . ,Wn , has the same distribution as thecollection of points, U1, . . . , Un , in [0, t] .)

Theorem 14.29 (Joint Distributions). If Aiki=1 ⊂ B[0,t] is a partition

of [0, t] , then N (Ai)ki=1 are independent random variables and N (A)d=

Poi (λm (A)) for all A ∈ B[0,t] with m (A) < ∞. In particular, if 0 < t1 <

t2 < · · · < tn, thenNti −Nti−1

ni=1

are independent random variables and

Nt − Nsd= Poi (λ (t− s)) for all 0 ≤ s < t < ∞. (We say that Ntt≥0 is a

stochastic process with independent increments.)

Proof. If z ∈ C and A ∈ B[0,t], then

zN(A) = z∑n

i=11A(Wi) on Nt = n .

Let n ∈ N, zi ∈ C, and define

f (w1, . . . , wn) = z

∑n

i=11A1

(wi)

1 . . . z

∑n

i=11Ak (wi)

k

which is a symmetric function. On Nt = n we have,

zN(A1)1 . . . z

N(Ak)k = f (W1, . . . ,Wn)

and therefore,

E[zN(A1)1 . . . z

N(Ak)k |Nt = n

]= E [f (W1, . . . ,Wn) |Nt = n]

= E [f (U1, . . . , Un)]

= E[z

∑n

i=11A1

(Ui)

1 . . . z

∑n

i=11Ak (Ui)

k

]=

n∏i=1

E[(z

1A1(Ui)

1 . . . z1Ak (Ui)

k

)]=(E[(z

1A1(U1)

1 . . . z1Ak (U1)

k

)])n=

(1

t

k∑i=1

m (Ai) · zi

)n,

wherein we have made use of the fact that Aini=1 is a partition of [0, t] so that

z1A1

(U1)1 . . . z

1Ak (U1)

k =

k∑i=1

zi1Ai (Ui) .


E[zN(A1)1 . . . z

N(Ak)k

]=

∞∑n=0

E[zN(A1)1 . . . z

N(Ak)k |Nt = n

]P (Nt = n)

=

∞∑n=0

(1

t

k∑i=1

m (Ai) · zi

)n(λt)

n

n!e−λt

=

∞∑n=0

1

n!

(λ

k∑i=1

m (Ai) · zi

)ne−λt

= exp

(λ

[k∑i=1

m (Ai) zi − t

])

= exp

(λ

[k∑i=1

m (Ai) (zi − 1)

]).

From this result it follows that N (Ai)ni=1 are independent random variablesand N (A) = Poi (λm (A)) for all A ∈ BR with m (A) <∞.

Alternatively; suppose that ai ∈ N0 and n := a1 + · · ·+ ak, then

P [N (A1) = a1, . . . , N (Ak) = ak|Nt = n] = P

[n∑i=1

1Al (Ui) = al for 1 ≤ l ≤ k

]

=n!

a1! . . . ak!

k∏l=1

[m (Al)

t

]al=n!

tn·k∏l=1

[m (Al)]al

al!

and therefore,

P [N (A1) = a1, . . . , N (Ak) = ak]

= P [N (A1) = a1, . . . , N (Ak) = ak|Nt = n] · P (Nt = n)

=n!

tn·k∏l=1

[m (Al)]al

al!· e−λt (λt)

n

n!

=

k∏l=1

[m (Al)]al

al!· e−λtλn

=

k∏l=1

[m (Al)λ]al

al!e−λal

which shows that N (Al)kl=1 are independent and that N (Al)d= Poi (λm (Al))

for each l.

Remark 14.30. If A ∈ B[0,∞) with m (A) = ∞, then N (A) = ∞ a.s. To provethis observe that N (A) =↑ limn→∞N (A ∩ [0, n]) . Therefore for any k ∈ N, wehave

P (N (A) ≥ k) ≥ P (N (A ∩ [0, n]) ≥ k)

= 1− e−λm(A∩[0,n])∑

0≤l<k

(λm (A ∩ [0, n]))l

l!→ 1 as n→∞.

This shows that N (A) ≥ k a.s. for all k ∈ N, i.e. N (A) =∞ a.s.

14.5 Long time behavior

In this section, suppose that Xtt≥0 is a continuous time Markov chain withinfinitesimal generator, A, so that

P (Xt+h = j|Xt = i) = δij + A (i, j)h+ o (h) .

We further assume that A completely determines the chain.

Definition 14.31. The chain, Xt , is irreducible iff the underlying discrete

time jump chain, Yn , determined by the Markov matrix, A (i, j) := A(i,j)ai

1i 6=j ,

is irreducible, where as usual ai := −A (i, i) =∑j:j 6=i A (i, j) .

Remark 14.32. Using the Sojourn time description of Xt it is easy to see thatPt (i, j) =

(etA)ij> 0 for all t > 0 and i, j ∈ S if Xt is irreducible. Moreover,

if for all i, j ∈ S, Pt (i, j) > 0 for some t > 0 then the embedded chain Yn asa positive probability of hitting j when started at i. As i and j were arbitrary,it follows that Yn must be irreducible and hence Xt is irreducible. In shortthe following are equivalent:

1. Xt is irreducible,2. or all i, j ∈ S, Pt (i, j) > 0 for some t > 0, and3. Pt (i, j) > 0 for all t > 0 and i, j ∈ S.

In particular, all irreducible chains are “aperiodic.”

The next theorem gives the basic limiting behavior of irreducible Markovchains. Before stating the theorem we need to introduce a little more notation.

Notation 14.33 Let S0 be the time of the first jump of Xt, and

Ri := min t ≥ S0 : Xt = i ,

is the first time hitting the site i after the first jump, and set

πi =1

ai · EiRiwhere ai := −A (i, i) .

For the sample path in Figure 14.3, R1 = J2, R2 = J4, R3 = J3 and R4 = J1.

Theorem 14.34 (Limiting behavior). Let Xt be an irreducible Markovchain. Then

1. for all initial staring distributions, ν (j) := P (X (0) = j) for all j ∈ S, andall j ∈ S,

limT→∞

1

T

∫ T

0

1Xt=jdt = πj (Pν a.s.). (14.28)

In words, the fraction of the time the chain spends at site j is πj .2. limt→∞Pt (i, j) = πj independent of i.3. π = (πj)j∈S is stationary, i.e. 0 = πA, i.e.∑

i∈SπiA (i, j) = 0 for all j ∈ S,

which is equivalent to πPt = π for all t and to Pπ (Xt = j) = πj for allt > 0 and j ∈ S.

14.5 Long time behavior 137

4. If πi > 0 for some i ∈ S, then πi > 0 for all i ∈ S and∑i∈S πi = 1.

5. The πi are all positive iff there exists a solution, νi ≥ 0 to∑i∈S

νiA (i, j) = 0 for all j ∈ S with∑i∈S

νi = 1.

If such a solution exists it is unique and ν = π.

Proof. We refer the reader to [13, Theorems 3.8.1.] for the full proof. Let usmake a few comments on the proof taking for granted that limt→∞Pt (i, j) =: πjexists. (See part d) below for why the limit is independent of i if it exists.)

a) Suppose we assume that and that ν is a stationary distribution, i.e. νPt = ν,then (by dominated convergence theorem),

νj = limt→∞

∑i

νiPt (i, j) =∑i

limt→∞

νiPt (i, j) =

(∑i

νi

)πj = πj .

Thus νj = πj . If πj = 0 for all j we must conclude there is no stationarydistribution.

b) If we are in the finite state setting, the following computation is justified:∑j∈S

πjPs (j, k) =∑j∈S

limt→∞

Pt (i, j) Ps (j, k) = limt→∞

∑j∈S

Pt (i, j) Ps (j, k)

= limt→∞

[PtPs]ik = limt→∞

Pt+s (i, k) = πk.

This show that πPs = π for all s and differentiating this equation at s = 0then shows, πA= 0.

c) Let us now explain why

1

T

∫ T

0

1Xt=jdt→1

aj · EjRj. (14.29)

The idea is that, because the chain is irreducible, no matter how we startthe chain we will eventually hit the site j. Once we hit j, the (strong)Markov property implies the chain forgets how it got there and behaves asif it started at j. The size of J1 (as long as it is finite) is negligible whencomputing the limit in Eq. (14.29) and so we may now as well assume thatthe chain starts at j.Now consider one typical cycle in the chain staring at j jumping away attime S0 and then returning to j at time Rj . The average first jump timeis EjS0 = 1/aj while the average length of such as cycle is EjRj . As thechain repeats this procedure over and over again with the same statistics,we expect (by a law of large numbers) that the average time spent at site jis given by

EjS0

EjRj=

1/ajEjRj

=1

aj · EjRj.

This argument is made more precise using renewal processes techniqueswhich unfortunately will not be covered in this course.

d) Taking expectations of Eq. (14.28) shows (at least when # (S) <∞)

πj = limT→∞

1

T

∫ T

0

Pν (Xt = j) dt = limT→∞

1

T

∫ T

0

∑i

νiPi,j (t) dt

=∑i

νi limT→∞

1

T

∫ T

0

Pt (i, j) dt =∑i

νi limT→∞

Pt (i, j)

for any initial distribution ν. Taking νi = δk,i for any k ∈ S we chooseshows limT→∞Pt (k, j) = πj = 1

aj ·EjRj independent of k.

15

Brownian Motion

We now are going to take S = R or more generally Rd for some d. Ratherthan considering all continuous time Markov processes with values in R or(Rd)

we are going to focus out attention on one extremely important example– namely Brownian motion. Before restricting to the Brownian case let us firstdiscuss a more general class of stochastic processes.

15.1 Stationary and Independent Increment Processes

Definition 15.1. A stochastic process,Xt : Ω → S := Rd

t∈T , has indepen-

dent increments if for all finite subsets, Λ = 0 ≤ t0 < t1 < · · · < tn ⊂ Tthe random variables X0 ∪

Xtk −Xtk−1

nk=1

are independent. We refer toXt −Xs for s < t as an increment of X.

Definition 15.2. A stochastic process,Xt : Ω → S := Rd

t∈T , has station-

ary increments if for all 0 ≤ s < t <∞, the law of Xt −Xs depends only ont− s.

Example 15.3. The Poisson process of rate λ, Xtt≥0 , has stationary indepen-

dent increments with Xt −Xsd= Poi (λ (t− s)) . Also if we let

Xt := Xt − EXt = Xt − λt

be the “centering” of Xt, thenXt

t≥0

is a mean zero process with independent

increments. In this case, EX2t = Var (Xt) = λt.

Lemma 15.4. Suppose that Xtt≥0 is a mean zero square-integrable stochasticprocess with stationary independent increments. If we further assume that X0 =0 a.s. and t→ EX2

t is continuous, then EX2t = λt for some λ ≥ 0.

Proof. Let ϕ (t) := EX2t and let s, t ≥ 0. Then

ϕ (t+ s) = EX2t+s = E [(Xt+s −Xt) +Xt]

2

= E (Xt+s −Xt)2

+ EX2t + 2E [(Xt+s −Xt)Xt]

= EX2s + EX2

t + 2E [Xt+s −Xt] · EXt

= ϕ (t) + ϕ (s) + 0.

Now let λ := ϕ (1) and use the previous equation to learn, for n ∈ N0, that

λ = ϕ (1) = ϕ

n -times︷︸︸︷

1

n+ · · ·+ 1

n

= n · ϕ(

1

n

),

i.e. ϕ(

1n

)= 1

nλ. Similarly if k ∈ N we may conclude,

ϕ

(k

n

)= ϕ

k -times︷︸︸︷

1

n+ · · ·+ 1

n

= kϕ

(1

n

)= k

1

nλ =

k

nλ.

Thus ϕ (t) = λt for all t ∈ Q+ and then by continuity we must have ϕ (t) = λtfor all t ∈ R+.

Proposition 15.5. Suppose thatXt : Ω → S := Rd

t∈T is a stochastic pro-

cess with stationary and independent increments and let Ft := FXt for all t ∈ T.Then

E [f (Xt) |Fs] = (Pt−sf) (Xs) where (15.1)

(Ptf) (x) = E [f (x+Xt −X0)] . (15.2)

Moreover, Ptt≥0 satisfies the Chapman – Kolmogorov equation, i.e. PtPs =Pt+s for all s, t ≥ 0.

Proof. By assumption we have Xt−Xs is independent of Fs and thereforeby a jazzed up version of Proposition 3.15,

E [f (Xt) |Fs] = E [f (Xs +Xt −Xs) |Fs] = G (Xs)

whereG (x) = E [f (x+Xt −Xs)] .

Because the increments are stationary we may replace Xt−Xs in the definitionof G above by X(t−s) − X0 which then proves Eqs. (15.1) with Pt as in Eq.(15.2).

140 15 Brownian Motion

To verify the Chapman – Kolmogorov equation we have,

(PtPsf) (x) = E [(Psf) (x+Xt −X0)]

= E[Ef(x+Xt −X0 + Xs − X0

)]= E

[Ef(x+Xt −X0 + Xt+s − Xt

)]= E [f (x+Xt −X0 +Xt+s −Xt)]

= E [f (x+Xt+s −X0)] = (Pt+sf) (x) .

Example 15.6. The Poisson process of rate λ, Xtt≥0 , has stationary indepen-

dent increments with Xt −Xsd= Poi (λ (t− s)) . In this case,

(Ptf) (x) = Ef (x+Xt −X0) =

∞∑k=0

f (x+ k)(λt)

k

k!e−λt

and we find

(Af) (x) =d

dt|0+ (Ptf) (x) =

∞∑k=0

f (x+ k)d

dt|0+

(λt)k

k!e−λt

=

1∑k=0

f (x+ k)d

dt|0+

(λt)k

k!e−λt = −λf (x) + λf (x+ 1) .

15.2 Normal/Gaussian Random Vectors

A random variable, Y, is normal with mean µ standard deviation σ2 iff

P (Y ∈ (y, y + dy]) =1√

2πσ2e−

12σ2

(y−µ)2dy. (15.3)

We will abbreviate this by writing Yd= N

(µ, σ2

). When µ = 0 and σ2 = 1 we

say Y is a standard normal random variable. It turns out that rather thanworking with densities of normal random variables it is typically easier to makeuse of the Laplace transform description of the distribution. To motivate thefollowing definition, starting with Eq. (15.3) we find for any λ ∈ R we have

E[eλY

]:=

∫ ∞−∞

eλy1√

2πσ2e−

12σ2

(y−µ)2dy

=

∫ ∞−∞

eλ(σx+µ) 1√2πe−

12x

2

dx

= eλµ ·∫ ∞−∞

1√2πe−

12 (x−σλ)2e

12σ

2λ2

dx

= e12σ

2λ2+λµ ·∫ ∞−∞

1√2πe−

12 z

2

dz

= e12σ

2λ2+λµ, (15.4)

wherein in the second line we made the change of variables, y = σx + µ, com-pleted the squares in the third line, made the change of variables z = x−σλ inthe fourth line, and used the well known fact that

√2π =

∫ ∞−∞

e−12 z

2

dz

in the last. Finally by differentiating Eq. (15.4) we find,

E[Y eλY

]=(σ2λ+ µ

)e

12σ

2λ2+λµ and

E[Y 2eλY

]=(σ2 +

(σ2λ+ µ

)2)e

12σ

2λ2+λµ.

Taking λ = 0 in these equations then implies,

µ = EY and σ2 + µ2 = E[Y 2]

and therefore σ2 = Var (Y ) . These comments serve as motivation for the fol-lowing alternative definition of Gaussian random variables and more generallyrandom vectors.

Definition 15.7 (Normal / Gaussian Random Variable). An Rd –valuedrandom vector, Z, is said to be normal or Gaussian1 if for all λ ∈ Rd,

E[eλ·Z

]= exp

(1

2Var (λ · Z) + E (λ · Z)

).

whereVar (λ · Z) = E

[(λ · Z)

2]− (E [λ · Z])

2

is the variance of λ ·Z. We say that Z is a standard normal vector if EZ = 0and Var (λ · Z) = λ · λ, where

1 I will use the terms, normal and Gaussian, interchangeably.


15.2 Normal/Gaussian Random Vectors 141

µ := EZ =

EZ1

EZ2

...EZd

is the mean of Z.

Lemma 15.8. If Z ∈Rd is a Gaussian random vector, then for all(k1, . . . , kd) ∈ Nd we have

E

d∏j=1

∣∣∣Zkjj ∣∣∣ <∞.

Proof. A simple calculus exercise shows for each k ≥ 0 that there existsCk <∞ such that

|x|k ≤ Cke|x| ≤ Ck[ex + e−x

].

Hence we know,

Ed∏j=1

∣∣∣Zkjj ∣∣∣ ≤ E

d∏j=1

Ckj[eZj + e−Zj

]and the latter integrand is a linear combination of random variables of the formea·Z with a ∈ ±1d . By assumption E

[ea·Z

]<∞ for all a ∈ Rd and hence we

learn that

E

d∏j=1

Ckj[eZj + e−Zj

] <∞.

Fact 15.9 If W and Z are any two Rd – valued random vectors such thatE[eλ·Z

]<∞ and E

[eλ·W

]<∞ for all λ ∈ Rd, then

E[eλ·Z

]= E

[eλ·W

]for all λ ∈ Rd

iff Wd= Z where W

d= Z by definition means

E [f (W)] = E [f (Z)]

for all functions, f : Rd → R, such that the expectations makes sense.

Fact 15.10 If W ∈Rk and Z ∈Rl are any two random vectors such thatE[ea·W

]< ∞ and E

[eb·Z

]< ∞ for all a ∈ Rk and b ∈ Rl, then W and

Z are independent iff

E[ea·Web·Z

]= E

[ea·W

]· E[eb·Z

]∀ a ∈ Rk & b ∈ Rl. (15.5)

Example 15.11. Suppose W ∈Rk and Z ∈Rl are two random vectors such that(W,Z)∈Rk×Rl is a Gaussian random vector. Then W and Z are independentiff Cov (Wi, Zj) = 0 for all 1 ≤ i ≤ k and 1 ≤ j ≤ l. To keep the notation to aminimum let us verify this fact when k = l = 1. Then according to Fact 15.10we need only verify Eq. (15.5) for a, b ∈ R. However,

Var ((a, b) · (W,Z)) = Var (aW + bZ)

= Var (aW ) + Var (bZ) + 2abCov (W,Z)

= Var (aW ) + Var (bZ)

and hence

E[ea·W eb·Z

]= E

[e(a,b)·(W,Z)

]= exp

(1

2Var ((a, b) · (W,Z)) + E [(a, b) · (W,Z)]

)= exp

(1

2Var (aW ) + E [aW ] +

1

2Var (bZ) + E [bZ]

)= E

[ea·W

]E[eb·Z

].

Remark 15.12. In general it is not true that two uncorrelated random variablesare independent. For example, suppose that X is any random variable with an

even distribution, i.e. Xd= −X. Let Y := |X| which is clearly not independent

of X unless X ≡ 0. Nevertheless X and Y are uncorrelated. Indeed ,

Cov (X,Y ) = E [XY ]− EX · EY = E [X |X|]− EX · E |X| = 0

wherein we have used both X and |X|X have even distributions and therefore

have zero expectations. Indeed, if Zd= −Z then

EZ = E [−Z] = −EZ =⇒ EZ = 0.

Exercise 15.1. Suppose that X and Y are independent normal random vari-ables. Show;

1. Z = (X,Y ) is a normal random vector, and2. W = X + Y is a normal random variable.3. If N is a standard normal random variable and X is any normal random

variable, show Xd= σN + µ where µ = EX and σ =

√Var (X).

Notation 15.13 Suppose that Y is a random-variable. We write Yd=

N(µ, σ2

)to mean that Y is a normal random variable with mean µ and vari-

ance σ2. In particular, Yd= N (0, 1) is used to indicate that Y is a standard

normal random variable.

15.3 Brownian Motion Defined

Definition 15.14 (Brownian Motion). Let(Ω,F , Ftt∈R+

,P)

be a filtered

probability space. A real valued adapted process, Bt : Ω → Rt∈R+, is called a

Brownian motion if;

1. Btt∈R+has independent increments with increments Bt −Bs being inde-

pendent of Fs for all 0 ≤ s < t <∞.2. for 0 ≤ s < t, Bt − Bs

d= N (0, t− s) , i.e. Bt − Bs is a normal mean zero

random variable with variance (t− s) ,3. t→ Bt (ω) is continuous for all ω ∈ Ω.

Remark 15.15. In light of Lemma 15.4, we could have described a Brownianmotion (up to a scale) as a continuous in time stochastic process Btt≥0 withstationary independent mean zero Gaussian increments such that B0 = 0 andt→ EB2

t is continuous.

Exercise 15.2 (Brownian Motion). Let Btt≥0 be a Brownian motion asin Definition 15.14.

1. Explain why Btt≥0 is a time homogeneous Markov process with transitionoperator,

(Ptf) (x) =

∫Rpt (x, y) f (y) dy (15.6)

where

pt (x, y) =1√2πt

e−12t |y−x|

2

. (15.7)

In more detail use Proposition 15.5 and Eq. (15.9) below to argue,

E [f (Bt) |Fs] = (Pt−sf) (Bs) . (15.8)

2. Show by direct computation that PtPs = Pt+s for all s, t > 0. Hint:probably the easiest way to do this is to make use of Exercise 15.1 alongwith the identity,

(Ptf) (x) = E[f(x+√tZ)], (15.9)

where Zd= N (0, 1) .

3. Show by direct computation that pt (x, y) of Eq. (15.7) satisfies the heatequation,

d

dtpt (x, y) =

1

2

d2

dx2pt (x, y) =

1

2

d2

dy2pt (x, y) for t > 0.

4. Suppose that f : R→ R is a twice continuously differentiable function withcompact support. Show

d

dtPtf = APtf = PtAf for all t > 0,

where

Af (x) =1

2f ′′ (x) .

Note: you may use (under the above assumptions) without proof the factthat it permissible to interchange the d

dt and ddx derivatives with the integral

in Eq. (15.6).

Modulo technical details, Exercise 15.2 shows that A = 12d2

dx2 is the infinites-imal generator of Brownian motion, i.e. the infinitesimal generator of Pt in Eq.(15.6). The technical details we have ignored involve the proper function spacesin which to carry out these computations along with a proper description of thedomain of the operator A. We will have to postpone these somewhat delicateissues until later. By the way, it is no longer necessarily a good idea to try torecover Pt as

∑∞n=0

tn

n!An in this example since in order for

∑∞n=0

tn

n!Anf to

make sense one needs to assume that f is a least C∞ and even this will notguarantee convergence of the sum!

Corollary 15.16. If 0 = t0 < t1 < · · · < tn <∞, ∆it := ti − ti−1, then for

EF (Bt1 , . . . , Btn) =

∫F (x1, . . . , xn) p∆1t (x0, x1) . . . p∆nt (xn−1, xn) dx1 . . . dxn

(15.10)for all bounded or non-negative functions F : Rn → R. In particular if Ji =(ai, bi) ⊂ R are given bounded intervals, then

P (B (ti) ∈ Ji for i = 1, 2, . . . , n)

=

∫. . .

∫J1×···×Jn

p∆1t (0, x1) p∆2t (x1, x2) . . . p∆nt (xn−1, xn) dx1 . . . dxn.

(15.11)

as follows from Eq. (15.10) by taking F (x1, . . . , xn) := 1J1 (x1) . . . 1Jn (xn) .

Proof. This result is a formal consequence of the Markov property (Propo-sition 15.5) similar to the discrete space case in Theorem 5.8. Rather than carryout this style of proof here I will give a proof based directly on the independentincrements of the Brownian motion. Let x0 := 0. We are going to prove Eq.(15.10) by induction on n.

For n = 1, we have

15.3 Brownian Motion Defined 143

EF (Bt1) = EF(√t1N

)=

∫Rpt1 (0, y) f (y) dy

which is Eq. (15.10) with n = 1. For the induction step, let N be a standardnormal random variable independent of

(Bt1 , . . . , Btn−1

), then

EF (Bt1 , . . . , Btn) = EF(Bt1 , . . . , Btn−1 , Btn−1 +Btn −Btn−1

)= EF

(Bt1 , . . . , Btn−1

, Btn−1+√∆ntN

)= E

∫RF(Bt1 , . . . , Btn−1 , y

)p∆nt

(Btn−1 , y

)dy

=

∫RE[F(Bt1 , . . . , Btn−1

, y)p∆nt

(Btn−1

, y)]dy. (15.12)

By the induction hypothesis,

E[F(Bt1 , . . . , Btn−1 , y

)p∆nt

(Btn−1 , y

)]=

∫F (x1, . . . , xn−1, y)

[p∆1t (x0, x1) · . . .

·p∆n−1t (xn−2, xn−1) p∆nt (xn−1, y)

]dx1 . . . dxn−1dy.

(15.13)

Combining Eqs. (15.12) and (15.13) and then replacing y by xn verifies Eq.(15.10).

Remark 15.17 (Gaussian aspects of Brownian Motion). Brownian motion is aGaussian process in that for all 0 = t0 < t1 < · · · < tn <∞, (Bt1 , . . . , Btn) is aGaussian random vector since it is a linear transformation of the increment vec-tor, (∆1B, . . . ,∆nB) where ∆iB := Bti −Bti−1

. The distribution of mean zeroGaussian random vectors is completely determined by its covariance matrix. Inthis case if 0 ≤ s < t <∞, we have

E [BsBt] = E [Bs [Bs +Bt −Bs]] = EB2s + EBs · E [Bt −Bs] = s.

In general we haveE [BsBt] = min (s, t) ∀ s, t ≥ 0. (15.14)

Thus we could define Brownian motion Btt≥0 to be the mean-zero continuosGaussian process such that Eq. (15.14) holds.

Remark 15.18 (White Noise). If we formally differentiate Eq. (15.14) with re-spect to s and and t we find,

E[BsBt

]=

d

dt

d

dsmin (s, t) =

d

dt[1s≤t] = δ (t− s) . (15.15)

ThusBt

t≥0

is a totally uncorrelated Gaussian process which formally impliesBt

t≥0

are all independent “random variables” which is referred to as white

noise.Warning: white noise is not an honest real-valued random stochastic pro-

cess owing to the fact that Brownian motion is very rough and in particularnowhere differentiable. If Bt were to exists it should be a Gaussian and hencesatisfy EB2

t <∞. On the other hand from Eq. (15.15)with s = t we are lead tobelieve that

E[B2t

]= δ (t− t) = δ (0) =∞.

Nevertheless, ignoring these issues, it is common to model noise in a systemby such a white noise. For example we may have y (t) ∈ Rd satisfies, in a pristineenvironment, a differential equation of the form

y (t) = f (t, y (t)) . (15.16)

However, the real world is not so pristine and the trajectory y (t) is also influ-enced by noise in the system which is often modeled by adding a term of theform g (t, y (t)) Bt to the rights side of Eq. (15.16) to arrive at the “ stochasticdifferential equation,”

y (t) = f (t, y (t)) + g (t, y (t)) Bt.

In the last section of these notes we will begin to explain Ito’s method formaking sense of such equations.

Definition 15.19 (Multi-Dimensional Brownian Motion). For d ∈ N, we

say a Rd – valued process,Bt =

(B1t , . . . , B

dt

)trt≥0

is a d – dimensional

Brownian motion providedBi·di=1

is an independent collection of one di-mensional Brownian motions.

We state the following multi-dimensional version of the results argued above.The proof of this theorem will be left to the reader.

Theorem 15.20. A d – dimensional Brownian motion, Btt≥0 , is a Markovprocess with transition semi-group, Pt, now defined by

(Ptf) (x) = E[f(x+√tZ)]

=

∫Rdpt (x, y) f (y) dy

where now Z is a standard normal random vector in Rd and

pt (x, y) =

(1√2πt

)de−

12t‖x−y‖

2

.

Moreover we haved

dtPt =

1

2∆Pt = Pt

1

2∆

where now ∆ =∑dj=1

∂2

∂x2j

is the d – dimensional Laplacian.

15.4 Some “Brownian” Martingales

Definition 15.21 (Continuous time martnigales). Given a filtered proba-

bility space,(Ω,F , Ftt≥0 ,P

), an adapted process, Xt : Ω → R, is said to be

a (Ft) martingale provided, E |Xt| <∞ for all t and

E [Xt −Xs|Fs] = 0 for all 0 ≤ s ≤ t <∞.

IfE [Xt −Xs|Fs] ≥ 0 or E [Xt −Xs|Fs] ≤ 0 for all 0 ≤ s ≤ t <∞,

then X is said to be submartingale or supermartingale respectively.

Theorem 15.22 (“Martingale problem”). Suppose that h : [0, T ] × Rd →R is a continuous function such that h (t, x) = ∂

∂th (t, x) , ∇xh (t, x) , and∆xh (t, x) exist and are continuous on [0, T ]× Rd and satisfy

sup0≤t≤T

Eν[|h (t, Bt)|+

∣∣∣h (t, Bt)∣∣∣+ |∇xh (t, Bt)|+ |∆xh (t, Bt)|

]<∞. (15.17)

Then the process,

Mt := h (t, Bt)−∫ t

0

[h (τ,Bτ ) +

1

2∆xh (τ,Bτ )

]dτ (15.18)

is a Ftt∈R+– martingale. In particular, if h also satisfies the heat equation

in reverse time,

Lh (t, x) := ∂th (t, x) +1

2∆xh (t, x) = 0, (15.19)

then Mt = h (t, Bt) is a martingale.

Proof. Working formally for the moment,

d

dτEν [h (τ,Bτ ) |Fs] =

d

dτ[Pτ−sh (τ,Bs)]

= Pτ−sh (τ,Bs) +1

2Pτ−s∆h (τ,Bs)

= Eν[h (τ,Bτ ) +

1

2∆h (τ,Bτ ) |Fs

],

wherein we have used the Markov property in the first and third lines and thechain rule in the second line. Integrating this equation on τ ∈ [s, t] then shows

Eν [h (t, Bt)− h (s,Bs) |Fs] = Eν[∫ t

s

[h (τ,Bτ ) +

1

2∆h (τ,Bτ )

]dτ |Fs

].

This statement is equivalent to the statement that Eν [Mt −Ms|Fs] = 0 i.e. tothe assertion that Mtt≥0 is a martingale.

*(Omit the rest of this argument on first reading.) We now need to justifythe above computations.

1. Let us first suppose there exists an R < ∞ such that h (t, x) = 0 if

|x| =√∑d

i=1 x2i ≥ R. By a couple of integration by parts, we find

d

dτ(Pτ−sh) (τ, x) =

∫Rd

d

dτ[pτ−s (x− y)h (τ, y)] dy

=

∫Rd

[1

2(∆pτ−s) (x− y)h (τ, y) + pτ−s (x− y) h (τ, y)

]dy

=

∫Rd

[1

2pτ−s (x− y)∆yh (τ, y) + pτ−s (x− y) h (τ, y)

]dy

=

∫Rdpτ−s (x− y)Lh (τ, y) dy =: (Pτ−sLh) (τ, x) , (15.20)

where

Lh (τ, y) := h (τ, y) +1

2∆yh (τ, y) . (15.21)

Since

(Pτ−sh) (τ, x) :=

∫Rdpτ−s (x− y)h (τ, y) dy = Eν

[h(τ, x+

√τ − sN

)]where N

d= N (0, Id×d) , we see that (Pτ−sh) (τ, x) (and similarly that Lh (τ, x))

is a continuous function in (τ, x) for τ ≥ s. Moreover, (Pτ−sh) (τ, x) |τ=s =h (s, x) and so by the fundamental theorem of calculus (using Eq. (15.20

(Pτ−sh) (τ, x) = h (s, x) +

∫ t

s

(Pτ−sLh) (τ, x) dτ. (15.22)

Hence for A ∈ Fs,

15.4 Some “Brownian” Martingales 145

Eν [h (t, Bt) : A] = Eν [Eν [h (t, Bt) |Fs] : A]

= Eν [(Pτ−sh) (τ,Bs) : A]

= Eν[h (s,Bs) +

∫ t

s

(Pτ−sLh) (τ,Bs) dτ : A

]= Eν [h (s,Bs) : A] +

∫ t

s

Eν [(Pτ−sLh) (τ,Bs) : A] dτ.

(15.23)

Since(Pτ−sLh) (τ,Bs) = Eν [(Lh) (τ,Bτ ) |Fs] ,

we haveEν [(Pτ−sLh) (τ,Bs) : A] = Eν [(Lh) (τ,Bτ ) : A]

which combined with Eq. (15.23) shows,

Eν [h (t, Bt) : A] = Eν [h (s,Bs) : A] +

∫ t

s

Eν [(Lh) (τ,Bτ ) : A] dτ

= Eν[h (s,Bs) +

∫ t

s

(Lh) (τ,Bτ ) dτ : A

].

This proves Mtt≥0 is a martingale when h (t, x) = 0 if |x| ≥ R.2. For the general case, let ϕ ∈ C∞c

(Rd, [0, 1]

)such that ϕ = 1 in a neigh-

borhood of 0 ∈ Rd and for n ∈ N, let ϕn (x) := ϕ (x/n) . Observe that ϕn → 1,∇ϕn (x) = 1

n (∇ϕ) (x/n) , and ∆ϕn (x) = 1n2 (∆ϕ) (x/n) all go to zero bound-

edly as n→∞. Applying case 1. to hn (t, x) = ϕn (x)h (t, x) we find that

Mnt := ϕn (Bt)h (t, Bt)−

∫ t

0

ϕn (Bτ )

[h (τ,Bτ ) +

1

2∆h (τ,Bτ )

]dτ + εn (t)

is a martingale, where

εn (t) :=

∫ t

0

[1

2∆ϕn (Bτ )h (τ,Bτ ) +∇ϕn (Bτ ) · ∇h (τ,Bτ )

]dτ.

By DCT,

Eν |εn (t)| ≤∫ t

0

1

2Eν |∆ϕn (Bτ )h (τ,Bτ )| dτ

+

∫ t

0

Eν [|∇ϕn (Bτ )| |∇h (τ,Bτ )|] dτ → 0 as n→∞,

and similarly,

Eν |ϕn (Bt)h (t, Bt)− h (t, Bt)| → 0,∫ t

0

Eν∣∣∣ϕn (Bτ ) h (τ,Bτ )− h (τ,Bτ )

∣∣∣ dτ → 0, and

1

2

∫ t

0

Eν |ϕn (Bτ )∆h (τ,Bτ )−∆h (τ,Bτ )| dτ → 0

as n → ∞. From these comments one easily sees that Eν |Mnt −Mt| → 0 as

n→∞ which is sufficient to show Mtt∈[0,T ] is still a martingale.

Rewriting Eq. (15.18) we have

h (t, Bt) = Mt +

∫ t

0

[h (τ,Bτ ) +

1

2∆xh (τ,Bτ )

]dτ.

Considering the simplest case where h (t, x) = f (x) and x ∈ R, i.e. d = 1, wehave

f (Bt) = Mt +1

2

∫ t

0

f ′′ (Bt) dt

where Mt is a martingale provided,

sup0≤t≤T

Eν [|f (Bt)|+ |f ′ (Bt)|+ |f ′′ (Bt)|] <∞.

From this it follows that if f ′′ ≥ 0 (i.e. f is subharmonic) then f (Bt) is asubmartingale and if f ′′ ≤ 0 (i.e. f is super harmonic) then f (Bt) is a super-martingale. More precisely, if f : Rd → R is a C2 function, then f (Bt) is a“local” submartingale (supermartingale) iff ∆f ≥ 0 (∆f ≤ 0) .

Exercise 15.3 (h – transforms of Bt). Let Btt∈R+be a d – dimensional

Brownian motion. Show the following processes are martingales;

1. Mt = u (Bt) where u : Rd → R is a Harmonic function, ∆u = 0, such thatsupt≤T E [|u (Bt)|+ |∇u (Bt)|] <∞ for all T <∞.

2. Mt = λ ·Bt for all λ ∈ Rd.3. Mt = e−λYt cos (λXt) and Mt = e−λYt sin (λXt) where Bt = (Xt, Yt) is a

two dimensional Brownian motion and λ ∈ R.4. Mt = |Bt|2 − d · t.5. Mt = (a ·Bt) (b ·Bt)− (a · b) t for all a, b ∈ Rd.6. Mt := eλ·Bt−|λ|

2t/2 for any λ ∈ Rd.

Corollary 15.23 (Compensators). Suppose h : [0,∞) × Rd → R is a C2 –function such that

Lh = ∂th (t, x) +1

2∆xh (t, x) = 0

and both h and h2 satisfy the hypothesis of Theorem 15.22. If we let Mt denotethe martingale, Mt := h(t, Bt) and

At :=

∫ t

0

|(∇xh) (τ,Bτ )|2 dτ,

then Nt := M2t − At is a martingale. Thus the submartingale has the “Doob”

decomposition,M2t = Nt +At.

and we call the increasing process, At, the compensator to M2t .

Proof. We need only apply Theorem 15.22 to h2. In order to do this weneed to compute Lh2. Since

∂2i h

2 = ∂i (2h · ∂ih) = 2 (∂ih)2

+ 2h∂ih

we see that 12∆h

2 = h∆h+ |∇h|2 . Therefore,

Lh2 = 2hh+ h∆h+ |∇h|2 = |∇h|2

and hence the proposition follows form Eq. (15.18) with h replaced by h2.

Exercise 15.4 (Compensators). Let Btt∈R+be a d – dimensional Brown-

ian motion. Find the compensator, At, for each of the following square integrablemartingales.

1. Mt = u (Bt) where u : Rd → R is a harmonic function, ∆u = 0, such that

supt≤T

E[∣∣u2 (Bt)

∣∣+ |∇u (Bt)|2]<∞ for all T <∞.

Verify the hypothesis of Corollary 15.23 to show

At =

∫ t

0

|∇u (Bτ )|2 dτ.

2. Mt = λ ·Bt for all λ ∈ Rd.3. Mt = |Bt|2 − d · t.4. Mt := eλ·Bt−|λ|

2t/2 for any λ ∈ Rd.

Fact 15.24 (Levy’s criteria for BM) P. Levy gives another description ofBrownian motion, namely a continuous stochastic process, Xtt≥0 , is a Brow-

nian motion if X0 = 0, Xtt≥0 is a martingale, andX2t − t

t≥0

is a martin-

gale.

15.5 Optional Sampling Results

Similar to Definition 14.20, we have the informal definition of a stopping timein the Brownian motion setting.

Definition 15.25 (Informal). A stopping time, T, for Btt≥0 , is a randomvariable with the property that the event T ≤ t is determined from the knowl-edge of Bs : 0 ≤ s ≤ t . Alternatively put, for each t ≥ 0, there is a functional,ft, such that

1T≤t = ft (Bs : 0 ≤ s ≤ t) .

In this continuous context the optional sampling theorem is as follows.

Theorem 15.26 (Continuous time optional sampling theorem). IfMtt≥0 is a FBt – martingale and σ and τ are two stopping times suchthat there τ ≤ K for some non-random finite constant K < ∞, then Mτ ∈L1 (Ω,Fτ , P ) , Mσ∧τ ∈ L1 (Ω,Fσ∧τ , P ) and

Mσ∧τ = E [Mτ |Fσ] . (15.24)

Remark 15.27. In applying the optional sampling Theorem 15.26, observe thatif τ is any optional time and K < ∞ is a constant, then τ ∧K is a boundedoptional time. Indeed,

τ ∧K < t =

τ < t ∈ Ft if t ≤ KΩ ∈ Ft if t > K.

Therefore if σ and τ are any optional times with σ ≤ τ, we may apply Theorem15.26 with σ and τ replaced by σ ∧K and τ ∧K. One may then try to pass tolimit as K ↑ ∞ in the resulting identity.

For the next few results let d = 1 and for any y ∈ R let

τy := inf t > 0 : Bt = y . (15.25)

Let us now fix a, b ∈ R so that −∞ < a < 0 < b < ∞ and set τ := τa ∧ τb bethe first exit time the Brownian motion from the interval, (a, b) . By the law oflarge numbers for Brownian motions, see Corollary 15.32 below, we may deducethat P0 (τ <∞) = 1. This may also be deduced from Lemma 15.45 – but thisis using a rather large hammer to conclude a simple result. We will give anindependent proof here as well.

Proposition 15.28. With the notation above we have,

Eτ = E [τa ∧ τb] = −ab = |a| b <∞.

In particular, P (τa ∧ τb =∞) = 0.

15.5 Optional Sampling Results 147

Fig. 15.1. A plot of f for a = −2 and b = 5.

Proof. Let u (x) = (x− a) (b− x) , see Figure 15.1. Then u′′ (x) = −1 andso by Theorem 15.22 (or by Ito’s lemma in Theorem 16.5 below) we know that

Mt = u (Bt) + t

is a martingale. So by the optional sampling Theorem 15.26 we may concludethat

Eu (Bt∧τ ) + E [τ ∧ t] = EMτ∧t = EM0 = u (B0) = −ab. (15.26)

As u ≥ 0 on [a, b] and Bt∧τ ∈ [a, b] , it follows that

|a| b = E [τ ∧ t] + Eu (Bt∧τ ) ≥ E [τ ∧ t] .

We may now use the monotone convergence theorem (see Section 1.1) to con-clude,

Eτ = limt↑∞

E [τ ∧ t] ≤ |a| b <∞.

Next using the facts; 1) τ is integrable and t∧τ → τ as t ↑ ∞ 2) u is boundedon [a, b] and u (Bt∧τ ) → u (Bτ ) = 0 as t ↑ ∞, we may apply the dominatedconvergence theorem to pass to the limit in Eq. (15.26) to find,

−ab = limt↑∞

[E [τ ∧ t] + Eu (Bt∧τ )] = E [τ ] + E [u (Bτ )] = Eτ.

Proposition 15.29. With the notation above,

P (τb < τa) =−ab− a

and P (τb <∞) = 1. (15.27)

In particular, this shows that one dimensional Brownian motion hits every pointin R and by the Markov property is therefore, recurrent. Again these results agreewith what we found for simple random walks.

Proof. Now let u (x) = x−a in which case u′′ (x) = 0. Therefore by Theorem15.22 (or by Ito’s lemma in Theorem 16.5 below) we know that Mt = u (Bt) =Bt − a is a martingale and so by the optional sampling Theorem 15.26 we mayconclude that

− a = EM0 = EMt∧τ = E [Bt∧τ − a] . (15.28)

Since Bt∧τ ∈ [a, b] and P (τ =∞) = 0, Bt∧τ → Bτ ∈ a, b boundedly as t ↑ ∞and hence we may use the dominated convergence theorem to pass to the limitin Eq. (15.28) in order to learn,

−a = E [Bτ − a] = 0 · P (Bτ − a) + (b− a)P (Bτ = b) = (b− a)P (τb < τa) .

This proves proves the first equality in Eq. (15.27). Since τb < τa ⊂ τb <∞for all a, it follows that

P (τb <∞) ≥ −ab− a

→ 1 as a ↓ −∞

and therefore, P (τb <∞) = 1 proving the second equality in Eq. (15.27).

Exercise 15.5. By considering the martingale,

Mt := eλBt−12λ

2t,

showE0

[e−λτa

]= e−a

√2λ. (15.29)

Remark 15.30. Equation (15.29) may also be proved using Lemma 15.45 di-rectly. Indeed, if

pt (x) :=1√2πt

e−x2/2t,

then

E0

[e−λτa

]= E0

[∫ ∞τa

λe−λtdt

]= E0

[∫ ∞0

1τa<tλe−λtdt

]= E0

[∫ ∞0

P (τa < t)λe−λtdt

]= −2

∫ ∞0

[∫ ∞a

pt (x) dx

]d

dte−λtdt.

Integrating by parts in t, shows

−2

∫ ∞0

[∫ ∞a

pt (x) dx

]d

dte−λtdt = 2

∫ ∞0

[∫ ∞a

d

dtpt (x) dx

]e−λtdt

= 2

∫ ∞0

[∫ ∞a

1

2p′′t (x) dx

]e−λtdt

= −∫ ∞

0

p′t (a) e−λtdt

= −∫ ∞

0

1√2πt

−ate−a

2/2te−λtdt

=a√2π

∫ ∞0

t−3/2e−a2/2te−λtdt

where the last integral may be evaluated to be e−a√

2λ using the Laplace trans-form function of Mathematica.

Alternatively, consider∫ ∞0

pt (x) e−λtdt =

∫ ∞0

1√2πt

e−x2/2te−λtdt =

1√2π

∫ ∞0

1√t

exp

(−x

2

2t− λt

)dt

which may be evaluated using the theory of the Fourier transform (or charac-teristic functions if you prefer) as follows,

√2π

2me−m|x| =

∫R

(|ξ|2 +m2

)−1

eiξxdξ

=

∫R

[∫ ∞0

e−λ(|ξ|2+m2)dλ

]eiξxdξ

=

∫ ∞0

dλ

[e−λm

2

∫R

dξe−λ|ξ|2

eiξx]

=

∫ ∞0

e−λm2

(2λ)−1/2e−14λx

2

dλ,

where dξ := (2π)−1/2

dξ. Now make appropriate change of variables and carryout the remaining x integral to arrive at the result.

15.6 Scaling Properties of B. M.

Theorem 15.31 (Transformations preserving B. M.). Let Btt≥0 be aBrownian motion and Bt := σ (Bs : s ≤ t) . Then;

1. bt = −Bt is again a Brownian motion.2. if c > 0 and bt := c−1/2Bct is again a Brownian motion.

3. bt := tB1/t for t > 0 and b0 = 0 is a Brownian motion. In particular,limt↓0 tB1/t = 0 a.s.

4. for all T ∈ (0,∞) , bt := Bt+T −BT for t ≥ 0 is again a Brownian motionwhich is independent of BT .

5. for all T ∈ (0,∞) , bt := BT−t − BT for 0 ≤ t ≤ T is again a Brownianmotion on [0, T ] .

Proof. It is clear that in each of the four cases above btt≥0 is still aGaussian process. Hence to finish the proof it suffices to verify, E [btbs] = s ∧ twhich is routine in all cases. Let us work out item 3. in detail to illustrate themethod. For 0 < s < t,

E [bsbt] = stE [Bs−1Bt−1 ] = st(s−1 ∧ t−1

)= st · t−1 = s.

Notice that t→ bt is continuous for t > 0, so to finish the proof we must showthat limt↓0 bt = 0 a.s. However, this follows from “Kolmogorov’s continuitycriteria” which we do not cover here.

Corollary 15.32 (B. M. Law of Large Numbers). Suppose Btt≥0 is aBrownian motion, then almost surely, for each β > 1/2,

lim supt→∞

|Bt|tβ

=

0 if β > 1/2∞ if β ∈ (0, 1/2) .

(15.30)

Proof. We omit the full proof here other than to show that we may replacethe limiting behavior at ∞ by statements about the limiting behavior as t ↓ 0because bt := tB1/t for t > 0 and b0 = 0 is a Brownian motion.

15.7 Random Walks to Brownian Motion

Let Xj∞j=1 be a sequence of independent Bernoulli random variables with

P (Xj = ±1) = 12 and let W0 = 0, Wn = X1 + · · ·+Xn be the random walk on

Z. For each ε > 0, we would like to consider Wn at n = t/ε. We can not expectWt/ε to have a limit as ε → 0 without further scaling. To see what scaling isneeded, recall that

Var (X1) = EX21 =

1

212 +

1

2(−1)

2= 1

and therefore, Var (Wn) = n. Thus we have

Var(Wt/ε

)= t/ε

and hence to get a limit we should scale Wt/ε by√ε. These considerations

motivate the following theorem.

15.8 Path Regularity Properties of BM 149

Theorem 15.33. For all ε > 0, let Bε (t)t≥0 be the continuous time process,defined as follows:

1. If t = nε for some n ∈ N0, let Bε (nε) :=√εWn and

2. if nε < t < (n+ 1) ε, let Bε (t) be given by

Bε (t) = Bε (nε) +t− nεε

(Bε ((n+ 1) ε)−Bε (nε))

=√εWn +

t− nεε

(√εWn+1 −

√εWn

)=√εWn +

t− nεε

√εXn+1,

i.e. Bε (t) is the linear interpolation between (nε,√εWn) and

((n+ 1) ε,√εWn+1) , see Figure 15.2.

Fig. 15.2. The four graphs are constructed (in Excel) from a single realization ofa random walk. Each graph corresponds to a different scaling parameter, namely,ε ∈

24, 28, 212, 214

. It is clear from these pictures that Bε (t) is not converging to

B (t) for each realization. The convergence is only in law.

Then Bε =⇒ B (“weak convergence”) as ε ↓ 0, where B is a continuousrandom process which we call a Brownian motion.

Remark 15.34. Theorem 15.33 shows that Brownian motion is the result ofcombining additively the effects of many small random steps.

We may allow for more general random walk limits which is the content ofDonsker’s invariance principle or the so called functional central limit theorem.The setup is to start with a “random walk,” Sn := X1+· · ·+Xn where Xn∞n=1

are i.i.d. random variables with zero mean and variance one. We then define foreach n ∈ N the following continuous process,

Bn (t) :=1√n

(S[nt] + (nt− [nt])X[nt]+1

)(15.31)

where for τ ∈ R+, [τ ] is the integer part of τ, i.e. the nearest integer to τ whichis no greater than τ.

Theorem 15.35 (Donsker’s invariance principle). Let Ω, Bn, and B beas above. Then for Bn =⇒ B, i.e.

limn→∞

E [F (Bn)] = E [F (B)]

for all bounded continuous functions F : Ω → R.

One method of proof, see [10] and [17], goes by the following two steps. 1)Show that the finite dimensional distributions of Bn converge to a B as wasdone in Proposition ??. 2) Show the distributions of Bn are tight. A proofof the tightness may be based on Ottaviani’s maximal inequality in Corollary??, [9, Corollary 16.7]. Another possible proof or this theorem is based on“Skorokhod’s representation,” see (for example) [9, Theorem 14.9 on p. 275]or [14].

15.8 Path Regularity Properties of BM

For the rest of this chapter we will assume that Btt≥0 is a Brownian motion

on some probability space,(Ω,F , Ftt≥0 ,P

)and Ft := σ (Bs : s ≤ t) .

Notation 15.36 (Partitions) Given Π := 0 = t0 < t1 < · · · < tn = T , apartition of [0, T ] , let

∆iB := Bti −Bti−1 , and ∆it := ti − ti−1

for all i = 1, 2, . . . , n. Further let mesh(Π) := maxi |∆it| denote the mesh ofthe partition, Π.

Exercise 15.6 (Quadratic Variation). Let

Πm :=

0 = tm0 < tm1 < · · · < tmnm = T

be a sequence of partitions such that mesh (Πm)→ 0 as m→∞. Further let

Qm :=

nm∑i=1

(∆iB)2

:=

nm∑i=1

(Btm

i−Btm

i−1

)2

. (15.32)

Showlimm→∞

E[(Qm − T )

2]

= 0.

Also show that

∞∑m=1

mesh (Πm) <∞ =⇒ E

[ ∞∑m=1

(Qm − T )2

]<∞

=⇒ limm→∞

Qm = T a.s.

[These results are often abbreviated by the writing, dB2t = dt.]

Hints: it is useful to observe; 1)

Qm − T =

nm∑i=1

[(∆iB)

2 −∆it]

and 2) using ∆iBd=√∆itN (0, 1) one easily shows there is a constant, c < ∞

such that

E[(∆iB)

2 −∆it]2

= c (∆it)2. (15.33)

Proposition 15.37. Suppose that Πm∞m=1 is a sequence of partitions of [0, T ]such that Πm ⊂ Πm+1 for all m and mesh (Πm)→ 0 as m→∞. Then Qm → Ta.s. where Qm is defined as in Eq. (15.32).

Corollary 15.38 (Roughness of Brownian Paths). A Brownian motion,Btt≥0 , is not almost surely α – Holder continuous for any α > 1/2.

Proof. According to Exercise 15.6, we may choose partition, Πm, such thatmesh (Πm) → 0 and Qm → T a.s. If B were α – Holder continuous for someα > 1/2, then

Qm =

nm∑i=1

(∆mi B)

2 ≤ Cnm∑i=1

(∆mi t)

2α ≤ C max(

[∆it]2α−1

) nm∑i=1

∆mi t

≤ C [mesh (Πm)]2α−1

T → 0 as m→∞

which contradicts the fact that Qm → T as m→∞.

15.9 The Strong Markov Property of Brownian Motion

Notation 15.39 If ν is a probability measure on Rd, we let Pν indicate thatwe are starting our Brownian motion so that B0 distributed as ν. For Brownianmotion, the expectation relative to Pν can be defined to be

Eν[F(B(·)

)]=

∫Rd

E0

[F(x+B(·)

)]dν (x) .

Theorem 15.40 (Strong-Markov Property). Let ν be a probability measureon Rd, τ be an optional time with Pν (τ <∞) > 0, and let

bt := Bt+τ −Bτ on τ <∞ .

Then, conditioned on τ <∞ , b is a Brownian motion starting at 0 ∈ Rdwhich is independent of F+

τ . To be more precise we are claiming, for all boundedmeasurable functions F : Ω → Rd and all A ∈ F+

τ , that

Eν [F (b) |τ <∞] = E0 [F ] (15.34)

and

Eν [F (b) 1A|τ <∞] = Eν [F (b) |τ <∞] · Eν [1A|τ <∞] . (15.35)

Corollary 15.41. Let ν be a probability measure on(Rd,BRd

), T > 0,and let

bt := Bt+T −BT for t ≥ 0.

Then b is a Brownian motion starting at 0 ∈ Rd which is independent of FTrelative to the probability measure, Pν , describing Brownian motion started withB0 distributed as ν.

Proof. This is a direct consequence of Theorem 15.40 with τ = T or it canbe proved directly using the Markov property alone.

Proposition 15.42 (Stitching Lemma). Suppose that Btt≥0 and Xtt≥0

are Brownian motions, τ is an optional time, and Xtt≥0 is independent of

F+τ . Then Bt := Bt∧τ +X(t−τ)+

is another Brownian motion.

For the next couple of results we will follow [9, Chapter 13] (also see [10,Section 2.8]) where more results along this line may be found.

Theorem 15.43 (Reflection Principle). Let τ be an optional time andBtt≥0 be a Brownian motion. Then the “reflected” process (see Figure 15.3),

Bt := Bt∧τ − (Bt −Bt∧τ ) =

Bt if t ≤ τ

Bτ − (Bt −Bτ ) if t > τ,

is again a Brownian motion.

15.9 The Strong Markov Property of Brownian Motion 151

Proof. Let T > 0 be given and set bt := Bt+τ∧T − Bτ∧T . Then applying

Proposition 15.42 with X = b and τ replaced by τ ∧ T we learn that B|[0,T ]d=

B|[0,T ]. Again as T > 0 was arbitrary we may conclude that Bd= B.

Lemma 15.44. If d = 1, then

P0 (Bt > a) ≤ min

(√t

2πa2e−a

2/2t,1

2e−a

2/2t

)

for all t > 0 and a > 0.

Proof. This follows from Lemma ?? and the fact that Btd=√tN, where N

is a standard normal random variable.

Lemma 15.45 (Running maximum). If Bt is a 1 - dimensional Brownianmotion and z > 0 and y ≥ 0, then

P(

maxs≤t

Bs ≥ z, Bt < z − y)

= P(Bt > z + y) (15.36)

and

P(

maxs≤t

Bs ≥ z)

= 2P (Bt > z) = P (|Bt| > z) . (15.37)

In particular we have maxs≤tBsd= |Bt| .

tτ

z + y

z

z − y

Bt

Bt

Fig. 15.3. A Brownian path B along with its reflection about the first time, τ, thatBt hits the level z.

Proof. Proof of Eq. (15.37). Let z > 0 be given. Sincemaxs≤tBs ≥ z, Bt ≥ z = Bt ≥ z and P (Bt ≥ z) = P (Bt > z) wefind

P(

maxs≤t

Bs ≥ z)

= P(

maxs≤t

Bs ≥ z, Bt ≥ z)

+ P(

maxs≤t

Bs ≥ z, Bt < z

)= P (Bt ≥ z) + P

(maxs≤t

Bs ≥ z, Bt < z

)= P (Bt > z) + P

(maxs≤t

Bs ≥ z, Bt < z

).

So to finish the proof of Eq. (15.37) it suffices verify Eq. (15.36) for y = 0.Let τ := Tz = inf t > 0 : Bt = z and let Bt := Bτt − b(t−τ)+

where bt :=

(Bt∧τ −Bτ ) 1τ<∞, see Figure 15.3. (By Corollary 15.32 we actually know thatTz < ∞ a.s. but we will not use this fact in the proof.) Observe from Figure15.3 that

τ := inft > 0 : Bt = z

= τ,

Bt < z =Bt > z

on τ ≤ t and (15.38)

maxs≤t

Bs ≥ z

= τ ≤ t = τ ≤ t .

Hence we have

P(

maxs≤t

Bs ≥ z, Bt < z

)= P (τ ≤ t, Bt < z) = P

(τ ≤ t, Bt > z

)= P (τ ≤ t, Bt > z) = P (Bt > z) (15.39)

wherein we have used Law (B·, τ) = Law(B·, τ

)in the third equality and

Bt > z ⊂ τ ≤ t for the last equality.Proof of Eq. (15.36) for general y ≥ 0. We simply follow the above proof

except we use the identity,

Bt < z − y =Bt > z + y

on τ ≤ t ,

in place of Eq. (15.38). Since Bt > z + y ⊂ τ ≤ t and working as above wefind

P(

maxs≤t

Bs ≥ z, Bt < z − y)

= P (τ ≤ t, Bt < z − y) = P(τ = τ ≤ t, Bt > z + y

)= P (τ ≤ t, Bt > z + y) = P (Bt > z + y) .

Remark 15.46. Notice that

P(

maxs≤t

Bs ≥ z)

= P (|Bt| > z) = P(√

t |B1| > z)

= P(|B1| >

z√t

)→ 1 as t→∞

and therefore it follows that sups<∞Bs =∞ a.s. In particular this shows thatTz <∞ a.s. for all z ≥ 0 which we have already seen in Corollary 15.32.

Corollary 15.47. Suppose now that T = inf t > 0 : |Bt| = a , i.e. the firsttime Bt leaves the strip (−a, a). Then

P0(T < t) ≤ 4P0(Bt > a) =4√2πt

∫ ∞a

e−x2/2tdx

≤ min

(√8t

πa2e−a

2/2t, 1

). (15.40)

Notice that P0(T < t) = P0(B∗t ≥ a) where B∗t = max |Bτ | : τ ≤ t . So Eq.(15.40) may be rewritten as

P0(B∗t ≥ a) ≤ 4P0(Bt > a) ≤ min

(√8t

πa2e−a

2/2t, 1

)≤ 2e−a

2/2t. (15.41)

Proof. By definition T = Ta ∧ T−a so that T < t = Ta < t ∪ T−a < tand therefore

P0(T < t) ≤ P0 (Ta < t) + P0 (T−a < t)

= 2P0(Ta < t) = 4P0(Bt > a) =4√2πt

∫ ∞a

e−x2/2tdx

≤ 4√2πt

∫ ∞a

x

ae−x

2/2tdx =4√2πt

(− tae−x

2/2t

)∣∣∣∣∞a

=

√8t

πa2e−a

2/2t.

This proves everything but the very last inequality in Eq. (15.41). To prove thisinequality first observe the elementary calculus inequality:

min

(4√2πy

e−y2/2, 1

)≤ 2e−y

2/2. (15.42)

Indeed Eq. (15.42) holds 4√2πy≤ 2, i.e. if y ≥ y0 := 2/

√2π. The fact that Eq.

(15.42) holds for y ≤ y0 follows from the following trivial inequality

1 ≤ 1.4552 ∼= 2e−1π = e−y

20/2.

Finally letting y = a/√t in Eq. (15.42) gives the last inequality in Eq. (15.41).

Corollary 15.48 (Fernique’s Theorem). For all λ > 0, then

E[eλ[maxs≤t Bs]

2]=

1√1−2λt

if λ < 12t

∞ if λ ≥ 12t

. (15.43)

Moreover,

E[eλ‖B·‖

2∞,t

]<∞ iff λ <

1

2t. (15.44)

Proof. From Lemma 15.45,

E[eλ[maxs≤t Bs]

2]= Eeλ|Bt|

2

= Eeλt|B1|2

=1√2π

∫Reλtx

2

e−12x

2

dx

=

1√1−2λt

if λ < 12t

∞ if λ ≥ 12t

.

By Corollary 15.47 we can show E[eλ‖B·‖

2∞,t

]<∞ if λ < 1

2t while if λ ≥ 12t we

will have,

E[eλ‖B·‖

2∞,t

]≥ E

[eλ[maxs≤t Bs]

2]=∞.

Alternative weak bound. For a cruder estimate, notice that

‖B·‖∞,t := maxs≤t|Bs| =

[maxs≤t

Bs

]∨[maxs≤t

[−Bs]]

and hence

‖B·‖2∞,t =

[maxs≤t

Bs

]2

∨[maxs≤t

[−Bs]]2

≤[maxs≤t

Bs

]2

+

[maxs≤t

[−Bs]]2

.

Therefore using −Bss≥0 is still a Brownian motion along with the Cauchy-Schwarz’s inequality gives,

E[eλ‖B·‖

2∞,t

]≤ E

[eλ[maxs≤t Bs]

2

· eλ[maxs≤t(−Bs)]2]

≤√E[e2λ[maxs≤t Bs]

2]· E[·eλ[maxs≤t(−Bs)]

2]= E

[e2λ[maxs≤t Bs]

2]<∞ if λ <

1

4t.

15.10 Dirichelt Problem and Brownian Motion 153

15.10 Dirichelt Problem and Brownian Motion

Theorem 15.49 (The Dirichlet problem). Suppose D is an open subset ofRd and τ := inf t ≥ 0 : Bt ∈ Dc is the first exit time form D. Given a boundedmeasurable function, f : bd(D) → R, let u : D → R be defined by (see Figure15.4),

u (x) := Ex [f (Bτ ) : τ <∞] for x ∈ D.

Then u ∈ C∞ (D) and ∆u = 0 on D, i.e. u is a harmonic function.

Fig. 15.4. Brownian motion starting at x ∈ D and exiting on the boundary of D atBτ .

Proof. (Sketch.) Let x ∈ D and r > 0 be such that B (x, r) ⊂ D and letσ := inf t ≥ 0 : Bt /∈ B (x, r) as in Figure 15.5. Setting F := f (Bτ ) 1τ<∞,we see that F θσ = F on σ <∞ and that σ <∞ , Px - a.s. on τ <∞ .Moreover, by either Corollary 15.32 or by Lemma 15.45 (see Remark 15.46), weknow that σ <∞ , Px – a.s. Therefore by the strong Markov property,

u (x) = Ex [F ] = Ex [F : σ <∞] = Ex [F θσ : σ <∞]

= Ex [EBσF : σ <∞] = Ex [u (Bσ)] .

Using the rotation invariance of Brownian motion, we may conclude that

Ex [u (Bσ)] =1

ρ (bd(B (x, r)))

∫bd(B(x,r))

u (y) dρ (y)

where ρ denotes surface measure on bd(B (x, r)). This shows that u (x) satisfiesthe mean value property, i.e. u (x) is equal to its average about in sphere cen-tered at x which is contained in D. It is now a well known that this property,see for example [10, Proposition 4.2.5 on p. 242], that this implies u ∈ C∞ (D)and that ∆u = 0.

Fig. 15.5. Brownian motion starting at x ∈ D and exiting the boundary of B (x, r)at Bσ before exiting on the boundary of D at Bτ .

When the boundary ofD is sufficiently regular and f is continuous on bd(D),it can be shown that, for x ∈ bd(D), that u (y) → f (x) as y ∈ D tends to x.For more details in this direction, see [1], [2], [10, Section 4.2], [8], and [7].

16

A short introduction to Ito’s calculus

16.1 A short introduction to Ito’s calculus

Definition 16.1. The Ito integral of an adapted process1 ftt≥0 , is defined by∫ T

0

fdB = lim|Π|→0

n∑i=1

fti−1

(Bti −Bti−1

)(16.1)

when the limit exists. Here Π denotes a partition of [0, T ] so that Π =0 = t0 < t1 < · · · < tn = T and

|Π| = max1≤i≤n

∆it where ∆it = ti − ti−1/ (16.2)

Proposition 16.2. Keeping the notation in Definition 16.1 and further assumeEf2

t <∞ for all t. Then we have,

E

[n∑i=1

fti−1

(Bti −Bti−1

)]= 0

and

E

[n∑i=1

fti−1

(Bti −Bti−1

)]2

= En∑i=1

f2ti−1

(ti − ti−1) .

Proof. Since(Bti −Bti−1

)is independent of fti−1

we have,

E

[n∑i=1

fti−1

(Bti −Bti−1

)]=

n∑i=1

Efti−1E(Bti −Bti−1

)=

n∑i=1

Efti−1· 0 = 0.

For the second assertion, we write,

1 To say f is adapted means that for each t ≥ 0, ft should only depend on Bss≤t ,i.e. ft = Ft

(Bss≤t

).

[n∑i=1

fti−1

(Bti −Bti−1

)]2

=

n∑i,j=1

ftj−1

(Btj −Btj−1

)fti−1

(Bti −Bti−1

).

If j < i, then ftj−1

(Btj −Btj−1

)fti−1

is independent of(Bti −Bti−1

)and

therefore,

E[ftj−1

(Btj −Btj−1

)fti−1

(Bti −Bti−1

)]= E

[ftj−1

(Btj −Btj−1

)fti−1

]· E(Bti −Bti−1

)= 0.

Similarly, if i < j,

E[ftj−1

(Btj −Btj−1

)fti−1

(Bti −Bti−1

)]= 0.

Therefore,

E

[n∑i=1

fti−1

(Bti −Bti−1

)]2

=

n∑i,j=1

E[ftj−1

(Btj −Btj−1

)fti−1

(Bti −Bti−1

)]=

n∑i=1

E[fti−1

(Bti −Bti−1

)fti−1

(Bti −Bti−1

)]=

n∑i=1

E[f2ti−1

(Bti −Bti−1

)2]=

n∑i=1

Ef2ti−1· E(Bti −Bti−1

)2=

n∑i=1

Ef2ti−1

(ti − ti−1)

= En∑i=1

f2ti−1

(ti − ti−1) ,

wherein the fourth equality we have used Bti −Bti−1is independent of fti−1

.This proposition motivates the following theorem which will not be proved

in full here.

156 16 A short introduction to Ito’s calculus

Theorem 16.3. If ftt≥0 is an adapted process such that E∫ T

0f2t dt < ∞,

then the Ito integral,∫ T

0fdB, exists and satisfies,

E∫ T

0

fdB = 0 and

E

(∫ T

0

fdB

)2

= E∫ T

0

f2t dt.

[More generally, Mt :=∫ t

0fdB is a martingale such that M2

t −∫ t

0f2s ds is also

a martingale. ]

Corollary 16.4. In particular if τ is a bounded stopping time (say τ ≤ T <∞)then

E∫ τ

0

fdB = 0 and

E(∫ τ

0

fdB

)2

= E∫ τ

0

f2t dt.

Proof. The point is that, by the definition of a stopping time, 10≤t≤τft isstill an adapted process. Therefore we have,

E∫ τ

0

fdB = E

[∫ T

0

10≤t≤τftdBt

]= 0

and

E(∫ τ

0

fdB

)2

= E

[∫ T

0

10≤t≤τftdBt

]2

= E

[∫ T

0

(10≤t≤τft)2dt

]= E

[∫ τ

0

f2t dt

].

Theorem 16.5 (Ito’s Lemma). If f (t, x) is C1 – function such that ∂2xf (t, x)

exists and is continuous, then

df (t, Bt) =

[∂tf (t, Bt) +

1

2∂2xf (t, Bt)

]dt+ ∂xf (t, Bt) dBt

= Lf (t, Bt) dt+ ∂xf (t, Bt) dBt.

More precisely,

f (T,BT ) = f (0, B0) +

∫ T

0

∂xf (t, Bt) dBt

+

∫ T

0

[∂tf (t, Bt) +

1

2∂2xf (t, Bt)

]dt. (16.3)

Roughly speaking, all differentials should be expanded out to second order usingthe Ito multiplication rules,

dB2 = dt and dBdt = 0 = dt2.

Proof. The rough idea is as follows. Let Π = 0 = t0 < t1 < · · · < tn = Tbe a partition of [0, T ] , so that by a telescoping series argument,

f (T,BT )− f (0, B0) =

n∑i=1

[f (ti, Bti)− f

(ti−1, Bti−1

)].

Then by Taylor’s theorem,

f (ti, Bti)− f(ti−1, Bti−1

)= ∂tf

(ti−1, Bti−1

)∆it+ ∂xf

(ti−1, Bti−1

)∆iB

+1

2∂2xf(ti−1, Bti−1

)[∆iB]

2+O

(∆it

2, ∆iB3).

The error terms are negligible, i.e.

lim|Π|→0

n∑i=1

O(∆it

2, ∆iB3)

= 0

and hence

f (T,BT )− f (0, B0)

= lim|Π|→0

n∑i=1

[f (ti, Bti)− f

(ti−1, Bti−1

)]= lim|Π|→0

n∑i=1

∂tf(ti−1, Bti−1

)∆it+ lim

|Π|→0

n∑i=1

∂xf(ti−1, Bti−1

)∆iB

+1

2lim|Π|→0

n∑i=1

∂2xf(ti−1, Bti−1

)[∆iB]

2

=

∫ T

0

∂tf (t, Bt) dt+

∫ T

0

∂xf (t, Bt) dBt +1

2

∫ T

0

∂2xf (t, Bt) dt

where the last term follows by an extension of Exercise 15.6.

16.2 Ito’s formula, heat equations, and harmonic functions 157

Example 16.6. Let us verify Ito’s lemma in the special case where f (x) = x2.For T > 0 and

Π = 0 = t0 < t1 < · · · < tn = T

being a partition of [0, T ] , we have, with ∆iB := Bti −Bti−1that

B2T =

n∑i=1

(B2ti −B

2ti−1

)=

n∑i=1

(Bti +Bti−1

) (Bti −Bti−1

)=

n∑i=1

(2Bti−1 +∆iB

)∆iB

= 2

n∑i=1

Bti−1∆iB +QΠ

|Π|→0→ 2

∫ T

0

BdB + T

where we have used Exercise 15.6 in finding the limit. Hence we conclude

B2T = 2

∫ T

0

BdB + T

which is exactly Eq. (16.3) for f (x) = x2.

16.2 Ito’s formula, heat equations, and harmonic functions

Suppose D is a bounded open subset of Rd and x ∈ D. Let Bt =(B1t , . . . , B

dt

)be a standard d – dimensional Brownian motion. Let Xt := x+Bτt where

τ = inf t ≥ 0 : x+Bt /∈ D

be the first exit time from D. [Warning: I will be using the “natural” multi-dimensional and “stopped” extensions of the Ito theory developed above.] Inthis multi-dimensional setting Ito’s formula becomes,

df (t, Bt) =

[∂tf (t, Bt) +

1

2∆f (t, Bt)

]dt+∇f (t, Bt) · dBt.

The Ito multiplication table in this multi-dimensional setting is

dt · dBi = 0, dt2 = 0, dBidBj = δijdt. (16.4)

Example 16.7 (Feynmann-Kac formula). Suppose that V is a nice function onRd and h (t, x) solves the partial differential equation,

∂th =1

2∆h− V h with h (0, x) = f (x) .

Given T > 0 let

Mt := h (T − t,Xt) e−∫ t0V (Xs)ds := h (T − t,Xt)Zt.

Then by Ito’s lemma,

dMt =∇xh (T − t,Xt) · dBt +

(1

2∆h− ∂th

)(T − t,Xt)Ztdt

− h (T − t,Xt)ZtV (Xt) dt

=∇xh (T − t,Xt) · dBt +

(1

2∆h− ∂th− V h

)(T − t,Xt)Ztdt

=∇xh (T − t,Xt) · dBt.

From this it follows that Mt is a Martingale and in particular we learn EM0 =EMT from which ti follows

h (T, x) = E[f (XT ) e

−∫ T0V (Xs)ds

]= E

[f (x+BT ) e

−∫ T0V (x+Bs)ds

].

Theorem 16.8. If u : D → R is a function such that ∆u = g on D and u = fon ∂D, then

u (x) = E[f (x+Bτ )− 1

2

∫ τ

0

g (x+Bs) ds

]. (16.5)

Proof. By Ito’s formula,

d [u (Xt)] = 1t≤τ∇u (Xt) · dBt +1

2(∆u) (Xt) 1t≤τdt

= 1t≤τ∇u (Xt) · dBt +1

2g (Xt) 1t≤τdt,

i.e.

u (XT ) = u (X0) +

∫ T

0

1t≤τ∇u (Xt) · dBt +1

2

∫ T∧τ

0

g (Xt) dt.

Taking expectations of this equation and using X0 = x shows

E [u (XT )] = u (x) + E∫ T

0

1t≤τ∇u (Xt) · dBt +1

2E∫ T∧τ

0

g (Xt) dt

= u (x) +1

2E∫ T∧τ

0

g (Xt) dt.

Now letting T ↑ ∞ gives the desired result upon observing that

E [u (X∞)] = E [u (x+Bτ )] = E [f (x+Bτ )] .

Notation 16.9 In the future let us write Ex[F(B(·)

)]for E

[F(x+B(·)

)].

Example 16.10. Suppose u : D → R solves, ∆u = −2 on D and u = 0 on ∂D,then Eq. (16.5) implies

u (x) = Ex [τ ] . (16.6)

If instead we let u solve, ∆u = 0 on D and u = f on ∂D, then Eq. (16.5) implies

u (x) = Ex [f (Bτ )] .

Here are some technical details for Eq. (16.6). If u solves ∆u = −2 on Dand u = 0 on ∂D, then by the maximum principle applied to −u we learn thatu ≥ 0 on D. So by the optional sampling theorem

Mt := u (Bτt ) + t ∧ τ

is a martingale and therefore,

u (x) = Exu (Bτ0 ) = ExM0 = Ex [u (Bτt ) + t ∧ τ ]

= Ex [u (Bτt )] + Ex [t ∧ τ ] (16.7)

≥ Ex [t ∧ τ ] . (16.8)

We may now use the MCT and pass to the limit as t ↑ ∞ in the inequality inEq. (16.8) to show Exτ ≤ u (x) <∞. Now that we know τ is integrable and inparticular Px (τ <∞) = 1, we may with the aid of DCT let t ↑ ∞ in Eq. (16.7)to find,

u (x) = Ex [u (Bτ )] + Exτ = Exτ

wherein we have used Bτ ∈ ∂D where u = 0 in the last equality.

16.3 A Simple Option Pricing Model

In this section we are going to try to explain the Black–Scholesformula for option pricing. The following excerpt is taken fromhttp://en.wikipedia.org/wiki/Black-Scholes.

Robert C. Merton was the first to publish a paper expanding the mathematicalunderstanding of the options pricing model and coined the term ”Black-Scholes” options pricing model, by enhancing work that was published byFischer Black and Myron Scholes. The paper was first published in 1973.The foundation for their research relied on work developed by scholars suchas Louis Bachelier, A. James Boness, Sheen T. Kassouf, Edward O. Thorp,and Paul Samuelson. The fundamental insight of Black-Scholes is that theoption is implicitly priced if the stock is traded.Merton and Scholes received the 1997 Nobel Prize in Economics for thisand related work. Though ineligible for the prize because of his death in1995, Black was mentioned as a contributor by the Swedish academy.

Definition 16.11. A European stock option at time T with strike priceK is a ticket that you would buy from a trader for the right to buy a particularstock at time T at a price K. If the stock price, ST , at time T is greater thatK you could then buy the stock at price K and then instantly resell it for aprofit of (ST −K) dollars. If the ST < K, you would not turn in your ticketbut would loose whatever you paid for the ticket. So the pay off of the option attime T is (ST −K)+ .

Question: What should be the price (q) at time zero of such a stock option?

To answer this question, we will use a simplified version of a financial marketwhich consists of only two assets; a no risk bond worth βt = β0e

rt (for somer > 0) dollars per share at time t and a risky stock worth St dollars per share.We are going to model St via a geometric “Brownian motion.”

Definition 16.12 (Geometric Brownian Motion). Let σ > 0, and µ ∈ Rbe given parameters. We say that the solution to the “stochastic differentialequation,”

dStSt

= σdBt + µdt (16.9)

with S0 being non-random is a geometric Brownian motion. More precisely,St, is a solution to

St = S0 + σ

∫ t

0

SdB + µ

∫ t

0

Ssds. (16.10)

(The parameters σ and µ measure the volatility and drift or trend of thestock respectively.)

Notice that dSS is the relative change of S and formally, E

(dSS

)= µdt and

Var(dSS

)= σ2dt. Taking expectation of Eq. (16.10) gives,

ESt = S0 + µ

∫ t

0

ESsds.

16.3 A Simple Option Pricing Model 159

Differentiating this equation then implies,

d

dtESt = µESt with ES0 = S0,

which yields, ESt = S0eµt. So on average, St is growing or decaying exponen-

tially depending on the sign of µ.

Proposition 16.13 (Geometric Brownian motion). The stochastic differ-ential Equation (16.10) has a unique solution given by

St = S0 exp

(σBt +

(µ− 1

2σ2

)t

).

Proof. We do not bother to give the proof of uniqueness here. To proveexistence, let us look for a solution to Eq. (16.9) of the form;

St = S0 exp (aBt + bt) ,

for some constants a and b. By Ito’s lemma, using ddxe

x = d2

dx2 ex = ex and the

multiplication rules, dB2 = dt and dt2 = dB · dt = 0, we find that

dS = S (adB + bdt) +1

2S (adB + bdt)

2

= S (adB + bdt) +1

2Sa2dt,

i.e.dS

S= adB +

(b+

1

2a2

)dt.

Comparing this with Eq. (16.9) shows that we should take a = σ and b = µ− 12σ

2

to get a solution.

Definition 16.14 (Holdings and Value Processes). Let (at, bt) be the hold-ings process which denotes the number of shares of stock and bonds respectivelythat are held in the portfolio at time t. The value process, Vt, of the portfolio,is

Vt = atSt + btβt. (16.11)

Suppose time is partitioned as,

Π = 0 = t0 < t1 < t2 < · · · < tn = T

for some time T in the future. Let us suppose that (at, bt) is constant on theintervals, [0, t1] , (t1, t2], . . . , (tn−1, tn]. Let us write (at, bt) = (ai−1, bi−1) forti−1 < t ≤ ti, see Figure 16.1.

ti−1 timeti ti+1

a or b in dollars

Fig. 16.1. A possible graph of either at or bt.

Therefore the value of the portfolio is given by

Vt = ai−1St + bi−1βt for ti−1 < t ≤ ti.

We now assume that our holding process is self financing (i.e. we do not addany external money to portfolio other than what was invested, V0 = a0S0+b0β0,at the initial time t = 0), then we must have2

ai−1Sti + bi−1βti = Vti = aiSti + biβti for all i. (16.12)

That is to say, when we rebalance our portfolio at time ti, we are only use themoney, Vti , dollars in the portfolio at time ti. Using Eq. (16.12) at i and i− 1allows us to conclude,

Vti − Vti−1= ai−1Sti + bi−1βti −

(ai−1Sti−1

+ bi−1βti−1

)= ai−1

(Sti − Sti−1

)+ bi−1

(βti − βti−1

)for all i, (16.13)

2 Equation (16.12) may be written as

(ai − ai−1)Sti + (bi − bi−1)βti = 0.

This explains why the continuum limit of this equation is not Stdat+βtdbt = 0 butrather must be interpreted as St+dtdat + βt+dtdbt = 0. It is also useful to observethat

d (XY )t = Xt+dtYt+dt −XtYt= (Xt+dt −Xt)Yt+dt +Xt (Yt+dt − Yt) ,

and hence there is no quadratic differential term when d (XY ) is written out thisway.

which states the change of the portfolio balance over the time interval, (ti−1, ti]is due solely to the gain or loss made by the investments in the portfolio. (TheEquations (16.12) and (16.13) are equivalent.) Summing Eq. (16.13) then gives,

Vtj − V0 =

j∑i=1

ai−1

(Sti − Sti−1

)+

j∑i=1

bi−1

(βti − βti−1

)(16.14)

=

∫ tj

0

ardSr +

∫ tj

0

brdβr for all j. (16.15)

More generally, if we throw any arbitrary point, t ∈ [0, T ] , into our partitionwe may conclude that

Vt = V0 +

∫ t

0

adS +

∫ t

0

bdβ for all 0 ≤ t ≤ T. (16.16)

The interpretation of this equation is that Vt−V0 is equal to the gains or lossesdue to trading which is given by∫ t

0

adS +

∫ t

0

bdβ.

Equation (16.16) now makes sense even if we allow for continuous trading. Theprevious arguments show that the integrals appearing in Eq. (16.16) shouldbe taken to be Ito – integrals as defined in Definition 16.1. Moreover, if theinvestor does not have psychic abilities, we should assume that holding processis adapted.

16.4 The Black-Scholes Formula

Now that we have set the stage we can now try to price the option. (We willclosely follow [3, p. 255-264.] here.)

Fundamental Principle: The price of the option should be equal to theamount of money, V0, that an investor would have to put into the bond-stock market at time t = 0 so as there exists a self-financing holding process(at, bt), such that

VT = aTST + bTβT = (ST −K)+ .

Remark 16.15 (Money for nothing). If we price the option higher than V0, i.e.q > V0, we could make risk free money by selling one of these options at qdollars, investing V0 < q of this money using the holding process (at, bt) tocover the payoff at time T and then pocket the different, q − V0.

If the price of the option was less than V0, i.e. q < V0, the investor shouldbuy the option and then pursue the trading strategy, (−a,−b) . At time zerothe investor has invested q + (−a0S0 − b0β0) = q − V0 < 0 dollars, i.e. he isholding V0−q dollars in hand at time t = 0. The value of his portfolio at time Tis now −VT = − (ST −K)+ . If ST > K, the investor then exercises her optionto pay off the debt she as accrued in the portfolio and if ST ≤ K, she doesnothing since his portfolio is worth zero dollars. Either way, she still has theV0 − q dollars in hand from the start of the transactions at t = 0.

Ansatz: We would like the price of the option q = V0 to depend onlyon what we might know at the initial time. Thus we make the ansatz thatq := f

(S0, T,K, r, σ

2, µ)

3. (It is part of the content of the Black-Scholes formulathat this ansatz is permissible.)

If we have a self-financing holding process (at, bt) , then (as, bs)t≤s≤T isalso a self-financing holding process on [t, T ] such that VT = aTST + bTβT =(ST −K)+ , therefore, given the ansatz and the fundamental principle above,if the stock price is St at time t, the options price at this time should be

Vt = f (St, T − t) for all 0 ≤ t ≤ T. (16.17)

By Ito’s lemma

dVt = fx (St, T − t) dSt +1

2fxx (St, T − t) dS2

t − ft (St, T − t) dt

= fx (St, T − t)St (σdBt + µdt) +

[1

2fxx (St, T − t)S2

t σ2 − ft (St, T − t)

]dt

= fx (St, T − t)StσdBt

+

[fx (St, T − t)Stµ+

1


t σ2 − ft (St, T − t)

]dt

On the other hand from Eqs. (16.16) and (16.9), we know that

dVt = atdSt + btβ0rertdt

= atSt (σdBt + µdt) + btβ0rertdt

= atStσdBt +[atStµ+ btβ0re

rt]dt.

Comparing these two equations implies,

at = fx (St, T − t) (16.18)

and

3 Since r,K, µ, and σ2 are fixed, we will often drop them from the notation.

atStµ+ btβ0rert

= fx (St, T − t)Stµ+1


t σ2 − ft (St, T − t) . (16.19)

Using Eq. (16.18) and

f (St, T − t) = Vt = atSt + btβ0ert

= fx (St, T − t)St + btβ0ert

in Eq. (16.19) allows us to conclude,

1


t σ2 − ft (St, T − t) = rbtβ0e

rt

= rf (St, T − t)− rfx (St, T − t)St.

Thus we see that the unknown function f should solve the partial differentialequation,

1

2σ2x2fxx (x, T − t)− ft (x, T − t) = rf (x, T − t)− rxfx (x, T − t)

with f (x, 0) = (x−K)+ ,

i.e.

ft (x, t) =1

2σ2x2fxx (x, t) + rxfx (x, t)− rf (x, t) (16.20)

with f (x, 0) = (x−K)+ . (16.21)

Fact 16.16 Let N be a standard normal random variable and Φ (x) :=P (N ≤ x) . The solution to Eqs. (16.20) and (16.21) is given by;

f (x, t) = xΦ (g (x, t))−Ke−rtΦ (h (x, t)) , (16.22)

where,

g (x, t) =ln (x/K) +

(r + 1

2σ2)t

σ√t

,

h (x, t) = g (x, t)− σ√t.

Theorem 16.17 (Option Pricing). Given the above setup, the “rational”price” of the European call option is

q = S0Φ

(ln (S0/K) +

(r + 1

2σ2)T

σ√T

)

−Ke−rTΦ

(ln (S0/K) +

(r + 1

2σ2)T

σ√T

− σ√T

)where Φ (x) := P (N (0, 1) ≤ x) .

Part V

Appendix

17

Analytic Facts

17.1 A Stirling’s Formula Like Approximation

Theorem 17.1. Suppose that f : (0,∞) → R is an increasing concave downfunction (like f (x) = lnx) and let sn :=

∑nk=1 f (k) , then

sn −1

2(f (n) + f (1)) ≤

∫ n

1

f (x) dx

≤ sn −1

2[f (n+ 1) + 2f (1)] +

1

2f (2)

≤ sn −1

2[f (n) + 2f (1)] +

1

2f (2) .

Proof. On the interval, [k − 1, k] , we have that f (x) is larger than thestraight line segment joining (k − 1, f (k − 1)) and (k, f (k)) and thus

1

2(f (k) + f (k − 1)) ≤

∫ k

k−1

f (x) dx.

Summing this equation on k = 2, . . . , n shows,

sn −1

2(f (n) + f (1)) =

n∑k=2

1

2(f (k) + f (k − 1))

≤n∑k=2

∫ k

k−1

f (x) dx =

∫ n

1

f (x) dx.

For the upper bound on the integral we observe that f (x) ≤ f (k)−f ′ (k) (x− k)for all x and therefore,∫ k

k−1

f (x) dx ≤∫ k

k−1

[f (k)− f ′ (k) (x− k)] dx = f (k)− 1

2f ′ (k) .

Summing this equation on k = 2, . . . , n then implies,∫ n

1

f (x) dx ≤n∑k=2

f (k)− 1

2

n∑k=2

f ′ (k) .

Since f ′′ (x) ≤ 0, f ′ (x) is decreasing and therefore f ′ (x) ≤ f ′ (k − 1) for x ∈[k − 1, k] and integrating this equation over [k − 1, k] gives

f (k)− f (k − 1) ≤ f ′ (k − 1) .

Summing the result on k = 3, . . . , n+ 1 then shows,

f (n+ 1)− f (2) ≤n∑k=2

f ′ (k)

and thus ti follows that∫ n

1

f (x) dx ≤n∑k=2

f (k)− 1

2(f (n+ 1)− f (2))

= sn −1

2[f (n+ 1) + 2f (1)] +

1

2f (2)

≤ sn −1

2[f (n) + 2f (1)] +

1

2f (2)

166 17 Analytic Facts

Example 17.2 (Approximating n!). Let us take f (n) = lnn and recall that∫ n

1

lnxdx = n lnn− n+ 1.

Thus we may conclude that

sn −1

2lnn ≤ n lnn− n+ 1 ≤ sn −

1

2lnn+

1

2ln 2.

Thus it follows that(n+

1

2

)lnn− n+ 1− ln

√2 ≤ sn ≤

(n+

1

2

)lnn− n+ 1.

Exponentiating this identity then gives the following upper and lower boundson n!;

e√2· e−nnn+1/2 ≤ n! ≤ e · e−nnn+1/2.

These bound compare well with Strirling’s formula (Theorem 17.5) which im-plies,

n! ∼√

2πe−nnn+1/2 by definition⇐⇒ limn→∞

n!

e−nnn+1/2=√

2π.

Observe that

e√2∼= 1. 922 1 ≤

√2π ∼= 2. 506 ≤ e ∼= 2.718 3.

Definition 17.3 (Gamma Function). The Gamma function, Γ : R+ →R+ is defined by

Γ (x) :=

∫ ∞0

ux−1e−udu (17.1)

(The reader should check that Γ (x) <∞ for all x > 0.)

Here are some of the more basic properties of this function.

Example 17.4 (Γ – function properties). Let Γ be the gamma function, then;

1. Γ (1) = 1 as is easily verified.2. Γ (x+ 1) = xΓ (x) for all x > 0 as follows by integration by parts;

Γ (x+ 1) =

∫ ∞0

e−u ux+1 du

u=

∫ ∞0

ux(− d

due−u

)du

= x

∫ ∞0

ux−1 e−u du = x Γ (x).

In particular, it follows from items 1. and 2. and induction that

Γ (n+ 1) = n! for all n ∈ N. (17.2)

3. Γ (1/2) =√π. This last assertion is a bit trickier. One proof is to make use

of the fact (proved below in Lemma ??) that∫ ∞−∞

e−ar2

dr =

√π

afor all a > 0. (17.3)

Taking a = 1 and making the change of variables, u = r2 below implies,

√π =

∫ ∞−∞

e−r2

dr = 2

∫ ∞0

u−1/2e−udu = Γ (1/2) .

Γ (1/2) = 2

∫ ∞0

e−r2

dr =

∫ ∞−∞

e−r2

dr

= I1(1) =√π.

4. A simple induction argument using items 2. and 3. now shows that

Γ

(n+

1

2

)=

(2n− 1)!!

2n√π

where (−1)!! := 1 and (2n− 1)!! = (2n− 1) (2n− 3) . . . 3 · 1 for n ∈ N.

Theorem 17.5 (Stirling’s formula). The Gamma function (see Definition17.3), satisfies Stirling’s formula,

limx→∞

Γ (x+ 1)√2πe−xxx+1/2

= 1. (17.4)

In particular, if n ∈ N, we have

n! = Γ (n+ 1) ∼√

2πe−nnn+1/2

where we write an ∼ bn to mean, limn→∞anbn

= 1.

18

Multivariate Gaussians

18.1 Review of Gaussian Random Variables

Definition 18.1 (Normal / Gaussian Random Variable). A random vari-able, Y, is normal with mean µ standard deviation σ2 iff

P (Y ∈ (y, y + dy]) =1√

2πσ2e−

12σ2

(y−µ)2dy. (18.1)

We will abbreviate this by writing Yd= N

(µ, σ2

). When µ = 0 and σ2 = 1 we

say Y is a standard normal random variable. We will often denote standardnormal random variables by Z.

Observe that Eq. (18.1) is equivalent to writing

E [f (Y )] =1√

2πσ2

∫Rf (y) e−

12σ2

(y−µ)2dy

for all bounded functions, f : R→ R. Also observe that Yd= N

(µ, σ2

)is

equivalent to Yd= σZ+µ. Indeed, by making the change of variable, y = σx+µ,

we find

E [f (σZ + µ)] =1√2π

∫Rf (σx+ µ) e−

12x

2

dx

=1√2π

∫Rf (y) e−

12σ2

(y−µ)2 dy

σ=

1√2πσ2

∫Rf (y) e−

12σ2

(y−µ)2dy.

Lastly the constant,(2πσ2

)−1/2is chosen so that

1√2πσ2

∫Re−

12σ2

(y−µ)2dy =1√2π

∫Re−

12y

2

dy = 1.

Lemma 18.2 (Integration by parts). If Xd= N

(0, σ2

)for some σ2 ≥ 0

and f : R→ R is a C1 – function such that Xf (X) , f ′ (X) and f (X) are all

integrable random variables and1 limz→±∞

[f (x) e−

12σ2

x2]

= 0, then

E [Xf (X)] = σ2E [f ′ (X)] = E[X2]· E [f ′ (X)] . (18.2)

1 This last hypothesis is actually unnecessary!

Proof. If σ = 0 then X = 0 a.s. and both sides of Eq. (18.2) are zero. So we

now suppose that σ > 0 and set C := 1/√

2πσ2. The result is a simple matterof using integration by parts;

E [f ′ (X)] = C

∫Rf ′ (x) e−

12σ2

x2

dx = C limM→∞

∫ M

−Mf ′ (x) e−

12σ2

x2

dx

= C limM→∞

[f (x) e−

12σ2

x2

|M−M −∫ M

−Mf (x)

d

dxe−

12σ2

x2

dx

]

= C limM→∞

∫ M

−Mf (x)

x

σ2e−

12x

2

dx =1

σ2E [Xf (X)] .

Example 18.3. Suppose that Xd= N (0, 1) and define αk := E

[X2k

]for all

k ∈ N0. By Lemma 18.2,

αk+1 = E[X2k+1 ·X

]= (2k + 1)αk with α0 = 1.

Hence it follows that

α1 = α0 = 1, α2 = 3α1 = 3, α3 = 5 · 3

and by a simple induction argument,

EX2k = αk = (2k − 1)!!,

where (−1)!! := 0.Actually we can use the Γ – function to say more. Namely for any β > −1,

E |X|β =1√2π

∫R|x|β e− 1

2x2

dx =

√2

π

∫ ∞0

xβe−12x

2

dx.

Now make the change of variables, y = x2/2 (i.e. x =√

2y and dx = 1√2y−1/2dy)

to learn,

E |X|β =1√π

∫ ∞0

(2y)β/2

e−yy−1/2dy

=1√π

2β/2∫ ∞

0

y(β+1)/2e−yy−1dy =1√π

2β/2Γ

(β + 1

2

).

168 18 Multivariate Gaussians

Exercise 18.1. Let q (x) be a polynomial2 in x, Zd= N (0, 1) , and

u (t, x) := E[q(x+√tZ)]

(18.3)

=

∫R

1√2πt

e−12t (y−x)2q (y) dy (18.4)

Show u satisfies the heat equation,

∂

∂tu (t, x) =

1

2

∂2

∂x2u (t, x) for all t > 0 and x ∈ R,

with u (0, x) = q (x) .

Hints: Make use of Lemma 18.2 along with the fact (which is easily provedhere) that

∂

∂tu (t, x) = E

[∂

∂tq(x+√tZ)].

You will also have to use the corresponding fact for the x derivatives as well.

Exercise 18.2. Let q (x) be a polynomial in x, Zd= N (0, 1) , and ∆ = d2

dx2 .Show

E [q (Z)] =(e∆/2q

)(0) :=

∞∑n=0

1

n!

((∆

2

)nq

)(0) =

∞∑n=0

1

n!

1

2n(∆nq) (0)

where the above sum is actually a finite sum since ∆nq ≡ 0 if 2n > deg q.Hint: let u (t) := E

[q(√tZ)]. From your proof of Exercise 18.1 you should be

able to see that u (t) = 12E[(∆q)

(√tZ)]. This latter equation may be iterated

in order to find u(n) (t) for all n ≥ 0. With this information in hand you shouldbe able to finish the proof with the aid of Taylor’s theorem.

Example 18.4. Suppose that k ∈ N, then

E[Z2k

]=(e∆2 x2k

)|x=0 =

∞∑n=0

1

n!

1

2n(∆nx2k

)|x=0

=1

k!

1

2k∆kx2k =

(2k)!

k!2k

=2k · (2k − 1) · 2 (k − 1) · (2k − 3) · · · · · (2 · 2) · 3 · 2 · 1

2kk!= (2k − 1)!!

in agreement with Example 18.3.

2 Actually, q (x) can be any twice continuously differentiable function which along

with its derivatives grow slower than eεx2

for any ε > 0.

Example 18.5. Let Z be a standard normal random variable and set f (λ) :=

E[eλZ

2]

for λ < 1/2. Then f (0) = 1 and

f ′ (λ) = E[Z2eλZ

2]

= E[∂

∂Z

(ZeλZ

2)]

= E[eλZ

2

+ 2λZ2eλZ2]

= f (λ) + 2λf ′ (λ) .

Solving for λ we find,

f ′ (λ) =1

1− 2λf (λ) with f (0) = 1.

The solution to this equation is found in the usual way as,

ln f (λ) =

∫f ′ (λ)

f (λ)dλ =

∫1

1− 2λdλ = −1

2ln (1− 2λ) + C.

By taking λ = 0 using f (0) = 1 we find that C = 0 and therefore,

E[eλZ

2]

= f (λ) =1√

1− 2λfor λ <

1

2.

This can also be shown by directly evaluating the integral,

E[eλZ

2]

=

∫R

1√2πe−

12 (1−2λ)z2dz.

Exercise 18.3. Suppose that Zd= N (0, 1) and λ ∈ R. Show

f (λ) := E[eiλZ

]= exp

(−λ2/2

). (18.5)

Hint: You may use without proof that f ′ (λ) = iE[ZeiλZ

](i.e. it is permissible

to differentiate past the expectation.) Assuming this use Lemma 18.2 to see thatf ′ (λ) satisfies a simple ordinary differential equation.

18.2 Gaussian Random Vectors

Definition 18.6 (Gaussian Random Vectors). A random vector, X ∈ Rd,is Gaussian iff

E[eiλ·X

]= exp

(−1

2Var (λ ·X) + iE (λ ·X)

)for all λ ∈ Rd. (18.6)

In short, X is a Gaussian random vector iff λ·X is a Gaussian random variablefor all λ ∈ Rd. (Implicitly in this definition we are assuming that E

∣∣X2j

∣∣ < ∞for 1 ≤ j ≤ d.)

18.2 Gaussian Random Vectors 169

Notation 18.7 Let X be a random vector in Rd with second moments, i.e.E[X2k

]< ∞ for 1 ≤ k ≤ d. The mean X is the vector µ = (µ1, . . . , µd)

tr ∈ Rdwith µk := EXk for 1 ≤ k ≤ d and the covariance matrix C = C (X) is thed× d matrix with entries,

Ckl := Cov (Xk, Xl) for 1 ≤ k, l ≤ d. (18.7)

Exercise 18.4. Suppose thatX is a random vector in Rd with second moments.Show for all λ = (λ1, . . . , λd)

tr ∈ Rd that

E [λ ·X] = λ · µ and Var (λ ·X) = λ · Cλ. (18.8)

Corollary 18.8. If Yd= N

(µ, σ2

), then

E[eiλY

]= exp

(−1

2λ2σ2 + iµλ

)for all λ ∈ R. (18.9)

Conversely if Y is a random variable such that Eq. (18.9) holds, then Yd=

N(µ, σ2

).

Proof. ( =⇒ ) From the remarks after Lemma 18.1, we know that Yd=

σZ + µ where Zd= N (0, 1) . Therefore,

E[eiλY

]= E

[eiλ(σZ+µ)

]= eiλµE

[eiλσZ

]= eiλµe−

12 (λσ)2 = exp

(−1

2λ2σ2 + iµλ

).

(⇐=) This follows from the basic fact that the characteristic function or Fouriertransform of a distribution uniquely determines the distribution.

Remark 18.9 (Alternate characterization of being Gaussian). Given Corollary18.8, we have Y is a Gaussian random variable iff EY 2 <∞ and

E[eiλY

]= exp

(−1

2Var (λY ) + iλEY

)= exp

(−λ

2

2Var (Y ) + iλEY

)for all λ ∈ R.

Exercise 18.5. Suppose X1 and X2 are two independent Gaussian random

variables with Xid= N

(0, σ2

i

)for i = 1, 2. Show X1 + X2 is Gaussian and

X1 +X2d= N

(0, σ2

1 + σ22

). (Hint: use Remark 18.9.)

Exercise 18.6. Suppose that Zd= N (0, 1) and t ∈ R. Show E

[etZ]

=

exp(t2/2

). (You could follow the hint in Exercise 18.3 or you could use a

completion of the squares argument along with the translation invariance ofLebesgue measure.)

Exercise 18.7. Use Exercise 18.6 to give another proof that EZ2k = (2k − 1)!!

when Zd= N (0, 1) .

Exercise 18.8. Let Zd= N (0, 1) and α ∈ R, find ρ : R+ → R+ := (0,∞) such

that

E [f (|Z|α)] =

∫R+

f (x) ρ (x) dx

for all continuous functions, f : R+ → R with compact support in R+.

In particular a random vector (X) in Rd with second moments a Gaussianrandom vector iff

E[eiλ·X

]= exp

(−1

2Cλ · λ+ iµ · λ

)for all λ ∈ Rd. (18.10)

We abbreviate Eq. (18.10) by writing Xd= N (µ,C) . Notice that it follows from

Eq. (18.7) that Ctr = C and from Eq. (18.8) that C ≥ 0, i.e. λ · Cλ ≥ 0 for allλ ∈ Rd.

Definition 18.10. Given a Gaussian random vector, X, we call the pair, (C, µ)appearing in Eq. (18.10) the characteristics of X.

Lemma 18.11. Suppose that X =∑kl=1 Zlvl +µ where Zlkl=1 are i.i.d. stan-

dard normal random variables, µ ∈ Rd and vl ∈ Rd for 1 ≤ l ≤ k. Then

Xd= N (µ,C) where C =

∑kl=1 vlv

trl .

Proof. Using the basic properties of independence and normal random vari-ables we find

E[eiλ·X

]= E

[ei∑k

l=1Zlλ·vl+iλ·µ

]= eiλ·µ

k∏l=1

E[eiZlλ·vl

]= eiλ·µ

k∏l=1

e−12 (λ·vl)2

= exp

(−1

2

k∑l=1

(λ · vl)2+ iλ · µ

).

Sincek∑l=1

(λ · vl)2=

k∑l=1

λ · vl(vtrl λ)

= λ ·

(k∑l=1

vlvtrl

)λ

we may conclude,

E[eiλ·X

]= exp

(−1

2Cλ · λ+ iλ · µ

),

i.e. Xd= N (µ,C) .

Exercise 18.9 (Existence of Gaussian random vectors for all C ≥ 0 andµ ∈ Rd). Suppose that µ ∈ Rd and C is a symmetric non-negative d×d matrix.

By the spectral theorem we know there is an orthonormal basis ujdj=1 for Rd

such that Cuj = σ2juj for some σ2

j ≥ 0. Let Zjdj=1 be i.i.d. standard normal

random variables, show X :=∑dj=1 Zjσjuj + µ

d= N (µ,C) .

Theorem 18.12 (Gaussian Densities). Suppose that Xd= N (µ,C) is an Rd

– valued Gaussian random vector with C > 0 (for simplicity). Then

E [f (X)] =1√

det (2πC)

∫Rdf (x) exp

(−1

2C−1 (x− µ) · (x− µ)

)dx (18.11)

for bounded or non-negative functions, f : Rd→ R.

Proof. Let us continue the notation in Exercise 18.9 and further let

A := [σ1u1| . . . |σnun] = UΣ (18.12)

where

U = [u1| . . . |un] and Σ = diag (σ1, . . . , σn) = [σ1e1| . . . |σnen] ,

where eidi=1 is the standard orthonormal basis for Rd. With this notation we

know that Xd= AZ+µ where Z = (Z1, . . . , Zd)

tris a standard normal Gaussian

vector. Therefore,

E [f (X)] = (2π)−d/2

∫Rdf (Az + µ) e−

12‖z‖

2

dz (18.13)

wherein we have used

d∏j=1

1√2πe−

12 z

2i = (2π)

−d/2e−

12‖z‖

2

with ‖z‖2 :=

d∑j=1

z2j .

Making the change of variables x = Az+µ in Eq. (18.13) (i.e. z = A−1 (x− µ)and dz = dx/detA) implies

E [f (X)] =1

(2π)d/2

detA

∫Rdf (x) e−

12‖A−1(x−µ)‖2dx. (18.14)

Recall from your linear algebra class (or just check) that CU = UΣ2, i.e.C = UΣ2U−1 = UΣ2U tr. Therefore3,

3 Alternatively,

AAtr = UΣΣU tr = UΣ2U−1 = C (18.15)

which then implies detA =√

detC and for all y ∈ Rd∥∥A−1y∥∥2

=(A−1

)trA−1y · y =

(AAtr

)−1y · y = C−1y · y.

Equation (18.11) follows from theses observations, Eq. (18.14), and the identity;

(2π)d/2

detA = (2π)d/2√

detC =

√(2π)

ddetC =

√det (2πC).

Theorem 18.13 (Gaussian Integration by Parts). Suppose that X =(X1, . . . , Xd) is a mean zero Gaussian random vector and Cij = Cov (Xi, Yj) =E [XiXj ] . Then for any smooth function f : Rd → R such that f and all itsderivatives grows slower that exp (|x|α) for some α < 2, we have

E [Xif (X1, . . . , Xd)] =

d∑k=1

CikE[

∂

∂Xkf (X1, . . . , Xd)

]

=

d∑k=1

E [XiXk] · E[

∂

∂Xkf (X1, . . . , Xd)

].

Here we write ∂∂Xk

f (X1, . . . , Xd) for (∂kf) (X1, . . . , Xd) where

(∂kf) (x1, . . . , xd) :=∂

∂xkf (x1, . . . , xd) =

d

dt|0f (x+tek)

where x := (x1, . . . , xd) and ek is the kth – standard basis vector for Rd.

Proof. From Exercise 18.9 we know Xd= Y where Y =

∑dj=1 σjZjuj where

ujdj=1 is an orthonormal basis for Rd such that Cuj = σ2juj and Zjdj=1 are

i.i.d. standard normal random variables. To simplify notation we define A :=[σ1u1| . . . |σdud] as in Eq. (18.12) so that Y = AZ where Z = (Z1, . . . , Zd)

tras

in the proof of Theorem 18.12. From our previous observations and a simplegeneralization of Lemma 18.2, it follows that

Cik = Cov (Xi, Xk) = Cov (Yi, Yk)

=∑j,m

Cov (AijZj , AkmZm) =∑j,m

AijAkm Cov (Zj , Zm)

=∑j,m

AijAkmδjm =∑j

AijAkj =(AAtr)

ik.

18.2 Gaussian Random Vectors 171

E [Xif (X1, . . . , Xd)] = E [Yif (Y1, . . . , Yd)]

=∑j

AijE [Zjf ((AZ)1 , . . . (AZd))]

=∑j

AijE[∂

∂Zjf ((AZ)1 , . . . (AZd))

]

=∑j

AijE

[∑k

(∂kf) ((AZ)1 , . . . (AZd)) ·∂

∂Zj(AZ)k

]=∑j,k

AijAkjE [(∂kf) (X1, . . . , Xd)] .

This completes the proof since,∑j AijAkj = (AAtr)ik = Cik as we saw in Eq.

(18.15).

Theorem 18.14 (Wick’s Theorem). If X = (X1, . . . , X2n) is a mean zeroGaussian random vector, then

E [X1 . . . X2n] =∑

pairings

Ci1j1 . . . Cinjn

where the sum is over all perfect pairings of 1, 2, . . . , 2n and

Cij = Cov (Xi, Xj) = E [XiXj ] .

Proof. From Theorem 18.13,

E [X1 . . . X2n] =∑j

C1jE[∂

∂XjX2 . . . X2n

]=∑j>2

C1jE[X2 . . . Xj . . . X2n

]where the hat indicates a term to be omitted. The result now basically followsby induction. For example,

E [X1X2X3X4] =C12E[∂

∂X2(X2X3X4)

]+ C13E

[∂

∂X3(X2X3X4)

]+ C14E

[∂

∂X4(X2X3X4)

]=C12E [X3X4] + C13E [X2X4] + C14E [X2X3]

=C12C34 + C13C24 + C14C23.

Recall that if Xi and Yj are independent, then Cov (Xi, Yj) = 0, i.e. indepen-dence implies uncorrelated. On the other hand, typically uncorrelated randomvariables are not independent. However, if the random variables involved arejointly Gaussian, then independence and uncorrelated are actually the samething!

Lemma 18.15. Suppose that Z = (X,Y )tr

is a Gaussian random vector withX ∈ Rk and Y ∈ Rl. Then X is independent of Y iff Cov (Xi, Yj) = 0 for all1 ≤ i ≤ k and 1 ≤ j ≤ l.

Remark 18.16. Lemma 18.15 also holds more generally. Namely ifX lnl=1

is

a sequence of random vectors such that(X1, . . . , Xn

)is a Gaussian random

vector. ThenX lnl=1

are independent iff Cov(X li , X

l′

k

)= 0 for all l 6= l′ and

i and k.

Exercise 18.10. Prove Lemma 18.15. Hint: by basic facts about the Fouriertransform, it suffices to prove

E[eix·Xeiy·Y

]= E

[eix·X

]· E[eiy·Y

]for all x ∈ Rk and y ∈ Rl.

If you get stuck, take a look at the proof of Corollary 18.17 below.

Corollary 18.17. Suppose that X ∈ Rk and Y ∈ Rl are two independent ran-dom Gaussian vectors, then (X,Y ) is also a Gaussian random vector. Thiscorollary generalizes to multiple independent random Gaussian vectors.

Proof. Let x ∈ Rk and y ∈ Rl, then

E[ei(x,y)·(X,Y )

]=E

[ei(x·X+y·Y )

]= E

[eix·Xeiy·Y

]= E

[eix·X

]· E[eiy·Y

]= exp

(−1

2Var (x ·X) + iE (x ·X)

)× exp

(−1

2Var (y · Y ) + iE (y · Y )

)= exp

(−1

2Var (x ·X) + iE (x ·X)− 1

2Var (y · Y ) + iE (y · Y )

)= exp

(−1

2Var (x ·X + y · Y ) + iE (x ·X + y · Y )

)which shows that (X,Y ) is again Gaussian.

Remark 18.18 (Be careful). If X1 and X2 are two standard normal randomvariables, it is not generally true that (X1, X2) is a Gaussian random vector.

For example suppose X1d= N (0, 1) is a standard normal random variable and

ε is an independent Bernoulli random variable with P (ε = ±1) = 12 . Then

X2 := εX1d= N (0, 1) but X := (X1, X2) is not a Gaussian random vector as

we now verify.If λ = (λ1, λ2) ∈ R2, then



E[eiλ·X

]= E

[ei(λ1X1+λ2X2)

]= E

[ei(λ1X1+λ2εX1)

]=

1

2

∑τ=±1

E[ei(λ1X1+λ2τX1)

]=

1

2

∑τ=±1

E[ei(λ1+λ2τ)X1

]=

1

2

∑τ=±1

exp

(−1

2(λ1 + λ2τ)

2

)=

1

2

∑τ=±1

exp

(−1

2

(λ2

1 + λ22 + 2τλ1λ2

))=

1

2e−

12 (λ2

1+λ22) · [exp (−λ1λ2) + exp (λ1λ2)]

= e−12 (λ2

1+λ22) cosh (λ1λ2) .

On the other hand, E[X2

1

]= E

[X2

2

]= 1 and

E [X1X2] = Eε · E[X2

1

]= 0 · 1 = 0,

from which it follows that X1 and X2 are uncorrelated and CX = I2×2. Thusif X were Gaussian we would have,

E[eiλ·X

]= exp

(−1

2CXλ · λ

)= e−

12 (λ2

1+λ22)

which is just not the case!Incidentally, this example also shows that two uncorrelated random variables

need not be independent. For if X1, X2 were independent, then again wewould have

E[eiλ·X

]= E

[ei(λ1X1+λ2X2)

]= E

[eiλ1X1eiλ2X2

]= E

[eiλ1X1

]· E[eiλ2X2

]= e−

12λ

21e−

12λ

22 = e−

12 (λ2

1+λ22),

which is not the case.

The following theorem gives another useful way of computing Gaussian in-tegrals of polynomials and exponential functions.

Theorem 18.19. Suppose Xd= N (0, C) where C is a N×N symmetric positive

definite matrix. Let L = LC :=∑di,j=1 Cij∂i∂j (sum on repeated indices) where

∂i := ∂/∂xi. Then for any polynomial function, q : RN → R,

E [q (X)] =(e

12Lq)

(0) :=

∞∑n=0

1

n!

((L

2

)nq

)(0) (a finite sum). (18.16)

Proof. This is a fairly straight forward extension of Exercise 18.2 and so Iwill only provide a short outline to the proof. 1) Let u (t) := E

[q(√tX)]. 2)

Using Theorem 18.13 one shows that u (t) = 12E[(Lq)

(√tX)]. 3) Iterating this

result and then using Taylor’s theorem finishes the proof just like in Exercise18.2.

Corollary 18.20. The function u (t, x) := E[q(x+√tX)]

solves the heatequation,

∂tu (t, x) =1

2LCu (t, x) with u (0, x) = q (x) .

If Xd= N (1, 0) we have

u (t, x) =1√2π

∫Rq(x+√tz)e−

12 z

2

dz

=

∫Ept (x, y) q (y) dy

where

pt (x, y) :=1√2πt

exp

(− 1

2t(y − x)

2

).

Theorem 18.21 (Gaussian Conditioning). Suppose that (X,Y ) is a Gaus-sian vector taking values in Rk × Rl. Then

E [f (Y ) |X] = G (X) (18.17)

where

G (x) := E [f (Ax+ Z)] , (18.18)

Aik =∑j

[C−1

]k,j

E [XjYi] , and Ck,j = E [XkXj ] . (18.19)

In particular as a special case we have, E [Y |X] = AX where A is given asabove. [This is essentially only true in the Gaussian case.]4

Proof. We first decompose Y as Y = AX + Z where A is a l × k matrixand Z ∈ Rl is a Gaussian random vector independent of X. To construct A wemust solve the equations,

0 = E [[Y −AX]iXj ] = E [YiXj ]−AikE [XkXj ]

4 If C is not invertible, it means that X takes values in a subspace of Rk (a.s.). Oneshould then expand X in terms of a basis for this subspace and work in that basisinstead.


where 1 ≤ i ≤ l and 1 ≤ j ≤ k. The solutions to these equations are given inEq. (18.19). Once this is done it then follows from Proposition 3.15 that

E [f (Y ) |X] = G (X) where G (x) := E [f (Ax+ Z)] .

References

1. Richard F. Bass, Probabilistic techniques in analysis, Probability and its Applica-tions (New York), Springer-Verlag, New York, 1995. MR MR1329542 (96e:60001)

2. , Diffusions and elliptic operators, Probability and its Applications (NewYork), Springer-Verlag, New York, 1998. MR MR1483890 (99h:60136)

3. K. L. Chung and R. J. Williams, Introduction to stochastic integration, seconded., Probability and its Applications, Birkhauser Boston Inc., Boston, MA, 1990.MR MR1102676 (92d:60057)

4. Persi Diaconis and J. W. Neuberger, Numerical results for the Metropolis algo-rithm, Experiment. Math. 13 (2004), no. 2, 207–213. MR 2068894

5. J. L. Doob, Stochastic processes, Wiley Classics Library, John Wiley & Sons Inc.,New York, 1990, Reprint of the 1953 original, A Wiley-Interscience Publication.MR MR1038526 (91d:60002)

6. Richard Durrett, Probability: theory and examples, second ed., Duxbury Press,Belmont, CA, 1996. MR MR1609153 (98m:60001)

7. , Stochastic calculus, Probability and Stochastics Series, CRC Press, BocaRaton, FL, 1996, A practical introduction. MR MR1398879 (97k:60148)

8. Evgenii B. Dynkin and Aleksandr A. Yushkevich, Markov processes: Theoremsand problems, Translated from the Russian by James S. Wood, Plenum Press,New York, 1969. MR MR0242252 (39 #3585a)

9. Olav Kallenberg, Foundations of modern probability, second ed., Probability andits Applications (New York), Springer-Verlag, New York, 2002. MR MR1876169(2002m:60002)

10. Ioannis Karatzas and Steven E. Shreve, Brownian motion and stochastic calculus,second ed., Graduate Texts in Mathematics, vol. 113, Springer-Verlag, New York,1991. MR MR1121940 (92h:60127)

11. K. L. Mengersen and R. L. Tweedie, Rates of convergence of the Hastings andMetropolis algorithms, Ann. Statist. 24 (1996), no. 1, 101–121. MR 1389882(98c:60081)

12. Sean Meyn and Richard L. Tweedie, Markov chains and stochastic stability, seconded., Cambridge University Press, Cambridge, 2009, With a prologue by Peter W.Glynn. MR 2509253 (2010h:60206)

13. J. R. Norris, Markov chains, Cambridge Series in Statistical and ProbabilisticMathematics, vol. 2, Cambridge University Press, Cambridge, 1998, Reprint of1997 original. MR MR1600720 (99c:60144)

14. Yuval Peres, An invitation to sample paths of brownian motion, stat-www.berkeley.edu/ peres/bmall.pdf (2001), 1–68.

15. Philip Protter, Stochastic integration and differential equations, Applications ofMathematics (New York), vol. 21, Springer-Verlag, Berlin, 1990, A new approach.MR MR1037262 (91i:60148)

16. Sheldon M. Ross, Stochastic processes, Wiley Series in Probability and Mathe-matical Statistics: Probability and Mathematical Statistics, John Wiley & SonsInc., New York, 1983, Lectures in Mathematics, 14. MR MR683455 (84m:60001)

17. Barry Simon, Functional integration and quantum physics, second ed., AMSChelsea Publishing, Providence, RI, 2005. MR MR2105995 (2005f:81003)