ECE541_Lectures_9_2009

8/10/2019 ECE541_Lectures_9_2009

1/81

ECE 541 - NOTES ON PROBABILITY THEORY ANDSTOCHASTIC PROCESSES

MAJEED M. HAYAT

These notes are made available to students of the course ECE-541 at the University

of New Mexico. They cannot and will not be used for any type of profitable repro-

duction or sale. Parts of the materials have been extracted from texts and notes by

other authors. These include the text of Probability Theory by Y. Chow and H. Te-

icher, the notes on stochastic analysis by T. G. Kurtz, the text of Real and Complex

Analysis by W. Rudin, the notes on measure and integration by A. Beck, the text

of Probability and Measure by P. Billingsley, the text of Introduction to Stochastic

Processes by P. G. Hoel, S. C. Port, and C. J. Stone, and possibly other sources.

In compiling these notes I tried, as much as possible, to make explicit reference to

these sources. If any omission is found, it is due to my oversight for which I seek the

authors forgiveness.I am truly indebted to these outstanding scholars and authors that I mentioned

above and whom I obtained most of the materials from; I am truly privileged to have

been taught by some of them. I also like to thank Dr. Bradley Ratliff for helping in

the tedious task of typing these notes.

This material is intended as a graduate-level treatment of probability and stochastic

processes. It requires basic undergraduate knowledge of probability, random variables,

probability distributions and density functions, and moments. A course like ECE340will cover these topics at an undergraduate level. The material also requires some

knowledge of elementary analysis; concepts such as limits and continuity, basic set

theory, some basic topology, Fourier and Laplace transforms, and elementary linear

systems theory.

Date: November 4, 2009.1

8/10/2019 ECE541_Lectures_9_2009

2/81

2 MAJEED M. HAYAT

1. Fundamental concepts

1.1. Experiments. The most fundamental component in probability theory is the

notion of a physical (or sometimes imaginary) experiment, whose outcome is

revealed when the experiment is completed. Probability theory aims to provide the

tools that will enable us to assess the likelihood of an outcome, or more generally, the

likelihood of a collection of outcomes. Let us consider the following example:

Example 1. Shooting a dart: Consider shooting a single dart at a target (board)

represented by the unit closed disc, D, which is centered at the point(0, 0). We write

D ={(x, y) IR2 : x2 +y2 1}. Here, IR denotes the set of real numbers (sameas(, )), and IR2 is the set of all points in the plane (IR2 = IR IR, wheredenotes the Cartesian product of sets). We read the above description ofD as the

set of all points(x, y) inIR2 such that (or with the property) x2 + y2 1.

1.2. Outcomes and the Sample Space. Now we define what we mean by an

outcome: An outcome can be missing the target, or missed for short, in which

case the dart misses the board entirely, or it can be its location in the scenario thatit hits the board. Note that we have implicitly decided (or chose) that we do not

care where the dart lands as whenever it misses the board. (The definition of an

outcome is totally arbitrary and therefore it is not unique for any experiment. It

depends on whatever makes sense to us.) Mathematically, we form what is called

the sample space as the set containing all possible outcomes. If we call this set ,

then for our dart example, ={missed} D, where the symboldenotes set union

(we say x AB if and only if x A or x B). We write to denote aspecific outcome from the sample space . For example, we may have = missed,

= (0, 0), and = (0.1, 0.2); however, according to ourdefinition of an outcome,

cannot be (1,1).

1.3. Events. An event is a collection of outcomes, that is, a subset in . Such a

subset can be associated with a question that we may ask about the outcome of the

experiment and whose answer can be determined after the outcome is revealed. For

example, the question: Q1: Did the dart land within 0.5 of the bulls eye? can be

8/10/2019 ECE541_Lectures_9_2009

3/81

ECE 541 3

associated with the subset of (or event) given by E={(x, y)IR2 :x2+y2 1/4}.

We would like to call the setEan event. Now consider the complement of Q1, that is:Q2: Did the dart not land within 0.5 of the bulls eye? with which we can associate

the event Ec, were the superscript c denotes set complementation (relative to ).

Namely,Ec = \ E, which is the set of outcomes (or points or members) of thatare not already in E. The notation \ represents set difference (or subtraction). Ingeneral, for any two sets Aand B,A \ B is the set of all elements that are in A butnot in B. Note that Ec ={(x, y)IR2 :x2 +y2 > 1/4}{missed}. The point hereis that ifE is an event, then we would want Ec to qualify as an event as well since

we would like to be able to ask the logical negative of any question. In addition,

we would also like to be able to form a logical or of any two questions about the

experiment outcome. Thus, ifE1 andE2 are events, we would like their union to also

be an event.

Here is another illustrative example: for each n= 1, 2, . . ., define the subset En =

{(x, y)IR2 :x2+y2 11/n} and letn=1 Enbe their countable union. (Notation:we say

n=1 En if and only if

En for some n. Similarly, we say

n=1 En

if and only if En for all n.) It is not hard to see (prove it) thatn=1En =E={(x, y) IR2 : x2 +y2 < 1}, which corresponds to the valid question did the dartland inside the board? Thus, we would want E(which is the countable union of

events) to be an event as well.

Example 2. For each n 1, take An = (1/n, 1/n). Then,

n=1 An ={0} and

n=1 An= (1, 1). Prove these.

Finally, we should be able to ask whether or not the experiment was conducted or

not, that is, we would like to label the sample space as an event as well. With this

(hopefully) motivating introduction, we proceed to formally define what we mean by

events.

Definition 1. A collectionFof subsets of is called a -algebra (reads as sigma-algebra) if:

(1) F

8/10/2019 ECE541_Lectures_9_2009

4/81

8/10/2019 ECE541_Lectures_9_2009

5/81

ECE 541 5

requirements for a -algebra, but what we mean by valid is really up to us as long

as the self-consistency rules defining the collection of events are met.

Generation of -algebras: LetM be a collection of events (not necessarily a -algebra); that is,M F. This collection could be a collection of certain events of in-terest. For example, in the dart experiment we may defineM=

{missed}, {(x, y)

IR2 : 1/4x2 +y2 1/2}

. The main question now is the following: Can we con-

struct a minimal (or smallest) -algebra that containsM? If such a-algebra exists,

call itFM, then it must possess the following property: IfD is another -algebracontainingM, then necessarilyFM D. HenceFM is minimal.The following Theorem states that there is such a minimal -algebra.

Theorem 1. LetM be a collection of events, then there is a minimal -algebracontainingM.

Before we prove the Theorem let us look at an example.

Example 5. Suppose that = (, ), andM= (, 1), (0, ). It is easy tocheck that the minimal-algebra containingM is

FM =, , (, 1), (0, ), (0, 1), (, 0], [1, ), (, 0] [1, )

.

Explain where each member is coming from.

Proof of Theorem: LetKM be the collection of all -algebras that containM.We observe that such a collection is not empty since at least it contains the power

set 2. Let us label each member ofKM by an index , namely D , where I,where I is some index set. DefineFM =

ID. We need to show that 1)FM is a

-algebra containingM, and 2) thatFM is a minimal -algebra. Note that each Dcontains (since each D is a -algebra); thus,FM contains . Next, ifA FM,then A D, for each I. Thus, Ac D, for each I (since each D is a-algebra) which implies that Ac

ID. Next, suppose thatA1, A2, . . . FM.

Then, we know that A1, A2, . . . D, for each I. Moreover, n=1 An D,

8/10/2019 ECE541_Lectures_9_2009

6/81

6 MAJEED M. HAYAT

for each I (again, since each D is a -algebra) and thusn=1 An

ID.

This completes proving thatFM is a -algebra. Now suppose thatFM is another-algebra containingM, we will show thatFM FM. First note that sinceKM isthe collection ofall-algebras containingM, then it must be true thatFM = Dfor someI. Now ifA FM, then necessarily AD (since D is one of themembers of the countable intersection that definesFM), and consequently A FM,sinceFM = D. This establishes thatFM FM and completes the proof of thetheorem.

Example 6. LetUbe the collection of all open sets in IR. Then, according to theabove theorem, there exists a minimal-algebra containingU. This-algebra is calledthe Borel-algebra,B, and its elements are called the Borel subsets ofIR. Note thatby virtue of set complementation, union and intersection,B contains all closed sets,half open intervals, their countable unions, intersections, and so on.

(Reminder: A subsetU inIR is called open if for everyxU, there exists an openinterval centered atx which lies entirely inU. Closed sets are defined as complements

of open sets. These definitions extend to IRn

in a straightforward manner.)

Restrictions of-algebras: LetF be a -algebra. For any measurable set U, wedefineF Uas the collection of all intersections between Uand the members ofF,that is,F U={V U :V F}. It is easy to show thatF Uis also a -algebra,which is called the restriction ofF to U.

Example 7. Back to the dart experiment: What is a reasonable-algebra for this

experiment? I would say any such-algebra should contain all the Borel subsets ofD

and the{missed} event. So we can takeM={missed}, B D. It is easy to check

that in this case

FM=

(B D) {missed}

(B D),(1)

where for any-algebraFand any measurable setU, we defineFUas the collectionof all unions betweenU and the members ofF, that is,F U ={V U :V F}.Note thatF U is not always a-algebra (contrary toF U), but in this example

it is because{missed} is the complement ofD.

8/10/2019 ECE541_Lectures_9_2009

7/81

8/10/2019 ECE541_Lectures_9_2009

8/81

8 MAJEED M. HAYAT

We emphasize again thatY1((, r)) is always an event; that is, Y1((, r))

FM, whereFM was defined earlier for this experiment (1). Again, this is a directconsequence of the way we defined the function Y.

Motivated by these examples, we proceed to define what we mean by a random

variablein general.

Definition 4. Let (, F) be a measurable space. A transformationX : IR issaid to beF-measurable if for every r IR, X1((, r)) F. Any measurable Xis called a random variable.

Now let Xbe a random variable and consider the collection of eventsM ={ : X() (, r), r IR}, which can also be written more conveniently as{X1((, r)), rIR}. As before, letFM be the minimal -algebra containingM.Then,FMis the information that the random variableXconveys about the experi-ment. We define such a-algebra from this point on as(X), the-algebra generated

by the random variable X. In the above example ofX,(X) ={, {missed}, , D}.This concept is demonstrated in the next example. (Also see HW#1 for another

example of(X).)

Example 8. Back to the dart example: LetM ={, {missed}, }, then it is easyto check thatFM ={, {missed}, , D}, which also happens to be a subset ofFMas defined in (1). However, we observe that in fact X1((, r)) FM for anyrIR. Intuitively,FM, which also call(X), can be identified as the set of all eventsthat the functionXcan convey about the experiment. In particular,FM consists ofprecisely those events whose occurrence or nonoccurrence can be determined through

our observation of the value ofX. In other words,FM is all the information thatthe mappingXcan provide about the outcome of the experiments. Note thatFM ismuch smaller than the original-algebraF, which was((BD){missed})(BD).As we have seen before,{, {missed}, , D} ((B D) {missed}) (B D). Thus,Xcan only partially inform us of the true outcome of the experiment; it can only tell

us if we hit the target or missed it, nothing else, which is precisely the information

contained inFM .

8/10/2019 ECE541_Lectures_9_2009

9/81

ECE 541 9

Facts about Measurable Transformations:

Theorem 2. Let(, F) be a measurable space. The fol lowing statements are equiv-alent:

(1) X is a measurable transformation.

(2) X1((, r]) F for allrIR.(3) X1((r,)) F for allrIR.(4) X1([r,)) F for allrIR.(5) X1((a, b)) F for alla < b.(6) X1([a, b)) F for alla < b.(7) X1([a, b]) F for alla < b.(8) X1((a, b]) F for alla < b.(9) X1(B) F for allB B.

The equivalence amongst (1), (2) and (3) is left as an exercise (see HW1). The

equivalence amongst (3) and (4) through (8) can be shown using the same technique

used in HW1. To show that (1) implies (9) requires additional work and will not be

shown here. In fact, (9) is the most powerful definition of a random variable. Using

(9), we can equivalently define (X) as{X1(B) : B B}, which can be directlyshown to be a -algebra (see HW1).

8/10/2019 ECE541_Lectures_9_2009

10/81

10 MAJEED M. HAYAT

1.5. Probability Measure.

Definition 5. Consider the measurable space (, F), and a function, P, that mapsF into IR. P is called a probability measureif

(1) P(E)0, for any E F.(2) P() = 1.

(3) If E1, E2, . . . F and if Ei Ej = when i = j, then P(n=1 En) =

n=1 P(En).

The following properties follow directly from the above definition.

Property 1. P() = 0.

Proof. Put E1 = , E2 = . . . = En = = in (3) and use (2) to get 1 = P() =P( . . .) = P()+n=2 P() = 1 +n=2 P(). Thus, n=2 P() = 0, whichimplies that P() = 0 since P()0 according to (1).

Property 2. IfE1, E2, . . . , E n Fand ifEiEj =wheni=j , thenP(ni=1 Ei) =ni=1 P(Ei).

Proof. Put En+1 = En+2 = . . .= and the result will follow from 3 since P() = 0(from Property 2)

Property 3. IfE1, E2 F andE1E2, thenP(E1) P(E2).

Proof. Note that E1

E2

\E1 = E2 and E1

E2

\E1 =

. Thus, by Property 2 (use

n= 2), P(E2) = P(E1) + P(E2\E1)P(E1), since P(E2\E1)0.

Property 4. IfA1A2A3 . . . F, then

(3) limn

P(An) = P(n=1

An).

Proof. Put B1 =A1, B2 = A2\A1, . . . , Bn = An\An1, . . .. Then, it is easy to checkthat

n=1 An=

n=1 Bn and

mn=1 An=

mn=1 Bn for anym1, and thatBi Bj =

when i= j. Hence, P(ni=1 Ai) = ni=1 P(Bi) and P(i=1 Ai) = i=1 P(Bi).

8/10/2019 ECE541_Lectures_9_2009

11/81

ECE 541 11

But,

ni=1 Ai = An, so that P(An) =

ni=1 P(Bi). Now since

ni=1 P(Bi) converges

toi=1 P(Bi), we conclude that P(An) converges toi=1 P(Bi), which is equal to

P(i=1 Ai).

Property 5. IfA1A2A3 . . ., then limn P(An) = P(n=1 An).

Proof. See HW.

Property 6. For anyA F, 0 P(A)1.

The triplet (, F, P) is called a probability space.

Example 9. Recall the dart experiment. We now defineP on (, FM) as follows.Assign P({missed}) = 0.5, and for any A D B, assign P(A) = area(A)/2. Itis easy to check thatP defines a probability measure. (For example, P() = P(D{missed}) = P(D) + P({missed}) = area(D)/2+ 0.5 = 0.5 + 0.5 = 1. Check theother requirements as an exercise.)

1.6. Distributions and distribution functions: Consider a probability space (, F, P),and consider a random variable X defined on it. Up to this point, we have devel-

oped a formalism that allows us to ask questions of the form what is the proba-

bility that Xassumes a value in a Borel set B? Symbolically, this is written as

P({ :X() B}), or for short, P{XB}with the understanding that the set{X B} is an event (i.e., member ofF). Answering all the questions of the aboveform is tantamount to assigning a number in the interval [0,1] to every Borel set.

Thus, we can think of a mapping from B into [0,1], whose knowledge will provide an

answer to all the questions of the form described earlier. We call this mapping the

distribution of the random variableX, and it is denoted by X. Formally, we have

X :B [0, 1] according to the rule X(B) = P{XB}, B B.

Proposition 1. Xdefines a probability measure on the measurable space(IR, B).

Proof. See HW.

8/10/2019 ECE541_Lectures_9_2009

12/81

12 MAJEED M. HAYAT

Distribution Functions: Recall from your undergraduate probability that we often

associate with each random variable a distribution function, defined as FX(x) =P{X x}. This function can also be obtained from the distribution ofX, X, byevaluatingX at B = (, x], which is a Borel set. That is, FX(x) =X((, x]).Note that for any xy ,X((x, y]) =FX(y) FX(x).

Property 7. FXis nondecreasing.

Proof. For x1 x2, (, x1] (, x2] and X((, x1]) X((, x2]) sinceX is a probability measure (see Property 3 above).

Property 8. limx FX(x) = 1 and limx FX(x) = 0.

Proof. Note that (, ) = n=1(, n] and by Property 4 above, limn X((, n]) =X(n=1(, n]) =X((, )) = 1 since X is a probability measure. Thus we

proved that limn FX(n) = 1. Now the same argument can be repeated if we re-

place the sequence n by any increasing sequence xn. Thus, limxn FX(xn) = 1 for

any increasing sequence xn, and consequently limx FX(x) = 1.

The proof of the second assertion is left as an exercise.

Property 9. FXis right continuous, that is, limn FX(xn) =FX(y)for any mono-

tonic and convergent sequencexn for whichxny.

Proof. For simplicity, assume that xn=y+n1. Note that FX(y) =X((, y]), and

(, y] = n=1(, y+n1]. So, by Property 5 above, limn X((, y+n1]) =

X(n=1(, y+n1]) =X((, y]). Thus, we proved that limn FX(y+n1) =

FX(y). In the same fashion, we can generalize the result to obtain limn FX(y+

xn) =FX(y) for any sequence for which xn0. This completes the proof.

Property 10. FXhas a left limit at every point, that is, limn FX(xn) exists for

any monotonic and convergent sequencexn.

Proof. For simplicity, assume that xn = y n1 for some y. Note that (, y) =n=1(, yn1]. So, by Property 4 above, limn FX(yn1]) = limn X((, y

8/10/2019 ECE541_Lectures_9_2009

13/81

8/10/2019 ECE541_Lectures_9_2009

14/81

14 MAJEED M. HAYAT

Absolutely continuous random variables: If there is a Borel functionfX : IR[0, ), with the property that f(t) dt= 1, such thatFX(x) = x fX(t) dt, thenwe say that X is absolutely continuous. Note that in this event we term fX the

probability density function (pdf) ofX. Also note that ifFx is differentiable, then

such a density exists and it is given by the derivative ofFX.

Example 10. Let X be a uniformly-distributed random variable in [0,2]. Let the

functiong be as shown in Fig. 1 below. Compute the distribution function of Y =

g(X).

1

1/2

1/2

FY(y)

y

1

1

g(x)

x1

Figure 1.

Solution: If y < 0, FY(y) = 0 since Y is nonnegative. If 0 y 1, FY(y) =P({Xy/2}{X >1 y/2}) = 0.5y + 0.5. Finally, ify >1, FY(y) = 1 sinceY1.The graph of FY(y) is shown above. Note that FY(y) is indeed right continuous

everywhere, but it is discontinuous at 0.

In this example the random variable X is absolutely continuous (why?). On the

other hand, the random variable Y is notabsolutely continuous because there is no

Borel function fY so that FY(x) =x fY(t) dt. Observe that FYhas a jump at 0,

and we cannot reproduce this jump by integrating a Borel function over it (we would

need a delta function, which is not really a function let alone a Borel function).

It is not a discrete random variable either since Y can assume any value from an

uncountable collection of real numbers in [0, 1].

8/10/2019 ECE541_Lectures_9_2009

15/81

8/10/2019 ECE541_Lectures_9_2009

16/81

16 MAJEED M. HAYAT

2. Expectations

Recall that in an undergraduate probability course one would talk about the ex-

pectation, average, or mean of a random variable. This is done by carrying out an

integration (in the Riemann sense) with respect to a probability density function

(pdf). It turns out that the definition of an expectation does require having a pdf. It

is based on a more-or-less intuitive notion of an average. We will follow this general

approach here and then connect it to the usual expectation with respect to a pdf

whenever the pdf exists. We begin by introducing the expectation of a nonnegative

random variable, and will generalize thereafter.

Consider a nonnegative random variableX, and for each n1, we define the sumSn =

i=1

i2nP{ i

2n < X i+1

2n}. We claim thatSn is nondecreasing. If this is the

case (to be proven shortly), then we know that Sn is either convergent (to a finite

number) or Sn . In any case, we call the limit ofSn the expectation ofX, andsymbolically we denote it as E[X]. Thus, we define E[X] = limn Sn. To see the

monotonicity ofSn, we follow Chow and Teicher [1] and observe that

{ i2n < X

i+12n}={

2i2n+1 < X

2i+22n+1}={

2i2n+1 < X

2i+12n+1} {

2i+12n+1 < X

2i+22n+1};

thus,

Sn =i=1

2i2n+1

(P{ 2i2n+1

< X 2i+12n+1

} + P{ 2i+12n+1

< X 2i+22n+1

}) 1

2n+1P{ 1

2n+1 < X 2

2n+1} +i=1 2i2n+1 (P{ 2i2n+1 < X 2i+12n+1} + P{ 2i+12n+1 < X 2i+22n+1})

12n+1

P{ 12n+1

< X 22n+1

} + i=1 2i2n+1P{ 2i2n+1 < X 2i+12n+1} + i=1 2i+12n+1P{ 2i+12n+1 } E[(XX)2]

2 . Note that the

numerator is simply the variance ofX.

Also, if we take (t) =t, then we have P{|X| > } E[|X|]

.

Proof. Since is nondecreasing,{|X| > } ={(|X|) > ()}. Further, note that1

I{|X|>}

(), for any

(since any indicator function is either 0 or 1). Now

note that(|X|)I{|X|>}(|X|)I{|X|>}(). Now take expectations of both endsof the inequalities to obtain

E[(|X|)]()E[I{|X|>}] =()P{|X|> }, and the desired result follows. (See thefirst special case after the definition of expectation.)

Chernoff Bound: This is an application of the Markov inequality and it is a

useful tool in upperbounding error probabilities in digital communications. LetXbe

a non-negative random variable and let be nonnegative. Then by taking (t) =et,

we obtain

P{X > x} exE[eX].The right hand side can be minimized over >0 to yield what is called the Chernoff

boundfor a random variable X.

8/10/2019 ECE541_Lectures_9_2009

23/81

ECE 541 23

3. Elementary Hilbert Space Theory

Most of the material in this section is extracted from the excellent book by W.

Rudin, Real & Complex Analysis [2]. A complex vector spaceH is called an inner

product space ifx, yH, we have a complex-valued scalar,< x, y >, read the innerproduct between vectors xand y, such that the following properties are satisfied:

(1) < x, y >=< y, x >, x, yH(2) < x+ y,z >=< x, z > + < y, z >,x,y,zH(3) < x, y >= < x, y >, x, yH, |C(4) < x, x >0, xH and < x, x >= 0 only ifx= 0

Note that (3) and (1) together imply < x,y >= < x, y >. Also, (3) and (4)

together imply < x, x >= 0 if and only ifx = 0. It would be convenient to write

< x, x >asx2, which we later use to introduce a norm on H.

3.1. Schwarz Inequality. It follows from properties (1)-(4) that| < x , y >| x y, x, yH.

Proof: PutA =x2,C=y2 andB =|< x, y >|. By (4),< xry,xry >0, for any choice of |C and all r IR, which can be further written as (using(2)-(3)):

(6) x2 r < y, x >r < x, y >+r2y2||2 0.

Now choose so that < y,x >=|< x, y >|. Thus, (6) can be cast asx2 2r|| + r2y2 0, or A2rB +r2C 0, which is true for any r IR. Letr1,2= 2B4B

2

4AC2C = BB2

ACC denote the roots of the equationA2rB +r2C= 0.Since A2rB + r2C 0, it must be true that B2 AC 0 (since the rootscannot be real unless they are the same), which implies|< x, y >|2 x2y2, or|< x, y >| x y.

Note that| < x, y >|=xy whenever x= y , |C. Also, if| < x, y >|=xy, then it is easy to verify that < x ||x

yy , x ||xyy >= 0, whichimplies (using (4)) x ||x

yy = 0. Thus,|< x, y >|=xyif and only ifx is

proportional to y .

8/10/2019 ECE541_Lectures_9_2009

24/81

24 MAJEED M. HAYAT

3.2. Triangle Inequality.x+ y x + y, x, yH. Proof.

This follows from the Schwarz inequality. Note thatx+y2

0 and expand toobtainx2 +y2+ < x, y > +< x, y > x2 +y2 + 2| < x, y >|. Now bySchwarz inequality,x2+y2+2|< x, y >| x2+y2+2xy = (x+y)2,from which the desired result follows. Note that x+y2 =x2+y2 if< x, y >= 0(why?).

This is a generalization of the customary triangle inequality for complex numbers,

which states thatx z x y + y z, x, y,z |C.

3.3. Norm. We sayx is a norm on H if:(1)x 0.(2)x= 0 only ifx= 0.(3)x + y x + y.(4)x=|| x, D.

With the triangular inequality at hand, we can define on members ofH asfollows:

x

=< x, x >1/2. You should check that this actually defines a norm. This

yields a yardstick for distance between the two vectors x, yH, defined as xy.We can now say that H isnormed space.

3.4. Convergence. We can then talk about convergence: a sequence xnHis saidto be convergent to x, written asxnx, or limn xn = x, if for every >0, thereexistsN IN such thatxn x< whenever n > N.

3.5. Completeness. An inner-product space His complete if any Cauchy sequence

in Hconverges to a point in H. A sequence{yn}n=1 in H is called a Cauchy if forany >0, N IN such thatyn ym< for all n, m > N.

Now, ifH iscomplete, then H is called a Hilbert space.

Fact: IfH is complete, then it is closed. This is because any convergent sequence isautomatically a Cauchy sequence.

3.6. Convex Sets. A set E in a vector spaceVis said to be a convex set if for any

x, yE, and t(0, 1), the following point Zt =tx + (1 t)yE. In other words,

8/10/2019 ECE541_Lectures_9_2009

25/81

ECE 541 25

the line segment between xand y lies in E. Note that ifEis a convex set, then the

translation ofE,E+ x

={y+ x: yE}, is also a convex set.3.7. Parallelogram Law. For any x and y in an inner-product space,x+y2 +xy2 = 2x2 + 2y2.This can be simply verified using the properties of an innerproduct. See the schematic below.

x+y

x-y

Figure 2.

3.8. Orthogonality. If< x, y >= 0, then we say that x and y are orthogonal. We

writexy . Note that the relation is a symmetric relation; that is,xyyx.Pick a vector xH, then find all vectors in Hthat are orthogonal to x. We write

it as: x ={y H :< x,y >= 0} and x is a closed subspace. To see that it is asubspace, we must show that x is closed under addition and scalar multiplication;

these are immediate from the definition of an inner product (see Assignment 4). To

see that x is closed, we must show that the limit of every convergent sequence in

x is also in x. Let xn be a sequence in x, and assume that xn x0. We needto show that x0 x. To see this, note that| < x0, x >| =| < x0 xn+ xn, x >|=|< x0 xn, x >+ < xn, x >|=|< x0 xn, x >| x0 xnx, by Schwarzinequality; but, the term on the right converges to zero, so| < x0, x >|= 0, whichimpliesx0x and x0x.

Let M H be a subspace ofH. We define M = xMx, x M. It is easyto see that M is actually a subspace ofHand that M M={0}. It is also truethat M is closed, which is simply because it is an intersection of closed sets (recall

that x is closed).

8/10/2019 ECE541_Lectures_9_2009

26/81

26 MAJEED M. HAYAT

Theorem 4. Any non-empty, closed and convex setEin a Hilbert spaceH contains

a unique element with smallest norm. That is, there existsxoE such thatxo=ni=1 xiyi.

8/10/2019 ECE541_Lectures_9_2009

27/81

ECE 541 27

2=H

x+M

Shortest distance Qx

Figure 3. ElevatingMbyx.

2=H

=M

M

Px

Qx

x

Figure 4. Correction: The upper Mshould read M.

Example 12. L2[a, b] =

f :ba|f(x)|2dx

=b

af(x)g(x)dx.f= < f,f >= [b

a|f|2dx]1/2 =f2. Actually, we need Schwarz

inequality for expectations before we can see that we have an inner product in this

case. More on this to follow.

Theorem 5. Projection Theorem. IfM His a closed subspace of a Hilbert spaceH, then for everyx Hthere exists a unique decompositionx = P x+Qx, whereP xM, andQxM. Moreover,

(1)x2 =P x2 + Qx2.(2) If we think ofP x andQx as mappings fromH to M andM, respectively,

then the mappingsP andQ are linear.

8/10/2019 ECE541_Lectures_9_2009

28/81

28 MAJEED M. HAYAT

(3) P x is the nearest point in M to x, and Qx is the nearest point in M to

x. P x and Qx are called the orthogonal projections of x into M and M,respectively.

Proof.

Existence: ConsiderM+ x; we claim that M+ xis closed. Recall that Mis closed,

which means that ifxnM and xnxo, thenxoM. Pick a convergent sequenceinx + M, call it zn. Now zn=x + yn, for some ynM. Since zn is convergent, so isyn, but the limit ofyn is in M, so x+ limn ynx+ M.

We next show that x+M is convex. Pick x1 and x2 x +M. We need to showthat for any 0 < < 1, x1+ (1 )x2 x+M. But x1 = x+y1, y1 M, andx2 = x+ y2, y2 M. So x1+ (1)x2 = x+ y1+ (1)y2 x+ M sincey1+ (1 )y2M.

By the minimal-norm theorem, there exists a member in x+ Mof smallest norm.

Call it Qx. Let P x= x Qx. Note that P xM. We need to show that QxM.Namely,< Qx, y >= 0, yM. CallQx= z, and note that z y, yM+x.Pick y=z y, where yM,y= 1.z2 z y2 =< z y,z y >, or0 < y, z > < z, y > +2. Pick =< z, y >. We obtain 0 |< z, y >|2. This can hold only if< z, y >= 0, i.e., zis orthogonal to every y M; therefore,QxM.

Uniqueness: Suppose that x = P x+Qx = (P x)+ (Qx), where P x, (P x) MandQx, (Qx)M. Then,P x (P x) = (Qx) Qx, where the left side belongs toMwhile the right side belongs to M. Hence, each side can only be the zero vector

(why?), and we conclude that P x= (P x) and (Qx)=Qx.Minimum Distance Properties: To show that P x is the nearest point in M to x,

pick anyy Mand observe thatx y2 =P x + Qx y2 =Qx + (P x y)2 =Qx2 +P xy2 (since P xy M). The right-hand side is minimized whenP x y= 0, which happens if and only ify =P x. The fact that Qx is the nearestpoint in M to xcan be shown similarly.

Linearity: Take x, y H. Then, we have x = P x+Qx and y = P y+ Qy. Now

ax+by= aP x+aQx+bP y+bQy. On the other hand,ax+by=P(ax+by)+Q(ax+by).

8/10/2019 ECE541_Lectures_9_2009

29/81

ECE 541 29

Thus, we can write P(ax + by) aP x bP y=Q(ax + by) + aQx + bQyand observe

that the left side is in Mwhile the right side is in M. Hence, each side can only bethe zero vector. Therefore, P(ax+by) =aP x+bP y and Q(ax+by) =aQx+bQy.

8/10/2019 ECE541_Lectures_9_2009

30/81

8/10/2019 ECE541_Lectures_9_2009

31/81

ECE 541 31

we may not have an inner product. Can you see why?) We can also define the norm

X2

=< X, X >1/2

. This is called the L2-norm associated with (, F,P

). Hence,we can recast Has the vector space of all random variables that have finite L2-norm.

This collection is often written as L2(, F,P). It can be shown thatL2(, F,P) iscomplete (see [2], pp. 67). Hence, L2(, F,P) is a Hilbert space. In fact, L2(, D,P)is a Hilbert space for any -algebraD.

4.2. TheL2conditional expectation. Let X, YL2(, F,P). Clearly,L2(, (X),P)is a closed (why?) subspace ofL2(, F,P)). We can now apply the projection theorem

to Y, with L2(, (X),P) being our closed subspace, and obtain the decomposition

Y = P Y +QY, where P Y L2(, (X),P) and QY L2(, (X),P). We alsohave the property thatY P Y2 Y Y2 Y L2(, (X),P). We callP Ythe conditional expectation ofY given (X), which we will write from this point an

on as E[Y|(X)].

Exercise 1. (1) Show that if X is a

D-measurable r.v. then so is aX,a

. (2)

Show that ifX andY areD-measurable, so isX+ Y. [See Homework]

The above exercise shows that the collection of square-integrableD-measurablerandom variables is indeed a vector space.

Theorem 7. Consider(, F,P), and let X be a random variable. A random variableZ is a (X)-measurable random variable if and only if Z = h(X) for some Borel

functionh. (The proof is based on a corollary to the Dynkin class Theorem. We

omit the proof, which can be found in [1].)

Since E[X|(Y)] is (Y)-measurable by definition, we can write it explicitly as aBorel function ofY. For this reason, we often time write E[X|Y] in place ofE[X|(Y)].

4.3. Properties of Conditional Expectation. Our definition of conditional ex-

pectation lends itself to many powerful properties that are useful in practice. One of

the properties also leads to an equivalent definition of the conditional expectation,

which is actually the way commonly donesee the comments after Property 4.3.3.

8/10/2019 ECE541_Lectures_9_2009

32/81

32 MAJEED M. HAYAT

However, proving the existence of the conditional expectation according to the alter-

native definition requires knowledge of a major theorem in measure theory called theRadon-Nikodym Theorem, which is beyond the scope of this course. That is why in

this course we took a path (for defining the conditional expectation) that is based on

the projection theorem, which is an important theorem to signal processing for other

reasons as well. Here are some key properties of conditional expectations.

4.3.1. Property 1. For any constant a, E[a|D] = a. This follows trivially from theobservation that P a= a (why?).

4.3.2. Property 2 (Linearity). For any constants a, b, E[aX+bY|D] = aE[X|D] +bE[Y|D]. This follows trivially from the linearity of the projection operatorP (seethe Projection Theorem).

4.3.3. Property 3. Let Z = E[X

|D], then E[XY] = E[ZY ], for all Y

L2(,

D,P).

Interpretation: Zcontains all the information thatXcontains that is relevant to any

D-measurable random variableY.

Proof: Note that E[XY] = E[(P X+QX)Y] = E[(P X)Y]+E[(QX)Y] = E[(P X)Y] =

E[ZY]. The last equality follows from the definition ofZ.

Conversely, if a random variable Z L2(, D,P) has the property that E[ZY] =E[XY],

Y

L2(,

D,P), then Z = E[

X

|D] almost surely. To see this, we need to

show that Z = P Xalmost surely. Note that by assumption E[Y(X Z)] = 0 foranyYL2(, D,P). Therefore, E[Y(P X+ QX Z)] = 0, or E[Y P X] + E[

Y QX]

E[Y Z] = 0, or E[Y(Z P X)] = 0, YL2(, D,P) (since E[Y QX] = 0). In partic-ular, if we take Y =Z P Xwe will conclude that E[(Z P X)2] = 0, which impliesZ=P Xalmost surely.

Thus, we arrive at an alternative definition for theL2 conditional expectation.

8/10/2019 ECE541_Lectures_9_2009

33/81

ECE 541 33

Definition 6. We defineZ = E[X|D] if (1)ZL2(, D,P) and (2) E[ZY] = E[XY]

YL2(, D,P).We will use this new definition frequently in the remainder of this chapter.

4.3.4. Property 4 (Smoothing Property). For any X, E[X] = EE[X|D]. To see this,

note that ifZ = E[X|D], then E[ZY] = E[XY] for all Y L2(, D,P). Now takeY= 1 to conclude that E[X] = E[Z].

4.3.5. Property 5. If Y L2(, D,P), then E[XY|D] = YE[X|D]. To show this,we check the second definition for conditional expectations. Note that YE[X|D]L2(, D,P), so all we need to show is that E[(XY)W] = E

(YE[X|D])W for anyW

L2(, D,P). But we already know from the definition ofE[X|D] that E[(W Y)E[X|D]] =E[(W Y)X], since W YL2(, D,P).

As a special case, we have E[XY] = EE[XY|Y]= EYE[X|Y].

4.3.6. Property 6. For any Borel function g : IR IR, E[g(Y)|D] = g(Y) wheneverY L2(, D,P). (For example, we have E[g(Y)|Y] = g(Y).) To prove this, all weneed to do is to observe that g (Y)L2(, D,P) and then apply Property 4.3.5 withX= 1.

4.3.7. Property 7 (Iterated Conditioning). LetD1 D2 F. Then, E[X|D1] =EE[X|D2]|D1

.

Proof: LetZ= EE[X|D2]

D1. First note thatZL2(, D1,P), so we only need toshow that E[ZY] = E[XY], Y L2(, D1,P). Now E[ZY] = E

YEE[X|D2]

D1.SinceY is D1-measurable, it is also D2-measurable; thus, we have E

YEE[X|D2]|D1

=

E

E

E[Y X|D2]|D1

for all YL2(, D1,P). But by property 4.3.4, EE

E[Y X|D2]

D1

=

EE[Y X|D2]= E[Y X], which completes the proof.

8/10/2019 ECE541_Lectures_9_2009

34/81

34 MAJEED M. HAYAT

Also note that E[X|D1] = EE[X|D1]|D2

. This is much easier to prove. (Prove it.)

The next property is very important in applications.

4.3.8. Property 8. IfX and Y are independent r.v.s, and if g : IR2 IR is Borelmeasurable, then E[g(X, Y)|(Y)] =h(Y), where for any scalar t, h(t) = E[g(X, t)].As a consequence, E[g(X, Y)] = E[h(Y)].

Proof: We prove this in the case g(x, y) =g1(x)g2(y), where g1 and g2 are bounded,

real-valued Borel functions. (The general result follows through an application of

a corollary to the Dynkin class theorem.) First note that E

g1(X)g2(Y)|(Y)

=

g2(Y)E

g1(X)|(Y)

. Also note that h(t) =g2(t)E[g1(X)], so h(Y) =g2(Y)E[g1(X)].

We need to show that h(Y) = E

g1(X)g2(Y)| (Y)

. To do so, we need to show

that E[h(Y)Z] = E[g1(X)g2(Y)Z] for every Z L2(, (Y),P). But E[h(Y)Z] =E

g2(Y)E[g1(X)]Z

= E[g1(X)]E

g2(Y)Z

, and by the independence ofX and Y the

latter is equal to E[g1(X)g2(Y)Z].

The stated consequence follows from Property 4.3.4. Also, the above result can be

easily generalized to random vectors.

4.4. Some applications of Conditional Expectations.

Example 13 (Photon counting).

radiation

detection

current

.

Nphotons

electric pulses

counter M

Figure 5.

8/10/2019 ECE541_Lectures_9_2009

35/81

8/10/2019 ECE541_Lectures_9_2009

36/81

36 MAJEED M. HAYAT

Next, E[M2] = E

h2(N)

= E

(N2 N)

E[G1]

2 + (E[G1]2 + 2G)E[N] = E[G1]

2(2N+

N

2

N) + (

E[G1]

2

+ 2

G)N, where

N

=E

[N] and2

N

=E

[N2

] N

2

is the variance ofN. Finally, if we assume that N is a Poisson random variable (as in coherent light)

the second moment ofMbecomes

E[M2] = E[G1]2 N2 +

E[G1]

2 + 2G

N .

Exercise: For an integer-valued random variable X, we define thegenerating function

ofXasX(s) = E[sX], s |C, |s| 1.We can think of the generating function as the

z-transform of the probability mass function associated with X.

For the photon-counting example described above, show thatM(s) =N

G(s)

.

Example 14 (First Occurrence of Successive Hits).

Consider the problem of flipping a coin successively. Suppose P{H} = p andP{T}= q= 1p. In this case, we define = Xi=1i, where for eachi, i={H, T}.This means that members of are simply of the form = (1, 2, . . .), whereii.For each i, we defineFi = , {H, T}, {T}, {H}. We takeFassociated with as the minimal -algebra containing all cylinders with measurable bases in . A an

event E in Xi=1i is called a cylinder with a measurable base if E is of the form

(1, 2, . . .) :i1B1, . . . , ikBk, k finite, Bi Fi

. In words, in a cylinder,

only finitely many of the coordinates are specified. (For example, think of a vertical

cylinder in IR3, where we specify only two coordinates out of three, now extend this to

IR.) Let Xi= I{i=H} be a{0, 1}-valued random variable, and define Xon Xi=1i

as X = (X1, X2, . . .). Finally, we define P on cylinders in by forming the product

of the probabilities of the coordinates (enforcing independence). For each (i, Fi),define Pi{H}= p and q= 1 p.

We would like to define a random variable that tells us when a run ofn successive

heads appears for the first time. More precisely, define the random variable T1 as

a function of X as follows: T1= min{i 1 : Xi = 1}. For example, if X =

(0, 0, 1, 0, 1, 1, 0, 1, 1, 1, . . .), then T1 = 3. More generally, we define Tk = min{i :Xik+1 = Xik+2 = . . . = Xi = 1}. For eachk, Tk is a r.v. on . For example, if

X= (0, 0, 1, 0, 1, 1, 0, 1, 1, 1, . . .), then T2 = 6, and T3= 10.

8/10/2019 ECE541_Lectures_9_2009

37/81

ECE 541 37

Let yk= E[Tk]. In what follows we will characterize yk.

Special case: k=1It is easy to see that P{T1 = 1}= p,P{T1 = 2}= qp,P{T1 = 3}= q2p,...

P{T1 = n}= qn1p.Recall from undergraduate probability that above probability law is called the geo-

metric law, andT1

is called a geometric random variable. Also,y1

= i=1

ipqi1 = 1p

.

General case: k >1

We begin by observing that Tk is actually an explicit function, f, say, of

Tk1, XTk1+1, XTk1+2, . . . .Note, however, that Tk1 and{XTk1+1, XTk1+2, . . .} areindependent. Therefore, by Property 4.3.8, E[Tk] = E[f

Tk1, XTk1+1, XTk1+2, . . .)] =

E[h(Tk1)], where h(t) = E[f(t, Xt+1, Xt+2, . . .)].

It is easy to check (using the equivalent definition of a conditional expectation)

that E[f(t, Xt+1, . . .)|Xt+1] = (t + 1)I{Xt+1=1}+ (t + 1 + yk)I{Xt+1=0}. This essentiallysays that if it took us t flips to see k 1 consecutive heads for the first time, then ifthe (t+ 1)st flip is a head, then we have achieved k successive heads at time t+ 1;

alternatively, if the (t+ 1)st flip is a tail, then we have to start all over again (start

afresh) while we have already waisted t+ 1 units of time.

Now, E[f(t, Xt+1, . . .)] = EE[f(t, Xt+1, . . .)|Xt+1]

= E

(t + 1)I{Xt+1=1}+ (t + 1 +

yk)I{Xt+1=0

}=(t + 1)p + (t + 1 + yk)q. Thus, h(t) = (t+ 1)p + (t + 1 + yk)qandyk = E[h(Tk1)] = E

(Tk1+ 1)p + (Tk1+ 1 + yk)q

= p +pyk1+ qyk+ qyk1+ q.

Finally, we obtain

(9) yk =p1yk1+p

1.

We now invoke the initial condition y1 = 1p

, which completes the characterization of

yk.

For example, ifp= 12 , then y2= 2y1+ 2 = 6, and so on.

8/10/2019 ECE541_Lectures_9_2009

38/81

38 MAJEED M. HAYAT

Example 15. Importance of the Independence Hypothesis in Property4.3.8.

Let Y = XZ where Z = 1 + X. Now, E[Y] = E[XZ] = E[X(1 + X)] = X+ X2.

However, if we erroneously attempt to use Property 4.3.8 (by ignoring the dependence

of Z on X), we will have E[Xt] = E[X]t, and EE[X]Z

= E[X]E[Z] = E[X](1 +

E[X]) = X+ X2. Note that the two are different since X2 =X2 in general.

4.5. The L1 conditional expectation. Consider a probability space (, F,P) andlet X be an integrable random variable, that is, E[|X|] < . We write X L1(, F,P). LetD be a sub -algebra. For each M 1, we define X M asX when|X| M and M otherwise. Note that XM L2(, F,P) (since it isbounded), so we can talk about E[(X M)|D], which we call ZM. The question iscan we somehow use ZM to define E[X|D]? Intuitively, we would want to think ofE[X|D] as some kind of limit ofZM, asM . So one approach for the constructionofE[X|D] is to take a limit ofZMand then show that the limit, Z, say, satisfiesDefinition 6 with L2 replaced by L1. It turns out that such an approach would lead

to the following definition, which we will adopt from now on.

Definition 7. [L1 Conditional Expectation] IfX is an integrable random variable,

then we define Z= E[X|D] ifZhas the following two properties: (1) Z isDmeasur-able, (2) E[ZY] = E[XY] for any boundedD measurable random variable Y.

All the properties of the L2 conditional expectation will carry on to the L1 condi-

tional expectation.

We make the final remark that if a random variable X is square integrable, then

it is integrable. The proof is easy. Suppose that X is square integrable. Then,

E[|X|] = E[|X|I{|X|1}] +E[|X|I{|X|>1}] E[I{|X|1}] +E[X2I{|X|>1}]1 +E[X2] 0, then we define the conditional

probability ofA given that the event B has occurred as P(A|B) = P(A B)/P(B).

8/10/2019 ECE541_Lectures_9_2009

39/81

ECE 541 39

What is the connection between this definition and our notion of a conditional ex-

pectation?Consider the -algebraD ={, , B , Bc}, and consider E[IA|D]. Because of the

special form of thisD, and because we know that this conditional expectation isD-measurable, we infer that E[IA|D] can assume only two values: one value on B andanother value on Bc. That is, we can write E[IA|D] =aIB+ bIBc, where a and bareconstants. We claim that a= P(A B)/P(B) and b= P(A Bc)/P(Bc). Note thatbecause P(B) > 0, 1 P(B) > 0, so we are not dividing by zero. As seen from ahomework problem, we can prove that P(AB)/P(B)IB+P(ABc)/P(Bc)IBc isactually the conditional expectation E[IA|D] by showing that

P(A B)/P(B)IB+

P(A Bc)/P(Bc)IBc satisfies the two defining properties listed in Definition 6 orDefinition 7. Thus, E[IA|D] encompasses both P(A|B) and P(A|Bc); that is, P(A|B)and P(A|Bc) are simply the values ofE[IA|D] on B and Bc, respectively. Also notethat P(|B) is actually a probability measure for each B as long as P(B)> 0.

4.7. Joint densities and marginal densities. Let X and Ybe random variables

on (, F,P) with a distribution XY(B) defined on all Borel subsets B of the plane.Suppose that X and Y have a joint density. That is, there exists an integrable

function fXY(, ) : IR2 IR, such that

(10) XY(B) =

B

fXY(x, y) dxdy, for anyB B2.

4.7.1. Marginal densities. Consider XY(A IR), whereA B. Notice that XY( IR) is a probability measure for anyA B. Also, by the integral representation shownin (10),

(11) XY(AIR) =AIR

fXY(x, y) dxdy=

A

IR

fXY(x, y) dy

dxA

fX(x) dx,

where fX() is called the marginal densityofX. Note that fX() qualifies as the pdfofX. Thus, we arrive at the familiar result that the pdf ofXcan be obtained from

the joint pdf ofXand Y through integration. Similarly, fY(y) =IR fXY(x, y) dx.

8/10/2019 ECE541_Lectures_9_2009

40/81

8/10/2019 ECE541_Lectures_9_2009

41/81

ECE 541 41

In the above, we have used two facts about integrations: (1) that we can write a

double integral as an iterated integral, and (2) that we can exchange the order of theintegration in the iterated integrals. These two properties are always true as long

as the integrand is non-negative. This is a consequence of what is called Fubinis

Theorem in analysis [2, Chapter 8]. To complete the proof, we need to address the

concern over the points over which fX() = 0 (in which case we will not be ableto cancel fXs in the numerator and the denominator in the above development).

However, it is straightforward to show that P{fX(X) = 0} = 0, which guaranteesthat we can exclude these problematic point from the integration in the x-direction

without changing the integral.

With the above results, we can think of

fXY(x, y)IR

fXY(x, y) dy

as a conditional pdf. In particular, if we define

(14) fY|X(y|x) = fXY(x, y)

IR fXY(x, y) dy=

fXY(x, y)

fX(x) ,

then we can calculate the conditional expectation E[Y|X] using the formula IR yfY|X(y|X) dy,which is the familiar result that we know from undergraduate probability.

8/10/2019 ECE541_Lectures_9_2009

42/81

42 MAJEED M. HAYAT

5. Convergence of random sequences

Consider a sequence of random variables X1, X2, . . . ,defined on the product space

(, F,P), where as before, = Xi=1i is the infinite-dimensional Cartesian productspace andF is the smallest -algebra containing all cylinders. Since each Xi is arandom variable, when we talk about convergence of random variables we must do so

in the context of functions.

5.1. Pointwise convergence. Let us take i = [0, 1],F =B [0, 1], and take Pas Lebesgue measure (generalized length) in [0, 1]. Consider the sequence of random

variablesfn() defined as follows:

(15) Xn() =

n2, 0n1/2n n2, n1/2< n10, otherwise.

It is straightforward to see that for any , the sequence of random variablesXn()

converges to the constant function 0. We say that Xn()

0 pointwise,everywhere,

or surely. Generally, a sequence of random variables Xn converges to a random

variable X if for every , Xn()X(). To be more precise, for every > 0and every, there existsn0IN such that|Xn()X()| < whenevernn0.Note that in this definition n0 not only depends upon the choice ofbut also on the

choice of . Note that in the example above we cannot drop the dependence ofn0 on

. To see that, observe that sup |Xn() 0|= n/2; thus, there is no single n0 thatwill work for every .

5.2. Uniform convergence. The strongest type of convergence is whats calleduni-

form convergence, in which case n0 will be independent of the choice of . More

precisely, a sequence of random variables Xn converges to a random variable Xuni-

formly if for every >0, there exists n0 IN such that|Xn() X()|< for any and whenever nn0. This is equivalent to saying that for every >0, thereexistsn0IN such that sup |Xn() X()| < whenever nn0. Can you make

a small change in the example above to make the convergence uniform?

8/10/2019 ECE541_Lectures_9_2009

43/81

ECE 541 43

5.3. Almost-sure convergence (or almost-everywhere convergence). Now con-

sider a slight variation of the example in (15).

(16) Xn() =

n2, 0< n1/2n n2, n1/2< n11, (n1, 1] |Q0, (n1, 1] |Qc.

Note that in this caseXn()0 for[0, 1)|Qc. Note that the Lebesgue measure(or generalize length) of the set of points for which Xn() does not converge to the

function 0 is zero. (The latter is because the measure of the set of rational numbers is

zero. To see that, let r1, r2, . . .be an enumeration of the rational numbers. Pick any

>0, and define the intervalsJn= (rn2n1, rn+ 2n1). Now, if we sum up thelengths of all of these intervals, we obtain . However, since |Q n=1Jn, the lengthof the set |Q cannot exceed . Since can be selected arbitrarily small, we conclude

that the Lebesgue measure of |Q must be zero.) In this case, we say that the sequence

of functions Xn converges to the constant random variable 0 almost everywhere.

Generally, we say that a sequence of random variables Xn converges to a randomvariable Xalmost surelyif there exists A F, with the property P(A) = 1, such thatfor everyA,Xn()X(). This is the strongest convergence statement that wecan make in a probabilistic sense (because we dont care about the points that belong

to a set that has probability zero). Note that we can write this type of convergence by

saying thatXnXalmost surely (or a.s.) ifP{ : limn Xn() =X()}= 1,or simply when P{limn Xn =X}= 1.

5.4. Convergence in probability (or convergence in measure). We say that

the sequence Xn converges to a random variable X in probability (also called in

measure) when for every >0, limn P{|Xn X|> }= 0. This a weaker type ofconvergence than almost sure convergence, as seen next.

Theorem 8. Almost sure convergence implies convergence in probability.

Proof

Let >0 be given. For each N 1, define EN ={|Xn X|> , for some nN}.

8/10/2019 ECE541_Lectures_9_2009

44/81

44 MAJEED M. HAYAT

If we define E =

N=1 EN, well observe that E implies that Xn() does not

converge to X. (This is because if E, then no matter how large we pick N,N0, say, there will be an n N0 such that|Xn X| > }.) Hence, since we knowthat Xn converges to Xa.s., it must be true that P(E) = 0. Now observe that since

E1 E2 . . ., P(E) = limN P(EN), and therefore limN P(EN) = 0. Observethat if {|Xn X|> } and nN, then {|Xn X|> for some nN}.Thus, for nN, {|Xn X|> } EN and P{|Xn X|> } P(EN). Hence,limn P{|Xn X|> }= 0.

This theorem is not true, however, in infinite measure spaces. For example, take = IR,F =B, and the Lebesgue measure m. Let Xn() = /n. Then clearly,Xn 0 everywhere, but m({|Xn 0| > }) = for any > 0. Where did we usethe fact thatP is a finite measure in the proof of the Theorem8?

5.5. Convergence in the mean-square sense (or in L2). We say that the se-

quence Xn converges to a random variable X in the mean-square sense (also in L2)

if limn E[|XnX|2] = 0. Recall that this is precisely the type of convergencewe defined in the Hilbert-space context, which we used to introduce the conditional

expectation. This a stronger type of convergence than convergence in probability, as

seen next.

Theorem 9. Convergence in the mean-square sense implies convergence in probabil-

ity.

Proof

This is an easy consequence of Markov inequality when taking () = ()2, whichyields P{|Xn X|> } E[|Xn X|2]/2.

Convergence in probability does not imply almost-sure convergence, as seen from

the following example.

8/10/2019 ECE541_Lectures_9_2009

45/81

ECE 541 45

Example 16 (The marching-band function). Consider = [ 0, 1], take P to be

Lebesgue measure on [0, 1], and defineXn as follows:(17)

Xn =

I[21(n1),21n](), n {1, 2}I[22(n121),22(n21)](), n {1 + 21, . . . , 21 + 22}...

I[2k(n121...2k1),2k(n21...2k1)](), n {1 + 21 + . . . + 2k1, 21 + . . . + 2k1 + 2k}...

Note that for any [0, 1], Xn() = 1 for infinitely manyns; thus, Xn does notconverge to 0 anywhere. At the same time, ifn {1+21+. . .+2k1, 21+. . .+2k1+2k},then P{|Xn 0| > } = 2k; hence, limn P{|Xn 0| > } = 0 andXn convergesto 0 in probability.

Note that in the above example, limn E[|Xn 0|2] = 0, so Xn also converges to0 in L2.

Later we shall prove that any sequence of random variables that converges in prob-ability will have a subsequence that converges almost surely to the same limit.

Recall the example given in (15) and lets modify it slightly as follows.

(18) Xn() =

n3/2, 0n1/2n1/2 n3/2, n1/2< n10, otherwise.

In this case,Xncontinues to converge to 0 at every point; nonetheless, E[|Xn0|2] =1/12 for every n. Thus, Xn does not converge to 0 in L2.

5.6. Convergence in distribution. A sequence Xn is said to converge to a ran-

dom variable X in distributionif the distribution functions, FXn(x), converge to the

distribution function ofX,FX(x), at every point of continuity ofFX(x).

Theorem 10. Convergence in probability implies convergence in distribution.

8/10/2019 ECE541_Lectures_9_2009

46/81

46 MAJEED M. HAYAT

Proof (Adapted from T. G. Kurtz)

Pick >0 and note that

P{Xx+ }= P{Xx + , Xn > x} + P{Xx+ , Xnx}

and

P{Xnx}= P{Xnx, X > x+ } + P{Xnx, Xx + }.

Thus,

(19) P{Xnx}P{Xx + }= P{Xnx, X > x + }P{Xx + , Xn> x}.

Note that since{Xnx, X > x + } {|Xn X|> },

P{Xnx} P{Xx+ } P{|Xn X|> }.

By taking the limit superior of both sides and noting that limn P{|XnX|> }=0, we obtain limn P{Xnx} P{Xx + }.

Now replace every occurrence of in (19) with, multiply both side by -1 andobtain

P{Xnx} + P{Xx }=P{Xnx, X > x } + P{Xx , Xn> x}.

Since{Xx , Xn > x} {|Xn X|> }, we have

P{Xnx} + P{Xx } P{|Xn X|> }.

By taking the limit inferior of both sides, we obtain limn P{Xn x} P{Xx }.

Combining, we obtain P{X x} limn P{Xn x} limn P{Xnx} P{Xx+}. Thus, by letting0 we conclude that limn FXn(x) =FX(x)at every point of continuity ofFX.

Theorem 11 (Bounded Convergence Theorem). Suppose thatXn converges to X in

probability, and that|Xn| M, a.s.,n, for some fixedM. ThenE

[Xn]nE

[X].

8/10/2019 ECE541_Lectures_9_2009

47/81

ECE 541 47

This example shows that the boundedness requirement is not superfluous. Take

= [0, 1],F=B [0, 1], andm as Lebesgue measure (i.e., m(A) = A dx). Considerthe functions fn(t) shown below in the figure. Note thatfn(t) 0,t . But,E[fn(t) 0] = 1, for any n. Therefore, E[fn] E[0] = 0.

Proof of Theorem (Adapted from A. Beck).

Let En ={|Xn| > M} and note that P(En) = 0 by the boundedness assumption.Define E0 =n=1En and note that P(E0)

n=1 P(En) = P(E1) + P(E2) + . . .= 0.

We will first show that |X| Ma.s. ConsiderWn,k ={|XnX|> 1/k},k = 1, 2, . . ..SinceXnconverges toXin probability, P(Wn,k)

n0. LetFk ={|X|> M+ 1/k}.

Note that off E0, Fk Wn,k a.s., regardless of n, and P(Fk) P(Wn,k). Then,P(Fk) limn P(Wn,k) = 0, or P(Fk) = 0. Therefore, P(i=1Fk) = 0. Sincei=1Fk ={|X| > M},|X| M a.s.

Let >0 be given. Let F0=i=1Fk and Gn ={|Xn X|> }. Note that can

be written as the disjoint union (F0 E0) (Gn F0 E0)c (Gn \ (F0 E0)). Thus,|E[Xn X]| E[|Xn X|] = E[|Xn X|IF0E0] + E[|Xn X|IGn\(F0E0)] + E[|XnX|I(GnF0E0)c ]0 + 2ME[IGn] + = 2ME[IGn] + .

Since P(Gn)0, there exist n0 such that nn0 implies P(Gn) 2M. Hence, fornn0,|E[Xn] E[X]| 2M2M + = 2, which establishes E[Xn] E[X] as n .

8/10/2019 ECE541_Lectures_9_2009

48/81

48 MAJEED M. HAYAT

Lemma 2(Adapted from T. G. Kurtz). Suppose thatX0 a.s. Then,limM E[XM] = E[X].

Proof

First note that the statement is true when Xn is a discrete random variable. Then

recall that we can always approximate a random variableXmonotonically from below

by a discrete random variable. More precisely, defineXnXas in (5) and note thatXnX and E[Xn] E[X]. Now since Xn MX MX, we have

E[Xn] = limM

E[Xn

M]

limM

E[X

M]

limM

E[X

M]

E[X].

Take the limit as n to obtain

E[X] limM

E[X M] limM

E[X M]E[X],

from which the Lemma follows.

Theorem 12 (Monotone Convergence Theorem, Adapted from T. Kurtz). Suppose

that0

Xn

XandXn converges toX in probability. Then,limn

E[Xn] = E[X].

Proof

For M >0, Xn MXnX. Thus,

E[Xn M] E[Xn] E[X].

By the bounded convergence theorem, limn E[Xn M] = E[X M]. Hence,

E[X

M]

limn

E[Xn]

limn

E[Xn]

E[X].

8/10/2019 ECE541_Lectures_9_2009

49/81

ECE 541 49

Now by Lemma 2, limM E[X M] = E[X], and we finally obtain

E[X] limn

E[Xn] limnE[Xn] E[X],and the theorem is proven.

Lemma 3 (Fatous Lemma, Adapted from T. G. Kurtz). Suppose thatXn 0 a.s.andXn converges to Xin probability. Then, limn E[Xn] E[X].

Proof

For M >0, Xn

M

Xn. Thus,

E[X M] = limn

E[Xn M] limn

E[Xn],

where the left equality is due to the bounded convergence theorem. Now by Lemma 2,

limM E[X M] = E[X], and we obtain E[X]limn E[Xn].

Theorem 13 (Dominated Convergence Theorem). Suppose thatXnX in proba-bility,|Xn| Y, andE[Y]0} I{ZnZm0}. In this case, (Zn Zm)Y =|Zn Zm|.

Thus, we conclude that E[|Zn Zm|] = E[(Xn Xm)(I{ZnZm>0} I{ZnZm0})].

8/10/2019 ECE541_Lectures_9_2009

50/81

50 MAJEED M. HAYAT

But the right-hand side is no greater than E[|Xn Xm|] (why?), and we obtainE

[|ZnZm|]E

[|XnXm|]. However,E

[|XnXm|]E

[|X|I{|X|min(m,n)}]0 by thedominated convergence theorem (verify this), and we conclude that limm,n E[|ZnZm|] = 0.

From a key theorem in analysis (e.g., see Rudin, Chapter 3), we know that any

L1 space is complete. Thus, there exists Z L1(, D,P) such that limn E[|ZnZ|] = 0. Further, Zn has a subsequence, Znk , that converges almost surely to Z.We take this Zas a candidate for E[X|D]. But first, we must show that Z satisfiesE[ZY] = E[XY] for any bounded,

D-measurable Y. This is easy to show. Suppose

that|Y| < M, for some M. We already know that E[ZnY] = E[XnY ], and bythe dominated convergence theorem, E[XnY] E[XY] (since|Xn| |X|). Also,|E[ZnY] E[ZY]| E[|Zn Z||Y|] E[|Zn Z|M] 0. Hence, E[ZnY] E[ZY].This leads to the conclusion that E[ZY] = E[XY] for any bounded,D-measurable Y.

In summary, we have found a Z L1(, D,P) such that E[ZY] = E[XY] for anybounded,D-measurable Y. We define this Zas E[X|D].

5.8. Central Limit Theorems. To establish this theorem, we need the concept of

a characteristic function. Let Y be a r.v. We define the characteristic function of

Y, Y(u), as E[eiuY], where E[eiuY] exists if E[Re(eiuY)]

8/10/2019 ECE541_Lectures_9_2009

51/81

ECE 541 51

Lemma 4. IfX(u)is the characteristic function ofX, thenlimc 12 c

ceiuaeiub

iu X(u)du

= P{a < X < b} + P

{X=a

}+P

{X=b

}2 .

See Chow and Teicher for proof.

As a special case, ifX is an absolutely continuous random variable with pdffX,

then 12

e

iuxX(u)du= fX(x).

Theorem 14 (Central limit theorem). Suppose{Xk} is a sequence of i.i.d. randomvariables withE[Xk] = andVarXk =

2. LetSn =

nk=1 Xk andYn=

Snnn

. Then,

Yn converges to Zin distribution, whereZ

N(0, 1) (zero-mean Gaussian randomvariable with unit variance).

Sketch of the proof

Observe that

Yn() = E[ejYn]

= E[ejPnk=1Xk

n

]

= E[nk=1

ejn

(Xk)]

=nk=1

E[ejn

(Xk)]

=nk=1

(Xk)(

n)

= ((Xk)(

n ))n

We now expand (Xk)() as

(Xk)() = E[ej(Xk)]

= E[1 +j(Xk ) +(j)2

2! (Xk )2 +(j)

3

3! (Xk )3 + . . .]

= 1 +jE[Xk ] 2

2 E[(Xk )2] +(j)

3

3! E[(Xk )3] + . . .

= 1 2

2 E

[(Xk )2

] +

(j)3

3! E

[(Xk )3

] + . . .

8/10/2019 ECE541_Lectures_9_2009

52/81

52 MAJEED M. HAYAT

Then we have

(Xk)(

n) = 1 ( n )2

2 E[(Xk )2] + (j(

n ))3

3! E[(Xk )3] + . . .

= 1 2

2n2E[(Xk )2] j

3

3!

n33

E[(Xk )3] + . . .

1 2

2n2E[(Xk )2].

= 1 2

2n

Then Yn()(1 22n )n and limn Yn() = limn(1 2/2n

)n =e2/2. On the

other hand, the characteristic function ofZ is Z() = E[ejz]. Then

Z() = E[ejz]

=

ejzfZ(z)dz

=

ejz 1

2ez

2/2dz

= 1

2

e12 (z

22jz)dz

= 1

2

e12 [(zj)2+2]dz

= 1

2e

2/2

e12 (zj)2dz

= e2/2

Therefore, limn Yn() = Z() and FYn nFZby the inversion lemma.

5.10. Central limit theorem for non-identical random variables: Lende-

bergs Theorem. We first define what is called a triangular array. For each n1,the sequenceXn1, . . . , X nrn is a sequence of zero-mean independent random variables,

where rn are arbitrarily specified integers. An example of rn is n, where the term

triangular comes from. Let Sn = Xn1 + . . . + Xnrn ands2n =

rnk=1 E[X

2nk]. Suppose

further that for each n the sequence Xn1, . . . , X nrn satisfies the Lindeberg Condition

8/10/2019 ECE541_Lectures_9_2009

53/81

ECE 541 53

stated below:

(20) limn

rnk=1

1s2n

E

X2nkI{|Xnk|sn}

= 0, for any >0.

Then, Sn/sn converges in distribution to a unit variance, zero-mean normal random

variable.

For a proof, see the text buy P. Billingsley listed in the bibliography.

Remark: It can be shown (e.g., with the help of the Bounded Convergence Theorem)

that the Lindeberg condition is automatically satisfied if the sequence is identically

distributed. See HW-7.

5.11. Strong Law of Large Numbers. Suppose thatX1, X2, . . . ,are i.i.d. random

variables, E[X1] = }, where >0, then we would know that Sn(o) cannot convergeto .

More generally, we define the event {An occurs infinitely often}, or for short {An i.o.},as

{An i.o.} =n=1k=nAk.

The above event is also referred to as the limit superior of the sequence An, that

is,{An i.o.} = limn An. The terminology of limit superior makes sense since

8/10/2019 ECE541_Lectures_9_2009

54/81

54 MAJEED M. HAYAT

thek=nAk is a decreasing sequence and the intersection yields the infimum of the

decreasing sequence.Before we prove a key result on the probability of events such as{An i.o.}, we will

give an example showing their use.

LetYnbe any sequence, and letY be a random variable (possibly being a candidate

limit of the sequence provided that the sequence is convergent). Consider the event

{limn Yn = Y}, the collection of all outcomes for which Yn() Y(). It iseasy to see that

(21) { limnYn = Y}c =>0, |Q{|Yn Y|> i.o.}.This is because ifo {|Yn Y| > i.o.} for some > 0, then we are guaranteedthat for any n 1,|Yk Y| > for infinitely many k n; thus, Yn(o) does notconverge to Y(o).

Lemma 5 (Borel-Cantelli Lemma). Let A1, A2, . . . , be any sequence of events. If

n=1 P(An) i.o.}.

By Markov inequality (with () = ()4),

P

{|Sn|> } E[S4n]

4 =ni=1nj=1n=1nm=1 E[XiXjXXm]

n44 .

8/10/2019 ECE541_Lectures_9_2009

55/81

ECE 541 55

Considering the term in the numerator, we observe that all the terms in the sum-

mation that are of the form E

[X3

i]E

[Xj], E

[X2

i]E

[Xj]E

[X], E

[Xi]E

[Xj ]E

[X]E

[Xm](i=j==m) are zero. Hence,

ni=1

nj=1

n=1

nm=1

E[XiXjXXm] =nE[X41 ] + 3n(n 1)E[X21 ]E[X21 ],

and therefore

P{|Sn|> } n4+ 3n(n 1)2n44

.

Hence, we conclude thatn=1

P{|Sn|> } i.o.}= 0.

Hence, by (22)

P{ limn

Sn = 0}c

= 0

and the theorem is proved.

Actually, the frequent version of the strong law of large numbers does not require

any assumptions on second and fourth moments. We will not prove this version here,

but it basically states that ifE[X1]

8/10/2019 ECE541_Lectures_9_2009

56/81

56 MAJEED M. HAYAT

where the last equality follows from the law of large numbers. Hence, it follows that

limn

n1ni=1

Xi limM

E[X1 M] = E[X] =,

where the last limit follows from Lemma 2.

The next theorem uses the Borel-Cantelli Lemma to characterizes convergence of

probability.

Theorem 16 (From Billingsley). A necessary and sufficient condition forXn Xin probability is that each subsequence{Xnk} contains a further subsequence{Xnk(i)}such thatXnk(i)X a.s. asi .

Proof. IfXn X in probability, then given{nk}, choose a subsequence{nk(i)} sothatkk(i) implies that P{|XnkX| i1}< 2i; therefore,

i=1 P{|Xnk(i)X|

i1}0 and a

subsequence

{nk

}for which P

{|Xnk

X

|

}> for allk. No subsequence of

{Xnk

}can converge to Xin probability; hence, none can converge to Xalmost surely.

Corollary 2. Ifg: IRIRis continuous andXnXin probability, theng(Xn)g(X) in probability.

Proof: Left as an exercise.

8/10/2019 ECE541_Lectures_9_2009

57/81

ECE 541 57

6. Markov Chains

The majority of the materials of this section are extracted from the book by Billings-

ley (Probability and Measure) and the book by Hoelet al. (Introduction to Stochastic

Processes.)

Up to this point we have seen many examples and properties (e.g., convergence)

of iid sequences. However, many sequences have correlation among their members.

Markov chains (MCs) are a class of sequences that exhibit a simple type of correlation

in the sequence of random variables. In words, as far as calculating probabilities

pertaining future values of the sequence conditional on the present and the past, it

is sufficient to know the present. In other words, as far as the future is concerned,

knowledge of the present contains all the knowledge of the past. We next define

discrete-space MCs formally.

Consider a sequence of discrete random variables X0, X1, X2, . . . , defined on some

probability space (, F,P). Let Sbe a finite or countable set representing the set ofall possible values that eachXn may assume. Without loss of generality, assume that

S={0, 1, 2, . . .}. The sequence is a Markov chain if

P{Xn+1=j |X0= i0, . . . , X n =in}= P{Xn+1 = j |Xn= in}.

For each pair i and j in S, we define pij= P{Xn+1 =j |Xn =i}. Note that we have

implicitly assumed (for now) that P{Xn+1 =j |Xn =i} is not dependent upon n, inwhich case the chain is calledhomogeneous. Also note that the definition ofpijimplies

thatjSpij = 1, i S. The numberspij are called the transition probabilitiesof

the MC. The initial probabilitiesare defined as i= P{X0 =i}. Note that the only

restriction on the is is that they are nonnegative and they must add up to 1. We

denote the matrix whose (i, j)th entry is pi,j by IP, which is termed the transition

matrixof the MC. Note that ifSis infinite, then IP is a matrix with infinitely many

rows and columns. The transition matrix IP is called a stochastic matrix, as each row

sums up to one and all its elements are nonnegative. Figure 6 shows a state diagram

illustrating the transition probabilities for a two-state MC.

8/10/2019 ECE541_Lectures_9_2009

58/81

58 MAJEED M. HAYAT

P21

P22

P12

s t ate1

P11

s t a t e0

Figure 6.

6.1. Higher-Order Transitions. Consider P{X0 = i0, X1 = i1, X2 = i2}. By ap-

plying Bayes rule repeatedly (i.e., P

(AB C) = P

(A|B C)P

(B|C)P

(C)), weobtain

P{X0 = i0, X1=i1, X2=i2} = P{X0=i0}P{X1=i1|X0 = i0}P{X2 = i2|X0 = i0, X1=i1}

= i0pi0i1pi1i2.

More generally, it is easy to verify that

P{X0 = i0, X1 = i1, . . . , X m=im}= i0pi0i1. . . pim1im

for any sequence i0, i1, . . . , im of states.

Further, it is also easy to see that

P{X3 = i3, X2= i2|X1= i1, X0= i0}= pi1i2pi2i3.

More generally,

(23) P{Xm+= j, 1n|Xs=is, 0sm}= pimj1pj1j2. . . pjn1jn.

With these preliminaries, we can define the n-step transition probabilityas

p(n)ij = P{Xm+n= j |Xm=i}.

If we now observe that

{Xm+n = j}=j0,...,jn1{Xm+n=j, Xm+n1 = j0, . . . , X m+1=jn1},

we conclude

p(n)ij =

k1...kn1

pik1pk1k2. . . pkn1j.

8/10/2019 ECE541_Lectures_9_2009

59/81

ECE 541 59

From matrix theory, we recognize p(n)ij as the (i, j)th entry of IP

n, thenth power of

the transition matrix IP. It is convenient to use the convention p(0)

ij =ij, consistentwith the fact that IP0 is the identity matrix I.

Finally, it is straightforward to verify that

p(m+n)ij =

vS

p(m)iv p

(n)vj ,

and that jSp(n)ij = 1. This means IP

n = IPkIP, whenever + k = n, which is

consistent with taking powers of a matrix.

For convenience, we now define the conditional probabilities Pi(A) = P{A|X0=i},

A F. With this notation and as an immediate consequence of (23), we have

Pi{X1=i1, . . . , X m= j, Xm+1 =jm+1, . . . , X m+n = jm+n}=

Pi{X1= i1, . . . , X m= j}Pj{X1 = jm+1, . . . , X n= jm+n}.

Example 19. A gambling problem (see related homework problem)

Initial fortune of a gambler is X0 =xo. After playing each hand, he/she increases

(decreases) the fortune by one dollar with probability p (q). The gambler quits if

either her/his fortune becomes 0 (bankruptcy) or reaching a goal ofL dollars. Let

Xn denote the fortune at time n. Note that

(24) P{

Xn= j|X1=i1 . . . X n

1= in

1

}= P

{Xn= j

|Xn

1= in

1

}.

Hence, the sequence Xn is a Markov chain with state space S={0, 1, . . . , L}. Also,the time-independent transition probabilities are given by

(25) pij = P{Xn+1= j |Xn=i}=

p, ifj =i + 1, i=Lq, ifj =i 1, i= 0

1, ifi= j =L or ifi= j = 0

8/10/2019 ECE541_Lectures_9_2009

60/81

60 MAJEED M. HAYAT

The (L + 1) (L+ 1) probability transition matrix, IP = ((pij)) is therefore

(26) IP =

1 0 0 . . . . . . . . . . . . . . . 0

q 0 p 0 . . . . . . . . . . . . ...

0 q 0 p 0 . . . . . . . . . ...

... ...

... ...

... . . .

... ...

...

0 0 . . . . . . . . . . . . q 0 p

0 0 . . . . . . . . . . . . . . . . . . 1

.

Note that the sum of any row of IP is 1, a characteristic of a stochastic matrix.

Exercise 2. Show that= 1 is always an eigenvalue for any stochastic matrixIP.

LetP(x) be the probability of achieving the goal ofL dollars, where x is the initial

fortune. In a homework problem we have shown that

(27) P(x) =pP(x + 1) + qP(x 1), x= 1, 2, . . . , L 1,

with boundary conditions P(0) = 0 and P(L) = 1. Similarly, define Q(x) as the

probability of going bankrupt. Then,

(28) Q(x) =pQ(x + 1) + qQ(x 1), x= 1, 2, . . . , L 1,

with boundary conditions Q(0) = 1 and Q(L) = 0.

For example, ifp= q= 12

, then (see homework solutions)

(29) P(x) = x

L,

(30) Q(x) = 1 xL

.

Thus, limL P(x) = 0.(what is the implication of this?).

Exercise 3. Show that

(31) limL

P(x) = 0

whenq=p.

8/10/2019 ECE541_Lectures_9_2009

61/81

ECE 541 61

Back to Gamblers ruin problem: take L= 4, p= 0.6. Then,

(32) IP =

1 0 0 0 0

0.4 0 0.6 0 0

0 0.4 0 0.6 0

0 0 0.4 0 0.6

0 0 0 0 1

.

Also, straightforward calculation yields

(33) IP2 =

1 0 0 0 0

0.4 0.24 0 0.36 0

0.16 0 0.48 0 0.36

0 0.16 0 0.24 0.6

0 0 0 0 1

,and

(34) limn

IPn =

1 0 0 0 0

0.5846 0 0 0 0.4154

0.3077 0 0 0 0.6923

0.1231 0 0 0 0.8769

0 0 0 0 1

.

(This can be obtained, for example, using the Cayley-Hamilton Theorem). Note that

the last column is

(35) Q=

Q(0)...

Q(4)

,

and the first column is

(36) P=

P(0)...

P(4)

.

8/10/2019 ECE541_Lectures_9_2009

62/81

8/10/2019 ECE541_Lectures_9_2009

63/81

ECE 541 63

However,kAjk is precisely the event{Xn = j i.o.}. Hence, we arrive at the conclu-

sion

(37) Pi{Xn= j i.o.}= 0, iffjj

8/10/2019 ECE541_Lectures_9_2009

64/81

64 MAJEED M. HAYAT

steps and then revisit j in two steps, and so on. More precisely, we write

p(n)ij = Pi{Xn =j}=n1

s=0

Pi{X1=j, . . . , X ns1=j, Xns=j, Xn=j}

=n1s=0

Pi{X1=j, . . . , X ns1=j, Xns=j}Pj{Xs= j}

=n1s=0

f(ns)

ij p(s)jj .

Put j =i and summing up over n= 1 to n= we obtain,

n=1

p(n)ii =n=1

n1s=0

f(ns)

ii p(s)ii

=1s=0

p(s)ii

n=s+1

f(ns)ii fii

s=0

p(s)ii

The last inequality comes from the fact that

n=s+1 f(ns)ii =

su=1 f

(u)ii fii. Re-

alizing that p(0)ii = 1, we conclude (1 fii)n=1p

(n)ii fii. Hence, iffii < 1, then

n=1p(n)ii fii/(1 fii), and the series n=1p

(n)ii will therefore be convergent since

the partial sum is increasing and bounded.

Exercise 4. Show that in the gamblers problem discussed earlier, states 0 andL are

recurrent, but states1, . . . , L 1 are transient.

6.3. Irreducible chains. A MC is said to be irreducible if for every i, j S, thereexists an n (that is generally dependent up on iand j) such that p

(n)ij >0. In words,

if the chain is irreducible then we can always go from any state to any other state in

finite time with a nonzero probability. The next theorem shows that for irreducible

chains, either all states are recurrent or all states are transient, there is no option in

between. Thus, we can say that an irreducible MC is either recurrent or transient.

Theorem 19 (Adopted from Billingsley). If the Markov chain is irreducible, then

one of the following two alternatives holds.

(i) All states are recurrent, Pi(j{Xn = j i.o.}) = 1 for all i, andnp(n)ij =

for alli andj.

8/10/2019 ECE541_Lectures_9_2009

65/81

ECE 541 65

(ii) All states are transient, Pi(j{Xn = j i.o.}) = 0 for all i, andn

p(n)ij 0 and

p(s)ji >0. Note that if a chain withX0 = i visitsj fromi in r steps, then visitsj from

j inn steps, and then visits i fromj ins steps, then it has visitedi fromi in r + n + s

steps. Therefore,

p(r+s+n)ii

p

(r)ij p

(n)jj p

(s)ji .

Now if we sum up both sides of the above inequality over n and use the fact that

p(r)ij p

(s)ji > 0, we conclude that

np

(n)ii 0; hence, fij = 1. Now by (38),

Pi{Xn = j i.o.]} = 1. The only item left to prove is that np(n)ij under the second

8/10/2019 ECE541_Lectures_9_2009

66/81

66 MAJEED M. HAYAT

alternative. Note that if

np(n)ij

8/10/2019 ECE541_Lectures_9_2009

67/81

ECE 541 67

Note that pi+ qi+ri= 1, q0= 0, and pd = 0 ifd

8/10/2019 ECE541_Lectures_9_2009

68/81

68 MAJEED M. HAYAT

Thus, (39) becomes

u(j) u(j+ 1) = jb1j=a j

, aj < b.

Let us now sum up this equation over j =i, . . . , b 1 and use the fact that u(b) = 0to obtain

u(i) =

b1j=ijb1j=a j

, a < i < b.

It now follows from the definition ofu(i) that

(40) Pi(Ta< Tb) =b1j=ijb1j=a j

, ai < b,

or

Pi(Tb< Ta) =

i1j=a jb1j=a j

, ai < b.

As a special case of (40),

(41) P1(T0< Tn) = 1 1

n1j=0j

, n >1.

Note that

(42) 1T2 < T3< . . . ,

which implies{T0< Tn}, n >1, are nondecreasing events. Hence,

(43) limn

P1(T0< Tn) = P1(n=1T0 < Tn).

Equation (42) implies Tnn, and thus Tn as n . Hence,

n=1{T0 < Tn}={T0

8/10/2019 ECE541_Lectures_9_2009

69/81

ECE 541 69

We now show that the birth and death chain is recurrent if and only if

j=0

j =.

From our earlier discussion, we know that if the irreducible birth and death chain is

recurrent, then P1(T0

8/10/2019 ECE541_Lectures_9_2009

70/81

70 MAJEED M. HAYAT

j1pj1+ jrj+ j+1qj+1= j , j1.

Usingpj+ qj+ rj = 1, we obtain

q11p00= 0,

qj+1j+1 pjj =qjjpj1j1, j1.By induction, we obtain

qj+1j+1pjj = 0, j0,

and hence

j+1= pj

qj+1j , j0.

Consequently, we obtain

(45) i =p0 . . . pi1

q1 . . . q i0, i1.

If we define

i=

1, i= 0,

p0...pi1q1...qi

, i1

we can recast (45) asi= i0, i0.

Realizing thati i

8/10/2019 ECE541_Lectures_9_2009

71/81

ECE 541 71

7. Properties of Covariance Matrices and Data Whitening

7.1. Preliminaries. For a random vector X = [X1 . . . X n]T, the covariance matrix

isCXis ann n matrix = E[(X E[X])(X E[X])T]. This matrix has the followingproperties:

(1) Cx is symmetric, which is clear from the definition.

(2) There is a matrix such that TCX = , where is a diagonal matrix

whose diagonal elements are the eigenvalues, 1 . . . n, of the matrix CX (as-

sume that they are distinct). The corresponding eigenvectors, v1 . . . vn, form

the columns of the matrix : = [v1|v2| . . . |vn]. This fact comes fromelementary linear algebra. BecauseCX is symmetric, the eigenvectors corre-

sponding to distinct eigenvalues are orthogonal (prove this). Moreover, can

be selected so that the eigenvectors are orthonormal; that is, vTivj =i,j .

(3) The fact that the vis are orthonormal implies that they span IRn and we also

obtain the following representation for X:

X=n

i=1

civi,

with ci = XTvi.

(4) Cx ispositive semi-definite, which means xTCxx0 for anyxIRn. To see

this property, note thatxTCXx= xTE[(XE[X])(XE[X])T]x= E[{xT(XE[X])}2]0. This calculation also shows as long as no component ofX canbe written as a linear combination of other components, then xTCxx > 0,

which is the defining property of positive definite matrices. (Recall that for

any random variable Y, E[Y2] = 0 if and only ifY= 0 almost surely.)

In the homework, we looked at an estimate ofCx from K samples of the random

vectorX as follows:

CX= 1

K 1Ki=1

X(i)X(i)T

(assuming, without loss of generality, that E[X] = 0), which has rank at most k (see

homework problems). Therefore, ifn > K, then Cx is not full rank and hence Cx is

not positive definite.

8/10/2019 ECE541_Lectures_9_2009

72/81

8/10/2019 ECE541_Lectures_9_2009

73/81

ECE 541 73

To begin, we may want to try A again. We know thatA whitens X. Put T =

AZ and note that CT = ACZA

T

= 1/2

T

CZ1/2

, which is not necessarilya diagonal matrix (show this by example). Note, however, that if W is a matrix

consisting of the eigenvectors of CT, corresponding to the eigenvalues 1, . . . , n,

then WTCTW = M, where M is the diagonal matrix whose diagonal elements are

1, . . . , n.

Consider = WTA as a candidate for the desired transformation. We need to

check if this new transformation diogonalizes both CX and CZ. To do so, consider

D= X and note thatCD=WTACXA

TW = WTIW =WTW =I (because W

is normal matrix, i.e., W1 =WT).

Next, letF = Z and writeCF= CZT =WTACZA

TW. But,ACZAT =CT.

Thus, CF=WTGW= M.

7.3.1. Summary. Let CX and CZ be given. Let B be as before and let W be the

matrix whose ith column is the eigenvector of B1CZB corresponding to the ith

eigenvalue. Let = WTB1, then Y =X and T = Z have the following proper-

ties:

(1) CY = I,

(2) CT= M

(48) =

1 . . . 0...

. . . ...

0 . . . n

,

where the is are the eigenvalues ofB1CZB.

Example 20.

(49) CX=

2 1

1 2

; CZ=

3 11 3

.

Find the transformation.

We first we calculate . Set|CXiI| = 0. We find that1 = 1 and 2 = 3.

Next, we find v1 and v2 using CXv1 = 1v1 and CXv2 =2v2. Normalize these so

8/10/2019 ECE541_Lectures_9_2009

74/81

74 MAJEED M. HAYAT

thatv1||=v2= 1 and obtain

(50) v1 = 1

2

11

; v2 = 12

11

.We now form

(51) B1 =1/2T = 1

2

1 1

13

13

.

Next,

(52) B1CZ(B1)T =1

28 0

0 43

,

and its eigenvalues are 1 = 4 and 2 = 23

, and the corresponding eigenvectors are

[1, 0] and [0, 1]. Hence,W= I and = WTB1 =B1.

8/10/2019 ECE541_Lectures_9_2009

75/81

ECE 541 75

8. Continuous-time Wiener Filtering

Suppose that we have the observation model

X(t) =S(t) + N(t),

where S(t) and X(t) are wide-sense stationary random processes. We desire to find

a physically realizable (causal) filter f(t) with impulse response h(t) to minimize the

mean-square error (MSE)

E[2] = E[(S(t + ) S(t))2],

where S(t) =h(t) X(t) is the filter output.If= 0, this problem is called theWiener filtering problem; if >0, it is called the

Wiener prediction problem, and if

8/10/2019 ECE541_Lectures_9_2009

76/81

76 MAJEED M. HAYAT

filter hr. Thus, we can pursue this differentiation approach and write

E[2(r)] = RSS(0) 2 0

[h0() + rg()]RXS(+ )d

+

0

0

[h0() + rg()][h0() + rg()]RXX( )dd

= RSS(0) 20

h0()RXS(+ )d 20

rg()RXS(+ )d

+

0

0

[rg()h0() + rg()h0()]RXX( )dd

+

0

0

[h0()h0()]RXX(

)dd+

0

0

[r2g()g()]RXX(

)dd.

Note that RSS(0) does not depend on r . Now take partial derivative with respect to

r and obtain

rE[2(r)] = 2

0

RXS(+ )g()d

+

0

0

[g()h0() + g()h0()]RXX( )dd

+ 2r 00

g()g()RXX(

)dd.

Set rE[2(0)] = 0 and obtain

20

RXS(+ )g()d+

0

0

[g()h0() + g()h0()]RXX( )dd = 0.

Using the fact that RXX() is symmetric, i.e.,0

0

g()h0()RXX( )dd =0

0

g()h0()RXX( )dd,

we get 0

g()RXS(+ ) +

0

h0()RXX( )d

d= 0.

Since this is true for all g, the only way this can be true is that the multiplier ofg

inside the outer integral must be zero for almost all 0 (with respect to Lebesguemeasure).

Thus, we get the so-called Wiener Integral Equation:

RXS(+ ) = 0

RXX(

)h0()d,

0

8/10/2019 ECE541_Lectures_9_2009

77/81

ECE 541 77

8.1. Non-causal Wiener Filtering. If we replace

0

with

in Weiners Equa-

tion, i.e.,

RXS(+ ) =

RXX( )h0()d, IR,

we obtain the characterization of the optimal non-causal (non-realizable) filter. In

the frequency domain, we have

SXS()ej =SXX()H0(),

and hence

H0() = SXS ()SXX ()

ej,

which is the optimum non-realizable filter.

Suppose the signal and noise are uncorrelated, then

RXX() = E[X(t)X(t+)] = E[(S(t)+ N(t))(S(t+)+ N(t+))] =RSS()+ RNN().

Similarly, RXS() =RSS(). Then,

H0() = SSS()SSS()+SNN()

Example: Suppose that N(t) 0 and > 0 (ideal prediction problem), thenH0() =e

j[1], or h0(t) =(t + ), which is the trivial non-realizable predictor.

Before we attempt solving the Wiener integral equation for the realizable optimal

filter, we will review germane aspects of spectral factorization.

8.2. Review of Spectral Factorization. Note that we can approximate any power

spectrum as white noise passed through a linear time-invariant (LTI) filter (why?).Suppose the transfer function of the filter is

H() =n +an1n1 + . . . + a1+ a0

d + bd1d1 + . . . + b1+ b0

If the input is white noise with power spectrum N02

, the output power spectrum is

N02|H()|2. Note that we can write

H() =A(2) +jB(2)

C(2) +jD(2),

8/10/2019 ECE541_Lectures_9_2009

78/81

ECE541_Lectures_9_2009

Documents

Transcript of ECE541_Lectures_9_2009