ECE541_Lectures_9_2009

download ECE541_Lectures_9_2009

of 81

Transcript of ECE541_Lectures_9_2009

  • 8/10/2019 ECE541_Lectures_9_2009

    1/81

    ECE 541 - NOTES ON PROBABILITY THEORY ANDSTOCHASTIC PROCESSES

    MAJEED M. HAYAT

    These notes are made available to students of the course ECE-541 at the University

    of New Mexico. They cannot and will not be used for any type of profitable repro-

    duction or sale. Parts of the materials have been extracted from texts and notes by

    other authors. These include the text of Probability Theory by Y. Chow and H. Te-

    icher, the notes on stochastic analysis by T. G. Kurtz, the text of Real and Complex

    Analysis by W. Rudin, the notes on measure and integration by A. Beck, the text

    of Probability and Measure by P. Billingsley, the text of Introduction to Stochastic

    Processes by P. G. Hoel, S. C. Port, and C. J. Stone, and possibly other sources.

    In compiling these notes I tried, as much as possible, to make explicit reference to

    these sources. If any omission is found, it is due to my oversight for which I seek the

    authors forgiveness.I am truly indebted to these outstanding scholars and authors that I mentioned

    above and whom I obtained most of the materials from; I am truly privileged to have

    been taught by some of them. I also like to thank Dr. Bradley Ratliff for helping in

    the tedious task of typing these notes.

    This material is intended as a graduate-level treatment of probability and stochastic

    processes. It requires basic undergraduate knowledge of probability, random variables,

    probability distributions and density functions, and moments. A course like ECE340will cover these topics at an undergraduate level. The material also requires some

    knowledge of elementary analysis; concepts such as limits and continuity, basic set

    theory, some basic topology, Fourier and Laplace transforms, and elementary linear

    systems theory.

    Date: November 4, 2009.1

  • 8/10/2019 ECE541_Lectures_9_2009

    2/81

    2 MAJEED M. HAYAT

    1. Fundamental concepts

    1.1. Experiments. The most fundamental component in probability theory is the

    notion of a physical (or sometimes imaginary) experiment, whose outcome is

    revealed when the experiment is completed. Probability theory aims to provide the

    tools that will enable us to assess the likelihood of an outcome, or more generally, the

    likelihood of a collection of outcomes. Let us consider the following example:

    Example 1. Shooting a dart: Consider shooting a single dart at a target (board)

    represented by the unit closed disc, D, which is centered at the point(0, 0). We write

    D ={(x, y) IR2 : x2 +y2 1}. Here, IR denotes the set of real numbers (sameas(, )), and IR2 is the set of all points in the plane (IR2 = IR IR, wheredenotes the Cartesian product of sets). We read the above description ofD as the

    set of all points(x, y) inIR2 such that (or with the property) x2 + y2 1.

    1.2. Outcomes and the Sample Space. Now we define what we mean by an

    outcome: An outcome can be missing the target, or missed for short, in which

    case the dart misses the board entirely, or it can be its location in the scenario thatit hits the board. Note that we have implicitly decided (or chose) that we do not

    care where the dart lands as whenever it misses the board. (The definition of an

    outcome is totally arbitrary and therefore it is not unique for any experiment. It

    depends on whatever makes sense to us.) Mathematically, we form what is called

    the sample space as the set containing all possible outcomes. If we call this set ,

    then for our dart example, ={missed} D, where the symboldenotes set union

    (we say x AB if and only if x A or x B). We write to denote aspecific outcome from the sample space . For example, we may have = missed,

    = (0, 0), and = (0.1, 0.2); however, according to ourdefinition of an outcome,

    cannot be (1,1).

    1.3. Events. An event is a collection of outcomes, that is, a subset in . Such a

    subset can be associated with a question that we may ask about the outcome of the

    experiment and whose answer can be determined after the outcome is revealed. For

    example, the question: Q1: Did the dart land within 0.5 of the bulls eye? can be

  • 8/10/2019 ECE541_Lectures_9_2009

    3/81

    ECE 541 3

    associated with the subset of (or event) given by E={(x, y)IR2 :x2+y2 1/4}.

    We would like to call the setEan event. Now consider the complement of Q1, that is:Q2: Did the dart not land within 0.5 of the bulls eye? with which we can associate

    the event Ec, were the superscript c denotes set complementation (relative to ).

    Namely,Ec = \ E, which is the set of outcomes (or points or members) of thatare not already in E. The notation \ represents set difference (or subtraction). Ingeneral, for any two sets Aand B,A \ B is the set of all elements that are in A butnot in B. Note that Ec ={(x, y)IR2 :x2 +y2 > 1/4}{missed}. The point hereis that ifE is an event, then we would want Ec to qualify as an event as well since

    we would like to be able to ask the logical negative of any question. In addition,

    we would also like to be able to form a logical or of any two questions about the

    experiment outcome. Thus, ifE1 andE2 are events, we would like their union to also

    be an event.

    Here is another illustrative example: for each n= 1, 2, . . ., define the subset En =

    {(x, y)IR2 :x2+y2 11/n} and letn=1 Enbe their countable union. (Notation:we say

    n=1 En if and only if

    En for some n. Similarly, we say

    n=1 En

    if and only if En for all n.) It is not hard to see (prove it) thatn=1En =E={(x, y) IR2 : x2 +y2 < 1}, which corresponds to the valid question did the dartland inside the board? Thus, we would want E(which is the countable union of

    events) to be an event as well.

    Example 2. For each n 1, take An = (1/n, 1/n). Then,

    n=1 An ={0} and

    n=1 An= (1, 1). Prove these.

    Finally, we should be able to ask whether or not the experiment was conducted or

    not, that is, we would like to label the sample space as an event as well. With this

    (hopefully) motivating introduction, we proceed to formally define what we mean by

    events.

    Definition 1. A collectionFof subsets of is called a -algebra (reads as sigma-algebra) if:

    (1) F

  • 8/10/2019 ECE541_Lectures_9_2009

    4/81

  • 8/10/2019 ECE541_Lectures_9_2009

    5/81

    ECE 541 5

    requirements for a -algebra, but what we mean by valid is really up to us as long

    as the self-consistency rules defining the collection of events are met.

    Generation of -algebras: LetM be a collection of events (not necessarily a -algebra); that is,M F. This collection could be a collection of certain events of in-terest. For example, in the dart experiment we may defineM=

    {missed}, {(x, y)

    IR2 : 1/4x2 +y2 1/2}

    . The main question now is the following: Can we con-

    struct a minimal (or smallest) -algebra that containsM? If such a-algebra exists,

    call itFM, then it must possess the following property: IfD is another -algebracontainingM, then necessarilyFM D. HenceFM is minimal.The following Theorem states that there is such a minimal -algebra.

    Theorem 1. LetM be a collection of events, then there is a minimal -algebracontainingM.

    Before we prove the Theorem let us look at an example.

    Example 5. Suppose that = (, ), andM= (, 1), (0, ). It is easy tocheck that the minimal-algebra containingM is

    FM =, , (, 1), (0, ), (0, 1), (, 0], [1, ), (, 0] [1, )

    .

    Explain where each member is coming from.

    Proof of Theorem: LetKM be the collection of all -algebras that containM.We observe that such a collection is not empty since at least it contains the power

    set 2. Let us label each member ofKM by an index , namely D , where I,where I is some index set. DefineFM =

    ID. We need to show that 1)FM is a

    -algebra containingM, and 2) thatFM is a minimal -algebra. Note that each Dcontains (since each D is a -algebra); thus,FM contains . Next, ifA FM,then A D, for each I. Thus, Ac D, for each I (since each D is a-algebra) which implies that Ac

    ID. Next, suppose thatA1, A2, . . . FM.

    Then, we know that A1, A2, . . . D, for each I. Moreover, n=1 An D,

  • 8/10/2019 ECE541_Lectures_9_2009

    6/81

    6 MAJEED M. HAYAT

    for each I (again, since each D is a -algebra) and thusn=1 An

    ID.

    This completes proving thatFM is a -algebra. Now suppose thatFM is another-algebra containingM, we will show thatFM FM. First note that sinceKM isthe collection ofall-algebras containingM, then it must be true thatFM = Dfor someI. Now ifA FM, then necessarily AD (since D is one of themembers of the countable intersection that definesFM), and consequently A FM,sinceFM = D. This establishes thatFM FM and completes the proof of thetheorem.

    Example 6. LetUbe the collection of all open sets in IR. Then, according to theabove theorem, there exists a minimal-algebra containingU. This-algebra is calledthe Borel-algebra,B, and its elements are called the Borel subsets ofIR. Note thatby virtue of set complementation, union and intersection,B contains all closed sets,half open intervals, their countable unions, intersections, and so on.

    (Reminder: A subsetU inIR is called open if for everyxU, there exists an openinterval centered atx which lies entirely inU. Closed sets are defined as complements

    of open sets. These definitions extend to IRn

    in a straightforward manner.)

    Restrictions of-algebras: LetF be a -algebra. For any measurable set U, wedefineF Uas the collection of all intersections between Uand the members ofF,that is,F U={V U :V F}. It is easy to show thatF Uis also a -algebra,which is called the restriction ofF to U.

    Example 7. Back to the dart experiment: What is a reasonable-algebra for this

    experiment? I would say any such-algebra should contain all the Borel subsets ofD

    and the{missed} event. So we can takeM={missed}, B D. It is easy to check

    that in this case

    FM=

    (B D) {missed}

    (B D),(1)

    where for any-algebraFand any measurable setU, we defineFUas the collectionof all unions betweenU and the members ofF, that is,F U ={V U :V F}.Note thatF U is not always a-algebra (contrary toF U), but in this example

    it is because{missed} is the complement ofD.

  • 8/10/2019 ECE541_Lectures_9_2009

    7/81

  • 8/10/2019 ECE541_Lectures_9_2009

    8/81

    8 MAJEED M. HAYAT

    We emphasize again thatY1((, r)) is always an event; that is, Y1((, r))

    FM, whereFM was defined earlier for this experiment (1). Again, this is a directconsequence of the way we defined the function Y.

    Motivated by these examples, we proceed to define what we mean by a random

    variablein general.

    Definition 4. Let (, F) be a measurable space. A transformationX : IR issaid to beF-measurable if for every r IR, X1((, r)) F. Any measurable Xis called a random variable.

    Now let Xbe a random variable and consider the collection of eventsM ={ : X() (, r), r IR}, which can also be written more conveniently as{X1((, r)), rIR}. As before, letFM be the minimal -algebra containingM.Then,FMis the information that the random variableXconveys about the experi-ment. We define such a-algebra from this point on as(X), the-algebra generated

    by the random variable X. In the above example ofX,(X) ={, {missed}, , D}.This concept is demonstrated in the next example. (Also see HW#1 for another

    example of(X).)

    Example 8. Back to the dart example: LetM ={, {missed}, }, then it is easyto check thatFM ={, {missed}, , D}, which also happens to be a subset ofFMas defined in (1). However, we observe that in fact X1((, r)) FM for anyrIR. Intuitively,FM, which also call(X), can be identified as the set of all eventsthat the functionXcan convey about the experiment. In particular,FM consists ofprecisely those events whose occurrence or nonoccurrence can be determined through

    our observation of the value ofX. In other words,FM is all the information thatthe mappingXcan provide about the outcome of the experiments. Note thatFM ismuch smaller than the original-algebraF, which was((BD){missed})(BD).As we have seen before,{, {missed}, , D} ((B D) {missed}) (B D). Thus,Xcan only partially inform us of the true outcome of the experiment; it can only tell

    us if we hit the target or missed it, nothing else, which is precisely the information

    contained inFM .

  • 8/10/2019 ECE541_Lectures_9_2009

    9/81

    ECE 541 9

    Facts about Measurable Transformations:

    Theorem 2. Let(, F) be a measurable space. The fol lowing statements are equiv-alent:

    (1) X is a measurable transformation.

    (2) X1((, r]) F for allrIR.(3) X1((r,)) F for allrIR.(4) X1([r,)) F for allrIR.(5) X1((a, b)) F for alla < b.(6) X1([a, b)) F for alla < b.(7) X1([a, b]) F for alla < b.(8) X1((a, b]) F for alla < b.(9) X1(B) F for allB B.

    The equivalence amongst (1), (2) and (3) is left as an exercise (see HW1). The

    equivalence amongst (3) and (4) through (8) can be shown using the same technique

    used in HW1. To show that (1) implies (9) requires additional work and will not be

    shown here. In fact, (9) is the most powerful definition of a random variable. Using

    (9), we can equivalently define (X) as{X1(B) : B B}, which can be directlyshown to be a -algebra (see HW1).

  • 8/10/2019 ECE541_Lectures_9_2009

    10/81

    10 MAJEED M. HAYAT

    1.5. Probability Measure.

    Definition 5. Consider the measurable space (, F), and a function, P, that mapsF into IR. P is called a probability measureif

    (1) P(E)0, for any E F.(2) P() = 1.

    (3) If E1, E2, . . . F and if Ei Ej = when i = j, then P(n=1 En) =

    n=1 P(En).

    The following properties follow directly from the above definition.

    Property 1. P() = 0.

    Proof. Put E1 = , E2 = . . . = En = = in (3) and use (2) to get 1 = P() =P( . . .) = P()+n=2 P() = 1 +n=2 P(). Thus, n=2 P() = 0, whichimplies that P() = 0 since P()0 according to (1).

    Property 2. IfE1, E2, . . . , E n Fand ifEiEj =wheni=j , thenP(ni=1 Ei) =ni=1 P(Ei).

    Proof. Put En+1 = En+2 = . . .= and the result will follow from 3 since P() = 0(from Property 2)

    Property 3. IfE1, E2 F andE1E2, thenP(E1) P(E2).

    Proof. Note that E1

    E2

    \E1 = E2 and E1

    E2

    \E1 =

    . Thus, by Property 2 (use

    n= 2), P(E2) = P(E1) + P(E2\E1)P(E1), since P(E2\E1)0.

    Property 4. IfA1A2A3 . . . F, then

    (3) limn

    P(An) = P(n=1

    An).

    Proof. Put B1 =A1, B2 = A2\A1, . . . , Bn = An\An1, . . .. Then, it is easy to checkthat

    n=1 An=

    n=1 Bn and

    mn=1 An=

    mn=1 Bn for anym1, and thatBi Bj =

    when i= j. Hence, P(ni=1 Ai) = ni=1 P(Bi) and P(i=1 Ai) = i=1 P(Bi).

  • 8/10/2019 ECE541_Lectures_9_2009

    11/81

    ECE 541 11

    But,

    ni=1 Ai = An, so that P(An) =

    ni=1 P(Bi). Now since

    ni=1 P(Bi) converges

    toi=1 P(Bi), we conclude that P(An) converges toi=1 P(Bi), which is equal to

    P(i=1 Ai).

    Property 5. IfA1A2A3 . . ., then limn P(An) = P(n=1 An).

    Proof. See HW.

    Property 6. For anyA F, 0 P(A)1.

    The triplet (, F, P) is called a probability space.

    Example 9. Recall the dart experiment. We now defineP on (, FM) as follows.Assign P({missed}) = 0.5, and for any A D B, assign P(A) = area(A)/2. Itis easy to check thatP defines a probability measure. (For example, P() = P(D{missed}) = P(D) + P({missed}) = area(D)/2+ 0.5 = 0.5 + 0.5 = 1. Check theother requirements as an exercise.)

    1.6. Distributions and distribution functions: Consider a probability space (, F, P),and consider a random variable X defined on it. Up to this point, we have devel-

    oped a formalism that allows us to ask questions of the form what is the proba-

    bility that Xassumes a value in a Borel set B? Symbolically, this is written as

    P({ :X() B}), or for short, P{XB}with the understanding that the set{X B} is an event (i.e., member ofF). Answering all the questions of the aboveform is tantamount to assigning a number in the interval [0,1] to every Borel set.

    Thus, we can think of a mapping from B into [0,1], whose knowledge will provide an

    answer to all the questions of the form described earlier. We call this mapping the

    distribution of the random variableX, and it is denoted by X. Formally, we have

    X :B [0, 1] according to the rule X(B) = P{XB}, B B.

    Proposition 1. Xdefines a probability measure on the measurable space(IR, B).

    Proof. See HW.

  • 8/10/2019 ECE541_Lectures_9_2009

    12/81

    12 MAJEED M. HAYAT

    Distribution Functions: Recall from your undergraduate probability that we often

    associate with each random variable a distribution function, defined as FX(x) =P{X x}. This function can also be obtained from the distribution ofX, X, byevaluatingX at B = (, x], which is a Borel set. That is, FX(x) =X((, x]).Note that for any xy ,X((x, y]) =FX(y) FX(x).

    Property 7. FXis nondecreasing.

    Proof. For x1 x2, (, x1] (, x2] and X((, x1]) X((, x2]) sinceX is a probability measure (see Property 3 above).

    Property 8. limx FX(x) = 1 and limx FX(x) = 0.

    Proof. Note that (, ) = n=1(, n] and by Property 4 above, limn X((, n]) =X(n=1(, n]) =X((, )) = 1 since X is a probability measure. Thus we

    proved that limn FX(n) = 1. Now the same argument can be repeated if we re-

    place the sequence n by any increasing sequence xn. Thus, limxn FX(xn) = 1 for

    any increasing sequence xn, and consequently limx FX(x) = 1.

    The proof of the second assertion is left as an exercise.

    Property 9. FXis right continuous, that is, limn FX(xn) =FX(y)for any mono-

    tonic and convergent sequencexn for whichxny.

    Proof. For simplicity, assume that xn=y+n1. Note that FX(y) =X((, y]), and

    (, y] = n=1(, y+n1]. So, by Property 5 above, limn X((, y+n1]) =

    X(n=1(, y+n1]) =X((, y]). Thus, we proved that limn FX(y+n1) =

    FX(y). In the same fashion, we can generalize the result to obtain limn FX(y+

    xn) =FX(y) for any sequence for which xn0. This completes the proof.

    Property 10. FXhas a left limit at every point, that is, limn FX(xn) exists for

    any monotonic and convergent sequencexn.

    Proof. For simplicity, assume that xn = y n1 for some y. Note that (, y) =n=1(, yn1]. So, by Property 4 above, limn FX(yn1]) = limn X((, y

  • 8/10/2019 ECE541_Lectures_9_2009

    13/81

  • 8/10/2019 ECE541_Lectures_9_2009

    14/81

    14 MAJEED M. HAYAT

    Absolutely continuous random variables: If there is a Borel functionfX : IR[0, ), with the property that f(t) dt= 1, such thatFX(x) = x fX(t) dt, thenwe say that X is absolutely continuous. Note that in this event we term fX the

    probability density function (pdf) ofX. Also note that ifFx is differentiable, then

    such a density exists and it is given by the derivative ofFX.

    Example 10. Let X be a uniformly-distributed random variable in [0,2]. Let the

    functiong be as shown in Fig. 1 below. Compute the distribution function of Y =

    g(X).

    1

    1/2

    1/2

    FY(y)

    y

    1

    1

    g(x)

    x1

    Figure 1.

    Solution: If y < 0, FY(y) = 0 since Y is nonnegative. If 0 y 1, FY(y) =P({Xy/2}{X >1 y/2}) = 0.5y + 0.5. Finally, ify >1, FY(y) = 1 sinceY1.The graph of FY(y) is shown above. Note that FY(y) is indeed right continuous

    everywhere, but it is discontinuous at 0.

    In this example the random variable X is absolutely continuous (why?). On the

    other hand, the random variable Y is notabsolutely continuous because there is no

    Borel function fY so that FY(x) =x fY(t) dt. Observe that FYhas a jump at 0,

    and we cannot reproduce this jump by integrating a Borel function over it (we would

    need a delta function, which is not really a function let alone a Borel function).

    It is not a discrete random variable either since Y can assume any value from an

    uncountable collection of real numbers in [0, 1].

  • 8/10/2019 ECE541_Lectures_9_2009

    15/81

  • 8/10/2019 ECE541_Lectures_9_2009

    16/81

    16 MAJEED M. HAYAT

    2. Expectations

    Recall that in an undergraduate probability course one would talk about the ex-

    pectation, average, or mean of a random variable. This is done by carrying out an

    integration (in the Riemann sense) with respect to a probability density function

    (pdf). It turns out that the definition of an expectation does require having a pdf. It

    is based on a more-or-less intuitive notion of an average. We will follow this general

    approach here and then connect it to the usual expectation with respect to a pdf

    whenever the pdf exists. We begin by introducing the expectation of a nonnegative

    random variable, and will generalize thereafter.

    Consider a nonnegative random variableX, and for each n1, we define the sumSn =

    i=1

    i2nP{ i

    2n < X i+1

    2n}. We claim thatSn is nondecreasing. If this is the

    case (to be proven shortly), then we know that Sn is either convergent (to a finite

    number) or Sn . In any case, we call the limit ofSn the expectation ofX, andsymbolically we denote it as E[X]. Thus, we define E[X] = limn Sn. To see the

    monotonicity ofSn, we follow Chow and Teicher [1] and observe that

    { i2n < X

    i+12n}={

    2i2n+1 < X

    2i+22n+1}={

    2i2n+1 < X

    2i+12n+1} {

    2i+12n+1 < X

    2i+22n+1};

    thus,

    Sn =i=1

    2i2n+1

    (P{ 2i2n+1

    < X 2i+12n+1

    } + P{ 2i+12n+1

    < X 2i+22n+1

    }) 1

    2n+1P{ 1

    2n+1 < X 2

    2n+1} +i=1 2i2n+1 (P{ 2i2n+1 < X 2i+12n+1} + P{ 2i+12n+1 < X 2i+22n+1})

    12n+1

    P{ 12n+1

    < X 22n+1

    } + i=1 2i2n+1P{ 2i2n+1 < X 2i+12n+1} + i=1 2i+12n+1P{ 2i+12n+1 } E[(XX)2]

    2 . Note that the

    numerator is simply the variance ofX.

    Also, if we take (t) =t, then we have P{|X| > } E[|X|]

    .

    Proof. Since is nondecreasing,{|X| > } ={(|X|) > ()}. Further, note that1

    I{|X|>}

    (), for any

    (since any indicator function is either 0 or 1). Now

    note that(|X|)I{|X|>}(|X|)I{|X|>}(). Now take expectations of both endsof the inequalities to obtain

    E[(|X|)]()E[I{|X|>}] =()P{|X|> }, and the desired result follows. (See thefirst special case after the definition of expectation.)

    Chernoff Bound: This is an application of the Markov inequality and it is a

    useful tool in upperbounding error probabilities in digital communications. LetXbe

    a non-negative random variable and let be nonnegative. Then by taking (t) =et,

    we obtain

    P{X > x} exE[eX].The right hand side can be minimized over >0 to yield what is called the Chernoff

    boundfor a random variable X.

  • 8/10/2019 ECE541_Lectures_9_2009

    23/81

    ECE 541 23

    3. Elementary Hilbert Space Theory

    Most of the material in this section is extracted from the excellent book by W.

    Rudin, Real & Complex Analysis [2]. A complex vector spaceH is called an inner

    product space ifx, yH, we have a complex-valued scalar,< x, y >, read the innerproduct between vectors xand y, such that the following properties are satisfied:

    (1) < x, y >=< y, x >, x, yH(2) < x+ y,z >=< x, z > + < y, z >,x,y,zH(3) < x, y >= < x, y >, x, yH, |C(4) < x, x >0, xH and < x, x >= 0 only ifx= 0

    Note that (3) and (1) together imply < x,y >= < x, y >. Also, (3) and (4)

    together imply < x, x >= 0 if and only ifx = 0. It would be convenient to write

    < x, x >asx2, which we later use to introduce a norm on H.

    3.1. Schwarz Inequality. It follows from properties (1)-(4) that| < x , y >| x y, x, yH.

    Proof: PutA =x2,C=y2 andB =|< x, y >|. By (4),< xry,xry >0, for any choice of |C and all r IR, which can be further written as (using(2)-(3)):

    (6) x2 r < y, x >r < x, y >+r2y2||2 0.

    Now choose so that < y,x >=|< x, y >|. Thus, (6) can be cast asx2 2r|| + r2y2 0, or A2rB +r2C 0, which is true for any r IR. Letr1,2= 2B4B

    2

    4AC2C = BB2

    ACC denote the roots of the equationA2rB +r2C= 0.Since A2rB + r2C 0, it must be true that B2 AC 0 (since the rootscannot be real unless they are the same), which implies|< x, y >|2 x2y2, or|< x, y >| x y.

    Note that| < x, y >|=xy whenever x= y , |C. Also, if| < x, y >|=xy, then it is easy to verify that < x ||x

    yy , x ||xyy >= 0, whichimplies (using (4)) x ||x

    yy = 0. Thus,|< x, y >|=xyif and only ifx is

    proportional to y .

  • 8/10/2019 ECE541_Lectures_9_2009

    24/81

    24 MAJEED M. HAYAT

    3.2. Triangle Inequality.x+ y x + y, x, yH. Proof.

    This follows from the Schwarz inequality. Note thatx+y2

    0 and expand toobtainx2 +y2+ < x, y > +< x, y > x2 +y2 + 2| < x, y >|. Now bySchwarz inequality,x2+y2+2|< x, y >| x2+y2+2xy = (x+y)2,from which the desired result follows. Note that x+y2 =x2+y2 if< x, y >= 0(why?).

    This is a generalization of the customary triangle inequality for complex numbers,

    which states thatx z x y + y z, x, y,z |C.

    3.3. Norm. We sayx is a norm on H if:(1)x 0.(2)x= 0 only ifx= 0.(3)x + y x + y.(4)x=|| x, D.

    With the triangular inequality at hand, we can define on members ofH asfollows:

    x

    =< x, x >1/2. You should check that this actually defines a norm. This

    yields a yardstick for distance between the two vectors x, yH, defined as xy.We can now say that H isnormed space.

    3.4. Convergence. We can then talk about convergence: a sequence xnHis saidto be convergent to x, written asxnx, or limn xn = x, if for every >0, thereexistsN IN such thatxn x< whenever n > N.

    3.5. Completeness. An inner-product space His complete if any Cauchy sequence

    in Hconverges to a point in H. A sequence{yn}n=1 in H is called a Cauchy if forany >0, N IN such thatyn ym< for all n, m > N.

    Now, ifH iscomplete, then H is called a Hilbert space.

    Fact: IfH is complete, then it is closed. This is because any convergent sequence isautomatically a Cauchy sequence.

    3.6. Convex Sets. A set E in a vector spaceVis said to be a convex set if for any

    x, yE, and t(0, 1), the following point Zt =tx + (1 t)yE. In other words,

  • 8/10/2019 ECE541_Lectures_9_2009

    25/81

    ECE 541 25

    the line segment between xand y lies in E. Note that ifEis a convex set, then the

    translation ofE,E+ x

    ={y+ x: yE}, is also a convex set.3.7. Parallelogram Law. For any x and y in an inner-product space,x+y2 +xy2 = 2x2 + 2y2.This can be simply verified using the properties of an innerproduct. See the schematic below.

    x+y

    x-y

    Figure 2.

    3.8. Orthogonality. If< x, y >= 0, then we say that x and y are orthogonal. We

    writexy . Note that the relation is a symmetric relation; that is,xyyx.Pick a vector xH, then find all vectors in Hthat are orthogonal to x. We write

    it as: x ={y H :< x,y >= 0} and x is a closed subspace. To see that it is asubspace, we must show that x is closed under addition and scalar multiplication;

    these are immediate from the definition of an inner product (see Assignment 4). To

    see that x is closed, we must show that the limit of every convergent sequence in

    x is also in x. Let xn be a sequence in x, and assume that xn x0. We needto show that x0 x. To see this, note that| < x0, x >| =| < x0 xn+ xn, x >|=|< x0 xn, x >+ < xn, x >|=|< x0 xn, x >| x0 xnx, by Schwarzinequality; but, the term on the right converges to zero, so| < x0, x >|= 0, whichimpliesx0x and x0x.

    Let M H be a subspace ofH. We define M = xMx, x M. It is easyto see that M is actually a subspace ofHand that M M={0}. It is also truethat M is closed, which is simply because it is an intersection of closed sets (recall

    that x is closed).

  • 8/10/2019 ECE541_Lectures_9_2009

    26/81

    26 MAJEED M. HAYAT

    Theorem 4. Any non-empty, closed and convex setEin a Hilbert spaceH contains

    a unique element with smallest norm. That is, there existsxoE such thatxo=ni=1 xiyi.

  • 8/10/2019 ECE541_Lectures_9_2009

    27/81

    ECE 541 27

    2=H

    x+M

    Shortest distance Qx

    Figure 3. ElevatingMbyx.

    2=H

    =M

    M

    Px

    Qx

    x

    Figure 4. Correction: The upper Mshould read M.

    Example 12. L2[a, b] =

    f :ba|f(x)|2dx

    =b

    af(x)g(x)dx.f= < f,f >= [b

    a|f|2dx]1/2 =f2. Actually, we need Schwarz

    inequality for expectations before we can see that we have an inner product in this

    case. More on this to follow.

    Theorem 5. Projection Theorem. IfM His a closed subspace of a Hilbert spaceH, then for everyx Hthere exists a unique decompositionx = P x+Qx, whereP xM, andQxM. Moreover,

    (1)x2 =P x2 + Qx2.(2) If we think ofP x andQx as mappings fromH to M andM, respectively,

    then the mappingsP andQ are linear.

  • 8/10/2019 ECE541_Lectures_9_2009

    28/81

    28 MAJEED M. HAYAT

    (3) P x is the nearest point in M to x, and Qx is the nearest point in M to

    x. P x and Qx are called the orthogonal projections of x into M and M,respectively.

    Proof.

    Existence: ConsiderM+ x; we claim that M+ xis closed. Recall that Mis closed,

    which means that ifxnM and xnxo, thenxoM. Pick a convergent sequenceinx + M, call it zn. Now zn=x + yn, for some ynM. Since zn is convergent, so isyn, but the limit ofyn is in M, so x+ limn ynx+ M.

    We next show that x+M is convex. Pick x1 and x2 x +M. We need to showthat for any 0 < < 1, x1+ (1 )x2 x+M. But x1 = x+y1, y1 M, andx2 = x+ y2, y2 M. So x1+ (1)x2 = x+ y1+ (1)y2 x+ M sincey1+ (1 )y2M.

    By the minimal-norm theorem, there exists a member in x+ Mof smallest norm.

    Call it Qx. Let P x= x Qx. Note that P xM. We need to show that QxM.Namely,< Qx, y >= 0, yM. CallQx= z, and note that z y, yM+x.Pick y=z y, where yM,y= 1.z2 z y2 =< z y,z y >, or0 < y, z > < z, y > +2. Pick =< z, y >. We obtain 0 |< z, y >|2. This can hold only if< z, y >= 0, i.e., zis orthogonal to every y M; therefore,QxM.

    Uniqueness: Suppose that x = P x+Qx = (P x)+ (Qx), where P x, (P x) MandQx, (Qx)M. Then,P x (P x) = (Qx) Qx, where the left side belongs toMwhile the right side belongs to M. Hence, each side can only be the zero vector

    (why?), and we conclude that P x= (P x) and (Qx)=Qx.Minimum Distance Properties: To show that P x is the nearest point in M to x,

    pick anyy Mand observe thatx y2 =P x + Qx y2 =Qx + (P x y)2 =Qx2 +P xy2 (since P xy M). The right-hand side is minimized whenP x y= 0, which happens if and only ify =P x. The fact that Qx is the nearestpoint in M to xcan be shown similarly.

    Linearity: Take x, y H. Then, we have x = P x+Qx and y = P y+ Qy. Now

    ax+by= aP x+aQx+bP y+bQy. On the other hand,ax+by=P(ax+by)+Q(ax+by).

  • 8/10/2019 ECE541_Lectures_9_2009

    29/81

    ECE 541 29

    Thus, we can write P(ax + by) aP x bP y=Q(ax + by) + aQx + bQyand observe

    that the left side is in Mwhile the right side is in M. Hence, each side can only bethe zero vector. Therefore, P(ax+by) =aP x+bP y and Q(ax+by) =aQx+bQy.

  • 8/10/2019 ECE541_Lectures_9_2009

    30/81

  • 8/10/2019 ECE541_Lectures_9_2009

    31/81

    ECE 541 31

    we may not have an inner product. Can you see why?) We can also define the norm

    X2

    =< X, X >1/2

    . This is called the L2-norm associated with (, F,P

    ). Hence,we can recast Has the vector space of all random variables that have finite L2-norm.

    This collection is often written as L2(, F,P). It can be shown thatL2(, F,P) iscomplete (see [2], pp. 67). Hence, L2(, F,P) is a Hilbert space. In fact, L2(, D,P)is a Hilbert space for any -algebraD.

    4.2. TheL2conditional expectation. Let X, YL2(, F,P). Clearly,L2(, (X),P)is a closed (why?) subspace ofL2(, F,P)). We can now apply the projection theorem

    to Y, with L2(, (X),P) being our closed subspace, and obtain the decomposition

    Y = P Y +QY, where P Y L2(, (X),P) and QY L2(, (X),P). We alsohave the property thatY P Y2 Y Y2 Y L2(, (X),P). We callP Ythe conditional expectation ofY given (X), which we will write from this point an

    on as E[Y|(X)].

    Exercise 1. (1) Show that if X is a

    D-measurable r.v. then so is aX,a

    . (2)

    Show that ifX andY areD-measurable, so isX+ Y. [See Homework]

    The above exercise shows that the collection of square-integrableD-measurablerandom variables is indeed a vector space.

    Theorem 7. Consider(, F,P), and let X be a random variable. A random variableZ is a (X)-measurable random variable if and only if Z = h(X) for some Borel

    functionh. (The proof is based on a corollary to the Dynkin class Theorem. We

    omit the proof, which can be found in [1].)

    Since E[X|(Y)] is (Y)-measurable by definition, we can write it explicitly as aBorel function ofY. For this reason, we often time write E[X|Y] in place ofE[X|(Y)].

    4.3. Properties of Conditional Expectation. Our definition of conditional ex-

    pectation lends itself to many powerful properties that are useful in practice. One of

    the properties also leads to an equivalent definition of the conditional expectation,

    which is actually the way commonly donesee the comments after Property 4.3.3.

  • 8/10/2019 ECE541_Lectures_9_2009

    32/81

    32 MAJEED M. HAYAT

    However, proving the existence of the conditional expectation according to the alter-

    native definition requires knowledge of a major theorem in measure theory called theRadon-Nikodym Theorem, which is beyond the scope of this course. That is why in

    this course we took a path (for defining the conditional expectation) that is based on

    the projection theorem, which is an important theorem to signal processing for other

    reasons as well. Here are some key properties of conditional expectations.

    4.3.1. Property 1. For any constant a, E[a|D] = a. This follows trivially from theobservation that P a= a (why?).

    4.3.2. Property 2 (Linearity). For any constants a, b, E[aX+bY|D] = aE[X|D] +bE[Y|D]. This follows trivially from the linearity of the projection operatorP (seethe Projection Theorem).

    4.3.3. Property 3. Let Z = E[X

    |D], then E[XY] = E[ZY ], for all Y

    L2(,

    D,P).

    Interpretation: Zcontains all the information thatXcontains that is relevant to any

    D-measurable random variableY.

    Proof: Note that E[XY] = E[(P X+QX)Y] = E[(P X)Y]+E[(QX)Y] = E[(P X)Y] =

    E[ZY]. The last equality follows from the definition ofZ.

    Conversely, if a random variable Z L2(, D,P) has the property that E[ZY] =E[XY],

    Y

    L2(,

    D,P), then Z = E[

    X

    |D] almost surely. To see this, we need to

    show that Z = P Xalmost surely. Note that by assumption E[Y(X Z)] = 0 foranyYL2(, D,P). Therefore, E[Y(P X+ QX Z)] = 0, or E[Y P X] + E[

    Y QX]

    E[Y Z] = 0, or E[Y(Z P X)] = 0, YL2(, D,P) (since E[Y QX] = 0). In partic-ular, if we take Y =Z P Xwe will conclude that E[(Z P X)2] = 0, which impliesZ=P Xalmost surely.

    Thus, we arrive at an alternative definition for theL2 conditional expectation.

  • 8/10/2019 ECE541_Lectures_9_2009

    33/81

    ECE 541 33

    Definition 6. We defineZ = E[X|D] if (1)ZL2(, D,P) and (2) E[ZY] = E[XY]

    YL2(, D,P).We will use this new definition frequently in the remainder of this chapter.

    4.3.4. Property 4 (Smoothing Property). For any X, E[X] = EE[X|D]. To see this,

    note that ifZ = E[X|D], then E[ZY] = E[XY] for all Y L2(, D,P). Now takeY= 1 to conclude that E[X] = E[Z].

    4.3.5. Property 5. If Y L2(, D,P), then E[XY|D] = YE[X|D]. To show this,we check the second definition for conditional expectations. Note that YE[X|D]L2(, D,P), so all we need to show is that E[(XY)W] = E

    (YE[X|D])W for anyW

    L2(, D,P). But we already know from the definition ofE[X|D] that E[(W Y)E[X|D]] =E[(W Y)X], since W YL2(, D,P).

    As a special case, we have E[XY] = EE[XY|Y]= EYE[X|Y].

    4.3.6. Property 6. For any Borel function g : IR IR, E[g(Y)|D] = g(Y) wheneverY L2(, D,P). (For example, we have E[g(Y)|Y] = g(Y).) To prove this, all weneed to do is to observe that g (Y)L2(, D,P) and then apply Property 4.3.5 withX= 1.

    4.3.7. Property 7 (Iterated Conditioning). LetD1 D2 F. Then, E[X|D1] =EE[X|D2]|D1

    .

    Proof: LetZ= EE[X|D2]

    D1. First note thatZL2(, D1,P), so we only need toshow that E[ZY] = E[XY], Y L2(, D1,P). Now E[ZY] = E

    YEE[X|D2]

    D1.SinceY is D1-measurable, it is also D2-measurable; thus, we have E

    YEE[X|D2]|D1

    =

    E

    E

    E[Y X|D2]|D1

    for all YL2(, D1,P). But by property 4.3.4, EE

    E[Y X|D2]

    D1

    =

    EE[Y X|D2]= E[Y X], which completes the proof.

  • 8/10/2019 ECE541_Lectures_9_2009

    34/81

    34 MAJEED M. HAYAT

    Also note that E[X|D1] = EE[X|D1]|D2

    . This is much easier to prove. (Prove it.)

    The next property is very important in applications.

    4.3.8. Property 8. IfX and Y are independent r.v.s, and if g : IR2 IR is Borelmeasurable, then E[g(X, Y)|(Y)] =h(Y), where for any scalar t, h(t) = E[g(X, t)].As a consequence, E[g(X, Y)] = E[h(Y)].

    Proof: We prove this in the case g(x, y) =g1(x)g2(y), where g1 and g2 are bounded,

    real-valued Borel functions. (The general result follows through an application of

    a corollary to the Dynkin class theorem.) First note that E

    g1(X)g2(Y)|(Y)

    =

    g2(Y)E

    g1(X)|(Y)

    . Also note that h(t) =g2(t)E[g1(X)], so h(Y) =g2(Y)E[g1(X)].

    We need to show that h(Y) = E

    g1(X)g2(Y)| (Y)

    . To do so, we need to show

    that E[h(Y)Z] = E[g1(X)g2(Y)Z] for every Z L2(, (Y),P). But E[h(Y)Z] =E

    g2(Y)E[g1(X)]Z

    = E[g1(X)]E

    g2(Y)Z

    , and by the independence ofX and Y the

    latter is equal to E[g1(X)g2(Y)Z].

    The stated consequence follows from Property 4.3.4. Also, the above result can be

    easily generalized to random vectors.

    4.4. Some applications of Conditional Expectations.

    Example 13 (Photon counting).

    radiation

    detection

    current

    .

    Nphotons

    electric pulses

    counter M

    Figure 5.

  • 8/10/2019 ECE541_Lectures_9_2009

    35/81

  • 8/10/2019 ECE541_Lectures_9_2009

    36/81

    36 MAJEED M. HAYAT

    Next, E[M2] = E

    h2(N)

    = E

    (N2 N)

    E[G1]

    2 + (E[G1]2 + 2G)E[N] = E[G1]

    2(2N+

    N

    2

    N) + (

    E[G1]

    2

    + 2

    G)N, where

    N

    =E

    [N] and2

    N

    =E

    [N2

    ] N

    2

    is the variance ofN. Finally, if we assume that N is a Poisson random variable (as in coherent light)

    the second moment ofMbecomes

    E[M2] = E[G1]2 N2 +

    E[G1]

    2 + 2G

    N .

    Exercise: For an integer-valued random variable X, we define thegenerating function

    ofXasX(s) = E[sX], s |C, |s| 1.We can think of the generating function as the

    z-transform of the probability mass function associated with X.

    For the photon-counting example described above, show thatM(s) =N

    G(s)

    .

    Example 14 (First Occurrence of Successive Hits).

    Consider the problem of flipping a coin successively. Suppose P{H} = p andP{T}= q= 1p. In this case, we define = Xi=1i, where for eachi, i={H, T}.This means that members of are simply of the form = (1, 2, . . .), whereii.For each i, we defineFi = , {H, T}, {T}, {H}. We takeFassociated with as the minimal -algebra containing all cylinders with measurable bases in . A an

    event E in Xi=1i is called a cylinder with a measurable base if E is of the form

    (1, 2, . . .) :i1B1, . . . , ikBk, k finite, Bi Fi

    . In words, in a cylinder,

    only finitely many of the coordinates are specified. (For example, think of a vertical

    cylinder in IR3, where we specify only two coordinates out of three, now extend this to

    IR.) Let Xi= I{i=H} be a{0, 1}-valued random variable, and define Xon Xi=1i

    as X = (X1, X2, . . .). Finally, we define P on cylinders in by forming the product

    of the probabilities of the coordinates (enforcing independence). For each (i, Fi),define Pi{H}= p and q= 1 p.

    We would like to define a random variable that tells us when a run ofn successive

    heads appears for the first time. More precisely, define the random variable T1 as

    a function of X as follows: T1= min{i 1 : Xi = 1}. For example, if X =

    (0, 0, 1, 0, 1, 1, 0, 1, 1, 1, . . .), then T1 = 3. More generally, we define Tk = min{i :Xik+1 = Xik+2 = . . . = Xi = 1}. For eachk, Tk is a r.v. on . For example, if

    X= (0, 0, 1, 0, 1, 1, 0, 1, 1, 1, . . .), then T2 = 6, and T3= 10.

  • 8/10/2019 ECE541_Lectures_9_2009

    37/81

    ECE 541 37

    Let yk= E[Tk]. In what follows we will characterize yk.

    Special case: k=1It is easy to see that P{T1 = 1}= p,P{T1 = 2}= qp,P{T1 = 3}= q2p,...

    P{T1 = n}= qn1p.Recall from undergraduate probability that above probability law is called the geo-

    metric law, andT1

    is called a geometric random variable. Also,y1

    = i=1

    ipqi1 = 1p

    .

    General case: k >1

    We begin by observing that Tk is actually an explicit function, f, say, of

    Tk1, XTk1+1, XTk1+2, . . . .Note, however, that Tk1 and{XTk1+1, XTk1+2, . . .} areindependent. Therefore, by Property 4.3.8, E[Tk] = E[f

    Tk1, XTk1+1, XTk1+2, . . .)] =

    E[h(Tk1)], where h(t) = E[f(t, Xt+1, Xt+2, . . .)].

    It is easy to check (using the equivalent definition of a conditional expectation)

    that E[f(t, Xt+1, . . .)|Xt+1] = (t + 1)I{Xt+1=1}+ (t + 1 + yk)I{Xt+1=0}. This essentiallysays that if it took us t flips to see k 1 consecutive heads for the first time, then ifthe (t+ 1)st flip is a head, then we have achieved k successive heads at time t+ 1;

    alternatively, if the (t+ 1)st flip is a tail, then we have to start all over again (start

    afresh) while we have already waisted t+ 1 units of time.

    Now, E[f(t, Xt+1, . . .)] = EE[f(t, Xt+1, . . .)|Xt+1]

    = E

    (t + 1)I{Xt+1=1}+ (t + 1 +

    yk)I{Xt+1=0

    }=(t + 1)p + (t + 1 + yk)q. Thus, h(t) = (t+ 1)p + (t + 1 + yk)qandyk = E[h(Tk1)] = E

    (Tk1+ 1)p + (Tk1+ 1 + yk)q

    = p +pyk1+ qyk+ qyk1+ q.

    Finally, we obtain

    (9) yk =p1yk1+p

    1.

    We now invoke the initial condition y1 = 1p

    , which completes the characterization of

    yk.

    For example, ifp= 12 , then y2= 2y1+ 2 = 6, and so on.

  • 8/10/2019 ECE541_Lectures_9_2009

    38/81

    38 MAJEED M. HAYAT

    Example 15. Importance of the Independence Hypothesis in Property4.3.8.

    Let Y = XZ where Z = 1 + X. Now, E[Y] = E[XZ] = E[X(1 + X)] = X+ X2.

    However, if we erroneously attempt to use Property 4.3.8 (by ignoring the dependence

    of Z on X), we will have E[Xt] = E[X]t, and EE[X]Z

    = E[X]E[Z] = E[X](1 +

    E[X]) = X+ X2. Note that the two are different since X2 =X2 in general.

    4.5. The L1 conditional expectation. Consider a probability space (, F,P) andlet X be an integrable random variable, that is, E[|X|] < . We write X L1(, F,P). LetD be a sub -algebra. For each M 1, we define X M asX when|X| M and M otherwise. Note that XM L2(, F,P) (since it isbounded), so we can talk about E[(X M)|D], which we call ZM. The question iscan we somehow use ZM to define E[X|D]? Intuitively, we would want to think ofE[X|D] as some kind of limit ofZM, asM . So one approach for the constructionofE[X|D] is to take a limit ofZMand then show that the limit, Z, say, satisfiesDefinition 6 with L2 replaced by L1. It turns out that such an approach would lead

    to the following definition, which we will adopt from now on.

    Definition 7. [L1 Conditional Expectation] IfX is an integrable random variable,

    then we define Z= E[X|D] ifZhas the following two properties: (1) Z isDmeasur-able, (2) E[ZY] = E[XY] for any boundedD measurable random variable Y.

    All the properties of the L2 conditional expectation will carry on to the L1 condi-

    tional expectation.

    We make the final remark that if a random variable X is square integrable, then

    it is integrable. The proof is easy. Suppose that X is square integrable. Then,

    E[|X|] = E[|X|I{|X|1}] +E[|X|I{|X|>1}] E[I{|X|1}] +E[X2I{|X|>1}]1 +E[X2] 0, then we define the conditional

    probability ofA given that the event B has occurred as P(A|B) = P(A B)/P(B).

  • 8/10/2019 ECE541_Lectures_9_2009

    39/81

    ECE 541 39

    What is the connection between this definition and our notion of a conditional ex-

    pectation?Consider the -algebraD ={, , B , Bc}, and consider E[IA|D]. Because of the

    special form of thisD, and because we know that this conditional expectation isD-measurable, we infer that E[IA|D] can assume only two values: one value on B andanother value on Bc. That is, we can write E[IA|D] =aIB+ bIBc, where a and bareconstants. We claim that a= P(A B)/P(B) and b= P(A Bc)/P(Bc). Note thatbecause P(B) > 0, 1 P(B) > 0, so we are not dividing by zero. As seen from ahomework problem, we can prove that P(AB)/P(B)IB+P(ABc)/P(Bc)IBc isactually the conditional expectation E[IA|D] by showing that

    P(A B)/P(B)IB+

    P(A Bc)/P(Bc)IBc satisfies the two defining properties listed in Definition 6 orDefinition 7. Thus, E[IA|D] encompasses both P(A|B) and P(A|Bc); that is, P(A|B)and P(A|Bc) are simply the values ofE[IA|D] on B and Bc, respectively. Also notethat P(|B) is actually a probability measure for each B as long as P(B)> 0.

    4.7. Joint densities and marginal densities. Let X and Ybe random variables

    on (, F,P) with a distribution XY(B) defined on all Borel subsets B of the plane.Suppose that X and Y have a joint density. That is, there exists an integrable

    function fXY(, ) : IR2 IR, such that

    (10) XY(B) =

    B

    fXY(x, y) dxdy, for anyB B2.

    4.7.1. Marginal densities. Consider XY(A IR), whereA B. Notice that XY( IR) is a probability measure for anyA B. Also, by the integral representation shownin (10),

    (11) XY(AIR) =AIR

    fXY(x, y) dxdy=

    A

    IR

    fXY(x, y) dy

    dxA

    fX(x) dx,

    where fX() is called the marginal densityofX. Note that fX() qualifies as the pdfofX. Thus, we arrive at the familiar result that the pdf ofXcan be obtained from

    the joint pdf ofXand Y through integration. Similarly, fY(y) =IR fXY(x, y) dx.

  • 8/10/2019 ECE541_Lectures_9_2009

    40/81

  • 8/10/2019 ECE541_Lectures_9_2009

    41/81

    ECE 541 41

    In the above, we have used two facts about integrations: (1) that we can write a

    double integral as an iterated integral, and (2) that we can exchange the order of theintegration in the iterated integrals. These two properties are always true as long

    as the integrand is non-negative. This is a consequence of what is called Fubinis

    Theorem in analysis [2, Chapter 8]. To complete the proof, we need to address the

    concern over the points over which fX() = 0 (in which case we will not be ableto cancel fXs in the numerator and the denominator in the above development).

    However, it is straightforward to show that P{fX(X) = 0} = 0, which guaranteesthat we can exclude these problematic point from the integration in the x-direction

    without changing the integral.

    With the above results, we can think of

    fXY(x, y)IR

    fXY(x, y) dy

    as a conditional pdf. In particular, if we define

    (14) fY|X(y|x) = fXY(x, y)

    IR fXY(x, y) dy=

    fXY(x, y)

    fX(x) ,

    then we can calculate the conditional expectation E[Y|X] using the formula IR yfY|X(y|X) dy,which is the familiar result that we know from undergraduate probability.

  • 8/10/2019 ECE541_Lectures_9_2009

    42/81

    42 MAJEED M. HAYAT

    5. Convergence of random sequences

    Consider a sequence of random variables X1, X2, . . . ,defined on the product space

    (, F,P), where as before, = Xi=1i is the infinite-dimensional Cartesian productspace andF is the smallest -algebra containing all cylinders. Since each Xi is arandom variable, when we talk about convergence of random variables we must do so

    in the context of functions.

    5.1. Pointwise convergence. Let us take i = [0, 1],F =B [0, 1], and take Pas Lebesgue measure (generalized length) in [0, 1]. Consider the sequence of random

    variablesfn() defined as follows:

    (15) Xn() =

    n2, 0n1/2n n2, n1/2< n10, otherwise.

    It is straightforward to see that for any , the sequence of random variablesXn()

    converges to the constant function 0. We say that Xn()

    0 pointwise,everywhere,

    or surely. Generally, a sequence of random variables Xn converges to a random

    variable X if for every , Xn()X(). To be more precise, for every > 0and every, there existsn0IN such that|Xn()X()| < whenevernn0.Note that in this definition n0 not only depends upon the choice ofbut also on the

    choice of . Note that in the example above we cannot drop the dependence ofn0 on

    . To see that, observe that sup |Xn() 0|= n/2; thus, there is no single n0 thatwill work for every .

    5.2. Uniform convergence. The strongest type of convergence is whats calleduni-

    form convergence, in which case n0 will be independent of the choice of . More

    precisely, a sequence of random variables Xn converges to a random variable Xuni-

    formly if for every >0, there exists n0 IN such that|Xn() X()|< for any and whenever nn0. This is equivalent to saying that for every >0, thereexistsn0IN such that sup |Xn() X()| < whenever nn0. Can you make

    a small change in the example above to make the convergence uniform?

  • 8/10/2019 ECE541_Lectures_9_2009

    43/81

    ECE 541 43

    5.3. Almost-sure convergence (or almost-everywhere convergence). Now con-

    sider a slight variation of the example in (15).

    (16) Xn() =

    n2, 0< n1/2n n2, n1/2< n11, (n1, 1] |Q0, (n1, 1] |Qc.

    Note that in this caseXn()0 for[0, 1)|Qc. Note that the Lebesgue measure(or generalize length) of the set of points for which Xn() does not converge to the

    function 0 is zero. (The latter is because the measure of the set of rational numbers is

    zero. To see that, let r1, r2, . . .be an enumeration of the rational numbers. Pick any

    >0, and define the intervalsJn= (rn2n1, rn+ 2n1). Now, if we sum up thelengths of all of these intervals, we obtain . However, since |Q n=1Jn, the lengthof the set |Q cannot exceed . Since can be selected arbitrarily small, we conclude

    that the Lebesgue measure of |Q must be zero.) In this case, we say that the sequence

    of functions Xn converges to the constant random variable 0 almost everywhere.

    Generally, we say that a sequence of random variables Xn converges to a randomvariable Xalmost surelyif there exists A F, with the property P(A) = 1, such thatfor everyA,Xn()X(). This is the strongest convergence statement that wecan make in a probabilistic sense (because we dont care about the points that belong

    to a set that has probability zero). Note that we can write this type of convergence by

    saying thatXnXalmost surely (or a.s.) ifP{ : limn Xn() =X()}= 1,or simply when P{limn Xn =X}= 1.

    5.4. Convergence in probability (or convergence in measure). We say that

    the sequence Xn converges to a random variable X in probability (also called in

    measure) when for every >0, limn P{|Xn X|> }= 0. This a weaker type ofconvergence than almost sure convergence, as seen next.

    Theorem 8. Almost sure convergence implies convergence in probability.

    Proof

    Let >0 be given. For each N 1, define EN ={|Xn X|> , for some nN}.

  • 8/10/2019 ECE541_Lectures_9_2009

    44/81

    44 MAJEED M. HAYAT

    If we define E =

    N=1 EN, well observe that E implies that Xn() does not

    converge to X. (This is because if E, then no matter how large we pick N,N0, say, there will be an n N0 such that|Xn X| > }.) Hence, since we knowthat Xn converges to Xa.s., it must be true that P(E) = 0. Now observe that since

    E1 E2 . . ., P(E) = limN P(EN), and therefore limN P(EN) = 0. Observethat if {|Xn X|> } and nN, then {|Xn X|> for some nN}.Thus, for nN, {|Xn X|> } EN and P{|Xn X|> } P(EN). Hence,limn P{|Xn X|> }= 0.

    This theorem is not true, however, in infinite measure spaces. For example, take = IR,F =B, and the Lebesgue measure m. Let Xn() = /n. Then clearly,Xn 0 everywhere, but m({|Xn 0| > }) = for any > 0. Where did we usethe fact thatP is a finite measure in the proof of the Theorem8?

    5.5. Convergence in the mean-square sense (or in L2). We say that the se-

    quence Xn converges to a random variable X in the mean-square sense (also in L2)

    if limn E[|XnX|2] = 0. Recall that this is precisely the type of convergencewe defined in the Hilbert-space context, which we used to introduce the conditional

    expectation. This a stronger type of convergence than convergence in probability, as

    seen next.

    Theorem 9. Convergence in the mean-square sense implies convergence in probabil-

    ity.

    Proof

    This is an easy consequence of Markov inequality when taking () = ()2, whichyields P{|Xn X|> } E[|Xn X|2]/2.

    Convergence in probability does not imply almost-sure convergence, as seen from

    the following example.

  • 8/10/2019 ECE541_Lectures_9_2009

    45/81

    ECE 541 45

    Example 16 (The marching-band function). Consider = [ 0, 1], take P to be

    Lebesgue measure on [0, 1], and defineXn as follows:(17)

    Xn =

    I[21(n1),21n](), n {1, 2}I[22(n121),22(n21)](), n {1 + 21, . . . , 21 + 22}...

    I[2k(n121...2k1),2k(n21...2k1)](), n {1 + 21 + . . . + 2k1, 21 + . . . + 2k1 + 2k}...

    Note that for any [0, 1], Xn() = 1 for infinitely manyns; thus, Xn does notconverge to 0 anywhere. At the same time, ifn {1+21+. . .+2k1, 21+. . .+2k1+2k},then P{|Xn 0| > } = 2k; hence, limn P{|Xn 0| > } = 0 andXn convergesto 0 in probability.

    Note that in the above example, limn E[|Xn 0|2] = 0, so Xn also converges to0 in L2.

    Later we shall prove that any sequence of random variables that converges in prob-ability will have a subsequence that converges almost surely to the same limit.

    Recall the example given in (15) and lets modify it slightly as follows.

    (18) Xn() =

    n3/2, 0n1/2n1/2 n3/2, n1/2< n10, otherwise.

    In this case,Xncontinues to converge to 0 at every point; nonetheless, E[|Xn0|2] =1/12 for every n. Thus, Xn does not converge to 0 in L2.

    5.6. Convergence in distribution. A sequence Xn is said to converge to a ran-

    dom variable X in distributionif the distribution functions, FXn(x), converge to the

    distribution function ofX,FX(x), at every point of continuity ofFX(x).

    Theorem 10. Convergence in probability implies convergence in distribution.

  • 8/10/2019 ECE541_Lectures_9_2009

    46/81

    46 MAJEED M. HAYAT

    Proof (Adapted from T. G. Kurtz)

    Pick >0 and note that

    P{Xx+ }= P{Xx + , Xn > x} + P{Xx+ , Xnx}

    and

    P{Xnx}= P{Xnx, X > x+ } + P{Xnx, Xx + }.

    Thus,

    (19) P{Xnx}P{Xx + }= P{Xnx, X > x + }P{Xx + , Xn> x}.

    Note that since{Xnx, X > x + } {|Xn X|> },

    P{Xnx} P{Xx+ } P{|Xn X|> }.

    By taking the limit superior of both sides and noting that limn P{|XnX|> }=0, we obtain limn P{Xnx} P{Xx + }.

    Now replace every occurrence of in (19) with, multiply both side by -1 andobtain

    P{Xnx} + P{Xx }=P{Xnx, X > x } + P{Xx , Xn> x}.

    Since{Xx , Xn > x} {|Xn X|> }, we have

    P{Xnx} + P{Xx } P{|Xn X|> }.

    By taking the limit inferior of both sides, we obtain limn P{Xn x} P{Xx }.

    Combining, we obtain P{X x} limn P{Xn x} limn P{Xnx} P{Xx+}. Thus, by letting0 we conclude that limn FXn(x) =FX(x)at every point of continuity ofFX.

    Theorem 11 (Bounded Convergence Theorem). Suppose thatXn converges to X in

    probability, and that|Xn| M, a.s.,n, for some fixedM. ThenE

    [Xn]nE

    [X].

  • 8/10/2019 ECE541_Lectures_9_2009

    47/81

    ECE 541 47

    This example shows that the boundedness requirement is not superfluous. Take

    = [0, 1],F=B [0, 1], andm as Lebesgue measure (i.e., m(A) = A dx). Considerthe functions fn(t) shown below in the figure. Note thatfn(t) 0,t . But,E[fn(t) 0] = 1, for any n. Therefore, E[fn] E[0] = 0.

    Proof of Theorem (Adapted from A. Beck).

    Let En ={|Xn| > M} and note that P(En) = 0 by the boundedness assumption.Define E0 =n=1En and note that P(E0)

    n=1 P(En) = P(E1) + P(E2) + . . .= 0.

    We will first show that |X| Ma.s. ConsiderWn,k ={|XnX|> 1/k},k = 1, 2, . . ..SinceXnconverges toXin probability, P(Wn,k)

    n0. LetFk ={|X|> M+ 1/k}.

    Note that off E0, Fk Wn,k a.s., regardless of n, and P(Fk) P(Wn,k). Then,P(Fk) limn P(Wn,k) = 0, or P(Fk) = 0. Therefore, P(i=1Fk) = 0. Sincei=1Fk ={|X| > M},|X| M a.s.

    Let >0 be given. Let F0=i=1Fk and Gn ={|Xn X|> }. Note that can

    be written as the disjoint union (F0 E0) (Gn F0 E0)c (Gn \ (F0 E0)). Thus,|E[Xn X]| E[|Xn X|] = E[|Xn X|IF0E0] + E[|Xn X|IGn\(F0E0)] + E[|XnX|I(GnF0E0)c ]0 + 2ME[IGn] + = 2ME[IGn] + .

    Since P(Gn)0, there exist n0 such that nn0 implies P(Gn) 2M. Hence, fornn0,|E[Xn] E[X]| 2M2M + = 2, which establishes E[Xn] E[X] as n .

  • 8/10/2019 ECE541_Lectures_9_2009

    48/81

    48 MAJEED M. HAYAT

    Lemma 2(Adapted from T. G. Kurtz). Suppose thatX0 a.s. Then,limM E[XM] = E[X].

    Proof

    First note that the statement is true when Xn is a discrete random variable. Then

    recall that we can always approximate a random variableXmonotonically from below

    by a discrete random variable. More precisely, defineXnXas in (5) and note thatXnX and E[Xn] E[X]. Now since Xn MX MX, we have

    E[Xn] = limM

    E[Xn

    M]

    limM

    E[X

    M]

    limM

    E[X

    M]

    E[X].

    Take the limit as n to obtain

    E[X] limM

    E[X M] limM

    E[X M]E[X],

    from which the Lemma follows.

    Theorem 12 (Monotone Convergence Theorem, Adapted from T. Kurtz). Suppose

    that0

    Xn

    XandXn converges toX in probability. Then,limn

    E[Xn] = E[X].

    Proof

    For M >0, Xn MXnX. Thus,

    E[Xn M] E[Xn] E[X].

    By the bounded convergence theorem, limn E[Xn M] = E[X M]. Hence,

    E[X

    M]

    limn

    E[Xn]

    limn

    E[Xn]

    E[X].

  • 8/10/2019 ECE541_Lectures_9_2009

    49/81

    ECE 541 49

    Now by Lemma 2, limM E[X M] = E[X], and we finally obtain

    E[X] limn

    E[Xn] limnE[Xn] E[X],and the theorem is proven.

    Lemma 3 (Fatous Lemma, Adapted from T. G. Kurtz). Suppose thatXn 0 a.s.andXn converges to Xin probability. Then, limn E[Xn] E[X].

    Proof

    For M >0, Xn

    M

    Xn. Thus,

    E[X M] = limn

    E[Xn M] limn

    E[Xn],

    where the left equality is due to the bounded convergence theorem. Now by Lemma 2,

    limM E[X M] = E[X], and we obtain E[X]limn E[Xn].

    Theorem 13 (Dominated Convergence Theorem). Suppose thatXnX in proba-bility,|Xn| Y, andE[Y]0} I{ZnZm0}. In this case, (Zn Zm)Y =|Zn Zm|.

    Thus, we conclude that E[|Zn Zm|] = E[(Xn Xm)(I{ZnZm>0} I{ZnZm0})].

  • 8/10/2019 ECE541_Lectures_9_2009

    50/81

    50 MAJEED M. HAYAT

    But the right-hand side is no greater than E[|Xn Xm|] (why?), and we obtainE

    [|ZnZm|]E

    [|XnXm|]. However,E

    [|XnXm|]E

    [|X|I{|X|min(m,n)}]0 by thedominated convergence theorem (verify this), and we conclude that limm,n E[|ZnZm|] = 0.

    From a key theorem in analysis (e.g., see Rudin, Chapter 3), we know that any

    L1 space is complete. Thus, there exists Z L1(, D,P) such that limn E[|ZnZ|] = 0. Further, Zn has a subsequence, Znk , that converges almost surely to Z.We take this Zas a candidate for E[X|D]. But first, we must show that Z satisfiesE[ZY] = E[XY] for any bounded,

    D-measurable Y. This is easy to show. Suppose

    that|Y| < M, for some M. We already know that E[ZnY] = E[XnY ], and bythe dominated convergence theorem, E[XnY] E[XY] (since|Xn| |X|). Also,|E[ZnY] E[ZY]| E[|Zn Z||Y|] E[|Zn Z|M] 0. Hence, E[ZnY] E[ZY].This leads to the conclusion that E[ZY] = E[XY] for any bounded,D-measurable Y.

    In summary, we have found a Z L1(, D,P) such that E[ZY] = E[XY] for anybounded,D-measurable Y. We define this Zas E[X|D].

    5.8. Central Limit Theorems. To establish this theorem, we need the concept of

    a characteristic function. Let Y be a r.v. We define the characteristic function of

    Y, Y(u), as E[eiuY], where E[eiuY] exists if E[Re(eiuY)]

  • 8/10/2019 ECE541_Lectures_9_2009

    51/81

    ECE 541 51

    Lemma 4. IfX(u)is the characteristic function ofX, thenlimc 12 c

    ceiuaeiub

    iu X(u)du

    = P{a < X < b} + P

    {X=a

    }+P

    {X=b

    }2 .

    See Chow and Teicher for proof.

    As a special case, ifX is an absolutely continuous random variable with pdffX,

    then 12

    e

    iuxX(u)du= fX(x).

    Theorem 14 (Central limit theorem). Suppose{Xk} is a sequence of i.i.d. randomvariables withE[Xk] = andVarXk =

    2. LetSn =

    nk=1 Xk andYn=

    Snnn

    . Then,

    Yn converges to Zin distribution, whereZ

    N(0, 1) (zero-mean Gaussian randomvariable with unit variance).

    Sketch of the proof

    Observe that

    Yn() = E[ejYn]

    = E[ejPnk=1Xk

    n

    ]

    = E[nk=1

    ejn

    (Xk)]

    =nk=1

    E[ejn

    (Xk)]

    =nk=1

    (Xk)(

    n)

    = ((Xk)(

    n ))n

    We now expand (Xk)() as

    (Xk)() = E[ej(Xk)]

    = E[1 +j(Xk ) +(j)2

    2! (Xk )2 +(j)

    3

    3! (Xk )3 + . . .]

    = 1 +jE[Xk ] 2

    2 E[(Xk )2] +(j)

    3

    3! E[(Xk )3] + . . .

    = 1 2

    2 E

    [(Xk )2

    ] +

    (j)3

    3! E

    [(Xk )3

    ] + . . .

  • 8/10/2019 ECE541_Lectures_9_2009

    52/81

    52 MAJEED M. HAYAT

    Then we have

    (Xk)(

    n) = 1 ( n )2

    2 E[(Xk )2] + (j(

    n ))3

    3! E[(Xk )3] + . . .

    = 1 2

    2n2E[(Xk )2] j

    3

    3!

    n33

    E[(Xk )3] + . . .

    1 2

    2n2E[(Xk )2].

    = 1 2

    2n

    Then Yn()(1 22n )n and limn Yn() = limn(1 2/2n

    )n =e2/2. On the

    other hand, the characteristic function ofZ is Z() = E[ejz]. Then

    Z() = E[ejz]

    =

    ejzfZ(z)dz

    =

    ejz 1

    2ez

    2/2dz

    = 1

    2

    e12 (z

    22jz)dz

    = 1

    2

    e12 [(zj)2+2]dz

    = 1

    2e

    2/2

    e12 (zj)2dz

    = e2/2

    Therefore, limn Yn() = Z() and FYn nFZby the inversion lemma.

    5.10. Central limit theorem for non-identical random variables: Lende-

    bergs Theorem. We first define what is called a triangular array. For each n1,the sequenceXn1, . . . , X nrn is a sequence of zero-mean independent random variables,

    where rn are arbitrarily specified integers. An example of rn is n, where the term

    triangular comes from. Let Sn = Xn1 + . . . + Xnrn ands2n =

    rnk=1 E[X

    2nk]. Suppose

    further that for each n the sequence Xn1, . . . , X nrn satisfies the Lindeberg Condition

  • 8/10/2019 ECE541_Lectures_9_2009

    53/81

    ECE 541 53

    stated below:

    (20) limn

    rnk=1

    1s2n

    E

    X2nkI{|Xnk|sn}

    = 0, for any >0.

    Then, Sn/sn converges in distribution to a unit variance, zero-mean normal random

    variable.

    For a proof, see the text buy P. Billingsley listed in the bibliography.

    Remark: It can be shown (e.g., with the help of the Bounded Convergence Theorem)

    that the Lindeberg condition is automatically satisfied if the sequence is identically

    distributed. See HW-7.

    5.11. Strong Law of Large Numbers. Suppose thatX1, X2, . . . ,are i.i.d. random

    variables, E[X1] = }, where >0, then we would know that Sn(o) cannot convergeto .

    More generally, we define the event {An occurs infinitely often}, or for short {An i.o.},as

    {An i.o.} =n=1k=nAk.

    The above event is also referred to as the limit superior of the sequence An, that

    is,{An i.o.} = limn An. The terminology of limit superior makes sense since

  • 8/10/2019 ECE541_Lectures_9_2009

    54/81

    54 MAJEED M. HAYAT

    thek=nAk is a decreasing sequence and the intersection yields the infimum of the

    decreasing sequence.Before we prove a key result on the probability of events such as{An i.o.}, we will

    give an example showing their use.

    LetYnbe any sequence, and letY be a random variable (possibly being a candidate

    limit of the sequence provided that the sequence is convergent). Consider the event

    {limn Yn = Y}, the collection of all outcomes for which Yn() Y(). It iseasy to see that

    (21) { limnYn = Y}c =>0, |Q{|Yn Y|> i.o.}.This is because ifo {|Yn Y| > i.o.} for some > 0, then we are guaranteedthat for any n 1,|Yk Y| > for infinitely many k n; thus, Yn(o) does notconverge to Y(o).

    Lemma 5 (Borel-Cantelli Lemma). Let A1, A2, . . . , be any sequence of events. If

    n=1 P(An) i.o.}.

    By Markov inequality (with () = ()4),

    P

    {|Sn|> } E[S4n]

    4 =ni=1nj=1n=1nm=1 E[XiXjXXm]

    n44 .

  • 8/10/2019 ECE541_Lectures_9_2009

    55/81

    ECE 541 55

    Considering the term in the numerator, we observe that all the terms in the sum-

    mation that are of the form E

    [X3

    i]E

    [Xj], E

    [X2

    i]E

    [Xj]E

    [X], E

    [Xi]E

    [Xj ]E

    [X]E

    [Xm](i=j==m) are zero. Hence,

    ni=1

    nj=1

    n=1

    nm=1

    E[XiXjXXm] =nE[X41 ] + 3n(n 1)E[X21 ]E[X21 ],

    and therefore

    P{|Sn|> } n4+ 3n(n 1)2n44

    .

    Hence, we conclude thatn=1

    P{|Sn|> } i.o.}= 0.

    Hence, by (22)

    P{ limn

    Sn = 0}c

    = 0

    and the theorem is proved.

    Actually, the frequent version of the strong law of large numbers does not require

    any assumptions on second and fourth moments. We will not prove this version here,

    but it basically states that ifE[X1]

  • 8/10/2019 ECE541_Lectures_9_2009

    56/81

    56 MAJEED M. HAYAT

    where the last equality follows from the law of large numbers. Hence, it follows that

    limn

    n1ni=1

    Xi limM

    E[X1 M] = E[X] =,

    where the last limit follows from Lemma 2.

    The next theorem uses the Borel-Cantelli Lemma to characterizes convergence of

    probability.

    Theorem 16 (From Billingsley). A necessary and sufficient condition forXn Xin probability is that each subsequence{Xnk} contains a further subsequence{Xnk(i)}such thatXnk(i)X a.s. asi .

    Proof. IfXn X in probability, then given{nk}, choose a subsequence{nk(i)} sothatkk(i) implies that P{|XnkX| i1}< 2i; therefore,

    i=1 P{|Xnk(i)X|

    i1}0 and a

    subsequence

    {nk

    }for which P

    {|Xnk

    X

    |

    }> for allk. No subsequence of

    {Xnk

    }can converge to Xin probability; hence, none can converge to Xalmost surely.

    Corollary 2. Ifg: IRIRis continuous andXnXin probability, theng(Xn)g(X) in probability.

    Proof: Left as an exercise.

  • 8/10/2019 ECE541_Lectures_9_2009

    57/81

    ECE 541 57

    6. Markov Chains

    The majority of the materials of this section are extracted from the book by Billings-

    ley (Probability and Measure) and the book by Hoelet al. (Introduction to Stochastic

    Processes.)

    Up to this point we have seen many examples and properties (e.g., convergence)

    of iid sequences. However, many sequences have correlation among their members.

    Markov chains (MCs) are a class of sequences that exhibit a simple type of correlation

    in the sequence of random variables. In words, as far as calculating probabilities

    pertaining future values of the sequence conditional on the present and the past, it

    is sufficient to know the present. In other words, as far as the future is concerned,

    knowledge of the present contains all the knowledge of the past. We next define

    discrete-space MCs formally.

    Consider a sequence of discrete random variables X0, X1, X2, . . . , defined on some

    probability space (, F,P). Let Sbe a finite or countable set representing the set ofall possible values that eachXn may assume. Without loss of generality, assume that

    S={0, 1, 2, . . .}. The sequence is a Markov chain if

    P{Xn+1=j |X0= i0, . . . , X n =in}= P{Xn+1 = j |Xn= in}.

    For each pair i and j in S, we define pij= P{Xn+1 =j |Xn =i}. Note that we have

    implicitly assumed (for now) that P{Xn+1 =j |Xn =i} is not dependent upon n, inwhich case the chain is calledhomogeneous. Also note that the definition ofpijimplies

    thatjSpij = 1, i S. The numberspij are called the transition probabilitiesof

    the MC. The initial probabilitiesare defined as i= P{X0 =i}. Note that the only

    restriction on the is is that they are nonnegative and they must add up to 1. We

    denote the matrix whose (i, j)th entry is pi,j by IP, which is termed the transition

    matrixof the MC. Note that ifSis infinite, then IP is a matrix with infinitely many

    rows and columns. The transition matrix IP is called a stochastic matrix, as each row

    sums up to one and all its elements are nonnegative. Figure 6 shows a state diagram

    illustrating the transition probabilities for a two-state MC.

  • 8/10/2019 ECE541_Lectures_9_2009

    58/81

    58 MAJEED M. HAYAT

    P21

    P22

    P12

    s t ate1

    P11

    s t a t e0

    Figure 6.

    6.1. Higher-Order Transitions. Consider P{X0 = i0, X1 = i1, X2 = i2}. By ap-

    plying Bayes rule repeatedly (i.e., P

    (AB C) = P

    (A|B C)P

    (B|C)P

    (C)), weobtain

    P{X0 = i0, X1=i1, X2=i2} = P{X0=i0}P{X1=i1|X0 = i0}P{X2 = i2|X0 = i0, X1=i1}

    = i0pi0i1pi1i2.

    More generally, it is easy to verify that

    P{X0 = i0, X1 = i1, . . . , X m=im}= i0pi0i1. . . pim1im

    for any sequence i0, i1, . . . , im of states.

    Further, it is also easy to see that

    P{X3 = i3, X2= i2|X1= i1, X0= i0}= pi1i2pi2i3.

    More generally,

    (23) P{Xm+= j, 1n|Xs=is, 0sm}= pimj1pj1j2. . . pjn1jn.

    With these preliminaries, we can define the n-step transition probabilityas

    p(n)ij = P{Xm+n= j |Xm=i}.

    If we now observe that

    {Xm+n = j}=j0,...,jn1{Xm+n=j, Xm+n1 = j0, . . . , X m+1=jn1},

    we conclude

    p(n)ij =

    k1...kn1

    pik1pk1k2. . . pkn1j.

  • 8/10/2019 ECE541_Lectures_9_2009

    59/81

    ECE 541 59

    From matrix theory, we recognize p(n)ij as the (i, j)th entry of IP

    n, thenth power of

    the transition matrix IP. It is convenient to use the convention p(0)

    ij =ij, consistentwith the fact that IP0 is the identity matrix I.

    Finally, it is straightforward to verify that

    p(m+n)ij =

    vS

    p(m)iv p

    (n)vj ,

    and that jSp(n)ij = 1. This means IP

    n = IPkIP, whenever + k = n, which is

    consistent with taking powers of a matrix.

    For convenience, we now define the conditional probabilities Pi(A) = P{A|X0=i},

    A F. With this notation and as an immediate consequence of (23), we have

    Pi{X1=i1, . . . , X m= j, Xm+1 =jm+1, . . . , X m+n = jm+n}=

    Pi{X1= i1, . . . , X m= j}Pj{X1 = jm+1, . . . , X n= jm+n}.

    Example 19. A gambling problem (see related homework problem)

    Initial fortune of a gambler is X0 =xo. After playing each hand, he/she increases

    (decreases) the fortune by one dollar with probability p (q). The gambler quits if

    either her/his fortune becomes 0 (bankruptcy) or reaching a goal ofL dollars. Let

    Xn denote the fortune at time n. Note that

    (24) P{

    Xn= j|X1=i1 . . . X n

    1= in

    1

    }= P

    {Xn= j

    |Xn

    1= in

    1

    }.

    Hence, the sequence Xn is a Markov chain with state space S={0, 1, . . . , L}. Also,the time-independent transition probabilities are given by

    (25) pij = P{Xn+1= j |Xn=i}=

    p, ifj =i + 1, i=Lq, ifj =i 1, i= 0

    1, ifi= j =L or ifi= j = 0

  • 8/10/2019 ECE541_Lectures_9_2009

    60/81

    60 MAJEED M. HAYAT

    The (L + 1) (L+ 1) probability transition matrix, IP = ((pij)) is therefore

    (26) IP =

    1 0 0 . . . . . . . . . . . . . . . 0

    q 0 p 0 . . . . . . . . . . . . ...

    0 q 0 p 0 . . . . . . . . . ...

    ... ...

    ... ...

    ... . . .

    ... ...

    ...

    0 0 . . . . . . . . . . . . q 0 p

    0 0 . . . . . . . . . . . . . . . . . . 1

    .

    Note that the sum of any row of IP is 1, a characteristic of a stochastic matrix.

    Exercise 2. Show that= 1 is always an eigenvalue for any stochastic matrixIP.

    LetP(x) be the probability of achieving the goal ofL dollars, where x is the initial

    fortune. In a homework problem we have shown that

    (27) P(x) =pP(x + 1) + qP(x 1), x= 1, 2, . . . , L 1,

    with boundary conditions P(0) = 0 and P(L) = 1. Similarly, define Q(x) as the

    probability of going bankrupt. Then,

    (28) Q(x) =pQ(x + 1) + qQ(x 1), x= 1, 2, . . . , L 1,

    with boundary conditions Q(0) = 1 and Q(L) = 0.

    For example, ifp= q= 12

    , then (see homework solutions)

    (29) P(x) = x

    L,

    (30) Q(x) = 1 xL

    .

    Thus, limL P(x) = 0.(what is the implication of this?).

    Exercise 3. Show that

    (31) limL

    P(x) = 0

    whenq=p.

  • 8/10/2019 ECE541_Lectures_9_2009

    61/81

    ECE 541 61

    Back to Gamblers ruin problem: take L= 4, p= 0.6. Then,

    (32) IP =

    1 0 0 0 0

    0.4 0 0.6 0 0

    0 0.4 0 0.6 0

    0 0 0.4 0 0.6

    0 0 0 0 1

    .

    Also, straightforward calculation yields

    (33) IP2 =

    1 0 0 0 0

    0.4 0.24 0 0.36 0

    0.16 0 0.48 0 0.36

    0 0.16 0 0.24 0.6

    0 0 0 0 1

    ,and

    (34) limn

    IPn =

    1 0 0 0 0

    0.5846 0 0 0 0.4154

    0.3077 0 0 0 0.6923

    0.1231 0 0 0 0.8769

    0 0 0 0 1

    .

    (This can be obtained, for example, using the Cayley-Hamilton Theorem). Note that

    the last column is

    (35) Q=

    Q(0)...

    Q(4)

    ,

    and the first column is

    (36) P=

    P(0)...

    P(4)

    .

  • 8/10/2019 ECE541_Lectures_9_2009

    62/81

  • 8/10/2019 ECE541_Lectures_9_2009

    63/81

    ECE 541 63

    However,kAjk is precisely the event{Xn = j i.o.}. Hence, we arrive at the conclu-

    sion

    (37) Pi{Xn= j i.o.}= 0, iffjj

  • 8/10/2019 ECE541_Lectures_9_2009

    64/81

    64 MAJEED M. HAYAT

    steps and then revisit j in two steps, and so on. More precisely, we write

    p(n)ij = Pi{Xn =j}=n1

    s=0

    Pi{X1=j, . . . , X ns1=j, Xns=j, Xn=j}

    =n1s=0

    Pi{X1=j, . . . , X ns1=j, Xns=j}Pj{Xs= j}

    =n1s=0

    f(ns)

    ij p(s)jj .

    Put j =i and summing up over n= 1 to n= we obtain,

    n=1

    p(n)ii =n=1

    n1s=0

    f(ns)

    ii p(s)ii

    =1s=0

    p(s)ii

    n=s+1

    f(ns)ii fii

    s=0

    p(s)ii

    The last inequality comes from the fact that

    n=s+1 f(ns)ii =

    su=1 f

    (u)ii fii. Re-

    alizing that p(0)ii = 1, we conclude (1 fii)n=1p

    (n)ii fii. Hence, iffii < 1, then

    n=1p(n)ii fii/(1 fii), and the series n=1p

    (n)ii will therefore be convergent since

    the partial sum is increasing and bounded.

    Exercise 4. Show that in the gamblers problem discussed earlier, states 0 andL are

    recurrent, but states1, . . . , L 1 are transient.

    6.3. Irreducible chains. A MC is said to be irreducible if for every i, j S, thereexists an n (that is generally dependent up on iand j) such that p

    (n)ij >0. In words,

    if the chain is irreducible then we can always go from any state to any other state in

    finite time with a nonzero probability. The next theorem shows that for irreducible

    chains, either all states are recurrent or all states are transient, there is no option in

    between. Thus, we can say that an irreducible MC is either recurrent or transient.

    Theorem 19 (Adopted from Billingsley). If the Markov chain is irreducible, then

    one of the following two alternatives holds.

    (i) All states are recurrent, Pi(j{Xn = j i.o.}) = 1 for all i, andnp(n)ij =

    for alli andj.

  • 8/10/2019 ECE541_Lectures_9_2009

    65/81

    ECE 541 65

    (ii) All states are transient, Pi(j{Xn = j i.o.}) = 0 for all i, andn

    p(n)ij 0 and

    p(s)ji >0. Note that if a chain withX0 = i visitsj fromi in r steps, then visitsj from

    j inn steps, and then visits i fromj ins steps, then it has visitedi fromi in r + n + s

    steps. Therefore,

    p(r+s+n)ii

    p

    (r)ij p

    (n)jj p

    (s)ji .

    Now if we sum up both sides of the above inequality over n and use the fact that

    p(r)ij p

    (s)ji > 0, we conclude that

    np

    (n)ii 0; hence, fij = 1. Now by (38),

    Pi{Xn = j i.o.]} = 1. The only item left to prove is that np(n)ij under the second

  • 8/10/2019 ECE541_Lectures_9_2009

    66/81

    66 MAJEED M. HAYAT

    alternative. Note that if

    np(n)ij

  • 8/10/2019 ECE541_Lectures_9_2009

    67/81

    ECE 541 67

    Note that pi+ qi+ri= 1, q0= 0, and pd = 0 ifd

  • 8/10/2019 ECE541_Lectures_9_2009

    68/81

    68 MAJEED M. HAYAT

    Thus, (39) becomes

    u(j) u(j+ 1) = jb1j=a j

    , aj < b.

    Let us now sum up this equation over j =i, . . . , b 1 and use the fact that u(b) = 0to obtain

    u(i) =

    b1j=ijb1j=a j

    , a < i < b.

    It now follows from the definition ofu(i) that

    (40) Pi(Ta< Tb) =b1j=ijb1j=a j

    , ai < b,

    or

    Pi(Tb< Ta) =

    i1j=a jb1j=a j

    , ai < b.

    As a special case of (40),

    (41) P1(T0< Tn) = 1 1

    n1j=0j

    , n >1.

    Note that

    (42) 1T2 < T3< . . . ,

    which implies{T0< Tn}, n >1, are nondecreasing events. Hence,

    (43) limn

    P1(T0< Tn) = P1(n=1T0 < Tn).

    Equation (42) implies Tnn, and thus Tn as n . Hence,

    n=1{T0 < Tn}={T0

  • 8/10/2019 ECE541_Lectures_9_2009

    69/81

    ECE 541 69

    We now show that the birth and death chain is recurrent if and only if

    j=0

    j =.

    From our earlier discussion, we know that if the irreducible birth and death chain is

    recurrent, then P1(T0

  • 8/10/2019 ECE541_Lectures_9_2009

    70/81

    70 MAJEED M. HAYAT

    j1pj1+ jrj+ j+1qj+1= j , j1.

    Usingpj+ qj+ rj = 1, we obtain

    q11p00= 0,

    qj+1j+1 pjj =qjjpj1j1, j1.By induction, we obtain

    qj+1j+1pjj = 0, j0,

    and hence

    j+1= pj

    qj+1j , j0.

    Consequently, we obtain

    (45) i =p0 . . . pi1

    q1 . . . q i0, i1.

    If we define

    i=

    1, i= 0,

    p0...pi1q1...qi

    , i1

    we can recast (45) asi= i0, i0.

    Realizing thati i

  • 8/10/2019 ECE541_Lectures_9_2009

    71/81

    ECE 541 71

    7. Properties of Covariance Matrices and Data Whitening

    7.1. Preliminaries. For a random vector X = [X1 . . . X n]T, the covariance matrix

    isCXis ann n matrix = E[(X E[X])(X E[X])T]. This matrix has the followingproperties:

    (1) Cx is symmetric, which is clear from the definition.

    (2) There is a matrix such that TCX = , where is a diagonal matrix

    whose diagonal elements are the eigenvalues, 1 . . . n, of the matrix CX (as-

    sume that they are distinct). The corresponding eigenvectors, v1 . . . vn, form

    the columns of the matrix : = [v1|v2| . . . |vn]. This fact comes fromelementary linear algebra. BecauseCX is symmetric, the eigenvectors corre-

    sponding to distinct eigenvalues are orthogonal (prove this). Moreover, can

    be selected so that the eigenvectors are orthonormal; that is, vTivj =i,j .

    (3) The fact that the vis are orthonormal implies that they span IRn and we also

    obtain the following representation for X:

    X=n

    i=1

    civi,

    with ci = XTvi.

    (4) Cx ispositive semi-definite, which means xTCxx0 for anyxIRn. To see

    this property, note thatxTCXx= xTE[(XE[X])(XE[X])T]x= E[{xT(XE[X])}2]0. This calculation also shows as long as no component ofX canbe written as a linear combination of other components, then xTCxx > 0,

    which is the defining property of positive definite matrices. (Recall that for

    any random variable Y, E[Y2] = 0 if and only ifY= 0 almost surely.)

    In the homework, we looked at an estimate ofCx from K samples of the random

    vectorX as follows:

    CX= 1

    K 1Ki=1

    X(i)X(i)T

    (assuming, without loss of generality, that E[X] = 0), which has rank at most k (see

    homework problems). Therefore, ifn > K, then Cx is not full rank and hence Cx is

    not positive definite.

  • 8/10/2019 ECE541_Lectures_9_2009

    72/81

  • 8/10/2019 ECE541_Lectures_9_2009

    73/81

    ECE 541 73

    To begin, we may want to try A again. We know thatA whitens X. Put T =

    AZ and note that CT = ACZA

    T

    = 1/2

    T

    CZ1/2

    , which is not necessarilya diagonal matrix (show this by example). Note, however, that if W is a matrix

    consisting of the eigenvectors of CT, corresponding to the eigenvalues 1, . . . , n,

    then WTCTW = M, where M is the diagonal matrix whose diagonal elements are

    1, . . . , n.

    Consider = WTA as a candidate for the desired transformation. We need to

    check if this new transformation diogonalizes both CX and CZ. To do so, consider

    D= X and note thatCD=WTACXA

    TW = WTIW =WTW =I (because W

    is normal matrix, i.e., W1 =WT).

    Next, letF = Z and writeCF= CZT =WTACZA

    TW. But,ACZAT =CT.

    Thus, CF=WTGW= M.

    7.3.1. Summary. Let CX and CZ be given. Let B be as before and let W be the

    matrix whose ith column is the eigenvector of B1CZB corresponding to the ith

    eigenvalue. Let = WTB1, then Y =X and T = Z have the following proper-

    ties:

    (1) CY = I,

    (2) CT= M

    (48) =

    1 . . . 0...

    . . . ...

    0 . . . n

    ,

    where the is are the eigenvalues ofB1CZB.

    Example 20.

    (49) CX=

    2 1

    1 2

    ; CZ=

    3 11 3

    .

    Find the transformation.

    We first we calculate . Set|CXiI| = 0. We find that1 = 1 and 2 = 3.

    Next, we find v1 and v2 using CXv1 = 1v1 and CXv2 =2v2. Normalize these so

  • 8/10/2019 ECE541_Lectures_9_2009

    74/81

    74 MAJEED M. HAYAT

    thatv1||=v2= 1 and obtain

    (50) v1 = 1

    2

    11

    ; v2 = 12

    11

    .We now form

    (51) B1 =1/2T = 1

    2

    1 1

    13

    13

    .

    Next,

    (52) B1CZ(B1)T =1

    28 0

    0 43

    ,

    and its eigenvalues are 1 = 4 and 2 = 23

    , and the corresponding eigenvectors are

    [1, 0] and [0, 1]. Hence,W= I and = WTB1 =B1.

  • 8/10/2019 ECE541_Lectures_9_2009

    75/81

    ECE 541 75

    8. Continuous-time Wiener Filtering

    Suppose that we have the observation model

    X(t) =S(t) + N(t),

    where S(t) and X(t) are wide-sense stationary random processes. We desire to find

    a physically realizable (causal) filter f(t) with impulse response h(t) to minimize the

    mean-square error (MSE)

    E[2] = E[(S(t + ) S(t))2],

    where S(t) =h(t) X(t) is the filter output.If= 0, this problem is called theWiener filtering problem; if >0, it is called the

    Wiener prediction problem, and if

  • 8/10/2019 ECE541_Lectures_9_2009

    76/81

    76 MAJEED M. HAYAT

    filter hr. Thus, we can pursue this differentiation approach and write

    E[2(r)] = RSS(0) 2 0

    [h0() + rg()]RXS(+ )d

    +

    0

    0

    [h0() + rg()][h0() + rg()]RXX( )dd

    = RSS(0) 20

    h0()RXS(+ )d 20

    rg()RXS(+ )d

    +

    0

    0

    [rg()h0() + rg()h0()]RXX( )dd

    +

    0

    0

    [h0()h0()]RXX(

    )dd+

    0

    0

    [r2g()g()]RXX(

    )dd.

    Note that RSS(0) does not depend on r . Now take partial derivative with respect to

    r and obtain

    rE[2(r)] = 2

    0

    RXS(+ )g()d

    +

    0

    0

    [g()h0() + g()h0()]RXX( )dd

    + 2r 00

    g()g()RXX(

    )dd.

    Set rE[2(0)] = 0 and obtain

    20

    RXS(+ )g()d+

    0

    0

    [g()h0() + g()h0()]RXX( )dd = 0.

    Using the fact that RXX() is symmetric, i.e.,0

    0

    g()h0()RXX( )dd =0

    0

    g()h0()RXX( )dd,

    we get 0

    g()RXS(+ ) +

    0

    h0()RXX( )d

    d= 0.

    Since this is true for all g, the only way this can be true is that the multiplier ofg

    inside the outer integral must be zero for almost all 0 (with respect to Lebesguemeasure).

    Thus, we get the so-called Wiener Integral Equation:

    RXS(+ ) = 0

    RXX(

    )h0()d,

    0

  • 8/10/2019 ECE541_Lectures_9_2009

    77/81

    ECE 541 77

    8.1. Non-causal Wiener Filtering. If we replace

    0

    with

    in Weiners Equa-

    tion, i.e.,

    RXS(+ ) =

    RXX( )h0()d, IR,

    we obtain the characterization of the optimal non-causal (non-realizable) filter. In

    the frequency domain, we have

    SXS()ej =SXX()H0(),

    and hence

    H0() = SXS ()SXX ()

    ej,

    which is the optimum non-realizable filter.

    Suppose the signal and noise are uncorrelated, then

    RXX() = E[X(t)X(t+)] = E[(S(t)+ N(t))(S(t+)+ N(t+))] =RSS()+ RNN().

    Similarly, RXS() =RSS(). Then,

    H0() = SSS()SSS()+SNN()

    Example: Suppose that N(t) 0 and > 0 (ideal prediction problem), thenH0() =e

    j[1], or h0(t) =(t + ), which is the trivial non-realizable predictor.

    Before we attempt solving the Wiener integral equation for the realizable optimal

    filter, we will review germane aspects of spectral factorization.

    8.2. Review of Spectral Factorization. Note that we can approximate any power

    spectrum as white noise passed through a linear time-invariant (LTI) filter (why?).Suppose the transfer function of the filter is

    H() =n +an1n1 + . . . + a1+ a0

    d + bd1d1 + . . . + b1+ b0

    If the input is white noise with power spectrum N02

    , the output power spectrum is

    N02|H()|2. Note that we can write

    H() =A(2) +jB(2)

    C(2) +jD(2),

  • 8/10/2019 ECE541_Lectures_9_2009

    78/81