Download - Risk Assessment and Management: Module 2 Introduction to ...mspong/L2-2.pdfJoint Probability Distributions Bayes’ Rule and Applications Conditional Expectations, Variances Etc. Risk

Joint Probability DistributionsBayes’ Rule and Applications

Conditional Expectations, Variances Etc.

Risk Assessment and Management:Module 2

Introduction to Probability and StatisticsLecture 2: Multiple Random Variables

M. Vidyasagar

Cecil & Ida Green ProfessorThe University of Texas at DallasEmail: [email protected]

August 27, 2010

M. Vidyasagar Multiple Random Variables



Outline

1 Joint Probability Distributions

Joint and Marginal Probability Distributions

Independence, Conditional Probability Distributions

Covariance and Correlation Coefficients

2 Bayes’ Rule and Applications

A Motivating Example

Bayes’ Rule

3 Conditional Expectations, Variances Etc.

Conditional Expected Value: Definition

Conditional Expected Value: Example

Conditioning on an Event, Independent Events




Joint and Marginal Probability DistributionsIndependence, Conditional Probability DistributionsCovariance and Correlation Coefficients

Outline







Bayes’ Rule









Joint Probability Distributions: Motivating Examples

Joint probability distributions arise naturally when one conductsmultiple experiments with random outcomes.

Example 1: An urn has 7 white balls and 3 black balls. We drawtwo balls in succession, replacing the first ball drawn beforedrawing a second time. What is the probability of drawing a whiteball followed by a black ball? What is the probability of drawingone white ball and one black ball (in either order)?

Example 2: An urn has 7 white balls and 3 black balls. We drawtwo balls in succession, without replacing the first ball drawnbefore drawing a second time. What is the probability of drawing awhite ball followed by a black ball? What is the probability ofdrawing one white ball and one black ball (in either order)?





Cartesian Products of Sets

Suppose we have two random variables X and Y , which may ormay not influence each other. X takes values in A = {x1, . . . , xn}while Y takes values in B = {y1, . . . , ym}. Note that the two setsA and B could be different.

The joint random variable (X,Y ) takes in the so-called(Cartesian) product set A× B, which consists of all pairs of theform (xi, yj). Thus

A× B := {(xi, yj) : xi ∈ A, yj ∈ B}.

Note that A× B has nm elements.

The product set A× B is the sample space for the joint randomvariable (X,Y ). The event space, as before, consists of all possiblesubsets of A× B.





Joint Distributions of Joint Random Variables

Suppose (X,Y ) takes values in A× B. Its (joint) probabilitydistribution is a vector φ with nm components, where

φij = Pr{(X,Y ) = (xi, yj)} = Pr{X = xi&Y = yj}.

As always φij ≥ 0 for all i, j and∑n

i=1

∑mj=1 φij = 1.

What is the difference between the ‘joint’ r.v. (X,Y ) and a plainold r.v. Z taking values in a set of cardinality nm?

So long as we always talk about X and Y together – nothing! Butthere are special notions for joint r.v.’s such as independence,marginal and conditional distributions etc.





Marginal Distributions

Suppose X,Y are r.v.’s assuming values in A,B respectively, withjoint distribution φ. So

φij = Pr{X = xi&Y = yj}, ∀i, j.Now let us ask: What is the probability that X = xi, and we don’tcare what Y is?

Answer: The event {X = xi} is the union of the m disjointevents (X,Y ) = (xi, y1) through (X,Y ) = (xi, ym). In symbols

{X = xi} =

m⋃j=1

{(X,Y ) = (xi, yj)}.

So it follows from earlier discussion that

Pr{X = xi} =

m∑j=1

φij .





Marginal Distributions (Cont’d)

So the r.v. X by itself has the distribution denoted by φX , definedby

(φX)i :=

m∑j=1

φij , ∀i.

The distribution φX is referred to as the marginal distribution ofX corresponding to the joint distribution φ. It is a probabilitydistribution on the set A.

Similarly the marginal distribution of Y corresponding to the jointdistribution φ is defined analogously by

(φY )j =

n∑i=1

φij , ∀j.





Matrix Interpretation of Marginal Distributions

If X,Y take values in A,B with joint distribution φ, arrange theentries in a matrix as shown below.

X�Y y1 . . . ymx1 φ11 . . . φ1m...

.... . .

...xn φn1 . . . φnm

If we call the above matrix as Φ, then the vector φX is obtained bymultiplying Φ on the right by a column vector of all one’s, and φYis obtained by multiplying Φ on the left by a row vector of all one’s.





Outline







Bayes’ Rule









Independence of Two Random Variables

Suppose X,Y are r.v.’s assuming values in A,B respectively, withjoint distribution φ. The random variables X,Y are said to beindependent if

φij = (φX)i · (φY )j , ∀i, j.

In words, X and Y are independent if their joint probabilitydistribution is just the product of the two individual marginaldistributions.

An equivalent definition is: X and Y are independent if their jointprobability matrix Φ has rank one.





Ball-Drawing from an Urn with Replacement

Suppose an urn has 7 white balls and 3 black balls. We draw a ballfrom this urn, and then replace the ball before drawing a secondtime.

Define X to be the color of the first ball drawn, Y to be the colorof the second ball. Since there are 7 white balls and 3 black ballseach time, it is clear that

Pr{X = W} = 0.7,Pr{X = B} = 0.3,

and similarly for Y . Since we replace the ball, the outcome of Xdoes not affect the outcome of Y . In other words, X and Y areindependent in this case. So

Pr{(X,Y ) = (W,B)} = Pr{X = W} × Pr{Y = W} = 0.49,

and similarly for the other three combinations.M. Vidyasagar Multiple Random Variables




Joint Distribution for Drawing with Replacement

If we draw with replacement, the joint probability distribution of Xand Y is as shown below:

X�Y W B

W 0.49 0.21B 0.21 0.09

With this information we can answer the questions raised earlier.

The probability of drawing a white ball followed by a black ball is

Pr{(X,Y ) = (W,B)} = φ(W,B) = 0.21.

The event ‘drawing one white ball and one black ball (in eitherorder)’ is the subset {(W,B), (B,W )}. So

Pφ({(W,B), (B,W )}) = φ(W,B) + φ(B,W ) = 0.42.





Conditional Probabilities

Let φ be the joint distribution of (X,Y ) as before. Theconditional probability of X given the outcome Y = yj is definedas

Pr{X = xi|Y = yj} :=Pr{X = xi&Y = yj}

Pr{Y = yj}=

φij∑ni′=1 φi′j

.

Use φ{xi|yj} as a shorthand for Pr{X = xi|Y = yj}. Then thevector

φ{X|Y=yj} := [φ{xi|y1} . . . φ{xi|ym}]

is called the conditional probability distribution of X given theoutcome Y = yj . Note that φ{X|Y=yj} is a distribution on the setA.





Ball-Drawing from an Urn Without Replacement

Repeat the same experiment, except that we don’t replace the firstball we draw. Now things are more tricky.

Suppose X = W or we draw a white ball the first time. So nowthere are only 6 white and 3 black balls left. So we can say that

Pr{Y = W |X = W} = 6/9,Pr{Y = B|X = W} = 3/9.

If we draw a black ball the first time, then there are 7 white ballsand 2 black balls for the second draw. So

Pr{Y = W |X = B} = 7/9,Pr{Y = B|X = B} = 2/9.





Drawing Without Replacement (Cont’d)

We can construct the joint distribution φ by using the definition ofconditional probabilities cleverly.

Pr{(X,Y ) = (W,W )} = Pr{X = W} · Pr{Y = W |X = W}= 0.7× 6/9 = 42/90,

and similarly for the other three combinations. This leads to thetable below.

X�Y W B

W 42/90 21/90B 21/90 6/90

Since Pr{(X,Y ) = (W,W )} 6= Pr{X = W} · Pr{Y = W}, itfollows that X and Y are not independent!





Probabilities of Some Events

With the probability table

X�Y W B

W 42/90 21/90B 21/90 6/90

We can answer the questions raised in the beginning. Theprobability of drawing a white ball followed by a black ball is

Pr{(X,Y ) = (W,B)} = φ(W,B) = 21/90 ≈ 0.2333.

The event ‘drawing one white ball and one black ball (in eitherorder)’ is the subset {(W,B), (B,W )}. So

Pφ({(W,B), (B,W )}) = φ(W,B) + φ(B,W ) = 42/90 ≈ 0.4667.





Two Counter-Intuitive Calculations

Let Φ denote the matrix of joint probabilities of X and Y , namely

Φ =

[42/90 21/9021/90 6/90

].

Note: (W,B) and (B,W ) have the same probability even withoutreplacement! This is not a coincidence. (See exercises)

Let us calculate the marginal distribution of Y , the second draw.Summing columns in the matrix gives

[Pr{Y = W} Pr{Y = B}] = [1 1]Φ = [0.7 0.3]!

But there are only 9 balls when we draw Y . So how can Y havethe same distribution as X? (See exercises)





Another Counter-Intuitive Calculation

Thus far we have studied Pr{Y |X} which is OK because X is theoutcome of the first draw.

As per the definition, we can also compute Pr{X|Y }, thedistribution of the first outcome, conditioned on the secondoutcome. Does this even make sense, or is it just mathematicaljugglery?

The question makes sense. Suppose the urn had 1 white ball andN ≥ 2 black balls. We draw a ball (the outcome of X) and put itaside without looking at it. Then we draw another ball (theoutcome of Y ) without replacement.

Suppose we draw a white ball (Y = W ). Then we know for surethat X 6= W . So Pr{X = W |Y = W} = 0 in this instance.





Another Counter-Intuitive Calculation (Cont’d)

Using the joint distribution matrix[42/90 21/9021/90 6/90

]and the definition we can compute Pr{X|Y } for all outcomes ofX and Y .

It turns out that Pr{X|Y } is the same as Pr{Y |X}! How is thispossible? (See exercises)





More Than Two Random Variables

There is nothing magic about having just two r.v.s!

Suppose X,Y, Z are r.v.’s taking values in A,B,C = {z1, . . . , zl}respectively. So they have a joint distribution φ where

φijk = Pr{X = xi&Y = yj&Z = zk}.

We can define marginals of single and double r.v.’s, as well asconditional distributions, just as before.





Definitions of Marginal Distributions

Marginal distribution of single r.v.:

(φX)i :=

m∑j=1

l∑k=1

φijk,

and similarly for Y, Z.

Marginal (joint) distribution of two r.v.’s:

(φX,Y )ij =

l∑k=1

φijk,

and similarly for (Y,Z) and (X,Z).





Definitions of Conditional Distributions

Conditional distribution of a single r.v.:

φ{xi|(yj ,zk)} :=φijk∑m

j′=1

∑lk′=1 φij′k′

, i = 1, . . . , n,

and similarly for other two cases.

Conditional distribution of a joint r.v.:

φ{(xi,yj)|zk} :=φijk∑l

k′=1 φijk′, ∀i, j,

and similarly for other two cases.





Law of Iterated Conditioning

Suppose X,Y, Z are three r.v.’s. We observe an outcome of Z andit is zk. This gives a conditional distribution φ{(X,Y )|zk}. then weobserve an outcome of Y and it is yj . So now we can furthercondition X on this new observation to compute φ{{(X,Y )|zk}|yj}.

Do we really need to do this in stages, or is this the same asconditioning X ‘in one go’ namely φ{X|(yj ,zk)}?

Fortunately both answers are the same. This may be called the‘law of iterated conditioning’. (See exercises)





Conditional Independence

Suppose X,Y, Z are r.v.’s. We say that Y, Z are conditionallyindependent given X if it is true that

φ{(yj ,zk)|xi} = φ{yj |xi} · φ{zk|xi}, ∀i, j, k.

Baby (contrived) example: Suppose we draw a ball from an urn;call the outcome X. Then without replacing we draw another ball;call the outcome Y . Then after replacing the second ball drawn,we draw again; call the outcome Z.

Then Y,Z are not independent. However Y and Z areconditionally independent given X. (See exercises)





Outline







Bayes’ Rule









Covariance for Real-Valued Random Variables

Until we have talked about general r.v.’s. Now suppose X,Y arer.v.’s assuming values in finite sets A,B which are subsets of thereal number set R. Let φ,φX ,φY denote, as before, their jointand marginal distributions.

By viewing X as an r.v. with distribution φX , we can compute itsmean and variance; similarly for Y . The focus of this discussion ison what happens for ‘cross’ terms of the form XY .

Recall that

V (X) = E[(X − E(X))2], V (Y ) = E[(Y − E(Y ))2]

denote the variances of X and Y respectively. Alsoσ(X) = (V (X))1/2 and σ(Y ) = (V (Y ))1/2 denote the standarddeviations of X and Y respectively.





Covariance for Real-Valued Random Variables (Cont’d)

Definition The quantity C(X,Y ) := E[(X − E(X))(Y − E(Y ))]is called the covariance between X and Y , and the quantity

ρ(X,Y ) :=E[(X − E(X))(Y − E(Y ))]

{E[(X − E(X))2]}1/2 · {E[(Y − E(Y ))2]}1/2

=C(X,Y )

σ(X)σ(Y )

is called the correlation coefficient between X and Y .

The correlation coefficient ρ(X,Y ) always lies between −1 and+1. If ρ(X,Y ) > 0, we say that X and Y are positively correlated;similarly if ρ(X,Y ) < 0. If ρ(X,Y ) = 0, we say that X and Y areuncorrelated.





Consequences of Definitions

Theorem: An equivalent expression for the covariance is

C(X,Y ) = E(XY )− E(X)E(Y ).

Theorem: If X,Y are independent, then

E[XY,φ] = E[X,φX ] · E[Y,φY ].

Theorem: If X,Y are independent, then they are uncorrelated.(Explain why)

Exercises: Prove the above theorems, and also give an examplewhere X,Y are uncorrelated but not independent.

Fact: We have that

V (X + Y ) = V (X) + 2C(X,Y ) + V (Y ).





Example

Recall the ball in an urn example. There are 7 white balls and 3black balls in an urn. We draw one ball and call the outcome X.Then without replacing we draw a second ball and call theoutcome Y . Now X,Y are abstract r.v.’s so we define anassociated ‘pay-off’ function f : {W,B} by f(W ) = 0, f(B) = 1.

The probability table is

X�Y W B

W 42/90 21/90B 21/90 6/90

So it is easy to verify that

E[f(X)] = E[f(Y )] = 0.3, E[f(X)f(Y )] = 2/30 ≈ 0.0667.

Since E[f(X)f(Y )] < E[f(X)]E[f(Y )], f(X) and f(Y ) arenegatively correlated.




A Motivating ExampleBayes’ Rule

Outline







Bayes’ Rule









Motivating Example for Bayes’ Rule

Bayes’ rule is very useful in distinguishing prior probabilities fromposterior probabilities. This is best illustrated via an example.

Example: Someone has developed a HIV diagnostic test that hasa 2% false negative rate, and a 1% false positive rate. In otherwords, if the patient really has HIV, then the test comes outpositive 98% of the time. If the patient does not have HIV, thetest comes out negative 99% of the time.

Both of these are ‘prior’ probabilities. What we really want toknow is the ‘posterior’ probability, namely: If the patient testspositive, what is the probability that he/she has HIV?





Motivating Example (Cont’d)

Suppose that 1% of the population has HIV. Define two r.v.’s:X ∈ {H,F} indicates whether the patient has HIV or is free fromHIV. Y ∈ {P,N} indicates whether the test comes out positive ornegative.

The data we are given can be summarized as follows:

Pr{X = H} = 0.01.

Pr{Y = P |X = H} = 0.98,Pr{Y = N |X = F} = 0.99.

From this data we can easily infer the following additionalinformation:

Pr{X = F} = 0.99.

Pr{Y = N |X = H} = 0.02,Pr{Y = P |X = F} = 0.01.






Now we can construct the joint distribution of X and Y .

X�Y P N

H 0.0098 0.0002F 0.0099 0.9801

From this it follows that

Pr{Y = P} = 0.0098 + 0.0099 = 0.0197,

Pr{X = H|Y = P} =Pr{X = H&Y = P}

Pr{Y = P}=

0.0098

0.0197≈ 0.5.

So the test is totally useless! – we might as well flip a coin asadminister the test!





Outline







Bayes’ Rule









Bayes’ Rule

Suppose X,Y are r.v.’s assuming values in finite set A,B, andxi ∈ A, yj ∈ B.. By the definition of conditional probability, weknow that

Pr{X = xi|Y = yj} =Pr{X = xi&Y = yj}

Pr{Y = yj}.

Bayes’ rule consists of rewriting this formula in an equivalent form.

Bayes’ Rule – Version 1: Suppose X,Y are r.v.’s assumingvalues in finite set A,B. Then

Pr{X = xi|Y = yj} = Pr{Y = yj |X = xi}Pr{X = xi}Pr{Y = yj}

.





Bayes’ Rule (Cont’d)

Bayes’ Rule – Version 2:

Suppose X,Y are r.v.’s assuming values in finite set A,B. Then

Pr{X = xi|Y = yj} =Pr{Y = yj |X = xi} · Pr{X = xi}∑n

i′=1 Pr{Y = yj |X = xi′} · Pr{X = xi′}.

It can be recognized that the numerator is justPr{X = xi&Y = yj} while the denominator is just Pr{Y = yj}.






By Bayes’ rule, Version 1, we have

Pr{X = xi|Y = yj} = Pr{Y = yj |X = xi}Pr{X = xi}Pr{Y = yj}

.

In our example Pr{Y = P |X = H} ≈ 1, because the diagnostichas a very low false negative rate.

However, Pr{X = H} = 0.01 (1% of the population has HIV)while Pr{Y = P} ≈ 0.02 (the test will be positive for about 2% ofthe population). So

Pr{X = H|Y = P} ≈ 0.01

0.02= 0.5.






Now suppose that the fraction of the population that has HIV isnot 1% but is 0.1% instead. Then Pr{X = H} = 0.001. It can becomputed that Pr{Y = P} ≈ 0.01, so

Pr{X = H|Y = P} ≈ 0.1.

So the test is still worse in this situation!

This is because most of the positive readings will be from falsepositives of non-afflicted persons, and only a few positive readingswill be from afflicted persons.




Conditional Expected Value: DefinitionConditional Expected Value: ExampleConditioning on an Event, Independent Events

Outline







Bayes’ Rule









Conditional Expected Value: Motivation

Suppose as always that X,Y are r.v.’s assuming values in A,Bwith joint distribution φ. Suppose f : A→ R is a real-valuedfunction of the r.v. X. We can talk about the ‘unconditional’ aswell as ‘conditional’ expected value of f .

If we know nothing about either X or Y , then the probabilitydistribution of X is the ‘marginal’ distribution φX . If we denotefi = f(xi), then the ‘unconditional’ expected value of f is

E[f ] =

n∑i=1

fi(φX)i =

n∑i=1

fi

m∑j=1

φij .

Now suppose we measure Y and the outcome is Y = yj . We can‘update’ the distribution of X to φ{X|Y=yj} and recompute theexpected value.






With the outcome Y = yj , the conditional distribution of X nowbecomes

φ{X|Y=yj} = [φ{x1|yj} . . . φ{xn|yj}].

So the conditional expected value of f , given the outcomeY = yj , is defined as

E[f |Y = yj ] =

n∑i=1

fiφ{xi|yj}.





Outline







Bayes’ Rule










You have been given a speeding ticket in the amount of $ 50. Youcan contest it in court; if found guilty you pay $ 100 but if you arefound not guilty you pay nothing. There are two judges who trysuch cases. Judge T is a ‘toughie’ who finds ‘guilty’ 70% of thetime while Judge S is a ‘softie’ who finds ’guilty’ only 20% of thetime. Judge T is more senior so he tries 60% of the cases. Youhave the option of just paying the ticket right until the momentyour case comes up for trial.





Conditional Expected Value: Example (Cont’d)

The joint probability distribution is shown below, where J standsfor Judge and D stands for the decision:

J�D G N

T 0.42 0.18S 0.08 0.32

The fine ‘function’ is real-valued with f(G) = 100 and f(N) = 0.Since the prior probability of being found guilty is 50%, the priorexpected fine is $ 50 – the same as the ticket. So you decide to‘go for it’.






When you arrive at the courtroom you do not know which judge issitting, but he finds the defendant just before you ‘guilty’. Whatnow is your expected fine?

You need to compute the posterior probability Pr{J = T |D = G}.By Bayes’ rule,

Pr{J = T |D = G} =Pr{D = G|J = T} · Pr{J = T}

Pr{D = G}=

0.42

0.50= 0.84.

So now you are 84% sure that this is the ‘tough’ judge. Theposterior probability distribution of J (the identity of the judge) is

[P (T ) P (S)] = [0.84 0.16].






Use P (T ) = 0.84, P (S) = 0.16. This updates the probability tableas shown below:

J�D G N

T 0.588 0.252S 0.032 0.128

The conditional expected value of the ‘fine function’ f is now $ 62,more than the value of the ticket.

So you should perhaps decide not to contest!





Conditional Moments, Variance Etc.

The notions of moments, variance, mean, mode etc. all carry overfrom the ‘unconditional’ case to the ‘conditional’ case.

Whenever an observation is made, just replace the marginaldistribution by the conditional distribution.





Outline







Bayes’ Rule









Conditioning on an Event: Definition

Until we have been conditioning on an ‘outcome’: Thus wecompute the conditional distribution of X given an observation (oroutcome) Y = yj . But it is also possible condition on an ‘event’ –and this can be done for just a single r.v.; we don’t need a ‘joint’r.v.

Suppose X is a r.v. taking values in a finite set A with distributionφ, and suppose S, T ⊆ A. Thus S, T are ‘events’. We define theconditional probability of T given S by

Pr(T |S) :=Pr(S ∩ T )

Pr(S).

In terms of φ we can also write this as

Pφ(T |S) =Pφ(S ∩ T )

Pφ(S).





Conditioning on an Event: Example

Suppose we draw one card from a standard deck of cards. So X isa r.v. assuming one of 52 values. Assume that each card is equallylikely, so that φ is the uniform distribution.

Now suppose S is the event X is a red card, while T is the eventthat X is an honor, that is, 10 through ace. What is theconditional probability that X is an honor given that it is a redcard?

Since there are 26 red cards, Pφ(S) = 26/52 = 0.5. Similarly thereare 20 honors, so Pφ(T ) = 20/52 = 5/13. Finally, there are 10 redhonor cards, so Pφ(S ∩ T ) = 10/52 = 5/26. So

Pφ(T |S) =5/26

0.5=

5

13.





Conditioning on an Event: Formula

Suppose φ is the distribution of X, and S ⊆ A. Then theconditional distribution φ|S on A is defined by

φ|S(xi) :=

{0 if xi 6∈ S,φi

Pφ(S)if xi ∈ S.

So the conditional distribution is defined for every element of theoriginal sample space A, but is concentrated on the set S. Thecorresponding probability measure is precisely what was definedearlier.





Independent Events

Two events S, T ⊆ A are said to be independent if

Pφ(S ∩ T ) = Pφ(S)Pφ(T ).

For instance, in the card example above, we had that S was theevent that the card was red, and T was the event that the cardwas an honor. We found that1

P (S) = 0.5, P (T ) = 5/13, P (S ∩ T ) = 5/26.

Since the above relationship is satisfied, we can say that whether acard is an honor or not is independent of whether it is red or not.

This example is ‘obvious’ – but there can also be some‘non-obvious’ examples, as we shall see next!

1We leave off the subscript φ for clarityM. Vidyasagar Multiple Random Variables




A Counter-Intuitive Example

Suppose N is some number, say 20. Define A = {1, . . . , N} andlet φ be the uniform distribution on A. Pick two prime numbers,say 2 and 3. Define S to be the event that X is divisible by 2, andT to be the event that X is divisible by 3. So S ∩ T is the eventthat X is divisible by both 2 and 3.

An easy calculation shows that

P (S) = 10/20 = 0.5, P (T ) = 6/20 = 0.3, P (S∩T ) = 3/20 = 0.15.

So once again S and T are independent.

The exercises bring out some more facets of this problem.