Business Statistics 41000: Probability...

Post on 06-Mar-2018

216 views 1 download

Transcript of Business Statistics 41000: Probability...

Business Statistics 41000:

Probability 1

Drew D. Creal

University of Chicago, Booth School of Business

Week 3: January 24 and 25, 2014

1

Class information

I Drew D. Creal

I Email: dcreal@chicagobooth.edu

I Office: 404 Harper Center

I Office hours: email me for an appointment

I Office phone: 773.834.5249

Course homepage

http://faculty.chicagobooth.edu/drew.creal/teaching/index.html

2

Course schedule

I Week # 1: Plotting and summarizing univariate data

I Week # 2: Plotting and summarizing bivariate data

I Week # 3: Probability 1

I Week # 4: Probability 2

I Week # 5: Probability 3

I Week # 6: In-class exam

I Week # 7: Statistical inference 1

I Week # 8: Statistical inference 2

I Week # 9: Simple linear regression

I Week # 10: Multiple linear regression

3

Outline of today’s topics

I. Discrete random variables (AWZ p. 215-216)

I Discrete probability distributions (AWZ p. 949-950)

I The Bernoulli distributionI Computing the probabilities of subsets of

outcomes

II. Expectation and variance of a discrete randomvariable

III. Mode of a discrete random variable

IV. Conditional, marginal, and joint distributions (AWZ

p. 230-236)

V. Several random variables

4

Why probability?

5

Why probability?

I In lectures #1 and #2, we looked at various types of datain different ways.

I We learned to use plots and numerical summary statisticsto identify patterns in the data and see how variablesrelated to one another.

I If we find patterns, we can use them to predict.

I For example, we used regression to predict the sales priceof a house given its size.

6

Why probability?

I To make predictions, we use a mathematical model forthe relationship.

I However, in business and economic applications, thesespecifications are rarely exact.

Instead of saying:“if x is this, then y must be that”

we want to say:“if x is this, then y will probably be within thisrange of values.”

Probability is a way of modelling uncertaintymathematically.

7

Example: Gallup poll

Gallup (1/22/14): In U.S., 65% Dissatisfied With HowGov’t System Works

“ Sixty-five percent of Americans are dissatisfied with the nation’s systemof government and how well it works, the highest percentage in Gallup’strend since 2001. Dissatisfaction is up five points since last year, and hasedged above the previous high from 2012 (64%)....

Results:....are based on telephone interviews conducted Jan. 5-8, 2014,with a random sample of 1,018 adults....the margin of sampling error is±3 percentage points at the 95% confidence level.”

Source: www.gallop.com/poll

8

Example: Rasmussen poll

Rasmussen: 68% Expect NSA Phone Spying To Stay theSame or Increase

“Despite President Obama’s announcement of tighter controls on theNational Security Agency’s domestic spying efforts, two-out-of-three U.S.voters think spying on the phone calls of ordinary Americans will stay thesame or increase.

.....The margin of sampling error for the full sample of 1,000 LikelyVoters is ± 3 percentage points with a 95% level of confidence. .”

Source: www.rasmussenreports.com

9

Why probability?

I In the previous examples, they mention the sampling“error?”

I What do they mean by this?

I How are they estimating the “error?”

10

Why probability?

Answer: they took a random sample and computed

20.5√1018

= 0.0313 ≈ 3%

20.5√1000

= 0.0316 ≈ 3%

These calculations come from a probability model, which wewill study extensively!!

Importantly, this model is based on a set of assumptionsthat could be wrong!

You need to understand these assumptions and be able tothink critically about them!

11

Discrete Random Variables

12

Discrete random variables

Suppose you are a manager trying to estimate the number ofunits of a product you will sell next quarter.

Suppose you know (unrealistically) that sales will be 1, 2, 3, or4 (thousand) units.

But, you are not sure which one it will be.

First, why is sales a discrete random variable?

13

Discrete random variables

Let the random variable S denote sales.

Since S can only take on the values 1, 2, 3, or, 4 it is adiscrete random variable.

A probability distribution is a way to express this uncertaintymathematically.

s p(s) ←− probability of each value

list of possible values ↗ 1 0.095or outcomes 2 0.230

3 0.4404 0.235

14

Discrete random variables

In a probability distribution, the probabilities always sum toone by definition.

s p(s) ←− probability of each value

list of possible values ↗ 1 0.095or outcomes 2 0.230

3 0.4404 0.235

1.0

p(s) = Prob(S = s)

15

Remarks on notation

I In words, the notation Prob(X = x) means “theprobability that the random variable X takes on thenumber x .”

I It is common convention to use capital letters (or words)such as X or Z to denote a random variable.

I The possible values that a random variable can take onare also known as outcomes.

I It is common for lower case letters such as x or z todenote the outcomes.

I It is common to abbreviate random variable as r.v..

16

A picture of the discrete random variable’s

distribution

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

0.2

0.4

0.6

0.8

1.0

s

p(s)

17

Discrete random variable

A discrete random variable is a numeric quantitythat can take on any one of a countable number

of possible values. However, it is unknown inadvance which value will occur.

Remarks:

I This is how we quantify or model uncertainty when a random eventor experiment can take on a countable number of values.

I We list the possible values the variable can take on (i.e. theoutcomes).

I We assign to each number (or outcome) a probability.

18

Discrete random variable

Remarks continued:

I A probability is a number between 0 and 1.

I When the probabilities are summed up over the possibleoutcomes, the probabilities always sum to one.

I The word “discrete” emphasizes that the number ofoutcomes is finite (we can create a list of them).

I In our sales example, there were only 4 possible outcomesfor the r.v. S .

I Later, we will study continuous random variables whichmay take on a continuous range of values.

19

Example: coin tossing

I Imagine a random experiment where we toss two coins.

I We define the random variable X to be the number ofheads in two tosses.

I We assume each coin is “fair” so that the probability oftossing a head or tail is 1

2.

I Before tossing the coins, we know that there are 3possible outcomes: x = 0, 1, and 2.

x p(x)0 0.251 0.502 0.25

The probability distribution of therandom variable X .

20

Probability distribution of a discrete r.v.

The probability distribution of a discreterandom variable has two parts:

1.) a list of the possible outcomes.

2.) a list of the probabilitiesfor each outcome.

x p(x)x1 p1

x2 p2...

...

For a discrete r.v., we canthink of a probabilitydistribution as a table.

21

Remarks on notation

I You will often see probabilities written as p(x) orProb(X = x) or Pr(X = x) or P(X = x) or pX (x).

I These are all common notation for the same thing. It justdepends on the author’s preferences.

I With the notation p(x) it should be understood from thecontext that you are talking about the random variable Xwhich may take on an outcome x .

I In our sales example, p(1) is the probability that our salesduring the next quarter is 1,000 units.

22

Interpreting probabilities

The easiest way to interpret probabilities are...

I Probability is a measure of uncertainty with valuesbetween 0 and 1.

I An outcome with a probability of 0 will basically neverhappen.

I An outcome with a probability of 1 will basically alwayshappen.

23

Interpreting probabilities

There are more “philosophical” ways of interpretingprobabilities.

Two common ways are: frequentist and subjective (Bayesian).

Consider again the example where we toss two fair coins.

x p(x)0 0.251 0.502 0.25

Frequentist: In the long run, if Itoss the two coins over and over andover..., I will get 1 head 50% of thetime.

Subjective: I am indifferent between

betting on the event “1 head” or the

event “0 or 2 heads.”

24

Interpreting probabilities

Consider the sales example again.

s p(s)1 0.0952 0.2303 0.4404 0.235

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

0.2

0.4

0.6

0.8

1.0

s

p(s)

“Its’s about twice as likely that we will sell 3,000 units as it isthat we will sell 2,000 or 4,000 units.”

“If all our quarters were like this, we would see sales of 1,000units about once in every 10 quarters.”

25

Assigning Probabilities to Categorial Variables

I Remember that part of our definition of a randomvariable is that it’s value always takes on a “number.”

I What about situations where we have a randomexperiment and the variable of interest is a categoricalvariable (which is typically not a number)?

I In this case, we just assign a number to each category.

I We can then assign a probability to each number.

26

Assigning Probabilities to Categorial Variables

Example: For the variable “Reg” from the British marketingdata set, we assigned each region a number.

1 “Scotland”2 “North West”3 “North”4 “Yorkshire & Humberside”5 “East Midlands”6 “East Anglia”7 “South East”8 “Greater London”9 “South West”

10 “Wales”11 “West Midlands”

27

Assigning Probabilities to Categorial Variables

Example: Who will win the MVP at the Super Bowl?

I We let the random variable M be the football player thatwins the MVP.

I We label each player with a number.

1 “Peyton Manning” 4 “Russell Wilson”2 “Wes Welker” 5 “Marshawn Lynch”3 “Eric Decker” 6 “Richard Sherman”

I The outcomes are h = 1, 2, 3, 4, 5, 6.

I We then assign probabilities to each outcome.

I P(M = 1) = p(1) = P(Peyton Manning wins the MVP)

28

The Bernoulli (and uniform) distributions

29

The Bernoulli distribution

Our fundamental discrete random variable is the dummyvariable, with two outcomes 0 and 1.

Some examples are.

Clinical trials: T is a r.v. describing a test for a disease. T = 0 ifthe person does not have the disease. T = 1 if they do.

Marketing example: B is a r.v. describing whether a person buys aproduct. B = 0 if the person does not buy the product. B = 1 if they do.

Sports: Rafael Nadal is about to hit a first serve. A is a r.v.

describing whether he hits an ace. A = 0 if he does not hit an ace. A = 1

if he does.

30

The Bernoulli distribution

I In general, we say a discrete random variable taking ononly two values (such as a dummy variable) has aBernoulli distribution.

I You may often hear it called a “Bernoulli Trial.”

I The value 1 is often called a “success.”

I Suppose we label this random variable X , then we have

x p(x)0 1− p1 p

Pr(X = 1) = p

I Notation: X ∼ Bernoulli(p)

31

The Bernoulli distribution

X ∼ Bernoulli(p) means that X is a discreter. v. with the following probability distribution:

x p(x)

0 1− p1 p

where p is the probability that X equals 1.

I In words, “the random variable X is distributed as Bernoulli withparameter p.”

I The “Bernoulli” is a family of probability distributions, where eachprobability distribution is indexed by the parameter p.

32

The Bernoulli distribution: further examples

Example: Tossing a fair coin.

Let X = 1 if the toss is heads and 0otherwise.

Then X ∼ Bernoulli(0.5).

x p(x)0 0.51 0.5

33

The discrete Uniform distribution

X ∼ Discrete Uniform means that X is a discreter. v. taking on a finite number of

values with equal probabilities.

I If there are N outcomes, the probabilities are all 1N .

34

Probabilities of Subsets of Outcomes

Example: Suppose we toss a fair six-sided die. Let Z denotethe outcome of the toss.

What is the probability thatPr(2 < Z < 5)?

In other words, what is theprobability that we roll a 3 or 4?

z p(z)1 1/62 1/63 1/64 1/65 1/66 1/6

35

Probabilities of Subsets of Outcomes

To compute the probability that any one of a group ofoutcomes occurs we sum up their probabilities.

Pr(a < X < b) =∑

a<X<b p(x)

Example: Tossing a die.

Pr(2 < Z < 5) = p(3) + p(4) = 16

+ 16

= 13

36

Probabilities of Subsets of Outcomes

Sometimes, we may also want to know if something is greater(less) than or equal to!

Example: Tossing a die.

Pr(2 ≤ Z < 5) = p(2) + p(3) + p(4) = 16

+ 16

+ 16

= 12

37

Probabilities of Subsets of Outcomes

Example: Let’s return to our sales example where S denotesthe sales of units of our product (in thousands).

s p(s)1 0.0952 0.2303 0.4404 0.235

What is the probability that we sellmore than 1,000 units next quarter?

Pr(S > 1) = p(2) + p(3) + p(4)

= 0.23 + 0.44 + 0.235

= 0.905

38

Probabilities of Subsets of Outcomes

Example: let’s do it again!

s p(s)1 0.0952 0.2303 0.4404 0.235

What is the probability that we sellmore than 1,000 units next quarter?

We could have done it like this:

Pr(S > 1) = Pr(S 6= 1)

= 1− p(1)

= 0.905

39

Probabilities of Subsets of Outcomes

Example: one more time!

s p(s)1 0.0952 0.2303 0.4404 0.235

What is the probability that we sell3000 units or less next quarter?

Pr(S ≤ 3) = Pr(1) + Pr(2) + Pr(3)

= 0.095 + 0.23 + 0.44

= 0.765

40

Probabilities of Subsets of Outcomes

Here are two helpful reminders

1. “OR means ADD”Pr(X = a OR X = b) = p(a) + p(b)

As long as two events cannot both happen, theprobability of either is the sum of theprobabilities.

2. “NOT means ONE MINUS”Pr(X 6= a) = 1− p(a)

The probability that something does NOT happenis one minus the probability that it does.

41

Expectation and Variance of a Random

Variable

42

Expectation of a discrete random variable

Example: consider again the random variable S denoting sales.

s p(s)1 0.0952 0.2303 0.4404 0.235

Now, imagine your boss asks you topredict sales next quarter.

You have to come up with onenumber (a “guess”) even though youare not sure.

What number would you choose?

43

Expectation of a discrete random variable

One option (and not the only one) is to report the expectedvalue.

The expected value of a discrete random variable is:

E [X ] =∑

all x x ∗ p(x)

In words, the expected value is “the sum of the possibleoutcomes x where each one is weighted by its probabilityp(x).”

IMPORTANT: This is similar to the sample mean but this is NOT the

same thing. We will discuss this later on below.44

Computing the expected value

Example: consider again the random variable S which denotesthe sales of our product.

s p(s)1 0.0952 0.2303 0.4404 0.235

E (S) = .095 ∗ 1 + .23 ∗ 2 + .44 ∗ 3 + .235 ∗ 4

= 2.815

Yes. It does seem weird that 2.815 which is our “guess” for sales is not

one of the possible values. Think of this as saying “we think sales is likely

to be somewhere around 3 thousand units, but it’s more likely to be

under 3 thousand than over.”

45

Notation for the expected value

I Different authors use different notation for the expectedvalue E [X ] including E (X ) and E [X ].

I It is common notation in statistics to use the Greeksymbol

µ or µx

which is pronounced as “mu.”

I We often say “mean” instead of “expected value.” Whatwe mean by this is that the “expected value of X ” is the“mean of the r.v. X .”

46

Sample Mean vs. the Expected Value

The Sample Mean The Expected Value

variable: in a data set, it random variable: a mathematicalis the observed set of values model for an uncertain

quantity

sample mean: of a variable in expected value (mean):

our data is 1n

n∑i=1

xi of a r.v. is

E [X ] =∑

all x x ∗ p(x)

It is the average of the Average of the possible valuesobserved values in the data set. taken by a r.v. weighted by

their probabilities.

47

Expected Value of a Function of a Discrete

Random VariableSometimes we will be interested in the expected value of some functionof a random variable. For example, let W be the prize a game showcontestant ends up with.

Example: Deal or No Deal

George has cases worth $5, $400, $10,000, and $1,000,000remaining. There are 4 outcomes and each is equally likely.The banker’s offer is $189,000.

The expected value is

E [W ] = .25 ∗ 5 + .25 ∗ 400 + .25 ∗ 10, 000 + .25 ∗ 1, 000, 000

= $252, 601.30

Is the banker’s offer a good or bad deal?48

Expected Value of a Function of a Discrete

Random Variable

BUT, this assumes people choose based on expected values.Economists believe in diminishing marginal utility of income.The more wealth you have the less utility you get from eachadditional $1.

This is often modeledwith a utility functionover wealth.

Let’s assumesomething simplesuch asU(W ) =

√W .

0 25000 50000 75000 100000 125000 150000 175000 200000 225000 250000 275000 300000

100

200

300

400

500

U(W)

Wealth

Utility function

49

Expected Value of a Function of a Discrete

Random Variable

To compute the expected value E [f (W )] = E[√

W]

in the

case of a discrete random variable, just take the function f (.)of each possible outcome, then multiply by the probability andadd them together.

What is George’s expected utility?

E[√

W]

= .25√

5 + .25√

400 + .25√

10, 000 + .25√

1, 000, 000

= 288.56

Compare this to the utility of the banker’s offer:√

189, 000 = 434.74

50

Variance of a Discrete Random Variable

To understand how much the discrete random variable X variesabout its mean (expected value), we define the variance.

The variance of a discrete random variable X is:

Var [X ] =∑

all x p(x) (x − µx)2 = E[

(X − µx)2]

I In words, the variance “is the expected squared distanceof the r.v. X from its mean.”

I If we take µx to be our prediction for X , you can think ofit as a weighted average of the “squared prediction error.”

51

Variance of a Discrete Random Variable

Example: consider again the random variable S denoting salesnext quarter.

s p(s)1 0.0952 0.2303 0.4404 0.235

Imagine your boss asks you to alsoreport the uncertainty associated withyour predicted sales next quarter.

E [S ] = 2.815

V [S ] = .095 (1− 2.815)2 + .23 (2− 2.815)2

+.44 (3− 2.815)2 + .235 (4− 2.815)2

= 0.811

The units are in squared thousands

of units sold.52

Remarks on Notation

I For the variance Var [X ] of X , it also common to use theabbreviated version V [X ].

I It is common notation in statistics to use the Greeksymbol

σ2 or σ2x

which is pronounced as “sigma squared.”

53

Standard deviation of a Discrete Random Variable

The standard deviation of a discrete random variable X is:

σX =√σ2X

The standard deviation of a random variable X is the squareroot of the variance of X .

54

Example: consider again the random variable S denoting salesnext quarter.

Consider two different distributions for sales denoted by p1(s)and p2(s).

s p1(s) p2(s)1 0.01 0.302 0.10 0.303 0.80 0.204 0.09 0.20

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

0.2

0.4

0.6

0.8

1.0

S

p(s)p2

p1

Which distribution (p1(s) or p2(s)) has the larger expectedvalue and/or variance? (Answers on the next slide.)

55

Example: If these were the distributions, what are theexpected values and variances?

E1 (S) = .01 ∗ 1 + .1 ∗ 2 + .8 ∗ 3 + .09 ∗ 4 = 2.97

E2 (S) = .3 ∗ 1 + .3 ∗ 2 + .2 ∗ 3 + .2 ∗ 4 = 2.3

V1 (S) = .01 (1− 2.97)2 + .1 (2− 2.97)2

+.8 (3− 2.97)2 + .09 (4− 2.97)2 = 0.2291

V2 (S) = .3 (1− 2.3)2 + .3 (2− 2.3)2

+.2 (3− 2.3)2 + .2 (4− 2.3)2 = 1.21

(NOTE: the notation E1 (S) and V1 (S) are the mean and variance of the first

probability distribution p1(s).)

56

Example: consider again the random variable S denoting salesnext quarter.

Consider three more distributions for sales denoted byp3(s), p4(s), and p5(s). (NOTE: p4(s) is the same as the original distribution above.)

s p3(s) p4(s) p5(s)1 0.20 0.095 0.052 0.30 0.230 0.203 0.30 0.440 0.504 0.20 0.235 0.25

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

S

p(s)

p3(s)

p4(s)

p5(s)

The means are E3 (S) = 2.5,E4 (S) = 2.815,E5 (S) = 2.95 while the

variances are V3 (S) = 1.05,V4 (S) = 0.811,V5 (S) = 0.648.

57

Mean and Variance of a Bernoulli Distribution

Suppose X ∼ Bernoulli(p) then the mean and variance are

E (X ) = p ∗ 1 + (1− p) ∗ 0 = p

V (X ) = p (1− p)2 + (1− p) (0− p)2

= p (1− p) [(1− p) + p] = p (1− p)

I For what value of p is the mean the smallest (biggest)?

I For what value of p is the variance the smallest (biggest)?

58

Final Comments on the Mean and Variance

I The sample mean, sample variance, and sample standarddeviation of a set of numbers are sample statisticscomputed from observed data.

I The mean, variance, and standard deviation of a randomvariable are properties of its probability distribution whichis a mathematical model of uncertainty.

I They do share a lot of the same properties.

I The distinction between them is subtle but important forlater on in the course!!

59

The mode of a discrete distribution

For a discrete r.v. X , the mode of its probabilitydistribution is the most likely value.

I In other words, the mode is the outcome x that has thelargest probability.

I The mode does not have to be unique because there couldbe multiple outcomes that share the largest probability.

60

The mode of a discrete distribution

Consider the two different distributions for sales S denoted byp1(s) and p2(s).

s p1(s) p2(s)1 0.01 0.302 0.10 0.303 0.80 0.204 0.09 0.20

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

0.2

0.4

0.6

0.8

1.0

S

p(s)p2

p1

What are the modes of the distributions p1(s) and p2(s)?

61

Remarks on the mode

I The mode of a probability distribution is not the samething as the sample mode.

I The sample mode is the value that occurs mostfrequently in a dataset.

I For discrete numeric data, you may occasionally see thesample mode reported.

62

Conditional, Marginal, and Joint

Distributions

63

Conditional, Marginal, and Joint Distributions

I What happens when there are two (or more) variablesthat we are uncertain about?

I How do we describe them probabilistically?

I We want to use probability to understand how two (ormore) variables are related.

I In this section, we extend the results above to more thanone variable.

64

Extending the sales example to include economic

conditions

Example: consider a slightly more complicated but(potentially) more realistic version of our sales example wherewe also take into account the condition of the economy

We want to think about the economy and our sales“together,” that is jointly.

For simplicity, our model thinks of the economy next quarteras either up or down.

It is a Bernoulli random variable!

65

Extending the sales example to include economic

conditions

Example continued:

Again, the random variable S denotes sales (in thousands ofunits) next quarter.

Let E denote the economy next quarter where E = 1 if theeconomy is up and E = 0 if it is down.

How can we think about E and S together?

66

Example continued:

First: what do we think will happen with the economy? Up ordown?

Second: given the economy is up (down), what will happen tosales?

Suppose we know

p(E = 1) = p(Up) = 0.7

which of course implies that

p(E = 0) = p(Down) = 0.3

Our model for the economy is

E ∼ Bernoulli(0.7)

67

Example continued:

Question: If the economy is up, will it be more or less likelythat sales will take on higher values? How can we representthis mathematically?

68

Example continued:

Answer: Specify two different probability distributions for S ,one for each possible value that E can take on!

p(S = s|E = 1): the distribution of sales given that theeconomy is up.

p(S = s|E = 0): the distribution of sales given that theeconomy is down.

69

Example continued: Suppose we decide

s p(s|E = 1) s p(s|E = 0)

1 0.05 1 0.202 0.20 2 0.303 0.50 3 0.304 0.25 4 0.20

These are called conditional probability distributions. (NOTE: These

are the same as the earlier distributions p3(s) and p5(s).)

Conditional on the economy being up (E = 1), sales of ourproduct are more likely to be higher than when the economy isdown (E = 0).

If our product is actually procyclical, then this is likely to be abetter model of reality than our earlier model.

70

We just defined two different probability distributions for therandom variable S depending on the value of E .

We can easily compute the expected value and variance ofeach of these distributions.

s p(s|E = 1) s p(s|E = 0)

1 0.05 1 0.202 0.20 2 0.303 0.50 3 0.304 0.25 4 0.20

E (S |E = 1) = .05 ∗ 1 + .2 ∗ 2 + .5 ∗ 3 + .25 ∗ 4 = 2.95

E (S |E = 0) = .2 ∗ 1 + .3 ∗ 2 + .3 ∗ 3 + .2 ∗ 4 = 2.5

These are called conditional means.

71

We can also compute the variances of the conditionalprobability distributions

V (S|E = 1) = .05 (1− 2.95)2 + .2 (2− 2.95)2 + .5 (3− 2.95)2 + .25 (4− 2.95)2

= .05 ∗ 2.8025 + .2 ∗ 0.9025 + .5 ∗ 0.0025 + .25 ∗ 1.1025

= 0.1901 + 0.1805 + 0.00125 + 0.2756 = 0.6475

V (S|E = 0) = .2 (1− 2.5)2 + .3 (2− 2.5)2 + .3 (3− 2.5)2 + .2 (4− 2.5)2

= .2 ∗ 2.25 + .3 ∗ 0.25 + .3 ∗ 0.25 + .2 ∗ 2.25

= 0.45 + 0.075 + 0.075 + 0.45 = 1.05

These are called conditional variances.

72

Conditional means and variances

I The mean (variance) of the conditional distribution iscalled a conditional mean (variance).

I The distributions p (E |S) and p (S |E ) are bothconditional distributions.

I Both of these distributions have a conditional mean.

E [E |S ] =∑all e

e ∗ p (E = e|S)

E [S |E ] =∑all s

s ∗ p (S = s|E )

I The conditional mean of p (E |S) depends on the outcomeof the random variable S .

73

Example continued:

I We’ve said what we think will happen for the economy E .

I We’ve said what we think will happen for sales S givenwe know E .

I What will happen for E and S jointly?

70% of the time the economy goes up, and 1/4 of those timessales = 4.

25% of 70% is 17.5%

Pr(S = 4 and E = 1) = Pr(E = 1) ∗ Pr(S = 4|E = 1)

= .7 ∗ .25

= .175

74

Computing joint probabilities

There are eight possible outcomes for (S ,E ).

E = 1 (UP)

E = 0

(DOWN)

0.3

0.7

0.25

0.5

0.2

0.05

0.2

0.3

0.3

0.2

S = 4 P(S = 4 and E = 1) = 0.7 * 0.25 = 0.175

S = 3 P(S = 3 and E = 1) = 0.7 * 0.5 = 0.35

S = 2 P(S = 2 and E = 1) = 0.7 * 0.20 = 0.14

S = 1 P(S = 1 and E = 1) = 0.7 * 0.05 = 0.035

S = 4 P(S = 4 and E = 0) = 0.3 * 0.25 = 0.06

S = 3 P(S = 3 and E = 0) = 0.3 * 0.3 = 0.09

S = 2 P(S = 2 and E = 0) = 0.3 * 0.3 = 0.09

S = 1 P(S = 1 and E = 0) = 0.3 * 0.2 = 0.06

75

When both variables are discrete, we can display the jointprobability distribution of (E , S) in a table:

(e, s) Pr(E = e and S = s)(1,4) 0.175(1,3) 0.350(1,2) 0.140(1,1) 0.035(0,4) 0.060(0,3) 0.090(0,2) 0.090(0,1) 0.060

I There are eight possible values for the pair of randomvariables (E , S).

I We list the eight outcomes.

I Then, we list the probability of each outcome (which wecalculated on the previous slide).

76

When there are only two discrete random variables, we can alsodisplay the joint distribution of E and S in a different table.

Rows are values of E , columns are values of S .

S1 2 3 4

E 0 0.060 0.090 0.090 0.0601 0.035 0.140 0.350 0.175

I What is the probability that Pr(E = 1 and S = 4)?

I Answer: 0.175

I If we don’t know anything about E , what is Pr(S = 4)?

I Answer: .06 + .175 = .235 = pS(4)

77

Marginal distributionsWhat is the probability of S if we know nothing about E ?

S1 2 3 4

E 0 0.060 0.090 0.090 0.0601 0.035 0.140 0.350 0.175

pS(s) 0.095 0.230 0.440 0.235

I To obtain the probability distribution pS(s), we add thejoint probabilities for each outcome (i.e. add downwards).

I For example:

PS(1) = P(S = 1,E = 0) + P(S = 1,E = 1)

= 0.06 + 0.035 = 0.09578

Marginal distributionsWhat is the probability of E if we know nothing about S?

S1 2 3 4 pE (e)

E 0 0.060 0.090 0.090 0.060 0.3001 0.035 0.140 0.350 0.175 0.700

pS(s) 0.095 0.230 0.440 0.235

I To obtain the probability distribution pE (e), we add thejoint probabilities for each outcome (i.e. add sideways).

I For example:

PE (0) = P(S = 1,E = 0) + P(S = 2,E = 0) + P(S = 3,E = 0) + P(S = 4,E = 0)

= 0.060 + 0.090 + 0.090 + 0.060 = 0.3

79

Marginal distributions

S1 2 3 4 pE (e)

E 0 0.060 0.090 0.090 0.060 0.3001 0.035 0.140 0.350 0.175 0.700

pS(s) 0.095 0.230 0.440 0.235

I The distributions pE (e) and pS(s) are called marginaldistributions.

I Why?

80

Conditional versus Marginals

Remember the three distributions p3(s), p4(s), and p5(s) fromabove?

It turns out that:

p3(s) = p (s|E = 0)

p4(s) = pS(s)

p5(s) = p (s|E = 1)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

S

p(s)

p3(s)

p4(s)

p5(s)

I p4(s) is the marginal distribution.

I Notice that it lies “in-between” the two conditionaldistributions.

81

Conditional Probability Distribution

The conditional probability that Y turns out to be ygiven you know that X = x is denoted by

Pr(Y = y |X = x)

I In words, the “conditional prob. dist. of the randomvariable Y conditional on X is the probability that Y = ygiven that we know X = x .”

I A conditional probability distribution is a new probabilitydistribution for the random variable Y given that weknow X = x .

(NOTE: In our example, S was analagous to Y and E to X .)82

Joint Probability Distribution

The joint probability that Y turns out to be yand that X turns out to be x is denoted by

Pr(Y = y ,X = x) = Pr(Y = y and X = x)

I In words, “a joint probability distribution specifies theprobability that Y = y AND X = x .”

I It describes our uncertainty over both Y and X at thesame time.

(NOTE: In our example, S was analagous to Y and E to X .)

83

Remarks on Notation

I The notation for the conditional, marginal, and jointdistributions often gets abused and may be confusing.

I For the joint distribution, you may often see:P(Y = y ,X = x) = Pr(Y = y and X = x) = p(y , x).

I The order in which the variables are written does notmatter. Pr(Y = y and X = x) is the same asPr(X = x and Y = y).

I For the conditional distribution, authors often write:P(Y = y |X = x) = p(y |x)

I For the marginal distribution, we can use pX (x) or p(x)or P(X = x) as before. These are all the same.

84

Two Important Relationships

Relationship between Joint and Conditional

p(y , x) = p(x) ∗ p(y |x) = p(y) ∗ p(x |y)

Relationship between Joint and Marginal

p(x) =∑

y p(y , x)

p(y) =∑

x p(y , x)

85

Example: consider again the sales example with the r.v.s(S ,E ).

JOINT: p(4, 1) = 0.175

In words, “What’s the chance the economy is up AND sales is4 units?”

CONDITIONAL: p(4|1) = 0.25

In words, “GIVEN you know the economy is up, what is thechance sales turns out to be 4 units?”

86

Example continued:

MARGINAL: p(4) = pS(4) = .235 = .175 + .06

In words, “What’s the chance sales turns out to be 4 units?”

MARGINAL: p(1) = pE (1) = .7 = .175 + .35 + .14 + .035

In words, “What’s the chance the economy will be up?”

(NOTE: This last one can be a bit confusing because p(1) is ambiguous as both E

and S can take on the value 1.)

87

Conditionals from Joints

We derived the joint distribution of (E , S) by first consideringthe marginal of E and then thinking about the conditionaldistribution of S |E .

An alternative approach is to start with a joint distributionp(y , x) and the marginal pX (x) and then obtain theconditional distribution.

p(y , x) = pX (x)p(y |x)

=>

p(y |x) =p(y , x)

pX (x)

(Note: in the expression on the

left, the denominator is the

marginal probability.)

88

Example: given that the economy is up (E = 1), what is theprobability that sales is 4?

S1 2 3 4 pE (e)

E 0 0.060 0.090 0.090 0.060 0.3001 0.035 0.140 0.350 0.175 0.700

pS(s) 0.095 0.230 0.440 0.235

Using the marginal P(E = 1) and joint probabilitiesP(S = 4,E = 1) we have

P(S = 4|E = 1) =P(S = 4,E = 1)

P(E = 1)=

0.175

0.7= 0.25

89

Example: given that sales is (S = 4), what is the probabilitythat the economy is up?

S1 2 3 4 pE (e)

E 0 0.060 0.090 0.090 0.060 0.3001 0.035 0.140 0.350 0.175 0.700

pS(s) 0.095 0.230 0.440 0.235

Using the marginal P(S = 4) and joint probabilitiesP(S = 4,E = 1) we have

P(E = 1|S = 4) =P(S = 4,E = 1)

P(S = 4)=

0.175

0.235= 0.745

(NOTE: even though we started with distributions for E and S|E , we can still

calculate p(E |S).)90

In general, you can compute the joint from marginals andconditionals and the other way around.

Which way you do it will depend on the problem.

Example: suppose you toss two fair coins: X is the first, Y isthe second. (NOTE : X = 1 is a head).

What is P(X = 1 and Y = 1) = P(two heads)?

I There are 4 possible outcomes for the two coins and eachis equally likely so it is 1

4.

I P(X = 1 and Y = 1) = P(X = 1)P(Y = 1|X = 1) =12

12

= 1/4.

91

Bayes Theorem

92

Bayes Theorem

In many situations, you will know one conditional distributionp(y |x) and the marginal distribution pX (x) but you are reallyinterested in the other conditional distribution p(x |y).

Given that we know p(y |x) and pX (x), can we computep(x |y)?

93

Example: Testing for a Disease

Let D = 1 indicate you have a certain (rare) disease and letT = 1 indicate that you tested positive for it.

Suppose we know the marginal probabilities, P(D = 1), andthe conditional probabilities P(T = 1|D = 1) andP(T = 1|D = 0).

D = 1

D = 0

T = 1 P(D = 1 and T = 1) = 0.02 * 0.95 = 0.019

T= 0 P(T = 0 and D = 1) = 0.02 * 0.05 = 0.001

T= 0 P(T = 0 and D = 0) = 0.98 * 0.99 = 0.9702

T= 1 P(T = 1 and D = 0) = 0.98 * 0.01 = 0.0098

0.02

0.98

0.95

0.05

0.01

0.99

94

We start with info about D and T |D. But if you are thepatient who tests positive for a disease you care aboutP(D = 1|T = 1)!

Given that you have tested positive, what is the probabilitythat you have the disease?

D0 1

T 0 0.9702 0.0011 0.0098 0.019

P(D = 1|T = 1) =P(D = 1,T = 1)

P(T = 1)=

.019

(.019 + .0098)= 0.66

95

Bayes Theorem

Computing p(x |y) from pX (x) and p(y |x) is called BayesTheorem.

p(x |y) =p(y , x)

pY (y)=

p(y , x)∑allx p(y , x)

=pX (x)p(y |x)∑allx pX (x)p(y |x)

Example: (from the last slide...)

p(D = 1|T = 1) =p(T = 1|D = 1)p(D = 1)

p(T = 1|D = 1)p(D = 1) + p(T = 1|D = 0)p(D = 0)

96

Bayes Theorem

Suppose that 52% of the U.S. population is currently Democrat and theremainder is Republican.

Let the r.v. D = 1 if a person is a Democrat and zero otherwise.

Recently, a poll was taken asking each voter their party and whether ornot they would vote for the healthcare bill.

Let the r.v. H = 1 if they would vote for the bill and zero otherwise.

The results of the poll indicated that 55% of Democrats would vote forthe healthcare bill while only 10% of Republicans would.

A distant friend of yours said that if given the chance she would vote for

the bill. For her, what is P(D = 1|H = 1)?

97

Bayes Theorem

We can apply Bayes’ Theorem

p(D = 1|H = 1) =p(H = 1|D = 1)p(D = 1)

p(H = 1|D = 1)p(D = 1) + p(H = 1|D = 0)p(D = 0)

=(0.55) ∗ (0.52)

(0.55) ∗ (0.52) + (0.10) ∗ (0.48)

=0.286

0.286 + 0.048= 0.856

98

Many Random Variables

99

Many Random variables

As we have seen in looking at data, we often want to thinkabout more than two variables at a time.

We can extend the approach we used with two variables.

Suppose we have three random variables (Y1,Y2,Y3).

p(y1, y2, y3) = p(y3|y2, y1)p(y2|y1)p(y1)

The joint distribution of all three variables can be broken downinto the marginal and conditionals distributions.

100

Sampling without Replacement

Example:

Suppose we have 10 voters. 4 are Republican and 6 areDemocrat.

We randomly choose 3. Let Yi = 1 if the i -th voter chosen isa Democratic and 0 if they are Republican for i = 1, 2, 3.

What is the probability of three Democrats?

In other words, P(Y1 = 1,Y2 = 1,Y3 = 1) = p(1, 1, 1)?

101

Sampling without Replacement

The answer is

p(Y1 = 1)p(Y2 = 1|Y1 = 1)p(Y3 = 1|Y1 = 1,Y2 = 1) =6

10

5

9

4

8=

1

6

Step 1: p(Y1 = 1) = 610 because 6 out of 10 voters are Democrats.

Step 2: p(Y2 = 1|Y1 = 1) = 59 because conditional on Y1 = 1 there are

now only 9 voters remaining and 5 are Democrats.

Step 3: p(Y3 = 1|Y2 = 1,Y1 = 1) = 48 because conditional on Y1 = 1

and Y2 = 1 there are only 8 voters remaining and 4 are Democrats.

Key Point: If Y1 = 1 then we do not “replace” the Democratthat was chosen first (and so on). This person can’t be chosenagain.

102

Example continued: There are a total of 8 outcomes.

The logic behind howeach probability iscalculated is the same ason the last slide.

(y1, y2, y3) p(y1, y2, y3)(0,0,0) 1/30(0,0,1) 1/10(0,1,0) 1/10(1,0,0) 1/10(0,1,1) 1/6(1,0,1) 1/6(1,1,0) 1/6(1,1,1) 1/6

What is the marginal distribution of Y1? Find all the outcomeswhere Y1 = 1 and add the probabilities.

p(Y1 = 1) = p(1, 0, 0) + p(1, 0, 1) + p(1, 1, 0) + p(1, 1, 1)

=1

10+

1

6+

1

6+

1

6=

6

10

103

Many Random variables

Above, we had three random variables (Y1,Y2,Y3).

Then, we decomposed their joint distribution as

p(y1, y2, y3) = p(y3|y2, y1)p(y2|y1)p(y1)

This is true for as many variables as you want.

p(y1, y2, . . . , yn) = p(yn|yn−1, yn−2, . . . , y2, y1) . . . p(y3|y2, y1)p(y2|y1)p(y1)

This is important because it allows us to extend our results ton random variables.

104