Chapter 4: Probability Distributions - Stony Brooklinli/teaching/ams-310/lecture-notes...Chapter 4:...

53
Chapter 4: Probability Distributions 4.1 Random Variables A random variable is a function X that assigns a numerical value x to each possible outcome in the sample space An event can be associated with a single value of the random variable, or it can be associated with a range of values of the random variable. The probability of an event can then be described as: = ( = ) or = ( ≀≀ ) There could also be other topology for the random variable to describe the event. If , = 1,2, β‹― , are all the possible values of random variable associated with the sample space, then ( = ) =1 =1

Transcript of Chapter 4: Probability Distributions - Stony Brooklinli/teaching/ams-310/lecture-notes...Chapter 4:...

Chapter 4: Probability Distributions

4.1 Random Variables

A random variable is a function X that assigns a numerical value x to each possible outcome in the sample space

An event can be associated with a single value of the random variable, or it can be associated with a range of values of the random variable. The probability of an event can then be described as: 𝑃 𝐴 = 𝑃(𝑋 = π‘₯𝑖) or 𝑃 𝐴 = 𝑃(π‘₯𝑙 ≀ 𝑋 ≀ π‘₯𝑒) There could also be other topology for the random variable to describe the event. If π‘₯𝑖 , 𝑖 = 1,2,β‹― ,𝑁 are all the possible values of random variable associated with the sample space, then

𝑃(𝑋 = π‘₯𝑖)

𝑁

𝑖=1

= 1

𝑀1

𝑀2

𝑃1

𝑃2

𝑃1

𝐢1

𝐢2

𝐢3

𝐢1

𝐢2

𝐢3

𝐢1

𝐢2

𝐢3

probabilities 0.03

0.06

0.07

0.02

0.01

0.01

0.09

0.16

0.01 …

e.g. Each (composite) outcome consists of 3 ratings (M,P,C). Let 𝑀1 , 𝑃1 and 𝐢1 be preferred ratings. Let X be the function that assigns to each outcome the number of preferred ratings each outcome possesses.

Since each outcome has a probability, we can compute the probability of getting each value x = 0,1,2,3 of the function X

x 3

2

2

2

1

1

2

1

1 …

x | P(X = x) 3 | 0.03 2 | 0.29 1 | 0.50 0 | 0.18

Random variables X can be classified by the number of values x they can assume. The two common types are discrete random variables with a finite or countably infinite number of values continuous random variables having a continuum of values for x 1. A value of a random variable may correspond to several random events. 2. An event may correspond to a range of values (or ranges of values) of a

random variable. 3. But a given value (in its legal range) of a random variable corresponds to a

random event. 4. Different random values of the random variable correspond to mutually

exclusive random events. 5. Each value of a random variable has a corresponding probability. 6. All possible values of a random variable correspond to the entire sample

space. 7. The summation of probabilities corresponding to all values of a random

variable must equal to unity.

A fundamental problem is to find the probability of occurrence for each possible value x of the random variable X.

𝑃 𝑋 = π‘₯ = 𝑃(𝐴)

all outcomes 𝐴 assigned value π‘₯

This is the problem of identifying the probability distribution for a random variable. The probability distribution of a discrete random variable X can be listed as a table of the possible values x together with the probability P(X = x) for each e.g. π‘₯1 | 𝑃(𝑋 = π‘₯1) π‘₯2 | 𝑃(𝑋 = π‘₯2) π‘₯3 | 𝑃(𝑋 = π‘₯3) … It is standard notation to refer to the values P(X = x) of the probability distribution by f(x) f(x) ≑ P(X = x)

The probability distribution always satisfies the conditions 𝑓 π‘₯ β‰₯ 0 and 𝑓 π‘₯ = 1π‘Žπ‘™π‘™ π‘₯

e.g. 𝑓 π‘₯ = π‘₯βˆ’2

2 for x = 1,2,3,4

e.g. 𝑓 π‘₯ = π‘₯2

25 for x = 0,1,2,3,4

Since the probability distribution for a discrete random variable is a tabular list, it can also be represented as a histogram, the probability histogram.

For a discrete random variable, the height for the bin value x is f(x), the width of the bin is meaningless. For a discrete random variable, the probability histogram is commonly drawn either with touching bins (left) or in Pareto style (right - also referred to as a bar chart).

f(x) for number preferred ratings

Of course one can also compute the cumulative distribution function (or cumulative probability function)

𝐹 π‘₯ = 𝑃 𝑋 ≀ π‘₯ for all βˆ’ ∞ ≀ π‘₯ ≀ ∞ and plot it in the ways learned in chapter 2 (with consideration that the x-axis is not continuous but discrete).

We now start to discuss the probability distributions for many discrete random variables that occur in nature

F(x) for number preferred ratings

4.2 Binomial Distribution

Bernoulli distribution: In probability theory and statistics, the Bernoulli distribution, named after Swiss scientist Jacob Bernoulli, is a discrete probability distribution, which takes value 1 with success probability 𝑝 and value 0 with failure probability π‘ž = 1 βˆ’ 𝑝 . So if X is a random variable with this distribution, we have:

𝑃 𝑋 = 1 = 𝑝; 𝑝 𝑋 = 0 = π‘ž = 1 βˆ’ 𝑝. Mean and variance of a random variable 𝑿: (1) Mean (mathematical expectation, expectation, average, etc):

πœ‡ = π‘₯ = 𝐸 𝑋 = π‘₯𝑃(𝑋 = π‘₯)

𝑖

(2) Variance:

π‘‰π‘Žπ‘Ÿ 𝑋 = 𝐸 π‘₯ βˆ’ π‘₯ 2 = 𝜎2 = π‘₯ βˆ’ πœ‡ 2𝑃(𝑋 = π‘₯)𝑖

𝜎 is called the standard deviation. For random variable with Bernoulli distribution, we have

πœ‡ = 𝐸 𝑋 = 𝑝 π‘‰π‘Žπ‘Ÿ 𝑋 = 𝜎2 = 1 βˆ’ 𝑝 2𝑝 + 𝑝2π‘ž = π‘ž2𝑝 + 𝑝2π‘ž = π‘π‘ž 𝑝 + π‘ž = π‘π‘ž

Binomial Distribution: We can refer to the ordered sequence of length n as a series of n repeated trials, where each trial produces a result that is either β€œsuccess” or β€œfailure”. We are interested in the random variable that reports the number x successes in n trials. Each trial is a Bernoulli trial which satisfies a) there are only two outcomes for each trial b) the probability of success is the same for each trial c) the outcomes for different trials are independent

We are talking about the events 𝐴𝑖 in the sample space S where 𝐴1= s _ _ _ _ …. _; 𝐴2= _ s _ _ _ …. _; 𝐴3= _ _ s _ _ …. _; … ; 𝐴𝑛= _ _ _ _ _ …. s; where by b) P(𝐴1) = P(𝐴2) = … = P(𝐴𝑛) and by c) P(𝐴𝑖 ∩ 𝐴𝑗) = P(𝐴𝑖) Β· P(𝐴𝑗) for all distinct pairs i , j

e.g. police roadblock checking for drivers who are wearing seatbelts condition a): two outcomes: β€œy” or β€œn” conditions b) &c): if the events 𝐴1 to 𝐴𝑛 contain all cars stopped, then b) and c) will be satisfied If however, event 𝐴1 is broken into two (mutually exclusive sub-events), 𝐴1< which is all events s _ _ _ … _ and driver 1 is less than 21 and 𝐴1> which is all events s _ _ _ … _ and driver 1 is 21 or older

it is entirely likely that P(𝐴1<) β‰  P(𝐴1>), and we would not be dealing with Bernoulli trials.

If the someone caught not wearing a seatbelt began to warn oncoming cars approaching the roadblock, then P(𝐴𝑖 ∩ 𝐴𝑗) β‰  P(𝐴𝑖) Β· P(𝐴𝑗) for all i , j pairs and we would also not be

dealing with Bernoulli trials.

Note that in our definition of Bernoulli trials the number of trials n is fixed in advance

All Bernoulli trials of length n have the same probability distribution!!!! (a consequence of the assumptions behind the definition of Bernoulli trials) This probability distribution is called the Binomial probability distribution for n. (it is called this because each trial has a binomial outcome β€œs” or β€œf” and the sequences generated (the composite outcomes) are binomial sequences.)

Probability Distribution

x 0 1 2 3

f(x) 1/8 3/8 3/8 1/8

30Β½ 0 1 βˆ’Β½ 3

31Β½ 1 1 βˆ’Β½ 2

32Β½ 2 1 βˆ’Β½ 1

33Β½ 3 1 βˆ’Β½ 0

e.g. Binomial probability distribution for n = 3. Sample space has 23 = 8 outcomes sss ssf sff fff sfs fsf fss ffs

RV values 3 2 1 0 P(sss) = 1/8 = Β½ Β· Β½ Β· Β½; P(ssf) = 1/8 = Β½ Β· Β½ Β· (1βˆ’Β½); P(fsf) = 1/8 = (1βˆ’Β½) Β· Β½ Β· (1βˆ’Β½); etc.

From this example, we see that the binomial probability distribution, which governs Bernoulli trials of length n is:

𝑓(π‘₯) ≑ 𝑏 π‘₯; 𝑛, 𝑝 = 𝑛π‘₯𝑝π‘₯ 1 βˆ’ 𝑝 π‘›βˆ’π‘₯ (BPD)

where p is the (common) probability of success in any trial, and x = 0, 1, 2, …., n

Note: 1. The term on the RHS of (BPD) is the x’th term of the binomial expansion of

𝑝 + (1 βˆ’ 𝑝) 𝑛

i.e. 𝑝 + (1 βˆ’ 𝑝) 𝑛 = 𝑛π‘₯𝑝π‘₯(1 βˆ’ 𝑝)π‘›βˆ’π‘₯𝑛

π‘₯=0

which also proves that

𝑛π‘₯𝑝π‘₯(1 βˆ’ 𝑝)π‘›βˆ’π‘₯

𝑛

π‘₯=0

= 1𝑛 = 1

2. (BPD) is a 2-parameter family of distribution functions characterized by choice of n and p.

e.g. In 60% of all solar-heat installations, the utility bill is reduced by at least 1/3. What is the probability that the utility bill will be reduced by at least 1/3 in

a) 4 of 5 installations? b) at least 4 of 5 installation?

a) β€œs” = β€œat least 1/3” (i.e. 1/3 or greater) β€œf” = β€œless than 1/3” P(Ai) = p = 0.6 Assume c) of Bernoulli trial assumptions holds.

Then f(4) = b(4; 5, 0.6) = 54 0.64 0.41

b) We want f(4) + f(5) = b(4; 5, 0.6) + b(5; 5, 0.6) = 54 0.64 0.41 +

55 0.65 0.40

Examples of binomial distribution

Cumulative binomial probability distribution

𝑩 𝒙;𝒏, 𝒑 ≑ 𝒃 π’Œ;𝒏, 𝒑

𝒙

π’Œ=𝟎

(𝐂𝐁𝐏𝐃)

is the probability of x or fewer successes in n Bernoulli trials, were p is the probability of success on each trial. From (CBPD) we see 𝒃 𝒙; 𝒏, 𝒑 = 𝑩 𝒙;𝒏, 𝒑 βˆ’ 𝑩(𝒙 βˆ’ 𝟏;𝒏, 𝒑)

Values of 𝑩 𝒙;𝒏, 𝒑 are tabulated for various n and p values in Table 1 of Appendix B

Cumulative binomial distribution cumulative probability

e.g. probability is 0.05 for flange failure under a given load L. What is the probability that, among 16 columns, a) at most 2 will fail b) at least 4 will fail

a) 𝐡 2; 16, 0.05 = 𝑏 0; 16, 0.05 + 𝑏 1; 16, 0.05 + 𝑏(2; 16, 0.05)

b) 1.0 βˆ’ 𝐡 3; 16, 0.05

e.g. Claim: probability of repair for a hard drive within 12 months is 0.10 Preliminary data show 5 of 20 hard drives required repair in first 12 months of manufacture Does initial production run support the claim?

β€œs” = repair within 12 months. p = 0.10. Assume Bernoulli trials. 1.0 βˆ’ B(4; 20, 0.10) = 0.0432 is the probability of seeing 5 or more hard drives requiring repair in 12 months. This says that in only 4% of all year-long periods (i.e. in roughly 1 year out of 25) should one see 5 or more hard drives needing repair. The fact that we saw this happen in the very first year makes us suspicious of the manufacturers claim (but does NOT prove that manufacturers claim is wrong !!!!!!!)

Shape of binomial probability histograms e.g. b(x; 5, p)

positively skewed symmetric negatively skewed

b(x; n, 0.5) will always be symmetric: b(x; n, p) will always be positively skewed for p < 0.5 (Tail on positive side) will always be negatively skewed for p > 0.5 (Tail on negative side)

𝑏(π‘₯; 𝑛, 0.5) = 𝑏(𝑛 βˆ’ π‘₯; 𝑛, 0.5)

4.3 Hypergeometric probability distribution

In Bernoulli trials, one can get β€œs” with probability p and β€œf” with probability 1βˆ’p in every trial (i.e. Bernoulli trials can be thought of as β€œsample with replacement”)

Consider a variation of the problem, in which there are total of only a outcomes available that are successes (have RV values = β€œs”) and N βˆ’ a outcomes that are failures. (e.g. there are N radios, a of them are defective and N βˆ’ a of them work.)

We want to run n trials, (e.g. in each trial we pick a radio), but outcomes are sampled without replacement (that is, once a radio is picked, it is no longer available to be picked again).

As we run each trial, we assume that whatever outcomes are left, whether having RV value β€œs” or β€œf”, have the same chance of being selected in the next trial (i.e. we are assuming classical probability – where the chance of being picking a particular value of a RV is in proportion to the number of outcomes that have that RV value).

Thus, for x ≀ a, the probability of getting x successes in n trials if there will be a successes in N trials is

the number of n-arrangements (permutations) having x successes and n βˆ’ x failures

the number n arrangements (permutations) of N things

That is _ _ _ _ _ _ _ _ _ _ _ _ _ _ . . . _ trial 1 2 3 4 5 . . . n pick x of the trials: 𝐢π‘₯𝑛 ways

pick x of the a outcomes and arrange them in all possible ways in those x trials: 𝑃π‘₯π‘Ž ways

pick nβˆ’x of the N─a outcomes and arrange them in all possible ways in the remaining nβˆ’x trials: π‘ƒπ‘›βˆ’π‘₯π‘βˆ’π‘Ž ways total possible n outcomes 𝑃𝑛𝑁 Therefore

𝑓(π‘₯) = 𝐢π‘₯ 𝑃π‘₯ π‘ƒπ‘›βˆ’π‘₯π‘βˆ’π‘Žπ‘Žπ‘›

𝑃𝑛𝑁

i.e.

𝑓 π‘₯ =

𝑛!𝑛 βˆ’ π‘₯ ! π‘₯!

π‘Ž!π‘Ž βˆ’ π‘₯ !

𝑁 βˆ’ π‘Ž !

𝑁 βˆ’ π‘Ž βˆ’ 𝑛 βˆ’ π‘₯ !

𝑁!𝑁 βˆ’ 𝑛 !

=

π‘Ž!π‘Ž βˆ’ π‘₯ ! π‘₯!

𝑁 βˆ’ π‘Ž !

𝑁 βˆ’ π‘Ž βˆ’ 𝑛 βˆ’ π‘₯ ! 𝑛 βˆ’ π‘₯ !

𝑁!𝑁 βˆ’ 𝑛 ! 𝑛!

=

π‘Žπ‘₯𝑁 βˆ’ π‘Žπ‘› βˆ’ π‘₯𝑁𝑛

,

This defines the hypergeometric probability distribution

β„Ž π‘₯; 𝑛, π‘Ž, 𝑁 =

π‘Žπ‘₯𝑁 βˆ’ π‘Žπ‘› βˆ’ π‘₯𝑁𝑛

, π‘₯ = 0, 1,2, … , π‘Ž; 𝑛 ≀ 𝑁

e.g. PC has 20 identical car chargers, 5 are defective. PC will randomly ship 10. What is the probability that 2 of those shipped will be defective?

β„Ž 2; 10,5,20 =

521582010

=

5!3! 2!

15!7! 8!20!10! 10!

= 5! 15! 10! 10!

3! 2! 7! 8! 20!= 5!

3! 2!

15!

20!

10!

7!

10!

8!

= 5 4

2

1

20 19 18 17 16

10 9 8 10 9=5 4

2

10

20

9

18

8

16

10

19

9

17=5 4

2

1

2

1

2

1

2

10

19

9

17

= 5

2 5

19 9

17= 0.348

e.g. redo using 100 car chargers and 25 defective

β„Ž 2; 10,25,100 =

252758

10010

= 0.292

e.g. approximate this using the binomial distribution

b 2; 10, 𝑝 β‰ˆ 25/100 =102 0.252 0.758= 0.282

The hypergeometric distribution β„Ž π‘₯; 𝑛, π‘Ž, 𝑁 approaches the binomial distribution

𝑏(π‘₯; 𝑛, 𝑝 =π‘Ž

𝑁) in the limit 𝑁 β†’ ∞

i.e. the binomial distribution can be used to approximate the hypergeometric

distribution when 𝑛 ≀𝑁

10

4.4 Mean and Variance of a Probability Distribution

Consider the values π‘₯1, π‘₯2, β‹― , π‘₯𝑛 As discussed in Chapter 2, the sample mean is

π‘₯ = π‘₯𝑖𝑛𝑖=1

𝑛= π‘₯𝑖 βˆ™

1

𝑛

𝑛

𝑖=1

We can view each term in the RHS as π‘₯𝑖 βˆ™ 𝑓(π‘₯𝑖) where 𝑓 π‘₯𝑖 =1

𝑛 is the probability

associated with each value (each value appears once in the list, and each is equally likely)

Let X be a discrete random variable having values π‘₯1, π‘₯2, β‹― , π‘₯𝑛, with probabilities f(π‘₯𝑖). The mean value of the RV , aka. the mean value of the probability distribution, is

ΞΌ = π‘₯ βˆ™ 𝑓(π‘₯)

all π‘₯

e.g. Mean value for the probability distribution of the number of heads obtained in 3 flips of a coin.

There are 23 = 8 outcomes. The RV β€œnumber of heads in 3 flips” has 4 possible values, 0 1, 2, and 3 heads having probabilities f(0) = 1/8; f(1) = 3/8; f(2) = 3/8; f(3) = 1/8. Therefore the mean value is

ΞΌ = 0 βˆ™1

8+ 1 βˆ™3

8+ 2 βˆ™3

8+ 3 βˆ™1

8= 3

2

The mean value for the Binomial distribution

πœ‡ = π‘₯ βˆ™ 𝑏 π‘₯; 𝑛, 𝑝 = π‘₯ βˆ™π‘›π‘₯𝑝π‘₯(1 βˆ’ 𝑝)π‘›βˆ’π‘₯

𝑛

π‘₯=0

𝑛

π‘₯=0

= π‘₯ βˆ™π‘›!

𝑛 βˆ’ π‘₯ ! π‘₯!𝑝π‘₯(1 βˆ’ 𝑝)π‘›βˆ’π‘₯

𝑛

π‘₯=1

= 𝑛 𝑛 βˆ’ 1 !

𝑛 βˆ’ π‘₯ ! π‘₯

π‘₯!𝑝 𝑝π‘₯βˆ’1 (1 βˆ’ 𝑝)π‘›βˆ’π‘₯

𝑛

π‘₯=1

= 𝑛 𝑝 𝑛 βˆ’ 1 !

𝑛 βˆ’ π‘₯ ! 1

(π‘₯ βˆ’ 1)!𝑝π‘₯βˆ’1 (1 βˆ’ 𝑝)π‘›βˆ’π‘₯

𝑛

π‘₯=1

Let y = x ─ 1 and m = n ─ 1

πœ‡ = 𝑛 𝑝 π‘š!

π‘š βˆ’ 𝑦 ! 𝑦! 𝑝𝑦 (1 βˆ’ 𝑝)π‘šβˆ’π‘¦

π‘š

𝑦=0

= 𝑛 𝑝 [𝑝 βˆ’ 1 βˆ’ 𝑝 ]π‘š = 𝑛 𝑝 1π‘š

The mean value for the binomial distribution 𝒃 𝒙; 𝒏, 𝒑 is 𝝁 = 𝒏 𝒑

e.g. Since the RV β€œnumber of heads in three tosses” is a Bernoulli trial RV with p = 0.5, its mean value must be n p = 3 Β·Β½ = 3/2 as shown on the previous slide.

The mean value of the hypergeometric distribution 𝒉(𝒙; 𝒏, 𝒂, 𝑡) is given by

πœ‡ = 𝑛 βˆ™π‘Ž

𝑁

(This is β€œeasy” to remember. The formula is similar to the binomial distribution if one β€œrecognizes” 𝑝 = π‘Ž 𝑁 as the hypergeometric probability in the limit of large N.)

e.g. PC has 20 identical car charges, 5 are defective. PC will randomly ship 10. On average (over many trials of shipping 10), how many defective car chargers will be included in the order.

We want the mean of β„Ž(π‘₯; 10,5,20). The mean value is ΞΌ = 10 Β· 5/20 = 2.5

Recall from chapter 2, that the sum of the sample deviations π‘₯𝑖 βˆ’ π‘₯ 𝑛𝑖=1 = 0

If ΞΌ is the mean of the probability distribution f(x), then note that

π‘₯ βˆ’ πœ‡ βˆ™ 𝑓(π‘₯)

π‘Žπ‘™π‘™ π‘₯

= π‘₯ βˆ™ 𝑓 π‘₯ βˆ’ πœ‡ 𝑓 π‘₯

π‘Žπ‘™π‘™ π‘₯π‘Žπ‘™π‘™ π‘₯

= πœ‡ βˆ’ πœ‡ = 0

Therefore, in analogy to the sample variance defined in Chapter 2, we define the variance of the probability distribution f(x) as

𝜎2 = π‘₯ βˆ’ πœ‡ 2 βˆ™ 𝑓(π‘₯)π‘Žπ‘™π‘™ π‘₯

Similarly we define the standard deviation of the probability distribution f(x) as

𝜎 = 𝜎2 = π‘₯ βˆ’ πœ‡ 2 βˆ™ 𝑓(π‘₯)

π‘Žπ‘™π‘™ π‘₯

The variance for the binomial distribution 𝑏(π‘₯; 𝑛, 𝑝)

𝝈𝟐 = 𝒏 βˆ™ 𝒑 βˆ™ 𝟏 βˆ’ 𝒑 = πœ‡ βˆ™ (1 βˆ’ 𝑝)

e.g. The standard deviation for throwing heads in 3 flips of a coin is

𝜎 = 3 βˆ™1

2βˆ™ (1 βˆ’1

2) =3

4= 3

2= 0.866

The variance for the hypergeometric distribution is

𝝈𝟐 = 𝒏 𝒂

π‘΅πŸ βˆ’π’‚

𝑡

π‘΅βˆ’ 𝒏

π‘΅βˆ’ 𝟏

e.g. The standard deviation for the number of defective car chargers in shipments of 10 is

𝜎 = 10 5

201 βˆ’5

20

20 βˆ’ 10

20 βˆ’ 1=75

76= 0.99

β†’1 as N β†’βˆž

The moments of a probability distribution The k’th moment about the origin (usually just called the k’th moment) of a probability distribution is defined as

πœ‡π‘˜β€² = π‘₯π‘˜ βˆ™ 𝑓(π‘₯)

π‘Žπ‘™π‘™ π‘₯

Note: the mean of a probability distribution is the 1’st moment (about the origin)

The k’th moment about the mean of a probability distribution is defined as

πœ‡π‘˜ = (π‘₯ βˆ’ πœ‡)π‘˜βˆ™ 𝑓(π‘₯)

π‘Žπ‘™π‘™ π‘₯

Notes:

the 1’st moment about the mean, πœ‡1 = 0

the 2’nd moment about the mean πœ‡2 is the variance

the 3’rd moment about the mean πœ‡3/𝜎3 is the skewness (describes the symmetry)

the 4’th moment about the mean πœ‡3/𝜎4 is the kurtosis (describes the β€œpeakedness”)

Note:

𝜎2 = π‘₯ βˆ’ πœ‡ 2 βˆ™ 𝑓(π‘₯)

π‘Žπ‘™π‘™ π‘₯

= (π‘₯2βˆ’2π‘₯πœ‡ + πœ‡2) 𝑓(π‘₯)

π‘Žπ‘™π‘™ π‘₯

= π‘₯2𝑓 π‘₯ βˆ’ 2πœ‡ π‘₯ 𝑓 π‘₯

π‘Žπ‘™π‘™ π‘₯

+ πœ‡2 𝑓 π‘₯

π‘Žπ‘™π‘™ π‘₯

= πœ‡2β€² βˆ’ 2πœ‡2 + πœ‡2

π‘Žπ‘™π‘™ π‘₯

Therefore we have the result 𝜎2 = πœ‡2

β€² βˆ’ πœ‡2

Since computation of πœ‡2β€² and πœ‡2 does not involve squaring differences within the sum,

they can be more straightforward to compute.

e.g. Consider the R.V. which is the number of points obtained on a single roll of a die. The R.V. has values 1,2,3,4,5,6. What is the variance of the probability distribution behind this RV?

The probability distribution is f(x) = 1/6 for each x. Therefore the mean is

πœ‡ = 1 βˆ™1

6+ 2 βˆ™1

6+ 3 βˆ™1

6+ 4 βˆ™1

6+ 5 βˆ™1

6+ 6 βˆ™1

6= 6 βˆ™ 7

2 βˆ™ 6=7

2

The second moment about the origin is

πœ‡2β€² = 12 βˆ™

1

6+ 22 βˆ™

1

6+ 32 βˆ™

1

6+ 42 βˆ™

1

6+ 52 βˆ™

1

6+ 62 βˆ™

1

6=91

6

Therefore 𝜎2 = 91

6βˆ’49

4= 35

12

4.5 Chebyshev’s Theorem

Theorem 4.1 If a probability distribution has mean ΞΌ and standard deviation Οƒ,

then the probability of getting a value that deviates from ΞΌ by at least k Οƒ is a most 1

π‘˜2

i.e. the probability P(x) for getting a result x such that |x ─μ| β‰₯ k Οƒ satisfies 𝑃 π‘₯ ≀1

π‘˜2

Chebyshev’s theorem quantifies the statement that the probability of getting a result x decreases as x moves further away from ΞΌ Theorem 4.1 can be stated as

𝑃(|π‘₯ βˆ’ πœ‡| β‰₯ π‘˜πœŽ) ≀1

π‘˜2

Note: k can be any positive number (it does not have to be an integer).

Corollary 4.1 If a probability distribution has mean ΞΌ and standard deviation Οƒ,

then the probability of getting a value that deviates from ΞΌ by at most k Οƒ is at least 1─ 1

π‘˜2

𝑃 π‘₯ βˆ’ πœ‡ ≀ π‘˜πœŽ β‰₯ 1 βˆ’1

π‘˜2

e.g. The number of customers who visit a car dealer’s showroom on a Saturday morning is an RV with mean 18 and standard deviation 2.5. With what probability can we assert there will be more than 8 but fewer than 28 customers.

This problem sets k Οƒ = 10, making k = 4. Thus

𝑃 π‘₯ βˆ’ 18 ≀ 4 Β· 2.5 β‰₯ 1 βˆ’1

42=15

16

Chebyshev’s theorem holds for all probability distributions, but it works better for some than for others (gives a β€œsharper” estimate).

4.6 Poisson distribution

Consider the binomial distribution

𝑏 π‘₯; 𝑛, 𝑝 = 𝑛π‘₯𝑝π‘₯(1 βˆ’ 𝑝)π‘›βˆ’π‘₯

Write 𝑝 as 𝑝 = Ξ»/𝑛 where Ξ» is a constant. In the limit 𝑛 β†’ ∞, the𝑛 𝑝 β†’ 0 and the binomial distribution becomes the Poisson probability distribution

𝑓 π‘₯; Ξ» = Ξ»π‘₯π‘’βˆ’Ξ»

π‘₯! for π‘₯ = 0, 1, 2, 3, …

As derived, the Poisson distribution describes the probability distribution for an infinite (in practice very large) number of Bernoulli trials when the probability of success in each trial is vanishingly small (in practice – very small).

As the Poisson distribution describes probabilities for a sample space in which each outcome is countably infinite in length, we have to technically modify the third Axiom (property) that probabilities must obey to include such sample spaces. The third axiom stated that the probability function is an additive set function. The appropriate modification is

Axiom 3’ If 𝐴1, 𝐴2, 𝐴3, β‹― is a countably infinite sequence of mutually exclusive events in S, then

𝑃 𝐴1 U𝐴2 βˆͺ 𝐴3 βˆͺβ‹― = 𝑃 𝐴1 + 𝑃 𝐴2 + 𝑃 𝐴3 +β‹―

Note that the Poisson distribution satisfies 𝑓(π‘₯; Ξ»)π‘Žπ‘™π‘™ π‘₯ = 1 Proof:

Ξ»π‘₯π‘’βˆ’Ξ»

π‘₯!

∞

π‘₯=0

= π‘’βˆ’Ξ» Ξ»π‘₯

π‘₯!= π‘’βˆ’Ξ»π‘’Ξ»

∞

π‘₯=0

= 1

Taylors series

expansion of 𝑒λ

The cumulative Poisson distribution 𝐹 π‘₯; Ξ» = 𝑓(π‘˜; Ξ») π‘₯π‘˜=0 is tabluated for select

values of x and Ξ» in Appendix B (Table 2)

e.g. 5% of bound books have defective bindings. What is the probability that 2 out of 100 books will have defective bindings using (a) the binomial distribution, (b) the Poisson distribution as an approximation

(a) b(2;100,0.05) = 1002 0.052 0.9598 = 0.081

(b) Ξ» = 0.05 βˆ™ 100 = 5. f 2; 5 =52 π‘’βˆ’5

2!= 0.084

e.g. There are 3,840 generators. The probability is 1/1,200 that any one will fail in a year. What is the probability of finding 0, 1, 2, 3, 4, … failures in any given year

Ξ» = 3840 /1200 = 3.2. We want the probabilities f(0; 3.2), f(1; 3.2), f(2; 3.2) etc. Using the property 𝑓 π‘₯; Ξ» = 𝐹 π‘₯; Ξ» βˆ’ 𝐹 π‘₯ βˆ’ 1; Ξ» we can compute these probabilities from Table 2 Appendix B

x 0 1 2 3 4 5 6 7 8

𝑓 π‘₯; 3.2 0.041 0.130 0.209 0.223 0.178 0.114 0.060 0.028 0.011

The mean value for the Poisson probability distribution is 𝝁 = 𝝀 The variance for the Poisson probability distribution is 𝝈𝟐 = 𝝀

i.e. the standard deviation for the Poisson distribution is 𝝈 = 𝝀

Proof for mean: πœ‡ = π‘₯Ξ»π‘₯π‘’βˆ’Ξ»

π‘₯!=

∞

π‘₯=0

Ξ»π‘’βˆ’Ξ» Ξ»π‘₯βˆ’1

(π‘₯ βˆ’ 1)!

∞

π‘₯=1

Let 𝑦 = π‘₯ βˆ’ 1

πœ‡ = Ξ» π‘’βˆ’Ξ» λ𝑦

𝑦!=

∞

𝑦=0

Ξ» π‘’βˆ’Ξ» 𝑒λ = Ξ»

The average Ξ» is usually approximated by running many long (but finite) trials.

e.g. An average of 1.3 gamma rays per millisec is recorded coming from a radioactive substance. Assuming the RV β€œnumber of gamma rays per millisec” has a probability distribution that is Poisson (aka, is a Poisson process), what is the probability of seeing 1 or more gamma rays in the next millisec

Ξ» = 1.3. Want 𝑃 𝑋 β‰₯ 1 = 1.0 βˆ’ 𝑃 𝑋 = 0 = 1.0 βˆ’1.30π‘’βˆ’1.3

0!= 1.0 βˆ’ π‘’βˆ’1.3 = 0.727

4.7 Poisson Processes

Consider a random process (a physical process controlled, wholly or in part, by a chance mechanism) in time. To find the probability of the process generating x success over a time interval T, divide T into n equal interval βˆ†π‘‘ = 𝑇/𝑛. (n is large, βˆ†π‘‘ is small) Assume the following hold: 1. The probability of success during βˆ†π‘‘ is 𝛼 βˆ†π‘‘ 2. The probability of more than one success during βˆ†π‘‘ is negligible 3. The probability of success during each time interval βˆ†π‘‘ does not depend on what

happened in a prior interval.

These assumptions describe Bernoulli trials, with 𝑛 = 𝑇/βˆ†π‘‘ and p = 𝛼 βˆ†π‘‘ and the

probability of x successes in n intervals is 𝑏(π‘₯;𝑇

βˆ†π‘‘, 𝛼 βˆ†π‘‘).

As 𝑛 β†’ ∞, p β†’0 (as βˆ†π‘‘ β†’0) and the probability of x successes is governed by the Poisson probability distribution with Ξ» = 𝑛𝑝 = 𝛼𝑇

Since λ is the mean (average) number of successes over time T, we see that 𝜢 is the mean number of successes per unit time.

e.g. A bank receives, on average, 6 bad checks per day. What are the probabilities it will receive (a) 4 bad checks on a given day (b) 10 bad checks over a 2 day period

(a) 𝛼 = 6. Ξ» = 6 βˆ™ 1

Therefore 𝑓(4; 6) =64π‘’βˆ’6

4!= 0.134

(b) 𝛼 = 6. Ξ» = 6 βˆ™ 2 = 12

Therefore 𝑓 10; 12 =1210π‘’βˆ’12

10!= 𝐹 10; 12 βˆ’ 𝐹 9; 12 = 0.134

e.g. a process generates 0.2 imperfections per minute. Find probabilities of (a) 1 imperfection in 3 minutes (b) at least 2 imperfections in 5 minutes (c) at most 1 imperfection in 15 minutes

(a) Ξ» = 0.2 βˆ™ 3 = 0.6. Want 𝑓 1; 0.6 = 𝐹 1; 0.6 βˆ’ 𝐹(0; 0.6)

(b) Ξ» = 0.2 βˆ™ 5 = 1.0. Want 1.0 βˆ’ 𝐹 1; 1.0

(c) Ξ» = 0.2 βˆ™ 15 = 3.0. Want 𝐹 1; 3.0

4.8 Geometric and Negative Binomial Distributions

Consider the sample space of outcomes for countably infinite Bernoulli trials (i.e. the three Bernoulli assumptions hold) In particular β€œs” occurs with probability p and β€œf” with probability 1-p We want to know the probability that the first success occurs on the x’th trial.

Divide the sample space into the following events

𝐴1 s _ _ _ _ _ _ _ _ … 𝐴 1 f _ _ _ _ _ _ _ _ … 𝐴1 βˆͺ 𝐴 1 = 𝑆 𝐴2 f s _ _ _ _ _ _ _ … 𝐴 2 f f _ _ _ _ _ _ _ … 𝐴2 βˆͺ 𝐴 2 = 𝐴 1 𝐴3 f f s _ _ _ _ _ _ … 𝐴 3 f f f _ _ _ _ _ _ … 𝐴3 βˆͺ 𝐴 3 = 𝐴 2 𝐴4 f f f s _ _ _ _ _ … 𝐴 4 f f f f _ _ _ _ _ … 𝐴4 βˆͺ 𝐴 4 = 𝐴 3 etc

𝐴1 𝐴2

𝐴3

𝐴4

𝑃(𝐴1) = 𝑝

𝑃(𝐴2) = 𝑝 1 βˆ’ 𝑝

𝑃(𝐴3)=𝑝1βˆ’π‘2

𝑃(𝐴4) = 𝑝 1 βˆ’ 𝑝3

𝐴5 𝐴6 𝑃(𝐴6) = 𝑝 1 βˆ’ 𝑝

5

𝑃(𝐴5)=𝑝1βˆ’π‘4

𝐴7 …

Since the sum of the probabilities of all outcomes must =1, from the diagram we see that

𝑃 𝐴1 + 𝑃 𝐴2 + 𝑃 𝐴3 + 𝑃 𝐴4 +β‹― = 𝑝 + 𝑝 1 βˆ’ 𝑝 + 𝑝 1 βˆ’ 𝑝2 + 𝑝 1 βˆ’ 𝑝 3 +β‹―

= 𝑝(1 βˆ’ 𝑝)π‘₯βˆ’1∞

π‘₯=1

= 1

Let the sample space consist of outcomes each of which consists of infinitely countable Bernoulli trials. Let p be the probability of success in each Bernoulli trial. Then the geometric probability distribution

𝑔 π‘₯; 𝑝 = 𝑝(1 βˆ’ 𝑝)π‘₯βˆ’1, π‘₯ = 1, 2, 3, 4, …

describes the probability that the first success occurs on the x’th trial.

e.g. A measuring device has a 5% probability of showing excessive drift during a measurement. What is the probability that the first time the device exhibits successive drift occurs on the sixth measurement?

p = 0.05. We want 𝑔 6; 0.05 = 0.05(0.95)5= 0.039

Assume you are dealing with Bernoulli trials governed by probability p and you would like to know how many trials x you need to make in order to observe r successes. (Clearly π‘Ÿ ≀ π‘₯) To have exactly r successes in x trials, the r’th success has to occur on trial x, and the previous π‘Ÿ βˆ’ 1 successes have to occur in the previous π‘₯ βˆ’ 1 trials. Therefore the probability that the r’th success occurs on the x’th trial must be

f(π‘₯) = (probability of π‘Ÿ βˆ’ 1 successes in π‘₯ βˆ’ 1 trials) x (probability of β€œs” on trial x)

= 𝑏 π‘Ÿ βˆ’ 1; π‘₯ βˆ’ 1, 𝑝 βˆ™ 𝑝

f(π‘₯) =π‘₯ βˆ’ 1π‘Ÿ βˆ’ 1

π‘π‘Ÿβˆ’1(1 βˆ’ 𝑝)π‘₯βˆ’π‘Ÿβˆ™ 𝑝 =π‘₯ βˆ’ 1π‘Ÿ βˆ’ 1

π‘π‘Ÿ(1 βˆ’ 𝑝)π‘₯βˆ’π‘Ÿ

This is the negative binomial probability distribution

𝑓 π‘₯ =π‘₯ βˆ’ 1π‘Ÿ βˆ’ 1

π‘π‘Ÿ 1 βˆ’ 𝑝 π‘₯βˆ’π‘Ÿ for π‘₯ = π‘Ÿ, π‘Ÿ + 1, π‘Ÿ + 2,…

As π‘›π‘˜=𝑛𝑛 βˆ’ π‘˜

, the negative binomial probability distribution can also be written

𝑓 π‘₯ =π‘₯ βˆ’ 1π‘₯ βˆ’ π‘Ÿ

π‘π‘Ÿ 1 βˆ’ 𝑝 π‘₯βˆ’π‘Ÿ

It can be shown that π‘₯ βˆ’ 1π‘₯ βˆ’ π‘Ÿ

= βˆ’1π‘₯βˆ’π‘Ÿβˆ’π‘₯π‘₯ βˆ’ π‘Ÿ

explaining the name β€œnegative” binomial

distribution

Recap: Sample space: outcomes are Bernoulli trials of fixed length n. Probability of β€œs” is p. Probability of getting x outcomes in the n trials is given by the binomial distribution 𝑏 π‘₯; 𝑛, 𝑝 , π‘₯ = 0,1, 2, 3, … , 𝑛

If n is large and p is small, 𝑏 π‘₯; 𝑛, 𝑝 β‰ˆ 𝑓 π‘₯; Ξ» where Ξ» = 𝑛𝑝 and 𝑓 π‘₯; Ξ» is the Poisson distribution Sample space: outcomes are Bernoulli trials of countably infinite length. Probability of β€œs” is p. Probability of getting the first success on the x’th trial is given by the geometric distribution 𝑔 π‘₯; 𝑝 , π‘₯ = 1, 2, 3, 4, … . Probability of getting exactly r successes in x trials is given by 𝑓 π‘₯ = 𝑏 π‘Ÿ βˆ’ 1; π‘₯ βˆ’ 1, 𝑝 βˆ™ 𝑝, π‘₯ = π‘Ÿ, π‘Ÿ + 1, π‘Ÿ + 2,…

Recap: Sample space: Time recordings of a random process occurring over a continuous time interval T. The random process produces only β€œs” or β€œf”.

Let 𝛼 denote the average number of β€œs” produced per unit time. Further assume 1. probability of β€œs” during small time interval βˆ†π‘‘ is Ξ±βˆ†π‘‘ 2. probability of more than one β€˜s” in βˆ†π‘‘ is negligible 3. probability of β€œs” in a later βˆ†π‘‘ is independent of what occurs earlier Then: Probability of x successes during time interval T is given by the Poisson distribution

𝑓 π‘₯; Ξ» where Ξ» = 𝛼𝑇

4.9 The Multinomial Distribution

Sample space: sequences of trials of length n We assume: 1) Each trial has k possible distinct outcomes, type 1, type 2, type 3, …., type k

2) Outcome type i occurs with probability 𝑝𝑖 for each trail, where 𝑝𝑖 = 1π‘˜π‘–=1

3) The outcomes for different trials are independent. (i.e. we assume β€œmultinomial Bernoulli” trials. In the n trials, we want to know the probability 𝑓(π‘₯1, π‘₯2, π‘₯3, … , π‘₯π‘˜) that there are

π‘₯1 outcomes of type 1 π‘₯2 outcomes of type 2 … π‘₯π‘˜ outcomes of type k

where π‘₯𝑖 = π‘›π‘˜π‘–=1

For fixed values of π‘₯1, π‘₯2, π‘₯3, … , π‘₯π‘˜, there are 𝑛π‘₯1

𝑛 βˆ’ π‘₯1π‘₯2

𝑛 βˆ’ π‘₯1 βˆ’ π‘₯2π‘₯3

⋯𝑛 βˆ’ π‘₯1 βˆ’ π‘₯2 βˆ’β‹―βˆ’ π‘₯π‘˜βˆ’1

π‘₯π‘˜

=𝑛!

π‘₯1! π‘₯2! π‘₯3!β‹―π‘₯π‘˜!

outcomes that have these k values.

(AMS 301 students will recognize this as 𝑃(𝑛; π‘₯1, π‘₯2, π‘₯3, … , π‘₯π‘˜), the number of ways to arrange n objects, when there are π‘₯1 of type 1, π‘₯2 of type 2, … , and π‘₯π‘˜ of type k ) Each outcome has probability 𝑝1

π‘₯1𝑝2π‘₯2𝑝3π‘₯3β‹―π‘π‘˜

π‘₯π‘˜. Summing the probabilities for theses outcomes we have

𝑓 π‘₯1, π‘₯2, π‘₯3, … , π‘₯π‘˜ =𝑛!

π‘₯1! π‘₯2! π‘₯3!β‹―π‘₯π‘˜! 𝑝1π‘₯1𝑝2π‘₯2𝑝3π‘₯3β‹―π‘π‘˜

π‘₯π‘˜

This is the multinomial probability distribution with the conditions that each π‘₯𝑖 β‰₯ 0 and that

π‘₯𝑖 = π‘›π‘˜π‘–=1

e.g. 1. 30% of light bulbs will survive less that 40 hours of continuous use 2. 50% will survive from 40 to 80 hours of continuous use 3. 20% will survive longer than 80 hours of continuous use What is the probability that, among 8 light bulbs, 2 will be of type 1, 5 of type 2 and 1 of type 3?

We want 𝑓 2,5,1 = 8!

2! 5! 1!(0.3)2(0.5)5(0.2)1= 0.0945

4.10 Generating discrete random variables that obey different probability distributions

Observation: It is relatively simple to generate the random values 0, 1, 2, …, 9 with equal-likelihood (i.e. each with probability 1/10) draw the numbers (with replacement) from a hat flip a balanced, 10-sided dice

It is also relatively straightforward to write a computer program that generates the integers 0, 1, 2, …, 9 with equal-likelihood.

Consequently, it is possible to generate all 2-digit numbers (outcomes) 00 to 99 with equal-likelihood (1/100) all 3-digit numbers (outcomes) 000 to 999 with equal-likelihood (1/1000) etc.

outcomes

Consider the RV β€œnumber of heads in 3 tosses of the dice” The probability distribution for this RV is

x 0 1 2 3

f(x) 1/8=0.125 3/8=0.375 3/8=0.375 1/8=0.125

F(x) 0.125 0.500 0.875 1.000

𝐹(0)

𝐹(1)

𝐹(2)

π‘₯1 = 0 π‘₯2 = 1 π‘₯3 = 2 π‘₯4 = 3

i.e. all the outcomes 0 – 124 are assigned the RV 0 all the outcomes 125 – 499 are assigned the RV 1 all the outcomes 500 – 874 are assigned the RV 2 all the outcomes 875 – 999 are assigned the RV 3 Thus RV 0 occurs with probability 1/8 RV 1 occurs with probability 3/8 RV 2 occurs with probability 3/8 RV 3 occurs with probability 1/8

Thus the sequence of outcomes generated randomly (with equal-likelihood) 197, 365, 157, 520, 946, 951, 948, 568, 586, 089

are interpreted as the random values (number of heads) 1, 1, 1, 2, 3, 3, 3, 2, 2, 0

Table 7 in Appendix B presents a long list of the integers 0, …, 9 generated with equal-likelihood. One can use the table to randomly generate lists of 1-digit, 2-digit, 3-digit, etc. outcomes (by taking non-overlapping combinations and starting in different places)

e.g. RV = number cars arriving at a toll booth per minute

x 0 1 2 3 4 5 6 7 8 9

f(x) 0.082 0.205 0.256 0.214 0.134 0.067 0.028 0.010 0.003 0.001

F(x) 0.082 0.287 0.543 0.757 0.891 0.958 0.986 0.996 0.999 1.000

𝐹(0)

𝐹(1)

𝐹(2)

0 1 2 3

𝐹(3)

𝐹(4)

4 …

Classical probability versus frequentist probability

Recall: classical probability counts outcomes and assumes all outcomes occur with equal likelihood. Frequentist probability measures the frequency of occurrence of outcomes from past β€œexperiments”. So what do two dice really do when thrown at the same time? Classic probability: distinct (i.e. different colored) dice: There are 36 distinct outcomes, each appears with equal likelihood, therefore the (unordered) outcome 1,2 has probability 2/36

identical dice: There are 21 distinct outcomes, each appears with equal likelihood, therefore the (unordered) outcome 1,2 has probability 1/21 Frequentist probability: distinct dice: The (unordered) outcome 1,2 has measured probability 2/36 in agreement with classic probability

identical dice: The (unordered) outcome 1,2 has measured probability 2/36 (!!) in disagreement with classic probability

For identical dice, the classic view of probability for throwing two identical dice assumes all 21 outcomes occur with equal probability. This is not what occurs in practice. in practice, each of the (unordered) outcomes i, j where i β‰  j occurs more frequently than the outcomes i, i.

β€œWhy” is the frequentist approach correct. Clearly the frequency of getting unordered outcomes cannot depend on the color of dice being thrown (i.e. the color of the dice cannot affect frequency of occurrence). Thus two identical dice must generate outcomes with the same frequency as two differently-colored dice. Note: That is not to say that the classic probability view is completely wrong. The classic view correctly counts the number of different outcomes in each case ( identical and different dice). However it computes probability incorrectly for the identical case. The frequentist view concentrates on assigning probabilities to each outcome. In the frequentist view, the number of outcomes for two identical dice is still 21, but the probabilities assigned to i,i and i,j outcomes are different.