MATH 323: Probability

MATH 323: ProbabilityAnna Brandenberger

February 10, 2019

This is a transcript of the lectures given by Prof. David Stephens1 dur- 1 [email protected], BH 1225

ing the fall semester of the 2018-2019 academic year (09-12 2018) forthe Probability class (math 323). Subjects covered are: sample space,events, conditional probability, independence of events, Bayes’ Theo-rem; basic combinatorial probability, random variables, discrete andcontinuous univariate and multivariate distributions; independence ofrandom variables; inequalities, weak law of large numbers, central limittheorem.

1 Basics of ProbabilityProbability is a numerical value rep-resenting the chance of a particularevent occurring given a particular set ofcircumstances.

1.1 Review of Set Theory

Definition 1.1. A set S is a collection of elements s ∈ S:

• finite: finite number of elements

• countable: countably infinite number of elements

• uncountable: uncountably infinite number of elements

Definition 1.2. The power set P(A) of A is the set of all subsets of A:P(A) = {B | B ⊆ A}.

Set Operations

1. intersection: s ∈ A ∩ B⇐⇒ s ∈ A and s ∈ Bextends to A1, A2, . . . , AK (finite): s ∈ ⋂K

k=1 Ak ⇐⇒ s ∈ Ak ∀k

Let A ⊆ S:

A ∩∅ = ∅ A ∪∅ = A

A ∩ S = A A ∪ S = S

Intersection ∩ and union ∪ are commu-tative and associative.

2. union: s ∈ A ∪ B⇐⇒ s ∈ A or s ∈ Bextends to A1, A2, . . . , AK (finite2): s ∈ ⋃K

k=1 Ak ⇐⇒ ∃k s.t. s ∈ Ak2 technically also extends to a countablyinfinite number of sets.

3. complement: for A ⊆ S, s ∈ A′ ⇐⇒ s ∈ S but s 6∈ A

4. set difference A− B ≡ A \ B ≡ A ∩ B′

5. exclusive or (xor): A⊕ B ≡ A ∪ B− A ∩ B

Definition 1.3. A1, A2, . . . , AK is called a partition of S if these setsare pairwise disjoint Aj ∩ Ak = ∅ ∀j 6= k and exhaustive

⋃Kk=1 Ak = S.

Theorem 1.1 (De Morgan’s Laws).

A′ ∩ B′ = (A ∪ B)′ A ∪ B′ = (A ∩ B)′

Proof. Let A1 = A ∪ B, A2 = A′ ∩ B′.To show that A′1 = A2, one must showA1 ∪ A2 = ∅ and A1 ∪ A2 = S.

A1 ∩ A2 = (A ∪ B) ∩ A2 = (A ∩ A2) ∪ (B ∩ A2)

= (A ∩ A′ ∩ B′) ∪ (B ∩ A′ ∩ B′)

= ∅ ∪∅ = ∅

A1 ∪ A2 = A1 ∪ (A′ ∩ B′) = (A1 ∪ A′) ∩ (A1 ∪ B′)

= (A ∪ B ∪ A′) ∩ (A ∪ B ∪ B′)

= S ∩ S = S

2 anna brandenberger

1.2 Sample Spaces and Events

Definition 1.4. The set S of possible outcomes of an experiment is thesample space of an experiment, and the individual elements in S aresample points or outcomes.

Definition 1.5. An event A is a collection of sample outcomes A ⊆ S.Individual sample outcomes E1, E2, . . . , EK, · · · ∈ A are termed simple(elementary) events. We say that A occurs if the outcome s ∈ A. S isthe certain event; ∅ the impossible event.

Definition 1.6. The probability function P(.) assigns numerical val-ues to events: P : A −→ R, A −→ p (i.e. P(A) = p, A = P(S)).

1.3 Axioms and Consequences

Theorem 1.2. Basic probability axioms:

1. P(A) ≥ 0

2. P(S) = 1

3. P is countably additive: if A1, A2, . . . s.t. Aj ∩ Ak = ∅ ∀j 6= k:

P

(∞⋃

i=1

Ai

)=

∞

∑i=1

P(Ai)

Corollary 1.3. ∀A P(A′) = 1− P(A)Proof. S = A ∪ A′. By Theorem 1.2(3),P(S) = P(A) + P(A′) = 1=⇒ P′(A) = 1− P(A)

Corollary 1.4. P(∅) = 0 Proof. Simple from Corollary 1.3.

Corollary 1.5. ∀A P(A) ≤ 1Proof. By Theorem 1.2 and 1.3,1 = P(S) = P(A) + P(A′) ≥ P(A)

Corollary 1.6. ∀A ⊆ B, P(A) ≤ P(B)Proof. Take B = A ∪ (A′ ∩ B). Then,P(B) = P(A) + P(A′ ∩ B) ≥ P(A)

Corollary 1.7 (General Addition Rule).

P(A ∪ B) = P(A) + P(B)− P(A ∩ B)

Proof. First, (A ∪ B) = A ∪ (A′ ∩ B)=⇒ P(A∪ B) = P(A) + P(A′ ∩ B) [1]

Then, take B = (A ∩ B) ∪ (A′ ∩ B)=⇒ P(B) = P(A ∩ B) + P(A′ ∩ B) [2]So taking [1] - [2], we get:P(A ∪ B)− P(B) = P(A)− P(A ∩ B)P(A∪ B) = P(A) + P(B)− P(A∩ B)

Using inductive arguments, one can construct a formula for P (⋃n

i=1 Ai)

(c.f. math 240) but we will mostly use:

Theorem 1.8 (Boole’s Inequality).

P

(n⋃

i=1

Ai

)≤

n

∑i=1

P(Ai)

math 323: probability 3

As a method, one can lay out probabilities in table form such as inTable 1, ensuring that 0 ≤ p? ≤ 1 and probabilities along rows andcolumn add up correctly.

A A′ TotalB pA∩B pA′∩B pBB′ pA∩B′ pA′∩B′ pB′

Total pA pA′ pSTable 1: Probabilities laid out in a table.

1.4 Specifying probabilitiesExamples of experiments with equallylikely sample outcomes include faircoins, dice and cards.

Definition 1.7 (Equally likely sample outcomes). Suppose S finite,with sample outcomes E1, . . . , EN , then P(Ei) =

1N ∀i = 1, . . . , N.

For an event A ∈ S, ∃n ≤ N s.t. A =⋃n

i=1 EiA , iA ∈ {1, . . . , n}

=⇒ P(A) =nN

=# sample outcomes in A# sample outcomes in S

Definition 1.8 (Relative frequencies). For a given experiment with asample space S and interest event A, consider a finite sequence of Nrepeat experiments and n the number of times that A occurs. The fre-quencist definition of probability, generalizes the previous definitionto P(A) = limN→∞

nN , the relative frequency of appearance of A.

Definition 1.9 (Subjective assessment). For a given experiment withsample space S, the probability of event A is further generalized to anumerical representation of one’s own personal degree of belief thatthe actual outcome lies in A, assuming one is internally consistent,i.e. rational and coherent.

1.5 Combinatorial Rulesc.f. math 240 notes for examples.

1. Multiplication principle: a sequence of k operations i, eachwith ni possible outcomes, results in n1 × n2 × · · · × nk possiblesequences of outcomes.

2. Selection principles: selecting from a finite set can be done:

• with replacement: each selection is independent;

• without replacement: set is depleted by each selection.

3. Ordering: the order of a sequence of selections can be:

• important (312 distinct from 123);

• unimportant (312 identical to 123).

4. Distinguishable items: objects being selected can be:

• distinguishable: individually labelled (e.g. lottery balls);

• indistinguishable: labelled according to a type (e.g. colors).


Definition 1.10. A permutation is an ordered arrangement of distinctobjects. The number of ways of ordering an n-set taking r at a time is

Pnr =

n!(n− r)!

Definition 1.11. The number of ways of partitioning n distinct ele-ments into k disjoint subsets of sizes n1, . . . , nk is the multinomialcoefficient

N =

(n

n1, · · · , nk

):=

n!n1!× n2!× · · · × nk!

a generalization of the binomial coeff. (nj) =

n!j!(n−j)! := ( n

j,n−j).

Definition 1.12. The number of combinations Cnr is the number of

subsets of size r that can be selected from n objects

Cnr =

(nr

)= n! · Pn

r

1.6 Conditional Probability and IndependenceGiven partial knowledge (B happened)i.e. knowing s ∈ B ⊆ S, one can make amore accurate probability assessment ofsome other event A, as s ∈ A ∩ B.Definition 1.13. The conditional probability of A given B, where

P(B) > 0 is

P(A | B) = P(A ∩ B)P(B)

Corollary 1.9. P(A | S) = P(A) P(A | A) = 1 P(A | B) ≤ 1

Warning 1. Important distinction between P(A ∩ B) and P(A | B):

• P(A ∩ B): chance of A and B occurring relative to S

• P(A | B): chance of A and B occurring relative to B — non-symmetric: in general, P(A | B) 6= P(B | A).

Theorem 1.10. The conditional probability function satisfies the prob-ability axioms (Theorem 1.2).

Definition 1.14. Two events A, B ∈ S are independent if

P(A | B) = P(A) ⇐⇒ P(A ∩ B) = P(A)P(B)

Remark 2. Notice then that P(B | A) = P(A∩B)P(A)

= P(A)P(B)P(A)

= P(B).

Example 1.1. Given three two-sidedcards, one RR, one RB and one BB,one card is selected randomly and oneside displayed: it is red (R). Find theprobability that the other side is red.

Answer. S = {R1, R2, B, R, B1, B2}Card 1 (RR) selected A = {R1, R2} =⇒P(A) = 2/6Red side exposed: B = {R1, R2, R} =⇒P(B) = 3/6P(A | B) = P(A∩B)

P(B) = 2/63/6 = 2/3

Proof. 1. P(A | B) = P(A∩B)P(B) ≥ 0

2. P(S | B) = P(S∩B)P(B) = P(B)

P(B) = 1

3. P (⋃∞

i=1(Ai | B)) =P(⋃∞

i=1 Ai∩B)P(B) =

∑∞i=1 P(Ai∩B)

P(B) = ∑∞i=1 P(Ai | B) and

since Ai disjoint ∀i, (Ai , B) alsodisjoint.

Warning 3. Not the same as mutual exclusivity where P(A ∩ B) = 0 !


Warning 4. For multiple events A1, . . . , An ∈ S, we can talk aboutpairwise independence between any Ai, Aj with i 6= j, but this doesnot automatically generalize (c.f. Example 1.2).

Example 1.2. Rolling two dice withindependent outcomes A1: first rolloutcome is odd; A2: second roll out-come is odd and A3: total outcomeis odd. Then P(A1) = P(A2) = 1

2and P(A1 | A3) = P(A2 | A3) = 1

2 .But P(A1 ∩ A2 ∩ A3) = 0,P(A1 | A2 ∩ A3) = 0, etc.

Definition 1.15. Events A1, . . . , AK are mutually independent if, forall subsets I of {1, 2, . . . , K}:

P

(⋂k∈I

Ak

)= ∏

k∈IP(Ak)

Definition 1.16. Consider events A1, A2, B ∈ S with P(B) > 0.A1 and A2 are conditionally independent given B if

P(A1 | A2 ∩ B) = P(A1 | B)⇐⇒ P(A1 ∩ A2 | B) = P(A1 | B)P(A2 | B)

Theorem 1.11 (General Multiplication Rule).

P(A1, A2, . . . , AK) = P(A1)P(A2 | A1)P(A3 | A1 ∩ A2) · · ·· · · P(AK | A1 ∩ A2 ∩ · · · ∩ AK−1)

Proof. Trivial, recursively (Chain Rulefor probabilities).

We can use probability trees to dis-play joint probabilities of multiples,where the junctions are the events andthe branches correspond to possiblesequences of choices of events. We thenmultiply along a branch to get the jointprobability.1.7 Important Theorems

Theorem 1.12 (Theorem of Total Probability). Given a partitionB1, . . . , Bn of the sample space S into n subsets s.t. P(Bi) > 0, then forany event A ∈ S,

P(A) =n

∑i=1

P(A | Bi)P(Bi)

Theorem 1.13 (Bayes’s Theorem). For any A, B ∈ S where P(A) > 0and P(B) > 0:

P(B | A) =P(A | B)P(B)

P(A)

Furthermore, given B1, B2, . . . , Bn forming a partition of S:

P(Bi | A) =P(A | Bi)P(Bi)

P(A)=

P(A | Bi)P(Bi)

∑nj=1 P(A | Bj)P(Bj)

Proof. (Theorem 1.12)A = (A ∩ B1) ∪ · · · ∪ (A ∩ Bn)

P(A) = P(A ∩ B1) + · · ·+ P(A ∩ Bn)

= P(A | B1)P(B1) + · · ·+ P(A | Bn)P(Bn)

=n

∑i=1

P(A | Bi)P(Bi)

Proof. (Theorem 1.13). By Defini-tion 1.13,

P(A | B)P(B) = P(A∩B) = P(B |A)P(A)

With the stated conditions, we have

P(B | A) =P(A | B)P(B)

P(A)

The second part of the theorem comesfrom replacing the denominator usingTheorem 1.12.

Warning 5. Although ∑ni=1 P(Bi | A) = 1, ∑n

i=1 P(A | Bi) 6= 1 due tothe use of different conditioning sets.

Remark 6. Bayes’ Theorem is often used to make probability state-ments concerning an event B that has not been observed, given anevent A that has been observed.


Remark 7 (Prosecutor’s Fallacy).

P("Evidence" | "Guilt") 6= P("Guilt" | "Evidence")

Example 1.3 (Medical screening). Givenevents A (positive screening test) andB (person is actually a sufferer). Weare interested in P(B | A). SupposeP(B) = p; P(A | B) = 1 − α (truepositive rate); P(A | B′) = β (falsepositive rate) for probabilities p, α, β.

P(A) = P(A | B)P(B) + P(A | B′)P(B′)

= (1− α)p + β(1− p)

P(B | A) =P(A | B)P(B)

P(A)

=(1− α)p

(1− α)p + β(1− p)

P(B | A′) = αpαp + (1− β)(1− p)

by similarly logic.

Remark 8. Recall that the odds on B are defined as P(B)P(B′) = P(B)

1−P(B) ;

and the conditional odds given A (P(A) > 0) are P(B | A)P(B′ | A)

= P(B | A1−P(B | A)

.

Thus the odds change by a factor of P(A | B)P(A | B′) when one adds the infor-

mation A.

Corollary 1.14 (Bayes’ Theorem for multiple events). Given eventsA1, A2 ∈ S where P(A1 ∩ A2) > 0 and given B1, B2, . . . , Bn forming apartition of S:

P(Bi | A1 ∩ A2) =P(A1 ∩ A2 | Bi)P(Bi)

∑nk=1 P(A1 ∩ A2 | Bk)P(Bk)


2 Random Variables and Probability Distributions

2.1 Random Variables

Definition 2.1. A random variable is a mapping Y : S → R, s 7→ yfrom an arbitrary space S (where the experiment sample outcomesare defined) to the real line R.

Remark 9. We use uppercase letters X,Y, Z to denote random vari-ables and lowercase x, y, z to denote the real values taken by therandom variable.

Y(.) is usually a many-to-one mapping.Also, we usually suppress the depen-dence

Definition 2.2. A random variable is discrete if the image of Y is acountable set {y1, y2, . . . , yn, . . . } denoted Y for positive probabilities.

Corollary 2.1. The probabilities assigned to individual points in theset Y must satisfy

∑y∈Y

P(Y = y) = 1

Proof. folllows from the probability axioms.

2.2 Probability Mass Function

Definition 2.3. The probability mass function (pmf) p(.) for the ran-dom variable Y records how probability is distributed across pointsin R: p(y) = P(Y = y) where p(y) > 0 for y ∈ Y . It is also denotedf (y), pY(y) or fY(y). It must:

(a) specify probabilities, i.e. 0 ≤ p(y) ≤ 1

(b) satisfy Corrolary 2.1: ∑y∈Y p(y) = 1

Remark 10. The pmf can be defined pointwise (for a discrete randomvariable) or in some functional form.

Remark 11. All pmfs take the formp(y) = cg(y) where g(y) also containsinformation on the set Y for whichP(y) > 0.

∑y∈Y

p(y) = 1 =⇒ ∑y∈Y

g(y) =1c

2.3 Moments: expectation and variance

Definition 2.4. The expectation (expected value) E[Y]

of the r.v. Y isE[Y]= ∑y∈Y yp(y), the centre of mass of the probability distribution. Example 2.1 (Discrete Uniform Distri-

bution).Suppose Y = {y1, . . . , yn}, p(y) = 1

n .Then E

[Y]= 1

n ∑ni=1 yi = y

Here p(y) = c is constant and thenumber of elements in Y is 1

c .

Let’s now deal with the expectation of a function g(Y).

Definition 2.5. Let g(.) real-valued function and p(y) the pmf of Y:

Set of values of g(Y): G = {gi, i = 1, . . . , m : gi = g(y), some y in Y}E[g(Y)

]= ∑

y∈Yg(y)p(y)


Definition 2.6. The variance V[Y]

of Y is a measure of how dis-persed the probability distribution is, defined as

V[Y]= E

[(Y− µ)2] where µ = E

[Y]

= ∑y(y− µ)2 p(y) ≥ 0

Definition 2.7. The standard deviation of Y is√

V[Y].

Theorem 2.2. The expectation E[Y]

satisfies linearity.

Remark 12. V[Y]

can be infinite even if E[Y]

is bounded.

Theorem 2.3. E[(Y− µ)2] = E

[Y2]− µ2

Corollary 2.4. V[aY + b

]= a2V

[Y]

Proof. (Theorem 2.2)Let g(Y) = ag1(Y) + bg2(Y) for separateg1(.) and g2(.).

E[g(Y)

]= E

[aga(Y) + bg2(Y)

]= ∑

y(ag1(y) + bg2(y))p(y)

= a ∑y

g1(y)p(y) + b ∑y

g2(y)p(y)

= aE[g1(Y)

]+ bE

[g2(Y)

]Proof. (Theorem 2.3)

E[(Y− µ)2] = E

[Y2 + 2Yµ + µ2]

= E[Y2]− 2µE

[Y]+ µ2

= E[Y2]− 2µ2 + µ2

= E[Y2]− µ2

Proof. (Corollary 2.4)

V[aY + b

]= E

[(aY + b−E

[aY + b

])2]= E

[(aY + b− (aE

[Y]+ b)

)2]= E

[(aY + b− aµ− b)2]

= E[a2(Y− µ)2]

= a2V[Y]

2.4 Binomial distribution

Considering an experiment with two outcomes ’success’ and ’failure’,define the random variable Y to be Y = 1 for ’success’ and Y = 0 for’failure’, with P(1) = p, P(0) = 1 = 1− p.

Definition 2.8. The Bernoulli distribution is defined with pmf

pY(y) = pyq1−y = py(1− p)1−y ∀y ∈ {0, 1}, given p ∈ (0, 1)

Theorem 2.5. The expectation of the Bernoulli dist. is µ = E[Y]= p.

Theorem 2.6. The variance of the Bernoulli dist. is V[Y]= p(1− p).

Proof. (Theorem 2.5)

E[Y]=

1

∑i=0

yp(y) = 0p(0) + 1p(1)

= 0 · q + 1 · p = p


V[Y]= E

[Y2]− µ2

=1

∑i=0

y2 p(y)− µ2 = 12 p(1)

= p− p2 = p(1− p) = pq


E[Y]=

n

∑y=0

y(

ny

)pyqn−y

= nn

∑y=1

(n− 1)!(y− 1)!(n− y)!

pyqn−y

= nn−1

∑j=0

(n− 1)!j!((n− 1)− j)!

pj+1qn−j−1

= npn−1

∑j=0

(n− 1)!j!((n− 1)− j)!

pjqn−1−j

= np(p + q)n−1 = np

We now extend the definition to a sequence of multiple experimentswith two outcomes, and have the r.v. Y record the number of ’suc-cesses’: Y = y ⇔ "y successes in n trials". Consider a sequence withy successes and n− y failures: the probability of any such sequence ispyqn−y. The number of possible such sequences is Cn

y = (ny).

Definition 2.9. The Binomial distribution is defined with pmf

py(y) ≡ P(Y = y) =(

ny

)pyqn−y ∀y ∈ {0, 1, 2, . . . , n}

Theorem 2.7. The expectation of the binomial dist. is E[Y]= np.


Theorem 2.8. The variance of the binomial dist. is V[Y]= npq.

Proof.

E[Y2] = E

[Y(Y− 1)

]+ E

[Y]

(by linearity)

E[Y(Y− 1)

]=

n

∑y=0

y(y− 1)(

ny

)pyqn−y =

n

∑y=2

y(y− 1)n!

y!(n− y)!pyqn−y

= n(n− 1)n

∑y=2

(n− 2)!(y− 2)!(n− y)!

pyqn−y

= n(n− 1)n−2

∑j=0

(n− 2)!j!(n− 2− j)!

pj+2qn−j−2

= n(n− 1)p2n−2

∑j=0

(n− 2)!j!(n− 2− j)!

pjqn−j−2

= n(n− 1)p2(p + 1)n−2 = n(n− 1)p2

V[Y]= E

[Y2]− µ2 = n(n− 1)p2 + np− (np)2

= np− np2 = npq

Remark 13. Notice how the Bernoullidistribution is a special case of thebinomial distribution with n = 1.

2.5 Geometric Distribution

Consider independent binary trials, where one counts the numberof trials carried out until the first success. Let Y be the r.v. recordingthe number of trials carried out up to an including the first success:Y = y⇔ "y− 1 failures followed by 1 success".

Definition 2.10. The geometric distribution is defined with pmf

p(y) ≡ P(Y = y) = qy−1 p ∀y ∈ {1, 2, . . . }

Remark 14. Notice that

∞

∑y=1

qy−1 p = p∞

∑j=0

qj = p1

1− q= 1

Remark 15. The geometric distribution can also be written with pmf

p(y) ≡ P(Y∗ = y) = qy p ∀y ∈ {0, 1, 2, . . . }

where the r.v. Y∗ then records the number of failures observed beforethe first success: Y∗ = Y− 1.

Proof. (of Theorem 2.9)

E[Y]=

∞

∑y=0

yqy−1 p = p∞

∑y=1

yqy−1

= p1

(1− q)2 = p1p2 =

1p

Proof. (of Theorem 2.10)

E[Y(Y− 1)

]=

∞

∑y=0

y(y− 1)p(y)

= pq∞

∑y=2

y(y− 1)qy−2 = pqd2

dq2

(∞

∑y=0

qy

)

= pqd2

dq2

(1

1− q

)= pq

2(1− q)3 = 2

qp2

V[Y]= E

[Y(Y− 1)

]+ E

[Y]−(E[Y])2

= 21− p

p2 +1p− 1

p2 =1− p

p2

Theorem 2.9. The expectation of the geometric dist. is E[Y]= 1

p .

Theorem 2.10. The variance of the geometric dist. is V[Y]= 1−p

p2 .

Definition 2.11. The cumulative distribution function (cdf) of adistribution is defined as F(y) = P(Y ≤ y).


Theorem 2.11. The cdf of the geometric distribution is

F(y) = P(Y ≤ y) = 1− (1− p)y y ∈ {1, 2, . . . }

Proof.

P(Y > y) =∞

∑j=y+1

P(Y = j) =∞

∑j=y+1

qj+1 p = pqy∞

∑j=y+1

qj−y−1 =pqy

1− q

= qy ∀y ∈ {1, 2, . . . }P(Y ≤ y) = 1− P(Y > y) = 1− (1− p)y ∀y ∈ {1, 2, . . . }

We can define all ’interval’ probabilitiessuch as P(Y ≤ a), P(a ≤ Y ≤ b), etc.via F(y), since ∀ distributions,

P(Y ∈ X ) = ∑y∈X

p(y)

2.6 Negative Binomial Distribution

Consider independent binary trials, where one counts the number oftrials carried out until the rth success. Let r.v. Y record the number oftrials carried out until the rth success: Y = y ⇔ A ∩ B: A = "r − 1successes in the first y− 1 trials" and B = "success on the yth trial".

Definition 2.12. The negative binomial distribution has pmf

p(y) ≡ P(Y = y) =(

y− 1r− 1

)prqy−r ∀y ∈ {r, r + 1, . . . }

Exercise 2.2. Show for the negativebinomial distribution that

∞

∑y=r

p(y) = 1Remark 16. As for the geometric distribution, the negative binomialdist. can also be written with pmf

p(y) = P(Y∗ = y) =(

y + r− 1r− 1

)prqy ∀y ∈ {0, 1, 2, . . . }

where the r.v. Y∗ records the number of failures observed up to therth success: Y∗ = Y− r.

Proofs left to the reader.

Theorem 2.12. The expectation of the negative binomial distributionis E

[Y]= r

p .

Theorem 2.13. The variance of the negative binomial distribution isV[Y]= r(1−p)

p2 .

2.7 Hypergeometric Distribution

(ry): number of ways of selecting y type

I from the r available.(N−r

n−y): number of ways of selectingn− y type II from the N − r available.(N

n ): total number of ways of selecting nitems from N.

Consider a finite set of N items with r type I and N − r type II. Weselect n items without replacement. Let r.v. Y = y⇔ "y type I items".

Definition 2.13. The hypergeometric distribution is defined with pmf

p(y) ≡ P(Y = y) =(r

y)× (N−rn−y)

(Nn )

∀y ∈ Y


Remark 17. An equivalent expression for p(y) is

p(y) =(n

y)(N−nr−y )

(Nr )

Remark 18. When N and r are large, wecan approximate the hypergeometric toa binomial distribution

p(y) ≈(

ny

)( rN

)y (1− r

N

)n−y

2.8 Poisson distribution

Consider the binomial experiment setting, and take the limiting casewhere we allow n to get larger and p to get smaller, but keepingn× p constant: i.e. take p = λ

n and let n large. Then,

λ: rate at which successes occur ina continuous approximation to theexperiment.

An example use case would be: "IfMike makes an average number of 2

sales a day, what are the odds that hemakes 3 sales tomorrow?"

p(y) =(

ny

)py(1− p)n−y =

n!y!(n− y)!

(λ

n

)y (1− λ

n

)n−y

=λy

y!

(1− λ

n

)n n!(n− y)!ny

(1− λ

n

)−y

︸︷︷︸→1 as n→∞

→ λy

y!e−λ

Definition 2.14. The Poisson distribution is defined with pmf

p(y) =λy

y!e−λ ∀y ∈ {0, 1, 2, . . . }

Theorem 2.14. The expectation of the Poisson dist. is E[Y]= λ.

Theorem 2.15. The variance of the Poisson dist. is V[Y]= λ.


E[Y]=

∞

∑y=0

yp(y) =∞

∑y=0

yλy

λ!e−λ

= e−λ∞

∑y=1

λy

(y− 1)!

= e−λλ∞

∑y=1

λy−1

(y− 1)!

= e−λλeλ = λ


E[Y(Y− 1)

]=

∞

∑y=1

y(y− 1)λy

y!e−λ

= e−λλ2∞

∑y=2

λy−2

(y− 2)!= e−λλ2eλ

= λ2

V[Y]= E

[Y(Y− 1

]+ E

[Y]− µ2

= λ2 + λ− λ2 = λ

Remark 19. The Poisson distribution corresponds to a model whereincidents occur independently at a constant rate λ.

2.9 Moments and Moment-Generating Functions

Definition 2.15. Let µ = E[Y]. We define

kth moment: µ′k = E[Yk] = ∑

yyk p(y)

kth central moment: µk = E[(Y− µ)k] = ∑

y(y− µ)k p(y)

Definition 2.16. Generating functions g(t) for the sequence {aj} ∈ R

is the function defined by the sum

g(t) = ∑j

ajtj

Example 2.3. The generating functionfor {aj} = (n

j) ∀j = 0, 1, . . . is

g(t) = (1 + t)n =n

∑j=0

(nj

)tj

Definition 2.17. Exponential generating function: g(t) = ∑j ajtj

j! .


Definition 2.18. For pmf p(y), the moment generating function(mgf) m(t) is the exponential generating function for the moments ofp(y), requiring the sum to be finite in the neighbourhood of t = 0:

m(t) =∞

∑j=0

µ′jtj

j!

Remark 20. It is difficult to see theuse for moment generating functionsright now, but it can be used to identifydistributions, since two random vari-ables with the same mgf share the samedistribution (Theorem 2.18), and it canbe used to calculate the moments of arandom variables.

Theorem 2.16. The moment generating function can be written

m(t) = E[etY]

Theorem 2.17. The kth derivative of m(t) evaluated at t = 0 gives

m(k)(0) =dk

dtk m(0) = µ′k

Theorem 2.18 (Uniqueness). Suppose m1(t) and m2(t) are two mgfsfor pmfs p1(t) and p2(t). Then:

m1(t) = m2(t) ∀|t| ≤ b for some b ⇐⇒ p1(y) = p2(y) ∀y


m(t) =∞

∑j=0

µ′jtj

j!=

∞

∑j=0

∑y

yj p(y)tj

j!

= ∑y

(∞

∑j=0

(yt)j

j!

)p(y)

= ∑y

ety p(y)

Proof. (Theorem 2.17)Simply write out the terms.

Proof. (Theorem 2.18)Beyond the scope of this course.

2.10 Probability-Generating Functions

Definition 2.19. The probability-generating function (pgf) for ran-dom variable Y with P(Y = i) = yi is

G(t) = E[tY] = ∞

∑i=0

piti

Definition 2.20. The kth factorial moment µ[k] is defined as

µ[k] = E[Y(Y− 1) · · · (Y− k + 1)

]Remark 21. Note that

G(k)(t) =d(k)(t)

dtk

(∞

∑i=0

piti

)=

∞

∑i=k

i(i− 1) · · · (i− k + 1)piti−kG(k)(1) =

∞

∑i=k

i(i− 1) · · · (i− k+ 1)pi = µ[k]

Corollary 2.19. G(t) ≡ E[tY] = E

[exp(Y ln t)

]= m(ln t)


2.11 Continuous Random Variables

Theorem 2.20. Properties of the cdf F(Y) (c.f. Definition 2.11):

1. "starts at zero": F(−∞) ≡ limy→−∞ F(y) = 0

2. "ends at one": F(∞) ≡ limy→∞ F(y) = 1

3. "non-decreasing in between": y1 < y2 =⇒ F(y1) ≤ F(y2)

Definition 2.21. A r.v. Y with cdf F(y) is called continuous if F(y) iscontinuous for all real y. i.e. ∀y,

limε→0+

F(y) = limε→0−

F(y) = F(y)

Warning 22. If F(y) is continuous,then the probability attached to anyindividual point in the space is null:P(Y = y) = 0 ∀y ∈ R.

2.12 Probability Density Function

Definition 2.22. For a continuous cdf F(y), the probability densityfunction (pdf) f (y) is the first derivative of F(.) at y when it exists:

f (y) =dF(y)

dt=⇒ F(y) =

∫ y

−∞f (t)dt

Theorem 2.21. Properties of the pdf f (y):

1. non-negative: f (y) ≥ 0 −∞ < y < ∞

2. integrates to 1:∫ ∞−∞ f (y)dy = 1

Proof. Follows from probability axioms- properties from Theorem 2.20.

Warning 23. The probability densityfunction does not represent probabili-ties; it does not integrate to 1.

We can compute the interval probability P(a ≤ y ≤ b) as

P(a ≤ Y ≤ b) = P(Y ≤ b)− P(Y ≤ a) =∫ b

af (t)dt

All the definitions and theorems shown for the discrete casestill hold for the continuous case:

1. Expectation: for continuous r.v. Y and a function g(Y),

E[Y]=∫ ∞

−∞y f (y)dy =

∫Y

y f (y)dy = µ

E[g(Y)

]=∫ ∞

−∞g(y) f (y)dy (both: only if integral is finite)

2. Linearity of expectation (Theorem 2.2) still holds.

3. Variance: for continuous r.v. Y,

V[Y]= E

[(Y− µ)2] = ∫ ∞

−∞(y− µ)2 f (y)dy


4. Moments, central moments and factorial moments:

µ′k = E[Yk] = ∫ ∞

−∞yk f (y)dy

µk = E[(Y− µ)k] = ∫ ∞

−∞(y− µ)k f (y)dy

µ[k] = E[Y(Y− 1) · · · (Y− k + 1)

]=∫ ∞

−∞y · · · (y− k + 1) f (y)dy

5. Moment generating function m(t): for |t| ≤ b, b > 0

m(t) = E[etY ] = ∫ ∞

−∞ety f (y)dy

6. Factorial moment generating function G(t): for 1 − b ≤t ≤ 1 + b, b > 0,

G(t) = E[tY ] = ∫ ∞

−∞ty f (y)dy

2.13 Distributions for continuous RVs

Definition 2.23. The continuous uniform distribution has a pdf thatis constant on some interval (θ1, θ2), θ1 < θ2:

f (y) =1

θ1 − θ2∀ θ1 ≤ y ≤ θ2 and zero otherwise.

By direct calculation, we can compute that the cdf F(y) is

F(y) =

0 y ≤ θ1y−θ1θ2−θ1

θ1 ≤ y ≤ θ2

1 y ≥ θ2

and also find E[Y]= θ1+θ2

2 and V[Y]= (θ2−θ1)

2

12 .

Definition 2.24. The Normal (Gaussian) distribution has pdf definedwith parameters σ and µ by the pdf

f (y) =1

σ√

2πexp

(− 1

2σ2 (y− µ)2)∀ −∞ < y < ∞

Figure 1: Normal distribution forvarious values of µ and σ.

It can be shown that E[Y]= µ and V

[Y]= σ2.

The cdf F(y) =∫ y−∞

1√2π

exp(− 1

2σ2 (t− µ)2)

dt which cannot becomputed analytically.Instead, if Y has Normal distribution with parameters µ and σ, andwe consider the r.v. Z = Y−µ

σ , then ∀z ∈ R,

FZ(z) = P(Z ≤ z) = P (σZ + µ ≤ σz + µ) = P (Y ≤ σz + µ)

= FY(σz + µ) =∫ σz−µ−∞

1σ√

2πexp

(− 1

2σ2 (t− µ)2)

dt

=∫ z−∞

1√2π

exp(− u2

2

)substituting u = (t− µ)/σ


Definition 2.25. The Gamma distribution with parameters α, β > 0has pdf

f (y) =

0 y < 01

βαΓ(α)yα−1e−y/β 0 ≤ y < ∞

where Γ(α) =∫ ∞

0 tα−1e−tdt which satisfies Γ(α) = (α − 1)Γ(α − 1),and for n ∈N, Γ(n) = (n− 1)! with Γ(1) = 1.

Figure 2: Gamma distribution forvarious values of α and β, whichare respectively the shape and scaleparameters.

The Gamma distribution is to the Expo-nential distribution what the Binomialis to the Bernoulli distribution.

e.g use case of the Gamma dist.: "Howlong does it take for a bucket to fill upif left out in the rain?"

The exponential distribution is knownas the waiting-time distribution.

It can be shown that E[Y]= αβ and V

[Y]= αβ2.

Special cases of the Gamma Distribution are

1. chi-square distribution µ degrees of freedom: α = µ/2, β = 2

2. exponential distribution: α = 1, f (y) = 1β e−y/β 0 ≤ y < ∞

The moment-generating function of the Gamma distribution is

m(t) = E[etY] = ∫ ∞

0ety 1

βαΓ(α)yα−1e−y/βdy =

(1

1− βt

)α

Definition 2.26. The beta distribution is suitable for an r.v. definedon [0, 1], with pdf

f (y) =Γ(α + β)

Γ(α)Γ(β)yα−1(1− y)β−1 =

1B(α, β)

yα−1(1− y)β−1 0 ≤ y ≤ 1

where B(α, β) is the Beta function B(α, β) =∫ 1

0 yα−1(1− y)β−1dy.

Figure 3: Beta distribution for variousvalues of α and β.

It can be found that E[Yk] = 1

B(α,β)B(α + k, β) for k = 1, 2, . . .

Remark 24. A special case of the beta function with α = β = 1 hasB(1, 1) = Γ(1)Γ(1)

Γ(2) = 1·11 = 1 and results in the pdf

f (y) = 1 ∀ 0 ≤ y ≤ 1 and zero otherwise

the continuous uniform distribution with θ1 = 0, θ2 = 1.


3 Probability Calculation Methods

3.1 Transformations in One Dimension

Given r.v. Y defined on Y (the region on which Y is nontrivial), weare interested in the transformed r.v. U (defined on U ), by

U = h(Y) where h(.) : Y → R

Discrete Case

Suppose h(.) maps values from (countable) {y1, y2, . . . , yn, . . . } to(countable) {u1, u2, . . . , un, . . . } with ui = h(yj). We typically deal with simple transfor-

mations, e.g.

U = 2Y U = exp(−Y)

U =Y

Y + 1U = Y2

• if h(.) is 1-1, then identify the set {ui}∞i=1 that {yi}∞

i=1 map onto,and for each i, assign pY(yi) to ui to obtain pU(ui).

• if h(.) is not 1-1, we must identify the set U and compute for eachcase the probability assigned to the values in the set.

Continuous Case

We again want to see how values in the set Y are mapped to a setU . Consider B ⊆ U the set of values that points in A ⊆ Y can bemapped to under the transformation h(.): Recall that for a continuous r.v. Y,

P(Y = y) = 0 ∀y ∈ RB = {u : u = h(y) | y ∈ A} =⇒ P(Y ∈ A) = P(U ∈ B)

So to compute FU(.) or fU(.), B ⊆ U , so fix B. If h(.) is a 1-1transformation, then the inverse transformation h−1(.) can usually becomputed explicitly, We know P(U ∈ B) = P(Y ∈ h−1(B)), and

Theorem 3.1. fU(u) = fY(h−1(u))∣∣∣ dh−1(u)

du

∣∣∣Proof. Two cases depending on h(.)

• 1-1 increasing:

FU(u) = P(U ≤ u) = P(h(Y) ≤ u) = P(Y ≤ h−1(u)) = FY(h−1(u))

fU(u) =dFU(u)

du=

dFY(h−1(u))du

= fY

(h−1(u)

) dh−1(u)du︸︷︷︸≥0

• 1-1 decreasing:

FU(u) = P(U ≤ u) = P(h(Y) ≤ u) = P(Y ≥ h−1(u)) = 1− FY(h−1(u))

fU(u) =dFU(u)

du= −dFY(h−1(u))

du= − fY(h−1(u))

dh−1(u)du︸︷︷︸≤0

The identity P(U ∈ B) = P(Y ∈ h−1(B)can be re-written as∫

BfU(u)du =

∫h−1(B)

fY(y)dy

Example 3.1. If U = cY + d for c > 0and d,

mU(t) = E[etU] = E

[et(cY+d)]

= E[e(ct)Yedt] = edtE

[e(ct)Y]

=⇒ mU(t) = edtmY(ct)


3.2 Techniques for sums of random variables

Discrete Random Variables

For discrete random variables Y1 and Y2 independent taking valueson Y1 and Y2 respectively. Let Y = Y1 + Y2. The important components of this

calculation are:

• the partition of the event Y = yaccording to possible values of Y1

• the theorem of total probability

• the independence of Y1 and Y2

where

Y∗1 = {y1 : y1 ∈ Y1 and y− y1 ∈ Y2}

P(Y = y) = P

⋃y1∈Y1

(Y = y) ∩ (Y1 = y1)

= ∑

y1∈Y1

P ((Y = y) ∩ (Y1 = y1))

= ∑y1∈Y1

P((Y2 = y− y1) ∩ (Y1 = y1)

)= ∑

y1∈Y1

P(Y2 = y− y1)P(Y1 = y1)

= ∑y1∈Y∗1

P(Y2 = y− y1)P(Y1 = y1)

Theorem 3.2. If Y1 and Y2 are independent Poisson variables with pa-rameters λ1 and λ2, then Y = Y1 + Y2 also has a Poisson distribution,with parameter λ1 + λ2.

Proof.

P(Y = y)

=y−y1

∑y1=0

P(Y1 = y1)P(Y2 = y− y2)

=y−y1

∑y1=0

λy11 e−λ1

y1!λ

y−y12 e−λ2

(y− y1)!

= e−(λ1+λ2)λy2

y−y1

∑y1=0

(λ1

λ2

)y1 1y1!(y− y1)!

=e−(λ1+λ2)λ

y2

y!

(1 +

λ1

λ2

)y

=e−(λ1+λ2)(λ1 + λ2)

y

y!y = 0, 1, 2, . . .

Poisson dist. with param. λ1 + λ2.

Continuous Random Variables

If Y1 and Y2 are independent continuous random variables, then thesum random variable Y = Y1 + Y2 is also continuous and we aim tocompute FY(y) = P(Y ≤ y). By definition, with fY(y) pdf of Y:

FY(y) = P(Y ≤ y) =∫ y

−∞fY(t)dt where (Y ≤ y) = (Y1 + Y2 ≤ y)

From the discrete case above, we analogize to:

Definition 3.1. Convolution formula for the cdf of Y = Y1 + Y2:

fy(y) =∫ ∞

−∞fY2(y− y1) fY1(y1)dy1

where fY(.), fY1(.) and fY2(.) are the pdf’s for Y, Y1 and Y2 resp.

Proof? .-.

We can extend this result to moreindependent r.v.’s using the convolutionformula recusively, e.g.

Y = Y1 + Y2 + Y3 = (Y1 + Y2) + Y3


4 Multivariate Distributions

We now consider probability specifications for more multiple randomvariables where there is not necessarily independence. We considerthe vector random variable (Y1, Y2, . . . , Yn).

We can think of p(y1, y2) as specifyingprobabilities in a table with Y1 and Y2values in columns / rows.

We can find the joint pmf for varioussituations using combinatorial argu-ments.

4.1 Discrete joint distributions

Definition 4.1. For Y1 . . . , Yn be discrete random variables on Z. Thejoint (discrete bivariate for n = 2) probability mass function is

p(y1, . . . , yn) = P

(n⋂

i=1

(Yi = yi)

)≡ P(Y1 = y1, . . . , Yn = yn)

Theorem 4.1. The joint pmf has the basic properties:

0 ≤ p(y1, . . . , yn) ≤ 1 ∀yi i ∈ [1, n]

n

∑i=1

∞

∑yi=−∞

p(y1, . . . , yn) =∞

∑y1=−∞

· · ·∞

∑yn=−∞

p(y1, . . . , yn) = 1

4.2 Marginal CDFs and PDFs

Definition 4.2. Given a joint pmf for Y1 and Y2: PY1,Y2(., .), themarginal probability mass function for Yi is merely the pmf PYi

of Yi, which can be found via the Theorem of Total Probability (1.12).

pY1(y1) = P(Y1 = y1) =∞

∑y2=−∞

pY1,Y2(y1, y2)

Definition 4.3. The conditional mass function for Y2 given Y1 = y1 is

pY2 |Y1(y1, y2) = P(Y2 = y2 |Y1 = y1) =

PY1,Y2(y1, y2)

PY1(y1)

If pY1 ,Y2 (y1, y2) specfies the valuesin a probability table, we computethe marginal pmf for Y1 and Y2 bysumming respectively across the rowsand down the columns of the table.

Y21 2 3 pY1 (.)

Y1

1

2 pY1 (2)3

pY2 (.) pY2 (2) 1

Recall the fundamental result

pY1,Y2(y1, y2) = pY1(y1)pY2 |Y1(y2 | y1) when pY1(y1) > 0

Definition 4.4. Discrete r.v.’s Y1, . . . , Yn are independent if

pY1,...,Yn(y1, . . . , yn) =n

∏i=1

pYi (yi)

i.e. pYi |Yj(yi | yj) = pYi ∀i, j ∈ [1, n]


Definition 4.5. The joint cumulative distribution function cdfFY1,Y2(y1, y2) and joint probability density function pdf fY1,Y2(y1, y2)

FY1,Y2(y1, y2) = P ((Y1 ≤ y1) ∩ (Y2 ≤ y2)) ∀(y1 ∈ R, y2 ∈ R)

=∫ y1

−∞

∫ y2

−∞fY1,Y2(t1, t2)dt2dt1

Theorem 4.2. The joint cdf has the following properties:

limy1→−∞

limy2→−∞

FY1,Y2(y1, y2) = 0 limy1→∞

limy2→∞

FY1,Y2(y1, y2) = 1

FY1,Y2(y1, y2) ≤ FY1,Y2(y1 + c, y2) and FY1,Y2(y1, y2 + c) ∀y1, y2, c > 0

limy1→∞

FY1,Y2(y1, y2) = FY2(y2) limy2→∞

FY1,Y2(y1, y2) = FY1(y1)

Theorem 4.3. By the probability axioms, the joint pdf

• is non-negative: fY1,Y2(y1, y2) ≥ 0

• integrates to 1:∫ ∞−∞

∫ ∞−∞ fY1,Y2(y1, y2)dy2dy1 = 1

Definition 4.6. The marginal probability density function pdf fYi (yi)

for each fixed yi, i ∈ {1, 2} is

fYi (yi) =∫ ∞

−∞fY1,Y2(y1, y2)dyj j 6= i

Definition 4.7. The conditional probability density function pdf is

fY2 |Y1(y2 | y1) =

fY1,Y2(y1, y2)

fY1(y1)

As in the discrete case, we have the chain rule factorization

fY1,Y2(y1, y2) = fY1(y1) fY2 |Y1(y2 | y1) = fY2(y2) fY1 |Y2

(y1 | y2)

To compute fY1 ,Y2 (y1, y2) fromFY1 ,Y2 (y1, y2), use

fY1 ,Y2 (y1, y2) =∂2

∂y1∂y2

(FY1 ,Y2 (y1, y2)

)where the partial differentiationsare commutative given the secondderivatives are continuous at that point.

The conditional pdf fY1 ,Y2 (y1, y2) isa pdf in y2 for every fixed y1, so it isnon-negative and integrates to 1 overy2.

4.3 Independence and Correlation

Definition 4.8. Continuous r.v.’s (Y1, . . . , Yn) are independent if

fY1,...,Yn(y1, . . . , yn) =n

∏i=1

fYi (yi)

In the bivariate case, we have

fY2 |Y1(y2 | y1) = fY2 (y2)

fY1 |Y2(y1 | y2) = fY1 (y1)

Remark 25. An easy way to prove that two r.v.’s are not independentis to check if Y12 = Y1 ×Y2 where Y12 is the support of the joint pdf.


Extending the idea of expectation to multiple variables Y1, . . . , Yn, weget the following, respectively discrete and continuous, expressions

E[g(y1, . . . , yn)

]=

∞

∑i=−∞

· · ·∞

∑i=−∞

g(y1, . . . , yn) fY1,...,Yn(y1, . . . , yn)

E[g(y1, . . . , yn)

]=∫ ∞

−∞· · ·

∫ ∞

−∞g(y1, . . . , yn) fY1,...,Yn(y1, . . . , yn)dy1 · · · dyn

and linearity of expectation (Theorem 2.2) still holds.

For multiple variables, we considerpairwise covariance between all of themand put the results into a matrix calledcovariance matrix.

Definition 4.9. The covariance Cov [Y1, Y2] between Y1 and Y2 withexpectations µ1 and µ2 is

Cov [Y1, Y2] = E[(Y1 − µ1)(Y2 − µ2)

]= E

[Y1Y2

]− µ1µ2

Definition 4.10. The correlation Corr [Y1, Y2] between Y1 and Y2 is

Corr [Y1, Y2] =Cov [Y1, Y2]√V[Y1]V[Y2]

Theorem 4.4. −1 ≤ Corr [Y1, Y2] ≤ 1

Definition 4.11. Y1, Y2 uncorrelated if Cov [Y1, Y2] = Corr [Y1, Y2] = 0.

Theorem 4.5. For Y1 and Y2 independent,

E[g1(Y1)g2(Y2)

]= E

[g1(Y1)

]E[g2(Y2)

]Theorem 4.6. If Y1 and Y2 independent then they are uncorrelated.

Proof. in textbook.


E[g1(Y1)g2(Y2)

]=∫ ∞

−∞

∫ ∞

−∞g1(y1)g2(y2) fY1 ,Y2 (y1, y2)dy1dy2

=∫ ∞

−∞g1(y1) fY1 (y1)dy1

∫ ∞

−∞g2(y2) fY2 (y2)dy1

= E[g1(Y1)

]E[g2(Y2)

]Proof. (Theorem 4.6)

Cov [Y1, Y2] = E[Y1Y2

]− µ1µ2

= E[Y1]E[Y2]− µ1µ2

= 0

Warning 26. Uncorrelation doesn’t imply independence!

4.4 Transformations of random variables

Linear Combinations

Let Y1, . . . , Yn with E[Yi]= µi and X1, . . . , Xm with E

[Xi]= ξi, define

U1 = ∑ni=1 aiYi and U2 = ∑m

j=1 bjXj. We get the following:

Theorem 4.7.

E[U1]=

n

∑i=1

aiE[Yi]=

n

∑i=1

aiµi

V[U1]=

n

∑i=1

a2i V[Yi]+ 2

n

∑j=2

j−1

∑i=1

aiaj Cov[Yi, Yj

]Cov [U1, U2] =

n

∑i=1

m

∑j=1

aibj Cov[Yi, Xj

]


Example 4.1 (Sum of Binomial r.v.’s).For Y1, Y2 independent Binomial r.v.’swith parameters (n1, p) and (n2, p) wehave for Y = Y1 + Y2

mY(t) = mY1 (t)mY2 (t)

= (pet + q)n1 (pet + q)n2

= (pet + q)n1+n2

Notice this is the mgf of the Binomialdistribution (n1 + n2, p).

Example 4.2 (Sum of Poisson r.v.’s).Similarly to the previous example, forY1 and Y2 independent Poisson r.v.’swith parameters λ1 and λ2 to find

mY(t) = exp((λ1 + λ2)(et − 1)

)the mgf of the Poisson distribution withparameter λ1 + λ2.

Example 4.3 (Sum of Normal r.v.’s).Similarly for Y1 and Y2 independentNormal r.v.’s with parameters (µ1, σ2

1 )and (µ2, σ2

2 ), we get

mY(t) = exp((µ1 + µ2)t + (σ2

1 + σ22 )t

2/2)

the mgf of the Normal distribution withparameters (µ1 + µ2, σ2

1 + σ22 ).

Back to sums, consider Y = Y1 + Y2 and recall the convolution for-mula (Definition 3.1). An easier method than computing convolutionsis to recall the moment generating function for U = g(Y) (Defini-tion 2.18) mU(t) = E

[etU] = E

[etg(t)]. If Y1 and Y2 are independent

and Y = a1g1(Y1) + a2g2(Y2):

mY(t) = E[et(a1g1(Y1)+a2g2(Y2))

]= E

[eta1g1(Y1)

]E[eta2g2(Y2)

]Theorem 4.8. For Y = Y1 + Y2 where Y1, Y2 independent,

mY(t) = mY1(t)mY2(t)

Proof. mY(t) = E[et(Y1+Y2)

]= E

[etY1]E[etY2]= mY1(t)mY2(t)

General Transformations

Approaching from a ’change of variable’ perspective:

Theorem 4.9. Consider the bijective transformation

(Y1, Y2) −→ (U1, U2)

where we can write

Y1 = h−11 (U1, U2) and Y2 = h−1

2 (U1, U2)

So for (u1, u2) ∈ R2,

fU1,U2(u1, u2) = fY1,Y2(h−11 (u1, u2), h−1

2 (u1, u2))|J(u1, u2)|

where J(u1, u2) = det

∂h−11 (u1,u2)

∂u1

∂h−11 (u1,u2)

∂u2∂h−1

2 (u1,u2)∂u1

∂h−12 (u1,u2)

∂u2

is the Jacobian.

Proof. c.f. textbook.


5 Probability Inequalities and Theorems

5.1 Probability Inequalities

Theorem 5.1 (Markov’s Inequality). Suppose g(y) function of a con-tinuous r.v. Y such that E

[|g(Y)|

]< ∞. For any k > 0 let

Ak = {y ∈ R : |g(y)| ≥ k}

then we have

P(|g(Y)| ≥ k

)≤

E[|g(Y)|

]k

Proof. We want to prove that E[|g(Y)|

]≥ kP

(|g(Y)| ≥ k

)E[|g(Y)|

]=∫ ∞

−∞|g(y)| fY(y)dy =

∫Ak

|g(y)| fY(y)dy +∫

A′k|g(y)| fY(y)dy

≥∫

Ak

|g(y)| fY(y)dy ≥∫

Ak

k fY(y)dy = kP(|g(Y)| ≥ k

)

Markov’s Inequality holds for dis-crete r.v.’s as well (just replace all theintegrals with sums in the proof).

It can be written equivalently as

P(|g(Y)| < k

)≥ 1−

E[|g(Y)|

]k

Theorem 5.2 (Chebychev’s Inequality). For any distribution withfinite mean µ and variance σ2, we have that ∀c > 0,

P (µ− cσ < Y < µ + cσ) ≥ 1− 1c2

Proof. Let g(t) = (t− µ)2 in Theorem 5.1:

P((t− µ)2 < k

)≥ 1−

E[(t− µ)2]

k

Recall E[(Y− µ)2] = σ2 so choose k = c2σ2

P((t− µ)2 < c2σ2) = P

(√(t− µ)2 <

√c2σ2

)= P

(|Y− µ| < cσ

)P(|Y− µ| < cσ

)≥ 1−

E[(t− µ)2]c2σ2 = 1− 1

c2

Theorem 5.2 bounds how much prob-ability can be present in the tails of thedistribution.

It can be written equivalently as

P(|Y− µ| ≥ cσ

)≤ 1

c2

Proof. Recall that for U = ∑ni=1 Yi we

have E[U]= nµ and V

[nσ2].

E[Yn]=

1n

E

[n

∑i=1

Yi

]=

nµ

n= µ

V[Yn]=

1n2 V

[n

∑i=1

Yi

]=

nσ2

n2 =σ2

n

P(|Yn − µ| ≥ cσ√

n

)≤ 1

c2 ∀c

Let ε = cσ√n such that c =

√nε

σ

=⇒ P(|Yn − µ| ≥ ε

)≤ σ2

nε2

where ε arbitrary.

Theorem 5.3 (Sample mean). Suppose Y1, . . . , Yn independent r.v.’sthat have the same distribution, with expectation µ and variance σ2.Let Yn = 1

n ∑ni=1 Yi be the sample mean r.v. Then ∀ε > 0,

P(|Yn − µ| ≥ ε

)≤ σ2

nε2

5.2 Convergence and Theorems

Definition 5.1. A sequence of r.v.’s Y1, Y2, . . . , Yn, . . . converges in

probability to c ∈ R as n→ ∞, written as Ynp−→ c, if ∀ε > 0,

limn→∞

P (|Yn − c| < ε) = 1


Theorem 5.4 (Weak Law of Large Numbers). Suppose Y1, . . . , Yn iidwith expectation µ and variance σ2. Then the sample mean Yn asdefined in Therorem 5.3 converges in probability to µ:

Ynp−→ µ (as n→ ∞)

Remark 27. The n term in the denominator of Theorem 5.3 tells usthat the rate of convergence is at least n.

In other words, the sample meanr.v. converges in probability to theexpectation of the distribution of the Ys.

Proof. follows directly from Chebychev(Theorem 5.2).

P(|Yn − µ| ≥ ε

)≤ σ2

nε2

⇔ P(|Yn − µ| < ε

)≥ 1− σ2

nε2 ∀ε > 0

−→ 1 as n→ ∞

=⇒ limn→∞

P(|Yn − µ| < ε

)≥ 1

So Ynp−→ µ

Definition 5.2. Consider a sequence of cdfs F1(y), F2(y), . . . , Fn(y), . . .such that limn→∞ Fn(y) = F(y) at all points where F(y) is continuous.Then F(y) is the limiting distribution for the sequence.

Definition 5.3. If a sequence of r.v.’s Y1, Y2, . . . , Yn, . . . has respectivecdfs F1(y), F2(y), . . . , Fn(y), . . . and Y has cdf F(y), this sequence

converges in distribution to Y as n→ ∞, written as Ynd−→ Y.

Theorem 5.5 (Central Limit Theorem). Suppose Y1, . . . , Yn iid with ex-pectation µ and variance σ2. Consider the sample mean Yn as defined

in Therorem 5.3. The random variable Un =√

n(Yn−µ)σ converges in

distribution to a standard normal distribution Normal(0, 1)

limn→∞

FUn(u) = limn→∞

P(Un ≤ u) =∫ u

−∞

1√2π

e−t2/2

Recalling Yn = 1n ∑n

i=1 Yi

Un =Yn − µ

σ/√

n

=1

σ√

n(

n

∑i=1

Yi − nµ)

In other words, irrespective of the actual distribution of Y1, . . . , Yn,providing the conditions on expectation and variance are met, thedistributions Yn = 1

n ∑ni=1 Yi and Sn = ∑n

i=1 Yi can be approximatedusing a Normal distribution.

Yn ≈ Normal(µ, σ2/n)

Sn ≈ Normal(nµ, nσ2)

Proof. Let Zi =Yi−µ

σ for i = 1, 2, . . . , n. Then E[Zi]= 0, V

[Zi]= 1.

Recall that for X = σY + µ,

mX(t) = eµtmY(σt)

E[Zi]= 0 and V

[Zi]= 1

=⇒ mZ(0) = 1, m(1)Z (0) = E

[Zi]= 0

Notice that Un =1√n

n

∑i=1

Zi

so use mUn(t) =(mZ(t/√

n))n Taylor expanding mZ(t) at t = 0 :

mZ(t) = mZ(0) + tm(1)Z (0) +

t2

2m(2)

Z (t0) 0 < t0 < t

=t2

2m(2)

Z (t0)

=⇒ mUn(t) =(

1 +t2

2nm(2)

Z (t0)

)n

t0 ∈[

0,t√n

]→ 0 as n→ ∞

limn→∞

mUn(t) = limn→∞

(1 +

t2

2nm(2)

Z (0))n

recall E[Z2

i]= m(2)

Z (0) = 1

= limn→∞

(1 +

t2

2n

)= et2/2 by def of exponential

=⇒ Und−→ Normal(0, 1)

MATH 323: Probability

Documents

Transcript of MATH 323: Probability