The Concentration of B52, an Essential Splicing Factor and ...
STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two...
Transcript of STAB52 - University of Torontoutsc.utoronto.ca/~butler/b52/slides.pdf · Suppose we flip two...
Welcome to
STAB52
Instructor: Dr. Ken Butler
1
Contact information
(on Intranet: intranet.utsc.utoronto.ca, My Courses)
• E-mail: [email protected]
• Office: H 417
• Office hours: to be announced
• Phone: 5654 (416-287-5654)
2
Probability Models
3
Measuring uncertainty
4
Random Variables and
Distributions
5
Random Variables
Suppose we flip two (fair) coins, and note whether each coin
(ordered) comes up H or T.
• Sample space is S = {HH,HT, TH, TT}.
• Probability measure is 14
for each of 4 outcomes.
What about “number of heads”? Could be 0, 1 or 2:
• P (0 heads) = P (TT ) = 14
• P (1 head) = P (TH) + P (HT ) = 12
• P (2 heads) = P (HH) = 14.
6
“Number of heads” is random variable: function from S to R. That
is, given outcome, get value of random variable.
Random variables can be any function from S to R. If
S = {rain, snow, clear}, random variable X could be
X(rain) = 3
X(snow) = 6
X(clear) = −2.7.
7
Some more examples of random variables
Roll a fair 6-sided die, so that S = {1, 2, 3, 4, 5, 6}. Let X be the
number of spots showing, let Y be square of number of spots. If s is
number of spots, let W = s + 10, let U = s2 − 5s + 3, etc.
In previous situation, let C = 3 regardless of s. C is constant
random variable.
Suppose have event A, only interested in whether A happens or
not. Define indicator random variable I to be 1 if A happens, 0
otherwise. Example (rolling die) I6(s) = 1 if s = 6, 0 otherwise.
8
≥, =, sum for random variables
Imagine rolling a fair die again, S = {1, 2, 3, 4, 5, 6}. Let X = s,
and let Y = X + I6.
X is number of spots, I6 is 1 if you roll a 6 and 0 otherwise. What
does Y mean?
Eg. roll a 4, X = 4, Y = 4 + 0 = 4. But if you roll a 6,
Y = 6 + 1 = 7. (That is, Y is the number of spots plus a “bonus
point” if you roll a 6.)
Sum of random variables (like Y here) for any outcome is sum of
their values for that outcome.
9
Also: if s = 1, 2, 3, 4, 5, values of X and Y are same. If s = 6,
X < Y .
Say that random variable X ≤ Y if value of X ≤ value of Y for
every single outcome. True in example.
Say that random variable X = Y if value of X equals value of Y
for every single outcome. Not true in example (different when
outcome is s = 6).
For constant random variable c, X ≤ c if all possible values of X
are≤ c.
10
When S is infinite
When S infinite, random variable can take infinitely many different
values (but may not).
Example: S = {1, 2, 3, . . .}. If X = s, X takes all infinitely many
values in S. But define Y = 3 if s ≤ 4, Y = 2 if 4 < s ≤ 10,
Y = 1 when s > 10. Y has only finitely many (3) different values.
11
Distributions of random variables
A random variable can be described by listing all its possible vales
and their probabilities. Started this chapter with a coin-flipping
example:
Flip two (fair) coins, and note whether each coin (ordered) comes up
H or T.
Let X be “number of heads”. Could be 0, 1 or 2:
• P (X = 0) = P (TT ) = 14
• P (X = 1) = P (TH) + P (HT ) = 12
• P (X = 2) = P (HH) = 14.
Called the distribution of X .
12
Notice how can talk about P (X = s) for some s. In this case,
listing all the s for which P (X = s) > 0 describes distribution.
Consider now random variable U taking values in [0, 1] with
P (a ≤ X ≤ b) = b− a
for 0 ≤ a ≤ b ≤ 1. Try to figure out eg. P (X = 0.4): is
P (0.4 ≤ X ≤ 0.4) = 0.4− 0.4 = 0.
Can’t define probability of a value, but still can define probability of
landing in subset of R (namely interval).
13
To account for all of this, define distribution of random variable X
as: collection of probabilities P (X ∈ B) for all subsets B of
real numbers.
Works for both examples above. Eg. in first example,
P (X ≤ 1) = P (X = 0) + P (X = 1) = 34.
In practice, often messy to define probabilities for “all possible
subsets”. Think first about examples like 1st, “discrete”, where can
talk about probabilities of individual values. Then consider
“continuous” case (like 2nd), where have to look at intervals.
14
Discrete distributions
Often it makes sense to talk about individual probs, P (X = x).
When all probability included in these probs, ie.∑x∈R
P (X = x) = 1,
don’t need to look at anything else.
Another way to look at it: there is a finite or countable set of x
values, x1, x2, . . ., each having probability pi = P (X = xi), such
that∑
x∈R pi = 1.
Either of these is definition of discrete distribution.
15
Compare case where P (a ≤ X ≤ b) = b− a: P (X = x) = 0
for all x, so not discrete distribution.
Another example: suppose X = −1 with prob 12, and for
0 ≤ a ≤ x ≤ b ≤ 1, P (a ≤ X ≤ b) = (b− a)/2. Can talk
about P (X = −1) = 12, but P (X = x) = 0 for any other x. So
not a discrete distribution.
Notation for discrete distributions (emphasize function):
pX(x) = P (X = x)
called probability function or mass function.
Now look at some important discrete distributions.
16
Degenerate distributions
If random variable C is constant, equal to c, then P (C = c) = 1
and P (C = x) = 0 for any x 6= c. Since∑x∈R P (C = x) = P (C = c) = 1, is a proper (though dull)
discrete distribution. Called degenerate distribution or point
mass.
17
Bernoulli distribution
Flip a coin once, let X be number of heads (has to be 0 or 1).
Suppose P (head) = θ, so P (tail) = 1− θ. Then
pX(1) = P (X = 1) = P (head) = θ;
pX(0) = P (X = 0) = P (tail) = 1− θ.
X said to have Bernoulli distribution; write X ∼ Bernoulli(θ).
Application: any kind of “success/failure”. Denote “success” by 1,
“failure” by 0. Or selection from population with two kinds of
individual like male/female.
18
Binomial distribution
Now suppose we flip the coin n times (independently) and again
count number of heads. Probability of exactly x heads is
pX(x) = P (X = x) =
(n
x
)θx(1− θ)n−x.
X said to have binomial distribution, written
X ∼ Binomial(n, θ).
Applications: as for Bernoulli. Eg. randomly select 100 Canadian
adults, let X be number of females.
19
Let X ∼ Binomial(4, 0.5), Y ∼ Binomial(4, 0.2). Then
x P(X=x) P(Y=x)
0 0.0625 0.4096
1 0.2500 0.4096
2 0.3750 0.1536
3 0.2500 0.0256
4 0.0625 0.0016
X probs symmetric about x = 2, Y more likely to be 0 or 1
because successes less likely.
Bernoulli and binomial count successes in fixed number of trials.
Could also look at waiting time problem: fix successes, count
number of trials needed to get them.
20
Geometric distribution
Same situation as for binomial: number of trials, independent, equal
prob. θ. Let X now be number of tails before 1st head.
X = k means we observe k tails, and then a head, so
pX(k) = P (X = k) = (1− θ)kθ, k = 0, 1, 2, . . .
X can be as large as you like, since you might wait a long time for
the first head. (Compare binomial: can’t have more than n
successes in n trials).
X has geometric distribution, prob. θ, written X ∼ Geometric(θ).
Applications: number of working light bulbs tested until first one that
fails; number of at-bats for baseball player until first hit.
21
Examples: suppose X1 ∼ Geometric(0.8) and
X2 ∼ Geometric(0.5).
k P (X1 = k) P (X2 = k)
0 0.8 0.5
1 0.16 0.25
2 0.032 0.125
3 0.0064 0.0625
4 0.00128 0.03125
. . . . . .
When θ larger, 1st success probably sooner.
Also: probabilities form geometric series, hence the name.
22
Negative binomial distribution
To take geometric one stage further: Let r be a fixed number, let Y
be the number of tails before the r-th head.
Y = k only if observe r − 1 heads and k tails, in any order,
followed by a head (must finish with a head). Are r + k − 1 flips
before the final head. Prob. of this is
pY (k) = P (Y = k) =
(r + k − 1
r − 1
)θr−1(1− θ)kθ
=
(r + k − 1
k
)θr(1− θ)k
Write this Y ∼ Negative-Binomial(r, θ).
23
Applications: can re-use geometric distribution examples. Thus:
number of working lightbulbs tested until 5th non-working one
encountered; number of at-bats until baseball player achieves 10th
hit.
Numerical examples: let Y1 ∼ Negative-Binomial(4, 0.8) and
Y2 ∼ Negative-Binomial(3, 0.5).
k P(Y1=k) P(Y2=k)
0 0.4096 0.1250
1 0.3276 0.1875
2 0.1638 0.1875
3 0.0655 0.1562
4 0.0229 0.1171
5 0.0073 0.0820
6 0.0022 0.0546
24
With Y1, “heads” are likely so probably won’t see many tails before
4th H. With Y2, heads not so likely but only need to see 3 before
stopping.
General note: some books count total number of trials until first (or
r-th) head for geometric and negative binomial distributions. Gives
random variables 1 + X and r + Y as defined above.
25
Poisson distribution
Suppose X ∼ Binomial(n, λ/n). We’ll think of λ as being fixed
and see what happens as n →∞. That is, what if the number of
trials gets very large but the prob. of success gets very small?
Then
P (X = x) =
(n
x
)(λ
n
)x (1− λ
n
)n−x
=n!
x!(n− x)!nxλx
(1− λ
n
)n (1− λ
n
)−x
26
Thinking of x as fixed (for now) and letting n →∞: the behaviour
of the factorials is determined by the highest power of n. Thus n!
behaves like nn, (n− x)! behaves like nn−x and hence
n!
(n− x)!nx→ 1.
Also, (1− λ
n
)−x
→ 1
because 1− λ/n → 1 and raising it to a fixed power changes
nothing.
27
Finally,
limn→∞
(1− λ
n
)n
is a famous limit from calculus; it is e−λ. Thus
limn→∞
P (X = x) =e−λλx
x!.
A random variable Y with P (Y = y) = e−λλy/y! is said to have
a Poisson(λ) distribution, written Y ∼ Poisson(λ).
The Poisson distribution is a good model for rare events: that is,
events which have a large number of “chances” to happen, but have
a very small probability of happening at each “chance”. λ represents
“rate” at which events happen; doesn’t have to be integer.
28
Applications of Poisson distribution are things like: number of house
fires in a city on a given day, number of phone calls arriving at a
switchboard in an hour, number of radioactive events recorded by a
Geiger counter.
Let X ∼ Poisson(2), Y ∼ Poisson(0.8):
29
lam=2 lam=0.8
x P(X=x) P(Y=x)
0 0.1353 0.4493
1 0.2707 0.3595
2 0.2707 0.1438
3 0.1804 0.0383
4 0.0902 0.0077
5 0.0361 0.0012
6 0.0120 0.0002
... ...
• When λ is integer, highest prob at that integer and next lower
• Otherwise, highest prob at next lower integer (so when λ < 1,
highest prob at x = 0).
30
Hypergeometric distribution
Introduction
Imagine a pot containing 10 balls, 7 red and 3 green. Prob. of
drawing a red ball is 0.7 (7/10). If we put the ball drawn back in the
pot, prob. of drawing a red ball the next time is still 0.7.
Thus, drawing with replacement, number of red balls in 4 draws
R ∼ Binomial(4, 0.7). Therefore
P (R = 4) =
(4
4
)(0.7)4(0.3)0 = 0.2401.
31
Now suppose we draw without replacement: that is, don’t put balls
back in pot after drawing. If we draw a red ball 1st time, there are
only 6 red balls out of 9 balls left.
Should be harder to draw 4 red balls in 4 draws because there are
fewer left after we draw each one: now
P (R = 4) =7
10· 6
9· 5
8· 4
7= 0.1667.
This is not so bad, but suppose we now want P (R = 3), say?
Need general principle for drawing without replacement.
32
The hypergeometric formula
Introduce symbols: suppose draw n balls out of a pot containing N
total. Suppose M of the balls in the pot are red. Let X be number
of red balls drawn. What is P (X = x)?
Need to count ways:
• Number of ways to draw n balls out of N in pot:(
Nn
).
• number of ways to draw x red balls out of M red balls in pot:(Mx
).
• number of ways to draw n− x green balls out of N −M green
balls in pot:(
N−Mn−x
).
P (X = x) is number of ways to draw the red and green balls
33
divided by number of ways to draw n balls out of N :
P (X = x) =
(M
x
)(N −M
n− x
)/
(N
n
).
X said to have hypergeometric distribution:
X ∼ Hypergeometric(N, M, n). Checks:
M + (N −M) = N and x + (n− x) = n. Restrictions on x?
• Number of red balls: x ≤ n and x ≤ M so x ≤ min(n,M).
• Number of green balls: n− x ≤ n and n− x ≤ N −M , so
x ≥ 0 and x ≥ n + M −N , so x ≥ max(0, n + m−N).
34
Example 1: let X ∼ Hypergeometric(10, 7, 4):
x P(X=x)
0 0.0000
1 0.0333
2 0.3000
3 0.5000
4 0.1667
10 balls in pot, 7 red, 4 drawn. Cannot draw 0 red, because that
would mean drawing 4 green, and only 3 in pot. (Also cannot draw
more than 4 red because only drawing 4).
35
Example 2: let Y ∼ Hypergeometric(5, 3, 4):
y P(Y=y)
0 0.0
1 0.0
2 0.6
3 0.4
4 0.0
5 0.0
5 balls in pot, 3 red and 2 green, draw 4. Cannot draw more than 3
red. But also cannot draw only 0 or 1 red, because that would mean
drawing 4 or 3 green, and aren’t that many in the pot.
36
Applications
Anything that involves drawing without replacement from a finite set
of elements. Includes sampling, eg. selecting people to include in
opinion poll. (Don’t want to select same person twice). People
sampled from might agree (red ball) or disagree (green ball) with
question asked.
Large N
If N large, might imagine that it doesn’t matter much whether you
replace balls in pot or not. In other words, for large N , binomial
would be decent approximation. Turns out to be true:
If X ∼ Hypergeometric(N, M, n) and N large, then X has
37
approx. same distribution as Y ∼ Binomial(n,M/N).
38
Continuous distributions
Suppose, for random variable X ,
P (a ≤ X ≤ b) = b− a
for 0 ≤ a ≤ b ≤ 1).
Is legitimate probability since 0 ≤ b− a ≤ 1. But
P (X = a) = a− a = 0 for any a, so not discrete distribution.
Where did the probability go?
39
Cumulative distribution functions
40
One-dimensional change of variable
41
Joint Distributions
Know how to describe random variables one at a time: probability
function (discrete), density function (continuous), cumulative
distribution function (either).
But two random variables X , Y might be related. Don’t have a way
to describe this.
Example: X ∼ Bernoulli(2/3). Let Y = 1−X .
Y ∼ Bernoulli(1/3) (count failures not successes). X, Y
related, but doesn’t show in individual probability functions.
42
Joint probability functions
Can simply find probability of all possible combinations of values for
X, Y . Uses individual probability functions and relationship.
In example: if X = 0, then Y = 1; if X = 1, then Y = 0.
Possible values for Y depend on value of X . Also,
P (X = 1) = 2/3.
Notation: pX,Y (x, y) = P (X = x, Y = y) (comma is “and”),
called joint probability function. In example:
pX,Y (1, 0) = 2/3; pX,Y (0, 1) = 1/3.
Are only possible combinations of X and Y values.
43
Often convenient to depict as table. Above example:
x \ y 0 1
0 0 13
1 23
0
Another:
u \ v 0 1 2
0 13
16
16
1 16
112
112
Note that all the probabilities sum to 1, because joint probability
function covers all possibilities.
44
Joint density functions
If random variables continuous, joint probability function makes no
sense; instead, define joint density function f(x, y) that
expresses chance of being “near” (X = x, Y = y).
Joint density function also covers all possible values of X, Y , so
integrates to 1 when integrated over both x and y.
Example: f(x, y) = 4x2y + 2y5, 0 ≤ x, y ≤ 1 (page 85).
45
Sometimes possible values of Y depend on value of X . Account for
in integration.
Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1. (Thus
if X = 0.6, Y cannot exceed 0.4.) Region forms triangle: Figure
2.7.3 of text (p. 85). Verify density by letting y limits of integration
depend on x (y = 1− x).
46
Bivariate normal distribution
Suppose X , Y both have standard normal distributions, and
suppose−1 < ρ < 1. Then the bivariate standard normal
distribution with correlation ρ has joint density function
f(x, y) =1
2π√
1− ρ2exp
{− 1
2(1− ρ2)(x2 + y2 − 2ρxy)
}.
Plotting in 3D (Figure 2.7.4) gives a 3D bell shape.
ρ measures relationship between X and Y :
• ρ = 0: no relationship
• ρ > 0: when X > 0, Y likely > 0
• ρ < 0: when X > 0, Y likely < 0.
47
Bivariate standard normal has peak at (0, 0). Replacing x by
(x− µ1)/σ1 and y by (y − µ2)/σ2 shifts peak to (µ1, µ2) and
changes decrease of density away from peak (larger σ values mean
slower decrease).
48
Calculating probabilities
For a continuous random variable X , calculate probabilities by
integrating, eg. P (a < X ≤ b) =∫ b
af(x) dx.
Same idea for continuous joint distribution, integrating over x and y.
Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1. Find
P (0.5 ≤ X ≤ 0.7, Y > 0.2).
Draw picture. Area is trapezoid: y between 0.2 and diagonal line
(1− x), x between given limits. Integrate over y first, then x to get
0.294.
49
Marginal distributions
Started from individual distributions for X, Y plus relationship. But:
start from joint, get individual?
One way: get distribution of X by “averaging” over distribution of Y .
Discrete: simply row and column totals. Example:
u \ v 0 1 2 Sum
0 13
16
16
23
1 16
112
112
13
Sum 12
14
14
1
50
Without knowledge of V , U twice as likely 0 as 1; without
knowledge of U , V twice as likely 0 as 1 or 2.
Row totals here give marginal distribution of U ; column totals
here marginal distribution of V . Each marginal distribution is
proper probability distribution (probs sum to 1).
51
Continuous: integrate over other variable. Get marginal density
function.
Example: f(x, y) = 120x3y for x ≥ 0, y ≥ 0, x + y ≤ 1.
Marginal density for X : integrate over y. Limits 0, 1− x; get
fX(x) =
∫ 1−x
0
120x3y dy = 60x3(1− x)2.
For Y : integrate over x, limits 0, 1− y:
fY (y) =
∫ 1−y
0
120x3y dx = 30y(1− y)4.
“Integrating out” unwanted variable.
Alternative approach via cumulative; text page 79.
52
Example 2: bivariate standard normal. Recall standard normal
density; integrates to 1, so∫ ∞
−∞
1√2π
exp
[−1
2u2
]du = 1.
Marginal distribution of x in bivariate standard normal: integrate out
y:
fX(x) =
∫ ∞
−∞
1
2π√
1− ρ2exp
[−x2 + y2 − 2ρxy
2(1− ρ2)
]dy.
Substitution: let u = (y − ρx)/√
1− ρ2, so du = dy/√
1− ρ2.
Then
u2 =y2 − 2ρxy + ρ2x2
1− ρ2
53
which is nearly what appears inside “exp”. Precisely:
fX(x) =
∫ ∞
−∞
1
2πexp
[−u2 + x2
2
]du
=1√2π
exp(−x2/2)
∫ ∞
−∞
1√2π
exp(−u2/2) du.
Integral is 1 (of a standard normal density), so
fX(x) =1√2π
exp(−x2/2) :
that is, marginal distribution of X is standard normal.
54
Conditioning and Independence
Marginal distribution: of one variable, ignorant about other.
But what if we knew X ; what then about distribution of Y ?
Example 1:
x \ y 0 1
0 0 13
1 23
0
Suppose X = 1. Then ignore 1st row.
55
But 2nd row not probability distribution (sum 23
not 1). Idea: divide
by sum. Then if X = 1, P (Y = 0) = 1 and P (Y = 1) = 0: that
is, if X = 1, Y certain to be 0. Called conditional distribution of
Y given X = 1.
If X = 0, Y certain to be 1. Conditional distribution of Y different
for different X : Y depends on X .
Notation: as for conditional probability. Eg. above:
P (Y = 1|X = 0) = 1.
56
Example 2:
u \ v 0 1 2
0 13
16
16
1 16
112
112
Conditional distribution of V given U = 0? Use U = 0 row. This
sums to 23, so divide by this to get P (V = 0|U = 0) = 1
2, P (V =
1|U = 0) = 14, P (V = 2|U = 0) = 1
4.
U = 1 line sums to 13; conditional distribution of V given U = 1 is
same as given U = 0.
In example 2, does not matter what U is – conditional distribution of
V same. Say that V and U are independent.
57
Two examples give extreme cases. In Example 1, knowing X gave
Y with certainty; in example 2, knowing U said nothing about V .
Most cases in between: knowing one variable has some effect on
distribution of other.
Symbols:
P (Y = b|X = a) =P (X = a, Y = b)∑y P (X = a, Y = y)
=P (X = a, Y = b)
P (X = a).
Denominator is marginal probability that X = a.
58
Conditioning on continuous random variables
Continuous case: no probabilities, so replace with density functions;
replace sum by integral. This gives conditional density function:
fY |X(y|x) =fX,Y (x, y)∫∞
−∞ fX,Y (x, y) dy=
fX,Y (x, y)
fX(x),
replacing infinities by actual limits for y. Denominator depends on x
only; is marginal density function for X .
Then use conditional density to evaluate conditional probabilities.
59
Example: fX,Y (x, y) = 4x2y + 2y5 for x, y between 0 and 1, 0
otherwise. Find P (0.2 ≤ Y ≤ 0.3|X = 0.8).
Steps: find marginal density of X , use to find conditional density of
Y given X , integrate conditional density to find probability.
Answers: marginal density of X is fX(x) = 2x2 + 13
for
0 ≤ X ≤ 1, 0 otherwise. Conditional density of Y |X is
fY |X(y|x) =4x2y + 2y5
2x2 + 13
then integrate over 0.2 ≤ y ≤ 0.3 and put in x = 0.8 to get
P (0.2 ≤ Y ≤ 0.3|X = 0.8) = 0.0395.
60
Followup: what happens to P (0.2 ≤ Y ≤ 0.3) if X changes?
One answer: P (0.2 ≤ Y ≤ 0.3|X = 0.4) = 0.0242. So
probability does change as X changes; Y does depend on X .
However, change in probability quite small; dependence is not very
strong.
61
Law of total probability
Because
fY |X(y|x) =fX,Y (x, y)∫∞
−∞ fX,Y (x, y) dy=
fX,Y (x, y)
fX(x),
also true that
fX,Y (x, y) = fX(x)fY |X(y|x).
62
So
P (a ≤ X ≤ b, c ≤ Y ≤ d)
=
∫ d
c
∫ b
a
fX,Y (x, y) dx dy
=
∫ d
c
∫ b
a
fX(x)fY |X(y|x) dx dy.
In words: can find probabilities either using joint density or using a
marginal and a conditional density. Can use whichever easier.
63
Independence of random variables
Recall this joint distribution:
u \ v 0 1 2
0 13
16
16
1 16
112
112
Sum 12
14
14
Conditional distribution of V same given U = 0 and given U = 1.
Also same as marginal distribution of V . Knowing U says nothing
about V .
(Also, conditional dist. of U same for all V and same as marginal for
U .)
64
Suggests definition: random variables independent if conditional
distribution always same, and always same as marginal.
Mathematics: X,Y independent if
pY (y) = pY |X(y|x) =pX,Y (x, y)
pX(x)
so that
pX,Y (x, y) = pX(x)pY (y).
This is usually easiest check:
• if pX,Y (x, y) = pX(x)pY (y) for all x, y, then X, Y
independent.
• if pX,Y (x, y) 6= pX(x)pY (y) for any one (x, y) pair, then
X, Y not independent.
65
For example above: P (U = 0) = 23, P (U = 1) = 1
3;
P (V = 0) = 12, P (V = 1) = P (V = 2) = 1
4. Also,
P (U = 0)P (V = 0) =2
3· 1
2=
1
3= P (U = 0, V = 0).
Repeat for all u and v: proves independence.
66
Compare this joint distribution:
x \ y 0 1
0 0 13
1 23
0
Now,
P (X = 0)P (Y = 0) =1
3· 2
3=
2
9
and P (X = 0, Y = 0) = 0 6= 29. One calculation shows X, Y
not independent.
67
Independence of continuous random variables
As usual, turn probability into density. If
fX,Y (x, y) = fX(x)fY (y)
for all x, y, then continuous random variables X, Y independent. If
it fails for any (x, y) pair, not independent.
Example: suppose fX(x) = 2x2 + 13, fY (y) = 4
3y + 2y5,
fX,Y (x, y) = 4x2y + 2y5 for 0 ≤ x, y ≤ 1. Then
fX(x)fY (y) =
(2x2 +
1
3
)(4
3y + 2y5
)
which cannot be simplified to fX,Y (x, y). So X, Y not
independent.
68
Order statistics
Suppose that X1, X2, . . . , Xn all, independently, have same
distribution (a sample from distribution). Suppose common cdf
FX(x).
For example: take 20 people, give each IQ test. Without knowing
about individuals, use same distribution for each. What might
highest score in sample be?
Idea: more people sampled, higher the highest score could be (get
more chances to see a very high score).
69
Let M = max(X1, X2, . . . , Xn). Then
P (M ≤ m) = P (X1 ≤ m,X2 ≤ m, . . . , Xn ≤ m)
= P (X1 ≤ m)P (X2 ≤ m) · · ·P (Xn ≤ m)
= [FX(m)]n .
If X continuous, differentiate to get density.
Example: each Xi ∼ Uniform[0, 1]. Then FX(x) = x, so
P (M ≤ m) = xn.
If n = 5, P (M ≤ 0.9) = 0.95 = 0.59; if n = 20,
P (M ≤ 0.9) = 0.920 = 0.1216, much smaller. That is, with
more observations, the maximum is likely to be higher (less likely to
be low).
70
Similar idea for minimum: let K = min(X1, X2, . . . , Xn). Then
P (K ≤ k) = 1− P (K > k)
= 1− P (X1 > k,X2 > k, . . . , Xn > k)
= 1− P (X1 > k)P (X2 > k) · · ·P (Xn > k)
= 1− (1− FX(k))n.
Example: if n = 10, Xi ∼ Uniform[0, 1], then
P (K ≤ 0.2) = 1− (1− 0.2)10 = 0.8926.
71
Simulating probability distributions
So far, considered mathematical properties of distributions:
probabilities, densities, cdf’s etc. But some distributions difficult to
understand or use.
Generate random values from distribution.
approximation of difficult-to-calculate quantities
simulation of complex systems
generating potential solutions for difficult problems
random choices for quizzes, computer games
understanding behaviour of samples (chapter 4)
72
Pseudo-random numbers
In practice, don’t get actual random numbers, but pseudo-random
numbers. These follow recipe, but look random. (Paradox?)
Not so bad, because crucial feature: unpredictable – cannot easily
say what comes next.
Typical method: multiplicative congruential generator. Start with
initial “seed” value R0, then, for n = 0, 1, . . .:
Rn+1 = 106Rn + 1283 (mod 6075)
(“take remainder on division by 6075”).
73
Eg. start with r0 = 1001:
R1 = 106(1001) + 1283 = 107389 (mod 6075) = 4114
R2 = 106(4114) + 1283 = 437367 (mod 6075) = 6042
R3 = 106(6042) + 1283 = 641735 (mod 6075) = 3860
and so on, with 0 ≤ Ri < 6075.
Gives up to 6075 different random integers before repeating itself.
Suitable choice of constants gives long “period” and unpredictable
sequence. (Number theory.)
In practice, use much larger constants – get many more possible
random numbers.
74
Continuous uniform on [0, 1]
To get (pseudo-) random values from Uniform[0, 1], take
pseudo-random integers and divide by maximum. Result has
approx. uniform distribution.
With generator above, max value is 6075, so random uniform values
are 4114/6075 = 0.677, 6042/6075 = 0.995,
3860/6075 = 0.635. (Only 6075 possible values, so only 3 or so
digits trustworthy.)
“Random numbers” in calculators, Excel etc. of this kind.
Random Uniform[0, 1] values are used as building block for
random values from other distributions. Eg. random
Y ∼ Uniform[0, b]: multiply a random Uniform[0, 1] by b.
75
Bernoulli distribution
Suppose we want to simulate X ∼ Bernoulli(0.4): single trial,
prob. 0.4 of success.
Take single random uniform U . If U ≤ 0.4, take X = 1 (success),
otherwise take X = 0 (failure).
Works because U ≤ 0.4 about 0.4 of the time, so will get
successes about 0.4 of the time (long run).
In general, for X ∼ Bernoulli(θ), take X = 1 if U ≤ θ, 0
otherwise.
76
Binomial and geometric distributions
If Y ∼ Binomial(n, θ), Y = X1 + X2 + · · ·+ Xn where
Xi ∼ Bernoulli(θ). So just generate n random Bernoullis and
add them up.
Similarly, if Z ∼ Geometric(θ), Z is number of failures (in
Bernoulli trials) before 1st success. So get random value of Z like
this:
1. set Z = 0
2. generate U from Uniform[0, 1]
3. if U ≤ θ, stop with current Z
4. otherwise, add 1 to Z and return to step 2.
77
Inverse-CDF method
Cdf F (x) = P (X ≤ x) defined for all x.
Also, in set of possible X-values (where f(x) > 0), F (x)
invertible: for any p, exactly one x where F (x) = p.
Example: X ∼ Exponential(λ). Then F (x) = 1− e−λx. For
x > 0, write p = F (x), and solve for x to get
x = −1
λln(1− p).
Then generate a random p from Uniform[0, 1], and put it in the
formula to get a random X .
78
For instance, if λ = 2, might have p = 0.7 and hence random X is
−12ln(1− 0.7) = 0.602.
Why does this work in general?
Let Y be any random variable; let F (y) = P (Y ≤ y) be cdf of Y .
Define random variable W = F (Y ). Then
P (W ≤ w) = P (F (Y ) ≤ w)
= P (Y ≤ F−1(w)) = F{F−1(w)} = w.
That is, W ∼ Uniform[0, 1] whatever the distribution of Y .
79
So: to simulate Y , simulate W , then use relationship
Y = F−1(W ) to simulate Y (by using simulated uniform in place
of W ).
This was done above for exponential. Called inverse-CDF method.
80
Also works for discrete. Example: Poisson(0.7) has this cdf:
x 0 1 2 3 4
P (X ≤ x) 0.497 0.844 0.966 0.994 0.999
Procedure: get random U ∼ Uniform[0, 1]. If U ≤ 0.497, take
random X = 0; else if U ≤ 0.844, take X = 1, . . . , else if
U > 0.999, take X = 5.
(Higher values possible, but very unlikely; for more accuracy use
more digits.)
81
Normal distribution
Difficult to simulate from (cannot invert cdf).
But consider X , Y with bivariate standard normal distribution,
correlation 0. Joint density is
fX,Y (x, y) =1
2πexp
{−1
2(x2 + y2)
}.
Thinking of (x, y) as point in R2, note that density depends only on
distance from origin (r2 = x2 + y2), not on angle.
So generate random (x, y) pair by generating random angle
θ ∼ Uniform[0, 2π], random distance, separately.
(details: 2-variable transformation using Jacobian determinant.)
82
Density function for distance R is
fR(r) = re−r2/2
and cdf is
FR(r) =
∫ r
0
te−t2/2 dt = 1− e−r2/2
(eg. use substitution u = t2/2, du = t dt).
FR(r) invertible; let p = FR(r), solve for r to get
r =√−2 ln(1− p).
Get random R by taking U ∼ Uniform[0, 1], using for p above.
83
Finally, convert random R, θ to (X, Y ) using polar coordinate
formulas
X = R cos θ; Y = R sin θ.
Example: suppose random θ = 1.8 (radians), U = 0.3. Then
R =√−2 ln(1− 0.3) = 0.8446. So
X = 0.8446 cos 1.8 = −0.19; Y = 0.8446 sin 1.8 = 0.82.
84
Rejection methods
Inverse-CDF method doesn’t always work – cdf can be too
complicated to invert. Example: X ∼ Gamma(3, 1), with density
function
f(x) =x2
2e−x.
This has maximum 2e−2 = 0.2707 at x = 2. Density “small”
beyond x = 10.
85
Idea: sample random point (X,Y ) in rectangle enclosing f(x),
with 0 ≤ X ≤ 10, 0 ≤ Y ≤ 2e−2 (using uniform distribution):
• if point below density function (Y ≤ f(X)), take X as random
value from distribution
• otherwise, reject (X, Y ) pair and try again.
Chance of X-value being accepted proportional to density f(X):
when value more likely in distribution, more likely to be accepted.
86
Example:
X 7.3 1.0 2.7 1.7 9.4 5.5
Y 0.206 0.130 0.023 0.256 0.197 0.203
f(X) 0.018 0.184 0.245 0.264 0.004 0.062
reject y n n n y y
Values 7.3, 9.4, 5.5 rejected; 1.0, 2.7, 1.7 random values from
Gamma(3, 1).
Needed 12 random uniforms to generate 3 random gammas.
87
Can be made more sophisticated. Let g(x) be density function that
is easy to sample from, such that f(x) ≤ cg(x) for all x (choose
c). Above, g(x) = 1, c = 2e−2.
Generate random value X from distribution with density g(x).
Generate random Y ∼ Uniform[0, cg(X)]. If Y ≤ f(X),
accept X ; otherwise, reject and try again.
Efficiency of rejection method greatest when cg(x) only slightly
greater than f(x); then, very little rejection.
88
Simulation in Minitab
Minitab can generate random values from many distributions (using
methods above or variations).
Basic procedure:
• Select Calc, Random Data
• Select desired distribution
• Fill in number of random values to generate
• Fill in (empty) column to store values
• Fill in parameters of distribution (if any)
• Click OK.
89
Examples: Uniform[0, 1], Bernoulli(0.4), Binomial(5, 0.4),
Exponential(2), Poisson(0.7), Normal(0, 1).
To generate random values from another distribution, generate
column of values from Uniform[0, 1], then use Calculator to create
desired values (p. 47–48 of manual).
Recall random values actually “pseudo-random”: starting at same
seed value gives same sequence of random values. Can set seed
value in Minitab (Calc, Set Base) to get reproducible random values.
90
Expectation
91
Introduction
Game: toss fair coin, win $2 for a head, lose $1 for a tail.
Amount you win is random variable W with
P (W = 2) = P (W = −1) = 12.
Could win or lose on any one play, but (a) winning and losing equally
likely, (b) amount won greater than amount lost.
Would probably play this game given chance, because expect to win
in long run, on average over many plays, even though anything
possible.
92
Expected value of random variable is its long-run average. For W
above, expect equal number of 2’s and−1’s, so expected value
would be
E(W ) =2 + (−1)
2=
1
2.
Another: suppose Y = 7 always (ie. P (Y = 7) = 1,
P (Y = k) = 0 for k 6= 7). Then E(Y ) should be 7.
Another: roll 2 dice. Win $30 for double 6, lose $1 otherwise. Looks
good because potential win greater than potential loss, but win very
unlikely. How to balance? For winnings random variable V , what is
E(V )?
93
Expectation for discrete random variables
Define expected value (expectation) of random variable X :
E(X) =∑
x
xP (X = x),
“sum of value times probability”. Sum over all possible x.
Check for above examples:
E(W ) = 2 · 1
2+ (−1) · 1
2=
1
2E(Y ) = 7 · 1 = 7
E(V ) = 30 · 1
36+ (−1)
35
36= − 5
36
94
First 2 as expected.
For V , prob. of double 6 is 136
, so chance of losing is 1− 136
. Even
though prize large (win $30 for double 6), E(V ) < 0, so would lose
in long run, because win prob even smaller than prize large.
Formula much easier than reasoning out – less thought!
Now suppose X ∼ Bernoulli(θ). What is E(X)?
X = 1 with prob θ, 0 with prob 1− θ, so:
E(X) = 1 · θ + 0 · (1− θ) = θ.
In long run, average X equal to success probability.
Makes sense (think of θ = 0 and θ = 1 as extreme cases).
95
Expectation for geometric and Poisson distributions
To find more complicated expectations, cleverness can be needed
to figure out sum.
Suppose Z ∼ Geometric(θ), so P (Z = k) = θ(1− θ)k. Then
E(Z) =∞∑
k=0
kθ(1− θ)k =1− θ
θ.
Method: write (1− θ)E(Z) to look like E(Z) but with k − 1 in
place of k, subtract.
Mean is odds against success: if failure 4 times more likely than
success, on average get 4 failures before 1st success.
96
If X ∼ Poisson(λ), then
E(X) =∞∑
k=0
k · e−λλk
k!.
Note that the k = 0 term is 0, so start sum at k = 1, then let
l = k − 1 to get
E(X) = λ
∞∑
l=0
e−λλl
l!.
The sum is of all the probabilities from a Poisson distribution, so is
1. (Or,∑∞
l=0(λl/l!) is the Maclaurin series for eλ.)
So for X ∼ Poisson(λ), E(X) = λ. Thus parameter λ in fact
mean.
97
St Petersburg Paradox
Game: toss fair coin, let Z be #tails before 1st head. Win 2Z
dollars. Thus for TTTH, win 23 = $8. Expected winnings (fair price
to pay to play)?
∞∑
k=0
2k · 1
2k· 1
2=
∞∑
k=0
1
2= ∞.
How can this be? Only ever win finite amount.
Play game 10 times:
Z 0 1 0 0 3 0 3 0 6 1
Winnings 1 2 1 1 8 1 8 1 64 2
Mean winnings $8.90, larger than actual winnings 90% of time!
98
Problem is that any one big payoff completely dominates average,
and by playing game enough times, can make it very likely that a
very big payoff will occur.
If there is a maximum payoff, say $230, expectation finite ($15.50).
When random variable can be arbitrarily large, expectation may not
be finite. But can be finite – compare Poisson, where probabilities
decrease faster than values increase. Similarly, lotteries with very
big prizes still have expected winnings less than ticket price
(because chance of winning big prize small enough).
99
Utility and Kelly betting
In St Petersburg paradox, expectation didn’t tell story, because “fair
price” ought to be finite. Changing game by a little changed
expected winnings a lot.
Most bets look like this: win known $w if you win, lose $1 if you lose.
Suppose probability of winning is θ. Then expectation is
E = wθ + (−1)(1− θ) = θ(w + 1)− 1
which is positive if θ > 1/(w + 1).
For instance, if w = 2, E > 0 if θ > 1/3. That is, if you believe
your chance of winning is better than 13, you should bet because in
long run you win more than you lose.
100
If bet more than $1, wins and losses increase in proportion: on bet
of $b, win $wb or lose $b.
Positive expectation seems to say “bet everything you have”: far too
risky for most! Always possibility of losing.
Idea: consider utility of money, not same as money itself. If you
only have $10, $1 is a lot of money (has great utility), but if you have
$1 million, $1 almost meaningless.
Utility of money varies between people, but could be proportional to
current fortune. Then, utility of money depends on log of $ amount.
101
Suppose we currently have $c, and want to choose b for bet above,
assuming all else known. Then fortune after the bet is F = c + bw
if we win (prob θ), F = c− b if we lose (prob 1− θ). Utility idea:
choose b to maximize E(ln F ):
E(ln F ) = θ ln(c + bw) + (1− θ) ln(c− b).
Take derivative (for b), set to 0:
dE(ln F )
db= w
θ
c + bw+(−1)
1− θ
c− b=
θw(c− b)− (1− θ)(c + bw)
(c + bw)(c− b).
Zero when numerator zero; solve for b to get
b =c{θ(w + 1)− 1}
w=
cE
w.
This is called the Kelly bet. (If negative, don’t bet anything!)
102
Examples, with c = 100:
• w = 9, θ = 18. E = θ(w + 1)− 1 = 0.25, so Kelly bet
b = 100(0.25)/9 = $2.78.
• w = 1.5, θ = 12. E = 0.25 again; Kelly bet
b = 100(0.25)/1.5 = $16.67.
Note: expected winnings same in both cases, but bet less when
w = 9: more risk because less likely to win.
In general, bet fraction of current fortune that is bigger when
expected winnings bigger and chance of winning bigger.
103
Expectation of functions of random variables
In St Petersburg problem above, random variable was number of
tails Z , but winnings 2Z . In effect, found that E(2Z) was infinite.
Method: sum values of 2Z times probability.
Formally: let g(X) be some function of random variable X . Then
E(g(X)) =∑
x
g(x)P (X = x).
104
Linearity of expected values
Suppose we have two random variables X, Y . What is
E(X + Y )?
Go back to definition, bearing in mind that X,Y might be related,
so have to use joint probability function:
E(X + Y ) =∑
x
∑y
(x + y)P (X = x, Y = y)
=∑
x
xP (X = x) +∑
y
yP (Y = y)
= E(X) + E(Y ).
Details: expand out (x + y) in first sum, recognize (eg.) that∑y P (X = x, Y = y) = P (X = x) (marginal distribution).
105
Same logic shows that E(aX + bY ) = aE(X) + bE(Y ).
Likewise,
E(X1 + X2 + · · ·+ Xn) = E(X1) + E(X2) + · · ·+ E(Xn).
Also, if Y = 1 always, we get E(aX + b) = aE(X) + b.
106
Expectation for binomial distribution
If Y ∼ Binomial(n, θ), then Y actually sum of Bernoullis:
Y = X1 + X2 + · · ·+ Xn, where Xi ∼ Bernoulli(θ).
Know that E(Xi) = θ, so (by result on previous page)
E(Y ) = θ + θ + · · ·+ θ = nθ.
Makes sense: eg. if you succeed on one-third of trials on average
(θ = 13), and you have n = 30 trials, you’d expect 10 successes,
and nθ = 10.
107
Independence and E(XY )
Since E(X + Y ) = E(X) + E(Y ) for all X and Y , tempting to
claim that E(XY ) = E(X)E(Y ). But is this true?
Consider this joint distribution:
Y = 1 Y = 2 Total
X = 0 13
16
12
X = 1 14
14
12
Total 712
512
1
Using marginal distributions, E(X) = 12
and E(Y ) = 1712
. What is
E(XY )?
108
When X = 0, XY = 0 for all Y . So P (XY = 0) = 13
+ 16
= 12.
XY = 1 when X = 1, Y = 1, so P (XY = 1) = 14. Likewise,
XY = 2 when X = 1, Y = 2, so P (XY = 2) = 14. Hence
E(XY ) = 0 · 1
2+ 1 · 1
4+ 2 · 1
4=
3
4.
But
E(X)E(Y ) =1
2· 17
12=
17
246= 3
4.
So E(XY ) 6= E(X)E(Y ) in general.
109
But what if X,Y independent? Then
E(XY ) =∑
x
∑y
xyP (X = x)P (Y = y) = E(X)E(Y ),
rearranging, because joint prob is product of marginals.
So, if X, Y independent, then E(XY ) = E(X)E(Y ), but not
necessarily otherwise.
See later (in “covariance”) that difference E(XY )− E(X)E(Y )
measures extent of non-independence of X and Y .
110
Monotonicity of expectation
Suppose X, Y discrete random variables such that X ≤ Y . (That
is, for any event giving X = x and Y = y, x ≤ y always.
Example: roll 2 dice, let X be score on 1st die, Y be total score on 2
dice.)
How do E(X), E(Y ) compare?
Idea: let Z = Y −X . Then Z ≥ 0, discrete, and
E(Z) =∑
z≥0 zP (Z = z). All terms in sum positive or 0, so
E(Z) ≥ 0. But E(Z) = E(Y −X) = E(Y )− E(X). Hence
E(Y )− E(X) ≥ 0.
Conclusion: if X ≤ Y , then E(X) ≤ E(Y ).
111
Expectation for continuous random
variables
Can’t use formula
E(X) =∑
x
xP (X = x)
because probability of particular value not meaningful for continuous
X .
Standard procedure: replace probability by density function, replace
sum by integral.
112
That is, if X continuous random variable, define
E(X) =
∫ ∞
−∞x f(x) dx.
In integral, replace infinite limits by actual upper and lower limits.
113
Examples
Suppose X ∼ Uniform[0, 1], so f(x) = 1, 0 ≤ x ≤ 1. Then
E(X) =
∫ 1
0
x · 1 dx =
[1
2x2
]1
0
=1
2.
As you would have guessed.
Suppose W ∼ Exponential(λ). Then
E(W ) =
∫ ∞
0
wλe−λw dw.
Integrate by parts with u = w, v′ = λe−λw: E(W ) = 1/λ.
If W represents time between events, E(W ) in units of time, so λ
in units of 1 / time: a rate, number of events per unit time.
114
Suppose Z ∼ N(0, 1), so f(z) = (1/√
2π)e−z2/2. Then
E(Z) =
∫ ∞
−∞
1√2π
ze−z2/2 dz.
Replacing z by−z gives negative of function in integral, ie. f(z) is
odd function. Hence integral is 0, so E(Z) = 0. (Alternative:
substitute u = z2/2.)
115
As for discrete, expectation may not be finite.
f(x) = 1/x2, x ≥ 1 is a proper density, but for random variable X
with this distribution:
E(X) =
∫ ∞
1
x · 1
x2dx =
∫ ∞
1
1
xdx = [ln x]∞1 = ∞.
Problem: though density decreases as x increases, does not do so
fast enough to make E(X) integral converge.
116
Properties of expectation for continuous random
variables
These are same as for discrete variables. Proofs use integrals and
densities not sums, but otherwise very similar. Suppose X has
density fX(x) and X,Y have joint density fX,Y (x, y):
• E(g(X)) =∫∞−∞ g(x)fX(x) dx
• E(h(X,Y )) =∫∞−∞
∫∞−∞ h(x, y)fX,Y (x, y) dx dy.
• E(aX + bY ) = aE(X) + bE(Y )
• If X,Y independent, then E(XY ) = E(X)E(Y )
• If X ≤ Y , then E(X) ≤ E(Y ).
117
Expectations for general uniform and normal
distributions
Suppose X ∼ Uniform[a, b]. Then
U = (X − a)/(b− a) ∼ Uniform[0, 1], so E(U) = 12.
Write in terms of X : X = a + (b− a)U , so
E(X) = a + (b− a)E(U) = (a + b)/2. Again as expected.
Now suppose X ∼ Normal(µ, σ2). Then
Z = (X − µ)/σ ∼ N(0, 1). Write X = µ + σZ ; then
E(X) = µ + σE(Z) = µ + σ(0) = µ.
That is, parameter µ in normal distribution is the mean.
118
Variance, covariance and correlation
Compare random variables:
Z = 10 with prob 1, Y = 5, 15 each prob 12.
E(Z) = E(Y ) = 10, but Y further from mean than Z .
Expectation only gives long-run average of random variable, not how
much higher/lower than average it could be. For this, use variance:
Var(X) = E[(X − µX)2], µX = E(X).
119
For discrete X , Var(X) =∑
x(x− µX)2P (X = x). So:
Var(Z) = (10− 10)2 · 1 = 0;
Var(Y ) = (5− 10)2 · 1
2+ (15− 10)2 · 1
2= 25.
Here, Var(Y ) > Var(Z) because Y tends to be further from its
mean than Z does.
(Here, Y always further from mean than Z . But in general,
Var(Y ) > Var(Z) means Y likely to be further from mean than
Z .)
120
More about variance
Because (X − µX)2 ≥ 0, Var(X) ≥ 0 for all random variables
X .
Var(X) = 0 only if X does not vary (compare Z). No upper limit
on variance; larger variance means more unpredictable (can get
further from mean).
Why square? Cannot just omit: E(X − µX) = E(X)− µX = 0
always. Absolute value E(|X − µX |) possible, but hard to work
with (not differentiable).
121
Standard deviation
If random variable X in metres, Var(X) in metres-squared. For
interpretation, suggests using square root of variance:
SD(X) =√
Var(X)
which would be in metres. Called standard deviation of X .
SD easier for interpretation, variance easier for algebra.
122
Variance of Bernoulli
If X ∼ Bernoulli(θ), E(X) = θ, and
Var(X) =∑
x
(x− θ)2P (X = x)
= (1− θ)2θ + (0− θ)2(1− θ)
= θ(1− θ)(1− θ + θ) = θ(1− θ).
This is 0 if θ = 0, 1 (when results completely predictable) and
maximum, 14, when θ = 1
2.
123
Useful properties of variance
Var(aX + b) = a2 Var(X).
Because variance in squared units, changing X eg. from metres to
feet multiplies variance not by 3.3 but by that squared.
Also, adding b changes mean of X , but doesn’t change how spread
out distribution is (shifts left/right).
Var(X) = E(X2)− µ2X .
Useful result for finding variances in practice, since E(X2) not
usually too hard.
124
Proofs: use definition of variance as expectation, then rules of
expectation.
Bernoulli revisited: E(X2) = 12θ + 02(1− θ) = θ, so
Var(X) = θ − θ2 = θ(1− θ) as before.
125
Variance of exponential distribution
For continuous distributions, find E(X2) or variance using integral.
W ∼ Exponential(λ): already know E(W ) = 1/λ. Find
Var(W ) by first finding E(W 2), using integration by parts:
E(W 2) =
∫ ∞
0
w2λe−λw dw =[−w2e−λw
]∞0
+2
λ
∫ ∞
0
wλe−λw dw.
Square brackets 0; integral is E(W ) = 1/λ. Hence
E(W 2) = (2/λ)(1/λ) = 2/λ2, and
Var(W ) =2
λ2−
(1
λ
)2
=1
λ2.
For exponential distribution, variance is square of mean.
126
Variance of normal random variable
Suppose Z ∼ N(0, 1). Know that E(Z) = 0, so
Var(Z) = E(Z2)− 02 = E(Z2). Thus
Var(Z) =
∫ ∞
−∞z2 1√
2πe−z2/2 dz.
To tackle by parts: let u = z/√
2π, v′ = ze−z2/2. v′ has
antiderivative v = −e−z2/2. Gives
127
Var(Z) =
[− z√
2πe−z2/2
]∞
−∞+
∫ ∞
−∞
1√2π
e−z2/2 dz.
Square bracket 0 (e−z2/2 → 0 very fast); integral that of density of
Z , so 1. Hence Var(Z) = 1.
Suppose now X ∼ N(µ, σ2). Then Z = (x− µ)/σ, so
X = µ + σZ . So Var(X) = σ2 Var(Z) = σ2. That is,
parameter σ2 in normal distribution is variance.
128
Covariance
Consider discrete joint distribution:
Y = 1 Y = 2 sum
X = 0 0.4 0.2 0.6
X = 1 0.1 0.3 0.4
sum 0.5 0.5
If X = 0, Y more likely to be small; if X = 1, Y more likely to be
large. X, Y vary together.
Idea: covariance Cov(X, Y ) = E[(X − µX)(Y − µY )].
129
Here, µX = E(X) = 0.4, µY = E(Y ) = 1.5, so take all
combinations of (X − µX , Y − µY ) values and their probs:
Cov(X,Y )
= (0− 0.4)(1− 1.5)(0.4) + (0− 0.4)(2− 1.5)(0.2)
+ (1− 0.4)(1− 1.5)(0.1) + (1− 0.4)(2− 1.5)(0.3)
= 0.08− 0.04− 0.03 + 0.04 = 0.10.
Result positive. (X, Y ) combinations where (X − µX)(Y − µY )
positive outweigh those where negative. That is, when X large, Y
more likely to be large as well (and small with small).
Covariance can be negative: then large X goes with small Y and
vice versa. Covariance 0: no trend.
130
Calculating covariances
Useful formula:
Cov(X, Y ) = E(XY )− E(X)E(Y ).
Proof: definition of covariance, properties of expectation.
Previous example revisited:
E(XY ) = (0)(1)(0.4)+(0)(2)(0.2)+(1)(1)(0.1)+(1)(2)(0.3) = 0.7;
Cov(X, Y ) = 0.7− (0.4)(1.5) = 0.1.
As with corresponding variance formula, useful for calculations.
131
Covariance and independence
If X,Y independent, then E(XY ) = E(X)E(Y ), so
Cov(X, Y ) = E(XY )− E(X)E(Y ) = 0.
But covariance could be 0 without independence. Example:
(X, Y ) = (−1, 1), (0, 0), (1, 1), each prob 13. E(X) = 0,
E(Y ) = 23, E(XY ) = (−1)(1
3) + (0)(1
3) + (1)(1
3) = 0, so
Cov(X, Y ) = 0− (0)(23) = 0. But X, Y not independent: given
X , know Y exactly.
Relationship between X, Y not a trend: as X increases, Y
decreases then increases. No general statement about Y
large/small as X increases.
Fact: if X,Y bivariate normal, covariance 0 implies independence.
132
Variance of sum
Previously found that E(X + Y ) = E(X) + E(Y ) for all X,Y .
Corresponding formula for variances?
Derive formula for Var(X + Y ) by writing as expectation,
expanding out square, recognizing terms:
Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X,Y ).
Logic: if Cov(X,Y ) > 0, X, Y big/small together, sum could be
very big/small, variance large. If Cov(X, Y ) < 0, large X
compensates small Y and vice versa, sum of moderate size,
variance small.
If X,Y independent, then Var(X + Y ) = Var(X) + Var(Y ).
133
Variance of binomial distribution
Suppose X ∼ Binomial(n, θ). Then can write
X = Y1 + Y2 + · · ·+ Yn,
where Yi ∼ Bernoulli(θ) independently. So
Var(X) = Var(Y1) + Var(Y2) + · · ·+ Var(Yn)
= θ(1− θ) + θ(1− θ) + · · ·+ θ(1− θ)
= nθ(1− θ).
Variance increases as n increases (fixed θ) because range of
possible #successes becomes wider.
134
Correlation
Covariance hard to interpret. Eg. size of positive correlation says
little about X,Y relationship.
Suppose X height (metres), Y weight (kg). Units of covariance m
× kg. Measure height in inches, weight in lbs: covariance in
different units.
Try for scale-free quantity. Covariance measures how X, Y vary
together: suggests use of variances. Var(X) m2, Var(Y ) kg2, so
right scaling is by sq root of each. Define correlation:
Corr(X,Y ) =Cov(X,Y )√
Var(X) Var(Y ).
135
Example: (X, Y ) = (0, 1), (1, 3), each prob 12.
E(X) = 0.5, E(Y ) = 2; XY = 0, 3 each prob 12
so
Cov(X, Y ) = 32− (0.5)(2) = 1
2.
Also, Var(X) = 14, Var(Y ) = 1, so
Corr(X, Y ) =12√
(14)(1)
= 1.
When X larger (1 vs. 0), Y also larger (3 vs. 1) for certain: a perfect
trend. So this should be largest possible correlation.
(Proof later: Cauchy-Schwartz inequality.)
136
More about correlation
Smallest possible correlation is−1, when larger X always goes
with smaller Y (eg. (X, Y ) = (0, 1), (1,−3) with prob 12).
If X,Y independent, covariance 0, so correlation 0 also.
In-between values represent in-between trends. Eg.
Corr(X, Y ) = 0.5: larger X with larger Y most of the time, but
not always.
Correlation actually measures extent of linear relationship between
random variables. X, Y in example related by Y = 2X + 1.
Perfect nonlinear relationship won’t give correlation±1.
137
Viewing correlation by simulation
Useful to have sense of what correlation “looks like”.
Generate random normals with required correlation, plot.
Suppose X, Y ∼ N(0, 1) independently. Then use X and
Z = αX + Y for suitable choice of α: correlated if α 6= 0
because X in both. Can show Cov(X,αX + Y ) = α and
Corr(X, αX + Y ) = α/√
1 + α2.
Choose α to get desired correlation ρ: α = ±ρ/√
1− ρ2.
138
Correlation 0.95:
−3 −2 −1 0 1 2
−10
−5
05
x
z
139
Correlation -0.8:
−3 −2 −1 0 1 2
−4
−2
02
4
x
z
140
Correlation 0.5:
−2 −1 0 1 2 3
−2
01
23
x
z
141
Correlation -0.2:
−3 −2 −1 0 1 2 3
−2
−1
01
2
x
z
142
Moment-generating functions
Means and variances (and eg. E(X3)) can be messy: each one
needs an integral (sum) to be solved. Would be nice to have function
that gives E(Xk) more easily than by integration (summing).
Consider mX(s) = E(esX). Function of s.
Maclaurin series for exp function:
mX(s) = E(1) + sE(X) +s2
2!E(X2) +
s3
3!E(X3) + · · · .
143
Differentiate both sides (as function of s):
m′X(s) = E(X) + sE(X2) +
s2
2!E(X3) + · · ·
Putting s = 0 gives m′(0) = E(X). Differentiate again:
mX(s) = E(X2) + sE(X3) + · · ·
so that m′′X(0) = E(X2).
By same process, find E(Xk) by differentiating mX(s) k times,
and setting s = 0. Differentiating easier than integrating!
E(Xk) called k-th moment of distribution of X ; function mX(s),
used to get moments, called moment generating function for X .
144
If X discrete,
mX(s) = E(esX) =∑
x
esxP (X − x)
and if X continuous,
mX(s) = E(esX) =
∫ ∞
−∞esxfX(x) dx.
145
Examples of moment generating functions
Bernoulli is easiest of all:
mX(s) = es·0P (X = 0) + es·1P (X = 1) = 1− θ + θes.
So:
m′X(s) = θes ⇒ E(X) = θ
m′′X(s) = θes ⇒ E(X2) = θ
and indeed E(Xk) = θ for all k. Also,
Var(X) = E(X2)− [E(X)]2 = θ − θ2 = θ(1− θ).
146
Now try X ∼ Exponential(λ), continuous:
mX(s) = E(esX) =
∫ ∞
0
esxλe−λx dx = λ(λ− s)−1
after some algebra. (Requires s < λ.)
m′X(s) = λ(λ− s)−2, so E(X) = m′
X(0) = 1/λ.
m′′X(s) = 2λ(λ− s)−3, so E(X2) = m′′
X(0) = 2/λ2. Hence
Var(X) =2
λ2−
(1
λ
)2
=1
λ2.
147
More about moment-generating functions
If X ∼ Poisson(λ), then
mX(s) = eλ(es−1).
If X ∼ N(0, 1), then
mX(s) = es2/2.
Facts:
• mX+Y (s) = mX(s)mY (s). (Mgf of sum is product of
moment-generating functions.)
• maX+b(s) = ebsmX(as). (Mgf of linear function related to
mgf of original random variable.)
148
Proofs from definition.
First result very useful: distribution of sum very difficult to find, but
can get moments for sum much more easily.
If X ∼ Binomial(n, θ), then X = Y1 + Y2 + · · ·+ Yn where
each Yi ∼ Bernoulli(θ). Hence
mX(s) = [mYi(s)]n = (1− θ + θet)n.
If X ∼ N(µ, σ2), X = µ + σZ where Z ∼ N(0, 1). Thus
mX(s) = mσZ+µ(s) = eµsmZ(σs) = eµs+σ2s2/2.
149
Using mgfs to recognize distributions
Important result, called uniqueness theorem. Suppose X has mgf
finite for−s0 < s < s0; suppose mX(s) = mY (s) for
−s0 < s < s0. Then X , Y have same distribution.
In other words: if mgf of X is that of known distribution, then X
must have that distribution.
Example: X, Y ∼ Poisson(λ). X + Y has mgf
mX+Y (s) = {eλ(es−1)}2 = e2λ(es−1).
This is mgf of Poisson(2λ), so X + Y ∼ Poisson(2λ).
150
Conditional Expectation
Consider this joint distribution (Ex. 3.5.2):
X = 5 X = 8 sum
Y = 0 17
37
47
Y = 3 17
0 17
Y = 4 17
17
27
sum 37
47
X, Y related: if Y = 0, then X more likely to be 8.
151
Suppose Y = 3. Then P (X = 5|Y = 3) = (17)/(1
7) = 1,
P (X = 8|Y = 3) = 0/(17) = 0. If Y = 3, then X certain to be
5, so E(X|Y = 3) = 5.
Now suppose Y = 4:
P (X = 5|Y = 4) =17
17
+ 17
=1
2= P (X = 8|Y = 4).
If Y = 4, then average X is E(X|Y = 4) = 5 · 12
+ 8 · 12
= 6.5.
Likewise, E(X|Y = 0) = 7.25.
152
These expectations from conditional distribution called conditional
expectations. E(X|Y = y) varies from 5 to 7.25 depending on
value of Y ; “on average, X depends on Y ”.
In general, if X, Y related, then mean of X depends on Y .
Calculate conditional distribution of X|Y , find X-expectation. This
is conditional expectation.
153
Conditional expectation: continuous case
Same principle: find expectation of conditional distribution. Now use
joint and marginal densities to find conditional density; then
integrate to get expectation.
Example: fX,Y (x, y) = 4x2y + 2y5, 0 ≤ x, y ≤ 1.
Conditional density fX|Y (x, y) = fX,Y (x, y)/fY (y). So first find
marginal density fY (y) by integrating out x from joint density:
fY (y) = 43y + 2y5. Has no x. Hence
fX|Y (x, y) =4x2y + 2y5
43y + 2y5
.
154
Note: only x in numerator, so not so hard. Thus
E(X|Y = y) =
∫ 1
0
x · 4x2y + 2y5
43y + 2y5
dx =1 + y4
43
+ 2y4.
Depends slightly on Y : E(X|Y = 0) = 0.75,
E(X|Y = 0.5) = 0.729, E(X|Y = 1) = 0.6. As Y increases,
X decreases, on average.
155
Conditional expectations as random variables
Without particular Y -value in mind, can define E(X|Y ) by taking
E(X|Y = y) and replacing y by Y . Above example:
E(X|Y ) =1 + Y 4
43
+ 2Y 4.
This kind of conditional expectation is random variable (function of
random variable Y ).
156
As random variable, E(X|Y ) must have expectation,
E[E(X|Y )]. What is it? Directly, as function of y:
E[E(X|Y )] =
∫ 1
0
E(X|Y = y)fY (y) dy =2
3
(much cancellation). Now: marginal density of x is
fX(x) = 2x2 + 13
(integrate out y from joint density), so
E(X) =
∫ 1
0
x
(2x2 +
1
3
)dx =
2
3= E[E(X|Y )].
Not a coincidence. Illustrates theorem of total expectation:
E[E(X|Y )] = E(X). In words: effect of varying Y is to change
E(X|Y ), but E[E(X|Y )] averages out these effects, leaving only
overall average of X .
157
Conditional variance
Conditional variance is variance of conditional distribution.
Return to previous discrete example:
X = 5 X = 8 sum
Y = 0 17
37
47
Y = 3 17
0 17
Y = 4 17
17
27
sum 37
47
If Y = 3, X certain to be 5, so Var(X|Y = 3) = 0.
But if Y = 4, X equally likely 5 or 8; Var(X|Y = 4) = 2.25.
158
(Calculation: E(X|Y = 4) = 6.5, E(X2|Y = 4) = 44.5,
Var(X|Y = 4) = 44.5− (6.5)2 = 2.25.)
Another expression of how Y affects X . If know Y = 3, know X
exactly, but if Y = 4, more uncertain about possible X .
159
Inequalities relating probability, mean and
variance
Mean and variance closely related to probabilies. Are general
relationships true for wide range of random variables and
distributions.
Markov inequality: If X cannot be negative, then
P (X ≥ a) ≤ E(X)
a.
In words: if mean small, X unlikely to be very large.
160
Chebychev inequality:
P (|Y − µY | ≥ a) ≤ Var(Y )
a2.
In words: if variance small, Y unlikely to be far from mean.
(Variations in spelling: best English transliteration from Russian
probably “Chebyshov”.)
161
Example: suppose X = 0, 1, 2 each with probability 13. Then
E(X) = 1, E(X2) = 53, so Var(X) = 2
3.
Markov with a = 1.5 says P (X ≥ 1.5) ≤ 11.5
= 23. Actual
P (X ≥ 1.5) = P (X = 2) = 13, which is indeed≤ 2
3.
Chebychev with a = 0.9:
P (|X − 1| ≥ 0.9) ≤ (2/3)/(0.9)2 = 0.823. Actual
P (|X − 1| ≥ 0.9) = P (X ≤ 0.1) + P (X ≥ 1.9) = P (X =
0) + P (X = 2) = 23.
Bounds from Markov and Chebychev inequalities often not very
close to truth, but guaranteed, so can use inequalities to prove
results.
162
Proof of Markov inequality
Uses idea that if Z ≤ X , then E(Z) ≤ E(X).
Define random variable Z = a if X ≥ a, 0 otherwise. Because
X ≥ 0, value of Z always≤ that of X : Z ≤ X .
E(Z) = aP (X ≥ a) + 0P (X < a) = aP (X ≥ a).
But Z ≤ X so E(Z) ≤ E(X) and therefore
aP (X ≥ a) ≤ E(X). Divide both sides by a. Done.
163
Proof of Chebychev inequality
This uses Markov’s inequality with clever choice of random variable.
Let X = (Y − µY )2; X ≥ 0. Then Markov’s inequality (with a2
replacing a) says
P (X ≥ a2) ≤ E(X)
a2⇒ P [(Y−µY )2 ≥ a2] ≤ E[(Y − µY )2]
a2.
In last inequality, E[.] is Var(Y ). On left, both terms in probability
≥ 0, so can square-root both sides. Gives
P (|Y − µY | ≥ a) ≤ Var(Y )
a2
which is Chebychev’s inequality. Done.
164
Cauchy-Schwartz and Jensen inequalities
Cauchy-Schwartz:
|Cov(X, Y )| ≤√
Var(X) Var(Y ) ⇒ |Corr(X, Y )| ≤ 1.
Proof: page 188 of text. Idea, for X, Y having mean 0: write
E(X − λY )2 in terms of variances and covariances; result must
be≥ 0.
Jensen’s inequality relates E(g(X)) and g(E(X)). Specifically,
if g(x) is concave up (that is, g′′(x) > 0), then
g(E(X)) ≤ E(g(X)).
165
Proof: Tangent line to concave-up function always≤ function
(picture). Consider tangent line to g(x) at x = E(X); suppose
equation is a + bx. Then g(E(X)) = a + bE(X). Also, line
≤ g(x) everywhere else, so
a + bX ≤ g(X) ⇒ E(a + bX) ≤ E(g(X))
⇒ a + bE(X) ≤ E(g(X))
⇒ g(E(X)) ≤ E(g(X)).
Done.
(Note: text uses “convex” for “concave up”.)
166
Consequences of Jensen’s inequality
Take g(x) = x2. Then (E(X))2 ≤ E(X2). But
Var(X) = E(X2)− (E(X))2 ≥ 0, so knew that anyway.
Another: suppose X = 1, 2, 3, each prob 13. Then E(X) = 2.
But get another kind of average by multiplying 3 possible values and
taking 3rd root. This is called geometric mean. Here is
(1.2.3)1/3 = 1.817. Ordinary mean greater than geometric mean.
Look at log of geometric mean:
ln{(1.2.3)1/3} =1
3ln(1.2.3) =
1
3(ln 1+ln 2+ln 3) = E(ln X).
Thus geometric mean is eE(ln X).
167
Jensen: − ln x is concave up for x > 0, so
− ln(E(X)) ≤ E(− ln X) ⇒ ln(E(X)) ≥ E(ln X).
Exponentiate both sides (eln y = y):
E(X) ≥ eE(ln X).
This says that for any positive random variable X , the ordinary
mean will always be≥ the geometric mean.
168
Sampling Distributions and
Limits
169
Introduction: roulette
See http://tinyurl.com/238p5 for intro to game.
Basic idea: bet on number or number combination. Roulette wheel
spun, one number is winner. Your bet wins if it contains winning
number.
Wheel also contains numbers 0, 00. Winning bets paid as if 0, 00
absent (advantage to casino).
Bet 1: “high number”: win with 19–36, lose otherwise. Bet $1, win
$1 if win. Let W be winnings on one play; P (W = 1) = 18/38,
P (W = 0) = 20/38. Then
E(W ) = 1 · 18
38+ (−1) · 20
38= − 2
38' −$0.05.
170
Bet 2: “lucky number”: win if 24 comes up, lose otherwise. Win $35
for $1 bet. Now P (W = 35) = 1/38, P (W = −1) = 37/38, so
E(W ) = 35 · 1
38+ (−1) · 37
38= − 2
38' −$0.05.
In both bets, lose 5 cents per $ bet in long run.
Play game not once but many times. Interested in total winnings, or
mean winnings per play. Let Wi be winnings on play i; then mean
winnings per play Mn over n plays is
Mn =1
n
n∑i=1
Wi.
Investigate behaviour of Mn by simulation.
171
High-number, 30 plays:
0 5 10 15 20 25 30
−0.
40.
00.
20.
4
n
M_n
172
High-number, 1000 plays:
0 200 400 600 800 1000
−0.
40.
00.
20.
4
n
M_n
173
Lucky-number, 1000 plays:
0 200 400 600 800 1000
−0.
40.
00.
20.
4
n
M_n
174
Notes about roulette simulation
1st graph: in high-number bet, fortune goes up/down by $1 per play;
winnings/play pattern similar. On this sequence, in profit after 30
plays, but losing after 15.
2nd graph: same bet, 1000 plays. Less fluctuation after more trials;
winnings per play apparently tending to dotted line, E(W ). (Other
simulations have different shape but similar end behaviour.)
3rd graph: lucky-number bet, 1000 plays. Large jump upwards on
each win. Picture more erratic than for high-number bet; long-term
behaviour not clear yet. (Need more plays.)
175
Understanding Mn mathematically: mean, variance
Mn =1
n
n∑i=1
Wi
is sum. Wi in sum independent, each same distribution (one spin of
wheel has no effect on other spins). So can calculate E(Mn) and
Var(Mn).
Already found E(Wi) = − 238
for both our bets.
Find variances for bets: for high-number bet, Var(Wi) = 0.9972;
for lucky-number bet Var(Wi) = 33.21.
176
For mean:
E(Mn) =1
n
n∑i=1
E(Wi) =1
n
n∑i=1
(− 2
38
)= − 2
38,
since there are n terms in the sum, all the same.
That is, regardless of how long you play, you will lose 5 cents per $
bet on average.
177
Var(Mn) =1
n2
n∑i=1
Var(Wi) =Var(Wi)
n.
Sum has n terms all equal to variance of one play’s winnings. So for
high-number bet, Var(Mn) = 0.9972/n, for lucky-number bet,
Var(Mn) = 33.21/n.
For any particular n, variance for high-number bet lower. Supports
simulation: high-number bet results more predictable.
In both cases, as n →∞, Var(Mn) → 0. Longer you play, more
predictable Mn is.
178
Distribution of Mn
Mean and variance not whole story – want to know things like
P (Mn > 0) (chance of profit). For this, need distribution of Mn.
Start with M2 (2 plays). Do lucky-number bet (P (W = 35) = 138
,
P (W = −1) = 3738
).
4 possibilities:
• win both times. M2 = (35 + 35)/2 = 35;
P (M2 = 35) = ( 138
)2 = 11444
' 0.0007.
• win on 1st, lose on 2nd. M2 = (35 + (−1))/2 = 17; prob is138· 37
38= 37
1444.
179
• lose on 1st, win on 2nd. Again M2 = 17 and prob is same as
above. Thus overall P (M2 = 17) = 741444
' 0.0512.
• lose on both. M2 = ((−1) + (−1))/2 = −1;
P (M2 = −1) = (3738
)2 = 13691444
' 0.9480.
Calculation complicated, even for n = 2, because have to consider
all possible combinations.
In general: this kind of distribution very difficult to find exactly. So
look for approximations to it.
180
Sampling distributions
Suppose X1, X2, . . . , Xn are random variables, each independent
and with same distribution. For example:
• Xi is winnings from i-th play of a roulette bet.
• Xi is height of i-th randomly chosen Canadian.
• Xi = 1 if randomly chosen voter supports Liberal party,
Xi = 0 otherwise.
• Xi is randomly generated value from a distribution with density
fX(x).
In each case: underlying phenomenon of interest, collect data at
random to help understand phenomenon.
181
Summarize Xi values using random variable
Yn = h(X1, X2, . . . , Xn) for some function h (eg. mean, like
Mn).
Some jargon:
• total collection of individuals (all possible spins of roulette
wheel, all Canadians, all possible values) called population.
• particular individuals selected, or Xi values obtained from
them, called sample.
• Yn defined above called sample statistic.
Usually don’t know about population, so draw conclusion about it
based on sample.
182
First: opposite problem: if we know population, find out what
samples from it look like.
“At random” important, and specific. Each individual value in
population must have correct chance of being in sample (same
chance, for human populations), and each must be in sample or not
independently of others.
Aim: learn about distribution of Yn, called sampling distribution.
General statements difficult. Approach: find what happens as
n →∞, then use result as approximation for finite n.
183
Convergence in probability; weak law of
large numbers
In mathematics, accustomed to convergence ideas. Eg. if
an = 1− 1/n, so that a1 = 0, a2 = 12, a3 = 2
3, etc., an → 1
(converges to 1) as n →∞ because, by taking n large enough, all
values after an as close to 1 as desired.
For sequence X1, X2, . . . of random variables, what is meaning of
Xn → Y , where Y is random variable?
184
Different possibilities. One idea: “prob of Xn being far from Y goes
to 0 as n gets large”. Leads to definition:
Sequence {Xn} converges in probability to Y if, for all ε > 0,
limn→∞ P (|Xn − Y | ≥ ε) = 0. Notation: XnP→ Y .
Example: suppose U ∼ Uniform[0, 1]. Let Xn = 3 when
U ≤ 23(1− 1
n) and 8 otherwise.
Thus when n = 1, X1 must be 8. If U > 23, Xn remains 8 forever,
but if U ≤ 23, U ≤ 2
3(1− 1
n) eventually, so Xn becomes 3 for
some n, then remains 3 forever.
(Cannot know which will happen since U random variable.)
185
Now define Y = 3 if U ≤ 23
and Y = 8 otherwise. Same as
“eventual” Xn, so should have XnP→ Y . Correct?
P (|Xn − Y | ≥ ε) ≤ P (Xn 6= Y )
= P
(2
3
(1− 1
n
)< U <
2
3
)
=2
3n.
This tends to 0 as n →∞, so XnP→ Y .
186
Convergence to a constant
What if Y not random variable, but number?
Example: suppose Zn ∼ Exponential(n). Then E(Zn) = 1/n,
suggesting that Zn typically gets smaller and smaller. Does
ZnP→ 0?
P (|Zn − 0| ≥ ε) = P (Zn ≥ ε)
=
∫ ∞
ε
ne−nx dx = e−nε.
For any fixed ε, P (|Zn − 0| ≥ ε) → 0, so ZnP→ 0.
Important special case (usually easier to handle).
187
Convergence to mean
Suppose sequence {Yn} has E(Yn) = µ for all n. Then YnP→ µ
if P (|Yn − µ| ≥ ε) → 0.
But recall Chebychev’s inequality,
P (|Y − µY | ≥ a) ≤ Var(Y )/a2. Here:
P (|Yn − µ| ≥ ε) ≤ Var(Yn)
ε2.
For fixed ε, right side (and hence left side) tends to 0 if
Var(Yn) → 0, in which case YnP→ µ.
(Logically: if Var(Yn) getting smaller, Yn becoming closer to their
mean µ.)
188
Weak Law of Large Numbers
Return to X1, X2, . . . , Xn being a random sample from some
population with mean E(Xi) = µ and variance Var(Xi) = v.
Consider sample mean
Mn =1
n
n∑i=1
Xi.
Intuitively, expect Mn to be “close” to population mean µ, and to get
closer as n increases (more information in larger sample).
Does MnP→ µ? Re-do roulette calculations to show that
E(Mn) = µ and Var(Mn) = Var(Xi)/n = v/n.
189
Now, {Mn} is sequence of random variables with same mean µ.
Result of section “convergence to mean” says that MnP→ µ if
Var(Mn) → 0. But here, Var(Mn) = v/n → 0. This proves
that MnP→ µ.
This justifies use of sample mean as estimate of the population
mean. Can estimate average height of all Canadians by measuring
average height of sample of Canadians; the larger the sample,
closer estimate will likely be.
Important result, called weak law of large numbers.
190
To generalize: suppose now that Xn do not all have same variance,
but Var(Xi) = vi. Then
Var(Mn) =1
n2
n∑i=1
vi.
This might not→ 0. But suppose that vi ≤ v for all i. Then
Var(Mn) =1
n2
n∑i=1
vi ≤ 1
n2
n∑i=1
v =v
n→ 0.
In other words, MnP→ µ even if the variances are not all equal,
provided that they are bounded.
191
Convergence with probability 1
Previous example: suppose U ∼ Uniform[0, 1]. Let Xn = 3
when U ≤ 23(1− 1
n) and 8 otherwise. Let Y = 3 if U ≤ 2
3and
Y = 8 otherwise. Concluded that XnP→ Y .
Take another approach. Suppose we knew U , eg. suppose
U = 0.4. Then
0.4 ≤ 2
3
(1− 1
n
)⇒ n ≥ 5
2.
Thus X1 = X2 = 8, X3 = X4 = · · · = 3. This is ordinary
sequence of numbers, converges to 3. Also, if U = 0.4, Y = 3.
192
In general: if U < 23, Xn = 8 for n ≤ 2/(2− 3U) and Xn = 3
after that. If U > 23, Xn = 8 for all n.
In both cases, Xn → Y as ordinary sequence for any particular
value of U . Potentially different idea of convergence of random
variables.
Definition: Xn converges to Y with probability 1 if
P (limn→∞ Xn = Y ) = 1. Also “converges almost surely”;
notation Xna.s.→ Y .
In words: consider all ways to get (number) sequences {Xn}; for
each, consider corresponding Y . If Xn → Y always, then
Xna.s.→ Y .
193
Is it same as convergence in probability?
Example: let U ∼ Uniform[0, 1], and define {Xn} like this:
• X1 = 1 if 0 ≤ U < 12, 0 otherwise
• X2 = 1 if 12≤ U < 1, 0 otherwise
• X3 = 1 if 0 ≤ U < 14, 0 otherwise
• X4 = 1 if 14≤ U < 1
2, 0 otherwise
• X5 = 1 if 12≤ U < 3
4, 0 otherwise
• X6 = 1 if 34≤ U < 1, 0 otherwise
• X7 = 1 if 0 ≤ U < 18, 0 otherwise
• X8 = 1 if 18≤ U < 1
4, 0 otherwise, etc.
194
(Divided [0, 1] into 2, then 4, then 8,. . . intervals.)
Intervals getting shorter, so P (Xn = 1) decreasing. Indeed, for
ε < 1, P (|Xn − 0| ≥ ε) = P (Xn = 1) → 0, so XnP→ 0.
Suppose U = 0.2. Then Xn = 0 except for
X1 = X3 = X8 = · · · = 1. Beyond any n, always another
Xn = 1 (always another interval containing 0.2). So for U = 0.2,
number sequence {Xn} has no limit. Hence not true that Xna.s.→ 0.
Example shows that two comvergence ideas different –
convergence with probability 1 harder to achieve.
195
Strong law of large numbers
Random sample X1, X2, . . . , Xn with E(Xi) = µ,
Var(Xi) ≤ v; let Mn = (∑n
i=1 Xi)/n be sample mean.
Already showed that MnP→ µ (“weak law of large numbers”).
Also strong law of large numbers: Mna.s.→ µ. Proof difficult.
In words: out of (infinitely) many different sequences {Mn}obtainable, every one of them converges to µ.
196
Convergence in distribution
Consider independent sequence of random variables {Xn} with
P (Xn = 1) = 12
+ 1n
and P (Xn = 0) = 12− 1
n. Also, let
P (Y = 0) = P (Y = 1) = 12
independently of the Xn.
Now, take ε < 1. Then P (|Xn − Y | ≥ ε) = P (Xn 6= Y ). Could
have Xn = 0, Y = 1 or Xn = 1, Y = 0; use independence:
P (Xn 6= Y ) =
(1
2− 1
n
)1
2+
(1
2+
1
n
)1
2=
1
2.
Not→ 0, so not true that XnP→ Y .
197
But Xn does converge to Y in sense that
P (Xn = 1) → 12
= P (Y = 1) and
P (Xn = 0) → 12
= P (Y = 0). Called convergence in
distribution.
To make definition: note that P (Xn = x) meaningless for
continuous Xn, so work with P (Xn ≤ x) instead.
Then: {Xn} converges in distribution to Y if
P (Xn ≤ x) → P (Y ≤ x) for all x. Notation: XnD→ Y .
198
Example: Poisson approximation to binomial
Suppose Xn ∼ Binomial(n, λ/n) (that is, trials increasing but
success prob decreasing so that E(X) = n(λ/n) = λ constant.
Then
P (Xn = j) =
(n
j
)(λ
n
)j (1− λ
n
)n−j
→ e−λλj
j!,
which is P (Y = j) when Y ∼ Poisson(λ). That is,
XnD→ Poisson(λ).
(Proof based on limn→∞(1− (x/n))n = e−x.)
Suggests that if n large and θ small, Poisson is good approx to
binomial.
199
Try this: take λ = 1.5 for n = 2, 5, 10, 20, 100:
x n=2 n=5 n=10 n=20 n=100 Poisson
0 0.0625 0.1680 0.1968 0.2102 0.2206 0.2231
1 0.3750 0.3601 0.3474 0.3410 0.3359 0.3346
2 0.5625 0.3087 0.2758 0.2626 0.2532 0.2510
3 0.0000 0.1323 0.1298 0.1277 0.1259 0.1255
4 0.0000 0.0283 0.0400 0.0440 0.0465 0.0470
5 0.0000 0.0024 0.0084 0.0114 0.0136 0.0141
6 0.0000 0.0000 0.0012 0.0023 0.0032 0.0035
Approx for n = 20 not bad; for n = 100 is very good.
200
Convergence in distribution and moment generating
functions
Moment-generating function mY (s) for random variable Y is
function of s.
Uniqueness theorem: if mX(s) = mY (s) for all s where both
finite, then X, Y have same distribution.
Suggests following (true) result: if {Xn} is sequence of random
variables with mXn(s) → mY (s) (for all s where both sides finite),
then XnD→ Y .
201
Central Limit Theorem
Return to “random sample” X1, X2, . . . , Xn; suppose E(Xi) = 0
and Var(Xi) = 1.
Define Mn = (∑n
i−1 Xi)/n. Does Mn converge in distribution to
anything interesting?
Well, E(Mn) = 0 but Var(Mn) = 1/n → 0. So look instead at
Zn =√
nMn: E(Zn) = 0 and Var(Zn) = 1. Then
Zn = (∑n
i=1 Xi)/√
n.
202
Moment-generating function for Xi is
mXi(s) = 1 + sE(Xi) +
s2
2!E(X2
i ) +s3
3!E(X3
i ) + · · · ;
here E(Xi) = 0, Var(Xi) = 1 so E(X2i ) = 1, giving
mXi(s) = 1 +
s2
2+
s3
3!E(X3
i ) + · · · .
Now, by rules for mgf’s,
mZn(s) = mX1(s/√
n) ·mX2(s/√
n) · · · · ·mXn(s/√
n)
= {mXi(s/√
n)}n
=
(1 +
s2
2n+
s3
3!n3/2E(X3
i ) + · · ·)n
.
203
Recall that as n →∞, (1 + y/n)n → ey. Above, the terms in s3
and higher contribute less and less as n increases, so only the 1
and s2/n terms in bracket have effect. Thus
limn→∞
mZn(s) = limn→∞
(1 +
s2
2n
)n
= es2/2
which is mgf of standard normal distribution.
Thus, remarkable fact: regardless of distribution of Xi,
ZnD→ N(0, 1).
Also works for Xi with any mean and variance: standardized
MnD→ N(0, 1). Called central limit theorem.
204
Exact distribution of Mn very difficult to find. But if n “large”,
distribution can be approximated very well by normal distribution,
easier to work with.
This is reason for studying normal distribution.
Note that theorem uses convergence in distribution, so that it is the
cdf that converges, not the density function. Important if Xi discrete.
Also, for approximation, don’t need to be so careful about
standardization. Any sum/mean for large n works.
205
CLT by simulation
Let U1, U2, . . . ∼ Uniform[0, 1]; investigate distribution of
Yn = (U1 + U2 + · · ·+ Un)/n for various n. Uniform[0, 1]
distribution completely unlike normal. Do by simulation:
1. choose “large” number of Yn’s to simulate (eg. nsim = 10, 000)
2. in each of n columns, generate nsim random values from
Uniform[0, 1]
3. calculate simulated Yn values as row means. Eg. for n = 5,
let c10=rmean(c1-c5).
4. Draw histogram of results, compare normal distribution shape.
Normal good if curve through top middle of histogram bars.
206
Histogram of y
y
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
n = 2: normal too high at top, too low elsewhere.
207
Histogram of y
y
Den
sity
0.2 0.4 0.6 0.8
0.0
1.5
3.0
n = 5: much closer approx.
208
Histogram of y
y
Den
sity
0.3 0.4 0.5 0.6 0.7
02
4
n = 20: almost perfect.
209
Normal approx to binomial
Binomial is sum of Bernoullis, so CLT should apply if #trials n large.
Suppose Y ∼ Binomial(4, 0.5). Then E(Y ) = 2, Var(Y ) = 1.
Exact P (Y ≤ 1):
P (Y ≤ 1) =
(4
0
)(0.5)0(1−0.5)4+
(4
1
)(0.5)1(0.5)3 = 0.3125.
Take X ∼ N(2, 1) (same mean, variance as Y ). P (X ≤ 1)?
P (X ≤ 1) = P
(Z ≤ 1− 2√
1
)= P (Z ≤ −1) = 0.1587.
Not very close!
210
Problem: X continuous, but Y discrete. Y ≤ 1 really “Y ≤anything rounding to 1”. Suggests approximating P (Y ≤ 1) by
P (X ≤ 1.5):
P (X ≤ 1.5) = P
(Z ≤ 1.5− 2√
1
)= P (Z ≤ −0.5) = 0.3085.
For such small n, really very close to P (Y ≤ 1) = 0.3125.
In general, add 0.5 for≤ and subtract 0.5 for <. Called continuity
correction; do whenever discrete distribution approximated by
continuous.
(Alternatively: for binomial, P (Y ≤ 1) 6= P (Y < 1), but for
normal, P (X ≤ 1) = P (X < 1).)
211
Compare Y ∼ Binomial(20, 0.5); E(Y ) = 10, Var(Y ) = 5.
Then exact P (Y ≤ 8) = 0.2517; approx by X ∼ N(10, 5) as
P (Y ≤ 8) ' P (X ≤ 8.5)
= P
(Z ≤ 8.5− 10√
5
)
= P (Z ≤ −0.67) = 0.2514.
Now, approx very good.
212
If p 6= 0.5, binomial skewed; skewness decreases as n increases.
So need larger n for p far from 0.5.
Example: n = 20, p = 0.1. Simulate and plot using Minitab:
MTB > random 1000 c3;
SUBC> binomial 20 0.1.
MTB > hist c3
Shape clearly skewed, not normal. n = 20 not large enough here.
Rule of thumb: normal approx OK if np ≥ 5 and n(1− p) ≥ 5.
Examples: n = 4, p = 0.5: np = 2 < 5, no good.
n = 20, p = 0.5: np = n(1− p) = 10 ≥ 5, good;
n = 20, p = 0.1: np = 2 < 5, no good.
213
Monte Carlo integration
Integral I =∫ 1
0sin(x4) dx: impossible algebraically (no
antiderivative). Get approximate answer numerically eg. by
Simpson’s rule. But can also recognize that
I = E{sin(U4)}
where U ∼ Uniform[0, 1]. I is “average” of sin(U4), suggesting
procedure:
1. Generate U randomly from Uniform[0, 1].
2. Calculate T = sin(U4)
3. Repeat steps 1 and 2 many times, find mean value m of T .
214
Minitab commands to do this (U in c1, T in c2):
MTB > random 1000 c1;
SUBC> uniform 0 1.
MTB > let c2=sin(c1**4)
MTB > mean c2
I got m = 0.19704. How accurate?
m observed value of random variable M . M mean of 1000 values,
so central limit theorem applies: approx normal distribution.
Mean, variance unknown but estimate using sample mean 0.19704,
sample SD 0.25221: E(M) ' 0.19704,
Var(M) = σ2/n ' 0.252212/1000 = 6.36× 10−5.
215
Now, 99.7% of normal distribution within mean± 3× SD, so I
almost certainly in
0.19704± 3√
6.36× 10−5 = (0.189, 0.205).
To get more accurate answer, get more simulated values.
216
Recognizing as expectation
Consider now I =∫∞0
5x cos(x2)e−5x dx.
Again impossible algebraically; because of limits, can’t use previous
trick.
Idea: use distribution with right limits and density in integral. Here,
Exponential(5) has density 5e−5x on correct interval, so
I = E{X cos(X2)} where X ∼ Exponential(5).
Minitab annoyance: its exponential dist has parameter 1/λ, so we
have to feed in 1/5 = 0.2.
217
Commands:
MTB > random 1000 c1;
SUBC> exponential 0.2.
MTB > let c2=c1*cos(c1**2)
MTB > describe c2
I got mean 0.1884, SD 0.1731, so this area almost certainly in
0.1884± 30.1731√
1000= (0.1720, 0.2048).
218
Approximating sampling distributions
Central Limit Theorem only applies to means (sums), so is no help
for other quantities (median, variance etc).
Can approximate sampling distributions for these by simulation.
Idea:
1. simulate random sample from population
2. calculate sample quantity
3. repeat steps 1 and 2 many times, summarize results.
219
Sampling distribution of sample median in normal
population
Suppose X1, X2, . . . , Xn is random sample from normal
population mean 10, SD 2; take n = 3.
MTB > Random 500 c1-c3;
SUBC> Normal 10 2.
MTB > RMedian c1-c3 c4.
Samples in rows; use “row statistics” to get sample medians.
220
Shape is very like normal, even for such small sample.
221
Sampling distribution of sample variance in normal
population
Again suppose X1, X2, . . . , Xn ∼ N(10, 22). Now take n = 5:
MTB > Random 500 c1-c5;
SUBC> Normal 10 2.
MTB > RStDev c1-c5 c6.
MTB > let c7=c6*c6
MTB > histogram c7
(samples in rows again; variance as square of SD.)
222
Shape definitely skewed right: not normal-shaped.
223
Normal distribution theory
Normal distribution arises often from CLT, so worth knowing
properties and related distributions. These used frequently in
Chapter 5 and beyond (STAB57).
First: suppose U, V are independent. Then Cov(U, V ) =
E(UV )− E(U)E(V ) = E(U)E(V )− E(U)E(V ) = 0 as
expected.
But: now suppose that Cov(U, V ) = 0. If U, V normal, then (fact)
U, V independent.
That is, for normal U, V , Cov(U, V ) = 0 if and only if U, V
independent. Not true for other distributions.
224
The chi-squared distribution
Suppose Z ∼ N(0, 1). What is distribution of W = Z2? Can’t
use usual transformation because Z2 neither increasing nor
decreasing.
FW (w) = P (W ≤ w) = P (Z2 ≤ w) = P (−√w ≤ Z ≤ √w).
This as integral is
FW (w) =
∫ √w
−√w
e−z2/2
√2π
dz =
∫ √w
−∞
e−z2/2
√2π
dz−∫ −√w
−∞
e−z2/2
√2π
dz.
225
Differentiate both sides and simplify to get
fW (w) =1√2πw
e−w/2.
This is called chi-squared distribution with 1 degree of freedom
(df). Written W ∼ χ21.
Now suppose Z1, Z2, . . . , Zn ∼ N(0, 1) independently.
Distribution of W = Z21 + Z2
2 + · · ·+ Z2n called chi-squared with
n degrees of freedom. Written W ∼ χ2n.
What is E(W )?
E(W ) = E
(n∑
i=1
Z2i
)=
n∑i=1
E(Z2i ) = n(1) = n
since E(Z2i ) = Var(Zi) = 1.
226
To get density function of χ2n, compare gamma density with χ2
1:
λαwα−1
Γ(α)e−λw =
1√2πw
e−w/2
if α = 12
and λ = 12. That is, χ2
1 = Gamma(12, 1
2).
If Z2i ∼ χ2
1, use mgf formula for gamma dist to write
mZ2i(s) =
(1
2
)1/2 (1
2− s
)−1/2
.
227
If W =∑n
i=1 Z2i ∼ χ2
n, mgf of W is n copies of mZ2i(s)
multiplied together, ie.
MW (s) =
(1
2
)n/2 (1
2− s
)−n/2
which is mgf of Gamma(n/2, 12). Using formula for gamma
density, then, for W ∼ χ2n,
fW (w) =1
2n/2Γ(n/2)wn/2−1e−w/2.
Has skew-to-right shape (picture page 225).
228
Distribution of sample variance
Suppose X1, X2, . . . , Xn ∼ N(µ, σ2). Define X̄ =∑n
i=1 Xi/n
to be sample mean, S2 =∑n
i=1(Xi− X̄)2/(n− 1) to be sample
variance.
Know that X̄ ∼ N(µ, σ2/n). Distribution of S2?
Actually look at (n− 1)S2/σ2 =∑n
i=1(Xi − X̄)2/σ2. Can write
(p. 235) as sum of n− 1 squared N(0, 1)’s, so
(n− 1)S2
σ2∼ χ2
n−1.
Fact: E(S2) = σ2 (explains division by n− 1).
229
The t distribution
Standardize X̄ :X̄ − µ√
σ2/n∼ N(0, 1).
But what if σ2 unknown? Idea: replace σ2 by sample variance S2.
Distribution of result no longer normal (even though Xi are).
X̄ − µ√S2/n
=X̄ − µ√
σ2/n· 1√
(n− 1)S2/σ2/(n− 1)=
Z√Y/(n− 1)
where Z ∼ N(0, 1) and Y ∼ χ2n−1.
This called t distribution with n− 1 degrees of freedom, written
tn−1.
230
What happens as n increases? Write
Y/(n− 1) =∑n−1
i=1 Z2i /(n− 1) where Zi ∼ N(0, 1). Then
E(Y/(n− 1)) = 1. Let k = Var(Z2i ); then
Var(Y/(n− 1)) = (n− 1)k/(n− 1)2 = k/(n− 1) → 0.
That is, Y/(n− 1)P→ 1 and therefore
Z√Y/(n− 1)
D→ N(0, 1);
that is, for large n, the t distribution with n− 1 df well approximated
by N(0, 1).
t distribution hard to work with; use tables/software for probabilities.
231
The F distribution
Suppose S21 and S2
2 sample variances from independent samples
sizes m, n, both from normal populations with variance σ2. Then
might compare variances by looking at ratio R = S21/S
22 :
R =S2
1
S22
=(m− 1)S2
1/σ2
(n− 1)S22/σ
2· 1/(m− 1)
1/(n− 1)=
X/(m− 1)
Y/(n− 1)
where X ∼ χ2m−1 and Y ∼ χ2
n−1.
This defined to have F distribution with m− 1 and n− 1
degrees of freedom, written F (m− 1, n− 1).
232
Properties of F distribution
Ratio could have been S22/S
21 = 1/R with similar result: therefore,
if R ∼ F (m− 1, n− 1), then 1/R ∼ F (n− 1,m− 1).
Suppose T = X/√
Y/(n− 1) ∼ tn−1. Then
T 2 =X2/1
Y/(n− 1)
is a χ21/1 over χ2
n−1/(n− 1); that is, T 2 ∼ F (1, n− 1).
233
In
R =X/(m− 1)
Y/(n− 1):
if n →∞, know that Y/(n− 1)P→ 1, and numerator of
R ∼ χ2m−1/(m− 1).
Hence, as n →∞,
(m− 1)RD→ χ2
m−1.
Thus χ2m−1 is useful approx to F (m− 1, n− 1) if n large.
234
Stochastic Processes
235
Random walks
Consider gambling game: win $1 with prob p, lose $1 with prob q
(p + q = 1). Each play independent. Start with fortune a; let Xn
denote fortune after n plays.
Thus X0 = a; X1 = a + 1 if win (prob p), X1 = a− 1 if lose
(prob q).
Sequence {Xn} of random variables called random walk.
236
Properties of random walk
At each step, two possible outcomes (win/lose), same prob p of
winning, independent. So number of wins Wn ∼ Binomial(n, p).
With Wn wins, must be n−Wn losses, so fortune after Wn wins is
Xn = a + (1)Wn + (−1)(n−Wn) = a + 2Wn − n.
Since E(Wn) = np, have
E(Xn) = a + 2np− n = a + 2n
(p− 1
2
).
Also
Var(Xn) = 22 Var(Wn) = 4np(1− p).
237
Since Wn ∼ Binomial(n, p), have
P (Wn = j) =
(n
j
)pjqn−j;
write in terms of Xn to get
P (Xn = a + k) = P (a + k = a + 2Wn − n)
= P (Wn = (n + k)/2)
=
(n
(n + k)/2
)p(n+k)/2q(n−k)/2.
Only certain values of Xn possible; formula fails for impossible
values.
238
Examples
Suppose a = 5, p = 14. Then
E(Xn) = 5 + 2n(14− 1
2) = 5− n/2. Expect fortune to decrease
on average.
What is P (X3 = 6)? Write 6 = 5 + 1 so k = 1, n = 3;
(n + k)/2 = 2 and (n− k)/2 = 1:
P (X3 = 6) =
(3
2
)(1
4
)2 (3
4
)1
=9
64.
How about P (X9 = 7)? This is P (X9 = 5 + 2), so n = 9 and
k = 2. But (n + k)/2 = (5 + 2)/2 not integer, so formula fails.
X9 cannot be 7 (in fact X9 must be even).
239
Now suppose a = 20, p = 23. Then
E(Xn) = 20 + 2n
(2
3− 1
2
)= 20 + n/3,
increasing with n.
Find P (X5) = 21 = 20 + 1: n = 5, k = 1 so (n + k)/2 = 3,
(n− k)/2 = 2 and
P (X5 = 21) =
(5
3
)(2
3
)3 (1
3
)2
' 0.329,
fairly likely.
240
Gambler’s ruin
Suppose we gamble with aim to reach fortune c > 0. How likely do
we succeed before fortune reaches 0 (run out of money)?
Hard to see answer: no idea how long it takes to reach c or 0.
Idea: let S(a) be prob of reaching c first starting from fortune a.
Then for all c > 0, S(0) = 0, S(c) = 1. Also, if current fortune a,
fortune at next step either a + 1 or a− 1, leading to
S(a) = pS(a + 1) + qS(a− 1).
241
Solve above recurrence relation to get formula: if p = 12,
S(a) = a/c; otherwise,
S(a) =1− (q/p)a
1− (q/p)c.
Example: start with $20, want to win $50. If p = 12, chance of
success is 20/50 = 0.4. If p = 0.51, chance of success is
S(20) =1− (0.49/0.51)20
1− (0.49/0.51)50' 0.637.
Even a very small edge makes success much more likely. (Even
small disadvantage makes eventual failure much more likely.)
242
Markov Chains
Simple model of weather:
• if sunny today, prob 0.7 of sunny tomorrow, prob 0.3 of rainy.
• if rainy today, prob 0.4 of sunny tomorrow, prob 0.6 of rainy.
Weather has two states (sunny, rainy). From one day to next,
weather may change state.
Probs above called transition probabilities. This kind of probability
model called Markov chain.
243
Can write as matrix:
P =
0.7 0.3
0.4 0.6
where element pij is P (go to state j|currently state i).
Note assumption: only need to know weather today to predict
weather tomorrow. (If weather today known, past weather
irrelevant). Called Markov property.
Suppose sunny today. Chance of sun in two days?
One idea: list possibilities. Two: SSS, SRS. Use transition probs to
get (0.7)(0.7) + (0.3)(0.4) = 0.61.
244
Another: calculate matrix P 2:
P 2 =
0.7 0.3
0.4 0.6
0.7 0.3
0.4 0.6
=
0.61 0.39
0.52 0.48
.
Note that top-left calculation same as 1st idea above.
Matrix P 2 gives two-step transition probs. That is, if sunny today,
prob of sunny in 2 days’ time 0.61; if rainy today, almost even
chance of being rainy in 2 days.
In general, P n gives n-step transition probs (weather in n days’
time given weather today).
245
Another example
“Ehrenfest’s Urn”: Two urns, containing total of 4 balls. Choose one
ball at random, take out of current urn, place in other urn. Keep
track of number of balls in urn 1.
Transition matrix (states 0, 1, 2, 3, 4 balls in urn 1):
P =
0 1 0 0 0
14
0 34
0 0
0 24
0 24
0
0 0 34
0 14
0 0 0 1 0
Apparent tendency for number of balls in 2 urns to even out.
246
Find likely number of balls in urn 1 after 9 steps by finding P 9. (Use
Minitab: see section E.1 of manual, p. 162.) Answer (rounded):
P 9 =
0 0.5 0 0.5 0
0.125 0 0.75 0 0.125
0 0.5 0 0.5 0
0.125 0 0.75 0 0.125
0 0.5 0 0.5 0
Start with even number of balls in urn 1: end with either odd
number, equally likely. Start with odd number: end with even
number, most likely 2.
247
Stationary distributions
Instead of starting from particular state, pick starting state from
prob. distribution θ = (θ1, θ2, . . .).
In weather example: suppose 80% chance today sunny, so
θ = (0.80, 0.20).
To get prob of each state n steps later, multiply θ as row vector by
P n. Weather example, for n = 2 days later:
(0.8 0.2
)P 2 =
(0.8 0.2
)0.61 0.39
0.52 0.48
=
(0.592 0.408
).
248
Suppose we could find θ such that θP = θ. Then starting
distribution θ would be stationary: (marginal) prob of sunny day
same for all days.
Can try directly for weather example:(θ1 θ2
)P =
(0.7θ1 + 0.4θ2 0.3θ1 + 0.6θ2
)=
(θ1 θ2
).
2 equations in 2 unknowns, collapse into one equation
0.3θ1 − 0.4θ2 = 0, but θi are probs so that θ1 + θ2 = 1 also.
Solve: θ1 = 47, θ2 = 3
7.
More generally: solve θP = θ by transposing both sides to get
P T θT = θT . Like solution to Av = λv with λ = 1: stationary
prob θ is eigenvector of P T with eigenvalue 1.
249
Can use Minitab to get eigenvalues/vectors (manual p. 167). Usually
need to scale eigenvector to get probs summing to 1.
Ehrenfest urn example: 5 eigenvectors; one with eigenvalue 1 is
(0.120, 0.478, 0.717, 0.478, 0.120), scaling to 116
, 416
, 616
, 416
, 116
.
(Actually binomial probs: see text p. 595).
250
Limiting distributions
If initial state chosen from stationary distribution, then prob of each
state remains same for all time.
Also: if watch Markov chain for many steps, should not matter much
which state we began in.
Weather example: 8-step transition matrix is
P 8 =
0.57146 0.42854
0.57139 0.42861
'
47
37
47
37
Starting either from sunny or rainy day, chance of sunny day in 8
days’ time is about 47. Same as stationary distribution.
251
Compare Ehrenfest urn example:
P 8 '
0.125 0 0.75 0 0.125
0 0.5 0 0.5 0
0.125 0 0.75 0 0.125
0 0.5 0 0.5 0
0.125 0 0.75 0 0.125
not getting stationary distribution in each row.
Problem here: number of balls in urn 1 always goes from odd to
even or vice versa. So eg. P (1 ball in urn 1 after n steps)
alternates between 0 and positive; cannot have limit. Chain called
periodic.
252
Consider a third example:
P =
0.5 0.5 0
0.75 0.25 0
0 0 1
.
Has two eigenvectors for eigenvalue 1: (0.6, 0.4, 0) and (0, 0, 1).
Note: start in state 1 or 2, can never reach state 3. Start in state 3,
can never reach states 1 or 2.
Such chain called reducible: can split up into two chains, {1, 2}and {3} and treat each separately.
253
Markov chain limit theorem
Previous work suggests following theorem:
Suppose a Markov chain has a stationary distribution, is not
reducible, and is not periodic. Then its stationary distribution also
gives the probability, as n →∞, of being in any particular state
after n steps.
In effect, the stationary distribution gives approx to long-term
behaviour of chain.
254
... that’s all, folks!
255