Quick Tour of Basic Probability Theory
Transcript of Quick Tour of Basic Probability Theory
Quick Tour of Basic Probability Theory
CS224W: Social and Information Network AnalysisFall 2012
Outline
Today’s goal: A gentle refresher on probability
You should have seen this before
Outline
I Basic definitions
I Random variables
I Maximum likelihood estimation
Fundamentals of Probability
I Sample space Ω: Set of all possible outcomes
I Event space F : 2Ω (an event is a subset of the sample space)I Probability measure: function P : F → R such that:
I P(A) ≥ 0 (∀A ∈ F)I P(Ω) = 1I For disjoint events Ai , P(∪iAi ) =
∑i P(Ai )
In this session, I’ll focus mostly on the discrete case (things arebasically the same in the continuous case).
Example
Consider throwing a die twice:
I Sample space Ω = 1, 2, 3, 4, 5, 6 × 1, 2, 3, 4, 5, 6I Event space F= 2Ω (example events: let A be the event that
the sum is even and let B be the event that we roll at leastone 6).
I Probability measure: function P is simple counting in thissimple discrete case. Example: P(A) = 18/36 = 1/2,P(B) = 11/36.
Note that multiple events can happen simultaneously. e.g. if weroll a 6 then a 2, the outcome is 6, 2, and both A and B haveoccurred.
Union
For any two events A and B, the union of the two (“A or B”) is:
P(A ∪ B) = P(A) + P(B)− P(A ∩ B)
e.g. P(A ∪ B) = 18/36 + 11/36− 5/36 = 24/36 = 2/3
Conditional probability
Let A and B be two events. Then the conditional probability of Agiven B is:
P(A|B) =P(A ∩ B)
P(B)
“What’s the probability of A once we know B has happened?”
Rewriting gives us the useful product rule:
P(A ∩ B) = P(A|B)P(B)
Independence
Two events A and B are independent if
P(A ∩ B) = P(A)P(B)
Equivalently: P(A|B) = P(A) and P(B|A) = P(B)
Intuitively, knowing A doesn’t tell you anything about B andvice-versa.
But beware of relying on your intuition: rolling two dice (xa andxb), events xa = 2 and xa + xb = k are independent if k = 7 anddependent otherwise.
Union bound
Recall that P(A ∪ B) = P(A) + P(B)− P(A ∩ B) for any twoevents A and B.
If we’re trying to upper bound the probability that A or B happens,the worst case is that A and B are disjoint (so P(A ∩ B) = 0).
The surprisingly useful union bound now follows. Let Ai be some(not necessarily independent!) events, then:
P
(⋃i
Ai
)≤∑
I
P(Ai )
Bayes’ Rule
Most important basic rule of probability!
For two events A and B (such that P(B) 6= 0):
P(A|B) =P(B|A)P(A)
P(B)
Often used to update beliefs:
posterior = “support B provides for A”× “prior”
Bayes’ Rule Example
You friend told you she had a great conversation with someone onthe Caltrain. Not knowing anything else, your prior belief that herconversation partner was a woman is 50%. Let W donate thisevent. Let L denote the event that her conversation partner haslong hair. If you learn L to be true, how should you update yourbeliefs about W ?
P(W ) = 0.5 and suppose P(L) = 0.6, and P(L|W ) = 0.75 areknown.
Then P(W |L) = P(L|W )P(W )P(L) = 0.75 ∗ 0.5/0.6 = 62.5%.
Random Variables
A random variable is a technically a function X : Ω→ R
Probabilities of random variable events come from underlying Pfunction: P(X = k) = P(ω ∈ Ω|X (ω) = k)
It’s called a random variable because it’s a variable that doesn’ttake on a single, deterministic value, but it can take on a set ofdifferent values, each with an associated probability.
e.g. Let X be a random variable that counts the number of 6’s weroll in 2 rolls of a die.
P(X = 2) = P(6, 6) = 1/36P(X = 1) =P(1, 6) + . . .+ P(6, 6) + P(6, 1) + . . .P(6, 5) = 10/36P(X = 0) = 25/36
Distributions
A probability mass function (pmf) assigns a probability to eachpossible value of a random variable (in the discrete case)
Example: funny die
Distributions
Another example: distribution over sum of two die rolls
Probability density functionsThe PDF of a continuous random variable X is describes therelative likelihood for X to take on a given value:
P[a ≤ X ≤ b] =
∫ b
af (x)dx
Cumulative Distributions
The CDF of a random variable X is:
F (x) = P(X ≤ x)
Properties of Distribution Functions
I CDF (cumulative distribution function):I 0 ≤ FX (x) ≤ 1I FX monotone increasing, with limx→−∞FX (x) = 0,
limx→∞FX (x) = 1
I pmf:I 0 ≤ pX (x) ≤ 1I∑
x pX (x) = 1I∑
x∈A pX (x) = pX (A)
I pdf:I fX (x) ≥ 0I∫∞−∞ fX (x)dx = 1
I∫x∈A
fX (x)dx = P(X ∈ A)
Some Common Random Variables
I X ∼ Bernoulli(p) (0 ≤ p ≤ 1): pX (x) =
p x=1,
1− p x=0.
I X ∼ Geometric(p) (0 ≤ p ≤ 1): pX (x) = p(1− p)x−1
I X ∼ Uniform(a, b) (a < b): fX (x) =
1
b−a a ≤ x ≤ b,
0 otherwise.
I X ∼ Normal(µ, σ2): fX (x) = 1√2πσ
e−1
2σ2 (x−µ)2
Expectation and Variance
I If the discrete random variable X has pmf p(x), then theexpectation is E [X ] =
∑x x · p(x)
I Continuous case is similar: E [X ] =∫∞−∞ x · fX (x)dx
I Expectation is linear:I for any constant a ∈ R, E [a] = aI E [a · g(X ) + b · h(X )] = aE [g(X )] + bE [h(X )]
I Var [X ] = E [(X − E [X ])2] = E [X 2]− E [X ]2
I Variance is not linear
Example: expectation of rolling a die once:1 · 1/6 + 2 · 1/6 + 3 · 1/6 + 4 · 1/6 + 5 · 1/6 + 6 · 1/6 = 3.5
Indicator variables
An indicator variable just indicates whether an event occurs or not:
IA =
1 if A occurs0 otherwise
They have a very useful property:
E [IA] = 1 · P(IA = 1) + 0 · P(IA = 0)
= P(IA = 1)
= P(A)
Method of indicators
Goal: find expected number of successes out of N trials
Method: define an indicator (Bernoulli) random variable for eachtrial, find expected value of the sum
Example: N professors are at dinner and take a random coat whenthey leave. Expected number of profs with the right coat?
Let G be the number of profs who get the right coat, and let Gi bean indicator for the event that professor i gets his own coat. Then
G = G1 + G2 + . . .+ Gn
These events are not independent!
But linearity of expectation saves us:
E [G ] = E [G1 + G2 + . . .+ Gn]
= E [G1] + E [G2] + . . .+ E [Gn]
= 1/n + 1/n + . . . 1/n = 1
Remember: linearity of expectation does not assumeindependence!
Some Useful Inequalities
I Markov’s Inequality: X random variable, and a > 0. Then:
P(X ≥ a) ≤ E [X ]
a
Example: back to the professors and their coats. We knowthat E[G] = 1, so applying Markov’s Inequality gives us:
P(G ≥ a) ≤ 1
a
Plugging in a = 5, we get that the chance that at least 5professors get the right coats is no higher than 20%(regardless of N).
I Chernoff bound: Let X1, . . . ,Xn independent Bernoulli withP(Xi = 1) = pi . Denoting µ = E [
∑ni=1 Xi ] =
∑ni=1 pi ,
P(n∑
i=1
Xi ≥ (1 + δ)µ) ≤(
eδ
(1 + δ)1+δ
)µ
for any δ. Multiple variants of Chernoff-type bounds exist,which can be useful in different settings
Parameter Estimation: Maximum Likelihood
I Say we have a parametrized distribution fX (x ; θ) and we don’tknow the parameter(s) θ.
I IID samples x1, . . . , xn observed.
I Goal: Estimate θ
I The maximum likelihood estimator (MLE) is the value θ thatmaximizes the likelihood of observing the data you observed.
MLE Example
Say you flip a coin with unknown bias p of landing heads n timesand get nH heads and nT tails. What’s the MLE estimate for thecoin’s bias?
The likelihood of observing the data given a particular θ isP(D|θ) = θnH (1− θ)nT .
Take logs: log P(D|θ) = nH log(θ) + nT log(1− θ).
MLE Example continued
Take the derivative and set to 0:
d
dθlog P(D|θ) = 0
d
dθ[nH log(θ) + nT log(1− θ)] = 0
nH
θ− nT
1− θ= 0
θ =nH
nH + nT
Sometimes it is not possible to find the optimal estimate in closedform, in which case iterative methods must be used.
Interesting limits
I limn→∞(1 + kn )n → ek
I limn→∞ n!→√
2πn(
ne
)n(lower bound)