Probability Basics for Machine Learning
CSC411 Shenlong Wang*
Friday, January 15, 2015
*Based on many others’ slides and resources from Wikipedia
Outline
• MoHvaHon • NotaHon, definiHons, laws • ExponenHal family distribuHons – E.g. Normal distribuHon
• Parameter esHmaHon • Conjugate prior
Why Represent Uncertainty?
• The world is full of uncertainty – “Is there a person in this image?” – “What will the weather be like today?” – “Will I like this movie?”
• We’re trying to build systems that understand and (possibly) interact with the real world
• We oZen can’t prove something is true, but we can sHll ask how likely different outcomes are or ask for the most likely explanaHon
amazon
Why Use Probability to Represent Uncertainty?
• Write down simple, reasonable criteria that you'd want from a system of uncertainty (common sense stuff), and you always get probability: – Probability theory is nothing but common sense reduced to calculaHon. — Pierre Laplace, 1812
• Cox Axioms (Cox 1946); See Bishop, SecHon 1.2.3 • We will restrict ourselves to a relaHvely informal discussion of probability theory.
NotaHon • A random variable X represents outcomes or states of the world.
• We will write p(x) to mean Probability(X = x) • Sample space: the space of all possible outcomes (may be discrete, conHnuous, or mixed)
• p(x) is the probability mass (density) func8on – Assigns a number to each point in sample space – Non-‐negaHve, sums (integrates) to 1 – IntuiHvely: how oZen does x occur, how much do we believe in x.
Joint Probability DistribuHon • Prob(X=x, Y=y) – “Probability of X=x and Y=y” – p(x, y)
CondiHonal Probability DistribuHon • Prob(X=x|Y=y) – “Probability of X=x given Y=y” – p(x|y) = p(x,y)/p(y)
The Rules of Probability
• Sum Rule (marginalizaHon/summing out):
• Product/Chain Rule:
),...,,(...)(
),()(
2112 3
Nx x x
y
xxxpxp
yxpxp
N
∑∑ ∑
∑
=
=
),...,|()...|()(),...,()()|(),(
111211 −=
=
NNN xxxpxxpxpxxpxpxypyxp
Bayes’ Rule
• One of the most important formulas in probability theory
• This gives us a way of “reversing” condiHonal probabiliHes
p(x | y) = p(y | x)p(x)p(y)
=p(y | x)p(x)p(y | x ')p(x ')
x '∑
Independence
• Two random variables are said to be independent iff their joint distribuHon factors
• Two random variables are condi8onally independent given a third if they are independent aZer condiHoning on the third
X ⊥Y⇔ p(x, y) = p(y | x)p(x) = p(x | y)p(y) = p(x)p(y)
X ⊥Y | Z⇔ p(x, y | z) = p(y | x, z)p(x | z) = p(y | z)p(x | z) ∀z
ConHnuous Random Variables • Outcomes are real values. Probability density funcHons define distribuHons. – E.g.,
• ConHnuous joint distribuHons: replace sums with integrals, and everything holds – E.g., MarginalizaHon and condiHonal probability
∫∫ ==yy
yPyzxPzyxPzxP )()|,(),,(),(
⎭⎬⎫
⎩⎨⎧ −−= 2
2 )(21exp
21),|( µ
σσπσµ xxP
Summarizing Probability DistribuHons
• It is oZen useful to give summaries of distribuHons without defining the whole distribuHon (E.g., mean and variance)
• Mean: • Variance:
• Nth moment:
dxxpxxxEx
)(][ ∫ ⋅==
dxxpxExxx
)(])[()var( 2∫ ⋅−=
=E[x2 ]−E[x]2
µn = (x − c)n ⋅ p(x)x∫ dx
ExponenHal Family
• Family of probability distribuHons • Many of the standard distribuHons belong to this family – Bernoulli, binomial/mulHnomial, Poisson, Normal (Gaussian), beta/Dirichlet,…
• Share many important properHes – e.g. They have a conjugate prior (we’ll get to that later. Important for Bayesian staHsHcs)
• First – let’s see some examples
DefiniHon • The exponenHal family of distribuHons over x, given parameter η (eta) is the set of distribuHons of the form
• x-‐scalar/vector, discrete/conHnuous • η – ‘natural parameters’ • u(x) – some funcHon of x (sufficient staHsHc) • g(η) -‐ normalizer
)}(exp{)()()|( xugxhxp Tηηη =
1)}(exp{)()( =∫ dxxuxhg Tηη
Example 1: Bernoulli
• Binary random variable -‐ • p(heads) = µ • Coin toss
xxxp −−= 1)1()|( µµµ
}1,0{∈X]1,0[∈µ
Example 1: Bernoulli
xxxp −−= 1)1()|( µµµ
}1
exp{ln)1(
)}1ln()1(lnexp{
xu
xx
⎟⎠
⎞⎜⎝
⎛−
−=
−−+=
µµ
µµ
)()(11)(
1ln
)(1)(
ηση
ησµµµ
η η
−=
+==⇒⎟⎟
⎠
⎞⎜⎜⎝
⎛
−=
=
=
−
ge
xxuxh
)}(exp{)()()|( xugxhxp Tηηη =
)exp()()|( xxp ηηση −=
Example 2: MulHnomial • p(value k) = µk
• For a single observaHon – die toss – SomeHmes called Categorical
• For mulHple observaHons – integer counts on N trials – Prob(1 came out 3 Hmes, 2 came out once,…,6 came out 7 Hmes if I tossed a die 20 Hmes)
1],1,0[1
=∈ ∑=
M
kkk µµ
∏∏ =
=M
k
xk
kk
Mk
xNxxP
11 !
!)|,...,( µµ
∑=
=M
kk Nx
1
Example 2: MulHnomial (1 observaHon)
}lnexp{1∑=
=M
kkkx µ
xxx=
=
)(1)(
uh
)}(exp{)()()|( xugxhxp Tηηη =
∏=
=M
k
xkMkxxP
11 )|,...,( µµ
)exp()|( xx Tp ηη =
Parameters are not independent due to constraint of summing to 1, there’s a slightly more involved notaHon to address that, see Bishop 2.4
Example 3: Normal (Gaussian) DistribuHon
• Gaussian (Normal)
⎭⎬⎫
⎩⎨⎧ −−= 2
2 )(21exp
21),|( µ
σσπσµ xxp
Example 3: Normal (Gaussian) DistribuHon
• µ is the mean • σ2 is the variance • Can verify these by compuHng integrals. E.g.,
⎭⎬⎫
⎩⎨⎧ −−= 2
2 )(21exp
21),|( µ
σσπσµ xxp
€
x ⋅ 12πσ
exp −12σ 2 (x −µ)2
⎧ ⎨ ⎩
⎫ ⎬ ⎭ dx = µ
x→−∞
x→∞
∫
Example 3: Normal (Gaussian) DistribuHon
• MulHvariate Gaussian
€
P(x |µ,∑) = 2π ∑−1/ 2 exp −12(x −µ)T ∑−1(x −µ)
⎧ ⎨ ⎩
⎫ ⎬ ⎭
Example 3: Normal (Gaussian) DistribuHon
• MulHvariate Gaussian
• x is now a vector • µ is the mean vector • Σ is the covariance matrix
⎭⎬⎫
⎩⎨⎧ −∑−−∑=∑ −− )()(21exp2),|( 12/1
µµπµ xxxp T
Important ProperHes of Gaussians
• All marginals of a Gaussian are again Gaussian • Any condiHonal of a Gaussian is Gaussian • The product of two Gaussians is again Gaussian
• Even the sum of two independent Gaussian RVs is a Gaussian.
ExponenHal Family RepresentaHon
}21exp{)
4exp()2()2(
}21
21exp{
21
)(21exp
21),|(
2222
212
1
221
222
22
22
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡ −−=
=−
++−
=
⎭⎬⎫
⎩⎨⎧ −−=
−
xx
xx
xxp
σσµ
ηη
ηπ
µσσ
µσσπ
µσσπ
σµ
)}(exp{)()()|( xugxhxp Tηηη =
)(xh )(ηg Tη )(xu
Example: Maximum Likelihood For a 1D Gaussian
• Suppose we are given a data set of samples of a Gaussian random variable X, D={x1,…, xN} and told that the variance of the data is σ2
What is our best guess of µ?
*Need to assume data is independent and idenHcally distributed (i.i.d.)
x1 x2 xN …
Example: Maximum Likelihood For a 1D Gaussian
What is our best guess of µ? • We can write down the likelihood func8on:
• We want to choose the µ that maximizes this expression – Take log, then basic calculus: differenHate w.r.t. µ, set derivaHve to 0, solve for µ to get sample mean
∏ ∏= = ⎭
⎬⎫
⎩⎨⎧ −−==µ
N
i
N
i
ii xxpdp1 1
22 )(
21exp
21),|()|( µ
σσπσµ
€
µML =1N
xii=1
N∑
Example: Maximum Likelihood For a 1D Gaussian
x1 x2 xN … µML
σML
Maximum Likelihood
ML esHmaHon of model parameters for ExponenHal Family
( )
)g(for solve 0, set to...,)|(
)}(exp{)()(),...,()|( 1
ηηη
ηηη
∇=∂
∂
== ∑∏Dp
xugxhxxpDpn
nTN
nN
∑=
=∇−N
nnML xu
Ng
1)(1)(ln η
• Can in principle be solved to get esHmate for eta. • The soluHon for the ML esHmator depends on the data only through sum over u, which is therefore called sufficient sta8s8c • What we need to store in order to esHmate parameters.
Bayesian ProbabiliHes
• is the likelihood func8on • is the prior probability of (or our prior belief over) θ – our beliefs over what models are likely or not before seeing any data
• is the normaliza8on constant or par88on func8on
• is the posterior distribu8on – Readjustment of our prior beliefs in the face of data
)()()|()|(
dppdpdp θθ
θ =
∫= θθθ dPdpdp )()|()(
)|( θdp)(θp
)|( dp θ
Example: Bayesian Inference For a 1D Gaussian
• Suppose we have a prior belief that the mean of some random variable X is µ0 and the variance of our belief is σ02
• We are then given a data set of samples of X, d={x1,…, xN} and somehow know that the variance of the data is σ2
What is the posterior distribu7on over (our belief about the value of) µ?
Example: Bayesian Inference For a 1D Gaussian
x1 x2 xN …
Example: Bayesian Inference For a 1D Gaussian
x1 x2 xN … µ0
σ0
Prior belief
Example: Bayesian Inference For a 1D Gaussian
• Remember from earlier
• is the likelihood func8on
• is the prior probability of (or our prior belief over) µ
)()()|()|(
dppdpdp µµ
=µ
)|( µdp
)(µp
∏ ∏= = ⎭
⎬⎫
⎩⎨⎧ −−==µ
N
i
N
i
ii xxPdp1 1
22 )(
21exp
21),|()|( µ
σσπσµ
⎭⎬⎫
⎩⎨⎧
µ−µ−=µµ 202
0000 )(
21exp
21),|(
σσπσp
Example: Bayesian Inference For a 1D Gaussian
),|()|()()|()|(
NNDppDpDp
σµµ=µ
µµ∝µ
Normal
€
µN =σ 2
Nσ 02 +σ 2 µ0 +
Nσ 02
Nσ 02 +σ 2 µML
1σN2 =
1σ 02 +
Nσ 2
where
Example: Bayesian Inference For a 1D Gaussian
x1 x2 xN … µ0
σ0
Prior belief
Example: Bayesian Inference For a 1D Gaussian
x1 x2 xN … µ0
σ0
Prior belief µML
σML
Maximum Likelihood
Example: Bayesian Inference For a 1D Gaussian
x1 x2 xN µN
σN
Prior belief Maximum Likelihood
Posterior Distribu8on
Image from xkcd.com
Conjugate Priors • NoHce in the Gaussian parameter esHmaHon example that the funcHonal form of the posterior was that of the prior (Gaussian)
• Priors that lead to that form are called ‘conjugate priors’
• For any member of the exponenHal family there exists a conjugate prior that can be wri{en like
• MulHply by likelihood to obtain posterior (up to normalizaHon) of the form
• NoHce the addiHon to the sufficient staHsHc • ν is the effecHve number of pseudo-‐observaHons.
}exp{)(),(),|( χνηηνχνχη ν Tgfp =
)})((exp{)(),,|(1
νχηηνχη ν +∝ ∑=
+N
nn
TN xugDp
Conjugate Priors -‐ Examples
• Beta for Bernoulli/binomial • Dirichlet for categorical/mulHnomial • Normal for mean of Normal • And many more... – Conjugate Prior Table:
• h{p://en.wikipedia.org/wiki/Conjugate_prior
Top Related