Intro to Probability
description
Transcript of Intro to Probability
Intro to Probability
Slides from Professor Pan,Yan, SYSU
Probability TheoryExample of a random experiment
– We poll 60 users who are using one of two search engines and record the following:
X0 1 2 3 4 5 6 7 8
Each point corresponds to one of 60 users
Two searchengines
Number of “good hits” returned by search
engine
X0 1 2 3 4 5 6 7 8
Probability TheoryRandom variables
– X and Y are called random variables– Each has its own sample space:
• SX = {0,1,2,3,4,5,6,7,8}
• SY = {1,2}
X0 1 2 3 4 5 6 7 8
Probability TheoryProbability
– P(X=i,Y=j) is the probability (relative frequency) of
observing X = i and Y = j– P(X,Y) refers to the whole table of probabilities
– Properties: 0 ≤ P ≤ 1, P = 1
3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60
0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60
P(X=i,Y=j)
Probability TheoryMarginal probability
– P(X=i) is the marginal probability that X = i, ie, the
probability that X = i, ignoring Y
X0 1 2 3 4 5 6 7 8
P(X)
P(Y)
Probability TheoryMarginal probability
– P(X=i) is the marginal probability that X = i, ie, the
probability that X = i, ignoring Y– From the table: P(X=i) =j P(X=i,Y=j)
Note that
i P(X=i) = 1
and
j P(Y=j) = 1
X0 1 2 3 4 5 6 7 8
3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60
0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60
3460
2660
P(Y=j)
3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60
P(X=i)
SUM RULE
Probability TheoryConditional probability
– P(X=i|Y=j) is the probability that X=i, given that Y=j– From the table: P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j)
X0 1 2 3 4 5 6 7 8
P(X|Y=1)
P(Y=1)
Probability TheoryConditional probability
– How about the opposite conditional probability, P(Y=j|X=i)?
– P(Y=j|X=i) =P(X=i,Y=j) / P(X=i) Note that jP(Y=j|
X=i)=1
X0 1 2 3 4 5 6 7 8
3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60
0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60
X0 1 2 3 4 5 6 7 8
33
03
66
06
88
08
89
19
59
49
38
58
19
89
06
66
02
22
P(Y=j|X=i)P(X=i,Y=j)
3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60
P(X=i)
Summary of types of probability
• Joint probability: P(X,Y)
• Marginal probability (ignore other variable): P(X) and P(Y)
• Conditional probability (condition on the other variable having a certain value): P(X|Y) and P(Y|X)
Probability TheoryConstructing joint probability
– Suppose we know• The probability that the user will pick each search
engine, P(Y=j), and
• For each search engine, the probability of each number of good hits, P(X=i|Y=j)
– Can we construct the joint probability, P(X=i,Y=j)?
– Yes. Rearranging P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j) we get P(X=i,Y=j) =P(X=i|Y=j) P(Y=j)
PRODUCT RULE
Summary of computational rules
• SUM RULE: P(X) = Y P(X,Y)
P(Y) = X P(X,Y)
– Notation: We simplify P(X=i,Y=j) for clarity
• PRODUCT RULE: P(X,Y) = P(X|Y) P(Y)
P(X,Y) = P(Y|X) P(X)
Ordinal variables• In our example, X has a natural order 0…8
– X is a number of hits, and– For the ordering of the columns in the table below,
nearby X’s have similar probabilities
• Y does not have a natural order
X0 1 2 3 4 5 6 7 8
Probabilities for real numbers
• Can’t we treat real numbers as IEEE DOUBLES with 264 possible values?
• Hah, hah. No!
• How about quantizing real variables to reasonable number of values?
• Sometimes works, but…– We need to carefully account for ordinality– Doing so can lead to cumbersome mathematics
Probability theory for real numbers• Quantize X using bins of width • Then, X {.., -2, -, 0, , 2, ..}
• Define PQ(X=x) = Probability that x X ≤ x+
• Problem: PQ(X=x) depends on the choice of • Solution: Let 0• Problem: In that case, PQ(X=x) 0
• Solution: Define a probability density
P(x) = lim0 PQ(X=x)/
= lim0 (Probability that x X ≤ x+)/
Probability theory for real numbers
Probability density
– Suppose P(x) is a probability density
– Properties
• P(x) 0• It is NOT necessary that P(x) ≤ 1
• x P(x) dx = 1
– Probabilities of intervals:
P(aX≤b) = b
x=a P(x) dx
Probability theory for real numbers
Joint, marginal and conditional densities
• Suppose P(x,y) is a joint probability density
– x y P(x,y) dx dy = 1
– P( (X,Y) R) = R P(x,y) dx dy
• Marginal density: P(x) = y P(x,y) dy
• Conditional density: P(x|y) = P(x,y) / P(y)
x
yR
The Gaussian distribution
is the standard deviation
Mean and variance
• The mean of X is E[X] = X X P(X)
or E[X] = x x P(x) dx
• The variance of X is VAR(X) = X(X-E[X])2P(X)
or VAR(X) = x (x - E[X])2P(x)dx
• The std dev of X is STD(X) = SQRT(VAR(X))
• The covariance of X and Y is
COV(X,Y) = XY (X-E[X]) (Y-E[Y]) P(X,Y)
or COV(X,Y) = x y (x-E[X]) (y-E[Y]) P(x,y) dx dy
Mean and variance of the Gaussian
E[X] = VAR(X) = 2
STD(X) =
How can we use probability as a framework for machine learning?
Maximum likelihood estimation• Say we have a density P(x|) with parameter
• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is
P(x|) = n=1N P(xn|)
• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)
• The maximum likelihood (ML) estimate of is
ML = argmax L = argmax n=1N ln P(xn|)
• Example: For Gaussian likelihood P(x|) = N (x|,2),
L =
Comments on notation from now on• Instead of j P(X=i,Y=j), we write X P(X,Y)
• P() and p() are used interchangeably
• Discrete and continuous variables treated the same, so X, X, x and x are interchangeable
• ML and ML are interchangeable
• argmax f() is the value of that maximizes f()
• In the context of data x1,…,xN, symbols x, X, X and X refer to the entire set of data
• N (x|,2) =
• log() = ln() and exp(x) = ex
• pcontext(x) and p(x|context) are interchangable
Maximum likelihood estimation• Say we have a density P(x|) with parameter
• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is
P(x|) = n=1N P(xn|)
• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)
• The maximum likelihood (ML) estimate of is
ML = argmax L = argmax n=1N ln P(xn|)
• Example: For Gaussian likelihood P(x|) = N (x|,2),
L =
Questions?
Maximum likelihood estimation• Say we have a density P(x|) with parameter
• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is
P(x|) = n=1N P(xn|)
• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)
• The maximum likelihood (ML) estimate of is
ML = argmax L = argmax n=1N ln P(xn|)
• Example: For Gaussian likelihood P(x|) = N (x|,2),
L =
Maximum likelihood estimation
L =
• Example: For Gaussian likelihood P(x|) = N (x|,2),
Objective of regression: Minimize error
E(w) = ½ n ( tn - y(xn,w) )2