Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good,...
-
Upload
stephanie-gardner -
Category
Documents
-
view
217 -
download
0
Transcript of Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good,...
![Page 1: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/1.jpg)
Probability and Maximum Likelihood
![Page 2: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/2.jpg)
How are we doing on the pass sequence?
• This fit is pretty good, but…
Han
d-l
abe
led
ho
rizo
nta
l co
ord
inat
e, t
The red line doesn’t reveal different levels of uncertainty in predictions
Cross validation reduced the training data, so the red line isn’t as accurate as it should be
Choosing a particular M and w seems wrong – we should hedge our bets
![Page 3: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/3.jpg)
Probability TheoryExample of a random experiment
– We poll 60 users who are using one of two search engines and record the following:
X0 1 2 3 4 5 6 7 8
Each point corresponds to one of 60 users
Two searchengines
Number of “good hits” returned by search
engine
![Page 4: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/4.jpg)
X0 1 2 3 4 5 6 7 8
Probability TheoryRandom variables
– X and Y are called random variables– Each has its own sample space:
• SX = {0,1,2,3,4,5,6,7,8}
• SY = {1,2}
![Page 5: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/5.jpg)
X0 1 2 3 4 5 6 7 8
Probability TheoryProbability
– P(X=i,Y=j) is the probability (relative frequency) of
observing X = i and Y = j– P(X,Y) refers to the whole table of probabilities
– Properties: 0 ≤ P ≤ 1, P = 1
3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60
0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60
P(X=i,Y=j)
![Page 6: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/6.jpg)
Probability TheoryMarginal probability
– P(X=i) is the marginal probability that X = i, ie, the
probability that X = i, ignoring Y
X0 1 2 3 4 5 6 7 8
P(X)
P(Y)
![Page 7: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/7.jpg)
Probability TheoryMarginal probability
– P(X=i) is the marginal probability that X = i, ie, the
probability that X = i, ignoring Y– From the table: P(X=i) =j P(X=i,Y=j)
Note that
i P(X=i) = 1
and
j P(Y=j) = 1
X0 1 2 3 4 5 6 7 8
3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60
0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60
3460
2660
P(Y=j)
3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60
P(X=i)
SUM RULE
![Page 8: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/8.jpg)
Probability TheoryConditional probability
– P(X=i|Y=j) is the probability that X=i, given that Y=j– From the table: P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j)
X0 1 2 3 4 5 6 7 8
P(X|Y=1)
P(Y=1)
![Page 9: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/9.jpg)
Probability TheoryConditional probability
– How about the opposite conditional probability, P(Y=j|X=i)?
– P(Y=j|X=i) =P(X=i,Y=j) / P(X=i) Note that jP(Y=j|
X=i)=1
X0 1 2 3 4 5 6 7 8
3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60
0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60
X0 1 2 3 4 5 6 7 8
33
03
66
06
88
08
89
19
59
49
38
58
19
89
06
66
02
22
P(Y=j|X=i)P(X=i,Y=j)
3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60
P(X=i)
![Page 10: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/10.jpg)
Summary of types of probability
• Joint probability: P(X,Y)
• Marginal probability (ignore other variable): P(X) and P(Y)
• Conditional probability (condition on the other variable having a certain value): P(X|Y) and P(Y|X)
![Page 11: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/11.jpg)
Probability TheoryConstructing joint probability
– Suppose we know• The probability that the user will pick each search
engine, P(Y=j), and
• For each search engine, the probability of each number of good hits, P(X=i|Y=j)
– Can we construct the joint probability, P(X=i,Y=j)?
– Yes. Rearranging P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j) we get P(X=i,Y=j) =P(X=i|Y=j) P(Y=j)
PRODUCT RULE
![Page 12: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/12.jpg)
Summary of computational rules
• SUM RULE: P(X) = Y P(X,Y)
P(Y) = X P(X,Y)
– Notation: We simplify P(X=i,Y=j) for clarity
• PRODUCT RULE: P(X,Y) = P(X|Y) P(Y)
P(X,Y) = P(Y|X) P(X)
![Page 13: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/13.jpg)
Ordinal variables• In our example, X has a natural order 0…8
– X is a number of hits, and– For the ordering of the columns in the table below,
nearby X’s have similar probabilities
• Y does not have a natural order
X0 1 2 3 4 5 6 7 8
![Page 14: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/14.jpg)
Probabilities for real numbers
• Can’t we treat real numbers as IEEE DOUBLES with 264 possible values?
• Hah, hah. No!
• How about quantizing real variables to reasonable number of values?
• Sometimes works, but…– We need to carefully account for ordinality– Doing so can lead to cumbersome mathematics
![Page 15: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/15.jpg)
Probability theory for real numbers• Quantize X using bins of width • Then, X {.., -2, -, 0, , 2, ..}
• Define PQ(X=x) = Probability that x X ≤ x+
• Problem: PQ(X=x) depends on the choice of • Solution: Let 0• Problem: In that case, PQ(X=x) 0
• Solution: Define a probability density
P(x) = lim0 PQ(X=x)/
= lim0 (Probability that x X ≤ x+)/
![Page 16: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/16.jpg)
Probability theory for real numbers
Probability density
– Suppose P(x) is a probability density
– Properties
• P(x) 0• It is NOT necessary that P(x) ≤ 1
• x P(x) dx = 1
– Probabilities of intervals:
P(aX≤b) = b
x=a P(x) dx
![Page 17: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/17.jpg)
Probability theory for real numbers
Joint, marginal and conditional densities
• Suppose P(x,y) is a joint probability density
– x y P(x,y) dx dy = 1
– P( (X,Y) R) = R P(x,y) dx dy
• Marginal density: P(x) = y P(x,y) dy
• Conditional density: P(x|y) = P(x,y) / P(y)
x
yR
![Page 18: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/18.jpg)
The Gaussian distribution
is the standard deviation
![Page 19: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/19.jpg)
The Gaussian distribution
-1
-1
“Precision”
is the standard deviation
![Page 20: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/20.jpg)
Mean and variance
• The mean of X is E[X] = X X P(X)
or E[X] = x x P(x) dx
• The variance of X is VAR(X) = X(X-E[X])2P(X)
or VAR(X) = x (x - E[X])2P(x)dx
• The std dev of X is STD(X) = SQRT(VAR(X))
• The covariance of X and Y is
COV(X,Y) = XY (X-E[X]) (Y-E[Y]) P(X,Y)
or COV(X,Y) = x y (x-E[X]) (y-E[Y]) P(x,y) dx dy
![Page 21: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/21.jpg)
Mean and variance of the Gaussian
E[X] = VAR(X) = 2
STD(X) =
![Page 22: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/22.jpg)
How can we use probability as a framework for machine learning?
![Page 23: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/23.jpg)
Maximum likelihood estimation• Say we have a density P(x|) with parameter
• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is
P(x|) = n=1N P(xn|)
• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)
• The maximum likelihood (ML) estimate of is
ML = argmax L = argmax n=1N ln P(xn|)
• Example: For Gaussian likelihood P(x|) = N (x|,2),
L =
![Page 24: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/24.jpg)
Comments on notation from now on• Instead of j P(X=i,Y=j), we write X P(X,Y)
• P() and p() are used interchangeably
• Discrete and continuous variables treated the same, so X, X, x and x are interchangeable
• ML and ML are interchangeable
• argmax f() is the value of that maximizes f()
• In the context of data x1,…,xN, symbols x, X, X and X refer to the entire set of data
• N (x|,2) =
• log() = ln() and exp(x) = ex
• pcontext(x) and p(x|context) are interchangable
![Page 25: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/25.jpg)
Maximum likelihood estimation• Say we have a density P(x|) with parameter
• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is
P(x|) = n=1N P(xn|)
• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)
• The maximum likelihood (ML) estimate of is
ML = argmax L = argmax n=1N ln P(xn|)
• Example: For Gaussian likelihood P(x|) = N (x|,2),
L =
![Page 26: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/26.jpg)
Questions?
![Page 27: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/27.jpg)
How are we doing on the pass sequence?
• This fit is pretty good, but…
Han
d-l
abe
led
ho
rizo
nta
l co
ord
inat
e, t
The red line doesn’t reveal different levels of uncertainty in predictions
Cross validation reduced the training data, so the red line isn’t as accurate as it should be
Choosing a particular M and w seems wrong – we should hedge our betsNo progress!
But…
![Page 28: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/28.jpg)
Maximum likelihood estimation• Say we have a density P(x|) with parameter
• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is
P(x|) = n=1N P(xn|)
• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)
• The maximum likelihood (ML) estimate of is
ML = argmax L = argmax n=1N ln P(xn|)
• Example: For Gaussian likelihood P(x|) = N (x|,2),
L =
![Page 29: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/29.jpg)
Maximum likelihood estimation
L =
• Example: For Gaussian likelihood P(x|) = N (x|,2),
Objective of regression: Minimize error
E(w) = ½ n ( tn - y(xn,w) )2
![Page 30: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.](https://reader033.fdocuments.in/reader033/viewer/2022061306/55147584550346284e8b627a/html5/thumbnails/30.jpg)