Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good,...

Probability and Maximum Likelihood

How are we doing on the pass sequence?

• This fit is pretty good, but…

Han

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

The red line doesn’t reveal different levels of uncertainty in predictions

Cross validation reduced the training data, so the red line isn’t as accurate as it should be

Choosing a particular M and w seems wrong – we should hedge our bets

Probability TheoryExample of a random experiment

– We poll 60 users who are using one of two search engines and record the following:

X0 1 2 3 4 5 6 7 8

Each point corresponds to one of 60 users

Two searchengines

Number of “good hits” returned by search

engine

X0 1 2 3 4 5 6 7 8

Probability TheoryRandom variables

– X and Y are called random variables– Each has its own sample space:

• SX = {0,1,2,3,4,5,6,7,8}

• SY = {1,2}

X0 1 2 3 4 5 6 7 8

Probability TheoryProbability

– P(X=i,Y=j) is the probability (relative frequency) of

observing X = i and Y = j– P(X,Y) refers to the whole table of probabilities

– Properties: 0 ≤ P ≤ 1, P = 1

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

P(X=i,Y=j)

Probability TheoryMarginal probability

– P(X=i) is the marginal probability that X = i, ie, the

probability that X = i, ignoring Y

X0 1 2 3 4 5 6 7 8

P(X)

P(Y)

Probability TheoryMarginal probability

– P(X=i) is the marginal probability that X = i, ie, the

probability that X = i, ignoring Y– From the table: P(X=i) =j P(X=i,Y=j)

Note that

i P(X=i) = 1

and

j P(Y=j) = 1

X0 1 2 3 4 5 6 7 8

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

3460

2660

P(Y=j)

3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60

P(X=i)

SUM RULE

Probability TheoryConditional probability

– P(X=i|Y=j) is the probability that X=i, given that Y=j– From the table: P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j)

X0 1 2 3 4 5 6 7 8

P(X|Y=1)

P(Y=1)

Probability TheoryConditional probability

– How about the opposite conditional probability, P(Y=j|X=i)?

– P(Y=j|X=i) =P(X=i,Y=j) / P(X=i) Note that jP(Y=j|

X=i)=1

X0 1 2 3 4 5 6 7 8

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

X0 1 2 3 4 5 6 7 8

33

03

66

06

88

08

89

19

59

49

38

58

19

89

06

66

02

22

P(Y=j|X=i)P(X=i,Y=j)

3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60

P(X=i)

Summary of types of probability

• Joint probability: P(X,Y)

• Marginal probability (ignore other variable): P(X) and P(Y)

• Conditional probability (condition on the other variable having a certain value): P(X|Y) and P(Y|X)

Probability TheoryConstructing joint probability

– Suppose we know• The probability that the user will pick each search

engine, P(Y=j), and

• For each search engine, the probability of each number of good hits, P(X=i|Y=j)

– Can we construct the joint probability, P(X=i,Y=j)?

– Yes. Rearranging P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j) we get P(X=i,Y=j) =P(X=i|Y=j) P(Y=j)

PRODUCT RULE

Summary of computational rules

• SUM RULE: P(X) = Y P(X,Y)

P(Y) = X P(X,Y)

– Notation: We simplify P(X=i,Y=j) for clarity

• PRODUCT RULE: P(X,Y) = P(X|Y) P(Y)

P(X,Y) = P(Y|X) P(X)

Ordinal variables• In our example, X has a natural order 0…8

– X is a number of hits, and– For the ordering of the columns in the table below,

nearby X’s have similar probabilities

• Y does not have a natural order

X0 1 2 3 4 5 6 7 8

Probabilities for real numbers

• Can’t we treat real numbers as IEEE DOUBLES with 264 possible values?

• Hah, hah. No!

• How about quantizing real variables to reasonable number of values?

• Sometimes works, but…– We need to carefully account for ordinality– Doing so can lead to cumbersome mathematics

Probability theory for real numbers• Quantize X using bins of width • Then, X {.., -2, -, 0, , 2, ..}

• Define PQ(X=x) = Probability that x X ≤ x+

• Problem: PQ(X=x) depends on the choice of • Solution: Let 0• Problem: In that case, PQ(X=x) 0

• Solution: Define a probability density

P(x) = lim0 PQ(X=x)/

= lim0 (Probability that x X ≤ x+)/

Probability theory for real numbers

Probability density

– Suppose P(x) is a probability density

– Properties

• P(x) 0• It is NOT necessary that P(x) ≤ 1

• x P(x) dx = 1

– Probabilities of intervals:

P(aX≤b) = b

x=a P(x) dx

Probability theory for real numbers

Joint, marginal and conditional densities

• Suppose P(x,y) is a joint probability density

– x y P(x,y) dx dy = 1

– P( (X,Y) R) = R P(x,y) dx dy

• Marginal density: P(x) = y P(x,y) dy

• Conditional density: P(x|y) = P(x,y) / P(y)

x

yR

The Gaussian distribution

is the standard deviation

The Gaussian distribution

-1

-1

“Precision”

is the standard deviation

Mean and variance

• The mean of X is E[X] = X X P(X)

or E[X] = x x P(x) dx

• The variance of X is VAR(X) = X(X-E[X])2P(X)

or VAR(X) = x (x - E[X])2P(x)dx

• The std dev of X is STD(X) = SQRT(VAR(X))

• The covariance of X and Y is

COV(X,Y) = XY (X-E[X]) (Y-E[Y]) P(X,Y)

or COV(X,Y) = x y (x-E[X]) (y-E[Y]) P(x,y) dx dy

Mean and variance of the Gaussian

E[X] = VAR(X) = 2

STD(X) =

How can we use probability as a framework for machine learning?

Maximum likelihood estimation• Say we have a density P(x|) with parameter

• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is

P(x|) = n=1N P(xn|)

• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)

• The maximum likelihood (ML) estimate of is

ML = argmax L = argmax n=1N ln P(xn|)

• Example: For Gaussian likelihood P(x|) = N (x|,2),

L =

Comments on notation from now on• Instead of j P(X=i,Y=j), we write X P(X,Y)

• P() and p() are used interchangeably

• Discrete and continuous variables treated the same, so X, X, x and x are interchangeable

• ML and ML are interchangeable

• argmax f() is the value of that maximizes f()

• In the context of data x1,…,xN, symbols x, X, X and X refer to the entire set of data

• N (x|,2) =

• log() = ln() and exp(x) = ex

• pcontext(x) and p(x|context) are interchangable



P(x|) = n=1N P(xn|)





L =

Questions?

How are we doing on the pass sequence?

• This fit is pretty good, but…

Han

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

The red line doesn’t reveal different levels of uncertainty in predictions

Cross validation reduced the training data, so the red line isn’t as accurate as it should be

Choosing a particular M and w seems wrong – we should hedge our betsNo progress!

But…



P(x|) = n=1N P(xn|)





L =

Maximum likelihood estimation

L =


Objective of regression: Minimize error

E(w) = ½ n ( tn - y(xn,w) )2

Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good,...

Documents

Transcript of Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good,...