Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good,...

30
Probability and Maximum Likelihood

Transcript of Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good,...

Page 1: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Probability and Maximum Likelihood

Page 2: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

How are we doing on the pass sequence?

• This fit is pretty good, but…

Han

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

The red line doesn’t reveal different levels of uncertainty in predictions

Cross validation reduced the training data, so the red line isn’t as accurate as it should be

Choosing a particular M and w seems wrong – we should hedge our bets

Page 3: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Probability TheoryExample of a random experiment

– We poll 60 users who are using one of two search engines and record the following:

X0 1 2 3 4 5 6 7 8

Each point corresponds to one of 60 users

Two searchengines

Number of “good hits” returned by search

engine

Page 4: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

X0 1 2 3 4 5 6 7 8

Probability TheoryRandom variables

– X and Y are called random variables– Each has its own sample space:

• SX = {0,1,2,3,4,5,6,7,8}

• SY = {1,2}

Page 5: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

X0 1 2 3 4 5 6 7 8

Probability TheoryProbability

– P(X=i,Y=j) is the probability (relative frequency) of

observing X = i and Y = j– P(X,Y) refers to the whole table of probabilities

– Properties: 0 ≤ P ≤ 1, P = 1

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

P(X=i,Y=j)

Page 6: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Probability TheoryMarginal probability

– P(X=i) is the marginal probability that X = i, ie, the

probability that X = i, ignoring Y

X0 1 2 3 4 5 6 7 8

P(X)

P(Y)

Page 7: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Probability TheoryMarginal probability

– P(X=i) is the marginal probability that X = i, ie, the

probability that X = i, ignoring Y– From the table: P(X=i) =j P(X=i,Y=j)

Note that

i P(X=i) = 1

and

j P(Y=j) = 1

X0 1 2 3 4 5 6 7 8

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

3460

2660

P(Y=j)

3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60

P(X=i)

SUM RULE

Page 8: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Probability TheoryConditional probability

– P(X=i|Y=j) is the probability that X=i, given that Y=j– From the table: P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j)

X0 1 2 3 4 5 6 7 8

P(X|Y=1)

P(Y=1)

Page 9: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Probability TheoryConditional probability

– How about the opposite conditional probability, P(Y=j|X=i)?

– P(Y=j|X=i) =P(X=i,Y=j) / P(X=i) Note that jP(Y=j|

X=i)=1

X0 1 2 3 4 5 6 7 8

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

X0 1 2 3 4 5 6 7 8

33

03

66

06

88

08

89

19

59

49

38

58

19

89

06

66

02

22

P(Y=j|X=i)P(X=i,Y=j)

3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60

P(X=i)

Page 10: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Summary of types of probability

• Joint probability: P(X,Y)

• Marginal probability (ignore other variable): P(X) and P(Y)

• Conditional probability (condition on the other variable having a certain value): P(X|Y) and P(Y|X)

Page 11: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Probability TheoryConstructing joint probability

– Suppose we know• The probability that the user will pick each search

engine, P(Y=j), and

• For each search engine, the probability of each number of good hits, P(X=i|Y=j)

– Can we construct the joint probability, P(X=i,Y=j)?

– Yes. Rearranging P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j) we get P(X=i,Y=j) =P(X=i|Y=j) P(Y=j)

PRODUCT RULE

Page 12: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Summary of computational rules

• SUM RULE: P(X) = Y P(X,Y)

P(Y) = X P(X,Y)

– Notation: We simplify P(X=i,Y=j) for clarity

• PRODUCT RULE: P(X,Y) = P(X|Y) P(Y)

P(X,Y) = P(Y|X) P(X)

Page 13: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Ordinal variables• In our example, X has a natural order 0…8

– X is a number of hits, and– For the ordering of the columns in the table below,

nearby X’s have similar probabilities

• Y does not have a natural order

X0 1 2 3 4 5 6 7 8

Page 14: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Probabilities for real numbers

• Can’t we treat real numbers as IEEE DOUBLES with 264 possible values?

• Hah, hah. No!

• How about quantizing real variables to reasonable number of values?

• Sometimes works, but…– We need to carefully account for ordinality– Doing so can lead to cumbersome mathematics

Page 15: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Probability theory for real numbers• Quantize X using bins of width • Then, X {.., -2, -, 0, , 2, ..}

• Define PQ(X=x) = Probability that x X ≤ x+

• Problem: PQ(X=x) depends on the choice of • Solution: Let 0• Problem: In that case, PQ(X=x) 0

• Solution: Define a probability density

P(x) = lim0 PQ(X=x)/

= lim0 (Probability that x X ≤ x+)/

Page 16: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Probability theory for real numbers

Probability density

– Suppose P(x) is a probability density

– Properties

• P(x) 0• It is NOT necessary that P(x) ≤ 1

• x P(x) dx = 1

– Probabilities of intervals:

P(aX≤b) = b

x=a P(x) dx

Page 17: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Probability theory for real numbers

Joint, marginal and conditional densities

• Suppose P(x,y) is a joint probability density

– x y P(x,y) dx dy = 1

– P( (X,Y) R) = R P(x,y) dx dy

• Marginal density: P(x) = y P(x,y) dy

• Conditional density: P(x|y) = P(x,y) / P(y)

x

yR

Page 18: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

The Gaussian distribution

is the standard deviation

Page 19: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

The Gaussian distribution

-1

-1

“Precision”

is the standard deviation

Page 20: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Mean and variance

• The mean of X is E[X] = X X P(X)

or E[X] = x x P(x) dx

• The variance of X is VAR(X) = X(X-E[X])2P(X)

or VAR(X) = x (x - E[X])2P(x)dx

• The std dev of X is STD(X) = SQRT(VAR(X))

• The covariance of X and Y is

COV(X,Y) = XY (X-E[X]) (Y-E[Y]) P(X,Y)

or COV(X,Y) = x y (x-E[X]) (y-E[Y]) P(x,y) dx dy

Page 21: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Mean and variance of the Gaussian

E[X] = VAR(X) = 2

STD(X) =

Page 22: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

How can we use probability as a framework for machine learning?

Page 23: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Maximum likelihood estimation• Say we have a density P(x|) with parameter

• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is

P(x|) = n=1N P(xn|)

• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)

• The maximum likelihood (ML) estimate of is

ML = argmax L = argmax n=1N ln P(xn|)

• Example: For Gaussian likelihood P(x|) = N (x|,2),

L =

Page 24: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Comments on notation from now on• Instead of j P(X=i,Y=j), we write X P(X,Y)

• P() and p() are used interchangeably

• Discrete and continuous variables treated the same, so X, X, x and x are interchangeable

• ML and ML are interchangeable

• argmax f() is the value of that maximizes f()

• In the context of data x1,…,xN, symbols x, X, X and X refer to the entire set of data

• N (x|,2) =

• log() = ln() and exp(x) = ex

• pcontext(x) and p(x|context) are interchangable

Page 25: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Maximum likelihood estimation• Say we have a density P(x|) with parameter

• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is

P(x|) = n=1N P(xn|)

• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)

• The maximum likelihood (ML) estimate of is

ML = argmax L = argmax n=1N ln P(xn|)

• Example: For Gaussian likelihood P(x|) = N (x|,2),

L =

Page 26: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Questions?

Page 27: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

How are we doing on the pass sequence?

• This fit is pretty good, but…

Han

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

The red line doesn’t reveal different levels of uncertainty in predictions

Cross validation reduced the training data, so the red line isn’t as accurate as it should be

Choosing a particular M and w seems wrong – we should hedge our betsNo progress!

But…

Page 28: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Maximum likelihood estimation• Say we have a density P(x|) with parameter

• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is

P(x|) = n=1N P(xn|)

• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)

• The maximum likelihood (ML) estimate of is

ML = argmax L = argmax n=1N ln P(xn|)

• Example: For Gaussian likelihood P(x|) = N (x|,2),

L =

Page 29: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.

Maximum likelihood estimation

L =

• Example: For Gaussian likelihood P(x|) = N (x|,2),

Objective of regression: Minimize error

E(w) = ½ n ( tn - y(xn,w) )2

Page 30: Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.