Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… ·...

41
Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabás Póczos & Aarti Singh 2014 Spring

Transcript of Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… ·...

Page 1: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Introduction to Machine Learning

CMU-107012. MLE, MAP, Bayes classification

Barnabás Póczos & Aarti Singh

2014 Spring

Page 2: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Administration

2

http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/index.html

� Blackboard manager & Peer grading: Dani

� Webpage manager and autolab: Pulkit

� Camera man: Pengtao

� Homework manager: Jit

� Piazza manager: Prashant

Recitation: Wean 7500, 6pm-7pm, on Wednesdays

Page 3: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

OutlineTheory:

� Probabilities:

� Dependence, Independence, Conditional Independence

� Parameter estimation:

� Maximum Likelihood Estimation (MLE)

� Maximum aposteriori (MAP)

� Bayes rule

� Naïve Bayes Classifier

Application:

Naive Bayes Classifier for

3

� Spam filtering� “Mind reading” = fMRI data processing

Page 4: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Independence

Page 5: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Independence

5

Y and X don’t contain information about each other.

Observing Y doesn’t help predicting X.

Observing X doesn’t help predicting Y.

Examples:

Independent: Winning on roulette this week and next week.

Dependent: Russian roulette

Independent random variables:

Page 6: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Dependent / Independent

Independent X,Y

X

Dependent X,Y

X

YY

6

Page 7: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Conditionally Independent

7

Dependent: show size and reading skillsConditionally independent: show size and reading skills given …?

Examples:

Storks deliver babies:

Highly statistically significant correlation exists between stork populations and human birth rates across Europe.

Conditionally independent:

Knowing Z makes X and Y independent

age

Page 8: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

London taxi drivers: A survey has pointed out a positive and

significant correlation between the number of accidents and wearing

coats. They concluded that coats could hinder movements of drivers and

be the cause of accidents. A new law was prepared to prohibit drivers

from wearing coats when driving.

Finally another study pointed out that people wear coats when it rains…

Conditionally Independent

8

Page 9: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Correlation ≠ Causation

xkcd.com 9

Page 10: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Conditional Independence

Formally: X is conditionally independent of Y given Z:

Equivalent to:

10

Note: does NOT mean Thunder is independent of Rain

But given Lightning knowing Rain doesn’t give more info about Thunder

Page 11: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

� C calls A and B separately and tells them a number n ϵ {1,…,10}

� Due to noise in the phone, A and B each imperfectly (and

independently) draw a conclusion about what the number was.

� A thinks the number was na and B thinks it was nb.

� Are na and nb marginally independent?

– No, we expect e.g. P(na = 1 | nb = 1) > P(na = 1)

� Are na and nb conditionally independent given n?

– Yes, because if we know the true number, the outcomes na

and nb are purely determined by the noise in each phone.

P(na = 1 | nb = 1, n = 2) = P(na = 1 | n = 2)11

Conditional vs. Marginal Independence

n

nbna

Page 12: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Parameter estimation:

MLE, MAP

Estimating Probabilities

12

Our first machine learning problem:

Page 13: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Flipping a Coin

3/5 “Frequency of heads”

13

I have a coin, if I flip it, what’s the probability that it will fall with the head up?

Let us flip it a few times to estimate the probability:

The estimated probability is:

Page 14: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Flipping a Coin

3/5 “Frequency of heads”

14

The estimated probability is:

(1) Why frequency of heads???

(2) How good is this estimation???

(3) Why is this a machine learning problem???

Questions:

We are going to answer these questions

Page 15: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Question (1)

15

Why frequency of heads???

� Frequency of heads is exactly the

maximum likelihood estimator for this problem

� MLE has nice properties

(interpretation, statistical guarantees, simple)

Page 16: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Maximum Likelihood Estimation

16

Page 17: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

MLE for Bernoulli distribution

Flips are i.i.d.:– Independent events

– Identically distributed according to Bernoulli distribution

17

Data, D =

P(Heads) = θ, P(Tails) = 1-θ

MLE: Choose θ that maximizes the probability of observed data

Page 18: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Maximum Likelihood Estimation

Independent draws

Identically

distributed

18

MLE: Choose θ that maximizes the probability of observed data

Page 19: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Maximum Likelihood Estimation

19

MLE: Choose θ that maximizes the probability of observed data

That’s exactly the “Frequency of heads”

Page 20: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Question (2)

20

How good is this MLE estimation???

Page 21: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

How many flips do I need?

I flipped the coins 5 times: 3 heads, 2 tails

• Which estimator should we trust more?

• The more the merrier???

What if I flipped 30 heads and 20 tails?

21

Page 22: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Simple bound

For n = αH+αT, and

Hoeffding’s inequality:

For any ε>0:

Let θ* be the true parameter.

22

Page 23: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Probably Approximate Correct

(PAC )LearningI want to know the coin parameter θ, within ε = 0.1

error with probability at least 1-δ = 0.95.

How many flips do I need?

Sample complexity:

23

Page 24: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Question (3)

24

Why is this a machine learning problem???

� improve their performance

� at some task

� with experience

(accuracy of the predicted prob. )

(predicting the probability of heads)

(the more coins we flip the better we are)

Page 25: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

What about continuous features?

µµµµ=0 µµµµ=0

σσσσ2222σσσσ2222

Let us try Gaussians…

6543 7 8 9

25

Page 26: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

MLE for Gaussian mean and variance

Choose θ= (µ,σ2) that maximizes the probability of observed data

Independent draws

Identically

distributed

26

Page 27: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

MLE for Gaussian mean and variance

27

Unbiased variance estimator:

Note: MLE for the variance of a Gaussian is biased

[Expected result of estimation is not the true parameter!]

Page 28: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

What about prior knowledge?(MAP Estimation)

Page 29: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

What about prior knowledge?We know the coin is “close” to 50-50. What can we do now?

The Bayesian way…Rather than estimating a single θ, we obtain a distribution over possible

values of θ

50-50

Before data After data

29

Page 30: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Prior distribution

What prior? What distribution do we want for a prior?

� Represents expert knowledge (philosophical approach)

� Simple posterior form (engineer’s approach)

Uninformative priors:

� Uniform distribution

Conjugate priors:

� Closed-form representation of posterior

� P(θ) and P(θ|D) have the same form

30

Page 31: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Bayes Rule

31

In order to proceed we will need:

Page 32: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Chain Rule & Bayes Rule

32

Bayes rule is important for reverse conditioning.

Bayes rule:

Chain rule:

Page 33: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Bayesian Learning

• Use Bayes rule:

• Or equivalently:

posterior likelihood prior

33

Page 34: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

MLE vs. MAP

When is MAP same as MLE?

� Maximum Likelihood estimation (MLE)

Choose value that maximizes the probability of observed

data

� Maximum a posteriori (MAP) estimation

Choose value that is most probable given observed data and

prior belief

34

Page 35: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

MAP estimation for Binomial distribution

Likelihood is BinomialCoin flip problem:

35

If the prior is Beta distribution,

⇒ posterior is Beta distribution

Beta function:

Page 36: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

MAP estimation for Binomial distribution

Likelihood is Binomial:

36

P(θ) and P(θ|D) have the same form! [Conjugate prior]

Prior is Beta distribution:

⇒ posterior is Beta distribution

Page 37: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Beta distribution

More concentrated as values of α, β increase

37

Page 38: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Beta conjugate prior

As n = αH + αT

increases

As we get more samples, effect of prior is “washed out”38

Page 39: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

From Binomial to Multinomial

Likelihood is ~ Multinomial(θ = {θ1, θ2, … , θk})

If prior is Dirichlet distribution,

Then posterior is Dirichlet distribution

For Multinomial, conjugate prior is Dirichlet distribution.

Example: Dice roll problem (6 outcomes instead of 2)

http://en.wikipedia.org/wiki/Dirichlet_distribution 39

Page 40: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Conjugate prior for Gaussian?

40

Conjugate prior on mean:

Conjugate prior on covariance matrix:

Gaussian

Inverse Wishart

Page 41: Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence.

Bayesians vs.Frequentists

You are no

good when

sample is

smallYou give a

different

answer for

different

priors

41