Kevin Murphy UBC CS & Stats 9 February 2005

68
Why I am a Bayesian (and why you should become one, too) or Classical statistics considered harmful Kevin Murphy UBC CS & Stats 9 February 2005

description

Why I am a Bayesian (and why you should become one, too) or Classical statistics considered harmful. Kevin Murphy UBC CS & Stats 9 February 2005. Where does the title come from?. “Why I am not a Bayesian”, Glymour, 1981 “Why Glymour is a Bayesian”, Rosenkrantz, 1983 - PowerPoint PPT Presentation

Transcript of Kevin Murphy UBC CS & Stats 9 February 2005

Why I am a Bayesian(and why you should become one, too)

orClassical statistics considered harmful

Kevin MurphyUBC CS & Stats

9 February 2005

Where does the title come from?

• “Why I am not a Bayesian”, Glymour, 1981

• “Why Glymour is a Bayesian”, Rosenkrantz, 1983

• “Why isn’t everyone a Bayesian?”,Efron, 1986

• “Bayesianism and causality, or, why I am only a half-Bayesian”, Pearl, 2001

Many other such philosophical essays…

Frequentist vs Bayesian

• Prob = objective relative frequencies

• Params are fixed unknown constants, so cannot write e.g. P(=0.5|D)

• Estimators should be good when averaged across many trials

• Prob = degrees of belief (uncertainty)

• Can write P(anything|D)

• Estimators should be good for the available data

Source: “All of statistics”, Larry Wasserman

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?

Coin flipping

HHTHT

HHHHHWhat process produced these sequences?

The following slides are from Tenenbaum & Griffiths

Hypotheses in coin flipping

• Fair coin, P(H) = 0.5

• Coin with P(H) = p

• Markov model

• Hidden Markov model

• ...

Describe processes by which D could be generated

HHTHTD =

statisticalmodels

Hypotheses in coin flipping

• Fair coin, P(H) = 0.5

• Coin with P(H) = p

• Markov model

• Hidden Markov model

• ...

Describe processes by which D could be generated

generativemodels

HHTHTD =

Representing generative models

• Graphical model notation– Pearl (1988), Jordan (1998)

• Variables are nodes, edges indicate dependency

• Directed edges show causal process of data generation

HHTHTd1 d2 d3 d4 d5

d1 d2 d3 d4

Fair coin, P(H) = 0.5

d1 d2 d3 d4

Markov model

Models with latent structure

• Not all nodes in a graphical model need to be observed

• Some variables reflect latent structure, used in generating D but unobserved

HHTHTd1 d2 d3 d4 d5 d1 d2 d3 d4

Hidden Markov model

s1 s2 s3 s4

d1 d2 d3 d4

P(H) = p

p

How do we select the “best” model?

Bayes’ rule

Hh

hphdp

hphdpdhp

)()|(

)()|()|(

Posteriorprobability

Likelihood Priorprobability

Sum over space of hypotheses

The origin of Bayes’ rule

• A simple consequence of using probability to represent degrees of belief

• For any two random variables:

)|()()&(

)|()()&(

BApBpBAp

ABpApBAp

)|()()|()( ABpApBApBp

)(

)|()()|(

Bp

ABpApBAp

• Good statistics– consistency, and worst-case error bounds.

• Cox Axioms– necessary to cohere with common sense

• “Dutch Book” + Survival of the Fittest– if your beliefs do not accord with the laws of probability, then you

can always be out-gambled by someone whose beliefs do so accord.

• Provides a theory of incremental learning– a common currency for combining prior knowledge and the lessons

of experience.

Why represent degrees of belief with probabilities?

Hypotheses in Bayesian inference

• Hypotheses H refer to processes that could have generated the data D

• Bayesian inference provides a distribution over these hypotheses, given D

• P(D|H) is the probability of D being generated by the process identified by H

• Hypotheses H are mutually exclusive: only one process could have generated D

Coin flipping

• Comparing two simple hypotheses– P(H) = 0.5 vs. P(H) = 1.0

• Comparing simple and complex hypotheses– P(H) = 0.5 vs. P(H) = p

Coin flipping

• Comparing two simple hypotheses– P(H) = 0.5 vs. P(H) = 1.0

• Comparing simple and complex hypotheses– P(H) = 0.5 vs. P(H) = p

Comparing two simple hypotheses

• Contrast simple hypotheses:– H1: “fair coin”, P(H) = 0.5

– H2:“always heads”, P(H) = 1.0

• Bayes’ rule:

• With two hypotheses, use odds form

)(

)|()()|(

DP

HDPHPDHP

Bayes’ rule in odds form

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2) = x

Posterior odds Bayes factor(likelihood ratio)

Prior odds

Data = HHTHT

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

D: HHTHTH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/25 P(H1) = 999/1000

P(D|H2) = 0 P(H2) = 1/1000

P(H1|D) / P(H2|D) = infinity

= x

Data = HHHHH

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

D: HHHHHH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/25 P(H1) = 999/1000

P(D|H2) = 1 P(H2) = 1/1000

P(H1|D) / P(H2|D) 30

= x

Data = HHHHHHHHHH

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

D: HHHHHHHHHHH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/210 P(H1) = 999/1000

P(D|H2) = 1 P(H2) = 1/1000

P(H1|D) / P(H2|D) 1

= x

Coin flipping

• Comparing two simple hypotheses– P(H) = 0.5 vs. P(H) = 1.0

• Comparing simple and complex hypotheses– P(H) = 0.5 vs. P(H) = p

Comparing simple and complex hypotheses

• Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p?

d1 d2 d3 d4

Fair coin, P(H) = 0.5

d1 d2 d3 d4

P(H) = p

p

vs.

• P(H) = p is more complex than P(H) = 0.5 in two ways:– P(H) = 0.5 is a special case of P(H) = p– for any observed sequence X, we can choose p

such that X is more probable than if P(H) = 0.5

Comparing simple and complex hypotheses

Comparing simple and complex hypotheses

Pro

babi

lity

Comparing simple and complex hypotheses

Pro

babi

lity

HHHHH p = 1.0

Comparing simple and complex hypotheses

Pro

babi

lity

HHTHT p = 0.6

• P(H) = p is more complex than P(H) = 0.5 in two ways:– P(H) = 0.5 is a special case of P(H) = p– for any observed sequence X, we can choose p such

that X is more probable than if P(H) = 0.5

• How can we deal with this?– frequentist: hypothesis testing– information theorist: minimum description length– Bayesian: just use probability theory!

Comparing simple and complex hypotheses

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

Computing P(D|H1) is easy:

P(D|H1) = 1/2N

Compute P(D|H2) by averaging over p:

= x

Comparing simple and complex hypotheses

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

Computing P(D|H1) is easy:

P(D|H1) = 1/2N

Compute P(D|H2) by averaging over p:

= x

Comparing simple and complex hypotheses

likelihood PriorMarginal likelihood

Likelihood and prior

• Likelihood:

P(D | p) = pNH (1-p)NT

– NH: number of heads– NT: number of tails

• Prior:

P(p) pFH-1 (1-p)FT-1 ?

A simple method of specifying priors

• Imagine some fictitious trials, reflecting a set of previous experiences– strategy often used with neural networks

• e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair

• In fact, this is a sensible statistical idea...

Likelihood and prior• Likelihood:

P(D | p) = pNH (1-p)NT

– NH: number of heads– NT: number of tails

• Prior:

P(p) pFH-1 (1-p)FT-1 – FH: fictitious observations of heads– FT: fictitious observations of tails

Beta(FH,FT)(pseudo-counts)

Posterior / prior x likelihood• Prior

• Likelihood

• Posterior Same form!

Conjugate priors

• Exist for many standard distributions– formula for exponential family conjugacy

• Define prior in terms of fictitious observations

• Beta is conjugate to Bernoulli (coin-flipping)

FH = FT = 1 FH = FT = 3FH = FT = 1000

Normalizing constants• Prior• Normalizing constant for Beta distribution

• Posterior

• Hence marginal likelihood is

P(H1|D) P(D|H1) P(H1)

P(H2|D) P(D|H2) P(H2)

Computing P(D|H1) is easy:

P(D|H1) = 1/2N

Compute P(D|H2) by averaging over p:

= x

Comparing simple and complex hypotheses

Marginal likelihood (“evidence”) for H2

Likelihood for H1

Marginal likelihood for H1 and H2

Pro

babi

lity

Marginal likelihood is an average over all values of p

Sensitivity to hyper-parameters

• Simple and complex hypotheses can be compared directly using Bayes’ rule– requires summing over latent variables

• Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor”

• Maximum likelihood cannot be used for model selection (always prefers hypothesis with largest number of parameters)

Bayesian model selection

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?

Example: Belgian euro-coins

• A Belgian euro spun N=250 times came up heads X=140.

• “It looks very suspicious to me. If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%” – Barry Blight, LSE (reported in Guardian, 2002)

Source: Mackay exercise 3.15

Classical hypothesis testing

• Null hypothesis H0 eg. = 0.5 (unbiased coin)

• For classical analysis, don’t need to specify alternative hypothesis, but later we will useH1: 0.5

• Need a decision rule that maps data D to accept/ reject of H0.

• Define a scalar measure of deviance d(D) from the null hypothesis e.g., Nh or 2

P-values• Define p-value of threshold as

• Intuitively, p-value of data is probability of getting data at least that extreme given H0

P-values• Define p-value of threshold as

• Intuitively, p-value of data is probability of getting data at least that extreme given H0

• Usually choose so that false rejection rate of H0 is below significance level = 0.05

R

P-values• Define p-value of threshold as

• Intuitively, p-value of data is probability of getting data at least that extreme given H0

• Usually choose so that false rejection rate of H0 is below significance level = 0.05

• Often use asymptotic approximation to distribution of d(D) under H0 as N ! 1

R

P-value for euro coins

• N = 250 trials, X=140 heads

• P-value is “less than 7%”

• If N=250 and X=141, pval = 0.0497, so we can reject the null hypothesis at the significance level of 5%.

• This does not mean P(H0|D)=0.07!

Pval=(1-binocdf(139,n,0.5)) + binocdf(110,n,0.5)

Bayesian analysis of euro-coin

• Assume P(H0)=P(H1)=0.5

• Assume P(p) ~ Beta(,)

• Setting =1 yields a uniform (non-informative) prior.

Bayesian analysis of euro-coin

• If =1,so H0 (unbiased) is (slightly) more probable than H1 (biased).

• By varying over a large range, the best we can do is make B=1.9, which does not strongly support the biased coin hypothesis.

• Other priors yield similar results.• Bayesian analysis contradicts classical

analysis.

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?– Violates likelihood principle– Violates stopping rule principle– Violates common sense

The likelihood principle• In order to choose between hypotheses H0

and H1 given observed data, one should ask how likely the observed data are; do not ask questions about data that we might have observed but did not, such as

• This principle can be proved from two simpler principles called conditionality and sufficiency.

Frequentist statistics violates the likelihood principle

• “The use of P-values implies that a hypothesis that may be true can be rejected because it has not predicted observable results that have not actually occurred.” – Jeffreys, 1961

Another example

• Suppose X ~ N(,2); we observe x=3

• Compare H0: =0 with H1: >0

• P-value = P(X ¸ 3|H0)=0.001, so reject H0

• Bayesian approach: update P(|X) using conjugate analysis; compute Bayes factor to compare H0 and H1

When are P-values valid?• Suppose X ~ N(,2); we observe X=x.

• One-sided hypothesis test: H0: ·

0 vs H1: > 0

• If P() / 1, then P(|x) ~ N(x,2), so

• P-value is the same in this case, since Gaussian is symmetric in its arguments

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?– Violates likelihood principle– Violates stopping rule principle– Violates common sense

Stopping rule principle

• Inferences you make should only depend on the observed data, not the reasons why this data was collected.

• If you look at your data to decide when to stop collecting, this should not change any conclusions you draw.

• Follows from likelihood principle.

Frequentist statistics violates stopping rule principle

• Observe D=HHHTHHHHTHHT. Is there evidence of bias (Pt > Ph)?

• Let X=3 heads be observed random variable and N=12 trials be fixed constant. Define H0: Ph=0.5. Then, at the 5% level, there is no significant evidence of bias:

Frequentist statistics violates stopping rule principle

• Suppose the data was generated by tossing coins until we got X=3 heads.

• Now X=3 heads is a fixed constant and N=12 is a random variable. Now there is significant evidence of bias!

First n-1 trials contain x-1 heads; last trial always heads

Ignoring stopping criterion can mislead classical estimators

• Let Xi ~ Bernoulli()• Max lik. estimator• MLE is unbiased:• Toss coin; if head, stop, else toss second coin.

P(H)=, P(HT)= (1-), P(TT)=(1-)2.

• Now MLE is biased!

• Many classical rules for assessing significance when complex stopping rules are used.

Outline

• Hypothesis testing – Bayesian approach

• Hypothesis testing – classical approach

• What’s wrong the classical approach?– Violates likelihood principle– Violates stopping rule principle– Violates common sense

Confidence intervals

• An interval (min(D),max(D)) is a 95% CI if lies inside this interval 95% of the time across repeated draws D~P(.|)

• This does not mean P( 2 CI|D) = 0.95!

Mackay sec 37.3

Example• Draw 2 integers from

• If =39, we would expect

Example• If =39, we would expect

• Define confidence interval as

• eg (x1,x2)=(40,39), CI=(39,39)• 75% of the time, this will contain the true

CIs violate common sense• If =39, we would expect

• If (x1,x2)=(39,39), then CI=(39,39) at level 75%. But clearly P(=39|D)=P(=38|D)=0.5

• If (x1,x2)=(39,40), then CI=(39,39), but clearly P(=39|D)=1.0.

What’s wrong with the classical approach?

• Violates likelihood principle

• Violates stopping rule principle

• Violates common sense

What’s right about the Bayesian approach?

• Simple and natural

• Optimal mechanism for reasoning under uncertainty

• Generalization of Aristotelian logic that reduces to deductive logic if our hypotheses are either true or false

• Supports interesting (human-like) kinds of learning

Bayesian humor

• “A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.”