Bayesian Statistics - University of Washington · Bayesian Statistics Adrian Raftery and Jeff Gill...

61
Bayesian Statistics Adrian Raftery and Jeff Gill One-day course for the American Sociological Association August 15, 2002 Bayes Course, ASA Meeting, August 2002 c Adrian E. Raftery 2002 1

Transcript of Bayesian Statistics - University of Washington · Bayesian Statistics Adrian Raftery and Jeff Gill...

Bayesian Statistics

Adrian Raftery and Jeff Gill

One-day course for the American Sociological

Association

August 15, 2002

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 1

Outline

1. Bayes’s theorem

2. Bayesian estimation� One parameter case

� Conjugate priors� “Noninformative” priors

� Multiparameter case

� Integrating out parameters� Asymptotic approximations

� When is Bayes useful?� Example: regression in macrosociology

3. Bayesian testing and model selection

� Bayesian testing: Bayes factors

� Bayesian model selection: posterior model

probabilities� Bayesian model averaging: Accounting for model

uncertainty

� Examples

4. Further reading

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 2

Purposes of Statistics

� Scientific inference:

– Find causes

– Quantify effects

– Compare competing (causal) theories

� Prediction:

– Policy-making

– Forecasting (e.g. future population, results of

legislation)

– Control of processes

� Decision-making

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 3

Standard (frequentist) Statistics

� Estimation is based on finding a good point estimate, and

assessing its performance under repetitions of the

experiment (or survey) that gave rise to the data

� The best point estimate is often the maximum likelihood

estimator.

In large samples, for regular models, this is the most

efficient estimator (i.e. the one with the smallest mean

squared error).

In relatively simple models, the MLE is often the

“obvious” estimator.

For example, for estimating the mean of the normal

distribution, the MLE is just the sample mean.

� For testing one hypothesis against another one within

which it is nested (i.e. of which it is a special case), the

best test is often the likelihood ratio test.

� Standard statistical methods for testing nonnested

models against one another, or for choosing among many

models, are not well developed.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 4

Bayesian Statistics

� Based on the idea of expressing uncertainty about

the (unknown) state of nature in terms of probability.

� You start with a probability distribution reflecting

your current state of knowledge.

When new data become available, you update your

probability distribution in light of the new data.

In a probability framework, there is only one way to

do this: via Bayes’s theorem.

� This solves many of the technical problems of

standard statistics:

nonregular models, testing nonnested models,

choosing among many models.

It also provides a way of incorporating external

information (outside the current data set).

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 5

� The key idea is subjective probability.

The current distribution of the state of nature

reflects your opinion.

This has been criticized as non-scientific.

However, it turns out that when there a moderate

amount of evidence, even people who disagree

violently initially end up in substantial agreement,

so long as they follow Bayes’s theorem.

And if there isn’t enough evidence, it’s reasonable

for people who disagreed to start with to go on

disagreeing (although not as much as at first).

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 6

Bayes’s Theorem: Notation

Bayes’s theorem relates to the problem of adjudicating

between competing hypotheses given observations.

Suppose�

is an event, i.e. something that either

happens or that doesn’t.

Suppose � ������������ are other events that form a

partition.

This means that their union is the certain event (i.e. at

least one of them is sure to be the case), and their

intersections are zero. Mathematically:

� � �

� � �

where � is the certain event, and

��� � � � � �where � is the null event.

� ������������ can be thought of as competing

hypotheses to explain the event observed,�

.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 7

Bayes’s Theorem

Bayes’s Theorem: In that case, the conditional

probability of � � given�

is

��� � ��� � � ���� � � � � � ��� � � �

��� � � �

To calculate ��� � �, we may need a further result, the

Law of Total Probability: The overall, or marginal

probability of the event�

, ��� � �, can be expressed in

terms of the probabilities of � � and the conditional

probabilities of�

given each of the � � ’s, as follows:

��� � � �

�� ���� � � � � � ��� � � � �

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 8

Bayes’s Theorem: An Example

Example 1: An item is produced in 3 different factories,

� � � ��� ���� .

The proportions produced in the 3 factories, and the

proportions defective in each, are as follows:

Factory % produced % defective

� � 50 2

� � 30 3

� � 20 4

An item is purchased and found to be defective.

This is event�

.

What is the probability that it was from factory � � ?

First, we find the overall probability of a defective,��� � �

, from the Law of Total Probability:

��� � � � ��� � � � � � ��� � � ��� ��� � � � � � ��� � � ��� ��� � � � � � ��� � � �

� ��� � � � � ����� � ��� � ��� � � � ���� �Then, Bayes’s theorem tells us the probability that the

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 9

item was from factory � � :��� � � � � � �

��� � � � � � ��� � � ���� � �

� ��� � � �� � � � ��� � �

This makes intuitive sense:

Before we found out that the item was defective, we

knew that the probability it was from factory � � was .50.

Then we found out it was defective.

Factory � � has a lower rate of defectives than the other

two, so finding out that the item was defective made it

less likely to be from factory � � , i.e. to have a

probability lower than .50.

And, indeed, so it is: .37 instead of .50.

Another Version of Bayes’s Theorem:

��� � ��� � � � ��� � � � � � ��� � � � �where “ � ” means “proportional to.”

To implement this, we calculate ��� � � � � � ��� � � � for each�

, add them up, and then divide by the sum so that they

add up to 1 (which they have to, because they’re

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 10

probabilities of a partition).

Example 1 (ctd):

��������� � � ��� ����������������� � ����� � ��� � ������������! "� � � ��� ���# $�������# �� � ���&% � �'% � ���(�&)�����+*,� � � ��� ���!*-�������!*$� � ����. � ��� � ���(�&/

Then

��� � � � � � ���� � � � � � ��� � � �

���10 � � �����32 � � ��34 � ���10 ����� � ��� � �

Another way of looking at this is that � � , � � , � � are the

possible states of nature, and that�

is the data. (datum)

We then use the data to decide how likely the different

states of nature are relative to one another.

This is the idea that underlies Bayesian statistics.

��� � � � � � is the probability of the data given the state of

nature � � .

This is called the likelihood of � � .��� � � � is the probability that it was from � � before we

knew whether or not it was defective, i.e. before we

observed the data.

This is called the prior probability of � � .

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 11

��� � �is called the marginal probability of the data, or, for

reasons we will see later, the integrated likelihood.

��� � � � � �is called the posterior probability of � � given

�.

The set of posterior probabilities

� ��� � � � � � � ��� � � � � � � ��� � � � � ���

is called the posterior distribution of the state of nature.

In Bayesian statistics, all inference is based on the

posterior distribution.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 12

Bayesian Estimation of One

Parameter

Now, we consider the situation where the state of nature

is a parameter to be estimated, denoted by � .

For now, we’ll just consider the case where � is

one-dimensional, i.e. where there’s only one parameter.

An example is the mean of a distribution.

This is like the factories and defectives Example 1, but

with the difference that the possible states of nature

form a contintuum, at least approximately, instead of a

small number of discrete values.

The same basic theory applies, though, with

probabilities replaced by probability densities, and sums

replaced by integrals.

We assume that for each possible value of � , we know

what ��� � � � � is.

As before, this is called the likelihood.

We also assume that we have a probability density

function (pdf), ��� � � , that tells us the relative probability

of each value of � before observing the data.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 13

As before, this is called the prior distribution.

This can come from prior knowledge.

Often it’s specified roughly so that the prior distribution

covers the range of plausible values and is fairly flat

over that range.

We’ll see that there is a sense in which the precise form

of the prior distribution doesn’t matter too much for

estimation.

We’ll give examples in a bit.

Bayes’s Theorem for Parameter Estimation:

Version 1: The posterior distribution of � given data�

is

given by

��� � � � � ���� � � � � ��� � �

��� � �

� likelihood � prior

integrated likelihood�

where

��� � � �all values of �

��� � � � � ��� � ��� � �

is the integrated likelihiood.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 14

Version 2:

��� � � � � � ��� � � � � ��� � �

i.e. � likelihood � prior �This gives the posterior distribution only up to a

multiplicative constant, but often this is enough, and

avoids the difficulty of evaluating the integrated

likelihood (also called the normalizing constant in this

context).

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 15

Example: Normal Mean with Known

Variance and One Observation

Example 2: (Box and Tiao 1973): Two physicists, A and

B, are trying to estimate a physical constant, � .

They each have prior views based on their professional

experience, their reading of the literature, and so on.

We will approximate the prior distribution of � by a

normal distribution

� �� � ��� ��� �� � �

Suppose now that an unbiased method of experimental

measurement is available, and that an observation �

made by this method approximately follows a normal

distribution with mean � , and variance � � , where � � is

known from calibration studies.

Then the likelihood is

� � � �� � � ��� � � �

Then it can be shown that the posterior distribution of �

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 16

given � , ��� � � � � , is also a normal distribution,

� � � �� ���� � �� � � �

with mean�� �

� � � � �� � � �

� ��

� � �and variance such that

0�� � � � �

�� � �

where� � � 0

� ���

and� � � 0

� �� �The reciprocal of the variance of a distribution is often

called its precision, because the bigger the variance, the

lower the precision.

Thus � � is the prior precision, and � � is the

observation precision.

The posterior mean �� is a weighted average of the prior

mean � � and the observation � , with the weights being

proportional to the associated precisions.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 17

This is an appealing result.

The posterior precision is the sum of the prior and

observation precisions, reflecting the fact that the two

sources of information are pooled together.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 18

Normal Mean with Multiple

Observations

Now suppose that, instead of one measurement � , we

have � independent measurements with the same

experimental method, � � ��������� ��� .

Then � � ��������� � � are conditionally independent given � .

This means that, if we knew � , knowing the value of � �would tell us nothing about � � , and similarly for any pair

of � � values.

Is this true if we don’t know � ? Why?

� � ��������� ��� are also said to be exchangeable.

Then the likelihood is got by multiplying up the

likelihoods for the individual � � ’s:

��� � ����������� ��� � � � � ��� � � � � � ����� ��� ��� � � � �It can be shown that this is proportional (as a function

of � ) to a normal distribution with mean �� and standard

deviation ����

� .

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 19

Then the posterior distribution is again normal,

� � � ����������� ��� �� � �� � � �� �� � �

with mean�� �

� � � � �� � ��

� ��

� ��

and variance such that

0�� � � � �

�� � �

where� � � 0

� ���

and� � � �

� � �Thus the posterior mean is again a weighted average of

the prior mean and the mean of the data.

The weight associated with the mean of the data is

proportional to the number of data points.

The weight associated with the prior remains constant

as the amount of data increases.

Thus, with large samples the prior matters very little.

This is a very general result for Bayesian statistics, and

helps to justify its use.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 20

Inference: Summarizing the

Posterior Distribution

In Bayesian statistics, the posterior distribution is “all

ye know on Earth, and all ye need to know”.

It tells us the probability that the parameter of interest

lies in any interval, given all our current information.

A plot of the posterior density is often useful.

Point Estimation: The search for a point estimate is

meaningless, except in the context of a specific

decision context (and most decisions don’t call for point

estimates).

A numerical value can be useful for saying where the

“center” of the distribution is.

The posterior mode (the most likely value) is the most

intuitive summary.

But often the posterior mean is the most easily

available.

The posterior mode and mean are usually close

together, but not always.

Example: Estimating a hard-to-find population (the

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 21

number of homeless, the number of unregistered guns

in America, etc.)

Interval Estimation: The most intuitive interval estimate is

formed by the lower 2.5th percentile and the upper

97.5th percentile of the posterior distribution for a 95%

interval (and similarly for other intervals).

There are other proposals in the Bayesian literature, like

the highest posterior density region, but in my view these

do not have much scientific interest.

Roughly summarizing the posterior distribution: Often, in

practice, the posterior mean and posterior standard

deviation are reported.

These are like the MLE and standard error, and are often

close to them numerically.

Posterior mean�

2 posterior standard deviations is a

rough 95% confidence interval.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 22

Conjugate Priors

In the physical constant example, the prior was normal,

and the posterior was too.

So the data updated the parameters of the prior

distribution, but not its form.

This can be very useful in practical work.

A prior distribution that has this property is called a

conjugate prior.

Often priors of this form are flexible enough to

represent prior knowledge fairly well.

Most priors used in applied Bayesian work are

conjugate.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 23

Examples of Conjugate Priors

Some examples of conjugate priors for one parameter

models:

Model Prior distribution

Normal with known variance Normal (for the mean)

Normal with known mean Gamma (for the variance)

Binomial Beta

Poisson Gamma

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 24

“Noninformative” Priors

There have been many efforts to find priors that carry

no information, or “noninformative” priors.

In general, this has turned out to be a modern version of

the Philosopher’s Stone.

There are some very simple problems for which there

are agreed “reference” priors.

One example is the normal mean problem, for which a

flat prior��� � � � 0 �

is often used.

This is an “improper” prior, i.e. it does not integrate up

to 1, because it is constant over the whole real line.

Instead, it integrates up to infinity.

Nevertheless, the resulting posterior distribution is

proper.

When there is more than one parameter, though,

“noninformative” priors turn out to be very informative

about some aspects of the problem, in an unexpected

way.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 25

Improper “noninformative” priors can lead to paradoxes

and strange behavior, and should be used with extreme

caution.

The current trend in applied Bayesian statistical work is

towards informative and, if necessary, spread out but

proper prior distributions.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 26

More Than One Parameter

Suppose that we have two parameters in the model, � �and � � .

One example is the normal distribution (mean and

variance).

Then we have a joint prior distribution, ��� � � � � � � .Often, parameters are independent a priori.

We also have a joint likelihood, ��� � � � ��� � � � .And so we have a joint posterior distribution, exactly as

in the one-parameter case:

��� � � � � � � � � � ��� � � � ��� � � � ��� � ��� � � � �

Usually, we’re interested in parameters individually.

To get the posterior distribution of � � on its own, for

example, we must integrate out � � , as follows:

��� � � � � � � ��� � ��� � � � � � � � � � (1)

This follows from the Law of Total Probability.

(1) is called the marginal posterior distribution of � � .We can then summarize the posterior distribution of � �

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 27

in the same way as when there’s only one parameter

(posterior mean or mode, posterior standard deviation,

posterior percentiles, plot of the posterior density).

The same approach holds when there are more than two

parameters (e.g. in regression).

Then the integral in (1) is a multiple integral over all the

parameters except � � .

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 28

Integrating Out Other Parameters

Sometimes the integral in (1) can be evaluated

analytically (i.e. a mathematical expression found for it

in terms of � � � .Often it cannot, especially when there are many

parameters.

Here are some ways to evaluate it:

� Analytical evaluation: This is the best, if it can be

done.

� Asymptotic approximation: Approximately, in large

samples, for regular models, the posterior

distribution is multivariate normal with mean at the

MLE and the same covariance matrix as the MLE, i.e.

the inverse of the Fisher information matrix.

Then the marginal distribution of each parameter is

just normal, with the variance equal to the diagonal

element of the inverse Fisher information matrix.

� Direct simulation: Sometimes it is possible to

simulate from the posterior distribution directly,

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 29

even if it is hard to integrate it out.

Then you can simulate a big sample, and just strip

out the � � values.

This gives you a sample from the marginal posterior

distribution of � � , which can be used to estimate the

posterior mean, standard deviation, percentiles, and

so on.

This is the case in the normal distribution with both

mean and variance unknown.

Then the posterior distribution has the form:

��� � � � � � � � � � ��� � � � � � � � � ��� � � � � �

� � � �� ����� � � � Gamma ��� � � � � � � � � � �where � � is the mean and � � is the precision

(reciprocal of the variance).

This can be simulated using an algorithm such as:

Repeat many times:

1. Simulate a value of � � from the gamma

distribution.

This can be done directly using available

software.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 30

2. Simulate a value of � � from the normal

distribution, using the value of � � simulated in

step 1.

� Markov chain Monte Carlo (MCMC) simulation: See Jeff

Gill’s lecture this afternoon.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 31

When is Bayes Better?

We have seen that Bayesian statistics gives very similar

results to standard statistics when three conditions

hold:

1. The model is regular (i.e., roughly, the MLE is

asymptotically normal, which requires, for example,

that the likelihood be smooth and that the amount of

information about each parameter increase as

� �� �

),

2. There’s at least a moderate amount of data, and

3. We’re doing estimation, rather than testing or model

selection

Bayesian statistics takes more work in standard

situations, because you have to assess the prior and

investigate sensitivity to it.

Thus, when these 3 conditions hold, Bayesian statistics

involves more work than standard statistics (mostly

MLE and asymptotic standard errors), but yields similar

results.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 32

So it doesn’t seem too worthwhile in this case.

Bayesian statistics can be better in other situations.

� Irregular models: The Bayesian solution is

immediate.

Bayesian statistics doesn’t need regularity

conditions to work.

Examples include: estimating population size;

change-point models; hierarchical models (see

Jeff’s lecture).

� Not much data: Here we can get bad solutions, and

prior information can help a lot.

Examples abound in macrosociology

� Testing and model selection: Here Bayesian

solutions seem more general and avoid many

difficulties with standard methods (nonnested

models, many models, failure to consider power

when setting significance levels.)

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 33

Example: Bayesian Inference in

Comparative Research

(Western and Jackman, 1994, APSR)

Problems in comparative research (macrosociology):

� Few cases (e.g. the 23 OECD countries)

� Quite a few parameters in regressions

� Collinearity

� Result: Weak inferences

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 34

Example: Explaining Union Density

� Data: 20 democratic countries

� Dependent variable: Union density

� Independent variables: Left government, labor-force

size, economic concentration

� Method: Linear regression

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 35

Bayesian Model Selection

� How probable is a model given the data,

conditionally on a set of models

considered, ����������� � ?

� Posterior model probability given data�

:

� � � � � � � � � � �

� Integrated likelihood of a model:

� � � � � � ��� � � � � �� � ��� � ��

likelihood � prior �

This comes from the law of total probability.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 36

Bayesian Model Selection

(ctd)

Posterior odds for 0 against � :

� � �0�� �� �� � �

� � � � 0��� � � �

� �0��� �� �

where � � � � is the prior probability of� (often taken to be equal),

� Bayes factor ( 0 � ) prior odds

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 37

Properties

Theorem 1: For two nested models,

model choice based on the Bayes factor

minimizes the Total Error Rate (= Type I

Error Rate + Type II error rate), on

average over data sets drawn from the

prior.

� Different interpretation of prior:

The set of parameter values over

which we would like good performance

(cf simulation studies).

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 38

Bayesian Model Averaging

Suppose is a quantity of interest

which has the same interpretation over

the models considered, e.g. it is an

observable quantity that can be

predicted, at least asymptotically.

Then if there are several models, its

posterior distribution is a weighted

average over the models:

� � � � ��

� � �

� � � �� � � � � � �

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 39

Estimation via Bayesian

Model Averaging

� Estimation: The BMA estimate of a

parameter�

is

������ � ��

� � �

�� � � � � � � �

where��

denotes posterior mean

(often � MLE).

� Theorem 2:�� ��� �

minimises MSE

among point estimators, where MSE is

calculated for data sets drawn from the

prior.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 40

Comments on Bayesian

Model Selection/Averaging

� Deals easily with multiple ( � � ) models

� Deals easily with nonnested models

� For significance tests, a way of choosing

the size of the test to balance power and

significance. Threshold ��� � � ����� �

increases slowly with .

� Deals with model uncertainty (datamining)

� Point null hyotheses approximate interval

nulls, so long as the width is less than

about 1/2 standard error (Berger and

Delampady 1987)

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 41

The BIC Approximation

� BIC = 2 log maximized likelihood �

no of parameters ����� ��� � .� Theorem 3:

� ����� 0 � � � �� � �

i.e. BIC approximates the Bayes

factor to within � �� � no matter

what the prior is. The � �� � term is

unimportant in large samples, so� ����� 0 �

, so that BIC is

consistent. (Cox and Hinkley 1978, Schwarz

1978)

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 42

The BIC Approximation (ctd)

� Theorem 4: If

� � � � � �� �� � ���

� � 0 � �

where is the expected information

matrix for one observation, the unit

information prior (UIP), then� ����� 0 � � � ��� � 0 � � �

i.e. the approximation is much

better for the UIP (Kass and Wasserman

1995).

� What if the prior is wrong?

4 slides here: UIP plot, criticism, BF vs prior sd plot, response.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 43

Small Simulation Study

based on Weakliem example.

� Formulate � table as a loglinear model with

ANOVA parametrization

� Keep main effects constant at values in Weakliem

data

� Set log-odds ratio = 0 ( � � ), or LOR �� � � � 0 � �� � �

( � � ) (Weakliem recommendation)

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

Simulated Odds Ratios

Odds Ratio

Pro

port

ion

Figure 1:

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 44

Tests Assessed

Test

LRT 5%

BIC

BF: default GLIB (scale = 1.65)

BF: right prior

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 45

Tests: Total Error Rates

Total error rate = Type I error rate + Type II

error rate

Total Error Rate

Test ( � 1000)

LRT 5% 163

BIC 160

BF: default GLIB 154

BF: right prior 153

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 46

Calibration of Tests

Of those data sets for which

�� � �

����

(one star), what

proportion actually had an odds ratio

that was different from 1?

We might hope, somewhere in the

region 95%–99%.

Actually, it was 39%.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 47

Calibration of Bayes Factors

Of those data sets for which the

posterior probability of an association

is between 50% and 95% (weak to

positive evidence), what proportion

actually had an odds ratio that was

different from 1?

We might hope, somewhere in the

region 50%–95%. (Halfway = 73%).

Actually, it was:

BIC: 94%

GLIB default: 71%

GLIB right prior: 73%

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 48

When BIC and a 5% Test

Disagree: Is BIC Really Too

Conservative?

� Consider those data sets for which

a 5% test “rejects” independence

(i.e. � � � ����� ), but BIC does not

(i.e. � � ��� � ).

� If BIC were really too conservative,

we would expect association to be

present in most of these cases,

probably not far from 95% of these

cases.

� Actually, it was present in only 48%

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 49

of these cases.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 50

Estimators

Estimator

Full model

1. MLE

2. Bayes: GLIB

3. Bayes: right prior

Model selection

4. 5% LRT�

MLE

5. BIC�

MLE

6. Bayes GLIB

7. Bayes right prior

BMA

8. BMA: BIC�

MLE

9. BMA: GLIB

10. BMA: right prior

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 51

Estimators: MSEs

Total MSE

Estimator ( � 0 ��� )Full model

1. MLE 49

2. Bayes: GLIB 48

3. Bayes: right prior 48

Model selection

4. 5% LRT�

MLE 37

5. BIC�

MLE 35

6. Bayes GLIB 35

7. Bayes right prior 34

BMA

8. BMA: BIC�

MLE 33

9. BMA: GLIB 32

10. BMA: right prior 32

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 52

Estimation: Comments

� Overall, BMA � Model selection � Full

model.

� Different trade-offs between MSEs under

the two models.

� Right prior (slightly) � GLIB default �BIC

�MLE � LRT

�MLE � Full model

� Full model less good (MLE and Bayes)

� This can guide choice of “ ” in BIC. E.g. for

event history models, it is better to choose

the number of events than the number of

individuals, or of exposure times.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 53

The Hazelrigg-Garnier Data

Revisited

Australia Belgium France Hungary

292 170 29 497 100 12 2085 1047 74 479 190 14

290 608 37 300 434 7 936 2367 57 1029 2615 347

81 171 175 102 101 129 592 1255 1587 516 3110 3751

Italy Japan Philippines Spain

233 75 10 465 122 21 239 110 76 7622 2124 379

104 291 23 159 258 20 91 292 111 3495 9072 597

71 212 320 285 307 333 317 527 3098 4597 8173 14833

United States West Germany West Malaysia Yugoslavia

1650 641 34 3634 850 270 406 235 144 61 24 7

1618 2692 70 1021 1694 306 176 369 183 37 92 13

694 1648 644 1068 1310 1927 315 578 2311 77 148 223

Denmark Finland Norway Sweden

79 34 2 39 29 2 90 29 5 89 30 0

55 119 8 24 115 10 72 89 11 81 142 3

25 48 84 40 66 79 41 47 47 27 48 29

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 54

The Quasi-Symmetry Model

1 4 0

0 2 0

0 0 3

� Accounts for 99.7% of the deviance under

independence.

� Theoretically grounded.

� No easily discernable patterns in the residuals.

� BUT � � � 0 ��� on 16 d.f, so � � 0 ��� � � � . An

apparently good model is rejected.

� BIC seems to resolve the dilemma: BIC = � ���favors the QS model.

� A more refined analysis using Weakliem’s prior for

parameter 4 gives the same conclusion, with a more

“exact” BIC (from GLIB) = � 2�� . The conclusion is

insensitive to the prior standard deviation.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 55

Further Model Search

� One should continue to search for better

models if the deviance from the BIC-best

model is big enough:

# Model Deviance d.f. BIC

1 Independence 42970 64 42227

2 Quasi-symmetry 150 16 � 36

3 Saturated 0 0 0

4 Explanatory 490 46 � 43

5 Farm asymmetry 26 14 � 137

� Weakliem’s preferred model is # 5, which is

also preferred by BIC, but rejected by a 5%

significance test.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 56

Concluding Remarks

� Bayes factors seem to perform well as tests

(in terms of total error rate). This seems

fairly robust to the prior used. They also

seem well calibrated.

� In the small example considered, Bayes

factors based on “good” priors did better

than BIC, which did better than a 5% LRT.

The GLIB default prior had similar

performance to the optimal.

� For estimation, BMA did better in MSE

terms than model selection estimators,

which did better than estimation for the full

model. These results were robust to the

prior, and BIC did almost as well as more

exact Bayes factors.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 57

Concluding Remarks (ctd)

� When the model doesn’t hold, we can

assess methods using out-of-sample

predictive performance. BMA has

consistently done better than model

selection methods (Bayes or non-Bayes).

(e.g. Volinsky et al 1995)

� It’s important to assess whether any of the

models considered fit the data well.

Diagnostics are useful to suggest better

models, but do not necessarily rule out the

use of a model that is better than others by

Bayes factors.

� Even if a Bayes factor prefers one model to

another, the search for better models

should continue (as in the

Hazelrigg-Garnier example).

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 58

Papers and Software

www.stat.washington.edu/raftery� Research� Bayesian Model Selection

BMA Homepage:

www.research.att.com/˜volinsky/bma.html

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 59

Further Reading: Books

� Introductory: Peter Lee (1989). Bayesian Statistics: An

Introduction.

� Theory: Jose Bernardo and Adrian Smith (1994).

Bayesian Theory.

� Applied: Andrew Gelman et al (1995). Bayesian Data

Analysis.

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 60

Further Reading: Review articles

� Bayesian estimation: W. Edwards, H. Lindeman and

L. Savage (1963). Bayesian statistical inference for

psychological research. Psych. Bull. 70, 193-242.

� Bayesian testing: R. Kass and A. Raftery (1995).

Bayes factors. J. Amer. Statist. Ass. 95, 377–395.

www.stat.washington.edu/tech.reports/tr254.ps

� Bayesian model selection: A. Raftery (1995).

Bayesian model selection in social research (with

discussion). Sociological Methodology 25, 111–195.

www.stat.washington.edu/tech.reports/bic.ps

� Bayesian model averaging: J.A. Hoeting et al (1999).

Bayesian model averaging: A tutorial (with

discussion). Statistical Science 14, 382-417.

www.stat.washington.edu/www/research/online/hoeting1999.pdf

Bayes Course, ASA Meeting, August 2002 c�

Adrian E. Raftery 2002 61