Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th...

48
Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th...

Page 1: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

ProbabilityFrequentist versus Bayesianand why it mattersRoger Barlow

Manchester University

11th February 2008

Page 2: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 2

Outline Different definitions of probability: Frequentist and

Bayesian Measurements: definitions usually give the same

results Differences in dealing with

Nongaussian measurements Small number counting Constrained parameters

Difficulties for both Conclusions and recommendations

Page 3: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 3

What is Probability?

A is some possible event or fact.

What is P(A)?

Classical: An intrinsic property

Frequentist: Limit N N(A) / N

Bayesian: My degree of belief in A

What do we mean by P(A)?

Page 4: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 4

Classical (Laplace and others)

Symmetry factor Coin – ½ Cards – 1/52

Dice – 1/6

Roulette – 1/32

Equally likely outcomes

Extend to more complicated systems of several coins,

many cards, etc.

The probability of an event is the ratio of the number of cases favourable to it, to the number of all cases possible when nothing leads us to expect that any one of these cases should occur more than any other, which renders

them, for us, equally possible. Théorie analytique des probabilités

Page 5: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 5

Classical Probability Breaks down

Can’t handle continuous variables

Bertrand’s paradox: if we draw a chord at random, what is the probability that it is longer than the side of the triangle?

Answer: 1/3 or ½ or 1/4

Cannot enumerate ‘equally likely’ cases

in a unique way

Page 6: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 6

Frequentist Probability (von Mises, Fisher) Limit of frequency

P(A)= Limit N N(A)/N

This was a property of the classical definition, now promoted to become a definition itself

P(A) depends not just on A but on the ensemble – which must be specified.

This leads to two surprising features

Ensemble of Everything

A

Page 7: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 7

Feature 1:There can be many Ensembles

Probabilities belong to the event and the ensemble Insurance company data shows P(death) for 40 year old male

clients = 1.4% (example due to von Mises) Does this mean a particular 40 year old German has a 98.6%

chance of reaching his 41st Birthday? No. He belongs to many ensembles

German insured males German males Insured nonsmoking vegetarians German insured male racing drivers …

Each of these gives a different number. All equally valid.

Page 8: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 8

Feature 2: Unique events have no ensemble

Some events are unique.

Consider

“It will probably rain tomorrow.”

There is only one tomorrow (Tuesday 12th February). There is NO ensemble. P(rain) is either 0/1 =0 or 1/1 = 1

Strict frequentists cannot say 'It will probably rain tomorrow'.

This presents severe social problems.

Page 9: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 9

Circumventing the limitationA frequentist can say:

“The statement ‘It will rain tomorrow’ has a 70% probability of being true.”

by assembling an ensemble of statements and ascertaining that 70% (say) are true.

(E.g. Weather forecasts with a verified track record)

Say “It will rain tomorrow” with 70% confidence

For unique events, confidence level statements replace probability statements.

Page 10: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 10

Bayesian (Subjective) Probability

P(A) is a number describing my degree of belief in A1=certain belief. 0=total disbelief Can be calibrated against simple classical

probabilities. P(A)=0.5 means: I would be indifferent given the choice of betting on A or betting on a coin toss.

A can be anything: death, rain, horse races, existence of SUSY…

Very adaptable. But no guarantee my P(A) is the same as your P(A). Subjective = unscientific?

Page 11: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 11

General (uncontroversial) formP(A|B)P(B) = P(A & B) = P(B|A) P(A )

P(A|B)=P(B|A) P(A) P(B)

P(B) can be written P(B|A) P(A) + P(B|not A) (1-P(A))Examples:People P(Artist|Beard)=P(Beard|Artist) P(Artist) P(Beard)

/K Cherenkov counter P(|signal)=P(signal| ) P()

P(signal)

Medical diagnosis P(disease|symptom)=P(symptom|disease) P(disease) P(symptom)

Bayes’ Theorem

0.9*0.5/(.9*.5+.01*.5)= 0.989

Page 12: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 12

Bayes’ Theorem

“Bayesian” form

P(Theory|Data)=P(Data|Theory) P(Theory)

P(Data)

“Theory” may be an event (e.g. rain tomorrow)

Or a parameter value (e.g. Higgs Mass): then P(Theory) is a function

P(MH|Data)=P(Data|MH) P(MH)

P(Data)PriorPosterior

Page 13: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 13

Measurements:Bayes at work Result value x Theoretical ‘true’ value P(|x) P(x|) P()

Prior is generally taken as uniformIgnore normalisation problems

Construct theory of measurements – prior of second measurement is posterior of the first

P(x|) is often Gaussian, but can be anything (Poisson, etc) For Gaussian measurement and uniform prior, get Gaussian posterior

= X

Page 14: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 14

Aside: “Objective” Bayesian statistics Attempt to lay down rule for choice of prior ‘Uniform’ is not enough. Uniform in what? Suggestion (Jeffreys): uniform in a variable

for which the expected Fisher information <d2ln L/dx2>is minimum (statisticians call this a ‘flat prior’).

Has not met with general agreement – different measurements of the same quantity have different objective priors

Page 15: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 15

Measurement and Frequentist probabilityMT=1743 GeV :What does it mean?

For true value the probability (density) for a result x is (for the usual Gaussian measurement)

P(x ; , )=(1/ 2) exp-[(x -)2/22]For a given , the probability that x lies within is 68%. This

does not mean that for a given x, the ‘inverse’ probability that lies within is 68%

P(x; , ) cannot be used as a probability for .(It is called the likelihood function for given x.)

MT=1743 GeVIs there a 68% probability that MT lies between 171 and 177 GeV?

No. MT is unique. It is either in the range or outside. (Soon we’ll know.)

But 3 does bracket x 68% of the time: The statement ‘MT lies between 171 and 177 GeV’ has a 68% probability of being true.

MT lies between 171 and 177 GeV with 68% confidence

Page 16: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 16

Pause for breath

For Gaussian measurements of quantities with no constraints/objective prior knowledge the same results are given by:

Frequentist confidence intervals Bayesian posteriors from uniform priorsA frequentist and a simple Bayesian will report the

same outcome from the same raw data, except one will say ‘confidence’ and the other ‘probability’. They mean something different but such concerns can be left to the philosophers

Page 17: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 17

Frequentist confidence intervals beyond the simple GaussianSelect Confidence Level value CL and strategyFrom P(x ,) choose construction (functions

x1(), x2()) for which

P(x[x1(), x2()]) CL for all

Given a measurement X, make statement

[LO, HI] @ CL

Where X=x2(LO), X=x1(HI)(Neyman technique)

Page 18: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 18

Confidence Belt

x

Example: proportional Gaussian

= 0.1

Measures with 10% accuracy

Result (say) 100.0

LO=90.91 HI= 111.1

Constructed horizontally such that the probability of a result lying inside the belt is 68%(or whatever)

Read vertically using the measurement

X

Page 19: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 19

Bayesian Proportional Gaussian Likelihood function

C exp(- ½(-100)2/(0.1 )2)

Integration gives C=0.03888

68% (central) limits

92.6 and 113.8

Different techniques give different answers

16%

68%

16%

Page 20: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 20

Small number counting experiments

Poisson distribution P(r; )= e- r / r!

For large can use Gaussian approx. But not small Frequentists: Choose CL. Just use one curve to give

upper limitDiscrete observable makes smooth curves into ugly

staircases

Observe n. Quote upper limit as HI from solving

0n P(r, HI) = 0

n e-HI HI r/r! = 1-CLTranslation. n is small. can’t be very large. If the true

value is HI (or higher) then the chance of a result this small (or smaller) is only (1-CL) (or less)

Page 21: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 21

Frequentist Poisson Table

Upper limitsn 90% 95% 99%0 2.30 3.00 4.611 3.89 4.74 6.642 5.32 6.30 8.413 6.68 7.75 10.054 7.99 9.15 11.605 9.27 10.51 13.11

.....

Page 22: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 22

Bayesian limits from small number countsP(r,)=exp(- ) r/r!

With uniform prior this gives posterior for

Shown for various small r results

Read off intervals...r=6

r=2

r=1

r=0 P()

Page 23: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 23

Upper limitsUpper limit from n events

0

HI exp(- ) n/n! d = CL

Repeated integration by parts:0

n exp(- HI) HIr/r! = 1-CL

Same as frequentist limit This is a coincidence! Lower Limit formula is not

the same

Page 24: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 24

Result depends on Prior

Example: 90% CL Limit from 0 events

Prior flat in

Prior flat in

X

X =

=1.65

2.30

Page 25: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 25

Which is right?

Bayesian Method is generally easier, conceptually and in practice

Frequentist method is truly objective. Bayesian probability is personal ‘degree of belief’. This does not worry biologists but should worry physicists.

Ambiguity appears in Bayesian results as differences in prior give different answers, though with enough data these differences vanish

Check for ‘robustnesss under change of prior’ is standard statistical technique, generally ignored by physicists

‘Uniform priors’ is not a sufficient answer. Uniform in what?

Page 26: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 26

Problems for Frequentists:Add a background =S+bFrequentist method (b known, measured, S wanted 1. Find range for 2. Subtract b to get range for SExamples: See 5 events, background 1.2

95% Upper limit: 10.5 9.3 See 5 events, background 5.1

95% Upper limit: 10.5 5.4 ? See 5 events, background 10.6

95% Upper limit: 10.5 -0.1

Page 27: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 27

S< -0.1? What’s going on?

If N<b we know that there is a downward fluctuation in the background. (Which happens…)

But there is no way of incorporating this information without messing up the ensemble

Really strict frequentist procedure is to go ahead and publish.

We know that 5% of 95%CL statements are wrong – this is one of them

Suppressing this publication will bias the global results

Page 28: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 28

Similar problems Expected number of events must be non-negative Mass of an object must be non-negative Mass-squared of an object must be non-negative Higgs mass from EW fits must be bigger than LEP2

limit of 114 GeV

3 Solutions Publish a ‘clearly crazy’ result Use Feldman-Cousins technique Switch to Bayesian analysis

Page 29: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 29

=S+b for Bayesians No problem! Prior for is uniform for Sb Multiply and normalise as before

Posterior Likelihood Prior

Read off Confidence Levels by integrating posterior

=X

Page 30: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 30

Another Aside: Coverage

Given P(x;) and an ensemble of possible measurements {xi} and some confidence level algorithm, coverage is how often ‘

LO HI’ is true.Isn’t that just the confidence level? Not quite.

Discrete observables may mean the confidence belt is not exact – move on side of caution

Other ‘nuisance’ parameters may need to be taken account of – again erring on side of caution

Coverage depends on . For a frequentist it is never less than the CL (‘undercoverage’). It may be more (‘overcoverage’) – this is to be minimised but not crucial

For a Bayesian coverage is technically irrelevant – but in practice useful

Page 31: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 31

Bayesian pitfall(Heinrich and others)Observe n events from Poisson with =S+bChannel strength S – unknown. Flat priorEfficiency x luminosity - from sub-experiment

with flat priorBackground b - from sub-experiment with flat

priorInvestigated coverage and all OKPartition into classes (e.g. different run periods)Coverage falls!

Page 32: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 32

What’s happening

Problem due to efficiency x Lumi priors

= 1+ 2 + 3 +…Uniform density in all N

components means P () N-1

‘Solve’ by taking priors P(i) 1/ i (Arguments you should have

done so in the first place – Jeffreys’ Prior)

2

1

Page 33: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 33

Another example: Unitarity triangleMeasure CKM angle by measuring B decays

(charged and neutral, branching ratios and CP asymmetries). 6 quantities.

Many different parametrisations suggested

Uniform priors in different parametrisations give different results from each other and from a Frequentist analysis (according to CKMfitter: disputed by UTfit)

For a complex number z=x+iy=rei a flat prior in x and y is not the same as a flat prior in r and

Page 34: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 34

Loss of ambiguities?

Toy exampleMeasure X=(+)2=1.00 0.07, Y= 2=1.100.07 Is interesting but is a nuisance parameterClearly 4 fold ambiguity 1, 0 or 2 Frequentists stop thereBayesians integrate over and get a peak at 0

double that at 2 – feature persists whatever the prior used for .

Is this valid? (Real example will be more subtle)

Page 35: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 35

ConclusionsFrequentist statistics cannot do everythingBayesian statistics can be dangerous. Choose

between1) Never use it2) Use only if frequentist method has problems3) Use only with care and expert guidance and always

check for robustness under different priors4) Use as investigative tool to explore possible

interpretations5) Just plug in the package and write down the resultsBut always know what you are doing and say what you

are doing.

Page 36: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 36

Backup slides

Page 37: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 37

Incorporating Constraints: PoissonWork with total source strength (s+b) you know

is greater than the background b

Need to solve

Formula not as obvious as it looks.

1 CL 0

ne s b s b r r !

0

ne b b r r !

Page 38: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 38

Feldman Cousins MethodWorks by attacking what looks like a different problem...

E xampl e:

Y ou hav e a back grou n d of 3 .2

O bserv e 5 ev en t s? Q u ot e on e-si ded u pper l i mi t (9 .27 -3 .2 =6.07@ 90% )

O bserv e 25 ev en t s? Q u ot e t w o-si ded l i mi t s

Physic ist s ar e humanI deal Physic ist1. Choose S t r at egy2 . E x amine dat a3 . Q uot e r esult

Real Physic ist1. E x amine dat a2 . Choose S t r at egy3 . Q uot e R esult

A ls o c a lle d * ‘th e U n ifie d A p p ro a c h ’

* by Feldman and Cousins, mostly

Page 39: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 39

Feldman Cousins: =s+bb is known. N is measured. s is what we're after

This is called 'flip-flopping' and BAD because is wrecks the whole design of the Confidence Belt

Suggested solution:

1) Construct belts at chosen CL as before

2) Find new ranking strategy to determine what's inside and what's outside

1 sided 90%

2 sided 90%

Page 40: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 40

Feldman Cousins: RankingFirst idea (almost right)Sum/integrate over range of (s+b) values with

highest probabilities for this observed N.(advantage that this is the shortest interval)

Glitch: Suppose N small. (low fluctuation)P(N;s+b) will be small for any s and never get

countedInstead: compare to 'best' probability for this N, at

s=N-b or s=0 and rank on that numberSuch a plot does an automatic ‘flip-flop’N~b single sided limit (upper bound) for sN>>b 2 sided limits for s

Page 41: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 41

How it worksHas to be computed for the

appropriate value of background b. (Sounds complicated, but there is lots of software around)

As n increases, flips from 1-sided to 2-sided limits – but in such a way that the probability of being in the belt is preserved

s

n

Means that sensible 1-sided limits are quoted instead of nonsensical 2-sided limits!

Page 42: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 42

Arguments against using Feldman Cousins

Argument 1

It takes control out of hands of physicist. You might want to quote a 2 sided limit for an expected process, an upper limit for something weird

Counter argument:

This is the virtue of the method. This control invalidates the conventional technique. The physicist can use their discretion over the CL. In rare cases it is permissible to say ”We set a 2 sided limit, but we're not claiming a signal”

Page 43: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 43

Feldman Cousins: Argument 2 Argument 2

If zero events are observed by two experiments, the one with the higher background b will quote the lower limit. This is unfair to hardworking physicists

Counterargument

An experiment with higher background has to be ‘lucky’ to get zero events. Luckier experiments will always quote better limits. Averaging over luck, lower values of b get lower limits to report.

Example: you reward a good student with a lottery ticket which has a 10% chance of winning £10. A moderate student gets a ticket with a 1% chance of winning £ 20. They both win. Were you unfair?

Page 44: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 44

3. Including Systematic Errors =aS+b is predicted number of eventsS is (unknown) signal source strength.

Probably a cross section or branching ratio or decay rate

a is an acceptance/luminosity factor known with some (systematic) error

b is the background rate, known with some (systematic) error

Page 45: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 45

3.1 Full Bayesian

Assume priors for S (uniform?) For a (Gaussian?) For b (Poisson or Gaussian?)

Write down the posterior P(S,a,b).

Integrate over all a,b to get marginalised P(s)

Read off desired limits by integration

Page 46: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 46

3.2 Hybrid Bayesian

Assume priors For a (Gaussian?) For b (Poisson or Gaussian?)Integrate over all a,b to get marginalised P(r,S)

Read off desired limits by 0nP(r,S) =1-CL etc

Done approximately for small errors (Cousins and Highland). Shows that limits pretty insensitive to a , b

Numerically for general errors (RB: java applet on SLAC web page). Includes 3 priors (for a) that give slightly different results

Page 47: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 47

3.3-3.9

Extend Feldman Cousins Profile Likelihood: Use P(S)=P(n,S,amax,bmax)

where amax,bmax give maximum for this S,n Empirical Bayes And more…

Results being compared as outcome from Banff workshop

Page 48: Probability Frequentist versus Bayesian and why it matters Roger Barlow Manchester University 11 th February 2008.

Frequentist and Bayesian Probability Roger Barlow Slide 48

Summary Straight Frequentist approach is objective and

clean but sometimes gives ‘crazy’ results Bayesian approach is valuable but has

problems. Check for robustness under choice of prior

Feldman-Cousins deserves more widespread adoption

Lots of work still going on This will all be needed at the LHC