Frequentist versus Bayesian. Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005,...

Post on 15-Jan-2016

217 views 0 download

Tags:

Transcript of Frequentist versus Bayesian. Glen CowanStatistics in HEP, IoP Half Day Meeting, 16 November 2005,...

Frequentist versus Bayesian

Glen Cowan Statistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester

The Bayesian approach

In Bayesian statistics we can associate a probability witha hypothesis, e.g., a parameter value .

Interpret probability of as ‘degree of belief’ (subjective).

Need to start with ‘prior pdf’ (), this reflects degree of belief about before doing the experiment.

Our experiment has data x, → likelihood function L(x|).

Bayes’ theorem tells how our beliefs should be updated inlight of the data x:

Posterior pdf p(|x) contains all our knowledge about .

Glen Cowan Statistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester

Case #4: Bayesian method

We need to associate prior probabilities with 0 and 1, e.g.,

Putting this into Bayes’ theorem gives:

posterior Q likelihood prior

← based on previous measurement

reflects ‘prior ignorance’, in anycase much broader than

Glen Cowan Statistics in HEP, IoP Half Day Meeting, 16 November 2005, Manchester

Bayesian method (continued)

Ability to marginalize over nuisance parameters is an importantfeature of Bayesian statistics.

We then integrate (marginalize) p(0, 1 | x) to find p(0 | x):

In this example we can do the integral (rare). We find

Bayesian Statistics at work: The Troublesome Extraction of the

angle

Stéphane T’JAMPENS

LAPP (CNRS/IN2P3 & Université de Savoie)

J. Charles, A. Hocker, H. Lacker, F.R. Le Diberder, S. T’Jampens, hep-ph-0607246

Frequentist: probability about the data (randomness of measurements), given the model

P(data|model)

Hypothesis testing: given a model, assess the consistency of the data with a particular parameter value 1-CL curve (by varying the parameter value)

[only repeatable events (Sampling Theory)]

Statistics tries answering a wide variety of questions two main different! frameworks:

Digression: StatisticsD.R. Cox, Principles of Statistical Inference, CUP (2006)

W.T. Eadie et al., Statistical Methods in Experimental Physics, NHP (1971)

www.phystat.org

Bayesian: probability about the model (degree of belief), given the data

P(model|data) Likelihood(data,model) Prior(model)

Bayesian Statistics in 1 slide

Bayesian: probability about the model (degree of belief), given the data

P(model|data) Likelihood(data;model) Prior(model)

“it treats information derived from data (“likelihood”) as on exactly equal footing with probabilities derived from vague and unspecified sources (“prior”). The assumption that all aspects of uncertainties are directly comparable is often unacceptable.”

“nothing guarantees that my uncertainty assessment is any good for you - I'm just expressing an opinion (degree of belief). To convince you that it's a good uncertainty assessment, I need to show that the statistical model I created makes good predictions in situations where we know what the truth is, and the process of calibrating predictions against reality is inherently frequentist.”(e.g., MC simulations)

Bayes’rule

The Bayesian approach is based on the use of inverse probability (“posterior”):

Cox – Principles of Statistical Inference (2006)

Uniform prior: model of ignorance?

A central problem : specifying a prior distribution for a parameter about which nothing is known flat prior

Problems:

Not re-parametrization invariant (metric dependent): uniform in is not uniform in z=cos

Favors large values too much [the prior probability for the range 0.1 to 1 is 10 times less than for 1 to 10]

Flat priors in several dimensions may produce clearly unacceptable answers.

In simple problems, appropriate* flat priors yield essentially same answer as non-Bayesian sampling theory. However, in other situations, particularly those involving more than two parameters, ignorance priors lead to different and entirely unacceptable answers.* (uniform prior for scalar location parameter, Jeffreys’ prior for scalar scale parameter).

Cox – Principles of Statistical Inference (2006)

Hypersphere:

One knows nothing about the individual Cartesian coordinates x,y,z…

What do we known about the radius r =√(x^2+y^2+…) ?

One has achieved the remarkable feat of learning something about the radius of the hypersphere, whereas one knew nothing about the Cartesian coordinates and without making any experiment.

6D space

Uniform Prior in Multidimensional Parameter Space

Isospin Analysis : B→hh J. Charles et al. – hep-ph/0607246

Gronau/London (1990)

MA: Modulus & ArgumentRI: Real & Imaginary

Improper posterior

Isospin Analysis: removing information from B0→00

No model-independent constraint on can be inferred in this case

Information is extracted on , which is introduced by the priors (where else?)

Conclusion

Statistics is not a science, it is mathematics (Nature will not decide for us) [You will not learn it in Physics books go to the professional literature!]

Many attempts to define “ignorance” prior to “let the data speak by themselves” but none convincing. Priors are informative.

Quite generally a prior that gives results that are reasonable from various viewpoints for a single parameter will have unappealing features if applied independently to many parameters.

In a multiparameter space, credible Bayesian intervals generally under-cover.

If the problem has some invariance properties, then the prior should have the corresponding structure.specification of priors is fraught with pitfalls (especially in high dimensions).

Examine the consequences of your assumptions (metric, priors, etc.)Check for robustness: vary your assumptionsExploring the frequentist properties of the result should be strongly encouraged.

PHYSTAT Conferences:

http://www.phystat.org

αα[[ππππ] : ] : B-factories status LP07 B-factories status LP07

A+0

A+0

• B+0 |A+0|= |A+0|

Isospin analysis : reminderIsospin analysis : reminder

√2 A+0 = √2 A(Bu π+π0) = e-iα (T+- +T00) √2 A+0 = e+iα (T+- +T00)

A+- = A(Bd π+π-) = e-iα T+- + P+- A+- = e+iα T+- + P+-

√2 A00 = √2 A(Bd π0π0) = e-iα T00 - P+- √2 A00 = e+iα T00 - P+-

ΔΦΔΦ=2=2αα

ΔΦΔΦ=2=2ααeffeff

• Neglecting EW penguin, the amplitude of the SU(2)-related Bππ modes is :

• SU(2) triangular relation : A+0 = A+-/ √2 + A00

• Same for Bρρ decay dominated by longitudinal polarized ρ (CP-even fs)

• S+- sin(2αeff ) 2-fold αeff in [0,π]

• B00, C00 |A00|,|A00|

A00

A00

A+-/√2

A+-/√2

• B+-, C+- |A+-|,|A+-|

Closing SU(2) triangle Closing SU(2) triangle 8-fold 8-fold αα

α

• SS0000 relative phase between A00 & A00

Re

Im

BbarBbar

BB

PiPiPiPi RhoRho RhoRho RhoRho RhoRho CC0000 but noS but noS0000 no Cno C0000/S/S0000 CC0000 AND S AND S0000

• Sin(2αeff) from B (π/ρ)+ (π/ρ)- 2 solutions for αeff in [0,π]• Δα = α-αeff from SU(2) B/Bbar triangles 1 ,2 or 4 solutions for Δα (dep. on triangles closure)

2, 4 or 8 solutions for 2, 4 or 8 solutions for αα = = ααeff eff + + ΔαΔα

4-fold Δα

2-fold Δα 1-fold Δα (‘plateau’)A00/A+0

A+-/√2/A+0

1-fold Δα (peak)

Isospin analysis : reminderIsospin analysis : reminder

Developments in Bayesian Priors

Roger Barlow

Manchester IoP meeting

November 16th 2005

Plan

• Probability– Frequentist– Bayesian

• Bayes Theorem– Priors

• Prior pitfalls (1): Le Diberder• Prior pitfalls (2): Heinrich• Jeffreys’ Prior

– Fisher Information

• Reference Priors: Demortier

Probability

Probability as limit of frequency

P(A)= Limit NA/Ntotal

Usual definition taught to students

Makes sense

Works well most of the time-

But not all

Frequentist probability

“It will probably rain tomorrow.”

“ Mt=174.3±5.1 GeV means the top quark mass lies between 169.2 and 179.4, with 68% probability.”

“The statement ‘It will rain tomorrow.’ is probably true.”

“Mt=174.3±5.1 GeV means: the top quark mass lies between 169.2 and 179.4, at 68% confidence.”

Bayesian Probability

P(A) expresses my belief that A is true

Limits 0(impossible) and 1 (certain)

Calibrated off clear-cut instances (coins, dice, urns)

Frequentist versus Bayesian?

Two sorts of probability – totally different. (Bayesian probability also known as Inverse Probability.)

Rivals? Religious differences? Particle Physicists tend to be frequentists.

Cosmologists tend to be BayesiansNo. Two different tools for practitionersImportant to:• Be aware of the limits and pitfalls of both• Always be aware which you’re using

Bayes Theorem (1763)

P(A|B) P(B) = P(A and B) = P(B|A) P(A)

P(A|B)=P(B|A) P(A)

P(B)

Frequentist use eg Čerenkov counter

P( | signal)=P(signal | ) P() / P(signal)

Bayesian use

P(theory |data) = P(data | theory) P(theory)

P(data)

Bayesian Prior

P(theory) is the Prior

Expresses prior belief theory is true

Can be function of parameter:

P(Mtop), P(MH), P(α,β,γ)

Bayes’ Theorem describes way prior belief is modified by experimental data

But what do you take as initial prior?

Uniform Prior

General usage: choose P(a) uniform in a(principle of insufficient reason)

Often ‘improper’: ∫P(a)da =∞. Though posterior P(a|x) comes out sensible

BUT!If P(a) uniform, P(a2) , P(ln a) , P(√a).. are notInsufficient reason not valid (unless a is ‘most

fundamental’ – whatever that means)Statisticians handle this: check results for

‘robustness’ under different priors