Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf ·...

64
Frequentist and Bayesian Confidence Limits unter Zech * Universit¨ at Siegen, D-57068 Siegen June 6, 2001 Abstract Frequentist (classical) and the Bayesian approaches to the construction of con- fidence limits are compared. Various examples which illustrate specific prob- lems are presented. The Likelihood Principle and the Stopping Rule Paradox are discussed. The performance of the different methods is investigated rel- ative to the properties coherence, precision, bias, universality, simplicity. A proposal on how to define error limits in various cases are derived from the comparison. They are based on the likelihood function only and follow in most cases the general practice in high energy physics. Classical methods are not recommended because they violate the Likelihood Principle, they can produce physically inconsistent results, suffer from lack of precision and gen- erality. Also the extreme Bayesian approach with arbitrary choice of the prior probability density or priors deduced from scaling laws is rejected. Contents 1. Introduction 1 1.1 Scope of this article ..................................... 1 1.2 A first glance at the problem ................................ 3 2. Classical confidence limits 3 2.1 Visualization ........................................ 4 2.2 Classical confidence limits in one dimension - definitions ................. 6 2.21 Central intervals .................................. 6 2.22 Equal probability density intervals ......................... 6 2.23 Minimum size intervals ............................... 7 2.24 Symmetric intervals ................................ 7 2.25 Selective intervals ................................. 7 2.26 One-sided intervals ................................. 7 2.27 Which definition is the best? ............................ 7 2.3 Two simple examples .................................... 8 2.4 Digital measurements .................................... 9 2.5 External constraints ..................................... 9 * E-mail: [email protected] 1

Transcript of Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf ·...

Page 1: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Frequentist and Bayesian Confidence Limits

Gunter Zech∗

Universitat Siegen, D-57068 Siegen

June 6, 2001

AbstractFrequentist (classical) and the Bayesian approaches to the construction of con-fidence limits are compared. Various examples which illustrate specific prob-lems are presented. The Likelihood Principle and the Stopping Rule Paradoxare discussed. The performance of the different methods is investigated rel-ative to the properties coherence, precision, bias, universality, simplicity. Aproposal on how to define error limits in various cases are derived from thecomparison. They are based on the likelihood function only and follow inmost cases the general practice in high energy physics. Classical methodsare not recommended because they violate the Likelihood Principle, they canproduce physically inconsistent results, suffer from lack of precision and gen-erality. Also the extreme Bayesian approach with arbitrary choice of the priorprobability density or priors deduced from scaling laws is rejected.

Contents

1. Introduction 1

1.1 Scope of this article. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 A first glance at the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2. Classical confidence limits 3

2.1 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Classical confidence limits in one dimension - definitions . .. . . . . . . . . . . . . . . 6

2.21 Central intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.22 Equal probability density intervals. . . . . . . . . . . . . . . . . . . . . . . . . 6

2.23 Minimum size intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.24 Symmetric intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.25 Selective intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.26 One-sided intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.27 Which definition is the best? . .. . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Two simple examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Digital measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 External constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9∗E-mail: [email protected]

1

Page 2: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

2.6 Classical confidence limits with several parameters. . . . . . . . . . . . . . . . . . . . 11

2.7 Nuisance parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.71 Factorization and restructuring . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.72 Global limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.73 Other methods . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.8 Upper and lower limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.9 Upper limits for Poisson distributed signals . . .. . . . . . . . . . . . . . . . . . . . . 14

2.91 Rate limits with uncertainty in the flux . . . . . . . . . . . . . . . . . . . . . . . 15

2.92 Poisson limits with background. . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.93 Limits with re-normalized background . .. . . . . . . . . . . . . . . . . . . . . 18

2.94 Uncertainty in the background prediction. . . . . . . . . . . . . . . . . . . . . 19

2.10 Discrete parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3. Unified approaches 20

3.1 Basic ideas of the unified approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Difficulties with two-sided constraints .. . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 External constraint and distributions with tails . .. . . . . . . . . . . . . . . . . . . . . 22

3.4 Artificial correlations of independent parameters .. . . . . . . . . . . . . . . . . . . . . 24

3.5 Upper Poisson limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.6 Restriction due to unification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.7 Alternative unified methods .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4. Likelihood ratio limits and Bayesian confidence intervals 27

4.1 Inverse probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Interval definition .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.21 Bayesian intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.22 Likelihood ratio intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Problems with likelihood ratio intervals. . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 The prior parameter density and the parameter choice . . . . . . . . . . . . . . . . . . . 30

4.5 External constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.6 Several parameters and nuisance parameters . . . . . . . . . . . . . . . . . . . . . . . . 33

4.7 Upper and lower limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.8 Discrete parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5. The Likelihood Principle and information 35

5.1 The Likelihood Principle . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 The Stopping Rule Paradox .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Stein’s example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.31 The paradox and its solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.32 Classical treatment .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 Stone’s example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5 Goodness-of-fit tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2

Page 3: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

5.6 Randomization . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6. Comparison of the methods 44

6.1 Difference in philosophy . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.2 Relevance, consistency and precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.3 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.4 Metric - Invariance under variable transformations. . . . . . . . . . . . . . . . . . . . . 48

6.5 Error treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.51 Error definition . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.52 Error propagation in general . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.53 Systematic errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.6 Combining data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.7 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.8 Subjectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.9 Universality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.10 Simplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7. Summary and proposed conventions 53

7.1 Classical procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.2 Likelihood and Bayesian methods . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 54

7.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.4 Proposed Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.5 Summary of the summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

A Shortest classical confidence interval for a lifetime measurement 57

B Objective prior density for an exponential decay 58

1. Introduction

1.1 Scope of this article

The progress of experimental sciences to a large extent is due to the assignment of uncertainties toexperimental results. The information contained in a measurement, or a parameter deduced from it, isincompletely documented and more or less useless, unless some kind of error is attributed to the data.The precision of measurements has to be known i) to combine data from different experiments, ii) todeduce secondary parameters from it and iii) to test predictions of theories. Different statistical methodshave to be judged on their ability to fulfill these tasks.

There is still no consensus on how to define error or confidence intervals among the differentschools of statistics, represented by frequentists and Bayesians. One of the reasons for a continuingdebate between these two parties is that statistics is partially an experimental science, as stressed byJaynes [1] and partially a mathematical discipline as expressed by Fisher: The first sentence in hisfamous book on statistics [2] is “The science of statistics is essentially a branch of Applied Mathematics”.Thus one expects from statistical methods not only to handle all kind of practical problems but alsoto be deducible from few axioms and to provide correct solutions for sophisticated exotic examples,requirements which are not even fulfilled by old and reputed sciences like physics.

Corresponding to the two main lines of statistical thought, we are confronted with different kinds

3

Page 4: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

of error interval definitions, the classical (or frequentist) one and some more or less Bayesian inspiredones. The majority of particle physicists intellectually favor to the first but in practice use the second.Both methods are mathematically consistent. In most cases their results are very similar but there arealso situations where they differ considerably. These cases exhibit either low event numbers, large mea-surement errors or parameters restricted by physical limits, like positive mass,| cos | ≤ 1, positive ratesetc..

Some standard is badly needed. For example, there exist at present at least eight different methodsto the single problem to compute an upper limit for Poisson distributed events.

The purpose of this article is not to repeat all the philosophical arguments in favor of the Bayesianor the classical school. They can be found in many text books for example in Refs. [3, 4, 5] and more orless profound articles and reports [6, 7]. Further references are given in an article by Cousins [7]. I amconvinced that methods from both schools are valid and partially complementary. Pattern recognition,noise suppression, analysis of time series are fields where Bayesian methods dominate, goodness-of-fit techniques are based on classical statistics. It should be possible to agree on methods of intervalestimation which are acceptable to both classical and Bayesian scientists.

We restrict our discussion to parameters of a completely defined theory and errors deduced froman unbiased data sample. In this respect the situation in physics is different from that in social, medicalor economic sciences where usually models have to be used which often forbid the use of a likelihoodfunction.

In this report, the emphasis is mainly put on performance and less on the mathematical and statis-tical foundation. An exception is a discussion of the Likelihood Principle which is fundamental for allmodern statistics. The intention is to confront the procedures with the problems to be solved in physicsand to judge them on the basis of their usefulness. Even though the challenge is in real physics cases it isin simple examples that we gain clarity and insight. Thus simple examples are selected which illustratethe essential problems.

We focus on the following set of criteria:

1. Error intervals and one-sided limits have to measure the precision of an experiment.

2. They should be selective (powerfull in classical notation [8]).

3. They have to be unique and consistent: Equally precise measurements have equal errors. Moreprecise measurements provide smaller intervals than less precise measurements.

4. Subjective input has to be avoided.

5. The procedure should allow to combine results from different experiments with minimum loss ofinformation.

6. The information given should provide a firm bases for decisions.

7. The method should be as general as possible. Ad hoc solutions for special cases should be avoided.

8. Last not least we emphasize simplicity and transparency.

Most of these points have acquired little attention in the ongoing discussion in the physics com-munity and examples in the professional statistical literature hardly touch our problems.

The present discussion in the high energy physics community focuses on upper limit determina-tions and here especially on the Poisson case. However, upper limits should not be regarded isolatedfrom the general problem of error assignment.

In the following section we will confront the classical method with examples which demonstrateits main difficulties and limitations. Section 3 deals with the unified approach proposed by Feldmanand Cousins. In Section 4 we investigate methods based on the likelihood function and discuss relatedproblems. Section 5 is devoted to the Likelihood Principle and Section 6 contains a systematic compari-son of the methods with respect to the issues mentioned above. Section 7, finally, concludes with somerecommendations.

4

Page 5: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

We emphasize low statistic experiments. Thus least square (χ2) methods will often not be appli-cable. For simplicity, we will usually assume that a likelihood function of the parameters of interest isavailable and that it has at most one significant maximum. For some applications (averaging of results)not only an interval has to be estimated but also a parameter point. Normally, the maximum likelihood es-timate is chosen. In classical approaches, the intervals do not necessarily contain the likelihood estimateand special prescriptions for the combination of results are necessary.

Most of our conclusions are equally valid forχ2 methods.

1.2 A first glance at the problem

Before we start to discuss details, let us look in a very qualitative way at the main difference between afrequentist approach and methods based on the likelihood function.

As a simple example we imagine a measurementx and a probability distribution function1 (pdf)depending on a parameterθ. How should we select the parameters which we want to keep inside ourconfidence interval? In Figure 1 we display a measurement and two Gaussian probability densitiescorresponding to two different parameter valuesθ1 andθ2.

Classical confidence limits (CCL) measure tail probabilities. A parameter value is accepted insidethe confidence interval if the measurement is not too far in the tail of the corresponding pdf. Essentially,the integral over the tail beyond the measurement determines whether it is included. Classical methodswould preferentially accept the parameterθ2 corresponding to the wide peak, the measurement beingless than one standard deviation off. Thus, they may exclude parameter values better supported by thedata than those which they include.

Methods using the likelihood function favor the parameter corresponding to the narrow peak whichhas the larger probability density at the data point which we usually call likelihood ofθ.

Which of the two approaches is the better one? Givenx with no additional information we cer-tainly would bet forθ1 with betting odds corresponding to the likelihood ratio in favor ofθ1. However,we then clearly favors precise predictions over crude ones. Assumeθ2 applies. The chance to acceptit inside a certain likelihood interval is smaller than the corresponding chance forθ1. The likelihoodmethod is unfair toθ2.

The classical limits exhibit an integration in the sample space, thus depending on the probabilitydensity of data that have not been observed. Bayesians object to using such irrelevant information.They rely on the likelihood function, transform it into a probability density of the parameter and usuallyintegrate it to compute moments. Thus their conclusions depend on the somewhat arbitrary choice of themetric of the parameter space2. This is not acceptable to frequentists.

The simplest and most common procedure is to define intervals based solely on the likelihoodfunction. It depends only on the local probability density of the observed data and does not includesubjective or irrelevant elements. Admittedly, restricting interval estimation to the information containedin the likelihood function - which is a mere parametrization of the data - does not permit to deduceprobabilities or confidence levels in the probabilistic sense.

2. Classical confidence limits

The defining property of classical confidence limits (CCL) iscoverage: If a large number of experimentsperform measurements of a parameter with confidence level3 α, the fractionα of the limits will contain

1In most cases we do not distinguish between discret and continuous probability distribution functions.2We consider only uniform prior densities. There is no loss of generality, see Section 4.3Throughout this article we use the generic name “confidence level” to measure the quality of the interval. More precise

expressions are “coverage” or “p-value” for classical limits, “probability” for Bayesian limits and “ratio” for likelihood ratiolimits. The confidence levelα corresponds to1− α in most of the literature.

5

Page 6: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 1: The likelihood is larger for parameterq1, but the measurement is less (more) then 1 st. dev. offq2.(q1)

Classical approaches includeq2 and excludeq1 within a 68.3 % confidence interval.

the true value of the parameter inside the confidence limits. Thus, the measure of accuracy is definedpre-experimental and evaluated by counting the successes of a large number of fictive experiments.

In the following we show how confidence limits fulfilling the coverage requirement can be con-structed.

2.1 Visualization

We illustrate the concept of CCL for a measurement (statistic) consisting of a vector (x1, x2) and atwo-dimensional parameter space (see Figure 2). In a first step we associate to each pointθ1, θ2 inthe parameter space a closedprobability contourin the sample space containing a measurement withprobability α. For example, the probability contour labeleda in the sample space corresponds to theparameter values of pointA in the parameter space. The curve (confidence contour) connecting allpoints in the parameter space with probability contours in the sample space passing through the actualmeasurementx1, x2 encloses theconfidence regionof confidence level (C.L.)α. All parameter valuescontained in the confidence region will contain the measurement inside their probability contour.

Whatever the actual values of the parameters are, the measurement will produce with probabilityα a confidence contour which contains these parameters.

Frequently, a measurement is composed of many independent single measurements which arecombined to an estimatorθ of the parameter. Then the two plots of Figure 2 can be combined to a singlegraph (see Figure 3 ).

Figures 2 and 3 demonstrate some of the requirements necessary for the construction of an exactconfidence region:

1. The sample space must be continuous. (Discrete distributions, thus all digital measurements andin principle also Poisson processes are excluded.)

2. The probability contours should enclose a simply connected region.

3. The parameter space has to be continuous.

4. The parameter space should be infinite.

The restriction (1) usually is overcome by relaxing the requirement of exact coverage. The prob-ability contours are enlarged to contain at least the fractionα of measurements. Confidence limits with

6

Page 7: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

d

a b

c

X2

x2

X1x1

xx

x

x

AB

C

D

θ2

θ1

sample space parameter space

confidence contour

probability contour

Fig. 2: Two parameter classical confidence limit for a measurementx1,x2. The dashed contours labeled with smallletters in the sample space correspond to probability contours of the parameter pairs labeled with capital letters inthe parameter space.

xx

x

x

AB

C

D

d

a b

c

θ2

θ2

θ1θ1

probability contour

confidence contour

Fig. 3: Two parameter classical confidence limit. The dashed probability contours labeled with small letters containan estimate of the true value (capital letter) with probabilityα.

7

Page 8: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

central interval P (x ≤ x1|θ) = P (x ≥ x2|θ) = (1− α)/2equal probability densities f(x1|θ) = f(x2|θ)minimum size θhigh − θlow is minimumsymmetric θhigh − θ = θ − θlow

likelihood ratio ordering f(x1|θ)/f(x1|θbest) = f(x2|θ)f(x2|θbest)one-sided θlow = −∞ or θhigh = ∞

Table 1: Some choices for classical confidence intervals

minimum overcoverage are chosen. This makes sense only when the density of points in the samplespace is relatively large.

A possibility to conserve exact coverage is to randomize the confidence belts. The sample pointsare associated to the probability region of one or another parameter according to an appropriate proba-bility distribution. This method is quite popular in some research fields and discussed extensively in thestatistical literature but will not be followed here. It introduces an additional unnecessary uncertainty. Itseems stupid to make ones decisions depend on the outcome of a thrown coin.

The restriction from point 4) concerns the mathematical property of the probability density andnot physical constraints. This problem is absent in unified approaches (see Section 3)

There is considerable freedom in the choice of the probability contours but to insure coveragethey have to be defined independently of the result of the experiment. Usually, contours are locations ofconstant probability density.

2.2 Classical confidence limits in one dimension - definitions

For each possible value of the parameterθ we fix a probability interval[x1(θ), x2(θ)] fulfilling

P (x1 ≤ x ≤ x2|θ) =∫ x2

x1

f(x|θ)dx = α

wheref(x|θ) is the probability density function. Using the construction explained above, we find for ameasurementx the confidence limitsθlow andθhigh from

x1(θhigh) = x

x2(θlow) = x

The definitions do not fix the limits completely. Additional constraints have to be added. Somesensible choices are listed in Table 1. (We only consider distributions with a single maximum.)

2.21 Central intervals

The standard choice iscentral intervals. For a given parameter value it is equally likely to fall into thelower tail and into the upper tail of the distribution. The Particle Data Group (PDG) [9] advocates centralintervals, though the last edition also proposes likelihood ratio intervals. Central intervals are invariantagainst variable and parameter transformations. There application is restricted to the simple case withone variable and one parameter. When central intervals are considered together with a point estimation(measurement, parameter fit), the obvious parameter choice is the zero interval length limit (α = 0), themedian and not the value which maximizes the likelihood.

2.22 Equal probability density intervals

Equal probability density intervalsare preferable because in the majority of the cases they are shorterand less biased than central intervals and the concept is also applicable to the multidimensional case.

8

Page 9: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

They coincide with central intervals for symmetric distributions. A disadvantage is the non-invariance ofthe definition under variable transformations (see Section 6.4).

2.23 Minimum size intervals

Of course we would like the confidence interval to be as short as possible [10]. The construction ofminimum size intervals, if possible at all, is a difficult task [3, 10]. In Appendix A we illustrate howpivotal quantities can be used to compute such intervals. Clearly, these intervals depend by definition onthe parameter choice. When we determine the mean lifetime of a particle the minimum size limits forthe lifetime will not transform into limits of the decay constant with the same property.

2.24 Symmetric intervals

Often it is reasonable to quote symmetric errors relative to an estimateθ of the parameter. Again thistype of interval is difficult to construct and not invariant under parameter transformations.

2.25 Selective intervals

One could try to select confidence intervals which minimize the probability to contain wrong parametervalues [8]. These intervals are calledshortestby Neyman andmost selectiveby Kendall and Stuart[11]. Since this condition cannot be fulfilled for two-sided intervals independent of the value of thetrue parameter, Neyman has proposed the weaker condition that the coverage for all wrong parametervalues has to be always less thanα. It defines theshortest unbiasedor most selective unbiased(MSU)intervals4.

Most selective unbiasedintervals can be constructed [12] with the likelihood ratio ordering (notto be mixed up with likelihood ratio intervals). Here the probability contour (in the sample space) of aparameter corresponds to constantR, defined by

R(x|θ) =f(x|θ)

f(x|θbest)(1)

whereθbest is the parameter value with the largest likelihood for a fictive measurementx. Qualitativelythis means that preferentially those values ofx are added to the probability interval ofθ where compet-itive values of the parameter have a low likelihood. Selective intervals have the attractive property thatthey are invariant under transformations of variablesand parameters independent of their dimension.Usually the limits are close to likelihood ratio intervals (see Section 4.2) and shorter than central inter-vals 5. The construction of the limits is quite tedious unless a simple sufficient statistic can be found.In the general case, where sufficiency requires a full data sample of say 1000 events, one has to findlikelihood ratio contours in a 1000-dimensional space. For this reason, MSU intervals have not becomepopular in the past and were assumed to be useful only for the so-called exponential family [13] ofpdfs where a reduction of the sample space by means of sufficient statistics is admitted. Nowadays, thecomputation has become easily feasible on PCs. However, the programming effort is not negligible.

The likelihood ratio ordering is applied in the unified approach [15] which will be discussed inSection 3.

2.26 One-sided intervals

One-sided intervalsdefine upper and lower limits. Here, obviously the limit should be a function of asufficient statistic or equivalently of the likelihood.

4These intervals are closely related to uniformly most powerful and uniformly most powerful unbiased tests [13]5The likelihood ratio ordering minimizes the probability of errors of the second kind for one sided intervals. For two-sided

intervals this probability can only be evaluated using relative prior probabilities of the parameters located at the two sides of thecofidence interval. [8, 3, 14].

9

Page 10: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 4: Position measurement from drift time. The error is due to diffusion. Classical confidence intervals areshown.

2.27 Which definition is the best?

No general answer can be given. The choice may be different when we are interested in the uncertaintyof a measurement of a particle track or in the verification of a theory. The former case one may prefercriteria based on the mean squared deviation of the limits from the true parameter value [16]. If proba-bility arguments dominate, Neyman’sMSU intervals are most attractive. They are the optimum classicalchoice. Their only disadvantage is the complicated numerical evaluation of the limits.

To simplify the discussion, in the following section (except for the first example) we do not applythe Neyman construction but follow the more popular line using central or equal probability intervals.The MSU prescription will be treated separately in Section 3 dealing with the unified approach.

2.3 Two simple examples

Example 1 To illustrate the concept of classical confidence levels in one dimension we consider an elec-tron moving in a gaseous detector. The measured drift time is converted to a distancex. The uncertaintyis due to diffusion. The probability density is

f(x) =1√

2πcθexp

[−(x− θ)2

2cθ

]

wherec is a constant. Figure 4 shows a measurement, the likelihood function and the confidence limitsfor a central interval and for a MSU interval. The likelihood ordering is indicated in Figure 5 forθ = 1.The 68.3 % probability interval covers the region ofx where the likelihood ratio is largest. The dip atx = 0 is due to the fact thatf(0|θbest) = f(0|θ = 0) = ∞. The MSU interval is shorter than the centralinterval and very close to the likelihood ratio interval.

We now turn to problematic situations.

Example 2 We extract the slope of a linear distribution

f(x; θ) =12(1 + θx) (2)

10

Page 11: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 5: Likelihood ratio ordering.

with−1 < x < 1. Our data sample consists of two events observed at symmetric locations,x1, x2. Fromthe observed sample means = (x1 + x2)/2 we construct the central limits∫ s

−1g(s′; θhigh)ds′ =

∫ 1

sg(s′; θlow)ds′ =

1− α

2

whereg(s; θ) is the distribution of the sample mean of two measurements each followingf(x; θ). Forthe specific result a)x1 = −0.5, x2 = 0.5, we finds = 0 and θlow = −0.44, θhigh = 0.44 withC.L. = 0.683. If instead the result of the measurement had been b)x1 = 0, x2 = 0 or c) x1 = −1,x2 = 1 the same confidence limits would have been obtained as in case a).

This is one of the weak points of classical confidence limits: They can provide stringent limitseven when the experimental data are non-informative as in case b). This is compensated by relativelyloose limits for informative observations as in case c). CCL do not necessarily exhaust the experimentalinformation.

Of course, in the quoted example the poor behavior can be avoided also in classical statisticsby working in the two-dimensional space ofx1, x2. For a sample of sizen, probability contours haveto be defined in a n-dimensional space because no obvious simple sufficient statistic is available. Thecomputation of the limits becomes complex but is easily feasible with present PCs.

2.4 Digital measurements

Example 3 A particle track is passing at the unknown positionµ through a proportional wire cham-ber. The measured coordinatex is set equal to the wire locationxw. The probability density for ameasurementx

f(x, µ) = δ(x − xw)

is independent of the true locationµ. Thus it is impossible to define a classical confidence interval, excepta trivial one with full overcoverage. This difficulty is common to all digital measurements because theyviolate condition 1 of Section 2.1. Thus a large class of measurements cannot be handled in classicalstatistics.

Even orthodox physicists use the obvious Bayesian solution (see Section 4) to this problem.

11

Page 12: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

2.5 External constraints

One of the main objections against classical confidence limits is related to situations where the parameterspace is restricted by external constraints. To illustrate the problem, we take up an often quoted example:

Example 4 A physical quantity like the mass of a particle with a resolution following normal distribu-tions is constrained to positive values. Figure 6 shows typical central confidence bounds which extendinto the unphysical region. In extreme cases a measurement may produce a 90 % confidence intervalwhich does not cover physical values at all.

An even more complicated situation for the standard classical method is encountered in the fol-lowing example corresponding to the distribution of Example 2.

Example 5 For a sample of 100 events following the distribution of Equ. 1, a likelihood analysis givesa best value for the slope parameter ofθ = 0.92 (see Figure 7). When we try to compute centralclassical62.8%. confidence bounds we getθlow = 0.82 and find no upper bound inside the allowedrange−1 ≤ θ ≤ 1. We cannot compute a limit above1 either because slopesθ > 1 correspond tonegative probabilities for small values ofx.

I know of no simple classical solution to this problem (except with likelihood ordering in a hun-dered dimensional space) which is related to condition 4 of Section 2.1.

Another difficulty arises for parameters which are restricted from both sides:

Example 6 A particle passes through a small scintillator and another position sensitive detector withGaussian resolution. Both boundaries of the classical error interval are in the region forbidden bythe scintillator signal (see Figure 8). The classical error is twice as large as the r.m.s. width. It ismeaningless.

2.6 Classical confidence limits with several parameters

In case several parameters have to be determined, the notion of central intervals no longer makes senseand the construction of intervals of minimal size is exceedingly complex. It is advisable to use curves ofequal probability or likelihood ratio boundaries.

Very disturbing confidence intervals may be obtained in several dimensions when one tries tominimize the size of the interval. Here, a standard example is the case of the simple normal distributionin more than two dimension where occasionally ridiculously tiny intervals are obtained [17].

2.7 Nuisance parameters

In most experimental situations we have to estimate several parameters but are interested in only one ofthem. The other parameters, the nuisance parameters, have to be eliminated. A large number of researcharticles have been published on the subject. A detailed discussion and references can be found in Basu’swork [18].

Feldman [14] emphasizes that a general treatment of nuisance parameters is given by Kendall andStuart in connection with the likelihood ratio ordering. This is not true. The authors state (notationslightly modified) “For the likelihood ratio method to be useful ... the distribution ofR(x|θ) has to befree of nuisance parameters.” and give a relevant example.

12

Page 13: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 6: Confidence limits for a measurement with Gaussian errors and a physical boundary. The left side shows68.3% confidence intervals, the right side 90% upper limits. The labels refer to classical (c), unified classical (cu)and Bayesian (b).

13

Page 14: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 7: Likelihood of the slope parameter of a linear distribution. Classical confidence limits cannot be definedbecause the slope is undefined outside the physical interval.

-3 -2 -1 0 1 2 30.0

0.1

0.2

0.3

xx

highxlow

allowed region

f(x)

x

Fig. 8: Classical confidence bounds cannot be applied to a parameter space bound from both sides.

14

Page 15: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

2.71 Factorization and restructuring

It is easy to eliminate the nuisance parameter if the pdf factorizes in a functionfφ of the nuisanceparameterν and a functionfθ of the parameter of interestθ.

f(x|θ, φ) = fθ(x|θ)fν(x|ν) (3)

Then the limits forθ are independent of a specific choice ofν. If this condition is not realized, one maytry to restructure the problem looking for another parameterη(θ, ν) which fulfills the condition.

f(x|θ, η) = fθ(x|θ)fη(x|η)

Then again fixing the new parameterη to an arbitrary value has no influence on the confidence limits forthe parameter of interest and we may just forget about it.

The following example is taken from Edwards [19].

Example 7 The absorption of a piece of material is determined from the two ratesr1, r2 measured withand without the absorber. The rates follow Poisson distributions with mean valuesθ1, θ2:

f(r1, r2) =e−(θ1+θ2)θr1

1 θr22

r1!r2!

We are interested in the ratioφ = θ1/θ2. By a clever choice of the nuisance parameterν

φ =θ1

θ2

ν = θ1 + θ2

we get

f = e−ννr1+r2(1 + 1/φ)r1(1 + φ)r2

r1!r2!= fν(r1 + r2|ν)fφ(r1, r2|φ)

where the dependence off on the interesting parameterφ and the function of the nuisance parameterνfactorize. The sumr1+r2 which is proportional to the measuring time, obviously contains no informationabout the absorption parameterφ1. It is ancillary and we can condition onr1 + r2. It is interesting tonotice thatfφ is not a function of the measured absorption ratior1/r2 only.6

A lot of effort has been invested to solve the factorization problem [20]. The solutions usuallyyield the same results as integrating out the nuisance parameter.

That restructuring is not always possible is seen in the following example.

Example 8 The mean decay rateγ is to be determined from a sample of 20 events which contain anunknown number of background events with decay rateγb = 0.2. Likelihood contours for the twoparametersγ and the number of signal events are shown in Figure 9. There is no obvious way todisentangle the parameters.

6Exactly the same situation is found in the determination ofε′ from a double ratio of CP violating kaon decays.

15

Page 16: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 9: Likelihood contours for example 8. For better visualization the discrete values of the nuisence parameter“number of events” are connected.

2.72 Global limits

A possible but not satisfactory solution to eliminate a nuisance parameter is to compute a conservativelimit. How this can be done is indicated by the dashed lines in the right hand graph of Figure 2 forthe two parameter case: A global confidence interval is computed for all parameters including the nui-sance parameters. The projection of the contour onto the parameter of interest provides a conservativeconfidence interval for this parameter.

Looking again at Figure 2, we realize that by squeezing the confidence ellipse in the direction ofthe parameter of interest and by stretching it in the other direction, the confidence level probably canbe maintained but a narrower projected interval could be obtained. To perform this deformation in aconsistent way such that minimum overcoverage is guaranteed is certainly not easy. An example isdiscussed in Section 2.9.

2.73 Other methods

Another popular method used by classical statisticians [21, 22] to eliminate the nuisance parameter isto estimate it away, to replace it by the best estimate. This proposal does not only violate the coverageprinciple but is clearly incorrect when the parameters are correlated. The stronger the correlation betweenthe parameters is, the smaller the confidence interval will become. In the limit of full correlation it willapproach zero!

A more sensible way is to condition on the sensitivity of the data relative to the nuisance parameter.We search for a statisticYν which is ancillary inν and use the pdff(Yν|θ) to determine the confidenceinterval. UsuallyYν will not be sufficient forθ and the approach is not optimum since part of theexperimental information has to be ignored.

2.8 Upper and lower limits

Frequently, we want to give upper or lower limits for a physical parameter. Then, of course, we use onesided bounds.

16

Page 17: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Example 9 In a measurement of the neutrino mass the result ism = (−2± 2) eV with Gaussian errorsindependent of the true value. A 90% confidence upper limitmu is defined classically by

0.9 =∫ ∞

mη(x|mu, 2)dx

whereη denotes the normal distribution centered atmu with width2. The upper 90% confidence limit ismu < 0.6eV . For a measurementm = (−4±2) eV the limit would bemu < −1.4 eV in the unphysicalregion. This is a frequently discussed problem. The fact that we confirm with 90% confidence somethingwhich is obviously wrong does not contradict the concept of classical confidence limits. The coverage isguarantied for an ensemble of experiments and the unphysical interval occur only in a small fraction ofthem. Whether such confidence statements make sense is a different story7.

2.9 Upper limits for Poisson distributed signals

In particle physics, by far the most frequent case is the calculation of upper limits for the mean ofPoisson distributed numbers,P (k|µ) = e−µµk/k!. As stated above, for discrete measurements thefrequentist approach has to accept overcoverage. (Remember, we do not consider randomization.) In theapproximation with minimum overcoverage the upper limitµ for n events is given by:

α =∞∑

i=n+1

P (i|µ)

1− α =n∑

i=0

P (i|µ)

In words, this means: If the limit corresponded to the true parameter value, the probability toobserven events or less were equal to1− α.

In the majority of the experiments no event (n = 0) is found. The classical 90% upper confidencelimit is thenµ = 2.3 as shown in Figure 9.

Now assume the true value isµ = 0. Obviously, no event would be found in repeated experimentsand the coverage is 100% independent of the nominal confidence value. It is impossible to avoid thecomplete overcoverage. Published exotic particle searches are much more often right than indicated bythe given confidence level.

2.91 Rate limits with uncertainty in the flux

The flux is a nuisance parameter which is to be eliminated. We apply the procedure outlined above andobtain a conservative global limit.

Example 10 We observe zero eventsn = 0 of a certain type and measure the fluxf = 1±0.05 and wantto compute an upper limit for the rate. To simplify the problem for the present purpose we assume thatthe observed flux distribution is Gaussian with width independent of the mean value. The event numberis Poisson distributed. As usual, we first have to construct probability contours. Figure 10 left shows90% probability contours for three different flux and rate values. Measured flux and event numbers willfall in the shaded area of the left hand upper graph with 90% probability for true fluxf = 1.0 and truerate r = 3.0. It extends from at least one event to infinity. There are two other probability contourswhich just exclude the actual observation (f = 1, n = 0) with 90% probability. The construction of thecorresponding confidence region (Figure 10, right) for the two parameters follows the recipe of Section2. The confidence contour is given by the true values of the left hand graph. A conservative limit for

7Savage et al. [23]: “The only use I know for a confidence interval is to have confidence in it.”

17

Page 18: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 10:Construction of the classical 90% confidence upper limit for zero events observed. The Poisson distributionwith mean 2.3 predicts zero events in 10% of the cases.

the rate alone is indicated by the dash-dotted line. The global limit depends on the construction of theprobability contours. The two lower plots correspond to probability regions which are wider in the fluxand narrower in the rate variable. The second construction obviously is superior to the first because itgives a more restrictive limit. In any case the limit is considerably worse than in an experiment withoutuncertainty in the flux.

This result is in contradiction to Refs. [7, 21] where the contrary is claimed, namely that the fluxuncertainty improves the limit, a conclusion which the authors seem to accept without questioning theirrecipe!

2.92 Poisson limits with background

The situation becomes complex when the experimental data contain background. Assuming the back-ground expectationb is precisely known the probability to findk events is

W (k) =k∑

i=0

k∑j=1

P (i|µ)Q(j|b)δi+j,k (4)

=k∑

i=0

P (i|µ)Q(k − i|b) (5)

with Q(j|b) the background distribution. (We sum over all combinations of backgroundj and signaliwhich add up tok events.) Usually the background also follows a Poisson distribution.

W (k) =k∑

i=0

P (i|µ)P (k − i|b) (6)

= P (k|µ + b) (7)

18

Page 19: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 11: Probability contours (left) and confidence regions (right) for a Poisson rate with flux uncertainty. Theconservative 90% upper limit of the rate is indicated by the dashed line. The lower plots with 3 st. dev. limits inthe flux provide more restrictive 90% limits than the upper 2 st. dev. flux limits.

19

Page 20: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Then the probability to find less or equaln events is

1− α =n∑

k=0

W (k)

=n∑

k=0

P (k|µ + b) (8)

Solving the last equation forµ, we get the upper limit with confidenceα. Apparently, forn given,the limit becomes more restrictive the larger isb. In average, experiments with large background getas good limits as experiments with little background and occasionally due to background fluctuationsthe numerical evaluation produces even negative limits. The limits do not represent the precision of themeasurement.

It is instructive to study the case where no event is observed but where background is expected:

Example 11 In a garden there are apple and pear trees near together. Usually during night some pearsfall from the trees. One morning looking from his window, the proprietor who is interested in apples findthat no fruit is lying in the grass. Since it is still quite dark he is unable to distinguish apples from pears.He concludes that the average rate of falling apples per night is less the 2.3 with 90% confidence level.His wife who is a classical statistician tells him that his rate limit is too high because he has forgotten tosubtract the expected pears background. He argues, “there are no pears”, but she insists and explainshim that if he ignores the pears that could have been there but weren’t, he would violate the coveragerequirement. In the meantime it has become bright outside and pears and apples - which both are notthere - are now distinguishable. Even though the evidence has not changed, the classical limit has.

The 90% confidence limits for zero events observed and background expectationb = 0 is µ = 2.3.For b = 2 it is µ′ = 0.3 much lower.Classical confidence limits are different for two experiments withexactly the same experimental evidence relative to the signal (no signal event seen). This conclusion isabsolutely intolerable.

Feldman and Cousins consider this kind of objections as “based on a misplaced Bayesian interpre-tation of classical intervals” [15]. It is hard to detect a Bayesian origin in a generally accepted principlein science, namely, two measurements containing the same information should give identical results. Thecritics here is not that CCLs are inherently wrong but that their application to the computation of upperlimits when background is expected does not make sense, i.e. these limits do not measure the precisionof the experiment. This is also illustrated in the following example which is a more scientific replicateof Example 11:

Example 12 An experiment is undertaken to search for events predicted by some exotic theory. In apre-defined kinematic regionno eventis found. A search in a corresponding control region predictsb background events. After the limit is determined a new kinematical cut is found which completelyeliminates all background. The improved analysis produces a much less stringent classical limit!

The 90% upper limits for some special cases are collected in Table 2. The upper rate limit for noevent found but three background events expected is negative.

2.93 Limits with re-normalized background

To avoid the unacceptable situation, I had proposed [24] a modified frequentist approach to the calcula-tion of the Poissonian limits. It takes into account that the backgroundk has to be less or equal to thenumbern of observed events. The a priori background distribution for this condition (k ≤ n) is

Q′(k|b) =Q(k|b)∑ni=1 Q(i|b)

20

Page 21: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

n=0, b=0 n=0, b=1 n=0, b=2 n=0, b=3 n=2, b=2standard classical 2.30 1.30 0.30 -0.70 3.32unified classical 2.44 1.61 1.26 1.08 3.91uniform Bayesian 2.30 2.30 2.30 2.30 3.88

Table 2: 90 percent confidence limits in frequentist and Bayesian approaches

We replaceQ by Q′ in Equation 5 and obtain for the Poisson case:

1− α =∑n

k=0 P (k|µ + b)∑nk=0 P (k|b) (9)

The interpretation is: The probability1−α to observe less or equaln events for a signal mean equalµ, abackground mean equalb with the restriction that the background does not exceed the observed numberis given by Equation 9. The resulting limits respect the Likelihood Principle (see Section 5) and thus areconsistent. The standard classical limits depend on the background distribution for background largerthan the observed event number. This information which clearly is irrelevant for estimating the signal isignored in the modified approach (Equ. 9).

The formula 9 accidentally coincides with that of the uniform Bayesian method. Interesting appli-cations of the method with some variations are found in Refs. [25, 26, 27, 28].

Highland [29] has criticized my derivation. He claimed (in my words):

1) Q′ is not the relevant background distribution. He proposes a different distribution which hederives from Bayes theorem and which depends onµ.

2) The modified limits do not have minimum overcoverage as required by the strict application ofthe Neyman construction.

The first objection is not justified [30] in my opinion. I computed exactly what is stated in theinterpretation. The construction of product formula 4 requires that the background distribution is inde-pendent of the signal distribution, a condition not satisfied by Highland’s construction. He apparentlyrealized this himself, otherwise he would have derived a relation for the computation of the upper limit.

The second statement certainly is correct, but in my paper no claim relative the coverage hadbeen made. (See also Ref. [31]) In view of the unavoidable complete overcoverage forµ = 0 of allclassical methods, the moderate overcoverage of the relation 9 which avoids inconsistencies seems to beacceptable to many pragmatic frequentists.

2.94 Uncertainty in the background prediction

Often the background expectationb is not known precisely since it is estimated from side bands or fromother measurements with limited statistics. We have to distinguish two cases.

a) The pdfg(b) of b is known. Then we can integrate overb and obtain for the conventionalclassical expression

1− α =∫

g(b)n∑

k=0

P (k|µ + b)db

and the modified frequentist formula 9 becomes

1− α =∫

g(b)∑n

k=0 P (k|µ + b)db∫g(b)

∑nk=0 P (k|b)db

21

Page 22: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

b) We know only the likelihood function ofb. Thenb is a nuisance parameter. Thus the methodsoutlined in Section 2.6 have to be applied. Again, I would not support the proposal of Cousins [21], “re-place the nuisance parameter by the best estimate” since the result often violates the coverage principle.The “global” method should be used.

2.10 Discrete parameters

For discrete parameters usually it does not make much sense to give error limits but we would like toassociate relative confidence values to the parameters.

Example 13 Two theories H1, H2 predict the time of an earthquake with Gaussian precision

H1 : t1 = ( 7.50 ± 2.25) h

H2 : t2 = (50± 100) h

The event actually takes place at timetm = 10 h. The predictions together with the observation aredisplayed in Figure 12 top. The prediction of H2 is rather vague but includes the measurement withinhalf a standard deviation. H1 is rather precise but misses the observation within the errors.

A similar example is analyzed in detail in an article by Jeffreys and Berger [32].

There is no obvious way to associate classical confidence levels to the two possible solutions. Arational extension of the frequentist methods applied to continuous parameters would be to compare thetail probabilities of the two hypothesis. Tail probabilities, however, are quite misleading as illustratedbelow in Example 14. They would support H2 contrary to our intuition which is clearly in favor of H1.For this reason classical statisticians prefer [33] to apply the Neyman-Pearson test based on the likelihoodratio, in our case equal to 26 in favor of H1.

It is interesting to consider the modified example:

Example 14 Another theory, H3(t3), depending on the unknown parametert3 predicts the Gaussianprobability density

f(t) =25√2πt32

exp(−625(t − t3)2

2t43

)

for the timet. The classical confidence limits for the same measurement as above,tm = 10 h, are7.66h < t3 < ∞, is slightly excludingt1 but in perfect agreement witht2. Figure 12 bottom shows thecorresponding likelihood function and the classical confidence limit.

The probability density of our example has been constructed such that it includes H1 and H2 forthe specific values oft3 equalt1, t2. Thus the likelihood ratiof(t1)/f(t2) is identical to that of theprevious example. Let us assume that the alternative theories H1 and H2 (which are both compatiblewith H3) were developed after H3. H1 which is by far more likely than H2 could have been excluded onthe bases of the measurement and the classical confidence limits.

The preceding example shows that the concept of classical confidence limits for continuous pa-rameters is not compatible with methods based on the likelihood function. We may construct a transitionfrom the discrete case to the continuous one by adding more and more hypothesis but a transition fromlikelihood based methods to CCL is impossible8.

The two classical concepts, confidence limit and Neyman-Pearson test, lack a common bases. TheNeyman-Pearson Lemma would favor likelihood ratio intervals. These intervals, however, usually haveno well defined coverage. Of all classical ordering schemes the likelihood ratio ordering comes nearestto the optimum.

8We could also use theδ-distribution and not distinguish between discrete and continuous functions.

22

Page 23: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 12:Predictions from two discrete hypothesis H1, H2 and measurement (top) and log-likelihood for parametriza-tion of the two hypothesis (bottom). The likelihood ratio strongly favors H1which is excluded by the classicalconfidence limits.

23

Page 24: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

3. Unified approaches

Feldman and Cousins have proposed a new approach to the computation ofclassical confidence boundswhich avoids the occurrence of unphysical confidence regions, one of the most problematic featuresof the conventional classical confidence limits. In addition it unifies the two procedures “computationof confidence intervals” and “computation of one-sided confidence limits”. The unified treatment hasalready been adopted by several experiments and is recommended by the Particle Data Group [9]. How-ever, as shown below, it has serious deficiencies.

3.1 Basic ideas of the unified approach

The unified approach has two basic ingredients:

1) It unifies the two procedures “computation of a two-sided interval” and “computation of anupper limit”. The scientist fixes the confidence level before he looks at the data and then the data provideeither an error interval or a lower or upper limit, depending on the possibility to obtain an error intervalwithin the allowed physical region. The method thus avoids a violation of the coverage principle whichis obvious in the commonly used procedure where the selection of one or two sided bounds is based onthe data and personal prejudice.

2) It uses thelikelihood ratio orderingprinciple (see Section 2.2) which has the attractive propertyto be invariant against transformations of the sample space variables. In addition, unphysical intervalsor limits are avoided. Here the trick is to require that the quantityθbest of relation (1) is inside therange allowed by the laws of physics. As discussed in Section 2, the likelihood ordering corresponds toMSU intervals and is an old concept. What is new is the application to cases where the parameter rangeis restricted by external bounds. The parameterθbest in the Relation 1 is now a value ofθ inside theallowed parameter interval which maximizes the likelihood.

In practice, the main impact of this method is on the computation of upper limits. Experimentshave “improved” their upper limits by switching to the unified approach [34, 35].

The new approach has attractive properties, however, all those problems which are intrinsic to thephilosophy of classical statistics remain unsolved and additional complications are introduced [36] asdiscussed below. Further pathological cases have been presented by Punzi [37] and Bouchet [38].

3.2 Difficulties with two-sided constraints

One of the advantages of the unified approach is the improved handling of physical bounds. This isshown in Figure 6 for a Gaussian with external bounds. The unified intervals avoid unphysical values ofthe parameter but these problems persist when a parameter is bounded from both sides.

Example 15 We resume Example 6, Figure 8. A particle track is measured by a combination of a pro-portional wire chamber and a position detector with Gaussian resolution. Let us assume a measurementx = 0 of a parameterµ with a physical bound−1 < µ < 1 and a Gaussian resolution ofσ = 1.1. TheBayesian r.m.s. error derived from the likelihood function is 0.54. Since there are two boundaries theprocedure applied in Example 4 no longer works. Requiring a 68.3% confidence level produces at thesame time an upper and a lower limit. It is impossible to fulfill the coverage requirement, except if thecomplete range ofx is taken (complete coverage).

A similar example is discussed by Kendall and Stuart [39]: “It may be true, but would be absurdto assert−1 ≤ µ ≤ +2 if we know already that0 ≤ µ ≤ 1. Of course we could truncate our interval toaccord with the prior information. In our example, we could assert0 ≤ µ ≤ 1: the observation wouldhave added nothing to our knowledge.”

24

Page 25: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 13: Probability distribution (top) and corresponding likelihood ratio (center) for the superposition of twoGaussians in the unified approach. Since regions are added to the probability interval using the likelihood ratio asan ordering scheme, disconnected intervals (shaded) are obtained. A Breit-Wigner pdf shows a similar behavior(bottom).

3.3 External constraint and distributions with tails

Difficulties occur also with one-sided physical bounds when the resolution function has tails. Then weproduce disconnected confidence intervals.

Example 16 We consider the superposition of a narrow and a wide Gaussian. (It is quite common thatdistributions have non-Gaussian tails.)

f(x|µ) =1√2π

{0.9 exp

(−(x− µ)2/2)

+ exp(−(x− µ)2/0.02

)}(10)

with the additional requirement of positive parameter valuesµ. The prescription for the construction ofthe probability intervals leads to disconnected interval regions and cannot produce confidence intervals.This is shown in Figure 13.

25

Page 26: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

unified

conventional

unphysical

allowed

region

Fig. 14:Probability contours (schematical) for a two-dimensional Gaussian near a boundary in the unified approach.

The same difficulty arises for the Breit-Wigner distribution (see Figure 13 bottom). In fact it isonly a very special class of distributions which can be handled with the likelihood ratio selection ofprobability intervals near physical boundaries.

The problem is absent for pdfs with convex logarithms.

d2 ln f

dx2< 0

Thus, near physical boundaries the unified approach in the present form is essentially restricted to Gaus-sian like pdfs.

3.4 Artificial correlations of independent parameters

The likelihood ratio ordering is the only classical method which is invariant against parameter and ob-servable transformations independent of the dimension of the associated spaces. Since the confidenceintervals have the properties of probabilities, one has to be careful in the interpretation of error boundsnear physical boundaries: Artificial correlations between independent parameters are introduced.

Example 17 A two dimensional Gaussian distributed measurement is located near a physical boundary.The two coordinatesx, y are independent. Figure 14 shows schematically the probability contours forthe conventional classical and the unified approach. For the latter the error inx shrinks due to theunphysicaly region.

The problem indicated above is common to all methods with intervals defined by coverage orprobability excluding unphysical regions. It is difficult to define limits which at the same time indicatethe precision of a measurement and the likelihood or confidence that they contain the true value.

3.5 Upper Poisson limits

Table 2 contains90% C.L. upper limits for Poisson distributed signals with background. For the casen = 0, b = 3 the uniform approach avoids the unphysical limit of the conventional classical method butfinds a limit which is more restrictive than that of a much more sensitive experiment with no backgroundexpected and twice the flux! Compared to the conventional approach the situation has improved - from aBayesian point of view - but the basic problem is not solved and the inconsistencies discussed in Section2.9 persist.

Figure 15 compares the coverage and the interval lengths of the unified method with the Bayesianone with uniform prior (to be discussed below). The nominal coverage is 90%. Unavoidably forµ = 0,b = 0 there is maximum over-coverage and in the range0 ≤ µ ≤ 2.2 the average coverage is 96%.

26

Page 27: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 15: Coverage and confidence interval width for the unified approach and the Bayesian case. The Bayesiancurves are computed according to the unified prescription.

27

Page 28: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 16: Likelihood function for zero observed events and 90% confidence upper limits with and without back-ground expectation. The labels refer to [15] (f), Bayesian (b), [40] (g) and [41] (r).

3.6 Restriction due to unification

Let us assume that in a search for a Susy particle a positive result is found which however is compatiblewith background within two standard deviations. Certainly, one would prefer to publish an upper limitto a measurement contrary to the prescription of the unified method. The authors defend their methodarguing that a measurement with a given error interval always can be interpreted as a limit - which istrue.

3.7 Alternative unified methods

Recently modified versions of the unified treatment of the Poisson case have been published [40, 41, 37,42, 43].

The results of Giunti are nearer to the uniform Bayesian ones, but wider than those of Feldmanand Cousins.

Roe and Woodroofe [41] re-invented the method proposed by myself ten years earlier [24] andadded the unification principle. The authors did not realize that they do not fulfil the coverage require-ment. In their second paper [42] they present a Bayesian method which is claimed to have good coverageproperties, but this is also true for the standard Bayesian approach.

Punzi [37] modifies the definition of confidence intervals such that they fulfil the Likelihood Prin-ciple. This approach is in some sense attractive but the intervals become quite wide and it is difficult tointerpret them as standard error bounds.

Ciampolillo [44] uses the likelihood statistic which maximizes the likelihood inside the physicallyallowed region as estimator. In most cases it coincides with the parameter or parameter set which maxi-mizes the likelihood function. Usually, the maximum likelihood estimator is not a sufficient statistic andthus will not provide optimum precision (see Example 2). Similarly, Mandelkern and Schulz [43] use

28

Page 29: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

an estimator confined to the physical domain. Both approaches lead to intervals which are independentof the location of the measurement within the unphysical region. I find it quite unsatisfactory that twoGaussian measurementsx1 = −0.1 andx2 = −2 with the boundµ > 0 and widthσ = 1 yield the sameconfidence interval. Certainly, our betting odds would be different in the two cases.

In summary, from all proposed unified methods only the Feldman/Cousins approach has a simplestatistical foundation. Roe/Woodroofe’s method is not correct from a classical frequentist point of view,Punzi’s method is theoretically interesting but not suited for error definitions, and the other prescriptionsrepresent unsatisfactory ad-hoc solutions.

The Figures 16 and 17 compare upper limits, coverage and interval lengths for some approachesfor b = 0 andb = 3. (Part of the data has been taken from Ref. [41].) The likelihood functions forn = 0,b = 0 andb = 3 are of course identical up to an irrelevant constant factor. Apparently, Feldman andCousins avoid under-coverage while the other approaches try to come closer to the nominal confidencelevel but in some cases are below. The interval widths are similar for all methods.

4. Likelihood ratio limits and Bayesian confidence intervals

4.1 Inverse probability

Bayesians treat parameters as random variables. The combined probability densityf(x, θ) of the mea-sured quantityx and the parameterθ can be conditioned on the outcome of one of the two variates usingBayes theorem:

f(x, θ) = fx(x|θ)πθ(θ) = fθ(θ|x)πx(x)

fθ(θ|x) =fx(x|θ)πθ(θ)

πx(x)(11)

where the functionsπ are the marginal densities. The densityπθ(θ) in this context usually is called priordensity of the parameter and gives the probability density forθ prior to the measurement. For a givenmeasurementx the conditional densityfx can be identified with the likelihood function. The marginaldistribution πx is just a multiplicative factor independent ofθ and is eliminated by the normalizationrequirement.

fθ(θ|x) ∝ L(x, θ)πθ(θ)

fθ(θ|x) =L(x, θ)πθ(θ)∫∞

−∞ L(x, θ)πθ(θ)dθ(12)

The prior densityπθ has to guarantee that the normalization integral converges.

In the literaturefθ often is calledinverse probabilityto emphasize the change of role betweenxandθ.

The relation (12) contains one parameter and one measurement variablex, but x can also beinterpreted as a vector, a set of individual measurements, independent and identically distributed (i.i.d.)following the same distributionf0(x|θ) or any kind of statistic andθ may be a set of parameters.

4.2 Interval definition

4.21 Bayesian intervals

Having the probability density of the parameter in hand, it is easy to compute mean values, r.m.s. errorsor probabilities for upper or lower bounds (see Figure 18).

29

Page 30: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 17: Top: Coverage as a function of the Poisson rate for expected backgroundb = 3. The nominal coverage is90%. The labels refer to [15] (f), Bayesian (b), [40] (g) and [41] (r). Bottom: Interval length as a function of theobserved number of events.

30

Page 31: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 18:Likelihhood ratio limits (left) and Bayesian limits (right).

There is again some freedom in defining the Bayesian limits. One possibility would be to quote themean and the variance of the parameter, another possibility is to compute intervals of a given probability9.The first possibility is emphasizes error propagation, the second is better adapted to hypothesis testing. Inthe latter case, the interval boundaries correspond to equal probability density of the parameter10 (equallikelihood for a uniform prior). Again a lack of standardization is apparent for the choice of the interval.

4.22 Likelihood ratio intervals

Since the likelihood function itself represents the information of the data relative to the parameter and isindependent of the problematic choice of a prior it makes sense to publish this function or to parametrizeit. Usually, the log-likelihood, in this context called support function [45] or support-by-data an expres-sion introduced by Hacking [46], is more convenient than the likelihood function itself.

For continuous parameters the usual way is to compress the information into likelihood limits.The condition

Lmax/L(θlow) = Lmax/L(θhigh) = e∆ (13)

ln L(θlow) = ln Lmax −∆ = ln L(θhigh) (14)

whereLmax is the maximum of the function in the allowed parameter rangeθmin < θ < θmax, fixes alikelihood ratio intervalθlow < θ < θhigh (see Fig. 18). The values∆ = 0.5 and∆ = 2 define one andtwo standard deviation likelihood limits (support intervals). In the limit wheref(x) is a perfect Gaussian(the width being independent of the parameter) these bounds correspond to C.C.L. of0.683 and0.954confidence.

The one-dimensional likelihood limits transform into boundaries in several dimensions. For ex-ample for two parameters we get the confidence contour in theθ1 − θ2 plane.

ln L(x; θ1, θ2) = ln Lmax −∆

The value of∆L is again equal to 0.5 (2) for the 1st. dev. (2 st. dev.) case. The contour is invariantunder transformations of the parameter space.

9In the literature the expressiondegree of beliefis used instead of probability to emphasize the dependence on a priordensity. For simplicity we will stick to the expression confidence level.

10Central intervals are wider and restricted to one dimension.

31

Page 32: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Upper Poisson limits are usually computed from Bayesian probability intervals. D’Agostini [47]emphasizes the likelihood ratio as a sensible measure of upper limits, for example in Higgs searches. Inthis way the dependence of the limit on the prior of the Bayesian methods is avoided.

4.3 Problems with likelihood ratio intervals

Many of the situations where the classical methods find difficulties are also problematic for likelihoodratio intervals.

• The elimination of nuisance parameters is as problematic as in the frequentist methods.

• Digital measurements have constant likelihood functions and cannot be handled.

• The error limits for functions with long tails (like the Breit-Wigner pdf) are misleading.

• When the likelihood function has its mathematical maximum outside the physical region, the re-sulting one-sided likelihood ratio interval for∆ = 0.5 may be unreasonably short.

• The same situation occurs when the maximum is at the border of the allowed observable space.

A frequently discussed example is:

Example 18 The width of an uniform distribution

f(x|θ) =1θ; x > 0

is estimated fromn observationsxi. The likelihood function is

L = θ−n; θ ≥ xmax

with a1/√

e likelihood ratio interval ofxmax < θ < xmaxe1/(2n) which forn = 10 is only about half of

the classical and the Bayesian widths. (This example is relevant for the determination of the time zerot0from a sample of measured drift times.)

We will come back to likelihood ratio intervals in sections 6 and 7. In the following we concentrateon the Bayesian method with uniform prior but free choice of parameter.

4.4 The prior parameter density and the parameter choice

The Bayesian method would be ideal if we knew the prior of the parameter. Clearly the problem is inthe prior parameter density and some statisticians completely reject the whole concept of a prior density.The following example demonstrates that at least in some cases the Bayesian way is plausible.

Example 19 An unstable particle with known mean lifeτ has decayed at timeθ. We are interested inθ. A measurement with Gaussian resolutions, η(t|θ, s) finds the valuet. The prior densityπ(θ) for θ isproportional toexp(−θ/τ). Using (12) we get

fθ(θ) =η(t|θ, s)e−θ/τ∫∞

0 η(t|θ, s)e−θ/τdθ

However, there is no obvious way to fix the prior density in the following example:

Example 20 We find a significant peak in a mass spectrum and associate it to a so far unknown particle.To compute the density of the massm we need a pre-experimental mass densityπ(m).

There are all kind of intermediate cases:

32

Page 33: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Example 21 When we reconstruct particle tracks from wire chamber data, we always assume a flat trackdensity (uniform prior) between two adjacent wires. Is this assumption which is deduced from experiencejustified also for very rare event types? Is it valid independent of the wire spacing?

In absence of quantitative information on the prior parameter density, there is no rigorous way tofix it. Some Bayesians use scaling laws to select a specific prior [5] or propose to express our completeignorance about a parameter by the choice of an uniform prior distribution. An example for the applica-tion of scaling is presented in Appendix B. Some scientists invoke the Principle of Maximum Entropy tofix the prior density [48]. I cannot find convincing all these arguments but nevertheless I consider it verysensible to choose a flat prior as is common practice. This attitude is shared by most physicists [49]11. Ido not know of any important experimental result in particle physics analyzed with a non-uniform prior.In Example 21 it is quite legitimate to assume an uniform track density at the scale of the wire spacing.Similarly for a fit of the Z0-mass there are no reasons to prefer a certain mass within the small rangeallowed by the measurement. This fact translates also into the quasi independence of the result of theparameter selection. Assuming a flat prior for the mass squared would not noticeable alter the result.

The constant prior density is what physicists use in practice. The probability density is thenobtained by normalizing the likelihood function. The obvious objection to this recipe is that it is notinvariant against parameter transformations. A flat prior ofθ1 is incompatible with a flat prior ofθ2

unless the relation between the two parameters is linear since we have to fulfill

π1(θ1)dθ1 = π2(θ2)dθ2

The formal contradiction often is unimportant in practice unless the intervals are so large that alinear relation betweenθ1 andθ2 is a bad approximation.

One should also note that fixing the prior density to be uniform does not really restrict the Bayesianchoice: There is the additional freedom to select the parameter. A densityg1 for parameterθ1 witharbitrary priorπ1 can be transformed into a densityg2(θ2) ∝ L2(θ2)dθ2 = L1(θ1)π1(θ1)dθ1 for theparameterθ2(θ1) with a constant prior distribution. Thus, for example, a physicist who would like tochoose a prior densityπτ ∝ 1/τ2 for the mean lifeτ is advised to use instead the decay parameterγ = 1/τ with a constant prior density. In the following we will stick to the convention of a uniform priorbut allow for a free choice of the parameter.

An example where the parameter choice matters is the following:

Example 22 The decay time of a particle is measured. The resultt is used to estimate its mean lifeτ .Using a flat prior we find the posterior parameter density

f(τ) ∼ 1τe−t/τ

which is not normalizable.

Thus a flat prior density is not always a sensible choice for an arbitrarily selected parameter. In thepreceding example the reason is clear: A flat prior density would predict the same probability for a meanlife of an unknown particle in the two intervals0 < τ < 1ns and1s < τ < 1s + 1ns which obviouslyis a fairly exotic choice. When we choose instead of the mean life the decay constantγ with a flat prior,we obtain

f(γ) =γe−γt∫∞

0 γe−γtdγ= t2γe−γt

The second choice is also supported by the more Gaussian shape of the likelihood function ofγas illustrated in Figure 19 for a measurement from two decays. The figure gives also the distributions ofthe likelihood estimators to indicate how frequent they are.

11A different point of view is expressed in Refs. [7, 50].

33

Page 34: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 19: Distribution of the max. likelihood estimates for the mean life (a) and the decay constant (b) for twodecays. The corresponding likelihood functions are shown in (c) and (d) for estimators equal to one.

34

Page 35: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 20: Likelihood function renormalized to the physically allowed region. The construction of the upper limit isindicated.

One possible criterion for a sensible parameter choice is the shape of the likelihood function. Onecould try to find a parameter with approximatly Gaussian shaped likelihood function and use a constantprior.

We resume the digital position measurement by a proportional wire chamber (Example 21). Thelikelihood is constant inside the interval allowed by the measuring device. It does not make sense totransform the parameter to obtain a Gaussian shape. Here it is common practice to use a constant priorprobability and to associate to the central value the r.m.s. error interval±2mm/

√12 of a flat probability

distribution.

4.5 External constraints

Bayesians account for external constraints by using vanishing prior functions which is equivalent tonormalizing the parameter density to the relevant parameter interval as is illustrated in Figure 20. Thesame procedure is applied to the Gaussian of Example 4 in Figure 6.

4.6 Several parameters and nuisance parameters

The extension of the Bayesian scheme to several parameters is straight forward. Nuisance parameterscan be integrated out. To do so, in principle, we have to choose the full prior density of all parameters.

f(θ) ∼∫

L(θ, φ, x)π(θ, φ)dφ

Even in case the likelihood function of the parameters factorizes we cannot separate the parameterswithout making the additional assumption that also the prior density can be written as a product ofindependent priors.

f(θ) ∼ L(θ, x)∫

L(φ, x)π(θ, φ)dφ

∼ L(θ, x)πθ(θ)

35

Page 36: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 21: 90% c.l. upper limits in the Poisson case for the classical and the Bayesian approach. The expectedbackground is 4, the parametern is the number of observed events. The unphysical limits of the classical approachfor n < 2 are suppressed. The integrated probability corresponds to the true valueµ = 0.

with

πθ(θ) =∫

L(φ, x)π(θ, φ)dφ

Since we have decided to accept only uniform prior densities the difficulty is absent.

4.7 Upper and lower limits

Bayesians compute the upper limitsµ from the tail of the inverse probability density which for a constantprior is the normalized likelihood function.

α =∫ ∞

µL(µ′)dµ′/

∫ ∞

−∞L(µ′)dµ′ (15)

For the Poisson process with Poisson distributed background we get the limitµ from

α =∫ ∞

µP (n|µ′ + b)dµ′ = 1−

∑ni=0 P (i|µ + b)∑n

i=0 P (i|b)which is identical to the frequentist formula with normalization to possible background [24], and, forzero background expectation to the classical limit. Figure 21 compares the classical 90% confidencelimits to the Bayesian ones for four expected background events. It indicates also the frequency of theoccurrence of a certain numbern of observed events forµ = 0.

It is easy to generalize the Bayesian procedure for arbitrarily distributed background with theprobabilityQ(m|b) to obtainm background events for a given parameterb.

36

Page 37: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

L(µ) =n∑

i=0

P (i|µ)Q(n − i|b)

The likelihood function has to be inserted in Equ. 14. The background parameterb may follow anotherdistributiong(b)

L(µ) =n∑

i=0

P (i|µ)∫ ∞

−∞Q(n− i|b)g(b)db

The latter expression is useful if the background mean has an uncertainty for example when it isdeduced from the sidebands of an experimental histogram.

4.8 Discrete parameters

The discrete case is automatically included in the continuous one if we allow forδ-functions. Theprobability for hypothesisi is given by

Pi =Liπi∑i Liπi

where the likelihoods are multiplied by the prior probabilities. The confidence level for an intervaltranslates into the confidence attributed to a hypothesis. Usually, one would set the priors all equal andobtain the simple likelihood ratio used in Examples 10 and 13. Another example (quoted in Ref. [7])follows:

Example 23 The Mark II collaboration reported a measurement of the number of neutrinosNν = 2.8±0.6 which is slightly lower than the known number of three neutrino generations. They take advantage ofthe negative fluctuation of their result and derive from the tail distribution the 95% classical confidencelimit Nν < 3.9. Bayesian would quote the ratio of the likelihoods for the hypothesis of three and fourneutrino species,L(3|2.8± 0.6)/L(4|2.8± 0.6) = 7.0 (assuming Gaussian distributed errors) which inprinciple could be multiplied by prior probabilities. The difference between these results is considerable:The classical tail limit indicates a factor20 in favor of the three neutrino hypothesis.

5. The Likelihood Principle and information

In this article I hope to convince the reader that all parameter and interval inference should be basedsolely on the likelihood function. To this end we start with a discussion of the Likelihood Principlewhich plays a key role in modern statistics.

5.1 The Likelihood Principle

Many statisticians (Fisher, Barnard, Birnbaum, Hacking, Basu and others) have contributed to the devel-opment of the Likelihood Principle (LP). References, derivations and extensive discussions are found ina summary of Basu’s articles [51] and the book by Berger and Wolpert [52].

We assume that our measurementsx follow a given probability densityf(x; θ) with unknownparameterθ.

The Likelihood Principle (weak form) states:

All information contained in a measurementx with respect to the parameterθ is summarized bythe likelihood functionL(θ;x).

37

Page 38: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

A multiplicative constantc(x) in L (corresponding to an additive constant not depending onθ inthe log-likelihood) is irrelevant. The LP in this form is almost evident, since the likelihood function isa sufficient statistic. Methods not respecting the LP - inference not based onL andf - do not use thefull information contained the data sample12. In its stronger, more ambitious form, the LP asserts thatinference on a parameter must be the same in two experiments with proportional likelihood functionsand identical parameter space, even when the pdfs are different(see Stein’s example below).

The LP relies on the exact validity of the probability density which is used, a condition usuallyfulfilled in exact sciences. It is less useful in social, economical and medical applications where onlyapproximative theoretical descriptions are available. Methods based on the likelihood function often arevery sensitive to unknown biases, background and losses.

The LP can be derived [52, 51] from the Sufficiency Principle:

If T ′ = T (x′1, x′2, ...x′n) is a sufficient statistic in an experiment and ifT ′ = T ′′ = T (x′′1 , x′′2, ...x′′n)the same value is obtained in another experiment followingf(X1,X2, ...Xn|θ) the evidence forθ is thesame in both experiments.

and the Weak Conditioning Principle (in sloppy notation):

Performing the mixed experiment where randomly one of the experimentsE1 andE2 is selected isequivalent to performing directly the selected experiment.

However, there exist reservations against the proof of the LP and some prominent statisticians donot believe in it. Barnard and Birnbaum, originally promoters of the LP, ultimately came close to reject-ing it or to strongly restrict its applicability. One of the reasons for rejecting the LP is its incompatibilitywith classical concepts13. We will come back to some of the most disturbing examples. Other criticismis related to questioning the Sufficiency Principle and to unacceptable consequences of the LP in specificcases [54, 55, 56, 57]. Most statisticians, frequentists or Bayesians, usually accepted that the likelihoodfunction summarizes the experimental information but some statisticians like Barnard feel that parameterinference should include system information like the pdf or if relevant the applied stopping rule.

The acceptance of the LP is almost unavoidable if the likelihood ratio for discrete hypothesis isaccepted to contain the full information of a measurement relevant for comparing different hypothesis.It is also unavoidable if the concept of a prior probability density for the parameter is accepted. ThenBayes’ theorem can be used to deduce the probability densityg(θ|x) of the parameterθ conditioned onthe outcomex of the measurement (see Equation 11)

g(θ|x) ∼ L(θ|x)π(θ) (16)

Assuming that there is independent information ong(θ|x) other than provided by the likelihood func-tion leads to a contradiction: Then the additional information could be used to partially infer, by usingEquation 16,π(θ) which, however, by definition is independent of the data.

Thus the only way out is to deny the existence of the prior density. Indeed, at first glance it is notobvious what meaning we should attribute for example to a prior density of the muon neutrino mass. Butfor the prove of the LP it is not necessary to know this function, it is sufficient to assume that it exists,that God has chosen this mass at random according to an arbitrary distributionπ(θ) which we do notknow. Such an assumption would certainly not affect our scientific analysis techniques.

There is one point of caution:The integral∫

L(θ|x)π(θ)dxdθ has to be finiteandthe prior param-eter density has to be normalizable. Thus in some cases the assumption of an uniform prior extendingover an infinite range can lead to problematic results. In practice, this is not a problem. For example, re-stricting the Higgs mass to positive values below the Planck mass would be acceptable to every educated

12Information as used here is not to be mistaken with the technical meaning given to this word by Fisher [53].13James [33] denies the LP because classical confidence limits cannot be deduced from the likelihood function. Classical

limits depend on irrelevant information (see Examples 11, 12) not retained in the likelihood function. The problem is in theclassical limits not in the LP.

38

Page 39: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

scientist.

Inferring the meanµ and the widthσ of a Gaussian from a single measurementx we realize thatthe likelihood function is infinite forµ = x, σ = 0 which seems to imply this solution with certainty.This was one of the examples that had cast doubt on Birnbaum’s believe in the LP [56]. Close inspectionhowever reveals that the normalization condition is violated.

Of course by the extraction of parameters from the likelihood function as done for the likelihoodratio limits and the Bayesian limits we unavoidably lose information. In the classical method, however,the LP is explicitly violated. This is obvious for the Poisson upper limits with background expected andno event observed, where the classical limit depends on the background expectation and the likelihoodfunction is independent of it.

The violation of the LP by classical methods does not imply that they are wrong. They do notalways exhaust the available information and as a consequence different limits can be obtained for thesame observation.

The likelihood function ignores the full probability density for a given value of the parameter.Only the value of the probability density for the actual observation enters. If the probability density ofthe remaining sample space contains information relevant for the inference of the unknown parameter,the LP is wrong as is argued by many classical statisticians. In order to investigate this question, wecome back to our earthquake example (Example 13)

Let us modify the sample space while keeping the likelihoods constant. We change both Gaussiansinto square distributions as indicated in Figure 12 for hypothesis H1. (For H2 the difference is hardlyvisible.). The modification changes the relative tail probabilities but it does not alter the relative meritsof the two hypothesis as can be checked by a Monte Carlo simulation. This conclusion also follows fromthe fact that the Neyman Pearson test which only uses the likelihood ratio is uniformly most powerful.Thus in the simple two-parameter case the likelihood ratio contains all relevant information.

When we switch to the situation with a continuous parameter like in Example 13, the likelihoodfunction contains the full information for a choice between two arbitrarily selected parameters values(discrete hypothesis). It is then rather hard to imagine that the likelihood function is insufficient for otherparameter inferences.

5.2 The Stopping Rule Paradox

The likelihood function is independent of the sequence of i.i.d. observations. Thus, since according to theLP parameter inference should solely be based on the observed sample it should be independent of anysequential sampling plan [55]. For example, stopping data taking in an experiment after a “golden event”has been observed should not introduces a bias. This contradicts our naive intuition and correspondsto the so-calledStopping Rule Paradox. Similarly, some people believe that one can bias the luck in acasino by playing until the balance is positive and stopping then.

The Stopping Rule Principle is of considerable importance for measurement techniques14.

Frequentists [33] claim that stopping rules introduce biases and consequently deny the LP. Onehas to admit that the Stopping Rule Principle is hard to digest.

It is easy to see that there is no bias in the example quoted above. Let us imagine an infinitelylong measurement where many “golden event” are recorded. The experiment now is cut into many sub-experiments, each stopping after a “golden event” and the sub-experiments are analyzed separately. Sincenothing has changed, the sub-experiments cannot be biased and since the sum of the log-likelihoods ofthe subsamples is equal to the log-likelihood of the full sample it is also guaranteed that combining the

14Edwards, Lindman, Savage [58]: “The irrelevance of stopping rules to statistical inference restores a simplicity and freedomto experimental design that had been lost by classical emphasis on significance levels ... Many experimenters would like to feelfree to collect data until they have either conclusively proved their point, conclusively disproved it, or run out of time, money,or patience.”

39

Page 40: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 22:Rate measurement. The time sequence is stopped when 3 events are observed within 1 second (indicated bya bar) and a new measurement is started. Since the combination of many conditioned measurements is equivalentto a long unconditioned measurement, this stopping rule cannot introduce a bias.

0 2 4 6

cu

cc

classical 1 event

classical, fixed time

likelihood

like

liho

od

decay rate0 4 8 12 16 20

θ= 4 (-1.92,+1.92) class, fixed time

θ= 4 (-1.92,+3.16) class., 4 ev.

4 events recorded in 1 secondor 1 second needed to observe 4 ev.

θ= 4 (-1.68,+2.36) likelihood

likelih

ood

decay rate

Fig. 23:Likelihood for the decay rate. Four events are observed in one time unit. The classical error limits dependon the stopping condition i. e. fixed time interval or fixed event number.

results of a large number of experiments performed with an arbitrary stopping rule asymptotically givethe correct result. Of course the condition that the subsamples are unbiased is necessary but not sufficientfor the validity of the Stopping Rule Principle. The effect of the stopping rule “stop when 3 events arefound within 1 second” is presented in Figure 22.

In the following example we explicitly study a specific stopping rule.

Example 24 In a rate measurement 4 events have been observed within the time intervalt. Does theuncertainty on the rateθ depend on the two different experimental conditions: a) The observation timehad been fixed. b) The measurement had been stopped after 4 events. In case a) the uncertainty is dueto the event number fluctuation and in b) to the time variation. The LP denies any difference, only thedata actually seen matter, while for a classical treatment also the data that could have been observedare important. The likelihood functionsLa, Lb derived from the Poisson distribution and the decay timedistribution, respectively, are proportional to each other and depend only on the observation time and

40

Page 41: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

the number of events seen (see Figure 23):

La(θ, n) = P (n, θt)

=e−θt(θt)4

4!∝ θ4e−θt

Lb(θ, t) =t3θ4

3!e−θt ∝ θ4e−θt

This example is well suited to explain the essence of the LP. Followers of the LP would bet thesame amount of money thatθ is located in a certain arbitrarily selected interval in both experiments.

The stopping rule may be tested by a Monte Carlo simulation. In the previous example we couldselect two arbitrary rate valuesθ1 andθ2 and simulate a large number of the two kinds of experiments.For the experiments of type a) only results with 4 events found are retained and the ratio of successesRa for the two values ofθ is formed. For the experiments of type b) a small time slice aroundt is fixedandRb is formed from the number of cases where four events are recorded. The two ratios should beequal if the information relative to the parameter is the same in both variants. Actually, a simulation isnot necessary. The quantitiesRa andRb are the likelihood ratios which agree because the likelihoodfunctions are the same up to a constant factor. The stopping rule does not favor one or the other thetavalues. There is no bias.

A classical treatment produces inconsistent results for the two equivalent experiments. The errorlimits depend on the way we are looking at the data (Figure 23).

The intuitive explanation for the independence of inference of a sequential stopping rule indicatesalso its limitations. The stopping rule has to be simple in the sense that the different pieces obtained bycutting an infinitely long chain of measurements cannot be classified differently. However, experimentsstopped when the result is significant, when for a long time no event has been recorded or when moneyhas run out should be analyzed ignoring the reason for stopping data taking.

The following example cited in the statistical literature is more subtle. Basu gives a detaileddiscussion [59].

Example 25 We sample measurementsxi following a Gaussian with unknown meanθ and width equalto one. The averagex is formed after each observation. The experiment is stopped after observationnwhen one of the following conditions is fulfilled:

|x| > 2√n

, or n = 1000 (17)

The idea behind this construction is to discriminate the valueθ = 0 which will be off by 2 st. dev. in allcases where the first condition is realized. (One may want to stop clinical trials when a 2 st. dev. sig-nificance is reached for the effectiveness of a certain drug.) The second condition is introduced to avoidextremely large samples forθ = 0 (non-performable experiments). Let us assume that an experimentproduces an event of sizen = 100 and |x| = 0.21 > 2/

√100. When we ignore the stopping condition a

frequentist would exclude the parameter valueθ = 0 with > 95% confidence (tail probability). Knowingthe stopping rule, he would realize that the first of the two conditions 5.3 has been fulfilled and probablybe willing to acceptθ = 0 as a possible value. Wouldn’t we base the inference not only on|x| and nbut include the stopping condition into our considerations? Basu [51] and also Berger and Wolpert [52]deny this and argue that our intuitions is misled. In fact, the likelihood atθ = 0 is not really reducedby the stopping rule, it is the integrated tail which is small, because the likelihood function becomesnarrower with increasingn.

The last example certainly is not easy to digest and it is less the problem to understand thatθ = 0is not excluded but rather to accept that some value different from zero has a very large value of the

41

Page 42: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

likelihood even whenθ = 0 is correct. Again, a careful analysis reveals that also this situation conformsto the LP. However, it has to be admitted that likelihood ratio limits would not represent the measurement.

The stopping rule changes the experiment and some rules are not very helpful in providing preciseresults. But this is not in contradiction to the fact, once the data are at hand, knowing the stopping ruledoes not help getting more precise inference of the parameter of interest.

5.3 Stein’s example

In 1960, L. J. Savage emphasized [60]: “If the LP were untenable, clear-cut counter-examples by nowwould have come forward. But such examples seem, rather, to illuminate, strengthen, and confirm theprinciple.” One year later C. Stein [61] came up with his famous example which we will discuss in thefollowing section. Since then, other more or less exotic counter-examples have been invented to disprovethe LP. The corresponding distributions exhibit strange infinities and are far from what we can expect inphysics. Nevertheless, they are very instructive.

Stein has been presented the probability density

f(x, θ) =a

xexp

{−100

(θ − x)2

2x2

}, for 0 ≤ x ≤ bθ

where for simplicity we have fixed the slope parameter.

It resembles a Gaussian with width proportional to the mean value as shown in Figure 24a and 24bbut it has a long tail towards large values ofx. The range ofx is restricted by the cutoff parameterb toobtain a normalizable density. Since the1/x term almost insures convergence the upper limit ofx maybe rather large for not too largea. The two constantsa, b are fixed by the normalization condition andthe requirement thatx exceedsθ by at least a factor of ten in 99% of the cases.∫ bθ

0f(x, θ)dx = 1∫ bθ

10θf(x, θ)dx = 0.99 (18)

The cutoff parameter is hugebθ > 10(1022) anda ≈ 0.04. Figure 24a, b are plots of Stein’s function withthe specific choiceθ = 5. Only about1% of thex-values lie inside the bell shaped part (x < 50) of theprobability density, the remaining part is located in the long right hand tail.

5.31 The paradox and its solution

For a single observationx the likelihood function for the Stein pdf

L(θ) ∼ exp{−100

(θ − x)2

2x2

}, for x/b < θ < ∞

is proportional to the likelihood function of a Gaussian centered atx, with width s = x/10 and restrictedto positiveθ. It is displayed in Figure 25c forx = 8. (There is a small difference in the validity range ofθ for the Gaussian and Stein’s distribution which numerically is of no importance.)

If the likelihood function contains the full information relative to the parameter, the inferences onθshould be the same for the very different distributions, the Gaussian and the Stein function. However, onemight conclude from Equation 18 that the Stein parameterθ is 10 times smaller than the measurementxin 99% of the cases, while the Gaussian case would favor a parameter in the vicinity of the measurement[61]. It seems silly to select the same confidence interval in both cases.

The solution of Stein’s paradox can be summarized as follows:

42

Page 43: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 24: Stein’s test function with a long tail towards largex. The maximum is next to the parameter value. TheGaussian likelihood function for a measurementx = 8 is shown in c).

43

Page 44: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

• The LP does not relate different probability densities. It just states that the likelihood functioncontains the full information relative to the unknown parameter. Thus there is no contradiction inthe fact that the same likelihood belongs to two different probability densities.

• The conclusion “fromx is larger than10θ in 99% of the cases follows that for a givenx the valueof θ is smaller thanx/10 in 99% of the cases independent ofx” is not correct. Classical confidenceshould not be mixed up with probability.

• Related to the last point: The sameθ interval corresponds to very different coverages in the twocases, Gaussian and Stein pdf. Coverage, however, is a frequentist concept which is not com-putable from the likelihood function and not relevant here.

It is easily seen that for any reasonable normalizable prior density the inferences for the Gaussianand the Stein cases should be identical while the classical intervals are rather strange in some cases.

Let us look at an example: We select an uniform prior functionπ(θ) = 0.1 and0 < θ < 10. Inmost cases one would getx much larger thanθ, sayx = 101000. Obviously, we cannot learn much fromsuch a measurement, a fact which translates into a likelihood function which is completely flat in theallowedθ range:

L = exp{−100

(θ − x)2

2x2

}∼ exp {−100θ/x} ≈ 1

The same result would be obtained for a Gaussian pdf, except that the occurrence of very largex is rare.

Now assumex = 5. Then

L(θ) ∼ exp{−(θ − 5)2

2 · 0.52

}, for 5/b < θ < 10

is a Gaussian of widths = 0.5 and we guess thatθ is near tox. In both cases, for the Gaussian pdfand the Stein pdf the results derived from the likelihood function are very reasonable. The differencebetween the two is that in the former case likelihood functions centered around smallx are frequent andin the latter those centered at largex but for a given observation this is irrelevant. The reader is invitedto invent other prior densities and to search for inconsistencies in the Bayesian treatment.

The argument of Stein relies on the following frequentist observation. Assumeθ values are se-lected in a rangeθ < θmax. Then in99% of all cases the likelihood function would favorθ ≈ x >10θmax, an estimate which is considerable worse than the classical one. The flaw of the argument is dueto the inconsistency in selecting specific test parameters and using an uniform prior ofθ extending tosome10(1020)θmax.

5.32 Classical treatment

What would a classical treatment give for central 99% confidence intervals?

∫ bθlow

xf(x′, θlow)dx′ = 0.005 =⇒ θlow � x∫ x

0f(x′, θhigh)dx′ = 0.005 =⇒ θhigh � x

For any reasonably smallx both limits would be nearly zero. For very largex the classical proce-dure produces a rather wideθ interval. Qualitatively the classical interval fails to contain the true value

44

Page 45: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 25:Modified Stone’s example: The problem is to infer the last step knowing start and stop positions.

in the 1% cases where the measurement falls into the bell shaped part of the distribution but produces asave result in the remaining 99% wherex is located in the tail. Thus the correct coverage is guaranteedbut in some cases ridiculous error limits have to be quoted. This is the well known problem of classicalstatistics: It can produce results that are known to be wrong. In the remaining cases the interval will beenormously wide and extend to extremely large values.

Example 26 A Susy Particle has a mass ofθ = 1eV/c2. A physicist who ignores this value designsan experiment to measure the mass in the range from0.001eV < θ < 1TeV/c2 through a quantityxdepending onθ according to the Stein distribution. (It is certainly not easy to determine a parameterθ over 15 orders of magnitude from a quantityx which may vary over 1020 orders of magnitude.) A99% confidence level is envisaged. A frequentist analysis of the data will provide wrong limits0 <θ � 0.1eV/c2 in 1 % of the cases and huge, useless intervals in the remaining ones (except for a tinyfraction). In contrast, a Bayesian physicist will obtain a correct and useful limit in 1% of the cases anda result similar to the classical one in the remaining cases.

It is impossible to invent a realistic experimental situation where the Bayesian treatment of theStein problem produces wrong results. The Stein function plays with logarithmic infinities, which cannotoccur in reality. Limiting the uniform prior15 to a reasonable range avoids the paradoxical conclusion ofStein.

5.4 Stone’s example

Another nice example has been presented by M. Stone [62] which, I regret, I have to shorten and tosimplify. The Stone example is much closer to problems arising in physics than the rather exotic SteinParadox.

Example 27 A drunken sailer starts from a fixed intersection of streets and proceeds at random throughthe square net of streets oriented along the directions N, W, S, E (see Figure 25)16. At a certain inter-

15Any other normalizable prior would produce very similar results.16The analogy to physics is not between the sailor and drunken physicists but between his random walk and diffusion.

45

Page 46: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

sectionx he stops. The problem is to guess the last crossing before the stop, i.e. the direction. We namethe four possible originsn,w, s, e. The likelihood fore is 1/4, equal to the probability to arrive atxstarting in a random direction frome. The same probabilities apply for the other three locations. Witha “non-informative” prior for the locations, one has to associate equal probabilities to all four possibil-ities. This is in contradiction to the correct intuition that a position between the starting point and thestop is more likely than 1/4.

The example does not directly involve the LP [63] but it questions inference based solely on thelikelihood function. Here, the solution is rather simple: There is information about the prior and it isclearly not uniform. (In the original version of the example the prior has an even stronger influence.)Including the prior information, and using the constant likelihood we obtain a correct result [63, 52].

It is not clear, how a classical solution of the problem would look like.

The fact, that so far no simple counter-example to the LP has been found provides strong supportto this principle.

5.5 Goodness-of-fit tests

Significance tests cannot be based on the LP since there is no parameter differentiating between the hy-pothesis at hand and the rather diffuse alternatives. For this reason they have been criticized especially byBayesians [5, 64] and likelihood amateurs [19]. However, as long as no better alternatives are available,classical methods should be used. Goodness-of-fit tests like the Neyman-Pearson or Kolmogorov testsare very useful and cannot easily be replaced by likelihood based recipes. (Note that also here the subjec-tivity problem is present. The binning has to be chosen according to the Bayesian expectation of possibledeviations. A rough binning neglects narrow deviations and with fine binning in the Neyman-Pearsontest the statistical fluctuations hide systematic deviations [65]. Also the fact that we use one-sided testshas a Bayesian origin.)

It should be also clear the usefulness of the recipes which are not in accordance with the LP donot falsify the latter. Significance tests have no bearing whatsoever on interval estimation.

5.6 Randomization

Randomization is an important tool in many statistical applications like clinical tests and survey studies.For example we might select randomly 50 CERN physicists and ask them their opinion on closing downthe LEP machine. It might turn out that by accident the majority of the selected physicists are involvedin LEP experiments. This post-experimental information has to be ignored in contradiction with the LP.

Strong criticism of randomization is expressed by Basu [66] and it seems obvious that with a goodunbiased judgement one should be able to get better performance than by random selection. Yet, thereare cases where this classical procedure is very useful - think of Monte Carlo simulation - and especiallyfor experiments with large statistics the Bayesian objections are less convincing. This point of view isshared by prominent bayesians like Savage [60]: “The theory of personal probability ... does lead to newinsight into the role and limitations of randomization but does by no means deprive randomization of itsimportant function in statistics.”

In conclusion, we should neither exclude classical methods nor use them as an argument againstthe LP.

6. Comparison of the methods

6.1 Difference in philosophy

Before we compare the frequentist methods with the likelihood based approaches in detail, let us comeback to the example (see Figure 1) presented in the introduction which is repeated in more detail as

46

Page 47: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Example 13/14.

Frequentist test the compatibility of data with a parameter and independent of details of the theo-retical prediction, they want to be right in the prefixed fraction of cases. The confidence level is a kindof “inclusion probability” of the correct parameter value inside the interval and is fixed. The “exclusionprobability” of wrong parameter values which depends on the width of the probability distribution is atleast partially ignored in the classical approaches. In the language of discrete parameters one could saythat the “error of the second kind” is not fully considered. The wider the distribution is, the likelier it isto include a wrong parameter in the confidence interval.

Since the confidence interval is based on an integration of the pdf it depends on data that couldhave been observed. The probability to observe the actual measurementx is not enough for classicalreasoning. Intuitively, one would find the two parameter valuesθ1 and θ2 which satisfyf(x|θ1) =f(x|θ2) equally acceptable, but the non locality of the frequentist coverage introduces a difference.

On the other hand in concepts based on the likelihood function only the value of the pdf at theobserved data is important. The methods are local in the sample space. To illustrate the philosophybehind it, it is simpler to argue with discrete variables.

Assume, we have seven different kind of dies, all differently biased, except one which is unbiased.We do not know the number of dies of each kind. Each of the biased dies favors a different number withprobability 1/2. The remaining numbers have probabilities of1/10. We observe the number “3” andhave to make a choice. Our decision will be based on the content of the following table only and wewould ignore the remaining sample space.

die 1 2 3 4 5 6 7p(3) 0.1 0.1 0.5 0.1 0.1 0.1 1/6

We would select die “3”. If we would repeat the experiment many times, we would always selectone of the first six dies and never die “7”. Thus a frequentist would argue that the procedure is inadmissi-ble and that also die “7” should be given a chance of 1/7 which would guarantee equal coverage to all hy-pothesis. A Bayesian would not care about long term coverage and not aboutp(1), p(2), p(4), p(5), p(6),the probabilities of sample variables not observed. He wants to make the optimum decision in the presentcase and select the die which gives the largest likelihood. Why is long term coverage not an issue? TheBayesian updates his information by multiplying the likelihoods of the individual trials and after 10experiments, he then would probably find for die “7” a higher likelihood than for the other dies.

Let us look at another very extreme example.

Example 28 Both, the sample variablex and the parameterθ are restricted to the interval [0,1].

f(x|θ) = δ(x − θ) for θ = 0.453= 1 else

The result of a precise measurement isx = 0.453. The classical confidence interval would be the fullinterval 0 ≤ θ ≤ 1. The Bayesian conclusion would beθ = 0.453 ± 0 with 100 % confidence. I believe,everybody would intuitively support the Bayesian decision.

Classical confidence intervals are not well qualified for decision making. Even advocators of theclassical concept admit that for decision making classical limits may not be optimum [15]. The questionis then, what else should they be used for? Frequentists, however, argue that the intention of inferenceis not necessarily to optimize decisions. Kendall and Stuart [67] write: “Some writers have gone sofar as to argue that all estimation and hypothesis-testing are, in fact, decision-making operations. Weemphatically disagree, both that all statistical inquiry emerges in decisions and that the consequences ofmany decisions can be evaluated numerically.”

47

Page 48: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Narsky [68] who compares several different approaches to the estimation of upper Poisson limits,states: “There is no such thing as the best procedure for upper limit estimation. An experimentalist is freeto choose any procedure she/he likes, based on her/his belief and experience. The only requirement isthat the chosen procedure must have a strict mathematical foundation.” This opinion is typical for manypapers on confidence limits. However, “the real test of the pudding is in its eating” and not in the beautyof the cooking recipe. We should state what we aim for and should not forget that what we measure haspractical implications, hopefully!

6.2 Relevance, consistency and precision

A. W. F. Edwards writes [19]: “Relative support (of a hypothesis or a parameter) must be consistent indifferent applications, so that we are content to react equally to equal values, and it must not be affectedby information judged intuitively to be irrelevant.”

The most obvious example for the violation of this principle in all classical approaches is thePoisson case with no event observed where expected background is absent (Examples 11, 12). Otherexamples are related to the dependence of classical results on the sampling plan (Example 24) For thisreason alone, classical limits are to be discarded as a useful measure of uncertainty.

Formally, the fact that different limits are obtained in equivalent situations manifests itself as aviolation of the LP [30]. In the Poisson case forn events found andb background events expected, thelikelihood function for the signal parameterµ is

L(µ) ∼n∑

b′=0

P (b′|b)P (n− b′|µ)

and forn = 0 it reduces to

L(µ) ∝ e−µ

independent of the background expectation. In Figure 16 the likelihood function is compared to theclassical and Bayesian results.

Similarly, sampling plans lead to different classical limits for identical data samples (Example 24).Another example is presented by Berger and Berry [69].

In addition, we require that a measure of uncertainty respects a logical ordering principle: Themore precise a measurement is, the smaller should be the associated error interval. An experimentalresult of zero eventsn = 0 found with background expectationb = 0 is clearly more exclusive thana resultn = 1 with b = 3.17 The corresponding 90 % confidence limits 2.44 and 1.88 of the unifiedapproach [15] invert the sequence of precision.

The deficiency of classical upper limits is also recognized by some frequentists. They accept thatthe classical confidence intervals are not always useful measures of the uncertainty. (But, what elseshould they be used for?) To cure the problem, Feldman and Cousins [15] propose to supplement theconfidence limit by an additional quantity called sensitivity. However, they do not explain how to com-bine the two quantities to produce a sensitive measure for the relative precision of different experimentalresults and how to test the compatibility of the measurement with a theory.

Bayesian methods are in agreement with the LP and respect the consistency property. (Problemsmay occur for probability densities with multiple maxima when the prior density is updated with in-creasing event number [70]. These exotic cases are not relevant in real physics problems and cannot behandled by classical methods either.)

A general comparison of the precision of different methods is difficult because there is no gen-erally accepted definition of precision. Usually, precision is measured by the variance of the estimator

17Cousins’s justification [21] does not distinguish between experiment and experimental result.

48

Page 49: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

(efficiency). This is problematic, because the variance is not invariant against parameter transformations(see Example 30 below). An efficient estimatorθ for θ does not necessarily transform into an efficientestimatorθ′ = f(θ) when the parameters are related byθ′ = f(θ). The precision of an interval is evenharder to judge. Qualitatively, the precision should decrease with the interval length and the failure rate.

In the Bayesian philosophy with the assumption of an uniform prior density, likelihood ratio in-tervals are optimum. For a given interval length the probability to contain the true parameter valueis maximum. In other words, avoiding the notion of prior density, the likelihood interval contains theparameter values best supported by the data.

Classical methods hardly discuss precision. “Minimum length” intervals are not very popular.Clearly, classical intervals cannot be optimum when they violate the likelihood principle because thenthey ignore available information. For example, it is easy to improve negative Poisson limits or unphys-ical intervals. Qualitatively, the intervals from the unified approach are expected to be superior to otherclassical intervals.

6.3 Coverage

In the classical approach we have the attractive property that a well known fraction of experimentalresults are within the quoted error limits. Thus the concept of coverage is able to provide some confidencein the scientific procedure. However, confidence can also be attributed to intervals based on the likelihoodfunction.

Major criticism of the classical concepts is related to the requirement of pre-experimental defini-tion of the analysis methods necessary to guaranty coverage. Conditioning on the actual observation isforbidden. However as shown by Kiefer [71], conditioning and coverage are not necessarily exclusive.

On the other hand for an individual experiment we cannot deduce from the computed coverage aprobability that the result is true and it is the latter we need to make decisions.

Example 29 Throwing a coin, an experimenter quotes that the Higgs mass is larger than 100 GeV/c2

with 50% confidence. It would be stupid to deduce from this “measurement” a probability of 50% forthe Higgs mass to be larger than the arbitrarily selected mass value of 100 GeV/c2.

The admiration of coverage by many physicists is due to an illusion: The intuitively mix coveragewith probability even if they intellectually realize the difference and admit the contrary when confrontedwith difficult examples. In reality, the only possible way to derive a probability is to apply the Bayesianprior and to integrate the likelihood function. We cannot do any better. If we measure a quantity for thefirst time and if we have no idea about it, we cannot give a probability and the notion of coverage makeslittle sense. On the other hand, coverage is not a completely useless quality as some Bayesians claim.Coverage intervals retain a large fraction of the information contained in the likelihood function.

Coverage is a property of an ensemble of experiments and makes sense if an experiment is repeatedmany times18. Let us assume that we have 10 experiments measuring with different precision a particlemass and providing 90 % confidence intervals each. It will not be probable that more than one or two ofthe limits will not contain the true value. This will certainly give us an idea where the true value may belocated, but a quantitative evaluation is difficult.

The violation of coverage due to the separate treatment of upper limits and intervals was the mainmotivation for the development of the unified approach. In the years before the adoption of this newscheme, rare decay searches usually presented Bayesian upper limits - not intervals - when the result wascompatible with background. It is hard to detect negative consequences from the former procedure to theprogress of physics.

The main problems with the coverage paradigm are the following:18Insurance companies may rely on coverage.

49

Page 50: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

• Coverage often is only approximate. There is complete overcoverage in Poisson distributed datawith mean zero and in digital measurements.

• The usual treatment of nuisance parameters [21] leads to undercoverage.

• It is difficult to retain coverage when results are averaged.

Classical confidence cannot be attributed to intervals based on the likelihood function. The PDG[9] gives the advise to determine the true coverage by a Monte Carlo simulation. This is not possible.To a given parameter interval sometimes many different confidence levels can be associated and in somecases it is impossible to interpret likelihood intervals as confidence intervals.

6.4 Metric - Invariance under variable transformations

We consider the following non-linear parameter transformationθ′ = u(θ). (Linear transformations donot pose problems.)

θ′ = u(θ) (19)

θ′low = u(θlow)θ′high = u(θhigh)

Clearly, the probabilitiesP (θlow < θ < θhigh) andP (θ′low < θ′ < θ′high) are the same.

In addition, we may transform the sample space variables and consequently also the probabilitydensitiesf(x) −→ g(x′).

Invariance means that i) the probability and ii) the interval limits are the same independent of theinitial choice of the parameter and the sample variable to which the definition applies.

In the frequentist schemeboth conditions are satisfied with the likelihood ratio ordering used inthe unified approach and in the conventional scheme when central intervals are chosen, a choice whichis restricted to one dimension and which is not very sensible. Equal probability density intervals definedin x usually differ from probability density intervals inx′. Shortest confidence intervals depend on theparameter choice.

General transformations of a multi-dimensional sample space may lead to concave probabilitycontours and ruin the whole concept.

Conceptually and formally the parameter and sample variable dependence is acceptable: The pa-rameters and the sample variables have to be chosen pre-experimentally in a reasonable way independentof the result of the experiment similarly to all the other analysis procedures in the classical method. Nev-ertheless, it is disturbing that the physicists choice to determine a confidence interval from a mass squareddistribution instead from a simple mass distribution will produce different limits. Thus the likelihood ra-tio ordering which is free of this problem is certainly to be preferred to the other classical methods.

Likelihood ratio intervalsare strictly invariant under parameter transformations. The support bydata is independent of the choice of the parameter. It is also invariant against transformations of thesample variable.

TheBayesian methodconserves probability if the prior is transformed according to:

π′(θ′)dθ′ = π(θ)dθ

Of course not both prior densitiesπ andπ′ can be uniform. The invariance of the probability is guarantiedbut the size of the interval depends on the primary parameter choice for which we requiredπ(θ) =const. With our requirement of using uniform priors, the Bayesian prescription is not invariant undertransformations of the parameter.

The fact that the choice of a parameter space affects the result of an analysis is quite common.Very useful parameters like root mean square errors are not invariant. Inference procedures like patternrecognition independent of a selected space are unthinkable.

50

Page 51: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

Fig. 26: Lifetime measurements using 10 and 2 events. The error bars from different methods are given for ameasurement coinciding with the true lifetime. The letters refer to classical (c), likelihood (l) and Bayesian (b).

I find it rather natural that a scientist chooses not only the selection criteria but also the metric ofthe parameter and sample spaces. In many cases there is quite some common agreement on a reasonablechoice (for example1/p for tracking,m2 for neutrino mass limits,γ for the decay rate).

6.5 Error treatment

Error propagation is simple when the distributions are Gaussian and the functions are linear. All othercases are difficult to handle and very little guidance can be found in the literature. A detailed discussionof the problem is beyond the scope of this article.

Error propagation requires not only an error interval but also a parameter estimation. We areused to maximum likelihood estimates. For classical methods other choices are better adapted. Centralintervals are well compatible with the median which we obtain when we shrink the confidence intervalto zero width. Medians, however, are not very efficient estimators.

The errors - better called uncertainties - of a measurement are usually not accurately known andthis is also not necessary. A precision of about 10% is good enough in most cases. Thus we should besatisfied with reasonable approximations.

6.51 Error definition

In theclassical schemeserror intervals correspond to confidence intervals. They are well defined, exceptfor the arbitrariness of the definition of the probability intervals.

Physicists usually publish the asymmetriclikelihood ratio limitswhich provide approximate in-formation of the shape of the likelihood function.

TheBayesian wayto define probability distributions for the parameters gives full freedom (whichis a disadvantage) in defining the error. An obvious choice is to use mean value, variance, and, whennecessary skewness and kurtosis of the parameter probability density. It has the advantage that tails inthe distributions are taken into account.

A comparison of error bounds of lifetime measurements from the different methods is presentedin Figure 26 where we show the belts for a measurementλ = 1 obtained from ten and two eventsamples, respectively. (Hereequal tail is synonymous withcentral, short indicates that the interval has

51

Page 52: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

minimum length.) There are considerable differences for small event numbers. In the case of ten eventsthe differences between various definitions are already negligible.

The Bureau International de Poids et Mesures (BIPM) and the International Organization for Stan-dardization (ISO) have issued some recommendations [72] on how to define errors. After having dis-tributed a questionnaire among a large number of scientists BIPM and ISO have essentially adopted aBayesian inspired method. They recommend to quantify the uncertainty by the variance and not by aclassical confidence interval or a likelihood ratio interval.

6.52 Error propagation in general

To conserve the confidence level or the probability, all methods call for an explicit transformation fol-lowing generalized Equations 19. The multivariate problemyk = yk(x1, x2, ...xN ) involves a variabletransformation plus the elimination of nuisance parameters. Here again, classical methods fail. Only theBayesian way permits a consistent solution of the problem.

Transformations conserving probability often produce strong biases and very asymmetric errorbounds in the new variables. Following the ISO recommendations one has to accept a certain violationof probability conservation and work with the moments of the Bayesian distributions or the asymmetriclikelihood errors.

A reasonable approximation is obtained inserting the moments of the probability densities of theparametersxi into the Taylor expansions ofyk.

6.53 Systematic errors

The separation of statistical and systematic errors often is rather artificial. Usually, errors are calledsystematic if their size is badly known. Bayesians have no conceptual problem with systematic errorswhereas it seems difficult to incorporate the notion of coverage in the definition of systematic errors.

6.6 Combining data

One of the simpler problems of error propagation is the computation of a mean value from individualmeasurements of the same quantity.

The possibility to increase precision by combining data from different experiments guarantees acontinuous progress of knowledge in science.

Following the LP we conclude:

The only way to combine data without loss of information is to add their log-likelihood functions.

This first step does not require a definition of the error.

Frequentist methodscannot compute confidence limits from the likelihood function. Thus theyhave to invent other prescriptions to combine data. No generally accepted method is known to me. Es-pecially in unified approaches additional information to the error limits is necessary to perform sensiblecombinations. Complicated procedures have been invented to combine upper limits [73].

The usual averaging procedure, weighting the individual measurement with the inverse of theerror squared, is an approximation based on the central limit theorem, and assumes that the errors areproportional to the standard deviation of a probability density and that the measurement itself correspondsto the expectation value. It is a Bayesian method.

The PDG [9] uses the following recipe for the combination of measurements with asymmetricerrors: The mean value is computed iteratively, weighting the individual measurements with the inverseof the error squared. The error is a function of the resulting mean value. When mean value is locatedinside the error interval a linear interpolation between the lower and upper side errors is used. When themean value is outside the interval, the corresponding lower or upper error is chosen.

52

Page 53: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

method < τ/τ0 > < γ/γ0 >

adding likelihood functions 1 1classical, weight:σ−2 0.0 0.67likelihood, weight: PDG 0.26 0.80likelihood, weight: PDG extended 0.69 0.85Bayesian mean, uniform prior, weight:σ−2 ∞ 1

Table 3: Average of an infinite number of equivalent lifetime measurements using different weighting procedures

In principle, it would be more consistent to apply the linear error function also outside the interval.(Let us call this methodPDG extended.) On the other hand the pragmatic procedure of the PDG ispreferable because it is less sensitive to tails in the distributions.

Applying the usual prescriptions to add measurements with error intervals as described in the PDGbooklets i) certainly will not conserve coverage and ii) the result will depend on the chosen scheme ofdefining the bounds.

For asymmetric likelihood functions the standard averaging procedure where the inverse of thefull width is used as a weighting factor is only approximately correct. Better results are obtained with thePDG prescription. Alternatively, one could approximate the log-likelihood by a parabola of third order(using the asymmetric errors) and add the approximated log-likelihoods.

Example 30 Two events are observed from an exponential decay with true mean lifeτ0 = 1/γ0. Themaximum likelihood estimate is used either forτ or γ. We assume that an infinite number of identicalexperiments is performed and that the results are combined. In Table 3 we summarize the results ofdifferent averaging procedures. There is no unique prescription for averaging classical intervals. Tocompute the classical value given in the table, the maximum likelihood estimate and central intervalswere used.

In this special example a consistent result is obtained in the Bayesian method with uniform priorfor the decay constant. It shows also how critical the choice of the parameter is in the Bayesian approach.It is also clear that an educated choice is also important for the pragmatic procedures. From the discussionof Example 22 it should be clear that the decay constant is the better parameter. Methods approximatingthe likelihood function provide reasonable results unless the likelihood function is very asymmetric.The conservative weighting procedure of the PDG is inferior to the extended scheme (which may bedangerous in other cases). The simple classical average is the least precise of all.

6.7 Bias

Most of the text books on statistics and also the PDG claim that parameters obtained from measurementsor estimators should be unbiased, and when they are not, we are advised to correct for the biases. Oneargument for correcting for a bias is the property of the maximum likelihood estimate to be efficient(most precise) when unbiased. Estimates from small samples often are biased.

Since both properties, efficiency and bias, are not invariant under change of parameter, the require-ment that estimated parameters should be unbiased is not so obvious. The situation is more tricky thanit appears at first sight. What should we conclude from the fact that the maximum likelihood estimateγfrom an exponential decay distribution is biased butτ = 1/γ is not? Is the resultτ more precise thanγand should we correctγ for its bias?

A bias in itself is not necessary a bad property. What really matters is a bias of the estimatorweighted with its inverse error squared, the quantity which is used for averaging data. When we computean average of several unbiased measurements the result may very well be biased.

53

Page 54: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

An average of unbiased measurements with errors proportional to the parameter value will besystematically too small. We have demonstrated this in the previous section (see Table 3). The unbiasedlifetime estimate produces a strongly biased average. On the other hand when we use a biased maximumlikelihood estimator and add the log-likelihoods from many similar experiments the bias which is presentin all individual experiments asymptotically disappears.

Biases usually are correlated with small event numbers and large errors which require a cautioustreatment also in the context of error propagation. A non-linear functiony(xub) with xub unbiased andlarge uncertainty will give a biased resulty. The bias corrected estimatorsxub, yub would not followthe same function, i.e.yub(xub) 6= y(xub) and consequently confidence limits which are invariant underthe variable transformationy(x) would not be invariant underyub(xub) and thus not be relevant for theestimateyub.

We will have to be very careful in the treatment of biases.

Not much can be said about the bias observed in classical methods. There are too many ways todefine the limits and also a clear prescription of combining measurements is lacking.

6.8 Subjectivity

In the field of statistics the wordsubjectiveusually is associated to probabilities not definable throughfrequencies and also as a somewhat negative label for the Bayesian school. Here, we will use it in itscommon meaning. In experimental design and data analysis many decisions have to be made whichare not obvious and consequently depend on personal judgements. However, these decisions should beconcerned merely with the quality (precision) of the result and avoid as much as possible to introducebiases. Where this is unavoidable, the procedure has to be documented and allow to the reader of apublication to use the results without adopting the personal judgements of the publishing group. To bemore precise, we would like the results to be as much as possible independent of event selection criteria,stopping rules, Monte Carlo parameters, fitting procedures, etc..

Bayesian methods often are accused of being subjective and sometimes they really are. However,the Bayesian confidence intervals as defined in this article obviously contain no subjective elementsexcept for the choice of the parameter metric which is documented. Transformations to other parametersare possible with the published results, of course with limited precision.

Likelihood ratio intervals are rather insensitive to subjective decisions.

On the other hand it is very difficult to avoid subjective elements in a classical confidence levelanalysis: Consider two identical experiments looking for a rare event to be identified in a mass plot. Inthe first experiment within a range corresponding to 3 st. dev. of the mass resolution no event is observedat the expected mass and a 90% upper limit for the occurrence is computed. In the second experimentthere are a few events in the region of interest but these events are compatible with background. Usuallythe experimentalist would try to reduce the background by applying some clever cuts and if this is notsuccessful he would estimate the background probability and include it in the computation of the limit.He also might decide not to publish, because his colleague of the first experiment has obtained a morerestrictive limit. These subjective decisions completely ruin the classical probability interpretation whichis only valid if the scientist decides how to treat the data and to publish before he looks at them - not avery realistic scenario.

6.9 Universality

As has been shown in our examples, the classical scheme has a limited range of applications: There areproblems with physical bounds, discrete samples (digital measurements). It produces useless intervals inother cases and so far there is no general method to eliminate nuisance parameters. Especially in Poissonprocesses with background and uncertainty on the background mean there are unsolved problems. Theunified approach in the version proposed by Feldman and Cousins improves the situation of some prob-

54

Page 55: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

lems but reduces the applicability of the method to others. There remains the problem of interpretationand application: Are the classical bounds useful measures of the uncertainty of measurements and howshould we use them in error handling and hypothesis testing?

The use of simple likelihood ratio intervals is subject to similar restrictions as the classical con-fidence intervals. When there is no smooth likelihood function with a maximum in the physical region,the standard limits do not indicate the same precision as in the standard case. For this reason, upper lim-its deduced from the likelihood ratio should not be transformed into confidence statements. Likelihoodratio intervals are more general than classical limits in some respects: They are able to handle a discretesample space and there is a logical transition from continuous parameter problems to discrete hypothesisevaluation. They have a clear and simple interpretation, the interval is linked to the maximum likelihoodestimate and combination of errors is straight forward.

The Bayesian method is the most general approach. It is able to cope with all the difficultiesmentioned above.

6.10 Simplicity

Simplicity is an important quality not only for physical theories but also for statistical methods.

While methods based on the likelihood function are fairly simple both in the conception and theirapplication, the classical approaches are complicated. This is the main reason why the latter are not usedfor standard interval estimation. Hardly any physicist knows the different ways to define the probabilityintervals. Thus, for standard applications in physics, classical methods are mainly of academic interest.

The newly proposed unified approaches for the computation of upper limits are even more com-plicated. Even specialists have difficulties to understand the details. I am convinced that only a fractionof the authors of the recent papers on neutrino oscillations is familiar with the underlying calculation ofthe limits and is able to interpret them correctly. The variety of different classical approaches to Poissonupper limits adds to the confusion. Practical applications will in some situations require considerablecomputing time. While combining results based on the likelihood function is transparent and simple inmost cases, the combination of classical limits, if possible at all, is a quite tricky task.

7. Summary and proposed conventions

The error of a measurement of an estimated parameter is not uniquely defined, but obviously we wouldlike our error interval to include the true value with well defined confidence and to be as small as possible.Of course, as discussed above, there are other features which are important when we compare differentmethods: The procedures should be simple, robust, cover all standard cases and it should enable us tocombine results from different experiments. Parameter and interval estimation is used in very differentapplications, such as simple few parameter estimates, upper limit computation, unfolding and imagereconstruction. It is not obvious that one single method would be optimum for all these situations.

In the standard situation with high event numbers the likelihood can be approximated by a Gaus-sian and all methods give the same results within an adequate precision.

7.1 Classical procedure

The magic principle of classical confidence bounds iscoverage. Knowing that a certain fraction ofpublished measurements cover within their error limits the true parameter value is an attractive propertyat first sight but unfortunately it is not very useful in practice. After all, experimental results shouldbe used to make decisions but even frequentists admit that classical bounds are not well suited for thispurpose.

The main difficulties of the standard classical approaches are.

• They produce inconsistent results.

55

Page 56: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

• There range of applicability is restricted.– Discrete variates (for example digital measurements) cannot be treated.

– Discrete parameters cannot be treated.

– Parameters defined in a restricted region cannot be handled.• The results in many cases do not represent the precision of the measurement:

– The experimental information is not always fully used.

– The error interval sometimes is in an unphysical region.

– Strict limits sometimes correspond to measurements with low precision whereas loose limitsmay be associated to precise measurements.

• The requirement of pre-experimental fixing of the whole analysis method is very unrealistic andwould, if followed, in many cases prevent the scientist to extract the full information from the data.

• Error intervals are parameter dependent, with the exception of one-dimensional central intervalsand unified likelihood ratio ordered intervals.

• The method in the practical application is partially subjective due to the arbitrary choice of theprobability interval, due to selection criteria based on the data and because of the many adhoceriesnecessary to handle the problematic cases.

• A prescription how to combine results from different experiments is missing. Biases are not dis-cussed.

• Nuisance parameters (background with uncertainty in the mean) often cannot be handled satisfac-torily. Estimating them out produces undercoverage.

The unified classical approach provides the required coverage (approximate in the Poisson case)but shares most of the other difficulties mentioned above and adds new ones. It is applicable onlyunder special conditions and it introduces artificial correlations between independent variables whenthe parameters are near physical boundaries.

If I had to choose a frequentist method, I would select the likelihood ordering for standard prob-lems. In situations with physical boundaries, I would follow Bouchet [38] and ignore those. This wouldmake the combination of experimental results easier. I would relax the coverage requirement which isanyhow violated at many occasions, not follow the unification concept and treat Poisson limits as exposedin [24].

7.2 Likelihood and Bayesian methods

The log-likelihood function contains the full experimental information. It measures the relative supportof different hypothesis or parameter values by the data. It fulfills transitivity, it is additive and it isinvariant against transformations of the parameters or the data. Thus it provides a sound bases for allkind of parameter and interval inferences.

Likelihood ratio limits parametrize the experimental observations in a sensible way. They allowthe combination of data, error propagation and provide a useful measure of the uncertainty. The use oflikelihood ratio errors (without integration) has some restrictions:

• The likelihood function has to be smooth.

• Digital measurements cannot be treated.

• Likelihood ratio limits deduced from likelihood functions that are very different from Gaussians(Upper Poisson limits) have to be interpreted with care.

A fully Bayesian treatment with a free choice of the prior parameter density is not advisable andis not further considered here. Fixing the prior density to be uniform retains the mathematical simplicityand thus is very attractive but also has disadvantages.

• The choice of the parameter is rather arbitrary.

• It is less directly related to the data than the likelihood function.

56

Page 57: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

method: classical unified cl. likelihood Bayesian u.p. Bayesian a.p.

consistency - - - - ++ + +

precision - - - + + +

universality - - - - - + ++

simplicity - - - ++ + +

variable transform. - ++ ++ - - - -

nuisance parameter - - - + +

error propagation - - + + +

combining data - - + + -

coverage + ++ - - - - - -

objectivity - - ++ + -

discrete hypothesis - - + + +

Table 4: Comparison of different approaches to define error intervals

Bayesian methods have been criticized [21] because they do not provide goodness-of-fit tests.This is correct but misleading in the present context. Goodness-of-fit tests are irrelevant for parameterinterval estimation which relies on a correct description of the data by the parametrization whereas thetests question the parametrization.

7.3 Comparison

Obviously all methods have advantages and disadvantages as summarized in the following table wherewe compare the conventional classical, the unified classical, the likelihood and the Bayesian methodswith uniform and arbitrary prior densities. The distribution of the plus and minus signs indicate theauthor’s personal conclusion. The attributes have to be weighted with the importance of the evaluatedproperty.

7.4 Proposed Conventions

There are two correlated principles which I am not ready to give up:

1. Methods have to be consistent.2. Since all relevant information for parameter inference is contained in the likelihood function,

error intervals and limits should be based on the likelihood function only.

These principles excludes the use of classical limits.

All further conventions are debatable. I propose the following guide lines which present my per-sonal view and are intended as a bases for a wider discussion.

1. Whenever possible the full likelihood function should be published. It contains the full experimen-tal information and permits to combine the results of different experiments in an optimum way.This is especially important when the likelihood is strongly non-Gaussian (strongly asymmetric,cut by external bounds, has several maxima etc.) A sensible way to present the likelihood functionis to plot it in log-scale and normalized to its maximum. A similar proposal has been made by D’Agostini [47]. This facilitates reading of likelihood ratios. Also in two dimensions the likelihoodcontour plots give a very instructive presentation of the data. Isn’t the two-dimensional likelihoodrepresentation of Figure 9 more illuminating than drawing confidence contours supplemented bysensitivity curves?

2. Data are combined by adding the log-likelihoods. When not known, parametrizations are used toapproximate it.

57

Page 58: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

3. If the likelihood is smooth and has a single maximum the likelihood ratio limits should be usedto define the error interval. These limits are invariant under parameter transformation. For themeasurement of the parameter the value maximizing the likelihood function is chosen. No correc-tion for biased likelihood estimators is applied. The errors usually are asymmetric. These limitscan also be interpreted as Bayesian one standard deviation errors for the specific choice of theparameter variable where the likelihood of the parameter has a Gaussian shape.

4. Nuisance parameters are eliminated by integrating them out using an uniform prior. A correlationcoefficient should be computed.

5. For digital measurements the Bayesian mean and r.m.s. should be used.

6. In cases where the likelihood function is restricted by physical or mathematical bounds and wherethere are no good reasons to reject an uniform prior the measurement and its errors defined as themean and r.m.s should be computed in the Bayesian way.

7. a) Upper and lower limits are computed from the tails of the Bayesian probability distributions.b) Even better suited are the likelihood ratio limits. There have the disadvantage of being lesspopular.

8. Non-uniform prior densities should not be used.

9. It is the scientist’s choice whether to present an error interval or a one-sided limit.

10. In any case the applied procedure has to be documented.

These recipes contain no subjective element and correspond more or less to our every day prac-tice. An exception are Poisson limits where for strange reasons the coverage principle - though onlyapproximately realized - has gained preference.

I am sure, many frequentist physicists will object to these propositions. They are invited to presenttheir solutions to the many examples presented in this article and to define precisely error limits and toexplain how to use them for error propagation, combination of results and hypothesis exclusion. If theyaccept the LP they will have to explain why the likelihood function or its parametrization is not sufficientto compute the confidence limits.

It is obvious from the discussion of the treatment of errors in the preceding section that there isa lack of statistical framework. Even in the pure Bayesian philosophy it is not clear whether to choosemoments or probability intervals to characterize the errors. In any case, compared to frequentist methods,likelihood based methods are better suited for error calculations. This does not mean that classicalmethods in general are useless and that the full Bayesian philosophy should be adopted.

For checking statistical methods, frequentist arguments may be useful. Bayesian confidence inter-vals with low coverage could be suspicious but they are not necessarily unrealistic.

An example of a bad use of the Bayesian freedom are some unfolding methods. In some cases theregularization reduces the error below the purely statistical one and results are obtained which are moreprecise than they would have been in an experiment with no smearing where unfolding would not havebeen necessary.

7.5 Summary of the summary

We got nothing but the likelihood function. We have to parameterize it in a sensible way such that theparameters represent the accuracy of the measurement. If this is not possible, we have to provide the fullfunction.

Acknowledgment

I am grateful to Fred James for many critical comments to an early version of this report, to BobCousins and Giulio D’Agostini for numerous interesting oral and email transmitted discussions.

58

Page 59: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

A Shortest classical confidence interval for a lifetime measurement

In some cases we can find apivotal quantityq(x, θ) of a measurementx and a parameterθ. A pivot hasa probability density which is independent of the parameterθ (like x − θ for a Gaussian with meanθ)and consequently its probability limitsq1, q2 of q are also independent ofθ :

P (q1 < q(x, θ) < q2) = α (20)

By solving the inequality forθ we find:

P (θ1(q1, x) < θ < θ2(q2, x)) = α (21)

Equation (22) relatesq1 and q2(q1, α). The remaining free parameterq1 is used to minimize|θ2 − θ1|.

We look at the example of a lifetime measurementx. The width of a confidence interval[γ1, γ2]of the decay constantγ is minimized. Herep = γx is a pivotal quantity.

f(x) = γe−γx

p = γx

g(p) = e−p

For a given confidence levelα the two limits depend on each other:q2(α, q1). The limits transformto limits on the parameterθ: [θ(q1), θ(q2)] Then|θ(q2(α, q1))− θ(q1)| is minimized:

e−q1 − e−q2 = α

q2 = − ln(−α + e−q1)q1 < γx < q2 = − ln(−α + e−q1)

q1

x< γ <

− ln(−α + e−q1)x

Minimizing the interval length is equivalent to minimizing

lne−q1

−α + e−q1

Taking into account the boundary conditionq1 > 0, we findq1 = 0, q2 = − ln(1− α) and

γ1 = 0

γ2 = −1x

ln(1− α)

Forα = 0.6826, and a measurementx = 1 we findγ1 = 0, γ2 = 1.15. (The likelihood limits areγ1 = 0.30, γ2 = 2.36)

When we play the same game with the parameterτ = 1/γ and the same pivot we obtain

q′1 <x

τ< q′2 = − ln(−α + e−q′1)

x

− ln(−α + e−q′1)< τ <

x

q′1Minimizing

1q′1

+1

ln(−α + e−q′1)

we findτ1 = 0.17 , τ2 = 2.65 (The likelihood limits areτ1 = 0.42, τ2 = 3.31)

The limits obviously depend on the choice of the parameter.

59

Page 60: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

B Objective prior density for an exponential decay

In the literature we find arguments forobjective priors. A nice example which illustrates the methodsused, is the derivation of the decay constant for particle decays.

Two different particles with exponential decays have the probability densities of the decay timetand decay constantλ

f(t, λ) = f(t|λ)π(λ)f ′(t′, λ′) = f(t′|λ′)π(λ′)

whereπ(λ) is the prior density andf(t|λ) = λ exp(−λt). The decay constantsλ andλ′ are related byλ′ = αλ. This fixesα. Then for timest′ = t/α we have the transformation

f(t, λ)dtdλ = f ′(t′, λ′)dt′dλ′

from where we obtain

απ(αλ) = π(λ)

This relation is satisfied by

π(λ) =const.

λ

The prior density is inversely proportional to the decay constant. The flaw of the argument is easilyidentified: Why should there be an universal prior densityπ(λ) for all particles? There is no reason forthis assumption which was used when we tacitly substitutedπ′(λ′) with π(λ′).

We can also argue in a slightly different way: Two physicists measure the lifetime of the sameparticle in different time unitst in seconds andt′ in minutes. The numerical values are then related byt′ = t/α andλ′ = αλ as above withα = 60. The prior densities have to fulfill

π(λ)dλ = π′(λ′)dλ′

Now the two physicists choose the same functional form of the prior density. Thus we get

π(λ)dλ = π(λ′)dλ′

π(λ)dλ = π(αλ)αdλ

π(λ) =const.

λ(22)

The resulting prior densities are not normalizable, a fact which already indicates that the resultcannot be correct. Again there is no compelling reason whyπ should be equal toπ′. It is also easily seenthat prior densities different from (23) do not lead to contradictions. For example a uniform prior in theinterval0 < t < 1 sec would produceπ(t) = 1 andπ′(t′) = 1/60.

Similar plausibility arguments as used above are applied for the derivation of the a prior densityfor the Poisson parameter. They are in no way convincing.

It is impossible to determine prior densities from scaling laws or symmetries alone. We can usenon-uniform prior densities when they are explicitly given as in examples 17.

References

[1] E. T. Jaynes, Bayesian Methods: General Background. An Introductory Tutorial. Maximum En-tropy and Bayesian Methods in Applied Statistics, ed. J. H. Justice, Cambridge University Press(1984) 1.

60

Page 61: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

[2] R. A. Fisher, Statistical Methods, Experimental Design and Scientific Inference, Oxford UniversityPress (1990) 1 (first published in 1925).

[3] M. G. Kendall and A. Stuart, The Advanced Theory of Statistics, Vol. 2, Griffin, (1979).

[4] I. J. Good, Good Thinking, The Foundations of Probability and Its Applications, Univ. of MinnesotaPress (1983).

G. E. P. Box and G. C. Tiao, Bayesian Inference in Statistical Analysis, Addison-Weseley (1973).

B. P. Roe, Probability and Statistics in Experimental Physics, Springer (1992).

[5] H. Jeffreys, Theory of Probability, Clarendon Press, Oxford (1961).

[6] B. Efron, Why isn’t Everyone a Bayesian? Am. Stat. 40 (1986) 1.

D. V. Lindley, Wald Lectures: Bayesian Statistics (with discussion), Statistical Science, 5 (1990)44.

G. D’Agostini, Bayesian Reasoning versus Conventional Statistics in High Energy Physics, pre-sented at XVIII International Workshop on Maximum Entropy and Bayesian Methods, Garching,Preprint physics/9811046 (1998).

[7] R. D. Cousins, Why Isn’t Every Physicist a Bayesian? Am J. Phys. 63 (1995) 398.

[8] J. Neyman, Outline of a Theory of Statistical Estimation Based on the Classical Theory of Proba-bility, Phil. Trans. A236 (1937) 333.

[9] C. Caso et al., Review of particle physics, Eur. Phys. J. C 3 (1998).

[10] T. A. Bancroft and Chien-Pai Han, Statistical theory and inference in research, ed. D. B. Owen,Dekker, (1981)

[11] M. G. Kendall and A. Stuart, The Advanced Theory of Statistics, Vol. 2, page 128, Griffin, (1979).

[12] M. G. Kendall and A. Stuart, The Advanced Theory of Statistics, Vol. 2, page 240, Griffin, (1979).

[13] E. L. Lehmann, Testing Statistical Hypothesis, Wiley (1959).

[14] G. J. Feldman, talk given at the Workshop on Confidence Limits, held at FNAL (2000), unpub-lished.

[15] G. J. Feldman, R. D. Cousins, Unified approach to the classical statistical analysis of small signals.Phys. Rev. D 57 (1998) 3873.

[16] H. L. Harter, Criteria for best substitute interval estimators, with application to the normal distribu-tion, J. Amer. Statist. Ass. 59, (1964) 1133.

[17] J. O. Berger and R. L. Wolpert, The Likelihood Principle, Lecture Notes of Inst. of Math. Stat.,page 9, ed. S. S. Gupta, Hayward, Ca, ( (1984).

W. James and C. Stein, Estimation with quadratic loss, Forth Berkeley Symposium Math. Statist.and Prob. University of California Press, Berkeley (1961) 361.

P. Clifford, Interval estimation as viewed from the world of mathematical statistics, contribution tothe Workshop on Confidence Limits, held at CERN, CERN 2000-005 (2000) 157.

[18] D. Basu, Statistical Information and Likelihood, Lecture Notes in Statistics, page 114, ed. J. K.Ghosh, Springer (1988).

61

Page 62: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

[19] A. W. F. Edwards, Likelihood, page 114, The John Hopkins University Press (1992).

[20] J. D. Kalbfleisch and D. A. Sprott, Applications of likelihood methods to models involving largenumbers of parameters, J. Roy. Statist. Soc. B 32 (1970) 175.

[21] R. D. Cousins, Comments on methods for setting confidence limits, contribution to the Workshopon Confidence Limits, held at CERN, CERN 2000-005 (2000) 49.

[22] G. J. Feldman, in the Pannel Discussion of Workshop on Confidence Limits, held at CERN, CERN2000-005 (2000) 280.

[23] L. J. Savage et al., The Foundation of Statistical Inference, Methuen, London (1962).

[24] G. Zech, Upper limits in experiments with background or measurement errors, Nucl. Instr. andMeth. A277 (1989) 608.

[25] A. L. Read, Optimal statistical analysis of search results based on the likelihood ratio and its appli-cation to the search for the MSM Higgs boson at

√s = 161 and 172 GeV, DELPHI 97-158 PHYS

737 (1997).

[26] K. K. Gan et al., Incorporation of the statistical uncertainty in the background estimate into theupper limit on the signal., Nucl. Instr. and Meth. A412 (1998) 475.

[27] A. L. Read, Modified frequentist analysis of search results (The CLs method), contribution to theWorkshop on Confidence Limits, held at CERN, CERN 2000-005 (2000) 81.

[28] J. Conway, Setting limits and making discoveries in CDF, contribution to the Workshop on Confi-dence Limits, held at CERN, CERN 2000-005 (2000) 247.

[29] V. L. Highland, Comments on “Upper limits in experiments with background or measurement er-rors”, Nucl. Instr. and Meth. A398 (1997) 429.

[30] G. Zech, Reply to ’Comments on “Upper limits in experiments with background or measurementerrors” ’, Nucl. Instr. and Meth. A398 (1997) 431.

[31] R. D. Cousins, Additional Comments on Methods for Setting Confidence Limits, Contribution toWorkshop on Confidence Limits, held at FNAL (2000), unpublished.

[32] W. H. Jeffreys and J. O. Berger, Ockham’s Razor and Bayesian Analysis, American Scientist 80(1992) 64.

[33] F. James, private communication (1999).

[34] J. Kleinfeller et al., KARMEN 2: A new precision in the search for neutrino oscillations, Proc. ofthe 29th Int. Conf. on High Energy Phys., Vancouver (1998).

[35] L. Baudis et al., Limits on the Majorana neutrino mass in the 0.1 eV range, hep-ex/9902014 (1999).

[36] G. Zech, Objections to the unified approach to the computation of classical confidence limits,physics/9809035 (1998).

G. Zech, Confronting Classical and Bayesian Confidence Limits to Examples, presented at theWorkshop on Confidence Limits, held at CERN, CERN 2000-005, (2000) 141.

[37] G. Punzi, A stronger classical definition of confidence limits, hep-ex/9912048, and contribution ,to the Workshop on Confidence Limits, held at CERN, CERN 2000-005, (2000) 207.

[38] J. Bouchez, Confidence belts on bounded parameters, hep-ex/0001036 (2000).

62

Page 63: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

[39] M. G. Kendall and A. Stuart, The Advanced Theory of Statistics, Vol. 2, page 167, Griffin, (1979).

[40] C. Giunti, A new ordering principle for the classical statistical analysis of Poisson processes withbackground, Phys. Rev D 59 , 053001 (1999).

[41] B. P. Roe and M. B. Woodroofe, Improved probability method for estimating signal in the presenceof background, Phys. Rev. D 60, 053009 (1999).

[42] B. P. Roe and M. B. Woodroofe, Setting Confidence Belts, hep-ex/0007048 (2000).

[43] M. Mandelkern and J. Schultz, The Statistical Analysis of Gaussian and Poisson Signals NearPhysical Boundaries, J. Math. Phys. 41, 5701 (2000), hep-ex/9910041 (2000).

[44] S. Ciampolillo, Small signal with background: Objective confidence intervals and regions for phys-ical parameters from the principle of maximum likelihood, Nuovo Cim.ento A 111 (1998) 1415.

[45] H. Jeffreys, Further significance tests, Proc. Roy. Soc. A146 (1936) 416.

[46] I. Hacking, Logic of Statistical Inference, page 49, Cambridge University Press, Cambridge (1965).

[47] Confidence limits: What is the Problem? Is there the Solution, contribution to the Workshop onConfidence Limits, held at CERN, CERN 2000-005 (2000) 3.

[48] E. T. Jaynes, Prior Probabilities, IEEE Trans. Solid-State Circuits SC 4 (1968) 227

[49] G. D’Agostini, Jeffreys Priors versus Experienced Physicist Priors. Arguments against ObjectiveBayesian Theory, Contribution to 6th Valencia International Meeting on Bayesian Statistics, Al-cossebre, Preprint physics/9811045 (1998)

[50] H. B. Prosper, Small-signal analysis in high-energy physics: A Bayesian approach, Phys. Rev. D37 (1988) 1153.

D. A. Williams, Comment on “Small-signal analysis in high-energy physics: A Bayesian ap-proach”, Phys. Rev. D 38 (1988) 3582.

[51] D. Basu, Statistical Information and Likelihood, Lecture Notes in Statistics, ed. J.K. Ghosh,Springer-Verlag NY (1988).

[52] J. O. Berger and R. L. Wolpert, The Likelihood Principle, Lecture Notes of Inst. of Math. Stat.,Hayward, Ca, ed. S. S. Gupta (1984).

[53] R. A. Fisher, Theory of statistical estimation. Proc. Comb. Phil. Soc. 22, 700-725, reprint in Con-tribution to Mathematical Statistics, New York, Wiley.(1950).

[54] J. B. Kalbfleisch, Sufficiency and conditionality, Biometrika 62 (1975) 251.

[55] G. A. Barnard, G. M. Jenkins and C. B. Winsten, Likelihood inference and time series, J. Roy.Statist. Soc. A125 (1962) 32.

[56] G. A. Barnard and V. P. Godambe, Memorial Article, Allan Birnbaum 1923 - 1976, Ann. of Stat.10 (1982) 1033.

[57] A. Birnbaum, More on concepts of statistical evidence, J. Amer. Statist. Assoc. 67 (1972) 858.

[58] W. Edwards, H. Lindman, L. J. Savage, Bayesian statistical inference for psychological reasearch,Psychological Review 70 (1963) 193.

63

Page 64: Frequentist and Bayesian Confidence Limitscds.cern.ch/record/503404/files/0106023.pdf · 2009-07-09 · Frequentist and Bayesian Confidence Limits G¨unter Zech Universit¨at Siegen,

[59] D. Basu, Statistical Information and Likelihood, Lecture Notes in Statistics, page 69, ed. J.K.Ghosh, Springer-Verlag NY (1988).

[60] L. G. Savage, The Foundations of Statistics Reconsidered, Proceedings of the forth Berkeley Sym-posium on Mathematical Statistics and Probability, ed. J. Neyman (1961) 575.

[61] C. Stein, A remark on the likelihood principle, J. Roy. Statist. Soc. A 125 (1962) 565.

[62] M. Stone, Strong inconsistency from uniform priors, J. Amer. Statist. Assoc. 71, (1976) 114.

[63] B. Hill, On some statistical paradoxes and non-conglomerability. In Bayesian Statistics, ed. J. M.Bernardo, M. H. DeGroot, D. V. Lindley. and A. F. M. Smith, University Press, Valencia. (1981).

[64] I. J. Good, Some logic and history of hypothesis testing, ed. J. Pitt, Philosophy in Economics,Reidel, Dordrecht (1981).

[65] G. Zech, Comparing statistical data to Monte Carlo simulation - parameter fitting and unfolding,Desy 95-113 (1995).

[66] D. Basu, On the relevance of randomization in data analysis, Survey Sampling and measurement,ed. N. K. Namboodiri, Academic Press, New York (1978).

[67] M. G. Kendall and A. Stuart, The Advanced Theory of Statistics, Vol. 2, page 653, Griffin, (1979).

[68] I. Narsky, Estimation of Upper Limits Using a Poisson Statistic, hep-ex/9904025 (1999).

[69] J. O. Berger and D. A. Berry, Statistical Analysis and the Illusion of Objectivity, American Scientist76 (1988) 159.

[70] Diaconis and Freedman, On inconsistent Bayes estimates of location, Anals of Statistics 14 (1986)68.

[71] J. Kiefer, Conditional Confidence Statements and Confidence Estimation, J. Amer. Statist. Assoc.72 (1977) 789.

[72] Bureau International de Poids et Mesures, Rapport du Groupe de travail sur l’expression des incer-titudes, P.V. du CIPM (1981) 49.

P. Giacomo, On the expression of uncertainties, in Quantum Metrology and Fundamental PhysicalConstants, ed. P. H. Cutler and A. A. Lucas, Plenum Press (1983).

International Organization for Standardization (ISO), Guide to the expression of uncertainty inmeasurement, Geneva (1993).

[73] P. Janot and F. Le Diberder, Combining “Limits”, CERN-PPE/97-53 (1997).

LEP working group for Higgs boson searches, P. Bock et al., Lower bound for the SM Higgs bosonmass: combined result from four LEP experiments, CERN/LEPC 97-11 (1997).

T. Junk, Confidence Level Computation for Combining Searches with Small Statistics, hep-ex/9902006 (1999).

64