Hypothesis Tests, Significance - Caltech High Energy Physics

45
Hypothesis Tests, Significance Significance as hypothesis test Pitfalls Avoiding pitfalls Goodness-of-fit Context is frequency statistics. 1 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Transcript of Hypothesis Tests, Significance - Caltech High Energy Physics

Page 1: Hypothesis Tests, Significance - Caltech High Energy Physics

Hypothesis Tests, Significance

Significance as hypothesis test

Pitfalls

Avoiding pitfalls

Goodness-of-fit

Context is frequency statistics.

1 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 2: Hypothesis Tests, Significance - Caltech High Energy Physics

Significance as Hypothesis Test

When asking for the “significance” of an observation (of, perhaps a new

effect), you ask for a test of the hypotheses:

Null hypothesis H0 : There is no new effect;

against

Alternative hypothesis H1 : There is a new effect.

Reject the null hypothesis (that is, claim a new effect) if the observa-

tion falls in a region that is “unlikely” if the null hypothesis is correct.

“Significance” (as typically used in HEP, e.g., “a significance of 5σ”)

is the probability that we erroneously reject the null hypothesis. Also

called “confidence level”, or “P -value”, or the probability of a “Type I

error”.

2 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 3: Hypothesis Tests, Significance - Caltech High Energy Physics

Significance

A 68% confidence interval does not always tell you much about signif-icance. The tails may be non-normal. A separate analysis is generallyrequired, which models the tails appropriately.

No recommendation on when to label result as “significant”. Labelimplies interpretation.

– No uniform prescription seems to make sense, involves judgement.Eg, bizarre new particle vs. expected branching fraction.

– Not our most essential experimental role; up to consumer ulti-mately to decide what they want to believe.

Nevertheless, people insist on making qualitative statements (“observa-tion of”, “evidence for”, “discovery of”, “not significant”, “consistentwith”)

Code: “observation of” ≡ > 4σ, “evidence for” ≡ > 3σ

3 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 4: Hypothesis Tests, Significance - Caltech High Energy Physics

Significance (more semantics. . . )From Physics Today, http://www.physicstoday.org/pt/vol-54/iss-9/p19.html(coloring mine, references deleted)nb: Just an observation on human nature, no criticism should be inferred.

“In March, back-to-back papers in Physical Review Letters reported themeasurement of CP symmetry violation in the decay of neutral B mesonsby groups in Japan and California. Now the word “measurement” hasbeen replaced by “observation” in the titles of two new back-to-backreports by these same groups in the 27 August Physical Review Letters.That is to say, with a lot more data and improved event reconstruction,the BaBar collaboration at SLAC and the Belle collaboration at KEK inJapan have at last produced the first compelling evidence of CP violationin any system other than the neutral K mesons.”

Some people think a measurement should not be called a “measurement”unless the result is significantly different from zero. A senior AssistantEditor at a prominent journal suggested that “bounds on” might be moreappropriate than “measurement” in reference to a CP asymmetry anglewhich was observed as consistent with zero.Finding sin 2β = 0.00 ± 0.01 would be pretty exciting.But it isn’t a “measurement”?

4 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 5: Hypothesis Tests, Significance - Caltech High Energy Physics

Computation of significance: Example

Flat normal background (avg 100/bin) + gaussian signal (100 events,

fixed) with mean 0, σ = 1

0

02

04

06

08

001

021

041

061

01864202-4-6-8-01-

Generated signal = 100 eventsCut&Count signal = 194 39 eventsSignificance = 5.0Expected significance = 2.7

x

Even

ts/0

.5

+_

Distribution is normal here, so P = 5.7 × 10−7 (two-tail).

Note: H0 is flat distribution with no signal.

H1 is Nsig �= 0.5 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 6: Hypothesis Tests, Significance - Caltech High Energy Physics

Note: The significance is not obtained by dividing the signal estimate

(194) by the uncertainty in the signal (39), 194/39 = 5.0. That would

be akin to asking how likely a signal of the estimated size would be to

fluctuate to zero. It is a good approximation in this example, however,

because B/S is large.

The significance was estimated here with a simple cut-and-count method:

– Level of background estimated from region |x| > 3.

– Counts in “signal region” |x| < 3 added up.

– The excess in the signal region over the estimated background is di-

vided by the square root of the estimated background in the signal region.

In this example, we assumed we knew the mean and width of the signal

we were looking for, as well as the background shape. The uncertainty

in the background estimate is negligible in this example.

A more sophisticated fit may yield a more powerful test by incorpo-

rating the known shape of the signal.6 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 7: Hypothesis Tests, Significance - Caltech High Energy Physics

What about Systematic Uncertainties?

B(e+e− → Nobel prize) = 10 ± 1 ± 5.

They may be important!

– Maybe the ±5 is a systematic uncertainty in the estimate of the

background expectation. A “10σ” statistical significance is really

only a “2σ” effect.

They may be irrelevant

– Maybe the ±5 is a systematic uncertainty on the efficiency, entering

as a multiplicative factor. It makes no difference to the significance

whether the result is 10 ± 1 or 5 ± 0.5.

They may be “fuzzy”

– E.g., how is the background expectation estimated? What is the

sampling distribution?

– E.g., Are “theoretical” uncertainties present?7 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 8: Hypothesis Tests, Significance - Caltech High Energy Physics

Systematic Uncertainties

“Blind checks” and “educated checks”:

– Blind check: testing for mistakes; no correction is expected. If

pass test, no contribution to systematic error. Eg, divide data into

chronological subsets and compare results.

– Educated check: measuring biases, corrections. May affect quoted

result. Always contributes to systematic error. Eg, dependence of

efficiency on model.

Quote systematic uncertainty separately from statistical

– Systematic uncertainty may contain statistical components, eg, MC

statistics in evaluation of efficiency.

8 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 9: Hypothesis Tests, Significance - Caltech High Energy Physics

Systematic Uncertainties Example: D mixing and DCSD revisited

– Want simple procedure. Willing toaccept approximation.

– Scale statistical-only contour uniformlyalong ray from best-fit value.

– Factor is√

1 +∑m2

i , where mi isan estimate of the effect of system-atic uncertainty i measured in unitsof the statistical uncertainty. Thisestimate is obtained by determiningthe effect of the systematic uncer-tainty on x′2, y′. -3 / 102x’

-0.5 0 0.5 1 1.5 2 2.5

y’

-0.06

-0.04

-0.02

-0

0.02

0.04

-3 / 102x’-0.5 0 0.5 1 1.5 2 2.5

y’

-0.06

-0.04

-0.02

-0

0.02

0.04

, y’)2

Physical (x’, y’) No CPV

2Central (x’95% CL CPV allowed95% CL CPV allowed, stat only95% CL CP conserved95% CL CP conserved, stat only

Method conservative (≡ lazy) in sense that scaling for a given systematic in one direc-

tion is applied uniformly in all directions. On the other hand, a linear approximation

is being made.

9 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 10: Hypothesis Tests, Significance - Caltech High Energy Physics

Aside: Significance as “nσ”HEP parlance is to say an effect has, e.g., “5σ” significance.

At face value, this means the observation is “5 standard deviations” awayfrom the mean:

σ ≡ 〈(x− x)2〉.

But we often don’t really mean this. Note that a 5σ effect of this sortmay not be improbable:

x

P(x)

0.1

0.8

0 1-1

0.1

P (|x− x| = 5σ) = 20% !

10 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 11: Hypothesis Tests, Significance - Caltech High Energy Physics

Aside: Significance as “nσ” (continued)Instead, we often mean that the probability (P -value) for the effect is

given by the probability of a fluctuation in a normal distribution 5σ fromthe mean, i.e.,

P = P (|x| > 5), for x ∈ N (0, 1)

= 5.7 × 10−7

(two-tailed probability).

But sometimes we really do mean 5σ, usually presuming that the sam-pling distribution is approximately normal. [This may not be an accuratepresumption when far out in the tails!]

Also now popular to call√−2Δ lnL the “n” in “nσ”.

From: L0(θ = 0;x) = 1√2πσ

exp[−1

2(x/σ

)2], Lmax(θ = x;x) = 1√

2πσ,

giving√−2Δ lnL =

√Δχ2 = x/σ = n.

Desirable to be more concise by quoting probabilities, or “P -values” asis common in the statistics world. At least say what you mean!

11 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 12: Hypothesis Tests, Significance - Caltech High Energy Physics

Estimating significance: Pitfalls

What are the dangers? In a nutshell:Unknown or unknowable sampling distributions

Ways to not know the distribution:

The Improbable Tails

Systematic Unknowns

The Stopping Problem

The exploratory Bump Hunt

12 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 13: Hypothesis Tests, Significance - Caltech High Energy Physics

The Improbable Tails

Our earlier example was known to be normal sampling.0

02

04

06

08

001

021

041

061

01864202-4-6-8-01-

Generated signal = 100 eventsCut&Count signal = 194 39 eventsSignificance = 5.0Expected significance = 2.7

x

Even

ts/0

.5

+_

Often, this is true approximately (central limit theorem).

But for significance, often interested in distribution far into the tails.The normal approximation may be very bad here!

If there is any doubt, need to compute the actual distribution. Typicallythis is done with a “toy Monte Carlo” to simulate the distribution of thesignificance statistic.

To get to the tails, this may require a fair amount of computing time.

Still need to be wary of pushing calculation beyond its validity as a modelof the actual distribution.

BaBar (and others) now routinely performs these calculations.

13 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 14: Hypothesis Tests, Significance - Caltech High Energy Physics

Systematic UnknownsNuisance parameters

– Unknown, but relevant parameters. Estimated somehow, but withsome uncertainty. Even if sampling distribution is known, cannot ingeneral derive exact P -values in lower dimensional parameter space.

– Central Limit Theorem is our friend.

– Can try other values besides best estimate of nuisance parameter.

– See our discussion of confidence intervals.

“Theoretical” systematic uncertainties. Guesses, no sampling distri-bution.

– Use worst case values when evaluating significance. requires under-standing what is meant by the theory “errors”.

– Or, give the dependence, e.g., as a range.

14 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 15: Hypothesis Tests, Significance - Caltech High Energy Physics

The Stopping ProblemThere is a strong tendency to work on an analysis until we are convincedthat we got it “right”, then we stop.

Simple example: “Keep sampling” until we are satisfied.

Motivate our example:

• Ample historical evidence that experimental measurements are some-times biased by some preconception of what the answer “should be”.For example, a preconception could be based on the result of anotherexperiment, or on some theoretical prejudice.

• A model for such a biased experiment is that the experimenter works“hard” until s/he gets the expected result, and then quits. Let’s Con-sider a simple example of a distribution which could result from sucha scenario.

15 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 16: Hypothesis Tests, Significance - Caltech High Energy Physics

Stopping Problem: Normal likelihood function example

• Consider an experiment in which a measurement of a parameter θcorresponds to sampling from a Gaussian distribution of standard de-viation one:

N (x; θ, 1)dx =1√2πe−(x−θ)2/2dx.

x

Freq

uenc

y

θ− 2 θ θ + 2

• Suppose the experimenter has a prejudice that θ is greater than one.

• Subconsciously, s/he makes measurements until the sample mean, m =1n

∑ni=1 xi, is greater than one, or until s/he becomes convinced (or

tired) after a maximum of N measurements.

• The experimenter then uses the sample mean, m, to estimate θ.

16 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 17: Hypothesis Tests, Significance - Caltech High Energy Physics

Stopping Problem: Normal likelihood function exampleFor illustration, assume that N = 2. In terms of the random variables

m and n, the pdf is:

f(m,n; θ) =

⎧⎪⎨⎪⎩

1√2πe−

12(m−θ)2, n = 1, m > 1

0, n = 1, m < 11πe

−(m−θ)2∫ 1

−∞ e−(x−m)2 dx n = 2

Num

ber

of e

xper

imen

ts

Sample mean

-4 -2 0 2 4 0

1000

2000

3000

4000

Histogram of sampling dis-tribution for m, with pdfgiven by above equation,for θ = 0.

The likelihood function, as a function of θ, has the shape of a normaldistribution, given any experimental result. The peak is at θ = m, so mis the maximum likelihood estimator for θ.

17 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 18: Hypothesis Tests, Significance - Caltech High Energy Physics

Stopping Problem: Normal likelihood function example

In spite of the normal form of the likelihood function, the samplemean is not sampled from a normal distribution. The “4σ” tail is moreprobable (for some θ) than the experimenter thinks.

0

10000.0

20000.0

30000.0

40000.0

50000.0

60000.0

70000.0

4202-4-6-8-θ

Pro

babi

lity(

<

m -

4s )

θ

Straight line is probability for a “4 ” fluctuationof a normal distrinbution.

σ

18 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 19: Hypothesis Tests, Significance - Caltech High Energy Physics

Stopping Problem: Normal likelihood function exampleThe likelihood function, as a function of θ, has the shape of a normal

distribution, given any experimental result. The peak is at θ = m, so mis the maximum likelihood estimator for θ. In spite of the normal form ofthe likelihood function, the sample mean is not sampled from a normaldistribution. The interval defined by where the likelihood function fallsby e−1/2 does not correspond to a 68% CI:

Prob

abili

ty (θ

-<θ<θ

+)

θ-2 0 2 4

0.60

0.65

0.70

0.75

19 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 20: Hypothesis Tests, Significance - Caltech High Energy Physics

Stopping Problem: Normal likelihood function example

• The experimenter in this scenario thinks s/he is taking n samples froma normal distribution, and makes probability statements (e.g., aboutsignificance) according to a normal distribution.

• S/he gets an erroneous result because of the mistake in the distribution.

• If the experimenter realizes that sampling was actually from a non-normal distribution, s/he can do a more careful analysis to obtainmore valid results.

Related effect: First “observations” tend to be biased high.Example? BES early fD = 371 vs fD ∼ 220 MeV more recently. Nothing“wrong” with this, but best estimates should average in earlier “null”data. Importance of reporting null results and quoting combinable re-sults, e.g., two-sided confidence intervals.

20 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 21: Hypothesis Tests, Significance - Caltech High Energy Physics

The Exploratory Bump Hunt

Context is the typical bump-hunting approach in HEP

That is, analysis is developed concurrent with looking at the data

We see a “bump”, how significant is it?

)2) (GeV/cψJ/-π+πm(3.8 4 4.2 4.4 4.6 4.8 5

2E

vent

s / 2

0 M

eV/c

0

10

20

30

40

)2) (GeV/cψJ/-π+πm(3.8 4 4.2 4.4 4.6 4.8 5

2E

vent

s / 2

0 M

eV/c

0

10

20

30

40

)2) (GeV/cψJ/-π+πm(3.8 4 4.2 4.4 4.6 4.8 5

2E

vent

s / 2

0 M

eV/c

0

10

20

30

40

)2) (GeV/cψJ/-π+πm(3.8 4 4.2 4.4 4.6 4.8 5

2E

vent

s / 2

0 M

eV/c

0

10

20

30

40

3.6 3.8 4 4.2 4.4 4.6 4.8 51

10

210

310

410

BaBare+e− → γISRπ

+π−J/ψP (H0) ∼ 10−11 − 10−16

If the analysis is our typical bump-hunt approach, it is impossible tocompute the significance!

We don’t know the sampling distribution (under null hypothesis).

Hence, we cannot compute probabilities (under null hypothesis, in par-ticular).

21 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 22: Hypothesis Tests, Significance - Caltech High Energy Physics

Bump Hunting – Example IRecall our earlier example:

Flat normal background (avg 100/bin) + gaussian signal (100 events)with mean 0, σ = 1

0

02

04

06

08

001

021

041

061

01864202-4-6-8-01-

Generated signal = 100 eventsCut&Count signal = 194 39 eventsSignificance = 5.0Expected significance = 2.7

x

Even

ts/0

.5

+_

Distribution is normal here, so P = 5.7 × 10−7 (two-tail).

Note: H0 is flat distribution with no signal.H1 is Nsig �= 0.

As long as we knew at the outset that we were looking for an effectwith mean 0 and σ = 1, we make correct inferences.

22 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 23: Hypothesis Tests, Significance - Caltech High Energy Physics

Bump Hunting – Example II

Looking for a mass bump.

0

02

04

06

08

001

021

041

061

0908070605040302010 0 0 0 00 00 0 0

Even

ts/1

0 M

eV

m(pp) - 2m(p) (MeV)_

Evidence for a threshold enhancement!Evidence for a threshold enhancement!

Background level = 100 events/binSignal = 40 12 eventsStatistical significance = 4.0

+_σ

One-tailed probability (P-value) = 2.6 10_ 5X

23 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 24: Hypothesis Tests, Significance - Caltech High Energy Physics

So, what does it mean?

Even though we cannot estimate probabilities, we still quote numbersintended to relay an idea of significance, usually quoted as some num-ber of “sigma”.

What we (usually) mean by this is:“If I had done the analysis in a controlled manner, and had beeninterested in the observed value for the mean and width, then thenull hypothesis would require a fluctuation of this number of standarddeviations of a Gaussian distribution to produce a bump as large as Isee”

24 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 25: Hypothesis Tests, Significance - Caltech High Energy Physics

Is it useful?With this understanding of the meaning, it is perhaps not completelyuseless, since we can interpret it in the context of our experience.

However, our experience is that we are sometimes fooled!pentaquarks example

Life is short, we have also made great discoveries with this approach.Just remember that quoted significances are highly misleading.

Note: The “threshold enhancement” in Example II is a statistical fluc-tuation! The sampling distribution was the same N (100, 10) for all bins.

0

02

04

06

08

001

021

041

061

0908070605040302010 0 0 0 00 00 0 0

Even

ts/1

0 M

eV

m(pp) - 2m(p) (MeV)_

Evidence for a threshold enhancement!Evidence for a threshold enhancement!

Background level = 100 events/binSignal = 40 12 eventsStatistical significance = 4.0

+_σ

One-tailed probability (P-value) = 2.6 10_ 5X

25 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 26: Hypothesis Tests, Significance - Caltech High Energy Physics

Avoiding pitfalls

Model those tails!

“Conservatism”

– Based on our experience with the failings of our methodology, wedon’t claim new discoveries lightly.

Do it better: Be Blind

26 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 27: Hypothesis Tests, Significance - Caltech High Energy Physics

It can be done “better”

Better means with more meaningful significance estimates, and avoid-ance of bias.

Do a “blind analysis”

– Blind means you design the experiment (analysis) before you lookat the results.

Goal is to know the sampling distribution.

BaBar routinely blinds it’s analyses, when there is a well-defined quantity(e.g., CP asymmetry or branching fraction) being measured.

27 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 28: Hypothesis Tests, Significance - Caltech High Energy Physics

Blinding MethodologiesThere are several basic approaches to blinding an analysis, which maybe chosen according to the problem. Many variations on these themes!

Don’t look inside the box

You can look, but keep the answer hidden

Obscure the real data (e.g., adding simulated signal to data)

Design on a dataset that will be thrown away

Divide and conquer (e.g., BNL muon g − 2 result, ωp (magnetic field)and ωa (muon precession) were analyzed independently, before com-bining to obtain g − 2; arXiv:hep-0401008v3)

[Reference: Klein & Roodman, Ann.Rev.Nucl.Part.Sci. 55 (2005) 141.]

28 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 29: Hypothesis Tests, Significance - Caltech High Energy Physics

Blind AnalysisMany BaBar results are obtained in “blind analyses”. Purpose is toavoid introduction of bias.

Not all: Exploratory analyses not blind, eg, discovery of theD∗sJ(2317)+.

Example of a blind analysis: B± → K±e+e−

After event selection, look for signal in distribution of two kinematicvariables ΔE and mES. Monte Carlo, control sample (usually a type ofdata resembling signal) studies of entire sideband and fit region. Dataonly looked at in large sideband region prior to unblinding.

) 2 (GeV/cESm5 5.05 5.1 5.15 5.2 5.25 5.3

E (

GeV

-0.5

-0.4

-0.3

-0.2

-0.1

-0

0.1

0.2

0.3

0.4

0.5

-e+e + K→ +B

large sideband

fit region

signal region

Signal MCData after unblinding

29 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 30: Hypothesis Tests, Significance - Caltech High Energy Physics

Blind Analysis (continued)Issue: Updating a blind analysis with additional data.

Simply add data, no change in analysis.

– May be impractical, undesirable, eg, re-reconstruction of entire dataset;improvements in tools such as particle identification.

– Sometimes would like to work harder to optimize analysis, or op-timize on different criterion (eg, for most precision instead of mostsensitivity).

Notion of “reblinding” and re-optimization.

– Considered safe to use variables which have not been inspected toocarefully in blind region.

– Not called a blind analysis.

“Pure” approach: Do a blind analysis only on the new dateset.

– OK to use whatever was learned from the old dataset.

– Can combine the results from the old and the new data.

30 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 31: Hypothesis Tests, Significance - Caltech High Energy Physics

Blind Analysis – Hiding the answer

In this approach, no data is hidden, just the “answer” (typically, anumber).

For example, a hidden, possibly random, offset may be applied tothe real answer to prevent it from being seen before the analysis isdesigned.

Δt(Hidden) =(

1−1

)(TAG)Δt + Offset,

Top: Not hidden; Bottom: HiddenBaBar, from Klein& Roodman, op. cit.

where TAG = ±1 according to the B flavor tag. This algorithm permitsviewing the decay time distributions without revealing the asymmetry.

31 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 32: Hypothesis Tests, Significance - Caltech High Energy Physics

Blinding a bump hunt

No different in principle than our other blind analyses.

The more you know (about the signal you are interested in), the betteryou can do.

Yes, it is harder. It takes discipline, and it might even cost “sensitiv-ity”.

But it is at least worth thinking about.

32 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 33: Hypothesis Tests, Significance - Caltech High Energy Physics

What cuts do I make?Again, the more we know, the better off we are!

– If we have good simulation of background processes, or backgrounddata, that’s great! We’ll use it.

– The less we know, the bigger the risk that we won’t be optimal.That’s life. Being non-optimal doesn’t mean wrong.

One approach: Sacrifice a portion of data for cut optimization.

– Ignore any “signals”. Or tune on them. . .

– If signal is real, it should show up in remaining blinded data. If it isa fluctuation, it will disappear.

0

02

04

06

08

001

021

041

061

0908070605040302010 0 0 0 00 00 0 0

Even

ts/1

0 M

eV

m(pp) - 2m(p) (MeV)_

Evidence for a threshold enhancement!Evidence for a threshold enhancement!

Background level = 100 events/binSignal = 40 12 eventsStatistical significance = 4.0

+_σ

One-tailed probability (P-value) = 2.6 10_ 5X

⇒⇒80

100

120

140

0

20

40

60

0 10 20 30 40 50 60 70 80 90

– Systematic effects will show up both places. But that’s anothermatter.

33 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 34: Hypothesis Tests, Significance - Caltech High Energy Physics

Significance/GOF: Counting Degrees of Freedom

The following situation arises with some frequency (with variations):

I do two fits to the same dataset (say a histogram with N bins):

Fit A has nA parameters, with χ2A [or perhaps −2 lnLA].

Fit B has a subset nB of the parameters in fit A, with χ2B, where the

nA − nB other parameters (call them θ) are fixed at zero.

What is the distribution of χ2B − χ2

A?

See SWiG thread:

http://babar-hn.slac.stanford.edu:5090/HyperNews/get/Statistics/300.html

Luc Demortier in:http://phystat-lhc.web.cern.ch/phystat-lhc/program.html[this is recommended reading]

34 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 35: Hypothesis Tests, Significance - Caltech High Energy Physics

Counting Degrees of Freedom (continued)In the asymptotic limit (that is, as long as the normal sampling distri-bution is a valid approximation),

Δχ2 ≡ χ2B − χ2

A

is the same as a likelihood ratio (−2 lnλ) statistic for the test:

H0 : θ = 0 against H1 : some θ �= 0

Δχ2 is distributed according to a χ2(NDOF = nA − nB) distribution aslong as:

Parameter estimates in λ are consistent (converge to the correct values)under H0.

Parameter values in the null hypothesis are interior points of the main-tained hypothesis (union of H0 and H1).

There are no additional nuisance parameters under the alternativehypothesis.

35 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 36: Hypothesis Tests, Significance - Caltech High Energy Physics

Counting Degrees of Freedom (continued)

A commonly-encountered situation violates these requirements: Wecompare fits with and without a signal component to estimate signifi-cance of the signal.

If the signal fit has, e.g., a parameter for mass, this constitutes anadditional nuisance parameter under the alternative hypothesis.

If the fit for signal constrains the yield to be non-negative, this violatesthe interior point requirement.

Freq

uenc

y

0 2 4 6 8 10 12

040

0080

0012

000

2Δχ (H0-H1)

H0: Fit with no signalH1: Float signal yield

Curve: χ2 (1 dof )

Freq

uenc

y

0 2 4 6 8 10 12

050

0015

000

2500

0

2Δχ (H0-H1)

H0: Fit with no signalH1: Float signal yield, constrained 0

_>

Curve: χ2 (1 dof )

Freq

uenc

y0 2 4 6 8 10 12

010

0030

0050

00

H0: Fit with no signalH1: Float signal yield and location

Blue: χ2 (1 dof )

2Δχ (H0-H1)

χ2 (2 dof )Red:

36 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 37: Hypothesis Tests, Significance - Caltech High Energy Physics

Counting Degrees of Freedom - Sample fits50

080

0

H0 fit H1 fit

Fix meanAny signal

0 40 80

500

800

x

Cou

nts/

bin

H1 fit

Fix meanSignal>=0

0 40 80x

H1 fit

Float meanAny signal

500

800

H0 fit H1 fit

Fix meanAny signal

0 40 80

500

800

x

Cou

nts/

bin

H1 fit

Fix meanSignal>=0

0 40 80x

H1 fit

Float meanAny signal

Fits to data with no signal. Fits to data with a signal.

37 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 38: Hypothesis Tests, Significance - Caltech High Energy Physics

Significance – Conclusions

Evaluation of significance is a “hypothesis test”. It is essentially thesame problem as evaluating confidence intervals.

– Except for the more obvious role played by improbable tails.

Pitfalls amount to (not) knowing the sampling distribution.

Techniques exist to avoid pitfalls:

– Simulating the sampling distribution

– Vary the nuisance/theory parameters

– Blind your experiment

Doing it properly requires patience and discipline; the benefit is a moremeaningful, convincing result to yourself and to others.

38 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 39: Hypothesis Tests, Significance - Caltech High Energy Physics

Goodness-of-FitNo perfect general goodness-of-fit test:

– Given a dataset generated under null hypothesis, can usually finda test which rejects the null hypothesis (ie, choosing the test afteryou see the data is dangerous).

– Given a dataset generated under alternative hypothesis, can usuallyfind a test for which the null passes (ie, should think about whatyou want to test for).

Nominal recommendation:

– If you have a specific question, test for that.

– χ2 test when valid.

– Consider more general likelihood ratio test, Kolmogorov-Smirnov,etc., otherwise.

– Monte Carlo evaluation of distribution of test statistic.

39 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 40: Hypothesis Tests, Significance - Caltech High Energy Physics

Goodness-of-Fit (continued)

But recognize when test may not answer desired question, eg, in sin 2β

analysis, a likelihood ratio (or a χ2) test on the time distribution may

have little sensitivity to testing goodness-of-fit of the asymmetry.

Ent

ries

/ 0.

6 ps

B0 tags

B− 0 tags

Background

Δt (ps)

Raw

Asy

mm

etry 0

50

100

150

-0.5

0

0.5

-5 0 5

40 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 41: Hypothesis Tests, Significance - Caltech High Energy Physics

Consistency of two correlated results

BaBar has encountered several times the question of whether a newanalysis is consistent with an old analysis.

Often, new analysis is a combination of additional data plus changed(improved. . . ) analysis of original data.

The stickiest issue is handling the correlation in testing for consistencyin the overlapping data.

People sometimes have difficulty understanding that statistical differ-ences can arise even comparing results based on the same events.

Given a sampling θ1, θ2 from a bivariate normal distributionN(θ, σ1, σ2, ρ),with 〈θ1〉 = 〈θ2〉 = θ, the difference Δθ ≡ θ2 − θ1 is N(0, σ)-distributedwith σ2 = σ2

1 + σ22 − 2ρσ1σ2.

If the correlation is unknown, all we can say is that the variance of the

difference is in the range (σ1 − σ2)2 . . . (σ1 + σ2)2. If we at least believe

ρ ≥ 0 then the maximum variance of the difference is σ21 + σ2

2.

41 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 42: Hypothesis Tests, Significance - Caltech High Energy Physics

Consistency – Simple example of two analyses on same eventsSuppose we measure a neutrino mass, m, in a sample of n = 10

independent events. The measurements are xi, i = 1, . . . , 10. Assumethe sampling distribution for xi is N(m,σi).

We may form unbiased estimator, m1, for m:

m1 = 1n

∑ni=1 xi ±

√1n2

∑ni=1 σ

2i .

The result (from a MC) is m1 = 0.058 ± 0.039.

Then we notice that we have some further information which might beuseful: we know the experimental resolutions, σi for each measurement.We form another unbiased estimator, m2, for m:

m2 =

∑ni=1

xiσ2i∑n

i=11σ2i

± 1√∑ni=1

1σ2i

.

The result (from the same MC) is m1 = 0.000 ± 0.016.42 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 43: Hypothesis Tests, Significance - Caltech High Energy Physics

Example continued

The results are certainly correlated, so question of consistency arises

(we know the error on the difference is between 0.023 and 0.055). In this

example, the difference between the results is 0.058 ± 0.036, where the

0.036 error includes the correlation (ρ = 0.41).

43 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 44: Hypothesis Tests, Significance - Caltech High Energy Physics

Consistency – Evaluating the CorrelationArt Snyder developed an approximate formula for evaluating the correlation in a com-parison of maximum likelihood analyses (eg, in one-dimensional case).

Suppose we perform two maximum likelihood analysis, with event likelihoods L1, L2, onthe same set of events [nb, may use different information in each analysis]. The resultsare estimators θ1, θ2 for parameter θ. The correlation coefficient ρ may be estimatedaccording to:

ρ ≈∑N

i=1Rid lnL1i

dθ|θ=θ1

d lnL2idθ

|θ=θ2√(∑Ni=1

d2 lnL1idθ2 |θ=θ0

)(∑Ni=1

d2 lnL2idθ2 |θ=θ0

),

where (θ0 is an expansion reference point)

Ri =[1 − (θ1 − θ0)

d2 lnL1i

dθ2|θ=θ0

/d lnL1i

dθ|θ=θ0

] [1 − (θ2 − θ0)

d2 lnL2i

dθ2|θ=θ0

/d lnL2i

dθ|θ=θ0

].

If θ0 ≈ θ1 ≈ θ2, then

ρ ≈ σθ1σθ2

N∑i=1

d lnL1i

dθ|θ=θ0

d lnL2i

dθ|θ=θ0

,

where σ2θk≡ 1/

∑Ni=1

(dLkidθ

|θ=θ0

)2

.

44 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Page 45: Hypothesis Tests, Significance - Caltech High Energy Physics

Consistency – Example: sin 2β

32 × 106BB pairs – PRL, vol 87, 27 August 2001:

sin 2β = 0.59 ± 0.14(stat) ± 0.05(syst)

62 × 106BB pairs – SLAC-PUB-9153, March 2002:

sin 2β = 0.75 ± 0.09(stat) ± 0.04(syst)

Second result includes the earlier data, re-reconstructed. Analysis in-volves multivariate maximum likelihood fits; reprocessing changes, eg,relative likelihood for an event to be signal or background. Not simplycounting events. Question: are the two results statistically consistent?

If these were independent data sets, a difference of 0.16±0.17 would notbe a worry. The issue is the correlation.

A specialized analysis deriving from the previous formula is performed

on the events in common between the two analyses. A correlation of

ρ = 0.87 is deduced, yielding a difference of ∼ 2.2σ.

45 Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008