TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility...

58
TODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping vs Optional Continuation 5. Gambling Interpretation, Again 6. Types of S-Values / GROW S-Values [Next Week No Lecture โ€“ Homework due Mo May 11]

Transcript of TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility...

Page 1: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

TODAY: Safe Testing

1. Reproducibility Crisis/Problems with p-values

2. The S-Value

3. Optional Continuation

4. Optional Stopping vs Optional Continuation

5. Gambling Interpretation, Again

6. Types of S-Values / GROW S-Values

[Next Week No Lecture โ€“ Homework due Mo May 11]

Page 2: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Testing

Peter Grรผnwald

Centrum Wiskunde & Informatica โ€“ Amsterdam

Mathematical Institute โ€“ Leiden University

with Rianne de Heide,

Wouter Koolen, Judith

ter Schure, Alexander

Ly, Rosanne Turner

Page 3: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Slate Sep 10th 2016: yet another classic finding in

psychologyโ€”that you can smile your way to

happinessโ€”just blew upโ€ฆ

Reproducibility Crisis

Cover Story of

Economist (2013),

Science (2014)

Page 4: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Reasons for Reproducibility Crisis

1. Publication Bias

2. Problems with Hypothesis Testing Methodology

Page 5: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping
Page 6: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping
Page 7: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Xkcd.org

Page 8: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Reasons for Reproducibility Crisis

1. Publication Bias

2. Problems with Hypothesis Testing Methodology

Page 9: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Replication Crisis in

Science somehow related to use of p-values and

significance testingโ€ฆ

Page 10: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Replication Crisis in

Science somehow related to use of p-values and

significance testingโ€ฆ

Page 11: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Replication Crisis in

Science somehow related to use of p-values and

significance testingโ€ฆ

Page 12: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Replication Crisis in

Science somehow related to use of p-values and

significance testingโ€ฆ

Page 13: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Replication Crisis in

Science somehow related to use of p-values and

significance testingโ€ฆ

Page 14: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

โ€ข Suppose reseach group A tests medication, gets

โ€˜almost significantโ€™ result.

โ€ข ...whence group B tries again on new data. How to

combine their test results?

โ€ข Standard Method 1: sweep data together, recompute p-

value. This is not correct; type-I error guarantee does not

hold any more

โ€ข Standard method 2: use Fisherโ€™s method for

combining p-values. Again not correct, since tests

cannot be viewed as independent

โ€ข Standard method 3: multiply p-values. Just plain

wrong โ€“ a mortal sin!

โ€ข With the type of โ€œp-valueโ€ introduced here, despite

dependence, evidences can still be safely multiplied

P-value Problem:

Combining Dependent Tests

Page 15: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

โ€ข Suppose reseach group A tests medication, gets

โ€˜almost significantโ€™ result.

โ€ข Sometimes group A canโ€™t resist to test a few

more subjects themselves...

โ€ข A recent survey revealed that 55% of psychologists have

succumbed to this practice (and then treat data as if large

sample size was determined in advance)

โ€ข But isnโ€™t this just cheating?

โ€ข Not clear: what if you submit a paper and the referee

asks you to test a couple more subjects? Should you

refuse because it invalidates your p-values!?

P-value Problem (b):

Extending Your Test

Page 16: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

S is the new P

โ€ข We propose a generic replacement of

the ๐‘-value that we call the ๐‘†-value

โ€ข ๐‘†-values handle optional continuation

(to the next test (and the next, and ..))

without any problems

(can simply multiply S-values of

individual tests, despite dependencies)

Page 17: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

S is the new P

โ€ข We propose a generic replacement of

the ๐‘-value that we call the ๐‘†-value

โ€ข ๐‘†-values handle optional continuation

(to the next test (and the next, and ..))

without any problems

(can simply multiply S-values of

individual tests, despite dependencies)

S-values have Fisherian, Neymanian and Bayes-

Jeffreysโ€™ aspects to them, all at the same time

Cf. J. Berger (2003, IMS Medaillion Lecture): Could

Neyman, Fisher and Jeffreys have agreed on testing?

Page 18: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

S-Values: General Definition

โ€ข Let ๐ป0 = ๐‘ƒ๐œƒ ๐œƒ โˆˆ ฮ˜0} represent the null hypothesis

โ€ข Assume data ๐‘‹1, ๐‘‹2, โ€ฆ are i.i.d. under all ๐‘ƒ โˆˆ ๐ป0 .

โ€ข Let ๐ป1= ๐‘ƒ๐œƒ ๐œƒ โˆˆ ฮ˜1} represent alternative hypothesis

โ€ข An S-value for sample size ๐‘› is a function

such that for all ๐‘ƒ0 โˆˆ ๐ป0 , we have

Page 19: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

First Interpretation: p-values

โ€ข Proposition: Let S be an S-value. Then ๐‘†โˆ’1 ๐‘‹๐‘› is a

conservative p-value, i.e. p-value with wiggle room:

โ€ข for all ๐‘ƒ โˆˆ ๐ป0, all 0 โ‰ค ๐›ผ โ‰ค 1 ,

โ€ข Proof: just Markovโ€™s inequality!

Page 20: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Tests

โ€ข The Safe Test against ๐ป0 at level ๐›ผ based on S-

value S is defined as the test which rejects ๐ป0 if

S ๐‘‹๐‘› โ‰ฅ1

๐›ผ

โ€ข Since ๐‘†โˆ’1 is a conservative ๐‘-valuue...

โ€ข ....the safe test which rejects ๐ป0 iff ๐‘†(๐‘‹๐‘›) โ‰ฅ 20 , i.e.

๐‘†โˆ’1 ๐‘‹๐‘› โ‰ค 0.05 , has Type-I Error Bound of 0.05

Page 21: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Testing and Bayes

โ€ข Bayes factor hypothesis testing

with ๐ป0 = ๐‘๐œƒ ๐œƒ โˆˆ ฮ˜0} vs ๐ป1 = ๐‘๐œƒ ๐œƒ โˆˆ ฮ˜1} :

Evidence in favour of ๐ป1 measured by

where

(Jeffreys โ€˜39)

Page 22: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Testing and Bayes, simple ๐‘ฏ๐ŸŽ

Bayes factor hypothesis testing

between ๐ป0 = { ๐‘0} and ๐ป1 = ๐‘๐œƒ ๐œƒ โˆˆ ฮ˜1} :

Bayes factor of form

Note that (no matter what prior ๐‘Š1 we chose)

Page 23: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Testing and Bayes, simple ๐‘ฏ๐ŸŽ

Bayes factor hypothesis testing

between ๐ป0 = { ๐‘0} and ๐ป1 = ๐‘๐œƒ ๐œƒ โˆˆ ฮ˜1} :

Bayes factor of form

Note that (no matter what prior ๐‘Š1 we chose)

The Bayes Factor for Simple ๐‘ฏ๐ŸŽ

is an S-value!

Page 24: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Default S-Value โ‰  Neyman

1. ๐ป0 and ๐ป1 are point hypotheses โ€“ then default S-

value is:

... the safe test based on ๐‘† looks a bit like, but is not a

standard Neyman-Pearson test.

Safe Test: reject if ๐‘† ๐‘‹๐œ โ‰ฅ 1/๐›ผ

NP: reject if ๐‘† ๐‘‹๐œ โ‰ฅ 1/๐ต with ๐ต s.t. ๐‘ƒ0 ๐‘† ๐‘‹๐œ โ‰ฅ ๐ต = ๐›ผ

more conservative

Page 25: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Tests are Safe

under optional continuation

โ€ข Suppose we observe data (๐‘‹1, ๐‘Œ1), ๐‘‹2, ๐‘Œ2 , โ€ฆ

โ€ข ๐‘Œ๐‘–: side information

...coming in batches of size ๐‘›1, ๐‘›2, โ€ฆ , ๐‘›๐‘˜. Let

โ€ข We first evaluate some S-value ๐‘†1 on (๐‘‹1, โ€ฆ , ๐‘‹๐‘›1).

โ€ข If outcome is in certain range (e.g. promising but not

conclusive) and ๐‘Œ๐‘›1has certain values (e.g. โ€˜boss has

money to collect more dataโ€™) then....

we evaluate some S-value ๐‘†2 on ๐‘‹๐‘›1+1, โ€ฆ , ๐‘‹๐‘2 ,

otherwise we stop.

Page 26: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Tests are Safe

โ€ข We first evaluate ๐‘†1.

โ€ข If outcome is in certain range and ๐‘Œ๐‘›1 has certain

values then we evaluate ๐‘†2 ; otherwise we stop.

โ€ข If outcome of ๐‘†2 is in certain range and ๐‘Œ๐‘2 has

certain values then we compute ๐‘†3 , else we stop.

โ€ข ...and so on

โ€ข ...when we finally stop, after say ๐พ data batches, we

report as final result the product

โ€ข First Result, Informally: any ๐‘บ composed of S-

values in this manner is itself an S-value,

irrespective of the stop/continue rule used!

Page 27: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Tests are Safe

โ€ข ๐‘†๐‘— may be same function as ๐‘†๐‘—โˆ’1, e.g. (simple ๐ป0)

โ€ข But choice of ๐‘—th S-value ๐‘†๐‘— may also depend on

previous ๐‘‹๐‘๐‘— , ๐‘Œ๐‘๐‘— , e.g.

and then (full compatibility with Bayesian updating)

Page 28: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Tests are Safe

Let ๐‘†1 be S-value on . For ๐‘— = 1,2,โ€ฆ , let

be any collection of S-values defined on

Let be arbitrary

stop/continue strategy, and:

Define if

Define if

else

else

and so on...

Define if

Page 29: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Tests are Safe

Theorem:

Suppose that for all ๐‘–, ๐‘‹๐‘– โŠฅ ๐‘‹๐‘–โˆ’1, ๐‘Œ๐‘–โˆ’1 . Then

๐‘†, the end-product of all employed S-values

๐‘†1, ๐‘†๐‘”1 , ๐‘†๐‘”2, โ€ฆ is itself an S-value

โ€ข Technically, the process is a

nonnegative supermartingale (Ville โ€˜39) and the theorem is

proved using Doobโ€™s optional stopping theorem

Page 30: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Tests are Safe

Theorem:

๐‘† , the end-product of all employed S-values

๐‘†1, ๐‘†๐‘”1 , ๐‘†๐‘”2, โ€ฆ is itself an S-value

Corollary: Type-I Error Guarantee Preserved

under Optional Continuation

Suppose we combine S-values with arbitrary

stop/continue strategy and reject ๐ป0 when final ๐‘† has

๐‘†โˆ’1 โ‰ค 0.05 . Then resulting test is a safe test and our

Type-I Error is guaranteed to be below 0.05!

Page 31: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Tests are Safe

Theorem:

๐‘† , the end-product of all employed S-values

๐‘†1, ๐‘†๐‘”1 , ๐‘†๐‘”2, โ€ฆ is itself an S-value

Corollary: Type-I Error Guarantee Preserved

under Optional Continuation

Suppose we combine S-values with arbitrary

stop/continue strategy and reject ๐ป0 when final ๐‘† has

๐‘†โˆ’1 โ‰ค 0.05 . Then resulting test is a safe test and our

Type-I Error is guaranteed to be below 0.05!

Page 32: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Generalizing the Result

Theorem says:

โ€ข Let where

s.t.

, is S-value:

โ€ข Suppose that for all ๐‘–, ๐‘‹๐‘– โŠฅ ๐‘‹๐‘–โˆ’1, ๐‘Œ๐‘–โˆ’1 .

โ€ข Let ๐œ be smallest ๐‘— such that ๐‘”๐‘— ๐‘๐‘๐‘— = ๐‘ ๐‘ก๐‘œ๐‘

โ€ข Then is an S-value

Page 33: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Generalizing the Result

Theorem says:

โ€ข Let where

For all ๐‘—, let and let โ€ฆ

, is S-value:

โ€ข s.t.

โ€ข Suppose that for all ๐‘–, ๐‘‹๐‘– โŠฅ ๐‘‹๐‘–โˆ’1, ๐‘Œ๐‘–โˆ’1 .

โ€ข Let ๐œ be smallest ๐‘— such that ๐‘”๐‘— ๐‘๐‘๐‘— = ๐‘ ๐‘ก๐‘œ๐‘

โ€ข Then is an S-value

Page 34: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Generalizing the Result

Theorem says: for all ๐‘—,

โ€ข Let ๐‘‹โŸจ๐‘—โŸฉ = (๐‘‹๐‘๐‘—โˆ’1 +1, โ€ฆ , ๐‘‹๐‘๐‘—) ; ๐‘‹โŸจ๐‘—โŸฉ = (๐‘‹1, โ€ฆ , ๐‘‹๐‘๐‘—

);

similarly for ๐‘Œ, ๐‘

โ€ข Let be a function such that

โ€ข Let ๐œ be a stopping time for the process ๐‘ 1 , ๐‘ 2 , โ€ฆ

โ€ข Then is an S-value

Page 35: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Optional Stopping vs

Continuation

โ€ข Let ๐‘‹โŸจ๐‘—โŸฉ = (๐‘‹๐‘๐‘—โˆ’1 +1, โ€ฆ , ๐‘‹๐‘๐‘—) ; ๐‘‹โŸจ๐‘—โŸฉ = (๐‘‹1, โ€ฆ , ๐‘‹๐‘๐‘—

)

โ€ข Let be a function such that

โ€ข Let ๐œ be a stopping time for the process ๐‘ 1 , ๐‘ 2 , โ€ฆ

โ€ข Then is an S-value

Optional Continuation: we have batches of data

๐‘โŸจ1โŸฉ, ๐‘โŸจ2โŸฉ, โ€ฆ and we can do optional stopping at the batch-

level (and obtain an S-Value and preserve Type I error

guarantees)

Traditional Optional Stopping: we take ๐‘โŸจ๐‘— โŸฉ = ๐‘๐‘— for all ๐‘—.

Page 36: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Optional Stopping vs

Continuation

โ€ข In many but certainly not all cases, we can also do

optional stopping based on S-values.

โ€ข Suppose that ๐ป0 = {๐‘ƒ0}, , no ๐‘Œ๐‘– โ€˜s

โ€ข Then and we can do optional

stopping at each ๐‘— and not just optional continuation

between โ€˜blocksโ€™

โ€ข โ€ฆbut if ๐ป0 composite then sometimes the only

conditional S-value satisfying

is given by the trivial S-value ๐‘†๐‘—+1 = 1 . Then OS

impossible (but OC with batches of size ๐‘›๐‘— โ‰ซ 1 still

possible)

Page 37: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

(Again:) Safe Testing = Gambling!

โ€ข At time 1 you can buy ticket 1 for 1$. It pays off

๐‘†1(๐‘‹1, โ€ฆ , ๐‘‹๐‘›1) $ after ๐‘›1 steps

โ€ข At time 2 you can buy ticket 2 for 1$. It pays off

๐‘†2(๐‘‹๐‘›1+1, โ€ฆ , ๐‘‹๐‘2) $ after ๐‘›2 further steps.... and so on.

You may buy multiple and fractional nrs of tickets.

Kelly (1956)

Page 38: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Testing = Gambling!

โ€ข At time 1 you can buy ticket 1 for 1$. It pays off

๐‘†1(๐‘‹1, โ€ฆ , ๐‘‹๐‘›1) $ after ๐‘›1 steps

โ€ข At time 2 you can buy ticket 2 for 1$. It pays off

๐‘†2(๐‘‹๐‘›1+1, โ€ฆ , ๐‘‹๐‘2) $ after ๐‘›2 further steps.... and so on.

You may buy multiple and fractional nrs of tickets.

โ€ข You start by investing 1$ in ticket 1.

Page 39: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Testing = Gambling!

โ€ข At time 1 you can buy ticket 1 for 1$. It pays off

๐‘†1(๐‘‹1, โ€ฆ , ๐‘‹๐‘›1) $ after ๐‘›1 steps

โ€ข At time 2 you can buy ticket 2 for 1$. It pays off

๐‘†2(๐‘‹๐‘›1+1, โ€ฆ , ๐‘‹๐‘2) $ after ๐‘›2 further steps.... and so on.

You may buy multiple and fractional nrs of tickets.

โ€ข You start by investing 1$ in ticket 1.

โ€ข After ๐‘›1 outcomes you either stop with end capital ๐‘†1or you continue and buy ๐‘†1 tickets of type 2.

Page 40: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Testing = Gambling!

โ€ข At time 1 you can buy ticket 1 for 1$. It pays off

๐‘†1(๐‘‹1, โ€ฆ , ๐‘‹๐‘›1) $ after ๐‘›1 steps

โ€ข At time 2 you can buy ticket 2 for 1$. It pays off

๐‘†2(๐‘‹๐‘›1+1, โ€ฆ , ๐‘‹๐‘2) $ after ๐‘›2 further steps.... and so on.

You may buy multiple and fractional nrs of tickets.

โ€ข You start by investing 1$ in ticket 1.

โ€ข After ๐‘›1 outcomes you either stop with end capital ๐‘†1or you continue and buy ๐‘†1 tickets of type 2. After ๐‘2 =๐‘›1 + ๐‘›2 outcomes you stop with end capital ๐‘†1 โ‹… ๐‘†2 or

you continue and buy ๐‘†1 โ‹… ๐‘†2 tickets of type 3, and so

on..

Page 41: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Testing = Gambling!

โ€ข You start by investing 1$ in ticket 1.

โ€ข After ๐‘›1 outcomes you either stop with end capital ๐‘†1or you continue and buy ๐‘†1 tickets of type 2. After

๐‘2 = ๐‘›1 + ๐‘›2 outcomes you stop with end capital ๐‘†1 โ‹…๐‘†2 or you continue and buy ๐‘†1 โ‹… ๐‘†2 tickets of type 3,

and so on...

โ€ข ๐‘บ is simply your end capital

โ€ข Your donโ€™t expect to gain money, no matter what the

stop/continuation rule since none of individual

gambles ๐‘บ๐’Œ are strictly favorable to you

Page 42: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Testing = Gambling!

โ€ข You start by investing 1$ in ticket 1.

โ€ข After ๐‘›1 outcomes you either stop with end capital ๐‘†1or you continue and buy ๐‘†1 tickets of type 2. After

๐‘2 = ๐‘›1 + ๐‘›2 outcomes you stop with end capital ๐‘†1 โ‹…๐‘†2 or you continue and buy ๐‘†1 โ‹… ๐‘†2 tickets of type 3,

and so on...

โ€ข ๐‘บ is simply your end capital

โ€ข Your donโ€™t expect to gain money, no matter what the

stop/continuation rule since none of individual

gambles ๐‘บ๐’Œ are strictly favorable to you

โ€ข Hence a large value of ๐‘บ indicates that something

very unlikely has happened under ๐ป0 ...

Page 43: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Default S-Value โ‰  Neyman

1. ๐ป0 and ๐ป1 are point hypotheses โ€“ then default S-

value is:

... the safe test based on ๐‘† looks a bit like, but is not a

standard Neyman-Pearson test.

Safe Test: reject if ๐‘† ๐‘‹๐œ โ‰ฅ 1/๐›ผ

NP: reject if ๐‘† ๐‘‹๐œ โ‰ฅ 1/๐ต with ๐ต s.t. ๐‘ƒ0 ๐‘† ๐‘‹๐œ โ‰ฅ ๐ต = ๐›ผ

more conservative

Page 44: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

SafeTests & Neyman-Pearson, again

โ€ข Let ๐‘ be a strict ๐‘-value: for all ๐‘ƒ โˆˆ ๐ป0, ๐‘ƒ ๐‘ โ‰ค ๐›ผ = ๐›ผ

โ€ข Let ๐‘† =1

๐›ผif ๐‘ โ‰ค ๐›ผ , and ๐‘† =0 otherwise

โ€ข Then for all ๐‘ƒ โˆˆ ๐ป0,

...so ๐‘† is an S-value, and obviously, the safe test based

on ๐‘† rejects iff ๐‘ โ‰ค ๐›ผ. It thus implements the Neyman-

Pearson test at significance level ๐›ผ.

Page 45: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

โ€ข Let ๐‘ be a strict ๐‘-value: for all ๐‘ƒ โˆˆ ๐ป0, ๐‘ƒ ๐‘ โ‰ค ๐›ผ = ๐›ผ

โ€ข Let ๐‘† =1

๐›ผif ๐‘ โ‰ค ๐›ผ , and ๐‘† =0 otherwise

โ€ข Then for all ๐‘ƒ โˆˆ ๐ป0,

...so ๐‘† is an S-value, and obviously, the safe test based

on ๐‘† rejects iff ๐‘ โ‰ค ๐›ผ. t thus implements the Neyman-

Pearson test at significance level ๐›ผ.

...but it is a very silly S-value to use! With

probability ๐œถ, you loose all your capital, and you will

never make up for that in the future!

SafeTests & Neyman-Pearson, again

Page 46: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Safe Tests and Neyman-Pearson,

again

โ€ข The Safe Test based on an S-Value that is a

likelihood ratio is not a Neyman-Pearson test (it is

more conservative)

โ€ข Neyman-Pearson tests (that only report โ€˜rejectโ€™

and โ€˜acceptโ€™, and not the p-value) are (other) Safe

Tests, but useless ones corresponding to

irresponsible gambling...

Page 47: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Some S-Values are

Better than Others

โ€ข The Trivial S-Value ๐‘† = 1 is valid, but useless

โ€ข The Neyman-Pearson S-value is valid, but extremely

dangerous to use!

โ€ข We need some idea of โ€˜optimal S-valueโ€™

Page 48: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

How to design S-Values?

โ€ข Suppose we are willing to admit that weโ€™ll only be

able to tell ๐ป0 and ๐ป1 apart if ๐‘ƒ โˆˆ ๐ป0 โˆช ๐ป1โ€ฒ for some

๐ป1โ€ฒ โŠ‚ ๐ป1 that excludes points that are โ€˜too closeโ€™ to ๐ป0

e.g.

โ€ข We can then look for the GROW (growth-optimal in

worst-case) S-value achieving

Page 49: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

GROW: an analogue of Power

โ€ข The GROW (growth-optimal in worst-case) S-value

relative to ๐ป1,๐›ฟ is the S-value achieving

where the supremum is over all ๐‘†-values relative to ๐ป0โ€ข ...so we donโ€™t expect to gain anything when investing

in ๐‘† under ๐ป0โ€ข ...but among all such ๐‘† we pick the one(s) that make

us rich fastest if we keep reinvesting in new gambles

under ๐ป1

Page 50: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Main Theorem

(will be made precise in 3 weeks)

โ€ข Under โ€˜hardly any conditionsโ€™ on ๐ป0 and ๐ป1 a GROW

S-value exists! [G., De Heide, Koolen, 2019]

Page 51: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

โ€ข Jerzy Neyman: alternative exists, โ€œinductive . .

... behaviourโ€, โ€˜significance levelโ€™ and power

Sir Ronald Fisher: test statistic rather than

alternative, p-value indicates โ€œunlikelinessโ€

โ€ข Sir Harold Jeffreys: Bayesian, alternative exists,

absolutely no p-values

J. Berger (2003, IMS Medaillion Lecture): Could

Neyman, Fisher and Jeffreys have agreed on testing?

... Using S-Values we can unify/correct the central

ideas

Three Philosophies of Testing

Page 52: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

โ€œFisherianโ€ Example

2. Ryabko & Monarevโ€™s (2005)

Compression-based randomness test

R&M checked whether sequences generated by

famous random number generators can be

compressed by standard data compressors such as

gzip and rar

Answer: yes! 200 bits compression for file of 10

megabytes

Page 53: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Additional Background Slides

Page 54: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Null Hypothesis Testing

โ€ข Let ๐ป0 = ๐‘ƒ๐œƒ ๐œƒ โˆˆ ฮ˜0} represent the null hypothesis

โ€ข For simplicity, today we assume data ๐‘‹1, ๐‘‹2, โ€ฆ are

i.i.d. under all ๐‘ƒ โˆˆ ๐ป0 .

โ€ข Let ๐ป1= ๐‘ƒ๐œƒ ๐œƒ โˆˆ ฮ˜1} represent alternative hypothesis

โ€ข Example: testing whether a coin is fair

Under ๐‘ƒ๐œƒ , data are i.i.d. Bernoulli ๐œƒ

ฮ˜0 =1

2, ฮ˜1 = 0,1 โˆ–

1

2

Standard test would measure frequency of 1s

Page 55: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Null Hypothesis Testing

โ€ข Let ๐ป0 = ๐‘ƒ๐œƒ ๐œƒ โˆˆ ฮ˜0} represent the null hypothesis

โ€ข Let ๐ป1= ๐‘ƒ๐œƒ ๐œƒ โˆˆ ฮ˜1} represent alternative hypothesis

โ€ข Example: testing whether a coin is fair

Under ๐‘ƒ๐œƒ , data are i.i.d. Bernoulli ๐œƒ

ฮ˜0 =1

2, ฮ˜1 = 0,1 โˆ–

1

2

Standard test would measure frequency of 1s

Simple ๐ป0

Page 56: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Null Hypothesis Testing

โ€ข Let ๐ป0 = ๐‘ƒ๐œƒ ๐œƒ โˆˆ ฮ˜0} represent the null hypothesis

โ€ข Let ๐ป1= ๐‘ƒ๐œƒ ๐œƒ โˆˆ ฮ˜1} represent alternative hypothesis

โ€ข Example: t-test (most used test world-wide)

๐ป0: ๐‘‹๐‘– โˆผ๐‘–.๐‘–.๐‘‘. ๐‘ 0, ๐œŽ2 vs.

๐ป1 : ๐‘‹๐‘– โˆผ๐‘–.๐‘–.๐‘‘. ๐‘ ๐œ‡, ๐œŽ2 for some ๐œ‡ โ‰  0

๐œŽ2 unknown (โ€˜nuisanceโ€™) parameter

๐ป0 = ๐‘ƒ๐œŽ ๐œŽ โˆˆ 0,โˆž }

๐ป1 = ๐‘ƒ๐œŽ,๐œ‡ ๐œŽ โˆˆ 0,โˆž , ๐œ‡ โˆˆ โ„ โˆ– 0 }

Page 57: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Null Hypothesis Testing

โ€ข Let ๐ป0 = ๐‘ƒ๐œƒ ๐œƒ โˆˆ ฮ˜0} represent the null hypothesis

โ€ข Let ๐ป1= ๐‘ƒ๐œƒ ๐œƒ โˆˆ ฮ˜1} represent alternative hypothesis

โ€ข Example: t-test (most used test world-wide)

๐ป0: ๐‘‹๐‘– โˆผ๐‘–.๐‘–.๐‘‘. ๐‘ 0, ๐œŽ2 vs.

๐ป1 : ๐‘‹๐‘– โˆผ๐‘–.๐‘–.๐‘‘. ๐‘ ๐œ‡, ๐œŽ2 for some ๐œ‡ โ‰  0

๐œŽ2 unknown (โ€˜nuisanceโ€™) parameter

๐ป0 = ๐‘ƒ๐œŽ ๐œŽ โˆˆ 0,โˆž }

๐ป1 = ๐‘ƒ๐œŽ,๐œ‡ ๐œŽ โˆˆ 0,โˆž , ๐œ‡ โˆˆ โ„ โˆ– 0 }

Composite ๐ป0

Page 58: TODAY: Safe Testingpdg/teaching/inflearn/slides6.pdfTODAY: Safe Testing 1. Reproducibility Crisis/Problems with p-values 2. The S-Value 3. Optional Continuation 4. Optional Stopping

Standard Method:

p-value, significance

โ€ข Let ๐ป0 = ๐‘ƒ๐œƒ ๐œƒ โˆˆ ฮ˜0} represent the null hypothesis

โ€ข A (โ€œnonstrictโ€) p-value is a random variable (!) such

that, for all ๐œƒ โˆˆ ฮ˜0 ,