Final mayo's aps_talk

36
Final APS Mayo / 1 Error Statistical Control: Forfeit at your Peril Deborah Mayo A central task of philosophers of science is to address conceptual, logical, methodological discomforts of scientific practices—still we’re rarely called in Psychology has always been more self-conscious than most To its credit, replication crises led to programs to restore credibility: fraud busting, reproducibility studies. There are proposed methodological reforms––many welcome, some of them quite radical Without a better understanding of the problems, many reforms are likely to leave us worse off.

Transcript of Final mayo's aps_talk

 

Final APS Mayo / 1  

 

Error Statistical Control: Forfeit at your Peril Deborah Mayo

• A central task of philosophers of science is to address

conceptual, logical, methodological discomforts of scientific practices—still we’re rarely called in

• Psychology has always been more self-conscious than most • To its credit, replication crises led to programs to restore

credibility: fraud busting, reproducibility studies. • There are proposed methodological reforms––many

welcome, some of them quite radical • Without a better understanding of the problems, many

reforms are likely to leave us worse off.

 

Final APS Mayo / 2  

 

1. A Paradox for Significance Test Critics

Critic1: It’s much too easy to get a small P-value. Critic2: We find it very difficult to replicate the small P-values

others found.

Is it easy or is it hard?

R.A. Fisher: it’s easy to lie with statistics by selective reporting (he called it the “political principle”) Sufficient finagling—cherry-picking, P-hacking, significance seeking—may practically guarantee a researcher’s preferred hypothesis gets support, even if it’s unwarranted by evidence.

 

Final APS Mayo / 3  

 

2. Bad Statistics Severity Requirement: If data x0 agree with a hypothesis H, but the test procedure had little or no capability, i.e., little or no probability of finding flaws with H (even if H is incorrect), then x0 provide poor evidence for H.

Such a test we would say fails a minimal requirement for a stringent or severe test.

• This seems utterly uncontroversial.

 

Final APS Mayo / 4  

 

• Methods that scrutinize a test’s capabilities, according to their severity, I call error statistical.

• Existing error probabilities (confidence levels, significance

levels) may but need not provide severity assessments.

• New name: frequentist, sampling theory, Fisherian, Neyman-Pearsonian—are too associated with hard line views.

 

Final APS Mayo / 5  

 

3. Two main views of the role of probability in inference Probabilism. To provide an assignment of degree of probability, confirmation, support or belief in a hypothesis, absolute or comparative, given data x0. (e.g., Bayesian, likelihoodist)—with due regard for inner coherency    Performance. To ensure long-run reliability of methods, coverage probabilities, control the relative frequency of erroneous inferences in a long-run series of trials (behavioristic Neyman-Pearson)  What happened to using probability to assess the error probing capacity by the severity criterion?

 

Final APS Mayo / 6  

 

• Neither “probabilism” nor “performance” directly captures it.

• Good long-run performance is a necessary, not a sufficient, condition for avoiding insevere tests.

 • The problems with selective reporting, cherry picking,

stopping when the data look good, P-hacking, barn hunting, are not problems about long-runs—

• It’s that we cannot say about the case at hand that it has

done a good job of avoiding the sources of misinterpreting data.

 

Final APS Mayo / 7  

 

• Probabilism  says  H  is  not  warranted  unless  it’s  true  or  probable  (or  increases  probability,  makes  firmer–some  use  Bayes  rule,  but  its  use  doesn’t  make  you  Bayesian).    

• Performance  says  H  is  not  warranted  unless  it  stems  from  a  method  with  low  long-­‐run  error.    

 

• Error  Statistics  (Probativism)  says  H  is  not  warranted  unless  something  (a  fair  amount)  has  been  done  to  probe  ways  we  can  be  wrong  about  H.    

   

 

Final APS Mayo / 8  

 

• If  you  assume  probabilism  is  required  for  inference,  it  follows  error  probabilities  are  relevant  for  inference  only  by  misinterpretation.  False!    • Error  probabilities  have  a  crucial  role  in  appraising  well-­‐testedness,  which  is  very  different  from  appraising  believability,  plausibility,  confirmation.    

• It’s  crucial  to  be  able  to  say,  H  is  highly  believable  or  plausible  but  poorly  tested.  [Both  H  and  not-­‐H  may  be  poorly  tested)  

 • Probabilists  can  allow  for  the  distinct  task  of  severe  testing  

 

Final APS Mayo / 9  

 

• It’s not that I’m keen to defend many common uses of significance tests; my  work  in  philosophy  of  statistics  has  been  to  provide  the  long  sought  “evidential  interpretation”  (Birnbaum)  of  frequentist  methods,  to  avoid  classic  fallacies.  

• It’s just that the criticisms are based on serious

misunderstandings of the nature and role of these methods; consequently so are many “reforms”.

• Note: The severity construal blends testing and subsumes

(improves) interval estimation, but I keep to testing talk to underscore the probative demand.

 

Final APS Mayo / 10  

 

4.  Biasing  selection  effects:  

One  function  of  severity  is  to  identify  which  selection  effects  are  problematic  (not  all  are)      Biasing  selection  effects:  when  data  or  hypotheses  are  selected  or  generated  (or  a  test  criterion  is  specified),  in  such  a  way  that  the  minimal  severity  requirement  is  violated,  seriously  altered  or  incapable  of  being  assessed.      

 Picking up on these alterations is precisely what enables error statistics to be self-correcting— Let me illustrate.

 

 

Final APS Mayo / 11  

 

Capitalizing  on  Chance    

We  often  see  articles  on  fallacious  significance  levels:      When  the  hypotheses  are  tested  on  the  same  data  that  suggested  them  and  when  tests  of  significance  are  based  on  such  data,  then  a  spurious  impression  of  validity  may  result.  The  computed  level  of  significance  may  have  almost  no  relation  to  the  true  level…Suppose  that  twenty  sets  of  differences  have  been  examined,  that  one  difference  seems  large  enough  to  test  and  that  this  difference  turns  out  to  be  ‘significant  at  the  5  percent  level.’  ….The  actual  level  of  significance  is  not  5  percent,  but  64  percent!  (Selvin,  1970,  p.  104)  

 

Final APS Mayo / 12  

 

• This  is  from  a  contributor  to  Morrison  and  Henkel’s  Significance  Test  Controversy  way  back  in  1970!    

 • They  were  clear  on  the  fallacy:  blurring  the  “computed”  or  “nominal”  significance  level,  and  the  “actual”  or  “warranted”  level.  

 • There  are  many  more  ways  you  can  be  wrong  with  hunting  (different  sample  space).  

 • Nowadays,  we’re  likely  to  see  the  tests  blamed  for  permitting  such  misuses  (instead  of  the  testers).  

 • Even  worse  are  those  statistical  accounts  where  the  abuse  vanishes!  

 

Final APS Mayo / 13  

 

What  defies  scientific  sense?    On  some  views,  biasing  selection  effects  are  irrelevant….  

Stephen  Goodman  (epidemiologist):    Two  problems  that  plague  frequentist  inference:  multiple  comparisons  and  multiple  looks,  or,  as  they  are  more  commonly  called,  data  dredging  and  peeking  at  the  data.  The  frequentist  solution  to  both  problems  involves  adjusting  the  P-­‐value…But  adjusting  the  measure  of  evidence  because  of  considerations  that  have  nothing  to  do  with  the  data  defies  scientific  sense,  belies  the  claim  of  ‘objectivity’  that  is  often  made  for  the  P-­‐value.”  (1999,  p.  1010).    

 

Final APS Mayo / 14  

 

5.  Likelihood  Principle  (LP)  The  vanishing  act  takes  us  to  the  pivot  point  around  which  much  debate  in  philosophy  of  statistics  revolves:   In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses

P(x0;H1)/P(x0;H0)    

Different  forms:  posterior  probabilities,  positive  B-­‐boost,  Bayes  factor    The  data  x0  is  fixed,  while  the  hypotheses  vary    

 

Final APS Mayo / 15  

 

Savage on the LP:

“According to Bayes’s theorem, P(x|µ)...constitutes the entire evidence of the experiment, that is, it tells all that the experiment has to tell. More fully and more precisely, if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ… (Savage 1962, p. 17).

   

 

Final APS Mayo / 16  

 

 All  error  probabilities  violate  the  LP  (even  without  selection  effects):  “Sampling  distributions,  significance  levels,  power,  all  depend  on  something  more  [than  the  likelihood  function]–something  that  is  irrelevant  in  Bayesian  inference–namely  the  sample  space”.  (Lindley  1971,  p.  436)    That is why properties of the sampling distribution of test statistic d(X) disappear for accounts that condition on the particular data x0  

 

 

Final APS Mayo / 17  

 

Paradox  of  Optional  Stopping:  Error  probing  capabilities  are  altered  not  just  by  cherry  picking  and  data  dredging,  but  also  via  data  dependent  stopping  rules:  

We  have  a  random  sample  from  a  Normal  distribution  with  mean  µ  and  standard  deviation  σ, Xi  ~  N(µ,σ2),  2-­‐sided    

H0:  µ  =  0  vs.  H1:  µ  ≠  0.    

Instead  of  fixing  the  sample  size  n  in  advance,  in  some  tests,  n  is  determined  by  a  stopping  rule:        

Keep  sampling  until  H0  is  rejected  at  the  .05  level    

i.e.,  keep  sampling  until  |𝑋|  ≥  1.96  σ/ 𝑛.    

 

Final APS Mayo / 18  

 

“Trying  and  trying  again”:  having  failed  to  rack  up  a  1.96  SD  difference  after,  say,  10  trials,  the  researcher  went  on  to  20,  30  and  so  on  until  finally  obtaining  a  1.96  SD  difference.        Nominal  vs.  Actual  significance  levels:  with  n  fixed  the  type  1  error  probability  is  .05.    With  this  stopping  rule  the  actual  significance  level  differs  from,  and  will  be  greater  than  .05.          

 

Final APS Mayo / 19  

 

Jimmy  Savage  (1959  forum)  audaciously  declared:    

 “optional  stopping  is  no  sin”    

so  the  problem  must  be  with  significance  levels  (because  they  pick  up  on  it).    

On  the  other  side:  Peter  Armitage,  who  had  brought  up  the  problem,  also  uses  biblical  language    

“thou  shalt  be  misled”    

if  thou  dost  not  know  the  person  tried  and  tried  again.  (72)  

 

Final APS Mayo / 20  

 

   Where the Bayesians here claim:

“This irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels” (in the sense of Neyman and Pearson) (Edwards, Lindman, Savage 1963, p. 239). The frequentists: While it may restore "simplicity and freedom" it does so at the cost of being unable to adequately control probabilities of misleading interpretations of data) (Birnbaum).

   

 

Final APS Mayo / 21  

 

6. Current Reforms are Probabilist Probabilist reforms to replace tests (and CIs) with likelihood ratios, Bayes factors, HPD intervals, or just lower the P-value (so that the maximal likely alternative gets .95 posterior)

while ignoring biasing selection effects, will fail  The same p-hacked hypothesis can occur in Bayes factors; optional stopping can exclude true nulls from HPD intervals

With one big difference: Your direct basis for criticism and possible adjustments has just vanished To repeat: Properties of the sampling distribution d(x) disappear for accounts that condition on the particular data.

 

Final APS Mayo / 22  

 

7. How might probabilists block intuitively unwarranted inferences? (Consider first subjective) When we hear there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences), some probabilists claim—you see, if our beliefs were mixed into the interpretation of the evidence, we wouldn’t be fooled.

We know these things are unbelievable, a subjective Bayesian might say. That could work in some cases (though it still wouldn’t show what researchers had done wrong)—battle of beliefs.

 

Final APS Mayo / 23  

 

It wouldn’t help with our most important problem: (2 critics) How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, registered results and precautions)? Besides, as committees investigating questionable practices know, researchers really do sincerely believe their hypotheses. So now you’ve got two sources of flexibility, priors and biasing selection effects (which can no longer be criticized).

 

Final APS Mayo / 24  

 

8.  Conventional:  Bayesian-­‐Frequentist  reconciliations?  

The most popular probabilisms these days are non-subjective, default, reference:

• because of the difficulty of eliciting subjective priors, • the reluctance of scientists to allow subjective beliefs to

overshadow the information provided by data. Default,  or  reference  priors  are  designed  to  prevent  prior  beliefs  from  influencing  the  posteriors.    

• A  classic  conundrum:  no  general  non-­‐informative  prior  exists,  so  most  are  conventional.  

 

 

Final APS Mayo / 25  

 

“The  priors  are  not  to  be  considered  expressions  of  uncertainty,  ignorance,  or  degree  of  belief.  Conventional  priors  may  not  even  be  probabilities…”  (Cox  and  Mayo  2010,  p.  299)    

• Prior probability: An undefined mathematical construct for obtaining posteriors (giving highest weight to data, or satisfying invariance, or matching frequentists, or….).

Leading conventional Bayesians (J. Berger) still tout their methods as free of concerns with selection effects, stopping rules (stopping rule principle)

 

Final APS Mayo / 26  

 

There are some Bayesians who don’t see themselves as fitting under either the subjective or conventional heading, and may even reject probabilism…..

 

Final APS Mayo / 27  

 

Before concluding…I don’t ignore fallacies of current methods 9. How  the  severity  analysis  avoids  classic  fallacies    Fallacies  of  Rejection:  Statistical  vs.  Substantive  Significance    

i. Take  statistical  significance  as  evidence  of  substantive  theory  H*  that  explains  the  effect.  

ii. Infer  a  discrepancy  from  the  null  beyond  what  the  test  warrants.  

 (i) Handled  with  severity:  flaws  in  the  substantive  alternative  H*  have  not  been  probed  by  the  test,  the  inference  from  a  statistically  significant  result  to  H*  fails  to  pass  with  severity.  

 

Final APS Mayo / 28  

 

Merely refuting the null hypothesis is too weak to corroborate substantive H*, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called ‘a highly improbable coincidence’” (Meehl and Waller 2002,184).    

• NHSTs  (supposedly)  allow  moving  from  statistical  to  substantive;  if  so,  they  exist  only  as  abuses  of  tests:  they  are  not  licensed  by  any  legitimate  test.    

 • Severity  applies  informally:  Much  more  attention  to  these  quasi-­‐formal  statistical  substantive  links:    Do  those  proxy  variables  capture  the  intended  treatments?  Do  the  measurements  reflect  the  theoretical  phenomenon?  

 

 

Final APS Mayo / 29  

 

Fallacies  of  Rejection:  (ii)  Infer  a  discrepancy  beyond  what’s  warranted:  Severity  sets  up  a  discrepancy  parameter  γ (never  just  report  P-­‐value)  

 A  statistically  significant  effect  may  not  warrant  a  meaningful  effect  —  especially  with n sufficiently large:  large  n  problem.  

• Severity  tells  us:  an  α-­‐significant  difference  is  indicative  of  less  of  a  discrepancy  from  the  null  if  it  results  from  larger  (n1)  rather  than  a  smaller  (n2)  sample  size  (n1  >  n2  )

What’s  more  indicative  of  a  large  effect  (fire),  a  fire  alarm  that  goes  off  with  burnt  toast  or  one  so  insensitive  that  it  doesn’t  go  off  unless  the  house  is  fully  ablaze?  The  larger  sample  size  is  like  the  one  that  goes  off  with  burnt  toast.)  

 

Final APS Mayo / 30  

 

 Fallacy  of  Non-­‐Significant  results:  Insensitive  tests    • Negative  results  do  not  warrant  0  discrepancy  from  the  null,  but  we  can  use  severity  to  rule  out  discrepancies  that,  with  high  probability,  would  have  resulted  in  a  larger  difference  than  observed    

• akin  to  power  analysis  (Cohen)  but  sensitive  to  x0    

• We  hear  sometimes  negative  results  are  uninformative:  not  so.  

No  point  in  running  replication  research  if  your  account  views  negative  results  as  uninformative  

 

Final APS Mayo / 31  

 

 Confidence  Intervals  also  require  supplementing  There’s  a  duality  between  tests  and  intervals:  values  within  the  (1  -­‐  α)  CI  are  non-­‐rejectable  at  the  α  level  (but  that  doesn’t  make  them  well-­‐warranted)  • Still  too  dichotomous:  in  /out,  plausible/not  plausible  (Permit  fallacies  of  rejection/non-­‐rejection)  

• Justified  in  terms  of  long-­‐run  coverage  • All  members  of  the  CI  treated  on  par  • Fixed  confidence  levels  (SEV  need  several  benchmarks)  • Estimation  is  important  but  we  need  tests  for  distinguishing  real  and  spurious  effects,  and  checking  assumptions  of  statistical  models  

 

Final APS Mayo / 32  

 

10.  Error  Statistical  Control:  Forfeit  at  your  Peril  

• The  role  of  error  probabilities  in  inference,  is  not  long-­‐run  error  control,  but  to  severely  probe  flaws  in  your  inference  today.    

• It’s  not  a  high  posterior  probability  in  H  that’s  wanted  (however  construed)  but  a  high  probability  our  procedure  would  have  unearthed  flaws  in  H.  

• Many  reforms  are  based  on  assuming  a  philosophy  of  probabilism  

• The  danger  is  that  some  reforms  may  enable  rather  than  directly  reveal  illicit  inferences  due  to  biasing  selection  effects.  

 

Final APS Mayo / 33  

 

Mayo  and  Cox  (2010):  Frequentist    Principle  of  Evidence  (FEV);  SEV:  Mayo  and  Spanos  (2006)    

• FEV/SEV:  insignificant  result:  A  moderate  P-­‐value  is  evidence  of  the  absence  of  a  discrepancy  δ   from    H0,  only    if  there    is  a    high    probability  the    test  would    have    given  a  worse   fit  with    H0      (i.e.,    d(X) > d(x0)  )  were  a  discrepancy  δ  to    exist.    

 • FEV/SEV    significant  result  d(X) > d(x0)  is  evidence  of  discrepancy  δ   from  H0,  if  and  only  if,  there  is  a  high  probability  the  test  would  have  d(X) < d(x0)  were  a  discrepancy  as  large  as  δ  absent.  

   

 

Final APS Mayo / 34  

 

Test  T+:  Normal  testing:  H0:  µ  <  µ0  vs.    H1:  µ  >  µ0    σ  known    (FEV/SEV):  If  d(x)  is  not  statistically  significant,  then    µ  <  M0  +  kεσ/√𝑛 passes  the  test  T+  with  severity  (1  –  ε).      (FEV/SEV):  If  d(x)  is  statistically  significant,  then    µ  >  M0  +  kεσ/√𝑛 passes  the  test  T+  with  severity  (1  –  ε).      where  P(d(X)  >  kε)  =  ε.        

 

Final APS Mayo / 35  

 

REFERENCES: Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical Inference:

A Discussion, edited by L. J. Savage. London: Methuen. Barnard, G. A. 1972. “The Logic of Statistical Inference (review of ‘The Logic of Statistical

Inference’ by Ian Hacking).” British Journal for the Philosophy of Science 23 (2) (May 1): 123–132.

Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.

Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033.

Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press.

Edwards, W., H. Lindman, and L. J. Savage. 1963. “Bayesian Statistical Inference for Psychological Research.” Psychological Review 70 (3): 193–242.

Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.

Goodman SN. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine 1999; 130:1005 –1013.

 

Final APS Mayo / 36  

 

Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.

Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.

Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcom R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier.

Meehl, Paul E., and Niels G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300.

Morrison, Denton E., and Ramon E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter.

Pearson, E.S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96.

Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen. Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test

controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.