Alphabet Soup Blurring the Distinctions Between p’s and a’s in Psychological Research

33
Alphabet Soup Blurring the Distinctions Between p’s and a’s in Psychological Research Raymond Hubbard Drake University Abstract. Confusion over the reporting and interpretation of results of commonly employed classical statistical tests is recorded in a sample of 1,645 papers from 12 psychology journals for the period 1990 through 2002. The confusion arises because researchers mistakenly believe that their interpretation is guided by a single unified theory of statistical inference. But this is not so: classical statistical testing is a nameless amalgamation of the rival and often contradictory approaches developed by Ronald Fisher, on the one hand, and Jerzy Neyman and Egon Pearson, on the other. In particular, there is extensive failure to acknowledge the incompatibility of Fisher’s evidential p value with the Type I error rate, α, of Neyman–Pearson statistical orthodoxy. The distinction between evi- dence (p’s) and errors (α’s) is not trivial. Rather, it reveals the basic differences underlying Fisher’s ideas on significance testing and inductive inference, and Neyman–Pearson views on hypothesis testing and inductive behavior. So complete is this misunderstanding over measures of evidence versus error that it is not viewed as even being a problem among the vast majority of researchers and other relevant parties. These include the APA Task Force on Statistical Inference, and those writing the guidelines concerning statistical testing mandated in APA Publication Manuals. The result is that, despite supplanting Fisher’s significance-testing paradigm some fifty years or so ago, recognizable applications of Neyman–Pearson theory are few and far between in psychology’s empirical literature. On the other hand, Fisher’s influence is ubiquitous. Key Words: Fisher, hybrid statistical model, inductive behavior, inductive inference, Neyman–Pearson It is my personal belief that an objective look at the record will show that Fisher contributed a large number of statistical methods, but that Neyman contributed the basis of statistical thinking. (Lucien LeCam, quoted in Reid, 1982, p. 268) Extensive confusion prevails among psychologists concerning the reporting and interpretation of results of classical statistical tests. The reason for this Theory & Psychology Copyright © 2004 Sage Publications. Vol. 14(3): 295–327 DOI: 10.1177/0959354304043638 www.sagepublications.com

description

statistics - p values

Transcript of Alphabet Soup Blurring the Distinctions Between p’s and a’s in Psychological Research

Page 1: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

Alphabet SoupBlurring the Distinctions Between p’s and a’s inPsychological Research

Raymond HubbardDrake University

Abstract. Confusion over the reporting and interpretation of results ofcommonly employed classical statistical tests is recorded in a sample of1,645 papers from 12 psychology journals for the period 1990 through2002. The confusion arises because researchers mistakenly believe thattheir interpretation is guided by a single unified theory of statisticalinference. But this is not so: classical statistical testing is a namelessamalgamation of the rival and often contradictory approaches developed byRonald Fisher, on the one hand, and Jerzy Neyman and Egon Pearson, onthe other. In particular, there is extensive failure to acknowledge theincompatibility of Fisher’s evidential p value with the Type I error rate, α,of Neyman–Pearson statistical orthodoxy. The distinction between evi-dence (p’s) and errors (α’s) is not trivial. Rather, it reveals the basicdifferences underlying Fisher’s ideas on significance testing and inductiveinference, and Neyman–Pearson views on hypothesis testing and inductivebehavior. So complete is this misunderstanding over measures of evidenceversus error that it is not viewed as even being a problem among the vastmajority of researchers and other relevant parties. These include the APATask Force on Statistical Inference, and those writing the guidelinesconcerning statistical testing mandated in APA Publication Manuals. Theresult is that, despite supplanting Fisher’s significance-testing paradigmsome fifty years or so ago, recognizable applications of Neyman–Pearsontheory are few and far between in psychology’s empirical literature. On theother hand, Fisher’s influence is ubiquitous.

Key Words: Fisher, hybrid statistical model, inductive behavior, inductiveinference, Neyman–Pearson

It is my personal belief that an objective look at the record will show thatFisher contributed a large number of statistical methods, but that Neymancontributed the basis of statistical thinking. (Lucien LeCam, quoted inReid, 1982, p. 268)

Extensive confusion prevails among psychologists concerning the reportingand interpretation of results of classical statistical tests. The reason for this

Theory & Psychology Copyright © 2004 Sage Publications. Vol. 14(3): 295–327DOI: 10.1177/0959354304043638 www.sagepublications.com

Page 2: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

confusion is that textbooks on statistical methods in psychology and thesocial sciences usually present the subject matter as a single, comprehensive,uncontroversial theory of statistical inference. These texts rarely allude tothe fact that classical statistical inference as it is generally portrayed is infact an anonymous hybrid consisting of the union of the ideas developed byRonald Fisher, on the one hand, and Jerzy Neyman and Egon Pearson, onthe other (Gigerenzer, 1993; Gigerenzer & Murray, 1987; Gigerenzer et al.,1989; Huberty, 1993; Huberty & Pike, 1999). It is a union that neither sidewould have agreed to, given the pronounced philosophical and methodo-logical differences between them. In fact, a bitter debate raged over the yearsbetween the Fisherian and Neyman–Pearson camps.

The seminal work of Gigerenzer and his colleagues (Gigerenzer, 1993;Gigerenzer & Murray, 1987; Gigerenzer et al., 1989) and Huberty (1993;Huberty & Pike, 1999) notwithstanding, most researchers in psychology andelsewhere remain uninformed about the historical development of methodsof statistical inference, and of the mixing of Fisherian and Neyman–Pearsonconcepts. In particular, there is widespread failure to acknowledge theincompatibility of Fisher’s evidential p value (actually, Karl Pearson [1900]introduced the modern p value, but Fisher popularized it) with Neyman–Pearson’s Type I error rate, α (Goodman, 1993). The difference betweenevidence (p’s) and errors (α’s) is not some semantic splitting of hairs.Rather, it points to the basic distinctions between Fisher’s notions ofsignificance testing and inductive inference, versus Neyman–Pearson’s ideason hypothesis testing and inductive behavior. But since statistics textbooksoften surreptitiously blend concepts from both sides, misunderstandingsconcerning the reporting and interpretation of statistical tests are virtuallyguaranteed. Adding insult to injury, the confusion over measures of evidenceversus errors is so completely ingrained that it is not even seen as being aproblem among the rank and file of researchers.

As proof of this, even critics of statistical testing often fail to distinguishbetween p’s and α’s, and the repercussions this has on the meaning ofempirical results. Thus, many of these critics (e.g. Carver, 1978; Kirk, 1996;Krueger, 2001; Rozeboom, 1960; Wilkinson & the APA Task Force onStatistical Inference, 1999) unwittingly adopt a Fisherian stance inasmuch asthey talk almost exclusively in terms of p, as opposed to α, values. Stillother critics or discussants of statistical testing (e.g. American PsychologicalAssociation, 1994, 2001; Cohen, 1990, 1994; Dar, Serlin, & Omer, 1994;Falk & Greenbaum, 1995; Loftus, 1996; Mulaik, Raju, & Harshman, 1997;Nickerson, 2000; Rosnow & Rosenthal, 1989; Schmidt, 1996) are inclined,erroneously, to use p’s and α’s interchangeably. Additional examples ofrecent studies critiquing the merits of statistical testing that nonethelesscontinue to offer incorrect advice regarding the interpretation of p’s and α’sare readily adduced from the literature (e.g. Chow, 1996, 1998; Clark, 1999;Daniel, 1998; Dixon & O’Reilly, 1999; Finch, Cumming, & Thomason,

THEORY & PSYCHOLOGY 14(3)296

Page 3: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

2001; Grayson, Pattison, & Robins, 1997; Hyde, 2001; Macdonald, 1997;Nix & Barnette, 1998). The varying levels of confusion exhibited in so manyarticles dealing with the meaning and interpretation of classical statisticaltests points to the need to become familiar with their historical development.Krantz (1999) would surely agree with this assessment.

In light of the above concerns, the present paper addresses how theconfusion between p’s and α’s came about. I do this by first reporting on themajor differences in the structure of the Fisherian and Neyman–Pearsonschools of thought. In doing so, I typically let the protagonists speak forthemselves. This is an absolute necessity given that their own names areconspicuously absent from the textbooks used to help teach psychologistsabout statistical methods. Because textbook authors almost uniformly do notcite and discuss Fisher’s and Neyman–Pearson’s respective contributions tothe statistics literature, it is hardly surprising to learn that present researchersare not familiar with them. Second, I show how the competing ideas fromthe two camps have been inadvertently merged. The upshot is that althoughNeyman–Pearson theory claimed the mantle of statistical orthodoxy somefifty or so years ago (Hogben, 1957; LeCam & Lehmann, 1974; Nester,1996; Royall, 1997; Spielman, 1974), it is Fisher’s influence which dom-inates statistical testing procedures in psychology today. Third, empiricalevidence is gathered from a random sample of articles in 12 psychologyjournals for the period 1990–2002 detailing the widespread confusion amongresearchers caused by the mixing of Fisherian and Neyman–Pearson per-spectives. This evidence is manifested in how researchers, through theirmisunderstandings of the differences between p’s and α’s, almost univer-sally misreport and misinterpret the outcomes of statistical tests. They canscarcely help it, for such misreporting and misinterpretation is virtuallysanctioned in the advice found in APA Publication Manuals (1994, 2001).The end result is that applications of classical statistical testing in psychol-ogy are largely meaningless. And this signals the need for changes in theway in which it is taught in the classroom. More specifically, hopes foreliminating (or at least drastically reducing) the mass confusion over themeanings of p’s and α’s must rest on acquainting students with thefundamentals of the historical development of Fisherian and Neyman–Pearson statistical testing. The present paper attempts to do this.

Comparing and Contrasting the Fisherian andNeyman–Pearson Paradigms

Fisher’s Paradigm of Significance Testing

Fisher’s ideas on significance testing, popularized in the many editions of hiswidely influential books Statistical Methods for Research Workers (1925)

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 297

Page 4: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

and The Design of Experiments (1935a), were enthusiastically received bypractitioners. At the heart of his conception of inductive inference is what hetermed the null hypothesis, H0. Although briefly dabbling with Bayesianapproaches (Zabell, 1992), Fisher quickly renounced the methods of inverseprobability, or the probability of a hypothesis (H) given the data (x),Pr(H | x), instead championing the direct probability, Pr(x | H). In particular,Fisher used disparities in the data to reject the null hypothesis, that is, theprobability of the data conditional on a true null hypothesis, or Pr(x | H0).Consequently, a significance test is a means of determining the probabilityof a result, in addition to more extreme ones, on a null hypothesis of noeffect or relationship.

In Fisher’s model the researcher proposes a null hypothesis that a samplecomes from a hypothetical infinite population with a known samplingdistribution. As Gigerenzer and Murray (1987) comment, the null hypothesisis rejected ‘if our sample statistic deviates from the mean of the samplingdistribution by more than a criterion, which corresponds to alpha, the levelof significance’ (p. 10)1 In other words, the p value from a significance testis regarded as a measure of the implausibility of the actual observations (aswell as more extreme and unobserved ones) obtained in an experiment orother study, assuming a true null hypothesis. The rationale for the sig-nificance test is that if the data are seen as being rare or highly discrepantunder H0 this constitutes inductive evidence against H0. Fisher (1966) notedthat ‘It is usual and convenient for experimenters to take 5 per cent as astandard level of significance, in the sense that they are prepared to ignoreall results which fail to reach this standard’ (p. 13). Thus, Fisher’s sig-nificance testing revolves around the rejection of the null hypothesis at thep ≤ .05 level. If, in an experiment, the researcher obtains a p value of, say,.05 or .01 on a true null hypothesis, it would be interpreted to mean that theprobability of obtaining such an extreme (or more extreme) value is only 5%or 1%. (Hence, Fisher is a frequentist, but not in the same sense as Neyman–Pearson.) For Fisher (1966), then, ‘Every experiment may be said to existonly in order to give the facts a chance of disproving the null hypothesis’(p. 16).

In the Fisherian paradigm, an event is deemed established when we canconduct experiments that rarely fail to yield statistically significant (p ≤ .05)results. As mentioned earlier, Fisher considered p values from singleexperiments as supplying inductive evidence against the null hypothesis,with smaller p values indicating greater evidence (Johnstone, 1986, 1987b;Spielman, 1974). According to Fisher’s famous disjunction, a p value ≤ .05on the null hypothesis shows that either a rare event has occurred or else thenull hypothesis is false (Seidenfeld, 1979).

Fisher was sure that statistics could play a major role in fosteringinductive inference, that is, drawing inferences from the particular to thegeneral, from samples to populations. According to him, ‘Inductive in-

THEORY & PSYCHOLOGY 14(3)298

Page 5: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

ference is the only process known to us by which essentially new knowledgecomes into the world’ (Fisher, 1966, p. 7). But Fisher (1958) was wary thatmathematicians (certainly Neyman) did not necessarily subscribe to hisinductivist viewpoint:

In that field of deductive logic, at least when carried out with mathematicalsymbols, [mathematicians] are of course experts. But it would be a mistaketo think that mathematicians as such are particularly good at the inductivelogical processes which are needed in improving our knowledge of thenatural world, in reasoning from observational facts to the inferences whichthose facts warrant. (p. 261)

Fisher never wavered in his belief that inductive reasoning was the chiefmechanism of knowledge development, and for him the p values fromsignificance tests were evidential.

Neyman–Pearson’s Paradigm of Hypothesis Testing

The Neyman–Pearson (1928a, 1928b, 1933) statistical paradigm is widelyaccepted as the norm in classical statistical circles (Carlson, 1976; Hogben,1957; LeCam & Lehmann, 1974; Nester, 1996; Oakes, 1986; Royall, 1997;Spielman, 1974). Their work on hypothesis testing, terminology they pre-ferred to distinguish it from Fisher’s ‘significance testing’, was quite distinctfrom the latter’s framework of inductive inference.

The Neyman–Pearson approach postulates two competing hypotheses: thenull hypothesis (H0) and the alternative hypothesis (HA). In justifying theneed for an alternative hypothesis, Neyman (1977) wrote:

. . . in addition to H [the null hypothesis] there must exist some otherhypotheses, one of which may conceivably be true. Here, then, we come tothe concept of the ‘set of all admissible hypotheses’ which is frequentlydenoted by the letter Ω. Naturally, Ω must contain H. Let H denote thecomplement, say Ω–H=H. It will be noticed that when speaking of a test ofthe hypothesis H, we really speak of its test ‘against the alternative H.’ Thisis quite important. The fact is that, unless the alternative H is specified, theproblem of an optimal test of H is indeterminate. (p. 104)

Neyman–Pearson considered Fisher’s usage of the occurrence of rare orimplausible results to reject H0 to be an inadequate vehicle for hypothesistesting. Something more was needed. They wanted to see whether this sameimprobable outcome under H0 is more likely to occur under a competinghypothesis. As Pearson later explained: ‘The rational human mind did notdiscard a hypothesis until it could conceive at least one plausible alternativehypothesis’ (E.S. Pearson, 1990, p. 82). Even William S. Gosset (‘Student’,of t test fame), a man whom Fisher admired, saw the need for an alternativehypothesis. In response to a letter from Pearson, Gosset wrote that ‘the onlyvalid reason for rejecting any statistical hypothesis, no matter how unlikely,is that some alternative hypothesis explains the observed events with a

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 299

Page 6: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

greater degree of probability’ (quoted in Reid, 1982, p. 62). The inclusion ofan alternative hypothesis by Neyman–Pearson critically distinguishes theirapproach from Fisher’s, and this was an issue of great contention betweenthe two camps over the years.

In Neyman–Pearson theory, the investigator selects a (typically point) nullhypothesis and tests it against the alternative hypothesis. Their workintroduced the probabilities of committing two kinds of error, namely falserejection (Type I error) and false acceptance (Type II error) of the nullhypothesis. The former probability is called α, while the latter probability iscalled β.

Eschewing Fisher’s ideas about hypothetical infinite populations,Neyman–Pearson results are predicated on the assumption of repeatedrandom sampling from a defined population (Gigerenzer & Murray, 1987).Therefore, Neyman–Pearson theory is best equipped for handling situationswhere repeated random sampling has meaning, such as in the case ofquality-control experiments. In these narrow circumstances, the Neyman–Pearson frequentist interpretation of probability makes sense: α is the long-run relative frequency of Type I errors conditional on the null being true andβ is the counterpart for Type II errors.

The Neyman–Pearson theory of hypothesis testing introduced the entirelynew concept of the power of a statistical test. The power of a test, or (1–β),is the probability of rejecting a false null hypothesis. Since the power of atest to detect a particular effect size in the population can be calculatedbefore conducting the research, it is useful in the design of experiments. InFisher’s significance-testing scheme, however, there is no alternative hy-pothesis (HA), making the ideas about Type II errors and the power of thetest irrelevant. Fisher (1935b) pointed this out when rebuking Neyman andPearson without naming them: ‘In fact . . . “errors of the second kind” arecommitted only by those who misunderstand the nature and application oftests of significance’ (p. 474). And he subsequently added:

The notion of an error of the so-called ‘second kind,’ due to accepting thenull hypothesis ‘when it is false’ . . . has no meaning with respect to simpletests of significance, in which the only available expectations are thosewhich flow from the null hypothesis being true. (Fisher, 1966, p. 17)

Fisher denied the need for an alternative hypothesis, and strenuouslyopposed its incorporation by Neyman–Pearson (Gigerenzer & Murray, 1987;Hacking, 1965).

Fisher (1966), however, touches upon the concept of the power of a testwhen discussing the ‘sensitiveness’ of an experiment:

By increasing the size of the experiment we can render it more sensitive,meaning by this that it will allow of the detection of a lower degree ofsensory discrimination, or, in other words, of a quantitatively smallerdeparture from the null hypothesis. Since in every case the experiment iscapable of disproving, but never of proving this hypothesis, we may say

THEORY & PSYCHOLOGY 14(3)300

Page 7: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

that the value of the experiment is increased whenever it permits the nullhypothesis to be more readily disproved. (pp. 21–22)

And Neyman (1967) was, of course, familiar with this: ‘The consideration ofpower is occasionally implicit in Fisher’s writings, but I would have liked tosee it treated explicitly’ (p. 1459).

Whereas Fisher’s view of inductive inference centered on the rejection ofthe null hypothesis, Neyman and Pearson had no time at all for the very ideaof inductive reasoning. Their concept of inductive behavior sought toprovide rules for making decisions between two hypotheses, regardless ofthe researcher’s belief in either one. Neyman (1950) made this quiteexplicit:

Thus, to accept a hypothesis H means only to decide to take action A ratherthan action B. This does not mean that we necessarily believe that thehypothesis H is true. . . [while rejecting H] . . . means only that the ruleprescribes action B and does not imply that we believe that H is false. (pp.259–260)

Neyman–Pearson theory, therefore, substitutes the idea of inductive be-havior for that of inductive inference. According to Neyman (1971):

The description of the theory of statistics involving a reference to behavior,for example, behavioristic statistics, has been introduced to contrast withwhat has been termed inductive reasoning. Rather than speak of inductivereasoning I prefer to speak of inductive behavior. (p. 1)

And ‘The term “inductive behavior” means simply the habit of humans andother animals (Pavlov’s dogs, etc.) to adjust their actions to noticedfrequencies of events, so as to avoid undesirable consequences’ (Neyman,1961, p. 148; see also Neyman, 1962). Further defending his preference forinductive behavior over inductive inference, Neyman (1957) acknowledgedhis suspicions about the latter ‘because of its dogmatism, lack of clarity, andbecause of the absence of consideration of consequences of the variousactions contemplated’ (p. 16). In presenting his decision rules for takingaction A rather than B, Neyman (1950) emphasized that ‘the theory ofprobability and statistics both play an important role, and there is aconsiderable amount of reasoning involved. As usual, however, the reason-ing is all deductive’ (p. 1).

The deductive character of the Neyman–Pearson model proceeds from thegeneral to the particular. They came up with a ‘rule of behavior’ forselecting between two alternative courses of action, accepting or rejectingthe null hypothesis, such that ‘in the long run of experience, we shall not betoo often wrong’ (Neyman & Pearson, 1933, p. 291). Whether to accept orreject the hypothesis in their framework depends on the cost trade-offsinvolved with committing a Type I or Type II error. These costs areindependent of statistical theory. They must be estimated by the researcher

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 301

Page 8: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

in the context of each particular problem. Neyman and Pearson (1933)advised:

. . . in some cases it will be more important to avoid the first [type of error],in others the second [type of error]. . . . From the point of view ofmathematical theory all we can do is to show how the risk of errors may becontrolled or minimised. The use of these statistical tools in any given case,in determining just how the balance should be struck, must be left to theinvestigator. (p. 296)

After heeding such advice, the researcher would design an experiment tocontrol the probabilities of the α and β error rates, with the ‘best’ test beingthe one that minimizes β subject to a bound on α (Lehmann, 1993). Indetermining what this bound on α should be, Neyman (1950) later statedthat the control of Type I errors was more important than that of Type IIerrors:

The problem of testing statistical hypotheses is the problem of selectingcritical regions. When attempting to solve this problem, one must re-member that the purpose of testing hypotheses is to avoid errors insofar aspossible. Because an error of the first kind is more important to avoid thanan error of the second kind, our first requirement is that the test shouldreject the hypothesis tested when it is true very infrequently. . . . To put itdifferently, when selecting tests, we begin by making an effort to controlthe frequency of the errors of the first kind (the more important errors toavoid), and then think of errors of the second kind. The ordinary procedureis to fix arbitrarily a small number α . . . and to require that the probabilityof committing an error of the first kind does not exceed α. (p. 265)

Consequently, α is specified or fixed prior to the collection of the data.Because of this, Neyman–Pearson methodology is sometimes labeled thefixed α (Huberty, 1993), fixed level (Lehmann, 1993) or fixed size(Seidenfeld, 1979) approach. This contrasts α with Fisher’s p value, which isa random variable whose distribution is uniform over the interval [0, 1]under the null hypothesis. The α and β error rates define a ‘critical’ or‘rejection’ region for the test statistic, say z or t > 1.96. If the test statisticfalls in the critical region, H0 is rejected in favor of HA, otherwise H0 isretained (Goodman, 1993; Huberty, 1993). Furthermore, descriptions ofNeyman–Pearson theory refer to the rejection of H0 when H0 is true—theType I error probability, α—as the ‘significance level’ of a test. As we shallsee below, calling the Type I error probability the significance level of astatistical test was something quite unacceptable to Fisher. It has also helpedto create enormous confusion among researchers concerning the meaningand interpretation of ‘statistical significance’.

Recall that Fisher regarded his significance tests as constituting inductiveevidence against the null hypothesis in single experiments (Johnstone,1987a; Kyburg, 1974; Seidenfeld, 1979). Neyman–Pearson hypothesis tests,on the other hand, do not permit an inference to be made about the outcome

THEORY & PSYCHOLOGY 14(3)302

Page 9: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

of any individual hypothesis that the researcher is examining. Neyman andPearson (1933) were unequivocal about this: ‘We are inclined to think thatas far as a particular hypothesis is concerned, no test based upon the theoryof probability can by itself provide any valuable evidence of the truth orfalsehood of that hypothesis’ (pp. 290–291). But since scientists are in thebusiness of gleaning evidence from individual studies, this limitation ofNeyman–Pearson theory is acute. Nor, for that matter, does the Neyman–Pearson model allow an inference to be made in the case of ongoing,repetitive studies. Thus, Grayson, Pattison and Robins (1997) were incorrectwhen they stated that ‘one implication of a strictly frequentist [Neyman–Pearson] approach is that we can only make inferences on the basis of a longrun of repeated trials’ (p. 68). Neyman–Pearson theory is strictly behavioral;it is non-evidential in both the short and long runs. As Neyman (1942)wrote: ‘it will be seen that the theory of testing hypotheses has no claim ofany contribution to . . . “inductive reasoning” ’ (p. 301).

Fisher (1959) recognized this, commenting that the Neyman–Pearson‘procedure is devised for a whole class of cases. No particular thought isgiven to each case as it arises, nor is the tester’s capacity for learningexercised’ (p. 100). Instead, the investigator is only allowed to make adecision about the likely outcome of a hypothesis as if it had been subjected,as Fisher (1956) observed, to ‘an endless series of repeated trials which willnever take place’ (p. 99). In the vast majority of applied work, repeatedrandom sampling does not occur; empirical findings are usually limited to asingle sample.

Fisher conceded that Neyman and Pearson’s contribution, which hereferred to as an ‘acceptance procedures’ approach, had merit in the contextof quality-control decisions. For example, he acknowledged: ‘I am castingno contempt on acceptance procedures, and I am thankful, whenever I travelby air, that the high level of precision and reliability required can really beachieved by such means’ (Fisher, 1955, p. 69). This concession aside, Fisher(1959) was resolute in his objections to Neyman–Pearson ideas abouthypothesis testing as an appropriate method for guiding scientific research:

The ‘Theory of Testing Hypotheses’ was a later attempt, by authors whohad taken no part in the development of [significance] tests, or in theirscientific application, to reinterpret them in terms of an imagined processof acceptance sampling, such as was beginning to be used in commerce;although such processes have a logical basis very different from those of ascientist engaged in gaining from his observations an improved under-standing of reality. (pp. 4–5)

He insisted that

. . . the logical differences between [acceptance procedures] and the workof scientific discovery by physical or biological experimentation seem tome so wide that the analogy between them is not helpful, and the

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 303

Page 10: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

identification of the two sorts of operation is decidedly misleading. (Fisher,1955, pp. 69–70)

In further distancing himself from Neyman–Pearson methodology, Fisher(1955) drew attention to the fact that:

From a test of significance, however, we learn more than that the body ofdata at our disposal would have passed an acceptance test at someparticular level; we may learn, if we wish to, and it is to this that weusually pay attention, at what level it would have been doubtful; doing thiswe have a genuine measure of the confidence with which any particularopinion may be held, in view of our particular data. From a strictly realisticviewpoint we have no expectation of an unending sequence of similarbodies of data, to each of which a mechanical ‘yes or no’ response is to begiven. What we look forward to in science is further data, probably of asomewhat different kind, which may confirm or elaborate the conclusionswe have drawn; but perhaps of the same kind, which may then be added towhat we have already, to form an enlarged basis for induction. (p. 74)

The above discussion shows that Fisher and Neyman–Pearson disagreedvehemently over both the nature of statistical methods and their approachesto the conduct of science per se. Indeed, ongoing exchanges of a frequentlyacrimonious nature passed between Fisher and Neyman–Pearson as bothsides promulgated their respective conceptions of statistical analysis and thescientific method.

Minding One’s p’s and a’s

Users of statistical techniques in the social and medical sciences are almosttotally unaware of the distinctions, described above, between Fisher’s ideason significance testing and Neyman–Pearson thoughts on hypothesis testing(Gigerenzer, 1993; Goodman, 1993, 1999; Huberty, 1993; Royall, 1997).This is through no fault of their own; after all, they have been taught fromnumerous well-regarded textbooks on statistical methods. Unfortunately,many of these same textbooks combine, without acknowledgement, incon-gruous ideas from both the Fisherian and Neyman–Pearson camps. This issomething that both sides found appalling. Ironically, as will be seen, theend result of this unintentional mixing of Fisherian with Neyman–Pearsonideas is that although the latter’s work came to be accepted as statisticalorthodoxy about fifty years ago (Hogben, 1957; Spielman 1974), it isFisher’s methods that flourish today. As Royall (1997) observed:

The distinction between Neyman–Pearson tests and [Fisher’s] significancetests is not made consistently clear in modern statistical writing andteaching. Mathematical statistical textbooks tend to present Neyman–Pearson theory, while statistical methods textbooks tend to lean moretowards significance tests. The terminology is not standard, and the same

THEORY & PSYCHOLOGY 14(3)304

Page 11: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

terms and symbols are often used in both contexts, blurring the differencesbetween them. (p. 64)

Johnstone (1986) and Keuzenkamp and Magnus (1995) maintain thatstatistical testing usually follows Neyman–Pearson formally, but Fisherphilosophically. For instance, Fisher’s notion of disproving the null hypoth-esis is taught along with the Neyman–Pearson concepts of alternativehypotheses, Type II errors and the power of a statistical test. In addition,textbook descriptions of Neyman–Pearson theory often refer to the Type Ierror probability as the ‘significance level’ (Goodman, 1999; Kempthorne,1976; Royall, 1997).

But the quintessential example of the bewilderment caused by the forgingof Fisher’s ideas on inductive inference with the Neyman–Pearson principleof inductive behavior is the widely unappreciated fact that the former’s pvalue is incompatible with the Neyman–Pearson hypothesis test in which ithas become embedded (Goodman, 1993). Despite this fundamental incom-patibility, the end result of this mixing is that the p value is now indeliblylinked in researchers’ minds with the Type I error rate, α. And this isprecisely what Fisher (1955) had earlier complained about when he accusedNeyman–Pearson of attempting ‘to assimilate a test of significance to anacceptance procedure’ (p. 74). Because of this assimilation, much empiricalwork in psychology and the social and biological sciences proceeds in thefollowing manner: the investigator states the null (H0) and alternative (HA)hypotheses, the Type I error rate/significance level, α, and presumably—butrarely—calculates the statistical power of the test (e.g. t). These steps are inaccordance with Neyman–Pearson convention. After this, the test statistic iscomputed for the sample data, and in an effort to have the best of bothworlds, an associated p value (significance probability) is calculated. The pvalue is then erroneously interpreted as a frequency-based ‘observed’ Type Ierror rate, α (Goodman, 1993), and at the same time as an incorrect (i.e.p < α) measure of evidence against H0.

The p Value as a Type I Error Rate

Even staunch critics of, and other commentators on, statistical testing inpsychology occasionally commit this error. Thus Dar et al. (1994) noted:‘The sample p value, in the context of null hypothesis testing, is involved . . .in determining whether the predetermined criterion of Type I error, the alphalevel, has been surpassed’ (p. 76). Likewise, Meehl (1967) misreported thatthe investigator ‘gleefully records the tiny probability number “p < .001,”and there is a tendency to feel that the extreme smallness of this probabilityof a Type I error is somehow transferable’ (p. 107, my emphasis).

Nickerson (2000) also makes the mistake of drawing a parallel between pvalues and Type I error rates: ‘The value of p that is obtained as the result ofNHST is the probability of a Type I error on the assumption that the null

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 305

Page 12: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

hypothesis is true’ (p. 243). He later goes on to compound this mistake byadding: ‘Both p and α represent bounds on the probability of Type I error. . . p is the probability of a Type I error resulting from a particular test if thenull hypothesis is true’ (p. 259, my emphasis). Here, Nickerson misinterpretsa p value as an ‘observed’ Type I error rate, something which is impossiblesince the latter applies only to long-run frequencies, not to individualinstances. Neyman (1971) expressed this as follows:

It would be nice if something could be done to guard against errors in eachparticular case. However, as long as the postulate is maintained that theobservations are subject to variation affected by chance (in the sense offrequentist theory of probability), all that appears possible to do is tocontrol the frequencies of errors in a sequence of situations. (p. 13)

The p value is not a Type I error rate, long-run or otherwise; it is a measureof inductive evidence against H0. Type I errors play no role in Fisher’sparadigm.

This misinterpretation of his evidential p value as a Neyman–PearsonType I error rate severely upset Fisher, who was adamant that the sig-nificance level of a statistical test had no ongoing sampling interpretation.With regard to the .05 level, Fisher (1929) early on warned that this does notmean that the researcher ‘allows himself to be deceived once in every twentyexperiments. The test of significance only tells him what to ignore, namelyall experiments in which significant results are not obtained’ (p. 191). Thesignificance level, for Fisher, was a measure of evidence for the ‘objective’disbelief in the null hypothesis; it had no long-run frequentist character-istics.

Again, Fisher (1950) protested that his tests of significance

. . . have been most unwarrantably ignored in at least one pretentious workon ‘Testing Statistical Hypotheses’ . . . Pearson and Neyman have laid itdown axiomatically that the level of significance of a test must be equatedto the frequency of a wrong decision ‘in repeated samples from the samepopulation.’ This idea was foreign to the development of tests of sig-nificance given by the author in 1925. (p. 35.173a, my emphasis)

Seidenfeld (1979) exposed the difference between the two schools ofthought on this crucial matter:

. . . such a frequency property has little or no connection with theinterpretation of the [Fisherian significance] test. To repeat, the correctinterpretation is through the disjunction, either a rare event has occurred orthe null hypothesis is false. (p. 79)

In highlighting the discrepancies between p’s and α’s, Gigerenzer (1993)offered the following:

For Fisher, the exact level of significance is a property of the data (i.e., arelation between a body of data and a theory); for Neyman and Pearson,

THEORY & PSYCHOLOGY 14(3)306

Page 13: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

alpha is a property of the test, not of the data. Level of significance [pvalue] and alpha are not the same thing’ (p. 317, my emphasis)

Despite the above cautions about p values not being Type I error rates, itis sobering to note that even well-known statisticians such as Barnard(1985), Gibbons and Pratt (1975) and Hinkley (1987) nevertheless make themistake of equating them. Yet, as Berger and Delampady (1987) warn, theinterpretation of the p value as an error rate is strictly forbidden:

P-values are not a repetitive error rate . . . A Neyman–Pearson errorprobability, α, has the actual frequentist interpretation that a long series ofα level tests will reject no more than 100α% of true H0, but the data-dependent-P-values have no such interpretation. (p. 329)

At the same time it must be underlined that Neyman–Pearson would notendorse an inferential or epistemic interpretation of statistical testing, asmanifested in a p value. Their theory is behavioral, not evidential, and theywould likewise complain that the p value is not a Type I error rate.

It should therefore be pointed out that in his effort to partially resolvediscrepancies between the Fisherian and Neyman–Pearson programs,Lehmann (1993) similarly fails to distinguish between measures of evidenceversus error. He refers to the Type I error rate as the significance level of thetest, when for Fisher this was determined by p values and not α’s. And wehave seen that misconstruing the evidential p value as a Neyman–PearsonType I error rate was anathema to both Fisher and Neyman–Pearson.

The p Value as a Quasi-measure of Evidence Against H0 (p < α)

While the p value is being erroneously reported as a Neyman–Pearson TypeI error rate, it will be interpreted simultaneously in an incorrect quasi-Fisherian manner as evidence against H0. If p < α, a statistically significantfinding is announced, and the null hypothesis is disproved. For example,Leavens and Hopkins’ (1998) declaration that ‘alpha was set at p < .05 forall tests’ (p. 816) reflects a common tendency among researchers to confusep’s and α’s in a statistical significance testing framework. Clark-Carter(1997) goes further in this regard by invoking the great man himself:‘According to Fisher, if p were greater than α then the null hypothesis couldnot be accepted’ (p. 71). Fisher, of course, would have taken umbrage atsuch a statement, just as he would have with Clark’s (1999) assertion that ‘inFisher’s original work in agriculture, alpha was set a priori’ (p. 283), withHuberty (1993, p. 328) and Huberty and Pike’s (1999, p. 11) suggestionsthat Fisher encouraged the use of α = .05, with Cortina and Dunlap’s (1997)recommendation to compare observed probabilities with predetermined αcut-off values, and with Chow’s (1996) error of investing his (Fisher’s) testswith both p’s and α’s. Fisher had no use for the concept α. Again, in anotherwise thoughtful article titled ‘The Appropriate Use of Null HypothesisTesting’, Frick (1996) nonetheless makes the mistake of using p’s and α’s

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 307

Page 14: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

interchangeably: ‘Finally, the obtained value of p is compared to a criterionalpha, which is conventionally set at .05 . . . When p is less than .05, theexperimenter has sufficient empirical evidence to support a claim’ (p.385).

In addition, Nickerson (2000), who earlier misinterpreted a p value as aType I error rate, follows Frick (1996) in ascribing an evidential meaning tothe p value when it is directly compared with this error rate:

A specified significance level conventionally designated α (alpha) serves asa decision criterion, and the null hypothesis is rejected only if the value ofp yielded by the test is not greater than the value of α. If α is set at .05, say,and a significance test yields a value of p equal to or less than .05, the nullhypothesis is rejected and the result is said to be statistically significant atthat level. (Nickerson, 2000, pp. 242–243)

Nix and Barnette (1998) do likewise in a paper subtitled ‘A Review of NullHypothesis Significance Testing’: ‘As such, p values lower than the alphavalue are viewed as a rejection of the null hypothesis, and p values equal toor greater than the alpha value are viewed as a failure to reject’ (p. 6).

Yet we have seen that interpreting p values as evidence against the nullhypothesis in a single experiment is impossible in the Neyman–Pearsonframework. Their approach centers on decision rules with a priori statederror rates, α and β, which are limiting frequencies based on long-runrepeated sampling. If a result falls into the critical region, H0 is rejected andHA is accepted, otherwise H0 is accepted and HA is rejected (Goodman,1993; Huberty, 1993). Interestingly, this last claim contradicts Fisher’s(1966) remark that ‘the null hypothesis is never proved or established, but ispossibly disproved, in the course of experimentation’ (p. 16). In theNeyman–Pearson framework one can indeed ‘accept’ the null hypothesis.Neyman’s (1942) advice makes this plain:

. . . we may say that any test of a statistical hypothesis consists in aselection of a certain region, w0, in the n dimensional experimental spaceW, and in basing our decision as to the hypothesis H0 on whether theexperimental point E9, determined by the actual observations, falls withinw0 or not. If it does, the hypothesis H0 will be rejected, if it does not, it willbe accepted. (p. 303, my emphasis)

Note, then, the distinctly Fisherian bent adopted by Wilkinson and the APATask Force on Statistical Inference (1999) when they recommend: ‘Neveruse the unfortunate expression “accept the null hypothesis” ’ (p. 599). Toreiterate, this advice is at odds with, ostensibly, Neyman–Pearson statisticalconvention.

Further Confusion over p’s and α’s

In the Neyman–Pearson decision model the researcher is only allowed to saywhether or not the result fell in the critical region, not where it fell, as might

THEORY & PSYCHOLOGY 14(3)308

Page 15: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

be indicated by a p value. Thus, if the Type I error rate, α, is fixed at itsusual .05 value before (as it must be) the study is carried out, and theresearcher subsequently obtains a p value of, say, .0014, this exact valuecannot be reported in a Neyman–Pearson hypothesis test (Oakes, 1986). Thisis because, Goodman (1993, 1999) explains, α is the probability of a set ofpossible results that may fall anywhere in the tail area of the distributionunder the null hypothesis, and we cannot know in advance which of theseparticular results will arise. This differs from the tail area for the p value,which is known only after the result is observed, and which, by definition,will always lie exactly on the border of that tail area. Consequently, apredetermined Type I error rate cannot be conveniently renegotiated as ameasure of evidence after the result is observed (Royall, 1997).

Despite the above, Wilkinson and the APA Task Force on StatisticalInference (1999) again display a Fisherian, rather than a Neyman–Pearsonian, orientation when they state: ‘It is hard to imagine a situation inwhich a dichotomous accept–reject decision is better than reporting anactual p value’ (p. 599). But it is not at all difficult to imagine such asituation in a Neyman–Pearson context. On the contrary, the dichotomousaccept–reject decision is all that is possible in their statistical calculus.Furthermore, p values, exact or otherwise, play no role in the Neyman–Pearson model.

By the same reasoning, it is not permissible to report what Goodman(1993, p. 489) calls ‘roving alphas’, whereby p values are assigned a limitednumber of categories of Type I error rates, such as p < .05, p < .01, p < .001,and so on. As mentioned earlier, and elaborated below, a Type I error rate,α, must be fixed before the data are collected, and any ex post factoreinterpretation of values like p < .05, p < .01, and so on, as variable TypeI error rates applicable to different parts of any given study is strictlyinadmissible.

Why Must α Be Fixed Before the Study is Carried Out?

At this juncture it is helpful to be reminded why it is necessary to fix α priorto conducting the study, rather than allow it to be flexible (like a p value),and how the failure to do so has tainted researcher behavior. Alpha must beselected before performing the study so as to constrain the Type I error rateto some agreed-upon level. In specifying α in advance, Neyman (1977)admitted that in their efforts to build a frequentist theory of testinghypotheses, he and Pearson were inspired by the French mathematicianBorel’s advice that:

. . . (a) the criterion to test a hypothesis (a ‘statistical hypothesis’) usingsome observations must be selected not after the examination of the resultsof observation, but before, and (b) this criterion should be a function of theobservations ‘en quelque sort remarquable’. (p. 103)

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 309

Page 16: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

Royall (1997, pp. 119–121) addresses the rationale for prespecifying α,and I paraphrase him here. Suppose a researcher decides not to supply αahead of time, but waits instead until the data are in. It is decided that if theresult is statistically significant at the .01 level, it will be reported as such,and if it falls between the .01 and .05 levels, it will be recorded as beingstatistically significant at the .05 level. In the Neyman–Pearson model, thistest procedure is improper and yields misleading results, namely, that whileH0 will occasionally be rejected at the .01 level, it will always be rejectedany time the result is statistically significant at the .05 level. Thus, theresearch report does not have a Type I error probability of .01, but only .05.If the prespecified α is fixed at the .05 level, no extra observations canlegitimize a claim of statistical significance at the .01 level. This, Royall(1997) points out, makes perfect sense within Neyman–Pearson theory,although no sense at all from a scientific perspective, where additionalobservations would be seen as preferable. And this deficiency in theNeyman–Pearson paradigm has encouraged researchers, albeit unwittingly,to incorrectly supplement their testing procedures with ‘roving alpha’ pvalues.

The Neyman–Pearson hypothesis testing program strictly disallows any‘peeking’ at the data and (repeated) sequential testing (Cornfield, 1966;Royall, 1997). Cornfield (1966, p. 19) tells of a situation he calls commonamong statistics consultants: suppose, after gathering n observations, aresearcher does not quite reject H0 at the prespecified α = .05 level. Theresearcher still believes the null hypothesis is false, and that if s/he hadobtained a statistically significant result, the findings would be submitted forpublication. The investigator then asks the statistician how many more datapoints would be necessary to reject the null. And, in Neyman–Pearsontheory, the answer is no amount of (even extreme) additional data pointswould allow rejection at the .05 level. As incredible as this answer seems, itis correct. If the null hypothesis is true, there is a .05 chance of its beingrejected after the first study. As Cornfield (1966) remarks, however, to thischance we must add the probability of rejecting H0 in the second study,conditional on our failure to do so after the first one. This, in turn, raises thetotal probability of the erroneous rejection of H0 to over .05. Cornfieldshows that as the n size in the second study is continuously increased, thesignificance level approaches .0975 (= .05 + .95 × .05). This demonstratesthat no amount of collateral data points can be gathered to reject H0 at the.05 level. Royall (1997) puts it this way: ‘Choosing to operate at the 5%level means allowing only a 5% chance of erroneously rejecting thehypothesis, and the experimenter has already taken that chance. He spent his5% when he tested after the first n observations’ (p. 111). But once again,despite being in complete accord with Neyman–Pearson theory, this ex-planation runs counter to common sense. And once again, researchers

THEORY & PSYCHOLOGY 14(3)310

Page 17: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

mistakenly augment the Neyman–Pearson hypothesis-testing model withFisherian p values, thereby introducing a thicket of ‘roving’ or pseudo-alphas in their reports.

Further confounding matters, these variable Type I error ‘p’ values arealso interpreted in an evidential fashion. This happens, for example, whenp < .05 is deemed ‘significant’, p < .01 is ‘highly significant’, p < .001 is‘extremely significant’, and so on.

Goodman (1993, 1999) and Royall (1997) caution that because of itsapparent similarity with the Neyman–Pearson Type I error rate, α, Fisher’s pvalue has been subsumed within the former’s paradigm. As a result, the pvalue has been contemporaneously interpreted as both a measure of evidenceand an ‘observed’ error rate. This has created extensive confusion over themeaning of p values and α levels. Unfortunately, as Goodman (1992)warned,

. . . because p-values and the critical regions of hypothesis tests are bothtail area probabilities, they are easy to confuse. This confusion blurs thedivision between concepts of evidence and error for the statistician, andobscures it completely for nearly everyone else. (p. 897)

Fisher, Neyman–Pearson and the .05 Level

Finally, it is ironic that the confusion surrounding the differences betweenp’s and α’s was inadvertently fueled by Neyman and Pearson themselves.This occurred when they used, for convenience, Fisher’s 5% and 1%significance levels to help define their Type I error rates (E.S. Pearson,1962).

Fisher’s popularizing of such nominal levels of statistical significance is acurious, and influential, historical oddity. While working on StatisticalMethods for Research Workers, Fisher was denied permission by KarlPearson to reproduce W.P. Elderton’s table of χ2 from the first volume ofBiometrika, and therefore created his own version. In doing so, EgonPearson (1990) wrote: ‘[Fisher] gave the values of [Karl Pearson’s] χ2 [andstudent’s t] for selected values of P . . . instead of P for arbitrary χ2, and thusintroduced the concept of nominal levels of significance’ (p. 52, myemphasis). As noted, Fisher’s use of 5% and 1% levels was also adopted byNeyman–Pearson. And Fisher (1959) criticized them for doing so, explain-ing that ‘no scientific worker has a fixed level of significance at which fromyear to year, and in all circumstances, he rejects hypotheses; he rather giveshis mind to each particular case in the light of his evidence and his ideas’(p. 42). No wonder that many researchers confuse Fisher’s evidential pvalues with Neyman–Pearson behavioral error rates when both concepts areossified at the 5% and 1% levels.

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 311

Page 18: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

Confusion Over p’s and α’s: APA Publication Manuals

Adding fuel to the fire, there is also ongoing confusion and equivocationover the interpretation of p values and α levels in various APA PublicationManuals. I noted previously that a prespecified Type I error rate (α) cannotbe simultaneously viewed as a measure of evidence (p) once the result isobserved. Yet this mistake is actively endorsed as official policy in an earlierAPA Publication Manual (1994):

For example, given a true null hypothesis, the probability of obtaining theparticular value of the statistic you computed might be [p =] .008. Manystatistical packages now provide these exact [p] values. You can report thisdistinct piece of information in addition to specifying whether you rejectedor failed to reject the null hypothesis using the specified alpha level.

With an alpha level of .05, the effect of age was statistically significant,F (1, 123) = 7.27, p = .008. (p. 17)

And psychologists are provided with mixed messages by statements such as:‘Before you begin to report specific results, you should routinely state theparticular alpha level you selected for the statistical tests you conducted: [forexample,] An alpha level of .05 was used for all statistical tests’ (APA,1994, p. 17). This is sound advice that is congruent with Neyman–Pearsonprinciples. However, it is immediately followed with some poor advice: ‘Ifyou do not make a general statement about the alpha level, specify the alphalevel when reporting each result’ (p. 17). This recommendation promotesunsound statistical practice by virtually sanctioning the habitual use ofuntenable ‘roving alphas’. Such a situation, whether applied to actual α’s or(more likely) p values, is completely unacceptable: α must be determinedprior to, not after, data collection and statistical analysis.

The latest APA Publication Manual (2001), following on the heels of thereport of Wilkinson and the APA Task Force on Statistical Inference (1999),continues in a similar vein. Like its 1994 predecessor, for instance, the 2001Manual distinguishes the two types of probabilities associated with statis-tical testing, p and α (although both editions incorrectly call the p value alikelihood when it is a probability). But the new version likewise proceeds todispense erroneous counsel:

The APA is neutral on which interpretation [p or α] is to be preferred inpsychological research. . . . Because most statistical packages now reportthe p value . . . and because this probability can be interpreted according toeither mode of thinking, in general it is the exact probability (p value) thatshould be reported. (APA, 2001, p. 24, my emphasis)

But this is not the case. The p value is not a Type I error rate, and an α levelis not evidential. The Manual goes on to declare:

There will be cases—for example, large tables of correlations or complextables of path coefficients—where the reporting of exact probabilities could

THEORY & PSYCHOLOGY 14(3)312

Page 19: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

be awkward. In these cases, you may prefer to identify or highlight a subsetof values in the table that reach some prespecified level of statisticalsignificance. To do so, follow those values with a single asterisk (*) ordouble asterisk (**) to indicate p < .05 or p < .01, respectively. When usingprespecified significance levels, you should routinely state the particularalpha level you selected for the statistical tests you conducted:

An alpha level of .05 was used for all statistical tests. (p. 25)

Such recommendations are misleading to authors. Just after being told topresent exact, data-dependent, p values wherever feasible, we are alsoinstructed to use both p values and α levels to indicate prespecified (yetvariable) levels of statistical significance. This virtually obligates the re-searcher to consider a p value as an ‘observed’ Type I error rate.

This obligation is all but cemented in another passage, where twoexamples are given:

Two common approaches for reporting statistical results using the exactprobability formulation are as follows:

With an alpha level of .05, the effect of age was statistically significant,F (1,123) = 7.27, p < .01.

The effect of age was not statistically significant, F (1, 123) = 2.45,p = .12.

The second example should be used only if you have included a statementof significance level earlier in your article. (p. 25)

Note that the first example uses p and α interchangeably, as if they are thesame thing. Moreover, p < .01 is not an exact probability. Example two,which does use an exact probability, implies that such probabilities cannotbe used without first announcing an ‘overall’ significance level—a pre-specified α—to which all others bow. This is not the case.2

Because some statisticians, some quantitative psychologists and APAPublication Manuals are unclear about the differences between p’s and α’s,it is to be expected that confusion levels among psychologists in general willbe high. I assess the magnitude of this confusion in a sample of APAjournals.

Confusion Over p’s and a’s: Empirical Evidence

An empirical investigation was made of the way in which the results ofstatistical tests are reported in the psychology literature. In particular, arandomly selected issue of each of 12 psychology journals—the AmericanPsychologist (AP), Developmental Psychology (DP), Journal of AbnormalPsychology (JAbP), Journal of Applied Psychology (JAP), Journal of Com-parative Psychology (JComP), Journal of Consulting and Clinical Psychol-ogy (JCCP), Journal of Counseling Psychology (JCouP), Journal of

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 313

Page 20: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

Educational Psychology (JEP), Journal of Experimental Psychology—General (JExPG), Journal of Personality and Social Psychology (JPSP),Psychological Bulletin (PB) and Psychological Review (PR)—was examinedfor every year from 1990 through 2002 to ascertain the number of empiricalarticles and notes they contained. Table 1 shows that this examinationproduced a sample of 1,750 empirical papers, of which 1,645, or 94%, usedstatistical tests. There is some variability in the percentage of empiricalpapers using statistical testing, with a low of 40% for AP, and a high of99.2% for JAbP. Moreover, six of the journals (DP, JAbP, JCCP, JEP,JExPG and JPSP) had in excess of 95% of their published empirical worksfeaturing such tests.

Even though the Fisherian evidential p value from a significance test is atodds with the Neyman–Pearson hypothesis-testing approach, Table 2 revealsthat p values are ubiquitous in psychology’s empirical literature. In markedcontrast, α levels are few and far between.

Of the 1,645 papers using statistical tests, 1,090, or 66.3%, employed‘roving alphas’, that is, a discrete, graduated number of p values interpretedvariously as Type I error rates and/or measures of evidence against H0,usually p < .05, p < .01, p < .001, and so on. In other words, these p valuesare sometimes viewed as an ‘observed’ Type I error rate, meaning that theyare not pre-assigned, or fixed, error levels that would be dictated byNeyman–Pearson theory. Instead, these ‘error rates’ are incorrectly deter-mined solely by the data.

TABLE 1. The incidence of statistical testing in 12 psychologyjournals: 1990–2002

JournalNumber ofpapers

Number ofempiricalpapers

Number ofempirical papersusing statisticaltests

Percentage ofempirical papersusing statisticaltests

AP 149 10 4 40.0DP 222 213 211 99.1JAbP 242 240 238 99.2JAP 193 186 176 94.6JCCP 273 236 230 97.5JComP 148 146 138 94.5JCouP 176 157 140 89.2JEP 193 187 178 95.2JExPG 100 86 85 98.8JPSP 191 188 180 95.7PB 128 41 32 78.0PR 110 60 33 55.0

Totals 2,125 1,750 1,645 94.0

THEORY & PSYCHOLOGY 14(3)314

Page 21: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

TABLE 2. The reporting of results of statistical tests

‘Rovingalphas’ (R)

Exact pvalues (Ep)

Combination of Ep’swith fixed p valuesand ‘roving alphas’

‘Fixed’ Level Values

Level p’s ‘Significant’ α’s Unspecified Total

p’s Ep + .05 .051,090 54 18 47 5 11 6

Ep + .01 .015 17 1 1

Ep + .001 .0017 17 – 2

Ep + R Other354 7 1 2

Total1,090 54 384 88 7 16 6 1,645Percentage66.3 3.3 23.3 5.3 0.4 1.0 0.4 100.0

HU

BB

AR

D: DIS

TIN

CT

ION

SB

ET

WE

EN

P’SA

ND

α’S

315

Page 22: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

Corroborating my figures on the incidence of ‘roving alphas’—66.3%overall, and 74.4% for the JAP—Finch et al. (2001) found that these wereused in 77% of empirical articles in the JAP for the period 1955 to 1999, andin almost 100% of such articles in the British Journal of Psychology for1999. They concluded: ‘Typical practice is to report relative p values usingtwo or more implicit alpha levels, most commonly .05 and .01 andsometimes .001’ (p. 198, my emphasis). Interestingly, however, they are notcritical, as they should be, of this custom of interpreting p values as Type Ierror rates. Gigerenzer (1993), on the other hand, most certainly is. Whendiscussing how p values are usually reported in the literature—p < .05,p < .01, p < .001—he observes: ‘Neyman and Pearson would have rejectedthis practice: These p values are not the probability of Type I errors . . .[values] such as p < .05, which look like probabilities of Type I errors butaren’t’ (p. 329).

Further clouding the issue, these same p values will be interpreted simul-taneously in a quasi-evidential manner as a basis for rejecting H0 if p < α.This includes, in many cases, mistakenly using the p value as a surrogatemeasure for effect sizes (e.g. p < .05 is ‘significant’, p < .01 is ‘verysignificant’, p < .001 is ‘extremely significant’, etc.). In short, these ‘rovingalphas’ can assume a number of incorrect and contradictory interpreta-tions.

After the ‘roving alphas,’ an additional 54 (3.3%) papers reported ‘exact’p values, and a further 384 (23.3%) presented various combinations of exactp’s with either ‘roving alphas’ or fixed p values. Conservatively, therefore,some 1,474, or 89.6% (i.e. ‘roving alphas’ plus the combination of exact p’sand ‘roving alphas’), of empirical articles in a sample of psychology journalsreport the results of statistical tests in a fashion that is at variance withNeyman–Pearson convention. At the same time they violate Fisheriantheory, as when p values are misinterpreted as both Type I error rates and asquasi-Fisherian (i.e. p < α) measures of evidence against the null.

Another 6 (0.4%) studies were insufficiently clear about the disposition ofa result in their accounts (other than comments such as ‘This result wasstatistically significant’). I therefore assigned them to the ‘Unspecified’category.

Thus, 111 (6.7%) studies remain eligible for the reporting of ‘fixed’ levelα values in the manner intended by Neyman–Pearson. However, 88 of these111 studies reported ‘fixed p’ rather than fixed α levels. After subtractingthis group, a mere 23 (1.4%) studies remain eligible. Of these 23 papers, 7simply refer to their published results as being ‘significant’ at the .05, .01,levels, and so on, but provide no information about p values and/or α levels.In the final analysis, only 16 of 1,645 empirical papers using statistical tests,or 1%, explicitly used α levels.

Table 3 makes it abundantly clear that articles in all 12 psychologyjournals reflect confusion in the reporting of results of statistical tests.

THEORY & PSYCHOLOGY 14(3)316

Page 23: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

TABLE 3. The reporting of results of statistical tests by journal

Journal

‘Rovingalphas’ (R)

No. %

Exact pvalues (Ep)

No. %

Combination ofEp’s with fixed pvalues and ‘rovingalphas’

No. %

‘Fixed’ Level Values

p’s

No. %

‘Significant’

No. %

α’s

No. %

Unspecified

No. %

AP 2 50.0 – – 1 25.0 1 25.0 – – – – – –DP 160 75.8 11 5.2 34 16.1 6 2.8 – – – – – –JAbP 140 58.8 8 3.4 82 34.5 3 1.3 3 1.3 1 0.4 1 0.4JAP 131 74.4 – – 19 10.8 23 13.1 1 0.6 2 1.1 1 0.6JCCP 136 59.4 14 6.1 68 29.7 11 4.8 – – – – – –JComP 84 60.9 4 2.9 38 27.5 8 5.8 1 0.7 2 1.4 1 0.7JCouP 88 62.9 1 0.7 40 28.6 9 6.4 – – 2 1.4 – –JEP 126 70.8 5 2.8 34 19.1 8 4.5 – – 4 2.2 1 0.6JExPG 55 64.7 3 3.5 14 16.5 7 8.2 1 1.2 3 3.5 2 2.4JPSP 133 73.9 1 0.6 43 23.9 3 1.7 – – – – – –PB 16 50.0 4 12.5 6 18.8 4 12.5 – – 2 6.3 – –PR 19 57.6 3 9.1 5 15.2 5 15.2 1 3.0 – – – –

Totals 1,090 66.3 54 3.3 384 23.3 88 5.3 7 0.4 16 1.0 6 0.4

HU

BB

AR

D: DIS

TIN

CT

ION

SB

ET

WE

EN

P’SA

ND

α’S

317

Page 24: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

Fisher’s p values (‘roving alphas’, exact, and in various combinations)dominate the way in which the results of statistical tests are portrayed inpsychology journals. Neyman–Pearson fixed α’s are practically invisible.Indeed, 5 of the 12 journals in this sample (AP, DP, JCCP, JPSP, PR) reportno use of fixed α’s whatsoever. The fact that Neyman–Pearson theory isrecognized as statistical orthodoxy is overwhelmingly belied by the resultsin this paper. In psychology journals, Fisher’s legacy thrives; Neyman–Pearson viewpoints, on the other hand, receive only lip service.

Discussion

Confusion over the interpretation of classical statistical tests is so universalas to render their application almost meaningless. This ubiquitous confusionamong researchers over the meaning of p values and α levels is easier tounderstand when it is pointed out that both expressions are used to refer tothe ‘significance level’ of a test. But their interpretations are poles apart. Thelevel of significance expressed by a p value in a Fisherian significance testindicates the probability of encountering data this extreme (or more so)under a null hypothesis. This p value does not pertain to some prespecifiederror rate, but instead is determined by the data, and has epistemicimplications through its use as a measure of inductive evidence against H0 inindividual studies. Contrast this with the significance level called α in aNeyman–Pearson hypothesis test. Here, the emphasis is on minimizing TypeII, or β, errors (i.e. false acceptance of a null hypothesis) subject to a boundon Type I, or α, errors (i.e. false rejections of a null hypothesis). Moreover,this error minimization applies only to long-run repeated sampling from thesame population, not to single experiments, and is a directive for behaviors,not a way of gathering evidence. Viewed from this perspective, the twonotions of ‘statistical significance’ could hardly be more distant in meaning.And both the Fisherian and Neyman–Pearson camps, as has been shownrepeatedly throughout this paper, were keenly aware of these differences andattempted, in vain as it turns out, to communicate them to research workers.These differences are summarized in Table 4.

Yet these distinguishing characteristics of p’s and α’s are rarely spelledout in the literature. On the contrary, they tend to be used equivalently, evenbeing compared with one another (e.g. p < α), especially in statisticstextbooks aimed at applied researchers. Usually, in such texts, an anony-mous account of standard Neyman–Pearson doctrine is put forward initially,and is often followed by an equally anonymous discussion of ‘the p valueapproach’. This transition from α levels to p values (and their intermixing) istypically seamless, as if it constituted a natural progression through differentparts of the same coherent statistical whole.

Of course, a cynic might argue so what if researchers cannot correctly

THEORY & PSYCHOLOGY 14(3)318

Page 25: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

distinguish between the meanings of p’s and α’s. S/he may ask just exactlyhow the confusion over p’s and α’s has impeded knowledge growth in thediscipline. But surely continued, misinformed statistical testing cannot bejustified on the grounds that its pernicious consequences are difficult toisolate.

Another line of argument that might be adopted by those seeking tomaintain the status quo with regard to statistical testing could be that in theirinterchangeable uses of p values and α’s, researchers are simply picking andchoosing from the ‘best parts’ of the (unstated) Fisherian and Neyman–Pearson paradigms. Why should a researcher, one might say, feel obligatedto use only one or the other of the statistical testing methods? It might bethought that employing an α level here and a p value there does no damage.But these kinds of rationalizations of improper statistical practice do notwash. The Fisherian and Neyman–Pearson conceptions of statistical testingare incommensurate. The researcher is not at liberty to pick and choose,albeit unintentionally, from two incompatible statistical approaches. Yet thecrux of the problem is that this is precisely what the overwhelming majorityof psychology (and other) researchers do in their empirical analyses. Forexample, it would be surprising indeed to learn that the author(s) of even asingle article from among the 1,645 examined in our sample consciouslyplanned to analyze the data from both the Fisherian and Neyman–Pearsonianperspectives. The proof of the pudding is seen in the fact that the unaware-ness among researchers of the distinctions between these perspectives is so

TABLE 4. Contrasting p’s and α’s

p value α level

Fisherian significance level Neyman–Pearson significance level

Significance test Hypothesis test

Evidence against H0 Type I error—erroneous rejection of H0

Inductive philosophy—from particular togeneral

Deductive philosophy—from general toparticular

Inductive inference—guidelines forinterpreting strength of evidence in data

Inductive behavior—guidelines for makingdecisions based on data

Data-based random variable Pre-assigned fixed value

Property of data Property of test

Short-run—applies to any singleexperiment/study

Long-run—applies only to ongoing,identical repetitions of originalexperiment/study, not to any given study

Hypothetical infinite population Clearly defined population

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 319

Page 26: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

ingrained that using p’s and α’s interchangeably is not perceived as evenbeing a problem in the first place.

The foregoing account illustrates why Nickerson’s (2000) statement that‘reporting p values is not necessarily inconsistent with using an α criterion’(p. 277) is incorrect. Contrary to Nickerson’s claims (pp. 243 and 242–243,respectively), a p value is not a bound on α, nor is α a decision criterion forp. Fisher and Neyman–Pearson themselves would have denied such asser-tions, for they continue to cause trouble.

While not writing about the confusion regarding p’s and α’s per se,Tryon’s (1998) comments over general misunderstandings of statistical testsby psychologists are relevant here:

. . . the fact that statistical experts and investigators publishing in the bestjournals cannot consistently interpret the results of these analyses isextremely disturbing. Seventy-two years of education have resulted inminiscule, if any, progress toward correcting this situation. It is difficult toestimate the handicap that widespread, incorrect, and intractable use of aprimary data analytic method has on a scientific discipline, but thedeleterious effects are undoubtedly substantial. (p. 796)

I share Tryon’s sentiments.

Conclusions

Among psychology researchers, the p value is interpreted simultaneously inNeyman–Pearson terms as a deductive assessment of error in long-runrepeated sampling situations, and in a Fisherian sense as a measure ofinductive evidence in a single study. Clearly, this is asking the impossible.More pointedly, a p value from a Fisherian significance test has no place inthe Neyman–Pearson hypothesis-testing framework. And the Neyman–Pearson Type I error rate, α, has no place in Fisher’s significance-testingapproach. Contrary to popular belief, p’s and α’s are not the same thing;they measure different concepts.

Such distinctions notwithstanding, the mixing of measures of evidence(p’s) with measures of error (α’s) is standard practice in the classroom andin scholarly journals. Because of this, statistical testing is largely devoid ofmeaning—a p and/or α value can mean almost anything to the appliedresearcher, and hence nothing. Furthermore, this mixing of ideas from bothschools of thought has not been evenhanded. Paradoxically, while theNeyman–Pearson hypothesis-testing framework began supplanting Fisher’sviews on significance testing half a century ago, this is not evident in journalarticles. Here, it is Fisher’s legacy that is omnipresent.

Despite trenchant criticism of classical statistical testing (e.g. Schmidt,1996; Thompson, 1994a, 1994b, 1997),3 following the recommendations ofWilkinson and the APA Task Force on Statistical Inference (1999), and the

THEORY & PSYCHOLOGY 14(3)320

Page 27: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

continued coverage given to it in the latest APA Publication Manual (2001),it appears that such testing is here to stay. My advice to those who wish tocontinue using these tests is to be more deliberate about their application. Atthe very least, researchers should purposely determine whether their con-cerns are with controlling errors or collecting evidence. If the former,investigators should adopt a Neyman–Pearson approach to guide behavior,but one which makes a serious attempt to estimate the likely costs associatedwith Type I and II errors. Researchers choosing this option should dis-continue using ‘roving alphas’ and the like, and stick with fixed, pre-assigned, α’s. If the focus of a study is more evidential in nature, the use ofFisher’s p value would be more appropriate, preferably in its exact format soas to once more avoid the ‘roving alphas’ dilemma.

For myself, I see little of value in classical statistical testing—whether ofthe p’s or α’s variety. A better route for assessing the reliability, validity,and generalizability of empirical findings is to establish systematic replica-tion with extension research programs focusing on sample statistics, effectsizes and their confidence intervals (Cumming & Finch, 2001; Hubbard,1995; Hubbard, Parsa, & Luthy, 1997; Hubbard & Ryan, 2000; Thompson,1994b, 1997, 1999, 2002). But that is another story.

Notes

1. The wording by Gigerenzer and Murray here is unfortunate, likely to causemischief, and necessitates a digression. Their phraseology appears to convey toreaders the impression that p values and α levels are the same thing. Thisuncharacteristic lapse in communication by Gigerenzer and Murray can, nodoubt, be attributed to what Nickerson (2000, pp. 261–262) calls ‘linguisticambiguity’, whereby even experts on matters relating to statistical testing canoccasionally use misleading language. In fact, we shall see later that Gigerenzerand his colleagues are acutely aware of the distinctions between p’s and α’s.

2. An anonymous reviewer offered three (speculative) reasons why the APAPublication Manuals bestow erroneous and inconsistent advice. First, acrossdecades of editions, some views were included at one point and not taken outwhen alternative views were embedded. Second, most of the APA staff do nothave training in statistical methods. Third, this same staff do not seem to beoverly concerned about statistical issues, as evidenced by (a) the small amount ofcoverage devoted to them, and (b) the lack of attention in the latest (2001) editionto revisions involving statistical content. See, in this context, Fidler (2002).

3. Indeed, such criticism can also be found in disciplines as diverse as accounting(Lindsay, 1995), economics (McCloskey & Ziliak, 1996), marketing (Hubbard &Lindsay, 2002; Sawyer & Peter, 1983) and wildlife sciences (Anderson,Burnham, & Thompson, 2000). Moreover, this criticism has escalated decade bydecade, judging by the number of articles published on the topic (cf. Hubbard &Ryan, 2000).

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 321

Page 28: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

References

American Psychological Association. (1994). Publication manual of the AmericanPsychological Association (4th ed.). Washington, DC: Author.

American Psychological Association. (2001). Publication manual of the AmericanPsychological Association (5th ed.). Washington, DC: Author.

Anderson, D.R., Burnham, K.P., & Thompson, W. (2000). Null hypothesis testing:Problems, prevalence, and an alternative. Journal of Wildlife Management, 64,912–923.

Barnard, G.A. (1985). A coherent view of statistical inference. Technical ReportSeries, Department of Statistics & Actuarial Science, University of Waterloo,Ontario, Canada.

Berger, J.O., & Delampady, M. (1987). Testing precise hypotheses (with com-ments). Statistical Science, 2, 317–352.

Carlson, R. (1976). The logic of tests of significance. Philosophy of Science, 43,116–128.

Carver, R.P. (1978). The case against statistical significance testing. HarvardEducational Review, 48, 378–399.

Chow, S.L. (1996). Statistical significance: Rationale, validity and utility. ThousandOaks, CA: Sage.

Chow, S.L. (1998). Precis of statistical significance: Rationale, validity and utility(with comments). Behavioral and Brain Sciences, 21, 169–239.

Clark, C.M. (1999). Further considerations of null hypothesis testing. Journal ofClinical and Experimental Neuropsychology, 21, 283–284.

Clark-Carter, D. (1997). The account taken of statistical power in research publishedin the British Journal of Psychology. British Journal of Psychology, 88, 71–83.

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45,1304–1312.

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49,997–1003.

Cornfield, J. (1966), Sequential trials, sequential analysis, and the likelihoodprinciple. The American Statistician, 20, 18–23.

Cortina, J.M., & Dunlap, W.P. (1997). On the logic and purpose of significancetesting. Psychological Methods, 2, 161–172.

Cumming, G., & Finch, S. (2001). A primer on the understanding, use, andcalculation of confidence intervals that are based on central and noncentraldistributions. Educational and Psychological Measurement, 61, 532–574.

Daniel, L.G. (1998). Statistical significance testing: A historical overview of misuseand misinterpretation with implications for the editorial policies of educationaljournals. Research in the Schools, 5, 23–32.

Dar, R., Serlin, R.C., & Omer, H. (1994). Misuse of statistical tests in three decadesof psychotherapy research. Journal of Consulting and Clinical Psychology, 62,75–82.

Dixon, P., & O’Reilly, T. (1999). Scientific versus statistical inference. CanadianJournal of Experimental Psychology, 53, 133–149.

Falk, R., & Greenbaum, C.W. (1995). Significance tests die hard: The amazingpersistence of a probabilistic misconception. Theory & Psychology, 5, 75–98.

Fidler, F. (2002). The fifth edition of the APA Publication Manual: Why its statistics

THEORY & PSYCHOLOGY 14(3)322

Page 29: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

recommendations are so controversial. Educational and Psychological Measure-ment, 62, 749–770.

Finch, S., Cumming, G., & Thomason, N. (2001). Reporting of statistical inferencein the Journal of Applied Psychology: Little evidence of reform. Educational andPsychological Measurement, 61, 181–210.

Fisher, R.A. (1925). Statistical methods for research workers. Edinburgh: Oliver &Boyd.

Fisher, R.A. (1929). The statistical method in psychical research. Proceedings of theSociety for Psychical Research, London, 39, 189–192.

Fisher, R.A. (1935a). The design of experiments. Edinburgh: Oliver & Boyd.Fisher, R.A. (1935b). Statistical tests. Nature, 136, 474.Fisher, R.A. (1950). Contributions to mathematical statistics. London: Chapman &

Hall.Fisher, R.A. (1955). Statistical methods and scientific induction. Journal of the

Royal Statistical Society, B, 17, 69–78.Fisher, R.A. (1956). Statistical methods and scientific inference. Edinburgh: Oliver

& Boyd.Fisher, R.A. (1958). The nature of probability. The Centennial Review of Arts and

Sciences, 2, 261–274.Fisher, R.A. (1959). Statistical methods and scientific inference (2nd rev. ed.).

Edinburgh: Oliver & Boyd.Fisher, R.A. (1966). The design of experiments (8th ed.). Edinburgh: Oliver &

Boyd.Frick, R.W. (1996). The appropriate use of null hypothesis testing. Psychological

Methods, 1, 379–390.Gibbons, J.D., & Pratt, J.W. (1975). P-values: Interpretation and methodology. The

American Statistician, 29, 20–25.Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In

G. Keren & C.A. Lewis (Eds.), A handbook for data analysis in the behavioralsciences—Methodological issues (pp. 311–339). Hillsdale, NJ: Erlbaum.

Gigerenzer, G., & Murray, D.J. (1987). Cognition as intuitive statistics. Hillsdale,NJ: Erlbaum.

Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Kruger, L. (1989).The empire of chance. New York: Cambridge University Press.

Goodman, S.N. (1992). A comment on replication, P-values and evidence. Statisticsin Medicine, 11, 875–879.

Goodman, S.N. (1993). p values, hypothesis tests, and likelihood: Implications forepidemiology of a neglected historical debate. American Journal of Epidemiology,137, 485–496.

Goodman, S.N. (1999). Toward evidence-based medical statistics. 1: The P valuefallacy. Annals of Internal Medicine, 130, 995–1004.

Grayson, D., Pattison, P., & Robins, G. (1997). Evidence, inference, and the‘rejection’ of the significance test. Australian Journal of Psychology, 49, 64–70.

Hacking, I. (1965). Logic of statistical inference. New York: Cambridge UniversityPress.

Hinkley, D.V. (1987). Comment. Journal of the American Statistical Association,82, 128–129.

Hogben, L. (1957). Statistical theory. New York: Norton.

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 323

Page 30: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

Hubbard, R. (1995). The earth is highly significantly round (p < .0001). AmericanPsychologist, 50, 1098.

Hubbard, R., & Lindsay, R.M. (2002). How the emphasis on ‘original’ empiricalmarketing research impedes knowledge development. Marketing Theory, 2,381–402.

Hubbard, R., Parsa, R.A., & Luthy, M.R. (1997). The spread of statistical sig-nificance testing in psychology: The case of the Journal of Applied Psychology,1917–1994. Theory & Psychology, 7, 545–554.

Hubbard, R., & Ryan, P.A. (2000). The historical growth of statistical significancetesting in psychology—and its future prospects. Educational and PsychologicalMeasurement, 60, 661–681.

Huberty, C.J (1993). Historical origins of statistical testing practices: The treatmentof Fisher versus Neyman–Pearson views in textbooks. Journal of ExperimentalEducation, 61, 317–333.

Huberty, C.J, & Pike, C.J. (1999). On some history regarding statistical testing. In B.Thompson (Ed.), Advances in social science methodology (Vol. 5, pp. 1–22).Stamford, CT: JAI Press.

Hyde, J.S. (2001). Reporting effect sizes: The roles of editors, textbook authors, andpublication manuals. Educational and Psychological Measurement, 61, 225–228.

Johnstone, D.J. (1986). Tests of significance in theory and practice (with comments).The Statistician, 35, 491–504.

Johnstone, D.J. (1987a). On the interpretation of hypothesis tests following Neymanand Pearson. In R. Viertl (Ed.), Probability and Bayesian statistics (pp. 311–339).New York: Plenum.

Johnstone, D.J. (1987b). Tests of significance following R.A. Fisher. The BritishJournal for the Philosophy of Science, 38, 481–499.

Kempthorne, O. (1976). Of what use are tests of significance and tests of hypothesis.Communications in Statistics—Theory and Methods, A5(8), 763–777.

Keuzenkamp, H.A., & Magnus, J.R. (1995). On tests and significance in econo-metrics. Journal of Econometrics, 67, 5–24.

Kirk, R.E. (1996). Practical significance: A concept whose time has come. Educa-tional and Psychological Measurement, 56, 746–759.

Krantz, D.H. (1999). The null hypothesis testing controversy in psychology. Journalof the American Statistical Association, 44, 1372–1381.

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawedmethod. American Psychologist, 56, 16–26.

Kyburg, H.E. (1974). The logical foundations of statistical inference. Dordrecht:Reidel.

Leavens, D.A., & Hopkins, W.D. (1998). Intentional communication by chimpan-zees: A cross-sectional study of the use of referential gestures. DevelopmentalPsychology, 34, 813–822.

LeCam, L., & Lehmann, E.L. (1974). J. Neyman: On the occasion of his 80th

birthday. The Annals of Statistics, 2, vii–xiii.Lehmann, E.L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses:

One theory or two? Journal of the American Statistical Association, 88,1242–1249.

Lindsay, R.M. (1995). Reconsidering the status of tests of significance: An alter-native criterion of adequacy. Accounting, Organizations and Society, 20, 35–53.

THEORY & PSYCHOLOGY 14(3)324

Page 31: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

Loftus, G.R. (1996). Psychology will be a much better science when we change theway we analyze data. Psychological Science, 7, 161–171.

Macdonald, R.R. (1997). On statistical testing in psychology. British Journal ofPsychology, 88, 333–347.

McCloskey, D.N., & Ziliak, S.T. (1996). The standard error of regressions. Journalof Economic Literature, 34, 97–114.

Meehl, P.E. (1967). Theory-testing in psychology and physics: A methodologicalparadox. Philosophy of Science, 34, 103–115.

Mulaik, S.A., Raju, N.S., & Harshman, R.A. (1997). There is a time and a place forsignificance testing. In L.L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What ifthere were no significance tests? (pp. 65–116). Hillsdale, NJ: Erlbaum.

Nester, M.R. (1996). An applied statistician’s creed. Applied Statistics, 45,401–410.

Neyman, J. (1942). Basic ideas and some recent results of the theory of testingstatistical hypotheses. Journal of the Royal Statistical Society, 105, 292–327.

Neyman, J. (1950). First course in probability and statistics. New York: Holt.Neyman, J. (1957). ‘Inductive behavior’ as a basic concept of philosophy of science.

International Statistical Review, 25, 7–22.Neyman, J. (1961). Silver jubilee of my dispute with Fisher. Journal of the

Operations Research Society of Japan, 3, 145–154.Neyman, J. (1962). Two breakthroughs in the theory of statistical decision making.

Review of the International Statistical Institute, 30, 11–27.Neyman, J. (1967). R.A. Fisher (1890–1962), an appreciation. Science, 156,

1456–1460.Neyman, J. (1971). Foundations of behavioristic statistics (with comments). In V.P.

Godambe & D.A. Sprott (Eds.), Foundations of statistical inference (pp. 1–19).Toronto: Holt, Rinehart & Winston of Canada.

Neyman, J. (1977). Frequentist probability and frequentist statistics. Synthese, 36,97–131.

Neyman, J., & Pearson, E.S. (1928a). On the use and interpretation of certain testcriteria for purposes of statistical inference. Part I. Biometrika, 20A, 175–240.

Neyman, J., & Pearson, E.S. (1928b). On the use and interpretation of certain testcriteria for purposes of statistical inference. Part II. Biometrika, 20A, 263–294.

Neyman, J., & Pearson, E.S. (1933). On the problem of the most efficient tests ofstatistical hypotheses. Philosophical Transactions of the Royal Society of London,Ser. A, 231, 289–337.

Nickerson, R.S. (2000). Null hypothesis significance testing: A review of an old andcontinuing controversy. Psychological Methods, 5, 241–301.

Nix, T.W., & Barnette, J.J. (1998). The data analysis dilemma: Ban or abandon. Areview of null hypothesis significance testing. Research in the Schools, 5, 3–14.

Oakes, M. (1986). Statistical inference: A commentary for the social and behavioralsciences. New York: Wiley.

Pearson, E.S. (1962). Some thoughts on statistical inference. Annals of MathematicalStatistics, 33, 394–403.

Pearson, E.S. (1990). ‘Student’ a statistical biography of William Sealy Gosset.Oxford: Clarendon Press.

Pearson, K. (1900). On the criterion that a given system of deviations from theprobable in the case of a correlated system of variables is such that it can be

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 325

Page 32: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

reasonably supposed to have arisen from random sampling. London, Edinburghand Dublin Philosophical Magazine and Journal of Science, 50, 157–175.

Reid, C. (1982). Neyman–from life. Berlin: Springer-Verlag.Rosnow, R.L., & Rosenthal, R. (1989). Statistical procedures and the justification of

knowledge in psychological science. American Psychologist, 44, 1276–1284.Royall, R.M. (1997). Statistical evidence: A likelihood paradigm. New York:

Chapman & Hall.Rozeboom, W.W. (1960). The fallacy of the null-hypothesis significance test.

Psychological Bulletin, 57, 416–428.Sawyer, A.G., & Peter, J.P. (1983). The significance of statistical significance tests

in marketing research. Journal of Marketing Research, 20, 122–133.Schmidt, F.L. (1996). Statistical significance testing and cumulative knowledge in

psychology: Implications for the training of researchers. Psychological Methods,1, 115–129.

Seidenfeld, T. (1979). Philosophical problems of statistical inference: Learning fromR.A. Fisher. Dordrecht: Reidel.

Spielman, S. (1974). The logic of tests of significance. Philosophy of Science, 41,211–226.

Thompson, B. (1994a). Guidelines for authors. Educational and PsychologicalMeasurement, 54, 837–847.

Thompson, B. (1994b). The pivotal role of replication in psychological research:Empirically evaluating the replicability of sample results. Journal of Personality,62, 157–176.

Thompson, B. (1997). Editorial policies regarding statistical significance tests:Further comments. Educational Researcher, 26, 29–32.

Thompson, B. (1999). If statistical significance tests are broken/misused, whatpractices should supplement or replace them? Theory & Psychology, 9,165–181.

Thompson, B. (2002). What future quantitative social science research could looklike: Confidence intervals for effect sizes. Educational Researcher, 31, 25–32.

Tryon, W.W. (1998). The inscrutable null hypothesis. American Psychologist, 53,796.

Wilkinson, L., & the APA Task Force on Statistical Inference. (1999). Statisticalmethods in psychology journals: Guidelines and explanations. American Psychol-ogist, 54, 594–604.

Zabell, S.L. (1992). R. A. Fisher and the fiducial argument. Statistical Science, 7,369–387.

Acknowledgements. The author thanks Susie Bayarri, Jim Berger andSteve Goodman for conversations and correspondence related to the topicsdiscussed in this paper. I am indebted to Gerd Gigerenzer and an anony-mous reviewer for suggestions that have improved the manuscript. Anyerrors are my responsibility.

Raymond Hubbard is Thomas F. Sheehan Distinguished Professor ofMarketing in the College of Business and Public Adminstration, DrakeUniversity. His research interests include applied methodology and the

THEORY & PSYCHOLOGY 14(3)326

Page 33: Alphabet Soup  Blurring the Distinctions Between p’s and a’s in  Psychological Research

sociology of knowledge development in the management and socialsciences. He has published several articles on these topics. Address:College of Business and Public Adminstration, Drake University, DesMoines, IA 50311, USA. [email: [email protected]]

HUBBARD: DISTINCTIONS BETWEEN P’S AND α’S 327