Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf ·...

13
THE ACCOUNTINO REVIEW Vol. LXII. No. 1 January 1987 EDUCATION RESEARCH Frank Selto, Editor Inference from Empirical Research David Burgstahler ABSTRACT: Many researchers describe themselves as Bayesians in that they revise their prior beliefs based on observed ennpirical evidence. However, most studies are designed and reported as classical hypothesis tests, and research design issues are typically considered as determinants of abstract properties of statistical tests. Thus, although the primary function of empirical research is to influence beliefs, research design issues are seldom considered in their fundamental role as determinants of beliefs. In this paper, a Bayesian perspective is used to analyze the role of basic properties of hypothesis tests in the revision of beliefs. Two main points are emphasized. First, hypothesis tests with low power are not only undesirable ex ante (because of the low probability of observing significant results) but also ex post (because little probability revision should be induced even when significant results are observed). Second, irrespective of the usual issues of statistical and methodological validity, the effective level of tests in published research is likely to exceed the stated level, thus reducing the amount of probability revision justified by reported results. In combination, these conclusions are especially troublesome. If tests reported in the accounting literature are characterized by both low power and high effective levels, the results of published tests properly have little or no impact on the beliefs of a Bayesian. The Bayesian framework is useful in understanding and analyzing the tradeoffs which are an inherent part of empirical research. The analysis here identifies a Bayesian motivation for the common recommendations that researchers should attempt to maximize power in the design and execution of empirical tests and attempt to maintain the effective level of tests at their stated levels. Further, the analysis demonstrates the importance of explicit descriptions of research choices to allow (Bayesian) readers to properly revise their beliefs in response to reported empirical evidence. Finally, the model illustrates the role of prior beliefs and the characteristics of empirical tests in research and publication decisions. T HE merits of formal Bayesian meth- ods have been discussed in numer- ous articles and books.* However, this discussion has had relatively little effect on the way research is conducted and reported in accounting and related fields.^ Most studies are designed and reported in terms of classical hypothesis tests?, and applications of formal Bayes- ian methods are the exception rather than the rule. Consequently, most em- pirical issues are analyzed only in terms I would like to thank participants in the 1983 UBC- Oregon-Washington Research Conference and especially Robert Bowen, Lane Daley, Jim Jiambalvo, Bill Kinney, Eric Noreen, Jamie Pratt, Graeme Rankine, Ed Rice, Hein Schreuder, and three anonymous reviewers for helpful comments on earlier drafts of this paper. David Burgstahler is Assistant Profes- sor of Accounting, University of Wash- ington. Manuscript received July 1985. Revisions received April 1986 and June 1986. Accepted July 1986. 203

Transcript of Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf ·...

Page 1: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

THE ACCOUNTINO REVIEWVol. LXII. No. 1January 1987

EDUCATION RESEARCH

Frank Selto, Editor

Inference from Empirical ResearchDavid Burgstahler

ABSTRACT: Many researchers describe themselves as Bayesians in that they revise theirprior beliefs based on observed ennpirical evidence. However, most studies are designed andreported as classical hypothesis tests, and research design issues are typically considered asdeterminants of abstract properties of statistical tests. Thus, although the primary function ofempirical research is to influence beliefs, research design issues are seldom considered in theirfundamental role as determinants of beliefs. In this paper, a Bayesian perspective is used toanalyze the role of basic properties of hypothesis tests in the revision of beliefs. Two mainpoints are emphasized. First, hypothesis tests with low power are not only undesirable exante (because of the low probability of observing significant results) but also ex post (becauselittle probability revision should be induced even when significant results are observed).Second, irrespective of the usual issues of statistical and methodological validity, the effectivelevel of tests in published research is likely to exceed the stated level, thus reducing theamount of probability revision justified by reported results. In combination, these conclusionsare especially troublesome. If tests reported in the accounting literature are characterized byboth low power and high effective levels, the results of published tests properly have little orno impact on the beliefs of a Bayesian. The Bayesian framework is useful in understandingand analyzing the tradeoffs which are an inherent part of empirical research. The analysis hereidentifies a Bayesian motivation for the common recommendations that researchers shouldattempt to maximize power in the design and execution of empirical tests and attempt tomaintain the effective level of tests at their stated levels. Further, the analysis demonstratesthe importance of explicit descriptions of research choices to allow (Bayesian) readers toproperly revise their beliefs in response to reported empirical evidence. Finally, the modelillustrates the role of prior beliefs and the characteristics of empirical tests in research andpublication decisions.

THE merits of formal Bayesian meth-ods have been discussed in numer-ous articles and books.* However,

this discussion has had relatively littleeffect on the way research is conductedand reported in accounting and relatedfields.^ Most studies are designed andreported in terms of classical hypothesistests?, and applications of formal Bayes-ian methods are the exception ratherthan the rule. Consequently, most em-pirical issues are analyzed only in terms

I would like to thank participants in the 1983 UBC-Oregon-Washington Research Conference and especiallyRobert Bowen, Lane Daley, Jim Jiambalvo, Bill Kinney,Eric Noreen, Jamie Pratt, Graeme Rankine, Ed Rice,Hein Schreuder, and three anonymous reviewers forhelpful comments on earlier drafts of this paper.

David Burgstahler is Assistant Profes-sor of Accounting, University of Wash-ington.

Manuscript received July 1985.Revisions received April 1986 and June 1986.Accepted July 1986.

203

Page 2: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

204 The Accounting Review, January 1987

of the effect on properties of classicalhypothesis tests. The more fundamentalissue of the effect on belief revision isnot directly addressed.

Although formal Bayesian methodsare seldom used, many accounting re-searchers describe themselves as Bayes-ians who revise their prior beliefs basedon observed empirical evidence. The pur-pose of this paper is to explore the impli-cations of a Bayesian approach to revis-ing beliefs based on the reported resultsof classical hypothesis tests. This frame-work provides useful insights into theprocess by which empirical evidence isused to revise beliefs. For example, theBayesian perspective demonstrates thedistinction between statistical signifi-cance and statistical persuasiveness byshowing that there are situations wherestatistically significant results should notlead to revision of the reader's priorbeliefs. Further, the characteristics ofthe research and publication process sug-gest that these situations may not beunusual.

The remainder of the paper is orga-nized as follows: After briefly reviewingthe fundamentals of classical hypothesistesting in the following section, a Bayes-ian framework for integrating empiricalevidence with prior beliefs to formposterior beliefs is described in the nextsection.^ In this framework, three fac-tors jointly determine posterior beliefs.The first two, the power and significancelevel of the empirical test, are discussedin the succeeding two sections. The thirdfactor, prior beliefs, is discussed nextalong with other factors which infiu-ence research and publication decisions.Finally, a sununary and conclusions arepresented.

HYPOTHESIS TESTING

Development of a hypothesis test com-prises three interrelated steps: (1) choose

a test statistic, (2) derive the distributionof the test statistic under the null hypoth-esis, and (3) define a rejection regionsuch that the probability of observing atest statistic in the rejection region ifthe null hypothesis holds is some (pre-specified) significance level. If the ob-served value of the test statistic falls inthe rejection region, the null hypothesisis rejected; otherwise, the null hypothesisis not rejected. Since the probability ofthe statistic falling in the rejection regionif the null hypothesis holds is relativelysmall (i.e., equal to the level of signifi-cance), observation of a significant sta-tistic is interpreted as evidence againstthe null hypothesis.

The preceding description omits someimportant aspects of the problem ofdesigning a satisfactory hypothesis test,but the scope of published descriptionsof hypothesis tests (and, by implication,the scope of issues considered in con-structing tests) is often similarly limited.In developing and describing a hypothe-sis test, attention is typically focused onthe significance level of the test. How-ever, if the level of significance were theonly concern, a random number genera-tor could be used to construct an essen-tially costless test of any hypothesis.

Ajiother critical concern is the powerof the test."* The alternative hypothesisimplies the distribution for the test statis-tic which determines the power of the

' See, for example, Savage [1954], Raiffa andSchlaifer [1961], Edwards, Lindman, and Savage [1963],or ZeUner [1971].

' Efron [1986] suggests several factors which mayexplain why most scientific data analysis is carried out ina non-Bayesian framework.

' Similar models are found in Zellner [1971. Chapter10] and Judge et al. [1985].

* In the larger context of a decision problem, th^Iossfunction must also be viewed as an integral part ofhypothesis testing (see Savage [1954, Chapter 16]).

Page 3: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

Burgstahler 205

test, i.e., the probability that the test sta-tistic will fall in the rejection region (willbe significant) when the alternativehypothesis holds.'

In evaluating results of completedempirical research, there tends to be anemphasis on either the null or the alter-native hypothesis distribution, but notboth. When a significant test statisticis observed, factors which might havecaused the effective level of the test toexceed the stated level are explored. Onthe other hand, when the observed teststatistic is not significant, factors whichmight have caused the power of the testto be low are exsunined. However, asemphasized in the following sections,both the level and power of a test areimportant regardless of the outcome ofthe testy i.e., regardless of whether theobserved statistic is significant.

A BAYESIAN FRAMEWORK FORINTEGRATING EMPIRICAL EVIDENCE

Let Ho and H^, respectively, denote

the null and alternative hypotheses. Letthe observation of a significant test sta-tistic be denoted by S and observation ofa nonsignificant test statistic by NS.Finally, let the probability of observing 5when the null hypothesis holds be a andlet the probability of observing NS whenthe alternative holds be ^. Then the fol-lowing familiar table of probabilities ofobserving S or NS conditional on Ho orHA applies:

CONDITIONAL PROBABELITIES

OF OUTCOMES

State

Value of NS P[NS\Ho\ = \-a. P[NS\HA\=&TestStatistic S P[5|//o]=a P[S\HA\ = \-&

A Bayesian will combine prior beliefsabout Ho with observed empirical evi-dence to form a posterior belief. Forexample, the posterior belief in Ho giventhat empirical evidence S is observed is:

P[Ho\S] = P[S\Ho]P[Ho] aP[Ho]P[S\HO]P[HO]+P[S\HA]P[HA] aP[Ho] +(1 - (1)

where P[Ho] and P[HA] denote the priorbeliefs in Ho and H^, respectively. Simi-larly,

P[HA\S] =

= l-P[Ho\S] (2)

In the following sections, the posteriorbelief in the null hypothesis given obser-vation of a significant statistic, P[//o I-S],is discussed as a function of the power ofthe test (1 ~/3), the level of the test (a),and the prior belief in the null

POWER AS A DETERMINANTOF POSTERIORS

Although the issue of power is oftendiscussed when a nonsignificant test sta-

• Composite alternative hypotheses would imply a setof alternative hypothesis distributions, giving rise to apower function defmed over the set of alternativehypotheses. Since composite alternative hypothesescomplicate the exposition without changing the qualita-tive conclusions, only simple alternatives are consideredhere. For an application of the model in Section 3 tocomposite alternatives, see Zellner [1971].

•Since P[HO\S] = 1-P[HA\S], discussion of theposterior belief in the alternative hypothesis would belogically equivalent.

Page 4: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

206 The Accounting Review, January 1987

FIGURE 1

POSTERIOR BELIEF IN THE NULL AFTER OBSERVING A SIGNIFICANT TEST STATISTIC(FOR PRIOR BELIEF OF .5)

PosteriorBeliefin theNull

Level .20

Level .10

Level .05

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Power of the Test

tistic is observed, the importance ofpower when a significant statistic isobserved has received little attention.'The power of a test which has resulted ina significant test statistic is commonlyviewed as a moot issue; there seems littlereason to be concerned with the power ofa test which has been "powerful enough'*to yield a significant test statistic. How-ever, equations (1) and (2) show thatpower (1 - /3) remains a critical com-ponent of the belief revision processeven when a significant test statistic isobserved. A Bayesian cannot know how(or even whether) to revise beliefs inresponse to a significant test statistic

without knowledge of the power of thetest.

This point can be further illustrated byexamining posterior beliefs given obser-vation of a significant test statistic asa function of power. Figure 1 plots therelationship between posterior belief inthe null hypothesis and power of the testfor three different significance levels.

example. Cook and Campbell [1979, p. 40]emphasize the importance of analysis of power when noeffect is observed but do not mention the importance ofpower when a significant effect is observed. See alsoSimonds and Collins [1978, pp. 649-650], Foster [1980,p. 41], Foster [1981, pp. 220-222], Ball and Foster [1982,p. 186], Beaver [1982, pp. 327-328], and Kinney [1986,pp. 345-348].

Page 5: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

Burgstahler 207

.05, .10, and .20, with prior beliefsP[Ho] =P[HA] = .5.* For more powerfultests, there is a lower degree of belief inthe null (and greater belief in the alterna-tive) as a result of observation of a sig-nificant statistic' If the power of a test isas low as the significance level of thetest, then the posterior will be unchangedfrom the prior regardless of the outcomeof the test.^^ This is illustrated in Figure1 where posterior beliefs are seen to beequal to the prior belief (.5) at the valuewhere power is equal to the level of thetest. Intuitively, when a significant teststatistic is equally likely under the nulland alternative hypotheses (i.e., whenthe level and power of the test are thesame), results of the test do not changebeliefs about the null hypothesis."

In empirical accounting research, deci-sions made by researchers often result inbias toward non-rejection of the nullhypothesis, i.e., they reduce the powerof the test.** The ex ante risk that thereduction in power will result in insig-nificant (and probably unpublishable)results is well-known and widely acknowl-edged. However, it is not widely recog-nized that this is more than an ex anteproblem. Even if the results from a testwith low power are significant, the resultshave little value; even significant resultsfrom a low-power test do not causemuch change in a Bayesian's beliefsabout the null hypothesis.

Researchers sometimes expect a singleset of data to serve two purposes. Resultsare calculated to decide whether a test issufficiently powerful as well as to drawa conclusion about the hypothesis beingtested. In fact, researchers may even mis-takenly assert that a significant resultfrom a low-power test is more convinc-ing evidence against the null than a sig-nificant result from a high-power testbecause a more extreme test statistic isrequired to attain significance for a low-

power test." However, not all testswhich yield significant results are power-ful tests and it is inappropriate to judgethe power of a test based solely on thesignificance of the observed results.Moreover, from equations (1) and (2) itis clear that the power of the test is arequired input to Bayesian belief revisionin response to empirical results.

In summary, the power of a test is crit-ical in drawing inferences from empiricalresults. The statement that significantresults are evidence against the nullhypothesis is, by itself, incomplete; sig-nificant results are only strong evidenceagainst the null for powerful tests.

EFFECTIVE LEVEL OF PUBLISHEDEMPIRICAL TESTS

The role of the effective level of pub-

'The specific significance levels and prior beliefsplotted in Figure 1 were chosen to provide quantitativeexamples, but similar qualitative conclusions wouldapply for any (non-dogmatic) priors and any (non-zero)significance levels.

' Note, however, that even for a test with power of 1.0,a significant statistic does not result in a posterior beliefin the alternative of 1.0 unless the level of the test is 0.0.An ideal test with power 1.0 and level 0.0 would enablethe researcher to reason by logical implication ratherthan by statistical inference (see Bakan [1966]).

'" When the power of the test is less than the effectivelevel of the test, observation of a "significant" test statis-tic actually results in an increased belief in the nullhypothesis. In this soniewhat perverse situation, referredto as a biased hypothesis test [Mood, Graybill, and Boes,1974], a significant test statistic is actually more likelyunder the null than under the alternative and thusincreases belief in the null.

" Thus, a "significant" test statistic from the randomnumber generator mentioned earlier would not changebeliefs because a significant test statistic would be equallylikely under the null and alternative hypotheses.

" For some specific examples, see Beaver, Clarke, andWright [1979, pp. 326-327], Beaver. Lambert, andMorse [1980, pp. 11-12], and Bowen, Noreen, and Lacey[1981, p. 178].

" For example, Zmijewski and Hagerman [1981, p.138] construct a test so as "to intentionally bias the esti-mator in favor of the null hypothesis to provide for astronger test." Similarly, Bakan [1966] concludes that asignificant result from a test with a small sample size ismore convincing than a significant result from a test witha larger sample size.

Page 6: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

208 The Accounting Review, January 1987

lished tests in belief revision is illustratedby the three curves in Figure 1 whichshow posterior beliefs for tests of threedifferent significance levels. Observationof a test statistic significant at a lowerlevel. Of, results in a lower posterior beliefin the null hypothesis, P[Ho\S]. Thus,the effective level of a published result isa critical component of the belief revisionprocess for a Bayesian.

The publication process serves as afilter which affects the probability ofobserving a published significant test sta-tistic when the null hypothesis holds.*'*Some of the incentives which determinethe characteristics of this filter are dis-cussed in the following section. In thissection, behavioral factors which maycause the effective level of the testsreported in the literature to exceed theirnominal level are discussed.*' These be-haviors create biases which are describedbelow as researcher-induced or editor-induced, though the categories are notalways distinct. Researchers may beaware of many of these biases individu-ally but their joint and cumulative effectsare not often considered, nor is theirimpact on a Bayesian integration ofempirical evidence with prior beliefs.

Researcher-induced biases are theresult of choices made in the course ofa research project. Greenwald [1975, p.3] summarizes a number of researcherbehaviors which lead to a high probabil-ity of observing significant results inpublished research even when the nullhypothesis holds (i.e., a high effectivelevel for published tests), including:

1. submitting results for publicationmore often when the null hypothesishas been rejected than when it hasnot been rejected;

2. continuing research on a problemwhen results have been close torejection of the null hypothesis(•near significant'), while abandon-

ing the problem if rejection of thenull hypothesis is not close;

3. elevating ancillary hypothesis testsor fortuitous findings to promi-nence in reports of studies for whichthe major dependent variables didnot provide a clear rejection of thenull hypothesis;

4. revising otherwise adequate opera-tionalizations of variables when un-able to obtain rejection of the nullhypothesis and continuing to reviseuntil the null hypothesis is (at last!)rejected or until the problem isabandoned without publication;

5. failing to report initial data collec-tions (renamed as 'pilot data' or'false starts') in a series of studiesthat eventually leads to a prediction-confirming rejection of the nullhypothesis; and

6. failing to detect data analysis errorswhen an analysis has rejected thenull hypothesis by miscomputation,while vigilantly checking and re-checking computations if the nullhypothesis has not been rejected.

The effective significance level of pub-lished results may also be increased by apractice sometimes described as "refin-ing a theory" in the course of a researchproject. For example, a project mightbegin with a theory which suggests thedirection of a relationship between twoor more variables but is not sufficientlyrefined to specify the precise functionalform of the relationship between thedependent variable and the independentvariables. The theory may not be suffi-ciently well-developed to specify whichvariables should be included, whetherthe relationship should be linear or

'̂ Further discussion of this point is found in Sterling[1959], Bakan [1966], Lykken [1968], Walster and Cleary[1970], Greenwald [1975], and McCIoskey [1983].

"The discussion in this section focuses on somebehavioral factors which may lead to invalid nominalsignificance levels. A more general discussion of threatsto validity is found in Cook and Campbell [1979].

Page 7: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

Burgstahler 209

quadratic, or whether the effects of inde-pendent variables are additive or multi-plicative. As the research progresses, thetheory is sometimes "refined" by tryingseveral functional forms and selectingthe one which is most consistent with thedata. For example, Bowen [1981] explic-itly uses significance and consistencywith theory as criteria for choosing aparticular functional form for a regres-sion equation.** When data-fitting isused to refine the form of a regressionequation and the same data are used forhypothesis tests, the effective signifi-cance level of the test may substantiallyexceed the stated level. *̂

The most troublesome aspect of thebehaviors discussed above is that theirprecise effects on the effective level ofpublished tests are difficult to identifyand even more difficult to quantify.However, it would be naive to concludethat because the effects cannot easily bequantified, they must be unimportant.Rather, researchers must attempt to min-imize these effects by meticulous adher-ence to carefully planned researchdesigns. Researchers should begin with awell-developed theory which specifiesthe relevant variables and the functionalform of the relationship. Whenever pos-sible, major research decisions suchas operational definitions of variables,data to be collected, and sample sizesshould be made in advance and research-ers should be reluctant to modify theresearch plan unless the modified planis viewed as a separate test requiringnew data. Only by conducting carefullyplanned tests of well-developed theoriescan researchers effectively control thesignificance level of reported tests.

The proportion of incorrect rejectionsin published resarch is also increased bythe behavior of journal editors andreviewers. If, as casual empiricism sug-gests, stricter editorial standards are

applied to completed research with non-significant results, the proportion ofpublished incorrect rejections of the nullhypothesis will be higher than implied bythe nominal level of the tests. *® In fact, ifonly rejections of null hypotheses werepublished, then, even for a correct nullhypothesis, the probability of observinga published study which rejects thehypothesis would increase rapidly asthe number of tests of the hypothesisincreased.*'

Most researchers conscientiouslyattempt to avoid the behaviors describedabove, though few can cletim to be freeof their effects. Also, although thereis probably some editorial bias againstpublishing insignificant results, editorsgenerally do not follow a policy of simplypublishing all significant results andrejecting all nonsignificant results. Well-designed and executed research is pub-

'* While other researchers in accounting undoubtedlyconsider significance and consistency in selecting a modelto be used for significance tests, Bowen is especiallyconscientious in explicitly reporting his criteria. Whenspecifications are tried and discarded without beingreported, the reader has no way to assess the impactof alternative specifications on the effective level ofthe reported tests. For other specific examples, seeMcCloskey's [1985] description of a sample of ten recentpapers from economics using regression analysis, ofwhich only two do not admit experimenting with alter-native specifications.

'̂ Freedman [1983] demonstrates how seriously theeffective significance level can be distorted by selectingvariables for inclusion in a regression equation based onsignificance of the individual coefficients.

'* In order to eliminate the bias in acceptance criteria,Walster and Cleary [1970] suggest an editorial policy inwhich data and results are withheld when an article issubmitted for review so that publication decisions arebased on the design rather than the results. Kinney [1986]suggests a similar policy for evaluating dissertationresearch.

" To make the risk of observing a significant resultpurely by chance more concrete, let k be the number ofindependent tests conducted, all at level a. Then, theprobability of observing at least one rejection of thehypothesis is 1 - ( 1 -a )* . As Ar goes from 1 to 5 to 10, theprobability of observing at least one rejection goes from.05 to .22 to .40 for a=.05 and from. 10 to .40 to .65 fora=.10.

Page 8: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

210 The Accounting Review, January 1987

lished despite statistically insignificantresults; poorly-designed and executedresearch is rejected despite statisticallysignificant results.

Nonetheless, the conclusion remains:If any of these behaviors is exhibited byresearchers or editors, the effective pro-portion of published incorrect rejectionsof null hypotheses may systematicallyexceed the proportion implied by thenominal level of the tests conducted. Asa consequence, the belief revision by aBayesian in response to a publishedempirical result should be less thanwould be implied by the (incorrect) nom-inal level of significance. Moreover,readers attempting to integrate publishedresults with their prior beliefs cannotdetermine the effects of unrecordedchoices made by researchers and editorsand thus are missing a critical input tothe belief revision process. ̂ °' *̂

PRIOR BELIEFS AND INCENTIVESFOR RESEARCH PRODUCTION

AND DISSEMINATION

The role of prior beliefs in belief revi-sion is clear from equations (1) and (2);posterior beliefs are directly related toprior beliefs and individuals with differ-ent prior beliefs will have differentposterior beliefs. In this section, the dis-cussion focuses on another aspect ofprior beliefs, the role of prior beliefs inresearch and publication decisions. Priorbeliefs infiuence decisions because oftheir role in researchers* perceptionsof the probable outcome of proposedresearch projects and in editors* inter-pretations of submitted results. Theinfluence of prior beliefs is complex;researchers* decisions reflect perceptionsof editors* beliefs and editors* decisionsare influenced by perceptions of research-ers* beliefs.

A test of a hypothesis must satisfy two

necessary conditions to be considered forpublication: The hypothesis (1) must notbe trivial, and (2) must not be quasi-cer-tain. A null hypothesis not related to asubstantive economic issue is trivial. Anull hypothesis for which substantiallyall researchers hold prior beliefs "closeto** zero or for which substantially allresearchers hold prior beliefs "close to**one is quasi-certain. The results of a testof a quasi-certain hypothesis are notinformative because they are "close to**perfectly predictable." The tastes andpreferences of an editor determine howthese definitions are operationalized.The results of tests of hypotheses whichan editor deems trivial or quasi-certainwill not be published. In the remainderof this section, all hypotheses contem-plated by researchers and editors areassumed to meet these two necessaryconditions.

Editors may be reluctant to publishsignificant results when prior beliefs arestrong, i.e., when the prior belief in thenull hypothesis is either very low or veryhigh. When the prior belief in the null

" Based on a number of assumptions about researcherand editor behaviors, Greenwald [1975] estimated thatthe proportion of incorrect rejections of null hypothesesappearing in the social psychology literature may be ashigh as .30 (for tests conducted at the .05 level). Althoughthis estimate does not necessarily apply to research inaccounting (since different behavioral assumptionswould lead to different estimates of the proportion ofincorrect rejections), it does give an impression of thepotential magnitude of the effects of these biases.

" The level of published tests remains a concern evenwith publication of the results of more than one test of ahypothesis. Christie [1985] proposes an interesting tech-nique for combining evidence from separate tests of ahypothesis. Since the technique requires valid assess-ments of effective significance levels for the individualtests, correct nominal significance levels are essential.

" It might seem that test results inconsistent withhighly certain priors would be especially interesting.However, as illustrated in the following paragraph,unless it can be established that the test had very highpower and a very low significance level, it may be morelikely that the results are incorrect than that the priorbeliefs are incorrect.

Page 9: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

Burgstahler 211

is low, significant results are consistentwith the priors and may not changebeliefs much.*' An editor may be reluc-tant to expend resources to publishresults that merely confirm strong priorbeliefs. On the other hand, when theprior belief in the null is high, an editormay also be reluctant to publish signifi-cant results. For example, with a priorbelief in the null of .95, significantresults from a test with a level .10 andpower of .90 would cause posteriors tobe revised downward to approximately.68, a fairly substantial revision ofbeliefs. However, the posterior belief inthe null hypothesis given a significanttest statistic can also be interpreted as theprobability that the significant results are"incorrect.** Thus, the posteriors indi-cate that the odds are greater than two toone that the significant result is "incor-rect!** Because editors are likely to bereluctant to publish a result that hasa high probability of being incorrect,hypotheses for which priors are strongrequire especially high-power tests withcarefully controlled significance levels.

A primary purpose of publishing theresults of empirical research is to influ-ence researcher beliefs about a hypothe-sis and thus influence the further devel-opment of theory. However, a secondaryeffect, apparent from equation (3) be-low, is that changes in beliefs affectthe expected utility to other research-ers of devoting resources to tests of ahypothesis. Thus, publication of resultswhich are largely consistent with priorbeliefs can be valuable when the resultsstrengthen beliefs so that other research-ers will not choose to expend additionalresources to conduct tests of the hypoth-

(3)

esis. 24

A researcher will choose the researchproject, P, which maximizes expectedutility,

The expected utility of a proposed proj-ect depends on prior beliefs about thehypothesis as well as the power and levelof the proposed hypothesis test. Differ-ent researchers may have different priorbeliefs and may have access to tests withdifferent powers, or have different per-ceptions of the power of a given test.These differences, along with other fac-tors, lead different researchers to choosedifferent research projects.̂ **

The expected utility of a proposedproject also depends on the utility of thestate/outcome combinations. Publica-tions £ire major determinants of the deci-sions which determine most monetaryand nonmonetary rewards for academicresearchers, including promotion andtenure decisions, salary increases, con-sulting rates, and job changes. Conse-quently, the utility of each state/out-come combination depends primarily onthe product of the reward to publicationand the probability of publication. Pub-lications that deal with more important

" The extreme case is the test of a hypothesis which isobviously false; significant results from a test of such a(quasi-) certain hypothesis do not change beliefs at all.

^* A third purpose of publication is to support anddisseminate methodological innovation which improvesthe level and power of subsequent tests, thus increasingthe belief revision supported by the results of subsequentresearch.

" More generally, researchers and editors must choosea set of research projects rather than just a single researchproject. Choosing a portfolio of projects which maxi-mizes expected utility is a more complex problem thanthe one considered here and raises additional issuesbeyond the scope of this discussion.

" The utility-maximizing P* may be the null researchproject which allows a researcher to engage in alternativeactivities, e.g., consulting or consumption of leisure.

Page 10: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

212 The Accounting Review, January 1987

hypotheses and that have a larger impacton beliefs are likely to lead to greaterrewards to the researcher. The probabil-ity of publication depends on researcherperceptions of editorial policies andbeliefs.

If, as suggested in the previous sec-tion, researchers perceive that significantresults are much more likely to be pub-lished, then researchers will choose theproject which yields an optimal combi-nation of probability of significantresults and low cost. Assuming that theutility for significant results ("rejectingthe null") is much higher when the nullis false, researchers will prefer powerfultests of null hypotheses for which theyhave low prior beliefs. Thus, the utility-maximizing project may be a powerfultest of a hypothesis which the researcherbelieves is likely to be false. However,powerful tests of hypotheses are oftencostly; for example, a high-power testmay require a large amount of researchertime for data collection and analysis ormay require that the researcher acquireadditional expertise. Consequently, theutility-maximizing research projectcould also be a low-power, but low-cost,test of a null hypothesis which the re-searcher believes is likely to be false.

Under some circumstances, research-ers could also have high utility for reject-ing a true hypothesis. For example,researchers may have high utility forpublishing significant, but incorrect,results if subsequent results which con-tradict the original results are not likelyto be published until after some decisionaffected by publication (e.g., tenure) hasbeen made. Alternatively, for very costlytests or tests for which it is difficult toassess power, there may be little chancethat a replication contradicting theresults will ever be conducted and pub-lished to reveal an error. Even if subse-quent tests reveal that the hypothesis is

true, the impact on the original research-er*s reputation and wealth may be smallas long as there is no evidence of deceitor incompetence. Thus, because of thepossibility of observing significant re-sults even when the null hypothesisholds, the utility-maximizing researchproject might be a low-cost test of a nullhypothesis which the researcher believesis likely to be true. In fact, in choosing aproject from the set of research projectsfor which the power of the test is nearlyas low as the level, the primary consider-ation would be the cost of available tests,rather than researcher beliefs about thehypotheses.

Prior beliefs play an important rolethroughout the research process. Theydirectly determine posterior beliefs asshown in equations (1) and (2). How-ever, they also play an important role inthe decisions of editors and researchers.Further, the power and level of availabletests and the costs associated with eachtest are determinants of researcher deci-sions. The model presented in this sec-tion helps to clarify the relationshipsamong these factors in research deci-sions.

CONCLUSIONS

Research practices which lower thepower of tests have sometimes beenaccepted in accounting research withoutconsideration of their implications. Sim-ilarly, the implications of research andeditorial practices which increase theeffective level of reported tests havereceived little attention. The model dis-cussed in this paper demonstrates thattests with low power are not only unde-sirable ex antey because of the low prob-ability of observing significant results,but also ex post, because little probabil-ity revision should be induced by theresults of low-power tests regardless ofwhether significant results are observed.

Page 11: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

Burgstahler 213

Further, the discussion suggests that,irrespective of the usual issues of statis-tical and methodological validity, theeffective level of tests in publishedresearch is likely to exceed the statedlevel, thus reducing the amount of beliefrevision justified by reported results.

The Bayesian framework presentedhere is useful in analyzing and under-standing the trade-offs which are aninherent part of empirical research. Themodel clarifies the relationship betweenresearch decisions and characteristics oftests, prior beliefs, and individual utili-ties. Broadly stated, the recommenda-tions which follow from the analysis arelittle different from those of an intro-ductory discussion of experimentaldesign: Researchers should attempt tomaximize power in the design and execu-

tion of empirical tests and attempt tomaintain the effective level of tests attheir stated levels. However, the analysisidentifies the reasons for these recom-mendations. The results of tests with lowpower have little value (even if the resultsare statistically significant) because low-power tests should induce little beliefrevision regardless of their outcome.Behaviors which increase the effectivelevel of published empirical tests areundesirable because an increase in theeffective level is accompanied bya decrease in the belief revision justifiedby the results. If these recommendationswere not followed, published empiricalresults would properly have little or noeffect on the beliefs of accountingresearchers.

REFERENCES

Bakan, D., "The Test of Significance in Psychological Research," Psychological Bulletin (December1966), pp. 423-437.

Ball, R., "Discussion of Accounting for Research and Development Costs: The Impact on Research andDevelopment Expenditures," Journal of Accounting Research (Supplement, 1980), pp. 27-37.

, and G. Foster, "Corporate Financial Reporting: A Methodological Review of Empirical Re-search," Journal of Accounting Research (Supplement, 1982), pp. 161-234.

Beaver, W., "Discussion of Market-Based Empirical Research in Accounting: A Review, Interpretation,and Extension," Journal of Accounting Research (Supplement, 1982), pp. 323-331.

, R. Clarke, and W. Wright, "The Association Between Unsystematic Security Returns and theMagnitude of Earnings Forecast Errors,'' Journal of A ccounting Research (Autumn 1979), pp. 316-340.

, R. Lambert, andD. Morse, "The Information Content of Security Prices,"/owr/ia/o/^ccoMwrmgand Economics (March 1980), pp. 3-28.

Bowen, R., "Valuation of Earnings Components in the Electric Utility Industry," THE ACCOUNTINGREVIEW (January 1981), pp. 1-22.

, E. Noreen, and J. Lacey, "Determinants of the Corporate Decision to Capitalize Interest," Jour-nal of Accounting and Economics (August 1981), pp. 151-179.

Box, G., "Sampling and Bayes Inference in Scientific Modelling and Robustness," Journal of the RoyalStatistical Society—Series A (1980), pp. 383-430.

Christie, A., "Evaluation of the Consistency of Evidence on Contracting and Political Cost HypothesesAcross and Within Studies," Unpublished paper. School of Accounting, University of Southern Cali-fornia (February 1985).

Cook, T., and D. Campbell, Quasi-Experimentation: Design and Analysis Issues for Field Settings(Houghton'Mifflin, 1979).

Edwards, W., H. Lindman, and L. Savage, "Bayesian Statistical Inference for Psychological Research,"Psychological Review (1963), pp. 193-242.

Efron, B., "Why Isn't Everyone a Bayesian?", The American Statistician (February 1986), pp. 1-5.Foster, G., "Accounting Policy Decisions and Capital Market Research," Journal of Accounting and

Economics (March 1980), pp. 29-62.

Page 12: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence

214 The Accounting Review, January 1987

, "Intra-industry Information Transfers Associated with Earnings Releases," Journal of Account-ing and Economics (1981), pp. 201-232.

Freedman, D., "A Note on Screening Regression Equations," The American Statistician (May 1983),pp. 152-155.

Greenwald, A., "Consequences of Prejudice Against the Null Hypothesis," Psychological Bulletin(January 1975), pp. 1-20.

Judge, G., W. Griffiths, R. Hill, H. Lutkepohl, and T. Lee, The Theory and Practice of Econometrics,2nd Ed. (John Wiley & Sons, 1985).

Kinney, W., "Empirical Accounting Research Design for Ph.D. Students," THE ACCOUNTING REVIEW(April 1986), pp. 338-350.

Lykken, D., "Statistical Significance in Psychological Research," Psychological Bulletin (1968), pp. 151-159.

McCIoskey, D., "The Rhetoric of Economics," Journal of Economic Literature (\9^3>), pp. 481-517., "The Loss Function Has Been Mislaid: The Rhetoric of Significance Tests,'' American Economic

Review (1985), pp. 201-205.Mood, A., F. Graybill, and D. Boes, Introduction to the Theory of Statistics, 3rd Ed. (McGraw-Hill,

1974).Raiffa, H., and R. Schlaifer, Applied Statistical Decision Theory (Graduate School of Business Adminis-

tration, Harvard University, 1961).Savage, L. The Foundations of Statistics (John Wiley & Sons, 1954).Simonds, R., and D. Collins, "Line of Business Reporting and Security Prices: An Analysis of an SEC Dis-

closure Rule: Comment," The Bell Journal of Economics (Autumn 1978), pp. 646-658.Sterling, T., "Publication Decisions and Their Possible Effects on Inferences Dravm from Tests of Signif-

icance—or Vice Versa," Journal of the American Statistical Association (March 1959), pp. 30-34.Walster, G., and T. Cleary, "A Proposal for a New Editorial Policy in the Social Sciences," The American

Statistician (April 1970), pp. 16-19.Zellner, A., An Introduction to Bayesian Inference in Econometrics (John Wiley & Sons, 1971).Zmijewski, M., and R. Hagerman, "An Income Strategy Approach to the Positive Theory of Account-

ing Standard Setting Choice," Journal of Accounting and Economics (19S0), pp. 129-149.

Page 13: Inference from Empirical Researchleeds-faculty.colorado.edu/bhagat/burgstahler.pdf · 2007-01-19 · factor, prior beliefs, is discussed next along with other factors which infiu-ence