Replication and Meta-Analysis in Parapsychology - Jessica Utts

download Replication and Meta-Analysis in Parapsychology - Jessica Utts

of 41

Transcript of Replication and Meta-Analysis in Parapsychology - Jessica Utts

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    1/41

    Statistical Science1991 Vol 6 o 4 363-403Replication and Meta-Analysis inParapsychologyJessica Utts

    Abstract. Parapsychology, the laboratory studyof psychicphenomena,has had its history interwoven with that of statistics. Many of thecontroversies inparapsychologyhavefocusedonstatistical issues,and statistical models have played an integral role in the experimentalwork.Recently,parapsychologists havebeenusingmeta-analysisasatool for synthesizing large bodies of work. This paper presents anoverviewof theuseof statistics inparapsychologyandoffersasummaryof the meta-analyses that have been conducted. It beginswith someanecdotalinformationaboutthe involvement of statisticsand stat isti -cianswiththe earlyhistoryof parapsychology.Next, it isargued that most nonstatisticiansdo not appreciateth e connectionbetween powerand"successful" replicationof experimentaleffects.Returning topara -psychology,aparticularexperimentalregimeisexaminedbysummariz-inganextendeddebateovertheinterpretationof theresults.A newsetofexperimentsdesignedtoresolvethedebateisthenreviewed.Finally,meta-analysesfromseveralareasof parapsychologyaresummarized. Itisconcludedtha tthe overallevidenceindicatestha tthereis an anoma-louseffectinneedof an explanation.Key words a nd phrases: Effectsize,psychicresearch,statisticalcontro-versies,randomness,vote-counting.

    1 INTRODUCTION Parapsychology,asthisfieldiscalled,hasbeenasourceifcontroversythroughouti tshistory.StrongIn a June 1990 Gallup Poll, 49%of the 1236 beliefstend tobe resistant to change evenin therespondentsclaimedtobelieveinextrasensoryper- faceof data, andmanypeople, scientistsincluded,ception(ESP),andoneinfourclaimedtohavehad seemtohavemadeuptheirmindsonthequestiona personal experience involving telepathy (Gallup without examining any empirical data at all. Aand Newport, 1991). Other surveys have shown criticofparapsychologyrecentlyacknowledgedthateven h igher percentages; the Universi ty of "The levelof the debateduringthepast 130yearsChicago's National Opinion Research Center re- hasbeenanembarrassmentforanyonewhowouldcentlysurveyed 1473adults ,of which67%claimed liketo believethat scholarsand scientistsadherethattheyhadexperiencedESP(Greeley,1987). tostandards of rationality andfairplay" (Hyman,Publicopinionisapoor arbiter of science,how- 1985a,page89).Whilemuchof thecontroversyhasever, and experience is a poor substitute for the focusedonpoor experimental designandpotentialscieptificmethod.Formorethan a century,smallfraud,therehavebeen attacks and defensesof thenumbersof scientistshavebeenconductinglabora- statisticalmethods aswell,sometimescallingintotory experiments to study phenomena such as question the very foundations of probability andtelepathy, clairvoyance and precognition, collec- statisticalinference.tively known as "psi" abilities. This paper will Most of the criticismshavebeen leveledby psy-examinesomeof thatwork,as well assomeof the chologists.Forexample,a 1988report of theU.S.statisticalcontroversiesithasgenerated. NationalAcademyof Sciencesconcludedthat "Thecommittee finds no scientific justification fromresearch conductedover a periodof 130yearsfor

    Jessica Utts is Associate Professor, Division of the existence of parapsychological phenomena"Statistics, University of C alifornia at Davis, 469 (DruckmanandSwets,1988,page22).ThechapterKerr Hall, D avis, California 95616. onparapsychologywaswritten by a subcommittee

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    2/41

    64 J UTTchaired by a psychologist who had published asimilar conclusion prior to his appointment to thecommittee (Hyman, 1985a, page 7). There were noparapsychologists involved with the writing of thereport. Resulting accusations of bias (Palmer, Hon-orton and Utts, 1989) led U.S. Senator ClaibornePel1 to request that the Congressional Office ofTechnology Assessment (OTA) conduct an investi-gation with a more balanced group. A one-dayworkshop was held on September 30, 1988, bring-ing together parapsychologists, critics and expertsin some related fields (including the author of thispaper). The report concluded that parapsychologyneeds a fairer hearing across a broader spectrumof the scientific community, so that emotionalitydoes not impede objective assessment of experimen-ta l results (Office of Technology Assessment,1989).

    It is in the spir it of the OTA report that thisarticle is written. After Section 2, which offers ananecdotal account of the role of statisticians andstatistics in parapsychology, the discussion turns tothe more general question of replication of experi-mental results. Section 3 illustrates how replica-tion has been (mis)interpreted by scientists in manyfields. Returning to parapsychology in Section 4, aparticular experimental regime called the ganz-feld is described, and an extended debate aboutthe ,interpretation of the experimental results isdiscussed. Section 5 examines a meta-analysis ofrecent ganzfeld experiments designed to resolve thedebate. Finally, Section 6 contains a brief accountof meta-analyses that have been conducted in otherareas of parapsychology, and conclusions are givenin Section 7.

    2 STATISTICS AND PARAPSYCHOLOGYParapsychology had its beginnings in the investi-gation of purported mediums and other anecdotal

    claims in the late 19th century. The Society forPsychical Research was founded in Britain in 1882,and its American counterpart was founded inBoston in 1884. While these organizations and theirmenibers were primarily involved with investigat-ing anecdotal material, a few of the early re-searchers were already conducting forced-choiceexperiments such as card-guessing. (Forced-choiceexperiments are like multiple choice tests; on eachtrial the subject must guess from a small, knownset of possibilities.) Notable among these wasNobel Laureate Charles Richet, who is generallycredited with being the first to recognize tha t prob-ability theory could be applied to card-guessingexperiments (Rhine, 1977, page 26; Richet, 1884).F. Y Edgeworth, partly in response to what heconsidered to be incorrect analyses of these experi-

    ments, offered one of the earliest treatises on thestatistical evaluation of forced-choice experimentsin two articles published in the roceedings of theSociety for sychical Research (Edgeworth, 1885,1886). Unfortunately, as noted by Mauskopf andMcVaugh (1979) in their historical account of theperiod, Edgeworth's papers were perhaps too diffi-cult for their immediate audience @age 105).

    Edgeworth began his analysis by using Bayes'theorem to derive the formula for the posteriorprobability that chance was operating, given thedata. He then continued with an argumentsavouring more of Bernoulli than Bayes in whichit is consonant, I submit, to experience, to put 112

    both for r and P, that is, for both the prior proba-bility that chance alone was operating, and theprior probability th at there should have been someadditional agency. He then reasoned (using aTaylor series expansion of the posterior prob-ability formula) that if there were a large prob-ability of observing the data given that someadditional agency was at work, and a small objec-tive probability of the data under chance, then thelatter (binomial) probability may be taken as arough measure of the sought a posteriori probabil-ity in favour of mere chance @age 195). Edge-worth concluded his article by applying his methodto some data published previously in the samejournal. He found the probability against chance tobe 0.99996, which he said may fairly be regardedas physical certainty @age 199). He concluded:

    Such is the evidence which the calculus ofprobabilities affords as to the existence of anagency other than mere chance. The calculus issilent as to the nature of that agency-whetherit is more likely to be vulgar illusion or ex-traordinary law. That is a question to bedecided, not by formulae and figures, but bygeneral philosophy and common sense [page1991.Both the statistical arguments and the experi-

    mental controls in these early experiments weresomewhat loose. For example, Edgeworth treatedas binomial an experiment in which one personchose a string of eight letters and another at-tempted to guess the string. Since it has long beenunderstood that people are poor random number (orletter) generators, there is no statistical basis foranalyzing such an experiment. Nonetheless, Edge-worth and his contemporaries set the stage for theuse of controlled experiments with statistical evalu-ation in laboratory parapsychology. An interestinghistorical account of Edgeworth's involvement andthe role telepathy experiments played in the earlyhistory of randomization and experimental designis provided by Hacking (1988).

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    3/41

    365EPLIC TION IN P R PSYCHOLOGYOne of the first American researchers to

    use statistical methods in parapsychology wasJohn Edgar Coover, who was the Thomas WeltonStanford Psychical Research Fellow in the Psychol-ogy Department at Stanford University from 1912to 1937 (Dommeyer, 1975). In 1917, Coover pub-lished a large volume summarizing his work(Coover, 1917). Coover believed that his resultswere consistent with chance, but others have ar-gued tha t Coover's definition of significance wastoo strict (Dommeyer, 1975). For example, in oneevaluation of his telepathy experiments, Cooverfound a two-tailed p-value of 0.0062. He concluded,Since this value, then, lies within the field of

    chance deviation, although the probability of itsoccurrence by chance is fairly low, it cannot beaccepted as a decisive indication of some causebeyond chance which operated in favor of success inguessing (Coover, 1917, page 82). On the nextpage, he made it explicit that he would require ap-value of 0.0000221 to declare that somethingother than chance was operating.

    It was during the summer of 1930, with thecard-guessing experiments of J. B. Rhine at DukeUniversity, tha t parapsychology began to take holdas a laboratory science. Rhine's laboratory stillexists under the name of the Foundation for Re-search on the Nature of Man, housed a t the edge ofthe Duke University campus.It wasn't long after Rhine published his firstbook, Extrasensory Perception in 1934, that theattacks on his methodology began. Since his claimswere wholly based on statistical analyses of hisexperiments, the statistical methods were closelyscrutinized by critics Anxious to find a conventionalexplanation for Rhine's positive results.

    The most persistent critic was a psychologistfrom McGill University named Chester Kellogg(Mauskopf and McVaugh, 1979). Kellogg's mainargument was that Rhine was using the binomialdistribution (and normal approximation) on a se-ries of trials that were not independent. The experi-ments in question consisted of having a subject*ess the order of a deck of 25 cards, with five eachof five symbols, so technically Kellogg was correct.By 1937, several mathematicians and statis-ticians had come to Rhine's aid. Mauskopf andMcVaugh (1979) speculated tha t since statistics wasitself a young discipline, a number of statisticianswere equally outraged by Kellogg, whose argu-ments they saw as discrediting their profession(page 258). The major technical work, which ac-knowledged tha t Kellogg's criticisms were accuratebut did little to change the significance of theresults, was conducted by Charles Stuart andJoseph A. Greenwood and published in the firstvolume of the Journal of Parapsychology (Stuart

    and Greenwood, 1937). Stuart, who had been anundergraduate in mathematics a t Duke, was one ofRhine's early subjects and continued to work withhim as a researcher until Stuart's death in 1947.Greenwood was a Duke mathematician, who appar-ently converted to a statistician at the urging ofRhine.Another prominent figure who was distressedwith Kellogg's attack was E. V. Huntington, amathematician at Harvard. After correspondingwith Rhine, Huntington decided that, rather thanfurther confuse the public with a technical reply toKellogg's arguments, a simple statement should bemade to the effect that the mathematical issues inRhine's work had been resolved. Huntington musthave successfully convinced his former student,Burton Camp of Wesleyan, that this was a wiseapproach. Camp was the 1937 President of IMS.When the annual meetings were held in Decemberof 1937 ('jointly with AMS and AAAS), Campreleased a statement to the press that read:

    Dr. Rhine's investigations have two aspects:experimental and statistical. On the exper-imental side mathematicians, of course,have nothing to say. On the statistical side,however, recent mathematical work hasestablished the fact that, assuming that theexperiments have been properly performed,the statistical analysis is essentially valid. Ifthe Rhine investigation is to be fairly attacked,it must be on other than mathematical grounds[Camp, 19371.One statistician who did emerge as a critic wasWilliam Feller. In a talk at the Duke Mathemati-

    cal Seminar on April 24, 1940, Feller raised threecriticisms to Rhine's work (Feller, 1940). They hadbeen raised before by others (and continue to beraised even today). The first was that inadequateshuffling of the cards resulted in additional infor-mation from one series to the next. The second waswhat is now known as the file-drawer effect,namely, tha t if one combines the results of pub-lished studies only, there is sure to be a bias infavor of successful studies. The third was tha t theresults were enhanced by the use of optional stop-ping, that is, by not specifying the number of trialsin advance. All three of these criticisms were ad-dressed in a rejoinder by Greenwood and Stuart(1940), but Feller was never convinced. Even in itsthird edition published in 1968, his book An Intro-duction to Probability Theory and Its Applicationsstill contains his conclusion about Greenwood andStuart: Both their arithmetic and their experi-ments have a distinct tinge of the supernatural(Feller, 1968, page 407). In his discussion of Feller'sposition, Diaconis (1978) remarked, I believe

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    4/41

    86 UTTFeller was confused. he seemed to have decidedthe opposition was wrong and that was tha t.

    Several statisticians have contributed to theliterature in parapsychology to greater or lesserdegrees. T. N E. Greville developed applicablestatistical methods for many of the experiments inparapsychology and was Statistical Editor of theJournal of Parapsychology (with J A Greenwood)from its start in 1937 through Volume 31 in 1967;Fisher (1924, 1929) addressed some specific prob-lems in card-guessing experiments; Wilks (1965a, b)described various statistical methods for parapsy-chology; Lindley (1957) presented a Bayesian anal -ysis of some parapsychology da ta; and Diaconis(1978) pointed out some problems with certain ex-periments and presented a method for analyzingexperiments when feedback is given.

    Occasionally, attacks on parapsychology havetak en the form of attacks on statistical inference ingeneral, at least as it is applied to real data.Spencer-Brown (1957) attempted to show th at truerandomness is impossible, at least in finite se-quences, and tha t th is could be the explanation forthe results in parapsychology. That argument re-emerged in a recent debate on the role of random-ness in parapsychology, initiated by psychologist J.Barnard Gilmore (Gilmore, 1989, 1990; Utts, 1989;Palmer, 1989, 1990). Gilmore stated th at The ag-nostic statistician, advising on research in psi,should take account of the possible inappropriate-ness of classical inferential stat istics (1989, page338). In his second paper , Gilmore reviewed severalnon-psi studies showing purportedly random sys-tems that do not behave as they should underrandomness (e.g., ~v er se n, Longcor, Mosteller,Gilbert and Youtz, 1971; Spencer-Brown, 1957).Gilmore concluded th at Anomalous datashould not be found nearly so often if classicalstatistics offers a valid model of reality (1990,page 54), thus rejecting the use of classical statisti-cal inference for real-world applications in general.

    Iinplicit and explicit in the l itera ture on parapsy-chology is the assumption that, in order to trulyestablish itself, the field needs to find a repeat-able experiment. For example, Diaconis (1978)started the summary of his article in Science withthe words In search of repeatable ESP experi-ments, modern invest igato rs. (page 131). OnOctober 28-29, 1983, th e 32nd Internat ional Con-ference of the Parapsychology Foundation was heldin San Antonio, Texas, to address The Repeatabil-ity Problem in Parapsychology. The ConferenceProceedings (Shapin and Coly, 1985) reflect the

    diverse views among parapsychologists on the na-ture of the problem. Honorton (1985a) and Rao(1985), for example, both argued t ha t strict replica-tion is uncommon in mos t branches of science andthat parapsychology should not be singled out asunique in this regard. Other authors expresseddisappointment in the lack of a single repeatableexperiment in parapsychology, with titles suchas Unrepeatabili ty: Parapsychology's Only Find-ing (Blackmore, 1985), and Research Strategiesfor Dealing with Unstable Phenomena (Beloff,1985).

    It has never been clear, however, just exactlywhat would constitute acceptable evidence of a re-peatable experiment. In th e early days of investiga-tion, the major critics insisted that i t would besufficient for Rhine and Soal to convince them ofESP if a parapsychologist could perform success-fully a single 'fraud-proof' experiment (Hyman,1985a, page 71). However, as soon as well-designedexperiments showing statistical significanceemerged, the critics realized that a single experi-ment could be statistically significant just bychance. British psychologist C. E. M Hansel quan-tified the new expectation, that the experimentshould be repeated a few times, as follows:

    If a result i s significant at the O 1 level andthi s result is not due to chance but to informa-tion reaching the subject, it may be expectedthat by making two further sets of trials theantichance odds of one hundred to one will beincreased to around a million to one, thus en-abling the effects of ESP-or whatever is re-sponsible for the original result-to manifestitself to such an extent t ha t there will be littledoubt that the result is not due to chance[Hansel, 1980, page 2981.In other words, three consecutive experiments a tp 0.01 would convince Hansel that something

    other t han chance was a t work.This argument implies th at if a particular experi-

    ment produces a statistically significant result, butsubsequent replications fail to attain significance,then the original result was probably due to chance,or a t least remains unconvincing. The problem withthi s line of reasoning is th at the re is no consid-eration given to sample size or power. Only anexperiment with extremely high power shouldbe expected to be successful three times insuccession.

    It is perhaps a failure of the way statistics istaugh t th at many scientists do not understand theimportance of power in defining successful replica-tion. To illustrate this point, psychologists Tverskyand Kahnem ann (1982) distributed a questionnaire

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    5/41

    67EPLICATION IN PARAPSYCHOLOG Yto their colleagues at a professional meeting, withthe question:

    An investigator has reported a result that youconsider implausible. He ran 15 subjects, andreported a significant value, t = 2.46. Anotherinvestigator has attempted to duplicate his pro-cedure, and he obtained a nonsignificant valueof t with the same number of subjects. Thedirection was the same in both sets of data.You are reviewing the literature. What is thehighest value of t in th e second set of data t ha tyou would describe as a failure to replicate?[1982, page 281.

    In reporting their results, Tversky and Kahne-mann stated:

    The majority of our respondents regarded t =1.70 as a failure to replicate. If the data of twosuch studies (t = 2.46 and t = 1.70) are pooled,the value of t for the combined data i s about3.00 (assuming equal variances). Thus, we arefaced with a paradoxical sta te of affairs, inwhich the same data that would increase ourconfidence in the finding when viewed as partof the original study, shake our confidencewhen viewed as an independent study [1982,page 281.At a recent presentation to the History and Phi-

    losophy of Science Seminar a t the University ofCalifornia a t Davis, I asked the following question.Two scientists, Professors A and B, each have atheory they would like to demonstrate. Each plansto run a fixed number of Bernoulli trial s and thentest H,: p = 0.25 versos Ha: p > 0.25. Professor Ahas access to large numbers of studen ts eachsemester to use a s subjects. In his first experiment,he runs 100 subjects, and there are 33 successes( p = 0.04, one-tailed). Knowing the importance ofreplication, Professor A ru ns an additional 100 sub-jects as a second experiment. He finds 36 successes( p = 0.009, one-tailed).

    Professor B only teaches small classes. Eachquarter, she runs an experiment o n her students totest her theory. She carries out ten studies thisway, with the results i n Table 1

    I asked the audience by a show of hands toindicate whether or not they felt the scientists hadsuccessfully demonstrated their theories. ProfessorA's theory received overwhelming support, withapproximately 20 votes, while Professor B's theoryreceived only one vote.

    If you aggregate the results of the experimentsfor each professor, you will notice that each con-ducted 200 trials, and Professor B actually demon-strated a higher level of success than Professor A,

    with 71 as opposed to 69 successful trials. Theone-tailed p-values for the combined trials are0.0017 for Professor A and 0.0006 for Professor B.

    To address the question of replication more ex-plicitly, I also posed the following scenario. InDecember of 1987, i t was decided to prematurelyterminate a study on the effects of aspirin in reduc-ing heart attacks because the data were so convinc-ing (see, e.g., Greenhouse and Greenhouse, 1988;Rosenthal, 1990a). The physician-subjects had beenrandomly assigned to take aspirin or a placebo.There were 104 heart attacks among the 11,037subjects in the aspirin group, and 189 heart attacksamong the 11,034 subjects in the placebo group(chi-square = 25.01, p 0.00001).

    After showing the results of that study, I pre-sented the audience with two hypothetical experi-ments conducted to try to replicate the originalresult, with outcomes in Table 2.

    I asked the audience to indicate which one theythought was a more successful replication. The au-dience chose the second one, as would most journaleditors, because of the significant p-value. Infact, the first replication has almost exactly thesame proportion of heart attacks in the two groupsas the original study and i s thu s a very close repli-cation of that result. The second replication has

    TABLE1ttempted replciations for professor Bn Numbe r of successes One tailed p value

    TABLE2Hypothetical replications of the aspirin / heartattack study

    Replication 1Heart at tack Replication 2Heart at tackYes No Yes No

    AspirinPlaceboChi square11 115619 1090

    2.596, p = 0.112048

    13.206, p23142170

    = 0.0003

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    6/41

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    7/41

    69EPLIC TION IN P R PSYCHOLOGYtarget, rather than being forced to make a choicefrom a small discrete set of possibilities. Varioustypes of target material have been used, includingpictures, short segments of movies on video tapes,actual locations and small objects.

    Despite the more complex target material, thestatistical methods used to analyze these experi-ments are similar to those for forced-choice experi-ments. A typical experiment proceeds as follows.Before conducting any trials, a large pool of poten-tial ta rgets is assembled, usually i n packets of four.Similarity of targets within a packet is kept to aminimum, for reasons made clear below. At thesta rt of an experimental session, after the subject issequestered in an isolated room, a target is selectedat random from the pool. A sender is placed inanother room with the target. The subject is askedto provide a verbal or written description of whathe or she thinks is in the target, knowing only thatit is a photograph, an object, etc.After the subject's description has been recordedand secured against the potential for later alter-ation, a judge (who may or may not be the subject)is given a copy of the subject's description and thefour possible targets that were in the packet withthe correct target. A properly conducted experi-ment either uses video tapes or has two identicalsets of target material and uses the duplicate setfor this part of the process, to ensure that cluessuch as fingerprints don't give away the answer.Based on the subject's description, and of course ona blind basis, the judge is asked to either rank thefour choices from most to least likely to have beenthe target, or to select the one from the four thatseems to best match the subject's description. Ifranks a re used, the statistical analysis proceeds bysumming the ranks over a series of tri als andcomparing the sum to what would be expected bychance. If the selection method is used, a directhit occurs if the correct target is chosen, and thenumber of direct hit s over a series of tria ls iscompared to the number expected in a binomialexperiment with p 0.25.

    Note that the subjects' responses cannot be con-sidered to be random in any sense, so probabilityassessments are based on the random selection ofthe target and decoys. In a correctly designed ex-periment, the probability of a direct hit by chanceis 0.25 on each trial , regardless of the response, andthe trials are independent. These and other issuesrelated to analyzing free-response experiments arediscussed by Utts (1991).4.2 The Psi Gandeld Experiments

    The ganzfeld procedure is a particular kind offree-response experiment utilizing a perceptual

    isolation technique originally developed by Gestaltpsychologists for other purposes. Evidence fromspontaneous case studies and experimental workhad led parapsychologists to a model proposing thatpsychic functioning may be masked by sensory in-put and by inattention to inte rnal states (Honorton,1977). The ganzfeld procedure was specifically de-signed to test whether or not reduction of externalnoise would enhance psi performance.In these experiments, the subject is placed in a

    comfortable reclining chair in an acousticallyshielded room. To create a mild form of sensorydeprivation, the subject wears headphones throughwhich white noise is played, and stares into aconstant field of red ligh t. This is achieved bytaping halved translucent ping-pong balls over theeyes and then illuminating the room with red light.In th e psi ganzfeld experiments, the subject speaksinto a microphone and attempts to describe thetarget material being observed by the sender in adistant room.

    At the 1982 Annual Meeting of the Parapsycho-logical Association, a debate took place over thedegree to which the results of the psi ganzfeldexperiments constituted evidence of psi abilities.Psychologist and critic Ray Hyman and parapsy-chologist Charles Honorton each analyzed the re-sult s of all known psi ganzfeld experiments to date,and they reached strikingly different conclusions(Honorton, 1985b; Hyman, 1985b). The debate con-tinued with the publication of their arguments inseparate articles in the March 1985 issue of theJournal of Parapsychology. Finally, in the Decem-ber 1986 issue of the Journal'of Parapsychology,Hyman and Honorton (1986) wrote-a joint articlein which they highlighted their agreements anddisagreements and outlined detailed criteria forfuture experiments. That same issue containedcommentaries on the debate by 10 other authors.

    The data base analyzed by Hyman and Honorton(1986) consisted of results taken from 34 reportswritten by a total of 47 authors. Honorton counted42 separate experiments described in the reports, ofwhich 28 reported enough information to determinethe number of direct hi ts achieved. Twenty three ofthe studies (55%) were classified by Honorton ashaving achieved statistical significance at 0.05.4.3 The Vote Counting Debate

    Vote-counting is the te rm commonly used for thetechnique of drawing inferences about an experi-mental effect by count ing the number of significantversus nonsignificant studies of the effect. Hedgesand Olkin (1985) give a detailed analysis of theinadequacy of thi s method, showing th at it is moreand more likely to make the wrong decision as the

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    8/41

    37 J UTTSnumber of studies increases. While Hyman ac-knowledged that vote-counting raises many prob-lems (Hyman, 1985b, page 8), he nonetheless spenthalf of his critique of the ganzfeld studies showingwhy Honorton's count of 55% was wrong.

    Hyman's first complaint was that several of thestudies contained multiple conditions, each of whichshould be considered as a separate study. Usingthis definition he counted 80 studies (thus furtherreducing the sample sizes of the individual studies),of which 25 (31%) were successful. Honorton'sresponse to this was to invite readers to examinethe studies and decide for themselves if the varyingconditions constituted separate experiments.

    Hyman next postulated that there was selectionbias, so that significant studies were more likely tobe reported. He raised some important issues abouthow pilot studies may be terminated and not re-ported if they don't show significant results, or mayat least be subject to optional stopping, allowingthe experimenter to determine the number of tri-als. He also presented a chi-square analysis thatsuggests a tendency to report studies with a small

    sample only if they have significant results(Hyman, 1985b, page 14), but I have questioned hisanalysis elsewhere (Utts , 1986, page 397).

    Honorton refuted Hyman's argument with fourrejoinders (Honorton, 1985b, page 66). In additionto reinterpreting Hyman's chi-square analysis,Honorton pointed out that the ParapsychologicalAssociation has an official policy encouraging thepublication of nonsignificant results in its journalsand proceedings, that a large number of reportedganzfeld studies did not achieve statistical signifi-cance and tha t therewould have to be 15 studies inthe file-drawer for every one reported to cancelout the observed significant results.

    The remainder of Hyman's vote-counting analy-sis consisted of showing that the effective error ratefor each study was actually much higher than thenominal 5%. For example, each study could havebeen analyzed using the direct hit measure, thesum of ranks measure or one of two other measuresused for free-response analyses. Hyman carried outa simulation study th at showed the true error ratewould be 0.22 if significance was defined by re -quiring at least one of these four measures toachieve the 0.05 level. He suggested several otherways in which multiple testing could occur andconcluded that the effective error rate in each ex-periment was not the nominal 0.05, but rather wasprobably close to the 31% he had determined to bethe actual success rate in his vote-count.Honorton acknowledged that there was a multi-ple testing problem, but he had a two-fold response.First, he applied a Bonferroni correction and found

    that the number of significant studies (using hisdefinition of a study) only dropped from 55% to45%. Next, he proposed that a uniform index ofsuccess be applied to al l studies. He used the num-ber of direct hits, since it was by far the mostcommonly reported measure and was the measureused in the first published psi ganzfeld study. Hethen conducted a detailed analysis of the 28 studiesreporting direct hits and found that 43% were sig-nificant at 0.05 on tha t measure alone. Furthe r, heshowed that significant effects were reported by sixof the 10 independent investigators and thus werenot due to just one or two investigators or laborato-ries. He also noted that success rates were verysimilar for reports published in refereed journalsand those published in unrefereed monographs andabstracts.

    While Hyman's arguments identified issues suchas selective reporting and optional stopping thatshould be considered in any meta-analysis, the de-pendence of significance levels on sample size makesthe vote-counting technique almost useless for as-sessing the magnitude of the effect. Consider, forexample, the 24 studies where the direct hit meas-ure was reported and the chance probability of adirect hi t was 0.25, the most common type of studyin the data base. (There were four direct hit studieswith other chance probabilities and 14 that did notreport direct hits.) Of the 24 studies, 13 (54%) werenonsignificant at a 0.05, one-tailed. But if the367 trials in these failed replications are com-

    bined, there are 106 direct hits, 1.66, and p0.0485, one tailed. This is reminiscent of thedilemma of Professor B in Section 3.

    Power is typically very low for these studies. Themedian sample size for the studies reporting directhits was 28. If there is a real effect and i t increasesthe success probability from the chance 0.25 toan actual 0.33 (a value whose rationale will bemade clear below), the power for a study with 28trials is only 0.181 (Utts, 1986). It should be nosurprise that there is a repeatability problem inparapsychology.4 4 Flaw Analysis and Future RecommendationsThe second half of Hyman's paper consisted of aMeta-Analysis of Flaws and Successful Outcomes

    (1985b, page 30), designed to explore whether ornot various measures of success were related tospecific flaws in the experiments. While many crit-ics have argued t ha t t he results in parapsychologycan be explained by experimental flaws, Hyman'sanalysis was the first to attempt to quantify therelationship between flaws and significant results.Hyman identified 12 potential flaws in theganzfeld experiments, such as inadequate random-

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    9/41

    371EPLIC TION IN P R PSYCHOLOGYization, multiple tests used without adjusting thesignificance level (thus inflating the significancelevel from the nominal 5%) and fai lure to use aduplicate set of targets for the judging process (thusallowing possible clues such as fingerprints). Usingcluster and factor analyses, the 12 binary flawvariables were combined into three new variables,which Hyman named General Security, Statisticsand Controls.Several analyses were then conducted. The onereported with the most detail is a factor analysisutilizing 17 variables for each of 36 studies. Fourfactors emerged from the analysis. From these,Hyman concluded that security had increased overthe years, that the significance level tended to beinflated the most for the most complex studies andthat both effect size and level of significance werecorrelated with the existence of flaws.

    Following his factor analysis, Hyman picked thethree flaws that seemed to be most highly corre-lated with success, which were inadequate atten-tion to both randomization and documentation andthe potential for ordinary communication betweenthe sender and receiver. A regression equation wasthen computed using each of the three flaws asdummy variables, and the effect size for the experi-ment as the dependent variable. From this equa-tion, Hyman concluded that a study without thesethree flaws would be predicted to have a h it r ate of27%. He concluded th at this is well within thestatis tical neighborhood of the 25% chance rate(1985b, page 37), and thus the ganzfeld psi database, despite initial impressions, is inadequate ei-ther to support the contention of a repeatable studyor to demonstrate thereal ity of psi (page 38).

    Honorton discounted both Hyman's flaw classifi-cation and his analysis. He did not deny that flawsexisted, but he objected that Hyman's analysis wasfaulty and impossible to interpret. Honorton askedpsychometrician David Saunders to write an Ap-pendix to his article, evaluating Hyman's analysis.Saunders first criticized Hyman's use of a factoranalysis with 17 variables (many of which weredichotomous) and only 36 cases and concluded th atthe entire analysis is meaningless (Saunders,1985, page 87). He then noted that Hyman's choice

    of the three flaws to include in his regression ana l-ysis constituted a clear case of multiple analysis,since there were 84 possible sets of three tha t couldhave been selected (out of nine potential flaws), andHyman chose the set most ,highly correlated witheffect size. Again, Saunders concluded that anyinterpretation drawn from [the regression analysis]must be regarded as meaningless (1985, page 88).

    Hyman's results were also contradicted by Harrisand Rosenthal (1988b) in an analysis requested by

    Hyman in his capacity as Chair of the NationalAcademy of Sciences' Subcommittee on Parapsy-chology. Using Hyman's flaw classifications and amultivariate analysis, Harris and Rosenthal con-cluded that Our analysis of the effects of flaws onstudy outcome lends no support to the hypothesisthat ganzfeld research results are a significantfunction of the set of flaw variables (1988b,page 3).Hyman and Honorton were in the process ofpreparing papers for a second round of debate whenthey were invited to lunch together at the 1986Meeting of the Parapsychological Association. Theydiscovered that they were in general agreement onseveral major issues, and they decided to coauthora Joint Communiqu6 (Hyman and Honorton,1986). It is clear from their paper that they boththought it was more important to set the stage forfuture experimentation than to continue the techni-cal arguments over the current data base. In theabstract to their paper, they wrote:

    We agree that there is an overall significanteffect in this data base that cannot reasonablybe explained by selective reporting or multipleanalysis. We continue to differ over the degreeto which the effect constitutes evidence for psi,but we agree that the final verdict awaits theoutcome of future experiments conducted by abroader range of investigators and according tomore stringent standards [page 3511.The paper then outlined what these standards

    should be. They included controls against any kindof sensory leakage, thorough testing and documen-tation of randomization methods used, better re-porting of judging and feedback protocols, controlfor multiple analyses and advance specification ofnumber of trials and type of experiment. Indeed,any area of research could benefit from such acareful list of procedural recommendations.4.5 Rosenthal s Meta-Analysis

    The same issue of the Journal of Parapsychologyin which the Joint Communiqu6 appeared also car-ried commentaries on the debate by 10 separateauthors. In his commentary, psychologist RobertRosenthal, one of the pioneers of meta-analysis inpsychology, summarized the aspects of Hyman'sand Honorton's work that would typically be in-cluded in a meta-analysis (Rosenthal, 1986). It isworth reviewing Rosenthal's results so that theycan be used as a basis of comparison for th e morerecent psi ganzfeld studies reported in Section 5.Rosenthal, like Hyman and Honorton, focusedonly on the 28 studies for which direct hits wereknown. He chose to use an effect size measure

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    10/41

    372 J. UTTScalled Cohen's h , which is the difference betweenthe arcsin transformed proportions of direct hitsthat were observed aild expected:

    h = 2(arcsin arcsin6 .One advantage of this measure over the differencein raw proportions is that it can be used to compareexperiments with different chance hit rates.

    If the and numbers of hitswere identical, the effect size would be zero. Of the28 studies, 23 (82%) had effect sizes greater thanzero, with a median effect size of 0.32 and a meanof 0.28. These correspond to direct h it rates of 0.40and 0.38 respectively, when 0.25 is expected bychance. A 95% confidence interval for-the trueeffect size is from 0.11 to 0.45, corresponding todirect hit rates of from 0.30 to 0.46 when chance is0.25.A common technique in meta-analysis is to calcu-late a combined z, found by summing the indi-vidual z scores and dividing by the square root ofthe number of studies. The result should have astandard normal distribution if each z score has astandard normal distribution. For the ganzfeldstudies, Rosenthal reported a combined z of 6.60with a p-value of 3.37 lo- . He also reiteratedHonorton's file-drawer assessment by calculatingthat there would have to be 423 studies unreportedto negate the significant effect in the 28 direct hitstudies.

    Finally, Rosenthal acknowledged that , because ofthe flaws in the data base and the potential for atleast a small file-drawer effect, the true averageeffect size was probably closer to 0.18 than 0.28. Heconcluded, Thus, when the accuracy ra te expectedunder the null is 114, we might estimate the ob-tained accuracy rate to be about 1/3 (1986, page333). This is the value used for the earlier powercalculation.It is worth mentioning that Rosenthal was com-missioned by the National Academy of Sciences toprepare a background paper to accompany its 1988

    ' report on parapsychology. That 'paper (Harris andRosenthal, 1988a) contained much of the sameanalysis as his commentary summarized above.Ironically, the discussion of the ganzfeld work inthe National Academy Report focused on Hyman's1985 analysis, but never mentioned the work it hadcommissioned Rosenthal to perform, which contra-dicted the final conclusion,in the report.

    5. A META-ANALYSIS OF RECENT GANZFELDEXPERIMENTS

    After the initial exchange with Hyman atthe 1982 Parapsychological Association Meeting,

    Honorton and his colleagues developed an auto-mated ganzfeld experiment that was designed toeliminate the methodological flaws identified byHyman. The execution and reporting of the experi-ments followed the detailed guidelines agreed uponby Hyman and Honorton.Using this autoganzfeld experiment, experi-mental series were conducted by eight experi-menters between February 1983 and September1989, when the equipment had to be dismantleddue to lack of funding. In this section, the resultsof these experiments are summarized and com-pared to the earlier ganzfeld studies. Much of theinformation is derived from Honorton et al . (1990).5.1 The Automated Ganzfeld Procedure

    Like earlier ganzfeld studies, the autoganzfeldexperiments require four participants. The first isthe Receiver (R), who attempts to identify the tar-get material being observed by the Sender (S). TheExperimenter E) prepares R for the task, elicitsthe response from R and supervises R's judging ofthe response against the four potential targets.(Judging is double blind; E does not know which isthe correct target.) The fourth participant is the labassistant (LA) whose only task is to instruct thecomputer to randomly select the target. No oneinvolved in the experiment knows the identity ofthe target.

    Both R and S are sequestered in sound-isolated,electrically shielded rooms. R is prepared as inearlier ganzfeld studies, with white noise and afield of red light. In a nonadjacent room, S watchesthe target material on a television and can hear R'starget description ( mentation ) as it is beinggiven. The mentation is also tape recorded.The judging process takes place immediately af-ter the 30-minute sending period. On a TV monitorin the isolated room, R views the four choices fromthe target pack t hat contains the actual target. R isasked to rate each one according to how closely itmatches the ganzfeld mentation. The ratings areconverted to ranks and, if the correct target isranked first, a direct hit is scored. The entire proc-ess is automatically recorded by the computer. Thecomputer then displays the correct choice to R asfeedback.

    There were 160 preselected targets, used withreplacement, in 10 of the series. They werearranged in packets of four, and the decoys for agiven target were always the remaining three inthe same set. Thus, even if a particular target in aset were consistently favored by Rs, the probabilityof a direct hit under the null hypothesis wouldremain at 114. Popular targets should be no more

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    11/41

    R E P L I C T I O N IN P

    likely to be selected by the computer's randomnumber generator than any of the others in the set.The selection of the target by the computer is theonly source of randomness in these experiments.This is an important point, and one that is oftenmisunderstood. (See Utts, 1991, for elucidation.)

    Eighty of the targets were dynamic, consistingof scenes from movies, documentaries and cartoons;80 were static, consisting of photographs, artprints and advertisements. The four targets withineach set were all of the same type. Earlier studiesindicated that dynamic targets were more likely toproduce successful results, and one of the goals ofthe new experiments was to test that theory.

    The randomization procedure used to select thetarget and the order of presentation for judging wasthoroughly tested before and during the experi-ments. A detailed description is given by Honortonet al. (1990, pages 118-120).

    Three of the series were pilot series, five wereformal series with novice receivers, and three wereformal series with experienced receivers. The lastseries with experienced receivers was the only onethat did not use the 160 targets. Instead, it usedonly one set of four dynamic targets in which onetarget had previously received several first placeranks and one had never received a first placerank. The receivers, none of whom had had priorexposure to that target pack, were not aware thatonly one target pack was being used. They eachcontributed one session only to the series. This willbe called the special series in what follows.Except for two of the pilot series, numbers oftrials were planned in advance for each series.Unfortunately, three of the formal series were notyet completed when the funding ran out, includingthe special series, and one pilot study with advanceplanning was terminated early when the experi-menter relocated. There were no unreported trialsduring the 6-year period under review, so there wasno file drawer.

    Overall, there were 183 Rs who contributed onlyone trial and 58 who contributed more than one, fora total of 241 participants and 355 trials . Only 23Rs had previously participated in ganzfeld experi-ments, and 194 Rs (81%) had never participated inany parapsychological research.5 2 Results

    While acknowledging that no probabilistic con-clusions can be drawn from qualitative data, Hon-orton et al. (1990) included several examples ofsession excerpts that Rs identified as providing thebasis for their target ra ting. To give a flavor for thedream-like quality of the mentation and the amountof information tha t can be lost by only assigning a

    rank, the first example is reproduced here. Thetarget was a painting by Salvador Dali calledChrist Crucified. The correct target received afirst place rank. The part of the mentation R used

    to make this assessment read:

    I think of guides, like spirit guides, leadingme and I come into a court with a king. It'squiet . It's like heaven. The king is some-thing like Jesus. Woman. Now I'm just sort ofsummersaul t ing through heavenBrooding. Aztecs, the Sun God. Highpriest .Fear Graves . Woman.P r a y e r F u n e r a l D a r k .Death Souls Ten Commandments.Moses. [Honorton et al ., 19901.Over all series, there were 122 direct hits in

    the 355 tria ls, for a hit ra te of 34.4% (exact bino-mial p-value 0.00005) when 25% were expectedby chance. Cohen's h is 0.20, and a 95% confidenceinterval for the overall hit rate is from 0.30 to 0.39.This calculation assumes, of course, that the proba-bility of a direct hit is constant and independentacross trials, an assumption that may be question-able except under the null hypothesis of no psiabilities.

    Honorton et al. (1990) also calculated effect sizesfor each of the series and each of the eightexperimenters. All but one of the series (the firstnovice series) had positive effect sizes, as did all ofthe experimenters.The special series with experienced Rs had anexceptionally high effect size with h 0.81, corre-sponding to 16 direct hits out of 25 tr ials (64%),butthe remaining series and the experimenters hadrelatively homogeneous effect sizes given theamount of variability expected by chance. If thespecial series is removed, the overall hit rate is32.1%, h 0.16. Thus, the positive effects are notdue to just one series or one experimenter.Of the 218 trials contributed by novices, 71 weredirect hits (32.5%, h 0.17), compared with 51hits in the 137 trials by those with prior ganzfeldexperience (37%, h 0.26). The hit rates and effectsizes were 31% ( h 0.14) for the combined pilotseries, 32.5% (h 0.17) for the combined formalnovice series, and 41.5% (h 0.35) for the com-bined experienced series. The last figure drops to31.6% if the outlier series is removed. Finally,without the outlier series the hit rate for the com-bined series where all of the planned tri als werecompleted was 31.2% (h 0.14), while it was 35%( h 0.22) for the combined series that were termi-nated early. Thus, optional stopping cannotaccount for the positive effect.

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    12/41

    74 UTTSThere were two interesting comparisons that had

    been suggested by earlier work and were pre-planned in these experiments. The first was tocompare results for trials with dynamic targetswith those for static targets. In the 190 dynamictarget sessions there were 77 direct hits (40%, h0.32) and for the static targets there were 45 hitsin 165 trials (27%, h 0.05), thus indicatingthat dynamic targets produced far more successfulresults.

    The second comparison of interest was whetheror not the sender was a friend of the receiver. Thiswas a choice the receiver could make. If he or shedid not bring a friend, a lab member acted assender. There were 211 trials with friends assenders (some of whom were also lab staff), result-ing in 76 direct hits (36%, h 0.24). Four trialsused no sender. The remaining 140 trials usednonfriend lab staff as senders and resulted in 46direct hits (33%,h 0.18). Thus, tria ls with friendsas senders were slightly more successful than thosewithout.

    Consonant with the definition of replication basedon consistent effect sizes, it is informative to com-pare the autoganzfeld experiments with the directhit studies in the previous data base The overallsuccess rates are extremely similar. The overalldirect hi t rat e was 34.4% for the autoganzfeld stud -ies and was 38% for the comparable direct hitstudies in the earlier meta-analysis. Rosenthal's(1986) adjustment for flaws had placed a more con-servative estimate a t 33%, very close to theobserved 34.4% in the new studies.

    One limitation of this work is that the auto-ganzfeld studies, while conducted by eight experi-menters, all used the same equipment in the samelaboratory. Unfortunately, the level of fund-ing available in parapsychology and the cost intime and equipment to conduct proper experimentsmake it difficult to amass large amounts of dataacross laboratories. Another autoganzfeld labora-tory is currently being constructed at the Univer-sity of Edinburgh in Scotland, s6 interlaboratorycomparisons may be possible in the near future.

    Based on the effect size observed to date, largesamples are needed to achieve reasonable power. Ifthere is a constant effect across all trials, resultingin 33% direct hits when 25% are expected by chance,to achieve a one-tailed significance level of 0.05with 95% probability would require 345 sessions.

    We end this section by &turning to the aspirinand heart attack example in Section 3 and expand-ing a comparison noted by Atkinson, Atkinson,Smith and Bem (1990, page 237). Computing theequivalent of Cohen's h for comparing obser-ved heart attack rates in the aspirin and placebo

    groups results in h 0.068. Thus, the effect sizeobserved in the ganzfeld data base is triple themuch publicized effect of aspirin on hear t attacks.

    6 OTHER META ANALYSES INPARAPSYCHOLOGYFour additional meta-analyses have been con-

    ducted in various areas of parapsychology since theoriginal ganzfeld meta-analyses were reported.Three of the four analyses focused on evidence ofpsi abilities, while the fourth examined the rela-tionship between extroversion and psychic func-tioning. In this section, each of the four analyseswill be briefly summarized.

    There are only a handful of English-languagejournals and proceedings in parapsychology, soretrieval of the relevant studies in each of thefour cases was simple to accomplish by searchingthose sources in detail and by searching otherbibliographic data bases for keywords.

    Each analysis included an overall summary, ananalysis of the quality of the studies versus the sizeof the effect and a file-drawer analysis to deter-mine the possible number of unreported studies.Three of the four also contained comparisons acrossvarious conditions.6.1 Forced Choice Precognition Experiments

    Honorton and Ferrari (1989) analyzed forced-choice experiments conducted from 1935 to 1987, inwhich the target material was randomly selectedfter the subject had attempted to predict what itwould be. The time delay in selecting the targetranged from under a second to one year. Targetmaterial included items as diverse as ESP cardsand automated random number generators. Twoinvestigators, S. G Soal and Walter J. Levy, werenot included because some of their work has beensuspected to be fraudulent.Overall Results There were 309 studies re-ported by 62 senior authors, including more than50,000 subjects and nearly two million individualtrials. Honorton and Ferrari used z / f i as themeasure of effect size (ES) for each study, where nwas the number of Bernoulli trials in the study.They reported a mean ES of 0.020, and a meanz-score of 0.65 over all studies. They also reported acombined z of 11.41, p 6.3 x Some 30%(92) of the studies were statistically significant ata 0.05. The mean E S per investigator was 0.033,and the significant results were not due to just afew investigators.Quality Eight dichotomous quality measureswere assigned to each study, resulting in possible

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    13/41

    375EPLIC TION IN P R PSYCHOLOGYscores from zero for the lowest quality, to eight forthe highest. They included features such as ade-quate randomization, preplanned analysis and au-tomated recording of the results. The correlationbetween study quality and effect size was 0.081,indicating a slight tendency for higher qualitystudies to be more successful, contra ry to claims bycritics that the opposite would be true. There wasa clear relationship between quality and year ofpublication, presumably because over the yearsexperimenters in parapsychology have respondedto suggestions from critics for improving theirmethodology.File Drawer Following Rosenthal (1984), theauthors calculated the fail-safe N indicating thenumber of unreported studies th at would have to besitting in file drawers in order to negate the signifi-cant effect. They found N 14,268, or a ratio of 46unreported studies for each one reported. They alsofollowed a suggestion by Dawes, Landman andWilliams (1984) and computed the mean z for allstudies with z > 1.65. If such studies were a ran-dom sample from the upper 5% tai l of a N(0, l)distribution, the mean would be 2.06. In this caseit was 3.61. They concluded that selective reportingcould not explain thew results.Comparisons Four variables were identifiedthat appeared to have a systematic relationship tostudy outcome. The first was that the 25 studiesusing subjects selected on the basis of good pastperformance were more successful than the 223using unselected subjects, with mean effect sizes of0.051 and 0.008, respectively. Second, the 97 stud-ies testing subjects individually were more success-ful than the 105 studies that used group testing;mean effect sizes were 0.021 and 0.004, respec-tively. Timing of feedback was the third moderat-ing variable, but information was only available for104 studies. The 15 studies tha t never told thesubjects what the targets were had a mean effectsize of -0.001. Feedback after each trial producedthe best results, the mean ES for the 47 studieswas 0.035. Feedback after each set of trials re-sulted in mean ES of 0.023 (21 studies), whiledelayed feedback (also 21 studies) yielded a meanES of only 0.009. There is a clear ordering; as thegap between time of feedback and time of theactual guesses decreased, effect sizes increased.

    The fourth variable was the time interval be-tween the subject's guess and the actual targetselection, available for 144 studies. The best resultswere for the 31 studies tha t generated targe ts lessthan a second after the guess (mean ES 0.045),while the worst were for the seven studies thatdelayed target selection by a t least a month (meanES 0.001). The mean effect sizes showed a clear

    trend, decreasing in order as the time intervalincreased from minutes to hours to days to weeks tomonths.6.2 ttempts to Influence Random PhysicalSystems

    Radin and Nelson (1989) examined studies de-signed to test the hypothesis that The statisticaloutput of an electronic RNG [random number gen-erator] is correlated with observer intention in ac-cordance with prespecified instructions (page1502). These experiments typically involve RNGsbased on radioactive decay, electronic noise or pseu-dorandom number sequences seeded with true ra n-dom sources. Usually the subject is instructed totry to influence the results of a string of binarytr ials by mental intent ion alone. typical protocolwould ask a subject to press a button (thus st art ingthe collection of a fixed-length sequence of bits),and then try to influence the random source toproduce more zeroes or more ones. ru n mightconsist of three successive button presses, one eachin which the desired result was more zeroes ormore ones, and one as a control with no consciousintent ion. score would then be computed foreach button press.

    The 832 studies in the analysis were conductedfrom 1959 to 1987 and included 235 control stud-ies, in which the output of the RNGs were recordedbut there was no conscious intention involved.These were usually conducted before and duringthe experimental series, as tests of the RNGs.Results The effect size measure used was againz / where was positive if more bits of thespecified type were achieved. The mean effect sizefor control studies was not significantly differentfrom zero - 1.0 x lo p5 ). The mean effect sizefor the experimental studies was also very small,3.2 x but i t was significantly higher than th emean ES for the control studies (z 4.1).Quality Sixteen quality measures were definedand assigned to each study, under the four genera lcategories of procedures, statist ics, data and th eRNG device. score of 16 reflected the highestquality. The authors regressed mean effect size onmean quality for each investigator and found aslope of 2.5 x with standard error of 3.2 x

    indicating little relationship between qualityand outcome. They also calculated a weighted meaneffect size, using quality scores as weights, andfound that it was very similar to the unweightedmean ES They concluded th at differencesin methodological quality are not significantpredictors of effect size (page 1507).File Drawer Radin and Nelson used severalmethods for estimating the number of unreported

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    14/41

    UTTstudies (pages 1508-1510). Their estimates rangedfrom 200 to 1000 based on models assumingthat all significant studies were reported. Theycalculated the fail-safe N to be 54,000.6 3 ttempts to Influence Dice

    Radin and Ferrari (1991) examined 148 studies,published from 1935 to 1987, designed to testwhether or not consciousness can influence theresul ts of tossing dice. They also found 31 con-trol studies in which no conscious intention wasinvolved.Results. The effect size measure used wasz / here z was based on the number of throwsin which the die landed with the desired face (orfaces) up, in n throws. The weighted mean ES forthe experimental studies was 0.0122 with a stan-dard error of 0.00062; for the control studies themean and s tandard error were 0.00093 and 0.00255,respectively. Weights for each study were de-termined by quality, giving more weight to high-quality studies. Combined z scores for the exper-imental and control studies were reported by Radinand Ferra ri to be 18.2 and 0.18, respectively.Quality. Eleven dichotomous quality measureswere assigned, ranging from automated recordingto whether or not control studies were interspersedwith the experimental studies. The final qualityscore for each study combined these with informa-tion on method of tossing the dice, and with sourceof subject (defined below). regression of qualityscore versus effect size resulted in a slope of 0.002,with a standard error of 0.0011. However, wheneffect sizes were weighted by sample size, there wasa significant relationship between quality and ef-fect size, leading Radin and Ferrari to concludeth at higher-quality studies produced lower weightedeffect sizes.File Drawer. Radin and Ferrari calculatedRosenthal's fail-safe, N for this analysis to be17,974. Using the assumption that all significantstudies were reported, they estimated the numberof unreported studies to be 1152. As a final assess-ment, they compared studies published before andafter 1975, when the ournal of Parapsychologyadopted a n official policy of publishing nonsigni-ficant results. They concluded, based on that an-alysis, that more nonsignificant studies werepublished after 1975, and thu s We must consi-der the overall (1935-1987) data base as suspectwith respect to the filedrawer problem.Comparisons. Radin and Ferrari noted thatthere was bias i n both the experimental and controlstudies across die face. Six was the face most likelyto come up, consistent with the observation that ithas the least mass. Therefore, they examined re-sults for the subset of 69 studies in which targets

    were evenly balanced among the six faces. Theystill found a significant effect, with mean and s tan-dard error for effect size of 8.6 and 1. 1

    respectively. The combined z was 7.617 forthese studies.

    They also compared effect sizes across types ofsubjects used in the studies, categorizing them asunselected, experimenter and other subjects, exper-imenter as sole subject, and specially selected sub-jects. Like Honorton and Ferrari (1989), they foundthe highest mean ES for studies with selectedsubjects; it was approximately 0.02, more than twicethat for unselected subjects.6 4 Extroversion and ESP Performance

    Honorton, Ferrari and Bem (1991) conducted ameta-analysis to examine the relationship betweenscores on test s of extroversion and scores onpsi-related tasks. They found 60 studies by 17investigators, conducted from 1945 to 1983.Results. The effect size measure used for thisanalysis was the correlation between each subject'sextroversion score and ESP score. A variety ofmeasures had been used for both scores across stud-ies, so various correlation coefficients were used.Nonetheless, a stem and leaf diagram of the corre-lations showed an approximate bell shape withmean and standard deviation of 0.19 and 0.26,respectively, and with an additional outlier at r0.91. Honorton et al. reported that when weightedby degrees of freedom, the weighted mean r was0.14, with a 95% confidence interval covering 0.10to 0.19.Forced Choice versus Free Response Results. Because forced-choice and free-response testsdiffer qualitatively, Honorton e t al . chose to exam-ine their relationship to extroversion separately.They found th at for free-response studies there wasa significant correlation between extroversion andESP scores, with mean r 0.20 and z 4.46. Fu r-ther, this effect was homogeneous across bothinvestigators and extroversion scales.

    For forced-choice studies, there was a significantcorrelation between ESP and extroversion, but onlyfor those studies that reported the ESP resultsto the subjects before measuring extroversion.Honorton et al. speculated that the relationshipwas an artifact, in which extroversion scoreswere temporarily inflated as a result of positivefeedback on ESP performance.Confirmation with New Data Following theextroversion/ESP meta-analysis, Honorton et al.attempted to confirm the relationship usingthe autoganzfeld data base. Extroversion scoresbased on the Myers-Briggs Type Indicator wereavailable for 221 of the 241 subjects who hadparticipated in autoganzfeld studies.

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    15/41

    77EPLICATION IN PARAPSYCHOLOGYThe correlation between extroversion scores and

    ganzfeld rating scores was r 0.18, with a 95%confidence interval from 0.05 to 0.30. This is con-sistent with the mean correlation of 0.20 forfree-response experiments, determined from themeta-analysis. These correlations indicate tha t ex-troverted subjects can produce higher scores infree-response E SP tests.

    7 CONCLUSIONSParapsychologists often make a distinction be-

    tween proof-oriented research and process-oriented research. The former is typically con-ducted to test the hypothesis tha t psi abilities exist,while the latter is designed to answer questionsabout how psychic functioning works. Proof-oriented research has dominated the literaturein parapsychology. Unfortunately, many of thestudies used small samples and would thus benonsignificant even if a moderate-sized effectexists.

    The recent focus on meta-analysis in parapsy-chology has revealed that there are small butconsistently nonzero effects across studies, experi-menters and laboratories. The sizes of the effects inforced-choice studies appear to be comparable tothose reported in some medical studies that hadbeen heralded as breakthroughs. (See Section 5;also Honorton and Ferrari, 1989, page 301.) Free-response studies show effect sizes of far greatermagnitude.promising direction for future process-orientedresearch i s to examine the causes of individualdifferences in psychic functioning. The ESP/ex-troversion meta-analysis is a step in t ha t direction.

    In keeping with the idea of individual differ-ences, Bayes and empirical Bayes methods wouldappear to make more sense than the classical infer-ence methods commonly used, since they wouldallow individual abilit ies and beliefs to be modeled.Jeffreys (1990) reported a Bayesian analysis of someof the RNG experiments and showed that conclu-sions were closely tied to prior beliefs even thoughhundreds of thousands of trials were available.It may be tha t the nonzero effects observed in th emeta-analyses can be explained by something otherthan ESP, such as shortcomings in our understand-ing of randomness and independence. Nonetheless,there is an anomaly that needs an explanation. AsI have argued elsewhere (Utts, 1987), research inparapsychology should receive more support fromthe scientific community. If ESP does not exist,there i s little to be lost by err ing in th e direction offurther research, which may in fact uncover otheranomalies. If ESP does exist, there is much to belost by not doing process-oriented research, and

    much to be gained by discovering how to enhanceand apply these abilities to important worldproblems.

    CKNOWLEDGMENTS

    I would like to thank Deborah Delanoy, CharlesHonorton, Wesley Johnson, Scott Plous and ananonymous reviewer for their helpful comments onan earlier draft of this paper, and Robert Rosenthaland Charles Honorton for discussions that helpedclarify details.

    REFERENCESA T K I N S O N , R . C . , S M I T H ,. L . , A T K I N S O N , E . E . a nd B E M ,D. J .(1990 ). Introduction to Psychology, 10th ed. Harcourt BraceJovanovich, San Diego.B E L O F F ,. (1985) . Research s trategies for deal ing wi th unstablephen omen a. In Th e Repeatabili ty Problem in Parapsychol-ogy (B . Shap in and L. Coly , eds . ) 1-21. Parapsychology

    Founda t ion , New Y ork .B L A C K M O R E ,. J . (1985) . Unrepeatabi l i ty: Parapsychology s onlyfinding . In Th e Repeatabili ty Problem i n Parapsychology(B . Shap in and L. Coly , eds . ) 183-206. ParapsychologyFounda t ion , New Y ork .BUR DICK , . S . and KEL LY, . F . ( 1977) . S ta ti s ti cal me thods i nparapsychological re searc h. In Han dbook of Parapsycho logy(B . B . W olman , ed . ) 81 -130 . V a n Nos trand Re inho ld , NewY o r k .C A M P , . H . (1937) . (S tatem ent in Notes Sect ion. ) Jou rnal o fParapsychology 1 305.C O H E N , . (1990) . Thing s have learned (so far) . AmericanPsychologist 45 1304-1312.

    COOV ER, . E . (191 7). Experiments in Psychical Research atLeland Stanford J unior Univers ity . S tanford Uni v .DAW ES,R . M . , L A N D M A N ,. a n d W I L L I A M S ,. (1984) . Reply toKurosawa. American Psychologis t 39 74-75.DIAC ONIS ,. (197 8) . S tat is t ical problems i n ESP research. Sc i -ence 201 131-136.D O M M E Y E R ,. C . (1975) . Psychical research at S tanford Univ er-sity. Journ al of Parapsychology 39 173-205.DRUCKM AN, . and SW ETS , . A , , eds . ( 1988) Enhanc ing Huma nPerformance: Issues, Theories, and Techniques. NationalAcademy Press , Wash ington , D.C.EDGEW ORTH,. Y . 1885) . Th e calculus of probabil i ties appl iedto psychical re searc h. In Proceedings of the Society forPsychical Research 3 190-199.EDGEW ORTH,. Y . (1886) . The calculus of probabi l it ies appliedto psychical resea rch. 11. In Proceedings of the Society forPsychical Research 4 18 9-208.F E L L E R ,W . K . ( 1940) . S tat i st i ca l a spec t s o f ESP . Journa l o f

    Parapsychology 4 271-297.FELLER,W K . (1968) . A n Introduct ion to Probabi li ty Theoryand I t s Applicat ions 1 , 3rd ed. Wi le y , New Yo rk .F I S H E R ,R. A. (1924) . A method of scor ing coincidences i n tes tswi th playing cards. In Proceedings o f the Society for Psychi-cal Research 34 181-185.FIS HE R, . A . (1929 ) . Th e s tati s t ical method in psychical re-search. In Proceedings o f the Society for P sychical Research39 189-192.GALLUP,G . H . , JR. , and NEWPO RT, . (1991) . Bel ie f i n paranor-mal phenomena among adul t Americans . Skept ical Inquirer15 137-146.G A R D N E R , . J . and A LTM AN, . G . (1986) . Conf idence in tervalsrather tha n p-values: Est imat ion rather th an hypothes istes t ing. Bri t i sh Medical Journal 292 746-750.

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    16/41

    378 J . UTTSGILM ORE, . B . ( 19 8 9) . Ra n d om n e ss a n d t h e s e a rc h f or p si .Journal of Parapsychology 53 309-340.GILMORE, B. (1990) . Anomalous s igni fi cance in p ararandomandpsi-freedomains. Journal of Parapsychology 54 53-58.GREELEY,A. ( 19 8 7) . M y s ti ci sm g oe s m a i n s t r e a m . A m er i canHeal th 7 47-49.GREEN HO U SE, S . W . (1988). A n aspir in a B. a n d GREENHOUSE, d a y . . Chance 1 24-31.GREEN W OO D, C .E (1940) .A review of Dr . A. a n d STUART, Fel ler 's cr i t ique. Journal of Parapsychology 4 299-319.HACKI NG,. (1988). Telepathy: Originsof randomizat ion in ex- perimental design. Zsis 79 427-451.HANSEL,C. E . M. (1980). ESP and Parapsychology: A CriticalRe-evaluation. PrometheusBooks,Buffalo, N.Y.HARRIS,M. J a n d ROSENTHAL,. (1988a). Interpersonal Ex-pectancy Effects and H um an Performance Research. N a -t ional AcademyPress, Washington,D.C.HARRIS,M. J andROSENTHAL,.(1988b). Postscript to Znterper-sonal Expectancy Effects and Human Performance Research.National AcademyPress, Washington,D.C.HEDGES,L . V . a n d O L KI N, . ( 1 98 5) . Statistical Methods forMeta-Analysis. Academic,Orlando,Fla. H O N OR T O N, . ( 19 77 ). P si a n d i n te r n a l a t te n t io n s ta te s . I n Handbook of Parapsychology (B.B.Wolman,ed.) 435-472.VanNos t rand Reinhold , NewYork. HONORTON,. (1985a) .How to eva lua te an d improve the repl i- cabi l i ty of parapsychological effects . In The RepeatabilityProblem i n Parapsychology ( B . Sh a p i n a n d L . Co ly , e d s . )238-255. Parapsychology Foundat ion, New York.HON ORT ON ,.(1985b).Me ta-an alys isofp siganzfeldresearch: Aresponse to Hyman. Journal of Parapsychology 49 51-91.H ON ORTO N, M . P ., Q UA NT,M.,. , BERGER , . E. , VARVOGLIS, DERR, P . , SCHECH TER, . I . a n d FERRARI , . C . ( 19 90 ). P s i c o m m u n ic a ti on i n t h e g a nz fe ld : Ex p e r im e n t s wi t h a n a u t o m a te d t e s t in g s y s te m a n d a c o mp a ri so n wi t h a meta- ana lys i s of ear l i e r s tudies . Journal of Parapsychology 5499-139.HONORTON,. and FERRARI , . C. (1989) . "Future t e ll ing" : Ameta-ana lys i s of forced-choice precogni tion exper iments ,1935-1987. Journal of Parapsychology 53 281-308.HONO RTON,., FERR ARI,:C. a n d B E M , D.J . (1991). Extraver- s io n a n d E S P p er fo rm an ce : A m e ta - an a ly s is a n d a n ew confirmation. Research i n Parapsychology 1990 TheScare- crowPress, Metuchen,N.J. Toappear. HY MA N, .(1985a). Acri tical overviewof parapsychology. In ASke ptic s Handbook of Parapsychology (P.Kurtz, ed.) 1-96.PrometheusBooks,Buffalo, N.Y.HYM AN, . ( 19 8 5b ). Th e g an z fe ld p s i e x p er im e n t: A c ri ti c al appraisal . Journal of Parapsychology 49 3-49.HYM AN,R . a n d H O N O R T O N , . (1986). Jo in t communiqu6: T he ps i ganzfe ld cont roversy . Journal o.f Parapsychology 5351-364.IVERSEN,G.R .,LO NG COR, F . , GILBERT,. H., MOSTELLER, J P . a n d Y o u ~ z , . ( 1 97 1) . B i a s a n d r u n s i n d ic e t h r o wi n g a n d recording: Afewmil l ionthrows.Psychometr ika 36 1-19. JEFFREYS,W . H. ( 19 90 ). Ba y e s ia n a n a ly s i s of r a n d o m e v e n tgenera tor da ta . Journal of Scient i fic Explorat ion 4 153-169.L I NDLEY,D. V . ( 19 5 7) . A s t at i st i ca l p a ra d o x. Biometr ika 44187-192.MAUSKOPF, . H. andMCVAUGH,,. (1979 ).The E lusive Science:Origins of Experimental Psychical Research. Johns Hopkins Univ . Press . M CVAUGH,M. R. and MAUSKOPF, . H. (1976). J B. Rhine 's Extrasensory Perception a n d i t s b a ck g ro u n d i n p s yc h ic al research . Zsis 67 161-189.NEULIEP, W.,ed.(1990). Handbook of repl icationresearch in

    thebehavioral andsocial sciences. Journal of Social Behav-ior and Personality 5 (4) 1-510.OFFICEOF TECH NO LO GY (1989). R eport of a work-SSESSMENTshoponexperimental parapsychology. Journal of the Amer -ican Society for Psychical Resea rch 83 317-339.PALM ER, (1989).AreplytoGilmore. Journal of Parapsychol-ogy 53 341-344.PALM ER, (1990) . Reply to Gi lmore: Round two. Journa l o fParapsychology 54 59-61.PALM ER, A., HONORTON,. and Urrs , J . (1989) . Reply to the National ResearchCounci l s tudyonparapsychology. Jour -nal of the Ameri can Society for Psychical Research 83 31-49.RADIN,D.I .an dFE RR AR I,.C .(1991).Effectsof consciousnesson t he fa ll of d ice: A meta-ana lys i s . Journal of Scient i f icExploration 5 61-83.RADIN, D.I . and NELSON ,. D. (1989). Evidence for conscious- ness -re la ted anomal ies in randomphys ica l sys tems . F oun-dations of Physics 19 1499-1514.RAO,K. R.(1985).Replicationinconventionalandcontroversialsciences. In Th e Repeatability Problem i n Parapsychology(B.Sh apin andL.Coly,eds.) 22-41. Parapsychology Foun -dat ion,NewYork.RHI NE,J B.(1934). Extrasensory Perception. BostonSocietyforPsychica l Research , Bos ton . (Repr in ted by Branden Press , 1964.) RHINE,J B. (1977). Historyof experim ental s tudies. In H a n d -book of Parapsychology ( B . B . W o l m a n, e d . ) 2 5 - 4 7 . Va n NostrandReinhold, NewYork.RICHET,C.(1884). Lasuggest ionmentaleet lecalcul desproba-bilit6s. Revue Philosophique 18 608-674.ROSENT HAL,.(1984). Meta -Ana lytic Procedures for Social Re-search. Sage,BeverlyHil ls . ROSENTHAL,.(1986). Meta-analyt icproceduresandthenatureofreplication:Theganzfelddebate. Journal of Parapsychol-ogy 5 315-336.ROSENT HAL,. (1990a). How are we doingi n soft psychology?American Psychologist 45 775-777.ROSENTHAL,R. ( 19 9 0b ). Re p li ca ti o n i n b e ha v io r al r e s ea r c h .Journal of Social Behavior and Personality 5 1-30. SAU NDE RS,. R.(1985).OnHym an's factor analysis . Journal of

    Parapsychology 49 86-88.SHAPIN,B.andCOLY,. , eds. (1985). Th e Repeatability Problemin Parapsychology. ParapsychologyFo undat ion,New York.SPENCER- BROW N,.(1957). Probability and Scientific Inference.LongmansGreen,LondonandNewYork.STUART,C. E . a n d GREENW OOD, . A.(1937). A reviewof cr i t i - cismsof the mathem atical evalu at ionof ES Pdata. Journa lof Parapsychology 1 295-304.TVERSKY,A . a n d K A H N E M A N ,. (1982) . Beli ef in the l aw ofsmal l numbers . In Judg men t Under Uncertainty: Heuris t icsand Biases (D.K ahneman, P . Slovic andA . Tversky, eds .) 23-31. Cambr idgeUniv . Press . U r r s , J (1986). Th e ganzfeld debate: A stat is t ician's perspec- t ive. Journal of Parapsychology 5 395-402.U r r s , J (1987) . Ps i , s t a t i st i cs , and socie ty . Behavioral andBrain Sciences 1 615-616.Urrs, J . (1988). Successful repl icat ionversusstat is t ical s ignif i - cance. Jou rnal of Parapsychology 52 305-320.U r r s , J (1989).Randomnessandrandomizationtests:AreplytoGilmore. Jour nal of Parapsychology 53 345-351.U rrs , J . (1991).Analyzingfree-responsedata : Aprogressreport . I n Psi Research Methodology: A Re-examination (L. Coly, ed.) . ParapsychologyFound at ion,NewY ork.Toappear.

    WILKS,S. S. ( 19 6 5a ). S t a ti s ti c a l a s pe c ts of e x p ei rm e n ts i n telepath. N Y. Statistician 16 (6) 1-3. W I LKS,S . S . ( 19 6 5b ). S t a ti s ti c a l a s pe c ts of e x p e ri m e n ts i n telepathy. N Statistician 16 (7)4-6.

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    17/41

    CommentM J Bayarri and James Berger

    1 INTRODUCTIONThere are many fascinating issues discussed in

    this paper. Several concern parapsychology itselfand the interpretat ion of statis tical methodologytherein. We are not experts in parapsychology, andso have only one comment concerning such mat-ters: In Section we briefly discuss the need toswitch from P-values to Bayes factors in discussingevidence concerning parapsychology.

    A more general issue raised in the paper is thatof replication. It is quite illuminating to considerthe issue of replication from a Bayesian perspec-tive, and th is i s done in Section 2 of our discussion.

    2 REPLICATIONMany insightful observations concerning replica-

    tion are given in the article, and these spurred usto determine if they could be quantified withinBayesian reasoning. Quantification requires cleardelineation of the possible purposes of replication,and at least two are obvious. The first is simplereduction of random error, achieved by obtainingmore observations from the replication. The secondpurpose is to search for possible bias in the originalexperiment. We use bias in a loose sense here, torefer to any of the huge number of ways in whichthe effects being measured by the experiment candiffer from the ac tual effects of in terest. Thus aclinical tr ial without a placebo can suffer a placebobias ; a survey can suffer a bias due to the

    sampling frame being unrepresentative of theactual population; and possible sources of biasin parapsychological experiments have beenextensively discussed.Replication to Reduce Random Error

    If the sole goal of replication 'of an experiment isto reduce random error, matters are very straight-forward. Reviewing the Bayesian way of studyingthis issue is, however, useful and will be donethrough the following simple example.

    M. J. Bayarri is Titular Professor Department ofStatistics and Operations Research University ofValencia Avenid a Dr. Moliner 50 6100 BurjassotValencia Spain. James Berger is the Richard M.Brum field D istinguished Professor of StatisticsPurdu e University Wes t Lafayette Indi ana 47907.

    X MPL . Consider the example from Tverskyand Kahnemann (1982), in which an experimentresults in a standardized test statist ic of 2, = 2.46.(We will assume normality to keep computationstrivial.) The question is: What is the highest valueof z2 in a second set of data tha t would be consid-ered a failure to replicate? Two possible preciseversions of thi s question are: Question 1: What isthe probability of observing z2 for which the nullhypothesis would be rejected in the replicated ex-periment? Question 2: What value of z2 wouldleave one's overall opinion about the null hypothe-sis unchanged?

    Consider the simple case where Z, N(z, 10, 1)and (independently) Z2 N(z2 10, I) , where 0 isthe mean and 1 is the standard deviation of thenormal distribution. Note that we are consideringthe case in which no experimental bias is suspectedand so the means for each experiment a re assumedto be the same.

    Suppose th at it is desired to tes t H,: 0 0 versusH,: 0 > 0, and suppose that initial prior opinionabout 0 can be described by the noninformativeprior n(0) = 1. We consider the one-sided testingproblem with a constant prior in this section, be-cause it is known th at then the posterior probabil-ity of H,, to be denoted by P(H , data), equals theP-value, allowing us to avoid complications ari singfrom differences between Bayesian and classicalanswers.

    After observing 2, = 2.46, the posterior distribu-tion of 0 is

    Question 1 then has the answer (using predictiveBayesian reasoning)

    P(rejecting a t level a 2,)

    where 9 is the standard normal cdf and c is the(one-sided) critical value corresponding to the level,a , of the test. For instance, if a = 0.05, then thisprobability equals 0.7178, demonstrating tha t thereis a quite substantial probability that the secondexperiment will fail to reject. If a is chosen to bethe observed significance level from the firs t exper-iment, so th at c = z,, then the probability that the

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    18/41

    38 J.UTTsecond experiment will reject is just 112. This isnothing but a statement of the well-known martin-gale property of Bayesianism, th at what you ex-pect to see in the future is just what you knowtoday. In a sense, therefore, question 1 is exposedas being uninteresting.Question 2 more properly focuses on the fact thatthe stated goal of replication here is simply toreduce uncertainty in stated conclusions. The an-swer to the question follows immediately from not-ing that the posterior from the combined data(21, 22) is

    so that

    Setting this equal to P (H , 2,) and solving for 2yields 2 = l )z , = 1.02. Any value of 2greater than this will increase the total evidenceagainst H,, while any value smaller than 1.02 willdecrease the evidence.Replication to Detect ias

    The aspirin example dramatically raises the is-sue of bias detection as a motive for replication.Professor Utts observes that replication 1 givesresults that are fully compatible with those of theoriginal study, which could be interpreted as sug-gesting that there is no bias in the original study,while replication 2 would raise serious concerns ofbias. We became very interested in the implicitsuggestion that replication 2 would thus lead toless overall evidence against the null hypothesisthan would replication 1, even though in isolationreplication 2 was much more significant thanwas replication 1 . In attempting to see if this is so,we considered the Bayesian approach to study ofbias within the framework of the aspirin example.

    EXAMPLE. For simplicity in the aspiring exam-ple, we reduce consideration to6 = true difference in heart attack rates between

    aspirin and placebo populations multiplied by1000;

    Y = difference in observed heart attack rates be-tween aspirin and placebo groups in originalstudy multiplied by 10QO;

    Xi= difference in observed heart attack rates be-tween aspirin and placebo groups in Replica-tion i multiplied by 1000.

    We assume that the replication studies are ex-tremely well designed and implemented, so that

    one is very confident tha t the have mean 6Using normal approximations for convenience, thedata can be summarized as

    with actual observations x = 7.704 and x13.07.

    Consider now the bias issue. We assume that theoriginal experiment is somewhat suspect in thisregard, and we will model bias by defining themean of Y to be

    where 0 is the unknown bias. Then the data in theoriginal experiment can be summarized by

    with th e actual observation being y = 7.707.Bayesian analysis requires specification of a prior

    distribution, (P), for the suspected amount of bias.Of particular interest then are the posterior distri-bution of 0, assuming replication i has beenperformed, given by( P I Y, xi

    where a is the variance (4.82 or 3.63) from repli-cation i; and the posterior probability of H,, givenby

    Recall that our goal here was to see if Bayesiananalysis can reproduce the intuition that the origi-nal experiment could be trusted if replication 1hadbeen done, while it could not be trusted (in spite ofits much larger sample size) had replication 2 beenperformed. Establishing this requires finding aprior distribution (0) for which n(@ y, x,) haslittle effect on P(H, y, x,), but (0 I y, x,) has alarge effect on H, y, x,). To achieve the firstobjective, (0) must be tightly concentrated nearzero. To achieve the second, (0) must be such thatlarge y x I which suggests presence of a largebias, can result in a substantial shift of posteriormass for 0 away from zero.

  • 8/9/2019 Replication and Meta-Analysis in Parapsychology - Jessica Utts

    19/41

    38EPLICATION IN PARAPSYCHOLOGYA sensible candidate for the prior density n @)

    is the Cauchy (0, V) density

    Flat-tailed densities, such as this, are well knownto have the property that when discordant data isobserved (e.g., when ( y x, is large), subs tan-tial mass shifts away from the prior center towardsthe likelihood center. It is easy to see that a normalprior for can not have the desired behavior.

    Our first surprise in consideration of these priorswas how small needed to be chosen in order forP(Ho y, x,) to be unaffected by the bias. Forinstance, even with V 1.54/100 (recall th at 1.54was the standard deviation of Y from the originalexperiment), computation yields P( Ho y, x,)4.3 x compared with the P-value (and poste-rior probability from the original experiment as-suming no bias) of 2.8 x There is a clearlesson here; even very small suspicions of bias candrastically alter a small P-value. Note that replica-tion 1 is very consistent with the presence of nobias, and so the posterior distribution for the biasremains tightly concentrated near zero; for in-stance, the mean of the posterior for is then7.2 x and the standard deviation is 0.25.When we turned attention to replication 2, wefound that it did not seriously change the priorperceptions of bias. Examination quickly revealedthe reason; even the maximum likelihood estimateof the bias is no more than 1.4 standard deviationsfrom zero, which is not enough to change strongprior beliefs. We, therefore, considered a thirdexperiment, defined in Table 1 Transforming toapproximate normality, as before, yields

    with x, 22.72 being the actual observation. Themaximum likelihood estimate of bias is now 3.95standard deviations from zero, so there i s potentialfor a substantial change in opinion about the bias.

    Sure enough, computation when V 1.54/100yields that E[@ y, x,] 4.9 with (posterior)standard deviation equal to 6.62, which is a dra-matic shift from prior opinion (that is Cauchy (0,

    TABLE1requency o f heart attac ks in replication 3Yes No

    Aspirin 23 9Placebo 54 2116

    1.54/100)). The effect of this is to essentially ignorethe original experiment in overall assessments ofevidence. For instance, P(Ho y, x,) 3.81 xlo-'', which is very close to P(Ho x,) 3.29 xlo- . Note that , if were set equal to zero, theoverall posterior probability of Ho (and P-value)w