Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014;...

11
Geobiology. 2019;1–11. wileyonlinelibrary.com/journal/gbi | 1 © 2019 John Wiley & Sons Ltd Received: 4 June 2018 | Revised: 30 October 2018 | Accepted: 30 December 2018 DOI: 10.1111/gbi.12333 PERSPECTIVE Statistical inference and reproducibility in geobiology 1 | INTRODUCTION The late, great Karl Turekian would often joke about the number of new data points required for a geochemical paper. The answer was one: When combined with a previously published data point, a “best- fit” line could be drawn between the two points, and the slope of the line calculated, thereby giving rate. The joke had its roots in a seminar Dr. Turekian gave far earlier in his career, where the entire talk focused on the first data point of a novel measurement (a hard- won data point, but only one nonetheless). When an apparently ex- asperated audience member pointed this out, Turekian reportedly replied “well it's one more than anyone else has!” (Thiemens, Davis, Grossman, & Colman, 2013). This anecdote raises two important points about our field: First, establishing new analytical techniques is laborious, expensive, and requires considerable vision and skill, and second, at the nascent stages of any new record, the number of observations will be small. Geobiology is now at the point where some mature data records (for instance δ 13 C or δ 34 S) have had thousands or tens of thousands of data points generated. There also exist millions of individual gene se- quences from environmental genomics surveys. In contrast, emerg- ing proxies (for instance, selenium isotopes, I/Ca ratios, carbonate “clumped” isotopes), or biomarker studies and technical geomicro- biological experiments still face the issue Turekian described and have only a handful of measurements or are transitioning toward larger datasets. The question remains how best to interpret data- rich records on the one hand, and scattered, hard-won data points on the other hand. Here, using examples from within geobiology as well as the development of other fields such as ecology, psychology, and medicine, we argue that increased clarity regarding significance testing, multiple comparisons, and effect sizes will help research- ers avoid false-positive inferences, better estimate the magnitude of changes in proxy records or experiments, and ultimately yield a richer understanding of geobiological phenomena. 2 | STATISTICS AND REPRODUCIBILITY IN OTHER FIELDS AND IN GEOBIOLOGY We start by examining statistical practice and reproducibility outside of our field, before relating these broader themes back to geobiology. Science as a whole is currently described as fac- ing a “crisis of reproducibility,” with diverse studies in multiple fields failing to replicate published findings (see Baker, 2016 for a Nature survey of reproducibility across fields). The issue here is not technical reproducibility at the sample level (e.g., “If I put this same sample in a mass spectrometer, will I get the same re- sult twice?”) but rather at the level of the effects observed in the manuscript (e.g., a given treatment causes a specific outcome, or a predictor variable is correlated with a response variable). For example, independent efforts to reproduce “landmark papers” in cancer biomedical research by pharmaceutical companies Amgen (Begley & Ellis, 2012) and Bayer (Prinz, Schlange, & Asadullah, 2011) have only been able to replicate 11% and 20%–25% of published results, respectively. Similarly, a critical study of gene- by-environment (GxE) interactions in psychiatry found that only 27% of published replication attempts were positive (Duncan & Keller, 2011). Further, this small proportion of apparently posi- tive replications was likely inflated; there were clear signatures of publication bias (the tendency to publish significant results more readily than non-significant results) among replication attempts in the GxE literature. In psychology, a large-scale replication effort found that only 39% of studies could be replicated (Open Science Collaboration, 2015), and similar problems plague much of neu- roscience research (Button et al., 2013). The studies mentioned here are likely just the tip of the iceberg (Begley & Ioannidis, 2015; Ioannidis, 2005). The causes of these low rates of replication are varied. There are likely different underlying causes for poor reproducibility across fields, and the severity of the problem undoubtedly varies as well. Our personal experience trying to implement published protocols without the highly detailed, laboratory-specific knowl- edge that is often omitted (generally for space reasons) from Materials and Methods sections suggests to us that at least some percentage of failed replications are caused by inadvertent meth- odological differences. Indeed, the Open Science Collaboration replication of psychology studies was attacked for poor adherence to original protocols (Gilbert, King, Pettigrew, & Wilson, 2016). In order to achieve precise replication of protocols, in extreme cases researchers have had to travel to other laboratories and work side- by-side to identify seemingly trivial methodological differences— vigorous stirring versus prolonged gentle shaking to isolate cells, for instance—that had an outsized effect on reproducibility (Hines, Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely only explain a portion of the failed replication attempts. True scientific fraud is also something that makes headlines and erodes public trust, but again likely only accounts for a small proportion of replication fail- ures (Bouter, Tijdink, Axelsen, Martinson, & ten Riet, 2016; Fanelli, 2009).

Transcript of Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014;...

Page 1: Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely

Geobiology. 2019;1–11. wileyonlinelibrary.com/journal/gbi  | 1© 2019 John Wiley & Sons Ltd

Received:4June2018  |  Revised:30October2018  |  Accepted:30December2018DOI:10.1111/gbi.12333

P E R S P E C T I V E

Statistical inference and reproducibility in geobiology

1  | INTRODUC TION

Thelate,greatKarlTurekianwouldoftenjokeaboutthenumberofnewdatapointsrequiredforageochemicalpaper.Theanswerwasone:Whencombinedwithapreviouslypublisheddatapoint,a“best-fit” linecouldbedrawnbetweenthetwopoints,andtheslopeofthe linecalculated, therebygivingrate.The jokehad its roots inaseminarDr.Turekiangavefarearlierinhiscareer,wheretheentiretalkfocusedonthefirstdatapointofanovelmeasurement(ahard-wondatapoint,butonlyonenonetheless).Whenanapparentlyex-asperated audiencememberpointed this out, Turekian reportedlyreplied“wellit'sonemorethananyoneelsehas!”(Thiemens,Davis,Grossman, & Colman, 2013). This anecdote raises two importantpointsaboutourfield:First,establishingnewanalyticaltechniquesis laborious, expensive, and requires considerable vision and skill,andsecond,atthenascentstagesofanynewrecord,thenumberofobservationswillbesmall.

Geobiologyisnowatthepointwheresomematuredatarecords(forinstanceδ13Corδ34S)havehadthousandsortensofthousandsofdatapointsgenerated.Therealsoexistmillionsofindividualgenese-quencesfromenvironmentalgenomicssurveys.Incontrast,emerg-ingproxies (for instance, selenium isotopes, I/Ca ratios, carbonate“clumped” isotopes),orbiomarker studiesand technicalgeomicro-biological experiments still face the issue Turekian described andhave only a handful ofmeasurements or are transitioning towardlargerdatasets. Thequestion remainshowbest to interpretdata-richrecordsontheonehand,andscattered,hard-wondatapointsontheotherhand.Here,usingexamplesfromwithingeobiologyaswellasthedevelopmentofotherfieldssuchasecology,psychology,andmedicine,wearguethatincreasedclarityregardingsignificancetesting,multiple comparisons, and effect sizeswill help research-ers avoid false-positive inferences, better estimate themagnitudeofchanges inproxyrecordsorexperiments,andultimatelyyieldaricherunderstandingofgeobiologicalphenomena.

2  | STATISTIC S AND REPRODUCIBILIT Y IN OTHER FIELDS AND IN GEOBIOLOGY

We start by examining statistical practice and reproducibilityoutside of our field, before relating these broader themes backto geobiology. Science as awhole is currently described as fac-ing a “crisis of reproducibility,” with diverse studies in multiplefields failing toreplicatepublishedfindings (seeBaker,2016fora Nature survey of reproducibility across fields). The issue here

isnot technical reproducibilityat thesample level (e.g., “If Iputthissamesample inamassspectrometer,will Igetthesamere-sulttwice?”)butratherattheleveloftheeffectsobservedinthemanuscript(e.g.,agiventreatmentcausesaspecificoutcome,ora predictor variable is correlatedwith a response variable). Forexample,independenteffortstoreproduce“landmarkpapers”incancerbiomedicalresearchbypharmaceuticalcompaniesAmgen(Begley & Ellis, 2012) and Bayer (Prinz, Schlange, & Asadullah,2011) have only been able to replicate 11% and 20%–25% ofpublishedresults,respectively.Similarly,acriticalstudyofgene-by-environment (GxE) interactions inpsychiatry found thatonly27%of published replication attemptswerepositive (Duncan&Keller, 2011). Further, this small proportion of apparently posi-tivereplicationswaslikelyinflated;therewereclearsignaturesofpublicationbias(thetendencytopublishsignificantresultsmorereadilythannon-significantresults)amongreplicationattemptsintheGxEliterature. Inpsychology,a large-scalereplicationeffortfoundthatonly39%ofstudiescouldbereplicated(OpenScienceCollaboration, 2015), and similar problemsplaguemuchofneu-roscience research (Button etal., 2013). The studiesmentionedherearelikelyjustthetipoftheiceberg(Begley&Ioannidis,2015;Ioannidis,2005).

Thecausesoftheselowratesofreplicationarevaried.Thereare likely different underlying causes for poor reproducibilityacrossfields,andtheseverityoftheproblemundoubtedlyvariesaswell.Our personal experience trying to implement publishedprotocolswithout thehighlydetailed, laboratory-specificknowl-edge that is often omitted (generally for space reasons) fromMaterialsandMethodssectionssuggeststousthatatleastsomepercentageoffailedreplicationsarecausedbyinadvertentmeth-odological differences. Indeed, the Open Science Collaborationreplicationofpsychologystudieswasattackedforpooradherencetooriginalprotocols(Gilbert,King,Pettigrew,&Wilson,2016).Inordertoachieveprecisereplicationofprotocols,inextremecasesresearchershavehadtotraveltootherlaboratoriesandworkside-by-sidetoidentifyseeminglytrivialmethodologicaldifferences—vigorousstirringversusprolongedgentleshakingtoisolatecells,forinstance—thathadanoutsizedeffectonreproducibility(Hines,Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips,2017).Methodologicaldifferences,though,canlikelyonlyexplainaportionofthefailedreplicationattempts.Truescientificfraudisalsosomethingthatmakesheadlinesanderodespublictrust,butagainlikelyonlyaccountsforasmallproportionofreplicationfail-ures(Bouter,Tijdink,Axelsen,Martinson,&tenRiet,2016;Fanelli,2009).

Page 2: Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely

2  |     PERSPECTIVE

So, what causes such poor reproducibility?While recognizingagainthatthecausescanvaryacrossfields,clearlysomeofthemostimportant factors are under-powered studies, a reliance on NullHypothesisSignificanceTesting(NHST,e.g.,p <0.05)todetermine“truth”orwhetherapapershouldbepublished,anda lackofcor-rection for implicitorexplicitmultiplecomparisons (morebroadly,“researcherdegreesoffreedom”).Theseissueswereknownwithinsomefields,butwerebroughttomorewidespreadattentionthroughthepublicationofaprovocative2005essay,“WhyMostPublishedResearchFindingsAreFalse” (Ioannidis,2005). Ioannidis identifiedthefollowingcausalfactorsas leadingtotheoverallvery lowper-centagesofpositive replication findings: (a) small sample sizes, (b)smalleffectsizes,(c)highnumbersoftestedrelationships,(d)flexi-bilityindesignsanddefinitions,andwhatrepresentsanunequivo-caloutcome,(e)conflictsofinterestorprejudicewithinafield,thusincreasingbias,and(f)“hot”fieldsofscience,wheremoreteamsaresimultaneouslytestingmanydifferentrelationships.Thistendencyisexacerbatedbytheincentivestructureforjournalsandauthorstopublishpositiveresultsandpublishoften.

Our field of geobiology is likely less besot by some of theseissues related to, for instance,publicationbias.The reasonwhy isthatdataanalysisisoftenaccomplishedthroughvisualratherthanstatistical comparison;explicitp-values thatwouldbeusedas thearbitrator of accept/reject decisions in other fields are not gener-ated.Figure1documentstheuseofstatisticaltestinginthejournalGeobiologycomparedtoasisterpublicationbyWiley,Marine Ecology. ThiscomparisonisnotmeanttosingleoutGeobiologyasajournal;wearecertaintheresultswouldbesimilarinourotherdisciplinaryjournals.Inthiscomparison,weconsideredthe100mostrecentpa-persatthetimeofwritingthatreportednewdata(soreviewpapers,commentaries, or papers that were descriptive in nature, such asdescribingstromatolitemorphologies—basicallyanypaperwherenonewnumericaldataweregenerated—wereexcluded).Paperswereconsideredtohavea“statisticalanalysis”simplyiftherewassomeefforttounderstandthepossibleinfluenceoferrorandsamplingontheprecisionoftheinference,recognizingthiscantakemanyforms(e.g.,formalNHST,bootstrapping,andBayesianposteriorprobabil-ities). In themarineecology journal, essentially everypaper (97%)reportingnewdatauseda statistical analysis (similar resultswerereported by Fidler, Burgman, Cumming, Buttrose, & Thomason,2006,forotherecologyjournals).InGeobiology,thepercentagewithanysortofstatisticalanalysisisfarlower,at38%(chi-squaredtest;p = 2.0 × 10−18).Werecognizethatsomeofthisdifferencemaybedue todifferentdata types andapproaches, butourpersonalob-servation is thatmanystudies inGeobiologyarecomparinggroupsofdatavisuallyratherthanstatistically.Further,thestatisticalanal-ysesthatarepublishedinGeobiologyareconcentratedinmolecularphylogeneticstudies,wherebootstrapsorBayesianposteriorprob-abilities are commonly used to assess precision. Statistical testingis far less common innon-phylogenetic studies (e.g.,manygeomi-crobiology,geochemistry,biomarker,andbiomineralizationstudies).ComparingthemostrecentpapersinGeobiologyagainstthefirst100data-driven papers in the journal (first published in 2003) reveals

thatthepercentagewithastatisticaltesthasincreasedslightly(from26%to38%)butthedifferencebetweenthetwotimeintervalsex-aminedisnotstatisticallysignificant(usingp<0.05todeclarestatis-ticalsignificance;chi-sq=2.8,df=1,p=0.095).

We propose that increased emphasis on statistical testing asacriticalstep indataanalysis isneeded inthefieldofgeobiology.Putsimply,whybuildascientificworldviewbasedonstudieswhereit has not been demonstrated that the observed differences aremore thanwhatwouldbeexpected fromsamplingvariability?Ontheotherhand,blindrelianceonstatisticaltestingwillnotbehelp-ful either. As cogently argued by Fidler, Cumming, Burgman, andThomason(2004),ecologyasafieldismiredinstatisticalfallacies:specifically,theerroneousbeliefsthatp-valuesareadirectindexofeffectsize,andthatthep-valuerepresentstheprobabilitythatthenullhypothesisistrue(orfalse).Consequently,simplyrunningmorestatisticalanalysesisnotasufficientavenuetoaccurate,reproduc-ible, and correctly interpreted findings.Here,wehope to use theopportunity provided by broader discussions about statistics andreproducibility in science to review three important concepts—(a)significancetesting,(b)multiplecomparisons,and(c)effectsize—andtranslatethemtoourfieldofgeobiology.Thismanuscriptisintended

F IGURE  1 Useofstatisticaltestinginthe100mostrecentdata-drivenpapersinGeobiology(present)versusMarine Ecology. Thefirst100papersinGeobiology(early)werealsocompared.“Statisticaltests”werebroadlydefined(forinstance,bootstrapping,Bayesianapproaches,etc.,andnotonlyformalNullHypothesisSignificanceTesting).Paperswithouta“MaterialsandMethods”sectionorthatdidnotpresentnewnumericaldata(e.g.,review/synthesispapers,modelingpapers)werenotincludedinthecomparison

Use of statistics in 100 papers from each journal

Geobiology (early) Geobiology (present) Marine ecology

Sta

tistic

al te

stN

o st

atis

tical

test

Page 3: Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely

     |  3PERSPECTIVE

tobeeducationalandtostartadiscussionregardingproperstatisti-calanalysesingeobiology.OurfocusisonwhatwebelievearethemorefamiliartermsofformalNHST(i.e.,afrequentistapproach),butwediscussthegoalofamoreflexibleapproachtostatisticalthink-ingat theconclusionof thepaper. Inthisspirit,werecognizethatthesetopicswillbefamiliartomanyreaders,andindeedasidefromageobiologicalspin,thereislittleintellectualterritoryherethathasnot been extensively covered in medical, psychological, and eco-logicaljournals.However,forreaderslessfamiliarwiththesetopicswehopethisessaymayprovideusefulguidanceforavoidingsomeof the problems of reproducibility that have plagued other fields.Geobiologyultimatelyhastheopportunitytobeamongtheselectgroupoffieldsthatdonothavemajorreproducibilityproblems.

Take home message: There are acknowledged issues with NullHypothesisSignificanceTesting(includingpublicationbiasandrep-licability in science at large), but studies shouldnot rely solely onvisualanalysisalone.Aformalexaminationofthesizeoftheeffectrelativetosamplingvariationisakeytoolinevaluatingnewscientificclaims.

3  | NULL HYPOTHESIS SIGNIFIC ANCE TESTING (NHST)

Despite itswidespread use in science, the development ofNHSThas its roots in industrial applications.For instance,RonaldFisherdevelopedtheanalysisofvarianceworkingwithexperimentalcropdata,andWilliamGosset(nominatedhereasthepatronsaintofgeo-biologicaldataanalysis)developedtheStudent'sttestworkingtoin-creaseyieldsattheGuinnessBrewery(Student,1908).Whenfacedwithtwo(ormore)groupsofsamples,eachwithscatter,theperti-nentquestioniswhetherthesetsofsamplesmayactuallyrepresentthesameunderlyingdatadistribution,withobservedvariationbe-tweengroupsarisingfromsamplingofthisdistribution(Thenullhy-pothesis,H0,isthatthereisnodifferencebetweengroups;inotherwords,thereisonlyoneunderlyingdistribution).InNHST,thisques-tionisaddressedbycomparingthemeansormediansofthegroups(anddatavariability)andcalculatingtheprobabilitythataresultatleastthatextremewouldbefoundifthegroupsweredrawnfromthesamedistributionorpopulation.Thisprobabilityisexpressedasthep-value.Incorrectlyrejectingthenullhypothesis(afalsepositive)isreferredtoasTypeIerror,andincorrectlyfailingtorejectthenullhypothesis (afalsenegative) isTypeIIerror.Avarietyofparamet-ric (those thatdependona specifiedprobabilitydistribution fromwhich the data are drawn) and nonparametric (those that do not)statistical analysesexist tomake thesecomparisonsbetween twoormoregroups.A full reviewofparticularanalyses isbeyond thescopeof this articleand isbest found in statistical textbooksandwebresources.

Choosingthecorrectstatisticalanalysis foragivensetofdataposessomeissues,butthemoreimportantissue—thefocusofthispiece—is understanding the fundamental logic of statistical tests:What theydoanddonot tellyou. It iserrors in logic, rather than

someone using a t test when they should have used aWilcoxon,whicharemostproblematic.OnecommonfallacyregardingNHSTisthatthelevelofsignificancerejectingthenullhypothesis(thep-value)istheprobabilitythatthenullhypothesisiscorrect.Thereisawideliteraturediscussingthisfallacy,withoneofthebestbeingCohen's1994essay “TheEarth is round (p<0.05).”Henotes thatwhatwewanttoknow,asresearchers,is“giventhesedata,whatistheprobabilitythatH0istrue?”ButwhatNHSTtellsusis“giventhatH0istrue,whatistheprobabilityofthese(ormoreextreme)data?”Inotherwords, rejectinga specificnullhypothesisdoesnotactu-allyconfirmanyunderlyingtruthortheory.Addressingthisrequiresknowing the likelihood that a real effect exists in the first place,whichmaybedifficulttocalculate.Eveniftheseoddscanbecalcu-lated, thecombineduncertaintyresults inmoresubstantialmurki-nessabouttheresultsthanthep-valuealoneindicate(Nuzzo,2014,providesmore informationonwhatap-value really tellsyou—in a formmorepalatabletotheaveragereaderthanCohen—aswellasahistoricaldiscussionofhowthep=0.05thresholdcameabout).

Another common error in interpreting formal statistical testsregards “mechanical dichotomous decisions around a sacred 0.05criterion” (Cohen, 1994). Obvious to most readers is that a 5.5%probabilityofgeneratingresultsatleastthatextreme(e.g.,p=0.055)isbasicallynodifferent,inaninterpretivesense,thana4.5%prob-ability(p=0.045).Inotherwords,p <0.05,p <0.01,andp <0.005(Benjamin etal., 2018, recently advocated for the more stringentp <0.005 statistical criteria) are all arbitrary criteria, although stillusefulconventions.AmemorablequotebyRosnowandRosenthal(1989)states:

Wewant to underscore that, surely, God loves the0.06 nearly asmuch as the 0.05. Can there be anydoubtthatGodviewsthestrengthofevidencefororagainstthenullasafairlycontinuousfunctionofthemagnitudeofp?

Yetwhileresearchers inherently“know”p=0.045and0.055arereallynodifferent,oneresultisdeemed“true”andpublishable(posi-tivepublicationbias)andoneisdeemedinconsequentialandignored.OrinRosnowandRosenthal'smoredramaticprose:

Itmaynot be an exaggeration to say that formanyPhDstudents,forwhomthe0.05alphahasacquiredanalmostontologicalmystique,itcanmeanjoy,adoc-toral degree, and a tenure-track position at amajoruniversity if their dissertationp is <0.05.However,ifthep is>0.05,itcanmeanruin,despair,andtheiradvisor'ssuddenlythinkingofanewcontrolconditionthatshouldberun.

Take home message: Null Hypothesis Significance Testing inves-tigates the probability (expressed as p-values) of finding results asextreme as the observed data, given the null hypothesis. Of note,p-values cannot address the likelihood that a hypothesis is correct.

Page 4: Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely

4  |     PERSPECTIVE

p-value thresholds representarbitrarycutoffs rather than true/falsecriteriaforaccept/rejectdecisions.

4  | MULTIPLE TESTING AND RESE ARCHER DEGREES OF FREEDOM

A colleague once described a conference—in the pre-PowerPointdays—in which geologists were instructed to bring an overheadtransparency summarizing the record of their subfield throughEarthhistory(tectonicevents,fossiltrends,geochemicalproxyre-cords,etc.)alongthesamex-axis temporalscale.Theparticipantsthen took turns swappingdifferent transparenciesonandoff theprojector, looking for correlations. This strategy was innovativefrom a data exploration perspective—and indeed sounds intellec-tuallystimulating—butultimatelyisanightmareregardingmultiplecomparisons.Themultiple comparisonproblemessentially resultsfrom“moreshotsongoal.” Itcanbe illustratedwiththefollowingexample—the probability of flipping any given coin as “heads” 10timesoutof10isverylow;however,ifthisisdoneoverandover,theprobability thatonecoinwillbe “heads”every timeobviouslyincreases.Itwouldbeincorrect,though,toconcludethatthatcoinisdifferentfromtherest.SpecificallyforNHST,theprobabilityofafalsepositiveinabatteryoftestswillbe1–(1–α)k,wherealpha(α)isthesignificancelevelrequired(e.g.,p<0.05)andkisthenum-beroftestsperformed(discussedindetailinStreiner,2015).For10separatetestsatthestandardlevelofsignificance,theprobabilityofafalsepositiveis1–(1–0.05)10orabout40%.Thisissueisper-hapsevenbetterillustratedbyFigure2below(analyzingtheeffectofjellybeancolors).

Uncorrectedmultiplecomparisonsareoneoftheprimarycausesof replicability issues across many scientific fields. In some cases,thesecomparisonsmayhavebeendoneexplicitly,suchaswiththegreenjellybeans.Manyreportsinthe1990sand2000saboutwhatgenes“cause”specificeffectsinhumanswerethespuriousresultoftrawlinglimitedgeneticdataacrosscomprehensiveepidemiologicaldatasetsand lookingfor“significant”correlations.Ananalog inourfieldmaybeinstanceswherebiologicaldataofinterest(e.g.,micro-bialabundance/ecologicaltraits/geneexpression)arecollectedfromamodernenvironment,suchasahotspring,alongsideenvironmen-tal data.Which environmental parameters are correlatedwith thebiologicalmetricof interest?(leavingasidethebroaderquestionofcorrelationandcausality). Itwouldbe inappropriate tosimplycon-ductapairwisecomparisonofalltheenvironmentalparameterswiththebiologicalmetricwithoutcorrectionformultipletests.Bychancealone, some parameters might be significantly correlated: In nulldatasets,p<0.05occurs,bydefinition,5%ofthetime.

Perhapsmoreinsidiously,multiplecomparisonscanbedoneim-plicitlyorunconsciously.Forinstance,aresearchermaycompleteacompilationof fossil data from shelly invertebrates and thenhalf-asleepintheshowermentallywanderthroughallthedifferentgeo-logicaldatarecords(sealevel,temperature,redoxproxies,strontiumisotopes,etc.)beforesnappingawakeafternotingthattheidentified

fossil trend looks very similar to a previously published record ofcalciumisotopes.Asinglestatisticalcomparisonismade,andvoila:p<0.05. Perhaps something in the calcium cycle is affecting thelivelihoodoftheseshell-formingorganisms?Maybe.Inthiscase,theresearcher did not explicitly test each comparisonwith ap-value,as inthejellybeanorhotspringsexample,buttheresearcherstillmentallyconductedtheequivalentofswappingoutoverheadtrans-parencies:multiplecomparisonsuntilamatchwasfound.Arelatedproblemisexploratorydataanalysisasdataarebeinggenerated(theroleofexploratorydataanalysisisdiscussedfurtherbelow).Arangeofanalysesmightbeconducted,withonepredictorvariableoutofmanyyieldingasignificantcorrelation.Whenthefulldatasetisgen-erated,“onlyone”explicittestisconductedonthefinaldatasetandincludedinthepublication,withtheresearcherhonestlyforgettingjusthowmanyanalyseshadbeenconducted.Or,aspuriouslysignif-icantp-valueisfoundearlyon,sayforallavailablemarinesamples,whichthendisappearsasmoredataareadded.Asecondanalysis,looking at individual ocean basins, and a third, looking by depthclass, are conducted, perhaps excluding some extreme outliers,until anothersignificantp-value reappears [so-calledp-hacking,or“researcherdegreesoffreedom”;Simmons,Nelson,andSimonsohn(2011)]. It isthenthissub-groupanalysisthat isemphasizedinthemanuscript.Moreoftenthannot,itisanunwittingerrorbyascien-tistexcitedlyanalyzingtheirdata.RememberFeynman'squotethat“thefirstprincipleisthatyoumustnotfoolyourself—andyouaretheeasiestperson to fool.”This isnot todiscouragedataexploration,butcorrectlyaccountingforthesecomparisons—oratleastremain-ingcognizantoftheissue—willultimatelybekeytoproducinglastinginsights.

4.1 | Avoiding multiple comparison pitfalls

Howthenshouldoneaccountformultiplecomparisons?Thebestapproachisearly,explicitplanning.Thegoldstandard,aspracticedin clinical trials, is pre-registration (for instance, on ClinicalTrials.govrunbytheU.S.NationalLibraryofMedicine). Inthisstrategy,theresearcherpublishesawhitepaperpriortostartingtheexperi-ments,explicitlydetailinghowthedatawillbecollected(includingthestoppingpoint),andthenumberandtypesofstatisticalanalysestobeconducted.Theyarethenheldtothisplan,orthetrialisnotconsidered valid. Such a strategy is often considered anunrealis-ticidealoutsideofclinicaltrials,butnotablyanefforttopublicallypostmethodologiesa priorihasrecentlybeeninitiatedformolecularphylogenetics(Phylotocol;DeBiasse&Ryan,2018),afieldparticu-larly susceptible to “researcher degrees of freedom.”Whether ornotsuchregistriesareappropriateforgeobiologyinthelongtermisasubjectfordebate,andcertainlytheapproachignoresthefactthatmuchofourscience (especially fieldscience) is truly “discov-erydriven”ratherthan“hypothesisdriven.”Nonetheless,increasedpre-experimenteffortputintoplanningstatisticalanalyseswillbeeffectiveinreducingmultiplecomparison“creep.”Thismaybepar-ticularlyusefultobringupduringprojectplanningwithearly-careerresearchers, as essentially all aspects of experimental design are

Page 5: Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely

     |  5PERSPECTIVE

beingtaughtatthatpoint.Discussionsofstudy-levelreproducibilityshould also startmaking it intoundergraduate andgraduategeo-biology curricula, as for instance is occurring in somepsychologyprograms(Button,2018).

Moving from pre-experiment awareness and planning to post-experimentdataanalysis,thereareseveraltechniquestoaccountformultiplecomparisons(“multipletestingcorrection”or“alphainflationcorrection”).CertainBayesianapproachesmaynotrequireexplicitcor-rection(seeGelman,Hill,&Yajima,2012),butinafrequentistcontextacommonapproachistoapplyadirectcorrectionthataccountsforthe increased family-wise error rate. The Bonferroni correction, forinstance, divides the level of significance required (e.g.,α=0.05) bythenumberofindependentanalysesconducted.So,aresearcherin-vestigatinghowbrachiopodbodysizechangedbetweenfourdifferentstratigraphicformationswouldrequirep<0.008(anoverallα=0.05/6independentpairwisetests=0.008)inordertoachievesignificance.Anothercommonpracticeforsuchananalysis(multiplepairwisecom-parisons)istoconductanomnibustestsuchasANOVA,followedbyaposthoctestsuchastheTukeyHSDtestthatdirectlyaccountsforincreasedfamily-wiseerror.

This general practice of alpha inflation correction has receivedsomecriticism, as the thresholds for significancecanbeoverly con-servative, leading tomore false negatives and potentially restrictingthepathoffuturescientificcuriosity(Moran,2003;Rothman,1990).Suchcriticismscommonlynotethatstudiesintheirfieldsoftenhave“asmallnumberofreplicates,highvariability,and(subsequently) lowstatisticalpower” (Moran,2003).Dr.Turekianmightwell relate!Theargumentputforthbysuchpapersisthattheincreasedincidenceinfalsepositivesisnotreallyanissue,becauseotherresearcherswillre-peattheexperiments,beunabletoreplicatethem,andthefalseclaimswilleventuallydisappearfromtheliterature.However,carefulstudyofreplicationattempts inpsychiatryandpsychologyhasdemonstratedthat(a)uncorrectedmultiplecomparisontestingdoesempiricallyleadtoamorassofpublishedfalsepositives(Duncan&Keller,2011),and(b)theoriginalhypothesesdonotsimplydisappearfromtheliteraturebuthaveincrediblepersistence(Ioannidis,2012).Thecausesofsuchper-sistencearevariedbutbasicallyboildowntolowincentivesforjournalsorauthorstopublishnegativereplications(Ioannidis,2012;Simmonsetal.,2011).Whiletheseargumentsonthestringencyofmultipletest-ingcorrectionsarenotmeritless (seenextparagraph), inouropinionwidespreadTypeIerrors(falsepositives)areafargreaterhindrancetotheadvancementofsciencethanTypeIIerrors(falsenegatives).

Thatsaid,determiningthecorrectbalancebetweenthelikelihoodoffalsenegativesandfalsepositives,aswellasencouragingcutting-edge methods development that must start with small datasets

F IGURE  2 FiguremodifiedfromthewebcomicXKCD.com.Multiplestatisticalcomparisonsincreasetheprobabilityofafalse-positiveresult(“moreshotsongoal”).Specifically,withtwentyindependentcomparisonsasshowninthecomic,theprobabilityofafalsepositiveis~65%(Streiner,2015)

Page 6: Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely

6  |     PERSPECTIVE

(Turekian'spoint),isobviouslyacomplicatedendeavor.Severalothermethods, such as the Holm and Hochberg methods (reviewed byStreiner,2015), offerprotectionagainst alpha inflationbut arenotas conservative as a strict Bonferroni correction. Such correctionsshouldalsofollowcommonsense,astheycandegenerateintotheab-surd(García,2004;Streiner,2015).Forinstance,howmanytrulyin-dependentcomparisonsarereallybeingconducted?Oceanographicfactorssuchastemperature,oxygen,pH,light,andpressurecanallbebroadlycorrelatedwithdepth.Comparingalloftheseagainstmicro-bialabundancesmightresultinmultiplesignificantcorrelationsthatbecome(inappropriately)non-significantaftercorrectionformultiplecomparisons.Anotherissuetoconsideriswhethertheremaybepre-existinghypothesesthatmotivatedthestudy.Forthebrachiopodol-ogiststudyingbodysizeacrossfourformations,theymaybetestingprevioushypothesesofbodysizeevolutionduringaspecifictimepe-riod(e.g.,Heim,Knope,Schaal,Wang,&Payne,2015).Therealtestmaybebetweenthestratigraphicallylowestandhighestformations,and itmaybeundulystringent to requireavery lowvaluep-valueresultingfromthemultiplepairwisecomparisons.

ThisdifferencehasbeendiscussedbyStreiner(2015)asthediffer-encebetweenhypothesistesting(whichmaynotrequirecorrection)andhypothesis generating (which shouldbe reportedas tentativeand/orexploratoryresults).Orinplainerlanguage,exploratorydataanalysiscanbegood,andexplicithypothesis testingcanbegood,buttheapproachbeingusedshouldbeclear.Indeed,therearemanypowerfultechniques(includingnewmachinelearningtechniques)tounderstandwhichofmultiplepredictorvariablesmightbestexplainvariance in the response variableof interest (a situationweoftenfaceingeobiology).Theresultsofsuchanalyses,though,cannotbeturnedaroundandpresentedasana priorihypothesiscompletewithasignificantp-value.Dr.BrianMcGill'sblogpostprovidesacogent“defense” of exploratory data analysis: (https://dynamicecology.wordpress.com/2013/10/16/in-praise-of-exploratory-statistics).Hecloses,“IuseexploratorystatisticsandI'mproudofit!AndifIclaimsomethingwasahypothesisitreallywasana priorihypothesis.Youcantrustmebecause Iamoutandproudaboutusingexploratorystatistics.”

Thekeythreadrunningthroughthesestatisticalcommentariesregarding multiple comparisons is conscientiousness—personallywithrespecttohowmanyanalyseshavebeenconducted,butalsowith respect tohow the full scopeof statisticalprocedures isde-scribedinthepaper.Simmonsetal.(2011)provideexcellentguide-lines in this regard forbothauthorsandreviewers.Notably,whilepromotingstrict reportingguidelines, theseauthorsalsoadvocateforincreasedtoleranceofstatistical“imperfection”byreviewersandeditorsinwell-documentedpapers:“onereasonresearchersexploitresearch degrees of freedom is the unreasonable expectationweoftenimposeasreviewersforeverydatapatterntobe(significantly)as predicted. Under-powered studieswith perfect results are theonesthatshouldinviteextrascrutiny.”

Finally,attheendofthisdiscussion,itisworthwhiletoaskyour-selfthebasicquestionofwhethermultipletestingisanissueinyourparticularresearcharea.Ifyouareworkingsolelywithasingleproxy

record,oranisolatedgeneticsystem,theanswermightbeno.Theissue arises if youwant to understandwhat correlates with yourdata—ifthereisalargeconstellationofpossiblecorrelates,thereisan equally large likelihoodof spurious correlations. Learning fromthe abysmal record of replication in other fields, caution againstalphainflationinthesecaseswillbeacornerstonetoarobustandhealthyfieldofgeobiology.

Take home message:Vigilanceagainstalphainflationstartswiththeindividualresearcher,ideallyduringthepre-experimentalde-signphase.Dataexplorationisencouraged,buttrawlingthroughdata for significant correlations that are then presented as an a priorihypothesismustbeavoided:Papersshouldexplicitlystateiftheyaretobeviewedasexploratory.Thefullsweepofdatacol-lected,statisticaltestsconducted,andchoicesofdatainclusion/exclusion(orother“degreesoffreedom”)byaresearchershouldbe made clear in publication in order to avoid cherry-picking(Simmonsetal.,2011).

5  | EFFEC T SIZE

So letus sayyouhave founda significantdifferencebetween twogroupsofdata,andthroughattentivepracticeandanalysis,youhavedetermined it is not a chance result based on numerous “shots ongoal.”Thequestionnowis—doestheresultmatter?Thislastquestionseems silly, but amistaken focuson significantp-values as thebe-all/end-all,insteadofoneffectsize,hashamperedprogressinfieldssuchasecology(Fidleretal.,2004).Simply,effectsizeisaquantita-tivemeasureofthemagnitudeofaphenomenon.Statisticalpoweristhelikelihoodthatananalysiswilldetectarealeffect(asdeterminedbyasignificantp-value),andisgovernedbythesizeoftheeffect,thevariationpresentinthegroups,thenumberofsamplesintheanalysis,andthethresholdforsignificance(α).Thus,twogroupswithwidelyseparatedmeans (largeeffect size) and littlewithin-groupvariationwillrequirerelativelyfewsamplesforawell-poweredstudy.

The flip side of this—andwhy p-valuesmust be regarded asthe statistical likelihood a result that extreme would be foundby chance, rather than how “important” a result is—is that givenenoughsamples,literallyanyeffectsize,nomatterhowsmall,canbedetected.Thishasreceivedconsiderableattentioninthesta-tisticalliterature:

[Thenullhypothesis]canonlybetrueinthebowelsofacomputerprocessorrunningaMonteCarlostudy(andeventhenastrayelectronmaymakeitfalse).Ifitisfalse,eventoatinydegree,itmustbethecasethatalargeenoughsamplewillproduceasignificantresultand lead to its rejection.So if thenullhypothesis isalwaysfalse,what'sthebigdealinrejectingit?

(Cohen,1990)

At this point, you may be asking, “since what we are really in-terested in is effect size, andgivenCohen's statement that thenull

Page 7: Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely

     |  7PERSPECTIVE

hypothesisisalwaysfalse,doweevenneedtorunstatisticaltests?”Yes!Withoutasignificantresult,thereisnoreasontobelievetheob-servedresultsmaynotbeduetosamplingvariation.Inotherwords,significanceisthejumpingoffpoint,thelicensetostarttalkingabouteffectsizefromapositionofconfidence.Forfurtherreading,there-lationshipbetweensamplesize,p-values,andaccuracyininferringef-fect size is intelligentlydissectedbyHalsey,Curran-Everett,Vowler,andDrummond(2015).

Whatconstitutesanimportanteffectwillvarybyfield.Returningtothecointossexample,givenmillionsorbillionsortrillionsofflips—whatevertherequiredsamplesizemaybe—thedifferingweightsofthecoinsideswillultimatelycauseonesidetolandupsignificantlymore times than the other. Yet, this does not matter when twofriendsarechoosingarestaurant:Itisstill~50:50,andtheminisculeeffect size is irrelevant in this instance. This is the difference be-tweenasignificanteffectandanimportanteffect.Nonetheless,thisfallacyisoftencommittedintheliterature:

Allpsychologistsknowthatstatistically significantdoesnotmeanplain-Englishsignificant,butifonereadstheliterature,oneoftendiscoversthatafindingreportedintheResultssectionsstuddedwithasterisksimplic-itlybecomes intheDiscussionsectionhighlysignifi-cantorveryhighlysignificant,important,big!

(Cohen,1994)

Sometimes, though, small effect sizes really domatter. As anexample,humangeneticistshavelearnedthatthemajorityofphe-notypesarenotcontrolledbyasinglegene/locus(suchasthecaseforHuntington'sdisease).Rather,characteristicslikeheight,andtheriskofdiseasessuchasheartdisease,TypeIIdiabetes,anddepres-sionarecausedbythe(largely)additiveeffectsofthousandsofge-neticloci.Inthecaseofpsychiatricdisorders,eachindividuallocusexplains far<1%ofvariance in risk forapsychiatricdisorder (i.e.,smallindividualeffects)andyettotalgeneticcontributionsexplain40%–80%ofthevarianceinriskforthesedisorders(largesummedeffect) (Duncanetal.,2017;Ripkeetal.,2014;Wrayetal.,2018).Thus, as our questions move toward datasets with hundreds orthousandsofmeasurements,thefocusmustbeongeobiologicallyimportanteffectsizes(whichwillvarybyquestion)andconfidenceintervalsratherthansimplystatisticalsignificance.

6  | INTERPRETATIVE E X AMPLES

Therelationshipbetweenp-values (significance),effectsize,andsamplesizeisillustratedinFigure3.Notethereisnorelationshipimplied between the left and right panels, they are simply illus-trativeexamplesof theseconcepts. InPanelA,a researcherhasidentified relationships that appear interesting to pursue, withtheGroupBmean~50%higherthanGroupAinthettestexam-ple (left side).On the right side, thepredictor variable accountsfor27%ofthevariation(R2value)intheregressionexample.But

the results are not statistically significant. Especially in light ofrecentclaims thatp <0.05 is too laxastandard (Benjaminetal.,2018),thereisastronglikelihoodthatthisresult—specificallytheobservedsizeanddirectionof theeffect—isdue to samplingef-fects (Halsey etal., 2015). If thiswere your hard-wondata, it isimportant to avoid the temptationof knowinglyor unknowinglymanipulatingthedata(p-hacking)toachievea“significantresult.”For instance, simply removing the lowest data point inGroupBresultsinp = 0.03,significant!Maybe,thereisanimminentlylogi-calreasontoexcludethatdatapoint—perhapsit isfromadiffer-entandinappropriatelithology,orthesamplingmethodologywasactuallydifferent.Thegoalthoughistoavoidinventingposthocjustificationsfordatamanipulation,asitissoeasytofoolyourself,particularlyifremovingthatpointprovidessupportforlong-heldideas (confirmation bias). If such data exclusions aremade, theyshouldbenotedclearlyinthemanuscript,andbothsetsofanaly-ses, with a reasoned explanation, should be included (Simmonsetal.,2011).Movingtowardsharedtransparencywithinscientificsubfieldsfordatacollection,reportingandpresentationmethodswillbeinstrumentalinhelpingresearcherspresentreasonablere-sultswithlessthreatofconscious/unconsciousp-hacking.PanelBdepictsessentiallythesameanalysisasinPanelA,butwithmoresamples. Inthiscase,theresult issignificant,andtheresearchercanfeelmoreconfidentdescribingthesizeanddirectionoftheef-fect.NotethatifPanelBwereanextensionofthestudyinPanelA,thebestpractice(sometimesdifficulttoachievebutnonethelessthebestpractice) isactually toestablishapre-determinedstop-pingpoint for data collection (Simmons etal., 2011).Continuingtoaddbitsofdata,withtheanalysisreruneachtime,effectivelyrepresentsmultipletests.

PanelCdepictshowwithlargersamplesizesthepowertoiden-tify small but statistically significant effects also increases. In thecomparisonofgroupsAandB,themeansonlydifferby~3%,buttheresultishighlysignificant(p =0.007).Intheregressionanalysis,thepredictorvariableonlydescribes4%ofthevariationintheresponsevariable,butnonetheless,theresultissignificantbytraditionalmea-sures(p =0.047).Lookingatthescatterinthisplotisinstructive,asitappearsasacloudofpointswithnocorrelation,butstatistically,asmallcorrelationdoesexist. Inotherwords, itvisually illustratesthe quotes fromCohen: Effect size, not significance alone, is theendgoal.Asalsopreviouslydiscussed, theultimate interpretationofimportanceisquestion-andfield-specific.Toreiteratethepoint,robust increases in crop yields of 4%may feedmillions, whereasachangeof4% inmodernmarine sulfate levels (e.g.,~1mM)mayhavelittlerelevancetocurrentresearchquestionsregardingsulfurbiogeochemistry.

7  | BEST PR AC TICES MOVING FORWARD

Westartthis“bestpractices”sectionfromahumbleposition,aswearebynomeanstrainedstatisticiansbutratherenthusiasticadvocatesforincreasedstatisticalrigoringeobiology.Wehavemade(orwillmake)

Page 8: Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely

8  |     PERSPECTIVE

technical and logical errors inourpublishedpapers and almost cer-tainlyhavemadeunconscious“researcherdegreesoffreedom”deci-sionsthatimpactedresults(Simmonsetal.,2011).EvenJacobCohen,whosethoughtfulpaperswehavecitednumeroustimes,wascalledoutforincorrectstatisticallogic(Oakes,1986).Learningcorrectstatisticalpracticeandlogicisthusacareer-longendeavorformostscientists.

Asafirststep,wesuggestthatbothreviewersandauthorsin-sistonsomeformofstatisticalanalysisasabestpracticeapproach.Thecriticalconcepthereisthatyourparticularsetofmeasurements

isnotastatictruthaboutagroup.Rather, theyarevaluesdrawnfromadistribution, and repeateddrawsmayyield verydifferentresults—especiallyatlowsamplesizes(Halseyetal.,2015).Wedorecognizethatmanygeobiologicalstudieswillnotlendthemselvesto formal statistical tests and that there is a healthy traditionofdiscovery-based geobiological science—especially in field stud-ies—that should not be stifled.Nonetheless, if a paper is report-ingobserveddifferencesbetweengroupsorcorrelationsbetweenvariables,itisscientificdue-diligencetoinvestigatethedegreeto

F IGURE  3 Therelationshipbetweensignificance,effectsize,andsamplesizeshownwithhypotheticaldatapoints.Thisisdepictedformeasurementssampledfromtwodifferentgroups(leftside)andcorrelationbetweenapredictorvariableandresponsevariable(rightside);notethereisnostrictrelationshipbetweenthedatainthetwopanels.(a)Smallsamplesizesmayrevealalargeeffect,buttheresultisnotsignificantandshouldbeconsideredanintriguingfindingratherthanaconfidentconclusion.(b)Withincreasedsampling,aresearchermay(ormaynot)revealasignificanteffect.Thesign(direction)ofthiseffectislikelycorrect,whileincreasedsamplingwillleadtobetteraccuracyoftheeffect'smagnitude.(c)Givenenoughsampling,eventhesmallesteffectsizewillbecomestatisticallysignificant.Therightsideplotincisvisuallyinformativeinthisregard—whatappearstobeacloudofpointsare,statisticallyspeaking,correlated.Forillustrativepurposes,wehavedescribedeffectshereas“large”and“small,”butasdiscussedinthetext,thisdistinctionwillbefield-andquestion-specific

0 1 2 30

2

4

6

8

10

Group A(4.3 ± 2.3*)

Group B(6.4 ± 2.8*)

0 2 4 6 8 100

2

4

6

8

10

Predictor variable

Resp

onse

var

iabl

ep = 0.11

(a) Non-signi�cant result, possibly(?) large e�ect size* (10 samples)

p = 0.12, R2 = 0.27*

0 1 2 30

2

4

6

8

10

Group A(4.2 ± 2.1)

Group B(6.6 ± 2.0)

p = 0.0001

0 2 4 6 8 100

2

4

6

8

10

Predictor variable

Resp

onse

var

iabl

e

p < 0.0001, R2 = 0.53

*E�ect sizes should not be conclusively interpreted in the absence of signi�cant p-values

0 1 2 30

2

4

6

8

10

Group A(3.9 ± 0.2)

Group B(4.0 ± 0.3)

0 2 4 6 8 100

2

4

6

8

10

Predictor variable

Resp

onse

var

iabl

e

p = 0.007 p = 0.047, R2 = 0.04

(b) Signi�cant result, large e�ect size (25 samples)

(c) Signi�cant result, small e�ect size (100 samples)

Page 9: Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely

     |  9PERSPECTIVE

whichthesedifferenceswouldbeexpectedgiventhenullhypoth-esis(ormorebroadly,theuncertaintyoftheresult).

Inthisrespect,wenotethatwehavefocusedprimarilyhereonformalNHST,buttherearecertainlyotherstrategiessuchasmodel-basedapproaches,informationtheoreticapproaches,and/orboot-strappingthatachievethesamegeneralgoalofunderstandingtruerelationships (as distinguished from spurious results that are sim-ply due to sampling variability). For instance,maximum likelihoodand Bayesian approaches to assess the impact of random errorare the common practice inmolecular phylogenetics. The debateaboutwhether formalNHST should be retained and improvedordoneawaywithisdecadesold,ragesstill,andisnotsomethingwecanadequatelyaddresshere (Benjaminetal.,2018;Cohen,1994;Falk&Greenbaum, 1995; Fidler etal., 2004;Halsey etal., 2015).AndasCohen(1994)noted,thereisnomagicalternativetoNHST.Certainly,though,allapproachesdiscussedabovearemorelikelytoleadtocorrectinferencethanvisualinspectionofdata.

As statistical tests are increasingly implemented in geobiol-ogy,thenextstepistolearnfromthemistakesofotherfieldsandhelpeachothernotfallfoulofthefallaciesdescribedinthisessay.Specifically,thepointstoavoidare:

1. Not correctly accounting for multiple testing.2. Consideringthep-valuetobetheprobabilitythatthehypothesisiscorrect.

3. Consideringp =0.045tobeadramaticallystrongerrejectionofthe null hypothesis than p =0.055 (“mechanical dichotomousdecisions”).

4. Not explicitly reporting “researcher degrees of freedom”(Simmonsetal.,2011).

5. Consideringevery“significant”resulttobean“important”result.Rather,significanceshouldtypicallybetherequirementforposit-inga scientificeffect,whichmust thenbeput intoappropriatecontext.

Ofthese,multipletestingand“researcherdegreesoffreedom”arelikelythemostproblematicwithrespecttoreproducibility,especiallyastheycanoftenbedoneunconsciously.Explicitplanningatthestartofastudyiscrucialinthisregard.Alsoimportantattheplanningstagesis power analysis. Ioannidis (2005) notes that in terms of achievinglong-lastingscientificinsights,fewerwell-poweredstudiesarevastlypreferabletomanylow-poweredstudies,andcertainlythefieldben-efitsfromnotchasingfalseleads.Further,asdemonstratedbyHalseyetal.(2015),largersamplessizesandwell-poweredstudiesalsomorepreciselyestimatetheeffectsize,whichisafterallwhatweareinter-estedin.Thechallengehereisanobviousconflictbetweenincentivestructuresforthefieldasawholeandindividualresearchers,specifi-callyearly-careerresearchers.Suchconsiderationsshouldideallyplayintoevolvingdiscussionsonhowpost-docs,facultypositions,andten-ureareevaluated.Fortunately,ingeobiologyweareoftenaddressingfirst-orderquestionswith largeeffects,andtherequired increase insample size to achieveawell-powered study isoftennot that large(Sterne&Smith,2001).

Asonefinalnoteregardingreproducibility,correctlydocument-ing original data and metadata in accessible supplementary doc-uments or data repositories is key to allowing researchers to testresults and build on previous studies inmeta- andmega-analyses(seeforinstanceIoannidisetal.,2009).Originalcodeusedforanaly-sesmustalsobeadequatelycuratedinapublicrepository;thisstepiscommoninbiologicalfieldssuchasecology(Cooperetal.,2017;Ram,2013),butisnotacrossgeobiology.

Toavoid feedingapublicationbiasmonster, and toencouragenewdevelopmentsinamannerKarlTurekianwouldbeproudof,wedonotadvocateastrictrequirementofsignificanceforpublication.Inouropinion,theresultsinFigure3,PanelA,iffromanemergingproxyrecord,wouldbequitesuggestiveandshouldbeconsideredforpublication,butthereadershouldbenotifiedhowlikelythedataare(giventhenullhypothesis).Poweranalysisisalsohelpfulinthisregard(Cohen,1992).Asanexample,aninfluentialgeneticspaperwaspublishedinNature,despitehavingnullresultsfortheprimaryhypothesis (The International Schizophrenia Consortium, 2009).Thispaperprovidedstrongevidencethatsignificantresultswouldbedetectablewithlargersamplesizesinthenearfuture,anditwastheapplicationofarecentlydevelopedstatistical technique (poly-genicriskscoring)thatmadeitworthyofpublicationinNature.

Moving to the longer-term view, the “best practices” describedabovearehopefullyuseful,but thegoal for thenextgenerationofgeobiologistsshouldnotbebestpracticesliststapedtothesideofcu-bicles.Putsimply,eventhemostwell-intentionedof“bestpractices”listscan lead toa “cookbook”viewofdataanalysis,where there isarightandawrongwaytodothings,andstatisticsareacomputerbutton to push after data acquisition. Rather the goal should be asituation where, for many, computational reasoning and data sci-enceareanatural,integratedpartofoursciencealongsidefieldandlaboratory skills.Themainneed for this is simply thatmanyofourscientificquestionsdonotreadilyconformtoclassicstatisticaltests.Asanexample,KellerandSchoene(2012)investigatedhowigneousgeochemistryhaschangedthroughEarthhistory,andrecognizedthattheserocksarenotevenlysampledinspaceandtime—someplutonsareheavilysampled,whileothersweresampledrarelyornotatall.Inotherwords, samplesarenot independent.Toaddress this issueofsamplingheterogeneity,theyutilizedare-weightedbootstrappingapproach,withbootstrapweightsforagivensamplerelatedtothespatialandtemporalproximitytoothersamples.Paleontologistshavealsoaddressedthesame issueofsamplingheterogeneity,butusingdifferentmethodsappropriatetothedataarchiveofthatfield(e.g.,Alroy,2010).Neitherofthesesolutionscamefromastatistics“cook-book,” and achieving the flexibility to design themost appropriatetest(beitfrequentist,likelihood,orBayesian),ortoperformnumeri-calexperimentstestingdifferentscenarios,willrequireafoundationinstatisticsbutalso, importantly,computationalthinking (Weintropetal., 2016). Geobiology has had considerable success in breakingfieldboundariesandeducatingstudentswhoareascomfortablewitharockhammerasapipette;integratingacomputationalandstatisticalperspectiveintothistrainingwillbethenextsteptodrawingrobustinsightsfromever-largergeobiologicaldatasets.

Page 10: Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely

10  |     PERSPECTIVE

Inclosing,wedonotexpectastatistical revolution ingeobiol-ogy overnight. The common implementation of statistical analy-ses in fields likeecology tookdecades. In fact,Gosset (“Student”)wrotetoFisherregardingthettestthat“IamsendingyouacopyofStudent'sTablesasyouaretheonly[person]that'severlikelytousethem!”(citedinBox,1981).WehopethisPerspectivehelpsstartadialogueregardingstatisticalpracticeingeobiology,whilealsorec-ognizingitisheavilycoloredbyourperspective—welookforwardtoseeingcommentaryfromotherperspectives(differentsubfieldsofgeobiology,Bayesianists,etc.).Ultimately,thiswillhelpusavoidtheproblemswithreproducibilitypresentinotherfields,bemoreconfi-dentinourresults,andasafieldmovemorequicklytowarddeepergeobiologicalunderstanding.

ACKNOWLEDG MENTS

WethankDavidJohnston,JonPayne,MattClapham,andananony-mous reviewer for commentsonapreviousversionof thismanu-script,andJamesFarquhar,AnneDekas,JoeRyan,DavidEvans,andAlex Bradley for helpful discussion.We thank RandallMunroe ofXKCD.comforpermissiontoreproduceFigure2.EASandSATwerefundedbyaSloanResearchFellowship.

CONFLIC T OF INTERE S T

Theauthorsdeclarenoconflictsofinterest.

ORCID

Erik A. Sperling https://orcid.org/0000-0001-9590-371X

ErikA.Sperling1

SabrinaTecklenburg1

LaramieE.Duncan2

1Department of Geological Sciences, Stanford University, Stanford, California

2Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, California

CorrespondenceErik A. Sperling, Department of Geological Sciences, Stanford

University, Stanford, CA.Email: [email protected]

R E FE R E N C E S

Alroy, J. (2010). The shiftingbalanceof diversity amongmajormarineanimal groups. Science, 329, 1191–1194. https://doi.org/10.1126/science.1189910

Baker,M. (2016).1,500scientists lift the lidon reproducibility.Nature News,533,452.https://doi.org/10.1038/533452a

Begley,C.G.,&Ellis,L.M. (2012).Drugdevelopment:Raisestandardsfor preclinical cancer research.Nature, 483, 531–533. https://doi.org/10.1038/483531a

Begley, C. G., & Ioannidis, J. P. A. (2015). Reproducibility in sci-ence: Improving the standard for basic and preclinical research.

Circulation Research, 116, 116–126. https://doi.org/10.1161/CIRCRESAHA.114.303819

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A.,Wagenmakers,E.-J.,Berk,R.,…Johnson,V.E.(2018).Redefinesta-tistical significance.Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562-017-0189-z

Bouter, L.M., Tijdink, J., Axelsen,N.,Martinson, B. C., & ten Riet,G.(2016). Ranking major and minor research misbehaviors: Resultsfrom a survey among participants of fourWorld Conferences onResearchIntegrity.Research Integrity and Peer Review,1,17.https://doi.org/10.1186/s41073-016-0024-5

Box, J. F. (1981). Gosset, Fisher, and the t distribution. The American Statistician,35,61–66.

Button, K. (2018). Reboot undergraduate courses for reproducibility.Nature,561,287.https://doi.org/10.1038/d41586-018-06692-8

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J.,Robinson,E.S.J.,&Munafò,M.R.(2013).Powerfailure:Whysmallsamplesizeunderminesthereliabilityofneuroscience.Nature Reviews Neuroscience,14,365–376.https://doi.org/10.1038/nrn3475

Cohen,J.(1990).ThingsIhavelearned(sofar).American Psychologist,45,1304–1312.https://doi.org/10.1037/0003-066X.45.12.1304

Cohen, J. (1992).Apowerprimer.Psychological Bulletin,112, 155–159.https://doi.org/10.1037/0033-2909.112.1.155

Cohen, J. (1994).Theearth is round (p<.05).American Psychologist,49,997–1003.https://doi.org/10.1037/0003-066X.49.12.997

Cooper,N.,Hsing,P.-Y.,Croucher,M.,Graham,L.,James,T.,Krystalli,A.,&Primeau,F.(2017).A guide to reproducible code in ecology and evolution. BESGuidestoBetterScience.London,UK:BritishEcologicalSociety.

DeBiasse,M.B.,&Ryan,J.F.(2018).Phylotocol:Promotingtransparencyand overcoming bias in phylogenetics (No. e26585v1). Systematic Biology, syy090.https://doi.org/10.1093/sysbio/syy090

Duncan,L.E.,&Keller,M.C.(2011).Acriticalreviewofthefirst10yearsofcandidategene-by-environment interactionresearch inpsychia-try.The American Journal of Psychiatry,168,1041–1049.https://doi.org/10.1176/appi.ajp.2011.11020191

Duncan,L.,Yilmaz,Z.,Gaspar,H.,Walters,R.,Goldstein,J.,Anttila,V.,…Bulik,C.M. (2017). Significant locus andmetabolic genetic cor-relations revealed in genome-wide association study of anorexianervosa.American Journal of Psychiatry,174, 850–858. https://doi.org/10.1176/appi.ajp.2017.16121402

Falk, R., & Greenbaum, C. W. (1995). Significance tests diehard: The amazing persistence of a probabilistic misconcep-tion. Theory & Psychology, 5, 75–98. https://doi.org/10.1177/ 0959354395051004

Fanelli,D. (2009).Howmany scientists fabricate and falsify research?Asystematicreviewandmeta-analysisofsurveydata.PLoS ONE,4,e5738.https://doi.org/10.1371/journal.pone.0005738

Fidler, F., Burgman, M. A., Cumming, G., Buttrose, R., & Thomason,N. (2006). Impact of criticism of null-hypothesis signifi-cance testing on statistical reporting practices in conserva-tion biology. Conservation Biology, 20, 1539–1544. https://doi.org/10.1111/j.1523-1739.2006.00525.x

Fidler,F.,Cumming,G.,Burgman,M.,&Thomason,N.(2004).Statisticalreform in medicine, psychology and ecology. The Journal of Socio- Economics,33,615–630.https://doi.org/10.1016/j.socec.2004.09.035

García, L. V. (2004). Escaping the Bonferroni iron claw in ecologicalstudies. Oikos, 105, 657–663. https://doi.org/10.1111/j.0030- 1299.2004.13046.x

Gelman,A.,Hill,J.,&Yajima,M.(2012).Whywe(usually)don'thavetoworryaboutmultiplecomparisons.Journal of Research on Educational Effectiveness,5,189–211.https://doi.org/10.1080/19345747.2011.618213

Gilbert,D.T.,King,G.,Pettigrew,S.,&Wilson,T.D. (2016).Commenton“Estimatingthereproducibilityofpsychologicalscience”.Science,351,1037.https://doi.org/10.1126/science.aad7243

Page 11: Statistical inference and reproducibility in geobiology · Su, Kuhn, Polyak, & Bissell, 2014; Lithgow, Driscoll, & Phillips, 2017). Methodological differences, though, can likely

     |  11PERSPECTIVE

Halsey,L.G.,Curran-Everett,D.,Vowler,S.L.,&Drummond,G.B.(2015).TheficklePvaluegeneratesirreproducibleresults.Nature Methods,12(3),179–185.https://doi.org/10.1038/nmeth.3288

Heim,N.A.,Knope,M.L.,Schaal,E.K.,Wang,S.C.,&Payne,J.L.(2015).Cope'sruleintheevolutionofmarineanimals.Science,347,867–870.https://doi.org/10.1126/science.1260065

Hines,W.C.,Su,Y.,Kuhn,I.,Polyak,K.,&Bissell,M.J.(2014).SortingouttheFACS:Adevilinthedetails.Cell Reports,6,779–781.https://doi.org/10.1016/j.celrep.2014.02.021

Ioannidis, J. P. A. (2005). Why most published research findings arefalse. PLOS Medicine, 2, e124. https://doi.org/10.1371/journal.pmed.0020124

Ioannidis,J.P.A.(2012).Whyscienceisnotnecessarilyself-correcting.Perspectives on Psychological Science, 7, 645–654. https://doi.org/10.1177/1745691612464056

Ioannidis,J.P.A.,Allison,D.B.,Ball,C.A.,Coulibaly,I.,Cui,X.,Culhane,A.C.,…vanNoort,V.(2009).Repeatabilityofpublishedmicroarraygeneexpressionanalyses.Nature Genetics,41,149–155.https://doi.org/10.1038/ng.295

Keller,C.B.,&Schoene,B.(2012).Statisticalgeochemistryrevealsdis-ruptioninsecularlithosphericevolutionabout2.5Gyrago.Nature,485,490–493.https://doi.org/10.1038/nature11024

Lithgow,G.J.,Driscoll,M.,&Phillips,P.(2017).Alongjourneytoreproduc-ibleresults.Nature News,548,387.https://doi.org/10.1038/548387a

Moran, M. D. (2003). Arguments for rejecting the sequentialBonferroni in ecological studies.Oikos,100, 403–405. https://doi.org/10.1034/j.1600-0706.2003.12010.x

Nuzzo,R.(2014).Scientificmethod:Statisticalerrors.Nature News,506,150.https://doi.org/10.1038/506150a

Oakes,M.W.(1986).Statistical inference: A commentary for the social and behavioural sciences.Chichester,UK:Wiley.

Open Science Collaboration (2015). Estimating the reproducibility ofpsychologicalscience.Science 349,aac4716.

Prinz,F.,Schlange,T.,&Asadullah,K.(2011).Believeitornot:Howmuchcan we rely on published data on potential drug targets? Nature Reviews Drug Discovery,10,712.https://doi.org/10.1038/nrd3439-c1

Ram,K. (2013).Git can facilitategreater reproducibilityand increasedtransparency in science.Source Code for Biology and Medicine,8,7.https://doi.org/10.1186/1751-0473-8-7

Ripke,S.,Corvin,A.,Walters,J.T.R.,Farh,K.-H.,Holmans,P.A.,Lee,P.,…O'Donovan,M.C.(2014).Biologicalinsightsfrom108schizophrenia-associatedgeneticloci.Nature,511,421–427.

Rosnow,R.L.,&Rosenthal,R.(1989).Statisticalproceduresandthejusti-ficationofknowledgeinpsychologicalscience.American Psychologist,44,1276–1284.https://doi.org/10.1037/0003-066X.44.10.1276

Rothman,K.J.(1990).Noadjustmentsareneededformultiplecompar-isons. Epidemiology, 1, 43–46. https://doi.org/10.1097/00001648- 199001000-00010

Simmons,J.P.,Nelson,L.D.,&Simonsohn,U.(2011).False-positivepsy-chology:Undisclosedflexibilityindatacollectionandanalysisallowspresentinganythingas significant.Psychological Science,22, 1359–1366.https://doi.org/10.1177/0956797611417632

Sterne,J.A.C.,&Smith,G.D.(2001).Siftingtheevidence—what'swrongwithsignificancetests?British Medical Journal,322,226–231.https://doi.org/10.1136/bmj.322.7280.226

Streiner, D. L. (2015). Best (but oft-forgotten) practices: The multipleproblemsofmultiplicity-whetherandhowtocorrectformanysta-tisticaltests.The American Journal of Clinical Nutrition,102,721–728.https://doi.org/10.3945/ajcn.115.113548

Student(1908).Theprobableerrorofamean.Biometrika 6,1–25.TheInternationalSchizophreniaConsortium(2009).Commonpolygenic

variation contributes to risk of schizophrenia and bipolar disorder.Nature,460,748–752.

Thiemens,M. H., Davis, A.M., Grossman, L., & Colman, A. S. (2013).Turekianreflections.Proceedings of the National Academy of Sciences of the United States of America, 110, 16289–16290. https://doi.org/10.1073/pnas.1315804110

Weintrop,D.,Beheshti,E.,Horn,M.,Orton,K.,Jona,K.,Trouille,L.,&Wilensky,U.(2016).Definingcomputationalthinkingformathematicsandscienceclassrooms.Journal of Science Education and Technology,25,127–147.https://doi.org/10.1007/s10956-015-9581-5

Wray, N. R., Ripke, S.,Mattheisen,M., Trzaskowski,M., Byrne, E.M.,Abdellaoui, A., … Sullivan, P. F. (2018). Genome-wide associationanalyses identify 44 risk variants and refine the genetic architec-tureofmajordepression.Nature Genetics,50,668–681.https://doi.org/10.1038/s41588-018-0090-3