The$crumbling$wall:$$ dataarchiving$and$reproducibility...

Tim Vines, University of Bri2sh Columbia

The crumbling wall: data archiving and reproducibility

in published science

Arianne Albert, Rose Andrew, Florence Débarre, Dan Bock, Michelle Franklin, Kim Gilbert, Nolan Kane, Jean-‐

Sébas2en Moore, Brook Moyers, Sébas2en Renaut, Diana Rennison, Thor Veen, Tim Vines, and Sam Yeaman

The crumbling wall: data archiving and reproducibility

in published science

Reproducibility

•  Science is the search for general ‘rules’

Reproducibility


•  Replica2on tests different circumstances

Reproducibility


•  Replica2on tests different circumstances

•  Reproducibility checks exis2ng results

Reproducibility

•  We hope bad papers will be discarded

Reproducibility

•  We hope bad papers will be discarded •  But maybe many papers are ‘wrong’? – We need to quan2fy this problem…

Reproducibility

•  Reproducibility needs the original data

•  Then we need to repeat the analyses

Reproducibility

•  Reproducibility needs the original data

•  Then we need to repeat the analyses

•  Here are two itera2ons of this process…

How does the availability of data change with 2me since publica2on?

Vines et al. Current Biology 2014

Introduc2on

Michener et al. (1997) Nongeospa2al metadata for the ecological sciences. Ecol. Appl. 7:330

Important metadata lost/forgocen

Career change, email breaks

Accidentaldata loss

Death of researcher

Study published

Minor details lost/forgocen

Time

•  How fast does this happen?

0 5 10 15 20 25 30 35 Years

Study published

0 1 2 3 4 5 6 7 Years

Study published


•  What are the main causes of data loss?

Career change, email breaks

Accidentaldata loss

Death of researcher

Study published

Time

Data storage defunct


•  What are the main causes of data loss?

•  Ask for datasets, see how many you get…

Methods

•  Need to control for data type – morphological data from animals & plants – used in a Discriminant Func2on Analysis

Important metadata lost

Study published

Minor details lost

Time

•  516 studies in odd years 1991 -‐ 2011

•  Asked for data by email – searched for emails in paper and online – contacted first, last & corresponding authors

•  “We want to try repea2ng your DFA” – part of study on reproducibility and paper age

•  Author mo2va2on : – we’re trapped in burning building vs – we want to print it out for wallpaper

•  Our request is fairly common prac2ce – expect 20-‐50% for 2011

•  Mo2va2on sets total % of data we receive

•  But our focus is on how % changes with 2me – as long as we get some data we’re OK

•  If data were gone, we asked for the reason

Results

Probability that data s2ll extant (i.e. received + couldn’t be shared)

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(e

xtan

t dat

a)

Probability that data s2ll extant (i.e. received + couldn’t be shared)

•  Odds of data being extant fall by 8% per yr

•  Almost all gone aoer 20 years –  just 3 of 61 datasets extant for 1991 and 1993

•  Why were we unable to get the data? – which reasons are related to paper age?

0.00

0.25

0.50

0.75

1.00


P(e

mai

l got

thro

ugh)

Probability that at least one email for authors on the paper we contacted didn’t bounce

0.00

0.25

0.50

0.75

1.00


P(r

espo

nse|

emai

l got

thro

ugh)

Given that at least one email didn’t bounce, probability we got a response

0.00

0.25

0.50

0.75

1.00


P(r

espo

nse|

emai

l got

thro

ugh)

Given that at least one email didn’t bounce, probability we got a response

(mo2va2on to respond is unrelated to paper age)

0.00

0.25

0.50

0.75

1.00


P(u

sefu

l res

pons

e|re

spon

se)

Given that we got a response, probability we heard about the data

0.00

0.25

0.50

0.75

1.00


P(d

ata

exta

nt|u

sefu

l res

pons

e)

Given that we heard about the data, probability data is extant

Conclusions

•  Data held by authors disappears fast

•  Almost all gone aoer 20 years

•  Archiving at publica2on really is crucial

Vines et al. Current Biology 2014

Reproducibility Part I: Discriminant Func2ons

Reproducibility Part I

•  We received 101 files from authors –  these are only the first step

•  Are these the actual data from the paper?

•  We tried to repeat their DFA

•  What’s a Discriminant Func2on Analysis?

– you have 2 or more groups of something – you want be be able to tell the groups apart –  the groups differ in e.g. size & shape – you measure a few things

–  the DF says what aspect of size/shape is best for dis2nguishing the groups


–  the DFA produces three useful metrics: 1.  the percent variance explained by the 1st axis


–  the DFA produces three useful numbers: 1.  the percent variance explained by the 1st axis 2.  the loading coefficient


–  the DFA produces three useful numbers: 1.  the percent variance explained by the 1st axis 2.  the loading coefficient 3.  the percentage of individuals correctly assigned


–  the DFA produces three useful numbers: 1.  the percent variance explained by the 1st axis 2.  the loading coefficient 3.  the percentage of individuals correctly assigned

•  We tried to reproduce these metrics


•  We started with 101 studies – 16 didn’t contain any of our three metrics –  these were excluded

•  What happened with the rest?

Outcome Percent

Unclear methods 4

Insufficient metadata 7

Incorrect/incomplete data 9

[Subtotal] [20]

Reanalysis a2empted:

Results don’t match 21

Some metrics match 18

All metrics match 31

Overall Total 100



•  Only 52 could be reproduced – 10% of the 516 datasets requested




•  How far off were we?

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

40

60

80

100

40 60 80 100Published PVE

Rea

naly

sed

PVE

Published % Variance Explained

Reanalyzed

% Variance Explaine

d

83% of 23 reanalyses within 5% of published value

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

0

10

20

0 10 20Published Coefficient

Rea

naly

sed

Coe

ffici

ent

01

Published Coefficient

Reanalyzed

Coe

fficien

t 70% of 23 reanalyses within 1 decimal place of published value

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

60

80

100

60 80 100Published PAC

Rea

naly

sed

PAC

Published % Correctly Assigned

Reanalyzed

% Correctly Assigne

d 75% of 56 reanalyses within 5% of published value




•  Strong differences between metrics

Conclusions


•  Geung the data is the biggest obstacle – accounts for 80% of total

•  Poor cura2on takes out only 4% – 22% of received datasets

•  For DFA, reproducibility is quite good – but depends a lot on the metric used

Do data archiving policies work?

•  journals now have data archiving policies

•  four flavours: 1.  no policy 2.  recommend 3.  require

Vines et al. (2013) FASEBJ

•  journals now have data archiving policies

•  four flavours: 1.  no policy 2.  recommend 3.  require

a.  no ‘data availability’ statement b.  ‘data availability’ statement

Vines et al. (2013) FASEBJ

•  focus on single type of data –  gene2c data used in STRUCTURE

•  must have established online archive –  in this case Dryad (or supp. mat.)

•  found 229 papers from 2011-‐12 –  what % had data available?

% e

ligib

le p

aper

s w

ith d

ata

avai

labl

e on

line

0

20

40

60

80

100

Cons.

Gen.

Crop S

cienc

e

Genetica

TAG

BMC Evo

l. Biol

.BJLS

J. Here

dity

PLoS O

ne

J. Evo

lution

ary B

iolog

y

Evolution

Heredity

Molecu

lar E

colog

y

n=47 n=12 n=9 n=21 n=13 n=13 n=12 n=51 n=10 n=6 n=7 n=28

No archiving policy Recommends archiving Mandates archiving

no data statement data statement

Conclusions

•  journals need to get tough

•  give priority to papers with good archiving?

•  have reviewers assess data statement

“Papers with exemplary data and code archiving are more valuable for future research, and, all else being equal, these are more likely to get accepted for publica2on”

How journals can boost data sharing The journal ecosystem is a powerful filter of scien2fic literature, promo2ng the best work into the best journals. Why not use a similar mechanism to encourage more comprehensive data sharing? Several journals have introduced policies manda2ng that data be shared on a public archive at publica2on. However, these have met with limited success, perhaps because of authors’ fear of losing control, being scooped in subsequent papers or having errors exposed. Moreover, compliance with data sharing policies is typically only checked aoer the paper is accepted. To spur excellence in data sharing, journals must recognise that becer sharing leads to stronger papers, and judge their submissions accordingly. Ar2cles with feeble sharing efforts should either improve or be rejected. A focus on publishing verifiable research correspondingly boosts journal reputa2on, and signals to the author community that withholding data restricts them to publica2on in less pres2gious journals. Timothy H. Vines University of Bri2sh Columbia

Reproducibility Part II: gene2c data

Gilbert et al. (2012) Molecular Ecology

Reproducibility Part II

•  Reproducing simple stats (a DFA) was OK

•  modern stats are more sophis2cated

•  most involve numerical op2miza2on – can get a different answer each 2me


•  34 datasets from the previous study

•  all have a STRUCTURE analysis




•  this uses extensive numerical op2miza2on




•  this uses extensive numerical op2miza2on

•  output is K, the number of dis2nct clusters


•  Can we reproduce their value of K?

•  4 studies were excluded – no data, irregular use of STRUCTURE

•  Reanalyzed remaining 30 datasets

Outcome No. datasets Percent

Strange use of STRUCTURE 2 6

Missing data 2 6

Incorrect/incomplete data 3 9

Reanalysis a2empted:

K didn’t match 6 18

K matched 21 62

Overall Total 34 100


•  Can we reproduce their value of K?

•  4 studies were excluded – no data, irregular use of STRUCTURE

•  How close did we get?

0 1 2 3 4 5 6

01

23

45

6

Original chosen K value

Cho

sen

K va

lue

of re

anal

ysis

Author’s value of K

Reanalysis value of K


•  Most mismatches from poor sooware use – stochas2c methods need many itera2ons –  too few and the answer is unreliable


•  Most mismatches from poor sooware use – stochas2c methods need many itera2ons –  too few and the answer is unreliable

•  Poor cura2on was less of a problem

Grand Conclusions

•  STRUCTURE reproducibility > DFA – 65% vs 50%

•  Is under 100% reproducibility unacceptable?

•  Maybe replica2on is more important

•  Data availability is the biggest problem – without it, reproducibility = 0

•  We need stronger data archiving policies

•  May mean becer science as well – someone will check your data…

Thanks to: Arianne Albert Rose Andrew Florence Débarre Dan Bock Michelle Franklin Kim Gilbert Nolan Kane Jean-‐Sébas2en Moore Brook Moyers Sébas2en Renaut Diana Rennison Loren Rieseberg Thor Veen Mike Whitlock Sam Yeaman

The$crumbling$wall:$$ dataarchiving$and$reproducibility...

Documents

Transcript of The$crumbling$wall:$$ dataarchiving$and$reproducibility...