The$crumbling$wall:$$ dataarchiving$and$reproducibility...
Transcript of The$crumbling$wall:$$ dataarchiving$and$reproducibility...
Tim Vines, University of Bri2sh Columbia
The crumbling wall: data archiving and reproducibility
in published science
Arianne Albert, Rose Andrew, Florence Débarre, Dan Bock, Michelle Franklin, Kim Gilbert, Nolan Kane, Jean-‐
Sébas2en Moore, Brook Moyers, Sébas2en Renaut, Diana Rennison, Thor Veen, Tim Vines, and Sam Yeaman
The crumbling wall: data archiving and reproducibility
in published science
Reproducibility
• Science is the search for general ‘rules’
Reproducibility
• Science is the search for general ‘rules’
• Replica2on tests different circumstances
Reproducibility
• Science is the search for general ‘rules’
• Replica2on tests different circumstances
• Reproducibility checks exis2ng results
Reproducibility
• We hope bad papers will be discarded
Reproducibility
• We hope bad papers will be discarded • But maybe many papers are ‘wrong’? – We need to quan2fy this problem…
Reproducibility
• Reproducibility needs the original data
• Then we need to repeat the analyses
Reproducibility
• Reproducibility needs the original data
• Then we need to repeat the analyses
• Here are two itera2ons of this process…
How does the availability of data change with 2me since publica2on?
Vines et al. Current Biology 2014
Introduc2on
Michener et al. (1997) Nongeospa2al metadata for the ecological sciences. Ecol. Appl. 7:330
Important metadata lost/forgocen
Career change, email breaks
Accidentaldata loss
Death of researcher
Study published
Minor details lost/forgocen
Time
• How fast does this happen?
0 5 10 15 20 25 30 35 Years
Study published
0 1 2 3 4 5 6 7 Years
Study published
• How fast does this happen?
• What are the main causes of data loss?
Career change, email breaks
Accidentaldata loss
Death of researcher
Study published
Time
Data storage defunct
• How fast does this happen?
• What are the main causes of data loss?
• Ask for datasets, see how many you get…
Methods
• Need to control for data type – morphological data from animals & plants – used in a Discriminant Func2on Analysis
Important metadata lost
Study published
Minor details lost
Time
• 516 studies in odd years 1991 -‐ 2011
• Asked for data by email – searched for emails in paper and online – contacted first, last & corresponding authors
• “We want to try repea2ng your DFA” – part of study on reproducibility and paper age
• Author mo2va2on : – we’re trapped in burning building vs – we want to print it out for wallpaper
• Our request is fairly common prac2ce – expect 20-‐50% for 2011
• Mo2va2on sets total % of data we receive
• But our focus is on how % changes with 2me – as long as we get some data we’re OK
• If data were gone, we asked for the reason
Results
Probability that data s2ll extant (i.e. received + couldn’t be shared)
0.00
0.25
0.50
0.75
1.00
5 10 15 20age of paper (years)
P(e
xtan
t dat
a)
Probability that data s2ll extant (i.e. received + couldn’t be shared)
• Odds of data being extant fall by 8% per yr
• Almost all gone aoer 20 years – just 3 of 61 datasets extant for 1991 and 1993
• Why were we unable to get the data? – which reasons are related to paper age?
0.00
0.25
0.50
0.75
1.00
5 10 15 20age of paper (years)
P(e
mai
l got
thro
ugh)
Probability that at least one email for authors on the paper we contacted didn’t bounce
0.00
0.25
0.50
0.75
1.00
5 10 15 20age of paper (years)
P(r
espo
nse|
emai
l got
thro
ugh)
Given that at least one email didn’t bounce, probability we got a response
0.00
0.25
0.50
0.75
1.00
5 10 15 20age of paper (years)
P(r
espo
nse|
emai
l got
thro
ugh)
Given that at least one email didn’t bounce, probability we got a response
(mo2va2on to respond is unrelated to paper age)
0.00
0.25
0.50
0.75
1.00
5 10 15 20age of paper (years)
P(u
sefu
l res
pons
e|re
spon
se)
Given that we got a response, probability we heard about the data
0.00
0.25
0.50
0.75
1.00
5 10 15 20age of paper (years)
P(d
ata
exta
nt|u
sefu
l res
pons
e)
Given that we heard about the data, probability data is extant
Conclusions
• Data held by authors disappears fast
• Almost all gone aoer 20 years
• Archiving at publica2on really is crucial
Vines et al. Current Biology 2014
Reproducibility Part I: Discriminant Func2ons
Reproducibility Part I
• We received 101 files from authors – these are only the first step
• Are these the actual data from the paper?
• We tried to repeat their DFA
• What’s a Discriminant Func2on Analysis?
– you have 2 or more groups of something – you want be be able to tell the groups apart – the groups differ in e.g. size & shape – you measure a few things
– the DF says what aspect of size/shape is best for dis2nguishing the groups
• What’s a Discriminant Func2on Analysis?
– the DFA produces three useful metrics: 1. the percent variance explained by the 1st axis
• What’s a Discriminant Func2on Analysis?
– the DFA produces three useful numbers: 1. the percent variance explained by the 1st axis 2. the loading coefficient
• What’s a Discriminant Func2on Analysis?
– the DFA produces three useful numbers: 1. the percent variance explained by the 1st axis 2. the loading coefficient 3. the percentage of individuals correctly assigned
• What’s a Discriminant Func2on Analysis?
– the DFA produces three useful numbers: 1. the percent variance explained by the 1st axis 2. the loading coefficient 3. the percentage of individuals correctly assigned
• We tried to reproduce these metrics
Reproducibility Part I
• We started with 101 studies – 16 didn’t contain any of our three metrics – these were excluded
• What happened with the rest?
Outcome Percent
Unclear methods 4
Insufficient metadata 7
Incorrect/incomplete data 9
[Subtotal] [20]
Reanalysis a2empted:
Results don’t match 21
Some metrics match 18
All metrics match 31
Overall Total 100
Reproducibility Part I
• We started with 101 studies – 16 didn’t contain any of our three metrics – these were excluded
• Only 52 could be reproduced – 10% of the 516 datasets requested
Reproducibility Part I
• We started with 101 studies – 16 didn’t contain any of our three metrics – these were excluded
• Only 52 could be reproduced – 10% of the 516 datasets requested
• How far off were we?
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
40
60
80
100
40 60 80 100Published PVE
Rea
naly
sed
PVE
Published % Variance Explained
Reanalyzed
% Variance Explaine
d
83% of 23 reanalyses within 5% of published value
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
0
10
20
0 10 20Published Coefficient
Rea
naly
sed
Coe
ffici
ent
01
Published Coefficient
Reanalyzed
Coe
fficien
t 70% of 23 reanalyses within 1 decimal place of published value
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
60
80
100
60 80 100Published PAC
Rea
naly
sed
PAC
Published % Correctly Assigned
Reanalyzed
% Correctly Assigne
d 75% of 56 reanalyses within 5% of published value
Reproducibility Part I
• We started with 101 studies – 16 didn’t contain any of our three metrics – these were excluded
• Only 52 could be reproduced – 10% of the 516 datasets requested
• Strong differences between metrics
Conclusions
Reproducibility Part I
• Geung the data is the biggest obstacle – accounts for 80% of total
• Poor cura2on takes out only 4% – 22% of received datasets
• For DFA, reproducibility is quite good – but depends a lot on the metric used
Do data archiving policies work?
• journals now have data archiving policies
• four flavours: 1. no policy 2. recommend 3. require
Vines et al. (2013) FASEBJ
• journals now have data archiving policies
• four flavours: 1. no policy 2. recommend 3. require
a. no ‘data availability’ statement b. ‘data availability’ statement
Vines et al. (2013) FASEBJ
• focus on single type of data – gene2c data used in STRUCTURE
• must have established online archive – in this case Dryad (or supp. mat.)
• found 229 papers from 2011-‐12 – what % had data available?
% e
ligib
le p
aper
s w
ith d
ata
avai
labl
e on
line
0
20
40
60
80
100
Cons.
Gen.
Crop S
cienc
e
Genetica
TAG
BMC Evo
l. Biol
.BJLS
J. Here
dity
PLoS O
ne
J. Evo
lution
ary B
iolog
y
Evolution
Heredity
Molecu
lar E
colog
y
n=47 n=12 n=9 n=21 n=13 n=13 n=12 n=51 n=10 n=6 n=7 n=28
No archiving policy Recommends archiving Mandates archiving
no data statement data statement
Conclusions
• journals need to get tough
• give priority to papers with good archiving?
• have reviewers assess data statement
“Papers with exemplary data and code archiving are more valuable for future research, and, all else being equal, these are more likely to get accepted for publica2on”
How journals can boost data sharing The journal ecosystem is a powerful filter of scien2fic literature, promo2ng the best work into the best journals. Why not use a similar mechanism to encourage more comprehensive data sharing? Several journals have introduced policies manda2ng that data be shared on a public archive at publica2on. However, these have met with limited success, perhaps because of authors’ fear of losing control, being scooped in subsequent papers or having errors exposed. Moreover, compliance with data sharing policies is typically only checked aoer the paper is accepted. To spur excellence in data sharing, journals must recognise that becer sharing leads to stronger papers, and judge their submissions accordingly. Ar2cles with feeble sharing efforts should either improve or be rejected. A focus on publishing verifiable research correspondingly boosts journal reputa2on, and signals to the author community that withholding data restricts them to publica2on in less pres2gious journals. Timothy H. Vines University of Bri2sh Columbia
Reproducibility Part II: gene2c data
Gilbert et al. (2012) Molecular Ecology
Reproducibility Part II
• Reproducing simple stats (a DFA) was OK
• modern stats are more sophis2cated
• most involve numerical op2miza2on – can get a different answer each 2me
Reproducibility Part II
• 34 datasets from the previous study
• all have a STRUCTURE analysis
Reproducibility Part II
• 34 datasets from the previous study
• all have a STRUCTURE analysis
• this uses extensive numerical op2miza2on
Reproducibility Part II
• 34 datasets from the previous study
• all have a STRUCTURE analysis
• this uses extensive numerical op2miza2on
• output is K, the number of dis2nct clusters
Reproducibility Part II
• Can we reproduce their value of K?
• 4 studies were excluded – no data, irregular use of STRUCTURE
• Reanalyzed remaining 30 datasets
Outcome No. datasets Percent
Strange use of STRUCTURE 2 6
Missing data 2 6
Incorrect/incomplete data 3 9
Reanalysis a2empted:
K didn’t match 6 18
K matched 21 62
Overall Total 34 100
Reproducibility Part II
• Can we reproduce their value of K?
• 4 studies were excluded – no data, irregular use of STRUCTURE
• How close did we get?
0 1 2 3 4 5 6
01
23
45
6
Original chosen K value
Cho
sen
K va
lue
of re
anal
ysis
Author’s value of K
Reanalysis value of K
Reproducibility Part II
• Most mismatches from poor sooware use – stochas2c methods need many itera2ons – too few and the answer is unreliable
Reproducibility Part II
• Most mismatches from poor sooware use – stochas2c methods need many itera2ons – too few and the answer is unreliable
• Poor cura2on was less of a problem
Grand Conclusions
• STRUCTURE reproducibility > DFA – 65% vs 50%
• Is under 100% reproducibility unacceptable?
• Maybe replica2on is more important
• Data availability is the biggest problem – without it, reproducibility = 0
• We need stronger data archiving policies
• May mean becer science as well – someone will check your data…
Thanks to: Arianne Albert Rose Andrew Florence Débarre Dan Bock Michelle Franklin Kim Gilbert Nolan Kane Jean-‐Sébas2en Moore Brook Moyers Sébas2en Renaut Diana Rennison Loren Rieseberg Thor Veen Mike Whitlock Sam Yeaman