Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in...

35
Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010

Transcript of Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in...

Page 1: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Data Reuse, Sharing, and Production: An article-centric investigation of data citation

practices in prominent journals

Sarah Judson

DataONE

Summer 2010

Page 2: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Motivation

Many journals have data citation, or at least data sharing policies.

– Most are “recommendations”– Many will soon be mandatory

– But are they enacted? Multiple depositories exist for data sharing

– Allow browsing for available data– Provide space for data storage– Recommend how data reuse should be properly credited.

– But are they utilized?

Page 3: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Intentions

Report current status of data citation and sharing to relevant journals

Recommend best practices– Increase ability to retrace and reuse data– Ease transition to mandatory polices– Promote appropriate credit to data author

Page 4: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Background

Advent of data sharing/citation policies Continued expression of the need for

increased data sharing, esp. for meta-analysis and global change studies

Similar studies in Biomedical journals* or focused on Genbank**, but few in Ecological/Evolutionary journals

*Piwowar and Chapman. Public sharing of research datasets: A pilot study of associations. Journal of Informetrics April 2010 4(2):148-156

**Noor et al. 2006. Data Sharing: How Much Doesn't Get Submitted to GenBank? PLoS Biol. 4(7):228

Page 5: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Research Questions

What are current practices for data citation within articles? – Do authors tend to cite that dataset itself or related paper?– How does the author obtain the dataset?

How do these practices vary across discipline, journal, data type, data source?

– Are data citation practices influenced more by attitude of the discipline towards data sharing or journal policy?

How have these practices varied across time?– Does increased data reuse/sharing correlate with changes in

journal policy?– Does data reuse/sharing simply increase with time since the

advent of the internet?

Page 6: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Angles of Attack

“Snapshot” approach– 1st issue in 2010 for journals of interest

To assess “current state” To evaluate utility of a particular journal for more

detailed “Time Series” investigation

“Time Series” approach– Random sample of 25 articles per journal per

year To investigate trends over time, especially considering

changes in journal data/citation policies

Page 7: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Nitty-Gritty Methods

Random sampling– Export all articles and accompanying metadata

2005-2010 Journal- specific

– Assign record number to each article– Generate random numbers to select 25 articles

Data Extraction– Recorded on Excel spreadsheet, uploaded weekly to GoogleDocs– Read Journal Citation/Data Policy in Preparation for Extraction– Read through articles manually

Special attention to the Methods and Acknowledgements sections. Identify instances of data reuse and sharing

– Copy relevant excerpts Code according to established fields

– Record additional metadata Open access, Discipline, Submission to Publication duration, etc.

Page 8: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Extracted Fields

ISI metadata– DOI– Author and affiliation– Abstract and keywords– Journal and ISSN

For each instance of Data Reuse, Sharing, or Production– Depository – Type of InText and Bibliographic Citation

Author-Year, URL, Accession #– How dataset acquired

Is depository clearly referenced? Was it obtained from a colleague? Is it previous work by one of the authors?

– Where citation occurs– Type of Dataset

Gene Sequence, Phylogenetic Tree, Ecological, etc

Page 9: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Selected Journals

Dryad “Top Three”– Justification:

1. Most currently posted datasets...is it really being reused?

2. Known "High Impact" Journals

3. Cover target disciplines and depositories

– Systematic Biology (Systematics, Phylogenetics/geography)– American Naturalist (Behavior, Natural History, Ecology)– Molecular Ecology (Genetics, Molecular Evolution)

Other options: ESA family, Discipline-specific, Broad

Page 10: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Limitations

Only looking at a few journals and disciplines Relying only on the main text

– Not looking at supplementary material unless article extremely unclear

– Have to assume if it wasn’t stated, it wasn’t reused/shared Would have developed automated extraction, text

coding if time permitted– Process more articles– Remove bias– Standardization

Page 11: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Unresolved Problems (suggestions please!)

Data Type Classification– Easy: Gene Sequences and Phylogenetic Trees– Biology vs. Ecology– Subdivisions in Biology, Earth, etc

Bio: Morphology, Behavior Eco: Competition, Community Earth: Soils, GIS

“Articles” according to ISI– AmNat:

High % are models Notes and Comments Natural History Miscellany

– SysBio: Points of View

Author Recurrence– SysBio: only 50 articles per year and multiple publications/accreditations to the same

people (Wiens, Sullivan)– AmNat: less pronounced problem (Abrams)

Page 12: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Findings

Qualitative observations Good citation, bad citation Journal Comparison Time Series

– % Reuse % Sharing results not presented

– Data type– Depository

Page 13: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Qualitative Observations

- Internal (journal) supplementary depositories used more as a dump than for reusable data

Additional or color figures and tables Statistical outputs

– InText citations allude to raw data supplement,

but often ends up being raw results

– Defunct data storage Personal URLS Problem retrieving supplementary data (SysBio 2005-8)

– More data produced than shared– Alignments and Trees often not posted to TreeBase– Ecological datasets grossly under shared

Page 14: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Haphazard citation practices

Accessions cited in Text vs. Table Author vs. Accession Only depository referenced

– Especially with large datasets Some in Methods, Some in Results

– Majority of reuses cited in Methods– Sharing cited roughly 50/50 between Methods and Results

Crediting self before others– Bibliographic citations not given or only for same author – Give article citation for self, but not accession; accession for others

but not their article Disparate citation formats within a single paper

Page 15: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Good Citation

“Previously published sequence data were used for V. velella 18S (Collins, 2002, GenBank AF358087), P. porpita 18S (Collins, 2002, AF358086), Staurocladia wellingtoni 18S (Collins, 2002, AF358084), S. wellingtoni 16S (Schuchert, 2005, AJ580934), Hydra circumcincta 18S (Medina et al., 2001, AF358080), and H. vulgaris 16S (Pont-Kingdon et al., 2000, AF100773).”

Taxon Gene region Author-Year

– accompanying bibliographic citation GenBank Accession

Page 16: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Bad Citation

Incomplete• “The sequences, which were all produced

in our previous studies (Aceto et al. 1999; Cozzolino etal. 2001) and are available in GenBank”

• Usually missing accession, sometimes author and depository Sometimes the info is buried in tables or not given for large compilations

Unclear– “During annual aerial surveys, observers sketch the extent of defoliation

from the air on paper or digital maps (Ciesla 2000) that are then compiled as a series of polygons in a geographical information system (GIS) (Liebhold et al. 1997).”

Who is the original data author?– Are these theoretical, methodological or data citations?– Bibliographic citations occasionally shed light

Where is the data stored?

Page 17: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

What is a good citation?

Data easily retraceable Proper credit given Criteria

– Depository mentioned in text– Accession mentioned in text– Author credit given in Bibliography

Page 18: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Citations: Systematic Biology

Bad

Good

Ok

Bad

Good

Ok

Count of Year

IdealCitationAll

Page 19: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Citations: American Naturalist

Bad

Good

Ok

Bad

Good

Ok

Count of Year

IdealCitationAll

Page 20: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Journal Comparison: Snapshot

Data Reuse Data Sharing

Systematic Biology

~ Frequent use of Genbank

~ Occasional use of Treebase

~ Often post to Treebase, but often unclear about GA vs. PT

~ Internal Difficult (no unique accession [generic URL]; not accessible pre-2008)

American Naturalist

~ Varied data (biological)

~ Often extracted from literature or used to validate a model

~ Occasional sharing: Dryad, Treebase, Genbank, internal

Molecular Ecology

~ Frequent use of Genbank, but steadily drops off after 2009

~ Some morphological data matricies

~ Posting to Genbank, but alternatively given in Methods and Results

~ Level of accessibility varies widely

Ecology ~ Minor datasets ~Extensive datasets rarely shared

~ Ecological Archives (accessible but used for excess figures and results)

Page 21: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Journal Comparison: % Reuse and Sharing in 2010

Data Reuse Data Sharing "Good" CitationsAmNat 47% 13% 0%Ecology 48% 5% 0%MolecEco 90% 70% 30%SysBio 86% 62% 14%

Page 22: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Percent Reuse over time

AmNat SysBio2005 20% 73%2006 20% 50%2007 24% 67%2008 32% 75%2009 28% 64%2010 47% 83%

All Years 27% 68%

Page 23: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Depository: Systematic Biology

Genbank

Treebase

Database

URL

Extraction

Not Indicated

Other

Genbank

Treebase

Database

URL

Extraction

Not Indicated

Data

Page 24: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Depository: American Naturalist

Genbank

Treebase

Database

URL

Extraction

Not Indicated

Other

Genbank

Treebase

Database

URL

Extraction

Not Indicated

Data

Page 25: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Data Types: Systematic Biology

Gene Sequence

Organism Biology

Phylogenetic Tree

G.I.S.

Unclassified

Gene Sequence

Organism Biology

Phylogenetic Tree

G.I.S.

Unclassified

Data

Page 26: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Data Types: American Naturalist

Gene Sequence

Organism BiologyPhylogenetic Tree

G.I.S.

Unclassified

Gene Sequence

Organism Biology

Phylogenetic Tree

G.I.S.

Unclassified

Data

Page 27: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Back to the big picture

Inform journals and depositories about current practices vs. policy

Best Practices recommendations Continued research on trends in data citation

Page 28: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Suggested Best Practices

Accession numbers and Authors of each dataset (reused and shared) given in the Methods or Supplementary Table referenced in the Methods

– Authors not charged extra page/online fees– Authors allowed to exceed Reference limit to credit data

Editorial enforcement– Checklist

Internal Depositories made more accessible– Usable formats– Unique and Stable URLs

Page 29: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Long Term Best Practices

Separate Supplementary Data Section– Example: Molecular Ecology

SysBio added a separate section but it is defunct AmNat has an “Online enhancements” header

– For both internal and external deposits– Distinguish from “data-dump” (extra figures, outputs)– Accompanying References section

Unlimited length DATA cited, in addition to publication

– Could combine into a new reference type: Author. Year. Title. Journal. Pages. Depository. Accession.

Track on par with publications in ISI, etc

Page 30: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Continued Research

Snapshot and time series of Molecular Ecology– Possibly Ecology if time permits– Alternative (suggestions please!): Just snapshots

Trends over time– Has reuse and sharing increased?– Have citation practices improved over time?

Is this influenced by journal/depository recommendations on citations? Correlation with influential factors

– Is there more data reuse in articles that are also open access or share data?– Are certain dataset types or article disciplines more inclined to reuse/share

data? Data shared vs. data produced Sync with Journal/Depository Metadata (Nic) and Search Findings

(Valerie)– Refine “Good” citation criteria

Journal and depository specific

Page 31: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Additional Exploration

Track the cited or shared datasets Look at supplemental data alone

– Internal (journal repositories) Additional data not cited in text? Data dumps?

– Ease of access Accuracy of accession numbers

Actual data reusability– Method/processing metadata – File format

Software/model reuse and sharing – R-packages, GUIs – Encouraged by American Naturalist

Databases– Independent databases vs. depositories

% utilized out of available– Caching/stability options, linking metadata to depositories

Page 32: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Final products

Reports to requisite journals/depositories Potential Manuscripts

– Journal Comparison: Citation Practices– Treebase: Shared vs. Produced– Best Practices recommendations

Shared dataset!

Page 33: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Thanks for listening!

Questions? Suggestions?

– Unresolved problems– Continued Research

Page 34: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

Hurdles

Determining extracted fields Coding data now vs. later

Page 35: Data Reuse, Sharing, and Production: An article-centric investigation of data citation practices in prominent journals Sarah Judson DataONE Summer 2010.

In light of data/citation policies….

Compare “performance” of sysbio and amnat in their depository and journal policy performance (do they meet the requirements?) – or state this in future research section

OWW: Nic – do “editor” instructions or other sections of policy indicate how data/citation policies are enforced?