Open data: Enhancing preservation, reproducibility, and innovation
-
Upload
ciakov -
Category
Data & Analytics
-
view
53 -
download
1
Transcript of Open data: Enhancing preservation, reproducibility, and innovation
OPEN DATA: ENHANCING
PRESERVATION, REPRODUCIBILITY, AND
INNOVATION
Clarke IakovakisScholarly Communications Librarian
Neumann Library CC BY-SA 3.0-2.5-2.0-1.0 image courtesy Daniel Tenerife - Own work. Title: "Social Red" https://commons.wikimedia.org/wiki/File:Social_Red.jpg#mediaviewer/File:Social_Red.jpg
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
CHANGE IN MINDSET
“data is no longer regarded as static or stale, whose usefulness is finished once the purpose for
which it was collected was achieved.”
- Kenneth Cukier and Viktor Mayer-Schönberger
"in some fields, the data are coming to be viewed as an essential end product of research, comparable in value to journal articles or
conference papers”
- Christine Borgman
OUTLINE
• Data-centric scholarship
• Benefits & challenges of open data• Defining open data
• Reproducibility
• Public use & data management plans
• Data reuse
• Concerns and open questions
• Where to deposit data?
SECTION 1
FROM DOCUMENT TO DATA-CENTRIC VIEW OF SCHOLARSHIP
WHAT WE MEAN BY “DATA”
A wide definition:
any information that can be stored in digital form, including text, numbers, images, video or movies,
audio, software, algorithms, equations, animations, models, simulations, etc. Such data
may be generated by...observation, computation, or experiment
- National Science Board
National Science Board. Long-Lived Data Collections: Enabling Research and Education in the 21st Century. Arlington, VA (2005): 13. https://www.nsf.gov/pubs/2005/nsb0540/nsb0540_3.pdf
WHAT IS RESEARCH DATA?
collected, observed, accessed, or created, for the purposes of analysis to produce and validate original
research results.
What is a routine collection at one point can become research data in the future
Thus research data are very much about whenthey are used, as well as what they constitute, and
the purpose for which they are to be used
University of Edinburg. “Research Data Explained.” http://mantra.edina.ac.uk/researchdataexplained/
Hard Science: Scientific data generated by instrumented research projects
Social science: data generated from government statistics, online surveys, behavioral models
Humanities: bodies of text, digital images and video, models of historic sites
WHAT IS RESEARCH DATA?
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Applying information technology to research problems
Collaborations across disciplines & increasing size of collaborations
Increasing the complexity and quantity of research data
DATA INTENSIVE RESEARCH
DATA INTENSIVE RESEARCH
• Scientific instruments generate data at greater speeds, densities, and detail
• Digitization of older print & analog data
• Born digital data
• Data storage capacity increases & storage costs decrease, enabling preservation of data
• Improvements in searching, analysis & visualization tools
World’s technological installed capacity to store information (table SA1) (16).
M Hilbert, and P López Science 2011;332:60-65
SLOAN DIGITAL SKY SURVEY
The most distant quasar ever discovered (at least as of October 2003). The redshift 6.4 quasar is seen at a time when the universe was just 800 million years old. The light-travel time from this object to us is about 13 billion years.
http://www.sdss.org
Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It
has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives
profitable activity; somust data be broken down, analyzed for it
to have value
- Clive Humbly
Image © Against All Odds Productions
VALUE OF DATA
Pryor, Graham. “Why Manage Research Data?” Managing Research Data. London: Facet Publishing, 2012.
VALUE OF DATA
Value of a dataset can be• Immediate
• Gained over time
• Transient
• Little (i.e. it’s easier to recreate than curate)
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
VALUE OF DATA
“Fundamentally, there is a shift from a document-centric view of scholarship to a data-
centric view of scholarship”
- Sayeed Choudury
Choudury, Sayeed. "Data curation: An ecological perspective." College & Research Libraries News 71, no. 4 (2010): 194-196.
WHY OPEN?Data that underpin a journal article should be made concurrently available in an accessible
database.
We are now on the brink of an achievable aim: for all science literature to be online, for all of the data to be online and for the two to be
interoperable.
Adapted from Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The Open Research Challenge: Peer Review and Publication of Research Data“ Licensed under CC BY
Royal Society June 2012, Science as an Open Enterprise, http://royalsociety.org/policy/projects/science-public-enterprise/report/
DATA AVAILABILITY
Vines, Timothy H, Arianne Y K. Albert, Rose L Andrew, Florence Débarre, Dan G Bock, Michelle T Franklin, Kimberly J Gilbert, et al. "The Availability of Research Data Declines Rapidly with Article Age." Current Biology 24, no. 1 (1/6/ 2014): 94-97. https://linkinghub.elsevier.com/retrieve/pii/S0960-9822(13)01400-0
Researchers requested data sets from a relatively homogenous set of 516 articles published 1991-2011 in field
of zoology
Tracking down the authors & getting a response was the first challenge.
For every yearly increase in article age, the odds of the data set being reported as extant decreased by 17%
When the authors did give the status of their data, the proportion of data sets that still existed dropped from 100%
in 2011 to 33% in 1991
DATA AVAILABILITY
Vines, Timothy H, Arianne Y K. Albert, Rose L Andrew, Florence Débarre, Dan G Bock, Michelle T Franklin, Kimberly J Gilbert, et al. "The Availability of Research Data Declines Rapidly with Article Age." Current Biology 24, no. 1 (1/6/ 2014): 94-97. https://linkinghub.elsevier.com/retrieve/pii/S0960-9822(13)01400-0
Many of these missing data sets could be retrieved only with considerable effort by the
authors, and others are completely lost to science
DATA LOSS
Adapted from Mitcham, Jenny & Lindsey Myers. “Managing your research data”. Licensed under CC BY-NC-SA
Adapted from Mitcham, Jenny & Lindsey Myers. “Managing your research data”. Licensed under CC BY-NC-SA
DATA LOSS
DATA LOSS• Human error
• Natural disaster
• Facilities infrastructure failure
• Storage failure
• Server hardware/software failure
• Application software failure
• Format obsolescence• Legal encumbrance • Malicious attack • Loss of staffing
competencies• Loss of institutional
commitment • Loss of financial stability
Peters, Christie. Research Data Management: Basics and Best Practices. http://uknowledge.uky.edu/cgi/viewcontent.cgi?article=1000&context=rdsc_workshops. Licensed under CC BY
DISCUSSION
• Have you seen a shift to a data-centric research culture in your discipline?
• Is data availability a concern among you or your colleagues?
• Other ideas & questions
• Up next: Open Data Benefits and Challenges
SECTION 2
BENEFITS AND CHALLENGES OF OPEN DATA
WHAT IS OPEN DATA?
HIGH ASPIRATIONS, LOW UPTAKE
• Berlin Declaration for Access to Knowledge in the Sciences and Humanities (2003: 572 institutions)
• Recommendations for Access to Data from Publicly Funded Research (2006, all OECD member states)
CULTURE CHANGE?
A survey of 17,000 UK doctoral students
Showed that they are privately open to sharing resources
But in practice, followed behaviors of supervisors
And fear losing future publication opportunities
Researchers of Tomorrow – The Research Behaviour of Generation Y Doctoral Students. London, United Kingdom: JISC. Retrieved from: http://www.jisc.ac.uk/publications/reports/2012/researchers-of-tomorrow.
Tenopir, C, Dalton, E D, Allard, S, Frame, M, Pjesivac, I, Birch, B, Pollock, D and Dorsett, K (2015). Changes in Data Sharing and Data Reuse Practices and Perceptions among Scientists Worldwide. PLoS ONE 10(8): e0134826.DOI: https://doi.org/10.1371/journal.pone.0134826
STRUCTURAL BARRIERS
Small data could initially be published as part of the original publication as tables
As size and complexity of data grew and publishers enforced page limits, data publication
was prohibited or impossible
Klump, J., (2017). Data as Social Capital and the Gift Culture in Research. Data Science Journal. 16, p.14. DOI: http://doi.org/10.5334/dsj-2017-014
WHAT IS OPEN DATA?
The Open Definition (opendefinition.org):
“Open data and content can be freely used, modified, and shared by anyone for
any purpose”
Screencap © Open Data Commons Attribution License:
http://opendatacommons.org/licenses/by/summary/
Fair Data Principles:
Findable
Accessible
Interoperable
Reusable
WHAT IS OPEN DATA?
OPEN SCIENCE
CC-BY-SA image courtesy PLOS One.Credit: Ainsley Seago.
doi:10.1371/journal.pbio.1001779.g001
Table © Fecher & FriesikeFrom Open Science: One Term, Five Schools of Thought
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2272036
RATIONALES FOR SHARING RESEARCH DATA
• Stakeholders• Researchers• Public• Journals• Funders• Libraries
• Motivations to share• Needs of research community• Needs of the public at large
• Beneficiaries of sharing• Those who produce the data• Those who use the data
REPLICATION & REPRODUCIBILITY
Alexander, Ruth. “Reinhart, Rogoff... and Herndon: The student who caught out the profs.” BBC News http://www.bbc.com/news/magazine-22223190
“This week, economists have been astonished to find that a famous academic paper often
used to make the case for austerity cuts contains major
errors. Another surprise is that the mistakes, by two eminent
Harvard professors, were spotted by a student doing his
homework.”
REPLICATION/REPRODUCIBILITY
REPLICATION/REPRODUCIBILITY
• 90% of respondents to a recent survey in Nature agreed that there is a ‘reproducibility crisis’
• Increasing number of retractions
• Failures to replicate high profile studies
• Underlying causes• Mechanized reporting of statistical results
• Publication bias towards statistically significant results
Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).
REPLICATION/REPRODUCIBILITY
• Transparency and Openness Promotion (TOP) Guidelines (https://osf.io/9f6gx/wiki/Guidelines/)
• Badges to articles with open data
• The Peer Reviewers' Openness Initiative
• Open Science Foundation Reproducibility Project (https://osf.io/ezcuj/wiki/home/)
• Science Exchange Reproducibility Initiative (http://validation.scienceexchange.com/#/)
JOURNAL MANDATES• Mandatory requirement to archive data publically
unless there is a valid reason not to• Response to low voluntary uptake
• To allow reproduction of reported results
• Ecology, evolution, biology
• These policies do work to increase data archiving
• However, the quality varies…
Roche DG, Kruuk LEB, Lanfear R, Binning SA (2015) Public Data Archiving in Ecology and Evolution: How Well Are We Doing? PLoSBiol13(11): e1002295. https://doi.org/10.1371/journal.pbio.1002295
JOURNAL MANDATES
Researchers surveyed 100 datasets associated with nonmolecular studies in journals that commonly publish ecological and evolutionary research and
have a strong PDA policy.
Out of these datasets, 56% were incomplete, and 64% were archived in a way that partially or entirely
prevented reuse.
Roche DG, Kruuk LEB, Lanfear R, Binning SA (2015) Public Data Archiving in Ecology and Evolution: How Well Are We Doing? PLoSBiol13(11): e1002295. https://doi.org/10.1371/journal.pbio.1002295
REPLICATION/REPRODUCIBILITY
"True reproducibility requires deep engagement with the epistemological questions of a given
research specialty, and the very different ways in which investigators obtain and value evidence“
“As rationale for sharing research, reproducibility…risks reducing the research process to a set of mechanistic procedures”
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
REPRODUCIBILITY AS RATIONALE
• Where data deposit is required as condition of publication (e.g. Protein Data Bank), researchers will comply
• Data sharing more likely if• Materials/documentation are automated
• Data is not sensitive/no licensing restrictions apply
• Publication is completed
• Data is not part of a long-term study integral to researcher’s career
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
DISCUSSION
Is there a reproducibility crisis?
If so, to what extent can data sharing remedy the crisis?
Other questions/comments
Up next: data management plans & sharing for public use
PUBLIC USE
PUBLIC USE
Tax monies should be leveraged to serve the public good
Data should not be hoarded by researchers
Public understanding of research
Evidence-based advocacy
Education & teaching
Citizen science
Policymakers
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
OPEN GOVERNMENT DATA
White House Office of Science & Technology Policy memo: “Expanding Public Access to the Results of Federally Funded Research” (Feb 2013)
digitally formatted scientific data resulting from unclassified research supported wholly
or in part by Federal funding should be stored and publicly accessible to search, retrieve,
and analyze.
OPEN GOVERNMENT DATA
• Data is hard (or even impossible) to find
• Data can not be readily used
• Unavailable, unclear, restrictive licensing terms
https://blog.okfn.org/files/2017/06/FinalreportTheStateofOpenGovernmentDatain2017.pdf
Global Open Data Index (GODI): https://index.okfn.org/
“Measures the openness of government data according to the Open Definition”
DATA MANAGEMENT POLICIES
• NSF
• NIH
• NEH
• NASA
• NOAA
• CDC
• Gates Foundation
http://dms.data.jhu.edu/data-management-resources/plan-research/funders-data-sharing-requirement/funder-data-related-mandates-and-public-access-plans/
DATA MANAGEMENT PLANS
• Roles and responsibilities
• Description of data and metadata
• Storage, Backup and security
• Provisions for Privacy, confidentiality, intellectual property rights and other rights
• Data access and sharing
• Data reuse, redistribution and production of derivatives
• Archiving and preservation
University of Iowa Libraries. Data Management Plans. Licensed under CC BY. http://guides.lib.uiowa.edu/c.php?g=132111&p=900990`
NSF DATA SHARING POLICY
What constitutes reasonable data management and access
…and reasonable length of time
will be determined by the community of interest through the process of peer review and program
management
NSF Data Management & Sharing FAQ. https://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp
NSF DATA SHARING POLICY
Annual reports must include information on the progress on data management and sharing of
research products
Final project reports are to contain a more thorough updating of the original DMP, including
how your data is archived.
http://dms.data.jhu.edu/data-management-resources/plan-research/funders-data-sharing-requirement/funder-data-related-mandates-and-public-access-plans/
PUBLIC HEALTH
WHO seeks a paradigm shift in the approach to information sharing in emergencies, from one limited by embargoes set for publication timelines, to open sharing using modern fit-for-purpose pre-publication platforms.
Opting in to data and results sharing should be the default practice and the onus should be placed on data generators and stewards at the local, national and international level to explain any decision to opt out from sharing data and results during public health emergencies
World Health Organization “Developing global norms for sharing data and results during public health emergencies”
PUBLIC HEALTH
Many publishers, NGOs and research funders committed to free research sharing in light of the
Zika outbreak
Wellcome Trust. “Statement on data sharing in public health emergencies.“
Journal signatories will make all content concerning the Zika virus free to access. Any data or
preprint deposited for unrestricted dissemination ahead of submission of any paper will not pre-empt its
publication in these journals.
Funder signatories will require researchers undertaking work relevant to public health emergencies to
set in place mechanisms to share quality-assured interim and final data as rapidly and widely as
possible, including with public health and research communities and the World Health Organization.
Wiley, Taylor and Francis, and Elsevier are not signatories.
SHERPA JULIET
Searchable database and single focal point of up-to-date information concerning funders'
policies and their requirements on open access, publication and data archiving.
http://v2.sherpa.ac.uk/juliet/
DISCUSSION
Do you have experience with data management plans?
Up next: Data reuse
DATA REUSE
ASKING NEW QUESTIONS OF EXTANT DATA
• Encourages meta-analyses & data combination
• Exploring new questions and identifying new relationships
HUBBLE SPACE TELESCOPE DATA REUSE
• General Observing (GO) paper: At least one author was investigator on the GO proposal that obtained the data.
• AR paper: No overlap between the paper authors and investigators on the GO proposal that obtained the data.
• GO+AR: Combination of GO data sets with AR data sets.
Adapted from Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The Open Research Challenge: Peer Review and Publication of Research Data“ Licensed under CC BY.
Royal Society June 2012, Science as an Open Enterprise, http://royalsociety.org/policy/projects/science-public-enterprise/report/
Papers based upon reuse of archived observations now
exceed those based on the use described
in the original proposal.
https://archive.stsci.edu/hst/bibliography/pubstat.html
ASKING NEW QUESTIONS OF EXTANT DATA
• Assessing veracity requires domain expertise & misinterpretation is a serious risk
• Depends on extensive documentation & description
• The farther the user is from the point of data origin
• The more documentation required• The more effort required by reuser• Greater the risk of misinterpretation
• Benefits prospective users more than producers
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
ADVANCING RESEARCH AND INNOVATION
• Data-intensive fields (astronomy, social sciences, economics)
• Comparisons across time and space (ecology, biology, sociology)
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
ADVANCING RESEARCH AND INNOVATION
• Maximizing the use of data
• Increasing the impact of findings
• Progressing the state of research
• Laying broader foundation for knowledge
• Diversifying perspectives
Fischer, B.A., & Zigmond, M.J. (2010). The essential nature of sharing in science. Science and Engineering Ethics, 16(4), 783–799.
DATA SHARING ASSOCIATED WITH CITATION IMPACT
Examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data.
The 48% of trials with publicly available microarray data received 85% of the aggregate citations
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE2(3): e308. https://doi.org/10.1371/journal.pone.0000308
DATA SHARING ASSOCIATED WITH CITATION IMPACT
Does not imply causation
• But there may be mechanisms in which data sharing did stimulate greater citations
• Exposure
• Reanalysis
• Enthusiasm and synergy around a specific research question
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE2(3): e308. https://doi.org/10.1371/journal.pone.0000308
SECTION 3
CONCERNS AND OPEN QUESTIONS
RESEARCHER CONCERNS
• Data is competitive advantage
• Data is intellectual capital
• Time & effort required to prepare data for archiving
• Lack of recognition & other extrinsic incentives
• Concerns about data misinterpretation
Roche, D. G., Kruuk, L. E. B., Lanfear, R., & Binning, S. A. (2015). Public data archiving in ecology and evolution: How well are we doing?PLoS Biology, 13(11) doi:http://dx.doi.org/10.1371/journal.pbio.1002295
OPEN QUESTIONS
• What data to share?
• What is sharing?
• What is interpretable and reusable?
• How to reward/give credit?
• How to document without extensive labor?
• How to handle misuse/misinterpretation?
• Restricting access/de-identification
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
OPEN QUESTIONS
• Lack of demonstrated demand for research data outside genomics, climate science, astronomy, social science, demographics
• How open is it?
• Who owns the copyright? Is data public domain?
• How to validate data?
• Preserving data
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
SECTION 4
WHERE TO ARCHIVE?
WHERE TO ARCHIVE?
• Downloadable files on author’s webpage
• Repositories
• Publishers
• Torrents (academic torrents.com)
• APIs (Application Programming Interfaces)
DATA REPOSITORIES
• Institutional Repositories
• Subject Repositories
http://oad.simmons.edu/oadwiki/Data_repositories
DATA REPOSITORIES
• Government (data.gov, data.worldbank.org)
• Multidisciplinary (figshare)
ACCESSING & USING OPEN DATA
• Open source software: R• rOpenSci (ropensci.org)
• rOpenGov (https://ropengov.github.io/projects/)
• Run My Code (http://www.runmycode.org)
• Google Public Data Explorer (https://www.google.com/publicdata)
www.r-project.org
QUESTIONS?