April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific...

214
NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management April 23, 2014 Speakers: Jan Brase, Jared Lyle, Mercè Crosas, Michael Witt, Christine Borgman, Adriane Chapman, David Wilcox, Judy Ruttenberg http://www.niso.org/news/events/2014/virtual/data_d

description

About the Virtual Conference With the expansion of digital data collection and the increased expectations of data sharing, researchers are turning to their libraries or institutional repositories as a place to store and preserve that data. Many institutions have created such data management services and see the data curation role as a growing and important element of their service portfolio. While some of the experience in managing other types of digital resources is transferrable, the management of large-scale scientific data has many special requirements and challenges. From metadata collection and cataloging data sources, to identification, discovery, and preservation, best practices and standards are still in their infancy. This Virtual Conference will explore in greater depth than traditional webinars some of the practical lessons from those who have implemented data management and developed best practices, as well as provide some insight into the evolving issues the community faces. It will include discussions related to certification of trusted repositories, provenance and identification issues around data, data citation, preservation, and the work of several repository networks to advance distribution of scientific information.

Transcript of April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific...

Page 1: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific

Data Management

April 23, 2014

Speakers:

Jan Brase, Jared Lyle, Mercè Crosas, Michael Witt, Christine Borgman, Adriane Chapman,

David Wilcox, Judy Ruttenberg

http://www.niso.org/news/events/2014/virtual/data_deluge/

Page 2: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

NISO Virtual Conference: The Semantic Web Coming of Age: Technologies and Implementations

Agenda11:00 a.m. – 11:10 a.m. – Introduction

Todd Carpenter, Executive Director, NISO11:10 a.m. - 12:00 p.m. Keynote Speaker: DataCite – A Global Approach for Better Data Sharing

Jan Brase, Ph.D., German National Library of Science and Technology12:00 p.m. - 12:30 p.m. Guidelines and Resources for Office of Science and Technology Policy (OSTP) Data Access Plans

Jared Lyle, Director of Data Curation Services, Interuniversity Consortium for Political and Social Research (ICPSR), University of Michigan12:30 p.m. - 1:00 p.m. Joint Declaration of Data Citation Principles: Implementation and Compliance in the Dataverse Repository

Mercè Crosas, Ph.D., Director of Data Science, Institute for Quantitative Social Science (IQSS), Harvard University1:00 p.m. - 1:45 p.m. Lunch Break1:45 p.m. - 2:15 p.m. Purdue University Research Repository (PURR): A Commitment to Supporting Researchers

Michael Witt, Head, Distributed Data Curation Center (D2C2); Associate Professor of Library Science, Purdue University Research Repository (PURR)2:15 p.m. - 2:45 p.m. The Roles of Data Citation in Data Management

Christine L. Borgman, Professor & Presidential Chair in Information Studies, UCLA2:45 p.m. - 3:15 p.m. Is This Data Fit for My Use? The Challenges and Opportunities Data Provenance Presents

Adriane Chapman, MITRE3:15 p.m. - 3:30 p.m. Afternoon Break3:30 p.m. - 4:00 p.m. A Durable Space: Technologies for Accessing Our Collective Digital Heritage

David Wilcox, Product Manager, DuraSpace4:00 p.m. - 4:30 p.m. The SHared Access Research Ecosystem (SHARE) Project: A Joint Initiative of ARL, AAU, and APLU

Judy Ruttenberg, Program Director for Transforming Research Libraries, Association of Research Libraries (ARL)4:30 p.m. - 5:00 p.m. Conference Roundtable

Moderated by Todd Carpenter, Executive Director, NISO

Page 3: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

DataCite –A global approach for better data sharing

Jan Brase DataCite

NISO virtual conferenceApril 23rd 2014

Page 4: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Thousand years ago: science was empirical

describing natural phenomena

Last few hundred years: theoretical branch

using models, generalizations

Last few decades: a computational branch

simulating complex phenomena

Today: data exploration (eScience)

unify theory, experiment, and simulation

Jim Gray, eScience Group, Microsoft Research

2

22.

3

4

a

cG

a

a

Science Paradigms

Page 5: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Scientific Information is more than a journal article or a book

Libraries should open their cataolgues to any kind of information

The catalogue of the future is NOT ONLY a window to the library‘s holding, but

A portal in a net of trusted providers of scientific content

Consequences for Libraries

Page 6: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

We do not have it

BUT

We know where you can find

And here is the link to it!

Page 7: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

7

Simulation

Scientific Films

3D Objects

Grey Literature

Research Data

Software

Including non-classical publications

Page 8: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Why is this a role for libraries?

• Libraries have a history in bringing scientific information to the public

• Libraries have a tendency to be persistent• A project will be forgotten in 40 years, the

library will very likely still exist then

• Library are very trustworthy organisations

Page 9: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

DataCite

Page 10: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

High visability of the content

Easy re-use and verification.

Scientific reputation for the collection and documentation of content (Citation Index)

Encouraging the Brussels declaration on STM publishing

Avoiding duplications

Motivation for new research

What if any kind of scientific content would be citable?

Page 11: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

How to achieve this?

Science is global• it needs global standards• Global workflows• Cooperation of global players

Science is carried out locally• By local scientist• Beeing part of local infrastrucures• Having local funders

Page 12: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Global consortium carried by local institutions

focused on improving the scholarly infrastructure around datasets and other non-textual information

focused on working with data centres and organisations that hold content

Providing standards, workflows and best-practice

Initially, but not exclusivly based on the DOI system

Founded December 1st 2009 in London

DataCite

Page 13: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Carries

International DOI Foundation

DataCite

MemberInstitution

Data CentreData CentreData Centre

MemberInstitution

Data CentreData CentreData Centre

… Works with

Managing Agent(TIB)

Member

AssociateStakeholder

DataCite structure

Page 14: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

1. Technische Informationsbibliothek (TIB)2. Canada Institute for Scientific and Technical Information (CISTI), 3. California Digital Library, USA4. Purdue University, USA5. Office of Scientific and Technical

Information (OSTI), USA6. Library of TU Delft,

The Netherlands7. Technical Information

Center of Denmark8. The British Library9. ZB Med, Germany10. ZBW, Germany11. Gesis, Germany12. Library of ETH Zürich13. L’Institut de l’Information Scientifique

et Technique (INIST), France14. Swedish National Data Service (SND)15. Australian National Data Service (ANDS)16. Conferenza dei Rettori delle Università Italiane (CRUI)17. National Research Council of Thailand (NRCT)18. The Hungarian Academy of Sciences 19. University of Tartu, Estonia20. Japan Link Center (JaLC)21. South African Environmental Observation Network (SAEON)22. European Organisation for Nuclear Research (CERN)

DataCite members

Affiliated members:1. Digital Curation Center (UK)2. Microsoft Research3. Interuniversity Consortium for Political and Social Research (ICPSR) 4. Korea Institute of Science and Technology Information (KISTI) 5. Bejiing Genomic Institute (BGI)6. IEEE7. Harvard University Library8. World Data System (WDS)9. GWDG

Page 15: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

IRD

(g ra v /1 0 c m3 )

Sand

(% )

CaCO3

(% )

TOC

(% )

Radio

(% /s a n d )

Smect

(% /c l a y )

IRD

(g ra v /1 0 c m3 )

Sand

(% )

CaCO3

(% )

TOC

(% )

Radio

(% /s a n d )

Smect

(% /c l a y )

IRD

(g ra v /1 0 c m3 )

Sand

(% )

CaCO3

(% )

TOC

(% )

Radio

(% /s a n d )

Smect

(% /c l a y )

IRD

(g ra v /1 0 c m3 )

Sand

(% )

CaCO3

(% )

TOC

(% )

Radio

(% /s a n d )

Smect

(% /c l a y )

IRD

(g ra v /1 0 c m3 )

Sand

(% )

CaCO3

(% )

TOC

(% )

Radio

(% /s a n d )

Smect

(% /c l a y )

PS1389-3 PS1390-3 PS1431-1 PS1640-1 PS1648-1

Age (kyr) max. : 233.55 kyr PS1389-3ff

0.0

100.0

200.0

0 2 0 0 1 0 0 0 1 5 0 0 .5 0 5 0 0 1 0 0 0 2 0 0 1 0 0 0 1 5 0 0 .5 0 5 0 0 1 0 0 0 2 0 0 1 0 0 0 1 5 0 0 .5 0 5 0 0 1 0 0 0 2 0 0 1 0 0 0 1 5 0 0 .5 0 5 0 0 1 0 0 0 2 0 0 1 0 0 0 1 5 0 0 .5 0 5 0 0 1 0 0

54° 0' 54° 0'

54°30' 54°30'

55° 0' 55° 0'

55°30' 55°30'

11°

11°

12°

12°

13°

13°

14°

14°

15°

15°

World vector shore lineGrain size class KOLP AGrain size class KOEHN2Grain size class KOEHNGeochemistryGrain size class KOLP BGrain size class KOLP DIN20 m

Scale: 1:2695194 at Latitude 0°

Source: Baltic Sea Research Institute, Warnemünde.

Earth quake events => doi:10.1594/GFZ.GEOFON.gfz2009kciu

Climate models => doi:10.1594/WDCC/dphase_mpeps

Sea bed photos => doi:10.1594/PANGAEA.757741

Distributes samples => doi:10.1594/PANGAEA.51749

Medical case studies => doi:10.1594/eaacinet2007/CR/5-270407

Computational model => doi:10.4225/02/4E9F69C011BC8

Audio record => doi:10.1594/PANGAEA.339110

Grey Literature => doi:10.2314/GBV:489185967

Videos => doi:10.3207/2959859860

What type of data are we talking about?

Page 16: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Anything that is the foundation of further reserach

is research data

Data is evidence

Anything that is the foundation of further reserach

is research data

Data is evidence

Page 17: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Over 3,200,000 DOI names registered so far.

290 data centers.

10,000,000 resolutions in 2013.

DataCite Metadata schema published (in cooperation with all members) http://schema.datacite.org

DataCite MetadataStore

http://search.datacite.org

DataCite in 2014

Page 19: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 20: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 21: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 22: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

OAI and Statistics

OAI Harvester

http://oai.datacite.org

DataCite statistics (resolution and registration)

http://stats.datacite.org

Page 23: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 24: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 25: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

DataCite Content Service

Service for displaying DataCite metadata

Different formats (BibTeX, RIS, RDF, etc.)

Content Negotation (through MIME-Typ)

• Access through DOI proxy (http://dx.doi.org)

• First implemented by CNRI and CrossRef:

Documentation:

http://www.crosscite.org/cn/

Page 26: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Content negotiation

Optimized for m2m communication using the accept header of the http protocol

curl -L -H "Accept: MIME_TYPE" http://dx.doi.org/DOI

Try a shortcut out in any webbrowser:

http://data.datacite.org/MIME_TYPE/DOI

http://data.crossref.org/DOI

Page 27: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Resolving to the citation

http://data.datacite.org/application/x-datacite+text/10.5524/100005

Li, j; Zhang, G; Lambert, D; Wang, J (2011): Genomic data from Emperor penguin. GigaScience. http://dx.doi.org/10.5524/100005

Page 28: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Resolving to the RDF metadata

http://data.datacite.org/application/rdf+xml/10.5524/100005

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:j.0="http://purl.org/dc/terms/" > <rdf:Description rdf:about="http://dx.doi.org/10.5524/100005"> <j.0:identifier>10.5524/100005</j.0:identifier> <j.0:creator>Li, J</j.0:creator> <j.0:creator>Zhang, G</j.0:creator> <j.0:creator>Wang, J</j.0:creator> <owl:sameAs>doi:10.5524/100005</owl:sameAs> <owl:sameAs>info:doi/10.5524/100005</owl:sameAs> <j.0:publisher>GigaScience</j.0:publisher> <j.0:creator>Lambert, D</j.0:creator> <j.0:date>2011</j.0:date> <j.0:title>Genomic data from the Emperor penguin (Aptenodytes forsteri)</j.0:title> </rdf:Description></rdf:RDF>

Page 29: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Example of use

This allows persistent identification of RDF statements!

Implemented for all over 65 million CrossRef and DataCite DOI names

Example of use:

DOI Citation Formatter

http://www.crosscite.org/citeproc/

Page 30: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 31: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 32: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

2012: STM, CrossRef and DataCite Joint Statement

1. To improve the availability and findability of research data, the signers encourage authors of research papers to deposit researcher validated data in trustworthy and reliable Data Archives.

2. The Signers encourage Data Archives to enable bi-directional linking between datasets and publications by using established and community endorsed unique persistent identifiers such as database accession codes and DOI's.

3. The Signers encourage publishers and data archives to make visible or increase visibility of these links from publications to datasets and vice versa

32

Page 33: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Example

The dataset:Storz, D et al. (2009): Planktic foraminiferal flux and faunal composition of sediment trap

L1_K276 in the northeastern Atlantic. http://dx.doi.org/10.1594/PANGAEA.724325

Is supplement to the article:Storz, David; Schulz, Hartmut; Waniek, Joanna J; Schulz-Bull, Detlef;

Kucera, Michal (2009): Seasonal and interannual variability of the planktic foraminiferal flux in the vicinity of the Azores Current.

Deep-Sea Research Part I-Oceanographic Research Papers, 56(1), 107-124,

http://dx.doi.org/10.1016/j.dsr.2008.08.009

Page 34: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 35: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 36: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Next steps

ODIN project with ORCID.

http://datacite.labs.orcid-eu.org/

MoU with Thomson reuters to cooperate on data citation index

DataCite plugin for next D-Space release (early 2014)

Page 37: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 38: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 39: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 40: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Cooperation

MoU with ORCID

Agreement with Re3Data and DataBib to include their service in 2016

MoU with RDA to become organisational affiliate

Page 41: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

2014 Annual conference

Page 42: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Let us get back to libraries

Page 43: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

The wave

Growth of Information –

Diversity of media types and formats

User requirements – e. g. :Science 2.0, collaborativenetworks, social media

Page 44: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

A threat?

Information overload is only a problem for manual curation.

Google is not complaining about data deluge—they’re constantly trying to get more data.

The more data you throw, the better the filter gets.

To develop and maintain these tools is a classical tasks for libraries!

Don’t turn off the taps, build boats.

Page 45: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

It is not only a challenge …

… it is an opportunity

We all should ride the wave …

Page 46: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Thank you!

Page 47: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Guidelines and Resources for OSTP Data Access Plans

NISO WebinarApril 2014

www.icpsr.umich.edu/datamanagement

Page 48: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

The OSTP MemoGuidelines for Response

• Released February 2013, this memo directs funding agencies with an annual R&D budget over $100 million to develop a public access plan for disseminating the results of their research

• ICPSR stresses that standards and guidelines for many of the requirements currently exist

• The slides to follow provide an overview of the access plan elements including guidelines and resources on how to respond to meet digital data requirements in the memo

Page 49: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

The OSTP Memo – A Review• Released February 22, 2013• A concern for investment: “Policies that mobilize these

publications and data for re-use through preservation and broader public access also maximize the impact and accountability of the Federal research investment.”

• Federal agencies with over $100 M annually in R&D expenditures to develop plans to support increased public access to the results of research funded by the Federal Government

• Plans to contain eight points

Page 50: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

The Eight Points of the Plan1. Strategy for leveraging existing archives2. Strategy to improve the public’s ability to locate and access digital

data3. Approach to optimize search, archival, and dissemination features

that encourage innovation in accessibility & interoperability and ensure long-term stewardship

4. A plan to notify awardees & researchers of their obligations5. Strategy for measuring and enforcing compliance with the plan6. Identification of resources within the existing agency budget to

implement plan7. Timeline for implementation8. Identification of special circumstances that prevent the agency from

meeting memo objectives

Page 51: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Data Portion of Memo - 13 Elements• The portion of the memo describing objectives

for public access to data stresses 13 elements for a public access plan

• The elements are also summarized online within ICPSR’s Web site: http://icpsr.umich.edu/content/datamanagement/ostp.html

Page 52: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

http://sites.nationalacademies.org/DBASSE/CurrentProjects/DBASSE_082378

Page 53: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

http://www.icpsr.umich.edu/files/ICPSR/ICPSRComments.pdf

Page 54: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

RDAP 2014 Panel: Funding agency (NOAA, NSF, NIH) responses to federal requirements for public access to research resultsWendy Kozlowski (Cornell), Moderator

http://www.slideshare.net/asist_org/rdap14-ostp-panel-introduction

http://www.slideshare.net/asist_org/rdap-3-2714thakur

Page 56: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

ICPSR – a 50-Year History of Providing Access to Research Data

Established in 1962, ICPSR maintains and shares over 8,600 research datasets and hosts 16 public-access specialized collections of data funded by various government agencies and foundations. Our mission:

ICPSR advances and expands social and behavioral research, acting as a global leader in data stewardship and providing rich data resources and responsive educational opportunities for present and future generations.

Page 57: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

ICPSR’s Data Management & Curation Goals• Quality - Data at ICSPR are

enhanced with meaningful information to make it complete, self-explanatory, and usable for future researchers

• Access – Sought by over 730 member institutions an indexed by all the major search engines, ICPSR data are easily discoverable and widely accessible to the public.

• Citation - By providing standardized and well-recognized data citations, ICPSR ensures that data producers receive credit for their archived data

• Preservation – For over 50 years, ICPSR has preserved its data resources for the long-term, guarding against deterioration, accidental loss, and digital obsolescence

• Confidentiality - Stringent protections are in place for securing and distributing sensitive data

• Educational Support – ICPSR has a long tradition of supporting training in quantitative methods, scientific data management, and resources for instruction

Page 59: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

http://icpsr.umich.edu/datamanagement/ostp.html

ICPSR’s Guidelines for OSTP Data Access Plan Page

Page 60: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Data Portion of Memo - 13 Elements• The portion of the memo describing objectives

for public access to data stresses 13 elements for a public access plan

• The elements are also summarized online within ICPSR’s Web site: http://icpsr.umich.edu/content/datamanagement/ostp.html

Page 61: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Maximize Access"Maximize access, by the general public and without charge, to digitally formatted scientific data created with Federal funds“

• Increasing access to research data prevents the duplication of effort, provides accountability and verification of research results, and increases opportunities for innovation and collaboration.

• Finding and accessing data in repositories requires descriptive metadata ("data about data") in standard, machine-actionable form. Metadata help search engines find data, and help researchers understand the context of data collections.

• Standards already exist: see Data Documentation Initiative – http://www.ddialliance.org/

Page 62: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Maximize Access cont.• Access also involves knowing how to interpret the data. Incomplete data

limit reuse. Obsolete data formats can be unreadable.– Repositories 'curate' or enhance data to make it complete, self-explanatory,

and usable for future researchers. This includes adding descriptive labels, correcting coding errors, gathering documentation, and standardizing the final versions of files. This is called “data curation.”

– Like museums that curate art or artifacts for study and understanding now and in the future, data archives curate data with the same goals.

• Data curation is crucial to maximizing access. Resources for curating data:– ICPSR's Guide to Social Science Data Preparation and Archiving – UK Data Archive's Managing and Sharing Data guide.

Page 63: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Protect Confidentiality and Privacy• It is critically important to protect the

identities of research subjects. • Disclosure risk is a term that is often used for

the possibility that a data record from a study could be linked to a specific person.

• Concerns about disclosure risk have grown as more datasets have become available online, and it has become easier to link research datasets with publicly available external databases.

Page 64: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Protect Confidentiality and Privacy cont.Protecting confidentiality of research subjects is not a viable argument for not sharing data. Infrastructure, including virtual and physical data enclaves, already exists:• Restricted-Use Data are made available for research

purposes for use by investigators who agree to stringent conditions for the use of the data and its physical safekeeping.

• Enclave Data are those datasets which present especially acute disclosure risks. They can be accessed only on-site in ICPSR's physical data enclave in Ann Arbor. Investigators must be approved. Their notes and analytic output are reviewed by ICPSR staff.

Page 65: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Balance Demands of Long-term Preservation and Access• Preserving digital data requires much more than

storing files on a server, desktop, or in the cloud!• Digital preservation is the active and ongoing

management of digital content to lengthen the lifespan and mitigate against loss, including physical deterioration, format obsolescence, and hardware and software failure.

Page 66: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Balance Demands of Long-term Preservation and Access cont.• Not all data are worth preserving

indefinitely; less valuable or easily producible data may be preserved for shorter periods.

• Establish selection and appraisal guidelines that make it clear what to save or discard. – Selection criteria consider factors like

availability, confidentiality, copyright, quality, file format, and financial commitment.

Page 67: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Use of Data Management Plans• Data management plans describe how researchers

will provide for long-term preservation of, and access to, scientific data in digital formats.

• Data management plans provide opportunities for researchers to manage and curate their data more actively from project inception to completion.

• See ICPSR's resource: Guidelines for Effective Data Management Plans

Page 68: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Include Cost of Data Management in Funding Proposals• Data management services carry real costs, ranging from

personnel to storage to software. • Maintenance costs are routinely built into physical

infrastructure development, so too should data management costs be built into data development.

• Long-term access to data requires durable institutions that plan on a scale of decades and even generations.

• Cost resources: – DataONE's

Provide budget information for your data management plan – UK Data Archive's Costing Tool: Data Management Planning.

Page 69: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Evaluate Data Management Plans & Ensure Compliance• Plans help researchers prepare for working with

and preserving data, repositories get ready to accession and provide access, and agencies to understand the community needs for archiving and access. Evaluation helps refine plans so they are realistic and attainable.

• If data management plans are to be a standard component of funding applications, funding recipients should be held accountable for diversions from the originally stated plans.

Page 70: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Promote Public Deposit of Data• Public deposit of data helps to ensure the long-term

accessibility and preservation of the data. • It removes the burden of ongoing maintenance and care (and

user support) from the researcher and provides a stable system to which data can be entrusted.

• Many sustainable online repositories are already available to host and archive research data. These may include discipline-specific repositories, archives administered by funding agencies, or institutional repositories.

• Databib, a searchable directory of over 500 research data repositories, can help locate relevant repositories by subject area.

Page 71: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Preserve Intellectual Property Rights and Commercial Interests

Original research may be both commercially valuable and proprietary. There are several approaches to managing these interests, including:

– Tailor copyright and patent licenses, such as through Creative Commons licenses

– Establish an embargo period or delayed dissemination on distribution.

Page 72: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Private-sector Cooperation to Improve Access

Encourage cooperation with the private sector to improve data access and compatibility. Issues to consider:• What funding structures will be in place to ensure that both

organizations involved are benefiting from the partnership?• Will the partnership require any rights to be transferred to the

private organization?• How does private-sector cooperation affect

access restrictions and intellectual property concerns?

Page 73: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Mechanisms for Identification & Attribution of Data

• Properly citing data encourages the replication of scientific results, improves research standards, guarantees persistent reference, and gives proper credit to data producers.

• Citing data is straightforward. Each citation must include the basic elements that allow a unique dataset to be identified over time: title, author, date, version, and persistent identifier.

• Resources: ICPSR's Data Citations page , IASSIST's Quick Guide to Data Citation, DataCite.

Page 74: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Data Stewardship Workforce DevelopmentIn coordination with other agencies and the private sector, support training, education, and workforce development related to scientific data management, analysis, storage, preservation, and stewardship. Recent data stewardship workforce development in the United States has included:• Digital Preservation Outreach and Education, from the Library of Cong

ress• Digital Preservation Management tutorial, from Cornell University, ICP

SR, and MIT• DigCCurr, from the University of North Carolina

Page 75: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Data Stewardship Workforce Development cont.

ICPSR hosts data stewardship courses as part of its Summer Program in Quantitative Methods of Social Research. These include:• Curating and Managing Research Data for Re-Use• Assessing and Mitigating Disclosure Risk: Essentials for Soc

ial Science• Providing Social Science Data Services: Strategies for Desig

n and Operation

Page 76: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Long-term Support for Repository Development• ICPSR advocates long-term funding for specialized, long-lived,

trustworthy, and sustainable repositories that can mediate between the needs of scientific disciplines and data preservation requirements.

• As digital data management becomes an increasingly important part of scientific research, funding agencies must contribute to the developing ecosystem of services and technologies that support access to and preservation of data.

• For more information, including various long-term funding models, see ICPSR’s 2013 position paper – “The Price of Keeping Knowledge”

Page 77: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Get More information• Visit ICPSR’s Data Management & Curation site: http://

www.icpsr.umich.edu/datamanagement• Contact us:

[email protected]– (734) 647-2200

Page 78: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Acknowledgements:

Linda DettermanEmily ReynoldsGavin Strassel

Page 79: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Thank you!

[email protected]

Page 80: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Joint Declaration of Data Citation Principles: Implementation and Compliance in the Dataverse Repository

Mercè Crosas, Ph.D.

Twitter: @mercecrosas

Director of Data Science

Institute for Quantitative Social Science, Harvard University

NISO Virtual Conference, April 23, 2014

Page 81: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

A brief History of Data Citation

Altman M., Crosas M., 2014, “The Evolution of Data Citation: From Principles to Implementation” IASSIST Quarterly, In Press

1906Chicago Manual of Style

Standards in Scholarly Citation: author/creator, title, dates, publisher or distributor of the work

1960First scientific digital data archives

1977 – 1998ASBR (“Data File” type)MARC (machine readable catalog)

1999-2014Data Repositories (NESSTAR, Dataverse, Dryad, Figshare)DOI services(DataCite)

Page 82: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

The Making of the Principles

Decades of research and practices in data citation

Consolidated to a single set of Principles

By a synthesis group representing 25+ organizations

Driven by the premise that:

"sound, reproducible scholarship rests upon a foundation of robust, accessible data"

and

"data should be considered legitimate, citable products of research"

Page 83: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Joint Declaration of Data Citation Principles

1 Importance

2 Credit and Attribution

3 Evidence

4 Unique Identification

5 Access

6 Persistence

7 Specificity and Verifiability

8 Interoperability and flexibility

Full Principles: https://www.force11.org/datacitation

Endorsement: https://www.force11.org/datacitation/endorsements

Page 84: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Joint Declaration of Data Citation Principles

1. Importance

Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.

Page 85: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Joint Declaration of Data Citation Principles

2. Credit and Attribution

Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data.

Page 86: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Joint Declaration of Data Citation Principles

3. Evidence

In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.

Page 87: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Joint Declaration of Data Citation Principles

4. Unique Identification

A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.

Page 88: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Joint Declaration of Data Citation Principles

5. Access

Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.

Page 89: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Joint Declaration of Data Citation Principles

6. Persistence

Unique identifiers, and metadata describing the data, and its disposition, should persist --  even beyond the lifespan of the data they describe.

Page 90: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Joint Declaration of Data Citation Principles

7. Specificity and Verifiability

Data citations should facilitate identification of, access to, and verification of the specific data that support a claim.  Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific time slice, version and/or granular portion of data retrieved subsequently is the same as was originally cited.

Page 91: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Joint Declaration of Data Citation Principles

8. Interoperability and flexibility

Data citation methods should be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability of data citation practices across communities.

Page 92: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

About Dataverse

A software framework to build data repositories.

Provides a preservation and archival infrastructure,

… while researchers share, keep control of and get recognition for their data through a web interface.

Harvard Dataverse is open to all researchers and disciplines.

It contains more than 50,000 data sets.

Other large Dataverse instances throughout the world: ODUM at UNC, Dutch Universities, Scholar Portal, Fudan University.

Dataverse 4.0 (June 2014) brings an entirely new UI and improved data publishing workflows.

Page 93: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Data Citation Implementation in Dataverse

The Dataverse generates a Data Citation for each deposited data set compliant with the Principles:

Authors, Year, Dataset Title, DOI, Data Repository, UNF, version

Example:

Logan Vidal, 2013, "ANES data coding ",http://dx.doi.org/10.7910/DVN/23274 Harvard Dataverse, UNF:5:0fdUNzmCsyeqrVKtgUG74A==, V8

Page 94: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Compliant with Principle 2

Principle 2:

Credit and Attribution: …facilitate giving scholarly credit and … attribution to all contributors to the data, …

Authors, Year, Dataset Title, DOI, Data Repository, UNF, version

Page 95: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Compliant with Principles 4, 5, 6

Principles 4, 5, 6

Unique Identification: …machine actionable, globally unique, and widely used by a community …Access: … access to the data themselves and to such associated metadata, documentation, code, and other materials …Persistence: …  even beyond the lifespan of the data they describe.

Authors, Year, Dataset Title, DOI, Data Repository, UNF, version

Resolves to landing page with access to metadata, docs, code and data

Page 96: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Landing Page Example: Metadata

Page 97: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Landing Page Example: Data, Code & Docs

Page 98: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Compliant with Principle 7

Principle 7

Specificity and Verifiability: …provenance and fixity sufficient to facilitate verifying that the specific time slice, version and/or granular portion of data …

Authors, Year, Dataset Title, DOI, Data Repository, UNF, version

Universal Numerical Fingerprint: Independent of format

Page 99: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Example of version History

Page 100: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Compliant with Principle 8

Principle 8: Interoperability and flexibility:

Dataverse exports all citation metadata in XML, JSON formats

Page 101: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Implementation Suggestions for Publishers

Upgrade data citation to references section [Principle 1: Importance]

In article, cite data by claim [Principle 3: Evidence]

Provide guidelines for authors based on Principles, but customized to each journal [Principle 8: Interoperability and Flexibility]

Interoperate with, or recommend, trusted Data Repositories compliant with the Principles

Build tools to access machine-readable metadata from datasets

Want to be involved?

Join the Data Citation Implementation group:

https://www.force11.org/datacitationimplementation

Page 102: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Remaining Challenges

Challenges of Provenance: what is the chain of ownership and transformations to the data?

Challenges of Identity: what should be cited? at what level of granularity and versioning for large, dynamic datasets?

Challenges of Attribution: How do you support attribution for hundreds/thousands contributors?

Altman M., Crosas M., 2014, “The Evolution of Data Citation: From Principles to Implementation” IASSIST Quarterly, In Press

Page 103: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

NISO VIRTUAL CONFERENCEAPRIL 23, 2014 – SUCCESSFUL TECHNIQUES FOR SCIENTIFIC DATA MANAGEMENT

Purdue University Research Repository (PURR): A Commitment to Supporting Researchers

Michael WittHead, Distributed Data Curation Center

Associate Professor of Library Sciencehttp://www.lib.purdue.edu/research/witt

E-mail: [email protected]

Page 104: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

OVERVIEW1. Preaching to the choir, but still: Data2. Ecosystem of data repositories3. Our campus data repository & service (PURR)

a. Data management planningb. Project space for collaborationc. Publishing datad. Archiving data

4. Creating opportunities for liaison librarians & helping to operationalize library research data services

5. Roles and collaboration6. Conclusion

104

Page 105: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

DATA = EVIDENCE

105

http://epicgraphic.com/data-cake

Page 106: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

FUNDING AGENCY MANDATES

106

Page 107: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

ECOSYSTEM OF DATA REPOSITORIES• Publisher, e.g., Dryad• Sub/Disciplinary, e.g., RKMP• Consortium, e.g., ICPSR• Country, e.g., Research Data Australia• Government, e.g., data.gc.ca• Research center, e.g., NASA GES DISC• Instrument, e.g., CHANDRA• General-purpose, e.g., FigShare• Roll-your-own, e.g., DataVerse• University, e.g., PURR• Many others…

107

Page 108: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

CAMPUS COLLABORATION

The PURR service is a collaborative effort of the Purdue University Libraries, Office of the Vice President for Research, and Information Technology at Purdue. PURR is a designated university core research facility.

Designated community: Purdue University faculty, staff, and graduate student researchers; their collaborators; and the current and future consumers of their data.

108

Page 109: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

LIBRARY STRATEGIC PLANData is written into the three pillars of our strategic plan:

• Learning“…information literacy defined broadly to include digital information literacy, science literacy, data literacy, health literacy, etc…”

• Scholarly Communication“Lead in data-related scholarship and initiatives”

• Global Challenges“We will lead in international initiatives in information literacy and e-science and … contribute to international information literacy, learning spaces, data management, and scholarly communication initiatives.”

109

https://www.lib.purdue.edu/sites/default/files/admin/plan2016.pdf

Page 111: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

CURATION LIFECYCLE SERVICE MODEL

111

Witt, M. (2012). Co-designing, Co-developing, and Co-implementing an Institutional Data Repository Service. Journal of Library Administration, 52(2). DOI:10.1080/01930826.2012.655607. http://docs.lib.purdue.edu/lib_fsdocs/6/

Digital Curation Centre’s Curation Lifecycle Model: http://www.dcc.ac.uk/resources/curation-lifecycle-model

Page 112: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

PURR SERVICE – INTERNAL MODEL

112112

Page 113: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

PURR SERVICE – EXTERNAL MODEL

113

Page 114: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

INTRO TO PURR VIDEO

114

http://www.youtube.com/watch?v=Yw0IJj7FqA8

Page 115: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

PURR POSTCARD AND POSTER

115115

Page 116: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

116

Dimensions of Discovery (Winter 2013). Office of the Vice President for Research, Purdue University, http://www.purdue.edu/research/vpr/publications/docs/dimensions/Winter2013.pdf

Page 117: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

DATA MANAGEMENT PLANS• Boilerplate text• Example DMPs• DMP Self-Assessment• DMPTool• Workshops• Tutorials• Reference and consultation with subject-

specialist librarian and/or data services specialist

https://purr.purdue.edu/dmp

117

Page 118: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

CREATE PROJECT AND COLLABORATECreate:• any Purdue faculty, staff, or graduate student researcher can create projects• describe the project• disclaim use of sensitive or restricted data• receive a default allocation of storage• register a grant award to increase allocation• invite collaborators to join project

Collaborate:• git repository to share and version files (Google Drive integration)• wiki• blog• to-do list management and project notes• newsfeed• stage data publications

118

Page 119: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

SENSITIVE AND RESTRICTED DATASensitive data: Information whose access must be guarded due to proprietary, ethical, or privacy considerations. This classification applies even though there may not be a civil statute requiring this protection.

Restricted data Information protected because of protective statutes, policies or regulations. This level also represents information that isn't by default protected by legal statue, but for which the Information Owner has exercised their right to restrict access.

http://www.purdue.edu/securepurdue/policies/dataConfident/restrictions.cfm

• FERPA Registrar • HIPAA Health Center• IRB Human Research Protection Program• Export Control Vice President for Research

119

Page 120: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management
Page 121: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

PROJECT SPACE

121

PURR project tutorial video: http://www.youtube.com/watch?v=q5xGO_oF9uQ

Page 122: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

STORAGE MENU

https://purr.purdue.edu/about/pricing

122

Page 123: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

DATA PUBLICATION

123

PURR publication tutorial video: http://www.youtube.com/watch?v=jYBcsfiRhio

Page 124: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

PRESERVATION AND STEWARDSHIPInitial commitment of 10 years

• data producer or dept can fund for longer• otherwise remanded to library collection

Design guided by ISO 16363 / TRAC• Organization infrastructure• Digital object management• Technical infrastructure & Security Risk

Management

124

Page 125: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

ARCHIVAL INFORMATION PACKAGEBagit “bag” contains:• bag declaration file, manifest file, data files

Metadata file (XML):• METS wrapper• Dublin Core and MODS (descriptive metadata)• PREMIS (preservation metadata)

MetaArchive: LOCKSS replication network (7 copies)

125

Page 126: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

SUPPORTING POLICIES• Terms of Deposit• Collection Development Policy• Preservation Policy• Preservation Strategies• File Format Recommendations• Preservation Support Policy

126

https://purr.purdue.edu/legal/terms

Page 127: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

REPOSITORY SOFTWARE: HUBZERO• HUBzero, open source software: http://hubzero.org• Maintained by HUBzero Foundation, originally funded by NSF• Over 50 hubs online, supporting different virtual scientific communities,

hundreds of thousands of users• http://nanoHUB.org - grandfather of the hubs, exemplar• Built to facilitate virtual communities and online, scientific collaboration,

research/teaching• Collaborate, develop, publish, access, execute, and manage content

using a web browser• Software tools, documents, multimedia, learning objects, datasets, etc.• Social network functionality and collaboration features• LAMP stack, Joomla framework, OpenVZ and Rappture, git, etc.• EZID interface to mint DataCite DOIs (coming soon: ORCID)• Some extensions customized for PURR not in core distribution

127

Page 128: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

PURR TEAM

• Executive Committee: Dean of Libraries, Vice President for Research, Chief Information Officer

• Steering Committee: 2 from libraries, 2 from IT, 2 from research office and sponsored programs, 3 domain faculty researchers

• Personnel: Project Director (.50), Technologists (3.85), HUBzero Liaison (.35), Metadata Specialist (.20), Digital Archivist (.25), Digital Data Repository Specialist (1.0)

128

Page 129: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

LIBRARIES PURR TEAM

129

PURR Project Director (50%)

Michael Witt

Three examples of responsibilities:

• resourcing (personnel, budget, coffee, etc.)• oversees development roadmap, service definition

and design• communicates across constituencies

Page 130: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

LIBRARIES PURR TEAM

130

Digital Data Repository Specialist

Courtney Matthews

Three examples of responsibilities:

• primary point of contact for helping users and librarians utilize PURR

• coordinates outreach, support, and development (tons of community engagement)

• helps to acquire, organize, and ingest data collections

Page 131: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

LIBRARIES PURR TEAM

131

Digital Library Software Developer

Mark Fisher

Three examples of responsibilities:

• developing a module to create archival information packages from datasets published in PURR

• integrating PURR with MetaArchive, an LOCKSS preservation network

• web and graphics design to keep the PURR website current and dynamic

Page 132: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

LIBRARIES PURR TEAM

132

Digital Archivist (25%)

Carly Dearborn

Three examples of responsibilities:

• define and implement AIP as well as long-term digital object management and supporting practices

• lead policy development and documentation such as PURR’s preservation policy, preservation strategies, file format recommendations, and preservation support policy

• consult with data producers and librarians on file formats, appraisal of data collections, and data management planning

Page 133: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

LIBRARIES PURR TEAM

133

Metadata Specialist (20%)

Amy Barton

Three examples of responsibilities:

• consult with data producers and librarians identify and apply appropriate metadata schemas and vocabularies to describe datasets

• design and implement metadata for preservation, findability, and citability (i.e., DataCite DOIs)

• enhance and provide quality assurance for metadata for acquired data collections

Page 134: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

KEY PLAYERS: SUBJECT LIBRARIANS

134

Page 135: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

KEY PLAYERS: DATA SPECIALISTS

135

Page 136: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Librarians consult on data management plans in their subject areas.

Creating opportunities for librarians to interact with researchers about data

136

Page 137: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Librarian is notified by e-mail when a new project is created or a grant is awarded, based on department affiliation of Purdue project owner.

Creating opportunities for librarians to interact with researchers about data

137

Page 138: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Librarian may consult or collaborate on project if needed.

Creating opportunities for librarians to interact with researchers about data

138

Page 139: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Librarians review and post submitted datasets.

Creating opportunities for librarians to interact with researchers about data

139

Page 140: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

At the end of initial commitment (10 years), archived and published datasets are remanded to the Libraries’ collection. A librarian working with the digital archivist selects (or not) the dataset for the collection.

Creating opportunities for librarians to interact with researchers about data

140

Page 141: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

CONCLUSION• Soft launch in 2012; 2013 was our first full year• PURR included in 1,040 data management plans with proposals from

Purdue (tracked by our sponsored programs office)• 79 grants awarded• 1,466 registered researchers• 331 active research projects• Average project team size: 4 people• Average files per project: 67 files

DMP analysis (n=111 NSF proposals from Purdue, Jan-Jun 2013)• 49% PURR• 29% Local computer or server• 14% Disciplinary repository (e.g., ICPSR, Protein Data Bank,

nanoHUB, NEES)• 8% No data or not applicable

141

Page 142: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

THANK YOU

PURR: http://purr.purdue.edu Michael WittHead, Distributed Data Curation Center

Associate Professor of Library Sciencehttp://www.lib.purdue.edu/research/witt

E-mail: [email protected]

Page 143: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

The Roles of Data Citation in Data Management

NISO Virtual Conference:Dealing with the Data Deluge: Successful Techniques

for Scientific Data Managementhttp://www.niso.org/news/events/2014/virtual/data_deluge/

Christine L. BorgmanProfessor and Presidential Chair in Information Studies

University of California, Los Angeles

hudsonalpha.orgNASA Astronomy Picture of the Day

Page 144: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Deluge!!!

Data!

Scientists

Social Scientists

Funding agencies Policy makers

Humanists

Librarians

http://www.guzer.com/pictures/suprise_suprise.jpg 144

Publishers Internet architects

Page 145: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

http://www.census.gov/population/cen2000/map02.gif

What are data?

ncl.ucar.edu

http://onlineqda.hud.ac.uk/Intro_QDA/Examples_of_Qualitative_Data.php

Marie Curie’s notebook aip.org

hudsonalpha.org

NASA Astronomy Picture of the Day

145

Page 146: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

146

Data are representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship.

C.L. Borgman, 2014, forthcoming, Big Data, Little Data, No Data: Scholarship in the Networked World, MIT Press.

hudsonalpha.org

Page 147: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Publications are arguments made by authors, and data are the evidence used to support the arguments.

C.L. Borgman, 2014, forthcoming, Big Data, Little Data, No Data: Scholarship in the Networked World, MIT Press.

Page 148: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Citing publications vs. data

• If publications are the stars and planets of the scientific universe, data are the ‘dark matter’ – influential but largely unobserved in our mapping process*

*CODATA-ICSTI Task Group on Data Citation Standards and Practices, 2013, p. 54

Page 149: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Authorship and Attribution

• Publications– Independent units– Authorship is negotiated

• Data– Compound objects– Ownership is rarely clear– Attribution

• Long term responsibility: Investigators• Expertise for interpretation: Data collectors and analysts

hudsonalpha.org

Page 150: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Attribution of data• Legal responsibility

– Licensed data– Specific attribution required

• Scholarly credit: contributorship– Author of data– Contributor of data to this publication– Colleague who shared data– Software developer– Data collector– Instrument builder– Data curator– Data manager– Data scientist– Field site staff– Data calibration – Data analysis, visualization– Funding source– Data repository– Lab director– Principal investigator– University research office– Research subjects– Research workers, e.g., citizen science…

150

Page 151: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Scholarly credit

• Publications• Publications• Publications• Publications• Publications• Publications• Awards and honors• Grants• Teaching• Service• Data

http://blog.startfreshtoday.com/Portals/170402/images/improve-credit-score1.jpg

Page 152: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Everyone is overwhelmed with life and email and, in academia, trying to get funding and write papers. Whether

something is open or not open is not highest on the priority list. There’s still need for making people aware of open

science issues and making it easy for them to participate if they want to.

Jonathan Eisen, genetics professor at the University of California, Davis

DESPITE BEING GOOD FOR YOU AND FOR SCIENCE, TOO MANY CHALLENGES AND TOO LITTLE TIME

Rewards for publications

Effort to document

data

Competition, priority

Control, ownership

Slide courtesy of Merce Crosas, Harvard IQSS; Mashup of Borgman and Crosas slides 152

Page 153: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Data citation as solution to…

• Credit• Attribution

• Discovery

Page 154: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Research practices

• Goal is publications that report the researchVs.• Goal is data that are reusable by others

Image: Alyssa Goodman, Harvard Astronomy154

Page 155: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Scientific data creation, use, and reuse*

• What are the characteristics of data use and reuse within each research community?

• How do characteristics of data use and reuse vary within and between research communities?

Fastlizard4’s image of a Geiger counter setup to measure background radiation (flickr.com)

155

* Wynholds, L. A., Wallis, J. C., Borgman, C. L., Sands, A., & Traweek, S. (2012). Data, data use, and scientific inquiry: two case studies of data practices. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries (pp. 19–22). New York, NY, USA: ACM. doi:10.1145/2232817.2232822

* Wallis, J. C., Rolando, E., & Borgman, C. L. (2013). If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology. PLoS ONE, 8(7), e67332. doi:10.1371/journal.pone.0067332

Page 156: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Research Sites

• Center for Embedded Networked Sensing– Science research

• Environment• Seismology

– Technology research• Instrumentation• Networks

– Small science– Circa 300 partners

• Sloan Digital Sky Survey needs to align – Science research

• Astronomy• Astrophysics

– Technology research• Instrumentation• Databases

– Big science– Circa 400 partners

156

Page 157: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Interview Questions

Topic Question CENS SDSS

Data Types

Within your work, what is typically considered to be “data?” X XHow do you distinguish between different levels or states of data? X

Data Sourc

es

What are the main sources of data for your research projects? XDo you routinely or have you ever used data that you did not generate yourself, or from beyond the immediate project team?

X X

Data Use

When you look at data, what are you hoping to find in it? X X

When, if ever, do you reuse your datasets? X X

157

Page 158: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Dimensions of Data

• Observed vs. simulated data• Lab generated vs. field collected• Collected by team vs. obtained from external

sources• Old vs. new data• Raw vs. processed data• Foreground vs. background data

158

Page 159: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Research findings

• Uses of data vary by type of inquiry• Foreground data

– Research questions– Curated– Cited

• Background data– Necessary for comparison or calibration– Rarely curated– Rarely cited

• Value of data lies in their use• “Use” of data is not reflected in citations

159http://drpinna.com/the-gold-standard-22948

Page 160: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Sharing and discovering data

• Means to share data – Curated data archives: NASA, UKDA, ICPSR…– Contributor-curated collections– Research domain collections– University repositories– Personal websites– ftp sites

• Release upon request*

http://www.zippykidstore.com/

*Wallis, J. C., Rolando, E., & Borgman, C. L. (2013). If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology. PLoS ONE, 8(7), e67332. doi:10.1371/journal.pone.0067332

160

Page 161: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Discoverability• Data are inseparable from

– Code– Technical standards– Documentation– Instrumentation– Calibration– Provenance– Workflows– Local practices– Physical samples

http://peacetour.org/sites/default/files/code4peace-logo2-v3-color-sm.jpg 161

Page 162: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Usability of cited objects

• Identify the form and content• Interpret• Evaluate• Open• Read• Compute upon• Reuse• Combine• Describe• Annotate…

162

Page 163: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Identity and persistence of digital objects

• Identity – Identifiers

• DOI, Handles, URI, PURL…

– Naming and namespaces• Authors/creators: ORCID, VIAF…• Generic/specific: registry number…

– Description• Self-describing • Metadata augmentation

• Persistence– Permanent– Long-lived– Scratch spaces

http://web-interview-questions.blogspot.com/2010_06_21_archive.html 163

Page 164: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Intellectual property

• What can I do with this object?• What rights are associated?

– Reuse– Reproduce– Attribute

• Who owns the rights?• How open are data?

– Open data– Open bibliography

164http://pzwart.wdka.hro.nl/mdr/research/lliang/mdr/mdr_images/opencontent.jpg/

Page 165: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Implications for data management

• Authors of publications– Cite publications for their data, findings, and other content– Cite your data as you wish others to cite them– Cite others’ data and publications as they wish to be cited

• Data archives– Add metadata for discovery of datasets– Add metadata for interpretation and provenance

• Institutional repositories, bibliographic databases– Establish standards and practices for citing data sources– Coordinate communities, e.g., telescope bibliography, IAU*

165*IAU Working Group Libraries. (2013). Best Practices for Creating a Telescope Bibliography. IAU-Commission5 - WG Libraries. http://iau-commission5.wikispaces.com/WG+Libraries

Page 166: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Data Citation and Attribution

166

Uhlir, P. F. (Ed.). (2012). For Attribution -- Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. Washington, D.C.: The National Academies Press. Retrieved from http://www.nap.edu/catalog.php?record_id=13564

Data Science Journal, Volume 12, 13 September 2013

2012

CODATA-ICSTI Task Group on Data Citation and Attribution. Co-Chairs: Jan Brase, Sarah Callaghan, Christine Borgman

Page 167: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Research funding acknowledgements

Research reported here is supported in part by grants from the National Science Foundation and the Alfred P. Sloan Foundation:

The Transformation of Knowledge, Culture, and Practice in Data-Driven Science: A Knowledge Infrastructures Perspective, Sloan Award # 20113194, CL Borgman, UCLA, PI; S Traweek, UCLA, Co-PI

The Data Conservancy, NSF Cooperative Agreement (DataNet) award OCI0830976, Sayeed Choudhury, Johns Hopkins University, PI

The Center for Embedded Networked Sensing (CENS) is funded by NSF Cooperative Agreement #CCR-0120778, Deborah L. Estrin, UCLA, PI

Towards a Virtual Organization for Data Cyberinfrastructure, NSF #OCI-0750529, C.L. Borgman, UCLA, PI; G. Bowker, Santa Clara University, Co-PI; Thomas Finholt, University of Michigan, Co-PI

Monitoring, Modeling & Memory: Dynamics of Data and Knowledge in Scientific Cyberinfrastructures: NSF #0827322, P.N. Edwards, UM, PI; Co-PIs C.L. Borgman, UCLA; G. Bowker, SCU and Pittsburgh; T. Finholt, UM; S. Jackson, UM; D. Ribes, Georgetown; S.L. Star, SCU and Pittsburgh

167

Page 168: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Finding and following digital objects

• Discoverability– Identify existence– Locate– Retrieve

• Provenance– Chain of custody– Transformations from original state

• Relationships– Units identified– Links between units– Actions on relationships http://chicagoist.com/2008/10/09/

a_gourmet_oasis_provenance_food_and.php

168

Page 169: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Infrastructure for digital objects

• Social practice• Usability• Identity • Persistence• Discoverability• Provenance• Relationships• Intellectual property• Policy

http://datalib.ed.ac.uk/GRAPHICS/blue_data.gif

169

Page 170: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Social practice

• Why cite data?– Reproduce research– Replicate findings– Reuse data

• Why attribute data?– Social expectation– Legal responsibility

• How to cite data?– Bibliographic reference– Identifier– Link

170

http://farm2.static.flickr.com/1207/707625876_46aa44851f_o.jpg

Page 171: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

171

Foreground vs Background

Foreground data Background data

Uses Research questions Comparison, calibration

Reuses Internal data sources External data sources

Disposition Retain, curate Discard

Value Reference in paper Rarely cited

Page 172: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

UCLA USC UCR CALTECH UCMCENTER FOR EMBEDDED NETWORKED SENSING

Sensor Collected Application Data

Sensor CollectedProprioceptive Data

Sensor Collected Performance Data

Hand Collected Application

Data

Flow

Water depth

Ammonium

Ammonia Phosphate

Water temp

pH

Temperature

Conductivity

Chlorophyll

GPS/location Time

Sap flow

CO2

Humidity

RainfallPackets transmitted

Packets receivedORP

PAR

Motor speed

Rudder angle

Heading

Roll/pitch/yawSoil moisture

Nitrate

Calcium

Chloride

Water potential

Wind speed

Wind direction

Wind duration

Leaf wetness

Routing table

Neighbor table

Fault detection

Awake time

Organism presence

Organism concentration

Battery voltage

Mercury

Methylmercury

Nutrient concentration

Nutrient presence

LandSat images Mosscam

CDOM

Bird calls

CENS Data: Foreground vs background

Page 173: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Astronomy data: Foreground vs. background

Type Source Named GenreCatalog (Data) index

SIMBAD, VizieR Obs

Curated Data Collection

NASA Exoplanet Database Obs

Data Archive Multi-mission Archive at STScI (MAST), Infrared Science Archive (IRSA) Obs

Federated Data Query Services

Virtual Observatory Services (NVO, IVOA) Obs

Ground Based Instruments

DEep Imaging Multi-Object Spectrograph (DEIMOS), Keck Observatories, Laser Interferometer Gravitational-Wave Observatory (LIGO)

Obs

Ground Based Sky Surveys

Deep Lens Survey, DEEP2 Galaxy Redshift Survey, Catalina Transients Survey, Palomar-Quest Survey, Sloan Digital Sky Survey (SDSS), Digitized Palomar Observatory Sky Survey (DPOSS), SDSS Value Added Catalogs

Obs

Physical Constants NIST Atomic Spectra Database ExpPublications Index SAO/NASA Astrophysics Data System MixedSimulation Millennium Simulation Database SimSpace Based Instruments

Chandra X-Ray Observatory, Fermi Large Area Telescope, Far Ultraviolet Spectroscopic Explorer (FUSE), Galaxy Evolution Explorer (GALEX), Hubble Space Telescope, Spitzer Space Telescope, XMM X-ray Telescope

Obs

Space Based Sky Surveys

Two Micron All Sky Survey (2MASS), Infrared Astronomical Satellite Survey (IRAS), Wide-field Infrared Survey Explorer (WISE)

Obs

173

Page 174: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Adriane Chapman [email protected]

M. David Allen [email protected]

Barbara Blaustein [email protected]

Is this data fit for my use?

The challenges and opportunities provenance

presents

Information graphic courtesy of FreeDigitalPhotos.net

Public Release #12-1548.

Page 175: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Page

175

What is Provenance?

Public Release #12-1548.

Page 176: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Public Release #12-1548.

■ Provenance can help in evaluating whether data is fit for a specific purpose– Does the data item derive from an Internet source?

– Were untrusted organizations involved in producing the data item?

■ Provenance “in the raw” is not always useful to users– Generally presented as a directed acyclic graph (DAG)

– Many users have a good intuitive understanding of simple graphs, BUT

Is Data Fit for a Specific Use?

Page

176

Provenance graphs are oftenlarge and unwieldy

Page 177: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Use Case

Page

177

Page 178: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Financial Systemic Risk AnalysisAnalysts

Financial Models

build and run

Are there systemic

risks to the health of

the financial system?

Decision Makers

Public Release #12-3756

Page 179: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Systemic Risk: The IT Problem

■ To monitor systemic risk, regulators have hundreds of analysts, running hundreds of models…– …against hundreds of data sets at various time scales…

– …each with thousands of different parameter settings

■ Currently, care and feeding of these models (especially data extract-transform-load) is ad hoc

■ Result: Current simulation environments don’t support analysts’ need to find and interpret data across the resulting millions of simulation executions

Public Release #12-3756

Page 180: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Data Provenance Challenge

I ran a flow of funds model from the University of Vermont back in May. Which version did I use? What transformations did I perform on the input data sets?

Which model runs used the 1Q 2011 version of the FDIC’s Uniform Bank Performance Reports?

Who is running Prof. Jones’ model? What input data are they using it with and with what parameters?

Public Release #12-3756

Page 181: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Data Provenance Example

d1

Filter (P1) Multi-market model (P2)

Source: Thompson

Reuters order book data

d2 d3

Filter (P3) d4 d5

Version: 1Time-horizon = 2016Invoked-by: Jones

Version: 2Time-horizon = 2018Invoked-by: Smith

Multi-market model (P4)

Year: 2010 Sector: Technology

 

Time Series Normalization (P6)

Link-based Classification Model (P7)

d7 d8

Outlook: “Excellent”

Outlook: “Poor”

Filter (P5) Year: 2001-2010Sector: Housing

Periodicity: Quarterly Invoked-by: Roberts

Outlook: “Fair”

d6

Source: Nanex order book data

Public Release #12-3756

Page 182: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

FitnessWidgets

Page

182

Page 183: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Public Release #12-1548.

Page

183

Ease of Use for End Users

Data-centric goal: build tools and applications over provenance information to support a user’s needs.

Information graphic courtesy of FreeDigitalPhotos.net

Page 184: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Public Release #12-1548.

■ Ad hoc, user-defined

Fitness Widgets: Pre-defined queries operating over provenance graphs

Page

184

Page 185: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Fitness Widgets: Pre-defined queries operating over provenance graphs

■ Complex, pre-defined

Page

185Public Release #12-1548.

Page 186: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Page

186

More Complex: Cross-organizational “double counting”

Public Release #12-1548.

Page 187: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

The Skeletons

Page

187

Page 188: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

PLUS Provenance Manager

Provenance Manager

PLUS

Users &Applications

Administrators

Provenance Store (MySQL)

PLUS

Applications & Capture Agents

ReportAnnotateRetrieve

Administer(access control,archiving, etc.)

API (provenance-aware applications)

Coordination points for automatic provenance capture

Web Proxy (provenance-aware applications)

Approved for Public Release 10-4145© 2010 The MITRE Corporation. All rights reserved

A. Chapman, M.D. Allen, B. Blaustein, L. Seligman, “PLUS: A Provenance Manager for Integrated Information,” IEEE Int. Conf. on Information Reuse and Integration (IRI ‘11), Las Vegas, NV, August 2011

API

Public Release #12-1548.

Page 189: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Architectural Options for Lineage Capture

■ “Smart Applications”– Strategy: Each application calls lineage API to log whatever it thinks

is important.

– But, unrealistic for legacy applications

■ “Interceptors”– Strategy: Listen in to whatever is happening, and log silently as it

happens

– Requires a small number of points of lineage capture: ESBs are ideal, since they act as central “routers”

■ “Wrappers”– Strategy: Write a transparent wrapper service. Make sure all

orchestrations call the wrapper service with enough information for the wrapper to invoke the real thing.

189Public Release #10-1285

Page 190: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Public Release #12-1548.

Page 191: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

View Provenance

The provenance graph is built automatically

over time by “watching” users’

actionsPublic Release #12-1548.

Page 192: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

The system can show relationship information and metadata details

Get Details

Public Release #12-1548.

Page 193: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Sort Information

The system provides ways to get information

“at a glance”, e.g. which

organizations own the data that

was used.

Public Release #12-1548.

Page 194: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

FitnessWidgets

FitnessWidgets help the analyst

assess data products for his

specific use.Public Release #12-1548.

Page 195: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

Annotations

Annotate any node. Information can be

propagated through graph.

Public Release #12-1548.

Page 196: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

© 2012 The MITRE Corporation. All rights reserved.

■ Provenance keeps track of who did what, when to data.

■ Provenance can help– Determine what data to use

– Find data

– Know what happened to the data

■ It is not a silver bullet– Capture is hard

■ Determine what pieces of information are vital to judging “fitness”, try to capture those

Conclusions

Page

196

Page 197: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

SHARE PROJECT UPDATEJudy Ruttenberg, Program Director

Association of Research Libraries

NISO Virtual Conference: Dealing with the Data Deluge

April 23, 2014

Page 198: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Higher education &research community

• Preservation, access, and reuse of research outputs (data, articles, and more)

• Interlocking layers & services to better understand what research is being produced, and to render that research as accessible as possible

• Leverage existing ecosystem

Page 199: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Formation and context of SHARE

• Institutional OA policies• AAU-ARL Task Force on Scholarly

Communication• Funder mandates

– 2013 OSTP Memorandum– 2014 Omnibus Appropriations– Private and other funder policies

Page 200: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Who is SHARE?Steering Group

• Provost, Library directors, CIO, SRO• ARL, AAU, APLU, CNI, SPARC, NLM (federal

agency liaison)

Staff• Project Manager (ARL), Technical Director,

Product/Community Lead, Development Team

Working Groups• Repository, Workflow, Technical,

Communications

Page 201: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Layers & Services of SHARE

Notification Service: Project underway– Beta release fall 2014– Full release fall 2015

Concurrent planning forinteractive systems:

– Registry – Discovery – Aggregation

Page 202: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

SHARE Notification Service

Problem Statement:• Difficult to keep abreast of the release

of publications, datasets, other research outputs

• No single, structured way to report research output releases in timely and ubiquitous manner

Page 203: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

SHARE Notification Service

Outcome & Goal:• Know that research output exists• Enable, short-term & with high-latency:

– Repository Managers to identify articles/papers/reports for deposit

– University and funding agency grant administrators to determine compliance with public access policies

Page 204: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

SHARE Notification Service – Building Blocks

Page 205: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

SHARE Notification Service – Information Flow

Page 206: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

SHARE Research Release Events

Page 207: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

SHARE Research Release Events

Page 208: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

SHARE Registry Layer

Page 209: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

SHARE Registry & Discovery Layers

Page 210: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Other Community Initiatives

• CHORUS• ORCID• CrossRef• International

Page 211: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Long-term planning

• Data• Author rights: An intellectual property

rights strategy, including the promotion of university-based open access policies and favorable licensing terms, will be part of the scaffolding that will enable the layers of SHARE to develop

Page 212: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

www.arl.org/share

www.facebook.com/SHARE.research

www.twitter.com/share_research

[email protected]

Staying connected with SHARE:

Page 213: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

NISO Virtual ConferenceDealing with the Data Deluge: Successful Techniques for Scientific Data Management

NISO Virtual Conference • April 23, 2014

Questions?All questions will be posted with presenter answers on the NISO website following the webinar:

http://www.niso.org/news/events/2014/virtual/data_deluge/

Page 214: April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Techniques for Scientific Data Management

Thank you for joining us today. Please take a moment to fill out the brief online survey.

We look forward to hearing from you!

THANK YOU