Inroads into Data: Getting Involved in Data at Your Institution

79
Inroads into Data: Getting Involved in Data at Your Institution Margaret Henderson Director, Research Data Management [email protected] @mehlibrarian Beyond the SEA Webinar, November 18, 2015

Transcript of Inroads into Data: Getting Involved in Data at Your Institution

Page 1: Inroads into Data: Getting Involved in Data at Your Institution

Inroads into Data:Getting Involved in Data

at Your Institution

Margaret HendersonDirector, Research Data Management

[email protected]@mehlibrarian

Beyond the SEA Webinar, November 18, 2015

Page 2: Inroads into Data: Getting Involved in Data at Your Institution

“I believe that knowledge rather than the format or container should drive our work.”

~ Lucretia McClure, 1997

http://www.mlanet.org/blog/mcclure,-lucretia-w.-(ahip,-fmla)

Page 3: Inroads into Data: Getting Involved in Data at Your Institution

Case 1

Page 4: Inroads into Data: Getting Involved in Data at Your Institution

Case 2

http://retractionwatch.com/2015/08/17/trouble-with-data-prove-toxic-for-a-pair-of-toxicology-papers/

Page 5: Inroads into Data: Getting Involved in Data at Your Institution

Case 3

Page 7: Inroads into Data: Getting Involved in Data at Your Institution
Page 8: Inroads into Data: Getting Involved in Data at Your Institution

What is Data?

• Research results• Admission records• Student course marks• Patient health records• Financial statement• Supply order information• Inventories• Surgery counts • Surgery records• Genetic sequences

• Computer software• Study protocols• Clinical case histories• Samples• Physical collections• Cell lines• Spectroscopic data• Oral history interviews• Surveys• Laboratory Notebooks

Page 9: Inroads into Data: Getting Involved in Data at Your Institution

“If it gives you pain, it is Big Data.”

~ Donald Brown, Director of Virginia Integrative Data Institute, speaking at Research Data and Technology Fair presented by Claude Moore Health Sciences Library, University of Virginia Health System

Presentation link at http://guides.hsl.virginia.edu/research-fair

Page 10: Inroads into Data: Getting Involved in Data at Your Institution
Page 11: Inroads into Data: Getting Involved in Data at Your Institution

The Value of Reference Skills

https://commons.wikimedia.org/wiki/File:1930%27s_-_ca._-_Alma_Custead,_Librarian,_and_Staff.jpg

Page 12: Inroads into Data: Getting Involved in Data at Your Institution

Environmental Scan

• PEST - political, economic, social, and technological factors

• PESTEL – add environmental and legal factors• SWOT – strengths, weaknesses, opportunities,

and threats• Six Forces Model – competition, new entrants,

end users, suppliers, substitutes, and complementary products

Page 13: Inroads into Data: Getting Involved in Data at Your Institution

Potential Departments• Information Technology/Technology Services –

backups and security• Office of Research – grants, research output

for assessment, patents• Administration – people, financial, facilities

data• Records – patient health records• Statistics or Biostatistics department

Page 14: Inroads into Data: Getting Involved in Data at Your Institution

The Noun Project - http://t.co/oGuXfP7NBq

Page 15: Inroads into Data: Getting Involved in Data at Your Institution

Data Life Cycle

http://www.dcc.ac.uk/resources/curation-lifecycle-model

Page 16: Inroads into Data: Getting Involved in Data at Your Institution

Simplified Data Lifecycle

Data Management Plan and Ownership

Organizing and folder and file name suggestions

Metadata or Readme files

Clean data and statistics help

IR, subject repository, or journal that includes supporting data.

Stable file formats, duration as per funder or other policy.

Page 17: Inroads into Data: Getting Involved in Data at Your Institution

Plan

Data Management Plan and Ownership

Page 18: Inroads into Data: Getting Involved in Data at Your Institution

Data Management PlansOutlines how a researcher will:• collect• organize • back up• storing • sharethe data for a project, and indicates who the data steward will be.

Page 19: Inroads into Data: Getting Involved in Data at Your Institution

DMPTool

https://dmptool.org/

Page 20: Inroads into Data: Getting Involved in Data at Your Institution
Page 21: Inroads into Data: Getting Involved in Data at Your Institution

NIH Policies• Public Access: ...all investigators funded by the NIH submit or have

submitted for them to the National Library of Medicine’s PubMed Central an electronic version of their final peer-reviewed manuscripts upon acceptance for publication, to be made publicly available no later than 12 months after the official date of publication. https://publicaccess.nih.gov/

• Data Sharing: extension of NIH policy on sharing research resources, and reaffirms NIH support for the concept of data sharing. http://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html

• Genomic Data Sharing: Applies to all NIH-funded research that generates large-scale human or non-human genomic data, as well as the use of those data for subsequent research. Requires “Genomic Data Sharing Plan”.Allows for expenses in project budget. http://grants.nih.gov/grants/guide/notice-files/NOT-OD-07-088.html

Page 22: Inroads into Data: Getting Involved in Data at Your Institution

NSF PoliciesNSF Data Sharing PolicyInvestigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing. See Award & Administration Guide (AAG) Chapter VI.D.4. http://www.nsf.gov/bfa/dias/policy/dmp.jsp

NSF Data Management Plan RequirementsProposals submitted or due on or after January 18, 2011, must include a supplementary document of no more than two pages labeled “Data Management Plan”. This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. See Grant Proposal Guide (GPG) Chapter II.C.2.j for full policy implementation. https://www.nsf.gov/eng/general/dmp.jsp

Page 23: Inroads into Data: Getting Involved in Data at Your Institution

NSF PoliciesNSF Data Sharing PolicyInvestigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing. See Award & Administration Guide (AAG) Chapter VI.D.4. http://www.nsf.gov/bfa/dias/policy/dmp.jsp

NSF Data Management Plan RequirementsProposals submitted or due on or after January 18, 2011, must include a supplementary document of no more than two pages labeled “Data Management Plan”. This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. See Grant Proposal Guide (GPG) Chapter II.C.2.j for full policy implementation. https://www.nsf.gov/eng/general/dmp.jsp

Slide courtesy of Amanda Whitmire

Page 24: Inroads into Data: Getting Involved in Data at Your Institution

OSTP MemorandumIncreasing Access to the Results of Federally Funded Scientific Research -February 22, 2013

“ensuring that, … the direct results of federally funded scientific research are made available to and useful for the public, industry, and the scientific community. Such results include peer-reviewed publications and digital data.”

“develop plans to make the results of federally-funded research publically available free of charge within 12 months after original publication.”https://www.whitehouse.gov/blog/2013/02/22/expanding-public-access-results-federally-funded-research

Page 25: Inroads into Data: Getting Involved in Data at Your Institution

Data Management Plans

• All agencies will require a data management plan.

• “Not all data need to be shared or preserved. The costs and benefits of doing so should be considered in data management planning.” DOE third principlehttp://science.energy.gov/funding-opportunities/digital-data-management/

• DOE and NSF have indicated they will review and evaluate DMPs

Page 26: Inroads into Data: Getting Involved in Data at Your Institution

Data Sharing•Digitally formatted data arising from unclassified, publicly releasable research and programs.

•Decentralized approach to data storage.•Allow for inclusion of costs for data management and access.•Will establish a system to enable the identification, attribution, (federated) storage, and access of digital data.

From NASA FAQ•“First of all, be reassured that we are not going to force you to reveal your precious proprietary data prior to publication. No personal, proprietary or ITAR data is included.”

http://science.nasa.gov/researchers/sara/faqs/dmp-faq-roses/

Page 27: Inroads into Data: Getting Involved in Data at Your Institution

https://commons.wikimedia.org/wiki/File:SMPTE_Color_Bars.svg

AND NOW BACK TO OUR REGULARLY SCHEDULED PROGRAM

Page 28: Inroads into Data: Getting Involved in Data at Your Institution

Ownership

• Check institutional policy• Consult with legal counsel for your institution• Can’t copyright data so think about licensing• How to License Research Data

http://www.dcc.ac.uk/resources/how-guides/license-research-data

• Patient Record Ownership by Statehttp://www.healthinfolaw.org/comparative-analysis/who-owns-medical-records-50-state-comparison

Page 29: Inroads into Data: Getting Involved in Data at Your Institution

Collect

Organizing and folder and file name suggestions

Page 31: Inroads into Data: Getting Involved in Data at Your Institution

OrganizingWhat makes sense for person or group:• File type• Date• Type of analysis• Project

MyDocuments\Research\Sample20.tiff vs.C:\\NSFGrant2020\CellDynamics\Images\RatCell_141020.tiff

Page 32: Inroads into Data: Getting Involved in Data at Your Institution

Naming

Use file naming conventions for related files• Be consistent• Short yet descriptive• Avoid spaces and special characterse.g. File2020.xls vs. Project_experiment_celltype_YYYYMMDD.xls

Page 33: Inroads into Data: Getting Involved in Data at Your Institution

Possible elements for file names• Project/grant name and/or number.• Date of creation: useful for version control, e.g. YYYYMMDD• Name of creator/investigator: last name first followed by

(initials of) first name.• Name of research team/department associated with the

data.• Description of content/subject descriptor.• Data collection method (instrument, site, etc.).• Version number.

Page 34: Inroads into Data: Getting Involved in Data at Your Institution

Describe

Metadata or Readme files

Page 36: Inroads into Data: Getting Involved in Data at Your Institution

Metadata• Descriptive – describes object in question,

whole dataset and each element of the set• Administrative – preservation, IP rights• Structural – physical and logical structure of

digital object• Metadata Standards Directory

http://rd-alliance.github.io/metadata-directory/

Page 37: Inroads into Data: Getting Involved in Data at Your Institution

Readme Files• Names + contact information for people associated with the

project • List of files, including a description of their relationship to one

another • Copyright + licensing information • Limitations of the data • Funding sources / institutional support• Any information necessary for someone with no knowledge of

your research to understand and / or replicate your work.

Page 39: Inroads into Data: Getting Involved in Data at Your Institution
Page 40: Inroads into Data: Getting Involved in Data at Your Institution
Page 41: Inroads into Data: Getting Involved in Data at Your Institution

All Points Alone Points

Page 42: Inroads into Data: Getting Involved in Data at Your Institution

Data Dictionary

• Define terms used• If measurements are made, gives units and

explains exactly how measured or calculated• How item is recorded, especially when there

are multiple options, e.g. date

Page 45: Inroads into Data: Getting Involved in Data at Your Institution

Process & Analyze

Clean data and statistics help

Page 46: Inroads into Data: Getting Involved in Data at Your Institution

You Can’t Do It All

https://twitter.com/kdnuggets/status/663427070677118976

Page 47: Inroads into Data: Getting Involved in Data at Your Institution

Tools for Data Cleaning

• Open Refine - to clean and transform data to different formats http://openrefine.org/

• Trifecta Wrangler – free version of the program, so some limitations https://www.trifacta.com/trifacta-wrangler

• NLM-Scrubber – clinical text de-identificationhttps://scrubber.nlm.nih.gov/

• Johns Hopkins Coursera on Data Science https://www.coursera.org/specializations/jhudatascience

Page 48: Inroads into Data: Getting Involved in Data at Your Institution

Analysis and Visualization

• The R Project - language and environment for statistical computing and graphics https://www.r-project.org/

• Tableau Public – analytical tools and visualizations without learning a programming language https://public.tableau.com/s/

• Flowing Data - Nathan Yau has written a couple of books on statistics and visualization; his website has examples, tutorials and more http://flowingdata.com/

Page 50: Inroads into Data: Getting Involved in Data at Your Institution

Publish & Share

IR, subject repository, or journal that includes supporting data.

Page 51: Inroads into Data: Getting Involved in Data at Your Institution

Sharing Data• Helps to avoid duplication, thereby reducing costs and wasted effort. • Promotes scientific integrity and debate. • Enables scrutiny of research findings and allows for validation of results. • Leads to new collaborations between data users and data creators. • Improves research and leads to better science. • Enables the exploration of topics not envisioned by the initial

investigators.• Permits the creation of new datasets by combining data from multiple

sources.• Increases citations.*

* A study by Piwowar, Day and Fridsma showed a 69% increase in citation, http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0000308

Page 52: Inroads into Data: Getting Involved in Data at Your Institution

Ways to Share DataUpload to open repository; general, subject, or institutional.• figshare http://figshare.com/ • Zenodo https://zenodo.org/ • Open Science Framework https://osf.io/ • DataVerse http://dataverse.org/ • Search Registry of Research Data Repositories

http://www.re3data.org/

Page 53: Inroads into Data: Getting Involved in Data at Your Institution

Supplemental file with journal article or link to the upload.

– Be sure to check the contract. – Will the data be available to the public as per

OSTP if grant funded? – Will the rights conflict with institutional ownership

of the data?Tried and true methods? Send files upon request. Upload to personal web site.

Page 54: Inroads into Data: Getting Involved in Data at Your Institution

Sharing Sensitive Data

http://iom.nationalacademies.org/Reports/2015/Sharing-Clinical-Trial-Data.aspx

Page 55: Inroads into Data: Getting Involved in Data at Your Institution

Controlled Access• Researchers must request access to database,

explaining research and providing IRB approval forms.

• Data must be anonymized in some way before being made publicly available.

Page 57: Inroads into Data: Getting Involved in Data at Your Institution

Preserve

Stable file formats, duration as per funder or other policy.

Page 58: Inroads into Data: Getting Involved in Data at Your Institution

Storage vs Backup

storage = working files The files you access regularly and change frequently. In general, losing your storage means losing current versions of the data. backup = regular process of copying data separate from storage. You don’t really need it until you lose data, but when you need to restore a file it will be the most important process you have in place.

Page 59: Inroads into Data: Getting Involved in Data at Your Institution

Rule of 3

Keep THREE copies of your data – TWO onsite – ONE offsite Example – One: Laptop – Two: External hard drive – Three: Cloud storage This ensures that your storage and backup is not all in the same place – that’s too risky!

http://dataabinitio.com/?p=320

Page 60: Inroads into Data: Getting Involved in Data at Your Institution

Preservation

Page 61: Inroads into Data: Getting Involved in Data at Your Institution

Considerations

• How long must the data be kept?• What is the long-term value of the data?

Page 62: Inroads into Data: Getting Involved in Data at Your Institution

Appraisal of Data

1. Relevance to Mission2. Scientific, Social, Cultural, Historical Value3. Uniqueness4. Potential for Redistribution5. Non-Replicability6. Economic Case7. Full Documentationfrom NECDMC, Module 7 activity, http://library.umassmed.edu/necdmc/modules based on Whyte and Wilson http://www.dcc.ac.uk/resources/how-guides/appraise-select-data

Page 63: Inroads into Data: Getting Involved in Data at Your Institution

Where to Preserve Data

• Dryad• Figshare• Subject Repository • Institutional Repository• Government Repository

Page 64: Inroads into Data: Getting Involved in Data at Your Institution

Don’t Forget Print• Set a schedule to scan lab notebooks and other print

materials (makes for a good back up and easier to share data within group).

• Print original should have similar security to digital data (i.e. good, secure storage and labelling of files).

Page 65: Inroads into Data: Getting Involved in Data at Your Institution

Reusing Data

Page 66: Inroads into Data: Getting Involved in Data at Your Institution

Data Information Literacy

DIL http://www.datainfolit.org/

https://www.dataone.org/education-modules

The New England Collaborative Data Management Curriculum (NECDMC)http://library.umassmed.edu/necdmc/index

Page 67: Inroads into Data: Getting Involved in Data at Your Institution

ARE YOU DONE YET?

Page 68: Inroads into Data: Getting Involved in Data at Your Institution

Case 1

• Data Dictionary• Readme File

Page 69: Inroads into Data: Getting Involved in Data at Your Institution

Case 2

• Rule of 3• Learn statistics

Page 70: Inroads into Data: Getting Involved in Data at Your Institution

Case 3

• PI needs to check notebook and provide guidance

Page 71: Inroads into Data: Getting Involved in Data at Your Institution

Case 4

• Rule of 3!

Page 72: Inroads into Data: Getting Involved in Data at Your Institution

Librarians and Data• Subject headings = Organization• Cataloging = Metadata• Reference = Data Reference and Interviewing• Collections = Purchasing data sets, Deciding what

data to keep• Archives = Preservation, Deciding what to keep• Instruction = Instruction• Policy = Funder Policies• Scholarly Communication = Data Citation, Licensing

Page 73: Inroads into Data: Getting Involved in Data at Your Institution

Mos

s Ter

rariu

m in

a Ja

r by

Gerg

ely

Hide

g

Portland Japanese Grdn 11'0709 - 085 by studio-dGrea

t Dix

ter G

arde

ns, S

usse

x, E

ngla

nd b

y uk

gard

enph

otos

Ascott House Gardens, Buckinghamshire, UK by ukgardenphotos

Majorelle Garden (Marrakech, Morocco) by My Wave Pictures

dish gardens by hortulus_aptus

Jane and Trevor's back garden by scrappy annie

Garden metaphor and design by Jamene Brooks-Kieffer

Page 74: Inroads into Data: Getting Involved in Data at Your Institution

Youcan’t

transplanteverything

Green Elephants Garden Sculptures by epsos#MDLS15

Page 75: Inroads into Data: Getting Involved in Data at Your Institution

A garden is…

Local

Cultivated

IntentionalAir Plant Globe Terrarium 2 by cierah

Page 76: Inroads into Data: Getting Involved in Data at Your Institution

Cact

us G

arde

n at

Kno

tt's

Berr

y Fa

rms b

y da

ilyor

gani

zedc

haos

Happ

y Ea

ster

from

Geo

rgia

's Ca

llaw

ay G

arde

ns! b

y ug

arde

ner

final terrarium by bangada

What is your local like?

Page 77: Inroads into Data: Getting Involved in Data at Your Institution

https://www.flickr.com/photos/travelinlibrarian/223839049 by Michael Sauers

Page 78: Inroads into Data: Getting Involved in Data at Your Institution

References• Bishop, D. 2015. Who’s Afraid of Open Data. Blog post on BishopBlog.

http://deevybee.blogspot.co.uk/2015/11/whos-afraid-of-open-data.html • Carlson, Jake R. 2011. "Demystifying the Data Interview: Developing a Foundation for Reference

Librarians to Talk with Researchers about their Data." Reference Services Review 40 (1): 7-23. • Choudhury, S. 2013. Open Access & Data Management Are Do-Able Through Partnerships. In: ASERL;

2013 Summertime Summit: "Liaison Roles in Open Access & Data Management: Equal Parts Inspiration & Perspiration," https://smartech.gatech.edu/handle/1853/48696

• Christensen-Dalsgaard, et.al. 2012.Ten Recommendations for Libraries to Get Started with Research Data Management: Final report of the LIBER working group on E-Science / Research Data Management . Ligue des Bibliothèques Européennes de Recherche (LIBER)http://libereurope.eu/wp-content/uploads/The%20research%20data%20group%202012%20v7%20final.pdf

• McClure, Lucretia W. 1997. "Knowledge and the Container." In Health Information Management. What Strategies? Proceedings of the 5th European Conference of Medical and Health Libraries, Coimbra, Portugal, September 18–21, 1996, edited by Suzanne Bakker, 258-260: Springer Netherlands. doi:10.1007/978-94-015-8786-0_86

• Rinehart, Amanda K. September 2015. "Getting Emotional about Data: The Soft Side of Data Management Services." C&RL News 76 (8): 437-440.

• Ross, Catherine Sheldrick, Kirsti Nilsen, and Marie L. Radford. 2009. Conducting the Reference Interview: A how-to-do-it Manual for Librarians. 2nd ed. New York: Neal-Schuman Publishers.

Page 79: Inroads into Data: Getting Involved in Data at Your Institution

Resources• Educating Yourself on Research Data Management: Resources and

Opportunities (resource list) Greater Midwest Region webinar by Abigail Goben and Rebecca Raszewski, Nov. 16, 2015

• Midwest Data Librarians Symposium - presentations and other materials http://dc.uwm.edu/mdls/2015/

• Pinfield, Stephen, Andrew M. Cox, and Jen Smith. 2014. "Research Data Management and Libraries: Relationships, Activities, Drivers and Influences." PloS One 9, no. 12: e114734. doi:10.1371/journal.pone.0114734

• Sweeney L, Crosas M, Bar-Sinai M. Sharing Sensitive Data with Confidence: The Datatags System. Technology Science. 2015101601. October 16, 2015. http://techscience.org/a/2015101601

• Table of NIH Data Sharing Policies and Repositorieshttps://www.nlm.nih.gov/NIHbmic/nih_data_sharing_policies.html