Data Management Best Practices
Embed Size (px)
Transcript of Data Management Best Practices
Data Management Best Practices
March 3, 2015Chris Eaker, Data Curation Librarianchris@utk.edu
CC image by University of Maryland Press Releases on FlickrAdapted from curriculum developed byIntroduction to Data Management0Why Manage Data?Data SharingData Entry & ManipulationQuality Control & AssuranceBackupMetadataData Citation
Outline for TodayIntroduction to Data Management1OrganizationReproducibilityVersion controlQuality controlValuable assetAccuracyIntegrityData sharingSustainability & accessibilityWhy Manage Data?Introduction to Data ManagementManage your data for yourself: Keep yourself organized be able to find your files (data inputs, analytic scripts, outputs at various stages of the analytic process, etc) Track your science processes for reproducibility be able to match up your outputs with exact inputs and transformations that produced themBetter control versions of data identify easily versions that can be periodically purgedQuality control your data more efficientlyData is a valuable asset it is expensive and time consuming to collect Data should be managed to:maximize the effective use and value of data and information assetscontinually improve the quality including: data accuracy, integrity, integration, timeliness of data capture and presentation, relevance and usefulnessensure appropriate use of data and informationfacilitate data sharingensure sustainability and accessibility in long term for re-use in science2SummaryIf data are:Well-organizedDocumentedPreservedAccessibleVerified as to Accuracy and validityResult is: High quality dataEasy to share and re-use in scienceCitation and credibility to the researcherCost-savings to science
Introduction to Data ManagementTo summarize, the goals of effective DM is to get data that are well organized both within the files themselves, and the groups of files, documented adequately with metadata, preserved for future reuse, accessible by others for that reuse, and accurate and valid.
If data are all those things, then you will have high quality data that is easy to share and reuse in science. The data can then be cited by other researchers which will add credibility to the researcher who prepared them. Overall, it saves money and advances science. 3Why Manage Data?Data Sharing
Outline for TodayIntroduction to Data ManagementOne of the main goals of effective data management is to facilitate sharing data with other researchers. Grant funding agencies want to see a greater return on their investment. Lets talk a bit about data sharing and why its important.4Data sharing requires effort, resources, and faith in others. Why do it?
For the benefit of:the publicthe research sponsorthe research communitythe researcher
Why Share Data?
CC image by Jessica Lucia on FlickrIntroduction to Data ManagementWhy expend the extra effort to share data? Because it benefits the public, the research sponsor, the research community and, perhaps most importantly, the researcher.
5A better informed public yields better decision making with regard to:
Environmental and economic planningFederal, state, and local policiessocial choices such as use of tax dollars and education optionspersonal lifestyle and health such as nutrition and recreation
Value of Data Sharing: To the Public
CC image by falonyates on FlickrIntroduction to Data ManagementHow does the public benefit from shared research? The more informed the public is, the better they are able to understand and contribute toward effective public and personal decisions:The public needs data to help with environmental and economic planning Data help inform federal, state and local policiesThe public can use data to help with social choices such as who they will vote for, how they want their tax dollars to be used, and where they will send their children to school.It can even help them make personal lifestyles and health choices such as exercise, smoking, and nutrition
6Organizations that sponsor research must maximize the value of research dollarsData sharing enhances the value of research investments by enabling:verification of performance metrics and outcomesnew research and increased return on investmentadvancement of the science reduced data duplication expenditures
Value of Data Sharing: To Researcher SponsorIntroduction to Data ManagementWhy do research sponsors encourage data sharing? Because sponsors have an obligation to maximize the investment of research dollars.Data sharing enhances the value of the research investment by enabling external reviewers to verify the project performance metrics and outcomes. This not only increases the credibility of the data but also spurs new research that can build upon the initial investment and advance the science rather than duplicate expenditures.
7Access to related research enables community members to:build upon the work of othersperform meta analyses share resources and perspectives
Value of Data Sharing: To Scientific Community
CC image by Lawrence Berkeley National Laboratory on FlickrIntroduction to Data ManagementThe scientific community as a whole also benefits from sharing among researchers. Data sharing allows researchers to build upon one anothers work and to further, rather than duplicate, the science by exploring new findings or combining findings into meta analyses that cannot be performed with individual data. In sharing data, the scientific community expands both individual perspectives and the collective comprehension.
8Access to related research enables community members to (contd):increase transparency, reproducibility and comparability of resultsexpand methodology assessment, recommendations and improvementeducate new researchers as to the most current and significant findings
Value of Data Sharing: To Scientific CommunityIntroduction to Data ManagementAccess to related research enables members of the scientific community to better reproduce, compare and assess methods and results. Scientists are able to learn from one another and educate new researchers as to the most current and significant findings. 9Scientists that share data gain the benefit of:Recognitionimproved data qualitygreater opportunity for data exchangeimproved connectionsValue of Data Sharing: To the Scientist
CC image by SLU Madrid Campus on FlickrIntroduction to Data ManagementAnd finally, how does the independent researcher benefit from data sharing? When scientists share their data, they gain recognition as an authoritative source and respect as a wise investment for research dollars. When data are exposed, feedback from the broader community can be used to improve the quality and presentation of the data. Shared data also allows for greater opportunity for data exchange and networking opportunities with peers and potential collaborators.
Im going to go through 4 steps to make data shareable. Then well go through the nuts and bolts of executing on those four steps.10Step One:Create robust metadata that is discoverableGeographic and temporal coverageDiscipline specific metadata schemaDiscipline specific vocabularyDescribe attributes
Making Data Shareable Introduction to Data ManagementThe more robust your metadata, the easier your data will be discovered and the more appropriately it will be used.
Specifically, when creating metadata, be specific in regards to the geographic and time period coverage of your data. For example. Use a discipline specific metadata schema, if possible. For exampleUse discipline specific themes, place names, and keywords.Describe any attributes (variable names, specimen names, etc) thoroughly11Step Two:Include archival and reference informationInclude a data citationInclude Persistent Identifier (e.g. DOI)
Data Citation Example: Sidlauskas, B. 2007. Data from: Testing for unequal rates of morphological diversification in the absence of a detailed phylogeny: a case study from characiform fishes. Dryad Digital Repository. doi:10.5061/dryad.20
Making Data Shareable Introduction to Data ManagementStep 2: Be sure to include archival and reference information with properly formatted data citations for sources and content. Include persistent data identifiers with the data citation. Were going to go into data citation in more depth later, but this is an example of a citation for a data set on the Dryad Data Repository. As you can see, there is a DOI added to the end, which allows the data to be located easily.
12Step Three:Have data contributors review your metadata to ensure validity and organizational correctnessare the processes described accurately?are all contributions adequately identified?has management reviewed the product and documentation?is the funding organization properly recognized?
Making Data Shareable Introduction to Data ManagementStep 3: Be sure to have data contributors review their metadata to ensure validity and organizational correctness. Are the processes correct? Is your contribution adequately represented and reflected? Is your organization properly recognized and is the funding organization properly recognized? Be sure to get management and sponsor approval on the data publication including the content, presentation, and manner in which contributors are identified.13Step Four:Publish your data and metadata via:Data Repositories/ClearinghousesDiscipline-specificSciencesKnowledge Network for Biodiversity (KNB) Data PortalLong Term Ecological Research (LTER) Network Data PortalSocial SciencesICPSRInstitutionalTrace
Making Data ShareableIntroduction to Data ManagementThe last step to make your data Shareable: Step 4: Publish your metadata in data portals and clearinghouses. Seek out relevant government portals and portals developed by specific communities of practice. The nice