Best practices data management

40
Best Practices Creating and Managing Research Data Presented by Sherry Lake [email protected] http://dmconsult.library.virginia.edu/

Transcript of Best practices data management

  1. 1. Best Practices Creating and Managing Research Data Presented by Sherry Lake [email protected] http://dmconsult.library.virginia.edu/ Data Life Cycle Re-Purpose Re-Use Deposit Data Collection Data Analysis Data Sharing Proposal Planning Writing Data Discovery End of Project Data Archive Project Start Up
  2. 2. Why Manage Your Data?
  3. 3. Best Practices for Creating Data 1. Use Consistent Data Organization 2. Use Standardized Naming, codes and formats 3. Assign Descriptive File Names 4. Perform Basic Quality Assurance / Quality Control 5. Preserve Information - Use Scripted Languages 6. Define Contents of Data Files; Create Documentation 7. Use Consistent, Stable and Open File Formats
  4. 4. Spreadsheet Examples
  5. 5. Spreadsheets
  6. 6. Consistent Data Organization Spreadsheets (such as those found in Excel) are sometimes a necessary evil They allow shortcuts which will result in your data not being machine-readable But there are some simple steps you can take to ensure that you are creating spreadsheets that are machine-readable and will withstand the test of time
  7. 7. Spreadsheets
  8. 8. Spreadsheet Problems?
  9. 9. Problems Dates are not stored consistently Values are labeled inconsistently Data coding is inconsistent Order of values are different
  10. 10. Problems Confusion between numbers and text Different types of data are stored in the same columns The spreadsheet loses interpretability if it is sorted
  11. 11. How would you correct this file?
  12. 12. Spreadsheet Best Practices Include a Header Line 1st line (or record) Label each Column with a short but descriptive name Names should be unique Use letters, numbers, or _ (underscore) Do not include blank spaces or symbols (+ - & ^ *)
  13. 13. Columns of data should be consistent Use the same naming convention for text data Each line should be complete Each line should have a unique identifier Spreadsheet Best Practices
  14. 14. Spreadsheet Best Practices Columns should include only a single kind of data Text or string data Integer numbers Floating point or real numbers
  15. 15. Use Naming Standards & Codes Use commonly accepted label names that describe the contents (e.g., precip for precipitation) Use consistent capitalization (e.g., not: temp, Temp, and TEMP in same file) Standard codes State Postal (VA, MA) FIPS Codes for Counties and County Equivalent Entities (http://www.census.gov/geo/reference/codes/cou.html)
  16. 16. Use Standardized Formats Use standardized formats for units International System of Units (SI) http://physics.nist.gov/Pubs/SP330/sp330.pdf ISO 8601 Standard for Date and Time YYYYMMDDThh:mmss.sTZD 20091013T09:1234.9Z 20091013T09:1234.9+05:00 Spatial Coordinates for Latitute/Longitude +/- DD.DDDDD -78.476 (longitude) +38.029 (latitude)
  17. 17. File Names
  18. 18. File Names Use descriptive names Not too long; CamelCase Try to include time Date using YYYYMMDD Use version numbers Dont use spaces May use - or _ Dont change default extensions
  19. 19. Organize Files Logically Make sure your file system is logical and efficient Biodiversity Lake Grassland Experiments Field Work Biodiv_H20_heatExp_2005_2008.csv Biodiv_H20_predatorExp_2001_2003.csv Biodiv_H20_planktonCount_start2001_active.csv Biodiv_H20_chla_profiles_2003.csv Project Name Location Experiment Name Date File Format
  20. 20. Check for missing, impossible, anomalous values Plotting Mapping Examine summary statistics Verify data transfers from notebooks to digital files Verify data conversion from one file format to another Data Validation Hook, et al. 2010. Best Practices for Preparing Environmental Data Sets to Share and Archive. Available online: http://daac.ornl.gov/PI/BestPractices-2010.pdf.
  21. 21. Data Manipulation You will need to repeat reduction and analysis procedures many times You need to have a workflow that recognizes this Scripted languages can help capture the workflow You could just document all steps by hand After the 20th iteration through your data set; however, you may feel more fondly towards scripted languages Learn the analytical tools of your field Talk to colleagues, etc. and choose at least one tool to master
  22. 22. Preserve Information Keep Original (Raw) File Do not include transformations, interpolations, etc. Consider making the raw data read-only Save as a new file Processing Script (R)
  23. 23. Preserving: Scripted Notes Use a scripted language to process data R Statistical package (free, powerful) SAS MATLAB Processing scripts records processing Steps are recorded in textual format Can be easily revised and re-executed Easy to document GUI-based analysis may be easier, but harder to reproduce
  24. 24. Data Documentation (Metadata) Informal or formal methods to describe your data Important if you want to reuse your own data in the future Also necessary when sharing your data
  25. 25. Define Contents of Data Files Create a Project Document File (Lab Notebook) Details such as: Names of data & analysis files associated with study Definitions for data and codes (include missing value codes, names) Units of measure (accuracy and precision) Standards or instrument calibrations
  26. 26. Data Dictionary Example
  27. 27. Data Dictionary Example
  28. 28. Data Documentation Project Documentation Dataset Documentation Context of data collection Data collection methods Structure, organization of data files Data sources used Data validation, quality assurance Transformations of data from the raw data through analysis Information on confidentiality, access and use conditions Variable names and descriptions Explanation of codes and schemas used Algorithms used to transform data File format and software (including version) used
  29. 29. File Format Sustainability Types Examples Text ASCII, Word, PDF Numerical ASCII, SPSS, STATA, Excel, Access, MySQL Multimedia Jpeg, tiff, mpeg, quicktime Models 3D, statistical Software Java, C, Fortran Domain-specific FITS in astronomy, CIF in chemistry Instrument-specific Olympus Confocal Microscope Data Format
  30. 30. Choosing File Formats Accessible Data (in the future) Non-proprietary (software formats) Open, documented standard Common, used by the research community Standard representation (ASCII, Unicode) Unencrypted & Uncompressed
  31. 31. 1. Use Consistent Data Organization 2. Use Standardized Naming, Codes and Formats 3. Assign Descriptive File Names 4. Perform Basic Quality Assurance / Quality Control 5. Preserve Information - Use Scripted Languages 6. Define Contents of Data Files; Create Documentation 7. Use Consistent, Stable and Open File Formats Best Practices for Creating Data
  32. 32. Will improve the usability of the data by you or by others Your data will be computer ready Save you time Following these Best Practices.
  33. 33. Research Life Cycle Data Life Cycle Re- Purpose Re- Use Deposit Data Collection Data Analysis Data Sharing Proposal Planning Writing Data Discovery End of Project Data Archive Project Start Up
  34. 34. Managing Data in the Data Life Cycle Choosing file formats File naming conventions Document all data details Access control & security Backup & storage
  35. 35. Data Security & Access Control Network security keep confidential or sensitive data off internet servers or computers on connected to the internet Physical security Access to buildings and rooms Computer Systems & Files Use passwords on files/system Virus protection
  36. 36. Backup Your Data Reduce the risk of damage or loss Use multiple locations (here, near, far) Create a backup schedule Use reliable backup medium Test your backup system (i.e., test file recovery)
  37. 37. Storage & Backup
  38. 38. Sustainable Storage Lifespan of Storage Media: http://www.crashplan.com/medialifespan/
  39. 39. Best Practices Bibliography Borer, E. T., Seabloom, E. W., Jones, M. B., & Schildhauer, M. (2009). Some simple guidelines for effective data management. Bulletin of the Ecological Society of America, 90(2), 205-214. http://dx.doi.org/10.1890/0012-9623-90.2.205 Graham, A., McNeill, K., Stout, A., & Sweeney, L. (2010). Data Management and Publishing. Retrieved 05/31/2012, from http://libraries.mit.edu/guides/subjects/data-management/. Hook, L. A., Santhana Vannan, S.K., Beaty, T. W., Cook, R. B. and Wilson, B.E. (2010). Best Practices for Preparing Environmental Data Sets to Share and Archive. Available online (http://daac.ornl.gov/PI/BestPractices-2010.pdf) from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. http://dx.doi.org/10.3334/ORNLDAAC/BestPractices-2010.
  40. 40. Best Practices Bibliography (Cont.) Inter-university Consortium for Political and Social Research (ICPSR). (2012). Guide to social science data preparation and archiving: Best practices throughout the data cycle (5th ed.). Ann Arbor, MI. Retrieved 05/31/2012, from http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf. Van den Eynden, V., Corti, L., Woollard, M. & Bishop, L. (2011). Managing and Sharing Data: A Best Practice Guide for Researchers (3rd ed.). Retrieved 05/31/2012, from http://www.data- archive.ac.uk/media/2894/managingsharing.pdf.