Introduction to Data Management
CC
im
ag
e b
y U
niv
ers
ity o
f M
ary
lan
d P
ress R
ele
ase
s o
n F
lickr
Adapted from
curriculum
developed by
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
• Metadata
• Data Citation
Introduction to Data Management
• Organization
• Reproducibility
• Version control
• Quality control
• Valuable asset
• Accuracy
• Integrity
• Data sharing
• Sustainability & accessibility
Introduction to Data Management
• If data are:o Well-organized
o Documented
o Preserved
o Accessible
o Verified as to Accuracy and validity
• Result is: o High quality data
o Easy to share and re-use in science
o Citation and credibility to the researcher
o Cost-savings to science
Introduction to Data Management
• Why Manage Data?
• Data Sharing
Introduction to Data Management
Data sharing requires effort, resources, and faith in others. Why do it?
For the benefit of:o the public
o the research sponsor
o the research community
o the researcher
CC
im
ag
e b
y J
essic
a L
ucia
on
Flic
kr
Introduction to Data Management
A better informed public yields better decision making with regard to:
o Environmental and economic planning
o Federal, state, and local policies
o social choices such as use of tax dollars and education options
o personal lifestyle and health such as nutrition and recreation
CC
im
ag
e b
y f
alo
nya
tes o
n F
lickr
Introduction to Data Management
• Organizations that sponsor research must maximize the value of research dollars
• Data sharing enhances the value of research investments by enabling:o verification of performance metrics and outcomes
o new research and increased return on investment
o advancement of the science
o reduced data duplication expenditures
Introduction to Data Management
Access to related research enables community members to:o build upon the work of others
o perform meta analyses
o share resources and perspectives
CC
im
ag
e b
y L
aw
ren
ce
Be
rke
ley N
atio
na
l
La
bo
rato
ry o
n F
lickr
Introduction to Data Management
Access to related research enables community members to (cont’d):
o increase transparency, reproducibility and comparability of results
o expand methodology assessment, recommendations and improvement
o educate new researchers as to the most current and significant findings
Introduction to Data Management
Scientists that share data gain the benefit of:o Recognition
o improved data quality
o greater opportunity for data exchange
o improved connections
CC
im
ag
e b
y S
LU
Ma
dri
d
Cam
pu
s
on
Flic
kr
Introduction to Data Management
Step One:
Create robust metadata that is discoverableo Geographic and temporal coverage
o Discipline specific metadata schema
o Discipline specific vocabulary
o Describe attributes
Introduction to Data Management
Step Two:
Include archival and reference informationo Include a data citation
o Include Persistent Identifier (e.g. DOI)
Data Citation Example: Sidlauskas, B. 2007. Data from: Testing for unequal rates of morphological diversification in the absence of a detailed phylogeny: a case study from characiform fishes. Dryad Digital Repository. doi:10.5061/dryad.20
Introduction to Data Management
Step Three:
Have data contributors review your metadata to ensure validity and organizational ‘correctness’
o are the processes described accurately?
o are all contributions adequately identified?
o has management reviewed the product and documentation?
o is the funding organization properly recognized?
Introduction to Data Management
Step Four:
Publish your data and metadata via:
Data Repositories/Clearinghouses
• Discipline-specific◦ Sciences
• Knowledge Network for Biodiversity (KNB) Data Portal
• Long Term Ecological Research (LTER) Network Data Portal
◦ Social Sciences
• ICPSR
• Institutional◦ Trace
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
Introduction to Data Management
• Create data sets that are:o Valid
o Organized to support ease of use
CC
im
ag
e b
y T
ravis
S o
n F
lickr
Introduction to Data Management
• Inconsistency between data collection events
– Location of Date information
– Inconsistent Date format
– Column names
– Order of columns
Introduction to Data Management
• Inconsistency between data collection events
– Different site spellings, capitalization, spaces in site names—hard to filter
– Codes used for site names for some data, but spelled out for others
– Mean1 value is in Weight column
– Text and numbers in same column – what is the mean of 12, “escaped < 15”, and 91?
Introduction to Data Management
• Columns of data are consistent:
only numbers, dates, or text
• Consistent Names, Codes, Formats (date) used in each column
• Data are all in one table, which is much easier for a statistical program to work with than multiple small tables which each require human intervention
Introduction to Data Management
• Descriptive column names◦ Soil T30 Soil_Temp_30cm
◦ Species-Code Species_Code (avoid using -,+,*,^ in column names. Some software may interpret these symbols as an operator)
• Descriptive file names◦ Mammal data-.csv FieldVisit1_SmallMammalData_2010-04-11.csv
Introduction to Data Management
• Enter complete lines of data
Sorting an
Excel file with
empty cells is
not a good
idea!
Introduction to Data Management
• Missing datao Preferably leave field empty (NULL = no value)
o In numeric fields, use a distinct value such as 9999 to indicate a missing value
o In text fields, use NA (“Not Applicable” or “Not Available”)
o Use Data flags in a separate column to qualify missing value
Date Time NO3_N_Conc NO3_N_Conc_Flag
20081011 1300 0.013
20081011 1330 0.016
20081011 1400 M1
20081011 1430 0.018
20081011 1500 0.001 E1
M1 = missing; no sample
collected
E1 = estimated from
grab sample
Introduction to Data Management
Introduction to Data Management
20
Introduction to Data Management
• Great for charts, graphs, calculations
• Flexible about cell content type—cells in same column can contain numbers or text
• Lack record integrity--can sort a column independently of all others)
• Easy to use – but harder to maintain as complexity and size of data grows
• Easy to query to select portions of data
• Data fields are typed – For example, only integers are allowed in integer fields
• Columns cannot be sorted independently of each other
• Steeper learning curve than a spreadsheet
Introduction to Data Management
• A set of tables
• Relationships
• A command
language
*siteIDsite_name
latitudelongitude
description
Sample sites
*speciesIDspecies_name
common_namefamilyorder
Species
*sampleIDsiteIDsample_datespeciesIDheightfloweringflagcomments
samples
*sampleIDsiteID
sample_datespeciesID
heightflowering
flagcomments
Samples
Introduction to Data Management
Date Site Height Flowering<dates only> <text only> <real numbers only> <‘y’ and ‘n’ only>
Advantages• quality control• performance
Introduction to Data Management
Date Site Species Flowering?
2/13/2010 A BOGR2 y
2/13/2010 B HODR y
4/15/2010 B BOER4 y
4/15/2010 C PLJA n
Site Latitude Longitude
A 34.1 -109.3
B 35.2 -108.6
C 32.6 -107.5
Date Site Species Flowering? Latitude Longitude
2/13/2010 A BOGR2 y 34.1 -109.3
2/13/2010 B HODR y 35.2 -108.6
4/15/2010 B BOER4 y 35.2 -108.6
4/15/2010 C PLJA n 32.6 -107.5
Mix and
Match
data on
the fly
Introduction to Data Management
Date Plot Treatment SensorDepth Soil_Temperature
2010-02-01 C R 30 12.8
2010-02-01 B C 10 13.2
2010-02-02 C R 0 6.3
2010-02-02 A N 0 15.1
SQL examples: Select Date, Plot, Treatment, SensorDepth, Soil_Temperature from SoilTemp where Date = ‘2010-02-01’
Date Plot Treatment SensorDepth Soil_Temperature
2010-02-01 C R 30 12.8
2010-02-01 B C 10 13.2
Date Plot Treatment SensorDepth Soil_Temperature
2010-02-02 A N 0 15.1
This table is called SoilTemp
Select * from SoilTemp where Treatment=‘N’ and SensorDepth=‘0’
Introduction to Data Management
Introduction to Data Management
• Be aware of Best Practices when designing data file structures
• Choose a data entry method that allows some validation of data as it is entered
• Consider investing time in learning how to use a database if datasets are large or complex
CC
im
ag
e b
y f
o.o
l o
n F
lickr
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
Introduction to Data Management
• Errors of Commissiono Incorrect or inaccurate data entered
o Examples: malfunctioning instrument, mistyped data
• Errors of Omissiono Data or metadata not recorded
o Examples: inadequate documentation, human error, anomalies in the field
CC
im
ag
e b
y N
ick J
We
bb
on
Flic
kr
Introduction to Data Management
• Define & enforce standards◦ Formats
◦ Codes
◦ Measurement units
◦ Metadata
• Assign responsibility for data quality◦ Be sure assigned person is educated in QA/QC
Introduction to Data Management
• Double entry◦ Data keyed in by two independent people
◦ Check for agreement with computer verification
• Record a reading of the data and transcribe from the recording
• Use text-to-speech program to read data back
CC
im
ag
e b
y w
eskri
ese
l o
n F
lickr
Introduction to Data Management
• Design data storage well◦ Minimize number of times items that must be entered repeatedly
◦ Use consistent terminology
◦ Atomize data: one cell per piece of information
• Document changes to data◦ Avoids duplicate error checking
◦ Allows undo if necessary
Introduction to Data Management
• Make sure data line up in proper columns
• No missing, impossible, or anomalous values
• Perform statistical summaries
CC
im
ag
e b
y c
he
sa
pe
akeclim
ate
on
Flic
kr
Introduction to Data Management
• Look for outliers◦ Outliers are extreme values for a variable given the statistical model
being used
◦ The goal is not to eliminate outliers but to identify potential data contamination
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35
Introduction to Data Management
• Methods to look for outliers◦ Graphical
• Normal probability plots
• Regression
• Scatter plots
◦ Plotting on maps
◦ Deviation
Introduction to Data Management
• Beware of errors of commission and omission
• Execute quality assurance and quality control strategies◦ Data entry
◦ Data visualization
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
Introduction to Data Management
• Backups vs. Archives◦ Backups: a copy (or copies) of the original file is made before the
original is overwritten
o Archives: preservation of the file
• Data Preservationo Includes archiving in addition to processes such as data rescue, data
reformatting, data conversion, metadata
Introduction to Data Management
• Backupso periodic snapshots
o usually copies of files
o performed on regular schedule
• Archivingo preserve data
o usually the final version
o performed at the end of a project or milestones
It is a good idea to have multiple copies of your backups and archives, in case one copy fails.
Introduction to Data Management
• Limit or negate loss of data
• Save time, money, productivity
• Help prepare for disasterso Accidental deletions
o Fires, natural disasters
o Software bugs, hardware failures
• Reproduce results
• Respond to data requests
• Limit liability
CC
Im
ag
e c
ou
rte
sy o
f B
ria
n J
Ma
tis o
n F
lickr
Introduction to Data Management
• Includes backups and archiving
• Also includes◦ data conversion
◦ data reformatting
◦ data rescue
Introduction to Data Management
• Data Conversions and Formatso Use non-proprietary, standard formats
o Textual documents .txt
o Spreadsheets .csv
o Digital images .tiff
• Versioning
• File Naming
Introduction to Data Management
• Create a backup plan that clearly identifies: o roles,
o responsibilities,
o where the data is backed up,
o how often the files are backed up,
o how to access the files,
o recommended file formats to be used, and
o policies for migrating data to assure data are not lost due to media degradation or changing formats or programs
• Review your backup plan regularly
• Update as needed
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
• Metadata
Introduction to Data Management
• When you provide data to someone else, what types of information would you want to include with the data?
• When you receive a dataset from an external source, what types of details do you want to know about the data?
Introduction to Data Management
• Why were the data created? • What limitations, if any, do the data have? • What does the data mean? • How should the data be cited if it is re-used in a
new study?
Introduction to Data Management
• What are the data gaps?• What processes were used for creating the
data?• Are there any fees associated with the data?• In what scale were the data created? • What do the values in the tables mean?• What software do I need in order to read the
data?• What projection are the data in?• Can I give these data to someone else?
Introduction to Data Management
Metadata is: Data ‘reporting’
• WHO created the data?
• WHAT is the content of the data?
• WHEN were the data created?
• WHERE is it geographically?
• HOW were the data developed?
• WHY were the data developed?
Ph
oto
by M
ich
elle
Cha
ng
. A
ll R
igh
ts R
ese
rve
d
Introduction to Data Management
Author(s) Boullosa, Carmen.
Title(s) They're cows, we're pigs /
by Carmen Boullosa
Place New York : Grove Press, 1997.
Physical Descr viii, 180 p ; 22 cm.
Subject(s) Pirates Caribbean Area Fiction.
Format Fiction
CC
im
ag
e b
y U
SD
Ago
v o
n F
lickr
CC
im
ag
e b
y M
ska
du
on
Flic
kr
Introduction to Data Management
DA
TA
DE
TA
ILS
Time of data development
Specific details about problems with individual items or
specific dates are lost relatively rapidly
General details about datasets are
lost through time
Accident or
technology
change may
make data
unusable
Retirement or career change
makes access to “mental
storage” difficult or unlikely
Loss of data
developer leads to
loss of remaining
information
TIME (From Michener et al 1997)
Introduction to Data Management
• A Standard provides a structure to describe data with:◦ Common terms to allow consistency between records
◦ Common definitions for easier interpretation
◦ Common language for ease of communication
◦ Common structure to quickly locate information
• In search and retrieval, standards provide:◦ Documentation structure in a reliable and predictable format for
computer interpretation
◦ A uniform summary description of the dataset
CC
im
ag
e b
y c
ca
rlste
ad
on
Flic
kr
Introduction to Data Management
CC
im
ag
e b
y I
lik
e o
n F
lickr
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
• Metadata
• Data Citation
Introduction to Data Management
• Similar to citing a published article or booko Provide information necessary to
identify and locate the work cited
• No standards yet
• Use format recommended by journal, repository, or professional organization
CC
im
ag
e b
y P
axsim
ius o
n F
lickr
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
A persistent identifier should be included in the citation:
• DOI (Digital Object Identifier)o Globally unique, alphanumeric string assigned by a registration
agency to identify content and provide a persistent link to its location.
o May be assigned to any item of intellectual property that is defined by structured metadata
o Examples:10.1234/NP5678, 10.5678/ISBN-0-7645-4889-4; 10.2224/2004-10-ISO-DOI
• The UT Libraries can assign a DOI to your data set.
Introduction to Data Management
Chris EakerData Curation Librarian
Hodges Library, Room 236
865-974-4404
Available to help with…• Data management plan support
• Data repositories
• Metadata
• Data management consulting