Download - Data Management Best Practices

Transcript
Page 1: Data Management Best Practices

Introduction to Data Management

CC

im

ag

e b

y U

niv

ers

ity o

f M

ary

lan

d P

ress R

ele

ase

s o

n F

lickr

Adapted from

curriculum

developed by

Page 2: Data Management Best Practices

Introduction to Data Management

• Why Manage Data?

• Data Sharing

• Data Entry & Manipulation

• Quality Control & Assurance

• Backup

• Metadata

• Data Citation

Page 3: Data Management Best Practices

Introduction to Data Management

• Organization

• Reproducibility

• Version control

• Quality control

• Valuable asset

• Accuracy

• Integrity

• Data sharing

• Sustainability & accessibility

Page 4: Data Management Best Practices

Introduction to Data Management

• If data are:o Well-organized

o Documented

o Preserved

o Accessible

o Verified as to Accuracy and validity

• Result is: o High quality data

o Easy to share and re-use in science

o Citation and credibility to the researcher

o Cost-savings to science

Page 5: Data Management Best Practices

Introduction to Data Management

• Why Manage Data?

• Data Sharing

Page 6: Data Management Best Practices

Introduction to Data Management

Data sharing requires effort, resources, and faith in others. Why do it?

For the benefit of:o the public

o the research sponsor

o the research community

o the researcher

CC

im

ag

e b

y J

essic

a L

ucia

on

Flic

kr

Page 7: Data Management Best Practices

Introduction to Data Management

A better informed public yields better decision making with regard to:

o Environmental and economic planning

o Federal, state, and local policies

o social choices such as use of tax dollars and education options

o personal lifestyle and health such as nutrition and recreation

CC

im

ag

e b

y f

alo

nya

tes o

n F

lickr

Page 8: Data Management Best Practices

Introduction to Data Management

• Organizations that sponsor research must maximize the value of research dollars

• Data sharing enhances the value of research investments by enabling:o verification of performance metrics and outcomes

o new research and increased return on investment

o advancement of the science

o reduced data duplication expenditures

Page 9: Data Management Best Practices

Introduction to Data Management

Access to related research enables community members to:o build upon the work of others

o perform meta analyses

o share resources and perspectives

CC

im

ag

e b

y L

aw

ren

ce

Be

rke

ley N

atio

na

l

La

bo

rato

ry o

n F

lickr

Page 10: Data Management Best Practices

Introduction to Data Management

Access to related research enables community members to (cont’d):

o increase transparency, reproducibility and comparability of results

o expand methodology assessment, recommendations and improvement

o educate new researchers as to the most current and significant findings

Page 11: Data Management Best Practices

Introduction to Data Management

Scientists that share data gain the benefit of:o Recognition

o improved data quality

o greater opportunity for data exchange

o improved connections

CC

im

ag

e b

y S

LU

Ma

dri

d

Cam

pu

s

on

Flic

kr

Page 12: Data Management Best Practices

Introduction to Data Management

Step One:

Create robust metadata that is discoverableo Geographic and temporal coverage

o Discipline specific metadata schema

o Discipline specific vocabulary

o Describe attributes

Page 13: Data Management Best Practices

Introduction to Data Management

Step Two:

Include archival and reference informationo Include a data citation

o Include Persistent Identifier (e.g. DOI)

Data Citation Example: Sidlauskas, B. 2007. Data from: Testing for unequal rates of morphological diversification in the absence of a detailed phylogeny: a case study from characiform fishes. Dryad Digital Repository. doi:10.5061/dryad.20

Page 14: Data Management Best Practices

Introduction to Data Management

Step Three:

Have data contributors review your metadata to ensure validity and organizational ‘correctness’

o are the processes described accurately?

o are all contributions adequately identified?

o has management reviewed the product and documentation?

o is the funding organization properly recognized?

Page 15: Data Management Best Practices

Introduction to Data Management

Step Four:

Publish your data and metadata via:

Data Repositories/Clearinghouses

• Discipline-specific◦ Sciences

• Knowledge Network for Biodiversity (KNB) Data Portal

• Long Term Ecological Research (LTER) Network Data Portal

◦ Social Sciences

• ICPSR

• Institutional◦ Trace

Page 16: Data Management Best Practices

Introduction to Data Management

• Why Manage Data?

• Data Sharing

• Data Entry & Manipulation

Page 17: Data Management Best Practices

Introduction to Data Management

• Create data sets that are:o Valid

o Organized to support ease of use

CC

im

ag

e b

y T

ravis

S o

n F

lickr

Page 18: Data Management Best Practices

Introduction to Data Management

• Inconsistency between data collection events

– Location of Date information

– Inconsistent Date format

– Column names

– Order of columns

Page 19: Data Management Best Practices

Introduction to Data Management

• Inconsistency between data collection events

– Different site spellings, capitalization, spaces in site names—hard to filter

– Codes used for site names for some data, but spelled out for others

– Mean1 value is in Weight column

– Text and numbers in same column – what is the mean of 12, “escaped < 15”, and 91?

Page 20: Data Management Best Practices

Introduction to Data Management

• Columns of data are consistent:

only numbers, dates, or text

• Consistent Names, Codes, Formats (date) used in each column

• Data are all in one table, which is much easier for a statistical program to work with than multiple small tables which each require human intervention

Page 21: Data Management Best Practices

Introduction to Data Management

• Descriptive column names◦ Soil T30 Soil_Temp_30cm

◦ Species-Code Species_Code (avoid using -,+,*,^ in column names. Some software may interpret these symbols as an operator)

• Descriptive file names◦ Mammal data-.csv FieldVisit1_SmallMammalData_2010-04-11.csv

Page 22: Data Management Best Practices

Introduction to Data Management

• Enter complete lines of data

Sorting an

Excel file with

empty cells is

not a good

idea!

Page 23: Data Management Best Practices

Introduction to Data Management

• Missing datao Preferably leave field empty (NULL = no value)

o In numeric fields, use a distinct value such as 9999 to indicate a missing value

o In text fields, use NA (“Not Applicable” or “Not Available”)

o Use Data flags in a separate column to qualify missing value

Date Time NO3_N_Conc NO3_N_Conc_Flag

20081011 1300 0.013

20081011 1330 0.016

20081011 1400 M1

20081011 1430 0.018

20081011 1500 0.001 E1

M1 = missing; no sample

collected

E1 = estimated from

grab sample

Page 24: Data Management Best Practices

Introduction to Data Management

Page 25: Data Management Best Practices

Introduction to Data Management

20

Page 26: Data Management Best Practices

Introduction to Data Management

• Great for charts, graphs, calculations

• Flexible about cell content type—cells in same column can contain numbers or text

• Lack record integrity--can sort a column independently of all others)

• Easy to use – but harder to maintain as complexity and size of data grows

• Easy to query to select portions of data

• Data fields are typed – For example, only integers are allowed in integer fields

• Columns cannot be sorted independently of each other

• Steeper learning curve than a spreadsheet

Page 27: Data Management Best Practices

Introduction to Data Management

• A set of tables

• Relationships

• A command

language

*siteIDsite_name

latitudelongitude

description

Sample sites

*speciesIDspecies_name

common_namefamilyorder

Species

*sampleIDsiteIDsample_datespeciesIDheightfloweringflagcomments

samples

*sampleIDsiteID

sample_datespeciesID

heightflowering

flagcomments

Samples

Page 28: Data Management Best Practices

Introduction to Data Management

Date Site Height Flowering<dates only> <text only> <real numbers only> <‘y’ and ‘n’ only>

Advantages• quality control• performance

Page 29: Data Management Best Practices

Introduction to Data Management

Date Site Species Flowering?

2/13/2010 A BOGR2 y

2/13/2010 B HODR y

4/15/2010 B BOER4 y

4/15/2010 C PLJA n

Site Latitude Longitude

A 34.1 -109.3

B 35.2 -108.6

C 32.6 -107.5

Date Site Species Flowering? Latitude Longitude

2/13/2010 A BOGR2 y 34.1 -109.3

2/13/2010 B HODR y 35.2 -108.6

4/15/2010 B BOER4 y 35.2 -108.6

4/15/2010 C PLJA n 32.6 -107.5

Mix and

Match

data on

the fly

Page 30: Data Management Best Practices

Introduction to Data Management

Date Plot Treatment SensorDepth Soil_Temperature

2010-02-01 C R 30 12.8

2010-02-01 B C 10 13.2

2010-02-02 C R 0 6.3

2010-02-02 A N 0 15.1

SQL examples: Select Date, Plot, Treatment, SensorDepth, Soil_Temperature from SoilTemp where Date = ‘2010-02-01’

Date Plot Treatment SensorDepth Soil_Temperature

2010-02-01 C R 30 12.8

2010-02-01 B C 10 13.2

Date Plot Treatment SensorDepth Soil_Temperature

2010-02-02 A N 0 15.1

This table is called SoilTemp

Select * from SoilTemp where Treatment=‘N’ and SensorDepth=‘0’

Page 31: Data Management Best Practices

Introduction to Data Management

Page 32: Data Management Best Practices

Introduction to Data Management

• Be aware of Best Practices when designing data file structures

• Choose a data entry method that allows some validation of data as it is entered

• Consider investing time in learning how to use a database if datasets are large or complex

CC

im

ag

e b

y f

o.o

l o

n F

lickr

Page 33: Data Management Best Practices

Introduction to Data Management

• Why Manage Data?

• Data Sharing

• Data Entry & Manipulation

• Quality Control & Assurance

Page 34: Data Management Best Practices

Introduction to Data Management

• Errors of Commissiono Incorrect or inaccurate data entered

o Examples: malfunctioning instrument, mistyped data

• Errors of Omissiono Data or metadata not recorded

o Examples: inadequate documentation, human error, anomalies in the field

CC

im

ag

e b

y N

ick J

We

bb

on

Flic

kr

Page 35: Data Management Best Practices

Introduction to Data Management

• Define & enforce standards◦ Formats

◦ Codes

◦ Measurement units

◦ Metadata

• Assign responsibility for data quality◦ Be sure assigned person is educated in QA/QC

Page 36: Data Management Best Practices

Introduction to Data Management

• Double entry◦ Data keyed in by two independent people

◦ Check for agreement with computer verification

• Record a reading of the data and transcribe from the recording

• Use text-to-speech program to read data back

CC

im

ag

e b

y w

eskri

ese

l o

n F

lickr

Page 37: Data Management Best Practices

Introduction to Data Management

• Design data storage well◦ Minimize number of times items that must be entered repeatedly

◦ Use consistent terminology

◦ Atomize data: one cell per piece of information

• Document changes to data◦ Avoids duplicate error checking

◦ Allows undo if necessary

Page 38: Data Management Best Practices

Introduction to Data Management

• Make sure data line up in proper columns

• No missing, impossible, or anomalous values

• Perform statistical summaries

CC

im

ag

e b

y c

he

sa

pe

akeclim

ate

on

Flic

kr

Page 39: Data Management Best Practices

Introduction to Data Management

• Look for outliers◦ Outliers are extreme values for a variable given the statistical model

being used

◦ The goal is not to eliminate outliers but to identify potential data contamination

0

10

20

30

40

50

60

0 5 10 15 20 25 30 35

Page 40: Data Management Best Practices

Introduction to Data Management

• Methods to look for outliers◦ Graphical

• Normal probability plots

• Regression

• Scatter plots

◦ Plotting on maps

◦ Deviation

Page 41: Data Management Best Practices

Introduction to Data Management

• Beware of errors of commission and omission

• Execute quality assurance and quality control strategies◦ Data entry

◦ Data visualization

Page 42: Data Management Best Practices

Introduction to Data Management

• Why Manage Data?

• Data Sharing

• Data Entry & Manipulation

• Quality Control & Assurance

• Backup

Page 43: Data Management Best Practices

Introduction to Data Management

• Backups vs. Archives◦ Backups: a copy (or copies) of the original file is made before the

original is overwritten

o Archives: preservation of the file

• Data Preservationo Includes archiving in addition to processes such as data rescue, data

reformatting, data conversion, metadata

Page 44: Data Management Best Practices

Introduction to Data Management

• Backupso periodic snapshots

o usually copies of files

o performed on regular schedule

• Archivingo preserve data

o usually the final version

o performed at the end of a project or milestones

It is a good idea to have multiple copies of your backups and archives, in case one copy fails.

Page 45: Data Management Best Practices

Introduction to Data Management

• Limit or negate loss of data

• Save time, money, productivity

• Help prepare for disasterso Accidental deletions

o Fires, natural disasters

o Software bugs, hardware failures

• Reproduce results

• Respond to data requests

• Limit liability

CC

Im

ag

e c

ou

rte

sy o

f B

ria

n J

Ma

tis o

n F

lickr

Page 46: Data Management Best Practices

Introduction to Data Management

• Includes backups and archiving

• Also includes◦ data conversion

◦ data reformatting

◦ data rescue

Page 47: Data Management Best Practices

Introduction to Data Management

• Data Conversions and Formatso Use non-proprietary, standard formats

o Textual documents .txt

o Spreadsheets .csv

o Digital images .tiff

• Versioning

• File Naming

Page 48: Data Management Best Practices

Introduction to Data Management

• Create a backup plan that clearly identifies: o roles,

o responsibilities,

o where the data is backed up,

o how often the files are backed up,

o how to access the files,

o recommended file formats to be used, and

o policies for migrating data to assure data are not lost due to media degradation or changing formats or programs

• Review your backup plan regularly

• Update as needed

Page 49: Data Management Best Practices

Introduction to Data Management

• Why Manage Data?

• Data Sharing

• Data Entry & Manipulation

• Quality Control & Assurance

• Backup

• Metadata

Page 50: Data Management Best Practices

Introduction to Data Management

• When you provide data to someone else, what types of information would you want to include with the data?

• When you receive a dataset from an external source, what types of details do you want to know about the data?

Page 51: Data Management Best Practices

Introduction to Data Management

• Why were the data created? • What limitations, if any, do the data have? • What does the data mean? • How should the data be cited if it is re-used in a

new study?

Page 52: Data Management Best Practices

Introduction to Data Management

• What are the data gaps?• What processes were used for creating the

data?• Are there any fees associated with the data?• In what scale were the data created? • What do the values in the tables mean?• What software do I need in order to read the

data?• What projection are the data in?• Can I give these data to someone else?

Page 53: Data Management Best Practices

Introduction to Data Management

Metadata is: Data ‘reporting’

• WHO created the data?

• WHAT is the content of the data?

• WHEN were the data created?

• WHERE is it geographically?

• HOW were the data developed?

• WHY were the data developed?

Ph

oto

by M

ich

elle

Cha

ng

. A

ll R

igh

ts R

ese

rve

d

Page 54: Data Management Best Practices

Introduction to Data Management

Author(s) Boullosa, Carmen.

Title(s) They're cows, we're pigs /

by Carmen Boullosa

Place New York : Grove Press, 1997.

Physical Descr viii, 180 p ; 22 cm.

Subject(s) Pirates Caribbean Area Fiction.

Format Fiction

CC

im

ag

e b

y U

SD

Ago

v o

n F

lickr

CC

im

ag

e b

y M

ska

du

on

Flic

kr

Page 55: Data Management Best Practices

Introduction to Data Management

DA

TA

DE

TA

ILS

Time of data development

Specific details about problems with individual items or

specific dates are lost relatively rapidly

General details about datasets are

lost through time

Accident or

technology

change may

make data

unusable

Retirement or career change

makes access to “mental

storage” difficult or unlikely

Loss of data

developer leads to

loss of remaining

information

TIME (From Michener et al 1997)

Page 56: Data Management Best Practices

Introduction to Data Management

• A Standard provides a structure to describe data with:◦ Common terms to allow consistency between records

◦ Common definitions for easier interpretation

◦ Common language for ease of communication

◦ Common structure to quickly locate information

• In search and retrieval, standards provide:◦ Documentation structure in a reliable and predictable format for

computer interpretation

◦ A uniform summary description of the dataset

CC

im

ag

e b

y c

ca

rlste

ad

on

Flic

kr

Page 57: Data Management Best Practices

Introduction to Data Management

CC

im

ag

e b

y I

lik

e o

n F

lickr

Page 58: Data Management Best Practices

Introduction to Data Management

• Why Manage Data?

• Data Sharing

• Data Entry & Manipulation

• Quality Control & Assurance

• Backup

• Metadata

• Data Citation

Page 59: Data Management Best Practices

Introduction to Data Management

• Similar to citing a published article or booko Provide information necessary to

identify and locate the work cited

• No standards yet

• Use format recommended by journal, repository, or professional organization

CC

im

ag

e b

y P

axsim

ius o

n F

lickr

Page 60: Data Management Best Practices

Introduction to Data Management

Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231

Page 61: Data Management Best Practices

Introduction to Data Management

Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231

Page 62: Data Management Best Practices

Introduction to Data Management

Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231

Page 63: Data Management Best Practices

Introduction to Data Management

Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231

Page 64: Data Management Best Practices

Introduction to Data Management

Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231

Page 65: Data Management Best Practices

Introduction to Data Management

Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231

Page 66: Data Management Best Practices

Introduction to Data Management

A persistent identifier should be included in the citation:

• DOI (Digital Object Identifier)o Globally unique, alphanumeric string assigned by a registration

agency to identify content and provide a persistent link to its location.

o May be assigned to any item of intellectual property that is defined by structured metadata

o Examples:10.1234/NP5678, 10.5678/ISBN-0-7645-4889-4; 10.2224/2004-10-ISO-DOI

• The UT Libraries can assign a DOI to your data set.

Page 67: Data Management Best Practices

Introduction to Data Management

Chris EakerData Curation Librarian

Hodges Library, Room 236

865-974-4404

[email protected]

Available to help with…• Data management plan support

• Data repositories

• Metadata

• Data management consulting