Data Stewardship for Scientists, for CLIR Postdoc Workshop

84
Data Stewardship for Researchers Carly Strasser, PhD California Digital Library @carlystrasser [email protected] 31 July 2013 CLIR Symposium From Calisphere, Couretsy of UC Riverside, California Museum of Photography Tips, Tools, & Guidance From Calisphere, Courtesy ofThousand Oaks Library

description

Presentation for CLIR/DLF Postdoctoral Fellows on data management for scientists; Bryn Mawr College 31 July 2013.

Transcript of Data Stewardship for Scientists, for CLIR Postdoc Workshop

Page 1: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Data  Stewardship  for  Researchers  

Carly  Strasser,  PhD  California  Digital  Library  

@carlystrasser  [email protected]  

31  July  2013  CLIR  Symposium  

 

From

 Calisph

ere,    Cou

retsy  of    U

C  Riverside,  Califo

rnia  M

useu

m  of  P

hotograp

hy  

Tips,  Tools,  &  Guidance    

From

 Calisph

ere,    Cou

rtesy  of  Tho

usan

d  Oak

s  Library      

Page 2: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Roadmap  

4.  Toolbox    

1.  Background    

2.  Why  you  should  care  3.  Best  practices  

Page 3: Data Stewardship for Scientists, for CLIR Postdoc Workshop

NSF  funded  DataNet  Project  Office  of  Cyberinfrastructure  

Two  main  goals:  1.   Build  a  network  for  data  repositories  2.   Build  community  around  data  

Focus  on    Earth  |  environmental  |  ecological  |  oceanographic    

data    

Page 4: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Why  don’t  people  share  data?  

Is  data  management  being  taught?  Do  attitudes  about  

sharing  differ  among  disciplines?  

How  can  we  promote  storing  data  in  repositories?  

What  barriers  to  sharing  can  we  eliminate?  

What  role  can  libraries  play  in  data  education?  

Page 5: Data Stewardship for Scientists, for CLIR Postdoc Workshop
Page 6: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Why  is  data  management      a  hot  topic?  

From  Flickr  by  Velo  Steve  

Page 7: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Back in the day…

Da  Vinci  

Curie  

Newton  

classicalschool.blogspot.com  

Darwin  

Page 8: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Digital  data  From

 Flickr  by  Flickm

or  

From

 Flickr  by  US  Arm

y  En

vironm

ental  C

omman

d  

From

 Flickr  by    DW08

25  

C.  Strasser  

Courtesey  of  W

HOI  

From

 Flickr  by    deltaMike  

Page 9: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Digital  data  +    

Complex  workflows  

Page 10: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From  Flickr  by  ~Minnea~  

Data  management  Documentation  Reproducibility  

Page 11: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From  Flickr  by  iowa_spirit_walker  

•  Cost  •  Confusion  about  standards  

•  Lack  of  training  •  Fear  of  lost  rights  or  benefits  

•  No  incentives  

Page 12: Data Stewardship for Scientists, for CLIR Postdoc Workshop

THE TRUTH

From

 san

dierpa

stures.com

 

Data  management  

Metadata  

Data  repositories  

Data  sharing  

 

RESEARCHERS NEED TO KNOW

ABOUT

Page 13: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From  Flickr  by  johntrainor  

Who  cares?  

Page 14: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From

 Flickr  by  hy

perio

n327  

From  Flickr  by  Redden-­‐McAllister  

Page 15: Data Stewardship for Scientists, for CLIR Postdoc Workshop

…  “Federal  agencies  investing  in  research  and  development  (more  than  $100  million  in  annual  expenditures)  must  have  clear  and  coordinated  policies  for  increasing  public  access  to  research  products.”  

Back  in  February:    

Page 16: Data Stewardship for Scientists, for CLIR Postdoc Workshop

1.  Maximize  free  public  access  2.  Ensure  researchers  create  data  

management  plans  

3.  Allow  costs  for  data  preservation  and  access  in  proposal  budgets  

4.  Ensure  evaluation  of  data  management  plan  merits  

5.  Ensure  researchers  comply  with  their  data  management  plans  

6.  Promote  data  deposition  into  public  repositories  

7.  Develop  approaches  for  identification  and  attribution  of  datasets  

8.  Educate  folks  about  data  stewardship  

From  Flickr  by  Joe  Crimmings  Photography  

Page 17: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From

 Flickr  by  tw

m1340

 

Culture  Shift  Ahead  

Page 18: Data Stewardship for Scientists, for CLIR Postdoc Workshop

science  source  notebook  content  access  data  government  knowledge  

From

 Flickr  by  cd

sessum

s  

Page 19: Data Stewardship for Scientists, for CLIR Postdoc Workshop

flowingdata.com

Map  of  Scientific  Collaborations  

Page 20: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From

 Flickr  by  ~sho

rts  an

d  long

s  

Publications  &    Their  Citation     &  data  availability  

Page 21: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Data  are  being  recognized  as  first  class  products  of  research  

From  Flickr  by  Richard  Moross  

Page 22: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Data  management  plans  

Data  sharing  mandates  

Data  publications  

Data  citation  

From  Flickr  by  torkildr  

Page 23: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Data  publications  Data  citation  

Data  management  plans  Data  sharing  mandates  

Page 24: Data Stewardship for Scientists, for CLIR Postdoc Workshop

What  should  researchers  be  doing?  

From  Flickr  by  whatthefeed  

NOT V

Page 25: Data Stewardship for Scientists, for CLIR Postdoc Workshop

C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet

Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004

SD for delta 13C = 0.07 SD for delta 15N = 0.15

Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 cB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 cB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 cB5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 cC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398

23.78 1.17

Reference statistics:

Sampling Site / Identifier:Sample Type:

Date:Tray ID and Sequence:

From  Stephanie  Hampton  (2010)      ESA  Workshop  on  Best  Practices  

2  tables   Random  notes  

From  Stephanie  Hampton  

Page 26: Data Stewardship for Scientists, for CLIR Postdoc Workshop

C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet

Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004

SD for delta 13C = 0.07 SD for delta 15N = 0.15

Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 cB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 cB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 cB5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 cC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398

23.78 1.17

Reference statistics:

Sampling Site / Identifier:Sample Type:

Date:Tray ID and Sequence:

From  Stephanie  Hampton  (2010)      ESA  Workshop  on  Best  Practices  

Wash  Cres  Lake  Dec  15  Dont_Use.xls  

From  Stephanie  Hampton  

Page 27: Data Stewardship for Scientists, for CLIR Postdoc Workshop

C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet

Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004

SD for delta 13C = 0.07 SD for delta 15N = 0.15

Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUTB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression StatisticsB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 R Square 0.080178B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square-0.022024B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error1.906378B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 Observations 11B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 ANOVAC1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c df SS MS F Significance FC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278

23.78 1.17 Total 10 35.55962

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%Upper 95.0%Intercept -4.297428 4.671099 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569

Reference statistics:

Sampling Site / Identifier:Sample Type:

Date:Tray ID and Sequence:

Random  stats  output  

From  Stephanie  Hampton  

Page 28: Data Stewardship for Scientists, for CLIR Postdoc Workshop

C:\Documents and Settings\hampton\My Documents\NCEAS Distributed Graduate Seminars\[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1Stable Isotope Data Sheet

Wash Cresc Lake Peter's lab Don't use - old dataAlgal Washed RocksDec. 16Tray 004

SD for delta 13C = 0.07 SD for delta 15N = 0.15

Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg ConA5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 cA8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUTB2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression StatisticsB4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 R Square 0.080178B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square-0.022024B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error1.906378B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 Observations 11B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 ANOVAC1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c df SS MS F Significance FC2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278

23.78 1.17 Total 10 35.55962

CoefficientsStandard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%Upper 95.0%Intercept -4.297428 4.671099 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569

Reference statistics:

Sampling Site / Identifier:Sample Type:

Date:Tray ID and Sequence:

SampleID ALG03 ALG05 ALG07 ALG06 ALG04 ALG02 ALG01 ALG03 ALG07

Weight (mg) 2.91 2.91 3.04 2.95 3.01 3 2.99 2.92 2.9

%C 6.85 35.56 33.49 41.17 43.74 4.51 1.59 4.37 33.58delta 13C -21.11 -28.05 -29.56 -27.32 -27.50 -22.68 -24.58 -21.06 -29.44

delta 13C_ca -20.65 -27.59 -29.10 -26.86 -27.04 -22.22 -24.12 -20.60 -28.98

%N 0.48 2.30 1.68 1.97 1.36 0.34 0.15 0.34 1.74delta 15N -0.97 0.59 0.79 2.71 0.99 4.31 -1.69 -1.52 0.62

delta 15N_ca -1.62 -0.06 0.14 2.06 0.34 3.66 -2.34 -2.17 -0.03

-3.00

-2.00

-1.00

0.00

1.00

2.00

3.00

4.00

-35.00 -30.00 -25.00 -20.00 -15.00 -10.00 -5.00 0.00

Series1

From  Stephanie  Hampton  

Page 29: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From  Flickr  by  whatthefeed  

What  should  researchers  be  doing?  

Page 30: Data Stewardship for Scientists, for CLIR Postdoc Workshop

data management

From

 Flickr  by  Big  Sw

ede  Guy

 

1.  Planning  2.  Data  collection  &  

organization  3.  Quality  control  &  assurance  4. Metadata  5. Workflows  6. Data  stewardship  &  reuse  

Best  Practices  

Page 31: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Create  unique  identifiers  •  Decide  on  naming  scheme  early  •  Create  a  key  •  Different  for  each  sample  

2.  Data  collection  &  organization  

From  Flickr  by  sjbresnahan  

From

 Flickr  by  zebb

ie  

Page 32: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Standardize  •  Consistent  within  columns  – only  numbers,  dates,  or  text  

•  Consistent  names,  codes,  formats  

Modified  from  K.  Vanderbilt    From  Pink  Floyd,  The  Wall      themurkyfringe.com  

2.  Data  collection  &  organization  

Page 33: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Google  Docs  Forms  

Standardize  •  Reduce  possibility  of  manual  error  by  constraining  entry  choices  

Modified  from  K.  Vanderbilt    

2.  Data  collection  &  organization  

Excel  lists  Data  

validataion  

Page 34: Data Stewardship for Scientists, for CLIR Postdoc Workshop

2.  Data  collection  &  organization  

   

Create  parameter  table  Create  a  site  table  

From  doi:10.3334/ORNLDAAC/777  

From  doi:10.3334/ORNLDAAC/777  

From  R  Cook,  ESA  Best  Practices  Workshop  2010  

Page 35: Data Stewardship for Scientists, for CLIR Postdoc Workshop

 Use  descriptive  file  names  •  Unique  •  Reflect  contents  

From  R  Cook,  ESA  Best  Practices  Workshop  2010  

Bad:    Mydata.xls      2001_data.csv      best  version.txt  

Better:  Eaffinis_nanaimo_2010_counts.xls  

Site  name  

Year  What  was  measured    

Study  organism  

2.  Data  collection  &  organization  

*Not  for  everyone  

*  

Page 36: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Organize  files    logically  

Biodiversity  

Lake  

Experiments  

Field  work  

Grassland  

Biodiv_H20_heatExp_2005to2008.csv  Biodiv_H20_predatorExp_2001to2003.csv  …  Biodiv_H20_PlanktonCount_2001toActive.csv  Biodiv_H20_ChlAprofiles_2003.csv  …    

From  S.  Hampton  

2.  Data  collection  &  organization  

Page 37: Data Stewardship for Scientists, for CLIR Postdoc Workshop

 Preserve  information  •  Keep  raw  data  raw  

•  Use  scripts  to  process  data      &  save  them  with  data  

Raw  data  as  .csv  

R  script  for  processing  &  analysis  

2.  Data  collection  &  organization  

Page 38: Data Stewardship for Scientists, for CLIR Postdoc Workshop

data management

From

 Flickr  by  Big  Sw

ede  Guy

 

1.  Planning  2.  Data  collection  &  

organization  3.  Quality  control  &  assurance  4. Metadata  5. Workflows  6. Data  stewardship  &  reuse  

Best  Practices  

Page 39: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Before  data  collection  •  Define  &  enforce  standards  •  Assign  responsibility  for  data  quality  

3.  Quality  control  and  quality  assurance  

From

 Flickr  by  StacieBe

e  

Page 40: Data Stewardship for Scientists, for CLIR Postdoc Workshop

After  data  entry  •  Check  for  missing,  impossible,  

anomalous  values  •  Perform  statistical  summaries    •  Look  for  outliers  

 

3.  Quality  control  and  quality  assurance  

0  

10  

20  

30  

40  

50  

60  

0   10   20   30   40  

Page 41: Data Stewardship for Scientists, for CLIR Postdoc Workshop

data management

From

 Flickr  by  Big  Sw

ede  Guy

 

1.  Planning  2.  Data  collection  &  

organization  3.  Quality  control  &  assurance  4.  Metadata  5. Workflows  6. Data  stewardship  &  reuse  

Best  Practices  

Page 42: Data Stewardship for Scientists, for CLIR Postdoc Workshop

4.  Metadata  basics   Why  are  you  promoting  Excel?  

What  is  metadata?  

Page 43: Data Stewardship for Scientists, for CLIR Postdoc Workshop

•  Digital  context  

•  Name  of  the  data  set  

•  The  name(s)  of  the  data  file(s)  in  the  data  set  

•  Date  the  data  set  was  last  modified  

•  Example  data  file  records  for  each  data  type  file  

•  Pertinent  companion  files  

•  List  of  related  or  ancillary  data  sets  

•  Software  (including  version  number)  used  to  prepare/read    the  data  set  

•  Data  processing  that  was  performed  

•  Personnel  &  stakeholders  

•  Who  collected    

•  Who  to  contact  with  questions  

•  Funders  

•  Scientific  context  

•  Scientific  reason  why  the  data  were  collected  

•  What  data  were  collected  

•  What  instruments  (including  model  &  serial  number)  were  used  

•  Environmental  conditions  during  collection  

•  Where  collected  &  spatial  resolution  When  collected  &  temporal  resolution  

•  Standards  or  calibrations  used  

•  Information  about  parameters  

•  How  each  was  measured  or  produced  

•  Units  of  measure  

•  Format  used  in  the  data  set  

•  Precision  &  accuracy  if  known  

•  Information  about  data  

•  Definitions  of  codes  used  

•  Quality  assurance  &  control  measures  

•  Known  problems  that  limit  data  use  (e.g.  uncertainty,  sampling  problems)    

•  How  to  cite  the  data  set  

4.  Metadata  basics  

Page 44: Data Stewardship for Scientists, for CLIR Postdoc Workshop

•  Provides  structure  to  describe  data  

Common  terms    |    definitions    |    language    |    structure  

4.  Metadata  basics  

•  Lots  of  different  standards    EML  ,  FGDC,  ISO19115,  DarwinCore,…  

•  Tools  for  creating  metadata  files  

 Morpho  (EML),  Metavist  (FGDC),  NOAA  MERMaid  (CSGDM)    

   

What  is  metadata?  

Select  the  appropriate  standard  

Page 45: Data Stewardship for Scientists, for CLIR Postdoc Workshop

data management

From

 Flickr  by  Big  Sw

ede  Guy

 

1.  Planning  2.  Data  collection  &  

organization  3.  Quality  control  &  assurance  4. Metadata  5.  Workflows  6. Data  stewardship  &  reuse  

Best  Practices  

Page 46: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Temperature  data  

Salinity                data  

Data  import  into  R  

Analysis:  mean,  SD  

Graph  production  

Quality  control  &  data  cleaning  “Clean”  T  

&  S  data  

Summary  statistics  

Data  in  R  format  

5.  Workflows  

Workflow:  how  you  get  from  the  raw  data  to  the  final  products  of  your  research  

 

Simple  workflows:  flow  charts  

Page 47: Data Stewardship for Scientists, for CLIR Postdoc Workshop

•  R,  SAS,  MATLAB  •  Well-­‐documented  code  is…  

Easier  to  review  Easier  to  share  Easier  to  repeat  analysis  

5.  Workflows  

Workflow:  how  you  get  from  the  raw  data  to  the  final  products  of  your  research  

 

Simple  workflows:  commented  scripts  

#  %  $  

&  

Page 48: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Fancy  Schmancy  workflows:  Kepler  Resulting  output  

5.  Workflows  

https://kepler-­‐project.org  

Page 49: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Workflows  enable…    

Reproducibility  

 can  someone  independently  validate  findings?  

Transparency      others  can  understand  how  you  arrived  at  your  results  

Executability    

 others  can  re-­‐run  or  re-­‐use  your  analysis  

 

5.  Workflows  

From  Flickr  by  merlinprincesse  

Coming  Soon:  

workflow  shar

ing  

requirements!  

Page 50: Data Stewardship for Scientists, for CLIR Postdoc Workshop

data management

From

 Flickr  by  Big  Sw

ede  Guy

 

1.  Planning  2.  Data  collection  &  

organization  3.  Quality  control  &  assurance  4. Metadata  5. Workflows  6.  Data  stewardship  &  reuse  

Best  Practices  

Page 51: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Use  stable  formats      csv,  txt,  tiff  

Create  back-­‐up  copies    original,  near,  far  

Periodically  test  ability  to  restore  information  

6.  Data  stewardship  &  reuse  

Modified from R. Cook  

Page 52: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Store  your  data  in  a  repository  

Institutional  archive  

Discipline/specialty  archive  

   

 

6.  Data  stewardship  &  reuse  

From  Flickr  by  torkildr  

Ask  a  librarian  

Repos  of  repos:  

databib.org  

re3data.org  

Page 53: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Allows  readers  to  find  data  products  Get  credit  for  data  and  publications  

Promotes  reproducibility  Better  measure  of  research  impact  

Example:  Sidlauskas,  B.  2007.  Data  from:  Testing  for  unequal  rates  of  morphological  diversification  in  the  absence  of  a  detailed  phylogeny:  a  case  study  from  characiform  fishes.  Dryad  Digital  Repository.  doi:10.5061/dryad.20   Persistent  Unique  

Identifier  

6.  Data  stewardship  &  reuse  

Practice  Data  Citation  

Page 54: Data Stewardship for Scientists, for CLIR Postdoc Workshop

data management

From

 Flickr  by  Big  Sw

ede  Guy

 

1.   Planning  2.  Data  collection  &  

organization  3.  Quality  control  &  assurance  4. Metadata  5. Workflows  6. Data  stewardship  &  reuse  

Best  Practices  

Page 55: Data Stewardship for Scientists, for CLIR Postdoc Workshop

A  document  that  describes  what  you  will  

do  with  your  data  throughout    

the  research  project  

From Flickr by Barbies Land

What  is  a  data  management  plan?  

Page 56: Data Stewardship for Scientists, for CLIR Postdoc Workshop

DMP  for  funders:  A  short  plan  submitted  alongside  grant  applications  

But they all have different requirements and express them in

different ways

From  Flickr  by  401(K)  2013  

 An  outline  of    –  what  will  be  collected  –  methods  –  Standards  –  Metadata  –  sharing/access  –  long-­‐term  storage  

 Includes  how  and  why  

Page 57: Data Stewardship for Scientists, for CLIR Postdoc Workshop

 DMP  supplement  may  include:  1.  the  types  of  data,  samples,  physical  collections,  software,  curriculum  

materials,  and  other  materials  to  be  produced  in  the  course  of  the  project  

2.   the  standards  to  be  used  for  data  and  metadata  format  and  content  (where  existing  standards  are  absent  or  deemed  inadequate,  this  should  be  documented  along  with  any  proposed  solutions  or  remedies)  

3.   policies  for  access  and  sharing  including  provisions  for  appropriate  protection  of  privacy,  confidentiality,  security,  intellectual  property,  or  other  rights  or  requirements  

4.   policies  and  provisions  for  re-­‐use,  re-­‐distribution,  and  the  production  of  derivatives  

5.   plans  for  archiving  data,  samples,  and  other  research  products,  and  for  preservation  of  access  to  them  

NSF  DMP  Requirements  

From  Grant  Proposal  Guidelines:  

Page 58: Data Stewardship for Scientists, for CLIR Postdoc Workshop

•  Types  of  data  •  Existing  data  •  How/when/where  created?  

•  How  processed?  

•  Quality  control    

•  Security  •  Who  is  responsible    

1.  Types  of  data  &  other  information  

biology.kenyon.edu  

C.  Strasser  

From  Flickr  by  Lazurite  

Page 59: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Wired.com  

•  Metadata  needed  •  How  captured    •  Standards  

2.  Data  &  metadata  standards  

Page 60: Data Stewardship for Scientists, for CLIR Postdoc Workshop

•  Obligation  to  share    

•  How/when/where  available  

•  Getting  access    •  Copyright  /  IP  •  Permission  restrictions  •  Embargo  periods    •  Ethics/privacy    •  How  cited  

3.  Policies  for  access  &  sharing  4.  Policies  for  re-­‐use  &  re-­‐distribution  

From

 Flickr  by  maryfranc

esmain  

Page 61: Data Stewardship for Scientists, for CLIR Postdoc Workshop

•  What  &  where    

•  Metadata  

•  Who’s  responsible  

5.  Plans  for  archiving  &  preservation  

From  Flickr  by  theManWhoSurfedTooMuch  

Page 62: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Don’t  forget  the  budget  

dorrvs.com  

Page 63: Data Stewardship for Scientists, for CLIR Postdoc Workshop

NSF’s  Vision*  

DMPs  and  their  evaluation  will  grow  &  change  over  time    

Peer  review  will  determine  next  steps  

Community-­‐driven  guidelines    

Evaluation  will  vary  with  directorate,  division,  &  program  officer  

 

*Unofficially  

Page 64: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From

 Flickr  by  celikins  

Where  to  start?  

Page 65: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From  Flickr  by  Andy  Graulund  

Make  a  resolution  • Triage  on  current  projects  • Get    advisor,  lab  mates,  collaborators  on  board  • Do  better  next  time  

Page 66: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Start  working  online  

From  Flickr  by  karindalziel  

Page 67: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From  Flickr  by  karindalziel  

E-­‐notebooks  Online  science      

http://datapub.cdlib.org/software-­‐for-­‐reproducibility-­‐part-­‐2-­‐the-­‐tools/  

Reproducibility  

Page 68: Data Stewardship for Scientists, for CLIR Postdoc Workshop
Page 69: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From

 Flickr  by  dipster1  

Toolbox  

Page 70: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Step-by-step wizard for generating DMP

create | edit | re-use | share

Free & open to community

dmptool.org                    Write  a  DMP  

Page 71: Data Stewardship for Scientists, for CLIR Postdoc Workshop

databib.org  

Where  should  I  put  my  data?  

Find  a  repository  

Page 72: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Get  help  

From

Flic

kr b

y th

ewm

att

Page 73: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Get  help  from  your  library  From

 Flickr  by  North  Carolina  Digita

l  Herita

ge  Cen

ter  

From  Flickr  by  Madison  Guy  

Page 74: Data Stewardship for Scientists, for CLIR Postdoc Workshop

NSF  funded  DataNet  Project  Office  of  Cyberinfrastructure  

www.dataone.org  

Get  help  

Page 75: Data Stewardship for Scientists, for CLIR Postdoc Workshop

B  

C  A  

Page 76: Data Stewardship for Scientists, for CLIR Postdoc Workshop

•  Data  Education  Tutorials  •  Database  of  best  practices    &  

software  tools  •  Primer  on  data  management  •  Investigator  Toolkit  

www.dataone.org  

Page 77: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From  Flickr  by  Skakerman  

A  word  about  Metrics…  

Page 78: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Articles  are  the  butterfly  pinned  on  the  wall.  Pretty  but  not  very  useful.  They  are  only  the  advertisements  for  scholarship.      –  A.  Levi,  U.  Maryland  College  of  Information  Studies    

From  Flickr  by  LisaW123  

Page 79: Data Stewardship for Scientists, for CLIR Postdoc Workshop

How to incentivize good data stewardship?

Data  Citation  

Altmetrics  (Alternative  Metrics)  

From  Flickr  by  chriscook04  

Page 80: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From  Flickr  by  dotpolka  

Doing  science  is  a  privilege  –  not  a  right  

Page 81: Data Stewardship for Scientists, for CLIR Postdoc Workshop

 There  is  a  social  contract  of  science:  we  have  an  obligation  to  ensure  dissemination,  validation,  &  advancement.  

To  not  do  so  is  science  malpractice.    

Who's  responsible?  Researchers,  publishers,  libraries,  repositories…    

–  Brian  Hole,  Ubiquity  Press  at  UCL  

From  Flickr  by  mikerosebery  

Page 82: Data Stewardship for Scientists, for CLIR Postdoc Workshop

From  Flickr  by  Michael  Tinkler  

Page 83: Data Stewardship for Scientists, for CLIR Postdoc Workshop

Data  Pub  Blog:  datapub.cdlib.org  

Page 84: Data Stewardship for Scientists, for CLIR Postdoc Workshop

My  website  Email  me  Tweet  me  My  slides  

carlystrasser.net  [email protected]  @carlystrasser    slideshare.net/carlystrasser