The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation

33
1 The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation Collect Store Searc h Retrieve Analyze Presen t Jeff Dozier James E. Frew

description

Collect. Store. Present. Analyze. Search. Retrieve. The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation. Jeff Dozier James E. Frew. Snow spectral reflectance and absorption coefficient of ice. Landsat Thematic Mapper (TM) band combinations. - PowerPoint PPT Presentation

Transcript of The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation

Page 1: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

1

The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation

Collect

Store

Search

Retrieve

Analyze

PresentJeff DozierJames E. Frew

Page 2: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

2

Snow spectral reflectance and absorption coefficient of ice

0.0

0.2

0.4

0.6

0.8

1.0

0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

wavelength, mm

refl

ect

an

ce

1.0E-09

1.0E-08

1.0E-07

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

abso

rpti

on

co

effi

cien

t

0.05 mm

0.2 mm

0.5 mm

1.0 mm

absorptioncoefficient

Page 3: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

3

Landsat Thematic Mapper (TM) band combinations

Bands 4,3,2 (R,G,B) Bands 5,4,2 (R,G,B)

Page 4: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

4

0

20

40

60

80

100

0.3 0.8 1.3 1.8 2.3wavelength (mm)

refl

ec

tan

ce

(%

)

snow

vegetation

rock

equal snow-veg-rock

80% snow, 10% veg, 10% rock

20% snow, 50% veg, 30% rock

What you see, through Earth’s atmosphere

Page 5: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

5

Spatial, spectral characteristics of Landsat and MODIS

0.4 0.5 0.6 0.8 1.0 1.2 1.5 2 3 4 5 6 8 10 12 15

10

20

30

50

100

200

300

500

1000

wavelength, m

IFO

V, m

Landsatpanchromatic

visible NIR SWIR

thermal

MODIShigh-resolution

“land”

ocean/atmosphere

visible

visible

NIR

NIR

SWIR

thermal

Page 6: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

6

0

20

40

60

80

100

0.3 0.8 1.3 1.8 2.3wavelength (mm)

refl

ec

tan

ce

(%

)

snow

vegetation

rock

equal snow-veg-rock

80% snow, 10% veg, 10% rock

20% snow, 50% veg, 30% rock

What a multispectral sensor sees

Page 7: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

7

Set of equations for each pixel

1 1 12 11

2 2 22 22

2

, , ,

, , ,

, , ,

where 0 1 and 1

Solve for , and (least squares)

Still to co

snow M

snow M

MN snow N N NM

i

R r cF

R r cF

FR r c

F

r c

1

2

N

a

a

a

F

F

nsider: better corrections for illumination angle,

viewing angle, subpixel topography, and vegetation

Page 8: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

8

Fractional snow cover, Sierra Nevada, March 7 2004

Page 9: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

9

Sierra Nevada topography

Page 10: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

10

Daily MODIS acquisition, processing for Sierra Nevada snow cover and albedo

Ingest from NASA DAACs

Sierra Nevada = 36 MB/daySnow-covered land = 8 GB/day

reproject,mosaic,subset,format

MODIS snow cover & albedo algorithm

Database

Sierra Nevada = 10 MB/daySnow-covered land = 2 GB/day

MODsterTerra

ServerAlexandria

Page 11: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

11

Examples of fractional snow cover, January through April 2004

Jan 01 2004

Jan 17 2004

Mar 26 2004

Apr 08 2004

Page 12: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

12

Examples of grain size, January through April 2004

Jan 01 2004

Jan 17 2004

Mar 26 2004

Apr 08 2004

Page 13: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

13

2004, March 3 vs March 4 2004, March 4 vs March 5 2004, March 5 vs March 7

2004, March 7 vs March 8

SCA (%)

CCA (%) Sensor zenith (degrees)

March 3 73 0 50

March 4 74 18 48

March 5 78 27 36

March 7 69 9 15

March 8 55 31 62

Effect of vegetation

Page 14: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

14

Applications: snowmelt modeling, Marble Fork of the Kaweah River(Molotch et al., GRL, 2004)

Snow Covered Area net radiation > 0 degree days > 0

where:

mq = Energy to water depth conversion, 0.026 cm W-1 m2 day-1

convection parameter, based on wind speed, humidity, and roughnessra

Melt Flux net q d rR m T a SCA

Page 15: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

15

Magnitude of snowmelt: Modeled – Observed snow water equivalence

assumedalbedo

assumed w/ update

AVIRISalbedo

SWE difference, cm Tokopah basin, Sierra Nevada

Page 16: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

16

The data author’s perspective on drivers and constraints

• The science information user:– I want reliable, timely, usable science information products

» Accessibility

» Accountability

• The funding agencies and the science community:– We want this to be done by a distributed federation of providers,

not just by data centers

» Scalability

• The science information provider:– I’m doing just fine, thanks.

» Transparency

Page 17: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

17

Research vs. production computing

Research computing is …

• Heterogeneous– multiple platforms,

applications, languages

• Idiosyncratic– researchers typically have

highly customized computing environments

• Problem-driven– focus on results, not processes

Production computing is …• Robust

– reliable, not just correct

• Standardized– can easily substitute

components for repair, upgrade, etc.

• Scalable– accommodates steady or

increasing demand for product

Page 18: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

18

Principles

• Goal– Help scientists become information providers in a

federated data system

• Prime Directive– Minimal disruption of a working scientist’s

computational environment

• Ultimate product– Software, system architecture, and procedures for

turning science projects into a federation of providers

Page 19: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

19

Model structure for MODIS snow-covered area and albedo

Basinmask

Processing Lineage

Watershedinfo

MODIScloud mask

(48 bits)

MODIS 7 land bands (112 bits)

MODIS quality flags

Topography

MODIS snow cover and grain

size

MODISview

angles

Solarzenith,

azimuth

Snowfraction

albedoRMSerror

Vegfraction

Soilfraction

Shadefraction

Open water

fraction

Quality flag

Page 20: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

20

Lineage: current best practice

Page 21: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

21

ESSW: Our Earth System Science WorkbenchProducer and consumer issues can both be addressed

by a laboratory metaphor• Experiment

– Network of models– … ingesting / synthesizing data– … generating products

• Laboratory– Experiment execution environment

» Computing + storage = accessibility + scalability

• Lab Notebook– Persistent storage that can be queried– Keeps track of all experiments

» Documentation + lineage = accountability

Page 22: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

22

Use existing science applications• No “standard” Earth science computing

environment– commercial packages (ArcGIS, ENVI, MATLAB, …)

– public packages/models (MM5, MODTRAN, …)

– locally-developed codes

• Example: Snow cover from AVHRR commercial + standalone programs– parameters highly customized for UCSB

• How do we get these programs to– communicate

– cooperate

with the Earth System Science Workbench (ESSW), without rewriting?

Navigate(Manual/Automatic)

Receive

Ingest and Calibrate

Rectify

Snow-Covered Area

SnowMaps

Page 23: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

23

Wrap Your App: Scripts talk to ESSW

• No changes,just additions– Wrapper scripts

» Make program (groups) look like ESSW experiments

– ESSW daemon

» Convertswrapper outputtodatabase input

– ESSW database

» Stores converted wrapper output

Navigate(Manual/Automatic)

Receive

Ingest and Calibrate

Rectify

Snow-Covered Area

SnowMaps

ESSWDatabase

Perl APIESSW

daemon

XML + SQL

MySQL

JDBC

Java

Perl

Page 24: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

24

avhrr_handNav

AVHRR telemetry ingest

AHVRR Level 1Bproduct

AVHRR Level 1B:navigatedMulti-channel

snow-coveredarea

algorithm

AVHRR Level 0 product

avhrr_copyNav

Hand navigation details

Snow-covered area

Copynavigatedimage

SCA: navigated

avhrr_snowModel

avhrr_navd_sca

Hand navigationprocedure

avhrr_ingest

avhrr_navd_l1b

avhrr_L0

avhrr_l1b

avhrr_sca

Detailedexample

Page 25: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

25

ESSW Lessons

• Providers are customers– Federations aren’t much good unless scientists are happy to put

information in them

• A light touch is the right touch– Wrapping is easier for scientists and their programmers to deal with than

complete re-engineering

• Scientists do write scripts, but not necessarily Perl– Scripting (gluing stuff together) comes naturally to scientists

• Scientists don’t write DTDs

• Nobody calls metadata APIs

ESSW was automatic, but not automatic enough…

Page 26: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

26

ES3 : Earth System Science Server

cheap server

RAID 5 controller

cheap server

(mirror)

RAID 5 controller

Back Up Brick (BUB)

read read (backup)

write

cheap server

RAID 5 controller

cheap server

(mirror)

RAID 5 controller

Back Up Brick (BUB)

read read (backup)

write

data lineage tracking

BUB data storage ROCKS processing

clusters

Alexandria Digital Library

Microsoft TerraServer

MODster

OpenDAP

MODIS

Corona

AVHRR

Watershed-scale snow

product

Global-scale snow

product

Page 27: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

27

From ESSW to ES3: Summary

• Perl wrappers Probulators

• Perl API web services + RDF messages

• SQL XML database(s)

Page 28: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

28

From wrappers to probulators

Wrappers: active lineage• Good

– Complete control over what gets recorded– Single language/API for all wrapped events– Not tied to execution

» You can even lie about what happened

• Bad– Must explicitly script everything– Scripts can drift from reality

» You can even lie about what happened

Page 29: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

29

From wrappers to probulators

Probulators: passive lineage• Good

– Record what actually happened» Not just what you think happened» Not what didn’t happen

– Automatic: don’t have to write new scripts for everything

• Bad– Different flavors for different environments

» Can’t just do everything in Perl…

Page 30: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

30

Probulator flavors• Instrumentation

– Insert lineage capture instructions directly into science codes» e.g. “I just created file ‘foo’”

– Typical implementation: preprocessor/precompiler

• Overriding– Replace standard routines/libraries with lineage-capturing versions

» e.g. open(…) → snoopy_open(…)– Typical implementation: modify execution environment

» environment variables» configuration files

• Passive monitoring– Trace program execution

» e.g. “called open() with args foo, bar, …”– Typical implementation: strace’d shell

Page 31: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

31

ES3 lineage architecture

probulator1

probulatorn

logger transmitter ES3 core

logfiles

Page 32: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

32

Now What?

• Probulator reports not universally unique– Q: How hook separate reports together?

– A: Logger assigns UUIDs to

» Data streams

» Processes

» Jobs (workflows)

• Lineage not explicit– Q: How publish lineage?

– A: ES3 Core builds serialized graph

Page 33: The Data Author’s Perspective:  Lessons Learned From Data Creation to Data Curation

33

Products available from http://www.snow.ucsb.edu (forthcoming)

• Fractional snow-covered area, grain size (and contaminants) from daily MODIS images– Quality flags for cloud cover, highly oblique viewing– Fractional coverage of other endmembers

• Best estimate of snow-covered area and broadband albedo on that date– Extrapolating from previous values to that date and

smoothing

• End-of-season reanalysis of daily snow-covered area and broadband albedo– Interpolation, smoothing, comparison with in situ snow

pillow data