Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL)...
-
Upload
jackson-crawford -
Category
Documents
-
view
216 -
download
2
Transcript of Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL)...
Early Experience Early Experience Prototyping a Science Prototyping a Science
Data Server for Data Server for Environmental DataEnvironmental Data
Deb Agarwal (LBL) Deb Agarwal (LBL) Catharine van Ingen (MSFT) Catharine van Ingen (MSFT)
25 October 200625 October 2006
OutlineOutline• Water and ecological data archives
and other sources• Typical small group collaboration
needs• Berkeley Water Center and Ameriflux
collaboration• Common problems
Unprecedented Data Unprecedented Data AvailabilityAvailability
Soils
Climate
Remote SensingExample Carbon-Climate Datasets
Observatory datasets
Spatially continuous datasets
5
Ameriflux Collaboration Ameriflux Collaboration OverviewOverview
• 149 Sites across the Americas• Each site reports a minimum of
22 common measurements.• Communal science – each
principle investigator acts independently to prepare and publish data.
• Second level data published to and archived at Oak Ridge.
• Total data reported to date on the order of 150M half-hourly measurements.
• http://public.ornl.gov/ameriflux/ T AIR
T SOIL
Onset of photosynthesis
Typical Data Flow TodayTypical Data Flow Today• Prior to analysis, data and
ancillary data are must be assembled, checked, and cleaned– Some of this is mundane
(eg unit conversions) – Some requires domain-
specific knowledge including instrumentation or location knowledge
– Ancillary data is often critical to understanding and using the data
• After all that, data are often misplaced, scattered, and even lost– Provenance is in the mind
of the beholder– “Everybody knows” yet no
one is sure
Internet Data Archives
Local Measurements
Large Models
Legacy Sources
Improved Data Flow Improved Data Flow • Local repository for data
and ancillary data assembled by a small scientific collaboration from a wide variety of sources– A common “safe deposit
box” – Versioned and logged to
provide basic provenance• Simple interactions with
existing and emerging internet portals for data and ancillary data download, and, over time, upload– Simplify data assembly by
adding automation for tracking and data conversions
Legacy Sources
Internet DataArchives
Local Measurements
Large Models
Data Curation TodayData Curation Today
• Well curated large government operated sites Clear protocols for measurement updates, recalibrations, changes– Emerging standards or long
standing practices for measurement naming and reported units
– http://waterdata.usgs.gov/nwis
• Somewhat curated smaller organization sites – Best effort use of common
measurement naming and units
– As data sharing increases, “best” practices tend to emerge
– http://public.ornl.gov/ameriflux/
• Locator catalog sites– Helps locate similar data
across websites– http://www.cuahshi.org/hdas
• Everybody else– Naming, units, and
recalibrations unclear– Moving to an ideal:
http://www2.ncsu.edu/ncsu/CIL/WRRI/neuse.html
Data Curation ChallengesData Curation Challenges• Cross source and over time
rationalization– Different naming and units conventions: – Distinguish derived and non-derived
measurements: VPD computed from Rh
• Convert basic measurements to useful inputs for science – Algorithms still evolving for smoothing
(obviously?) data and gap-filling– Archive tends to represent
instrumentation; science tends to represent physical system
• Convert from basic science data to useful inputs for public policy– $40K acre-foot for Central Valley
irrigation water; ~80% of that is energy cost
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
Bra
zil -
- T
apaj
os (
San
tare
m,K
m
Bra
zil -
- T
apaj
os (
San
tare
m,K
m
Can
ada
- B
orea
s 18
50
Can
ada
-- B
OR
EA
S N
SA
- 1
930
bu
Can
ada
-- B
OR
EA
S N
SA
- 1
963
bu
Can
ada
-- B
OR
EA
S N
SA
- 1
981
bu
Can
ada
-- B
OR
EA
S N
SA
- 1
989
bu
Can
ada
-- B
OR
EA
S N
SA
- 1
998
bu
Can
ada
-- B
OR
EA
S N
SA
- O
ld B
la
Can
ada
-- B
ritis
h C
ol.,
Cam
pbe
Can
ada
-- L
ethb
ridge
US
A -
- A
K A
tqas
uk, A
lask
a
US
A -
- A
K B
arro
w, A
lask
a
US
A -
- A
K H
appy
Val
ley,
Ala
ska
US
A -
- A
K U
pad,
Ala
ska
US
A -
- A
Z A
udub
on R
esea
rch
Ran
US
A -
- C
A B
lodg
ett F
ores
t, C
al
US
A -
- C
A S
ky O
aks,
Old
Sta
nd,
US
A -
- C
A S
ky O
aks,
You
ng S
tan
US
A -
- C
A T
onzi
Ran
ch, C
alifo
r
US
A -
- C
A V
aira
Ran
ch, I
one,
C
US
A -
- C
O N
iwot
Rid
ge F
ores
t,
US
A -
- C
T G
reat
Mou
ntai
n F
ores
US
A -
- F
L F
lorid
a-K
enne
dy S
pac
US
A -
- F
L F
lorid
a-K
enne
dy S
pac
US
A -
- F
L S
lash
pine
-Aus
tin C
ar
US
A -
- F
L S
lash
pine
-Don
alds
on,
US
A -
- F
L S
lash
pine
-Miz
e,cl
ear
US
A -
- F
L S
lash
pine
-Ray
onie
r,m
US
A -
- IL
Bon
dville
, Illin
ois
US
A -
- IN
Mor
gan
Mon
roe
Sta
te
US
A -
- K
S W
alnu
t Riv
er W
ater
sh
US
A -
- M
A H
arva
rd F
ores
t EM
S T
US
A -
- M
A H
arva
rd F
ores
t hem
lo
US
A -
- M
A L
ittle
Pro
spec
t Hill
US
A -
- M
E H
owla
nd F
ores
t (m
ain
US
A -
- M
I Syl
vani
a W
ilder
ness
US
A -
- M
I Uni
v. o
f Mic
h. B
iolo
US
A -
- M
O M
isso
uri O
zark
Site
US
A -
- M
S G
oodw
in C
reek
, Mis
si
US
A -
- M
T F
ort P
eck,
Mon
tana
US
A -
- N
C D
uke
For
est -
lobl
ol
US
A -
- N
C D
uke
For
est-
hard
woo
d
US
A -
- N
E M
ead
- irr
igat
ed c
on
US
A -
- N
E M
ead
- irr
igat
ed m
ai
US
A -
- N
E M
ead
- ra
infe
d m
aize
US
A -
- O
K L
ittle
Was
hita
Wat
er
US
A -
- O
K P
onca
City
, Okl
ahom
a
US
A -
- O
K S
hidl
er, O
klah
oma
US
A -
- O
K S
outh
ern
Gre
at P
lain
US
A -
- O
R M
etol
ius-
first
you
ng
US
A -
- O
R M
etol
ius-
inte
rmed
iat
US
A -
- O
R M
etol
ius-
old
aged
po
US
A -
- S
D B
lack
Hills
, Sou
th D
US
A -
- S
D B
rook
ings
, Sou
th D
ak
US
A -
- T
N W
alke
r B
ranc
h W
ater
s
US
A -
- W
A W
ind
Riv
er C
rane
Sit
US
A -
- W
I Los
t Cre
ek, W
isco
nsi
US
A -
- W
I Par
k F
alls
/WLE
F, W
is
US
A -
- W
I Willo
w C
reek
, Wis
con
US
A -
- W
V C
anaa
n V
alle
y, W
est
APAR CO2 DT FC FG FH2O FPAR GPP H H2OLE Leafwetness NEE O3 Other PAR PREC PRESS Rd RgRgl RH Rn Sa SCO2 SVP SWC TA TAU TboleTdew TS U UST UW VPD WD WS
Odd Microclimate Effects or Error in Time Reporting ?
Average Air Temperature at Two Nearby Sites
Scientific Data Server Scientific Data Server GoalsGoals
• Act as a local repository for data and metadata assembled by a small group of scientists from a wide variety of sources– Simplify provenance by providing a common “safe
deposit box” for assembled data• Interact simply with existing and emerging
internet portals for data and metadata download, and, over time, upload– Simplify data assembly by adding automation– Simplify name space confusion by adding explicit
decode translation• Support basic analyses across the entire dataset
for both data cleaning and science– Simplify mundane data handling tasks– Simplify quality checking and data selection by enabling
data browsing
Scientific Data Server Logical Scientific Data Server Logical OverviewOverview
DataAccess
and Analysis
Tools
Latest DatasetDatabase
Latest DatasetCube
Staging Databases
and Cubes
Last Known Good Dataset(s)
Database
Older Dataset(s)Archive
Database
Last Known Good Dataset(s) Cubes
Private Data
Analysis Databases
and Cubes
Scientific Data Server
Analysis ToolsExcel, Matlab, SPlus, SAS,
ArcGIS
Simple web data plots and
tables
BigPlot data browsing
Computational Models
Flat file data import/export
Data Staging PipelineData Staging Pipeline
Scheduled download
from Website
Incremental Data Copy to
Active Database
Basic Data Checks
Stage Data Decode
Convert to CSV
Canonical Form
Load CSV files into Staging
Database
• Data can be downloaded from internet sites regularly– Sometimes the only way to detect changed data is to compare with the
data already archived– The download is relatively cheap, the subsequent staging is expensive
• New or changed data discovered during staging– Simple checksum before load– Chunk checksum after decode– Comparison query if requested
• Decode stage critical to handle the uncontrolled vocabularies– Measurement type, location offset, quality indicators, units, derivation
methods often encoded in column headers• Incremental copy moves staged data to one or more sitesets
– Automated via siteset:site:source mapping
Column Decode TodayColumn Decode Today
[Datumtype] [repeat][_offset][_offset][extended datumtype][units]
• Datumtype: the short (<16 characters) name for the data. – Example: TA, PREC, or LE.
• Repeat: an optional number indicating that multiple measurements were taken at the same site and offset. – Example: include TA2.
• [_offset][_offset]: major and minor part of the z offset.– Example: SWC_10 (SWC at 10 cm) orTA_10_7 (TA at 10.7m).
• Extended datumtype: any remaining column text. – Example: “fir”, “E”, “sfc”, “wangrot”, “_cum”
• Units: measurement units. – Example: w/m2, or deg C.
1243 unique column header strings nowRoughly 70% of that due to offset or two extended datumtypes
Another ~100 arriving nowQuality and algorithm derivation provenance
Browsing for Data Availability Browsing for Data Availability Data Availability by SiteData Availability by Site
Measuring temperature is easy; deriving ecosystem production problematic
GPP Data Availability
199019911992199319941995199619971998199920002001200220032004200520062007
La
Sa
nta
rem
-S
an
tare
m-
BO
RE
AS
NS
A -
Ca
mp
be
ll R
ive
r-L
eth
bri
dg
eU
CI-
18
50
bu
rnU
CI-
19
30
bu
rnU
CI-
19
64
bu
rnU
CI-
19
64
bu
rnU
CI-
19
81
bu
rnU
CI-
19
89
bu
rnU
CI-
19
98
bu
rnU
CI-
20
03
bu
rnL
a S
elv
aL
a P
az
Atq
asu
kB
arr
ow
Ha
pp
y V
alle
yIv
otu
kU
pa
dA
ud
ub
on
Sa
nta
Rita
Wa
lnu
t Gu
lch
Blo
dg
ett
Fo
rest
Sky
Oa
ksS
ky O
aks
-Old
Sky
Oa
ks-
To
nzi
Ra
nch
Va
ira
Ra
nch
CR
P g
raze
d s
iteC
RP
min
imu
m-
CR
P u
ng
raze
dN
iwo
t Rid
ge
Gre
at M
ou
nta
inK
en
ne
dy
Sp
ace
Ke
nn
ed
y S
pa
ceM
an
gro
veS
lash
pin
e-
Sla
shp
ine
-S
lash
pin
e-M
ize
-S
lash
pin
e-
Ne
al S
mith
Bo
nd
ville
Bo
nd
ville
Fe
rmiL
ab
-F
erm
iLa
b-
Mo
rga
n M
on
roe
Wa
lnu
t Riv
er
Ha
rva
rd F
ore
stH
arv
ard
Fo
rest
Litt
le P
rosp
ect
Ho
wla
nd
Fo
rest
Ho
wla
nd
Fo
rest
Ho
wla
nd
Fo
rest
KB
S C
rop
sN
ort
he
rnS
ylva
nia
Un
iv. o
f Mic
h.
KU
OM
tow
er
Ro
sem
ou
nt-
C7
Ro
sem
ou
nt-
Ro
sem
ou
nt-
Mis
sou
ri O
zark
Go
od
win
Cre
ek
Fo
rt P
eck
Du
ke F
ore
st-
Du
ke F
ore
st-
Du
ke F
ore
st-
NC
_C
lea
rcu
tN
C_
Lo
blo
llyM
ea
d-i
rrig
ate
dM
ea
d-i
rrig
ate
dM
ea
d-r
ain
fed
Ba
rtle
ttC
ed
ar
Bri
dg
eF
ort
Dix
Sila
s L
ittle
Va
lles
Ca
lde
raO
ak
Op
en
ing
sA
RM
So
uth
ern
Litt
le W
ash
itaP
on
ca C
ityS
hid
ler
Fir
site
Me
toliu
s-E
yerl
yM
eto
lius-
first
Me
toliu
s-M
eto
lius-
old
Me
toliu
s-B
lack
Hill
sB
roo
kin
gs
Ch
est
nu
t Rid
ge
Wa
lke
r B
ran
chF
ree
ma
nF
ree
ma
n R
an
chF
ree
ma
nW
ind
Riv
er
Lo
st C
ree
kP
ark
Will
ow
Cre
ek
Ca
na
an
Va
lley
GL
EE
SS
ky O
aks
-Po
st
TA Data Availability
199019911992199319941995199619971998199920002001200220032004200520062007
La
Sa
nta
rem
-S
an
tare
m-
BO
RE
AS
NS
A -
Ca
mp
be
ll R
ive
r-L
eth
bri
dg
eU
CI-
18
50
bu
rnU
CI-
19
30
bu
rnU
CI-
19
64
bu
rnU
CI-
19
64
bu
rnU
CI-
19
81
bu
rnU
CI-
19
89
bu
rnU
CI-
19
98
bu
rnU
CI-
20
03
bu
rnL
a S
elv
aL
a P
az
Atq
asu
kB
arr
ow
Ha
pp
y V
alle
yIv
otu
kU
pa
dA
ud
ub
on
Sa
nta
Rita
Wa
lnu
t Gu
lch
Blo
dg
ett
Fo
rest
Sky
Oa
ksS
ky O
aks
-Old
Sky
Oa
ks-
To
nzi
Ra
nch
Va
ira
Ra
nch
CR
P g
raze
d s
iteC
RP
min
imu
m-
CR
P u
ng
raze
dN
iwo
t Rid
ge
Gre
at M
ou
nta
inK
en
ne
dy
Sp
ace
Ke
nn
ed
y S
pa
ceM
an
gro
veS
lash
pin
e-
Sla
shp
ine
-S
lash
pin
e-M
ize
-S
lash
pin
e-
Ne
al S
mith
Bo
nd
ville
Bo
nd
ville
Fe
rmiL
ab
-F
erm
iLa
b-
Mo
rga
n M
on
roe
Wa
lnu
t Riv
er
Ha
rva
rd F
ore
stH
arv
ard
Fo
rest
Litt
le P
rosp
ect
Ho
wla
nd
Fo
rest
Ho
wla
nd
Fo
rest
Ho
wla
nd
Fo
rest
KB
S C
rop
sN
ort
he
rnS
ylva
nia
Un
iv. o
f Mic
h.
KU
OM
tow
er
Ro
sem
ou
nt-
C7
Ro
sem
ou
nt-
Ro
sem
ou
nt-
Mis
sou
ri O
zark
Go
od
win
Cre
ek
Fo
rt P
eck
Du
ke F
ore
st-
Du
ke F
ore
st-
Du
ke F
ore
st-
NC
_C
lea
rcu
tN
C_
Lo
blo
llyM
ea
d-i
rrig
ate
dM
ea
d-i
rrig
ate
dM
ea
d-r
ain
fed
Ba
rtle
ttC
ed
ar
Bri
dg
eF
ort
Dix
Sila
s L
ittle
Va
lles
Ca
lde
raO
ak
Op
en
ing
sA
RM
So
uth
ern
Litt
le W
ash
itaP
on
ca C
ityS
hid
ler
Fir
site
Me
toliu
s-E
yerl
yM
eto
lius-
first
Me
toliu
s-M
eto
lius-
old
Me
toliu
s-B
lack
Hill
sB
roo
kin
gs
Ch
est
nu
t Rid
ge
Wa
lke
r B
ran
chF
ree
ma
nF
ree
ma
n R
an
chF
ree
ma
nW
ind
Riv
er
Lo
st C
ree
kP
ark
Will
ow
Cre
ek
Ca
na
an
Va
lley
GL
EE
SS
ky O
aks
-Po
st
Browsing for Data ApplicabilityBrowsing for Data Applicability
• Real field data has both short term gaps and longer term outages due to instrument outages– The utility of the data
depends on the nature of the science being performed
– Browsing data counts can give rapid insight into how the data can be used before more complex analyses are performed
0
1000
2000
3000
4000
5000
6000
1 2 3 4 5 6 7 8 9 10 11 12
55.86306 BOREAS NSA -1981 burn site
55.879002 BOREAS NSA -Old Black Spruce
55.90583 BOREAS NSA -1930 burn site
55.911671 BOREAS NSA -1963 burn site
55.916672 BOREAS NSA -1989 burn site
56.63583 BOREAS NSA -1998 burn site
69.133331 AK HappyValley
70.281471 AK Upad
70.496002 AK Atqasuk
Data often missing in the winter!
-15
-10
-5
0
5
10
15
20
25
30
20 30 40 50 60 70 80
Latitude
Deg
C
Average Temperature
What’s going on at higher latitudes? (It should be getting colder)
Data Count
Curation Learnings To DateCuration Learnings To Date
• Ancillary data is as important as data– Comparing sites of like vegetation, climate as
important as latitude or other physical quantity– Only some are numeric, most are debated, some
vary with time– Curate the two together
• Controlled vocabularies are hard – Humans like making up names and have a hard
time remembering 100+ names– Assume a decode step in the staging pipeline
• Data analysis and data cleaning are intertwined– Data cleaning is always on-going– Some measurements can be used as indicators of
quality of other measurements– Share the simple tools and visualizations
The saga continues at http://dsd.lbl.gov/BWC/amfluxblog/ and http://research.microsoft.com/~vaningen/BWC/BWC.htm
AcknowledgementsAcknowledgementsBerkeley Water Center,
University of California, Berkeley, Lawrence Berkeley LaboratoryDeb AgarwalMonte GoodSusan HubbardJames HuntMatt RodriguezYoram Rubin
MicrosoftJim GrayTony HeyDan FayStuart OzerSQL product team
Ameriflux CollaborationDennis BaldocchiBeverly LawGretchen MillerTara StieflMathias GoeckedeMattias FalkTom Boden