Data Integration Goal
• Single interface– Access and download *all* data, near real time
• Include all necessary metadata possible• Currently– 6 groups, 20 sites– Soon: 5 more groups– http://dbbadger.gleonrcn.org
Challenges with Global and Grassroots
• Entire globe covered– Long distances involved
• Grassroots, Bottom Up– No Top Level Funding or Mandates– Sites have different staffing and funding levels
• Existing systems– Some have extensive in place infrastructure– Some have no existing systems
• Tens of diverse groups– Common language, vocabulary?
Many Potential Strategies
• Transferring Data– Push– Pull
• Archiving Data– Distributed (Federated)– Centralized
Distributed Storage
• No replication• Control Designation– Each site stores their
data
• Highly susceptible to faults
• Potentially poor performance
CentralPortal
Centralized Storage
• Good query performance
• Less susceptible to network faults
• Responsibility and control change– Sites want local copy
of data
CentralDB
Moving DataPush-Sites send data-Central listens
Pull-Central requests data-Sites listen for requests
Central
Central
Data Request
Data Sent
GLEON Model: Mixed
• All data are stored centrally– Some replicated at local
site
• Pull: Sites with existing systems – Based on XML standard
• Push: Sites with GLEON system
Central
Data-Integration Project
• XML Based Standard– Sites expose data– Data are harvested
• Underlying DB can be anything
• Still creates issues
ZiggyStardust
• Source any data originator
• Repository any ‘next step’ for data
• Filter– QA/QC– Event Detection– Derived product
generator• Notification Services
Source
Filter
Filter
Filter
Repository/Middleware
CoreData
-Value
Metadata
-Site
-Variable
-Offset
-Source
-Aggregation
-RepNumber
Data Storage: Flat Structure
• Create data table • DateTime column• Each variable is unique
column
Mendota_Buoy_Table:
Data Storage: Vega Data Model
• Data Model similar to “Star” database schema– Vega is a star
• ‘Data Stream’ as core entity
• Inspiration from CUAHSI’s Observation Data Model
Data Stream
• Data– Same metadata– Change only in time
• Example– Var: Water Temperature– Site: Lake Annie– Unit: C– Depth: 0.5m– Aggregation: 24:00
Mean
Vega Data Model
• Value oriented structure • Store data from any
number of sites• Highly optimized
‘Values’ table• Query Times < 1 sec• GLEON central– Now 40 million values
Streams
Controlled Vocabulary
• AirTemp? • Air Temperature?• RH?• RelHum?
• Water_Temperature• Air_Temperature• Phycocyanin• Precipitation• Relative_Humidity• Etc…
Software Sharing and Reuse
“Good programmers write good code.Great programmers steal great code.”– Unknown
Science and Software Development Parallels
• Science– Heavily collaborative– Sharing ideas and results– Benefits from openness
• Software Development– Could do the same– Open source community an
example
• (Other Expertise)– Gleon.org
Science
Software Dev
Leve
l of C
olla
bora
tion
Grass Roots Software Dev Model
• Open Source/Free Software Community– Can be hugely successful– Many high profile projects
• Lake Analyzer First example– Received input to improve algorithms– Available to everyone
• ZiggyStardust, VADER, others also available
Current Challenges• Metadata (Quality Control Specifically)– Collection, standards, storage– Challenging for real time and streaming data– Meaningful output– Replicating updates
• Metadata (Controlled vocabulary)– Correct way to differentiate variables?
• Other Observations (manually sampled)– Expand to more diverse datasets
Top Related