GLEON Data Management Luke Winslow PASEO 3/18/09.

22
GLEON Data Management Luke Winslow PASEO 3/18/09
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of GLEON Data Management Luke Winslow PASEO 3/18/09.

GLEON Data Management

Luke WinslowPASEO 3/18/09

GLEON SitesSeptember 2008

Lake Observatory

+ IT Development

Buoys (Hi-res Data)

Data Integration Goal

• Single interface– Access and download *all* data, near real time

• Include all necessary metadata possible• Currently– 6 groups, 20 sites– Soon: 5 more groups– http://dbbadger.gleonrcn.org

Challenges with Global and Grassroots

• Entire globe covered– Long distances involved

• Grassroots, Bottom Up– No Top Level Funding or Mandates– Sites have different staffing and funding levels

• Existing systems– Some have extensive in place infrastructure– Some have no existing systems

• Tens of diverse groups– Common language, vocabulary?

Many Potential Strategies

• Transferring Data– Push– Pull

• Archiving Data– Distributed (Federated)– Centralized

Distributed Storage

• No replication• Control Designation– Each site stores their

data

• Highly susceptible to faults

• Potentially poor performance

CentralPortal

Centralized Storage

• Good query performance

• Less susceptible to network faults

• Responsibility and control change– Sites want local copy

of data

CentralDB

Moving DataPush-Sites send data-Central listens

Pull-Central requests data-Sites listen for requests

Central

Central

Data Request

Data Sent

GLEON Model: Mixed

• All data are stored centrally– Some replicated at local

site

• Pull: Sites with existing systems – Based on XML standard

• Push: Sites with GLEON system

Central

Data-Integration Project

• XML Based Standard– Sites expose data– Data are harvested

• Underlying DB can be anything

• Still creates issues

ZiggyStardust

• Source any data originator

• Repository any ‘next step’ for data

• Filter– QA/QC– Event Detection– Derived product

generator• Notification Services

Source

Filter

Filter

Filter

Repository/Middleware

CoreData

-Value

Metadata

-Site

-Variable

-Offset

-Source

-Aggregation

-RepNumber

Data Storage: Flat Structure

• Create data table • DateTime column• Each variable is unique

column

Mendota_Buoy_Table:

Data Storage: Vega Data Model

• Data Model similar to “Star” database schema– Vega is a star

• ‘Data Stream’ as core entity

• Inspiration from CUAHSI’s Observation Data Model

Data Stream

• Data– Same metadata– Change only in time

• Example– Var: Water Temperature– Site: Lake Annie– Unit: C– Depth: 0.5m– Aggregation: 24:00

Mean

Vega Data Model

• Value oriented structure • Store data from any

number of sites• Highly optimized

‘Values’ table• Query Times < 1 sec• GLEON central– Now 40 million values

Streams

Controlled Vocabulary

• AirTemp? • Air Temperature?• RH?• RelHum?

• Water_Temperature• Air_Temperature• Phycocyanin• Precipitation• Relative_Humidity• Etc…

Software Sharing and Reuse

“Good programmers write good code.Great programmers steal great code.”– Unknown

Science and Software Development Parallels

• Science– Heavily collaborative– Sharing ideas and results– Benefits from openness

• Software Development– Could do the same– Open source community an

example

• (Other Expertise)– Gleon.org

Science

Software Dev

Leve

l of C

olla

bora

tion

Grass Roots Software Dev Model

• Open Source/Free Software Community– Can be hugely successful– Many high profile projects

• Lake Analyzer First example– Received input to improve algorithms– Available to everyone

• ZiggyStardust, VADER, others also available

Current Challenges• Metadata (Quality Control Specifically)– Collection, standards, storage– Challenging for real time and streaming data– Meaningful output– Replicating updates

• Metadata (Controlled vocabulary)– Correct way to differentiate variables?

• Other Observations (manually sampled)– Expand to more diverse datasets

Questions?

• Acknowledgements– All GLEON Members, Tim Kratz, Paul Hanson, Tom Harmon, and all

others that have contributed ideas and support– NSF Grant DEB-0217533, DBI-0639229, and DBI-0446017 and the

Gordon and Betty Moore Foundation