Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

25
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research Network Office, University of New Mexico University of Kansas San Diego Supercomputer Center Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis http://seek.ecoinformatics.org December 4, 2003 dinburgh, Scotland

description

Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis. Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research Network Office, University of New Mexico University of Kansas - PowerPoint PPT Presentation

Transcript of Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Page 1: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Chad BerkleyNCEAS

National Center for Ecological Analysis and Synthesis (NCEAS),University of California Santa Barbara

Long Term Ecological Research Network Office, University of New MexicoUniversity of Kansas

San Diego Supercomputer Center

Kepler: A Workflow Tool for Heterogeneous Ecological Data

Analysis

http://seek.ecoinformatics.org December 4, 2003Edinburgh, Scotland

Page 2: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Outline

Quick history SEEK overview Ecological Metadata Language Using workflows in Ecology Workflow editing with Kepler Future visions

Page 3: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

History

Late 1990s – patterns noticed in the problems surrounding data synthesis at NCEAS

1999 - Michener et al paper on ecological metadata

2000 – Knowledge Network for Biocomplexity Morpho, Metacat, Ecological Metadata Language Some footholds into workflow creation and execution

2003 – Scientific Environment for Ecological Knowledge (SEEK) Grant Continues the work done on the KNB grant Emphasis on using metadata for advanced data

processing

Page 4: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

SEEK approach

General approach to specific ecological problems

Data described with adequate metadata in a grid accessible repository

Reasoning engine (ontology based) to locate and extract data and processes

Modeling system to put it all together and control execution flow

Page 5: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

SEEK Components

Ecogrid Analysis Library Metadata and data repository

Semantic Mediation System Controlled semantic vocabulary Ontological discovery system

Analysis and Modeling System (Kepler) Workflow control system Utilizes resources from other components

Page 6: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

SEEK Architecture

Page 7: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Ecological Metadata Language

Common language for archiving and transport of datasets

XML based Designed for/by the ecological

community Describes physical and logical

structure of data Also includes project, literature and

software information SEEK will add semantic information

Page 8: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Workflows in SEEK

In the SEEK model, data ingestion/cleaning is metadata driven (specifically with EML)

Output generation includes creating appropriate metadata

The analysis pipeline itself becomes metadata

Page 9: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Metadata driven data ingestion

Key information needed to read and machine process a data file is in the metadata File descriptors (CSV, Excel, RDBMS, etc.) Entity (table) and Attribute (column)

descriptions Name Type (integer, float, string, etc.) Codes (missing values, nulls, etc.) In the future, this will include semantic typing

Page 10: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Metadata revision

Metadata is revised following any transformation

Versioning of metadata and data is very important

This process results in a lineage of the data file as it has been transformed

Page 11: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Typical ecological workflow example

Workflows can automate the integration process if data is described with adequate structured metadata

Page 12: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Homogeneous data integration

Integration of homogeneous or mostly homogeneous data via EML metadata is relatively straightforward

Page 13: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Heterogeneous Data integration

Integration of heterogeneous data requires much more advanced metadata and processing

Attributes must be semantically typed Collection protocols must be known Units and measurement scale must be known Measurement mechanics must be known (i.e. that

Density=Count/Area)

Page 14: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Label data with semantic types Label inputs and outputs of analytical components with semantic types

Use Semantic Mediation System (SMS) to generate transformation steps Beware analytical constraints

Use SMS to discover relevant components Ontology – specification of a conceptualization (a knowledge map)

Semantic typing and ontologies

Data Ontology Workflow Components

Page 15: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Measurement Ontology

Density is part of a larger measurement ontology SEEK’s intent is to create one or more community created

ecological ontologies Creates a controlled vocabulary for ecological metadata More about this in Bertram’s talk

Page 16: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

About Kepler

Kepler is the name of the SEEK/SDM additions to the Ptolemy modeling system

Ptolemy was designed by the UC Berkeley EECS department

Primary use is modeling EE circuits Free, opensource, pure Java Flexible design GUI for building

workflows

Page 17: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Kepler

A Kepler model consists of linked “actors” (which correspond to workflow steps)

Timing is controlled by a “director” All actors are written in Java but can

call other applications (such as SAS and MATLAB or native language code via JNI)

Actors can call arbitrary Web (or Grid) Services

Ptolemy already has a very large inventory of actors

Easy to use, drag ‘n drop interface

Page 18: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

SEEK Contributions to Kepler (so far)

EML data ingestion actor

Actor design tool

Page 19: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

EML data ingestion actor

Ingests any data format described by EML metadata

Converts raw data to Kepler format Data can then be operated on with other

actors Produces one output port for each attribute

in the dataset Individual attributes can then be mapped to

other actors

Page 20: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Ptolemy model with EML ingestion actor

Page 21: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

SEEK Contributions to Kepler (so far)

EML data ingestion actor

Actor design tool

Page 22: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Actor design tool

Allows “place-holder” actors to be defined on the fly by non-programmers during workflow creation

Domain scientists can thereby create workflows without programming knowledge

Workflows created with these actors can be executed once their functionality is implemented by a programmer

Allows quick prototyping of workflows by domain scientists

“Place-holder” actors can still be linked to other working actors

Page 23: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Ptolemy and dynamically created actor

Page 24: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

How domain scientists will benefit

More fully automated integration systems

A library of pre-defined analytical processes which can be executed on heterogeneous data

Semantic data discovery and processing

Automated unit and measurement scale conversions

A fuller understanding of cross site research implications

Page 25: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Acknowledgements

This material is based upon work supported by:

The National Science Foundation under Grant Numbers 9980154, 9904777, and 0225676 to NCEAS and its collaborators.

The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.

Primary Collaborators: University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research)

More info: http://seek.ecoinformatics.org

Questions? IRC: irc.ecoinformatics.org #seek