Download - May 29, 2007 Metadata, Provenance, and Search in e-Science Beth Plale Director, Center for Data and Search Informatics School of Informatics Indiana University.

May 29, 2007

Metadata, Provenance, and Search in e-Science

Beth PlaleDirector, Center for Data and Search Informatics

School of Informatics Indiana University

Sept 17, 2007

Credits:

PhD students Yogesh Simmhan, Nithya Vijayakumar, and Scott Jensen.

Dennis Gannon, IU, key collaborator on discovery cyberinfrastructure

Sept 17, 2007

Nature of Computational Science Discovery• Extract data from heterogeneous databases,• Execute task sequences (“workflows”) on your behalf,• Mine data from sensors and instruments and

responding,• Try out new algorithms,• Explore data through visualization, and• Go back and repeat steps again: with new data,

answering new questions, or with new algorithms. • How is this discovery process supported today?

• Through cyberinfrastructure. • CyberInfrastructure that supports

• On demand knowledge discovery• Automated experiment management (data and workflow)• Data protection, and automated data product provenance tracking.

Sept 17, 2007

CyberInfrastructure: framework for discovery

• Plug and play data sources and analysis tools. Complex what-if scenarios. Through

• User portal• Personal metadata catalog of data

exploration results• Data product index/catalog• Data provenance service• Workflow engine and composition

tools• Tied together with Internet-scale

event bus.• Results publishable to digital library.

Sept 17, 2007

Cyberinfrastructure for computing: DSI DataCenter

Supports analysis, use, visualization and search research. Supports multiple datasets.

Sept 17, 2007

Distributed services provide functionali capability

Sept 17, 2007

Vision for Data Handling

• Capturing metadata about data sets as generated is key

• Syntatic: file size, date of creation and • Semantic or domain specific: spatial region, logical time

• Context of file is key search parameter• Provenance, or history of data product, needed to

assess quality• Volume of data used in computational science too

large: manage on behalf of user• Indexes help efficiency

Sept 17, 2007

The Realization in Software

Data Storage

Application services Compute Engine

User’s Browser

PortalserverPortalserver

DataCatalogservice

DataCatalogservice

MyLEAD UserMetadatacatalog


MyLEAD Agent

service

MyLEAD Agent

service DataManagement

Service

DataManagement

Service

WorkflowEngine

WorkflowEngine

Workflow graph

ProvenanceCollection

service

ProvenanceCollection

service

Event Notification Bus

AppfactoryApp

factory

Sept 17, 2007

Infrastructure is portal based - that is, all services are available

through a web server

Infrastructure is portal based - that is, all services are available

through a web server

Sept 17, 2007

Gateway ServicesGateway Services

Core Grid ServicesCore Grid Services

e-Science Gateway Architecture

Grid Portal Server

Grid Portal Server

ExecutionManagement

ExecutionManagement

InformationServices

InformationServices

SelfManagement

SelfManagement

DataServices

DataServices

ResourceManagement

ResourceManagement

SecurityServices

SecurityServices

Resource Virtualization (OGSA)Resource Virtualization (OGSA)

Compute Resources Data Resources Instruments & Sensors

Proxy CertificateServer (Vault)

Proxy CertificateServer (Vault)

Events & Messaging

Events & Messaging

Resource BrokerResource Broker

Community & User Metadata Catalog

Community & User Metadata Catalog

Workflow engine

Workflow engine Resource

Registry

Resource Registry

ApplicationDeployment

ApplicationDeployment

User’s Grid DesktopUser’s Grid Desktop

[1][1] Service Oriented Architectures for Science Gateways on Grid Systems, Gannon, D., et al.; ICSOC, 2005

Sept 17, 2007

LEAD-CI Cyberinfrastructure

• Workflows run on the LEADgrid and on Teragrid.

• Portal and persistent back-end web services run on LEADgrid.

• Data storage resources for storing user-generated data products are provided by Indiana University.

Sept 17, 2007

arpssfc

arpstrn Ext2arps-ibc

88d2arps

mci2arps

ADASassimilation

arps2wrf

nids2arps

WRF

Ext2arps-lbc

wrf2arps

arpsplot

IDV viz

Terrain data files

Surface data files

ETA, RUC, GFS data

Radar data (level II)

Radar data (level III)

Satellite data

Surface, upper air mesonet & wind profiler

data

Typical weather forecast runs as workflow

~400 Data Products Consumed & ~400 Data Products Consumed & Produced Produced –– transformedtransformed –– during during

Workflow LifecycleWorkflow Lifecycle

~400 Data Products Consumed & ~400 Data Products Consumed & Produced Produced –– transformedtransformed –– during during

Workflow LifecycleWorkflow Lifecycle

Pre-ProcessingPre-Processing AssimilationAssimilation ForecastForecast VisualizationVisualization

Sept 17, 2007

To set up workflowexperiment,

we select a workflow(not shown)

then set model parameters here

To set up workflowexperiment,

we select a workflow(not shown)

then set model parameters here

Sept 17, 2007

Supported community

data collections

Supported community

data collections

Sept 17, 2007

Data Integration

CASA radarCollection,

Months (ftp)

Latest 3 days Unidata IDD Distribution

(XML web server)

Level II and III radar, latest

3 days(XML web server)

ETA, NCEP, NAM,

METAR, etc.(XML web server)

Oklahoma

Indiana

Colorado

ColoradoIndexXMLDB native XML database

and Lucene for index

Local view: crosswalk point of presence supports crawling,

publishes difference list as LEAD Metadata Schema (LMS)

documents

• Crawler crawls catalogs; • Builds index of results; • Web service API; • Boolean search query with spatial/temporal support

Globally integrated view: Data Catalog Service

Web s

erv

ice A

PI

Boolean search query

List of results as LEAD Metadata

Schema documents

crosswalks

Sept 17, 2007

LEAD Personal Workspace

• CyberInfrastructure extends user’s desktop to incorporate vast data analysis space.

• As users go about doing scientific experiments, the CI manages back-end storage and compute resources.

• Portal provides ways to explore this data and search and discover it.

• Metadata about experiments is largely automatically generated, and highly searchable.

• Describes data object (the file) in application-rich terms, and provides URI to data service that can resolve an abstract unique identifier to real, on-line data “file”.

Sept 17, 2007

Searching for experiments using model configuration parameters: 2 attributes selected

Sept 17, 2007

Searching for experiments based on model parameters: 4 returned experiments; one displayed

Sept 17, 2007

How forecast model configuration parameters stored in personal catalog

Forecast model configuration file handed off to plugin that shreds XML

document into queriable attributes

associated with experiment

Sept 17, 2007

What & Why of Provenance• Derivation history of a data product

• What (when, where) application created the data• Its parameters & configuration• Other input data used by application

• Workflow is composed from building blocks like these. So provenance for data used in workflow gives workflow trace

ApplicationA

Data.Out.1

Data.In.1

Config.A

Data.In.2

Data Provenance::Data.Out.1Process: Application_A Timestamp: 2006-06-23T12:45:23 Host: tyr20.cs.indiana.edu …Input: Data.In.1, Data.In.2Config: Config.A

Sept 17, 2007

The What & Why of Provenance• Trace Workflow Execution

• What services were used during workflow execution?• Validate if all steps of execution successful?

• Audit Trail• What resources were used during workflow execution?

• Data Quality & Reuse• What applications were used to derived data products?• Which workflows use a certain data product?

• Attribution• Who performed the experiment?• Who owns the workflow & data products?

• Discovery• Locate data generated by a workflow• Locate workflows containing App-X that succeeded

Sept 17, 2007

Karma Provenance ServiceKarma Provenance Service

ProvenanceListener

ProvenanceListener

ActivityDB

ActivityDB

Collection Framework

Workflow Instance10 Data Products Consumed & Produced by each Service

Workflow Instance10 Data Products Consumed & Produced by each Service

Service2

Service2 ……Service

1Service

1Service

10Service

10Service

9Service

910P/10C

10C

10P 10C 10P/10C

10P

Workflow Engine

Workflow Engine

Message Bus WS-Eventing Service API Message Bus WS-Eventing Service API WS-Messenger

Notification BrokerWS-Messenger

Notification Broker

Publish Provenance Activities as Notifications

Application–Started & –Finished, Data–Produced & –ConsumedActivities

Workflow–Started & –Finished Activities

ProvenanceQuery API

ProvenanceQuery API

Provenance Browser ClientProvenance

Browser Client

Query for Workflow, Process,& Data Provenance

Subscribe & Listen toActivity Notifications

A Framework for Collecting Provenance in Data-Centric Scientific Workflows, Simmhan, Y., et al., ICWS Conference, 2006

Sept 17, 2007

Generating Karma Provenance Activities• Instrument applications to publish provenance• Simple Java Library available to

• Create provenance activities• Publish activities as messages

• Jython “wrapper” scripts use library to publish provenance & invoke application

• Generic Factory toolkit easily converts applications to web service• Built-in provenance instrumentation

Sept 17, 2007

Sample Sequence of ActivitiesappStarted(App1)

info(‘App1 starting’)fileReceiveStarted(File1)

-- do gridftp get to stage input file File1 --fileReceiveFinished(File1)fileConsumed(File1)computationStarted(Code1)

-- call Fortran code Code1 to process input files --computationFinished(Code1)fileProduced(File2)fileSendStarted(File2)

-- do gridftp put to save output file File2 --fileSendFinished(File2)publishURL(File2)

appFinishedSuccess(App1, File2) | appFinishedFailed(App1, ERR)flush()

Sept 17, 2007

Performance perturbation

1460

286

1969

296

1643

28092785

419

2216

1653

28342805

426

2233

6

4

1

4

5

- 3

5

0

4

0

500

1000

1500

2000

2500

3000

Start

Terrain PreProcSurface PreProc

3D Interp

ARPS2 WRF

WRF

WRF2 ARPSARPS Plot PS2Image

W o r k f l o w A p p l i c a t i o n S c r i p ts E x e c u ti o n S e q u e n c e

Cumulative Time for Execution (Secs)

-15

-10

-5

0

5

10

15

Provenance Overhead for Each Script (Secs)

C u m u l a t i v e T i m e w / o P r o v e n a n c e ( S e c s )

C u m u l a t i v e T i m e w / P r o v e n a n c e ( S e c s )

P r o v e n a n c e O v e r h e a d

Sept 17, 2007

Standalone tool for provenance collection and experience reuse: future direction

Sept 17, 2007

Forecast start time can also be set to

occur onsevere

weather conditions (not shown here)

Forecast start time can also be set to

occur onsevere

weather conditions (not shown here)

Sept 17, 2007

Weather triggered workflows• Goal is cyberinfrastructure that allows scientists and

students to run weather models dynamically and adaptively in response to weather events.

• Accomplished by coupling events processing and triggered forecast workflows

• Vijayakumar et al (2006) presented framework for this purpose

• Events-processing system does temporal and spatial filtering. • Storm detection algorithm (SDA) detects storm events in

remaining streams• SDA returns detected storm events• Events processing system generates trigger to workflow engine

Sept 17, 2007

Continuous stream mining• In stream mining of weather,

events of interest are anomalies

• Event processing queries can be deployed to sites in the LEAD grid (rectangles)

• Data streams delivered to each site through Unidata Internet Data Dissemination system

• CEP enables real-time response to the weather

query

computation node

data generation source

Sept 17, 2007

Example CEP query• Scientists can set up a 6-hour

weather forecast over a region of say a 700 sq. mile bounding box, and submit a workflow that will run sometime in the future

• CEP query detects severe storm conditions developing in the region

• The forecast workflow is started at a future point in time as determined by the CEP query

Sept 17, 2007

Stream Provenance Tracking• Data stream provenance - derivation history of data

product where data product is derived time-bounded stream

• Stream provenance can establish correlations between significant events (e.g., storm occurrences)

• Anticipate resource needs by examining provenance data and discover trends in weather forecast model output

• Determine when next wave of users will arrive, and where their resources might need to be allocated

Sept 17, 2007

Stream processing as part of cyberinfrastructure• SQL-based queries responding to input streams event-

by-event within stream and concurrent across streams• Each query generates time-bounded output stream

Data Storage

Application services Compute Engine

User’s Browser

Portalserver

DataCatalogservice


MyLEAD Agent

service

DataManagement

Service

WorkflowEngine

Workflow graph

Event Notification Bus

Appfactory

Calder Stream Mining Service

NEXRAD Streams

Mining queries

Doppler Radars

Sept 17, 2007

Provenance Service in Calder

Rowset serviceaggregating derived

streams

User Query

Obtain continuous

query

Compile SQL toTCL query

Distribute query

Queries

Setup buffer to aggregate

results

Deploy queries

Create Ring Buffer

Planner Service

Computational mesh executing query

execution engines

DB

Provenance Service

Updates on Stream rates, approximations etc

Query start / stop / distribution plan chance

Results if any

Process updates and store in DB

Process flow / invocation

Calder internal messaging

WS-Messenger notifications

Sept 17, 2007

Provenance Update Handling Scalability

• Update processing time - time taken from instant user sends a notification to instant provenance service completes corresponding update

• Experiment• Bombard provenance service at different update rates by simulating

many clients sending provenance updates simultaneously• Measure incoming rate at provenance service and overall time taken for

handling each update. • Overhead includes time to create message, send and receive through

WS-Messenger, process message and store it in DB

Sept 17, 2007

• Problem:• Severe weather can bring many storms over a local

region of interest• It is infeasible and unnecessary to run weather model

in response to each of them

• Solution:• Group storm events into spatial clusters• Trigger model runs in response to clusters of storms

Sept 17, 2007

Spatial Clustering: DBSCAN algorithm*

• DBSCAN is a density-based clustering algorithm and it can do spatial clustering location parameters are treated as features.

• DBSCAN algorithm has two parameters• ε: radius within which a point is considered to be a

neighbor of another point• minPt: minimum number of neighboring points that a

point has to have to be considered as a core point.

• The two parameters determine the clustering result * Mining work done by Xiang Li, University of Alabama Huntsville

Sept 17, 2007

Data• WSR88D radar data

on 3/27/2007• Total of 134 radar

sites covering CONUS

• The time period examined is between 1:00 pm to 6:00pm EST.

• The 5 hrs time period is divided into 20 time interval with each interval of 15 min. Storm events within the same time interval is clustered Storm events detected at 1:00 pm – 1:15 pm

* Mining work done by Xiang Li, University of Alabama Huntsville

Sept 17, 2007

Algorithm comparison: DBSCAN and K-means

Number of clusters: 3

Time period: 1:00 pm – 1:15 pm

K-means result

DBSCAN result

Conclusion: DBSCAN algorithm performs better than k-means algorithm

Sept 17, 2007

Future Work• Publication of provenance to digital library• Generalized support for metadata systems• Enhanced support for mining triggers

• Personal weather predictor• LEAD framework packaged

into single 8-16 core multicore machine

• Expands educational opportunities: suitable for small schools

• Engage communities beyond meteorologists

Sept 17, 2007

Thank you for the interest.

Thanks to my many domain science and CS collaborators, to my students, and to the funding agents.

Please feel free to contact me at [email protected]