May 29, 2007
Metadata, Provenance, and Search in e-Science
Beth PlaleDirector, Center for Data and Search Informatics
School of Informatics Indiana University
Sept 17, 2007
Credits:
PhD students Yogesh Simmhan, Nithya Vijayakumar, and Scott Jensen.
Dennis Gannon, IU, key collaborator on discovery cyberinfrastructure
Sept 17, 2007
Nature of Computational Science Discovery• Extract data from heterogeneous databases,• Execute task sequences (“workflows”) on your behalf,• Mine data from sensors and instruments and
responding,• Try out new algorithms,• Explore data through visualization, and• Go back and repeat steps again: with new data,
answering new questions, or with new algorithms. • How is this discovery process supported today?
• Through cyberinfrastructure. • CyberInfrastructure that supports
• On demand knowledge discovery• Automated experiment management (data and workflow)• Data protection, and automated data product provenance tracking.
Sept 17, 2007
CyberInfrastructure: framework for discovery
• Plug and play data sources and analysis tools. Complex what-if scenarios. Through
• User portal• Personal metadata catalog of data
exploration results• Data product index/catalog• Data provenance service• Workflow engine and composition
tools• Tied together with Internet-scale
event bus.• Results publishable to digital library.
Sept 17, 2007
Cyberinfrastructure for computing: DSI DataCenter
Supports analysis, use, visualization and search research. Supports multiple datasets.
Sept 17, 2007
Distributed services provide functionali capability
Sept 17, 2007
Vision for Data Handling
• Capturing metadata about data sets as generated is key
• Syntatic: file size, date of creation and • Semantic or domain specific: spatial region, logical time
• Context of file is key search parameter• Provenance, or history of data product, needed to
assess quality• Volume of data used in computational science too
large: manage on behalf of user• Indexes help efficiency
Sept 17, 2007
The Realization in Software
Data Storage
Application services Compute Engine
User’s Browser
PortalserverPortalserver
DataCatalogservice
DataCatalogservice
MyLEAD UserMetadatacatalog
MyLEAD UserMetadatacatalog
MyLEAD Agent
service
MyLEAD Agent
service DataManagement
Service
DataManagement
Service
WorkflowEngine
WorkflowEngine
Workflow graph
ProvenanceCollection
service
ProvenanceCollection
service
Event Notification Bus
AppfactoryApp
factory
Sept 17, 2007
Infrastructure is portal based - that is, all services are available
through a web server
Infrastructure is portal based - that is, all services are available
through a web server
Sept 17, 2007
Gateway ServicesGateway Services
Core Grid ServicesCore Grid Services
e-Science Gateway Architecture
Grid Portal Server
Grid Portal Server
ExecutionManagement
ExecutionManagement
InformationServices
InformationServices
SelfManagement
SelfManagement
DataServices
DataServices
ResourceManagement
ResourceManagement
SecurityServices
SecurityServices
Resource Virtualization (OGSA)Resource Virtualization (OGSA)
Compute Resources Data Resources Instruments & Sensors
Proxy CertificateServer (Vault)
Proxy CertificateServer (Vault)
Events & Messaging
Events & Messaging
Resource BrokerResource Broker
Community & User Metadata Catalog
Community & User Metadata Catalog
Workflow engine
Workflow engine Resource
Registry
Resource Registry
ApplicationDeployment
ApplicationDeployment
User’s Grid DesktopUser’s Grid Desktop
[1][1] Service Oriented Architectures for Science Gateways on Grid Systems, Gannon, D., et al.; ICSOC, 2005
Sept 17, 2007
LEAD-CI Cyberinfrastructure
• Workflows run on the LEADgrid and on Teragrid.
• Portal and persistent back-end web services run on LEADgrid.
• Data storage resources for storing user-generated data products are provided by Indiana University.
Sept 17, 2007
arpssfc
arpstrn Ext2arps-ibc
88d2arps
mci2arps
ADASassimilation
arps2wrf
nids2arps
WRF
Ext2arps-lbc
wrf2arps
arpsplot
IDV viz
Terrain data files
Surface data files
ETA, RUC, GFS data
Radar data (level II)
Radar data (level III)
Satellite data
Surface, upper air mesonet & wind profiler
data
Typical weather forecast runs as workflow
~400 Data Products Consumed & ~400 Data Products Consumed & Produced Produced –– transformedtransformed –– during during
Workflow LifecycleWorkflow Lifecycle
~400 Data Products Consumed & ~400 Data Products Consumed & Produced Produced –– transformedtransformed –– during during
Workflow LifecycleWorkflow Lifecycle
Pre-ProcessingPre-Processing AssimilationAssimilation ForecastForecast VisualizationVisualization
Sept 17, 2007
To set up workflowexperiment,
we select a workflow(not shown)
then set model parameters here
To set up workflowexperiment,
we select a workflow(not shown)
then set model parameters here
Sept 17, 2007
Supported community
data collections
Supported community
data collections
Sept 17, 2007
Data Integration
CASA radarCollection,
Months (ftp)
Latest 3 days Unidata IDD Distribution
(XML web server)
Level II and III radar, latest
3 days(XML web server)
ETA, NCEP, NAM,
METAR, etc.(XML web server)
Oklahoma
Indiana
Colorado
ColoradoIndexXMLDB native XML database
and Lucene for index
Local view: crosswalk point of presence supports crawling,
publishes difference list as LEAD Metadata Schema (LMS)
documents
• Crawler crawls catalogs; • Builds index of results; • Web service API; • Boolean search query with spatial/temporal support
Globally integrated view: Data Catalog Service
Web s
erv
ice A
PI
Boolean search query
List of results as LEAD Metadata
Schema documents
crosswalks
Sept 17, 2007
LEAD Personal Workspace
• CyberInfrastructure extends user’s desktop to incorporate vast data analysis space.
• As users go about doing scientific experiments, the CI manages back-end storage and compute resources.
• Portal provides ways to explore this data and search and discover it.
• Metadata about experiments is largely automatically generated, and highly searchable.
• Describes data object (the file) in application-rich terms, and provides URI to data service that can resolve an abstract unique identifier to real, on-line data “file”.
Sept 17, 2007
Searching for experiments using model configuration parameters: 2 attributes selected
Sept 17, 2007
Searching for experiments based on model parameters: 4 returned experiments; one displayed
Sept 17, 2007
How forecast model configuration parameters stored in personal catalog
Forecast model configuration file handed off to plugin that shreds XML
document into queriable attributes
associated with experiment
Sept 17, 2007
What & Why of Provenance• Derivation history of a data product
• What (when, where) application created the data• Its parameters & configuration• Other input data used by application
• Workflow is composed from building blocks like these. So provenance for data used in workflow gives workflow trace
ApplicationA
Data.Out.1
Data.In.1
Config.A
Data.In.2
Data Provenance::Data.Out.1Process: Application_A Timestamp: 2006-06-23T12:45:23 Host: tyr20.cs.indiana.edu …Input: Data.In.1, Data.In.2Config: Config.A
Sept 17, 2007
The What & Why of Provenance• Trace Workflow Execution
• What services were used during workflow execution?• Validate if all steps of execution successful?
• Audit Trail• What resources were used during workflow execution?
• Data Quality & Reuse• What applications were used to derived data products?• Which workflows use a certain data product?
• Attribution• Who performed the experiment?• Who owns the workflow & data products?
• Discovery• Locate data generated by a workflow• Locate workflows containing App-X that succeeded
Sept 17, 2007
Karma Provenance ServiceKarma Provenance Service
ProvenanceListener
ProvenanceListener
ActivityDB
ActivityDB
Collection Framework
Workflow Instance10 Data Products Consumed & Produced by each Service
Workflow Instance10 Data Products Consumed & Produced by each Service
Service2
Service2 ……Service
1Service
1Service
10Service
10Service
9Service
910P/10C
10C
10P 10C 10P/10C
10P
Workflow Engine
Workflow Engine
Message Bus WS-Eventing Service API Message Bus WS-Eventing Service API WS-Messenger
Notification BrokerWS-Messenger
Notification Broker
Publish Provenance Activities as Notifications
Application–Started & –Finished, Data–Produced & –ConsumedActivities
Workflow–Started & –Finished Activities
ProvenanceQuery API
ProvenanceQuery API
Provenance Browser ClientProvenance
Browser Client
Query for Workflow, Process,& Data Provenance
Subscribe & Listen toActivity Notifications
A Framework for Collecting Provenance in Data-Centric Scientific Workflows, Simmhan, Y., et al., ICWS Conference, 2006
Sept 17, 2007
Generating Karma Provenance Activities• Instrument applications to publish provenance• Simple Java Library available to
• Create provenance activities• Publish activities as messages
• Jython “wrapper” scripts use library to publish provenance & invoke application
• Generic Factory toolkit easily converts applications to web service• Built-in provenance instrumentation
Sept 17, 2007
Sample Sequence of ActivitiesappStarted(App1)
info(‘App1 starting’)fileReceiveStarted(File1)
-- do gridftp get to stage input file File1 --fileReceiveFinished(File1)fileConsumed(File1)computationStarted(Code1)
-- call Fortran code Code1 to process input files --computationFinished(Code1)fileProduced(File2)fileSendStarted(File2)
-- do gridftp put to save output file File2 --fileSendFinished(File2)publishURL(File2)
appFinishedSuccess(App1, File2) | appFinishedFailed(App1, ERR)flush()
Sept 17, 2007
Performance perturbation
1460
286
1969
296
1643
28092785
419
2216
1653
28342805
426
2233
6
4
1
4
5
- 3
5
0
4
0
500
1000
1500
2000
2500
3000
Start
Terrain PreProcSurface PreProc
3D Interp
ARPS2 WRF
WRF
WRF2 ARPSARPS Plot PS2Image
W o r k f l o w A p p l i c a t i o n S c r i p ts E x e c u ti o n S e q u e n c e
Cumulative Time for Execution (Secs)
-15
-10
-5
0
5
10
15
Provenance Overhead for Each Script (Secs)
C u m u l a t i v e T i m e w / o P r o v e n a n c e ( S e c s )
C u m u l a t i v e T i m e w / P r o v e n a n c e ( S e c s )
P r o v e n a n c e O v e r h e a d
Sept 17, 2007
Standalone tool for provenance collection and experience reuse: future direction
Sept 17, 2007
Forecast start time can also be set to
occur onsevere
weather conditions (not shown here)
Forecast start time can also be set to
occur onsevere
weather conditions (not shown here)
Sept 17, 2007
Weather triggered workflows• Goal is cyberinfrastructure that allows scientists and
students to run weather models dynamically and adaptively in response to weather events.
• Accomplished by coupling events processing and triggered forecast workflows
• Vijayakumar et al (2006) presented framework for this purpose
• Events-processing system does temporal and spatial filtering. • Storm detection algorithm (SDA) detects storm events in
remaining streams• SDA returns detected storm events• Events processing system generates trigger to workflow engine
Sept 17, 2007
Continuous stream mining• In stream mining of weather,
events of interest are anomalies
• Event processing queries can be deployed to sites in the LEAD grid (rectangles)
• Data streams delivered to each site through Unidata Internet Data Dissemination system
• CEP enables real-time response to the weather
query
computation node
data generation source
Sept 17, 2007
Example CEP query• Scientists can set up a 6-hour
weather forecast over a region of say a 700 sq. mile bounding box, and submit a workflow that will run sometime in the future
• CEP query detects severe storm conditions developing in the region
• The forecast workflow is started at a future point in time as determined by the CEP query
Sept 17, 2007
Stream Provenance Tracking• Data stream provenance - derivation history of data
product where data product is derived time-bounded stream
• Stream provenance can establish correlations between significant events (e.g., storm occurrences)
• Anticipate resource needs by examining provenance data and discover trends in weather forecast model output
• Determine when next wave of users will arrive, and where their resources might need to be allocated
Sept 17, 2007
Stream processing as part of cyberinfrastructure• SQL-based queries responding to input streams event-
by-event within stream and concurrent across streams• Each query generates time-bounded output stream
Data Storage
Application services Compute Engine
User’s Browser
Portalserver
DataCatalogservice
MyLEAD UserMetadatacatalog
MyLEAD Agent
service
DataManagement
Service
WorkflowEngine
Workflow graph
Event Notification Bus
Appfactory
Calder Stream Mining Service
NEXRAD Streams
Mining queries
Doppler Radars
Sept 17, 2007
Provenance Service in Calder
Rowset serviceaggregating derived
streams
User Query
Obtain continuous
query
Compile SQL toTCL query
Distribute query
Queries
Setup buffer to aggregate
results
Deploy queries
Create Ring Buffer
Planner Service
Computational mesh executing query
execution engines
DB
Provenance Service
Updates on Stream rates, approximations etc
Query start / stop / distribution plan chance
Results if any
Process updates and store in DB
Process flow / invocation
Calder internal messaging
WS-Messenger notifications
Sept 17, 2007
Provenance Update Handling Scalability
• Update processing time - time taken from instant user sends a notification to instant provenance service completes corresponding update
• Experiment• Bombard provenance service at different update rates by simulating
many clients sending provenance updates simultaneously• Measure incoming rate at provenance service and overall time taken for
handling each update. • Overhead includes time to create message, send and receive through
WS-Messenger, process message and store it in DB
Sept 17, 2007
• Problem:• Severe weather can bring many storms over a local
region of interest• It is infeasible and unnecessary to run weather model
in response to each of them
• Solution:• Group storm events into spatial clusters• Trigger model runs in response to clusters of storms
Sept 17, 2007
Spatial Clustering: DBSCAN algorithm*
• DBSCAN is a density-based clustering algorithm and it can do spatial clustering location parameters are treated as features.
• DBSCAN algorithm has two parameters• ε: radius within which a point is considered to be a
neighbor of another point• minPt: minimum number of neighboring points that a
point has to have to be considered as a core point.
• The two parameters determine the clustering result * Mining work done by Xiang Li, University of Alabama Huntsville
Sept 17, 2007
Data• WSR88D radar data
on 3/27/2007• Total of 134 radar
sites covering CONUS
• The time period examined is between 1:00 pm to 6:00pm EST.
• The 5 hrs time period is divided into 20 time interval with each interval of 15 min. Storm events within the same time interval is clustered Storm events detected at 1:00 pm – 1:15 pm
* Mining work done by Xiang Li, University of Alabama Huntsville
Sept 17, 2007
Algorithm comparison: DBSCAN and K-means
Number of clusters: 3
Time period: 1:00 pm – 1:15 pm
K-means result
DBSCAN result
Conclusion: DBSCAN algorithm performs better than k-means algorithm
Sept 17, 2007
Future Work• Publication of provenance to digital library• Generalized support for metadata systems• Enhanced support for mining triggers
• Personal weather predictor• LEAD framework packaged
into single 8-16 core multicore machine
• Expands educational opportunities: suitable for small schools
• Engage communities beyond meteorologists
Sept 17, 2007
Thank you for the interest.
Thanks to my many domain science and CS collaborators, to my students, and to the funding agents.
Please feel free to contact me at [email protected]
Top Related