Grid-enabled Collaborative Research Applications Internet2 Member Meeting Spring, 2003 Sara J....

30
Grid-enabled Collaborative Research Applications Internet2 Member Meeting Spring, 2003 Sara J. Graves Director, Information Technology and Systems Center University Professor, Computer Science Department University of Alabama in Huntsville Director, Information Technology Research Center National Space Science and Technology Center 256-824-6064 [email protected] http://www.itsc.uah.edu

Transcript of Grid-enabled Collaborative Research Applications Internet2 Member Meeting Spring, 2003 Sara J....

Grid-enabled Collaborative Research Applications

Internet2 Member MeetingSpring, 2003

Sara J. GravesDirector, Information Technology and Systems CenterUniversity Professor, Computer Science Department

University of Alabama in HuntsvilleDirector, Information Technology Research Center

National Space Science and Technology Center256-824-6064

[email protected]

http://www.itsc.uah.edu

“…drowning in data but starving for knowledge”

User

Community

InformationInformation

Data glut affects business, medicine,

military, scienceHow do we leverage data to make BETTER decisions???

Collaborative Research Applications

• Enabling Technologies for Collaborative Research– Grid-Enabled Data Mining Services– Interchange Technology Mark-ups– Collaboration Tools

• Collaborative Research Applications on the Grid– TeraGrid Expeditions– Linked Environments for Atmospheric Discovery– Propulsion Research: Rocket Engine

Advancement Project 2

Data Mining

• Automated discovery of patterns, anomalies from vast observational data sets

• Derived knowledge for decision making, predictions and disaster response

• ADaM – Algorithm Development and Mining System

http://datamining.itsc.uah.edu

Mining Environment: When,Where, Who and Why?

WHEN•Real Time•On-Ingest•On-Demand•Repeatedly

WHERE•User Workstation•Data Mining Center•GRID

WHO•End Users•Domain Experts•Mining Experts

Data Mining

WHY•Event•Relationship•Association•Corroboration•Collaboration

Iterative Nature of the Data Mining Process

DATA

PREPROCESSING

CLEANINGAnd

INTEGRATION

MINING SELECTION

AndTRANSFORMATION

DISCOVERY

KNOWLEDGEEVALUATION

AndPRESENTATION

ADaM Engine Architecture

PreprocessedData

PreprocessedData

Patterns/ModelsPatterns/Models

ResultsResults

OutputGIF ImagesHDF-EOSHDF Raster ImagesHDF SDSPolygons (ASCII, DXF)SSM/I MSFC

Brightness TempTIFF ImagesOthers...

Preprocessing AnalysisClustering K Means Isodata MaximumPattern Recognition Bayes Classifier Min. Dist. ClassifierImage Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture OperationsGenetic AlgorithmsNeural NetworksOthers...

Selection and Sampling Subsetting Subsampling Select by Value Coincidence SearchGrid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find HolesImage Processing Cropping Inversion ThresholdingOthers...

Processing

InputHDFHDF-EOSGIF PIP-2SSM/I PathfinderSSM/I TDRSSM/I NESDIS Lvl 1BSSM/I MSFC

Brightness TempUS RainLandsatASCII GrassVectors (ASCII Text)

Intergraph RasterOthers...

TranslatedData

DataData

Mining EnvironmentsMultilevel Mining (ADaM)

– Complete System (Client and Engine)– Mining Engine (User provides its own

client)– Application Specific Mining Systems– Operations Tool Kit– Stand Alone Mining Algorithms– Data Fusion

Distributed/Federated Mining– Distributed services– Distributed data– Chaining using Interchange Technologies

On-board Mining (EVE)– Real time and distributed mining– Processing environment constraints

Grid-Enabled Data Mining Services

• Distributed researchers, data sources, storage and computational resources in a secure environment

• ADaM data mining modules as Open Grid Services Architecture (OGSA) services

Data Mining / Earth Science Collaboration: Tropical Cyclone Detection

Advanced Microwave Sounding Unit (AMSU-A) Data

Calibration/Limb Correction/Converted to Tb

Mining Environment

Data Archive

Result

Results are placed on the web, made available to National Hurricane Center & Joint Typhoon Warning Center,

and stored for further analysis

Mining Plan:• Water cover mask to eliminate land• Laplacian filter to compute temperature

gradients• Science Algorithm to estimate wind

speed• Contiguous regions with wind speeds

above a desired threshold identified• Additional test to eliminate false positives• Maximum wind speed and location

produced

Hurricane Floyd

Further Analysis

http://pm-esip.msfc.nasa.gov/cyclone

KnowledgeBase

Data Mining / Earth Science Collaboration: Classification Based on Texture Features

Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagery

Science Rationale: Man-made changes to land use cause changes in weather patterns, especially cumulus clouds

Comparison based on – Accuracy of detection– Amount of time required to classify

Parallel Version of Cloud Extraction

Laplacian FilterSobel Horizontal

FilterSobel Vertical

Filter

Energy Computation

Energy Computation

Energy Computation

Energy Computation

Classifier

GOES Image

Cloud Image

• GOES images can be used to recognize cumulus cloud fields

• Cumulus clouds are small and do not show up well in 4km resolution IR channels

• Detection of cumulus cloud fields in GOES can be accomplished by using texture features or edge detectors• Three edge detection filters are used together to detect cumulus clouds which lends itself to implementation on a parallel cluster

GOES Image Cumulus CloudMask

Data Mining / Earth Science Collaboration:

Detecting Signatures• Detecting mesocyclone

signatures from Radar data• Science Rationale:

Mesocyclone is an indicator of Tornadic activity

• Developing an algorithm based on wind velocity shear signatures– Improve accuracy and

reduce false alarm rates

Data Mining / Space Science Collaboration:

Boundary Detection and Quantification

• Analysis of polar cap auroras in large volumes of spacecraft UV images

• Scientific Rationale:– Indicators to predict

geomagnetic storm • Damage satellites• Disrupt radio

connection

• Developing different mining algorithms to detect and quantify polar cap boundary

Polar Cap Boundary

A B

C D

Data Mining / BioInformatics Collaboration:

Genome Patterns

MiningResults:MCSs

Genome DB

Mining EngineAnalysisModules

InputModules

OutputModules

Text Pattern Recognition: Used to search for text patterns in bioscience data as well as other text documents.

Scientists

Event/Relationship

SearchSystem

Event/Relationship

SearchSystem

Knowledge base

Sensor Data Characteristics

• Many different formats, types and structures

• Different states of processing ( raw, calibrated, derived, modeled or interpreted )

• Enormous volumes

• Heterogeneity leads to data usability problems

• Earth science data comes in: Different formats, types and

structures Different states of processing (raw,

calibrated, derived, modeled or interpreted)

Enormous volumes

• Heterogeneity leads to data usability problems

• One approach: Standard data formats Difficult to implement and enforce Can’t anticipate all needs

Some data can’t be modeled or is lost in translation

The cost of converting legacy data

• A better approach: Interchange Technologies Earth Science Markup Language

The Problem

DATA FORMAT 1

DATA FORMAT 1

DATA FORMAT 2

DATA FORMAT 2

DATA FORMAT 3

DATA FORMAT 3

READER 1 READER 2

FORMATCONVERTER

ESML LIBRARY

APPLICATION

DATA FORMAT 1

DATA FORMAT 1

DATA FORMAT 2

DATA FORMAT 2

DATA FORMAT 3

DATA FORMAT 3

The Solution

APPLICATION

ESMLFILEESMLFILE

ESMLFILEESMLFILE

ESMLFILEESMLFILE

Interchange Technologies: Accessing Heterogeneous Data

What is ESML? It is a specialized markup language for Earth

Science metadata based on XML - NOT another data format.

It is a machine-readable and -interpretable representation of the structure, semantics and content of any data file, regardless of data format

ESML description files contain external metadata that can be generated by either data producer or data consumer (at collection, data set, and/or granule level)

ESML provides the benefits of a standard, self-describing data format (like HDF, HDF-EOS, netCDF, geoTIFF, …) without the cost of data conversion

ESML is the basis for core Interchange Technology that allows data/application interoperability

ESML complements and extends data catalogs such as FGDC and GCMD by providing the use/access information those directories lack.

http://esml.itsc.uah.edu

DATAFORMAT1

DATAFORMAT2

DATAFORMAT3

OTHER FORMATS

ESMLFILE

ESMLFILE

ESMLFILE

ESMLSCHEMA

ESML LIBRARY

OTHER APPLICATIONS

ESMLEDITOR

(3) MIDDLEWARE FOR AUTOMATION

ESML LIBRARY

ESMLDATA

BROWSER

ADaM DATA MININGSYSTEM

ESML CONSISTS OF:

(1) MARKUPS

ESMLFILE

(1) External description file for dataset or formats

(2) RULES FOR THEMARKUPS

ESMLSCHEMA

(2) Rules that govern the description of the data files

(3) Library parses and interpretsthe description file and figuresout how to read the data

Components of the ESML Interchange Technology

ESML in Numerical Modeling

ESMLfile

ESMLfile

ESMLfile

ESML Library

255

256

257

258

259

260

261

262

263

264

265

200 210 220 230 240 250 260 270 280 290 300

Sea Surface Temperature (TMI) Degree Kelvin

Ch

n 5

Tem

per

atu

re (

AM

SU

) D

egre

e K

elvi

n

GOESSkin Temp

InsolationProducts

Soundings,Others

Network

Prediction

Scientists can:• Select remote files across the

network• Select different observational

data to increase the model prediction accuracy

Purpose:• Use ESML to incorporate

observational data into the numerical models for simulation

NUMERICAL WEATHERMODELS (MM5, ETA, RAMS)

Collaboration Tools

CAMEX-4 campaign

• Data acquisition and integration from multiple platforms and instruments for quick exploitation

• Intra-project communications before, during, and after CAMEX campaigns

• Collaborators included NASA, NOAA, USAF, and multiple universities

Technologies to coordinate complex projects

http://camex.msfc.nasa.gov

CAMEX-4Distributed

Mission Coordinatio

nRDBMS

CoordinationClearinghouse

Forecasters

NASA managers review status

Radars Mission Managers

Data management

Web-based interface

Experiment PI

NASA Aircraft

NOAA Aircraft

USAF Aircraft

Aircraft Crew: maintenance and report status.

Modeling Environment for Atmospheric Discovery (MEAD): Use of the TeraGrid

Infrastructure

• will develop/adapt a cyberinfrastructure that will enable simulation, datamining, and visualization of hurricanes and storms

• will integrate model and grid workflow management, data management, model coupling, and analysis/mining of large, ensemble datasets.

•Argonne National Lab

•Georgia Tech University

•Indiana University

•Lawrence Berkley National Lab

•NCSA

•NOAA/FSL

•NOAA/NSSL

•Northwestern University

•Ohio State University

•Oklahoma University

•Portland State University

•Rice University

•Rutgers

•UAH

•UCAR

•University of Wisconsin

•University of Minnesota

Primary MEAD Software Components

• WRF Model (Weather Research and Forecasting)

• ROMS Model (Regional Ocean Modeling System)

• Coupled WRF/ROMS Model• D2K (Data to Knowledge)• ADaM (Algorithm Development and Mining

System)• Visualization Engines (NCAR Graphics, Vis5D,

IDV-VisAD, HVR, VTK)• netCDF, HDF5, ESML• Middleware (Globus, JavaCog, GridFTP)• Metadata Catalogue Service

Example MEAD Workflow

Initial Data and

Parameters

Initial Data and

Parameters

Multiple WRF Models

(Weather)

Multiple ROMS Models

(Ocean)

Data Mining (ADaM)

Visualization

Inter-model communications

Initial Setup Model Execution Post Run Analysis

ModelResults

ModelResults

Need the Grid to support the huge computational, data storage and post analysis requirements

Linked Environments for Atmospheric Discovery (LEAD)

Create for the university community an integrated, scalable framework for use in accessing, preparing, assimilating, predicting, managing, mining/analyzing, and displaying a broad array of meteorological and related information independent of format and physical location.

Collaborators:– University of Oklahoma– University of Alabama in Huntsville– UCAR/Unidata– Indiana University– University of Illinois/NCSA– Millersville University– Howard University– Colorado State University

LEAD Architecture

Application Services

Middleware

Grid and Web infrastructure

Data Management Workflow Management Monitoring

Data MiningVisualization

toolsModels

MyLEAD Portal

Others…

MyLEAD Virtual Environment

Interchange Technologies

Workflow Orchestration

Semantics for data and services

Personal Data Space

Resource Allocation

Scheduling Security Others…

poolsof work-stations

clu

ster

s

nat

ion

alsu

per

-co

mp

ute

r fa

cili

ties

tert

iary

sto

rag

e

scie

nti

fic

inst

r’m

ts

Distributed Resources

Collaborative Environment for Propulsion Research:

Rocket Engine Advancement Program 2

• Consortium of propulsion research centers.

• Auburn University • Purdue University• Pennsylvania State University• Tuskegee University

• Grid configuration will make distributed computational and data resources available to researchers without having to negotiate separate access to each resource.

• Linking or integration of multiple distributed experiment steps into a single investigation for more timely results and analysis.

• Will rely on the security capabilities of the Grid due to the sensitive nature of the propulsion research.

• University of Alabama in Huntsville• University of Tennessee• NASA Marshall Space Flight Center• NASA Glenn Research Center

Collaborative Environment for Propulsion Research

Rocket Engine Advancement Program 2

SupercomputerCluster(s)

TestEquipment

Data andResults

REAP2User Portal

REAP2Grid Portal

Evolution of Frameworks for Advanced Applications

• Changing Computational Landscape– GRIDS– Clusters– Web Services– Pervasive Computing– On-Board Processing

• Middleware for applications on GRID/Clusters – Automate parallelization of mining tasks– Estimate using resource requirements using computational

complexity of the algorithms

• Federated Model for Mining– Individual components that can be distributed and can

execute across different platforms