GEMS and Data Mining - Minnesota Supercomputing Institute · GEMS and Data Mining Building the Grid...

30
SAN DIEGO SUPERCOMPUTER CENTER, UCSD NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Data Mining and Middleware Workshop, Minnesota, Sept 2003 GEMS and Data Mining Building the Grid Infrastructure Chaitan Baru Program Co-Director Data and Knowledge Systems San Diego Supercomputer Center

Transcript of GEMS and Data Mining - Minnesota Supercomputing Institute · GEMS and Data Mining Building the Grid...

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

GEMS and Data MiningBuilding the Grid Infrastructure

Chaitan BaruProgram Co-Director

Data and Knowledge SystemsSan Diego Supercomputer Center

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

SDSC Organizational Structurewww.sdsc.edu

Office of the Director

Fran Berman, DirectorAlan Blatecky, Exec DirectorRichard Moore, NPACI Exec DirectorAnke Kamrath, COO

~ 600 employees/students total

Data and KnowledgeSystems(DAKS)

IntegrativeComputational

Sciences(ICS)

Integrative BiologicalSciences

(IBS)

High-End Computing(HEC)

Grids and Clusters (G&C)

• Molecular biology• Neuroscience• Structural Genomics• Cell Signaling• Proteomics

• Computational chemistry• Applied math• Ecoinformatics• Environmental Science• Computational Economics• User Services

• Data integration• Distributed data management• Scientific databases• Data mining• Scientific data visualization

• Cluster management• Portals• Grid middleware

• Production systems

Networking and Security(N&S)

Education and Training

CommunicationsAnd Outreach

BusinessOffice

• Production networking and security• Research on network monitoring

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

The DAKS Program

• Organized as a set of R&D Labs1. Knowledge-based Integration (Bertram Ludaescher)2. Advanced Query Processing (Amarnath Gupta)3. Advanced Database Projects (David Archbell)4. Data Mining (Tony Fountain)5. Visualization (Michael Bailey)6. Spatial Information Systems (Ilya Zaslavsky)7. Geoinformatics (Dogan Seber)8. Storage Resource Broker, SRB (Arcot Rajasekar)9. Sustainable Archives and Digital library Technology

(Richard Marciano)

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

Outline

• Some distributed/grid computing environments• TeraGrid, NPACI Grid, GEON, BIRN, LTER Network• Hardware, software, middleware

• Middleware for data management, exploration, and mining• Some data-oriented / data-intensive application use cases• Data-oriented middleware

• SRB, SKIDLKit, GEMS

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

Prototype for Cyberinfrastructure

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

TeraGridCommon Teragrid Software Stack (CTSS)

• OS: Linux (SuSE), but also others• Compilers: gcc, Intel C/C++, Intel Fortran• MPICH• Schedulers: OpenPBS, Maui• Grid Services: Globus GT2.2.4, gsi, Condor-G, CACL• Math Libs• I/O: HDF4/5, GPFS, PVFS• Collection Management: SRB client• Monitoring: Ganglia, Clumon

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

NPACI Grid Sites and Platforms

U.MichiganAMD Athlon

AMD Opteron

UT AustinPower 4

Cray-Dell Linux cluster

Blue HorizonDataStar

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

SDSC DataStar

• Next major acquisition at SDSC• IBM Power-based system, optimized for data-

oriented applications (large I/O as well as DBMS)• Likely to be ~7TF system• 128 x 8 processor nodes, 16GB/node (2TB memory)• 8 x 32 processor nodes (6 @ 64GB/node, 1 @

128GB, 1 @ 256GB) (768GB memory)• High-speed switch interconnect• FCS interfaces to SAN-based disk

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

NPACKageFocus on impact, interoperability and usability

• NPACKage• Interoperable collection of NPACI

SW targeted for national-scale distribution

• NPACKage Components• The Globus Toolkit™.• GSI-OpenSSH.• Network Weather Service• DataCutter• Ganglia• LAPACK for Clusters (LFC)• MyProxy• GridConfig• Condor-G• Storage Resource Broker (SRB)• Grid Portal Toolkit (GridPort)• MPICH-G2• APST (AppLeS Parameter Sweep Template)• Kx509

• Technology integration• All-to-all interoperability

• Packaging and deployment

• Maintenance• User support

• Documentation• Consulting• Help-desk

• User feedback key to improvement in FY’04

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

Biomedical Informatics Research NetworkParticipating Sites

PI of BIRN CC: Mark EllismanCo-I’s of BIRN CC: Chaitan Baru, Phil Papadopoulos, Amarnath Gupta, Bertram Ludaescher

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

BIRN: Commonality is the Key

• Hardware – HP DL380 processors, common CISCO switch, Netscout monitoring software, gigabit connectivity

• Operating Systems – Red Hat Linux• Database – Oracle • Applications – Storage Resource Broker, data integration

and mediators, variability in back-up solutions• BIRN Portal – common user interface, able to launch

unique user applications

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

BIRN Project ObjectivesEstablish a Establish a stable, high performance networkstable, high performance network linking key linking key Biotechnology Centers and General Clinical Research CentersBiotechnology Centers and General Clinical Research Centers

Establish Establish distributed and linked data collectionsdistributed and linked data collections with partnering with partnering groups groups -- create a “Data GRID”create a “Data GRID”

Facilitate the use of "Facilitate the use of "gridgrid--basedbased" computational infrastructure " computational infrastructure and integrate BIRN with other GRID middleware projectsand integrate BIRN with other GRID middleware projects

Enable Enable data miningdata mining from from multiple data collections or databasesmultiple data collections or databaseson on neuroimagingneuroimaging and bioinformaticsand bioinformatics

Build a Build a stable software and hardware infrastructurestable software and hardware infrastructure that will that will allow centers to coordinate efforts to allow centers to coordinate efforts to accumulate larger studiesaccumulate larger studiesthan can be carried out at one site.than can be carried out at one site.

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

The GEON Grid• OptIPuter / GEON Project – connect NASA Goddard to SDSC via

optic fiber

5-node cluster2-node DB store

1-node

Partner Projects

Chronos

CUAHSI

Partner services

USGS

GeologicalSurvey ofCanada

ESRI

NASA

1TF cluster

Livermore

SDSC PI: Chaitan BaruSDSC co-PI’s: Phil Papadopoulos, Bertram Ludaescher, Michael Bailey

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

GEON Software Stack• OGSA• Information Integration software

• IBM Information Integrator• SDSC GEMS

• Grid Data services• Replication – Grid Movement and Replication• Replica Location Services• Community Authorization Service• Grid Monitoring and Discovery, Network Weather Service, …

• GEON Portal Development• Search and Discovery interface• Workflow specification, customization, execution• Data and Information Visualization tools

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

GMR ArchitectureCollaboration with IBM Almaden (Inderpal Narang et al)

GMR Monitor

Client

Manages all subscriptions for data movement/replication

Handles QoS (latency, security)Handles schedulingCoordinates RecoveryHandles Notifications, Billing, Auditing

Capture ServiceBuffers Data

Controls Flow

Capture AdapterHandles data type

specific interactionProvides capture

service with chunks of data from the data source

Apply ServiceMaintains persistent

statistics for recovery

Apply AdapterHandles data type

specific interactionApplies data chunks

to data target

GMR Service

Grid Service Calls and Notifications

Data Source Data Target

AdaptionLayer

Data Transfer

Grid Service Calls

Grid Service Calls

Gravity DataSet:OracleUTEP node

Gravity Cache:PostgresqlSDSC node

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

Building the BIRN Portal

Schematic overview of the layered software architecture leveraging Grid middleware technologies to link users to distributed resources

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

Application Use Cases

• Different classes of I/O• Read and/or generate large, individual files

• “traditional” supercomputing applications• Read large data collections

• E.g., Digital sky, system log files• Database applications

• E.g., Digital sky, Protein Data Bank• Remote vs. local data

• Compute engines remote from data archives or “data owners”

• Staging vs prefetching vs. synchronous I/O• Ability to reserve disk vs. rewriting I/O calls vs. fast

communications

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

Data Middleware

• SDSC Storage Resource Broker (SRB)• See http://srb.sdsc.edu

• SKIDLkit (SDSC Knowledge and Information Discovery Lab Kit)• Led by Tony Fountain• See http://www.sdsc.edu/SKIDL• Web-services based environment to provide access to

data sources and analysis tools• SDSC Grid-Enabled Mediation Services (GEMS)

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

SRBArchives

HPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Sybase…

File SystemsUnix, NT,

Mac OSX…

UserApplication

Posix I/O, Metadata querying interfaces

RemoteProxies

DataCutter

MetadataExtraction

MCAT

SDSC Storage Resource BrokerSRB clients: mySRB, UNIX

shell (s-commands), inQ, C/C++ libs

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

SKIDLKit and the LTER ProjectCurrent Status

• Work with four key LTER sites (NTL, VCR, AND, JRN)

• Extend to other sites, and implement as Grid services

• Based on Apache Tomcat 4.1.24, Apache Axis 1.1, JDK 1.4

• Generic Service Wrappers using Java & JDBC :• For Oracle database at NTL site, MySQL database at VCR site, SQL

Server database at AND site• And, using the site’s EML (Ecological Metadata Language)

config file• Designed a simple standard in XML to unify climate

data expression across four LTER sites• Access to some data mining tools

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

An International Computational Grid for Ecology and the Environment

L TER- ANDCorvallis, OR

L TER- N TLMadison, WI

SDSCLa Jolla, CA

L TER- VC RCharlottesville, VA

CAS / CNICBeijing, China NA RC

Tsukuba, Japan

NCHCHsinchu, Taiwan

SOAP / XMLSOAP / XML

JDBC

JDBC / EML

JDBC

JDBC

- SOAP Servers where web services are deployed - Database Servers where data sources are hosted

CAS/CNIC

HTTP

- Sensor Data from web cam deployed at fields

Underwater Sensors

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

SDSC Grid-Enabled Mediation Services(GEMS)

• Based on XML, Xquery (next generation of MIX—Mediation of Information using XML)

• Defined in terms of a set of services that are used at:• “Registration time”

• Dataset registration, schema registration, ontology registration• Source content and capability related services: e.g., “term resolution”

service, capability description service, …• “View definition time”

• Data Integration Services, Discovery services• “Query formulation time”• Query runtime

• Dynamic binding of logical to physical resources• Administrative Services

• Services to manage access controls, control replicas, …

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

Mediator

LegendGenerator

MapAssembler

Ontology

GRID SERVICESFOR MAP INTEGRATION

Integrate Geologic Data From Multiple Sources Using Ontology and Map Assembly Web Services(to be deployed by USGS)

ArcIMS Services wrappedIn WSDL/SOAP

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

GEON: Information Integration

Chronos

PaleoStrat

Neptune

PaleoBiology

EGI

PaleoGeography

PaleoBiology

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

GeMS Components

Query Optimization & Plan Generation

Verification, AccessControl, and Query

Rewrite

Result Assembly(e.g. map

generation)

Ontology Service

CommunityAuthorization

Service

Monitoring& Discovery

Service

NetworkWeather Service

ReplicaLocation Service

Client

ComputeResources

Distributed Compute and Storage Resources

DatabasesDatabases

Databases

File systemFile

systemFile system

ComputeResourcesCompute

Resources

Registration Services

Metadata Registry

Deployment Services

Data Integration

Services

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

GeMS Request Processing Scenario

Published DataPrivate Data

WrapperPublished Data

Private Data

WrapperReplicas

Mediator

OntologyService(s)

Wrapper

Client

GeMSQuery Planner

GeMSPlan

GeMS Logical Physical Query Plan “binding”

query Result Assembly

Wrapper

Published DataPrivate Data

WrapperPublished Data

Private Data

WrapperPublished Data

Private Data

Wrapper

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

Some Issues

• Function shipping versus data shipping• Need to deal with different levels of access

provided by different sites, example:• Native API access to databases• JDBC• Web services (with full query vs limited query access)• Read-only vs read-write (dealing with temp results,

annotations)

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

Contact Info

Chaitan [email protected]

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

SDSC Machine Room Data Architecture

LAN (multiple GbE, TCP/IP)

Blue Horizon HPSS

SAN (2 Gb/s, SCSI)

Linux cluster

4TF

Sun F15K

WAN (30 Gb/s)

SCSI/IP or FC/IP

FC Disk Cache (400 TB)

FC GPFS Disk (100TB)

200 MB/s per controller

Silos and Tape, 6 PB, 1 GB/sec disk to tape 32 tape drives

30 MB/s per drive

Servers VisEngine

Local Disk (50TB)DataStar

Power 4 DB

• .5 PB disk, 6 PB archive• 1 GB/s disk-to-tape• Optimized support for DB2

(Regatta) / Oracle (Sun 15K)

SAN DIEGO SUPERCOMPUTER CENTER, UCSD

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE

Data Mining and Middleware Workshop, Minnesota, Sept 2003

Current DTF/ETF Sites

UT Austin

Indiana

Oakridge