Post on 12-Apr-2018
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
GEMS and Data MiningBuilding the Grid Infrastructure
Chaitan BaruProgram Co-Director
Data and Knowledge SystemsSan Diego Supercomputer Center
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
SDSC Organizational Structurewww.sdsc.edu
Office of the Director
Fran Berman, DirectorAlan Blatecky, Exec DirectorRichard Moore, NPACI Exec DirectorAnke Kamrath, COO
~ 600 employees/students total
Data and KnowledgeSystems(DAKS)
IntegrativeComputational
Sciences(ICS)
Integrative BiologicalSciences
(IBS)
High-End Computing(HEC)
Grids and Clusters (G&C)
• Molecular biology• Neuroscience• Structural Genomics• Cell Signaling• Proteomics
• Computational chemistry• Applied math• Ecoinformatics• Environmental Science• Computational Economics• User Services
• Data integration• Distributed data management• Scientific databases• Data mining• Scientific data visualization
• Cluster management• Portals• Grid middleware
• Production systems
Networking and Security(N&S)
Education and Training
CommunicationsAnd Outreach
BusinessOffice
• Production networking and security• Research on network monitoring
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
The DAKS Program
• Organized as a set of R&D Labs1. Knowledge-based Integration (Bertram Ludaescher)2. Advanced Query Processing (Amarnath Gupta)3. Advanced Database Projects (David Archbell)4. Data Mining (Tony Fountain)5. Visualization (Michael Bailey)6. Spatial Information Systems (Ilya Zaslavsky)7. Geoinformatics (Dogan Seber)8. Storage Resource Broker, SRB (Arcot Rajasekar)9. Sustainable Archives and Digital library Technology
(Richard Marciano)
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
Outline
• Some distributed/grid computing environments• TeraGrid, NPACI Grid, GEON, BIRN, LTER Network• Hardware, software, middleware
• Middleware for data management, exploration, and mining• Some data-oriented / data-intensive application use cases• Data-oriented middleware
• SRB, SKIDLKit, GEMS
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
Prototype for Cyberinfrastructure
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
TeraGridCommon Teragrid Software Stack (CTSS)
• OS: Linux (SuSE), but also others• Compilers: gcc, Intel C/C++, Intel Fortran• MPICH• Schedulers: OpenPBS, Maui• Grid Services: Globus GT2.2.4, gsi, Condor-G, CACL• Math Libs• I/O: HDF4/5, GPFS, PVFS• Collection Management: SRB client• Monitoring: Ganglia, Clumon
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
NPACI Grid Sites and Platforms
U.MichiganAMD Athlon
AMD Opteron
UT AustinPower 4
Cray-Dell Linux cluster
Blue HorizonDataStar
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
SDSC DataStar
• Next major acquisition at SDSC• IBM Power-based system, optimized for data-
oriented applications (large I/O as well as DBMS)• Likely to be ~7TF system• 128 x 8 processor nodes, 16GB/node (2TB memory)• 8 x 32 processor nodes (6 @ 64GB/node, 1 @
128GB, 1 @ 256GB) (768GB memory)• High-speed switch interconnect• FCS interfaces to SAN-based disk
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
NPACKageFocus on impact, interoperability and usability
• NPACKage• Interoperable collection of NPACI
SW targeted for national-scale distribution
• NPACKage Components• The Globus Toolkit™.• GSI-OpenSSH.• Network Weather Service• DataCutter• Ganglia• LAPACK for Clusters (LFC)• MyProxy• GridConfig• Condor-G• Storage Resource Broker (SRB)• Grid Portal Toolkit (GridPort)• MPICH-G2• APST (AppLeS Parameter Sweep Template)• Kx509
• Technology integration• All-to-all interoperability
• Packaging and deployment
• Maintenance• User support
• Documentation• Consulting• Help-desk
• User feedback key to improvement in FY’04
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
Biomedical Informatics Research NetworkParticipating Sites
PI of BIRN CC: Mark EllismanCo-I’s of BIRN CC: Chaitan Baru, Phil Papadopoulos, Amarnath Gupta, Bertram Ludaescher
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
BIRN: Commonality is the Key
• Hardware – HP DL380 processors, common CISCO switch, Netscout monitoring software, gigabit connectivity
• Operating Systems – Red Hat Linux• Database – Oracle • Applications – Storage Resource Broker, data integration
and mediators, variability in back-up solutions• BIRN Portal – common user interface, able to launch
unique user applications
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
BIRN Project ObjectivesEstablish a Establish a stable, high performance networkstable, high performance network linking key linking key Biotechnology Centers and General Clinical Research CentersBiotechnology Centers and General Clinical Research Centers
Establish Establish distributed and linked data collectionsdistributed and linked data collections with partnering with partnering groups groups -- create a “Data GRID”create a “Data GRID”
Facilitate the use of "Facilitate the use of "gridgrid--basedbased" computational infrastructure " computational infrastructure and integrate BIRN with other GRID middleware projectsand integrate BIRN with other GRID middleware projects
Enable Enable data miningdata mining from from multiple data collections or databasesmultiple data collections or databaseson on neuroimagingneuroimaging and bioinformaticsand bioinformatics
Build a Build a stable software and hardware infrastructurestable software and hardware infrastructure that will that will allow centers to coordinate efforts to allow centers to coordinate efforts to accumulate larger studiesaccumulate larger studiesthan can be carried out at one site.than can be carried out at one site.
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
The GEON Grid• OptIPuter / GEON Project – connect NASA Goddard to SDSC via
optic fiber
5-node cluster2-node DB store
1-node
Partner Projects
Chronos
CUAHSI
Partner services
USGS
GeologicalSurvey ofCanada
ESRI
NASA
1TF cluster
Livermore
SDSC PI: Chaitan BaruSDSC co-PI’s: Phil Papadopoulos, Bertram Ludaescher, Michael Bailey
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
GEON Software Stack• OGSA• Information Integration software
• IBM Information Integrator• SDSC GEMS
• Grid Data services• Replication – Grid Movement and Replication• Replica Location Services• Community Authorization Service• Grid Monitoring and Discovery, Network Weather Service, …
• GEON Portal Development• Search and Discovery interface• Workflow specification, customization, execution• Data and Information Visualization tools
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
GMR ArchitectureCollaboration with IBM Almaden (Inderpal Narang et al)
GMR Monitor
Client
Manages all subscriptions for data movement/replication
Handles QoS (latency, security)Handles schedulingCoordinates RecoveryHandles Notifications, Billing, Auditing
Capture ServiceBuffers Data
Controls Flow
Capture AdapterHandles data type
specific interactionProvides capture
service with chunks of data from the data source
Apply ServiceMaintains persistent
statistics for recovery
Apply AdapterHandles data type
specific interactionApplies data chunks
to data target
GMR Service
Grid Service Calls and Notifications
Data Source Data Target
AdaptionLayer
Data Transfer
Grid Service Calls
Grid Service Calls
Gravity DataSet:OracleUTEP node
Gravity Cache:PostgresqlSDSC node
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
Building the BIRN Portal
Schematic overview of the layered software architecture leveraging Grid middleware technologies to link users to distributed resources
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
Application Use Cases
• Different classes of I/O• Read and/or generate large, individual files
• “traditional” supercomputing applications• Read large data collections
• E.g., Digital sky, system log files• Database applications
• E.g., Digital sky, Protein Data Bank• Remote vs. local data
• Compute engines remote from data archives or “data owners”
• Staging vs prefetching vs. synchronous I/O• Ability to reserve disk vs. rewriting I/O calls vs. fast
communications
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
Data Middleware
• SDSC Storage Resource Broker (SRB)• See http://srb.sdsc.edu
• SKIDLkit (SDSC Knowledge and Information Discovery Lab Kit)• Led by Tony Fountain• See http://www.sdsc.edu/SKIDL• Web-services based environment to provide access to
data sources and analysis tools• SDSC Grid-Enabled Mediation Services (GEMS)
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
SRBArchives
HPSS, ADSM,UniTree, DMF
DatabasesDB2, Oracle,
Sybase…
File SystemsUnix, NT,
Mac OSX…
UserApplication
Posix I/O, Metadata querying interfaces
RemoteProxies
DataCutter
MetadataExtraction
MCAT
SDSC Storage Resource BrokerSRB clients: mySRB, UNIX
shell (s-commands), inQ, C/C++ libs
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
SKIDLKit and the LTER ProjectCurrent Status
• Work with four key LTER sites (NTL, VCR, AND, JRN)
• Extend to other sites, and implement as Grid services
• Based on Apache Tomcat 4.1.24, Apache Axis 1.1, JDK 1.4
• Generic Service Wrappers using Java & JDBC :• For Oracle database at NTL site, MySQL database at VCR site, SQL
Server database at AND site• And, using the site’s EML (Ecological Metadata Language)
config file• Designed a simple standard in XML to unify climate
data expression across four LTER sites• Access to some data mining tools
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
An International Computational Grid for Ecology and the Environment
L TER- ANDCorvallis, OR
L TER- N TLMadison, WI
SDSCLa Jolla, CA
L TER- VC RCharlottesville, VA
CAS / CNICBeijing, China NA RC
Tsukuba, Japan
NCHCHsinchu, Taiwan
SOAP / XMLSOAP / XML
JDBC
JDBC / EML
JDBC
JDBC
- SOAP Servers where web services are deployed - Database Servers where data sources are hosted
CAS/CNIC
HTTP
- Sensor Data from web cam deployed at fields
Underwater Sensors
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
SDSC Grid-Enabled Mediation Services(GEMS)
• Based on XML, Xquery (next generation of MIX—Mediation of Information using XML)
• Defined in terms of a set of services that are used at:• “Registration time”
• Dataset registration, schema registration, ontology registration• Source content and capability related services: e.g., “term resolution”
service, capability description service, …• “View definition time”
• Data Integration Services, Discovery services• “Query formulation time”• Query runtime
• Dynamic binding of logical to physical resources• Administrative Services
• Services to manage access controls, control replicas, …
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
Mediator
LegendGenerator
MapAssembler
Ontology
…
GRID SERVICESFOR MAP INTEGRATION
Integrate Geologic Data From Multiple Sources Using Ontology and Map Assembly Web Services(to be deployed by USGS)
ArcIMS Services wrappedIn WSDL/SOAP
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
GEON: Information Integration
Chronos
PaleoStrat
Neptune
PaleoBiology
EGI
PaleoGeography
PaleoBiology
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
GeMS Components
Query Optimization & Plan Generation
Verification, AccessControl, and Query
Rewrite
Result Assembly(e.g. map
generation)
Ontology Service
CommunityAuthorization
Service
Monitoring& Discovery
Service
NetworkWeather Service
ReplicaLocation Service
Client
ComputeResources
Distributed Compute and Storage Resources
DatabasesDatabases
Databases
File systemFile
systemFile system
ComputeResourcesCompute
Resources
Registration Services
Metadata Registry
Deployment Services
Data Integration
Services
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
GeMS Request Processing Scenario
Published DataPrivate Data
WrapperPublished Data
Private Data
WrapperReplicas
Mediator
OntologyService(s)
Wrapper
Client
GeMSQuery Planner
GeMSPlan
GeMS Logical Physical Query Plan “binding”
query Result Assembly
Wrapper
Published DataPrivate Data
WrapperPublished Data
Private Data
WrapperPublished Data
Private Data
Wrapper
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
Some Issues
• Function shipping versus data shipping• Need to deal with different levels of access
provided by different sites, example:• Native API access to databases• JDBC• Web services (with full query vs limited query access)• Read-only vs read-write (dealing with temp results,
annotations)
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
Contact Info
Chaitan Barubaru@sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER, UCSD
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE
Data Mining and Middleware Workshop, Minnesota, Sept 2003
SDSC Machine Room Data Architecture
LAN (multiple GbE, TCP/IP)
Blue Horizon HPSS
SAN (2 Gb/s, SCSI)
Linux cluster
4TF
Sun F15K
WAN (30 Gb/s)
SCSI/IP or FC/IP
FC Disk Cache (400 TB)
FC GPFS Disk (100TB)
200 MB/s per controller
Silos and Tape, 6 PB, 1 GB/sec disk to tape 32 tape drives
30 MB/s per drive
Servers VisEngine
Local Disk (50TB)DataStar
Power 4 DB
• .5 PB disk, 6 PB archive• 1 GB/s disk-to-tape• Optimized support for DB2
(Regatta) / Oracle (Sun 15K)