Fox 2 January 4, 2005 Project sponsors âEarth System Grid - DOE/SciDAC âCoupled Energetics and...
-
Upload
briana-daniels -
Category
Documents
-
view
215 -
download
0
Transcript of Fox 2 January 4, 2005 Project sponsors âEarth System Grid - DOE/SciDAC âCoupled Energetics and...
Fox
2
January 4, 2005
Project sponsors
Earth System Grid - DOE/SciDAC
Coupled Energetics and Dynamics of Atmospheric Regions - NSF/GEO/ATM
Virtual Solar-Terrestrial Observatory - NSF/CISE/SCI
Related DODS/OPeNDAP work - NASA and NCAR/HAO
Fox
3
January 4, 2005
Report on experience with data ‘systems’ and data ‘frameworks’
CEDARWEB
Earth System Grid
Compare and contrast success in terms of use(rs)
Technology integration - when and how does it work and scale?
Outline a merged approach for Virtual Observatory concept
Overview
Fox
5
January 4, 2005
CEDARWEB: heritage CEDAR is a large scientific and technical community focusing on the Earth’s middle and
upper atmosphere. The program features ground-based observing networks, models and integrative studies. Funded by NSF, in third phase (3rd decade)
CEDAR data history Started as an incoherent radar database in 1983 as a tape archive (back to 1966) Grew by late 80’s adding other instruments, models, indices Went on-line in early 90’s (became a single-tiered data system) Web access in 1996, three versions of the interface
Holdings - some satellite data, geophysical indices, modesl (GCM, empirical, tides, etc.), ISRs, HF Radars, Digisondes, FPIs, IR Michelson Interferometers, Spectrometers, Airglow Imagers, All-Sky Cameras, LIDARs, Multi-Channel Photometers, MST Radars, MF Radars, LF Radars, Meteor Wind Radars, Campaigns, Presentations, Surveys, Jobs, Workshops, etc.
Community, 600+, 300+ registered users, ~ 100 active data users per year NCAR tasked with community support, and especially in the early days to ‘take care’ of
the data and work with data providers and users Significant effort in catalogs, metadata, controlled vocabulary System has labored in getting past the code/mnemonic schemes of the past, base data
format
Fox
6
January 4, 2005
CEDAR pre-web
Data query, selection and retrieval interface, without any integrated tools or ability to preview data before retrieving it.
Fox
9
January 4, 2005
CEDARWEB 3.x
Data query, selection and retrieval interface, with integrated tools, e.g. ability to plot (preview) data before retrieving it.
Fox
12
January 4, 2005
CEDARWEB 3.1
Ability to quickly plot data to assess suitability, quality, and produce a quick copy with some customization for a preliminary study.
Fox
13
January 4, 2005
Experience: CEDARWEB
Don’t just provide data, but also build in community information and ancillary information that is of value.
Fox
14
January 4, 2005
Inside CEDARWEB
Rich metadata; categorized OPeNDAP for data access and transport MySQL for catalog and user records https and cookies for session authentication Script-enabled interface with plotting built in (ION) delivers html to browsers ‘Hides’ organizational data record structure (sort of) Low-level data product, but also high-level Disconnect between delivery of data and attributes
Today: framework is inside the data system!
Fox
15
January 4, 2005
Experience: CEDARWEB
CEDARWEB has been developed and improved over more than 10 years of interaction with users, data providers, and a community steering committee. Each of these elements has directly contributed to changes in what services are provided, what information and materials are made available via the web site and what levels of authorization and authentication are required.
Biggest lesson: systems approach has worked because of the heritage of the data collection but users (esp. new or very experienced) see a barrier to entry and don’t understand where system starts/stops.
http://cedarweb.hao.ucar.edu
Fox
16
January 4, 2005
The goal of ESG is to make climate data – particularly climate model data – an easily accessible community resource. The project is funded by the SciDAC program: Scientific Discovery through Advanced Computing.
Enabling researchers to understand and make effective use of very large, distributed climate datasets is critical. The broad strategy is to develop a collection of server-side capabilities – minimize the amount of data movement.
Multiple interfaces to ESG will allow researchers to focus on science rather than issues of data transfer, format, and data set manipulation.
Foundation is Globus Grid technology
Earth System Grid Overview
Fox
17
January 4, 2005
ESG: U.S. Collaborations & Development
ORNL: Climate storage &computational resources
ORNL: Climate storage &computational resources
LANL: Next generationcoupled models & computing
LANL: Next generationcoupled models & computing
ANL: Computational grids,& grid-based applications
ANL: Computational grids,& grid-based applications
USC/ISI: Computational grids,& grid-based applications
USC/ISI: Computational grids,& grid-based applications
NCAR: Climate changepredication and scenarios
NCAR: Climate changepredication and scenarios
LBNL: Climate storage facility
LBNL: Climate storage facility
LLNL: Model diagnostics& inter-comparison
LLNL: Model diagnostics& inter-comparison
Fox
18
January 4, 2005
DODS/OPeNDAP: Distributed Oceanographic Data System (Unidata)Integrations of Globus GridFTP, DODS data access
THREDDS: THematic Real‑time Environmental Distributed Data Services (Unidata)LAS: Live Access Server (NOAA Pacific Marine Environmental Laboratory)
Works with CDAT, Ferret, GrADS, …CDAT: Climate Data Analysis Tools (PCMDI), includes CDMS: Climate Data Management System, VCDAT visualizationCommunity Data Portal project (NCAR)NCL (NCAR)Globus Grid technology(ANL, ISI): GridFTP, CAS Community Access Portal
ESG leverages existing software and projects
Fox
19
January 4, 2005
ESG: Requirements & Priority Matrix
ESG Developer ESG Administrator ESG UserESG Services: Framework H H H Automatic Installation L L HDistributed Computing Authorization & Authentication H H M Registration H H L Event Services L L M Task Management L L L Logging Services L H HData Systems Search and Discovery M H H data movement (transport) L H H meta-data framework H H M collaboratories M L HTools analysis M M H visualization L L H collaboration M M H
L = LOW, M = MEDIUM, H = HIGH
Fox
23
January 4, 2005
NCAR
LBNL
LLNL
ISI
ANL
ORNL
GSI
GSIGSIGSI
GSI
GSI CAS server
CAS client
CAS client
CAS client
MyProxy client MyProxy server
TOMCAT
SECURITY services
GRAM
METADATA services
FRAMEWORK services
Auth metadata
RLSMySQL
RLSMySQL
RLSMySQL
RLSMySQL
NERSCHPSS
NCAR MSS
DISK
DISK
DISK
DISKORNLHPSS
DATA storage
The Earth System Grid
THREDDS catalogs Xindice
XindiceMySQL OGSA-DAISMCS
TRANSPORT services
gridFTP server/client
gridFTP server/client
gridFTP server/clientgridFTP server/client
HRM
HRM
HRMHRM
openDAPg server
openDAPg server
ANALYSIS & VIZ services
NCL openDAPg client LAS server
CDAT openDAPg client
MONITORING services
SLAMON daemon
SLAMON daemon
TOMCAT
AXIS
Fox
25
January 4, 2005
Community Data Portal
Free text search
Applications
Live Access
News
Authentication
THREDDS catalog
Fox
27
January 4, 2005
LAS/CDAT: Example of a Web-based Data Portal
Technology: Web Based (end user requirements) LAS, DODS, ESG (i.e., Globus),
CDAT Portal should hide/simplify the Grid for
users Single sign-on Community-based authorization Simplified resource location Remote job submission,
management Accesses the ESG Grid Testbed
Fox
28
January 4, 2005
ESG: Example of a Web-based Data Portal (serving 40+ simulations: AMIP, CMIP, and PCM)
Fox
30
January 4, 2005
Metadata-centric view of ESG services
METADATASERVICES
METADATASERVICES
USER AUTHENTICATIONAND AUTHORIZATION
USER AUTHENTICATIONAND AUTHORIZATION
ACCESS AND AUTHORIZATION
METADATA
DATA TRANSPORTDATA TRANSPORT
LOCATIONMETADATA
SYSTEM MONITORINGAND CONTROL
SYSTEM MONITORINGAND CONTROL
LOGGINGMETADATA
DATA SEARCH & DISCOVERYDATA SEARCH & DISCOVERY
CONTENT METADATA
ANNOTATION & HISTORYMETADATA
DATA ANALYSIS & VISUALIZATION
DATA ANALYSIS & VISUALIZATION
AGGREGATION METADATA
DATA BROWSINGDATA BROWSING
CATALOGUINGMETADATA
Fox
31
January 4, 2005
ESG Metadata Services Architecture
3-layer architecture: Metadata Holdings: physical metadata content, stored in a system
of relational and/or XML native databases Core Metadata Services: modules and libraries that mediates all
access to the Metadata Holdings (insert, update, delete, query) – expose an API that hides the specific implementation of the databases and query languages
High Level Metadata Services: system of applications that make use of the Core Metadata Services to fulfill a specific atomic functionality – will be invoked by external clients
Fox
32
January 4, 2005
METADATAEXTRACTION
METADATAEXTRACTION
METADATADISPLAY
METADATADISPLAY
METADATABROWSING
METADATABROWSING
METADATASEARCH, QUERY
& DISCOVERY
METADATASEARCH, QUERY
& DISCOVERY
ESG CLIENTS API & USER INTERFACES
ReplicaLocationServices
MetadataCataloguing
ServicesXML DB THREDDS
catalogs
METADATA HOLDINGS
METADATAANNOTATION
METADATAANNOTATION
METADATAVALIDATION
METADATAVALIDATION
METADATA ACCESS(update, insert, delete, query)
METADATA ACCESS(update, insert, delete, query)
SERVICE TRANSLATIONLIBRARY
SERVICE TRANSLATIONLIBRARY
CORE METADATA SERVICES
METADATAAGGREGATION
METADATAAGGREGATION
METADATACONVERSION
METADATACONVERSION
METADATA & DATA REGISTRATION
METADATA & DATA REGISTRATION
PUBLISHINGPUBLISHING
HIGH LEVEL METADATA SERVICES
SEARCH & DISCOVERYSEARCH & DISCOVERYADMINISTRATIONADMINISTRATION BROWSING & DISPLAYBROWSING & DISPLAY
ANALYSIS & VISUALIZATIONANALYSIS & VISUALIZATION
Fox
33
January 4, 2005
ESG Metadata Services Goal Functionality
Services responsible for the creation, management and utilization of metadata associated with geophysical data
Functionality: Metadata extraction (automatically, from files in different format and
according to various possible metadata standards) Metadata conversion (from one standard to another) Metadata aggregation (associated with data collections) Metadata annotation (manually by humans) Metadata validation (basic quality control of metadata) Registration (population of metadata holdings) Harvesting (combination of metadata from different repositories) Metadata browsing and display (for humans) Search and discovery of data through metadata Metadata query (by agents or clients for data analysis and visualization)
Fox
34
January 4, 2005
ESG Metadata Services Current Development
Currently have in production the following technologies : Replica Location Services : database to manage and index multiple
copies of the same data stored at different centers Metadata Cataloguing Services : relational database to store
scientific metadata (developed for high energy physics and geophysical data)
XML native (**) and SQL databases THREDDS (by Unidata ) : system for hierarchical cataloguing of
datasets and associated metadata (http://www.unidata.ucar.edu/projects/THREDDS)
NcML (Netcdf Markup Language) : XML language for encoding of metadata associated with data in netcdf format (and more…)
Fox
35
January 4, 2005
ESG Metadata Policy
Premise : geophysical sciences are too broad and complex to impose a single, omnicomprehensive metadata standard to capture the relevant information for all datasets, projects, instruments, scientists
ESG will not mandate use of any metadata schema or convention Allow data providers, scientists to use their metadata of choice,
provide technologies and tools to store and access metadata through common services (MCS, XML DB, THREDDS catalogs)
Encourage development and reuse of a limited set of domain-specific standards (climate data, radar data, airborn instrumentation etc), encoding in XML (according to community developed schemas), interoperability and combination of schemas (XML namespaces and RDF-based ontologies - developed but not used)
Fox
36
January 4, 2005
OPeNDAP for ESG II
DODS since ~ 1995 was been based on http and cgi-style architecture
Two concernsApplication support and performance of HTTPHousekeeping abilities of cgi architecture
Solution evolve OPeNDAP the discipline neutral aspect of DODS
Fox
37
January 4, 2005
OPeNDAP ctd.
Data transport protocol and access protocol separated
Revised server architecture Address Grid-style authentication Memory management Exception handling All these changes and retain interoperation with
HTTP and cgi Advanced requirements: URL should support
more than one dataset, or object, i.e. aggregation
Fox
38
January 4, 2005
OPeNDAP 3.x vs OPeNDAP-g Architecture
• Simple and easy to install• One CGI process per
URL request• Limited memory
management – external• Limited scalability• Limited status reporting to
web server• Returns data stream from
one format
• Standalone server or httpd module
• Can manage multiple daemon processes
• Strong memory management – internal
• Reuse processes, scales• Coupled to OPeNDAP
server for status• Returns multiple formats
in a single stream, multiple protocols
Fox
41
January 4, 2005
Status
Refactor core classes to remove http/libwww, etc. Operational/production release of standalone OPeNDAP
server (no dependence on web server) Multi-protocol support: file, http, GridFTP, ftp, etc. Re-architected for aggregation support and performance Run OPeNDAP server as a client to GridFTP server Portal application client in production, prototype of
netCDF client operational Authentication is handled outside OPeNDAP server URL syntax is more complex
Fox
42
January 4, 2005
ESG: Framework experience
ESG is a highly collaborative effort and will allow users to quickly access data storage facilities storing petabytes of raw or processed data in an application independent manner.
Payoffs of this distributed collaborative infrastructure have included: Distributed data-sharing, RLS works! SRM/HRM work! OPeNDAP-g works! Simplified data discovery of climate data, the work on metadata paid off!
Scalability? Large-scale climate data processing and analysis via highly integrated portal Increased collaboration among climate research scientists, people use it! Aid in climate assessments and estimates of future climate variability and trends,
IPCC! Authentication and authorization have been a significant challenge
GSI to CAS MyProxy - session based and seems to work well, more compatible with
heterogeneous framework services SAML is working for multi-file batch transfer
Fox
43
January 4, 2005
ESG: Framework experience
Privatization Portal interface (and much of the holdings) are cloned Closed communities are breeding dead-end alley developments, e.g. delivering
netCDF Transport - GridFTP versus HTTP
Server to server Very good performance Depends on a very specific version of GRIDftp server (stripped) Clients are not as capable due to ‘weight’ of globus, revert to HTTP
Scalability and response times (data AND metadata) Framework architecture supports re-layered for tuning
Service monitoring to support the distributed collaborative infrastructure need lots or all services to really make a production environment work
Many Globus services not used (GRIS, MDS, GIIS, … ) Feeling lucky? Try out ESG by visiting the website at: http://www.
earthsystemgrid.org
Fox
44
January 4, 2005
Success?
Users are generally happy Exploited new technology components
Integration - when and how does it work and scale? XML SQL DODS OPeNDAP and OPeNDAP-g
Portals P2P - clients are not as ready as we think
Globus provides a suite of framework components, some are easier to integrate than others, some just don’t fit our use-cases and architecture
Data framework - e.g. OPeNDAP has been extremely successful
Fox
45
January 4, 2005
User needs
In discussions with data providers and users, the needs are clear:
``Fast access to `portable' data, in a way that works with the tools we have; information must be easy to access, retrieve and work with.'’
Too often users (and data providers) have to deal with the organizational structure of the data sets which varies significantly --- data may be stored at one site in a small number of large files while similar data may be stored at another site in a large number of relatively smaller files. There is an equally large problem with the range of metadata descriptions for the data. Users often only want subsets of the data and struggle with getting it efficiently. One user expresses it as:
``(Please) solve the interface problem.''
Fox
46
January 4, 2005
Vision for building science cyberinfrastructure
Use-case, then requirements Then derive architecture and choose technology
components Build a working system for users from the start Get your funding source and community to commit to an
evolving architecture
If you choose a major framework technology, e.g. Globus, OPeNDAP, THREDDS, partner with them
Data framework - e.g. OPeNDAP has been extremely successful
Fox
47
January 4, 2005
One paradigm
Goal - find the right balance of data/model holdings, portals and client software that a researchers can use without effort or interference as if all the materials were available on his/her local computer.
E.g.The Virtual Solar-Terrestrial Observatory (VSTO) is proposed to be:• a distributed, scalable education and research environment for
searching, integrating, and analyzing observational, experimental and model databases in the fields of solar, solar-terrestrial and space physics
Comprises:• a system-like framework which provides virtual access to specific data,
model, tool and material archives containing items from a variety of space- and ground-based instruments and experiments, as well as individual and community modeling and software efforts bridging research and educational use
Fox
48
January 4, 2005
Virtual Observatory? Need better glue
• Basic problem: schema are categorized rather than developed from an object model/class hierarchy -> significantly limits non-human use. However, they all form the basis to organize catalog interfaces for all types of data, images, etc.
• This limits data systems utilizing frameworks and prevents frameworks from truly interoperating (SOAP, WSDL only a start)
• Directories, e.g. NASA GCMD, CEDAR catalog, FITS (flat) keyword/ value pairs, are being turned into ontologies (SWEET, VSTO)
• Markup languages, e.g. ESML, SPDML, ESG/ncML are excellent bases
• Evolve, recast, merge (where appropriate) using formal processes, tools with intended use in mind - for interface specifications, reasoning, validation, etc. beyond the usual search and access
Fox
49
January 4, 2005
Summary
Basic success in both data systems and data framework approaches
Satisfying user and sponsor needs (from ‘just’ to ‘outstanding’)
Experience with Globus ranges from very good, to not ready for our need
Experience with OPeNDAP is very good, especially with core services
Scalability and performance require an adaptable architecture which is something system-level interfaces can still hide from the user
Challenge - to bring these attributes to a framework, i.e. in which the user is more exposed
Interoperate, interoperate, interoperate - interface, interface, interface
User interfaces still require significant HCI efforts
Metadata services are extremely important