DataLab’SAD’1.00! System’Architecture’Document...

DataLab-‐SAD-‐1.00

System Architecture Document

For the

NOAO Data Lab Project

Revised: March 4, 2015

2

Revision History Date Author Changes / Comments Version Sep 10, 2014 M. Fitzpatrick First Draft 0.1 Dec 01, 2014 M. Fitzpatrick Restructured draft 0.2 Dec 04, 2014 M. Fitzpatrick Another restructure 0.3 Dec 22, 2014 M. Fitzpatrick More text 0.4 Jan 12, 2015 M. Fitzpatrick Fleshing out contens 0.5 Jan 15, 2015 M. Fitzpatrick Incorporated comments, arch description 0.6 Jan 21, 2015 M. Fitzpatrick First complete draft 0.7 Jan 26, 2015 M. Fitzpatrick Typos, 4-‐level TOC 0.71 Jan 27, 2015 M. Fitzpatrick Included Ridgway comments, new logo 0.72 Jan 27, 2015 M. Fitzpatrick Included Sec. 7 tracking from Mighell 0.80 Mar 04, 2015 K. Mighell Final edits 1.00

3

Table of Contents

1 Document Overview ...................................................................................................................................................... 5 1.1 Purpose ..................................................................................................................................................................... 5 1.2 Document Scope ..................................................................................................................................................... 5 1.3 Referenced Documents ........................................................................................................................................ 5 1.4 Key Concepts ........................................................................................................................................................... 5 1.4.1 Large Catalogs ............................................................................................................................................................................ 5 1.4.2 Data Publication and Data Services .................................................................................................................................. 5 1.4.3 Virtual Storage ........................................................................................................................................................................... 6 1.4.4 Compute Services ..................................................................................................................................................................... 6 1.4.5 Task Containers ......................................................................................................................................................................... 6 1.4.6 FUSE Filesystems ...................................................................................................................................................................... 6 1.4.7 Visualization ............................................................................................................................................................................... 6 1.4.8 Distributable Data Lab Components ................................................................................................................................ 6 1.5 Abbreviations and Acronyms ............................................................................................................................ 6 1.6 System Context for the NOAO Data Lab .......................................................................................................... 7

2 Software Architecture ................................................................................................................................................... 8 2.1 Infrastructure Architecture ............................................................................................................................... 8 2.1.1 Presentation Layer ................................................................................................................................................................... 8

2.1.1.1 Astronomer’s Desktop Tools ................................................................................................................................ 9 2.1.1.2 Data Lab Operations Tools .................................................................................................................................... 9

2.1.2 Public Services Layer .............................................................................................................................................................. 9 2.1.3 Private Services Layer ............................................................................................................................................................ 9 2.1.4 Data Access Services Layer ................................................................................................................................................ 10

2.1.4.1 TAP ................................................................................................................................................................................ 10 2.1.4.2 SIA/SCS/SSA .............................................................................................................................................................. 11 2.1.4.3 VOSpace ....................................................................................................................................................................... 11 2.1.4.4 SQL Service ................................................................................................................................................................ 11

2.1.5 Resource Layer ........................................................................................................................................................................ 12 2.1.5.1 External Resources ................................................................................................................................................. 12

2.2 Component Descriptions .................................................................................................................................. 13 2.2.1 Authentication (Services Layer) ....................................................................................................................................... 13 2.2.2 Query Manager (Services Layer) ...................................................................................................................................... 13 2.2.3 Job Manager (Services Layer) ............................................................................................................................................ 14 2.2.4 Virtual Storage Manager (Services Layer) .................................................................................................................... 14 2.2.5 Resource Resolver Interface (Services Layer) ............................................................................................................ 15 2.2.6 Public Repository (Services Layer) .................................................................................................................................. 15 2.2.7 Private Repository (Services Layer) ............................................................................................................................... 15 2.2.8 Operations Monitor (Services Layer) ............................................................................................................................. 15 2.2.9 Data Access Services (Data Access Layer) .................................................................................................................... 16 2.2.10 SQL Service (Data Access Layer) .................................................................................................................................... 16

3 Software Deployment ................................................................................................................................................. 18 3.1 Client Software ..................................................................................................................................................... 18 3.2 Content Servers ................................................................................................................................................... 18 3.2.1 Large Catalogs .......................................................................................................................................................................... 19 3.2.2 NSA Proxy/SIA Service ........................................................................................................................................................ 19 3.2.3 Survey/PI Data Access Services ....................................................................................................................................... 19 3.3 Storage Servers .................................................................................................................................................... 20 3.4 Compute Servers ................................................................................................................................................. 20 3.5 MyDB Server ......................................................................................................................................................... 20

4

3.6 Data Lab Services Server .................................................................................................................................. 20 4 Distributable Data Lab Components ..................................................................................................................... 22 4.1 Software Packaging and Distribution .......................................................................................................... 22 4.2 Virtual Storage ..................................................................................................................................................... 23 4.3 Data Publication .................................................................................................................................................. 24 4.4 Processing Tools and Services ....................................................................................................................... 24 4.5 An Example ........................................................................................................................................................... 24

5 System Interfaces ........................................................................................................................................................ 26 5.1 Security .................................................................................................................................................................. 26 5.2 Command-‐line Tools .......................................................................................................................................... 26 5.3 Web Portals ........................................................................................................................................................... 27 5.4 Legacy Applications ........................................................................................................................................... 28 5.5 Data Query ............................................................................................................................................................ 28 5.6 Processing Task Control ................................................................................................................................... 29 5.6.1 Task Containers ....................................................................................................................................................................... 29 5.6.2 Job Control ................................................................................................................................................................................. 29 5.7 Virtual Storage ..................................................................................................................................................... 31

6 Implementation Tools and Standards .................................................................................................................. 31 6.1 Implementation Languages ............................................................................................................................. 31 6.1.1 Language Versions ................................................................................................................................................................. 31 6.2 Development Platforms .................................................................................................................................... 32 6.3 Software Development Standards ................................................................................................................. 32 6.3.1 Software Licensing ................................................................................................................................................................. 32 6.3.2 Public Repository ................................................................................................................................................................... 32 6.3.3 Private Repository ................................................................................................................................................................. 32 6.3.4 Testing Framework ............................................................................................................................................................... 32 6.3.5 Bug and Issue Tracking ........................................................................................................................................................ 32 6.4 Web Interfaces ..................................................................................................................................................... 33 6.5 Database Technologies ..................................................................................................................................... 33 6.6 Machine Virtualization ..................................................................................................................................... 33

7 Requirements Tracking ............................................................................................................................................. 34 7.1 Core Data Lab Capabilities ............................................................................................................................... 34 7.2 User-‐Provided Science Capabilities .............................................................................................................. 37

Appendix I: Vocabulary / Acronyms Used ................................................................................................................. 39 Appendix II: List of Figures ............................................................................................................................................. 43

5

1 Document Overview

1.1 Purpose This System Architecture Document for the NOAO Data Lab is intended to:

1. Provide a high-‐level conceptual design of a Data Lab system that satisfies all Operational and Science requirements.

2. Describe and define the components of the Data Lab, their implementation and interfaces, 3. Describe the interaction between components to show how the functional requirements of the

system are satisfied.

1.2 Document Scope The scope of this document is the entire Data Lab Project. This document will evolve over time as requirements and designs are finalized.

1.3 Referenced Documents This document may reference additional documentation identified below.

[1] Science Use Cases (SUC) [2] Science Requirements Document (SRD) [3] Operational Concepts Document (OCD) [4] Operational Requirements Document (ORD) [5] Project Execution Plan (PEP)

1.4 Key Concepts Throughout this document we may use several phrases or terms that refer to specific Data Lab components or activities. These are briefly explained here for context, a more detailed explanation is provided in the documents referenced in Section 1.3 above and in the descriptions given below.

1.4.1 Large Catalogs The term Large Catalogs is used for a specific dataset requiring dedicated hardware to manage distributed query processing. Examples include the Dark Energy Survey (DES) Catalog, but the term generally refers to any database larger than can typically fit on a modern desktop machine.

1.4.2 Data Publication and Data Services The terms Data Publication may be used to refer to datasets hosted in the Data Lab and served publicly through standard Virtual Observatory (VO) interfaces. These typically represent high-‐level data products (images, catalogs, spectra, time series, etc.) created by a Survey Team or individual PI.

The term Data Service may be used to refer to any web-‐service providing an interface to query and access a data collection. This may include Large Catalogs or private databases that use custom interfaces.

6

1.4.3 Virtual Storage The term Virtual Storage is used to refer to the Data Lab services managing distributed storage of data through web interfaces. It is similar to Cloud Storage1 but in the Data Lab is more closely associated with a service running at a particular location.

1.4.4 Compute Services The term Compute Services refers to data processing elements of the Data Lab. The may be implemented as web services available as a RESTful interface (i.e. an HTTP service) or as a specific computational task executed as part of a larger workflow. Within the Data Lab these services are used to perform general transformation of data files (e.g., an image cutout service) or some specific analysis (e.g., to detect variability in a time series).

1.4.5 Task Containers A task container, specifically a Linux Container, is a virtualization method for running applications in isolated Linux systems (i.e. the container) on a single host machine. Unlike Virtual Machines (VMs) that virtualize an entire machine, containers are generally much lighter-‐weight and share elements of the host machine (e.g., binaries and libraries), allowing them to be started almost instantly. Specific task dependencies (e.g., language versions) can be bundled with the container, allowing for a more heterogeneous computing environment. Within the Data Lab, containers are used to package Compute Services and for distributed software.

1.4.6 FUSE Filesystems A FUSE (Filesystem in Userspace) filesystem is an operating system mechanism that allows a non-‐privileged user to mount a data source as a standard Unix filesystem. Within the Data Lab, this is used to mount a user’s virtual storage to provide transparent access to their data without requiring applications to use the web-‐service protocol that implements the storage.

1.4.7 Visualization Visualization is used to refer to the plotting or image display capabilities in the Data Lab and on the astronomer’s desktop. Examples may include purpose-‐built web tools or the use of more general plotting or display tools that interact with Data Lab components. Within the Data Lab architecture, visualization is an application or a Compute Service and not specifically an architectural component.

1.4.8 Distributable Data Lab Components Distributable components are those elements of the Data Lab that can be downloaded and installed on a user’s machine. These can be services such as Virtual Storage that operate on local data or tasks developed in the Data Lab that execute as command-‐line tools in the user’s environment. Distributable components are described in Section 4 below.

1.5 Abbreviations and Acronyms A complete list of acronyms and abbreviations used in this document is given in Appendix I.

1 http://en.wikipedia.org/wiki/Cloud_storage

7

1.6 System Context for the NOAO Data Lab As a Project, the Data Lab is developed within the Science Data Management (SDM) group at NOAO (soon to be the NOAO System Science and Data Center, NSSDC). The NOAO Science Archive (NSA) will continue to ingest and archive raw image data from KPNO and CTIO as well as the pipeline-‐reduced data (i.e. from the NOAO High-‐Performance Pipeline System, NHPPS) for the Mosaic, NEWFIRM, and DECam instruments. Data Lab will not replace the NSA; rather it will work alongside NSA or act as a client when pixel data are required.

Figure 1.3: Context Diagram for the NOAO Data Lab.

Within NOAO, the Survey Programs and individual Principal Investigator (PI) programs will continue to have their raw data archived by the NSA, but may also choose to import that data into the Data Lab for further analysis or as a means to share intermediate results or offer collaborative access to the data using the Virtual Storage services provided. Additionally, users may wish to publish their final data products (e.g., catalogs, image stacks, etc.) using the Data Publication services.

Within the wider astronomical community, Data Lab will be a consumer of data and services from other data centers, using both standard Virtual Observatory (VO) and proprietary protocols. Additionally, community PI users may request a Data Lab account in support of their science program, either importing data for analysis or to be used in conjunction with services provided by the Data Lab (e.g., catalog cross-‐matching, target lists for pixel data, etc.). Lastly, the distributable parts of the Data Lab running on a user’s local hardware can interact with similar components within the main Data Lab itself, for example, to sync data to virtual storage or to do local software development before uploading a workflow for a long-‐running job.

8

2 Software Architecture

2.1 Infrastructure Architecture

Figure 2.1: Data Lab software architecture diagram. Elements of the

diagram are described in more detail below.

2.1.1 Presentation Layer The components of the Data Lab’s Presentation Layer are shown in blue in the top level of Figure 2.1. This layer consists of:

• Web-‐page interfaces to specific services, including: o A Data Lab login portal allowing users to authenticate themselves to the system and access

resources assigned to them. This includes a control page to manage the user’s information (password, contact addresses, etc.) once logged in.

o Resource-‐specific web pages, including: § Web-‐based virtual storage browsers § Dataset specific query interfaces, e.g., a custom interface for the DES catalog and

query pages and descriptions of published datasets. § Data publication tools.

o Compute-‐process status and monitoring pages. • Command-‐line applications, including:

o Desktop tools run on the Astronomer’s Desktop that access Data Lab services remotely. o Science workflows created within the Data Lab. These include tools and scripted

applications executed within the user’s Data Lab login shell. • Legacy software that may use Data Lab services either through existing standard interfaces (e.g.,

HTTP requests) or inclusion of Data Lab client code. These may be individual tasks or development environments/languages.

Astronomer’s Desktop

Legacy AppsUser CodeCmdline ToolsWeb Page

Data Lab Ops

User Mgmt Monitoring

Data Access Services

VOSpaceUWS

SCSSSASIA TAPUWS

SQL ServiceUWS

Public Services

Resource ResolverStorage MgrQuery ManagerJob ManagerAuthentication

Private Services

Ops MonitorPrivate RepoPublic Repo

Storage Resource

UserSpace

VirtualSpace

Compute Resource

Compute Jobs

External Resources

VO DataVO Svcs

NSA

Databases

Data PubOps DBs

Large CatsUWSMyDB

PresentationLayer

ServicesLayer

Data AccessLayer

ResourcesLayer

9

• User-‐developed code as described in Sec 2.1.1.1 below.

2.1.1.1 Astronomer’s Desktop Tools Astronomer’s Desktop Tools are defined to be the web interfaces, software distributions or command-‐line tools described in Sec. 2.1.1 that are executed by users of the Data Lab. These tools may also be run within the Data Lab system (e.g., from the login-‐shell portal) and include any application or interface that uses Data Lab services but was developed by an individual astronomer or science collaboration.

2.1.1.2 Data Lab Operations Tools Operations Tools are defined to be the web interface and command-‐line tools used by Data Lab Operators to manage and monitor the system. These tools will generally not be available to normal users and include:

• Utilities to manage user accounts • System backup and restore commands • Tools to monitor and control Data Lab components • Utilities used in Data Publishing • Tools for system logging and reporting

Tools with administrative functions that may be useful to users when managing Data Lab components

running on their local machine are considered User Tools and may have different capabilities.

2.1.2 Public Services Layer The components of the Data Lab’s Public Services Layer are shown in cyan in the 2nd level of Figure 2.1. This layer consists of:

• Authentication services. This is the primary interface for clients to identify themselves to the Data Lab. See Sec. 2.2.1.

• Authorization services. This is the primary interface for clients to obtain permission to access resources in the Data Lab. See Sec. 2.2.1. These services will be done by the Authentication Service.

• The Job Manager. This is the primary interface for clients to submit processing jobs to the Data Lab. See Sec. 2.2.3.

• The Query Manager. This is the primary interface for clients to submit queries to the Data Lab data resources. See Sec. 2.2.2.

• The Resource Resolver. This is the primary interface for clients to resolve URI’s to service endpoints in the Data Lab. It may be replaced by a VO Publishing Registry in the future and serves as a Registry proxy in the interim. See Sec. 2.2.5.

Role: This layer exposes a client-‐facing API used by the Presentation Layer to access Data Lab components. It is built on internal services to provide functionality that may be accessed via alternate APIs from lower-‐level services or interfaces; those internal interfaces are described below. Other services (e.g., data access or virtual storage) may also be public in terms of being available to clients, but these are exposed as standard interfaces outside the context of the Data Lab (e.g., a VO Simple Image Access service).

2.1.3 Private Services Layer The components of the Data Lab’s Private Services Layer are shown in yellow in the 2nd level of Figure 2.1. This layer consists of:

10

• Private repository services. This is the internal repository used for Data Lab Operations or for development in-‐progress by both Users and Developers and is distinct from the public repository used for released software distributions. See Sec. 2.2.7.

• Operations monitoring and logging services. These are used primarily by Data Lab Operators to monitor system health (using internal interfaces), to log system activity and to generate usage reports. See Sec. 2.2.8.

Role: These are the primary interfaces for Data Lab Developers and Operators to work within the system. Users may be allowed use of or access to some private repositories but are not guaranteed access to all administrative services.

2.1.4 Data Access Services Layer The components of the Data Lab’s Data Access Services Layer are shown in magenta in the 3rd level of Figure 2.1. This layer consists of:

• Simple VO Data Access Layer interfaces. These services are distinguished by the property that they permit a query of a service based on a minimal set of parameters, e.g, a radius/box around some celestial position, a bandpass, or data type. These services are standard VO protocols implemented as a minimal interface to all data services.

• Advanced VO Data Access Layer (DAL) interfaces. These services provide for more complex queries of data, e.g., an SQL-‐like query of a database schema, or a custom interfaces to a specific Data Lab resource. These services are appropriate only for catalog datasets, however there is no guarantee that all catalogs will implement this interface. Advanced services may also include protocols which permit processing of data before returning the result of a query.

• Virtual Storage interfaces. These services provide a high-‐level interface to the virtual storage suitable for clients in the Presentation Layer (e.g., a web-‐based storage browser). They hide many of the details of the underlying protocol and present an abstract interface to the virtual storage system (thus allowing use of both local and remote resources transparently).

• Custom SQL database access interfaces. In certain instances it is preferable to bypass other data access interfaces in order to talk to the database directly (e.g., from legacy clients or low-‐level utility code). These interfaces will allow authorized clients to access the database on a read-‐only basis.

2.1.4.1 TAP TAP (Table Access Protocol)2 is a web-‐service protocol from the Virtual Observatory that provides access to collections of tabular data. Large or complex catalogs are typically stored in a relational database, TAP services allow clients to query any of the columns in any of the database tables, perform joins with user-‐supplied tables and submit queries using a SQL variant (ADQL, the Astronomical Data Query Language3) with extension function specific to astronomy. TAP services require no special authentication/authorization. Primary functions of TAP services are:

• To respond to data queries of complex tabular data collections, • To respond to metadata queries to allow clients to determine the names of tables and columns to be

used in queries,

2 Dowler, P., Rixon, G., Tody, D., “Table Access Protocol, Version 1.0”, http://ivoa.net/documents/TAP/, IVOA Recommendation 27 March 2010 3 Oriz, I., et al, “IVOA Astronomical Data Query Language, Version 2.0”, http://ivoa.net/documents/ADQL/ , IVOA Recommendation 30 October 2008

11

• To respond to standard interface queries used to supply metadata about service availability (e.g., for operational monitoring services),

• To provide synchronous and asynchronous execution of queries Role: Within the Data Lab architecture TAP services provide a VO standard interface to expose data to legacy client applications through VO-‐compliant protocols.

2.1.4.2 SIA/SCS/SSA The Virtual Observatory simple protocols support parameterized queries of data collections of a specific type, e.g., images (SIA, the Simple Image Access protocol), catalogs (SCS, the Simple Cone Search protocol), or spectra (SSA, the Simple Spectral Access protocol), amongst others. These interfaces are ideal for web-‐form clients or single object queries in a synchronous execution environment. The Simple Cone Search (SCS) used for catalog queries returns results directly, the image and spectral forms permit a query and then return a result table with enough information to allow a client to decide which data to actually download in a second step. A celestial position is usually the key search parameter, however results may additionally be constrained by other metadata such as the bandpass, time of observation, resolution element, etc.

These services require no special authentication/authorization. These services may optionally be layered upon an underlying TAP service. Role: Within the Data Lab architecture the Simple services provide a VO standard interface to expose data to legacy client applications through VO-‐compliant protocols.

2.1.4.3 VOSpace The VOSpace protocol will be used to implement the Data Lab virtual storage system. Clients will be able to access their storage using VOSpace protocols by communicating directly with the service. Transfers into/out of the space may be synchronous or asynchronous as allowed by the protocol, however, asynchronous jobs will be managed by the VOSpace service itself and not the Data Lab Job Manager.

This service uses the Authorization service to verify the requesting client has permission to use the resource. The Storage Manager and Job Manager both use this service. Role: Within the Data Lab architecture the VOSpace service provides a standard interface to the user’s virtual storage space for legacy VO applications. Exposing the service implementation at this level additionally allows it to be packaged and exported for use outside the Data Lab.

2.1.4.4 SQL Service The SQL service provides an abstract database interface that allows clients low-‐level access to query a database or process the results. Clients are presented with a uniform interface regardless of the backend database used, however the abstraction supports only the common intersection of capabilities available in the databases used within the Data Lab. Direct access to the database with this service is useful in the following scenarios:

• A client application wishes to step through a query result row-‐by-‐row, • The entire result set must be copied to another database in the most efficient way possible, • The results are to be serialized into a format other than what the VO services provide, • A query is more complex than can be handled by a VO data service.

This service uses the Authorization service to verify the requesting client has permission to use the resource.

12

Role: Within the Data Lab architecture the SQL service provides direct (but authorized) access to databases for use by client code that needs a low-‐level database interface. In some cases, this interface will be used to optimize some higher-‐level Data Lab functionality (e.g., saving query results to a user’s personal database).

2.1.5 Resource Layer The components of the Data Lab’s Resource Layer are shown in grey in the lower part of Figure 2.1. This layer consists of:

• Databases used for published data collections and operational purposes. Public databases are those used to support Data Services, Private databases are used for internal Data Lab operations (e.g., logging systems, job control, etc.).

• Storage resources. The development hardware system will have available up to 400TB of disk storage for use with virtual storage of user files and database storage of published data. We expect the hardware allocation to be re-‐provisioned prior to full production.

• Compute resources. The development hardware system will have up to XYZ cores available for use to support database, compute, visualization and storage operations. We expect the hardware allocation to be re-‐provisioned prior to full production.

• External data and compute resources. See Sec 2.1.5.1.

2.1.5.1 External Resources

External Resources are defined to be resources that may be accessed in a workflow or from a Data Lab component, but are not maintained or managed by the Data Lab directly.

2.1.5.1.1 NOAO Science Archive Data Lab will use the NOAO Science Archive (NSA) as the primary source of raw and pipeline-‐reduced DECam, MOSAIC and NEWFIRM image data. The NSA provides a Simple Image Access interface as well as other custom interfaces that may be used to query for images or to access a specific image. Clients of the NSA include core Data Lab components and tools used in user-‐defined analysis. Role: Within the Data Lab architecture the NSA provides image data that may be used in science workflows that require access to source pixels. Full-‐sized images may be sent to Compute Services for additional processing (e.g., cutouts) before results are returned to the user.

2.1.5.1.2 External Data Services (VO Data) External data services refer to all non-‐NOAO data sources that may be used within the Data Lab. These may be either VO data services or data available through custom interfaces requiring specialized clients (e.g., the Sesame name resolver). Clients for these services include core Data Lab components (e.g., compute or visualization services) and tools used in user-‐defined analysis. Role: Within the Data Lab architecture external data services are used to supply additional data needed by science workflows or core Data Lab components.

2.1.5.1.3 External Compute Services (VO Services)

13

External compute services refer to all non-‐NOAO compute services that may be used within the Data Lab. These may be either standard VO services (e.g., DataLink4) or services available through custom interfaces requiring specialized clients that perform a specific function (e.g., catalog cross-‐match). Clients for these services include core Data Lab components (e.g., compute or visualization services) and tools used in user-‐defined analysis. Role: Within the Data Lab architecture external compute services are used to perform some action on data used in science workflows or by core Data Lab components.

2.2 Component Descriptions This section provides a more detailed description of individual components listed in Section 2.1 above.

2.2.1 Authentication (Services Layer) The Authentication Service implements the following functionality

• Maintains a database of users and account information (login name, password, contact info, etc.) • Provides an interface allowing users to retrieve or modify their account information. • Provides an interface that allows an application to present a login/password and receive an

authorization credential. • Provides an interface allowing users to create named Groups of users. • Provides an interface allowing the owner of a Group to add or delete users from the Group. • Provides an interface allowing the owner of a Group to set or get access permissions to a specific

resource on behalf of all members of a Group. • Provides an interface allowing the owner of a Group to transfer ownership to a member of the Group. • Provides an interface allowing clients to get a list of all members of a Group, or of all Groups to which

the user belongs. • Verifies that a user (identified by a credential) is authorized to access a specific resource. • Verifies that a user (identified by a credential) is a member of a specified Group and thus has all the

privileges of that Group. • Provides an interface that allows privileged users (i.e. Data Lab Operators, identified with a secure

credential) full access to all user records. A detailed description of the Authentication Service design and requirements is to be provided in a future document.

2.2.2 Query Manager (Services Layer) The Query Manager implements the following functionality:

• Presents a uniform HTTP interface to functionality available in the Large Catalog data services: o Table Access Protocol (TAP) o SQL Service

• Composes the query into a form suitable for the requested service. • Validates that the user is authorized to access the requested service.

4 DataLink is a VO specification for connecting metadata discovered about a dataset to the data, metadata products, or web-‐services that can act upon the data. Examples include finding links to preview or progenitor datasets, or services that can extract cutouts, re-‐orient / re-‐scale images, etc.

14

• Submits the query to the requested service for synchronous or asynchronous execution • Maintains the state of the submitted asynchronous jobs, allowing clients to poll for status or results

from completed queries. • Serializes the query results into a user-‐specified format (e.g., HTML, CSV/TSV/ASV, FITS). • Returns results to the calling application or orchestrates the process to save results to Virtual Storage

or MyDB personal database.

The Query Manager does not provide an interface to the Simple Cone Search (SCS) services because of their mandatory synchronous execution. The Query Manager may call the Job Manager to process asynchronous query jobs.

A detailed description of the Query Manager design and requirements is to be provided in a future document.

2.2.3 Job Manager (Services Layer) The Job Manager implements the following functionality:

• Validates that the user is authorized to submit a job for execution. • Provides an interface to queue jobs for execution. • Provides an interface to determine the status of all queued and running jobs. • Provides an interface to set or change the properties of a queued job (e.g., execution time). • Provides an interface to change the state of a queued job (e.g., to begin execution immediately). • Provides an interface to remove jobs from a queue, or to kill a running job. • Provides an interface to execute jobs on remote servers in either a synchronous or asynchronous

manner. • Creates the compute job on the remote server. • Sets the parameters for the remote compute job. • Collects results from the remote compute job and presents them to the calling client. • Starts and/or stops the compute job on the remote server. • Maintains a history of sumitted jobs and their status for the service-‐monitoring task.

The Job Manager provides a simplified job-‐control interface for clients and implements the Universal

Worker Service (UWS)5 design pattern internally to manage individual jobs.

A detailed description of the Job Manager design and requirements is to be provided in a future document.

2.2.4 Virtual Storage Manager (Services Layer) The Virtual Storage Manager implements the following functionality:

• Validates that the user is authorized to access the requested virtual storage space. • Provides an interface to browse the contents of the space. • Provides an interface to move data into the storage space from a user’s desktop. • Provides an interface to move data into the storage space from an external URL. • Provides an interface to directly access data stored in the space, e.g., for file download or transfer. • Provides an interface to move data between separate instances of the space. • Provides an interface to set or display VOSpace properties, capabilities and views.

5 http://www.ivoa.net/documents/UWS

15

A detailed description of the Virtual Storage Manager design and requirements is to be provided in a future document.

2.2.5 Resource Resolver Interface (Services Layer) The Resource Resolver implements the following functionality:

• Provides an interface to allow clients retrieve information about a Data Lab service given a resource URI. Clients may request a single value or the entire record.

• Provides an interface to allow services (or Operators) to register new resource records describing the service.

• Provides an interface to allow services (or Operators) to remove their resource records once they become invalid (i.e. the service moves or shuts down).

• Provides an interface to allow clients to list all available services. • Provides an interface to allow clients to search for services by keyword or service type.

Public data and compute services will be registered with the VO Registry to allow external clients to

discover and access the service directly. The Resource Resolver is used primarily for services that are internal to the Data Lab (e.g., available compute servers or virtual storage instances) or which may be transient (e.g., VO data access services created as part of a VOSpace capability). This allows client software to access a service such as a VOSpace using a location-‐independent URI that the resolver will translate into a service URL endpoint for use by an application. External services running as distributed Data Lab components will register themselves when started to make the service known to other Data Lab tasks.

A detailed description of the Resource Resolver design and requirements is to be provided in a future document.

2.2.6 Public Repository (Services Layer) Data Lab will use GitHub (http://www.github.com) as the Public software and document repository. This repository will be used for all released software and documentation.

2.2.7 Private Repository (Services Layer) Data Lab will use an internal instance of the GitLab (http://about.gitlab.com) Git repository management software as the Private repository. This repository will be used for software or documentation in development, as well as for operational data such as configuration files, deployment notes, etc.

2.2.8 Operations Monitor (Services Layer) The Operations Monitor implements the following functionality:

• Provides an interface to add or remove services from monitoring. • Provides an interface to display summary information from the Job Manager. • Provides an interface to display the current status and availability of services being monitored. • Regularly accesses each of the VO services under its control to determine if the service is responding

correctly. • Regularly accesses supporting software services (e.g., databases, web servers, etc.) to determine if

service is responding correctly • Sends alerts (in a TBD manner) to Data Lab Operators summarizing the list of any non-‐working

services.

16

The VO data access services will provide VO Support Interface (VOSI) methods that the Operations

monitor will use to determine service availability; these services may be accessed directly by client applications as well.

A detailed description of the Operations Monitor design and requirements is to be provided in a future

document.

2.2.9 Data Access Services (Data Access Layer) The following Data Access services are provided within the architecture:

SCS (Simple Cone Search) – Provides a synchronous VO-‐standard positional-‐query interface to catalogs. Details are provided at http://www.ivoa.net/Documents/latest/ConeSearch.html

SIA (Simple Image Access) -‐-‐ Provides a synchronous VO-‐standard query interface to image data collections. Details are provided at http://www.ivoa.net/Documents/SIA

SSA (Simple Spectral Access) -‐-‐ Provides a synchronous VO-‐standard query interface to spectral data collections. Details are provided at http://www.ivoa.net/Documents/SSA

TAP (Table Access Protocol) – Provides a sevice protocol for general table data access using either synchronous or asynchronous access methods under a Universal Worker Service (UWS) interface. Details are provided at http://www.ivoa.net/Documents/TAP

VOSpace – Virtual storage services can optionally provide functionality to make data searchable using one or more of the above service types in addition to the direct access capabilities of the service itself.

Client applications may access these services directly, they may also be accessed by the Job Manager as

part of a Compute Service.

Additional information about individual Data Access Services is given in Sec 2.1.4.

2.2.10 SQL Service (Data Access Layer) The SQL Service implements the following functionality:

• Validates that the user is authorized to access the requested resource (when required). • Provides an abstract read-‐only database API allowing client applications to:

o Submit SQL queries directly to the database o Provides an interface that allows clients to step through a query result row-‐by-‐row. o Provides an interface that allows a query result set to be copied to another database

efficiently. • Provides synchronous or asynchronous (UWS) job control for queries. • Provides an interface to allow the upload of user-‐specified temporary tables to be used in a query

(e.g., to perform a join operation using user data).

The SQL Service provides an alternate interface to databases for clients that do not wish to, or cannot, use a TAP service. The SQL Service does not provide support for ADQL functions used in queries or metadata discovery using the TAP_SCHEMA mechanism found in a TAP service. The VO Support Interfaces (VOSI) endpoints will be implemented as a means to inform clients about service availability and basic column information.

17

A detailed description of the SQL Service design and requirements is to be provided in a future

document.

18

3 Software Deployment The infrastructure for the Data Lab will be deployed to a mix of dedicated and shared hardware resources in the Tucson NOAO headquarters. The deployment described here reflects a conceptual layout more than a specific hardware configuration because of the fluid nature of the shared hardware and expected future upgrades to the system.

Figure 2.4: The Data Lab deployment diagram. Arrows indicate the flow of requests and data in the system.

3.1 Client Software Client software is any application that makes use of Data Lab services. Examples include:

• The Astronomer’s Desktop – This may include web pages running in a browser, command-‐line tools installed as part of a Data Lab software distribution, or legacy code used for analysis.

• A user-‐supplied analysis script – This includes Data Lab command-‐line tools running in the user’s login shell called from a scripting environment (e.g., C-‐shell/Bourne, Python, IRAF, IDL, etc.), or applications developed by the astronomer using programmatic interfaces.

• A visualization requesting data for display – Core Data Lab plotting or image display tools may act as a client for data services based on user-‐supplied parameters for the query.

• A higher-‐level Data Lab system component – The Query and Job Manager interfaces may act as a client for lower-‐level VO protocol services.

3.2 Content Servers

Client Software Content Servers Resource Servers R

19

Content servers are the machines that host the data access services, i.e. the VO protocol services for the supported data types as well as the custom Data Lab interfaces to access large catalogs. These machines provide read-‐only access to data available from the Data Lab.

3.2.1 Large Catalogs Large catalogs will be served using a distributed database and will involve multiple machines to host the dataset. The master node is responsible for the primary interface to the data; there may be multiple worker nodes in the background to process individual queries on the partitioned data.

The master node is responsible for:

• Managing the query as either a synchronous or asynchronous job • Distributing the query amongst the worker nodes • Collating results of queries from worker nodes • Responding to job-‐control requests from the Query Manager

Data Lab will use the QServ database system6 from LSST as the distributed database system for Large Catalogs. Because this is a system still in-‐development, we expect to re-‐deploy these services multiple times as the system evolves. Additional information on the QServ system is available from Sec. 6.5.

3.2.2 NSA Proxy/SIA Service The NOAO Science Archive (NSA) currently provides a VO Simple Image Access (SIA) version 1 (v1) service, but as of this writing the prospect for an SIA version 2 (v2) compatible service is uncertain. While the current service provides a basic positional query capability, the SIA v2 service allows additional constraints as part of the standard query (e.g., bandpass or temporal constraints) that may prove necessary for some science cases. Both SIA v1 and v2 versions allow service-‐specific parameters to take advantage of native archive functionality.

Data Lab will present an SIA v2 interface to the NSA (as it will to all image services) in one of two ways:

• Using the native SIA v2 service (if it is available), • Using a proxy service to the NSA that takes advantage of an existing generic SQL-‐query interface to

the NSA to provide an SIA v2 façade interface

In either case, the query service will execute on a Data Lab content server that is connected over a socket to the NSA service and not directly on NSA hardware.

3.2.3 Survey/PI Data Access Services These services represent the NOAO Survey/PI data publication component of the Data Lab. They are individual data collections hosted by the Data Lab that present standard VO interfaces to their data as independent services (i.e. there is no global entry into all the collections, just the individual services). These datasets are generally much smaller than Large Catalogs but may represent complex data collections, i.e. 6 Wang, D.L.; Monkewitz, S.M.; Kian-Tat Lim; Becla, J., "Qserv: A distributed shared-nothing database for the LSST catalog,"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)”, vol., no., pp.1,11, 12-18 Nov. 2011

20

multiple services for the catalog, image and/or spectral data holdings in the collection. Each service in this case is independent of any others and so may be deployed on different machines if needed.

Because clients access these services independently, their physical location is irrelevant, accessing the data as a coherent collection is the job of a knowledgeable client application, or an advanced service such as DataLink7 that understands the links between the services (e.g., that a particular catalog object may have an associated spectrum or image cutout).

3.3 Storage Servers The storage server manages the Data Lab Virtual Storage system, providing the central VOSpace interface in the system. As of this writing and for development purposes, a disk system of ~400 TB is available on a shared-‐use basis, we expect this to be augmented to provide dedicated storage prior to public release of the Data Lab. The disk array is managed using the GPFS8 file system already in operation at NOAO.

3.4 Compute Servers The compute server is a multi-‐CPU, multi-‐core system9 to be used for parallel execution of processing tasks in workflows (e.g., image cutouts or reprojections). Processes execute as containerized applications under a specific userid as described in Sec. 1.4.5; as such, each process will have access to a user’s virtual storage space (mounted in the container as a user filesystem) to provide file access for the task without requiring it to use specific Data Lab interfaces.

Multiple machines may be deployed to allow for greater capacity, each machine will have a modest amount of local disk that may be used for intermediate processing. The Job Manager is responsible for starting the process on the machine and shutting it down once processing is complete.

3.5 MyDB Server The personal database server is configured as a large database machine and is responsible for handling the MyDB personal databases assigned to each user. Access to the MyDB tables require authentication in the Data Lab. New tables in the database may be created from:

• A query result on either the Content Server or Large Catalogs, • A saved result from a client query of external data resources (e.g., a VO service result), • Data saved to virtual storage that uses a VOSpace capability to create a database table

3.6 Data Lab Services Server This server will host the bulk of the public Data Lab services, including:

• The Job Manager responsible for managing tasks running on the Compute Server, • The Query Manager responsible for asynchronous queries of the Content Server or Large Catalogs,

7 Reference: Dowler, P., et al, “DataLink, Version 1.0”, http://ivoa.net/documents/DataLink/, IVOA Recommendation 05 May 2014 8 http://en.wikipedia.org/wiki/IBM_General_Parallel_File_System 9 For development we are using a 16-‐CPU quad-‐core server with 16GB of RAM and 16TB local disk.

21

• Any needed proxy services for the NOAO Science Archive, • The Storage Manager responsible for access to virtual storage, • The Resource Resolver responsible for resolving local resource URIs into service endpoints, • The Authentication service responsible for providing secure access to Data Lab services.

22

4 Distributable Data Lab Components The public services such as virtual storage or data publication provided by the Data Lab generally require system privileges and additional software in order to be deployed (e.g., an application server such as Tomcat and/or database backing) in addition to hardware adequate to support multiple users. However, configuration of some services can be simplified or packaged in a way that the service can be distributed for use on a single-‐user machine with minimal installation requirements. This has a number of advantages:

• Users can create a local Data Lab environment for use in software development prior to executing a workflow on a full dataset within the Data Lab.

• Data Lab functionality can be exported to a user’s machine instead of requiring user data to be imported into the Data Lab for use, allowing components to run closer to the data on which they operate.

• Components can work together intelligently, e.g., virtual storage can be synchronized between multiple sites or data services on the user’s machine can be used transparently within a workflow executing in the Data Lab.

• User hardware can be leveraged to increase the effective computing capacity available. The deployment of services within the Data Lab involves multiple machines already; the goal of distributable components is simply to extend this concept to services running on machines outside the primary NOAO data center.

4.1 Software Packaging and Distribution Data Lab software will be distributed in three ways:

1. Public Repository: Users will be able to access sources for all software components from the project’s public GitHub repository, allowing them to retrieve only the code of interest. Minimal installation documentation will be available with each component, however the user will be required to configure all the software manually to deploy a working system. This method of distribution will be most useful to developers wishing to modify the code to add new functionality.

2. Containerized Applications: Individual components will be packaged as containerized Docker applications, requiring only minimal configuration and the Docker framework to be available to execute. Although each container could theoretically be run individually, multiple containers will be packaged into a download file to provide users with a coherent system of Data Lab capabilities.

3. Virtual Machine (VM) Images: Machine virtualization will be used during development not only to

create testing and development platforms, but also to maximize utilization of the available hardware during operations. Various VM images will be created and configured with appropriate Data Lab services to provide standard platforms for various purposes, e.g., as a “content server” or a “compute server”, or as a “large catalog worker node”. Additionally, VMs may be configured with pre-‐installed analysis environments that will serve as the base operating system for user login shells. All of the machine images will be available for users to download for use locally.

Links to the GitHub repository and the other download files will be available from the project web site. In the following sections we discuss distribution using the containerized application model only.

23

4.2 Virtual Storage The virtual storage system in the Data Lab will be implemented as a VOSpace10 service, where the protocol defines a web-‐service interface to manage distributed storage. The service is typically deployed as a web application and the contents of the space are managed in a database, the physical storage of files may use any number of backend systems including a standard local file system. Data Lab currently provides both a Java and Python reference implementation of the VOSpace protocol, however from the client perspective they are identical in terms of core functionality.

The Python implementation is ideally suited for distribution since the language easily supports an embedded web server and database (e.g., SQLite11), making the entire service self-‐contained. Further, the container mechanism allows the VOSpace and its supporting software to be packaged in a way that isolates it from the underlying system, reducing the installation process for the user to that of enabling Docker on the machine (trivial for both Linux and Mac systems), optionally modifying a local configuration file and then simply executing the container to run the service.

Figure 4.2: Architecture of the Virtual Storage service Docker container.

VOSpace capabilities and views are supported either by external applications or are integrated into the

implementation of the VOSpace itself. These external applications are themselves packaged as containerized tasks that are available as part of the software distribution discussed in Sec 4.1. The service will need persistent storage to maintain the database contents; this can be achieved by using a specialized data storage container that can be shared by all distributed components, or as a mounted directory from the user’s local machine. Bundling the support tasks as part of the service container may also be considered as an alternative distribution mode (see Figure 4.2).

The functionality required in the support tools include:

• FITS header metadata scraping tools (i.e. tools used to collect FITS header information or other

metadata from keywords or the file contents) to enable creation of a searchable database for an SIA/SSA service on image or spectral data stored in containers. These tools are the same as those used when creating a public data service in the Data Lab and can be containerized for distribution.

• Image conversion tools to support alternate formats of image data (e.g., to create previews from FITS image files.

• Table conversion tools to support alternate views of table data or for use in loading a database table. Supported formats will minimally include:

o VOTable (XML) o FITS BINTABLE o SExtractor output files o CSV, TSV and ASCII table files

10 http://www.ivoa.net/documents/VOSpace/ 11 http://www.sqlite.org/

Virtual Storage Service Container

Image/Table Support Apps

Data Lab Interfaces

Python

VOSpace

Database

Base Docker OS

Local Disk Container

24

• General task execution code to allow arbitrary processing of container contents. • Any other code needed to support a specific capability or view on a container.

A configuration file will be used to manage the service options (e.g., directory to be managed, service URL including port number) or to enable specific capabilities/views.

4.3 Data Publication Creating “simple” VO data services (i.e.SIA/SSA/SCS) will be done using the DALServer12 framework from the VAO as this provides a configuration-‐only option for creating a service from an existing database. DALServer runs as a web-‐application deployed to an application server such as Tomcat and can provide service endpoints for multiple datasets. The framework and its supporting code can be containerized for distribution, again with the user only needing to provide configuration information for the service instead of deploying all of the underlying code. As with the VOSpace service discussed above, persistent storage required to operate the service can make use of a specialized storage container or a local disk mount. Pre-‐existing databases will be accessed directly from the DALServer using connection information provided by the configuration file. New searchable databases can be created using the metadata collection tools Sec. 4.4; a database within the publication container (backed by the persistent storage) will be available to users when creating data services if one is not otherwise available. DALServer can additionally build a web-‐page interface to its services that allows browser-‐based query and access to the data. Because these services may not be registered with the VO for public access, legacy applications, desktop tools and programmatic VO interfaces can query and access the service by calling the service endpoint directly. At this time, advanced publication services (i.e. VO TAP interfaces) are not being considered for distribution due to their complexity.

4.4 Processing Tools and Services A number of Compute Services used within the Data Lab will also be useful to users on their local machines, e.g., image cutout or catalog crossmatch tools. Additionally, command-‐line tools used to collect metadata, convert file formats or other utilities used within the Data Lab may be needed by the user. Within the main Data Lab (i.e. the system running in the NOAO computer center) these tools are all containerized so that they can (optionally) be run asynchronously and in parallel under the control of the Job Manager using the UWS design pattern. However, in the distributable Data Lab these tools are likely to be either called directly by the user or from a scripted application and will be packaged as command-‐line tasks that always execute synchronously. Applications requiring a container (to bundle dependencies or isolate them from the underlying system) will be wrapped with a shell command to provide the task interface.

4.5 An Example As an example of how distributable Data Lab components might be used, consider the situation show in Fig 4.5 below:

• A PI (User 1) has access to all Data Lab services running in the main NOAO computer center (as shown in the top box) in addition to services installed on his/her local machine.

12 http://vaosa-‐vm1.aoc.nrao.edu/vo/dalserver/

25

• A student (User 2) has installed only the virtual storage service and will be analyzing data using legacy tools.

• Queries from the PI’s desktop to the Large Catalog (red arrows) service can store results in their virtual storage space, these results are then copied automatically back for the PI and Student to use at a later time. Alternatively, the PI may choose to store the results in the Data Lab MyDB database; other desktop tools may in turn query these results later.

• Tasks the PI may have created in the Data Lab (blue arrows) can query data services running on the PI’s desktop (e.g., from a local analysis) and have the results stored to the student’s storage for further analysis.

NOAO Data Lab DL Task

Virtual Storage Svcs Large Catalog Svcs

DL Task

Data Publication Svcs

PI/Survey NSA

MyDB

User 1 Desktop

Virtual Storage Svc DL TaskDL TaskMyDB

User 2 Laptop Virtual Storage Svc Legacy Tools

Data Publication Svc

Figure 4.5: Example uses of downloadable Data Lab component.

26

5 System Interfaces All components described in Sec 2.1 and 2.2 present specific interfaces to either user-‐facing client software or other Data Lab components. In cases where a detailed design document exists (or will exist) for a specific component, a description of the interface will be detailed in that document and referenced in the section describing the component here. Standard Web or VO interfaces (e.g., http, VOSpace, SIA, etc.) will reference the appropriate specification when needed. Here we describe the interfaces built within the system.

5.1 Security Resources in the Data Lab will require differing levels of security:

None at All For completely public services such as the Large Catalog or published datasets. Proprietary For services such as the NSA where the user may be required to identify themselves

before gaining access to proprietary data. (Note the NSA also presents a public data service requiring no special authorization).

Restricted For resources allocated to registered Data Lab users only, e.g., a personal database or virtual storage space.

External resources may additionally have their own authentication requirements, e.g., the x.509 certificate required by some Grid computing networks or other VO services. Security in the Data Lab then is a matter of providing an authentication method to protect allocated resources and secure user’s data in the Data Lab itself, and of managing credentials needed to access external services that may be called from within the Data Lab by applications or services.

Registered Data Lab users may login using their Data Lab Identity username/password combination to establish a session, a user logged in under their NOAO Identity (as is used for access to the NSA) or under the VO Single Sign-‐On toolkit will similarly be a recognized user so long as those identities match a registered Data Lab user account. For simplicity, the Data Lab identity will be used to authenticate the user to all services operated by the Data Lab, a resource can then use the Authorization Service to determine whether the user has permission to use the resource.

A user’s account record will contain multiple bits of information used to determine which resources may be accessed, and how they are accessed. For example, the user may import multiple identity tokens to be used with their account and then associate specific services with a particular token. The Authorization Service will not only answer whether a user can access a particular resource, but can respond to indicate that a particular identity token (e.g., a cookie or an x.509 certificate) should be passed when access the resource.

The high-‐level Authorization Service interface is described in Sec 2.2.1. A detailed description of the Authentication Service design and requirements is to be provided in a future document.

5.2 Command-‐line Tools Command-‐line tools provide a high-‐level client interface to Data Lab functionality that can be easily used by both users and called from many analysis environments. These tools can also serve as a testing interface during development and for monitoring the health of the system once in operations.

Data Lab will develop a suite of command-‐line tools to interface to its components that can be used by both anonymous (for access to public services) and authorized users (for access to restricted Data Lab

27

resources). For example, a “login” command tool would authorize a user to access Data Lab virtual storage services or proprietary data in subsequent command calls, whereas an unauthorized (i.e. anonymous) “query” command tool might return a result of the query directly for only the public data available from that service directly to the user.

The description of the planned command-‐line tools is contained in the PEP and detailed design documents it references.

5.3 Web Portals The term Web Portal as used here refers to any web-‐page interface used to access a component of the Data Lab system. These include specific web pages for:

• Authorization: This web page is responsible for allowing a user to “Log In” to the system and obtain the appropriate credential for accessing his/her Data Lab resources. This page is visible to all public visitors.

• User Management: This web page allows the user to set/change personal information (e.g., contact email, reset their password) related to his/her account and to set permissions on user-‐defined groups created to share resources. This page is only visible once the user has identified himself/herself to the system.

• Data query and access services: These pages allow a user to query a data service and view the results. Specific pages with added functionality may be created for high-‐value datasets (e.g., Large Catalogs). In other cases a standardized interface will be created automatically from a template for user datasets published through the Data Lab. These pages are visible to all public visitors. However, additional features may be revealed (e.g., a search for proprietary data or the ability to save to virtual storage) to users with proper credentials.

• Virtual storage services: This web page allows a user to browse his/her virtual storage holdings, navigating between containers or viewing individual files. Users can also use this page to designate items as public, shared or private to restrict access. Additionally, users can enable/disable capabilities and views associated with containers. This page is only visible once the user has identified himself/herself to the system.

• Job submission, control and monitoring: This page allows users to submit new jobs for execution (query or processing), check the status of previously submitted jobs, or cancel running jobs. This page is only visible once the user has identified himself/herself to the system.

• Admin Portals: Operations staff will have access to several administrative web pages not available to public visitors or registered users. These include special-‐purpose pages needed to:

o Manage user accounts, e.g., to allow bulk registration of users, delete users or to edit account

information. o Monitor or manage compute jobs. This provides the same functionality as the user-‐level job

page, however all jobs for all users on all servers are visible.

In some cases, administrative functions will be available on other portal pages when logged in through an administrative account.

28

5.4 Legacy Applications Legacy applications or analysis environments13 can interact with the Data Lab in one or more of the following ways:

• Existing support for Data Lab interfaces: Data Lab exposes a number of standard VO protocols to its

services. Legacy systems that already provide support for these protocols may use the Data Lab services directly. Additionally, tools that can access data given a URI will be able to access a limited number of Data Lab services, either direct access to files or the result of a service call.

• Transparent access to Data Lab services: In some circumstances, Data Lab capabilities will require that no special interface be used. For example,

o FUSE-‐mounted filesystems will provide access to virtual storage. Users will authenticate themselves when mounting the storage locally, however legacy tools will be unaware the data are remote.

o Legacy apps may modify a local filesystem under control of a local VOSpace service. The service will track changes to the files to keep the Data Lab service interface current, the contents of the controlled space may be synchronized with other Data Lab services transparently.

o Storage (local or remote) under VOSpace control may provide capabilities that allow a legacy app to access data in alternate formats transparently.

• Updated code using Data Lab programmatic interfaces: In later releases of Data Lab, programmatic interfaces will be available to allow apps to work directly with Data Lab components. Legacy tools may optionally be updated once these are available to use these interfaces and allow a tighter integration between the legacy application and the Data Lab.

5.5 Data Query The high-‐level Query Manager interface is described in Sec 2.2.2. The public VO data services will all expose the interface appropriate for the service type as specified by the corresponding IVOA standard. These include:

• Simple Image Access (SIA, for images), • Simple Spectral Access (SSA, for spectra), • Simple Cone Search (SCS, for catalogs) • Table Access Protocol (TAP, for tabular collections)

These services may optionally be provided by VOSpace containers and will use the same VO interface standards. Additionally, these services will also implement the VO Support Interface (VOSI) recommendation; these service endpoints are used by the operations monitoring system to check on service availability. Both VOSpace and TAP services implement the Universal Worker Service (UWS) recommendation as part of their public interface.

13 Defined to be tools commonly used in the community prior to public release of the Data Lab.

29

5.6 Processing Task Control The high-‐level Job manager interface is described in Sec 2.2.3. Much of the analysis and functional work of the Data Lab will be done by tasks, i.e. some application or web service that manipulates data or performs some specific analysis. In order to provide the functionality needed at all levels of the Data Lab some new development will be undertaken, but in other cases existing applications will be used (or will require only a small wrapper interface). This implies that a broad mix of runtime environments will be required to support the heterogeneous collection of tasks to be used.

5.6.1 Task Containers Traditionally, virtual machines could be used to configure multiple environments, however in many cases we don’t need to virtualize an entire machine just to support a single application. Linux containers14 provide a method to run applications in an isolated environment much more efficiently (i.e. many more tasks can be supported on the same amount of physical hardware, and startup times for the tasks are sub-‐second). Data Lab will use the Docker15 container system to build self-‐contained application containers (see Figure 5.6a) that will be execute on the compute servers under the control of the Job Manager.

Figure 5.6(a): Components of a Linux task container.

Containers are composed of a base operating system (OS) image that can share binaries and libraries

with the host machine, meaning they are usually much smaller than the entire OS to be used. We can then add special Data Lab support code (e.g., libraries or utility tools) that may be needed by the task, this can be optionally stripped down further to minimize the container size. Together, the base OS and the Data Lab code can form the basis for other containers to standardize the environments, or to create custom environments that support different language or OS versions that may be needed by an application. Additionally, we can mount a user’s virtual storage space using the FUSE mechanism as well as a specialized storage container used as a disk cache that can be shared between instances of a task container (providing faster I/O than virtual storage or network access since the disk cache will be on the host machine).

5.6.2 Job Control The application itself is installed in the container as if it were installed on a real machine, this may

include the configuration of web servers or other services used by the application. In cases where the container provides a web service it can be deployed directly by the Job Manager (or be a persistent service running on the machine) since containers have individual IP addresses and port mapping allows multiple

14 https://linuxcontainers.org/ 15 http://www.docker.com

TaskingInterface <<Task>>

Data Lab Support Code

Base OS Image

Disk CacheMount

Virtual Storage

FUSE

Task Container

Params

Results

30

containers to co-‐exist without conflict. If the container is used for an application, then an additional tasking interface is built into the container configuration to control execution of the task (see Figure 5.6b).

The Job Manager spawns containers on the compute server when a new job is to be created. This can

be done using a simple ssh interface to initiate the job on the remote machine and then interact with the remote process. In a synchronous job (left side, Fig 5.6b) the tasking interface executes the task and acts to redirect the task’s standard I/O streams (i.e. stdin/stdout/stderr) to sockets used to communicate with the Job Manager16. Once the task is complete, the interface then cleans up the process and the container exits. In this case the Job Manager must supply all information needed to start the task when it is executed, e.g., through command-‐line arguments.

In an asynchronous job (right side, Fig 5.6b), the tasking interface first creates a Universal Worker

Service (UWS) client as the control process that responds to requests from the Job Manager for the lifespan of the job. The UWS design pattern provides a set of HTTP service endpoints that allow the Job Manager to set task parameters, start/stop task execution, poll for completion status, and collect results. Upon receipt of the start request, the UWS client forks the application and sets up the stdio sockets as in a synchronous job, however the output streams are saved to a result object that isn’t returned until the task exits and the Job Manager requests it. During execution the UWS client can respond to status requests so the Job Manager knows when it has completed, or it can abort the task once some execution time limit has been exceeded. In this mode, the Job Manager is responsible for notifying the calling client the task has completed, for returning results, and for issuing the task cleanup request once the task is no longer needed.

Figure 5.6(b): Breakdown of task execution for synchronous (left) and asynchronous (right) jobs.

Task containers isolate applications both from the underlying system and from other containers that

may be running the same task, greatly simplifying their deployment to compute servers and for use in massively parallel workflows. The Job Manager is able to distribute execution to provide load-‐balancing capabilities, and since it is a web-‐service itself, it could likewise be packaged as a container and made part of the Data Lab software distribution. Similarly, the containers could be deployed under other execution frameworks (e.g., Condor).

16 A similar mechanism is used in the IRAF Networking protocol to provide access to remote data and tasks.

Tasking Interface

<<Task>>

Tasking Interface

UWS Client

<<Task>>

fork()

fork()

stdiostreams

stdiostreams

Job Manager Job Manager

ssh ssh

Sync Job ASync Job

31

5.7 Virtual Storage The high-‐level Storage Manager interface is described in Sec 2.2.4. The virtual storage system will be implemented using the VOSpace standard for distributed storage. This interface will be exposed to clients to allow direct access to the storage space and provides the low-‐level interface used by the Storage Manager. Client applications and users must identify themselves to the Authorization Service before gaining access to this resource. Users can additionally access the storage if it is mounted using a FUSE (Filesystem in User Space17) client. In this case, the client would access the space using the standard VOSpace protocols, however to the user it would appear to be interfaced as a normal Unix filesystem. The FUSE client in this case is responsible for authenticating itself to the service using the Authorization Service. Data stored in virtual storage that may be exposed using a data access service (e.g., via a capability on a storage container) will be interfaced as described in Sec 4.2.

6 Implementation Tools and Standards This section describes the planned implementation tools and technologies to be used in the Data Lab. Additional tools may be used as necessary and will be documented in a detailed design for the component in question.

6.1 Implementation Languages Data Lab will not mandate that a particular development language be used for all components given the reliance on adapting existing code bases [DL-‐ORD-‐51010].

• Modification of existing software will be done using the original implementation language. As needed, code may be updated to use a more modern version of the language.

• New development will be done using the most appropriate language for the tool or service being implemented.

• Client-‐side interfaces will be generally implemented using C/C++ as the core language with multi-‐language interface bindings generated by SWiG where appropriate in order to be as open as possible to user-‐provided application development.

• A limited suite of custom client-‐side interfaces will be implemented for Python application development (currently the most popular scripting language among astronomers). Updates to an existing similar python interface will be preferred to new development.

• All dependencies for software tools (e.g., specific versions of libraries or 3rd party code) must be justified.

6.1.1 Language Versions All core component services must be compatible with the following language versions:

• Java-‐based applications/services must be compatible with Java 7

17 http://fuse.sourceforge.net/

32

• Python-‐based applications/services must be compatible Python 2.7 Exceptions will be made in cases where a physical/virtual machine is dedicated to hosting a single service requiring an alternate version of the language.

6.2 Development Platforms Data Lab hardware will use a standardized operating systems across all machines hosting core services. At present this is Linux CentOS 6.5; the operating system used in the final deployment of the Data Lab may change subject to a requirement that a common OS is used across all NOAO SDM (Science Data Management) systems. Two exceptions will be made:

1. A collaboration/user requesting use of a private Virtual Machine is free to request an alternate base operating system (however support for these systems will be limited).

2. Users creating containerized applications are free to use a base image from another operating system.

Client software is expected to run on modern versions of Linux and Mac OSX.

6.3 Software Development Standards Data Lab will not require use of specific integrated development environment (IDE) for implementation of services. Release documentation must specify the complete process for building, configuring and deploying a tool or service from source code.

6.3.1 Software Licensing All Data Lab software will be Open Source and available under a {TBD} license. Individual applications or components may have different licenses with the goal that all software will be released with the most lenient license possible. Software imported and extended for use in the Data Lab will be made available under the original software license terms. Data Lab will not use proprietary software that cannot be redistributed.

6.3.2 Public Repository All software released through the Data Lab will be available on the GitHub (github.com) public repository. Deployment of an application/service within the Data Lab will use code available from this repository.

6.3.3 Private Repository Data Lab shall maintain a self-‐hosted GitLab (gitlab.com) repository for code not yet released but still under version control. This repository solution is compatible with GitHub, making it possible to migrate to the public repository when software is released publically.

6.3.4 Testing Framework

See description in the PEP document.

6.3.5 Bug and Issue Tracking Data Lab will use the JIRA issue and project tracking system already in use by SDM. Additional issue tracking will be done using the public Github repository mechanism.

33

6.4 Web Interfaces Data Lab will use Apache HTTP server and Apache Tomcat for public web interfaces. The GitLab repository requires an alternate HTTP server (NGinx) and will be isolated from machines hosting public services to avoid potential conflicts with Apache servers.

6.5 Database Technologies Data Lab will support a number of different databases within the system:

• QServ will be used to host extremely large catalogs (e.g., DES) on dedicated machines. • MySQL and PostgreSQL will be available on machines hosting public data services. Either or both of

these databases may be used depending on the optimizations required during data publication. • A user’s MyDB database will use MySQL to maintain maximum compatibility with results obtained

from the QServ-‐based datasets. • SQLite may be used internally by some applications or services.

Use of other databases will not be supported without sufficient justification.

6.6 Machine Virtualization Data Lab currently uses Oracle’s VirtualBox product as its machine virtualization tool to create and maintain Virtual Machines (VMs) within the Data Lab. Virtual machines are used to create machines with a single purpose (e.g., to host internal administrative services) in order to maximize hardware utilization. Process virtualization will use the Docker (docker.io) container system to create distributed applications and/or to create a compute service that can run in isolation of other processes on the machine. Because of their lightweight nature and portability, containers are ideal for building specialized services within the Data Lab that can be deployed as needed.

34

7 Requirements Tracking This section traces the elements of the architecture presented here (right-‐hand column) back to the Science Use Cases presented in the SUC. Unless otherwise specified, numbers in the right-‐hand column refer to sections in this document.

7.1 Core Data Lab Capabilities Access to SQL catalogs

DL-‐SRD-‐21000 DL must provide access to SQL catalogs with command line tools for experienced users.

2.1.4.4 2.1.4.1 5.2 DL-‐OCD-‐2500

DL-‐SRD-‐21002 DL must provide access to SQL catalogs with Web-‐based tools for intermediate and novice users.

2.1.4.4 2.1.4.1 5.3 DL-‐OCD-‐2500

DL-‐SRD-‐21004 DL must provide the capability to create table joins of DL-‐based SQL catalogs.

2.1.4.4 2.1.4.1

DL-‐SRD-‐21006 DL must provide asynchronous state-‐full access to DL-‐based SQL catalogs.

2.1.4.4 2.1.4.1 DL-‐OCD-‐3130 DL-‐OCD-‐3135

DL-‐SRD-‐21008 DL must provide synchronous access to DL-‐based SQL catalogs. 2.1.4.4 2.1.4.1 DL-‐OCD-‐3125

User database storage (local & remote)

DL-‐SRD-‐21050 DL must provide for the storage of databases on the user’s desktop computer (local storage).

2.2.4 2.2.2 DL-‐OCD-‐3111

DL-‐SRD-‐21055 DL must provide for the storage of databases at the DL (remote storage

2.2.4 2.2.2 DL-‐OCD-‐3110

Light curve data generation from catalogs DL-‐SRD-‐21100 The DL must provide the means to generate light curve data from multi-‐epoch flux/magnitude measurements in SQL catalogs served by the DL.

1.4.4 DL-‐OCD-‐2540

Virtual Storage Service DL-‐SRD-‐21150 DL must provide means for users to store results of SQL database queries near the computational resources serving the major SQL catalogs at the DL.

2.2.4 DL-‐OCD-‐3200

DL-‐SRD-‐21151 DL must provide means for users to create their own data objects (files) in the DL distributed storage network. 2.2.4 DL-‐SRD-‐21152 DL must provide means for users to delete their own data objects in the DL distributed storage network. DL-‐SRD-‐21153 DL must provide means for users to upload data objects from local (desktop) to remote storage.

2.2.4 DL-‐OCD-‐4125

35

DL-‐SRD-‐21154 DL must provide means for users to download data objects from remote to local storage.

2.2.4 DL-‐OCD-‐2508

DL-‐SRD-‐21155 DL must provide means for users to manipulate metadata of their own data objects.

2.2.4

DL-‐SRD-‐21156 DL must provide means for users to set access privileges of their own data objects.

2.2.4 DL-‐OCD-‐3205

DL-‐SRD-‐21157 DL must provide means for users to access the content of data objects within the DL distributed storage network.

2.2.4 DL-‐OCD-‐2573

IVOA registry searches DL-‐SRD-‐21200 DL must provide means for users to search (cone-‐searches) for observations (images/fluxes) obtained at different wavelengths. IVOA registry searches for Spectral Energy Distributions (galaxies and stars). 2.1.4.2

2.2.5 DL-‐SRD-‐21205 DL must provide means for users to search for Spectral Energy Distributions (galaxies and stars). Access to external image surveys DL-‐SRD-‐21250 DL must provide means for users to access significant ground-‐based optical/near-‐infrared image surveys (e.g., DSS, 2MASS, ESO Vista Hemisphere Survey).

2.1.5.1 DL-‐OCD-‐2520 DL-‐OCD-‐2522

Galactic Extinction/Reddening Service DL-‐SRD-‐21300 DL must provide means for users to get extinction and reddening values due to Galactic dust as a function of position on the sky. DataLab_SAD_v0.72.docx

2.1.5.1

Magellanic Clouds Extinction Service DL-‐SRD-‐21350 DL must provide means for users to get extinction due to dust in the Magellanic Clouds as a function of position on the sky.

2.1.5.1

Color-‐Magnitude & Hess Diagram plotting tool DL-‐SRD-‐21400 DL must provide Color-‐Magnitude and Hess Diagram (with the option of contour overlays) plotting tools to enable the graphical analysis of data samples of possibly millions of stars.

1.4.7 DL-‐OCD-‐2554

Variable resolution display tool for remote users DL-‐SRD-‐21450 DL must provide an interactive plotting/visualization/ analysis tool with variable resolution for remote users in order to improve the user interaction experience.

1.4.7

Phase-‐folded light curves DL-‐SRD-‐21500 DL must provide the means to produce phase-‐folded light curves for a given period value from light curve data.

1.4.7 DL-‐OCD-‐2552

Create animations/movies of variable objects DL-‐SRD-‐21550 DL must provide the means to create animations/movies of observations of variable objects (e.g., RR Lyrae stars, supernova, etc.)

1.4.7 DL-‐OCD-‐2553

Image Cutout Service

DL-‐SRD-‐21600 DL must provide a general asynchronous state-‐full Image Cutout Service that will serve subimages (image cutouts) of images based at the DL.

1.4.4 DL-‐OCD-‐2524 DL-‐OCD-‐3415

DL-‐SRD-‐21601 The Image Cutout Service must be able to be run in a synchronous 1.4.4

36

mode. DL-‐OCD-‐3410 DL-‐SRD-‐21602 The Image Cutout Service must be capable of delivering small images (“postage stamps” with ~100 pixels) with position orientation metadata from DL-‐based images.

1.4.4 DL-‐OCD-‐2521

DL-‐SRD-‐21603 The Image Cutout Service must be capable of delivering large images (millions of pixels covering possibly more than one deg2) with position orientation metadata from DL-‐based images.

1.4.4 DL-‐OCD-‐2521

DL-‐SRD-‐21604 The Image Cutout Service must be able to serve 100,000 image cutouts as part of large asynchronous batch jobs. 1.4.4 DL-‐SRD-‐21605 The Image Cutout Service must serve images in a format suitable for the creation of animations/movies of variable objects (see DL-‐SRD-‐21550). Task automation tools DL-‐SRD-‐21650 DL must provide task automation tools to enable computation tasks (workloads) to be spread over many cores and/or machines.

1.4.4

Positional Cross-‐Match Service DL-‐SRD-‐21700 DL must provide an asynchronous state-‐full Positional Cross-‐Match Service (PCMS) that will enable a DL user to cross-‐match objects with positions in a custom database with SQL catalogs served by the DL (e.g., the DES catalog).

1.4.4 DL-‐OCD-‐2504

DL-‐SRD-‐21702 The PCMS must be able to be run in a synchronous mode. 1.4.4 DL-‐OCD-‐3410

DL-‐SRD-‐21704 The PCMS must be able to process a million object positions as part of large asynchronous batch jobs. 1.4.4 DL-‐SRD-‐21706 The PCMS must provide access to external robust IVOA-‐standards-‐compliant positional cross-‐match services (e.g., CDS’s VisieR). Periodogram Service DL-‐SRD-‐21750 The DL should provide an asynchronous state-‐full Periodogram Service that will return periodograms of time series data.

1.4.4

DL-‐SRD-‐21752 The Periodogram Service should be able to be run in a synchronous mode.

1.4.4 DL-‐OCD-‐3410

DL-‐SRD-‐21754 The Periodogram Service should be able to analyze 5,000 light curves as part of large asynchronous batch jobs.

1.4.4

DL-‐SRD-‐21756 The Periodogram Service should be able to process light curves that were generated by DL-‐SRD-‐21100.

1.4.4 DL-‐OCD-‐2504

Stellar photometry codes DL-‐SRD-‐21800 DL must provide users with at least one executable binary of a standard stellar photometry code (e.g., SExtractor, Dophot, DAOPHOT, etc.) for inclusion in user-‐designed photometric pipelines.

1.4.4 DL-‐OCD-‐2530

Statistical time series analysis tools DL-‐SRD-‐21850 DL must provide light curve (time series) statistical analysis tools that determine if the flux of an object varies in time for a given level of statistical significance. 1.4.4

DL-‐OCD-‐2542 DL-‐SRD-‐21855 The statistical time series analysis tools should be able to determine the statistical nature of a variable object: periodic, aperiodic (semiregular), random (stochastic), or transient. Compute Service DL-‐SRD-‐21900 The DL should provide an asynchronous state-‐full Compute 1.4.4

37

Service that would do computationally intensive calculations in a few hours or days instead of weeks or months

DL-‐OCD-‐2572

7.2 User-‐Provided Science Capabilities User tools to determine crowding factor DL-‐SRD-‐22000 DL should provide the means to enable user-‐defined database tools to determine crowding factor (nearest neighbor distances, N-‐point correlations, etc.) .

1.4.4 DL-‐OCD-‐2570

Database of theoretical isochrones DL-‐SRD-‐22050 DL should provide at least one set of theoretical stellar isochrones transformed to the DES (SDSS) filter set. 1.4.4

Registration of large/complex images DL-‐SRD-‐22100 DL should provide resources to enable the spatial register/cross-‐match large/complex images observed with different filters, exposure times, rotation angles, etc.

1.4.4

Capture interactive results DL-‐SRD-‐22150 DL should provide resources to enable the capture of interactive results for reproducibility and sharing within collaborations. 1.4.4

Poisson-‐based Matched-‐Filter Service DL-‐SRD-‐22200 DL should provide resources to enable the identification of unique stellar populations in complex stellar fields contaminated by multiple external stellar populations in the Milky Way or other Local Group.

1.4.4

Estimate reddening of RR Lyraes from light curves DL-‐SRD-‐22250 DL should provide resources to enable analysis tools to estimate reddening of individual RR Lyrae stars based on time series observations. 1.4.4

User-‐defined analysis tools

DL-‐SRD-‐22300 DL should provide resources to enable user-‐defined analysis tools (code, scripts, templates, etc.).

1.4.4 DL-‐OCD-‐2543 DL-‐OCD-‐2570

High-‐order Polynomial Background Fitting DL-‐SRD-‐22350 DL should provide resources to enable high-‐order polynomial background fitting in complex star fields. 1.4.4

Digital image filters for feature/object detection DL-‐SRD-‐22400 DL should provide digital filters for feature recognition/ object detection in images 1.4.4

Variable Object Classification Service DL-‐SRD-‐22450 DL should provide resources to determine what type of variable an object is based on its light curve. 1.4.4

Galaxy Morphology Analysis Service DL-‐SRD-‐22500 DL should provide resources to determine to enable galaxy morphology analysis codes like Galfit, Galphot, etc. to analyze large (many pixel)

1.4.4 DL-‐OCD-‐2571

38

galaxy images. DL-‐OCD-‐2572 DL-‐SRD-‐22505 DL should provide resources to enable morphological analysis of galaxy blob images (with a small number of pixels). 1.4.4

Single-‐Galaxy Photometric Redshift DL-‐SRD-‐22550 DL should provide resources to enable the determination of photometric redshifts of a galaxy from multiband observations using Spectral Energy Distribution (SED) template libraries.

1.4.4

Interactive User-‐Defined Plotting/Visualization Tools DL-‐SRD-‐22600 DL should provide resources to enable the graphical user interface tools developed by users to enhance the visualization or understanding of complex images or databases.

1.4.4

Astrometry for large images DL-‐SRD-‐22650 DL should provide resources to enable the development of tools for the computation of astrometric solutions of large astronomical images that may be distorted due to imager optics.

1.4.4

Extended object/non-‐point-‐source detection DL-‐SRD-‐22700 DL should provide resources to enable the development of image analysis tools for the detection of astrophysical objects that are not point sources. 1.4.4

Appendix I: Vocabulary / Acronyms Used AAS (American Astronomical Society) ADASS (Astronomical Data Analysis Software and Systems) conference ADQL (Astronomical Data Query Language) An SQL-‐like language which includes

astronomical facilities to query a database. AGN (Active Galactic Nucleus) API (Application Programming Interface) The documentation of the interface to a software library or tool. ASCII (American Standard Code for Information Interchange) A character-‐encoding

scheme based on the English alphabet where 128 specific characters are encoded into 7-‐bit binary integers.

ASV (ASCII Space Values) AURA (Association of Universities for Research in Astronomy) CADC (Canadian Astronomy Data Centre) CDS Centre de Données astronomiques de Strasbourg CMD (Color Magnitude Diagram) CSV (Comma Separated Values) CTIO (Cerro Tololo Inter-‐American Observatory) DAL (Data Access Layer) The VO protocols that define how VO applications access data resources. Datalink VO protocol for associating complex astronomical data DECaLS DECam Legacy Survey DECam (Dark Energy Camera) A 520 megapixel digital camera on the Blanco 4-‐m telescope at CTIO. DES (Dark Energy Survey) a survey to prove the origin of the accelerating Universe and help uncover the nature of dark energy by measuring the 14 billion-‐year history of cosmic expansion with high precision over five years beginning in summer 2013. DESI (Dark Energy Spectroscopic Instrument) An instrument to measure the effect of dark energy on the expansion of the universe by obtaining optical spectra for tens of millions of galaxies and quasars (beginning 2018). DESDM (Dark Energy Survey Data Management) Project that developed and operates the

DESDM system at NCSA. DL (Data Lab) DAOPHOT Package for crowded field stellar photometry. Docker An open platform for developers and system administrators to build, ship, and run

distributed applications. DoPHOT CCD PSF fitting photometry program. DS9 SAOimage DS9, an astronomical imaging and data visualization application. DSS (Digitized Sky Survey) ESO (European Southern Observatory) FITS (Flexible Image Transport System) An open standard defining a digital file format for storage, transmission, and processing of astronomical (and other scientific) data. FTP (File Transfer Protocol) A standard network protocol used to transfer computer files from one host to another host over a TCP-‐based network.

40

FUSE (FileSystem in User Space) An operating system mechanism that lets non-‐priviledged users to create their own file systems. GAVO (German Astrophysical Virtual Observatory) GMS (Group Management Services) GPFS (General Parallel File System) A high-‐performance clustered file system developed

by IBM Hess diagram Plots the relative density of the occurrence of stars at different color-‐

magnitude positions of Hertzsprung-‐Russell diagram for a given galaxy. HSB (High Surface Brightness) HST (Hubble Space Telescope) HTTP (HyperText Transfer Protocol) An application protocol for distributed, collaborative, hypermedia information systems. IDL (Interactive Data Language) A programming language used for data visualization and analysis. IPAC (Infrared Processing and Analysis Center) IRAF (Image Reduction and Analysis Facility) NOAO image reduction/analysis and visualization software system. IVOA (International Virtual Observatory Alliance) The international VO community

responsible for developing VO standards. JIRA A commercial tool for software teams to plan, build, and track projects. JPEG (Joint Photographic Experts Group) Lossy compression for digital images LDAP (Lightweight Directory Access Protocol) An industry standard application protocol

for accessing and maintaining distributed directory information services over an Internet Protocol (IP) network.

LSB (Low Surface Brightness) LMC (Large Magellanic Clouds) LSST (Large Synoptic Survey Telescope) MAST (Mikulski Archive for Space Telescopes) MPC (Minor Planet Center) MCs (Magellanic Clouds) MySQL Popular open source database. MyDB A read-‐write database available to users for saving results from queries of read-‐only

databases. This is similar to the SDSS MyDB concept.Simple database wrapper for MySQL.

NASA (National Aeronautics and Space Administration) NCSA (National Center for Supercomputing Applications) NHPPS (NOAO High-‐Performance Pipeline System) An event-‐driven, multi-‐process executor

system developed to manage pipeline applications in a coarse-‐grained, distributed processing environment.

NOAO (National Optical Astronomy Observatory) NSA (NOAO Science Archive) NSSDC (NOAO System Science and Data Center) OCD (Operational Concept Document) ORD (Operational Requirements Document) OS (Operating System) PCMS (Positional Cross-‐Match Service) PNG (Portable Network Graphics) Raster graphics file format that supports lossless data

compression. PSF (Point Spread Function) QServ The LSST database management system.

41

R A programming language and software environment for statistical computing and graphics.

RDBMS (Relational DataBase Management System) A DBMS that represents data using a relational database.

Relational database A database that stores data in a structure consisting of one or more tables (aka

relations) of rows and columns, which may be interconnected. ReST (Representational State Transfer) An approach to web services that uses the

standard HTTP GET and POST protocols. SAD (System Architecture Design) document SAMP (Simple Applications Messaging Protocol) A VO protocol for desktop messaging. SCS (Simple Cone Search) SDM (Science Data Management) group SDSS (Sloan Digital Sky Survey) SED (Spectral Energy Distribution) Plot of brightness of flux density versus frequency or

wavelength. SExtractor A program that builds a catalogue of objects from an astronomical image. SIA/SIAP (Simple Image Access Protocol) A VO protocol that supports queries for images

available in a given data collection near a given position on the sky. SMASH (Survey of the MAgellanic Stellar History) PI: Nidever SMC (Small Magellanic Cloud) SN (Super Nova) SQL (Structured Query Language) The standard language used to communicate with

RDBMS’s. SQLite A software library that implement a self-‐contained, serverless, zero-‐configuration,

transactional SQL database engine. SRD (Science Requirements Document) SSh Secure Shell SSA (Simple Spectral Access) A VO protocol for spectral query/retrieval. SSO (Single Sign-‐On) SUC (Science Use Case) document SVC An abbreviation for a Web service. SWIG (Simplified Wrapper and Interface Generator) An open source software tool used to connect C or C# programs or libraries with scripting languages. TAP (Table Access Protocol) A VO protocol for querying remote databases. TB (Tera Bytes) 1012 bytes or 1,000,000,000,000 bytes (base 10) TiB (Tebibyte) 240 bytes or 1,099,511,627,776 bytes (base 2) TCP (Transmission Control Protocol) One of the core protocols of the Internet protocol suite, commonly referred to as TCP/IP. TSV (Tab-‐Separated Values) A simple file format often used to move tabular data

between computer programs that support the format, e.g., transferring information from a database program to a spreadsheet.

URI (Uniform Resource Identifier) An address standard for a resource available on the Internet. URL (Uniform Resource Locator) The global address of documents and other Resources on the World Wide Web. The address contains 2 parts: specification of the protocol to be used in accessing the resource and its network location. UWS (Universal Worker Service) pattern defines how to manage asynchronous

42

execution of jobs on a service. VAO (Virtual Astronomical Observatory) The US VO project. VM (Virtual Machine) VO (Virtual Observatory) VOSI (VO Support Interfaces) The minimum interface that a SOAP or REST-‐ based web service requires for compatibility with the IVOA. VOSpace The IVOA interface to distributed storage that specifies how VO agents and

applications can use network attached data stores to persist and exchange data in a standard way.

XML (eXtensible Markup Language) 2MASS-‐PSC (2 Micron All Sky Survey – Point Source Catalog)

43

Appendix II: List of Figures Page 7 Figure 1.3: Context Diagram for the NOAO Data Lab. Page 8 Figure 2.1: Data Lab software architecture diagram. Page 17 Figure 2.4: The Data Lab deployment diagram. Page 21 Figure 4.2: Architecture of the Virtual Storage service Docker container. Page 23 Figure 4.5: Example uses of downloadable Data Lab component Page 27 Figure 5.6(a): Components of a Linux task container. Page 28 Figure 5.6(b): Breakdown of task execution for synchronous and asynchronous jobs.

DataLab’SAD’1.00! System’Architecture’Document...

Documents

Transcript of DataLab’SAD’1.00! System’Architecture’Document...