The SMB Archive System: Data Backup Across the Web

12
The SMB Archive System: Data Backup Across the Web Kenneth R. Sharp Stanford Synchrotron Radiation Laboratory

description

The SMB Archive System: Data Backup Across the Web. Kenneth R. Sharp Stanford Synchrotron Radiation Laboratory. Why a high capacity, long term data archive is needed. Need a replacement for tapes Tapes age and medium formats change rapidly. Storage capacity and reliability of tapes limited. - PowerPoint PPT Presentation

Transcript of The SMB Archive System: Data Backup Across the Web

Page 1: The SMB Archive System: Data Backup Across the Web

The SMB Archive System:Data Backup Across the Web

Kenneth R. Sharp

Stanford Synchrotron Radiation Laboratory

Page 2: The SMB Archive System: Data Backup Across the Web

Why a high capacity, long term data archive is needed

Need a replacement for tapes

• Tapes age and medium formats change rapidly.• Storage capacity and reliability of tapes limited.• Much manual book-keeping is needed to keep

track of data stored on tapes.

Need to support large-area CCD detectors

• Three Q315 detectors will be generating 20-80 MB files at much increased rate when the SPEAR3 upgrade is complete.

• RAID data storage at SSRL will be 24 TB in 2004--all that data must be backed up somehow!

• Need to archive data as rapidly as it is collected.

Need to support high-throughput structural biology

• Automated beam lines will generated huge amounts of data. • Large numbers of samples and targets require that metadata

be stored and tracked systematically.• Data must be archived automatically and easy to retrieve.

Page 3: The SMB Archive System: Data Backup Across the Web

SMB Archive Uses NPACI Resources at SDSC

High Performance Storage System (HPSS)

• Centralized long term data storage system at SDSC.• Stores over 344 TB of data in 18 million files. (Jan 2002)• Capacity: 2000 GBytes Disk; 6000 TBytes Tape Storage.

Storage Resource Broker (SRB)

• Client-server middleware provides uniform interface for accessing heterogeneous resources over the network.

• Presents data in hierarchical folders w/data and access controls.• May be used to store and retrieve data on the HPSS at SDSC.• Powerful metadata querying system allows data sets to be

accessed based on their attributes.• Data sets can be replicated over multiple resources.• Organizations may install and maintain their own SRB Servers.

We use the SRB installation at SDSC.

National Partnership for Advanced Computational Infrastructure (NPACI)

• Mission: advance science by creating national computational infrastructure: the Grid.• Maintains resources at San Diego Supercomputer Center (SDSC) including HPSS, SRB.

Page 4: The SMB Archive System: Data Backup Across the Web

Organizations Using SRB

• Digital Libraries• UCB, Umich, UCSB, Stanford,CDL• NSF NSDL - UCAR / DLESE

• NASA Information Power Grid• Astronomy

• National Virtual Observatory

• 2MASS Project (2 Micron All Sky Survey) • Particle Physics

• Particle Physics Data Grid (DOE)• GriPhyN

• Medicine• Digital Embryo (NLM)

• Earth Systems Sciences• ESIPS• LTER

• Persistent Archives• NARA• LOC

• Neuro Science & Molecular Science• TeleScience/NCMIR, BIRN• SLAC, AfCS, …

Page 5: The SMB Archive System: Data Backup Across the Web

InQ SRB client for Microsoft Windows

SRB client applications

• Users must be able to upload data, download data, and view the data in the archive.• Users perform these functions via SRB client applications. • Available clients: Command-line programs (“S Commands”), InQ, MySRB.• Tools for custom clients: SRB C library; Java API.

InQ for Microsoft Windows

• InQ is the easiest to use client provided by NPACI.

• Individual files or entire folders may be uploaded or downloaded.

• Files in the archive may be browsed either by directory structure or by data attributes.

Limitations of InQ

• Runs only on Microsoft Windows platforms.• Windows is not the major platform used at synchrotron light sources or in crystallography

research labs.• No batch job capability for long archive jobs.• Exposes confusing SRB features and terminology (resources, containers, collections, etc).

Page 6: The SMB Archive System: Data Backup Across the Web

MySRB web browser-based SRB client

MySRB

• MySRB is a powerful web-based SRB client which can be run from standard web browsers.

• Files in the archive may be browsed either by directory structure or by data attributes.

Limitations of MySRB

• No way to upload or download more than one file at a time.

• The otherwise rich functionality and powerful features are confusing to users.

The bottom line:

• Capabilities of HPSS and SRB far exceed the perceived needs of our beam line users.

• Our users need a customized interface with simplified functionality.• Additional infrastructure had to be designed and implemented in order to

make the SRB a viable storage system for crystallographic data.• A browser-based user interface is ideal.

Page 7: The SMB Archive System: Data Backup Across the Web

The SMB Archive interface for using the SRB

Simple archive job definition

• Users may rapidly browse their /home and /data directories at SSRL.

• Directory contents are listed in the browser window.

• Directories may be navigated by clicking on directory names.

• Files to be uploaded may be filtered according to a list of wildcards.

• Subdirectories may be archived recursively.

• The only SRB related information required is the name of the new data collection to create.

Convenient web browser interface

• Users may define archive jobs over the web from anywhere in the world using any common type of computer.

• Users need only log in with their SMB Unix account name and password.

Page 8: The SMB Archive System: Data Backup Across the Web

Monitoring archive jobs and downloading data

Batch operation

• Archive job runs in background once definition is confirmed.

• Browser does not hang during archival.• New jobs may be started while

previously defined jobs are in progress.• Automatically restarts jobs if HPSS is

unavailable.• A job status page indicates definitions

and status of all running jobs.• User may abort running jobs.• E-mail is sent to the user when a job is

started and again when it is completed.

Similar interface for data download

• Users browse their archived data sets in exactly the same fashion.

• Data may be downloaded from the archive to a directory at SSRL (analogous to an upload job).

• Another option is to download selected files in one or more tar files directly to any computer on the Internet.

Page 9: The SMB Archive System: Data Backup Across the Web

Archive System Infrastructure

But first a word about SRB Accounts:

• An SRB account (independent of the SSRL Unix Account) is required to archive data.

• Your SRB account permits you to upload/download any data using SRB clients.

• Handy web page on our site to create an SRB account: https://smb.slac.stanford.edu/secure/collaboratory/archive_system/SRBAccountForm.html

Archive System Infrastructure – the Archive System uses the following software elements:• Apache Web Server (v1.3.27)• Apache Tomcat Servlet Container (v4.1.24)• Java 2 Runtime (v1.4.1)• SMB Authentication Gateway Server• SMB Impersonation Server• SRB JARGON Java API (v1.1)• Archive System Servlets (for Upload, Download, and Job Maintenance)• Archive System Background Applications

• All Archive System applications and servlets are written in Java.• Archive System front-end is made up of Java servlets.• Archive System back-end is made up of Java applications.

• All infrastructure elements are either available for free or are home-grown.

Page 10: The SMB Archive System: Data Backup Across the Web

Significant infrastructure is required to provide this “simple” interface--but the payoff is huge.

Authentication Gateway Server

• Java servlet that provides a common authentication protocol for all web-based and stand-alone applications.

• Used to authenticate archive system users.

• All web-based software developed at SSRL is being updated to use this single authentication server.

• Support for the authentication server has already been integrated into Blu-Ice/DCS.

• Allows users to navigate seamlessly between applications without authenticating multiple times.

• Will eventually allow access to beamline systems to be controlled automatically based on the beam schedule.

• Access to other resources (computing, data directories, etc.) available 24/7

Impersonation Server

• Unix daemon that can run any non-interactive program on behalf of any Unix user.

• Enables web applications to run background jobs for a user with the actual rights of the Unix user account.

• Accepts commands via the HTTP protocol.• Verifies authentication information with the

Authentication Server.• Used by the archive system to list

directories in the web browser and run background archive jobs as the user.

• Will allow further analyses to be automatically initiated by the beam line control system.

Page 11: The SMB Archive System: Data Backup Across the Web

Archive System Web Architecture

Internet

Internet (Backbone)

SMB

Impersonation

Archive Servlets (Tomcat)

Define Upload

Define Download

View Job Status

Authentication

Archive Jobs (background)

Upload Jobs

Download Jobs

Job Maintenance

SDSC

MCAT

SRB

HPSS

Disk Cache

Tape Storage

Web Browser

Apa che

Page 12: The SMB Archive System: Data Backup Across the Web

Archive Projects for the next year

• Optimize data transfer rates between SSRL and SDSC.

• Provide stand-alone application for users wishing to download datasets directly from the SRB.

• Implement other functions available in inQ and MySRB for manipulating existing collections (replicate, delete, etc.)

• Provide option for automatic data upload from Blu-Ice.

• Provide link from Blu-Ice to automatically start browser and load Archive page w/o user having to log in again. (New Authentication Server makes this possible.)

• Provide additional options for using SRB Metadata Catalog (MCAT) to describe, index, and retrieve data files.

The Collaboratory for Macromolecular Crystallography is supported by the NIH, NCRR as a supplement to the SSRL Synchrotron Radiation Structural Biology Resource (P41-RR-01209). The SSRL Structural Molecular Biology program is funded by DOE BER, NIH NCRR, and NIH NIGMS.