ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data...

13
ASDC Data Storage Re-architecture As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each data provider Our goal is to maximize the use of the storage resources (currently online disks and tape) to optimally serve the needs of users of the data NASA IT Security directives: All publicly accessible data and websites must be moved to DMZ (outside internal campus network)

Transcript of ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data...

Page 1: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we

want to routinely examine the ASDC data holdings provided by each data provider

• Our goal is to maximize the use of the storage resources (currently online disks and tape) to optimally serve the needs of users of the data

• NASA IT Security directives: All publicly accessible data and websites must be moved to DMZ (outside internal campus network)

Page 2: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

ASDC Data Storage Re-architecture Some questions to ask as we work with projects to examine data sets:

• Was the data set only used as inputs for generation of final products and is

no longer required? – Has it been superseded by another version and is now obsolete? – Has it been replaced by an alternative data set and is no longer required? – Does the data set still need to remain in the ASDC archive?

• What level of protection does the data set warrant? – ASDC will archive and preserve long term all publicly orderable data sets. This includes a

disaster recovery copy at an off-site location. – ASDC will archive ancillary data sets used to support production of the current and

previous version of final data products – ASDC is not chartered to permanently archive ancillary data sets where a another

organization has responsibility for long term archive • Is data set being stored in the most effective location for required user or

application access? – Publicly orderable data sets will be located on fast access disk system in LaRC DMZ – Data sets required for production and use by local scientists will be store in the DPO – Data sets not required for current or near-term access will likely be retrieved from the

tape archive.

Page 3: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

ASDC Data Storage Re-architecture Architecture of Current Data Storage

ANGe sends data to 3 storage locations: Tape Archive

Oracle SL8500 with LTO-4/LTO-5 tapes (4/5 PB)

DPO SGI/NetApp IS5000 Storage Units (Using 3.3 PB of 5.5 PB)

Orders Cache IBM DS5300 RAID Unit (524 TB)

Page 4: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

ASDC Data Storage Re-architecture Architecture of Current Data Storage (cont.)

• Tape Archive – Oracle SL8500 Tape Library – Primary archive media for all data archived in

ANGe – 24 x LTO-4 tape drives; 12 x LTO-6 tape drives – 10,000 tape slots available – 6,000 LTO-4 tapes (800 GB each; 4 PB capacity) – 2, 000 LTO-6 tapes (2.5 TB each; 5 PB capacity)

Page 5: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

ASDC Data Storage Re-architecture Architecture of Current Data Storage (cont.)

• DPO (Data Products Online) – Online disk storage for most of data archived via ANGe

(/ASDC_archive4, /ASDC_archive5, /ASDC_archive6) – SGI/NetApp IS5000 Storage Systems (5.5 PB usable

storage); Configured as 5 GPFS Building Blocks (1.1 PB each)

• Orders Cache – Online disk storage located in DMZ for caching

orderable data products based on most frequently ordered

– IBM DS5300 Storage System (520 TB usable storage)

Page 6: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

Data Capacities as of May 2016 ~ 2.3 PB

Page 7: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

ASDC Data Storage Re-architecture Technology Refresh

• New tape technology: LTO-4 (800GB) LTO-6 (2.5 T) per tape

• Quantum StorNext Software IBM HPSS – Cost savings:

• StorNext licensed by the capacity of data stored; HPSS license cost does not increase as data volume continues to grow; must store data on self to live within licensed capacity

• StorNext has been operationally expensive due to software instabilities

– IBM HPSS can be integrated with existing IBM GPFS file systems to provide an end-to-end data management solution

– ANGe can write single copy of data files to DPO and IBM software will handle writing tape copies

Page 8: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

ASDC Data Storage Re-architecture Other IBM HPSS Pros

• IBM High Performance Storage System (HPSS) that has been used at NASA Langley for over 20 years and is deployed in production environment in many multi-petabytes sites around the world.

• IBM has a track record of helping customers transition from StorNext and other archiving solutions.

Page 9: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

GHI session

Process Manager

Scheduler Daemon

Event Daemon

Mount Daemon

ghi_migrate ghi_recall ghi_list

GHI IOM

I/O Manager ISHTAR

HPSS

HPSS Mover

HPSS Core GHI DB

Configuration Manager

DPO (GPFS) ASDC_archive 4 ASDC_archive 5 ASDC_archive 6

ANGe (Ingest/Archive)

Migration Policy

Threshold Policy

ghi_stage

Page 10: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

ASDC Data Storage Re-architecture Storage Tiers for ASDC Data

10

Storage Tier Description Current Medium Profile

DMZ Data Store

DMZ accessible Online disk-based data store

SGI/NetApp Disk system; ~1 PB capacity

High performance IBM GPFS file system; Low latency; 4TB disk drives with RAID 6 protection

Internal Data Store (DPO)

Internal accessible only online disk-based data store

SGI/NetApp Disk system; ~4 PB capacity

High performance IBM GPFS files system; Low latency; 4TB disk drives with RAID 6 protection

Tape Archive Data Store

Data on tapes in local tape Library

Oracle L8500 tape library with LTO-6 tapes; 5 PB capacity

IBM HPSS managed tape archive with two copies of data (separate tapes)

DR Tape Data Store

Data on tapes stored at Disaster Recovery site

Iron Mountain (Ashland, VA)

Second tape copy of data sent to DR site within 14 days after creation

Page 11: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

ASDC Data Storage Re-architecture: Data Policy Establish Data Retention Policies by Data Categories

11

Categories intended to cover all ASDC data holdings (CALLIPSO, CERES, MISR, MODIS, etc.)

Code Data Category P0 Publicly Orderable Data (current versions)

P0CP Publicly Orderable Data (current versions; input for current production stream) P1 Publicly Orderable Data (1 version back from current version)

P1CP Publicly Orderable Data (1 version back; input for current production stream) P2 Publicly Orderable Data (2 or more versions from current version)

P2CP Publicly Orderable Data (2 or more versions back; input for current production stream) L0 Level 0 Data (including associated orbit/attitude) I0 Intermediate Data (current production stream) I1 Intermediate Data (non-current production stream) A0 Ancillary data (used in current or planned production) A1 Ancillary Data (used for non-current production)

ANP Ancillary Data (not used for production) QA QA/QC Output Data (short term validation) VAL Validation Output Data (long term validation) PD Project Documentation (ATDBs, mission website, etc.)

Page 12: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

ASDC Data Storage Re-architecture: Data Policy Publicly Orderable Data (current versions)

12

Data Category Description Category

Required for current production processing input

Required for historical production processing

Current versions of our publicly orderable data products. Most recent ones the science teams point people toward for any defined period of time and spatial region. Publicly orderable Some products n/a Version Availability Access Speed Required Retention Priority Current Publicly available High High

Example Data Sets: CALIPSO V3/V4; CATS V2; CERES Ed3/Ed4; MISR V006; MOPITT V006; SAGE III Meteror-3M V004; TES V006 1. Data will be actively stored and managed on fast access storage tier and be

accessible to the public outside the campus firewall (in DMZ) 2. Data will be actively stored and managed on fast access storage tier

designated for inputs to production and use by local scientists and DMT 3. Copy of data will be actively stored and managed in archive storage tier with

high latency (tape media) 4. Copy of data will be actively stored and managed at disaster recovery facility;

integrity of data stored at DR site validated annually 5. Revisit status of data annually or when a new version is published

Page 13: ASDC Data Storage Re-architecture...ASDC Data Storage Re-architecture • As a prudent means of data stewardship, we want to routinely examine the ASDC data holdings provided by each

ASDC Data Storage Re-architecture The Plan Forward

• New HPSS configuration being deployed by IBM with initial operations March/April 2017

• Need data retention policies in place to govern the migration of data from disk when it no longer requires immediate and rapid access

• Need to remove data from the archive that no longer has value

• Will continue to work with Walt in defining data policies for CERES data sets