Data Challenges from the NASA Perspective

29
1 Science Mission Directorate Space Science Archives at NASA Jeffrey Hayes Science Mission Directorate, NASA HQ July 15, 2004

Transcript of Data Challenges from the NASA Perspective

Page 1: Data Challenges from the NASA Perspective

11Science Mission Directorate

Space Science Archives at NASA

Jeffrey Hayes Science Mission Directorate, NASA HQ

July 15, 2004

Page 2: Data Challenges from the NASA Perspective

22Science Mission Directorate

The funding agencies and muses

How we really work!

(Note where Urania is…)

Page 3: Data Challenges from the NASA Perspective

33Science Mission Directorate

Space Science Archives

The idea of archiving data is old in the NASA sciencecommunity. Data are the only legacy a mission hasonce it has ended.

Because data was being archived in a hap-hazard manner early on, it was decided in the late 1960’s to form a central archive to capture all Space Science data -- the National Space Science Data Center (NSSDC) at Goddard Space Flight Center (GSFC). This is still the “deep archive” for all NASA Space Science missions.

Page 4: Data Challenges from the NASA Perspective

44Science Mission Directorate

Space Science Archives

Over the years the NSSDC evolved to include not only

Space Science data, but large ground-basedAstronomical catalogues needed by both the NASA

andthe general science communities.

All data were available on request.All data were maintained on (by today’s standards) very primitive media (i.e. cards, 7 track mag. tape)

Leads to questions of accessibility and usefulness.

Page 5: Data Challenges from the NASA Perspective

55Science Mission Directorate

Standards

• For data to be useful, it must conform to a form that most of the community agrees to, and can understand.– Catalogue data was standardized on the 80-character line of

data (hold-over from punch cards). There were limitations, because some records were longer than 80 characters. Image data was essentially photographic prints. Spectroscopy was in lists of lines and wavelengths.

– All of these formats had severe limitations for transport and further analysis.

Page 6: Data Challenges from the NASA Perspective

66Science Mission Directorate

Standards

• In 1977 the Flexible Image Transport System (FITS) standard was published (Wells). This was a self-describing data structure that allowed for the storage of data on a computer as a file with an embedded header.

• Quickly became the standard in the astronomical and parts of the Solar physics communities, and with substantial modifications, is still the standard.

• All FITS data can be read by all FITS readers (with provisos).

• Became a NASA standard for data in 1999.• New standards are coming on-line (XML, VOTable).

Page 7: Data Challenges from the NASA Perspective

77Science Mission Directorate

Archival Evolution

The NSSDC was a great idea, but the advances incomputer science made it out of date. Astronomerswanted access to digital data that was straight-forwardand did not require translation from one data format toanother. There was also a desire for rapiddissemination of that data other than by US Mail.

The growth in the Internet and the decline of computerhardware prices made this possible.

Page 8: Data Challenges from the NASA Perspective

88Science Mission Directorate

Archival Evolution

• In the mid 1980’s NASA tried to develop the Astronomical Data System (ADS), which would combine all astronomical data. It was a failure because there was not enough compute power or network capacity for such system. (A sideline to the ADS work was digitizing all the astronomical literature, and the ADS is now the premier site for such data in the world. All scanned papers are in PDF or JPEG formats.)

• 1990 NASA started the HEASARC with a more limited view of systematically archiving all High Energy Astrophysics data. This was very successful and the model for other space science archives.

Page 9: Data Challenges from the NASA Perspective

99Science Mission Directorate

Active Archives

• HEASARC was the first of a series of active archives that allowed the community to interact with the data by allowing down-loads of post-Level 1 data sets. The archive model now consists of scientists who maintain the integrity of the data, and also develop new and better tools for the manipulation and analysis of these data.

• The model has been generalized across wavelength regimes: HEA -- HEASARC; UV/Optical -- MAST; IR/sub-mm -- IRSA; CMB -- LAMBDA; etc.

• The planetary sciences have the Planetary Data System (PDS) which parallels the Astronomy archives, but with differences (i.e. includes FITS as well as other standards, and nodes based on science discipline type).

Page 10: Data Challenges from the NASA Perspective

1010Science Mission Directorate

Philosophy

• The last decade has seen the phenomenal growth in the power and ability in computers. This has allowed for the rapid evolution in archives and their ability to respond to the community’s needs.

• We have moved from a main-frame, static archive philosophy, to one that is more mobile and dynamic and evolves through feedback with more sophisticated data products.

• We have moved away from simple curation to managing the data. Data must now be migrated from one medium to another with a reasonable plan on how to do it.

Page 11: Data Challenges from the NASA Perspective

1111Science Mission Directorate

Philosophy

• It is NASA policy that all science data come into the public domain as soon as possible. The researcher signs an agreement that all data taken by a NASA mission will be archived and after a suitable length of time (within 6 to 12 months from the date of observation/data acquisition), the data becomes publicly accessible. The are very few exceptions to this rule.

Page 12: Data Challenges from the NASA Perspective

1212Science Mission Directorate

Hardware

• Hardware is now cheap. Both memory and disks are to the point where it is logical and practical to spin all data possible and make it accessible via the Web to all users.

• Less than 10 years ago, we were still using 9-track tapes, some Exabytes, some DAT, and M/O disks. Little data was “spinning”. Now with the advent of RAID disks on the TB scale, flash drives on the GB scale, and laptops with G-flop compute power, the only problem is bandwidth. That too is getting cheaper and more accessible.

• The NASA archives are trying to keep abreast of all these developments.

Page 13: Data Challenges from the NASA Perspective

1313Science Mission Directorate

Software

• The archiving tools used are usually COTS products. It is not cost effective to develop entire stand-alone SQL systems for archives. The customization is in adapting their use in an astronomical context. However as the evolution of database management continues, we are seeing a tremendous flexibility in how these data can be managed.

• The compute power also allows for the development of very sophisticated data imaging, manipulation, and analysis tools. This is now considered to be within the purview of the active science archives.

Page 14: Data Challenges from the NASA Perspective

1414Science Mission Directorate

Management

• Day-to-day operations are managed at the active archive level by a scientist responsible to NASA HQ for the assigned activities.

• There are biannual meetings of a coordinating committee (ADEC) with NASA HQ having ex-officio membership.

• The active Astronomy archives are peer-reviewed every 4 years in the Astronomy Senior Review process. On-going and new activities are proposed and judged. Funding can be reallocated as needed.

Page 15: Data Challenges from the NASA Perspective

1515Science Mission Directorate

Operational Archives

• The suite of NASA Space Science archives currently consists of:– 1 “deep” archive:

— NSSDC– 7 active archives:

— HEASARC, MAST, IRSA, LAMBDA, MSC, PDS, SSDC– 2 data services:

— ADS, NED– 3 on-going Great Observatory missions have stand-alone

archives which are associated with the above active archives:

— HST, CXO, SIRTF

Page 16: Data Challenges from the NASA Perspective

1616Science Mission Directorate

Interoperability

• There is a movement in the astronomical community to have access to data from multiple wavelength regimes in order to cross-correlate them. The Space Science archives are now working on a plan to implement such interoperability (the NVO by another name). In addition, we want to incorporate both theory and modeling data in the infrastructure.

• This promises a huge leap in science by using the Internet, Grid technologies, and very fast computing techniques.

• NASA is working on a response to the white paper produced by the archives.

Page 17: Data Challenges from the NASA Perspective

1717Science Mission Directorate

Historical Data

• One last point: What to do with old, pre-digital data?– Photographic plates exist in huge numbers and are both

fragile and of finite lifetime. Harvard and Caltech are digitizing their plate collections, but most other institutions cannot because of the cost.

– Do we accept the lost of such historical data, or so we collect only those collections at large or national observatories which have a uniform pedigree?

Should data have expiration dates? (Like milk.)

Page 18: Data Challenges from the NASA Perspective

1818Science Mission Directorate

NASA Astronomy Archives

Backup Material

Page 19: Data Challenges from the NASA Perspective

1919Science Mission Directorate

Data Archive Centers: LAMBDA

Legacy Archive for Microwave Background Data Analysis (LAMBDA)

• http://lambda.gsfc.nasa.gov/• “One Stop Shopping for CMB Researchers”• Contains Cosmic Microwave Background data and data

products from WMAP, COBE, IRAS, SWAS missions; related software (CMBFAST, HEALPix etc); and archives of news and science papers.

Page 20: Data Challenges from the NASA Perspective

2020Science Mission Directorate

Data Archive Centers: IRSA

NASA/IPAC InfraRed Science Archive (IRSA)

• http://irsa.ipac.caltech.edu/• “Archive node for scientific data sets from NASA’s infrared and

sub-millimeter astronomy projects and missions” • Contains data from 2MASS, IRAS, MSX, SWAS, ISO, Spitzer,

and related inventory, software, and data exploration services.

Page 21: Data Challenges from the NASA Perspective

2121Science Mission Directorate

Data Archive Centers: MAST

Multimission Archive at Space Telescope (MAST)

• http://archive.stsci.edu/• “Supports a variety of astronomical data archives, with the

primary focus on scientifically related data sets in the optical, ultraviolet, and near-infrared parts of the spectrum”

• Contains data and data products from HST, FUSE, IUE, EUVE, ASTRO, HUT, UIT, WUPPE, and others;

• Also, catalogues and surveys from GALEX, SDSS, GSC, DSS, VLA-FIRST, relevant software (STSDAS), etc.

Page 22: Data Challenges from the NASA Perspective

2222Science Mission Directorate

Data Archive Centers: HEASARC

High Energy Astrophysics Science Archive Research Center (HEASARC)

• http://heasarc.gsfc.nasa.gov/• “An archive of astronomy data from extreme ultraviolet, X-ray,

and gamma-ray observatories” • Contains data from ASCA, BeppoSAX, CGRO, Chandra, EUVE,

HETE-2, Integral, ROSAT, RXTE, XMM-Newton, and others. In the future will serve data from Astro-E2 and Swift. Also multi-mission software and analysis tools, and information for educators and the public

Page 23: Data Challenges from the NASA Perspective

2323Science Mission Directorate

Data Archive Centers: CXC

Chandra X-ray Center (CXC)

• http://chandra.harvard.edu/• Center for Chandra science and calibration data, proposer

information, data analysis software assistance, public information and education resources.

Page 24: Data Challenges from the NASA Perspective

2424Science Mission Directorate

Services: ADS and NED

NASA Astrophysics Data System (ADS)

• http://adswww.harvard.edu/• The main body of data in the ADS consists of bibliographic

records searchable through database queries, and full-text scans of much of the astronomical literature.

NASA/IPAC Extragalactic Database (NED)

• http://nedwww.ipac.caltech.edu/• Database built around a master list of extragalactic objects;

bibliographic references, photometry, position and redshift data, etc.

Page 25: Data Challenges from the NASA Perspective

2525Science Mission Directorate

Data Archive Centers: MSC

Michelson Science Center (MSC)

• http://msc.caltech.edu/• “Science operations and analysis service organization for

selected NASA Origins Theme projects” - software infrastructure, science ops and consulting to Navigator Program projects and their user communities;

• Up-and-coming archive for data Palomar Testbed Interferometer, Keck Interferometer, SIM, and TPF.

Page 26: Data Challenges from the NASA Perspective

2626Science Mission Directorate

Data Archive Centers: NSSDC

National Space Science Data Center (NSSDC)

• http://nssdc.gsfc.nasa.gov/• “The NSSDC is responsible for the long term archiving and

preservation of all space science data” - provides a permanent archive for OSS data (for space physics, solar physics and planetary/lunar, as well as astrophysics)

• Relatively recent data is held on CD-ROMs; older astrophysics datasets available on offline media.

Page 27: Data Challenges from the NASA Perspective

2727Science Mission Directorate

Data Archive Centers: SSDC

Solar Science Data Center (SSDC)

• http://ssdc.gsfc.nasa.gov/• Provides a permanent archive for Solar data (for space physics,

solar physics as well as upper atmospheric physics)• Relatively recent data is held on CD-ROMs; older astrophysics

datasets available on offline media.• Data is in mixed formats: mainly FITS for imaging, but HDF

used for spectra.• Colocated with NSSDC at GSFC.

Page 28: Data Challenges from the NASA Perspective

2828Science Mission Directorate

Data Archive Centers: PDS

Planetary Data System (PDS)

• http://pds.jpl.nasa.gov/

• Provides active archive for various aspects of planetary mission data. Unlike other centers, it is discipline specific, not wavelength specific (i.e. rings, small bodies, satellites, etc.)

• Relatively recent data is held on CD-ROMs; older datasets available on RAID disks or on offline media.

• Data is in mixed formats: some FITS for imaging, but mainly HDF of various flavors for other data.

• Various discipline nodes across the country with central coordination at JPL Central Node.

Page 29: Data Challenges from the NASA Perspective

2929Science Mission Directorate

Data Archives: Funding Levels

NASA archive funding levels in FY04 (in $M)LAMBDA 0.9

IRSA 1.2

MAST/HST 0.85 (HST ~4M)

HEASARC 2.8

CXC (archive costs hard to deconvolve from overall Center costs )

NSSDC/SSDC (complicated because of shared costs from Solar Science.) ~$7M

MSC 7.5

ADS 0.8

NED 1.3

PDS 9 (for all nodes)