US GPO AIP Independence Test

Post on 06-Jan-2016

30 views 0 download

description

US GPO AIP Independence Test. CS 496A – Senior Design Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong Faculty advisor: Dr. Russ Abbott GPO contact: Kate Zwaard. Overview. Background OAIS FDsys AIP METS, MODS, and PREMIS Project Objectives Solution Strategy - PowerPoint PPT Presentation

Transcript of US GPO AIP Independence Test

US GPOAIP Independence Test

CS 496A – Senior Design

Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong

Faculty advisor: Dr. Russ AbbottGPO contact: Kate Zwaard

Overview

Background OAIS FDsys

AIP METS, MODS, and PREMIS Project Objectives

Solution Strategy XML parsing A note on deliverables Repositories Testing

Conclusion

OAIS Open Archival Information System

“An OAIS is an archive consisting of an organization of people and systems that has accepted the responsibility to preserve information and make it available for a Designated Community”

Developed by the Consultive Committee on Space Data Systems (ISO 14721:2003)

FDsysFederal Digital System

FDsys – Am OAIS maintained by the U.S. Government Printing Office to provide public access to information submitted by Congress and Federal agencies.

OAIS Primary Functions Ingest – Turn SIPs into AIPs Archival Storage – Storage and retrieval

of AIPs Data Management – Populating,

maintaining and accessing the varieties of information

Administration – Controls day to day operations

Preservation Planning – Maintaining archive accessibility

Access – Functions for access of archive

Information Package- critical component of OAIS

The information package is a conceptual linking of content information with its preservation description and packaging information.

Three kinds of information packages (before, after, and during ingestion) SIP – Submission Information Package AIP – Archive Information Package DIP – Distribution Information Package

AIP

Archival Information Package What is AIP?

METS MODS PREMIS

Project Objectives:

Prove AIP Independence

Improve their file system.

AIP: METS Understanding METS

Schema

File format

Seven major sections

AIP: METS Schema

5 Major Sections5 Major Sections METS Header Descriptive Metadata Administrative Metadata File Section Structural Map

AIP: MODS

Descriptive metadata

Extension to METS

Top-level elements Mandatory Recommended Optional

AIP: MODS

AIP: PREMIS

Preservation metadata

Extension to METS

PREMIS Data Model Intellectual Entity Object Entity Event Entity Agent Entity Rights Entity*

AIP: PREMIS

Solution Strategy

The data we have received are AIPs, not SIPs. Repository software can only ingest SIPs. We must therefore write scripts to parse the AIPs in such a way to construct SIPs from an arbitrary file structure, and then ingest those SIPs into a repository software in order to create new AIPs for the same information.

XML Parsing We plan to use the Java programming

language for our scripting needs. The Java API for XML Processing (JAXP) is the

standard Java library for parsing XML It provides several different possible

representations for XML After being rendered human-readable,

the AIP files will need to be converted into a new SIP schema of our own design, which would only describe information that still appears relevant.

XML Parsing Example This is a portion of a sample FDsys MODS file

that summarizes a bill in Congress: <extension><collectionCode>BILLS</

collectionCode><searchTitle>To increase Federal Pell Grants for the children of fallen public safety officers, and for other purposes.;Officer Daniel Faulkner Children of Fallen Heroes Scholarship Act of 2010;S. 3880 (IS)</searchTitle><category>Bills and Statutes</category><waisDatabaseName>111_cong_bills</waisDatabaseName><branch>legislative</branch><dateIngested>2010-10-06</dateIngested></extension>

XML Parsing Example We might expect this type of output once

properly parsed: <extension>

Collection code: “BILLS”Search title: “To increase Federal Pell Grants for the children of fallen public safety officers, and for other purposes.;Officer Daniel Faulkner Children of Fallen Heroes Scholarship Act of 2010;S. 3880 (IS)”Category: “Bills and Statutes”WAIS database name: “111_cong_bills”Branch: legislativeDate ingested: 2010-10-06

</extension>

A Note on Deliverables

Because our aim is not to design software, this is not a typical computer science design project. Instead, we are conducting coded experimental tests on real data and forming conclusions based on the results.

Deliverables will most likely include: a written report of our findings and

recommendations a reorganized version of the input data

Testing After parsing and organizing the data, it will

be important to perform checks to ensure that the reconstruction is accurate. We may send a preliminary report to GPO for

verification.

The exact testing procedure is still undefined, as we haven’t had a chance to investigate the data in depth yet. Our goals should be clearer once we understand

exactly what type of data we are dealing with.

Repositories Third party repository software to

ingest created SIPs. DSpace, Fedora Commons

(Duraspace)Based on a few simple technologies:

JavaMySQLApache Tomcat JavaScript Server

Conclusion

Our thanks to Kate, Dr. Abbott, and Dr. Pamula for their support.