PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
description
Transcript of PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall
Overall Principles
Consistent with the Open Archival Information System (OAIS) model
Distributed, secure ingestion Use of web/grid technologies – platform
independent Minimal client-side requirements Ease of integration with archival storage or data
grid systems.
ProducerProducer Management Interface
Producer data suppliers
Archive
Management Server
Producer
Provides data to an Archive based on a prior agreement.
Consists of a management/metadata server and an ingestion client.
Provides initial arrangement, context, and metadata.
Archive - receiving
Bitstream Validation Service
Digital Archive
Load Balancer
Producer 1
Producer n
Producer 2
Archive – receiving
Receives data from a Producer Validates bitstreams and metadata, and
sends acknowledgement to Producer. Arranges into collections and specifies
preservation policy. Publishes bitstreams into a digital archive.
Archive – Long term preservation Implemented using grid technologies.
Use the existing prototype NARA/UMD/SDSC site.
Automated replication and integrity checking.
Enforces access control and preservation policy
Ingestion Workflow
1. Negotiate Submission Agreement.2. Workflow Initialization and Submission
Information Packet (SIP) creation.3. Transfer of SIPs to archive.4. Validation of SIP transfer5. Organization of data into collections and
transfer into persistent archive.
Submission Agreement
Based on data appraisal and record schedule, including format and metadata.
Create machine actionable set of rules describing items.
Final Submission Agreement is composed of: METS document for application defaults METS Constraint document to limit METS form to
submission parameters
METS Overview
Provides a framework for linking structural organization of objects with metadata.
Using XML namespace, metadata from various XML schema can be attached to objects Ie, dublin core, FGDC, etc
Extensible for more complex metadata http://www.loc.gov/standards/mets/
Sample METS Document<?xml version="1.0" encoding="utf-8" standalone="no"?><mets xmlns="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/TR/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd"><metsHdr><agent ROLE="CREATOR"><name>toaster@hostname</name>
</agent></metsHdr><fileSec><fileGrp><file ID="5" MIMETYPE="application/octet-stream" SIZE="67624" CREATED="2002-08-21T15:36:05"
CHECKSUM="2CE7D79E40BD6C6A65A6684B6FD3D08C" CHECKSUMTYPE="MD5"><FLocat LOCTYPE="URL" xlink:type="simple" xlink:href="/nfshomes/toaster/iscsi/GFS-contrib-5.1.tar.gz"/>
</file></fileGrp><fileGrp><file ID="7" MIMETYPE="application/octet-stream" SIZE="2517" CREATED="2002-09-06T17:06:07"
CHECKSUM="767185AA022180E701324C592E1C36E3" CHECKSUMTYPE="MD5"><FLocat LOCTYPE="URL" xlink:type="simple" xlink:href="/nfshomes/toaster/iscsi/gfs.out"/>
</file></fileGrp></fileSec><structMap><div ID="3" LABEL="iscsi"><fptr FILEID="5"/><fptr FILEID="7"/>
</div></structMap>
</mets>
MetadataLinking
StructuralOrganization
Why METS Constraints?
METS doesn’t provide a way to create machine interpretable rules describing a collection Ie: allow only JPEG files in certain structural
areas METS profiles allow for developer
interpretable rules, not machine interpretable
METS Constraints
Allows structural, metadata, and file constraints. Structural Constraints:
Restrict child div’s and restrict pointers to div, file, and other mets documents
File Constraints:Restrict files by mime-type or validation tests
Metadata Constraints:Restrict allowed metadata schema.
METS Constraints Example<techMD ID="WORD97">
<mdWrap LABEL=”MS Word 97”><arc:valgrp required="yes">
<arc:valtest class="wordextension" required="no"><arc:valtest class="wordparser" required="yes">
</arc:valgrp></mdWrap>
</techMD>......<structMap TYPE="logical" >
<div ID="DIV1" LABEL="Toxic Chemical Release Inventory System" >ID="DIV2" ORDER="1" LABEL="Reports for 1997“ DMDID="tree97"></div><div ID="DIV3" ORDER="2" LABEL="Meeting Notes for 1997" DMDID="tree98"></div>...
</div>...
</structMap> … <divrule ID="DIV1" FILEALLOW="no" DIVALLOW="NO"></divrule><divrule ID="DIV2" FILEALLOW="yes" DIVALLOW="yes"></divrule><divrule ID="DIV3" FILEALLOW="yes" DIVALLOW="NO">
<filegrp><file ID="GIF98"><file ID="WORD97">
</filegrp></divrule>
Ingestion Workflow
1. Negotiate Submission Agreement.2. Workflow Initialization and Submission
Information Packet creation.3. Transfer of SIPs to archive.4. Validation of SIP transfer5. Organization of data into collections and
transfer into persistent archive.
Initialize Ingestion workflow
Instantiate Producer management server to track registered objects
Establish a working trust relationship with the Archive
Issue clients.
Create SIP
Each client registers objects stored locally with producer management serverRegister file types, validation tests, etcClient follows rules in Submission Agreement
Producer-wide agents can arrange registered object to give a broader context
SIP Example
METS Handles all areas of a SIP except Physical Object and Descriptive Information
Descriptive Information can be embedded into METS as 3rd party XML schema
· Physical Object· Representation
Information
· Provenance· Fixity· Reference · Context
Packaging Information
Descriptive Information
Content InformationPreservation Description
Information
OAIS Information packet
Client Interface
Ingestion Workflow
1. Negotiate Submission Agreement.2. Workflow Initialization and Submission
Information Packet creation.3. Transfer of SIPs to archive.4. Validation of SIP transfer5. Organization of data into collections and
transfer into persistent archive.
Transfer SIP to archive Retrieve previously registered SIP from producer
management server Authenticate to archive Update provenance information in METS
document with file structure of SIP Transfer METS document describing SIP and
container for SIP physical objects Archive acknowledges transfer completion to
producer management server
Ingestion Workflow
1. Negotiate Submission Agreement.2. Workflow Initialization and Submission
Information Packet creation.3. Transfer of SIP to archive.4. Validation of SIP transfer5. Organization of data into collections and
transfer into persistent archive.
Validation of SIP transfer
Check incoming SIP against constraints documents.
Ensure object integrity by verifying checksums/cryptographic digest
Validate bitstreams against tests described in METS document
Update METS document with validation results and movement of objects on receiving server
Ingestion Workflow
1. Negotiate Submission Agreement.2. Workflow Initialization and Submission
Information Packet creation.3. Transfer of SIP to archive.4. Validation of SIP transfer5. Organization of data into collections
and transfer into persistent archive.
Final transfer to archive
Transfer objects to digital archive Update provenance information in METS
document with handle to object in archive Transfer METS document into archive Return accept/reject messages to
producer metadata server
Component Overview
CRL check
Success/Failure notification of ingestion
METS document registration/retrieval
Producer Management Interface Archive Management Interface
Producer data suppliersBitstream Validation Service
Archive Data Grid
Producer Components
Database to track registered objects Certificate Authority management
Web service for archive security check Management server supplies web service
interfaces to ingestion clients and management operations.
Clients are designed to be standalone, with security certificates issued by producer
Archive Components
Receiving servers validate connecting clients and validate SIPs
Validation Services are simple webservice calls.
Abstract I/O layer into digital archive. All components are scalable using
standard load balancing techniques.
Recap
Implemented using web technologies Architecture independent OAIS compliant XML based metadata
METS based SIPsAdd-on constraints describing Submission
Agreement
Questions??
For more informationhttp://www.umiacs.umd.edu/research/adapt