PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

30
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall

description

PAWN: A Novel Ingestion Workflow Technology for Digital Preservation. Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall. Overall Principles. Consistent with the Open Archival Information System (OAIS) model Distributed, secure ingestion Use of web/grid technologies – platform independent - PowerPoint PPT Presentation

Transcript of PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Page 1: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall

Page 2: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Overall Principles

Consistent with the Open Archival Information System (OAIS) model

Distributed, secure ingestion Use of web/grid technologies – platform

independent Minimal client-side requirements Ease of integration with archival storage or data

grid systems.

Page 3: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

ProducerProducer Management Interface

Producer data suppliers

Archive

Management Server

Page 4: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Producer

Provides data to an Archive based on a prior agreement.

Consists of a management/metadata server and an ingestion client.

Provides initial arrangement, context, and metadata.

Page 5: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Archive - receiving

Bitstream Validation Service

Digital Archive

Load Balancer

Producer 1

Producer n

Producer 2

Page 6: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Archive – receiving

Receives data from a Producer Validates bitstreams and metadata, and

sends acknowledgement to Producer. Arranges into collections and specifies

preservation policy. Publishes bitstreams into a digital archive.

Page 7: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Archive – Long term preservation Implemented using grid technologies.

Use the existing prototype NARA/UMD/SDSC site.

Automated replication and integrity checking.

Enforces access control and preservation policy

Page 8: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Ingestion Workflow

1. Negotiate Submission Agreement.2. Workflow Initialization and Submission

Information Packet (SIP) creation.3. Transfer of SIPs to archive.4. Validation of SIP transfer5. Organization of data into collections and

transfer into persistent archive.

Page 9: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Submission Agreement

Based on data appraisal and record schedule, including format and metadata.

Create machine actionable set of rules describing items.

Final Submission Agreement is composed of: METS document for application defaults METS Constraint document to limit METS form to

submission parameters

Page 10: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

METS Overview

Provides a framework for linking structural organization of objects with metadata.

Using XML namespace, metadata from various XML schema can be attached to objects Ie, dublin core, FGDC, etc

Extensible for more complex metadata http://www.loc.gov/standards/mets/

Page 11: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Sample METS Document<?xml version="1.0" encoding="utf-8" standalone="no"?><mets xmlns="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/TR/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd"><metsHdr><agent ROLE="CREATOR"><name>toaster@hostname</name>

</agent></metsHdr><fileSec><fileGrp><file ID="5" MIMETYPE="application/octet-stream" SIZE="67624" CREATED="2002-08-21T15:36:05"

CHECKSUM="2CE7D79E40BD6C6A65A6684B6FD3D08C" CHECKSUMTYPE="MD5"><FLocat LOCTYPE="URL" xlink:type="simple" xlink:href="/nfshomes/toaster/iscsi/GFS-contrib-5.1.tar.gz"/>

</file></fileGrp><fileGrp><file ID="7" MIMETYPE="application/octet-stream" SIZE="2517" CREATED="2002-09-06T17:06:07"

CHECKSUM="767185AA022180E701324C592E1C36E3" CHECKSUMTYPE="MD5"><FLocat LOCTYPE="URL" xlink:type="simple" xlink:href="/nfshomes/toaster/iscsi/gfs.out"/>

</file></fileGrp></fileSec><structMap><div ID="3" LABEL="iscsi"><fptr FILEID="5"/><fptr FILEID="7"/>

</div></structMap>

</mets>

MetadataLinking

StructuralOrganization

Page 12: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Why METS Constraints?

METS doesn’t provide a way to create machine interpretable rules describing a collection Ie: allow only JPEG files in certain structural

areas METS profiles allow for developer

interpretable rules, not machine interpretable

Page 13: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

METS Constraints

Allows structural, metadata, and file constraints. Structural Constraints:

Restrict child div’s and restrict pointers to div, file, and other mets documents

File Constraints:Restrict files by mime-type or validation tests

Metadata Constraints:Restrict allowed metadata schema.

Page 14: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

METS Constraints Example<techMD ID="WORD97">

<mdWrap LABEL=”MS Word 97”><arc:valgrp required="yes">

<arc:valtest class="wordextension" required="no"><arc:valtest class="wordparser" required="yes">

</arc:valgrp></mdWrap>

</techMD>......<structMap TYPE="logical" >

<div ID="DIV1" LABEL="Toxic Chemical Release Inventory System" >ID="DIV2" ORDER="1" LABEL="Reports for 1997“ DMDID="tree97"></div><div ID="DIV3" ORDER="2" LABEL="Meeting Notes for 1997" DMDID="tree98"></div>...

</div>...

</structMap> … <divrule ID="DIV1" FILEALLOW="no" DIVALLOW="NO"></divrule><divrule ID="DIV2" FILEALLOW="yes" DIVALLOW="yes"></divrule><divrule ID="DIV3" FILEALLOW="yes" DIVALLOW="NO">

<filegrp><file ID="GIF98"><file ID="WORD97">

</filegrp></divrule>

Page 15: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Ingestion Workflow

1. Negotiate Submission Agreement.2. Workflow Initialization and Submission

Information Packet creation.3. Transfer of SIPs to archive.4. Validation of SIP transfer5. Organization of data into collections and

transfer into persistent archive.

Page 16: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Initialize Ingestion workflow

Instantiate Producer management server to track registered objects

Establish a working trust relationship with the Archive

Issue clients.

Page 17: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Create SIP

Each client registers objects stored locally with producer management serverRegister file types, validation tests, etcClient follows rules in Submission Agreement

Producer-wide agents can arrange registered object to give a broader context

Page 18: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

SIP Example

METS Handles all areas of a SIP except Physical Object and Descriptive Information

Descriptive Information can be embedded into METS as 3rd party XML schema

· Physical Object· Representation

Information

· Provenance· Fixity· Reference · Context

Packaging Information

Descriptive Information

Content InformationPreservation Description

Information

OAIS Information packet

Page 19: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Client Interface

Page 20: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Ingestion Workflow

1. Negotiate Submission Agreement.2. Workflow Initialization and Submission

Information Packet creation.3. Transfer of SIPs to archive.4. Validation of SIP transfer5. Organization of data into collections and

transfer into persistent archive.

Page 21: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Transfer SIP to archive Retrieve previously registered SIP from producer

management server Authenticate to archive Update provenance information in METS

document with file structure of SIP Transfer METS document describing SIP and

container for SIP physical objects Archive acknowledges transfer completion to

producer management server

Page 22: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Ingestion Workflow

1. Negotiate Submission Agreement.2. Workflow Initialization and Submission

Information Packet creation.3. Transfer of SIP to archive.4. Validation of SIP transfer5. Organization of data into collections and

transfer into persistent archive.

Page 23: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Validation of SIP transfer

Check incoming SIP against constraints documents.

Ensure object integrity by verifying checksums/cryptographic digest

Validate bitstreams against tests described in METS document

Update METS document with validation results and movement of objects on receiving server

Page 24: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Ingestion Workflow

1. Negotiate Submission Agreement.2. Workflow Initialization and Submission

Information Packet creation.3. Transfer of SIP to archive.4. Validation of SIP transfer5. Organization of data into collections

and transfer into persistent archive.

Page 25: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Final transfer to archive

Transfer objects to digital archive Update provenance information in METS

document with handle to object in archive Transfer METS document into archive Return accept/reject messages to

producer metadata server

Page 26: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Component Overview

CRL check

Success/Failure notification of ingestion

METS document registration/retrieval

Producer Management Interface Archive Management Interface

Producer data suppliersBitstream Validation Service

Archive Data Grid

Page 27: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Producer Components

Database to track registered objects Certificate Authority management

Web service for archive security check Management server supplies web service

interfaces to ingestion clients and management operations.

Clients are designed to be standalone, with security certificates issued by producer

Page 28: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Archive Components

Receiving servers validate connecting clients and validate SIPs

Validation Services are simple webservice calls.

Abstract I/O layer into digital archive. All components are scalable using

standard load balancing techniques.

Page 29: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Recap

Implemented using web technologies Architecture independent OAIS compliant XML based metadata

METS based SIPsAdd-on constraints describing Submission

Agreement

Page 30: PAWN: A Novel Ingestion Workflow Technology for Digital Preservation

Questions??

For more informationhttp://www.umiacs.umd.edu/research/adapt