Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008...
-
Upload
osborn-hall -
Category
Documents
-
view
214 -
download
2
Transcript of Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008...
Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008
Preservation Characterization
Stephen Abrams
California Digital [email protected]
Characterization
• /ker-ik-t(ə-)rə-zā'-shən/ noun
1. The action or result of characterizing.
2. Description of characteristics or essential features.
Characterization
• Knowing what you have, as a stable starting point for iterative preservation analysis, planning, and action
Characterization
Preservation action
Preservation planning
Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 1:2 (June 2007).
What? So what?
• What do you have?
– Identification– Feature extraction– Conformance
• What should you do with what you have?
– Assessment
Ingest workflow
Content
Metadata
Identification Feature extract Conformance
Package SIP Unpackage
Content
Metadata
Identification Feature extract Conformance
Metadata ′
Producer
Consistency Ingest
Archive
Policy rules
Assessment
Policy rules
Assessment
Migration workflow
Content
Metadata
Assessment
Policy rules
Migration
Content ′
Identify Feature extract Conformance
Metadata ′
Equivalence (Re)IngestAIP Unpackage Assessment
Policy rules
Two approaches to characterization
• Implicit
– Custom grammars defining a single format processed by a generic engine that understands all grammars
• Unix file
• National Archives (UK) DROID
• Open Grid Forum DFDL
• Planets XCEL/XCDL
• Explicit
– Plug-in framework with custom modules that each understand a single format
• NLNZ Metadata Extractor
• JHOVE
Why choose one over the over?
• Implicit
– Pro More sustainable in the long term
– Con Is the formal notation rich enough to capture allnuances of formats of interest?
• Explicit
– Pro It’s just programming
– Con It’s more programming
JH VE
• Extensible framework for format identification, validation, and characterization
– Pluggable format-specific modules for:
• GIF, JPEG, JPEG 2000, TIFF
• AIFF, WAVE
• ASCII, HTML, UTF-8, XML
– GUI, command-line, and Java API
• Collaborative project of Harvard University and the JSTOR Electronic-Archive Initiative– Funded by Andrew W. Mellon Foundation
– GNU LGPL license
JH VE2
• A next generation architecture for format-aware preservation processing
– Three-fold goals:
• Re-factor the existing architecture to achieve higher performance, simplify system integration, and encourage third-party enhancement
• Provide significant new function
• (Re-) Implement modules
• Collaborative project of CDL, Portico, and Stanford University
– Funded by Library of Congress/NDIIPP– Open source BSD license
JH VE2 enhancements
• JHOVE assumed 1 object = 1 file = 1 format
• But what about…
– TIFF with embedded ICC profile and XMP metadata
1 object = 1 file = 3 formats
– JPEG 2000 JPX fragmentation
1 object = n files = 1 format
– ESRI Shapefile
1 object = 3 files = 3 formats
• JHOVE2 will support 1 object = n files = m formats
JH VE2 enhancements
• Generic plug-in interface
• Configurable set of modules iteratively invoked against each object
• Inter-module memory structure for stateful processing
• Identification de-coupled from conformance
• Standardized handling of format profiles and error reporting
• Configurable conformance criteria
• API level support for limited editing
JH VE2 modules
• Identification
• Feature extraction and conformance for:
– GIF, JPEG, JPEG 2000, TIFF– AIFF, WAVE– ASCII, HTML, SGML, UTF-8, XML– PDF– Shapefile– ICC
• Symbolic display of selected binary formats
• Assessment based on prior characterization and locally-defined policy rules and heuristics
JH VE2 modules
• Identification
• Feature extraction and conformance for:
– GIF, JPEG, JPEG 2000, TIFF– AIFF, WAVE– ASCII, HTML, SGML, UTF-8, XML– PDF– Shapefile– ICC
• Symbolic display of selected binary formats
• Assessment based on prior characterization and locally-defined policy rules and heuristics
JH VE2 modules
• Identification
• Feature extraction and conformance for:
– JPEG 2000, TIFF– WAVE– ASCII, SGML, UTF-8, XML– PDF– Shapefile– ICC
• Symbolic display of selected binary formats
• Assessment based on prior characterization and locally-defined policy rules and heuristics
JH VE2 data abstraction
• Determine the “natural” conceptual structures of a format and their component attributes
– Each such structure maps to a class with methods for parsing, validating, reporting, and serializing
– Each such attribute maps to a field with accessor and mutator methods
• UTF-8 Character
• TIFF IFH and IFD
• JPEG 2000 Box
• PDF boolean, number, string, name, array, dictionary, stream, and null
JH VE2 timeline
• Months 1-6 Outreach, design, and prototyping
• Months 7-9 Core APIs and framework
• Months 10-24 Modules
For more information…
www.significantproperties.org.uk
droid.sourceforge.net
forge.gridforum.org/projects/dfdl-wg
hki.uni-koeln.de/planets/
meta-extractor.sourceforge.net
hul.harvard.edu/jhove
www.ucop.edu:8080/display/JHOVE2Info/Home