Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008...

18
Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California Digital Library [email protected]

Transcript of Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008...

Page 1: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008

Preservation Characterization

Stephen Abrams

California Digital [email protected]

Page 2: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

Characterization

• /ker-ik-t(ə-)rə-zā'-shən/ noun

1. The action or result of characterizing.

2. Description of characteristics or essential features.

Page 3: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

Characterization

• Knowing what you have, as a stable starting point for iterative preservation analysis, planning, and action

Characterization

Preservation action

Preservation planning

Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 1:2 (June 2007).

Page 4: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

What? So what?

• What do you have?

– Identification– Feature extraction– Conformance

• What should you do with what you have?

– Assessment

Page 5: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

Ingest workflow

Content

Metadata

Identification Feature extract Conformance

Package SIP Unpackage

Content

Metadata

Identification Feature extract Conformance

Metadata ′

Producer

Consistency Ingest

Archive

Policy rules

Assessment

Policy rules

Assessment

Page 6: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

Migration workflow

Content

Metadata

Assessment

Policy rules

Migration

Content ′

Identify Feature extract Conformance

Metadata ′

Equivalence (Re)IngestAIP Unpackage Assessment

Policy rules

Page 7: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

Two approaches to characterization

• Implicit

– Custom grammars defining a single format processed by a generic engine that understands all grammars

• Unix file

• National Archives (UK) DROID

• Open Grid Forum DFDL

• Planets XCEL/XCDL

• Explicit

– Plug-in framework with custom modules that each understand a single format

• NLNZ Metadata Extractor

• JHOVE

Page 8: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

Why choose one over the over?

• Implicit

– Pro More sustainable in the long term

– Con Is the formal notation rich enough to capture allnuances of formats of interest?

• Explicit

– Pro It’s just programming

– Con It’s more programming

Page 9: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

JH VE

• Extensible framework for format identification, validation, and characterization

– Pluggable format-specific modules for:

• GIF, JPEG, JPEG 2000, TIFF

• AIFF, WAVE

• ASCII, HTML, UTF-8, XML

• PDF

– GUI, command-line, and Java API

• Collaborative project of Harvard University and the JSTOR Electronic-Archive Initiative– Funded by Andrew W. Mellon Foundation

– GNU LGPL license

Page 10: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

JH VE2

• A next generation architecture for format-aware preservation processing

– Three-fold goals:

• Re-factor the existing architecture to achieve higher performance, simplify system integration, and encourage third-party enhancement

• Provide significant new function

• (Re-) Implement modules

• Collaborative project of CDL, Portico, and Stanford University

– Funded by Library of Congress/NDIIPP– Open source BSD license

Page 11: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

JH VE2 enhancements

• JHOVE assumed 1 object = 1 file = 1 format

• But what about…

– TIFF with embedded ICC profile and XMP metadata

1 object = 1 file = 3 formats

– JPEG 2000 JPX fragmentation

1 object = n files = 1 format

– ESRI Shapefile

1 object = 3 files = 3 formats

• JHOVE2 will support 1 object = n files = m formats

Page 12: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

JH VE2 enhancements

• Generic plug-in interface

• Configurable set of modules iteratively invoked against each object

• Inter-module memory structure for stateful processing

• Identification de-coupled from conformance

• Standardized handling of format profiles and error reporting

• Configurable conformance criteria

• API level support for limited editing

Page 13: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

JH VE2 modules

• Identification

• Feature extraction and conformance for:

– GIF, JPEG, JPEG 2000, TIFF– AIFF, WAVE– ASCII, HTML, SGML, UTF-8, XML– PDF– Shapefile– ICC

• Symbolic display of selected binary formats

• Assessment based on prior characterization and locally-defined policy rules and heuristics

Page 14: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

JH VE2 modules

• Identification

• Feature extraction and conformance for:

– GIF, JPEG, JPEG 2000, TIFF– AIFF, WAVE– ASCII, HTML, SGML, UTF-8, XML– PDF– Shapefile– ICC

• Symbolic display of selected binary formats

• Assessment based on prior characterization and locally-defined policy rules and heuristics

Page 15: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

JH VE2 modules

• Identification

• Feature extraction and conformance for:

– JPEG 2000, TIFF– WAVE– ASCII, SGML, UTF-8, XML– PDF– Shapefile– ICC

• Symbolic display of selected binary formats

• Assessment based on prior characterization and locally-defined policy rules and heuristics

Page 16: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

JH VE2 data abstraction

• Determine the “natural” conceptual structures of a format and their component attributes

– Each such structure maps to a class with methods for parsing, validating, reporting, and serializing

– Each such attribute maps to a field with accessor and mutator methods

• UTF-8 Character

• TIFF IFH and IFD

• JPEG 2000 Box

• PDF boolean, number, string, name, array, dictionary, stream, and null

Page 17: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

JH VE2 timeline

• Months 1-6 Outreach, design, and prototyping

• Months 7-9 Core APIs and framework

• Months 10-24 Modules

Page 18: Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization Stephen Abrams California.

For more information…

www.significantproperties.org.uk

droid.sourceforge.net

forge.gridforum.org/projects/dfdl-wg

hki.uni-koeln.de/planets/

meta-extractor.sourceforge.net

hul.harvard.edu/jhove

www.ucop.edu:8080/display/JHOVE2Info/Home