Jay Gattuso Persistently Identifying Formats
-
Upload
future-perfect-2012 -
Category
Technology
-
view
1.400 -
download
0
description
Transcript of Jay Gattuso Persistently Identifying Formats
‘Persistently’ Identifying Formats
PRONOM, DROID and the NDHA
Jay Gattuso Digital Preservation Analyst
National Digital Heritage ArchiveNational Library of New Zealand
Summary
How Rosetta uses DROIDHow DROID has changed
Research NDHA completedResults
Recommendations
DROID & PRONOM • PRONOM is the most
widely used file format registry in the sector
• DROID is a tool that ‘identifies’ file types (based on PRONOM records)
• Both are from TNA (UK)• DROID Signature v59
– 551 signature sets– 864 file type records
EP/1958/2520-F Registry, Hunter Building, Victoria University of Wellington
Photograph taken for the Evening Post newspaper, 31 Jul 1958 Alexander Turnbull Library
www.nationalarchives.gov.uk/PRONOM/Default.aspx
Rosetta – A Brief History
• NLNZ Digital Preservation Repository
• 4 years since inception• 18 months out of project• 8 significant
upgrades/software revisions• ~6 Million digital objects to
date• Backbone of the ANZ GDAP
1/1-000008-G Smiley's stables and horse repository, Whanganui
Harding, William James, 1826-1899 :Negatives of Wanganui district .Alexander Turnbull Library
Write Once, Read Many
Inside Rosetta, format identification is a ‘WORM’ process.
As a part of the ingest routine, format identification is automatically undertaken, written to the file records, and the system database, and used thereafter as a consistent ‘label’.
E-272-f-001Abbot, John 1751-1840 :
Original drawings of insects by J Abott. [1816?]Alexander Turnbull Library
.
We rely on the persistence of the label to accurately plan activities and ‘measure’ the content or shape of the repository.
Behaviours and functions based on DROID format assertions
Rosetta uses DROID to automatically establish format type.
Rosetta Overview
Validation StackAutomated Format
Identification via DROID
Shape Sorting...
Where:
• The area inside the box is Rosetta
• Each block is a DO• Each shape is a format• The ‘Sorter’ is DROID
Shape Sorting...
Process:
• A record is kept of the ‘shape’ the DO entered the box via
• The record is used by the system to trigger activities
• The DO can be removed from the box using the same shaped hole it used on entry
Shape Sorting...
Expectations:
• The ‘Sorter’ never changes• The blocks never change• A DO placed in the box
yesterday will be the same shape tomorrow
• A DO placed in the box yesterday will be extractable via the shape tomorrow
Shape Sorting...
The reality for NDHA:
• DROID has undergone 2 major revisions
• Container signatures have been included
• Since Rosetta v1 release: – 406 new formats, – 600 changes to signatures– (This is generally a good thing!)
• Rosetta has used DROID versions 3 and 5, currently testing with 6
• Rosetta has used DROID signature versions v13, v37, v45 and v49, testing with v52
• Proposal to use a new DROID method in Rosetta
• How has/will this affect the way we characterise Digital Objects at the NDHA?
Identifying and Quantifying Change
EP/1958/0585-F Signature of Queen Elizabeth II in a visitors book
Negatives of the Evening Post newspaper. Feb 1958Alexander Turnbull Library
• Source set: – 26,000 digital objects, – ~600 Gb of content, – spanning 61 format types – all from the live system
• DROID v3, DROID v5, DROID v6 and DROID v6 ‘FAST’ tested
• Signatures v13, v37, v45, v49 and v50 tested
• All files tested with and without file extensions
Identifying and Quantifying Change
EP/1990/0432/29-FNew school patrol system being tested , Wellington
Photograph taken by John Nicholson ca 2 Feb 1990
Alexander Turnbull Library
• 1 million DROID ‘assertions’ captured• Python and MySQL used to sort,
clean, filter, draw graphics and otherwise interpret results
• Paper competed and will be available on the OPF website
www.openplanetsfoundation.org
Identifying and Quantifying Change
DCDL-0004533Eric Idle. 5 December, 2007.
Webb, Murray, 1947- : Digital caricatures published from 29 July 2005 onwards
Alexander Turnbull Library
Summary of Results
Of the 61 tested file types :
75% performed identically for all tested versions of DROID and signature versions
fmt/49(RTF 1.4)
Summary of Results
Of the 61 tested file types :
40% consistently offered a single PUID across the range of DROID tests
By extension: gif, avi, png, jpg, html, xml, bmp, wp, and some subsets of doc, ppt and exe
fmt/12(PNG 1.1)
Summary of Results
Of the 61 tested file types :
In 26% of the file types multiple PUIDs are equally asserted by DROID at various times.
By extension: docx,xlsx,pptx, some pdf, doc, xls, ppt, txt, log, aiff, and arc
fmt/7(TIF format)
Summary of Results
Of the 61 tested file types :
In 16% of the file types DROID version 6 in ‘FAST’ mode performs differently DROID version 6 in standard mode
By extension: epubs, mp4, flac, wav, zip and some subsets of pdf, xls, tif and exe fmt/6
(Waveform Audio)
Recommendation 1
There is a clear need for a community owned dataset that spans the PRONOM catalogue to support testing
(This should be community created) ExL-fmt/62 - fmt/189
(MS Open Office XML 2007)
Recommendation 2
It is strongly recommended that more research is undertaken looking at the persistence of PUID’s to give a more complete history of file type assertions by PRONOM/DROID
fmt/14(PDF 1.0)
Recommendation 3
Given the variances observed, especially with DROID v6 ‘FAST’ mode, it is recommended that all signatures are robustly tested prior to release, and efforts are made to maintain consistency with legacy signatures, and limit impact on users x-fmt/263
(ZIP format)
Recap
How Rosetta uses DROIDHow DROID has changed
Research NDHA completedResults
Recommendations
Thank you
Rosetta demo – Wednesday 28th March 9am to 1pm @ NLNZ - 77 Thorndon Quay
Paper available through the Open Planets Website www.openplanetsfoundation.org