Evaluation of format identification tools

26
Johan van der Knijff Koninklijke Bibliotheek – National Library of the Netherlands [email protected] The Future of File Format Identification DPC / The National Archives, Kew, London, 28.11.2011 Evaluation of format identification tools

description

Johan van der Knijff, The National Library of the Netherlands, presented his evaluation of format identification tools. He concluded by discussing the potential next steps for tools like DROID, following the results of his evaluation. This talk was given as part of The Future of File Format Identification: PRONOM and DROID User Consultation, in collaboration with the Digital Preservation Coalition at The National Archives, UK, on 28 November 2011. Listen to the full presentation at http://media.nationalarchives.gov.uk/index.php/johan-van-der-knijff-evaluation-of-format-identification-tools/

Transcript of Evaluation of format identification tools

Page 1: Evaluation of format identification tools

SCAPE

Johan van der Knijff Koninklijke Bibliotheek – National Library of the Netherlands

[email protected]

The Future of File Format Identification DPC / The National Archives, Kew, London, 28.11.2011

Evaluation of format identification tools

Page 2: Evaluation of format identification tools

SCAPE

Background & context

Page 3: Evaluation of format identification tools

SCAPE About SCAPE

SCAlable Preservation Environments

EU funded FP7 project; 16 partners

Scalable services for preservation and preservation planning

Semi-automated workflows for large-scale, heterogeneous collections of complex digital objects

Page 4: Evaluation of format identification tools

SCAPE Importance of identification

Object

Object (new format) migration

Object Automated watch

Alert

Object emulation Emulated

performance

What is it? Identification!

Page 5: Evaluation of format identification tools

SCAPE Evaluation of identification tools

Which tools suitable for SCAPE architecture?

Specific strengths/weaknesses

Decide on needed enhancements and modifications

Hopefully provide some useful input to developers as well!

Page 6: Evaluation of format identification tools

SCAPE Tools

DROID 6.0

FIDO 0.93/0.95

Unix File Utility 5.0.3

FITS 0.5 (uses DROID 3.0)

JHOVE2 (uses DROID 4.0)

Page 7: Evaluation of format identification tools

SCAPE Evaluation framework

Total of 22 criteria, broadly covering:

Usability in automated workflow (interface, dependencies)

Fit to requirements archival setting: format coverage, extendibility, accuracy

Output: format, identifiers, granularity

User documentation

Performance, stability and error handling

Page 8: Evaluation of format identification tools

SCAPE Key principles

Hands-on testing

using real data is essential!

Page 9: Evaluation of format identification tools

SCAPE Key principles

Inform tool developers on results, and give them opportunity to provide feedback

Page 10: Evaluation of format identification tools

SCAPE

Performance tests

Page 11: Evaluation of format identification tools

SCAPE Test data: KB Scientific Journals set

N size(min) size(median) size(max) Total size

11,892 11 4,737 25,495,289 1.15 GB

Page 12: Evaluation of format identification tools

SCAPE

File 1 tool Output

File 2 tool Output

File n tool Output

One file per tool invocation

Page 13: Evaluation of format identification tools

SCAPE

tool Output

File 1

File 2

File n

Many files per tool invocation

Page 14: Evaluation of format identification tools

SCAPE One-file per invocation

Page 15: Evaluation of format identification tools

SCAPE One-file per invocation

Page 16: Evaluation of format identification tools

SCAPE One-file per invocation

Page 17: Evaluation of format identification tools

SCAPE Comparison: one-file per invocation

File FIDO FITS JHOVE2 DROID

Page 18: Evaluation of format identification tools

SCAPE

File FIDO

JHOVE2

DROID

Comparison: many-files per invocation

Page 19: Evaluation of format identification tools

SCAPE Performance: main conclusions

All tested Java-based tools slow for one-file-per-invocation use case

Performance much better for many-files-per-invocation use case

Slow initialisation seems to be main culprit - Actual processing time per file: milliseconds

- Tool initialisation time: several seconds!

Page 20: Evaluation of format identification tools

SCAPE So is this really a problem?

Depends on required throughput

Depends on workflow interface (command line or Java API)

Depends on organisation of workflow

Depends on purpose (e.g. pre-ingest vs profiling of large file collections)

Page 21: Evaluation of format identification tools

SCAPE Apples vs oranges

FITS, JHOVE2: wrappers; also feature extraction and validation

DROID 6, FIDO: recurse into ZIP files ; File doesn’t!

Page 22: Evaluation of format identification tools

SCAPE

Miscellaneous observations

Page 23: Evaluation of format identification tools

SCAPE Other observations

Signature-based identification doesn’t work too well for text-based formats (including XML)

File outperforms other tools on format coverage and performance; management of signatures (‘magic’ file) awkward

DROID 6 output handling clumsy in automated workflows (separate DROID invocation needed for exporting profile information!)

Page 24: Evaluation of format identification tools

SCAPE Response to this work so far

FIDO: version 0.9.6 released in October; fixes most reported issues

FITS: version 0.6 released in October; various enhancements based on outcome of evaluation

DROID, JHOVE2: both provided feedback and will consider test results for upcoming releases

Page 25: Evaluation of format identification tools

SCAPE Possible next steps

Improve evaluation of accuracy

Keep up with tool updates; keep this work up-to-date

Publish all used scripts and detailed description of analysis methods so others can contribute more easily

Use publicly available test corpus (e.g. Govdocs1)

Page 26: Evaluation of format identification tools

SCAPE

More about SCAPE:

#SCAPEProject

http://www.scape-project.eu

Link to full report on OPF blog:

www.openplanetsfoundation.org/blogs/2011-09-21-evaluation-identification-tools-first-results-scape