File Format Identification and Archival Processing
description
Transcript of File Format Identification and Archival Processing
![Page 1: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/1.jpg)
File Format IdentificationFile Format Identificationandand
Archival ProcessingArchival ProcessingWilliam Underwood NARA Briefing
GTRI Washington, DCAtlanta, Georgia February 6, 2009
![Page 2: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/2.jpg)
OverviewOverview
BackgroundFile Command- Magic ExpressionsDROID-File Format Signature ExpressionsComparison-File Command/Magic &
DROID/FFSignaturesSummary
![Page 3: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/3.jpg)
Background – Projects
Presidential Electronic Records PilOt System (PERPOS) (2001-2006)
Advanced Decision Support for Archival Processing of Presidential Electronic Records (2007-2009)
![Page 4: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/4.jpg)
Backgound: Electronic Records atBackgound: Electronic Records atGeorge H.W. Bush Pres. LibraryGeorge H.W. Bush Pres. Library
One of the first presidential libraries to have electronic presidential records, particularly from hard drives◦ Word Processing Files◦ Databases ◦ Spreadsheets◦ Presentations◦ Email◦ Computer Programs◦ Scanned Paper Records
![Page 5: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/5.jpg)
Background: Where We BeganBackground: Where We Began
The archival functions needed to process paper records are well understood.
We had few tools to identify, view or review electronic records in response to PRA/FOIA requests
Tools Initially Needed:◦File Format Identification Tool◦Viewers for Records in Legacy File Formats◦Tool for Filtering OS and Office Applications
Software from User-created Files◦Tools for Converting Legacy to Current Formats◦Tools to Support Redaction of E-records
![Page 6: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/6.jpg)
Background: Background: Evolutionary PrototypingEvolutionary Prototyping
Result: Integrated set of tools called PERPOS
![Page 7: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/7.jpg)
Background: Archival Activities Background: Archival Activities Supported by PERPOSSupported by PERPOS
![Page 8: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/8.jpg)
Contents of PC Hard Disk
![Page 9: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/9.jpg)
File Format Names
![Page 10: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/10.jpg)
Filter Contents of a Hard Drive
![Page 11: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/11.jpg)
OS and Software Application Files Blocked by Filter
![Page 12: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/12.jpg)
File Types of Passed Files
![Page 13: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/13.jpg)
Properties of Filtered Files
![Page 14: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/14.jpg)
OS/App Hash Code Filter
![Page 15: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/15.jpg)
National Software Reference Library
![Page 16: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/16.jpg)
NSRL Reference Data Set
![Page 17: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/17.jpg)
Viewers, Archive Extractors, Password Viewers, Archive Extractors, Password Recovery, Decrypters, Converters, RepairersRecovery, Decrypters, Converters, Repairers
![Page 18: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/18.jpg)
Magic File – Man Page
![Page 19: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/19.jpg)
Magic File – Man Page
![Page 20: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/20.jpg)
Magic File – Man Page
![Page 21: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/21.jpg)
Extensions of File Command and Magic File
Magic for individual file formats Output of file command/magic file is File Format
IDRewriting file command code for identifying
Characteristics of Text files and Document TypesDefined approx. 750 file format signaturesCollected examples of approx. 500 of the file
format typesCreated File Signature DatabaseVerified that magic file correctly identifies
approx. 500 File Types
![Page 22: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/22.jpg)
![Page 23: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/23.jpg)
![Page 24: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/24.jpg)
![Page 25: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/25.jpg)
GUI for File Type Identifier
![Page 26: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/26.jpg)
File signatures for about 200 File Formats that are currently defined in DROID File Signature file only by file name extensions◦ Examples: Microsoft Outlook Personal folders (97-2002), AIFF
(Compressed), AutoCAD Design Web Format, Adobe Framemaker Document, Applixware Spreadsheet, Chiwriter 3 Document
File signatures for about 300 file formats that probably should be included in Pronom Registry and DROID Signature File.◦ Examples: MHTML Web Page Archive, Outlook Express E-mail
Folder, Autodesk Revit Project, CATIA Model File V4, CATIA Drawing V5, ClarisWorks 3 Document, MacWrite 4.x Document, PDF/X1a
![Page 27: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/27.jpg)
DROID – File Signature Expressions
In PRONOM, an internal signature is composed of one or more byte sequences, each comprising a continuous sequence of hexadecimal byte values and, optionally, regular expressions. A signature byte sequence is modelled by describing its starting position within a bitstream and its value.
The starting position can be one of two basic types:•Absolute: the byte sequence starts at a fixed position within the
bitstream. This position is described as an offset from either the beginning or the end of the bitstream.
Variable: the byte sequence can start at any offset within the bitstream. The byte sequence can be located by examining the entire bitstream.
![Page 28: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/28.jpg)
The value of the byte sequence is defined as a sequence of hexadecimal values, optionally incorporating any of the following regular expressions:
??: wildcard matching any pair of hexadecimal values (i.e. a single byte). *: wildcard matching any number of bytes (0 or more). {n}: wildcard matching n bytes, where n is an integer. {m-n}: wildcard matching between m-n bytes inclusive, where m and n are integers or
‘*’. (a|b): wildcard matching one from a list of values (e.g. a or b), where each value is a
hexadecimal byte sequence of arbitrary length containing no wildcards. [a:b]: wildcard matching any sequence of bytes which lies lexicographically between a
and b, inclusive (where both a and b are byte sequences of the same length, containing no wildcards, and where a is less than b). The endian-ness of a and b are the same as the endian-ness of the signature as a whole.
[!a]: wildcard matching any sequence of bytes other than a itself (where a is a byte sequence containing no wildcards).
[!a:b]: wildcard matching any sequence of bytes which does not lie lexicographically between a and b, inclusive (where a and b are both byte sequences of the same length, containing no wildcards, and where a is less than b).
![Page 29: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/29.jpg)
![Page 30: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/30.jpg)
DROID Applied to Sample Files
![Page 31: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/31.jpg)
Comparison of DROID and GTRI file Type Identifier Technologies
DROID
Matches sequences of hex values at offsets
Regular expressions on hex values
Efficient substring search
Identifies all possible signatures and then selects the one of highest priority
Includes offsets from EOF
GTRI File Type Identifier Matches a variety of data
types at offsets Regular expressions on
strings in lines Less efficient substring
search, but more indirect offsets increase efficiency
Preorders signatures and stops search when pattern matches.
Lacks offsets from EOF
![Page 32: File Format Identification and Archival Processing](https://reader035.fdocuments.in/reader035/viewer/2022062305/568148b2550346895db5c823/html5/thumbnails/32.jpg)
Summary
PERPOS File Format Resources◦ File Format Signatures◦ File Format Specifications/Reverse Engineering Documents◦ Software
Viewers/players Archive Extractors Converters Password Recovery & Decryption Repairers
◦ Sample FilesResearch Issues
◦ File Signature Representation Languages◦ Metadata Extraction Languages◦ File Format Description Languages