The BagIt file package format

1
The BagIt file package format John Kunze and Stephen Abrams, California Digital Library (CDL) A BagIt bag directory listing A BagIt bag is a special directory (or Windows folder) The directory name is your choice, but inside the top level BagIt reserves names for required files: data : a subdirectory where you can put anything bagit.txt : a 2-line file declaring this is a bag manifest-md5.txt : a list of files present MyFirstBag/ \--- data/ Replicas for a rainy day Need a data package format that can carry any data Suitable for disk-based and network-based transfer Main purpose is to move data from one digital library or archive to another for safe-keeping Not important whether the receiver provide access to or even understand meaning of the sender’s content Fig. 1. Sleeping better at night is more likely for the archivist who has replicas safely tucked away at other memory organizations. e³°°° | manifest-md5.txt | bagit.txt | bag-info.txt data/ | my.pdf | my.xsl | my.mpg Fig. 2. Payload directory, data,holds anything you want. “Tag” files (purple) describe the bag itself, including list of files and checksums in the manifest (algorithms other than md5 possible) and an optional “bag-info.txt” metadata file. .mpg 1. start: put your files… .xls .pdf 3. create a tag manifest-md5.txt bagit.txt . . . .mpg .xls .pdf data/ 2. in a directory, data 4. put tag next to data Bag&Tag Holey bags! for efficiency A common optional tag file lists “holes” to be filled Payload is incomplete until these files are fetched File fetch.txt lists URLs for receiver to grab Benefits include (network transfers only): No need to stage extra data copy at either end Cheap parallelism from fetching URLs in batches .mpg .xls .pdf data data Fig. 3. On the left is a bag received with missing files, or “holes”. The “fetch.txt” file lists URLs and corresponding payload filenames that the receiver must fetch before declaring the bag on the right complete. (http) manifest-md5.txt bagit.txt fetch.txt manifest-md5.txt bagit.txt fetch.txt .mpg .xls .pdf data manifest-md5.txt bagit.txt BagIt credits and details Authors from LC and CDL: Andy Boyko, John Kunze, Justin Littman, Liz Madden, Brian Vargas Many thanks to: Stephen Abrams, Mike Ashenfelder, Scott Fisher, Erik Hetzner, Keith Johnson, David Loy, Tracy Seneca, Mark Phillips, Adam Turoff, Jim Tuttle BagIt specification: http://www.ietf.org/internet-drafts/draft-kunze-bagit-03.txt http://www.cdlib.org/inside/diglib/bagit/bagitspec.html BagIt was informed by Enclose-and-Deposit method, Tabata & Sugimoto, LC’s eDeposit Pilot and NDIIPP AIHT ARC/WARC aggregate file format Summary BagIt is a file hierarchy, suitable for disk- or network- based transfer, and maybe bag return, with just enough structure to safely enclose a manifest, checksums, tag info, and an arbitrary payload. Bags in action Library of Congress (LC) grantees send bags For example, bagged web crawls funded by LC are sent by CDL to LC, with replica sent to San Diego clarrisa~ LaurelJukebox For further information Please contact jak@ucop .edu or [email protected] For information on CDL’s Preservation Program, see http://www. cdlib . org/programs/digital_preservation .html

description

 

Transcript of The BagIt file package format

Page 1: The BagIt file package format

The BagIt file package formatJohn Kunze and Stephen Abrams, California Digital Library (CDL)

A BagIt bag directory listing• A BagIt bag is a special directory (or Windows folder)• The directory name is your choice, but inside the top

level BagIt reserves names for required files:• data : a subdirectory where you can put anything• bagit.txt : a 2-line file declaring this is a bag• manifest-md5.txt : a list of files present

MyFirstBag/ | manifest-md5.txt | bagit.txt | bag-info.txt \--- data/ | my.pdf | my.xsl | my.mpg

Replicas for a rainy day• Need a data package format that can carry any data• Suitable for disk-based and network-based transfer• Main purpose is to move data from one digital library

or archive to another for safe-keeping• Not important whether the receiver provide access to

or even understand meaning of the sender’s content

Fig. 1. Sleeping better atnight is more likely for thearchivist who has replicassafely tucked away at othermemory organizations.

e³°°°

| manifest-md5.txt | bagit.txt | bag-info.txt

data/ | my.pdf | my.xsl | my.mpg

Fig. 2. Payload directory, data,holds anything you want.“Tag” files (purple) describe the bag itself, including list offiles and checksums in the manifest (algorithms other thanmd5 possible) and an optional “bag-info.txt” metadata file.

.mpg

1. start: put your files…

.xls.pdf

3. create a tagmanifest-md5.txt

bagit.txt. . .

.mpg

.xls.pdf

data/

2. in a directory, data

4. put tag next to data

Bag&Tag

Holey bags! for efficiency• A common optional tag file lists “holes” to be filled• Payload is incomplete until these files are fetched• File fetch.txt lists URLs for receiver to grab• Benefits include (network transfers only):

• No need to stage extra data copy at either end• Cheap parallelism from fetching URLs in batches

.mpg

.xls

.pdf

datadata

Fig. 3. On the left is a bag received with missing files, or“holes”. The “fetch.txt” file lists URLs and correspondingpayload filenames that the receiver must fetch beforedeclaring the bag on the right complete.

(http)

manifest-md5.txtbagit.txtfetch.txt

manifest-md5.txtbagit.txtfetch.txt

.mpg

.xls.pdf

data

manifest-md5.txtbagit.txt

BagIt credits and detailsAuthors from LC and CDL: Andy Boyko, John Kunze,

Justin Littman, Liz Madden, Brian VargasMany thanks to: Stephen Abrams, Mike Ashenfelder, Scott

Fisher, Erik Hetzner, Keith Johnson, David Loy, TracySeneca, Mark Phillips, Adam Turoff, Jim Tuttle

BagIt specification:http://www.ietf.org/internet-drafts/draft-kunze-bagit-03.txthttp://www.cdlib.org/inside/diglib/bagit/bagitspec.html

BagIt was informed by• Enclose-and-Deposit method, Tabata & Sugimoto,• LC’s eDeposit Pilot and NDIIPP AIHT• ARC/WARC aggregate file format

SummaryBagIt is a file hierarchy, suitable for disk- or network-

based transfer, and maybe bag return, with justenough structure to safely enclose a manifest,checksums, tag info, and an arbitrary payload.

Bags in action

Library of Congress (LC) grantees send bags• For example, bagged web crawls funded by LC are

sent by CDL to LC, with replica sent to San Diego

clarrisa~

LaurelJukebox

For further informationPlease contact [email protected] or [email protected] information on CDL’s Preservation Program, see http://www.cdlib.org/programs/digital_preservation.html