Dr. Douglas C. Schmidt d.schmidt@vanderbilt dre.vanderbilt/~schmidt
Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt...
-
Upload
sophia-johnston -
Category
Documents
-
view
217 -
download
0
Transcript of Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt...
Preserving a Born-Digital Archive: The H-Net E-Mail Lists
Lisa M. [email protected]://www.h-net.org/archive/
MATRIX: The Center for Humane Arts, Letters & Social Sciences Online
Michigan State UniversityNovember 16, 2009
Preserving the H-Net E-Mail Lists
• H-Net Background
• Original “Preservation” Practices
• Use of the Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC)
• Preservation Improvements
H-Net: Humanities and Social Sciences Online
• International consortium of scholars and teachers
• Oldest collection of born-digital and content-moderated arts, humanities, and social science material on the Internet
• Hosted by MATRIX
H-Net: Humanities and Social Sciences Online
• Valuable scholarly resource– More than 180 networks, or e-mail lists,
with more than 130,000 unique subscribers– More than 5,000 posts per month– More than 230 “private” lists– 230,000 message views in single week
• More than 1 million e-mail messages
MATRIX
• Digital humanities research center• Devoted to the application of new
technologies in teaching, research, and outreach
• Creates and maintains digital libraries of humanities and social science materials
• Provides training in computing and new teaching technologies
• Creates forums for the exchange of ideas and expertise
NHPRC Grant
• Conduct assessment of existing H-Net preservation policies and practices
• Apply OCLC/CRL TRAC checklist • Develop and implement an improved long-
term preservation plan• Useful to those managing large collections of
electronic records• Research semantic clustering search
techniques
How H-Net Works:Backup & Security
• 3 TB of data, including H-Net• Server rack kept in climate controlled,
physically secured room• Daily incremental backups, weekly full • Full, “permanent” tape backups every four
months
How H-Net Works
• H-Net runs on LISTSERV software
• Submission policies– Users must be list subscribers to post– Messages written in plain text– No attachments allowed on public lists
How H-Net Works:An Archival Perspective
• Appraisal/Acquisition/Accession– All approved messages permanently
archived– Editors approve and post messages– Messages post from a few seconds up to
several days after approval
How H-Net Works:An Archival Perspective
Message Posting Process
How H-Net Works:An Archival Perspective
• Arrangement– Messages kept in flat text files called
“notebooks”– Single notebook includes messages
posted during seven-day time period, concatenated in original order
How H-Net Works:An Archival Perspective
How H-Net Works:An Archival Perspective
• Arrangement– Notebooks appear to be arranged in
original order within each list directory
How H-Net Works:An Archival Perspective
• Description– Most descriptive metadata for messages
automatically generated on creation/posting– “Author’s Subject” inserted by creator
How H-Net Works:An Archival Perspective
Period Day of Month
a 1-7
b 8-14
c 15-21
d 22-28
e 29-31
- Ex. “h-africa.log0802a”• Notebook description contained in filename
Notebook File Naming
How H-Net Works:Message Retrieval
• BRS Database– Newest notebook messages parsed and copied
every 24 hours– MD5 hashes created for each message– Available for full-text search
• MySQL Database Cache– Key metadata extracted, MD5 hashes created,
written to database cache– Enables more efficient browsing
How H-Net Works:Message Retrieval
Message Metadata Stored in MySQL Database
How H-Net Works:Message Retrieval
http://h-net.msu.edu/cgi-bin/logbrowse.pl?trx=vx&list=H-Albion&month=0808&week=b&msg=w8utW6nKNO1FuY19vSK2mo
&user=&pw=
How H-Net Works
Message Ingest, Storage, and Retrieval Processes
Original “Preservation” Practices
• Backup, but only local—and no true archiving
• No normalization or migration strategy– Message/notebook content: No need
• Created and stored in plain text formats• XML encoding only required with proprietary
e-mail formats
– Needed for attachments on private lists
Original “Preservation” Practices
• Authenticity– Informal check by author and/or editor on
posting– Broken URL on message retrieval attempt– Cached metadata as PDI
• Reference, Content, Provenance Information• MD5 hashes for message discovery, not fixity• No Fixity Information for notebook files
• Policies– No documented preservation policies
Trustworthy Repositories Audit & Certification:
Criteria and Checklist (TRAC)
• TRAC 1.0 published in February 2007• For certification by third party or self
assessment• Three sections
– A. Organizational Infrastructure– B. Digital Object Management– C. Technologies, Technical Infrastructure,
& Security• 84 audit criteria
Trustworthy Repositories Audit & Certification:
Criteria and Checklist (TRAC)
• Compare core audit criteria to local capabilities—“Gap Analysis,” illuminating areas requiring improvement
• Formulate strategies to narrow the gap and improve trustworthiness of repository
Trustworthy Repositories Audit & Certification:
Criteria and Checklist (TRAC)
• Example 1:Repository has formal succession plan– H-Net: No succession plan in place– Narrow the gap: Identify, negotiate with,
and make preliminary plans with potential successor; document intent, describing what’s needed in successor
Trustworthy Repositories Audit & Certification:
Criteria and Checklist (TRAC)
• Example 2:Repository functions on well-supported operating systems and other core infrastructural software– H-Net: Servers run on Debian distribution
of Linux– No gap!
Trustworthy Repositories Audit & Certification:
Criteria and Checklist (TRAC)
The TRAC Experience
• Thorough yet flexible, leaving room for interpretation, lots of options for supporting documentation/evidence
• Good snapshot of current state of repository • Clarifies what’s needed to narrow the gap• Great internal audit tool• Useful for certification of a trusted digital
repository
Preservation Improvements:Backup & Archival Storage
Backup• Long-term (“permanent”) backup tape sets stored offsite,
put on 3-year retention schedule
• Reciprocal backup storage arrangement with ICPSR
Archival Storage• Annual copying to tape of H-Net data, databases, scripts
• Media refreshment every 5 years
• Future: Copy to alternative storage repository
• Future: Participation in distributed archival storage system
Preservation Improvements:Authenticity
Fixity: Individual Messages (SIPs/AIPs)• Shorten time window for generation of hashes• Create database of SHA-256 hashes for fixity checks• Validate message hashes on notebook completion
Fixity: Notebook Files (AICs)• Create SHA-256 message digests on completion of notebooks• Calculate SHA-256 message digests for existing notebooks• Create database of SHA-256 message digests for fixity checks• Validate notebook hashes on weekly basis
Preservation Improvements:Authenticity
Preservation Improvements:Attachments
• Found with < 0.01% of H-Net messages– MS Office, PDF, image files
• Provide constructed URLs, as with public lists
• Provide download links
• No file normalization or migration plan– Most files should open in viewers, later versions of
applications– MATRIX will help users if problems arise
Preservation Improvements:Digital Preservation Policies
• Documented digital preservation policies and procedures for the H-Net e-mail lists– http://www.h-net.org/archive/doc.php
• Based on the Digital Preservation Policy Framework developed by Nancy McGovern of ICPSR– Digital Preservation Management
Workshop/Tutorial– Roadmap to developing and documenting
policies– Wealth of examples
Preservation Improvements:Narrowing the Gap
• Lather, rinse, repeat: New TRAC assessment
• Technical improvements
• Digital preservation policies
Conclusions
• Relevant to e-mail preservation discussion• Applicable to preservation of LISTSERV-
based and other e-mail lists• Testbed for other preservation tools and
systems• Useful foundation for digital preservation
planning at Michigan State
References
• Digital Preservation Management Tutorial, http://www.icpsr.umich.edu/dpm/dpm-eng/eng_index.html
• H-Net Archives Project, http://www.h-net.org/archive/• H-Net: Humanities and Social Sciences Online,
http://www.h-net.org• MATRIX: The Center for Humane Arts, Letters, and Social
Sciences Online, http://www.matrix.msu.edu• OAIS Reference Model,
http://public.ccsds.org/publications/archive/650x0b1.pdf• Trusted Digital Repositories: Attributes and Responsibilities,
http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf
• Trustworthy Repositories Audit & Certification: Criteria and Checklist, http://www.crl.edu/PDF/trac.pdf