Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt...

35
Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt [email protected] http://www.h-net.org/archive/ MATRIX: The Center for Humane Arts, Letters & Social Sciences Online Michigan State University November 16, 2009

Transcript of Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt...

Page 1: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Preserving a Born-Digital Archive: The H-Net E-Mail Lists

Lisa M. [email protected]://www.h-net.org/archive/

MATRIX: The Center for Humane Arts, Letters & Social Sciences Online

Michigan State UniversityNovember 16, 2009

Page 2: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Preserving the H-Net E-Mail Lists

• H-Net Background

• Original “Preservation” Practices

• Use of the Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC)

• Preservation Improvements

Page 3: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

H-Net: Humanities and Social Sciences Online

• International consortium of scholars and teachers

• Oldest collection of born-digital and content-moderated arts, humanities, and social science material on the Internet

• Hosted by MATRIX

Page 4: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

H-Net: Humanities and Social Sciences Online

• Valuable scholarly resource– More than 180 networks, or e-mail lists,

with more than 130,000 unique subscribers– More than 5,000 posts per month– More than 230 “private” lists– 230,000 message views in single week

• More than 1 million e-mail messages

Page 5: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

MATRIX

• Digital humanities research center• Devoted to the application of new

technologies in teaching, research, and outreach

• Creates and maintains digital libraries of humanities and social science materials

• Provides training in computing and new teaching technologies

• Creates forums for the exchange of ideas and expertise

Page 6: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

NHPRC Grant

• Conduct assessment of existing H-Net preservation policies and practices

• Apply OCLC/CRL TRAC checklist • Develop and implement an improved long-

term preservation plan• Useful to those managing large collections of

electronic records• Research semantic clustering search

techniques

Page 7: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works:Backup & Security

• 3 TB of data, including H-Net• Server rack kept in climate controlled,

physically secured room• Daily incremental backups, weekly full • Full, “permanent” tape backups every four

months

Page 8: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works

• H-Net runs on LISTSERV software

• Submission policies– Users must be list subscribers to post– Messages written in plain text– No attachments allowed on public lists

Page 9: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works:An Archival Perspective

• Appraisal/Acquisition/Accession– All approved messages permanently

archived– Editors approve and post messages– Messages post from a few seconds up to

several days after approval

Page 10: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works:An Archival Perspective

Message Posting Process

Page 11: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works:An Archival Perspective

• Arrangement– Messages kept in flat text files called

“notebooks”– Single notebook includes messages

posted during seven-day time period, concatenated in original order

Page 12: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works:An Archival Perspective

Page 13: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works:An Archival Perspective

• Arrangement– Notebooks appear to be arranged in

original order within each list directory

Page 14: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works:An Archival Perspective

• Description– Most descriptive metadata for messages

automatically generated on creation/posting– “Author’s Subject” inserted by creator

Page 15: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works:An Archival Perspective

Period Day of Month

a 1-7

b 8-14

c 15-21

d 22-28

e 29-31

- Ex. “h-africa.log0802a”• Notebook description contained in filename

Notebook File Naming

Page 16: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works:Message Retrieval

• BRS Database– Newest notebook messages parsed and copied

every 24 hours– MD5 hashes created for each message– Available for full-text search

• MySQL Database Cache– Key metadata extracted, MD5 hashes created,

written to database cache– Enables more efficient browsing

Page 17: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works:Message Retrieval

Message Metadata Stored in MySQL Database

Page 18: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works:Message Retrieval

http://h-net.msu.edu/cgi-bin/logbrowse.pl?trx=vx&list=H-Albion&month=0808&week=b&msg=w8utW6nKNO1FuY19vSK2mo

&user=&pw=

Page 19: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

How H-Net Works

Message Ingest, Storage, and Retrieval Processes

Page 20: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Original “Preservation” Practices

• Backup, but only local—and no true archiving

• No normalization or migration strategy– Message/notebook content: No need

• Created and stored in plain text formats• XML encoding only required with proprietary

e-mail formats

– Needed for attachments on private lists

Page 21: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Original “Preservation” Practices

• Authenticity– Informal check by author and/or editor on

posting– Broken URL on message retrieval attempt– Cached metadata as PDI

• Reference, Content, Provenance Information• MD5 hashes for message discovery, not fixity• No Fixity Information for notebook files

• Policies– No documented preservation policies

Page 22: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• TRAC 1.0 published in February 2007• For certification by third party or self

assessment• Three sections

– A. Organizational Infrastructure– B. Digital Object Management– C. Technologies, Technical Infrastructure,

& Security• 84 audit criteria

Page 23: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• Compare core audit criteria to local capabilities—“Gap Analysis,” illuminating areas requiring improvement

• Formulate strategies to narrow the gap and improve trustworthiness of repository

Page 24: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• Example 1:Repository has formal succession plan– H-Net: No succession plan in place– Narrow the gap: Identify, negotiate with,

and make preliminary plans with potential successor; document intent, describing what’s needed in successor

Page 25: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• Example 2:Repository functions on well-supported operating systems and other core infrastructural software– H-Net: Servers run on Debian distribution

of Linux– No gap!

Page 26: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

Page 27: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

The TRAC Experience

• Thorough yet flexible, leaving room for interpretation, lots of options for supporting documentation/evidence

• Good snapshot of current state of repository • Clarifies what’s needed to narrow the gap• Great internal audit tool• Useful for certification of a trusted digital

repository

Page 28: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Preservation Improvements:Backup & Archival Storage

Backup• Long-term (“permanent”) backup tape sets stored offsite,

put on 3-year retention schedule

• Reciprocal backup storage arrangement with ICPSR

Archival Storage• Annual copying to tape of H-Net data, databases, scripts

• Media refreshment every 5 years

• Future: Copy to alternative storage repository

• Future: Participation in distributed archival storage system

Page 29: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Preservation Improvements:Authenticity

Fixity: Individual Messages (SIPs/AIPs)• Shorten time window for generation of hashes• Create database of SHA-256 hashes for fixity checks• Validate message hashes on notebook completion

Fixity: Notebook Files (AICs)• Create SHA-256 message digests on completion of notebooks• Calculate SHA-256 message digests for existing notebooks• Create database of SHA-256 message digests for fixity checks• Validate notebook hashes on weekly basis

Page 30: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Preservation Improvements:Authenticity

Page 31: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Preservation Improvements:Attachments

• Found with < 0.01% of H-Net messages– MS Office, PDF, image files

• Provide constructed URLs, as with public lists

• Provide download links

• No file normalization or migration plan– Most files should open in viewers, later versions of

applications– MATRIX will help users if problems arise

Page 32: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Preservation Improvements:Digital Preservation Policies

• Documented digital preservation policies and procedures for the H-Net e-mail lists– http://www.h-net.org/archive/doc.php

• Based on the Digital Preservation Policy Framework developed by Nancy McGovern of ICPSR– Digital Preservation Management

Workshop/Tutorial– Roadmap to developing and documenting

policies– Wealth of examples

Page 33: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Preservation Improvements:Narrowing the Gap

• Lather, rinse, repeat: New TRAC assessment

• Technical improvements

• Digital preservation policies

Page 34: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

Conclusions

• Relevant to e-mail preservation discussion• Applicable to preservation of LISTSERV-

based and other e-mail lists• Testbed for other preservation tools and

systems• Useful foundation for digital preservation

planning at Michigan State

Page 35: Preserving a Born-Digital Archive: The H-Net E-Mail Lists Lisa M. Schmidt lisa.schmidt@matrix.msu.edu  MATRIX: The Center.

References

• Digital Preservation Management Tutorial, http://www.icpsr.umich.edu/dpm/dpm-eng/eng_index.html

• H-Net Archives Project, http://www.h-net.org/archive/• H-Net: Humanities and Social Sciences Online,

http://www.h-net.org• MATRIX: The Center for Humane Arts, Letters, and Social

Sciences Online, http://www.matrix.msu.edu• OAIS Reference Model,

http://public.ccsds.org/publications/archive/650x0b1.pdf• Trusted Digital Repositories: Attributes and Responsibilities,

http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf

• Trustworthy Repositories Audit & Certification: Criteria and Checklist, http://www.crl.edu/PDF/trac.pdf