Preserving email: The PeDALS approach
-
Upload
pete-watters -
Category
Technology
-
view
937 -
download
1
description
Transcript of Preserving email: The PeDALS approach
![Page 1: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/1.jpg)
The PeDALS approach
![Page 2: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/2.jpg)
Pete WattersArizona State Library, project [email protected]
Richard Pearce-MosesClayton State University, Georgia,principal [email protected]
Brian SchnackelArizona State Library, lead [email protected]
![Page 3: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/3.jpg)
PeDALS strives for OAIS compliance
Archivists focus on process, not individual records
Business rules… generate normalized metadata transform SIPs into standardized AIPs create DIPs for each record
![Page 4: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/4.jpg)
Suited to the PeDALS methodology Born digital Potential for historical value Message transmission information provides
a rich source of metadata
All partners had Outlook PST files
![Page 5: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/5.jpg)
Atomize individual messages • To store as individual AIPs • To disseminate as browser-friendly DIPs
Create a database of rich metadata • From the process: to support administration• From the email headers: to support discovery• From BagIt, New Zealand Metadata Extractor,
other sources: to support preservation
![Page 6: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/6.jpg)
PeDALS is intended for permanent records
PeDALS is not a records management system
Deleting files is difficult at best
![Page 7: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/7.jpg)
When negotiating with the originating office, archivists encourage weeding PSTs of non-permanent records
Archivists work with rules rather than records – they don’t have time to weed the collections
If you give us junk, we’ll archive junk.
PSTs plucked from hard drives can work, but more likely to generate errors during processing.
![Page 8: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/8.jpg)
Metadata taken from headers was surprisingly messy
One response is to learn to cope with a complete lack of authority control
Or possibly correct by “data wrangling” from within the database
![Page 9: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/9.jpg)
Senders and recipients can be an email address or display name from one or more contact lists“Janet Napolitano” or “[email protected]” or “[email protected]” or “Napolitano, Janet “ or “Janet” or “J Napolitano”?
Subject line not reliable source for titles or abstracts – often blank, repetitive, or a remnant from an unrelated message
![Page 10: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/10.jpg)
Email (and other records) may be open to the public by statute, but some content may be sensitive•Personally identifying information•Private information (intimate, of no public interest)
Repositories must develop procedures and policies for aggregates that may have some records with sensitive information
![Page 11: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/11.jpg)
Boucher/Stearns draft legislation for online privacy would require “notice to and consent of an individual prior to the collection and disclosure of certain personal information” such as street and email addresses, phone numbers, aliases, and other common information.
Excludes government agencies, but may include academic libraries.
Possible chilling effect on archives: Keeping such information confidential would effectively block access to email and many other records
![Page 12: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/12.jpg)
PST file structure was proprietary
Considered third-party Outlook plug-ins• Smithsonian Institution had done researchhttp://siarchives.si.edu/cerp/RAC_SIA_CERP_tools_V2_CC.pdf
Adopted open-source PST export utility• No longer supported• Written in Visual Basichttp://www.genusa.com/utils/pmseu.htm
![Page 13: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/13.jpg)
Could generate human-readable XML of email messages
Was based on code open to public
Did not require understanding of PST structure
![Page 14: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/14.jpg)
It’s more than just email
What to do with tasks, calendar items, contacts?
Need to give the archivist the ability to decide what to keep
What about viruses, corrupt attachments?
![Page 15: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/15.jpg)
What is the record? What are we authenticating?
PST as database; messages are constructs of fields in tables tied together by keys and other tables
XML is best way to preserve these relations and dependencies
![Page 16: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/16.jpg)
Did not use the full record
Had almost no way to handle errors
Tended to break when dealing with large PST files that had not been curated
Required a copy of Outlook
Ran very slowly
![Page 17: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/17.jpg)
In late February, Microsoft released the PST specification
http://msdn.microsoft.com/en-us/library/ff385210.aspx
203 pages of techspeak with some errors and inaccuracies
Based on the spec, we’ve been developing a file-based tool that doesn’t require Outlook.
![Page 18: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/18.jpg)
Generates XML from the entire PST file
Much improved exception handling
Does not require Outlook
Runs much more quickly
![Page 19: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/19.jpg)
File-based processor was slow to develop because of some errors in Microsoft’s documentation.
Test on as many PST samples as possible. Don’t rely on small curated samples.
Discovered differences between Unicode PST files and earlier ANSI-encoded files.
![Page 20: Preserving email: The PeDALS approach](https://reader033.fdocuments.in/reader033/viewer/2022051816/5462e6b3b4af9f4e1c8b497d/html5/thumbnails/20.jpg)
PSTs are not an automatic occurrence in Outlook 2010
But they can be generated manually and can remain part of a scheduled retention routine