University Archives University Archives & Archive-It WebCom 2011-03-29.
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of University Archives University Archives & Archive-It WebCom 2011-03-29.
University Archives
University Archives & Archive-It
WebCom 2011-03-29
The Duke University Archives is responsible for the collection and management of records of
enduring value created by the University's administrative offices and academic units.
The Archives also acquires records of student, faculty and staff organizations, selected
personal papers, and books, images, audio, and other documentation about Duke
University.
Archive-It Service Agreement
• $6,000 Subscription Fee• 8,000,000 URLS• 0.75 TB storage• 1-2 Active Collections• Maximum 200 Active Seeds
• Collection & Crawl Interface• Search Portal• All data will be copied to Internet Archive’s
Wayback Machine on contract termination
Front Page
Collection Page
Page Capture Index
http://wayback.archive-it.org/1858/*/http://news.duke.edu/
Page View
Priorities this past year…
• Institutes & Student Groups
– Have a relatively short life
– The Archives rarely receives records transfers
from these groups
• Units with existing relationships
• Opportunities as they arise
Crawl of duke.edu• Started March 4, 2011 4:34:20 PM• Completed March 7, 2011 5:46:52 PM
• Average Doc Rate 13.66 urls/sec• Average KB Rate 1,646 KB/s
• Total Documents 3,594,845• Total Data 413.2 GB
• Duke Domains Found 1,698
Issues
• Capturing the “Look & Feel” of a site
• Crawler Traps (e.g. calendars)
• Junk URLS (e.g. bad CMS link generation)
Robots Exclusions
We do want:• Look & Feel
– JavaScript– CSS– Images
• Policy• Publications• Events• RSS
We (usually) don’t want:• Every day of every year• Your taxonomies• Administrative pages• Maps/GIS• “Personal” pages
User-agent: archive.org_bot
Contact me:
Seth Shaw
Electronic Records Archivist
Duke University Archives
684.6181