University Archives University Archives & Archive-It WebCom 2011-03-29.

Post on 19-Dec-2015

217 views 0 download

Transcript of University Archives University Archives & Archive-It WebCom 2011-03-29.

University Archives

University Archives & Archive-It

WebCom 2011-03-29

The Duke University Archives is responsible for the collection and management of records of

enduring value created by the University's administrative offices and academic units.

The Archives also acquires records of student, faculty and staff organizations, selected

personal papers, and books, images, audio, and other documentation about Duke

University.

Archive-It Service Agreement

• $6,000 Subscription Fee• 8,000,000 URLS• 0.75 TB storage• 1-2 Active Collections• Maximum 200 Active Seeds

• Collection & Crawl Interface• Search Portal• All data will be copied to Internet Archive’s

Wayback Machine on contract termination

Front Page

Collection Page

Page Capture Index

http://wayback.archive-it.org/1858/*/http://news.duke.edu/

Page View

Priorities this past year…

• Institutes & Student Groups

– Have a relatively short life

– The Archives rarely receives records transfers

from these groups

• Units with existing relationships

• Opportunities as they arise

Crawl of duke.edu• Started March 4, 2011 4:34:20 PM• Completed March 7, 2011 5:46:52 PM

• Average Doc Rate 13.66 urls/sec• Average KB Rate 1,646 KB/s

• Total Documents 3,594,845• Total Data 413.2 GB

• Duke Domains Found 1,698

Issues

• Capturing the “Look & Feel” of a site

• Crawler Traps (e.g. calendars)

• Junk URLS (e.g. bad CMS link generation)

Robots Exclusions

We do want:• Look & Feel

– JavaScript– CSS– Images

• Policy• Publications• Events• RSS

We (usually) don’t want:• Every day of every year• Your taxonomies• Administrative pages• Maps/GIS• “Personal” pages

User-agent: archive.org_bot

Contact me:

Seth Shaw

Electronic Records Archivist

Duke University Archives

684.6181

seth.shaw@duke.edu