Farl web archiving
-
Upload
aerho -
Category
Technology
-
view
135 -
download
0
Transcript of Farl web archiving
A survey of web-based art resources with findings applicable to FARL electronic records collection development
Alison Rhonemus, LIS 698, Seminar and Practicum, Dr. Tula Giannini
Frick Art Reference LibraryDeborah Kempe, Chief, Collections Management & Access
Web Survey and Collection Development
Coffee on the terrace
M-LEAD-TWO
Intern enterprises -"collection assessments, digital resource surveys, web archiving, provide support for important consortial programs such as shared resources"● Brooklyn Museum: Mark Daly, Ronnette Hope,
Project Manager: Emily Atwater● NYARC Latin American Resources (MOMA):
Ralph Baylor● FARL: Gretchen Nadasky, Alison Rhonemus
Frick Art Reference Library
In early 2011, the Frick Art Reference Library and the Thomas J. Watson Library at The Metropolitan Museum of Art completed a pilot project to address coordinated collecting of born-digital auction catalogs using ContentDM and Archive-It.
FARL web archiving program is situated in Collection Development.Current plans for website capture include online auction catalogs and art web resources
cataloged by NYARC.Fellow MLEAD-TWO intern Gretchen Nadasky has just described online auction
catalogs.My project focused on NYARC cataloged websites.
Web Archiving
"The Internet Archive is already doing it.”
Actually, the IA is providing the tools for other institutions to use in archiving.
ARCHIVE - ITuses open source tools developed by the
Internet Archive● Heritrix Web Crawler ● Wayback Interface● WARC format, an ISO standard
the report and manual checks
Partner and WAYBACK interface
Quality Assurance
• Password protected sites – can not be archived
• Javascript – more complicated implementation can be difficult to capture and display. Ongoing area of development.
• Videos -- difficulty with some proprietary formats
• Form and Database driven content --‐ may be archived using a sitemap or other direct links to the content.
Evaluating seeds
Robots.txt Blocks
The crawler by default respects all robots.txt files. Check post--‐crawl reports for blocked seeds or documents
If your site is blocked:
a) Contact the site owner and ask if they will un--‐block
b) Ask your Partner Specialist to turn on “ignore robots” feature in your account
Notes:
/ denotes single directory seed
subdomains.archive.org (add individually or expand seed)
Site Survey Criteria● html/flash/pdf
● images
● embedded material ● links ● directories and subdomains ● terms, rights statements and permissions
Obvious ruse
More of the obvious
Sites created without the intention of being archived are the sites in need of
archiving.
Survey Says
● 257 cataloged entries● 168 resources are possible to capture ● 82 resources would require more research or
display definite red flags for web archiving. ● PDFs are available for at least some of the
content in 75 resources. ● Flash was an element in 23 resources ● 16 sites used HTML5 ● 54 used a CMS like Drupal or WordPress
There were 3 cataloged resources no longer available on the live web but viewable through Internet Archive. Another 2 defunct resources were not available through Internet Archive. The main page for one of these lost resources was available as a snapshot in WAYBACK but the actual cataloged resource was not available.
Change is Constant
Archive-It Updates:● Heritrix 1 series to Heritrix 3 series (February)● Archive-It 4.8
(May)
Archive-It 4.8
Plans
● Upcoming grants
● Capture of NYARC institution websites
● Include Wayback interface links in Arcade catalog records
● Continue to identify websites for capture and implement capture
Conclusions
○ Digital resources not prevalent enough to reassign current staff
○ Website capture most costly in terms of staff time
○ Copyright continues to be an issue
○ Long term digital preservation needs yet to be assessed
○ Capture of Frick Collection sites and NYARC will pose as a challenging test case