Farl web archiving

A survey of web-based art resources with findings applicable to FARL electronic records collection development

Alison Rhonemus, LIS 698, Seminar and Practicum, Dr. Tula Giannini

Frick Art Reference LibraryDeborah Kempe, Chief, Collections Management & Access

Web Survey and Collection Development

Coffee on the terrace

M-LEAD-TWO

Intern enterprises -"collection assessments, digital resource surveys, web archiving, provide support for important consortial programs such as shared resources"● Brooklyn Museum: Mark Daly, Ronnette Hope,

Project Manager: Emily Atwater● NYARC Latin American Resources (MOMA):

Ralph Baylor● FARL: Gretchen Nadasky, Alison Rhonemus

Frick Art Reference Library

In early 2011, the Frick Art Reference Library and the Thomas J. Watson Library at The Metropolitan Museum of Art completed a pilot project to address coordinated collecting of born-digital auction catalogs using ContentDM and Archive-It.

http://www.contentdm.org/

http://www.contentdm.org/

https://archive-it.org/

https://archive-it.org/

FARL web archiving program is situated in Collection Development.Current plans for website capture include online auction catalogs and art web resources

cataloged by NYARC.Fellow MLEAD-TWO intern Gretchen Nadasky has just described online auction

catalogs.My project focused on NYARC cataloged websites.

Web Archiving

"The Internet Archive is already doing it.”

Actually, the IA is providing the tools for other institutions to use in archiving.

ARCHIVE - ITuses open source tools developed by the

Internet Archive● Heritrix Web Crawler ● Wayback Interface● WARC format, an ISO standard

the report and manual checks

Partner and WAYBACK interface

Quality Assurance

https://partner.archive-it.org/archiveit/partner/home.html?accountId=484&cid=194666

https://partner.archive-it.org/archiveit/partner/home.html?accountId=484&cid=194666

• Password protected sites – can not be archived

• Javascript – more complicated implementation can be difficult to capture and display. Ongoing area of development.

• Videos -- difficulty with some proprietary formats

• Form and Database driven content --‐ may be archived using a sitemap or other direct links to the content.

Evaluating seeds

Robots.txt Blocks

The crawler by default respects all robots.txt files. Check post--‐crawl reports for blocked seeds or documents

If your site is blocked:

a) Contact the site owner and ask if they will un--‐block

b) Ask your Partner Specialist to turn on “ignore robots” feature in your account

Notes:

/ denotes single directory seed

subdomains.archive.org (add individually or expand seed)

Site Survey Criteria● html/flash/pdf

● images

● embedded material ● links ● directories and subdomains ● terms, rights statements and permissions

Obvious ruse

More of the obvious

Sites created without the intention of being archived are the sites in need of

archiving.

Survey Says

● 257 cataloged entries● 168 resources are possible to capture ● 82 resources would require more research or

display definite red flags for web archiving. ● PDFs are available for at least some of the

content in 75 resources. ● Flash was an element in 23 resources ● 16 sites used HTML5 ● 54 used a CMS like Drupal or WordPress

There were 3 cataloged resources no longer available on the live web but viewable through Internet Archive. Another 2 defunct resources were not available through Internet Archive. The main page for one of these lost resources was available as a snapshot in WAYBACK but the actual cataloged resource was not available.

Change is Constant

Archive-It Updates:● Heritrix 1 series to Heritrix 3 series (February)● Archive-It 4.8

(May)

Archive-It 4.8

Plans

● Upcoming grants

● Capture of NYARC institution websites

● Include Wayback interface links in Arcade catalog records

● Continue to identify websites for capture and implement capture

Conclusions

○ Digital resources not prevalent enough to reassign current staff

○ Website capture most costly in terms of staff time

○ Copyright continues to be an issue

○ Long term digital preservation needs yet to be assessed

○ Capture of Frick Collection sites and NYARC will pose as a challenging test case

Farl web archiving

Technology

Transcript of Farl web archiving