Making the Ephemeral Endure: Collecting the Web in Research … · 2011. 4. 1. ·...
Transcript of Making the Ephemeral Endure: Collecting the Web in Research … · 2011. 4. 1. ·...
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
Making the Ephemeral Endure:
Collecting the Web in Research Libraries
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
Association of College and Research Libraries / Annual Conference / Philadelphia, PA / April 01, 2011
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
Hashtag #webarchives
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
COLLECTING WEB RESOURCES:
OVERVIEW
• Why
• Who
• Columbia context
• How
• Some Issues
• Questions
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
WHY IT MATTERS…
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
THE LIVE SITE TODAY
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
WHY IT MATTERS
• “Joaquim Chissano Appointed UN Special
Envoy for LRA-Affected Region 2007,”
Uganda-CAN [article on-line]; available
from http://www.ugandacan.org/item/1846;
accessed 17 May 2007"
-- citation from Human Rights Review,
March 2009
http://www.ugandacan.org/item/1846
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
The domain ugandacan.org may be for sale. Click here for details.
Welcome to ugandacan.orgRelated Searches
•Mountain Gorilla
•Cheap Air
•Uganda
•Volunteering
•Uganda Travel
•Hotel Deals
•Discount Airfare
•Luxury Car Rental
•Uganda Tours
•Cruise Vacation
•Vacation Package Deal
RELATED SEARCHES:
•Car Rental
•Travel Insurance
•Cheap Airfare
•Family Vacation Deals
THE LIVE SITE TODAY
http://www.acquirethisname.com/make-an-offer.aspx?domain=ugandacan.org&eo=1http://www.ugandacan.org/location/mountain/gorilla/cheap/mountain_gorilla.htm?t=&slt=7&slr=1&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/cheap_air.htm?t=&slt=7&slr=2&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/uganda.htm?t=&slt=7&slr=3&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/volunteering.htm?t=&slt=7&slr=4&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/uganda_travel.htm?t=&slt=7&slr=5&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/hotel_deals.htm?t=&slt=7&slr=6&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/discount_airfare.htm?t=&slt=7&slr=7&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/luxury_car_rental.htm?t=&slt=7&slr=8&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/uganda_tours.htm?t=&slt=7&slr=9&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/cruise_vacation.htm?t=&slt=7&slr=10&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/vacation_package_deal.htm?t=&slt=7&slr=11&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/car_rental.htm?t=&slt=7&slr=12&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/travel_insurance.htm?t=&slt=7&slr=13&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/cheap_airfare.htm?t=&slt=7&slr=14&lpt=1
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
WHY COLLECT WEB RESOURCES
• Libraries build research collections by
selecting, acquiring, describing,
organizing, managing, and preserving
relevant resources
• Libraries have stable models for collecting
non-digital print resources–the roles of
selectors, acquisition departments,
catalogers, and preservation units are
well-understood
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
WHY COLLECT WEB RESOURCES
For commercial digital resources a different
model has emerged:
• resource bundling
• licensed access rather than physical
receipt
• vendor-supplied cataloging
• collective preservation efforts (LOCKSS,
Portico)
– Libraries’ financial investment in these
resources has ensured that they are managed
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
WHY COLLECT WEB RESOURCES
• Many have high research value
• May supplement or replace existing print resources
What about non-commercial web resources?
• Identifying relevant resources
• Integrating access with other collections
• Securing permissions for harvesting
• Preservation
But as yet we have no
common model for:
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
A LOT OF CONTENT
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
MUCH OF IT IS NOT COLLECTED
• 40 documents on web site
• 0 in Columbia collections
• 10 listed in OCLC
• 1 held by more than 2 libraries
• No library holds more than 3
Refugees International
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
WHO: SOME KEY PROGRAMS
• Members include over 30 international libraries and the Internet Archive.
• http://netpreserve.org
International Internet Preservation
Consortium (IIPC)
• Over 100 Institutions using Archive-IT software. Includes universities, schools, state libraries, museums …
• http://www.archive-it.org
Archive-IT (Internet Archive)
• 16 partner institutions
• http://webarchives.cdlib.org
Web Archiving Service (California
Digital Library)
• Commercial web archiving services (Hanzo, Iterasi)
• National institutions (libraries, archives)Independent
Initiatives
http://netpreserve.org/http://www.archive-it.org/http://www.archive-it.org/http://www.archive-it.org/http://webarchives.cdlib.org/
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
MELLON PROJECT ON WEB RESOURCES
COLLECTION PROGRAM DEVELOPMENT
Collection Building
Make non-commercial web resources of scholarly value an
integral part of Columbia’s collection
building
Workflow
Move web resource collection from a
project-based activity to part of routine
workflow
Collaboration
Develop complementary and
collaborative approaches with other research institutions
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
TERMINOLOGY
Project Program
Web archiving
Collecting web
resources
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
CREDITS: COLUMBIA TEAM
• Bob Wolven (Associate University Librarian forBibliographic Services and CollectionDevelopment)• Stephen Davis (Director, ColumbiaLibraries Digital Program)• Pamela Graham (Director, Area Studiesand CHRDR)• Kate Harcourt (Director, Original and Special Materials Cataloging)• Alex Thurman (Web Collection Curator)• Tessa Fallon (Web Collection Curator)
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
GENERAL WEB ARCHIVING
WORKFLOW
Selection Permissions CrawlingQuality Review
Description Access
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
REQUIREMENTS
• Tool(s) for capturing websites
Crawler
• Mechanism for viewing captured websites
Access/Rendering
• Storage for data collected and created by the crawler
Storage
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
• Event based
• “Arab Spring” websites 2011
• Japan earthquake 2011
• Thematic
• Human Rights
• Blogging in Iran
• Institutional
• Avery Library, Historic Preservation
• Burke Library, NYC Religions
• Domain
• Top level domains such as .uk
Selection
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
CUL EXPERIENCE
Selection models in use at Columbia:
• Subject specialists
• Public nomination form
• Internet Resource Cataloging Request
• Coordination with other library collections
– Avery Fine Arts and Architecture Library
– Burke Library/Union Theological Seminary
– Rare Book and Manuscript Library
– Columbia University Archives
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
• Unlike other countries, there is no mandate for any US institution to archive websites
• Section 108 Study Group recommendations for web archiving
• Internet Archive Model
• Oakland Archive Policy
Copyright
+
Permissions
http://www.section108.gov/http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
CUL EXPERIENCE
• Permissions request created in
consultation with legal counsel
• Request permission from site owners
• Response from site owners
• Complications
– Identifying site owners
– Third party copyright
– Extent of permission
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
• Collection of URLs required to reproduce a website
• Test crawls gauge size of sites and flag potential crawl issues
• Actual crawls may take hours or weeks
• Product: WARC files (ISO 28500)
Crawling
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
• Archive-It
• CDL-WAS
• Hanzo Archives
• Internet Memory Foundation
• Iterasi
Web Archiving Services
• Heritrix + Wayback Machine
• Web Curator Tool
• NetarchiveSuite
• HTTrack
• GNU Wget
• WebCollect toolbar
Open Source/Free
SERVICES + OPEN SOURCE
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
SERVICES VS. OPEN SOURCE
• Customer support and training
• External storage
• Development of interface
• Management of crawler
Web Archiving Services
• Customizable
• Free software
• Crawled sites are stored locally
Open Source
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
CUL EXPERIENCE
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
HERITRIX SCREENSHOT
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
• Crawler-generated reports
• Crawler traps
• Robots.txt
• URLs captured
• Crawled sites
• Formatting/style
• Navigation
• Multimedia
Quality Review
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
CUL EXPERIENCE
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
• Testing phase or post-crawl
• Excluding out-of-scope URLs
• Expanding scope
• Additional domains (common: blogs, other languages, subordinate sections of an organization)
• Excluding crawler traps
Scoping
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
CUL EXPERIENCE
Archived-site snapshot--problems
Live Site
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
• Wayback Machine or equivalent necessary to render the WARC files
• Description: metadata created by cataloging staff
• Access
• Web archiving service
• OPAC
• Portal
• Consortium
Access
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
CUL EXPERIENCE
CLIO
Archive-It
WorldCat
http://clio.cul.columbia.edu:7018/vwebv/searchBasic?sk=CLIOhttp://www.archive-it.org/http://www.archive-it.org/http://www.archive-it.org/http://www.worldcat.org/
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
• Rapidly changing technologies used in website development
• Dynamic pages, deep web, other inaccessible content
• Providing access across collections and avoiding web archive silos
• Aggregation of data: long-term storage and responsibility
• Long-term preservation challenges
Challenges
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
ISSUES
Scale, sustainability
• Matching scale to program objectives
• Budgeting for storage, staffing
Scope; collection policy
• Limit to a few concentrations or broader?
• Defining by source (.org), format (.pdf), topic?
• What happens to resources excluded?
Collaboration, external
• Duplication/overlap with related initiatives
• Complementary approaches: frequency, access, scoping
• Role of Archive-IT partners, consortia (2CUL), NDSA
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
MORE ISSUES
Coordination, internal
• Relation to institutional repository, archival collections, e-archives
Staffing, roles
• Centralized vs distributed effort
• Impact on selectors, cataloging, archivists, digital program
Impact on print collecting
• Potential for “e-only”
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
STILL MORE ISSUES
Technical
• Local vs. hosted storage
• Open source, local development vs externally-supported toolkit
• Moving from harvesting to archiving
Access, Use, Assessment
• Use cases for portals, cross-collection searching
• Disclosure outside local context
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
WHAT DO LIBRARIES DO?
• Libraries build research collections by selecting, acquiring, describing, organizing, managing, and preserving relevant resources
• Libraries manage business transactions necessary to provide access to resources needed for research
• Libraries preserve research resources to enable access to be restored if lost
-
Co
lum
bia
Un
ive
rsity
Lib
rarie
s &
Info
rma
tion
Se
rvic
es
ADDITIONAL INFORMATION
• Project Information
• Human Rights Web Archive
• Archive-It Collections Page
• Human Rights Web Archive Delicious Survey
CUL Mellon Project on Web Resource Collection Development Program
• Archive-It Partners
• IIPC
• Internet Archive
• Internet Memory Foundation
• Web Archiving Initiatives wiki
Other Web Archives
• Heritrix
• Wayback Machine (newest Beta version)
• Archive-It
• CDL-WAS
• NetarchiveSuite
• Internet Memory Foundation
• Web Curator Tool
• Web Collect Toolbar
• GNU wGet
• HTTrack
Services + Tools:
https://www1.columbia.edu/sec/cu/libraries/bts/web_resource_collection/index.htmlhttp://www.archive-it.org/public/collection.html?id=1068http://www.archive-it.org/public/partner.html?id=304http://www.archive-it.org/public/partner.html?id=304http://www.archive-it.org/public/partner.html?id=304http://www.delicious.com/hrwebprojecthttp://www.archive-it.org/http://www.archive-it.org/http://www.archive-it.org/http://netpreserve.org/http://www.archive.org/http://internetmemory.org/en/http://en.wikipedia.org/wiki/List_of_Web_Archiving_Initiativeshttp://crawler.archive.org/http://waybackmachine.org/http://waybackmachine.org/http://waybackmachine.org/http://www.archive-it.org/http://www.archive-it.org/http://www.archive-it.org/http://webarchives.cdlib.org/http://webarchives.cdlib.org/http://webarchives.cdlib.org/http://netarchive.dk/suite/http://internetmemory.org/en/http://webcurator.sourceforge.net/http://download.cnet.com/windows/webcollect/3260-20_4-6285468.html?tag=rb_content;contentMainhttp://www.gnu.org/software/wget/http://www.gnu.org/software/wget/http://www.httrack.com/