Making the Ephemeral Endure: Collecting the Web in Research … · 2011. 4. 1. ·...

38
Columbia University Libraries & Information Services Making the Ephemeral Endure: Collecting the Web in Research Libraries Columbia University Libraries & Information Services Association of College and Research Libraries / Annual Conference / Philadelphia, PA / April 01, 2011 Columbia University Libraries & Information Services Hashtag #webarchives

Transcript of Making the Ephemeral Endure: Collecting the Web in Research … · 2011. 4. 1. ·...

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    Making the Ephemeral Endure:

    Collecting the Web in Research Libraries

    Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    Association of College and Research Libraries / Annual Conference / Philadelphia, PA / April 01, 2011

    Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    Hashtag #webarchives

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    COLLECTING WEB RESOURCES:

    OVERVIEW

    • Why

    • Who

    • Columbia context

    • How

    • Some Issues

    • Questions

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    WHY IT MATTERS…

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    THE LIVE SITE TODAY

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    WHY IT MATTERS

    • “Joaquim Chissano Appointed UN Special

    Envoy for LRA-Affected Region 2007,”

    Uganda-CAN [article on-line]; available

    from http://www.ugandacan.org/item/1846;

    accessed 17 May 2007"

    -- citation from Human Rights Review,

    March 2009

    http://www.ugandacan.org/item/1846

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    The domain ugandacan.org may be for sale. Click here for details.

    Welcome to ugandacan.orgRelated Searches

    •Mountain Gorilla

    •Cheap Air

    •Uganda

    •Volunteering

    •Uganda Travel

    •Hotel Deals

    •Discount Airfare

    •Luxury Car Rental

    •Uganda Tours

    •Cruise Vacation

    •Vacation Package Deal

    RELATED SEARCHES:

    •Car Rental

    •Travel Insurance

    •Cheap Airfare

    •Family Vacation Deals

    THE LIVE SITE TODAY

    http://www.acquirethisname.com/make-an-offer.aspx?domain=ugandacan.org&eo=1http://www.ugandacan.org/location/mountain/gorilla/cheap/mountain_gorilla.htm?t=&slt=7&slr=1&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/cheap_air.htm?t=&slt=7&slr=2&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/uganda.htm?t=&slt=7&slr=3&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/volunteering.htm?t=&slt=7&slr=4&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/uganda_travel.htm?t=&slt=7&slr=5&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/hotel_deals.htm?t=&slt=7&slr=6&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/discount_airfare.htm?t=&slt=7&slr=7&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/luxury_car_rental.htm?t=&slt=7&slr=8&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/uganda_tours.htm?t=&slt=7&slr=9&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/cruise_vacation.htm?t=&slt=7&slr=10&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/vacation_package_deal.htm?t=&slt=7&slr=11&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/car_rental.htm?t=&slt=7&slr=12&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/travel_insurance.htm?t=&slt=7&slr=13&lpt=1http://www.ugandacan.org/location/mountain/gorilla/cheap/cheap_airfare.htm?t=&slt=7&slr=14&lpt=1

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    WHY COLLECT WEB RESOURCES

    • Libraries build research collections by

    selecting, acquiring, describing,

    organizing, managing, and preserving

    relevant resources

    • Libraries have stable models for collecting

    non-digital print resources–the roles of

    selectors, acquisition departments,

    catalogers, and preservation units are

    well-understood

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    WHY COLLECT WEB RESOURCES

    For commercial digital resources a different

    model has emerged:

    • resource bundling

    • licensed access rather than physical

    receipt

    • vendor-supplied cataloging

    • collective preservation efforts (LOCKSS,

    Portico)

    – Libraries’ financial investment in these

    resources has ensured that they are managed

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    WHY COLLECT WEB RESOURCES

    • Many have high research value

    • May supplement or replace existing print resources

    What about non-commercial web resources?

    • Identifying relevant resources

    • Integrating access with other collections

    • Securing permissions for harvesting

    • Preservation

    But as yet we have no

    common model for:

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    A LOT OF CONTENT

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    MUCH OF IT IS NOT COLLECTED

    • 40 documents on web site

    • 0 in Columbia collections

    • 10 listed in OCLC

    • 1 held by more than 2 libraries

    • No library holds more than 3

    Refugees International

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    WHO: SOME KEY PROGRAMS

    • Members include over 30 international libraries and the Internet Archive.

    • http://netpreserve.org

    International Internet Preservation

    Consortium (IIPC)

    • Over 100 Institutions using Archive-IT software. Includes universities, schools, state libraries, museums …

    • http://www.archive-it.org

    Archive-IT (Internet Archive)

    • 16 partner institutions

    • http://webarchives.cdlib.org

    Web Archiving Service (California

    Digital Library)

    • Commercial web archiving services (Hanzo, Iterasi)

    • National institutions (libraries, archives)Independent

    Initiatives

    http://netpreserve.org/http://www.archive-it.org/http://www.archive-it.org/http://www.archive-it.org/http://webarchives.cdlib.org/

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    MELLON PROJECT ON WEB RESOURCES

    COLLECTION PROGRAM DEVELOPMENT

    Collection Building

    Make non-commercial web resources of scholarly value an

    integral part of Columbia’s collection

    building

    Workflow

    Move web resource collection from a

    project-based activity to part of routine

    workflow

    Collaboration

    Develop complementary and

    collaborative approaches with other research institutions

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    TERMINOLOGY

    Project Program

    Web archiving

    Collecting web

    resources

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    CREDITS: COLUMBIA TEAM

    • Bob Wolven (Associate University Librarian forBibliographic Services and CollectionDevelopment)• Stephen Davis (Director, ColumbiaLibraries Digital Program)• Pamela Graham (Director, Area Studiesand CHRDR)• Kate Harcourt (Director, Original and Special Materials Cataloging)• Alex Thurman (Web Collection Curator)• Tessa Fallon (Web Collection Curator)

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    GENERAL WEB ARCHIVING

    WORKFLOW

    Selection Permissions CrawlingQuality Review

    Description Access

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    REQUIREMENTS

    • Tool(s) for capturing websites

    Crawler

    • Mechanism for viewing captured websites

    Access/Rendering

    • Storage for data collected and created by the crawler

    Storage

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    • Event based

    • “Arab Spring” websites 2011

    • Japan earthquake 2011

    • Thematic

    • Human Rights

    • Blogging in Iran

    • Institutional

    • Avery Library, Historic Preservation

    • Burke Library, NYC Religions

    • Domain

    • Top level domains such as .uk

    Selection

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    CUL EXPERIENCE

    Selection models in use at Columbia:

    • Subject specialists

    • Public nomination form

    • Internet Resource Cataloging Request

    • Coordination with other library collections

    – Avery Fine Arts and Architecture Library

    – Burke Library/Union Theological Seminary

    – Rare Book and Manuscript Library

    – Columbia University Archives

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    • Unlike other countries, there is no mandate for any US institution to archive websites

    • Section 108 Study Group recommendations for web archiving

    • Internet Archive Model

    • Oakland Archive Policy

    Copyright

    +

    Permissions

    http://www.section108.gov/http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    CUL EXPERIENCE

    • Permissions request created in

    consultation with legal counsel

    • Request permission from site owners

    • Response from site owners

    • Complications

    – Identifying site owners

    – Third party copyright

    – Extent of permission

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    • Collection of URLs required to reproduce a website

    • Test crawls gauge size of sites and flag potential crawl issues

    • Actual crawls may take hours or weeks

    • Product: WARC files (ISO 28500)

    Crawling

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    • Archive-It

    • CDL-WAS

    • Hanzo Archives

    • Internet Memory Foundation

    • Iterasi

    Web Archiving Services

    • Heritrix + Wayback Machine

    • Web Curator Tool

    • NetarchiveSuite

    • HTTrack

    • GNU Wget

    • WebCollect toolbar

    Open Source/Free

    SERVICES + OPEN SOURCE

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    SERVICES VS. OPEN SOURCE

    • Customer support and training

    • External storage

    • Development of interface

    • Management of crawler

    Web Archiving Services

    • Customizable

    • Free software

    • Crawled sites are stored locally

    Open Source

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    CUL EXPERIENCE

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    HERITRIX SCREENSHOT

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    • Crawler-generated reports

    • Crawler traps

    • Robots.txt

    • URLs captured

    • Crawled sites

    • Formatting/style

    • Navigation

    • Multimedia

    Quality Review

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    CUL EXPERIENCE

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    • Testing phase or post-crawl

    • Excluding out-of-scope URLs

    • Expanding scope

    • Additional domains (common: blogs, other languages, subordinate sections of an organization)

    • Excluding crawler traps

    Scoping

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    CUL EXPERIENCE

    Archived-site snapshot--problems

    Live Site

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    • Wayback Machine or equivalent necessary to render the WARC files

    • Description: metadata created by cataloging staff

    • Access

    • Web archiving service

    • OPAC

    • Portal

    • Consortium

    Access

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    CUL EXPERIENCE

    CLIO

    Archive-It

    WorldCat

    http://clio.cul.columbia.edu:7018/vwebv/searchBasic?sk=CLIOhttp://www.archive-it.org/http://www.archive-it.org/http://www.archive-it.org/http://www.worldcat.org/

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    • Rapidly changing technologies used in website development

    • Dynamic pages, deep web, other inaccessible content

    • Providing access across collections and avoiding web archive silos

    • Aggregation of data: long-term storage and responsibility

    • Long-term preservation challenges

    Challenges

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    ISSUES

    Scale, sustainability

    • Matching scale to program objectives

    • Budgeting for storage, staffing

    Scope; collection policy

    • Limit to a few concentrations or broader?

    • Defining by source (.org), format (.pdf), topic?

    • What happens to resources excluded?

    Collaboration, external

    • Duplication/overlap with related initiatives

    • Complementary approaches: frequency, access, scoping

    • Role of Archive-IT partners, consortia (2CUL), NDSA

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    MORE ISSUES

    Coordination, internal

    • Relation to institutional repository, archival collections, e-archives

    Staffing, roles

    • Centralized vs distributed effort

    • Impact on selectors, cataloging, archivists, digital program

    Impact on print collecting

    • Potential for “e-only”

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    STILL MORE ISSUES

    Technical

    • Local vs. hosted storage

    • Open source, local development vs externally-supported toolkit

    • Moving from harvesting to archiving

    Access, Use, Assessment

    • Use cases for portals, cross-collection searching

    • Disclosure outside local context

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    WHAT DO LIBRARIES DO?

    • Libraries build research collections by selecting, acquiring, describing, organizing, managing, and preserving relevant resources

    • Libraries manage business transactions necessary to provide access to resources needed for research

    • Libraries preserve research resources to enable access to be restored if lost

  • Co

    lum

    bia

    Un

    ive

    rsity

    Lib

    rarie

    s &

    Info

    rma

    tion

    Se

    rvic

    es

    ADDITIONAL INFORMATION

    • Project Information

    • Human Rights Web Archive

    • Archive-It Collections Page

    • Human Rights Web Archive Delicious Survey

    CUL Mellon Project on Web Resource Collection Development Program

    • Archive-It Partners

    • IIPC

    • Internet Archive

    • Internet Memory Foundation

    • Web Archiving Initiatives wiki

    Other Web Archives

    • Heritrix

    • Wayback Machine (newest Beta version)

    • Archive-It

    • CDL-WAS

    • NetarchiveSuite

    • Internet Memory Foundation

    • Web Curator Tool

    • Web Collect Toolbar

    • GNU wGet

    • HTTrack

    Services + Tools:

    https://www1.columbia.edu/sec/cu/libraries/bts/web_resource_collection/index.htmlhttp://www.archive-it.org/public/collection.html?id=1068http://www.archive-it.org/public/partner.html?id=304http://www.archive-it.org/public/partner.html?id=304http://www.archive-it.org/public/partner.html?id=304http://www.delicious.com/hrwebprojecthttp://www.archive-it.org/http://www.archive-it.org/http://www.archive-it.org/http://netpreserve.org/http://www.archive.org/http://internetmemory.org/en/http://en.wikipedia.org/wiki/List_of_Web_Archiving_Initiativeshttp://crawler.archive.org/http://waybackmachine.org/http://waybackmachine.org/http://waybackmachine.org/http://www.archive-it.org/http://www.archive-it.org/http://www.archive-it.org/http://webarchives.cdlib.org/http://webarchives.cdlib.org/http://webarchives.cdlib.org/http://netarchive.dk/suite/http://internetmemory.org/en/http://webcurator.sourceforge.net/http://download.cnet.com/windows/webcollect/3260-20_4-6285468.html?tag=rb_content;contentMainhttp://www.gnu.org/software/wget/http://www.gnu.org/software/wget/http://www.httrack.com/