1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell...

40
1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell...

Page 1: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

1

Strategies for Collecting and Preserving Open Access Materials on the Web

William Y. Arms

Cornell University

Federal Library and Information Center Committee

Page 2: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

2

Open Access Materials on the Web

Page 3: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

3

The Library of Congress:the Web Preservation Project

Library of Congress collects cultural and intellectual output of today for the benefit of future generations.

An ever-increasing amount of this material is born digital.

The library has:

• privileged legal position• generous public funding

... but cannot do everything!

Step 1: Open Access Materials on the Web

Page 4: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

4

Page 5: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

5

Page 6: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

6

Page 7: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

7

Page 8: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

8

Partnership with publishers

Publishers and libraries as partners

Selective collection of open access web

Librarianship in a new domain

Bulk collection of open access web

Automated processes

Approaches to Preservation of the Web

OPEN ACCESS

CLOSED ACCESS

Page 9: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

9

Example: Web Preservation Project Pilot

• Small number of web sites nominated by selection officers. Three chosen for close study.

http://www.whitehouse.gov/ http://www.algore2000.com/ http://www.georgewbush.com/

• Copies downloaded using HTTrack mirroring program. Inspected for errors, anomalies, etc.

• Catalog records created using OCLC's CORC software Loaded into Library of Congress's ILS system.

• Trial web site developed to evaluate user interfaces.

Page 10: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

10

Example: The Internet Archive

Page 11: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

11

Example: National Library of Australia

Page 12: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

12

Example: National Library of Sweden

Page 13: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

13

Selection and Collection

Page 14: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

14

Collecting: Making a Snapshot

Web site

SnapshotDownload

Archive

A web site is downloaded, using a mirroring program. A snapshot is stored in an archive.

Page 15: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

15

Collecting: Periodic Snapshots

Web site Snapshot 1

Archive

At scheduled time intervals additional snapshots are made.

Snapshot 2

Snapshot 3

Page 16: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

16

Selection Decisions

Which sites to collect

• Bulk -- collect all within a certain category• Selective -- collect sites selected by a librarian

How often to make snapshots

• Monthly, weekly, or depending on circumstances

Which content to collect

• HTML pages only• Text and images only• Everything

Page 17: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

17

Examples of Selection Decisions

Selection Frequency Content

Internet Archive bulk monthly HTML + images

Pandora selective varies all

Kulturarw3 bulk sweeps all

Web Preservation selective irregular all

Page 18: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

18

Legal Issues

Legal position of archives that download open access materials is unclear

• Preservation is in the national interest

• See the discussion in The Digital Dilemma

• Crucial factor is economic impact on copyright owners

• Library of Congress has no special position except via copyright deposit

Page 19: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

19

Legal Issues: Thoughts and Actions

• Presumption is that downloading open access materials is permitted by the publisher ....

... unless other indication given, e.g., robot exclusion using robots.txt file

• Different parties to consider

=> Library of Congress=> other national libraries=> partners of the Library of Congress and national libraries=> independent archives

U.S. Copyright Office has offered to help clarification

Page 20: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

20

Access to Collections

Page 21: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

21

Access: Analysis by Computer

Snapshot 1

Archive

Snapshot 2

Snapshot 3Analysis

by computer

Page 22: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

22

Access: Analysis by Patron

Web site

Snapshot 1

Archive

Snapshot 2

Snapshot 3

Access 1

Access 2

Access 3Analysis by patron

Analysis by

computer

Page 23: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

23

Access Decisions

Style of access

• Analysis of snapshot files by computer• Analysis of Web access version by patron

Editing

• Minimal editing to make access version• Fuller editing to maintain experience• Automatic or by hand

Policy

• Who has access to the collections?

Page 24: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

24

Examples of Access Decisions

Style Editing

Internet Archive computer none

Pandora researcher some

Kulturarw3 ? ?

Web Preservation researcher some

Page 25: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

25

Information Discovery

Page 26: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

26

Options for Information Discovery

Very large numbers of Web sites will be collected and preserved. Some form of index or catalog is required.

Options

• List of sites (e.g., Internet Archive)

=> Access by URL + date

• Automatic index (e.g., Web search engines)

• Catalog (e.g., Web Preservation Project)

=> Record for individual site or group of sites=> Access through library catalog

Page 27: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

27

Information Discovery: Web Preservation Project

Procedure

• MARC catalog records created using OCLC's CORC system.• Loaded into Library of Congress's ILS.

Observations

• Catalog effort similar to other electronic files• Continual changes between snapshots• Some similarities to serials • No significant workflow difficulties

Page 28: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

28

Storage

Page 29: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

29

Storage: Preservation Versions

Snapshot 1 Access 1

Snapshot 1 Access 1

Snapshot 1 Access 1

Over time, other versions of a snapshot will be made for preservation.

Page 30: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

30

Storage Decisions: Size

Each Web site will be stored many times

• Repeated snapshots

• Access versions

• Preservation versions

Saving space

• Many files are repeated (e.g., video clips)

• Storing a single copy saves space, but leads to more complex computer systems

• Compressing files save space, but leads to more complex computer systems

Page 31: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

31

Very Rough Estimates of Size and Cost

Public web sites (OCLC, February 2000) 2,900,000

Library of Congress collects 1% 30,000

Average size of site 60 Mbytes

Size of 30,000 sites 1.8 terabytes

Storage requirements/year (monthly snapshot) 21.6 terabytes

Storage requirements (no duplicates) 5.0 terabytes

Cost per year ($25,000 per terabyte) $125,000

Page 32: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

32

Storage Decisions: Identification

Identification of Web site

• URL, but Web sites may change their URL• URN (e.g., Handle or PURL)

Identification and provenance of versions

• Web site identifier• Collection information (date, time, etc.)• History of changes

Page 33: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

33

Archive

AccessionControl

Web CrawlerProcess

Catalog ExternalAccess

Workflow

snapshot

Analysis by patron

Analysis by computer

Web site

Page 34: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

34

Preservation

Page 35: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

35

Objective

Objective is to preserve the digital collections in a manner that makes them usable for scholarship and research in the future.

What is preserved?

• Preservation of bits

• Preservation of content

• Preservation of experience

How is it used?

• Analysis by computer program

• Analysis by human researcher

• Viewed by human researcher

Page 36: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

36

Process of Preservation

Version 1

Version 2

Version 3

Time 0

Time 1

Time 2

This process may be applied to either the snapshot or the access version.

Page 37: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

37

Preservation: Refreshing

Each version is created from the previous by exactly copying the bits.

• Keeps the exact files for all time

• Preserves bits, and content but not always in an accessible form

• Later computers and software are unlikely to support today's protocols, formats, languages, etc.

Keeping the unedited snapshot files by repeated refreshing should be a basic part of any preservation strategy.

Page 38: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

38

Preservation: Automatic Migration of Individual Files

As protocols, formats, languages, etc. become obsolete, convert individual files to new standards.

• Can be carried out automatically

• Preserves content and helps toward preservation of experience

• Effectiveness depends on availability of conversion tools and the complexity and quality of original source

• Migrated versions will steadily diverge from original

• Web sites will eventually cease to function

Automated migration of individual files is the basic technique for keeping web sites functional at moderate cost.

Page 39: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

39

Preservation: Automatic Migration with Manual Editing

In conjunction with automatic migration, web sites are reviewed by a librarian and edited as necessary to preserve functionality

• The only method that can be expected to preserve the experience of using web sites

• Migrated versions will steadily diverge from original

• Some web sites will be impossible to edit without changing the experience

Manual editing is very expensive and is therefore suitable for only a small number of particularly important sites.

Page 40: 1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

40

Acknowledgements

The members of the Web Preservation Project are:

Roger AdkinCassy AmmenWilliam ArmsAllene HayesMelissa LevineDiane KreshBarbara Tillett