1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital...

15
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator

Transcript of 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital...

Page 1: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

1

BCS, Oxfordshire, 19 February, 2004

WEB ARCHIVING issues and challenges

Deborah WoodyardDigital Preservation Coordinator

Page 2: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

2

Where to start?

Selection Collection Development Policy

Need to be able to find them again Cataloguing issues 404 Not Found

Need to capture web sites Who is responsible for capture? Who is responsible for preservation/access? What does this mean?

Define a web site - Where are the boundaries: Links Content on other sites / other servers Changes with time – significant change

Page 3: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

3

Technical issues – Capture software

Capture software Taking ‘Snapshots’ Follow directory structure or links? Where to break links / replace broken links? Relative vs absolute linking No changes to code for authenticity

Preserve ‘original’ version, provide ‘access’ version

Obey robots.txt exclusions Politeness – server load Quality control checking

Page 4: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

4

Technical issues - Web sites

File types - HTML, gif, JPEG, Javascript, asp, etc. etc. etc.

Software plug-ins- permission- access

Dynamic database driven sites- producing static pages- producing pages on-the-fly

Frequency of capture Extent of capture

- volume- duplication- storage and access to partial sites

Page 5: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

5

Technical issues – storage and access

Management and storage- high volume- multiple captures- long term, inc. storage system migration- disaster recovery

Permanent naming Ensuring authenticity

- trusted digital repository- checksums, signatures – long term

Signifying access to archived version

Page 6: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

6

Technical issues - preservation

Preserve bits Preserve intellectual object, + ‘look & feel’ Preserve functionality Technology changes

- physical storage- hardware platform- operating systems- application software- HTML

Page 7: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

7

Technical issues – preservation strategies

Metadata for preservation- describe bits: how and where stored- describe how to interpret/use bits- describe the context for the bits

Migration- in part / in whole- valid code?- keep all versions?- manage multiple versions

Emulation- of software / OS / platform

Page 8: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

8

LEGAL DISCUSSION

Minimise risk Capture non-commercial sites Preserve without providing access Embargo or limit access Document actions taken Maintain ability to remove access

Page 9: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

9

Cost

£££ ??

- to do it

- of not doing it

Page 10: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

10

PROJECTS

General project types:

Selective- narrow, high quality, low volume

Comprehensive- broad, lower quality, high volume

Combination- useful, high quality, high volume

Page 11: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

11

PROJECTS

British Library involvement:

Domain.UK - selective

UK Web Archiving Consortium - selective

International Internet Preservation Consortium (IIPC) – comprehensive/combination

Page 12: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

12

Project details

Domain.uk WebWhacker, HTTrack Regular captures of simple sites Staff PC (later networked drive), very small No access

UK WAC UK partners sharing one system PANDAS management, HTTrack, Oracle Manual selection, cataloguing and quality

checking Web interface for capture and public access

Page 13: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

13

Project details

IIPC Comprehensive automated selection

- links in / links out- authority / hits- rare words

Designing new crawler / harvester Developing technical architecture Deep web? Access challenging

Page 14: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

14

FUTURE WORK

Expand collection

Collaborative projects, inc. automated capture and metadata generation

Legal deposit instruments for web archiving

Provide restricted access

Page 15: 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

15

USEFUL REFERENCES

http://library.wellcome.ac.uk/projects/archiving_reports.shtml

Collecting and preserving the World Wide Web: A feasibility study undertaken for the JISC and Wellcome TrustMichael Day, UKOLN, University of BathVersion 1.0 - 25 February 2003

Legal issues relating to the archiving of Internet resources in the UK, EU, US and AustraliaAndrew Charlesworth, University of Bristol, Centre for IT and LawVersion 1.0 - 25 February 2003

2nd ECDL workshop on Web archivinghttp://bibnum.bnf.fr/ecdl/2002/index.html

Digital Preservation Coalitionhttp://www.dpconline.org/