1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital...
-
Upload
shanna-kelly -
Category
Documents
-
view
213 -
download
0
Transcript of 1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital...
1
BCS, Oxfordshire, 19 February, 2004
WEB ARCHIVING issues and challenges
Deborah WoodyardDigital Preservation Coordinator
2
Where to start?
Selection Collection Development Policy
Need to be able to find them again Cataloguing issues 404 Not Found
Need to capture web sites Who is responsible for capture? Who is responsible for preservation/access? What does this mean?
Define a web site - Where are the boundaries: Links Content on other sites / other servers Changes with time – significant change
3
Technical issues – Capture software
Capture software Taking ‘Snapshots’ Follow directory structure or links? Where to break links / replace broken links? Relative vs absolute linking No changes to code for authenticity
Preserve ‘original’ version, provide ‘access’ version
Obey robots.txt exclusions Politeness – server load Quality control checking
4
Technical issues - Web sites
File types - HTML, gif, JPEG, Javascript, asp, etc. etc. etc.
Software plug-ins- permission- access
Dynamic database driven sites- producing static pages- producing pages on-the-fly
Frequency of capture Extent of capture
- volume- duplication- storage and access to partial sites
5
Technical issues – storage and access
Management and storage- high volume- multiple captures- long term, inc. storage system migration- disaster recovery
Permanent naming Ensuring authenticity
- trusted digital repository- checksums, signatures – long term
Signifying access to archived version
6
Technical issues - preservation
Preserve bits Preserve intellectual object, + ‘look & feel’ Preserve functionality Technology changes
- physical storage- hardware platform- operating systems- application software- HTML
7
Technical issues – preservation strategies
Metadata for preservation- describe bits: how and where stored- describe how to interpret/use bits- describe the context for the bits
Migration- in part / in whole- valid code?- keep all versions?- manage multiple versions
Emulation- of software / OS / platform
8
LEGAL DISCUSSION
Minimise risk Capture non-commercial sites Preserve without providing access Embargo or limit access Document actions taken Maintain ability to remove access
9
Cost
£££ ??
- to do it
- of not doing it
10
PROJECTS
General project types:
Selective- narrow, high quality, low volume
Comprehensive- broad, lower quality, high volume
Combination- useful, high quality, high volume
11
PROJECTS
British Library involvement:
Domain.UK - selective
UK Web Archiving Consortium - selective
International Internet Preservation Consortium (IIPC) – comprehensive/combination
12
Project details
Domain.uk WebWhacker, HTTrack Regular captures of simple sites Staff PC (later networked drive), very small No access
UK WAC UK partners sharing one system PANDAS management, HTTrack, Oracle Manual selection, cataloguing and quality
checking Web interface for capture and public access
13
Project details
IIPC Comprehensive automated selection
- links in / links out- authority / hits- rare words
Designing new crawler / harvester Developing technical architecture Deep web? Access challenging
14
FUTURE WORK
Expand collection
Collaborative projects, inc. automated capture and metadata generation
Legal deposit instruments for web archiving
Provide restricted access
15
USEFUL REFERENCES
http://library.wellcome.ac.uk/projects/archiving_reports.shtml
Collecting and preserving the World Wide Web: A feasibility study undertaken for the JISC and Wellcome TrustMichael Day, UKOLN, University of BathVersion 1.0 - 25 February 2003
Legal issues relating to the archiving of Internet resources in the UK, EU, US and AustraliaAndrew Charlesworth, University of Bristol, Centre for IT and LawVersion 1.0 - 25 February 2003
2nd ECDL workshop on Web archivinghttp://bibnum.bnf.fr/ecdl/2002/index.html
Digital Preservation Coalitionhttp://www.dpconline.org/