CDL's Web Archiving System
Transcript of CDL's Web Archiving System
CDL’s Web Archiving System
Erik Hetzner
UC3, California Digital Library
16 June 2011
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 1 / 24
Introduction
We don’t decide what to collect.
We don’t decide when to collect it.
We build tools to allow curators to make those decisions.
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 2 / 24
Introduction
Vital statistics
35 public archives
16 partners
2724 web sites
289,272,095 URLs (×2)
16.1 TB (×2)
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 3 / 24
Introduction
Vital statistics
35 public archives
16 partners
2724 web sites
289,272,095 URLs (×2)
16.1 TB (×2)
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 3 / 24
Introduction
Vital statistics
35 public archives
16 partners
2724 web sites
289,272,095 URLs (×2)
16.1 TB (×2)
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 3 / 24
Introduction
Vital statistics
35 public archives
16 partners
2724 web sites
289,272,095 URLs (×2)
16.1 TB (×2)
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 3 / 24
Introduction
Vital statistics
35 public archives
16 partners
2724 web sites
289,272,095 URLs (×2)
16.1 TB (×2)
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 3 / 24
Introduction
The Web Archiving Service
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 4 / 24
Introduction
Archive
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 5 / 24
Introduction
Search
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 6 / 24
Introduction
Site list
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 7 / 24
Introduction
Archived page
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 8 / 24
How we do it
Collection focus (unofficial)
Middle East political sites (Stanford)
Social movements (Tamiment, NYU)
California government sites (UC)
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 9 / 24
How we do it
Tools
Heritrix 1.14.x
Open-source Wayback
Nutchwax (moving to Solr)
CDL’s legacy Digital Preservation Repository
. . . and a lot of UI code
. . . ARC management, indexing scripts, etc.
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 10 / 24
Difficulties
Web archiving is easy*, but there are some difficulties.
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 11 / 24
Difficulties
Uneven coverage
We only crawl what our curators select.
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 12 / 24
Difficulties
Human selection
High precision; low recall.
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 13 / 24
Difficulties
Scale
We are not Internet Archive scale:but we are big enough that it takes a long time to do anything.
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 14 / 24
Difficulties
Collection mismatch
Our crawls are organized into ‘collections’.Everybody [?] else has ‘one big archive’.
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 15 / 24
Difficulties
Politics
We are customer-driven:we need to convince customers that collaboration is good for them.
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 16 / 24
Possibilities
What’s on our plate
Deduplication
. . . requires a new index (Solr)
Moving to our new Merritt repository
Implementing Memento
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 17 / 24
Possibilities
What’s on our plate
Deduplication
. . . requires a new index (Solr)
Moving to our new Merritt repository
Implementing Memento
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 17 / 24
Possibilities
What’s on our plate
Deduplication
. . . requires a new index (Solr)
Moving to our new Merritt repository
Implementing Memento
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 17 / 24
Possibilities
What’s on our plate
Deduplication
. . . requires a new index (Solr)
Moving to our new Merritt repository
Implementing Memento
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 17 / 24
Possibilities
Evaluating community needs
What do we have that you need?What do you have that we need?
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 18 / 24
Possibilities
Collaboration with researchers
The hard, fun problems are not necessarilythe ones that we need to be solved.
But maybe we can work it out.
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 19 / 24
Possibilities
Temporal search
How can we rank (and display) results across time?
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 20 / 24
Possibilities
Standards
Standards for sharing, or providing computational access to,metadata or full content.
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 21 / 24
Possibilities
The changing web
Flash and HTML5 throw a monkeywrench in the web.
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 22 / 24
Possibilities
Cross-archive collections
There is no reason why our curators should only be using ‘our’ crawls.How can we build collections that span archives?
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 23 / 24
Possibilities
CDL’s Web Archiving Service
We build tools; curators build collections.
We are ready to be part of a global web archive infrastructure.
What next?
Thanks for having me, and thanks for [email protected]
Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 24 / 24