CDL's Web Archiving System

31
CDL’s Web Archiving System Erik Hetzner UC3, California Digital Library 16 June 2011 Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 1 / 24

Transcript of CDL's Web Archiving System

CDL’s Web Archiving System

Erik Hetzner

UC3, California Digital Library

16 June 2011

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 1 / 24

Introduction

We don’t decide what to collect.

We don’t decide when to collect it.

We build tools to allow curators to make those decisions.

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 2 / 24

Introduction

Vital statistics

35 public archives

16 partners

2724 web sites

289,272,095 URLs (×2)

16.1 TB (×2)

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 3 / 24

Introduction

Vital statistics

35 public archives

16 partners

2724 web sites

289,272,095 URLs (×2)

16.1 TB (×2)

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 3 / 24

Introduction

Vital statistics

35 public archives

16 partners

2724 web sites

289,272,095 URLs (×2)

16.1 TB (×2)

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 3 / 24

Introduction

Vital statistics

35 public archives

16 partners

2724 web sites

289,272,095 URLs (×2)

16.1 TB (×2)

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 3 / 24

Introduction

Vital statistics

35 public archives

16 partners

2724 web sites

289,272,095 URLs (×2)

16.1 TB (×2)

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 3 / 24

Introduction

The Web Archiving Service

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 4 / 24

Introduction

Archive

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 5 / 24

Introduction

Search

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 6 / 24

Introduction

Site list

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 7 / 24

Introduction

Archived page

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 8 / 24

How we do it

Collection focus (unofficial)

Middle East political sites (Stanford)

Social movements (Tamiment, NYU)

California government sites (UC)

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 9 / 24

How we do it

Tools

Heritrix 1.14.x

Open-source Wayback

Nutchwax (moving to Solr)

CDL’s legacy Digital Preservation Repository

. . . and a lot of UI code

. . . ARC management, indexing scripts, etc.

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 10 / 24

Difficulties

Web archiving is easy*, but there are some difficulties.

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 11 / 24

Difficulties

Uneven coverage

We only crawl what our curators select.

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 12 / 24

Difficulties

Human selection

High precision; low recall.

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 13 / 24

Difficulties

Scale

We are not Internet Archive scale:but we are big enough that it takes a long time to do anything.

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 14 / 24

Difficulties

Collection mismatch

Our crawls are organized into ‘collections’.Everybody [?] else has ‘one big archive’.

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 15 / 24

Difficulties

Politics

We are customer-driven:we need to convince customers that collaboration is good for them.

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 16 / 24

Possibilities

What’s on our plate

Deduplication

. . . requires a new index (Solr)

Moving to our new Merritt repository

Implementing Memento

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 17 / 24

Possibilities

What’s on our plate

Deduplication

. . . requires a new index (Solr)

Moving to our new Merritt repository

Implementing Memento

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 17 / 24

Possibilities

What’s on our plate

Deduplication

. . . requires a new index (Solr)

Moving to our new Merritt repository

Implementing Memento

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 17 / 24

Possibilities

What’s on our plate

Deduplication

. . . requires a new index (Solr)

Moving to our new Merritt repository

Implementing Memento

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 17 / 24

Possibilities

Evaluating community needs

What do we have that you need?What do you have that we need?

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 18 / 24

Possibilities

Collaboration with researchers

The hard, fun problems are not necessarilythe ones that we need to be solved.

But maybe we can work it out.

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 19 / 24

Possibilities

Temporal search

How can we rank (and display) results across time?

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 20 / 24

Possibilities

Standards

Standards for sharing, or providing computational access to,metadata or full content.

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 21 / 24

Possibilities

The changing web

Flash and HTML5 throw a monkeywrench in the web.

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 22 / 24

Possibilities

Cross-archive collections

There is no reason why our curators should only be using ‘our’ crawls.How can we build collections that span archives?

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 23 / 24

Possibilities

CDL’s Web Archiving Service

We build tools; curators build collections.

We are ready to be part of a global web archive infrastructure.

What next?

Thanks for having me, and thanks for [email protected]

Erik Hetzner (UC3, California Digital Library) CDL’s Web Archiving System 16 June 2011 24 / 24