Tools Overview - UNT Digital Library

17
Tools Overview Mark Phillips June 4, 2008 Texas Conference on Digital Libraries

Transcript of Tools Overview - UNT Digital Library

Tools Overview

Mark Phillips

June 4, 2008Texas Conference on Digital Libraries

Outline

Crawlers

Heritrix

HTTrack

wget

Display

Wayback Machine

WERA

Locally Developed Tools

Search

NutchWax

wget

Powerful command line tool

Part of the GNU Project

First released in 1996

Used by SDSC for large scale NSDL crawling

simple example:

wget -rx http://example.org

HTTrack

Website Copier and Offline Browser

Written by Xavier Roche

GUI interface

Easy to use for small to medium sized crawls

Rewrites links for easy offline browsing and hosting of harvested content.

Drawback:

Doesn't create “archival” web crawls at this time.

Heritrix

Developed by the Internet Archive

Open Source JAVA

Command Line, JMX, Web Based GUI

Highly scalable, customizable

Growing developer community

Used by many National Libraries, Universities and private companies.

Writes “archival quality” output of crawls

Writes arc and warc files natively

Wayback

wayback is an open source java implementation of the The Internet Archive Wayback Machine.

Original Wayback Machine written in Perl and has some IP related issues keeping it from being released.

Is used to “play back” a harvested collection contained in a group of arc/warc files

Wayback Archival URL

Standard way of serving content back to users.

Similar to IA's Wayback Machine

Users search on a URL

Presented with a list of captured dates

User selects date and is presented with content

Unique URLshttp://web.archive.org/web/20060610172052/http://tdl.org/

Wayback (cont)

URLs are rewritten using Javascript embedded in the html by the wayback machine. The clients browser will execute the Javascript which will rewrite the links on the page.

The non-Javascript version of wayback is in development.

Installations using this tool can grow quite large (millions of pages easily)

NutchWAX

Nutch + (Web Archiving eXtension)

Provides search access to web archive collections.

Built on top of Nutch, Lucene, Hadoop

Used to provide search to Archive-It collections

Testing for 20th century search

Questions?