Tools Overview - UNT Digital Library
Transcript of Tools Overview - UNT Digital Library
Outline
Crawlers
Heritrix
HTTrack
wget
Display
Wayback Machine
WERA
Locally Developed Tools
Search
NutchWax
wget
Powerful command line tool
Part of the GNU Project
First released in 1996
Used by SDSC for large scale NSDL crawling
simple example:
wget -rx http://example.org
HTTrack
Website Copier and Offline Browser
Written by Xavier Roche
GUI interface
Easy to use for small to medium sized crawls
Rewrites links for easy offline browsing and hosting of harvested content.
Drawback:
Doesn't create “archival” web crawls at this time.
Heritrix
Developed by the Internet Archive
Open Source JAVA
Command Line, JMX, Web Based GUI
Highly scalable, customizable
Growing developer community
Used by many National Libraries, Universities and private companies.
Writes “archival quality” output of crawls
Writes arc and warc files natively
Wayback
wayback is an open source java implementation of the The Internet Archive Wayback Machine.
Original Wayback Machine written in Perl and has some IP related issues keeping it from being released.
Is used to “play back” a harvested collection contained in a group of arc/warc files
Wayback Archival URL
Standard way of serving content back to users.
Similar to IA's Wayback Machine
Users search on a URL
Presented with a list of captured dates
User selects date and is presented with content
Unique URLshttp://web.archive.org/web/20060610172052/http://tdl.org/
Wayback (cont)
URLs are rewritten using Javascript embedded in the html by the wayback machine. The clients browser will execute the Javascript which will rewrite the links on the page.
The non-Javascript version of wayback is in development.
Installations using this tool can grow quite large (millions of pages easily)
NutchWAX
Nutch + (Web Archiving eXtension)
Provides search access to web archive collections.
Built on top of Nutch, Lucene, Hadoop
Used to provide search to Archive-It collections
Testing for 20th century search