Preserving the web

Post on 19-Jan-2015

190 views 1 download

Tags:

description

overview of web archiving and the experience of Texas A&M University - Commerce using the Internet Archive's Archive-It service.

Transcript of Preserving the web

Preserving the Web: One institution’s foray into Digital Preservation through

Web Archiving

Jeremy FloydTexas A&M University – Commerce

Jeremy.Floyd@tamuc.edutwitter @jjamesfloyd

Why save the web?

Google Data Center. The Dalles, Oregon 2012 <http://www.google.com/about/datacenters/gallery/>

Approaches and Considerations

• Do It Yourself Approach• IT infrastructure• Level of ‘In-house’ Expertise• Long Term Digital

Preservation

• Hosted Solutions• Annual Expenditure• Options for Joining a

Consortium or Collaborative

Alington, Greg. 1936. “A Book Mark Would be Better.” Made for the Illinois WPA Art Project. from Library of Congress Print and Photographs Online Catalog <http://www.loc.gov/pictures/item/2011645389/>

HTTrack

• Free open source software• Allows downloading of websites to

a local drive• Preserves content and structure of

target sites

OCLC Web Harvester

• Runs OCLC’s own Webcrawler• Can Import Directly into

CONTENTdm and • Connexion Catalog• Discoverable in WorldCat• Can be Saved in OCLC Digital

Archive

California Digital LibraryWeb Archiving Service

• Free to join for all UC departments and organization (charged only for storage)

• Fee based subscription service for all other institutions

• Utilizes Heritrix web crawler for capture and Wayback for display and Nutchwax search engine

• 56 public archives• 21 partners• 4407 web sites• 616,585,489

documents• 32.3 TB of data

The Internet ArchiveArchive-It

• Subscription Service• Heritrix web crawler• Nutchwax search engine• Wayback Machine browser

-All developed and maintained by the Internet Archive

• More than 225 partner organizations• 5,214,935,471 URLs in 2,056

collections• Partners in 45 states and 15 countries

including, university libraries, state archives, historical societies, federal institutions, NGOs, public libraries, and museums

Texas A&M University – Commerce partnered with Archive-It

Gathering Support Among Constituencies and Stakeholders

All aboard! Liberty Bond fourth issue Sept. 28 - Oct. 19, 1918. from Library of Congress Print and Photographs Online Catalog <http://www.loc.gov/pictures/item/00652400/>

Selecting Seed URLsUniversity Websiteshttp://www.tamuc.edu/http://web.tamuc.edu/ http://catalog.tamuc.edu/ http://pride.tamuc.edu/http://www.tamu-commercedining.com/http://tamuc.orgsync.com/ http://www.lionathletics.com/

Facebookhttp://www.facebook.com/tamucommerce/http://www.facebook.com/TAMUCLibraries/http://www.facebook.com/pages/AM-Commerce-Lion-Athletics/242136009137926?ref=ts/http://www.facebook.com/TAMUCspirit/http://www.facebook.com/tamucalumni/

Twitterhttp://twitter.com/TAMU_Commerce/http://twitter.com/Lion_Athletics/http://twitter.com/ketrradio/http://twitter.com/TheEastTexan/http://twitter.com/LionsAfterDark/http://twitter.com/TAMUC_News/http://twitter.com/LionSafety/http://twitter.com/TAMUCalumni/http://twitter.com/TAMUC_Mesquite/

Youtubehttp://www.youtube.com/user/LionsMedia/

University News and Mediahttp://www.ketr.org/ http://TheEastTexanOnline.com

Managing Scope and Frequency of Crawls

robots.txt

“Robots- Electro and Sparko” 1940. still image. Computer History Museum < http://www.computerhistory.org/collections/accession/102693536>

Crawler Traps

“It’s A Trap” 2010. Know Your Meme <http://knowyourmeme.com/memes/its-a-trap>

Adding Descriptive Metadata

Rebecca Goldman. 2009. “Core Values.” Derangement and Description. <http://derangementanddescription.wordpress.com/2009/07/13/core-values/>

Establishing a Workflow

Access and Future Growth

Further Resources• Niu, Jinfang. 2012. “An Overview of Web Archiving” D-Lib

Magazine. 18(3/4) http://www.dlib.org/dlib/march12/niu/03niu1.html

• LOC Signal Blog: http://blogs.loc.gov/digitalpreservation/• International Internet Preservation Consortium (IIPC)

http://netpreserve.org/• International Web Archiving Workshop (2001 – 2010)

http://www.iwaw.net/• Society of American Archivists: Web Archiving Roundtable

http://www2.archivists.org/groups/web-archiving-roundtable/email: Jeremy.Floyd@tamuc.edutwitter: @jjamesfloyd

http://www.slideshare.net/jjamesfloyd/preserving-the-web/