Preserving the web

17
Preserving the Web: One institution’s foray into Digital Preservation through Web Archiving Jeremy Floyd Texas A&M University – Commerce [email protected] twitter @jjamesfloyd

description

overview of web archiving and the experience of Texas A&M University - Commerce using the Internet Archive's Archive-It service.

Transcript of Preserving the web

Page 1: Preserving the web

Preserving the Web: One institution’s foray into Digital Preservation through

Web Archiving

Jeremy FloydTexas A&M University – Commerce

[email protected] @jjamesfloyd

Page 2: Preserving the web

Why save the web?

Google Data Center. The Dalles, Oregon 2012 <http://www.google.com/about/datacenters/gallery/>

Page 3: Preserving the web

Approaches and Considerations

• Do It Yourself Approach• IT infrastructure• Level of ‘In-house’ Expertise• Long Term Digital

Preservation

• Hosted Solutions• Annual Expenditure• Options for Joining a

Consortium or Collaborative

Alington, Greg. 1936. “A Book Mark Would be Better.” Made for the Illinois WPA Art Project. from Library of Congress Print and Photographs Online Catalog <http://www.loc.gov/pictures/item/2011645389/>

Page 4: Preserving the web

HTTrack

• Free open source software• Allows downloading of websites to

a local drive• Preserves content and structure of

target sites

Page 5: Preserving the web

OCLC Web Harvester

• Runs OCLC’s own Webcrawler• Can Import Directly into

CONTENTdm and • Connexion Catalog• Discoverable in WorldCat• Can be Saved in OCLC Digital

Archive

Page 6: Preserving the web

California Digital LibraryWeb Archiving Service

• Free to join for all UC departments and organization (charged only for storage)

• Fee based subscription service for all other institutions

• Utilizes Heritrix web crawler for capture and Wayback for display and Nutchwax search engine

• 56 public archives• 21 partners• 4407 web sites• 616,585,489

documents• 32.3 TB of data

Page 7: Preserving the web

The Internet ArchiveArchive-It

• Subscription Service• Heritrix web crawler• Nutchwax search engine• Wayback Machine browser

-All developed and maintained by the Internet Archive

• More than 225 partner organizations• 5,214,935,471 URLs in 2,056

collections• Partners in 45 states and 15 countries

including, university libraries, state archives, historical societies, federal institutions, NGOs, public libraries, and museums

Page 8: Preserving the web

Texas A&M University – Commerce partnered with Archive-It

Page 9: Preserving the web

Gathering Support Among Constituencies and Stakeholders

All aboard! Liberty Bond fourth issue Sept. 28 - Oct. 19, 1918. from Library of Congress Print and Photographs Online Catalog <http://www.loc.gov/pictures/item/00652400/>

Page 10: Preserving the web

Selecting Seed URLsUniversity Websiteshttp://www.tamuc.edu/http://web.tamuc.edu/ http://catalog.tamuc.edu/ http://pride.tamuc.edu/http://www.tamu-commercedining.com/http://tamuc.orgsync.com/ http://www.lionathletics.com/

Facebookhttp://www.facebook.com/tamucommerce/http://www.facebook.com/TAMUCLibraries/http://www.facebook.com/pages/AM-Commerce-Lion-Athletics/242136009137926?ref=ts/http://www.facebook.com/TAMUCspirit/http://www.facebook.com/tamucalumni/

Twitterhttp://twitter.com/TAMU_Commerce/http://twitter.com/Lion_Athletics/http://twitter.com/ketrradio/http://twitter.com/TheEastTexan/http://twitter.com/LionsAfterDark/http://twitter.com/TAMUC_News/http://twitter.com/LionSafety/http://twitter.com/TAMUCalumni/http://twitter.com/TAMUC_Mesquite/

Youtubehttp://www.youtube.com/user/LionsMedia/

University News and Mediahttp://www.ketr.org/ http://TheEastTexanOnline.com

Page 11: Preserving the web

Managing Scope and Frequency of Crawls

Page 12: Preserving the web

robots.txt

“Robots- Electro and Sparko” 1940. still image. Computer History Museum < http://www.computerhistory.org/collections/accession/102693536>

Page 13: Preserving the web

Crawler Traps

“It’s A Trap” 2010. Know Your Meme <http://knowyourmeme.com/memes/its-a-trap>

Page 14: Preserving the web

Adding Descriptive Metadata

Rebecca Goldman. 2009. “Core Values.” Derangement and Description. <http://derangementanddescription.wordpress.com/2009/07/13/core-values/>

Page 15: Preserving the web

Establishing a Workflow

Page 16: Preserving the web

Access and Future Growth

Page 17: Preserving the web

Further Resources• Niu, Jinfang. 2012. “An Overview of Web Archiving” D-Lib

Magazine. 18(3/4) http://www.dlib.org/dlib/march12/niu/03niu1.html

• LOC Signal Blog: http://blogs.loc.gov/digitalpreservation/• International Internet Preservation Consortium (IIPC)

http://netpreserve.org/• International Web Archiving Workshop (2001 – 2010)

http://www.iwaw.net/• Society of American Archivists: Web Archiving Roundtable

http://www2.archivists.org/groups/web-archiving-roundtable/email: [email protected]: @jjamesfloyd

http://www.slideshare.net/jjamesfloyd/preserving-the-web/