Preserving the web
-
Upload
jeremy-floyd -
Category
Entertainment & Humor
-
view
190 -
download
1
description
Transcript of Preserving the web
Preserving the Web: One institution’s foray into Digital Preservation through
Web Archiving
Jeremy FloydTexas A&M University – Commerce
[email protected] @jjamesfloyd
Why save the web?
Google Data Center. The Dalles, Oregon 2012 <http://www.google.com/about/datacenters/gallery/>
Approaches and Considerations
• Do It Yourself Approach• IT infrastructure• Level of ‘In-house’ Expertise• Long Term Digital
Preservation
• Hosted Solutions• Annual Expenditure• Options for Joining a
Consortium or Collaborative
Alington, Greg. 1936. “A Book Mark Would be Better.” Made for the Illinois WPA Art Project. from Library of Congress Print and Photographs Online Catalog <http://www.loc.gov/pictures/item/2011645389/>
HTTrack
• Free open source software• Allows downloading of websites to
a local drive• Preserves content and structure of
target sites
OCLC Web Harvester
• Runs OCLC’s own Webcrawler• Can Import Directly into
CONTENTdm and • Connexion Catalog• Discoverable in WorldCat• Can be Saved in OCLC Digital
Archive
California Digital LibraryWeb Archiving Service
• Free to join for all UC departments and organization (charged only for storage)
• Fee based subscription service for all other institutions
• Utilizes Heritrix web crawler for capture and Wayback for display and Nutchwax search engine
• 56 public archives• 21 partners• 4407 web sites• 616,585,489
documents• 32.3 TB of data
The Internet ArchiveArchive-It
• Subscription Service• Heritrix web crawler• Nutchwax search engine• Wayback Machine browser
-All developed and maintained by the Internet Archive
• More than 225 partner organizations• 5,214,935,471 URLs in 2,056
collections• Partners in 45 states and 15 countries
including, university libraries, state archives, historical societies, federal institutions, NGOs, public libraries, and museums
Texas A&M University – Commerce partnered with Archive-It
Gathering Support Among Constituencies and Stakeholders
All aboard! Liberty Bond fourth issue Sept. 28 - Oct. 19, 1918. from Library of Congress Print and Photographs Online Catalog <http://www.loc.gov/pictures/item/00652400/>
Selecting Seed URLsUniversity Websiteshttp://www.tamuc.edu/http://web.tamuc.edu/ http://catalog.tamuc.edu/ http://pride.tamuc.edu/http://www.tamu-commercedining.com/http://tamuc.orgsync.com/ http://www.lionathletics.com/
Facebookhttp://www.facebook.com/tamucommerce/http://www.facebook.com/TAMUCLibraries/http://www.facebook.com/pages/AM-Commerce-Lion-Athletics/242136009137926?ref=ts/http://www.facebook.com/TAMUCspirit/http://www.facebook.com/tamucalumni/
Twitterhttp://twitter.com/TAMU_Commerce/http://twitter.com/Lion_Athletics/http://twitter.com/ketrradio/http://twitter.com/TheEastTexan/http://twitter.com/LionsAfterDark/http://twitter.com/TAMUC_News/http://twitter.com/LionSafety/http://twitter.com/TAMUCalumni/http://twitter.com/TAMUC_Mesquite/
Youtubehttp://www.youtube.com/user/LionsMedia/
University News and Mediahttp://www.ketr.org/ http://TheEastTexanOnline.com
Managing Scope and Frequency of Crawls
robots.txt
“Robots- Electro and Sparko” 1940. still image. Computer History Museum < http://www.computerhistory.org/collections/accession/102693536>
Crawler Traps
“It’s A Trap” 2010. Know Your Meme <http://knowyourmeme.com/memes/its-a-trap>
Adding Descriptive Metadata
Rebecca Goldman. 2009. “Core Values.” Derangement and Description. <http://derangementanddescription.wordpress.com/2009/07/13/core-values/>
Establishing a Workflow
Access and Future Growth
Further Resources• Niu, Jinfang. 2012. “An Overview of Web Archiving” D-Lib
Magazine. 18(3/4) http://www.dlib.org/dlib/march12/niu/03niu1.html
• LOC Signal Blog: http://blogs.loc.gov/digitalpreservation/• International Internet Preservation Consortium (IIPC)
http://netpreserve.org/• International Web Archiving Workshop (2001 – 2010)
http://www.iwaw.net/• Society of American Archivists: Web Archiving Roundtable
http://www2.archivists.org/groups/web-archiving-roundtable/email: [email protected]: @jjamesfloyd
http://www.slideshare.net/jjamesfloyd/preserving-the-web/