Archival Web Research Datasets -...
Transcript of Archival Web Research Datasets -...
Archival Web Research Datasetswww.archive.org
Internet Archive
• Established in 1996• 501(c)(3) non profit organization• Over twenty petabytes (compressed) of publicly accessible archival material• Technology partner to libraries, archives, museums, universities, researchinstitutes, and memory institutions• Currently archiving books, texts, film, video, audio, images, software, educationalcontent and the Internet
www.archive.org
IA Web ArchiveBegan in 1996415+ billion publicly accessible web instancesOperate web wide, survey, end of life, selective, &resource specific harvestsDevelop freely available, open source, webarchiving & access tools
Approaches to Collecting…Thematic / Topical collectionsResource specific crawls: PDFs, videos, etc.Exhaustive: end of life, closure crawls, nationaldomain crawls for .au, .es, .fr, .il, .nz, .se etc.Broad survey crawls: domain wide for.org/.net/.edu/.gov/.comNo more 404’s project
Primary Methods of Web Harvest
• Proprietary web crawlersthat harvest and preservedocs using the ARC filestandard– E.g.
• open sourceweb crawls that harvestand preserve docs usingthe W/ARC file standard
Public Data Extractions & CrawlDatasets: ArchiveHub
• Hurricane Katrina extraction• Senate.gov/House.gov extractions• NARA Congressional Crawls (2006 – 2012)• Occupy Wall Street extraction• Superstorm Sandy extraction• US Media extraction
Public dataset: Wide – 00002 (2011)
• Available for download• Crawl start: 09 Mar, 2011• Crawl end: 23 Dec, 2011• Captures: 2.7 Billion• Unique URLs: 2.2 Billion• Hosts: 29 Million• Size: 80 TB• Contact: [email protected]
Public dataset: Wide – 00005 (2012)
Available soon!Crawl start: 30 Apr, 2012Crawl end: 11 Sep, 2012Captures: 11.6 BillionUnique URLs: 4.2 BillionHosts: 31 MillionSize: 360 TBRelease paired with hackathons
Public dataset: .gov (1995 – 2013)• Hosted by Altiscale
(https://www.altiscale.com/)
• Date range: 1995 – September 30, 2013• Captures: 4.1 Billion• Size: 285 TB• Deduped and compressed size is ~90TBs, plus
indexes
Additional Extracted datasetsExtracted longitudinal web data for
� .uk� .pt� .ie� .il� .is� .dk� .fr� .de (in process)
Contact respective national libraries for access!
500,000+1,643,000+2,000,000+2,500,000+6,300,000+
415,000,000,000+
BooksMoving ImagesAudio RecordingsHours of TVDigital TextsArchived Web Pages