Crawling The Web For a Search Engine Or Why Crawling is Cool.
Internet Archive: Archive-It and Contract Crawling, C. Mumma
-
Upload
ncdd -
Category
Government & Nonprofit
-
view
73 -
download
2
Transcript of Internet Archive: Archive-It and Contract Crawling, C. Mumma
Web Crawling Tools and Services from the Internet Archive: Archive-It and Contract Crawling
Courtney C. Mumma, Internet ArchiveNovember 17, 2016 - Dutch Institute for Sound and Vision
Talk overview● Archiving the web at IA● Partnerships and services
○ Contract crawls○ Archive-It
■ Research Services○ Interoperability & Distributed Preservation
● New technology for new challenges
The Internet ArchiveNon-Profit Library
Founded in 1996 by Brewster Kahle
Universal Access to All Knowledge
30,000,000,000,000,000 Bytes Archived(30 PetaBytes)
20 Years of Archiving the Web500,000,000,000+ URLs
Global Wayback
● Broad snapshot
● Deep crawl on popular
sites
● Broad crawl on known
domains
● No more 404s
● On-demand
● Donated and targeted crawls
● https://web-beta.archive.org/
with KEYWORD SEARCH
and more!
Contract CrawlingDomain-scale • Run by Internet Archive • Average 300 million URLs per collection
Partial List of Partners• National Libraries of Australia and New Zealand• U.S. National Archives and Library of Congress• Luxembourg National Library• Israel National Library
Partial List of Collections• Iraq War (2003-2011)• 2005 US Supreme Court Nominations
Archive-It
Web based - nothing to install
Fully hosted service with
unlimited support
Simple to select, manage, scope
and catalog with metadata
10 different crawl frequencies
Includes quick access and
storage
html, videos, audio, social
media, PDFs, images, news
Full text search
Restricted access options
How our partners use Archive-It
● Enhance and supplement traditional offline collections ○ archives, topical collections
● Support records retention and archival policies● Capture event-based content
○ Spontaneous○ Planned
● Individual organizations and Consortial collaboration
Goals of Archive-It Research Services
● Expand access models for web archives
● Enable new insights into collections
● Leverage Internet Archive infrastructure for large-scale
processing to produce datasets for research
● Facilitate computational analysis and new use cases
● Increase use, visibility, and value of Archive-It partner
collections
Web Archives Datasets
Archive-It Research Serviceshttp://bit.ly/ait_ars
Lost in the maze in Labyrinth (1986, LucasFilm, screen capture)
WARCs, CDXs
and derivatives
Access
Storage
Preservation
Content Mgmt
Web Archiving Tools
APIs(*application programming interfaces)
● Interoperability ● Flexibility and modularity● Loose coupling of services (so we can improve pieces as
needed)● Scalability - Bulk data upload and download
Ongoing efforts
• Open Wayback
• Social media / Dynamic content
– Brozzler and Umbra (Archive-It)
– Social Feed Manager (GWU)
• URL nomination tools (UNT)
• Capture tools (GWU, IA, Rhizome)
• WASAPI - Community building and API
• Memento
BROZZLER!