Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.
-
Upload
hunter-frazier -
Category
Documents
-
view
218 -
download
0
Transcript of Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.
![Page 1: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/1.jpg)
Gordon MohrChief Technologist, Web Projects
Internet Archive
An Introduction To Heritrix
![Page 2: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/2.jpg)
Web Collection
• Since 1996
• Over 4x1010
resources(URI+time)
• Over 400TB(compressed)
![Page 3: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/3.jpg)
Web Collection: via Alexa
• Alexa Internet– Private company – Crawling for IA since 1996
• 2-month rolling snapshots– Recent: 3 billion URIs, 35 million websites, 20 TB
• Crawling software– Sophisticated– Weighted towards popular sites– Proprietary: we only receive the data
![Page 4: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/4.jpg)
Heritrix: Motivations #1
• Deeper, specialized, in-house crawling– Sites of topical interest– Contractual crawls for libraries and
governments• US Library of Congress
– Elections, current events, government websites
• UK Public Records Office, US National Archives– Government websites
– Using our own software & machines
![Page 5: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/5.jpg)
Heritrix: Motivations #2
• Open source– Encourage collaboration on features and best practices
– Avoid duplication of work, incompatibilities
• Archival-quality– Perfect copies
– Keep up with changing web
– Meet evolving needs of Internet Archive and International Internet Preservation Consortium
![Page 6: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/6.jpg)
Heritrix
New
Open-source
Extensible
Web-scale
Archival-quality
Web crawling software
![Page 7: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/7.jpg)
Heritrix: Use Cases
• Broad Crawling– Large, as-much-as-possible
• Focused Crawling– Collect specific sites/topics deeply
• Continuous Crawling– Revisit changed sites
• Experimental Crawling– Novel approaches
![Page 8: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/8.jpg)
Heritrix: Project
• Heritrix means heiress• Java, modular• Project website:
http://crawler.archive.org– News, downloads, documentation– Sourceforge: open source hosting site
• Source-code control (CVS)• Issue databases
• “Lesser” GPL license• Outside contributions
![Page 10: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/10.jpg)
Heritrix: Milestones
• Summer 2003: Prototypes created and tested against existing crawlers; requirements collected from IA and IIPC
• October 2003-April 2004: Nordic Web Archive programmers join project, add capabilities
• January 2004: First public beta (0.2.0)– Used for all in-house crawling since
• February & June 2004: Workshops for Heritrix users at national libraries
• August 2004: Version 1.0.0 released
![Page 11: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/11.jpg)
Heritrix: Architecture
• Basic loop:1. Choose a URI from among all those
scheduled2. Fetch that URI3. Analyze or archive the results4. Select discovered URIs of interest, and add to
those scheduled5. Note that the URI is done and repeat
• Parallelized across threads (and eventually, machines)
![Page 12: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/12.jpg)
Key components of Heritrix
• Scopewhich URIs should be included(seeds + rules)
• Frontierwhich URIs are done, or waiting to be done(queues and lists/maps)
• Processor chainsconfigurable sequential tasks to do to each URI(code modules + configuration)
![Page 13: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/13.jpg)
Heritrix: Architecture
CrawlController
ServerCache
Scope
Prefetch Chain Preselector PreconditionEnforcer
Frontier
Already Included URIs
URI Wor k Queues
Fetch Chain FetchDNS FetchHTTP Extractor Chain ExtractorHTML ExtractorJS Write Chain ARCWriterProcessor
Postprocess Chain CrawlStateUpdater Postselector
ToeThreads ToeThreads
ToeThreads
next(CrawlURI)
finished(CrawlURI)
schedule(URI)
![Page 14: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/14.jpg)
Heritrix: Processor Chains
• Prefetch– Ensure conditions are met
• Fetch– Network activity (HTTP, DNS, FTP, etc.)
• Extract– Analyze – especially for new URIs
• Write– Save archival copy to disk
• Postprocess– Feed URIs back to Frontier, update crawler state
![Page 15: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/15.jpg)
Heritrix: Features & Limitations
• Other key features:– Web UI console to control & monitor crawl– Very configurable inclusion, exclusion, politeness policies
• Limitations:– Requires sophisticated operator– Large crawls hit single-machine limits– No capacity for automatic revisit of changed material
• Generally:– Good for focused & experimental crawling use cases; not
yet for broad and continuous
![Page 16: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/16.jpg)
Heritrix console
![Page 17: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/17.jpg)
Heritrix settings
![Page 18: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/18.jpg)
Heritrix logs
![Page 19: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/19.jpg)
Heritrix reports
![Page 20: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/20.jpg)
Heritrix: Current Uses
• Weekly, Monthly, 6-monthly, and special one-time crawls
• Hundreds to thousands of specific target sites
• Over 20 million collected URIs per crawl
• Crawls run for 1-2 weeks
![Page 21: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/21.jpg)
Heritrix: Performance
• Not yet stressed, optimized– Current crawls limited by material to crawl and chosen
politeness, not our performance
• Typical observed rates (actual focused crawls)– 20-40 URIs/sec (peaking over 60)
– 2-3Mbps (peaking over 20Mbps)
• Limits imposed by memory usage– Over 10,000 hosts/over 10 million URIs (512MB
machine, more on larger machines)
![Page 22: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/22.jpg)
Heritrix: Future Plans
• Larger scale crawl capacity– Giant focused crawls– Broad whole-web crawls
• New protocols & formats• Automate expert operator tasks• Continuous and dynamic crawling
– Revisit sites as they change– Dynamically rank sites and URIs
![Page 23: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/23.jpg)
Latest Developments
• 1.2 Release (next week)– Configurable canonicalization
• Handles common session-IDs, URI variations
– Politeness by IP address
– Experimental more memory-efficient Frontier
– Bug fixes
• 1.4 Release (January 2004)– Memory robustness
– Experimental multi-machine distribution support
![Page 24: Gordon Mohr Chief Technologist, Web Projects Internet Archive An Introduction To Heritrix.](https://reader033.fdocuments.in/reader033/viewer/2022061305/55145e0a550346414e8b5776/html5/thumbnails/24.jpg)
The End
• Questions?