Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam...

33
Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam [email protected] AND

Transcript of Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam...

Page 1: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Parallel Crawlers

Efficient URL Caching for World Wide Web Crawling

PresenterSawood [email protected]

AND

Page 2: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Parallel Crawlers

Hector Garcia-MolinaStanford University

[email protected]

Junghoo ChoUniversity of California

[email protected]

Page 3: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

ABSTRACTDesign an effective and scalable parallel

crawlerPropose multiple architectures for a parallel

crawlerIdentify fundamental issues related to

parallel crawlingMetrics to evaluate a parallel crawlerCompare the proposed architectures using

40 million pages

Page 4: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Challenges for parallel crawlersOverlapQualityCommunication bandwidth

Page 5: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

AdvantagesScalabilityNetwork-load dispersionNetwork-load reduction

CompressionDifferenceSummarization

Page 6: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Related workGeneral architecturePage selectionPage update

Page 7: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Geographical categorizationIntra-site parallel crawlerDistributed crawler

Page 8: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

CommunicationIndependentDynamic assignmentStatic assignment

Page 9: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Crawling modes (Static)Firewall modeCross-over modeExchange mode

Page 10: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

URL exchange minimizationBatch communicationReplication

Page 11: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Partitioning functionURL-hash basedSite-hash basedHierarchical

Page 12: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Evaluation modelsOverlapCoverageQualityCommunication overhead

Page 13: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Firewall mode and coverage

Page 14: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Cross-over mode and overlap

Page 15: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Exchange mode and communication

Page 16: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Quality and batch communication

Page 17: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

ConclusionFirewall mode is good if processes <= 4URL exchange poses network overhead <

1%Quality is maintained even in the batch

communicationReplicating 10,000 to 100,000 popular

URLs can reduce 40% communication overhead

Page 18: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Efficient URL Caching for World Wide Web Crawling

Andrei Z. BroderIBM TJ Watson Research Center

[email protected]

Janet L. WienerHewlett Packard [email protected]

Marc NajorkMicrosoft Research

[email protected]

Page 19: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

IntroductionFetch a pageParse it to extract all linked URLsFor all the URLs not seen before, repeat the

process

Page 20: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

ChallengesThe web is very large (coverage)

doubling every 9-12 monthsWeb pages are changing rapidly (freshness)

all changes (40% weekly)changes by a third or more (7% weekly)

Page 21: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

CrawlersIA crawlerOriginal Google crawlerMercator web crawlerCho and Garcia-Molina’s crawlerWebFountainUbiCrawlerShkapenyuk and Suel’s crawler

Page 22: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

CachingAnalogous to OS cacheNon-uniformity of requestsTemporal correlation or locality of reference

Page 23: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Caching algorithmsInfinite cache (INFINITE)Clairvoyant caching (MIN)Least recently used (LRU)CLOCKRandom replacement (RANDOM)Static caching (STATIC)

Page 24: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Experimental setup

Page 25: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

URL Streamsfull tracecross sub-trace

Page 26: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Result plots

Page 27: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Result plots

Page 28: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Result plots

Page 29: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

ResultsLRU & CLOCK performed equally well but slightly

worse than MIN except for critical region (for both traces)

RANDOM is slightly inferior to CLOCK and LRU, while STATIC is generally much worse

Concludes considerable locality of reference in the traces

For very large cache STATIC is better than MIN (excluding initial k misses)

STATIC is relatively better for cross traceLack of deep links, often pointing to home pages.Intersection between the most popular URLs and the

cross trace tends to be larger

Page 30: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Critical regionMiss rate for all efficient algorithms is

constant (~70%) in k = 2^14 - 2^18Above k = 2^18 miss rate drops abruptly

to ~20%

Page 31: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Cache Implementation

Page 32: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

Conclusions and future directions1,800 simulations over 26.86 billion URLs

resulted cache of 50,000 entries gives 80% hit rate

Cache size of 100 ~ 500 entries per thread is recommended

CLOCK or RANDOM implementation using scatter table with circular chain is recommended

To what order graph traversal method affects caching?

Global cache or per thread cache is better?

Page 33: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND.

THANKS