1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
![Page 1: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/1.jpg)
1
How to Crawl the WebHow to Crawl the Web
Looksmart.comLooksmart.com
12/13/200212/13/2002
Junghoo “John” ChoJunghoo “John” Cho
UCLAUCLA
![Page 2: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/2.jpg)
2
What is a Crawler?What is a Crawler?
web
init
get next url
get page
extract urls
initial urls
to visit urls
visited urls
web pages
![Page 3: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/3.jpg)
3
ApplicationsApplications
Internet Search EnginesInternet Search Engines– Google, AltaVistaGoogle, AltaVista
Comparison Shopping ServicesComparison Shopping Services– My Simon, BizRateMy Simon, BizRate
Data miningData mining– Stanford Web Base, IBM Web FountainStanford Web Base, IBM Web Fountain
![Page 4: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/4.jpg)
4
Prototype WebBase CrawlerPrototype WebBase Crawler
Web Base ProjectWeb Base Project BackRub Crawler, PageRankBackRub Crawler, PageRank New Web Base CrawlerNew Web Base Crawler
– 20,000 lines in C/C++20,000 lines in C/C++– 130M pages collected130M pages collected
![Page 5: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/5.jpg)
5
Crawling Issues (1)Crawling Issues (1)
Load at visited web sitesLoad at visited web sites– Space out requests to a siteSpace out requests to a site– Limit number of requests to a site per dayLimit number of requests to a site per day– Limit depth of crawlLimit depth of crawl
![Page 6: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/6.jpg)
6
Crawling Issues (2)Crawling Issues (2)
Load at crawlerLoad at crawler– ParallelizeParallelize
init
get next url
get page
extract urls
initial urls
to visit urls
visited urls
web pages
init
get next url
get page
extract urls
?
![Page 7: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/7.jpg)
7
Crawling Issues (3)Crawling Issues (3)
Scope of crawlScope of crawl– Not enough space for “all” pagesNot enough space for “all” pages– Not enough time to visit “all” pagesNot enough time to visit “all” pages
Solution: Visit “important” pages
visitedpages
Intel
Intel
![Page 8: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/8.jpg)
8
Crawling Issues (4)Crawling Issues (4)
ReplicationReplication– Pages mirrored at multiple locationsPages mirrored at multiple locations
![Page 9: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/9.jpg)
9
Crawling Issues (5)Crawling Issues (5)
Incremental crawlingIncremental crawling– How do we avoid crawling from scratch?How do we avoid crawling from scratch?– How do we keep pages “fresh”?How do we keep pages “fresh”?
![Page 10: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/10.jpg)
10
My Research On CrawlerMy Research On Crawler
Load on sites [PAWS00]Load on sites [PAWS00] Parallel crawler [WWW01]Parallel crawler [WWW01] Page selection [WWW7]Page selection [WWW7] Replicated page detection [SIGMOD00]Replicated page detection [SIGMOD00] Page freshness [SIGMOD00, VLDB01]Page freshness [SIGMOD00, VLDB01] Crawler architecture [VLDB00]Crawler architecture [VLDB00]
![Page 11: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/11.jpg)
11
Outline of This TalkOutline of This Talk
How can we maintain pages fresh?How can we maintain pages fresh? How does the Web change?How does the Web change? What do we mean by “fresh” pages?What do we mean by “fresh” pages? How should we refresh pages?How should we refresh pages?
![Page 12: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/12.jpg)
12
Web Evolution ExperimentWeb Evolution Experiment
How often does a Web page change?How often does a Web page change? How long does a page stay on the Web?How long does a page stay on the Web? How long does it take for 50% of the Web How long does it take for 50% of the Web
to change?to change? How do we model Web changes?How do we model Web changes?
![Page 13: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/13.jpg)
13
Experimental SetupExperimental Setup
February 17 to June 24, 1999February 17 to June 24, 1999 270 sites visited (with permission)270 sites visited (with permission)
– identified 400 sites with highest “PageRank”identified 400 sites with highest “PageRank”– contacted administratorscontacted administrators
720,000 pages collected720,000 pages collected– 3,000 pages from each site daily3,000 pages from each site daily– start at root, visit breadth first (get new & old pages)start at root, visit breadth first (get new & old pages)– ran only 9pm - 6am, 10 seconds between site requestsran only 9pm - 6am, 10 seconds between site requests
![Page 14: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/14.jpg)
14
Average Change IntervalAverage Change Intervalfr
actio
n of
pag
es
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
1day 1day- 1week
1week-1month
1month-4months
4months
average change interval
![Page 15: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/15.jpg)
15
Change Interval – By DomainChange Interval – By Domainfr
actio
n of
pag
es
0
0.1
0.2
0.3
0.4
0.5
0.6
1day 1day- 1week
1week-1month
1month-4months
4months
com
netorg
edu
gov
average change interval
![Page 16: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/16.jpg)
16
Modeling Web EvolutionModeling Web Evolution
Poisson process with rate Poisson process with rate T is time to next eventT is time to next event ffTT ((tt) = ) = ee--
tt ( (tt > 0) > 0)
![Page 17: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/17.jpg)
17
Change Interval of PagesChange Interval of Pagesfor pages thatchange every
10 days on average
interval in days
frac
tion
of c
hang
esw
ith g
iven
inte
rval
Poisson model
![Page 18: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/18.jpg)
18
Change MetricsChange Metrics
FreshnessFreshness– Freshness of element Freshness of element eeii at time at time tt is is
F F ( ( eeii ; ; tt ) = 1 if ) = 1 if eeii is up-to-date at time is up-to-date at time tt 0 otherwise 0 otherwise
eiei
......
web databaseFreshness of the database S at time t is
F( S ; t ) = F( ei ; t )
(Assume “equal importance” of pages)
N
1 N
i=1
![Page 19: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/19.jpg)
19
Change MetricsChange Metrics
AgeAge– Age of element Age of element eeii at time at time tt is is
A A( ( eeii ; ; tt ) = 0 if ) = 0 if eeii is up-to-date at time is up-to-date at time tt tt - (modification - (modification eei i time) otherwisetime) otherwise
eiei
......
web databaseAge of the database S at time t is
A( S ; t ) = A( ei ; t )
(Assume “equal importance” of pages)
N
1 N
i=1
![Page 20: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/20.jpg)
20
Change MetricsChange Metrics
F(ei)
A(ei)
0
0
1
time
time
update refresh
Time averages:
![Page 21: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/21.jpg)
21
Refresh OrderRefresh Order
Fixed orderFixed order– Explicit list of URLs to visitExplicit list of URLs to visit
Random orderRandom order– Start from seed URLs & follow linksStart from seed URLs & follow links
Purely randomPurely random– Refresh pages on demand, Refresh pages on demand, as requested by useras requested by user
eiei
......
webdatabase
![Page 22: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/22.jpg)
22
Freshness vs. Revisit FrequencyFreshness vs. Revisit Frequency
r = / f = average change frequency / average visit frequency
![Page 23: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/23.jpg)
23
Age vs. Revisit FrequencyAge vs. Revisit Frequency
r = / f = average change frequency / average visit frequency
= Age / time to refresh all N elements
![Page 24: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/24.jpg)
24
Trick QuestionTrick Question
Two page databaseTwo page database e1 changes dailychanges daily e2 changes once a weekchanges once a week Can visit one page per weekCan visit one page per week How should we visit pages?How should we visit pages?
– e1 e2 e1 e2 e1 e2 e1 e2... ... [uniform] [uniform]
– e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … … [proportional][proportional]
– e1 e1 e1 e1 e1 e1 ... ...
– e2 e2 e2 e2 e2 e2 ... ...
– ??
e1
e2
e1
e2
webdatabase
![Page 25: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/25.jpg)
25
Proportional Often Not Good!Proportional Often Not Good!
Visit fast changing Visit fast changing e1
get 1/2 day of freshnessget 1/2 day of freshness
Visit slow changing Visit slow changing e2
get 1/2 week of freshnessget 1/2 week of freshness
Visiting Visiting e2 is a better deal!is a better deal!
![Page 26: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/26.jpg)
26
Optimal Refresh FrequencyOptimal Refresh Frequency
ProblemProblem
Given Given and and f ,f ,
find find ff11, f, f22,.., f,.., fNN that maximizethat maximize
![Page 27: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/27.jpg)
27
Optimal Refresh FrequencyOptimal Refresh Frequency
• Shape of curve is the same in all cases• Holds for any change frequency distribution
![Page 28: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/28.jpg)
28
Optimal Refresh for AgeOptimal Refresh for Age
• Shape of curve is the same in all cases• Holds for any change frequency distribution
![Page 29: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/29.jpg)
29
Comparing PoliciesComparing Policies
Freshness AgeProportional 0.12 400 days
Uniform 0.57 5.6 daysOptimal 0.62 4.3 days
Based on Statistics from experimentand revisit frequency of every month
![Page 30: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/30.jpg)
30
Not Every Page is Equal!Not Every Page is Equal!
e1
e2 Accessed by users 20 times/day
Accessed by users 10 times/day
Some pages are “more important”Some pages are “more important”
In general,
F (S ) = 1 F (e1) + 2 F (e2)
![Page 31: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/31.jpg)
31
Weighted FreshnessWeighted Freshness
w = 1
w = 2
f
![Page 32: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/32.jpg)
32
Change Frequency EstimationChange Frequency Estimation
How to estimate change frequency?How to estimate change frequency?– Naïve Estimator: Naïve Estimator: XX//TT
– XX: number of detected changes: number of detected changes
– TT: monitoring period: monitoring period
– 2 changes in 10 days: 0.2 times/day2 changes in 10 days: 0.2 times/day
Change detected1 day
Page visitedPage changed
Incomplete change historyIncomplete change history
![Page 33: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/33.jpg)
33
Improved EstimatorImproved Estimator
Based on the Poisson modelBased on the Poisson model
– XX: number of detected changes: number of detected changes– NN: number of accesses: number of accesses– f f : access frequency: access frequency
3 changes in 10 days: 0.36 times/day Accounts for “missed” changes
![Page 34: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/34.jpg)
34
Improvement Significant?Improvement Significant?
Application to a Web crawlerApplication to a Web crawler– Visit pages once every week for 5 weeksVisit pages once every week for 5 weeks– Estimate change frequency Estimate change frequency – Adjust revisit frequency based on the estimateAdjust revisit frequency based on the estimate
» Uniform: do not adjustUniform: do not adjust
» Naïve: based on the naïve estimatorNaïve: based on the naïve estimator
» Ours: based on our improved estimatorOurs: based on our improved estimator
![Page 35: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/35.jpg)
35
Improvement from Our EstimatorImprovement from Our Estimator
Detected changesDetected changes Ratio to uniformRatio to uniform
UniformUniform 2,147,5892,147,589 100%100%
NaïveNaïve 4,145,5824,145,582 193%193%
OursOurs 4,892,1164,892,116 228%228%
(9,200,000 visits in total)
![Page 36: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/36.jpg)
36
Other EstimatorsOther Estimators
Irregular access intervalIrregular access interval Last-modified dateLast-modified date CategorizationCategorization
![Page 37: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/37.jpg)
37
SummarySummary
Web evolution experimentWeb evolution experiment Change metricChange metric Refresh policyRefresh policy Frequency estimatorFrequency estimator
![Page 38: 1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d455503460f94a2256b/html5/thumbnails/38.jpg)
38
The EndThe End
Thank you for your attentionThank you for your attention For more information visitFor more information visit
http://www-db.stanford.edu/~cho/http://www-db.stanford.edu/~cho/