Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.
-
Upload
barbra-simon -
Category
Documents
-
view
212 -
download
0
Transcript of Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.
![Page 1: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/1.jpg)
Monitoring the dynamic Web to respond to Continuous Queries
Presented by Qing Cao CS851 Spring 2005
![Page 2: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/2.jpg)
2
Talk outline
Introduction and Motivation Previous Approaches Paper Contributions
• Continuous Adaptive Monitoring (CAM)• How to allocate limited polling resources
among pages• How to schedule monitoring tasks
Experiments and Evaluations Critique Conclusions
![Page 3: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/3.jpg)
3
Problem Context
Current web pages are highly dynamic• 40% commercial pages• 23% of all pages
change per day (Sethuraman et al.) How can search engines handle the user’s
long-term request on a particular topic?• Need to monitor a set of web pages (how
often?) • Analyze the difference• Send the results to the user
![Page 4: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/4.jpg)
4
Application Example
Google News page shot on Apr 16
![Page 5: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/5.jpg)
5
Application Example
Many websites allow users to receive email alerts or updated news on particular events
A special kind of query!
CNN webpage shot on Apr 16
![Page 6: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/6.jpg)
6
Goal of this Paper
Continuous Adaptive Monitoring (CAM) such that it can allocates limited resource (such as bandwidth and computation power) to the monitoring tasks such that the misses for updated pages are minimized
![Page 7: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/7.jpg)
7
User Model: Continuous Queries (CQ)
Users issue long-lived queries of interest Pages of interest may be added, modified,
and deleted System continually updates responses
![Page 8: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/8.jpg)
8
Discrete vs. continuous queries Query lives for an
“instant”, one-shot answer
Optimize content freshness each time
Usually handled by page crawlers, (such as Google Robot), with diverse periods
Queries have positive lifetime, many updates over time
Updates must track changes continuously over certain periods of time
Dynamic monitoring with more restrictive network resources, using new services such as CAM
![Page 9: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/9.jpg)
9
Talk outline
Introduction and motivation Previous approaches Our contributions
• Continuous Adaptive Monitoring (CAM)• How to allocate limited polling resources
among pages• How to schedule monitoring tasks
Experiments Critique Conclusion
![Page 10: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/10.jpg)
10
Related work CONQUER and WebCQ (Liu, Pu and Tang)
• Query language and architecture for CQ• Do not address monitoring for freshness optimization
NIAGARA (DeWitt and Naughton)• Query evaluation and optimization techniques• Database query optimization setting
ChangeDetector (Boyapati et al.)• Fixed-priority polling for given set of pages
Freshness for discrete queries• Poisson updates (Cho and Garcia-Molina)• Quasi-deterministic and other distributions
(Sethuraman, Wolf, Squillante, Yu)
![Page 11: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/11.jpg)
11
Alternative Solution: RSS (Rich Site Summary or
RDF Site Summary) An XML format for news and content syndication, in which headlines and links to the actual content are made available to Web sites. After the publishing site creates an RSS file of its content, other Web sites may use the headline feed, and the content can be read with a standard Web browser or by specialized RSS viewers
RSS is a push-pull based scheme that is different than the scheme discussed here, which is purely pull based
![Page 12: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/12.jpg)
12
Paper Contributions
New monitoring framework to fit statistical models of page change behavior
Freshness optimization problem constrained by network resources
Two-phase solution to optimization tailored to CQ search systems• Resource allocation (knapsack)• Poll scheduling (flow-shop)
![Page 13: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/13.jpg)
13
Continuous Adaptive Monitoring
Consider epoch Consider a large set of pages Each time step j, each page i has
probability ρi,j of an update
• Can capture predictable periodicityj ρi,j = i, the expected number updates to page i
or change rate in an epoch
Decision variables yij
• Whether a page is visited at a time point• Optimization goal
![Page 14: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/14.jpg)
14
Goal The goal is to minimize the weighted
importance of changes that are not reported to the users
Put another way, update information reported for page i is
Goal is to maximize importance-weighted updates reported, iWiRi
min ( )i ii P
WE
j ijiji yR
![Page 15: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/15.jpg)
15
Constraints and Metrics
The system is subject to polling resource constraint:
Metric returned info ratio (RIR) is:
The goal is equivalent to maximize RIR
Cyji ij ,
i ii
i j ijiji
W
yW
Importance-weighted updatescaptured by system
Total importance-weightedexpected updates
![Page 16: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/16.jpg)
16
CAM System Overview Time proceeds in epochs At the end of every epoch
we re-evaluate• Relevance• Update probabilities
For the next epoch• We select instants at which
to poll each page (resource allocation)
• Schedule these instants subject to resource constraints
Determiningrelevant pages
Tracking
Resourceallocation
Scheduling
Monit
ori
ng
![Page 17: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/17.jpg)
18
Resource allocation Existing policies
• Uniform: Resources (number of polls) distributed uniformly among all pages irrespective of their change frequency
• Proportional: number of polls allocated to a page is proportional to the frequency with which it changes
Better policies also exist, such as taking into the account of the weights of different pages
CAM: Discrete, Separable and Convex • Better than uniform and proportional• Proportional better than uniform• Well studied optimization problem
BUT EXACTLY HOW DO THEY DO IT?
![Page 18: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/18.jpg)
19
Scheduling
Suppose our crawler can fetch M pages concurrently, and
An epoch is T time steps long Then we can fetch a total of
C=MT pages during an epoch• Ensured by resource allocation
phase But at each instant we cannot
schedule more than M fetches• Want small planned-to-actual poll
delays• May fail to schedule all poll jobs
in an epoch
Determiningrelevant pages
Parametertracking
Resourceallocation
Scheduling
Monit
ori
ng
Tentative yijs
![Page 19: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/19.jpg)
20
A flow-shop problem
M “machines” available at any time Each yij which is equal to 1 is a “job”
Job “k” is “released” at time step rk (= j )
“Processing time” = crawl time = tj
“Completion time” of job j is Cj
Want to minimize “total flow”
NP-hard problem• The paper uses a 1.58 heuristic algorithm
k kk rC )(
Time
Job
![Page 20: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/20.jpg)
21
Evaluation: Preparing Knowledge
Zipfian Distribution (Power Law)• Zipf's law, named after the Harvard linguistic
professor George Kingsley Zipf (1902-1950) • Zipf curves follow a straight line when plotted
on a double-logarithmic diagram
![Page 21: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/21.jpg)
22
Zipf Properties Three Key Observations:
• A few elements that score very high (the left tail in the diagrams)
• A medium number of elements with middle-of-the-road scores (the middle part of the diagram)
• A huge number of elements that score very low (the right tail in the diagram)
Zipf distributions have been shown to characterize use of words in a natural language
Web use follows a Zipf distribution Other interesting examples including file
popularity, web request on caching, etc
![Page 22: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/22.jpg)
23
Experiments
Data (Synthetic)• Change frequency distribution:
a few pages change very often (Zipf)
• Update probability distribution: a few ρi,j ’s are large, most are small (Zipf again)
• Page importance distribution: also Zipf (Wolman, 1999)
FIXME0
50
100
150
200
250
300
350
1 5 9 13
Change frequency
Num
ber
of p
ages
![Page 23: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/23.jpg)
24
Experiment Parameters
Comparison Baselines:• Uniform: resources (monitoring tasks) are
allocated uniformly across all pages• Proportional: resources are allocated
proportional to change frequencies of pages respectively
Parameters:• Number of Queries = 500• Number of Pages = 500• Number of Monitoring Tasks = 1000-50000
![Page 24: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/24.jpg)
25
CAM > Proportional > Uniform Uniform update and
importance distribution Plot RIR against ratio
of resources toexpected changes
RIR for CAM is >3times better
Proportional is betterthan uniform in theCQ setting
0
0.020.04
0.060.08
0.1
0.120.14
0.160.18
0.2
2 4 6 8Monitor/change ratio
RIR
UniformProportionalCAM
![Page 25: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/25.jpg)
26
Resource allocation
Sort pages by increasing change rate
Uniform spends same resource for each bin
Proportional wastes fewer resources on slow-changing bins, but is not aggressive enough
CAM invests more aggressively in fast-changing bins, achieving the greatest RIR
![Page 26: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/26.jpg)
27
Skewed Distribution Effect (1)
CAM performs better as the data update rate is skewed
![Page 27: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/27.jpg)
28
Skewed Distribution Effect (2)More information is obtained as the update is skewed
![Page 28: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/28.jpg)
29
Skewed Distribution Effect(3)
00.10.20.30.40.50.6
0 0.5 1 1.5Zipf parameterRIR
CAMProportionalUniform
As Zipf parameter increases, CAM performs better
![Page 29: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/29.jpg)
30
CAM Performance with Epoches
CAM improves over initial epochs Change distribution estimates stabilize
within a few epochs
![Page 30: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/30.jpg)
31
Effect of Monitoring Task ratio on CAM
Only when the ratio is 50 can CAM obtain all information.
Is this good performance?
![Page 31: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/31.jpg)
32
Scheduling Performance
The first figure shows the document size distribution, and the second figure shows the loss of information due to scheduling
Some monitoring tasks are not doneSome monitoring tasks are delayed
![Page 32: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/32.jpg)
33
Experiments on real pages (from one of the author’s talks)
Eight sites with dynamic cricket match information• In fact, Zipfian updates
Adversarial setup: monitor/change < 1• CAM close to best
possible
For M/C=2, CAM updates on 80% of the information changed
0
100
200
300
400
500
1 2 3 4 5 6 7 8Page Index
Number of Changes
0
0.2
0.4
0.6
0.8
1
0.3 1 10Monitoring-Change RatioR
IR
Uniform
Proportional
CAM
![Page 33: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/33.jpg)
34
Critique This paper expresses the algorithm part
very vague and unclear. It is unknown how they performed the experiments
The performance is not good: it takes 50 times the actual number of updates to get all information
Several assumptions do not hold: for example, updates are typically correlated from updates to updates, but the paper assumes that the update information of last time is completely lost when the next update is done
![Page 34: Monitoring the dynamic Web to respond to Continuous Queries Presented by Qing Cao CS851 Spring 2005.](https://reader036.fdocuments.in/reader036/viewer/2022081603/56649f4e5503460f94c6fd20/html5/thumbnails/34.jpg)
35
Conclusion
Continual queries are inherently different from discrete queries
Approach used in CAM• Identify relevant pages• Track the pages as they change• Characterize page change behavior• Decide when to monitor the pages in future
CAM approach performs better than other naïve approaches