Monitoring the dynamic Web to respond to Continuous Queries Sandeep Pandey Krithi Ramamritham Soumen...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
2
Transcript of Monitoring the dynamic Web to respond to Continuous Queries Sandeep Pandey Krithi Ramamritham Soumen...
Monitoring the dynamic Web to respond to Continuous Queries
Sandeep PandeyKrithi RamamrithamSoumen Chakrabarti
IIT Bombaywww.cse.iitb.ac.in/laiir/
2
Motivation Web pages change rapidly:
• 40% commercial pages• 23% of all pages
change per day (Sethuraman et al.) Current search engine users
• Need to repeat queries (how often?) and• Diff results with recent versions• Or poll frequently updated collections
(e.g., Google news)
3
Continuous Queries (CQ) Users register long-lived queries of
interest Pages of interest may be added,
modified, and deleted System continually updates
responses Example applications
• Commuter updates: traffic and weather conditions
• Alerts on cricket scores, stock portfolios
4
Discrete vs. continuous queries Query lives for an
“instant”, one-shot anwer
Optimize corpus freshness at all times
Objective penalizes delay from update to refresh
Usually handled by bulk crawls with diverse periods
Queries have positive lifetime, many updates over time
Updates must track changes closely
Objective penalizes number or importance of missed updates
Dynamic monitoring with more restrictive network resources
5
Talk outline Introduction and motivation Previous approaches Our contributions
• Continuous Adaptive Monitoring (CAM)• How to allocate limited polling resources
among pages• How to schedule poll instants
Experiments Conclusion
6
Related work CONQUER and WebCQ (Liu, Pu and Tang)
• Query language and architecture for CQ• Do not address monitoring for freshness
optimization
NIAGARA (DeWitt and Naughton)• Query evaluation and optimization techniques• Database query optimization setting
ChangeDetector (Boyapati et al.)• Fixed-priority polling for given set of pages
Freshness for discrete queries• Poisson updates (Cho and Garcia-Molina)• Quasi-deterministic and other distributions
(Sethuraman, Wolf, Squillante, Yu)
7
Our contributions New statistical recency objective for
CQs New monitoring framework to fit
statistical models of page change behavior
Recency optimization problem constrained by network resources
Two-phase solution to optimization tailored to CQ search systems• Resource allocation (knapsack)• Poll scheduling (flow-shop)
8
Continuous Adaptive Monitoring Planning horizon or “epoch”
Time proceeds in discrete steps {j } over epoch
Each time step j, each page i has probability ρi,j of an update• Can capture predictable bursts,
periodicityj ρi,j = i, the expected #updates to page i
(“change rate”)
Decision variables yij
• Is page i polled at time step j?
9
Profit, relevance and importance Each registered query q has a profit q
Relevance riq of page i w.r.t. query q• We use cosine in TFIDF space as in IR• Other measures (e.g. PageRank) may be
integrated
Page i has “importance” Wi —function of• Currently resident queries and their “profits”• Relevance of page i to each resident query
Importance
q qiqi rW
10
Returned Information Ratio Update information reported for page
i is
Goal is to maximize importance-weighted updates reported, iWiRi subject to polling resource constraint
Returned info ratio (RIR) is
Cyji ij ,
j ijiji yR
i ii
i j ijiji
W
yW
Importance-weighted updatescaptured by system
Total importance-weightedexpected updates
11
CAM system overview Time proceeds in
epochs At the end of every
epoch we re-evaluate• Relevance• Update probabilities
For the next epoch• We select instants at
which to poll each page (resource allocation)
• Schedule these instants subject to resource constraint
Determiningrelevant pages
Parametertracking
Resourceallocation
Scheduling
Monit
ori
ng
13
Resource allocation Existing policies
• Uniform: Resources (#polls) distributed uniformly among all pages irrespective of their change frequency
• Proportional: #polls allocated to a page is proportional to the frequency with which it changes
For discrete queries, uniform better than proportional for any inter-update distribution
CAM: solve a knapsack problem • Better than uniform and proportional• Proportional better than uniform• Evidence that CQ objective discrete objective
14
Scheduling
Suppose our crawler can fetch M pages concurrently, and
An epoch is T time steps long Then we can fetch a total of
C=MT pages during an epoch• Ensured by resource allocation
phase But at each instant we cannot
schedule more than M fetches• Want small planned-to-actual poll
delays• May fail to schedule all poll jobs in
an epoch
Determiningrelevant pages
Parametertracking
Resourceallocation
Scheduling
Monit
ori
ng
Tentative yijs
15
A flow-shop problem M “machines” available at any time Each yij which is equal to 1 is a “job”
Job “k” is “released” at time step rk (= j )
“Processing time” = crawl time = tj
“Completion time” of job j is Cj
Want to minimize “total flow”
NP-hard problem• We use earliest deadline heuristic
k kk rC )(
Time
Job
16
Experiments Synthetic data
• Change frequency distribution: a few pages change very often (Zipfian)
• Update probability distribution: a few ρi,j ’s are large, most are small (Zipfian again)
• Page importance distribution: also Zipfian (Wolman, 1999)
Real data• Eight cricket score sites• High update rate
FIXME0
50
100
150
200
250
300
350
1 4 7 10 13
Change frequencyN
umbe
r of
pag
es
17
CAM > Proportional > Uniform Uniform update and
importance distrib. Plot RIR against ratio
of resources toexpected changes
RIR for CAM is >3times better
Proportional is betterthan uniform in theCQ setting• Intuition from “minimum total stale
duration” does not apply to CQ
0
0.020.04
0.060.08
0.1
0.120.14
0.160.18
0.2
2 4 6 8Monitor/change ratio
RIR
UniformProportionalCAM
18
Resource allocation
00.10.20.30.4
1 2 3 4 5 6 7 8 9 10Page Bins
RIR Uniform Proportional CAM Total info
Sort pages by increasing change rate Place in ten equally populated bins
(10=fastest) Uniform spends same resource for each bin Proportional wastes fewer resources on slow-
changing bins, but is not aggressive enough CAM invests more aggressively in fast-
changing bins, achieving the greatest RIR
19
Skew-handling and adaptation Fixed monitoring/
change ratio Vary skew in
update probability distribution
CAM’s gains increase with skew
CAM improves over initial epochs
Change distribution estimates stabilize within a few epochs
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.5 1 1.5Zipf parameterRIR
CAMProportionalUniform
RIR
20
Experiments on real pages Eight sites with
dynamic cricket match information• In fact, Zipfian
updates
Adversarial setup: monitor/change < 1• CAM close to best
possible
For M/C=2, CAM updates on 80% of the information changed
0
100
200
300
400
500
1 2 3 4 5 6 7 8Page Index
Number of Changes
0
0.2
0.4
0.6
0.8
1
0.3 1 10Monitoring-Change RatioR
IR
Uniform
Proportional
CAM
21
Conclusion Continual queries are inherently
different from discrete queries Approach used in CAM
• Identify relevant pages• Track the pages as they change• Characterize page change behavior• Decide when to monitor the pages in
future CAM approach performs better than
other naïve approaches
22
References J. Cho, H. Gracia-Molina.
Synchronizing the database to improve freshness. ACM-SIGMOD, 2000.
J. Cho, H. Gracia-Molina. Estimating frequency of change. Technical Report, 2000.
J. Sethuram, J. L. Wolf, M. S. Squillante, P. S. Yu. Optimal Crawling strategies for Web search-engines. World Wide Web, 2002.