Modeling Web Content Dynamics
description
Transcript of Modeling Web Content Dynamics
![Page 1: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/1.jpg)
Modeling Web Content Dynamics
Brian Brewington ([email protected])George Cybenko ([email protected])
IMA February 2001
![Page 2: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/2.jpg)
Observing changing information sources An index of changing information sources
must re-index items periodically to keep the index from becoming out-of-date.
What does it mean for an observer or index to be “up-to-date” or “current”?
Our work on the web has two parts:– Estimation of change rates for a large
sample of web pages– Re-indexing speed requirements with
respect to a formal definition of “up-to-date”.
![Page 3: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/3.jpg)
Your brain is good at this
Where is your visual attention directed when driving a car? Why?
Form state estimates;re-observe when uncertainty becomes too large
![Page 4: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/4.jpg)
Ingredients
1. A formal definition of “up-to-dateness”
2. Data
3. Scheduling to optimize “up-to-dateness”
![Page 5: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/5.jpg)
A meaning for “up to date”
An index entry is current if it is correct to within a grace period of time , with probability at least .
To be “-current”:
No alteration allowed in gray region for index entry to be “-current”
(time)
(grace period)
(nex
t obs
erve
d)
(las
t obs
erve
d)
tn
(now)t0 t
0+T t
n-
![Page 6: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/6.jpg)
currency has meaning in many contexts
Any source has a spectrum of possibilities; here are some possible values (guesses)– Newspaper: (0.9, 1 day)– Television news: (0.95, 1 hour)– Broker watching stocks: (0.95, 30 min)– Air traffic controller: (0.95, 20 sec)– Web search engine: (0.6, 1 day)– An old web page’s links: (0.4, 70 day)
![Page 7: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/7.jpg)
![Page 8: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/8.jpg)
Collecting web page data Our web page data comes
from a web monitoring service.
The Informant runs periodic standing user queries against four search engines and monitors user-selected URLs. When new or updated results appear, users are notified via email.
We download ~100,000 pages per day for ~30,000 users.
See http://informant.dartmouth.edu
![Page 9: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/9.jpg)
Sampling issues
Biased towards search engine results in the top 10 for users’ queries
No more than one observation of a page per day, pages are usually observed once every three days.
Queries and page checks are run only at night, so sample times are correlated.
Filesystem timestamps are available for about 65% of our observations.
![Page 10: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/10.jpg)
Data in our collection As of March 2000, we had observations of about
3 million web pages. Data in paper spans 7 mo. Each page is observed an average of 12 times,
and the average time span of observation is 38 days.
Each observation includes:– “Last-Modified” timestamps, when available– Observation time (using remote server’s if possible)– Document summary information
» Number of bytes (“Content-Length”)» Number of images, tables, forms, lists, banner ads» 16-bit hash of text, hyperlinks, and image references
![Page 11: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/11.jpg)
“Lifetimes” vs. “ages” We can model objects as having
independent, identically-distributed time periods between modifications. We call these “lifetimes.”
The “age” is the time since the present lifetime began.
By analogy, thinkBy analogy, thinkof replacement parts,of replacement parts,each with an each with an independentindependentlifetime length.lifetime length.
L1 L2
(Each “(Each “” is a ” is a change)change)
0 0.5 1 1.5 2 2.5 3 3.5 4
Life
time=
1.53
Life
time=
1.14
Life
time=
0.62
Life
time=
0.84
Time
Age
1...
![Page 12: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/12.jpg)
Determining dynamics from the time dataTwo ways to find the distribution of change rates:
1. Observe the time between successive modifications. (Lifetimes)
GoodGood: direct measurement of time between changesBadBad: aliasing possible; needs repeat observations
2. Observe the time since the most recent modification. (Ages)
GoodGood: doesn’t have aliasing problems, works without having to make repeat observationsBadBad: requires that we accurately account for growth
![Page 13: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/13.jpg)
Sampling the lifetime distribution
There are two problems with trying to sample the difference of successive change times:
timex xo oxx x
1. 1. Second observation (o) will miss two changes (x)
x=modificationo=observation
timex x xo o o o o
2. 2. Observation window not big enough to see any changes (x)
o
(Observation timespan)
(Actual lifetime)
(Observed lifetime)
![Page 14: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/14.jpg)
Web page age CDFC
um
ula
tive P
r
Age [days, log scale]
1 d
ay
10 d
ays
100
days
• Median age 120 days• upper 25% > 1 year• lowest 25% < 1 month
0
1
0.5
0.1
0.2
0.3
0.4
0.6
0.7
0.8
0.9
![Page 15: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/15.jpg)
Empirical lifetime distribution
0 200 400 600
10-4
10-3
10-2
Lifetime [days]
Pro
babi
lity
den
sity
100 102
0.2
0.4
0.6
0.8
1
Lifetime [days]
Cum
ulat
ive
prob
abil
ity
Lifetime PDF Lifetime CDF
![Page 16: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/16.jpg)
When do changes happen?Change times, mod 247 hours, show more changes happen
during the span of US working hours (8AM to 8PM, EST)
0 50 100 1500
1
2
3
4x 10
-3
time since Thursday 12:00 GMT [hours]
Rel
ativ
e fr
equ
ency
Wed
s af
tern
oon
Thu
rsda
y
Fri
day Sa
turd
ay
Sund
ay
Mon
day
Tue
sday
Wed
s m
orni
ng
![Page 17: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/17.jpg)
Distribution of mean change times The Weibull distribution, a
generalized exponential, models mean lifetimes fairly well:
This can be used to find an age or lifetime CDF for any shape parameter and scale parameter . But for the age CDF, a growth model is needed, so age-based estimates can be inaccurate.
1
/1 tmean mean
tf f t e
![Page 18: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/18.jpg)
100 101 102 1030
0.2
0.4
0.6
0.8
1Lifetime CDF: F (=1.4, =152.2)
Lifetime [days]
Cu
mu
lati
ve p
roba
bili
ty
Trial Reference
1
/tte
![Page 19: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/19.jpg)
()currency for Poisson sourceA single source has Poisson changes at rate . If re-indexed every T time units, the expected probability of the index entry being -current is:
1
1
1
T
z
e
T T
e
z
,
/
z T
T
10-2 100 10 2
0.2
0.4
0.6
0.8
Expected changes per check period, T
Pro
babi
lity
,
=0.9
=0.25
=0.6
=0.0
1/T
![Page 20: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/20.jpg)
Probability of currency over a collectionExpected probability of a random index
entry being -current (given distribution f(t) of mean change times t):
/
0
1
/
t T t
t
ef t dt
T T t
1
/( ) ttf t e
Distribution ofavg. lifetimes
Probability of being -current given avg. lifetime
![Page 21: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/21.jpg)
Index performance surface: as a function of T, /T
Surface formed by integrating out the rate dependence
Large period T implies =
Plane shown for =0.95%, intersects at a level set (,T)
![Page 22: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/22.jpg)
101 10210-1
100
101
102
Re-indexing period, T [days]
Gra
ce p
erio
d,
[da
ys] Age-based
Lifetime-based
T =50 days
=1 week
=1 month
=1 year
T =23 days
T =59 days
T =8.5 days
T =18 days
=1 day T =11.5 days
95% level set: (T,) pairs
![Page 23: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/23.jpg)
Bandwidth needed for (0.95, 1-week) currency
For (0.95, 1 week) currency of this collection:– Must re-index with period around 18 days.– A (0.95, 1-week) index of the whole web (~800
million pages) processes about 50 megabits/sec.– A more “modest” (0.95, 1-week) index of 150
million pages will process 9 megabits/sec.
For fixed-period checks, we can estimate processing speed requirements.
![Page 24: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/24.jpg)
Empirical search engine currency
10 0 101
102
1030.4
0.5
0.6
0.7
0.8
0.9
1
[days]
Google Infoseek AltaVista Northern Light
![Page 25: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/25.jpg)
A calculus for currency
If x is current andy is current, then
(x,y) ismaxcurrent.
Extend this to other atomic operationson information, eg composition.
![Page 26: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/26.jpg)
Summary About one in five pages has been
modified within the last 12 days. (0.95, 1-week) on our collection: must
observe every 18 days Ideas: More specialty search engines?
Distributed monitoring/remote update? Other work: algorithms for scheduling
observation based on source change rate and importance
![Page 27: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/27.jpg)
Mathematics of “Semantic Hacking”
![Page 28: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/28.jpg)
Problem
Denial of Service Attacks Infrastructure
System attacks Systems
Semantic attacks Information
easy todetect
hard todetect
![Page 29: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/29.jpg)
Distribution of information
“Gaussian”is expected.
Outliers
Collusion?
![Page 30: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/30.jpg)
What makes a good mystery/thriller?
“Correct”conclusion
“Wrong”conclusion
A wrong conclusion can be reached by onelarge, detectable bad decision or a sequenceof small, undetectably perturbed decisions.
Understand the whole sequence of decisions not justone in isolation.
![Page 31: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/31.jpg)
Ongoing research
Develop a model of such “semantic attacks”.
Develop a way to quantify such things.
Develop some tools for detecting/managingcomplex decision sequences.
Make information/decision systems morerobust.
![Page 32: Modeling Web Content Dynamics](https://reader035.fdocuments.in/reader035/viewer/2022062309/5681399f550346895da13ada/html5/thumbnails/32.jpg)
Acknowledgements
DARPA contractF30602-98-2-
0107
DoD MURI (AFOSR contract F49620-97-1-
03821)
NSF KDI Grant 9873138