Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS,...
-
Upload
audrey-burke -
Category
Documents
-
view
216 -
download
2
Transcript of Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS,...
Addressing Incompleteness and Noise in Evolving Web Snapshots
KJDB2007
Masashi Toyoda
IIS, University of Tokyo
Web as a projection of the world• Web is now reflecting various events in
the real and virtual world
• Evolution of past topics can be tracked by observing the Web
• Identifying and tracking new informationnew information is important for observing new trendsnew trends – Sociology, marketing, and survey research
WarTsunamiSportsComputer virus
Online newsweblogsBBS
Massive Periodic Crawling for Observing Trends on the Web
TimeT1 T2
TN
ArchiveArchive
WWWWWW
CrawlerCrawlerComparisonComparison
Web Archivevo
lum
e of
1999/ 082000/ 082001/ 102002/ 022002/ 082002/ 122003/022003/072004/ 012004/052005/072006/06
Observing Trends on the Web• WebRelievo [Toyoda 2005]
– Evolution of link structure
Issues of Observing Evolutionvo
lum
e of
1999/ 082000/ 082001/ 102002/ 022002/ 082002/ 122003/022003/072004/012004/052005/072006/06
Incompleteness of snapshotsIncompleteness of snapshots・ Cannot crawl the entire Web・ Time of creation/deletion is uncertain for many pages
Spam & mirror sitesSpam & mirror sites・ Increasing spam sites that deceive SEs (9% to 25% sites)・ Many mirror sites (22% to 29% pages)
Example of link spamming
What's Really New on the Web? Identifying New Pages from a Series
of Unstable Web Snapshots[WWW2006]
Masashi Toyoda and Masaru Kitsuregawa
IIS, University of Tokyo
Problems in Massive Periodic Crawling
• The whole of the Web cannot be crawled– # of uncrawled pages overwhelms
# of crawled pages even after crawling 1B pages[Eiron et al 2004]
• Web sites may be temporarily unavailable– Server and network troubles
Novelty of a page crawled for the first time remaiNovelty of a page crawled for the first time remains uncertainns uncertain– The page might exist at the previous time– “Last-Modified” time guarantees only that the page is ol
der than that time
Our Contribution
• Propose a novelty measurenovelty measure for estimating the certainty that a newly crawled page is really new– New pages can be extracted from a series of
unstable snapshots
• Evaluate the precision, recall, and miss rate of the novelty measure
• Apply the novelty measure to our Web time machine application
Novelty MeasureOld and Unknown Pages
?
? ?
?
Crawled pages: W(tW(t--1)1)
Crawled pages: W(tW(t))
t-1 t
U(tU(t))
O(tO(t))
Novelty MeasureIf all in-links come from pages crawled last 2 times(LL22(t)(t))
p
t-1 t
N(p)N(p) 1≒
Crawled last 2 timesLL22(t)(t)
New
Novelty MeasureN(p) is discounted when the novelty of some i
n-links are unknown
q
p
t-1 t
?
N(p)N(p) 0.75≒
New
Novelty Measure
If some in-links come from U(t) U(t) ?
q
p
t-1 t
?
N(p)N(p) ?≒
Novelty MeasureDetermine the novelty measure recursively
q
p
t-1 t
N(p)N(p) (3 + 0.5) / 4 ≒
N(q)N(q) 0.5≒
0.5
Definition of Novelty Measure
• δ: damping factor– probability that there were links to pp before t-1
Experiment: Data Set
• A massively crawled Japanese web archive
Time Period Crawled pages Links
1999 Jul to Aug 17M 120M
2000 Jun to Aug 17M 112M
2001 Oct 40M 331M
2002 Feb 45M 375M
2003 Feb 66M 1058M
2003 Jul 97M 1589M
2004 Jan 81M 3452M
2004 May 96M 4505M
Experiment : Precision• Given threshold θ,
pp is judged to be novel when θ< N(p)N(p)
– Precision: #(correctly judged) / #(judged to be novel)
– Recall: #(correctly judged) / #(all novel pages)
• Use URLs including dates as a golden set– Assume that they appeared at their including time– E.g. http://foo.com/2004/05– Patterns: YYYYMM, YYYY/MM, YYYY-DD
Experiment: Precision• Precision jumps from the baseline when θ becomes
positive, then gradually increases• Positive novelty provides 80% to 90% precision
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Novelty measure min. threshold
Pre
cisi
on /
Re
call
2003-07 Precision delta=0.2
2003-07 Precision delta=0.1
2003-07 Precision delta=0.0
A Large-Scale Study of Link Spam Detection by Graph Algorithms
AIRWEB’07 Workshop, WWW2007
Hiroo Saito University of Tokyo. JST, ERATO
Masashi Toyoda University of Tokyo
Masaru Kitsuregawa University of Tokyo
Kazuyuki Aihara University of Tokyo. JST, ERATO
Outline• Propose a link farm detection method using graph algorithms• Distribution of detected link farms in the Web graph structure
1. SCC decomposition
2. Maximal clique enumeration
3. Minimum cutLink farms are expanded by min-cut.How many links for cutting them out?
Around the largest SCC (CORE),large SCCs are link farms
Link farms in CORE can be found asmaximal cliques
CORE
Dataset• Japanese Web archive crawled in May 2004
– 96 million pages, 4.5 billion links– 60% pages in Japanese, 40% in other languages
• Site graph– Top of site: URL linked from 3 or more servers– A site is a set of URLs below the top URL– 5.9 million sites, 283 million links
Domain Number Ratio (%).com 2,711,588 46.2.jp 1,353,842 23.1.net 436,645 7.4.org 211,983 3.6.de 169,279 2.9.info 144,483 2.5.nl,.kr,.us,etc. 841,610 14.3
max. of indegree 61,006avr. of indegree 48max. of outdegree 70,294avr. of outdegree 48
Domains
Degree
SCC decomposition• Size distribution follows th
e power-law (1 n 100)≦ ≦ with a long and thick tail
• Large SCCs are spams (100<n) – 552 SCCs, 0.57M sites– 550 sample sites
Sampling results
spam suspicious non-spam#sites 527 23 0ratio (%) 95.8 4.2 0
Size distribution of SCCs
1.E+00
1.E+02
1.E+04
1.E+06
1.E+00 1.E+02 1.E+04 1.E+06Size of SCCs
Num
ber
of S
CC
s
D istribution of SCCs in the bow tie• Bow-tie structure
[Broder et al. 2000]
• Distribution of large SCCs– 450 / 552 (81%) SCCs in OUT– 385 / 450 (85%) SCCs directly
connected to CORE• CORE has many spam sites
connecting to them
SCCs whose size are larger than 1,000
CORE
IN
OUT
OUT60%
CORE30%
IN1%
TENDRILS2% OTHERS
7%
Maximal clique enumeration• Use maximal cliques for extracting spam from CORE
– Link farms tend to include cliques
• Maximal clique enumeration [Makino,Uno 2004]– Ignore nodes with high degree (80<d)
• Because of O(max. degree^4)
– Large cliques are link farms (40 < n)• 26,931 maximal cliques, 8,346 sites (many duplicates)• 165 sample sites
spam suspicious non-spam#sites 157 8 0ratio (%) 95.2 4.8 0
Sampling result
Minimum cut • How many spam sites around large SCCs and cliques?• How many links for cutting off spam sites?
Apply max-flow / min-cut on the directed site graph
Cliques
SCCs
Virtualsource
Virtual sink
210 white sites
8,000 sites
450,000 sites
57,000 sites
Min-cut: 18,000
Sampling result
number ratio (%)spam 459 94.3suspicious 27 5.5non- spam 1 0.2
CORE
Conclusions and future work• An automatic link farm detection method
– Based on graph algorithms• Seed extraction: SCC and maximal clique • Seed expansion: Max-flow / min-cut
– High precision (95% ~ 99%)
• Distribution of link farms in the Web graph structure– Large SCCs around CORE, Maximal cliques in CORE– Only 18,000 links for cutting off 0.5M spam sites
Future work– Improving recall (small SCCs, large cliques in CORE)– Experiments on other datasets