Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS,...

Addressing Incompleteness and Noise in Evolving Web Snapshots

KJDB2007

Masashi Toyoda

IIS, University of Tokyo

Web as a projection of the world• Web is now reflecting various events in

the real and virtual world

• Evolution of past topics can be tracked by observing the Web

• Identifying and tracking new informationnew information is important for observing new trendsnew trends – Sociology, marketing, and survey research

WarTsunamiSportsComputer virus

Online newsweblogsBBS

Massive Periodic Crawling for Observing Trends on the Web

TimeT1 T2

TN

ArchiveArchive

WWWWWW

CrawlerCrawlerComparisonComparison

Web Archivevo

lum

e of

1999/ 082000/ 082001/ 102002/ 022002/ 082002/ 122003/022003/072004/ 012004/052005/072006/06

Observing Trends on the Web• WebRelievo [Toyoda 2005]

– Evolution of link structure

Issues of Observing Evolutionvo

lum

e of

1999/ 082000/ 082001/ 102002/ 022002/ 082002/ 122003/022003/072004/012004/052005/072006/06

Incompleteness of snapshotsIncompleteness of snapshots・ Cannot crawl the entire Web・ Time of creation/deletion is uncertain for many pages

Spam & mirror sitesSpam & mirror sites・ Increasing spam sites that deceive SEs (9% to 25% sites)・ Many mirror sites (22% to 29% pages)

Example of link spamming

What's Really New on the Web? Identifying New Pages from a Series

of Unstable Web Snapshots[WWW2006]

Masashi Toyoda and Masaru Kitsuregawa

IIS, University of Tokyo

Problems in Massive Periodic Crawling

• The whole of the Web cannot be crawled– # of uncrawled pages overwhelms

# of crawled pages even after crawling 1B pages[Eiron et al 2004]

• Web sites may be temporarily unavailable– Server and network troubles

Novelty of a page crawled for the first time remaiNovelty of a page crawled for the first time remains uncertainns uncertain– The page might exist at the previous time– “Last-Modified” time guarantees only that the page is ol

der than that time

Our Contribution

• Propose a novelty measurenovelty measure for estimating the certainty that a newly crawled page is really new– New pages can be extracted from a series of

unstable snapshots

• Evaluate the precision, recall, and miss rate of the novelty measure

• Apply the novelty measure to our Web time machine application

Novelty MeasureOld and Unknown Pages

?

? ?

?

Crawled pages: W(tW(t--1)1)

Crawled pages: W(tW(t))

t-1 t

U(tU(t))

O(tO(t))

Novelty MeasureIf all in-links come from pages crawled last 2 times(LL22(t)(t))

p

t-1 t

N(p)N(p) 1≒

Crawled last 2 timesLL22(t)(t)

New

Novelty MeasureN(p) is discounted when the novelty of some i

n-links are unknown

q

p

t-1 t

?

N(p)N(p) 0.75≒

New

Novelty Measure

If some in-links come from U(t) U(t) ?

q

p

t-1 t

?

N(p)N(p) ?≒

Novelty MeasureDetermine the novelty measure recursively

q

p

t-1 t

N(p)N(p) (3 + 0.5) / 4 ≒

N(q)N(q) 0.5≒

0.5

Definition of Novelty Measure

• δ: damping factor– probability that there were links to pp before t-1

Experiment: Data Set

• A massively crawled Japanese web archive

Time Period Crawled pages Links

1999 Jul to Aug 17M 120M

2000 Jun to Aug 17M 112M

2001 Oct 40M 331M

2002 Feb 45M 375M

2003 Feb 66M 1058M

2003 Jul 97M 1589M

2004 Jan 81M 3452M

2004 May 96M 4505M

Experiment : Precision• Given threshold θ,

pp is judged to be novel when θ< N(p)N(p)

– Precision: #(correctly judged) / #(judged to be novel)

– Recall: #(correctly judged) / #(all novel pages)

• Use URLs including dates as a golden set– Assume that they appeared at their including time– E.g. http://foo.com/2004/05– Patterns: YYYYMM, YYYY/MM, YYYY-DD

Experiment: Precision• Precision jumps from the baseline when θ becomes

positive, then gradually increases• Positive novelty provides 80% to 90% precision

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Novelty measure min. threshold

Pre

cisi

on /

Re

call

2003-07 Precision delta=0.2



A Large-Scale Study of Link Spam Detection by Graph Algorithms

AIRWEB’07 Workshop, WWW2007

Ｈｉｒｏｏ Saito University of Tokyo. JST, ERATO

Masashi Toyoda University of Tokyo

Masaru Kitsuregawa University of Tokyo

Kazuyuki Aihara University of Tokyo. JST, ERATO

Outline• Propose a link farm detection method using graph algorithms• Distribution of detected link farms in the Web graph structure

1.　 SCC decomposition

2.　Maximal clique enumeration

3.　Minimum cutLink farms are expanded by min-cut.How many links for cutting them out?

Around the largest SCC (CORE),large SCCs are link farms

Link farms in CORE can be found asmaximal cliques

CORE

Dataset• Japanese Web archive crawled in May 2004

– 96 million pages, 4.5 billion links– 60% pages in Japanese, 40% in other languages

• Site graph– Top of site: URL linked from 3 or more servers– A site is a set of URLs below the top URL– 5.9 million sites, 283 million links

Domain Number Ratio (%).com 2,711,588 46.2.jp 1,353,842 23.1.net 436,645 7.4.org 211,983 3.6.de 169,279 2.9.info 144,483 2.5.nl,.kr,.us,etc. 841,610 14.3

max. of indegree 61,006avr. of indegree 48max. of outdegree 70,294avr. of outdegree 48

Domains

Degree

SCC decomposition• Size distribution follows th

e power-law (1 n 100)≦ ≦ 　with a long and thick tail

• Large SCCs are spams (100<n) – 552 SCCs, 0.57M sites– 550 sample sites

Sampling results

spam suspicious non-spam#sites 527 23 0ratio (%) 95.8 4.2 0

Size distribution of SCCs

1.E+00

1.E+02

1.E+04

1.E+06

1.E+00 1.E+02 1.E+04 1.E+06Size of SCCs

Num

ber

of S

CC

s

Ｄ istribution of SCCs in the bow tie• Bow-tie structure

[Broder et al. 2000]

• Distribution of large SCCs– 450 / 552 (81%) SCCs in OUT– 385 / 450 (85%) SCCs directly

connected to CORE• CORE has many spam sites

connecting to them

SCCs whose size are larger than 1,000

CORE

IN

OUT

OUT60%

CORE30%

IN1%

TENDRILS2% OTHERS

7%

Maximal clique enumeration• Use maximal cliques for extracting spam from CORE

– Link farms tend to include cliques

• Maximal clique enumeration [Makino,Uno 2004]– Ignore nodes with high degree (80<d)

• Because of O(max. degree^4)

– Large cliques are link farms (40 < n)• 26,931 maximal cliques, 8,346 sites (many duplicates)• 165 sample sites

spam suspicious non-spam#sites 157 8 0ratio (%) 95.2 4.8 0

Sampling result

Minimum cut • How many spam sites around large SCCs and cliques?• How many links for cutting off spam sites?

Apply max-flow / min-cut on the directed site graph

Cliques

SCCs

Virtualsource

Virtual sink

210 white sites

8,000 sites

450,000 sites

57,000 sites

Min-cut: 18,000

Sampling result

number ratio (%)spam 459 94.3suspicious 27 5.5non- spam 1 0.2

CORE

Conclusions and future work• An automatic link farm detection method

– Based on graph algorithms• Seed extraction: SCC and maximal clique • Seed expansion: Max-flow / min-cut

– High precision (95% ～ 99%)

• Distribution of link farms in the Web graph structure– Large SCCs around CORE, Maximal cliques in CORE– Only 18,000 links for cutting off 0.5M spam sites

Future work– Improving recall (small SCCs, large cliques in CORE)– Experiments on other datasets

Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS,...

Documents

Transcript of Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS,...