Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS,...

26
Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo

Transcript of Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS,...

Page 1: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Addressing Incompleteness and Noise in Evolving Web Snapshots

KJDB2007

Masashi Toyoda

IIS, University of Tokyo

Page 2: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Web as a projection of the world• Web is now reflecting various events in

the real and virtual world

• Evolution of past topics can be tracked by observing the Web

• Identifying and tracking new informationnew information is important for observing new trendsnew trends – Sociology, marketing, and survey research

WarTsunamiSportsComputer virus

Online newsweblogsBBS

Page 3: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Massive Periodic Crawling for Observing Trends on the Web

TimeT1 T2

TN

ArchiveArchive

WWWWWW

CrawlerCrawlerComparisonComparison

Page 4: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Web Archivevo

lum

e of

1999/ 082000/ 082001/ 102002/ 022002/ 082002/ 122003/022003/072004/ 012004/052005/072006/06

Page 5: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Observing Trends on the Web• WebRelievo [Toyoda 2005]

– Evolution of link structure

Page 6: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Issues of Observing Evolutionvo

lum

e of

1999/ 082000/ 082001/ 102002/ 022002/ 082002/ 122003/022003/072004/012004/052005/072006/06

Incompleteness of snapshotsIncompleteness of snapshots・ Cannot crawl the entire Web・ Time of creation/deletion is uncertain for many pages

Spam & mirror sitesSpam & mirror sites・ Increasing spam sites that deceive SEs (9% to 25% sites)・ Many mirror sites (22% to 29% pages)

Example of link spamming

Page 7: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

What's Really New on the Web? Identifying New Pages from a Series

of Unstable Web Snapshots[WWW2006]

Masashi Toyoda and Masaru Kitsuregawa

IIS, University of Tokyo

Page 8: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Problems in Massive Periodic Crawling

• The whole of the Web cannot be crawled– # of uncrawled pages overwhelms

# of crawled pages even after crawling 1B pages[Eiron et al 2004]

• Web sites may be temporarily unavailable– Server and network troubles

Novelty of a page crawled for the first time remaiNovelty of a page crawled for the first time remains uncertainns uncertain– The page might exist at the previous time– “Last-Modified” time guarantees only that the page is ol

der than that time

Page 9: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Our Contribution

• Propose a novelty measurenovelty measure for estimating the certainty that a newly crawled page is really new– New pages can be extracted from a series of

unstable snapshots

• Evaluate the precision, recall, and miss rate of the novelty measure

• Apply the novelty measure to our Web time machine application

Page 10: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Novelty MeasureOld and Unknown Pages

?

? ?

?

Crawled pages: W(tW(t--1)1)

Crawled pages: W(tW(t))

t-1 t

U(tU(t))

O(tO(t))

Page 11: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Novelty MeasureIf all in-links come from pages crawled last 2 times(LL22(t)(t))

p

t-1 t

N(p)N(p) 1≒

Crawled last 2 timesLL22(t)(t)

New

Page 12: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Novelty MeasureN(p) is discounted when the novelty of some i

n-links are unknown

q

p

t-1 t

?

N(p)N(p) 0.75≒

New

Page 13: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Novelty Measure

If some in-links come from U(t) U(t) ?

q

p

t-1 t

?

N(p)N(p) ?≒

Page 14: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Novelty MeasureDetermine the novelty measure recursively

q

p

t-1 t

N(p)N(p) (3 + 0.5) / 4 ≒

N(q)N(q) 0.5≒

0.5

Page 15: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Definition of Novelty Measure

• δ: damping factor– probability that there were links to pp before t-1

Page 16: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Experiment: Data Set

• A massively crawled Japanese web archive

Time Period Crawled pages Links

1999 Jul to Aug 17M 120M

2000 Jun to Aug 17M 112M

2001 Oct 40M 331M

2002 Feb 45M 375M

2003 Feb 66M 1058M

2003 Jul 97M 1589M

2004 Jan 81M 3452M

2004 May 96M 4505M

Page 17: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Experiment : Precision• Given threshold θ,

pp is judged to be novel when θ< N(p)N(p)

– Precision: #(correctly judged) / #(judged to be novel)

– Recall: #(correctly judged) / #(all novel pages)

• Use URLs including dates as a golden set– Assume that they appeared at their including time– E.g. http://foo.com/2004/05– Patterns: YYYYMM, YYYY/MM, YYYY-DD

Page 18: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Experiment: Precision• Precision jumps from the baseline when θ becomes

positive, then gradually increases• Positive novelty provides 80% to 90% precision

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Novelty measure min. threshold

Pre

cisi

on /

Re

call

2003-07 Precision delta=0.2

2003-07 Precision delta=0.1

2003-07 Precision delta=0.0

Page 19: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

A Large-Scale Study of Link Spam Detection by Graph Algorithms

AIRWEB’07 Workshop, WWW2007

Hiroo Saito University of Tokyo. JST, ERATO

Masashi Toyoda University of Tokyo

Masaru Kitsuregawa University of Tokyo

Kazuyuki Aihara University of Tokyo. JST, ERATO

Page 20: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Outline• Propose a link farm detection method using graph algorithms• Distribution of detected link farms in the Web graph structure

1.  SCC decomposition

2. Maximal clique enumeration

3. Minimum cutLink farms are expanded by min-cut.How many links for cutting them out?

Around the largest SCC (CORE),large SCCs are link farms

Link farms in CORE can be found asmaximal cliques

CORE

Page 21: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Dataset• Japanese Web archive crawled in May 2004

– 96 million pages, 4.5 billion links– 60% pages in Japanese, 40% in other languages

• Site graph– Top of site: URL linked from 3 or more servers– A site is a set of URLs below the top URL– 5.9 million sites, 283 million links

Domain Number Ratio (%).com 2,711,588 46.2.jp 1,353,842 23.1.net 436,645 7.4.org 211,983 3.6.de 169,279 2.9.info 144,483 2.5.nl,.kr,.us,etc. 841,610 14.3

max. of indegree 61,006avr. of indegree 48max. of outdegree 70,294avr. of outdegree 48

Domains

Degree

Page 22: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

SCC decomposition• Size distribution follows th

e power-law (1 n 100)≦ ≦  with a long and thick tail

• Large SCCs are spams (100<n) – 552 SCCs, 0.57M sites– 550 sample sites

Sampling results

spam suspicious non-spam#sites 527 23 0ratio (%) 95.8 4.2 0

Size distribution of SCCs

1.E+00

1.E+02

1.E+04

1.E+06

1.E+00 1.E+02 1.E+04 1.E+06Size of SCCs

Num

ber

of S

CC

s

Page 23: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

D istribution of SCCs in the bow tie• Bow-tie structure

[Broder et al. 2000]

• Distribution of large SCCs– 450 / 552 (81%) SCCs in OUT– 385 / 450 (85%) SCCs directly

connected to CORE• CORE has many spam sites

connecting to them

SCCs whose size are larger than 1,000

CORE

IN

OUT

OUT60%

CORE30%

IN1%

TENDRILS2% OTHERS

7%

Page 24: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Maximal clique enumeration• Use maximal cliques for extracting spam from CORE

– Link farms tend to include cliques

• Maximal clique enumeration [Makino,Uno 2004]– Ignore nodes with high degree (80<d)

• Because of O(max. degree^4)

– Large cliques are link farms (40 < n)• 26,931 maximal cliques, 8,346 sites (many duplicates)• 165 sample sites

spam suspicious non-spam#sites 157 8 0ratio (%) 95.2 4.8 0

Sampling result

Page 25: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Minimum cut • How many spam sites around large SCCs and cliques?• How many links for cutting off spam sites?

Apply max-flow / min-cut on the directed site graph

Cliques

SCCs

Virtualsource

Virtual sink

210 white sites

8,000 sites

450,000 sites

57,000 sites

Min-cut: 18,000

Sampling result

number ratio (%)spam 459 94.3suspicious 27 5.5non- spam 1 0.2

CORE

Page 26: Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.

Conclusions and future work• An automatic link farm detection method

– Based on graph algorithms• Seed extraction: SCC and maximal clique • Seed expansion: Max-flow / min-cut

– High precision (95% ~ 99%)

• Distribution of link farms in the Web graph structure– Large SCCs around CORE, Maximal cliques in CORE– Only 18,000 links for cutting off 0.5M spam sites

Future work– Improving recall (small SCCs, large cliques in CORE)– Experiments on other datasets