1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred...

26
1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred...

Page 1: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

1

Our Web

Part 0: Overview

COMP630L Topics in DB Systems: Managing Web DataFall, 2007

Dr Wilfred Ng

Page 2: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

2

OutlineOutline

Two important issues:• Web Dynamics • Search Engines

Web is related to• Tim Berners-Lee?• Bill Gates?• Dik? Frederick?Wilfred ?

(March 11, 1890 – June 30, 1974)

Page 3: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

3

IntroductionIntroduction

The Web: the largest collection of (linked) resources (cf Memex machine in 1945, Xanadu in 1965, Internet in 1990)

Web search engines: locating and retrieving Web information:• Crawler-based (Google, MSN Search,…)• Human-powered (Yahoo directory, Open Directory)

Web is very dynamic:• Dynamics of Web size• Dynamics of Web pages• Dynamics of Web link structure

Page 4: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

4

Introduction (cont’)Introduction (cont’)

Dynamics of Web size:• Almost anyone can publish almost anything on

the Web at almost zero-cost• Web size grows at an exponential rate

Challenge for search engines: • Scalability to cover a large part of the Web

Page 5: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

5

Introduction (cont’)Introduction (cont’) Dynamics of Web pages:

• Creation: new pages come into existence New information need to be captured by search engines

• Updates: content changes on a page (minor? major?) Search engines should keep the local pages to be fresh

• Deletion: existing pages cannot be found Search engines should detect deletions to avoid broken

links

Page 6: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

6

Introduction (cont’)Introduction (cont’)

Dynamics of Web link structure:• Links are being established and removed

constantly

Important for search engines:• Use the link structure to rank search results• Eg: authoritative hubs

Page 7: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

7

Introduction (cont’)Introduction (cont’)

Relationship between three dimensions• Dynamics of Web size • Dynamics of Web pages• Dynamics of Web link structure

WebWebP

+1 page

Page 8: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

8

PreliminaryPreliminary

Search engine basic architecture:

WebWeb

Search Engine

Crawler Indexer Searcher EEnd

Users

Page 9: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

9

Dynamics of Web SizeDynamics of Web Size Two categories of the Web:

• Indexable Web (shallow Web): Indexed by major engines More than four billion pages by late 2003 [Google] 8 billion in 2004, 20 billion in 2005,??? Now [Google] Non-indexable Web (deep Web): Pages hidden behind search forms, or with authorization

requirements, etc. At least 400 times larger than indexable Web [Bergman00]

Page 10: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

10

Web Size StudyWeb Size Study

The Web is growing at an exponential rate

Netcraft Web Server Survey Report (August 1995 – November 2004)

Page 11: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

11

Search Engine Coverage StudiesSearch Engine Coverage Studies Bharat and Broder [1997]:

• Generate random URLs from a search engine• Check whether these pages were in other

engines• Test on four search engines

AltaVista, Excite, Infoseek, HotBot

• Estimated Web size: 200 million pages• The overlap between engines was very small

Page 12: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

12

Search Engine Coverage StudiesSearch Engine Coverage Studies Lawrence and Giles [1997]:

• Query-based sampling by scientists• Test on six major search engines:

AltaVista, Excite, Infoseek, HotBot, Lycos, and Northern Light

• Estimated Web size: 320 million pages• Single engine coverage is limited: 34%• Join coverage increases significantly: 60%

Lawrence and Giles [1999]:• Test on 11 search engines• Estimated Web size: 320 million 800 million• Single engine coverage: 34% 16%

Page 13: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

13

Search Engine Coverage StudiesSearch Engine Coverage Studies

Summary:

Study Web Size Largest Engine

Join Coverage

Bharat and Broder (1997)

200 million AltaVista (50%)

80%

Lawrence and Giles (1997)

320 million HotBot (34%)

60%

Lawrence and Giles (1999)

800 million Northern Light (16%)

42%

Page 14: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

14

Impact on Search Engines Impact on Search Engines – Scalable Architecture– Scalable Architecture

Google [Brin and Page 98]:• Data structure:

Compact encoding and compression

• Distributed crawling system: Crawlers run in parallel Each crawler keeps hundreds of connections

Page 15: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

15

Impact on Search Engines Impact on Search Engines – Metasearch Engines– Metasearch Engines

Combine results of multiple engines to increase Web coverage

Metasearch engine:

Query

Search Engine 1

Search Engine n

ResultsQuery

Final

Results Crawler Indexer Searcher

Metasearch engine

Page 16: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

16

Impact on Search Engines Impact on Search Engines – Special-purpose Search Engines– Special-purpose Search Engines

Not necessary to search the entire Web Special-purpose search engines:

• Focus on restricted domains

• Use focused crawler

• Start with relevant seed pages• Score the extracted URLs according to relevance

• Pick up the URL with highest score to crawl

P1

P2

P3

P4 P5

Priority queue

P5 P4 P5P2

P6

P7

P3

Page 17: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

17

Dynamics of Web PagesDynamics of Web Pages – Characterize Updates – Characterize Updates

Two measures [Lim02]:• A Web page: an ordered sequence of words

Distance Measure:• The degree of change: [0, 1]

Clusteredness Measure:• How changes are spread out within a page: [0, 1]

Changes are generally small and clustered• An incremental update is more efficient for search engines

nmBAd

),(

bmbBAc

/1),,(

Page 18: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

18

Impact of Web Page Dynamics on Impact of Web Page Dynamics on Search EnginesSearch Engines

A typical way to study page dynamics from a search engine perspective:

1. Develop a model for Web page changing

2. Propose update strategies to maximize the freshness for search engines

• Develop metrics to measure the freshness

Page 19: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

19

Web Page Changing Model Studies Web Page Changing Model Studies – Poisson Process Model– Poisson Process Model

Each page Pi is updated at an average rate λi

Poisson Process:• X(t): the number of changes of page P in (0, t]

• Random variable X(s+t) – X(s) has Poisson probability distribution:

• for k = 0, 1, 2,…

tk

ek

tksXtsX

!

)(})()(Pr{

Page 20: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

20

Poisson Process Poisson Process – Brewington and Cybenko Study– Brewington and Cybenko Study

Combine the effects of page creation and updates into the Poisson Web model

(α,β) – currency:• Characterize how up-to-date a search engine is

• A page is β- current (β is a time unit)

• A search engine is (α,β) – current Pr (P is β- current) >= α

• T = f (α,β, λ)

(0.95, 1 week) – currency:• T = 18 days(800 million pages per day) t

Now

t - βt0

Last observation

β

Grace period

t0 + T

Re-indexing period T

Grace period

Page 21: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

21

Impact of Web Page Dynamics on Impact of Web Page Dynamics on Search Engines – Summary Search Engines – Summary

Study Creation Updates Deletion Freshness Metric

Brewington and Cybenko

√ √ X (α,β) – currency

Cho and

Giacia-Molina

X √ X freshness

age

Edwards et al. √ √ X -

Ntoulas, Cho and Olston

√ √ √ -

Page 22: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

22

Dynamics of Web Link Structure – Dynamics of Web Link Structure – Web Link Structure ModelingWeb Link Structure Modeling

Web link structure [Broder et al. 00]

Four components:• SCC (27.5%)

• IN (21.5%)

• OUT (21.5%)

• Tendrils and Tubes (21.5%)

• Others (8%)

S C CIN O U T

D is c o nne c te dc o m po ne nts

Tube s

Te ndr i ls

Page 23: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

23

Dynamics of Web Link Structure Dynamics of Web Link Structure StudyStudy

Only one existing study Ntoulas, Cho and Olston [04]: in one year

• Only 24% initial links were still available• 25% new links created every week• Link structure is more dynamic than pages (8%

new pages and 5% new content in the same year!)

• Search engines should update link-based ranking metrics frequently

Page 24: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

24

Link-based Ranking Metric Link-based Ranking Metric – PageRank – PageRank

PageRank [WWW98]: main ranking metric of Google

Definition:• Page A has pages T1 … Tn (authoritative sites) pointing

to it

• C(A): the number of links going out of page A

• d: damping factor in (0, 1)

• PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

Page 25: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

25

Incremental Update on PageRankIncremental Update on PageRank

Computations are too expensive Incrementally compute approximations to

PageRank [Chien02] Basic ideas:

• Construct a subgraph of the Web

• Contain small neighborhood of link changes

• Model the rest of the Web graph as a single node

• Compute PageRank on this subgraph

Page 26: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.

26

ConclusionsConclusions The Web is dynamic in three dimensions

• Serious challenges to search engines Search engines to cope with high dynamics

• Scalable architecture, intelligent scheduling strategies, efficient update algorithm for ranking metrics, etc

Interesting to database people:• Data representation dynamics: XML• User dynamics: Adaptive search• Deep Web dynamics: searchable? how?• You should study COMP630L well

References