1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred...
![Page 1: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/1.jpg)
1
Our Web
Part 0: Overview
COMP630L Topics in DB Systems: Managing Web DataFall, 2007
Dr Wilfred Ng
![Page 2: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/2.jpg)
2
OutlineOutline
Two important issues:• Web Dynamics • Search Engines
Web is related to• Tim Berners-Lee?• Bill Gates?• Dik? Frederick?Wilfred ?
(March 11, 1890 – June 30, 1974)
![Page 3: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/3.jpg)
3
IntroductionIntroduction
The Web: the largest collection of (linked) resources (cf Memex machine in 1945, Xanadu in 1965, Internet in 1990)
Web search engines: locating and retrieving Web information:• Crawler-based (Google, MSN Search,…)• Human-powered (Yahoo directory, Open Directory)
Web is very dynamic:• Dynamics of Web size• Dynamics of Web pages• Dynamics of Web link structure
![Page 4: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/4.jpg)
4
Introduction (cont’)Introduction (cont’)
Dynamics of Web size:• Almost anyone can publish almost anything on
the Web at almost zero-cost• Web size grows at an exponential rate
Challenge for search engines: • Scalability to cover a large part of the Web
![Page 5: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/5.jpg)
5
Introduction (cont’)Introduction (cont’) Dynamics of Web pages:
• Creation: new pages come into existence New information need to be captured by search engines
• Updates: content changes on a page (minor? major?) Search engines should keep the local pages to be fresh
• Deletion: existing pages cannot be found Search engines should detect deletions to avoid broken
links
![Page 6: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/6.jpg)
6
Introduction (cont’)Introduction (cont’)
Dynamics of Web link structure:• Links are being established and removed
constantly
Important for search engines:• Use the link structure to rank search results• Eg: authoritative hubs
![Page 7: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/7.jpg)
7
Introduction (cont’)Introduction (cont’)
Relationship between three dimensions• Dynamics of Web size • Dynamics of Web pages• Dynamics of Web link structure
WebWebP
+1 page
![Page 8: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/8.jpg)
8
PreliminaryPreliminary
Search engine basic architecture:
WebWeb
Search Engine
Crawler Indexer Searcher EEnd
Users
![Page 9: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/9.jpg)
9
Dynamics of Web SizeDynamics of Web Size Two categories of the Web:
• Indexable Web (shallow Web): Indexed by major engines More than four billion pages by late 2003 [Google] 8 billion in 2004, 20 billion in 2005,??? Now [Google] Non-indexable Web (deep Web): Pages hidden behind search forms, or with authorization
requirements, etc. At least 400 times larger than indexable Web [Bergman00]
![Page 10: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/10.jpg)
10
Web Size StudyWeb Size Study
The Web is growing at an exponential rate
Netcraft Web Server Survey Report (August 1995 – November 2004)
![Page 11: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/11.jpg)
11
Search Engine Coverage StudiesSearch Engine Coverage Studies Bharat and Broder [1997]:
• Generate random URLs from a search engine• Check whether these pages were in other
engines• Test on four search engines
AltaVista, Excite, Infoseek, HotBot
• Estimated Web size: 200 million pages• The overlap between engines was very small
![Page 12: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/12.jpg)
12
Search Engine Coverage StudiesSearch Engine Coverage Studies Lawrence and Giles [1997]:
• Query-based sampling by scientists• Test on six major search engines:
AltaVista, Excite, Infoseek, HotBot, Lycos, and Northern Light
• Estimated Web size: 320 million pages• Single engine coverage is limited: 34%• Join coverage increases significantly: 60%
Lawrence and Giles [1999]:• Test on 11 search engines• Estimated Web size: 320 million 800 million• Single engine coverage: 34% 16%
![Page 13: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/13.jpg)
13
Search Engine Coverage StudiesSearch Engine Coverage Studies
Summary:
Study Web Size Largest Engine
Join Coverage
Bharat and Broder (1997)
200 million AltaVista (50%)
80%
Lawrence and Giles (1997)
320 million HotBot (34%)
60%
Lawrence and Giles (1999)
800 million Northern Light (16%)
42%
![Page 14: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/14.jpg)
14
Impact on Search Engines Impact on Search Engines – Scalable Architecture– Scalable Architecture
Google [Brin and Page 98]:• Data structure:
Compact encoding and compression
• Distributed crawling system: Crawlers run in parallel Each crawler keeps hundreds of connections
![Page 15: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/15.jpg)
15
Impact on Search Engines Impact on Search Engines – Metasearch Engines– Metasearch Engines
Combine results of multiple engines to increase Web coverage
Metasearch engine:
Query
Search Engine 1
Search Engine n
ResultsQuery
Final
Results Crawler Indexer Searcher
Metasearch engine
![Page 16: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/16.jpg)
16
Impact on Search Engines Impact on Search Engines – Special-purpose Search Engines– Special-purpose Search Engines
Not necessary to search the entire Web Special-purpose search engines:
• Focus on restricted domains
• Use focused crawler
• Start with relevant seed pages• Score the extracted URLs according to relevance
• Pick up the URL with highest score to crawl
P1
P2
P3
P4 P5
Priority queue
P5 P4 P5P2
P6
P7
P3
![Page 17: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/17.jpg)
17
Dynamics of Web PagesDynamics of Web Pages – Characterize Updates – Characterize Updates
Two measures [Lim02]:• A Web page: an ordered sequence of words
Distance Measure:• The degree of change: [0, 1]
Clusteredness Measure:• How changes are spread out within a page: [0, 1]
Changes are generally small and clustered• An incremental update is more efficient for search engines
nmBAd
),(
bmbBAc
/1),,(
![Page 18: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/18.jpg)
18
Impact of Web Page Dynamics on Impact of Web Page Dynamics on Search EnginesSearch Engines
A typical way to study page dynamics from a search engine perspective:
1. Develop a model for Web page changing
2. Propose update strategies to maximize the freshness for search engines
• Develop metrics to measure the freshness
![Page 19: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/19.jpg)
19
Web Page Changing Model Studies Web Page Changing Model Studies – Poisson Process Model– Poisson Process Model
Each page Pi is updated at an average rate λi
Poisson Process:• X(t): the number of changes of page P in (0, t]
• Random variable X(s+t) – X(s) has Poisson probability distribution:
• for k = 0, 1, 2,…
tk
ek
tksXtsX
!
)(})()(Pr{
![Page 20: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/20.jpg)
20
Poisson Process Poisson Process – Brewington and Cybenko Study– Brewington and Cybenko Study
Combine the effects of page creation and updates into the Poisson Web model
(α,β) – currency:• Characterize how up-to-date a search engine is
• A page is β- current (β is a time unit)
• A search engine is (α,β) – current Pr (P is β- current) >= α
• T = f (α,β, λ)
(0.95, 1 week) – currency:• T = 18 days(800 million pages per day) t
Now
t - βt0
Last observation
β
Grace period
t0 + T
Re-indexing period T
Grace period
![Page 21: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/21.jpg)
21
Impact of Web Page Dynamics on Impact of Web Page Dynamics on Search Engines – Summary Search Engines – Summary
Study Creation Updates Deletion Freshness Metric
Brewington and Cybenko
√ √ X (α,β) – currency
Cho and
Giacia-Molina
X √ X freshness
age
Edwards et al. √ √ X -
Ntoulas, Cho and Olston
√ √ √ -
![Page 22: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/22.jpg)
22
Dynamics of Web Link Structure – Dynamics of Web Link Structure – Web Link Structure ModelingWeb Link Structure Modeling
Web link structure [Broder et al. 00]
Four components:• SCC (27.5%)
• IN (21.5%)
• OUT (21.5%)
• Tendrils and Tubes (21.5%)
• Others (8%)
S C CIN O U T
D is c o nne c te dc o m po ne nts
Tube s
Te ndr i ls
![Page 23: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/23.jpg)
23
Dynamics of Web Link Structure Dynamics of Web Link Structure StudyStudy
Only one existing study Ntoulas, Cho and Olston [04]: in one year
• Only 24% initial links were still available• 25% new links created every week• Link structure is more dynamic than pages (8%
new pages and 5% new content in the same year!)
• Search engines should update link-based ranking metrics frequently
![Page 24: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/24.jpg)
24
Link-based Ranking Metric Link-based Ranking Metric – PageRank – PageRank
PageRank [WWW98]: main ranking metric of Google
Definition:• Page A has pages T1 … Tn (authoritative sites) pointing
to it
• C(A): the number of links going out of page A
• d: damping factor in (0, 1)
• PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
![Page 25: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/25.jpg)
25
Incremental Update on PageRankIncremental Update on PageRank
Computations are too expensive Incrementally compute approximations to
PageRank [Chien02] Basic ideas:
• Construct a subgraph of the Web
• Contain small neighborhood of link changes
• Model the rest of the Web graph as a single node
• Compute PageRank on this subgraph
![Page 26: 1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.](https://reader035.fdocuments.in/reader035/viewer/2022062516/56649d625503460f94a4449f/html5/thumbnails/26.jpg)
26
ConclusionsConclusions The Web is dynamic in three dimensions
• Serious challenges to search engines Search engines to cope with high dynamics
• Scalable architecture, intelligent scheduling strategies, efficient update algorithm for ranking metrics, etc
Interesting to database people:• Data representation dynamics: XML• User dynamics: Adaptive search• Deep Web dynamics: searchable? how?• You should study COMP630L well
References