WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin,...
-
Upload
augustus-hodge -
Category
Documents
-
view
223 -
download
0
description
Transcript of WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin,...
![Page 1: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/1.jpg)
WebBase:Building a Web Warehouse
Hector Garcia-MolinaStanford University
Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala,Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley
![Page 2: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/2.jpg)
2
The Web
• A universal information resource– Model weak, strong agreement
• How to exploit it?
web
![Page 3: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/3.jpg)
3
WebBase
WEB PAGE
![Page 4: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/4.jpg)
4
WebBase Goals
• Manage very large collections of Web pages– Today: 1500GB HTML, 200 M pages
• Enable large-scale Web-related research• Locally provide a significant portion of the Web• Efficient wide-area Web data distribution
![Page 5: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/5.jpg)
5
WebBase Architecture
![Page 6: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/6.jpg)
6
WebBase Remote Users
• Berkeley• Columbia• U. Washington• Harvey Mudd• Università degli
Studi di Milano• U. of Arizona
• California Digital Library
• Cornell• U. of Houston• Learning Lab
Lower Saxony (L3S)• France Telecom• U. Texas
![Page 7: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/7.jpg)
7
Outline
• Technical Challenges• WebBase Use• The Future
![Page 8: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/8.jpg)
8
Challenges
• Scalability– crawling– archive distribution– index construction– storage
• Consistency– freshness– versions
• Dissemination
• Archiving– “units”– coordination
• IP Management– copy access– link access– access control
• Hidden Web• Topic-Specific
Collection Building
![Page 9: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/9.jpg)
9
What is a Crawler?
web
init
get next url
get page
extract urls
initial urls
to visit urls
visited urls
web pages
![Page 10: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/10.jpg)
10
Parallel Crawling
C
C
C
...
web
![Page 11: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/11.jpg)
11
Independent Crawlers
C
C
web
a
e
d c
b
site 1
fh
i
g
site 2
![Page 12: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/12.jpg)
12
Partition: Firewall
C
C
a
e
d c
b
site 1
fh
i
g
site 2
partition·URL hash·Site hash·Hierarchical
![Page 13: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/13.jpg)
13
Partition: Cross-Over
C
C
a
e
d c
b
site 1
fh
i
g
site 2
partition
![Page 14: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/14.jpg)
14
Partition: Cross-Over
C
C
a
e
d c
b
site 1
fh
i
g
site 2
partition
![Page 15: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/15.jpg)
15
Partition: Exchange
C
C
a
e
d c
b
site 1
fh
i
g
site 2
partition
![Page 16: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/16.jpg)
16
Partition: Exchange
C
C
a
e
d c
b
site 1
fh
i
g
site 2
partition
![Page 17: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/17.jpg)
17
Coverage vs Overlap
cross-over crawler; 5 random seeds per C-proc
![Page 18: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/18.jpg)
18
WebBase Parallel Crawling
web
sitequeues ...
process
sitequeues ...
process
...
computer
other computers
coordinator
![Page 19: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/19.jpg)
19
WebBase Parallel Crawling
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
pages/sec cpu utilization sites-being-crawled
100%
2 cpuutilzation
0%
200%
number of processes
![Page 20: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/20.jpg)
20
Challenges
• Scalability– crawling– archive distribution– index construction– storage
• Consistency– freshness– versions
• Dissemination
• Archiving– “units”– coordination
• IP Management– copy access– link access– access control
• Hidden Web• Topic-Specific
Collection Building
done
next
![Page 21: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/21.jpg)
21
How to Refresh?
a
b
a
b
webrepository
a changes daily
b changes once a week
can visitone page per week
• How should we visit pages?– a a a a a a a ...– b b b b b b b ...– a b a b a b a b ... [uniform]
– a a a a a a b a a a ... [proportional]
– ?
![Page 22: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/22.jpg)
22
Using WebBase
• Fast Page Rank• Complex Queries
![Page 23: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/23.jpg)
23
Structure of the Web
Color the nodes by their domainred = stanford.edugreen = berkeley.edublue = mit.edu
![Page 24: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/24.jpg)
24
Structure of the Web
stanford.edu berkeley.edu
mit.edu
![Page 25: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/25.jpg)
25
Nested Block Structure of the Web
Berkeley
Stanford
from
to
![Page 26: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/26.jpg)
26
Personalized Page Rank
ab
![Page 27: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/27.jpg)
27
Complex Queries
Stanford WebBase Repositor
y
Text searchE.g., Search for “SARS Symptoms”
Bulk/Streaming accessLarge-scale mining & indexingE.g., compute PageRank, extract communities
Complex queriesDeclarative analysis interface
![Page 28: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/28.jpg)
Example of a Complex Query
Rank pages in S by PageRank
Rank domains in R by sum (incoming ranks)
Web Entire Web
Compute S = stanford.edu pages containing phrase
“Mobile networking”stanford.ed
uMobile
networking pages
(S)
Compute R = set of all “.edu”
domains pointed to by
pages in SS
RList top 10 domains in
R
find universitiescollaborating with Stanfordon mobile networking
![Page 29: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/29.jpg)
29
Supernodes
P1
P2
P3
P4
P5
Web graph
= {N1, N2, N3}
N1 N3
N2
N1
N2
N3
E1,2E3,2
E1,3E3,1
Supernode graph
P1 P2
IntraNode1
P2 P5
SEdgePos1
,3
P4 P5
IntraNode3
SEdgeNeg3
,2
P5P3
![Page 30: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/30.jpg)
30
Growth of Supernode Graph
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120
Number of pages (Millions)
Size
of s
uper
node
gra
ph (M
B)
82MB, 115M pages(830 GB of
raw HTML)
![Page 31: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/31.jpg)
31
Query Execution Times
Query
Tim
e fo
r nav
igat
ion
oper
atio
n (s
ecs)
0
100
200
300
400
500
600
Query 1 Query 2 Query 3 Query 4 Query 5 Query 6
S-Node representationRelational DBConnectivity Server
Files of adjacency lists
![Page 32: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/32.jpg)
32
Query Optimization
P
4pDepth
".net/%domainmy 2." LIKE pURL
P
5pDepth
1000
pURL
![Page 33: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/33.jpg)
33
Impact of cluster-based optimization
0
100
200
300
400
500
600
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Sample Queries
Que
ry E
xecu
tion
Tim
e (s
ecs)
No optimizationOptimization enabled
35-million page dataset600 million links300GB of HTML
40-45% reduction in query execution times
![Page 34: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/34.jpg)
34
Conclusion (So Far)
• Web is universal information resource• WebBase exploits this resource• WebBase Challenges:
– scalability, consitency, complex queries...
• The Future for WebBase(and clones)??
![Page 35: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/35.jpg)
35
Will WebBase Scale?
web content(indexable)
webBasecapacity(pesimistic)
webBasecapacity(optimistic)
timetoday
![Page 36: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/36.jpg)
36
Pessimistic Scenario
• Specialized WebBases– sports– shopping– ...
web content(indexable)
webBasecapacity(pesimistic)
timetoday
![Page 37: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/37.jpg)
37
Optimistic Scenario
• Web in a Box– web delivered in
“CD” monthy– search engine
handles updates
web content(indexable)
webBasecapacity(optimistic)
timetoday
![Page 38: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/38.jpg)
38
Legal Issues?
• Is WebBase legal?– copies– links, deep linking
• International regulations
![Page 39: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/39.jpg)
39
Biasing Results
• How long will Google, Altavista, etc.resist “temptations”?
• Biasing Crawler• Link and Content Spam
![Page 40: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/40.jpg)
40
Access Data
• WebBase does not capture access patterns
web
? WebBase
![Page 41: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/41.jpg)
41
Semantic Web?
• Will tags be generated?• By whom?• Agreement?
web
? WebBase
semantic tags
![Page 42: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/42.jpg)
42
Future Technical Challenges
• Incremental Updates• Query Optimization• Crawling Deep Web
![Page 43: WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,](https://reader035.fdocuments.in/reader035/viewer/2022081512/5a4d1b4f7f8b9ab0599a6df2/html5/thumbnails/43.jpg)
43
Final Conclusion
• Many challenges ahead...• Additional information:
Google: Stanford WebBase
WEB PAGE