WebGather Design and Implementation
Hongfei YanNetwork Group,CST,PKU,Dec. 15, 2000
Email: [email protected]
http://net.cs.pku.edu.cn/~yhf
Outline
Introduction of searchengine WebGather Conclusion
Introduction: http://www.yahoo.com/
Introduction: http://sohu.com/
Introduction: http://sina.com.cn/
Introduction: http://www.google.com/
Introduction: http://e.pku.edu.cn/
Introduction: Search Engine Sizes--searchenginewatch in Nov 8, 2000
GG=Google WT=WebTop.com AV=AltaVista, FAST=FAST NL=Northern Light EX=Excite INK=Inktomi, Go=Go (Infoseek)
Introduction: a new study -- Inktomi and the NEC Research Institute, Inc. In Feb. 2000
Number of indexable pages on the web : over 1 billion Number of servers discovered: 6,409,521 Number of mirrors in servers discovered: 1,457,946 Number of sites (total servers minus mirrors): 4,951,247 Number of good sites (reachable over 10 day period):
4,217,324 Number of bad sites (unreachable): 733,923
Web pages on a site: 1000,000,000/4,217,324 = 237.1
Introduction:
Inktomi Search Engine cluster
In the picture9*8*2=144
WebGather:Introduction
由北大计算机系网络与分布式系统研究室研制开发的“天网”中英文搜索引擎系统是国家“九五”重点科技攻关项目“中文编码和分布式中英文信息发现”的研究成果,并于 1997 年 10 月 29 日正式在 CERNET 上向广大 Internet 用户提供 web 信息导航服务。在“天网”系统对外提供服务期间,广泛采纳用户的意见和建议,不断地改进其服务质量,到目前为止访问量已突破 800万人次。 2000 年初新成立的“天网”搜索引擎课题组在国家 973重点基础研究发展规划项目基金资助下,秉承老的开发队伍的优良传统,将致力于探索和研究中英文搜索引擎系统的关键技术,以便向广大用户提供更为快速、准确、全面、时新的海量 web信息导航服务。欢迎广大用户给我们提出更好的意见和建议。
http://e.pku.edu.cn/ 身无彩凤双飞翼,心有灵犀一点通
WebGather:in Dec. 1, 2000
2.5 million scale Index 2.5 million web pages More than 200,000 web pages
everyday Ten day to update all data three PCs
collect all the web pages in China
keep pace with the rapid growth of Chinese web information
WebGather: Design goals for a distributed web-crawling system for
WebGather
238 X 40,000 = 9,520,000
WebGather 2.0: architecture
Client log database
User behavior
Gather Database
Indexer
Retrieve Database
Client
Retriever
Gatherer
WWW
WebGather 1.2:architecture of gather subsystem 1/4
Main Control
Gather1Gather2
GatherN…
WebGather 2.0:architecture of gather subsystem 1/4
WebGather : technologies in gather subsystem 1/4
Distributed system architecture High availability
…… Load balance Low bandwidth Scalability Re-configurability
…… Cut words Position relativity Anchor text, Link popularity
WebGather :architecture of indexer subsystem 2/4
A B
webpage1
webpage2
webpageK
…
webpageN
feature1
feature2
feature1
feature2
feature3
feature1
feature2
…
featureK
…
featureN
…
webpage1
webpage2
webpage1
webpage2
webpage3
WebGather : technologies in retriever subsystem 3/4
Traditional IR (VSM ) Query cache, hot click Cut words Anchor text, Link popularity
WebGather : technologies in user behavior subsystem 4/4
Link popularity Replica popularity User popularity
Conclusion :
Searchengine is More and more important.
Web is a good experimental object, we can do a lot R&D on it.
Top Related