“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled...
-
date post
19-Dec-2015 -
Category
Documents
-
view
219 -
download
1
Transcript of “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled...
“The Anatomy of a Large-Scale HypertextualWeb Search Engine”
Presented by
Ahmed Khaled Al-ShantoutICS 542 - 072
Outline• Paper Objective• Introduction & History• Related Work• Design Goals• Google Search Engine Features• Google Architecture• Results & Performance• Conclusion• References
Introduction & History• Why the name Google? From googol = 10100 = very large scale
• Web is different from a normal search engine.– Web is vast and is growing exponentially
– Web is heterogonous – images, HTML, files … etc.
– IR on small and well controlled homogenous collections is much easier.
• Human Maintained Lists can’t keep up– Yahoo! Is a human maintained list and so – subjective, slow to
improve, expensive to build and maintain.
• Google make use of the hypertext info to get better results
Related Work
• WWWW (World Wide Web Worm) – 1994 – first web search engine - indexed about 110,000 web pages – handled about 1500 queries/day.
• In 1997, the top SE claimed to index about 2 million web documents - handled about 20 million queries/day.
• In 2000, it was expected to index more than a billion documents – with more than 100 million queries/day.
• Current S.E problems– Subjective– If automated S.E then it returns low quality results– Advertisers can mislead automated S.E
Solution = Google
• Problem statement !
• Must handle the problem in very efficient way– Storage requirements– Efficient processing of the indexing system– Handle a huge number of queries/second– Produce a high quality results
Design Goals• Deliver results that have very high precision even at the
expense of recall– Using hypertextual info can improve the search quality – such as font size,
links, titles, anchor text …. Etc– In 1997, only 4 commercial S.E. was able to return themselves in the top
ten results !
• Make search engine technology transparent, i.e. advertising shouldn’t bias results
• Bring search engine technology into academic environment in order to support novel research activities on large web data sets
• Make system easy to use for most people, e.g. users shouldn’t have to specify more than a couple of words
Google Search Engine Features
Two main features to increase result precision:• Uses link structure of web (PageRank)• Uses text surrounding hyperlinks to improve
accurate document retrieval
Other features include:• Takes into account word proximity in documents• Uses font size, word position, etc. to weight word• Storage of full raw html pages
PageRank• What does it mean for a web page to have a high rank?
– Many pages point to it – so it is an important one-– Some important pages point to it such as Yahoo!
• PR(A) = (1-d) + d [PR(T1)/C(T1) + PR(T2/C(T2) + … + PR(Tn/C(Tn)].
• D is called the damping factor - used to prevent misleading the system to get a higher ranking.
• Page A has T1…Tn pages which point to A.• C(T1) is the number of links going out of page T1.• Not all links are treated the same• PageRank is calculated using simple iterative algorithms
Anchor Text• The text of the link. Chapter1
• Objective to return non-textual objects like files, databases, images … etc – which can not be indexed by a text-based S.E
• Also, it return non crawled pages
PageTextLinkTarget Page
Dr. Wasfi’s Page
Chapter1www…PDF file
Google Architecture
• Most of Google was built using C and C++ for efficiency.
• Works on Solaris and Linux.
Repository
BarrelsLexicon
SearcherPageRank
Sorter
CrawlerStore ServerURL Server
URL Resolver
Links
Doc Index
Indexer
Anchors
Crawling the Web
• To fetch URL and gather web pages into the store.• It is a challenging task. Needs to be fast to keep up to date info• Has to interact with the outside world – web servers, name servers …
etc• To scale well distributed crawling• Each crawler keep 300 open connections• Up to 100 web pages per second using 4 crawler• DNS cache to improve performance• A connection can be in one of these states - looking up DNS,
connecting to host, sending request, and receiving response
Major Data Structure
• Repository: Contains the full html page compressed using zlib standard.
• Document Index: Keeps information about each document. ordered by docID.
• Hit Lists: Corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information.
Forwarded Index
• Stored in the barrels.• Each barrel hold a set of wordID• If a document has a word in that barrel, the docID
is recorded in that barrel.
docIDWordID =1 #Hits =2Hit list
WordID =2 #Hits =3Hit list
Inverted Index
• Same as forwarded index expect that it was processed by the sorter.
WordIDDocs#
WordIDDocs#
WordIDDocs# docID# hitsHit list
docID# hitsHit list
docID# hitsHit list
docID# hitsHit list
Lexicon
Results & Performance
• Quality of the results is the most important metric in search engines
• Authors claim that Google outperform major commercial search engines
• Example
System Performance
• Major operations are crawling, indexing and sorting
• 9 days to get 26 million pages.
• Average 48.5 pages/second
• Indexer – 54 pages/second
• The whole sorting operation takes 24 hours
Search Performance
• Was not the most important issue in their design that time
• The response time for a query was between 1 to 10 second for all queries – mainly Disk IO time –
• Did not have any query cashing, subindices on common terms – for optimization -
Repository
BarrelsLexicon
SearcherPageRank
Sorter
CrawlerStore ServerURL Server
URL Resolver
Links
Doc Index
Indexer
Anchors
Conclusion & Future Work
• The most concern of Google design is to be a scalable web search engine.
• And to provide high quality results
– Page ranking
– Anchor text
Future Work
• To improve search efficiency– Cash the query– Smart disk allocation– Subindices
• Updates – old and new pages – • Add Boolean operators, negation, and stemming• relevance feedback and clustering• user context• result summarization• PageRank can be personalized by increasing the weight of a
user’s home page or bookmarks
References
• S. Brin,L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998)