Search engines
-
Upload
sanjana-dixit -
Category
Education
-
view
85 -
download
5
description
Transcript of Search engines
Search Engines
Mahesh Sharma(CE/10/1158) Computer Science Department
2
Today's Coverage
● Introduction● Types of Search Engines● Components of a Search Engine● Semantics and Relevancy● Search Engine Optimization
Introduction
• A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages.
• Search engines look through their own databases of information in order to find what it is that you are looking for…
4
Types of Search Engine
● Crawler Powered Indexes– Guruji.com, Google.com
● Human Powered Indexes– www.dmoz.org
● Hybrid Models– Submitted URLs to a search engine ?
● Semantic Indexes– Hakia.com,
Copyleft (ɔ) 2009 Sudarsun Santhiappan 5
Copyleft (ɔ) 2009 Sudarsun Santhiappan 6
Copyleft (ɔ) 2009 Sudarsun Santhiappan 7
How does a Search Engine work ?
Copyleft (ɔ) 2009 Sudarsun Santhiappan 8
Your Browser
How Search Engines Work(Sherman 2003)
The Web
URL1
URL2
URL3 URL4
Crawler
Indexer
SearchEngine
Database Eggs?Eggs.
Eggs - 90%Eggo - 81%Ego- 40%
Huh? - 10%
All AboutEggsby
S. I. Am
Copyleft (ɔ) 2009 Sudarsun Santhiappan 9
Search Engine Internals
Copyleft (ɔ) 2009 Sudarsun Santhiappan 10
Search Engine Internals
● Crawlers● Indexers● Searching● Semantics● Ranking
Crawlers
• A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot."
Indexers
• A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and the use of more storage space to maintain the extra copy of data.
Semantics
• Semantics is the study of meaning. It focuses on the relation between signifiers, like words, phrases, signs, and symbols, and what they stand for, their denotation. semantics is the study of meaning that is used for understanding human expression through language.
Copyleft (ɔ) 2009 Sudarsun Santhiappan 14
Inverted Indexes
Copyleft (ɔ) 2009 Sudarsun Santhiappan 15
How Inverted Files Are Created
● Periodically rebuilt, static otherwise.● Documents are parsed to extract
tokens. These are saved with the Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time
was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
Copyleft (ɔ) 2009 Sudarsun Santhiappan 16
How Inverted Files are Created
● After all documents have been parsed the inverted file is sorted alphabetically.
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
Copyleft (ɔ) 2009 Sudarsun Santhiappan 17
How InvertedFiles are Created● Multiple term
entries for a single document are merged.
● Within-document term frequency information is compiled.
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Copyleft (ɔ) 2009 Sudarsun Santhiappan 18
How Inverted Files are Created
● Finally, the file can be split into – A Dictionary or Lexicon file
and – A Postings file
Copyleft (ɔ) 2009 Sudarsun Santhiappan 19
How Inverted Files are CreatedDictionary/Lexicon Postings
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
inverted index
• In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems, used on a large scale for example in search engines.
Copyleft (ɔ) 2009 Sudarsun Santhiappan 21
From description of the FAST search engine, by Knut Risvik
In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries.
Each row can handle 120 queries per second
Each column can handle 7M pages
To handle more queries, add another row.
Copyleft (ɔ) 2009 Sudarsun Santhiappan 22
PageRank● Let A1, A2, …, An be the pages that point to
page A. Let C(P) be the # links out of page P. The PageRank (PR) of page A is defined as:
● PageRank is principal eigenvector of the link matrix of the web.
● Can be computed as the fixpoint of the above equation.
PR(A) = (1-d) + d ( PR(A1)/C(A1) + … + PR(An)/C(An) )
Copyleft (ɔ) 2009 Sudarsun Santhiappan 23
Search Engine Optimization