Search engines

Search Engines

Mahesh Sharma(CE/10/1158) Computer Science Department

2

Today's Coverage

● Introduction● Types of Search Engines● Components of a Search Engine● Semantics and Relevancy● Search Engine Optimization

Introduction

• A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages.

• Search engines look through their own databases of information in order to find what it is that you are looking for…

4

Types of Search Engine

● Crawler Powered Indexes– Guruji.com, Google.com

● Human Powered Indexes– www.dmoz.org

● Hybrid Models– Submitted URLs to a search engine ?

● Semantic Indexes– Hakia.com,

Copyleft (ɔ) 2009 Sudarsun Santhiappan 5


How does a Search Engine work ?


Your Browser

How Search Engines Work(Sherman 2003)

The Web

URL1

URL2

URL3 URL4

Crawler

Indexer

SearchEngine

Database Eggs?Eggs.

Eggs - 90%Eggo - 81%Ego- 40%

Huh? - 10%

All AboutEggsby

S. I. Am


Search Engine Internals


Search Engine Internals

● Crawlers● Indexers● Searching● Semantics● Ranking

Crawlers

• A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot."

Indexers

• A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and the use of more storage space to maintain the extra copy of data.

Semantics

• Semantics is the study of meaning. It focuses on the relation between signifiers, like words, phrases, signs, and symbols, and what they stand for, their denotation. semantics is the study of meaning that is used for understanding human expression through language.


Inverted Indexes


How Inverted Files Are Created

● Periodically rebuilt, static otherwise.● Documents are parsed to extract

tokens. These are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time

was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2


How Inverted Files are Created

● After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2


How InvertedFiles are Created● Multiple term

entries for a single document are merged.

● Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2


How Inverted Files are Created

● Finally, the file can be split into – A Dictionary or Lexicon file

and – A Postings file


How Inverted Files are CreatedDictionary/Lexicon Postings

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

inverted index

• In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems, used on a large scale for example in search engines.


From description of the FAST search engine, by Knut Risvik

In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries.

Each row can handle 120 queries per second

Each column can handle 7M pages

To handle more queries, add another row.


PageRank● Let A1, A2, …, An be the pages that point to

page A. Let C(P) be the # links out of page P. The PageRank (PR) of page A is defined as:

● PageRank is principal eigenvector of the link matrix of the web.

● Can be computed as the fixpoint of the above equation.

PR(A) = (1-d) + d ( PR(A1)/C(A1) + … + PR(An)/C(An) )


Search Engine Optimization

Search engines

Education

Transcript of Search engines