Search engines

23
Search Engines Mahesh Sharma(CE/10/1158) Computer Science Department

description

Google yahoo case study, page ranking ,search engine optimization,types of search engine,invert file,how search engine work, web crawler,doc file ,query...etc

Transcript of Search engines

Page 1: Search engines

Search Engines

Mahesh Sharma(CE/10/1158) Computer Science Department

Page 2: Search engines

2

Today's Coverage

● Introduction● Types of Search Engines● Components of a Search Engine● Semantics and Relevancy● Search Engine Optimization

Page 3: Search engines

Introduction

• A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages.

• Search engines look through their own databases of information in order to find what it is that you are looking for…

Page 4: Search engines

4

Types of Search Engine

● Crawler Powered Indexes– Guruji.com, Google.com

● Human Powered Indexes– www.dmoz.org

● Hybrid Models– Submitted URLs to a search engine ?

● Semantic Indexes– Hakia.com,

Page 5: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 5

Page 6: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 6

Page 7: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 7

How does a Search Engine work ?

Page 8: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 8

Your Browser

How Search Engines Work(Sherman 2003)

The Web

URL1

URL2

URL3 URL4

Crawler

Indexer

SearchEngine

Database Eggs?Eggs.

Eggs - 90%Eggo - 81%Ego- 40%

Huh? - 10%

All AboutEggsby

S. I. Am

Page 9: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 9

Search Engine Internals

Page 10: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 10

Search Engine Internals

● Crawlers● Indexers● Searching● Semantics● Ranking

Page 11: Search engines

Crawlers

• A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot."

Page 12: Search engines

Indexers

• A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and the use of more storage space to maintain the extra copy of data.

Page 13: Search engines

Semantics

• Semantics is the study of meaning. It focuses on the relation between signifiers, like words, phrases, signs, and symbols, and what they stand for, their denotation. semantics is the study of meaning that is used for understanding human expression through language.

Page 14: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 14

Inverted Indexes

Page 15: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 15

How Inverted Files Are Created

● Periodically rebuilt, static otherwise.● Documents are parsed to extract

tokens. These are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time

was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 16: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 16

How Inverted Files are Created

● After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 17: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 17

How InvertedFiles are Created● Multiple term

entries for a single document are merged.

● Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Page 18: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 18

How Inverted Files are Created

● Finally, the file can be split into – A Dictionary or Lexicon file

and – A Postings file

Page 19: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 19

How Inverted Files are CreatedDictionary/Lexicon Postings

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 20: Search engines

inverted index

• In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems, used on a large scale for example in search engines.

Page 21: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 21

From description of the FAST search engine, by Knut Risvik

In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries.

Each row can handle 120 queries per second

Each column can handle 7M pages

To handle more queries, add another row.

Page 22: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 22

PageRank● Let A1, A2, …, An be the pages that point to

page A. Let C(P) be the # links out of page P. The PageRank (PR) of page A is defined as:

● PageRank is principal eigenvector of the link matrix of the web.

● Can be computed as the fixpoint of the above equation.

PR(A) = (1-d) + d ( PR(A1)/C(A1) + … + PR(An)/C(An) )

Page 23: Search engines

Copyleft (ɔ) 2009 Sudarsun Santhiappan 23

Search Engine Optimization