Web Search Engine

25
Web Technology Presentation How search works in SEARCH ENGINE

description

How web search engine works

Transcript of Web Search Engine

Page 1: Web Search Engine

Web Technology Presentation

How search works in SEARCH ENGINE

Page 2: Web Search Engine

Index

✔ Prelude

✔ How search engine works

✔ Search Engine Architecture

✔ Working

✔ Features(Web Crawling,Page Rank,etc.)

✔ Problems

✔ Future

✔ Conclusion

Page 3: Web Search Engine

Prelude

✔ A Search Engine is an information retrieval system designed to help find information stored on a computer system, such as on the

● World Wide Web,

● inside a corporate or proprietary network,

● or in a personal computer.

✔ The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria.

Page 4: Web Search Engine

How search engine works

✔ SEARCH STARTS WITH THE WEB. IT'S MADE UP OF OVER

60  TRILLION INDIVIDUAL PAGES AND IT'S CONSTANTLY

GROWING.

✔ Search engine navigates the web by:

● Web Crawling

● Indexing

● Searching

Page 5: Web Search Engine

Architecture Overview

Page 6: Web Search Engine

Working...

✔ There is an URL server that sends lists of URL to be fetched to the crawlers.

✔ The web pages that are fetched are then sent to the store server.

✔ The store server then compresses and stores the web pages into a repository.

✔ Every web page has an associated ID number called a doc ID which is assigned whenever a new URL is parsed out of a web page.

Page 7: Web Search Engine

Working...(cont.)

✔ The indexing function is performed by the indexer and the sorter.

✔ The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them.

✔ The indexer distributes hits into a set of "barrels", creating a partially sorted forward index.

✔ It parses out all the links in every web page and stores important information about them in anchor files.

✔ Anchor file contains enough information to determine where each link points from and to, and the text of the link.

Page 8: Web Search Engine

Working...(cont.)

✔ The URL resolver reads the anchors file and converts relative URL into absolute URL and in turn into doc ID.

✔ It puts the anchor text into the forward index, associated with the doc ID that the anchor points to. It also generates a database of links which are pairs of doc IDs. The links database is used to compute Page Ranks for all the documents.

✔ The sorter takes the barrels, which are sorted by doc ID and resorts them by word ID to generate the inverted index.

Page 9: Web Search Engine

Working...(cont.)

✔ A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher.

✔ The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries.

Page 10: Web Search Engine

Crawling the Web (Web Crawler)

✔ A web crawler or spider is used to create an index of web pages to be used by a web search engine.

✔ Web search is based on this index.

✔ Search engines has a fast distributed crawling system.

✔ A single URL server serves lists of URL to a number of crawlers.

✔ Both the URL server and the crawlers are implemented in Python.

✔ Each crawler keeps roughly 300 connections open at once. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data.

✔ Each crawler maintains its own DNS cache so it does not need to do a DNS lookup before crawling each document.

Page 11: Web Search Engine

Crawling the Web

Page 12: Web Search Engine

Indexing the Web

✔ Parsing

◦ Any parser which is designed to run on the entire Web must handle a huge array of possible errors.

✔ Indexing Documents into Barrels

◦ After each document is parsed, it is encoded into a number of barrels. Every word is converted into a word ID by using an in-memory hash table -- the lexicon.

◦ Once the words are converted into word ID, their occurrences in the current document are translated into hit lists and are written into the forward barrels.

✔ Sorting

◦ the sorter takes each of the forward barrels and sorts it by word ID to produce an inverted barrel for title and anchor hits and a full text inverted barrel.

Page 13: Web Search Engine

Indexing the Web

Page 14: Web Search Engine

Web Searching

✔ Match the query terms with words in the index

✔ Sort documents by relevance

✔ Find files or records

✔ Open each one and read it

✔ Store each word in a search able index

✔ Kinds of information

✔ Dates

✔ Languages

✔ Update processes

Page 15: Web Search Engine

Web Searching

Page 16: Web Search Engine

Other System Features

✔ It makes use of the link structure of the Web to calculate a quality ranking for each web page, called Page Rank

◦ Page Rank is a trademark of Google. The Page Rank process has been patented.

✔ Search Engine utilizes link to improve search results.

Page 17: Web Search Engine

Page Rank

✔ Page Rank is a link analysis algorithm which assigns a numerical weighting to each Web page, with the purpose of "measuring" relative importance.

Based on the hyper links map

An excellent way to prioritize the results of web keyword searches

Page 18: Web Search Engine

Intuitive Justification

✔ A "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back“, but eventually gets bored and starts on another random page.

◦ The probability that the random surfer visits a page is its Page Rank.

◦ The d damping factor is the probability at each page the "random surfer" will get bored and request another random page.

✔ A page can have a high Page Rank

◦ If there are many pages that point to it

◦ Or if there are some pages that point to it, and have a high Page Rank.

Page 19: Web Search Engine

Anchor Text

● <A href =" http://www.yahoo.com/">Yahoo!</A>

Besides the text of a hyper link (anchor text) is associated with the page that the link is on, it is also associated with the page the link points to.

◦ anchors often provide more accurate descriptions of web pages than the pages themselves.

◦ anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases.

Page 20: Web Search Engine

Major Data Structures

✔ Big Files

virtual files spanning multiple file systems and are addressable by 64 bit integers.

✔ Repository

contains the full HTML of every web page.

✔ Document Index

keeps information about each document.

✔ Lexicon

two parts – a list of the words and a hash table of pointers.

✔ Hit Lists

a list of occurrences of a particular word in a particular document including position, font, and capitalization information.

✔ Forward Index

stored in a number of barrels

✔ Inverted Index

consists of the same barrels as the forward index, except that they have been processed by the sorter.

Page 21: Web Search Engine

Search is NOT a Panacea

✔ Search can’t find what’s not there

◦ The content is hugely important

✔ Information Architecture is vital

✔ Usable sites have good navigation and structure

Page 22: Web Search Engine

Problems

✔ High Quality Search

The biggest problem facing users of web search engines today is the quality of the results they get back.

✔ Scalable Architecture

Google is designed to scale. It must be efficient in both space and time

Page 23: Web Search Engine

The Future

“The ultimate search engine would understand exactly what you mean and give back exactly what you want.”

- Larry Page

Page 24: Web Search Engine

Conclusion

✔ The primary goal is to provide high quality search results over a rapidly growing World Wide Web.

✔ Search engines employs a number of techniques to improve search quality including page rank, anchor text, and proximity information.

✔ Search engines is a complete architecture for gathering web pages, indexing them, and performing search queries over them.

Page 25: Web Search Engine

And thats how search works...

● THANK YOU

Presented by:Presented by:

SAURABH GOELSAURABH GOEL

111438111438