Search Engine

Project Presentation

Project PresentationSubject: Professional Practises-3Faculty: Mudassir Mahadik Sir

Project Details:Group Members:Qasim DadanRashid ShaikhSaif KhanWahaj Shaikh

Project Topic: Search Engine and Web Crawler(Spider)Search Engine IntroductionSearch Engine Definition:Aweb search engineis a software system that is designed to search for information on theWorld Wide Web. The search results are generally presented in a line of results often referred to assearch engine results pages(SERPs). The information may be a mix ofweb pages, images, and other types of files. Some search engines alsomine dataavailable indatabasesoropen directories. Unlikeweb directories, which are maintained only by human editors, search engines also maintainreal-timeinformation by running an algorithmon aweb crawler.

Purpose of Search EngineHelping people find what theyre looking forStarts with an information needConvert to a queryGets resultsMaterials available on these SystemsWeb pagesOther formatsDeep Web

A search engine operates in the following order:

1. Web Crawling: It is bot for purpose of indexing pages into the database.

2. Indexing: It decides rank(priority) of indexed results.

3. Searching: It is process of looking up into the search database by firing a simple query.Working Diagram of Search Engine

Explanation of Working Diagram:Indexing Process: The search engine analyzes the contents of each page to determine how it should beindexed(for example, words can be extracted from the titles, page content, headings, or special fields calledmeta tags). Data about web pages are stored in an index database for use in later queries. A query from a user can be a single word. The index helps find information relating to the query as quickly as possible.Searching Process: When a user enters aqueryinto a search engine (typically by usingkeywords), the engine examines itsindexand provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. The index is built from the information stored with the data and the method by which the information is indexed.

Crawling Process:A Web crawler starts with a list ofURLsto visit, called theseeds. As the crawler visits these URLs, it identifies all thehyperlinksin the page and adds them to the list of URLs to visit, called thecrawl frontier. URLs from the frontier arerecursivelyvisited according to a set of policies. If the crawler is performing archiving ofwebsitesit copies and saves the information as it goes7contentsearch functionalityuser interfaceSearch is Mostly InvisibleLike an iceberg,2/3 below waterWeb Crawling IntroductionAWeb crawleris anInternet botwhich systematically browses theWorld Wide Web, typically for the purpose ofWeb indexing. A Web crawler may also be called aWeb spider,anant, anautomatic indexer,or aWeb scutter. Web search enginesand some other sites use Web crawling or spidering software to update theirweb contentor indexes of others sites' web content. Web crawlers can copy all the pages they visit for later processing by a search engine whichindexesthe downloaded pages so theuserscan search much more efficiently.Crawlers can validatehyperlinksandHTMLcode. They can also be used forweb scraping

Utilities of a crawlerWeb Crawling Definition:A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia)Web Crawling Utilities:Gather pages from the Web.Support a search engine, perform data mining and so on.Objects Processed by Crawler:Text, video, image and so on.

Overview of Crawler

A Web crawler starts with a list ofURLsto visit, called theseeds. As the crawler visits these URLs, it identifies all thehyperlinksin the page and adds them to the list of URLs to visit, called thecrawl frontier. URLs from the frontier arerecursivelyvisited according to a set of policies. If the crawler is performing archiving ofwebsitesit copies and saves the information as it goes. The archives are usually stored in such a way they can be viewed, read and navigated as they were on the live web, but are preserved as snapshots'.The large volume implies the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change can imply the pages might have already been updated or even deleted.

Thank You

Search Engine

Documents

Transcript of Search Engine