Mini Google

8/4/2019 Mini Google

1/34

Mahender K

[email protected]


2/34
http://www.killerinfo.com/http://www.gigablast.com/http://www.wisenut.com/http://www.metacrawler.com/index.htmlhttp://www.teoma.com/http://www.altavista.com/http://www.alltheweb.com/http://www.askjeeves.com/http://www.google.com/


3/34

Tools for finding information on the Web

Problem: hidden databases, e.g. Times ofIndia (ie, databases of keywords hosted by

the web site itself. These cannot beaccessed by Yahoo, Google etc.)

Search engine

A machine-constructed index (usually bykeyword)

So many search engines, we needsearch engines to find them.


4/34

Search engines: key tools for ecommerce

Buyers and sellers must find each other

How do they work?

How much do they index?

Are they reliable?

How are hits ordered?

Can the order be changed?


5/34

Overall goal: Locate web documentscontaining a specified keyword.

Input: Keyword Output: Set of links


6/34

Crawl the web, look at each page for thekeyword. Follow each link to find more pages

to search. Problems

Non terminating: walking in circles?

Inefficient: walk web for every search? Page interpretation: Match HTML tags?


7/34

Walk the web once.

Build a database.

Problem: staleness How often to walk the every -changing web?

Approach

Periodic rebuilds of the database

Specialization

Accept limited staleness


8/34

Problem: How to ignore HTML tags? Problem: How to capture words?

Problem: How to capture links? Problem: How to capture Images? . Idea

Use a parser(Tokenizer)


9/34


10/34


11/34

Problem: How to ignore HTML tags?

Issue: Need to extract links

Problem: How to capture words? Idea : Use a parser (tokenizer)

Parse1: HTML -page -> set-of-words

Parse1: HTML -page -> set-of-links


12/34


13/34

Problem: How to ignore HTML tags?

Issue: Need to extract links

Problem: How to capture words? Idea #3: Use a parser (tokenizer)

Parse1: HTML -page -> set-of-words

Parse1: HTML -page -> set-of-links

Idea #4: Parse Once

Parse: HTML-page -> set-of-words & set-of-links


14/34

1. Acquire the collection, i.e. all the documents

[Off-line process]

2. Create an inverted index

[Off-line process]

3. Match queries to documents

[On-line process, the actual retrieval]

4. Present the results to user

[On-line process: display, summarize, ...]


15/34

Spider

Crawls the web to find pages. Follows hyperlinks. Never stops

Indexer

Produces data structures for fast searching of all words in the pages (ie,it updates the lexicon)

Retriever

Query interface

Database lookup to find hits

1 billion documents

1 TB RAM, many terabytes of disk

Ranking


16/34

16

Thousands of servers (WOW!)

Web site traffic grows over 20% per month

Spiders and index over 17 Billion URLs Supports many language and used in many

countries

Over 283 million searches per day

Even we use it!


17/34


18/34

Start with an initial page P0. Find URLs on P0 and addthem to a queue

When done with P0, pass it to an indexing program,

get a page P1 from the queue and repeat Can be specialized (e.g. only look for email addresses)

Issues

Which page to look at next? Avoid overloading a site

How deep within a site do you go (depth search)?

How frequently to visit pages?


19/34

Arrangement of data (data structure) to permit fastsearching

Which list is easier to search?

sow fox pig eel yak hen ant cat dog hog

ant cat dog eel fox hen hog pig sow yak

Sorting helps. Why?

Permits binary search. About log2n probes into list

log2(1 billion) ~ 30

Permits interpolation search. About log2(log2n) probes

log2 log2(1 billion) ~ 5


20/34

A file is a list of words by position

- First entry is the word in position 1 (first word)

- Entry 4562 is the word in position 4562 (4562nd word)

- Last entry is the last wordAn inverted file is a list of positions by word!

POS

1

10

20

30

36

FILE

a (1, 4, 40)

entry (11, 20, 31)

file (2, 38)list (5, 41)

position (9, 16, 26)

positions (44)

word (14, 19, 24, 29, 35, 45)

words (7)

4562 (21, 27)

INVERTED FILE


21/34

107 4 322 354 381 405

232 6 15 195 248 1897 1951 2192

677 1 481

713 3 42 312 802

WORD NDOCS PTR

jezebel 20

jezer 3jezerit 1

jeziah 1

jeziel 1

jezliah 1

jezoar 1

jezrahliah 1

jezreel 39jezoar

34 6 1 118 2087 3922 3981 5002

44 3 215 2291 3010

56 4 5 22 134 992

DOCID OCCUR POS 1 POS 2 . . .

566 3 203 245 287

67 1 132

. . .

jezebel occurs

6 times in document 34,3 times in document 44,4 times in document 56 . . .

LEXICON

WORDINDEX


22/34

Hits must be presented in some order

What order?

Relevance, popularity, reliability?

Some ranking methods

Presence of keywords in title of document

Closeness of keywords to start of document

Frequency of keyword in document

Link popularity (how many pages point to this one)


23/34


24/34

Search engine for any website Not for the entire web

Results can be confined to only one web site


25/34

http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09

/15/&prd=bl :: 4

http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10

/27/&prd=mag :: 7

..

http://www.hindu.com/2004/10/09/stories/2004100904051900.htm :: 23

http://www.hindu.com/2004/10/09/stories/2004100910970300.htm :: 3

..

.

http://www.hinduonnet.com/thehindu/gallery/0166/016606.htm :: 2

http://www.hinduonnet.com/thehindu/gallery/0048/004807.htm :: 1

..

India

ManMohan

Cricket

Bollywo

Sharukh

Sachin

.
http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=bl


26/34

Search Engine


27/34

Index

Crawl

Search


28/34

Index

Query

retrieve

ResultSet

FinalResult

Sort by Rank

ResultPage

makePage

TheWeb

Spider

Parser

URLList

crawl parse

getNextUrl

addUrls

addPage

Indexer

store

retrieve


29/34

Index

Query

retrieve

ResultSet

FinalResult

Sort by Rank

ResultPage

makePage

TheWeb

Spider

Parser

URLList

crawl parse

getNextUrl

addUrls

addPage

Indexer

store

retrieve

QueuePriorityQueue

Hashtable

BinaryTree

LinkedList

MergeSort&InsertionSort

AVLTree

Finite StateMachines


30/34

PageImg PageHref

PageElement

Spider

WebSpider

PageWord

Queue

SearchDriver

PageLexer

HttpTokenizer URLTextReader

CrawlerDriver

TreeDictionary

Query

addPage

ListDictionary

Indexer

Index

HashDictionary

Index

Save

Restore

Crawl

Parse

DictionaryInterface

Inheritance

Uses

Calls

DictionaryDriver


31/34

Week 3

Tokenizer (using FSM)

Crawling - Rules

Breadth First Spider

Priority Based Spider

Indexing

Keywords with the occurrences of it frequency and the URLs Persistence

Saving the Index to the Disk

Simple Search Sorting based on Rank


32/34

Week 4

Set Data Structures

Allowing Boolean Search (AND, OR) Client and Server Architecture

Client developed using Swings

Multi-Threaded Server

Performance Analysis

Final Demo


33/34

Thinking about the future How fast is the web growing?

How does Moores Law help us?

Compute time RAM space Disk space

How can we make our algorithms and data

structures more clever? What new features will our customers want? Targeted advertising Site-specific search


34/34

Mini Google

Documents

Transcript of Mini Google