Mini Google
-
Upload
monishaurobind -
Category
Documents
-
view
218 -
download
0
Transcript of Mini Google
-
8/4/2019 Mini Google
1/34
Mahender K
-
8/4/2019 Mini Google
2/34
http://www.killerinfo.com/http://www.gigablast.com/http://www.wisenut.com/http://www.metacrawler.com/index.htmlhttp://www.teoma.com/http://www.altavista.com/http://www.alltheweb.com/http://www.askjeeves.com/http://www.google.com/ -
8/4/2019 Mini Google
3/34
Tools for finding information on the Web
Problem: hidden databases, e.g. Times ofIndia (ie, databases of keywords hosted by
the web site itself. These cannot beaccessed by Yahoo, Google etc.)
Search engine
A machine-constructed index (usually bykeyword)
So many search engines, we needsearch engines to find them.
-
8/4/2019 Mini Google
4/34
Search engines: key tools for ecommerce
Buyers and sellers must find each other
How do they work?
How much do they index?
Are they reliable?
How are hits ordered?
Can the order be changed?
-
8/4/2019 Mini Google
5/34
Overall goal: Locate web documentscontaining a specified keyword.
Input: Keyword Output: Set of links
-
8/4/2019 Mini Google
6/34
Crawl the web, look at each page for thekeyword. Follow each link to find more pages
to search. Problems
Non terminating: walking in circles?
Inefficient: walk web for every search? Page interpretation: Match HTML tags?
-
8/4/2019 Mini Google
7/34
Walk the web once.
Build a database.
Problem: staleness How often to walk the every -changing web?
Approach
Periodic rebuilds of the database
Specialization
Accept limited staleness
-
8/4/2019 Mini Google
8/34
Problem: How to ignore HTML tags? Problem: How to capture words?
Problem: How to capture links? Problem: How to capture Images? . Idea
Use a parser(Tokenizer)
-
8/4/2019 Mini Google
9/34
-
8/4/2019 Mini Google
10/34
-
8/4/2019 Mini Google
11/34
Problem: How to ignore HTML tags?
Issue: Need to extract links
Problem: How to capture words? Idea : Use a parser (tokenizer)
Parse1: HTML -page -> set-of-words
Parse1: HTML -page -> set-of-links
-
8/4/2019 Mini Google
12/34
-
8/4/2019 Mini Google
13/34
Problem: How to ignore HTML tags?
Issue: Need to extract links
Problem: How to capture words? Idea #3: Use a parser (tokenizer)
Parse1: HTML -page -> set-of-words
Parse1: HTML -page -> set-of-links
Idea #4: Parse Once
Parse: HTML-page -> set-of-words & set-of-links
-
8/4/2019 Mini Google
14/34
1. Acquire the collection, i.e. all the documents
[Off-line process]
2. Create an inverted index
[Off-line process]
3. Match queries to documents
[On-line process, the actual retrieval]
4. Present the results to user
[On-line process: display, summarize, ...]
-
8/4/2019 Mini Google
15/34
Spider
Crawls the web to find pages. Follows hyperlinks. Never stops
Indexer
Produces data structures for fast searching of all words in the pages (ie,it updates the lexicon)
Retriever
Query interface
Database lookup to find hits
1 billion documents
1 TB RAM, many terabytes of disk
Ranking
-
8/4/2019 Mini Google
16/34
16
Thousands of servers (WOW!)
Web site traffic grows over 20% per month
Spiders and index over 17 Billion URLs Supports many language and used in many
countries
Over 283 million searches per day
Even we use it!
-
8/4/2019 Mini Google
17/34
-
8/4/2019 Mini Google
18/34
Start with an initial page P0. Find URLs on P0 and addthem to a queue
When done with P0, pass it to an indexing program,
get a page P1 from the queue and repeat Can be specialized (e.g. only look for email addresses)
Issues
Which page to look at next? Avoid overloading a site
How deep within a site do you go (depth search)?
How frequently to visit pages?
-
8/4/2019 Mini Google
19/34
Arrangement of data (data structure) to permit fastsearching
Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
Sorting helps. Why?
Permits binary search. About log2n probes into list
log2(1 billion) ~ 30
Permits interpolation search. About log2(log2n) probes
log2 log2(1 billion) ~ 5
-
8/4/2019 Mini Google
20/34
A file is a list of words by position
- First entry is the word in position 1 (first word)
- Entry 4562 is the word in position 4562 (4562nd word)
- Last entry is the last wordAn inverted file is a list of positions by word!
POS
1
10
20
30
36
FILE
a (1, 4, 40)
entry (11, 20, 31)
file (2, 38)list (5, 41)
position (9, 16, 26)
positions (44)
word (14, 19, 24, 29, 35, 45)
words (7)
4562 (21, 27)
INVERTED FILE
-
8/4/2019 Mini Google
21/34
107 4 322 354 381 405
232 6 15 195 248 1897 1951 2192
677 1 481
713 3 42 312 802
WORD NDOCS PTR
jezebel 20
jezer 3jezerit 1
jeziah 1
jeziel 1
jezliah 1
jezoar 1
jezrahliah 1
jezreel 39jezoar
34 6 1 118 2087 3922 3981 5002
44 3 215 2291 3010
56 4 5 22 134 992
DOCID OCCUR POS 1 POS 2 . . .
566 3 203 245 287
67 1 132
. . .
jezebel occurs
6 times in document 34,3 times in document 44,4 times in document 56 . . .
LEXICON
WORDINDEX
-
8/4/2019 Mini Google
22/34
Hits must be presented in some order
What order?
Relevance, popularity, reliability?
Some ranking methods
Presence of keywords in title of document
Closeness of keywords to start of document
Frequency of keyword in document
Link popularity (how many pages point to this one)
-
8/4/2019 Mini Google
23/34
-
8/4/2019 Mini Google
24/34
Search engine for any website Not for the entire web
Results can be confined to only one web site
-
8/4/2019 Mini Google
25/34
http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09
/15/&prd=bl :: 4
http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10
/27/&prd=mag :: 7
..
http://www.hindu.com/2004/10/09/stories/2004100904051900.htm :: 23
http://www.hindu.com/2004/10/09/stories/2004100910970300.htm :: 3
..
.
http://www.hinduonnet.com/thehindu/gallery/0166/016606.htm :: 2
http://www.hinduonnet.com/thehindu/gallery/0048/004807.htm :: 1
..
India
ManMohan
Cricket
Bollywo
Sharukh
Sachin
.
http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=bl -
8/4/2019 Mini Google
26/34
Search Engine
-
8/4/2019 Mini Google
27/34
Index
Crawl
Search
-
8/4/2019 Mini Google
28/34
Index
Query
retrieve
ResultSet
FinalResult
Sort by Rank
ResultPage
makePage
TheWeb
Spider
Parser
URLList
crawl parse
getNextUrl
addUrls
addPage
Indexer
store
retrieve
-
8/4/2019 Mini Google
29/34
Index
Query
retrieve
ResultSet
FinalResult
Sort by Rank
ResultPage
makePage
TheWeb
Spider
Parser
URLList
crawl parse
getNextUrl
addUrls
addPage
Indexer
store
retrieve
QueuePriorityQueue
Hashtable
BinaryTree
LinkedList
MergeSort&InsertionSort
AVLTree
Finite StateMachines
-
8/4/2019 Mini Google
30/34
PageImg PageHref
PageElement
Spider
WebSpider
PageWord
Queue
SearchDriver
PageLexer
HttpTokenizer URLTextReader
CrawlerDriver
TreeDictionary
Query
addPage
ListDictionary
Indexer
Index
HashDictionary
Index
Save
Restore
Crawl
Parse
DictionaryInterface
Inheritance
Uses
Calls
DictionaryDriver
-
8/4/2019 Mini Google
31/34
Week 3
Tokenizer (using FSM)
Crawling - Rules
Breadth First Spider
Priority Based Spider
Indexing
Keywords with the occurrences of it frequency and the URLs Persistence
Saving the Index to the Disk
Simple Search Sorting based on Rank
-
8/4/2019 Mini Google
32/34
Week 4
Set Data Structures
Allowing Boolean Search (AND, OR) Client and Server Architecture
Client developed using Swings
Multi-Threaded Server
Performance Analysis
Final Demo
-
8/4/2019 Mini Google
33/34
Thinking about the future How fast is the web growing?
How does Moores Law help us?
Compute time RAM space Disk space
How can we make our algorithms and data
structures more clever? What new features will our customers want? Targeted advertising Site-specific search
-
8/4/2019 Mini Google
34/34