Mini Google

download Mini Google

of 34

Transcript of Mini Google

  • 8/4/2019 Mini Google

    1/34

    Mahender K

    [email protected]

  • 8/4/2019 Mini Google

    2/34

    http://www.killerinfo.com/http://www.gigablast.com/http://www.wisenut.com/http://www.metacrawler.com/index.htmlhttp://www.teoma.com/http://www.altavista.com/http://www.alltheweb.com/http://www.askjeeves.com/http://www.google.com/
  • 8/4/2019 Mini Google

    3/34

    Tools for finding information on the Web

    Problem: hidden databases, e.g. Times ofIndia (ie, databases of keywords hosted by

    the web site itself. These cannot beaccessed by Yahoo, Google etc.)

    Search engine

    A machine-constructed index (usually bykeyword)

    So many search engines, we needsearch engines to find them.

  • 8/4/2019 Mini Google

    4/34

    Search engines: key tools for ecommerce

    Buyers and sellers must find each other

    How do they work?

    How much do they index?

    Are they reliable?

    How are hits ordered?

    Can the order be changed?

  • 8/4/2019 Mini Google

    5/34

    Overall goal: Locate web documentscontaining a specified keyword.

    Input: Keyword Output: Set of links

  • 8/4/2019 Mini Google

    6/34

    Crawl the web, look at each page for thekeyword. Follow each link to find more pages

    to search. Problems

    Non terminating: walking in circles?

    Inefficient: walk web for every search? Page interpretation: Match HTML tags?

  • 8/4/2019 Mini Google

    7/34

    Walk the web once.

    Build a database.

    Problem: staleness How often to walk the every -changing web?

    Approach

    Periodic rebuilds of the database

    Specialization

    Accept limited staleness

  • 8/4/2019 Mini Google

    8/34

    Problem: How to ignore HTML tags? Problem: How to capture words?

    Problem: How to capture links? Problem: How to capture Images? . Idea

    Use a parser(Tokenizer)

  • 8/4/2019 Mini Google

    9/34

  • 8/4/2019 Mini Google

    10/34

  • 8/4/2019 Mini Google

    11/34

    Problem: How to ignore HTML tags?

    Issue: Need to extract links

    Problem: How to capture words? Idea : Use a parser (tokenizer)

    Parse1: HTML -page -> set-of-words

    Parse1: HTML -page -> set-of-links

  • 8/4/2019 Mini Google

    12/34

  • 8/4/2019 Mini Google

    13/34

    Problem: How to ignore HTML tags?

    Issue: Need to extract links

    Problem: How to capture words? Idea #3: Use a parser (tokenizer)

    Parse1: HTML -page -> set-of-words

    Parse1: HTML -page -> set-of-links

    Idea #4: Parse Once

    Parse: HTML-page -> set-of-words & set-of-links

  • 8/4/2019 Mini Google

    14/34

    1. Acquire the collection, i.e. all the documents

    [Off-line process]

    2. Create an inverted index

    [Off-line process]

    3. Match queries to documents

    [On-line process, the actual retrieval]

    4. Present the results to user

    [On-line process: display, summarize, ...]

  • 8/4/2019 Mini Google

    15/34

    Spider

    Crawls the web to find pages. Follows hyperlinks. Never stops

    Indexer

    Produces data structures for fast searching of all words in the pages (ie,it updates the lexicon)

    Retriever

    Query interface

    Database lookup to find hits

    1 billion documents

    1 TB RAM, many terabytes of disk

    Ranking

  • 8/4/2019 Mini Google

    16/34

    16

    Thousands of servers (WOW!)

    Web site traffic grows over 20% per month

    Spiders and index over 17 Billion URLs Supports many language and used in many

    countries

    Over 283 million searches per day

    Even we use it!

  • 8/4/2019 Mini Google

    17/34

  • 8/4/2019 Mini Google

    18/34

    Start with an initial page P0. Find URLs on P0 and addthem to a queue

    When done with P0, pass it to an indexing program,

    get a page P1 from the queue and repeat Can be specialized (e.g. only look for email addresses)

    Issues

    Which page to look at next? Avoid overloading a site

    How deep within a site do you go (depth search)?

    How frequently to visit pages?

  • 8/4/2019 Mini Google

    19/34

    Arrangement of data (data structure) to permit fastsearching

    Which list is easier to search?

    sow fox pig eel yak hen ant cat dog hog

    ant cat dog eel fox hen hog pig sow yak

    Sorting helps. Why?

    Permits binary search. About log2n probes into list

    log2(1 billion) ~ 30

    Permits interpolation search. About log2(log2n) probes

    log2 log2(1 billion) ~ 5

  • 8/4/2019 Mini Google

    20/34

    A file is a list of words by position

    - First entry is the word in position 1 (first word)

    - Entry 4562 is the word in position 4562 (4562nd word)

    - Last entry is the last wordAn inverted file is a list of positions by word!

    POS

    1

    10

    20

    30

    36

    FILE

    a (1, 4, 40)

    entry (11, 20, 31)

    file (2, 38)list (5, 41)

    position (9, 16, 26)

    positions (44)

    word (14, 19, 24, 29, 35, 45)

    words (7)

    4562 (21, 27)

    INVERTED FILE

  • 8/4/2019 Mini Google

    21/34

    107 4 322 354 381 405

    232 6 15 195 248 1897 1951 2192

    677 1 481

    713 3 42 312 802

    WORD NDOCS PTR

    jezebel 20

    jezer 3jezerit 1

    jeziah 1

    jeziel 1

    jezliah 1

    jezoar 1

    jezrahliah 1

    jezreel 39jezoar

    34 6 1 118 2087 3922 3981 5002

    44 3 215 2291 3010

    56 4 5 22 134 992

    DOCID OCCUR POS 1 POS 2 . . .

    566 3 203 245 287

    67 1 132

    . . .

    jezebel occurs

    6 times in document 34,3 times in document 44,4 times in document 56 . . .

    LEXICON

    WORDINDEX

  • 8/4/2019 Mini Google

    22/34

    Hits must be presented in some order

    What order?

    Relevance, popularity, reliability?

    Some ranking methods

    Presence of keywords in title of document

    Closeness of keywords to start of document

    Frequency of keyword in document

    Link popularity (how many pages point to this one)

  • 8/4/2019 Mini Google

    23/34

  • 8/4/2019 Mini Google

    24/34

    Search engine for any website Not for the entire web

    Results can be confined to only one web site

  • 8/4/2019 Mini Google

    25/34

    http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09

    /15/&prd=bl :: 4

    http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10

    /27/&prd=mag :: 7

    ..

    http://www.hindu.com/2004/10/09/stories/2004100904051900.htm :: 23

    http://www.hindu.com/2004/10/09/stories/2004100910970300.htm :: 3

    ..

    .

    http://www.hinduonnet.com/thehindu/gallery/0166/016606.htm :: 2

    http://www.hinduonnet.com/thehindu/gallery/0048/004807.htm :: 1

    ..

    India

    ManMohan

    Cricket

    Bollywo

    Sharukh

    Sachin

    .

    http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0048/004807.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hinduonnet.com/thehindu/gallery/0166/016606.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hindu.com/2004/10/09/stories/2004100910970300.htmhttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hindu.com/2004/10/09/stories/2004100904051900.htmhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=maghttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=blhttp://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=bl
  • 8/4/2019 Mini Google

    26/34

    Search Engine

  • 8/4/2019 Mini Google

    27/34

    Index

    Crawl

    Search

  • 8/4/2019 Mini Google

    28/34

    Index

    Query

    retrieve

    ResultSet

    FinalResult

    Sort by Rank

    ResultPage

    makePage

    TheWeb

    Spider

    Parser

    URLList

    crawl parse

    getNextUrl

    addUrls

    addPage

    Indexer

    store

    retrieve

  • 8/4/2019 Mini Google

    29/34

    Index

    Query

    retrieve

    ResultSet

    FinalResult

    Sort by Rank

    ResultPage

    makePage

    TheWeb

    Spider

    Parser

    URLList

    crawl parse

    getNextUrl

    addUrls

    addPage

    Indexer

    store

    retrieve

    QueuePriorityQueue

    Hashtable

    BinaryTree

    LinkedList

    MergeSort&InsertionSort

    AVLTree

    Finite StateMachines

  • 8/4/2019 Mini Google

    30/34

    PageImg PageHref

    PageElement

    Spider

    WebSpider

    PageWord

    Queue

    SearchDriver

    PageLexer

    HttpTokenizer URLTextReader

    CrawlerDriver

    TreeDictionary

    Query

    addPage

    ListDictionary

    Indexer

    Index

    HashDictionary

    Index

    Save

    Restore

    Crawl

    Parse

    DictionaryInterface

    Inheritance

    Uses

    Calls

    DictionaryDriver

  • 8/4/2019 Mini Google

    31/34

    Week 3

    Tokenizer (using FSM)

    Crawling - Rules

    Breadth First Spider

    Priority Based Spider

    Indexing

    Keywords with the occurrences of it frequency and the URLs Persistence

    Saving the Index to the Disk

    Simple Search Sorting based on Rank

  • 8/4/2019 Mini Google

    32/34

    Week 4

    Set Data Structures

    Allowing Boolean Search (AND, OR) Client and Server Architecture

    Client developed using Swings

    Multi-Threaded Server

    Performance Analysis

    Final Demo

  • 8/4/2019 Mini Google

    33/34

    Thinking about the future How fast is the web growing?

    How does Moores Law help us?

    Compute time RAM space Disk space

    How can we make our algorithms and data

    structures more clever? What new features will our customers want? Targeted advertising Site-specific search

  • 8/4/2019 Mini Google

    34/34