Information Organization - How Search Engines Work
-
Upload
stefanos-anastasiadis -
Category
Documents
-
view
214 -
download
0
Transcript of Information Organization - How Search Engines Work
-
7/30/2019 Information Organization - How Search Engines Work
1/41
SIMS 202
Information Organizationand Retrieval
Prof. Marti Hearst and Prof. Ray Larson
UC Berkeley SIMS
Tues/Thurs 9:30-11:00amFall 2000
Uploaded by: CarAutoDriver
http://www.carautodriver.co.uk/http://www.carautodriver.co.uk/ -
7/30/2019 Information Organization - How Search Engines Work
2/41
Last Time
Web Search Directories vs. Search engines How web search differs from other search
Type of data searched over Type of searches done Type of searchers doing search
Web queries are short This probably means people are often using search
engines to find starting points Once at a useful site, they must follow links or usesite search
Web search ranking combines many features
-
7/30/2019 Information Organization - How Search Engines Work
3/41
What about Ranking? Lots of variation here
Pretty messy in many cases Details usually proprietary and fluctuating
Combining subsets of: Term frequencies Term proximities
Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category information Popularity information
Most use a variant of vector space ranking tocombine these
Heres how it might work: Make a vector of weights for each feature
Multiply this by the counts for each feature
-
7/30/2019 Information Organization - How Search Engines Work
4/41
From description of the NorthernLight search engine, by Mark Krellensteinhttp://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm
-
7/30/2019 Information Organization - How Search Engines Work
5/41
High-Precision Ranking
Proximity search can help get high-precision results if > 1 term Hearst 96 paper:
Combine Boolean and passage-level proximity
Proves significant improvements whenretrieving top 5, 10, 20, 30 documents
Results reproduced by Mitra et al. 98 Google uses something similar
-
7/30/2019 Information Organization - How Search Engines Work
6/41
Boolean Formulations, Hearst 96
Results
-
7/30/2019 Information Organization - How Search Engines Work
7/41
Spam
Email Spam: Undesired content
Web Spam: Content is disguised as something it isnot, in order to Be retrieved more often than it otherwise
would Be retrieved in contexts that it otherwise
would not be retrieved in
-
7/30/2019 Information Organization - How Search Engines Work
8/41
Web Spam
What are the types of Web spam? Add extra terms to get a higher ranking Repeat cars thousands of times
Add irrelevant terms to get more hits Put a dictionary in the comments field
Put extra terms in the same color as the backgroundof the web page
Add irrelevant terms to get different types ofhits
Put sex in the title field in sites that are sellingcars
Add irrelevant links to boost your link analysisranking
There is a constant arms race between
web search companies and spammers
-
7/30/2019 Information Organization - How Search Engines Work
9/41
Commercial Issues
General internet search is oftencommercially driven Commercial sector sometimes hides things
harder to track than research On the other hand, most CTOs for search
engine companies used to be researchers, andso help us out
Commercial search engine information changes
monthly Sometimes motivations are commercial rather
than technical Goto.com uses payments to determine ranking order iwon.com gives out prizes
-
7/30/2019 Information Organization - How Search Engines Work
10/41
Web Search Architecture
-
7/30/2019 Information Organization - How Search Engines Work
11/41
Web Search Architecture
Preprocessing Collection gathering phase
Web crawling
Collection indexing phase
Online Query servers
This part not talked about in thereadings
-
7/30/2019 Information Organization - How Search Engines Work
12/41
From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
-
7/30/2019 Information Organization - How Search Engines Work
13/41
Standard Web Search Engine Architecture
crawl theweb
create aninvertedindex
Check for duplicates,store thedocuments
Inverted
index
Search
engine
servers
user
query
Show resultsTo user
DocIds
-
7/30/2019 Information Organization - How Search Engines Work
14/41
More detailedarchitecture,from Brin &Page 98.
Only covers thepreprocessing indetail, not thequery serving.
-
7/30/2019 Information Organization - How Search Engines Work
15/41
Inverted Indexes for Web Search Engines
Inverted indexes are still used, eventhough the web is so huge
Some systems partition the indexes across
different machines; each machine handlesdifferent parts of the data
Other systems duplicate the data acrossmany machines; queries are distributedamong the machines
Most do a combination of these
-
7/30/2019 Information Organization - How Search Engines Work
16/41
From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
In this example, the data
for the pages is
partitioned across
machines. Additionally,each partition is allocated
multiple machines to
handle the queries.
Each row can handle 120
queries per second
Each column can handle
7M pages
To handle more queries,
add another row.
-
7/30/2019 Information Organization - How Search Engines Work
17/41
Cascading Allocation of CPUs
A variation on this that produces acost-savings: Put high-quality/common pages on many
machines Put lower quality/less common pages on
fewer machines Query goes to high quality machines
first If no hits found there, go to other
machines
-
7/30/2019 Information Organization - How Search Engines Work
18/41
Web Crawlers
How do the web search engines get allof the items they index?
Main idea: Start with known sites
Record information for these sites
Follow the links from each site
Record information found at new sites
Repeat
-
7/30/2019 Information Organization - How Search Engines Work
19/41
Web Crawlers
How do the web search engines get all ofthe items they index?
More precisely:
Put a set of known sites on a queue Repeat the following until the queue is empty: Take the first page off of the queue If this page has not yet been processed:
Record the information found on this page
Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processed
In what order should the links be followed?
-
7/30/2019 Information Organization - How Search Engines Work
20/41
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Structure to be traversed
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.htmlhttp://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html -
7/30/2019 Information Organization - How Search Engines Work
21/41
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Breadth-first search(must be in presentation mode to see this animation)
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.htmlhttp://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html -
7/30/2019 Information Organization - How Search Engines Work
22/41
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Depth-first search(must be in presentation mode to see this animation)
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.htmlhttp://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html -
7/30/2019 Information Organization - How Search Engines Work
23/41
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
D th Fi t C li
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.htmlhttp://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html -
7/30/2019 Information Organization - How Search Engines Work
24/41
Depth-First Crawling(more complex graphs & sites)
Page 1
Page 3Page 2
Page 1
Page 2
Page 1
Page 5
Page 6
Page 4
Page 1
Page 2
Page 1
Page 3
Site 6
Site 5
Site 3
Site 1 Site 2
Site Page
1 1
1 2
1 4
1 6
1 3
1 5
3 1
5 1
6 1
5 2
2 1
2 2
2 3
-
7/30/2019 Information Organization - How Search Engines Work
25/41
Breadth First Crawling(more complex graphs & sites)
Page 1
Page 3Page 2
Page 1
Page 2
Page 1
Page 5
Page 6
Page 4
Page 1
Page 2
Page 1
Page 3
Site 6
Site 5
Site 3
Site 1 Site 2
Site Page
1 1
2 1
1 2
1 6
1 3
2 2
2 3
1 4
3 1
1 5
5 1
5 26 1
-
7/30/2019 Information Organization - How Search Engines Work
26/41
Web Crawling Issues Keep out signs
A file called norobots.txt tells the crawler whichdirectories are off limits
Freshness Figure out which pages change often
Recrawl these often Duplicates, virtual hosts, etc
Convert page contents with a hash function Compare new pages to the hash table
Lots of problems Server unavailable Incorrect html Missing links Infinite loops
Web crawling isdifficult
to do robustly!
-
7/30/2019 Information Organization - How Search Engines Work
27/41
Cha-Cha
Cha-cha searches an intranet Sites associated with an organization
Instead of hand-edited categories Computes shortest path from the root
for each hit
Organizes search results according to
which subdomain the pages are found in
-
7/30/2019 Information Organization - How Search Engines Work
28/41
Cha-Cha Web Crawling Algorithm
Start with a list of servers to crawl for UCB, simply start with www.berkeley.edu
Restrict crawl to certain domain(s) *.berkeley.edu
Obey No Robots standard Follow hyperlinks only
do not read local filesystems links are placed on a queue
traversal is breadth-first
See first lecture or the technical papers formore information
-
7/30/2019 Information Organization - How Search Engines Work
29/41
Summary
Web search differs from traditional IRsystems Different kind of collection Different kinds of users/queries Different economic motivations
Ranking combines many features in adifficult-to-specify manner
Link analysis and proximity of terms seemsespecially important This is in contrast to the term-frequency
orientation of standard search Why?
-
7/30/2019 Information Organization - How Search Engines Work
30/41
Summary (cont.)
Web search engine archicture Similar in many ways to standard IR
Indexes usually duplicated across
machines to handle many queries quickly
Web crawling Used to create the collection
Can be guided by quality metrics Is very difficult to do robustly
-
7/30/2019 Information Organization - How Search Engines Work
31/41
Web Search Statistics
-
7/30/2019 Information Organization - How Search Engines Work
32/41
Information from searchenginewatch.com
Searchesper Day
Info missing
For fast.com,
Excite,
Northernlight,
etc.
-
7/30/2019 Information Organization - How Search Engines Work
33/41
Information from searchenginewatch.com
Web
SearchEngineVisits
-
7/30/2019 Information Organization - How Search Engines Work
34/41
Information from searchenginewatch.com
Percentageof web
users whovisit the
site shown
-
7/30/2019 Information Organization - How Search Engines Work
35/41
Information from searchenginewatch.com
SearchEngine
Size
(July2000)
-
7/30/2019 Information Organization - How Search Engines Work
36/41
Information from searchenginewatch.com
Does sizematter?
You cant
accessmany hitsanyhow.
-
7/30/2019 Information Organization - How Search Engines Work
37/41
Information from searchenginewatch.com
Increasingnumbers
of indexed
pages,self-reported
-
7/30/2019 Information Organization - How Search Engines Work
38/41
Information from searchenginewatch.com
Increasingnumbers
of indexedpages
(morerecent)self-
reported
-
7/30/2019 Information Organization - How Search Engines Work
39/41
Information from searchenginewatch.com
Web
Coverage
-
7/30/2019 Information Organization - How Search Engines Work
40/41
From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
-
7/30/2019 Information Organization - How Search Engines Work
41/41
Directory
sizes