Indexing and Caching In
-
Upload
s-vivek-ramjee -
Category
Documents
-
view
222 -
download
0
Transcript of Indexing and Caching In
-
7/31/2019 Indexing and Caching In
1/15
11MX56
INDEXING AND CACHING IN
SEARCH ENGINES
-
7/31/2019 Indexing and Caching In
2/15
THE PROBLEM ?
The users want the results the moment they query
What the engine has to handle ?
A staggering amount of over 4 billion web pages !
And over a million queries per minute !
The response time must be as immediate as possible.
Consume least amount of resources possible.
-
7/31/2019 Indexing and Caching In
3/15
SOME SEARCH STATISTICS TO NOTE
The 63.7% of the queries are unique.
An approx. 34% or only 1/3rd of the search queries submitted
are repeated.
58% of the users only view the 1st page of the search result.
(average considering popular search giants Google, Yahoo and Ask)
No more than 12% of users browse through more than 3 pages.
-
7/31/2019 Indexing and Caching In
4/15
THE SOLUTION
With a Cache With out Cache
QUERY QUERYRESULT RESULT
CACHE
HIT
Web
QUERY SERVER
Web
QUERY SERVER
Yes
No
-
7/31/2019 Indexing and Caching In
5/15
Search
Same
Different
36% of all queries have
been retrieved before.
The stats show that
most people are looking
for the same thing when
using a search engine.
WHAT DO WE CACHE/INDEX ?
-
7/31/2019 Indexing and Caching In
6/15
1. Direct Cache
2. Inverted Index/List
3. Two-Level
4. N-Level
VARIANTS
-
7/31/2019 Indexing and Caching In
7/15
Stores a link to pages
containing the tokens in all
frequently/recently searched
queries.
Can only be fetched after the
query is processed and
tokenized.
Stores the top few results of a
query that are searched
frequently/recently.
Can be fetched even before
the query is processed and
tokenized.
VARIANTS
Direct Cache Inverted Index
-
7/31/2019 Indexing and Caching In
8/15
Allocate a rank based list that
can accommodate a certain
number of result pages.
When the list is full and a new
page needs to be cached, the
least FREQUENTLY used page
is removed from the cache.
Allocate a queue that can
accommodate a certain number
of result pages.
When the queue is full and a new
page needs to be cached, the
least RECENTLY used page is
removed from the cache.
POLICIES
LRU (Least Recently Used) LFU (Least Frequently Used)
-
7/31/2019 Indexing and Caching In
9/15
AN ADVANCED POLICY
Probability Driven Cache
o Users search in sessions, the next query will probably be related to the previous
query.
o This is currently in use by Google. Noted by its related searches given at the
bottom of the result page.
-
7/31/2019 Indexing and Caching In
10/15
INDEXING
Steps and not just Types !
1. Forward Index
2. Inverted Index
-
7/31/2019 Indexing and Caching In
11/15
1. This, is, what, it, is
2. What, is, it
3. It, is, a, panther
Page 1
This is what it is.
Page 2
what is it ?
Page 3
It is a panther.
FORWARD INDEX
Pages Forward Index
-
7/31/2019 Indexing and Caching In
12/15
This - 1
Is 1,2,3
What 1,2
It - 1,2,3
Is 1,2,3
A - 3
Panther - 3
INVERTED INDEX
Inverted IndexForward Index
1. This, is, what, it, is
2. What, is, it
3. It, is, a, panther
Search term like what is it ? will givepages 1, 2 as best results.
But It occurs in the same order in only 1page i.e. 2 and ranked on top.
-
7/31/2019 Indexing and Caching In
13/15
TROUBLES ENCOUNTERED
The indexed documents correspond to an older version of the
web pages.
The documents matched for a cached query correspond to an
older version of the index.
Periodic Refresh Has to be done to tackle above troubles !
-
7/31/2019 Indexing and Caching In
14/15
IMPACT
Direct Cache Inverted Index
-
7/31/2019 Indexing and Caching In
15/15
References
Performance of Inverted List Caching, CIS Department, Brooklyn University,
NY, USA
A Refreshing Perspective of search engine caching, Yahoo! Research,
Barcelona, Spain
Some help from Wiki as usual
THANK YOU