Advanced Search Features Dr. Susan Gauch. Pruning Search Results If a query term has many postings ...
-
Upload
pearl-richardson -
Category
Documents
-
view
212 -
download
0
Transcript of Advanced Search Features Dr. Susan Gauch. Pruning Search Results If a query term has many postings ...
Advanced Search FeaturesDr. Susan Gauch
Pruning Search Results If a query term has many postings
It is inefficient to add all postings to the accumulator and then sort the results
Just reading all postings from the inverted file is not scalable when a word may be in a billion documents
So, process highest weighted postings for a given query term How many to use? Several thousand so that we have the chance of
adding weights from multiple query terms for a given document
Pruning Search Results Implementation
Must sort all postings for a given term by weight during indexing
Since all postings for a given term have same idf Sort postings by rtf during indexing
Can also affect incremental indexing Kept P postings (max) for any given term Sorted in order by rtf If only processing p postings per term (max) at
query time, only keep P = p*4 in inverted file Run experiments on P
How many postings do you need to process to get unchanged top results
Pruning Search Results Incremental Indexing
Puts a bound on possible growth of postings file Only ever storing P postings for a given term Makes adding to the postings slower
Must insert new posting in right location in list of postings for the term by weight
Have a max of P postings per term Can pre-allocate P posting records per term
Never have to move postings around
Bounded Accumulator If you create a bounded size accumulator
Want it to store the highest weighted results
Can achieve best results by adding highest postings to accumulator first Then make minor adjustments by adding lower
weight postings
This is achieved by processing query terms with highest idf first
WildcardsUsually not implemented in web search
engines
Wildcards at the end: Nation*
Matches nation, nations, nationality, nationalization, …
Requires: Sorted dictionary (inefficient; could use B+
Tree instead of hashtable) Stemming:
Map words to stems during indexing Store stems in dict file