Advanced Search Features Dr. Susan Gauch. Pruning Search Results If a query term has many postings ...

6
Advanced Search Features Dr. Susan Gauch

Transcript of Advanced Search Features Dr. Susan Gauch. Pruning Search Results If a query term has many postings ...

Page 1: Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.

Advanced Search FeaturesDr. Susan Gauch

Page 2: Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.

Pruning Search Results If a query term has many postings

It is inefficient to add all postings to the accumulator and then sort the results

Just reading all postings from the inverted file is not scalable when a word may be in a billion documents

So, process highest weighted postings for a given query term How many to use? Several thousand so that we have the chance of

adding weights from multiple query terms for a given document

Page 3: Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.

Pruning Search Results Implementation

Must sort all postings for a given term by weight during indexing

Since all postings for a given term have same idf Sort postings by rtf during indexing

Can also affect incremental indexing Kept P postings (max) for any given term Sorted in order by rtf If only processing p postings per term (max) at

query time, only keep P = p*4 in inverted file Run experiments on P

How many postings do you need to process to get unchanged top results

Page 4: Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.

Pruning Search Results Incremental Indexing

Puts a bound on possible growth of postings file Only ever storing P postings for a given term Makes adding to the postings slower

Must insert new posting in right location in list of postings for the term by weight

Have a max of P postings per term Can pre-allocate P posting records per term

Never have to move postings around

Page 5: Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.

Bounded Accumulator If you create a bounded size accumulator

Want it to store the highest weighted results

Can achieve best results by adding highest postings to accumulator first Then make minor adjustments by adding lower

weight postings

This is achieved by processing query terms with highest idf first

Page 6: Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.

WildcardsUsually not implemented in web search

engines

Wildcards at the end: Nation*

Matches nation, nations, nationality, nationalization, …

Requires: Sorted dictionary (inefficient; could use B+

Tree instead of hashtable) Stemming:

Map words to stems during indexing Store stems in dict file