Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra...

Query Suggestions in the Absence of

Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra

SIGIR’11, July 24–28, 2011, Beijing, China

Introduction Query suggestions are more useful for difficult topics, for

which users have little knowledge to create meaningful queries

A meaningful query must infer the user’s query intent & information needs & must help user find the

relevant documents containing relevant information

Existing web search engines rely on query logs to make query suggestions which are not available for

desktop or enterprise search systems

Solution: a document centric probabilistic mechanism to generate query suggestions w/o using query logs which utilizes the document corpus to extract

phrases 2

Related Work Most of the previous works provide query expansion &

refinement rather than query suggestions

Comp(lete)Search Method:

Provides real time auto-completion of the last query term typed by the user

Requires user to type at least two characters of the last query term which is the most frequent term

SimSearch Method: Phrase index is searched to find phrases that contain

the user submitted partial query as a sub-phrase

Selected phrases are presented to the user in order of their occurrence frequency 3

Proposed QS Approach Based on the document centric probabilistic mechanism

Extracting phrases to create a database of phrases that can be used for completing partial user queries

from document corpus

Using N-grams of all order 1, 2, & 3, i.e., unigrams, bigrams, & trigrams from the document corpus

Use idea similar to skip-grams rather than N-grams

N-gram is the number of non stop-words

4

Query Suggestions At any given instant of time, after the user has entered k

characters, denoted Q1k , which can be

decomposed

Q1K = Qc + Qt (1)

where |Qc| 0, a set of words, & |Qt| {0, 1}, a (in)complete word

Given a partial query Q1k & a phrase pi P = {p1, p2, …, pn},

what is the probability P(pi | Q1k), i.e., the probability

that the user will type pi after typing Q1k?

5

Query Suggestions

6

Query Suggestions The proposed query suggestion is defined as

The probability of selecting a phrase given a partial word is

The importance of phrases is determined by occurrence frequencies in the document corpus

7

a vocabulary word that start with Qt a phrase that contains the word ci

Estimating Phrase-Query Correlation The contextual relationship between a phrase pi & a

user submitted query Qc using their joint occurrence

pi is the 2nd half of the complete query & Qc is the 1st half

Both P(Qc, pi) & P(pi) in the previous equation can be estimated using the corpus as follows:

where Dpi and DQc

represent the sets of documents that contain phrase pi and Qc, respectively 8

Experimental Results: Datasets Two datasets were used

TREC Consists of more than 200K news articles published in

Financial Times between years 1991–1994

Ubuntu: Consists of more than 100K discussion threads crawled from

ubuntuforums.org, 25 queries, & relevance judgments

9

Baselines Methods The proposed methods was compared with the following

two baseline methods

Similarity based phrase search (SimSearch)

• Indexed phrases which contain user queries as sub-phrasesare searched & ranked according to their occurrence frequencies

CompleteSearch (CompSearch)

• Offers real-time auto-completion of the last query term being typed by the user

• Also use frequency as the ranking criterion

10

Test Queries Generated 40 partial test queries, created from 20 non-

stop words, non-single keyword, randomly-chosen queries, for each dataset

Type-A Queries Queries were generated by retaining only the 1st keyword

from each of the 20 original queries

Type-B Queries

Queries were generated by retaining the 1st keyword of the query followed by the first randomly-chosen k

characters (2 ≤ k ≤ length of the remaining query string)

11

Test Queries(cont’d)

12

Evaluation For each test query, the top 10 suggestions generated by

SimSearch, CompSearch & the proposed Probabilistic method were collected & evaluated by 3 assessors

Evaluation was performed w/ the help from 12 volunteers who were colleagues not associated with the

project

For each query suggestion, each assessor assigned one rating among the four (given below) & major-vote is

used

13

Suggestions Created by Two Test Queries

14

Success Rate of Different Methods A query suggestion method is successful for a given

partial query if it is able to generate at least one meaningful suggestion for the partial query

15

Quality of Suggestions

16

Precision Values Achieved by Different QS

17

Effectiveness of Suggested Queries Query clarity score is used to measure the retrieval

performance of suggested queries

Clarity score of a query increases if we add terms that reduce query ambiguity & it decreases on

adding terms that make the query more ambiguous

Clarity score for a query q with respect to a collection of documents C is computed using KL-Divergence

where V is the vocabulary of the collection18

Clarity Scores Achieved by Different QS

19

Conclusions and Future Works Meaningful query suggestions can be made in the absence

of query logs with probabilistic approach using the occurrence of terms/phrases in a corpus of

documents

Future works A future goal is to ensure that the badly formed combination

of phrases are eliminated from the suggestions

Use of synonyms and synonymous phrases to enable the system to suggest alternatives also needs to be

explored

Systematic approach towards diversifying the suggested queries

Apply to a relatively larger scale

20

Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra...

Documents

Transcript of Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra...