Query expansion

15
Query Expansion Cluster based using N Grams UMA K L (201305514) SPANDAN VEGGALAM (201307674) MAHAVER CHOPRA (201101011) AKSHAT KANDELWAL (201001095) INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY-HYDERABAD

description

 

Transcript of Query expansion

Page 1: Query expansion

Query ExpansionCluster based using N Grams

U M A K L ( 2 0 1 3 0 5 5 1 4)S PA N DA N V EG GA L A M ( 2 0 1 3 0 7 6 7 4)

M A H AV E R C H O P R A ( 2 0 1 1 0 1 0 1 1)A K S H AT K A N D E LWA L ( 2 0 1 0 0 1 0 9 5)

I N T E R NAT I O NA L I N ST I T UT E O F I N FO R MAT I O N T EC H N OLOGY - H Y D E R ABA D

Page 2: Query expansion

Query Expansion• Key feature of Search Engine

• In many cases it is difficult to find the search intent of user

• Users do not always formulate query in the best way

• Query recommendation is to help users in formulating queries to certain extent

• Improves the search retrieval performance, user selects the alternate query input fromsuggestions which is relevant to his intent.

• Increases recall of Information Retrieval System

Page 3: Query expansion

Expanding QueriesFollowing techniques are used for expanding queries

1. Spell Corrections

2. Finding and searching with “Synonyms” of input query terms

3. Augmenting query with terms

In our approach we focus on Augmenting queries and Searching with synonyms

Page 4: Query expansion

Our ApproachTwo Phases

1. Offline Phase1. Add Synonyms to documents

2. Cluster the documents, in order to group similar documents into a cluster

3. Index and Label the clustersOnly nouns are indexed.

2. Online Phase1. Search for clusters as Phrase query

2. Predict words for query augmentation

3. Re-weight the query and suggest top queries as query recommendationsOnly Nouns are considered as augmented words

Page 5: Query expansion

Our Approach – Offline PhaseWhy Clustering?

1. Clustering improves the scope of suggesting queries for different contexts

2. Documents are clustered together, and indexed

3. Search is performed on cluster index.

4. Relevant clusters are considered to find augmented terms

5. Top N query suggestions from each cluster are considered

Clustering Parameters Used

Algorithm: K-Means

Number of Clusters: 150

Page 6: Query expansion

Our Approach – Offline PhaseAdding Synonyms

1. Allows user to search with synonyms as well

2. Ideally system should accept synonyms and is expected to retrieve same relevant documents

3. Top 5% of words from each document are considered, and synonyms are added to these words

Labeling Clusters

1. Clusters are tagged with most relevant terms

2. Label contain set of terms which can distinguish it from other clusters

Page 7: Query expansion

Our Approach – Online Phase1. Retrieve relevant Clusters for given input query.

2. Select top ‘N’ Clusters

3. If given query can be represented in N Grams

1. As the words are sequential and from same document, intent of user is clear. Next word in document can be suggested as augment word

2. Retrieve next sequential word from the cluster, which is set of documents

3. Augment the query with these predicted words, retrieve top queries are present the user as query recommendations

Page 8: Query expansion

Our Approach – Online Phase4. Else if the query terms are separated with some distance

1. Predict next word for each input term, add terms to a list.

2. Identify the tags for clusters, and them to list. Tags are words that gives information about the input words together

3. Here user intent is not clear so, tags which gives category/topic/context of the document are also considered for augmentation

4. Augment the query with these predicted words, retrieve top queries are present the user as query recommendations

5. If the given words are far from each other, it is very difficult to co-relate each word

1. Sequential words cannot be used as augment words, the input words may be from different contexts and is hard to retrieve relevant documents thereby reduces the recall

2. Cluster tags are words which gives some information about the input words together, and are considered as augmented terms

Page 9: Query expansion

Architecture – Offline Phase

Page 10: Query expansion

Architecture – Online Phase

Page 11: Query expansion

Tools & Data Set UsedTools

◦ Word Net: Used to identify synonyms

◦ Cluto: Clusters the documents

◦ Doc2Mat: Represents documents in matrix format

◦ Apache Lucene: Used for indexing and querying.

◦ Stanford NLP POS tagger: Identifying Part of speech of word

Data Set◦ Data set consists of Telegraph Calcutta news paper stories

◦ Stories are categorized into ◦ “FrontPage”, “Nation, “Calcutta”, “Bengal”, “Foreign”, “Business”, “Sports”, “Opinion”, “Metro”

◦ Format of each story in data set is◦ <DocNo>*</DocNo><Text>*</Text>

Page 12: Query expansion

EvaluationWe have run the augmented queries over the data set to retrieve relevant documents and found considerable increase in recall and precision values.

Following bar diagram gives change in precision and recall values for random augmented queries formulated over 30 input queries.

Page 13: Query expansion

Evaluation

Open this file for evaluation results of all augmented queries for 30 input queries.

Page 14: Query expansion

Future Work1. Approach can be extended to implement query logs

◦ Query logs can be used as knowledge base for suggesting queries, and also helps in asynchronous way of suggesting queries

2. User preferences can be used to filter the documents based on relevancy

3. Different Mixed versions of Markov models can be used to achieve the best balance among accuracy and coverage both in terms of data (objective) and user (subjective) centric evaluation metrics

4. Different N gram variations can be used to make it ideally suitable for real time Search engine

Page 15: Query expansion

Conclusion1. We have explained our approach for query expansions, which uses Clustering method to

extend suggestions from various contexts, N gram and Markov model to determine augment terms

2. We have applied sequential probabilistic model as it is suitable for the task of online query recommendation

3. Achieved accuracy and coverage in terms of data.

4. Time and memory complexities of our application is measured and found it is suitable for real time search engine