AUTOMATIC QUERY EXPANSION IN INFORMATION...

AUTOMATIC QUERY EXPANSION IN INFORMATION RETRIEVAL

Ryan Herbeck

Abstract

The overall effectiveness of information retrieval systems has long been hindered by term mismatching between the system’s indexers and the quality and quantity of terms within a user’s queries. To handle this aptly named vocabulary problem, methods such as interactive query refinement, relevance feedback, word sense disambiguation and search results clustering have been introduced. One of the most successful methods is automatic query expansion. This journal gives an overview of automatic query expansion and its applications, compares automatic query expansion to other methods used to handle the vocabulary problem, describes the automatic query expansion process, classifies the types of automatic query expansion techniques and discusses issues with using automatic query expansion in information retrieval systems.

1. Overview

According to the Encyclopedia of Database Systems, automatic query expansion (AQE) in information retrieval (IR) systems is “the process…which consists of selecting and adding terms to the user's query with the goal of minimizing query-document mismatch…” [4] In other words, AQE takes a user’s query against some data collection and enhances it with additional key terms in order to retrieve better search results.

Current information retrieval systems typically have a standard user interface consisting only of a single text box that takes in keywords. These keywords are then matched against the system’s collection index to find documents that contain keywords from the user’s query. Naturally, if the user includes multiple topic-specific keywords in the query, then the system will return higher quality results. However, typical user queries contain few keywords, and because of this along with the fact that the natural language is ambiguous, this basic system is susceptible to search errors and omissions of relevant information. [1]

The major issue concerning IR systems is known as the vocabulary problem. [2] This is an issue in which the system’s indexers use different keywords compared to the users. This issue is difficult to handle because of two reasons: the natural language’s synonymy (one word has multiple meanings) and polysemy (two words have similar meanings). Synonymy, coupled with word inflections and verb conjugations reduces the IR system’s recall, or the ability to retrieve all documents relevant to the user’s query. Polysemy may cause the IR system to retrieve

irrelevant documents, which would be a decrease in the IR system’s precision, or ability to retrieve documents only relevant to the user’s query. [1]

Several solutions have been proposed to reduce the effects of the vocabulary problem. Such solutions include interactive query expansion, relevance feedback, word sense disambiguation, search results clustering and of course AQE. [1] The alternative solutions to AQE are discussed and compared with AQE in section 3.

AQE has been suggested as a solution for the vocabulary problem as early as 1960. A variety of techniques were investigated. Early experiments were conducted on small scale data collections. However, these experiments yielded inconclusive results regarding the effectiveness of AQE; in many of these experiments, a gain in recall was often accompanied by a loss in precision. [1]

In today’s world, the volume of data has increased substantially. However, the number of terms in a user’s query against data in an IR system has remained low; the most common queries are a mere one to three words in length. As a result of these two facts, the vocabulary problem has gotten much worse. The scarcity of query terms reduces an IR system’s synonymy handling, while the diversity and size of data collections increases the negative effects of polysemy. Thus, the need for and scope of AQE and other related solutions have increased. [1]

2. Applications of AQE

AQE has been commonly applied to the following areas: question answering, multimedia information retrieval, information filtering and cross-language information retrieval.

2.1 Question Answering

The goal of question answering is not to provide the user with documents that contain the answer, but to provide a single direct, concise response to the question asked. AQE assists with this by expanding the original question with related terms that are expected to be found in documents that contain the answers. These expansion terms are often found in FAQ data. [1]

2.2 Multimedia Information Retrieval

As the growth of digital media increases, the effective search of multimedia documents such as speeches, images and videos becomes more important. IR systems typically search in the metadata of multimedia documents to retrieve relevant search results. Such metadata includes annotations, captions and html/xml descriptions. When such metadata is not available, IR systems often use some form of content analysis combined with AQE techniques. Some examples of this include searching transcriptions produced by an automatic speech recognition system and analyzing image colors, textures and shapes. [1]

2

2.3 Information Filtering

In information filtering, an IR system monitors a continuous stream of documents and selects only the ones that are relevant to the user’s queries. As the user’s needs change over time, so does the selection of relevant documents. This is applied to electronic news, blogs and e-mail, to name a few. [1]

2.4 Cross-Language Information Retrieval

As the name implies, this process involves the retrieval of documents written in a language differing from the original user’s query. Several issues regarding translation arise with this process, however. There may be insufficient language coverage of differing languages or untranslatable terms and translation ambiguity between the original query’s language and a differing language. AQE can assist with these translation errors by expanding the original query before and/or after it is translated into a different language. [1]

3. Related Techniques

The following are proposed alternative solutions to the vocabulary problem: interactive query expansion, relevance feedback, word sense disambiguation and search results clustering.

3.1 Interactive Query Expansion

Interactive query expansion (IQE) is similar to AQE, with one major difference: rather than select expansion terms on its own, the system suggests several query reformulations to the users, and the users decide which reformulation to use. Unlike AQE systems, IQE systems do not handle issues with feature selection and query reformulation because those steps are handled by the users. Thus, IQE systems have the potential to produce better search results than AQE systems, except that they require some expertise on the user’s part. One of the most popular IQE systems is Google Suggest. [1]

3.2 Relevance Feedback

With relevance feedback, the user first receives the results generated by the original query. The IR system then gathers information from the user about the relevance of the returned results and generates a new query from the gathered information. Relevance feedback systems attempt to make the new query more similar to the documents that were retrieved with the original query, whereas AQE systems attempt to make the new query more similar to what the user actually intends to search for. The data sources in relevance feedback systems may have more reliability

3

than the data sources used by AQE systems, but the assessment of document relevance is placed in the hands of the user. [1]

3.3 Word Sense Disambiguation

Word sense disambiguation systems attempt to identify meanings of the user’s query terms in the context of the user’s query as a whole and subsequently performing a search based on the user’s perceived intent. One approach to this includes representing the words by their dictionary definitions using some central dictionary database. Another approach is the use of WordNet synsets. WordNet is an online lexical database which groups similar words into synonym subsets (synsets), gives general definitions of these synsets and records the semantic relations between synsets. These two approaches rely on predefined lists that are relatively short. A more feasible approach would be to find all contexts of a word and cluster them with similar contexts in order to find the user’s intended word uses. [1]

Word sense disambiguation, like AQE, attempts to determine the intent of the user’s query and performs an intent-based search. However, word sense disambiguation has limitations regarding computations and effectiveness. In addition, typical queries have too few words to disambiguate. [1]

3.4 Search Results Clustering

This solution to the vocabulary problem groups a query’s search results by topic. Search results clustering not only presents results by topic but also attempts to optimize the quality and relevance of the topic labels. The topic labels that are generated could be seen and used as query refinements, but they are generally broader than the original query and are intended to help the user browse through the search results. An example of a system that utilizes search results clustering is Clusty (http://clusty.com), a Web search engine. Unlike AQE, there is no actual query refinement performed, and control over the search results and any query reformulation is entirely in the hands of the user. [1]

4. How AQE Works

4

Figure 1: Automatic Query Expansion Process [1]

As seen in Figure 1, the process of AQE is divided into four main steps: data preprocessing, feature generation and ranking, feature selection and query reformulation.

4.1 Data Preprocessing

The first step in AQE is to take the data source that the user’s query is run against and reformat it for more effective subsequent processing. The following steps are typically performed on the data source:

1. Extract text from documents.2. Extract words without punctuation and ignoring case.3. Remove articles and prepositions.4. Reduce word inflections and derivations.5. Assign a weighted importance value to each word in the resulting word set.

Take the following HTML fragment as an example:

‘<b>Automatic query expansion</b> expands queries automatically.’

One example of an indexed representation of this fragment may look like this (assuming the weight is the number of times a word appears divided by the total number of words):

automat 0.33, queri 0.33, expan 0.16, expand 0.16.

Note that expan and expand are separate because expan is the root of a noun and expand is the root of a verb. So each document in the data source is represented as a set of weighted terms. [1]

4.2 Feature Generation and Ranking

During this step, AQE takes the user’s original query and the transformed data source and generates a collection of potential terms to be added to the original query based on the relations between the query and the data source. The original query may be reformatted much like the data

5

source was in the previous step to facilitate feature generation. The candidate features generated in this step will be ranked according to the system’s term ranking function. [1]

4.3 Feature Selection

Once the potential features have been generated and ranked, the features with the highest ranks are selected. In this step, the candidate features are not evaluated any further and are simply selected based on rank. Only a limited number of expansion features are selected to allow for more rapid query processing and research has shown that using all candidate features is not necessarily better than using only a few. Research also suggests that it is typical to select between ten and thirty features for expansion. One could also implement this step so that only terms within a specific rank range will be selected. [1]

4.4 Query Reformulation

The last step is to modify the original query by adding the selected candidate features to the original query and perform the search with the reformulated query. [1]

5. Classification of AQE Techniques

Figure 2: AQE Approach Classifications [1]

Figure 2 shows the five main classifications of AQE techniques: linguistic analysis, corpus-specific techniques, query-specific techniques, search log analysis and Web data. This section gives a brief overview of each technique and compares the effectiveness of the five classifications.

5.1 Linguistic Analysis

6

This approach focuses on the morphological, lexical, syntactic and semantic relationships between words in the query. This analysis is based on databases such as dictionaries, thesauri or WordNet. This approach is susceptible to word sense ambiguity. Some examples of this technique include using a stemming algorithm (reducing terms to root form), ontology browsing (paraphrasing the user’s query in context) and syntactic analysis (extracting relations between terms to find features that appear in related relations). [1]

5.2 Corpus-Specific Techniques

This technique utilizes a large structured set of texts (corpus). It analyzes the contents of a full database to find features that are used similarly. It will attempt to find correlations between term pairs at the document level or within paragraphs or sentences. This technique is data-driven and may not have a simple linguistic interpretation. [1]

5.3 Query-Specific Techniques

This technique utilizes local context provided by the user’s query. It makes use of the top-ranked documents retrieved by the user’s original query to generate candidate features. This technique can be more useful than corpus-specific techniques because corpus-specific techniques could retrieve candidate features that appear frequently in the document collection but are irrelevant to the user’s query. [1]

5.4 Search Log Analysis

This technique mines users’ query logs for implicit query associations. Search logs will contain past queries and URLs of clicked pages. Using this data, it may encode implicit relevance feedback instead of original query retrieval feedback. This technique can extract candidate features from current and past users’ related queries that are related to the current query, use top-ranked documents from past related queries and extract terms directly from past visited documents. [1]

5.5 Web Data

This technique utilizes anchor texts on web pages to generate candidate features. Anchor text is the visible, clickable text of a hyperlink. Most anchor texts are similar to real user queries, as anchor texts typically describe the contents of the linked document. However, consider the anchor texts “click here” and ones which consist of only one word. This technique can also utilize the complex net of Wikipedia documents and hyperlinks. [1]

5.6 Technique Comparison

7

Linguistic techniques are generally considered less effective than statistics-based techniques because of their reliance on near-exact word sense disambiguation. Query-specific techniques have better performance than corpus techniques because corpus techniques may produce features that might not be related to the user’s query. The effectiveness of search log and Web data techniques is uncertain as such techniques have not yet been thoroughly tested against other techniques on a standard test collection. [1]

8

6. Critical Issues with AQE

Three key issues arise when considering using AQE in a wider range outside of small scale applications: parameter setting, efficiency and usability.

6.1 Parameter Setting

AQE techniques rely on several parameters including the number of top-ranked documents to select from the original query, the number of expansion features to select, and several variables within term-ranking and weighting functions. These parameters vary depending on the size of the data source and the quality of the original query. One could use fixed values for these parameters, but they may not work well with all types of queries. [1]

6.2 Efficiency

If used on a larger scale, AQE will need to deliver real-time results to a large volume of users. This will require a balancing of performance time with quality search results. Good AQE is computationally expensive largely due to the execution time of the expanded query, which may contain tens to hundreds of terms, so any significant cut in performance time is likely to cause a drastic decrease in the retrieval of quality results. [1]

6.3 Usability

The implementation of AQE in IR systems is generally hidden to users. Because of this and the fact that AQE often adds synonyms to the original query, users could possibly receive high-ranked documents that contain none of their original query terms. Also, it is possible that irrelevant documents can contain the original query terms in anchor texts. One solution for this issue is to show the user the features that were used to expand the query and allow the user to revise the expanded query as needed, making it more like IQE. In general, AQE is better suited for non-expert users who may not know synonyms or alternative meanings for their intended query. [1]

7. Conclusions

There is no perfect solution which effectively handles all problems associated with the vocabulary problem, but AQE overcomes a user’s reluctance and difficulty to provide better refined queries to meet their intended needs. AQE has a variety of implementations and its overall efficiency is gradually increasing. Although near the end of its experimental stages, it is not yet ready to be implemented on large scale IR systems such as Web search engines. [1]

9

References

[1] Carpineto, C. and Romano, G. 2012. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44, 1, Article 1 (January 2012), 50 pages.

[2] Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T. 1987. The vocabulary problem in human-system communication. Comm. ACM 30, 11, 964-971.

[3] Mitra, M., Singhal, A., and Buckley, C. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 206-214.

[4] Vechtomova, O. 2009. Query expansion for information retrieval. In Encyclopedia of Database Systems, L. Liu and M. T. Özsu Eds., Springer, 2254-2257.

10

AUTOMATIC QUERY EXPANSION IN INFORMATION...

Documents

Transcript of AUTOMATIC QUERY EXPANSION IN INFORMATION...