Mining related queries from Web search engine query logs using an

13
With the overwhelming volume of information, the task of finding relevant information on a given topic on the Web is becoming increasingly difficult. Web search engines hence become one of the most popular solutions avail- able on the Web. However, it has never been easy for novice users to organize and represent their information needs using simple queries. Users have to keep modify- ing their input queries until they get expected results. Therefore, it is often desirable for search engines to give suggestions on related queries to users. Besides, by iden- tifying those related queries, search engines can poten- tially perform optimizations on their systems, such as query expansion and file indexing. In this work we pro- pose a method that suggests a list of related queries given an initial input query. The related queries are based in the query log of previously submitted queries by human users, which can be identified using an enhanced model of association rules. Users can utilize the suggested re- lated queries to tune or redirect the search process. Our method not only discovers the related queries, but also ranks them according to the degree of their relatedness. Unlike many other rival techniques, it also performs rea- sonably well on less frequent input queries. Introduction With the advances in information technologies, the Web has become a huge information repository that covers almost all the topics in which a human user could be interested. However, with the overwhelming volume of information on the Web, the task of finding relevant information related to a specific topic is becoming increasingly difficult. Many advanced Web searching techniques have been developed to tackle this problem and are being used in the commercial Web search engines such as Google and Yahoo. In spite of the recent advances in the Web search engine technologies, there are still many situations in which the user is presented with nonrelevant search results. One of the major reasons for this problem is that Web search engines often have difficulties in forming a concise while precise represen- tation of the user’s information need. Most Web search en- gine users are not well trained in organizing and formulating their input queries, which the search engine relies on to find the desired search results. Some studies on Web and peer-to- peer (P2P) queries have been conducted (Chau, Fang, & Yang, 2007; Kwok & Yang, 2004; Yang & Kwok, 2005). On one hand, this is due to the ambiguity that arises in the diversity of language itself; no dialog (discourse) context structure is available for search engines. On the other hand, users are often not clear about the exact terms that best repre- sent their specific information needs. In the worst case, users are even not clear of what exactly their specific information need is. For example, in our study of the sample dataset, one frequently submitted query is “ ” (download) without specifying what exactly the users are seeking to download. In order to overcome these problems, some Web search en- gines have implemented methods to suggest alternative queries to users. The purpose of these methods is to help users specify alternative related queries in their search process in order either to clarify their information needs or to rephrase their query formulation to retrieve more related search results. The techniques used in these proprietary commercial systems are usually confidential; however, we observe that those sug- gested queries returned from these search engines are rather similar in their terms. This may imply that those suggested queries are likely to be generated by using simple query expansion techniques. For instance, if the user searches for Madonna in Yahoo! search engine the following related queries are presented: madonna lyrics, madonna pictures, madonna confessions on a dance floor, madonna biography, and madonna university. However, as we can imagine, there are a good number of other queries related to Madonna but presumably not having the term Madonna explicitly in their term vectors. Given this problem, the technique to retrieve semantically related queries (though probably dissimilar in their terms) is becoming an increasingly important research topic that attracts considerable attention. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 58(12):1871–1883, 2007 Mining Related Queries from Web Search Engine Query Logs Using an Improved Association Rule Mining Model Xiaodong Shi and Christopher C. Yang Department of Systems Engineering and Engineering Management, William M. W. Wong Engineering Building, The Chinese University of Hong Kong, Shatin, Hong Kong, People’s Republic of China. E-mail: [email protected] Accepted January 4, 2007 © 2007 Wiley Periodicals, Inc. Published online 3 August 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/asi.20632

Transcript of Mining related queries from Web search engine query logs using an

Page 1: Mining related queries from Web search engine query logs using an

With the overwhelming volume of information, the task offinding relevant information on a given topic on the Web isbecoming increasingly difficult. Web search engineshence become one of the most popular solutions avail-able on the Web. However, it has never been easy fornovice users to organize and represent their informationneeds using simple queries. Users have to keep modify-ing their input queries until they get expected results.Therefore, it is often desirable for search engines to givesuggestions on related queries to users. Besides, by iden-tifying those related queries, search engines can poten-tially perform optimizations on their systems, such asquery expansion and file indexing. In this work we pro-pose a method that suggests a list of related queries givenan initial input query. The related queries are based in thequery log of previously submitted queries by humanusers, which can be identified using an enhanced modelof association rules. Users can utilize the suggested re-lated queries to tune or redirect the search process. Ourmethod not only discovers the related queries, but alsoranks them according to the degree of their relatedness.Unlike many other rival techniques, it also performs rea-sonably well on less frequent input queries.

Introduction

With the advances in information technologies, the Webhas become a huge information repository that covers almostall the topics in which a human user could be interested.However, with the overwhelming volume of information onthe Web, the task of finding relevant information related toa specific topic is becoming increasingly difficult. Manyadvanced Web searching techniques have been developed totackle this problem and are being used in the commercialWeb search engines such as Google and Yahoo.

In spite of the recent advances in the Web search enginetechnologies, there are still many situations in which the useris presented with nonrelevant search results. One of the major

reasons for this problem is that Web search engines oftenhave difficulties in forming a concise while precise represen-tation of the user’s information need. Most Web search en-gine users are not well trained in organizing and formulatingtheir input queries, which the search engine relies on to findthe desired search results. Some studies on Web and peer-to-peer (P2P) queries have been conducted (Chau, Fang, &Yang, 2007; Kwok & Yang, 2004; Yang & Kwok, 2005).On one hand, this is due to the ambiguity that arises in thediversity of language itself; no dialog (discourse) contextstructure is available for search engines. On the other hand,users are often not clear about the exact terms that best repre-sent their specific information needs. In the worst case, usersare even not clear of what exactly their specific informationneed is. For example, in our study of the sample dataset, onefrequently submitted query is “ ” (download) withoutspecifying what exactly the users are seeking to download.

In order to overcome these problems, some Web search en-gines have implemented methods to suggest alternativequeries to users. The purpose of these methods is to help usersspecify alternative related queries in their search process inorder either to clarify their information needs or to rephrasetheir query formulation to retrieve more related search results.The techniques used in these proprietary commercial systemsare usually confidential; however, we observe that those sug-gested queries returned from these search engines are rathersimilar in their terms. This may imply that those suggestedqueries are likely to be generated by using simple queryexpansion techniques. For instance, if the user searches forMadonna in Yahoo! search engine the following relatedqueries are presented: madonna lyrics, madonna pictures,madonna confessions on a dance floor, madonna biography,and madonna university. However, as we can imagine, thereare a good number of other queries related to Madonna butpresumably not having the term Madonna explicitly in theirterm vectors. Given this problem, the technique to retrievesemantically related queries (though probably dissimilar intheir terms) is becoming an increasingly important researchtopic that attracts considerable attention.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 58(12):1871–1883, 2007

Mining Related Queries from Web Search Engine Query Logs Using an Improved Association Rule Mining Model

Xiaodong Shi and Christopher C. YangDepartment of Systems Engineering and Engineering Management, William M. W. Wong Engineering Building, The Chinese University of Hong Kong, Shatin, Hong Kong, People’s Republic of China. E-mail: [email protected]

Accepted January 4, 2007

© 2007 Wiley Periodicals, Inc. • Published online 3 August 2007 in WileyInterScience (www.interscience.wiley.com). DOI: 10.1002/asi.20632

Page 2: Mining related queries from Web search engine query logs using an

1872 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007DOI: 10.1002/asi

The objective of our work is to devise a method for auto-matically generating lists of related queries given an initialinput query submitted to a search engine. We propose themodel of discovering related queries for a given input queryby mining from search engine query logs using an improvedversion of association rules. We identify more features thansimply the correlations between queries to measure theirrelatedness, i.e., the edit distance similarity between queries.Another significant contribution is that we propose a simplebut effective and efficient user session segmentation algo-rithm, of which the quality is critical for discovering relatedqueries using association rules. On the basis of our model,we can devise a very efficient system that can handle real-time query streams submitted to Web search engines. Theautomatic system can not only identify the related queriesbut also rank them according to their relatedness to the initialinput query. Its computational complexity should be reason-able so that it can be implemented in a real search engine. Theexperimental evaluation results also signify that our proposedmodel outperforms other rival models significantly.

Related Work

Query expansion is a popular technique in informationretrieval to modify users’queries in order to obtain betterprecision and recall. Under the bag of words model, if a rel-evant document does not contain the terms that are in thequery, that document will not be retrieved. The purpose ofquery expansion is to reduce this query/document mismatchby expanding the original query using additional words orphrases with a similar meaning or some other statistical rela-tion to the set of relevant documents, which can improve theeffectiveness of ranked retrieval. For example, Buckley,Salton, Allan, and Singhal (1995) proposed a simple app-roach of expanding search engine queries. Related terms areextracted from the top documents that are returned inresponse to the original query using statistical heuristics, andthe query is expanded using these extracted terms. The exp-anded query could be treated as an artificial reformulation ofthe original query, hence providing the user an alternativequery suggestion, and may improve the retrieval perfor-mance. This approach has been shown to be effective onsome collections (Buckley et al., 1995), but results on largecollections of Web data have been mixed and sometimeseven negative (Billerbeck & Zobel, 2003).

A good number of existing works have suggested methodsin utilizing Web search engine query logs for mining relatedqueries. Cui, Wen, Nie, and Ma (2002) proposed a methodfor finding the relatedness between queries and phrases ofdocuments on the basis of query logs. They hypothesized thatthe click-through information available in search enginequery logs represented evidence of relatedness betweenqueries and documents chosen to be visited by users. Thisevidence is called cross-reference of documents. On the basisof this evidence, the authors establish relationships betweenqueries and phrases that occur in the chosen documents.These relationships are then used to expand the initial query

or to give query suggestions. This approach can also be usedto cluster queries extracted from log files (Wen, Nie, &Zhang, 2001). Cross-reference of documents is combinedwith similarity functions based on query content, edit dis-tance, and document hierarchy to find better clusters. Theseclusters are used in question answering systems to find simi-lar queries. A common weakness of these approaches is thatthe click-through information has to be available in querylogs; surprisingly, that is not always true for many searchengines such as the one in our study. Besides, the computa-tional requirements of these models are high since they needto establish the relationships between queries and key termsin the chosen document first before measuring the relatednessof queries using complex metrics.

Pu, Chuang, Shui-Lung, and Yang (2002) performedautomatic classification on Chinese search engine query logdata in order to classify queries into hierarchical query clus-ter structures that are characterized by their topics. Theycompared the performance of human categorization andmachine categorization of queries into different subjects.Queries in the same cluster are mostly topically related andhence can be considered as another approach to findingrelatedness between queries. However, the automatic classi-fication of queries relies heavily on high-quality trainingdata, which often involve efforts of human annotators.Besides, query log data are quite dynamic and search trendsevolve quickly, making it difficult for automatic classifica-tions to perform well given that training data could not covermany up-to-date queries.

Huang, Chien, and Oyang (2003) presented another log-based approach based on the fact that the relevant termssuggested for a user input query were those that co-occurredin similar query sessions from search engine logs, ratherthan in the retrieved documents. Then a correlation matrix ofquery terms was built, and three notable similarity estima-tion functions were applied to calculate the relevancebetween query terms, i.e., the Jaccard’s measure, depen-dence measure, and cosine measure. Finally the top relevantqueries to the input query were identified according to theranking of relevance between the input query and all otherqueries. The suggested terms in each interactive search stepwith the user could be organized according to its relevanceto the entire query session, rather than to the most recentsingle query. They showed that their log-based method gen-erated better results than document-based methods.

Chien and Immorlica (2005) attempted to address theproblem by measuring the temporal correlation of queries.They inferred that two queries were related if their populari-ties behave similarly over time, as reflected in query logs.They defined in their model a new measure of temporalcorrelation of two queries based on the correlation coeffi-cient of their frequency functions. They also developed amethod of efficiently finding the highest correlated queriesfor a given input query using reasonable space and time,making real-time implementation possible according to theirclaim. However, because they did not present any systematicexperimental result that compares the performance of their

Page 3: Mining related queries from Web search engine query logs using an

model with that of others, the claimed superiority of theirproposed model is questionable. In our comparative experi-mental evaluation, the temporal correlation model is used asa benchmark and it is found that this model is not as effec-tive as other rival models on our test dataset.

Fonseca, Golgher, De Moura, and Ziviani (2003) seg-mented query sessions in search engine query logs intosubsessions and then used association rules to extract relatedqueries from those subsessions. Association rules are widelyused to develop high-quality recommendation systems in e-commerce applications available in the Web (Agrawal,Imielinski, & Swami, 1993; Agrawal & Srikant, 1994).These applications take user sessions stored in system logsto obtain information about the user behavior to recommendservices and products. The same idea is applied to findrelated queries and provide suggestions to Web searchengine users. They find previous search patterns that matchthe current query and use this information to suggest relatedqueries that may be useful to users. However, generallyspeaking, their work is only exploratory and failed to exam-ine the semantic relational patterns hidden in query logs.Nor did they suggest an “effective and efficient” method toextract query sessions from query accurately. Despite theseshortcomings, their association rule mining model wasproved to be successful according to their experimental eval-uation. This inspired us to develop an enhanced model basedon association rules.

Later in another work Fonseca and associates (2005)slightly improved the segmentation algorithm that segmentsthe query sessions into subsessions. They also applied thepreviously proposed model (Fonseca et al., 2003) to identifythe concept taxonomies embedded in the search enginequery logs. They calculated the relatedness between allqueries using the association rule mining model and thenbuilt a query relation graph. The query relation graph wasused for identifying concepts related to a given user inputquery. However, the underlying model was the same onepreviously proposed (Fonseca et al., 2003).

We see the shortcomings of the association rule miningmodel (Fonseca et al., 2005) and propose our system, whichis based on an improved version of association rules. Themajor contributions of our work are the following: (1) wepropose a simple but effective and efficient user sessionsegmentation algorithm, of which the quality is critical foridentifying related queries using statistical associations; (2)we propose the model of discovering related queries givenan input query by mining from search engine query logs byutilizing features including co-occurrence and edit distancesimilarity between queries. Our model can not only beimplemented as a query recommendation system that sug-gests related queries to an input query, but also be employedfurther for query expansion.

System Architecture

Our system generally consists of three layers: (1) extract-ing user sessions from search query logs, (2) segmenting

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007 1873DOI: 10.1002/asi

extracted user sessions into query transactions, and (3) miningrelated queries from query transactions using our proposedmodel. Figure 1 shows the blueprint of the architecture of ourproposed system.

1. Extracting user sessions from query logs: This layer han-dles the original data of search engine query logs. Itextracts user sessions from query logs by identifying thequery records that belong to the same user who is charac-terized by his/her unique Internet Protocol (IP) address.This assumption may be weakened if we consider thevery existence of publicly shared computers. However,our system can still counteract the possible negative ef-fects caused by the confusion of user sessions, becauserelated queries are identified statistically on the basis ofextracted query transactions instead of user sessions.Since the technique for this layer is straightforward, wedo not discuss it in detail here.

2. Segmenting user sessions into query transactions: Thislayer accepts the extracted user sessions generated asinputs and then segments each of them into query trans-actions properly. The segmented query transactions arethen pooled together. Till the end of this stage, the useridentities and time-stamps of query transactions will nolonger be used. The details of this segmentation algo-rithm are presented in the next section.

3. Discovering related queries from query transactions:In this layer, an initial input query submitted by a certainuser is directed as the input to the system, if it satisfiessome criterion. Query transactions are then fetched fromthe pool and relatedness is calculated between the inputquery and any other query that satisfies predefined con-straints. The predefined criteria and constraints can be butare not limited to the thresholds for the raw frequenciesof queries. Having measured the relatedness between

FIG. 1. System architecture of the related query mining system.

Page 4: Mining related queries from Web search engine query logs using an

the input query and all other queries, the system cuts athreshold and ranks qualified queries according to theirrelatedness score. Finally the top k related queries arereturned, where k is either predetermined by the systemor specified by the user. We elaborate this model in moredetail in the next section.

Besides the underlying model, our system differs fromother comparable systems in the way that it can handle thosesearch engine query logs with least information recorded.Hypothetically our system can operate on query log data that have only three pieces of information available,i.e., query terms, the IP address of the user who submits thatquery, and the time-stamp when the query is submitted. Thisis the basic information that almost every search enginesaves in its query logs. However, not all search enginessave additional information, such as click-through uniformresource locators (URLs). Therefore, it is important for ageneral model of mining related queries to be compatiblewith such query log data. In addition, our system does notrequire any additional resources, such as dictionary orthesaurus. This also makes our system independent of envi-ronmental configurations and reduces the computationalrequirements.

Mining Related Queries From Query Logs

As mentioned in the previous section, our proposed modelconsists of three stages: (1) extracting user sessions fromsearch engine query logs; (2) then segmenting extracted usersessions into query transactions; (3) given an input query,calculating the degree of its relatedness to all other queries bymining association rules from all the query transactions,ranking the results, and finally returning the portion of mostrelated queries to users. The first two stages can be mergedinto one since they can be done simultaneously.

This section is organized as follows: first we present thedefinitions of some key terms; in the second subsection weintroduce the Levenshtein distance similarity utilized in ouralgorithm and model; in the third subsection we describethe algorithm for segmenting user sessions into query trans-actions, of which the quality is critical for discoveringrelated queries; we introduce in the fourth subsection thetraditional technique for mining association rules; finally wepropose our model, which is based on an enhanced versionof association rules.

Definitions

Before introducing the details of our model, we givethe definitions of a few terms that are used in the modeldescriptions.

Query record: A query record represents the submissionof one single query from a user to the search engine at acertain time. Thus, a query record can be represented as a setof triplets Ii � (qi, ipi, ti), where qi is the submitted queryconsisting of terms, ipi is the IP address of the host from

1874 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007DOI: 10.1002/asi

1Here we use the host IP address as the identity of a unique user. Weunderstand that in some scenarios host IP addresses fail to represent aunique user, e.g., public computers and computers assigned with multipleIP addresses. However, we assume these cases are rare and negligible.

which the user issues the query,1 and ti represents the time-stamp when the user submits that query.

Query transaction: A query transaction is the searchprocess (1) with the search interest focusing on the same topicor strongly related topics, (2) in a bounded and consecutiveperiod, and (3) issued by the same user. A query trans-action can be represented as a series of query records intemporal order, i.e., Tj � {Ij1, Ij2, . . . , Ijm} � {(qj1, ipj1, tj1),(qj2, ipj2, tj2), . . . , (qjm, ipjm, tjm)}, where ipj1 � ipj2 � . . . � ipjm

and tj1 � tj2 � . . . � tjm. Note that the queries in differentquery records need not be different; that implies that the user iscontinuously using the same formulation of query terms.

User session: A user session contains all the query recordsthat belong to the same user regardless of their time-stampsand therefore is a complete record of that user’s search history.A user session can also be represented as a series of queryrecords in temporal order, i.e., Sk � {Ik1, Ik2, . . . , Ikn} � {Ik1,Ik2, . . . , Ikn} � {(qk1, ipk1, tk1), (qk2, ipk2, tk2), . . . , (qkn, ipkn,tkn)}, where ipk1 � ipk2 � . . . � ipkn, tk1 � tk2 � . . . � tkn andn � m. A user session can usually be decomposed into multi-ple query transactions because search engine users often havemultiple information needs at different times or shift theirsearch interests from time to time. As we assume that anyunique user can be identified by his/her IP address, hereby wesimply deduce that a use session consists of all query recordsthat share the same IP address in the query log.

Given the definitions of query transaction and user ses-sion, we have the following constraints:

(1)

(2)

(3)

(4)

The first expression, , means thatevery query record must belong to a single querytransaction, and thus a single user session. Expressions

and assert that all transactions and user ses-

sions are nonempty and nonoverlapping. Expressionstates that every transaction must be a

subset of or equal to a single user session.The difference between them is that queries in a query

transaction represent a continuous search process in whichthe user is interested in a single topic or at least stronglyrelated topics. This inspires us with the idea that queriesin the same transaction should be related, since they repre-sent the user’s formulations (and reformulations) of his/hersame information need at that time. Even though some of theusers may issue wrong formulations of a certain topic of

5i E k Tj � Sk

�, Sp � Sq � �5i, j, p, q Ti � Tj �5 j, k Tj � �, Sk � �

5i E j, k Ii � Tj � Sk

5i E k Tj � Sk

5i, j, p, q Ti � Tj � �, Sp � Sq � �

5 j, k Tj � �, Sk � �

5i E j, k Ii � Tj � Sk

Page 5: Mining related queries from Web search engine query logs using an

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007 1875DOI: 10.1002/asi

interest for various reasons, a good statistical approach caneliminate most of such noise. Therefore the problem of dis-covering related queries is decomposed into two subtasks:(1) how to segment a user session Sk into a series of querytransactions effectively and efficiently, i.e., {Tk1, Tk2, … , Tkl}(l � m � n), and (2) how to measure statistically (and proba-bly semantically) the associations between queries, to pro-vide the best model of human users’ judgments of relatedqueries during query transactions.

Levenshtein Distance Similarity

A common approach to string matching is edit distance,which is widely applied in tasks including spell checkingand plagiarism detection. It defines a set of edit operations,such as insertion or deletion of a word,2 together with a costfor each operation. The distance between two text stringsthen is defined to be the sum of the costs in the cheapestchain of edit operations transforming one text string into theother. Having insertion, deletion, and substitution as opera-tions, each at the cost of 1, yields the Levenshtein distance(Levenshtein, 1966). For example, in Figure 2(a), theLevenshtein distance between “ ” (SARS) and“ ” (full name of SARS) is 1, and it is 2 bet-ween “adobe photoshop 7” and “adobe” in Figure 2(b).

Here we define a similarity measurement that usesLevenshtein distance between queries for surface similarityestimation. Comparison between queries thus becomes aninexact string-matching problem. The maximal number ofthe words (or characters for Chinese queries) in the twoqueries is used to divide the Levenshtein distance so that thevalue can be normalized and is constrained within the rangeof [0, 1]. The similarity is inversely proportional to the nor-malized Levenshtein distance:

where wn(.) is the number of the words in a query. Thesmaller the Levenshtein distance between queries and thelarger the maximal number of words in the two queries are,the more similar the two queries are. A pair of queries withsimilarityLevenshtein � 1 are exactly the same.

The advantage of this similarity measurement is that it takesinto account both the query terms and the word order. Besides,the Levenshtein distance similarity tends to favor those pairsof queries that have relatively more terms in them and areclosely matching each other both in their terms and orders ofterms, thereby reducing false positive results on short queries.

However, since most of the search engine queries are rel-atively short and concise, except those queries closely match-ing each other, the Levenshtein distance similarity often failto identify those related queries having different orders orcombinations, as reformulated by users, from their originalqueries. This is why it was seldom applied solely to address-ing the problem of discovering related search enginequeries, to the best of our knowledge.

In our system, the Levenshtein distance similarity is im-plemented as a supplemental and nonpenalizing decayingfactor in the model of discovering related queries. Themodel is based on a modified-confidence version of miningassociation rules, thus statistically identifying those highlycorrelated queries and eliminating noise queries. The Leven-shtein distance similarity is employed to promote among thehighly correlated queries those closely matching queries,without significantly penalizing those different from eachother. In the third subsection, we introduce the details abouthow to combine the two measures.

Segmenting User Sessions Into Query Transactions

In a same query transaction, according to its definition,the user is focused on a single topic or strongly related topics, and, hence submits queries that are all directed to-ward that/those topic(s). These queries themselves are inter-related since they represent the same user’s informationneed on a single topic or a single group of strongly relatedtopics. If we can mine the statistical associations between

similarityLevenshtein � 1 �Levenshtein_distance(q1, q2)

max(wn(q1), wn(q2) )

2In Chinese, generally speaking, the base semantic unit of language isthe character instead of the word as in English. Although there are some fewinstances of Chinese words that cannot be decomposed into characters thathave separate semantic meanings, we still use the character as the base unit,because (1) it is usually difficult and computationally to segment very shortand concise Chinese queries into exact words, and (2) the Levenshteindistance does take the order of characters in queries into consideration.

FIG. 2. Levenshtein distances between sample queries.

Page 6: Mining related queries from Web search engine query logs using an

1876 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007DOI: 10.1002/asi

these queries, related queries can then be identified. In thisway, segmenting of user sessions into query transactions isalmost critical for mining associations between relatedqueries, and hence a good segmentation algorithm needs tobe devised.

However, most previous work did not pay enough atten-tion to this issue. Many of them simply implement aninterval segmentation algorithm that segments the user sessionsinto statistically bounded sections. For example, Fonsecaand coworkers (2003) define a query transaction3 by includ-ing all the queries submitted by the same user in thepredefined time interval t and use 10 minutes as the value oft. Later, Fonseca and colleagues (2005) improved this seg-mentation algorithm by utilizing the length of the intervalbetween adjacent submissions of queries in the same usersession. They define a query transaction to include all querysubmissions 10 minutes or less apart from each other.

Our proposed segmentation algorithm generally improvesthe previous interval segmentation algorithm in two aspects.First, our segmentation algorithm implements two types ofinterval lengths and one type of query transaction time win-dow length. We define max_transaction_interval_length asthe maximal interval length allowed between the adjacentquery records in a same query transaction. It is similar to theprevious ones. However, it no longer serves as the only factorthat indicates the boundary of query transactions. In addition,we define another interval length max_inactive_interval_lengthas the maximal interval length of the period during which theuser is allowed to be inactive. If the time between two adja-cent query records in a same user session is larger thanmax_inactive_interval_length, then the later query recorddefinitely indicates the start of a novel query transaction.Besides the two types of interval lengths, we define max_transaction_time_window_length as the maximal length ofthe time window that the query transaction is allowed to span.This constraint bounds the period during which the user isfocused on a single topic or strongly related topics. In ourexperimental setup we empirically set the three values to be5 minutes, 24 hours, and 60 minutes.

Second, we also utilize the similarity between adjacentqueries at the hypothesized boundary of query transactionsin a same user session to determine whether they shouldbelong to the same query transaction. We employ the Leven-shtein distance similarity to measure the surface similaritybetween two queries. If their Levenshtein distance similarityis above the threshold min_levenshtein_distance_similarity_for_related_queries, simply treat the later one as the modi-fied version of the former one and include them in the samequery transaction; otherwise separate them into two querytransactions.

All of these rules are combined to formulate our segmenta-tion algorithm. As mentioned, the computational complexity

of this algorithm is maintained at the same level, i.e., O(n),as previous segmentation algorithms. It does not requireadditional resources such as thesaurus or ontologies. Nor isthere a need to train it using any training set. Our experimentalevaluation results show that our proposed segmentationalgorithm improves significantly over the previous naivesegmentation algorithms. Details of the user session segmen-tation algorithm is presented in Figure 3.

Mining Association Rules

In this subsection, we briefly describe the concept anddefinition of mining association rules. Generally we adheremostly to the definitions and descriptions that were formal-ized in Agrawal and associates (1993) to review the neededconcepts.

Let I � {I1, I3, . . . , Im} be a set of binary attributes calleditems. Let T be a database of transactions. Each transactioncan be represented by a binary vector, with t[k] � 1 if t boughtthe item, and t[k] � 0 otherwise. Let X be a set of some itemsin I. A transaction t satisfies X if for all items Ik in X, t[k] � 1.

By an association rule we mean an implication where and We define that the association rule

has a confidence factor of c if c% of the transactionsin T� satisfy {Ij} given that all transactions in T� ( )satisfy X. We will use the classical notation tospecify that the association rule has a confidencefactor of c. We also define that the association rule has a support factor of s if s% of the transactions in T satisfyboth X and {Ij} at the same time. Note that support shouldnot be confused with confidence: while confidence is ameasure of the association rule’s strength, support corre-sponds to its statistical significance.

Given the complete set of all transactions T, the problemof mining association rules therefore is to generate all asso-ciation rules that satisfy the two following constraints:

1. Syntactic constraint: This kind of constraint involvesrestrictions on items that can appear in an associationrule. For example, we may be interested only in rulesthat have a specific item Ix appearing in the consequent,or rules that have a specific item Iy appearing in theantecedent. Sometimes we may also want to limit the sizeof X in the antecedent.

2. Support constraint: This constraint requires the discov-ered association rules to have a support factor greater thana specified minimal support (called min_support). Thisconstraint ensures that most noises (e.g., unusual purchas-ing behavior or corrupted data error) in the transaction setcould be statistically eliminated, along with the insignifi-cant association rules, 1which of are of no business value.

Discovering Related Queries

Mining related queries from query logs is intrinsicallysimilar to mining association rules. Our proposed model isa modified-confidence version of the traditional approachof mining association rules. We redefine the statements forassociation rules for the problem of mining related queries

X 1 Ij

X 1 Ij

X 1 Ij 0 cT� ( T

X 1 Ij

Ij � X.X ( I ,X 1 Ij ,

3In their work they use the notion “user session” to represent the idea ofquery transaction in our paper. Note that their “user session” is generallyidentical to our “query transaction” and they did not formally define the ideaof user sessions as in our work.

Page 7: Mining related queries from Web search engine query logs using an

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007 1877DOI: 10.1002/asi

from search engine query logs. Here, we define Q � {q1,q2, q3, . . . , qn} as the set of unique queries from query logfiles and T as the set of query transactions t. For each t thereis a binary vector t[k] such that t[k] � 1 if query transactiont contains query record Ii that searched for query qk,4

and t[k] � 0 otherwise. Let X be a set of some uniquequeries in Q. A transaction t satisfies X if for all records qk

in X, t[k] � 1.Here the association rule is redefined to mean an implica-

tion where and Because we are in-terested only in finding related queries given an initial inputquery, the set X contains only the initial input query qi,

qj � X.X ( Q ,X 1 qj ,

i.e., X 5 {qi}. Therefore the association rule in this problembecomes where and qi fi qj. Thisproblem of mining related queries is thus simplified as find-ing the statistical associations between the input query andany other queries. Certainly it is also possible to find thestatistical associations between more than two queries, thatis, where �X� . 1; however, that is beyond our scopein this paper.

The support factor of an association rule is stilldefined similarly to that in the previous subsection, i.e.,

has a support factor of s if s% of the transactions inT satisfy both {qi} and {qj} at the same time, and is notatedas However, we modify the definition of theconfidence factor of the association rule so that itincorporates the Levenshtein distance similarity in itsformula, hence better modeling the relatedness of queries.We define the raw confidence factor of the association rule

to be rc if rc% of the transactions in T� satisfy {qj}qi 1 qj

qi 1 qj

qi 1 qj 0 s.

qi 1 qj

qi 1 qj

X 1 qj ,

qj � Qqi � Q,qi 1 qj ,

4 Note that here there may be multiple query items in a same querytransaction that searched for the same query. This happens usually when theuser is requesting different page views of the retrieved results returned bythe search engine, or when the user returns to a previously submitted queryif he/she thinks that query retrieves best results.

FIG. 3. Segmentation algorithm.

Page 8: Mining related queries from Web search engine query logs using an

1878 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007DOI: 10.1002/asi

given that all transactions in T� satisfy {qi} also, notated asThen we combine the raw confidence factor with

the Levenshtein distance similarity between qi and qj. Thefinal confidence factor of is calculated via the fol-lowing formula:

The reason why we want to introduce the new confidencefactor incorporating the Levenshtein distance similarity israther intuitive. The mining of association rules relies heav-ily on the statistical co-occurrence of queries in all the querytransactions. However, because of the lack of intelligence insegmenting user sessions into query transactions and hencethe errors in that process, some loosely related but veryfrequent queries may co-occur in many of the query transac-tions. For example, in our empirical study, we found that thequery “ (music)” often co-occurred with the query “(movie)” in the Tianwang query log data. It might be truethat the two queries are loosely related on some occasions,but usually the query expansion or recommendation systemis not supposed to give “ ” as a related query to “ ”since their connection is far from conspicuous. In anotherscenario, the two queries “ (download)” and “(movie downloads)” are considered to be related to the inputquery “ (movie).” However, “ (download)” appearsmuch more frequently in different query transactions than“ (movie downloads)” in our Tianwang query logs;according to the traditional model of mining association rules,the confidence of “ ” “ ” must be greater than that of“ ” “ .” However, “ ” is obviously morerelated to “ ” than to “ ” since there may be manyother things to download such as music, software, video, or anyother downloadable resources.

There is no mature technique that can tackle this prob-lem completely. However, a good attempt is to utilize thesurface similarity between queries as a nonpenalizingdecaying factor. We supplement the traditional confidencefactor with the Levenshtein distance similarity as a decay-ing factor measuring the surface similarity betweenqueries. If two queries are very similar in their query terms,then the confidence of their associations is promoted rela-tively higher, in addition to the statistical associations calculated using association rule mining; otherwise, the con-fidence of their associations is not penalized significantlysince we want as well to retrieve those related queries re-gardless of their surface similarities. Using the base of thenatural logarithm ensures that the promotion only takesplace on those pairs of queries that are significantlysimilar; otherwise the promotion is not conspicuous, andhence has no significant negative effects.

Given the input query qi, we presume that there is anassociation rule from qi to any other query qj.

5j, qi 1 qj, qi � Q, qj � Q , qi � qj

11

(qi 1 qj 0 c) � (qi 1 qj 0 r c) � esimilarityLevenshtein(qi ,qj)

qi 1 qj

qi 1 qj 0 rc .We calculate with our model the support factor

and confidence factor of any association ruleThen we first set the threshold value of min_support

to filter away those queries that are not statistically strongenough. In the next we rank the list of association rulesaccording to their confidence factors. Finally we select thetop portion of the list and retrieve the set of most relatedqueries to the input query qi.

As shown in our experimental evaluation in a later sec-tion, our proposed model of modified-confidence associationrule mining improves the performance of finding relatedqueries by a considerable percentage. Besides the improve-ment in precision rates, our empirical study shows that ourproposed model improves the ranking of retrieved relatedqueries very well.

Query Log Data

The dataset of Web search engine query logs that we areusing in our experiments is obtained from Tianwang ( )Search Engine.5 It covers a total period of 19 months betweenFebruary 2002 and December 2003. Since Tianwang ( )Search Engine serves mainly Chinese users, particularlyChinese university students, around 80% of the receivedqueries are Chinese queries or contain Chinese terms. Becausethere are data missing or incomplete for some of the monthsand the monthly numbers of queries received fluctuate consid-erably, we select for our experiment a consecutive period of4 months from March 2003 to June 2003, during which thedata are complete and the monthly numbers of received queriesare relatively stable. The total file size for the dataset weselected amounts to 1.01 gigabytes. Figure 4 is a snapshot ofone segment extracted from the Tianwang Search Engine querylog.

As shown in Figure 4, the recorded data in the query logcontain only the query terms, the index of page to request,the host IP address, and the time-stamp when that query wassubmitted. Unlike in some search engines, because it doesnot include any information like users’ browser types or theirclick-through URLs, many other techniques that exploit fea-tures like user feedback or click-through data fail to work onit. This situation is rather typical for many other small tomedium Web search engines that are operating on manylocal Web sites or interval networks. Therefore, it is quiteimportant for our proposed model to be able to work withthose imperfect Web search engine systems.

Table 2 shows some statistics about the Tianwang query logdata as well as the segmented query transactions. The ratio ofChinese queries over English queries is around 4:1, which isvery typical among Chinese Web search engines. The totalnumber of query transactions is counted on the generatedquery transactions using our proposed segmentation algorithm.

qi 1 qj.qi 1 qj 0 c

qi 1 qj 0 s

5 Tianwang ( ) Search Engine is a well-known Web search enginein China. It was first developed by a research group in Peking University,and therefore its users were mainly university students. Now it has beencommercialized. URL: http://www.tianwang.com

Page 9: Mining related queries from Web search engine query logs using an

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007 1879DOI: 10.1002/asi

Since this query log spans only 4 months, most of them arevery infrequent queries, with their occurrences less than theaverage.

Experimental Evaluation

In this section we present the experimental setup withwhich we conduct our experimental evaluations. We alsopresent the experimental results showing that our proposedmodel outperformed competitive models significantly.

Experimental Setup

For our experiments, we selected 100 test input queriesrandomly from the list of all queries in the query log. The listis sorted in descending order according to the frequencies ofthe queries in the list. However, instead of “purely” randomdistribution, the random selection of 100 input queries isbased on the frequency distribution of all queries; i.e., queriesthat occur frequently have relatively higher probabilities ofbeing selected and vice versa. The process of selecting testinput queries is more like a lottery drawing; that is, everyoccurrence of query is assigned with an equal chance ofbeing chosen; however, highly frequent queries correspondto more occurrences of queries and hence are assignedmore probabilities of winning the lottery drawing. This isreasonable since the number of queries with relatively highfrequencies is much less than the number of queries withrelatively low frequencies, and thus our process of selectingtest input queries ensures the selection is not biased towardless frequent queries becaise of the very large number of them.

In the selection of test input queries, we set a thresholdfor the frequencies of queries to exclude those queries whosefrequencies are below 50, as our past experiences suggestthat very infrequent queries are generally of not much valuefor study, and very infrequent queries often include a lot oferroneous, faulty, irrational, rarely seen variants of commonqueries, or queries on obscure topics that receive little atten-tion. We empirically set the threshold to 50 because in ourinvestigation a significant portion of the selected queriesbegan to exhibit those undesirable features mentioned whenwe lower the threshold value below 50.

The 100 test input queries selected are listed in the fol-lowing. They are ranked in descending order according totheir frequencies, then separated into four columns, each rep-resenting 25% of the total 100 queries and hence one of thefour frequency categories: most frequent, second most fre-quent, second most infrequent, and most infrequent divisionsof test input queries. Our test set covers almost all frequencyspectrums (from high to low) and the results of the experi-ments run on it are representative enough to reflect the trueperformances of comparable models in the real world.

FIG. 4. A snapshot of the query log file of Tianwang Web search engine.

TABLE 1. Format of the query log record of Tianwang Web search engine.

Query Page IP address Time

3 61.236.220.126 Mar 1 00:01:27foxpro 6.0 1 218.16.43.5 Mar 1 00:01:45foxpro 6.0 12 218.16.43.5 Mar 1 00:01:54

TABLE 2. Some statistics of our query log data.

Total no. of query records in the query log 14, 002, 275Total no. of unique queries 3, 095, 803Avg. frequency per unique query 4.53Total no. of unique queries whose frequency is below avg. 2, 788, 839Total no. of unique Chinese queries 2, 507, 638Total no. of unique English queries 588,165Percentage of unique Chinese queries 81%Percentage of unique English queries 19%Total no. of query records of Chinese queries 10, 939, 744Total no. of query records of English queries 3, 062, 531Avg. frequency of unique Chinese queries 4.36Avg. frequency of unique English queries 5.21Total no. of query transactions 4, 141, 983Avg. no. of query records per transaction 3.38Total no. single-query transactionsa 2, 541, 401Avg. no. query records per non-single-query transaction 7.16

aSingle-query transactions represent those transactions in which onlyone distinct query appears. However, there may be multiple query items inthe transaction that are submitted at different times but share the samequery. Single-query transactions are useless for study since they do notinclude any co-occurrence of two distinct queries.

Page 10: Mining related queries from Web search engine query logs using an

1880 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007DOI: 10.1002/asi

Given a test input query, our proposed model of miningrelated queries accepts it as input and generates an outputthat is a ranked list of queries sorted according to their relat-edness. If the model retrieves no related queries, then thattest input query is omitted when measuring the overallretrieval performance. If the list is not empty, we select thetop K queries as the suggested related queries. In our exper-iment, we set K � 1, 5, 10, 15, or 20; that is, we select the top1, 5, 10, 15, or 20 retrieved queries as related queries. If thenumber of available queries in the list is less than K, we usethe maximal number of available queries instead.

We quantify the performance of retrieving related queriesusing average precision rate. Assuming there are N test inputqueries, every test input query i retrieves a set of queriesRi that are suggested as related queries, which can be empty.Among the set of retrieved queries we have a subset ri that arecorrectly retrieved related queries. To measure the retrievalperformance, we employ two kinds of precision rates: one isthe overall average precision, and the other one is per queryaverage precision. Overall average precision is defined as thetotal number of correctly retrieved related queries divided bythe total number of retrieved queries for all test input queries.

For query average precision, first a precision rate is cal-culated for every test input query, defined as the number of

Poverall �aN

i�0

0ri 0

aN

i�0

0Ri 0

correctly retrieved related queries divided by the number of retrieved queries for each test query; then the sum of thequery precision rates is divided by the number of test inputqueries of which the set of retrieved queries is not empty.

Experimental Results

In this subsection we describe our experimental results. Ouranalysis of the retrieval performances is based on comparingthe average precision rates obtained from running differentcompetitive models on the same test set, i.e., 100 test inputqueries mentioned in the previous subsection, given that thesemodels are all tuned to their best possible performance underthe same set of constraints. Since our contributions in improv-ing the technique of discovering related queries are generallyin the user-session segmentation algorithm and the model ofmining related queries from query transactions, which are tworelatively independent components in the system, we comparedifferent combinations of the segmentation algorithm and themodel to demonstrate the effectiveness of both.

For the segmentation algorithm, we implemented twocomparable algorithms, one adopting the approach as de-scribed in Fonseca and associates (2005), which is a slightlyimproved version over the one described in Fonseca and

Pquery �

aN

i�0,Ri��

0ri 0

0Ri 0

aN

i�0,Ri��

1

TABLE 3. One hundred selected test input queries for experimental evaluation.

1–25 26–50 51–75 76–100

Query Freq. Query Freq. Query Freq. Query Freq.

75975 2559 872 at89c51 18349583 2260 868 16729445 2195 823 140

ftp 29268 ip 2133 793 13528303 2034 web 715 itu-t 13323975 caj 1972 ps 696 130

mp3 22232 1919 682 fesco 120ansys 17346 2000 1840 668 112

9145 1801 573 1118048 1763 555 .net framework 104

cs 6114 powerpoint 1736 470 photo shop 965839 opengl 1572 467 885552 .net 1566 db2 462 874872 www.163.com 1513 smtp 370 844346 rar 1459 342 773877 firestorm.zip 1452 338 753492 1310 318 743474 1138 261 xinhua 73

music 3288 1084 lindows 225 693285 1040 223 avi 633168 975 218 632926 954 206 582816 907 203 57

rm 2767 879 201 522626 878 195 51

Page 11: Mining related queries from Web search engine query logs using an

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007 1881DOI: 10.1002/asi

associates (2003, p. 2), and the other one adopting our pro-posed approach. We simply call the former approach the naivesegmentation algorithm, and the latter the proposed segmen-tation algorithm. For the naive segmentation algorithm, weset the transaction interval length to 10 minutes, claimed to beoptimal (B. M. Fonseca, P. B Golgher, et al., 2005). For ourproposed segmentation algorithm, we tune the parameters ofthe algorithm and set max_transaction_interval_length �5 minutes, max_inactive_interval_length � 24 hours, andmax_transaction_time_window_length � 60 minutes.

For the model of mining related queries, we compare thefollowing three models:

1. Temporal correlation model (TCM): This approach adoptsthe model described in Chien and Immorlica (2005). Theauthors claimed that their proposed model was effectiveand efficient in capturing the semantic similarity/related-ness between queries; however, an extensive and system-atic experimental evaluation of their model against rivalmodels was absent in their paper. It is selected as a bench-mark in our experiment.

2. Association rule mining model (ARM): The associationrule mining model without integrating Levenshteindistance similarity is selected as the second model in ourexperiment. The same support factor is utilized but theconfidence factor without the Levenshtein distance simi-larity is employed. This model is compared in order todetermine whether integrating Levenshtein distance sim-ilarity improves retrieval performance or not.

3. Association rule mining with Levenshtein distance simi-larity (ARM_LDS): This is our proposed approach utiliz-ing the combination of association rule and Levenshteindistance similarity. Our proposed model differs fromARM in that it can recognize and promote rephrasedqueries of the original one presuming that they are simi-lar in their query terms. For both ARM and ARM_LDSmodels, we exclude those candidate related queries withfrequencies less than 20, because these infrequent relatedqueries are usually of no significant help for commonusers. We set the value of min_support to 0.0002%; thatmeans that the candidate related query has to co-occurwith the input query in approximately at least four querytransactions of a total 1.6 million non-single-query trans-actions. This partially ensures that the co-occurrences ofthe two queries are statistically strong enough to supporttheir deduced logical relatedness.

Table 4 presents the precision rates of the retrieved relatedqueries obtained by running different models on the test inputqueries. With a brief comparison of the precision rates of dif-ferent models, we found that the temporal correlation model(TCM), as the baseline model, generated the worst perfor-mance at all levels, which is strongly negative evidenceagainst the effectiveness claimed in Chien and Immorlica(2005). We believe the unsatisfactory performance of the tem-poral correlation model cannot be entirely attributed to the dif-ferences in the dataset used in Chien and Immorlica (2005)and our work. In Chien and Immorlica (2005) the authorsgave quite a few examples to illustrate the effectiveness of thetemporal correlation model on sample queries. However,the queries they gave were not representative enough to coverall types and frequencies in the real world. Most of their sam-ple queries are of extremely high frequencies. Besides, in ourempirical investigation we found that usually the temporalcorrelation model performed better on event-driven queriesthan on non-event-driven queries, because for non-event-driven queries their temporal distributions tend to be similarand thus TCM model fails to distinguish between them. Theseweaknesses of the TCM model are further exacerbated whenit is applied on the query logs of small to medium Web searchengines, such as our Tianwang query log, particularly becausein those query logs frequent queries are not as diversified as inthat of major Web search engines.

Next we compare the naive segmentation algorithm withour proposed segmentation algorithm. Regardless of whichmodel of mining related queries it is incorporated with, ourproposed segmentation algorithm always outperforms thenaive segmentation algorithm significantly. For K � 20, ourproposed segmentation algorithm improves the overall perfor-mance by approximately 9%–12%, compared with the naivesegmentation algorithm. Hence we conclude that our proposedsegmentation algorithm is more effective in segmenting usersessions into query transactions and hence improving the re-trieval performance of mining related queries.

We can also see from Table 4 that the ARM_LDS model,which is proposed by us, generally outperforms the rival ARMmodel. If incorporated with the naive segmentation algorithm,the ARM_LDS model generally improves the overall perfor-mance by 3%–5% steadily, compared with the ARM modelat different levels of K. If incorporated with our proposed

TABLE 4. The precision rates of different combinations of segmentation algorithms and models of mining related queries from search engine query logs.

Naive segmentation algorithm Proposed segmentation algorithm

Top queries TCM ARM ARM_LDS ARM ARM_LDS

K Poverall Pquery Poverall Pquery Poverall Pquery Poverall Pquery Poverall Pquery

1 56.65 56.65 91.86 91.86 94.65 94.65 95.35 95.35 97.65 97.655 60.47 62.32 85.60 86.76 89.73 90.64 90.88 91.47 93.64 93.73

10 54.88 56.05 81.11 83.86 85.44 87.29 88.45 89.73 90.59 91.5615 50.63 52.29 75.76 80.32 80.88 82.05 86.05 88.13 89.88 91.0420 44.32 46.76 71.66 77.63 76.29 78.88 83.29 85.00 88.44 90.04

Page 12: Mining related queries from Web search engine query logs using an

1882 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007DOI: 10.1002/asi

segmentation algorithm, it can improve the overall perfor-mance by approximately 3%, which is a considerable amountconsidering that the absolute precision rate has reached a highlevel. In conclusion, our proposed ARM_LDS model improvesthe retrieval performance by a considerable percentage andtherefore is proved effective in capturing the relatednessbetween queries.

Figure 5 is the graph representation of the precision rates ofdifferent models or combinations of segmentation algorithmsand models of mining related queries. It more clearly

illustrates the superiority of our proposed segmentation algo-rithm and proposed model of association rule mining supple-mented with Levenshtein distance similarity over rival seg-mentation algorithms and models of mining related queries.

Conclusion and Future Work

In this paper, we propose a method of automatically iden-tifying related queries from Web search engine query logsusing a modified version of the association rule mining

Overall Average Precision

40

45

50

55

60

65

70

75

80

85

90

95

100

1 5 10 15 20

Top K Queries

Ove

rall

Ave

rage

Pre

cisi

on

Per Query Average Precision

40

45

50

55

60

65

70

75

80

85

90

95

100

1 5 10 15 20

Top K Queries

Que

ry A

vera

ge P

reci

sion

ARM+NaÏve_Segmentation ARM_LDS+NaÏve_Segmentation

ARM+Proposed_Segmentation ARM_LDS+Proposed_Segmentation

TCM

FIG. 5. Graph representation of the precision rates of different combinations of segmentation algorithms and models of mining related queries fromsearch engine query logs.

Page 13: Mining related queries from Web search engine query logs using an

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—October 2007 1883DOI: 10.1002/asi

model. This method segments the user sessions identified inquery logs into query transactions that contain queries on thesame topic or strongly related topics, then mines associationrules of related queries using a combination of associationrules and Levenshtein distance similarity. The advantage ofour proposed model is that it exploits human users’ implicitevaluation of the relatedness between queries as reflected intheir reformulations while taking the similarities betweenquery terms into consideration. According to the experimen-tal results, our proposed method significantly outperformsrival models as well as the baseline model, with approximategains of 17% and 44%, respectively, in precision rates. Toconclude, our proposed method performs well enough to beapplied in real Web search engines as a query recommenda-tion system or for further query expansions.

In our future work, we plan to investigate further theapplication of the association rule mining model in miningrelated queries from query logs. We are interested in know-ing whether the absence of queries is of equal importance asthe presence of queries in the query transactions when theyare used to measure the significance of mined associationrules (Brin, Motwani, & Silverstein, 1997). We will studythe effects of changing the costs of edit distance on ourmodel, which can probably better model the similaritiesbetween queries. In addition, a query relation graph can bebuilt and graph partition techniques can be applied to iden-tify the semantic clusters or “concept groups” in the graph,to develop further Web information retrieval applications.Besides, as our query log data consist mostly of Chinesequeries (around 80%) and may not be representative of largecommercial Web search engines, we plan to investigatewhether our approach achieves the same performance onquery log data of commercial search engines such as Yahoo!,Google, and MSN.

Acknowledgment

This project was supported by the Earmarked Grant forResearch from the Hong Kong Research Grant Council,4178/05E

References

Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rulesbetween sets of items in large databases. In Proceedings of the ACMSigmod International Conference Management of Data (SIGMOD’93)(pp. 207–216). Washington, DC, May.

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining associationrules. In Proceedings of the 20th International Conference on Very LargeData Bases (VLDB’94) (pp. 487–499). Santiago de Chile, Chile.

Baeza-Yates, R.A., & Ribeiro-Neto, B. (1999). Modern informationretrieval (pp. 75–79). Reading, MA: Addison-Wesley.

Beeferman, D., & Berger, A. (2000). Agglomerative clustering of a searchengine query log. In Proceedings of the Sixth ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining(pp. 407–416). Boston, MA.

Billerbeck, B., Scholer, F., Williams, H.E., & Zobel, J. (2003). Queryexpansion using associated queries. In Proceedings of the CIKM Interna-

tional Conference on Information and Knowledge Management (CIKM)(pp. 2–9). New Orleans.

Billerbeck, B., & Zobel, J. (2003). When query expansion fails. In Proceed-ings of the 26th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR’03)(pp. 387–388). Toronto.

Brin, S., Motwani, R., & Silverstein, C. (1997). Beyond market baskets:Generalizing association rules to correlations. In Proceedings ofACM SIGMOD International Conference on Management of Data(SIGMOD’97). Tucson, AZ.

Buckley, C., Salton, G., Allan, J., & Singhal, A. (1995). Automatic queryexpansion using SMART: TREC-3 report. In Proceedings of the ThirdText REtrieval Conference (TREC-3) (pp. 69–80).

Chau, M., Fang, X., & Yang, C.C. (2007) Web searching in Chinese: Astudy of a search engine in Hong Kong. Journal of the American Societyfor Information Science and Technology, 58(7), pp. 1044–1054.

Chien, S., & Immorlica, N. (2005). Semantic similarity between searchengine queries using temporal correlation. In Proceedings of the 14th International Conference on World Wide Web (WWW’05)(pp. 2–11). Chiba, Japan.

Cui, H., Wen, J.-R., Nie, J.-Y., & Ma, W.-Y. (2002). Probabilistic queryexpansion using query logs. In Proceedings of the 11th InternationalConference on the World Wide Web (WWW’02) (pp. 325–332). Honolulu, Hawaii.

Fonseca, B.M., Golgher, P.B., De Moura, E.S., & Ziviani, N. (2003,November). Using association rules to discovery search engines’ relatedqueries. In First Latin American Web Congress (LAWEB’03). Santiago,Chile.

Fonseca, B.M., Golgher, P., Possas, B., Ribeiro-Neto, B., & Ziviani, N.(2005). Concept-based interactive query expansion. In Proceedings of the14th ACM International Conference on Information and Knowledge Man-agement (CIKM’05) (pp. 696–703). Bremen, Germany.

Gilleland, M. Levenshtein distance: three flavors. Retrieved fromhttp://www.merriampark.com/ld.htm.

Huang, C.-K., Chien, L.-F., & Oyang, Y.-J. (2003). Relevant term sugges-tion in interactive Web search based on contextual information in querysession logs. Journal of the American Society for Information Scienceand Technology, 54(7), 638–649.

Joachims, T. (2002). Optimizing search engines using clickthrough data. InProceedings of the Eighth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. New York: ACM Press.

Jones, R., Rey, B., Madani, O., & Greiner, W. (2006). Generating querysubstitutions. In Proceedings of the 15th International Conference onWorld Wide Web (WWW’06). Edinburgh, Scotland.

Kwok, S.H., & Yang, C.C. (2004). Searching the peer-to-peer networks:The community and their queries. Journal of the American Society forInformation Science and Technology, Special Topic Issue on Research onInformation Seeking, 55(9), 783–793.

Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8), 707–710.

Pu, H.T., Chuang, S. & Yang, C. (2000). Subject categorization of queryterms for exploring Web users’ search interests. Journal of the AmericanSociety for Information Science and Technology, 53, 617–630.

Wen, J., Nie, J., & Zhang, H. (2001). Clustering user queries of a searchengine. In Proceedings of the 10th International World Wide WebConference (W3C) (pp. 162–168).

Xu, J., & Croft, W.B. (2000). Improving the effectiveness of information re-trieval with the local context analysis. ACM Transaction of InformationSystems, 1(18), 79–112.

Yang, C.C., & Kwok, S.H. (2005). Changes of queries in gnutella peer-to-peer networks. Journal of Information Science, 31(2), 124–135.

Zhao, Q., Hoi, S., Liu, Tie-Yan, Bhowmick, S., Lyu, M.R., & Ma, Wei-Ying. (2006). Time-dependent semantic similarity measure ofqueries using historical click-through data. In Proceedings of the 15th International Conference on World Wide Web (WWW’06). Edinburgh,Scotland.