Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE...

12
Keyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin Huang Department of Computer Science and Technology Key Laboratory of High Con๏ฌdence Software Technologies (Ministry of Education) Peking University, P.R.China {junjie.yao,bin.cui}@pku.edu.cn {hualiansheng,dazzle1203}@gmail.com Abstractโ€”Textual web pages dominate web search engines nowadays. However, there is also a striking increase of struc- tured data on the web. Ef๏ฌcient keyword query processing on structured data has attracted enough attention, but effective query understanding has yet to be investigated. In this paper, we focus on the problem of keyword query reformulation in the structured data scenario. These reformulated queries provide alternative descriptions of original input. They could better capture usersโ€™ information need and guide users to explore related items in the target structured data. We propose an automatic keyword query reformulation approach by exploiting structural semantics in the underlying structured data sources. The reformulation solution is decomposed into two stages, i.e., of๏ฌ‚ine term relation extraction and online query generation. We ๏ฌrst utilize a heterogenous graph to model the words and items in structured data, and design an enhanced Random Walk approach to extract relevant terms from the graph context. In the online query reformulation stage, we introduce an ef๏ฌcient probabilistic generation module to suggest substitutable reformulated queries. Extensive experiments are conducted on a real-life data set, and our approach yields promising results. I. I NTRODUCTION Recent years have witnessed an increasing amount of structured data on the web, which becomes a prevailing trend [1]. As reported in a study of large-scale online user search behavior [2], the structural object is an increasing information need on general web search today, and accounts for more than half of user queries. Examples of structured data include e-commercial sites, knowledge databases and entity corpus in social media. Additionally, lots of structured data is extracted from text pages, and stored in some kinds of structured databases [3], [4]. Compared to the ๏ฌ‚at text, the structured data is able to include various associated attributes, and connections/relations between data items could also be easily represented, thus the structured data can be used to better organize online content. The query result on structured data generally contains information stored in separate tables. Relevant pieces of data that are in different tables but inter-connected and collectively relevant to the input query are expected to be returned. In tradition, declare languages like SQL or prede๏ฌned query forms are used for query on top of these databases. Due to the limitation and complexity of aforementioned query methods, the uncontrolled keyword query is more preferable to ordinary users in web search. Recently, there are lots of work on ef๏ฌcient and effective keyword search processing on structured data [5], [6], [7], [8]. Query suggestion is an promising alternative to improve the effectiveness of keyword search, however it has yet to be investigated in structured data scenario. In reality, it is dif๏ฌcult for users to specify exact keywords to describe their requests in web search. This is mainly caused by natural language ambiguity or semantic โ€œmis-matchโ€ gap between query words and underlying data. Therefore, under- standing query intent behind simple query is a challenging task for effective search operation [9]. A query could be ineffective due to many reasons, among which miss-speci๏ฌcation is a common one. Letโ€™s consider the following use cases, in the setting of bibliographic database DBLP 1 : โˆ™ Similar words: where two different words have the same meaning, but with different expressions. When a user searches papers related to โ€œXMLโ€, he may not know authors also use โ€œsemi-structuredโ€ and โ€œtreeโ€ to describe this similar concept. โ€œprobabilityโ€ v.s. โ€œuncertaintyโ€ is another quasi-synonym pair. โˆ™ Related items: where similar topics lie in other parts of the searched corpus. For example, a user searches R. Agrawalโ€™s work on โ€œassociation ruleโ€, he may not know there also exist some related topics, e.g. โ€œsequen- tial patternโ€ and โ€œfrequent itemsetโ€. Other prominent researchers, such as Jiawei Han and Raghu Ramakrishnan are also experts in this area. Better understanding the user intent and providing sub- stitutive query are important requirements to improve the effectiveness of search engine, and query suggestion has been widely applied in text retrieval or web search. After a user submits a query, the typo modi๏ฌcation or more appropriate queries are suggested. By collecting the query log or learning the user behavior, most of existing methods exploit large query logs to obtain similar queries to support query rewriting or generation, e.g., [10], [11]. For areas short of query logs, some previous work have also tried statistical extraction methods over text corpora, e.g., [12], [13]. However, these approaches are initially proposed for ๏ฌ‚at text documents, and cannot effectively model the attribute facets and connections within the structured data. First, we lack of structured query logs in structured data scenario. Second, there does not exist suitable document de๏ฌnition over 1 http://dblp.uni-trier.de/

Transcript of Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE...

Page 1: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

Keyword Query Reformulation on Structured DataJunjie Yao Bin Cui Liansheng Hua Yuxin Huang

Department of Computer Science and TechnologyKey Laboratory of High Confidence Software Technologies (Ministry of Education)

Peking University, P.R.China{junjie.yao,bin.cui}@pku.edu.cn

{hualiansheng,dazzle1203}@gmail.com

Abstractโ€” Textual web pages dominate web search enginesnowadays. However, there is also a striking increase of struc-tured data on the web. Efficient keyword query processing onstructured data has attracted enough attention, but effectivequery understanding has yet to be investigated. In this paper,we focus on the problem of keyword query reformulation in thestructured data scenario. These reformulated queries providealternative descriptions of original input. They could bettercapture usersโ€™ information need and guide users to explorerelated items in the target structured data. We propose anautomatic keyword query reformulation approach by exploitingstructural semantics in the underlying structured data sources.The reformulation solution is decomposed into two stages, i.e.,offline term relation extraction and online query generation. Wefirst utilize a heterogenous graph to model the words and items instructured data, and design an enhanced Random Walk approachto extract relevant terms from the graph context. In the onlinequery reformulation stage, we introduce an efficient probabilisticgeneration module to suggest substitutable reformulated queries.Extensive experiments are conducted on a real-life data set, andour approach yields promising results.

I. INTRODUCTION

Recent years have witnessed an increasing amount ofstructured data on the web, which becomes a prevailingtrend [1]. As reported in a study of large-scale online usersearch behavior [2], the structural object is an increasinginformation need on general web search today, and accountsfor more than half of user queries. Examples of structured datainclude e-commercial sites, knowledge databases and entitycorpus in social media. Additionally, lots of structured datais extracted from text pages, and stored in some kinds ofstructured databases [3], [4]. Compared to the flat text, thestructured data is able to include various associated attributes,and connections/relations between data items could also beeasily represented, thus the structured data can be used tobetter organize online content.

The query result on structured data generally containsinformation stored in separate tables. Relevant pieces of datathat are in different tables but inter-connected and collectivelyrelevant to the input query are expected to be returned. Intradition, declare languages like SQL or predefined queryforms are used for query on top of these databases. Dueto the limitation and complexity of aforementioned querymethods, the uncontrolled keyword query is more preferableto ordinary users in web search. Recently, there are lots ofwork on efficient and effective keyword search processing

on structured data [5], [6], [7], [8]. Query suggestion is anpromising alternative to improve the effectiveness of keywordsearch, however it has yet to be investigated in structured datascenario.

In reality, it is difficult for users to specify exact keywordsto describe their requests in web search. This is mainly causedby natural language ambiguity or semantic โ€œmis-matchโ€ gapbetween query words and underlying data. Therefore, under-standing query intent behind simple query is a challenging taskfor effective search operation [9]. A query could be ineffectivedue to many reasons, among which miss-specification is acommon one. Letโ€™s consider the following use cases, in thesetting of bibliographic database DBLP 1:

โˆ™ Similar words: where two different words have the samemeaning, but with different expressions. When a usersearches papers related to โ€œXMLโ€, he may not knowauthors also use โ€œsemi-structuredโ€ and โ€œtreeโ€ to describethis similar concept. โ€œprobabilityโ€ v.s. โ€œuncertaintyโ€ isanother quasi-synonym pair.

โˆ™ Related items: where similar topics lie in other partsof the searched corpus. For example, a user searchesR. Agrawalโ€™s work on โ€œassociation ruleโ€, he may notknow there also exist some related topics, e.g. โ€œsequen-tial patternโ€ and โ€œfrequent itemsetโ€. Other prominentresearchers, such as Jiawei Han and Raghu Ramakrishnanare also experts in this area.

Better understanding the user intent and providing sub-stitutive query are important requirements to improve theeffectiveness of search engine, and query suggestion has beenwidely applied in text retrieval or web search. After a usersubmits a query, the typo modification or more appropriatequeries are suggested. By collecting the query log or learningthe user behavior, most of existing methods exploit large querylogs to obtain similar queries to support query rewriting orgeneration, e.g., [10], [11]. For areas short of query logs, someprevious work have also tried statistical extraction methodsover text corpora, e.g., [12], [13].

However, these approaches are initially proposed for flattext documents, and cannot effectively model the attributefacets and connections within the structured data. First, welack of structured query logs in structured data scenario.Second, there does not exist suitable document definition over

1http://dblp.uni-trier.de/

Page 2: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

ID Paper Name YearID Paper Name YearP1 Threshold query optimization for uncertain data 2010P2 Probabilistic string similarity joins 2010P2 Probabilistic string similarity joins 2010P3 Efficient rank based KNN query processing over

uncertain data2010

uncertain data

PaperWritenWriten

Paper

Conf A th

By

ID Conf Name ID Author Name

Conf Author

C1 SIGMODC2 PODS

ID Author NameA1 Sunil PrabhakarA2 Ke YiC2 PODS

C3 ICDEA2 Ke YiA3 Xuemin Lin

Fig. 1. Tables and References in Structured DBLP Data

structured data. The ad hoc definition of results dependingon input query is a double-edge sword of structured data. Itis difficult to enumerate all possible types of query results,and the predefined virtual document approach [14] is notfeasible. Even if we can manually define virtual documentsby selectively extracting several tuples, this kind of virtualdocument is rather arbitrary and limited to extend.

In this paper, we investigate the problem of keyword queryreformulation over structured data. Given an initial query,instead of query expansion or relevance feedback, we expectto suggest new but substitutive queries by exploiting thestructural semantics of underlying data sources. Comparedto some recent returned result analysis methods [15], [16],our approach provides more global indications and exploresto a broader scope. Intuitions behind our approach lie atthe ubiquitous cross-linked connections of structured data.It contains rich semantic information indicating the wordcloseness and context scenario. It is intuitive to understandthat the surrounding text or the relevant tuples/attributes ofthe similar terms are also similar.

Figure 1 presents an example of table structure in DBLP.Different tables have fixed information and they are connectedvia key references. In this setting, โ€œprobabilisticโ€ and โ€œun-certainโ€ in the motivating example usually appear in sameconferences or are used by same researchers, while they maynot appear together simultaneously, say in a title. Authorsconnected by similar conferences or using similar terms intheir paper titles, could have similar research interests, thoughthey have not ever collaborated.

In this work, we model the structured data with a heteroge-nous graph and propose an automatic approach for keywordquery reformulation which consists of offline term relationshipextraction and online query reformulation. In the offline stage,we resort to the rich structural context information and capturethe term relation. Specifically, we build a term relation graph,where โ€œtermโ€ refers to a textual word or phrase in the database.For each term, we employ an improved random walk modelover the graph to obtain a list of most similar terms; we

also compute the closeness of terms to measure the semanticdistance in the graph. At the running time, given a query com-posed of several keywords, we propose a probabilistic querygeneration model to find top-๐‘˜ related queries, consisting withsimilar keywords of input query. Compared to those priorkeyword query processing and result ranking methods, ourwork is complementary and focuses on query understandingby taking advantage of structural information in the underlyingdata sources.

The contributions of this paper could be summarized asfollows:

1) We propose a graph structure to model the semanticrelationships among terms in the structured data, whichconsist of not only tuple content but also interactionrelations.

2) We present an improved Random Walk model to obtainthe term similarity over the graph scope.

3) We design a novel probabilistic query generation modelthat determines good substitutive queries for an inputquery. Under the generative framework, we further dis-cuss efficient top-๐‘˜ reformulated queries selection.

4) We conduct extensive experiments on real-life data toverify the effectiveness and efficiency of our methods.

The remainder of this paper is organized as follows. We firstreview related work, and then introduce the problem definitionand solution overview in Section III. Next two sectionsprovide details of our approach. Section IV describes the termrelation extraction process. The reformulated query generationis presented in Section V. We show the experimental study inSection VI, and finally conclude this paper.

II. RELATED WORK

In this section, we make a brief review of the literature inseveral categories:

Keyword Search over Structured Data: As shown in [3],there exist tens of millions of useful and query-able tables onthe web. Keyword search over structured data has attracted lotsof attentions in recent years [5], [6], [7], [17], [18], [19]. Theresult definition and query processing are studied extensivelyin the literature [20], [21], [22]. [20] introduced a searchalgorithm on the graph to generate search results. [21] utilizeda PageRank strategy to assess the relevance and authority ofnodes in the keyword search. [6] presented the top-๐‘˜ resultsranking in relational databases.

Though there are also some tentative work towards keywordquery understanding [23], [15], [24], [16], these previous workeither chose manually created transform rules, or focus onlocal result analysis. The global scope and rich semanticstructure information were usually ignored in previous work.

Node Similarity on Graph: Graph is an attractive modelto describe the complex interactions and connections in manyapplications. Usually, existing keyword search methods modelthe structured data as a graph, where tuples in each table areconnected by foreign key references. Together these tuplesform a search result. There exist some work on applying graph

Page 3: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

algorithms to find related tuples [21], summarize zuschema ofrelational database [25] and cluster the graph [26].

How to measure the similarity on the graph is a key issuefor graph-based approaches, and different solutions have beenproposed to address this problem, such as link analysis [11]and random walk model [27].

Query Suggestion/Recommendation: Search operation istypically an interactive progress, in which users iterativelysubmit their unknown questions and get informed. Querysuggestion is commonly used by current web search engines tohelp users reformulate their queries with less effort. Traditionalquery expansion or relevance feedback generally extend theinitial query, and usually causes the query intent drift. Shownin web query logs, a user usually modifies a query to anotherclosely related one, and around 50% user sessions havereformulation actions [10].

Inspired by these, automatic query reformulation (a.k.a.rewriting, substitution) has become an attractive approach forquery suggestion [10]. In query reformulation, a totally newquery is suggested, which is similar or related to the initialone. Most of existing automatic query reformulation methodsutilize large scale query log mining techniques, e.g., [11],[10]. Statistical query generation is also discussed in [13],[12]. However, these work are not applicable in structured data,resulting from absence of keyword query log and dynamiclinkage nature of structured data.

III. OVERVIEW

A. Problem Formulation

We first introduce the notations and concepts used in thepaper. Specifically, the structured data in this paper refers torelational data, which consists of tables and references betweentables. An example of structured DBLP data has been showin Figure 1.

Usually, a database could be viewed as a labeled graph,where tuples in different tables are treated as nodes, andforeign key references correspond to connections betweennodes. Note that a graph constructed in this way has a regularstructure because the schema restricts the node connections.The proposed solutions are also applicable to other kind ofschema or even schemaless structured data, e.g., XML, RDFand graph data.

Definition 1 (Tuple Graph): A tuple graph is a graph ๐บ ={๐‘‰,๐ธ}, where ๐‘‰ is the set of nodes, and each node corre-sponds to one tuple; ๐ธ is the set of edges, and each edgerepresents a foreign key reference.

In Figure 1, we could replace the inner schema graph withtuple nodes and transform it to a tuple graph.

Definition 2 (Keyword Query): An input keyword query ๐‘„consists of a set of distinct keywords, i.e., ๐‘„= [๐‘ž1, ๐‘ž2, . . . , ๐‘ž๐‘›],where each keyword ๐‘ž๐‘– (๐‘– = 1, ..., ๐‘›) is a word or a topicalphrase, depending on the tokenization/segmentation.Without loss of generality, we simply use term to representthe word or phrase in the following discussions.

Definition 3 (Query Result): Keyword query results are thesubstructures that connect the matching nodes (in the tuple

graph). In a graph data model, a query result is defined asa subtree of the data graph where no node or edge can beremoved without losing connectivity or keyword matches [5],[20].

Be aware that, our work does not focus on: 1) how to presenta new result definition to keyword query; 2) how to processkeyword query ๐‘„ efficiently. Instead, our work addresses theproblem of how to reformulate the original query and presentsubstitutive queries to the users, which is complementary tothese prior work.

Definition 4 (Reformulated Query): Given an input query๐‘„= [๐‘ž1, ๐‘ž2, . . . , ๐‘ž๐‘›], find ๐‘˜ reformulated queries ๐‘„โ€ฒ

๐‘– (๐‘– =1, ..., ๐‘˜) with ๐‘˜ largest ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐‘„,๐‘„โ€ฒ

๐‘–), where ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐‘„,๐‘„โ€ฒ๐‘–)

evaluates the substitution validity between ๐‘„ and ๐‘„โ€ฒ๐‘–.

The generated new query should be similar to originalinput and also accord with the structural semantics in theunderlying database. As a combination of individual terms, thestraightforward similarity between queries could be measuredby aggregation of similarity between the corresponding terms.However, it is not surprising that even each term of the newquery is similar to the corresponding ones in the originalquery, together the new query do not cover any meaningfulresult, because these terms do not occur together or seman-tically relevant amongst them. ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’(๐‘„,๐‘„โ€ฒ

๐‘–) should includeboth similarity and utility metrics, and the quality details ofreformulated queries will be introduced in Section V.

Input Query

o1 Ot-1 ot Ot+1 oT

Tokenization

B B B B

A A A A

Reformulated Query GenerationReformulated Query GenerationReformulated Queries

Term Similarity

TAT Graph Term ClosenessStructured Database

Fig. 2. Flowchart of Reformulated Query Generation

B. Proposed Approach

The flowchart of our approach is shown in Figure 2. In theoff-line stage, we model the structured data in an enrichedgraph and then deploy term relationship extraction, i.e., sim-ilarity and closeness shown in the figure. The online stageaccepts submitted query and then fetches the correspondingtermsโ€™ information. With an efficient probabilistic model toreduce the search space, we conduct query reformulation andpresent top-๐‘˜ reformulated queries to end users.

The off-line stage uses an improved random walk model tomeasure the term similarity and details will be presented in

Page 4: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

Section IV. Compared to previous graph similarity measure-ments [27], [11], the new model personalizes the preferencemanner to guide the similarity computation on the graph. Thiskind of similarity is more suitable for similar term extractionin our complex heterogenous graph, and its advantage will bedemonstrated in the experimental study.

The online stage utilizes the pre-computed results, i.e., termsimilarity and closeness, and generates reformulated queries.To facilitate efficient processing, we employ a probabilisticmodel to reduce the search space. It is shown that this simplebut effective approximation does not degenerate the perfor-mance of query reformulation and additionally, the generativenature of this probabilistic model also shows advantage in mul-tiple metric fusion, especially in our reformulation objectivefunctions.

IV. TERM RELATION EXTRACTION

In this section, we first propose a heterogenous graphstructure to model the structured data. Based on the graphmodel, we further discuss how to extract term relationships,i.e., similarity and closeness, by exploiting the structuralsemantics from the underlying data sources.

A. TAT Graph Representation

In order to model term relationships in structured data, wedesign a term augmented tuple graph (TAT graph), which isdefined as follows.

Definition 5 (Term Augmented Tuple Graph): ๐’ข = (๐‘‰ โˆช๐‘‰๐‘ก, ๐ธ โˆช ๐ธ๐‘ก) is a term augmented tuple graph, where 1) eachnode ๐‘ฃ in ๐‘‰ is a tuple node that represents one tuple; 2) eachnode ๐‘ฃ๐‘ก in ๐‘‰๐‘ก is a term node that represents one term; 3) eachedge ๐‘’ in ๐ธ connects two tuple nodes, representing a foreignkey reference; 4) each edge ๐‘’๐‘ก in ๐ธ๐‘ก connects one tuple node๐‘ฃ and one term node ๐‘ฃ๐‘ก, if ๐‘ฃ๐‘ก appears in the correspondingtuple ๐‘ฃ.

C1 C3

A1 A3A3

P1 P2 P3 2010

uncertain probabilistic query similarity

Fig. 3. Term Augmented Tuple Graph

In Figure 3, ellipse nodes are original tuples and rectanglenodes are term nodes extracted from paper title field of theDBLP example shown in Figure 1. Authors are connectedto their published papers. The papers are associated withthe conferences/journals and extracted terms. Connections be-tween heterogeneous nodes represent the co-occurrence or keyreference between them. To define term nodes in our graph,

we would like to extract multiple terms from a field consistingof a long term sequence, e.g., paper title. However, for sometuples of which all terms stand together for certain semanticmeaning, e.g., author name or institute name, segmentationshould not be applied and these nodes are also searchable assimple term nodes. Term nodes with same text extracted fromdifferent fields, are considered as different. We label them withfield identifiers. That is to say, words from conference namesare distinguished from words from paper titles with differentfield labels.

Different from traditional tuple graph, TAT graph semanti-cally incorporates tuple and term relationships in a uniformway. We believe this is an intriguing character of graphstructure, which can model rich direct and indirect connectionsamongst structured data. In Figure 3, authors who do notcollaborate actively but share similar research interests, couldalso be connected by similar conferences/keywords. It ispossible that keywords used by similar authors belong torelated research areas, though they do not have to co-occurin same paper titles.

In TAT graph, the unified modeling of terms and tuplesenables us to better represent and extract their similarity orcloseness relations. Later in this section, we will also discussthe novel weight method over TAT graph.

B. Term Similarity in Linked Context

We first discuss how can term similarity be extracted fromthe aforementioned TAT graph structure. Compared to flatrepresentation, graph structure is more capable of modelingrelations between terms. In contrast, using โ€œfrequent co-occurringโ€ to model term similarity [15] ignores rich structuralsemantics in the relational databases.

Take DBLP data as example, to measure similarity betweentwo authors, a natural way is to represent each author withassociated publications or conferences. However, it is nota trivial task to choose appropriate representation subsets.Too short representation cannot provide sufficient information,while too long representation is usually very sparse. Addition-ally, this kind of the representation could not capture indirectconnections between the authors.

The graph structure we used here alleviates these problems.It could carry more global correlations between two nodesand also their multi-facet relations, i.e., different connectionpaths between them. For example, in Figure 1, the connectionbetween two authors could be measured not only by sharedconferences or co-authored papers, but also the shared key-words occurred in their paper titles or even indirect similarkeywords on the graph.

Random walk is considered as an effective method to mea-sure node proximity over graph [25], [27]. In this subsection,we propose a novel term similarity measurement, derived fromthe TAT graph. Different from traditional simple random walkmethods, we propose to design a novel contextual preferencevector and weighting flow to bias the algorithm behavior.

1) Basic Random Walk Model: Random walk is an iterationprocess to update node scores across the graph. After several

Page 5: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

iterations, the random walk process will usually converge toa stable state. After that, relative importance of each node ismeasured by its stable score.

Considering a general graph ๐บ = (๐‘‰,๐ธ) with node set ๐‘‰and edge set ๐ธ, ๐ด is its adjacency matrix. ๏ฟฝโƒ—๏ฟฝ is a preferencevector inducing a probability distribution over ๐‘‰ , RandomWalk vector ๐‘ over all vertexes ๐‘‰ is defined as the followingequation:

๐‘ = ๐œ† โ‹…๐ด โ‹… ๐‘+ (1โˆ’ ๐œ†) โ‹… ๏ฟฝโƒ—๏ฟฝ (1)

In the iteration process, a random walker starts from anode and chooses path randomly among the available out-going edges, according to the edge weights. With a restartprobability, he goes back to a node based on the weighteddistribution in ๏ฟฝโƒ—๏ฟฝ. After several iterations, node scores willconverge. This is guaranteed in most cases [28].

If preference vector ๏ฟฝโƒ—๏ฟฝ is uniform over ๐‘‰ , ๐‘ is referred asa global random walk vector. For non-uniform ๏ฟฝโƒ—๏ฟฝ, the solution๐‘ will be treated as a personalized random walk vector [29].By appropriately selecting ๏ฟฝโƒ—๏ฟฝ, the rank vector ๐‘ can be made tobias certain nodes. In most personalized cases, when the ๐‘–thcoordinate of ๏ฟฝโƒ—๏ฟฝ is 1 and all others are 0, the random walk willbe referred as the individual random walk [27].

To extract a specific node ๐‘ก0โ€™s similar ones, we can assign apreference vector ๏ฟฝโƒ—๏ฟฝ biased to ๐‘ก0 and then run the random walk.The convergence score is a measurement of relative similarity,i.e.,

๐‘ ๐‘–๐‘š(๐‘ก0, ๐‘ก) = ๐‘[๐‘ก]โˆฃ๏ฟฝโƒ—๏ฟฝ0. (2)

๏ฟฝโƒ—๏ฟฝ0 is the preference vector with ๐‘ก0โ€™s corresponding weight asnon-zero and others zero. The above equation means that, inan individual random walk model biased to ๐‘ก0, if ๐‘[๐‘ก1] > ๐‘[๐‘ก2],node ๐‘ก1 is more similar to ๐‘ก0 than ๐‘ก2.

Different from the homogeneous graph, here we only extractsimilar nodes belonging to same classes of the initial node.For example, given an author node, we expect to find similarauthor nodes, while ignoring similar keyword or conferencenodes found in the random walk process.

2) Contextual Biased Preference: To extract similar nodesof node ๐‘ก0 in random walk, a straightforward similarity com-putation solution is to bias ๐‘ก0, by assigning ๐‘ก0โ€™s correspondingelement in ๏ฟฝโƒ—๏ฟฝ with 1 and 0 for others. In an initial trial, we findthat this approach is locally sensitive. For example, given anauthor in Figure 3, how to find similar authors interested insame conferences/research areas? An individual random walkstarting from a specific author could only find some closecollaborators. Since similar terms generally appear in similarcontext, here we further model a contextual preference vectoras one improvement of random walk. Starting random walkprocess from this authorโ€™s primary conference and researchareas, we may encounter other valuable findings. Specifically,we utilize the surrounding tuples/terms as the context settingfor a given term/tuple node. This context setting is used fornode similarity extraction in our proposed approach.

Definition 6 (Context Node): For a term node, the contextis tuple nodes it belongs to. Tuple nodeโ€™s context is the

affiliated term nodes and tuple nodes it connects to. Nodesin these context are called context nodes.There probably exist several different fields in context nodes.Here, the field means type of node, e.g., tuple or attributesin a relational table. We fetch some top related nodes fromeach field into the contextual vector. Here, to better model thecontext information, we define two metrics, i.e., field weightand node weight in each field.

We first measure the field weight. Given a starting node๐‘ก0, with associated context nodes from ๐‘š fields. We resortto the cardinality statistics. We weight each field as thefraction of its cardinality โˆฃ๐น๐‘–โˆฃ and ๐‘ก0โ€™s frequency ๐‘“๐‘Ÿ๐‘’๐‘ž(๐‘ก0),1 โ‰ค ๐‘– โ‰ค ๐‘š. Second, we measure a context node ๐‘ฃ๐‘โ€™s weightwith respect to starting node ๐‘ก0 as ๐‘“๐‘Ÿ๐‘’๐‘ž(๐‘ฃ๐‘, ๐‘ก0) โ‹… ๐‘–๐‘‘๐‘“(๐‘ฃ๐‘).๐‘“๐‘Ÿ๐‘’๐‘ž(๐‘ฃ๐‘, ๐‘ก0) measures the co-occurrence statistics with ๐‘ก0,and the ๐‘–๐‘‘๐‘“(๐‘ฃ๐‘) is the inverse of global occurrence statisticsof node ๐‘ฃ๐‘. Given the above two weight measurements, wecan compute ๐‘ฃ๐‘โ€™s corresponding weight ๐‘ค(๐‘ฃ๐‘) which is thecombination of affiliated field weight and node weight, i.e,1/โˆฃ๐น๐‘–โˆฃโ‹…๐‘“๐‘Ÿ๐‘’๐‘ž(๐‘ฃ๐‘, ๐‘ก0)โ‹…๐‘–๐‘‘๐‘“(๐‘ฃ๐‘). The reason we choose this kind ofcombination is that, the representability of field and node affectthe behavior of random walk. It could guide the performanceof node similarity extraction.

Algorithm 1 Contextual Random Walk based Term SimilarityExtractionInput: starting node ๐‘ก0 for similarity extraction,

TAT graph ๐บ with linkage information ๐ด,damping parameter in random walk ๐œ†.

1: F โ† {๐น1,๐น2,. . . ,๐น๐‘–,...} // context fields of ๐‘ก0.2: C โ† {๐‘11,๐‘12,. . . ,๐‘๐‘–๐‘— ,...} // context nodes of ๐‘ก0.3: for ๐‘ โˆˆ ๐ถ do4: ๐‘๐‘–๐‘— = 1/โˆฃ๐น๐‘–โˆฃ โ‹… ๐‘“๐‘Ÿ๐‘’๐‘ž(๐‘๐‘–๐‘— , ๐‘ก0) โ‹… ๐‘–๐‘‘๐‘“(๐‘๐‘–๐‘—)5: end for6: ๏ฟฝโƒ—๏ฟฝ โ† [0, . . . , ๐‘๐‘–๐‘— , . . .] //preference vector.7: repeat8: ๐‘๐‘ก = ๐œ† โ‹…๐ด โ‹… ๐‘๐‘กโˆ’1 + (1โˆ’ ๐œ†) โ‹… ๏ฟฝโƒ—๏ฟฝ9: until โˆฃ๐‘๐‘ก โˆ’ ๐‘๐‘กโˆ’1โˆฃ < ๐œ– or predefined iteration times

10: return score vector ๐‘.

We illustrate our random walk based term similarity ex-traction process in Algorithm 1. First we select the contextpreference with regard to the starting node ๐‘ก0. We give apreference vector ๏ฟฝโƒ—๏ฟฝ for random walk process, where 0 is set forall other non-context nodes in the graph. Then we conduct therandom walk until it converges or reaches predefined iterationtimes. After iteration, a list of top similar nodes of ๐‘ก0 isreturned, i.e., the most similar terms extracted by our model.

Figure 4 depicts the difference between basic random walkand the contextual preference biased enhancement. In thebasic model, we could only extract terms highly co-occurredwith โ€œuncertainโ€, i.e., โ€œsimilarityโ€ and โ€œqueryโ€ in the figure.The results are similar to the frequent co-occurrence basedsimilarity measurement, which could not provide useful latentinformation. With the help of contextual preference, we firstidentify some preference nodes, i.e., two paper tuple nodes in

Page 6: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

the figure, and the new random walk model could go throughthe contextual part of the graph to find other relevant terms,e.g., โ€œprobabilisticโ€ in the figure.

C1 C3

A1 A3

P1 P2 P320102010

uncertainprobabilistic query similarity

Fig. 4. Random Walk on TAT Graph. Starting node is โ€œuncertainโ€. Basicrandom walk could only find dot lined โ€œquery โ€ and โ€œsimilarityโ€. In ourenhancement, we first grasp the contextual node as P1, P3, then from them,we extract the โ€œprobabilisticโ€ as a more similar node.

C. Term Closeness Extraction

As introduced in Section III, term closeness is another im-portant relation measurement, capturing the distance betweenterm nodes. This is an indication of two termsโ€™ keywordsearch result coverage, which is an important metric for validreformulated queries. Shorter paths between two terms indicatea closer relation and a good corpus coverage. At first glance,it looks similar to the previous similarity metric based onrandom walk. However, the random walk method combines theglobal graph structure to derive node similarity and it mixesdifferent paths between two nodes. This is inappropriate fornode closeness.

A straightforward solution of term closeness is to fetchthe exact search result and then count the returned items.However, as we have to construct the keyword search resultson the fly, this is especially prohibitive, considering that everypossible combinations of each candidate term pairs need tobe evaluated. For example, if we have an input query withthree keywords which is relatively short, every query termhas 10 candidate similar terms, then we need to compare thecloseness of 10*10*10 term triples. A feasible approach isto summarize the target corpus by term pair coverage, andestimate the result size of each query. Compared to previousresult estimation or database summary work [30], here we haveto validate a much larger number of substitutable queries, andalso maintain the frequency and length information of paths,which were somewhat ignored before.

In this part, we use an efficient path search method to com-pute the shortest paths between two term nodes, and choose thesummarization over all possible paths as the closeness metric.The formulation is defined as:

๐‘๐‘™๐‘œ๐‘ (๐‘ฃ๐‘–, ๐‘ฃ๐‘—) =โˆ‘

๐œ :๐‘ฃ๐‘–โ†’๐‘ฃ๐‘—

1

๐‘™๐‘’๐‘›(๐œ)(3)

In the first stage, we search each nodeโ€™s shortest path toothers. Starting from a node ๐‘ฃ, we identify the direct connected

ones, which have distance 1. From these, we then expand todistance 2. Distance ๐‘– + 1 nodes can be easily derived fromdistance ๐‘– ones in each iteration. We maintain top ones andprune less frequent to guarantee the extraction performance.At the second stage, we collect the nodesโ€™ connectivity infor-mation and compute the associated nodesโ€™ closeness based onthe above equation.

target term ranked close termsranked

close conferences

โ€œprobabilisticโ€

โ€œgenerationโ€,โ€œdocumentโ€,

โ€œdistributionโ€,. . .. . .

โ€œboundaryโ€. . .

VLDB,SIGMOD,

AAAI,. . .. . .

ICDM. . .

TABLE I

EXTRACTED CLOSE TERMS

Table I presents an example of close terms extracted forterm โ€œprobabilisticโ€, which is reasonably intuitive. From thetable, we can see that it has close connection with โ€œgenera-tionโ€ and โ€œdistributionโ€, as they appear together more often,while its connections with โ€œstrictโ€ or โ€œboundaryโ€ are loose.Regarding to the close conferences, we find it has more closerelation with database and artificial intelligence conferencesthan counterpart ones in data mining areas. Interestingly, thisfinding is quite consistent with the results returned by Googlesearch engine with a simple keyword search test. We get morethan 150 thousand results using keywords โ€œprobabilisticโ€ +โ€œVLDBโ€/โ€œSIGMODโ€/โ€œAAAIโ€, while only around 50 thousandresults with keywords โ€œprobabilisticโ€ + โ€œICDMโ€.

V. REFORMULATED QUERY GENERATION

In this section, we will introduce the online processing ofquery reformulation over structured data. We first define thequality of reformulated query, and then propose an efficientprobabilistic model for query reformulation.

A. Quality of Reformulated Queries

Given an input query ๐‘„ = [๐‘ž1, ๐‘ž2, ..., ๐‘ž๐‘š] and each querytermโ€™s similar term extension list ๐‘ž๐‘– โ‡’ {๐‘ก๐‘–1, ๐‘ก๐‘–2, ..., ๐‘ก๐‘–๐‘›},merely integrating these similar term lists is not enough.Sometimes, though ๐‘ ๐‘–๐‘š(๐‘ž๐‘–, ๐‘ก๐‘–๐‘™) and ๐‘ ๐‘–๐‘š(๐‘ž๐‘— , ๐‘ก๐‘—๐‘˜) are both veryhigh, ๐‘ก๐‘–๐‘™ and ๐‘ก๐‘—๐‘˜ may have no connection in the structureddata, i.e., they are meaningless when combined. Note that,the keyword search over relational database generally returnsa set of tuples, each of which is either obtained from asingle relation or by multiple joined relations. We could notarbitrarily combine top ranked terms together.

Therefore, we identify two requirements for query reformu-lation on structured data: 1) Cohesion: The generated queryshould be a โ€œvalidโ€ one and covers several tuples in thetargeted structured corpus. Random selection from differentterm extension lists cannot guarantee this requirement. 2)

Page 7: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

Similarity: The generated one should be similar to the originalinput in terms of some semantic concerns. These two measure-ments are separately represented by term closeness and termsimilarity.

The reformulation ability of new query ๐‘„โ€ฒ with respect to๐‘„ could be the aggregation of similar term pairs in the twoqueries. The extraction processes of these relations are welldiscussed in the prior section. We measure the reformulatedprobability of ๐‘„โ€ฒ given input query ๐‘„ as:

๐‘(๐‘„โ€ฒโˆฃ๐‘„) = ๐‘(๐‘žโ€ฒ1)๐‘(๐‘ž1โˆฃ๐‘žโ€ฒ1)๐‘šโˆ

๐‘–=2

๐‘(๐‘žโ€ฒ๐‘–โˆฃ๐‘žโ€ฒ๐‘–โˆ’1)๐‘(๐‘ž๐‘–โˆฃ๐‘žโ€ฒ๐‘–)

= ๐‘(๐‘žโ€ฒ1)๐‘ ๐‘–๐‘š(๐‘ž1, ๐‘žโ€ฒ1)

๐‘šโˆ

๐‘–=2

๐‘๐‘™๐‘œ๐‘ (๐‘žโ€ฒ๐‘–, ๐‘žโ€ฒ๐‘–โˆ’1)๐‘ ๐‘–๐‘š(๐‘ž๐‘–, ๐‘ž

โ€ฒ๐‘–) (4)

In Equation 4, the multiplication is sensitive to zero. Forexample, given an input query ๐‘„ with ๐‘š terms, most termsin a reformulated query ๐‘„โ€ฒ are similar and close to those of๐‘„. A merely mismatch of two terms in ๐‘„โ€ฒ may lead ๐‘„โ€ฒ tofail, i.e., ๐‘๐‘™๐‘œ๐‘ (๐‘žโ€ฒ๐‘–, ๐‘ž

โ€ฒ๐‘–โˆ’1)=0, then ๐‘„โ€ฒโ€™s score becomes 0. Here

we use a global indication to smooth the scoring function ofreformulated queries.

๐‘ ๐‘–๐‘š๐‘ ๐‘š๐‘œ(๐‘žโ€ฒ๐‘–, ๐‘ž๐‘–) = ๐œ†ร— ๐‘ ๐‘–๐‘š(๐‘žโ€ฒ๐‘–, ๐‘ž๐‘–) + (1โˆ’ ๐œ†)ร—

๐‘šโˆ‘

๐‘˜=1

๐‘ ๐‘–๐‘š(๐‘žโ€ฒ๐‘˜, ๐‘ž๐‘˜)

(5)

๐‘๐‘™๐‘œ๐‘ ๐‘ ๐‘š๐‘œ(๐‘žโ€ฒ๐‘–โˆ’1, ๐‘ž

โ€ฒ๐‘–) = ๐œ†ร—๐‘๐‘™๐‘œ๐‘ (๐‘žโ€ฒ๐‘–โˆ’1, ๐‘ž

โ€ฒ๐‘–)+(1โˆ’๐œ†)ร—

๐‘šโˆ‘

๐‘˜=2

๐‘๐‘™๐‘œ๐‘ (๐‘žโ€ฒ๐‘˜โˆ’1, ๐‘žโ€ฒ๐‘˜)

(6)This smooth process keeps the aggregated scores unchanged

in order to maintain the probabilistic meaning of the param-eters. And the smooth effect is controlled by parameter ๐œ†.The smooth procedure can help alleviate the problem whencandidate query contains trivial words or irrelevant word pairs.

B. Probabilistic Query Generation Model

The search space for a candidate reformulated query is๐‘‚(๐‘›๐‘š), given an input query with ๐‘š terms and each term witha similar extension list in size ๐‘›. Obviously, it is impossibleto enumerate the huge search space for online query reformu-lations. Though we introduce a feasible quality measurement,the efficient query generation still requires an effective searchspace reduction.

We resort to a novel algorithm to find the top-๐‘˜ generatedqueries. It reduces the search space and also provides a prob-abilistic perspective to view this query reformulation problem.This guarantees efficient reformulated query generation in aninteractive response time. Here we utilize a Hidden MarkovModel (HMM) [31] to profile query reformulation.

In general, HMM [31] is composed of a set of hiddenstates ๐‘† = {๐‘ 1,๐‘ 2,. . . ,๐‘ ๐‘}, and observation symbols ๐‘‚ ={๐‘œ1,๐‘œ2,. . . ,๐‘œ๐‘€}. It is a generative model with multiple steps.At each step ๐‘ก, the model is in a specific hidden state ๐‘ž๐‘ก โˆˆ ๐‘†and we observe an observation symbol ๐‘’๐‘ก โˆˆ ๐‘‚. Having outputa symbol, a transition is made to the next state.

The following three components describe an HMM behav-ior:

โˆ™ the state transition probability, ๐ด๐‘,๐‘ = {๐‘Ž๐‘–๐‘— โˆฅ 1 โ‰ค ๐‘–,๐‘— โ‰ค๐‘ }, where ๐‘Ž๐‘–๐‘— = ๐‘ƒ (๐‘ž๐‘ก+1 = ๐‘ ๐‘— โˆฅ ๐‘ž๐‘ก = ๐‘ ๐‘–).

โˆ™ the observation probability distribution in each state,๐ต๐‘,๐‘€ ={๐‘๐‘–๐‘— โˆฅ 1 โ‰ค ๐‘– โ‰ค ๐‘ , 1 โ‰ค ๐‘— โ‰ค ๐‘€ }, where ๐‘๐‘–๐‘—= ๐‘ƒ (๐‘’๐‘ก = ๐‘œ๐‘— โˆฃ๐‘ž๐‘ก = ๐‘ ๐‘–) is the probability of observed ๐‘œ๐‘—in state ๐‘ ๐‘–.

โˆ™ the initial state distribution, ๐œ‹๐‘ = {๐œ‹๐‘– โˆฅ ๐‘– โˆˆ [1,N]}, where๐œ‹๐‘– = ๐‘ƒ (๐‘ž1 = ๐‘ ๐‘–) is the initial probability of state ๐‘†๐‘–.

In query reformulation scenario, initial input query ๐‘„ = [๐‘ž1,๐‘ž2 , . . . , ๐‘ž๐‘š] is considered as observed symbols, and the hiddenstate set consists of all corresponding similar terms, i.e., ๐‘ž1โ€™ssimilar term list : ๐‘ก11, ๐‘ก12, . . . , ๐‘ก1๐‘›, ๐‘ž2โ€™s similar term list: ๐‘ก21,๐‘ก22, . . . , ๐‘ก2๐‘›, . . . and ๐‘ž๐‘šโ€™s counterpart: ๐‘ก๐‘š1, ๐‘ก๐‘š2, . . . , ๐‘ก๐‘š๐‘›. Areformulated query of length ๐‘š is composed of ๐‘š similarterms, with the term in position ๐‘– is selected from one of ๐‘ž๐‘–โ€™ssimilar term list. Here we describe one candidate reformulatedquery as ๐‘žโ€ฒ1, ๐‘žโ€ฒ2, . . . , ๐‘žโ€ฒ๐‘š. This reformulated query correspondsto a hidden state sequence in HMM.

By extending this model to allow the original term existingin the new reformulated query or deletion of initial terms, wecould also easily add the void and original states in each ๐‘ž๐‘–โ€™scorresponding similar term list. That is to say, we could alsosuggest new queries containing terms existed in input query.

The emission probability of each query term ๐‘ก๐‘–๐‘— in ๐ฟ(๐‘ž๐‘–)(๐‘ž๐‘–โ€™s similar term list) to ๐‘ž๐‘– is measured by similarity between๐‘ก๐‘–๐‘— and ๐‘ž๐‘– discussed in Sec IV-B. The transformation prob-ability between hidden states ๐‘žโ€ฒ๐‘– and ๐‘žโ€ฒ๐‘–+1 depends on theircloseness, i.e., ๐‘๐‘™๐‘œ๐‘ (๐‘žโ€ฒ๐‘–, ๐‘ž

โ€ฒ๐‘–+1). In this paper we use a first-order

HMM, where the transition from ๐‘กโˆ’1 to ๐‘ก is memoryless, andindependent of state in ๐‘กโˆ’ 2.

Given an input query, the distribution of initial hidden stateis derived from the similarity metrics between states and thefirst observation:

๐œ‹(๐‘ก1๐‘—) =1

๐‘๐‘ก๐‘“๐‘Ÿ๐‘’๐‘ž(๐‘ก1๐‘—), 1 โ‰ค ๐‘— โ‰ค ๐‘›. (7)

๐‘๐‘ก is a normalization factor over ๐‘ก, i.e.โˆ‘๐‘›

๐‘™=1 ๐‘“๐‘Ÿ๐‘’๐‘ž(๐‘ก1๐‘™).The probability of making a transition to ๐‘žโ€ฒ๐‘– when being in

state ๐‘žโ€ฒ๐‘–โˆ’1 depends on how often the terms ๐‘žโ€ฒ๐‘– and ๐‘žโ€ฒ๐‘–โˆ’1 appeartogether, i.e., the closeness between these two terms.

๐ด๐‘žโ€ฒ๐‘–โˆ’1,๐‘žโ€ฒ๐‘–= ๐‘๐‘™๐‘œ๐‘ (๐‘žโ€ฒ๐‘–โˆ’1, ๐‘ž

โ€ฒ๐‘–), (8)

The emission probability of a term ๐‘ก๐‘–๐‘— to observed queryterm ๐‘ž๐‘– is measured by

๐ต๐‘ก๐‘–๐‘— ,๐‘ž๐‘– =1

๐‘๐ต๐‘ ๐‘–๐‘š(๐‘ก๐‘–๐‘— , ๐‘ž๐‘–) (9)

๐‘๐ต is a normalization factor over ๐‘กโ€ฒ, i.e.,โˆ‘๐‘›

๐‘—=1 ๐‘ ๐‘–๐‘š(๐‘ž๐‘–, ๐‘ก๐‘–๐‘—).In the HMM framework, good reformulated queries can now

be determined as those hidden state sequences with a highprobability of being traversed while emitting the original query๐‘„ = [๐‘ž1, . . . , ๐‘ž๐‘š].

Given an input query, this model turn similar terms intoreformulated queries. It could be regarded as a translation

Page 8: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

process. Users have an information need, which could be de-scribed by different queries in the hidden state space of HMMmodel. The initial input query is an explicit representation ofthis information need. While we would like to find some goodhidden queries relevant to initial query and cohesive in quality.

Under this modeling, we use the following metric to eval-uate the quality of generated queries, and the top qualityreformulated queries are returned.

๐‘(๐‘„โ€ฒโˆฃ๐‘„) = ๐œ‹(๐‘žโ€ฒ1)๐ต(๐‘žโ€ฒ1, ๐‘ž1)๐‘šโˆ

๐‘–=2

๐ด(๐‘žโ€ฒ๐‘–โˆ’1, ๐‘žโ€ฒ๐‘–)๐ต(๐‘žโ€ฒ๐‘–, ๐‘ž๐‘–)

= ๐œ‹(๐‘žโ€ฒ1)๐‘šโˆ

๐‘–=1

๐ต(๐‘žโ€ฒ๐‘–, ๐‘ž๐‘–)๐‘šโˆ

๐‘–=1

๐ด(๐‘žโ€ฒ๐‘–โˆ’1, ๐‘žโ€ฒ๐‘–) (10)

In the above equation, the first part is an aggregation ofsimilarity over terms in hidden state sequence and secondpart with closeness. The HMM model naturally models querygeneration process, incorporating both term closeness andsimilarity together.

Algorithm 2 Extended Viterbi for Top-๐‘˜ Queries

Input: query ๐‘„ = [๐‘ž1, ๐‘ž2, ..., ๐‘ž๐‘š]term candidates ๐‘„๐ถ=(๐‘žโ€ฒ11,๐‘žโ€ฒ12,. . . ,๐‘žโ€ฒ1๐‘›,. . . ,๐‘žโ€ฒ๐‘–๐‘— ,. . . ,๐‘žโ€ฒ๐‘š๐‘›).closeness and similarity relation matrix ๐ด and ๐ต.๐ฟ[๐‘, ๐‘–, โˆ—]: path list with no more than ๐‘˜ paths in step ๐‘ending with hidden state ๐‘–๐‘†[๐‘, ๐‘–, โˆ—]: score vector of corresponding paths in ๐ฟ[๐‘, ๐‘–]

1: for ๐‘ = 1, . . . ,๐‘š, each step in HMM do2: for ๐‘– = 1, . . . , ๐‘› do3: if ๐‘ = 1 then4: ๐‘†[๐‘, ๐‘–, 1]โ† ๐‘(๐‘žโ€ฒ1๐‘–)๐ต(๐‘ž1, ๐‘ž

โ€ฒ1๐‘–),

๐ฟ[๐‘, ๐‘–, 1].๐‘Ž๐‘๐‘๐‘’๐‘›๐‘‘(๐‘–)5: else6: ๐‘†[๐‘, ๐‘–, โˆ—]โ† top-๐‘˜ scored:

๐‘†[๐‘โˆ’ 1, ๐‘—, ๐‘™]๐ด(๐‘žโ€ฒ๐‘โˆ’1,๐‘— , ๐‘žโ€ฒ๐‘๐‘–)๐ต(๐‘ž๐‘, ๐‘ž

โ€ฒ๐‘๐‘–)

7: ๐ฟ[๐‘, ๐‘–, โˆ—]โ† top-๐‘˜ scored:๐ฟ[๐‘โˆ’ 1, ๐‘—, ๐‘™].append(๐‘–)

8: end if9: end for

10: continue next step in HMM.11: end for12: return at step ๐‘š, top-๐‘˜ sequences from ๐ฟ[๐‘š, โˆ—, โˆ—].

C. Top-k Reformulated Query Generation

Given the initial query as an observed output sequence, andextracted similarity/closeness relation, we need to find top-๐‘˜reformulated queries with highest scores. In an HMM modelwith its known parameters and the observed output sequence,finding the hidden state sequence that is most likely to generatethe observed output is a common problem. This can be solvedby the Viterbi algorithm [31], which is an efficient dynamicprogramming algorithm.

As Viterbi is limited to top-1 problem, here we present twoextensions to find top-๐‘˜ reformulated queries. One is a simple

variant of Viterbi algorithm, and another is enhanced with anA* best first pruning search strategy. We outline the algorithmsin the following. The empirical comparison will be reportedin the experiment section.

Algorithm 2 extends the storing of temporal conditionsto top-๐‘˜ best sequences. The subsequence ended is enoughto ensure the top-๐‘˜ scored queries found. The time cost ofthis algorithm is ๐‘‚(๐‘š๐‘›2๐‘˜ log ๐‘˜), that is ๐‘˜ log ๐‘˜ times slowerthan the Viterbi algorithm. The time complexity of Viterbialgorithm is ๐‘‚(๐‘š๐‘›2), and its space complexity is ๐‘‚(๐‘š๐‘›).A significant portion of the HMM state space can be prunedduring the extraction. In the forward stage of Viterbi algorithm,states with zero or low closeness with the previous statecould be discarded. Equation 10 implies that the step by stepcomputation could eliminate some irrelevant states.

Algorithm 3 A* Strategy for Top-๐‘˜ Queries under Viterbi

Input: query ๐‘„ = [๐‘ž1, ๐‘ž2, ..., ๐‘ž๐‘š]term candidates ๐‘„๐ถ=(๐‘žโ€ฒ11,๐‘žโ€ฒ12,. . . ,๐‘žโ€ฒ1๐‘›,. . . ,๐‘žโ€ฒ๐‘–๐‘— ,. . . ,๐‘žโ€ฒ๐‘š๐‘›).closeness and similarity relation matrix ๐ด and ๐ต.incomplete path set ๐ผ๐‘ƒ = [ ], complete path set ๐ถ๐‘ƒ = [ ]

1: run Viterbi algorithm along ๐‘„ and ๐‘„๐ถ for top-1 query,get ๐‘†[๐‘, ๐‘–, 1].

2: โ„Ž[๐‘, ๐‘–] โ‡ state score ๐‘†[๐‘, ๐‘–, 1]3: for ๐‘– = 1, . . . , ๐‘› do4: ๐ฟ๐‘– โ† ๐‘–5: ๐ฟ๐‘–.โ„Ž ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’โ† โ„Ž[๐‘š, ๐‘–]6: ๐ฟ๐‘–.๐‘” ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’โ† 17: ๐ผ๐‘ƒ .insert(๐ฟ๐‘–)8: end for9: while ๐ผ๐‘ƒ โˆ•= โˆ… do

10: ๐ฟ ๐‘ก๐‘’๐‘š๐‘โ† max(๐ผ๐‘ƒ ), remove ๐ฟ ๐‘ก๐‘’๐‘š๐‘ from ๐ผ๐‘ƒ11: for ๐‘– = 1, . . . , ๐‘› do12: ๐‘โ† ๐‘šโˆ’ ๐ฟ ๐‘ก๐‘’๐‘š๐‘.๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž13: ๐‘— โ† ๐ฟ ๐‘ก๐‘’๐‘š๐‘.๐‘™๐‘Ž๐‘ ๐‘ก14: ๐ฟ ๐‘Ž๐‘ข๐‘”๐‘š๐‘’๐‘›๐‘กโ† ๐ฟ ๐‘ก๐‘’๐‘š๐‘.append(i)15: ๐ฟ ๐‘Ž๐‘ข๐‘”๐‘š๐‘’๐‘›๐‘ก.โ„Ž ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’โ† โ„Ž[๐‘ ๐‘ก๐‘’๐‘, ๐‘–]16: ๐ฟ ๐‘Ž๐‘ข๐‘”๐‘š๐‘’๐‘›๐‘ก.๐‘” ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’ โ†

๐ฟ ๐‘ก๐‘’๐‘š๐‘.๐‘” ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’ โ‹… ๐ด(๐‘žโ€ฒ๐‘โˆ’1,๐‘–, ๐‘žโ€ฒ๐‘๐‘—) โ‹… ๐ต(๐‘žโ€ฒ๐‘, ๐‘ž

โ€ฒ๐‘๐‘—)

17: if ๐ฟ ๐‘Ž๐‘ข๐‘”๐‘š๐‘’๐‘›๐‘ก.๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž = ๐‘š then18: reverse ๐ฟ ๐‘Ž๐‘ข๐‘”๐‘š๐‘’๐‘›๐‘ก19: ๐ถ๐‘ƒ .insert(๐ฟ ๐‘Ž๐‘ข๐‘”๐‘š๐‘’๐‘›๐‘ก)20: else21: ๐ผ๐‘ƒ .insert(๐ฟ ๐‘Ž๐‘ข๐‘”๐‘š๐‘’๐‘›๐‘ก)22: end if23: if โˆฃ๐ถ๐‘ƒ โˆฃ = ๐‘˜ then24: break;25: end if26: end for27: end while28: return ๐ถ๐‘ƒ

We proceed to present an improved query generation algo-rithm, sketched in Algorithm 3. In its first stage, Viterbi algo-rithm runs. Utilizing the information memorized by Viterbi,next in the line, an A* search strategy is performed to

Page 9: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

determine top-๐‘˜ state sequences. The improvements of thisnew algorithm lie at: 1) it first utilizes efficient Viterbi tocollect state connection information. 2) A* search strategy insecond stage prunes irrelevant states.

In this algorithm, list ๐ถ๐‘ƒ contains complete paths fromhead to the tail hidden state, while ๐ผ๐‘ƒ holds incomplete pathsto be augmented. Both ๐ถ๐‘ƒ and ๐ผ๐‘ƒ only need to hold at most๐‘˜ paths. ๐‘” ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’ preserves an incomplete pathโ€™s score, whileโ„Ž ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’ has the optimal score it can still get in the followingsteps based on its condition. An incomplete path with itspotential optimal score beyond top-๐‘˜ can be safely filteredout. Given โ„Ž ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’ previously calculated in Viterbi stage, it isguaranteed that there exists one complete augmentation fromthe current part which reaches score ๐‘” โ‹… โ„Ž.

VI. EXPERIMENT

We conduct an empirical study on the proposed query re-formulation approach. The efficiency and effectiveness aspectsof the proposed approach are reported.

The experiment is conducted on the bibliographic dataset,DBLP. We extract tuples and terms from the DBLP data, andcreate the graph representation of it. We use a recent datasetrelease with about 700 thousand authors, 1.3 million papersand 4.5 thousand conferences. We then bulk this dataset intoa MySQL database and create inverted index using Lucene 2.Codes are written in Java and C#. The programs are performedon a Windows server with 4 core processors and 12G memory.

A. Effect on Term Similarity

One major contribution of our work is that we model theterm similarity in a link context which can effectively profilesthe characters of structured data. Specifically, we propose anenhanced random walk based similarity extraction method,which can facilitate โ€œsemanticallyโ€ related term extraction. Inthis experiment, we first evaluate the effect on term similarityof different approaches. The difference between basic randomwalk and our proposed method has been discussed in Figure 4.Here we further compare the proposed random walk basedmethod with the co-occurrence similarity approach [15], whichextracts the similar terms with frequent co-occurring counts.

Table II presents a similar topic extraction case, in whichtarget term โ€œxmlโ€ is a popularly used word in paper titles. Intraditional frequent co-occurrence method, only some corre-lated words can be selected. For example,โ€œdocumentโ€,โ€œintegrationโ€ and โ€œindexingโ€ are subareas in XML researcharea. From them, we are able to understand specific topicsgiven the input term. However, these co-occur terms could notindicate the other alterative meanings of input term. With thehelp of contextual Random Walk, we find some interestingsimilar or counterpart topics, such as โ€œtwigโ€, โ€œnativeโ€ andโ€œkeywordโ€, โ€œhtmlโ€. Though there is still some noise in thiskind of similar term extraction, we could find that the proposed

2http://lucene.apache.org

target term extract method top ranked similar terms

โ€œxmlโ€frequent

co-occurrence

โ€œdocumentโ€,โ€œrelationโ€,โ€œstructureโ€,

โ€œindexโ€,โ€œprocessingโ€,โ€œintegrationโ€,

โ€œschemaโ€

contextualRandom Walk

โ€œtwigโ€,โ€œnativeโ€,

โ€œdtdโ€,โ€œholisticโ€,

โ€œkeywordโ€,โ€œfragmentโ€,

โ€œhtmlโ€

TABLE II

SIMILAR TERMS EXTRACTED THROUGH DIFFERENT METHODS

approach is better at providing alternative and intuitive topicsrelated but slightly diverse from original โ€œxmlโ€.

We could conclude that, the frequent co-occurrence methodis useful to specify usersโ€™ information need, but its suggestionis limited. The extracted method we proposed here couldextend usersโ€™ understanding and cognition of the underlyingdata sources.

In another case, we test author name as target term to findsimilar authors using different approaches. The existing co-occurrence methods can only find the collaborators of a certainauthor as the similar terms of it. For example, they can find thecollaborators of โ€œJiawei Hanโ€, such as โ€œPhilip S. Yuโ€ and โ€œJianPeiโ€. However, our approach can also find other well knownresearchers in Data Mining area, e.g., โ€œChristos Faloutsosโ€and โ€œRakesh Agrawalโ€, though they are not the collaboratorof โ€œJiawei Hanโ€. We find that, these indirect researchers areconnected through conference and shared research topics.

The above two case studies show the effectiveness of theproposed similarity measurement. In this work, we designeda heterogeneous TAT graph structure to model the structureddata, which semantically integrates tuple and term relationshipin a uniform way. Furthermore, the proposed random walkenhancement based on the TAT graph can take the contextualinformation into effect, i.e., exploiting global correlationsbetween two nodes and also their multi-facet relations. There-fore, the proposed similarity measurement can better capturethe semantic relation between terms in the structure data.Since no ground truth of term similarity is available, we leavethe quantitative analysis to the next experimental evaluation,i.e., how the above two similarity measurements affect theperformance of query reformulation.

B. Effect on Query Reformulation

We next evaluate the performance of query reformulationon structured data. In this work, we design a probabilisticgeneration model to reformulate input query and suggest newsubstitutive queries, which can incorporate both similarity andcloseness metrics in an HMM model. We notate the proposedquery reformulation method as TAT-based Reformulation in

Page 10: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

the following experiment. We choose two baseline methods tocompare with this proposed approach.

โˆ™ Rank-based reformulation: Given each termโ€™s similarterm list, we enumerate the possible combinations ofcorresponding terms, and return the queries with topsimilarity scores with original query. We use this baselineto demonstrate the contribution of probabilistic queryreformulation module in our approach.

โˆ™ Co-occurrence reformulation: We utilize the proposedquery reformulation framework, but replace the randomwalk based similarity with the co-occurrence based sim-ilarity [15]. This variant can be used to demonstrate thecontribution of new similarity metric in our solution.

The test query set has 10 different queries which are chosenwith various formats consisting of topical words, author orconference name, such as โ€œknn uncertainโ€ and โ€œChristian S.Jensen spatio-temporalโ€.

A ranked list of reformulated queries from each method isreturned for every input query. Due to the absence of goldenstandard set, we use Precision@N to measure the performanceof different algorithms, which is the percentage of top-Nanswers returned that are relevant. Here, the relevance refersto the similarity and semantic closeness of reformulated oneswith respect to the input query. In the experimental evaluation,we fetch the top-N reformulated queries with the highestsimilarity, and ask three evaluators to judge the relevancebetween original query and reformulated queries.

1) Quality of Reformulated Queries: We first compare theeffectiveness of different methods, and the results are shownin Figure 5. We present the average precision value of testingqueries at rank position 1, 3, 5, 7 and 10. The figure showsthat our approach has advantages over two baseline methods.Co-occurrence similarity based reformulation method couldnot compete with others. As it has same query reformulationmodel with our method, we could conclude that the similarterms extracted with frequent co-occurrence method may notbe suitable for query reformulation.

As we discussed previously, the co-occurrence methodgenerally finds the terms which co-occur frequently in a certainfield of structured data, thus top scored similar terms arelocally related. With the help of contextual biased randomwalk, we could utilize context information for similarityevaluation over underlying targeting corpora, and hence thereformulated queries generated from the corresponding sim-ilar term lists of query keywords have higher probability toproduce valid query. Note that, the keyword query results aregenerally retrieved from joined tuples in multiple relations.In Section VI-A, we have presented the case study on theeffect of similarity difference between co-occurrence basedmethod and our proposed method. The query reformulationresults further verify the effectiveness of the TAT graph basedsimilarity measurement.

Compared with the greedy style Rank-based reformulationbaseline, our proposed approach also performs better. Therank-based method simply combines the similar terms from thecorresponding lists and ignores the semantical correlation over

underlying structured data. Our probabilistic method takesstructural information into consideration to evaluate the refor-mulated query quality. By combining similarity and closenessin a unified way, our approach can yield more satisfactoryperformance.

1 3 5 7 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

rank position

aver

age

prec

isio

n

TAT basedRank basedCoโˆ’occurrence based

Fig. 5. Query Generation Performance of Different Methods

We next provide a showing case of reformulated queries.In Figure 6, we show a snapshot of the query reformulationdemo interface. In the main column, traditional keywordsearch results are returned. In the right panel, some rankedreformulated queries are presented. The initial query โ€œspatiotemporal Christian S. Jensenโ€ gets some reformulated sugges-tions: โ€œmoving object Ouri Wolfsonโ€ and โ€œnearest neighborsDimitris Papadiasโ€, which can not be generated by traditionalquery suggestion techniques, e.g., [15], [24].

Admittedly, there also exist some vague suggestions as weexploit larger scope of structured data for query reformulation.However, comparing the reformulated query with the returnedpaper list, we could find that these query suggestions are noveland diverse, beyond the returned papers and initial input query.Such query suggestions can help users better understand andexplore the data source.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>

spatio temporal Christian S. Jensen Search

Fig. 6. Reformulated Keyword Queries

2) Time Efficiency: The query reformulation process re-quires fast response time. In Section V, we have discussed twoalgorithms in the probabilistic framework to support reformu-

Page 11: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

lated query generation. Here we investigate their efficiencydifference and deployment issues. In this experiment, werandomly sample 400 queries, varying query length from 1 to8. These queries are chosen from the following fields: authorname, paper title and conference name.

Figure 7 shows the run time of two algorithms. Algorithm 2only uses ๐‘˜ to prune the hidden state space and clearly fails tocompete with the improved Algorithm 3. Algorithm 3 first usesa Viterbi to get reformulated query candidates and backwardsearch to select top scored ones.

5 10 20 30 500

1

2

3

4

5

6

7

8

9

size of result set

Pro

cess

tim

e (m

s) lo

g sc

ale.

Extended ViterbiViterbi, A* Strategy

Fig. 7. Time Cost of Query Generation Algorithms

In the improved Algorithm 3, there are two stages, i.e.,Viterbi based initialization and an A* backward search. Fig-ure 8 illustrates the average response time in these two stages.The costs of both steps grow with the length of query,and among which the Viterbi initialization is a more costlyprocedure. When the length of query reaches 8 which can beconsidered as long query in real-life applications, the wholealgorithm still yields satisfactory performance, which takesless than 0.2 second.

Fig. 8. Time Cost of new Top-๐‘˜ Query Generation Algorithm

3) Parameter Sensitivity: To show the robustness of querygeneration model, we further test two different parameters:the return size of top reformulated queries and the space ofrelated hidden states in HMM model. These two parameters

determine the search space for our generation model. Thefollowing two experiments use the improved A* solution asoutlined in Algorithm 3.

0 10 20 30 40 500

2

4

6

8

10

12

14

16

18

result size

proc

ess

time

in a

โˆ’st

ar(m

s) (

n=30

,m=

6)

Fig. 9. Time Cost with Different Returned Queries

Figure 9 displays the average time cost, with the growthof number of queries recommended (๐‘˜), when query lengthis set to 6, which is a relative long query as here we onlyextract topical words. Since the Viterbi step only generatesthe highest scored query, which is not affected by ๐‘˜ largerthan 1. We could find that, the time cost in A* search strategystage grows linearly with ๐‘˜ in Figure 9, which demonstratesthe scalability in terms of the result size.

0 10 20 30 40 500

50

100

150

200

250

300

350

400

size of relevant term space

resp

onse

tim

e(m

s)(k

=30

,m=

6)

Fig. 10. Time Cost with Varied Sizes of Candidate States

Figure 10 presents the average response time by varyingsize of the hidden states in our proposed approach. In otherwords, in the online interactive query reformulation, how manysimilar terms for each input term can we fetch to ensure a fastresponse? The result shows that our approach is very efficient.Especially when the size of similar term list is less than 20.It is sufficient for most real interactive applications, such asAjax or dialogue based query interface.

C. Effect on Reformulated Query Results

In the last experiment, we go a step further to test thekeyword search results using the reformulated queries. Theresults can demonstrate how the reformulation can affect thesearch performance. The query set used for evaluation is thekeywords extracted from the title of 19 SIGMOD Best Paperssince 1994, and there are totally 19 queries in our test set. Foreach query, we get top-10 reformulated queries using differentmethods, conduct search using these reformulated keywords,

Page 12: Keyword Query Reformulation on Structured Datanet.pku.edu.cn/~cuibin/Papers/2012ICDE keyword.pdfKeyword Query Reformulation on Structured Data Junjie Yao Bin Cui Liansheng Hua Yuxin

and finally examine the search results. We design two objectivemetrics to evaluate the performance:

โˆ™ Result size: which is the average number of search resultsfor the top-10 reformulated keywords of 19 queries. Thelarger size means the approach tends to generate higherquality reformulated queries, vise versa.

โˆ™ Query distance: which is defined as the average termdistance of every corresponding term pairs between twoqueries, where term distance is the shortest path distanceof two terms in the TAT graph. In this experiment, wecalculate the average query distances between top-10reformulated queries and the original query. The distancereflects the diversity of the reformulated queries.

TABLE III

EFFECT ON QUERY REFORMULATION

TAT based Rank based Co-occurrence basedResult size 20.89 9.21 14.16

Query distance 1.11 0.67 0.82

The results in Table III clearly show that our reformulationapproach can produce reformulated queries with not onlybetter quality but also higher diversity. This experiment furtherverifies the effectiveness of our TAT-based approach which canbetter exploit the semantics of underlying structured data.

VII. CONCLUSION

Information queries for text, multimedia and location arepervasive in current web search engines, and the availabilityof structured data continues to grow at an increasing speed.In this paper, we have addressed an interesting problem: howto automatically reformulate userโ€™s initial keyword query intosimilar or related ones, for better query understanding orrecommendation.

Towards this goal, we proposed to model the structured datawith a heterogenous graph and designed effective approachesfor keyword query reformulation. Specifically, we presenteda random walk based similar term extraction method, andutilized a probabilistic generation model for query reformu-lation. The experimental results show that our methods makea promising attempt towards this direction.

Compared to traditional query suggestion on text docu-ments, keyword query reformulation on structured data hasmany open problems, which will be investigated in our futurework. For example, a quantitative measurement of the effec-tiveness on term relation extraction and query generation is anongoing work. With the collection of considerable query logs,the user interaction and feedback analysis on this new kind ofquery reformulation is another interesting extension. We couldalso exploit the reformulated queries to support ad hoc facetedretrieval over structured data, which is more intuitive and userfriendly.

ACKNOWLEDGMENT

This research was supported by the grants of NaturalScience Foundation of China (No. 61073019 and 60933004),and HGJ Grant No. 2011ZX01042-001-001.

REFERENCES

[1] R. MacManus, Top 5 Web Trends of 2009: Structured Data,http://www.readwriteweb.com/.

[2] R. Kumar and A. Tomkins, โ€œA characterization of online search behav-ior,โ€ IEEE Data Eng. Bull., vol. 32, no. 2, pp. 3โ€“11, 2009.

[3] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang,โ€œWebtables: exploring the power of tables on the web,โ€ Proc. VLDBEndow., vol. 1, no. 1, pp. 538โ€“549, 2008.

[4] R. Gupta and S. Sarawagi, โ€œAnswering table augmentation queries fromunstructured lists on the web,โ€ PVLDB, vol. 2, no. 1, pp. 289โ€“300, 2009.

[5] Y. Chen, W. Wang, Z. Liu, and X. Lin, โ€œKeyword search on structuredand semi-structured data,โ€ in Proc. of SIGMOD, 2009, pp. 1005โ€“1010.

[6] Y. Luo, X. Lin, W. Wang, and X. Zhou, โ€œSpark: top-k keyword queryin relational databases,โ€ in Proc. of SIGMOD, 2007, pp. 115โ€“126.

[7] S. Paparizos, A. Ntoulas, J. Shafer, and R. Agrawal, โ€œAnswering webqueries using structured data sources,โ€ in Proc. of SIGMOD, 2009, pp.1127โ€“1129.

[8] N. Sarkas, S. Paparizos, and P. Tsaparas, โ€œStructured annotations of webqueries,โ€ in Proc. of SIGMOD, 2010, pp. 771โ€“782.

[9] M. A. Hearst, Search User Interface. Cambridge Univ. Press, 2009,http://searchuserinterfaces.com/.

[10] R. Jones, B. Rey, O. Madani, and W. Greiner, โ€œGenerating querysubstitutions,โ€ in Proc. of WWW, 2006, pp. 387โ€“396.

[11] I. Antonellis, H. G. Molina, and C. C. Chang, โ€œSimrank++: queryrewriting through link analysis of the click graph,โ€ Proc. of VLDB, pp.408โ€“421, 2008.

[12] D. Kelly, K. Gyllstrom, and E. W. Bailey, โ€œA comparison of query andterm suggestion features for interactive searching,โ€ in Proc. of SIGIR,2009, pp. 371โ€“378.

[13] E. Agichtein and L. Gravano, โ€œQuerying text databases for efficientinformation extraction,โ€ in ICDE, 2003, pp. 113โ€“124.

[14] J.-H. Kang, M.-G. Kang, and C.-S. Kim, โ€œSupporting databases in virtualdocuments,โ€ in Proc. of ICADL, 2001.

[15] Y. Tao and J. X. Yu, โ€œFinding frequent co-occurring terms in relationalkeyword search,โ€ in Proc. of EDBT, 2009, pp. 839โ€“850.

[16] N. Sarkas, N. Bansal, G. Das, and N. Koudas, โ€œMeasure-drivenKeyword-Query expansion,โ€ PVLDB, vol. 2, no. 1, pp. 121โ€“132, 2009.

[17] Z. Liu, P. Sun, and Y. Chen, โ€œStructured search result differentiation,โ€PVLDB, vol. 2, no. 1, pp. 313โ€“324, 2009.

[18] G. Limaye, S. Sarawagi, and S. Chakrabarti, โ€œAnnotating and searchingweb tables using entities, types and relationships,โ€ PVLDB, vol. 3, no. 1,pp. 1338โ€“1347, 2010.

[19] J. Pound, I. F. Ilyas, and G. E. Weddell, โ€œExpressive and flexible accessto web-extracted data: a keyword-based structured query language,โ€ inProc. of SIGMOD, 2010, pp. 423โ€“434.

[20] B. Aditya, G. Bhalotia, S. Chakrabarti, A. Hulgeri, C. Nakhe, P. Parag,and S. Sudarshan, โ€œBanks: browsing and keyword searching in relationaldatabases,โ€ in Proc. of VLDB, 2002, pp. 1083โ€“1086.

[21] A. Balmin, V. Hristidis, and Y. Papakonstantinou, โ€œObjectrank:authority-based keyword search in databases,โ€ in Proc. of VLDB, 2004,pp. 564โ€“575.

[22] F. Liu, C. Yu, W. Meng, and A. Chowdhury, โ€œEffective keyword searchin relational databases,โ€ in Proc. of SIGMOD, 2006, pp. 563โ€“574.

[23] K. Q. Pu and X. Yu, โ€œKeyword query cleaning,โ€ PVLDB, vol. 1, no. 1,pp. 909โ€“920, 2008.

[24] R. Varadarajan, V. Hristidis, and L. Raschid, โ€œExplaining and reformu-lating authority flow queries,โ€ in Proc. of ICDE, 2008, pp. 883โ€“892.

[25] X. Yang, C. M. Procopiuc, and D. Srivastava, โ€œSummarizing relationaldatabases,โ€ PVLDB, vol. 2, no. 1, pp. 634โ€“645, 2009.

[26] Y. Zhou, H. Cheng, and J. X. Yu, โ€œGraph clustering based on struc-tural/attribute similarities,โ€ PVLDB, vol. 2, no. 1, pp. 718โ€“729, 2009.

[27] H. Tong, C. Faloutsos, and J. Pan, โ€œFast random walk with restart andits applications,โ€ in Proc. of ICDM, 2006, pp. 613โ€“622.

[28] R. Motwani and P. Raghavan, Randomized Algorithms. CambridgeUniv. Press, 2007.

[29] T. H. Haveliwala, โ€œTopic-sensitive pagerank,โ€ in Proc. of WWW, 2002,pp. 517โ€“526.

[30] Q. H. Vu, B. C. Ooi, D. Papadias, and A. K. H. Tung, โ€œA graphmethod for keyword-based selection of the top-K databases,โ€ in Proc.of SIGMOD, 2008, pp. 915โ€“926.

[31] L. R. Rabiner, โ€œA tutorial on hidden markov models and selectedapplications in speech recognition,โ€ in Proc. of the IEEE, 1989, pp.257โ€“286.