Keyword Query Routing

download Keyword Query Routing

of 13

description

IEEE Paper

Transcript of Keyword Query Routing

  • 5/21/2018 Keyword Query Routing

    1/13

    Keyword Query RoutingThanh Tran and Lei Zhang

    AbstractKeyword search is an intuitive paradigm for searching linked data sources on the web. We propose to route keywords only

    to relevant sources to reduce the high cost of processing keyword search queries over all sources. We propose a novel method for

    computing top-k routing plansbased on their potentials to contain results for a given keyword query. We employ a keyword-element

    relationshipsummary that compactly represents relationships between keywords and the data elements mentioning them. A multilevel

    scoring mechanismis proposed for computing the relevance of routing plans based on scores at the level of keywords, data elements,

    element sets, and subgraphs that connect these elements. Experiments carried out using 150 publicly available sources on the web

    showed that valid plans (precision@1 of 0.92) that are highly relevant (mean reciprocal rank of 0.89) can be computed in 1 second on

    average on a single PC. Further, we show routing greatly helps to improve the performance of keyword search, without compromising

    its result quality.

    Index TermsKeyword search, keyword query, keyword query routing, graph-structured data, RDF

    1 INTRODUCTION

    THE web is no longer only a collection of textualdocuments but also a web of interlinked data sources(e.g., Linked Data). One prominent project that largelycontributes to this development is Linking Open Data.Through this project, a large amount of legacy data havebeen transformed to RDF, linked with other sources, andpublished as Linked Data. Collectively, Linked Data com-prise hundreds of sources containing billions of RDF triples,which are connected by millions of links (see LOD Cloudillustration at http://linkeddata.org/). While differentkinds of links can be established, the ones frequently

    published are sameAs links, which denote that two RDFresources represent the same real-world object. A sample ofLinked Data on the web is illustrated in Fig. 1.

    It is difficult for the typical web users to exploit this webdata by means of structured queries using languages likeSQL or SPARQL. To this end, keyword search has proven tobe intuitive. As opposed to structured queries, no knowl-edge of the query language, the schema or the underlyingdata are needed.

    In database research, solutions have been proposed,which given a keyword query, retrieve the most relevantstructured results [1], [2], [3], [4], [5], or simply, select thesingle most relevant databases [6], [7]. However, these

    approaches are single-source solutions. They are notdirectly applicable to the web of Linked Data, whereresults are not bounded by a single source but mightencompass several Linked Data sources. As opposed to thesource selection problem [6], [7], which is focusing oncomputing the most relevant sources, the problem here is tocompute themost relevant combinations of sources. The goal isto produce routing plans, which can be used to compute

    results from multiple sources. To this end, we provide thefollowing contributions:

    . We propose to investigate the problem of keywordquery routing for keyword search over a largenumber of structured and Linked Data sources.Routing keywords only to relevant sources canreduce the high cost of searching for structuredresults that span multiple sources. To the best of ourknowledge, the work presented in this paperrepresents the first attempt to address this problem.

    . Existing work uses keyword relationships (KR)

    collected individually for single databases [6], [7].We represent relationships between keywords aswell as those between data elements. They areconstructed for the entire collection of linkedsources, and then grouped as elements of a compactsummary called the set-level keyword-element relation-ship graph (KERG). Summarizing relationships isessential for addressing the scalability requirementof the Linked Data web scenario.

    . IR-style ranking has been proposed to incorporaterelevance at the level of keywords [7]. To cope withthe increased keyword ambiguity in the web setting,we employ a multilevel relevance model, where

    elements to be considered are keywords, entitiesmentioning these keywords, corresponding sets ofentities, relationships between elements of the samelevel, and inter-relationships between elements ofdifferent levels.

    . We implemented the approach and evaluated it ina real-world setting using more than 150 publiclyavailable data sets. The results show the applic-ability of this approach: valid plans (precision@1 0.92) that are highly relevant to the user informa-tion need (mean reciprocal rank (RR) 0.86) canbe computed in 1 second on average using a

    commodity PC. Further, we show that whenrouting is applied to an existing keyword searchsystem to prune sources, substantial performancegain can be achieved.

    . The authors are with Institute AIFB, Karlsruhe Institute of Technology,76128 Karlsruhe, Germany. E-mail: {thanh.tran, l.zhang}@kit.edu.

    Manuscript received 26 Jan. 2012; revised 14 Aug. 2012; accepted 10 Dec.

    2012; published online 14 Jan. 2013.Recommended for acceptance by S. Sudarshan.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2012-01-0065.Digital Object Identifier no. 10.1109/TKDE.2013.13.

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 2, FEBRUARY 2014

  • 5/21/2018 Keyword Query Routing

    2/13

    Outline. Section 2 provides an overview of existing work.The problem and solution are sketched in Section 3. Thesummary model is presented in Section 4 while Section 5shows how it can be used to compute routing plans andSection 7 discusses how to rank them. Evaluation results areprovided in Section 8 before we conclude in Section 9.

    2 RELATEDWORK

    There are two directions of work: 1) keyword searchapproaches compute the most relevant structured resultsand 2) solutions for source selection compute the mostrelevant sources.

    2.1 Keyword Search

    Existing work can be categorized into two main categories:There areschema-based approaches implemented on top of

    off-the-shelf databases [8], [1], [2], [3], [9], [10]. A keywordquery is processed by mapping keywords to elements ofthe database (called keyword elements). Then, using theschema, valid join sequences are derived, which are thenemployed to join (connect) the computed keywordelements to form so-called candidate networks represent-ing possible results to the keyword query.

    Schema-agnostic approaches [11], [12], [13], [5] operatedirectly on the data. Structured results are computed byexploring the underlying data graph. The goal is to findstructures in the data called Steiner trees (Steiner graphs ingeneral), which connect keyword elements [13]. For thequery Stanford John Award for instance, a Steiner graphis the path between uni1and prize1in Fig. 1. Various kindsof algorithms have been proposed for the efficient explora-

    tion of keyword search results over data graphs, whichmight be very large. Examples are bidirectional search [11]and dynamic programming [5].

    Recently, a system called Kite extends schema-basedtechniques to find candidate networks in the multisourcesetting [4]. It employs schema matching techniques todiscover links between sources and uses structure discoverytechniques to find foreign-key joins across sources. Alsobased on precomputed links, Hermes [14] translates key-words to structured queries. However, experiments havebeen performed only for a small number of sources sofar. Kite explicitly considered only the setting where

    the number of databases that can be dealt with is up tothe tens [4].

    In our scenario, the search space drastically increases,and also, the number of potential results may increase

    exponentially with the number of sources and linksbetween them. Yet, most of the results may be not necessaryespecially when they are not relevant to the user. A solutionto keyword query routing can address these problems bypruning unpromising sources and enabling users to selectcombinations that more likely contain relevant results. Forthe routing problem, we do not need to compute results

    capturing specific elements at the data level, but can focuson the more coarse-grained level of sources.

    2.2 Database Selection

    More closely related to this work are existing solutions todatabase selection, where the goal is to identify the mostrelevant databases. The main idea is based on modelingdatabases using keyword relationships. A keyword relation-ship is a pair of keywords that can be connected via asequence of join operations. For instance, hStanford; Awardiis a keyword relationship as there is a path between uni1and prize1 in Fig. 1. A database is relevant if its keywordrelationship model covers all pairs of query keywords. M-KS [6] captures relationships using a matrix. Since M-KSconsiders only binary relationships between keywords, itincurs a large number of false positives for queries withmore than two keywords. This is the case when all querykeywords are pairwise related but there is no combined joinsequence which connects all of them.

    G-KS [7] addresses this problem by considering morecomplex relationships between keywords using a keywordrelationship graph (KRG). Each node in the graphcorresponds to a keyword. Each edge between two nodescorresponding to the keywords hki; kji indicates that thereexists at least two connected tuples ti $ tj that match kiand kj. Moreover, the distance between ti and tj aremarked on the edges.

    Compared to M-KS, G-KS computes more relevantsources due to these differences: G-KS adopts IR-styleranking to compute TF-IDF for keywords and for keywordrelationships. Further, it helps to reduce the number of falsepositives. It provides an additional level of filtering,validating connections between keywords based on com-plex relationships and distance information in the KRG.

    Both M-KS and G-KS assume that sources are indepen-dent and answers reside within one single source. The KRG,its keywords, relationships between keywords as well asscores are derived from and built for one single database.The solution we propose for modeling and scoring relation-ships is geared toward the entire Linked Data collection.Moreover, we make use of a summary, which instead ofcapturing relationships at the level of keywords (and dataelements), it operates at the level of element sets.

    3 OVERVIEW

    In this section, we discuss the data, define the problem, andthen briefly sketch the proposed solution.

    3.1 Web of DataWe use a graph-based data model to characterize individualdata sources. In that model, we distinguish between anelement-level data graph representing relationships between

    Fig. 1. Extract of the web data graph.

  • 5/21/2018 Keyword Query Routing

    3/13

    individual data elements, and a set-level data graph, whichcaptures information about group of elements.

    Definition 1 (Element-level Data Graph). An element-leveldata graph gN; E consists of

    . the set of nodes N, which is the disjoint union ofNE] NV, where the nodes NErepresent entities andthe nodesNV capture entities attribute values, and

    . the set of edgesE, subdivided by E ER] EA, whereER represents interentity relations, EA stands for

    entity-attribute assignments. We have en1; n2 2 ERiff n1; n22 NE and en1; n2 2 EA iff n12 NE andn22 NV.

    The set of attribute edges EAn fen; m 2 EAg isreferred to as the descriptionof the entityn.

    Note that this model resembles RDF data where entitiesstand for some RDF resources, data values stand for RDFliterals, and relations and attributes correspond to RDFtriples. While it is primarily used to model RDF LinkedData on the web, such a graph model is sufficientlygeneral to capture XML and relational data. For instance, atuple in a relational database can be modeled as an entity,and foreign key relationships can be represented as

    interentity relations.Definition 2 (Set-level Data Graph).A set-level data graph

    of an element-level graph gNE] NV; ER] EA is a tupleg0 N0; E0. Every node n0 2 N0 stands for a set of element-level entities Nn0 NE, i.e., there is mapping type :

    NE7! N0 that associates every element-level entity n2 NEwith a set-level element n0 2 N0. Every edge e0n0i; n

    0j 2 E

    0

    represents a relation between the two sets of element-levelentities n0i and n

    0j. We have E

    0 fe0n0i; n0

    jjeni; nj 2 ER;typeni n0i; typenj n

    0jg.

    This set-level graph essentially captures a part of the

    Linked Data schema on the web that are represented inRDFS, i.e., relations between classes. Often, a schema mightbe incomplete or simply does not exist for RDF data on theweb. In such a case, a pseudoschema can be obtained bycomputing a structural summary such as a dataguide [15].A set-level data graph can be derived from a given schemaor a generated pseudoschema. Thus, we assume a member-ship mapping type: NE 7! N0 exists and use n2n0 todenote that n belongs to the set n0. An example of the set-level graph is given in Fig. 2.

    We consider the search space as a set of Linked Datasources, forming a web of data.

    Definition 3 (Web Graph). The web of data is modeled as aweb graph WG;N; Ei ] E

    e , where G

    is the set of alldata graphs, N is the set of all nodes, Ei is the set of allinternal edges that connect elements within a particular

    source, and Ee is the set of all external edges, which establishlinks between elements of two different sources, i.e.,G fg1N

    1; E

    1; g2N

    2; E

    2, . . . ; gnN

    n; E

    ng;N

    Sn

    l1Nl ;

    E

    e feni; njjni 2 N

    i ; nj 2 N

    j ;N

    i 6 N

    j g, and E

    Snl1 E

    il

    [ Eel . When considering the nodes and edges only,we simply use WN; E. We use WG;N; E to distinguishtheelement-level web graphfrom theset-level web graphW0G0;N0; E0.

    3.2 Keyword Query Routing

    We aim to identify data sources that contain results to akeyword query. In the Linked Data scenario, results mightcombine data from several sources:

    Definition 4 (Keyword Query Result). A web graphWN; E contains a result for a query K fk1; k2; . . . ; k Kj jg

    if there is a subgraph also calledSteiner graphW

    S

    N

    S

    ; E

    S

    ,which for allki 2 K, contains akeyword elementnodeni2NS whose descriptionEAni matches ki, and there is a pathbetweenni andnj (ni $ nj) for all ni; nj2 NS.

    Typical for all keyword search approaches is thepragmatic assumption that users are only interested incompact results such that a threshold dmax can be used toconstrain the connections to be considered. The type ofSteiner graphs that is of particular interest is dmax-Steinergraphs WSNS; ES, where for allni; nj2 NS, paths betweenni andnj is of length dmax or less. This work also relies onthis assumption to constrain the size of the search space.

    Definition 5 (Keyword Routing Plan). Given the web graphW G;N; E and a keyword query K, the mapping :K ! 2G that associates a query with a set of data graphs iscalled a keyword routing plan RP. A plan RP isconsidered valid w.r.t. K when the union set of its datagraphs contains a result for K.

    The problem of keyword query routing is to find the top-k keyword routing plans based on their relevance to aquery. Arelevant planshould correspond to the informationneed as intended by the user.

    Table 1 provides an overview of all symbols.

    3.3 Multilevel Inter-Relationship Graph

    We illustrate the search space of keyword query routingusing amultilevel inter-relationship graph. At the lowest level,

    Fig. 2. Set-level web data graph.

    TABLE 1Notation

  • 5/21/2018 Keyword Query Routing

    4/13

    it models relationships between keywords just like a KRG

    in the G-KS approach [7]. In the upper levels, there areWN; E followed by W0N0; E0 and the source-levelweb graph, which contains sources as nodes.

    The inter-relationships between elements at differentlevels are illustrated in Fig. 3. A keyword is mentioned insome entity descriptions at the element level. Entities at theelement level are associated with a set-level element viatype. A set-level elementis containedin a source. There is anedge between two keywords iff two elements at the elementlevel mentioning these keywords are connected via a path.

    Fig. 3 represents a holistic view of the search space.Based on this view, we propose a ranking scheme that dealswith relevance at many levels. Further, Fig. 3 providesdifferent perspectives on the search space. It makes clearthat there are different views on the data representingmodels and summaries at different levels of granularity.Based on this representation of the search space, we nowdiscuss how existing work on keyword search and databaseselection can be extended to solve the problem of keywordquery routing.

    3.4 Basic Approaches

    Existing work on keyword search [11], [12], [13] relies on anelement-level model (i.e., data graphs) to compute keywordquery results. Elements mentioning keywords are retrieved

    from this model and paths between them are explored tocompute Steiner graphs. To deal with the keyword routingproblem, elements can be stored along with the sources theyare contained in so that this information can be retrieved toderive the routing plans from the computed keyword queryresults. Thus, existing keyword search solutions naturallyapply to this problem. However, the data graph and thenumber of keyword elements are possibly very large in ourscenario, and thus, exploring all paths between them in thedata graphs is expensive.

    KRG [7] captures relationships at the keyword level. Asopposed to keyword search solutions, relationships cap-

    tured by a KRG are not direct edges between tuples butstand for paths between keywords. For database selection,KRG relationships are retrieved for all pairs of querykeywords to construct a subgraph. Based on these keyword

    relationships alone, it is not possible to guarantee that sucha subgraph is also a Steiner graph (i.e., to guarantee that thedatabase is relevant). To address this, subgraphs arevalidated by finding those that contain Steiner graphs. Thisis a filtering step, which makes use of information in theKRG as well as additional information about which key-words are contained in which tuples in the database. It is

    similar to the exploration of Steiner graph in keywordsearch, where the goal is to ensure that not only keywordsbut also tuples mentioning them are connected. Thus, thisapproach also relies on information at the element level.However, since KRG focuses on database selection, it onlyneeds to know whether two keywords are connected bysome join sequences or not. This information is stored asrelationships in the KRG and can be retrieved directly. Forkeyword search, paths between data elements have to beretrieved and explored. Retrieving and exploring paths thatmight be composed of several edges are clearly moreexpensive than retrieving relationships between keywords(that capture some paths between elements mentioningthese keywords, while the actual paths are not needed).

    For extending this work to query routing, we need toconstruct a multisource KRG for the entire collection ofsources. Such a KRG models both relationships that arewithin and those that link between sources. Keywordrelationships are stored together with the elements theycontained in, and are associated with source information.In particular, we construct an element-level keyword-elementrelationship graph(E-KERG) where constituents are relation-ships eK hnKi ; nKj i and every element nK k;n;gcaptures besides the graph g, both the keyword k andthe elementn in which it was mentioned (thus, we use theterm keyword-element). Using E-KERG, one can retrieveconnections between ki and kj, validate whether thecorresponding entities ni and nj are connected, andfinally extract source information to construct keywordrouting plans.

    While extending the KRG like this can solve the keywordquery routing problem, it has one essential flaw. Just likekeyword search approaches, the employed model capturesrelationships at the level of elements. Since a model has tobe constructed for the entire collection of sources, thenumber of relationships might be very large. For reducingthe complexity of the search space, we will now propose a

    more compact model.

    4 SET-LEVELSUMMARY

    The summary we derive is not at the level of elements butset of elements.

    4.1 Set-Level KERG

    Definition 6 (Set-level KERG).GivenWG;N; E andW0G0;N0; E0, a set-level keyword-element relationship graphisa tuple W0K N

    0K; E

    0K. Every keyword-element node n

    0K2

    N0K is a tuple k; n0; g where n0 2 N0 is the corresponding

    set-level node it represents,g2 Gis the graph it is associatedwith, and k is a keyword that matches some descriptions

    EAn of the entities n2 n0. There is a keyword-elementrelationship (KER) e0K hn

    0Ki

    ; n0Kj i 2 E0K representing a

    Fig. 3. Multilevel Inter-relationship graph.

  • 5/21/2018 Keyword Query Routing

    5/13

    relationship between the keyword-element n0Ki ki; n0i; gi

    andn0Kj kj; n0

    j; gj, iff there is anni2 n0i that mentions ki,

    annj2 n0j that mentionskj, andni $ nj. Admax-KERG is aKERG, which captures only paths between ni and nj withlength dmax or less (ni $dmax nj).

    Intuitively, a dmax-KERG represents all paths betweenkeywords that are connected over a maximum distancedmax. This is to capture all dmax-Steiner graphs that exist inthe data. However, it does not capture paths exhibited by

    data elements but paths between set of elements. Asdiscussed in the following example, certain relationshipsat the level of elements are combined and summarized toone set-level relationship.

    Example 1.A KERG for our running example withdmax 1is illustrated in Fig. 4. For instance, there is a keyword-element node John; P erson;DBP edia. Note that therelationship hJohn; P erson; DBP edia; Award; P rize;DBPediai actually stands for the element-level connec-tions hJohn; per3; DBP edia; Award; prize1; DBP ediai,and hJohn; per4; DBP edia; Award, prize2; DBP ediaibecause per3 and per4 mention J ohn; prize1 and prize2mention Award, per3, per42P erson, prize1, prize22Prize, there is a path between per3 and prize1, and apath between per4 and prize2 (see web data graph inFig. 1). This example illustrates that element-levelrelationships, which share the same pair of terms (Johnand Award), classes (Person and Prize), and sources(DBPedia and DBPedia) can be summarized to onesingle set-level relationship.

    For constructing KERG, we extract keywords andrelationships from the data. Then, based on the elementsand sets of elements in which they occur, we createkeyword-element relationships. Precomputing relation-ships (i.e., paths) between data elements are typically

    performed for keyword search to improve online perfor-mance. These relationships are stored in specialized indexesand retrieved at the time of keyword query processing toaccelerate the search for Steiner graphs [12]. For databaseselection, relationships between keywords are also precom-puted [6]. This work neither considers relationshipsbetween keywords nor relationships between data elementsbut between keyword-elements that collectively representthe keywords and the data elements in which they occur.They are represented as keyword-element relationships.Further, different from previous works [6], [7], not onlyrelationships within a particular source but also relation-ships between sources are taken into account. We now

    discuss the methods for computing these relationships andfor constructing KERG.

    Relationships within Data Sources. For computing rela-tionships within individual sources, we model paths

    between nodes in a source gNE] NV; E as jNEj j N Ejsymmetric Boolean matrices MEgd w.r.t. variousdistances d. MEg1, illustrated as

    MEg1

    n1 n2 nnn1 0 1 1n2 1 0 0

    ..

    .

    ..

    .

    nn 1 0 0

    8>>>>>>>>>:

    9>>>>>=

    >>>>>;;

    represents direct paths between all entity pairs. An entry rdijin this matrix is 1, iff there is a path between ni and nj of

    length d, where ni and nj denote the entities at ith row

    and jth column in MEgd; otherwise, rdij i s 0. For

    1< d dmax, the matrix MEgd is constructed iteratively

    using the formula MEgd rdijnn M

    Egd 1 M

    Eg1,

    which represents paths between entities with length no

    greater than d.

    Relationships between Data Sources. For computing rela-tionships between sources of the web graphWG;N; Ei; Ee,

    we model external links e2 Ee between every data source

    pair hgiNEi] NVi ; Ei; gjNEj] NVj ; Eji as a jNEi j j N Ej j

    Boolean matrix MEegi;gj. In MEegi ;gj

    , an entry is 1 iff

    eni; nj 2 Ee, where ni and nj denote the entities at ith

    row andjth column inMEegi;gj; otherwise, it is 0. Then, we

    compute all relationships using the following matrix:

    Theorem 1.All paths between nodes of the graph giand nodes of

    the graphgjwith maximum lengthdare captured by a jNEi j

    jNEj j Boolean matrix MEgi ;gj

    d. We have MEgi;gj0 1

    andME

    gi ;gj

    1 MEegi;gj

    . For d >1,

    MEgi;gjd X

    di ;dj2PIN0;di1djd

    MEgi di MEegi;gj

    MEgj dj;

    1

    whereP IN0 denotes 2-elements permutations of the set IN0.

    Proof Sketch.Every path ngi $ngj consists of three parts,namely the subpath pi ngi $n

    gi

    that is completelycovered by gi, the subpath pj ngj $ngj covered by gj,and the external edge pij ngi $n

    gj

    , which connects piwith pj. The paths pij represent the shortest relation-ships between elements of gi and gj with d 1. Allpaths between elements of gi and gj have distance ddi 1 dj, where di and dj denote the length ofpi andpj, respectively. The matrix multiplication M

    Egi

    di MEegi;gj M

    Egj

    dj gives all the paths with distanceddi 1 dj. The matrix addition over all possiblepermutations of di; dj, where di 1 dj d gives allthe paths, which have distance between 1 and d. tu

    Constructing Keyword-Element Relationships. First, allterms are extracted from the data. For each pair of termshki; kji, if there exist two nodesni andnj whereki 2 EAniand kj2 EAnj and ni$ nj, or when ki and kj are

    mentioned in the description of the same element(ni nj), we create an element-level KER hnKi ; nKj i. Twoelement-level KERs hnKi1 ki; ni1 ; gi; nKj1 kj; nj1 ; gji andhnKi2 ki; ni2 ; gi; nKj2 kj; nj2 ; gji are grouped to one single

    Fig. 4. A Set-level 1-KERG.

  • 5/21/2018 Keyword Query Routing

    6/13

    set-level KER hn0Ki ki; n0i; gi; n0Kj kj; n0j; gji iff ni1 ; ni2 2n0iand nj1 ; nj1 2n

    0j. For computing all set-level KERs, this

    summarization is iteratively performed for all possiblepairs of element-level KERs.

    5 COMPUTINGROUTINGPLANS

    Routing plans are computed by searching for Steiner graphsin the summary. Since edges in the summary stand forpaths between elements, subgraphs of the summarycapturing Steiner graphs can be defined as follows:

    Definition 7 (Routing Graph). Given a keyword query K

    fk1; k2; . . . ; k Kj jg and a KERG W0KN

    0K; E

    0K, a routing graph

    W0SK N0S

    K ; E0SK is a subgraph of W

    0K that satisfies the

    following properties:

    . For each query keyword k2 K, there is a noden0Ki k; n

    0; g 2 N0SK .. It is a complete graph, i.e., for every pair of nodes

    n0Ki ; n0Kj

    2 N0SK there is an edge hn0Ki

    ; n0Kj i 2 E0SK.

    Thus, we call Steiner graphs that can be found in thesummary routing graphs. Similar to a routing plan, arouting graph contains a set of data sources. Moreover, arouting graph contains information that enables the user to

    assess whether it is relevant: a plan is relevant only if thenodes mentioning the keywords and relationships betweenthem correspond to the intended information need. Thisadditional information will be used in the evaluation toassess the effectiveness of ranking.

    As discussed, unlike keyword search, which requiresexpensive exploration of edges between elements (at theelement-level), relationships between keywords capturingpossibly complex paths can be directly retrieved andemployed for routing. Basically, the computation can bedivided into three stages: 1) computation of routing graphs,2) aggregation of routing graphs, and 3) ranking queryrouting plans. First, for each query keyword pair, corre-sponding relationships are retrieved from KERG. They arejoined according to a plan (as illustrated for three key-wordski; kj; kl in Fig. 5, they act as join variables) to obtainrouting graphs. Note in Fig. 5 that for obtaining completegraphs, it is necessary to join on all three keywords. Whileother standard join implementations can be used, weemploy hash join in our experiment. Because the order ofjoins does not have an effect on the top-k procedure, weemploy in the end (discussed in Section 7), pairs ofkeywords are chosen randomly and iteratively until allquery keywords are covered.

    The procedure for computing routing plans is de-

    scribed in Algorithm 1. Given a query K and the summaryW0K, the algorithm computes a set of routing plans RP.For this, it first determines the join plan J P. Based on thisplan, KERG relationships are retrieved for every keyword

    pair, and joined with the intermediate result table T.This table contains candidate routing graphs, includingthe scores of their constituent elements and theircombined score. When the join plan is worked off, thecombined score is computed for every tuple in T, i.e.,for every routing graph W0SK. Routing graphs, whichrepresent the same set of sources, are aggregated into

    one single result. This is because we want to output onlythose plans that capture unique combination of sources.The score of a plan is computed from the scores of itsrouting graphs according to a ranking function that wewill discuss in the next section.

    Example 2.For the query K fStanford; John; Awardg, weretrieve the keyword-element relationships

    hStanford; University; F reebase;

    John; P erson; F reebasei;

    hStanford; University; F reebase;

    John; P erson; DBLPi;and hStanford; University; F reebase, John; P erson;DBPediai for the pair of keywords hStanford; Johniand proceed in the same way for the other pairshJohn; Awardi and hStanford; Awardi. We join theserelationships in the orderhStanford; Johni fflJohn hJohn,Awardi fflStanfordhStanford; Awardi fflAward hStanford,Awardi. One resulting join sequence is: 1) first

    hStanford; University; F reebase;

    John; P erson; DBP ediai fflJohn

    hJohn; P erson; DBP edia; Award;P rize; DBP ediai

    (i.e., join these two keyword-element relationships onthe keyword-element retrieved for John, which isJohn; P erson; DBPedia), 2) then join this intermediate

    Fig. 5. Join plan for three keywords.

  • 5/21/2018 Keyword Query Routing

    7/13

    result with hStanford; University; F reebase; Award;P rize; DBP ediai o n t he e le me nt r et ri ev ed f orStanford, and 3) finally, join with

    hStanford; University; F reebase;

    Award; P rize; DBP ediai

    on the element retrieved forAward. The result of this join

    sequence is the path

    hStanford; University; F reebase;

    John; P erson;DBP edia; Award; P rize; DBP ediai;

    representing the one and only routing graph of thisexample. Thus, no aggregation needs to be performedand the only routing plan derived from this is fFreebase,DBPediag.

    6 SUMMARY ANDALGORITHMANALYSIS

    In this section, we analyze completeness, soundness, andthe complexity of using KERG.

    6.1 Completeness

    Given there are keyword query results, KERG can be usedto compute all routing graphs:

    Theorem 2. Given K and the graph W, there exists a routinggraphW0SK ifWcontains a result for K.

    Proof Sketch. A web graph WN; E contains a result ifthere is a Steiner graph WSNS; ES in Wwhere for allquery keywords ki, there is a corresponding keyword

    element ni 2 NS

    , and ni $ nj for all ni; nj 2 NS

    . Recallthat a routing graph is derived from the set-level KERGW0KN

    0K; E

    0K. By Definition 7, a routing graph is

    essentially a Steiner graph at the summary level. Thus,we have to show that W0K contains a Steiner graph ifWdoes, i.e., for all keyword elements ni2 NS and allpairs of connected elementsni; nj2 NS s.t. andni $ nj,there are corresponding elements n0Ki ; n

    0Kj

    2 N0K, andn0Ki $n

    0Kj

    . These correspondences follow from Defini-tion 6 stating that for all pairs of keywords hki; kji iffthere is an ni 2 n0i that mentions ki, an nj2 n

    0j that

    mentions kj, and ni $ nj then there are corresponding

    elements n0

    Ki ki; n0

    i; gi and n0

    Kj kj; n0

    j; gj, andrelationship e0K hn0Ki

    ; n0Kj i 2 E0K. tu

    6.2 Soundness

    While completeness is guaranteed, the use of summarieshowever sacrifices soundness, i.e., there are cases where thecomputed graphs are not necessarily valid:

    Theorem 3. The existence of a routing graph W0SK does notguarantee that there exists a result forK.

    Proof Sketch. Given the query K, a routing graphW0SK N

    0SK ; E

    0SK and W, which represents the union set of

    sources captured byW0SK, W0SK is valid if a Steiner graph

    WS can be constructed from W. This is the case if Wcontains keyword elements for all ki 2 K, and theseelements are pairwise connected. We just showed that ifkeyword elements and paths between them exist at the

    element-level, then corresponding elements and relation-

    ships can also be found in W0 and the summary W0K,

    respectively. We now show that this is not true vice

    versa: for a set of three keywords ki; kj; kl or more, the

    existence of the keyword-element relationshipshn0Ki ; n0Kj

    i

    andhn0Kj ; n0Kl

    i inW0K, where some elements captured by

    n0Ki ; n0Kj

    , andn0Kl mentionki; kj, andkl, respectively (i.e.,

    n0Ki ; n0Kj , and n0Kl are connected keyword elements at the

    summary level), does not imply the existence of the

    paths hni; nji and hnj; nli in W, where ni; nj, and nlmention ki; kj, and kl, respectively (i.e., ni; nj; nl are

    connected keyword elements at the element level).Assuming an element- and set-level graph g and g0

    like in our running example and

    k1 John;k2 Stanford; k3 Music; k12 EAper3; k1

    2 EAper4; k22 EAuni1; k32 EAprize2;

    andper3$ uni1,per4$ prize2, andper3,per42 P erson,

    uni12 University, prize22 Prize. Then, the KERG re-lationships derived from that include hJohn, Person,

    DBPedia, Stanford, University, Freebasei a n d

    hJohn, Person, DBPedia; Music, Prize, DBPediai.

    Thus, it can be derived from KERG that there exists a

    relationship between the elements representing the

    keywords John, Stanford, and Music while in fact at

    the element level, the Johnper3 that is connected with

    Stanford is not the same as the Johnper4 that is

    connected with Music. tu

    6.3 ComplexityFor computing routing plans, elements retrieved from

    KERG have to be joined. In the worst case, a join operation

    requiresjinputij jinputjjtime and space. Inputs are joined

    for all possible pairs of keywords. This results in complexity

    OinputCK;2max , where inputmax here refers to the largest

    number of relationships that can be obtained for a keyword

    pair,CK; 2 is the number of 2-element combinations of the

    set K, and CK; 2 1 is the number of joins. However, in

    practice, inputs can be sorted such that efficient implemen-

    tations like merge join can be applied. As a result, a join

    requires only jinputij jinputjj such that overall complexityisOinputmax CK; 2.

    While the number of joins is same for all kinds of

    summary models (when using the same proposed proce-

    dure), the size ofinputmax varies. Clearly, the more coarse

    grained the grouping, i.e., the higher the number of

    elements aggregated to one group at the summary level,

    the smaller will be inputmax. Compared to E-KERG, KERG

    is potentially smaller because it summaries relationships at

    the element level (i.e., E-KERG relationships), which share

    the same pair of terms, classes, and sources. In the extreme

    case where every data element mentions only distinctterms, KERG and E-KERG are equal in size. In practice,

    elements in the same set have many terms in common such

    that KERG can yield substantial performance gain.

  • 5/21/2018 Keyword Query Routing

    8/13

    7 RANKING ROUTINGPLANS

    To produce relevant plans, which yield results correspond-ing to the user information need, we propose a multilevelIR-style ranking scheme.

    Keyword-element Score. Analogous to the notions of termfrequencytf and inverse document frequency idfused inIR, we define the frequency tfki; n0K for a term ki w.r.t. a

    keyword-element noden0Kk; n

    0

    ; gas the number of entitiesn2 n0, whose descriptions contain ki, i.e.,

    tfki; n0Kk; n

    0; g fnj n 2n0; ki2 EAngj j 2

    and

    idfki ln

    N0Kn0K j n0K 2 N0K; tfki; n0K> 0

    ; 3

    where the numerator denotes the number of keyword-element nodes in KERG and the denominator stands forthe number of nodes n0K that mention ki. While tfki; n

    0K

    can be regarded as a measure for the descriptiveness ofkiw.r.t. the elementsn0K,idfkistands for the distinctivenessofki. Intuitively, a highertfki; n0Kmeans that more likely,n0K is aboutki, while a higheridfkireflects thatki is a lesscommon term. Just like in IR, we combine these twometrics to evaluate the relevance of a routing graph noden0K w.r.t. ki using

    S

    ki; n0K

    tf

    ki; n

    0K

    idfki: 4

    Keyword-element Relationship Score. In the same IR fashion,we propose metrics to measure the descriptiveness anddistinctiveness of connected term pairs. Analogous tokeyword-element scores, we define the frequencytfhki; kji; e0K; d, which denotes the number of keyword-element relationships eK hnKi ki; n

    0i; gi; nKj kj; n

    0j; gji at

    distance d, which have been summarized to one KERGrelationship e0K hn

    0Ki

    ; n0Kj i; and idfhki; kji is computedfrom the total number of KERG relationships and theones where tfhki; kji; e0K; d> 0. In this case, a highertfhki; kji; e0K; d means that more likely, e

    0K is about the

    keyword relationship hki; kji, while a higher idfhki; kjimeans that the keyword relationship hki; kjiis less frequentin the data. We combine them to obtain the score

    S

    hki; kji; e0K

    X

    0id

    d tfhki; kji; e0K; d idfhki; kji;

    5

    where d 1d1 is a monotonically decreasing functionrecognizing that when a path connecting two terms has agreater length, its structural relevance for these termsshould be smaller.

    Ranking Keyword Routing Plan. Incorporating the rele-vance score of constituent keyword-elements and keyword-element relationships, the ranking of a routing graphW0SK N

    0SK ; E

    0SK w.r.t. a query K is computed as a combination

    of scores

    S

    K; W0SKN

    0SK ; E

    0SK

    X

    e0Kn0Ki

    ki;n0i;gi;n0Kj

    kj;n0j;gj2E0SK

    S

    hki; kji; e0K

    S

    ki; n0Ki

    S

    kj; n

    0Kj

    :

    Finally, the rank of a keyword routing plan RP iscomputed as the aggregation

    SK; RP X

    W0SK

    2T O P K RP

    S

    K; W0SK

    ; 6

    where TOP KRP returns the k-best ranked routinggraphs inRP. Using onlyk-best ones enables us to control

    the quality and also, not to associate a high rank to thoseplans, which can be derived from a large number but lessrelevant graphs.

    While the proposed scheme also leverages TF-IDF likethe ones used by M-KS and its extension G-KS, it addressesa different ranking problem, i.e., ranking combinations ofsources instead of single sources. It does not consider everysource independently. Node and relationship scores arederived from statistics computed for the entire collection ofsources. Moreover, this scheme is more fine-granular,considering several levels of the search space. As opposedto G-KS, not only the scores of nodes and relationships at

    the keyword level but also their contributions to the score ofkeyword-element relationships and routing graphs at theset level are incorporated.

    For estimating the score of a KRG node w.r.t. k, G-KSconsiders all tuples in the database that mention k. Thus,tuples have an effect on the final score even when they donot participate in the final answer. The same problemapplies to KRG relationship scores. The score of a KERGelement (relationship) w.r.t. k is computed not from allelements (tuples) mentioning k but only those thatparticipate in the set-level element (relationship).

    G-KS prunes invalid term relationships by consideringthe underlying paths and their distance to reduce falsepositives. However, this additional pruning step does nothave an effect on the scores. This pruning step is similar tothe computation of routing graph in our approach. In fact,pruning is a by-product of this computation because duringthis process, keyword-element relationships are succes-sively retrieved and ruled out if they cannot contribute tothe final routing graphs. As opposed to G-KS, scores ofelements and relationships which have been pruned do nothave an effect on the final score. Also, only the scores ofsome top routing graphs are incorporated into the overallscore of the plan.

    Top-k Plans. As demonstrated, we have reduced the

    problem of computing plans to the standard problem ofprocessing join queries, for which there exist differentkinds of optimization. Now, after the scores have beendefined, we show that for instance, top-k algorithms suchas rank-join [2] can be applied to output the k best resultsand to terminate early. Rank join takes a query plan with afixed join order such as the one used in the previoussection. In fact, the order of join operators can be chosenrandomly as with rank join, the order of joining inputs(their entries) is not dependent on the order of operatorsbut mainly determined by the pulling strategy, whichbasically prefers input entries that more likely lead to top-k

    results (those with high scores). Thus, rank join requiresinputs that are sorted on their score, and assumes amonotonic score aggregation function for computing thescore of the joined result. As a rank join input, we use

  • 5/21/2018 Keyword Query Routing

    9/13

    KERG relationships retrieved for a particular keywordpair. Hence, the number of inputs equals the number of allpossible combinations of two keywords (for a query with nkeyword, this is n!=2n 2!). Since both the scoreaggregation functions in (6) and (7) are monotonic w.r.t.their element scores, the input entries can be indexed withand sorted by Shki; kji; e0K Ski; n

    0Ki

    S kj; n0Kj . In-stead of using the standard join operator in Algorithm 1,we can now apply rank join on these sorted inputs tocompute top-k results.

    8 EVALUATION

    We implemented the proposed approach in Java using JDK1.6 on top of MySQL 5.1. The experiments were conductedon a commodity PC with 2.5-GHz Intel Core, 4 GB of RAM,and 500-GB HDD S-ATA II 7,200 rpm, running onWindows 7.

    8.1 Data

    The data used for the experiments are drawn from data setsprepared for the Billion Triple Challenge1 (BTC). BTC datawere crawled from major Semantic Webs websites during

    February/March 2009. BTC data were split into chunks of10M statements each. All the chunks, additional informa-tion, and statistics are made available at http://vmlion25.deri.ie/index.html.

    The data we used for the experiment are the chunk thatcan be found at http://vmlion25.deri.ie/btc-2009-small.nq.gz. The raw ungzipped file is 2.2 GB. This chunk ofdata contains a large number of data sources. Some ofthem contain less than 3K RDF triples. Excluding thesesmall sources from the experiments resulted in a final dataset that has about 10M RDF triples contained in 154 differentsources.

    Based on the number of RDF triples they contain, thesesources can be classified into six categories. Table 2 showsstatistics for each category and some example sources.

    8.2 Data Preprocessing

    Index Size and Building Time. During the index buildingprocess, we counted the number of keyword relationships,i.e., all pairs of keywords that are connected over amaximum distance dmax. This is to resemble the M-KSmodel [6], which captures all binary relationships betweenkeywords. As discussed, E-KERG extends G-KS [7] to thekeyword routing scenario. We counted the number ofelement-level keyword-element relationships (E-KERs) tocapture this baseline. Finally, we consider the number ofrelationships in KERG (KERs). These numbers were

    counted for the entire data and separately for everysubcategory. At variousdmax, Fig. 6a illustrates the numberof KRs versus E-KERs versus KERs for the entire data.Figs. 6b, 6c, and 6d show the number of KERs, the storagesize required for the corresponding KERG indexes, and thetime for building these indexes for data sets of differentcategories. Concrete numbers are provided in Table 3 forthe entire data.

    Expectedly, the number of KRs, E-KERs, and KERsincreased with dmax. The increases were sharp whendmax> 1. Compared to E-KERs, the increases for KRs andKERs were moderate. As illustrated in Fig. 6a, the increase

    of KERs was comparable to KRs. Table 3 more preciselyshows that for dmax 4, the number of KERs wasonly about 20 percent higher than KRs while the numberof E-KERs was more than three times higher. Thus, theseresults suggest that summarizing at the level of sets cansubstantially reduce the number of relationships.

    TRAN AND ZHANG: KEYWORD QUERY ROUTING 371

    TABLE 2Data Set Statistics

    Fig. 6. Element-level and Set-level KERs for entire data (a), KERG statistics for data sets of different categories (a-c).

    1. http://challenge.semanticweb.org/.

    TABLE 3KERG Statistics for Entire Data

  • 5/21/2018 Keyword Query Routing

    10/13

    Fig. 6b shows that also for every category of data sets, thenumber of KERs increased with dmax. This increase isreflected in the size of the index (Fig. 6c). For the entire dataas well as for every category, the index size also increasedwith dmax. Likewise, the higher dmax, the higher was theamount of time needed for building the index (Fig. 6d).These results are not surprising because at a greater dmax,more relationships have to be taken into account.

    However, we noted that these results were not strictlydependent on the data size. That is, the number of KERs,the size of the summary as well as building times did notdirectly correlate with the number of triples contained in

    the data sets. There were cases where relatively small datasets resulted in large KERGs. In Fig. 6d, for instance, we cansee that times for categories of larger size were higher thanthose of smaller size. However, while category 2 was morethan 150 percent larger in size, the difference in indexbuilding time to category 3 was less than 5 percent. InFigs. 6b and 6c, we can see that at dmax 4, the number ofKERs and the index size of category 2 were even smallerthan those of category 3.

    We found out that the dominant factor, which substan-tially determined index size and, thus, index building times,was the structural density of the data. In the experiments,

    densely structured data graphs resulted in higher buildingtimes than sparse graphs. Here, density refers to thedistribution of edges within data sources and links betweendata sources. Category 3 for instance is relatively dense,containing data sets of smaller size that, however, exhibitlarge amounts of links to other sources, and contain severalnodes that are well connected, i.e., reach hundreds of othernodes within dmax 4.

    8.3 Queries

    Our main goals of the evaluation are to verify the validityand measure the relevance of the generated keywordrouting plans. For a plan to be valid, the underlying query

    should produce answers. Further, interesting queries inthis setting are those which combine results from differentsources. We asked researchers who were familiar with theBTC data set to provide keyword queries that return

    meaningful results, along with descriptions of the in-tended information needs. In total, we have 30 keywordqueries, each of them involves more than two datasources. One example submitted by participants isRudi AIFB ISWC2008, and the associated description isFind the relationships between professor Rudi Studer,the AIFB Institute and the ISWC2008 conference. Thedata sources containing partial answers to this query are

    uni-karlsruhe.de and semanticweb.org. The keywords ofthe first 20 queries are shown in Table 4.

    8.4 Quality of Routing Plans

    This part of the experiment investigates the quality of therouting plans. The goal is to find out whether they producesome results and how well they match the informationneeds, i.e., we want to assess both the validity and therelevance of routing plans. Recall that while KERG is builtto capture all valid plans, the scoring mechanism isdesigned to focus on relevant plans. We now investigatehow appropriately KERG summarizes the search space andhow correctly the computed scores capture relevance.

    Validity. Given a keyword query, letRPk be the set oftop-krouting plans returned by our approach and letRP

    denote the set of valid keyword routing plans. To find outwhether a plan is valid, we looked into the data to see if itproduces at least one result. Considering only the k bestplans, we used the standard IR measure precision at k tocompute the percentage of query plans that are valid out ofkplans, i.e.,

    P@kjRP \ RPkj

    jRPkj :

    Fig. 9a shows P@k at k 1 up to 10, using theconfigurations dmax 0; 1; 2; 3; 4. These values representthe average computed for all 30 queries. When only thesingle best plan was considered (P@1), precision up to95 percent was achieved. Clearly,dmax had a positive effecton the quality of the produced plans. More valid plans werecomputed a higher dmax. When dmax> 1, precision wasalways higher than 50 percent at all values for k. However,we noted that precision did not increase substantially whendmax> 3, i.e.,dmax 4 improveddmax 3 by only 7 percentatk 1 and showed the same performance at k > 6.

    Fig. 7 shows P@k obtained for different categories ofqueries atdmax 1; 2; 3; 4. The goal of this illustration is to

    breakdown the results into queries of different complexities.Clearly, P@k increased withdmax. Moreover, we can seethat P@k correlated with query complexity. P@k valuesobtained for queries with a larger number of keywordswere with only few exceptions, lower than queries with asmaller number of keywords. For instance, queries withonly two keywords were processed with precision of100 percent when dmax> 2. By contrast, queries with fivekeywords achieved high precision only whendmax 4.

    We noted that whendmax 1; 2, P@k values were clearlydifferent for queries with different number of keywords.However, at greater dmax, differences between query typeswere small. Also, P@k increased sharply at greater dmax(and

    smaller k). For instance, at dmax 4, P@10 0.36, P@3 0.62, and P@1 1 for queries containing five keywords, andthe difference in precision compared to queries containingfour keywords is only 5 percent on average.

    TABLE 4Queries Employed in the Experiments

  • 5/21/2018 Keyword Query Routing

    11/13

    To sum up, dmax had a positive effect, both the numberof keywords and the number of presented results k had anegative effect; using a larger value for dmax minimizes theeffect of query complexity and substantially improvesprecision, especially at a small k; dmax 3 seems to besufficiently large as no much improvement was achievedwith dmax> 3. With dmax 3, P@1 = 0.92 on average forall queries.

    Relevance. To determine whether a valid plan alsoproduces relevant results, we look at the descriptions ofinformation needs provided by participants. Note that boththe generated routing plans and the underlying routinggraphs capture information needs, albeit at different levelsof granularity. To work with a more precise representation,we used the routing graphs. To assess the relevance of arouting plan, we selected its routing graph that has thehighest score and compared it with the description of theinformation need. When there is match, we assume thatcorrect results matching that need can be obtained usinga keyword search system and the routing plan is thusrelevant.

    A more direct way to measure relevance is to assesswhether relevant plans actually contain relevant queryresults. However, this result-level relevance not onlydepends on routing but also on the (ranking performedby the) underlying keyword search system. We verify thisusing the keyword search system and ranking schemepresented in [16]. The relevant plans determined asdiscussed (mostly, there is only one) are used as inputs tothis system, which employ them for pruning sources. Withpruning, this system generates less results. However, wemanually verified that more than 99 percent of thesemissed results are not relevant. Further, we observe that

    when using all the top-2 or top-3 plans for pruning,6 percent or none of the results in the top-50 list are missed,respectively. Thus, we conclude that the plan(s) manuallyassessed as relevant may contain nonrelevant results butpreserve all relevant results, and only the top-3 computedplans are needed to produce all the top-50 results.

    As a measure for the relevance, we use the standard IRmetricreciprocal rank(RR) defined asRR 1rRP, whererRP

    is the rank of the planRP 2 RPk determined as relevant.If the relevant plan is not in RPk, RR is simply 0. TheMean RR (MRR) is defined as

    MRR

    1

    jQjX

    Q

    i1 RR:

    Thus, MRR measures whether a relevant plan is in thetop-klist and also, whether it is ranked high.

    MRR values for queries of different categories at variousdmaxare shown in Fig. 9b. Clearly, also MRR were higher forqueries with smaller number of keywords. When thenumber of keyword is 2 for instance, MRR was 0.6 fordmax 1, up to 1 for dmax> 2. This means that for thesequeries, relevant plans were found and were also rank first.Also for queries with three or more keywords, results werereasonable. Average MRR for all queries was 0.86 for dmax

    3 and 0.96 for dmax 4. All these values were obtained bylooking for the relevant plan only in the list of top-5 results.Using k 1; 2; . . . ; 10 did not yield substantial differences.For instance, MRR computed for the top-5 plans is only1 percent worse than for the top-10 plans.

    It seems that using dmax 3 also yielded highly relevantplans: MRR 0:86means in this case that for all queries, arelevant corresponding plan could always be found andwas ranked among the top 2.

    Baseline. As pointed out in Section 7, the idea proposedfor G-KS can be implemented by computing scores of KERGnodes and relationships based on all elements in thedatabase (instead of considering only elements that parti-cipate in the set-level nodes and relationships captured byKERG). Further, scores of all nodes and relationships thathave been retrieved for all query keyword pairs were takeninto account (instead of considering only those that actuallyparticipate in the top-k routing graphs). Analogously, weadopt this ranking to compute the scores for E-KERGresults (routing plans for E-KERG are computed using thesame procedure as proposed for KERG). For dmax 3,which is the best configuration in our experiment, P@1 forKERG and E-KERG, respectively, was only 0.38 and 0.35(compared to 0.92) and MRR was only 0.32 and 0.43(compared to 0.86) on average. These results suggest that

    considering scores not only at the keyword level but alsotheir actual contributions to set-level elements and routinggraphs that participate in the final answers can largelyimprove quality.

    8.5 Performance of Plan Computation

    Fig. 10a shows the average response times for computingrouting plans. It depicts the performance for queries withdifferent number of keywords jKj and also illustrates theeffect of dmax. Expectedly, more time was needed as thenumber of query keywords increased. The increase wassteady for dmax 3; 4, and less so for dmax 1; 2. Theprevious experiment showed that a higher value for dmaxresulted in higher quality plans, but also larger indexes.The results of this experiment suggest that a larger numberof keywords required more time to processespecially onlarge indexes. There is, thus, a tradeoff between result

    Fig. 7.P@k for different categories of queries, at dmax 1; 2; 3; 4 and k 1; 2; . . . ; 10 (a-d).

  • 5/21/2018 Keyword Query Routing

    12/13

    quality and performance. Higher quality results wereachieved at the cost of a larger index, which resultedin performance.

    Overall, queries with two keywords were processed inless than 5 ms. All queries were processed within 90 mswhen dmax 2. In the previous experiment, dmax 3resulted in high-quality plans. This configuration showedalso relatively good performance, required an affordableamount of time of 1 s on average. We now discuss results

    shown in Fig. 8 to provide a more detailed analysis at thelevel of query types and a breakdown into costs for retrievaland join.

    Performance of Different Query Types. All 2-keywordqueries (Q1, Q3, Q17, and Q18) were processed within5 ms. This type of queries were processed efficiently even atlarge dmax. However, queries with more keywords couldnot be handled efficiently at dmax 3. At dmax 4 forinstance, queries with more than two keywords neededseveral seconds up to one minute. Thus, while this settingproduced results of highest quality, it is not reallyaffordable in a typical web scenario demanding highresponsiveness. More applicable is dmax 3, which pro-duced results in one second, while not compromising toomuch on quality.

    Besides the number of keywords, we found out thatpopularity was another factor. It seems that more timewas need when query keywords have a large number ofmentions, resulting in a large number of relationships thathave to be processed. This was the case for Q10. All thethree keywordsSemantic,W eb, andP ublicationin Q10 arevery popular, and thus, every keyword pair of this queryproduced a large number of matching relationships. Incontrast, the keywords Studer and AIFB in Q9 haverelatively few mentions. The time needed for Q9 is about

    one order of magnitude less than the time for Q10.Breakdown into Retrieval and Join. Processing relationships

    means to retrieve data from the index and to join them. Abreakdown of the time into the parts attributable to these

    two tasks are provided in Fig. 8. Costs for both retrieval andjoin increased with the number of keywords and dmax.Clearly, the reason for fast response time on 2-keywordqueries is that they did not require join processing. Retrievalcost seemed to be relatively low. In all settings and for allqueries, data were retrieved within 20 ms. The time neededfor join largely increased with dmax. For dmax> 2, joinprocessing made up more than 90 percent of the cost.

    Baseline. We compared results obtained for KERG withresults obtained for E-KERG. We have already discussed inSection 3 that E-KERG is a baseline that extends the G-KS[7] approach to the query routing problem. Resultsobtained for different configurations of dmax suggest thatKERG outperformed E-KERG by several order of magni-tudes. For dmax 3 for instance (shown in Fig. 10b),50 percent of the queries with two keywords or moreneeded more than 6 mins (timeout) when using E-KERG,while with KERG, maximum time was only several secondsand average time was 1 s.

    8.6 Keyword Search with/without Routing

    Keyword query routing can be employed when the subjectof interest is not necessarily results but sources that matchsome information needs. After selection, these sources canbe preprocessed (cleaned) for improving the quality/efficiency of keyword search as well as used for otherpurposes. In the context of keyword search, it enables anew paradigm where instead of considering all, onlyrelevant combinations of sources are considered. We studythis effect of routing on keyword search in this experi-ment. We used a fast keyword search system that isoptimized for time performance using materialized in-dexes of paths and subgraphs [16]. It finds answers forkeyword queries using all sources (KS). This system doesnot compute all but uses several (ranking) mechanisms toprune some answers. We verify that the maximumnumber of answers was 50, i.e., it produces top-50 results.

    Fig. 9. Validity of the plans measured using P@k and relevancemeasured usingM RR. Fig. 10. (a) Average times for KERG, (b) times for E-KERG atdmax 3.

    Fig. 8. Times for retrieval and join for different query types, at dmax 1; 2; 3; 4 (a-d).

  • 5/21/2018 Keyword Query Routing

    13/13

    We analyze the cost of this by comparing KSs perfor-mance with the same system, which however, runs undertwo other configurations. 1) In the first configuration, KSprunes all sources but the ones captured by the routingplan computed using KERG that has been determined asrelevant (KERG-R); thus, while KS delivers all results,KERG-R returns only the ones matching the informationneeded. Then, we verified that a maximum number of top-3 plans is needed to compute all top-50 results; accord-

    ingly, in the second configuration, KS uses sourcescaptured by the top-3 plans (KERG-A). The resultsproduced by KERG-A and KS are the same.

    Fig. 11 shows the performance results of these systemsover the 30 queries. The times for KERG-A and KERG-Rinclude both the time for routing and for computingkeyword search results. We noted that w.r.t. queries withfive keywords, KS could process only 1, while the other fivequeries hit time-out limit of 100 s. Not counting these fivequeries, KERG-A is 15 times faster while KERG-R is 20 timesfaster than KS on average. Thus, routing has a large positiveeffect on the performance of keyword search. Further, theresults suggest that keyword search without routing is

    especially problematic when the number of keywords islarge. Hence, routing can be seen as a promising alternativeparadigm especially for cases, where the information needis well described and available as a large amount of texts.

    9 CONCLUSION

    We have presented a solution to the novel problem ofkeyword query routing. Based on modeling the searchspace as a multilevel inter-relationship graph, we pro-posed a summary model that groups keyword andelement relationships at the level of sets, and developed

    a multilevel ranking scheme to incorporate relevance atdifferent dimensions. The experiments showed that thesummary model compactly preserves relevant information.In combination with the proposed ranking, valid plans(precision@1 0:92) that are highly relevant (mean reci-procal rank 0:86) could be computed in 1 s on average.Further, we show that when routing is applied to anexisting keyword search system to prune sources, sub-stantial performance gain can be achieved.

    REFERENCES[1] V. Hristidis, L. Gravano, and Y. Papakonstantinou, Efficient

    IR-Style Keyword Search over Relational Databases, Proc. 29thIntl Conf. Very Large Data Bases (VLDB), pp. 850-861, 2003.

    [2] F. Liu, C.T. Yu, W. Meng, and A. Chowdhury, Effective KeywordSearch in Relational Databases, Proc. ACM SIGMOD Conf.,pp. 563-574, 2006.

    [3] Y. Luo, X. Lin, W. Wang, and X. Zhou, Spark: Top-K KeywordQuery in Relational Databases, Proc. ACM SIGMOD Conf.,pp. 115-126, 2007.

    [4] M. Sayyadian, H. LeKhac, A. Doan, and L. Gravano, EfficientKeyword Search Across Heterogeneous Relational Databases,Proc. IEEE 23rd Intl Conf. Data Eng. (ICDE), pp. 346-355, 2007.

    [5] B. Ding, J.X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin, FindingTop-K Min-Cost Connected Trees in Databases, Proc. IEEE 23rdIntl Conf. Data Eng. (ICDE), pp. 836-845, 2007.

    [6] B. Yu, G. Li, K.R. Sollins, and A.K.H. Tung, Effective Keyword-

    Based Selection of Relational Databases, Proc. ACM SIGMODConf.,pp. 139-150, 2007.

    [7] Q.H. Vu, B.C. Ooi, D. Papadias, and A.K.H. Tung, A GraphMethod for Keyword-Based Selection of the Top-K Databases,Proc. ACM SIGMOD Conf., pp. 915-926, 2008.

    [8] V. Hristidis and Y. Papakonstantinou, Discover: Keyword Searchin Relational Databases,Proc. 28th Intl Conf. Very Large Data Bases(VLDB),pp. 670-681, 2002.

    [9] L. Qin, J.X. Yu, and L. Chang, Keyword Search in Databases: ThePower of RDBMS,Proc. ACM SIGMOD Conf., pp. 681-694, 2009.

    [10] G. Li, S. Ji, C. Li, and J. Feng, Efficient Type-Ahead Search onRelational Data: A Tastier Approach, Proc. ACM SIGMOD Conf.,pp. 695-706, 2009.

    [11] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, andH. Karambelkar, Bidirectional Expansion for Keyword Search onGraph Databases, Proc. 31st Intl Conf. Very Large Data Bases(VLDB),pp. 505-516, 2005.

    [12] H. He, H. Wang, J. Yang, and P.S. Yu, Blinks: Ranked KeywordSearches on Graphs, Proc. ACM SIGMOD Conf., pp. 305-316,2007.

    [13] G. Li, B.C. Ooi, J. Feng, J. Wang, and L. Zhou, Ease: An Effective3-in-1 Keyword Search Method for Unstructured, Semi-Structuredand Structured Data, Proc. ACM SIGMOD Conf., pp. 903-914,2008.

    [14] T. Tran, H. Wang, and P. Haase, Hermes: Data Web Search on aPay-as-You-Go Integration Infrastructure,J. Web Semantics,vol. 7,no. 3, pp. 189-203, 2009.

    [15] R. Goldman and J. Widom, DataGuides: Enabling QueryFormulation and Optimization in Semistructured Databases,Proc. 23rd Intl Conf. Very Large Data Bases (VLDB), pp. 436-445,1997.

    [16] G. Ladwig and T. Tran, Index Structures and Top-K JoinAlgorithms for Native Keyword Search Databases, Proc. 20thACM Intl Conf. Information and Knowledge Management (CIKM),pp. 1505-1514, 2011.

    Thanh Tran was a consultant and a softwareengineer at IBM and Capgemini, and served asa visiting assistant professor at Stanford Uni-versity. He has been mainly working on thetopics of semantic data management, keywordsearch on semantic data, and semantic search.His interdisciplinary work is published in morethan 40 top-level journals and conferenceproceedings and earned the second prize atthe Billion Triple Challenge 2008 and the best

    Linked Data paper at ESWC 2011.

    Lei Zhang received the bachelors degree incomputer science from Tongji University, China.He graduated from Karlsruhe Institute of Tech-nology (KIT), Germany, with a diplomas degreein Informatics. He is currently working toward thePhD degree at the AIFB institute of KIT, wherehe is working on the project XLike. He has beeninvolved in many research projects on semanticsearch.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    Fig. 11. Performance for KS with/without routing.