TAR: Constructing Target-Aware Results for Keyword Search...

45
TAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract Existing work for processing keyword searches on graph data focuses on efficiency of result generation. However, being oblivious to user search intention, a query result may contain multiple instances of user search target, and multiple query results may contain information for the same instance of user search target. With the misalignment be- tween query results and search targets, a ranking function is unable to effectively rank the instances of search targets. In this paper we propose the concept of target-aware query results driven by inferred user search intention. We leverage the Information Theory and develop a general probability model to infer search targets by analyzing re- turn specifier, modifier and relatedness relationships, and query keywords’ information gain. Then we propose two important properties for a target-aware result: atomic- ity and intactness. We develop techniques to efficiently generate target-aware results. Experimental evaluation shows the effectiveness and efficiency of our approach. Keywords: Database semantics, Metadata, Semi-structured Data and XML, Query, Conceptual Modeling 1. Introduction Keyword search provides a simple and user-friendly mechanism for information search, and has become increasingly popular for accessing structured or semi-structured data represented in knowledge graph. A knowledge graph contains entity nodes of dif- ferent types and edges representing relationships, i.e., attributes, among entities. While an increasing amount of investigation has been made in this important area, most ex- isting work concentrates on efficiency instead of search quality [11] and may fail to deliver high quality results. Let us look at a simple example. Example 1.1: [Motivation] Consider a knowledge graph in Figure 1, which contains information about entities of movies, actors and companies and their relationships. Suppose a user issues a keyword query Q 1 comedy, actor” on this knowledge graph. Likely the user is looking for actors starring in comedy movies. Majority of the existing work [4, 24, 16, 27, 17, 14, 20, 23, 33, 5, 19] generates minimal trees or graphs that contain all query keywords, which can not be further re- duced, to ensure the query results are “specific”, such as the two results shown in Fig- ure 2. While this result definition has the spirit of the minimal Steiner tree concept and presents a mathematically elegant model, it is oblivious to the entity and relationship semantics embedded in the data and in the user query. In these systems, if an actor has Preprint submitted to Elsevier December 1, 2016

Transcript of TAR: Constructing Target-Aware Results for Keyword Search...

Page 1: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

TAR: Constructing Target-Aware Resultsfor Keyword Search on Knowledge Graphs

Abstract

Existing work for processing keyword searches on graph data focuses on efficiency ofresult generation. However, being oblivious to user search intention, a query result maycontain multiple instances of user search target, and multiple query results may containinformation for the same instance of user search target. With the misalignment be-tween query results and search targets, a ranking function is unable to effectively rankthe instances of search targets. In this paper we propose the concept of target-awarequery results driven by inferred user search intention. We leverage the InformationTheory and develop a general probability model to infer search targets by analyzing re-turn specifier, modifier and relatedness relationships, and query keywords’ informationgain. Then we propose two important properties for a target-aware result: atomic-ity and intactness. We develop techniques to efficiently generate target-aware results.Experimental evaluation shows the effectiveness and efficiency of our approach.

Keywords: Database semantics, Metadata, Semi-structured Data and XML, Query,Conceptual Modeling

1. Introduction

Keyword search provides a simple and user-friendly mechanism for informationsearch, and has become increasingly popular for accessing structured or semi-structureddata represented in knowledge graph. A knowledge graph contains entity nodes of dif-ferent types and edges representing relationships, i.e., attributes, among entities. Whilean increasing amount of investigation has been made in this important area, most ex-isting work concentrates on efficiency instead of search quality [11] and may fail todeliver high quality results. Let us look at a simple example.

Example 1.1: [Motivation] Consider a knowledge graph in Figure 1, which containsinformation about entities of movies, actors and companies and their relationships.Suppose a user issues a keyword query Q1 “comedy, actor” on this knowledgegraph. Likely the user is looking for actors starring in comedy movies.

Majority of the existing work [4, 24, 16, 27, 17, 14, 20, 23, 33, 5, 19] generatesminimal trees or graphs that contain all query keywords, which can not be further re-duced, to ensure the query results are “specific”, such as the two results shown in Fig-ure 2. While this result definition has the spirit of the minimal Steiner tree concept andpresents a mathematically elegant model, it is oblivious to the entity and relationshipsemantics embedded in the data and in the user query. In these systems, if an actor has

Preprint submitted to Elsevier December 1, 2016

Page 2: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

starred in multiple comedy movies, then each movie induces a result containing thisactor. In other words, each result corresponds to a relationship between an actor and amovie. This is not desirable because the user searching for actor entities may need tobrowse many movies starred by the same actor before seeing another actor, which canbe time consuming and frustrating. Furthermore, a user is typically interested in top-kresults, expecting to see top-k actor entities who star in comedy movies to be properlyranked. However, the top-k results returned by these systems represent k relationships,typically involving much fewer than k actors. Even worse, these systems are unableto rank actors meaningfully (e.g. rank an actor based on the ranks of all the comedymovies that s/he stars) since the movies starred by the same actor spread in multipleresults. In other words, even if the ranking function itself is perfect, such systems canonly rank the relationships between movies and actors but are unable to rank actors forthis query.

There are also studies that define results to be subtrees or subgraphs that containall query keywords but are not necessarily “minimal” [26, 30, 22, 31, 29]. They returnresults shown in Figure 3 for this query. Basically, each comedy movie along with allits featured actors is returned as a result. As we can see, although the user is search-ing for information about entity actor, the results are about entity movie. Due to thismis-alignment, the information of multiple actors is packed in one result, and at thesame time, the information of one actor spreads in multiple results. To obtain the in-formation about distinct actors along with the movies that each actor stars, a user hasto perform manual extraction and transformation. When a user requests top-k results,they expect the top-k comedy actors. However, even if these systems have a perfectranking scheme, they would return the top-k comedy movies, which are unlikely tocontain exactly k different actors, and by no means these actors are top ranked.

Indeed, unlike processing structured queries where a query does not have ambi-guity and therefore efficiency is the key, search quality is the foremost challenge forprocessing keyword queries. Rather than focusing on the efficiency of generating re-sults as most existing work does, in this work we study the semantics of how to defineresults that can capture users’ search intention and then the generation of search inten-tion aware results.

Intuitively, data contains information about real world entities and relationships,and search intention of a user is a subset of entities and/or their relationships in thedata, referred as search target, which are constrained by modifiers in the query. Instructured queries, such as SQL queries, search targets are expressed in the SELECTclause, and modifiers are expressed in the WHERE clause. In XQuery, search targetsare expressed in the RETURN clause, and modifiers are expressed in the rest of thequery. For information search purpose, queries expressed in keywords also containthese two types of information. However, due to the lack of structure, these two typesof information are not explicitly labeled in the query, but has to be inferred intelligently.For instance, query Q1 “comedy, actor” involves two entities movie and actor,and likely the user search intention is to find actor entities that have a relationshipwith at least one comedy movie entity. In this case, actor is the search target, andcomedy is a modifier. Ideally, each query result shall correspond to one instanceof the user’s search target together with all the supporting information. Only then it

2

Page 3: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

is possible for downstream ranking functions to rank results effectively: ranking onresults is aligned with ranking on search target instances, and having all supportinginformation of a search target instance in the same result gives the ranking functioncomprehensive signals for fair relevance judgement. For Q1, each result shall containone actor staring comedy movies together with all comedy movies starred by him/her,as illustrated in Figure 4, and the ranking of results shall reflect the ranking of theactors (probably based on the comedy movies they star). On the other hand, resultsthat correspond to actor-movie relationships (Figure 2) or movie entities (Figure 3) donot match user intention, also make it impossible for downstream ranking functions toproperly rank actors.

The first challenge is how to automatically identify user’s search intention. Al-though we could define a special query syntax and ask users to explicitly specify searchtargets and modifiers, as existing work does [9, 10], most users are unwilling to takesuch extra effort. Indeed, the importance of automatically detecting user search inten-tion is widely realized, as reflected in the studies in information retrieval field [43].However, existing work addresses this problem for searching documents, not struc-tured data. They rely on query log and clickthrough streams. Meanwhile, meta-data instructured data provides valuable information on inferring user search intention, whichhas been barely utilized. We study an open problem of inferring user search intentionfor keyword search on graph data by leveraging the data itself. This is especially chal-lenging for heterogeneous data with a large number of entities and diverse information.Our proposed techniques can be used to bootstrap the system when user query log andclickthrough streams are yet available, and can be integrated with those user interactionbased methods for search intention inference.

The second challenge is, given user search intention, how to define high-qualityresults. We identify two important properties that a good query result should satisfy inorder to have effective ranking. On one hand, a result should be atomic, which meanseach query result should consist of exactly one instance of a search target. In Exam-ple 1.1, the search target is actor, thus each result should contain exactly one instanceof actor. Atomicity ensures that the ranking on results is aligned with the ranking onsearch target instances. On the other hand, a good result should be intact, containingall query-related information of the search target instance, so that the ranking functioncan consider comprehensive signals to make fair ranking. In Example 1.1, for eachactor, the information of all comedy movies that s/he stars should be included in theresult in order to derive a fair ranking on actors based on the comedy movies s/he stars.We refer the results that are atomic and intact on search target as target-aware results.

The third challenge is how to efficiently construct target-aware query results. Thisis not trivial because there can be millions or more instances of search targets in aknowledge graph. A brute-force approach to enumerate each possible atomic and intactsub-graph is too time-consuming to be feasible.

The contribution of this paper includes:

1. We study an open problem: how to define results for keyword search on knowl-edge graphs, driven by inferred search intention.

2. We leverage Information Theory and propose a probability model to infer searchtargets by analyzing return specifiers, modifiers and modifying relationships, and

3

Page 4: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

a query’s specific conditional entropy (Section 3).3. We propose atomicity and intactness as two important properties and define

target-aware results.4. We prove important properties of target-aware results to reduce result generation

to a tractable problem. (Section 4).5. We develop efficient algorithms to generate target-aware query results (Sec-

tion 5).6. Experimental evaluation verifies the effectiveness and efficiency of our approach

(Section 6).

2. Data and Query Model

Data Model. In this paper, we model data as a labeled undirected graph G. We useV (G) to denote the set of vertices in G, and E(G) for the set of edges. For each vertexv ∈ V (G), l(v) denotes its label, which is a tag name or text value.Vertex/Edge Types. As shown in [45], in a knowledge graph each vertex v has atype, denoted as t(v). The type can be Entity, such as an actor, Attribute, such as thename of an actor, or Value, such as Will Ferrell. Types of edges are representedby their end nodes.

Although in [45] each vertex’s type is given, when it is unknown we propose asa simple strategy to infer vertex types. Intuitively, a leaf vertex in a knowledge graphis likely to represent a value, such as vertex 1 Will Ferrell in Figure 1. A vertexadjacent to a leaf is likely to represent an attribute, such as vertex 9 name. Othervertices are considered as entities, such as vertex 17 actor. However, such scheme isnot reliable to handle heterogeneous data. For instance, if a name vertex does not havevalue information, then this inference will conclude that the name vertex is a value.

To address this problem, we consider structural summaries for more reliable in-ferences. We build a structural summary IG for a knowledge graph G using existingwork [28], where the vertices in G which share the same set of labeled incoming pathsof length up to k are represented by one vertex in IG. For example, for the data inFigure 1, its structural summary has a name and type node connected to an actor node;a title, genre and year node connected to a movie node; a name node connected to acompany node. Also, there is a dummy node labeled as value to represent attributevalues, connected to node name, type, title, genre and year. We use a lower case letterto denote a vertex in G and a upper case letter for that in IG. Furthermore, in thispaper we discuss vertices in structural summary by default, such as entity actor andattribute name, or actor and name in short. We use Ins(V ) to denote the set of allvertices in G that are mapped to V in IG. A vertex v ∈ Ins(V ) is called a data vertexof V , such as actor data vertices, or actors in short when there is no ambiguity.Then we adapt the type inferences discussed earlier on the structural summary. As forthe structural summary of Figure 1, actor, movie, and company are entity nodes; name,type, title, genre and year are attribute nodes; other nodes are value nodes. Finally, weconsider a data vertex has the same type as its corresponding vertex in the structuralsummary. How to make accurate inferences on vertex types has not been much studied

4

Page 5: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

in the literature, except that [35] proposed heuristics to handle XML trees. We leave itto future work to develop more sophisticated type inference techniques.Vertex Association Relationship. The next question is to identify the relationshipsbetween vertices. We define the association relationship between entity, attribute, andvalue vertices as following.

Definition 2.1:[Association Relationship] In a knowledge graph, a value data vertexand an attribute data vertex have an association relationship if and only if they are con-nected by an edge. A value or an attribute data vertex v has an association relationshipwith an entity data vertex v′ if and only if there is no other entity vertex v′′ s.t. v′′ ison the shortest path between v and v′. A vertex is considered to have an associationrelationship with itself.

For a data vertex v, we use [v]AR to denote the set of all data vertices that v hasassociation relationships with. Also by replacing a data vertex v with its correspondingvertex in the structural summary, we obtain association relationships of vertices in thestructural summary. Association Relationship is reflexive, and symmetric.

Keyword Queries and Keyword Matches. A keyword query consists of a sequenceof keywords, Q = {k1, k2, ..., kn}, where ki is an input query keyword. For instance,query Jeremy Piven comedy can be segmented to Jeremy Piven, comedy before beingfed into our system. A data vertex v is a keyword match if and only if it contains atleast one query keyword, i.e. Q ∩ l(v) 6= ∅. We name the corresponding vertex in thestructural summary as keyword match as well. Existing techniques for keyword querycleaning, synonyms and acronym replacement, query rewriting and query segmenta-tion, such as [41, 47], can be integrated with our proposed techniques, which is not thefocus of this paper.Query Results. A query result to Q is a connected sub-graph RG of knowledge graphG that contains all query keywords, i.e. ∀k ∈ Q,∃v ∈ RG ∧ k ∈ l(v). We discusswhat properties that such a sub-graph should satisfy to be a target-aware result next.

3. Inferring Search Targets

Before presenting how to define query results, we first discuss how to infer usersearch intention. User search intention can be described by: return some informationof a search target, which is either an entity that is referred as target entities or a rela-tionship of multiple target entities, constrained by modifiers. For example, the searchintention of query Q1 “comedy, actor” is likely to be: return information aboutactors who star comedy movies. Here actor is the search target and the only targetentity, and comedy is a modifier.

One challenge is to determine target entities, denoted as TE. The likelihood of anentity V to be a target entity in TE given a user query Q can be formally representedas the following conditional probability:

P (V ∈ TE|Q) (1)

5

Page 6: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

However, this probability is hard to compute directly. Instead of computing theexact value, we are only interested in relative ranking: among all entities in the struc-tural summary, which ones are more likely to be the target entities for a given query Q.Towards that, we transform the probability using Bayesian Theorem.

P (V ∈ TE|Q) = P (Q|V ∈ TE)× P (V ∈ TE)

P (Q)

∝ P (Q|V ∈ TE) (2)

Without a query log, we assume that all entities are equally likely to be a targetentity. That is, P (V ∈ TE) is the same for all the entities. The likelihood of observinga user query P (Q) does not differ for different entities. Thus we only discuss how tocalculate P (Q|V ∈ TE), the likelihood of observing Q given candidate V as a targetentity.

3.1. Return Specifiers and Modifiers

To compute P (Q|V ∈ TE), we need to better understand the query. There aretwo types of information in a query: return specifiers and modifiers. Return specifiersR denotes the type of information that shall be returned, such as the one specified inthe select clause in an SQL query and the return clause in an XQuery query. ModifiersM in a query constrains the returned information, such as the where clause in SQL orXQuery queries.

While lacking of structure, a keyword query also contains these two semantic com-ponents: return specifiers and modifiers. We have R ∪M = Q, and R ∩M = ∅. Weuse M {S} to denote that the set of keywords S is modifiers and R{S} to denote thatthe set of keywords S is return specifiers.

For Q1 in Example 1.1, “comedy, actor”, a user is likely to retrieve actorswho star comedy movies. The query keyword actor serves as a return specifier, andquery keyword comedy is a modifier that constrains the returned actors to be thosestarring comedy movies. That is, we have M {comedy} and R{actor}. Consideranother query Q2 ‘‘S.W.A.T., year’’. The user likely searches for the yearinformation of the movie “S.W.A.T.”. Hence we have M {S.W.A.T} and R{year}.

The semantic interpretation of a user query Q can now be described using atriple (V ∈ TE,M,R): for data vertices of target entity V that satisfy modifier M ,output their values of return specifiers R, where R and M compose the query Q.

For instance, a possible interpretation of Q2 “‘‘S.W.A.T., year’’” is: fortarget entity movies (V ∈ TE), if a data vertex is modified by S.W.A.T (M ), thenreturn its year information (R).

However, the challenges are: we do not know which query keywords serve as re-turn specifiers R and as modifiers M respectively due to the absence of indicators instructured queries like select, return, where; and we do not know which entities aretarget entities. We find the values of V , M and R to be the ones that maximize thelikelihood of observing the query Q with V as target entity: P (Q|V ∈ TE). That is,

6

Page 7: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

P (Q|V ∈ TE) = MaxV,M,R[P (R{R}|V ∈ TE,M {M})× P (M {M}|V ∈ TE)]

s.t. R ∪M = Q ∧R ∩M = ∅ (3)

where P (M {M}|V ∈ TE) is the probability of observing M as modifiers giventarget entity V , and P (R{R}|V ∈ TE,M {M}) is the probability of observing R asreturn specifiers given target entity V and modifiers M . We discuss how to computeeach value in the following two subsections.

3.2. Reasoning about ModifiersNow we discuss how to compute P (M (M)|V ∈ TE): given V as target entity,

the likelihood of observing a subquery M as modifiers. By applying the multiplicationformula for probabilities, we have

P (M (M)|V ∈ TE) = P (M {k1}|V ∈ TE)

× P (M {k2}|M {k1}, V ∈ TE)× ...

× P (M {kn}|M {k1...kn−1}, V ∈ TE) (4)

where M = {k1, k2, ..., kn}The question is how to compute the general form of each component in Equation 4:

P (M {ki}|M {M ′}, V ∈ TE), where M ′ ⊆M . Note that M ′ can be null. This is thelikelihood of observing ki as a modifier for target entity V which is already modifiedby M ′.

To calculate it, we break it down to two events. (i) The event that keyword ki isrelated to entity V , denoted as Rel(ki, V ), which makes it possible for ki to constrainV . For example, keyword Page Number is unrelated to entity movie. (ii) The eventthat given V as a target entity modified by M ′, ki can further constrain V , denotedFC(ki, V ). Note that sometimes although keyword ki is related to an entity V (event(i) happens), it may not be able to further constrain V (event (ii) does not happen). Forinstance, keyword actor is related to movie, but it cannot constrain movie (i.e. select aproper subset of movies) as every movie has actors. Formally, we have

P (M {ki}|M {M ′}, V ∈ TE)

=P (Rel(ki, V ), FC(ki, V )|M {M ′}, V ∈ TE)

=P (FC(ki, V )|Rel(ki, V ),M {M ′}, V ∈ TE)

× P (Rel(ki, V )|M {M ′}, V ∈ TE) (5)

where Rel(ki, V ) = {ki is related to entity V } and FC(ki, V ) = {ki furtherconstrains entity V }.

In fact, the event that keyword ki is related to entity V , Rel(ki, V ), is indepen-dent to the query, and thus is independent to modifier M ′ and target entity V , soP (Rel(ki, V )|M {M ′}, V ∈ TE) = P (Rel(ki, V )).

7

Page 8: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

If ki is unrelated to V (i.e. Rel(ki, V ) = 0), meaning that ki cannot constrain V ,we have Equation 5= 0. Otherwise (i.e. Rel(ki, V ) = 1), we can simplify Equation 5to P (FC(ki, V )|M {M ′}, V ∈ TE)× P (Rel(ki, V )).

The intuition of computing P (FC(ki, V )|M {M ′}, V ∈ TE) is that a reasonableuser would not issue a query with redundant or useless keywords. Using the conceptof Information Gain in Information Theory, the probability for a user to use ki (whichis related to V ) to modify target entity V which is already modified by M ′ can bemeasured by the information gain of using ki to constrain target entity data verticesbeyond modifiers M ′. So Equation 5 becomes

P (M {ki}|M {M ′}, V ∈ TE)

=

{0, if Rel(ki, V ) = 0

P (Rel(ki, V ))× IG(M {M ′}, V ∈ TE|M {ki}), else(6)

Now we discuss how to compute the information gain and the relatedness in turn.Information Gain. Recall that the information gain IG(Y |X) is the change to theentropy of Y after observing X: IG(Y |X) = H(Y ) −H(Y |X), where H(Y ) is theentropy of Y , and H(Y |X) is the entropy of Y conditioned on X . The conditional en-tropy quantifies the amount of information needed to describe the outcome of a variableY given that the value of another random variable X is known.

In our setting, variable Y stands for target entity V modified by M ′, X stands formodifier ki, and Y |X stands for target entity V modified by M ′ and ki. We have

IG(M {M ′}, V ∈ TE|M {ki})= H(M {M ′}, V ∈ TE)−H(M {M ′}, V ∈ TE|M {ki}) (7)

Let us look at some examples to explain the intuition and the computation, startingwith a simple case of a single modifier query Q = {k}. In this case, we have M ′ = ∅and ki = k in Equation 4. To simplify the discussion, for now we assume Rel(k, V ) =1 and P (Rel(k, V )) = 1 for any k and V in the examples. We will discuss how tocompute them later in this section.

Example 3.1: Consider a query:

• Q3: ‘‘leading actor’’

Suppose movie is target entity. We compute the likelihood of a user using ‘‘leadingactor’’ to modify movie as target entity: P (M {leading actor}|movie ∈ TE) =1×IG(movie ∈ TE|M {leading actor}) = H(movie)−H(movie|M {leading actor}).Suppose every movie data vertex has a leading actor, then the set of all movie data ver-tices is the same as the set of all movie data vertices modified by leading actor. There-fore, the entropy of these two sets are the same, and P (M {leading actor}|movie ∈TE) = 0.

On the other hand, suppose actor is target entity. The likelihood of a user using‘‘leading actor’’ to modify actor is P (M {leadingactor}|actor ∈ TE) =

8

Page 9: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

H(actor) −H(actor|M {leading actor}). This probability is unlikely to be 0 sincethe set of actors is different from the set of leading actors.

The above analysis shows that considering leading actor as a modifier, actoris more likely to be target entity of this query than movie. Indeed, intuitively, query“find actors who are leading actors” sounds more reasonable than query “find moviesthat have leading actors” as the later will return all the movies in the data.

Now let us look at an example query with multiple keywords.

Example 3.2: Consider query Q4 as shown below:

• Q4: ‘‘Jeremy Piven, comedy’’

Suppose actor is a target entity, we compute the likelihood of a user using ‘‘JeremyPiven, comedy’’ to modify actor as target entity using Equation 4: P (M {JeremyPiven, comedy}|actor ∈ TE) = P (M {comedy}|actor ∈ TE,M {Jeremy Piven})× P (M {Jeremy Piven}|actor ∈ TE).

First, P (M {comedy}|actor ∈ TE,M {Jeremy Piven}) =H(actor ∈ TE,M{Jeremy Piven})−H(actor ∈ TE,M {Jeremy Piven}|M {comedy}). Supposethe actor Jeremy Piven starred at least one comedy movie, that is, the actordata vertex modified by ‘‘Jeremy Piven’’ is also modified by ‘‘comedy’’.Therefore, these two entropy are the same. Then, we have P (M {comedy}|actor ∈TE,M {Jeremy Piven}) = 0, As we can see, although each keyword JeremyPiven and comedy can individually modify actor data vertices, after applyingJeremy Piven, comedy does not bring any additional information gain. This ex-ample shows the importance of considering the inter-relationship among the query key-words. Thus P (M {Jeremy Piven, comedy}|actor ∈ TE) = 0, indicating thatwith Jeremy Piven and comedy as modifiers, the query is unlikely to search foractor.

On the other hand, suppose movie is a target entity, the likelihood of having bothJeremy Piven and comedy as modifiers is significant (detailed computation omit-ted for space reason). This indicate the query semantics is more likely to be “findcomedy movies that are starred by Jeremy Piven” than “find actors whose name isJeremy Piven and starred comedy movies”.

In our implementation, we calculate the actual values of the entropy and find themost likely query semantics using Equation 3. Also note that, in Equation 4, multiplekeywords in M may be grouped together to form a modifier. We group keywordstogether if they are matched by a vertex or they have association relationship, such asleading actor in Q3. We can also use existing work on query segmentation [41,47] to group keywords, which is not the focus of this paper.

Now the remaining question is how to define and compute Relatedness Rel(ki, V )and P (Rel(ki, V )).Relatedness. Now we discuss when a data vertex is considered to be related to anotherdata vertex. Intuitively, in Q4 Jeremy Piven is considered to be related with anactor vertex whose attribute name has value Jeremy Piven. But it is not thatsimple. Is Jeremy Piven related to a movie? Jeremy Piven is associated with

9

Page 10: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

actor as an attribute value, but not associated with any movie (Definition 2.1). But itdoes have a relationship with movie: some movies are starred by Jeremy Piven.So the answer should be “Yes”. Typically users search for entities not only based ontheir attribute values, but also based on the entities that they have relationships with(e.g. find movies in which a particular actor stars).

Thus we define relatedness relationship to be more general than association re-lationship. Existing work has studied the semantic relationships between vertices inXML trees considering the path connecting them [13, 32]. We extend the Interconnec-tion Relation for XML trees [13] to graphs. Since there can be many paths connectingtwo data vertices v and v′ in a graph, we define the relatedness relationship of two datavertices on the shortest paths, which represent the most direct relationship betweenthem. If there are no entity data vertices with the same label on this path, then we con-sider v′ is related to v. In the above example, comedy vertex 36 is related to actordata vertex 17. On the other hand, vertex 21 is not related to vertex 36 since the shortestpath between them has two nodes with label movie.

Furthermore, even though two vertices are related, the likelihood of a user to usea keyword ki to modify a related vertex V , P (Rel(ki, V )), decreases when their dis-tance increases. We measure such likelihood by the average distance of the relatednessrelationship between keyword match of ki and data vertices of V . Formally, we have

Definition 3.1:[Relatedness] A data vertex v′ is a Related Vertex of an entity data vertexv if there exists a shortest path P between v′ and v where there is no two vertices on thepath sharing the same label and type, i.e. @v1, v2 ∈ P , t(v1) = t(v2) ∧ l(v1) = l(v2).We use P to represent the Relatedness Relationship between v and v′, and use Mod(v)to denote the set of all related vertices of v.

We use Rel(k, V ) to denote the Relatedness of keyword k to entity V . We haveRel(k, V ) = 1 if there exists vertex v as a data vertex of V and vertex v′ matches k,and v′ ∈Mod(v), otherwise Rel(k, V ) = 0.

The likelihood of using related keyword k to modify V is defined based on the av-erage of distances of the relatedness relationship between the data vertices Dist(v′, v):

P (Rel(k, V )) = e−AVGv′ matched by k,v∈Ins(V )Dist(v′,v) (8)

Given relatedness relationship, we now define a relevant entity vertex for a queryas the data vertex related to keyword matches of every query keyword modifier.

Definition 3.2:[Relevant Entity Data Vertex] An entity data vertex v ∈ V (G) is aRelevant Entity Data Vertex if and only if for ∀k ∈ M where M is the modifiers ofquery Q, there is a keyword match of k in the data: v′, k ∈ l(v′), such that v′ is arelated vertex of v (v′ ∈ Mod(v)). We group relevant entity data vertices by theircorresponding vertex in the structural summary, denoted RIV (Q). We use RTI(G) toindicate all query relevant entity data vertices of target entity V in a graph G, denotedas RTI(G), RTI(G) = {v|v ∈ V (G) ∧ v ∈ RIV (Q) ∧ V ∈ TE}.

For example, RIMovie(Q2) is the set of all movie data vertices that are related toS.W.A.T.: {vertex with id 24}. RIactor(Q1) is the set of all actor data verticesrelated to comedy: {vertex with id 17, 18}.

10

Page 11: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

3.3. Reasoning about Return Specifiers

By definition, return specifiers shall be entity name and/or attribute names of atarget entity whose value is unknown to the user. Thus, the likelihood of observingsome return specifiers R given a target entity V and modifiers M , P (R{R}|V ∈TE,M {M}) = 0, if some keyword in R does not match the name nor an attributename associated with V or if the user query contains attribute values that are associatedwith R. Otherwise, we consider P (R{R}|V ∈ TE,M {M}) = 1.

Example 3.3: Consider Q1 ‘‘comedy, actor’’ on data in Fig 1. The likeli-hood of observing actor as return specifier given entity actor as a target entity andcomedy as modifier, P (R{actor}|actor ∈ TE,M {comedy}) = 1 since actoris self-associated and keyword match of comedy is not associated with any actordata vertex. On the other hand, the likelihood of observing actor as return specifiergiven entity movie as a target entity and comedy as modifier, P (R{actor}|movie ∈TE,M {comedy}) = 0, since actor is not associated with movie.

Now let us consider another query ‘‘title, S.W.A.T., year’’. The like-lihood of observing return specifier title given movie as a target entity and S.W.A.T.,year as modifier, P (R{title}|movie ∈ TE,M {S.W.A.T., year}) = 0. This is be-cause although attribute title is associated with entity movie, keyword match ofS.W.A.T. in M is associated with title data vertex. Indeed, the user alreadyknows S.W.A.T. as title thus title is not a return specifier.

The above computation assumes the keyword matches of all return specifiers to beassociated target entity vertices. This is consistent with closed world assumption: allpossible attributes associated with an entity shall reside in the data set. However, it istoo strict under open world assumption. For instance, consider a query ‘‘S.W.A.T,year, runtime’’ on a data set which has movies with information about yearbut not runtime. Despite the incomplete information, movie entity still shall beconsidered as target entity based on the match of year (and S.W.A.T). To addressthis problem, we consider every return specifier to be equally important and use thepercentage of the return specifiers that are matched to an entity’s association as thevalue for P (R{R}|V ∈ TE,M {M}). In the presence of query log, we may identifythat the match of some return specifiers carry more confidence and derive a function tocompute P (R{R}|V ∈ TE,M {M}).

Another question is that a user query may not have return specifier. Indeed, un-like a structured query which must have both return specifiers and modifiers, a key-word query may leave return specifiers unspecified, such as Q4 ’’Jeremy Piven,comedy’’, where all keywords are modifiers. In this case, P (R{R}|V ∈ TE,M {M})is trivially 1.

P (R{R}|V ∈ TE,M {M}) = |R′||R|

(9)

where R′ = {k|k ∈ R, k matches V ′ ∈ [V ]AR, @k′ ∈M, s.t. k′matches a valuenode u, Ins−1([u]AR)

⋂V ′}. If R = ∅ then P (R{R}|V ∈M {M}) = 1.

11

Page 12: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

3.4. Target Entity and Search Target Definition

Now we define Target Entity as the ones that maximize the likelihood of observingquery Q (Equation 2).

Definition 3.3:[Target Entity] For a query Q, an entity V in the structural summary isa Target Entity if

P (V ∈ TE|Q) ∝MaxVi∈IG [P (Vi ∈ TE|Q)] (10)

where P (V ∈ TE|Q) is computed by using Equation 1 to Equation 9. In Defini-tion 3.3, we only consider the most probable entity as target entity. A system may alsochoose to consider more possible target entities to boost recall and use the probabilityfor ranking.

Our technique is general to handle heterogenous data, which is discussed with realuser queries in Section 6.2, such as Q2 and Q3 of DBPedia dataset.

A user’s search intention could be a target entity or relationship between multipletarget entities. Also for heterogeneous data, we may infer multiple target entities usingDefinition 3.3 such as Q1 of IMDB dataset in Section 6. To support these cases, wedefine Search Target of a user query as shown below:

Definition 3.4:[Search Target] Consider a query Q and a set of inferred target entitiesS. The Search Target of Q is either an entity in S or the relationship of a subset of S (ifthere is more than one entity in S).

For Q1, one target entity is inferred: actor, which is the search target. The in-ferred query semantics would be: “find actors that star comedy movies”. Another query‘‘Comedy, Ferrell’’, two target entities are inferred: movie and actor. Be-sides the two target entities, the relationship between them is also considered as asearch target. In the latter case, the query interpretation is “Find the relationship be-tween a movie with genre comedy and an actor whose name contains Ferrell”. In amore heterogeneous dataset than Figure 1, comedy might also be matched by compa-nies’ name besides movies’ genre. In that case, based on Definition 3.3, actor, movieand company may all be target entities. Then the search target includes not only thethree target entities, but also the relationship between them. TAR will infer additionalquery interpretations, such as “Find the movies that are produced by a company namedcomedy and are acted by an actor whose name contains Ferrell”.

4. Defining Target-Aware Results

Based on the concept of target entity, we propose two properties that good queryresults shall satisfy: Atomicity and Intactness (Section 4.1) and then define Target-Aware Result for keyword search on graphs (Section 4.2).

4.1. Atomicity and Intactness

A good query result should be atomic and intact.

12

Page 13: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Definition 4.1:[Atomicity] Given a query Q, a result RG is atomic if it has one instanceof Q’s search target.

Atomicity ensures that ranking on results reflects ranking on search target instances.For instance, for Q1 in Example 1.1 ‘‘comedy, actor’’, actor is considered as the only target entity and therefore the searchtarget of the query. Each result shown in Figure 4 contains exactly one actor, andtherefore it is atomic. Ranking on such results provides ranking on actors.

Definition 4.2:[Intactness] Given a query Q and the inferred target entities, a resultRG is intact if every relevant target entity data vertex in RG has all its related verticesmatching M of Q in RG.

Intactness ensures that all signals of the search targets related to the query are avail-able for fair ranking. Again consider result (a) of Q1 shown in Figure 4. It containsvertex 17, an instance of search target actor that is relevant to Q. We have all matchesto the modifier comedy, vertex 36 and 38, in the result. Thus this result is intact.Ranking on such results can consider the ranking of all comedy movies that he stars.

For presentation purpose, when there are many matches of the same query key-word, one system implementation can show one match of each query keyword and useexpansion links, which upon click can display other matches of the same query key-word. In this way, each displayed result is small and easy to inspect. This is similar asthe expansion links described in the literature [35, 36].

4.2. Target-Aware Result

Now we formally define Target-Aware Results. As discussed earlier we would liketo define a result that is both atomic and intact so that ranking on results provide fairranking on search target instances.

However, when user’s search intention is relationship, it may not be possible tohave a result that satisfies both properties, as shown in the following example.

Example 4.1: Suppose a user inputs keyword query Q5 to search for relationshipbetween an actor named Ferrell and comedy movies that he stars:

• Q5: “Comedy, Ferrell”

Based on Definition 3.3, we find actor and movie are target entities, as not allactors named Ferrell have starred a comedy movie, and not all comedy moviesfeature an actor named Ferrell. The user may search relationship between the twokeywords. Consider a result that contains actor Will Ferrell. For the result to beintact, it should contain the information of both comedymovies that Will Ferrellstars. But in this case, there are two instances of the relationship between target entityactor and target entity movie in the result, violating atomicity.

Since we may not always be able to satisfy atomicity and intactness when user’ssearch target is relationship, we need to make a design choice based on user’s pref-erence. As shown in user study presented in Section 6.2, while both properties areconsidered important by users, intactness is chosen by more users.

We define Target-Aware Result as below:

13

Page 14: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Definition 4.3:[Target-Aware Result] Consider query Q on data graph G, a subgraphof G: RG, is a Target-Aware Result if the following conditions hold:

• Every keyword k ∈ Q has a match in RG.

• If there is only one target entity data vertex in RG, then RG should be atomicand intact; otherwise RG should be intact.

• There does not exist a subgraph of RG that satisfies the above two conditions.

A result for Q5 is shown in Figure 5, which contains the relationships between anactor named Ferrell and all his comedy movies.

Studies show that 70% of all user issued queries are entity queries, queries with asingle entity type as search target [39, 18, 46]. Therefore our proposal that a search en-gine should generate results that satisfy atomicity and intactness for entity queries haslarge applications and significance in search engine design. For queries whose searchtargets are relationships, it is impossible to guarantee the satisfaction both atomicityand intactness. It is the designers choice as for which property to guarantee in thesearch engine design, based on application needs and/or user preference. We intro-duced both concepts as they are important to consider when designing a search enginefor graph data.

5. Algorithms

In this section, we present algorithms to efficiently generate target-aware results.In Section 5.1, we discuss properties of target-aware result. In Section 5.2, we discusstwo major indexes we will use in our algorithms. Section 5.3 presents algorithms tofind all target entities and their relevant data vertices. We discuss the algorithms ofgenerating target-aware results in Section 5.4.

5.1. Properties

A straightforward way to generate target-aware results based on Definition 4.3would enumerate all possible combinations of target entity data vertices to form atomicand intact graphs. However, such a naive method suffers extremely low performance.In this section, we give some important properties of target-aware results, which reducetarget-aware result construction to a tractable problem.

Lemma 5.1: Let RG be a target-aware result, and RTI(RG) be the set of all Q rele-vant target entity data vertices in a result graph RG(Definition 3.2). For any nonemptysubset S ⊂ RTI(RG), ∃v ∈ RTI(RG) − S s.t. ∃v′ ∈ S, v is on the relatednessrelationship path between v′ and a modifier of v′.

Proof. If |RTI(RG)| = 1, then true. Consider |RTI(RG)| ≥ 2. Suppose not, then itmeans if we remove all data vertices in RTI(RG) − S and their exclusive modifiersfrom RG, we will have a strict sub-graph of RG which is also intact. Also if RG isatomic the new sub-graph is atomic too. Contradiction. Proved.

14

Page 15: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Theorem 5.2: If RG is a target-aware result and |RTI(RG)| ≥ 2, then for any twodata vertices v and v′ ∈ RTI(RG), there is an ordered sequence of data vertices(v1, v2, ..., vk) s.t. vi ∈ RTI(R), v1 = v and vk = v′, and vi+1 is on the relatednessrelationship path between vi and a vi’s modifier, 1 ≤ i ≤ k − 1.

Proof. Suppose not, then we first find the maximum set of data vertices J = (v1, v2, ..., vl)with l ≤ |RTI(RG)| s.t. vi ∈ RTI(RG), and for v and vi the above theoremis true. Since |RTI(RG)| ≥ 2, it is obvious to see that |J | >= 1 by applyingLemma 5.1. Now assume in Lemma 5.1, S = J . We know that S 6= RTI(RG)because v′ ∈ RTI(RG) − S. By applying Lemma 5.1, we know that there must bea data vertex v′′ ∈ RTI(RG) − S s.t. ∃vi ∈ J ∧ v′′ is on a relatedness relationshippath between vi and a related vertex of vi. So, J ∪ {v′′} is a bigger set. Contradiction.Proved.

We use G[X] to denote an induced graph of G on X .

Corollary 5.3: Let GD denote a directed graph of a graph G, s.t. V (GD) = RTI(G)and (v, v′) ∈ E(GD) if v is on a relatedness relationship path between v′ and a relatedvertex of v′. RG is a target-aware result iff the following two conditions hold: 1)GD[RTI(RG)] is a strongly connected component. 2) @v ∈ RTI(G) − RTI(RG)s.t. ∃v′ ∈ RTI(RG) and (v, v′) ∈ E(GD).

Proof. Based on Theorem 5.2, we know that GD[RTI(RG)] is a strongly connectedsub-graph. Suppose GD[RTI(RG)] is not a strongly connected component, then thereis a v ∈ RTI(G) − RTI(RG) s.t. ∃v′ ∈ RTI(RG) ∧ (v, v′) ∈ E(GD). SinceRG is a target-aware result, according to Definition 4.2 v should be in V (RG) too, acontradiction. So GD[RTI(RG)] is a strongly connected component of GD. Also thisshows that @v ∈ RTI(G) − RTI(RG) s.t. ∃v′ ∈ RTI(RG) ∧ (v, v′) ∈ E(GD),otherwise RG will not be a strongly connected component.

Corollary 5.3 reduces the problem of constructing target-aware results to findingstrongly connected component of data graph G, which is tractable.

Corollary 5.4: For any v ∈ RTI(G), it is contained in at most one target-aware result.

Based on Corollary 5.3, it is easy to prove this true. Corollary 5.4 shows thatany v ∈ RTI(G) can exist in at most one target-aware result. This corresponds tothe intactness, which avoids to have information of the same search target instance inmultiple results.

We will use these properties in Section 5 to construct all target-aware results.

5.2. Index

We use two indexes: Keyword Index and Path Index. Keyword index is an indexwhich given an keyword k returns all data vertices v s.t. k ∈ l(v). We can easily findthe corresponding vertices in the structural summary that match query keywords.

Path index is an index which given an entity data vertex v returns a set of shortestpaths between v and other entity data vertices, P = {P1, P2, ..., Pr}. Each Pi ∈ P

15

Page 16: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

is of the form: (v, v1, ..., vq) and for each vj ∈ Pi, t(vj) = Entity. Also based onDefinition 3.1, we don’t record the paths that contain two vertices with the same label.When the index is big, we may use existing techniques [20] for graph partition.

With path index, given a related vertex v we can quickly find all entity data verticesv′ that it relates to. To do this, we first find the entity data vertex v′′ ∈ [v]AR, then wefind v′′ in path index and retrieve all the paths starting with v′′. Finally, every entitydata vertex in a retrieved path is related to v. Similarly, with an entity data vertex v′ wecan look it up in the path index and all keyword matched vertices on a path are relatedvertices of v.

5.3. Finding Target EntitiesBased on Definition 3.3, we design algorithms to find target entities V ∈ IG and

all v ∈ RIV (Q). We first invoke Algorithm 1 to find the best labelling of keywords interms of return specifier and modifier for each entity V . The inputs to Algorithm 1 in-clude R and M which stands for the given return specifiers and modifiers respectively,data graph G, structural summary IG, and the two indexes discussed in Section 5.2.The output is P (R,M |V ∈ TE), which is product of P (M {M}|V ∈ TE) andP (R{R}|V ∈ TE,M {M}). We define two functions in Algorithm 1 to calculatethese two values, respectively.

In function Computing P (M {M}|V ∈ TE) of Algorithm 1, RI1 and RI2refer to the set of search target instances that are modified by M ′, and that are modifiedby M ′ and k, in Equation 7, respectively. RI1 is initialized as the set of instanceof search target V . Recall that Ins(V ) retrieves data vertices that correspond to avertex V in the structure summary, and Ins−1(v) returns the vertex in the structuralsummary that corresponds to a data vertex v. Then we iterate over each keyword k inthe set of modifiers M to compute P (M {M}|V ∈ TE) (line 2-8). In line 3 - line4, we find node v′ that matches query keyword k, and node v, search target instancevertex that is modified by k. We update P (Rel(k, V )) and compute RI2. Then wecompute P (M {k}|M {M ′}, V ∈ TE) and P (M {M}|V ∈ TE) accordingly. Afterthe processing of keyword k, we update M ′, the set of query keywords processed sofar, and update RI1.

As an example, consider a query Jeremy, comedy on data in Figure 1. Suppose V ismovie. In the initialization, M ′ = ∅, and RI1 is the set of movie data vertices, that is,{v22, v23, v24, v25}. In the first iteration, k is Jeremy. After process line 3-4, we haveRI2 to be the set of movie data vertices that are modified by Jeremy, that is, {v22, v24}.After the computation in line 5 and 6, we update M ′ to be {Jeremy}, and RI1 to be thecurrent value of RI2. In the second iteration, k is comedy. After process line 3-4, wehave RI2 to be the movie data vertices that are modified both Jeremy and comedy, thatis, {v22}.

Now let us look at function Computing P (R{R}|V ∈ TE,M {M}) of Algo-rithm 1. First, we prune the elements in R which have at least one vertex in the datagraph with values match a query modifier (k ∈ M ). Then we check if each keywordk in R has occurrence in the data graph and is associated with target entity V . If so,we include k in R′. Finally, we compute P (R{R}|V ∈ TE,M {M}) based on Equa-tion 9. Let us look at an example. Consider the query title, S.W.A.T, year and data inFigure 1, as described in Section 3.3. To compute P (R{R}|V ∈ TE,M {M}), where

16

Page 17: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Algorithm 1 Computing P (R,M |V ∈ TE)

Input: M , R, G, IG, PathIdx, KeyIdxOutput: P (R,M |V ∈ TE)

Computing P (M {M}|V ∈ TE)

1: M ′ ← ∅, RI1 ← Ins(V )2: for each k ∈M do3: for each v ∈ RI1 s.t. ∃v′ ∈ G, k ∈ l(v′) and v′ ∈Mod(v) do4: Add v to RI2 and update P (Rel(k, V )) by Equation 85: P (M {k}|M {M ′}, V ∈ TE) = P (Rel(k, V ))(H(RI1)−H(RI2)) per Equa-

tion 6 and 76: P (M {M}|V ∈ TE)× = P (M {k}|M {M ′}, V ∈ TE) per Equation 47: Add k to M ′

8: RI1 = RI2

Computing P (R{R}|V ∈ TE,M {M})1: RN ← ∅2: for each k ∈M do3: for each value data vertex u s.t. k ∈ l(u) do4: for an entity or attribute data vertex v′ ∈ [u]AR do5: if Ins−1(v′) ∈ R then6: Remove Ins−1(v′) from R7: break8: for each k ∈ R do9: for V ′ that matches k, V ′ ∈ [V ]AR do

10: Add k to R′

11: P (R{R}|V ∈ TE,M {M}) = |R′||R|

return P (M {M}|V ∈ TE) ∗ P (R{R}|V ∈ TE,M {M})

17

Page 18: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Algorithm 2 Results GenerationInput: RTI(G), PathIdxOutput: TopK TAR results

Building Dependency Digraph1: for each v ∈ RTI(G) do2: for each path rooted at v in PathIdx, denoted as (v, v1, ..., vq) do3: for j = q to 1 do4: if vj ∈ [v′]AR s.t. v′ is a related vertex of vj then5: break;6: if j ≥ 1 then7: for t = 1 to j do8: if vt ∈ RTI(G) then9: Add (vt, v)to E(GD)

10: return GD

Constructing TAR Results1: NMMR, Results← ∅2: Assign each v ∈ V (GD) a strongly connected component id scv3: for each v ∈ V (GD) do4: if scv /∈ NMMR then5: for each v′ s.t. (v′, v) ∈ E(GD) do6: if scv′ 6= scv then7: Add scv to NMMR; break8: for each component id sc s.t. sc /∈ NMMR do9: Add sc to Results

10: Return Results.

R ={title, year}, M ={S.W.A.T.}, and V ={movie}. First, we remove the nodes inR that is associated with a match of query modifier. Ee process the keyword in M :{S.W.A.T.}. We find the value data vertex that matches this keyword v39. Then wefind the attribute and entity vertices that v39 is associated and their counterpart in thestructure summary: {title, movie}. We remove them from R if applicable. In this case,we have updated R to be {year}. Since every keyword in R has occurrence in the data,R′ = R.

For Algorithm 1, the complexity for first function is O(|M | ∗ |GV |2) in the worstcase. The complexity of the second function is O(|M | ∗GV ).

Then we compute P (Q|V ∈ TE) for each entity V and select out target entitiesbased on Definition 3.3. Finally, we get all search targets using Definition 3.4. Wecollect all relevant entity data vertices of target entities in RTI(G) and use them toconstruct target-aware results in Section 5.4

5.4. Constructing Target-Aware Results

Instead of enumerating each possible target-aware results which is intractable, weutilize the properties proved in Section 5.1 to efficiently construct target-aware results.

18

Page 19: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Based on Lemma 5.1, Corollary 5.3, Corollary 5.4, We designed Algorithm 2 togenerate target-aware results.

Function Building Dependency Digraph builds the directed graph GD

discussed in Corollary 5.3. First, from line 1 to line 5, for each path rooted at v itfinds the last entity data vertex vj s.t. vj has an association relationship with a relatedvertex of v. If there is a vt ∈ RTI(G) s.t. vt is between v and vj , then according toCorollary 5.3 there should be an edge from vt to v in the directed graph GD. Conse-quently, line 6 to line 9 add such edges.

Function Constructing TAR Results first finds all strongly connected com-ponents in GD, each with a unique identifier (line 2). We use the classic path-basedstrongly connected component detection algorithm in [44] here. NMMR is used torecord the id of a strongly connected component that does not contribute to target-awareresults. After that, it checks for each strongly component if ∃v ∈ RTI(G)−RTI(R)s.t. ∃v′ ∈ RTI(R) and (v, v′) ∈ E(GD). Only those strongly connected componentsthat do not have incoming edge from a data vertex outside are target-aware results (line3 - 7). From line 8 - 9, each target-aware result will be collected. Finally, all target-aware results will be returned in line 10.

For Example 4.1, Algorithm 2 first builds a directed graph for the data graph, whichhas a fragment shown in Figure 6. There is an edge from the actor data vertex to theother two movie data vertices because the actor data vertex contains a keyword matchFerrell, which modifies the movie data vertices. Other edges are created in thesame way. Then it identifies all strongly connected components. As shown in Figure 6,entity data vertices within the same strongly connected component are surrounded bya dashed rectangle. It shows two strongly connected components. Then based onCorollary 5.3, it excludes the strongly connected component on the right from beingtarget-aware result because there are two entity data vertices outside pointing to it.Consequently, only the left strongly connected component forms a target-aware result.

The worst case complexity of Building Dependency Digraph is |GV | ∗|GE |. The worst case complexity of Constructing TAR Results is |GV | +|GE |, dominated by the complexity of finding strongly connected component.

5.5. Ranking Function

Now we discuss the ranking function used in TAR. Recall that some query key-words serve as return specifiers, and the others serve as modifiers. Only the matchesto modifiers have impact on the ranking functions, and are referred as matches in thissection. We consider three factors.

First, the more matches to a query keyword, the more likely that the result is rele-vant. For instance, for Q1 “comedy, actor”, an actor that stars many comedy movies ismore likely to be relevant than an actor that starts only a few comedy movies.

However, not all matches to query modifiers are equally important. The attributevalues where the matches occur matter. For instance, consider Q2 “S.W.A.T., year”,where keyword S.W.A.T. is a modifier. A movie that has keyword S.W.A.T. in the titleattribute is more likely to be relevant than a movie that has S.W.A.T. in the trivia at-tribute, or even the review attribute (only). The length of the attribute values where thequery keyword appears is a good indicator to detect such cases. Intuitively a word that

19

Page 20: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

appears in a long attribute value (such as trivia or review) carries less weight than aword that appears in a short attribute value (such as title).

Furthermore, proximity of keyword matches indicate highly relevant results. Toincorporate this, existing work [4, 20, 30, 24, 16] use the size of the query result asa ranking factor. However, we do not want to penalize a large result which has manymatches to query modifiers. Thus, we use the average distance from matches of querymodifiers to the target entity instance in a result to measure proximity. There can bemany matches to a query keyword, we take the one that is closest to the target entityinstance.

Putting these together, the ranking function RS of the result graph G is presented byEquation 11. Here AvgDG stands for the average distance from target entity instanceto each match of query modifier G. KSet is the set of query keywords. MSet(k)is the set of matches in graph G that matches the query keyword k in KSet. Sincethere can be several matches in MSet(k) for query keyword k, we take the sum ofeach match’s weight, but cap it the sum to be 1, so that matches to k will not dominatematches to other query keywords. The weight of a match m is computed as the inverseof the ωGm, the average number of words in the attribute values in the data where thematch m appears. For instance, if in the data graph, on average the number of wordsin the title attribute value is 5, then the weight of a match to title value is 1/5. On theother hand, if average the number of words in the trivia attribute value is 100, then theweight of a match to trivia value is 1/100. DGk is the smallest edge distance fromtarget instance to a match to query keyword k in G.

RS(G) =

∑k∈KSet WGk

AvgDG=

∑k∈KSet min{1,

∑m∈MSet(k)

1ωGm}

1+∑

k∈KSet DGk

|KSet|

(11)

6. Experiments

In this section we present experimental evaluation of our system TAR that con-structs target-aware results for keyword search on graphs, in terms of both quality andefficiency.

6.1. Experimental Setup

Setting. We implemented TAR in Microsoft Visual C++ and used Oracle BerkeleyDB1 for keyword index and path index. Efficiency experiments were performed ona 3.1GHz Intel Core i5-2400 machine with 16GB memory running Windows 7. Allexperiments are tested with hot cache where indices are in memory.Data. We have performed experiments on three datasets: IMDB, DBLP, and DBPedia.IMDB is an international movie dataset of 2.0 GB with 3.9M data vertices.2 DBLP isa dataset recording bibliography information in computer science of size 1.5GB with

1http://oracle.com/technology/products/berkeley-db/index.html2http://www.imdb.com/

20

Page 21: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Table 1: User Study Keyword Queries

DBLP IMDB DBPedia

Q1 Divesh, Jignesh, Ja-gadish Steven Spielberg Sports game, Company

Q2 Semantic mining,Author Brosnan, Bond Apple, Word processor

Q3 Entity resolution,Health Thriller movie, Actor Dance, Album

Q4 Data mining, Con-ference

Dreamworks, Com-edy animation Google, City

Q5 Health informatics,Journal

Documentary movie,Directory Konami

Q6 Optimization book,Author Will Ferrell, Rating Rock music, Artist

Q7Krishnamurthy,Parametric QueryOptimization

Marvel, Iron Man Hip hop, United States

Q82007 Beijing, Dataintegration, Confer-ence

Superman movie,Company Lady Gaga

Q9 Information re-trieval database Bruce Willis, John NBA, Point guard

Q10 Data mining algo-rithm, 2006 Christmas, Family Germany, Small for-

ward

3.1M data vertices.3 For DBPedia dataset, we use its dump data in tabular form.4 Weselected out 60 entity types of totally 1.8M data vertices with size of 7GB from theoriginal data set, which is feasible for our experiment settings and also comparablewith the data sizes used in the state-of-art work on processing keyword queries ongraphs, which was up to 2 GB [48, 40]. To select these 60 entity types, we invited10 users to pick a few entity types they are interested in as seeds and then recursivelycrawled entity types referenced by seed entity types and combined them together. Akeyword index and a path index are built for each data set. The size of indices of DBLPdataset is 0.1GB and 2.6GB respectively; 0.13GB and 3.6GB for IMDB dataset, while0.78GB and 0.98GB for DBPedia dataset.Queries. We provide 10 queries for each dataset, with diverse characteristics. First,the length of these 30 queries varies from 1 to 5. Second, query keywords chosen havevarious number of keyword matches. These 30 queries are tested for search quality

3http://www.informatik.uni-trier.de/ ley/db/4http://wiki.dbpedia.org/DBpediaAsTables?v=hyz

21

Page 22: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

evaluation by 30 Amazon Mechanical Turk (MTurk, in short) workers. Table 1 showsall 30 queries.Comparison Systems. We compare TAR with BANKS [4] and EASE [30]. We useBANKS as a representative for systems that define query results as minimal trees, be-cause according to [11] BANKS has surprisingly high accuracy even compared withmore recent search engines and is the only search engine that can support the completeIMDB data set. We use EASE as a representative for systems that define results asgraphs, and do not restrict the result to be minimal graph containing query keywords.We implemented BANKS and EASE in C++ based on the respective papers.

6.2. Search Quality Evaluation

We performed a user study on Amazon MTurk using 30 test queries and each queryis evaluated by 30 real world users. For each user study query, we provide a fewtop ranked results returned by each system, and mix the results generated by differentsystems in a random order. Then we ask users to mark a query result as relevant or not.Search Quality Since top-k results, instead of enumeration of all results, are com-monly requested by users, we use Top-k precision, or in short, precision, as the majormetric to evaluate result quality of all three systems. Following common practice, wedo not evaluate recall on search results since the number of results are too big to befeasible to obtain user feedback to determine ground truth. As will be shown shortly,we do evaluate both precision and recall for search target inferences and modifier andreturn specifier labelling. We compute the top-k precision of each system on a queryfor each user. Then we compute the average top-k precision of each system of eachquery among all users following [11]. To achieve reliable evaluation results, we picka reasonable value, 3, for k such that MTurk users are not overwhelmed by the numberof results presented at a time. Figure 7(a) to Figure 7(c) show the precision of eachsystem on each query of the three data sets. The average precision of a system for allqueries on each dataset is shown on the figures. Since EASE cannot handle the com-plete IMDB dataset, we use the partial IMDB dataset provided by the authors of EASEthat can be successfully handled. For that dataset, Q1, Q2, and Q6 do not have queryresults and therefore are marked as N/A in Figure 7(b).

As we can see, TAR, the only system aiming at constructing target-aware results,has significant improvements over BANK and EASE on search quality in average. TARhas the highest scores for 25 out of all 30 queries. This empirically verifies the advan-tages of TAR over existing systems, as explained in Section 1.

The experiments also show that TAR is able to work well on heterogeneous data.For example, for Q3 of DBPedia dataset, “Dance, Album”, Dance is matched by songand artwork data vertices while Album is matched by album, and software data vertices.TAR is able to generate results for several different interpretations of query semantics,such as: “Find albums with dance genre”, “Find album named dance”, “Find albumswith some music named dance” as shown in Figure 8. In Figure 8(a), there is oneAlbum data vertex, and all songs of that album whose names or genres are matched byDance are organized in the same result. In Figure 8(b), Dance is directly matched tothe name and genre of one Album data vertex. To achieve that, suppose song is a targetentity, then based on Equation 9, both Dance and Album must be labeled as modifiers.

22

Page 23: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

However, since most Dance-related songs are modified by Album, Equation 3 is closeto 0. On the other hand, considering Album as target entity, we may label Album asreturn specifier and Dance as modifier, and obtain a much higher value for Equation 3.Thus TAR infers album as the target entity. In contrast, BANKS and EASE do notconstruct results based on search targets and may return albums with one music nameddance but not all, or a company with softwares named dance and album. TAR’s resultsare favored by users as shown in Figure 7(c).

As another example for Q2 of DBPedia dataset, “Apple, Word processor”, Apple ismatched by song, software, company and even artist data vertices while Word processoris matched by software. Suppose company is a target entity, then based on Equation 9,all keywords are labeled as modifiers. However, after keyword Apple, Word processorcannot bring information gain any more, i.e. Equation 3 is close to 0. Intuitively, Appleproduces Word processors. On the contrary, software gets a much larger value forEquation 3 and is inferred as a target entity. TAR is able to generate results for severaldifferent interpretations of query semantics, such as: “find word processor softwarenamed Apple” or “find word processor software produced by Apple Inc.”.

Compared with BANKS, EASE does not enforce a result to contain all query key-words and may receive unfavorable ratings from the user, such as queries Q3-Q10 ofDBLP dataset, and Q1, Q2, Q3, Q8 of DBPedia dataset. For example, for Q3 of DBLPdataset, “Entity resolution, Health”, some results returned by EASE does not containkeyword Health. Also, BANKS generally has a higher precision than EASE for IMDBqueries because EASE only supports partial IMDB data set (such as movie names anduser scores) based on the restrictions described in their paper. It cannot return relevantresults for queries related to actors, directors or production companies, such as Q1, Q2and Q6 of IMDB dataset.

On the other hand, EASE returns all vertices in a r-Radius and thus tends to providemore comprehensive information in a result. Thus it obtains higher user scores thanBANKS in other queries, such as Q1, Q2 of DBLP dataset, Q1, Q4-Q7, Q9-Q10 ofDBPedia dataset. For example, for Q9, “NBA, Point Guard”, EASE includes leagueinformation with each basketball player and the results generated end up to be intact,which obtains a higher score than BANKS.

Next we will analyze the queries that TAR does not generate the highest scoredresults. For DBLP queries, TAR obtains the lowest precision value for Q1, “Divesh,Jignesh, Jagadish”. While it correctly infers paper and article as the search target,it does not recognize that papers and articles are both publications due to the lackof ontology. It considers an article with title “Divesh” may modify a paper writtenby “Divesh, Jignesh, Jagadish”, which is not desirable. Similarly, for DBPedia Q7“Hip Hop, United States”, although TAR correctly infers artist as the search target, itconsiders a basketball player data vertex with united states in their values as a modifierof an artist if they are connected by the same city. We plan to consider ontology anddomain knowledge for vertex and edge weight to better capture modifying relationship.

For IMDB Q10, “Christmas, Family”, the user intention is to use family to constrainthe genre of a movie. TAR attaches a company named ”ABC Family” to a moviedata vertex related to ”Christmas, Family” to make it intact, which is considered asunfavorable by the user. While the query interpretation is reasonable and the approachthat TAR takes improves diversity and recall of the results, information about click-

23

Page 24: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

through lock or user profile can be leveraged to achieve personalization and enhancedranking.Search Target Inferences We have evaluated whether the search targets inferred byTAR are correct. We asked users to write down what they think is the meaning of aquery using an English sentence in the format of “Find X that Y”. Then we extract thenoun(s) from the X part as the intended search targets. For example, consider Q10 inIMDB “Christmas, Family”, a user wrote down the interpretation as “Find Christmasmovies that are for family to watch”. We extract movie as the ground truth of user’ssearch target.

Table 2: Precision & Recall of Search Target Inference

Precision,Recall

DBLP IMDB DBPedia

Q1 70%, 70% 53%, 100% 80%, 80%Q2 78%, 64% 67%, 67% 67%, 67%Q3 63%, 27% 87%, 87% 87%, 87%Q4 83%, 83% 97%, 97% 93%, 93%Q5 83%, 83% 93%, 93% 40%, 83%Q6 80%, 70% 57%, 57% 87%, 87%Q7 20%, 20% 83%, 83% 47%, 47%Q8 87%, 87% 80%, 80% 33%, 93%Q9 83%, 70% 53%, 53% 93%, 93%Q10 73%, 33% 83%, 83% 83%, 83%

Table 3: Precision & Recall of Modifer/Return Specifier Labelling

Precision,Recall

DBLP IMDB DBPedia

Q1 100%, 100% 83%, 83% 86%, 86%Q2 81%, 81% 83%, 83% 92%, 100%Q3 83%, 100% 75%, 75% 100%, 100%Q4 100%, 100% 100%, 100% 100%, 92%Q5 100%, 100% 100%, 100% 100%, 100%Q6 75%, 75% 100%, 75% 100%, 100%Q7 83%, 100% 67%, 67% 100%, 100%Q8 100%, 100% 100%, 100% 100%, 100%Q9 75%, 100% 100%, 100% 75%, 100%Q10 100%, 100% 100%, 100% 75%, 100%

The average precision of TAR on inferring search target is 72.8% and average recall

24

Page 25: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

is 74.0%. TAR has high quality inference for most queries. There are a few queriesthat TAR does not perform well. For Q3, and Q10 of DBLP dataset, we observe thatthe recall value is significantly smaller than precision. This happens when TAR infersonly one entity as the search target but users put down many. For example, for Q3DBLP ”Entity resolution, Health”, most users choose both paper and article as thesearch target but TAR infers only paper. This is because paper and article may haveslightly different values of Equation 3.3 and TAR only returns the entity with the highestvalue. Similar problems exist for Q7. This can be improved by leveraging ontologyinformation and grouping both papers and articles as publication to be search targets.

For Q1 of IMDB dataset and Q5, Q8 of DBPedia dataset, we observe that precisionis low. This is because TAR infers multiple entities as search target but most users onlychoose one. For example, for Q8 of DBPedia dataset, artist, song, album and personhave the same value of Definition 3.3 and are all considered as target entity. However,for most users, their choice is personalized and only one entity is chosen. To solve thisproblem, we need to include personalized search into TAR which is left to future work.Modifier and Return Specifier Labelling We evaluated whether the keyword la-belling as modifier and return specifier by TAR is correct. After a user fills out “FindX that Y”, we extract query keywords in X as return specifiers while keywords in Yas modifiers, which are used as ground truth for keyword labelling of this user. Thenwe calculate the average precision and recall among all users as shown in Table 3. Theaverage precision is 91 % and recall is 94 %, which proves TAR’s high accuracy.Atomicity and Intactness We ask users whether they think atomicity and intactnessare important in defining results. To make it understandable to MTurk users, we use Xto denote the search targets supplied by the user when they write an English sentenceto describe the query meaning. We describe atomicity as “Each search result shallcontain at least one X”, and describe intactness as “Each result shall contain all relatedinformation of X”. For IMDB data set, 76% users chose intactness while 35% choseatomicity. For DBLP data set, 70% users chose intactness while 42% chose atomicity.For DBPedia dataset, 72% users chose intactness while 34% chose atomicity. Althoughboth atomicity and intactness are important, intactness is more crucial according to userstudy, which supports our definition of target-aware results in Section 4.1 when bothcan not be satisfied simultaneously.

6.3. Efficiency Experiments

To test the efficiency of our approach, we record the processing times for generatingtop-3 results for TAR, BANKS and EASE and present the speed-up ratio of TAR over theother two in a log scale in Figure 9. Speed-up ratio is calculated as the elapsed time ofEASE or BANKS divided by that of TAR for each query.

Besides user study queries, we tested efficiency of TAR on 75 randomly generatedqueries. To generate these random queries, for each data set we first generate the listof all keywords appeared in the dataset, sorted by the number of occurrences. Then werandomly select n keywords from the list to form a query, where n is a random integerbetween 1 and 5. For random DBLP queries, the average speed up ratio of TAR overBANKS and EASE is 118.1 and 9.3 respectively. For IMDB, it is 23.4 and 0.69. ForDBPedia, it is 1.69 and 0.13 respectively.

25

Page 26: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

TAR outperforms BANKS for most queries. After identifying target entities, TARconstructs results based on search targets while BANKS constructs results starting fromall keyword matches in the data graph. For example, for Q2 on DBLP dataset “Se-mantic mining, Author”, BANKS first explores all data vertices directly connected to akeyword match of ”Semantic mining” (and “Author”). Then it moves on to explore alldata vertices connected to the keyword matches and expands its searching space untila path that connects matches of both query keywords is identified. Such a strategy isinefficient. EASE uses advanced indexes and achieves higher efficiency than BANKS.TAR starts with inferring author as the search target, and then retrieve modifiers match-ing “Semantic mining” for each relevant author data vertices through keyword indexand path index, which means smaller searching space and faster accessing rate.

When there are a large number of query relevant target entity data vertices, espe-cially when multiple entities together with their relationships are inferred as searchtarget, TAR is slower than BANKS and EASE. For example, for Q1 of IMDB data set“Steven Spielberg”, since many entities in the data, such as movies, companies, andactors, have this name in data vertices, they are all considered as target entities. Itturns out there are more than 2,000 query relevant data vertices of all those target enti-ties. TAR needs to process each of them including their modifiers and relationships andrank them, which can be time consuming. On the contrary, for Q4 of IMDB dataset,“Dreamworks, Comedy animation”, TAR infers that movie is the only a target entitywith only 49 query relevant data vertices, and achieves high efficiency.

For randomly generated queries, we can see that the speedup decreases for IMDBand DBPedia dataset. This is because randomly generated queries have much weakerindication of a specific target entity than real user-issued queries. For such queries,relationship are usually inferred as search targets. With many different entities con-sidered as target entities, TAR is less efficient. For example, for a random query “Vol,Philip, Rose” on DBPedia dataset, TAR infers music, artist, song as target entities andfinds more than 3,000 relevant data vertices which takes it much longer time to processthan real user study queries.

We also test the scalability of TAR in term of k when generating top-k results.TAR’s average speed-up ratios for the 10 user queries for each dataset over the other twosystems in a log scale with respect to k are presented in Figure 10. EASE’s performancedoes not change much over k, which is consistent with [30]. The speed-up ratio of TARover BANKS goes up noticeably as k increases because BANKS takes much longer timeto construct more results.

26

Page 27: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

6.4. Evaluation Using INEX Benchmark

Table 4: Unhandled Queries and their Reasons

Index INEX Query Keywords Unhandled reasonsQ01 Yimou Zhang, 2009, 2010 Contain ”or” relationship in queryQ12 true story, drug, +addiction, -dealer Contain ”-” relationship in query

Q14 romance movies by Richard Gereor George Clooney Contain ”or” relationship in query

Q21 Movies Klaus Kinski actor moviesgood rating Subjective query beyond match

Table 5: XPath & Description of INEX Queries

Index Keywords XPath Description

Q02 Dogme,movie

//movie[about(.//keywords,Dogme) OR about

(.//additional details,Dogme)]

Find movies that were shotunder the Dogme movement

Q04Avatar, James

FrancisCameron

//movie[about(.//title,Avatar) AND

about(.//director, JamesFancis Cameron)]

I am searching for theinformation about the movie”Avatar” directed by James

Fancis Cameron

Q05around theworld in

eighty days

//movie[about(.//title,around the world in eighty

days)]

find the movie ”around theworld in eighty days”

Q06 Tom Hanks,Ryan

//movie[about(.//actor, TomHanks) and about(.//title,

Ryan)]

find the movie containing”Ryan” in its title and acted

by Tom Hanks

Q07Scarlett

Johansson,John Slattery

//movie[about(.//actor,Scarlette Johansson) AND

about(.//actor, JohnSlattery)]

Find the movie that ScarlettJohansson and John Slattery

co-stared in

Q08 titanic, jack,rose

//movie[about(.//title,titanic) AND about(.//plot,

jack rose)]

I am searching for theversion of movie ”Titanic”

in which the two majorcharacters are called Jack

and Rose respectively

27

Page 28: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Q09 hua mulan,animation

//movie[about(.//title, huamulan) AND about(.//genre,

animation)]

I am looking forinformation about the

animation movie talkingabout the story of heroine

called Hua Mulan in ancientChina

Q10 director,fearless, jet li

//movie[about(.//title,fearless) AND

about(.//actor, JetLi)]//director

Who is the director of themovie ”Fearless” stared

with ”Jet Li”

Q11 ancientRome, era

//movie[about(.//plot,ancient rome era)]

find the movies about theera of ancient Rome

Q15 may the forcebe with you

//movie[about(.//quote, maythe force be with you)]

movies where the phrase”may the force be with you”

is cited

Q16 stan lee, actor //movie[about(.//actor, StanLee)]

movies where Stan Leeappears as an actor

Q17 brad pitt,producer

//movie[about(.//producer,Brad Pitt)]

movies where brad pitt is aproducer

Q18

StanleyKubrick,movie,director

//person[about(.,stanleykubrick)]//direct

List of all the moviesdirected by Stanley Kubrick

Q19 Heath Ledger,actor, movie

//person[about(.,heathledger)]//act//movie//title

List of all the movies inwhich Heath Ledger has

played

Q20Jean Pierre

Jeunet, MarcCaro, movie

//movie[about(.//director,”Jean Pierre Jeunet”) AND

about(.//director, ”MarcCaro”)]

List of movies directed byboth Jean Pierre Jeunet and

Marc Caro

Q22Ingmar

Bergman,biography

//person[about(., IngmarBergman)]//biography

I want to know more aboutIngmar Bergman

Q23

Woody Allen,Scarlett

Johansson,Comedy

//movie[about(.//genre,comedy) AND

about(.//director, WoodyAllen) AND about(.//actor,

Scarlett Johansson)]

I am looking for the comedymovie directed by Woody

Allen and acted by ScarlettJohansson

Q24 ShirleyTemple

//person[about(.//name,Shirley Temple)]

find the information aboutShirley Temple

Q25tom hanks,

stevenspielberg

//movie[about(., tom hankssteven spielberg)]

movies where both tomhanks and steven spielberg

worked together

28

Page 29: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Q26quentin

tarantino,thriller

//movie[about(.//genre,thriller) AND about(.//*,

Quentin Tarantino)]

movies of genre thrillerwith quentin tarantino

Q27 tom cruise,movie

//movie[about(., TomCruise)]

movies where tom cruiseworks as actor, producer or

director

Q28Clint

Mansell,composer

//person[about(.,clintmansell)]//compose

I want the list of movies inwhich Clint mansell was the

composer

Table 6: MAgP of Query Results

Index TAR BANKS EASEQ02 0.635 0.598 0.006Q04 0.878 0.331 0.001Q05 0.492 0.090 0Q06 0.058 0.015 0.001Q07 0.018 0 0.002Q08 0.019 0 0.002Q09 0.903 0 0.009Q10 0.371 0.001 0.001Q11 0.044 0 0.005Q15 0.997 0 0Q16 0.771 0.022 0Q17 0.281 0.016 0Q18 0.701 0.028 0Q19 0.558 0.011 0.001Q20 0.218 0.069 0.071Q22 0.084 0 0.023Q23 0.237 0 0.003Q24 0.668 0 0.004Q25 0.328 0 0.002Q26 0.143 0 0.002Q27 0.173 0.008 0Q28 0.873 0 0.008

MAgP 0.430 0.054 0.006

29

Page 30: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Table 7: Search Target Inference

Index INEX Inferred by TARQ02 movie movieQ04 movie movieQ05 movie movieQ06 movie movieQ07 movie movieQ08 movie actorQ09 movie movieQ10 director directorQ11 movie movieQ15 movie movieQ16 movie actorQ17 movie movieQ18 movie movieQ19 movie movieQ20 movie movieQ22 person actorQ23 movie movieQ24 person actorQ25 movie movieQ26 movie movieQ27 movie movieQ28 movie movie

INEX benchmark We also evaluate the proposed system using the INEX 2010 bench-mark. The dataset is the IMDB collection (2010-2011) [1] of 1.4 GB with 3.7M datavertices, represented as a collection of XML files. Each XML file records the infor-mation of a movie, or a person, which can be an actor, director, and producer. Toconstruct a knowledge data graph, we build links of the same entities that appear indifferent files, using exact string matches between the cast elements of movie files andthe name elements of person files, and between the title elements of movie files and thefilmography elements of person files. The same construction is used for all comparisonsystems.

The benchmark contains 28 queries. Among them, there are two queries (Q03 andQ13) that do not have ground truth in the benchmark. There are four queries (Q01,Q12, Q14, and Q21) that can”t be handled by our system since they contain advancedsyntax “or” and “not”, or they contain subjective keywords like ”good rating”, as shownin Table 4. The current implementation of TAR only handles “and” semantics of key-

30

Page 31: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

words in a query, and generates results based on exactly match of query keywords. Weevaluate the TAR system on the remaining 22 queries. The queries, their English de-scription and corresponding XPath expressions provided by INEX are shown in Table5.

The ground truth of all queries are provided by INEX, which are contributed byteams that participated in the competition, collected through a technique known aspooling [42].

Mean average generalized precision (MAgP ) is used by INEX to measure systemperformance, which is defined using generalized precision and generalized recall. Thegeneralized precision (gP [r]) is defined as the sum of document F1-measure scoresup to (and including) document-rank r, divided by the rank r, as shown Equation 12,where d is a retrieved document (i.e. query result).

gP [r] =

∑ri=1 F1(di)

r(12)

The generalized Recall (gR[r]) is defined as the number of relevant documentsretrieved up to (and including) document-rank r divided by the total number of relevantdocuments, and can be presented by Equation 13. Here IsRel(dr) = 1 if document dat document-rank r contains highlighted relevant text, and IsRel(dr) = 0 otherwise,and Nrel is the total number of document with relevance for a given topic (query).

gR[r] =

∑ri=1 IsRel(di)

Nrel(13)

Let L be the ranked list of the return results. The average generalized precisionAgP for a topic, with r ranging from 1 to L, is calculated by Equation 14 [25].

AgP =

|L|∑r=1

gP [r] · gR[r] (14)

The mean average generalized precision (MAgP ) is simply the mean of the aver-age generalized precision AgP scores over all topics (queries).Search Quality Evaluation We collected the top 1000 results of each system, con-verted the results into the required format for evaluation, and then run the assessmentprogram provided by INEX to compute the MAgP scores.

The AgP value of each system: TAR, and comparison systems BANKS, and EASE,on each query can be found in Table 6. As we can see, TAR, which aims at construct-ing target-aware results, has a significant improvement over the other two comparisonsystems. It has the highest score of AgP on every test query. Please note that none ofthe systems represents an end-to-end solution of processing keyword queries on graphdata, the contribution in this paper, as well as those in the papers that describe BANKSand EASE can be used with other techniques for enhanced performance. Next we willanalyze the performance.

While TAR has reasonable performance on most queries and has a good MAgPscore. There are several aspects that need improvement. First, there are several issueswhen TAR constructs a graph of entities, attributes and relationships from the collection

31

Page 32: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

of XML files in INEX benchmark. One problem is that TAR does not properly handlethe “MovieLink” element in the XML files in INEX, as illustrated by the performanceof processing queries Q06 and Q07. For Q06 “Tom Hanks, Ryan”, users intended tofind the movies which contain “Ryan” in its title and acted by Tom Hanks. TAR returnssome results like the graph shown in Figure 11. Since TAR considers the “Movielink”element in the XML files as an attribute of movie entity, it considers the movie thatcontains “Tom Hanks” in its MovieLink attribute as relevant. In fact, “MovieLink”links two different movies using xlink. We plan to update TAR to properly handlexlinks, and build relationships between two movie entities accordingly in constructingthe knowledge graph, so that the relatedness relationship between two movies can behandled using the techniques presented in Definition 3.1.

Another problem is that the XML files do not always use xlinks to link the sameentity that appears in multiple files. While TAR constructs such links based on exactstring match of person names or movie titles, the data may contain approximate match.For instance, consider query Q07, “Scarlett Johansson, John Slattery”, with the inten-tion of finding the movie that Scarlett Johansson and John Slattery co-stared in. Oneground truth result is movie “Saturday Night Live” (1975) {Scarlett JohanssonBjrk(#32.18)}, which does not link to actor “John Slattery” directly. On the other hand, inan XML file about “John Slattery”, his filmography element contains “Saturday NightLive”, showing the relationship with the movie. However, the movie title here does nothave annotation “{Scarlett JohanssonBjrk (#32.18)}”. Using exact string match, TARwas not able to identify that these two movie instances are the same entity, and thus notable to correctly link the actors and return this movie as a result.

Second, with exact match, TAR is insufficient to handle variations of words. Forexample, consider query Q05 “around the world in eighty days”. We notice that someof the ground truth results provide by INEX are the movies named “around the worldin 80 days”. The keyword match strategy of TAR doesn’t consider alias of keywords,such as “eighty” and “80”. Another example is query Q20, “Jean Pierre Jeunet, MarcCaro, movie”, with the intention of searching for the list of movies directed by bothJean Pierre Jeunet and Marc Caro. 9 out of 14 ground truth results are not returnedby TAR. One reason is that several results, such as movie “L”vasion (1978)” containskeyword “Jean-Pierre Jeunet”, with an additional hyphen symbol “-” compared to thequery keyword. TAR does not consider this as a match, but ground truth does. Tohandle such queries, we should not only consider roots of words, but also synonymsand variant forms of the same concept (such as removal of symbols) when processingkeyword matches.

Third, TAR is not able to properly handle queries that carry semantics beyond ex-act match of keywords. For the query Q11 “ancient Rome, era”, which intends tofind the movie about the era of ancient Rome. Most of ground truth results, such asthe movie “Il conquistatore di Corinto (1961)”, Great Generals of the Ancient World:Julius Caesar (2000) (V), Legions of Rome: The Gallic Wars (2001) (V), and “Im-perium: Augustus (2003) (TV)”, don”t contain the query keyword “era”. Althoughthese results do not explicitly contain the word “era”, these movies are indeed aboutRome era. However, using “and” semantics and exact matach, TAR can only return 27out of 178 ground truth results. If we omit the keyword “era” and use “ancient Rome”as the query, TAR returns 144 out of 178 ground truth results. In order to handle queries

32

Page 33: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

like Q11, we need to consider semantic relationship between query keywords and sup-port “or” semantics of queries. In Q11, the intention of the query can be captured by“ancient Rome”, which is highly correlated with keyword era and it is more specificthan “era”. So a movie that is about ancient Rome is most likely relevant even withoutthe presence of keyword “era”. Another observation is that movies about “Rome era” islikely to be genre of “history”. If such correlation could be incorporated, the precisioncan be improved to compensate the lack of “era” matches.

Another example of such semantic query problem is Q08 “titanic, jack, rose”. Ithas some ground truth results related the word “Titan” instead of the keyword “Ti-tanic”. Actually, “Titan” comes from the novella written by Morgan Robertson called“Wreck of the Titan”, published before the real-life “Titanic” event. The story fea-tures a fictional ocean liner “Titan”, which sinks in the North Atlantic after strikingan iceberg [2]. Due to the uncanny similarities between both the fictional and real-lifeversions, users are expected movies related to Titan as relevant. However, based on ex-act match of keywords, TAR can”t match “Titans” in the inverted index and thus misscorresponding results. To handle such queries, we need to consider related terms ofquery keywords based on certain knowledge base.

Furthermore, the ranking of results by TAR are different from the ranking in theground truth. For instance, for Q06 “Tom Hanks, Ryan”, users intended to find themovies which contain Ryan in its title and acted by Tom Hanks. However, since thekeyword query itself does not specify that Ryan should be in the movie title, the topranked movie in the ground truth is not highly ranked by TAR. On the other hand, TARreturns results like the graph shown in Figure 11. Since this result has three matches tomodifier “Ryan” in the movie actor, showing its high relevance to “Ryan”, it is rankedas top by TAR. However, the user does not mean to search for those actors named as“Ryan”. One possible strategy is to consider the popularity of movies in the ranking toboost the ranking of well known movies, though a general solution for ranking beyondmovie data is an open question.

In addition to these problems of TAR, we also find mismatch between the Englishand XPath description of query and the ground truth result in the INEX benchmark.The rationale of some queries” ground truth is unclear, such as Q08, Q22, and Q26.For example, for Q26 ”quentin tarantino, thriller”, the user intention is to find moviesof genre thriller with quentin tarantino. The top ranked result in the ground truth isa movie, in which “quentin tarantino” is not an actor, a director, writer, or producer,but just shown in the “trivia” as “Selected by Quentin Tarantino for the First QuentinTarantino Film Fest in Austin, Texas, 1996.” The result ranked as second in the groundtruth does not have occurrence of “quentin tarantino”. These seem to contradict tothe description of this query. On the other hand, TAR returns movies where “quentintarantino” is an actor, director, and writer. For Q22 “Ingmar Bergman, biography”, theintention is to find biography of “Ingmar Bergman”. However, many instances in theground truth are the actors who have worked with “Ingmar Bergman”. TAR does notreturn such actors.

On the other hand, for some queries, the results returned by TAR match with theXPath and the English description of the query, but are not included in the ground truthresult. For instance, query Q19 “Heath Ledger, actor, movie” has the intention of listall the movies in which Heath Ledger has played. The top first result returned by TAR

33

Page 34: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

is movie “The Dark Knight (2008)”, in which Heath Ledger stars. While it matchesthe XPath and the English description of the query, it is included in the ground truth.Another example is query Q27 “Tom Cruise, movie”, which intends to find the movieswhere Tom Cruise works as an actor, producer or director. One of the top resultsreturned by TAR is the movie named “2000 Blockbuster Entertainment Awards (2000)(TV)”, in which Tom Cruise stars. However, it is not in the ground truth results.

Other systems, BANKS and EASE, also suffer from the problems discussed above.There are also additional issues. One reason is that without search target inference, theentity types of top results of BANKS often have mismatch with the user intention. Forinstance, Q18 “ Stanley Kubrick, movie, director looks for the list of all the moviesdirected by Stanley Kubrick. However, the top ten results of BANKS are actors whohas connection with the director “ Stanley Kubrick, not movies.

Furthermore, the ranking function of BANKS is not effective for this evaluation.Inspired by PageRank in Google [8], it uses the indegree of a node in the data graphas node weight, as described in [7]. That is, a higher prestige is given to a node withmany incoming edges. Edge weight is set to 1 for undirected graphs, as the case in thisexperiment. The score of a result is the combination (either sum or multiplication) ofnode weight and edge weight. Lets look at the example of Q17 “brad pitt, producer,with intention of finding the movies where “brad pitt is a producer. BANKS returnsthe following three top results: the actor named “Antony Audenshaw”, the actor called“William Blinn”, and the actor with the name “Kevin (I) O’Connor”, all of which arerelated to the query keywords. These nodes all have a large amount of incoming edges:486, 366, and 210, respectively, since the actors are invited in many TV shows or series.However, they are not relevant results.

One key issue for EASE is that its implementation by the authors considers eachquery keyword as a unit, and cannot take segmented queries with phrases as input[30]. For instance, consider Q15 “may the force be with you”, whose intention is tofind the movies where the phrase “may the force be with you” is cited. The top rankedresult by EASE is movie “Body Snatchers (1993)”. It has each query keyword occurredmany times, under different context, but does not have the whole phrase together, thusnot very relevant. This also interfere with the ranking function. Consider Q05 “aroundthe world in eighty days”, which intends to find the movie “around the world in eightydays”. The top ranked result by EASE is the movie “Deadwood” (2004), since one ofits actors named ‘Taylor Toole’ is related to each individual query keywords, some ofwhich occurred many times in the result. While a ground truth result, such as movie“Around the World in Eighty Days (1956)”, has a relatively low frequency of keywordoccurrences, making this result rank 584th among the top 1000 result returned. It ispossible to extend EASE to handle segmented queries to improve performance.Search Target Inferences We also evaluate whether the search targets inferred by TARare correct with respect to the XPath and English description provided by the INEXbenchmark, as shown in Table 7.

We observe that TAR only makes wrong inference for one query: Q08 “titanic,jack, rose. The search target should be movie, while TAR considers actor to be the mostlikely target entity. Specifically, the score for actor to be target entity (Equation 10)is 0.07, and for movie is 0.04. TAR correctly identifies that there is no explicit return

34

Page 35: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

specifier in this query; and all query keywords are considered as modifiers. For hypoth-esis that a user searches for movie about “Titanic”, since there are only a few moviesare about “Titanic” and a lot of persons with names containing keyword “Jack” and“Rose”, adding keywords “Jack” and “Rose” does not bring much information gain.Thus TAR considers this hypothesis unlikely. On the other hand, for hypothesis that auser searches for persons named “Jack” and “Rose”, keyword “Titanic” brings a largeinformation gain to differentiate among those persons. Thus TAR considers this hy-pothesis is very likely. For Q22 and Q24, the search target is person. According to thedata, a person can be an actor, a director or a producer. Among these three types, TARranks actor the highest probability to be target entity, followed by director and produceras target entity, since there are more actors related to the query keyword than others.

7. Related Work

Keyword Search on Knowledge Graphs Existing work for keyword search on graphsdefine a query result to be a tree or a graph. Tree-based query results are typicallydefined as variants of G roup S teiner T rees in a data graph, where each result containsall query keywords, such as [4, 24, 16, 27, 17, 14, 20]. Graph-based query results,such as [26, 22, 12], are defined as sub-graphs containing all query keywords. To makeresults specific, majority of the work [4, 24, 16, 27, 17, 14, 20, 23, 33, 5, 19] requireseach result to be minimal, which means removing any node will make a result invalid.There are also work on processing keyword queries on relational databases [4, 6, 21,48]. Most of the work start with identifying the relations whose data instances havematches to query keywords, and then generating SQL queries using schema such thatthose relations can be joined and collectively contain all query keywords. In contrast,[6] focuses on the applications where no data instances can be accessed a prior, andexploits inter-dependencies of query keywords (such as their positions in the query)and external knowledge for query generation.

Common factors in IR ranking are adopted for ranking result trees or graphs, suchas term frequency, inverse document frequency, proximity of keyword matches, andpage rank. The focus of the existing work is on efficiency. For instance, [24] usesbidirectional graph exploration, Ding [16] uses dynamic programming, STAR [27]achieves pseudo-polynomial time complexity, Golenberg [17] can generate results inpolynomial delay by enumerating results by tree-height, and Dalvi [14] reduces I/Omemory for large graphs. Most of existing work takes ‘AND’ semantics (or conjunc-tive queries), some also support ‘OR’ semantics [21].

Existing work focus on efficiency of generating results or top-k results, as alsonoted in a recent study [11]. In contrast, in this work we focus on understandinguser search target and defining query results that are atomic and intact with respect toinferred search target. The ranking function that we proposed is also inspired by com-monly used IR ranking, but define it using search target instances and their modifiers.Search Intention Inference It has been long recognized that understanding user’ssearch target is important in IR research and many work has attempted to solve theproblem[43]. Existing work on automatically inferring user search target rely on querylog and clickthrough streams and considers documents as data. Our work addresses the

35

Page 36: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

problem of inferring user search targets on structured data represented as graphs andthe case when no query log or clickthrough data is available.

There are a few works that handles the same problem setting as ours. A demopaper [34] proposes the idea of search target and the property of atomicity and intact-ness based on the concept of return nodes developed in [35, 36]. But they define theconcept at schema level and can’t handle heterogenous data well. Also, we use prob-abilistic framework with information gain to infer search target. Moreover, they onlywork for tree structured data and cannot handle graphs. Ranking is not considered inthese works. Other work that consider user search target when searching graph dataeither require users to select the interpretation of keyword queries at run time [48] orrequire users to write down search targets following special syntax [9, 10], which poseadditional workload to users.

Due to the challenges of processing keyword queries, there are studies on ex-ploratory search that use an interactive session to find the user intent [38, 15]. Thetechniques proposed in this paper is orthogonal to exploratory search, and can be in-tegrated into those systems. For instance, an exploratory search system may outputthe results generated by TAR, and cluster them by inferred search intentions. Basedon user feedback, the system can perform filter and aggregations to further refine theresults.

8. Conclusions

In this paper we propose the concept of target-aware query results driven by in-ferred user search intention. A user query has a search target, which can be an entity orrelationships. We develop techniques to infer search targets by analyzing return nodes,the relatedness relationships and information gain between keyword matches and entitydata vertices in the data graph. Then we propose that an ideal result should be atomic,containing exactly one instance of search target, so that result ranking is aligned withsearch target instance ranking. An ideal result should also be intact, containing allquery-related evidences of the search target in the result, thus the downstream rankingfunction have comprehensive signals for fair ranking. Then we develop techniques toefficiently generate target-aware results. Experimental evaluation shows the effective-ness and efficiency of our approach. The focus of this paper is to define and generatesemantic query results based on the concept of search target. To further improve searchquality, we will investigate comprehensive ranking functions and incorporate existingquery segmentation techniques and ontology information. Furthermore, we will studyhow to efficiently handle large scale data. For instance, we can build indexes for largegraphs by extending the existing work of multi-layered graph index [20]. Anotherdirection is to develop algorithms for top-k query processing.

9. Acknowledgements

This work is partially supported by NSF CAREER Award IIS-0845647, GoogleCloud Service and the Leir Charitable Foundations.

36

Page 37: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

References

[1] INEX dataset. http://inex.mmci.uni-saarland.de/protected/dc/2010-datacentric-documents-beautified.tar.gz.

[2] Wiki of “titan”. https://en.wikipedia.org/wiki/Futility,_or_the_Wreck_of_the_Titan#Similarities_to_the_Titanic.

[3] CIDR 2009, Fourth Biennial Conference on Innovative Data Systems Research,Asilomar, CA, USA, January 4-7, 2009, Online Proceedings. www.cidrdb.org,2009.

[4] B. Aditya and G. Bhalotia. BANKS: Browsing and Keyword Searching in Rela-tional Databases. In VLDB, 2002.

[5] Z. Bao, T. W. Ling, B. Chen, and J. Lu. Effective XML Keyword Search withRelevance Oriented Ranking. In ICDE, 2009.

[6] S. Bergamaschi, E. Domnori, F. Guerra, R. T. Lado, and Y. Velegrakis. Keywordsearch over relational databases: a metadata approach. In T. K. Sellis, R. J. Miller,A. Kementsietsidis, and Y. Velegrakis, editors, Proceedings of the ACM SIG-MOD International Conference on Management of Data, SIGMOD 2011, Athens,Greece, June 12-16, 2011, pages 565–576. ACM, 2011.

[7] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti, and S. Sudarshan. KeywordSearching and Browsing in Databases using BANKS. In ICDE, 2002.

[8] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine.Computer Networks, 30(1-7):107–117, 1998.

[9] T. Cheng and K. C.-C. Chang. Entity Search Engine: Towards Agile Best-EffortInformation Integration over the Web. In CIDR [25], pages 108–113.

[10] T. Cheng, X. Yan, and K. C.-C. Chang. EntityRank: Searching Entities Directlyand Holistically. In VLDB [25], pages 387–398.

[11] J. Coffman and A. C. Weaver. A Framework For Evaluating Database KeywordSearch Strategies. CIKM ’10.

[12] J. Coffman and A. C. Weaver. Structured Data Retrieval using Cover DensityRanking. In Manolescu et al. [37], page 1.

[13] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. XSEarch: A Semantic SearchEngine for XML. VLDB ’03, 2003.

[14] B. B. Dalvi, M. Kshirsagar, and S. Sudarshan. Keyword Search on ExternalMemory Data Graphs. Proc. VLDB Endow., (1), Aug. 2008.

37

Page 38: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

[15] E. Demidova, X. Zhou, and W. Nejdl. Iqp: Incremental query construction, aprobabilistic approach. In F. Li, M. M. Moro, S. Ghandeharizadeh, J. R. Har-itsa, G. Weikum, M. J. Carey, F. Casati, E. Y. Chang, I. Manolescu, S. Mehro-tra, U. Dayal, and V. J. Tsotras, editors, Proceedings of the 26th InternationalConference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach,California, USA, pages 349–352. IEEE Computer Society, 2010.

[16] B. Ding, J. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin. Finding Top-k Min-CostConnected Trees in Databases. In ICDE, 2007.

[17] K. Golenberg, B. Kimelfeld, and Y. Sagiv. Keyword Proximity Search in Com-plex Data Graphs. SIGMOD ’08. ACM, 2008.

[18] J. Guo, G. Xu, X. Cheng, and H. Li. Named entity recognition in query. In J. Al-lan, J. A. Aslam, M. Sanderson, C. Zhai, and J. Zobel, editors, Proceedings of the32nd Annual International ACM SIGIR Conference on Research and Develop-ment in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19-23, 2009,pages 267–274. ACM, 2009.

[19] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked Key-word Search Over XML Documents. In SIGMOD, 2003.

[20] H. He, H. Wang, J. Yang, and P. S. Yu. BLINKS: Ranked Keyword Searches OnGraphs. In SIGMOD, 2007.

[21] V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient IR-Style KeywordSearch over Relational Databases. In VLDB, pages 850–861, 2003.

[22] V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword Proximity Search onXML Graphs. In ICDE, 2003.

[23] Z. B. Junfeng Zhou and W. Wang. Fast SLCA and ELCA Computation for XMLKeyword Queries Based on Set Intersection. In ICDE, 2012.

[24] V. Kacholia and S. Pandit. Bidirectional Expansion For Keyword Search on GraphDatabases. In VLDB, 2005.

[25] J. Kamps, J. Pehcevski, G. Kazai, M. Lalmas, and S. Robertson. INEX 2007 eval-uation measures. In Focused Access to XML Documents, 6th International Work-shop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, DagstuhlCastle, Germany, December 17-19, 2007. Selected Papers, pages 24–33, 2007.

[26] M. Kargar and A. An. Keyword Search In Graphs: Finding R-Cliques. Proc.VLDB Endow., 2011.

[27] G. Kasneci, M. Ramanath, M. Sozio, F. M. Suchanek, and G. Weikum. STAR:Steiner-Tree Approximation in Relationship Graphs. In ICDE [3], pages 868–879.

[28] R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarityfor indexing paths in graph-structured data. In ICDE, 2002.

38

Page 39: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

[29] L. Kong, R. Gilleron, and A. Lemay. Retrieving Meaningful Relaxed TightestFragments for XML Keyword Search. In EDBT, 2009.

[30] G. Li and B. C. Ooi. EASE: An Effective 3-in-1 Keyword Search Method ForUnstructured, Semi-Structured And Structured Data. In SIGMOD 2008.

[31] J. Li, C. Liu, R. Zhou, and W. Wang. Suggestion of Promising Result Types forXML Keyword Search. In Manolescu et al. [37], pages 561–572.

[32] Y. Li, C. Yu, and H. V. Jagadish. Schema-Free XQuery. In VLDB, pages 72–83,2004.

[33] R.-R. Lin, Y.-H. Chang, and K.-M. Chao. Improving the Performance Of Iden-tifying Contributors For XML Keyword Search. SIGMOD Record, 40(1):5–10,2011.

[34] Z. Liu, Y. Cai, and Y. Chen. TargetSearch: A Ranking Friendly XML KeywordSearch Engine. In ICDE, pages 1101–1104, 2010.

[35] Z. Liu and Y. Chen. Identifying Meaningful Return Information for XML Key-word Search. In SIGMOD, pages 329–340, 2007.

[36] Z. Liu and Y. Chen. Return Specification Inference And Result Clustering ForKeyword Search On XML. ACM Trans. Database Syst., 35(2), 2010.

[37] I. Manolescu, S. Spaccapietra, J. Teubner, M. Kitsuregawa, A. Leger, F. Nau-mann, A. Ailamaki, and F. Ozcan, editors. EDBT 2010, 13th International Con-ference on Extending Database Technology, Lausanne, Switzerland, March 22-26, 2010, Proceedings, volume 426 of ACM International Conference ProceedingSeries. ACM, 2010.

[38] Y. Mass, M. Ramanath, Y. Sagiv, and G. Weikum. IQ: the case for iterative query-ing for knowledge. In CIDR 2011, Fifth Biennial Conference on Innovative DataSystems Research, Asilomar, CA, USA, January 9-12, 2011, Online Proceedings,pages 38–44. www.cidrdb.org, 2011.

[39] A. Nandi and H. V. Jagadish. Qunits: queried units in database search. In CIDR[3].

[40] Y. Pan and Y. Wu. ROU: Advanced Keyword Search On Graph. In CIKM, 2013.

[41] K. Q. Pu and X. Yu. Keyword Query Cleaning. Proc. VLDB Endow., 1(1):909–920, Aug. 2008.

[42] S. E. Robertson. Evaluation in information retrieval. In Lectures on InformationRetrieval, Third European Summer-School, ESSIR 2000, Varenna, Italy, Septem-ber 11-15, 2000, Revised Lectures, pages 81–92, 2000.

[43] E. Sadikov, J. Madhavan, L. Wang, and A. Halevy. Clustering Query RefinementsBy User Intent. In WWW, WWW ’10. ACM, 2010.

39

Page 40: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

[44] R. Tarjan. Depth-First Search And Linear Graph Algorithms. In Switching andAutomata Theory, 1971., 1971.

[45] M. Yang, B. Ding, S. Chaudhuri, and K. Chakrabarti. Finding Patterns in aKnowledge Base using Keywords to Compose Table Answers. VLDB 2015.

[46] X. Yin and S. Shah. Building taxonomy of web search intents for name entityqueries. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors, Proceedingsof the 19th International Conference on World Wide Web, WWW 2010, Raleigh,North Carolina, USA, April 26-30, 2010, pages 1001–1010. ACM, 2010.

[47] X. Yu and H. Shi. Query Segmentation Using Conditional Random Fields. InKEYS [3], pages 21–26.

[48] Z. Zeng, Z. Bao, T. W. Ling, and M.-L. Lee. iSearch: An Interpretation BasedFramework For Keyword Search In Relational Databases. In KEYS, 2012.

40

Page 41: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

17 actor

20 actor

18 actor

19 company

21 actor

22 movie

23 movie

24 movie

25 movie

26 title

27 genre

28 title

29 genre

30 title

31 genre

33 title

34 genre

32 year

9 name

10 type

11 name

12 type

13 name

14 name

16 name

15 type

35 Old School

36 Comedy

37 The Campaig

38 Comedy

39 S.W.A.T

40 Action

41 2003

42 Legacy

43 Action

1 Will Ferrell

2 Leading

3 Jeremy Piven

4 Supporting

5 DreamWorks

6 Colin Ferrell

7 Leading

8 Jeremy

Renner

Figure 1: Movie Knowledge Graph

Figure 2: Minimal Sub-tree or Sub-graph Results of Q1

41

Page 42: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Figure 3: Sub-tree or Sub-graph Results of Q1

Figure 4: Search-intention-aware Results of Q1

Figure 5: Target-Aware Result of Q5

Figure 6: Algorithm 2 on Example 4.1

42

Page 43: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q100

0.2

0.4

0.6

0.8

1

To

p−

K P

recis

ion

TAR

EASE

BANKS

0.450.77

0.31

Avg

(a) DBLP

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q100

0.2

0.4

0.6

0.8

1

To

p−

K P

recis

ion

TAR

EASE

BANKS

0.130.490.74

Avg

N/A N/AN/A

(b) IMDB

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q100

0.2

0.4

0.6

0.8

1

To

p−

K P

recis

ion

TAR

EASE

BANKS

Avg

0.82

0.400.41

(c) DBPedia

Figure 7: Precision of Query Results

(a) Result 1 (b) Result 2

Figure 8: TAR Results for Q3 of DBPedia

43

Page 44: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

!

"

#!

#"

$!

$"

%!

%"

&# &$ &% &' &" &( &) &* &+ &#!

,-../.01/23456

476 89:;

1<, 9<=>? @<?@

(a) DBLP

!

"!

#!

$!

%!

&!

'!

(" (# ($ (% (& (' () (* (+ ("!

,-../.01/23456

476 89:;

1<, ;<=>? @<?@

(b) IMDB

!

"!!

#!!

$!!

%!!

&!!

'!!

(" (# ($ (% (& (' () (* (+ ("!

,-../.01/23456

476 89:3;/<

1=, 9=>?@ A=@A

(c) DBPedia

Figure 9: Running Time of Various Queries

3 5 10 15 200

2

4

6

k

Ln(S

peedup)

SPTAR

(BANKS)

SPTAR

(EASE)

(a) DBLP

3 5 10 15 2002468

10

k

Ln

(Sp

ee

du

p)

SPTAR

(BANKS)

SPTAR

(EASE)

(b) IMDB

3 5 10 15 200

2

4

6

8

k

Ln

(Sp

ee

du

p)

SP

TAR(BANKS)

SPTAR

(EASE)

(c) DBPedia

Figure 10: Speedup Ratio of Various Top-k

44

Page 45: TAR: Constructing Target-Aware Results for Keyword Search ...ychen/publications/tar-tech-report.pdfTAR: Constructing Target-Aware Results for Keyword Search on Knowledge Graphs Abstract

Movie

Actor Actor Editor

Name

Name

NameName

MovieLink

One Life to live (1968) Saturday Night Live: The Best of Tom

Hanks (2004) (TV)

Ryan, Mary (IV) Ryan, Amber Marsini, Ryan

Figure 11: One Result for Q06 “Tom Hanks, Ryan”

45