Wi presentation

Keyword-driven SPARQL Query Generation

Leveraging Background Knowledge

Authors:Saeedeh Shekarpour, Sören Auer, Axel-Cyrille Ngonga Ngomo, Daniel

Gerber, Sebastian Hellmann, Claus Stadler

AKSW group

Universität Leipzig

WI-IAT conference

Outline• Motivation• Entity recognition Phase• SPARQL query generation • Evaluation• Conclusion and future work

2AKSW group - Universität Leipzig 24 August 2011

Querying web of documents

3AKSW group - Universität Leipzig

Text retrieval

24 August 2011

Web of Data

AKSW group - Universität Leipzig 424 August 2011

Motivations

Difficulties of Sparql

• Knowledge about the underlying ontology structure.

• Proficiency in formulating formal queries.

Keyword paradigm

• Successful experience of keyword-based search in document retrieval

• Satisfactory research results about the usability of this paradigm


Birds-eye-view of the envisioned search approach


Overview of the proposed method


Outline• Motivation• Entity recognition phase• SPARQL query generation phase • Evaluation• Conclusion and future work


Mapping keywords to IRIs

• The goal is recognition of entities.

• Mapping is based on string similarity.

• This similarity is applied on all types of entities (i.e., classes, properties and instances).

• As a result, for each keyword, we retrieve a list of IRI candidates called anchor points.


Ranking and Selecting Anchor Points

• Ranking is based on Specificity degree.• Specificity degree is in terms of string similarity and

connectivity degree.• The string similarity score calculates the similarity of

the label of to • The connectivity degree CD(u) for each is

computed as counting how often occurs in the triples of the knowledge base.

10

iKi APu

iKi APu

u

iK


Ranking and Selecting Anchor Points

• Specificity degree is defined as:

• Sorting anchor points corresponding to each keyword based on specificity degree.

• Selecting IRIs in each sorted anchor points list.

11

))(log(),()( uCDKuuS ilabel

ntop


Outline• Motivation• Entity recognition phase• SPARQL query generation phase • Evaluation• Conclusion and future work


Graph pattern template

• H is a set of placeholders and V is a set of variable identifiers being

disjoint from each other and from .

• A graph pattern template is defined as:

• After replacing the placeholders in a graph pattern template with the detected IRIs, a graph pattern with triple patterns of the form

13

)}()()(|),,{( EVoEVpEVsopsGPT

)()()( ICVPVIV

PIC


Categorization of all graph pattern templates

14

Category Possible Patterns Pattern Schema

Instance-Property (IP)

IP.P1 IP.P2

IP.P3 IP.P4 IP.P5 IP.P6

)s, p, ?o(?) s, p, o(

?) s1, ?p1, o1?)(s1, p2, ?o2 (?)s1, ?p1, o1?)(o2, p2, ?s1() s1, ?p1, ?o1?)(s2, p2, ?o1() s1, ?p1, ?o1?)(o1, p2, ?o2 (

Class-Instance (CI) CI.P7

CI.P8 ?)s1, a, c?)(s1, ?p1, o1 (?)s1, a, c)(s2, ?p1, ?s1 (

Instance-Instance (II)

II.P9 II.P10 II.P11 II.P12

)s, ?p, o() s, ?p1, ?x?)(x, ?p2, o(

) s1, ?p1, ?x)(s2, ?p2, ?x(?)s, ?p1, o1?)(s, ?p2, o2 (

Class-Property (CP) CP.P13

CP.P14 ?)s, a, c?)(s, p, ?o(?) s, a, c?)(x, p, ?s (

Property-Property (PP) PP.P15

PP.P16 PP.P17

?)s, p1, ?x?)(x, p2, ?o(?) s1, p1, ?o?)(s2, p2, ?o (?)s, p1, ?o1?)(s, p2, ?o2 (


Appropriate identified graph pattern templates

15

Category Possible Patterns Pattern Schema

Instance-Property (IP) IP.P1IP.P4 IP.P6

)s, p, ?o(?)s1, ?p1, o1?)(o2, p2, ?s1(

) s1, ?p1, ?o1?)(o1, p2, ?o2 (

Class-Instance (CI) CI.P7

CI.P8 ?)s1, a, c?)(s1, ?p1, o1 (?)s1, a, c)(s2, ?p1, ?s1 (

Instance-Instance (II) II.P9

II.P10 )s, ?p, o(

) s, ?p1, ?x?)(x, ?p2, o (

Class-Property (CP) CP.P14 ?) s, a, c?)(x, p, ?s (

Property-Property (PP) - -


Query generation algorithm


Example

Consider two keywords : "Germany“ and "island“ User intention: the list of Germany's islands.

After applying mapping and ranking functions on the user keywords, we obtain two identified IRIs, i.e.

1. http://dbpedia.org/ ontology/ Island with the type class

2. http://dbpedia.org/ resource/Germany with the type instance.

The possible graph pattern templates for these two IRIs are:

1. (?island, a, dbo:Island), (?island, ?p, dbr:Germany)

2. (?island, a, dbo:Island), (dbr:Germany, ?p, ?island)


Example

SPARQL queries are:

SELECT * WHERE { ?island a dbo:Island . ?island ?p dbp:Germany . }

SELECT * WHERE { ?island a dbo:Island . dbp:Germany ?p ?island. }

Some desired answers to be retrieved are: db:Rettbergsaue a dbo:Island .

db:Rettbergsaue dbp:country dbr:Germany .

db:Sylt a dbo:Island .

db:Sylt dbp:country dbr:Germany .

db:Vilm a dbo:Island .

db:Vilm dbp:country dbr:Germany .

db:Mainau a dbo:Island .

db:Mainau dbp:country dbr:Germany .


Online interface

19AKSW group - Universität Leipzig

lod-query.aksw.org

24 August 2011

http://lod-query.aksw.org/

Outline• Introduction• Entity recognition phase• SPARQL query generation phase• Evaluation• Conclusion and future work


Accuracy metrics

• The user’s intention in keyword-based search is ambiguous.

• Judging the correctness of the retrieved answers is a challenging task.

• Example: Given the keywords France and President .

• Following RDF graphs (i.e. answers) are presented to the user:1. Nicolas_Sarkozyy nationality France .

Nicolas_Sarkozy a President .

2. Felix_Faure birthplace France .

Felix_Faure a President .

3. Yasser_Arafat deathplace France .

Yasser_Arafat a President .

...


Accuracy metrics

• Besides distinguishing between answers related to different interpretations, we differentiate between pure answers (just containing preferred terms) and those which contain some impurity.

• In fact, the correctness of an answer is not a bivalent value.

• We investigate two questions:

1) For how many of the keyword queries do the templates yield answers at all with respect to the original intention?

2) If answers are returned, how correct are they?


Accuracy metrics

• Correctness rate. For an individual answer, we define correctness rate as the fraction of correct (preferred) RDF terms occurring in it.

• Average CR. For a given set of answers of a query q, we define average correct rate as the arithmetic mean of the CRs of its individual answers.

• Fuzzy precision metric (FP). which measures the overall correctness of the answers corresponding to a set of keyword queries.


Accuracy metrics

• We also measured the recall as the fraction of keyword queries for which answers were found:


Accuracy of each categorized graph pattern


Categorization based on the matter of information.

1. Finding special characteristics of an instance - IP.P1, IP.P4 IP.P6

2. Finding similar instances - CI.P7, CI.P8, CP.P14

• Finding associations between instances - II.P9, II.P10


Samples of keywords and results


Accuracy results for different categories

Category Recall Fuzzy precision F-score

Similar instances 0.700 0.735 0.717

Characteristics of an instance

0.625 0.700 0.660

Associations between instances

0.500 0.710 0.580

General accuracy 0.625 0.724 0.670


Outline• Introduction• Entity recognition Phase• SPARQL query generation • Evaluation• Conclusion and future work


Conclusion and future work

• Analysis of graph patterns for limiting search space.

• We did not separate ontology level and knowledge base level for generating graph patterns.

We aim to:

1. Allow a larger number of keywords.

2. Make more extensive use of linguistic features and techniques.

3. Enable users to refine obtained queries and to add additional constraints.

4. Apply this work on large-scale datasets of Data Web.


31

Thank you for your attention.Thanks to my colleague from AKSW

research group.Any Question?


Wi presentation

Education

Transcript of Wi presentation