Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime...

88
Project N o : FP7-318338 Project Acronym: Optique Project Title: Scalable End-user Access to Big Data Instrument: Integrated Project Scheme: Information & Communication Technologies Deliverable D6.3 Runtime Query Rewriting Techniques Due date of deliverable: (T0+36) Actual submission date: November 16, 2015 Start date of the project: 1st November 2012 Duration: 48 months Lead contractor for this deliverable: FUB Dissemination level: PU – Public Final version

Transcript of Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime...

Page 1: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Project No: FP7-318338

Project Acronym: Optique

Project Title: Scalable End-user Access to Big Data

Instrument: Integrated Project

Scheme: Information & Communication Technologies

Deliverable D6.3Runtime Query Rewriting Techniques

Due date of deliverable: (T0+36)

Actual submission date: November 16, 2015

Start date of the project: 1st November 2012 Duration: 48 months

Lead contractor for this deliverable: FUB

Dissemination level: PU – Public

Final version

Page 2: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Executive Summary:Runtime Query Rewriting Techniques

This document summarises deliverable D6.3 of project FP7-318338 (Optique), an Integrated Project sup-ported by the 7th Framework Programme of the EC. Full information on this project, including the contentsof this deliverable, is available online at http://www.optique-project.eu/.

More specifically, the present deliverables describes the activities carried out and the results obtained inTask 6.2 of Optique. This task is concerned with the techniques for rewriting, at runtime, end-user queriesinto queries over the data sources, by relying on two separate but closely interacting phases: the first onerewrites the end-user query taking into account the ontology axioms, while the second one rewrites theresulting query into a query over the data sources by exploiting the mapping layer.

From the foundational point of view, we have considered the requirements coming from the Optique usescases for increased expressive power in the ontology and query languages. We have developed novel rewritingtechniques to deal with rules in the ontology, and with queries that can express a restricted form of recursionby means of (nested) regular paths. We have then tackled the need to integrate various data sources, andhave addressed the fundamental problem of relating different identifier for the same real-world entity indifferent data sources.

From the implementation point of view, we have developed several improvements regarding the integrationof the Ontop system in the Optique platform, and the support for standard languages and libraries. Moreover,to better support the prototyping activities within the project, we have made various improvements to theOntop interfaces.

List of AuthorsElena Botoeva (FUB)Diego Calvanese (FUB)Benjamin Cogrel (FUB)Elem Güzel Kalaycı (FUB)Sarah Komla-Ebri (FUB)Davide Lanti (FUB)Martin Rezk (FUB)Guohui Xiao (FUB)Alessandro Artale (FUB)Enrico Franconi (FUB)Werner Nutt (FUB)Sergio Tessaris (FUB)

2

Page 3: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Contents

1 Introduction 4

2 Foundational Results on Runtime Query Rewriting 62.1 Rules and Ontology Based Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Ontology-based Integration of Cross-linked Datasets . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Representation of DB Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Rewritability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 A ‘Historical Case’ of Ontology-Based Data Access . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Knowledge Representation and Data Management in EPNet . . . . . . . . . . . . . . . 132.3.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Nested Regular Path Queries in Description Logics . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Implementation Development 173.1 Integration with the Optique Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Support for the Mapping Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.2 Support for SPARQL Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.3 Support for Datatypes in OBDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1.4 Support for multiple references to the same entity . . . . . . . . . . . . . . . . . . . . . 273.1.5 Support for Column-Oriented Databases . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Interfaces of the Ontop Query Rewriting System . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.1 Protégé Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.2 Sesame SPARQL Protocol HTTP service . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.3 Ontop Java API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.4 Command-Line Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Releases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.1 Ontop Bundles on SourceForge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.2 Released Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Bibliography 31

A Rules and Ontology Based Data Access (RR 2014 Paper) 34

B Ontology-based Integration of Cross-linked Datasets (ISWC 2015 Paper) 51

C A ‘Historical Case’ of Ontology-Based Data Access (DH 2015 Paper) 68

D Nested Regular Path Queries in Description Logics (KR 2014 Paper) 78

3

Page 4: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Chapter 1

Introduction

We start by recalling the main ideas behind Runtime Query Rewriting in Ontology-based Data Access, whichis being developed in WP6. Ontology-Based data access (OBDA) is a paradigm for accessing data storedin legacy sources using Semantic Web technologies. In OBDA, users access the data through a conceptuallayer, which provides a convenient query vocabulary abstracting from specific aspects related to the datasources. This conceptual layer is typically expressed as an RDF(S) or OWL ontology, and it is connected tothe underlying relational databases using R2RML mappings. When the ontology is queried in SPARQL, theOBDA system exploits the mappings to retrieve elements from the data sources and constructs the answersexpected by the user. Different approaches for query processing in OBDA have been proposed. In the Optiqueproject, we focus on the virtual approach, which avoids materializing triples retrieved through mappings andanswers the SPARQL queries by rewriting them into SQL queries over the database. The rewriting consistsof two separate but closely interacting phases: the first one rewrites the end-user query taking into accountthe axioms in the ontology, while the second one rewrites the resulting query into a query over the datasources by exploiting the mapping layer. The objective is, on the one hand, to produce queries over thedata sources that are optimized towards an efficient execution, and, on the other hand, to keep the rewritingprocess as simple and as efficient as possible. For both rewriting steps, the developed techniques take intoaccount data dependencies that hold in the system or that are derived from the mappings.

In Task 6.2, the aim is to develop techniques for rewriting, at runtime, end-user queries into queriesover the data sources. In this deliverable, we describe the main contributions obtained during Year 3 of theproject in the context of Task 6.2 towards the above objectives, and that we overview in the remaining partof this chapter.

Overview of Contributions

The contributions we have provided are of two main kinds:

1. foundational results related to the runtime query rewriting techniques, which improve the state-of-the-art in the area.

2. implementation development of the Ontop system.

As for Item 1, we have obtained foundational results on runtime query rewriting in several settings, onwhich we report in Chapter 2. These results extend the expressivity of the OBDA specifications as requiredby the Optique use cases.

• In Section 2.1, we provide the results on the integration of rules into the OBDA setting . The rulescan be used to express conjunctions and recursions, which are difficult (or even impossible) to expressin OWL 2 QL ontologies. We show that with some natural restrictions, we can develop rewritingtechniques for query answering in the presence of rules.

4

Page 5: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

• In Section 2.2, we study the problem of ontology-based data integration (OBDI) of cross-linked datasets.In particular, we address the setting where the same entities might have different identifiers in differentdata sources, and these identifiers are related to each other by means of the standard OWL constructowl:sameAs. We provide a novel rewriting technique that extend queries with suitable join conditions,so that owl:sameAs can be treated as an ordinary object property. In addition, we develop optimiza-tion techniques, which are essential to allow the approach to scale with the number of owl:sameAs

statements and of data sources.

• In Section 2.3, we describe a use case of OBDI in a historical domain, carried out in an ongoingcollaboration with the EU Project EPNet – Production and Distribution of Food during the RomanEmpire: Economic and Political Dynamics.

• In Section 2.4, we consider expressive variants of query language based on nested two-way regular pathqueries, and we study the problem of answering such queries over description logic (DL) ontologies.For lightweight DLs, and more in general Horn DLs, we present novel query rewriting techniques, whilefor more expressive DLs we show how to reduce the problem to answering queries without nesting. Inall cases we provide a characterization of the computational complexity of the problem.

As for Item 2, we report on the implementation activity for Ontop during Year 3 of the Optique project.

• In Section 3.1, we describe the changes done for (i) improving the integration of Ontop in the Optiqueplatform and for (ii) extending the support for standard languages and libraries. More precisely:

– We have extended the support of the R2RML mapping language by allowing the use of the CONCATand REPLACE SQL operators in the source part of mappings, and accepting literal templates inthe target query (Section 3.1.1).

– We have implemented most of the standard SPARQL functions (Section 3.1.2).

– We have improved the support for datatypes (Section 3.1.3).

– We have implemented the support for multiple references to the same entity (Section 3.1.4).

– We have added two column-oriented relational databases (MonetDB and SAP Hana) to our setof supported databases (Section 3.1.5).

• In Section 3.2, we present the improvements made to Ontop interfaces, namely the Protégé plugin, theSesame SPARQL Protocol service, the Java APIs, and the command line interface. These interfacescan be used alone without the Optique platform. They are useful for prototyping.

• Finally, in Section 3.3, we describe the publication of Ontop bundles on SourceForge, and report onthe last stable releases.

5

Page 6: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Chapter 2

Foundational Results on Runtime QueryRewriting

In this chapter, we describe some foundational results on runtime query rewriting in OBDA that have beenobtained in Optique in the context of WP6. These results have been published in prestigious internationalvenues, and the corresponding publications are included as appendixes in this report. In this chapter, weprovide a brief overview of the obtained results in terms of short summaries of the publications, and refer tothe publications in the appendix for a comprehensive treatment of the presented results.

• Guohui Xiao, Martin Rezk, Mariano Rodriguez-Muro, and Diego Calvanese:Rules and Ontology Based Data Access. In Proc. of the 8th International Conference on Web Reasoningand Rule Systems (RR), 2014.

The main results of this publication (which is included as Appendix A), are summarized in Section 2.1.

• Diego Calvanese, Martin Giese, Dag Hovland and Martin Rezk:Ontology-based Integration of Cross-linked Datasets. In Proc. of the 14th International Semantic WebConference (ISWC), 2015.

The main results of this publication (which is included as Appendix B), are summarized in Section 2.2.

• Diego Calvanese, Alessandro Mosca, Jose Remesal, Martin Rezk and Guillem Rull:A ‘Historical Case’ of Ontology-Based Data Access. In Proc. of Digital Heritage International Congress(DH), 2015.

The main results of this publication (which is included as Appendix C), are summarized in Section 2.3.

• Meghyn Bienvenu, Diego Calvanese, Magdalena Ortiz, and Mantas Simkus:Nested Regular Path Queries in Description Logics. In Proc. of the 14th Int. Conference on thePrinciples of Knowledge Representation and Reasoning (KR), 2014.

The main results of this publication (which is included as Appendix D), are summarized in Section 2.4.

2.1 Rules and Ontology Based Data Access

In OBDA, an ontology defines a high level global vocabulary in terms of which user queries are formulated,and such vocabulary is mapped to (typically relational) databases. The typical ontology language of choicein OBDA is OWL 2 QL. Extending this paradigm with rules, e.g., expressed in the SWRL or RIF standards,boosts the expressivity of the model and the reasoning ability.

The following two examples, which are inspired by the use cases in Statoil, motivate the integration ofrules and OBDA.

6

Page 7: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Example 2.1.1 The following rule in Datalog syntax cannot be expressed in OWL 2 QL since it requiresrole chains which are beyond OWL 2 QL.

:intervalPermeability(?i, ?perm)← :extractedFrom(?c, ?i), :hasCoreSample(?c, ?s),

:hasPermeabilityMeasurement(?s, ?p), :valueInStandardUnit(?p, ?perm).

The next rule models that the role :isAncestorUnitOf is the transitive closure of :isParentUnitOf. Thisrequires the usage of linear recursion, which is also not available from OWL 2 QL. While SPARQL 1.1 providespath expressions that can be used to express the transitive closure of a property, it may be cumbersome andprone to errors, especially in the presence of path constraints.

:isAncestorUnitOf(?x, ?y)← :isParentUnitOf(?x, ?y).

:isAncestorUnitOf(?x, ?y)← :isParentUnitOf(?x, ?z), :isAncestorUnitOf(?z, ?y).

Computational complexity results show that unless we limit the form of the allowed rules, on-the-flyquery answering by rewriting into SQL Select-Project-Join (SPJ) queries is not possible [8, 2]. However, astarget language for query rewriting, typically only a fragment of the expressive power of SQL99 has beenconsidered, namely unions of SPJ SQL queries. We propose here to go beyond this expressive power, andwe advocate the use of SQL99’s Common Table Expressions (CTEs) to obtain a form of linear recursion inthe rewriting target language. In this way, we can deal with recursive rules at the level of the ontology, andcan reuse existing query rewriting optimizations developed for OBDA to provide efficient query rewritinginto SQL99. The languages that we target are those that are used more extensively in the context of OBDAfor Semantic Web application, i.e., RIF and SWRL as rule language, SPARQL 1.0 as query language, andR2RML as relational databases to RDF mapping language.

In this paper, we have provided translations from SWRL, R2RML, and SPARQL into relational algebraextended with a fixed-point operator that can be expressed in SQL99’s Common Table Expressions (CTEs).We show how to extend existing OBDA optimisation techniques that have been proven effective in theOWL 2 QL setting to this new context. In particular, we show that so called T-mappings for recursiveprograms exist and how to construct them.

We also implemented the proposed technique in Ontop, making it, to the best of our knowledge, the firstsystem to support all the following W3C recommendations: OWL 2 QL, R2RML, SPARQL, and SWRL.In Figure 2.1, we depict the new architecture that modifies and extends our previous OBDA approach,by replacing OWL 2 QL with SWRL. First, during loading time, we translate the SWRL rules and theR2RML mappings into a Datalog program. This set of rules is then optimized as described. This process isquery independent and is performed only once when Ontop starts. Then the system translates the SPARQLquery provided by the user into another Datalog program. None of these Datalog programs is meant to beexecuted. They are only a formal and succinct representation of the rules, the mappings, and the query, ina single language. Given these two Datalog programs, we unfold the query with respect to the rules andthe mappings using SLD resolution. Once the unfolding is ready, we obtain a program whose vocabulary iscontained in the vocabulary of the datasource, and therefore can be translated to SQL. The technique is ableto deal with all aspects of the translation, including URI and RDF Literal construction, RDF typing, andSQL optimization. However, the current implementation supports only a restricted form of queries involvingrecursion: SPARQL queries with recursion must consist of a single triple involving the recursive predicate.

2.1.1 Experiments

To evaluate the performance and scalability of Ontop with SWRL ontologies, we adapted the NPD bench-mark. The NPD benchmark [13] is based on the Norwegian Petroleum Directorate1 Fact Pages, which

1http://www.npd.no/en/

7

Page 8: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

SWRL Rules

Mappings M

+

T-Mappings

Datalog Π1

SPARQL q

Datalog q′

SQL q′′

data D

+

Unfolding

DB

Figure 2.1: SWRL and Query processing in the Ontop system

Table 2.1: Evaluation of Ontop on NPD benchmark (time in seconds)

Load q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 r1

NPD Ontop 16.6 0.1 0.09 0.03 0.2 0.02 1.7 0.1 0.07 5.6 0.1 1.4 2.8 0.25Stardog - 2.06 0.65 0.29 1.26 0.20 0.34 1.54 0.70 0.06 0.07 0.11 0.15 -

NPD Ontop 17.1 0.12 0.13 0.10 0.25 0.02 3.0 0.2 0.2 5.7 0.3 6.7 8.3 27.8(×2) Stardog - 5.60 1.23 0.85 1.89 0.39 2.29 2.41 1.47 0.34 0.36 1.78 1.52 -

NPD Ontop 16.7 0.2 0.3 0.17 0.67 0.05 18.08 0.74 0.35 6.91 0.55 162.3 455.4 237.6(×10) Stardog - 8.89 1.43 1.17 2.04 0.51 4.12 5.84 5.30 0.42 0.72 3.03 3.86 –

contains information regarding the petroleum activities on the Norwegian continental shelf. We used Post-greSQL as the underlying relational database system. The hardware consisted of an HP Proliant server with24 Intel Xeon X5690 CPUs (144 cores @3.47GHz), 106GB of RAM and a 1TB 15K RPM HD. The OS isUbuntu 12.04 LTS.

The original benchmark comes with an OWL ontology2. In order to test our techniques, we translated afragment of this ontology into SWRL rules by (i) converting the OWL axioms into rules whenever possible;and (ii) manually adding linear recursive rules. The resulting SWRL ontology contains 343 concepts, 142object properties, 238 data properties, 1428 non-recursive SWRL rules, and 1 recursive rule. The R2RMLfile includes 1190 mappings. The NPD query set contains 12 queries obtained by interviewing users of theNPD data.

We compared Ontop with the only other system (to the best of our knowledge) offering SWRL reasoningover on-disk RDF/OWL storage: Stardog 2.1.3. Stardog3 is a commercial RDF database developed byClark&Parsia that supports SPARQL 1.1 queries and OWL 2/SWRL for reasoning. Since Stardog is a triplestore, we needed to materialize the virtual RDF graph exposed by the mappings and the database usingOntop. In order to test the scalability w.r.t. the growth of the database, we used the data generator describedin [13] and produced several databases, the largest being approximately 10 times bigger than the originalNPD database. The materialization of NPD (x2) produced 8,485,491 RDF triples and the materialization ofNPD (x10) produced 60,803,757 RDF triples. The loading of the last set of triples into Stardog took aroundone hour.

The results of the evaluation are shown in Table 2.1. For queries q1 to q12, we only used the non-recursiverules and compared the performance with Stardog. For the second group (r1), we included recursive rules,which can only be handled by Ontop. The experiments show that the performance obtained with Ontop iscomparable with that of Stardog and in most queries Ontop is faster.

2http://sws.ifi.uio.no/project/npd-v2/3http://stardog.com/

8

Page 9: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

D1 D2 D3 D4

id1 Name

a1 ’A’

a2 ’B’

a3 ’H’

id2 Name Well

b1 null 1

b2 ’C’ 2

b6 ’B’ 3

id3 AName

c3 ’U1’

c4 ’U2’

c5 ’U6’

id4 LName

9 ’Z1’

8 ’Z2’

7 ’Z3’

Figure 2.2: Wellbore datasets D1, D2, D3, and company dataset D4

2.2 Ontology-based Integration of Cross-linked Datasets

One of the main objective in the Optique project is data integration, and we have tackled the problemof answering SPARQL queries over virtually integrated databases. We assume that the entity resolutionproblem has already been solved, and that explicit information is available about which records in thedifferent databases refer to the same real world entity.

To the best of our knowledge, there has been no attempt to extend the standard Ontology-Based DataAccess (OBDA) setting to take into account these DB links for SPARQL query-answering. In principle,links over database identifiers could be represented in terms of OWL owl:sameAs statements, which is thestandard approach in semantic technologies for connecting entity identifiers; however, this approach bringsa number of issues.

In our work we formally treat several fundamental questions in this context: how links over databaseidentifiers can be represented in terms of owl:sameAs statements, how to recover rewritability of SPARQLinto SQL (lost because of owl:sameAs statements), and how our solution can be made to scale up to largeenterprise datasets such as Statoil.

2.2.1 Representation of DB Links

Traditional relational data integration techniques use extract, transform, load (ETL) processes to address theproblem of integrating records from different DB representing the same real world object. These techniquesusually choose a single representation of the entity, merge the information available in all data sources,and then answer queries on the merged data. However, this approach of physically merging the data isnot possible neither in the Optique use cases, nor in many real world scenarios where one has no completecontrol over the data sources, so that they cannot be modified, and where the data cannot be moved due tofreshness, privacy, or legal issues.

An alternative that can be pursued in OBDA is to make use of mappings to virtually merge the data, byconsistently generating only one URI per real world entity. Unfortunately, also this approach is not viablein general: 1. it does not scale well for several datasets, since it requires a central authority for defining URIschemas, which may have to be revised along with all mappings whenever a new source is added, and 2. it iscrucial for the efficiency of OBDA that URIs be generated from the primary keys of the data sources, whichwill typically differ from source to source.

The approach we propose here is based on the natural idea of representing the links between databaserecords resulting from entity resolution in the form of linking tables, which are binary tables in dedicateddata sources that simply maintain the information about pairs of records representing the same entity.These linking tables are created in a custom database without modifying the original datasets. The standardway to represent equality at the ontology level is through the OWL built-in predicate owl:sameAs. Togenerate owl:sameAs statements from the linking tables, we map the linking tables to the OWL predicateowl:sameAs using standard R2RML mappings. Having owl:sameAs statements in the ontology creates anumber of problems for our query translation techniques since owl:sameAs is not in the OWL2QL profile.Thus, allowing the unrestricted use of owl:sameAs results in the loss of first order rewritability.

To illustrate all these issues we provide the following simplified example:

9

Page 10: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Example 2.2.1 Suppose we have three datasets (from now on D1, D2, D3) with wellbore information, anda dataset D4 with information about companies and licenses, as illustrated in Figure 2.2. The wellbores inD1, D2, D3 are linked, but companies in D4 are not linked with the other datasets. These four datasourcesare integrated virtually by topping them with an ontology. The ontology contains the concept Wellbore andthe properties hasName, hasAlternativeName and hasLicense.

The terms Wellbore and hasName are defined using D1 and D2. The property hasAlternativeName isdefined usingD3. The property hasLicense is defined over the isolated datasetD4. We assume that mappingsfor wellbores from Di use URI templates urii. In addition, we know that the wellbores are cross-linkedbetween datasets as follows: wellbores a1, a2 in D1 are equal to b2, b1 in D2 and c3, c4 in D3, respectively.In addition, a3 is equal to c5. These links are represented at the ontology level by owl:sameAs statementsof the form: owl:sameAs (uri1(a1),uri2(b2)), owl:sameAs (uri2(b2),uri3(c3)), etc.

Consider now a user looking for all the wellbores and their names. According to the SPARQL entail-ment regime, the system should return all the 12 combinations of equivalent ids and names ((uri1(a1),A),(uri2(b2),A), (uri3(c3),A), (uri1(a2),B), (uri2(b1),B), etc.) since all this tuples are entailed by theontology and the data. Note that no wellbores from D4 are returned.

The first issue in the context of OBDA is how to translate the user query into a query over the databases.Recall that owl:sameAs is not included in OWL QL, thus it is not handled by the current query translationand optimization techniques. If we solve the first issue by applying suitable constraints, we get into a secondissue, how to minimize the negative impact on the query execution time when reasoning over cross-linkeddatasets.In the next sections we will tackle these issues in turn.

2.2.2 Rewritability

owl:sameAs prevents first order rewritability because (just as equality) it is a transitive property. Wepresent here an approach, based on partial materialization of inference, that in principle allows us to exploita relational engine for query answering in the presence of owl:sameAs statements. This approach, however,is not feasible in practice, and we will then show next how to develop it into a practical solution. Assumefor simplicity that we have a materialized set AS of owl:sameAs facts. Our approach is based on thesimple observation that we can expand the set AS into the set A∗S obtained from AS by closing it underreflexivity, symmetry, and transitivity. That is, given the statements owl:sameAs (uri1(a1),uri2(b2)) andowl:sameAs (uri2(b2),uri3(c3)) in AS , A∗S contains also owl:sameAs (uri1(a1),uri2(a1)), owl:sameAs(uri1(a1),uri2(c3)), owl:sameAs (uri1(b2),uri2(a1)), etc. Observe that we do not expand here also thetriples representing the DB data. That is, if we have that P (uri1(a1)), we do not “generate” P (uri2(b2)).Instead, we rewrite the input SPARQL query to guarantee completeness of query answering. Then given aSPARQL query Q over our OBDA system with a set of AS of owl:sameAs facts, we generate a new SPARQLquery Q′ over an OBDA system where we use A∗S instead of AS . This allows us to ignore the implicit equalityaxioms that are part of the owl:sameAs semantics. Clearly, owl:sameAs without its implicit semantics isjust one more object property. Thus, we consider owl:sameAs as an object property (without a built-insemantics) and recover rewritability.

Intuitively, to answer the query in our running example:

SELECT ?v ?y WHERE { ?v rdf:type :Wellbore . ?v :hasName ?y .}

we generate the query depicted in Figure 2.3.The formal results are provided in Appendix B.

2.2.3 Optimization

The approach presented in the previous section is theoretical, and cannot be effectively applied in practicebecause it assumes that the links are given in the form of owl:sameAs statements whereas in practice, in ancross-linked setting, they will be given as tables (storing the results of the entity resolution process); and itrequires pre-computing a large number of triples (namelyA∗S) and materializing them into the ontology. Since

10

Page 11: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

SELECT ?v ?y WHERE {

{ ?v rdf:type :Wellbore .

UNION

{?v owl:sameAs [ rdf:type :Wellbore ] .}

}

{ ?v :hasName ?y .

UNION

{?v owl:sameAs _:z1 . _z1 :hasName ?y .}

}

}

Figure 2.3: Rewritten query where owl:sameAs is considered as an ordinary object property

these triples are not stored in the database, they cannot be efficiently retrieved using SQL. This negativelyimpacts the performance of query execution. To tackle these problems, we perform the following tasks:

1. expose, using mapping assertions that are optimization-friendly (this implies building the URI in aparticular way), the information in the linking tables as owl:sameAs statements (AS);

2. extend the mappings so as to encode also transitivity and symmetry (but not reflexivity for performancereasons), and hence expose the symmetric transitive closure A+

S of AS ;

3. modify the query-rewriting algorithm so as to return sound and complete answers over the (virtual)ontology extended with A+

S .

Example 2.2.2 (Mappings and Query) To generate the owl:sameAs statements from the linking tablesin Figure 2.4, we extend our set of mappingsM with the following mappings (fragment):

uri1({id1}) owl:sameAs uri2({id2}) <- SELECT * FROM L_{1,2}

uri2({id2}) owl:sameAs uri3({id3}) <- SELECT * FROM L_{2,3}

Observe that this also implies that to populate the concept Wellbore with elements from D1, the mappingsinM will have to use the URI template: uri1.

The optimization-friendly way of building URIs together with the fact that we do not encode reflexivityinto the mappings allow existing optimization techniques to remove any performance overhead caused byowl:sameAs statement whenever such statements are not needed to answer the queries.

Details and formal results are provided in Appendix B.

2.2.4 Experiments

We evaluated the performance of queries over crossed-linked datasets at Statoil and over artificial data.

L1,2 L2,3 L1,3

id1 id2

a1 b2

a2 b1

id2 id3

b1 c4

b2 c3

id1 id3

a1 c3

a2 c4

a3 c5

Figure 2.4: Linking Tables

11

Page 12: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

We integrated EPDS and the NPD fact pages at Statoil extending the existing ontology and the set ofmappings, and creating the linking tables. We ran 22 queries covering real information needs of end-usersover this integrated OBDA setting. Since EPDS is a production server and in addition the OBDA setting istoo complex to isolate different features of this approach, we also created a controlled OBDA environmentin our own server to perform a careful study our technique. In addition, we exported the triples of thiscontrolled environment and loaded them into the commercial triple store Stardog4 (v3.0.1).

To perform the controlled experiments, we setup an OBDA cross-linked environment based on the Wis-consin Benchmark. To mimic a federated scenario we created a single database with 10 tables: 4 Wisconsintables, representing different datasets, and 6 linking tables. Each Wisconsin table contains 100M rows, the6 tables occupied ca. 100GB of disk space, exposing +1.8B triples.

The following experiments evaluate the overhead of equality reasoning when answering SPARQL queries.The variables we considered are:

1. Number of SPARQL joins (1-4);2. Number and type of properties (0-4 /data-object);3. Number of linked datasets (2-3);4. Selectivity of the query (returning 0.001%, 0.01%, 0.1% of the dataset);5. Number of equal objects between datasets (10%,30%,60%).

In total we ran 1332 queries.The results confirm that reasoning over OBDA-based integrated data has a high cost, but this cost is not

prohibitive. The execution times at Statoil range from 3.2 seconds to 12.8 minutes, with mean 53 secs, andmedian 8.6 secs. The most complex query had 15 triple patterns, using object and data properties comingfrom both data sources.

In the controlled environment, in the 2 linked-datasets scenario, with 120M equal objects (60%), evenin the worst case most of the queries run in ≈ 5min. The query that performs the worst in this setting,(4 joins, 2 data properties, 2 object properties, 105 selectivity) returns 480.000 results, and takes ≈ 13min.When we move to the 3 linked-datasets scenario, most of the slowest executions take around than 15min. Inthis case, the worst query takes around 1.5hs and returns 1.620.000 results. One can see that the number oflinked datasets is the variable that impacts the most on the query performance. The second variable is thenumber of object properties since its translation is more complex than the one for data properties. The thirdvariable, is the selectivity. It is worth noticing that these results measure an almost pathological case takingthe system to its very limit. In practice, it is unlikely that 60% of the all the objects of an integrated datasetwill be equal and belong to the same type (for instance, wellbores). Recall that if they are not in the sametype, our optimization removes the unnecessary SQL subqueries. For instance, in the integration of EPDSand NPD there are less than 10.000 equal wellbores and there are millions of objects of different categories.Moreover, even 1.5hs is a reasonable time. Recall that Statoil users required weeks to get an answer for thissort of queries.

Because of the optimizations we use, with 2 datasets 30 out of 48 queries (52 out of 100 with 3 datasets)get optimized and executed in a few milliseconds.

The default semantics that Stardog gives to owl:sameAs is not compliant with the official OWL semanticssince “Stardog designates one canonical individual for each owl:sameAs equivalence set”; however, one canforce Stardog to consider all the URIs in the equivalence set. Our experiments show that Stardog does notbehave according to the claimed semantics.

2.3 A ‘Historical Case’ of Ontology-Based Data Access

Semantic technologies are valuable tools for historical research especially when dealing with an immenseamount of quantifiable data in interchangeable formats. They are used for situations like making explicitthe semantics contained in the historical sources, formalising them and linking them. In this approachwe rely on the Ontology-Based Data Access (OBDA) paradigm, where the different datasets are virtually

4http://stardog.com

12

Page 13: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

integrated by a conceptual layer (an ontology). Specifically we are focusing on investigating the mechanismand characteristics of the food production and commercial trade system during the Roman Empire.

A considerable number of public initiatives and projects have been funded to address the issue of buildinghistorical and cultural data using semantic technologies and making it public on the web. These projects havethe problem that they cannot be easily understood by non expert. They have the goal to support the designand development of computer applications using data structures, integrated datasets, vocabularies, andontologies, but the concept names used are often not self-explanatory. Also, the initiatives that representimplementations of the envisioned applications tend to be intentionally very abstract while defining theconcepts, in order to be useful for any domain in the digital humanities field.

The OBDA implementation introduced in the paper, by means of state-of-the-art technologies and prin-ciples coming from the research area of Knowledge Representation, is meant to support scholars in experi-mentally verifying theoretical hypotheses, and in formulating new ones. Allowing the scholars to execute asimple query that does not require any specific knowledge about the underlying data sources, and get all theavailable information coming from both datasets. The datasets involved are EPNet, containing informationabout amphoras and some (potentially incomplete) information about geo-coordinates, and Pleiades dataset,containing more complete geo-coordinates information but has no information about amphoras.

By means of the OBDA technology, extensive amounts of the EPNet data is to be connected and subse-quently interpreted in a variety of levels that will give new insights to the complexity of the Roman Empireexchange relations. Moving beyond the limitations of a traditional relational database is essential for thegeneration of new knowledge, and for the specification of values and parameters that will be manipulatedin the simulation experiments. The OBDA technology helped us to deal in an efficient and sound way withdata access, integration, and consistency issues. OBDA allows to access data in a way that is conceptuallysound with the domain knowledge (ontology and the EPNet CRM) as presented in the next subsection.

2.3.1 Knowledge Representation and Data Management in EPNet

The EpNet Conceptual Reference Model (CRM) has been specified for an unambiguous representation ofepigraphic information and domain expert knowledge about Roman Empire Latin inscriptions. This rep-resentation presents the way the data are understood by scholars, how they are connected, and what theircoverage is with respect to the literature of reference and current research practices in the history of theRoman Empire. The CRM has been formally specified in "Object Role Modeling" (ORM2), a conceptualmodeling language, and NORMA, a data modeling tool for ORM2. The CRM has been defined according tothe state-of-the-art formal ontological models and standards for representing the structure of cultural heritageobjects and the relationships between them. The main section of the model is a specialisation/extension ofthe CIDOC Conceptual Reference Model, the most dominant ontology in cultural heritage, increasing theinteroperability of the CRM and of the whole EPNet dataset. The CRM has been structurally organised intodistinct interrelated subsections, relying on existing standards for recording and publishing information onthe Semantic Web. The used standards have been chosen according to the different aim of each subsection.The five main sections of EpNet CRM are:

Main for the representation of the main domain entities, properties and mutual relationships.

Time for a conceptual arrangement, driven by the experts of the given research domain, of the differentmodalities used to denote interval periods, dates and punctual instants of time.

Space for information concerning space and geographical localisation of the entities in Main CRM.

Documental for the representation of the bibliographic information documenting the entities of interest.

Upper Typing for collection of all the taxonomic structure characterising the entities in the Main CRM.

The CRM model is formally corrected and consistent and it is comprehensive enough to host all the infor-mation and knowledge elicited from the domain experts.

13

Page 14: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

In EPNet, we use a relational database management system (RDBMS) to store our data, so we mustprovide a relational specification that complements our CRM. A RDBMS structures data in the form oftables, so a relational specification has to indicate which are the tables that form the database and whichare their attributes.

The ‘Pleiades’ dataset whose data content has been integrated in the project dataset, (i) increases thecoverage of the data provided to the final users with respect to the domain of interest (completeness), and (ii)complements the characterisation of the geographical entities already present in the initial dataset (accuracy).

If a location is present in both the Pleiades and the project DB but missing some attributes in the latter(e.g., the place has no geo-coordinates), the system is able to identify the missing attributes, catch theirassociated values, and with them augment the entry in EPNet, thus increasing the overall accuracy andcompleteness of the stored data.

The ontology, written in a formal language whose expressivity stays within the OWL 2 QL profile),modifies and extends the vocabulary of the database schema by reintroducing part of the domain specificterminology extracted with the support of the domain experts. The ontology captures the domain knowledgeby taking into consideration, at the same time, the available data and the user requirements in terms of dataaccessibility and usage.

The characteristic trait of the EPNet ontology, and of the domain knowledge encoded in the EPNet CRM,is that of being ‘functional to research’, instead of having data structures suitable for a generic audience.The integration of EPNet and Pleiades starts in the ontology, where concepts cover information containedin both datasets.

To cluster all the information about places in both dataset into a single well-defined concept we usemappings. The ontology together with the mappings and the database exposes a virtual RDF graph, whichcan be queried using SPARQL. To answer queries in the virtual approach by exploiting the information givenby the ontology, Ontop relies on query rewriting.

Users do not need to know the particular codes of the amphoras, nor they need to manually integratethe information coming from EPNet and Pleiades. If a place is not in the EPNet dataset, we completelyrely on the data from Pleiades (name(s), geo coordinates, and kind of settlement). If the place is in EPNet(with comparison done by name), then we keep the existing EPNet data, and add the kind of settlement(which is not in EPNet). Moreover, if the existing data is incomplete (e.g., missing coordinates), we fill itwith Pleiades data.

In OBDA, inconsistencies arise when the data in the sources together with the mappings violate theconstraints imposed by the ontology, and it is of interest to check whether such violations occur. Notice thatdisjointness, stating that the intersection between two classes or between two properties should be empty, canbe expressed in the OWL 2 QL profile of OWL 2, while functionality of properties, stating that no individualcan be related to more than one element through a functional property, cannot be. However, both types ofconstraints can be checked by Ontop by posing suitable queries over the ontology, and checking whether theanswer to such queries is non-empty. Being able to apply data consistency checks over the project data isof particular interest in such a context, considering that the data are usually collected by non-experts andmanually entered into a DB system without the support of any specific data entry interface.

2.3.2 User Interface

OBDA supports EPNet in facing the main challenge of providing users with semantically-transparent plat-form, ready to acquire and be complemented with new data from different sources (domain related historicaldatasets managed by research labs or promoted public). A preliminary user interface for testing the OBDAfunctionalities in EPNet is available online. It provides users with a text area where to write SPARQLqueries using the vocabulary of the ontology. After executing the query, the interface shows the SQL querythat was sent to the underlying RDBMS, and the result of the query in tabular form. See Appendix D forfurther details of the interface.

14

Page 15: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

2.4 Nested Regular Path Queries in Description Logics

There has been great interest recently in mechanisms for querying data that go beyond the traditional select-project-join fragment of SQL (or SPARQL), and allow for flexibly navigating the data, while still taking intoaccount complex domain knowledge represented in Description Logics (DLs) [7, 9, 19]. In DLs, instancedata stored in the ABox is constituted by ground facts over unary and binary predicates (concepts androles, respectively), and hence resembles data stored in graph databases [14, 3]. There is a crucial difference,however, between answering queries over graph databases and over DL ABoxes. In the former, the data areassumed to be complete, hence query answering amounts to the standard database task of query evaluation.In the latter, instead, it is typically assumed that the data are incomplete, but additional domain knowledgeis provided by the DL TBox, and can be profitably exploited for query answering. Hence query answeringamounts to the more complex task of computing certain answers, i.e., those answers that are obtained from alldatabases that both contain the explicit facts in the ABox and satisfy the TBox constraints. This differencehas driven research in different directions.

Regular Path Queries. In databases, expressive query languages for querying graph-structured data havebeen studied, which are based on the requirement of relating objects by flexibly navigating the data. Themain querying mechanism that has been considered for this purpose is that of one-way and two-way regularpath queries (RPQs and 2RPQs) [15, 11], which are queries returning pairs of objects related by a path whosesequence of edge labels belongs to a regular language over the (binary) database relations and their inverses.Conjunctive 2RPQs (C2RPQs) [10] are a significant extension of such queries that add to the navigationalability the possibility of expressing arbitrary selections, projections, and joins over objects related by 2RPQs,in line with conjunctive queries (CQs) over relational databases. Two-way RPQs are present in the propertypaths in SPARQL 1.1 [20], the new standard RDF query language, and in the XML query language XPath [5].An additional construct that is present in XPath, and that can be used to express sophisticated conditionsalong navigation paths, is the possibility of using existential test operators, also known as nesting. When anexistential test 〈E〉 is used in a 2RPQ E′, there will be objects along the main navigation path for E′ thatmatch positions of E′ where 〈E〉 appears; such objects are required to be the origin of a path conforming tothe (nested) 2RPQ E. It is important to notice that existential tests in general cannot be captured even byC2RPQs, e.g., when tests appear within a transitive closure of an RPQ. Hence, adding nesting effectivelyincreases the expressive power of 2RPQs and of C2RPQs.

Query Answering in Description Logics. In the DL community, query answering has been investigatedextensively for a wide range of DLs, with much of the work devoted to CQs. With regards to the complexityof query answering, attention has been paid on the one hand to combined complexity, i.e., the complexitymeasured considering as input both the query and the DL knowledge base (constituted by TBox and ABox),and on the other hand to data complexity, i.e., when only the ABox is considered as input. For expressive DLsthat extend ALC, CQ answering is typically coNP-complete in data-complexity [23], and 2Exp-complete incombined complexity [19, 22, 17]. For lightweight DLs, instead, CQ answering is in AC0 in data complexityfor the DL-Lite family [8], and P-complete for the EL family [21]. For both logics, the combined complexity isdominated by the NP-completeness of CQ evaluation over plain relational databases. In the context of DLs,there has also been some work on (2)RPQs and C(2)RPQs. For the very expressive DLs ZIQ, ZOQ, andZOI, where regular expressions over roles are present also in the DL, a 2Exp upper bound has been shownvia techniques based on alternating automata over infinite trees [12]. For the Horn fragments of SHOIQand SROIQ, P-completeness in data complexity and Exp/2Exp-completeness in combined complexity areknown [24]. For lightweight DLs, tight bounds for answering 2RPQs and C2RPQs have only very recentlybeen established in [6]: for (C)(2)RPQs, data complexity is NL-complete in DL-Lite and DL-LiteR, andP-complete in EL and ELH. For all of these logics, combined complexity is P-complete for (2)RPQs andPSpace-complete for C(2)RPQs.

15

Page 16: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

2RPQ C2RPQ N2RPQ / CN2RPQ

data combined data combined data combined

Graph DBs & RDFS NL-c NL-c NL-c NP-c NL-c P-c / NP-c

DL-Lite NL-c P-c NL-c PSpace-c NL-c Exp-c

Horn DLs (e.g., EL, Horn-SHIQ) P-c P-c P-c PSpace-c P-c Exp-c

Expressive DLs (e.g., ALC, SHIQ) coNP-h Exp-c coNP-h 2Exp-c coNP-h Exp-c / 2Exp-c

Table 2.2: Complexity of query answering. The ‘c’ indicates completeness, the ‘h’ hardness. New results aremarked in bold. For existing results, refer to [6, 25, 4, 12, 24] and references therein.

Adding Nesting to 2RPQs and C2RPQs. Motivated by the expressive power of nesting in XPath andSPARQL, in our work we significantly advance these latter lines of research on query answering in DLs,and study the impact of adding nesting to 2RPQs and C2RPQs. We refer to Appendix D for the formaldefinition of the nested variants of 2RPQs and C2RPQs, and illustrate their expressive power by means ofan example.

Example 2.4.1 We consider an ABox of advisor relationships of PhD holders. We assume an advisor relationbetween nodes representing academics. There are also nodes for theses, universities, research topics, andcountries, related in the natural way via roles wrote, submitted, topic, and location. We give two queries overthis ABox.

q1(x, y) = (advisor · 〈wrote · topic ·Physics?〉)∗ (x, y)Query q1 is a nested 2RPQ that retrieves pairs of a person x and an academic ancestor y of x such that allpeople on the path from x to y (including y itself) wrote a thesis in Physics. The nesting is indicated by〈p〉, where p indicates the nested path that is required to be present along the main path in the node thatmatches the occurrence of 〈p〉.

q2(x, y, z) = advisor−(x, z), advisor∗(x,w),advisor− · 〈wrote · 〈topic ·DBs?〉 · submitted · location · {usa}?〉(y, z),(advisor · 〈wrote · 〈topic · Logic?〉 · submitted · location ·EU?〉

)∗(y, w)

Query q2 is a nested C2RPQ that looks for triples of individuals x, y, z such that x and y have bothsupervised z, who wrote a thesis on Databases and who submitted this thesis to a university in the USA.Moreover, x and y have a common ancestor w, and all people on the path from x to w, including w, musthave written a thesis in Logic and must have submitted this thesis to a university in an EU country.

Achieved Results. We establish tight5 complexity bounds in data and combined complexity for a varietyof DLs, ranging from lightweight DLs of the DL-Lite and EL families up to the highly expressive ones of theSH and Z families. Our results are summarized in Table 2.2. For DLs containing at least ELI, we are ableto encode away nesting, thus showing that the worst-case complexity of query answering is not affected bythis construct. Instead, for lightweight DLs (starting already from DL-Lite!), we show that adding nestingto 2RPQs leads to a surprising jump in combined complexity, from P-complete to Exp-complete. We thendevelop a sophisticated rewriting-based technique that builds on (but significantly extends) the one proposedin [6], which we use to prove that the problem remains in NL for DL-Lite. We thus show that adding nestingto (C)2RPQs does not affect worst-case data complexity of query answering for lightweight DLs.

5With one exception, see Table 2.2.

16

Page 17: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Chapter 3

Implementation Development

In this chapter, we present the major changes that have been implemented in Ontop, that were directly drivenby the Statoil and Siemens use cases to tackle a variety of issues that came up during the experiments.

In Section 3.1, we describe the adaptations done to Ontop to better integrate it in the Optique platform,and to improve the support for standard languages and libraries.

In Section 3.2, we discuss the improvement of interfaces of Ontop, namely in the Protégé plugin, SesameSPARQL Protocol service, Java API, and command line interface. Finally, in Section 3.3, we describe thepublication of Ontop bundles on SourceForge, and report on the last stable releases of Ontop.

3.1 Integration with the Optique Platform

The Optique consortium agreed to adopt the following languages and libraries for the Optique platform:

• The mapping language should be W3C R2RML.

• SQL dialects of major database systems should be supported as R2RML source queries.

• End-user queries should be expressed in SPARQL.

During the third year of the Optique project, a significant part of the development has been dedicated toimprove the supports for the aforementioned standards and libraries. More specifically:

• We have extended the SQL support in the source query by allowing CONCAT and REPLACE and supportfor literal template in the target query (Section 3.1.1).

• We have implemented most of the SPARQL functions (Section 3.1.2).

• We have improved the support of datatypes (Section 3.1.3)

• We have implemented the support for multiple references to the same entity (Section 3.1.4)

• We have added support for column-oriented databases so that the Optique platform can access morekinds of databases (Section 3.1.5).

3.1.1 Support for the Mapping Language

We have been continue extending the support for the mapping language (both in R2RML syntax and Ontopnative syntax) in Ontop. Specifically, We have extended (1) the SQL support in the source query by allowingCONCAT and REPLACE and (2) the support for literal template in the target query.

17

Page 18: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Improved Support for the Mapping Language

Mapping assertions contain the correspondence between the predicates and classes of the ontology and theappropriate SQL queries over the database. In Ontop we have recently added the support for new SQLfunction CONCAT and REPLACE. Recognizing these new functions allow to parse them accordingly in theSQL query, convert them in Datalog form (Ontop internal representation format) and process them for lateroptimizations. CONCAT and REPLACE are now supported both in selection and projection part of theSQL query. If the function is contained in the projection part of the query, the Datalog head will containonly the alias name of the function, while the body will contain an equality function between the alias andthe corresponding function.

Different syntaxes are provided for the major supported databases (Oracle, DB2, H2, Postgres, MySQLand SQL Server) allowing the conversion in the final generated SQL.

• CONCAT is handled adding a Concat expression for Datalog conversion . Postgres, H2, Oracle, DB2are supported using the special symbol || , while for oracle we use the + symbol and for MySQL theCONCAT(string, string, ...) function.

Starting from a representation in Ontop mapping of a CONCAT function :

target :trace{e} :TcontainsE :event{e} .

source select ’paper’ || Submission.paper as ‘e‘ from Submission

The Datalog translation will have the form:

http://odbs.org#Event(URI("http://odbs.org#event{}",CONCAT("paper",t4)))

:- Submission(t1,t2,t3,t4), IS_NOT_NULL(t4)

http://odbs.org#TcontainsE(

URI("http://odbs.org#trace{}",CONCAT("paper",t4)),

URI("http://odbs.org#event{}",CONCAT("paper",t4)))

:- Submission(t1,t2,t3,t4), IS_NOT_NULL(t4)

http://odbs.org#Trace(URI("http://odbs.org#trace{}",CONCAT("paper",t4)))

:- Submission(t1,t2,t3,t4), IS_NOT_NULL(t4)

• REPLACE is handled with a new REPLACE expression for Datalog conversion. We do not supportthe use of flags. DB2, MySQL and SQL server do not support the use of regexp expression in replace.

– H2: REGEXP_REPLACE(inputString, regexString, replacementString). Replaces each sub-string that matches a regular expression.

– Postgres: regexp_replace(string text, pattern text, replacement text)– Oracle: REGEXP_REPLACE extends the functionality of the REPLACE function by letting

you search a string for a regular expression pattern.– DB2 and SQL Server and MySQL: REPLACE (sourceString, searchString ,replaceString) Replaces

all occurrences of searchString in sourceString with replaceString. They do not provide supportfor regular expressions.

From a Ontop mapping of the form:

target :{u} :hasValue {val} .

source SELECT REGEXP_REPLACE(uri, ’ ’, ’%20’) AS u, val FROM TABLE1

We will have a Datalog translation of the form:

18

Page 19: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

http://it.unibz.krdb/hasValue(

URI("http://it.unibz.krdb/{}",REPLACE(t1," ","%20")),

http://www.w3.org/2000/01/rdf-schema#Literal(t2))

:- TABLE1(t1,t2), IS_NOT_NULL(t2), IS_NOT_NULL(t1)

The construction of a CONCAT expression for internal representation is also useful to support literal templateas presented later.

Improved Support for Target language in the Mapping Language

The R2RML standard specifies that a string template is a format string that can be used to build stringsfrom multiple components. It can reference column names by enclosing them in curly braces ({and }).

Respecting the syntax rules requested by the standard, we added the support for literal templates,allowing the conversion in our Ontop mapping representation and internally, in the CONCAT expression forDatalog format.

The conversion in our internal mapping is straightforward. From an extract of an R2RML mapping ofthe form:

rr:predicateObjectMap [ a rr:PredicateObjectMap ;

rr:objectMap [ a rr:TermMap , rr:ObjectMap ;

rr:language "en-us" ;

rr:template "address:{ADDRESS} city:{CITY} {COUNTRY}" ;

rr:termType rr:Literal

] ;

rr:predicate <http://www.w3.org/2000/01/rdf-schema#label>

] ;

The corresponding Ontop native mappings will be:

rdfs:label "address:{ADDRESS} city:{CITY} {COUNTRY}"@en-us ;

It will be later on transform in the Datalog format and, if requested by a SPARQL query, converted inSQL based on the underlying database.

Only template literal are supported for rdfs:Literal and xsd:string, other types of datatypes shouldbe expressed using a single column.

3.1.2 Support for SPARQL Functions

Functions in queries let us find out more from our input data, and create new information from it. Functionsare very useful for performing many tasks, such as checking certain conditions, string manipulation, mathe-matical operations, hashing and so on. In SPARQL 1.0 most of the functions are only the "test functions"which are answering a particular question about a value [16]. They are either boolean functions whetherchecking certain conditions true or not, or those functions which question datatype or language tag of avalue. In SPARQL 1.1 the collection of SPARQL functions is extended and varied with many useful stringmanipulation, numeric, hash, program logic, node type check and convertion functions.

In order to provide more extensive SPARQL compliance of Ontop, we have implemented most of theSPARQL functions in the report year. The implementation involved extending classes for Datalog translationand SQL generation. Ontop supports many open source and commercial relational databases. Since RDBMSsdo not completely follow SQL standards, SQL implementations of various RDBMSs becomes incompatiblebetween each other. In particular implementation of string and hash functions, date and time syntax varyfrom vendor to vendor. Therefore in the implementation phase we have to consider the differences betweenSQL dialects which has been employed by different RDBMSs. Another problem on implementing SPARQLfunctions is the incompliancy between SPARQL and SQL functions. SPARQL functions are not one-to-one

19

Page 20: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

correspondent to SQL functions. For example there is no correspondent SQL function to SPARQL functionSTRENDS. So the SQL translation implementation of STRENDS function requires the usage of SQL CHARINDEX

and LENGTH functions together.Because of these reasons we have implemented the SPARQL functions with respect to SQL dialect which

has beeen adopted by each relational database and also tolerant to differences and incompatibilities betweenSPARQL and SQL functions. The supported SPARQL functions are briefly described below. For moredetails, please check the dedicated Wiki page1.

String Functions

xsd:integer STRLEN(string literal str)

string literal SUBSTR(string literal source, xsd:integer startingLoc)

string literal SUBSTR(string literal source, xsd:integer startingLoc, xsd:integer length)

string literal UCASE(string literal str)

string literal LCASE(string literal str)

literal STRBEFORE(string literal arg1, string literal arg2)

literal STRAFTER(string literal arg1, string literal arg2)

simple literal ENCODE_FOR_URI(string literal ltrl)

Most of the string functions which we have implemented perform frequently used string operations suchas searching for substrings (SUBSTR, STRBEFORE, STRAFTER), getting length of strings (STRLEN), convertingupper case strings to lower case (LCASE) and vice versa (UCASE). In addition to those we have also implementedENCODE_FOR_URI function which is very useful and worth a closer look with the following Example 3.1.1.

Example 3.1.1# RDF Graph:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema> .

@prefix data: <http://example.com/ns/data#> .

data:item rdfs:label "http://www.example.com/xyz/func&color=blue" .

# SPARQL Query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema>

SELECT ?test WHERE

{

?x rdfs:label ?label .

BIND (ENCODE_FOR_URI(?label) AS ?test)

}

#Output:

---------------------------------------------------------------

"http%3A%2F%2Fwww%2Eexample%2Ecom%2Fxyz%2Ffunc%26color%3Dblue"

As it can be seen from the output, ENCODE_FOR_URI function encodes all the punctuation characters inthe URI. This kind of encoding is very usefull to pass a URI or a SPARQL query as a parameter to a webservice such as a SPARQL endpoint.

1https://github.com/ontop/ontop/wiki/OntopSPARQLFunctions

20

Page 21: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Boolean Functions on Strings

xsd:boolean STRENDS(string literal arg1, string literal arg2)

xsd:boolean STRSTARTS(string literal arg1, string literal arg2)

xsd:boolean CONTAINS(string literal arg1, string literal arg2)

Boolean string functions return True or False to the questions about strings, like the existence and thelocation of a substring in a string. The STRENDS function checks if the string in the first parameter ends withthe string in the second parameter, and the STRSTARTS function checks if the string in the first parameterstarts with the string in the second parameter. The CONTAINS function checks if the string in the secondparameter can be found anywhere in the first parameter.

One challenge of implementing SPARQL string functions in Ontop is to support them in different re-lational databases with their own dialects. For example SPARQL CONTAINS function has many differentcorresponding implementations2 in different SQL dialects such as LOCATE in DB2, POSITION in PostgreSQL,INSTR in Oracle, CHARINDEX in MSSQL and so on. A detailed list of differences on string functions betweenSQL dialects are shown in Table 3.1.

Functions on RDF Terms

simple literal STRUUID()

Numeric Functions

numeric ABS (numeric term)

numeric ROUND (numeric term)

numeric CEIL (numeric term)

numeric FLOOR (numeric term)

xsd:double RAND ( )

The ABS function returns the absolute value of the numeric parameter. The ROUND function rounds offthe numeric parameter to the closest integer. The CEIL function rounds the numeric parameter up and theFLOOR function rounds the numeric parameter down to the closest integer. Finally the RAND function returnsa random double-precision number between 0 and 1. You can see an example query below about numericfunctions in Example 3.1.2.

Example 3.1.2# RDF Graph:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema> .

@prefix dm: <http://example.com/ns/demo#> .

d:item2 dm:num 1.3 .

d:item3 dm:num 1.8 .

d:item4 dm:num -2.2 .

d:item5 dm:num -2.7 .

# SPARQL Query:

PREFIX dm: <http://example.com/ns/demo#>

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?num ?abs ?round ?ceil ?floor WHERE

{

?s dm:num ?num .

BIND (abs(?num) AS ?abs )

2https://en.wikibooks.org/wiki/SQL_Dialects_Reference/Functions_and_expressions/String_functions

21

Page 22: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Tab

le3.1:

Differen

cesBetweenSQ

LDialectson

Implem

entation

ofString

Func

tion

s

SPA

RQ

LSQ

L99

/H2

MyS

QL

MSSQ

LSer

ver

Pos

tgre

SQ

LO

racl

eD

B2

Tei

idH

sqlS

B

STRLEN()

LENGTH()

CHAR_LENGTH()

LEN()

SQL99

SQL99

SQL99

SQL99

SQL99

SUBST

R()

SUBST

R()

SQL99

SQL99

SUBST

RIN

G(...

FROM

...FOR)

SQL99

SQL99

SQL99

SQL99

UCASE

()LCASE

()UPPER()

LOW

ER()

SQL99

SQL99

SQL99

SQL99

SQL99

UCASE

()LCASE

()SQ

L99

STRBEFORE()

LEFT()

CHARIN

DEX()

LEFT()

INST

R()

LEFT()

CHARIN

DEX()

POSITIO

N()

SUBST

R()

INST

R()

LEFT()

LOCATE()

LEFT()

LOCATE()

LEFT()

INST

R()

STRAFTER()

SUBST

RIN

G()

CHARIN

DEX()

LENGTH()

SUBST

RIN

G()

LOCATE()

LENGTH()

SUBST

RIN

G()

CHARIN

DEX()

LEN()

SUBST

RIN

G()

POSITIO

N()

LENGTH()

SUBST

R()

INST

R()

LENGTH()

SUBST

R()

LOCATE()

LENGTH()

SUBST

RIN

G()

LOCATE()

LENGTH()

SUBST

R()

LOCATE()

LENGTH()

ENCODE_FOR_URI()

REPLACE()

SQL99

SQL99

SQL99

SQL99

SQL99

SQL99

STRENDS()

CHARIN

DEX()

LENGTH()

CHAR_LENGTH()

INST

R()

RIG

HT()

LEN()

LENGTH()

POSITIO

N()

SUBST

R()

LENGTH()

SQL99

RIG

HT()

CHAR_LENGTH()

RIG

HT()

CHAR_LENGTH()

STRST

ARTS()

RIG

HT()

LENGTH()

CHAR_LENGTH()

RIG

HT()

LEFT()

LEN()

LEFT()

LENGTH()

SUBST

R()

LENGTH()

LEFT()

LENGTH()

SUBST

RIN

G()

CHAR_LENGTH()

SUBST

RIN

G()

CHAR_LENGTH()

CONTAIN

S()

CHARIN

DEX()

INST

R()

SQL99

POSITIO

N()

INST

R()

LOCATE()

LOCATE()

INST

R()

22

Page 23: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Table 3.2: Differences Between SQL Dialects on Implementation of Hash Functions

RDBMS Hash FunctionH2 SHA256ORACLE MD5 and SHA1 if DBMS_CRYPTO is enabledMySQL MD5 and SHA1MSSQL SERVER MD5, SHA256, SHA1, SHA512POSTGRESQL MD5DB2 noneTEIID noneHSQLDB none

BIND (round(?num) AS ?round )

BIND (ceil(?num) AS ?ceil )

BIND (floor(?num) AS ?floor )

}

# SPARQL Query results:

| num | abs | round | ceil | floor |

--------------------------------------------------------------------

1.3 1.3 "1"^^xsd:decimal "2"^^xsd:decimal "1"^^xsd:decimal

1.8 1.8 "2"^^xsd:decimal "2"^^xsd:decimal "1"^^xsd:decimal

-2.2 2.2 "-2"^^xsd:decimal "-2"^^xsd:decimal "-3"^^xsd:decimal

-2.7 2.7 "-3"^^xsd:decimal "-2"^^xsd:decimal "-3"^^xsd:decimal

Since numeric functions are uniform across SQL dialects, their implementations are quite straightforwardand do not require adaptation of the dialects.

Hash Functions

simple literal MD5 (xsd:string arg)

simple literal SHA1 (xsd:string arg)

simple literal SHA256 (xsd:string arg)

simple literal SHA512 (xsd:string arg)

SPARQL hash functions convert a string to a hexadecimal representation of a bit string that can serve asan encoded signature for the input string [16]. Most RDBMSs support maximum 2 hash algorithms (someof them supports none). So the hash algorithm which we have implemented vary from DB engine to DBengine (See Table 3.2).

3.1.3 Support for Datatypes in OBDA

Ontop supports datatyping in SPARQL query answering. Datatypes in ontology and mappings should agreeto assure the correct results during SPARQL query execution. SPARQL queries should be executed usingthe correct datatypes. Particularly, in the case where constants in the query are used, for example in thecase of use of filter. This is important so that the system can process the input datatypes and produce acorrect semantic output. This year the number of supported datatypes has been extended and the behaviouris now closer to the SPARQL 1.1 standard.

The reasoner could previously handle seven primitive datatypes, namely rdfs:Literal, xsd:string,xsd:integer, xsd:double, xsd:decimal, xsd:datetime, and xsd:boolean. Ontop has been later on ex-tended to support new datatypes both inside and outside OWL2 QL standard.

23

Page 24: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

OWL 2 QL Datatypes

• rdfs:Literal. Every literal in RDF is an instance of rdfs:Literal, every XML Schema data is also aninstance of rdfs:Literal.

• xsd:decimal It represents a subset of the real numbers, which can be represented by decimal numerals.Precision is not reflected in this value space.

• xsd:integer. It is derived from decimal by fixing the value of fractionDigits to be 0 and disallowingthe trailing decimal point.

• xsd:nonNegativeInteger. It is derived from integer by setting the value of maxInclusive to be 0.

• xsd:string. The string datatype represents character strings in XML.

• xsd:dateTime. This datatype describes instances identified by the combination of a date and a timein the format: [-]CCYY-MM-DDThh:mm:ss[Z|(+|-)hh:mm]

• xsd:dateTimeStamp. A specific date and time in the format. Its only difference from dateTime is thatthe time zone expression is required at the end of the value:[-]CCYY-MM-DDThh:mm:ssZ | [-]CCYY-MM-DDThh:mm:ss(+|-)hh:mm

Supported datatypes beyond OWL 2 QL Standard

• xsd:double. It represents an IEEE double-precision 64-bit floating-point number. Subset of realnumbers.

• xsd:float. It represents an IEEE single-precision 32-bit floating-point number.

• xsd:nonPositiveInteger. It is derived from integer by setting the value of maxInclusive to be 0.

• xsd:positiveInteger. It is derived from integer by setting the value of minInclusive to be 1.

• xsd:negativeInteger. It is derived from integer by setting the value of maxInclusive to be 1. Derivedby nonPositiveInteger.

• xsd:long. It is derived from integer by setting the value of maxInclusive to be 9223372036854775807and minInclusive to be -9223372036854775808. The basic type is integer.

• xsd:int. It is derived by setting the value of maxInclusive to be 2147483647 and minInclusive to be-2147483648. The basic type is long.

• xsd:unsignedInt. It is derived from xsd:unsignedLong by setting the value of maxInclusive to be4294967295. xsd:unsignedLong is a subset of nonNegativeInteger.

• xsd:boolean. It represents the values of two-valued logic, logical yes/no values. The valid values forxsd:boolean are true, false, 0, and 1.

Other supported datatypes outside OWL standard

• xsd:gYear. It represents Gregorian calendar years. Month, day and time has to be absent. Timezoneis optional.

• xsd:time. It represents instants of time that recur at the same point in each calendar day. Date isrequired to be absent.

• xsd:date. It represents top-open intervals of exactly one day in length on the timelines of dateTime.The time has to be absent, while the timezone is optional.

24

Page 25: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Table 3.3: Datatype constants

Data Type Valid Writing

LITERAL "Via Roma 33""Via Roma 33"^^rdfs:Literal

STRING "Via Roma 33"^^xsd:string

INTEGER

123456789"123456789"^^xsd:integer+12345678"+123456789"^^xsd:integer-12345678"-123456789"^^xsd:integer

DECIMAL

1234.5678"1234.5678"^^xsd:decimal+1234.5678"+1234.5678"^^xsd:decimal-1234.5678"-1234.5678"^^xsd:decimal

DOUBLE

1.2345678e+03"1.2345678e+03"^^xsd:double+1.2345678e+03"+1.2345678e+03"^^xsd:double -1.2345678e+03"-1.2345678e+03"^^xsd:double

DATE-TIME "2012-02-06T10:48:12Z"^^xsd:dateTime"2012-02-06T10:48:12.3Z"^^xsd:dateTime

BOOLEAN

trueTRUE1"true"^^xsd:boolean"1"^^xsd:boolean

In order to maintain the cross-compatibility between different libraries in the system, some restrictionin the way to use, and write, the constants in SPARQL queries have been made. rdfs:Literal is differentfrom all other datatypes (in particular it is defferent from xsd:string). When a value is written inside quotes,but no information about datatype is given we assume the value has been provided as rdfs:Literal. Table3.3 shows the correct constant writing in SPARQL queries. The first column lists the datatypes that aresupported by Ontop. The second column shows several options for writing the constant values.

Datatypes inside Ontop are converted different times in the necessary format. Ontology API uses OWL2format, mapping and SPARQL query answering use XML Schema format and Ontop internal format is calledCOLTYPE. Moreover we need to handle Java JDBC SQL types format and databases related SQL type format.

Translation inside Ontop occurs from OWL2 datatype, XML Schema and Java JDBC SQL format toOntop type format and from Java JDBC type format to database SQL type format. Cross compatibilityis a major issue in datatypes since every database supports datatypes in a different way, sometimes incontradiction with the standard. The datatypes have been tested for the major supported databases (mysql,postgres, oracle, sql server, h2, db2). Table 3.4 shows the cross compatibility of datatypes between differentdatabases, Ontop and XML Schema type.

Some limitations in the support are still present. We consider both datatypes given in the ontology anddatatypes provided by the mappings. If both are present they have to match to ensure a unique result atquery execution. If in the ontology no information about the datatype is given, the same property in themappings can have different datatypes in the different mappings in which is used. When no information

25

Page 26: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Tab

le3.4:

Datatyp

escompa

tibility

DBMS

DataTyp

e

H2

MyS

QL

PostgreSQ

LOracle

DB2

MsSQL

Ontop

XmlSchemaTyp

e

LIT

ERAL

varcha

r(n)

varcha

r(n)

varcha

r(n)

varcha

r2(n)

varcha

r(n)

varcha

r(n)

COL_TYPE.LIT

ERAL

rdfs:Literal

STRIN

Gvarcha

r(n)

varcha

r(n)

varcha

r(n)

varcha

r2(n)

varcha

r(n)

nvarchar(n)

COL_TYPE.STRIN

Gxsd:string

INTEGER

integer

integer

integer

numbe

r(10,0)

integer

integer

COL_TYPE.INTEGER

xsd:integer

DECIM

AL

decimal

decimal

decimal

numbe

r(x,y)

decimal

decimal

COL_TYPE.D

ECIM

AL

xsd:decimal

DOUBLE

doub

leprecision

doub

leprecision

doub

leprecision

float(24)

doub

leprecision

doub

leprecision

COL_TYPE.D

OUBLE

xsd:do

uble

DATE-T

IME

timestamp

timestamp

timestamp

date

timestamp

datetime

COL_TYPE.D

ATETIM

Exsd:da

teTim

e

BOOLEAN

boolean

boolean

boolean

boolean

—bit

COL_TYPE.BOOLEAN

xsd:bo

olean

26

Page 27: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

about the datatype is given in the ontology and in the mapping we consider the type of the database. Thesame happens for particular properties that are not defined in the ontology (for example rdfs:label) . Whenwe query datatypes in SPARQL we do not support subtype. If we ask for example to return xsd:integer

only type defined as xsd:integer will be returned. If we want between the result also the type defined asxsd:positiveInteger, we have to ask for them explicitly in the query.

3.1.4 Support for multiple references to the same entity

In this section, we present the implementation of the techniques for “ontology-based integration of cross-linkeddatasets” described in Section 2.2.

Ontology-based integration of cross-linked datasets allows to answer SPARQL query and do consistencychecking over virtually integrated databases. Ontop supports this feature using linking tables to connectdifferent databases. The linking tables allow to solve the problem of having the same entity distributed overdifferent data sources and using different identifiers.

These linking tables are built as binary tables that mantain a pair of records representing the same entity.The mappings which contains the relation between the symbols in the ontology (classes and properties) andthe SQL views over the data, are exended to use the built-in owl:sameAs property to represent the linkbetween the datasets. For the system to work we need to have a different URI for each of the datasetsinvolved, this is usually true because the uri is generated by the primary keys of the datasources, thattypically differ from source to source.

The use of owl:sameAs between linking tables work as follow:

1. The client provides linking tables containing the references to the same world entity for the involveddatabases.

2. The OBDA mappings are extended using owl:sameAs property to relate classes and properties inthe ontology to the SQL view of the related linking table. The mapping considers transitivity andsimmetry. Transitivity is handled in the SQL query creating a join between two or more linking tables.

3. Information about the properties and classes involved with owl:sameAs are stored in map for laterreference. The prefix of the URIs is used to distinguish the different owl:sameAs.

4. SPARQL query is converted into Datalog format. owl:sameAs property is added to the properties andclasses present in the map.

5. From the Datalog we generate the SQL query to retrieve the imformaton from the involved databases.

Supposing to have two different databases. They both contain names of wellbores in Sweden, but the firstone presents the name of the wellbores using finnish names and the other one presents the name of thewellbores using spanish names. Ontop does not support federation (this could be tested using for examplesystem as Teiid), so we create two simple tables, a Finland table (table 3.5a) and a Spanish table (table3.5b) with a linking table SameAs (table 3.5c) to connect the two. The linking table is created between thetwo different databases to suggest the correspondance between two equivalent wellbores.

id name1 Aleksi2 Eljas

(a) Finland table

id name991 Amerigo992 Luis993 Sagrada Familia

(b) Spanish table

id idspain idfinland0 991 1

(c) Linking Table SameAs

The mappings are extended to contain owl:sameAs properties. They are created using different URIsto represent the original database. To the standard mapping we add one with the property owl:sameAs aspresented in Example 3.1.3.

27

Page 28: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Example 3.1.3 target :spain-{idspain} owl:sameAs :finland-{idfinland} .

source select idspain, idfinland from sameas

The inverse of each owl:sameAs present in the mapping is also added. A special class MappingSameAs inOntop get the URIs from the mappings that refers to owl:sameAs and store them in a map. One is to storeobject properties and another one is to store data properties and classes that have a uri which is involvedin an owl:sameAs. This map will be used during the SPARQL query evaluation to search for the propertiesand classes that have equivalent data. The SPARQL query as in Example 3.1.4 is later on translated, beforethe query answering to consider the new mappings.

Example 3.1.4 PREFIX : <http://ontop.inf.unibz.it/test/wellbore#>

SELECT ?x ?y

WHERE {

?x :hasName ?y .

}

During the conversion from SPARQL algebra to datalog, the internal translation of Ontop before theSQL generation, we add the information about owl:sameAs as shown in Example 3.1.5. The datalog formatwill appear as a union between a property and owl:sameAs as in Example 3.1.6 .

Example 3.1.5 PREFIX : <http://ontop.inf.unibz.it/test/wellbore#>

SELECT ?x ?y

WHERE {

?x owl:sameAs ?x_1 .

?x_1 :hasName ?y .

}

Example 3.1.6 ans1(x,y) :- http://www.w3.org/2002/07/owl#sameAs(x,anon-0x_1),

http://ontop.inf.unibz.it/test/wellbore#hasName(anon-0x_1,y)

ans1(x,y) :- http://ontop.inf.unibz.it/test/wellbore#hasName(x,y)

The SQL query generated will search in the databases connected to the linking tables, returning thefinnish names which are equivalent to spanish names and the other way around. When searching for thefinland name of wellbore 1, we will return both the finnish name Aleksi and the spanish name Amerigo, sincethey actually correspond to the same wellbore.

3.1.5 Support for Column-Oriented Databases

In past few years column-oriented relational database systems have attracted a great deal of attention.They gain considerable popularity and also there has been a noteworthy amount of recent works aboutthem. Column-oriented RDBMSs store each database table column seperately as opposed to traditional row-oriented RDBMSs which store rows one after the other as a sequence of records. On the row-oriented side,storing data in rows provide efficiency for insert and update operations. On the column-oriented side, storingdata in columns is more I/O efficient for read-only queries. Moreover other most-often cited advantages ofcolumn-store RDBMSs are high performance on data compression, on aggregation queries and on analyticalworkloads such as those used in data warehouses [1]. Since Optique project aims interoperability with variousdata sources and also efficient execution of aggregations [18], supporting column-oriented RDBMSs comesinto question.

So far Ontop supports many row-oriented relational databases which include major commercial relationaldatabases (DB2, Oracle, and MS SQL Server) and the most popular open-source databases (PostgreSQL,MySQL, H2, and HSQL). In order to extend support of Ontop for different kind of databases which arecompatible with JDBC intefaces we recently started to provide support for MonetDB open source, column-oriented RDBMS and SAP HANA cloud, in-memory, column-oriented RDBMS. Thanks to modularity of

28

Page 29: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Ontop, once we support these databases they can be accessed from Optique platform without any additionalconfiguration.

Users can simply establish connection between Ontop and MonetDB via JDBC driver in the same waywith those row-oriented databases which are supported by Ontop. Though, users have to create a secureconnection tunnel to be able to establish connection between Ontop and SAP HANA cloud database. InOntop Wiki page3 we explain how to establish this secure connection.

3.2 Interfaces of the Ontop Query Rewriting System

Besides as the core query transformation component of Optique, Ontop is also available via other interfaces.The Java API of Ontop makes Ontop as a standard Java library that can be embed into any Java basedsystem. The Protégé plugin, and sesame extension, and command line interface of Ontop can be used aslightweight interfaces for fast editing mappings and ontologies, and for evaluating SPARQL queries. Duringthe third year of the Optique project, we have made improvements in these interfaces.

3.2.1 Protégé Plugin

Protégé is a free, open-source ontology editor. It is widely used by a strong community of academic, govern-ment, and corporate users. Protégé is based on Java and is extensible by a plugin architecture. The Ontopplugin extends Protégé with an OBDA mapping editor and a SPARQL query testing environment.

Before v1.15, Ontop plugin was developed for Protégé v4. One issue of Protégé v4 is incompatibility withJava 8. Hence, Ontop plugin was also incompatible with Java 8. To address this issue, in Ontop v1.15, wehave upgraded the Ontop plugin to support Protégé v5 and therefore Java 8.

3.2.2 Sesame SPARQL Protocol HTTP service

The Sesame framework provides two Java Servlet-based Web applications for (i) exposing Sesame repositoriesas SPARQL protocol HTTP services and (ii) configuring and querying these services. These applications arerespectively called Sesame Server and Sesame Workbench. The Ontop project delivers an extended versionof Sesame Workbench including an HTML form that lets the users provide the necessary information (suchas the ontology and mapping files) for setting up an Ontop Sesame repository.

During the reported year, we have improved this integration by validating the configuration before cre-ating a Sesame repository. This prevents the creation of invalid repositories that cannot be reconfiguredafterwards nor easily deleted. This problem had been reported by users as significantly counter-intuitive.This improvement was implemented in Ontop v1.14.

3.2.3 Ontop Java API

To allow developers building their systems using Ontop as a Java library, Ontop implements two widely usedJava APIs, which are also available as Maven artifacts:

• Sesame4 is a de-facto standard framework for processing RDF data. Ontop implements the SesameStorage And Inference Layer (SAIL) API supporting inferencing and querying over relational databases.The Sesame API implementation is used for the integration with Optique platform.

• OWL API5 is a reference implementation for creating, manipulating and serializing OWL ontologies.We extended the OWLReasoner interface to support SPARQL query answering. The OWL API imple-mentation is used for Ontop Protégé plugin.

3https://github.com/ontop/ontop/wiki/ObdalibPluginJDBC4http://rdf4j.org/5https://github.com/owlcs/owlapi/wiki

29

Page 30: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

During the the third year of Optique, we have updated the Sesame API and OWL API support to leverageJava 7 features such as AutoCloseable and try-with-resources. The following code snippet shows how to usethe new Java 7 style of Ontop OWL API.

try(QuestOWL reasoner = factory.createReasoner(owlOntology, new SimpleConfiguration());

QuestOWLConnection conn = reasoner.getConnection();

QuestOWLStatement st = conn.createStatement();

QuestOWLResultSet rs = st.executeTuple(sparqlQuery);

) {

while (rs.nextRow()) {

// work with the result set

}

}

// The resources (Reasoner, Connection, Statement, and ResultSet) are released

// automatically. No need for the "finally" block

To demonstrate the usage of these APIs, we have created a dedicated GitHub repository with examples 6.

3.2.4 Command-Line Interface

The command line interface (CLI) of Ontop exposes the core functionality and several utilities. It is an easyway to get the system quickly set-up, tested for correct execution, and querying or materializing as needed.It is also useful for teaching purpose.

Before v1.15, the Ontop CLI consisted of several scripts using inconsistent parameters. The related Javaclasses were spread into different modules in the code base. To make the CLI more accessible, we haverefactored the related Java classes by putting them into a new module and combining the scripts into ashell script (ontop for *nix) and a bat file (ontop.bat for Windows). We also added several utilities formanipulating mapping files in R2RML and in Ontop native syntax. More details can be found in a dedicatedWiki page7.

3.3 Releases

Ontop is released under Apache License and the source code is on Github8. Following the previous versions,as a Java library, Ontop has been continuously published on the central Maven repository. The ready-to-usebinary packages have been moved to host on SourceForge since v1.15 from an internal server. We keepreleasing new stable versions with bug fixes and new features every three to four months.

3.3.1 Ontop Bundles on SourceForge

The releases of ready-to-use bundles (e.g. command line tools, Protégé plugin and sesame workbench bundle)of Ontop were previous hosted on a server of Free University of Bozen-Bolzano. To make the these files moreaccessible, since the release of v1.15, we have moved the release channel to SourceForge9. The hosting serviceson the SourceForge is more reliable and the downloading can be faster thanks to the mirrors of SourceForge.In addition, SourceForge provides the statics of the downloads by the end users. The log showed that theversion 1.15 has been downloaded +1000 times (see Figure 3.1).

6https://github.com/ontop/ontop-api-examples7https://github.com/ontop/ontop/wiki/OntopCLI8https://github.com/ontop/ontop/9https://sourceforge.net/projects/ontop4obda/

30

Page 31: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Figure 3.1: Download Statistics of Ontop v1.15 (as of 22 October, 2015)

3.3.2 Released Versions

In the third year of Optique, we have released three stable versions of Ontop. We provide here a briefsummary of new features of each release. A complete change log is available on a dedicated wiki page10.

Version 1.14, released on 04/11/2014. This version featured better supports for datatypes (cf. Sec-tion 3.1.3) and stronger validation for Sesame Workbench (cf. Section 3.2.2).

Version 1.15, released on 13/05/2015. This version upgraded the Protégé plugin to Protégé 5 (cf. Sec-tion 3.2.1). It features a new command line interface (cf. Section 3.2.4), and support for SQL concatand replace in the mappings (cf. Section 3.1.1).

Version 1.16, released on 14/10/2015. This version featured supports for SPARQL functions (cf. Sec-tion 3.1.2), and provided supports for column oriented RDBMS MonetDB and SAP HANA (cf. Sec-tion 3.1.5).

10https://github.com/ontop/ontop/wiki/OntopReleases

31

Page 32: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Bibliography

[1] Daniel J. Abadi, Peter A. Boncz, and Stavros Harizopoulos. Column-oriented database systems. Proc.of the VLDB Endowment, 2(2):1664–1665, August 2009.

[2] Alessandro Artale, Diego Calvanese, Roman Kontchakov, and Michael Zakharyaschev. The DL-Litefamily and relations. J. of Artificial Intelligence Research, 36:1–69, 2009.

[3] Pablo Barceló, Leonid Libkin, Anthony Widjaja Lin, and Peter T. Wood. Expressive languages for pathqueries over graph-structured data. ACM Trans. on Database Systems, 37(4):31, 2012.

[4] Pablo Barceló Baeza. Querying graph databases. In Proc. of the 32nd ACM SIGACT SIGMOD SIGAISymp. on Principles of Database Systems (PODS), pages 175–188, 2013.

[5] Anders Berglund et al. XML Path Language (XPath) 2.0 (Second Edition). W3C Recommendation,World Wide Web Consortium, December 2010. Available at http://www.w3.org/TR/xpath20.

[6] Meghyn Bienvenu, Magdalena Ortiz, and Mantas Simkus. Conjunctive regular path queries inlightweight description logics. In Proc. of the 23rd Int. Joint Conf. on Artificial Intelligence (IJCAI),2013.

[7] Alexander Borgida and Ronald J. Brachman. Conceptual modeling with description logics. In FranzBaader, Diego Calvanese, Deborah McGuinness, Daniele Nardi, and Peter F. Patel-Schneider, editors,The Description Logic Handbook: Theory, Implementation and Applications, chapter 10, pages 349–372.Cambridge University Press, 2003.

[8] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati.Tractable reasoning and efficient query answering in description logics: The DL-Lite family. J. ofAutomated Reasoning, 39(3):385–429, 2007.

[9] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. Conjunctive query containment andanswering under description logics constraints. ACM Trans. on Computational Logic, 9(3):22.1–22.31,2008.

[10] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. Containment ofconjunctive regular path queries with inverse. In Proc. of the 7th Int. Conf. on the Principles ofKnowledge Representation and Reasoning (KR), pages 176–185, 2000.

[11] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. Reasoning on regularpath queries. SIGMOD Record, 32(4):83–92, 2003.

[12] Diego Calvanese, Thomas Eiter, and Magdalena Ortiz. Regular path queries in expressive descriptionlogics with nominals. In Proc. of the 21st Int. Joint Conf. on Artificial Intelligence (IJCAI), pages714–720, 2009.

[13] Diego Calvanese, Davide Lanti, Martin Rezk, Mindaugas Slusnys, and Guohui Xiao. A scalable bench-mark for OBDA systems: Preliminary report. In Proc. of the 3rd Int. Workshop on OWL Reasoner

32

Page 33: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Optique Deliverable D6.3 Runtime Query Rewriting Techniques

Evaluation (ORE), volume 1207 of CEUR Electronic Workshop Proceedings, http://ceur-ws.org/,pages 36–43, 2014.

[14] Mariano P. Consens and Alberto O. Mendelzon. GraphLog: a visual formalism for real life recursion. InProc. of the 9th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS),pages 404–416, 1990.

[15] I. F. Cruz, A. O. Mendelzon, and P. T. Wood. A graphical query language supporting recursion. InProc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 323–330, 1987.

[16] Bob DuCharme. Learning SPARQL. O’Reilly Media, Inc., 2011.

[17] Thomas Eiter, Carsten Lutz, Magdalena Ortiz, and Mantas Simkus. Query answering in descriptionlogics with transitive roles. In Proc. of the 21st Int. Joint Conf. on Artificial Intelligence (IJCAI), pages759–764, 2009.

[18] Martin Giese, Diego Calvanese, Peter Haase, Ian Horrocks, Yannis Ioannidis, Heralk Kllapi, ManolisKoubarakis, Maurizio Lenzerini, Ralf Möller, Mariano Rodriguez-Muro, Özgür Özcep, Riccardo Rosati,Rudolf Schlatte, Michael Schmidt, Ahmet Soylu, and Arild Waaler. Scalable end-user access to big data.In Rajendra Akerkar, editor, Big Data Computing. CRC Press, 2013.

[19] Birte Glimm, Carsten Lutz, Ian Horrocks, and Ulrike Sattler. Conjunctive query answering for thedescription logic SHIQ. J. of Artificial Intelligence Research, 31:157–204, 2008.

[20] Steve Harris and Andy Seaborne. SPARQL 1.1 Query Language. W3C Recommendation, World WideWeb Consortium, March 2013. Available at http://www.w3.org/TR/sparql11-query.

[21] Adila Krisnadhi and Carsten Lutz. Data complexity in the EL family of description logics. In Proc.of the 14th Int. Conf. on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR), pages333–347, 2007.

[22] Carsten Lutz. The complexity of conjunctive query answering in expressive description logics. In Proc.of the 4th Int. Joint Conf. on Automated Reasoning (IJCAR), volume 5195 of Lecture Notes in ArtificialIntelligence, pages 179–193. Springer, 2008.

[23] Magdalena Ortiz, Diego Calvanese, and Thomas Eiter. Data complexity of query answering in expressivedescription logics via tableaux. J. of Automated Reasoning, 41(1):61–98, 2008.

[24] Magdalena Ortiz, Sebastian Rudolph, and Mantas Simkus. Query answering in the Horn fragmentsof the description logics SHOIQ and SROIQ. In Proc. of the 22nd Int. Joint Conf. on ArtificialIntelligence (IJCAI), pages 1039–1044, 2011.

[25] Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. nSPARQL: A navigational language for RDF. J.of Web Semantics, 8(4):255–270, 2010.

33

Page 34: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Appendix A

Rules and Ontology Based Data Access

This appendix reports the paper:

Guohui Xiao, Martin Rezk, Mariano Rodriguez-Muro, and Diego Calvanese:Rules and Ontology Based Data Access. In Proc. of the 8th International Conference on Web Reasoningand Rule Systems (RR), 2014.

34

Page 35: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Rules and Ontology Based Data Access

Guohui Xiao1, Martin Rezk1, Mariano Rodrıguez-Muro2, and Diego Calvanese1

1Faculty of Computer Science, Free University of Bozen-Bolzano, Italy2IBM Watson Research Center, USA

Abstract. In OBDA an ontology defines a high level global vocabulary for userqueries, and such vocabulary is mapped to (typically relational) databases. Ex-tending this paradigm with rules, e.g., expressed in SWRL or RIF, boosts theexpressivity of the model and the reasoning ability to take into account featuressuch as recursion and n-ary predicates. We consider evaluation of SPARQL queriesunder rules with linear recursion, which in principle is carried out by a 2-phasetranslation to SQL: (1) The SPARQL query, together with the RIF/SWRL rules,and the mappings is translated to a Datalog program, possibly with linear recursion;(2) The Datalog program is converted to SQL by using recursive common tableexpressions. Since a naive implementation of this translation generates inefficientSQL code, we propose several optimisations to make the approach scalable. Weimplement and evaluate the techniques presented here in the Ontop system. Tothe best of our knowledge, this results in the first system supporting all of thefollowing W3C standards: the OWL 2 QL ontology language, R2RML mappings,SWRL rules with linear recursion, and SPARQL queries. The preliminary butencouraging experimental results on the NPD benchmark show that our approachis scalable, provided optimisations are applied.

1 Introduction

In Ontology Based Data Access (OBDA) [5], the objective is to access data trough aconceptual layer. Usually, this conceptual layer is expressed in the form of an OWLor RDFS ontology, and the data is stored in relational databases. The terms in theconceptual layer are mapped to the data layer using so-called globas-as-view (GAV)mappings, associating to each element of the conceptual layer a (possibly complex)query over the data sources. GAV mappings have been described as Datalog rules inthe literature [17] and formalized in the R2RML W3C standard [8]. Independently ofthe mapping language, these rules entail a virtual RDF graph that uses the ontologyvocabulary. This virtual graph can then be queried using an RDF query language such asSPARQL.

There are several approaches for query answering in the context of OBDA, and anumber of techniques have been proposed [17,16,13,21,9,3]. One of such techniques,and the focus of this paper, is query answering by query rewriting. That is, answerthe queries posed by the user (e.g., SPARQL queries) by translating them into queriesover the database (e.g., SQL). This kind of technique has several desirable features;notably, since all data remains in the original source there is no redundancy, the systemimmediately reflects any changes in the data, well-known optimizations for relational

Page 36: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

databases can be used, etc. It has been shown that through this technique one can obtainperformance comparable or sometimes superior to other approaches when the ontologylanguage is restricted to OWL 2 QL [22]. While the OWL 2 QL specification (whichsubsumes RDFS in expressive power) offers a good balance between expressivity andperformance, there are many scenarios where this expressive power is not enough.

As a motivating example and to illustrate the main concepts in this paper, supposewe have a (virtual) RDF graph over a database with information about direct flightsbetween locations and their respective cost. Suppose we have a flight relation in thedatabase, and we want to find out all the possible (direct and non-direct) routes betweentwo locations such that the total cost is less than 100 Euros. This problem is a particularinstance of the well-known reachability problem, where we need to be able to computethe transitive closure over the flight relation respecting the constraint on the flightcost. While SPARQL 1.1 provides path expressions that can be used to express thetransitive closure of a property, it may be cumbersome and prone to errors, especially inthe presence of path constraints such as the cost in our example.

Computational complexity results show that unless we limit the form of the allowedrules, on-the-fly query answering by rewriting into SQL Select-Project-Join (SPJ) queriesis not possible [6,2]. However, as target language for query rewriting, typically only afragment of the expressive power of SQL99 has been considered, namely unions of SPJSQL queries. We propose here to go beyond this expressive power, and we advocate theuse of SQL99’s Common Table Expressions (CTEs) to obtain a form of linear recursionin the rewriting target language. In this way, we can deal with recursive rules at thelevel of the ontology, and can reuse existing query rewriting optimisations developed forOBDA to provide efficient query rewriting into SQL99. The languages that we targetare those that are used more extensively in the context of OBDA for Semantic Webapplication, i.e., RIF and SWRL as rule language, SPARQL 1.0 as query language, andR2RML as relational databases to RDF mapping language.

The contributions of this paper can be summarized as follows: (i) We providetranslations from SWRL, R2RML, and SPARQL into relational algebra extended witha fixed-point operator that can be expressed in SQL99’s Common Table Expressions(CTEs); (ii) We show how to extend existing OBDA optimisation techniques that havebeen proven effective in the OWL 2 QL setting to this new context. In particular, we showthat so called T-mappings for recursive programs exist and how to construct them. (iii) Weprovide an implementation of such technique in the open source OBDA system Ontop ,making it the first system of its kind to support all the following W3C recommendations:OWL 2 QL, R2RML, SPARQL, and SWRL; (iv) We provide a preliminary evaluationof the techniques using an extension of the NPD benchmark (a recently developedOWL 2 QL benchmark) with rules, and show that the proposed solution competes andsometimes outperforms existing triple stores.

2 Preliminaries

2.1 RDF

The Resource Description Framework (RDF) is a standard model for data interchangeon the Web [15]. The language of RDF contains the following pairwise disjoint and

Page 37: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

countably infinite sets of symbols: I for IRIs, L for RDF literals, and B for blank nodes.RDF terms are elements of the set T = I ∪ B ∪ L. An RDF knowledge base (also calledRDF graph) is a collection of triples of the form (s, p, o), where s ∈ I, p ∈ I ∪ B, ando ∈ T. A triple (s, p, o) intuitively expresses that s and o are related by p; when p is thespecial role rdf:type, the triple (s, rdf:type, o) means that s is an instance of o.

It is sometimes convenient to define conversions between RDF graphs and sets of(Datalog) facts. Thus, given an RDF graph G, the corresponding set of Datalog facts is:

A(G) = {o(s) | (s, rdf:type, o) ∈ G} ∪ {p(s, o) | (s, p, o) ∈ G, p 6= rdf:type}

And given a set A of facts, the corresponding RDF graph is:

G(A) = {(s, rdf:type, o) | o(s) ∈ A} ∪ {(s, p, o) | p(s, o) ∈ A}

Note that G(A) discards the facts that are not unary or binary.

2.2 SPARQL

SPARQL is the standard RDF query language. For formal purposes we will use the alge-braic syntax of SPARQL similar to the one in [18] and defined in the standard1. However,to ease the understanding, we will often use graph patterns (the usual SPARQL syntax) inthe examples. The SPARQL language that we consider shares with RDF the set of sym-bols: constants, blank nodes, IRIs, and literals. In addition, it adds a countably infiniteset V of variables. The SPARQL algebra is constituted by the following graph patternoperators (written using prefix notation): BGP (basic graph pattern), Join , LeftJoin ,Filter , and Union . A basic graph pattern is a statement of the form:BGP(s, p, o). In thestandard, a BGP can contain several triples, but since we include here the join operator,it suffices to view BGPs as the result of Join of its constituent triple patterns. Observethat the only difference between blank nodes and variables in BGPs, is that the formerdo not occur in solutions. So, to ease the presentation, we assume that BGPs contain noblank nodes. Algebra operators can be nested freely. Each of these operators returns theresult of the sub-query it describes.

Definition 1 (SPARQL Query). A SPARQL query is a pair (V, P ), where V is a setof variables, and P is a SPARQL algebra expression in which all variables of V occur.

We will often omit V when it is understood from the context . A substitution, θ, is apartial function θ : V 7→ T. The domain of θ, denoted by dom(θ), is the subset of Vwhere θ is defined. Here we write substitutions using postfix notation. When a query(V, P ) is evaluated, the result is a set of substitutions whose domain is contained inV . For space reasons, we omit the semantics of SPARQL, and refer to [11] for thespecification of how to compute the answer of a query Q over an RDF graph G, whichwe denote as JQKG.

Example 1 (Flights, continued). Consider the flight example in the introduction. Thelow cost flights from Bolzano can be retrieved by the query:

1 http://www.w3.org/TR/rdf-sparql-query/#sparqlAlgebra

Page 38: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Select ?x Where {?x :tripPlanFrom :Bolzano . ?x :tripPlanTo ?y .?x :tripPlanPrice ?z . Filter(?z < 100)

}

The corresponding SPARQL algebra expression is as follows:

Filter (?z < 100)(Join(BGP(?x :tripPlanFrom :Bolzano .)Join(BGP(?x :tripPlanTo ?y .) BGP(?x :tripPlanPrice ?z .))))

2.3 Rules: RIF and SWRL

We describe now two important rule languages, SWRL and RIF, and the semantics oftheir combination with RDF graphs.

The Semantic Web Rule Language (SWRL) is a widely used Semantic Web languagecombining a DL ontology component with rules2. Notice that the SWRL language allowsonly for the use of unary and binary predicates. SWRL is implemented in many systems,such as, Pellet, Stardog, and HermiT.

The Rule Interchange Format (RIF) is a W3C recommendation [12] defining alanguage for expressing rules. The standard RIF dialects are Core, BLD, and PRD. RIF-Core provides “safe” positive Datalog with built-ins; RIF-BLD (Basic Logic Dialect) ispositive Horn logic, with equality and built-ins; RIF-PRD (Production Rules Dialect)adds a notion of forward-chaining rules, where a rule fires and then performs some action.In this paper we focus on RIF-Core [4], which is equivalent to Datalog without negation,but supports an F-Logic style frame-like syntax: s[p1 → o1, p2 → o2, . . . , pn → on]is a shorthand for the conjunction of atoms

∧pi=rdf:type oi(s) ∧

∧pi 6=rdf:type pi(s, oi).

Observe that the RIF language allows for additional n-ary Datalog predicates besidesthe unary concept names and binary role names from the RDF vocabulary. In this paper,we make the restriction that variables cannot be used in the place of the predicates. Forinstance, neither s[?p→ o] nor s[rdf:type→?o] are allowed.

For the sake of simplicity, in the following we will use Datalog notation, where(SWRL or RIF) rules are simply written as

l0 :- l1, . . . , lm

where each li is an atom. Therefore we refer to a set of rules as Datalog rules. Recall thata Datalog program Π that does not contain negation has a unique minimal model, whichcan be computed via a repeated exhaustive application of the rules in it in a bottom-upfashion [14]. We denote such model, MM(Π).

An RDF-rule combination is a pair (G,Π), where G is an RDF graph and Π is a setof Datalog rules.

Definition 2. The RDF graph induced by an RDF-rule combination (G,Π) is definedas G(MM(A(G) ∪Π)).

2 http://www.w3.org/Submission/SWRL/

Page 39: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Example 2 (Example 1, continued). The following rules model the predicates plan,representing the transitive closure of flights, including the total price of the trip, andtripPlanFrom/To/Price, which project plan into triples:

plan(from, to, price, plan url) :- flightFrom(fid, from), flightTo(fid, to),flightPrice(fid, price),plan url = CONCAT("http://flight/",fid)

plan(from, to, price, plan url) :- plan(from, to1 , price1 , plan url1 ),flightFrom(fid , to1 ), flightTo(fid , to),flightPrice(fid , price2 ),price = price1 + price2 ,plan url = CONCACT(plan url1 ,"/", fid)

tripPlanFrom(plan url, from) :- plan(from, to, price, plan url)tripPlanTo(plan url, to) :- plan(from, to, price, plan url)tripPlanPrice(plan url, price) :- plan(from, to, price, plan url)

Observe that rules not only boost the modelling capabilities by adding recursion,but also allow for a more elegant and succinct representation of the domain using n-arypredicates.

2.4 SPARQL and Rules: Entailment Regime

The RIF entailment regime specifies how RIF entailment can be used to redefine theevaluation of basic graph patterns. The evaluation of complex clauses is computed bycombining already computed solutions in the usual way. Therefore, in this section wecan restrict the attention to queries that consist of a single BGP.

The semantics provided in [10] is defined in terms of pairs of RIF and RDF interpre-tations. These models are then used to define satisfiability and entailment in the usualway. Combined entailment extends both entailment in RIF and entailment in RDF. Toease the presentation, we will present a simplified version of such semantics based onDatalog models seen as RDF graphs (c.f. Section 2.1).

Definition 3. Let Q be a BGP, G an RDF graph, and Π a set of rules. The evaluationof Q over G and Π , denoted JQKG,Π , is defined as the evaluation of Q over the inducedRDF graph of (G,Π), that is

JQKG,Π = JQKG(MM(A(G)∪Π)).

Example 3 (Example 2, continued). Suppose we have the following triples in our RDFgraph:

AF22 :flightFrom :Bolzano . AF23 :flightTo :Dublin .AF22 :flightTo :Milano . AF22 :flightPrice 45 .AF23 :flightFrom :Milano . AF23 :flightPrice 45 .

It is easy to see that these triples, together with the rules in Example 2 “extend” theoriginal RDF with the following triples:

http://flight/AF22/AF23 :tripPlanFrom :Bolzano .http://flight/AF22/AF23 :tripPlanTo :Dublin .http://flight/AF22/AF23 :tripPlanPrice 90 .

Page 40: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

rr:tripleMap rr:SubjectMap

rr:LogicalTable

rr:PredicateObjectMap

IRI

rr:PredicateMap

rr:ObjectMap

rr:logicalTable

rr:subjectMap

rr:predicateObjectMap

rr:class

rr:predicateMap

rr:objectMap

*

*

+

+

Fig. 1. A well formed R2RML mapping node

This implies that we get the following two substitutions by evaluating the query inExample 1: {x 7→ http://flight/AF22}, {x 7→ http://flight/AF22/AF23}.

2.5 R2RML: Mapping Databases to RDF

R2RML is a W3C standard [8] defining a language for mapping relational databases intoRDF data. Such mappings expose the relational data as RDF triples, using a structureand vocabulary chosen by the mapping author.

An R2RML mapping is expressed as an RDF graph (in Turtle syntax), where awell-formed mapping consists of one or more trees called triple maps with a structure asshown in Figure 1. Each tree has a root node, called triple map node, which is linked toexactly one logical table node, one subject map node and one or more predicate objectmap nodes. Intuitively, each triple map states how to construct a set of triples (subject,predicate, object) using the information contained in the logical table (specified as anSQL query).

The R2RML syntax is rather verbose, therefore, due to the space limitations, in thispaper we represent triple maps using standard Datalog rules of the form:

predicate(subject , object) :- body concept(subject) :- body

where body is a conjunction of atoms that refers to the database relations, possiblymaking use of auxiliary relations representing the SQL query of the mapping, when thesemantics of such query cannot be captured in Datalog. For the formal translation fromR2RML to Datalog, we refer to [23].

Example 4 (Example 2, continued). We present the R2RML rules mapping a relationaldatabase to the relations flightFrom and flightPrice. Recall that the relationstripPlanFrom, tripPlanTo, etc. are defined by rules. Suppose we have a table flightin the database, with attributes: id , departure, arrival , segment , and cost . Then themappings are as follows:

flightFrom(id, departure) :- flight(id, departure, arrival, cost)flightPrice(id, cost) :- flight(id, departure, arrival, cost)

Next we define the RDF graph induced by a set of mapping rules and a database.

Page 41: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Definition 4 (Virtual RDF Graph via R2RML Mapping). LetM be a set of R2RMLmappings (represented in Datalog), and I a relational database instance. Then thevirtual RDF graphM(I) is defined as the RDF graph corresponding to the minimalmodel ofM∪ I , i.e.,M(I) = G(MM(M∪ I)).

3 Answering SPARQL over Rules and Virtual RDF

In this section we describe how we translate SPARQL queries over a rule-enrichedvocabulary into SQL. The translation consists of two steps: (i) translation of the SPARQLquery and RIF rules into a recursive Datalog program, and (ii) generation of an SQLquery (with CTEs) from the Datalog program.

3.1 SPARQL to Recursive Datalog

The translation we present here extends the one described in [18,19,23], where theauthors define a translation function τ from SPARQL queries to non-recursive Datalogprograms with stratified negation. Due to space limitations, we do not provide thedetails of the translation τ , but illustrate it with an example, and refer to [18,19] forits correctness. Note, in this paper we only consider BGPs corresponding to atoms inSWRL or RIF rules; in other words, triple patterns like (t1, ?x, t2) or (t, rdf:type, ?x)are disallowed (cf. the restrictions on RIF in Section 2.3).

Example 5. Consider the query (V, P ) in Example 1, for which we report below thealgebra expression in which we have labeled each sub-expression Pi of P .

Filter (?z < 100)( # P1

Join( # P2

BGP(?x :tripPlanFrom :Bolzano .) # P3

Join( # P4

BGP(?x :tripPlanTo ?y .) # P5

BGP(?x :tripPrice ?z .)))) # P6

The Datalog translation contains one predicate ansi representing each algebra sub-expression Pi. The Datalog program τ(V, P ) for this query is as follows:

ans1(x) :- ans2(x, y, z), Filter(z > 100)ans2(x, y) :- ans3(x), ans4(x, y, z)ans3(x) :- tripPlanFrom(x, :Bolzano)ans4(x, y, z) :- ans5(x, y), ans6(x, z)ans5(x, y) :- tripPlanTo(x, y)ans6(x, z) :- tripPrice(x, z)

The overall translation of a SPARQL query and a set of rules to recursive Datalog isdefined as follows.

Definition 5. Let Q = (V, P ) be a SPARQL query and Π a set of rules. We define thetranslation of Q and Π to Datalog as the Datalog program µ(Q,Π) = Π ∪ τ(V, P ).

Page 42: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Observe that while τ(V, P ) is not recursive, Π and therefore µ(Q,Π) might be so.

Proposition 1. Let (G,Π) be an RDF-rule combination, Q a SPARQL query, and θ asolution mapping. Then

θ ∈ JQKG,Π if and only if µ(Q,Π) ∪ A(G) |= ansQ(θ)

where ansQ is the predicate in µ(Q,Π) corresponding to Q.

Proof. Let Q = (V, P ). By definition, τ(V, P ) is a stratified Datalog program and weassume that it has a stratification (S0, . . . , Sn). As Π is a positive program, clearly(Π ∪ A(G), S0, . . . , Sn) is a stratification of µ(Q,Π) ∪ A(G). Then the followingstatements are equivalent:

– θ ∈ JQKG,Π = JQKG(MM(A(G)∪Π))

– (By the correctness of the translation τ(V, P ) [19])ansQ(θ) ∈ MM(τ(V, P ) ∪MM(A(G) ∪Π))

– (By the definition of GL-reduction)ansQ(θ) ∈ MM(τ(V, P )MM(A(G)∪Π))

– (By the definition of model of a stratified Datalog program)ansQ(θ) ∈ MM(τ(V, P ) ∪ A(G) ∪Π) = MM(µ(Q,Π) ∪ A(G))

– µ(Q,Π) ∪ A(G) |= ansQ(θ) ut

Considering that R2RML mappings are represented in Datalog, we immediatelyobtain the following result.

Corollary 1. Let Q = (V, P ) be a SPARQL query, Π a set of rules,M a set of R2RMLmappings, I a database instance, and θ a solution mapping. Then

θ ∈ JQKM(I),Π if and only if µ(Q,Π) ∪M∪ I |= ansQ(θ)

3.2 Recursive Datalog to Recursive SQL

In this section, we show how to translate to SQL the recursive Datalog program obtainedin the previous step. Note that translation from Datalog to SQL presented in [23] doesnot consider recursive rules. Here we extend such translation to handle recursion bymaking use of SQL common table expressions (CTEs). We assume that the extensionaldatabase (EDB) predicates (i.e., those not appearing in the heads of Datalog rules) arestored in the database.

Fixpoint Operator, Linear Recursive Query and CTE. We recall first how relationalalgebra can be extended with a fixpoint operator [1]. Consider an equation of the formR = f(R), where f(R) is a relational algebra expression over R. A least fixpoint of theequation above, denoted LFP(R = f(R)), is a relation S such that

– S = f(S)– if R is any relation such that R = f(R), then S ⊆ R.

Page 43: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

In order to ensure the existence of a fixpoint (and hence of the least fixpoint), f must bemonotone.

Consider now the equation R = f(R1, . . . , Rn, R), making use of a relationalalgebra expression over R and other relations. Such equation is a linear recursiveexpression, if R occurs exactly once in the expression f . Then, we can split f into a non-recursive part f1 not mentioning R, and a recursive part f2, i.e., f(R1, . . . , Rn, R) =f1(R1, . . . , Rn) ∪ f2(R1, . . . , Rn, R). In this case, LFP(R = f(R1, . . . , Rn, R)) canbe expressed using a common table expression (CTE) of SQL99:

WITH RECURSIVE R AS {[block for base case f1]

UNION[block for recursive case f2]

}

We remark that CTEs are already implemented in most of the commercial databases,e.g, Postgres, MS SQL Server, Oracle, DB2, H2, HSQL.

Example 6. Suppose we have the database relation Flight (f , for short), with attributesid , source, destination , and cost (i, s, d, c, for short) and we want to compute all thepossible routes such that the total cost is less that 100. To express this relation plan wecan use the following equation with least fixpoint operator:

plan = LFP(f∗ = πvar1(f) ∪ πvar2(ρcount(σfil(f 1 f∗))))

where

var1 = f.s, f.d, f.c count = (f.c+ f∗.c)/cvar2 = f.s, f∗.d, c fil = f.c+ f∗.c < 100, f.s = f∗.d

It can be expressed as the following CTE:

WITH RECURSIVE plan AS (SELECT f.s, f.d, f.c FROM f

UNIONSELECT plan.s, f.d, f.c + plan.c AS cFROM f, planWHERE f.c + plan.c < 100 AND f.s = plan.d

)

Now we proceed to explain how to translate a recursive program into a fixpointrelation algebra expression.

The dependency graph for a Datalog program is a directed graph representing therelation between the predicate names in a program. The set of nodes are the relationsymbols in the program. There is an arc from a node a to a node b if and only if a appearsin the body of a rule in which b appears in the head. A program is recursive if there is acycle in the dependency graph.

We say that a Datalog program Π is SQL99 compatible if (i) there are no cycles inthe dependency graph of Π apart from self-loops; and (ii) the recursive predicates are

Page 44: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

restricted to linear recursive ones. For the sake of simplicity, and ease the presentation,we assume that non-recursive rules have the form

p(x) :- atom1(x1), . . . , atomn(xn), cond(z) (1)

where each atomi is a relational atom, and cond(z) is a conjunction of atoms overbuilt-in predicates. Moreover, for each recursive predicate p, there is a pair of rulesdefining p of the form

p(x) :- atom0(x0), p(y), cond1(z1)p(x) :- atom1(x1), . . . , atomn(xn), cond2(z2)

(2)

In addition, we assume that equalities between variables in the rules are made explicit inatoms of the form x = y. Thus, there are no repeated variables in non-equality atoms.

The intuition behind the next definition is that for each predicate p of arity n in theprogram, we create a relational algebra expression RA(p).

Definition 6. Let p be a predicate and Π a set of Datalog rules.

– If p is an extensional predicate, then

RA(p(x)) = p

– If p is a non-recursive intensional predicate, let Πp be the set of rules in Π definingp. For such a rule r, which is of the form (1), let

RA(r) = σcond(RA(atom1(x1)) 1 · · · 1 RA(atomn(xn)))

where cond is the condition corresponding to the conjunction of atoms cond(z)in (1). Then

RA(p(x)) =⋃r∈Πp

RA(r)– If p is a recursive intensional predicate defined by a pair of rules of the form (2),

then

RA(p(x)) = LFP(p = σcond1(RA(atom0(x0)) 1 p) ∪

σcond2(RA(atom1(x1)) 1 · · · 1 RA(atomn(xn))))

where again cond1 and cond2 are the conditions corresponding to the conjunctionsof atoms cond1(z1) and cond2(z2) in the two rules defining p.

The next proposition shows that if the rule component of an RDF-rule combinationis SQL99 compatible, then the Datalog transformation of the combination is also SQL99compatible.

Proposition 2. Let (G,Π) be an RDF-rule combination, Q = (V, P ) a SPARQL query,M a set of R2RML mapping. If Π is SQL99 compatible, then µ(Q,Π) ∪M is alsoSQL99 compatible.

Proof (Sketch). This can be easily verified by checking the layered structure of µ(Q,Π)∪M = τ(V, P ) ∪Π ∪M. Observe that (1) τ(V, P ) andM are non-recursive Datalogprograms, and (2) there is no arc from the head predicates of Π (resp. τ(V, P )) to thebody predicates ofM (resp. Π) in the dependency graph of µ(Q,Π) ∪M. Thereforeno additional cycles in the dependency graph will be introduced except the ones alreadyin Π (if any).

Page 45: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

4 Embedding Entailments in Mappings Rules

A first naive implementation of the technique described above was unable to generateSQL queries, due to the blowup caused by the processing of mappings, i.e., by unfoldingthe predicates defined in the heads of mapping rules with the corresponding queries in thebodies. Thus, it was necessary to optimize the rules and the mappings before unfoldingthe query. In this section, we present two optimizations based on the key observationthat we can optimize the rules together with R2RML mappings independently of theSPARQL queries.

1. For the recursive rules, we can pre-compute the relational algebra expression (i.e.,the recursive common table expressions (CTEs) in SQL).

2. For the non-recursive rules, we introduce a method to embed entailments into themapping rules.

4.1 Recursion Elimination

The presence of recursion in the Datalog representation of the SPARQL query gives riseto a number of issues e.g., efficiently unfolding the program using SLD resolution, andmanaging the different types of variables. For this reason, before generating SQL for thewhole set of rules, (i) we pre-compute CTEs for recursive predicates, making use of theexpressions in relational algebra extended with fixpoints provided in Definition 6, (ii) weeliminate the recursive rules and replace the recursive predicates by fresh predicates;these fresh predicates are defined by cached CTEs.

4.2 T-Mappings for SWRL Rules

We introduce now an extension of the notion of T-mappings [20] to cope with SWRLrules. T-mappings have been introduced in the context of OBDA systems in whichqueries posed over an OWL 2 QL ontology that is mapped to a relational database,are rewritten in terms of the database relations only. They allow one to embed in themapping assertions entailments over the data that are caused by the ontology axioms,and thus to obtain a more efficient Datalog program. In our setting, T-mappings extendthe set of mapping to embed entailments caused by (recursive) rules into the mappingrules. Formally:

Definition 7 (SWRL T-Mappings). LetM be a set of mappings, I a database instance,and Π a set of SWRL rules. A T-mapping for Π w.r.t.M is a setMΠ of mappings suchthat: (i) every triple entailed byM(I) is also entailed byMΠ(I); and (ii) every factentailed by Π ∪ A(M(I)) is also entailed byMΠ(I).

A T-mapping for SWRL rules can be constructed iteratively, unlike OWL 2 QL-basedT-mappings, using existing mappings to generate new mappings that take into accountthe implications of the rules3. In Algorithm 1, we describe the construction processwhich is similar to the classical semi-naive evaluation for Datalog programs.

3 Recall that SWRL allows only for unary and binary predicates.

Page 46: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Algorithm 1: T-Mapping(Π ,M)Input: a set Π of (SWRL) rules; a setM of R2RML mappingsOutput: T-MappingMΠ ofM w.r.t. Π∆M←M; MΠ ← ∅;while ∆M 6= ∅ doMΠ ←MΠ ∪∆M; ∆M′ ← ∅;foreach mapping m ∈ ∆M do

foreach rule r ∈ Π doif m and r resolves then

∆M′ ← ∆M′ ∪ res(m, r) ; � res(m, r) is a set ofmappings

∆M← ∆M′;

returnMΠ

Theorem 1. LetM be a set of mappings and Π a set of SWRL rules. Then there existsalways a T-mapping for Π w.r.t.M .

Proof (sketch). Let I be a database instance. The Datalog programM∪Π ∪ I does notcontain negation, therefore, it has a unique minimal model, which can be computed viaa repeated exhaustive application of the rules in it in a bottom-up fashion.

Since the rules inM act as links between the atoms in Π and the ground facts inI , it is clear that if a predicate A in Π does not depend on an intensional predicateinM, every rule containing A can be removed without affecting the minimal modelofM∪Π ∪ I . Thus, we can safely assume that every predicate in Π depends on anintensional predicate in M. Thus, if every predicate in a rule is defined (directly orindirectly) by mappings inM, we can always replace the predicate by its definition,leaving in this way rules whose body uses only database relations. ut

5 Implementation

The techniques presented here are implemented in the Ontop4 system. Ontop is anopen-source project released under the Apache License, developed at the Free Universityof Bozen-Bolzano and part of the core of the EU project Optique5. Ontop is available asa plugin for Protege 4, as a SPARQL end-point, and as OWLAPI and Sesame libraries.To the best of our knowledge, Ontop is the first system supporting all the following W3Crecommendations: OWL 2 QL, R2RML, SPARQL, and SWRL6. Support for RIF andintegration of SWRL and OWL 2 QL ontologies will be implemented in the near future.

In Figure 2, we depict the new architecture that modifies and extends our previousOBDA approach, by replacing OWL 2 QL with SWRL. First, during loading time, wetranslate the SWRL rules and the R2RML mappings into a Datalog program. This set of

4 http://ontop.inf.unibz.it5 http://www.optique-project.eu/6 SWRL is a W3C submission, but not a W3C recommendation yet.

Page 47: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

SWRL Rules

MappingsM

+

T-Mappings

Datalog Π1

SPARQL q

Datalog q′

SQL q′′

data D

+

Unfolding

DB

Fig. 2. SWRL and Query processing in the Ontop system

rules is then optimised as described in Section 4. This process is query independent and isperformed only once when Ontop starts. Then the system translates the SPARQL queryprovided by the user into another Datalog program. None of these Datalog programsis meant to be executed. They are only a formal and succinct representation of therules, the mappings, and the query, in a single language. Given these two Datalogprograms, we unfold the query with respect to the rules and the mappings using SLDresolution. Once the unfolding is ready, we obtain a program whose vocabulary iscontained in the vocabulary of the datasource, and therefore can be translated to SQL.The technique is able to deal with all aspects of the translation, including URI andRDF Literal construction, RDF typing, and SQL optimisation. However, the currentimplementation supports only a restricted form of queries involving recursion: SPARQLqueries with recursion must consist of a single triple involving the recursive predicate.This preliminary implementation is meant to test performance and scalability.

6 Evaluation

To evaluate the performance and scalability of Ontop with SWRL ontologies, we adaptedthe NPD benchmark. The NPD benchmark [7] is based on the Norwegian PetroleumDirectorate7 Fact Pages, which contains information regarding the petroleum activitieson the Norwegian continental shelf. We used PostgreSQL as the underlying relationaldatabase system. The hardware consisted of an HP Proliant server with 24 Intel XeonX5690 CPUs (144 cores @3.47GHz), 106GB of RAM and a 1TB 15K RPM HD. TheOS is Ubuntu 12.04 LTS.

The original benchmark comes with an OWL ontology8. In order to test our tech-niques, we translated a fragment of this ontology into SWRL rules by (i) converting theOWL axioms into rules whenever possible; and (ii) manually adding linear recursiverules. The resulting SWRL ontology contains 343 concepts, 142 object properties, 238data properties, 1428 non-recursive SWRL rules, and 1 recursive rule. The R2RML file

7 http://www.npd.no/en/8 http://sws.ifi.uio.no/project/npd-v2/

Page 48: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Table 1. Evaluation of Ontop on NPD benchmark

Load q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 r1

NPD Ontop 16.6 0.1 0.09 0.03 0.2 0.02 1.7 0.1 0.07 5.6 0.1 1.4 2.8 0.25Stardog - 2.06 0.65 0.29 1.26 0.20 0.34 1.54 0.70 0.06 0.07 0.11 0.15 -

NPD Ontop 17.1 0.12 0.13 0.10 0.25 0.02 3.0 0.2 0.2 5.7 0.3 6.7 8.3 27.8(×2) Stardog - 5.60 1.23 0.85 1.89 0.39 2.29 2.41 1.47 0.34 0.36 1.78 1.52 -

NPD Ontop 16.7 0.2 0.3 0.17 0.67 0.05 18.08 0.74 0.35 6.91 0.55 162.3 455.4 237.6(×10) Stardog - 8.89 1.43 1.17 2.04 0.51 4.12 5.84 5.30 0.42 0.72 3.03 3.86 –

includes 1190 mappings. The NPD query set contains 12 queries obtained by interview-ing users of the NPD data.

We compared Ontop with the only other system (to the best of our knowledge)offering SWRL reasoning over on-disk RDF/OWL storage : Stardog 2.1.3. Stardog9

is a commercial RDF database developed by Clark&Parsia that supports SPARQL 1.1queries and OWL 2/SWRL for reasoning. Since Stardog is a triple store, we needed tomaterialize the virtual RDF graph exposed by the mappings and the database using Ontop .In order to test the scalability w.r.t. the growth of the database, we used the data generatordescribed in [7] and produced several databases, the largest being approximately 10times bigger than the original NPD database. The materialization of NPD (x2) produced8,485,491 RDF triples and the materialization of NPD (x10) produced 60,803,757 RDFtriples. The loading of the last set of triples took around one hour.

The results of the evaluation (in seconds) are shown in Table 1. For queries q1 to q12,we only used the non-recursive rules and compared the performance with Stardog. Forthe second group (r1), we included recursive rules, which can only be handled by Ontop .

Discussion. The experiments show that the performance obtained with Ontop is com-parable with that of Stardog and in most queries Ontop is faster. There are 4 querieswhere Ontop performs poorly compared to Stardog. Due to space limitations, we willanalyze only one of these 4; however, the reason is the same in each case. Considerquery 12, which is a complex query that produces an SQL query with 48 unions. Theexplosion in the size of the query is produced by interaction of long hierarchies belowthe concepts used in the query and multiple mappings for each of these concepts. Forinstance npdv:Wellbore has 24 subclasses, and npdv:name has 27 mappings definingit. Usually just a join between these two should generate a union of 24×27 = 648 SQLqueries. Ontop manages to optimize this down to 48 unions but more work needs to bedone to get better performance. This problem is not a consequence of the presence ofrules, but is in the very nature of the OBDA approach, and is one of the main issues tobe studied in the future. For Stardog, the situation is slightly easier as it works on theRDF triples directly and does not need to consider mappings.

9 http://stardog.com/

Page 49: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

7 Conclusion

In this paper we have studied the problem of SPARQL query answering in OBDA inthe presence of rule-based ontologies. We tackle the problem by rewriting the SPARQLqueries into recursive SQLs. To this end we provided a translation from SWRL rules intorelational algebra extended with fixed-point operators that can be expressed in SQL99’sCommon Table Expressions (CTEs). We extended the existing T-mapping optimisationtechnique in OBDA, proved that for every non-recursive SWRL program there is aT-mapping, and showed how to construct it. The techniques presented in this paperwere implemented in the system Ontop . We evaluated its scalability and compared theperformance with the commercial triple store Stardog. Result shows that most of theSQL queries produced by Ontop are of high quality, allowing fast query answering evenin the presence of big data sets and complex queries.

Acknowledgement. This paper is supported by the EU under the large-scale integratingproject (IP) Optique (Scalable End-user Access to Big Data), grant agreement n. FP7-318338. We thank Hector Perez-Urbina for his support in evaluating Stardog.

References

1. Aho, A.V., Ullman, J.D.: The universality of data retrieval languages. In: Proc. of the 6thACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages (POPL 1979). pp.110–120 (1979)

2. Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M.: The DL-Lite family andrelations. J. of Artificial Intelligence Research 36, 1–69 (2009)

3. Bienvenu, M., Ortiz, M., Simkus, M., Xiao, G.: Tractable queries for lightweight descrip-tion logics. In: Proc. of the 23rd Int. Joint Conf. on Artificial Intelligence (IJCAI 2013).IJCAI/AAAI (2013)

4. Boley, H., Kifer, M.: A guide to the basic logic dialect for rule interchange on the Web. IEEETrans. on Knowledge and Data Engineering 22(11), 1593–1608 (2010)

5. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., Rodrıguez-Muro, M.,Rosati, R.: Ontologies and databases: The DL-Lite approach. In: Tessaris, S., Franconi, E.(eds.) Reasoning Web. Semantic Technologies for Informations Systems – 5th Int. SummerSchool Tutorial Lectures (RW 2009), Lecture Notes in Computer Science, vol. 5689, pp.255–356. Springer (2009)

6. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoningand efficient query answering in description logics: The DL-Lite family. J. of AutomatedReasoning 39(3), 385–429 (2007)

7. Calvanese, D., Lanti, D., Rezk, M., Slusnys, M., Xiao, G.: Data generation for OBDA systemsbenchmarks. In: Proc. of The 3rd OWL Reasoner Evaluation Workshop (ORE 2014). CEUR-WS.org (2014)

8. Das, S., Sundara, S., Cyganiak, R.: R2RML: RDB to RDF mapping language. W3C Recom-mendation, World Wide Web Consortium (Sep 2012), available at http://www.w3.org/TR/r2rml/

9. Eiter, T., Ortiz, M., Simkus, M., Tran, T.K., Xiao, G.: Query rewriting for Horn-SHIQ plusrules. In: Proc. of the 26th AAAI Conf. on Artificial Intelligence (AAAI 2012). AAAI Press(2012)

Page 50: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

10. Glimm, B., Ogbuji, C.: SPARQL 1.1 Entailment Regimes. W3C Recommendation,World Wide Web Consortium (Mar 2013), available at http://www.w3.org/TR/sparql11-entailment/

11. Harris, S., Seaborne, A.: SPARQL 1.1 Query Language. W3C Recommendation,World Wide Web Consortium (Mar 2013), available at http://www.w3.org/TR/sparql11-query

12. Kifer, M., Boley, H.: RIF Overview (Second Edition). W3C working group note 5 Febru-ary 2013, World Wide Web Consortium (2013), available at http://www.w3.org/TR/2013/NOTE-rif-overview-20130205/

13. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The combined approachto ontology-based data access. In: Proc. of the 22nd Int. Joint Conf. on Artificial Intelligence(IJCAI 2011). pp. 2656–2661 (2011)

14. Lloyd, J.W.: Foundations of Logic Programming (Second, Extended Edition). Springer, Berlin,Heidelberg (1987)

15. Manola, F., Mille, E.: RDF primer. W3C Recommendation, World Wide Web Consortium(Feb 2004), available at http://www.w3.org/TR/rdf-primer-20040210/

16. Perez-Urbina, H., Motik, B., Horrocks, I.: Tractable query answering and rewriting underdescription logic constraints. J. of Applied Logic 8(2), 186–209 (2010)

17. Poggi, A., Lembo, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Rosati, R.: Linkingdata to ontologies. J. on Data Semantics X, 133–173 (2008)

18. Polleres, A.: From SPARQL to rules (and back). In: Proc. of the 16th Int. World Wide WebConf. (WWW 2007). pp. 787–796 (2007)

19. Polleres, A., Wallner, J.P.: On the relation between SPARQL 1.1 and Answer Set Programming.J. of Applied Non-Classical Logics 23(1–2), 159–212 (2013)

20. Rodrıguez-Muro, M., Calvanese, D.: Dependencies: Making ontology based data access workin practice. In: Proc. of the 5th Alberto Mendelzon Int. Workshop on Foundations of DataManagement (AMW 2011). CEUR Electronic Workshop Proceedings, http://ceur-ws.org/, vol. 749 (2011)

21. Rodriguez-Muro, M., Calvanese, D.: High performance query answering over DL-Lite on-tologies. In: Proc. of the 13th Int. Conf. on the Principles of Knowledge Representation andReasoning (KR 2012). pp. 308–318 (2012)

22. Rodriguez-Muro, M., Kontchakov, R., Zakharyaschev, M.: Ontology-based data access: Ontopof databases. In: Proc. of the 12th Int. Semantic Web Conf. (ISWC 2013). Lecture Notes inComputer Science, vol. 8218, pp. 558–573. Springer (2013)

23. Rodriguez-Muro, M., Rezk, M.: Efficient SPARQL-to-SQL with R2RML mappings. Tech.rep., Free University of Bozen-Bolzano (Jan 2014), available at http://www.inf.unibz.it/˜mrezk/pdf/sparql-sql.pdf

Page 51: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Appendix B

Ontology-based Integration of Cross-linkedDatasets

This appendix reports the paper:

Diego Calvanese, Martin Giese, Dag Hovland and Martin Rezk:Ontology-based Integration of Cross-linked Datasets. In Proc. of the 14th International Semantic WebConference (ISWC), 2015.

51

Page 52: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Ontology-based Integration of Cross-linked Datasets

Diego Calvanese1, Martin Giese2, Dag Hovland2, and Martin Rezk1

1 Free University of Bozen-Bolzano, Italy2 University of Oslo, Norway

Abstract. In this paper we tackle the problem of answering SPARQL queriesover virtually integrated databases. We assume that the entity resolution prob-lem has already been solved and explicit information is available about whichrecords in the different databases refer to the same real world entity. Surprisingly,to the best of our knowledge, there has been no attempt to extend the standardOntology-Based Data Access (OBDA) setting to take into account these DB linksfor SPARQL query-answering and consistency checking. This is partly becausethe OWL built-in owl:sameAs property, the most natural representation of linksbetween data sets, is not included in OWL 2 QL, the de facto ontology language forOBDA. We formally treat several fundamental questions in this context: how linksover database identifiers can be represented in terms of owl:sameAs statements,how to recover rewritability of SPARQL into SQL (lost because of owl:sameAsstatements), and how to check consistency. Moreover, we investigate how our solu-tion can be made to scale up to large enterprise datasets. We have implemented theapproach, and carried out an extensive set of experiments showing its scalability.

1 Introduction

Since the mid 2000s, Ontology-Based Data Access (OBDA) [10,17,16] has become apopular approach for virtual data integration [7]. In (virtual) OBDA, a conceptual layeris given in the form of (the intensional part of) an ontology (usually in OWL 2 QL) thatdefines a shared vocabulary, models the domain, hides the structure of the data sources,and can enrich incomplete data with background knowledge. The ontology is connectedto the data sources through a declarative specification given in terms of mappings [5]that relate symbols in the ontology (classes and properties) to (SQL) views over data.The ontology and mappings together expose a virtual RDF graph, which can be queriedusing SPARQL queries, that are then translated into SQL queries over the data sources.In this setting, users no longer need an understanding of the data sources, the relationbetween them, or the encoding of the data.

One aspect of OBDA for data integration is less well studied however, namely the factthat in many cases, complementary information about the same entity is distributed overseveral data sources, and this entity is represented using different identifiers. The firstimportant issue that comes up is that of entity resolution, which requires to understandwhich records actually represent the same real world entity. We do not deal with thisproblem here, and assume that this information is already available.

Traditional relational data integration techniques use extract, transform, load (ETL)processes to address this problem [7]. These techniques usually choose a single represen-tation of the entity, merge the information available in all data sources, and then answer

Page 53: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

2 Diego Calvanese1, Martin Giese2, Dag Hovland2, and Martin Rezk1

queries on the merged data. However, this approach of physically merging the data isnot possible in many real world scenarios where one has no complete control over thedata sources, so that they cannot be modified, and where the data cannot be moved dueto freshness, privacy, or legal issues (see, e.g., Section 3).

An alternative that can be pursued in OBDA is to make use of mappings to virtuallymerge the data, by consistently generating only one URI per real world entity. Unfortu-nately, also this approach is not viable in general: 1. it does not scale well for severaldatasets, since it requires a central authority for defining URI schemas, which may haveto be revised along with all mappings whenever a new source is added, and 2. it is crucialfor the efficiency of OBDA that URIs be generated from the primary keys of the datasources, which will typically differ from source to source.

The approach we propose in this paper is based on the natural idea of representingthe links between database records resulting from entity resolution in the form of link-ing tables, which are binary tables in dedicated data sources that simply maintain theinformation about pairs of records representing the same entity. This bring about severalproblems that need to be addressed: 1. links over database identifiers should be repre-sented in terms of OWL owl:sameAs statements, which is the standard approach insemantic technologies for connecting entity identifiers; 2. the presence of owl:sameAsstatements, which are inherently transitive, breaks rewritability of SPARQL queries intoSQL queries over the sources, and one needs to understand whether rewritability canbe recovered by imposing suitable restrictions on the linking mechanism; 3. a similarproblem arises for checking consistency of the data sources with respect to the ontology,which is traditionally addressed through query answering; 4. since performance canbe prohibitively affected by the presence of owl:sameAs, it becomes one of the keyissues to address, so as to make the proposed approach scalable over large enterprisedatasets.

In this paper we tackle the above issues in the setting where we are given anOWL 2 QL ontology that is mapped to a set of data sources, which are then extendedwith linking tables. Specifically, we provide the following contributions:

– We propose a mapping-based framework that carefully virtually constructsowl:sameAs statements from the linking tables, and deals with transitivity andsymmetry, in such a way that performance is not compromised.

– We define a suitable set of restrictions on the linking mechanisms that ensuresrewritability of SPARQL query answering, despite the presence of owl:sameAsstatements.

– We develop a sound and complete SPARQL query translation technique, and showhow to apply it also for consistency checking.

– We show how to optimize the translation so as to critically reduce the size of theproduced SQL query.

– To empirically demonstrate scalability of our solution, we carry out an extensive setof experiments, both over a real enterprise cross-linked data set from the oil&gasindustry, and in a controlled environment; this demonstrates the feasibility of ourapproach.

The structure of the paper is as follows: Sect. 2 briefly introduces the necessarybackground needed to understand this paper, and Sect. 3 describes our enterprise sce-nario. Sect. 4 provides a sound and complete SPARQL query translation technique

Page 54: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Ontology-based Integration of Cross-linked Datasets 3

for cross-linked datasets. Sect. 5 presents the main contribution of the paper, showinghow to construct an OBDA setting over cross-linked datasets, and Sect. 6 presents ouroptimization technique. Sect. 7 presents an extensive experimental evaluation. Sect. 8surveys related work, and Sect. 9 concludes the paper.

2 Preliminaries

Ontology Based Data Access In the traditional OBDA setting (T ,M, D), the threemain components are a set T of OWL 2 QL [14] axioms (called the TBox), a relationaldatabase D, and a setM of mappings. The OWL 2 QL profile of OWL 2 guarantees thatqueries formulated over T can be rewritten into SQL [2]. The mappings allow one todefine how classes and properties in T should be populated with objects constructed fromthe data retrieved from D by means of SQL queries. Each mapping has one of the forms:

Class(subject)← sqlclass Property(subject,object)← sqlprop,where sqlclass and sqlprop respectively are a unary and binary SQL query overD. Forboth types of mappings we also use the equivalent notation (s p o)← sql. Subjects andobjects in RDF triples are resources (individuals or values) represented by URIs or literals.They are generated using templates in the mappings. For example, the URI template forthe subject can take the form <http://www.statoil.com/{id}> where {id} is anattribute in some DB table, and it generates the URI <http://www.statoil.com/25>when {id} is instantiated as "25". FromM and D, one can derive a (virtual) RDFgraph GM,D, obtained by applying all mappings. Any RDF graph can be seen as aset of logical assertions. Thus, the Tbox together with GM,D constitutes an ontologyO = (T , GM,D).

To handle ontology-based integration of cross-linked datasets, we extend here thetraditional OBDA setting with a fourth component AS containing a set of statementsof the form owl:sameAs (o1,o2). Thus, in this paper, an OBDA setting is a tuple(T ,M, D,AS), and its corresponding ontology is the tuple O = (T , GM,D ∪ AS).Unless stated differently, in the following we work with OBDA settings of this form.

Semantics: To interpret ontologies, we use the standard notions of first order in-terpretation, model, and satisfaction. That is, O |= A(v) iff for every model I of O,we have that I |= A(v). Intuitively, adding an ontology T on top of an RDF graph G,extends G with extra triples inferred by T . Formally, the RDF graph (virtually) exposedby the OBDA setting ((T ,M, D,AS) isG(T ,M,D,AS) = {A(v) | (T , GM,D∪AS) |=A(v)}.

SPARQL SPARQL is a W3C standard language designed to query RDF graphs. Itsvocabulary contains four pairwise disjoint and countably infinite sets of symbols: I forIRIs, B for blank nodes, L for RDF literals, and V for variables. The elements of T =I∪B∪L are called RDF terms. A triple pattern is an element of (T∪V)×(I∪V)×(T∪V).A basic graph pattern (BGP) is a finite set of triple patterns. Finally, a graph pattern, Q,is an expression defined by the grammarQ ::= BGP | FILTER(P, F ) | UNION(P1, P2) | JOIN(P1, P2) | OPT(P1, P2, F ),where F , is a filter expression. More details can be found in [4].

A SPARQL query (Q,V ) is a graph pattern Q with a set of variables V whichspecifies the answer variables—the set of variables in Q whose values we are interested

Page 55: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

4 Diego Calvanese1, Martin Giese2, Dag Hovland2, and Martin Rezk1

in. The values to variables are given by solution mappings, which are partial mapss : V→ T with (possibly empty) domain dom(s). Here, following [11,17], we use theset-based semantics for SPARQL (rather than the bag-based one, as in the specification).

The SPARQL algebra operators are used to evaluate the different fragments of theSPARQL query. Given an RDF graph G, the answer to a graph pattern Q over G isthe set JQKG of solution mappings defined by induction using the SPARQL algebraoperators and starting from the base case: triple patterns. Due to space limitation, andsince the entailment regime only modifies the SPARQL semantics for triple patterns, herewe only show the definition of for this basic case. We provide the complete definition inour technical report [4].

For a triple pattern B, JBKG = {s : var(B) → T | s(B) ⊆ G} where s(B) is theresult of substituting each variable u in B by s(u). This semantics is known as simpleentailment. Given a set V of variables, the answer to (Q,V ) over G is the restrictionJQKG|V of the solution mappings in JQKG to the variables in V .

SPARQL Entailment Regime We present now the standard W3C semantics forSPARQL queries over OWL 2 ontologies under different entailment regimes. We usehere the entailment regimes only to reason about individuals and, unlike [10], we do notallow for variables in triple patterns ranging over class and property names. We leave theproblem of extending our results to handle also this case for future work, but we do notexpect this to present any major challenge.

We work with TBoxes expressed in the OWL 2 QL profile, which however maycontain also owl:sameAs statements. Therefore, we consider two Direct Semantics en-tailment regimes for SPARQL queries, which differ in how they interpret owl:sameAs:the DL entailment regime (which defines |=DL) interprets owl:sameAs internally,implicitly adding to the ontology O the axioms to handle equality, i.e., transitivity, sym-metry, and reflexivity. Instead, the QL entailment regime (which defines |=QL) interpretsowl:sameAs as a standard object property, hence does not assign to it any specialsemantics.

Observe that a basic property of logical equality is that if a and b are equal, everythingthat holds for a should hold also for b, and viceversa. In the context of SPARQL,informally it means that given the answer JBKT ,G∪AS

to a triple pattern B, if theanswer contains the solution mapping s : v 7→ o and T |= owl:sameAs(o, o′), thenJBKT ,G∪AS

must also contain a solution mapping s′ that coincides with s but s′ : v 7→ o′.Formally, the answer JBKRT ,G∪AS

to a BGP B over an ontology O under entailmentregime R is defined as follows:

JBKRO = {s : var(B)→ T | (O) |=R s(B)},Starting from the JBKRO and applying the SPARQL operators in Q, we compute the setJQKRO of solution mappings.

3 Use Case and Motivating Example

In this section we briefly describe the real-world scenario we have examined at Statoil,and we illustrate the challenges it presents for OBDA with an example.

At Statoil, users access several databases on a daily basis, some of them are theExploration and Production Data Store (EPDS), the Norwegian Petroleum Directorate

Page 56: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Ontology-based Integration of Cross-linked Datasets 5

D1 D2 D3 D4

id1 Name

a1 ’A’

a2 ’B’

a3 ’H’

id2 Name Well

b1 null 1

b2 ’C’ 2

b6 ’B’ 3

id3 AName

c3 ’U1’

c4 ’U2’

c5 ’U6’

id4 LName

9 ’Z1’

8 ’Z2’

7 ’Z3’

Fig. 1. Wellbore datasets D1, D2, D3, and company dataset D4

(NPD) FactPages, and several OpenWorks databases. EPDS is a large Statoil-internallegacy SQL (Oracle 10g) database comprising over 1500 tables (some of them withup to 10 million tuples), 1600 views and 700 Gb of data. The NPD FactPages1 is adataset provided by the Norwegian government, and it contains information regardingthe petroleum activities on the Norwegian continental shelf. OpenWorks Databasescontain projects data produced by geoscientists at Statoil. The information in thesedatabases overlap, and often they refer to the same entities (companies, wells, licenses)with different identifiers. In this use case the entity resolution problem has been solvedsince the links between records are available.

The users at Statoil need to query (and get an answer in reasonable time) the infor-mation about these objects without worrying about what is the particular identifier ineach database. Thus, we assume that the SPARQL queries provided by the users will notcontain owl:sameAs statements. The equality between identifiers should be handledinternally by the OBDA system. To illustrate this we provide the following simplifiedexample:

Example 1. Suppose we have the three datasets (from now on D1, D2, D3) with well-bore2 information, and a dataset D4 with information about companies and licenses, asillustrated in Figure 1. The wellbores in D1, D2, D3 are linked, but companies in D4

are not linked with the other datasets. These four datasources are integrated virtually bytopping them with an ontology. The ontology contains the concept Wellbore and theproperties hasName, hasAlternativeName and hasLicense.

The terms Wellbore and hasName are defined using D1 and D2. The propertyhasAlternativeName is defined using D3. The property hasLicense is definedover the isolated dataset D4. We assume that mappings for wellbores from Di useURI templates urii. In addition, we know that the wellbores are cross-linked betweendatasets as follows: wellbores a1, a2 in D1 are equal to b2, b1 in D2 and c3, c4 in D3,respectively. In addition, a3 is equal to c5. These links are represented at the ontologylevel by owl:sameAs statements of the form: owl:sameAs (uri1(a1),uri2(b2)),owl:sameAs (uri2(b2),uri3(c3)), etc.

Consider now a user looking for all the wellbores and their names. According tothe SPARQL entailment regime, the system should return all the 12 combinationsof equivalent ids and names ((uri1(a1),A), (uri2(b2),A), (uri3(c3),A),(uri1(a2),B), (uri2(b1),B), etc.) since all this tuples are entailed by the on-tology and the data (c.f. Section 2). Note that no wellbores from D4 are returned. 2

1 http://factpages.npd.no/2 A wellbore is a hole drilled for the purpose of exploration or extraction of natural resources.

Page 57: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

6 Diego Calvanese1, Martin Giese2, Dag Hovland2, and Martin Rezk1

The first issue in the context of OBDA is how to translate the user query into a queryover the databases. Recall that owl:sameAs is not included in OWL QL, thus it is nothandled by the current query translation and optimization techniques. If we solve thefirst issue by applying suitable constraints, we get into a second issue, how to minimizethe negative impact on the query execution time when reasoning over cross-linkeddatasets.A third issue is how to check, for instance, whether hasName is a functionalproperty considering the linked entities. A fourth issue is how to handle the multiplicityof equivalent answers required by the standard. For instance, in our example, in principle,it could be enough to pick individuals with template uri1 as class representative, andreturn only those triples. In the next sections we will tackle all these issues in turn.

4 Handling owl:sameAs by SPARQL query rewriting

In this section we present the theoretical foundations for query answer over ontology-based integrated datasets. We also discuss how to perform consistency checking usingthis approach. We assume for now that the links are given in the form of owl:sameAsstatements, and address later, in Section 5, the proper OBDA scenario, where links arenot given between URIs, but between database records. Recall that owl:sameAs is notin the OWL 2 QL profile, and moreover, by adding the unrestricted use of owl:sameAswe lose first order rewritability [1], since one can encode reachability in undirectedgraphs. This implies that, if we allow for the unrestricted use of owl:sameAs, wecannot offer a sound and complete translation of SPARQL queries into SQL.3

We present here an approach, based on partial materialization of inference, that inprinciple allows us to exploit a relational engine for query answering in the presence ofowl:sameAs statements. This approach, however, is not feasible in practice, and wewill then show in Section 5 how to develop it into a practical solution. Our approachis based on the simple observation that we can expand the set AS of owl:sameAsfacts into the set A∗S obtained from AS by closing it under reflexivity, symmetry, andtransitivity. Unlike other approaches based on (partial) materialization [9], we do notexpand here also data triples (specifically, those in GM,D), but instead rewrite theinput SPARQL query to guarantee completeness of query answering. We assume thatuser queries in general will not contain owl:sameAs statements, and therefore, forsimplicity of presentation, here we do not consider the case where they are present asinput. However, our approach can be easily extended to deal also with owl:sameAsstatements in user queries. Given a SPARQL query (Q,V ) over (T , G ∪ AS), wegenerate a new SPARQL query (ϕ(Q), V ) over (T , G ∪ A∗S) that returns the sameanswers as (Q,V ) over (T , G∪AS). This approach is very similar to the singularisationtechnique in [13]. The translation ϕ(·) is defined as follows.

Definition 1. Given a query (Q,V ), the query (ϕ(Q), V ) is obtained by replacing everytriple pattern t in Q with ϕ(t), where:4

3 Using the linear recursion mechanism of SQL-99, a translation would be possible, but with asevere performance penalty for evaluating queries involving transitive closure.

4 Recall that terms of the form :x are blank nodes that, when occuring in a query, correspond toexistential variables.

Page 58: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Ontology-based Integration of Cross-linked Datasets 7

– ϕ({?v :P ?w}) = {?v owl:sameAs :a . :a :P :b .:b owl:sameAs?w .}

– ϕ({?v rdf:type :C}) = {?v owl:sameAs :a . :a rdf:type :C .}

The following proposition states that answering SPARQL queries over a TBox Tunder the DL entailment regime can be reduced to answering SPARQL queries underthe QL entailment regime (where owl:sameAs has no built-in semantics).

Proposition 1. Given OBDA setting (T ,M, D,AS) and a query (Q,V ), we have thatJQKDL

T ,GM,D∪AS|V = Jϕ(Q)KQL

T ,GM,D∪A∗S |V .

Consistency Check: Ontology languages, such as OWL 2 QL, allow for the specificationof constraints on the data. If the data exposed by the database through the mappingsdoes not satisfy these constraints, then we say that the ontology is inconsistent withrespect to the mappings and the data. OBDA allows one to check two types of con-straints: (i) functionality of properties (although it cannot be expressed in OWL 2 QL),which imposes that an individual is connected to at most one element; (ii) disjointnessof classes/properties, which cannot have (pairs of) individuals in common. In OBDA,consistency checking can be reduced to query-answering [3]. This does not hold any-more in general, when considering cross-linked datasets (where UNA does not hold).For instance, suppose we want to check if the property :hasName in Example 1 isfunctional. Clearly without considering equality between datasets the property is func-tional, however, when we integrate the datasets, it is not anymore since we have in thegraph (url1(a1) :hasName ‘A’) and (url2(b2) :hasName ‘C’) and (url1(a1)owl:sameAs url2(b2)). This implies that the wellbore url1(a1) has two names.Using the translation above we can extend the results in [3] for checking violationsof class disjointness and of functionality of data and object properties, to account forowl:sameAs statements. For disjointness and functionality of data properties thisis accomplished straightforwardly by the translation. Instead, for functionality of ob-ject properties, we need to modify the query used in [3] and explicitly incorporate thenegation of owl:sameAs. For instance, to check if functionality of the object prop-erty :isRelatedTo might be violated, we can check if the following query returns anon-empty answer over (T , G ∪ A∗S):SELECT ?x ?y1 ?y2 ?y3 WHERE {

?x :isRelatedTo ?y1 . ?x :isRelatedTo ?y2 .FILTER(?y1 != ?y2 AND NOT EXISTS {?y1 owl:sameAs ?y2} ) }

If the answer is non-empty, the returned elements might witness the violation of func-tionality. Notice that, because of the OWA if two elements are not known to be equal, ingeneral we cannot infer that they are not equal, and hence functionality might still holdin some models. We refer to [4] for more details.

5 Handling Cross-Linked Datasets in PracticeWe now deal with the proper case of querying cross-linked datasets, wherewe are given: (a) an OWL 2 QL TBox, (b) a collection of datasets,(c) a set of mappings, and (d) a set of linking tables5 stating equal-ity between records in different datasets that represent the same entity.

5 Note that these tables could be available virtually, and hence retrieved through queries.

Page 59: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

8 Diego Calvanese1, Martin Giese2, Dag Hovland2, and Martin Rezk1

Fig. 2. Linking tables for the wellbores category

For simplicity, we can think of eachdataset as corresponding to a differentdata source, but datasets could be decou-pled from the actual physical data sources.In general, in different datasets, the sameidentifiers might be used to denote differ-ent objects, and the same objects might bedenoted by different identifiers. Moreover,each dataset may contain data recordsbelonging to different pairwise disjointcategories C1, . . . , Cm, for example well-bores, or company names. A category cor-responds to a set of records that can be mapped to individuals in the ontology belongingto the same TBox class (different from owl:Thing), and that could, in principle, be joined.For instance, cats and men belong to the same class mammal, but a cat can never bejoined with a man, hence cat and men constitute two different categories. We assumethat in addition to the datasets D1, . . . , Dn, for each category C there is a database DC

containing the linking tables for the records in C. Specifically, we denote a linking tablefor datasets Di, Dj and category C with LC

ij(idi, idj). A tuple r1, r2 in LCij means that

the record r1 in Di represents the same object as the record r2 in Dj . Notice that, we donot assume that there is a linking table for each pair of datasets Di, Dj for each categoryC. The concepts above are illustrated in Figure 2. Our aim is to efficiently answer userSPARQL queries in this setting.

The approach presented in the previous section is theoretical, and cannot be effec-tively applied in practice because: (1) it assumes that the links are given in the formof owl:sameAs statements whereas in practice, in an cross-linked setting, they willbe given as tables (with the results of the entity resolution process); and (2) it requirespre-computing a large number of triples (namely A∗S) and materializing them into theontology. Since these triples are not stored in the database, they cannot be efficientlyretrieved using SQL. This negatively impacts the performance of query execution.

To tackle these problems, in this section we show how to: (a) expose, using mappingassertions that are optimization-friendly, the information in the tables expressing equalitybetween DB records, as a set AS of owl:sameAs statements; (b) extend the mappingsso as to encode also transitivity and symmetry (but not reflexivity), and hence exposethe symmetric transitive closure A+

S of AS ; (c) modify the query-rewriting algorithm(cf. Definition 1) so as to return sound and complete answers over the (virtual) ontologyextended with A+

S . We detail now the above steps.

(a) Generating AS: We now present a set of constraints on the structure of the linkingtables that are fully compatible with real-world requirements, and that allow us to processqueries efficiently, as we will show below:

1. All the information about which objects of category C are linked in datasets Di

and Dj is contained in LCij . Formally: If there are tables LC

ij , LCik and LC

kj , then LCij

contains all the tuples in πidi,idj(LC

ik 1 LCkj), when evaluated over DC .

Page 60: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Ontology-based Integration of Cross-linked Datasets 9

L1,2 L2,3 L1,3

id1 id2

a1 b2

a2 b1

id2 id3

b1 c4

b2 c3

id1 id3

a1 c3

a2 c4

a3 c5

Fig. 3. Linking Tables

2. Linking tables cannot state equality between different elements in the same dataset6.Formally: There is no join of the form LC

ik 1 · · · 1 LCni such that (o, o′), with

o 6= o′, occurs in πLCik.idi,LC

ni.idi(LC

ik 1 · · · 1 LCni), when evaluated over DC .

Example 2 (Categories). Consider Example 1. Here we consider only wellbores, there-fore we have a single category Cwb with three linking tables LCwb

12 , LCwb23 , and LCwb

13 asshown in Figure 3. From the constraints above we know that πid1,id3

(LCwb12 1 LCwb

23 ) iscontained in LCwb

13 , when both are evaluated over DCwb . 2

A key factor that affects performance of the overall OBDA system, is the form of themappings, which includes the structure of the URI templates used to generate the URIs.Here, we discuss how the part of the mappings (including URI templates) that deal withlinking tables should be designed, so this approach scales up. The SPARQL-to-SQLtranslation must add all the SQL queries defining owl:sameAs. However, as shownin Section 6, we exploit our URI design to (intuitively) remove as many owl:sameAsSQL definitions as possible before query execution.

We propose here to use a different URI template uriC,D for each pair constituted bya category C and a dataset D.7 Observe that this design decision is quite natural, sinceobjects belonging to different categories should not join, even if in some dataset they areidentified in the same way. For example, wellbore n. 25 should not be confused with theemployee whose id is 25.

Next we generate the set of equalities AS extending the set of mappingsM, using adifferent URI template for each tuple (category C,datasetD). More precisely, to generateAS out of the categories C1 . . . Cn,M is extended with mappings as follows. For eachcategory C, and each linking table LC

ij we extendM with:

uriC,Di({idi}) owl:sameAs uriC,Dj ({idj})← select ∗ from LCij (1)

When the category C is clear from the context we write urii to denote uriC,Di

Example 3 (Mappings). To generate the owl:sameAs statements from the tables inExample 2, we extend our set of mappingsM with the following mappings (fragment):

uri1({id1}) owl:sameAs uri2({id2}) ← SELECT * FROM LC1,2

uri2({id2}) owl:sameAs uri3({id3}) ← SELECT * FROM LC2,3

6 Observe that this amounts to making the Unique Name Assumption for the objects retrieved bythe mappings from one dataset

7 In the special case where there are several datasets that can be mapped to use common URIs,there is no need for linking tables or any of the techniques presented in this paper. We addressthe more general case, where this is not the case.

Page 61: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

10 Diego Calvanese1, Martin Giese2, Dag Hovland2, and Martin Rezk1

Observe that this also implies that to populate the concept Wellbore with elementsfrom D1, the mappings inM will have to use the URI template: uri1. 2

Considering that the same URIs in different triples of the virtual RDF graph can begenerated from different mapping assertions, we observe that the form of the templatesin the mappings related to linking tables will affect also those in the remaining mappingassertions in the OBDA system.(b) Approximating A+

S : To be able to rewrite SPARQL queries into SQL withoutadding A∗S as facts in the ontology, (relying only on the databases), we embed theowl:sameAs axioms together with the axioms for symmetry and transitivity into themappings, that is, extending the notion of T -mappings [16] (T stands for terminology).Intuitively, T -mappings embed the consequences from a OWL QL ontology into themappings. This allow us to drop the implicit axioms for symmetry, and transitivity fromthe Tbox T .

For each category C and for each set of non-empty tables LCi1,i2

LCi2,i3

. . . LCin−1,in

,if LC

i1,indoes not exist, we include the following transitivity mappings inM:

t1({id1}) owl:sameAs tn({idn})← select ∗ from LCi1,i2 1 · · · 1 LC

in−1,in (2)and for each of the owl:sameAs mapping described in (1) and (2) we include thefollowing symmetry mappings inM:

tj({idj}) owl:sameAs ti({idi})← select ∗ from sqlij (3)We call the resulting set of mappingsMS

(c) Rewriting the query Q: Encoding reflexivity would be extremely detrimental forperformance, not only by the large number of extra mappings we should consider butalso because it would render the optimizations explained in the next sections ineffective.Intuitively, the reason for this is that while symmetry and transitivity affect only elementswhich are linked to other datasets, reflexivity affects all the objects in the OBDA setting.Thus, we would not be able to distinguish during the query transformation process,which classes and properties actually deal with linked objects (and should be rewritten)and which ones are not. Therefore, we modify the query-rewriting technique to keepsoundness and completeness with respect to the DL entailment regime while evaluatingthe query under the QL entailment regime over (T ,MS , D).

We modify the query translation as follows:

Definition 2 ((ϕ(Q), V )). Given a query (Q,V ), the query (ϕ(Q), V ) is obtained byreplacing every triple pattern t in Q with ϕ(t), where: ϕ({?v :P ?w}) is shown inFig. 4 (A) and ϕ({?v rdf:type :C}) is shown in Fig. 4 (B).

Intuitively, following up our running example, the first BGP in Fig. 4 (A) gets alltriples such as (uri1(a1), :hasName, A) that do not need equality reasoning. Thesecond BGP, will get triples such as (uri1(a1), :hasName, C), that requireowl:sameAs(uri1(a1), uri2(b2)). The two last BGPs are used only for objectproperties, and it tackles the cases where equality reasoning is needed for the object (?w).

Recall that we do not allow owl:sameAs in the user query language. Thereforethe user will not be able to query ?x owl:sameAs?x. In principle, we could also movetransitivity and symmetry to the query, but it will not reduce the SQL query rewriting.

Theorem 1. Given OBDA setting (T ,AS ,M,D) and a query (Q,V ), we have thatJQKDL

T ,GM,D∪AS|V = Jϕ(Q)KQL

T ,GMS,D|V .

Page 62: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Ontology-based Integration of Cross-linked Datasets 11

{ ?v :P ?w . } UNION {?v owl:sameAs _:z1 . _z1 :P ?w .} UNION {?v :P _:z2 . _:z2 owl:sameAs ?w .} UNION {?v owl:sameAs _:a ._:b owl:sameAs ?w . _:a :P _:b . }

(A)

?v rdf:type :C . UNION {?v owl:sameAs [ rdf:type :C ] .}

(B)

Fig. 4. SPARQL translation to handle owl:sameAs without Reflexivity

6 Optimization

The technique presented in Section 5 can cause excessive overhead on the query sizeand therefore on the query execution time, since it has to extend every triple pattern withowl:sameAs statements. In this section we show how to remove the owl:sameAsstatements that do not contribute to the answer. For instance, in our running examplethe property hasLicense is defined over the companies in D4, which are not linkedwith the other 3 databases. Thus, the owl:sameAs statements should not contribute to“populate” this property.

To translate SPARQL to SQL, in the literature [17] and in the implementation, weencode the SPARQL algebra tree as a logic program. Intuitively, each SPARQL operatoris represented by a rule in the program as illustrated in Example 4. The translationalgorithm employs a well-known process in Logic Programming called partial evalua-tion [12]. Intuitively, the partial evaluation of a SPARQL query Q (represented as a logicprogram) is another query Q′, that represents the partial execution of Q. This processiterates over the structure of the query and specializes the query going from the highlyabstract query to the concrete SQL query over the database. It starts by replacing theatoms that correspond to leaves in the algebra tree (triple patterns) with the union of allits definitions in the mappings, and then it iterates over remaining atoms trying to replacethe atoms by their definitions. This procedure is done without executing any SQL queryover the databases.

We detect and remove owl:sameAs statements that do not contribute to the answerusing this procedure. It is critical to notice that this optimization can be performedbecause we intentionally added two constraints: (i) we disallow mappings modelingreflexivity; and (ii) we force unique URIs for each pair of category/database. We illustratethis optimization in the following example.

Example 4 (Companies). Consider the query asking for the list of companies and li-censes shown in Figure 5 (A). This query is translated into the query (fragment) shown inFigure 5 (B). Since we know that only wellbore are linked through the different datasets,it is clear that there is no need for owl:sameAs statements (nor unions) in this query.In the following, we show how the system partially evaluates the query to remove suchpointless union. This translated query is represented as the following program encodingthe SPARQL algebra tree:

(1)answer(v,w)← union(v,w)(2) union(v,w)← bgp1(v, w)(3) bgp1(v, w) ← hasLicense(v,w)(4) union(v,w)← bgp2(v, w)(5) bgp2(v, w) ← owl:sameAs(v,x), hasLicense(x,w)

Page 63: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

12 Diego Calvanese1, Martin Giese2, Dag Hovland2, and Martin Rezk1

Select * WHERE {?v :hasLicense ?w .}

(A)

Select * WHERE {{?v :hasLicense ?w .} UNION {?v owl:sameAs [ :hasLicense :w ] . } }

(B)

Fig. 5. Optimizable Queries

The next step is to replace the leaves of the SPARQL tree (the triple patternsowl:sameAs and hasLicense ) with their definitions (fragment without includingtransitivity and symmetry):

(6) hasLicense(uri4(v),uri4(w))← sql(v,w)(7) owl:sameAs(uri1(v),uri2(x)) ← T12(v,w)(8) owl:sameAs(uri2(v),uri3(x)) ← T23(v,w)(9) owl:sameAs(uri1(v),uri3(x)) ← T13(v,w)

Thus, the system try to replace hasLicense(x,w) in (5) by its definition in (6), andanalogously with owl:sameAs (5 by the union of 7-9) Using partial evaluation, thesystem will try to unify the head of (6) with hasLicense in (5). The result is:

(5')bgp2(v, uri4(w)) → owl:sameAs(v,uri4(x)), sql(uri4(x),uri4(w))

In the next step, the algorithm will try to unify the owl:sameAs in (5′) with the headof at least one of the rules (7), (8), (9) (if all matched, it would add the union of the tree).Given that the URI template (represented as a function) uri4 does not occur in any ofthe rules, the whole branch will be removed. The resulting program is:

(1)answer(v,w)→ union(v,w)(2) union(v,w)→ bgp1(v, w)(4) bgp1(v, w) → hasLicense(v,w)(5) hasLicense(uri4(v),uri4(w))→ sql(v,w)

This query without owl:sameAs overhead is now ready to be translated into SQL. 2

This process will also take care of eliminating unnecessary SQL queries used todefine owl:sameAs. For instance, if the user queries for wellbores, it will remove allthe SQL queries used for linking company names. This is why we require a unique URIfor each pair category/dataset.

7 Experiments

In this section we present a sets of experiments evaluating the performance of queriesover crossed-linked datasets. We integrated EPDS and the NPD fact pages at Statoilextending the existing ontology and the set of mappings, and creating the linking tables.We ran 22 queries covering real information needs of end-users over this integratedOBDA setting. Since EPDS is a production server with confidential data, and its loadschanges constantly, and in addition the OBDA setting is too complex to isolate differentfeatures of this approach, we also created a controlled OBDA environment in our ownserver to perform a careful study our technique. In addition, we exported the triples ofthis controlled environment and load them into the commercial triple store Stardog8

(v3.0.1).

8 http://stardog.com

Page 64: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Ontology-based Integration of Cross-linked Datasets 13

To perform the controlled experiments, we setup an OBDA cross-linked environmentbased on the Wisconsin Benchmark [6].9 The Wisconsin benchmark was designedfor the systematic evaluation of database performance with respect to different querycharacteristics. It comes with a schema that is designed so one can quickly understandthe structure of each table and the distribution of each attribute value. This allows easyconstruction of queries that isolate the features that need to be tested. The schema can beused to instantiate multiple tables. These tables, which we now call “Wisconsin tables”,contain 16 attributes, and a primary key.

Observe that Ontop does not perform SQL federation, therefore it usually relies onsystems such as Teiid 10 or EXAREME [19] (a.k.a. ADP) to integrate multiple databases.These systems expose to Ontop a set of tables coming from the different databases. Thus,to mimic this scenario we created a single database with 10 tables: 4 Wisconsin tables,representing different datasets, and 6 linking tables. Each Wisconsin table contains 100Mrows, the 6 tables occupied ca. 100GB of disk space, exposing +1.8B triples.

The following experiments evaluate the overhead of equality reasoning when an-swering SPARQL queries. The variables we considered are: (i) Number of SPARQLjoins (1-4); (ii) Number and type of properties (0-4 /data-object); (iii) Number of linkeddatasets (2-3); (iv) Selectivity of the query (0.001%, 0.01%, 0.1%); (v) Number of equalobjects between datasets (10%,30%,60%). In total we ran 1332 queries. The SPARQLqueries have the following template:SELECT * WHERE {?x rdf:type :Classi . // i =1..4?x :DataPropertyj−1 ?y1 . ?x :DataPropertyj ?y2 . // j =0..4?x :ObjectPropertyk−1 ?z1 . ?x :ObjectPropertyk ?z2 . // k =0..4Filter( ?y < k% ) }

where a 0 or negative subindex means that the property is not present in the query.When we evaluated 2 datasets we included equalities between elements of the classes A1

and A2. When we evaluated 3 datasets the equality was between A1, A2 and A4. Theclass A3 and the properties S3 and R3 are isolated. We group the queries in 9 groups:(G1) No properties (c), (G2) 1 d. prop. 0 obj. prop. (1d), (G3) 0 d. prop. 1 obj. prop.(1o),. . . , (G9)2 d. prop. 2 obj. prop. (2d2o).

The average start-up time is ≈5 seconds. Observe that SPARQL engines based onmaterialization can take hours to start-up with OWL-DL ontologies [10]. The results aresummarized in Figure 6. We show the worst execution time in each group including thetime that it takes to fetch the results.

Discussion: The results confirm that reasoning over OBDA-based integrated datahas a high cost, but this cost is not prohibitive. The execution times at Statoil range from3.2 seconds to 12.8 minutes, with mean 53 secs, and median 8.6 secs. An overview ofthe execution times are shown in Fig. 7. The most complex query had 15 triple patterns,using object and data properties coming from both data sources.

In the controlled environment, in the 2 linked-datasets scenario, with 120M equalobjects (60%), even in the worst case most of the queries run in ≈ 5min. The querythat performs the worst in this setting, (4 joins, 2 data properties, 2 object properties,

9 All the material to reproduce the experiments can be found online: https://github.com/ontop/ontop-examples/tree/master/iswc-crosslinked

10 http://teiid.jboss.org

Page 65: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

14 Diego Calvanese1, Martin Giese2, Dag Hovland2, and Martin Rezk1

G1   G2   G3   G4   G5   G6   G7   G8   G9  0  

100  

200  

300  

400  

500  

600  

700  

800  

900  

groups  

second

s  

60%  equality   30%  equality   10%  equality  

60%  equality   30%  equality   10%  equality  

60%  equality     30%  equality   10%  equality  

G1   G2   G3   G4   G5   G6   G7   G8   G9  0  

1000  

2000  

3000  

4000  

5000  

6000  

groups  

second

s  

60%  equality   30%  equality   10%  equality  

60%  equality   30%  equality   10%  equality  

60%  equality     30%  equality   10%  equality  

Fig. 6. Worst Execution Time including fetching time - 2 linked-DS (left) and 3 linked-DS (right)

Num

ber

of q

uerie

s

02

46

8

0−10secs 10−30secs 30−60secs 1−5mins 5−20mins

Fig. 7. Overview of query execution times for tests on EPDS at Statoil.

105 selectivity) returns 480.000 results, and takes ≈ 13min. When we move to the 3linked-datasets scenario, most executions (again worst time in every group) take aroundthan 15min. In this case, the worst query in G9 takes around 1.5hs and returns 1.620.000results. One can see that the number of linked datasets is the variable that impacts themost on the query performance. The second variable is the number of object propertiessince its translation is more complex than the one for data properties. The third variable,is the selectivity. It is worth noticing that these results measure an almost pathologicalcase taking the system to its very limit. In practice, it is unlikely that 60% of the allthe objects of a 300M integrated dataset will be equal and belong to the same category.Recall that if they are not in the same category, the optimization presented in Section 6removes the unnecessary SQL subqueries. For instance, in the integration of EPDS andNPD there are less than 10.000 equal wellbores and there are millions of objects ofdifferent categories. Moreover, even 1.5hs is a reasonable time. Recall that Statoil usersrequired weeks to get an answer for this sort of queries.

Because of the partial evaluation-based optimizations proposed in Section 6, with 2datasets 30 out of 48 queries (52 out of 100 with 3 datasets) get optimized and executedin a few milliseconds. These queries are the ones that join elements in A1,2,4 (3 datasets)with A3, S3 and R3 elements. Since there is no equality between these elements, neitherthrough owl:sameAs, nor with standard equality, the SPARQL translation producesan empty SQL, and no SQL query gets executed returning automatically 0 answers.

To load the data into Stardog we used Ontop to materialize the triples. The material-ization took 11hs, and it took another 4hs to load the triples into Stardog. The defaultsemantics that Stardog gives to owl:sameAs is not compliant with the official OWLsemantics since “Stardog designates one canonical individual for each owl:sameAsequivalence set”; however, one can force Stardog to consider all the URIs in the equiva-lence set. Our experiments show that Stardog does not behave according to the claimedsemantics. Details can be found in [4].

Page 66: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Ontology-based Integration of Cross-linked Datasets 15

8 Related Work

The treatment of owl:sameAs in reasoning and query evaluation has received consid-erable interest in recent years. After all, many data sources in the Linked Opend Data(LOD) cloud give owl:sameAs links to equivalent URIs, so it would be desirable touse them. Surprisingly, to the best of our knowledge, there has been no attempt to extendOBDA to take into account owl:sameAs. Next we discuss several approaches thathandle owl:sameAs trough rewriting.

Balloon Fusion [18] is a line of work that attempts to make use of owl:sameAsinformation in the LOD cloud for query answering. The approach is similar to ours in thatit is based on rewriting a query to take into account equality inferences, before executingit. The treatment of owl:sameAs is semantically very incomplete however, since therewriting only applies to URIs stated explicitly in the query. No equality reasoning isapplied to the variables in the query, which is a main point of our work.

The question of equality handling becomes quite different in nature in the context ofa single data store that is already in triple format. Equality can then be handled essentiallyby rewriting equal URIs to one common representative. E.g. [15] report on doing this foran in-memory triple store, while simultaneously saturating the data with respect to a setof forward chaining inference rules. Observe that in many scenarios (such as the Statoilscenario discussed here) this approach is not possible, both due to the fact that the datashould be moved from the original source, and because of the amount of data that shouldbe loaded into memory. In a query rewriting, OBDA setting, this corresponds to the ideaof making sure that mappings will map equivalent entities from several sources to thesame URI – which is often not practical or even impossible.

Our approach is only valid when the links between records really mean semanticidentity. When the links are uncertain, query answering then requires the use of proba-bilistic database methods, as discussed e.g. in [8] for a limited type of queries. Extendingthese methods to handle arbitrary SPARQL-style queries is not trivial.

9 Conclusions

In this paper we showed how to represent links over database as owl:sameAs state-ments, we propose a mapping-based framework that carefully constructs owl:sameAsstatements to minimize the performance impact of equality reasoning. To recoverrewritability of SPARQL into SQL we imposed a suitable set of restrictions on thelinking mechanisms that are fully compatible with real world requirements, and togetherwith the owl:sameAs-mappings make it possible to do the SPARQL-to-SQL trans-lation. We showed how to answer SPARQL queries over crossed linked datasets usingquery transformation. and how to optimize the translation to improve the performance ofthe produced SQL query. To empirically support this claim, we provided an extensiveset of experiments over real enterprise data, and also in a controlled environment.Acknowledgement. This paper is supported by the EU under the large-scale integratingproject (IP) Optique (Scalable End-user Access to Big Data), grant agreement n. FP7-318338.

Page 67: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

16 Diego Calvanese1, Martin Giese2, Dag Hovland2, and Martin Rezk1

References

1. A. Artale, D. Calvanese, R. Kontchakov, and M. Zakharyaschev. The DL-Lite family andrelations. J. of Artificial Intelligence Research, 36:1–69, 2009.

2. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoningand efficient query answering in description logics: The DL-Lite family. J. Autom. Reasoning,39(3):385–429, 2007.

3. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoningand efficient query answering in description logics: The DL-Lite family. J. of AutomatedReasoning, 39(3):385–429, 2007.

4. D. Calvanese, M. Giese, D. Hovland, and M. Rezk. Ontology-based integration of cross-linked datasets. http://www.inf.unibz.it/˜mrezk/pdf/techRep-ISWC15.pdf, 2015. [Online; accessed 30-April-2015].

5. S. Das, S. Sundara, and R. Cyganiak. R2RML: RDB to RDF mapping language. W3CRecommendation, W3C, Sept. 2012. Available at http://www.w3.org/TR/r2rml/.

6. D. J. DeWitt. The wisconsin benchmark: Past, present, and future. In J. Gray, editor, TheBenchmark Handbook. Morgan Kaufmann, 1992.

7. A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann,2012.

8. E. Ioannou, W. Nejdl, C. Niederee, and Y. Velegrakis. On-the-fly entity-aware query process-ing in the presence of linkage. PVLDB, 3(1):429–438, 2010.

9. R. Kontchakov, C. Lutz, D. Toman, F. Wolter, and M. Zakharyaschev. The combined approachto ontology-based data access. In Proc. of IJCAI 2011, pages 2656–2661, 2011.

10. R. Kontchakov, M. Rezk, M. Rodriguez-Muro, G. Xiao, and M. Zakharyaschev. AnsweringSPARQL queries over databases under OWL 2 QL entailment regime. In Proc. of ISWC 2014,volume 8796 of LNCS, pages 552–567. Springer, 2014.

11. R. Kontchakov, M. Rezk, M. Rodriguez-Muro, G. Xiao, and M. Zakharyaschev. AnsweringSPARQL queries over databases under OWL 2 QL entailment regime. In Proc. of ISWC 2014,volume 8796, pages 552–567. Springer, 2014.

12. J. W. Lloyd. Foundations of Logic Programming. Springer-Verlag New York, Inc., Secaucus,NJ, USA, 2nd edition, 1993.

13. B. Marnette. Generalized schema-mappings: From termination to tractability. In PODS ’09,pages 13–22, New York, NY, USA, 2009. ACM.

14. B. Motik, B. Cuenca Grau, I. Horrocks, Z. Wu, A. Fokoue, and C. Lutz. OWL 2 Web OntologyLanguage profiles (second edition). W3C Recommendation, W3C, Dec. 2012. Available athttp://www.w3.org/TR/owl2-profiles/.

15. B. Motik, Y. Nenov, R. E. F. Piro, and I. Horrocks. Handling owl:sameAs via rewriting. InB. Bonet and S. Koenig, editors, Proc. 29th AAAI, pages 231–237. AAAI Press, 2015.

16. M. Rodriguez-Muro, R. Kontchakov, and M. Zakharyaschev. Ontology-based data access:Ontop of databases. In Proc. of ISWC 2013, volume 8218, pages 558–573. Springer, 2013.

17. M. Rodriguez-Muro and M. Rezk. Efficient SPARQL-to-SQL with R2RML mappings. J. ofWeb Semantics, 2015. To appear.

18. K. Schlegel, F. Stegmaier, S. Bayerl, M. Granitzer, and H. Kosch. Balloon fusion: SPARQLrewriting based on unified co-reference information. In Proc. of the 30th Int. Conf. on DataEngineering Workshops (ICDE 2014), pages 254–259. IEEE, 2014.

19. M. M. Tsangaris, G. Kakaletris, H. Kllapi, G. Papanikos, F. Pentaris, P. Polydoras, E. Sitaridi,V. Stoumpos, and Y. E. Ioannidis. Dataflow processing and optimization on grid and cloudinfrastructures. IEEE Bull. on Data Engineering, 32(1):67–74, 2009.

Page 68: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Appendix C

A ‘Historical Case’ of Ontology-Based DataAccess

This appendix reports the paper:

Diego Calvanese, Alessandro Mosca, Jose Remesal, Martin Rezk and Guillem Rull:A ‘Historical Case’ of Ontology-Based Data Access. In Proc. of Digital Heritage International Congress(DH), 2015.

68

Page 69: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

A ‘Historical Case’ of Ontology-Based Data Access

Diego Calvanese∗, Alessandro Mosca†, Jose Remesal‡, Martin Rezk∗, and Guillem Rull‡

∗KRDB Research Centre, Free University of Bozen-Bolzano, Italy {calvanese,mrezk}@inf.unibz.it†SIRIS Lab, Research Division of SIRIS Academic, Spain [email protected]

‡CEIPAC, University of Barcelona, Spain {remesal,grull}@ceipac.ub.edu

Abstract—Historical research has steadily been adopting se-mantic technologies to tackle several recent problems in the field,such as making explicit the semantics contained in the historicalsources, formalising them and linking them. Over the last decades,in social sciences and humanities an immense amount of newquantifiable data have been accumulated and made available ininterchangeable formats, opening up new possibilities for solvingold questions and posing new ones. This paper introduces aweb-based platform to ease the access of scholars to historicaland cultural data distributed across different data sources. Theapproach relies on the Ontology-Based Data Access (OBDA)paradigm, where the different datasets are virtually integratedby a conceptual layer (an ontology). This work is focused oninvestigating the mechanisms and characteristics of the foodproduction and commercial trade system during the RomanEmpire.

Keywords—e-Culture, History of the Roman Empire economics,Ontology-Based Data Access, Cultural Data Integration, LinkedOpen Data, Web-based query/answering system

I. INTRODUCTION

Historical research has steadily been adopting semantictechnologies [1], [2], [3] to tackle several recent problems inthe field, such as making explicit the semantics contained inthe historical sources, formalising them and linking them [4].Historians, especially in Digital Humanities, are starting touse historical sources to aggregate information about history.Moreover, the recent advances in computing and computationaltools (from machine learning, to applied mathematical statis-tics, text mining and topic-modelling algorithms, and semantictechnologies) make it feasible to meaningfully manipulate,manage, and analyse these datasets. An outcome of this,is that over the last decades, an immense amount of newquantifiable data have been accumulated, and made availablein interchangeable formats, from social sciences to economics,opening up new possibilities for solving old questions andposing new ones [5].

Since a sustainable maturity in the development of Se-mantic Web and Linked Open Data technologies has beenreached—think, e.g., of data exchange protocols, standardisedknowledge representation languages, and common data for-mats1—a considerable number of public initiatives and projectshave been funded to address the issue of building historicaland cultural data, and making it public through the web.Among others, the following are worth to be mentioned here,since they represent pioneering efforts in the application ofsemantic technologies toward the development of e-culture

1W3C Standards, see http://www.w3.org/

portals providing multimedia access to distributed collec-tions of cultural heritage objects: EUROPEANA2, ARIADNE3,CULTURESAMPO4, STICH @ CATCH5, MultimediaN N9C6, CHIP7,EAGLE8, CIDOC CRM9, GETTY Vocabularies10, INCONCLASS11,EPIDOC12.

These projects can be characterised by one of the fol-lowing two goals: (i) to explicitly expose data structures,integrated datasets, vocabularies, and ontologies to supportfurther initiatives in the design and development of computerapplications in the Digital Heritage area; and (ii) to representimplementations of the envisioned applications.

A shortcoming of the existing models developed by theprojects in the first category is that they cannot be directlyunderstood by non-experts since (i) the concept names areoften not self-explanatory (for instance, the concept name for‘Information Carrier’ is ‘E84’ in CIDOC CRM); and (ii) theconcepts are intentionally defined at a very abstract level inorder to be useful for any domain in the digital humanitiesfield (for instance, E75: ‘Conceptual Object Appellation’).

The emphasis of EPNet is on providing historians withcomputational tools to compare, aggregate, measure, geo-localise, and search data about Latin and Greek inscriptionson amphoras for food transportation. This approach relies onthe Ontology-Based Data Access (OBDA) paradigm, wherethe different datasets are virtually integrated by a conceptuallayer (an ontology).

Example 1.1: Suppose the user needs all the amphorasproduced in ‘La Corregidora’ and its geo-coordinates. TheEPNet dataset contains information about amphoras and some(potentially incomplete) information about geo-coordinates.On the other hand, the Pleiades dataset (http://pleiades.stoa.org) contains more complete geo-coordinates information buthas no information about amphoras. There are hundreds oftypes of amphoras such as Dressel 1, Dressel 2-4, Lep-timinus 1, each of them represented by an alphanumeric-numeric code, such as “DR1C-BTIR” in EPNet. Thus creatinga query for this simple information need is not only extremely

2http://www.europeana.eu3http://www.ariadne-infrastructure.eu4http://www.kulttuurisampo.fi5http://www.cs.vu.nl/STITCH6http://e-culture.multimedian.nl7http://chip.win.tue.nl8http://http://eagle-network.eu9http://www.cidoc-crm.org10http://www.getty.edu/research/tools/vocabularies11http://www.iconclass.nl12http://epidoc.sourceforge.net

Page 70: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

complex, but requires the user to know the DB encoding ofeach type, the schemas in the datasources, and manually mergethe information obtained from each of them. Ideally the usershould be able to execute a single simple query that doesnot require any specific knowledge about the underlying datasources, and get all the available information coming from bothdatasets.

Differently from providing access to virtual museums ordigitalised collections, the OBDA implementation introducedin the paper, by means of state-of-the-art technologies andprinciples coming from the research area of Knowledge Rep-resentation [6], is meant to support scholars in experimentallyverifying theoretical hypotheses, and in formulating new ones.Specifically, this paper provides the following contributions:

• The introduction of the EPNet Conceptual ReferenceModel.

• The specification of the relational schema which drovethe deployment of the EPNet dataset.

• The EPNet ontology and mappings, and the ways they areused in the OBDA implemented system.

• The web-based implementation of the EPNet query/an-swering system.

The paper is organised as follows: Section II gives abrief introduction to the EPNet project. Section III conciselydescribes the different artefacts that we developed to build ourdata management system relying on ontologies. Section IVis devoted to the introduction, by means of examples, ofthe OBDA framework we implemented, explaining how thissolution deals with data access, integration, and consistencyissues. A preliminary, testing-oriented, interface is hyperlinkedin the same section. Section V concludes the paper.

II. THE HISTORICAL CONTEXT

The Roman Empire trade system is generally consideredto be the first complex European trade network. It formedan integrated system of interactions and interdependenciesbetween the Mediterranean basin and northern Europe. Overthe last couple of centuries, scholars have developed a varietyof theories to explain the organisation of the Roman Empiretrade system. The majority of them continue to be speculativeand difficult to falsify [7], [8].

EPNet aims at setting up an innovative framework to inves-tigate the mechanisms and characteristics of the commercialtrade system during the Roman Empire. The main objective ofEPNet is to create an interdisciplinary experimental laboratory(the project team includes specialists from Social Sciences andHumanities, and from Physical and Computer Sciences) for theexploration, validation and falsification of existing theories,and for the formulation of new ones. This approach is madepossible by (i) a large dataset of existing empirical data aboutRoman amphorae and their associated epigraphy that has beencreated during the last 2 decades (see, e.g., Fig. 1 and Fig. 2),and (ii) the front line theoretical research done by historians onthe political and economic aspects of the Roman trade system.

The economy of the Roman Empire: an ongoing debate. Acrucial aspect of any society is the production, supply andre-distribution of food. This topic has long been, and stillremains, one of the open problems for sustainable decision

Fig. 1. Titulus pictus in ‘delta’ position over a Dressel 20 amphora

policies in a world scale perspective. The food distributionduring the Roman Empire is commonly associated with thecontrol of the army. It is argued that the emperor and hiscircle managed the relationship between food and army inorder to supervise and control the whole Roman territoryand to strengthen and maintain their own political power.Two approaches are particularly evident in the current debateover scales and modalities of the Roman economics system:(i) the Roman Empire trade system as a specific model notconnected with modern global economies, and (ii) the RomanEmpire trade system as a sort of predecessor of modern globaleconomies perfectly explainable through modern economictheories. Assuming or not an analogy between past and presentor vice-versa, the scientific debate has focused mostly on theinfluence of the capital of the Empire (Rome) in the control andmanagement of long distance trade, rather than on analysingthe role played by periphery and regional distribution.

Roman archaeology provides us with an incredible sourceof data and information about economic productions and trans-actions around modern Europe and the Mediterranean basin(see Fig. 1). However, a scientific study of the mechanismsthat have characterised these economic and political links isstill missing. The main reason is the lack of formal approachesand methods in historical research. Specialists in history of-ten do not even consider the possibility that their researchcan be scientifically supported and expressed using formallanguages (codified using non-ambiguous languages capableof generating models that can be executed, by analytical orcomputational methods). However, ancient societies provide agreat opportunity to evaluate diachronic real-world data witha virtual laboratory in which formal models can be built anddifferent hypotheses and theories about the past explored (see[9]).

In this context, semantic-based technologies for data man-agement, such as OBDA, can account for discrete data inaddition to qualitative influences and interpretations, so asto answer broader questions about motives and patterns inthe historical record. In particular, OBDA enables scholars toretrieve information stored in the EPNet dataset in a domain-centred and scholar-friendly way, thus supporting the identifi-cation of patterns and trends in this information and discoverrelationships between disparate pieces of it.

OBDA supports EPNet in facing the main challenge ofproviding users with: (i) a running technology for accessingdata in a way that is conceptually sound with their owndomain knowledge (see, the EPNet Conceptual ReferenceModel and the ontology introduced in the next section); (ii) asemantically-transparent platform, ready to acquire and becomplemented with new data from different sources (domain-related historical datasets managed by research labs or pro-moted public); (iii) a theoretically grounded mechanism to ho-

Page 71: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Fig. 2. The result of a query over the stamp ACIRGI in the EPNet dataset

mogenise information stored in different formats and accordingto different conceptualisations (alternative representations ofperiods of time, for instance, or locations differently storedaccording to their ancient or modern name).

By means of the OBDA technology, extensive amountsof the EPNet data (see, e.g., Fig. 2) is to be connected andsubsequently interpreted in a variety of levels that will givenew insights to the complexity of the Roman Empire exchangerelations. Moving beyond the limitations of a traditional rela-tional DB is essential for the generation of new knowledge,and for the specification of values and parameters that will bemanipulated in the simulation experiments.

III. KNOWLEDGE REPRESENTATION AND DATAMANAGEMENT IN EPNET

In this section, we present the EPNet Conceptual ReferenceModel (CRM), the derived (logical) data model, and theEPNet ontology. The last part of this section is devoted to theintroduction of the ‘Pleiades’ dataset whose data content hasbeen integrated in the project dataset, thus (i) increasing thecoverage of the data provided to the final users with respect tothe domain of interest (completeness), and (ii) complementingthe characterisation of the geographical entities already presentin the initial dataset (accuracy).

The Conceptual Reference Model (CRM). The specificationof the EPNet CRM for the representation of epigraphic informa-tion and domain expert knowledge about Roman Empire Latininscriptions was meant to unambiguously represent the way thedata are understood by scholars, how they are connected, andwhat their coverage is with respect to the literature of referenceand current research practices in the history of the RomanEmpire. The CRM has been formally specified in the con-ceptual modelling language called ”Object Role Modelling”(ORM2), and by means of NORMA, a data modelling tool forORM213. Nonetheless, the CRM has been defined according tothe state-of-the-art formal ontological models and standards forrepresenting the structure of cultural heritage objects and therelationships between them. In particular, in order to increasethe interoperability of the CRM, and of the whole EPNetdataset, with other similar initiatives and data sources, the mainsection of the model results in a specialisation/extension ofthe well known CIDOC Conceptual Reference Model, the mostdominant ontology in cultural heritage.

For the sake of model maintenance, and according tothe specific nature of the involved information, the CRM

13NORMA is an open source plug-in to Microsoft Visual Studio .NET, freelydownloadable from http://www.ormfoundation.org/.

has been structurally organised into distinct interrelated sub-sections. Moreover, according to the different aim of each sub-section, we again relied on existing standards for recording andpublishing information on the Semantic Web, such as FaBiO(the FRBR-aligned Bibliographic Ontology - http://vocab.ox.ac.uk/fabio) for the bibliographic references documenting theentities in the CRM. The following are the five main sectionsof the EPNet CRM:

Main deals with the representation of the main domain entities(e.g., inscriptions, amphoric types, associated epigraphic infor-mation), their properties (e.g., finding place, letter dimensions,archaeometric characterisation), and mutual relationships (seeFig. 3).

Time offers a conceptual arrangement, driven by the experts,of the different modalities used to denote interval periods,dates, and punctual instants of time, w.r.t. the given researchdomain. As explained in Section IV-A, the different formatsthe domain experts are used to deal with temporal informationhave been homogenised in the implemented OBDA system, inorder to maintain the epistemological flexibility they providein looking for specific data, while keeping the possibilityto interchange between them and translate one into another(e.g., to move from the string ‘Trajan Government’ tothe corresponding numerical time-span ‘98-117’).

Space is meant to deal with information concerning spaceand geographical localisation of the entities in Main CRM. Aheterogeneous set of entities in Main CRM brings a charac-terisation in terms of space, from finding activities involved inthe discovery of an artefact, till the relative position of an in-scription with respect to other stylistic and structural elementsof an amphora. The Space section has been, for this reason,divided into two distinct subsections: (i) a ‘carrier-centred’one, used to represent the spatial relationships between thestructural and the epigraphic components of an amphora (e.g.,relative position of an inscription with respect to the amphorahands) and, (ii) a geographic one, which provides the elementsfor the representation of the location of a carrier finding, itsproduction and potting, the function of this location (e.g., civilsettlement, legionary camp, fort) and the latitude and longitudecoordinates identifying it on a map. The geographic part of themodel, complemented by information coming from differentsources (see subsection III-A), offers the possibility to geo-localise the domain entities, as well as to make a distinction,and a semantically sound mapping, between historical (e.g.,Roman provinces) and contemporary places.

Documental is devoted to the representation of the biblio-graphic information documenting the entities of interest (e.g.,conference and workshop papers, books, web portals anddigital encyclopedia).

Upper Typing is simply a collection of all the taxonomicstructures characterising the entities in the Main CRM. Havingall the taxonomies collected in a single place makes theirmanagement and successive extension a lot easier also forscholars with no technical background.

The CRM model made of the five sections introducedabove, besides being formally correct and consistent, is com-prehensive enough to host all the information and knowledgeelicited from the domain experts, and represents a definitiveimprovement in quality and granularity w.r.t. the previously

Page 72: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Fig. 3. A fragment of the EPNet CRM, where Inscriptions are related with the activities Producing, Potting, and Finding. Stamps are inscriptionscharacterised, among other, by their Relief, Shape type, and ReadingDirection. The model, written in ORM2, also shows that inscriptions are directlyconnected with ‘simplified’ and ‘full’ transcriptions, bringing information about their translation into contemporary languages, and their conservation status,respectively. The pink coloured symbols indicate cardinality constraints that have been superimposed to the schema, while the arrow stands for the usual is-arelation.

adopted informal data structure descriptions we faced at thebeginning of EPNet.

The EPNet dataset. While the CRM represents the knowledgeof the domain, it does not specify how to store the actual data.Data storage greatly depends on the underlying technology,i.e. different technologies store data in different ways, whichresults in a specification that is tied to the particular technologybeing used. Since the knowledge of the domain is independentof any particular technology, it is a common practice to specifydata storage separately from it.

In EPNet, we use a relational database management system(RDBMS) to store our data, so we must provide a relationalspecification that complements our CRM. A RDBMS struc-tures data in the form of tables (a.k.a. relations), so a relationalspecification has to indicate which are the tables that formthe database and which are their attributes (a.k.a. columns).It is important to note that the data currently available in theproject does not cover the entirety of the domain’s knowledgerepresented in the CRM, but rather a subset of it. Conse-quently, our efforts on providing a relational specification havefocused so far on this specific subset of the domain. Dueto space reasons, only a small fragment of this relationalspecification is shown in Fig. 4. Tables are depicted as boxes,with their name at the top (e.g., inscription) and the listof attributes following (e.g., id, carrier). Each attributeconsists of a name and a data type (e.g., id INT(11), whichindicates that the identifier of an inscriptions is an integer num-ber). In particular, notice the tables informationcarrier,amphoratype, and amphtyping, which we will be using inthe examples in the following section: informationcarrierstores data about amphorae, such as an identification numberand a reference to both its producing and finding activities(detailed data about these activities is stored in separatetables); amphoratype records the information of each kindof amphora; and amphtyping links amphora identifiers withthe corresponding type identifier(s) (could be more than oneif the exact type of an amphora could not be identified butwas narrowed down to a small set of possible types instead).Relationships between tables are depicted in the specificationas lines connecting them (see for example the lines connectinginformationcarrier, amphtyping, and amphoratype).

The EPNet ontology. In order to support the user with the pos-sibility of accessing data through a domain-centred conceptuallayer and terminology, the relational specification introducedin the previous section has been encoded into an ontology.The resulting ontology, written in a formal language whoseexpressivity stays within the OWL 2 QL profile14), modifiesand extends (by means of suitable concept hierarchies, seeExample 4.2) the vocabulary of the database schema by re-introducing part of the domain specific terminology extractedwith the support of the domain experts. The ontology capturesthe domain knowledge by taking into consideration, at thesame time, the available data and the user requirements interms of data accessibility and usage.

In the majority of the current projects in cultural her-itage and humanities dealing with semantic technologies, theconceptualisation of the domain is expected to expose datastructures suitable for a generic audience (from tourists visitinga museum or searching on the Web their favourite piece ofart, till public administrations willing to open up their culturalresources and historic properties). Instead, the EPNet ontologyhas been specified in collaboration with experts of the historyof the Roman economy with the main aim of: (i) supportingthem in measuring aggregate changes over decades and cen-turies, (ii) trying out historical hypotheses across the time-scaleof centuries, and (iii) systematically collecting information toquestion standard narratives [10]. The characteristic trait of theEPNet ontology, and of the domain knowledge encoded in theEPNet CRM, is that of being ‘functional to research’.

The EPNet ontology contains axioms that provide formaldefinitions for the concepts and (binary) relations the expertsmake use of in conceptually classifying the entities of theirresearch domain. As an example15, consider the followingaxioms:

14http://www.w3.org/TR/owl2-profiles/15A more comprehensive picture of the ontology can be found at http://136.

243.8.213/obdasystem/, where a simple user interface has been implementedwith the only aim of testing the system and its basic query functionalities.

Page 73: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Fig. 4. A fragment of the relational specification of our database

Federa&on  Engine  

Our DB

Mappings

Ontology

Data Sources

Pleiades

Fig. 5. EPNet system and Pleiades

:Stamp rdfs:subClassOf :Inscription .:TitulusPictus rdfs:subClassOf :Inscription .

:Amphora rdfs:subClassOf :InfCarrier .:carriedBy rdfs:domain :Inscription .:carriedBy rdfs:range :InfCarrier .:producedAt rdfs:domain :InfCarrier .:producedAt rdfs:range :TimeSpan .

:hasName rdf:type owl:DatatypeProperty .

They say that the concepts :Stamp and :TitulusPictusare both subconcepts of :Inscription (see also Fig. 3),while :Amphora is a specialisation of :InfCarrier. The:carriedBy relation links inscriptions with their informa-tional carrier and, similarly, the domain and range of the:producedAt relation are specified to be the :InfCarrierconcept and the :TimeSpan in which the existence of the car-rier is historically attested. The last axiom is for characterising:hasName as a datatype property, i.e., a property whose rangeis a specific datatype (:String in this case).

In addition, in order to expose the user to a domain-orientedvocabulary, special axioms have been added to the ontology.For instance, the following axiom introduces a new relationin the ontology by saying that engravedOn generalises thecarriedBy relation between inscriptions and their informa-tional carriers:

:carriedBy rdfs:subObjectPropertyOf :engravedOn .

Notice that the expressivity of OWL 2 QL allows for thespecification, among others, of disjointness constraints be-tween concepts, this way supporting data consistency checkingthat can be automatically performed by means of traditionalreasoning technologies (see Section IV-B). Being able to applydata consistency checks over the project data is of particular

interest in such a context, considering that the data are usuallycollected by non-experts and manually entered into a DBsystem without the support of any specific data entry interface.

A. Pleiades

Pleiades16 is an open-access digital gazetteer for ancienthistory. It provides stable Uniform Resource Identifiers (URIs)representations for tens of thousands of geographic entities.Built on the Classical Atlas Project (1988–2000), which pro-duced the ‘Barrington Atlas of the Greek and Roman World’[11], Pleiades is co-organised by the Institute for the Studyof the Ancient World (NYU) and the Ancient World MappingCenter (UNC Chapel Hill). Pleiades is beginning to expand be-yond its classical Greco-Roman roots and is establishing linesof interoperability with a number of other web-based resourcestreating the geographical, textual, visual and physical cultureof antiquity. The Pleiades dataset has been selected in orderto complement the EPNet dataset. In particular, it providesa number of geographic entities that is strictly greater thanthose present in the project dataset (e.g., specific municipalitiesand Roman provinces are present in EPNet only if they area finding, producing, or potting place). The integration withPleiades supports EPNet in tracing trade routes and economicconnections on the Roman Empire territory in a more preciseway and over a satisfactory picture of the past anthropicenvironment. If a location is present in both the Pleiades andthe project DB but missing some attributes in the latter (e.g.,the place has no geo-coordinates), the system is able to identifythe missing attributes, catch their associated values, and withthem augment the entry in EPNet, thus increasing the overallaccuracy and completeness of the stored data.

IV. OBDA IN EPNET

Since the mid 2000s, Ontology-Based Data Access(OBDA) has become a popular approach to tackle the problemsmentioned in Section II. An overall architecture of the EPNetOBDA setting is shown in Fig. 5. In OBDA, a conceptuallayer is given in the form of an ontology that defines a sharedvocabulary, models the domain, hides the structure of thedata sources, and can enrich incomplete data with backgroundknowledge. In our setting, the ontology is the one presentedin Section III. Then, queries are posed over this high-levelconceptual view, and the users no longer need an understandingof the data sources, the relation between them, or the encoding

16http://pleiades.stoa.org

Page 74: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

of the data. Queries are translated by the OBDA system intoqueries over the data sources.

The ontology is connected to the data sources through adeclarative specification given in terms of mappings that relatesymbols in the ontology (classes and properties) to (SQL)views over data. Intuitively, the mappings expose the datain the database as Resource Description Framework (RDF)triples. RDF is a World Wide Web Consortium (W3C) specifi-cation for data interchange on the Web. This standard is basedupon the idea of making statements about resources in the formof subject–predicate–object expressions. These expressions areknown as triples in RDF terminology. Example of such triplesare

http://epnet-url.org/1 rdf:type :Amphora

http://epnet-url.org/1 :producedIn

http://epnet-url.org/place/5

respectively stating that the element represented by the URIhttp://epnet-url.org/1 is an amphora, and that it was produced inthe place represented by the URI http://epnet-url.org/place/5.

Intuitively, each of the mapping assertions that generatesthese triples (in OBDA) consist of a source, which is anSQL query retrieving values from the database, and a target,defining RDF triples with values from the source.

(subject predicate object)︸ ︷︷ ︸target triple

←− SQL Statement︸ ︷︷ ︸source query

Subjects and objects in RDF triples are resources (in-dividuals or values) represented by URIs or literals. Theyare generated using templates in the mappings. For instance,the URI template :Amphora-{ic_id}, where ic_id is anattribute in some DB table, generates the URI :Amphora-1>when ic_id is instantiated to ’1’. In addition, the colonsymbol ’:’ represents the default URI string, in this examplehttp://epnet-url.org, hence the generated URI is actu-ally http://epnet-url.org/Amphora-1. Let us illustratethis with a further example.

Example 4.1 (EPNet Mappings): The following mappingpopulates the class :Dressel1 (which is a subclass of:Amphora):

:Amphora-{ic_id} rdf:type :Dressel1 ←−SELECT ic.id AS ic_id, t.code AS t_codeFROM InformationCarrier ic

JOIN AmphTyping amt ON amt.carrier=ic.idJOIN AmphoraType t ON t.code=amt.type

WHERE amt.type=’DR1’

Observe that this is a rather complex SQL query that joinsinformation from three different tables. This complexity ishidden to the users by the simple concept :Dressel1.

The ontology together with the mappings and the databaseexposes a virtual RDF graph, which can be queried usingSPARQL, the standard query language in the Semantic Webcommunity.

Example 4.2: Assume that the users need all the amphorasproduced in “La Corregidora” and its geo-coordinates. Thiscan be translated to the following SPARQL query using thevocabulary from the ontology.

SELECT * WHERE {?x rdf:type :Amphora .?x :producedIn ?pl .?pl rdf:type :Place .?pl :hasName "La Corregidora" .?pl :hasLatitude ?lat .?pl :hasLongitude ?long

}

Observe that users do not need to know the particular codesof the amphoras, nor they need to manually integrate theinformation coming from EPNet and Pleiades.

There are several OBDA systems in both, academia andindustry [12], [13], [14], [15]. We work with Ontop [12], [16],[17], [18], a mature open-source system, which is currentlybeing used in a number of projects. Ontop allows the users tomaterialize virtual RDF graphs, generating RDF triples thatcan be used with RDF triplestores, or alternatively the graphscan be kept virtual and queried only during query execution.The virtual approach avoids the cost of materialization and canprofit from more than 30 years of maturity of relational systems(efficient query answering, security, robust transaction support,etc.). To answer queries in the virtual approach by exploitingthe information given by the ontology, Ontop relies on query-rewriting. To illustrate this let us come back to Example 4.2.When the user queries the class :Amphora, Ontop uses theontology to infer that all the elements that belong to oneof the subclasses (e.g., :Dressel1) also belong to the class:Amphora. Intuitively, Ontop rewrites the query in Example 4.2creating a union for each subclass of :Amphora:

SELECT * WHERE {{ ?x rdf:type :Amphora .

?x :producedIn ?pl .?pl rdf:type :Place.?pl :hasName "La Corregidora".?pl :hasLatitude ?lat.?pl :hasLongitude ?long .

} UNION {?x rdf:type :Dressel1 .?x :producedIn ?pl .?pl rdf:type :Place.?pl :hasName "La Corregidora".?pl :hasLatitude ?lat.?pl :hasLongitude ?long .

} UNION {?x rdf:type :Leptiminus1 .· · ·

}}

Ontop is available as a Protege 4 plugin, a SPARQLendpoint through Sesame Workbench, and a Java library sup-porting OWL API and Sesame API.

A. EPNet Data Integration

Ontop allows for virtual data integration. In this approachthe data remain in the sources and are accessed at query time.Ontop does not modify the underlying databases, which is arequisite in this use case, neither does it require complexextract, transform, load processes. The classes and propertiesin the ontology, cluster different fragments of the databasesinto a homogenized well defined set of triples.

Page 75: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Ontop does not integrate the databases at the SQL level. Forthat it relies on a standard federation engine such as Teiid17

or Exareme [19]. Any of these engines will expose a set ofschemas containing the tables from each of these datasources.Ontop does the semantic integration and homogenization overthese federated databases. Here we will discuss the integrationof EPNet and Pleiades focusing on space and time periods.The integration starts in the ontology, where concepts coverinformation contained in both datasets. The :Place concept,for instance, is characterised in the ontology by having a givenfunction (e.g., :ProductionPlace, :CivilSettlement,:LegionaryCamp), it is linked through:hasLatitudeand :hasLongitude relations to its geo-coordinates, and:fallsWithin or :isContainedIn other known places.Then the information from both datasets get connected throughproperties. In our running example we find:

• :producedIn, connecting amphoras in EPNet and placesin EPNet and Pleiades, and

• :hasLatitude, connecting places in EPNet and Pleiadeswith latitude coordinates in both datasets.

Space. Both EPNet and Pleiades have information regardingplaces, settlements, geo-coordinates, etc. However, Pleiadesis more complete space-wise, moreover it contains a kind ofsettlement that is missing in EPNet. If a place is not in theEPNet dataset, we completely rely on the data from Pleiades(name(s), geo coordinates, and kind of settlement). If the placeis in EPNet (with comparison done by name), then we keep theexisting EPNet data, and add the kind of settlement (which isnot in EPNet). Moreover, if the existing data is incomplete (e.g.,missing coordinates), we fill it with Pleiades data.

To cluster all the information about places in both datasetinto a single well-defined concept :Place we use mappings.Here we present a simplified version of the mappings for thesake of presentation:

Pleiades:

pleiades:{path} rdf:type :Place ←−SELECT pp.path AS path

FROM pleiades.places ppJOIN pleiades.names pn ON pn.pid=pp.id

EPNet:

:Place-{gl_id} rdf:type :Place ←−SELECT gl.id AS gl_id

FROM GeographicLocation gl

Observe that URIs here also encode provenance informa-tion, namely, “pleiades” and colon (EPNet default URI). Thiscan help the user to asses where the information is comingfrom.

Time. Regarding the time periods, EPNet and Pleiadesspecify time periods using list of integers, for instance:[(98, 117), (130, 140)] to state that an object or a place existedeither in the period 98 AD – 117 AD, or 130 AD – 140 AD.Besides these numeric values, users often are interested inusing governments as time periods. For example, instead ofusing 98 AD – 117 AD, they prefer to use the term “Trajan

17teiid.jboss.org/

Government”. To achieve this, we add a mapping defining theterm “Trajan Government” as follows:

:Amphora-{ic_id} :producedAt :Trajan-Government ←−SELECT ic.id AS ic_id

FROM 〈complex join〉WHERE startYear <= 117 AND endYear >= 98

Now the user can query the amphoras in production duringthis period using any of these two equivalent queries:

SELECT * WHERE {?x rdf:type :Amphora .?x :producedAt :Trajan-Government .

}

and

SELECT * WHERE {?x rdf:type :Amphora .?x :producedAt ?y .?y rdf:type :YearSpan.?y :startsAt ?s.?y :endsAt ?e.

FILTER (?s <= 117 && ?e >= 98)}

Neither of these formats for time follows the standardformats (e.g., xsd:gYear, xsd:dateTime, xsd:period).However, adding them would simply require the small effortof adding a few mappings.

B. Ontop Data Consistency

A logic based ontology language, such as OWL, allowsontologies to be specified as logical theories, this implies thatit is possible to constrain the relationships between concepts,properties, and data. In OBDA, inconsistencies arise when thedata in the sources together with the mappings violate theconstraints imposed by the ontology, and it is of interest tocheck whether such violations occur. The following are someimportant types of constraints:

• Disjointness, stating that the intersection betweentwo classes or between two properties should beempty. For instance, the classes :MilitarCamp andCivilSettlement must not have elements in common.

• Functionality of properties, stating that no individual canbe related to more than one element through a functionalproperty. For instance, the property :hasShape is func-tional since every amphora must have a unique shape.

Notice that disjointness can be expressed in the OWL 2 QLprofile of OWL 2, while functionality cannot. However, bothtypes of constraints can be checked by Ontop by posing suitablequeries over the ontology, and checking whether the answerto such queries is non-empty.

C. User Interface

A preliminary user interface for testing the OBDA func-tionalities in EPNet is available online18. It provides users witha text area where to write SPARQL queries (e.g., the query inExample 4.2) using the vocabulary of the ontology discussed in

18136.243.8.213/obdasystem

Page 76: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Fig. 6. Screenshot of a user’s query in the OBDA web interface

Fig. 7. SQL query that is actually executed on the EPNet dataset

Section III (for the convenience of the user, a summary of theontology is provided by the interface; see Fig. 6). FollowingSPARQL syntax, users need to begin their queries with a prefixdeclaration, which in our case is:PREFIX : <http://136.243.8.213/obdasystem#>

PREFIX rdf:

<http://www.w3.org/1999/02/22-rdf-syntax-ns#>

After executing the query, the interface shows the SQLquery that was sent to the underlying RDBMS (Fig. 7), andthe result of the query in tabular form (Fig. 8).

V. CONCLUDING REMARKS AND FUTURE WORK

This paper presents the design and implementation ofthe OBDA approach in the context of the EPNet project.The OBDA technology helped us to deal in an efficientand sound way with data access, integration, and consistencyissues. The integration with a greater number of availabledatasets, from different scholars and research initiatives, hasbeen already planned. EPNet will also explore the application

Fig. 8. Results of the user’s query execution.

of text mining techniques to automatically extract informationfrom the epigraphic corpus (e.g., person names, professions,places), thus going beyond the ‘syntactical’ descriptions ofthe conservation status of the inscriptions themselves, andfruitfully complementing the information already present inthe project dataset.

Acknowledgments. We thank the reviewers for their valuablecomments. The work is partially funded by the ERC AdvancedGrant n. ERC-2013-ADG 340828 and the EU under thelarge-scale integrating project (IP) Optique (Scalable End-userAccess to Big Data), grant n. FP7-318338.

REFERENCES

[1] P. Hitzler, M. Krotzsch, and S. Rudolph, Foundations of Semantic WebTechnologies. Chapman & Hall/CRC, 2009.

[2] N. Shadbolt, T. Berners-Lee, and W. Hall, “The semantic web revisited,”IEEE Intelligent Systems, vol. 21, no. 3, pp. 96–101, 2006.

[3] J. Domingue, D. Fensel, and J. A. Hendler, Handbook of Semantic WebTechnologies, ser. Handbook of Semantic Web Technologies. Springer,2011, no. Volumes 1–2.

[4] A. Merono-Penuela, A. Ashkpour, M. van Erp, K. Mandemakers,L. Breure, A. Scharnhorst, S. Schlobach, and F. van Harmelen, “Se-mantic technologies for historical research: A survey,” Semantic WebJ., pp. 1–26, 2014.

[5] P. Raghavan, “It’s time to scale the science in the social sciences,” BigData & Society, vol. 1, no. 1, 2014.

[6] F. van Harmelen, V. Lifschitz, and B. Porter, Handbook of KnowledgeRepresentation. Elsevier, 2007.

[7] P. Garnsey and C. Whittaker, Trade and Famine in Classical Antiquity,ser. Supplementary volume - Cambridge Philological Society. Cam-bridge University Press, 1983.

[8] E. Cascio and D. Rathbone, Production and Public Powers in ClassicalAntiquity, ser. Supplementary volume - Cambridge Philological Society.Cambridge Philological Society, 2000.

[9] J. M. Epstein, “Why model?” J. of Artificial Societies and SocialSimulation, vol. 11, no. 4, 2008.

[10] J. Guldi and D. Armitage, The History Manifesto. CambridgeUniversity Press, 2014.

[11] R. J. A. Talbert, Ed., Barrington Atlas of the Greek and Roman World.Princeton University Press, 2000.

[12] M. Rodriguez-Muro and M. Rezk, “Efficient SPARQL-to-SQL withR2RML mappings,” Journal of Web Semantics, 2015.

[13] B. Bishop, A. Kiryakov, D. Ognyanoff, I. Peikov, Z. Tashev, andR. Velkov, “OWLIM: A family of scalable semantic repositories,”Semantic Web J., vol. 2, no. 1, pp. 33–42, 2011.

[14] J. F. Sequeda, M. Arenas, and D. P. Miranker, “OBDA: query rewritingor materialization? In practice, both!” in Proc. of the 13th Int. SemanticWeb Conf. (ISWC), vol. 8796. Springer, 2014, pp. 535–551.

[15] C. Civili, M. Console, G. De Giacomo, D. Lembo, M. Lenzerini,L. Lepore, R. Mancini, A. Poggi, R. Rosati, M. Ruzzi, V. Santarelli,and D. F. Savo, “MASTRO STUDIO: managing ontology-based dataaccess applications,” Proc. of the VLDB Endowment, vol. 6, no. 12, pp.1314–1317, 2013.

Page 77: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

[16] G. Xiao, M. Rezk, M. Rodriguez-Muro, and D. Calvanese, “Rules andontology based data access,” in Proc. of the 8th Int. Conf. on WebReasoning and Rule Systems (RR), ser. Lecture Notes in ComputerScience, vol. 8741. Springer, 2014, pp. 157–172.

[17] R. Kontchakov, M. Rezk, M. Rodriguez-Muro, G. Xiao, and M. Za-kharyaschev, “Answering SPARQL queries over databases underOWL 2 QL entailment regime,” in Proc. of the 13th Int. SemanticWeb Conf. (ISWC), ser. Lecture Notes in Computer Science, vol. 8796.Springer, 2014, pp. 552–567.

[18] D. Calvanese, M. Giese, D. Hovland, and M. Rezk, “Ontology-basedintegration of cross-linked datasets,” in Proc. of the 14th Int. SemanticWeb Conf. (ISWC). Springer, 2015.

[19] H. Kllapi, P. Sakkos, A. Delis, D. Gunopulos, and Y. E. Ioannidis,“Elastic processing of analytical query workloads on IaaS clouds,”arXiv.org e-Print archive, CoRR Technical Report abs/1501.01070,2015. [Online]. Available: http://arxiv.org/abs/1501.01070

Page 78: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Appendix D

Nested Regular Path Queries in DescriptionLogics

This appendix reports the paper:

Meghyn Bienvenu, Diego Calvanese, Magdalena Ortiz, and Mantas Simkus:Nested Regular Path Queries in Description Logics. In Proc. of the 14th Int. Conference on thePrinciples of Knowledge Representation and Reasoning (KR), 2014.

78

Page 79: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Nested Regular Path Queries in Description Logics

Meghyn BienvenuLab. de Recherche en InformatiqueCNRS & Univ. Paris Sud, France

Diego CalvaneseKRDB Research Centre

Free Univ. of Bozen-Bolzano, Italy

Magdalena OrtizMantas Simkus

Institute of Information SystemsVienna Univ. of Technology, Austria

Abstract

Two-way regular path queries (2RPQs) have received in-creased attention recently due to their ability to relate pairsof objects by flexibly navigating graph-structured data. Theyare present in property paths in SPARQL 1.1, the new stan-dard RDF query language, and in the XML query languageXPath. In line with XPath, we consider the extension of2RPQs with nesting, which allows one to require that objectsalong a path satisfy complex conditions, in turn expressedthrough (nested) 2RPQs. We study the computational com-plexity of answering nested 2RPQs and conjunctions thereof(CN2RPQs) in the presence of domain knowledge expressedin description logics (DLs). We establish tight complexitybounds in data and combined complexity for a variety of DLs,ranging from lightweight DLs (DL-Lite, EL) up to highly ex-pressive ones. Interestingly, we are able to show that addingnesting to (C)2RPQs does not affect worst-case data com-plexity of query answering for any of the considered DLs.However, in the case of lightweight DLs, adding nesting to2RPQs leads to a surprising jump in combined complexity,from P-complete to EXP-complete.

1 IntroductionBoth in knowledge representation and in databases, there hasbeen great interest recently in expressive mechanisms forquerying data, while taking into account complex domainknowledge (Calvanese, De Giacomo, and Lenzerini 2008;Glimm et al. 2008). Description Logics (DLs) (Baader et al.2003), which on the one hand underlie the W3C standardWeb Ontology Language (OWL), and on the other hand areable to capture at the intensional level conceptual modelingformalisms like UML and ER, are considered particularlywell suited for representing a domain of interest (Borgidaand Brachman 2003). In DLs, instance data, stored in aso-called ABox, is constituted by ground facts over unaryand binary predicates (concepts and roles, respectively), andhence resembles data stored in graph databases (Consensand Mendelzon 1990; Barcelo et al. 2012). There is a crucialdifference, however, between answering queries over graphdatabases and over DL ABoxes. In the former, the data isassumed to be complete, hence query answering amounts tothe standard database task of query evaluation. In the latter,

Copyright c© 2014, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

it is typically assumed that the data is incomplete and addi-tional domain knowledge is provided by the DL ontology (orTBox). Hence query answering amounts to the more com-plex task of computing certain answers, i.e., those answersthat are obtained from all databases that both contain the ex-plicit facts in the ABox and satisfy the TBox constraints.This difference has driven research in different directions.

In databases, expressive query languages for queryinggraph-structured data have been studied, which are basedon the requirement of relating objects by flexibly navigat-ing the data. The main querying mechanism that has beenconsidered for this purpose is that of one-way and two-wayregular path queries (RPQs and 2RPQs) (Cruz, Mendelzon,and Wood 1987; Calvanese et al. 2003), which are queriesreturning pairs of objects related by a path whose sequenceof edge labels belongs to a regular language over the (binary)database relations and their inverses. Conjunctive 2RPQs(C2RPQs) (Calvanese et al. 2000) are a significant extensionof such queries that add to the navigational ability the pos-sibility of expressing arbitrary selections, projections, andjoins over objects related by 2RPQs, in line with conjunc-tive queries (CQs) over relational databases. Two-way RPQsare present in the property paths in SPARQL 1.1 (Harrisand Seaborne 2013), the new standard RDF query language,and in the XML query language XPath (Berglund and oth-ers 2010). An additional construct that is present in XPath isthe possibility of using existential test operators, also knownas nesting, to express sophisticated conditions along naviga-tion paths. When an existential test 〈E〉 is used in a 2RPQE′, there will be objects along the main navigation path forE′ that match positions of E′ where 〈E〉 appears; such ob-jects are required to be the origin of a path conforming tothe (nested) 2RPQ E. It is important to notice that existen-tial tests in general cannot be captured even by C2RPQs,e.g., when tests appear within a transitive closure of an RPQ.Hence, adding nesting effectively increases the expressivepower of 2RPQs and of C2RPQs.

In the DL community, query answering has been inves-tigated extensively for a wide range of DLs, with much ofthe work devoted to CQs. With regards to the complex-ity of query answering, attention has been paid on the onehand to combined complexity, i.e., the complexity measuredconsidering as input both the query and the DL knowledgebase (constituted by TBox and ABox), and on the other

Page 80: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

hand to data complexity, i.e., when only the ABox is con-sidered as input. For expressive DLs that extend ALC, CQanswering is typically coNP-complete in data-complexity(Ortiz, Calvanese, and Eiter 2008), and 2EXP-completein combined complexity (Glimm et al. 2008; Lutz 2008;Eiter et al. 2009). For lightweight DLs, instead, CQ answer-ing is in AC0 in data complexity for DL-Lite (Calvanese etal. 2007), and P-complete for EL (Krisnadhi and Lutz 2007).For both logics, the combined complexity is dominated bythe NP-completeness of CQ evaluation over plain relationaldatabases. There has also been some work on (2)RPQs andC(2)RPQs. For the very expressive DLs ZIQ, ZOQ, andZOI, where regular expressions over roles are present alsoin the DL, a 2EXP upper bound has been shown via tech-niques based on alternating automata over infinite trees (Cal-vanese, Eiter, and Ortiz 2009). For the Horn fragments ofSHOIQ and SROIQ, P-completeness in data complexityand EXP/2EXP-completeness in combined complexity areknown (Ortiz, Rudolph, and Simkus 2011). For lightweightDLs, tight bounds for answering 2RPQs and C2RPQs haveonly very recently been established by Bienvenu, Ortiz, andSimkus (2013): for (C)(2)RPQs, data complexity is NL-complete in DL-Lite and DL-LiteR, and P-complete in ELand ELH. For all of these logics, combined complexity is P-complete for (2)RPQs and PSPACE-complete for C(2)RPQs.

Motivated by the expressive power of nesting in XPathand SPARQL, in this paper we significantly advance theselatter lines of research on query answering in DLs, and studythe impact of adding nesting to 2RPQs and C2RPQs. We es-tablish tight complexity bounds in data and combined com-plexity for a variety of DLs, ranging from lightweight DLsof the DL-Lite and EL families up to the highly expressiveones of the SH and Z families. Our results are summarizedin Table 1. For DLs containing at least ELI, we are ableto encode away nesting, thus showing that the worst-casecomplexity of query answering is not affected by this con-struct. Instead, for lightweight DLs (starting already fromDL-Lite!), we show that adding nesting to 2RPQs leads to asurprising jump in combined complexity, from P-completeto EXP-complete. We then develop a sophisticated rewriting-based technique that builds on (but significantly extends) theone proposed by Bienvenu, Ortiz, and Simkus (2013), whichwe use to prove that the problem remains in NL for DL-Lite. We thus show that adding nesting to (C)2RPQs doesnot affect worst-case data complexity of query answeringfor lightweight DLs.

For lack of space, some proofs have been relegated to theappendix of the long version (Bienvenu et al. 2014).

2 PreliminariesWe briefly recall the syntax and semantics of descriptionlogics (DLs). As usual, we assume countably infinite, mu-tually disjoint sets NC, NR, and NI of concept names, rolenames, and individuals. We typically use A for conceptnames, p for role names, and a, b for individuals. An inverserole takes the form p− where p ∈ NR. We let N±R =NR ∪{p− | p∈NR} and denote by r elements of N±R .

A DL knowledge base (KB) consists of a TBox and an

ABox, whose forms depend on the DL in question. In theDL ELHI⊥, a TBox is defined as a set of (positive) roleinclusions of the form r v r′ and negative role inclusions ofthe form ru r′ v ⊥ with r, r′ ∈ N±R , and concept inclusionsof the form C v D, where C and D are complex conceptsformed according to the following syntax:1

C ::= > | ⊥ | A | ∃r.C | C u Cwith A ∈ NC and r ∈ N±R .

Some of our results refer specifically to the lightweightDLs that we define next. ELHI is the fragment of ELHI⊥that has no ⊥. ELH and ELI are obtained by additionallydisallowing inverse roles and role inclusions, respectively.DL-LiteR is also a fragment of ELHI⊥, in which conceptinclusions can only take the formsB1vB2 andB1uB2v⊥,for Bi a concept name or concept of the form ∃r.> withr ∈ N±R . DL-Lite is the fragment of DL-LiteR that disallows(positive and negative) role inclusions.

An ABox is a set of assertions of the form C(a) or r(a, b),where C is a complex concept, r ∈ N±R , and a, b ∈ NI. Weuse Ind(A) to refer to the set of individuals in A.

Semantics. The semantics of DL KBs is based upon inter-pretations, which take the form I = (∆I , ·I), where ∆I isa non-empty set and ·I maps each a ∈ NI to aI ∈ ∆I , eachA ∈ NC to AI ⊆ ∆I , and each p ∈ NR to pI ⊆ ∆I ×∆I .2The function ·I can be straightforwardly extended to com-plex concepts and roles. In the case of ELHI⊥, this is doneas follows: >I = ∆I , ⊥I = ∅, (p−)I = {(c, d) | (d, c) ∈pI}, (∃r.C)I = {c | ∃d : (c, d) ∈ rI , d ∈ CI}, and(C u D)I = CI ∩ DI . An interpretation I satisfies an in-clusion G v H if GI ⊆ HI , and it satisfies an assertionC(a) (resp., r(a, b)) if aI ∈ AI (resp., (aI , bI) ∈ rI). Amodel of a KB (T ,A) is an interpretation I which satisfiesall inclusions in T and assertions in A.

Complexity. In addition to P and (co)NP, our results referto the complexity classes NL (non-deterministic logarithmicspace), PSPACE (polynomial space), and (2)EXP ((double)exponential time), cf. (Papadimitriou 1993).

3 Nested Regular Path QueriesWe now introduce our query languages. In RPQs, nestedRPQs and their extensions, atoms are given by (nested) reg-ular expressions whose symbols are roles. The set Rolesof roles contains N±R , and all test roles of the forms {a}?and A? with a ∈ NI and A ∈ NC. They are interpreted as({a}?)I = (aI , aI), and (A?)I = {(o, o) | o ∈ AI}.Definition 3.1. A nested regular expression (NRE), denotedby E, is constructed according to the following syntax:

E ::= σ | E · E | E ∪ E | E∗ | 〈E〉where σ ∈ Roles.

1We slightly generalize the usual ELHI⊥ by allowing for neg-ative role inclusions.

2Note that we do not make the unique name assumption (UNA),but all of our results continue to hold if the UNA is adopted.

Page 81: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

2RPQ C2RPQ N2RPQ / CN2RPQ

data combined data combined data combined

Graph DBs & RDFS NL-c NL-c NL-c NP-c NL-c P-c / NP-c

DL-Lite NL-c P-c NL-c PSPACE-c NL-c EXP-c

Horn DLs (e.g., EL, Horn-SHIQ) P-c P-c P-c PSPACE-c P-c EXP-c

Expressive DLs (e.g., ALC, SHIQ) coNP-h EXP-c coNP-h 2EXP-c coNP-h EXP-c / 2EXP-c

Table 1: Complexity of query answering. The ‘c’ indicates completeness, the ‘h’ hardness. New results are marked in bold.For existing results, refer to (Bienvenu, Ortiz, and Simkus 2013; Perez, Arenas, and Gutierrez 2010; Barcelo Baeza 2013;Calvanese, Eiter, and Ortiz 2009; Ortiz, Rudolph, and Simkus 2011) and references therein.

We assume a countably infinite set NV of variables (dis-joint from NC, NR, and NI). Each t ∈ NV ∪ NI is a term. Anatom is either a concept atom of the formA(t), withA ∈ NC

and t a term, or a role atom of the form E(t, t′), with E anNRE and t, t′ two (possibly equal) terms.

A nested two-way regular path query (N2RPQ) q(x, y) isan atom of the form E(x, y), where E is an NRE and x, yare two distinct variables. A conjunctive N2RPQ (CN2RPQ)q(~x) with answer variables ~x has the form ∃~y.ϕ, where ϕ isa conjunction of atoms whose variables are among ~x ∪ ~y.

A (plain) regular expression (RE) is an NRE that does nothave subexpressions of the form 〈E〉. Two-way regular pathqueries (2RPQs) and conjunctive 2RPQs (C2RPQs) are de-fined analogously to N2RPQs and CN2RPQs but allowingonly plain REs in atoms.

Given an interpretation I, the semantics of an NRE E isdefined by induction on its structure:

(E1 · E2)I = EI1 ◦ EI2 ,(E1 ∪ E2)I = EI1 ∪ EI2 ,

(E∗1 )I = (EI1 )∗,

〈E〉I = {(o, o) | there is o′ ∈ ∆I s.t. (o, o′) ∈ EI}.A match for a C2NRPQ q(~x) = ∃~y.ϕ in an interpreta-

tion I is a mapping from the terms in ϕ to ∆I such that(i) π(a) = aI for every individual a of ϕ, (ii) π(x) ∈ AI forevery concept atom A(x) of ϕ, and (iii) (π(x), π(y)) ∈ EIfor every role atom E(x, y) of ϕ. Let ans(q, I) = {π(~x) |π is a match for q in I}. An individual tuple~awith the samearity as ~x is called a certain answer to q over a KB 〈T ,A〉if (~a)I ∈ ans(q, I) for every model I of 〈T ,A〉. We useans(q, 〈T ,A〉) to denote the set of all certain answers to qover 〈T ,A〉. In what follows, by query answering, we willmean the problem of deciding whether ~a ∈ ans(q, 〈T ,A〉).Example 3.1. We consider an ABox of advisor relation-ships of PhD holders3. We assume an advisor relation be-tween nodes representing academics. There are also nodesfor theses, universities, research topics, and countries, re-lated in the natural way via roles wrote , subm(itted), topic,and loc(ation). We give two queries over this ABox.q1(x, y) = (advisor · 〈wrote · topic ·Physics?〉)∗ (x, y)

3Our examples are inspired by the Mathematics GenealogyProject (http://genealogy.math.ndsu.nodak.edu/).

Query q1 is an N2RPQ that retrieves pairs of a person x andan academic ancestor y of x such that all people on the pathfrom x to y (including y itself) wrote a thesis in Physics.

q2(x, y, z) = advisor−(x, z), advisor∗(x,w),advisor− · 〈wrote · 〈topic ·DBs?〉 · subm · loc · {usa}?〉(y, z),(advisor · 〈wrote · 〈topic ·Logic?〉 · subm · loc ·EU?〉

)∗(y, w)

Query q2 is a CN2RPQ that looks for triples of individu-als x, y, z such that x and y have both supervised z, whowrote a thesis on Databases and who submitted this thesis toa university in the USA. Moreover, x and y have a commonancestor w, and all people on the path from x to w, includ-ing w, must have written a thesis in Logic and must havesubmitted this thesis to a university in an EU country.

It will often be more convenient to deal with an automata-based representation of (C)N2RPQs, which we provide next.Definition 3.2. A nested NFA (n-NFA) has the form(A, s0, F0) where A is an indexed set {α1, . . . , αn}, whereeach αl ∈ A is an automaton of the form (S, s, δ, F ), whereS is a set of states, s ∈ S is the initial state, F ⊆ S is theset of final states, and

δ ⊆ S × (Roles ∪ {〈j1, . . . , jk〉 |l < ji ≤ n, for i ∈ {1, . . . , k}})× S

We assume that the sets of states of the automata in Aare pairwise disjoint, and we require that {s0} ∪ F0 arestates of a single automaton in A. If in each transition(s, 〈j1, . . . , jk〉, s′) of each automaton in A we have k = 1,then the n-NFA is called reduced.

When convenient notationally, we will denote an n-NFA(A, s0, F0) by As0,F0 . Moreover, we will use Si, δi, and Fito refer to the states, transition relation, and final states of αi.Definition 3.3. Given an interpretation I, we define AIs0,F0

inductively as follows. Let αl be the (unique) automaton inA such that {s0}∪F0 ⊆ Sl. Then (o, o′) ∈ AIs0,F0

if there isa sequence s0o0s1 · · · ok−1skok, for k ≥ 0, such that o0 =o, ok = o′, sk ∈ F0, and for i ∈ {1, . . . , k} there is atransition (si−1, σi, si) ∈ δl such that either– σi ∈ Roles and (oi−1, oi) ∈ σIi , or– σi = 〈j1, . . . , jk〉 such that, for every m ∈ {1, . . . , k},

there exists om ∈ ∆I with (oi, om) ∈ AIs′,F ′ , where s′

and F ′ are the initial and final states of αjm respectively.

Page 82: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Note that an n-NFA As0,F0 such that there are no transi-tions of the form (s, 〈j1, . . . , jk〉, s′) in the unique αl with{s0} ∪ F0 ⊆ Sl is equivalent to a standard NFA.

For every NRE E one can construct in polynomial timean n-NFA As0,F0

such that EI = AIs0,F0for every inter-

pretation I. This is an almost immediate consequence of thecorrespondence between regular expressions and finite stateautomata. Moreover, any n-NFA can be transformed into anequivalent reduced n-NFA by introducing linearly many ad-ditional states. In the following, unless stated otherwise, weassume all n-NFAs are reduced.

4 Upper Bounds via ReductionsIn this section, we derive some upper bounds on the com-plexity of answering (C)N2RPQs in different DLs, by meansof reductions to other problems. For simplicity, we assumein the rest of this section that query atoms do not employtest roles of the form {a}?. This is without loss of generality,since each symbol {a}? can be replaced by Aa? for a freshconcept name Aa, by adding the ABox assertion Aa(a).

We start by showing that answering CN2RPQs can bepolynomially reduced to answering non-nested C2RPQs us-ing TBox axioms that employ inverses, conjunction on theleft, and qualified existential restrictions.

Proposition 4.1. For each CN2RPQ q, one can compute inpolynomial time an ELI TBox T ′ and C2RPQ q′ such thatans(q, 〈T ,A〉) = ans(q′, 〈T ∪ T ′,A〉) for every KB 〈T ,A〉.

Proof. Let q be an arbitrary CN2RPQ whose role atoms aregiven by n-NFAs, that is, they take the form As0,F0

(x, y).For each atom As0,F0

(x, y) in q and each αi ∈ A, we usea fresh concept name As for each state s ∈ Si, and define aTBox Tαi that contains:

• > v Af for each f ∈ Fi,• ∃r.As′ v As for each (s, r, s′) ∈ δi with r ∈ N±R ,• As′ uA v As for each (s,A?, s′) ∈ δi with A ∈ NC, and• As′ u Asj v As for each (s, 〈j〉, s′) ∈ δi, with sj the

initial state of αj .

We denote by TA the union of all Tαiwith αi ∈ A, and de-

fine T ′ as the union of TA for all atoms As0,F0(x, y) ∈ q. To

obtain the query q′ we replace each atom As0,F0(x, y) by the

atom α′i(x, y), where αi is the unique automaton in A with{s0}∪F0 ⊆ Si, andα′i is obtained fromαi by replacing eachtransition of the form (s, 〈j〉, s′) ∈ δi with (s,Asj?, s′),for sj the initial state of αj . Note that each α′i is a stan-dard NFA. We show in the appendix that ans(q, 〈T ,A〉) =ans(q′, 〈T ∪ T ′,A〉), for every KB 〈T ,A〉.

It follows that in every DL that contains ELI, answeringCN2RPQs is no harder than answering plain C2RPQs. Fromexisting upper bounds for C2RPQs (Calvanese, Eiter, andOrtiz 2009; Ortiz, Rudolph, and Simkus 2011), we obtain:

Corollary 4.2. Answering CN2RPQs is:

• in 2EXP in combined complexity for all DLs contained inSHIQ, SHOI, ZIQ, or ZOI.

• in EXP in combined complexity and P in data complexityfor all DLs contained in Horn-SHOIQ.We point out that the 2EXP upper bound for expressive

DLs can also be inferred, without using the reduction above,from the existing results for answering C2RPQs inZIQ andZOI (Calvanese, Eiter, and Ortiz 2009).4 Indeed, these DLssupport regular role expressions as concept constructors, anda nested expression 〈E〉 in a query can be replaced by a con-cept ∃E.> (or by a fresh concept name AE if the axiom∃E.> v AE is added to the TBox). Hence, in ZIQ andZOI, nested expressions provide no additional expressive-ness and CN2RPQs and C2RPQs coincide.

The construction used in Proposition 4.1 also allows us toreduce the evaluation of a N2RPQ to standard reasoning inany DL that contains ELI.Proposition 4.3. For every N2RPQ q and every pair of in-dividuals a, b, one can compute in polynomial time an ELITBox T ′, and a pair of assertions Ab(b) and As(a) suchthat (a, b) ∈ ans(q, 〈T ,A〉) iff 〈T ∪ T ′,A ∪ {Ab(b)}〉 |=As(a), for every DL KB 〈T ,A〉.

From this and existing upper bounds for instance check-ing in DLs, we easily obtain:Corollary 4.4. Answering N2RPQs is in EXP in combinedcomplexity for every DL that contains ELI and is containedin SHIQ, SHOI, ZIQ, or ZOI.

5 Lower BoundsThe upper bounds we have stated in Section 4 are quite gen-eral, and in most cases worst-case optimal.

The 2EXP upper bound stated in the first item of Corol-lary 4.2 is optimal already for C2RPQs and ALC. Indeed,the 2EXP hardness proof for conjunctive queries in SH byEiter et al. (2009) can be adapted to use an ALC TBox anda C2RPQ. Also the EXP bounds in Corollaries 4.2 and 4.4are optimal for all DLs that contain ELI, because standardreasoning tasks like satisfiability checking are already EXP-hard in this logic (Baader, Brandt, and Lutz 2008). For thesame reasons, the P bound for data complexity in Corol-lary 4.2 is tight for EL and its extensions (Calvanese et al.2006).

However, for the lightweight DLs DL-LiteR and EL, thebest combined complexity lower bounds we have are NL(resp., P) for N2RPQs and PSPACE for CN2RPQs, inheritedfrom the lower bounds for (C)NRPQs (Bienvenu, Ortiz, andSimkus 2013). This leaves a significant gap with respect tothe EXP upper bounds in Corollaries 4.2 and 4.4.

We show next that these upper bounds are tight. This is theone of the core technical results of this paper, and probablythe most surprising one: already evaluating one N2RPQ inthe presence of a DL-Lite or EL TBox is EXP-hard.Theorem 5.1. In DL-Lite and EL, N2RPQ answering isEXP-hard in combined complexity.

Proof. We provide a reduction from the word problem forAlternating Turing Machines (ATMs) with polynomially

4For queries that do not contain inverse roles, that is, (1-way)CRPQs, the same applies to ZOQ and its sublogics.

Page 83: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

bounded space, which is known to be EXP-hard (Chandra,Kozen, and Stockmeyer 1981). An ATM is given as a tupleM = (Σ, S∃, S∀, δ, sinit , sacc , srej ), where Σ is an alpha-bet, S∃ is a set of existential states, S∀ is a set of universalstates, δ ⊆ (S∃ ∪ S∀)×Σ ∪ {b} × (S∃ ∪ S∀)×Σ ∪ {b} ×{−1, 0,+1} is a transition relation, b is the blank symbol,and sinit , sacc , srej ∈ S∃ are the initial state, the acceptancestate and the rejection state, respectively.

Consider a word w ∈ Σ∗. We can w.l.o.g. assume thatΣ = {0, 1}, thatM uses only |w| tape cells and that |w| ≥ 1.Let m = |w|, and, for each 1 ≤ i ≤ m, let w(i) denote theith symbol of w. Let S = S∃ ∪ S∀. We make the followingfurther assumptions:

(i) The initial state is not a final state: sinit 6∈ {sacc , srej}.(ii) Before entering a state sacc or srej , M writes b in all

m tape cells.(iii) There exist functions δ1, δ2 : S × Σ ∪ {b} → S ×

Σ∪{b}×{−1, 0,+1} such that {δ1(s, σ), δ2(s, σ)} ={(s′, σ′, d) | (s, σ, s′, σ′, d) ∈ δ} for every s ∈ S \{sacc , srej} and σ ∈ Σ∪{b}. In other words, non-finalstates of M give rise to exactly two successor configu-rations described by the functions δ1, δ2.

Note that the machine M can be modified in polynomialtime to ensure (i-iii), while preserving the acceptance of w.

We next show how to construct in polynomial time a DL-Lite KB K = (T ,A) and a query q such that M acceptsw iff a ∈ ans(q,K) (we return to EL later). The high-levelidea underlying the reduction is to use a KB to enforce atree that contains all possible computations of M on w. Thequery q selects a computation in this tree and verifies that itcorresponds to a proper, error-free, accepting run.Generating the tree of transitions. First we construct K,which enforces a tree whose edges correspond to the pos-sible transitions of M . More precisely, each edge encodesa transition together with the resulting position of theread/write head of M , and indicates whether the transitionis given by δ1 or δ2. This is implemented using role namesrp,t,i, where p ∈ {1, 2}, t ∈ δ, and 0 ≤ i ≤ m+ 1. To markthe nodes that correspond to the initial (resp., a final) con-figuration of M , we employ the concept name Ainit (resp.,Afinal ), and we use A∃ and A∀ to store the transition type.

We let A = {Ainit(a), A∃(a)}, and then we initiate theconstruction of the tree by including in T the axiom

Ainit v ∃rp,(sinit ,σ,s′,σ′,d),1+d. (1)

for each σ ∈ Σ∪{b} and p ∈ {1, 2} such that δp(sinit , σ) =(s′, σ′, d). To generate further transitions, T contains

∃r−p,(s,σ,s′,σ′,d),i v ∃rp′,(s′,σ∗,s′′,σ′′,d′),i+d′ (2)

for each (s, σ, s′, σ′, d) ∈ δ, 1 ≤ i ≤ m, σ∗ ∈ Σ ∪ {b}and p, p′ ∈ {1, 2} such that δp′(s′, σ∗) = (s′′, σ′′, d′). Notethat a transition t′ = (s′, σ∗, s′′, σ′′, d′) ∈ δ can follow t =(s, σ, s′, σ′, d) ∈ δ only if σ∗ is the symbol written on tapecell i, for i the position of the read/write head after executingt. This is not guaranteed by (2). Instead, we “overestimate”the possible successive transitions, and use the query q toselect paths that correspond to a proper computation.

We complete the definition of T by adding inclusions tolabel the nodes according to the type of states resulting from

transitions. For each 1 ≤ i ≤ m, p ∈ {1, 2} and transition(s, σ, s′, σ′, d) ∈ δ, we have the axiom

∃r−p,(s,σ,s′,σ′,d),i vAQ, where

- AQ = Afinal if s′ ∈ {sacc , srej},- AQ = A∃, if s′ ∈ S∃ \ {sacc , srej}, and- AQ = A∀, if s′ ∈ S∀ \ {sacc , srej}.

We turn to the construction of the query q, for which weemploy the n-NFA representation. We construct an n-NFAαq = (A, s, F ) where A hasm+1 automata {α0, . . . , αm}.Intuitively, the automaton α0 will be responsible for travers-ing the tree representing candidate computation paths. Atnodes corresponding to the end of a computation path, α0

launches α1, . . . , αm which “travel” back to the root of thetree and test for the absence of errors along the way. We startby defining the tests α1, . . . , αm. Afterwards we define α0,which selects a set of paths that correspond to a full compu-tation, and launches these tests at the end of each path.Testing the correctness of a computation path. For each 1 ≤l ≤ m, the automaton αl = (Sl, sl, δl, Fl) is built as follows.We let Sl = {σl | σ ∈ Σ}∪{bl}∪{s′l}. That is, Sl containsa copy of Σ ∪ {b} plus the additional state s′l. We definethe initial state as sl = bl and let Fl = {s′l}. Finally, thetransition relation δl contains the following tuples:

(T1) (σl, r−p,(s,σ,s′,σ′,d),i, σl) for all 1 ≤ i ≤ m, p ∈ {1, 2},

all transitions (s, σ, s′, σ′, d) ∈ δ, and each σl ∈ Sl \{s′l} with l 6= i− d;

(T2) (σ′l, r−p,(s,σ,s′,σ′,d),i, σl) for all 1 ≤ i ≤ m, s ∈ S and

p ∈ {1, 2} with δp(s, σ) = (s′, σ′, d) and l = i− d;(T3) (σl, Ainit?, s

′l) for σ = w(l).

The working of αl can be explained as follows. Each stateσl ∈ Sl \ {s′l} corresponds to one of the symbols that maybe written in position l of the tape during a run of M . Whenαl is launched at some node in a computation tree inducedby K, it attempts to travel up to the root node, and the onlyreason it may fail is when a wrong symbol is written in po-sition l at some point in the computation path. Recall thatin each final configuration of M , all symbols are set to theblank symbol, and thus the initial state of αl is bl.

Consider a word w′ ∈ Roles∗ of the form

r−pk,tk,ik · · · r−p1,t1,i1

·Ainit? (3)

that describes a path from some node in the tree inducedby K up to the root node a. We claim that w′ is acceptedby every αl (1 ≤ l ≤ m) just in the case that t1, . . . , tkis a correct sequence of transitions. To see why, first sup-pose that every αl accepts w′, and let (pos0, st0, tape0) bethe tuple with pos0 = 1, st0 = sinit and tape0 containsfor each 1 ≤ l ≤ m, the symbol σl corresponding to thestate of αl when reading Ainit . Clearly, due to (T3), the tu-ple (pos0, st0, tape0) describes the initial configuration ofM on input w. For 1 ≤ j ≤ k, if tj = (s, σ, s′, σ′, d),then we define (posj , stj , tapej) as follows: posj = ij ,stj = s′, and tapej contains for each 1 ≤ i ≤ m, the stateof αi when reading r−pj ,tj ,ij . A simple inductive argument

Page 84: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

shows that for every 1 ≤ j ≤ k, the tuple (posj , stj , tapej)describes the configuration of M after applying the tran-sitions t1, . . . , tj from the initial configuration. Indeed, letus assume that (posj−1, stj−1, tapej−1) correctly describesthe configuration after executing t1, . . . , tj−1 and tj =(s, σ, s′, σ′, d). After executing tj , the read/write head is inposition posj−1 + d and the state is s′. Since the only wayto enforce an r−pj ,tj ,ij -edge is via axioms (1) and (2), wemust have posj = posj−1 + d and stj = s′. It remains toshow that tapej describes the tape contents after executingtj . Consider some position 1 ≤ l ≤ m. There are two cases:

1. l 6= ij − d. In this case, we know that the symbol inposition l is not modified by executing tj . We have toshow that σl ∈ tapej−1 implies σl ∈ tapej . This followsfrom the construction of αl. In particular, when readingrpj ,tj ,ij

−, it must employ a transition from (T1).2. l = ij − d. In this case, after executing tj , we must haveσ′ in position l. We have to show that σl ∈ tapej−1 im-plies σ′l ∈ tapej . This again follows from the construc-tion of αl. In particular, when reading rpj ,tj ,ij

−, thereis only one possible transition available in (T2), namely(σ′l, rpj ,tj ,ij

−, σl).

Conversely, it is easy to see that any word of the form (3)that appears in the tree induced byK and represents a correctcomputation path will be accepted by all of the αl.Selecting a proper computation. It remains to define α0,which selects a subtree corresponding to a full candidatecomputation ofM , and then launches the tests defined aboveat the end of each path. We let α0 = (S0, s0, δ0, F0), whereS0 = {s↓, cL, cR, s↑, sl, stest , sf}, s0 = s↓, F0 = {sf},and δ0 is defined next.

The automaton operates in two main modes: movingdown the tree away from the root and moving back up to-wards the root. Depending on the type of the state of M , instate s↓ the automaton either selects a child node to processnext, or chooses to launch the test automata. If the tests aresuccessful, it switches to moving up. To this end, δ0 has thefollowing transitions:

(s↓, A∃?, cL), (s↓, A∃?, cR), (s↓, A∀?, cL),

(s↓, Afinal?, stest), and (stest , 〈1, . . . ,m〉, s↑).The transitions that implement a step down or up are:

- (cL, r1,t,i, s↓) for every 1 ≤ i ≤ m and t ∈ δ,- (cR, r2,t,i, s↓) for every 1 ≤ i ≤ m and t ∈ δ,- (s↑, r

−1,t,i, sl) for every 1 ≤ i ≤ m and t ∈ δ, and

- (s↑, r−2,t,i, s↑) for every 1 ≤ i ≤ m and t ∈ δ.

After making a step up from the state s↑ via an r−1,t,i-edge,the automaton enters the state sl. Depending on the encoun-tered state of M , the automaton decides either to verify theexistence of a computation tree for the alternative transition,to keep moving up, or to accept the word. This is imple-mented using the following transitions of δ0:

(sl, ?A∀, cR), (sl, ?A∃, s↑), and (sl, ?Ainit , sf ).

a1

s

s

pv2

spg

t t

a2pv1pv2t

ϕ1 : g → gϕ2 : v1 ∧ v2 → gϕ3 : → v2

pg pg

Figure 1: Example ABox in the proof of Theorem 5.2

To conclude the definition of αq = (A, s, F ), set s =s↓ and F = {sf}. Note that αq has a constant number ofstates, so it can be converted into an equivalent NRE Eq inpolynomial time. The desired query is q(x, y) = Eq(x, y).

The above DL-Lite TBox T can be easily rephrased inEL. Indeed, we simply take a fresh concept name Ap,t,i foreach role rp,t,i, and replace every axiom C v∃rp,t,i by C v∃rp,t,i.Ap,t,i and every axiom ∃r−p,t,ivC byAp,t,ivC.

The above lower bound for answering N2RPQs hinges onthe support for existential concepts in the right-hand-side ofinclusions. If they are disallowed, then one can find a poly-nomial time algorithm (Perez, Arenas, and Gutierrez 2010).However, it was open until now whether the polynomial-time upper bound is optimal. We next prove P-hardness ofthe problem, already for plain graph databases.Theorem 5.2. Given as input an N2RPQ q, a finite interpre-tation I and a pair (o, o′) ∈ ∆I ×∆I , it is P-hard to checkwhether (o, o′) ∈ ans(q, I).

Proof. To simplify the presentation, we prove the lowerbound for a slight reformulation of the problem. In partic-ular, we show P-hardness of deciding ~c ∈ ans(q, 〈∅,A〉),where q is an N2RPQ andA is an ABox with assertions onlyof the form A(a) or r(a, b), where A ∈ NC and r ∈ NR.

We provide a logspace reduction from the classical P-complete problem of checking entailment in propositionaldefinite Horn theories. Assume a set T = {ϕ1, . . . , ϕn}of definite clauses over a set of propositional variables V ,where each ϕi is represented as a rule v1∧. . .∧vm → vm+1.

Given a variable g ∈ V , we define an ABox A, anN2RPQ q, and tuple (a1, a2) such that T |= g iff (a1, a2) ∈ans(q, 〈∅,A〉). We may assume w.l.o.g. that ϕ1 = g → g.We define the desired ABox asA = A1∪A2, using the rolenames s, t, and pv , where v ∈ V . The ABox A1 simply en-codes T and contains for every ϕi = v1∧ . . .∧vm → vm+1,the following assertions:

pvm+1(eim+1, e

im), . . . , pv1(ei1, e

i0), s(ei0, f).

The ABox A2 links variables in rule bodies with their oc-currences in rule heads. For every pair of rules ϕi = v1 ∧. . . ∧ vm → vm+1 and ϕj = w1 ∧ . . . ∧ wn → wn+1, andeach 1 ≤ l ≤ m with vl = wn+1, it contains the assertiont(eil−1, e

jn+1). See Figure 1 for an example.

The existence of a proof tree for g, which can be lim-ited to depth |V |, is expressed using the query q(x, y) =E|V |(x, y), with E1, E2, . . . , E|V | defined inductively:

E1 =⋃

v∈V

(pv · t · pv

)· s

Page 85: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

Ei =⋃

v∈V

(pv · t · pv

)·(〈Ei−1〉 ·

v∈Vpv)∗ · s (i > 1)

Finally, we let a1 = e11 and a2 = f .

6 Concrete Approach for Horn DLsOur complexity results so far leave a gap for the data com-plexity of the DL-Lite family: we inherit NL-hardness fromplain RPQs, but we only have the P upper bound stemmingfrom Proposition 4.1. In this section, we close this gap byproviding an NL upper bound.

This section has an additional goal. We recall that the up-per bounds in Corollaries 4.2 and 4.4 rely on reductionsto answering (C)2RPQs in extensions of ELI, like Horn-SHOIQ, ZIQ, and ZOI. Unfortunately, known algo-rithms for C2RPQ answering in these logics use automata-theoretic techniques that are best-case exponential and notconsidered suitable for implementation. Hence, we want toprovide a direct algorithm that may serve as a basis for prac-ticable techniques. To this end, we take an existing algorithmfor answering C2RPQs in ELH and DL-LiteR due to Bien-venu et al. (2013) and show how it can be extended to handleCN2RPQs and ELHI⊥ KBs.

For presenting the algorithm in this section, it will be use-ful to first recall the canonical model property of ELHI⊥.

Canonical ModelsWe say that an ELHI⊥ TBox T is in normal form if all ofits concept inclusions are of one of the following forms:

A v ⊥ A v ∃r.B > v A B1 uB2 v A ∃r.B v Awith A,B,B1, B2 ∈ NC and r ∈ N±R .

By introducing fresh concept names to stand for com-plex concepts, every TBox T can be transformed in poly-nomial time into a TBox T ′ in normal form that is a model-conservative extension of T . Hence, in what follows, we as-sume that ELHI⊥ TBoxes are in normal form.

The domain of the canonical model IT ,A of a consistentKB 〈T ,A〉 consists of all sequences ar1C1 . . . rnCn (n ≥0) such that:• a ∈ Ind(A) and ri ∈ N±R for each 1 ≤ i ≤ n;• each Ci is a finite conjunction of concept names;• if n ≥ 1, then T ,A |= (∃r1.C1)(a);• for 1 ≤ i < n, T |= Ci v ∃ri+1.Ci+1.For an o∈∆IT ,A \ Ind(A), we use tail(o) to denote its finalconcept. The interpretation IT ,A is then defined as follows:

aIT ,A = a for all a ∈ Ind(A)

AIT ,A = {a ∈ Ind(A) | T ,A |= A(a)}∪ {o ∈ ∆IT ,A \ Ind(A) | T |= tail(o) v A}

pIT ,A = {(a, b) | p(a, b) ∈ A}∪{(o1, o2) | o2 = o1r C and T |= r v p}∪{(o2, o1) | o2 = o1r C and T |= r v p−}

Observe that IT ,A is composed of a core part containingthe individuals from A and an anonymous part consisting

of (possibly infinite) trees rooted at the ABox individuals.We use IT ,A|o to denote the restriction of IT ,A to domainelements having o as a prefix.

It is well-known that the canonical model of a consistentELHI⊥ KB IT ,A can be homomorphically embedded intoany model of 〈T ,A〉. Since CN2RPQs are preserved underhomomorphisms, we have:Lemma 6.1. For every consistent ELHI⊥ KB 〈T ,A〉,CN2RPQ q, and tuple ~a of individuals: ~a ∈ ans(q, 〈T ,A〉)if and only if ~a ∈ ans(q, IT ,A).

Computing Jump and Final TransitionsA crucial component of our algorithm is to compute relevantpartial paths in a subtree IT ,A|o rooted at an object o inthe anonymous part of IT ,A. Importantly, we also need toremember which parts of the nested automata that have beenpartially navigated below o still need to be continued. Thiswill allow us to ‘forget’ the tree below o.

In what follows, it will be convenient use runs to talkabout the semantics of n-NFAs.Definition 6.1. Let I be an interpretation, and let(A, s0, F0) be an n-NFA. Then a partial run for A on Iis a finite node-labelled tree (T, `) such that every node islabelled with an element from ∆I × (

⋃i Si) and for each

non-leaf node v having label `(v) = (o, s) with s ∈ Si, oneof the following holds:• v has a unique child v′ with `(v′) = (o′, s′), and there

exists (s, σ, s′) ∈ δi such that σ ∈ Roles and (o, o′) ∈ σI;• v has exactly two children v′ and v′′ with `(v′) = (o, s′)

and `(v′′) = (o, s′′), with s′′ the initial state of αj , andthere exists a transition (s, 〈j〉, s′) ∈ δi.

If T has root labelled (o1, s1) and a leaf node labelled(o2, s2) with s1, s2 states of the same αi, then (T, `) iscalled an (o1, s1, o2, s2)-run, and it is full if every leaf la-bel (o′, s′) 6= (o2, s2) is such that s′ ∈ Fk for some k.

Full runs provide an alternative characterization of the se-mantics of n-NFAs in Definition 3.3.Fact 6.2. For every interpretation I, (o1, o2) ∈ (As1,{s2})

I

if and only if there is a full (o1, s1, o2, s2)-run for A in I.We use partial runs to characterize when an n-NFA

A can be partially navigated inside a tree IT ,A|o whoseroot satisfies some conjunction of concepts C. Intuitively,JumpTrans(A, T ) stores pairs s1, s2 of states of someα ∈ A such that a path from s1 to s2 exists, whileFinalTrans(A, T ) stores states s1 for which a path to somefinal state exists, no matter where the final state is reached.Both JumpTrans(A, T ) and FinalTrans(A, T ) store a set Γof states s of other automata nested in α, for which a pathfrom s to a final state remains to be found.Definition 6.2. Let T be an ELHI⊥ TBox in normal formand (A, s0, F0) an n-NFA. The set JumpTrans(A, T ) con-sists of tuples (C, s1, s2,Γ) where C is either > or a con-junction of concept names from T , s1 and s2 are states fromαi ∈ A, and Γ ⊆ ⋃j>i Sj . A tuple (C, s1, s2,Γ) belongsto JumpTrans(A, T ) if there exists a partial run (T, `) ofA in the canonical model of 〈T , {C(a)}〉 that satisfies thefollowing conditions:

Page 86: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

• the root of T is labelled (a, s1);• there is a leaf node v with `(v) = (a, s2);• for every leaf node v with `(v) = (o, s) 6= (a, s2), eithers ∈ Fj for some j > i, or o = a and s ∈ Γ.

The set FinalTrans(A, T ) contains all tuples (C, s1, F,Γ)there is a partial run (T, `) of A in the canonical model of〈T , {C(a)}〉 that satisfies the following conditions:• the root of T is labelled (a, s1);• there is a leaf node v with `(v) = (o, sf ) and sf ∈ F ;• for every leaf node v with `(v) = (o, s), either s is a final

state in some αk, or o = a and s ∈ Γ.Proposition 6.3. It can be decided in exponential time if atuple belongs to JumpTrans(A, T ) or FinalTrans(A, T ).

Proof idea. We first show how to use TBox reasoning to de-cide whether (C, s1, s2,Γ) ∈ JumpTrans(A, T ). For ev-ery αj ∈ A, we introduce a fresh concept name As foreach state s ∈ Sj . Intuitively, As expresses that there isan outgoing path that starts in s and reaches a final state.If {s1, s2} ⊆ Si, then we add the following inclusions to T :

• > v As, for every s ∈ Fj with j > i;• ∃r.As′ v As, whenever (s, r, s′) ∈ δi with r ∈ N±R ;• As′ uB v As, whenever (s,B?, s′) ∈ δi;• As′ u As′′ v As, whenever (s, 〈j〉, s′) ∈ δi and s′′ is the

initial state of αj .

Let T ′ be the resulting TBox. In the long version, we showthat (C, s1, s2,Γ) ∈ JumpTrans(A, T ) iff

T ′ |= (C uAs2 ul

s∈Γ

As) v As1 .

To decide if (C, s1, F,Γ) ∈ FinalTrans(A, T ), we must alsoinclude in T ′ the following inclusions:

• > v As, for every s ∈ F .

We then show that (C, s1, F,Γ) ∈ FinalTrans(A, T ) iff

T ′ |= (C ul

s∈Γ

As) v As1 .

To conclude the proof, we simply note that both problemscan be decided in single-exponential time, as TBox reason-ing in ELHI⊥ is known to be EXP-complete.

Query RewritingThe core idea of our query answering algorithm is to rewritea given CN2RPQ q into a set of queries Q such that theanswers to q and the union of the answers for all q′ ∈ Qcoincide. However, for evaluating each q′ ∈ Q, we onlyneed to consider mappings from the variables to the indi-viduals in the core of IT ,A. Roughly, a rewriting step makessome assumptions about the query variables that are mappeddeepest into the anonymous part and, using the structure ofthe canonical model, generates a query whose variables arematched one level closer to the core. Note that, even whenwe assume that no variables are mapped below some ele-ment o in IT ,A, the satisfaction of the regular paths may

require to go below o and back up in different ways. This ishandled using jump and final transitions. The query rewrit-ing algorithm is an adaptation of the algorithm for C2RPQsin (Bienvenu, Ortiz, and Simkus 2013), to which the readermay refer for more detailed explanations and examples.

The query rewriting algorithm is presented in Figure 2. Inthe algorithm, we use atoms of the form 〈As,F 〉(x), whichare semantically equivalent toAs,F (x, z) for a variable z notoccurring anywhere in the query. This alternative notationwill spare us additional variables and make the complexityarguments simpler. To slightly simplify the notation, we maywrite As,s′ instead of As,{s′}.

The following proposition states the correctness of therewriting procedure. Its proof follows the ideas outlinedabove and can be found in the appendix of the long version.Slightly abusing notation, we will also use Rewrite(q, T ) todenote the set all of queries that can be obtained by an exe-cution of the rewriting algorithm on q and T .Proposition 6.4. Let 〈T ,A〉 be an ELHI⊥ KB and q(~x)a C2NRPQ. Then ~a ∈ ans(q, 〈T ,A〉) iff there exists q′ ∈Rewrite(q, T ) and a match π for q′ in IT ,A such thatπ(~x) = ~a and π(y) ∈ Ind(A) for every variable y in q′.

We note that the query rewriting does not introduce freshterms. Moreover, it employs an at most quadratic numberof linearly sized n-NFAs, obtained from the n-NFAs of theinput query. Thus, the size of each q′ ∈ Rewrite(q, T ) ispolynomial in the size of q and T . Given that all the em-ployed checks in Figure 2 can be done in exponential time(see Proposition 6.3), we obtain the following.Proposition 6.5. The set Rewrite(q, T ) can be computed inexponential time in the size of q and T .

Query EvaluationIn Figure 3, we present an algorithm EvalAtom for eval-uating N2RPQs. The idea is similar to the standard non-deterministic algorithm for deciding reachability: we guessa sequence (c0, s0)(c1, s1) · · · (cm, sm) of individual-statepairs, keeping only two successive elements in memory atany time. Every element (ci+1, si+1) must be reached fromthe preceding element (ci, si) by a single normal, jump, orfinal transition. Moreover, in order to use a jump or finaltransition, we must ensure that its associated conditions aresatisfied. To decide if the current individual belongs to C,we can employ standard reasoning algorithms, but to deter-mine whether an outgoing path exists for one of the states inΓ, we must make a recursive call to EvalAtom. Importantly,these recursive calls involve “lower” automata, and so thedepth of recursion is bounded by the number of automata inthe N2RPQ (and so is independent of A). It follows that thewhole procedure can be implemented in non-deterministiclogarithmic space in |A|, if we discount the concept and rolemembership tests. By exploiting known complexity resultsfor instance checking in DL-LiteR and ELHI⊥, we obtain:Proposition 6.6. EvalAtom is a sound and complete proce-dure for N2RPQ evaluation over satisfiable ELHI⊥ KBs. Itcan be implemented so as to run in non-deterministic loga-rithmic space (resp., polynomial time) in the size of the ABoxfor DL-LiteR (resp., ELHI⊥) KBs.

Page 87: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

PROCEDURE Rewrite

Input: CN2RPQ q, ELHI⊥ TBox T in normal form1. Choose either to output q or to continue.2. Choose a non-empty set Leaf ⊆ vars(q) and y ∈ Leaf.

Rename all variables in Leaf to y.3. Choose a conjunction C of concept names from T such

that T |= C v B whenever B(y) is an atom of q. Dropall such atoms from q.

4. For each atom at ∈ q of the form 〈As0,F 〉(t) orAs0,F (t, t′) with y ∈ {t, t′}:(a) let αi ∈ A be the automaton containing s0, F

(b) choose a sequence s1, . . . , sn−1 of distinct statesfrom Si and some sn ∈ F

(c) replace at by the atoms As0,s1(t, y), As1,s2(y, y),. . . , Asn−2,sn−1

(y, y), and• Asn−1,sn(y, t′) if at = As0,F (t, t′), or• 〈Asn−1,sn〉(y) if at = 〈As0,F 〉(y).

5. For each atom atj of the form Asj ,sj+1(y, y) or

〈Asj ,sj+1〉(y) in q, either do nothing, or:• Choose some (C, sj , sj+1,Γ) ∈ JumpTrans(A, T )

if atj = Asj ,sj+1(y, y).• Choose some (C, sj ,{sj+1},Γ) ∈ FinalTrans(A, T )

if atj = 〈Asj ,sj+1〉(y).

• Replace atj by {〈Au,Fk〉(y) |u∈Γ∩Sk, }.

6. Choose a conjunction D of concept names from T andr, r1, r2 ∈ N±R such that:(a) T |= D v ∃r.C, T |= r v r1, and T |= r v r2.(b) For each atom Au,U (y, t) of q with u ∈ Si, there

exists v ∈ Si such that (u, r−1 , v) ∈ δi(c) For each atom Au,U (t, y) of q with u ∈ Si, there

exists v ∈ Si and v′ ∈ U with (v, r2, v′) ∈ δi.

(d) For each atom 〈Au,U 〉(y) of q with u ∈ Si, thereexists v ∈ Si such that (u, r−1 , v) ∈ δi.

For atoms Au,U (y, y), both (b) and (c) apply.7. Replace• each atom Au,U (y, t) with t 6= y by Av,U (y, t),• each atom Au,U (t, y) with t 6= y by Au,v(y, t),• each atom Au,U (y, y) by atom Av,v′(y, y), and• each atom 〈Au,U 〉(y) by atom 〈Av,U 〉(y)

with v, v′ as in Step 6.8. Add A(y) to q for each A ∈ D and return to Step 1.

Figure 2: Query rewriting procedure Rewrite.

We present in Figure 4 the complete procedure EvalQueryfor deciding CN2RPQ entailment.

Theorem 6.7. EvalQuery is a sound and complete proce-dure for deciding CN2RPQ entailment over ELHI⊥ KBs.In the case of DL-LiteR KBs, it runs in non-deterministiclogarithmic space in the size of the ABox.

Proof idea. Soundness, completeness, and termination of

PROCEDURE EvalAtom

Input: n-NFA (A, s0, F0), ELHI⊥ KB K = 〈T ,A〉 innormal form, (a, b) ∈ Ind(A)× (Ind(A) ∪ {anon})

1. Let i be such that s0 ∈ Si, and set max = |A|×|Si|+1.2. Initialize current = (a, s0) and count = 0.3. While count < max and current 6= (b, sf ) for sf ∈ F0

(a) Let current = (c, s).(b) Guess a pair (d, s′) ∈ (Ind(A)∪{anon})×Si such

that one of the following holds:i. d ∈ Ind(A) and there exists (s, σ, s′) ∈ δi withσ ∈ Roles such that (c, d) ∈ σIT ,A

ii. d = c and JumpTrans(A, T ) contains a tuple(C, s, s′,Γ) such that c ∈ CIT ,A and for everyj > i and every u ∈ Γ ∩ Sj ,EvalAtom((A, u, Fj),K, (c, anon)) = yes

iii. d = anon, s′ ∈ F0, and FinalTrans(A, T ) con-tains a tuple (C, s, F0,Γ) such that c ∈ CIT ,A

and for every j > i and every u ∈ Γ ∩ Sj ,EvalAtom((A, u, Fj),K, (c, anon)) = yes

(c) Set current = (d, s′) and increment count.4. If current = (d, sf ) for some sf ∈ F0, and either b = d

or b = anon, return yes. Else return no.

Figure 3: N2RPQ evaluation procedure EvalAtom.

PROCEDURE EvalQuery

Input: Boolean CN2RPQ q, ELHI⊥ KB K = 〈T ,A〉 innormal form

1. Test whether K is satisfiable, output yes if not.2. SetQ = Rewrite(q, T ). Replace all atoms inQ of typesC(a),R(a, b) by equivalent atoms of type As0,F0

(a, b).3. Guess some q′ ∈ Q and an assignment ~a of individuals

to the quantified variables ~v in q′

• Let q′′ be obtained by substituting ~a for ~v.• For every atom As0,F0(a, b) in q′′

check if EvalAtom((A, s0, F0),K, (a, b)) = yes

• If all checks succeed, return yes.4. Return no.

Figure 4: CN2RPQ entailment procedure EvalQuery.

EvalQuery follow easily from the corresponding proper-ties of the component procedures Rewrite and EvalAtom(Propositions 6.4, 6.5, and 6.6). In DL-LiteR, KB satisfiabil-ity is known to be NL-complete in data complexity. Sincethe rewriting step is ABox-independent, the size of queriesin Q can be treated as a constant. It follows that the queryq′ and assignment ~a guessed in Step 3 can be stored in loga-rithmic space in |A|. By Theorem 6.7, each call to EvalAtomruns in non-deterministic logarithmic space.

Corollary 6.8. CN2RPQ entailment over DL-LiteR knowl-

Page 88: Runtime Query Rewriting Techniques - Optique | Scalable End-user … · 2015-11-19 · Runtime Query Rewriting Techniques This document summarises deliverable D6.3 of project FP7-318338

edge bases is NL-complete in data complexity.

7 Conclusions and Future WorkWe have studied the extension of (C)2RPQs with a nest-ing construct inspired by XPath, and have characterized thedata and combined complexity of answering nested 2RPQsand C2RPQs for a wide range of DLs. The only complex-ity bound we leave open is whether the coNP lower-boundin data complexity for expressive DLs is tight; indeed, theautomata-theoretic approach used to obtain optimal boundsin combined complexity for these logics does not seem toprovide the right tool for tight bounds in data complexity.

In light of the surprising jump from P to EXP in the com-bined complexity of answering N2RPQs in lightweight DLs,a relevant research problem is to identify classes of N2RPQsthat exhibit better computational properties. We are also in-terested in exploring whether the techniques developed inSection 6 can be extended to deal with additional queryconstructs, such as existential “loop-tests” or forms of role-value maps. Finally, containment of N2RPQs has been stud-ied very recently (Reutter 2013), but only for plain graphdatabases, so it would be interesting to investigate contain-ment also in the presence of DL constraints.

Acknowledgments. This work has been partially sup-ported by ANR project PAGODA (ANR-12-JS02-007-01),by the EU IP Project FP7-318338 Scalable End-user Accessto Big Data (Optique), by the FWF projects T515-N23 andP25518-N23, and by the WWTF project ICT12-015.

ReferencesBaader, F.; Calvanese, D.; McGuinness, D.; Nardi, D.; andPatel-Schneider, P. F., eds. 2003. The Description Logic Hand-book: Theory, Implementation and Applications. CambridgeUniversity Press.Baader, F.; Brandt, S.; and Lutz, C. 2008. Pushing the ELenvelope further. In Proc. of OWLED.Barcelo Baeza, P. 2013. Querying graph databases. In Proc. ofPODS, 175–188.Barcelo, P.; Libkin, L.; Lin, A. W.; and Wood, P. T. 2012. Ex-pressive languages for path queries over graph-structured data.ACM TODS 37(4):31.Berglund, A., et al. 2010. XML Path Language (XPath) 2.0(Second Edition). W3C Recommendation. Available at http://www.w3.org/TR/xpath20.Bienvenu, M.; Calvanese, D.; Ortiz, M.; and Simkus, M. 2014.Nested Regular Path Queries in Description Logics. CoRRTechnical Report arXiv:1402.7122. Available at http://arxiv.org/abs/1402.7122.Bienvenu, M.; Ortiz, M.; and Simkus, M. 2013. Conjunctiveregular path queries in lightweight description logics. In Proc.of IJCAI.Borgida, A., and Brachman, R. J. 2003. Conceptual modelingwith description logics. In Baader et al. (2003). chapter 10,349–372.Calvanese, D.; De Giacomo, G.; Lenzerini, M.; and Vardi, M. Y.2000. Containment of conjunctive regular path queries withinverse. In Proc. of KR, 176–185.

Calvanese, D.; De Giacomo, G.; Lenzerini, M.; and Vardi, M. Y.2003. Reasoning on regular path queries. SIGMOD Record32(4):83–92.Calvanese, D.; De Giacomo, G.; Lembo, D.; Lenzerini, M.; andRosati, R. 2006. Data complexity of query answering in de-scription logics. In Proc. of KR, 260–270.Calvanese, D.; De Giacomo, G.; Lembo, D.; Lenzerini, M.;and Rosati, R. 2007. Tractable reasoning and efficient queryanswering in description logics: The DL-Lite family. JAR39(3):385–429.Calvanese, D.; De Giacomo, G.; and Lenzerini, M. 2008. Con-junctive query containment and answering under descriptionlogics constraints. ACM TOCL 9(3):22.1–22.31.Calvanese, D.; Eiter, T.; and Ortiz, M. 2009. Regular pathqueries in expressive description logics with nominals. In Proc.of IJCAI, 714–720.Chandra, A. K.; Kozen, D.; and Stockmeyer, L. J. 1981. Alter-nation. JACM 28(1):114–133.Consens, M. P., and Mendelzon, A. O. 1990. GraphLog: avisual formalism for real life recursion. In Proc. of PODS, 404–416.Cruz, I. F.; Mendelzon, A. O.; and Wood, P. T. 1987. A graphi-cal query language supporting recursion. In Proc. of ACM SIG-MOD, 323–330.Eiter, T.; Lutz, C.; Ortiz, M.; and Simkus, M. 2009. Queryanswering in description logics with transitive roles. In Proc. ofIJCAI, 759–764.Glimm, B.; Lutz, C.; Horrocks, I.; and Sattler, U. 2008. Con-junctive query answering for the description logic SHIQ. JAIR31:157–204.Harris, S., and Seaborne, A. 2013. SPARQL 1.1Query Language. W3C Recommendation, World Wide WebConsortium. Available at http://www.w3.org/TR/sparql11-query.Hustadt, U.; Motik, B.; and Sattler, U. 2005. Data complexityof reasoning in very expressive description logics. In Proc. ofIJCAI. 466–471Krisnadhi, A., and Lutz, C. 2007. Data complexity in the ELfamily of description logics. In Proc. of LPAR, 333–347.Lutz, C. 2008. The complexity of conjunctive query answeringin expressive description logics. In Proc. of IJCAR, volume5195 of LNAI, 179–193. Springer.Ortiz, M.; Calvanese, D.; and Eiter, T. 2008. Data complexity ofquery answering in expressive description logics via tableaux.JAR 41(1):61–98.Ortiz, M.; Rudolph, S.; and Simkus, M. 2011. Query answeringin the Horn fragments of the description logics SHOIQ andSROIQ. In Proc. of IJCAI, 1039–1044.Papadimitriou, C. H. 1993. Computational Complexity. Addi-son Wesley.Perez, J.; Arenas, M.; and Gutierrez, C. 2010. nSPARQL: Anavigational language for RDF. J. of Web Semantics 8(4):255–270.Reutter, J. L. 2013. Containment of nested regular expres-sions. CoRR Technical Report arXiv:1304.2637, arXiv.orge-Print archive. Available at http://arxiv.org/abs/1304.2637.