Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗,...

63
EU-IST Integrated Project (IP) IST-2003-506826 SEKT SEKT: Semantically Enabled Knowledge Technologies Inconsistent Ontology Diagnosis: Evaluation Stefan Schlobach * , Zhisheng Huang * , and Ronald Cornet # ( * Vrije Universiteit Amsterdam, # AMC, Amsterdam) Abstract. EU-IST Integrated Project (IP) IST-2003-506826 SEKT Deliverable D3.6.2(WP3.6) In this document, we evaluated the framework for inconsistent ontology diagnosis and repair as introduced in Sekt Deliverable D3.6.1, where we defined a number of new non-standard reasoning services to explain inconsistences. The evaluation is done in two ways, first, we study the effectiveness of our proposal in a quali- tative way with some practical examples. Secondly, in a quantitative and statistical analysis, we try to get a better understanding of the computational properties of the debugging problem and our algorithms for solving it. Keyword list: ontology management, inconsistency dignosis, ontology reasoning Copyright c 2006 Department of Artificial Intelligence, Vrije Universiteit Amsterdam Document Id. Project Date Distribution SEKT/2005/D3.6.2/v1.0 SEKT EU-IST-2003-506826 January 26, 2006 internal

Transcript of Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗,...

Page 1: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

EU-IST Integrated Project (IP) IST-2003-506826 SEKT

SEKT: Semantically Enabled Knowledge Technologies

Inconsistent Ontology Diagnosis:Evaluation

Stefan Schlobach∗, Zhisheng Huang∗, and Ronald Cornet#

(∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

Abstract.EU-IST Integrated Project (IP) IST-2003-506826 SEKTDeliverable D3.6.2(WP3.6)In this document, we evaluated the framework for inconsistent ontology diagnosis and repair asintroduced in Sekt Deliverable D3.6.1, where we defined a number of new non-standard reasoningservices to explain inconsistences.The evaluation is done in two ways, first, we study the effectiveness of our proposal in aquali-tativeway with some practical examples. Secondly, in aquantitative and statisticalanalysis, wetry to get a better understanding of the computational properties of the debugging problem andour algorithms for solving it.

Keyword list: ontology management, inconsistency dignosis, ontology reasoning

Copyright c© 2006 Department of Artificial Intelligence, Vrije Universiteit Amsterdam

Document Id.ProjectDateDistribution

SEKT/2005/D3.6.2/v1.0SEKT EU-IST-2003-506826January 26, 2006internal

Page 2: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

SEKT Consortium

This document is part of a research project partially funded by the IST Programme of the Commission of the EuropeanCommunities as project number IST-2003-506826.

British Telecommunications plc.Orion 5/12, Adastral ParkIpswich IP5 3REUKTel: +44 1473 609583, Fax: +44 1473 609832Contact person: John DaviesE-mail: [email protected]

Empolis GmbHEuropaallee 1067657 KaiserslauternGermanyTel: +49 631 303 5540, Fax: +49 631 303 5507Contact person: Ralph TraphonerE-mail: [email protected]

Jozef Stefan InstituteJamova 391000 LjubljanaSloveniaTel: +386 1 4773 778, Fax: +386 1 4251 038Contact person: Marko GrobelnikE-mail: [email protected]

University of Karlsruhe , Institute AIFBEnglerstr. 28D-76128 KarlsruheGermanyTel: +49 721 608 6592, Fax: +49 721 608 6580Contact person: York SureE-mail: [email protected]

University of SheffieldDepartment of Computer ScienceRegent Court, 211 Portobello St.Sheffield S1 4DPUKTel: +44 114 222 1891, Fax: +44 114 222 1810Contact person: Hamish CunninghamE-mail: [email protected]

University of InnsbruckInstitute of Computer ScienceTechikerstraße 136020 InnsbruckAustriaTel: +43 512 507 6475, Fax: +43 512 507 9872Contact person: Jos de BruijnE-mail: [email protected]

Intelligent Software Components S.A.Pedro de Valdivia, 1028006 MadridSpainTel: +34 913 349 797, Fax: +49 34 913 349 799Contact person: Richard BenjaminsE-mail: [email protected]

Kea-pro GmbHTal6464 SpringenSwitzerlandTel: +41 41 879 00, Fax: 41 41 879 00 13Contact person: Tom BosserE-mail: [email protected]

Ontoprise GmbHAmalienbadstr. 3676227 KarlsruheGermanyTel: +49 721 50980912, Fax: +49 721 50980911Contact person: Hans-Peter SchnurrE-mail: [email protected]

Sirma AI EAD, Ontotext Lab135 Tsarigradsko ShoseSofia 1784BulgariaTel: +359 2 9768 303, Fax: +359 2 9768 311Contact person: Atanas KiryakovE-mail: [email protected]

Vrije Universiteit Amsterdam (VUA)Department of Computer SciencesDe Boelelaan 1081a1081 HV AmsterdamThe NetherlandsTel: +31 20 444 7731, Fax: +31 84 221 4294Contact person: Frank van HarmelenE-mail: [email protected]

Universitat Autonoma de BarcelonaEdifici B, Campus de la UAB08193 Bellaterra (Cerdanyola del Valles)BarcelonaSpainTel: +34 93 581 22 35, Fax: +34 93 581 29 88Contact person: Pompeu Casanovas RomeuE-mail: [email protected]

Page 3: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

Executive Summary

In this document we evaluate our logical framework for debugging we had introduced inSekt Deliverable 3.6.1. This evaluation is done in two ways: first, we study the effective-ness of our proposal in aqualitativeway with some practical examples. Secondly, in aquantitative and statisticalanalysis, we try to get a better understanding of the computa-tional properties of the debugging problem and our algorithms for solving it.

Instead of providing a wide and shallow evaluation for all types of debugging for allsorts of languages, we have opted for a very detailed analysis of a particular class ofontologies, namely the terminological part, and here we even focus on unfoldable ALCTBoxes. These restrictions have grown out of our own applications, and we have mostexperience in these cases. I believe that the lessons we learn can be extended to othercases.

In the qualititative part we discuss two simple incoherent terminologies to explainthe functionality and particular application scenarios of our debugging framework. Thisframework was then applied to two incoherent terminologies which were used at the Aca-demic Medical Center, Amsterdam, for the admission of patients to Intensive Care units.This evaluation is described in Chapter 4.

For the statistical parts we conducted three sets of experimments in order to evaluatethe debugging problem and our algorithms and tools. First, we applied our two debuggersDION and MUPSter on a set of real-world terminologies collected from our applica-tions and the WWW. Secondly, we translated an existing test-set for Description Logicsatisfiability to an incoherence problem, and finally, we created our own benchmark.

The results are mixed: although debugging is useful in practice, we cannot guaranteethat our tools will calculate debugs in reasonable time. The most important criteria arethe size and complexity of the definitions and the number of modeling errors.

This deliverable partly overlaps with the previous Sekt Deliverable 3.6.1. in orderto make it self-contained. This applies to Chapter 3, which is mostly copied from theprevious Deliverable, and to Chapter 2, which contains mostly known material.

Page 4: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

Contents

1 Introduction 3

2 Debugging Inconsistent Terminologies 62.1 Logical errors in Description Logic terminologies . . . . . . . . . . . . .62.2 Framework for debugging and diagnosis . . . . . . . . . . . . . . . . . .7

2.2.1 Model-based Diagnosis . . . . . . . . . . . . . . . . . . . . . . .82.2.2 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

3 Algorithms for debugging and diagnosis 133.1 Two algorithms for debugging . . . . . . . . . . . . . . . . . . . . . . .13

3.1.1 A top-down approach to explanation . . . . . . . . . . . . . . . .133.1.2 An Informed Bottom-up Approach to Explanation . . . . . . . .16

3.2 Calculating terminological diagnoses . . . . . . . . . . . . . . . . . . . .20

4 A qualitative evaluation of the framework 254.1 Two motivating examples . . . . . . . . . . . . . . . . . . . . . . . . . .254.2 Applying the framework to real-life terminologies . . . . . . . . . . . . .274.3 Qualitative evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

5 Principals of a quantitative evaluation 335.1 Evaluating algorithms for debugging and diagnosis . . . . . . . . . . . .335.2 Three types of benchmark experiments . . . . . . . . . . . . . . . . . . .34

5.2.1 Evaluation with real-world terminologies . . . . . . . . . . . . .345.2.2 Benchmarking with (adapted) existing test-sets . . . . . . . . . .365.2.3 Benchmarking with purpose-built test-sets . . . . . . . . . . . . .37

6 Experimental evaluation 416.1 Experiments to evaluate algorithms for debugging . . . . . . . . . . . . .41

6.1.1 Experiments with existing ontologies . . . . . . . . . . . . . . .426.1.2 Experiments with existing benchmarks . . . . . . . . . . . . . .436.1.3 Experiments with purpose-build benchmark . . . . . . . . . . . .47

6.2 Experiments to evaluate terminological diagnosis . . . . . . . . . . . . .53

1

Page 5: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CONTENTS 2

7 Concluding remarks 56

Page 6: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

Chapter 1

Introduction

Ontologies play a crucial role in the Semantic Web (SW), as they allow “intelligentagents” to share information in a semantically umambiguous way, and to reuse domainknowledge (possibly created by external sources). However, this makes SW technol-ogy highly dependent of the quality, and, in particular, of the correctness of the appliedontology. Two general strategies for quality assurance are predominant, one based ondeveloping more and more sophisticated ontology modeling tools, the second one basedon logical reasoning. In this paper we will focus on the latter. With the advent of ex-pressive ontology languages such as OWL and its close relation to Description Logics(DL), non-trivial implicit information, such as theis-a hierarchy of classes, can oftenbe made explicit by logical reasoners. More crucially, however, state-of-the art DL rea-soners can efficiently detect incoherences even in very large ontologies. The practicalproblem remains what to do in case an ontology has been detected to be incoherent.

Our work was motivated initially by the development of the DICE1 terminology.DICE implements frame-based definitions of diagnostic information for the unambigu-ous and unified classification of patients in Intensive Care medicine. The representationof DICE is currently being migrated to an expressive Description Logic (henceforth DL)to facilitate logical inferences. Figure 1.1 shows an extract of the DICE terminology. In[4] the authors describe the migration process in more detail. The resulting DL terminol-ogy (usually called a “TBox”) contains axioms such as the following, where classes (likeBODYPART) are translated as concepts, and slots (likeREGION) as roles:

Brain v CNSu∃systempart.NervousSystemuBodyPartu ∃region.HeadAndNecku ∀region.HeadAndNeck

CNSv NervousSystem

Developing a coherent terminology is a time-consuming and error-prone process.DICE defines more than 2400 concepts and uses 45 relations. To illustrate some of the

1DICE stands for “Diagnoses for Intensive Care Evaluation”. The development of the DICE terminologyhas been supported by the NICE foundation.

3

Page 7: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 1. INTRODUCTION 4

CLASS SUPERCLASS SLOT SLOT-VALUE

BRAINBODYPART

CNSREGION

SYSTEM PART

HEAD AND NECK

NERVOUSSYSTEM

CNS NERVOUSSYSTEM

Figure 1.1: An extract from the DICE terminology (frame-based).

problems, take the definition of a “brain” which is incorrectly specified, among others, asa “CNS” (central nervous-system) and “body-part” located in the head. This definitionis contradictory as nervous-systems and body-parts are declared disjoint in DICE. Fortu-nately, current Description Logic reasoners, such as RACER [12] or FaCT [13], can detectthis type of inconsistency and the knowledge engineer can identify the cause of the prob-lem. Unfortunately, many other concepts are defined based on the erroneous definition of“brain” forcing each of them to be erroneous as well. In practice, DL reasoners providelists of hundreds of unsatisfiable concepts for the DICE TBox and the debugging remainsa jigsaw to be solved by human experts, with little additional explanation to support thisprocess.

There are two main ways to deal with inconsistenct ontologies. One is to simply avoidthe inconsistency and to apply a non-standard reasoning method to obtain meaningfulanswers. In [14, 15], a framework of reasoning with inconsistent ontologies, in whichpre-defined selection functions are used to deal with concept relevance, is presented Thenotion of “concept relevance” can be used for reasoning with inconsistent ontologies.

An alternative approach to deal with logical contradictions is to resolve logical mod-eling errors whenever a logical problem is encountered. In this document, we will focuson evaluation of thisdebuggingprocess. In Sekt Deliverable 3.6.1 we introduce a frame-work for debugging and diagnosis, more precisely the notions ofminimal unsatisfiability-preserving sub-TBoxes(abbreviated MUPS) andminimal incoherence-preserving sub-TBoxes(MIPS) as the smallest subsets of axioms of an incoherent terminology preservingunsatisfiability of a particular, respectively of at least one unsatisfiable concept.

An orthogonal view on inconsistent ontologies is based on the traditional model-baseddiagnosis (MBD)which has been studies over many years in the AI community [22].Here the aim is to find minimal fixes, i.e. minimal subsets of an ontology that need tobe repaired or removed to render an ontology logically correct, and therefor usable again.In Reiter’s terminology, MIPS and MUPS would be minimal conflict sets, which impliesthat, technically, calculating diagnosis depends on debugging. Therefore, we focus on theevaluation of debugging in this document.

There are basically two approaches for debugging, a bottom-up method using the sup-port of an external reasoner, and a top-down implementation of a specialised algorithm.In this paper we describe one such approach each, the former based on the systematicenumerations of terminologies of increasing size based on selection functions on axioms,

Page 8: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 1. INTRODUCTION 5

the latter on Boolean minimisation of labels in a labelled tableau calculus.

Both methods have been implemented as prototypes. The prototype for the informedbottom-up approach is called DION, which stands for a Debugger of Inconsistent ON-tologies, the prototype of the specialised top-down method is calledMUPSter. In SEKTdeliverable 3.6.1, we discussed some implementation issue of both these systems, andprovided a basic introduction on how to use the systems for debugging. What is missingis a detailed evaluation of the quality of the proposed framework, and this paper attemptsto close this gap.

This is done in two ways: first, we first, we study the effectiveness of our proposal ina qualitativeway with some practical examples. Secondly, in aquantitative and statisti-cal analysis, we try to get a better understanding of the computational properties of thedebugging problem and our algorithms for solving it.

In the qualititative part we discuss two simple incoherent terminologies to explainthe functionality and particular application scenarios of our debugging framework. Thisframework was then applied to two incoherent terminologies which were used at the Aca-demic Medical Center, Amsterdam, for the admission of patients to Intensive Care units.This evaluation is described in Chapter 4.

For the statistical parts we conducted three sets of experimments in order to evaluatethe debugging problem and our algorithms and tools. First, we applied our two debuggerDION andMUPSter on a set of real-world terminologies collect from our applicationsand the WWW. Secondly, we translated an existing test-set for Description Logic satisfi-ability to an incoherence problem, and finally, we created our own benchmark.

The results are mixed: although debugging is useful in practice, we cannot guaranteethat our tools will calculate debugs in reasonable time. The most important criteria arethe size and complexity of the definitions and the number of modeling errors.

The research underlying this report was driven by practical needs, as theMUPSterwas implemented to debug the DICE ontology. This means that the top-down method tocalculate MIPS and MUPS is implemented for unfoldableALC TBoxes, only. This im-plies that this report focuses solely on debugging ofALC terminologies, and evaluates themethods with under these constraints. An extension to full ontologies in more expressivelanguages is left for future research.

Overview This paper is organized as follows. We will first introduce relevant back-ground to the problem of debugging and the approach we have developed in Chapter 2.Furthermore we will remind the reader about the implementation of our methods in Chap-ter 3. This chapter repeats known results from Sekt Deliverable 3.6.1. The qualitativestudy of our methods is described in Chapter 4. The largest part of this report is the quan-titative evaluation of the algorithms for debugging and diagnosis. Here, we first describeour evaluation method with three different types of evaluation in Chapter 5 and then theresults in Chapter 6. We end this report with some concluding remarks.

Page 9: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

Chapter 2

Debugging Inconsistent Terminologies

This chapter deals with debugging and diagnosis of inconsistent Description Logic on-tologies. Description Logics are a family of well-studied set-description languages whichhave been in use for over two decades to formalize knowledge. They have a well-definedmodel theoretic semantics, which allows for the automation of a number of reasoningservices.

2.1 Logical errors in Description Logic terminologies

We shall not give a formal introduction into Description Logics here, but point to the sec-ond chapter of the DL handbook [1] for an excellent introduction. Briefly, in DL conceptswill be interpreted as subsets of a domain, and roles as binary relations. In a terminolog-ical componentT (called TBox) the interpretations of concepts can be restricted to themodelsof T . Let, throughout the paper,T = {Ax1, . . . , Axn} be a set of (terminological)axioms, whereAxi is of the formCi v Di for each1 ≤ i ≤ n and arbitrary conceptsCi

andDi. A TBox is calledunfoldableif the left-hand sides of the axioms (the defined con-cepts) are atomic, and if the right-hand sides (the definitions) contain no direct or indirectreference to the defined concept [19].

LetU be a finite set, called the universe. A mappingI, which interpretes DL conceptsas subsets ofU is amodelof a terminological axiomC v D, if, and only if,CI ⊆ DI . Amodel for a TBoxT is an interpretation which is a model for all axioms inT . Based onthis semantics a TBox can be checked forincoherence, i.e., whether there areunsatisfiableconcepts: concepts which are necessarily interpreted as the empty set in all models of theTBox. More formally

1. A conceptA is unsatisfiablew.r.t. a terminologyT if, and only if,AI = ∅ for allmodelsI of T .

2. A TBox T is incoherentif there is a concept-name inT , which is unsatisfiable.

6

Page 10: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 2. DEBUGGING INCONSISTENT TERMINOLOGIES 7

ax1:A1v¬A uA2 uA3 ax2:A2vA uA4

ax3:A3vA4 uA5 ax4:A4v∀s.B u Cax5:A5v∃s.¬B ax6:A6vA1 t ∃r.(A3 u ¬C uA4)ax7:A7vA4 u ∃s.¬B

Table 2.1: A small (incoherent) TBoxT 1, whereA,B andC are atomic andA1, . . . , A7

defined concept names, andr ands are atomic roles.

Conceptually, these cases aresimplemodeling errors because we assume that a knowl-edge modeler would not specify something an impossible concept in a complex way.

Table 2.1 demonstrates this principle. Consider the (incoherent) TBoxT 1, whereA,B andC are atomic andA1, . . . , A7 defined concept names, andr ands are atomicroles. Satisfiability testing of the TBox by a DL reasoner returns a set of unsatisfiableconcept names{A1, A3, A6, A7}. Although this is still of manageable size, it hides crucialinformation, e.g., that unsatisfiability ofA1 depends, among others, on unsatisfiability ofA3, which is in turn unsatisfiable because of the contradictions betweenA4 andA5. Wewill use this example later in this paper to explain our debugging methods.

In this chapter we study ways ofexplaining incoherence and unsatisfiability in DLterminologies, and inconsistency of ontologies by ways of debugging and diagnosis.

2.2 Framework for debugging and diagnosis

In Sekt Deliverable 3.6.1. we introduced a theory of debugging and diagnosis and linkedit to description logic-based systems, in which case a diagnosis is a smallest set of ax-ioms that needs to be removed or corrected to render a specific concept or all conceptssatisfiable. Orthogonally, we considered the smallest set of axioms that still contained thelogical contradictions.

In this we section describe both methods on the use of minimal conflict sets for ex-plaining unsatisfiability and incoherence in DL-based ontologies. We will not elaborateon algorithms to construct such conflict sets, the interested reader is referred to Chapter3, and to [24],[25] and [26].

We propose to simplify a TBoxT in order to reduce the available information to theroot of the incoherence. More concretely we first exclude axioms which are irrelevant tothe incoherence and then provide simplified definitions highlighting the exact position ofa contradiction within the axioms of this reduced TBox. We will call the former “axiompinpointing”, the latter “concept pinpointing”. In this section we will formally introduceaxiom and concept pinpointing for a general TBox without restrictions on the underlyingdescription logic.

Page 11: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 2. DEBUGGING INCONSISTENT TERMINOLOGIES 8

In our analogy to diagnosis, we consider the ontology to be the system, where theaxioms are the components of the system. Satisfiability of a concept is taken as a mea-surement, where the system description states that all concepts are satisfiable.

First, we will define minimal conflict sets w.r.t. satisfiability of a concept. Next, wewill define minimal conflict sets w.r.t. coherence of the ontology as a whole. Thereafterwe describe a generalization method as a means of providing focus on those parts ofaxioms that lead to unsatisfiability of concepts.

In some situations, ontologies can contain a large number of unsatisfiable concepts.This can occur for example when ontologies are the result of a merging process of sep-arately developed ontologies, or when closure axioms (i.e. disjointness statements anduniversal restrictions) are added to ontologies. Unsatisfiability propagates, i.e. one un-satisfiable concept may cause many other concepts to become unsatisfiable as well. Asit is often not clear to a modeler what concepts are the root cause of unsatisfiability, wealso describe a number of heuristics that help to indicate reasonable starting points fordebugging an ontology.

2.2.1 Model-based Diagnosis

The literature on model-based diagnosis is manifold, but we focus on the seminal work ofReiter [22], and [11], which corrects a small bug in Reiter’s original algorithm. We referthe interested reader to a good overview in [3]. A more formal description can also befound in [24].

Reiter introduced a diagnosis [22] of a system as the smallest set of components fromthat system with the following property: the assumption that each of these components isfaulty (together with the assumption that all other components are behaving correctly) isconsistent with the system description and observation. For example, a simple electricalcircuit can be defined, consisting of a number of adders. Based on the description of thesystem and some input values, one can calculate the output of the system. If the observedoutput is different from the expected output, at least one of the components must be faulty,and diagnoses determine which components could have caused the error.

To apply this definition to a description logic ontology, thesystemis the ontology,and thecomponentsof the system are the axioms. The concepts and roles in a conceptdefinition are regarded asinput values, and the defined concepts asoutput values.

If we look at the example ontology from Table 2.1, thesystem descriptionstates thatit is coherent (i.e. all concepts are satisfiable), but theobservationis thatA1, A3, A6, andA7 are unsatisfiable.

Reiter provides a generic method to calculate diagnoses on the basis of conflict setsand their minimal hitting sets. A conflict set is a set of components that, when assumed tobe fault free, lead to an inconsistency between the system description and observations.A conflict set is minimal if and only if no proper subset of it is a conflict set. The minimal

Page 12: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 2. DEBUGGING INCONSISTENT TERMINOLOGIES 9

conflict sets (w.r.t. coherence) for the system in Table 2.1 are{ax1, ax2}, {ax3, ax4,ax5}, and{ax4, ax7}.

A hitting set H for a collection of sets C is a set that contains at least one element ofeach of the sets in C. Formally:H ⊆

⋃S∈C S such thatH ∩ S 6= ∅ for eachS ∈ C. A

hitting set is minimal if and only if no proper subset of it is a hitting set. Given the conflictsets above, the minimal hitting sets are:{ax1, ax3, ax7}, {ax1, ax4}, {ax1, ax5, ax7},{ax2, ax3, ax7}, {ax2, ax4}, and{ax2, ax5, ax7}.

Reiter shows that the set of diagnoses actually corresponds to the collection of mini-mal hitting sets for the minimal conflict sets. Hence, the minimal hitting sets given abovedetermine the diagnoses for the system w.r.t. coherence.

In [7] diagnosis is extended by providing a method for computing the probabilities offailure of various components based on given measurements. Especially in cases wherethere are many diagnoses, additional observations (measurements) need to be made inorder to determine the actually failing components. The method provided can also deter-mine what observation has the highest discriminating power, i.e. needs to be performedto maximally reduce the number of diagnoses.

2.2.2 Debugging

As previously mentioned the theory of diagnosis is built on minimal conflict sets. But inthe application of diagnosis of erroneous ontologies, these minimal conflict sets play arole of their own, as they are the prime tools for debugging, i.e. for the identification ofpotential errors. For different kind of logical contradictions we introduce several differentnotions based on conflict sets, the MUPS for unsatisfiability of a concept, the MIPS forincoherence of a terminology, and a number of heuristics.

Minimal unsatisfiability-preserving sub-TBoxes (MUPS) In [25] we introduced thenotion of Minimal Unsatisfiability Preserving Sub-TBoxes (MUPS) to denote minimalconflict sets. Unsatisfiability-preserving sub-TBoxes of a TBoxT and an unsatisfiableconcept A are subsets ofT in which A is unsatisfiable. In general there are several ofthese sub-TBoxes and we select the minimal ones, i.e., those containing only axiomsthat are necessary to preserve unsatisfiability. A TBoxT ′ ⊆ T is a MUPS ofT if A isunsatisfiable inT ′, and A is satisfiable in every sub-TBoxT ′′ ⊂ T ′. We will abbreviatethe set of MUPS ofT and A bymups(T , A). MUPS for our example TBoxT 1 and itsunsatisfiable concepts are:

mups(T 1, A1): {{ax1, ax2}, {ax1, ax3, ax4, ax5}}mups(T 1, A3): {ax3, ax4, ax5}mups(T 1, A6): {{ax1, ax2, ax4, ax6}, {ax1, ax3, ax4, ax5, ax6}}mups(T 1, A7): {ax4, ax7}

Page 13: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 2. DEBUGGING INCONSISTENT TERMINOLOGIES 10

It can be easily proven that each MUPS(T , A) is a minimal conflict set w.r.t. satisfia-bility of conceptA in TBox T .

As explained in Section 2, a diagnosis is a minimal hitting set for a conflict set. Hence,from the MUPS, we can also calculate the diagnoses for satisfiability of conceptA inTBox T , which we will denote∆T ,A.

∆T ,A1 : {{ax1}, {ax2, ax3}, {ax2, ax4}, {ax2, ax5} }∆T ,A3 : {{ax3}, {ax4}, {ax5}}∆T ,A6 : {{ax1}, {ax4}, {ax6}, {ax2, ax3}, {ax2, ax5} }∆T ,A7 : {{ax4}, {ax7}}

Minimal incoherence-preserving sub-TBoxes (MIPS) MUPS are useful for relatingsets of axioms to unsatisfiability of specific concepts, but they can also be used to calculateMIPS, which relate sets of axioms to incoherence of a TBox (i.e. unsatisfiability of anyconcept in a TBox).

In [25] we introduced Minimal Incoherence Preserving Sub-TBoxes (MIPS) as thesmallest subsets of an original TBox preserving unsatisfiability of at least one atomicconcept. The set of MIPS for a TBoxT is abbreviated withmips(T ). For T1 we get 3MIPS:

mips(T 1) = {{ax1, ax2}, {ax3, ax4, ax5}, {ax4, ax7}}Analogous to MUPS, each MIPS(T ) is a minimal conflict set w.r.t. coherence of TBox

T . Hence, frommips(T ), a diagnosis for coherence ofT can be calculated, which wedenote as∆T . From these definitions, we can determine the diagnosis for coherence ofT 1:

∆T 1 = {{ax1, ax4}, {ax2, ax4}, {ax1, ax3, ax7}, {ax2, ax3, ax7}, {ax1, ax5, ax7},{ax2, ax5, ax7}}

Generalized MIPS The use of MUPS and MIPS provides a focus on potentially in-correct axioms by reducing the number of axioms, while retaining unsatisfiability of aspecific concept or incoherence of the ontology as a whole. A further step towards pin-pointing is to look into the axioms, aiming at determining the concept expressions withinaxioms that lead to unsatisfiability.

We will describe a procedural approach to generalization of MIPS. This approach iscomparable to those used in structural subsumption algorithms. The first step is to repre-sent the axioms in the MIPS in conjunctive normal form, i.e.A v X1 uX2 u . . . uXn,whereX1 . . . Xn are concept names or concept expressions that do not contain conjunc-tions. The next step is to minimize the number of conjuncted concept expressions in theaxioms, while retaining incoherence. The resulting axioms provide generalizations of theconcepts from the MIPS, which we will call GMIPS. For the exampleT 1 we get these

Page 14: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 2. DEBUGGING INCONSISTENT TERMINOLOGIES 11

generalized axioms:

GMIPS{ax1, ax2}: { A′1 v ¬A′ u A′

2 , A′2 v A′ }

GMIPS{ax3, ax4, ax5}: { A′3 v A′

4 u A′5 , A′

4 v ∀s.B′ , A′5 v ∃s.¬B′ }

GMIPS{ax4, ax7}: { A′4 v ∀s.B′ , A′

7 v A′4 u ∃s.¬B′ }

A′1, A

′2 andA′

4 provide generalizations of the definitions ofA1, A2 andA4, respec-tively. The definitions ofA3 andA5 could not be further generalized without losingincoherence.

2.2.3 Heuristics

Now that we have introduced the basic notion w.r.t. pinpointing, we will also providea number of heuristics. These heuristics aim at indicating those axioms that are likelyto be involved in larger numbers of unsatisfiable concepts. As was described earlier,unsatisfiability of a concept propagates to subsumees of that concept, and to conceptsthat use the unsatisfiable concept as the value of an existential restriction. In the exampleTBox T 1, unsatisfiability ofA3 is one of the two causes for unsatisfiability ofA1 (asA1

is subsumed byA3). But it is also a cause for unsatisfiability ofA6. Firstly, becauseit rendersA1 unsatisfiable, secondly because it renders the conjunctionA3 u ¬C u A4

unsatisfiable, which in turn leads to unsatisfiability of∃r.(A3 u ¬C u A4).

Hence, solving unsatisfiability ofA3 might also lead to satisfiability ofA1 andA6.In this example,A1 will actually remain unsatisfiable due to conflicting definitions ofA1

andA2.

MIPS-weights Every MIPS is a subset of one or more MUPS. One indication of theeffect of propagation is the number of MUPS that a MIPS is a subset of. The higher thisnumber, the more likely it is that the axioms in the MIPS are the cause of unsatisfiabilityfor more concepts.

We define the MIPS-weight as the number of MUPS of which a MIPS is a subset.

In the example ontologyT 1 we found six MUPS and three MIPS. The MIPS{ax1,ax2} is equivalent to one of the MUPS forA1, {ax1, ax2}, and a proper subset of a MUPSfor A6, {ax1, ax2, ax4, ax6}. Hence, the weight of MIPS{ax1, ax2} is two. In the sameway we can calculate the weights for the other MIPS: the weight of{ax3, ax4, ax5} isthree, the weight of{ax4, ax7} is one.

Intuitively, this suggests that the combination of the axioms{ax3, ax4, ax5} maybe the cause of more than one unsatisfiable concept, whereas{ax4, ax7} only leads tounsatisfiability of one concept,A7.

Page 15: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 2. DEBUGGING INCONSISTENT TERMINOLOGIES 12

Cores MIPS-weights provide an intuition of which combinations of axioms lead to un-satisfiability. Alternatively, one can focus on the occurrence of the individual axioms inMIPS, in order to predict the likelihood that an individual axiom is erroneous.

We define cores as sets of axioms occurring in several MIPS. The more MIPS sucha core belongs to, the more likely its axioms will be the cause of contradictions. A non-empty intersection ofn different MIPS in mips(T ) (with n ≥ 1) is called a MIPS-core ofarity n (or simply n-ary core) forT .

For our example TBoxT 1 we find one 2-ary core,ax4. The other axioms in the MIPSare 1-ary cores.

Pinpoints Pinpoints were introduced in [26], as a computationally attractive alternativefor diagnoses. As calculation of diagnoses is a so-called NP-complete problem (i.e. mostlikely not solvable in polynomial time), we use the cores described above to construct apinpoint.

Pinpoints are constructed as follows. Take a core{ax} of size 1 with maximal arity.Then, remove from themips all MIPS containing{ax}. Repeat these steps until there areno MIPS left. The cores form a pinpoint for the ontology.

For our example TBoxT 1 with mips(T 1) = {{ax1, ax2}, {ax3, ax4, ax5}, {ax4,ax7}} we first take 2-ary core,ax4. Removing the MIPS containing this axiom leaves theMIPS{ax1, ax2}. Hence, two pinpoints can be defined:{ax4, ax1} and{ax4, ax2}.

A note on terminology In the remainder of this report we will often refer to the cal-culation of MIPS and MUPS as the process ofdebugging, as opposed to thediagnosisprocess, which refers to the calculation of diagnoses in the proper sense of the word.

Page 16: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

Chapter 3

Algorithms for debugging and diagnosis

3.1 Two algorithms for debugging

We present two general approaches to calculate explanations: a top-down method, whichreduces the reasoning into smaller parts in order to explain a subproblem with reducedcomplexity, and an informed bottom-up approach, which enumerates possible solutionsin a clever way. Both approaches will be represented for terminological reasoning only,but can, in principal, easily be extended to full ontology debugging.1

3.1.1 A top-down approach to explanation

In order to calculate minimal incoherence preserving sub-terminologies (MIPS) we firstcalculate the minimal unsatisfiability preserving sub-terminologies (MUPS) for each un-satisfiable concept. This is done in a top-down way: we calculate the set of axiomsneeded to maintain a logical contradiction by expanding a logical tableau with labels.This method is efficient, as it requires a single logical calculation per unsatisfiable con-cept. On the other hand, it is based on a variation of a specialised logical algorithm, andonly works for Description Logics for which such a specialised algorithm exists and isimplemented. At the moment, such a purpose-build method only exists for the DLALC,and a restricted type of TBoxes, namelyunfoldableones.

Debugging unfoldableALC-TBoxes Practical experience has shown that applying ourmethods on a simplified version of DICE can already provide valuable debugging infor-mation. We will therefore only provide algorithms for unfoldableALC-TBoxes [19] as

1The extension of the bottom-up method is trivial, as we only have to define a new selection functionand systematic enumeration. For the top-down approach things are a bit more complicated, as we now haveanalyse forests rather than trees. However, this seems to be a technical rather than a conceptual problem.

13

Page 17: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 3. ALGORITHMS FOR DEBUGGING AND DIAGNOSIS 14

(u): if (a : C1 u C2)label ∈ B, but not both(a : C1)label ∈ B and(a : C2)label ∈ Bthen B′ := B ∪ {(a : C1)label, (a : C2)label}.

(t): if (a : C1 t C2)label ∈ B, but neither(a : C1)label ∈ B nor (a : C2)label ∈ Bthen B′ := B ∪ {(a : C1)label} andB′′ := B ∪ {(a : C2)label}.

(Ax) if (a : A)label ∈ B and(A v C) ∈ Tthen B′ := B ∪ {(a : C)label∪{AvC}}.

(∃): if (a : ∃Ri.C)label ∈ B, Ri ∈ NR and all other rules have been applied on allformulas overa, and if{(a : ∀Ri.C1)label1 , . . . , (a : ∀Ri.Cn)labeln} ⊆ B isthe set of universal formulas fora w.r.t. Ri in B,

then B′ := {(b : C)label, (b : C1)label1∪label, . . . , (b : Cn)labeln∪label}whereb is a new individual name not occurring inB.

Figure 3.1: Tableau Rules forALC-Satisfiability w.r.t. a TBoxT (with Labels)

this significantly improves both the computational properties and the readability of thealgorithm.

The calculation of MIPS depends on the MUPS only, and we will provide an algorithmto calculate these minimal unsatisfiability-preserving sub-TBoxes based on Boolean min-imisation of terminological axioms needed to close a standard tableau ([1] Chapter 2).

Usually, unsatisfiability of a concept is detected with a fully saturated tableau (ex-panded with rules similar to those in Figure 3.1) where all branches contain a contradic-tion (or close, as we say). The information which axioms are relevant for the closure iscontained in a simple label which is added to each formula in a branch. Alabelled for-mula has the form(a : C)x wherea is an individual name,C a concept andx a set ofaxioms, which we will refer to aslabel. A labelled branch is a set of labelled formulasand a tableau is a set of labelled branches. A formula can occur with different labels onthe same branch. A branch is closed if it contains a clash, i.e. if there is at least one pairof formulas with contradictory atoms on the same individual. The notions of open branchand closed and open tableau are defined as usual and do not depend on the labels. Wewill always assume that any formula is innegation normal form(nnf) and newly createdformulas are immediately transformed. We usually omit the prefix “labelled”.

To calculate a minimal unsatisfiability-preserving TBox for a concept nameA w.r.t.an unfoldable TBoxT we construct a tableau from a branchB initially containing only(a : A)∅ (for a new individual namea) by applying the rules in Figure 3.1 as long aspossible. The rules are standardALC-tableau rules with lazy unfolding, and have to beread as follows: assume that there is a tableauT = {B,B1, . . . , Bn} with n+1 branches.Application of one of the rules onB yields the tableauT ′ := {B′, B1, . . . , Bn} for the(u), (∃) and(Ax)-rule,T ′′ := {B′, B′′, B1, . . . , Bn} in case of the(t)-rule.

Once no more rules can be applied, we know which atoms are needed to close asaturated branch and can construct a minimisation function forA andT according to therules in Figure 3.2. A propositional formulaφ is called aminimisation function forAandT if A is unsatisfiable in every subset ofT containing the axioms which are true in

Page 18: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 3. ALGORITHMS FOR DEBUGGING AND DIAGNOSIS 15

if rule = (u) has been applied to(a : C1 u C2)label andB′ is the new branchreturn min function(a,B′, T );

if rule = (t) has been applied to(a : C1 t C2)label andB′ andB′′ are the new branchesreturn min function(a,B′, T ) ∧ min function(a,B′′, T );

if rule = (∃) has been applied to(a : ∃R.C)label, B′ is the new branch andb the new variablereturn min function(a,B′, T ) ∨ min function(b, B′, T );

if rule = (Ax) has been applied andB′ is the new branchreturn min function(a,B′, T );

if no further rule can be appliedreturn:

∨(a : A)x ∈ B, (a : ¬A)y ∈ B (

∧ax∈x ax ∧

∧ax∈y ax);

Figure 3.2:min function(a,B, T ): Minimisation-function for the MUPS-problem

an assignment makingφ true. In our case axioms are used as propositional variables inφ. As we can identify unsatisfiability ofA w.r.t. a setS of axioms with a closed tableauusing only the axioms inS for unfolding, branching on a disjunctive rule implies that weneed to join the functions of the appropriate sub-branches conjunctively. If an existentialrule has been applied, the new branchB′ might not necessarily be closed on formulas forboth individuals. Assume thatB′ closes on the individuala but not onb. In this casemin function(a,B, T ) = ⊥, which means that the related disjunct does not influence thecalculation of the minimal incoherent TBox.

Based on the minimisation functionmin function(a, {(a : A)∅}, T ) (let us call itφ)which we calculated using the rules in Figure 3.2 we can now calculate the MUPS forAw.r.t. T . The idea is to use prime implicants ofφ. A prime implicantax1 ∧ . . . ∧ axn

is the smallest conjunction of literals2 implying φ [21]. As φ is a minimisation functionevery implicant ofφ must be a minimisation function as well and therefore also the primeimplicant. But this implies that the conceptA must be unsatisfiable w.r.t. the set of ax-ioms {ax1, . . . , axn}. As ax1 ∧ . . . ∧ axn is the smallest implicant we also know that{ax1, . . . , axn} must be minimal, i.e. a MUPS.

From MUPS we can easily calculate MIPS, but we need an additional operation onsets of TBoxes, calledsubset-reduction. LetM = {T1, . . . , Tm} be a set of TBoxes. Thesubset-reductionof M is the smallest subsetsr(M) ⊆ M such that for allT ∈ M thereis a setT ′ ∈ sr(M) such thatT ′ ⊆ T .

Let T be an incoherent TBox with unsatisfiable concepts∆T . A simple algorithmfor the calculation of MIPS forT is then defined through the following equation: Then,mips(T ) = sr(

⋃A∈∆T mups(T , A)). Checking elements ofmips(T ) for cores of max-

imal arity requires exponentially many checks in the size ofmips(T ). In practice, wetherefore apply a bottom-up method searching for maximal cores of increasing size stop-ping once the arity of the cores is smaller than 2.

2Note that in our case all literals are non-negated axioms.

Page 19: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 3. ALGORITHMS FOR DEBUGGING AND DIAGNOSIS 16

3.1.2 An Informed Bottom-up Approach to Explanation

In this section we propose an informed bottom-up approach to calculate MUPS by thesupport of an external DL reasoner, like RACER. The main advantage of this approach isthat it can deal with any DL-based ontology if it has been supported by an external rea-soner. Currently there exist several well-known DL reasoners, like RACER and FACT++.Those external DL reasoners have been proved to be very reliable and stable. They alreadysupport various DL-based ontology languages, including OWL. Thus, by the bottom-upapproach we can obtain an OWL debugger almost for free, although the price is paid forits performance.

Given an unsatisfiable conceptc and a formula set(i.e., an ontology)T , we can cal-culate the MUPS ofc by selecting a minimal subsetΣ of T in which c is unsatisfiablein Σ. We use a similar selecting procedure which has been used in the system PION forreasoning with inconsistent ontologies[14]. In the PION approach, a selection functionis designed to one which can extend selected subset by checking on axioms which arerelevant to the current selected subset which starts initially with a query. Although theapproach which is based this kind of relevance extension procedure may not give us thecomplete solution set of MUPS/MIPS, it is good enough to provide us an efficient ap-proach for debugging inconsistent ontologies. We are going to report the evaluation ofthis informed bottom-up approach in the SEKT deliverable D3.6.2 entitled ”Evaluationof Inconsistent Ontology Diagnosis”.

Selection Function and Relevance MeasureGiven an ontology (i.e., a formula set)Σand a queryφ, a selection functions is one which returns a subset ofΣ at the stepk > 0.Let L be the ontology language, which is denoted as a formula set. We have the generaldefinition about selection functions as follows:

Definition 3.1.1 (Selection Functions)A selection functions is a mappings : P(L) ×L×N → P(L) such thats(Σ, φ, k) ⊆ Σ.

In this approach, we extend the definition of the selection function so that it starts from aconceptc instead of from a query (i.e., a formulaφ). As we have discussed above, selectfunctions are usually defined by a relevance relation between a formula and a formula set.We will use a relevance relation as the informed message to guide the search strategy forMUPS.

Definition 3.1.2 (Direct Relevance Relation)A direction relevance relationR is a setof formula pairs. Namely,R ⊆ L× L.

Definition 3.1.3 (Direct Relevance Relation between a Formula and a Formula Set)Given a direction relevance relationR, we can extend it to a relationR+ on a formulaand a formula set, i.e.,R+ ⊆ L× P(L) as follows:

〈φ,Σ〉 ∈ R+iff there exists a formulaψ ∈ Σ such that〈φ, ψ〉 ∈ R.

Page 20: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 3. ALGORITHMS FOR DEBUGGING AND DIAGNOSIS 17

We have implemented the prototype of the informed bottom-up approach. The proto-type is called DION, which stands for a Debugger of Inconsistent Ontologies. DION usesa DIG data format as its internal data represenation format. Therefore, in the following,we define a direct relavance relation which is based on the ontology language DIG.

In DIG, concept axioms has only the following three forms:impliesc(C1, C2),equalc(C1, C2), anddisjoint(C1, · · · , Cn), which corresponds with the concept impli-cation statement, the concept equivalence statement, and the concept disjoint statementrespectively.

Given a formulaφ, we useC(φ) to denote the set of concept names that appear in theformulaφ.

Definition 3.1.4 (Direct concept-relevance)An axiomφ is directly concept-relevant toa formulaψ, writtenSynConRel(φ, ψ), iff(i) C1 ∈ C(ψ) if the formulaφ has the formimpliesc(C1, C2),(ii) C1 ∈ C(ψ) or C2 ∈ C(ψ) if the formulaφ has the formequalc(C1, C2),(iii) C1 ∈ C(ψ) or · · · or Cn ∈ C(ψ) if the formulaφ has the formdisjoint(C1, · · · , Cn).

Definition 3.1.5 (Direct concept-relevance to a set)A formulaφ is concept-relevant toa formula setΣ iff there exists a formulaψ ∈ Σ such thatφ andψ are directly concept-relevant.

For a terminologyT and a conceptc, we can define a selection functions in terms ofthe direct concept relevance as follows:

Definition 3.1.6 (Selection function on concept relevance)(i) s(T , c, 0) = {ψ | ψ ∈ T andψ is directly concept-relevant toc};(ii)s(T , c, k) = {ψ | ψ ∈ T andψ is directly concept-relevant tos(T , c, k − 1)} for k >0.

In order to do so, we extend the definition of direct concept relevance so that wecan say something like an axiomψ is direcly concept-relevant to a conceptc, i.e.,SynConRel(ψ, c). It is easy to see that it does not change the definition of direct rel-evance relation.

Algorithms We use an informed bottom-up approach to obtain MUPS. In logics andcomputer science, an increment-reduction strategy is usually used to find minimal incon-sistent sets[8]. Under this approach, the algorithm first finds a set of inconsistent sets,then reduces the redundant axioms from the subsets. Similarlly, a heuristic procedure forfinding MUPS consists of the following three stages:

Page 21: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 3. ALGORITHMS FOR DEBUGGING AND DIAGNOSIS 18

Algorithm 3.1: Algorithm formups(T , c)k := 0mups(T , c) := ∅repeatk := k + 1

until c unsatisfiable ins(T , c, k)Σ := s(T , c, k)− s(T , c, k − 1)for all Σ′ ∈ P(Σ) do

for all φ ∈ Σ/Σ′ doif Σ′ ∪ {φ} 6∈ mups(T , c) then

Σ′′ := s(T , c, k − 1) ∪ Σ′

if c satisfiable inΣ′′ andc unsatisfiable inΣ′′ ∪ φ thenmups(T , c) := mups(T , c) ∪ {Σ′′ ∪ {φ}}

end ifend if

end forend formups(T , c) := MinimalityChecking(mups(T , c))return mups(T , c)

• Heuristic Extension: Using a relevance-based selection function to find two sub-setsΣ andS such that a conceptc is satisfiable inS and unsatisfiable inS ∪ Σ.

• Enumeration Processing: Enumerating subsetsS ′ of S to obtain a setS ′ ∪ Σ inwhich the conceptc is unsatisfiable. We call those setsc-unsatisfiable sets.

• Minimality Checking : Reducing redundant axioms from thosec-unsatisfiable setsto get MUPSs.

The following is an algorithm for MUPSs. The algorithm first finds two subsetsΣ andSof T . Compared withT , the setΣ is relatively small. The algorithm then tries to exhaustthe powerset ofΣ to getc-unsatisfiable sets. Finally, by the minimality checking it obtainsthe MUPSs. We can define the minimality checking as a sub-procedure as shown in thealgorithm 3.2.

The complexity of the algorithm 3.1 is exponent to|Σ|. AlthoughΣ is much smallerthanT , it is still not very useful in the implementation. One of the improvement is todo pruning. We can check the subsets ofΣ with increasing cardinality of the subsets.Namely, we can always pick up the subsets with a less cardinality first, (i.e., the powerset ofΣ is sorted). First, set the cardinalityn = 1, namely pick up only one axiomφ inΣ, check ifc is unsatisfiable in{φ} ∪ s(T , c, k − 1). If ’yes’, then it ignores any supersetS such that{φ} ⊂ S. After all of the subsets with the cardinalityn have been checked,increasen by 1. Moreover, we can do checking and pruning during the powerset is built.

Page 22: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 3. ALGORITHMS FOR DEBUGGING AND DIAGNOSIS 19

Algorithm 3.2: Algorithm for the minimality checking onmups(T , c)for all Σ ∈ mups(T , c) do

Σ′ := Σfor all φ ∈ Σ′ do

if c unsatisfiable inΣ′ − {φ} thenΣ′ := Σ′ − {φ}

end ifend formups(T , c) := mups(T , c)/{Σ} ∪ {Σ′}

end forreturn mups(T , c)

That leads to the algorithm 3.3 in which we use the setS to book thec-satisfiable subsets.

Proposition 3.1.1 (Soundness of the Algorithms MUPS)The algorithms for MUPSabove are sound. Namely, they always return MUPSs.

PROOF. It is easy to see that the conceptc is always unsatisfiable for any elementSin the setmups(T , c). Otherwise it is never added into the set. The minimality conditionis achieved by the procedure of the minimality checking. Therefore, the algorithms forMUPSs are sound. 2

Take our running example. To calculatemups(T1, A1), the algorithm first gets theset Σ = {ax2} = {ax1, ax2} − {ax1}. Thus,mups(T1, A1) = {{ax1, ax2}}. Wecan see that the algorithm cannot find thatS1 = {ax1, ax3, ax4, ax5} ∈ mups(T1, A1).However, it does not change the result of MIPS, because we have a subset{ax3, ax4, ax5}of S1, which will be inmups(T1, A3). To calculatemups(T1, A6), the algorithm firstgets the setΣ = {ax2, ax5} = {ax1, ax2, ax3, ax4, ax5, ax6} − {ax1, ax3, ax4, ax6}.Thus,mups(T1, A6) = {{ax1, ax3, ax4, ax5, ax6}}. Again, the algorithm cannot findthat {ax1, ax2, ax4, ax6} ∈ mups(T1, A6). However, it does not affect MIPS, because{ax1, ax2} ∈ mups(T1, A1). Therefore, the algorithms for MUPSs proposed above aresound, but not complete. As we have argued above, this informed bottom-up approach isefficient for inconsistent ontology diagnosis.

Sometimes it is useful to find just a single MUPS by using the relevance relationwithout referring to a selection function. Algorithm 3.4 uses the increment-reductionstrategy to find a single MUPS for an unsatisfiable concept, without a selection function.The algorithm finds a subset of the ontology in which the concept is unsatisfiable first,then reduces the redundant axioms from the subset.

Page 23: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 3. ALGORITHMS FOR DEBUGGING AND DIAGNOSIS 20

Algorithm 3.3: Algorithm formups(T , c) with pruning

k := 0mups(T , c) := ∅repeatk := k + 1

until c unsatisfiable ins(T , c, k)Σ := s(T , c, k)− s(T , c, k − 1)S := {s(T , c, k − 1)}for all φ ∈ Σ do

for all S ′ ∈ S doif c satisfiable inS ′ ∪ {φ} andS ′ ∪ {φ} 6∈ S thenS := S ∪ {S ′ ∪ {φ}}

end ifif c unsatisfiable inS ′ ∪ {φ} andS ′ ∪ {φ} 6∈ mups(T , c) thenmups(T , c) := mups(T , c) ∪ {S ′ ∪ {φ}}

end ifend for

end formups(T , c) := MinimalityChecking(mups(T , c))return mups(T , c)

We can obtain MUPSs for all unsatisfiable concepts. Based on those MUPSs, we cancalculate MIPS, core, and pinpoints further, by using the standard algorithms which havebeen discussed in the previous chapter. Those data can be used for knowledge workers torepair the ontology to avoid unsatisfiable concepts [25, 26].

3.2 Calculating terminological diagnoses

Terminological diagnosis, as defined in [24], is an instance Reiter’s diagnosis from firstprinciples. Therefore, we can use Reiter’s algorithms to calculate terminological diag-noses. What is required is a method to produce conflict sets, and we will discuss threedifferent options for this. Let us first recall the basic methodology from [24]. Here, wesimplify the technical details in order to make the presentation slightly more intuitive.

Given an incoherent terminologyT , a conflict setis any incoherent subset ofT .3

Then, Reiter showed, that a set∆ ⊆ T is a diagnosis for an incoherent terminology iff∆is a minimal set such thatT \∆ is not a conflict set forT .

The basic idea to calculate diagnoses from conflict sets is based on minimal hitting

3A related notion can be defined for unsatisfiability of concepts w.r.t. a terminology. For readabilityreasons, we restrict ourselves to diagnosis of incoherence in terminologies.

Page 24: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 3. ALGORITHMS FOR DEBUGGING AND DIAGNOSIS 21

Algorithm 3.4: Finding a MUPS forT in which a conceptc is unsatisfiable

Σ := ∅repeat

for all φ1 ∈ T \ Σ doif SynConcRel(φ1, c) or there is aφ2 ∈ Σ such thatSynConRel(φ1, φ2) then

Σ := Σ ∪ {φ1}end if

end foruntil c is unsatisfiable inΣfor all φ ∈ Σ do

if c is unsatisfiable inΣ− {φ} thenΣ := Σ− {φ}

end ifend for

sets. SupposeC is a collection of sets. Ahitting setfor C is a setH ⊆⋃

S∈C S such thatH ∩ S 6= ∅ for eachS ∈ C. A hitting set is minimal forC iff no proper subset of it is ahitting set forC.

This gives the basis of Reiter approach to calculate diagnoses given the followingtheorem which is a direct consequence of Corollary 4.5 in [22].

Theorem 3.2.1 A set∆ ∈ T is a diagnosis for an incoherent terminologyT iff ∆ is aminimal hitting set for the collection of conflict sets forT .

To calculate minimal hitting trees Reiter introduces hitting set trees (HS-trees). For acollectionC of sets, a HS-treeT is the smallest edge-labeled and node-labeled tree, suchthat the root is labeled byX if C is empty. Otherwise it is labeled with any set inC. Foreach noden in T , let H(n) be the set of edge labels on the path inT from the root ton. The label forn is any setS ∈ C such thatS ∩ H(n) = ∅, if such a set exists. Ifnis labeled by a setS, then for eachσ ∈ S, n has a successor,nσ joined ton by an edgelabeled byσ. For any node labeled byX,H(n), i.e. the labels of its path from the root, isa hitting set forC.

Figure 3.3 shows a HS-tree T for the collection C ={{1, 2, 3, 4, 5, 6}{3, 4, 5}, {1, 2, 4, 6}, {1, 2}, {4, 7}} of sets. T is created breadthfirst, starting with root noden0 labeled with{1, 2, 3, 4, 5, 6}. For diagnostic problems thesets in the collection are conflict sets which are created on demand. In our case, conflictsets for a terminological diagnosis problem can be calculated by a standard DL engine(by definition each incoherent subset ofT is a conflict set).

These calls are computationally expensive, which means that we have to minimizethem. In Figure 3.3, those nodes are boxed, for which labels were created by calls tothe prover.T reuses already calculated and smallest possible labels, and is pruned in a

Page 25: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 3. ALGORITHMS FOR DEBUGGING AND DIAGNOSIS 22

variety of ways, which are defined in detail in [22]. Just for example, noden0 is relabeledwith a subset{3, 4, 5} of its label. We denote by1×, that element1 is deleted. Note, thatno successor for this element has to be created. Noden6 has been automatically labeledwith {4, 7}, because the intersection of its pathh(n6) = {2, 3} is empty with an alreadyexisting conflict set in the tree.

Three ways of implementing diagnosis The generality of Reiter’s algorithm has theadvantage of giving some leeway for particular methodological choices. We implementedthree ways of calculating conflict sets.

1. Use an optimized DL reasoner to return a conflict set in each step of the creation ofthe HS-tree. The only way to get conflict sets for an incoherent TBoxT is to returnT itself, i.e. themaximal conflict set.

2. Use an adapted DL reasoner to returnsmall conflict sets, which it can derive fromthe clashes in a tableau proof.

3. Use a specialized method to returnminimal conflict sets, e.g., using the algorithmsof [25].

Diagnosis with maximal conflict sets The most general way to calculate terminolog-ical diagnosis based on hitting sets is to use one of the state-of-the-art optimized DLreasoner. The advantage is obvious: the expressiveness of the diagnosis is only restrictedby the expressiveness of the DL reasoning implemented in the reasoner. We use RACER,which allows to diagnose incoherent terminologies up toSHIQ without restriction onthe structure of the TBox. The algorithm to use RACER is simple: ifT is incoherent, re-turnT , other return∅. As RACER is highly optimized we can expect to get the maximalconflict sets efficiently.

The disadvantage of this naive approach is that the conflict sets are huge, and evenwith reusing of node labels and pruning, the HS-tree become quickly to large to handle.Take the incoherent TBoxT ∗ where the related HS-tree already has 380 nodes, and needs67 calls to RACER. We will see that the price we pay for the gain in expressiveness is toohigh, and that smaller conflict sets are required.

Diagnosis with small conflict sets The disadvantage of using a DL reasoner as a black-box is that they do not provide any information on which components contribute to the in-coherence. Technically, this means which axioms contribute to the closure of the tableau.To show that already straightforward collecting of clash-enforcing axioms can dramati-cally improve the efficiency of diagnosis, we implemented a simple tableau calculus forunfoldableALC TBoxes. This reasoner returns an unordered, and not necessarily mini-mal, list of axioms which are (indirectly) responsible for the clashes in the tableau. Thebasic idea is to label each formula with a set of axioms, which are added to a formula in

Page 26: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 3. ALGORITHMS FOR DEBUGGING AND DIAGNOSIS 23

n0 : {1×, 2×

, 3, 4, 5, 6×}

n1 : {3, 4, 5}

×

1

n2 : {1, 2, 4×

, 6×}

3

n5 : {4, 7}

1

×

4

n11 : ∅X

7

n6 : {4, 7}

2

×

4

n12 : ∅X

7

n3 : {1, 2}

4

n7 : ∅X

1

n8 : ∅X

2

n4 : {1, 2}

5

n9 : {4, 7}

1

×

4

n13 : ∅X

7

n10 : {4, 7}

2

×

4

n14 : ∅X

7

Figure 3.3: HS-Tree with small conflict sets

the tableau whenever they are used to “unfold” a defined concept. This algorithm is notoptimized, but returns small conflict sets, and the sizes of the HS-Trees decrease dramat-ically. Figure 3.3 shows the hitting tree for the incoherence problem forT ∗ where smallconflict sets have been collected from tableau proofs. Compared to the previous method,there were only 14 nodes created, and 11 calls to the DL reasoner necessary.

Diagnosis with minimal conflict sets Previously, we recalled the notion of minimalunsatisfiability (and incoherence) preserving sub-terminologies MUPS and MIPS, whichwere introduced in [25] for the debugging of terminologies.

The MUPSs of an incoherent terminologyT and an unsatisfiable conceptA are theminimal conflict sets of this unsatisfiability problem. It is easily checked that each MUPS{{ax1, ax2}, {ax1, ax3, ax4, ax5}} for A1 andT ∗ is indeed a minimal conflict set. Thistime, only 12 nodes were created. Based on the MUPS, it is straightforward to calculateMIPS, which are the minimal conflict sets for the incoherence problem.

Proposition 3.2.1 The MIPS of an incoherent terminologyT are the minimal conflictsets for this incoherence problem.

Figure 3.4 shows the hitting tree for the incoherence problem ofT ∗ where minimalconflict sets have been calculated as MIPS.4

4An axiomaxi is represented by the numberi.

Page 27: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 3. ALGORITHMS FOR DEBUGGING AND DIAGNOSIS 24

n0 : {1, 2}

n1 : {3, 4, 5}

1

n3 : {4, 7}

3

×

4

n9 : ∅X

7

n4 : ∅

4

n5 : {4, 7}

5

×

4

n10 : ∅X

7

n2 : {3, 4, 5}

2

n6 : {4, 7}

3

×

4

n11 : ∅X

7

n7 : ∅

4

n8 : {4, 7}

5

×

4

n12 : ∅X

7

Figure 3.4: HS-Tree with minimal conflict sets

Page 28: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

Chapter 4

A qualitative evaluation of theframework

4.1 Two motivating examples

To demonstrate the use of our debugging approach, we first take a well-described ontol-ogy. We use the Pizza ontology from the Protege OWL tutorial1. Next, we will discussthe use of our approach to an extended version of the pizza ontology, which has a largernumber of unsatisfiable concepts.

Pizza ontology This pizza ontology purposely contains two unsatisfiable concepts, Ice-Cream and CheeseyVegetableTopping. Whereas the causes of the unsatisfiability are re-alistic, this ontology is a special case because the concepts that are unsatisfiable are fullyunrelated, and the unsatisfiability does not propagate to other concepts.

mups(Pizza, IceCream) ={ IceCreamv DomainConceptu ∃ hasTopping. FruitTopping, disjoint( IceCream, Pizza ), role hasTopping :domain Pizza}

mups(Pizza, CheeseyVegetableTopping) ={ CheeseyVegetableToppingv CheeseTop-pingu VegetableTopping , disjoint ( CheeseTopping, VegetableTopping )}

As the MUPS for both concepts contain exactly one set of axioms, the diagnoses forunsatisfiability of the concepts are the individual axioms:

∆Pizza,IceCream = { {IceCreamv DomainConceptu ∃ hasTopping. FruitTopping},{disjoint (IceCream, Pizza)}, {role hasTopping :domain Pizza} }

∆Pizza,CheeseyV egetableTopping = { {CheeseyVegetableToppingv CheeseToppingu Veg-etableTopping}, {disjoint (CheeseTopping, VegetableTopping)} }

It can be easily determined that

1http://www.co-ode.org/ontologies/pizza/2005/05/16/

25

Page 29: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 4. A QUALITATIVE EVALUATION OF THE FRAMEWORK 26

mips(Pizza) = { {IceCreamv DomainConceptu ∃ hasTopping. FruitTopping , disjoint(IceCream, Pizza) , role hasTopping :domain Pizza}, {CheeseyVegetableToppingv CheeseTop-pingu VegetableTopping , disjoint (CheeseTopping, VegetableTopping)} }

and

∆Pizza = { {IceCreamv DomainConceptu ∃ hasTopping. FruitTopping , CheeseyVeg-etableToppingv CheeseToppingu VegetableTopping}, {IceCreamv DomainConceptu ∃hasTopping. FruitTopping , disjoint (CheeseTopping, VegetableTopping)}, {disjoint (IceCream,Pizza) , CheeseyVegetableToppingv CheeseToppingu VegetableTopping}, {disjoint (Ice-Cream, Pizza) , disjoint (CheeseTopping, VegetableTopping)}, {role hasTopping :domain Pizza, CheeseyVegetableToppingv CheeseToppingu VegetableTopping}, {role hasTopping :domainPizza, disjoint (CheeseTopping, VegetableTopping)} }

For the generalized MIPS, only the definition of IceCream can be generalized, theother definitions remain unchanged.

gmips(Pizza) = { {IceCream′ v ∃ hasTopping. FruitTopping′ , disjoint( IceCream′,Pizza′) , role hasTopping :domain Pizza′ }, {CheeseyVegetableTopping′ v CheeseTopping′ uVegetableTopping′ , disjoint (CheeseTopping′, VegetableTopping′) } }

As all MIPS are equivalent to MUPS, they all have a MIPS-weight of 1.

As there are no axioms that occur in both of the MIPS, there are only MIPS-cores ofarity 1. As a result, any diagnosis is also a pinpoint.

This example shows that for this ontology the MUPS are a useful reduction of theontology as a whole to small sets of concepts. The other measures are of limited inter-est, due to the isolation of the unsatisfiable concepts and the absence of propagation ofunsatisfiability.

Extended Pizza ontology Suppose that within the Pizza ontology also 5 subsumees ofIceCream were defined, IceCream1 . . . IceCream5 and three subsumees of Pizza, Cheesey-VegetablePizza1 . . . CheeseyVegetablePizza3, which are defined as CheeseyVegetable-Pizzai v Pizzau ∃ hasTopping. CheeseToppingu ∀ hasTopping. VegetableTopping.

Now, the resulting ontology, which we will callPizza′ will have 10 unsatisfiableconcepts, IceCream, CheeseyVegetableTopping, plus the eight newly defined concepts.

Then the following MUPS are added:

mups(Pizza′, IceCreami) = { IceCreami v IceCream , IceCreamv DomainConceptu ∃hasTopping. FruitTopping , disjoint (IceCream, Pizza) , role hasTopping :domain Pizza}

mups(Pizza′, CheeseyVegetablePizzai) = { CheeseyVegetablePizzai v Pizza u ∃hasTopping. CheeseToppingu ∀ hasTopping. VegetableTopping , disjoint (CheeseTopping, Veg-etableTopping)}

Additional MIPS can now be calculated for the extended ontology. Themips for thenew ontology is:

Page 30: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 4. A QUALITATIVE EVALUATION OF THE FRAMEWORK 27

mips(Pizza′) = { {IceCreamv DomainConceptu ∃ hasTopping. FruitTopping , disjoint(IceCream, Pizza) , role hasTopping :domain Pizza},{CheeseyVegetableToppingv CheeseToppingu VegetableTopping , disjoint (CheeseTopping,VegetableTopping)},{CheeseyVegetablePizza1 v Pizzau ∃ hasTopping. CheeseToppingu ∀ hasTopping. Vegetable-Topping , disjoint (CheeseTopping, VegetableTopping)},{CheeseyVegetablePizza2 v Pizzau ∃ hasTopping. CheeseToppingu ∀ hasTopping. Vegetable-Topping , disjoint (CheeseTopping, VegetableTopping)},{CheeseyVegetablePizza3 v Pizzau ∃ hasTopping. CheeseToppingu ∀ hasTopping. Vegetable-Topping , disjoint (CheeseTopping, VegetableTopping)} }

For the sake of brevity, we will not present the diagnoses and the generalized mips forthis example.

The MIPS-weight for{IceCreamv DomainConceptu ∃ hasTopping. FruitTopping ,disjoint (IceCream, Pizza), role hasTopping :domain Pizza} is now 6. The other MIPShave a weight of 1.

The core with the highest arity is now the axiom “disjoint (CheeseTopping, Vegetable-Topping”), which has arity 4.

This core is used as the starting point for constructing a pinpoint. We can combinethis core with an axiom from the first MIPS. A pinpoint for this ontology now is:{disjoint(CheeseTopping, VegetableTopping ), disjoint (IceCream, Pizza)}.

In this example, we see that MIPS-weights and the cores indicate axioms that lead tounsatisfiability of more concepts.

4.2 Applying the framework to real-life terminologies

In the previous section we presented the results of the approach on ontologies that werespecifically designed to demonstrate unsatisfiability. The outcomes demonstrated that thecause(s) for unsatisfiability were clearly pinpointed.

We now present results from the implementation and application of the approach tothe real-world medical ontologies DICE [6] and FMA [23].

The current implementation of our approach can handle unfoldable TBoxes inALC [24]. An unfoldable TBox implies that all definitions are simple (defining onlyatomic concepts), unique (only one definition for each atomic concept exists), and acyclic(meaning the definition of a concept has no reference to the definiendum, either directlyor indirectly). The languageALC is a description logic where the allowed constructorsare the ones mentioned in Section 2.

Our implementation partly uses RACER to perform reasoning, and partly implementsalgorithms to calculatemups,mips, gmips, and the heuristics. These algorithms are im-plemented in Java, so it can run on the platforms that are supported by RACER (currently

Page 31: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 4. A QUALITATIVE EVALUATION OF THE FRAMEWORK 28

available for 32bit versions of Windows, Linux or Mac OS X, for Sun and other brandedUNIX workstations with 32bit or 64bit, as well as 64bit Linux environments).

DICE The DICE knowledge base2, which is under development at the Academic Med-ical Center in Amsterdam, contains about 2500 concepts. Each concept is described inboth Dutch and English by one preferred term, and any number of synonym(s) for eachlanguage. In addition to about 1500 reasons for admission, DICE contains concepts re-garding anatomy, etiology and morphology.

DICE originally has a frame-based representation, and is migrated to DL in order to beable to perform auditing w.r.t. incorrect definitions and missed classification. In the DL-based representation many closure axioms are used in order to be able to find incorrectdefinitions. These closure axioms include disjointness of sibling concepts, and universalrestrictions. As a result of the migration process, various concepts become unsatisfiable.

A recent version of DICE was classified, resulting in 65 unsatisfiable concepts. Forone concept, calculation of MUPS failed in the current implementation, for yet unknownreasons. For the remaining 64 concepts, a total of 175 MUPS are found.

142 MIPS are found, with the following distribution of MIPS-weights:MIPS of weight 1: 121MIPS of weight 2: 10MIPS of weight 3: 0MIPS of weight 4: 11

The pinpoint of this ontology consisted of the following five axioms, which form thelargest cores, according to the procedure defined in Section 2.

Core of arity 60: Disjointness of children of “Act”Core of arity 56: Disjointness of children of “Dysfunction/Abnormality”Core of arity 15: Disjointness of children of “System”Core of arity 7 (43 in the fullmips): “Heart valve operations”Core of arity 4: Disjointness of children of “Toxical substance”

The arity of 7 for “Heart valve operations” is the arity in the remainingmips afterremoval of all MIPS containing the disjointness statements mentioned earlier. The arityof this core in the fullmips is 43.

Based on these results one can determine where to start the debugging process. Ei-ther the disjointness statements mentioned can be verified, or one can further analyze thedefinition of and references to the concept “Heart valve operations”.

We will discuss the concept “Heart valve operations”, also in order to demonstrate theuse of generalized MIPS.

The concept is defined as:

2Development of DICE is supported by the National Intensive Care Evaluation (NICE) foundation.

Page 32: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 4. A QUALITATIVE EVALUATION OF THE FRAMEWORK 29

HeartValveOperationsv HeartProceduresu∃ HasSystemInvolvement. CirculatorySystemu∀ HasSystemInvolvement. CirculatorySystemu∃ LocalizedIn. HeartValveStructureu∀ InvolvesDysfunction. (Thrombosist Insufficiencyt Stenosis)u∃ InvolvesAct. Replacementu∃ InvolvesAct. Resectionu∃ InvolvesAct. Repairu∃ InvolvesAct. Excisionu∃ InvolvesAct. Inspectionu∀ InvolvesAct. (Replacementt Resectiont

Repairt Excisiont Inspection)

As mentioned earlier, the DL-based representation is generated from a frame-basedrepresentation [5]. In this migration process, it was decided that universal restrictionswere added as default, as is shown by the∀ constructors in the definition.

Inspection of this statement reveals incorrect semantics: five values for the In-volvesAct role are required, whereas generally these operations involve one of these acts.After correcting this modeling error, which resulted from an incorrect assumption in themigration process, we can reapply our approach. The resulting ontology still has 62 un-satisfiable concepts, resulting in 111 MIPS.

Now, the pinpoint only contains disjointness axioms. However, the definition forHeart valve operations is still a core with an arity of 12, so we continue to focus on thatconcept. One MIPS contains “Valve commissurotomy” (the definition of which contains∃ InvolvesAct. Incision) and disjointness of Incision and Replacement, Resection, Repair,Excision, and Inspection. This indicates that the definition of Heart valve operation shouldbe adjusted to include Incision in the disjunctions of the∀ InvolvesAct restriction.

In this way, we can iterate the process of making changes to the ontology and deter-miningmups andmips. Alternatively, we could have focused on specific unsatisfiableconcepts, using themups.

FMA To test how our approach can be applied to other ontologies, we have used theFoundational Model of Anatomy (FMA)3. FMA, developed by the University of Washing-ton, provides about 69000 concept definitions, describing anatomical structures, shapes,and other entities, such as coordinates (left, right, etc.). The FMA Knowledge Base,which is implemented as a frame-based model in Protege4, has been migrated to DL.

Due to its large size we were not able to classify the full FMA ontology with RACER.We hence limited the case study to “Organs”, which comprises a convenient subset thatis representative for the FMA. Of the 3826 concept definitions, 181 were found to be

3http://sig.biostr.washington.edu/projects/fm/4http://protege.stanford.edu/

Page 33: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 4. A QUALITATIVE EVALUATION OF THE FRAMEWORK 30

unsatisfiable. Interestingly, this resulted in the single pinpoint “Organ”. This could beexplained by the definition of Organ:

Organv AnatomicalStructureu∃ RegionalPartOf. OrganSystemu∀ RegionalPartOf. OrganSystemu∃ PartOf. OrganSystemu∀ PartOf. OrganSystem

In FMA, the unsatisfiable concepts were defined as part of some organ, for example

Periodontiumv SkeletalLigament∃ RegionalPartOf. Toothu∀ RegionalPartOf. Toothu∃ PartOf. Toothu∀ PartOf. Toothu∃ SystemicPartOf. Toothu∀ SystemicPartOf. Tooth

Periodontium and Tooth are subsumed by Organ, and according to the definition ofOrgan, Tooth should be an OrganSystem. Hence, it would be more correct to specify thatan Organ is also an allowed value for the (Regional)PartOf role, i.e. defining Organ asfollows:

Organv AnatomicalStructureu∃ RegionalPartOf. (Organt OrganSystem)u∀ RegionalPartOf. (Organt OrganSystem)u∃ PartOf. (Organt OrganSystem)u∀ PartOf. (Organt OrganSystem)

This example clearly shows how one axiom can lead to unsatisfiability of a largenumber of concepts. The pinpoint properly detects this single axiom.

4.3 Qualitative evaluation

The implementation of our approach provides a useful contribution to the debugging pro-cess. The heuristics (especially the Pinpoint) provide a good starting point for debuggingof the axioms. MUPS and generalized MIPS provide understandable explanations forcauses of unsatisfiability. In our experiments, performing the calculations for the MUPS,MIPS and heuristics was a matter of minutes (on a 2.4 GHz PC with 1 GB memory).

Page 34: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 4. A QUALITATIVE EVALUATION OF THE FRAMEWORK 31

The current implementation provides text-based output. This requires a modeler tobrowse the output, and find for example the MUPS that are related to a MIPS. Apart fromfurther optimizing the algorithms and supporting more expressive logics, the various stepsin the debugging process can be further integrated to provide more support.

Debugging of ontologies gets increasing attention, which is driven by research in thearea on the semantic web on the one hand, and the need for robust ontologies on the otherhand. We will first look into truth maintenance systems, a research field in artificial in-telligence that has not yet been discussed in this paper. These systems provide, at leasttheoretically, a possibility for explaining reasoning. Next we will describe two implemen-tations of other approaches to explanation, Swoop/Pellet and the OWL-debugger plug-infor Protege. These implementations differ in the approach followed. The Pellet reasoner,described later in this section, uses a “glass-box” approach, that offers a debugging modein which explanation is provided as a result of the reasoning process. The Protege OWL-debugger, like our approach, uses a black-box approach. It looks for common modelingerrors as explanation for inconsistency, as is described in last Section of this Chapter.

Truth Maintenance Systems Truth maintenance systems (TMSs) [9] are designed tokeep track of inferences made in knowledge bases. For example, the implicationP ⇒ Qmight have been used to addQ to a knowledge base that containsP . One approach isa justification-based truth maintenance system (JTMS). In such a system, each sentencein the knowledge base is provided with a justification consisting of the set of sentencesfrom which it was inferred. In the example above,Q is provided with the justification{P, P ⇒ Q}.

TMSs generally serve three distinct purposes: handling retraction of (incorrect) infor-mation, speeding up analysis of multiple hypothetical situations, and providing a mech-anism for generating explanations. In the example above, the justification provided withQ is also an explanation for whyQ holds. ShouldP be retracted (for example becauseit was found thatP does not hold), then from the justifications it can be determined thatQ should also be retracted, asP provided a justification forQ. Similar to our diagnoses,these explanations should be minimal, i.e. no proper subset of an explanation should alsobe an explanation.

As discussed in [2], a problem with implementing truth maintenance systems in DLreasoners is the fact that efficiency of the highly optimized tableaux-based reasoning al-gorithms may suffer, as is also experienced in Pellet. Therefore, a black-box approachproviding post-hoc explanation is a reasonable alternative to truth maintenance.

Explanation in Pellet One of the most elaborate implementations of debugging knownto the authors is the Pellet reasoner, combined with the SWOOP web ontology editor5,which is described in [20],[17]. It provides a combination of black-box and glass-box

5http://www.mindswap.org/2005/debugging/

Page 35: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 4. A QUALITATIVE EVALUATION OF THE FRAMEWORK 32

approaches.

Glass-box techniques are used to support two forms of debugging of unsatisfiableconcepts: presenting the root cause of the contradiction and determining the relevantaxioms in the ontology that are responsible for the clash (the so-called minimal sets ofsupport).

The black-box methods focus on detecting dependencies between unsatisfiableclasses. Two types of unsatisfiable classes are recognized: root classes, and derivedclasses. A root class is defined as an unsatisfiable class in which a clash or contradic-tion found in the class definition (axioms) does not depend on the unsatisfiability of an-other class in the ontology. A derived Class is an unsatisfiable class in which a clash orcontradiction found in a class definition depends on the unsatisfiability of another class.

The advantage of the Pellet/Swoop environment is that debugging clues are presentedin an integrated way. A drawback of the glass-box techniques is that it makes debuggingdependent to a specific reasoner. A comparison between the results of debugging with theuse of Swoop and Pellet is planned to be performed.

Explanation by the Protege OWL-debugger Another example of debugging supportis the OWL-debugger6, which is a plug-in for Protege [28, 27]. This debugger providesa black-box approach, and is heuristic as it is based on experience, from which a set ofcommonly made mistakes have been identified. Being based on heuristics, the debuggeris not complete, but it provides explanation to a majority of cases of unsatisfiability.

The process performed by the OWL-debugger resembles that of the Swoop/Pellet ap-proach. First, an unsatisfiable core is identified, consisting of the smallest set of conditionsof a concept that render that concept unsatisfiable. A number of rules is then applied tothe unsatisfiable core, in order to determine the debugging super conditions (DSC), whichinclude all concepts that are involved in the unsatisfiable core, for example all supercon-cepts. A most general conflicting class set is generated from the DSC, which is used toproduce an explanation for unsatisfiability.

The OWL-debugger also presents explanation in an integrated way, as part of theProtege ontology modeling environment. As it is a fully black-box implementation, it isindependent of the reasoner being used.

6http://www.co-ode.org/community/debugging.php

Page 36: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

Chapter 5

Principals of a quantitative evaluation

In Chapter 4 we presented a qualitative discussion of the framework for debugging and di-agnosis, which was primarily based on our own experience with incoherent terminologiesin a medical domain. In principle, these results were encouraging, as we claim that thatthe MUPS, MIPS, diagnoses and the like, do indeed help. What remains to investigateis whether we have effective algorithms to calculate those debugging operators, in whichcases a particular method will be more appropriate than another, and how much can in-deed stretch our method in practice. In the next two chapters we attempt to answer thesequestions given some practical experiments, in which we will run our toolsMUPSterand DION against a number of benchmarks. First, however, let us discuss which researchquestions we want to answer based on such a benchmark, and which basic requirementsthis benchmark need to fulful.

5.1 Evaluating algorithms for debugging and diagnosis

Even though our experience showed that our approach to debugging and diagnosis isuseful in practice, we regularly encountered computational problems in the day to dayapplication. In this report we want to get a better understanding of the challenge, thestate-of-the-art of our tools, and the differences between the methods. Basically, there area number of questions:

• Can debugging be performed efficiently?More concretely, given an incoherentterminology, can our tools support a practitioner effectively?

• What makes an incoherent terminology difficult to debug?We will see that theanswer to the previous question is not necessarily positive. Even worse, we knowthat for most Description Logics our problem is exponentially hard1, which means

1Debugging is at least as hard as checking satisfiability, which is itselfPSPACEfor most DLs (see. e.g.,Chapter 3 of [1]).

33

Page 37: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 5. PRINCIPALS OF A QUANTITATIVE EVALUATION 34

that we will never be able to guarantee termination in realistic time for any arbitraryinput. There might be particular classes of terminologies that are more difficult thanothers, and it is important to study these difficulties to improve the methods.

• Which are the most appropriate methods to calculate?Not only will there be classesthat are more difficult than others, there will also be classes of TBoxes for whichone method is more appropriate than the other. To answer this question might allowthe user to choose the appropriate implementation according to his/her needs.

Based on these research questions, there are a number of criteria for a good test-set.Most importantly, a test-set should be prototypical, so that it represents a larger class orrealistic problems. Also, it has to be systematical, so that influence of particular propertiesof classes of TBoxes can be evaluated. Finally, a test-set should be statistically viable, i.ethe results of an experiment for a particular class of TBoxes should indeed have somesignificance w.r.t. the properties it is meant to evaluate.

Moreover there are some criteria that are inherent to our evaluation, most importantly,that we are restricted by the evaluated tools to incoherent, unfoldableALC TBoxes.

5.2 Three types of benchmark experiments

In order to fulfill the above described criteria, and to be able to address the asked researchquestions we propose (and conducted) three different types of benchmark experiments,first, an evaluation of the methods with real-world terminologies, secondly, an evaluationusing an adapted benchmark set from the DL literature, and finally, some experimentswith our own purpose-build data-set.

5.2.1 Evaluation with real-world terminologies

The ideal case of an evaluation is to use a number of The first approach to evaluationof debugging methods is to consider a number of publicly available Description Logicterminologies.

We split our test terminologies in three groups, ordered by how they were built. Asexamples for terminologies created through migration we consider an older version of theanatomy fragment of DICE (we abbreviateDICE-A), with 534 axioms and 76 unsatisfiableconcepts, and a previous full version of full DICE (abbreviatedDICE). The incoherenceof DICE-A has two distinct causes: first, this is a snap-shot from the terminology in its cre-ation process, i.e. it contains real modeling errors. Moreover, the high number of contra-dictions is specific for migration as a result of stringent semantic assumptions.MGED andGeo are variants of ontologies which are incoherent because they have disjointness state-ments artificially added for semantic enrichment (as suggested in [26]).MGED provides

Page 38: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 5. PRINCIPALS OF A QUANTITATIVE EVALUATION 35

length#ax #unsat #mips |mips| of mD

DICE-A 534 76 16 3 3DICE 4995 27 55 4 -

MGED 406 72 38 4 3Geo 417 11 22 2.6 8S&C 6382 923 - - -

MadC 69 1 - - 11 2 3 4 5

Table 5.1: Real-world terminologies

standard terms for the annotation of micro-array experiments to enable structured querieson those experiments; andGeo an ontology of geography made available by the Teknowl-edge Corporation. The third category contains the merged terminologies ofSUMO andCYC, two well-known upper ontologies. As they are topic-related, and asCYC providesdisjointness statements, there is a high number of unsatisfiable concepts.

We constructed simplifiedALC versions for all five terminologies. Without loss ofunsatisfiability, we removed, for example, numerical constraints, role hierarchies andinstance information. All terminologies, however, were non-cyclic and could be trans-formed to an unfoldable format.

For the last example, theMadC ontology, this is not the case. This ontology wasconstructed to illustrate language features of Description Logics, and we use it to illustrateReiter’s generic method works for expressive formalisms, where both other methods fail.MadC is incoherent with unsatisfiable conceptMadCow.

Benchmarking with real-life ontologies is obviously a most natural way of evaluatingthe quality of debugging algorithms. On the other hand, there is only a limited amountof realistic terminologies available that are incoherent. This has several reasons, first,published ontologies usually have undergone a careful modeling process, and it shouldbe expected that logical modeling errors have already been eliminated (without the helpof automatic methods). Secondly, current ontologies often still use quite inexpressivelanguages, for example, and avoid inconsistencies by not stating disjointness of classes.The consequence is that the set of testing examples is limited, and it becomes difficultto make a systematic evaluation of our algorithms as the bias of the data is simply toodominant.

As an alternative it is common to use systematically created test-sets for benchmark-ing, and such sets also exists to evaluate Description Logic reasoning.

Page 39: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 5. PRINCIPALS OF A QUANTITATIVE EVALUATION 36

5.2.2 Benchmarking with (adapted) existing test-sets

The issue of benchmarking Description Logic systems has been addressed exhaustivelyin the literature over the last 10 years, mostly using adaptations of modal logic test-sets,[18]. In most of these studies the purpose was to evaluate the runtime of DL reasonerwith respect to the complexity of the used language (mostly the modal depths). As theywere often planned to include comparisons with SAT- [10] or resolution-based systems[16] most of the resulting test-sets are for concept satisfiability.

Unfortunately, we are bound by the expressiveness of our tools for the benchmark-ing, which essentially means to restrict the experiments to test-sets with unfoldableALCterminologies. Finding a good benchmark for unfoldable terminologies ontologies is dif-ficult, though, as existing extensions to ontologies usually go beyond these requirements.

For this reason, we adapted an existing test-set forALC concept satisfiability, bytranslating each unsatisfiable concept into an incoherent terminology. For this pur-pose, we used the satisfiability tests from the well-known DL benchmark from the1998 system comparison. We took 9 sets of unsatisfiable concepts usually denoted by:DL98={k branch p, k d4 p, k dum p, k grz p, k lin p, k path p, k ph p, k poly p,k t4p p}. The test procedure works as follows: each set contains 21 concepts with expo-nentially increasing computational difficulty. The measure of the speed of the DL systemwas then the highest concept in the list that could still be solved in 100 seconds. Thesetest-sets were build in the early days of the latest generation of Description Logic tools,which did not have as many optimisations then as they have now. Nowadays, these test-sets are outdated, as they are too easily solved by all existing DL systems. For our pur-pose, however, they are still relevant, as the implemented top-down work (in the currentversion) without optimisations.2

We translated each of the unsatisfiable concepts in the test-sets into one unfoldableincoherentALC terminology in the following way: letC be an unsatisfiable concept, webuild an initial terminologyAC v C. Then each sub-conceptS that is in the scope of anodd number of negations is (recursively) replaced by an atomAS, and an axiomASvS isadded to the terminology, whereAS andAC are new names not occurring in the concept.Then the resulting TBox is incoherent. Let us illustrate the method with an example,rather than give a formal definition. Suppose we have an unsatisfiable conceptC =∃r.(AtB)u∀r.(¬Au¬B). We first build a terminologyT = {AC v C}, and replace theoutermost subformulas ofC by atomsA1 andA2. T is now{AC v A1uA2, A1v∃r.(AtB), A2v∀r.(¬Au¬B)}. Next we transform the definitions ofA1 andA2, which leads tothe following TBox{AC v A1uA2, A1v∃r.A3, A2v∀r.A4, A3vAtB,A4v¬Au¬B},

2There is an experimental implementation, which trades completeness of the method with a number ofoptimisations. In this paper we restrict our attention to the more robust, and complete, standard implemen-tation ofMUPSter.

Page 40: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 5. PRINCIPALS OF A QUANTITATIVE EVALUATION 37

and so on. The resulting terminology forC = ∃r.(A tB) u ∀r.(¬A u ¬B) is then:

{AC v A1 u A2,A1 v ∃r.A3,A2 v ∀r.A4,A3 v A5 t A6,A4 v A7 u A8,A5 v AA6 v BA7 v ¬AA8 v ¬B}

For each of the sets of unsatisfiable concepts inDL98 we created an incoherent ter-minology given the above mentioned method for the first three concepts. The resultingterminologies are called:k branch p tbox1, k branch p tbox2, k branch p tbox3 andsimilarly for all other sets inDL98.

An interesting phenomenon given this new test-set is that the data-sets are heavilyengineered to make reasoning hard, andto punish non-optimised reasoning. We will seethat this has drastic effects on the run-times of the different algorithms. On the other hand,this reasoning-centered view does not really coincide with average structure of the real-istic terminologies, that are out in practise. Mostly these terminologies are relatively flatin structure, but large; and often the definitions strongly dependent on other definitions.To account for this, we decided to build our own benchmark for evaluating algorithms fordebugging.

5.2.3 Benchmarking with purpose-built test-sets

In order to build a test-set for benchmarking our algorithms and implementations for de-bugging and diagnosis several basic requirements have to be fulfilled: first, the resultingTBoxes have to be unfoldable and inALC, and they have to be incoherent. Also, asdiscussed in the beginning of this Chapter, the benchmark should be systematically con-structed, so that classes of problems can be studied and general statements over propertiesof TBoxes can be made.

These basic requirements leave a number of choices to create such a test-set: e.g.creating an incoherent terminology could be achieved through systematic construction oflogical contradictions, or through random choice of operators and names. The first choicehas been made in the previously mentioned DL98 benchmark set, whereas in our test-set we opted for the second option. This way we hope to get a stronger similarity withrealistic terminologies, while still retaining some control over parts of the structure of theTBoxes.

Building such a test-set for debugging and diagnosis of incoherent terminologies isdifficult, because there is a plethora of parameters that could influence the complexity of

Page 41: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 5. PRINCIPALS OF A QUANTITATIVE EVALUATION 38

reasoning, and the difficulty for debugging. In our case we decided to fix a number ofparameters, and vary others.

Fixed Parameters The fixed parameters are those parameters which will not changein any setup of the experimental analysis, i.e. for all terminologies we consider in theexperiments, we can consider the same properties. The most important fixed parametersare the constant term-depth of the concept definitions, the fixed ratio between modal andBoolean operators, and the ratio between defined and primitive concepts, and the restric-tion to atomic negation.

• Constant term-depth.The term-depths of a concept is the maximal number ofoperators that an atomic concept is “away from” under the scope of. More for-mally, the term-depthtd(C) of anALC conceptC is defined recursively as follows:

td(C) = 0 if C is an atomtd(C) = max(td(C1), td(C2)) if C = C1 t C2

td(C) = max(td(C1), td(C2)) if C = C1 u C2

td(C) = td(C1) + 1 if C = ∃r.C1

td(C) = td(C1) + 1 if C = ∀r.C1

td(C) = td(C1) + 1 if C = ¬C1

We say that a concept has constant term-depth, iftd(C1) = td(C2)) for all sub-conceptsC1 u C2 andC1 t C2. Intuitively, this implies that all paths of the termrepresentation of a concepts to the leaves are of the same length.

In logical terms this is not really a restriction, as for everyALC concept there is alogically equivalent concept with constant term-depth.

• Ratio between modal and Boolean operators.3 When creating a concept definitionwe decided create boolean and modal operators with the same probability. Thiscorresponds more or less to our experience with existing ontologies. Though theyoften contain more modal quantifiers than Booleans the latter are not binary, whichis the way we create our concepts.

• Ratio between primitive and defined concept names.In an unfoldable TBox a de-fined concept name is one that occurs on the left-hand side of an axiom. Semanti-cally, the interpretation of the defined concepts can be derived from the interpreta-tion of the base interpretation of the primitive concepts. For this test-set we decidedto have the same number of primitive and defined concept names. Evaluating theinfluence of this parameter might be subject to further experiments but was out ofthe scope of this research.

• Restriction to atomic negation.It is well known that it is possible to translate eachALC concept into an equivalent one in Negation Normal Form (NNF) in linear time,

3By modal operators we understand the Box- and Diamond-like DL operators∃ and∀.

Page 42: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 5. PRINCIPALS OF A QUANTITATIVE EVALUATION 39

i.e. a concept where negations only occur atomically. This means that this technicalrestriction does not really have an influence of the benchmarking, it should makeno significant difference whether debugging and diagnosis is performed on TBoxesin NNF or not.

Variable Parameters The variable parameters will be those parameters we will do ex-perimental evaluation over, i.e. the will influence properties of the resulting TBoxes,which we believe can have a significant influence on the practical complexity. For thecurrent experiments we have four such variable parameters, the size of the TBox, theconcept-length, the ratio of disjunctions versus conjunctions, and the chance of an atombeing negated or not.

• Size of the TBox.We vary the size of the TBox from 10 to 200 axioms. This numbercorresponds to the number of defined concepts.

• Concept length.As previously mentioned we restrict ourselves to TBoxes withconstant term-depths. In our experiments we will consider concept definitions ofterm-depth 3 to 7. Intuitively, this means that each concept-name occurs in thescope of 3 to 7 operators.

• Ratio of disjunction versus conjunction.As discussed before, when creating a con-cept there is a 50% chance that the operator is Boolean. We then have to determinewhether this operator will be a disjunction or a conjunction. In our experiments wevary this ratio from 10% to 30% up to 50%.4

• Ration of negated versus non-negated atoms.Once the full term-depth is reached inour creation process, an atomic concept or its negation has to be created at random.In the experiments we vary the probability of an atom to be negated from 30 to70%.

Building a set of benchmark TBox Finally, we have to describe how to build an un-foldableALC terminology. Suppose we have fixed the variable parameters TBox sizeto #t with concept size to #s. To create thei-th axiom (i.e. to defined thei-th concept-name) we create a concept of term-depth #s from all the primitive concept-names, andthe remaining concept-names that were not define in the firsti− 1 axioms. The construc-tion of the concept is then simply a construction based on random choice with the givenprobability distribution.

Given this method, we constructed 1611 unsatisfiable terminologies. Note that thismethod does not automatically deliver incoherent TBox. On the contrary, the overall rateof incoherence given this method is less than 30%. To consider only incoherent TBoxes

4We originally considered a similar variation of percentages for the ration between the choice of exis-tential versus universal quantifier, but this seemed to give no new insights.

Page 43: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 5. PRINCIPALS OF A QUANTITATIVE EVALUATION 40

we therefore simply run an optimised DL reasoner to delete all coherent TBoxes from thetestset.

It should be noticed, the choice of parameters greatly influences the ratio of satisfiableversus unsatisfiable terminologies, particularly thedisjunction/conjunctionratio. Moreconcretely, given the likelihood that a disjunction was chosen in 50% of the cases, only5% of the resulting TBoxes were incoherent, as opposed to almost 50% of the TBoxeswhen the likelihood was just 10%. The consequence of this is easy to see: assume thatwe decide to create a fixed number of TBoxes for testing per parameter value, we willhave only 10% of incoherent TBoxes in our test-set with high disjunction ratio. The mostsimple solution to this problem was to account for the satisfiability/unsatisfiability ratio,and to create an accordingly larger number of TBoxes in the first place. In this way webelieve to have created a test-set of TBoxes, where each class of TBoxes correspondingto a particular parameter choice is as likely to occur in the TBox as any other. Both thetest-set and the generator will be made available on our website.

Page 44: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

Chapter 6

Experimental evaluation

In the previous chapter, we described three different test-sets we use for evaluating therun-times, first, a set of 6 real-life ontologies we collected from the Internet, secondly,an extension of an existing benchmark for evaluation of Description Logic systems, andfinally, a purpose build benchmark consisting of 1611 systematically constructed incoher-ent terminologies. Interestingly, the results vary a lot, and it is worth looking into somedetails to get a better understanding of the pros and cons of each of the algorithms.

6.1 Experiments to evaluate algorithms for debugging

The main notion for debugging are the minimal unsatisfiability and incoherence preserv-ing sub-terminologies, the MUPS (of a TBox and an unsatisfiable concept) and MIPS (ofa TBox). Calculating MIPS can be done quite easily from MUPS. However, there areconceptually very different ways of calculating MUPS, and we have implemented twodifferent strategies: a top-down method, which uses a specialised algorithm based on DLtableau, and a bottom-up method which uses heuristics to guide a brute-force enumerationmethod.

The theoretical complexity of the problem and the algorithms for unfoldableALC-TBoxes is known. First, finding MIPS isPSPACEcomplete: obviously, finding MIPS is atleast as hard as checking satisfiability, and there is a simple brute-force way of calculatingall the MIPS using only polynomial space in terms of the size of the TBox (trial and error).DION, implements a variant of such an algorithms, whereasMUPSter’s algorithm needsexponential space w.r.t. the size of the TBox.1

While theMUPSter is the older method, the second one has some significant advan-tages: because it uses existing purpose-build DL systems for the reasoning part via aDIGinterface its expressiveness only depends on the underlying prover. At the moment, this

1This is because finding prime implicants isNPCOMPLETEw.r.t. a formula, which might need exponen-tial space in relation to the TBox.

41

Page 45: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 42

Top-Down:Mupster Bottom-up: DION RacerProtime for all MIPS time for all MIPS time for coherence check

DICE-A 12 sec timed out 4.3 secDICE 54 sec 32 sec 16.62sec

MGED 5 sec timed out 3.38 secGeo 20 sec 11 sec 1.82 sec

Sumo&Cyc timed out timed out 4.7 secMadCow not applicable 0.66 sec 0.22 sec

1 2 3

Table 6.1: Comparing top-down and bottom-up methods

means that DION can debug arbitrary ontologies in very expressive Description Logics,such asSHIQ and the like.MUPSter, on the other hand, is restricted to unfoldableALCTBoxes, and therefore severely less expressive than DION. On the other hand, DION hasa seemingly heavy disadvantage, in that its heuristics make it theoretically incomplete.Here, we will try to find out whether there is price for expressiveness or completeness,and whether we can identify cases in which one of the algorithms might be preferable tothe other.

6.1.1 Experiments with existing ontologies

The first set of experiments was performed on a dual AMD Athlon MP 2800+ with 2GB memory. BothMUPSter and DION were given the six terminologies described inSection 5.2.1. Both were timed-out if the programs had not calculated the set of MIPSwithin one hour.

Results Figure 6.1 summarises the run-times of the two systems on the 6 terminologiesin columns 1 and 2. As a reference we also give in column 3 the runtime RacerPro 1.8.1needs to find the unsatisfiable concepts. For theMadCow terminology,MUPSter couldnot be applied as this is not an unfoldable terminology.

WhereasMUPSter manages to calculate all the MIPS within the time-limit for allbut theSumo&Cyc terminology, DION also fails in two more cases, the anatomy part ofDICE, and the MGED TBox. The results of the experiments show a scalability problemfor both algorithms with real big terminologies, as both methods failed to calculate MIPSin the case ofSumo&Cyc, and DION also needs at least more than one hour for two moreterminologies. On the other hand, the run-times of DION are faster for the two exampleswhere it finds a solution, even thoughMUPSter can solve two more problems in the giventime.

Finally, it has to be noted, as a proof of concept more than anything else, that DION

Page 46: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 43

is indeed capable of calculating MIPS for the more complexMadCow terminology.

Analysis The first conclusion from the experiments can easily be drawn: debugging iscomputationally difficult, and with the current software it will be likely that calculatingMUPS and MIPS will in some cases fail no matter what algorithms is used. On a moreoptimistic note, however, it is important to note, that solutions can be found in fairlycomplex terminologies, such asDICE or MGED, even though this cannot be guaranteedin general.

This remark is in line with reasoning in Description Logics in general, where the worsecase complexity also suggests that reasoning is in principle intractable. But despite thisvery high computational complexity in the worse case, there are very promising resultsin practical complexity (both for classical reasoning and debugging), which are based onpowerful optimisations.

Out of the four cases where our tools find solutions, two interesting, and antagonistic,pieces of information stick out: the shorter runtime DION as compared toMUPSter inthe two cases where DION finds a solution, and the apparent unrelatedness of the run-times of DION and RacerPro. As DION uses RacerPro as underlying reasoning engine, acorrelation could have been expected. However, even though there is almost no differencein the time to check satisfiability for DICE and DICE-A, DION fails to find MIPS for thelatter. Similarly, the relatively small run-time of RacerPro onSumo&Cyc would suggestthat DION should be able to find MIPS, which it does not.

One way of explaining this behaviour is to consider the number of unsatisfiable con-cepts in the 6 ontologies. Here might be a correlation, as both DICE-A and MGED havea very high number of unsatisfiable concepts. As DION implements the calculation ofMIPS as a series of calculations of MUPS, it might simply need too many communica-tion steps to the DIG interface to solve the MIPS problem in time.

However, it remains to be investigated in more detail, whether there are other crite-ria that could be responsible for these relatively surprising results. For this purpose weconducted some experiments on two sets of systematically built benchmarks.

6.1.2 Experiments with existing benchmarks

The analytic value of the previous set of experiments is relatively limited, as the resultscould be severely influenced by some particularity of the chosen ontologies. Therefore,we conducted a second set of experiments based on the benchmark set from the DL sys-tems competition at DL 98.

The hope of running experiments against a more systematically built set of incoherentTBoxes was to get some better understanding of the influence of specific properties ofterminologies on the computational properties of the respective algorithms.

Page 47: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 44

Figure 6.1: Runtime of DION on DL98 testset

To this avail we run bothMUPSter and DION against a set of 27 TBoxes, where thetranslation from the DL98 test-bench of the first three concepts were considered. Theresults forMUPSter are summarised in Figure 6.2, those of DION in Figure 6.1

The graphics show the runtime per problem up to the time-out of 100 seconds. Hereby,it can be expected that with an increasing number of the description of each problem alsothe runtime goes up by construction of the problem.

We kept the relatively short time-out of 100 seconds from the original DL experiments.This has to do with the nature of the experiments. Remember that the DL benchmark wascreated in such a way that it show the weakness of an algorithm in an exponential way,so adding more time will not change anything substantially. DION might solve one classmore, and MUPSter over the first two levels in some cases, but the general picture will bethe same.

Results The results of the experiments with the adapted DL benchmark set are almostdiametric to the results described in the previous section, as they show high failure rateof theMUPSter tool to solve the more difficult cases, but very good computational be-haviour of DION.

More concretely, it can be noticed thatMUPSter fails in all of the 9 different classesto calculate even the 3 problem (out of a possible 21), whereas DION only fails in a singlecase (k branch p) to terminate the debugging process. However, in 6 out of 9 possible

Page 48: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 45

Figure 6.2: Runtime (in sec) ofMUPSter on DL98 testset

cases,MUPSter manages to find debugs for the most simple case. These results are inline with the computational difficulty of the problems.

Analysis The results show clearly thatMUPSter fails to debug terminologies that arebased on complex unsatisfiability problems, such as the ones used in the DL98 evaluation.These problems have been purpose-build to point to computational difficulties in satisfi-ability checking, i.e. they explicitly exploit structural properties of formulas and provers.More concretely, experience showed that the DL systems of the 90s failed in these cases,because they used naive tableau algorithms, and the complexity of the test-set forced theircalculations to be “lost in the search space”.

This allows the conclusion that basic optimisations can significantly improve the per-formance ofMUPSter, because techniques such as lemmatising and caching can avoidthe mentioned computational traps. The price in this case is a loss of theoretical correct-ness. More precisely, the terminologies returned by our algorithm might not be minimalany more. We doubt whether this is a problem in practise, as minimality is a nice, but notstrictly required feature of the MIPS.

As DION uses optimised reasoners, which can nowadays solve problems like those ofthe DL98 test-set within milliseconds, via its DIG interface, it solved the problems easily.This also has to do with the structure of the test-set: as we created the TBoxes from con-cepts in a deterministic way, each axiom is related to the part of the concept it was createdfrom. But this syntactic structure corresponds one-to-one to DION’s search strategy: in

Page 49: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 46

0

2

4

6

8

10

12

14

0 100 200 300 400 500 600 700 800

Runtime

Axioms added to test k-path-p

DION

33

3

3

3

33

Figure 6.3: Runtime (in seconds) of DION depending on the number of (unrelated) ax-ioms tok path p1

DION we basically start with the definition on an unsatisfiable concept, and include thenext following axioms in our test procedure on the basis of a syntactic relation (occur-rence of the defined name). Each calculation of MIPS therefore mirrors the constructionof the TBox from the unsatisfiable concept.

To counter this process we decided, in order to create a more realistic test-set, to add anumber of irrelevant axioms to the test-set. We arbitrarily chose the first TBoxk path pand added 100, 200, 400 and 800 random axioms.2 Figure 6.3 summarise the run-timesof this experiments.

Although the problem is in principle exponential in the size of the TBox, DION showslinear behaviour in this experiment. This also corresponds to the increasing run-time Rac-erPro needs for checking coherence. This has an relatively simple explanation: it showsthat the strategy underlying DION controls the inherent complexity of the search processin a nice ways. But it also poses a question with regard to our first set of experiments withthe real-world ontologies. Remember that we had noted there, that the failure of DION tocalculate any MIPS for two ontologies was independent of the computational complexityto find unsatisfiable concepts (as shown by RacerPro).

One reason for this discrepancy could be that the created TBoxes have a very partic-ular structure, as they usually only have a single unsatisfiable concepts. A more detailedstudy of the original test-set is needed to be able to properly assess the reasons for thosedifferences.

2We took arbitrary coherent sets of axioms from our purpose-build test-set.

Page 50: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 47

6.1.3 Experiments with purpose-build benchmark

The final set of experiments to evaluate algorithms of debuggers was conducted on ourown purpose built benchmark that was described in Section 5.2.3. Here the focus was noton creating difficult reasoning problems, but on identifying features of terminologies thatwould make debugging hard or easy. Furthermore, we intended our constructed TBoxesto be as close as possible to the structures of realistic terminologies.

In these experiments we constructed classes of TBoxes according to some varyingparameters, and compared the runtimes of our tools for these classes with other classes.This allows us to get some more insights into the influence of structural peculiarities ofTBoxes on the time needed for debugging.

The parameters we varied were the ratio of negated versus unnegated atoms, the num-ber of disjunctions versus conjunctions, and the size of the concepts and of the TBox.

We will first summarise the results, before trying to provide some explanations, and acritical discussion, for the sometimes unexpected results.

Results For the 1611 TBoxes tested in these experiments we checked the run-times ofthe three tools DION andMUPSter for debugging, and RacerPro for consistency check-ing. Again, the experiments were performed on a dual AMD Athlon MP 2800+ with 2GB memory. The overall average run-times in seconds are summarised in the followingtable:

MUPSter DION RacerProaverage in sec 2.05 1.68 1.35

These numbers show DION as the clear overall winner overMUPSter, which is inline with the results of the previous experiments. It is worth noticing, that the overall run-times ofMUPSter contain a call to RacerPro for a list of unsatisfiable concepts. Sincethis external run-time takes on average about 70% ofMUPSter’s runtime, we expecta strong correlation between RacerPro andMUPSter for those cases whereMUPStertakes reasonably short time. Any difference in the run-time behaviour of these two toolstherefor points to an additional source of complexity inMUPSter’s core algorithm.

On the other hand, DION’s algorithm is completely based on calls to an externalreasoner, in our experiments RacerPro. Therefore, also a correlation between the times ofthese tools can be expected.

The results of our experiments to identify computational properties of particularclasses of TBoxes are summarised in a number of figures, which we will discuss in thefollowing. In each of the following figures we show the average run-time in seconds foreach tool given a particular instantiation of values for the varying features.

Figure 6.4 shows the runtime (in seconds) of the three tools given a varying ratio of30 to 70% of negated versus non-negated atoms in the axioms of the TBox. There is an

Page 51: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 48

1

1.2

1.4

1.6

1.8

2

2.2

2.4

30 35 40 45 50 55 60 65 70

Runtime

Mupster

3

3

3

3

3

3Dion

+

+ +

+

+

+Racer

2

2 2

2 2

2

Figure 6.4: Average runtime (in seconds) depending on the ratio negative/positive atoms(in %)

interesting discrepancy between the run-time of RacerPro andMUPSter, on the one side,and DION on the other, as the former peak on a ratio of 40%, whereas the latter has itshighest value in the class of TBoxes with a probability of 30% that the atoms are negated.

This is a surprising result, and it might be worth to remember the construction of thetest-set. It is easy to see that the probability of TBoxes constructed with a negation ratioof 40 or 50% are more likely to be incoherent. This has also been shown in some ofour secondary experiments. However, as we only consider incoherent TBoxes we wouldnot expect the difference between coherence and incoherent to influence the run-timebehaviour. Moreover, the run-times of Racer do not support the view that reasoning forthe class of TBoxes with 30% negated atoms might be harder than for the other cases.

A similar result can be seen in the study of the ratio between disjunctions and con-junctions. Again, there is an almost equivalent behaviour ofMUPSter and RacerPro, anda diverging pattern for DION. Here, the difference between lower and higher number ofdisjunctions has significant influence on the runtimes of DION, much more than for RacerandMUPSter.

Much more in line with expectation is the following experiment, where we vary thesize of the concepts in the definitions. Figure 6.6 summarises the results, where we com-pare the run-times of our three tools on classes of TBoxes with concept-length 3 to 7.

In the case of varying concept-size, the result shows a clear correlation between thethree methods. There is a slight dip in computational difficulty of the concepts of size 4,followed by an increasing run-time for all tools. Interestingly, the curve for DION is thesteepest, and for concepts of size 7, DION already takes more time thanMUPSter.

The most natural results can be seen in the experiments were the size of the TBox isvaried. Here, we take TBoxes of different length into account varying from 10 to 100 in

Page 52: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 49

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

10 15 20 25 30 35 40 45 50

Runtime

Mupster

33

3

3Dion

+

+

+

+Racer

22

2

2

Figure 6.5: Average runtime depending on the disj/conj ratio

0.5

1

1.5

2

2.5

3

3.5

4

3 3.5 4 4.5 5 5.5 6 6.5 7

Runtime

Size of concepts

RacerPro

33

3 33

3Dion

+

+

++

++Mupster

2

22 2

2

2

Figure 6.6: Runtime (in seconds) depending on the size of the concepts

Page 53: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 50

0.5

1

1.5

2

2.5

3

3.5

0 20 40 60 80 100 120 140 160 180 200

Runtime

Number of axioms in the TBox

Runtime depending on the size of the TBox

Mupster

3

33 3

3

33 3

3

33

3Dion

++

+

++

++

++

++

+Racer

2 2 2 2 2 22 2 2 2 2

2

Figure 6.7: Runtime (in seconds) depending on the size of the TBox

steps of 10, comparing it with a single value for TBoxes of size 200. Here the general lineis an almost linear increase in run-time for all three tools up to 90 concepts, followed bya slight decrease toward longer TBoxes.

Analysis Unfortunately, none of the initial experiments seem to indicate a clear expla-nation for the run-time behaviour of particular classes of TBoxes. Even worse, almosteach experiments shows some very odd results.

In the two experiments where we vary the disjunction and negation ratios DION’srun-times are not in line with eitherMUPSter or RacerPro. There are some possibleexplanations, either these results correspond to some (unknown) properties of the DIONalgorithm or implementation or to some peculiarities in our test-data. More experiments(with a more detailed analysis of the benchmark set) are needed, before we can explainthis behaviour.

More conclusive are the next two experiments, in which a clear tendency can be seenthat debugging becomes more difficult with increasing TBox and concept-size. Fromthese two experiments, another conclusion might be drawn. Figure 6.6 shows an increasein the runtime of DION which is much steeper than those of the other two tools. This hasits reasons in the DION algorithm, where the size of the search tree is determined by thelength of the concepts of the definitions (because of the choice of selection function). Theexperiment described in Figure 6.7 suggests that the increase in complexity of the TBoxis controlled by the heuristics in a better way.

All the previously described experiments depend on the features of the TBoxes that areknown previous to debugging. This could be useful if we want to decide automaticallywhich tool to use for which terminology. A second class of features are those that areinherent to debugging, i.e., which can only be determined by running a debugger on the

Page 54: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 51

2

4

6

8

10

12

14

2 4 6 8 10 12

Runtime

Number of MIPS

Mupster

33 3 3

3 33

3

33

3Dion

+ +

++

+

+

+

+Racer

2 2 2 2 2 2 2 2 2 2 2

2

Figure 6.8: Runtime (in seconds) depending on the number of MIPS

data-sets. One such feature of our TBoxes that might be relevant to better understand theruntime behaviour of our three tools is the number of MIPS for an incoherent terminology.

Figure 6.8 shows the runtimes of the three tools with respect to the number of MIPSof the incoherent terminologies. The results are highly interesting, as they differ stronglyfrom the previous ones, and show a very distinct behaviour for each of the three tools.Whereas RacerPro’s runtime remains almost constant for increasing MIPS-size, andMUPSter shows an almost linear increase (up to the number 11), is there a clearly expo-nential behaviour visible for DION.

For the values above 10, the numbers have to be read with care, as there are too fewexamples to be statistically valid (only 6 in total). Nevertheless the data is interesting, asit shows clearly DION’s difficulties to deal with incoherent TBoxes with large numbersof MIPS.

Name of TBox Run-time Run-time incoherence#MUPS DION MUPSter RacerPro

13 tbox60 7 1 3 7 v1 74.59 sec 26.40 sec 1.64 sec11 tbox40 7 1 3 6 v1 46.07 sec 4.76 sec 1.38 sec11 tbox80 7 1 5 5 v1 38.33 sec 2.58 sec 1.97 sec10 tbox40 6 1 3 6 v1 15.65 sec 1.84 sec 1.20 sec10 tbox80 5 1 4 7 v1 10.18 sec 2.67 sec 1.30 sec10 tbox90 7 1 6 7 v1 46.67 sec 3.79 sec 2.10 sec

How can we explain the increasing run-time of the debugger, particularly of DION, foran increasing number of MIPS? BothMUPSter and DION implement the same methodto calculate MIPS from MUPS. As an increasing number of MIPS is directly related to anincreasing number of MUPS, the difference can be found here: remember thatMUPStercalculates all MUPS at the same time from a fully expanded tableau. This means thatcalculating several MUPS is not more difficult than calculating fewer MUPS, as the initial

Page 55: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 52

time-consuming act is the expansion of the tableau. However, the number of MUPS isdirectly related to the breadth of the search-tree of DION, which adds one exponentialfactor for each additional branch in the tree.

Remember, that the number of MIPS is also correlated with the number of “modellingerrors”, in the sense that it is more likely that there are more MIPS if there are moreerroneous axioms. This could explain the odd behaviour of DION for the class of TBoxeswith the smallest number (10%) of disjunctions as described in Figure 6.5. For conceptswith fewer branches, there will be more possible contradictions than for concepts withmore disjunctions. This implies that there will be more MIPS in the class of TBoxes witha 10% probability that a Boolean operator is a disjunction, than in the class where theprobability is 30% or 50%, which in turn explains the run-time behaviour of DION.

As the latest experiments show, studying the properties of the results of the debugseems to be a fruitful line, which could lead to an explanation of the run-time behaviourof DION on the realistic terminologies studied in Section 6.1.1.

There are a number of open questions, that we have to leave for future research:

• What is the influence of other properties of TBoxes, i.e. the number of MUPS orthe average size of MIPS and MUPS.

• What is the relation between the a-posteriori features (#MIPS etc) and the data-set? Can we explain the peculiarities of some results by secondary properties of thebenchmark?

In this report we focused on a-priori properties of TBoxes, i.e. properties that can bedetermined before diagnosis and debugging, because one of our initial research goals wasto determine the best tool for a particular class of incoherent terminologies.

Answering the research questions

Let us end this Section by answering the research questions we presented in Section 5.1.More concretely, there are three questions regarding the computational properties of de-bugging, our tools and particular subclasses of incoherent terminologies, which we cannow answer on the basis of the experiments of the previous section.

Can debugging be performed efficiently? Realistic ontologies can be debugged insome, but not all, cases. OverallMUPSter shows better performance on our real-worldexamples than DION, which probably has to do with the large number of MIPS. On theother hand, the other two benchmark show a more fine-grained picture: given our ownbenchmark we can conclude that both methods scale quite well with the overall size ofthe terminology and even the average length of the concepts. Finally, the DL benchmarkwhere complex reasoning is required show that optimised reasoners can be used for effi-cient (if incomplete) debugging, but also that non-optimised techniques (such as the one

Page 56: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 53

employed byMUPSter, which are naive w.r.t. the logical complexity of the reasoning)necessarily fail. It is an open question whether there are any complete approaches thatmight scale up in this case.

Diagnosis is a much harder problem, as it requires the calculation of all minimalconflict sets and, on top of this, anNPCOMPLETEminimisation procedure.

What makes an incoherent terminology difficult to debug? The larger an incoherentterminology, the more difficult it becomes. It seems that the increase in run-time is linearwith increasing TBox size, but exponential in the average size of the concepts. On theother hand, the results of our experiments with our purpose-built benchmark regardingthe influence of the ratio of negated versus non-negated atoms and disjunction versusconjunction do not show coherent influence on the run-time, and point to differences in thetwo computational approaches rather than in the general difficult to debug an incoherentterminology. Finally, the runtime w.r.t. the number of MUPS gives a clear indication thatthe complexity increases with an increase in the number of modelling errors.

Which are the most appropriate methods to calculate? There are two critical cases:first, the complexity of the standard reasoning is so thatMUPSter’s unoptimised tableaucalculus fails. Then, DION is a good choice, as it uses optimised DL reasoner. In the sec-ond case, there are multiple-errors, which might hamper DION’s efficiency. In this case,MUPSter can often be more efficient, as it uses a single procedure which is independenton the number of errors. In all other cases, our results seem to indicate that both methodsperform comparably.

6.2 Experiments to evaluate terminological diagnosis

With a number of experiments we studied the feasibility of diagnosis. We implementedthe three techniques described in the previous section in JAVA,3 and applied them to theterminologies introduced in Section 5.2.1

Table 6.2 summarizes the quantitative results of diagnosis on the 6 incoherent termi-nologies introduced above. All experiments were performed on a Pentium III, 1.3.GHz,RedHat Linux.

The first 4 columns summarize information about the terminologies, the number ofaxioms, unsatisfiable concepts and MIPS, as well as the average size of the MIPS. Column5 gives the length of the smallest diagnosis, Column 6 the time RACER needs to checkfor incoherence of the terminology. The diagnostic results are split in three pairs: the firsttwo columns 7 and 8 ((labeled Maximal CS) give

3Implementations and test sets will be made publicly available.

Page 57: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 54

Maximal CS Small CS Minimal CS#D/hr timeD1 #D/hr timeD1 #D/hr timeD1

DICE-A 0 - 4 1622 s 27 151 sMGED 0 - 10 40 s 58 31 s

Geo - - 8 114 s 115 62 sS&C 0 - 0 - 0 -

WINE 6 37 s / / / /MadC 4 12 s / / / /

1 2 3 4 5 6

Table 6.2: Comparing Diagnosis with different types of conflict sets

• the number of diagnoses calculated per hour (#D/hr), and

• the time to calculate the first diagnosis (timeD1)

based on maximal conflict sets. Similarly for columns 9 and 10, for small, and 11 and12 for minimal conflict sets (calculated using MIPS). The runtime in column 12 containscalculation of unsatisfiability using RACER, the calculation of the MIPS, and, finally, ofthe hitting sets.

Analysis The most significant result is the almost complete failure to calculate diag-noses using the naive maximal hitting set approach. Only for the toy examples of theWINE andMadC terminologies are any diagnoses found. The reason for this is the lengthof the minimal diagnosis and the number of axioms. As all but one axioms belong tothe maximal conflict sets, there are (#ax-1) branches at first level, and (#ax-1)*(#ax-2)branches at level 2. To find a diagnosis of length 3, a branch of depth 3 has to be ex-plored, which means, e.g., forDICE-A a total number of 100 million branches. Onlysmall ontologies with small diagnoses can be debugged in this most general way.

Things look better for the other methods. Both detect diagnoses of size up to 8 forlarge terminologies such asDICE-A or GEO. Again, all depends on the size of the diag-noses and the number of axioms. Still the results were unsatisfactory: there was not asingle algorithm that determined any diagnoses for the mergedSUMO andCYC ontology,and only once did the algorithm terminate within an hour, namely when checkingDICE-Ausing minimal conflict sets.

For this latter method (based on minimal conflict sets) the computational difficultylies in the fact that constructing minimal hitting sets from MIPS corresponds to calculat-ing prime implicants for a propositional formula, and is thus an NP-COMPLETE prob-lem. Although our prototypical implementation uses some optimization it is not efficientenough to build and search very large HS-Trees. The intermediate implementation basedon small conflict sets could significantly be made faster by using an optimized reasonerto return conflict sets more efficiently. From manual inspection we believe that the size

Page 58: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 6. EXPERIMENTAL EVALUATION 55

of the conflict sets (and thus the size of the HS-Tree) would not be much smaller, but thetime to find the small conflict sets could be significantly lower. In both cases, however,the theoretical (and practical) complexity is very high.

A word of caution It should be mentioned that the run-times of diagnosis and debug-ging cannot be compared as the experiments were performed on different computers.Also, the evaluation of diagnosis has to be studied with care. Given the algorithms basedon HS-Trees diagnoses are calculated with increasing size, the given numbers can beslightly misleading: remember that timeD1 denotes the time needed to calculate the firstdiagnosis, and #D/hr the number of diagnosis calculated in one hour. The evaluation of di-agnosis as presented here is meant as a comparison of the three different methods (basedon small, minimal and maximal conflict sets), and not for comparison with debugging.This has to be left for future research.

Page 59: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

Chapter 7

Concluding remarks

In previous work we had presented a framework for debugging and diagnosis, which waspublished as Sekt Deliverable 3.6.2. This research was continued for this report, in whichwe evaluate the previously defined methods in some detail.

This is done in two ways: first, we first, we study the effectiveness of our proposal ina qualitativeway with some practical examples. Secondly, in aquantitative and statisti-cal analysis, we try to get a better understanding of the computational properties of thedebugging problem and our algorithms for solving it.

In the qualititative part we discuss two simple incoherent terminologies to explainthe functionality and particular application scenarios of our debugging framework. Thisframework was then applied to two incoherent terminologies which were used at the Aca-demic Medical Center, Amsterdam, for the admission of patients to Intensive Care units.This evaluation is described in Chapter 4.

For the statistical parts we conducted three sets of experimments in order to evaluatethe debugging problem and our algorithms and tools. First, we applied our two debuggerDION andMUPSter on a set of real-world terminologies collect from our applicationsand the WWW. Secondly, we translated an existing test-set for Description Logic satisfi-ability to an incoherence problem, and finally, we created our own benchmark.

The results are mixed: in general we can conclude that our proposed framework fordebugging is useful in practice. Our experience showed that a number of heuristics andadditional measures, such as the Generalised MIPS, the weight of MIPS or the pinpointsare crucial in practical applications.

In terms of computational behaviour our results are slightly less positive: for each ofthe described methods there are relatively simple cases, where debugging fails in reason-able time. For example, the DION tool, which performs slightly better on average, failsto compute MIPS for a number of real-world terminologies, whereMUPSter comes upwith a solution with minutes. On the other hand, the more complex reasoning gets, theworse isMUPSter’s performance.

56

Page 60: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

CHAPTER 7. CONCLUDING REMARKS 57

Our experiments gave us some better understanding of the properties of incoherentterminologies that influence the debugging time, most notably the complexity of the defi-nitions, and the number of logical modeling errors. There are, however, still several openquestions, mostly related to the combination of properties. This means that an estimationof the run-time of the debuggers on the basis of the structure of the incoherent terminologyis still very difficult.

It has to be mentioned we restrict our attention purely to the evaluation of logicaldebugging methods. Nowadays, there is much research on conceptual approaches to de-bugging, which are out of the scope of this paper. Also, our evaluation has some severetechnical restrictions, most importantly the restriction to unfoldableALC terminologies.

In future research the methods for debugging have to be extended so that they can dealwith general ontologies. But the general properties we analysed in this paper will remainthe same: debugging will be possible, but computationally difficult. An interesting resultis the comparison of the two methods: top-down versus bottom-up, which both have theirmerits in different cases. This might be an argument to extend theMUPSterapproach tomore expressive languages, and non-restricted ontologies.

Page 61: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

Bibliography

[1] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Pe-ter F. Patel-Schneider, editors.The Description Logic Handbook: Theory, Imple-mentation, and Applications. Cambridge University Press, 2003.

[2] Alex Borgida, Enrico Franconi, Ian Horrocks, Deborah McGuinness, and PeterPatel-Schneider. ExplainingALC subsumption. In Patrick Lambrix, editor,In-ternational Workshop on Description Logics, Linkoping, Sweden, volume 22, pages37–40. CEUR-WS, 1999.

[3] Luca Console and Oskar Dressler. Model-based diagnosis in the real world: Lessonslearned and challenges remaining. InIJCAI ’99: Proceedings of the Sixteenth In-ternational Joint Conference on Artificial Intelligence, pages 1393–1400. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 1999.

[4] R. Cornet and A. Abu-Hanna. Evaluation of a frame-based ontology. Aformalization-oriented approach. InProceedings of MIE2002., volume 90, pages488–93, 2002.

[5] R. Cornet and A. Abu-Hanna. Description logic-based methods for auditingframe-based medical terminological systems.Artificial Intelligence in Medicine,34(3):201–17, 2005.

[6] N. F. de Keizer, A. Abu-Hanna, R. Cornet, J. H. Zwetsloot-Schonk, and C. P.Stoutenbeek. Analysis and design of an ontology for intensive care diagnoses.Meth-ods of Information in Medicine, 38(2):102–12, 1999.

[7] J de Kleer and B C Williams. Diagnosing multiple faults.Artificial Intelligence,32(1):97–130, 1987.

[8] Maria Garcia de la Banda, Peter J. Stuckey, and Jeremy Wazny. Finding all minimalunsatisfiable subsets. InFifth ACM-SIGPLAN International Conference on Princi-ples and Practice of Declarative Programming, pages 32–43. ACM, 2003.

[9] Jon Doyle. A truth maintenance system.Artificial Intelligence, 12(3):231–272,1979.

58

Page 62: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

BIBLIOGRAPHY 59

[10] Enrico Giunchiglia and Armando Tacchella. System description: *SAT: A platformfor the development of modal decision procedures. InConference on AutomatedDeduction, pages 291–296, 2000.

[11] R. Greiner, B. A. Smith, and R. W. Wilkerson. A correction to the algorithm inreiters theory of diagnosis.Artificial Intelligence, 41(1):79–88, 1989.

[12] Volker Haarslev and Ralf Moller. Description of the racer system and its applica-tions. In Proceedings of the International Workshop on Description Logics (DL-2001), pages 132–141. Stanford, USA, August 2001.

[13] I. Horrocks. The FaCT system. InTABLEAUX 98, pages 307–312, 1998.

[14] Z. Huang, F. van Harmelen, and A. ten Teije. Reasoning with inconsistent ontolo-gies. InProceedings of the International Joint Conference on Artificial Intelligence- IJCAI’05, 2005.

[15] Z. Huang, F. van Harmelen, A. ten Teije, P. Groot, and C. Visser. Reasoning withinconsistent ontologies: a general framework. Project Report D3.4.1, SEKT, 2004.

[16] Ullrich Hustadt and Renate Schmidt. MSPASS: Modal reasoning by translation andfirst-order resolution. In Roy Dyckhoff, editor,Proceedings of the InternationalConference on Automated Reasoning with Analytic Tableaux and Related Methods(T ABLEAUX 2000), volume 1847 ofLNAI, pages 67–71. Springer, 2000.

[17] Aditya Kalyanpur, Bijan Parsia, and Evren Sirin. Black box techniques for debug-ging unsatisfiable concepts. In I. Horrocks, U. Sattler, and F. Wolter, editors,Pro-ceedings of the International Workshop on Description Logics DL2005, Edinburgh,Scotland, UK, volume 147. CEUR-WS, 2005.

[18] Fabio Massacci and Francesco M. Donini. Design and results of tancs-2000 non-classical (modal) systems comparison. InTABLEAUX ’00: Proceedings of the Inter-national Conference on Automated Reasoning with Analytic Tableaux and RelatedMethods, pages 52–56, London, UK, 2000. Springer-Verlag.

[19] B. Nebel. Terminological reasoning is inherently intractable.AI, 43:235–249, 1990.

[20] Bijan Parsia, Evren Sirin, and Aditya Kalyanpur. Debugging OWL ontologies. InProceedings of the 14th International World Wide Web Conference (WWW2005),Chiba, Japan, pages 633–640, 2005.

[21] W.V. Quine. The problem of simplifying truth functions.American Math. Monthly,59:521–531, 1952.

[22] R. Reiter. A theory of diagnosis from first principles.Artif. Intelligence, 32(1):57–95, 1987.

Page 63: Inconsistent Ontology Diagnosis: Evaluationwasp.cs.vu.nl/sekt/dion/sekt362.pdfStefan Schlobach ∗, Zhisheng Huang , and Ronald Cornet# (∗Vrije Universiteit Amsterdam, #AMC, Amsterdam)

BIBLIOGRAPHY 60

[23] Cornelius Rosse and Jose L. V. Mejino, Jr. A reference ontology for biomedicalinformatics: the foundational model of anatomy.Journal of Biomedical Informatics,36(6):478–500, 2003.

[24] S. Schlobach. Diagnosing terminologies. In Manuela Veloso and Subbarao Kamb-hampati, editors,AAAI, Pittsburgh, PA, 2005.

[25] S. Schlobach and R. Cornet. Non-standard reasoning services for the debugging ofdescription logic terminologies. InProceedings of the eighteenth International JointConference on Artificial Intelligence, IJCAI’03. Morgan Kaufmann, 2003.

[26] Stefan Schlobach. Debugging and semantic clarification by pinpointing. InA. Gomez-Perez and J. Euzenat, editors,ESWC, Heraklion, Greece, volume 3532,pages 226–240. Springer, 2005.

[27] Hai Wang, Matthew Horridge, Alan Rector, Nick Drummond, and Julian Seiden-berg. Debugging OWL-DL ontologies: A heuristic approach. In BrahmanandaSapkota, editor,Proceeding of the 4th International Semantic Web Conference(ISWC2005), Galway, Ireland. Springer, 2005.

[28] Hai Wang, Matthew Horridge, Alan Rector, Nick Drummond, and Julian Seiden-berg. A heuristic approach to explain the inconsistency in OWL ontologies. In N. F.Noy, editor,Proceeding of the Protege Conference, Madrid, pages 65–68. StanfordKSL, 2005.