10.1.1.21.5883

Enabling Technologies for Interoperability

Ubbo Visser, Heiner Stuckenschmidt, Holger Wache, Thomas Vogele

TZI, Center for Computing TechnologiesUniversity of Bremen

D-28215 Bremen, Germany{visser|heiner|wache|vogele}@informatik.uni-bremen.de

Abstract

We present a new approach, which proposes to mini-mize the numerous problems existing in order to havefully interoperable GIS. We discuss the existence ofthese heterogeneity problems and the fact that theymust be solved to achieve interoperability. These prob-lems are addressed on three levels: the syntactic, struc-tural and semantic level. In addition, we identify theneeds for an approach performing semantic translationfor interoperability and introduce a uniform descrip-tion of contexts. Furthermore, we discuss a conceptualarchitecture Buster (Bremen University SemanticTranslation for Enhanced Retrieval) which can pro-vide intelligent information integration based on a re-classification of information entities in a new context.Lastly, we demonstrate our theories by sketching a reallife scenario.

Introduction

Over the last few years much work has been con-ducted in regards to the research topic fully interopera-ble GIS. Vckovski (Vckovski, 1998) for example givesan overview of the problems regarding data integra-tion and geographical information systems. Further-more, the proceedings of the 2nd International Con-ference on Interoperating Geographic Information Sys-tems (Interop99) (Vckovski et al., 1999) consists of nu-merous contributions about this research topic (e. g.(Wiederhold, 1999), (Landgraf, 1999)). In addition, re-cent studies in areas such as data warehousing (Wieneret al., 1996) and information integration (Galhardaset al., 1998) have also addressed interoperability prob-lems.

GIS’s share the need to store and process largeamounts of diverse data, which is often geographicallydistributed. Most GIS’s use specific data models anddatabases for this purpose. This implies, that makingnew data available to the system requires the data to betransferred into the system’s specific data format. Thisis a process which is very time consuming and tedious.Data acquisition, automatically or semi-automatically,often makes large-scale investment in technical infras-tructure and/or manpower inevitable. These obstaclesare some of the motivation behind the concept of in-

formation integration. The solution of information in-tegration applies here because existing information canbe accessed by remote systems in order to supplementtheir own data basis.

The advantages of successful information integrationare obvious for many reasons:

• Quality improvement of data due to the availabilityof large and complete data.

• Improvement of existing analysis and application ofthe new analysis.

• Cost reduction resulting from the multiple use of ex-isting information sources.

• Avoidance of redundant data and conflicts that canarise from redundancy.

However, before we can establish efficient informa-tion integration, difficulties arising from organizational,competence questions and many other technical prob-lems have to be solved. Firstly, a suitable informationsource must be located which contains the data neededfor a given task. Once the information source has beenfound, access to the data contained therein has to beprovided. Furthermore, access has to be provided onboth a technical and an informational level. In short,information integration not only needs to provide fullaccessibility to the data, it also requires that the ac-cessed data may be interpreted by the remote system.While the problem of providing access to informationhas been largely solved by the invention of large-scalecomputer networks, the problem of processing and in-terpreting retrieved information remains an importantresearch topic. This paper will address three of theproblems mentioned above:

• finding suitable information sources,

• enabling a remote system to process the accesseddata,

• and solutions to help the remote system interpretingthe accessed data as intended by its source.

In addressing these questions we will explore tech-nologies which enable systems to interoperate, alwaysbearing in mind the special needs of GIS.

1

Levels of Integration Our modern information soci-ety requires complete access to all information available.The opening of information systems towards integratedaccess, which has been encouraged in order to satisfythis demand, creates new challenges for many areas ofcomputer science. In this paper, we distinguish differ-ent integration tasks, that need to be solved in order toachieve complete integrated access to information:

Syntactic Integration: Many standards haveevolved that can be used to integrate different informa-tion sources. Beside classical database interfaces suchas ODBC, web-oriented standards such as HTML andXML are gaining importance.

Structural Integration: The first problem thatpasses a purely syntactic level is the integration of het-erogeneous structures. This problem is normally solvedby mediator systems defining mapping rules betweendifferent information structures.

Semantic Integration: In the following, we usethe term semantic integration or semantic translation,respectively, to denote the resolution of semantic con-flicts, that make a one to one mapping between conceptsor terms impossible.

Our approach provides an overall solution to theproblem of information integration, taking into ac-count all three levels of integration and combiningseveral technologies, including standard markup lan-guages, mediator systems, ontologies, and a knowledge-based classifier.

Enabling Technologies

In order to overcome the obstacles mentioned earlier,it is not sufficient to solve the heterogeneity problemsseparately. It is important to note that these problemscan only be solved with a system taking all three levelsof integration into account. In the following subsectionswe will give a short introduction to what we mean byproblems concerning the syntactic, structural and se-mantic integration.

Syntactic Integration

The typical task of syntactic data integration is, to spec-ify the information source on a syntactic level. Thismeans, that different data type problems can be solved(e. g. short int vs. int and/or long). This first data ab-straction is used to re-structure the information source.

The standard technology to overcome problems onthis level are wrappers. Wrappers hide the internal datastructure model of a source and transform the contentsto a uniform data structure model.

Structural IntegrationThe task of structural data integration is, to re-formatthe data structures to a new homogeneous data struc-ture. This can be done with the help of a formalismthat is able to construct one specific information sourceout of numerous other information sources. This is aclassical task of a middleware which can be done withCORBA (OMG, 1992) on a low level or rule-based me-diators (Wiederhold, 1992) on a higher level.

Mediators provide flexible integration of several infor-mation systems such as database management systems,GIS, or the world wide web. A mediator combines,integrates, and abstracts the information provided bythe sources. Normally the sources are encapsulated bywrappers.

Over the last few years numerous mediators havebeen developed. A popular example is the rule-drivenTSIMMIS mediator (Chawathe et al., 1994), (Papakon-stantinou et al., 1996). The rules in the mediator de-scribe how information of the sources can be mappedto the integrated view. In simple cases, a rule mediatorconverts the information of the sources into informationon the integrated view. The mediator uses the rules tosplit the query, which is formulated with respect to theintegrated view, into several sub-queries for each sourceand combine the results according to query plan.

A mediator has to solve the same problems which arediscussed in the federated database research area, i. e.structural heterogeneity (schematic heterogeneity) andsemantic heterogeneity (data heterogeneity) (Kim andSeo, 1991), (Naiman and Ouksel, 1995), (Kim et al.,1995). Structural heterogeneity means that differentinformation systems store their data in different struc-tures. Semantic heterogeneity considers the contentand semantics of an information item. In rule-basedmediators, rules are mainly designed in order to rec-oncile structural heterogeneity. Where as discoveringsemantic heterogeneity problems and their reconcilia-tion play a subordinate role. But for the reconciliationof the semantic heterogeneity problems, the semanticlevel must also be considered. Contexts are one possi-bility to describe the semantic level. A context contains”meta data relating to its meaning, properties (such asits source, quality, and precision), and organization”(Kashyap and Sheth, 1997). A value has to be consid-ered in its context and may be transformed into anothercontext (so-called context transformation).

Semantic IntegrationThe semantic integration process is by far the mostcomplicated process and presents a real challenge. Aswith database integration, semantic heterogeneities arethe main problems that have to be solved within spa-tial data integration (Vckovski, 1998). Other authorsfrom the GIS community call this problem inconsisten-cies (Shepherd, 1991). Worboys & Deen (Worboys andDeen, 1991) have identified two types of semantic het-erogeneity in distributed geographic databases:

2

• Generic semantic heterogeneity: Heterogeneity re-sulting from field- and object-based databases.

• Contextual semantic heterogeneity: Heterogeneitybased on different meanings of concepts and schemes.

The generic semantic heterogeneity is based on thedifferent concepts of space or data models being used.In this paper, we will focus on contextual semantic het-erogeneity which is based on different semantics of thelocal schemata.

In order to discover semantic heterogeneities, a for-mal representation is needed. Lately, WWW standard-ized markup languages such as XML and RDF havebeen developed by the W3C community for this pur-pose (W3C, 1998), (W3C, 1999). We will describe thevalue of these languages for the semantic descriptionof concepts and also argue that we need more sophisti-cated approaches to overcome the semantic heterogene-ity problem.

Ontologies have been identified to be useful for the in-tergration/interoperation process (Visser et al., 2000).The advantages and disadvantages of this technologywill be discussed in a separate subsection.

Ontologies can be used to describe informationsources. However, how does the actual integration pro-cess work? This will be briefly discussed in the followingsubsections. We call this process semantic mapping.

XML/RDF and semantic modeling XML andRDF have been developed for the semantic descriptionof information sources.

XML – Exchanging Information: In order toovercome the purely visualization-oriented annotationprovided e. g. by HTML, XML was proposed as an ex-tensible language allowing the user to define his owntags in order to indicate the type of it’s content. There-fore, it followed that the main benefit of XML lies ac-tually in the opportunity to exchange data in a struc-tured way. Recently, this idea has been emphasized byintroducing XML schemata that could be seen as a def-inition language for data structures. In the followingparagraphs we sketch the idea behind XML and de-scribe XML schema definitions and their potential usefor data exchange.

The General Idea: A data object is said to beXML document if it follows the guidelines for well-formed XML documents provided by the W3C com-munity. The specification provide a formal grammarused in well-formed documents. In addition to the gen-eral grammar, the user can impose further grammaticalconstraints on the structure of a document using a doc-ument type definition (DTD). A XML document is validif it has an associated type definition and complies tothe grammatical constraints of that definition. A DTD

specifies elements that can be used in an XML docu-ment. In the document, the elements are delimited bya start and an end tag. It has a type and may have aset of attribute specifications consisting of a name anda value.

The additional constraints in a DTD refer to the log-ical structure of the document, this especially includesthe nesting of tags inside the information body thatis allowed and/or required. Further restrictions thatcan be expressed in a DTD concern the type of the at-tributes and default values to be used when no attributevalue is provided.

Schema Definitions and Mappings: An XMLschema itself is, an XML document defining the validstructure of an XML document in the spirit of a DTD.The elements used in a schema definition are of thetype ’element’ and have attributes that are defining therestrictions already mentioned above. The informationin such an element is a list of further element definitionsthat have to be nested inside the defined element.

Furthermore, XML schema have some additional fea-tures that are very useful to define data structures suchas:

• Support for basic data types.

• Constraints on attributes such as occurrence con-straints.

• Sophisticated structures such as type definition de-rived by extending or restricting other types.

• A name-space mechanism allowing the combinationof different schemata.

We will not discuss these features at length. How-ever, it should be mentioned that the additional fea-tures make it possible to encode rather complex datastructures. This enables us to map data-models of ap-plications from whose information we want to sharewith others on an XML schema. From this point, wecan encode our information in terms of an XML docu-ment and make it (together with the schema, which isalso an XML document) available over the internet.

This procedure has a big potential in the actual ex-changing of data. However, the user must to committo our data-model in order to make use of the informa-tion. We must point out that an XML schema definesthe structure of data providing no information aboutthe content or the potential use for others. Therefore,it lacks an important advantage of meta-information.

We argued that XML is designed to provide an inter-change format for weakly structured data by definingthe underlying data-model in a schema and by usingannotations, from the schema, in order to clarify therole of single statements. Two things are important inthis claim from the information sharing point:

• XML is purely syntactic/structural in nature.

• XML describes data on the object level.

3

Consequently, we have to find other approaches if wewant to describe information on the meta level and de-fine its meaning. In order to fill this gap, the RDFstandard has been proposed as a data model for repre-senting meta-data about web pages and their contentusing an XML syntax.

RDF – A Standard Format: The basic model un-derlying RDF is very simple, every kind of informationabout a resource which may be a web page or an XMLelement is expressed in terms of a triple (resource, prop-erty, value).

Thereby, the property is a two-placed relation thatconnects a resource to a certain value of that property.This value can be a simple data-type or a resource. Ad-ditionally, the value can be replaced by a variable rep-resenting a resource that is further described by nestedtriples making assertions about the properties of theresource that is represented by the variable. Further-more, RDF allows multiple values for a single prop-erty. For this purpose, the model contains three build-in data types called collections, namely an unorderedlists (bag), ordered lists (seq), and sets of alternatives(alt) providing some kind of an aggregation mechanism.

A further requirement arising from the nature of theweb is the need to avoid name-clashes that might oc-cur when referring to different web-sites that use differ-ent RDF-models to annotate meta-data. RDF definesname-spaces for this purpose. Name-spaces are definedby referring to an URL that provides the names andconnecting it to a source id that is then used to an-notate each name in an RDF specification defining theorigin of that particular name: source id:name

A standard syntax has been developed to expressRDF-statements making it possible to identify thestatements as meta-data, thereby providing a low levellanguage for expressing the intended meaning of infor-mation in a machine processable way.

RDF/S – A Basic Vocabulary: The very simplemodel underlying ordinary RDF-descriptions leave a lotof freedom for describing meta-data in arbitrary ways.However, if people want to share this information, therehas to be an agreement on a standard core of vocabularyin terms of modeling primitives that should be used todescribe meta-data. RDF schemes (RDF/S) attemptto provide such a standard vocabulary.

Looking closer at the modeling components, re-veals that RDF/S actually borrows from frame sys-tems well known from the area of knowledge represen-tation. RDF/S provides a notion of concepts (class),slots (property), inheritance (SubclassOf, SubslotOf)and range restrictions (Constraint Property). Unfortu-nately, no well-defined semantics exist for these model-ing primitives in the current state. Further, parts suchas the re-identification mechanism are not well definedeven on an informal level. Lastly, there is no reasoningsupport available, not even for property inheritance.

Semantic modeling: After introducing the W3Cstandards for information exchange and meta-data an-notation we have to investigate their usefulness for in-formation integration with reference to the three lay-ers of integration (see section ). Firstly, we previouslydiscovered that XML is only concerned with the issueof syntactic integration. However, XML defines struc-tures as well, except there are no sophisticated mecha-nism for mapping different structures. Secondly, RDFis designed to provide some information on the semanticlevel, by enabling us to include meta-information in thedescription of a web-page. In the last section we men-tioned, RDF in it’s current state fails to really providesemantic descriptions. Rather it provides a commonsyntax and a basic vocabulary that can be used whendescribing this meta-data. Fortunately, the designers ofRDF are aware that there is a strong need for an addi-tional ’logical level’ which defines a clear semantics forRDF-expressions and provides a basis for integrationmechanisms.

Our conclusion about current web standards is thatusing XML and especially XML schemata is a suit-able way of exchanging data with a well defined syn-tax and structure. Furthermore, simple RDF providesa uniform syntax for exchanging meta-information ina machine-readable format. However, in their currentstate neither XML nor RDF provides sufficient supportfor the integration of heterogeneous structures or dif-ferent meanings of terms. There is a need for semanticmodeling and reasoning about structure and meaning.Promising candidates for semantic modeling approachescan be found in the areas of knowledge representation,as well as, in the distributed databases community. Wewill discuss some of these approaches in the followingsection.

Ontologies Recently, the use of formal ontologies tosupport information systems has been discussed (Guar-ino, 1998), (Bishr and Kuhn, 1999). The term ’Ontol-ogy’ has been used in many ways and across differentcommunities (Guarino and Giaretta, 1995). If we wantto motivate the use of ontologies for information inte-gration we have to define what we mean when we referto ontologies. In the following sections, we will intro-duce ontologies as an explication of some shared vo-cabulary or conceptualization of a specific subject mat-ter. Further, we describe the way an ontology explicatesconcepts and their properties and finally argue for thebenefit of this explication in many typical applicationscenarios.

Shared Vocabularies and Conceptualizations:In general, each person has an individual view on theworld and the things he/she has to deal with everyday. However, there is a common basis of understandingin terms of the language we use to communicate witheach other. Terms from natural language can there-fore, be assumed to be a shared vocabulary relying ona (mostly) common understanding of certain concepts

4

with very little variety. This common understanding re-lies on specific idea of how the world is organized. Weoften call these ideas a conceptualization of the world.These conceptualizations provide a terminology thatcan be used for communication between people.

The example of our natural language demonstrates,that a conceptualization cannot be universally valid,but rather a limited number of persons committed tothat particular conceptualization. This fact is reflectedin the existence of different languages which differ evenmore (English and Japanese) or much less (German andDutch). Confusion can become worse when we are con-sidering terminologies developed for a special scientificor economic areas. In these cases, we often find situ-ations where one term refers to different phenomena.The use of the term ’ontology’ in philosophy and incomputer science serves as an example. The conse-quence of this confusion is, a separation into differentgroups, that share terminology and its conceptualiza-tion. These groups are then called information commu-nities.

The main problem with the use of a shared termi-nology according to a specific conceptualization of theworld is that much information remains implicit. Whena mathematician talks about a binomial normal he isreferring to a wider scope than just the formula itself.Possibly, he will also consider its interpretation (thenumber of subsets of a certain size) and its potentialuses (e. g. estimating the chance of winning in a lot-tery).

Ontologies set out to overcome this problem of im-plicit and hidden knowledge by making the conceptu-alization of a domain (e. g. mathematics) explicit. Thiscorresponds to one of the definitions of the term ontol-ogy most popular in computer science (Gruber, 1993):

An ontology is an explicit specification of a conceptual-ization.

An ontology is used to make assumptions about themeaning of a term available. It can also be viewed anexplication, of the context a term, it is normally usedin. Lenat (Lenat, 1998) for example, describes contextin terms of twelve independent dimensions that haveto be know in order to understand a piece of knowledgecompletely. He also demonstrates how these dimensionscan be explicated, using the ’Cyc’ ontology.

Specification of Context Knowledge: Thereare many different ways in which an ontology may expli-cate a conceptualization and the corresponding contextknowledge. The possibilities range from a purely infor-mal natural language description of a term correspond-ing to a glossary up, to a strictly formal approach, withthe expressive power of full first order predicate logic oreven beyond (e. g. Ontolingua (Gruber, 1991)). Jasperand Uschold (Jasper and Uschold, 1999) distinguish twoways in which the mechanisms for the specification of

context knowledge by an ontology can be compared:

Level of Formality:The specification of a conceptualization and its implicitcontext knowledge, can be done at different levels of for-mality. As already mentioned above, a glossary of termscan also be seen as an ontology, despite its purely in-formal character. A first step to gain more formality, isto describe a structure to be used for the description.A good example of this approach is the standard webannotation language XML (see section ). The DTD isan ontology describing the terminology of a web pageon a low level of formality. Unfortunately, the rather in-formal character of XML encourages its misuse. Whilethe hierarchy of an XML specification was originallydesigned to describe a layout, it can also be exploitedto represent sub-type hierarchies, (van Harmelen andFensel, 1999) which may lead to confusion. Fortunately,this problem can be solved by assigning formal seman-tics to the structures used for the description of theontology. An example of this is the conceptual model-ing language CML (Schreiber et al., 1994). CML offersprimitives that describe a domain which can be givena formal semantic in terms of first order logic (Aben,1993). However, a formalization is only available forthe structural part of a specification. Assertions aboutterms and the description of dynamic knowledge is notformalized which offers total freedom for a descrip-tion. On the other, there are specification languageswhich are completely formal. A prominent example isthe Knowledge Interchange Format (KIF) (Geneserethand Fikes, 1992) which was designed to enable differentknowledge-based systems to exchange knowledge. KIFhas been used as a basis for the Ontolingua language(Gruber, 1991) which supplies formal semantics to thatlanguage as well.

Extend of Explication:The other comparison criterion is, the extend of ex-plication that is reached by the ontology. This crite-rion is strongly connected with the expressive power ofthe specification language used. We already mentionedDTD’s which are mainly a simple hierarchy of terms.Furthermore, we can generalize this by saying that, theleast expressive specification of an ontology consists ofan organization of terms in a network using two-placedrelations. The idea of this goes back to the use of se-mantic networks in the seventies. Many extensions ofthe basic idea examined have been proposed. One of themost influential ones was, the use of roles that could befilled out by entities showing a certain type (Brachman,1977). This kind of value restriction can still be found inrecent approaches. RDF schema descriptions (Brickleyand Guha, 2000), which might become a new standardfor the semantic descriptions of web-pages, are an exam-ple of this. An RDF schema contains class definitionswith associated properties that can be restricted by so-called constraint-properties. However, default valuesand value range descriptions are not expressive enough

5

to cover all possible conceptualizations. A more ex-pressive power can be provided by allowing classes tobe specified by logical formulas. These formulas canbe restricted to a decidable subset of first order logic.This is the approach of description logics (Borgida andPatel-Schneider, 1994). Nevertheless, there are also ap-proaches that allow for even more expressive descrip-tions. In Ontolingua for example, classes can be de-fined by arbitrary KIF-expressions. Beyond the ex-pressiveness of full first-order predicate logic, there arealso special purpose languages that have an extendedexpressiveness to cover specific needs of their applica-tion area. Examples are; specification languages forknowledge-based systems which often including vari-ants of dynamic logic to describe system dynamics.

Applications: Ontologies are useful for many dif-ferent applications, that can be classified into severalareas. Each of these areas, has different requirementson the level of formality and the extend of explicationprovided by the ontology. We will review briefly com-mon application areas, namely the support of commu-nication processes, the specification of systems and in-formation entities and the interoperability of computersystems.

Information communities are useful because they easecommunication and cooperation among members withthe use of shared terminology with well defined mean-ing. On the other hand, the formalization of informa-tion communities makes communication between mem-bers from different information communities very diffi-cult. Generally, because they do not agree on a commonconceptualization. Although, they may use the sharedvocabulary of natural language, most of the vocabularyused in their information communities is highly spe-cialized and not shared with other communities. Thissituation demands for an explication and explanationof the use of terminology. Informal ontologies with alarge extend of explication are a good choice to over-come these problems. While definitions have alwaysplayed an important role in scientific literature, concep-tual models of certain domains are rather new. Nowa-days systems analysis and related fields like softwareengineering, rely on conceptual modeling to communi-cate structure and details of a problem domain as wellas the proposed solution between domain experts andengineers. Prominent examples of ontologies used forcommunication are Entity-Relationship diagrams andObject-oriented Modeling languages such as UML.

ER-diagrams as well as UML are not only used forcommunication, they also serve as building plans fordata and systems guiding the process of building (en-gineering) the system. The use of ontologies for thedescription of information and systems has many bene-fits. The ontology can be used to identify requirementsas well as inconsistencies in a chosen design. Further, itcan help to acquire or search for available information.Once a systems component has been implemented, itsspecification can be used for maintenance and exten-

sion purposes. Another very challenging application ofontology-based specification is the reuse of existing soft-ware. In this case, the specifying ontology serves as abasis to decide if an existing component matches therequirements of a given task.

Depending on the purpose of the specification, on-tologies of different formal strength and expressivenessare to be utilized. While the process of communica-tion design decisions and the acquisition of additionalinformation normally benefit from rather informal andexpressive ontology representations (often graphical),the directed search for information needs a rather strictspecification with a limited vocabulary to limit the com-putational effort. At the moment, the support of semi-automatic software reuse seems to be one of the mostchallenging applications of ontologies, because it re-quires expressive ontologies with a high level of formalstrength.

The previously discussed considerations might pro-voke the impression that the benefits of ontologies arelimited to systems analysis and design. However, animportant application area of ontologies is the integra-tion of existing systems. The ability to exchange infor-mation at run time, also known as interoperability, isan valid and important topic. The attempt to provideinteroperability suffers from problems similar to thoseassociated with the communication amongst differentinformation communities. The important difference be-ing the actors are not people able to perform abstrac-tion and common sense reasoning about the meaningof terms, but machines. In order to enable machinesto understand each other, we also have to explicate thecontext of each system on a much higher level of formal-ity. Ontologies are often used as Inter-Linguas in orderto provide interoperability: They serve as a commonformat for data interchange. Each system that wantsto inter-operate with other systems has to transfer itsdata information into this common framework. Interop-erability is achieved by explicitly considering contextualknowledge in the translation process.

Semantic Mapper For an appropriate support of anintegration of heterogeneous information sources an ex-plicit description of semantics (i. e. an ontology) of eachsource is required. In principle, there are three wayshow ontologies can be applied:

• a centralized approach, where each source is relatedto one common domain ontology,

• a decentralized approach, where every source is re-lated to its own ontology, or

• a hybrid approach, where every source is related to itsown ontology but the vocabulary of these ontologiesstem from a common domain ontology

A common domain ontology describes the seman-tics of the domain in the SIMS mediator (Arens et al.,1996). In the global domain model of these approachesall terms of a domain are arranged in a complex struc-ture. Each information source is related to the terms

6

of the global ontology (e. g. with articulation axioms(Collet et al., 1991)). However, the scalability of sucha fixed and static common domain model is low (Mitraet al., 1999), because the kind of information sourceswhich can be integrated in the future is limited.

In OBSERVER (Mena et al., 1996) and SKC (Mi-tra et al., 1999) it is assumed, that a predefined ontol-ogy for each information source exists. Consequently,new information sources can easily be added and re-moved. But the comparison of the heterogeneous on-tologies leads to many homonym, synonym, etc. prob-lems, because the ontologies use their own vocabulary.In SKC (Mitra et al., 1999) the ontology of each sourceis described by graphs. Graph transformation rules areused to transport information from one ontology intoanother ontology (Mitra et al., 2000). These rules canonly solve the schematic heterogeneities between theontologies.

In MESA (Wache et al., 1999) the third hybrid ap-proach is used. Each source is related to its sourceontology. In order to make the source ontologies com-parable, a common global vocabulary is used, organizedin a common domain ontology. This hybrid approachprovides the biggest flexibility because new sources caneasily be integrated and, in contrast to the decentralizedapproach, the source ontologies remain comparable.

In the next section we will describe how ontologiescan help to solve heterogeneity problems.

BUSTER - An Approach forComprehensive Interoperability

In chapter 2 we described the methods needed toachieve structural, syntactic, and semantic interoper-ability. In this chapter, we propose the Buster- ap-proach (Bremen University Semantic Translator for En-hanced Retrieval), which provides a comprehensive so-lution to reconcile all heterogeneity problems.

During an acquisition phase all desired informa-tion for providing a network of integrated informa-tion sources is acquired. This includes the acquisi-tion of a Comprehensive Source Description (CSD) ofeach source together with the Integration Knowledge(IK) which describes how the information can be trans-formed from one source to another.

In the query phase, a user or an application (e. g.a GIS) formulates a query to an integrated view ofsources. Several specialized components in the queryphase use the acquired information, i. e. the CSD’s andIK’s, to select the desired data from several informa-tion sources and to transform it to the structure andthe context of the query.

All software components in both phases are associ-ated to three levels: the syntactic, the structural andthe semantic level. The components on each level dealwith the corresponding heterogeneity problems. Thecomponents in the query phase are responsible for solv-ing the corresponding heterogeneity problems whereasthe components in the acquisition phase use the CSD’s

of the sources to provide the specific knowledge for thecorresponding component in the query phase. A media-tor for example, which is associated with the structurallevel, is responsible for the reconciliation of the struc-tural heterogeneity problems. The mediator is config-ured by a set of rules that describe the structural trans-formation of data from one source to another. The rulesare acquired in the acquisition phase with the help ofthe rule generator.

An important characteristic of the Buster architec-ture is the semantic level, where two different typesof tools exists for solving the semantic heterogeneityproblems. This demonstrates the focus of the Bustersystem, providing a solution for this type of problems.Furthermore, the need for two types of tools exhibits,that the reconciliation of semantic problems is very dif-ficult and must be supported by a hybrid architecturewhere different components are combined.

In the following sections we describe the two phasesand the components in detail.

Query PhaseIn the query phase a user submits a query request toone or more data sources in the network of integrateddata sources. In this query phase several componentsof different levels interact (see Fig. 1).

On the syntactic level, wrappers are used to establisha communication channel to the data source(s), that isindependent of specific file formats and system imple-mentations. Each generic wrapper covers a specific file-or data-format. For example, generic wrappers may ex-ist for ODBC data sources, XML data files, or specificGIS formats. Still, these generic wrappers have to beconfigured for the specific requirements of a data source.

The mediator on the structural level uses informa-tion obtained from the wrappers and ”combines, in-tegrates and abstracts” (Wiederhold, 1992) them. Inthe Buster approach, we use generic mediators whichare configured by transformation rules (query definitionrules QDR). These rules describe in a declarative style,how the data from several sources can be integrated andtransformed to the data structure of original source.

On the semantic level, we use two different tools spe-cialized for solving the semantic heterogeneity prob-lems. Both tools are responsible for the contexttransformation, i. e. transforming data from an source-context to a goal-context. There are several wayshow the context transformation can be applied. InBuster we consider the functional context transfor-mation and context transformation by re-classification(Stuckenschmidt and Wache, 2000).

In the functional context transformation, the con-version of data is done by application of a predefinedfunctions. A function is declaratively represented inContext Transformation Rules. These (CTR’s) describefrom which source-context to which goal-context can betransformed by the application of which function. Thecontext transformation rules are invoked by the CTR-Engine. The functional context transformation can be

7

Figure 1: The query phase of the BUSTER architecture

used for example, in the transformation of area mea-sures in hectars to area measures in acres, or the trans-formation of one coordinate system into another. Allcontext transformation rules can be described with thehelp of mathematical functions.

Further to the functional context transformation,Buster also allows the classification of data into an-other context. This is utilized to automatically map theconcepts of one data source to concepts of another datasource. To be more precise, the context description (i. e.the ontological description of the data) is re-classified.The source-context description, to which the data is an-notated, is obtained from the CSD, completed with thedata information and relates to goal-context descrip-tions. After the context re-classification the data is sim-ply replaced with the data which is annotated with therelated goal-context. Context re-classification togetherwith the data replacement is useful for the transforma-tion of catalog terms, e. g. exchanging the term of ansource catalog by a term from the goal catalog.

A Query Example We demonstrate the query phaseand the interaction of the components by a real worldexample. The scenario presents a typical user, for ex-ample an environmental engineer in a public adminis-tration, who is involved in some kind of urban plan-ning process. The basis for his work is a set of digitalmaps and a GIS to view, evaluate, and manipulate thesemaps.

In our example, the engineer uses a set of ATKISmaps in an ArcView (ESRI, 1994) environment. ATKISstands for ”Amtliches Topographisch-KartographischesInformationssystem”, i. e. the official German informa-tion system related to maps and topographical infor-mation (AdV, 1998). Among others, the ATKIS datasource offers detailed information with respect to land-

use types in urban and rural areas of Germany.The ATKIS data sets are generated and maintained

by a working group of several public agencies on a fed-eral and state level. The complexity of the task of keep-ing all data sets up-to-date and the underlying adminis-trative structure, causes a certain delay in the produc-tion and delivery of new updated maps. Consequently,the engineer in our application example, is likely towork with ATKIS maps that are not quite up-to-datebut show discrepancies with respect to features observ-able in reality.

The engineer needs tools to compare his potentiallyinconsistent base-data with more recent representationsof reality, in order to define potential problem areas. Inour example the CORINE land cover (EEA, 1999) database provide satellite images. From 1985 to 1990, theEuropean Commission carried out the CORINE Pro-gramme (Co-ordination of Information on the Envi-ronment). The results are essentially of three types,which correspond to the three aims of the Programme:(a) an information system on the state of the envi-ronment in the European Community has been cre-ated (the CORINE system). It is composed of a seriesof data bases describing the environment in the Euro-pean Community, as well as the data bases with back-ground information. (b) Nomenclatures and method-ologies were developed for carrying out the programs,which are now used as the reference in the areas con-cerned at the community level. (c) A systematic effortwas made to concert activities with all the bodies in-volved in the production of environmental informationespecially at international level. As a result of this ac-tivity, and indeed of the whole programs, several groupsof international scientists have been working togethertowards agreed targets. They now share a pool of ex-pertise on various themes of environmental information.

The technologies of syntactic, structural, and seman-tic integration described in section can be applied tofacilitate this task.

Following, is a step-by-step example of how a typi-cal user interaction with the system in the query phasecould look:

1. The user starts the query from within his native GIStool (here: ATKIS maps in ArcView). He definesthe parameters of the query, such as the propertiesand formats of the originating system, the specifiedarea of interest (bounding rectangle, coordinate sys-tem etc.), and information about the requested at-tribute data (here: ”land use”). Then he submitsthe query to the network of integrated data sources.

2. The query is matched against the central networkdatabase, and a decision is made about which of theparticipating data sources a.) cover the area of inter-est and b.) hold information on the attribute ”landuse”. A list of all compatible data sources is createdand send back to the user.

8

From this list, the user selects one or more datasources and re-submits the query to the system. Inour example, the engineer selects a set of CORINEland-cover satellite images.

3. The system consults the central database and re-trieves basic information needed to access the se-lected data source(s). This includes informationabout technical, syntactical, and structural details aswell as rules needed for the access exchange of datafrom these sources.

4. The information is used to select and configure suit-able wrappers from a repository of generic wrappers.Once the wrappers are properly installed, a suitablemediator is selected from a repository of generic me-diators. Among others, the mediator-rules describethe fields that hold the requested information (here:the fields holding land-use information).With the help of wrappers and mediators, a directconnection to the selected data source(s) can be es-tablished, and individual instances of data can beaccessed.

5. For the context transformation from the source con-text into the query context the mediator queries theCTR-Engine. For example, the CTR-Eengine trans-forms the area measure in hectares to area measuresin acres.If the CTR-Engine cannot transform the context, be-cause no appropriate CTR’s exists, it queries the re-classifier for a context mapping. In our example, it isused to re-classify the CORINE land-use attributesof all polygons in the selected area of interest tomake them consistent with the ATKIS classificationscheme.If no context transformation can be performed themediator rejects the data.

6. The result of the whole process is a new map for theselected area that shows CORINE data re-classifiedto the ATKIS framework. The engineer in our exam-ple can overlay the original ATKIS set of maps withthe new map. He can then apply regular GIS toolsto make immediate decisions about which areas ofthe ATKIS maps are inconsistent with the CORINEsatellite images and consequently need to be updated.

Data Acquisition PhaseBefore the first query can be submitted, the knowledge,in fact the Comprehensive Source Description (CSD)and Integration Knowledge (IK) has to be acquired.The first step of the data acquisition phase consists ofgathering information about the data source that is tobe integrated (Fig. 2). This information is stored ina source-specific data base, the Comprehensive SourceDescriptor (CSD). A CSD has to be created for eachdata source that participates in a network of integrateddata sources.

Figure 2: The data acquisition phase of the BUSTERArchitecture

The Comprehensive Source Description EachCSD consists of meta data that describe technical andadministrative details of the data source as well as itsstructural and syntactic schema and annotations. Inaddition, the CSD comprises a source ontology, i. e. adetailed and computer-readable description of the con-cepts stored in the data source. The CSD is attached tothe respective data source. It should be available in ahighly interchangeable format (for example XML), thatallows easy data exchange over computer networks.

Setting up a CSD is the task of the domain special-ist responsible for the creation and maintenance of thespecific data source. With the help of specialized toolsthat use repositories of pre-existing general ontologiesand terminologies, the tedious task of setting up a CSDcan be supported. These tools examine existing CSD’sof other but similar sources and generate hypotheses forsimilar parts of the new CSD’s. The domain specialistmust verify – eventually modifying – the hypothesesand add them to the CSD of the new source. Withthese acquisition tools the creation of new CSD’s canbe simplified (Wache et al., 1999).

The Integration Knowledge In a second step ofthe data acquisition phase, the data source is addedto the network of integrated data sources. In orderfor the new data source to be able to exchange datawith the other data sources in the network, IntegrationKnowledge (IK) must be acquired. The IK is storedin a centralized database that is part of the network ofintegrated data sources.

The IK consists of several separated parts whichprovides specific knowledge for the components in thequery phase. For example, the rule generator exam-

9

ines several CSD’s and creates rules for the mediator(Wache et al., 1999). The wrapper configurator usesthe information about the sources in order to adaptgeneric wrappers to the heterogeneous sources.

Creating the IK is the task of the person responsi-ble for operating and maintaining the network of in-tegrated data sources. Due to the complexity of theIK needed for the integration of multiple heterogeneousdata sources and the unavoidable semantic ambiguities,it may not be possible to accomplish this task automat-ically. However, the acquisition of the IK can be sup-ported by semi-automatic tools. In general, such ac-quisition tools use the information stored in the CSDsto pre-define parts of the IK and propose them to thehuman operator who makes the final decision aboutwhether to accept, edit, or reject them.

Summary

In order to make GIS interoperable, several problemshave to be solved. We argued that these problems canbe divided onto three levels of integration, the syntac-tic, structural, and semantic level. In our opinion it iscrucial to note that the problem of interoperable GIScan only be solved if solutions (modules) on all threelevels of integration are working together. We believethat it is not possible to solve the heterogeneity prob-lems separately.

The Buster- approach uses different componentsfor different tasks on different levels and provides aconceptional solution for these problems. The com-ponents can be any existing systems. We use wrap-pers for the syntactic level, mediators for the struc-tural level, and both context transformation rule en-gines (CTR-Engines) and classifiers (mappers) for thesemantic level. CORBA as low level middleware is usedfor the communication of the components.

At the moment, a few wrappers are available (e. g.ODBC-, XML-wrapper), a wrapper for shape files willbe available soon. We are currently developing a medi-ator and the CTR-Engine and use FaCT (Fast Classifi-cation of Terminologies) (Horrocks, 1999) as a reasonerfor our prototype system. Buster is a first attemptto solve the heterogeneous problems mentioned in thispaper, however, a lot of work has to be done in variousareas.

References

[Aben, 1993] Aben, M. (1993). Formally specifying re-usable knowledge model components. Knowledge Ac-quisition Journal, 5:119–141.

[AdV, 1998] AdV (1998). Amtliches Topographisch-Kartographisches Informationssystem ATKIS. Lan-desvermessungsamt NRW, Bonn.

[Arens et al., 1996] Arens, Y., Hsu, C.-N., andKnoblock, C. A. (1996). Query processing in the simsinformation mediator. In Advanced Planning Technol-ogy, California, USA. AAAI Press.

[Bergamashi et al., 1999] Bergamashi, Castano,Vincini, and Beneventano (1999). Intelligent tech-niques for the extraction and integration of hetero-geneous information. In Workshop Intelligent Infor-mation Integration, IJCAI 99, Stockholm, Sweden.

[Bishr and Kuhn, 1999] Bishr, Y. and Kuhn, W.(1999). The Role of Ontology in Modelling Geospa-tial Features, volume 5 of IFGI prints. Institut furGeoinformatik, Universitat Munster, Munster.

[Borgida and Patel-Schneider, 1994] Borgida, A. andPatel-Schneider, P. (1994). A semantics and completealgorithm for subsumption in the classic descriptionlogic. JAIR, 1:277–308.

[Brachman, 1977] Brachman, R. (1977). What’s in aconcept: Structural foundations for semantic nets. In-ternational Journal of Man-Machine Studies, 9:127–152.

[Brickley and Guha, 2000] Brickley, D. and Guha, R.(2000). Resource description framework (rdf) schemaspecification 1.0. Technical Report PR-rdf-schema,W3C. http://www.w3.org/TR/2000/CR-rdf-schema-20000327/.

[Chawathe et al., 1994] Chawathe, S., Garcia-Molina,H., Hammer, J., Ireland, K., Papakonstantinou, Y.,Ullman, J., and Widom, J. (1994). The TSIMMISProject: Integration of Heterogeneous InformationSources. In Proceedings of IPSJ Conference, pages 7–18.

[Collet et al., 1991] Collet, C., Huhns, M. N., andShen, W.-M. (1991). Resource integration using alarge knowledge base in carnot. IEEE Computer,24(12):55–62.

[EEA, 1999] EEA (1997-1999). Corine land cover.technical guide, European Environmental Agency,ETC/LC, European Topic Centre on Land Cover.

[ESRI, 1994] ESRI (1994). Introducing ArcView. En-vironmental Systems Research Institute (ESRI), Red-lands,CA. USA.

[Galhardas et al., 1998] Galhardas, H., Simon, E., andTomasic, A. (1998). A framework for classifying envi-ronmental metadata. In AAAI, Workshop on AI andInformation Integration, Madison, WI.

[Genesereth and Fikes, 1992] Genesereth, M. andFikes, R. (1992). Knowledge interchange format ver-sion 3.0 reference manual. Report of the KnowledgeSystems Laboratory KSL 91-1, Stanford University.

[Gruber, 1991] Gruber, T. (1991). Ontolingua: Amechanim to support portable ontologies. KSL Re-port KSL-91-66, Stanford University.

[Gruber, 1993] Gruber, T. (1993). A translation ap-proach to portable ontology specifications. KnowledgeAcquisition, 5(2).

[Guarino, 1998] Guarino, N. (1998). Formal ontologyand information systems. In Guarino, N., editor, FOIS98, Trento, Italy. IOS Press.

10

[Guarino and Giaretta, 1995] Guarino, N. and Gia-retta, P. (1995). Ontologies and knowledge bases: To-wards a terminological clarification. In Mars, N., edi-tor, Towards Very Large Knowledge Bases: KnowledgeBuilding and Knowledge Sharing, pages 25–32. Ams-terdam.

[Horrocks, 1999] Horrocks, I. (1999). FaCT and iFaCT.In (Lambrix et al., 1999), pages 133–135.

[Jasper and Uschold, 1999] Jasper, R. and Uschold, M.(1999). A framework for understanding and classifyingontoogy applications. In Proceedings of the 12th BanffKnowledge Acquisition for Knowledge-Based SystemsWorkshop. University of Calgary/Stanford University.

[Kashyap and Sheth, 1997] Kashyap, V. and Sheth, A.(1997). Cooperative Information Systems: CurrentTrends and Directions, chapter Semantic Heterogene-ity in Global Information Systems: The role of Meta-data, Context and Ontologies. Academic Press.

[Kim et al., 1995] Kim, W., Choi, I., Gala, S., andScheevel, M. (1995). Modern Database: The Ob-ject Model, Interoperability, and Beyond, chapter OnResolving Schematic Heterogeneity in MultidatabaseSystems, pages 521–550. ACM Press / Addison-Wesley Publishing Company.

[Kim and Seo, 1991] Kim, W. and Seo, J. (1991). Clas-sifying schematic and data heterogeinity in multi-database systems. IEEE Computer, 24(12):12–18.

[Lambrix et al., 1999] Lambrix, P., Borgida, A., Lenz-erini, M., Moller, R., and Patel-Schneider, P., editors(1999). Proceedings of the International Workshop onDescription Logics (DL’99).

[Landgraf, 1999] Landgraf, G. (1999). Evolution ofeo/gis interoperability; towards an integrated applica-tion infrastructure. In Vckovski, A., editor, Interop99,volume 1580 of Lecture Notes in Computer Science,Zurich, Switzerland. Springer.

[Lenat, 1998] Lenat, D. (1998). The dimensions of con-text space. Available on the web-site of the CycorpCorporation. (http://www.cyc.com/publications).

[Maguire et al., 1991] Maguire, D. J., Goodchild,M. F., and Rhind, D. W., editors (1991). Geographi-cal Information Systems: Principles and applications.Longman, London, UK.

[Mena et al., 1996] Mena, E., Kashyap, V., Illarra-mendi, A., and Sheth, A. (1996). Managing multipleinformation sources through ontologies: Relationshipbetween vocabulary heterogeneity and loss of informa-tion. In Baader, F., Buchheit, M., Jeusfeld, M. A., andNutt, W., editors, Proceedings of the 3rd WorkshopKnowledge Representation Meets Databases (KRDB’96).

[Mitra et al., 1999] Mitra, P., Wiederhold, G., andJannink, J. (1999). Semi-automatic integration ofknowledge sources. In Fusion ’99, Sunnyvale CA.

[Mitra et al., 2000] Mitra, P., Wiederhold, G., andKersten, M. (2000). A graph-oriented model for ar-

ticulation of ontology interdependencies. In Proc. Ex-tending DataBase Technologies, EDBT 2000, volumeLecture Notes on Computer Science, Konstanz, Ger-many. Springer Verlag.

[Naiman and Ouksel, 1995] Naiman, C. F. and Ouksel,A. M. (1995). A classification of semantic conflicts inheterogeneous database systems. Journal of Organi-zational Computing, pages 167–193.

[OMG, 1992] OMG (1992). The common object re-quest broker: Architecture and specification. OMGDocument 91.12.1, The Object Management Group.Revision 1.1.92.

[Papakonstantinou et al., 1996] Papakonstantinou, Y.,Garcia-Molina, H., and Ullman, J. (1996). Medmaker:A mediation system based on declarative specifica-tions. In International Conference on Data Engineer-ing, pages 132–141, New Orleans.

[Schreiber et al., 1994] Schreiber, A., Wielinga, B.,Akkermans, H., Velde, W., and Anjewierden, A.(1994). Cml the commonkads conceptual modelinglanguage. In et al., S., editor, A Future of KnowledgeAcquisition, Proc. 8th European Knowledge Acquisi-tion Workshop (EKAW 94), number 867 in LectureNotes in Artificial Intelligence. Springer.

[Shepherd, 1991] Shepherd, I. D. H. (1991). Informa-tion integration in gis. In Maguire, D. J., Goodchild,M. F., and Rhind, D. W., editors, Geographical Infor-mation Systems: Principles and applications. Long-man, London, UK.

[Stuckenschmidt and Wache, 2000] Stuckenschmidt,H. and Wache, H. (2000). Context modelling andtransformation for semantic interoperability. InKnowledge Representation Meets Databases (KRDB2000). to appear.

[van Harmelen and Fensel, 1999] van Harmelen, F. andFensel, D. (1999). Practical knowledge representationfor the web. In Fensel, D., editor, Proceedings of theIJCAI’99 Workshop on Intelligent Information Inte-gration.

[Vckovski, 1998] Vckovski, A. (1998). Interoperable andDistributed Processing in GIS. Taylor & Francis, Lon-don.

[Vckovski et al., 1999] Vckovski, A., Brassel, K., andSchek, H.-J., editors (1999). Proceedings of the 2nd In-ternational Conference on Interoperating GeographicInformation Systems, volume 1580 of Lecture Notes inComputer Science, Zurich. Springer.

[Visser et al., 2000] Visser, U., Stuckenschmidt, H.,Schuster, G., and Vogele, T. (2000). Ontologies forgeographic information processing. Computers & Geo-sciences. submitted.

[W3C, 1998] W3C (1998). Extensible markup language(xml) 1.0. W3C Recommendation.

[W3C, 1999] W3C (1999). Resource descrition frame-work (rdf) schema specification. W3C Proposed Rec-ommendation.

11

[Wache et al., 1999] Wache, H., Scholz, T., Stieghahn,H., and Konig-Ries, B. (1999). An integrationmethod for the specification of rule–oriented medi-ators. In Kambayashi, Y. and Takakura, H., edi-tors, Proceedings of the International Symposium onDatabase Applications in Non-Traditional Environ-ments (DANTE’99), pages 109–112, Kyoto, Japan.

[Wiederhold, 1992] Wiederhold, G. (1992). Mediatorsin the architecture of future information systems.IEEE Computer, 25(3):38–49. standard reference formediators.

[Wiederhold, 1999] Wiederhold, G. (1999). Mediationto deal with heterogeneous data sources. In Vckovski,A., editor, Interop99, volume 1580 of Lecture Notes inComputer Science, Zurich, Switzerland. Springer.

[Wiener et al., 1996] Wiener, J., Gupta, H., Labio, W.,Zhuge, Y., Garcia-Molina, H., and Widom, J. (1996).Whips: A system prototype for warehouse view main-tenance. In Workshop on materialized views, pages26–33, Montreal, Canada.

[Worboys and Deen, 1991] Worboys, M. F. and Deen,S. M. (1991). Semantic heterogeneity in distributedgeographical databases. SIGMOID Record, 20(4).

12

10.1.1.21.5883

Technology

Transcript of 10.1.1.21.5883