QUERY PROCESSING FOR HETEROGENEOUS DATA ...hxiao/phdthesis.pdfQUERY PROCESSING FOR HETEROGENEOUS...

QUERY PROCESSING FOR HETEROGENEOUS DATA INTEGRATION

USING ONTOLOGIES

BY

HUIYONG XIAOB.S., Huazhong University of Science and Technology, 1999

M.S., Tsinghua University, China, 2002

THESIS

Submitted as partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Computer Science

in the Graduate College of theUniversity of Illinois at Chicago, 2006

Chicago, Illinois

Copyright by

Huiyong Xiao

2006

To my parents.

iii

ACKNOWLEDGMENTS

First of all, I would like to thank my advisor, Professor Isabel Cruz, without whose patient

guidance and persistent support, I could not have finished my doctoral research. She gave me

both the motivations for starting new research topics and the freedom of working on my research

interests. Not only has she taught me the systematic research methodologies and ethics, but

also she has given numerous suggestions to improve my English writing skills. All that I learnt

from her definitely will benefit me in my future career.

I would also like to thank all the committee for my preliminary examination and thesis

defense, including Professors Kevin Chang, Ajay Kshemkalyani, Bing Liu, Peter Nelson, Aris

Ouksel, and Clement Yu. They have given me valuable feedback and advice on my thesis

research.

I feel very fortunate to have Kalyan Ayloo, William Sunna, Paul Varkey, Nalin Makar,

Feihong Hsu, Ryan Aviles, Fang Fang, and Amira Rahal as my colleagues, who have made my

life in UIC much easier.

I owe special thanks to my wife, with whose support I have been able to save a large amount

of time for my thesis work.

HYX

iv

TABLE OF CONTENTS

CHAPTER PAGE

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Ontology Based Data Integration . . . . . . . . . . . . . . . . . 31.3 Our Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.1 Central Data Integration . . . . . . . . . . . . . . . . . . . . . . 81.3.2 Peer-to-Peer Data Integration . . . . . . . . . . . . . . . . . . . 151.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 CENTRAL DATA INTEGRATION . . . . . . . . . . . . . . . . . . . . 232.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 232.1.2 Semantic XML Data Integration . . . . . . . . . . . . . . . . . . 252.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2.1 Semantic Integration . . . . . . . . . . . . . . . . . . . . . . . . . 282.2.2 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.2.3 Query Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.4 Integrating Structure and Semantics . . . . . . . . . . . . . . . 362.4.1 Local XML Schemas and Local RDFS Ontologies . . . . . . . 362.4.2 The Global RDFS Ontology . . . . . . . . . . . . . . . . . . . . 402.4.3 Data Integration Semantics . . . . . . . . . . . . . . . . . . . . . 442.5 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.5.1 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.5.2 Certain Answers and Query Containment . . . . . . . . . . . . 492.5.3 Query Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3 HYBRID PEER-TO-PEER DATA INTEGRATION . . . . . . . . 623.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.3 The PEPSINT Architecture . . . . . . . . . . . . . . . . . . . . 673.4 Mapping Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.4.1 Mapping Local RDF Schemas to the Global Ontology . . . . . 703.4.2 Mapping Local XML Schemas to the Global Ontology . . . . 713.5 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.5.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

v

TABLE OF CONTENTS (Continued)

CHAPTER PAGE

3.5.2 Query Answering in Data Integration Mode . . . . . . . . . . . 743.5.3 Query Answering in Hybrid P2P Mode . . . . . . . . . . . . . . 773.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4 PURE PEER-TO-PEER DATA INTEGRATION . . . . . . . . . . 814.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.3.1 The Layered Peer Architecture . . . . . . . . . . . . . . . . . . . 844.3.2 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . 864.3.3 RDF Metadata Representation . . . . . . . . . . . . . . . . . . . 874.3.4 P2P Mapping and Query Answering . . . . . . . . . . . . . . . 884.4 P2P Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.4.1 RDFMS Meta-Ontology . . . . . . . . . . . . . . . . . . . . . . . 904.4.2 P2P Mapping Language – PML . . . . . . . . . . . . . . . . . . 924.5 P2P Query Processing . . . . . . . . . . . . . . . . . . . . . . . . 954.5.1 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.5.2 Query Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 DATA INTEROPERABILITY IN THE SEMANTIC DESKTOP 1025.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.3 The Layered Multi-Ontology Framework . . . . . . . . . . . . . 1075.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 1115.5 Semantic Data Organization . . . . . . . . . . . . . . . . . . . . 1145.5.1 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.5.2 Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.5.3 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.6 Semantic Data Navigation . . . . . . . . . . . . . . . . . . . . . 1205.7 Personal Information Applications . . . . . . . . . . . . . . . . . 1245.7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245.7.2 MVC-based PIA Development . . . . . . . . . . . . . . . . . . . 1265.7.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275.8 Services-based Desktop Interoperation . . . . . . . . . . . . . . 1315.9 Semantic Query Processing . . . . . . . . . . . . . . . . . . . . . 1355.9.1 Query Processing in a PIA . . . . . . . . . . . . . . . . . . . . . 1365.9.2 A2A Query Processing . . . . . . . . . . . . . . . . . . . . . . . . 1385.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6 GEOSPATIAL DATA MANAGEMENT IN E-GOVERNMENT 1426.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

vi

TABLE OF CONTENTS (Continued)

CHAPTER PAGE

6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1476.2.1 Ontology Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 1476.2.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1496.3 Data Heterogeneities . . . . . . . . . . . . . . . . . . . . . . . . . 1516.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1546.4.1 Schema Transformation and Ontology Mapping . . . . . . . . 1556.4.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1566.5 Ontology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 1576.5.1 Schema Transformation . . . . . . . . . . . . . . . . . . . . . . . 1576.5.2 Ontology Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 1616.5.2.1 Mapping Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1626.5.2.2 Deduction Process . . . . . . . . . . . . . . . . . . . . . . . . . . 1656.5.2.3 Mapping Representation . . . . . . . . . . . . . . . . . . . . . . 1676.6 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1696.6.1 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 1696.6.2 Query Rewriting and Answering . . . . . . . . . . . . . . . . . . 1706.6.2.1 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 1726.6.2.2 Query Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1746.6.2.3 Rewriting Constants . . . . . . . . . . . . . . . . . . . . . . . . . 1756.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1776.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

CITED LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

vii

LIST OF TABLES

TABLE PAGE

I INFERENCE RULES FOR SEMANTIC RELATIONS. . . . . . . . 19

II MAPPINGS BETWEEN XML SOURCE SCHEMA S1 AND THELOCAL ONTOLOGY R1 . . . . . . . . . . . . . . . . . . . . . . . . . 39

III MAPPING TABLE BETWEEN THE GLOBAL ONTOLOGY ANDLOCAL ONTOLOGIES . . . . . . . . . . . . . . . . . . . . . . . . . . 43

IV MAPPING TABLE BETWEEN THE GLOBAL ONTOLOGY ANDXML SOURCE SCHEMAS . . . . . . . . . . . . . . . . . . . . . . . . 43

V RESOURCE-RESOURCE ASSOCIATIONS. . . . . . . . . . . . . . . 118

VI RDF PROPERTIES FOR THE REPRESENTATION OF ASSOCI-ATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

VII SEMANTIC HETEROGENEITY RESULTED FROM DIFFERENTENCODINGS OF LAND USE DATA. . . . . . . . . . . . . . . . . . . 153

VIII ELEMENT-LEVEL SCHEMA TRANSFORMATION . . . . . . . . 159

IX MAPPINGS BETWEEN XML SOURCE SCHEMA D1 AND LO-CAL ONTOLOGY O1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

viii

LIST OF FIGURES

FIGURE PAGE

1 Two XML sources with heterogeneous schemas. . . . . . . . . . . . . . . 8

2 A central architecture for XML data integration. . . . . . . . . . . . . . . 9

3 Local ontologies generated from XML source schemas. . . . . . . . . . . 11

4 A conceptual view on local sources. . . . . . . . . . . . . . . . . . . . . . . 13

5 The hybrid peer-to-peer architecture of PEPSINT. . . . . . . . . . . . . 15

6 Mediation for peer-to-peer query rewriting. . . . . . . . . . . . . . . . . . 17

7 Thesaurus-based schema mapping process. . . . . . . . . . . . . . . . . . 18

8 Two XML sources with structural heterogeneities. . . . . . . . . . . . . . 26

9 The ontology-based framework for the integration of XML sources. . . . 31

10 Local ontologies R1 and R2 transformed from XML source schemas S1

and S2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

11 The global ontology G that results from merging R1 and R2. . . . . . . 42

12 The global database of G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

13 The retrieved database on S1 w.r.t. S2 and that on S2 w.r.t. S1. . . . . 50

14 The GLRewriting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 55

15 A part of XML data integration setting. . . . . . . . . . . . . . . . . . . . 57

16 The LLRewriting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 58

17 An example of heterogeneous XML and RDF data sources. . . . . . . . 63

18 The PEPSINT architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 68

ix

LIST OF FIGURES (Continued)

FIGURE PAGE

19 RDF schemas transformed from local XML source schemas. . . . . . . . 71

20 The global ontology and its mapping table. . . . . . . . . . . . . . . . . . 72

21 The layered peer architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 85

22 A motivating example for P2P data integration. . . . . . . . . . . . . . . 86

23 local RDFS ontologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

24 The meta-ontology of RDFMS. . . . . . . . . . . . . . . . . . . . . . . . . 90

25 An example of P2P mappings represented in RDFMS. . . . . . . . . . . 91

26 The P2PRewriting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 97

27 An example of files in a PI space. . . . . . . . . . . . . . . . . . . . . . . . 103

28 An ontology-based framework of a PIM system. . . . . . . . . . . . . . . 108

29 The architecture of MOSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

30 An example of an email message. . . . . . . . . . . . . . . . . . . . . . . . 116

31 Data organization in the application, domain, and resource layers. Allontologies are represented in RDFS. Two application ontologies for PIAs,i.e., picture management and publication management, are constructed.Below them are four ontologies for the domains of Email, Talk, Publi-cation, and Photo, respectively. At the bottom, the resource-file andresource-resource associations are represented as triples or in a graph. . 122

32 The browser for PIM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

33 The PIA designer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

34 Desktop services composition and execution. . . . . . . . . . . . . . . . . 133

35 The ADRewriting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 138

36 An example of XML schematic heterogeneity. . . . . . . . . . . . . . . . . 145

37 Local XML land use data sources. . . . . . . . . . . . . . . . . . . . . . . 152

x

LIST OF FIGURES (Continued)

FIGURE PAGE

38 The ontology-based architecture. . . . . . . . . . . . . . . . . . . . . . . . 155

39 An example of local RDFS ontologies. . . . . . . . . . . . . . . . . . . . . 160

40 An example of mapping between two land use taxonomies. The labelsover the edges represent mappings types, followed (in between parenthe-ses) by the deduction rule(s) that can be applied, if any. . . . . . . . . . 163

41 A fragment of ontology mappings represented in RDFS. . . . . . . . . . 168

42 The QueryRewriting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 171

43 The QueryExpand algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 173

44 The ConstantMapping algorithm. . . . . . . . . . . . . . . . . . . . . . . . 176

xi

LIST OF ABBREVIATIONS

GaV Global-as-View

LaV Local-as-View

GLaV Global-Local-as-View

RDF Resource Description Framework

RDFS RDF Schema

RDFMS RDF Mapping Schema

OWL Web Ontology Language

DAML+OIL DARPA Agent Markup Language and Ontology

Interface Language

RQL RDF Query Language

RDQL RDF Data Query Language

c-RQL Conjunctive RQL

c-XQuery Conjunctive XQuery

P2P Peer-to-Peer

PML P2P Mapping Language

PEPSINT Peer-to-Peer Semantic Integration Framework

MOSE Multiple Ontology based Semantic Desktop

xii

LIST OF ABBREVIATIONS (Continued)

PIM Personal Information Management

PIA Personal Information Application

xiii

SUMMARY

Data integration provides the ability to manipulate data transparently across multiple dis-

tributed data sources. We have studied comprehensively several scenarios where the need for

heterogeneous data integration occurs, including centralized integration of XML data sources,

hybrid peer-to-peer integration of XML and RDF data sources, pure peer-to-peer XML and

RDF data integration and interoperability, personal information management within and across

desktops, and geospatial data integration for e-Government.

There are different kinds of heterogeneity: syntactic heterogeneity, which is caused by dif-

ferent languages used for modeling the different sources, schematic heterogeneity, which results

from different structures of source schemas, and semantic heterogeneity, which arises when

different sources contain instances with meanings or interpretations.

The key notion of the emerging Semantic Web is that of an ontology, which is a formal

and explicit specification of a shared conceptualization. The use of ontologies can benefit data

integration tasks in a variety of ways, including metadata representation, global conceptualiza-

tion, support for high-level queries, declarative mediation, and mapping support. As the main

contribution of this thesis, we focus on the role of ontologies in data integration and propose

a series of ontology-based approaches to resolve the heterogeneities so as to achieve data inter-

operability. In this thesis, we report our achievements on ontology-based heterogeneous data

integration, and discuss the fundamental issues, including metadata representation, mapping

process, and query processing, in our approaches to different applications of data integration.

xiv

CHAPTER 1

INTRODUCTION

1.1 Problem Description

Data integration provides the ability to manipulate data transparently across multiple data

sources. It is relevant to a number of applications including enterprise information integra-

tion, medical information management, geographical information systems, and e-Commerce

applications. Based on the architecture, there are two different kinds of systems: central

data integration systems (3; 7; 29; 39; 81; 109) and peer-to-peer data integration systems

(6; 11; 15; 40; 59; 86). A central data integration system usually has a global schema, which

provides the user with a uniform interface to access information stored in the data sources.

In contrast, in a peer-to-peer data integration system, there are no global points of control

on the data sources (or peers). Instead, any peer can accept user queries for the information

distributed in the whole system.

The two most important approaches for building a data integration system are Global-as-

View (GaV) and Local-as-View (LaV) (109; 70). In the GaV approach, every entity in the

global schema is associated with a view over the source local schema. Therefore querying

strategies are simple, but the evolution of the local source schemas is not easily supported.

On the contrary, the LaV approach permits changes to source schemas without affecting the

1

2

global schema, since the local schemas are defined as views over the global schema, but query

processing can be complex.

Data sources can be heterogeneous in syntax, schema, or semantics, thus making data

interoperation a difficult task (16). Syntactic heterogeneity is caused by the use of different

models or languages. Schematic heterogeneity results from structural differences. Semantic

heterogeneity is caused by different meanings or interpretations of data in various contexts. To

achieve data interoperability, the issues posed by data heterogeneity need to be eliminated.

The advent of XML has created a syntactic platform for Web data standardization and

exchange. However, schematic data heterogeneity may persist, depending on the XML schemas

used (e.g., nesting hierarchies). Likewise, semantic heterogeneity may persist even if both

syntactic and schematic heterogeneities do not occur (e.g., naming concepts differently). In this

thesis, we are concerned with solving all three kinds of heterogeneities by bridging syntactic,

schematic, and semantic heterogeneities across different sources.

We call semantic data integration the process of using a conceptual representation of the

data and of their relationships to eliminate possible heterogeneities. At the heart of seman-

tic data integration is the concept of ontology, which is an explicit specification of a shared

conceptualization (55; 54).

Ontologies were developed by the Artificial Intelligence community to facilitate knowledge

sharing and reuse (56). Carrying semantics for particular domains, ontologies are largely used

for representing domain knowledge. A common use of ontologies is data standardization and

conceptualization via a formal machine-understandable ontology language. For example, the

3

global schema in a data integration system may be an ontology, which then acts as a mediator

for reconciliating the heterogeneities between different sources. As an example of the use of

ontologies on peer-to-peer data integration, we can produce for each source schema a local

ontology, which is made accessible to other peers so as to support semantic mappings between

different local ontologies.

1.2 Ontology Based Data Integration

An ontology is a formal, explicit specification of a shared conceptualization (55). In this

definition, “conceptualization” refers to an abstract model of some domain knowledge in the

world that identifies that domain’s relevant concepts. “Shared” indicates that an ontology

captures consensual knowledge, that is, it is accepted by a group. “Explicit” means that the

type of concepts in an ontology and the constraints on these concepts are explicitly defined.

Finally, “formal” means that the ontology should be machine understandable.

Typical “real-world” ontologies include taxonomies on the Web (e.g., Yahoo! categories),

catalogs for on-line shopping (e.g., Amazon.com’s product catalog), and domain-specific stan-

dard terminology (e.g., UMLS1 and Gene Ontology2). As an online lexicon database, WordNet3

is widely used for discovery of semantic relationships between concepts.

Existing ontology languages include:

1http://www.nlm.nih.gov/research/umls/

2http://www.geneontology.org

3http://www.cogsci.princeton.edu/∼wn/

4

XML Schema. Strictly speaking, XML Schema is a semantic markup language for Web data.

The database-compatible data types supported by XML Schema provide a way to specify

a hierarchical model.1 However, there are no explicit constructs for defining classes and

properties in XML Schema, therefore ambiguities may arise when mapping an XML-based

data model to a semantic model.

RDF and RDFS. RDF (Resource Description Framework) is a data model developed by the

W3C for describing Web resources.2 RDF allows for the specification of the semantics of

data in a standardized, interoperable manner. In RDF, a pair of resources (nodes) con-

nected by a property (edge) forms a statement: (resource, property, value). RDFS (RDF

Schema)3 is a language for describing vocabularies of RDF data in terms of primitives

such as rdfs:Class, rdf:Property, rdfs:domain, and rdfs:range. In other words, RDFS is used

to define the semantic relationships between properties and resources.

DAML+OIL. DAML+OIL (DARPA Agent Markup Language-Ontology Interface Language)

is a full-fledged Web-based ontology language developed on top of RDFS.4 It features an

XML-based syntax and a layered architecture. DAML+OIL provides modeling primitives

commonly used in frame-based approaches to ontology engineering, and formal semantics

1http://www.w3.org/TR/xmlschema-2

2http://www.w3.org/TR/rdf-primer

3http://www.w3.org/TR/rdf-schema

4http://www.w3.org/TR/daml+oil-reference

5

and reasoning support found in description logic approaches. It also integrates XML

Schema data types for semantic interoperability in XML.

OWL. OWL (Web Ontology Language) is a semantic markup language for publishing and

sharing ontologies on the Web. It is developed as a vocabulary extension of RDF and is

derived from DAML+OIL.1

Other ontology languages include SHOE (Simple HTML Ontology Extensions),2 XOL (Ontol-

ogy Exchange Language),3 and UML (Unified Modeling Language).4

Among all these ontology languages, we are most interested in XML Schema and RDFS

for their particular roles in data integration and the “Semantic Web” (42). More specifically,

XML Schema and RDFS use the same syntax and can be used for data modeling and ontology

representation. But they have their own particular features in the sense that XML data has

document structure in terms of the nesting elements in an individual XML document, whereas

RDF data has domain structure formed by the concepts and relationships between concepts

(40; 59).

1http://www.w3.org/TR/owl-ref

2http://www.cs.umd.edu/projects/plus/shoe

3http://www.ai.sri.com/pkarp/xol/

4http://www.uml.org/

6

Ontologies have been extensively used in data integration systems because they provide an

explicit and machine-understandable conceptualization of a domain. They have been used in

one of the three following ways (111):

Single ontology approach. All source schemas are directly related to a shared global ontol-

ogy that provides a uniform interface to the user (38). However, this approach requires

that all sources have nearly the same view on a domain, with the same level of granularity.

A typical example of a system using this approach is SIMS (7).

Multiple ontology approach. Each data source is described by its own (local) ontology sep-

arately. Instead of using a common ontology, local ontologies are mapped to each other.

For this purpose, an additional representation formalism is necessary for defining the

inter-ontology mappings. The OBSERVER system (81) is an example of this approach.

Hybrid ontology approach. A combination of the two preceding approaches is used. First,

a local ontology is built for each source schema, which, however, is not mapped to other

local ontologies, but to a global shared ontology. New sources can be easily added with

no need for modifying existing mappings. Our layered framework (38) is an example of

this approach.

The single and hybrid approaches are appropriate for building central data integration

systems, the former being more appropriate for GaV systems and the latter for LaV systems.

A hybrid peer-to-peer system, where a global ontology exists in a “super-peer” can also use the

7

hybrid ontology approach (40). The multiple ontology approach can be best used to construct

pure peer-to-peer data integration systems, where there are no super-peers.

We identify the following five uses of ontologies in data integration:

Metadata Representation. Metadata (i.e., source schemas) in each data source can be ex-

plicitly represented by a local ontology, using a single language.

Global Conceptualization. The global ontology provides a conceptual view over the schemat-

ically heterogeneous source schemas.

Support for High-level Queries. Given a high-level view of the sources, as provided by a

global ontology, the user can formulate a query without specific knowledge of the different

data sources. The query is then rewritten into queries over the sources, based on the

semantic mappings between the global and local ontologies.

Declarative Mediation. Query processing in a hybrid peer-to-peer system uses the global

ontology as a declarative mediator for query rewriting between peers.

Mapping Support. A thesaurus, formalized in terms of an ontology, can be used for the

mapping process to facilitate its automation.

In the following section we discuss five case studies, which correspond to the above five uses.

The first three case studies are in the context of centralized data integration systems, while the

last two are in the context of peer-to-peer data integration systems. We base our discussion on

our previous work (38; 39; 40; 113; 116).

8

1.3 Our Solutions

1.3.1 Central Data Integration

In this section, we will describe three case studies of ontologies in the context of central

data integration. To make the issues concrete, we use a running example involving two XML

sources and demonstrate how to enable semantic interoperation between them.

Example 1.1 Figure 1 displays two XML schemas (S1 and S2) and their respective documents

(D1 and D2), which are represented as trees. The two XML documents conform to different

schemas but represent data with similar semantics. In particular, both schemas represent a

many-to-many relationship between two concepts: book and author in S1 (equivalently denoted

by article and writer in S2). However, structurally speaking, they are different: S1 (book-

centric schema) has the author element nested under the book element, whereas S2 (author-

centric schema) has the article element nested under the writer element.

Semantically equivalent data elements, such as the authors of publication “b2”, can be

reached using different XML path patterns, respectively for schema S1 and schema S2:

books

book *

author

@booktitle @name

writers

article *

@title @fullname

writer *

books

book

author

"b1"

book

author

"b2"

"a1" "a3"

writers

writer

article

"w1" "w2"

"t1" "a2"

XML schema S 1 XML document D 1

"books.xml"

writer writer

article article

"t2"

"w3"

"t2"

[1..10]


"writers.xml"

author

Figure 1. Two XML sources with heterogeneous schemas.

9

/books/book[@booktitle="b2"]/author/@name

and

/writers/writer[article/@title="b2"]/@fullname

where the contents in the square brackets specify the constraints for the search patterns.

The example demonstrates that multiple XML schemas (or structures) can exist for a single

conceptual model. In comparison, the schema or ontology languages (e.g., RDFS, DAML+OIL,

and OWL) that operate on the conceptual level are structurally flat so that the user can

formulate a query from a conceptual perspective without considering the structure of the

source (3; 29; 111; 39).

mapping table

local XML

source 1

local XML

source 2

local XML

source n

RDF-based

global ontology

local RDF

ontology n

local RDF

ontology 1

local RDF

ontology 2 ...

...

Query translator

Query in data-integration direction

Query in peer-to-peer direction

Ontology Integration

Figure 2. A central architecture for XML data integration.

10

Figure 2 shows the architecture of a system that interoperates among schematically hetero-

geneous data sources (39). The following three cases study in detail the principles embodied in

this architecture.

Case Study 1 - Metadata Representation

As a first step for bridging across the heterogeneities of diverse local sources, a local ontology

must be generated from each source database schema (e.g., relational, XML, or RDF). A local

ontology is a conceptualization of the elements and relationships between elements in each

source schema. To facilitate interoperation, those ontologies should be expressed using the

same model. Furthermore, for the sake of correct query processing, the structure of source

schemas and the integrity constraints (e.g., relational foreign keys) expressed on the schemas

should be preserved in the local ontology. We choose RDFS to represent each local ontology.

In our approach, ontology generation from source schemas is accomplished by model-based

schema transformation (38). In particular, the following approaches are taken for the relational

and XML schema transformation:

Relational Schema. Relations are converted into RDF classes and attributes into RDF prop-

erties, which are attached to the class corresponding to the relation to which the attributes

belong. Foreign key dependencies between two relations are represented by two properties

(corresponding to the two relations) sharing the same value in the target local ontology.

XML Schema. Complex-type elements are converted into RDF classes and simple-type el-

ements and attributes are converted into RDF properties. This transformation process

11

encodes the mapping information between each concept in the local RDF ontology and

the path to the corresponding element in the XML source. Nesting relationships between

XML elements are represented using a meta-property rdfx:contained; rdfx stands for the

namespace where contained is defined. This meta-property enables the RDF representa-

tion of the XML nesting structure, by connecting two RDF classes representing the two

nesting XML elements.

Example 1.2 Following Example 1.1, Figure 3 shows the local RDF ontologies S′1 and S′2,

which are generated respectively from the XML source schemas S1 and S2.

Books

name booktitle

rdfx:contained

Local RDFS ontology S 1 '

Author rdfx:contained

Book Article

fullname title

rdfx:contained

Writers rdfx:contained

Writer

Local RDFS ontology S 2 '

rdfs:domain rdfs:domain rdfs:domain rdfs:domain

Figure 3. Local ontologies generated from XML source schemas.

Case Study 2 - Global Conceptualization

To make the integration system accessible through the uniform interface of the global on-

tology, semantic mappings are established between the global ontology and the local ontologies.

12

In our approach, this mapping process is accomplished during the construction of the global on-

tology, which is generated by merging the local ontologies, for example, using a GaV approach.

We consider that each local ontology is merged into the global ontology, the target ontology.

The process of ontology merging consists of several operations:

• Copying a class and/or its properties: classes and properties that do not exist in the

target ontology are copied into it.

• Class Merging: conceptually equivalent classes in the local and target ontologies are

combined into one class in the target ontology.

• Property Merging: conceptually equivalent properties of a class in the local and target

ontologies are combined into one property in the target ontology.

• Relationship Merging: conceptually equivalent relationships from one class c1 to another

class c2 in the local and target ontologies are combined into a single relationship in the

target ontology (i.e., an RDF property having c1 as its domain and c2 as its range).

• Class Generalization: related classes in the local and target ontologies can be generalized

into a a superclass. The superclass can be obtained by searching an existing knowledge

domain (e.g., the DAML Ontology Library 1) or reasoning over a thesaurus.

We note that along with the above operations, semantic correspondences are established.

For example, for each element pL in a local ontology, if there exists a semantically equivalent

1http://www.daml.org/ontologies/

13

element pG in the global ontology, the two elements will be merged and a correspondence

between pL and pG will be generated.

Book Author

title

Books

Publications Person

Authors

name

rdfx:contained

rdfx:contained

rdfx:contained rdfx:contained

Book Author

booktitle

Books

name


Article Writer

title

Writers

fullname


Local RDF ontology S 1 ' Local RDF ontology S 2 '

rdfs:domain rdfs:domain rdfs:domain rdfs:domain

rdfs:subClassof

rdfs:domain rdfs:domain

correspondence The global RDF ontology

Figure 4. A conceptual view on local sources.

Example 1.3 Figure 4 shows the global RDF ontologies generated by merging the local ontolo-

gies S′1 and S′2 of Example 1.2. Note that the classes (properties) represented in grey are merged

classes (properties), and the classes Book and Author are also extended, with Publication and

Person being their superclasses, respectively.

Case Study 3 - Support for High-level Queries

14

Given a conceptual view of available information sources, the user may pose a query in

terms of the global ontology. We say the query is a high-level query if its formulation does not

require awareness of particular source schemas. The query is then reformulated by a rewriting

algorithm into a subquery for each source. The subqueries over sources are subject to the

structure of source schemas, and may be expressed in a different language from that of the

high-level query. An inference mechanism may be needed in the query rewriting, for example,

when a concept involved in the query has super-concepts or sub-concepts.

In addition to handling high-level queries on the global ontology, a bidirectional query

translation algorithm is also supported (39) (see Figure 2). In this case, we can translate a

query posed against an XML source to an equivalent query against any other XML source.

Example 1.4 Suppose the user asks the query “Find the persons who have written publication

b2.” This query will be expressed in a RDF query language such as RDQL. 1 First, Person

has sub-concept Author, which corresponds to two different concepts (Author and Writer) in two

different RDF local databases. Therefore the initial query will be rewritten as two sub-queries to

those databases. In turn, those queries may be further rewritten using a XML query language

incorporating the path expressions of Example 1.1 (unless the data was materialized under the

RDF local ontologies). Using the bidirectional query translation mechanism, a query involving

the concepts Book and Author in one source will be translated into a query involving Article and

Writer in another data source, by using the correspondences established by the global ontology.

1http://www.hpl.hp.com/semweb/rdql.htm

15

mapping table

local XML

schema

Global RDF

ontology

peer 1 super peer

mapping

table

local RDF

schema

mapping table

peer n

XML to

RDF wrapper

local

XML

schema

peer i

mapping

table

XML to

RDF

wrapper

Query processing in

data-integration fashion

Query processing in

hybrid P2P fashion

Mapping process

Q 1

Q 2n '

Q 2i '

Q 2

Q 11 '

Q 1i '

Q 1n '

Figure 5. The hybrid peer-to-peer architecture of PEPSINT.

1.3.2 Peer-to-Peer Data Integration

We consider again the two XML sources of Figure 1. However, this time they are connected

in a peer-to-peer architecture. We consider a hybrid peer-to-peer architecture with two types of

peers: super-peers containing the global RDF ontology, and peers each containing a data source

and an ontology. Each peer represents an autonomous information system and connects to a

super-peer via semantic mappings. Peer-to-peer data integration systems or frameworks include

LRM (Local Relational Model) (15), Hyperion (6), Piazza (59), PeerDB (86), SEWASIE (11),

and PEPSINT (40).

Case Study 4 - Declarative Mediation

The PEPSINT system is a hybrid peer-to-peer system whose architecture is shown in Fig-

ure 5. PEPSINT uses a GaV approach. The global ontology in a super-peer serves two functions:

16

(1) It provides the user with a uniform high-level view of the data sources in the distributed

peers, and (2) it serves as a a mediator for query translation from one peer to another. The

former function is similar to the one described in Case Study 3. The latter function is discussed

in detail here.

The user can pose a query against the local XML or RDF data source in any peer. Locally,

the query will be executed on the local source to get a local answer. Meanwhile, the source

query is rewritten into a target query over every connected peer. The query rewriting utilizes

the global ontology, and the composition of mappings from the original peer to the super peer

with mappings from the super-peer to the target peers. By executing the target query, each

peer returns an answer to the original peer, called the remote answer. The local and remote

answers are integrated and returned to the user at the site of the originating peer.

Example 1.5 Consider two XML sources, one in peer p1 and the other in peer p2, and a global

ontology expressed in RDF in a super-peer. As shown in Figure 6, the global ontology consists

of a class Publication and two sub-classes Paper and Book. The Publication class is mapped to

the publication element of the XML source in p1, while the class Book corresponds to book of

the XML source in p2. An XML query Q1 on p1 involving publication will be rewritten to a

target query Q2 on p2 involving include book. The XML fragments inside the dashed-line boxes

are integrated and returned as answers.

Case Study 5 - Mapping Support

17

<publications>

< publication title="b1">

<author> a1 </author>

<ISBN> 1234567890 </ISBN>

</ publication >

</publications>

<books>

< book booktitle= 2?

<author> a2 </author>

<price> $23.00 </price>

</ book >

</books>

Publication

Paper Book

rdfs:subClassof rdfs:subClassof

The global RDF ontology in

the super-peer XML source in Peer p1

Q1: List all publications

Q2 XML source in Peer p2

Figure 6. Mediation for peer-to-peer query rewriting.

A thesaurus can be used for data integration to facilitate the automation of the schema mapping

process (99; 38). In particular, it can help discovering the semantic relationships between

concepts in different schemas or ontologies. WordNet is an example of such a thesaurus. It

consists of a network of terms and their semantic relations (e.g., synonym, hypernym, and

hyponym). A term may have multiple senses, each being a synset.

A thesaurus-based schema matching approach has been devised for peer-to-peer data inte-

gration (113); this approach consists of the following three steps (as illustrated in Figure 7):

1. Path Exploration. Among the semantic relations between synsets in WordNet, we

choose those of synonymy, hyponymy/hypernymy (i.e., more specific/more general), and related-

to, when enumerating the paths between two arbitrary concepts from different local ontologies

in peers. As shown in Figure 7, six paths are found from Quantity to Number.

18

2. Path Selection. When multiple paths are found between two concepts, we choose the

optimal path, which corresponds to the most likely semantic relation between the two concepts.

For this purpose, semantic similarities (i.e., the number above each path in the figure) are

calculated for all the paths. The calculation is implemented by assigning different semantic

relations with different weights (e.g., 1 for synonymy and 0.8 for hypernymy) and then taking

the average of all the weights. The path with highest similarity is then chosen as the optimal

path. If there is more than one such path, then the user’s intervention is needed.

3. Semantic Derivation. The last step is to derive the (direct) semantic relationship,

Sem, between the two concepts by reasoning on the semantic relations along the optimal path

p between them. More specifically, Sem(p) = Sem(pn) is computed based on the following

SYN (Synonym): 1

HYPER (Hypernym): 0.8

HYPO (Hyponym): 0.8

REL (Related-to): 0.5

Amount

Total

Definite Quantity

Product

Constant

Sum

Quantity Number

S Y N

S Y N

H Y P O

H Y P O H Y P E R

H Y P E R

S Y N H Y P O H Y P O

H Y P O

H Y P O

H Y P O

1

0.9

0.8

0.8

0.8

0.8

Quantity Number Amount SYN SYN

Quantity Number SYN

2. Path

Selection

3. Semantic

Derivation

WordNet 1. Path

Exploration

Figure 7. Thesaurus-based schema mapping process.

19

recursive algorithm, where pn = (r1, r2, ..., rn), and ri(1≤i≤n) are the edges (semantic relations)

along p.

Sem(pn) = Sem(pn−1) ∧ Sem(rn), if n > 1; (1.1)

Sem(pn) = ≈, ⊇, ⊆, or ∼, if n = 1. (1.2)

In the above formulas, the symbols ≈, ⊇, ⊆, and ∼, respectively stand for the semantic

relation of synonymy, hypernymy, hyponymy, and related-to. The operation ∧ obeys the rules

that are shown in Table I. Specifically, the first row and the first column are the operands, and

the cells at the intersection of each pair of operands contain the results of the operation ∧ on

both operands. A question mark indicates that human intervention is needed.

TABLE I

INFERENCE RULES FOR SEMANTIC RELATIONS.∧ ≈ ⊇ ⊆ ∼≈ ≈ ⊇ ⊆ ∼⊇ ⊇ ⊇ ? ∼⊆ ⊆ ? ⊆ ∼∼ ∼ ∼ ∼ ∼

20

1.4 Contributions

This thesis is focused on the reconciliation of heterogeneities among distributed data sources

to achieve data integration and interoperability. Semantic Web technologies, centered on the use

of ontologies, are extensively used in our approaches. We discuss the fundamental issues, includ-

ing metadata representation, mapping process, and query processing, in all our approaches to

different situations of the data integration, as described in the above case studies. In particular,

we make the following contributions in this thesis:

• In Chapter 2, we propose an ontology-based approach to the integration of heterogeneous

XML sources. The global ontology takes into account both the XML nesting structure

and the domain structure, which are expressed in RDFS, so as to enable semantic inter-

operation between the XML sources. This integration process is lossless with respect to

the nesting structure of the XML sources, so that XML structural queries can be cor-

rectly rewritten. We refine the concepts of certain answers and of query containment,

in two cases of query processing: global-to-local query rewriting and local-to-local query

rewriting. A query rewriting algorithm that guarantees equivalence is provided for each

case of query rewriting.

• To achieve interoperability across heterogeneous data sources with schemas, we propose

an ontology-based framework, PEPSINT, built on a hybrid P2P architecture, as presented

in Chapter 3. The global RDF ontology is constructed using the GaV approach in the

super peer. It behaves not only as a central control point over the peers but also as a

mediator for query translation from peer to peer. We provide a set of query rewriting

21

algorithms that can be used to propagate a user’s query across the heterogeneous XML

or RDF data sources in PEPSINT. The integration of the answer structures is considered

in query processing.

• In our framework for pure P2P data integration, we propose a mapping language, namely

the P2P Mapping Language (PML), to express the semantic mappings among local on-

tologies that represent the local schemas. A meta-ontology called RDF Mapping Schema

(RDFMS) is used as a physical representation of the mappings. We also discuss the

process of P2P query answering across the individual peers, by considering the individual

architecture of each peer. We also define the semantics of PML based on first-order logic

(FOL), which enables the use of the mappings for query processing in the system. We

propose a P2P query rewriting algorithm to process conjunctive RQL (c-RQL) queries

across the P2P network. We discuss these issues in detail in Chapter 4.

• Within the Semantic Desktop vision, we propose a layered framework for personal infor-

mation management (PIM) in desktops, in which multiple ontologies playing a variety

of roles are employed. As elaborated in Chapter 5, this layered architecture enables the

organization, navigation, and manipulation of desktop data in a semantically rich way,

and provides certain advantages (e.g., flexibility and extensibility) over the use of a single

domain model. We particularly present the idea of 3D navigation, which is a combination

of the vertical, horizontal and temporal navigation in the personal information space. We

introduce the idea of personal information application (PIA). We also discuss the devel-

opment of PIAs in a MVC-based designer, namely the PIA designer. Two different ways

22

of inter-desktop information sharing and data integration are presented, i.e., by means of

PIA-based desktop services and by means of PIA-to-PIA (A2A) query processing.

• In the scenario of heterogeneous geospatial data integration, we propose an ontology align-

ment algorithm based on a set of deduction rules, which can be performed automatically

when certain pre-conditions are established. We also propose a sound query rewriting

algorithm based on the bidirectionality and composition of the mappings. The algorithm

can compute a contained rewriting of a query in both cases. Query containment ensures

that all the answers retrieved by executing the rewriting are a subset of the answer to the

original query, thus guaranteeing precise query answering across distributed data sources.

We present our work on geospatial data integration in Chapter 6.

In the following chapters, we describe each of the above mentioned five approaches in the

following structure. Each chapter starts with an introduction of the problem to be addressed.

After reviewing the previous work related to that problem, we present the architecture of our

approach. Then, we focus on the discussion on metadata representation, ontology mapping, and

ontology-based query processing. Finally, each chapter ends with a summary of our approach

and of future work that addresses open research challenges. We conclude the entire thesis in

Chapter 7.

CHAPTER 2

CENTRAL DATA INTEGRATION

2.1 Introduction

2.1.1 Problem Description

Data integration is the problem of combining data residing at different sources, and provid-

ing the user with a unified view of these data (70). It is relevant to a number of applications

including data warehousing, enterprise information integration, geographic information sys-

tems, and e-commerce applications. Data integration systems are usually characterized by

an architecture based on a global schema, which provides a reconciled and integrated view of

the underlying sources. These systems are called central data integration systems, and a large

number of such systems have been proposed (3; 7; 25; 29; 69; 81; 93; 105; 109).

There are two key issues in central data integration, namely system modeling and query

processing. For modeling the relation between the sources and the global schema, two basic

approaches have been proposed (24; 70; 109). The first approach, called Global-as-View (GaV),

expresses the global schema in terms of the data sources. The second approach, called Local-

as-View (LaV), requires the global schema to be specified independently from the sources, and

the relationships between the global schema and the sources are established by defining every

source as a view over the global schema.

23

24

Query processing in central data integration may require a query reformulation step: the

query over the global schema has to be reformulated in terms of a set of queries over the

sources. In the GaV approach, every entity in the global schema is associated with a view

over the source local schema, therefore query processing in this case uses a simple “unfolding”

strategy (70). In contrast, query processing in LaV can be complex, since the local sources may

contain incomplete information. In this sense, query processing in LaV, called view-based query

processing (1; 27; 58), is similar to query answering with incomplete information (110). It can

also be the case that two data sources communicate in a peer-to-peer (P2P) way either through

the global schema or directly. Data exchange or query processing may occur in this case,

which requires data translation or query rewriting when heterogeneities are present between

the communicating sources (38; 68; 81; 93; 97).

The heterogeneities between distributed data sources can be classified as syntactic, schematic,

and semantic heterogeneities (16). Syntactic heterogeneity is caused by the use of different mod-

els or languages (e.g., relational and XML). Schematic heterogeneity results from the different

data organizations (e.g., aggregation or generalization hierarchies). Semantic heterogeneity is

caused by different meanings or interpretations of data. All these heterogeneities have to be

resolved, to achieve the goal of integration or interoperation. In this thesis, we consider the

semantic integration of XML data and data exchange between heterogeneous XML sources,

using ontologies.

XML documents that represent data with similar semantics may conform to different schemas.

Therefore, a user must construct queries in accordance to the different XML document’s struc-

25

tures even if to retrieve fragments of information that have the same meaning. This fact makes

the formulation of queries over heterogeneous XML sources a nontrivial burden to the user.

Furthermore, this shortcoming of XML impedes the interoperation between XML sources since

the reformulation of XML queries from one source to another has to eliminate the structural

differences of the queries while presenting the same semantics. Let us illustrate this problem

using a running example.

Example 2.1 Figure 8 shows two XML schemas (S1 and S2) with their instances (i.e., XML

documents D1 and D2), which are represented as trees. It is obvious that S1 and S2 both

represent a many-to-many relationship between two concepts: book and author (equivalently

denoted article and writer in S2). However, structurally speaking, they are different: S1,

which is a book-centric schema, has the author element nested under the book element, whereas

S2, which is an author-centric schema, has the article element nested under the writer

element. Suppose our query target is “Find all the authors of the publication b2.” The XML

path expressions that are used to define the search patterns in the two schema trees can be

respectively written as /books/book[booktitle.text()="b2"]/author/name and /writers

/writer[article/title.text()="b2"]/fullname, where the contents in the square brackets

specify the constraints for the search patterns. We notice that although the above two search

patterns refer to semantically equivalent concepts, they follow two distinct XML paths.

2.1.2 Semantic XML Data Integration

The structural diversity of conceptually equivalent XML schemas leads to the fact that XML

queries over different schemas may represent the same semantics even though they are formu-

26

books

book

booktitle author

name

books

book

book "b1"

booktitle

"b2"

name

"a2"

author "a1"

booktitle

author

author "a3"

name

name

writers

writer

fullname article

title

writ

ers

writer "w1"

fullname

title article "t1"

writer "w2"

fullname

title article "t2"

writer "w3"

fullname

title article "t2"


"books.xml" XML schema S 2

XML document D 2

"writers.xml"

Figure 8. Two XML sources with structural heterogeneities.

lated using two different alphabets and structures. In comparison, the schema languages used

for conceptual modeling are structurally flat so that the user can formulate a determined con-

ceptual query without worrying about the structure of the source. RDF Schema (RDFS) (77),

DAML+OIL, and OWL are examples of languages used to create ontologies, which represent a

shared, formal conceptualization of the domain of knowledge (55). There are currently many

attempts to use conceptual schemas (or ontologies) (3; 5; 38) or conceptual queries (29; 31) to

overcome the problem of structural heterogeneities among XML sources.

In this chapter, we propose an ontology-based approach for the integration of XML sources.

We use the GaV approach to model the mappings between the source schemas and the global

ontology, which is, therefore, an integrated view of the source schemas. The global ontology

is expressed in terms of RDFS, which is at the core of several ontology languages (e.g., OWL

and DAML+OIL). In order to facilitate the mappings between the XML source schemas and

the global RDFS ontology, their syntactic disparity needs to be reconciled. To this end, we

27

first transform the heterogeneous XML sources into local RDFS ontologies (defined using the

RDFS space (22)), which are then merged into the global ontology. This transformation process

encodes the mapping information between each concept in the local ontology and the corre-

sponding element in the XML source. The ontology merging process can be semi-automatically

performed (e.g., by using the PROMPT algorithm (88)). In addition to the global ontology,

the merging process also produces a mapping table, which contains the mapping information

between concepts in the global ontology and concepts in the local ontologies. In our approach,

we can translate a query posed against the global ontology into subqueries over the sources.

We can also translate a query posed against an XML source to an equivalent query against any

other XML source. We call the query rewriting in the first case global-to-local query rewriting

and that in the second case local-to-local query rewriting. Given that we choose a GaV ap-

proach, the global ontology is a view over the local ontologies, therefore the process of mapping

a query over the global ontology to queries over the local ontologies is straightforward.

2.1.3 Contributions

We make the following contributions in this chapter:

• We propose an ontology-based approach to the integration of heterogeneous XML sources.

The global ontology takes into account both the XML nesting structure and the domain

structure, which are expressed in RDFS, so as to enable semantic interoperation between

the XML sources. This integration process is lossless with respect to the nesting structure

of the XML sources, so that XML structural queries can be correctly rewritten.

28

• We extend the RDFS space by defining additional metadata, which enables the encoding

of the nesting structure of the XML Schema in the RDF schema. We convert each of the

XML source schemas into a local RDFS ontology while preserving their structure, so that

they share a uniform representation with the global ontology.

• Finally, we refine the concepts of certain answers and of query containment, in two query-

ing modes: global-to-local query rewriting and local-to-local query rewriting. Further-

more, a query rewriting algorithm that guarantees equivalence is provided for each case

of query rewriting.

The rest of this chapter is organized as follows. Section 2.2 describes related work. Section

2.3 describes the framework for the integration of XML sources. Data integration and query

processing, which are the two key points in our approach, are discussed respectively in Sections

2.4 and 2.5. We draw conclusions and discuss future work in Section 2.6.

2.2 Related Work

There are a number of approaches addressing the problem of data integration or interoper-

ation among XML sources. We classify those approaches into three categories, depending on

their main focus, namely semantic integration, query languages, and query rewriting.

2.2.1 Semantic Integration

High-level Mediator Amann et al. propose an ontology-based approach to the integration of

heterogeneous XML Web resources in the C-Web project (3; 5). The proposed approach

is very similar to our approach except for the following differences. The first difference

is that they use a local-as-view (LaV) approach (24) with a hypothetical global ontology

29

that may be incomplete. The second difference is that they do not retain the XML

documents’ structures in their conceptual mediator so they cannot deal with the reverse

query translation (from the XML sources to the mediator). Our previous work involved

a layered approach for the interoperation of heterogeneous web sources, but the nesting

structure associated with XML was lost in the mapping from XML data to RDF data

(38).

Direct Translation Klein proposes a procedure to transform XML data directly into RDF

data by annotating the XML documents via external RDFS specifications (66). The

procedure makes the data in XML documents available for the Semantic Web. However,

since the proposed approach does not consider the document structure of XML sources,

it can not propagate queries from one XML source to another XML source.

Semantics Encoding The Yin/Yang Web approach proposed by Patel-Schneider and Simeon

address the problem of incorporating the XML and RDF paradigms (94). They develop

an integrated model for XML and RDF by integrating the semantics and inferencing rules

of RDF into XML, so that XML querying can benefit from their RDF reasoner. But the

Yin/Yang Web approach does not solve the problem of query answering across heteroge-

neous sources, that is, sources with different syntax or data models. It also cannot process

higher-level queries such as RDQL. Lakshmanan and Sadri also propose an infrastructure

for interoperating over XML data sources by semantically marking up the information

contents of data sources using application-specific common vocabularies (68). However,

the proposed approach relies on the availability of an application-specific standard ontol-

30

ogy that serves as the global schema. This global schema contains information necessary

for interoperation, such as key and cardinality information for predicates. This approach

has the same problem as the Yin/Yang Web approach, that is, higher-level queries can

not be processed downward to XML queries.

2.2.2 Query Languages

CXQuery is a new XML query language proposed by Chen and Revesz, which borrows

features from both SQL and other XML query languages (31). It overcomes the limitations of

the XQuery language by allowing the user to define views, explicitly specify the schema of the

query answers, and query through multiple XML documents. However, CXQuery does not solve

the issue of structural heterogeneities among XML sources. The user has to be familiar with

the document structure of each XML source to formulate queries. Heuser et al. also present a

new language (CXPath) based on XPath for querying XML sources at the conceptual level (29).

CXPath is used to write queries over a conceptual schema that abstracts the semantic content

of several XML sources. However, they do not consider the situation of query translation from

the XML sources to the global conceptual schema.

2.2.3 Query Rewriting

Query rewriting is often a key issue for both mediator-based integration systems and peer-

to-peer systems. The Clio approach, which provides an example for the former case, mainly

addresses schema mapping and data transformation between nested schemas and/or relational

databases (97). It focuses on how to take advantage of schema semantics to generate the

consistent translations from source to target by considering the constraints and structure of the

31

Global RDFS

Ontology G

Mapping Table M

Local RDFS

Ontology R 1

Local RDFS

Ontology R 2

Local RDFS

Ontology R n

. . .

Local XML

Source S 1

Local XML

Source S 2

Local XML

Source S n

. . .

Ontology Integration

Global-to-local Query Rewriting

Local-to-local Query Rewriting

Figure 9. The ontology-based framework for the integration of XML sources.

target schema. It uses queries to express the mappings from the data to the target schema. The

Piazza system is a peer-to-peer system that aims to solve the problem of data interoperation

between XML and RDF (59). The system achieves its interoperation in a low-level (syntactic)

way, i.e., through the interoperation of XML and the XML serialization of RDF, whereas we

aim to achieve the same objective at the semantic level. For example, our approach supports

a conceptual view of XML sources (to facilitate the formulation of queries) and allows for

conceptual queries (e.g., RDF queries).

32

2.3 Framework

In this section, we present the framework for the integration of XML data sources and

in particular we describe the integration of XML source schemas and query processing in the

integrated system.

As shown in Figure 9, we generate for each local XML source a local RDFS ontology, which

represents the source schema. These local RDFS ontologies are then merged into the global

RDFS ontology, which provides an overview of all the local ontologies and a mediation between

each pair of XML sources. In this merging process, a mapping table is also produced to contain

all the mappings, which are correspondences between the global ontology and local ontologies.

The ontology-based XML data integration framework I can be formalized as a quadruple

〈G,S, µ,M〉, where

• G is the global ontology expressed in RDFS over the alphabetAG . The alphabet comprises

the name of the classes and properties of G.

• S is the XML source schema expressed in a language LS over the alphabet AS , which

comprises the XML element names in S.

• µ is a schema transformation function, which generates a local RDFS ontology R for S,

such that R encodes the nesting structure specified by S.

• M is the mapping table consisting of a set of mappings between the global ontology G and

a set of n XML sources Si, where i ∈ [1..n]. Each entry in M is of the form (g, s1, ..., sn),

33

where g ∈ AG and si ∈ ASi ∪ {ε} for i ∈ [1..n]. Note that ε is used when a source schema

has no corresponding elements to an element of G.

The first task of the framework is the integration of the distributed and heterogeneous XML

sources. Here, we are mainly concerned with the issue of schematic heterogeneity, that is, with

the different schema structures among the sources. The process of data integration contains

two steps: schema transformation and ontology merging.

In the first step, we use a local RDFS ontology to represent each XML source schema so as to

achieve a uniform representation for the next step. In other words, the schema transformation

function µ takes as input the source schema S, and the output is the local ontology R. The key

operation in this schema transformation is the preservation of the nesting structure of S. To this

end, we have to extend the RDFS space since it does not have a property to encode the nesting

structure between elements. In particular, we add a new RDF property, contained, in the

namespace of “http://www.example.org/rdf-extension” (abbreviated as rdfx), The RDF/XML

syntax for this property is described below.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

xmlns:rdfx="http://www.example.org/rdf-extension#">

<rdf:Property rdf:about= "http://www.example.org/rdf-extension#contained">

<rdfs:isDefinedBy rdf:resource= "http://www.example.org/rdf-extension#"/>

<rdfs:label>contained</rdfs:label>

<rdfs:comment> The containment between two classes. </rdfs:comment>

34

<rdfs:range rdf:resource= "http://www.w3.org/2000/01/rdf-schema#Class"/>

<rdfs:domain rdf:resource= "http://www.w3.org/2000/01/rdf-schema#Class"/>

</rdf:Property>

The second step is the merging (or integration) of all local ontologies, which generates the

global ontology as well as the mapping table. The merging is performed based on the semantics

of classes and properties from each of the local ontologies. In particular, the classes or properties

that have similar or same (equivalent) semantics are merged into a class or a property of the

global ontology. Then, each of these correspondences are recorded as an entry in the mapping

table. Different kinds of mappings can be established between two schemas or ontologies (116).

For this chapter, however, we consider only the equivalence type of mapping. We also do

not consider the different degrees to which two concepts may be equivalent. For instance, we

simply take book and article as equivalent concepts, although we could further refine such

equivalence. Additional domain-related knowledge (e.g., inheritance) may be considered. We

discuss these issues in more detail in Section 2.4.

It is worth mentioning that the global ontology in our system has two roles: (1) It provides

the user with access to the data with a uniform query interface to facilitate the formulation of

a query on all the XML sources; (2) It serves as the mediation mechanism for accessing the

distributed data through any of the XML sources.

Our framework handles user queries using a query rewriting strategy. More specifically,

query processing in our framework may occur in the following two directions, as shown in

Figure 9:

35

Global-to-local query rewriting. When the user poses a query q on the global ontology,

the system rewrites q into the union q′ of subqueries, one for each XML source. The

subqueries are then executed over the XML sources to get the answers, which are then

integrated (by using union) to produce the answer to q.

Local-to-local query rewriting. Given a query q posed on a local source, its answers then

include not only those retrieved from the local source, but also those from all the other

sources in the system. For the purpose of getting answers from the other sources, it

requires that q be rewritten (through the global ontology) into a union q′ of queries, one

on each of the other sources. Query rewriting in this direction is performed similarly to

that in peer-to-peer systems (101).

Query rewriting in both directions is based on the mapping information contained in the

mapping table. Each entry contains a element (RDF class or property) of the global ontology

and its corresponding elements in the local source schemas. Given that query rewriting is from a

query over one alphabet to that over another alphabet, the mapping table provides a convenient

way to finding the mapping between alphabets, in both rewriting directions. In addition, the

query languages used to formulate the queries have to be taken into consideration, since they

may have different expressiveness. We consider a subset of XQuery (19), called conjunctive

XQuery (c-XQuery), for queries over the XML sources and a subset of RDQL (62), namely

conjunctive RDQL (c-RDQL), for queries over the global RDFS ontology. We discuss in detail

query processing and related issues in Section 2.5.

36

2.4 Integrating Structure and Semantics

2.4.1 Local XML Schemas and Local RDFS Ontologies

To integrate heterogeneous XML data sources, we first transform the local XML schema into

a local RDFS ontology while preserving the XML document structure. By document structure,

we mean the structural relationship of objects specified in data-centric documents (21) by a

schema language (such as DTD, XML Schema, or RelaxNG1). In this chapter, we only focus on

the nesting structure (i.e., hierarchy). Other structural properties include order. A consequence

of not including order in our framework is that we cannot consider a query that involves the

order of the subelements of an element. However, this kind of query is of little interest in a

framework where we are mostly concerned with the semantics of the data.

Elements and attributes are the two basic building blocks of XML documents. Elements

can be defined as simple types, which cannot have element content and cannot carry attributes,

or complex types, which allow elements in their content and/or contain attributes. On the other

hand, all attribute declarations must reference simple types since attributes cannot contain other

elements or other attributes. From the perspective of XML Schema, these nesting relationships

are defined in terms of datatypes (simple or complex). An XML schema can be formalized as

an edge-labeled tree, namely an XML schema tree, as depicted in Figure 8. We overlook the

distinction between XML elements and attributes by considering both of them as vertices in

the XML schema tree.

1http://relaxng.sourceforge.net

37

Definition 2.1 An XML schema S over alphabet AS is an edge-labeled tree S = (V, E, λ),

where V is a set of vertices, E = {(vi, vj)|vi, vj ∈ V } is a set of edges, and λ is a labeling

function λ : E 7→ AS .

Before we discuss schema transformation, let us look at the formalization of ontologies.

Both the global ontology and local ontologies are actually RDF schemas defined in the RDFS

space, which is extended with the RDF property “rdfx:contained”. An RDF schema can be

formalized as a labeled graph, called RDF schema graph, as defined in Definition 2.2. We do

not elaborate on the data types of RDF properties and assume that they are all of type literal.

Also, we do not take into account the notion of namespace in the definition of both XML and

RDF schemas.

Definition 2.2 An RDF schema graph R over alphabet AR is a directed labeled graph R =

(V,E, λ), where V is a set of labeled vertices consisting of classes C, properties P , and data types

L, E = {(vi, vj)|vi, vj ∈ V } is a set of labeled edges, and λ is a labeling function λ : V ∪E 7→ AR,

such that

• ∀ v ∈ P , we have domain(v) ∈ C, range(v) ∈ C∪L, and λ((v, domain(v))) = “rdfs:domain”

and λ((v, range(v)))=“rdfs:range”;

• ∀ e = (vi, vj) ∈ E, we have λ(e)=“rdfs:subClassOf” (or “rdfx:contained”) if vi and vj ∈ C,

or λ(e) = “rdfs:subPropertyOf” if vi and vj ∈ P .

Now we are able to define the schema transformation function µ. Formally speaking, the

schema transformation function µ is a function µ : S 7→ R, where S = (VS , ES , λS), R =

38

(VR, ER, λR), and VR = C∪P , such that ∀ eij = (vi, vj) ∈ ES , we have µ(vj) ∈ VR, λR(µ(vj)) =

λS(eij), and furthermore:

(1) if ∃(vj , vk) ∈ ES , then µ(vj) ∈ C, (µ(vj), µ(vi)) ∈ ER, and λR(µ(vj), µ(vi)) = “rdfx:contained”;

(2) if @(vj , vk) ∈ ES , then µ(vj) ∈ P , (µ(vj), µ(vi)) ∈ ER, and λR(µ(vj), µ(vi)) = “rdfs:domain”.

The transformations thus defined fall into two categories:

Element-level transformation The element-level transformation converts from XML complex-

type elements to RDF classes and from XML simple-type elements to properties. For

example, for S1 in Example 2.1, we define the RDF classes Books, Book, and Author,

while taking booktitle and name as RDF properties of Book and Author, respectively,

as depicted in the resulting local RDFS ontology of Figure 10.

Structure-level transformation The structure-level transformation encodes the nesting struc-

ture of the XML schema into the local RDFS ontology. In particular, the nesting may

occur between two complex-type elements or between a complex-type element and its

child (simple) element. Following the element-level transformation, the nesting struc-

ture in the former case corresponds to a class-to-class relationship between two RDFS

classes, which are connected by the property rdfx:contained, The first item that defines

µ formalizes this case. In the latter case, the XML nesting structure corresponds to the

class-to-literal relationship in the local ontology, with the class and the literal connected

by the corresponding property. The second item that defines µ formalizes this case.

39

By applying the schema transformation function to the two XML schemas in Figure 8, we

can get the resulting local ontologies as shown in Figure 10. We see that rdfx:contained

enables the representation of the nesting relationship. Specifically, by following the edges

of rdfx:contained from Books to Author in R1, we actually get the corresponding path

/books/book/author in S1. In terms of the alphabets, the schema transformation function

specifies a mapping between the alphabet of the source schema and that of the local ontology.

Table II lists the mapping between the XML schema S1 and the local RDFS ontology R1. For

simplicity, we use XPath to specify the XML elements. Also, the properties in the mapping

table are in the form of an RDF expression c.p, where c is the class associated with p.

TABLE II

MAPPINGS BETWEEN XML SOURCE SCHEMA S1 AND THE LOCAL ONTOLOGY R1

XPath expressions in S1 RDF expressions in R1

/books Books/books/book Book/books/book/booktitle Book.booktitle/books/book/author Author/books/book/author/name Author.name

40

Books

name booktitle

rdfx:contained

property Class Legend

Local RDFS ontology

rdfs:domain

Author rdfx:contained

Book Article

fullname title

rdfx:contained

Writers rdfx:contained

Writer

Local RDFS ontology R 1 R 2

Figure 10. Local ontologies R1 and R2 transformed from XML source schemas S1 and S2.

2.4.2 The Global RDFS Ontology

Now that the source schemas are represented by local RDFS ontologies, we are able to merge

them to construct the global RDFS ontology. In other words, the process of ontology merging

takes as input the multiple local ontologies and returns a merged ontology as the output (108).

Ontology merging and ontology alignment, which require the mapping of ontologies, are

widely pursued research topics. Readers can be referred to a thorough survey of the state-of-the-

art of ontology mapping (64). In this chapter we do not intend to introduce a new technique for

ontology merging. Instead, we utilize existing techniques to generate the integrated ontology

from the local ontologies. In particular, we use an approach (such as PROMPT (88)) that

provides the following functionalities:

• Merging of classes: Multiple conceptually equivalent classes of the local ontologies are

combined into one class in the global ontology.

41

• Merging of properties: Multiple conceptually equivalent properties of the equivalent classes

in the local ontologies are combined as one property of the combined class in the global

ontology.

• Merging relationships between classes: Given two conceptually equivalent relationships,

e.g., p1 from a class c1 to another class c′1 and p2 from c2 to c′2, we combine p1 and p2

into one relationship p between the combined class c (of c1 and c2) and c′ (of c′1 and c′2).

• Copying a class or a property: If there does not exist a conceptually equivalent class or

property for a class c (or a property p of c), we simply copy c (or p, as a property of the

target class of c) into the global ontology.

• Generalizing semantically related classes into a superclass: The superclass can be obtained

by searching an existing knowledge domain (e.g., the DAML Ontology Library) or reason-

ing over a thesaurus such as WordNet.1 For example, we can find in the semantic network

of terms (consisting of terms and their semantic relations) that two classes (Author and

Writer) have the same hypernym (Person), which is then taken as a superclass of both

classes.

Figure 11 shows the global ontology that results from merging the two local RDF ontologies

of Figure 10. The greyed classes and properties are merged classes and properties from the

original ontologies. For instance, Book in R1 and Article in R2 are merged into Book, whereas

1http://wordnet.princeton.edu

42

booktitle in R1 and title in R2 are merged into title. The classes Book and Author are

also respectively extended with the superclasses Publication and Person.

Besides the global ontology, the process of ontology merging also yields as an output the

mapping table that contains the mappings between the local RDFS ontologies and the global

RDFS ontology. In general, if a class, property, or relationship between classes p in the global

ontology is the result of merging pi and pj from different local ontologies, then a tuple of the

form (p, pi, pj) is generated. If a class or property p in the global ontology is only copied from pi

in a local ontology, then a tuple (p, pi) is produced. For instance, for the class Book.title (in

the global ontology), which is merged from Book.booktitle in R1 and Article.title in R2,

we generate a tuple in the mapping table: (Book.title, Book.booktitle, Article.title).

Table III lists all the mappings in our example.

Now that we have the one-to-one mappings M1 between the XML source schemas and their

local ontologies and the one-to-one mappings M2 between the local ontologies and the global

Books

name title

rdfx:contained

Author

rdfx:contained

Book Authors rdfx:contained

Publication

property Class rdfs:domain rdfs:subClassOf

Person

rdfx:contained

Legend

Figure 11. The global ontology G that results from merging R1 and R2.

43

TABLE III

MAPPING TABLE BETWEEN THE GLOBAL ONTOLOGY AND LOCAL ONTOLOGIESRDF expressions in the RDF expressions in R1 RDF expressions in R2

global ontologyBooks Books -Book Book ArticleBook.title Book.booktitle Article.titleAuthors - WritersAuthor Author WriterAuthor.name Author.name Writer.fullname

ontology, we can compose M1 and M2 to get the mappings M between the source schemas

and the global ontology. Table IV shows the results.

TABLE IV

MAPPING TABLE BETWEEN THE GLOBAL ONTOLOGY AND XML SOURCESCHEMAS

RDF expressions in the XPath expressions in S1 XPath expressions in S2

global ontologyBooks /books -Book /books/book /writers/writer/articleBook.title /books/book/booktitle /writers/writer/article/titleAuthors - /writersAuthor /books/book/author /writers/writerAuthor.name /books/book/author/name /writers/writer/article/fullname

44

2.4.3 Data Integration Semantics

In this subsection, we discuss the semantics of the data integration in our proposed frame-

work including the semantics of the XML (local) databases, the mapping table, and the RDFS

(global) database. The discussion of the syntax and semantics of queries is postponed until

Section 2.5. In what follows, we refer to a fixed, finite set Γ of constants, which is shared by all

data sources. We also refer to a finite set U of URIs.

There are two types of databases in the framework, i.e., the local XML databases and the

global RDF database. An XML database is an XML instance tree, and an RDF database is an

RDF instance graph.

Definition 2.3 (XML instance tree) Given an XML schema S = (VS , ES , λS), an instance

of S is an XML instance tree G = (VG , EG , τ, λG), where VG is a set of vertices, EG is a set of

edges, and

(1) τ is a typing function τ : VG 7→ VS , such that (a) ∀v ∈ VG, τ(v) ∈ VS , and (b) ∀(vi, vj) ∈

EG, (τ(vi), τ(vj)) ∈ ES .

(2) λG is a labeling function, such that (a) ∀v ∈ VG, λG(v) ∈ Γ ∪ {ε}, and (b) ∀(vi, vj) ∈ EG,

λG((vi, vj)) = λS((τ(vi), τ(vj))).

Definition 2.4 (RDF instance graph) Given an RDF schema S = (VS , ES , λS), where

VS = C ∪ P , an instance of S is an RDF instance graph G = (VG , EG , τ, λG), where VG is

a set of vertices, EG is a set of edges, λG is a labeling function λG : VG ∪ EG 7→ AS ∪ U ∪ Γ,

45

and τ is a typing function τ : VG ∪ EG 7→ VS ∪ {“rdf:Property”} ∪ {“rdfs:literal”}, such that

∀e = (vi, vj) ∈ EG, we have

(1) if τ(e)=“rdf:Property”, then λG(e)=“rdfx:contained” or “rdfs:subClassOf”, λG(vi) and

λG(vj) ∈ U , τ(vi) and τ(vj) ∈ C, and (τ(vi), τ(vj)) ∈ ES ;

(2) if τ(e) ∈ P , then λG(e) = λS(τ(e)), λG(vi) ∈ U , τ(vi) ∈ C, λS((τ(e), τ(vi))) = “rdfs:domain”,

λS((τ(e), τ(vj)))=“rdfs:range”, and

– λG(vj) ∈ U , when τ(vj) ∈ C;

– λG(vj) ∈ Γ, when τ(vj)=“rdfs:literal”;

The semantics of the mappings depends on the assumptions adopted. In the view-based

approach, there are three assumptions for the inter-schema mappings, namely soundness, com-

pleteness, and exactness (70). In particular, given a database D, a set of view definitions V

over D, and view extensions E of V, we say the views V are sound if VD ⊇ E , complete if

VD ⊆ E , and exact if VD = E . It is common to use the soundness assumption for view-based

data integration (70). Given that our framework adopts a GaV approach, it is natural to as-

sume an exact semantics, that is, the sources are complete with respect to the global database.

However, the definition for these assumptions differs from our framework, where mappings are

represented by element correspondences in the mapping table.

Given an entry ti = (gi, si,1, ..., si,n) in the mapping table M(G,S1, ...,Sn), where gi ∈ G

and si,j ∈ Sj (1 ≤ j ≤ n), the semantics of the mappings can be captured by the concept

of valuation. Given the global database B of G and local databases Dj of Sj (1 ≤ j ≤ n), a

46

valuation of ti is a function σ, which maps ti to a tuple (vi, vi,1, ..., vi,n), where vi ∈ B, and

vi,j ∈ Dj (1 ≤ j ≤ n), such that τB(vi) = gi and τDj (vi,j) = si,j for j ∈ [1..n]. Under the exact

assumption, the semantics of the mapping tableM = {t1, ..., tm} is captured by a conjunction of

all the equalities (between the valuation of each global element and the union of the valuations

of its mapped local elements), that is:

∧1≤i≤m[σ(gi) = σ(si,1) ∪ ... ∪ σ(si,n)], such that for 1 ≤ k, l ≤ m,

(1) (gk, gl) ∈ EG ⇔ (σ(gk), σ(gl)) ∈ EB, and

(2) (sk,j , sk,l) ∈ ESk⇔ (σ(sk,j), σ(sk,l)) ∈ EDk

, for each j ∈ [1..n].

The definition of the semantics of sound (or complete) mappings is the same as the above

definition, except for the substitution of = by ⊇ (or ⊆). For simplicity, we abbreviate the

preceding assertion to σ(G) = σ(S1) ∪ ... ∪ σ(Sn). The global database B is then any database

such that σ(G) = σ(S1) ∪ ... ∪ σ(Sn) holds for the local databases D1, ...,Dn. Figure 12 shows

the global database (instances) for the data sources of Example 2.1.

2.5 Query Processing


RDQL (RDF Data Query Language) uses an SQL-like syntax. More specifically, the Select

clause identifies the variables to be returned to the application. The From clause specifies the

RDF model using an URI. The Where clause specifies the graph pattern as a list of triple

patterns. The And clause specifies the Boolean expressions. Finally, the Using clause provides

a way to shorten the length of the URIs. By overlooking the notion of namespace (i.e., URI)

47

Book#1

"b1"

"b2"

"t1"

"t2"

Books#1

Author#1

"a1"

Authors#1

title

title

title

title

name

Book#2

Book#3

Book#4 Author#2

"a2"

name

Author#3

"a3"

name

contained

Author#4

"w1"

name

Author#5

"w2"

name

Author#6

"w3"

name

"t2"

title

Book#5

conta

ined

contained

contained

contained

contained

contained

contained

conta

ined

contained

contained

contained

contained

contained

contained

contained

contained

Figure 12. The global database of G.

and the And clause, we get a conjunctive RDQL (c-RDQL) expression, which can be expressed

in a conjunctive formula:

ans( ~X) :- p1( ~X1), ..., pn( ~Xn).

where ~Xi = (xi, x′i) and pi is an RDF property of xi having the value x′i.

XQuery is a typed functional language that has an FLWR (i.e., For, Let, Where, Return)

syntax. For simplification, we assume that the XML query posed by the user is formulated

only in the form of FLWR expressions (19). In other words, we do not consider nesting FLWR

expressions, although they are allowed in XQuery. In particular, a conjunctive XQuery (c-

XQuery) is of the form:

ans( ~X) :- p1( ~X1), ..., pn( ~Xn).

48

where ~Xi = (xi, x′i) and pi is an XPath /e1/.../en connecting xi to x′i. That is, each predicate

represents an expression xi/e1/.../en/x′i, where ei(1 ≤ i ≤ n) is an edge label along the path

from xi to x′i.

In both query definitions, ans( ~X) is the head of the query, denoted headq, and the remaining

part is the body of the query, denoted bodyq. We say that the query is safe if ~X ⊆ ~X1∪ ...∪ ~Xn.

The answer qD to a query q over a database D is the result of evaluating q over D. The

query evaluation is based on the concept of valuation and depends on the data model and the

query language used. Informally, a valuation ρ over the variables var(q) of a query q is a total

function from var(q) to constants (or URIs for RDF queries) in the domain Γ of the database,

where q is evaluated (2), as follows:

• In the XML model: given a c-XQuery q of the form ans( ~X) :- p1( ~X1), ..., pn( ~Xn) over an

XML instance graph D, we have

qD = {ρ( ~X)|ρ is a valuation over var(q) and pi = (ρ(xi), ρ(x′i)) is a fact in D, for each

~Xi = (xi, x′i), where i ∈ [1..n]}.

• In the RDF model: given a c-RQL query q of the form ans( ~X) :- p1( ~X1), ..., pn( ~Xn) over

an RDF instance graph D, we have

qD = {ρ( ~X)|ρ is a valuation over var(q) and pi is a path connecting ρ(xi) and ρ(x′i) in

D, for each ~Xi = (xi, x′i), where i ∈ [1..n]}.

49

Example 2.2 Consider two queries q1 and q2. In particular, q1 is expressed over the global

ontology G in c-RDQL, to retrieve all the (Author, Book) pairs. The c-XQuery query q2 is

issued on local XML source S1, to retrieve all (Author, Book) pairs.

q1: ans(x, y) :- name(u, x), title(v, y), contained(u, v).

q2: ans(x, y) :- /name(u, x), /booktitle(v, y), /author(v, u).

By evaluating q1 over the global database B (shown in Figure 12) and q2 over D1 (shown in

Figure 8), we obtain the following answer sets to both queries.

qB1 = {(a1, b1), (a2, b2), (a3, b2), (w1, t1), (w2, t2), (w3, t2)},

qD12 = {(a1, b1), (a2, b2), (a3, b2)}.

We finally assume that all the concepts in the local ontologies are mapped to the concepts

in the global ontology during the ontology integration process. That is, the mappings are

total, one-to-one mappings from the local RDF ontologies to the global ontology. However, it

is possible that some concept c or property p in the global ontology gets mapped to a local

ontology but not to another local ontology. This may lead to null values when a query involves

c or p. However, we do not consider this case in our discussion.

2.5.2 Certain Answers and Query Containment

The concept of certain answers has been introduced in view-based query processing to

represent the results of answering a global query (the query over the global schema) using view

extensions (1). In our framework, where the mappings are correspondences between elements

of the global ontology and elements of the source schemas, the concept of certain answers is

50

books

book "t1" booktitle

name author "w1"

book "t2" booktitle

author

book "t2" booktitle

author

name

name

"w2"

"w3"

writers

writer "a1" fullname

title article


article


article "b2"

title

title

"b2"

"b1"

Figure 13. The retrieved database on S1 w.r.t. S2 and that on S2 w.r.t. S1.

redefined. We call the query posed on the global ontology a global query, and the query posed

over a local data source a local query. As previously discussed, these two queries are processed

in two different directions, i.e., the global-to-local direction and the local-to-local direction. The

certain answers to a global query are called global certain answers, while those to a local query

are called local certain answers.

Before we discuss the formalism for these two types of certain answers, we revisit the concept

of global database, from which we retrieve the global certain answers, and we introduce the

concept of retrieved database, where the local certain answers are computed.

Given the local data sources D1, ...,Dn and the mapping table M(G,S1, ...,Sn) between

the global ontology G and local source schemas S1, ...,Sn. The global database B is such that

σ(G) =⋃

(1≤i≤n) σ(Si) holds on D1, ...,Dn. Likely, the retrieved database Bk on a local source

Sk w.r.t. all the other local sources is the one satisfying σ(Sk) =⋃

(1≤i≤n,i6=k) σ(Si), whereas,

the retrieved database Bk,l on Sk w.r.t. a particular local source Sl is the one satisfying σ(Sk) =

σ(Sl) (refer to Section 2.4 for the semantics of σ). Figure 13 shows an example of the retrieved

51

database on S1 w.r.t. S2 (on the left side) and the one on S2 w.r.t. S1 (on the right side), for

S1 and S2 as presented in Figure 8.

Based on the concept of global database and that of retrieved database, we formally define

both types of certain answers next.

Definition 2.5 (Certain answers) Let G be the global ontology of n XML source schemas

S1, ...,Sn respectively with databases D1, ...,Dn, M be the mapping table, q be a global query

posed over G, and qk be a local query on Sk. The global certain answers to q with respect

to D1, ...,Dn based on M are the results of evaluating q over the global database B, denoted

certM(q) = qB. The local certain answers to qk with respect to D1, ...,Dk−1,Dk+1, ...,Dn

based on M are computed by evaluating qk over the retrieved database Bk on Sk, denoted

certM,k(qk) = qBk .

While the global certain answers constitute the answer to a global query, the answer to

a local query qk contains both the local certain answers and those retrieved from the local

database Dk, that is, ans(qk) = certM,k(qk) ∪ qDk .

Query containment is a fundamental problem in database research. In general, query con-

tainment checks whether two queries are contained in each other. This problem has been studied

in the following three cases.

The first case is query containment in a single database D, over which the two queries are

posed, that is, D1 = D2 = D. Given a single database schema S over which q1 and q2 are

posed, we say q1 is contained in q2, denoted q1 ⊆ q2, if they have the same output schema and

52

qD1 ⊆ qD2 for every database D of S. The two queries q1 and q2 are said to be equivalent, denoted

q1 ≡ q2, if qD1 ⊆ qD2 and qD2 ⊆ qD1 (2).

The second case is query containment in data integration systems, where both queries are

posed over the global database. The data sources are usually homogeneous in the sense that the

same syntax is used. Given that the sources are expressed as views over the global database,

two queries are said to be equivalent relative to the same set of data sources, if for any source

databases they have the same set of certain answers. The query containment problem in this

case is called relative query containment (82).

The third case is also in homogeneous data integration systems, where data sources are

defined as views of the global schema, but the two queries are formulated in terms of different

alphabets. In particular, there are two kinds of queries, i.e., the queries qΣ over the alphabet

Σ of the global schema and the queries qV over the alphabet V of the view definitions. The

query containment in this case is called view-based containment and is discussed for different

situations such as containment between qΣ1 and qΣ

2 , between qΣ1 and qV2 , between qV1 and qΣ

2 ,

and between qV1 and qV2 (28).

In our case, we are interested in two kinds of containment, specifically the containment

between a global query q and a union of local queries q1, ..., qn, and the containment between two

local queries qk and ql. The first kind of containment, which we call global query containment,

is the same as the containment between qΣ1 and qV2 . Whereas the second kind differs from the

containment between qV1 and qV2 , in the sense that qk and ql refer to different alphabets but qV1

and qV2 are expressed over the same alphabet. We call the containment between qk and ql P2P

53

query containment, because of its likeness to query processing in a P2P system. Next we give

the formal definitions for these two containments in our framework.

Definition 2.6 (Global query containment) Let G be the global ontology over n XML source

schemas S1, ...,Sn, M be the mapping table, q be a global query posed over G, and q′ be a union

of local queries q1, ..., qn respectively over S1, ...,Sn. We say q is globally contained in q′,

denoted q ⊆M q′, if for any databases D1, ...,Dn, we have certM(q) ⊆ qD11 ∪ ... ∪ qDn

n . We say

q and q′ are globally equivalent, denoted q ≡M q′, if q ⊆M q′ and q ⊇M q′.

Definition 2.7 (P2P query containment) Let G be the global ontology over n XML source

schemas S1, ...,Sn, M be the mapping table, qi be a local query posed over Si, and qj be a local

query over Sj. We say qi is P2P contained in qj, denoted qi ⊆M qj, if for any databases

D1, ...,Dn, we have certM,i(qi)∪qDii ⊆ certM,j(qj)∪q

Dj

j . We say q and q′ are P2P equivalent,

denoted qi ≡M qj, if qi ⊆M qj and qi ⊇M qj.

Example 2.3 Consider the following three queries q, q1, and q2 respectively on the global on-

tology G, local XML source S1, and local XML source S2. Also consider the mapping table M

shown in Table IV.

q: ans(x, y) :- name(u, x), title(v, y), contained(u, v).

q1: ans(x, y) :- /name(u, x), /booktitle(v, y), /author(v, u).

q2: ans(x, y) :- /fullname(u, x), /title(v, y), /article(u, v).

By executing q on the global database B, q1 on D1 and on the retrieved database B1, and q2

on D2 and on the retrieved database B2, we obtain the following answers to the three queries.

54

certM(q) = qB: {(a1, b1), (a2, b2), (a3, b2), (w1, t1), (w2, t2), (w3, t2)}

qD11 : {(a1, b1), (a2, b2), (a3, b2)}

certM,1(q1) = qB11 : {(w1, t1), (w2, t2), (w3, t2)}

qD22 : {(w1, t1), (w2, t2), (w3, t2)}

certM,2(q2) = qB22 : {(a1, b1), (a2, b2), (a3, b2)}

Therefore, by Definition 6 and Definition 7, we have q ≡M (q1 ∪ q2) and q1 ≡M q2.


In a data integration system where the sources are described as views over the global schema,

query processing is called view-based query processing, which has two approaches, i.e., view-based

query answering and view-based query rewriting (27; 58). Likewise, there are two approaches to

answering a query in our framework, where mappings are expressed by correspondences. The

first approach utilizes the notion of (global or local) certain answers, as previously discussed.

The alternative approach is by query rewriting. Specifically, to answer a global (or local)

query q, the query is rewritten into a union of the queries over all the sources, using the

mappings. The integration of the answers retrieved from each source constitutes the answer to

q.

As mentioned before, there are two directions of query processing in our framework. We

expect that query rewriting in both directions is equivalent, in the sense that the rewriting

is globally (or P2P) equivalent to the original query. We present next two query rewriting

algorithms, i.e., GLRewriting for global-to-local query rewriting and LLRewriting for local-

to-local rewriting, which will ensure the equivalence of the rewritten queries.

55

Algorithm GLRewriting

Input: 1. q1 over the global ontology G: ans( ~X) :- p1( ~X1), ..., pm( ~Xm);2. M between the global ontology G and local XML schemas S1, ...,Sn.

Output: q2: Union of the c-XQueries over S1, ...,Sn.1 q2 = null;2 For i = 1 to n do3 headq = headq1 ; bodyq = null;4 For j = 1 to m do5 (c1, c2) = name of the class/property bound to (x1, x2), for ~Xj = (x1, x2);6 Search M to find (d1, d2) such that {(c1, d1), (c2, d2)} ⊆ πG,Sj (M);7 If a path p exists from d1 to d2 in Sj then8 add p(x1, x2) to bodyq;9 Else if a path p exists from d2 to d1 in Sj then10 add p(x2, x1) to bodyq;11 Else add p(x, x1) and p′(x, x2) to bodyq, where x is a new variable bound to

the lowest ancestor d of d1 and d2, and p (p′) is the path from d to d1(d2);12 q2 = q2 ∪ q;

Figure 14. The GLRewriting algorithm.

We see that the algorithm GLRewriting adopts a strategy similar to the “unfolding”

strategy used by query processing in a GaV-based relational data integration system (70).

However, instead of substituting the predicates in a query q with the corresponding views, the

substitution of predicates in GLRewriting is guided by the correspondences in the mapping

table M, as stated in Lines 5 to 11. The calculation of the class or property (Line 5) bound

to different variables in q1 is as follows. For each predicate p(x1, x2): (1) if p is a property

connecting two classes c1 and c2, we say that x1 is bound to c1 and that x2 is bound to c2; (2)

if p connects a class c to a value (or literal) v, we say that x1 is bound to c and that x2 is bound

56

to p. Also, we note that the algorithm uses the relational algebra projection operator π (Line

6).

Example 2.4 Given a global query

q : ans(x, y) :- name(u, x), title(v, y), contained(u, v).

we use GLRewriting to rewrite q into a union of subqueries, each on a local XML source

(refer to the mapping table M of Table IV). For illustration, we only look at the rewriting of q

into a subquery q1 over the local source S1.

In particular, Line 5 computes the bound classes or properties of the variables (u, v, x, y) as

(Author, Book, Author.name, Book.title). By looking into M, we find the corresponding ele-

ment sequence of (Author, Book, Author.name, Book.title) in S1 to be (/books/book/author,

/books/book, /books/ book/author/name, /books/book/booktitle). From Lines 7 to 11,

we compute the predicates in the body of q1 as follows.

q1: ans(x, y) :- /name(u, x), /booktitle(v, y) /author(v, u).

Note that for the predicate contained(u, v) in q, we generate in q1 a predicate /author(v, u),

where the order of the two variables is switched. This results from the computation performed

by Lines 9 and 10. In particular, u and v are respectively bound to Author and Book, which

respectively correspond to XML paths /books/book/ author and /books/book. From S1, we

find that /author is the path from v to u, not the path from u to v.

Example 2.5 We give one more example to illustrate query rewriting when Line 11 is used.

Consider the following setting, where a local XML schema S1 (on the right side) is mapped to

57

Student

advises

Advisor faculty

f_name advisee

a_name

Local XML schema S 1 Global RDFS ontology G

Figure 15. A part of XML data integration setting.

the global RDFS ontology G (on the left side), as indicated by the dashed lines. The two classes

Advisor and Student are respectively instantiated with the name of faculty and the name of

advisee, that is, the mapping table contains two correspondences:

(Advisor, /faculty/f name)

(Student, /faculty/advisee/a name).

Now we consider rewriting a global c-RDQL query q: ans(x, y) :- advises(x, y). into a local

c-XQuery query q′ over S1. It is apparent that x and y are bound to Advisor and Student,

thus corresponding to /faculty/f name and /faculty/ advisee/a name, respectively. Be-

cause /faculty/f name and /faculty/ advisee/a name share the same ancestor /faculty,

by using Line 11 we add two predicates /f name(u, x) and /advisee/a name(u, y) to the body

of q′, generating the following local c-XQuery query q′:

ans(x, y) :- /f name(u, x), /advisee/a name(u, y).

58

Algorithm LLRewriting

Input: 1. q1 over a local XML schema S1: ans( ~X) :- p1( ~X1), ..., pm( ~Xm);2. M between the global ontology G and local XML schemas S1, ...,Sn.

Output: q: A query over local XML schema S2.1 headq = ans( ~X); bodyq = null;2 For j = 1 to m do3 (c1, c2) = name of the element bound to (x1, x2), for ~Xj = (x1, x2);4 Search M to find (d1, d2) such that {(c1, d1), (c2, d2)} ⊆ πS1,S2(M);5 If a path p exists from d1 to d2 in S2 then6 add p(x1, x2) to bodyq;7 Else if a path p exists from d2 to d1 in S2 then8 add p(x2, x1) to bodyq;9 Else add p(x, x1) and p′(x, x2) to bodyq, where x is a new variable bound to

the lowest ancestor d of d1 and d2, and p (p′) is the path from d to d1(d2);

Figure 16. The LLRewriting algorithm.

Algorithm LLRewriting differs from GLRewriting only in finding the elements bound to

the variables (Line 3) and in finding the corresponding elements from the mapping table (Line

4). Unlike in global-to-local rewriting, the result of using LLRewriting is a single c-XQuery.

Taking into account the definitions of global and P2P query containment, we prove below

that the algorithms GLRewriting and LLRewriting yield equivalent queries.

Theorem 2.1 Given a global query q over the global ontology G, its rewriting q′ as computed

by GLRewriting is globally equivalent to q, that is, q ≡M q′.

Proof sketch. To prove q ≡M q′, where q′ = q1 ∪ ... ∪ qn, we will check whether

certM(q) = qD11 ∪ ... ∪ qDn

n , given the mapping table M(G,S1, ...,Sn). Taking into account the

semantics of M, given any sequence u of values from the global database B, which makes bodyq

59

true, we can always have a sequence v of values from D1, ...,Dn, since σ(G) = σ(S1)∪ ...∪σ(Sn).

By GLRewriting, the sequence v is exactly the one that makes bodyqi true, where i ∈ [1..n].

Therefore, we have qB ⊆ qD11 ∪ ... ∪ qDn

n . Similarly, we can show that qB ⊇ qD11 ∪ ... ∪ qDn

n . By

the definition of certain answers, we conclude that certM(q) = qD11 ∪ ... ∪ qDn

n . ¤

Similarly, we have:

Theorem 2.2 Given a local query q1 over a local XML source S1, its rewriting q2 over the

local XML source S2 computed by LLRewriting is P2P equivalent to q1, that is, q1 ≡M q2.

We discuss here an interesting property, namely reversibility, of the local-to-local query

rewriting. Informally, consider a local query q1, which is rewritten into another local query

q2. If q2 can be rewritten back to a query q′1 (on the same source as q1) such that q1 ≡ q′1,

we say q′1 is a reverse query of q1. In the case that q2 and q′1 are computed using the same

rewriting algorithm, we say that the algorithm is reversible, if every query that is rewritable by

the algorithm has a reverse rewriting.

More generally, we consider a P2P data integration system with a cyclic path of P2P map-

pings, informally annotated as p1,M12, p2, ...,M(n−1)(n), pn,Mn1, p1, and an equivalent query

rewriting algorithm translating a query q1 (over p1) along this path until it comes back to

p1 with the resulting query q′1. In the spirit of equivalent query rewriting, we expect that

it is the case that q1 ≡ q′1, and furthermore, (q1 ≡M q2), ..., (qn ≡M q′1) ⇒ q1 ≡ q′1 and

q1 ≡ q′1 ⇒ (q1 ≡M q2), ..., (qn ≡M q′1). In other words, we expect that there exists a logical

relationship between P2P query containment/equivalence and a reversible rewriting algorithm.

60

2.6 Summary

XML and its schema languages do not express semantics but rather the document structure,

such as information about nesting. Therefore, semantically equivalent documents often present

different document structures when they originate from different applications. In this thesis,

we provide an ontology-based framework that aims to make XML documents interoperate at

the semantic level while retaining their nesting structure. The framework consists of two key

aspects: data integration and query processing.

For data integration, a global RDFS ontology is generated by merging the local RDFS

ontologies that are generated from each of the XML documents. At the same time, the mappings

between the global ontology and local XML schemas are manually established. We extend RDFS

by defining additional metadata that can encode the nesting structure of an XML document.

For query processing, we propose two query rewriting algorithms: one algorithm translates an

RDF query (posed on the global ontology) to an XML query; the other algorithm translates an

XML query (posed on one of the individual XML data sources) to another XML query (posed

on a different XML data source). In doing so, we discuss the problem of query containment

for two query languages, namely conjunctive RDQL (c-RDQL) and conjunctive XQuery (c-

XQuery). It is shown that both query rewriting algorithms are equivalent, in terms of both

global and P2P query equivalence.

In the future, we will extend query processing in our framework, by taking into account

other data models, such as relational and RDF data sources. We will further study query

containment in the case of more expressive query languages, e.g., the complete RDQL and

61

XQuery. The concept of reversibility of query rewriting, especially in P2P data integration

systems, is also a direction for future research.

CHAPTER 3

HYBRID PEER-TO-PEER DATA INTEGRATION

3.1 Introduction

The Semantic Web has been proposed to add semantics to web content and to enable

interoperability among heterogeneous data sources. Both Extensible Markup Language (XML)

and Resource Description Framework (RDF) can be used to represent information on the Web.

However, there exists a wide gap between the two languages, since RDF data has domain

structure (the concepts and the relationships between concepts) while XML data has document

structure (the hierarchy of elements) (59).

An example is shown in Figure 17, in which the RDF schema R explicitly specifies two

concepts, Book and Publisher, as well as the publishedBy relationship. Figure 17 also shows

two XML schemas S1 and S2. Each of these XML schemas contains two concepts: book

and author (equivalently denoted by article and writer in S2). Conceptually, these two

XML schemas are quite similar. Structurally speaking, however, they are very different: S1

(book-centric schema) has the author element nested under the book element, whereas S2

(author-centric schema) has the article element nested under the writer element.

Furthermore, the wide diversity of possible XML schemas for a single conceptual model also

results in wide diversity for the XML queries. For instance, a user who wants to “List all the

publications” from two data sources corresponding to S1 and S2 may write the XML path ex-

62

63

books

book *

author * @booktitle

@name

writers

article *

@title @fullname

writer *

books

book

author

"b1"

book

author

"b2"

"a1" "a3"

writers

writer

article

"a1" "a2"

"b1" "a2"

A local XML schema S 1 XML document D 1

"books.xml"

writer writer

article article

"b2"

"a3"

"b2"

A local XML schema S 2 XML document D 2

"writers.xml"

author

Book Publisher

Literal

ISBN

Literal

pulishedBy

booktitle

A local RDF schema R

Literal

name

local RDF data

Book Publisher

"b3"

ISBN

"0123456789"

pulishedBy

booktitle

"p1"

name

(Defined in namespace: http://examples.org/local#)

Figure 17. An example of heterogeneous XML and RDF data sources.

pressions, respectively, as /books/book/@booktitle and /writers/writer/article/@title.

We notice that although the two XML path expressions refer to semantically equivalent con-

cepts, they follow two distinct XML paths. In contrast, schemas defined on the conceptual level

(known as conceptual schemas or ontologies) are flat in document structure, and therefore the

user can formulate a query without considering the structure of the source (we refer to such

queries as conceptual queries). RDF Schema (RDFS), DAML+OIL, and OWL are examples of

languages used to create conceptual schemas.

There are currently several attempts to use conceptual schemas (3; 5; 38; 39) and conceptual

queries (29; 31) to overcome the problem of structural heterogeneities among XML sources. In

this chapter, we propose a framework called PEPSINT (PEer-to-Peer Semantic INTegration

framework) to semantically integrate heterogeneous XML and RDF data sources in a P2P

environment. We discuss the architecture of PEPSINT, and present a solution for semantic

64

integration and query processing in the P2P heterogeneous environment. In brief, we make the

following contributions in this chapter:

• We propose a P2P schema-based data management framework, PEPSINT, built on a

hybrid P2P architecture, in which the global RDF ontology (constructed using the global-

as-view approach (70)) in the super peer behaves not only as a central control point over

the peers but also as a mediator for query translation from peer to peer.

• For the purpose of semantic integration, we propose an approach that preserves the do-

main structure of RDF and the document structure of XML. Specifically, the semantic

integration of XML and RDF data sources is implemented at the schema level (through

the schema matching process) and at the instance level (through the query answering

process).

• We also provide a set of query rewriting algorithms that can propagate a user’s query

across the heterogeneous XML or RDF data sources in PEPSINT. In our framework,

mappings connect the peer to the super peer, thus making query processing within the

network transparent to a user in any peer.

The rest of this chapter is organized as follows. Section 3.2 gives a review of related work.

In Section 3.3 we describe the architecture of PEPSINT and its main components. Section 3.4

discusses schema-based integration of RDF sources and (structurally dissimilar) XML sources.

Query processing in PEPSINT is covered in Section 3.5. Finally, we draw conclusions and

discuss future work in Section 3.6.

65

3.2 Related Work

The research community has, to date, produced several P2P data management systems that

aim to enable interoperability among distributed heterogeneous data sources.

The Edutella project (85) provides an RDF-based metadata infrastructure for P2P net-

works based on the JXTA framework (52). In Edutella, connections between peers are encoded

into a network topology known as the Edutella super-peer topology, which is similar to the

hybrid architecture used in PEPSINT. A Datalog-based query exchange language called RDF-

QEL is proposed to serve as a common query interchange format. Thus a wrapper translates

local query languages such as SQL and XPath into RDF-QEL. Edutella does not support XML

sources directly, though the RDF data sources may be serialized in XML format.

PeerDB (86) is an agent-based P2P data management system where each peer holds a

relational database. The metadata for relations that are sharable with other peers is specified

in a local export dictionary. Unlike PEPSINT, there are no established mappings between peers.

Thus, query reformulation between peers in PeerDB is assisted by agents through a relation-

matching strategy ; this is a process of matching the metadata between relations in different

peers. XML and RDF data are not considered in the current implementation of PeerDB.

SEWASIE (11) is another agent-based P2P system that aims to integrate Information

Nodes (SINodes), where each node acts as an autonomous mediator-based system. It contains

two types of agents: query agents that are responsible for query processing and answering;

and brokering agents (peers) that handle the mappings between nodes. Each brokering agent

directly controls at least one SINode and handles the creation and maintenance of semantic

66

relationships among concepts from different information nodes in the system. SEWASIE does

not currently support RDF data sources.

Hyperion (6) proposes an architecture for a P2P data management system for relational

databases (one stored at each peer). Similarly to PEPSINT, mapping tables and mapping

expressions (mapping tables that allow variables) are used to store connections between local

schemas in peers. A query manager uses the mapping tables and mapping expressions to

rewrite a query posed in terms of the local schema; the rewriting process produces a query that

is run over the schema of acquainted peers. Unlike PEPSINT, only relational data sources and

relational queries are supported by Hyperion.

The Piazza system (59) is a P2P data management system that, like PEPSINT, supports

interoperation of both XML and RDF data sources. Furthermore, both systems preserve doc-

ument structure of XML sources during interoperation of these sources. The differences from

PEPSINT are: (1) Piazza is based on the pure P2P architecture in which peers are connected

directly, whereas PEPSINT is built on top of a hybrid architecture with a super peer containing

the global ontology. This is a tradeoff between efficiency and autonomy (11). (2) Piazza uses a

(declarative) XQuery-based mapping language for mediating between nodes, whereas PEPSINT

utilizes mapping tables to store schema correspondences, which we believe results in easier con-

struction and maintenance of mappings. (3) The Piazza system achieves its interoperability in

a low-level (syntactic) way, i.e., through the interoperability of XML and the XML serialization

of RDF. For this reason, the user has to write an RDF query in terms of an XQuery. The

query rewriting in Piazza is based on pattern matching between an XQuery expression and the

67

mappings. In contrast, PEPSINT supports RDF queries at the conceptual level (RDQL), as

well as XQuery. Query translation is realized by a collection of query rewriting algorithms.

3.3 The PEPSINT Architecture

There are two types of P2P architectures (84): the pure P2P architecture, in which no

central point of control exists and peers are autonomous but can communicate directly with

each other; and the hybrid P2P architecture that contains at least one central point of control.

The global control point(s) maintain either network control or the references to the remaining

peers. Based on the hybrid P2P architecture, PEPSINT contains two types of peers: the super

peer, containing the global RDF ontology, and the peers, containing local schemas and local

data sources. Each peer represents an autonomous information system and connects with the

super peer by establishing P2P mappings. As shown in Figure 18, the PEPSINT architecture

has four main components.

XML to RDF wrapper. Since XML is characterized by having a hierarchical document

structure while RDF has a flat document structure, it is hard for the user to directly map a

local XML schema to the global RDF ontology. To solve this problem, an XML to RDF wrapper

is used to transform the XML schema into a local RDF schema, which is then mapped to the

global ontology. This is a process that conceptualizes the XML elements into RDF concepts

while keeping their nesting information (by using a specialized RDF property).

Local XML and RDF schemas. The local XML and RDF schemas residing in peers

contain both data and metadata. For the purpose of semantic integration, we represent a local

RDF schema as a labeled digraph (from now on referred to as RDF schema graph). The domain

68

mapping table

local XML

schema

Global RDF

ontology

peer 1 super peer

mapping

table

local RDF

schema

mapping table

peer n

XML to

RDF wrapper

local

XML

schema

peer i

mapping

table

XML to

RDF

wrapper

Query processing in

data-integration fashion

Query processing in

hybrid P2P fashion

Mapping process

Q 1

Q 2n '

Q 2i '

Q 2

Q 11 '

Q 1i '

Q 1n '

Figure 18. The PEPSINT architecture.

structure is explicitly represented by labeled vertices (concepts) and labeled arcs (relationships

between concepts). Likewise, a local XML schema is represented as a labeled tree (from now on

referred to as XML schema tree) that specifies nesting relationships between labeled vertices

(elements).

Global RDF ontology. The global RDF ontology in the super peer is a virtual mediated

schema integrated from distributed local RDF schemas (using the global-as-view approach (70)).

In PEPSINT, the global ontology has two roles: (1) It provides the user with a uniform and

complete view of data sources in the distributed peers; and (2) it serves as a mediator for

query translation from one peer to other peers. The global RDF ontology is a fairly simple

ontology—it does not contain high-level axioms, such as those available to DAML+OIL or

OWL.

69

Mapping table. A mapping table stores mappings between local schemas and the global

ontology. We use XML path expressions to represent the elements contained in an XML schema,

and RDF path expressions to represent the concepts and relationships in an RDF schema.

The operation of PEPSINT can be divided into two phases: mapping (or design) phase and

query (or runtime) phase, as respectively indicated by the hollow arrowed lines and the solid

and dashed arrowed lines in Figure 18. To realize semantic integration of XML and RDF data

sources, domain structure and document structure must be preserved in both phases.

1. Mapping phase. Whenever a new peer joins the PEPSINT network, the peer gets

registered and indexed in the super peer by establishing mappings from its local schema to the

global ontology. The mappings are established through a process of schema matching 1 and

stored in the mapping table of the peer. During the process of schema matching, the global

ontology is extended by integration of the local schemas. As previously mentioned, the domain

structure and document structure of local schemas are encoded in the mappings.

2. Query phase. PEPSINT provides two query processing modes. (1) In the data-

integration mode, the user poses a query (source query) on the global ontology in the super

peer, which is then reformulated into multiple subqueries (target queries) over the XML and

RDF sources in the peers (one subquery for each source). By executing the target queries and

integrating their results, the system returns an answer to the user at the site of the super peer.

1Schema matching is a basic problem in many database application domains, and currently it mustbe performed manually. A taxonomy covering most of the existing approaches to schema matching hasbeen devised (99).

70

(2) In the hybrid P2P mode, the user can pose a source query on the local XML or RDF source

in some peer. Locally, the query will be executed on the local source to get a local answer.

Meanwhile, the source query is reformulated into a target query over every other peer through

transitive mappings (compositions of mappings from the original peer to the super peer and

mappings from the super peer to the other target peers). By executing the target query, each

peer returns an answer to the original peer, called the remote answer. The local and remote

answers are integrated and returned to the user at the site of the originating peer.

Query translation is achieved by using the mappings in conjunction with a collection of

query rewriting algorithms. We discuss the mapping and query phases in greater detail in

Section 3.4 and Section 3.5, respectively. Running examples based on the schemas in Figure 17

will be used for illustration.

3.4 Mapping Process

In PEPSINT, the data sources residing at the peers may be either XML data modeled by

an XML schema language (e.g., XML Schema) or else RDF data whose classes and properties

are described using RDF Schema (RDFS). As previously mentioned, mappings between local

schemas and the global ontology are established by the schema matching process during the

registration of a peer to the super peer. The key operation in this process is the preservation

of the domain structure of RDF sources and the document structure of the XML sources.

3.4.1 Mapping Local RDF Schemas to the Global Ontology

Schema matching takes the global RDF ontology G (in the super peer) and a local RDF

schema R (in the peer) as the inputs and returns a set of mappings M between the elements of

71

G and the elements of R as the output. Meanwhile, the global ontology is updated by merging

or adding metadata from the local RDF schema.

Elements in an RDF schema include concepts and roles (also known as classes and properties

in RDFS terminology). When matching the local RDF schema with the global RDF ontology,

for each element pL in the local RDF schema, if there already exists in the global ontology

a semantically equivalent element pG, the two elements will be merged and a correspondence

such as (pL, pG) will be generated. Otherwise, the element pL will be copied into the global

ontology as pG, and a correspondence (pL, pG) will be generated as well. We define a group

of operations on the ontology to implement schema matching between two RDF schemas, e.g.,

merging of classes, merging of properties, merging of relationships between classes, and copying

a class and/or its properties. A concrete example is given in our previous work (39).

3.4.2 Mapping Local XML Schemas to the Global Ontology

By transforming the participating local XML schema into a local RDF schema, we can

convert the problem of matching an XML schema with the global ontology into the problem of

matching an RDF schema with the global ontology, which is discussed in Section 3.4.1.

Books

Literal Literal

rdfx:contained

Local RDFS ontology R 1

Author

rdfx:contained

Book Article

rdfx:contained

Writers

rdfx:contained

Writer

Local RDFS ontology R 2

booktitle name title

Literal

fullname

Literal

Figure 19. RDF schemas transformed from local XML source schemas.

72

Book Author

Literal

Books Authors

Literal

rdfx:contained

rdfx:contained


title

name

Literal

ISBN

Publisher

publishedAt

Literal name

RDF path RDF path XML path expressions XML path expressions

expressions in G expressions in R in S1 in S2

Books – /books –

Book Book /books/book /writers/writer/article

Book.title Book.booktitle /books/book/@booktitle /writers/writer/article/@title

Book.ISBN Book.ISBN – –

Book.publishedBy Book.publishedBy – –

Publisher Publisher – –

Publisher.name Publisher.name – –

Authors – – /writers

Author – /books/book/author /writers/writer

Author.name – /books/book/author/@name /writers/writer/@fullname

Figure 20. The global ontology and its mapping table.

The schema transformation is carried out by the XML to RDF wrapper. The XML to RDF

wrapper converts XML attributes and simple elements to RDF properties; it converts XML

complex elements to RDF classes. The wrapper also encodes the element-attribute relationship

and the element-subelement relationship in XML schema respectively as the class-to-literal

relationship and the class-to-class relationship in the resulting RDF schema.

We choose to define a new, specialized RDF property rdfx:contained (the prefix rdfx stands

for the new name space “http://pepsint.org/rdfx#”) to explicitly denote nesting relation-

ships. In particular, given that two XML elements ei (parent element) and ej (child element)

are respectively converted into two RDF classes, ci and cj , the property rdfx:contained of ci is

then generated to connect ci to cj . Figure 19 shows the resulting local RDF schemas R1 and R2

73

that are respectively converted from the two XML schemas S1 and S2 shown in Figure 17. Fi-

nally, the global ontology G integrated from S1, S2 and R (in Figure 17) and its mapping table

are shown in Figure 20. The grayed concepts or roles are the ones merged from local sources.

We notice that both the rdfx:contained property in G and the mappings in the mapping table

encode the document structure of XML sources, so that either of them can be exploited for

tracking XML document structure in future query translations.


3.5.1 Assumptions

For the simplicity of discussion, we make the following assumptions.

1. We assume the mappings from a local schema to the global ontology are total, one-to-one

mappings. On the other hand, the mappings from the global ontology to the whole set of local

schemas are total but not one-to-one mappings, since a concept in the global ontology might

be merged from multiple concepts of different local schemas (as a result of schema matching).

The mappings from the global ontology to a single local schema are one-to-one but they may

be partial mappings, which means a query run at a local source may result in an incomplete

answer.

2. We also assume that XML queries conform to a subset of XQuery (19), which we call

PXQuery (Partial XQuery) in this chapter. PXQuery consists of a non-nested FLWR expression

that includes four clauses: for, let, where, and return; the where clause may only contain

comparison operators. Other limitations of PXQuery include: (1) Only a single XML document

is involved in the query; (2) No new XML fragments are introduced in the query; (3) The path

74

expressions contained in the clauses only use child axes; (4) No type declarations, functions,

order clauses, and predicate filters are used.

3. To represent RDF queries, we use RDQL, which uses an SQL-like syntax (62). RDQL

consists of the following clauses: SELECT, FROM, WHERE, AND, and USING. We assume only com-

parison operators are used in the AND clause of the RDQL query. The FROM and USING clauses

are not the focus of our attention since they are not involved in query translation.

For the sake of convenience, we associate a PXQuery query Q with

(VQR , VQW , CQ), where VQR and VQW are the two sets that respectively contain all XML path

expressions in the return clause and in the where clause, and CQ contains the constraints whose

items are in the form of vRc, where v ∈ VQW , c stands for a constant, and R is a comparison

operator (e.g., =, <, >, ≤, ≥, and 6=). Likewise, we also associate an RDQL query Q with a

triple (PQS , PQW , CQ), where PQS and PQW respectively contain all RDF path expressions in

the SELECT clause and in the WHERE clause, and CQ contains the constraints whose items are in

the form of pRc, where p ∈ PQW , c stands for a constant, and R is a comparison operator.

3.5.2 Query Answering in Data Integration Mode

Query answering in data integration mode includes the following steps. We use a running

example for illustration.

1. Analyzing the source RDQL query to convert it from a string to a triple Qin :

(PQSin

, PQWin

, CQin). In order to get the RDF path expressions in PQSin

and PQWin

, we have to

match the triple patterns (specified in the WHERE clause) with the RDF graph corresponding to

the local RDF schema. CQin contains all the constraints specified in both the triple patterns

75

of the WHERE clause and the AND clause. Because of space limitations, we ignore the detailed

process of pattern matching in this chapter.

Example 3.1 To “find the publications written by a1”, the user poses a query over the global

ontology as shown below on the left hand side (the prefix go stands for the name space "http://

examples.org/global#", where the global ontology is defined). The resulting Qin elements are

listed on the right hand side.

SELECT ?title PQSin

={Book.title}

WHERE (?book, <go:title>, ?title), PQWin

={Book, Book.title, Author,

(?book, <rdfx:contained>, ?author), Author.name}

(?author, <go:name>, ?name) CQin={(Author.name, eq, "a1")}

AND (?name eq "a1")

2. Rewriting the source query into target subqueries over the RDF or XML sources, by

applying the query rewriting algorithm: RDQL2RDQL or RDQL2PXQuery (once for each source),

which utilizes mapping information stored in the mapping table of Figure 20. The output Qout

of a query rewriting in algorithm is a triple of the form (PQSout

, PQWout

, CQout) for the RDF source

or (VQRout

, VQWout

, CQout) for the XML source. From Qout, we can compose the target query that

is executable over the local source. Below is the result of this step for Example 3.1.

For the local RDF source R:

PQSout

={Book.booktitle}, PQWout

={Book, Book.booktitle}, CQout={}.

The target RDF query is: SELECT ?booktitle

WHERE (?book, <lo:booktitle>, ?booktitle)

76

For the local XML source S1:

VQRout

={/books/book/@booktitle}, VQWout

={/books/book, /books/book/@booktitle,

/books/book/author, /books/book/author/@name},

CQout={/books/book/author/@name, =, "a1"}.

The target XML query is: for $book in doc("books.xml")/books/book

where $book/author/@name = "a1"

return $book/@booktitle


VQRout

={/writers/writer/article/@title}, VQWout

={/writers/writer/article,

/writers/writer/article/@title, /writers/writer, /writers/writer/@fullname},

CQout={/writers/writer/@fullname, =, "a1"}.

The target XML query is: for $writer in doc("writers.xml")/writers/writer

where $writer/@fullname = "a1"

return $writer/article/@title

3. Building an answer to the source query (on the global ontology G) by assembling

the fragment results returned from local sources. We need to not only union the fragments

(returned from different sources) while removing identical records, but also join the records

based on some common key attribute. In addition, null values will be filled into the records

that just partially cover queried attributes. The result of an RDQL query is a table containing

URIs or string constants corresponding to the path expressions in the SELECT clause. For

example, the answer to the query of Example 3.1 is a table containing a single tuple ("b1"),

which is the union of results from S1 and S2. The record ("b3") returned from R is filtered

77

out since the target query over R loses the query constraints in query rewriting, caused by the

partial mappings from G to R (i.e., R has no correspondence for the class Author in G).

3.5.3 Query Answering in Hybrid P2P Mode

We only focus on the case of translating a source query in PXQuery from a peer to all the

other peers, since the translation of a source RDQL query is similar to what is done in data

integration mode (except for the transitive mappings). Query answering in hybrid P2P mode

includes the following steps.

1. Analyzing the source PXQuery query to convert it from a string to a triple

Qin : (VQRin

, VQWin

, CQin).

Example 3.2 To “list all the publications”, the user poses a query (over the local source S1)

as shown below on the left hand side. The resulting Qin components are listed on the right hand

side.

for $book in doc("books.xml")/books/book VQRin

={/books/book}

return $book VQWin

={}, CQin={}

2. Rewriting the source query into a target query over all the other connected RDF or

XML sources, by utilizing the query rewriting algorithm: PXQuery2RDQL or PXQuery2PXQuery

(once for each source) and the transitive mappings between the original data source and the tar-

get data source. The output of the query rewriting algorithm is a triple Qout : (VQRout

, VQWout

, CQout)

for the target XML data source or (PQSout

, PQWout

, CQout) for the target RDF data source.

An XML query must take into account the document structure of the XML source. The

answer to an XML query is returned as a set of subtrees, each of which is rooted from one of the

78

queried nodes (i.e., vertices in VQR). For instance, the answer to the XML query in Example 3.2

is the subtree rooted from book in S1 (see Figure 17). Therefore, the query rewriting algorithm

also outputs a tree T with its children being the resulting subtrees of the answer. The result of

this step by following Example 3.2 is shown below.

For the local RDF source R:

PQSout

={Book}, PQWout

={}, CQout={}.

The target RDF query is:

SELECT ?book, ?title

WHERE (?book, <lo:booktitle>, ?title)

Book Publisher

Literal

ISBN

Literal

pulishedBy

booktitle

Literal

name

T


VQRout

={/writers/writer/article},VQWout

={}, CQout={}.

The target XML query is:

for $writer in doc("writers.xml")/writers/writer

for $article in $writer/article

return

<book booktitle="{$article/@title}">

<author name="{$writer/@fullname}"/>

</book>

writers

@title @fullname

writer * T

article *

3. Building an answer to the source query (against the original data source) by computing

the union of the local answer (returned from the original queried peer) and the remote answers

(returned from remote peers). To construct the remote answers, different methods are used for

queries that target XML sources versus queries that target RDF sources. In the former case,

79

because RDQL cannot represent document structure, the remote answer is built by organizing

(based on the structure specified by T ) the instances returned from executing the target RDQL

query. Whereas in the latter case, the remote answer is formed by simply executing the target

PXQuery query that already represents the same structure as specified by T . For Example 3.2,

the final answer to the source query is shown below, where the three resulting lines come from

the local sources S1, S2, and R, respectively.

<book booktitle="b1"> <author name="a1"> </book>

<book booktitle="b2"> <author name="a2"> <author name="a3"> </book>

<book booktitle="b4"> </book>

3.6 Summary

In this chapter, we propose a P2P schema-based data management framework called PEPSINT.

This framework aims to semantically integrate distributed heterogeneous XML and RDF data

sources. We discuss the construction of the architecture, maintenance of mappings, and query

processing in PEPSINT. In particular, semantic integration is implemented at schema-level

through the schema matching process and at instance-level through the query answering process.

A key aspect in these two processes is the preservation of domain and document structure, which

is realized by extending the RDF metadata space and providing a set of query rewriting al-

gorithms. Because of this preservation, the user query can be correctly propagated across the

heterogeneous XML and RDF data sources in PEPSINT, so that information access within the

network is transparent to the user.

80

As for future work, we will: (1) Develop a proof of correctness for the query process. (2)

Design and implement a semantic web application (e.g., for bibliographic data exchange) in

PEPSINT to validate and evaluate the system. (3) Do a performance comparison of PEPSINT

with other P2P data management systems.

CHAPTER 4

PURE PEER-TO-PEER DATA INTEGRATION

4.1 Introduction

Research on peer-to-peer (P2P) computing techniques is flourishing with a number of pro-

posals on the related open issues such as robustness in dynamic P2P systems, reliability of

participants (peers), network performance, data coordination, and semantic issues (83). Among

these issues, data interoperability is fundamental, especially in the case of fine-grained (e.g.,

content-based) searches in a P2P network of data sources. Thus, P2P data management (or

integration) systems (PDMS) arise by combining schema-based data integration with a P2P

infrastructure (15; 59). In addition, the use of ontologies has been recognized as an effective

approach to promote the interoperability among distributed sources, by resolving their data

heterogeneities at a semantic level (39; 64; 87; 111). These two research trends lead to the

emergence of ontology-based P2P data management systems (OPDMS).

P2P ontology mapping and query processing are two important issues in an OPDMS. While

ontologies are used in local sources as a uniform conceptual metadata representation, which re-

solves the syntactic heterogeneity among sources in different peers, schematic (or structural) and

semantic heterogeneity may still exist. Therefore, ontology mappings are established between

peers to provide a common understanding of their data sources (64). Based on such ontology

mappings, a variety of data management tasks, such as data integration, query processing, and

81

82

data exchange, can be performed within the whole OPDMS. In this chapter, we propose a

framework for OPDMS and discuss the issue of query processing in this framework. In partic-

ular, we propose a P2P query rewriting algorithm that takes into account integrity constraints

specified on local data sources.

In our work, local RDFS1 ontologies are used to uniformly represent heterogeneous source

schemas. To represent the semantic mappings among these metadata (ontologies), we propose

a mapping language, namely the P2P Mapping Language (PML), which uses a meta-ontology

called RDF Mapping Schema (RDFMS). We also discuss the process of P2P query answering

in a layered framework, which we propose to manage any peer. In spite of its simplicity in

comparison with some mapping languages (e.g., Semantic Bridging Ontology used in MAFRA

(74)), PML is adequately expressive to represent most types of ontology mappings including

the equivalent, broader (more generalized), narrower (more specialized), union, and intersection

mappings. Furthermore, PML is extensible to define complex (e.g., many-to-many) mappings

and new mapping types (e.g., a sibling mapping based on two broader mappings), due to the

extensibility of RDFMS as is defined on top of RDFS. We define a first order logic (FOL)

semantics for PML, as well as for queries, which lays a unified foundation for query rewriting.

We consider a particular class of queries on RDFS ontologies, namely conjunctive RQL (c-RQL)

queries, and propose a P2P query rewriting algorithm.

1http://www.w3.org/TR/rdf-schema/

83

The rest of the chapter is organized as follows. In Section 4.2 we describe existing related

work. Section 4.3 gives an overview of our approach. In Section 4.4, we discuss in detail the

P2P mapping language PML, as well as the meta-ontology RDFMS, which is used for mapping

representation. The algorithm for P2P query answering, specifically for P2P query rewriting

based on the P2P mappings, is given in Section 4.5. Finally, Section 4.6 concludes and discusses

future work.

4.2 Related Work

Semantic data integration using conceptual models, such as E-R models and ontologies, has

been widely investigated in the literature (70; 87; 111). Many P2P data management systems

(PDMSs) have been recently proposed, such as the LRM model (15) and Piazza (59). Our

framework as proposed in this chapter is closer to Piazza, which deals with the integration of

XML data and XML serialization of RDF data from different peers. Piazza uses an XQuery-

based mapping language to represent schema mappings. Query answering is realized by pattern

matching between the tree representing the XQuery and the tree representing the mappings.

Examples of OPDMS include the SWAP architecture (46), and based on it, the Bibster

system (57). Our ontology-based query rewriting algorithm in OPDMS is similar to the com-

puteWTA algorithm proposed by Calvanese et al. (26) for query reformulation, both assuming

consistent ontology mappings. However, unlike in computeWTA, we allow partial ontology

mappings, i.e., it is not necessary to map all the atoms in the query to be rewritten. This

assumption is practically meaningful since the user’s burden in mapping two peers can be thus

reduced.

84

The representation of ontology mappings should facilitate the use of mappings for data

management tasks, including data exchange and query processing. Issues related to ontology

mapping have been studied widely (64; 87). For example, Lehti et al. propose an OWL-based

model particularly for XML data integration (69). For representing RDF schema mappings,

Omelayenko has proposed the use of a meta-ontology, RDFT (90). However, it is unclear how

execution specific constraint information and data transformation dimension are attached to

the bridges. Context OWL (C-OWL) (20) and the Semantic Bridging Ontology (SBO) (74) are

two similar ontology mapping languages, with the former based on an extended OWL syntax

and semantics and the latter represented in DAML+OIL. Both languages define a set of bridge

rules with an explicit semantics. However, the utilization of such rules for query processing

remains an open issue.

In the case that mappings are defined as (relational) views, query processing is often referred

in literature as view-based query answering or rewriting (58). However, few of view based query

processing algorithms address the issue of query writing over ontologies, which usually allows

for more expressive constraints specification than most schemata languages do.

4.3 System Overview

4.3.1 The Layered Peer Architecture

In a P2P data management system, a peer manages its local data source as in a traditional

database system. In addition, a peer also has to possess the ability of communicating with

the other peers by providing and consuming services. To this end, we propose for each peer

85

a layered architecture (as shown in Figure 21), by which distributed peers form a pure P2P

network.

GUI

Query Module Mapping Module

XML/RDB Wrapper

Local Ontology & Mappings

RDFS & RDFMS

RDF/XML

User

Local Data

Source

Query Module

Peer 1

Peer 2

Query Module

Peer 3

Application Layer

Service Layer

Representation Layer

Syntax Layer

Figure 21. The layered peer architecture.

The peer architecture consists of four layers, in which each upper layer achieves its func-

tionality based on the lower ones. In particular, the syntax layer provides a uniform syntax

(RDF/XML) for serializing the local ontology and its instances. A wrapper is used to convert

the local source schemas and data into such local ontologies. The representation layer contains

the local ontology in RDFS and its mappings in RDFMS. The service layer implements schema

mapping and query processing, which are two main services that a peer can provide to the

network. The application layer contains a GUI (Graphic User Interface) for the user to initiate

query requests. The adoption of a layered peer architecture simplifies the resolution of peer-to-

86

peer heterogeneities into level-to-level dependencies, thus facilitating the data interoperation

by making the layers more maintainable and reusable.

4.3.2 An Illustrative Example

proceedings

publication

id

type

pid year title

001 2000 t1 Faculty

Literal

firstname

book

conf

publication *

title

type

department

publication

id

type

("p01")

("book")

("p02")

("conference")

lastname Literal

Literal

Literal

p.dtd

Peer p1 (XML) Peer p2 (RDB) Peer p3 (RDF)

p.xml

Faculty "t1"

"Luis"

"H."

"t3"

"t4"

firstname

lastname

conf

conf

book

f.rdfs

f.rdf

faculty *

name

pub

faculty

("M. Case")

("p01 p02")

name

faculty

("J. Adams")

("p01")

name

author_proc

aid pid

001 001

002 001

author

aid affiliation name

001 UC H. Luis

id

pub

pub

title ("t1")

title ("t2")

CS Department

002 2000 t3

002 UC M. Case

003 UIC J. Adams

003 001

001 002

Figure 22. A motivating example for P2P data integration.

As shown in Figure 22, the three autonomous peers p1, p2, and p3 contain three data sources,

which are heterogeneous in both syntax and schemata. In particular, Peer p1 contains the infor-

mation of faculty and publications in XML (p.xml) and DTD (p.dtd). The publication element

(pub) that is defined in IDREFS refers to one or more publication IDs (id). Such referential

constraints define inclusion dependencies as in relational databases. Peer p2 is a relational

87

database containing conference proceedings. The attributes aid and pid in author proc are

foreign keys referring to author.aid and proceedings.pid, respectively, defining inclusion

dependencies. Peer p3 contains an RDF document (f.rdf) with its RDF schema (f.rdfs)

defined in RDFS. In comparison to XML data, we say that an RDF data is flat because there

are no nesting structure and order constraints among the classes and properties.

In addition to syntactic heterogeneity, a notable structural difference among these data

sources is that the semantically equivalent terms are formulated in different forms. That is, the

two types of publications−−book and proceedings−−are designed as values (instances) of an

attribute in p1, as relation names in p2, and as property names in p3.

4.3.3 RDF Metadata Representation

The source schemas specify metadata about different data sources, in terms of elements

and attributes in XML schemas, relations and attributes in relational schemas, and classes

and properties in RDFS. A heterogeneous P2P integration system should provide a uniform

metadata representation to facilitate the P2P mapping process. For this purpose, wrappers are

used to transform heterogeneous schemas into the uniform representation (17; 40; 66).

In our approach, we choose to use RDFS to represent local metadata as a local ontology.

The following description summarizes the method of model-based schema transformation in our

previous work (40). For transformation from relational to RDFS, we represent relations as RDF

classes and attributes as RDF properties. For transformation from XML to RDF, we convert

complex-type elements into RDF classes and simple-type elements (with no subelements but

character contents) and attributes into RDF properties. The target RDF schema shall also

88

Department

name

Faculty

pub

title

type

rdfx:contained

property

Class

L e

g e

n d

rdfs:domain

aid

Author name

affiliation

title Proceedings

pid

year

Faculty

book

lastname

firstname

conf

Publication id

rdfx:contained

Literal

Author_proc aid_1

pid_1

Literal

Literal

rdfs:range

Local ontology O 1 in Peer p 1 Local ontology O 2 in Peer p 2 Local ontology O 3 in Peer p3

Figure 23. local RDFS ontologies.

include the XML or relational referential constraints, which are necessary to be preserved for

correct query translation between different data sources. We represent these constraints by two

RDF properties (corresponding to the two attributes involved in a referential constraint) shar-

ing the same value. Figure 23 shows the results of schema translation of the three sources in the

example of Figure 22. Notice that the nesting relationship between two XML elements is pre-

served by a new particular RDF property rdfx:contained, where rdfx is the new namespace

(39).

4.3.4 P2P Mapping and Query Answering

In our framework, the P2P inter-schema mappings result from a process of matching the two

participating source schemas (99). In our previous work (113), we proposed a thesaurus-based

89

RDF schema matching algorithm by utilizing WordNet.1 In our approach, an inter-schema

mapping specifies correspondences between RDF classes or properties from two different source

schemas. The different types of mappings (e.g., equivalent, broader, or narrower) are determined

according to the comparison of the semantics of the mapped classes or properties. The mapping

information is stored in terms of instances of an RDF meta-ontology RDFMS (RDF Mapping

Schema), using in addition a mapping language, PML (P2P Mapping Language).

The process of P2P query answering includes three aspects: query execution, query rewrit-

ing, and answer integration. The user poses a query on a peer, which is first executed on that

peer. Meanwhile, the query is also forwarded to each of the linked peers, where the query is

rewritten into a new query that is executed locally and propagated further. Finally, answers

from every peer are returned to the host peer, where they are integrated to produce the answer.

For the purpose of query answering, we use a first-order relation based method to interpret

the inter-schema mappings. Actually, in our approach, both the mappings and heterogeneous

queries are interpreted by a set of first-order relations, so as to provide a unified environment

for query rewriting.

4.4 P2P Mappings

In this section, we discuss the representation of P2P semantic mappings using an RDF-based

meta-ontology, namely RDFMS, which even if incomplete is expressive enough to specify most

1http://www.cogsci.princeton.edu/˜wn/

90

Map

EquivalentMap

BroaderMap

NarrowerMap

UnionMap

IntersectionMap

Literal

constrainedBy

leftElement

rightElement

property

Class

Legend:

rdfs:domain

rdfs:range

rdfs:subClassOf

Namespace: http://example.org/rdfms Prefix: rdfms

Figure 24. The meta-ontology of RDFMS.

commonly used mapping types. We also describe a mapping language PML, which uses an

FOL semantics and serves as an interface for the user to define and manipulate the mappings.

4.4.1 RDFMS Meta-Ontology

As shown in Figure 24, RDFMS provides one-to-one mappings such as equivalent (repre-

sented by EquivalentMap), broader (BroaderMap), and narrower (NarrowerMap). Regarding

the case of one-to-many mappings, RDFMS defines UnionMap and IntersectionMap respec-

tively for two types of logic combinations (i.e., and and or) of the elements on the multiple-

element side. All these types of mappings are defined as classes inheriting from a common

class Map, which has three general properties that are also inherited by its subclasses. The

leftElement and rightElement properties are used to connect the mapped elements.

In order to represent the mapping expression (99) that a P2P mapping may carry, the

property constrainedBy is defined, whose data type is specified as Literal. An example of

91

the use of this property is &c1 (see Figure 25), which is used to confine the retrieval of the

instances from Peer p1 since Faculty is mapped to Author using NarrowerMap. Following the

example in Figure 23, we obtain the P2P mappings among the three local RDF schemas, as

shown in Figure 25. Note that every P2P inter-schema mapping is an instance of the RDFMS

meta-ontology.

Department

name

Faculty

pub

title

type

aid

Author

name

affiliation

title

Proceedings

pid

year

Faculty

book

lastname

firstname

conf

Publication

id

Literal

Author_proc aid_1

pid_1

Literal

Literal

NM-1

&c1

EM-1

BM-2

&c2

EM-2

EM-3

IM-1

&c3

BM-2

Abbreviations: &c1 = "Author.affiliation = 'UC'" BM: BroaderMap

&c2 = "Publication.type = 'conference'" EM: EquivalentMap

&c3 = "Author.affiliation = 'UIC'" IM: UnionMap

property

Class

Legend:

rdfs:domain

rdfms:leftElement

rdfms:rightElement

rdfms:constrainedBy

Local ontology O 1 in Peer p 1 Local ontology O 2 in Peer p 2 Local ontology O 3 in Peer p3

rdfx:contained

rdfx:contained

Figure 25. An example of P2P mappings represented in RDFMS.

92

4.4.2 P2P Mapping Language – PML

We define a set of mapping atoms for defining different types of P2P semantic mappings,

according to the structure of the RDFMS meta-ontology. Listed below are mapping atoms and

their corresponding RDFMS representation.

• EM(c1, c2): there exists an instance m of EquivalentMap, such that c1 = m.leftElement

and c2 = m.rightElement.

• BM(c1, c2): there exists an instance m of BroaderMap, such that c1 = m.leftElement


• NM(c1, c2): there exists an instance m of NarrowerMap, such that c1 = m.leftElement


• UM(c1, c2): there exists an instance m of UnionMap, such that c1 = m.leftElement

and c2 = {x|x = m.rightElement}, or c1 = {x|x = m.rightElement} and c2 =

m.leftElement.

• IM(c1, c2): there exists an instance m of IntersectionMap, such that c1 = m.leftElement

and c2 = {x|x = m.rightElement}, or c1 = {x|x = m.rightElement} and c2 =

m.leftElement.

• CON(m, e): given an instance m of Map or its subclasses, we have e = m.constrainedBy.

We note that c1 and c2 in EM, BM, and NM correspond to RDFS classes or properties, whereas

c1 and c2 in UM and IM can correspond to a set of classes or properties, to which the logic

connectors and and or are applied, respectively.

93

Assuming a finite set of class names C and a finite set of property names P, we define a FOL

(first order logic) semantics for the mapping atoms, in terms of the following two predicates:

• C EXT(c, x), where the resource x is in the proper extent (i.e., direct instance) of class c.

• P EXT(x, p, y), where (x, y) is the proper extent (i.e., direct instance) of property p.

In our definition, a P2P mapping is allowed to connect not only two classes or two properties but

also a class and a property. An interpretation ∆ of every P2P mapping atoms varies according

to the type of objects that are mapped, as given below.

• ∆EM(c1, c2) implies:

∀x C EXT(c1, x) ↔ C EXT(c2, x), if c1, c2 ∈ C;

∀x1∀x2∀y P EXT(x1, c1, y) ↔ P EXT(x2, c2, y), if c1, c2 ∈ P;

∀x∀y C EXT(c1, y) ↔ P EXT(x, c2, y), if c1 ∈ P, c2 ∈ C.

• ∆BM(c1, c2) implies:

∀x C EXT(c1, x) → C EXT(c2, x), if c1, c2 ∈ C;

∀x1∀x2∀y P EXT(x1, c1, y) → P EXT(x2, c2, y), if c1, c2 ∈ P;

∀x∀y C EXT(c1, y) → P EXT(x, c2, y), if c1 ∈ P, c2 ∈ C.

• ∆NM(c1, c2) implies:

∀x C EXT(c1, x) ← C EXT(c2, x), if c1, c2 ∈ C;

∀x1∀x2∀y P EXT(x1, c1, y) ← P EXT(x2, c2, y), if c1, c2 ∈ P;

∀x∀y C EXT(c1, y) ← P EXT(x, c2, y), if c1 ∈ P, c2 ∈ C.

94

• ∆UM(c1, c2) implies:

∨i(∆EM(c1, ai)), where ai∈c2, if c1 ∈ C ∪ P, c2 ⊆ C ∪ P;

∨i(∆EM(ai, c2)), where ai∈c1, if c1 ⊆ C ∪ P, c2 ∈ C ∪ P.

• ∆IM(c1, c2) implies:

∧i(∆EM(c1, ai)), where ai∈c2, if c1 ∈ C ∪ P, c2 ⊆ C ∪ P;

∧i(∆EM(ai, c2)), where ai∈c1, if c1 ⊆ C ∪ P, c2 ∈ C ∪ P.

The following is the interpretation ∆M1,2 for the mappings M1,2 between p1 and p2 in

Figure 25.

∀x1∀x2∀y P EXT(x1, name, y) ↔ P EXT(x2, name, y),

∀x1∀x2∀y′, P EXT(x1, title, y′) ↔ P EXT(x1, title, y

′),

∀x C EXT(Faculty, x) → C EXT(Author, x),

∀x C EXT(Publication, x) ← C EXT(Proceedings, x)

The FOL interpretation for ontology mappings enables standard reasoning on mappings as

well as the definition of more complex P2P mappings. For example, we can define a sibling

mapping SM such that SM(c1, c2) ⇔ NM(c1, c3) ∧ NM(c2, c3). Another example is the defini-

tion of a many-to-many mapping by composing two UnionMaps. Furthermore, an example for

reasoning on mappings can be such as ∆BM(c1, c2) ∧ ∆NM(c1, c2) ⇔ ∆EM(c1, c2). However,

reasoning on mappings is not the focus of the this chapter. Instead, we concentrate on how to

use mappings for the purpose of query processing, specifically on query rewriting.

95

4.5 P2P Query Processing


Since the metadata of every source schema is expressed as a local ontology in RDFS, we

may be able to interpret a local query over the source schema in terms of a conjunctive query,

namely a conjunctive RQL query (c-RQL) (32), over the local ontology. An c-RQL Q is of the

form

ans(x) :– R1(x1), ..., Rn(xn).

where Ri is either C EXT or P EXT for i ∈ [1..n], and x ⊆ x1 ∪ ... ∪ xn. As usual, the ans part

is called the head of the query, denoted headQ, and the rest is called the body of the query,

denoted bodyQ. In this chapter, we assume that we only consider the class of local queries that

can be expressed in c-RQL. The following gives two examples of translating local XPath (47)

and relational queries into c-RQL queries, while ignoring the detailed procedure for the space

limit.

Consider an XPath query /department/faculty [@name="M. Case"] as posed over p.xml

in p1. The result of this query is the XML document tree (referred to as answer tree) rooted from

the first faculty element (See Figure 22). By considering the answer structure and semantics

of the query (for correct query rewriting), we can interpret the XPath query as follows. Note

that all the elements and/or attributes involved in the answer tree and in the predicates (of an

XPath query) are covered in the resulting c-RQL query.

ans(x, y, z) :– P EXT(x, name, y), P EXT(x, pub, z),

96

y = "M. Case".

As another example, consider a relational conjunctive query posed on Peer p1 to “find all

the publications written by authors from UIC”, as shown below.

ans(y) :– proceedings(x, y, z), authorproc(u, x),

author(u, v, w), w = "UIC".

The following is the first-order relation based interpretation for the preceding relational

conjunctive query.

ans(y) :– P EXT(x1, pid, x), P EXT(x1, title, y),

P EXT(x2, aid 1, u), P EXT(x2, pid 1, x),

P EXT(x3, aid, u), P EXT(x3, affiliation, w),

w="UIC"


The P2P query answering in our framework is a process of propagating a local query (ini-

tiated from a host peer p1) to every connected peer along the links. As previously mentioned,

this process includes three aspects: query execution, query rewriting, and answer integration.

Query rewriting can be seen as a function Q2 = f(Q1,M), where Q1 is the local query, M is the

set of P2P mappings, and Q2 is the resulting remote query. Based on the uniform first order

logic interpretation for both P2P mappings and user queries, the computation of f is realized

by the algorithm P2PRewriting as sketched below.

97

Algorithm P2PRewriting (Q, M)Input: a conjunctive query Q over ontology O1; the mappings M between O1 and O2.Output: a conjunctive query Q′ over O2.headQ′ = headQ; bodyQ′ = null;Let ∆Q be the corresponding c-RQL of Q;Expand ∆Q into Q∗ using the constraints over O1;Let φ be bodyQ∗ ;For each R(x) of φ

For each ψ ∈ MLet R′(x′) be the result of applying ψ on R(x);Add R′(x′) into bodyQ′ using a conjunction;

Let G be a query graph of φ and G′ be of bodyQ′ ;For each connected subgraph H ⊆ G

Find the corresponding subgraph H ′ of H in G′;If H ′ is not connected then

Expand H ′ using the constraints on O2 into a connected graph H ′′;If H ′′ exists then add into bodyQ′ all R′

i that contributes to the expansion of H ′;Else output null;

Output Q′;

Figure 26. The P2PRewriting algorithm.

The rest of our discussion elaborates on this algorithm by giving a concrete example. Sup-

pose that the user poses a query Q over Peer p1 (in a P2P network as shown in Figure 23):

“listing all papers written by H. Luis”, which is formulated as follows:

//publication[//faculty[contains(@pub, @id) and @name="H. Luis"]]

The first step of rewriting Q is the interpretation of Q as ∆Q. As previously mentioned,

the interpretation of an XPath query has to consider its answer structure. In this example,

the answer to Q covers the XML node publication and its children id, title, and type,

98

according to the schema structure in p1 (see Figure 22). Based on the local RDFS ontology of

Peer p1, ∆Q is computed as follows, .

ans(x, y, z) :– P EXT(p, id, x), P EXT(p, title, y),

P EXT(p, type, z), P EXT(q, pub, x),

P EXT(q, name, "H. Luis")

The expansion of ∆Q uses the classic chase algorithm that “chases” a tableau query with

dependencies on a relational database (2). The following shows the resulting Q∗ of expanding

∆Q using the constraints on the ontology O1, and its rewriting Q′ resulted from the application

of the mapping constraints M1,2. We note that the application of a mapping ϕ → ψ to a query

predicate ϕ follows the way of standard logical implication ϕ,ϕ → ψ ⇒ ψ.

Q∗ : ans(x, y, z) :– P EXT(p, id, x), P EXT(p, title, y),

P EXT(p, type, z), C EXT(Publication, p),

P EXT(q, pub, x), P EXT(q, name, "H. Luis"),

C EXT(Faculty, q)

Q′ : ans(y) :– P EXT(p, title, y), C EXT(Proceedings, p),

P EXT(q, name, "H. Luis")

The query graph of a query is constructed by adding a node for each atom in the query and

adding an edge between two nodes if their corresponding atoms contain the same variable. In

the last step, the algorithm finds that the query graph of Q∗ is connected, whereas the one of

99

Q′ is not connected. Hence, Q′ has to be expanded (using the chase algorithm too) according

to the constraints on O2, resulting in the following final rewriting Q′ of Q:

ans(y) :– P EXT(p, title, y), C EXT(Proceedings, p),

P EXT(q, name, "H. Luis"),

P EXT(p, pid, y2), P EXT(q, aid, y1),

P EXT(x, aid 1, y1), P EXT(x, pid 1, y2)

It will not be difficult to obtain the corresponding relational conjunctive query of Q′, which

is then executed over the RDB in Peer p2 to retrieve a local answer from p2. Similarly, we can

rewrite Q′ to a query Q′′ over O3 and get a local answer from p3. The (global) answer of Q,

after the integration of all local answers, is as follows, where the null values are caused by the

fact that the P2P mappings are partial (i.e., not all atoms referred by the query are mapped):

<publication id="" title="t1" type=""/>



We did not describe all the details because of space limitations.

In order to retrieve correct data from the P2P network, it is required that the remote query

Q2 rewritten from a local query Q1 be equivalent to Q1. The query rewriting satisfying this

condition is called equivalent query rewriting (58), which is defined for homogeneous relational

data integration. Query equivalence in terms of answer equivalence has also been defined (58).

Such equivalence, however, will have a different (less strict) meaning in the context of a hetero-

100

geneous P2P network. Informally, we say that two answers (to two different queries on different

data sources) are equivalent if they are structurally and semantically equivalent. However, such

equivalence does not entail identical answers. Although not proved in this chapter, our P2P

query rewriting guarantees semantic equivalence, which is based on the concept of reversibility

(39). To achieve semantic equivalence the following is needed: the correctness of source schema

representation in RDFS, a valid P2P ontology mapping, and the preservation of the answer

structures.

4.6 Summary

In this chapter, we describe an ontology-based approach to the data interoperability problem

in a heterogeneous P2P network. RDF techniques are used in our framework, through the use

of the RDFS local ontologies for metadata representation and the use of the RDFMS meta-

ontology for inter-schema mapping representation. Our contributions include a definition of the

syntax of PML (based on the RDFMS meta-ontology), a definition for its semantics in terms of

first-order relations, and a query answering algorithm that considers constraints in local data

sources.

For future work, we will further study the following aspects: (1) Due to the locality of P2P

systems, mappings between different pairs of peers may be designated by different people. This

can result in inconsistency between different inter-schema mappings. In addition, given two

large-size source schemas to be mapped, the user may hope some inferencing can be performed

to derive new mappings from existing mappings automatically. In fact, the problem of map-

ping consistency and that of mapping inference are essentially the same in the case where the

101

inferencing involves multiple sets of inter-schema mappings (6). (2) In a P2P network, peers

are designed as autonomous nodes, and any peer can accept user queries. In such settings, an

established inter-schema mapping, say from Peer p1 to Peer p2, may be used both for query

rewriting from p1 to p2 and for that from p2 to p1. Given that the inter-schema mappings

are directional and a uniform query rewriting algorithm is deployed in the P2P system, the

utilization of a single inter-schema mapping for query rewriting in different directions have to

be treated differently. This arises because of the problem of bidirectionality of P2P mappings

(59).

CHAPTER 5

DATA INTEROPERABILITY IN THE SEMANTIC DESKTOP

5.1 Introduction

In 1945, Vannevar Bush put forward the first vision of personal information management

(PIM) system, Memex, by pointing out that the human mind “operates by associations”, and

we should “learn from it” in building Memex (23). The Hypertext systems (see the survey of

Conklin (33)), which flourished in the 80’s, reinforced this vision and yielded the current World

Wide Web, in a broader scope. Recently, with the Semantic Web vision (13), a number of PIM

systems associated with that vision, hence called Semantic Desktop, have been proposed. By

summarizing these proposals and taking into account the characteristics of personal information

(PI), we propose the following principles that a PIM system should follow:

Semantic data organization. Almost all existing approaches are trying to go beyond the

hierarchical directory model. The critical factors of semantic data organization include ade-

quate annotations, explicit semantics, meaningful associations, and a uniform representation. A

semantic-rich data organization has several advantages. First, the annotations and associations

(as the superimposed information over the coarse data (75)) form the context of the PI, thus

making the data more easily understandable. Second, the superimposed information also allows

for a finer and more flexible manipulation (e.g., browsing and querying) of the data. Third,

an explicit formal semantics for the data can facilitate reasoning on the data and deriving new

102

103

PI Space (C:\)

papers

WISE03-1.pdf

WISE03-submission.pdf

WISE03-camera.pdf

JoDS05.pdf

WISE

myself.jpeg

talks

WISE03.ppt

IDEAS04.ppt

AP2PC04.ppt

Super-invited.ppt

photos

talk.jpeg

with sam.jpeg

emails

Final submission of WISE.eml

Meeting on Monday.eml

WISE photos.eml

Register for WISE.eml

Figure 27. An example of files in a PI space.

knowledge. Finally, the uniform representation can support the integration of data that may

be heterogeneous.

Flexible data manipulation. A PIM system can provide integration, exchange, navigation,

and query processing of the stored personal information. The framework of PIM, including the

data model, query language, and user interface, should provide multiple ways to manipulate data

in a powerful and flexible manner. Furthermore, a PIM system should possess the capability for

seamless communication (or interoperability) with external sources (possibly in another PIM

system), e.g., in a peer-to-peer (P2P) way (103).

Rich visualization. Multiple visualizations can help the user in understanding data. Instead

of providing separate views of the data as most traditional applications do, a PIM system should

support data visualization from different perspectives, to offer a comprehensive view. Examples

include association-centric visualization (98) and time-centric visualization (51; 50).

Example 5.1 Figure 27 presents a fragment of PI space, which consists of four directories

of files in the hard drive C:\. The papers directory contains four papers of the format pdf,

104

photos\WISE contains three pictures taken at the WISE ’03 conference, talks contains four

Powerpoint files that are respectively the slides of four talks, and emails contains four saved

email messages. Even if the concrete contents of all these files are unknown, we can tell from

their names (or the names of their respective directories) that several of them appear to be related

to one another. Unfortunately, their storage in different and possibly unrelated directories does

not show such inter-relationships, thus resulting in possible difficulties in locating the wanted

information. Some keyword-based searching techniques, e.g., offered by the Google Desktop

Search,1 can retrieve all files that are relevant to WISE. However, without further inspection

of the contents of each file, the user may not be able to discover certain associations between

them, e.g., that file JoDS05.pdf is an extended journal paper of WISE03-camera.pdf.

From this example, we can see that the lack of semantic associations among the stored data

could be a handicap for data and knowledge discovery. In this chapter, we focus on issues of

semantic data organization and management in PIM, by taking the following approach:

1) We propose a layered framework for PIM, in which multiple ontologies playing a variety

of roles are employed. Specifically, the resource layer stores all the PI resources (using URIs),

metadata of the PI, and all kinds of associations using RDF. The domain layer contains the

ontologies specific to various domains that are used to structure the data and categorize the

resources. The application layer, built on top of the domain layer, is where the user constructs

different application ontologies for different purposes of data usage. This layered architecture

1http://desktop.google.com

105

enables: i) a semantics-rich environment for personal information management; ii) a flexible and

reusable system, by decoupling the domain and application ontologies, so that the construction

of application ontologies for different applications can reuse the underlying domain ontologies.

We argue that this provides certain advantages over the use of a single domain model for all

the PI (e.g., (44)).

2) We discuss in detail how to utilize superimposed information for semantic organization,

focusing on the construction of resource-file and resource-resource associations. We also present

the idea of 3D navigation, which is a combination of the vertical, horizontal and temporal

navigation in the PI space. The idea is inspired by some existing PIM systems including

MyLifeBits (51) and Placeless Documents (45), and is demonstrated in a browser.

3) We describe in detail the architecture of our semantic desktop system, named MOSE

(Multiple Ontology based Semantic DEsktop) and the challenges that we are addressing in the

course of its implementation.

4) In our framework, the basic unit for the user to manage the Semantic Desktop is the per-

sonal information application (PIA). Each PIA aims to accomplish or assist a specific task (e.g.,

bibliography management, paper composition, and trip planning). The PIAs can be standalone,

with their own application ontology, user interface, and workflows. Meanwhile, they can com-

municate with each other as if in a P2P network, by means of the connections (mappings)

established between their application ontologies. In this sense, different PIAs interoperate at a

semantic level. We discuss the personal information application (PIA) development, and hence

to the inter-desktop information sharing and data integration by means of PIA-based desktop

106

services. We also describe query processing in our framework in two cases: within a single PIA

or between two PIAs, in a P2P query processing mode.

The rest of the chapter is structured as follows. In Section 5.3, we describe the layered

framework and its main components. The semantic organization of the PI (including the

concepts of annotation, association, and representation) is discussed in Section 5.5. Section 5.6

and Section 5.9 focus on two main ways of data manipulation, namely, navigation and query

processing. Finally, we conclude in Section 5.10.

5.2 Related Work

The term of semantic desktop was first coined by Decker and Frank, who also stated the

need for a “networked semantic desktop” that is enabled by several key emerging technologies

including: the Semantic Web, P2P computing, and online social networking (41). The state-

of-the-art of semantic desktop has been comprehensively summarized by Sauermann (104).

Among the existing approaches to PIM in desktops, the Gnowsis project aims at a semantic

desktop environment that supports P2P data management based on desktop services (103).

Similarly to MOSE, Gnowsis uses ontologies for expressing semantic associations and RDF

for data modeling. SEMEX is another personal data integration framework that uses a fine-

grained annotation based on schemas, similar to our ontology-based framework (44). However,

a single domain model is provided as the unified interface for all data access. MyLifeBits (51),

Haystack (98), and Placeless Documents (45) are three PIM systems that support annotations

and collections. The concept of collection is essentially the same as the conceptualization (using

ontologies) of resources in our framework.

107

Existing interfaces provide a workspace for the end user to develop applications. Such ap-

plications have their own data model, data presentation, and control logic. Of such interfaces,

Haystack’s end user interface is the closest to the PIA designer presented in this chapter (98).

Both use channels as units of content; however, the PIA designer supports parameterized chan-

nels in its MVC-based application development environment, which enable the specification of

the business logic (i.e., the controller) of an application. Furthermore, the PIA designer provides

a way to compose distributed desktop services that are defined and implemented based on PIAs.

Other interfaces for personal data management are based on Wikis and include SemperWiki

(91) and WikSAR (8). However, they resemble a hypertext composer (or content manager)

providing the user with a means to put pieces of information together as a Wiki page.

5.3 The Layered Multi-Ontology Framework

Our framework follows the principle of superimposed information, i.e., data or metadata

“placed over” existing information sources (75). This concept seems particularly useful for

the organization, access, interconnection, and reuse of the information elements. We propose

for PIM a layered ontology-based framework, as shown in Figure 28, with the following data

components:

Personal information space. The personal information space may contain structured data

(e.g., relational), semi-structured data (e.g., XML), or unstructured data. Unstructured data

can be textual or non-textual (as in video, audio, or picture files). Furthermore, textual files

can be classified as simple-content or complex-content. More specifically, simple-content files

have no references to other files. Typical examples include people contacts and Bibtex entries.

108

Resource

-file index

PI Space

Textual

Nontextual:

Simple-content

Complex-content

Contacts

Bibtex

...

Papers

Reports

Emails

Slides

...

(Video, Audio, Pictures, ...)

Domain

Ontology 1

Domain

Ontology 2

Domain

Ontology m

Domain Layer

Application

Ontology 1

Application

Ontology 2

Application

Ontology n

Application Layer

Resource

repository

(RDF)

File Metadata

Resource Layer

Relational database, XML

PIM 1

PIM 2

Application Layer

Application

Ontology i . . .

PIM 3

Application Layer

Application

Ontology j . . .

Association

Figure 28. An ontology-based framework of a PIM system.

In contrast, complex-content files have a flexible scheme of presentation, and may contain

references to other files, e.g., by means of citations or hypertext links (33). For example, a

paper in the PI space may cite another paper (existing in the PI space or an external space),

which, in turn, could cite other papers.

File description. We annotate each file using a file description (or metadata) consisting of

a set of properties of the file. Each item in the file description is a property-value pair. The

file description is the first-level (direct) annotation for the individual files, and has the same

109

scheme (structure) for the same type of files. For example, the following fragment contains a

typical description of a JPEG file.

Dimensions: 3072 × 2048 pixels

Device make: Canon

Color space: RGB

Focal Length: 75

......

Domain ontologies. A number of ontologies are published on the Web. Examples of such

ontology libraries include DAML Ontology Library,1 the Semantic Web Ontologies,2 and the

Protege OWL ontologies.3 The ontologies in these libraries are typically designed and organized

for different domains such as Conference, Person, Photo, and Email. In our framework, the

domain ontology layer is designed to be loosely-coupled with the other layers, to enable the

insertion and removal of ontologies as “plug-ins”.

Resource-file index and RDF repository. One of the roles of domain ontologies is to pro-

vide the basis for data classification. In order to establish the connections between the files and

the concepts in the domain ontologies, we treat each file as a resource, which is then classified

as an instance of one or more concepts. The resource-file index is a local database storing these

1http://www.daml.org/ontologies/

2http://www.schemaweb.info

3http://protege.stanford.edu/plugins/owl/owl-library/

110

connections between resources and files. Furthermore, the various types of associations among

resources (as instances of association of concepts in the domain ontologies) are stored in an

RDF repository. The resource-file index and the RDF repository are both in the resource layer,

providing resource instances for the domain ontologies in the domain layer above.

Application ontology. Above the domain layer is the application layer, which contains the

ontologies for different applications. The domain ontologies, as an intermediate layer between

the applications and the data, are meant to enhance the reusability and flexibility of the frame-

work. More specifically, the application ontologies are defined as views of the domain ontologies,

which can be reused for the construction of different application ontologies. In our framework,

each personal information application (PIA), is associated with an application ontology, has

access to relevant data, and is functionally independent of other applications. It may be infea-

sible to have a single ontology to cover various applications, e.g., for trip planning and paper

writing. Instead, as many PIAs as needed can be designed in one or more PIM systems, where

the PIAs can interoperate (e.g., through P2P query processing) for the purpose of integrating

relevant information. This issue is elaborated on in Section 5.9.

Besides the data components described above, a PIM system also needs some functional

components to perform all kinds of data and metadata processing, to make the framework

work as a whole. Such components include an indexer (for establishing and managing the

indexes of the files), a wrapper (for identifying and extracting resources from the files), and an

ontology designer (for importing and editing an ontology). Because of space limitations, we do

not elaborate further on these components.

111

5.4 System Architecture

File description R-F index Ontology and

resource repository

File system

Wrapper library (for PDF, PPT, and DOC. etc)

Annotator

Indexer

Classifier

Application APIs

Jena API Ontology matcher

PIA Browser PIA Designer

User Interfaces

Data flow

Control flow

Ontology designer

files

text text

<property,value>

<property,value>

resources

R-F associations resources triples

Query processor

Resource Browser

files

Semantic Desktop Server

Data and Metadata Repositories

triples

Figure 29. The architecture of MOSE.

Figure 29 presents the architecture of MOSE (Multiple Ontology based Semantic DEsktop).

The following describes the primary components of the framework.

Our framework goes beyond the hierarchical directory based organization by means of two

types of ontologies: domain ontologies and application ontologies. The former represent the

conceptualization of different domains, thus providing a foundation for personal data classifi-

112

cation. The latter are designed to serve as the data model underlying personal information

applications (PIAs), which are developed by the end user. More details of how these ontologies

cooperate to enable a semantically powerful data manipulation in the semantic desktop are

given in 5.7.

File wrappers. The semantic organization is mainly based on a series of analysis and process-

ing on text documents in the personal information space. That is, we do not consider the

non-textual features of a file, although such features may facilitate data annotation (18). A

file wrapper is used to retrieve text from various types of files, such as PDF, PPT, and DOC.

The other functionality of file wrappers is to obtain from the file system the system-defined

properties of a file, e.g., its MIME type, size, and date.

Annotator. The annotator is responsible for creating and enhancing the annotation (or meta-

data) of a file. It is fed with the results of file wrappers, including the retrieved text and its

standard properties, based on which it associates the file with property-value pairs. Most of

current data annotators need input from users, although sometimes part of the annotations can

be obtained from the file content. In practice, a semi-automatic annotator is often provided,

such as the “easy” annotation mechanism of MyLifeBits (51). In MOSE, the annotations are

stored in a database, called file description.

Classifier. The classifier is one of the most important components for the semantic organiza-

tion in the framework. Given a file and its file description, the classifier provides the following

operations: (1) Identification of the file as a resource with a unique URI (Universal Resource

113

Identifier); (2) Examination of the file content to explore the resources that are contained or

referred to by the file; (3) Population of domain ontologies with all discovered resources; (4) De-

termination of the associations between resources, called resource-resource (R-R) associations.

These resources and their associations are maintained in a resource repository.

Indexer. After being classified, a file is indexed in terms of the resources discovered in itself

(e.g., the names of the authors in a publication). Such resource-file indices are stored in a

repository, called R-F index, for the future use for query answering. There are three types of

R-F indices (also called R-F associations): identification, containment, and reference, which are

obtained by the first and second operations of the classifier. Given a query of keywords posed

by the user, the query processor of MOSE can first locate the corresponding resources and then

find the files that are identified as, containing, or referring to such resources, by means of the

R-F index.

Ontology designer and matcher. At the center of the framework of MOSE are the multiple

application and domain ontologies stored in the ontology repository. We provide an ontology

designer for the management of concepts and roles of individual ontologies, and an ontology

matcher for the maintenance of inter-ontology relationships (i.e., ontology mappings). Con-

sidering that most semantic desktop end users may lack the knowledge of particular ontology

languages (e.g., RDFS or OWL), the ontology designer should hide the details of such languages

but enable users to work with the conceptualization of their domains of interest. In addition,

114

to improve the precision of an automatic ontology mapping process, the ontology matcher may

be able to combine different ontology matching strategies (37; 64).

5.5 Semantic Data Organization

The layered architecture of our PIM framework described previously enables the reusability

and the organization of semantically rich data for PIM. In this section, we discuss in detail

the mechanisms that our framework uses to support the semantic organization of the PI space,

including those for semantic annotation, association, and representation.

5.5.1 Annotation

Given that the data in the PI space is the base information, all the other data components

in our framework are actually superimposed information over this base. The most fundamen-

tal function of the superimposed information is to provide semantic annotations of the base

information to enable powerful and accurate data access. We discuss the following two aspects:

File description. It is especially important to provide the searcher with a detailed description

of the nontextual files. When performing a keyword-based searching, the searcher matches the

submitted keywords (e.g., “Canon”) or key-value pairs (e.g., “Maker:Canon”) with the property-

value pairs of the file description, to find the right files requested by the user. Even for textual

files, taking into account such metadata will improve the effectiveness of full-text searching.

Domain ontologies. Given that a file is identified as a resource, we are able to annotate the

file using a domain ontology, by associating the resource with a concept of an ontology. The

domain ontology provides not only a context for understanding the data, but also semantic clues

for the precise data retrieval. For example, the user can query the PI using a query language

115

for RDF instead of using keywords. We note that a file can be an instance of more than one

concept, according to different classification criteria.

5.5.2 Association

In our framework, semantic associations are used to relate all the data (base information)

and metadata (superimposed information). There are two classes of associations: the resource-

file associations that are actually the resource-file indexes and the resource-resource associations

that are instances of the domain ontologies and are stored in the RDF repository.

Resource-file associations. In addition to the ontological resources that are used to identify

(through data classification) the files, a (textual) file may contain and refer to a number of

resources. Therefore, the resource-file associations can be one of the following: identification,

containment, and reference.

Example 5.2 Suppose that the user has saved an email message, which is an announcement of

a seminar, as shown in Figure 30. First, the email message can be classified as an instance of

the concept Email, provided that the concept exists in some domain ontology. Then, the system

can generate for the concept SeminarAnnouncement and its properties a new instance (i.e.,

resource), which is associated with the saved email by the relationship containment. Finally, a

reference association can be established between the resource http://www.tliap.nus.edu.sg/ (e.g.,

of the concept WebsiteAddress) and the email message.

The process of setting up the resource-file associations is the one of recognizing resources

from the file description and/or the file content and then mapping them to the ontological

116

Figure 30. An example of an email message.

concepts. The user may determine the degree to which the resources should be extracted from

a file and its description. For instance, in the previous example, the user can further create

resources for the title and abstract of the seminar, and for the biography of the presenter. It is

expected that this process (as well as the process of discovering resource-resource associations,

as discussed later) can be maximally automated, to reduce the user’s burden. For this purpose,

we may utilize the following methods:

• Keyword extraction. From the text of a file, keywords can be extracted based on a

thesaurus or be highlighted manually by the user. Each keyword can be considered a

resource contained by the file. The matching of the resources with the concepts in the

domain ontologies can be guided by a thesaurus such as WordNet.1

1http://wordnet.princeton.edu

117

• Hyperlink analysis. For the textual files that include hyperlinks to classified resources

(e.g., a citation of a paper or a link to a webpage), we create for each hyperlink a reference-

type resource-file association, as well as a resource-resource association between the re-

ferring resource and the referred one.

• Natural language processing. We can utilize known techniques (e.g., (12)) to parse

each sentence of a text or its summary obtained by means of text summarization (76).

For each resulting triple 〈subject, predicate, object〉, we try to match it with the patterns

〈s, p, o〉 in the domain ontologies, where p is a property of the concept s and has a value

typed of o. If such pattern exists, a resource-resource association of type property and of

the form 〈subject, predicate, object〉 is generated.

• History. As the framework proceeds with such classification and cognition, more and

more knowledge about this process can be accumulated and reused by a new process.

Resource-resource associations. We borrow from the Object Oriented Design (OOD) tech-

niques the following four types of relationships between objects: instantiation (i.e., member-

ship), property, aggregation (i.e., whole/part), and generalization (i.e., inheritance). These four

relationships, which are used in object models, are adopted to describe the associations among

concepts as well as resources in our framework. Note that “property” refers to a pattern as

identified, for example, using natural language processing techniques, which corresponds to a

user-defined property. For example, writes can be a property of the concept Author, connecting

Author to the concept Book. Table V summarizes the resource-resource associations in our

framework.

118

TABLE V

RESOURCE-RESOURCE ASSOCIATIONS.Resource-resource Intra- Inter- Intra- Inter- Domain-

associations domain domain application application applicationaggregation

√ √ √property

√ √ √instantiation

√ √ √generalization

√ √ √ontology mapping

√ √ √

By using the previously described techniques, we can discover the resources and their asso-

ciations implied in the PI, and classify them into the domain ontologies, thus populating the

ontologies. In the example of Figure 30, it is possible to extract a pattern 〈Singapore, implements,

ITS〉, which could then be classified as an instance of an ontological pattern such as 〈Organization,

implements,System〉, where Organization and System are two concepts, and implements is a prop-

erty. Note that the user is allowed to choose the granularity of this knowledge (resource and

associations) discovery process, ranging from only taking the whole file as a single resource to

analyzing the detailed contents of the file.

In addition, ontology mappings may be established between correspondences that connect

concepts in different domain and application ontologies. Currently, we consider equivalence as

the only semantics for the mapping between two concepts, although richer semantics of the

mappings could be considered (64).

119

5.5.3 Representation

In our framework, all information, including file descriptions, the resources in the repository,

and the resource-file indexes, are represented in the Resource Description Framework (RDF),1

a W3C proposed standard. For the schema of these data (i.e., the application and domain

ontologies), we use the vocabulary language for RDF, RDF Schema (RDFS).2 The RDF model

is a semantic network, where the nodes denote the resources and the edges are properties that

represent the relations between resources. The network can also be seen as a set of statements

(triples) in the form of (subject, predicate, object). RDFS is used to define the vocabulary

(in terms of classes and properties) of the RDF data, such as rdfs:Class, rdf:Property, and

rdf:type. Table VI summarizes the RDFS vocabularies that are used to represent different types

of associations.

The use of RDF as the data model and RDFS as the ontology language in our framework

is motivated by the nature of the RDF as a Web resources description mechanism and the

fact that the PI is represented as a set of interrelated resources. In contrast, XML is not

chosen because it cannot represent semantic associations (42). Certainly, OWL (Web Ontology

Language), as built on top of RDFS, is more expressive for ontology representation. However,

the use of a slightly extended version of RDFS is adequate for representing resource-file and

resource-resource associations.

1http://www.w3.org/RDF/

2http://www.w3.org/TR/rdf-schema

120

TABLE VI

RDF PROPERTIES FOR THE REPRESENTATION OF ASSOCIATIONS.Relationship RDF property Commentsaggregation rdfx:contained rdfx is the abbreviation of the namespace, where the

property contained is defined. For example, <#a,rdfx:contained, #b> means that a contained b.

property User-defined prop-erties

For example, <#wise03talk, presentedBy, #xiao>means that wise03talk is connected to xiao by theassociation presentedBy.

instantiation rdf:type For example, <#xiao, rdf:type, #Person> means thatthe resource xiao is an instance of the concept Person.

generalization rdfs:subClassOf rdfs:subPropertyOf is used for property generaliza-tion.

The extension to RDFS is as follows: we define in a namespace (abbreviated using the prefix

rdfx) a new RDF property, contained, which is used to represent the aggregation relationship.

For the representation of the instantiation and generalization relationships, we use rdf:type and

rdfs:subClassOf, respectively. The property relationship is represented naturally by an RDF

property defined in the user-defined namespace. 5.6 gives a concrete example of an RDF

representation.

5.6 Semantic Data Navigation

It is critical for a Semantic Desktop to provide the user with the capability to access the

stored data in a variety of ways. The user may want to browse the information by means of the

flexible and intelligent navigation in the information space, including the base and superimposed

information. The user may also desire that certain query facilities (e.g., keyword-based searching

or certain query languages) be provided by the framework. In this section, we discuss the

121

navigation in the data space of a Semantic Desktop. Query processing is discussed in the next

section.

The semantic data organization in our framework enables the navigation in the PI space,

making use of useful hints (e.g., the context of a concept being browsed) so as to facilitate the

user’s understanding of data. More specifically, by taking into account the layered architec-

ture, the semantic navigation in our framework can be performed in three directions: (1) In

vertical navigation, the user follows a path across layers. Two cases are possible for this way

of navigation: top-down from the application ontologies to the stored files and bottom-up from

the stored files to the application ontologies. (2) In horizontal navigation, the user follows links

of concepts (or resources) within one layer. Typically, there are three cases of horizontal nav-

igation, corresponding to each layer: application-to-application navigation, domain-to-domain

navigation, and file-to-file navigation. (3) In temporal navigation, the user can navigate by

following references in chronological order, each being a resource for the same real world object

with a time stamp associated with it. For example, the user may want to look at different

versions of a research paper.

All the base and superimposed information in the framework forms a directed graph, where

the vertices are the resources in the ontologies and the files stored in the PI space, and the

edges are the associations between the resources and files. We say that the three directions of

navigation together provide the capacity of a 3-dimension (3D) navigation mechanism, which

can facilitate the construction of a browser. For instance, suppose the user is browsing a specific

application ontology in a visualized browser. When the user clicks on the node of a concept in the

122

<#wisephotomsg, rdf:type, #Email>

<#wise03photomyself, rdf:type, #Photo>

<#wise03conf, rdf:type, #Conference>

<#wise03talk, rdf:type, #ConferenceTalk>

<#wise03papercamera, rdf:type, #InProceedings>

<#jods05, rdf:type, #Article>

<#cruz, rdf:type, #Person>

<#xiao, rdf:type, #Person>

Photo

Publication

Book

Person

editor

booktitle

Article

InProceedings

Misc

Literal

Literal

Literal

volume

pages

Publication Ontology

t a k e O n

Literal

Literal

Literal

w i d t h

h e i g h t

t i t l e

Date

Photo Ontology

Email

Receiver

attends

Sender

sentBy

title sentOn

attached

Email Ontology

Person

Date Literal

Attachment

Application Document

Talk

Picture

<"c:\emails\WISE photos.eml", rdfx:identification, #wisephotomsg>

<"c:\emails\WISE photos.eml", rdfx:contains, #wise03photomyself>

<"c:\emails\WISE photos.eml", rdfx:reference, #wise03conf>

<"c:\photos\myself.jpeg", rdfx:identification, #wise03photomyself>

<"c:\talks\WISE03.ppt", rdfx:identification, #wise03talk>

<"c:\talks\WISE03.ppt", rdfx:reference, #wise03talk>

<"c:\papers\WISE03-camera.pdf", rdfx:identification, #wise03papercamera>

<"c:\papers\JoDS05.pdf", rdfx:identification, #jods05>

Resource-file index RDF repository

Email

Conference

Paper Person

ConferenceTalk

Talk

InvitedTalk

Place

Talk Ontology

Conference

Date

PictureOfPerson

PictureOfScene

Person

Date Place

Person

Talk

takenBy

Ontology for attending a conference Ontology for picture management

PictureOfEvent

event

subject

takenOn

takenAt

presentedAt

writtenBy

publishedAt

receivedBy

sentBy

receivedBy

wisephotomsg

wise03photomyself

wise03papercamera

wise03conf

wise03talk cruz xiao

Literal title

programOf

presentedBy p r e s e n t e d A

t

presentedOn

sentBy receivedBy

presentedBy

editor editor

programOf

attached

presentedBy

jods05

extends

Journal

extendedVersion

extends

A p

p l i

c a

t i o

n L

a y

e r

D o

m a

i n L

a y

e r

R e

s o

u r c

e L

a y

e r

mapping

rdfs:subClassOf

User-defined property

Figure 31. Data organization in the application, domain, and resource layers. All ontologiesare represented in RDFS. Two application ontologies for PIAs, i.e., picture management andpublication management, are constructed. Below them are four ontologies for the domains of

Email, Talk, Publication, and Photo, respectively. At the bottom, the resource-file andresource-resource associations are represented as triples or in a graph.

123

Email

Receiver Sender

sentBy

title sentOn

attached

Person

Date Literal

Attachment

Application Document

receivedBy

attends

Email

Conference

Person

writtenBy receivedBy

sentBy

Application Ontology

Domain Ontology

1. title

WISE photos

2. attached

wise03photomyself

3. sentOn

12/30/2003

4. sentBy

cruz

5. receivedBy

xiao

Figure 32. The browser for PIM.

ontology, the browser can then choose to display the instances of the concept thus selected (by

vertical navigation), the context of the concept in the domain (also by vertical navigation), and

the associated concepts in other application ontologies (by horizontal navigation). Compared

to the traditional navigation approach that is based on hierarchical directories, 3D navigation

is based on semantic associations, similarly to those that humans establish between concepts.

124

Example 5.3 Consider the scenario shown in 5.6. The spirit of 3D navigation is demonstrated

in the browser of Figure 32. The current resource (file) that the user is browsing is an email

message (i.e., wisephotomsg), which has some photos attached, which were taken at WISE ’03.

The concepts that this resource belong to are highlighted (in white) so as to show the contexts

to which they belong. All associated resources are categorized and shown on the right tabbed

pane, which provides a guidance for the user in navigating the PI space. The bottom-right pane

shows the timeline of different versions of the current resource (if they exist) or all the resources

belonging to the same concept as the current resource.

5.7 Personal Information Applications

5.7.1 Motivation

Example 5.4 A PhD student majoring in Chemistry has collected quite a few publications

related to her research area and is now compiling a literature survey. The publications are

stored as PDF files in different directories. For the literature survey she looks at a group of

selected papers. For each of those papers, she would like to read some of the interesting papers

that are referenced in that paper, which have already been downloaded and stored in the local

desktop. To locate those papers, she can browse the directory hierarchy, use the search capacity

provided by the operating system (if she can remember the file names), or use desktop search

tools, such as the Google Desktop Search1 or MSN Desktop Search.2

1http://desktop.google.com

2http://toolbar.msn.com

125

As the literature survey progresses, the student becomes tired of switching between windows,

and wants to develop a bibliography management system such that the above mentioned func-

tionalities are integrated in a single interface. However, she finds it challenging to implement

such a system, which requires several components, including a database to store and retrieve

the citation relationships between pairs of publications. She asks the help of a friend majoring

in Computer Science, who develops for her such a standalone application in Java. Now, the

student is able to browse through her publications and the citation network easily. However, she

would like to share that application with her advisor and with the other students in the project

but is not able to do that. Furthermore, she would have liked to be able to access the publications

that the other group members have discovered and stored, but cannot do that either.

When all the papers have been discovered and interrelated she would finally like to integrate

the bibliography application with an application for paper composition. The paper composition

application would gather several pieces of information such as related literature, experimental

results, and comments/corrections from the advisor. However, she discovers that the two appli-

cations do not interoperate and she has to manually “import” the information that is gathered

by the bibliography application into the paper composition application.

There are several key considerations in the design of a PIA development tool. First, end

users may either be ignorant of programming skills or be reluctant to write such programs in

the context of organizing the information in their desktop. Therefore, the PIA development

environment, if provided, should hide the programming details from the user. The second con-

sideration has to do with the flexibility and expressiveness of the designer. Even though we do

126

not expect to invent another programming language, there are some fundamental functionali-

ties that we need to make available, such as data access, data presentation, and business logic.

Finally, there is the need to share the information related to the same application (or task)

between two end users, as well as the need to reuse and to interoperate among existing PIAs.

Based on its semantic data organization, MOSE provides a semantic tool for end users

to develop PIAs—the PIA designer. In this chapter, we describe how we exploit the MVC

(Model-View-Controller) methodology (67) for the personal information development in the

PIA designer, which addresses the issues above illustrated. In particular, we discuss how PIAs

can be formalized as desktop services and how such services can facilitate the data interoperation

and intergration across semantic desktops.

5.7.2 MVC-based PIA Development

The resource explorer allows for the “global” exploration of the resources and ontologies

in a desktop. However, views need to be tailorable for the users’ diverse tasks, as we see in

Example 5.4. To this end, MOSE provides a tool, the PIA designer, whose main objective is

its flexibility.

Each PIA can work in a standalone mode, with its own application ontology, user interface,

and work flows, aiming at a specific task (e.g., bibliography management, paper composing,

or trip planning). Meanwhile, different PIAs can communicate with each other as in a P2P

network, by means of the connections (mappings) established between their application on-

tologies. In MOSE, a PIA can present two modes: development mode and execution mode.

The interfaces corresponding to these two modes are respectively the PIA designer (for the

127

development mode) and the PIA browser (for the execution mode), which can switch from one

to another at anytime.

The development of a PIA uses the MVC (Model-View-Controller) methodology. In par-

ticular, in the development of a PIA, the “Model” can be an application ontology that has

been composed as a view over domain ontologies; the “View” consists of one or more compo-

nents that present data in different forms such as graph, text, and list; the “Controller”, which

is the business logic of the PIA, is a set of “if-then” rules, which enable the interaction and

synchronization between different data components. The data associated with components to

be displayed are retrieved from the repositories of ontologies and instances by queries named

parameterized channels.

The specifications of a PIA, as defined by the user by means of the PIA designer, including

the model, view, and business logic, can be serialized in XML. It is called the PIA definition.

Now, the user can run a PIA in the PIA browser, which interprets and executes the PIA in

either an “online” mode (by directly switching from the designer to the browser) or an offline

mode (by loading from the PIA’s permanent serialization). The separation of the declarative

specifications from the interpretative execution greatly benefits the communication between

semantic desktops in terms of PIA interoperation, as we will see in the following sections.

5.7.3 Implementation

We have implemented a prototype of PIA designer using Java, as shown in Figure 33.

Following the three basic elements of an application, the following describes three stages of the

application development.

128

Figure 33. The PIA designer.

Modeling. In the first stage, the user loads the application ontology from the ontology

repository, which represents the model underlying the PIA to be designed; it will be graphically

shown in the Data Model pane. The application ontology is mapped to the domain ontologies,

under which the resources representing personal information are classified. Actually, the appli-

cation ontology is constructed as a view over the domain ontologies in a “global as view” (GaV)

129

approach (70). This mapping process should not require the users’ programming expertise, but

only their awareness of the task and their knowledge of the domain.

Visualization. The second stage involves the design of the layout of the PIA, with one

or more visual components, each of which can be associated with a stream of data for its

presentation. The user drags the desired visual components from the Visual Component pane to

the PIA Browser Workspace pane. Examples of such components include TextPane, List, Table,

Graph, and File. The associated data can be resources, strings, files, and whatever as instances

of the ontologies; they are retrieved by queries, called channels (introduced in (98)), on the

application ontology. Some components, such as Button, Label, TextInput, and MessageBox, are

used to facilitate the interaction between the user and the PIA browser. A special component

called Services is used for desktop service composition, as discussed in Section 5.8.

Controller and parameterized channels. In the final stage, the controller (or business

logic) of a PIA is specified so as to realize rich interactions between the data and their views,

and to synchronize several visualizations. These controllers manage all possible updates of the

model and handle the events from the user interface, using “if-then” rules (more sophisticated

controls will be considered in future work) of the following form:

if Component1.event1(x1) and ... and Componentn.eventn(xn)

then Component1.action1(y1); ...; Componentm.actionm(ym);

endif

where xi, i ∈ [1..n], are parameters passed from the events, and yi, i ∈ [1..m], are the channels

that result in the actions. It often happens that the response of a component to some event needs

130

to take xi as a parameter to execute yi, especially when updating the data that is sensitive to

xi in a visual component. For this purpose, we introduce the concept of parameterized channel,

which are channels that have their contents determined by the parameters at runtime. In

MOSE, where channels are queries over ontologies, the parameter of a channel can be bound to

a variable or a constant in the query. By means of parameterized channels, an event started from

a component can pass any values to another component, thus enabling interactions between

different components.

Example 5.5 As shown in Figure 33, at the top left corner, the user loads the application

ontology (for publications), to develop a PIA for bibliography management. The application’s

user interface uses a Graph for displaying the citation network of papers, a TextPane for the

paper’s details, a List for the paper’s authors, and a TextPane for the author’s details.

To associate data with their proper visualization, the user defines the following channels,

in the syntax of RDQL (RDF Data Query Language), which has an SQL-like grammar (62).

Each channel is in the form of string, which can then be fed into an RDQL interpreter (e.g.,

provided by the Jena API) for execution.

1. ch 1(): “SELECT ?a, ?b WHERE (?a, cites, ?b)”

2. ch 2(x): “SELECT ?a, ?b, ?c, ?d WHERE (” + x + “, title, ?a), (” + x + “, writtenBy,

?b), (” + x + “, year, ?c), (” + x + “, citedAs, ?d)”

3. ch 3(x): “SELECT ?a WHERE (” + x + “, writtenBy, ?b), (?b, name, ?a)”

4. ch 4(x): “SELECT ?a, ?b WHERE (” + x + “, institute, ?a), (” + x + “, email, ?b)”

131

As an example of parameterized channel, the second query, ch 2(x), returns the title, author,

year, and citation entry of a publication, which is bound to parameter x.

The data computed by executing a channel will present different forms depending on what

visual component is used to visualize this data. For example, a Graph shows the data represented

as a graph, where nodes are resources and edges are their associations. To construct such a

graph, the nodes representing the same resource will be merged into a single one.

The following rules are defined to specify the controller, in which the first rule has no pre-

conditions, thus being triggered at the very beginning of the PIA’s run.

1. PaperGraph.update(ch 1())

2. if PaperGraph.isSelected(x) then PaperDetail.update (ch 2(x))

3. if PaperGraph.isSelected(x) then AuthorList.update (ch 3(x))

4. if AuthorList.isSelected(x) then AuthorDetail.update (ch 4(x))

5.8 Services-based Desktop Interoperation

As mentioned before, two PIAs can communicate in a P2P fashion based on the application

ontology mappings established between them. It is required in this case that the two PIAs are

designed for a similar task, for which they have their application ontologies partially or fully

overlapping. We say that this way of PIA interoperation (or integration) is on the semantic

level and is oriented to data models. Previous work has discussed such P2P ontology-based

query processing (26; 114; 115). In this section we discuss another type of PIA interoperation,

which is realized by means of desktop services, thus called service-oriented interoperation.

132

The notion of desktop service was first introduced into the vision of semantic desktop by

the Gnowsis system (103). However, to our best knowledge, there has been no definition and

formalization of desktop services. Next we give our own definition of what constitutes a desktop

service in terms of parameterized channels, and describe how this service-based mechanism

facilitates the data interoperation and integration in our semantic desktop vision. We assume

PIA-based desktop services in our discussion, and use both terms, PIA and desktop service,

interchangeably.

In general, a service (e.g., Web service1) must have its interface (i.e., input and output)

defined, while keeping the implementation of its operation hidden from the service consumer.

Intuitively, a PIA in MOSE consists of a set of visual components bound to parameterized

channels. In this sense, we can see a channel as the minimal unit of service, taking the para-

meters as input and its resulting data as output. Starting from this point, we are able to give

a definition of service based on the definition of parameterized channel.

Formally speaking, a parameterized channel q is a triple 〈M, I, O〉, where M is the under-

lying model (i.e., application ontology), I is a set of parameters (i.e., input), O is the set of

tuples resulted from execution of the channel (i.e., output). A desktop service s is a 5-tuple

〈Q, I, O, V, C〉, where

• Q = {q1, ..., qm, s1, ...sn}, is a set of channels q1, ..., qm or services s1, ...sn, where m ≥ 0,

n ≥ 0, m + n ≥ 1, and si 6= s, i ∈ [1..n];

1http://www.w3.org/2002/ws/

133

AO-2

PIA-2

PIA Browser

PIA-1

SemDesk 2

AO-4

PIA-4

PIA-3

SemDesk 4

PIA-2

AO-3

SemDesk 3

PIA-3

AO-1

SemDesk 1

PIA-1

AO-2

PIA-2

PIA Browser

PIA-1

SemDesk 2

AO-4

PIA-4

PIA-3

SemDesk 4

AO-3

SemDesk 3

PIA-3

AO-1

SemDesk 1

PIA-1

PIA-2

PIA-1

(a) Remote execution of services (b) Local execution of services

Request

Response

PIA-# PIA definition

PIA-# PIA implementation

Figure 34. Desktop services composition and execution.

• I ⊆ I1 ∪ ...∪ Im ∪ I ′1 ∪ ...∪ I ′n is the input, where Ii is the input of qi, i ∈ [1..m], and I ′i is

the input of si, i ∈ [1..n].

• O ⊆ O1 ∪ ... ∪ Om ∪ O′1 ∪ ... ∪ O′

n is the output, where Oi is the output of qi, i ∈ [1..m],

and O′i is the output of si, i ∈ [1..n].

• V = {v1, ..., vl} is the set of visual components, with vi being the component of oi, where

oi ∈ O, i ∈ [1..l].

• C = {c1, ..., ck} is a set of rules representing the control flows among the components.

The above recursive definition, based on the units of channels, allows for a flexible compo-

sition of desktop services. Besides its self-defined channels qi, i ∈ [1..m], a PIA can reuse any

134

services si, i ∈ [1..n], and embed them in itself, by establishing which channels oj of si to be

shown in which view vj , j ∈ [1..l]. Then, the controller C consisting of if-then rules is used to

specify the composition (control and data flows) among these channels or services in the PIA.

Because of space limitations, we do not elaborate on the different types of service composition

(e.g., “sequential” and “parallel” flows) (96). Instead, we describe next how the service-oriented

inter-desktop communication is implemented, by means of service composition and execution,

in the two cases that are depicted in Figure 34.

The first case, as shown in Figure 34(a), is called remote execution of desktop services. In the

example, there are four services (PIA-1 to PIA-4), with their respective application ontologies

(AO-1 to AO-4). Suppose that PIA-4 is the starting point of the service execution, where the

user interacts with the PIA browser. All requests for both the data and the execution of other

services (defined and implemented in other desktops, but composed by the current service) are

driven by events from such interactions. Whenever a nested remote service (e.g., PIA-2 or PIA-

3) is triggered by the current service, a request for execution will be sent to the remote desktop

(e.g., SemDesk 2 or SemDesk 3), where the remote service will be executed. As a response to

the request, the remote service returns its execution results to the current service.

While the first case is similar to what happens with Web services, the second case of desktop

service execution (called local execution, as shown in Figure 34(b)) is quite different. In partic-

ular, whenever a service nested in the current service is activated, it will be locally interpreted

and executed by the PIA browser in the current desktop. However, the local execution of a

135

remote service (e.g., PIA-2) needs permission to access relevant data (e.g., AO-2) from a remote

desktop. If so, the data is then duplicated in the local desktop via a secure data transfer.

We note that the essential difference between the two cases of desktop service execution is

related to a tradeoff between control permission and data access. This flexibility is important

in a semantic desktop setting. Depending on their available resources, some desktops may be

reluctant to take a heavy workload while some others may be concerned with the privacy of

their data. Therefore, a desktop (when acting as a server) can choose whether to contribute

its computing power or share its data.

5.9 Semantic Query Processing

Unlike navigation, which is an interactive process, query processing is performed without

further intervention from the user. To retrieve relevant data from the PI space, the user’s

request may be posed as a sequence of keywords or as a query formulated in a certain query

language.

The keyword-based search matches the input keywords and the vector of words in the

candidate documents, calculates the similarity for each of the matches, and returns to the

user the results after ranking them (102). The results of a search are usually evaluated using

the statistical criteria such as precision, recall, or a combination of them. The shortcoming of

keyword-based search is that the semantic associations between relevant data are not considered.

In contrast, query languages can provide a semantically richer access interface, thus facilitating

the data retrieval and improving the accuracy of the answers. However, a query is usually

136

performed based on an exact match between the query and the data, so that the recall of the

answers is influenced, in the sense that some relevant but not matched data is not retrieved.

Since the two approaches complement each other, it is desirable to provide both of them.

In this section, however, we mainly focus on query processing in our framework. We choose to

express the queries in RDQL (62); they can query both the resources and their associations.

We discuss how to process a query submitted by the user in two cases: within a PIA and across

different PIAs.

5.9.1 Query Processing in a PIA

In our framework, the user query is formulated in RDQL (RDF Data Query Language),

which uses an SQL-like syntax (62). To reduce the user’s burden, a graphic means can be used

to facilitate the user’s query formulation. For simplicity, we use a subset of RDQL that we

call conjunctive RDQL (c-RDQL), which can be expressed as a conjunctive formula: ans( ~X) :-

p1( ~X1), ..., pn( ~Xn), where ~Xi = (xi, x′i) and pi is an RDF property of xi having the value x′i.

In our framework, an application ontology is constructed over one or more domain ontologies,

and the files in the PI space are formalized as instances of the concepts in the domain ontologies.

If we consider the application ontology as the global ontology (since the user query is posed

on it), the whole system can be seen as a GaV data integration system (70). Therefore query

processing in a single PIA is performed as in a GaV system. In particular, when the user

poses a query (in RDQL) over the application ontology, the RDQL query is then rewritten into

a new RDQL query in terms of the domain ontologies, based on the mappings between the

global ontology and domain ontologies. By executing the rewritten query on the corresponding

137

domain ontologies, resources (files) that match the query are then returned as answers to the

query.

There are a number of algorithms for query rewriting in relational or XML data integration

systems (58). In a GaV based integration system, query processing is performed using a “un-

folding” strategy (70). More specifically, for rewriting a query (e.g., a conjunctive query) that is

posed on the global schema or ontology, we simply substitute the predicates in the body of the

query with the corresponding view definitions. In our framework, where the mappings between

the application ontology and the domain ontologies are expressed as RDF class or property

correspondences, the algorithm for query rewriting is similar to this strategy.

By assuming that there are no integrity constraints over the application ontologies and the

user queries are formulated in c-RDQL, we give the formal description of our query rewriting

algorithm in a single PIA, which we call ADREWRITING (for rewriting from Application

ontologies to Domain ontologies), as follows. We note that we do not consider the namespaces

of ontologies for simplicity of the description.

Example 5.6 Suppose the user wants to list all conference papers with their authors and jour-

nal version, using the query q1 : ans(x, y, z) :- writtenBy(x, y), extendedV ersion(x, z), which

is posed on the application ontology of publication management. For the variables (x, y, z),

we get the classes that they refer to as (Paper, Person, Journal), as indicated by Line 3. By

looking into M, we find the corresponding class sequence as (Publication:InProceedings, Publica-

tion:Person, Publication:Article), where the names before the colons are domain ontology names.

From Lines 5 to 10, we compute the predicates in the body of q2 as follows.

138

Algorithm ADRewriting

Input: 1. q1 over the application ontology G: ans( ~X) :- p1( ~X1), ..., pm( ~Xm);2. M: the mapping table between G and domain ontologies S1, ...,Sn.

Output: q2: A c-RDQL query over S1, ...,Sn.1 headq2 = ans( ~X); bodyq2 = null;2 For i = 1 to m do3 (c1, c2) = name of the classes referred to by (x1, x2), for ~Xj = (x1, x2);4 Search M to find (d1, d2) such that {(c1, d1), (c2, d2)} are two class correspondences

in M;5 Traverse S1, ..., and Sn by following all kinds of associations, to find the vertices,

v1, ..., vk, connecting from d1 to d2;6 If k = 0 then add p(x1, x2) (or p(x2, x1)) to bodyq2 , if there exists p connecting

d1 to d2 (or d2 to d1);7 Else for j = 1 to k − 1 do8 Add p(xj , xj+1) (or p(xj + 1, xj)) to bodyq2 , if p is not a mapping and connects

vj to vj+1 (or vj+1 to vj);9 Add p(x1, x1) (or p(x1, x1)) to bodyq2 , if p is not a mapping and connects d1

to v1 (or v1 to d1);10 Add p(xk, x2) (or p(x2, xk)) to bodyq2 , if p is not a mapping and connects vk

to d2 (or d2 to vk);11 q2 = headq2 :- bodyq2 ;

Figure 35. The ADRewriting algorithm.

q2: ans(x, y, z) :- editor(x, y), extends(z, x)

By executing q2 over the RDF repository as shown in 5.6, we get the answer:

{(#wise03papercamera, #xiao, #jods05), (#wise03papercamera, #cruz, #jods05)}.

5.9.2 A2A Query Processing

Application to application (A2A) query processing occurs when an application is attempting

to retrieve relevant data from another semantically related application, to answer a query. If

139

the PIAs are considered as connected peers (i.e., service providers for certain data access), the

A2A query processing is similar to that in peer-to-peer (P2P) systems (40; 59). Whether the

PIAs exist in a single desktop or are physically distributed makes no differences to the A2A

query processing.

A2A query processing consists of two steps of query rewriting. First, we rewrite the original

query q, which is posed on the application ontology G1, to a query q′ on the other application

ontology G2, according to the mappings between G1 and G2. Then, q′ is rewritten to a query

q′′ on the domain ontologies, to which G2 is mapped. Answers are obtained by executing

q′′ on the RDF repository. The second query rewriting is exactly the one described by the

algorithm ADRewriting, whereas the first rewriting is slightly different from ADRewriting.

In particular, unlike the total mapping from an application ontology to the domain ontologies,

some of the concepts in G1 may not be mapped to those in G2. Therefore, the answers returned

by q′′ may contain null values or Skolem functions for the unmapped concepts or properties.

The A2A mappings can be derived by composing the mappings between G1 and the do-

main ontologies, inter-domain mappings, and those between G2 and the domain ontologies. To

evaluate both query rewriting processes, we need to check the equivalence (or containment)

between a query and its rewriting. A correct query rewriting is the one that is equivalent to

(or maximally contained in) the query. These two issues (reasoning on mappings (14; 87) and

reasoning on queries (28; 82)) have been extensively studied and are beyond the scope of this

chapter.

140

5.10 Summary

In this chapter, we present our design of a PIM system. We propose a layered ontology-

based framework, which aims to provide a semantics-rich environment for personal information

organization and manipulation. The multiple ontologies existing in different layers of the archi-

tecture explicitly support the data semantics. Furthermore, the decoupling of the domain layer

and the application layer enhances the flexibility and reusability of the framework. Specif-

ically, we discuss in detail the semantic-enriched data organization, including the use of file

descriptions and domain ontologies as annotations, and the construction of resource-file and

resource-resource associations. We also introduce the idea of 3D navigation, which is used in a

desktop browser.

We also describe an MVC-based approach for personal information application (PIA) devel-

opment in MOSE. Based on that, we have formalized the concept of desktop service, building

on the notion of parameterized channel. Furthermore, we discussed how desktop services can

facilitate data interoperation and integration across distributed semantic desktops.

Finally, we discuss query processing in our framework in two cases: within a single personal

information application, PIA, and between two PIAs, using application to application (A2A)

communication. A formal query rewriting algorithm is presented for the single PIA case.

In the future, we will continue the study and implementation of our framework. It is clear

that a lot of the success of PIM systems lies on the successful automation of the different

mechanisms that are needed. In particular, we will look further into the automation of the

conceptualization of full-text files and that of matching resources to ontological concepts. Also,

141

we will elaborate on the idea of 3D navigation both by studying a model for temporal navigation

and by carrying out user studies. The study of A2A communication, including data exchange,

collaboration, and query processing will also be continued. While RDQL queries are expressive,

they may not be suitable for most users. We are therefore exploring visual queries that can

express a class of RDQL queries “appropriate” for the semantic desktop.

While the envisioned semantic desktop can be seen as a miniature of the prospective semantic

web, it has its particular features as well as challenges, such as automatic classification of

personal information into ontologies, context-aware information search, and flexible tools for

data manipulation and application development. In the future, we will work along the following

two directions: (1) We would like to make the outlined functionality accessible to most end users,

for example, by allowing natural language specifications to automatically formulate channels. In

this context, the previous work on conversion of natural language questions to formal queries is

of great interests (72). (2) We will also work on mechanisms for defining, publishing, discovering,

and composing desktop services so as to extend their current capabilities. Our goal is to provide

a semantic platform, where Web services and desktop services can be semantically integrated

in a seamlessly way, so as to achieve data integration and application interoperability across

semantic desktops.

CHAPTER 6

GEOSPATIAL DATA MANAGEMENT IN E-GOVERNMENT

6.1 Introduction

It is the objective of eGovernment to increase the cooperation among government orga-

nizations so as to enable effective overall assistance to citizens. To this end, achieving data

interoperability is a major objective (73). The advent of XML on one hand, and the emergence

of metadata standards on the other hand, will play an important role in achieving syntactic,

schematic (or structural), and semantic interoperability (16). However, years of autonomic

and uncoordinated development of classification schemes by government organizations pose

enormous challenges in achieving cooperation in areas such as land use planning, healthcare,

transportation, and social services.

In this chapter, our focus is on data interoperability of distributed geospatial data. To

illustrate the reach of our approach we focus on examples that are derived from land use

applications. The heterogeneity of data in such applications is extreme in that each county

and each municipality may have a different model for their databases—resulting in schematic

heterogeneity—and/or a different classification scheme for their land use data—resulting in

semantic heterogeneity). We have worked with the Wisconsin local government within the

scope of WLIS (Wisconsin Land Information System). In the state of Wisconsin, there are

hundreds of different land use data schemas and classifications associated with the land use

142

143

data sources of the different counties or municipalities, henceforth called local data sources,

therefore hindering the cooperation among the local governments to achieve comprehensive

land use planning across the borders of the different jurisdictions (112).

We propose an ontology-based approach to enable integration and interoperability of the

local data sources, which reconciles both schematic and semantic heterogeneities. An ontology

is a formal, explicit specification of a shared conceptualization; it can be either an axiomatized

set of concepts and relationship types or a taxonomy of entities (55). We call the first kind

of ontologies schema-like ontologies, since they can be associated with various constraints and

are allowed to have instances. In comparison, the second kind of ontologies usually include

the subconcept (or subclass) relationships between two entities, and are called taxonomy-like

ontologies in our discussion. In our approach, both ontologies co-exist.

The ontologies that we use to represent the structure of the local data sources, which we

call local ontologies, belong to the first type and can be obtained from the source schemas

through a schema transformation process. The second type of ontologies are the land use

ontologies, which are part of the local ontologies; they represent land use taxonomies, which

are used to classify land parcels in the local data sources according to their usage (for example,

agricultural, commercial, or residential). In addition, our approach uses a global ontology that

models the domain structure of the integration task and acts as a mediator among the local

sources. Similarly to the local ontologies, the global ontology contains a global land use ontology

that describes the land usage domain.

144

The key to our approach lies in establishing mappings between the concepts of the global

ontology and the concepts of the local ontologies. The process of establishing such mappings is

called alignment. When such mappings have been established, we say that the two ontologies are

aligned or matched. Using those mappings, a single query can then be expressed in terms of the

concepts of the global ontology (or of a local ontology) and be automatically rewritten and posed

against the other ontologies. Therefore, a single query can retrieve data from heterogeneous data

sources, thus allowing for land use planning across as many jurisdictions as needed, provided

that the corresponding ontologies have been aligned.

Whereas the local data sources are expressed in XML with a DTD schema, we express all

ontologies using RDF,1 which is at the core of several ontology languages such as OWL2 and

DAML+OIL.3 XML Schema (often seen as an enhanced DTD-like language with a well-defined

data typing system) is a semantic markup language for web data. The database-compatible

types that are supported by XML Schema provide a way to model data organized hierarchi-

cally. However, there are no explicit constructs for defining classes, properties, and relationships

between classes in XML Schema, therefore ambiguities may arise when determining the relation-

ship between two XML elements. For instance, the relationship (ownedBy) between an element

LandParcel and its child element Owner is implicitly indicated by their nesting relationship.

1http://www.w3.org/RDF/

2http://www.w3.org/2004/OWL/

3http://www.daml.org/

145

LandParcel

Land_id Owner

Owner_id

LandParcel Owner LandParcel

Land_id

Owner

Owner_id

ownedBy

a) land-centric schema b) owner-centric schema c) conceptual schema

Figure 36. An example of XML schematic heterogeneity.

The example in Figure 36 illustrates the fact that two different XML data schemas can

represent the same conceptualization, namely a many-to-many relationship between land parcels

and their owners, thus resulting in schematic heterogeneity. Specifically, the land-centric schema

has a LandParcel element containing the Owner child element. In the owner-centric schema,

the LandParcel element is nested as a child element under Owner. In contrast, the conceptual

schema in the same picture explicitly represents the underlying semantics of both XML schemas.

Because there is no nesting involved, we say that the conceptual schema is structurally flat.

Using that conceptual schema facilitates both the alignment and querying processes, as no

consideration needs to be given to the the structure of the source (4; 39).

In our approach, the global ontology provides an integrated view of the source schemas as

well as a uniform query interface. Mappings consisting of class or property correspondences

are then established between each local ontology and the global ontology. Given that both

the global and local ontologies contain two components (i.e., the schema and the taxonomy),

the mappings between the global ontology and a local ontology are accordingly subdivided into

146

two components: the schema-level mappings between the schemas of both ontologies and the

instance-level mappings between the land use taxonomies of both ontologies. Such mappings

can be respectively used to reconcile schematic and semantic heterogeneities.

Query processing in our approach involves query rewriting and can be performed in two

ways: global-to-local query processing and local-to-local query processing, based on the double

role played by the global ontology. Using its first role, we rewrite a query posed on the global

ontology into subqueries over the local sources—the global ontology acts as a uniform query

interface of the integration system. Using its second role, we translate a query posed on an XML

geospatial source to a query on any other XML geospatial source, taking the global ontology

as a mediator for the query rewriting and hence for the interoperation between local sources.

In addition to the ontology-based architecture for data integration and interoperation, we

make the following contributions in this chapter:

• We focus on the alignment process of the local land use ontology with the global land use

ontology and propose an ontology alignment algorithm based on a set of deduction rules,

which can be performed automatically (that is, without the intervention of users) when

certain pre-conditions are established.

• We propose a sound query rewriting algorithm based on the bidirectionality and compo-

sition of the mappings. The mappings are stored in a file using the RDF/XML syntax,

called the agreement file. The rewriting algorithm is used for both global-to-local and

local-to-local query processing. The algorithm can compute a contained rewriting of a

query in both cases. Query containment ensures that all the answers retrieved by exe-

147

cuting the rewriting are a subset of the answer to the original query, thus guaranteeing

precise query answering across distributed data sources (70).

The rest of the chapter is organized as follows. We summarize related work in Section 6.2.

The data heterogeneity issues that are associated with WLIS are presented in Section 6.3. In

Section 6.5, we discuss ontology alignment and focus on an automatic algorithm. The two cases

of query processing are illustrated in Section 6.6, where we also describe in detail the query

rewriting algorithm. We conclude in Section 6.7.

6.2 Related Work

Data integration and interoperability are critical for the implementation of eGovernment,

especially when access to distributed data is needed, such as in GIS (53), cross-jurisdictional

criminal investigation (78), coastal management (71), electronic elections (35), or air quality

management (61). In this chapter, we look at the semantic integration of heterogeneous geospa-

tial data using conceptual data models (70; 87; 111). In this scope, we discuss the issues of

ontology alignment and query processing.

6.2.1 Ontology Alignment

The work on ontology alignment considers related work on database schema matching, but

takes into consideration characteristics of ontologies (64; 87; 99; 106). Existing schema or

ontology matching techniques can be classified into three classes:

Element-level At the element level, matching can use various similarity measures based, for

example, on names of elements or their textual descriptions. A normalized numerical value

148

will be calculated for each of the matching candidates, and the best one is selected (10;

30; 92).

Structure-level The structure-level information that can be used by the matching process

include the graph or taxonomy underlying the schema or ontology. Graphs are used

as contextual information to map pairs of elements and the taxonomy can provide the

matching process with more semantics thus contributing to semantic-level matching. For

example, the AnchorPrompt algorithm compares the structures of the graphs that repre-

sent the ontologies or schemas and determines their similarity (89). Another example is

the idea of similarity flooding, which uses a hybrid matching algorithm that propagates

similarity through the graph (80). An example of semantic-level matching determines

the similarity of two concepts based on the similarities of their ancestors (100). In our

approach, we consider the semantic similarity of the concepts’ children, instead.

Instance-level Instance-level matching uses the actual contents (or instances) of the schema

or ontology elements. Examples include: 1) GLUE that employs machine-learning tech-

niques to determine mappings, particularly by using multiple learners that exploit the

information contained in the conceptual instances and in the taxonomic structure of the

ontologies, and then uses a probabilistic model to combine results of different learners

(43), 2) HICAL that exploits the data instances that overlap in two taxonomies to infer

mappings (63), and 3) the NLP-based method suggested by Fossati et al., where only the

instance documents associated with the nodes of ontologies are taken into account (48).

149

Other ontology or schema mapping tools, which combine some of the above mentioned

methods, include Chimaera (79), COMA++ (9), MAFRA (74), Clio (60), and PROMPT (88).

Regarding the two types of ontologies in WLIS, namely the schema-like ontologies (which

can have instances) and the taxonomy-like ontologies, the matching methods differ because

of the different characteristics presented by both types of ontologies. Specifically, the former

type contains various user-defined properties whereas the latter type usually include only the

sub-concept (or subclass) relationships between two entities. Therefore, the matching of the

latter type can benefit from most structure-level methods mentioned above.

6.2.2 Query Processing

When mappings are defined as (relational) views, query processing is often referred in the

literature as view-based query answering or rewriting (58). However, few view-based query

processing algorithms address the issue of query rewriting over ontologies. As compared with

schemas, ontologies allow for a more expressive specification of constraints than most schemata

languages do, thus raising issues that have been investigated by artificial intelligence research,

including deductive reasoning, ontology integration, knowledge discovery, and query approxi-

mation (107).

We divide ontology-based query processing techniques into two categories according to the

architecture being considered: a centralized architecture that uses a mediator and a peer-to-peer

architecture:

Centralized architecture We distinguish two types of query processing: GaV (global-as-

view) and LaV (local-as-view). In the first type, the ontology acts as a global schema. In

150

a system that exemplifies this kind of approach (95), queries are expressed using terms

from the vocabulary of the global description logic ontology. Query rewriting uses a

global-as-view (GaV) approach by translating the global query into an equivalent calculus

expression, which references only the objects available in the source databases. Instead,

in our approach we use RQL, a semantic-rich query language (65).

Within the second type (LaV), Amann et al. proposes a mediator architecture for the

querying and integration of XML data and introduce a new mapping language to express

the mappings between the global schema of the mediator and the XML resources, which

are defined as local views over the global schema. A query rewriting algorithms is proposed

that translates user queries according to existing source descriptions in XPath (4).

Peer-to-peer architecture Unlike the previous two approaches, SWAP handles queries in

a peer-to-peer (P2P) setting (46). The queries that are posed by the user on a local

node range from simple conjunction to recursion formulated in an RQL-related query

language. The local node rewrites the query into subqueries and distributes them to the

other peers, which will rewrite the queries in a similar fashion, then retrieving the answers

and gathering them. In this chapter we use a similar query answering approach, when

queries are posed on a local ontology and executed in a local-to-local fashion. In P2P

systems, a GLaV (global-local-as-view) approach (70) is commonly used, which can also

be applied to centralized architectures (49).

Our ontology-based query rewriting algorithm is similar to the computeWTA algorithm

proposed by Calvanese et al. for query reformulation (26) as both assume consistent ontology

151

mappings. However, unlike in computeWTA, we allow for the transformation of the values that

are contained in the query based on the instance-level ontology mappings. In this way, we can

address semantic heterogeneity.

Another approach considers constrained-based query processing in the Clio system for data

integration (117). It focuses mainly on schema mapping and data transformation between

nested schemas and/or relational databases by taking advantage of the schema semantics to

generate the consistent translations from source to target by considering the constraints and

structure of the target schema. In their approach, mappings are expressed using queries, whereas

in our approach mappings and queries exist independently.

6.3 Data Heterogeneities

Our application domain focuses on the Wisconsin Land Information System (WLIS) project,

which implements a distributed web-based system with heterogeneous data residing on local

and state servers (36).

As an example, Figure 37 shows two fragments of land parcel data, including their DTD

(on the left-hand side) and an XML fragment (on the right-hand side), which respectively exist

in the local systems of Eau Claire County and of Madison County. As we can observe, even

though the local XML sources present different structures and naming conventions, they share

a common domain with closely related meanings (or semantics), thus being ideal candidates for

an integration system.

The previous examples display syntactic homogeneity in that they both use XML but have

different structures, therefore displaying schematic heterogeneity. They may also encode their

152

<?xml encoding="ISO-8859-1"?> <LandUse><!ELEMENT LandUse (LandParcel)> <LandParcel><!ELEMENT LandParcel (AREA, BROAD, LU1, <AREA>1704995.587470</AREA>LU2, LU3, ..., JurisType, JurisName)> <BROAD>A</BROAD>

<!ELEMENT AREA (#PCDATA)> <LU1>AF</LU1><!ELEMENT BROAD (#PCDATA)> ......<!ELEMENT LU1 (#PCDATA)> <JurisType>County</JurisType>...... <JurisName>EauClaire</JurisName><!ELEMENT JurisType (#PCDATA)> </LandParcel><!ELEMENT JurisName (#PCDATA)> ......

</LandUse>

a) Local XML data source S1 of EauClaire County.

<?xml encoding="ISO-8859-1"?> <LandUse><!ELEMENT LandUse (LandParcel)> <LandParcel><!ELEMENT LandParcel (AREA, LAND USE, <AREA>1007908.5</AREA>PARCEL ID, ..., JurisType, JurisName)><LAND USE>9100</LAND USE>

<!ELEMENT AREA (#PCDATA)> <PARCEL ID>246710</PARCEL ID><!ELEMENT LAND USE (#PCDATA)> ......<!ELEMENT PARCEL ID (#PCDATA)> <JurisType>County</JurisType>...... <JurisName>Madison</JurisName><!ELEMENT JurisType (#PCDATA)> </LandParcel><!ELEMENT JurisName (#PCDATA)> ......

</LandUse>

b) Local XML data source S2 of Madison County.

Figure 37. Local XML land use data sources.

instances or values in different ways, thus displaying semantic heterogeneity, in the sense that

the same values may represent different meanings and that different values may have the same

meaning (111). Our discussion elaborates further on both kinds of heterogeneities. In the

example shown in Figure 37, we see that the two source schemas overlap on most elements and

153

TABLE VII

SEMANTIC HETEROGENEITY RESULTED FROM DIFFERENT ENCODINGS OFLAND USE DATA.

Local Source Element Name Land Use DescriptionType Value

Dane County RPC Lucode 91 Cropland PastureRacine County

Tag811 Cropland

(SEWRPC) 815 Pasture and Other AgricultureEau Claire County Lu1 AA General AgricultureCity of Madison Land use 8110 Farms

both have the same nesting depth. However, the elements of the land use codes are represented

differently in the two schemas: the schema S1 uses four elements broad, lu1, lu2, and lu3,

whereas S2 only uses a single element, namely land use. Furthermore, the values of such land

use codes (in the XML instances) are encoded in different ways, i.e., characters for S1 and

numbers for S2.

Land use codes in WLIS stand for land use types (or categories) and include, for example,

agriculture, commerce, industry, institutions and residences. Besides using different names in

different local source schemas, such land use codes use different classification schemes, thus

resulting in semantic heterogeneities across the local source schemas. This is illustrated by

Table VII, where there are four element names (Lucode, Tag, Lu1 and Land use) from four dif-

ferent classification schemas. The descriptions in the table show that different values represent

closely related land use types.

154

In order to integrate the distributed heterogeneous local geospatial data like in WLIS, it

is necessary to overcome data heterogeneity, which originates from having different state and

federal agencies involved in acquiring and storing geospatial data. The ontology-based solutions

to the data heterogeneity problem use either a single ontology approach, a multiple ontology

approach, or a hybrid approach.

In a single ontology approach such as SIMS (7) all information sources are directly related

to a shared global ontology. This approach requires that all sources provide nearly the same

view on a domain. In a multiple ontology approach such as OBSERVER (81) each informa-

tion source is described by its own ontology. It needs an additional representation formalism

defining the inter-ontology mapping between each pair of separate ontologies. Instead, we use a

hybrid approach. In our approach, a local ontology is generated for each local XML source that

represents its schema. In addition, a global ontology is defined to act as an integrated view and

a uniform access interface of the distributed data sources. Every local ontology is mapped to

this global ontology, by establishing the correspondences of their elements and attributes, which

results in an “agreement” on the local names. In addition to this schema level reconciliation,

it is also necessary to have a global land use taxonomy, to which the local land use taxonomies

are mapped, so as to achieve a common understanding of the semantics of the land use codes

used in local sources. All ontologies are represented using RDF and RDFS.

6.4 Architecture

In this section, we discuss the architecture (as shown in Figure 38) of our ontology-based

approach for heterogeneous geospatial data integration. We focus on ontology alignment and

155

User Query Interface

Global Ontology G

Agreement M 1 Agreement M n

User Query Interface

. . . . . .

User Query

Interface

Local Land-Use Ontology O 1

Local

Ontology

Instance I 1

Local DTD D 1 Local Land-Use

Hierarchy H 1

Local XML

Source S 1

Local Land-Use Ontology O n

Local

Ontology

Instance I n

Local DTD D n Local Land-Use

Hierarchy H n

Local XML

Source S n

. . .

Agreement Maker

Agreement Maker

Agreement

Maker

Ontology mapping

Local transformation

Local query processing

Global query processing

Figure 38. The ontology-based architecture.

briefly discuss query processing, leaving most of the latter subject to be presented in Section

6.6.

6.4.1 Schema Transformation and Ontology Mapping

The ontology-based data integration process contains two steps: schema transformation and

ontology alignment. In the first step, for each local source, we transform the local DTD schema

into a local RDFS ontology, the XML instances under this schema into instances of the local

ontology, and the XML taxonomy of land use categories into an equivalent RDFS taxonomy

156

(as part of the local ontology). In the second step, we map a local RDFS ontology and the

local land use taxonomy to the global ontology and its land use taxonomy, respectively. The

mappings are then stored in an agreement file to be used for query processing.

The global ontology in our system has two roles: (1) It provides the user with access to the

data with a uniform query interface to facilitate the formulation of a query on all the XML

sources; (2) It serves as the mediation mechanism for accessing the distributed data through

any of the XML sources.

6.4.2 Query Processing

Depending on the particular role of the global ontology in the architecture, we distinguish

the following two query processing cases:

Global-to-local query processing The query is posed on the global ontology, which acts as a

uniform interface to access the distributed data sources. The global query is rewritten into

multiple subqueries over individual local ontologies in local systems, where the subqueries

are executed. The answers to these subqueries are then returned to the global interface

and integrated to form the answer to the global query.

Local-to-local query processing As an autonomous system, the local system can accept

queries from the user, and answers them by forwarding the queries to other local data

sources through the global ontology. This case of query answering is similar to that in

peer-to-peer systems, in the sense that it can propagate the query to one or more peers,

or simply propagate the query to all of them (46).

157

The agreement files are the basis for both cases of query processing. The rewriting of

a query from one ontology to another needs to refer to the relevant agreement file to find

the corresponding concepts to the ones being queried. However, the mappings between a local

ontology and the global ontology and those between the local land use taxonomy and the global

land use taxonomy are used differently for query rewriting. We choose to use RQL (RDF Query

Language (65)) to express queries on ontologies. We discuss query processing in more detail in

Section 6.6.

6.5 Ontology Mapping

6.5.1 Schema Transformation

The first step of the integration of XML geospatial data sources is the transformation from

the XML source schema and data to an RDFS ontology and to RDF data. Due to the document

structure of XML, we may need to extend the RDFS vocabulary so as to be able to encode the

structure, which would otherwise be lost. In this chapter, we focus on the nesting structure,

while ignoring other implicit information such as the order information.

When the nesting structure represents a type hierarchy of elements (e.g., the land use taxon-

omy), the RDF property rdfs:subClassOf will be adequate to model such information, where

XML elements are represented by RDFS classes. However, it is common that nesting represents

an implicit relationship between two XML elements (e.g., the ownedBy relationship between an

element Owner with its child element LandParcel). In this case, in order to preserve the nesting

structure in the local ontology, we introduce a new RDF property, namely contained, which is

defined in the namespace rdfx). That is, while still representing the two XML elements using

158

RDFS classes, we use contained to connect the child-element class to the parent-element class.

Below we show the RDF/XML syntax for the contained property.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

xmlns:rdfx="http://www.example.org/rdf-extension#">

<rdf:Property rdf:about= "http://www.example.org/rdf-extension#contained">

<rdfs:isDefinedBy rdf:resource= "http://www.example.org/rdf-extension#"/>

<rdfs:label>contained</rdfs:label>

<rdfs:comment> The containment between two classes. </rdfs:comment>

<rdfs:range rdf:resource= "http://www.w3.org/2000/01/rdf-schema#Class"/>

<rdfs:domain rdf:resource= "http://www.w3.org/2000/01/rdf-schema#Class"/>

</rdf:Property>

Elements and attributes are the two basic building blocks of a DTD. There are two types

of elements: simple types that cannot have element or carry attributes, and complex types that

can have elements and/or contain attributes. On the other hand, all attribute declarations

must reference simple types since attributes cannot contain other elements or other attributes.

A well-formed XML document contains the hierarchical structure of elements and attributes

with the following kinds of relationships: 1) element and attribute relationship, where only

complex-type elements can carry attributes and attributes can only be of simple types, and 2)

element and sub-element relationship, where only complex-type elements can allow elements as

their children, but child elements can be either simple types or complex types.

159

Taking into account XML elements, attributes and their relationships, the transformation

from XML to RDF can further include element-level transformation and structure-level trans-

formation, as follows:

Element-level transformation Element-level transformation defines the basic classes and

properties of the local RDFS ontology according to the transformation correspondences

shown in Table VIII, with the structural relationships between the elements not being

considered for the time being. No new RDF metadata need be defined here because

rdfs:Class and rdfs:Property are sufficient to express classes and properties. For

instance, to transform the DTD of S1 in Figure 37, we define two classes: LandUse and

LandParcel for the elements with the same name. The other elements become properties

of LandParcel, because they are simple-type subelements.

TABLE VIII

ELEMENT-LEVEL SCHEMA TRANSFORMATIONXML Schema concepts RDF Schema conceptsAttribute PropertySimple-type element PropertyComplex-type element Class

Structure-level transformation Structure-level transformation encodes the nesting struc-

ture of the XML schema into the local RDFS ontology (39). In particular, the nesting

160

LandUse

LandParcel

rdfx:contained

Literal

rdfs:domain

rdfs:range

area

broad

lu1

jurisType

jurisName

Property

Class

LandUseTag rdfs:subClassOf

A R . . .

. . .

lu2

lu3

LandUse

LandParcel

rdfx:contained

Literal

area

Land_use

jurisType

jurisName

LandUseType

. . .

. . .

AA AD . . . RB RC . . .

parce_id

1 9

11 12 19 91 9100 910

a) Local RDFS ontology O 1 for local source S

1 b) Local RDFS ontology O

2 for local source S

2

Figure 39. An example of local RDFS ontologies.

may occur between two complex-type elements or between a complex-type element and its

child (as a simple-typed element). Following the element-level transformation, the nest-

ing structure in the former case corresponds to a class-to-class relationship between two

RDFS classes, which are connected by the property rdfx:contained. In the latter case,

the XML nesting structure corresponds to the class-to-literal relationship in the local on-

tology, with the class and the literal connected by the corresponding property. Table IX

lists the correspondences between the XML elements and the classes or properties in the

local RDFS ontology.

As an example, Figure 39 shows the local ontologies (represented as graphs where nodes are

classes and edges are properties) transformed from the XML schemas in Figure 37. The land

161

TABLE IX

MAPPINGS BETWEEN XML SOURCE SCHEMA D1 AND LOCAL ONTOLOGY O1

XPath expressions in D1 RDF expressions in O1

/LandUse LandUse/LandUse/LandParcel LandParcel/LandUse/LandParcel/AREA LandParcel.area/LandUse/LandParcel/BROAD LandParcel.broad/LandUse/LandParcel/LU1 LandParcel.lu1... .../LandUse/LandParcel/JurisType LandParcel.jurisType/LandUse/LandParcel/JurisName LandParcel.jurisName

use taxonomies are transformed into a hierarchy of classes and incorporated as part of the local

ontologies, rooted from LandUseTag and LandUseType, respectively.

6.5.2 Ontology Alignment

The ontology alignment process takes as input a local ontology obtained using the previously

described transformation, and the global ontology. It then produces as output an agreement

containing the class and property correspondences between the two ontologies.

Corresponding to the schema (non-taxonomy) and taxonomy parts in both the global on-

tology and local ontologies (see Figure 39),

When performing the alignment, we must consider two cases, which correspond to the

schema and taxonomy components in the global ontology and local ontologies (see Figure 39):

1) the schema-level mapping between the schema parts of two ontologies, where a concept (or a

role) of one ontology is mapped to a concept (or a role) of another ontology, and 2) the instance-

162

level mapping, where two corresponding concepts use two different classification schemes for

their instances, e.g., land use codes with different underlying taxonomies in WLIS.

Ontology alignment is in general a challenging task, with its degree of difficulty depending

on the types of ontologies being considered (106). For instance, the mapping between two

taxonomies consisting of only subClassOf relationships (i.e., the schema-level mapping in our

setting) is believed to be much simpler than the one between two non-taxonomies containing

various properties and relationships (i.e., instance-level mapping). In this chapter, we primarily

discuss the mapping between two (land use) taxonomies, and propose an automatic alignment

algorithm that can deduce new mappings from existing ones based on certain rules.

6.5.2.1 Mapping Types

Figure 40 shows a fragment of two concrete land use taxonomies: the one on the left hand

side is from the local ontology O1 in Eau Claire County (as depicted in Figure 39), and the one

on the right hand side is from the global ontology G.

The two taxonomies are respectively rooted from LandUseTag and from LandUseCode. A

node in each taxonomy represents a class of land use, where the label contains its description

and the code (in parenthesis). The dashed lines are class correspondences that are established

based on the semantics of the classes, which represent the mappings. We consider the following

types in ontology mappings:

Semantic relationship Considering a set-theoretic semantics, the mapping between two classes

A and B (seen as two sets of instances) can be classified into five categories: superclass,

subclass, equivalent, approximate (or overlapping), and disjoint, respectively, A ⊇ B,

163

rdfs:subClassOf

Class

ontology mapping

LandUseTag

Agricultural (A)

Residential (R)

Commercial (C)

Industrial (I)

Public/Institutional (P)

Single Family Residences (RS)

Mobile Home Parks (RSP)

LandUseCode

Agricultural (9)

Residential (1)

Transportation (5)

Industrial (3)

Single Family (111)

Two Family (113)

Multiple Family (115)

Other Single Family (140)

Mobile Homes (142)

Seasonal Residential (190)

Multiple Family

Dwellings having 4

units or more (RM)

Home Occupations (RO)

Vacant residential parcels (RV)

Parking Lots (RZ)

Communication (4)

Commercial (2)

Institutional/Governmental (6)

Non-mobile Home Parks (RSP)

Other Residential (199)

Duplexes (RD)

Triplexes (RT)

Cropland/pasture (AC)

Non-pasture (AN)

Pasture (91)

Other (99)

a) Land use taxonomy in local ontology O 1

b) Land use taxonomy in global ontology G

=

=

= (1)

=

(2, 4)

=

=

=

(3)

=

Figure 40. An example of mapping between two land use taxonomies. The labels over theedges represent mappings types, followed (in between parentheses) by the deduction rule(s)

that can be applied, if any.

164

A ⊆ B, A = B, (A ∩ B 6= ∅) ∧ (A− B 6= ∅) ∧ (B − A 6= ∅), and A ∩ B = ∅. We will not

consider the category approximate in our query rewriting algorithm for reasons that will

be apparent later (but we will discuss in Section 6.6 ways in which this category can be

incorporated in query answering). We will also not consider disjoint mappings, as they

are not useful for query answering.

Cardinality Class correspondences are established pairwise between two ontologies (producing

one-to-one mappings). However, it is possible that a class from one ontology is mapped

to multiple classes from the other ontology, in a many-to-one mapping and that multiple

classes are mapped to a single class, in a one-to-many mapping. To express such mappings

we consider the union of the classes to which a single class (in the other ontology) maps.

For example, given two mappings A = B and A = C, we have that A = B∪C. This issue

will be further discussed in the query processing section.

Coverage We distinguish two types of mappings: fully covered and partially covered. Let

C and C ′ be two classes to be mapped, such that C1, ..., Cm are subclasses of C, and

C ′1, ..., C

′n are subclasses of C ′. We say that C (resp. C ′) is fully covered if for each

child Ci ∈ {C1, ..., Cm} (resp. for each child C ′j ∈ {C ′

1, ..., C′n}) there is a non-empty

subset of {C ′1, ..., C

′n} to which Ci is mapped (respectively there is a non-empty subset of

{C1, ..., Cm} to which C ′j is mapped).

165

6.5.2.2 Deduction Process

In our approach, the ontology mapping process is performed semi-automatically, that is,

partly established manually by the user and partly obtained automatically using an inference

process based on deduction.

This semi-automatic ontology mapping process follows two principles: (1) The deduction of

the mapping between two nodes (from the both taxonomies being mapped) is determined by

the mappings between their child nodes. In other words, the mapping between two ontologies

are performed in a level-wise fashion, driven by the inference rules that are defined based on

the mapping semantics. (2) The user intervention is needed in two cases: when the mapping

between two nodes has insufficient information to determine its type (for example, when some

of the children of one node have not been mapped) or when there is conflicting information (for

example, that a node is both a superset and a subject of the corresponding node).

We make the complete-partition assumption: for any class C in the taxonomy, its subclasses

C1, ..., Cn together form a complete partition of the class, that is, C = C1 ∪ ... ∪ Cn. For

instance, in the global taxonomy depicted in Figure 40, the two child classes Pasture(91) and

Other(99) of the Agricultural(9) class form a complete partition of Agricultural(9), since

Other(99) includes all agricultural lands that are not used for pasture.

We consider the following deduction rules:

Definition 6.1 (Deduction rules) Let C and C ′ be two fully covered classes, and C1, ..., Cm

and C ′1, ..., C

′n be the subclasses of C and C ′, respectively. Then, the mapping between C and

C ′ can be obtained according to the following rules:

166

1) C = C ′, if for each Ci ∈ C, Ci is mapped to some k-element subset C ′′ of {C ′1, ..., C

′n}

( 1 ≤ k ≤ n′), such that Ci =⋃k

l=1 C ′′l .

2) C ⊆ C ′, if for each Ci ∈ C, Ci is mapped to some k-element subset C ′′ of {C ′1, ..., C

′n}


l=1 C ′′l or Ci ⊆

⋃kl=1 C ′′

l .

3) C ⊇ C ′, if for each Ci ∈ C, Ci is mapped to some k-element subset C ′′ of {C ′1, ..., C

′n}


l=1 C ′′l or Ci ⊇

⋃kl=1 C ′′

l .

The deduction rules in Definition 6.1 can be proved to be sound and complete by an induc-

tion on the set-theoretic semantics of each rule, under the complete-partition assumption and

the assumption that the user-defined mappings are semantically correct.

User intervention is needed in all other cases. The above rules assume a full mapping

between C and C ′. However, they still hold for the case of a partial mapping, provided that

we define the following supplemental rule: 4) Suppose that a class C is partially covered by C ′,

and that S is the subset of subclasses of C that are not mapped to any children of C ′. Then, we

create a temporary and empty subclass ⊥ of C ′, and add a superclass mapping from each class

in S to ⊥.

In Figure 40, the symbols and numbers (in between parentheses) over the dashed lines (i.e.,

the class correspondences) indicate the mapping type and the adopted inference rule(s), re-

spectively. For example, “⊆ (2, 4)” over the mapping between the class Residential(R) and

Residential(1) means that Residential(R) is a superclass of Residential(1), which is com-

puted by rules 2 and 4. The application of rule 4 is due to the fact that SeasonalResidential

(190) is unmapped, thus making Residential(1) partially covered.

167

We note that the inference on ontology mappings, as discussed above, occurs between two

ontologies (one being local ontology and the other being the global ontology). In our system,

there is another case that also requires reasoning. We will discuss this issue in more detail in

Section 6.6.

6.5.2.3 Mapping Representation

As the result of matching two ontologies, an agreement file is created to store the ontology

mappings. There are a number of methods proposed to represent ontology mappings in different

situations, e.g., using axioms or using a meta-ontology (74; 115). In our system, it is natural

to use RDFS as the language for mapping representation, provided that all ontologies are

represented by RDF and RDFS. Actually, as we will see in the next section, storing ontology

mappings in RDF has certain advantages for query processing.

Given the three types of ontology mapping semantics, and the feature of multiple inheritance

of RDFS classes, it is sufficient to use the RDF property rdfs:subClassOf to represent all types

of ontology mappings. As for the non-taxonomical parts of two ontologies, different kinds of

mappings can also be established between two properties, namely superproperty, subproperty,

or equivalent mappings. Similarly, we can use the RDF property rdfs:subPropertyOf to

represent these property mappings. Figure 41 shows an example of the mappings between the

global ontology G and the local ontology O1 (also shown in Figure 39). The graph shows a

fragment of the mappings (indicated by the dashed lines) between the schema components of

both ontologies and the text shows a fragment of the corresponding mapping representation in

RDFS.

168

LandParcel

Literal

rdfs:domain

rdfs:range

jurisType

jurisName

luCode

ownedBy

Property

Class

rdfs:subClassOf

LandUse

LandParcel

rdfx:contained

Literal

area

broad

lu1

jurisType

jurisName

. . .

lu2

lu3

a) Local ontology O 1 for local source S

1

area

Owner

name dob ssn gender

. . .

Land

mapping

=

=

=

=

=

b) Global ontology G

<!DOCTYPE rdf:RDF [ <!ENTITY G "urn:ontologies-advis-lab:global-ontology#"><!ENTITY O1 "urn:ontologies-advis-lab:local-ontology-1#"><!ENTITY O2 "urn:ontologies-advis-lab:local-ontology-2#"> ]>

<rdfs:Class rdf:about="&G;LandParcel"><rdfs:subClassOf rdf:Class="&O1;LandParcel"/>

</rdfs:Class><rdfs:Class rdf:about="&O1;LandParcel">

<rdfs:subClassOf rdf:Class="&G;LandParcel"/><rdfs:subClassOf rdf:Class="&G;Land"/>

</rdfs:Class>......<rdfs:Class rdf:about="&O1;RT">

<rdfs:subClassOf rdf:Class="&G;115"/></rdfs:Class><rdfs:Class rdf:about="&O1;RM">

<rdfs:subClassOf rdf:Class="&G;115"/></rdfs:Class>......

Figure 41. A fragment of ontology mappings represented in RDFS.

169



In WLIS, users can pose queries either on the global ontology as a global query, or over any

of the integrated local sources as a local query. A typical query such as “Where are all the

multiple family land parcels in Wisconsin?” would be relatively straightforward when using one

single local source, whose schema and taxonomy is familiar to the user, but much more difficult

when posed over a large set of local data sources, as the users would have to manually rewrite

their queries for each of the local data sources and know the schema and taxonomies for each

data source. In this section, we describe how such queries can be automatically rewritten by

our integrated system, using the agreement files that are generated by the alignment process.

Among the many query languages for RDF data access, we use RQL (RDF Query Language),

which is a typed language following a functional approach (32). RQL is defined by a set of basic

queries and select-from-where ( sfw) filters, which can be used to express meta-schema, schema

and data queries. The sfw filters contain generalized path expressions and can be nested to

form more complex queries. For example, the above query, if posed over the global ontology,

can be expressed by the following RQL:

SELECT a, b, c

FROM {$x}xyCoordinates{a}, {$x}bounding{b}, {$x}jurisName{c},

{$x}state{d}, {$x}luCode{e}

WHERE d = "Wisconsin" and e = "115"

170

This query is in the from of a sfw filter, consisting of the SELECT, FROM, and WHERE clauses.

The SELECT clause defines a projection over the variables of interest. In the FROM clause, we

use basic schema path expressions composed of the property name (e.g., bounding) and data

variables (e.g., $x) or class variables (e.g., a). The condition in the WHERE clause filters the

answers. We focus on a particular subset of RQL, namely conjunctive RQL (c-RQL), which is

of the following form:

ans(x) :– R1(x1), ..., Rn(xn).

where x ⊆ x1 ∪ ... ∪ xn are variables or constants, and Ri(xi) (i ∈ [1..n]) stands for a class

predicate C(x) or a property predicate P (x, y). As usual, ans(x) is the head of the query,

denoted headQ, and R1(x1), ..., Rn(xn) is the body of the query, denoted bodyQ. Most RQL

queries can be expressed in c-RQL. For instance, the RQL query on multiple family land parcels

can be expressed in c-RQL as follows:

ans(a, b, c) :– xyCoordinates(x, a), bounding(x, b), jurisName(x, c),

state(x, "Wisconsin"), luCode(x, "115")

where for each path expression in the RQL query we use a corresponding predicate (e.g.,

xyCoordinates(x, a) for {$x}xyCoordinates{a}).

6.6.2 Query Rewriting and Answering

The query processing across the whole system can be performed in two directions, i.e.,

global-to-local and local-to-local. We propose a query rewriting algorithm, QueryRewriting,

which can be used in both cases. Query rewriting can be seen as a function Q′ = f(Q,M),

171

Algorithm QueryRewriting (Q, M)Input: a conjunctive query Q over ontology O; the mappings M betweenontologies O and O′.Output: a union Q of conjunctive queries Q′ over O′.1 headQ′ = headQ; bodyQ′ = null;2 Q∗ = QueryExpand(Q, Σ), where Σ is the set of constraints over O;3 Let φ be bodyQ∗ ;4 Let M1 be the part of schema-level mappings in M ;5 For each R(x) of φ6 For each ψ ∈ M1

7 Let R′(x′) be the result of applying ψ on R(x);8 bodyQ′ = R′(x′) ∧ bodyQ′ ;9 Q′ = QueryExpand(Q′, Σ′), where Σ′ is the set of constraints over O′;10 Let M2 be the part of instance-level mappings in M ;11 Q = ConstantMapping(Q′, M2);12 Return Q;

Figure 42. The QueryRewriting algorithm.

where Q is the query to be rewritten, called source query, M is the set of ontology mappings,

and Q′ is the resulting query, called target query. The algorithm is shown in Figure 42.

In the global-to-local case, the source query Q is a query posed on the global ontology G,

M is the set of mappings from G to every local ontology O1, ..., On, and the target query Q′ is

the union of multiple subqueries over O1, ..., On. In the local-to-local case, Q is a local query

posed on a local ontology Oi (i ∈ [1..n]), M is the set of mappings from Oi to one or more local

ontologies Oj (j ∈ [1..n] and j 6= i), and Q′ is the union of multiple subqueries over all Oj . In

the latter case, M is, in fact, a set of compositions of the mappings from Oi to G with those

from G to Oj .

172

In the rest of this section, we describe in detail the four main steps of the QueryRewriting

algorithm: 1) expanding the source query using the source ontology constraints, 2) rewriting the

expanded source query into an intermediate target query using the schema-level mappings, 3)

expanding the intermediate target query using the target ontology constraints, and 4) mapping

the constants of the expanded intermediate target query to obtain the final target query using

instance-level mappings.

6.6.2.1 Query Expansion

In the above description of the QueryRewriting algorithm, we notice that both the source

query Q and the intermediate target query Q′ are expanded using the ontology constraints,

respectively in Line 2 and Line 9. This query expansion process, as described by the QueryEx-

pand function of Figure 43, uses the strategy of applying the ontology constraints to “chase”

the query, similarly to the chase algorithm that is used in relational databases to compute

dependency implications or optimize queries (2). In relational databases, a database constraint

can be represented as a tgd (tuple generating dependency) in the form ∀x∃y ϕ(x) → ψ(x,y),

where ϕ and ψ are conjunctions of atoms. In an ontology setting, we consider three kinds of

constraints, namely, subclass, subproperty, and typing constraints, all of which can be repre-

sented as a tgd. Specifically, a tgd ∀x C1(x) → C2(x) corresponds to a subclass constraint

C1 ⊆ C2; a tgd ∀x∀y P1(x, y) → P2(x, y) corresponds to a subproperty constraint P1 ⊆ P2; and

a tgd ∀x∀y P (x, y) → A(x) (resp. ∀x∀y P (x, y) → B(y)) corresponds to a typing constraint

that the instances of x (resp. y) are of type A (resp. B).

173

Algorithm QueryExpand (Q, Σ)Input: a conjunctive query Q over ontology O; the constraints Σ over O.Output: The query Q after the expansion.1 Repeat2 Let φ be bodyQ;3 Let ψ : R1(x) → R2(x) be any dependency in Σ;4 If there exists a homomorphism h from R1(x) to φ, but not from

R1(x) ∧R2(x) to φ, then5 Extend h to a new homomorphism h′ from R1(x) ∧R2(x) to φ;6 Add h′(R2(x)) into bodyQ;7 Else exit repeat;8 End repeat

Figure 43. The QueryExpand algorithm.

Similarly to the chase algorithm, QueryExpand is a non-deterministic process that termi-

nates provided that the dependencies are acyclic and the applications of dependencies do not

introduce new variables into the query. Furthermore, it has been proved that the resulting

query Q′ = QueryExpand (Q, Σ) is equivalent to Q, denoted Q ≡ Q′, meaning that the answers

to both queries are the same over all the ontology instances that satisfy the constraints (2).

As an example, let us take the preceding query on multiple family land parcels, and denote

it using Q. As specified on the global ontology G, all the properties (e.g., xyCoordinates)

referred in Q belong to the class LandParcel, thus leading to the corresponding typing con-

straints. Such constraints can be represented by a tgd of the form ∀x∀y P (x, y) → A(x) (e.g.,

∀x∀y xyCoordinates(x, y) → LandParcel(x)). By applying them to Q, we obtain the following

expansion of Q:

174


state(x, "Wisconsin"), luCode(x, "115"), LandParcel(x)

Furthermore, given that the LandParcel class is a subclass of Land in G, the corresponding

tgd (e.g., ∀x LandParcel(x) → Land(x)) of such constraint is still applicable to the above

query. The final resulting expansion Q∗ of Q is as follows:


state(x, "Wisconsin"), luCode(x, "115"), LandParcel(x),

Land(x)

6.6.2.2 Query Mapping

The key to query rewriting lies in Lines 4 to 7 of the QueryRewriting algorithm, which maps

the expanded source query Q∗ to a new query Q′ over the target ontology, based on the set

of schema-level mappings in M . Similarly to the ontology constraints used by QueryExpand,

ontology mappings can be treated as constraints specified over the source and the target on-

tologies. Therefore, we express ontology mappings in a tgd, which, however, is used for query

mapping in a way that is different from the use of ontology constraints for query expansion.

More specifically, a tgd ψ : ∀x R1(x) → R2(x) means that the satisfaction of predicate

R1(x) by some instance suffices the satisfaction of the predicate R2(x) by that same instance.

In the following discussion, we consider two ontologies O1 and O2. If ψ represents an ontology

constraint R1 ⊆ R2, where R1, R2 ∈ O1, the instances that make ψ hold are all the instances

of O1. Therefore, the expansion of a query on O1 involving R1 and using ψ, as specified by

175

the QueryExpand function, does not, in fact, expand the answer to the query, even though R2

may have a larger set of instances than R2. In comparison, in the case where ψ stands for an

ontology mapping R1 ⊆ R2, where R1 ∈ O1 and R2 ∈ O2, ψ actually specifies a constraint for

the data interoperation (exchange) from O1 to O2. In this case, the instances in O1 and in O2

are usually different. This means that the instances satisfying R1(x) may not exist in O2 so

as to satisfy R2(x), thus demanding a potential data transfer from O1 to O2. In this sense,

ψ should be applied to the rewriting of a query referring to R2 to a query referring to R1, not

in the opposite direction. In other words, the query rewriting from O1 to O2 should use the

mapping constraints with their dependency implication being from O1 to O2.

In particular, the application of a dependency ψ : R2(x) → R1(x) to a query Q, as Line 7

of QueryRewriting indicates, is performed by taking the converse ψ′ of ψ (i.e., R1(x) → R2(x)),

followed by the operations specified in Lines 4 and 5 of QueryExpand. The resulting R′(x′) (in

Line 8 of QueryRewriting) is then h′(R2(x)) as in Line 6 of QueryExpand. The following shows

the result of mapping Q∗ (the expanded source query) to a query Q′ on the local ontology O1

according to the mapping M as presented in Figure 41:

ans(a, b, c) :– xyCoordinates(x, a), boundingBox(x, b), jurisName(x, c),

state(x, "Wisconsin"), lu1(x, "115"), LandParcel(x)

6.6.2.3 Rewriting Constants

Both the QueryExpand function and the query mapping process are performed at the schema

level. In comparison, the rewriting of the constants that are referred to in the query is based

176

Algorithm ConstantMapping (Q, M)Input: a conjunctive query Q over ontology O′ with constants c1, ..., cn fromO; the instance level mappings M between ontologies O and O′.Output: a union Q of conjunctive queries Q′ with constants from O′.1 Q = ∅;2 c = (c1, ..., cn);3 For each ci, with i ∈ [1..n]4 Ai={};5 Let C be the class standing for ci;6 For each C ⊇ C ′ or C = C ′ in M7 Ai = Ai ∪ {c′}, where c′ is the constant represented by C ′;8 If there is no C ⊇ C ′ or C = C ′ in M then9 Ai = {c};10 For each c′ ∈ A1 × ...×An

11 Q′ = Q;12 Substitute c in Q′ with c′;13 Q = Q∪Q′;

Figure 44. The ConstantMapping algorithm.

on the instance-level mappings between two ontologies, particularly the mappings between two

land use taxonomies. We describe next the instance rewriting process of Figure 44.

Following the previous example, we have c = {"Wisconsin", "115"}. From the mapping

between G and O1, as shown in Figure 40, we have that RT ⊆ 115 and RM ⊆ 115, therefore

A1 = {"Wisconsin"} and A2 = {"RT", "RM"}, according to Lines 3 to 9. Now that we have two

vectors of constants (c′ in the algorithm): {"Wisconsin", "RT"} and {"Wisconsin", "RM"}, we

can obtain the following union of queries Q according to Lines 10 to 13 of the ConstantMapping

function:


177

state(x, "Wisconsin"), lu1(x, "RT"), LandParcel(x)

∪


state(x, "Wisconsin"), lu1(x, "RM"), LandParcel(x)

6.6.3 Discussion

In this section we discuss several considerations that are related to our query processing

strategy and discuss some alternatives to our choices. First, we have assumed that the schema-

level mapping M between two ontologies are a full certain mapping relative to the query to be

rewritten. More specifically, all relation atoms (including classes and properties) in the body

of the query should have been mapped to some atom in the other ontology, with the mapping

type being ⊇ or ≡. In the case that nulls are not allowed in the queried atoms, this assumption

is necessary so as to get complete answers to the query. If nulls are allowed, the queried atoms

do not have to be fully mapped, since we can add null values at appropriate positions of the

answer returned by the target query.

Under the assumption of full mappings, we can prove the soundness of the QueryRewriting

algorithm on its computation of a rewriting (i.e., target query) Q contained in the source query

Q, denoted Q ⊆ Q, as sketched by the following: Let Q∗ be the expanded source query, Q′

be the intermediate target query, Q′′ be the expanded intermediate target query. Given that

Q ≡ Q∗, Q′ ≡ Q′′, and Q′′ ≡ Q (2), it suffices to prove that Q′ ⊆ Q∗. Suppose that t is an

instance in the answer to Q′, i.e., t ∈ Q′(O), where O is the local ontology instance. Then t

178

makes every predicate R(x) in bodyQ′ true. According to Lines 5 to 8 of the QueryRewriting

algorithm, we have that every predicate S(x) in bodyQ∗ is also made true by t. This means

that t ∈ Q∗(G), where G is the global ontology instance, therefore Q′ ⊆ Q∗. We note that we

obtain a contained rewriting, instead of a maximally contained rewriting (70). This is actually

due to our preference for high precision rather than for high recall, which we discuss below.

Second, there are two important steps involved in the local-to-local query rewriting: query

conversion and mapping composition. The query conversion deals with the conversion of a

query (e.g., in XPath) native to the local system to a query (in c-RQL) on the local ontology.

However, c-RQL can only represent a particular class of XML queries that have the same

expressive power as c-RQL. We give below an example of query conversion.

Consider an XPath query /LandUse/LandParcel[board="A"] as posed over O1 to retrieve

all the lands used as Agriculture. The result of this query is the XML document trees rooted

from the LandParcel element (See Figure 37). By considering the answer structure and se-

mantics of the query, we convert the XPath query into the following c-RQL query.

ans(x1, x2, x3, ...) :– area(x, x1), ..., jurisType(x, x2), jurisName(x, x3),

broad(x, x4), lu1(x, x5), lu2(x, x6), lu3(x, x7),

x4 = "A".

We note that all the elements and/or attributes involved in the XML answer tree and in

the predicates of the XPath query are covered in the c-RQL query.

The mapping composition is necessary to obtain the mappings M between two local source

schemas, based on which the QueryRewriting algorithm is performed. Similarly to the ontology

179

alignment, the composition depends on a set of inference rules, which derive a new mapping

from two or more existing mappings. The difference is that, while a mapping in the deduction

of the ontology alignment is between the same pair of ontologies, the two ontology mappings

involved in the composition are between two pairs of ontologies (O1, O2) and (O2, O3), with

a common intermediate ontology, O2. Let R1, R2, and R3 be three classes (or properties)

respectively from the ontologies O1, O2, and O3. Then, given the mapping from R1 to R2 and

that from R2 to R3, a new mapping R1 to R3 can be derived according to the following rules:

1) R1 = R3, if R1 = R2 and R2 = R3;

2) R1 ⊆ R3, if R1 ⊆ R2 and R2 ⊆ R3, or R1 = R2 and R2 ⊆ R3, or R1 ⊆ R2 and R2 = R3;

3) R1 ⊇ R3, if R1 ⊇ R2 and R2 ⊇ R3, or R1 = R2 and R2 ⊇ R3, or R1 ⊇ R2 and R2 = R3.

The last issue we discuss concerns the trade-off between the precision and recall of the query

processing. Currently, we do not consider the mapping type approximate (see Section 6.5.2).

Furthermore, the query rewriting algorithm only uses mappings that guarantee the correctness

of the query. For instance, given a mapping A ⊆ B and a query Q : {x|A(x)}, our query

rewriting algorithm would not rewrite Q to {x|B(x)}, unless the mapping was A ⊇ B or

A ≡ B. This ensures that we will not return to the user instances that do not belong to A.

But we may miss some instances of B that are also instances of A and should be included in

the answer to Q, thus lowering recall.

An alternative is to allow the approximate semantic relationship and to assign a score be-

tween [0..1] to every mapping based on the similarity of the mapped classes or properties. Thus,

180

query rewriting can calculate an estimated precision of the target query. The consideration of

the approximate type can increase the automation of the ontology alignment process. In par-

ticular, we can integrate existing similarity-based ontology (or schema) matching methods (as

described in Section 6.2) in our alignment algorithm, so that the user does not have to interact

with the alignment process to disambiguate those mappings that cannot be inferred by the de-

ductions rules. Instead, the disambiguation can be performed by the similarity-based method,

and a similarity score can be assigned to the mapping. We have partially implemented this idea

in our ontology alignment interface (using the criterion matching by definition). Using this

approach, while the answer to a query may contain positive negatives the recall will increase.

In practice, different scenarios impose different requirements on the mappings. For exam-

ple, an eCommerce application involving purchase orders requires a very precise and complete

translation of a query, whereas a search engine usually does not require an exact transforma-

tion (34).

6.7 Summary

In this chapter, we focused on data integration and interoperability across distributed

geospatial data sources. To illustrate the impact of our approach in a local eGovernment

setting, we showed practical examples that are derived from land use applications in the Wis-

consin Land Information System (WLIS) project. The data heterogeneities in such applications

include: schematic heterogeneities resulting from the fact that each county and each munici-

pality may have a different schema for their data and semantic heterogeneities resulting from

the different classification schemes for the land use data.

181

We propose an ontology-based approach to achieve the integration and interoperability of the

distributed geospatial data sources by solving both schematic and semantic heterogeneities. In

particular, we use a local ontology to represent both the local source schema and the taxonomy

used to encode instances—land use codes in our application—based on a schema transforma-

tion process. Similarly, the global ontology that models the eGovernment application domain

consists of two components: the g lobal schema and the g lobal land use taxonomy.

To achieve data interoperability, two different kinds of mappings are established between

the global ontology and each local ontology: schema mappings between the schema of both

ontologies and instance mappings between the (land use) taxonomies of both ontologies. The

schematic and semantic heterogeneities are reconciled by these two kinds of mappings. We base

the ontology mapping process on a deduction procedure.

We have discussed two modes of query processing in our system, g lobal-to-local and local-

to-local (or peer-to-peer). The former mode is used to accomplish the data integration task, by

rewriting a global query (on the global ontology) into the union of subqueries on the multiple

local ontologies. The latter mode enables peer-to-peer interoperation between any pair of

sources, by means of local-to-local query rewriting. Query rewriting in both modes uses the

previously established mappings. While global-to-local query rewriting uses the mappings from

the global ontology to all local ontologies, local-to-local rewriting is based on the composition

of the mappings from the local ontology of the source database to the global ontology, followed

by the mappings from the global ontology to the local ontology of the target database. We

182

propose a c-RQL (conjunctive RQL) query rewriting algorithm, such that the resulting target

query is contained in the source query, thus providing sound answers to the source query.

Future work will focus on the following two topics: 1) Ontology alignment, and in particular

the deduction-based method. Currently, we make some assumptions on the topology of the

ontologies. Without such assumptions, we may need to consider the combination of our bottom-

up deduction process with top-down reasoning on mappings (e.g., (100)). 2) We will further

extend our query rewriting algorithm, so that it can take into account “approximate” mappings.

In this case, the precision and recall of query answering will depend on the similarity of the

underlying mappings, thus making the ability to determine mapping similarities a critical task.

CHAPTER 7

CONCLUSIONS

Data heterogeneity is the primary handicap in achieving data interoperability among dis-

tributed data sources. To build an integrated system, either in a centralized architecture or in

a peer-to-peer architecture, we have to resolve the different levels of heterogeneities, including

syntactic, schematic, and semantic heterogeneities. The focus of this thesis is then on the ap-

plication of Semantic Web technologies, centered on ontologies, to data interoperability so as

to achieve semantic data integration. We have proposed an ontology-based approach for both

central and peer-to-peer data integration. In doing so, we discuss the three fundamental issues,

including metadata representation, mapping process, and query processing.

In summary, our work as presented by five scenarios in this thesis can be summarized as

follows:

1. In our ontology-based framework for centralized integration of XML data sources, we

consider the problem that semantically equivalent XML documents can present different

document structures, caused by the lack of explicit semantics in XML. The ontology-

based approach enables the interoperation of XML documents at the semantic level while

retaining their nesting structure. A global RDFS ontology is generated by merging the

local RDFS ontologies that are generated from each of the XML documents. By means of

the mappings established between the global ontology and local XML schemas, we are able

183

184

to process queries in two modes: from the global ontology to local sources and from a local

source to another one. For both modes, we propose a query rewriting algorithm, which

is shown to be an equivalent rewriting algorithm. In doing so, we discuss the problem

of query containment for two query languages, namely conjunctive RDQL (c-RDQL) and

conjunctive XQuery (c-XQuery).

2. We have proposed a hybrid peer-to-peer framework, PEPSINT, for the integration of XML

and RDF data sources. We discuss the construction of the architecture, maintenance of

mappings, and query processing in PEPSINT. The data integration is implemented at

the schema-level through the schema matching process and at instance-level through the

query answering process. A key aspect in both processes is the preservation of the domain

and document structure, which enables both the integration of source schemas and that

of answers from different local queries that may have different structures. Furthermore,

user queries can be correctly propagated across the network of heterogeneous XML and

RDF data sources, so that information access within the network is transparent to the

user.

3. An ontology-based approach has been proposed to solve the data interoperability problem

in a heterogeneous pure P2P network. RDF and related techniques are overwhelmingly

used in our approach, including the use of the RDFS local ontologies for metadata rep-

resentation and the use of the RDFMS meta-ontology for inter-schema mapping repre-

sentation. Based on the RDFMS meta-ontology, we introduce a P2P mapping language,

namely PML, which is used to express the mappings based on its first-order logic seman-

185

tics. The P2P query answering in the system considers constraints defined over local data

sources.

4. The data interoperability problem exists in the management of personal information

within and across desktops. We propose a layered multi-ontology based framework,

called MOSE, which aims to provide a semantics-rich environment for personal infor-

mation organization and manipulation. In particular, we focus on the semantic-enriched

data organization, including data annotation using domain ontologies, data association

by means of a network consisting of the various ontologies and their instances, and data

representation. We also propose an MVC-based approach for personal information ap-

plication (PIA) development using the PIA designer. Based on that, we have formalized

the notion of desktop service, based on the concept of parameterized channel. The data

interoperability in MOSE can be realized in two ways, one by means of desktop services

and the other by means of query processing across desktops. We discuss two cases of

query processing: within a single PIA and between two PIAs.

5. Finally, we illustrate the impact of our ontology-based approach to the data interoper-

ability problem in a local eGovernment setting. We propose an ontology-based approach

to achieve the integration and interoperability of the distributed geospatial data sources

by solving both schematic and semantic heterogeneities. Both kinds of heterogeneities

are reconciled by two different kinds of mappings that are established between the global

ontology and each local ontology: schema mappings between the schema of both ontolo-

gies and instance mappings between the (land use) taxonomies of both ontologies. We

186

propose a c-RQL (conjunctive RQL) query rewriting algorithm, for two cases of query

processing in the system.

CITED LITERATURE

1. Serge Abiteboul and Oliver M. Duschka. Complexity of Answering Queries Using Ma-terialized Views. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGARTSymposium on Principles of Database Systems (PODS 1998), pages 254–263, 1998.

2. Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995.

3. Bernd Amann, Catriel Beeri, Irini Fundulaki, and Michel Scholl. Ontology-Based Inte-gration of XML Web Resources. In Proceedings of the 1st International SemanticWeb Conference (ISWC 2002), pages 117–131, 2002.

4. Bernd Amann, Catriel Beeri, Irini Fundulaki, and Michel Scholl. Querying XML SourcesUsing an Ontology-Based Mediator. In Proc. of the Confederated InternationalConferences DOA, CoopIS and ODBASE, LNCS 2519, Springer, 2002.

5. Bernd Amann, Irini Fundulaki, Michel Scholl, Catriel Beeri, and Anne-Marie Vercoustre.Mapping XML Fragments to Community Web Ontologies. In Proceedings of the 4thInternational Workshop on the Web and Databases (WebDB 2001), pages 97–102,2001.

6. Marcelo Arenas, Vasiliki Kantere, Anastasios Kementsietsidis, Iluju Kiringa, Renee J.Miller, and John Mylopoulos. The Hyperion Project: From Data Integration toData Coordination. SIGMOD Record, 32(3):53–38, 2003.

7. Yigal Arens, Craig A. Knoblock, and Chunnan Hsu. Query Processing in the SIMS Infor-mation Mediator. In The AAAI Press, May 1996.

8. David Aumueller and Soren Auer. Towards a Semantic Wiki Experience – Desktop Inte-gration and Interactivity in WikSAR. In Proc. of the 1st ISWC Workshop on TheSemantic Desktop, 2005.

9. David Aumueller, Hong Hai Do, Sabine Massmann, and Erhard Rahm. Schema andOntology Matching with COMA++. In Proc. of the ACM SIGMOD InternationalConference on Management of Data, pages 906–908, 2005.

187

188

10. Sonia Bergamaschi, Silvana Castano, and Maurizio Vincini. Semantic Integration of Semi-structured and Structured Data Sources. SIGMOD Record, 28(1):54–59, 1999.

11. Sonia Bergamaschi, Francesco Guerra, and Maurizio Vincini. A Peer-to-Peer InformationSystem for the Semantic Web. In Proceedings of the International Workshop onAgents and Peer-to-Peer Computing (AP2PC 2003), July 2003.

12. Matthew Berland and Eugene Charniak. Finding Parts in Very Large Corpora. In ACL,1999.

13. Tim Berners-Lee, James Hendler, and Ora Lassila. The Semantic Web. Scientific Ameri-can, May 2001.

14. Philip A. Bernstein. Applying Model Management to Classical Meta Data Problems. InProceedings of the 1st Biennial Conference on Innovative Data Systems Research(CIDR 2003), 2003.

15. Philip A. Bernstein, Fausto Giunchiglia, Anastasios Kementsietsidis, John Mylopoulos,Luciano Serafini, and Ilya Zaihrayeu. Data Management for Peer-to-Peer Com-puting: A Vision. In WebDB 2002, pages 89–94, 2002.

16. Yaser A. Bishr. Overcoming the semantic and other barriers to GIS interoperability.International Journal of Geographical Information Science, 12(4):229–314, 1998.

17. Christian Bizer. D2R MAP - A Database to RDF Mapping Language. In Proceedings ofthe 12th International World Wide Web Conference (WWW 2003), 2003.

18. Stephan Bloehdorn, Kosmas Petridis, Carsten Saathoff, Nikos Simou, Vassilis Tzouvaras,Yannis S. Avrithis, Siegfried Handschuh, Ioannis Kompatsiaris, Steffen Staab, andMichael G. Strintzis. Semantic Annotation of Images and Videos for MultimediaAnalysis. In ESWC 2005, pages 592–607, 2005.

19. Scott Boag, Don Chamberlin, Mary F. Fernandez, Jonathan Robie Daniela Flo-rescu, and Jerome Simeon. XQuery 1.0: An XML Query Language.http://www.w3.org/TR/xquery, W3C Working Draft.

20. Paolo Bouquet, Fausto Giunchiglia, Frank van Harmelen, Luciano Serafini, and HeinerStuckenschmidt. C-OWL: Contextualizing Ontologies. In Proc. of ISWC 2003,pages 164–179, 2003.

189

21. Ronald Bourret. XML and Databases. http://www.rpbourret.com/xml/XMLAndDatabases.htm,2004.

22. Dan Brickley, R.V. Guha, and Brian McBride. RDF Vocabulary Description Language1.0: RDF Schema. http://www.w3.org/TR/rdf-schema, Feburary 2004.

23. Vannevar Bush. As We May Think. The Atlantic Monthly, 176(1):101–108, 1945.

24. Andrea Calı, Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. On theExpressive Power of Data Integration Systems. In Proceedings of the 21st Inter-national Conference on Conceptual Modeling (ER 2002), pages 338–350, 2002.

25. Andrea Calı, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Paolo Naggar,and Fabio Vernacotola. IBIS: Semantic Data Integration at Work. In Proceedingsof the 15th Conference on Advanced Information Systems Engineering (CAiSE2003), pages 79–94, 2003.

26. Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Ric-cardo Rosati. What to Ask to a Peer: Ontolgoy-based Query Reformulation. InProceedings of the 9th International Conference on Principles of Knowledge Rep-resentation and Reasoning (KR 2004), pages 469–478, 2004.

27. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. View-Based Query Processing and Constraint Satisfaction. In The 15th Annual IEEESymposium on Logic in Computer Science (LICS 2000), pages 361–371, 2000.

28. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. View-based Query Containment. In Proceedings of the 22rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003), pages 56–67, 2003.

29. Sandro Daniel Camillo, Carlos A. Heuser, and Ronaldo dos Santos Mello. Querying Het-erogeneous XML Sources through a Conceptual Schema. In Proceedings of the 22ndInternational Conference on Conceptual Modeling (ER 2003), pages 186–199, 2003.

30. Silvana Castano, Valeria De Antonellis, and Sabrina De Capitani di Vimercati. GlobalViewing of Heterogeneous Data Sources. IEEE Transactions on Knowledge andData Engineering, 13(2):277–297, 2001.

190

31. Yi Chen and Peter Revesz. CXQuery: A Novel XML Query Language. In Proceedings ofInternational Conference on Advances in Infrastructure for Electronic Business,Science, and Medicine on the Internet (SSGRR 2002w), 2002.

32. Vassilis Christophides, Gregory Karvounarakis, I. Koffina, Giorgos Kokkinidis, AimiliaMagkanaraki, Dimitris Plexousakis, G. Serfiotis, and Val Tannen. The ICS-FORTHSWIM: A Powerful Semantic Web Integration Middleware. In SWDB 2003, pages381–393, 2003.

33. Jeff Conklin. Hypertext: An Introduction and Survey. IEEE Computer, 20(9):17–41,1987.

34. Valerie Cross. Uncertainty in the Automation of Ontology Matching. In Proc. of the 4thInternational Symposium on Uncertainty Modeling and Analysis (ISUMA), pages135–140, 2003.

35. Isabel F. Cruz, Afsheen Rajendran, and William Sunna. XML Database Integration forVisualizing US Election Results. In Proc. of the National Conference on DigitalGovernment Research (dg.o), pages 403–406, 2002.

36. Isabel F. Cruz, Afsheen Rajendran, William Sunna, and Nancy Wiegand. Handling Se-mantic Heterogeneities using Declarative Agreements. In Proc. of ACM GIS 10thInternational Symposium on Advances in Geographic Information Systems, pages168–174, 2002.

37. Isabel F. Cruz, William Sunna, and Anjli Chaudhry. Semi-Automatic Ontology Alignmentfor Geospatial Data Integration. In Proc. of the 3rd Int. Conf. on GIScience, pages51–66, 2004.

38. Isabel F. Cruz and Huiyong Xiao. Using a Layered Approach for Interoperability on theSemantic Web. In Proceedings of the 4th International Conference on Web Infor-mation Systems Engineering (WISE 2003), pages 221–232, Rome, Italy, December2003.

39. Isabel F. Cruz, Huiyong Xiao, and Feihong Hsu. An Ontology-based Framework for Seman-tic Interoperability between XML Sources. In Proceedings of the 8th InternationalDatabase Engineering & Applications Symposium (IDEAS 2004), pages 217–226,July 2004.

191

40. Isabel F. Cruz, Huiyong Xiao, and Feihong Hsu. Peer-to-Peer Semantic Integration ofXML and RDF Data Sources. In The 3rd International Workshop on Agents andPeer-to-Peer Computing (AP2PC 2004), July 2004.

41. Stefan Decker and Martin Frank. The Social Semantic Desktop. In Proc. of the WWWWorkshop Application Design, Development and Implementation Issues in the Se-mantic Web, 2004.

42. Stefan Decker, Sergey Melnik, Frank van Harmelen, Dieter Fensel, Michel C. A. Klein,Jeen Broekstra, Michael Erdmann, and Ian Horrocks. The Semantic Web: TheRoles of XML and RDF. IEEE Internet Computing, 4(5):63–74, 2000.

43. AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Y. Halevy. Learning to Mapbetween Ontologies on the Semantic Web. In Proc. of the 11th International WorldWide Web Conference (WWW), pages 662–673, 2002.

44. Xin Dong and Alon Y. Halevy. A Platform for Personal Information Management andIntegration. In CIDR, pages 119–130, 2005.

45. Paul Dourish, W. Keith Edwards, Anthony LaMarca, John Lamping, Karin Petersen,Michael Salisbury, Douglas B. Terry, and James Thornton. Extending DocumentManagement Systems with User-specific Active Properties. ACM Transaction ofInformation System, 18(2):140–170, 2000.

46. Marc Ehrig, Christoph Tempich, Jeen Broekstra, Frank van Harmelen, Marta Sabou,Ronny Siebes, Steffen Staab, and Heiner Stuckenschmidt. SWAP - Ontology-basedKnowledge Management with Peer-to-Peer Technology. In Proc. of WOW 2003,2003.

47. Mary Fernandez, Ashok Malhotra, Jonathan Marsh, Marton Nagy, and NormanWalsh. XQuery 1.0 and XPath 2.0 Data Model. http://www.w3.org/TR/xpath-datamodel, W3C Working Draft, October 2004.

48. Davide Fossati, Gabriele Ghidoni, Barbara Di Eugenio, Isabel F. Cruz, Huiyong Xiao, andRajen Subba. The Problem of Ontology Alignment on the Web: a First Report.In Proc. of the 2nd Web as Corpus Workshop (associated with the 11th Conferenceof the European Chapter of the ACL), pages 51–58, 2006.

49. Enrico Franconi, Gabriel M. Kuper, Andrei Lopatenko, and Ilya Zaihrayeu. A DistributedAlgorithm for Robust Data Sharing and Updates in P2P Database Networks. In

192

Current Trends in Database Technology - EDBT 2004 Workshops, LNCS 3268,Springer, pages 446–455, 2004.

50. Eric Freeman and David Gelernter. Lifestreams: A Storage Model for Personal Data.SIGMOD Record, 25(1):80–86, 1996.

51. Jim Gemmell, Gordon Bell, Roger Lueder, Steven M. Drucker, and Curtis Wong.MyLifeBits: Fulfilling the Memex Vision. In ACM Multimedia, pages 235–238,2002.

52. Li Gong. JXTA: A Network Programming Environment. IEEE Internet Computing,5(3):88–95, May 2001.

53. Michael Goodchild. Spatially Enabled E-Government. 8th InternationalSeminar on GIS (Keynote Talk), Seoul, Korea, November 2003.http://www.csiss.org/aboutus/presentations/files/goodchild seoul nov03.pdf.

54. Thomas R. Gruber and Gregory R. Olsen. An Ontology for Engineering Mathematics.In Proceedings of the 4th International Conference on Principles of KnowledgeRepresentation and Reasoning (KR 1994), pages 258–269, 1994.

55. Tom R. Gruber. A Translation Approach to Portable Ontology Specifications. KnowledgeAcquisition, 5(2):199–220, 1993.

56. Nicola Guarino. Formal Ontology and Information Systems. In Proceedings of the 1st In-ternational Conference on Formal Ontologies in Information Systems (FOIS 1998),pages 3–15, 1998.

57. Peter Haase, Jeen Broekstra, Marc Ehrig, Maarten Menken, Peter Mika, Mariusz Olko,Michal Plechawski, Pawel Pyszlak, Bjorn Schnizler, Ronny Siebes, Steffen Staab,and Christoph Tempich. Bibster - A Semantics-Based Bibliographic Peer-to-PeerSystem. In Proc. of ISWC 2004, pages 122–136, 2004.

58. Alon Y. Halevy. Answering Queries Using Views: A Survey. VLDB Jounal, 10(4):270–294,2001.

59. Alon Y. Halevy, Zachary G. Ives, Peter Mork, and Igor Tatarinov. Piazza: Data Man-agement Infrastructure for Semantic Web Applications. In Proceedings of the 12thInternational World Wide Web Conference (WWW 2003), pages 556–567, 2003.

193

60. Mauricio A. Hernandez, Renee J. Miller, and Laura M. Haas. Clio: A Semi-AutomaticTool For Schema Mapping (demo). In Proc. of the ACM SIGMOD InternationalConference on Management of Data, page 607, 2001.

61. Eduard H. Hovy and Stefan Falke. Automating the Integration of Heterogeneous Data-bases. In Proc. of the 2004 National Conference on Digital Government Research(dg.o), 2004.

62. HP Labs. RDQL - RDF Data Query Language.http://www.hpl.hp.com/semweb/rdql.htm, 2005.

63. Ryutaro Ichise, Hideaki Takeda, and Shinichi Honiden. Rule Induction for Concept Hier-archy Alignment. In Proc. of the Workshop on Ontologies and Information Sharingat the 17th International Joint Conference on Artificial Intelligence (IJCAI), 2001.

64. Yannis Kalfoglou and Marco Schorlemmer. Ontology Mapping: the State of the Art. TheKnowledge Engineering Review, 18(1):1–31, 2003.

65. Gregory Karvounarakis, Sofia Alexaki, Vassilis Christophides, Dimitris Plexousakis, andMichel Scholl. RQL: a declarative query language for RDF. In Proceedings of the11th International World Wide Web Conference (WWW 2002), pages 592–603,2002.

66. Michel C. A. Klein. Interpreting XML Documents via an RDF Schema Ontology. InProceedings of the 13th International Workshop on Database and Expert SystemsApplications (DEXA 2002), pages 889–894, 2002.

67. Glenn E. Krasner and Stephen T. Pope. A Cookbook for Using the Model-View-ControllerUser Interface Paradigm in Smalltalk-80. Journal of Object-Oriented Programming,1(3):26–49, August/September 1988.

68. Fereidoon Sadri Laks V. S. Lakshmanan. Interoperability on XML Data. In Proceedingsof the 2nd International Semantic Web Conference (ICSW 2003), pages 146–163,2003.

69. Patrick Lehti and Peter Fankhauser. XML Data Integration with OWL: Experiences andChallenges. In 2004 Symposium on Applications and the Internet (SAINT 2004),pages 160–170, 2004.

194

70. Maurizio Lenzerini. Data Integration: A Theoretical Perspective. In Proceedings of the 21stACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems(PODS 2002), pages 233–246, Madison, Wisconsin, June 2002. ACM.

71. Ron Li, Keith W. Bedford, C. K. Shum, Xutong Niu, Feng Zhou, Vasilia Velissariou,J. Raul Ramirez, and Aidong Zhang. Integration of Multidimensional GeospatialInformation for Coastal Management and Decision-Making. In Proc. of the 2005National Conference on Digital Government Research (dg.o), page 231, 2005.

72. Yunyao Li, Huahai Yang, and H. V. Jagadish. NaLIX: an Interactive Natural LanguageInterface for Querying XML. In SIGMOD 2005 (Poster).

73. Hugo Lueders. Interoperability and Open Standards for eGovernment Services.http://xml.coverpages.org/Comptia-ISC-OpenStandards.pdf, January 2005.

74. Alexander Maedche, Boris Motik, Nuno Silva, and Raphael Volz. MAFRA - A MAppingFRAmework for Distributed Ontologies. In Proc. of EKAW 2002, pages 235–250,2002.

75. David Maier and Lois M. L. Delcambre. Superimposed Information for the Internet. InWebDB, pages 1–9, 1999.

76. Inderjeet Mani. Recent Developments in Text Summarization. In CIKM, pages 529–531,2001.

77. Frank Manola, Eric Miller, and Brian McBride. RDF Primer. http://www.w3.org/TR/rdf-primer, Feburary 2004.

78. Byron Marshall, Siddharth Kaza, Jennifer Jie Xu, Homa Atabakhsh, Tim Petersen, ChuckViolette, and Hsinchun Chen. Cross-Jurisdictional Activity Networks to SupportCriminal Investigations. In Proc. of the 2004 National Conference on Digital Gov-ernment Research (dg.o), 2004.

79. Deborah L. McGuinness, Richard Fikes, James Rice, and Steve Wilder. An Environmentfor Merging and Testing Large Ontologies. In Proc. of the 7th International Con-ference on Principles of Knowledge Representation and Reasoning (KR), pages483–493, 2000.

195

80. Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. Similarity Flooding: A VersatileGraph Matching Algorithm and Its Application to Schema Matching. In Proc. ofthe International Conference on Data Engineering (ICDE), pages 117–128, 2002.

81. Eduardo Mena, Vipul Kashyap, Amit P. Sheth, and Arantza Illarramendi. OBSERVER:An Approach for Query Processing in Global Information Systems based on In-teroperation across Pre-existing Ontologies. In Proceedings of the 1st IFCIS In-ternational Conference on Cooperative Information Systems (CoopIS 1996), pages14–25, 1996.

82. Todd D. Millstein, Alon Y. Halevy, and Marc Friedman. Query Containment for DataIntegration Systems. Journal of Computer and System Sciences, 66(1):20–39, 2003.

83. Dejan S. Milojicic, Vana Kalogeraki, Rajan Lukose, Kiran Nagaraja, Jim Pruyne, BrunoRichard, Sami Rollins, and Zhichen Xu. Peer-to-Peer Computing. Technical ReportHPL-2002-57, HP Laboratories Palo Alto, 2002.

84. Gianluca Moro, Aris M. Ouksel, and Claudio Sartori. Agents and Peer-to-Peer Computing:A Promising Combination of Paradigms. In Proceedings of the 1st InternationalWorkshop of Agents and Peer-to-Peer Computing (AP2PC2002), pages 1–14, 2002.

85. Wolfgang Nejdl, Boris Wolf, Changtao Qu, Stefan Decker, Michael Sintek, Ambjorn Naeve,Mikael Nilsson, Matthias Palmer, and Tore Risch. EDUTELLA: A P2P NetworkingInfrastructure Based on RDF. In Proceedings of the 11th International World WideWeb Conference (WWW 2002), 2002.

86. Wee Siong Ng, Beng Chin Ooi, Kian Lee Tan, and Aoying Zhou. PeerDB: A P2P-basedSystem for Distributed Data Sharing. In Proceedings of the 19th InternationalConference on Data Engineering (ICDE 2003), pages 633–644, 2003.

87. Natalya Fridman Noy. Semantic Integration: A Survey Of Ontology-Based Approaches.SIGMOD Record, 33(4):65–70, 2004.

88. Natalya Fridman Noy and Mark A. Musen. PROMPT: Algorithm and Tool for AutomatedOntology Merging and Alignment. In Proceedings of the 17th National Conferenceon Artificial Intelligence and 12th Conference on Innovative Applications of Arti-ficial Intelligence (AAAI/IAAI 2000), pages 450–455, 2000.

89. Natalya Fridman Noy and Mark A. Musen. Anchor-PROMPT: Using Non-local Contextfor Semantic Matching. In Proc. of the Workshop on Ontologies and Informa-

196

tion Sharing at the 17th International Joint Conference on Artificial Intelligence(IJCAI), 2001.

90. Borys Omelayenko. RDFT: A Mapping Meta-Ontology for Web Service Integration. InKnowledge Transformation for the Semantic Web 2003, pages 137–153, 2003.

91. Eyal Oren. SemperWiki: a Semantic Personal Wiki. In Proc. of the 1st ISWC Workshopon The Semantic Desktop, 2005.

92. Luigi Palopoli, Domenico Sacca, and Domenico Ursino. An Automatic Techniques forDetecting Type Conflicts in Database Schemes. In Proc. of the 7th InternationalConference on Information and Knowledge Management (CIKM), pages 306–313,1998.

93. Yannis Papakonstantinou, Hector Garcia-Molina, and Jennifer Widom. Object ExchangeAcross Heterogeneous Information Sources. In Proceedings of the 11th Interna-tional Conference on Data Engineering (ICDE 1995), pages 251–260, 1995.

94. Peter F. Patel-Schneider and Jerome Simeon. The Yin/Yang web: XML syntax and RDFsemantics. In Proceedings of the 11th International World Wide Web Conference(WWW 2002), pages 443–453, July 2002.

95. Martin Peim, Enrico Franconi, Norman W. Paton, and Carole A. Goble. Query Processingwith Description Logic Ontologies Over Object-Wrapped Databases. In Proc. of the14th International Conference on Scientific and Statistical Database Management(SSDBM), pages 27–36, 2002.

96. Chris Peltz. Web Services Orchestration and Choreography. Computer, 36(10):46–52,2003.

97. Lucian Popa, Yannis Velegrakis, Renee J. Miller, Mauricio A. Hernandez, and RonaldFagin. Translating Web Data. In Proceedings of the 28th International Conferenceon Very Large Data Bases (VLDB 2002), pages 598–609, 2002.

98. Dennis Quan, David Huynh, and David R. Karger. Haystack: A Platform for AuthoringEnd User Semantic Web Applications. In ISWC, pages 738–753, 2003.

99. Erhard Rahm and Philip A. Bernstein. A Survey of Approaches to Automatic SchemaMatching. VLDB J., 10(4):334–350, 2001.

197

100. M. Andrea Rodrıguez and Max J. Egenhofer. Determining Semantic Similarity amongEntity Classes from Different Ontologies. IEEE Transactions on Knowledge andData Engineering, 15(2):442–456, 2003.

101. Ozgur D. Sahin, Abhishek Gupta, Divyakant Agrawal, and Amr El Abbadi. QueryProcessing Over Peer-To-Peer Data Sharing Systems. Technical Report CSD-2002-28, University of California at Santa Barbara, 2002.

102. Gerard Salton. Automatic Text Processing: The Transformation, Analysis, and Retrievalof Information by Computer. Addison-Wesley, 1989.

103. Leo Sauermann. The Gnowsis Semantic Desktop for Information Integration. In The 3rdConference on Professional Knowledge Management, pages 39–42, 2005.

104. Leo Sauermann, Ansgar Bernardi, and Andreas Dengel. Overview and Outlook on theSemantic Desktop. In Proc. of the 1st ISWC Workshop on The Semantic Desktop,2005.

105. Leon A. Shklar, Amit P. Sheth, Vipul Kashyap, and Kshitij Shah. InfoHarness: Use ofAutomatically Generated Metadata for Search and Retrieval of Heterogeneous In-formation. In Proceedings of the 7th Conference on Advanced Information SystemsEngineering (CAiSE 1995), pages 217–230, 1995.

106. Pavel Shvaiko and Jerome Euzenat. A Survey of Schema-Based Matching Approaches.Journal of Data Semantics, 4:146–171, 2005.

107. Heiner Stuckenschmidt. Query Processing on the Semantic Web. Kunstliche Intelligenz(KI), 17(3):22–, 2003.

108. Gerd Stumme and Alexander Maedche. Ontology Merging for Federated Ontologies forthe Semantic Web. In Proceedings of the International Workshop on Foundationsof Models for Information Integration (FMII 2001), pages 16–18, 2001.

109. Jeffrey D. Ullman. Information Integration Using Logical Views. In Proceedings of the 6thInternational Conference on Database Theory (ICDT 1997), pages 19–40, 1997.

110. Ron van der Meyden. Logical Approaches to Incomplete Information: A Survey. In Logicsfor Databases and Information Systems, pages 307–356, 1998.

198

111. Holger Wache, Thomas Vogele, Ubbo Visser, Heiner Stuckenschmidt, G. Schuster, H. Neu-mann, and S. Hubner. Ontology-Based Integration of Information - A Survey ofExisting Approaches. In Proceedings of the IJCAI-01 Workshop on Ontologies andInformation Sharing, 2001.

112. Nancy Wiegand, Dan Patterson, Naijun Zhou, Steve Ventura, and Isabel. F. Cruz. Query-ing Heterogeneous Land Use Data: Problems and Potential. In National Confer-ence for Digital Government Research (dg.o), pages 115–121, 2002.

113. Huiyong Xiao and Isabel F. Cruz. RDF-based Metadata Management in Peer-to-PeerSystems. In The 2nd IST Workshop on Metadata Management in Grid and P2PSystem (MMGPS 2004), 2004.

114. Huiyong Xiao and Isabel F. Cruz. Integrating and Exchanging XML Data Using Ontolo-gies. LNCS Journal on Data Semantics, Springer Verlag, 2006. (To appear).

115. Huiyong Xiao and Isabel F. Cruz. Ontology-based Query Rewriting in Peer-to-Peer Net-works. In Proc. of the 2nd International Conference on Knowledge Engineeringand Decision Support, 2006.

116. Huiyong Xiao, Isabel F. Cruz, and Feihong Hsu. Semantic Mappings for the Integrationof XML and RDF Sources. In Workshop on Information Integration on the Web(IIWeb 2004), August 2004.

117. Cong Yu and Lucian Popa. Constraint-Based XML Query Rewriting For Data Integration.In Proc. of the ACM SIGMOD International Conference on Management of Data,pages 371–382.

VITA

NAME: Huiyong Xiao

EDUCATION:

Ph.D., Computer Science, University of Illinois at Chicago, Chicago, Illinois, 2006.

M.S., Computer Science, Tsinghua University, Beijing, China, 2002.

B.S., Computer Science, Huazhong University of Sci. and Tech., Wuhan, China, 1999.

PUBLICATIONS:

1. Huiyong Xiao and Isabel F. Cruz. Integrating and Exchanging XML Data using Ontolo-

gies. Journal of Data Semantics, 2006 (To appear).

2. Huiyong Xiao and Isabel F. Cruz. Ontology-based Query Rewriting in Peer-to-Peer Net-

works. In Proceedings of The 2nd International Conference on Knowledge Engineering

and Decision Support, pages 11-18, May, 2006.

3. Davide Fossati, Gabriele Ghidoni, Barbara Di Eugenio, Isabel F. Cruz, Huiyong Xiao, and

Rajen Subba. The Problem of Ontology Alignment on the Web: a First Report. In Pro-

ceedings of The 2nd Web as Corpus Workshop (in conjunction with the 11th Conference

of the European Chapter of the ACL), pages 51-58, April, 2006.

4. Isabel F. Cruz and Huiyong Xiao. The Role of Ontologies in Data Integration. Journal

of Engineering Intelligent Systems: 13(4):245-252, December, 2005.

199

200

5. Huiyong Xiao and Isabel F. Cruz. A Multi-Ontology Approach for Personal Information

Management. In Proceedings of The 1st Workshop on Semantic Desktop (in conjunction

with the 4th International Conference of Semantic Web), pages 19-33, November, 2005.

6. Huiyong Xiao and Isabel F. Cruz. RDF-based Metadata Management in Peer-to-Peer

Systems. The 2nd IST Workshop on Metadata Management in Grid and P2P System

(MMGPS), December, 2004.

7. Huiyong Xiao, Isabel F. Cruz, and Feihong Hsu. Semantic Mappings for the Integration

of XML and RDF Sources. Proceedings of VLDB Workshop on Information Integration

on the Web (IIWeb), pages 40-45, August, 2004.

8. Isabel F. Cruz, Huiyong Xiao, and Feihong Hsu. Peer-to-Peer Semantic Integration of

XML and RDF Data Sources. The 3rd International Workshop on Agents and Peer-to-

Peer Computing (AP2PC), July, 2004. LNCS 3601, Springer 2005.

9. Isabel F. Cruz, Huiyong Xiao, and Feihong Hsu. An Ontology-based Framework for

Semantic Interoperability between XML Sources. In Proceedings of the 8th International

Database Engineering and Applications Symposium (IDEAS), pages 217-226, July, 2004.

IEEE Computer Society 2004.

10. Isabel F. Cruz and Huiyong Xiao. Using a Layered Approach for Interoperability on the

Semantic Web. In Proceedings of the 4th International Conference on Web Information

Systems Engineering (WISE), pages 221-232, December, 2003. IEEE Computer Society

2003.

QUERY PROCESSING FOR HETEROGENEOUS DATA ...hxiao/phdthesis.pdfQUERY PROCESSING FOR HETEROGENEOUS...

Documents

Transcript of QUERY PROCESSING FOR HETEROGENEOUS DATA ...hxiao/phdthesis.pdfQUERY PROCESSING FOR HETEROGENEOUS...