Improving XML schema matching performance using Prüfer sequences

20
Improving XML schema matching performance using Prüfer sequences q Alsayed Algergawy * , Eike Schallehn, Gunter Saake Department of Computer Science, Otto-von-Guericke University, 39016 Magdeburg, Germany article info Article history: Received 30 June 2008 Received in revised form 8 January 2009 Accepted 11 January 2009 Available online 22 January 2009 Keywords: XML schema matching Prufer Sequences Structural matching Matching performance abstract Schema matching is a critical step for discovering semantic correspondences among ele- ments in many data-shared applications. Most of existing schema matching algorithms produce scores between schema elements resulting in discovering only simple matches. Such results partially solve the problem. Identifying and discovering complex matches is considered one of the biggest obstacle towards completely solving the schema matching problem. Another obstacle is the scalability of matching algorithms on large number and large-scale schemas. To tackle these challenges, in this paper, we propose a new XML schema matching framework based on the use of Prüfer encoding. In particular, we develop and implement the XPrüM system, which consists mainly of two parts—schema prepara- tion and schema matching. First, we parse XML schemas and represent them internally as schema trees. Prüfer sequences are constructed for each schema tree and employed to construct a sequence representation of schemas. We capture schema tree semantic informa- tion in Label Prüfer Sequences (LPS) and schema tree structural information in Number Prü- fer Sequences (NPS). Then, we develop a new structural matching algorithm exploiting both LPS and NPS. To cope with complex matching discovery, we introduce the concept of compatible nodes to identify semantic correspondences across complex elements first, then the matching process is refined to identify correspondences among simple elements inside each pair of compatible nodes. Our experimental results demonstrate the perfor- mance benefits of the XPrüM system. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction Schema matching is the task of identifying semantic correspondences among elements across different schemas. It plays a central role in many data application scenarios [36]: in data integration to identify and characterize inter-schema relation- ships across multiple (heterogeneous) schemas; in data warehousing to map data sources to a warehouse schema; in E-busi- ness to help to map messages between different XML formats; in the Semantic Web to establish semantic correspondences between concepts of different ontologies [28]; in data migration to migrate legacy data from multiple sources into a new one [18]; and in XML data clustering to determine semantic similarities between XML data [39]. At the core of most of these data application scenarios, the eXtensible Markup Language (XML) has emerged as a stan- dard for information representation, analysis, and exchange on the Web. Since XML provides data description features that are similar to those of advanced data models, XML is today supported either as native data model or on top of a conven- tional data model by several database management systems. As a result, XML databases on the Web are proliferating, and efforts to develop good information integration technology for the growing number of XML data sources have become 0169-023X/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2009.01.001 q This work is a revised, extended version of the papers presented in [2,3]. * Corresponding author. Tel.: + 49 391 67 11603; fax: +49 391 67 12020. E-mail addresses: [email protected] (A. Algergawy), [email protected] (E. Schallehn), [email protected] (G. Saake). Data & Knowledge Engineering 68 (2009) 728–747 Contents lists available at ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

Transcript of Improving XML schema matching performance using Prüfer sequences

Page 1: Improving XML schema matching performance using Prüfer sequences

Data & Knowledge Engineering 68 (2009) 728–747

Contents lists available at ScienceDirect

Data & Knowledge Engineering

journal homepage: www.elsevier .com/ locate/datak

Improving XML schema matching performance using Prüfer sequences q

Alsayed Algergawy *, Eike Schallehn, Gunter SaakeDepartment of Computer Science, Otto-von-Guericke University, 39016 Magdeburg, Germany

a r t i c l e i n f o

Article history:Received 30 June 2008Received in revised form 8 January 2009Accepted 11 January 2009Available online 22 January 2009

Keywords:XML schema matchingPrufer SequencesStructural matchingMatching performance

0169-023X/$ - see front matter � 2009 Elsevier B.Vdoi:10.1016/j.datak.2009.01.001

q This work is a revised, extended version of the p* Corresponding author. Tel.: + 49 391 67 11603;

E-mail addresses: [email protected](G. Saake).

a b s t r a c t

Schema matching is a critical step for discovering semantic correspondences among ele-ments in many data-shared applications. Most of existing schema matching algorithmsproduce scores between schema elements resulting in discovering only simple matches.Such results partially solve the problem. Identifying and discovering complex matches isconsidered one of the biggest obstacle towards completely solving the schema matchingproblem. Another obstacle is the scalability of matching algorithms on large number andlarge-scale schemas. To tackle these challenges, in this paper, we propose a new XMLschema matching framework based on the use of Prüfer encoding. In particular, we developand implement the XPrüM system, which consists mainly of two parts—schema prepara-tion and schema matching. First, we parse XML schemas and represent them internallyas schema trees. Prüfer sequences are constructed for each schema tree and employed toconstruct a sequence representation of schemas. We capture schema tree semantic informa-tion in Label Prüfer Sequences (LPS) and schema tree structural information in Number Prü-fer Sequences (NPS). Then, we develop a new structural matching algorithm exploitingboth LPS and NPS. To cope with complex matching discovery, we introduce the conceptof compatible nodes to identify semantic correspondences across complex elements first,then the matching process is refined to identify correspondences among simple elementsinside each pair of compatible nodes. Our experimental results demonstrate the perfor-mance benefits of the XPrüM system.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

Schema matching is the task of identifying semantic correspondences among elements across different schemas. It plays acentral role in many data application scenarios [36]: in data integration to identify and characterize inter-schema relation-ships across multiple (heterogeneous) schemas; in data warehousing to map data sources to a warehouse schema; in E-busi-ness to help to map messages between different XML formats; in the Semantic Web to establish semantic correspondencesbetween concepts of different ontologies [28]; in data migration to migrate legacy data from multiple sources into a new one[18]; and in XML data clustering to determine semantic similarities between XML data [39].

At the core of most of these data application scenarios, the eXtensible Markup Language (XML) has emerged as a stan-dard for information representation, analysis, and exchange on the Web. Since XML provides data description features thatare similar to those of advanced data models, XML is today supported either as native data model or on top of a conven-tional data model by several database management systems. As a result, XML databases on the Web are proliferating, andefforts to develop good information integration technology for the growing number of XML data sources have become

. All rights reserved.

apers presented in [2,3].fax: +49 391 67 12020.

rg.de (A. Algergawy), [email protected] (E. Schallehn), [email protected]

Page 2: Improving XML schema matching performance using Prüfer sequences

A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747 729

vital. Identifying semantic correspondences among heterogeneous data sources is the biggest obstacle for developing suchan integrated schema. The process of identifying these correspondences across XML schemas is called XML schema match-ing [12].

As a result, myriad of matching algorithms have been proposed and many systems for automatic schema matching havebeen developed [41,23]. However, most of these systems such as Cupid [36], Similarity Flooding (SF) [38], COMA/COMA++[11,12], LSD [13], BTreeMatch [20], OntoBuilder [25], S-Match [26] and SPICY [7] produce scores schema elements, whichresults in discovering only simple (one-to-one) matching. Such results solve the schema matching problem partially. In orderto completely solve the problem, the matching system should discover complex matchings as well as simple ones. Few workhas addressed the problem of discovering complex matching [10,29,30], because of the greater complexity of finding com-plex matches than of discovering simple ones.

Additionally, existing schema matching systems rely heavily either on rule-based approaches or on learning-based ap-proaches. Rule-based systems [36,11,12,26] represent schemas in a common data model, such as schema trees or schemagraphs. Then, they apply their algorithms to the common data model, which in turn requires traversing schema trees(schema graphs) many times. By contrast, learning-based systems [13,16] need much pre-match effort to train theirlearners. As a consequence, especially in large-scale schemas and dynamic environments, matching efficiency declinesradically. As an attempt to improve matching efficiency, recent schema matching systems have been developed[44,43]. However, they consider only simple matching. Therefore, discovering complex matching taking into accountschema matching scalability against both large number of schemas and large-scale schemas is considered a realchallenge.

Motivated by the above challenges and by the fact that the most prominent feature for an XML schema is its hierarchicalstructure, in this paper, we propose a novel approach for matching XML schemas. In particular, we develop and implementthe XPrüM system, which mainly consists of two parts—schema preparation and schema matching. Schemas are first parsedand represented internally using rooted ordered labeled trees, called schema trees. Then, we construct a Prüfer sequence foreach schema tree. Prüfer sequences construct a one-to-one correspondence between schema trees and sequences [40]. Wecapture schema tree semantic information in the Label Prüfer Sequence (LPS) and schema tree structural information in theNumber Prüfer Sequence (NPS). LPS is exploited by a linguistic matcher to compute terminological similarities among thewhole schema elements including both atomic and complex elements. Then, first we apply our new structural algorithmonly to complex elements exploiting NPS and previously computed linguistic similarity. With the help of these similarity val-ues we can identify what is called compatible nodes across different schemas. Finally, a refine matching phase is carried out todiscover corresponding atomic elements inside each compatible element pair. A series of experiments is conducted in orderto evaluate our proposed system employing real data. Both performance aspects (effectiveness and efficiency) have beenused as measures for matching performance.

We claim that our proposed system presents many features. First, representing schema trees as Prüfer sequences permitstraversing each schema tree only once, which improves and accelerates the matching process. Second, performing firstmatching among complex elements reduces the search space complexity and prunes many false positive matching candi-dates. Third, refining match between atomic elements helps discovering complex matchings. Fourth, the XPrüM system is al-most automatic and does not need any auxiliary information sources such as dictionaries or ontologies. Finally, our system ishybrid in nature and is more flexible both in adding new matchers and in modifying the existing ones.

Thus, to summarize, the main contributions of our work are:

� We have developed and implemented a new XML schema matching system called XPrüM which is based on the sequencerepresentation for schemas to be matched.

� We introduced a new structural matching algorithm based on Prüfer encoding.� The proposed system can identify and discover complex matches by introducing the concept of compatible nodes.� By discovering first compatible elements, we reduce the search space complexity, which makes the system scalable

against large number and large-scale schemas.� We have developed several experiments using real-world schemas having different characteristics to validate the practical

applicability of our approach, both in terms of matching effectiveness and matching efficiency.

The remainder of this paper is organized as follows. Section 2 introduces the basic definitions used in the paper. Section 3presents an overview of the proposed approach. Section 4 is devoted to the detailed discussion of various matching algo-rithms and their implementations. Section 5 presents experimental results. Related work is discussed in Section 6. Conclud-ing remarks and our proposed future work are reported in Section 7.

2. Preliminaries

An XML schema is the description of a type of XML document, typically expressed in terms of constraints on the structureand content of that type [1]. There are several languages developed to express XML schema. The most commonly used is XMLSchema Definition (XSD) or XML schema. Through this paper, unless clearly specified, we used the term ‘‘schema” to presentXML schema (XSD).

Page 3: Improving XML schema matching performance using Prüfer sequences

Fig. 1. Computer science department schemas.

730 A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747

An XML schema is a set of schema components. In general, there are 13 kinds of components in all, falling into threegroups.1 In this paper, we consider only the first two categories. (1) The primary components may have names such as typedefinitions or must have names such as attribute and element declarations. (2) The secondary components must have namessuch as attribute group definitions and model group definitions. The other category, helper components such as wildcard andattribute uses, is left for ongoing work.

An XML schema is a graph. It can be represented as a tree by dealing with nesting and repetition problems using a set ofpredefined transformation rules similar to those in [34]. Hence, XML schemas can be modeled as rooted, ordered, labeledtrees, which are defined as follows.

Definition 1. An ordered labeled tree T is a 4-tuple T ¼ ðNT ; ET ; LabNT ; lÞ where:

� NT ¼ fnroot;n2; . . . ;nng is a finite set of nodes, each of them is uniquely identified by an object identifier (OID), where nroot isthe tree root node, and has no parent. Each node represents a schema component, such as the element/attributedeclaration.

� ET ¼ fðni;njÞjni;nj 2 NTg is a finite set of edges, where ni is the parent of nj. Each edge represents the relationship betweentwo nodes.

� LabNT is a finite set of node labels. These labels are strings for describing the properties of the node, such as name, data type,and cardinality.

� l : NT#LabNT is a mapping label assigning a label from the given LabNT to every node.

Node order is determined by a post-order traversal of the tree. In a post-order traversal, a node ni is visited and assignedits post-order number after its children are recursively traversed from left to right.

We categorize tree nodes into atomic nodes, which have no outgoing edges and represent leaf nodes, and complex nodes,which are the internal nodes in the schema tree. Each node is associated with a set of labels to represent features of the cor-responding schema component, such as name and type/data type. Every node of the tree, except the root, has exactly oneparent.

Example 1 (A schema matching example). To describe the operations of our approach, we use the example found in [16]. Itdescribes two XML schemas that represent the organization in universities from different countries and has been widelyused in the literature. Fig. 1a and b shows the schema trees of the two XML schemas.

Definition 2. Match is a function that takes two XML schemas as input and produces a mapping as output, employing dif-ferent similarity measure functions called matchers.

We distinguish between two types of matching, which are defined as follows. Let ST1 and ST2 be two schema trees havingn and m elements, respectively, the two types are:

� Simple matching: For each node nST1 2 ST1, find the most semantically similar node nST2 2 ST2. This problem is referred toas one-to-one matching. For example, the correspondence between the two Professor elements in the two schema treesshown in Fig. 1 represents a simple matching.

� Complex matching: For each node (or a set of nodes) 2 ST1, find the most semantically similar set of nodes 2 ST2. The cor-respondence between the Name element of ST1 shown in Fig. 1a and the {FirstName, LastName} elements of ST2 shown inFig. 1b represents a complex matching.

1 http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#components.

Page 4: Improving XML schema matching performance using Prüfer sequences

Fig. 2. Prüfer sequence construction.

A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747 731

In this paper, we assume that simple matches exist and should be discovered among either simple nodes or complexnodes, while complex matches are discovered only among simple nodes. The motivation behind this assumption is that acomplex node comprises simple and/or complex nodes. We notice that it rarely occurs that a complex node in a schema cor-responds to more than one complex node in another schema. However, it is obvious for one (or more) simple node in oneschema to correspond to one or more simple nodes from another schema.

Definition 3. A similarity measure is a function which takes two nodes’ properties such as node name and node data typeand produces a similarity value between 0 and 1. It is represented as simðn1;n2Þ, the similarity between two nodes n1 and n2

from source schema and target schema, respectively.

The similarity between two elements (nodes) depends on the used method to compute it. In general, the similarity valueranges between 0 and 1. The value of 0 means strong dissimilarity, while the value of 1 means strong similarity. Instead ofapplying directly matching algorithms to schema trees, we represent schema trees as sequences using the Prüfer encodingmethod [40]. The idea of Prüfer’s method is to find a one-to-one correspondence between the set of the schema trees and aset of Prüfer sequences.

Definition 4. A Prüfer sequence of length n� 2, for n P 2, is any sequence of integers between 1 and n with repetitionsallowed.

Example 2 [47]. The set of Prüfer sequences of length 2 ðn ¼ 4Þ is {(1,1), (1,2), (1,3), (1,4), (2,1), (2,2), (2,3), (2,4), (3,1),(3,2), (3,3), (3,4), (4,1), (4,1), (4,3), (4,4)}. In total, there are 44�2 ¼ 16 Prüfer sequences of length 2.

Given a schema tree with nodes labeled by 1;2;3; . . . ;n the Prüfer encoding method outputs a unique Prüfer sequence oflength n� 2. It initializes with an empty sequence. If the tree has more than two nodes, the algorithm finds the leaf with thelowest label, and appends the label of the parent of that leaf to the sequence. Then, the leaf with the lowest label is deletedfrom the tree. This operation is repeated n� 2 times until only two nodes remain in the tree. The algorithm ends up deletingn� 2 nodes. Therefore, the resulting sequence is of length n� 2. Fig. 2 illustrates the Prüfer sequence (Pr.Seq) constructionfor a schema tree whose nodes are randomly labeled. As shown in Fig. 2, since the regular Prüfer sequences include only theinformation of parent nodes, these sequences cannot represent the leaf nodes. In order to incorporate them, a modified ver-sion of the regular Prüfer sequence is exploited in the next section.

3. The XPrüM system

In this section, we describe the core parts of the XPrüM system. As shown in Fig. 3, the system has two main parts: schemapreparation and schema matching. First, schemas are parsed using a SAX parser2 and represented internally as schema trees.Then, using the Prüfer encoding method, we extract both label sequences and number sequences. Then, the schema matchingpart discovers the set of matches between two schemas utilizing both sequences.

2 http://www.saxproject.org.

Page 5: Improving XML schema matching performance using Prüfer sequences

Fig. 3. Matching process phases.

732 A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747

3.1. Schema preparation

This paper considers only XML schema matching. However, our approach is a generic framework, i.e. it has the ability toidentify semantic correspondences between different models from different domains. In order to make the matching processa more generic process, schemas should be represented internally by a common representation. This uniform representationreduces the complexity of subsequent algorithms by not having to cope with different schema representations. We use or-dered labeled trees as the internal model. We call the output of this step the schema tree (ST). As shown in Fig. 3, two XMLschemas are parsed using the SAX parser and are represented internally as ST1 and ST2.

Unlike existing rule-based matching systems that rely on schema trees (or schema graphs) to apply their matching algo-rithms, in the following subsection, we present how to extend schema tree representations into sequences using the Prüfersequence method.

3.1.1. Prüfer sequences constructionWe now describe the tree sequence representation method, which provides a bijection between ordered, labeled trees

and sequences. This representation is inspired from classical Prüfer sequences [40] and particularly from what is called con-solidated Prüfer sequence CPS, proposed in [45].

CPS of a schema tree ST consists of two sequences, Number Prüfer Sequence NPS and Label Prüfer Sequence LPS. They areconstructed by doing a post-order traversal that tags each node in the schema tree with a unique traversal number. NPS isthen constructed iteratively by removing the node with the smallest traversal number and appending its parent node num-ber to the already structured partial sequence. LPS is constructed similarly, but by taking the node labels of deleted nodesinstead of their parent node numbers. Both NPS and LPS convey completely different but complementary information—NPS that is constructed from unique post-order traversal numbers gives tree structure information and LPS gives tree seman-tic information. CPS representation thus provides a bijection between ordered, labeled trees and sequences. Therefore,CPS = (NPS,LPS) uniquely represents a rooted, ordered, labeled tree, where each entry in the CPS corresponds to an edge inthe schema tree. For more details see [45].

Example 3. Considering schema trees ST1 and ST2 shown in Fig. 4, each node is associated with its OID and its post-ordernumber. Table 1 illustrates CPS for ST1 and ST2. For example, CPS of ST1 can be written as the NPS(ST1) = (11 11 5 5 8 8 8 10 1011 –), and the LPS(ST1).name = (UnderGrad Courses, Grad Courses, Name, Degree, Assistant Professor, Associate Professor,Professor, Faculty, Staff, People, CS Dept US).

This example shows that using CPS to represent schema trees as sequences has the advantage of capturing semantic andstructural information of the schema tree including atomic nodes. The following section formally presents CPS properties.

3.1.2. CPS propertiesIn the following, we list the structural properties behind the CPS representation of schema trees. If we construct a

CPS = (NPS,LPS) from a schema tree ST, we could classify these properties into:

� Unary properties: Let ni be a node having a post-order number k,1. atomic node: ni is an atomic node iff k R NPS;2. complex node: ni is a complex node iff k 2 NPS;

Page 6: Improving XML schema matching performance using Prüfer sequences

Table 1Schema tree nodes properties.

Schema tree ST1 Schema tree ST2

NPS LPS NPS LPS

OID name type/data type OID name type/data type

11 n2 UnderGrad Courses element/string 11 n2 Courses element/string11 n3 Grad Courses element/string 5 n6 FirstName element/string5 n7 Name element/string 5 n7 LastName element/string5 n8 Degree element/string 5 n8 Education element/string8 n6 Assistant Professor element/– 8 n5 Lecturer element/–8 n9 Associate Professor element/string 8 n9 SeniorLecturer element/string8 n10 Professor element/string 8 n10 Professor element/string10 n5 faculty element/– 10 n4 AcademicStaff element/–10 n11 Staff element/string 10 n11 TechnicalStaff element/string11 n4 People element/– 11 n3 Staff element/–– n1 CS Dept US element/– – n1 CS Dept Aust element/–

Fig. 4. Node OIDs and corresponding post-order numbers.

A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747 733

3. root node: ni is the root node ðnrootÞ iff k ¼maxðNPSÞ, where max is a function which returns the maximum number inNPS.

� Binary properties:1. edge relationship: Let CPSi ¼ ðki; LPSiÞ 2 CPS of a schema tree ST be an entry. This entry represents an edge from the node

whose post-order number is ki to a node ni ¼ LPSi:OID. This property shows both child and parent relationships. Thismeans that the node ni ¼ LPSi:OID is an immediate child of the node whose post-order number is ki.

2. sibling relationship: Let CPSi ¼ ðki; LPSiÞ and CPSj ¼ ðkj; LPSjÞ be two entries 2 CPS of a schema tree ST. The two nodesni ¼ LPSi:OID and nj ¼ LPSj:OID are two sibling nodes iff ki ¼ Kj.

3.2. Schema matching

The proposed matching algorithms operate on the sequential representation of schema trees and discover semantic cor-respondences between them. Generally speaking, the process of schema matching is performed, as shown in Fig. 3, in threephases—linguistic matching, compatible nodes identification, and matching refinement.

Recent empirical analysis shows that there is no single dominant element matcher that performs best, regardless of thedata model and application domain [17]. This is because that schema matching is an intricate problem. Schema matchingpresents intricate problems mainly due to the following reasons:

� Representation problems: Databases are engineered by different people, resulting in different possible representation mod-els and different possible names and structures.

� Semantic problems: Semantics of elements involved cannot be inferred from only a few information sources [15].� Computational cost problems: When deciding if an element ni of schema tree ST1 matches an element n0i of schema tree ST2,

one must typically examine all other elements of ST2 to make sure that there is no other element that matches ni betterthan n0i.

Page 7: Improving XML schema matching performance using Prüfer sequences

734 A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747

As a result, we should exploit different kinds of matchers. In our approach, we utilize two schema-based matchers—thelinguistic matcher and the structural matcher.

First, a degree of linguistic similarity is automatically computed for all element pairs using the linguistic matcher phase.Once the degree of linguistic similarity is computed, the second phase starts to identify compatible nodes. In this phase, weapply our structural matcher only to complex nodes. Then we combine both linguistic and structural similarity measures forcomplex nodes and select compatible nodes among them. Finally, a matching refinement phase is carried out to discoveratomic nodes inside each compatible node pair. The set of compatible nodes and semantic corresponding atomic nodes con-stitute the matching result.

In the following section, we shall describe matcher algorithms. Without loss of generality, let the number of nodes in ST1and ST2 be n and m, respectively.

4. Matcher algorithms

To identify semantic correspondences between two schema trees, the similarity between their elements should be com-puted. To this end, we propose the following three phases (1) linguistic matching, (2) compatible nodes identification, and(3) matching refinement.

4.1. Linguistic matching

The aim of this phase is to obtain an initial similarity value between the nodes of the two schema trees based on the sim-ilarity of their labels. To this end, we make use of two basic schema-based matchers—the name matcher and type/data typecompatibility.

4.1.1. Name matcherIn absence of data instances, node names are a necessary but not a sufficient source of information for matching. For the

sake of simplicity and towards building a more efficient framework, we do not make use of any external dictionary or ontol-ogy. To compute name similarity between two strings s1 and s2, we first break each string into a set of tokens T1 and T2 usinga customizable tokenizer using punctuation, upper case, special symbols, and digits, e.g. UnderGradCours-es ? {Under,Grad,Courses}. We then determine the name similarity between the two sets of name tokens T1 and T2 asthe average best similarity of each token with a token in the other set. It is computed as follow:

NsimðT1; T2Þ ¼

Pt12T1½max

t22T2

simðt1; t2Þ� þP

t22T2½max

t12T1

simðt1; t2Þ�

jT1j þ jT2j: ð1Þ

To measure the string similarity between a pair of tokens, simðt1; t2Þ, we use the following three string similarity measuresthat might be useful in the generation of an initial mapping. The first one is

simeditðt1; t2Þ ¼maxðjt1j; jt2jÞ � editDistanceðt1; t2Þ

maxðjt1j; jt2jÞ; ð2Þ

where editDistanceðt1; t2Þ is the minimum number of character insertion and deletion operations needed to transform onestring to the other. The edit distance is used to determine how similar or different the two strings are. The second is basedon the number of different trigrams in the two strings:

simtriðt1; t2Þ ¼2� jtriðt1Þ \ triðt2Þjjtriðt1Þj þ jtriðt2Þj

; ð3Þ

where triðt1Þ is the set of trigrams in t1. The trigram measure is used for efficient approximate string matching. The thirdsimilarity measure is based on the Jaro–Winkler distance. The Jaro–Winkler distance metric is designed and best suitedfor short strings such as person names. The Jaro similarity measure is given by

simjaroðt1; t2Þ ¼13� Mjt1jþ Mjt2j�M � t

M

� �; ð4Þ

where M is the number of matching characters and t is the number of transpositions.The name similarity between two schema tree nodes is computed as the combination (weighted sum) of the above three

similarity values, according to the procedure given in Algorithm 1.

Example 4. Applying Algorithm 1 on CPSs illustrated in Table 1, we get a 11� 11 Nsim matrix. Of course, these initialcandidates contain many false positive matches. For example, ST1.staff correspondences initially with ST2.staff with a namesimilarity value that equals 1.0.

Page 8: Improving XML schema matching performance using Prüfer sequences

A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747 735

Algorithm 1: Name matching algorithm

Require: LPSST1&LPSST2

Ensure: NamesimilaritymatrixNsim1: Nsim½ �½ � ( 0;2: for i ¼ 1 to n do3: s1 ( LPSST1½i�:name;4: for j ¼ 1 to m do5: s2 ( LPSST2½j�:name;6: sim1 ( simeditðs1; s2Þ;7: sim2 ( simtriðs1; s2Þ;8: sim3 ( simJaroðs1; s2Þ;9: Nsim½i�½j� ( combinenameðsim1; sim2; sim3Þ;

10: end for11: end for

Table 2Data type compatibility table.

data type dt1 data type dt2 Comp. Coeff.(dt1,dt2)

string string 1.0string decimal 0.2decimal float 0.8float float 1.0float integer 0.8integer short 0.8

� � � � � � � � �

4.1.2. Data type compatibilityTo enhance the matching result and to prune some of the false positive candidates, we propose to exploit the type/data

type of nodes. We make use of built-in XML data types hierarchy3 in order to compute data type compatibility coefficients.Based on the XML schema data type hierarchy, we build a data type compatibility table as the one used in [36]. Table 2 illus-trates that elements having the same data types or belonging to the same data type category have the possibility to be corre-spondent and their compatibility coefficients (Comp. Coeff.) are high. For elements having the same data types, thecompatibility coefficient is set to 1.0, while the compatibility coefficient of elements having different data types but belongingto the same category (such as integer and short) is set to 0.8.

After computing data type compatibility coefficients (see Algorithm 2), we can adjust name similarity values. The result ofthe above process is a linguistic similarity matrix Lsim.

Algorithm 2: Data type compatibility algorithm

Require: LPSST1&LPSST2&NsimEnsure: LinguisticsimilaritymatrixLsim

1: Tsim½ �½ � ( 0;2: Lsim½ �½ � ( 0;3: for i ¼ 1 to n do4: dt1 ( LPSST1½i�:datatype;5: for j ¼ 1 to m do6: dt2 ( LPSST2½j�:datatype;7: Tsim½i�½j� ( compCoeff ðdt1; dt2Þ;8: Lsim½i�½j� ( combinelingðNsim½i�½j�; Tsim½i�½j�Þ;9: end for

10: end for

4.2. Compatible nodes identification

In this phase, we aim to identify and discover semantic correspondences across complex nodes. Similar complex nodesare called compatible nodes. Discovering compatible nodes reduces the search space complexity and helps to identify com-plex matching. To this end, we develop a new structural matching algorithm based on the Prüfer sequence representation of

3 http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/.

Page 9: Improving XML schema matching performance using Prüfer sequences

736 A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747

schema trees. In the following subsection, we shall describe our structural matching algorithm in general, and then we showhow to identify compatible nodes.

4.2.1. Structural matchingThe matching algorithms described above consider only the label information and ignore the structural information.

There may be multiple match candidates, which differ in structure but have the same label. The structural matching algo-rithm prunes these false positive candidates by considering the structural information represented in the NPS sequence.

Our structural matcher is motivated by the fact that the most prominent feature in an XML schema is its hierarchicalstructure. This matcher is based on the node context, which is reflected by its ancestors and its descendants. The descendantsof an element include both its immediate children and the leaves of the subtrees rooted at the element. The immediate chil-dren reflect its basic structure, while the leaves reflect the element’s content. In this paper, we consider three kinds of nodecontexts depending on its position in the schema tree.

� The child context of a node ni is defined as the set of its immediate children nodes including attributes and subelements.The child context of an atomic node is an empty set.

� The leaf context of a node ni is defined as the set of leaf nodes of subtrees rooted at node ni. The leaf context of an atomicnode is an empty set too.

� The ancestor context of a node ni is defined as the path extending from the root node to the node ni. The ancestor context ofthe root node is an empty path.

Example 5. Considering the schema tree ST1 of Example 3, child_context ðn1Þ ¼ fn2;n3;n4g, child_contextðn2Þ ¼ ;,ancestor_contextðn1Þ ¼ NULL, ancestor_contextðn5Þ ¼ n1=n4=n5, and leaf_contextðn5Þ ¼ fn7;n8;n9;n10g.

The context of a node is the combination of its ancestor, its child, and leaf context. Two nodes will be structurally similar ifthey have similar contexts. To measure the structural similarity between two nodes, we compute the similarity of their child,ancestor, and leaf contexts. Before getting into the details of how to measure the structural similarity, we build a connectionbetween the node context and the CPS sequence. To build such a connection, we make use of CPS properties described inSection 3.1.2 as follows:

� The child context: Using the edge relationship property, we can identify immediate children of a complex node and theirnumber. The number of immediate children of a complex node from the NPS sequence is obtained by counting its post-order traversal number in the sequence, and then we can identify these children. For example, consider the schema treeST1 of Example 3, the post-order number of node n1 is 11. This number occurs three times. This means that it has threeimmediate children fn2;n3;n4g. Since the post-order number 6 does not appear in NPSðST1Þ, this means that the node n9 isan atomic node (atomic node property) and its child context set is empty.

� The leaf context: Exploiting the CPS property asserting that the post-order numbers of atomic nodes do not appear in theNPS sequence, and also taking into account the child context we could recursively obtain the leaf context of a given node.For example, consider the schema tree ST1 of Example 3, nodes fn2;n3;n7;n8;n9; n10;n11g are the leaf node set. Node n6 hastwo children n7;n8, which are leaves. They form the leaf context set of node n6.

� The ancestor context: For a complex node, we obtain the ancestor context by scanning the NPS sequence from the begin-ning to the end of the sequence (conceptually from the left to the right) and by identifying the numbers which are largerthan the post-order number of the node until the first occurrence of the root node. While scanning from the left to theright, we ignore nodes whose post-order numbers are less than post-order numbers of already scanned nodes. For anatomic (leaf) node, the ancestor context is the ancestor context of its parent union of the parent node itself. For example,consider the schema tree ST1 of Example 3, the ancestor context of node n5 (non-atomic node) is the path n1=n4=n5, whilethe ancestor context of node n9 (atomic node) is the path n1=n4=n5=n9.

4.2.2. Structural context similarity algorithmThe structural node context defined above relies on the notions of path and set. In order to compare two ancestor contexts,

we essentially compare their corresponding paths. On the other hand, in order to compare two child contexts and/or leafcontexts, we need to compare their corresponding sets.

Although path comparison has been widely used in XML query answering frameworks, it relies on strong matching fol-lowing the two crisp constraints: node constraint and edge constraint. Under such constraints paths that are semanticallysimilar may be considered as unmatched, or paths that are not semantically similar may be considered as matched. Hence,these constraints should be relaxed. Several relaxation approaches have been proposed to approximate answering of queries[4]. Examples of such relaxations are allowing matching paths even when nodes are not embedded in the same manner or inthe same order and allowing two elements within each path to be matched even if they are not identical but their linguisticsimilarity exceeds a fixed threshold [8]. In the following, we describe how to compute the three structural context measures:

Page 10: Improving XML schema matching performance using Prüfer sequences

A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747 737

1. Child context algorithm. Computation of the child context similarity between two nodes is depicted in Algorithm 3. Wefirst extract the child context set for each node. Second, we get the linguistic similarity between each pair of childrenin the two sets. Third, we select the matching pairs with maximum similarity values. And finally, we take the averageof best similarity values.

2. Leaf context algorithm. Before getting into the details of computing leaf context similarity, we first introduce the notion ofgap between two nodes and the gap vector of a node.

Definition 5. The gap between two nodes ni and nj in a schema tree ST is defined as the difference between their post-ordernumbers.

Definition 6. The gaps between a complex node and its leaf context set in a schema tree ST form a vector called the gapvector.

Example 6. Considering ST1 of Example 3, the gap vector (gapvec) of node n8 is gapvecðn8Þ ¼ f5;4;2;1g.

Algorithm 3: Child context measure function

Require: nST1 & nST2

Ensure: child context similarity child sim1: if nST1 or nST2 is an atomic node then2: return 0;3: else4: sum( 0;5: child set1( extractChildNodeSetðnST1Þ;6: child set2( extractChildNodeSetðnST2Þ;7: for i ¼ 1 to child set1:size do8: ni ( child set1½i�;9: max( 0;

10: for j ¼ 1 to child set2:size do11: nj ( child set2½j�;12: sim( linguisticðni;njÞ;13: if sim P max then14: max( sim;15: end if16: end for17: sum( sumþmax;18: end for19: child sim( averageðsumÞ;20: return child sim;21: end if

To compute the leaf context similarity between two nodes, we compare their leaf context sets. First, we extract the leafcontext set for each node. Second, we determine the gap vector for each node. Third, we apply the cosine measure (CM) be-tween two vectors (see Algorithm 4).

Algorithm 4: Leaf context measure function

Require: nST1 & nST2

Ensure: leaf context similarity leaf sim1: if nST1 or nST2 is an atomic node2: return 0;3: else4: gapvec1 ( nodegapVectorðnST1Þ;5: gapvec2 ( nodegapVectorðnST2Þ;6: leaf sim( CMðgapvec1; gapvec2Þ;7: end if8: return leaf sim;

Example 7. Considering the two schema trees shown in Example 3, the leaf context set of ST1:n1 is leaf_setðn1ð11ÞÞ ¼ fn2ð1Þ;n3ð2Þ;n7ð3Þ;n8ð4Þ;n9ð6Þ;n10ð7Þ;n11ð9Þg and the leaf context set of ST2:n1 is leaf_set ðn1ð11ÞÞ ¼ fn2ð1Þ;n6ð2Þ;n7ð3Þ;n8ð4Þ;n9ð6Þ;n10ð7Þ;n11ð9Þg. The gap vector of ST1:n1 is v1 ¼ gapvecðST1:n1Þ ¼ f10;9;8;7;5;4;2g and the gap

Page 11: Improving XML schema matching performance using Prüfer sequences

738 A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747

vector of ST2:n1 is v2 ¼ gapvecðST2:n1Þ ¼ f10;9;8;7;5;4;2g. The cosine measure CM of the two vectors givesCMðv1; v2Þ ¼ 1:0. Then, the leaf context similarity between nodes ST1:n1 and ST2:n1 is 1.0. Similarly, the leaf contextsimilarity between ST1:n5 and ST2:n4 is 0.82 and ST1:n4 and ST2:n3 is 0.98.

3. Ancestor context similarity. The ancestor context similarity captures the similarity between two nodes based on theirancestor contexts. As mentioned before, the ancestor context for a given node ni is the path from the root node to ni. To com-pute the ancestor similarity between two nodes ni and nj, first we extract each ancestor context from the CPS sequence, saypath PST1 for ni and path PST2 for nj. Second, we compare two paths utilizing three of four scores established in [9] and reusedin [8]:

� LCSðPST1; PST2Þ: This score is used to ensure that path PST1 includes most of the nodes of PST2 in the right order. To this end, aclassical dynamic programming algorithm is employed to compute a longest common subsequence (LCS) between the twopaths represented as node sequences PST1½n1 � � � nk� and PST2½n1 � � �nk0 �. Finding a longest common subsequence is a well-defined problem in the literature [5]. The process is computed by finding the LCS lengths for all possible prefix combina-tions of PST1 and PST2. The computed LCS lengths are stored in a matrix. The recursive equation, Eq. (5), illustrates thematrix entries, where lsimðni;njÞ is a function that returns the linguistic similarity between the two nodes ni and nj,and th is a predefined threshold:

LCSM½i; j� ¼0 if i ¼ 0; j ¼ 0;LCSM½i� 1; j� 1� þ 1 lsimðni;njÞP th;

maxðLCSM½i� 1; j�; LCSM½i; j� 1�Þ lsimðni;njÞ < th:

8><>: ð5Þ

The bottom-right corner entry LCSM½k; k0� contains the overall LCS length. Then, we normalize this score in [0,1] by thelength of path PST1. The normalized score is given by:

LCSnðPST1; PST2Þ ¼jLCSðPST1; PST2Þj

jPST1j: ð6Þ

� GAPSðPST1; PST2Þ: This measure is used to ensure that the occurrences of the PST1 nodes in PST2 are close to each other. Forthis, another version of the LCS algorithm is exploited in order to capture the LCS alignment with minimum gaps. We pro-pose to normalize it by the length of the common subsequence added to the gaps value so as to ensure that the total scorewill be less than 1. Eq. (7) presents the normalized measure:

GAPSðPST1; PST2Þ ¼gaps

gapsþ LSCðPST1; PST2Þ: ð7Þ

� LDðPST1; PST2Þ: Finally, in order to give higher values to source paths whose lengths are similar to target paths, we computethe length difference LD between PST2 and LCSðPST1; PST2Þ normalized by the length of PST2. Thus, the final factor that eval-uates the length difference can be computed as:

LDðPST1; PST2Þ ¼jPST2j � LCSðPST1; PST2Þ

jPST2j: ð8Þ

These scores are combined to compute the similarity between two paths PST1 and PST2 psim as follows:

psimðPST1; PST2Þ ¼ LCSnðPST1; PST2Þ � cGAPSðPST1; PST2Þ � dLDðPST1; PST2Þ; ð9Þ

where c and d are positive parameters ranging from 0 to 1 that represent the comparative importance of each factor.

4.2.3. Discovering compatible nodesBefore detailing the process to identify and discover compatible nodes, we define what are compatible nodes.

Definition 7. Let ni 2 ST1 and nj 2 ST2 be two complex nodes. If the computed similarity (linguistic and structural) exceeds apredefined threshold th, simðni;njÞP th, then the two nodes are compatible nodes.

Unlike state-of-the-art approaches, we first apply our structural algorithm only to complex nodes to compute structuralsimilarities between them, assuming that no match between atomic and complex nodes exists. Then, we combine both lin-guistic similarity and structural similarity for complex nodes using the weighted sum aggregation. Due to uncertainty inher-ent in schema matching, the best matching can actually be an unsuccessful choice [24]. To overcome this shortcoming,matching candidates are ranked up to top-3 ranking for each node. Then, we select matching candidates that exceed athreshold, which equals to the smallest similarity value of a true positive candidate. The resulting matching set constitutesa set of compatible node pairs.

Example 8. Considering the two schema trees shown in Fig. 1 and their CPSs illustrated in Table 1. Table 3 represents top-3ranking, where a check mark U in the status column denotes a true positive match whereas an empty cell stands for a false

Page 12: Improving XML schema matching performance using Prüfer sequences

Table 3Similarity values top-3 ranking for each node.

ST1 ST2 Similarity value Status

n1 n1 0.624 U

n5 0.4950n4 0.4837

n4 n3 0.5816 U

n5 0.457n1 0.4406

n5 n4 0.6145 U

n3 0.500n1 0.4863

n6 n5 0.524 U

n3 0.514n4 0.482

A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747 739

positive match. Let there be a threshold value of 0.524, then the set of compatible node pairs isfðST1:n1; ST2:n1Þ; ðST1:n4; ST2:n3Þ; ðST1:n5; ST2:n4Þ; ðST1:n6; ST2:n5Þg.

4.3. Matching refinement

By identifying compatible nodes and the category set for each node, we have obtained top-level matchings (complexnodes). In the following we continue with bottom-level matchings (atomic nodes). For these nodes we have already com-puted their linguistic similarity and now we have to compute their structural similarity. In this phase, we ought not to carryout the structural similarity algorithm on all simple nodes. Compatible nodes (similar complex nodes) have the chance tobear similar simple nodes. Along this light of thinking, we apply structural algorithms on simple nodes inside every compat-ible node pair. To this, we first give a definition for the compatible node category.

Definition 8. A category of a given compatible node ni is a set of nodes including:

� all immediate atomic children nodes of ni,� all non-compatible (complex) nodes which are immediate children of ni and their atomic children.

Example 9. Considering the two schema trees and their compatible nodes represented in Example 8, Table 4 illustratesthese compatible nodes and the associated category for each one.

In general, atomic nodes do not have either a child context nor a leaf context. Therefore, to compute structural sim-ilarity for atomic nodes, we only compare nodes in each compatible category pair using the ancestor context algorithmpresented in Section 4.2. For example, the category ST1:C1 ¼ fn2;n3g is only compared to its compatible categoryST2:C1 ¼ fn2g. At first, we extract the ancestor context for each node. Consider P2 represents the ancestor context ofST1:n2 and P02 represents the ancestor context of ST2:n2. Then, the structural similarity between the two nodes is givenby

Table 4Categor

ST1

Comp.

n1

n4

n5

n6

ssimðST1:n2; ST2:n2Þ ¼ psimðP2; P02Þ; ð10Þ

where psimðP2; P02Þ is computed using Eq. (9). Then we combine both linguistic and structural similarities using a weighted

sum function and select the best candidate(s) based on a predefined threshold.By this mechanism we gain two main advantages. First, we reduce the search space complexity for atomic nodes. Second,

many false positive candidates are pruned. Furthermore, we can easily discover complex matchings.

y set for each compatible node.

ST2

node Category set Comp. node Category set

C1 ¼ fn2; n3g n1 C1 ¼ fn2gC2 ¼ fn11g n3 C2 ¼ fn11gC3 ¼ fn9; n10g n4 C3 ¼ fn9; n10gC4 ¼ fn7; n8g n5 C4 ¼ fn6; n7;n8g

Page 13: Improving XML schema matching performance using Prüfer sequences

740 A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747

4.3.1. Discovering complex matchingsXPrüM identifies element-level matchings between either atomic or complex nodes. This solves the schema matching

problem partially. To fully solve the problem, we should cope with complex matchings.

Definition 9. If one or more nodes in a category Ci from the source schema correspond with two or more nodes in acompatible category Cj from the target schema, the resulting match is a complex match.

Example 10. From our running example, the two categories ST1:C1 and ST2:C1 are compatible (see Table 4). Applying match-ing algorithms on their nodes, we obtain the complex match, ST2:C1:n2 matches ðST1:C1:n2; ST1:C1:n3Þ. Indeed, the Courseselement in the second schema ST2 is the union of the two UnderGrdCourses and GradCourse elements of the first schemaST1. Moreover, the two categories ST1:C4 and ST2:C4 are compatible (see Table 4). Applying matching algorithms on theirnodes, we obtain the complex match, ST1:C4:n7 matches ðST2:C4:n6; ST2:C4:n7Þ. The name element ðST1:n7Þ is the concatena-tion of the two elements FirstName and LastName from the second schema.

4.4. Complexity analysis

The worst-case time complexity of the XPrüM system can be expressed as a function of the number of nodes and the num-ber of input schemas. Let n be the average schema size and S be the number of input schemas. Following the same process in[21], we can prove the overall time complexity of our system as follows:

� Prüfer sequences construction: Schemas to be matched are first post-order traversed and represented as CPSs with a timecomplexity of OðnSÞ.

� Linguistic matching phase: This phase requires a comparison between the whole schema nodes with a time complexity ofOðn2S2Þ.

� Compatible nodes identification: Intuitively, the number of complex nodes is less than the number of atomic nodes in anXML schema tree. Consider this number is given by c ¼ n

k, where k is an integer number showing the ratio of complexnodes to the total nodes. The compatible nodes identification phase needs to compare only complex nodes with a timecomplexity of Oðc2Þð� Oðn2ÞÞ.

� Matching refinement: In this phase, we compare atomic nodes inside a category only with atomic nodes inside the corre-sponding compatible category. Consider the number of compatible nodes c0 ð6 cÞ and each category contains n0 ð� nÞatomic nodes. This phase is performed with a time complexity of Oðc0n02Þ.

Along the light of these observations, the overall worst-case time complexity of the XPrüM system is Oðn2S2Þ. This is due tothe large complexity of linguistic matching. However, the system shows additional improvements specially in the otherphases. The following section experimentally confirms this complexity.

5. Experimental evaluation

The algorithms described above have been implemented using Java. We ran all our experiments on a 2.4 GHz Intel core2processor with 2 GB RAM running Windows XP. We describe the data sets used through evaluation, criteria used for match-ing performance measures, experimental results, and a comparison with other matching systems in the running section.

5.1. Data sets

We experimented with the data sets shown in Table 5. These data sets were obtained from http://www.cs.toronto.edu/db/clio/testSchemas.html, http://sbml.org/Documents/Specifications/XML_Schemas, http://www.xcbl.com and http://www.oagi.org. We choose these data sets because they capture different characteristics in the numbers of nodes (schemasize) and their depth (the number of nodes nesting), and they represent different application domains. We utilized two dif-ferent data sets depending on the measured performance criterion: matching effectiveness or matching efficiency. The firstset, Table 5 Part (A), is used to evaluate matching effectiveness, wherein schemas from the TPC_H and bioinformatics domains

Table 5Data set details.

Part (A) effectiveness Part (B) efficiency

TPC H Bibliography Auction Bioinformatics Domain Number of schemas/nodes Schema size (KB)

Number of nodesðS1=S2Þ 43/17 22/28 38/37 101/69 University 44/550 <1Average number of nodes 30 25 38 85 XCBL 570/3500 <10Max. depth ðS1=S2Þ 3/6 6/7 3/4 6/6 OAGIS 4000/3600 <100

OAGIS 100/65000 >100

Page 14: Improving XML schema matching performance using Prüfer sequences

Fig. 5. Matching quality measures for XPrüM, COMA++, and SF systems.

A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747 741

do not contain complex matches, while schemas from the other two domains contain complex matches. Data sets describedin Table 5, Part (B), are used to validate matching efficiency.

5.2. Measures for match performance

The XPrüM system considers both aspects—matching effectiveness and matching efficiency. Therefore, we have carriedout two sets of experiments. The first set demonstrates the effectiveness of our matching system, while the second oneinvestigates the system efficiency. To measure the effectiveness of the matching result, we use the same measures usedin the literature, including precision, recall and F-measure. The response time is used as a function of the number of schemasand the number of nodes to measure matching efficiency.

Precision P can be computed from P ¼ jBjjBjþjCj,

4 which indicates the shared set between the real mappings and the identified

mappings ðjBjÞ to total identified mappings ðjBj þ jCjÞ. Recall R is computed by R ¼ jBjjBjþjAj, which indicates the shared set between

the real mappings and the identified mappings to total real mappings ðjBj þ jAjÞ. However, neither precision nor recall alone canaccurately assess the match quality. Hence, it is necessary to consider a trade-off between them. F-measure can be used for thistrade-off, computed by F-measure ¼ 2 � P�R

PþR.

5.3. Experimental results

In this section, we analyze the performance of The XPrüM system in terms of matching quality and matching efficiency.

5.3.1. Matching qualityThe data set Part (A) illustrated in Table 5 is introduced to the XPrüM system two schemas at a time. For each domain, we

performed two experiments—from S1 to S2 and from S2 to S1. Matcher algorithms discover candidate matchings that exceeda predefined threshold. Then, these candidates are ranked for each element up to the top-3 ranking (if found). Finally, match-ing quality measures are computed. We also computed the matching quality of both the COMA++ system and the SimilarityFlooding SF (RONDO) system and we compared them to our system. Results are summarized in Fig. 5.

The results show that XPrüM achieves high matching quality in terms of precision, recall, and F-measure across all fourdomains ranging from 80% to 100%. Compared to COMA++, which is mostly independent on the match direction (from S1 toS2 or from S2 to S1), our system, like SF, depends on the match direction. Fig. 5 illustrates that matching quality measures forCOMA++ using the TPC_H, Auction and Bioinformatics domains are the same from S1 to S2 (Fig. 5a) and from S2 to S1 (Fig. 5b).However, this is not true for the bibliography domain. The reason is that schemas from the bibliography domain containmore complex matches, which are harder to discover. As shown in Fig. 5, our system which is able to cope with complexmatches, achieves higher precision, recall, and F-measure than both COMA++ and SF across the bibliography domain. The best

4 A is the false negative matches, B is the true positive matches, and C is the false positive matches.

Page 15: Improving XML schema matching performance using Prüfer sequences

742 A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747

matching results for XPrüM are achieved from the Auction domain that includes less semantic and structural heterogeneities.We wish to remark that our system can identify all matchings including complex ones, whereas both COMA++ and SF identifyonly element-level matchings with F-measure of 94%.

Individual matchers effectiveness. For each domain, we performed a set of experiments to study the effect of individualmatchers on the whole matching quality. To this end, we considered the following combinations of matchers: (1) the namematcher alone, (2) the name matcher with the data type compatibility, and (3) the name matcher with the data type com-patibility and the structural context matcher (i.e. the complete XPrüM system). We use precision as a matching quality mea-sure. Fig. 6 shows matching quality for these scenarios.

Fig. 6 clearly shows the effect of each individual matcher on the total matching quality. The name matcher alone has verylow accuracy on the first domain (10%), because the two schemas S1 and S2 present names with many slightly variations andthe name matcher utilizes very simple string similarity functions. Some better results of accuracy have been achieved for theother two domains (21.5% on the bibliography domain and 26% on the auction domain). Using the data type compatibilitymatcher with the name matcher provides an unrelevant improvement of matching accuracy (between 4% and 10%). In con-trast, the best matching results of matcher combination are achieved by adding the structural context matcher. This matcherimproves matching precision by approximately 64%. Finally, we choose precision as a measure for matching quality in thisscenario, since precision quantifies efforts needed to remove false positive candidates.

Matching quality for complex matchings. As stated before, we select the tested data set to reflect different features andcharacteristics. The TPC H domain does not contain any complex matching and it was suitable for element-level matchings.The bibliography domain contains 4 complex matchings out of 20 total matchings. When performing match from S1 to S2,XPrüM could correctly identify 2 out of 4 producing a precision of 50%. While performing match from S2 to S1, the systemcould correctly identify 3 out of 4 producing a precision of 75%. The third domain, Auction domain, contains 5 complexmatchings out of 32 total matchings. XPrüM could identify all complex matchings along two matching directions, i.e. fromS1 to S2 and from S2 to S1.

5.3.2. Matching efficiencyWe measure the time response of our matching system as a function of the number of schemas and nodes through the

data set Part (B) illustrated in Table 5. For schemas whose sizes are less than 10 KB, the matching response time is a functionof the number of schemas. Otherwise, the response time is measured as a function of the number of nodes. Results are sum-marized in Fig. 7.

Fig. 7 shows that XPrüM scales well across all three domains. The system could identify and discover correspondencesacross 44 schemas of 550 nodes from the university domain in a time of 0.4 s, while the approach needs 1.8 s to match570 schemas with approximately 3500 nodes from the XCBL domain. This demonstrates that XPrüM is scalable with a largenumber of schemas. To demonstrate the scalability of the system with large-scale schemas, we carried out other two sets ofexperiments. First, we considered the OAGIS domain that contains schemas whose sizes range between 10 KB and 100 KB.Fig. 7c shows that the system needs 26 s to match 4000 schemas containing 36,000 nodes. Then, in the same domain weconsidered 100 schemas whose sizes are larger than 100 KB. XPrüM requires more than 1000 s to match 65,000 nodes, asshown in Fig. 7d.

Effect of individual matchers on matching efficiency. In this subsection, we discuss the effect of the individual matchercombination on matching efficiency. To this end, we performed a set of experiments by using the OAGIS domain with sizesranging between 10 KB and 100 KB for the following scenarios: (1) the name matcher alone, (2) the name matcher with thedata type compatibility, and (3) the name matcher with data type compatibility and the structural context matcher (i.e. thecomplete XPrüM system). Fig. 8 shows the matching response time for these scenarios.

The results show that the name matcher needs less time than the other combinations, as it was expected. This matchertakes 16 s to match 36,000 nodes. Adding the data type compatibility matcher increases the response time to 23 s, with an

Fig. 6. Matching precision for different combinations of matchers.

Page 16: Improving XML schema matching performance using Prüfer sequences

0 5 10 15 20 25 30 35 40 45 500

50

100

150

200

250

300

350

400

450

500

Number of Schemas

Res

pons

e tim

e in

ms

0 100 200 300 400 500 6000

200

400

600

800

1000

1200

1400

1600

1800

2000

Number of Schemas

Res

pons

e tim

e in

ms

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

Number of nodes x 1000

Res

pons

e tim

e in

Sec

.

0 10 20 30 40 50 60 700

200

400

600

800

1000

1200

Number of nodes x 1000

Res

pons

e tim

e in

Sec

.

Fig. 7. Performance analysis of XPrüM system with real-world schemas.

0 5 10 15 20 25 30 35 400

5

10

15

20

25

30

Number of nodes x 1000

Res

pons

e tim

e in

Sec

.

complete algorithmName+ datatype matchersName Matcher

Fig. 8. Matching response time for different combinations of matchers.

A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747 743

associated matching quality improvement ranging between 4% and 10%. It is interesting that by adding the structural contextmatcher it needs only 3 s more to perform the matching process, with an associated matching quality improvement ofapproximately 63%.

5.3.3. Matching quality/matching efficiency cost ratioIn order to evaluate the benefits behind the structural context matcher, we compute the ratio between matching quality

improvement to matching efficiency cost. This ratio could be used to evaluate different combinations of matchers, and isdenoted by gmatcher . The data type matcher is evaluated as follows:

gdatatype ¼MQIMEC

¼ 1030¼ 0:33; ð11Þ

Page 17: Improving XML schema matching performance using Prüfer sequences

744 A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747

where MQI, matching quality improvement, is the incremental value of matching quality because of adding the data typematcher, and MEC, matching efficiency cost, is computed by computing the percentage of increasing response time due toadding the matcher (i.e. 30 ¼ 23�16

23 � 100). For the structural matcher is computed as follows:

gstructural ¼MQIMEC

¼ 6311¼ 5:7: ð12Þ

Eqs. (11) and (12) show that the relative performance benefits of our new structural matcher. Although the match achieves63% improvements in matching quality, this requires only matching efficiency cost of 11%.

6. Related work

Since we have developed a conceptual connection between the XML schema matching problem and Prüfer sequences, wetherefore relate our work to the state-of-the-art in two aspects—the problem and the methodologies. Specifically in terms ofthe problem we present the differences of our work with existing schema matching systems. In terms of methodologies, wediscuss the usage of Prüfer sequences.

6.1. Schema matching

Research on the schema matching problem is gaining momentum [36,38,25,26,7,43,19,22,37]. Identifying semantic cor-respondences between schema components is one of the core operations in XML database integration, in E-business, in theSemantic Web, etc. Consequently, many XML schema matching algorithms have been proposed and many matching systemshave been developed. These systems can be broadly classified into rule-based systems and learning-based systems.

Rule-based systems. Most of these systems start with transforming schemas into schema trees, such as Cupid [36], Btree-Match [20] and PORSCHE [43] or schema graphs, such as COMA/COMA++ [11,12], SF [38], OntoBuilder [25] and S-Match [26].The SF system [38] transforms the original schemas into directed labeled graphs using an import filter. The representation ofthe graphs is based on the Open Information Model (OIM) specification. The SF system uses two different types of nodes. Ovalnodes denote the identifiers of nodes, and rectangle nodes denote string values. COMA/COMA++ [11,12] represents schemasby rooted directed acyclic graphs, where schema elements are represented by graph nodes connected by directed links of dif-ferent types. PORSCHE [43] models XML data as rooted labeled trees in which schema tree nodes bear two kinds of informa-tion: the node label used in linguistic similarity computation and the node number used in calculating the node’s context. OurSystem also models XML schemas as rooted ordered labeled trees. In order to efficiently extract schema tree properties, weextend schema tree representations to sequence representations utilizing the Prüfer encoding method [40,45].

In the rule-based systems, matching algorithms are applied on schema trees/graphs. Most of them are hybrid approacheswhich utilize both linguistic and structural matching. To enhance linguistic matchers, these systems employ auxiliary infor-mation systems, such as WordNet [38], external dictionaries [11,12,36], or a synonym table [43], which impacts matchingperformance. In their structural matchers, they have great impact on performances because most of them need to traverseschema trees/graphs many times. The COMA/COMA++ system exploits different kinds of node properties and different kindsof matching algorithms. The system utilizes simple properties such as names, data types and structural properties such asTypeName (data types + names), Children (child elements) and leaves (leaf elements), and auxiliary information such as syn-onym tables and previous mappings. Furthermore, the system has simple, hybrid and reuse-oriented matchers. The systemuses a bottom-up hierarchical approach: if leaf nodes are similar, then parent nodes may also be similar. The SimilarityFlooding (SF) system uses an operator, named SFJoin, to produce mappings between two schema graph elements. This oper-ator is implemented based on a fixpoint computation. The matcher element is based on the assumption that whenever twoelements in the two graphs are found to be similar, the similarity of their adjacent elements increases. After a number ofiterations, the initial similarity of any two nodes propagates through the graphs. The algorithm terminates after a fixpointhas been reached. PORSCHE first clusters the nodes based on a linguistic label similarity making use of a domain-specificuser-defined abbreviation table and manually defined domain-specific synonym table. This makes the system restrictedto single domain. Then, it applies a tree mining technique using the calculated node ranks. The system uses a top-down hier-archical approach: given the similarity between parent nodes, it may be possible for their descendants to be semanticallysimilar.

Our system is also a hybrid approach which utilizes both linguistic and structural matchers. However, our system is al-most automatic and independent of any auxiliary sources. Like PORSCHE, our approach is a top-down hierarchical approach.However, we introduced the concept of compatible nodes in order to cope with complex matchings.

Learning-based systems. Learning-based systems use both schema-based and instance-based information. These sys-tems depend largely on a training phase using unsupervised learning in SemInt [35], machine learning in LSD [14] and GLUE[16], and neural network-based partial least squares in [31]. The advantage of these approaches is that they can empiricallylearn the similarities among data-based on their instance values. The disadvantage of using learning-based approaches isthat instance data is generally available in very vast quantity. Hence, it is very expensive in terms of computation, whichaffects the schema matching performance. Moreover, these systems only deal with simple matchings. An extension to theLSD system is called iMap [10] and is used to discover complex matching, but it lacks matching performance.

Page 18: Improving XML schema matching performance using Prüfer sequences

A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747 745

The SemInt [35] system starts by building the attribute signatures, and then uses an unsupervised learning algorithm toclassify the attributes within the database. The Self-Organized Map algorithm is used to serve as clustering algorithm to clas-sify attributes into different categories. The output of this classifier is used as training data for a neural net. The back-prop-agation learning algorithm is used to train a network to recognize input patterns and give degrees of similarities. The systemexploits 20 different matchers; 15 metadata-based matchers and 5 data-based matchers. It uses the properties of one data-base to train a network. Then, the trained network uses information extracted from the other database as input to obtain thesimilarity between each attribute of the second database and each category in the first one.

Matching cardinality. As mentioned in Section 2, two kinds of matchings exist: (1) simple matching and (2) complexmatching. To date, many schema matching systems confine themselves to simple matching [36,14,38,11,26]. Few work isdone to deal with discovering complex matchings, such as iMAP [10], DCM [29], and INDIGO [30]. iMAP discovers complexmatchings by searching the space of all possible combinations. To search effectively, it employs a set of searchers, and toimprove accuracy, it exploits a variety of domain knowledge. DCM introduces schema matching as correlation miningexploiting co-occurrence patterns across query interfaces over the Web to discover complex matchings. Although it performswith good matching quality, it does not provide an efficient/scalable solution.

Matching scalability. Scalability, in general, is a desirable property of a system, or a process, which indicates its ability toeither handle growing amounts of work in a graceful manner, or to be readily enlarged [6]. Because of the extreme growth ofXML data on the Web and the dynamic change, scalability of a schema matching system becomes essential. Unfortunately,most existing matching systems have been tested using small-scale schemas, and few work has been done to deal with large-scale schemas, such as [44,43].

6.2. Prüfer sequences

A Prüfer sequence of a labeled tree is a unique sequence associated with the tree. The sequence of a tree with n nodes hasthe length n� 2 and can be generated by an iterative algorithm [40]. In the XPrüM system we construct Prüfer sequences(both LPS and NPS) of length n following the approach found in [45]. Much work connecting between XML schemas andthe Prüfer sequence has been established, such as the PRIX [42], performance-oriented sequencing [46], and LCM-TRIM[45] systems, to index and query XML documents and the FiST and its extension pFiST systems [32,33] to filter XML docu-ments. A nice survey for efficiently managing large XML data using the sequence-based technique and other techniques canbe found in [27].

7. Conclusions

With the emergence of XML as a standard for information representation, analysis, and exchange on the Web, the devel-opment of automatic techniques for XML schema matching will be crucial to their success. In this paper, we have addressedan intricate problem associated to the XML schema matching problem—discovering complex matching considering matchingscalability. To tackle this, we have proposed and implemented the XPrüM system, a hybrid matching algorithm, which auto-matically discovers semantic correspondences between elements of XML schemas. The system starts by transforming sche-mas into schema trees and then constructs a consolidated Prüfer sequence, which constructs a one-to-one correspondencebetween schema trees and sequences. We capture schema tree semantic information in Label Prüfer Sequences and schematree structural information in Number Prüfer Sequences.

Experimental results have shown that XPrüM scales well in terms of both large number of schemas and large-scale sche-mas. Moreover, it can preserve matching quality considering both simple and complex matching. We have introduced andmeasured matching quality improvement/matching efficiency cost ratio to validate our new structural matcher.

XPrüM includes other features: it is almost automatic; it does not make use of any external dictionary or ontology; more-over, it is independent of data model and application domain of matched schemas. In our future work, we should improveand suggest matching algorithms for atomic elements. Our ongoing work includes applying the sequence-based matchingapproach on other applications and domains, such as XML data clustering, image matching and Web service discovery.

Acknowledgement

We are grateful to Zohra Bellahsene and Marco Mesiti for their discussions and their technical supports. We also thankStefanie Quade for improving the quality of the paper.

References

[1] S. Abiteboul, D. Suciu, P. Buneman, Data on the Web: From Relations to Semistructured Data and XML, Morgan Kaumann, USA, 2000.[2] A. Algergawy, E. Schallehn, G. Saake, A Prufer sequence-based approach for schema matching, in: BalticDB&IS2008, Estonia, 2008.[3] A. Algergawy, E. Schallehn, G. Saake, A sequence-based ontology matching approach, in: 18th European Conference on Artificial Intelligence Workshop,

Greece, 2008.[4] S. Amer-Yahia, S. Cho, D. Srivastava, Tree pattern relaxation, in: EDBT’02, 2002, pp. 89–102.[5] L. Bergroth, H. Hakonen, T. Raita, A survey of longest common subsequence algorithms, SPIRE (2004) 39–48.

Page 19: Improving XML schema matching performance using Prüfer sequences

746 A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747

[6] A.B. Bondi, Characteristics of scalability and their impact on performance, in: Second International Workshop on Software and Performance, Canada,2000, pp. 195–203.

[7] A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, G. Summa, Schema mapping verification: the spicy way, in: EDBT2008, France, 2008, pp. 85–96.[8] A. Boukottaya, C. Vanoirbeek, Schema matching for transforming structured documents, in: DocEng’05, 2005, pp. 101–110.[9] D. Carmel, N. Efraty, G.M. Landau, Y.S. Maarek, Y. Mass, An extension of the vector space model for querying XML documents via XML fragments, SIGIR

Forum 36 (2) (2002).[10] R. Dhamankar, Y. Lee, A. Doan, A. Halevy, P. Domingos, iMAP: discovering complex semantic matches between database schemas, in: SIGMOD

Conference 2004, 2004, pp. 383–394.[11] H.H. Do, E. Rahm, COMA—a system for flexible combination of schema matching approaches, in: VLDB 2002, 2002, pp. 610–621.[12] H.H. Do, E. Rahm, Matching large schemas: approaches and evaluation, Information Systems 32 (6) (2007) 857–885.[13] A. Doan, Learning to map between structured representations of datag, Ph.D. Thesis, Washington University, 2002.[14] A. Doan, P. Domingos, A. Halevy, Reconciling schemas of disparate data sources: a machine-learning approach, in: SIGMOD, May 2001, pp. 509–520.[15] A. Doan, A. Halevy, Semantic integration research in the database community: a brief survey, AAAI AI Magazine 25 (1) (2005) 83–94 (Special Issues on

Semantic Integration).[16] A. Doan, J. Madhavan, P. Domingos, A. Halevy, Ontology matching: a machine learning approach, Handbook on Ontologies, International Handbooks on

Information Systems, 2004.[17] C. Domshlak, A. Gal, H. Roitman, Rank aggregation for automatic schema matching, IEEE Transactions on Knowledge and Data Engineering 19 (4)

(2007) 538–553.[18] C. Drumm, M. Schmitt, H.-H. Do, E. Rahm, Quickmig—automatic schema matching for data migration projects, in: Proceedings of the ACM CIKM07,

Portugal, 2007.[19] F. Duchateau, Z. Bellahsene, R. Coletta, A flexible approach for planning schema matching algorithms, in: OTM Conferences (1) 2008, Mexico, 2008, pp.

249–264.[20] F. Duchateau, Z. Bellahsene, M. Roche, An indexing structure for automatic schema matching, in: SMDB Workshop, Turkey, 2007.[21] M. Ehrig, S. Staab, QOM—quick ontology mapping, in: International Semantic Web Conference, 2004, pp. 683–697.[22] H. Elmeleegy, M. Ouzzani, A.K. Elmagarmid, Usage-based schema matching, in: ICDE 2008, Mexico, 2008, pp. 20–29.[23] J. Euzenat et al, State of the art on ontology alignment, in: Part of Research Project Funded by the IST Program, Project number IST-2004-507482,

Knowledge Web Consortim, 2004.[24] A. Gal, Managing uncertainty in schema matching with top-k schema mappings, Journal on Data Semantics 6 (2006) 90–114.[25] A. Gal, A. Tavor, A. Trombetta, D. Montesi, A framework for modeling and evaluating automatic semantic reconciliation, VLDB Journal 14 (1) (2005) 50–

67.[26] F. Giunchiglia, M. Yatskevich, P. Shvaiko, Semantic matching: algorithms and implementation, Journal on Data Semantics 9 (2007) 1–38.[27] G. Gou, R. Chirkova, Efficiently querying large XML data repositories: a survey, IEEE Transactions on Knowledge and Data Engineering 19 (10) (2007)

1381–1403.[28] Y. Hao, Y. Zhang, Web services discovery based on schema matching, in: ACSC2007, Australia, 2007, pp. 107–113.[29] B. He, K.C.-C. Chang, Automatic complex schema matching across web query interfaces: a correlation mining approach, ACM Transactions on Database

Systems 31 (1) (2006) 346–395.[30] Y.B. Idrissi, J. Vachon, A context-based approach for the discovery of complex matches between database sources, in: DEXA 2007, LNCS, vol. 4653, 2007,

pp. 864–873.[31] B. Jeong, D. Lee, H. Cho, J. Lee, A novel method for measuring semantic similarity for XML schema matching, Expert Systems with Applications 34

(2008) 1651–1658.[32] J. Kwon, P. Rao, B.M.S. Lee, FiST: scalable XML document filtering by sequencing twig patterns, in: Proceedings of the 31st VLDB Conference 2005, 2005,

pp. 217–228.[33] J. Kwon, P. Rao, B. Moon, S. Lee, Value-based predicate filtering of XML documents, Data and Knowledge Engineering 67 (1) (2008) 51–73.[34] M.L. Lee, L.H. Yang, W. Hsu, X. Yang, Xclust: clustering XML schemas for effective integration, in: CIKM’02, 2002, pp. 63–74.[35] W. Li, C. Clifton, Semint: a tool for identifying attribute correspondences in heterogeneous databases using neural networks, Data and Knowledge

Engineering 33 (2000) 49–84.[36] J. Madhavan, P.A. Bernstein, E. Rahm, Generic schema matching with cupid, in: VLDB 2001, Roma, Italy, 2001, pp. 49–58.[37] R. McCann, W. Shen, A. Doan. Matching schemas in online communities: a web 2.0 approach, in: ICDE 2008, Mexico, 2008, pp. 110–119.[38] S. Melnik, H. Garcia-Molina, E. Rahm, Similarity flooding: a versatile graph matching algorithm and its application to schema matching, in: Proceedings

of the 18th International Conference on Data Engineering (ICDE’02), 2002.[39] R. Nayak, Fast and effective clustering of XML data using structural information, Knowledge and Information Systems 14 (2) (2008) 197–215.[40] H. Prufer, Neuer beweis eines satzes uber permutationen, Archiv fur Mathematik und Physik 27 (1918) 142–144.[41] E. Rahm, P.A. Bernstein, A survey of approaches to automatic schema matching, VLDB Journal 10 (4) (2001) 334–350.[42] P. Rao, B. Moon, PRIX: indexing and querying XML using Prufer sequences, in: Proceedings of the 20th International Conference on Data Engineering

2004, 2004, pp. 288–299.[43] B. Saleem, Z. Bellahsene, E. Hunt, PORSCHE: performance oriented schema mediation, Information Systems 33 (7–8) (2008) 637–657.[44] M. Smiljanic, XML schema matching balancing efficiency and effectiveness by means of clustering, Ph.D. Thesis, Twente University, 2006.[45] S. Tatikonda, S. Parthasarathy, M. Goyder, LCS-TRIM: dynamic programming meets XML indexing and querying, in: VLDB’07, 2007, pp. 63–74.[46] H. Wang, X. Meng, On the sequencing of tree structures for XML indexing, in: ICDE 2005, Japan, 2005, pp. 372–383.[47] B.Y. Wu, K.-M. Chao, Spanning Trees and Optimization Problems, Taylor & Francis Group, USA, 2004.

Alsayed Algergawy is a Ph.D. student in the database research group at the University of Magdeburg, Germany. He received theM.Sc. degree in Electrical Engineering (Computer Engineering) from the University of Tanta, Egypt in 2004. Since October 2006 hehas been working on his Ph.D. project at the University of Magdeburg. His current research interests include data integration,schema matching and XML data clustering.

Page 20: Improving XML schema matching performance using Prüfer sequences

A. Algergawy et al. / Data & Knowledge Engineering 68 (2009) 728–747 747

Eike Schallehn is a scientific assistant in the database research group at the University of Magdeburg, Germany. From 1996 to1999 he worked for companies providing object-oriented database solutions and services in the field of object-oriented software.Since 1999 he has been doing research in the field of information fusion and database integration focusing on query processing inheterogeneous environments, especially similarity-based operations and finished his Ph.D. in 2004. His current research interestsinclude schema integration and query processing in distributed and heterogeneous databases, as well as self-tuning andautonomous database technology.

Gunter Saake is a full professor for the area ‘‘Databases and Information Systems” at the University of Magdeburg, Germany. Hereceived the Ph.D. degree in Computer Science from the Technical University of Braunschweig, Germany, in 1988. From 1988 to1989 he was a visiting scientist at the IBM Heidelberg Scientific Center where he joined the Advanced Information Managementproject. His current research interests include schema and data integration, Taylor-made data management for restricted devices,product line technology for database languages, conceptual modelling and infrastructures for digital engineering.