Probabilistic Message Passing in Peer Data Management Systems

Probabilistic Message Passing in Peer Data Management Systems∗

Philippe Cudre-Mauroux

School of Computer andCommunication Sciences

EPFL – Switzerland

Karl Aberer

School of Computer andCommunication Sciences

EPFL – Switzerland

Andras Feher

Fachbereich InformatikT.U. Darmstadt

Germany

Abstract

Until recently, most data integration techniques involvedcentral components, e.g., global schemas, to enable trans-parent access to heterogeneous databases. Today, how-ever, with the democratization of tools facilitating knowl-edge elicitation in machine-processable formats, one can-not rely on global, centralized schemas anymore as knowl-edge creation and consumption are getting more and moredynamic and decentralized. Peer Data Management Sys-tems (PDMS) provide an answer to this problem by elimi-nating the central semantic component and considering in-stead compositions of local, pair-wise mappings to propa-gate queries from one database to the others.

PDMS approaches proposed so far make the implicit as-sumption that all mappings used in this way are correct.This obviously cannot be taken as granted in typical PDMSsettings where mappings can be created (semi) automati-cally by independent parties. In this work, we propose atotally decentralized, efficient message passing scheme toautomatically detect erroneous mappings in PDMS. Ourscheme is based on a probabilistic model where we take ad-vantage of transitive closures of mapping operations to con-front local belief on the correctness of a mapping againstevidences gathered around the network. We show that ourscheme can be efficiently embedded in any PDMS and pro-vide a preliminary evaluation of our techniques on sets ofboth automatically-generated and real-world schemas.

1 Introduction

Data integration techniques have traditionally revolvedaround global schemas to query heterogeneous databases

∗The work presented in this paper was supported (in part) by the Na-tional Competence Center in Research on Mobile Information and Com-munication Systems (NCCR-MICS), a center supported by the SwissNational Science Foundation under grant number 5005-67322 and was(partly) carried out in the framework of the EPFL Center for Global Com-puting and supported by the Swiss National Funding Agency OFES as partof the European project Evergrow No 001935.

in a transparent way: popular techniques such as LAV(Local-As-View) or GAV (Global-As-View) drew consid-erable attention as they permit deterministic reformulationsof queries over various data sources. These centralized ap-proaches, however, require the definition of an integratedschema supposedly subsuming all local schemas. This re-quirement is particularly stringent in highly heterogeneous,dynamic and decentralized environments such as the Webor Peer-to-Peer overlay networks. In these settings, largelyautonomous and heterogeneous data sources coexist with-out any central coordination. Keeping a global schema up-to-date or defining a schema general enough to encompassall information sources is thus unmanageable in practice inthis context. This evolution motivated the research com-munity to imagine new paradigms for handling heteroge-neous data in decentralized environments. In this paper, wefocus on the Peer Data Management Systems (PDMS) ap-proach, which recently drew considerable attention in thiscontext [4].

PDMS take advantage of local, pairwise schema map-pings between pairs of databases to propagate queries posedagainst a local schema to the rest of the network in an iter-ative and cooperative manner. No global semantic coordi-nation is needed as peers only need to define local map-pings to a small set of related databases to be part of theglobal network of databases. Once a database sends a queryto its immediate neighbors through local mapping links, itsneighbors (after processing the query) in turn propagate thequery to their own neighbors, and so on and so forth untilthe query reaches all (or a predefined number of) databases.The way the query spreads around the network mimics theway messages are routed in a Peer-to-Peer (P2P) system,thus the appellationPeer Data Management Systems.

The vast majority of PDMS approaches, however, prop-agate queries without concern on the validity or quality ofthe mappings. This obviously represents a severe limita-tion, as one cannot expect any level of consistency or qual-ity on PDMS mappings for several reasons: First, remem-ber that PDMS target large-scale, decentralized and hetero-geneous environments where autonomous parties have full

control on schema designs. As a result, we can expect ir-reconcilable differences on conceptualizations (e.g., epis-temic or metaphysical differences on a given modelizationproblem, see also [18]) among the databases. Also, the lim-ited expressivity of the mappings, usually defined as queriesor using an ontology definition language like OWL, pre-cludes the creation of correct mappings in many situation(e.g., mapping an attribute onto a relation). Finally, giventhe vibrant activity on (semi-) automatic alignment tech-niques (see [11] for a recent overview), we can expect some(most?) of the mappings to be generated automatically inlarge-scale settings, with all the evident quality problemsassociated.

1.1 Our Contribution

In the following, we propose a probabilistic techniqueto determine the quality of schema mappings in PDMS set-tings in a totally automated way and without any form ofcentral coordination. As peers do not always share com-mon data, mapping errors are typically difficult to discoverfrom a local perspective in PDMS. Our methods are basedon the analysis of cycles and parallels paths in the graphof schema mappings: After detecting mapping inconsis-tencies by comparing transitive closures of mapping opera-tions, we build a global probabilistic inference model span-ning the entire PDMS system. We show how to constructthis model in a totally decentralized way by involving lo-cal information only. We describe a decentralized and effi-cient method to derive mapping quality measures from ourmodel. Our approach is based on a decentralized versionof iterative sum-product message passing techniques (loopybelief propagation). We show how to embed our approachin PDMS systems with a very modest communication over-head by piggybacking on normal query processing opera-tions. Finally, we present a preliminary evaluation of ourtechnique applied on both generated and real-world data.

1.2 An Introductory Example

Before delving into technicalities, we start with a high-level, introductory example of our approach. Let us con-sider the simple PDMS network depicted in Figure 1. Thisnetwork is composed of four databasesp1, . . . , p4. Alldatabases store a collection of XML documents relatedto pieces of art, but structured according to four differ-ent schemas (one per database). Each database supportsXQuery as query language. Various pairwise XQueryschema mappings have been created (both manually andautomatically) to link the databases; Figure 2 below showsan example of mapping between two schemas and how onecan take advantage of this mapping to resolve a query posedagainst two different schemas.

p4

/Creator -> /Author/DisplayName

/Author/DisplayName -> /Painting/Painter

p1 p3

/Painting/Painter -> /art/creator

p2

/Creator -> /Painting/CreatedOn

/Painting/CreatedOn -> /art/creatDate

art/creator -> /Creator

Figure 1. A simple directed PDMS network offour peers and five schema mappings, heredepicted for the attribute Creator

Q1=<GUID>$p/GUID</GUID> FOR $p IN /Photoshop_ImageWHERE $p/Creator LIKE "%Robi%"

<Photoshop_Image><GUID>178A8CD8865</GUID><Creator>Robinson</Creator><Subject><Bag><Item>Tunbridge Wells</Item><Item>Royal Council</Item>

</Bag> </Subject>…

</Photoshop_Image>

Photoshop(own schema)

<WinFSImage><GUID>178A8CD8866</GUID><Author> <DisplayName> Henry Peach Robinson

<DisplayName> <Role>Photographer</Role> <Author><Keyword>Tunbridge</Keyword><Keyword>Council</Keyword>…

</WinFSImage>

WinFS(known schema)

T12 =<Photoshop_Image><GUID>$fs/GUID</GUID> <Creator>$fs/Author/DisplayName</Creator></Photoshop_Image>FOR $fs IN /WinFSImage

Q2=<GUID>$p/GUID</GUID> FOR $p IN T12WHERE $p/Creator LIKE "%Robi%"

Figure 2. An example of schema mapping(here between peers p2 and p3) expressed asan XQuery

Let us suppose that a user inp2 wishes to retrieve thenames of all artists having created a piece of work relatedto some river. The user could locally pose an XQuery likethe following:

��

��

q_1 =FOR $c IN distinct-values (ArtDatabank//Creator)WHERE $c/..//Item LIKE "%river%"RETURN <myArtist> $c </myArtist>

This query basically boils down to a projection on the at-tributeCreator: op1 = πCreator and a selection on the title:op2 = σItem=%river%. The user issues the query and pa-tiently awaits for answers, both from his local database andthe rest of the network.

In a standard PDMS, the query would be forwardedthrough both outgoing mappings ofp2, generating a fairproportion of false positives as one of these two mappings

(the one betweenp2 and p4) is incorrect for the attributeCreator (the mapping erroneously mapsCreatorin p2 ontoCreatedOnin p4, see Figure 1). Luckily for our user, thePDMS system he is using implements our message passingtechniques. Without any prior information on the mappings,the system detects inconsistencies for the mappings onCre-ator by analyzing the cyclesp1 → p2 → p4 → p1 andp1 → p2 → p3 → p4 → p1, as well as the parallel pathsp2 → p4 andp2 → p3 → p4 in the mapping network. In adecentralized process, the PDMS constructs a probabilisticnetwork and determines that the semantics of the attributeCreator will most likely be preserved by all mappings, ex-cept by the mapping betweenp2 andp4 which is more likelyto be faulty. Thus, this specific query will be routed throughmappingp2 → p3, and then iteratively top4 andp1. Inthe end, the user will retrieve all artist names as specified,without any false-positive since the mappingp2 → p4 wasignored in the query resolution process.

2 Problem Definition

We model PDMS as collections of peers; Each individ-ual peerp represents a database storing data according to adistinct structured schema. Note that a peer could in prac-tice represent a (potentially large) cluster of databases alladhering to the same schema. As we wish to present anapproach as generic as possible, we do not make any as-sumption on the exact data model used by the databases inthe following but illustrate some of our claims with exam-ples in XML and RDF. We only require the databases tostore information with respect to some concepts we callat-tributesa (e.g., attributes in a relational schema, elementsor attributes in XML and classes or properties in RDF). Ad-ditionally, we suppose that each peer can be identified andcontacted by a unique identifier (e.g., an IP address or a peerID in a P2P network).

Peers are connected one another through (un)directededges representing (un)directed pairwise schema mappings.A schema mappingm allows a query posed against a localschema to be evaluated against another schema. This opera-tion is uni or bi-directional depending on the language usedto express the mappings. Again, we do not make assump-tions on that language, except that it should allow to con-nect pairs of semantically similar attributes. Note that thislanguage may for example allow for syntactic transforma-tions (e.g., transforming a date from one format to another)or complex mappings (e.g., mapping an attribute to part ofanother attribute). Finally, remember that our fundamentalassumption is that some mappings might be incorrect, i.e.,they might map an attribute from one database to a seman-tically irrelevant attribute in another database.

We consider queries posed locally against the schema ofa given peer. Queries are composed of generic selection /

projection operationsop on attributes. For each attributeai

appearing in the query, the system (or the expert user) de-fines a semantic thresholdθai

under which the query shouldnot be propagated any further: as the query gets propagatedthroughout the network of peer databases, each intermedi-ate peer checks the probabilityP (ai = correct) of each at-tribute in the query being semantically preserved by a map-ping operation. The query is forwarded through a mappinglink to another peer database only if all attributes are pre-served, that is if for allai, P (ai = correct) > θai

forthe given mapping. Note that other per-hop forwarding be-haviors could easily be implemented with our techniques(see [3]) but we stick to the given scheme for simplicity.

Given this setting, our goal is to provide probabilisticguarantees on the correctness of a mapping operation, i.e.,to determineP (ai = correct). As any processing in aPDMS, we wish our methods to operate without any globalcoordination in a purely decentralized manner. Also, wewould like our methods to be totally automated, as preciseas possible and fast enough to be applied on large schemasor ontologies.

3 Modeling PDMS as Factor-Graphs

In the following, we take advantage of query messagesbeing forwarded from one peer to another to detect inconsis-tencies in the network of mappings. We represent individualmappings and network information as related random vari-ables in a probabilistic graphical model. We will then effi-ciently evaluate marginal probabilities, i.e., mapping qual-ity, using these models.

3.1 A Quick Reminder on Factor-Graphs andMessage Passing Schemes

We give below a brief overview of message passing tech-niques. For a more in-depth coverage, we refer the in-terested reader to one of the many overviews on this do-main, such as [12]. Probabilistic graphical models are amarriage between probability theory and graph theory. Inmany situations, one can deal with a complicated globalproblem by viewing it as a factorization of several localfunctions, each depending on a subset of the variables ap-pearing in the global problem. As an example, supposethat a global functiong(x1, x2, x3, x4) factors into a prod-uct of two local functionsfA andfB : g(x1, x2, x3, x4) =fA(x1, x2)fB(x2, x3, x4). This factorization can be repre-sented in a graphical form by thefactor-graphdepicted inFigure 3, where variables (circles) are linked to their re-spective factors (black squares). Often, one is interested incomputing amarginalof this global function, e.g.,

g2(x2) =∑x1

∑x3

∑x4

g(x1, x2, x3, x4)

x1 x2 x3 x4

fA fB

µfA-x2(x2) µfB-x2(x2)

Figure 3. A simple factor-graph of four vari-ables and two factors

=∑∼{x2}

g(x1, x2, x3, x4)

where we introduce the summary operator∑

∼{xi} to sumover all variables butxi. Such marginals can be derived inan efficient way by a series ofsum-productoperations onthe local function, such as:

g2(x2) =

(∑x1

fA(x1, x2)

)(∑x3

∑x4

fB(x2, x3, x4)

).

Interestingly, the above computation can be seen as theproduct of two messagesµfA→x2(x2) andµfB→x2(x2) sentrespectively byfA andfB to x2 (see Figure 3). Thesum-productalgorithm exploits this observation to compute allmarginal functions of a factor-graph in a concurrent and ef-ficient manner. Message passing algorithms traditionallycompute marginals by sending two messages — one in eachdirection — for every edge in the factor-graph:

variable x to local factor f :

µx→f (x) =∏

h∈n(x)\{f}

µh→x(x)

local factor f to variable x

µf→x(x) =∑∼{x}

f(X)∏

y∈n(f)\{x}

µy→f (y)

wheren(·) stands for the neighbors of a variable / functionnode in the graph.

These computations are known to be exact for cycle-freefactor-graphs; in contrast, applications of the sum-productalgorithm in a factor-graph with cycles only result in ap-proximate computations for the marginals [15]. However,some of the most exciting applications of the sum-productalgorithms (e.g., decoding of turbo or LDPC codes) ariseprecisely in such situations. We show below that this isalso the case for factor-graphs modelling Peer Data Man-agement Systems. Finally, note that Belief Propagation asintroduced by Judea Pearl [20] is actually a specialized caseof a standard message passing sum-product algorithm.

3.2 On Factor-Graphs in Undirected PDMS

In the following, we explain how one can model a net-work of mapping as a factor-graph. These factor-graphs willin turn be used in Section 4 to derive quality measures forthe various mappings in the network.

3.2.1 Cyclic Mappings

Semantic overlay network topologies are not generated atrandom. On the contrary, they are constructed by (com-puterized or human) agents aiming at interconnecting par-tially overlapping information sources. We can expect veryhigh clustering coefficients in these networks, since sim-ilar sources will tend to bond together and create clus-ter of sources. As an example, a study of an online net-work of related biologic schemas (in the the SRS system,http://www.lionbioscience.com) shows an exponential de-gree distribution and an unusually high clustering coeffi-cient of0.54 (as of May 2005). Consequently, we can ex-pect semantic schema graphs to exhibitscale-freepropertiesand an unusually high number of loops [7].

Let us assume we have detected a cycle ofmappings m0,m1, . . . ,mn−1 connecting n peersp0, p1, . . . , p(n−1), p0 in a circle. Cycles of mappingscan be easily discovered by the peers in the PDMSnetwork, either by proactively flooding their neighbor-hood with probe messages with a certain Time-To-Live(TTL) or by examining the trace of routed queries inthe network. We take advantage of transitive closuresof mapping operations in the cycle to compare a queryq posed against the schema ofp0 to the correspondingquery q′ forwarded through alln mappings along thecycle: q′ = mn−1(mn−2(. . . (m0)(q)))). q andq′ can becompared on an equal basis since they are both expressedin terms of the schema ofp0. In an ideal world,q′ ≡ qsince the transformed queryq′ is the result ofn identitymappings applied on the original queryq. In a distributedsetting, however, this might not always be the case, bothbecause of the lack of expressiveness of the mappingsand of the fact that mappings can be created in (semi-)automatic ways.

When comparing an attributeai in an operationopq(ai)appearing in the original queryq to the attributeaj from thecorresponding operationop′(aj) in the transformed queryq′, three subcases may occur in practice:

aj = ai: this occurs when the attribute, after having beentransformedn times through the mappings, still mapsto the original attribute when returning to the semanticdomain of p0. Since this indicates a high level ofsemantic agreement along the cycle for this particularattribute, we say that this represents positive feedbackf+ on the mappings constituting the cycles.

aj 6= ai: this occurs when the attribute, after having beentransformedn times through the mappings, maps toa different attribute when returning to the semanticdomain ofp0. As this indicates some disagreement onthe semantics ofai along the cycle of mappings, wesay that this represents negative feedbackf− on themappings constituting the cycles.

aj = ⊥: this occurs when some intermediary schemadoes not have a representation for the attribute inquestion, i.e., cannot map the attribute onto one of itsown attributes. This does not give us any additional(feedback) information on the level of semanticagreement along the cycle, but can still represent somevaluable information in other contexts, for examplewhen analyzing query forwarding on a syntactic level(see also [3]). In the current case, we consider that theprobability on the correctness of a mapping link dropsto zero for a specific attribute if the mapping does notprovide any mapping for the attribute.

We focus here on single-attribute operations for simplicity,but our results can be extended to multi-attribute operationsas well.

Also to be taken into account, the fact that series of erro-neous mappings onai can accidentallycompensate theirrespective errors and actually create a correct compositemappingmn−1 ◦ mn−2 . . . ◦ m0 in the end. Assuming aprobability ∆ of two or more mapping errors being com-pensated along a cycle in this way, we can determine theconditional probability of a cycle producing positive feed-backf+

� given the correctness of its constituting mappingsm0, . . . ,mn−1:

P (f+� |m0, . . . ,mn−1) =

1 if all mappings correct0 if one mapping incorrect∆ if two or more

mappings incorrect

This conditional probability function allows us to create afactor-graph from a network of interconnected mappings.We create a global factor-graph as follows:�

�

�

�

for all mapping m in PDMSadd m.factor to global factor-graph;add m.variable to m.factor;

for all mapping cycle c in PDMSadd c.feedback.factor to global factor-graph;add c.feedback.variable to c.feedback.factor;for all mapping m in mapping cycle c

link c.feedback.factor to m.variable;

Figure 4 illustrates the derivation of a factor-graph froma simple semantic network of four peersp1, . . . , p4 (left-hand side of Figure 4). The peers are interconnected

p1

p2

p3

p4

m12 m23

m34m41

m24

m12 m23 m34 m41 m24

f1 f2 f3

Figure 4. Modeling an undirected network ofmappings as a factor-graph

through five mappingsm12,m23,m34,m41 andm24. Onemay attempt to obtain feedback from three different map-ping cycles in this network:

f1� : m12 −m23 −m34 −m41

f2� : m12 −m24 −m41

f3� : m23 −m34 −m24.

The right-hand side of Figure 4 depicts the resulting factor-graph, containing from top to bottom: five one-variable fac-tors for the prior probability functions on the mappings, fivemappings variablesmij , three factors linking feedback vari-ables to mapping variables through conditional probabilityfunctions (defined as explained above), and finally threefeedback variablesfk. Note that feedback variables are usu-ally not independent: two feedback variables are correlatedas soon as the two mapping cycles they represent have atleast one mapping in common (e.g., in Figure 4, where allthree feedbacks are correlated).

3.3 On Factor-Graphs in Directed PDMS

One may derive similar factor-graphs in directed PDMSnetworks, focusing this time on directed mapping cyclesand parallel mapping paths. Factors from directed mappingcyclesm0 → m1 →, . . . → mn−1 are defined in exactlythe same ways as explained above for the undirected case.

Parallel mapping paths occur when two different seriesof mappingsm′ and m′′ share the same source and des-tination, e.g.,m′

0 → m′1 → . . . → m′

n′−1 and m′′0 →

m′′1 → . . . → m′′

n′′−1, with m′0 andm′′

0 departing from thesame peer andm′

n′−1 andm′′n′′−1 arriving at the same peer.

Those two parallel paths would be considered as formingan undirected cycle in an undirected network, but cannotbe considered as such here due to the restriction on the di-rection of the mapping operations in a network of directedmappings.

If a query q is forwarded through both parallel paths,the destination peerpn−1 can compare bothq′ =

p1

p2

p3

p4

m12 m23

m34m41

m24

m12 m23 m34 m41 m24

f1 f2 f3

m21m21

f4 f5

Figure 5. Modeling a directed network of map-pings as a factor-graph

m′n′−1(. . . (m

′0(q))) and q′′ = m′′

n′′−1(. . . (m′′0(q))). As

for the undirected cycle presented above, three cases canoccur when comparing an operationsop(a′i) of q′ to itscorresponding operationop(a′′j ) in q′′: positive feedback(a′i ≡ a′′j ), negative feedback (a′i 6= a′′j ) of neutral feedback(a′i = ⊥ and/ora′′j = ⊥).

The conditional probability function for receiving posi-tive feedbackf+

⇒ through parallel paths knowing the cor-rectness of the sets of mappings{m′} = {m′

0, . . . ,m′n′−1}

and{m′′} = {m′′0 , . . . ,m′′

n′′−1} is:

P (f+⇒|{m′}, {m′′}) =

1 if all mappings correct0 if one mapping incorrect∆ if two or more

mappings incorrect

Figure 5 below shows an example of directed mappingnetwork with four peers and six mappings. Feedback fromtwo directed cycles and three pairs of parallel paths mightbe gathered from the network:

f1� : m12 → m23 → m34 → m41

f2� : m12 → m24 → m41

f3⇒ : m21‖m24 → m41

f4⇒ : m24‖m23 → m34

f5⇒ : m21‖m23 → m34 → m41.

As for the undirected case, the right-hand side of Figure 5represents the factor-graph derived from the directed map-ping network of the left-hand side.

Since undirected mapping networks and directed map-ping networks result in structurally similar factor-graphs inthe end, we treat them on the same basis in the follow-ing. We only include two versions of our derivations whensome noticeable difference between the undirected and thedirected case surfaces.

4 Embedded Message Passing

So far, we have developed a graphical probabilisticmodel capturing the relations between mappings and net-work feedback in a PDMS. To take advantage of these mod-els, one would have to gatherall information pertaining toall mappings, cycles and parallel paths in a system. How-ever, adopting this centralized approach makes no sense inour context, as PDMS were precisely invented to avoid suchcentralization. Instead, we devise below a method to em-bed message passing into normal operations of a Peer DataManagement System. Thus, we are able to get globally con-sistent mapping quality measures in a scalable, decentral-ized and efficient manner while respecting the autonomy ofthe peers.

Looking back at the factor-graphs introduced in Sec-tion 3.2 and 3.3, we make two observations: i) some (butnot all) nodes appearing in the factor-graphs can be mappedback onto the original PDMS graph, and ii) the factor-graphs contain cycles.

4.1 On Feedback Variables in PDMS Factor-Graphs

Going through one of the figures representing a PDMSfactor-graph from top to bottom, one may identify four dif-ferent kinds of nodes: factors for the prior probability func-tions on the mappings, variable nodes for the correctness ofthe mappings, factors for the probability functions linkingmapping and feedback variables, and finally variable nodesfor the feedback information. Going one step further, onecan make a distinction between nodes representing local in-formation, i.e., mapping factors and mapping variables, andnodes pertaining to global information, i.e., feedback fac-tors and feedback variables.

Mapping back local information nodes onto the PDMS iseasy, as only the nodes from which a mapping is departingneed to store information about that mapping (see per hoprouting behavior in Section 2). Luckily, we can also map theother nodes rather easily, as they either contain global butstatic information (density function in feedback factors), orinformation gathered around the localneighborhoodof anode (∆, observed values forf i

� and f j⇒, see preceding

section). Hence, each peerp only needs to store a fractionof the global factor-graph, fraction selected as follows:�

�

�

�

for all outgoing mapping madd m.factor to local factor-graph;add m.variable to m.factor;for all feedbackMessage f containing m

add f.factor to m.variable;if f.isPositive

add f.variable(+) to f.factor;else if feedback.isNegative

add f.variable(-) to f.factor;for all mapping m’ in feedback except m

add virtual peer m’.peer to f.factor;

m12 m23 m34 m41 m24

f1 f2 f3

m21

f4 f5

p1

p2 p3 p4p4 p2m12

f2f1

fa2fa1

fam12

Figure 6. Creating a local factor-graph in thePDMS (here for peer p1)

where feedbackMessage stands for all cycle of paral-lel path feedback messages received from neighboringpeers (resulting from probes flooded within a certain TTLthroughout the neighborhood or from analyzing standardforwarded queries). Figure 6 shows howp1 from Figure 5would store its local factor-graph.

Note that, depending on the PDMS, one can choose be-tween two levels of granularity for storing factor-graphs andcomputing related probabilistic value: coarse granularity –where peers only store one factor-graph per mapping andwhere they derive only one global value on the correctnessof the mapping – and fine granularity – where peers storeone instance of the local factor-graph per attribute in themapping, and where they derive one probabilistic qualityvalue per attribute. We suppose we are in the latter situationbut show derivations for only one attribute in the follow-ing. Values for other attributes can be derived in a similarfashion.

4.2 On Cycles in PDMS Factor-Graphs

Cycles appear in PDMS factor-graphs as soon as twomappings belong to two identical cycles or parallel paths inthe PDMS. See for example the PDMS in Figure 4, wherem12 and m41 both appear in cyclesp1 − p2 − p4 − p1

and p1 − p2 − p3 − p4 − p1, hence creating a cyclem12−factor(f1)−m41−factor(f2)−m12 in the factor-graph. As mentioned above, the results of the sum-productalgorithm operating in a factor-graph with cycles cannot (ingeneral) be interpreted as exact function summaries.

One well-known approach to circumvent this problem isto transform the factor-graph by regrouping nodes (cluster-ing or stretchingtransformations) to produce a factor tree.In our case, this would result in regrouping all mappingshaving more than one cycle or parallel path in common;this is obviously inapplicable in practice, as this would im-ply introducing central components in the PDMS to regroup(potentially large) sets of independent peers (see also Sec-

tion 7 for a discussion on this topic). Instead, we rely oniterative, decentralized message passing schedules (see be-low) to estimate marginal functions in a concurrent and ef-ficient way. We show in Section 5 that those evaluationsare sufficiently accurate to make sensible decisions on themappings in practice.

4.3 Embedded Message Passing Schedules

Given its local factor-graph and messages received fromits neighborhood, a peer can locally update its belief onthe mappings by reformulating the sum-product algorithm(Section 3.1) as follows:

local message from factorfaj to mapping variable mi:µfaj→mi

(mi) =∑∼{mi}

(faj(X)

∏pk∈n(faj)

µpk→faj(pk)∏

ml∈n(faj)\{mi} µml→faj(ml)

)local message from mappingmi to factor faj ∈ n(mi):

µmi→faj (mi) =∏

fa∈n(mi)\{faj} µfa→mi(mi)

remote message for factorfak from peer p0 to peerpj ∈ n(fak):

µp0→fak(mi) =

∏fa∈n(mi)\{fak} µfa→mi

(mi)

Posterior correctness of local mappingmi:

P (mi|{F}) = α(∏

fa∈n(mi)µfa→mi

(mi)))

wherealpha is a normalizing factor ensuring that the prob-abilities of all events sum to one (i.e., making sure thatP (mi = correct) + P (mi = incorrect) = 1) .

In cycle-free PDMS factor-graphs (i.e., trees), exact mes-sages can be propagated from mapping variables to the restof the network in at most two iterations (due to the specifictopology of our factor-graph). Thus, all inference resultswill be exact in two iterations.

For the more general case of PDMS factor-graph withcycles, we are stuck at the beginning of computation sinceevery peer has to wait for messages from other peers. Weresolve this problem in a standard manner by consideringthat all peers virtually received a unit message (i.e., a mes-sage representing the unit function) from all other peers ap-pearing in their local factor-graphs prior to starting the algo-rithm. From there on, peers derive probabilities on the cor-rectness of their local mappings and send messages to otherpeers as described above. We show in Section 5 that forPDMS factor-graphs with cycles, the algorithm convergesto very good approximations of the exact values obtainedby a standard global inference process. Peers can decideto send messages according to different schedules depend-ing on the PDMS; we detail below two possible scheduleswith quite different performance in terms of communicationoverhead and convergence speed.

4.3.1 Periodic Message Passing Schedule

In highly dynamic environments where databases, schemasand schema mappings are constantly evolving, appearing ordisappearing, peers might wish to act proactively in order toget results on the correctness of their mappings in a timelyfashion. In a Periodic Message Passing Schedule, peerssend remote messages to all peerspi appearing in their localfactor-graph every time periodτ . This corresponds to a newround of the iterative sum-product algorithm. This periodicschedule induces some communication overhead (a maxi-mum of

∑ci

(lci − 1) messages per peer everyτ , whereci represent all mapping cycles passing through the peerand lci

the length of the cycles) but guarantees our meth-ods to converge within a given time-frame dependent on thetopology of the network (see also Section 5). Note thatτshould be chosen according to the network churn in orderto guarantee convergence in highly dynamic networks. Itsexact value may range from a couple of seconds to weeksor months depending on the exact situation.

4.3.2 Lazy Message Passing Schedule

A very nice property of the iterative message passing al-gorithm is that it is tolerant to delayed or lost messages.Hence, we do not actually require any kind of synchroniza-tion for the message passing schedule; Peers can decide tosend remote message whenever they want without endan-gering the global convergence of the algorithm (the algo-rithm will still converge to the same point, simply slower,see next section). We may thus take advantage of this prop-erty to totally eliminate any communication overhead (i.e.,number of additional messages sent) induced by our methodby piggybacking on query messages. The idea is as follows:every time a query message is sent from one peer to anotherthrough a mapping linkmi, we append to this query mes-sage all messagesµ(mi) pertaining to the mapping beingused. In this case, the convergence speed or our algorithmis proportional to the query load of the system. This maybe the ideal schedule for query-intensive or relatively staticsystems.

4.4 Prior Belief Updates

Our computations always take into account the mappingfactors (top layer of a PDMS factor-graph). These factorsrepresent any local, prior knowledge the peers might pos-sess on their mappings. For example, if the mappings werecarefully checked and validated by a domain expert, thepeer might want to set all prior probabilities on the correct-ness of the mappings to one to ensure that these mappingswill always be treated as correct.

In most cases, however, the peers only have a vague idea(e.g., presupposed quality of the alignment technique used

to create the mappings) on the priors related to their sur-rounding mappings initially. As the network of mappingsevolves and time passes, however, the peers start to accu-mulate various posterior probabilities on the correctness oftheir mappings thanks to the iterative message passing tech-niques described above. Actually, the peers get new poste-rior probabilities on the correctness of the mappings as longas the network of mappings continues to evolve (e.g., asmappings get created, modified or deleted). Thus, peers candecide to modify their prior belief by taking into accountthe evidences accumulated in order to get more accurate re-sults in the future. This corresponds to learning parametersin a probabilistic graphical model when some of the obser-vations are missing. Several techniques might be appliedto this type of problem (e.g., Monte Carlo methods, Gaus-sian approximations). We propose in the following a simpleExpectation-Maximization [8] process which looks as fol-lows:

- Initialize the prior probability on the correctness ofthe mapping taking into account any prior informationon the mapping. If no information is available for agiven mapping, start withP (m = correct) = P (m =incorrect) = 0.5 (maximum entropy principle).

- Gather posterior evidencesPk(m = correct|{Fk}) onthe correctness of the mapping thanks to cycle analy-ses and message passing techniques. Treat these evi-dences as new observations for every change of the lo-cal factor-graphs (i.e., new feedback information, new,modified or lost cycle or parallel path)

- After each change of the local factor-graph, update theprior belief on the correctness of the mappingm givenprevious evidencesPk(m = correct|{Fk}) in the fol-lowing way:

P (m = correct) =k∑

i=1

Pi(m = correct|{Fi})k−1

Hence, we can make the prior values slowly converge toa local maximum likelihood to reflect the fact that more andmore evidences are being gathered about the mappings asthe mapping network evolves.

4.5 Introductory Example Revisited

Let us now come back to our introductory example anddescribe in more detail what happened. Imagine that thenetwork of databases was just created and that the peershave no prior information on their mappings. By sendingprobe queries withTTL ≥ 4 through its two mapping links,p2 detects two cycles and one parallel path, and gets all re-lated feedback information. For the attributeCreator:

f1+� : m12 → m23 → m34 → m41

f2−� : m12 → m24 → m41

f3−⇒ : m24‖m23 → m34

p2 constructs a local factor-graph based on this infor-mation and starts sending remote messages and calculat-ing posterior probabilities on its mappings according to theschedule in place in the PDMS.∆, the probability that twoor more mapping errors get compensated along a cycle, ishere approximated to 1/10; if we consider that the schema ofp2 contains eleven attributes, and that mapping errors mapto a randomly chosen attribute (but obviously not the correctone), the probability of the last mapping error compensatingany previous error is 1/10, thus explaining our choice. Af-ter a handful of iterations, the posterior probabilities on thecorrectness ofp2’s mappings towardsp3 andp4 convergeto 0.59 and0.3 respectively. The second mapping has herebeen successfully detected as faulty for the given attribute,and will thus not be used to forward queryq1 (θi = 0.5).The query will however reach all other databases by beingcorrectly forwarded throughp2 → p3, p3 → p4 and finallyp4 → p1. As the PDMS network evolves,p2 will updateits prior probabilities on the mapping towardp3 andp4 to0.55 and0.4 respectively to reflect the knowledge gatheredon the mappings so far.

5 Performance Evaluation

We present below series of results pertaining to the per-formance of our approach. We proceed in two phases: Wefirst report on sets of simulations to highlight some of thespecificities of our approach before presenting results ob-tained on a set of real-world schemas.

5.1 Performance Analyses

5.1.1 Convergence

As previously mentioned, our approach is exact for cycle-free PDMS factor-graphs. For PDMS factor-graphs withcycles, our embedded message passing scheme convergesto approximate results in ten iterations usually. Figure 7 be-low illustrates a typical convergence process for the exam-ple PDMS factor-graph of Figure 4 for schemas of about tenattributes (i.e.,∆ set to 0.1), prior beliefs at0.7 and cyclefeedback as follows:f+

1 , f−2 , f−3 .

Note that the accuracy does not suffer from longer cy-cles: Figure 9 shows the relative error between our itera-tive and decentralized scheme and a global inference pro-cess. The relative error is computed for our example graph(∆ = 0.1, priors at0.8, f+

1 , f−2 , f−3 , 10 iterations), and

Figure 7. Convergence of iterative messagepassing algorithm (example graph, priors at0.7)

by successively adding a new peer as depicted in Figure 8.The relative error is bigger for very short cycles but neverreaches 6% (these results are quite typical of iterative mes-sage passing schemes).

p1

p2

p3

p4

mi2m23

m34m41

m24

mi2 m23 m34 m41 m24

f1 f2 f3

pim1i

m1i

Figure 8. Adding nodes iteratively to increasethe cycle length

5.1.2 Cycle Length Impact

As cycles in the PDMS get bigger, so do the number ofvariables appearing in the feedback factors. Thus, shortercycles provide more precise evidences on the correctness(or incorrectness) of a mapping than longer cycles due tothe inherent uncertainty pertaining to each mapping vari-able. Figure 10 demonstrates this claim for simple cyclicPDMS networks of two to twenty nodes (priors at0.5, pos-itive feedback, 2 iterations [cycle-free factor-graph]). Theresults are quite naturally influenced by the value of∆, butin the end, cycles greater than ten mappings always end upby providing very little evidence on the correctness of the

Figure 9. Relative error of iterative messagepassing algorithm for various cycle lengths(10 iterations, example graph, priors at 0.8)

mappings, even for bigger schemas for which compensat-ing errors are infrequent (e.g., Figure 10 for∆=0.01).

Scale-free networks are known to have a number of largeloops growing exponentially with the size of the loops con-sidered [7]. Generally speaking, this would imply exponen-tially growing PDMS factor-graphs as well for large-scalenetworks. However, as we have just seen, the impact of cy-cles on the posterior probabilities diminishes rapidly withthe size of the cycles.

Peers can locally make a sensible decision on the tradeoffbetween the maximal length of the cycles they consider andthe precision of the results they can get as follows: Theystart by considering the direct vicinity of their mappingsonly and send probes with low TTL values. As they gradu-ally increase the TTL of their probes to detect longer cycles,they monitor the impact of the new cycles they discover onthe posteriors of their mappings. Peers can stop this expan-sion process as soon as changes induced by the new cyclesget insignificant. At this stage, peers are guaranteed to havediscovered the most pertinent cycles (or parallel paths) fortheir mappings anyway. In practice, the threshold on thecycle length might vary, but it always remains low (i.e., fiveto ten) for dense graphs, even in very large-scale environ-ments.

5.1.3 Fault-Tolerance

As mentioned earlier for the lazy message passing sched-ule, our scheme does not requires peers to be synchronizedto send their messages. To simulate this property, we ran-domly discard messages during the iterative message pass-ing schedule and observe the resulting effects. Figure 11shows the results if we consider, for every message to be

Figure 10. Impact of the cycle length on theposterior probability, here for a simple posi-tive cycle graph of a varying number of linksfor three values of ∆

sent, a probabilityP (send) to send it only (example net-work, ∆ = 0.1, priors at0.8, f+

1 , f−2 , f−3 ). We observe thatour method always converges, even for cases where 90% ofthe messages get discarded, and that the number of itera-tions required in order for our algorithm to converge growslinearly with the rate of discarded messages.

5.2 Applying Message Passing on Real-WorldSchemas

We have developed a tool to test our methods on real-world schemas and mappings. This tool can import OWLschemas (serialized in RDF/XML) and simple RDF map-pings (following the format introduced in [18]). It can alsocreate new mappings using the simple alignment techniquesdescribed in [10]. Once schemas and alignments have beendefined, the tool creates peers, automatically detects cycles,transforms the setting into a PDMS factor-graph and startssending messages in order to get posterior quality values onthe mappings as described above.

We report on preliminary experiments based on a set ofstandard schemas from the EON Ontology Alignment Con-test (http://co4.inrialpes.fr/align/Contest/). The setting is asfollows: we first import various ontologies related to biblio-graphic data: the so-called reference ontology (101), a sim-ilar ontology but translated in French (221), the Bibtex on-tology from M.I.T., the Bibtex ontology from UMBC, andtwo other bibliographic ontologies from INRIA and Karl-sruhe. Each of these ontologies is composed of about thirtyconcepts (attributes). We then start our automatic alignmenttechniques to create mappings between this set of schemas.We launch the iterative message passing algorithm on the

Figure 11. Robustness against faulty links(lost messages)

resulting PDMS factor-graph and try to detect erroneousmappings automatically. Finally, we ask a human expertto manually judge on the quality of the results.

Figure 12 shows the precision of our approach for 396generated mappings w.r.t. the thresholdθ used to determinewhether a mapping is correct or not (priors at 0.5,∆=0.1).Here, the precision is defined as the number of mappingscorrectly detected as being erroneous over the total numberof mappings detected as erroneous. The results are partic-ularly appealing for lowθ values, where our methods pre-dicts erroneous mappings with 80% or more accuracy. Asθ grows, however, more and more mixed or uncertain re-sults are being considered, thus quite naturally lowering theprecision. We observe a phase-transition atθ = 0.6. Atthis point, 50% of the 86 erroneous mappings have beenautomatically discovered by the algorithm and increasingθbeyond this will not lead to significantly better results. Still,even for high threshold values, our method is significantlybetter than random guesses.

These preliminary results are quite encouraging, consid-ering the facts that no prior information was provided onthe mappings, and that only one complete round of the al-gorithm could be completed on this static network (i.e., noupdate on prior beliefs).

6 Related Work

P2P data management techniques recently drew consid-erable attention (see [1] or [4] for an introduction). In [6],Bernstein et al. formulated requirements for P2P databasesin terms of local mappings and semi-automatic solutions fordata integration. Edutella [16], based on a super-peer topol-ogy, was one of the early systems deployed. Piazza [21, 22]is a PDMS system which provides efficient query reformu-

Figure 12. Precision results of the MessagePassing approach with a varying threshold θ

lations algorithms for relational and semi-structured datamodels. The Hyperion [5] project relies on rules to propa-gate data in a PDMS network according to mapping tables.PeerDB [17] examines the problem of sharing relationaldata in P2P networks from an information retrieval perspec-tive. None of these works directly addresses the problem offinding or handling faulty mapping links.

We can draw a parallel between our method and globalschema alignment methods (see [11] for an overview). Sim-ilarity Flooding [14] is a graph matching technique basedon a fixpoint-computation. The Glue system [9] learnsclassifiers for classes from instances in order to evaluatethe joint probability distributions of the instances. Corpus-based Schema Matching [13] relies on statistics about el-ements and their relationships to infer constraints on theschema mappings. None of these approaches takes advan-tages of inconsistencies detected in the mapping networknor do they use inference probabilistic networks.

Our approach is quite naturally based on some of ourprevious ideas [2, 3] on analyzing the network of mappings.Our previous work, however, was not concerned with paral-lel paths, prior beliefs, efficient message passing schemes,belief propagation or probabilistic networks; it was com-putationally much more expensive and, above all, ignoredall interdependencies among the mappings and cycles, thusresulting in comparatively poorer results. As an example,applying our former heuristics to the introductory examplegraph would result in disqualifying all three mappings onthe left (while only one is erroneous) while, by consider-ing all correlations between the mappings and cycles, ourcurrent approach yields to much better results (correct in-ference for all five mappings).

7 Conclusions and Future Work

As distributed database systems move from static, con-trolled environments to highly dynamic, decentralized set-tings, we are convinced that correctly handling uncertaintyand erroneous information will become a key challenge forimproving the overall quality of query answering schemes.The vast majority of approaches are today centered aroundlocal and deductive methods which seem quite inappropri-ate to maximize the performance of systems that operatewithout any form of central coordination. Contrary to theseapproaches, we consider an abductive, non-monotonic rea-soning scheme which reacts to observations or inconsisten-cies by globally propagating belief in a decentralized way.Our approach is computationally efficient as it is solelybased on sum-products operations. Also, we have provenits usefulness by evaluating its performance on a set of real-world schemas.

We are currently testing our heuristics on largerautomatically-generated PDMS settings. We are particu-larly interested in understanding the relation between ex-act inference and our approximate results in those envi-ronments. We are also analyzing the computational over-head and scalability properties of other inference techniques(e.g., generalized belief propagation [23], or techniquesconstructing a junction tree in a distributed way [19]). Fi-nally, as global interoperability is always challenged by thedynamics of the mapping network, we plan to analyze thetradeoff between the efforts required to maintain the proba-bilistic network in a coherent state and the potential gain interms of relevance of results.

References

[1] K. Aberer and P. Cudre-Mauroux. Semantic Overlay Net-works. In International Conference on Very Large DataBases (VLDB), 2004.

[2] K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. Start mak-ing sense: The Chatty Web approach for global semanticagreements.Journal of Web Semantics, 1(1), 2003.

[3] K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. TheChatty Web: Emergent Semantics Through Gossiping. InInternational World Wide Web Conference (WWW), 2003.

[4] K. Aberer (ed.). Special issue on peer to peer data manage-ment.ACM SIGMOD Record, 32(3), 2003.

[5] M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. J.Miller, and J. Mylopoulos. The Hyperion Project: FromData Integration to Data Coordination.SIGMOD Record,32(3), 2003.

[6] P. A. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. My-lopoulos, L. Serafini, and I. Zaihrayeu. Data managementfor peer-to-peer computing: A vision. InWorkshop on theWeb and Databases (WebDB), 2002.

[7] G. Bianconi and M. Marsili. Loops of any size and hamiltoncycles in random scale-free networks. Incond-mat/0502552v2, 2005.

[8] A. Dempster, N. Laird, and D. Rubin. Maximum likelihoodfrom incomplete data via the em algorithm.Journal of theRoyal Statistical Society, 39, 1977.

[9] A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learn-ing to map between ontologies on the semantic web. InIn-ternational World Wide Web Conference (WWW), 2002.

[10] J. Euzenat. An api for ontology alignment. InInternationalSemantic Web Conference (ISWC), 2004.

[11] J. Euzenat et al. State of the art on current align-ment techniques. InKnowledgeWeb Deliverable 2.2.3,http://knowledgeweb.semanticweb.org.

[12] F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphsand the sum-product algorithm.IEEE Transactions on In-formation Theory, 47(2), 2001.

[13] J. Madhavan, P. A. Bernstein, A. Doan, and A. Y. Halevy.Corpus-based schema matching. InInternational Confer-ence on Data Engineering (ICDE), 2005.

[14] S. Melnik, H. Garcia-Molina, and E. Rahm. SimilarityFlooding: A Versatile Graph Matching Algorithm and itsApplication to Schema Matching. InInternational Confer-ence on Data Engineering (ICDE), 2002.

[15] K. M. Murphy, Y. Weiss, and M. I. Jordan. Loopy beliefpropagation for approximate inference: An empirical study.In Uncertainty in Artificial Intelligence (UAI), 1999.

[16] W. Nejdl, B. Wolf, C. Qu, S. Decker, M. Sintek, A. Naeve,M. Nilsson, M. Palmer, and T. Risch. EDUTELLA: a P2Pnetworking infrastructure based on RDF. InInternationalWorld Wide Web Conference (WWW), 2002.

[17] B. C. Ooi, Y. Shu, and K.-L. Tan. Relational Data Sharingin Peer-based Data Management Systems.ACM SIGMODRecord, 32(3), 2003.

[18] P. Bouquet et al. Specification of a common frameworkfor characterizing alignment. InKnowledgeWeb Deliverable2.2.1, http://knowledgeweb.semanticweb.org.

[19] M. Paskin and C. Guestrin. A robust architecture for dis-tributed inference in sensor networks. InIntel ResearchTechnical Report IRB-TR-03-039, 2004.

[20] J. Pearl. Probabilistic Reasoning in Intelligent Systems :Networks of Plausible Inference. Morgan Kaufmann, 1988.

[21] I. Tatarinov and A. Halevy. Efficient Query Reformulation inPeer-Data Management Systems. InSIGMOD Conference,2004.

[22] I. Tatarinov, Z. Ives, J. M. amd A. Halevy, D. Suciu,N. Dalvi, X. Dong, Y. Kadiyaska, G. Miklau, and P. Mork.The Piazza Peer Data Management Project.ACM SIGMODRecord, 32(3), 2003.

[23] J. Yedidia, W. Freeman, and Y. Weiss. Generalized beliefpropagation. Advances in Neural Information ProcessingSystems (NIPS), 13, 2000.

Probabilistic Message Passing in Peer Data Management Systems

Documents

Transcript of Probabilistic Message Passing in Peer Data Management Systems