Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

download Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

of 14

Transcript of Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    1/14

    arXiv:1402.0282v1

    [cs.DB]

    3Feb2014

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. X, X 2014 1

    Principled Graph Matching Algorithms forIntegrating Multiple Data Sources

    Duo Zhang, Benjamin I. P. Rubinstein, and Jim Gemmell

    AbstractThis paper explores combinatorial optimization for problems of max-weight graph matching on multi-partite graphs,

    which arise in integrating multiple data sources. Entity resolutionthe data integration problem of performing noisy joins on

    structured datatypically proceeds by first hashing each record into zero or more blocks, scoring pairs of records that are co-

    blocked for similarity, and then matching pairs of sufficient similarity. In the most common case of matching two sources, it is often

    desirable for the final matching to be one-to-one (a record may be matched with at most one other); members of the database

    and statistical record linkage communities accomplish such matchings in the final stage by weighted bipartite graph matching

    on similarity scores. Such matchings are intuitively appealing: they leverage a natural global property of many real-world entity

    storesthat of being nearly dedupedand are known to provide significant improvements to precision and recall. Unfortunately

    unlike the bipartite case, exact max-weight matching on multi-partite graphs is known to be NP-hard. Our two-fold algorithmic

    contributions approximate multi-partite max-weight matching: our first algorithm borrows optimization techniques common to

    Bayesian probabilistic inference; our second is a greedy approximation algorithm. In addition to a theoretical guarantee on the

    latter, we present comparisons on a real-world ER problem from Bing significantly larger than typically found in the literature,

    publication data, and on a series of synthetic problems. Our results quantify significant improvements due to exploiting multiple

    sources, which are made possible by global one-to-one constraints linking otherwise independent matching sub-problems. We

    also discover that our algorithms are complementary: one being much more robust under noise, and the other being simple to

    implement and very fast to run.

    Index TermsData integration, weighted graph matching, message-passing algorithms

    1 INTRODUCTION

    It has long been recognizedand explicitly discussed

    recently [13]that many real-world entity stores arenaturally free of duplicates. Were they to have repli-cate entities, crowd-sourced sites such as Wikipediawould have edits applied to one copy and not oth-ers. Sites that rely on ratings such as Netflix andYelp would suffer diluted recommendation value byhaving ratings split over multiple instantiations ofthe same product or business page. And customersof online retailers such as Amazon would miss outon lower prices, options on new/used condition, orshipping arrangements offered by sellers surfacedon duplicate pages. Many publishers have naturalincentives that drive them to deduplicate, or maintainuniqueness, in their databases.

    Dating back to the late 80s in the statistical recordlinkage community and more recently in databaseresearch [13], [14], [16], [17], [28], [32], a numberof entity resolution (ER) systems have successfullyemployed one-to-one graph matching for leveragingthis natural lack of duplicates. Initially the benefitof this approach was taken for granted, but in pre-

    Research performed by authors while at Microsoft Research.

    D. Zhang is with Twitter, Inc., USA. B. Rubinstein is with the University of Melbourne, Australia. J. Gemmell is with Trov, USA.

    E-mail: [email protected], [email protected], [email protected]

    liminary recent work [13] significant improvementsto precision and recall due to this approach have

    been quantified. The reasons are intuitively clear:data noise or deficiencies of the scoring function canlead to poor scores, which can negatively affect ERaccuracy; however graph matching corresponds toimposing a global, one-to-one constraint which effec-tively smoothes local noise. This kind of bipartite one-to-one matching for combining a pair of data sourcesis both well-known to many in the community, andwidely applicable through data integration problemsas it can be used to augment numerous existingsystemse.g., [3], [5],[8], [10][12], [18], [20], [21], [30],[31], [35], [37],[38]as a generic stage following data

    preparation, blocking, and scoring.Another common principle to data integration is

    of improved user utility by fusing multiple sourcesof data. But while many papers have circled aroundthe problem, very little is known about extendingone-to-one graph matching to practical multi-partitematching for ER. In the theory community exact max-weight matching is known to be NP-hard [9], andin the statistics community expectation maximizationhas recently been applied to approximate the solutionsuccessfully for ER, but inefficiently with exponentialcomputational requirements[28].

    In this paper we propose principled approaches forapproximating multi-partite weighted graph match-ing. Our first approach is based on message-passingalgorithms typically used for inference on probabilis-

    http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1http://arxiv.org/abs/1402.0282v1
  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    2/14

    2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. X, X 2014

    Fig. 1. Multi-source ER is central to (left) Bings (middle) Googles and (right) Yahoo!s movies verticals forimplementing entity actions such as buy, rent, watch, buy ticket and surfacing meta-data, showtimes andreviews. In each case data on the same movie is integrated from multiple noisy sources.

    tic graphical models but used here for combinato-rial optimization. Through a series of non-trivial ap-proximations we derive an approach more efficientthan the leading statistical record linkage work of

    Sadinle et al. [28]. Our second approach extends thewell-known greedy approximation to bipartite max-weight matching to the multi-partite case. While lesssophisticated than our message-passing algorithm,the greedy approach enjoys an easily-implementabledesign and a worst-case 2-approximation competitiveratio (cf. Theorem4.1).

    The ability to leverage one-to-one constraints whenperforming multi-source data integration is of greateconomic value: systems studied here have made sig-nificant impact in production use for example withinseveral Bing verticals and the Xbox TV service drivingcritical customer-facing features (cf. e.g., Figure 1).We demonstrate on data taken from these real-worldservices that our approaches enjoy significantly im-proved precision/recall over the state-of-the-art un-constrained approach and that the addition of sourcesyields further improvements due to global constraints.

    This experimental study is of atypical value owingto its unusually large scale: compared to the largestof the four datasets used in the recent well-citedevaluation study of Kopcke et al. [22], our problem

    is three orders of magnitude larger.1 We conduct asecond experimental comparison on a smaller publi-cation dataset to demonstrate generality. Finally weexplore the robustness of our approaches to varying

    degrees of edge weight noise via synthetic data.In summary, our main contributions are:

    1) A principled factor-graph message-passing al-gorithm for generic, globally-constrained multi-source data integration;

    2) An efficient greedy approach that comes with asharp, worst-case guarantee (cf. Theorem4.1);

    3) A counter example to the common misconcep-tion that sequential bipartite matching is a suffi-cient approach to joint multipartite matching (cf.Example3.3);

    4) Experimental comparisons on a very large

    real-world dataset demonstrating that genericone-to-one matching leverages naturally-deduplicated data to significantly improveprecision/recall;

    5) Validation that our new approaches can ap-propriately leverage information from multiplesources to improve ER precision and recallfurther supported by a second smaller real-world experiment on publication data; and

    6) A synthetic-data comparison under which our

    1. While their largest problem contains 1.7 108 pairs, ours

    measures in at 1.

    6 1011 between just two of our sources. Fordata sources measuring in the low thousands, as in their other

    benchmark problems and as is typical in many papers, purelycrowd-sourcing ER systems such as [23] could be used for meretens of dollars.

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    3/14

    ZHANG ET AL.: PRINCIPLED GRAPH MATCHING ALGORITHMS FOR INTEGRATING MULTIPLE DATA SOURCES 3

    message-passing approach enjoys superior ro-bustness to noise, over the faster greedy ap-proach.

    2 RELATEDWORK

    Numerous past work has studied entity resolution [3],

    [18],[20], [21], [31], [38], statistical record linkage [8],[12], [35], [37] and deduplication [5], [10], [11], [30].There has been little previous work investigating one-to-one bipartite resolutions and almost no results onconstrained multi-source resolution.

    Some works have looked at constraints in gen-eral [1],[6], [32] but they do not focus on the one-to-one constraint. Suet al.do rely on the rarity of dupli-cates in a given web source [33]but use it only to gen-erate negative examples. Guo et al. studied a recordlinkage approach based on uniqueness constraints onentity attribute values. In many domains, however,this constraint does not always hold. For example, inthe movie domain, each movie entity could have mul-tiple actor names as its attribute. Jaro produced earlywork [17] on linking census data records using a Lin-ear Program formulation that enforces a global one-to-one constraint, however no justification or evaluationof the formulation is offered, and the method is onlydefined for resolution over two sources. Similarly formore recent work in databases, such as in conflatingweb tables without duplicates[16] and the notion ofexclusivity constraints[14]. We are the first to test and

    quantify the use of such a global constraint on entitiesand apply it in a systematic way building on earlierbipartite work [13].

    A more recent study [28] examines the recordlinkage problem in the multiple sources setting. Theauthors use a probabilistic framework to estimatethe probability of whether a pair of entities is amatch. However, the complexity of their algorithmis O(Bmnm), where m is the number of sources, nis the number of instances, and Bm is the mth Bellnumber which is exponential in m. Such a prohibitivecomputational complexity prevents the principled ap-

    proach from being practical. In our message-passingalgorithm, the optimization approach has far bettercomplexity, which works well on real entity resolu-tion problems with millions of entities; at the sametime our approach is also developed in a principledfashion. Our message-passing algorithm is the firstprincipled, tractable approach to constrained multi-source ER.

    In the bipartite setting, maximum-weight matchingis now well understood with exact algorithms ableto solve the problem in polynomial time [25]. Whenit comes to multi-partite maximum-weight matching,

    the problem becomes extremely hard. In particularCrama and Spieksma [9] proved that a special case ofthe tripartite matching problem is NP-hard, implyingthat the general multi-partite max-weight matching

    problem is itself NP-hard. The algorithms presentedin this paper are both approximations: one greedy ap-proach that is fast to run, and one principled approachthat empirically yields higher total weights and im-proved precision and recall in the presence of noise.Our greedy approach is endowed with a competitiveratio generalizing the well-known guarantee for thebipartite case.

    Previously Bayesian machine learning methodshave been applied to entity resolution [4], [10], [27],[36], however our goal here is not to perform in-ference with these methods but rather optimization.Along these lines in recent years, message-passingalgorithms have been used for the maximum-weightmatching problem in bipartite graphs [2]. There theauthors proved that the max-product algorithm con-verges to the desirable optimum point in findingthe maximum-weight matching for a bipartite graph,

    even in the presence of loops. In our study, wehave designed a message-passing algorithm targetingthe maximum-weight matching problem in multi-partite graphs. In a specific case of our problem, i.e.,maximum-weight matching in a bipartite graph, ouralgorithm works as effectively as exact methods. An-other recent work [29] studied the weighted-matchingproblem in general graphs, whose problem definitiondiffers from our own. It shows that max-product con-verges to the correct answer if the linear programmingrelaxation of the weighted-matching problem is tight.Compared to this work we pay special attention to the

    application to multi-source entity resolution, tune thegeneral message-passing algorithms specifically forour model, and perform large-scale data integrationexperiments on real data.

    3 THEMPEM PROBLEM

    We begin by formalizing the generic data integrationproblem on multiple sources. Let D1, D2, . . . , Dm bemdatabases, each of which contains representations ofa finite number ofentities along with a special nullentity . The database sizes need not be equal.

    Definition 3.1: Given D1, D2, . . . , Dm, the Multi-Partite Entity Resolution problem is to identify anunknown target relation or resolution R D1 D2 Dm, given some information about R, such aspairwise scores between entities, examples of match-ing and non-matching pairs of entities, etc.

    For example, for databases D1 ={e1, } and D2 ={e1, e

    2, }, the possible mappings areD1D2 = {e1

    e1, e1 e2, e1 , e

    1, e

    2, }. Here,

    appearing in the ith component of a mapping rmeans that r does not involve any entity from sourcei. A resolutionR, which represents the true matchings

    among entities, is some specific subset of all possiblemappings among entities in the m databases.

    A global one-to-one constraint on multi-partite entityresolution asserts the targetRispairwise one-to-one: for

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    4/14

    4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. X, X 2014

    each non-null entityxi Di, for all sourcesj =ithereexists at most one entity xj Dj such that togetherxi, xj are involved in one tuple in R .

    Definition 3.2: A Multi-Partite Entity Matching(MPEM) problem is a multi-partite entity resolutionproblem with the global one-to-one constraint.

    We previously showed experimentally [13] thatleveraging the one-to-one constraint when performingentity resolution across two sources yields signifi-cantly improved precision and recalla fact longwell-known anecdotally in the databases and statis-tical record linkage communities. For example, if weknow The Lord of the Rings I in IMDB matches withThe Lord of the Rings Iin Netflix, then the one-to-oneproperty precludes the possibility of this IMDB movieresolving with The Lord of the Rings II in Netflix.

    The present paper focuses on the MPEM problemand methods for exploiting the global constraint whenresolving across multiple sources. In particular, weare interested in global methods which perform entityresolution across all sources simultaneously. In theexperimental section, we will show that matchingmultiple sources together achieves superior resolutionas compared to matching two sources individually(equivalently pairwise matching of multiple sources).

    3.1 The Duplicate-Free Assumption

    As discussed in Section 1, many real-world datasources are naturally de-duplicated due to varioussocio-economic incentives. There we argued, by citing

    significant online data sources such as Wikipedia,Aamzon, Netflix, Yelp, that

    Crowd-sourced site duplicates will diverge withedits applied to one copy and not the other;

    Sites relying on ratings suffer from duplicates asrecommendations are made on too little data; and

    Attributes will be split by duplicates, for examplealternate prices of shipping options of products.

    Indeed in our past work, we quantified that thedata used in this paper, coming (raw and untouched)from major online movie sources, are largely duplicatefree, with estimated levels of 0.1% of duplicates. Ourpublication dataset is similar in terms of duplicates.

    Finally, consider matching and merging multiplesources. Rather than taking a union of all sourcesthen performing matching by deduplication on theunion, by instead (1) deduplicating each source then(2) matching, we may focus on heterogeneous datacharacteristics in the sources individually. Recogniz-ing the importance of data variability aids matchingaccuracy when scaling to many sources [24].

    3.2 Why MPEM Requires Global Methods

    One may naively attempt to use one of two naturalapproaches for the MPEM problem: (1) thresholdthe overall scores to produce many-to-many resolu-tions; or (2) 1-to-1 resolve the sources in a sequential

    manner by iteratively performing two-source entityresolution. The disadvantages of the first have beenexplored previously [13]. The order of sequential bi-partite matching may adversely affect overall results:if we first resolve two poor quality sources, then thepoor resolution may propagate to degrade subsequentresolutions. We could undo earlier matchings in pos-sibly cascading fashion, but this would essentiallyreduce to global multipartite.

    The next example demonstrates the inferiority ofsequential matching.

    Example 3.3: Take weighted tri-partite graphwith nodes a1, b1, c1 (source 1), a2, b2, c2 (source2), a3, b3, c3 (source 3). The true matching is{{a1, a2, a3}, {b1, b2, b3}, {c1, c2, c3}}. Each pair of sources is completely connected, with weights: 0.6for all pairs between 1 and 2 except 0.5 between{a1, a2} and 1 between {c1, c2}; 1 for records truly

    matching between 1 or 2, and 3; 0.1 for recordsnot truly matching between 1 or 2, and 3. Whenbipartite matching between1and2then3(preservingtransitivity in the final matching), sequential achievesweight 6.4 while global matching achieves 8.1.

    Sequential matching can produce very inferior overallmatchings due to being locked into poor subsequent deci-sionsi.e., local optima. This is compounded by poor qualitydata. In contrast global matching looks ahead. As a resultwe focus on global approaches.

    4 ALGORITHMS

    We now develop two strategies for solving MPEMproblems. The first is to approximately maximize totalweight of the matching by co-opting optimizationalgorithms popular in Bayesian inference. The secondapproach is to iteratively select the best matching foreach entity locally while not violating the one-to-oneconstraint, yielding a simple greedy approach.

    4.1 Message-Passing Algorithm

    As discussed in Section 2, exact multi-partite

    maximum-weight matching is NP-hard. Therefore wedevelop an approximate yet principled maximum-weight strategy for MPEM problems. Without loss ofgenerality, we use tri-partite matching from time-to-time for illustrative purposes.

    We make use of the max-sum algorithm thatdrives maximum a posteriori probabilistic inferenceon Bayesian graphical models [19]. Bipartite factorgraphs model random variables and a joint likeli-hood on them, via variable nodes and factor nodesrespectively. Factor nodes correspond to a factoriza-tion of the unnormalized joint likelihood, and are

    connected to variable nodes iff the correspondingfunction depends on the corresponding variable.Max-summaximizes the sum of these functions (i.e., MAPinference in log space), and decomposes via dynamic

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    5/14

    ZHANG ET AL.: PRINCIPLED GRAPH MATCHING ALGORITHMS FOR INTEGRATING MULTIPLE DATA SOURCES 5

    programming to an iterative message-passing algo-rithm on the graph.

    In this paper we formulate the MPEM problem asa maximization of a sum of functions over variables,naturally reducing to max-sum over a graph.

    4.1.1 Factor Graph Model for the MPEM Problem

    The MPEM problem can be represented by a graphicalmodel, shown in part in Figure2 for tri-partite graphmatching. In this model, Cijk represents a variablenode, where i, j, k are three entities from three dif-ferent sources. We use xijk to denote the variableassociated with graph node Cijk . This boolean-valuedvariable selects whether the entities are matched ornot. If xijk is equal to 1, it means all three entitiesare matched. Otherwise, it means at least two ofthese three entities are not matched. Since we willnot always find a matching entity from every source,

    we use to indicate the case of no matching entityin a particular source. For example, if variable xi,j,is equal to 1, then i and j are matched without amatching entity from the third source.

    Fig. 2. Graphical model for the MPEM problem.

    Figure3 shows all the factor nodes connected witha variable node Cijk . There are two types of factornodes: (1) a similarity factor node Sijk which repre-sents the total weight from matching i, j, k; (2) three

    constraint factor nodes E1i, E

    2j , and E

    3k which repre-

    sent one-to-one constraints on entity i from source 1,entity j from source 2, and entity k from source 3,respectively. The similarity factor node only connectsto a single variable node, but the constraint factornodes connect with multiple variable nodes.

    The concrete definitions of the factor nodes areshown in Figure 4. The definition of the similarityfunction is easy: if a variable node is selected, thefunctions value is the total weight between all pairsof entities within the variable node. Otherwise, thesimilarity function is 0. Each constraint E is defined

    on a plane of variables which are related to onespecific entity from a specific source. For example,as shown in Figure 2, the factor E1i is defined onall the variables related to entity i in source 1. It

    Cijk

    Sijk

    E2j

    E1i

    E3k

    Fig. 3. Variable nodes neighboring factor nodes.

    evaluates to 0 iff the sum of the values of all thevariables in the plane is equal to 1, which meansexactly one variable can be selected as a matchingfrom the plane. Otherwise, the function evaluatesto negative infinity, penalizing the violation of theglobal one-to-one constraint. Thus these factor nodesserve as hard constraints enforcing global one-to-one

    matching. Note that there is no factor E1

    defined onthe variables related to entity from a source, becausewe allow the entity to match with multiple differententities from other sources.

    Sijk(xijk) =

    sij+ sjk + sik, xijk = 1

    0, otherwise

    E1

    i(xi11,...,xi1,...,xijk,...,xi) =

    0 xi11+ ...+xi1+ ...+xijk+ ...+xi= 1

    otherwise

    E2

    j (x1j1,...,x1j,...,xijk,...,xj) =

    0 x1j1+ ...+x1j+ ...+xijk+ ...+xj= 1

    otherwise

    E

    3

    k(x11k,...,x1k,...,xijk,...,xk) =0 x11k+ ...+x1k+ ...+xijk+ ...+xk= 1 otherwise

    Fig. 4. Definitions of factor nodes potential functionsdisplayed in Figure3.

    Assuming all functions defined in Figure 4 are inlogspace, then their sum corresponds to the objectivefunction

    f =i,j,k

    Sijk(xijk) +3

    s=1

    Nst=1

    Est , (1)

    where Ns is the number of entities in source s.Maximizing this sum is equivalent to one-to-one maxweight matching, since the constraint factors are bar-rier functions. The optimizing assignment is the con-figuration of all variable nodes, either 0 or 1, whichmaximizes the objective function as desired.

    4.1.2 Optimization Approach

    The general max-sum algorithm iteratively passesmessages in a factor graph by Eq. (2) and Eq. (3)below. In the equations, n(C) represents neighborfactor nodes of a variable node C, andn(f)represents

    neighbor variable nodes of a factor node f. fC(x)defines messages passing from a factor node f to avariable node C, while Cf(x) defines messages inthe reverse direction. Each message is a function of the

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    6/14

    6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. X, X 2014

    corresponding variablex. For example, ifx is a binaryvariable, there are in total two messages passing inone direction: fC(0) andfC(1).

    fC(x) = maxx1, ,xn

    f(x, x1,...,xn) +

    Cin(f)\{C}

    Cif(xi)

    (2)Cf(x) =

    hn(C)\{f}

    hC(x) (3)

    As it stands the solution is computationally ex-pensive. Suppose there are m sources each with anaverage number of n entities, then in each iterationthere are in total 4mnm messages to be updated:only messages between variable nodes and constraintfactor nodes need to be updated. There are a total

    number of nm

    variable nodes and each has m con-straint factor nodes connecting with it. There are 4messages between each factor node and the variablenode (2 messages in each direction); so there are intotal4m messages for each variable node.

    Affinity Propagation. Instead of calculating thismany messages, we employ the idea of Affinity Prop-agation [15]. The basic intuition is: instead of passingtwo messages in each direction, either fC(x) orCf(x), we only need to pass the difference be-tween these two messages, i.e., fC(1) fC(0)and Cf(1) Cf(0); at the end of the max-sum

    algorithm what we really need to know is whichconfiguration of a variable x (0 or 1) yields a largerresult.

    The concrete calculationusing tri-partite matchingto illustrateis as follows. Let

    1ijk = CijkE1i (1) CijkE1i (0)

    1ijk = E1iCijk(1) E1iCijk(0) .

    Since

    CijkE1i (0)

    = E2jCijk(0) + E3kCijk(0) + SijkCijk (0)

    CijkE1i (1)

    = E2jCijk(1) + E3kCijk(1) + SijkCijk (1)

    the difference 1ijk between these two messages canbe calculated by:

    1ijk =2ijk+

    3ijk+ Sijk(1) . (4)

    In the other direction, since

    E1iCijk(1)

    = maxxi11,...,xi\{xijk}

    E

    1i(xijk = 1, xi11,...,xi)

    +

    a2a3=jk

    Cia2a3E1i

    (xia2a3)

    ,

    in order to get the max value for the message, all thevariables except xijk in E1i should be zero because ofthe constraint function. Therefore

    E1iCijk(1) =

    a2a3=jk

    Cia2a3E1i

    (0)

    Similarly, since

    E1iCijk(0)

    = maxxi11,...,xi\{xijk}

    E1i(xijk = 0, xi11,...,xi)

    +

    a2a3=jk

    Cia2a3E1i

    (xia2a3)

    ,

    to get the max message, one of the variables exceptxijk in E

    1i should be 1, i.e.,

    E1iCijk (0) = maxb2b3=jk

    Cib2b3E1i (1)

    +

    a2a3=jk,a2a3=b2b3

    Cia2a3E1i

    (0)

    .

    Therefore, subtracting E1iCijk (0) from E1iCijk (1)we can get the update formula for 1ijk as follows:

    1ijk = minb2b3=jk 1ib2b3 . (5)

    If i = , since there is no constraint function E1,both 1ijk and

    1ijk are equal to 0. The update rules

    Eq.s (4)and (5) can be easily generalized to m-partitematching.

    ix1x2...xm =j=i

    jx1x2...xm+ Sx1x2...xm(1) (6)

    ix1x2...xm (7)

    = min...bi1bi+1...=...xi1xi+1...

    i...bi1xibi+1...

    .Achieving exponentially fewer messages.By mes-

    sage passing with , instead of, the total numberof message updates per iteration is reduced by halfdown to 2mnm, which includes mnm each of , messages. This number is still exponential in m. How-ever, if we look at Eq. (5) for messages, the formulaactually can be interpreted as Eq. (8) below, where

    jk are minimizers of Eq. (5). This shows that fora fixed entity i in source 1, there are actually onlytwo values for all different combinations of j and

    k, and only the minimizer combination jk obtainsthe different value which is the second minimum (wecall the second minimum the exception value for 1ijk ,and the minimum as thenormal value). In other words,

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    7/14

    ZHANG ET AL.: PRINCIPLED GRAPH MATCHING ALGORITHMS FOR INTEGRATING MULTIPLE DATA SOURCES 7

    there are actually only 2mnof the messages insteadofmnm.

    1ijk =

    1ijk, for alljk =j

    k

    second min

    1ib2b3

    , for jk = jk .

    (8)Using this observation, we replace the messages

    in Eq. (5)with messages using Eq. (4), to yield:

    1ijk = maxb2b3=jk

    2ib2b3 +

    3ib2b3

    + Sib2b3(1)

    . (9)

    We can similarly derive update functions in theother sources. As shown in the following sections, thefinal resolution only depends onmessages, thereforewe only calculate messages and keep updating themiteratively until convergence or a maximum numberof iterations is reached.

    4.1.3 Stepwise Approximation

    In Eq. (9)we need to find the optimizer, which resultsin a complexity at leastO(m2nm1)in the general caseto compute an message, because we need to con-sider all the combinations ofb2and b3in the similarityfunction Sib2b3 . To reduce this complexity, we use astepwise approach (operating like Gibbs Sampling)to find the optimizer. The basic idea is: suppose wewant to compute the optimum combination jk inEq. (9), we first fix k1 and find the optimumj1 whichoptimizes the right hand side of Eq. (9)(in a generalcase, we fix candidate entities in all sources exceptone). Also, sincek1is fixed, we dont need to consider

    the similarity Sik1 between i and k1 in Sib2k1(1), asshown in Eq.10.

    j1 = argmaxb2

    2ib2k1+ 3ib2k1

    + Sib2k1(1)

    = argmaxb2

    2ib2k1 +

    3ib2k1

    + Sib2 + Sb2k1(10)

    From Eq.10,we see that this step requires only O(mn)computation, wherem is the number of sources (m=3in the current example), because the computation ofsimilarities among the fixed candidate entities neednot matter. After we compute j1, we will fix it andcompute the optimizer k2 for the third source. We

    keep this stepwise update continuing until the opti-mum value does not change any more. Since the opti-mum value is always increased, we are sure that thisstepwise computation will converge. In the generalcase, the computation is similar, and the complexityis now reduced to O(mnT)where Tis the number ofsteps.

    In practice, we want the starting combination to begood enough so that the number of steps needed toconverge is small. So, we always select those entitieswhich have high similarities with entity i as the initialcombination. Also, to avoid getting stuck in a local

    optimum, we begin from several different startingpoints to find the global optimum. We select the toptwo most similar entities with entity i from eachsource, and then choose L combinations among them

    as the starting points. Therefore, the complexity tofind the optimum becomes O(mnTL). The secondoptimum will be chosen among the outputs of each oftheTsteps withinL starting points, requiring anotherO(T L) computation.

    4.1.4 Final SelectionWe iteratively update all messages round by rounduntil only a small portion (e.g., less than 1%) of thesare changing. In the final selection step, the generalmax-sum algorithm will assign the value of eachvariable by the following formula.

    xC = argmaxx

    hn(C)

    hC(x)

    The difference between the two configurations ofxCi1,...,im is:

    Q(xC) =

    hn(C)

    hC(1)

    hn(C)

    hC(0)

    =mj=1

    ji1,...,im

    + Si1,...,im(1)

    This means the final selection only depends on mes-sages. Therefore, in our algorithm the configuration ofthe values is equal to

    xi1,...,im =1, mj=1 ji1,...,im + Si1,...,im(1) 0

    0, otherwise(11)

    While we employ heuristics to improve the effi-ciency of this step we omit the details as they aretechnical and follow the general approach laid outabove.

    4.1.5 Complexity Analysis

    Suppose there are m sources, each having n entities.The general max-sum algorithm in Eq.s (2) and (3)

    needs to compute at least O(mnm

    ) messages periteration. In addition to the complexity of Eq.(2), eachmessage also requires at least O(mnm1) complexity.The cost is significant when n 106 andm >2.

    The complexity of our algorithm can be decom-posed into two parts. During message passing, foreach iteration we update O(nm) messages. For each message, we first need to find the top-2 candidatesin m 1 sources, which requires O ((m 1)n) time.Then, we find the optimum starting from L combina-tions within the candidates, which requires O(mnTL)time complexity. The sorting of the candidate array

    requires O (mn log mn) time.A favorable property of message-passing algo-

    rithms is that the computation of messages can beparallelized. We leave this for future work.

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    8/14

    8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. X, X 2014

    4.2 Greedy Algorithm

    The greedy approach to bipartite max-weight match-ing is simple and efficient, so we explore its naturalextension to multi-partite graphs.

    Algorithm. The algorithm first sorts all the pairsof entities by their scores, discarding those falling

    below threshold. It then steps through the remainingpairs in order. When a pairxi, yj is encountered, withboth unmatched so far, the pair entities is matchedin the resolution: they form a new clique {xi, yj}. Ifeither ofxi, yiare already matched with other entitiesand formed cliques, we examine every entity in thetwo cliques (a single entity is regarded as a singletonclique). If all the entities derive from different sources,then the two cliques are merged into a larger cliquewith the resulting merged clique included in theresolution. Otherwise, if any two entities from the twocliques are from the same two sources, the current pair

    xi and yj will be disregarded as the merged cliquewould violate the one-to-one constraint.

    Complexity analysis. The time complexity of thesorting step is O(n2m2(log n +log m)), wherem is thenumber of sources and n is the average number ofentities in each source. The time complexity of the se-lection step isO(n2m3)as we must examine O(n2m2)edges with each examination involving O(m) timeto see if adding this edge will violate the one-to-one constraint. In practice the greedy approach isextremely fast.

    Approximation guarantee.In addition to being fastand easy to implement, the bipartite greedy algorithmis well-known to be a 2-approximation of exact max-weight matching on bipartite graphs. We generalizethis competitive ratio to our new multi-partite greedyalgorithm via duality.

    Theorem 4.1: On any weighted m-partite graph, thetotal weight of greedy max-weight multi-partite graphmatching is at least half the max-weight matching.

    Proof: The primal program for multipartite exactmax-weight matching is below. Note we have pairsof sources indexed by u, v with corresponding inter-

    source edge set Euv connecting nodes ofDu andDv.

    p = max

    uv

    ijEuv

    sijij (12)

    s.t.

    j: (ij)Euv

    ij 1, u,v, i Du

    0

    The primary variables select the edges of the match-ing, the objective function measures the total selectedweight, and the constraints enforce the one-to-oneproperty and that selections can only be made or

    not (and not be negative; and in sum, ensuring theoptimal solution rests on valid Boolean values).

    The Lagrangian function corresponds to the follow-ing, where we introduce dual variables per constraint,

    so as to bring constraints into the objective therebyforming an unconstrained optimization:

    L(, , )

    =uv

    ijEuv

    sijij

    +uv

    iDu

    uvi1

    j: (ij)Euv

    ij

    +uv

    ijEuv

    ijij .

    Inequality constraints require non-negative duals sothat constraint violations are penalized appropriately.

    The Lagrangian dual function then corresponds tomaximizing the Lagrangian over the primal variables:

    g(, )

    = sup

    L(, , )

    =uv

    iDu

    uvi+ sup

    uv

    ijEuv

    (uij+ ij uvi uvj) ij

    =

    1, ifsij+ ij i j 0, uv; i, j Euv, otherwise

    Next we form the dual LP, which minimizes thedual function subject to the non-negative dual vari-able constraints. This is the dual LP to our originalprimal. Dropping the superfluous dual variables:

    min

    uviDu

    uvi (13)

    s.t. sijuvi+ uvj uvij Euv

    0

    Finally we may note that the greedily matchedweights form a feasible solution for the dual LP.Moreover the value of the dual LP under this feasiblesolution is twice the greedy total weight. By weak du-ality this feasible solution bounds the optimal primalvalue proving the result.

    The same argument as is used in the bipartitecase demonstrates that this bound is in fact sharp:

    there exist pathological weighted graphs for whichgreedy only achieves half the max weight. In real-world matching graphs are sparse and their weightsare relatively well-behaved, and so in practice greedyusually achieves much better than this worst-caseapproximation as demonstrated in our experiments.

    5 EXPERIMENTALMETHODOLOGY

    We present our experimental results on both real-world and synthetic data. Our movie dataset aimsto demonstrate the performance of our algorithms ina real, very large-scale application; the publication

    dataset adds further support to our conclusions onadditional real-world data; while the synthetic datais used to stress test multi-partite matching in moredifficult settings.

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    9/14

    ZHANG ET AL.: PRINCIPLED GRAPH MATCHING ALGORITHMS FOR INTEGRATING MULTIPLE DATA SOURCES 9

    TABLE 1Experimental movies data

    Source Entities Source Entities Source Entities

    AMG 305,743 Flixster 140,881 IMDB 526,570iTunes 12,571 MSN 104,385 Netflix 75,521

    5.1 Movie Datasets

    For our main real-world data, we obtained moviemeta-data from six popular online movie sources usedin the Bing movie vertical (cf.Figure1)via a combina-tion of site crawling and API access. All sources havea large number of entities: as shown in Table 1, thesmallest has over 12,000, while the largest has overhalf a million. As noted in Section 1, while we onlypresent results on one real-world dataset, the data isthree orders of magnitude larger than typical for real-world benchmarks in the literature and is from a real

    production system [22].Every movie obtained had a unique ID within each

    source and a title attribute. However, other attributeswere not universal. Some sources tended to have moremetadata than others. For example, IMDB generallyhas more metadata than Netflix, with more alternativetitles, alternative runtimes, and a more comprehensivecast list. No smaller source was a strict subset ofa larger one. For example, there are thousands ofmovies on Netflix that cannot be found on AMG,even though AMG is much larger. We base similaritymeasurement on the following feature scores:

    Title: Exact match yields a perfect score. Other-wise, the fraction of normalized words that arein common (slightly discounted). The best scoreamong all available titles is used.

    Release year: the absolute difference in releaseyears up to a maximum of 30.

    Runtime: the absolute difference in runtime, upto a maximum of 60 minutes.

    Cast: a count of the number of matching castmembers up to a maximum of five. Matchingonly on part of a name yields a fractional score.

    Directors: names are compared like cast. How-ever, the number matching is divided by thelength of the shorter director list.

    Although the feature-level scores could beimprovede.g., by weighing title words by TF-IDF, by performing inexact words matches, byunderstanding that the omission of part 4 maynot matter while the omission of bonus materialis significantour focus is on entity matching usinggiven scores, and these functions are adequate forobtaining high accuracy, as will be seen later.

    After scoring features, we used regularized logistic

    regression to learn weights to combine feature scoresinto a single score for each entity pair. In order to trainthe logistic regression model and also evaluate our en-tity matching algorithms, we gathered human-labeled

    TABLE 2Experimental publications data

    Source DBLP ACM ScholarEntities 2,615 2,290 60,292

    truth sets of truly matching movies. For each pair ofmovie sources, we randomly sampled hundreds ofmovies from one source and then asked human judgesto find matching movies from the other source. If thereexists a matching, then the two movies will be labeledas a matching pair in the truth set. Otherwise, themovie will be labeled as non-matching from the othersource, and any algorithm that assigns a matching tothat movie from the other source will incur a falsepositive. We also employ the standard techniques ofblocking [21] in order to avoid scoring all possiblepairs of movies. In the movie datasets, two movies

    belong to the same block if they share at least onenormalized non-stopword in their titles.

    5.2 Publications Datasets

    We follow the same experimental protocol on a sec-ond real-world publications dataset. The data collectspublication records from three sources as detailed inTable2.Each record has the title, authors, venue andyear for one publication.

    5.3 Performance Metrics on Real-World Data

    We evaluate the performance of both the Greedy andthe Message Passing algorithms via precision andrecall. For any pair of sources, suppose Ris the outputmatching from our algorithms. LetR+and Rdenotethe matchings and non-matchings in our truth data setfor these two sources, respectively. Then, we calculatethe following standard statistics:

    T P = |R R+|

    F N = |R+\R|

    F P = |R R|

    +(x, y) R z=y, (x, z) R+

    +(x, y) R z=x, (z, y) R+ .

    The precision is then calculated as TPTP+FP, and

    recall is calculated as TPTP+FN. By varying threshold

    we may produce a sequence of precision-recall pairsproducing a PR curve.

    5.4 Synthetic Data Generation

    To thoroughly compare our two approaches to general

    MPEM problems, we design and carry out severalexperiments on synthetic data with different difficultylevels. This data is carefully produced by Algorithm 1to simulate the general entity matching problem.

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    10/14

    10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. X, X 2014

    Algorithm 1 Synthetic Data Generation

    Require: Generate n entities in m sources; k featuresper entity

    1: for i= 1 to n do2: for the current entity, generate k values

    F1, F2, . . . , F k Unif[0, 1] as true features;3: for j = 1 to m do4: for each source, generate feature values for

    the current entity as follows:5: for t= 1 to k do6: each feature value ft in fj:i sampled

    fromN(Ft, 2)7: done8: Compute scores between entities across sources:9: for i1= 1 to m 1 do

    10: for i2= i1+ 1 to m do11: for j1 = 1 to n do12: for j2 = 1 to n do

    13: score entitiesj1

    , j2 from sources

    i1

    , i2using values fi1:j1 , fi2:j2

    14: done

    We randomly generate features of entities insteadof randomly generating scores of entity pairs. Thiscorresponds to the real world, where each source mayadd noise to the true meta-data about an entity. In thesynthetic case, we start with given true data and addrandom noise. An important variable in the algorithmis the variance2. As increases, even the same entity

    may have very different feature values in differentsources, so the entity matching problem gets harder.In our experiment, we vary the value of through0.02, 0.04, 0.06, 0.08, 0.1 and 0.2 to create data setswith different difficulty levels.

    We set the number of entities n= 100, the numberof sources m = 3, and the number of features k = 5.We use the normalized inner product of two entitiesas their similarity score.

    5.5 Performance Metrics on Synthetic Data

    Since all the entities and sources are symmetric, we

    evaluate the performance of different approaches onthe whole data set. For a data set withmsources andnentities, there are a total of nm(m1)2 true matchings.If the algorithm outputs T entity matchings out ofwhich C entity pairs are correct, then the precisionand recall are computed as:

    P recision= C

    T, Recall=

    2C

    nm(m 1)

    As another measure we also use F-1, the har-monic mean of precision and recall, defined as2precisionrecallprecision+recall .

    5.6 Unconstrained ManyMany Baseline

    As a baseline approach, we employ a generic uncon-strained ER approachdenoted MANYMAN Ythat is

    common to many previous works [5], [7], [20], [21],[26], [30], [31], [34], [38]. Given a score function f :Di Dj R on sourcesi, j and tunable threshold R, MAN YMANY simply resolves all tuples with scoreexceeding the threshold. Compared to our messagepassing and greedy algorithms described in Section4,MAN YMANYallows an entity in one data source to bematched with multiple entities in another data source.Because of this property, it is also important to noticethat adding more data sources will not help improvethe resolution results of M ANYMAN Y on any pair ofsources, i.e., the resolution results on a pair of twosources using MAN YMANYare invariant to any othersources being matched.

    6 RESULTS

    We now present results of our experiments.

    6.1 Results on Movie Data

    The real-world movie dataset contains millions ofmovie entities, on which both message-passing andgreedy approaches are sufficiently scalable to operate.This sections experiments are designed primarily foranswering the following:

    1) Whether multi-partite matching can get bet-ter precision and recall compared with simplebipartite matchingdo additional sources im-prove statistical performance;

    2) Whether the one-to-one global constraint im-proves the resolution results in real movie data;

    3) Whether the message-passing-based approachachieves higher total weight than the heuristicgreedy approach; and

    4) How the two approaches perform on multi-partite matching of real-world movie data.

    6.1.1 Multipartite vs. Bipartite

    We examine the effectiveness of multi-partite match-ing by adding sources to the basic bipartite matchingproblem. Specifically, we use two movie sources (msn-

    movie and flixster in our experiments) as the targetsource pair, and perform bipartite matching on thesetwo sources to obtain baseline performance. We thenadd another movie source (IMDB) and perform multi-partite matching on these three sources, and comparethe matching results on the target two sources againstthe baseline. Finally, we perform multi-partite match-ing on all six movie sources and record the results onthe target two sources.

    Seven groups of results are shown together inFigure 5. We use diamonds to plot the results fromthe greedy approach, dots to plot the results from

    the message-passing approach, and plus symbols toplot the results from M ANYMANY. The results fordifferent numbers of sources are recorded with dif-ferent colors. The PR curve is generated by using

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    11/14

    ZHANG ET AL.: PRINCIPLED GRAPH MATCHING ALGORITHMS FOR INTEGRATING MULTIPLE DATA SOURCES 11

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Effect of Additional Movie Sources on Matchings

    Recall

    Precision

    MP on 2 sourcesGreedy on 2 sourcesMM on 2 sourcesMP on 3 sourcesGreedy on 3 sourcesMP on 6 sourcesGreedy on 6 sources

    Fig. 5. Performance of resolution on increasingly manymovie sources. The algorithms enjoy comparably highperformance which improves with additional sources.

    different thresholds on scores. Specifically, given athreshold, we regard all the entity pairs which havesimilarity below the threshold as having 0 similarityand as never being matched. Then, we perform entityresolution based on the new similarity matrix andplot the precision and recall value as a point in the

    PR curve. We range from threshold 0.51 through 0.96incrementing by 0.03.From the results, we can first see that when compar-

    ing the three approaches on two sources, both mes-sage passing and greedy are much better than MANY-MAN Y. This demonstrates the effectiveness of addingthe global one-to-one constraint in entity resolutionin real movie data. Its also important to notice thatwithout the one-to-one constraint, MAN YMAN Yis notaffected at all when new sources are added; while forthe approaches with the global one-to-one constraint,the resolution improves significantly with increasing

    number of sources, particularly going from 2 to 3. Thisfurther shows the importance of having the one-to-one constraint in multi-partite entity resolution: it fa-cilities appropriate linking of the individual pairwisematching problems.

    In addition, for both message-passing and greedyapproaches, the PR curve of using three sources formatching is higher than using only two sources formatching. This means that with an additional sourceIMDB, which has the largest number of entities amongall the sources, the resolution of msnmovie and flixsteris much improved. The matching results on all of

    the six sources are also higher than using only twosources, but only slightly higher than on three sources.This implies that the other three movie sources donot provide much more information beyond IMDB,

    likely because IMDB is a very broad source for moviesand its already very informative with good qualitymetadata to help match msnmovie and netflix.

    For any number of sources, the precision and recallof greedy is very similar to that of message pass-ing and very close to 1.0. This may be due to ourfeature scores and similarity functions being alreadypretty good for movie data. In order to determinewhether they are fundamentally similar, or whethergreedy was benefiting from this data set having lowfeature noise, we designed the synthetic experimentwhich is described in the following sections. Sincethe experiments on real movie data have alreadyshown the advantage of these two approaches overMAN YMANY, we do not compare M AN YMAN Yin thesynthetic experiment.

    6.1.2 Total weight

    Next we compare the total weight of the match-ings output by the message-passing and greedy ap-proaches. The total weight of a matching is calculatedas the sum of similarity scores of each matching pair.For example, if a movie x from IMDB matches withmovie y in msnmovie and movie z in netflix, then thetotal weight is sim(x, y) + sim(x, z) + sim(y, z). Sincethe message-passing approach goes to additionaleffort to better-approximate the maximum weightmatching, we expect that its total weight should behigher than greedy.

    In Figure6, we show the comparative results when

    performing multi-partite matching on two, three, andsix sources, where the threshold is set as 0.51. Fortwo sources, since exact maximum-weight bipartitematching algorithms is tractable, we compute thetrue maximum weight and display as MaxWeight.For comparison, we use the weight of the message-passing approach as the reference, and plot the rela-tive value of the greedy and max-weight approachesto this reference. From the figure, we can see that onall the datasets, the message-passing approach getshigher total weight than the greedy approach. Also,as the number of sources increases, the difference

    between the total weight of the message and greedyapproaches gets larger. For example, on six sourcesthe total weight of the message-passing approach ismore than 10% higher than greedy. In addition, in thetwo source comparison we can see that the message-passing approach (total weight: 76404) gets almost thesame total weight as the maximum-weight matching(total weight: 76409), while the greedy approachstotal weight is a little lower (76141). This suggeststhat our message-passing approach approximates themaximum-weight matching well. On the other hand,the result also implies that the movie matching data is

    far from pathological since the total weight for greedyis very close to the maximum-weight matching ontwo sources, which is in stark contrast with the 2-approximation worst case.

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    12/14

    12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. X, X 2014

    2 3 6

    GreedyMPMaxWeight

    Effect of Additional Movie Sources

    Number of sources

    Totalweightrela

    tivetoMP

    (threshold0.5)

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Fig. 6. Comparing weights, withtrue max weight for 2 sources. Tocompare across varying number ofsources, we plot weights relative toMessage Passing.

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Matching Comparison on Synthetic Data

    Recall

    Precision MP; =0.02

    Greedy; =0.02MP; =0.04Greedy; =0.04

    MP; =0.06Greedy; =0.06MP; =0.08Greedy; =0.08MP; =0.1Greedy; =0.1MP; =0.2Greedy; =0.2

    Fig. 7. Precision-Recall com-parison of Message Passing andGreedy matching on synthetic datawith varying amounts of scorenoise.

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Matching Comparison on Synthetic Data

    Threshold

    F1score MP; =0.02

    Greedy;=0.02MP; =0.04Greedy;=0.04

    MP; =0.06Greedy;=0.06MP; =0.08Greedy;=0.08MP; =0.1Greedy;=0.1MP; =0.2Greedy;=0.2

    Fig. 8. F1 scores of Message Pass-ing and Greedy matching on syn-thetic data with varying amounts ofscore noise.

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    MP on 2 sources

    Greedy on 3 sources

    Greedy on 2 sources

    MM on 2 sources

    MP on 3 sources

    Fig. 9. Performance of resolution on increasingly manypublication sources.

    6.2 Results on Publication Data

    We further explore our approaches with the publica-tion dataset of Section5.2, for which results are shownin Figure 9. We experiment on matching DBLP vs.Scholar, with matching DBLP vs. ACM vs. Scholar,comparing the results of the bipartite and tripartitematchings with truth labels on DBLP-Scholar. Theresults lead similar conclusions drawn from the moviedata: (1) globally-constrained matching with an ad-ditional source improve the accuracy of matchingDBLP-Scholar alone; and (2) one-to-one constrainedapproaches perform better than the unconstrainedMAN YMAN Yapproach.

    6.3 Results on Synthetic Data

    As discussed in Section 5, we created syntheticdatasets with different levels of feature noise by vary-

    ing parameter . Our primary aim with the syntheticdata is to see if greedy is truly competitive to messagepassing, or if it only achieves similar precision/recallperformance in the case of low feature noise on therelatively well-curated movie data. Our experimentsare on 3 sources with 100 entities. We use precision,recall and F-1 score as the evaluation metrics to com-pare the two approaches.

    Figure7shows the PR curves of the two approaches

    on the different data sets. Here, = 0.02 representsthe easiest dataset (i.e., generated with minimal noise)and = 0.2 represents the most noisy dataset (i.e.,generated with significant noise). We use a differentcolor to present the results on different data qualityand use different point markers to present resultsfrom the different approaches. From the figure, wesee that for the easiest dataset ( = 0.02), bothgreedy and message passing work exceptionally well,reminiscent of the movie matching problem. When thedataset gets more noisy, both approaches experiencedegraded results, but the decrease of the message-

    passing approach is far less than the greedy ap-proach. As shown in the figure, for datasets with = 0.04, 0.06, 0.08, 0.1, message-passing operates farbetter than greedy. Finally, when the data arrives at avery noisy level ( = 0.2), both approaches performequally poorly.

    In Figure8, we show the F-1 values of the two ap-proaches along with different thresholds. We can seevery clearly the contrast between the two approacheswhen the data becomes increasingly noisy. Here colorsagain denote varying noise level, while the line typedenotes message passing (solid) or greedy (dashed).

    For example, the gap between the solid and dashedlines is very small at the top of the figure when thedata is relatively clean. However, as increases thegap too increases, and reaches an apex on the green

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    13/14

    ZHANG ET AL.: PRINCIPLED GRAPH MATCHING ALGORITHMS FOR INTEGRATING MULTIPLE DATA SOURCES 13

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500

    5000

    0 5 10 15 20 25 30 35 40 45 50

    0.8

    0.7

    0.6

    0.5

    0.4

    0.3

    Number of Iterations

    NumberofChanges

    0

    1000

    2000

    3000

    4000

    5000

    6000

    0 5 10 15 20 25 30 35 40 45 50

    0.8

    0.7

    0.6

    0.5

    0.4

    0.3

    Number of Iterations

    TotalWeights

    (a) Change of messages (b) Change of total weight

    Fig. 10. Convergence Test

    curves ( = 0.06). Later, as increases even more,the gap becomes smaller but still exists. At last, when = 0.2, the gap becomes very small, which meansthe two approaches perform almost the same undersevere noise. This is to be expected: no method canperform well under extreme noise.

    In sum, from the experimental results on the syn-thetic data, we can conclude that for the datasetswith the least feature noise, both the greedy approachand the message-passing approach perform very well,while for the most noisy datasets, both of these twoapproaches perform poorly. On the other hand, whenthe feature noise is between these two extremes,message-passing is much more robust than greedy.

    6.4 Convergence and Scalability

    The complexity of our message-passing approach de-

    pends on the number of iterations that the algorithmneeds to converge. In this section, we examine theconvergence of the algorithm and empirically com-pare the efficiency of the message-passing approachwith the greedy approach.

    In the convergence experiment, we use 6 datasources each having 1k entities making 6k messagestotal. We change the threshold (as in Sec. 6.1.1) from0.3 to 0.8 with increment 0.1 to filter the candidatematchings. A higher value of the threshold will havea lower number of edges left in the weighted graph,so the total weight of the final matching will also be

    lower and the iterations needed to converge is alsosmaller. In Figure10(a), we show how many iterationsthe message-passing approach needs to converge withdifferent thresholds, and Figure 10(b) shows the to-tal weight after each iteration of message passing.Both of these two graphs show that with 6 sourcesand 1,000 entities, the message-passing approach con-verges quickly. With all the different thresholds, theapproach converges within 50 iterations. This indi-cates that the factor T in the complexity analysis forthe message-passing approach is much smaller thanthe total number of entities n.

    We also conducted an experiment to empiricallycompare the efficiency of the message-passing ap-proach and the greedy approach. There is no doubtthat the message-passing approach is much slower

    0

    20

    40

    60

    80

    100

    120

    140

    200 400 600 800 1000

    MP

    Greedy

    Number of Entities

    TotalRuntime(Seconds)

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    3 4 5 6 7

    MP

    Greedy

    Number of Sources

    TotalRuntime(Seconds)

    (a) Increase Entities (b) Increase Sources

    Fig. 11. Running Time Comparison

    than the greedy approach, but the purpose of theexperiment is to see how well the message-passingapproach scales when the number of entities andthe number of sources increase. The experiment wasperformed on a computer with Intel Core i5 [email protected] and 4GB Memory. Figure11(a) shows

    the time cost comparison between the two approacheswhen we fix the number of sources as 6 and increasethe number of entities per source from 200 to 1,000,and in Figure11(b) we fix the number of entities persource as 600 and increase the number of sources 3to 7. We can see that the greedy approach scales well.While the time cost of the message-passing approachincreases much faster, the time complexity of thealgorithm is still acceptable for solving real problems.In practice, it takes 1 to 2 hours to use messagepassing to find the true matching on the real, large-scale movie dataset, which has 6 sources, some of

    which containing around half a million movie entities.

    7 CONCLUSIONS

    In this paper, we have studied the multi-partitematching problem for integration across multiple datasources. We have proposed a sophisticated factor-graph message-passing algorithm and a greedy ap-proach for solving the problem in the presence ofone-to-one constraints, motivated by real-world socio-economic properties that drive data sources to benaturally deduplicated. We provided a competitive

    ratio analysis of the latter approach, and conductedcomparisons of the message-passing and greedy ap-proaches on a very large real-world Bing moviedataset, a smaller publications dataset, and syntheticdata. Our experimental results prove that with addi-tional sources, the precision and recall of entity reso-lution improve; that leveraging the global constraintimproves resolution; and that message-passing, whileslower to run, is much more robust to noisy data thanthe greedy approach.

    For future work, implementing a parallelized ver-sion of the message-passing approach is an interesting

    direction to follow. Another open area is the forma-tion of theoretical connections between the surrogateobjective of weight maximization and the end-goal ofhigh precision and recall.

  • 8/10/2019 Principled Graph Matching Algorithms for Integrating Multiple Data Sources.pdf

    14/14

    14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. X, X 2014

    REFERENCES

    [1] A. Arasu, C. Re, and D. Suciu. Large-scale deduplication withconstraints using dedupalog. InICDE09, pages 952963, 2009.

    [2] M. Bayati, D. Shah, and M. Sharma. Max-product for maxi-mum weight matching: convergence, correctness and LP du-ality. IEEE Trans. Info. Theory, 54(3):12411251, Mar. 2008.

    [3] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E.

    Whang, and J. Widom. Swoosh: A generic approach to entityresolution. VLDB J., 18(1):255276, 2009.[4] I. Bhattacharya and L. Getoor. A latent dirichlet model for

    unsupervised entity resolution. In SIAM Conf. Data Mining,2006.

    [5] M. Bilenko and R. J. Mooney. Adaptive duplicate detectionusing learnable string similarity measures. In KDD03, pages3948, 2003.

    [6] S. Chaudhuri, A. Das Sarma, V. Ganti, and R. Kaushik. Lever-aging aggregate constraints for deduplication. In SIGMOD07,pages 437448, 2007.

    [7] Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Exploiting con-text analysis for combining multiple entity resolution systems.InSIGMOD09, pages 207218, 2009.

    [8] P. Christen. Automatic training example selection for scalableunsupervised record linkage. In PAKDD08, pages 511518,

    2008.[9] Y. Crama and F. C. Spieksma. Approximation algorithms for

    three-dimensional assignment problems with triangle inequal-ities. European J. Op. Res., 60(3):273279, 1992.

    [10] A. Culotta and A. McCallum. Joint deduplication of multiplerecord types in relational data. In CIKM05, pages 257258,2005.

    [11] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicaterecord detection: A survey. IEEE Trans. KDE, 19(1):116, 2007.

    [12] I. P. Fellegi and A. B. Sunter. A theory of record linkage. J.Amer. Stat. Assoc., 64(328):11831210, 1969.

    [13] J. Gemmell, B. I. P. Rubinstein, and A. K. Chandra. Improvingentity resolution with global constraints. Technical ReportMSR-TR-2011-100, Microsoft Research, 2011.

    [14] L. Getoor and A. Machanavajjhala. Entity resolution: theory,

    practice & open challenges. 5(12):20182019, 2012.[15] I. E. Givoni and B. J. Frey. A binary variable model for affinitypropagation. Neural Computation, 21(6):15891600, 2009.

    [16] R. Gupta and S. Sarawagi. Answering table augmentationqueries from unstructured lists on the web. 2(1):289300, 2009.

    [17] M. A. Jaro. Advances in record-linkage methodology asapplied to matching the 1985 census of Tampa, Florida. J.

    Amer. Stat. Assoc., 84(406):414420, 1989.[18] A. Kannan, I. E. Givoni, R. Agrawal, and A. Fuxman. Matching

    unstructured product offers to structured product specifica-tions. In KDD11, pages 404412, 2011.

    [19] D. Koller and N. Friedman. Probabilistic Graphical Models:Principles and Techniques. MIT Press, 2009.

    [20] H. Kopcke and E. Rahm. Training selection for tuning entitymatching. InProc. Int. Work. Qual. DBs Manag. Uncert. Data,pages 312, 2008.

    [21] H. Kopcke and E. Rahm. Frameworks for entity matching: Acomparison. Data & Know. Eng., 69(2):197210, 2010.

    [22] H. Kopcke, A. Thor, and E. Rahm. Evaluation of entityresolution approaches on real-world match problems. PVLDB,3(1):484493, 2010.

    [23] A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller.Human-powered sorts and joins. PVLDB, 5(1):1324, 2011.

    [24] S. Negahban, B. I. P. Rubinstein, and J. Gemmell. Scalingmultiple-source entity resolution using statistically efficienttransfer learning. InCIKM12, pages 22242228, 2012.

    [25] C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimiza-tion. Dover, 1998.

    [26] J. C. Pinheiro and D. X. Sun. Methods for linking and miningmassive heterogeneous databases. In KDD98, pages 309313,1998.

    [27] P. Ravikumar and W. Cohen. A hierarchical graphical modelfor record linkage. In UAI04), pages 454461, 2004.

    [28] M. Sadinle, R. Hall, and S. E. Fienberg. Approaches to multiplerecord linkage. In Proc. 57th World Cong. ISI, 2011. Invitedpaper http://www cs cmu edu/rjhall/ISIpaperfinal pdf

    [29] S. Sanghavi, D. Malioutov, and A. Willsky. Linear pro-gramming analysis of loopy belief propagation for weightedmatching. In NIPS07, 2008.

    [30] S. Sarawagi and A. Bhamidipaty. Interactive deduplicationusing active learning. In KDD02, pages 269278, 2002.

    [31] A. Segev and A. Chatterjee. A framework for object matchingin federated databases and its implementation. Int. J. Coop.Info. Sys., 5(1):7399, 1996.

    [32] W. Shen, X. Li, and A. Doan. Constraint-based entity matching.

    InAAAI05, pages 862867, 2005.[33] W. Su, J. Wang, and F. H. Lochovsky. Record matching over

    query results from multiple web databases. IEEE Trans. KDE,22:578589, 2010.

    [34] S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracyobject identification. In KDD02, pages 350359, 2002.

    [35] W. E. Winkler. Advanced methods for record linkage. InProc.Sect. Surv. Res. Meth., Amer. Stat. Assoc. , pages 467472, 1994.

    [36] W. E. Winkler. Methods for record linkage and Bayesiannetworks. Technical Report Statistics #2002-5, U.S. CensusBureau, 2002.

    [37] W. E. Winkler. Overview of record linkage and currentresearch directions. Technical Report Statistics #2006-2, U.S.Census Bureau, 2006.

    [38] H. Zhao and S. Ram. Entity identification for heterogeneousdatabase integration: a multiple classifier system approach andempirical evaluation. Info. Sys., 30(2):119132, 2005.

    Duo Zhang is a software engineer on theAds team at Twitter. Before joining Twitter,he received his Ph.D. from the Universityof Illinois at Urbana-Champaign. He was aresearch intern at Microsoft, IBM, and Face-book during his Ph.D program. Dr. Zhanghas published numerous research papers intext mining, information retrieval, databases,and social networking. He has also served

    on PCs and reviewers at major computerscience conferences and journals including

    SIGKDD, ACL, and TIST.

    Benjamin I. P. Rubinstein is Senior Lecturerin CIS at the University of Melbourne, Aus-tralia, and holds a PhD from UC Berkeley.He actively researches in statistical machinelearning, databases, security & privacy. Ru-binstein has served on PCs and organisedworkshops at major conferences in theseareas including ICML, SIGMOD, CCS. Previ-ously he has worked in the research divisions

    of Microsoft, Google, Yahoo!, Intel (all in theUS), and at IBM Research Australia. Most

    notably as a Researcher at MSR Silicon Valley Ben helped shipproduction systems for entity resolution in Bing and the Xbox360.

    Jim Gemmell is CTO of startup Trov andholds PhD and M.Math degrees. Dr. Gem-mell is a world leader in the field of life-logging, and is author of the popular bookYour Life, Uploaded. He has numerous pub-lications in a wide range of areas includinglife-logging, multimedia, networking, video-conferencing, and databases. Dr. Gemmell

    was previously Senior Researcher at Mi-crosoft Research where he made leadingcontributions to major products including

    Bing, Xbox360, and MyLifeBits.

    http://www.cs.cmu.edu/~rjhall/ISIpaperfinal.pdfhttp://www.cs.cmu.edu/~rjhall/ISIpaperfinal.pdfhttp://www.cs.cmu.edu/~rjhall/ISIpaperfinal.pdfhttp://www.cs.cmu.edu/~rjhall/ISIpaperfinal.pdf