Lecture Notes in Computer Science:tw.rpi.edu/media/2013/10/07/e293/SEM.doc · Web viewTetherless...

SEM+: Discover “same as” Links among the Entities on

the Web of Data

Jin Guang Zheng, Linyun Fu, Xiaogang Ma, Peter Fox Tetherless World Constellation, Computer Science Department,

Rensselaer Polytechnic Institute110 8th St, Troy, New York, USA

{zhengj3, ful2, max7, foxp}@rpi.edu

Abstract. The amount of entity data published on Linked Open Data Cloud grows rapidly as Semantic Web technologies become more and more popular. The interlinking between these entity data enables the possibility of performing data integration and identity recognition, which are crucial in building interest-ing applications. One way to interlink entity data on Linked Open Data Cloud is using “same as” type of links. In this paper, we present SEM+ -- a Semantic Similarity based Entity Matching tool for detecting entities that refer to same individual or concept at both schema level and instance level, and on a large scale. In SEM+, we designed the Information Entropy based Weighted Similar-ity Model to compute semantic similarity measure between entity data to sug-gest possible linking. We also adopted a blocking approach to group possible matching entities into one block and therefore reduce the computation space. We performed extensive evaluations on SEM+ to study the effectiveness and scalability using various benchmarks and real world data. The results show that our solution is effective and scalable.

Keywords: ontology matching, instance matching, owl:sameAs, entity resolu-tion, Linked Data

1 Introduction

Due to ever-growing popularity of Semantic Web technologies, and large community effort such as Linking Open Data (LOD) project [1] and Open Government Data (OGD) projects1, more and more structured datasets are available on the Web. These datasets described thousands of millions of concepts and individuals across different domains. Most of the time, entities on the Web of Data are interlinked to each other via some types of link as depicted in the LOD cloud diagram2. These interlinks be-tween entities are fundamental for providing powerful semantic functionalities such as semantic integration. One of the important types of links is “same as” links. Its importance in Linked Data integration is well recognized [2, 3, 5, 6]. The process of creating such “same as” links between entities is actively pursued on LOD. However,

1 http://www.data.gov/2 http://richard.cyganiak.de/2007/10/lod/imagemap.html

given the amount of data and schema published on LOD and the amount of data been added to the LOD cloud every day, the created “same as” network cannot be consid-ered as complete. Moreover, the data on LOD cloud evolve over time. Thus a tool is needed to automatically identify individuals or concepts on the heterogeneous Web of data and match entities that describe same individual or concept [4, 5, 6, 18].

In ontology matching community, many impressive state-of-the-art tools consider ontology matching and instance matching as different problems because the targeting entities are different, where ontology matching focuses on schema level or T-Box entities [18,19,20] and instance matching focuses on instance level or A-Box entities [14,33]. However, since both schema data and instance data in ontology are entities on the Web of Data and are described by a set of triples, in this paper, we consider ontology matching and instance matching as similar problems. Therefore, in this paper we are aiming to solve Entity Matching Problem: detecting entities that refer to same real-world individual or concept regardless of the difference among their de-scription and creating “same as” links between these entities. Within the current state of the LOD cloud, we outline the following challenges faced by tools that try to per-form entity matching on the Web of Data [5,6,30,31]:

1. Entity data are typically described using different vocabularies or schemas.2. The descriptions of entities can be structured in different ways across different ontologies while describing the same information. For example, “_:Boston rdfs:type _:type. _:type rdfs:label ‘City’ ” is same as “_:Boston _:category ‘City’ ”.3. Many descriptions for an entity are not necessary to differentiate it fromanother data entity, more often, these descriptions are additional informationabout the same individual or concept that the entity refer to.4. The amount of linked open data on the Web is already in the order of billions of triples and is still increasing. It is important that solutions to the entity match-ing problem can scale to this increasing amount of Web data and at the same time, achieve sufficient effectiveness (precision and recall).

In this paper, we present SEM+ -- a novel entity matching tool that can address the above challenges. SEM+ implements a novel semantic similarity computation model called the Information Entropy and Weighted Similarity Model (IEWS Model) to suggest similarity measures between entities. Based on the similarity measures, SEM+ creates “same as” links among the entities. In SEM+, we also implemented a new prefix-based blocking algorithm, which groups possible matching pairs into one block. This blocking algorithm reduces the number of entity-pairs that are needed for similarity computation. Since both the blocking algorithm and part of the similarity measure computation in IEWS Model depend on same-language lexical similarity, it is assumed that same language is used to describe matching entities. In other words, SEM+ does not aim to help cross-lingual entity matching in current implementation. Also, SEM+ only computes possible matches based on the available information that describes the matching entities.

To summarize, the main contributions of this paper are:We present a new entity matching tool that can perform both instance level matching and schema level matching with high precision and recall.

We present a novel semantic similarity computation model to compute simi-larity measure among the entities on the Web of Data.

We present a prefix-based blocking algorithm which reduces number of pair-wise matching computation and allows trade-off between efficiency and effectiveness.

The remainder of this paper is organized as follows: Section 2 discusses some of the existing work in the literature and compares them with our system. Section 3 presents the SEM+ system. Experimental evaluation is discussed in section 4. Dis-cussion and conclusion are presented in section 5.

2 Related Work

Our approach is related to research in both instance and schema level matching con-ducted by the database and the Semantic Web communities. In database community, instance matching is also known as entity resolution, record linkage [7], deduplication [8], reference reconciliation[9]. Tools such as TAILOR[10], BigMatch[11], MOMA[12], and Swoosh [13] have been developed. These entity resolution tools follow the single-global-threshold paradigm, where they compute match suggestion measures in either supervised or unsupervised manner and then compare the measures with a threshold to determine the match. Chaudhuri [26] proposed the compact set criterion and sparse neighborhood criterion to enable more accurate characterization of duplicated records. Compared to these systems, our pro-posed work is a semantic similarity driven matching system. In the Semantic Web community, algorithms are developed to compute similarity between instance data on the Web of Data such as those presented in papers [14,15,16]. Volz et al.[15] used user configured information as a guide and compute similarity measure for possible matching suggestion. Compared to this approach, our approach doesn’t require user configuration. SLINT [14] applied different similarity computation methods to different type of values such as dates similarity for date val-ues, integer similarity for integer values, etc., then combined these similarities to get final similarity value for possible matching. Compared to SLINT, our approach takes both common and distinguishing description into consideration while ignores descrip-tions that are not present for differentiating one entity from another. Rong et al. [16] extracted literal information from the entities and represented this information as vectors. They then used the vector space model and other machine learning tech-niques to compute a similarity score. Compared to Rong et al.’s algorithm, our ap-proach considers not only the literal information but also the structural information. Schema level ontology matching is a more studied field in the Semantic Web community. There are many impressive state-of-the-art systems developed for the purpose of performing schema level ontology matching [30]. In this section, we will focus on reviewing some of the similarity based ontology matching tools developed in recent years, such as [17,18,21,22,23,25]. Among these systems, ASMOV [18] com-putes the similarity between two concepts from different ontologies by computing their lexical similarities and structural similarities. Duan et al. [17] used Jaccard

Similarity and “Edit distance” similarity as two measures in similarity computation. FCA-Merge [21] and T-Tree [22] compute subclass similarity, superclass similarity, lexical similarity, and concepts’ instance similarities to suggestion mapping. Com-pare to FCA-Merge and T-Tree, RiMOM[23] takes more information in similarity computation such as taxonomy structure, entity names, etc. AgreementMaker [25] first uses TF*IDF model to compute cosine similarity between two concepts. It then computes descendent similarities and sibling similarities using the cosine similarities. Compare to these systems, SEM+ considers that information describing entities to have different weights and uses machine learning approach and information entropy approach to compute these weights.

In terms of scalability, to the best of our knowledge, not many existing matchers investigated the problem [31,14], since most of the existing matching systems have been designed with focus on effectiveness, i.e. to boost precision and recall. They have been applied on small datasets. SLINT [14], Rong et al.’s [16] system and LogMap [31] are three systems that have algorithms to reduce the computation space. All those systems treat the literal descriptions of entities as bags of words and use inverted index [27] technique to create blocks. Compared to these computation re-duction techniques, our blocking algorithm further reduces the entity-wise computa-tion using prefix blocks.

3 SEM+

In this section, we discuss the detailed implementation of SEM+. We start by defin-ing the problem we intend to solve. Given two sets of entities from the Web of Data, namely E and E’, where entities e ϵ E and e’ ϵ E’. The entities e and e’ are described by a set of triples δs in form of (subject, property, object). We then compute similar-ity scores s between e and e’. The score s is in the range of 0 to 1, where 0 indicates e and e’ are dissimilar and 1 indicates e and e’ are representing a same object. Based on this similarity score, we select possible entity matches.

3.1 Overview

SEM+ consists of two major components: 1. Prefix-blocking groups entities that are likely to be similar to each other into one block, and dissimilar entities into difference blocks based on the literal descriptions of the entities such as rdfs:label, rdfs:com-ment, etc. 2. IEWS Model takes two or more entities from the same block and com-putes semantic similarity between these entities. IEWS Model consists of three sub-components. A Property Weight Learning component learns the importance of the properties that are used to describe the features of the given entities, and assigns weights to each property. An Information entropy computation component computes information entropy of the common descriptions of two entities. A Triple-Wise Simi-larity computation component computes similarity between two triples. An overview of SEM+ is depicted in Figure 1.

Fig. 1. System overview of SEM+: input is set of entities to be matched and output is entity matches

3.2 Blocking Algorithm

Giving two large sets of entities, pairwise similarity computation becomes so expan-sive. Therefore reducing the number of entity pairs for which similarity scores are to be computed is important. In SEM+, we adopted a blocking algorithm to reduce the computation space. The goal of blocking is to group similar entities into same blocks and dissimilar entities into different blocks as fast as possible, with blocks as small as possible. Potential similar entities should be contained in the blocks as completely as possible. More careful (thus more expensive) similarity computation is then being performed within each block to determine those exact similarity score. In essence, the function of blocking is dividing entities into blocks with restricted size and thus re-ducing the number of entity pairs for exact similarity score computation. Then the problem lies in how to find good indicators for potentially similar entities without using sophisticated formulae such as SimF(e, e’), see Equation 7, in IEWS Model.

In many entities, parts of the descriptions of the entities are presented in plain lit-eral. These literal descriptions play an important role on describing the entities. As-suming that each entity eventually will be linked to a set of words that describe cer -tain properties of the entity, and then we can leverage this information to perform keyword based indexing to help improve the performance in terms of computation speed. In SEM+, we proposed to compute the document frequency (the number of entities a word belongs to) of words appear in the literal descriptions (LDs) or non-URL descriptions, such as labels and comments, and then to compare only the pre-fixes of entities. Here prefixes are certain number of words having the least document frequency. When implementing the entity blocking approach, words in each entity e is extracted and an inverted index is built to record the list of entity for each word w, and each list has size lw. Then we filter the inverted index by removing the list for w if lw > lb, where lb is the blocking parameter. The remaining list of words is the prefixes. Since we only choose less frequent words to be prefixes, the approach also follows the intuition that entities sharing some rare descriptions are much more likely to be

the same than those do not share, because these rare descriptions are usually key fea-tures of the entities. Below is an example of the blocking.

Example 1. Consider the following four entities and their corresponding LDs. w = {A,B,C,E,K,L} x = {C,D,E,L}y = {B,K,E,L} z = {A,B,L}

If lb = 2, Then the prefixes and corresponding blocks areA : {w, z} C : {w, x} D : {x} K : {w, y} We can see this approach ignores the frequent words such as B, E and L, which

appeared in more than 3 documents. It also treats entities with different number of features equally based on the rare-description-sharing criterion. Note that the blocks may contain an overlap of some of their entities. Since the final similarity computa-tion SimF(e, e’), will only apply to the entities within the same block, we reduce the number of SimF(e, e’) computation from 6 to 3 in this example. For each entity pair from different blocks, a similarity score 0 is assigned, in this example, similarity score 0 is assigned to entity pair x and y.

This blocking stage makes it possible to make trade-offs between efficiency and effectiveness in the matching computation. Greater blocking parameter values result more, and bigger blocks, which do not speed up the matching process dramatically but preserve more similar pairs in the blocks. Smaller parameter values speed up the similarity computation a lot, but discard more similar pairs out of the blocks.

3.3 Information Entropy and Weighted Similarity Model

3.3.1 Triple-wise Similarity Computation The entities on the Web of Data are described by a set of triples δ. Each triple de-scribes one of the properties about the entity. Therefore, computing similarity be-tween two entities is same as computing similarity between the triples that describe the entities. In this section, we discuss how SEM+ performs triple-wise similarity computation or pv similarity (Simpv) computation.

One of the challenges when computing pv similarity is that sometimes the data that describes the same information of the entity are structured differently. For example, _:Boston rdfs:type _:t1. _:t1 rdfs:label ‘City’ is same as _:Boston _:category ‘City’. To solve this problem, SEM+ first check if the property part of the pvs are the same or there exist a property mapping between the properties of the pvs and then use Equa-tion 1 to compute the similarity score. In the given example, a mapping between rdf:type and _:category must be established before the similarity computation. This property mapping is just a sub-problem of ontology schema matching. In case of ontology matching, property mapping between OWL, SKOS, and RDFS are pre-as-signed. In case of instance matching, we differentiated the properties’ URLs to

(1) Simpv(pv, pv’) =

obtain ontologies that describe the properties, and thereafter perform property map-ping.

In this equation, pv ϵ δ and pv’ ϵ δ’, v is the value part of pv pair and p is the prop-erty part of pv pair. The formula checks to see whether the value parts of both pv pairs are URLs or literals. Note that, URL means that the value is pointing to another resource description. If both values are literal, SEM+ compute the pv similarity using Lin’s similarity (Siml )[24]. If both values are URLs, SEM+ computes pv similarity recursively using Equation 7. In some case, recursively traverse the URLs until there is no URLs to follow and then compute similarity is costly. So in SEM+, we only traverse URLs to the depth of 3. In the case where one of the values is literal and the other is a URL, SEM+ first extracts the resource description of the resource point by using the URL, then extracts the literal contents of the other value, and finally uses Siml to compute the similarity. As a result of this process we get a vector of pv simi-larity that represents the similarity between entities e and e’.

Applying triple-wise similarity computation algorithm and the concept of Jaccard similarity, one can compute the similarity between two entities e and e’ as:

(2)∼(e , e ’ )= ∑ ¿pv

∑¿pv+α (¿ PV 1∨−∑¿pv )+β (¿ PV 2∨−∑¿pv).

Where ¿ PV 1∨¿ is the number of pvs in entity e and ¿ PV 2∨¿ is the number of pvs in entity e’, and α and β are coefficients of variation on the similarity measure on e and e’ unique description.

3.3.2 Property Weight LearningOn the Web of Data, not all descriptions about an entity necessarily differentiate it from other entities. Also, in human’s intuition, people are likely to identify things using the key features they noticed in an object. For example, if an object has root, trunk, branches, and leaves, it is most likely that the object is a tree, even though a tree has more properties than the four mentioned. Similarly, if we have identified important properties which play a significant role in describing the entities, and have assigned appropriate weights to these properties, then we can ignore non-significant descriptions, and the similarity score computed is closer to a human’s intuition. In this section, we discuss how SEM+ identifies important properties: learn property weights. This weight learning is done by reducing Property Weight learning (PWL) problem to Binary Classification Problem (BCP).

In machine learning, when solving binary classification problems, we are typi-cally given a training set T = {(x1,y1), (x2,y2) … (xn,yn)}, where xi ∈ Rd and y is a set of classification labels {-1, +1}. Then we want to find an optimal separating hyperplanes W*Ф(x) + b = 0 that separates xs correctly. However, in our case, we are given a training set T = {(δ1,δ1’,m1), (δ2,δ2’,m2) ... (δn,δn’,mn)}, where δi and δi’ are two sets of triples describing the entities ei and ei’, and mn indicates whether or not an alignment should be generated between ei and ei’. Therefore, in order to utilize machine learning methods to solve the PWL problem, we need to map the problem to a classification problem. In other words, we need to define corresponding W, Ф(x), and y in the PWL problem:

W: a vector of weights for all properties that are used to describe all entities to be compared.

Ф(x): a vector of property-based similarities between entities e and e’. Dur-ing Simpv computation process, we obtain a vector of pv similarities between the enti-ties. We then take the average value of the pv similarities if the pvs share same prop-erty. Therefore we only get one pv similarity for every property p used to describe the entities. For any properties that are not used to describe entities e and e’, we assigned a 0. Therefore, the size of Ф(x) is the same as W.

y: for any given entities, 1 indicates there is an alignment between entities e and e’, and -1 indicates an alignment should not be established between entities e and e’.

Using the above definitions, we mapped the PWL problem to BCP. We then use Support Vector Machine to get a classifier based on the training set and therefore obtain weight vectors for properties. If learned weight for a particular property is 0 or are extremely small compared to other property weight, we ignore the triples that use the property in final similarity computation, which means we improve the speed of similarity computation by reducing the triple-wise comparison.

3.3.3 Information Entropy In information theory, Shannon suggests that the amount of information can be quan-tified as information entropy. Information entropy is a quantified measure of the uncertainty of the information content [28]. To compute the information entropy presented in the descriptions of entities, we consider each property that describe the entity as a Variable X, and possible values of the property as possible Outcomes xs. Let us consider the following example. If we are given a triple which describes the gender of an unknown person (_:unknown _:gender ?g), then there will be two possi-ble guesses: male or female. The probability of gender to be male or female depends on the distribution of the poll of the unknown person come from. If the poll is bino-mially distributed, then the expected uncertainties or probabilities of the unknown person to be male or female are both 1/2. In order words, information entropy for property _:gender is 1/2. Following is a formal definition for computing information entropy of properties describe the entities based on Shannon’s entropy[28]:Given a property X, with possible values {x1, x2, x3, x4…. xn} and probability of obtain-ing each value as P(xi), then the information entropy of X denoted H(X) is:

(3)H (X )=−∑i=1

n

P ( x i ) logb(P ( x i ))

where b is the base for the logarithm.However, entities on the Web of Data are typically described by more than just

one property, we then use chain rule from probabilistic theory to compute joint infor-mation entropy:

(4 ) H ( X 1 , X 2 ,… X n )=∑i=1

n

H (¿ X i∨X i−1 , X i−2 , … X1)¿

Chain rule suggests that the complexity of computing joint entropy increases exponentially as more and more variables introduced. Given the complexity of the Web of Data, computing accurate joint information entropy becomes infeasible. To overcome this problem, we compute approximate entropy by selecting only the prop-erties that has top information entropies:

(5) H(X1,X2,…Xn) ≈ H(X1,X2,.. Xk) and H(X1) >… H(Xk)> … > H(Xn)

where Xk is the property that has the kth highest information entropy. Depending on the data, k can be varying.

3.3.4 Final Similarity ComputationIn previous section, we discussed how we can learn property weights for each prop-erty. By applying learned weights to pv similarity computation, Equation 1, we get:

(6) Simwpv(pv, pv’) = Wp* Simpv(pv, pv’)

where Wp is the weight learned for property p. One of the important characteristic of the formula is if Wp = 0, then any triple whose property is p, will be discarded in final similarity computation. Finally, we put everything together to get the similarity com-putation formula:

(7) SimF(e, e’) =H(P) ∑∼¿wpv

∑∼¿wpv+α ( ∑W p 1−∑ ¿wpv )+β (∑W p 2−∑¿wpv)¿¿

In this equation, ∑¿wpv is the sum of the weighted pv similarity, (∑W p1−∑¿wpv¿ and (∑W p2−∑¿wpv) are the weighted unique description of e and e’ respectively. H(P) is the information entropy of the common descriptions, and P is the set of prop-erties in ∑¿wpv .

Using Equation 7, SEM+ computes final similarity score and based on the similar-ity scores to suggest possible matches.

4 Experiment

A prototype of SEM+ is implemented using Java and existing frameworks such as Lucene3 and Jena4. Using this prototype, we conducted two experiments: 1. Study the accuracy of SEM+ in both ontology schema matching and instance matching. During this experiment, we set blocking parameter lb = full, to ensure we cover all possible comparisons and therefore it gives a better understanding on the accuracy perfor-

3 http://lucene.apache.org/core/4 http://jena.apache.org/

mance of SEM+. SEM+ is configured to perform one to one matching. We also set α and β to be 1 in the final similarity computation. In this experiment, we also analyze the effectiveness of proposed Property Weight Learning component and Information Entropy component in terms of improving the accuracy of SEM+. 2. Study how well our blocking algorithm reduces both the computation space and the cost of reduction in terms of recall. In this experiment, we set lb to different values in order to study the effect of blocking. The experiments are carried out on a PC with 8 Intel Xeon proces-sors of speed 2.40 GHz and 32 GB memory. Each processor has a 12M cache.

4.1 Accuracy Evaluation

The goal of entity matching is to discover and generate an alignment among entities that refer to same real-world individual or concept. In some cases, entity matchers would either match instances that are not referred to same individual or concept, or not generate a match where two instances are actually refer to same individual or concept. Therefore, to evaluate accuracy of entity matching, it is necessary to find out the number of incorrect match generated and correct match missed. This is done by comparing the result generated by entity matcher with standard result. Then we can use this information to compute precision, recall and F1 values. The formula to com-pute each of these values is given as follows:

p=¿M ∩ S∨ ¿M

¿ r=¿ M ∩ S∨¿S

¿ F1=2 prp+r

In this formula, M indicates the set of alignments discovered by instance matcher; S indicates the set of standard alignments.

4.1.1 Ontology Matching

In the first part of this experiment, we evaluated the accuracy of SEM+ on ontology schema matching. We choose two datasets from OAEI 2012 campaign: conference dataset5 and library dataset6. The main reasons we choose these two datasets are 1. These dataset seems to be difficult, since all tested matchers obtain a relatively low F1 value; most systems obtains F1 value less than 0.75. 2. We want to test SEM+ on both small dataset (conference) and large dataset (Library). As we mentioned before, current implementation is not aimed for cross-lingual matching, therefore, some of the datasets are not suitable for SEM+. The result of SEM+ is compared against all sys-tems submitted to the campaign.

In conference ontology matching experiment, there are 16 ontologies, which means we have 120 possible test cases. Since the accuracy evaluation of systems submitted to the conference track are only evaluated on 21 test cases, we can use the remaining 99 test cases to construct training set to learn property weights. In this experiment, we compared SEM+ with 20 other systems. F1 values of all systems

5 http://oaei.ontologymatching.org/2012/conference/index.html6 http://web.informatik.uni-mannheim.de/oaei-library/2012/

performed matching on 21 conference dataset test cases is presented in Figure 2. As we can see in Figure 2, SEM+ outperforms all systems in this dataset with precision 0.82 and recall 0.82, and F1 value 0.82, which is 0.07 better than the F1 value of the system in the second place.

The Second dataset we used to evaluate is library ontology dataset. In this dataset, there are 2839 possible exact matches. Before we run test cases use this dataset, we manually construct some training dataset. The training dataset contains 100 matching

SEM

+YA

M++

LogM

ap

OptimaCODI

GOMMA

WmatchWeSe

E

Hertuda

MassMatc

h

LogM

apLt

HotMat

Baselin

e2

ServO

Map

Baselin

e1

ServO

MapLt

MEDLEY ASE

Map

AUTOMSv2

AROMA

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 2. Comparing F1 values of the systems using Conference Ontology

SEM +

MatcherP

ref

MatcherP

refDE

MatcherA

llLabels

GOMMA

ServO

MapLt

ServO

Map

LogM

apYA

M++

LogM

apLt

Hertuda

WeSeE

HotMatc

h

MatcherP

refEN

CODI

MapSSS

AROMA

Optima0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 3. Comparing F1 values of the systems using Library Ontology

pairs and 100 non-matching pairs. Since some matching pairs in the training dataset are presented in reference dataset, when we perform actual testing and compute preci-sion and recall of SEM+, we discarded these dataset to avoid data snooping. There-fore, SEM+ is computed based on 2739 possible exact matches. This should not be a major problem when we compare with other systems, since 100 matching pairs are

only 3 percent of the possible matches. In this experiment, we compared SEM+ with 17 other systems. F1 values of all systems performed matching Library dataset test cases are presented in Figure 2. As we can see in Figure 2, SEM+ outperforms all systems in this dataset with precision 0.99 and recall 0.65, and F1 value 0.785, which is 0.065 better than the F1 value of the system in the second place.

4.1.2 Instance MatchingIn the second part of this experiment, we evaluated the accuracy of SEM+ on instance matching. We choose two publicly available datasets from OAEI campaign: IIMB dataset7 and NYTimes to DBpedia dataset8. There is no particular reason on why we pick these two data, only because to best of our knowledge, there are not so many dataset available for instance matching test.

Table 1 IIMB dataset experiment result

Systems Precision Recall F1 valueSEM+ 0.94 0.94 0.94LogMap 0.94 0.91 0.93SemSIM 0.87 0.87 0.87SBUEI 0.87 0.85 0.86LogMap Lite 0.84 0.83 0.83

In IIBM dataset experiment, there are 120 possible test cases. Each case has differ-ent number of possible alignments. Before we evaluate SEM+, we use provided sandbox dataset as training set to obtain property weights. The final tested result is provided in Table 1. In this experiment, we compared SEM+ with different set of systems. The reason for this is because not all ontology matching systems are suitable to perform instance matching. In this case, only few systems are submitted to this track. Again, SEM+ outperforms all other systems with a F1 value of 0.94, slightly better than LogMap, which has F-measure of 0.93. The reason SEM+ only outper-forms very slightly in this experiment is because with F1 value as high as 0.93, it is very difficult to get better result. One thing needed to be mentioned is that SEM+ preprocessed this dataset by removing non-sense literal texts such as “aa2kd”, “8aAf”, etc. The preprocessing is done using Gibberish-Detector9. Even after preprocessing, SEM+ is unable to remove all non-sense literal in the test cases, which undermined the performance of SEM+.

Table 2 Statistics of OAEI 2011 NYTimes, DBpedia Data

NYTime Data DBpedia Data# of entity # of triple # of entity # of triple

People 4979 103496 4977 818381Organization 3044 59561 1965 591223Location 1920 172141 1920 2905998

7 http://www.instancematching.org/oaei/8 http://oaei.ontologymatching.org/2011/instance/index.html9 https://github.com/rrenaud/Gibberish-Detector

Combined 9943 335198 8862 4315602

The second dataset we evaluated for instance matching is NYTimes to DBpedia dataset. One of the reasons we think this dataset is interesting is it is real world data collected from NYTimes and DBpedia. In table 2, we present the statistics of NY-Times to DBpedia dataset. Again, before we run test cases use this dataset, we manu-ally construct some training dataset. The training dataset contains 50 matching pairs and 50 non-matching pairs for each category. The final result of this experiment is provided in Table 3. In both People and Organization dataset, SEM+ outperforms all systems. In Location dataset, SEM+ achieves F1 value of 0.91, which is 0.01 less than Zhishi.Link [29].

Table 3 NYTimes to DBpedia dataset experiment result

Dataset AgreementMaker SERIMI Zhishi.Link SEM+

Pre F1 Rec Pre F1 Re Pre F1 Re Pre F1 Rec

People 0.98 0.88 0.80 0.94

0.96 0.94 0.97 0.97 0.97 0.99

0.99 0.99

Organiza-tion

0.84 0.74 0.67 0.89

0.92 0.87 0.90 0.91 0.93 0.95

0.95 0.95

Location 0.79 0.69 0.61 0.69

0.83 0.67 0.92 0.92 0.91 0.91

0.91 0.91

4.1.3 Impact of Weight Learning and Information EntropyIn this part of evaluation, we study how helpful are our weight learning component and information entropy component in improving the accuracy of SEM+. To study

Table 4 Effects of different components in IEWS Model

Enabled Component Precision Recall F1 valueTriple-Wise 0.71 0.71 0.71Triple-Wise + WL 0.79 0.79 0.79Triple-Wise + IE 0.76 0.76 0.76Triple-Wise + WL + IE 0.82 0.82 0.82

this, we use Conference Ontology as the dataset. In this experiment, we purposely enabled only weight learning component or information entropy component or both of them to study the effects of the components during the matching process. Table 4 presents the effects of enabling these components. Note that, Triple-Wise means only applying triple-wise similarity computation and equation 1; WL means weight learn-ing component is enabled; IE means information entropy component is enabled. From table 4, we can see by enabling either weight learning component or informa-tion entropy component, we can improve the precision and recall of the system, which indicates that both of these components are helpful in entity matching process.

4.2 Impact of Blocking

We evaluate the blocking algorithm using the OAEI NYTimes to DBpedia instance matching dataset. The question we ask is how well can our blocking algorithm per-forms in terms of grouping matching entities into same block. Using this dataset, we evaluated the effect of our blocking algorithm by setting lb = 2, 10, 50, 100. The number of computations reduced to are presented in Table 5 and recalls (Rec) are

presented in Table 6. Recalls are computed with equation Rec = M 'M

, where M’ is

the

Table 5 Number of Computation reduced to for different lb

lb Peop. Org. Loc. Comb.2 5257 2571 2613 877910 39998 26747 21310 75388

50 259776 177396 203442 567443100 591918 362984 480197 1197011full 24780483 5981460 3686400 88114866

Table 6 Recalls on different block size with number of correct pair found in the same block

lb Peop. Org. Loc. Comb.2 0.65(3243) 0.61(1195) 0.64(1241) 0.57(4999)

10 0.92(4596) 0.895(1745) 0.9(1725) 0.87(7687)50 0.995(4951) 0.97(1892) 0.954(1827) 0.97(8624)100 0.996(4958) 0.97(1894) 0.96(1838) 0.98(8682)full 1.0(4977) 1.0(1949) 1.0(1916) 1.0(8842)

number of matching entity pairs found in the same block and M is the total number of matching entity in the standard file.

From these two tables, we can see that the blocking algorithm enables us to make trade-offs between computation time and the number of wrongly discarded pairs. For example, using the OAEI People dataset, with lb = 50, we can use 259,776 comparison computations to achieve 99.5% of recalls rate, compare to 4979*4977=24,780,483 comparison computation to achieve 100% of the recall rate, which is about 95 times faster.

5 Discussion and conclusion

In this paper, we presented SEM+, an automatic entity matching tool with detailed discussion of blocking algorithm and IEWS model for similarity computation. We also presented extensive evaluation for SEM+. The evaluation studies showed that SEM+ outperformed most of the existing systems in terms of accuracy. In terms of

scalability, the evaluation studies showed that our blocking algorithm can reduce the number of entity pair similarity computation while maintain high recalls. We also showed that our blocking algorithm allows trade-off between efficiency and effective-ness by changing prefix size. Even we showed that our algorithm can discover “same as” type of link between entities accurately, there are still challenges we will need to overcome in our future implementation:

1. In some cases, property mapping between the properties that describe the same infor-mation in two entities can’t be established. For example, _:A :address “1 Washing-ton St. New York, New York”, describes same information as _:B :street “1 Wash-ington St.” _:B :city “New York” _:B :state “New York”.

2. Cross-lingual entity matching: to improve SEM+ to be able to perform cross-lingual entity matching, we need to investigate cross-lingual string comparison and indexing algorithm.

3. String similarity plays important role in entity matching, a sophisticate string similar -ity computation can improve entity matching, especially when performing matching in special domains. We need to investigate how domain specific based string similar-ity computation can help entity matching.

4. Current implementation of SEM+ assumes there is one to one mapping, and then the challenge is how we know if it is really one to one mapping or one to many mapping.

References

1. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data – The Story So Far. Journal on Semantic Web and Information System. (in press, 2009)

2. Halpin, H., Hayes, P., McCusker, J., McGuinness, D., Thompson, H., When owl:sameAs isn’t the Same: An Analysis of Identity in Linked Data. In Proc. of ISWC, 2010

3. Li, D., Shinavier, J., Shangguan, Z., McGuinness, D., SameAs Network and Beyond: Analyzing Deployment Status and Implication of owl:sameAs in Linked Data. In Proc. of ISWC, 2010

4. Nikolov, A., Uren, V., Motta, E., Roeck, A., Handling instance coreferencing in the knofuss architecture. In Proc. of ISWC, 2008

5. Ferrara, A., Lorusso, D., Montanelli, S., Varese, G., Towards a Benchmark for In-stance Matching. In Ontology Matching (OM 2008), volume 431 of CEUR Work-shop Proceedings. CEUR-WS.org, 2008

6. Castano, S., Ferrara, A., Lorusso, D., Montanelli, S., On the Ontology Instance Matching Problem. In 19th International Conference on Database and Expert Sys-tems Application, 2008

7. Newcombe, H., Kenedy, J., Record linkage: making maximum use of the discrimi-nating power of identifying information. Commun, ACM, 5(11):563 -366, 1962

8. Sarawagi, S., Bhamidipaty, A., Interactivededuplication using active learning. In Proc. of KDD, pages 269-278, 2002

9. Dong, X., Halevy, Y., Madhavan., J., Reference reconciliation in complex informa-tion spaces. In Proc. of SIGMOD, pages 865–876, 2005

10. Elfeky, M., Elmagarmid, A., Verykios, V., Tailor: A record linkage tool box. In Proc. of SIGMOD, pages 85-96, 2005

11. Yancey, W., Bigmatch: A program for extracting probable matches from a large file for record linkage. Statistical research report series rrc2002/01, U.S. Bureau of Census, 2002

12. Thor, A., Rahm, E., Moma – a mapping-based object matching system. In Proc. of CIDR, pages 247-258, 2007.

13. Benjelloun, O., Garcia-Mollina H., Menestrina, D., Su, Q., Whang S., Widom, J., Swoosh: a generic approach to entity resolution. VLDB J., 18(1):255-276, 2009

14. Nguyen, K., Ichise, R., Le, B., SLINT: A Schema-Independent Linked Data Interlinking System. In Ontology Matching (OM 2012), 2012

15. Volz, J., Bizer, C., Gaedke, M., Kobilarov, G., Discovering and Maintaining Links on the Web of Data. In Proc. of ISWC, pages 650-665, 2009

16. Rong, S., Niu, X., Xiang, E., Wang, H., Yang, Q., Yu Y., A Machine Learning Ap-proach for Instance Matching Based on Similarity Metrics. In Proc. Of ISWC 2012

17. Duan, S., Fokoue, A., Srinivas, K., Byrne, B., A Clustering-based Approach to On-tology Alignment, In Proc. of ISWC 2012

18. Jean-Mary, Y., Shironoshita, E., Kabuka, M., Ontology matching with semantic verification. In Proc. of Web Semantics: Science, Services and Agents on the World Wide Web, 2009

19. Cross, V., Hu, X., Using Semantic Similarity in Ontology Alignment. In Ontology Matching (OM 2011), ISWC Workshop Proceedings. 2011

20. Massmann, S., Engmann, D., Rahm, E., COMA++: results for the ontology alignment contest OAEI 2006, in: International Workshop on Ontology Matching, 2006

21. G. Stumme, A. Madche, FCA-Merge: bottom-up merging of ontologies, In the 7th International Conference on Artificial Intellligence (IJCAI), 2011, pp. 225-230

22. J. Euzenat, brief overview of T-Tree: the TROPES taxonomy building tool, in: 4th ASIS SIG/CR Workshop on Classification Research, Columbus (OH, US), 1994, pp. 69-87

23. Tang, J., Li, J., Liang, B., Huang, X., Li, Y., Wang, K., Using Bayesian decision for ontology mapping, Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 2006, pp. 243-262

24. Lin, D., An Information-theoretic definition of similarity. In Proc. of 15th Interna-tional Conference of machine Learning (ICML) page 296-304, 1998

25. I. F. Cruz, F. P. Antonelli, and C. Stroe, “Agreementmaker: Efficient matching for large real-world schemas and ontologies,” PVLDB, vol. 2, no. 2, pp. 1586–1589, 2009.

26. Chaudhuri, S., Ganti, V., Motwani, R., Robust identification of fuzzy duplicates. In Prof. of ICDE, pages 865-876, 2005

27. Baeza-Yates, R., Ribeiro-Neto , B., Modern Information Retrieval. ACM Press, Addison-Wesley, 1999

28. Shannon, C., A Mathematical Theory of Communication. Bell System Technical Journal 27 (3): 379–423.

29. Niu, X., Rong, S., Zhang, Y., Wang, H., Zhishi.links results for OAEI 2011. In Ontology Matching (OM 2011), ISWC Workshop Proceedings. 2011

30. Shvaiko, P., Euzenant, J., Ontology matching: state of the art and future challenges, IEEE Transactions on Knowledge and Data Engineering, 2013

31. Jimenez-Ruiz, E., Grau, B., LogMap: Logic-based and Scalable Ontology Matching, In Proc. of ISWC 2012

http://en.wikipedia.org/wiki/A_Mathematical_Theory_of_Communication

Lecture Notes in Computer Science:tw.rpi.edu/media/2013/10/07/e293/SEM.doc · Web viewTetherless...

Documents

Transcript of Lecture Notes in Computer Science:tw.rpi.edu/media/2013/10/07/e293/SEM.doc · Web viewTetherless...