Post on 22-Mar-2016
description
Blank Node Matching andRDF/S Comparison Functions
Yannis Tzitzikas , Christina Lantzaki and Dimitris Zeginis
Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete, GREECE
ISWC2012, Boston, Nov. 2012
In two slides (1/2)
Several RDF/S Knowledge Bases rely heavily on blank nodes • Bnodes are convenient for representing complex attributes or
resources whose identity is unknown but their attributes (either literals or associations with other resources) are known.
• We show how to exploit blank node anonymity in order to reduce the delta size when comparing RDF/S Knowledge Bases.
• We approach the problem as an optimization problem:– Find the mapping that gives the minimum in size delta
G1 G2
2
Chris
_:ad1
Arlington St 77
street no city
hasAddress
Boston
Chris
_:ad2
77
street no city
hasAddress
BostonArlington St
Jim
hasAddressBlank node prevalence *
Opencalais.com 44.9%hi5.com foaf 87.5%
*[On blank nodes ISWC 2011]
FORTH-ICS, ISWC 2012
In two slides (2/2)All KBs
(general case)NP-Hard
O(n3)
Time Complexity
O(n logn)
O(n3)
||||||
opt
optx
deviation
KBs with no directly connected
bnodes
ApproximatelyOpt. mapping
[0, 7.2]
[1, 7.2]
OptimalMapping
ApproximatelyOpt. mapping
3FORTH-ICS, ISWC 2012
Mapping of 150,000 blank nodes ~11 sec
Deviation from optimal
Outline• Motivation• RDF Knowledge Bases with Blank Nodes• On finding the Optimal Bnode Mapping
– Delta and Bnode Name Tuning– The Optimization Problem– Polynomially-solved Cases
• Approximate Bnode Matching Algorithms– Hungarian Bnode Matching Algorithm– A Fast Signature-based Algorithm
• Experimental Evaluation• Discussing Semantics and Inference Rules• Related Work• Concluding Remarks
4FORTH-ICS, ISWC 2012
Motivation• World evolves, and world models (e.g. KBs expressed in
RDF/S) evolve as well.• The result of the comparison of two KBs is called Delta.• Deltas can be useful for
– aiding humans to understand the evolution of knowledge – to reduce the amount of data that need to be exchanged
and managed over the network in order to build synchronization, versioning and replication services
• The inability to match bnodes increases the delta size and does not assist in detecting the changes between subsequent versions of a KB. However, a large percentage of the nodes of existing RDF KBs are blank nodes– Opencalais.com: 44.9% bnodes, hi5.com foaf: 87.5%
bnodes5FORTH-ICS, ISWC 2012
FORTH-ICS, ISWC 2012
RDF Knowledge Bases with Blank Nodes
Def: Equivalence. Two RDF graphs G1 and G2 are equivalent if there is a bijection M between the sets of nodes of the two graphs (N1 and N2), such that:
– M(uri) = uri for each uri U∈ 1 ∩ N1
– M(lit) = lit for each lit L∈ 1
– M maps bnodes to bnodes – The triple (s, p, o) is in G1 if and only if the triple (M(s), p,M(o)) G∈ 2
Bijection M
Identity function
Identity function
?
N1 N2
Graph notationN: nodes B: blank nodes L : literals U : URIs
6
RDF Knowledge Bases with Blank Nodes (Cont)Def: Edit Distance over Nodes given a Bijection Let o1 and o2 be two nodes of G1 and G2, and suppose a bijection between the nodes of these graphs, i.e. a function h : N1 → N2 . We define the edit distance between o1 and o2 over h, denoted by disth(o1, o2), as the number of additions or deletions of triples which are required for making the “direct neighborhoods” of o1 and o2 the same. Formally, disth(o1, o2) = |{(o1, p, a) G∈ 1 | (o2, p, h(a) ∉ G2}| + |{(a, p, o1) G∈ 1 | (h(a), p, o2)) ∉ G2}|+
|{(o2, p, a) G∈ 2 | (o1, p, h-1(a)) ∉ G1}|+ |{(a, p, o2) G∈ 2 | (h-11(a),p,o1) ∉
G1}|
Theorem: RDF Graph Equivalence G1 ≡h G2 ⇔ disth(o, h(o)) = 0 for each o N∈ 1
7FORTH-ICS, ISWC 2012
o2
K1 K2o1
o3 o4
p
p p
o6
o5
o7 o8
p
p p
h = {(o1 → o7), (o2 → o6),(o3 → o5), (o4 → o8)}
dist h(o2,o6) = 4
Deltas and Bnode Mappings• For the case were the Knowledge Bases are not necessarily equivalent,
we would like to find the bnode mapping that reduces the delta size• Delta
– we use the differential function Δe, . The computed delta consists of triple additions and triple deletions
• Consider the following example: G1 = {(_:1, name, Joe)} G2 = {(_:2, name, Joe),(_: 2,lives,UK)}
Δe(G1 → G2) = {Add(t) | t G∈ 2 − G1} {Del(t) | t G∪ ∈ 1 − G2}
Note:No rename operation is needed and hence no particular execution order
Δe without bnode matching
Bnode Name Tuning
Δe without bnode matching
Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)
Δe without bnode matching
Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)| Δe | = 3
Δe without bnode matching
Δe with bnode matching
Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)| Δe | = 3
Δe without bnode matching
Δe with bnode matching
Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)| Δe | = 3
Add(_:2, lives, UK)
Δe without bnode matching
Δe with bnode matching
Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)| Δe | = 3
Add(_:2, lives, UK)
| Δe | = 1
Δe without bnode matching
Δe with bnode matching
Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)| Δe | = 3
Add(_:2, lives, UK)
| Δe | = 1
Add(_:1, lives, UK)
8FORTH-ICS, ISWC 2012
On Finding the Optimal Mapping• Our objective is to find the bijection M (between bnodes) that
minimizes the delta size– concerns the mapping of the blank nodes of the subsets B1 and B2
– the bijection M a priori contains the mappings of all the URIs (U1, U2) and literals(L1,L2) as identity functions
• The number of possible bijections M is exponential– |J| = n2 * (n2 -1) * …*(n2-n1+1) , if |B1| = n1, |B2|= n2, |B1| < |B2|
• The cost of a bijection M (which is a actually the part of deltas tha concerns bnodes)– Cost(M) = ∑ b1 B1 ∈ distM(b1,M(b1))
Proof: reduction to the subgraph isomorphism problem (NP-Complete)
Problem Statement Given two Knowledge Bases, find the bijection (or bijections) that minimizes the cost. Msol = argM minM J ∈ (Cost(M))
Theorem: Hardness of Optimality Finding the optimal bijection is NP-Hard.
9FORTH-ICS, ISWC 2012
All KBs (general case)
NP-Hard
O(n3)
Time Complexity
O(n logn)
O(n3)
KBs with no directly connected
bnodes
ApproximatelyOpt. mapping
OptimalMapping
ApproximatelyOpt. mapping
10FORTH-ICS, ISWC 2012
Polynomially-solved cases: Not directly connected bnodes
Key observation: If there are no directly connected bnodes, then the edit distance between a pair of bnodes is independent of the other pairsConsequence• The optimization problem can be solved using the Hungarian
Algorithm [J. Munkres, 1957]– The elements of B1 play the role of workers– The elements of B2 play the role of jobs– The edit distances of the pairs in B1 X B2 play the role of the costs
• All the possible combinations can be checked with only |B1| * |B2| (or else n2, assuming n=|B1| = |B2|) edit distance computationTheorem
Finding the optimal bijection is a polynomial task if there are no directly connected blank nodes.
• The Hungarian-based method has cubic time complexity O(n3) and quadratic main memory complexity.
11FORTH-ICS, ISWC 2012
All KBs (general case)
NP-Hard
O(n3)
Time Complexity
O(n logn)
O(n3)
KBs with no directly connected
bnodes
ApproximatelyOpt. mapping
OptimalMapping
ApproximatelyOpt. mapping
FORTH-ICS, ISWC 2012 12
The Hungarian-based Algorithm (1/2)
• It is a variation of the optimal Hungarian algorithm that provides an approximate solution, as there is a need for an assumption about the treatment of the directly connected blank nodes at the computation of disth
Two possible assumptions:• All connected bnodes are considered different • All connected bnodes are considered the same
It again makes only |B1| * |B2| (n2) edit distance computations and its complexity remains in the same level (O(n3))
13FORTH-ICS, ISWC 2012
The Hungarian-based Algorithm (2/2)
Jim
_:3
_:1
_:4
Chris Zeginis JohnTom
_:2
_:5
hasAgenda hasAgenda
brother friend friend
name sname name name
Jim
_:8
_:6
_:9
Chris Zeginis TomJohn
_:7
_:10
hasAgenda hasAgenda
brother friend friend
name sname name name
G1 G2
disth (_:1,_:6) = ? – dependent on the mappings of bnodes _:3, _:4, _:8, _:9
Assume all the connected bnodes are considered:• the same disth (_:1,_:6) = 0 exploits the similarity of their predicates
This assumption is used for the experiments
• different disth (_:1,_:6) = 4 does not take common predicates into account
14FORTH-ICS, ISWC 2012
All KBs (general case)
NP-Hard
O(n3)
Time Complexity
O(n logn)
O(n3)
KBs with no directly connected
bnodes
ApproximatelyOpt. mapping
OptimalMapping
ApproximatelyOpt. mapping
15FORTH-ICS, ISWC 2012
no
The Signature-based Algorithm (1/2)It consists of two steps1. Signature Construction Phase: for each bnode a signature (string) is
constructed based on the direct neighborhood of the bnode2. Mapping Construction Phase: the two bags of signatures are matched.
Each signature matching corresponds to a mapping of a pair of blank nodes
Example of Signature Construction:
Christina
_:1
Oxford St 14 London
street city
hasAddress
Yannis
_:2
Broadway 445 New York
streetno
city
hasAddressAddress
rdf:typerdf:type
G1 Christina
_:3
Oxford St 14 London
streetno
city
hasAddress
Yannis
_:4
Michigan A 132 Chicago
streetno
city
hasAddressAddress
rdf:typerdf:type
G2
16FORTH-ICS, ISWC 2012
The Signature-based Algorithm (2/2)
Mapping Construction
Signature Construction Phase
Lexicographical sorting
• The mapping is exported in two passes• For both passes we start from the
smaller list, say BS1 and for each bs1 in that list we perform a lookup in the second list BS2, using binary search (logarithmic complexity)
• First pass (exact match) exports only the exact matches
• Second pass (closest match) is applied over the remainder part of BS1, BS2 and matches each element of BS1 to the closer lexicographically elementNote:
we perform the closest matches after finishing with the exact matches in order to avoid the situation where an approximate match deters an exact match at a later step.
17FORTH-ICS, ISWC 2012
All KBs (general case)
NP-Hard
O(n3)
Time Complexity
O(n logn)
O(n3)
KBs with no directly connected
bnodes
ApproximatelyOpt. mapping
OptimalMapping
ApproximatelyOpt. mapping
Experimental evaluation
18FORTH-ICS, ISWC 2012
Experimental Evaluation• Over real data sets
– Available in the LOD cloud– Two versions from each dataset
• Over synthetic datasets – A synthetic generator was implemented – Built over the UBA generator [Y. Guo et. al ISWC ’04]– Extended to support control over the number of blank nodes and the blank
node properties• Evaluation Aspects
– Delta reduction potential– Equivalence detection potential– Time efficiency– Deviation from optimal delta
Experiments were conducted using Sesame RDF/S Repository (main memory model) and using a PC with Intel Core i3 at 2.2 Ghz, 3.8 GB Ram, running Ubuntu 11.10.
19FORTH-ICS, ISWC 2012
20
Experimental Evaluation: Real Datasets
Swedish Open Cultural Heritage*
Italian Museums*None of the datasets contains directly connected blank nodes
• The Hungarian always finds the optimal solution
Delta Size
• The Signature gave a 0.34 times bigger delta than the Hungarian
Mapping Time• The Hungarian requires more (from 15 to 624 times) time than the
Signature• The Signature needs less than one second for mapping 6390 blank nodes
* The datasets were downloaded from CKAN
• The proposed algorithms give a much smaller (12.7 to 7,924 times reduced) delta than without blank node matching
FORTH-ICS, ISWC 2012
Experimental Evaluation: Synthetic Datasets 1 Synthetic Generation 1• A set of 9 datasets, from KB0 to KB8 were generated
– all of them contain the same number of blank nodes (240)– gradually create more complex blank node structures
21
Two rounds of experiments1. Delta reduction potential: Compare each dataset with another version2. Equivalence detection potential: Compare each dataset with itself
FORTH-ICS, ISWC 2012
Experimental Evaluation: Synthetic Datasets 1Delta Reduction Potential
0 1 2 3 4 5 6 7 80.1
1
10
100
1000Delta Reduction Potential
OPTIMAL NO BNODE MATCHING HUNGARIANSIGNATURE
Pair of datasets
delta
size
per
cent
age
in lo
g sc
ale
Delta size is given as Without bnode matching the delta size ranges from 95% to 143% |'|
|)'(|KBKBKBKBe
• The algorithms provide a much smaller delta than without blank node matching
• The Hungarian achieves the optimal delta for most of the pairs• The Hungarian yields from 0 to 3 times smaller deltas than the Signature
22FORTH-ICS, ISWC 2012
Experimental Evaluation: Synthetic Datasets 1
Equivalence detection potential– Both the proposed algorithms detected equivalence for the first
five Knowledge BasesTime Efficiency
– The Signature gives two orders of magnitude lower mapping times than the Hungarian
0 1 2 3 4 5 6 7 810
100
1000
10000Mapping Time
HUNGARIAN SIGNATURE
Pair of Datasets
Map
ping
Tim
e (m
s) in
log
scal
e
23FORTH-ICS, ISWC 2012
Experimental Evaluation: Synthetic Datasets 2
Synthetic Generation 2• A set of 7 bigger datasets, from KB0 to KB6 were generated containing from
2,400 to 153,600 blank nodes
2,400 4,800 9,600 19,200 38,400 76,800 153,6000
2000
4000
6000
8000
10000
12000Mapping Time
Signature Construction Signature Mapping
|BNodes|
Map
ping
Tim
e (m
s)
Note:The Hungarian Algorithm could not be applied even to the third pair of datasetsdue to its high requirements in main memory space
The mapping time for the Signature was only 10.5 seconds for the seventh pair of Knowledge Bases
24FORTH-ICS, ISWC 2012
Measuring the approximation
Hungarian deviation: 0% - 7.2%
Signature deviation: 1% - 7.2%
0 0.1 0.15 0.2 0.25 0.320000000000001 0.40123456789
Deviation from Optimal Non Equivalent
HUNG ordered HUNG reversed SIGN ordered
SIGN reversed
b_density
devi
ation
dx
Deviation from optimal delta• Investigate how the bnode structures impact on the deviation from optimal delta
||||||
opt
optxdeviation
The percentage of bnodes in the direct neighborhood
25FORTH-ICS, ISWC 2012
Discussion: Semantics and Inference Rules• Apart from the explicitly specified triples of a KB, other triples can be inferred
based on the RDF/S semantics, or other custom inference rules.• To apply our method the only difference that the graphs should be completed
according to the inferred triples.• It follows that if the semantics is based on a set of inference rules yielding a finite
closure, then the graph is finite and thus our method can be applied. – E.g. Minimal RDFS semantics, ter Horst’s pD* semantics and others
• Note: – It is worth mentioning, that the optimal bnode mapping over the complete
graphs may be different from the optimal mapping when considering the explicit graphs.
26FORTH-ICS, ISWC 2012
Related Work
• Past works focus on detecting only isomorphism– Jena
• Past works focusing on finding delta – RDF Sync: no effort is dedicated on finding a blank node mapping– PromptDiff :employs heuristic matchers, but does not treat blank nodes– Otnoview: no blank node matching is offered– CWM: require for the blank nodes to have term labels– SemVersion: creates and assigns unique identifiers for the blank nodes– RDF Molecules (SSWS 2008): a blank node mapping O(n2) is offered ,
but requires the blank nodes to be part of a uniquely identified triple
• They do not try to find a mapping that reduces the delta size• Works for constructing RDF/S mappings are not directly related since they
map the named entities of the two KBs, and thus they take into account lexical similarities, something that is not possible with bnodes.
27FORTH-ICS, ISWC 2012
Concluding Remarks• We have shown how to exploit blank node anonymity in order to reduce
the delta size when comparing RDF/S Knowledge Bases• Proved that finding the optimal mapping is NP-Hard in the general case
(polynomial if there are no directly connected blank nodes)• We presented polynomial approximate algorithms for the general case (a
Hungarian-based and a Signature-based)• In real datasets with no directly connected blank nodes
– Signature Alg.: two orders of magnitude faster than the Hungarian Alg. (1 second for datasets with 6,390 blank nodes). 34% bigger deltas than the Hungarian Alg.
• In synthetic datasets with directly connected blank nodes– Hungarian Alg. yielded from 0 to 3 times smaller deltas than the Signature Alg.
The Signature Algorithm was 18 to 57 times faster• The algorithms provide a delta of 12.7 to 7,294 times smaller than
without blank node matching• The Signature Algorithm requires only 10.5 seconds to match 153,600
blank nodes!28FORTH-ICS, ISWC 2012
Possible Future Research
Several issues are interesting for further research• Investigation of other special cases where the optimal blank
node mapping can be found polynomially– Directly connected blank nodes that form graphs of bounded tree
width
• Comparative evaluation of various (probabilistic) signature construction methods and greedy approximation algorithms
29FORTH-ICS, ISWC 2012
Thank you for your attention
Work done in the context of SCIDIP-ES, APARSEN and i-Marine
Web system available in:http://www.ics.forth.gr/isl/BNodeDelta
FORTH-ICS, ISWC 2012