Blank Node Matching and RDF/S Comparison Functions

Blank Node Matching andRDF/S Comparison Functions

Yannis Tzitzikas , Christina Lantzaki and Dimitris Zeginis

Institute of Computer Science, FORTH-ICS, and Computer Science Department, University of Crete, GREECE

ISWC2012, Boston, Nov. 2012

In two slides (1/2)

Several RDF/S Knowledge Bases rely heavily on blank nodes • Bnodes are convenient for representing complex attributes or

resources whose identity is unknown but their attributes (either literals or associations with other resources) are known.

• We show how to exploit blank node anonymity in order to reduce the delta size when comparing RDF/S Knowledge Bases.

• We approach the problem as an optimization problem:– Find the mapping that gives the minimum in size delta

Arlington St 77

street no city

hasAddress

Boston

street no city

hasAddress

BostonArlington St

hasAddressBlank node prevalence *

Opencalais.com 44.9%hi5.com foaf 87.5%

*[On blank nodes ISWC 2011]

FORTH-ICS, ISWC 2012

In two slides (2/2)All KBs

(general case)NP-Hard

Time Complexity

O(n logn)

||||||

deviation

KBs with no directly connected

bnodes

ApproximatelyOpt. mapping

[0, 7.2]

[1, 7.2]

OptimalMapping

3FORTH-ICS, ISWC 2012

Mapping of 150,000 blank nodes ~11 sec

Deviation from optimal

Outline• Motivation• RDF Knowledge Bases with Blank Nodes• On finding the Optimal Bnode Mapping

– Delta and Bnode Name Tuning– The Optimization Problem– Polynomially-solved Cases

• Approximate Bnode Matching Algorithms– Hungarian Bnode Matching Algorithm– A Fast Signature-based Algorithm

• Experimental Evaluation• Discussing Semantics and Inference Rules• Related Work• Concluding Remarks

Motivation• World evolves, and world models (e.g. KBs expressed in

RDF/S) evolve as well.• The result of the comparison of two KBs is called Delta.• Deltas can be useful for

– aiding humans to understand the evolution of knowledge – to reduce the amount of data that need to be exchanged

and managed over the network in order to build synchronization, versioning and replication services

• The inability to match bnodes increases the delta size and does not assist in detecting the changes between subsequent versions of a KB. However, a large percentage of the nodes of existing RDF KBs are blank nodes– Opencalais.com: 44.9% bnodes, hi5.com foaf: 87.5%

bnodes5FORTH-ICS, ISWC 2012

RDF Knowledge Bases with Blank Nodes

Def: Equivalence. Two RDF graphs G1 and G2 are equivalent if there is a bijection M between the sets of nodes of the two graphs (N1 and N2), such that:

– M(uri) = uri for each uri U∈ 1 ∩ N1

– M(lit) = lit for each lit L∈ 1

– M maps bnodes to bnodes – The triple (s, p, o) is in G1 if and only if the triple (M(s), p,M(o)) G∈ 2

Bijection M

Identity function

Graph notationN: nodes B: blank nodes L : literals U : URIs

RDF Knowledge Bases with Blank Nodes (Cont)Def: Edit Distance over Nodes given a Bijection Let o1 and o2 be two nodes of G1 and G2, and suppose a bijection between the nodes of these graphs, i.e. a function h : N1 → N2 . We define the edit distance between o1 and o2 over h, denoted by disth(o1, o2), as the number of additions or deletions of triples which are required for making the “direct neighborhoods” of o1 and o2 the same. Formally, disth(o1, o2) = |{(o1, p, a) G∈ 1 | (o2, p, h(a) ∉ G2}| + |{(a, p, o1) G∈ 1 | (h(a), p, o2)) ∉ G2}|+

|{(o2, p, a) G∈ 2 | (o1, p, h-1(a)) ∉ G1}|+ |{(a, p, o2) G∈ 2 | (h-11(a),p,o1) ∉

Theorem: RDF Graph Equivalence G1 ≡h G2 ⇔ disth(o, h(o)) = 0 for each o N∈ 1

K1 K2o1

h = {(o1 → o7), (o2 → o6),(o3 → o5), (o4 → o8)}

dist h(o2,o6) = 4

Deltas and Bnode Mappings• For the case were the Knowledge Bases are not necessarily equivalent,

we would like to find the bnode mapping that reduces the delta size• Delta

– we use the differential function Δe, . The computed delta consists of triple additions and triple deletions

• Consider the following example: G1 = {(_:1, name, Joe)} G2 = {(_:2, name, Joe),(_: 2,lives,UK)}

Δe(G1 → G2) = {Add(t) | t G∈ 2 − G1} {Del(t) | t G∪ ∈ 1 − G2}

Note:No rename operation is needed and hence no particular execution order

Δe without bnode matching

Bnode Name Tuning

Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)

Delete(_:1, name, Joe)Add(_:2, lives, UK) Add(_:2, name, Joe)| Δe | = 3

Δe with bnode matching

Add(_:2, lives, UK)

| Δe | = 1

Add(_:2, lives, UK)

| Δe | = 1

Add(_:1, lives, UK)

On Finding the Optimal Mapping• Our objective is to find the bijection M (between bnodes) that

minimizes the delta size– concerns the mapping of the blank nodes of the subsets B1 and B2

– the bijection M a priori contains the mappings of all the URIs (U1, U2) and literals(L1,L2) as identity functions

• The number of possible bijections M is exponential– |J| = n2 * (n2 -1) * …*(n2-n1+1) , if |B1| = n1, |B2|= n2, |B1| < |B2|

• The cost of a bijection M (which is a actually the part of deltas tha concerns bnodes)– Cost(M) = ∑ b1 B1 ∈ distM(b1,M(b1))

Proof: reduction to the subgraph isomorphism problem (NP-Complete)

Problem Statement Given two Knowledge Bases, find the bijection (or bijections) that minimizes the cost. Msol = argM minM J ∈ (Cost(M))

Theorem: Hardness of Optimality Finding the optimal bijection is NP-Hard.

All KBs (general case)

NP-Hard

Time Complexity

O(n logn)

bnodes

OptimalMapping

Polynomially-solved cases: Not directly connected bnodes

Key observation: If there are no directly connected bnodes, then the edit distance between a pair of bnodes is independent of the other pairsConsequence• The optimization problem can be solved using the Hungarian

Algorithm [J. Munkres, 1957]– The elements of B1 play the role of workers– The elements of B2 play the role of jobs– The edit distances of the pairs in B1 X B2 play the role of the costs

• All the possible combinations can be checked with only |B1| * |B2| (or else n2, assuming n=|B1| = |B2|) edit distance computationTheorem

Finding the optimal bijection is a polynomial task if there are no directly connected blank nodes.

• The Hungarian-based method has cubic time complexity O(n3) and quadratic main memory complexity.

NP-Hard

Time Complexity

O(n logn)

bnodes

OptimalMapping

FORTH-ICS, ISWC 2012 12

The Hungarian-based Algorithm (1/2)

• It is a variation of the optimal Hungarian algorithm that provides an approximate solution, as there is a need for an assumption about the treatment of the directly connected blank nodes at the computation of disth

Two possible assumptions:• All connected bnodes are considered different • All connected bnodes are considered the same

It again makes only |B1| * |B2| (n2) edit distance computations and its complexity remains in the same level (O(n3))

The Hungarian-based Algorithm (2/2)

Chris Zeginis JohnTom

hasAgenda hasAgenda

brother friend friend

name sname name name

Chris Zeginis TomJohn

hasAgenda hasAgenda

brother friend friend

name sname name name

disth (_:1,_:6) = ? – dependent on the mappings of bnodes _:3, _:4, _:8, _:9

Assume all the connected bnodes are considered:• the same disth (_:1,_:6) = 0 exploits the similarity of their predicates

This assumption is used for the experiments

• different disth (_:1,_:6) = 4 does not take common predicates into account

NP-Hard

Time Complexity

O(n logn)

bnodes

OptimalMapping

The Signature-based Algorithm (1/2)It consists of two steps1. Signature Construction Phase: for each bnode a signature (string) is

constructed based on the direct neighborhood of the bnode2. Mapping Construction Phase: the two bags of signatures are matched.

Each signature matching corresponds to a mapping of a pair of blank nodes

Example of Signature Construction:

Christina

Oxford St 14 London

street city

hasAddress

Yannis

Broadway 445 New York

streetno

hasAddressAddress

rdf:typerdf:type

G1 Christina

Oxford St 14 London

streetno

hasAddress

Yannis

Michigan A 132 Chicago

streetno

hasAddressAddress

rdf:typerdf:type

The Signature-based Algorithm (2/2)

Mapping Construction

Signature Construction Phase

Lexicographical sorting

• The mapping is exported in two passes• For both passes we start from the

smaller list, say BS1 and for each bs1 in that list we perform a lookup in the second list BS2, using binary search (logarithmic complexity)

• First pass (exact match) exports only the exact matches

• Second pass (closest match) is applied over the remainder part of BS1, BS2 and matches each element of BS1 to the closer lexicographically elementNote:

we perform the closest matches after finishing with the exact matches in order to avoid the situation where an approximate match deters an exact match at a later step.

NP-Hard

Time Complexity

O(n logn)

bnodes

OptimalMapping

Experimental evaluation

Experimental Evaluation• Over real data sets

– Available in the LOD cloud– Two versions from each dataset

• Over synthetic datasets – A synthetic generator was implemented – Built over the UBA generator [Y. Guo et. al ISWC ’04]– Extended to support control over the number of blank nodes and the blank

node properties• Evaluation Aspects

– Delta reduction potential– Equivalence detection potential– Time efficiency– Deviation from optimal delta

Experiments were conducted using Sesame RDF/S Repository (main memory model) and using a PC with Intel Core i3 at 2.2 Ghz, 3.8 GB Ram, running Ubuntu 11.10.

Experimental Evaluation: Real Datasets

Swedish Open Cultural Heritage*

Italian Museums*None of the datasets contains directly connected blank nodes

• The Hungarian always finds the optimal solution

Delta Size

• The Signature gave a 0.34 times bigger delta than the Hungarian

Mapping Time• The Hungarian requires more (from 15 to 624 times) time than the

Signature• The Signature needs less than one second for mapping 6390 blank nodes

* The datasets were downloaded from CKAN

• The proposed algorithms give a much smaller (12.7 to 7,924 times reduced) delta than without blank node matching

Experimental Evaluation: Synthetic Datasets 1 Synthetic Generation 1• A set of 9 datasets, from KB0 to KB8 were generated

– all of them contain the same number of blank nodes (240)– gradually create more complex blank node structures

Two rounds of experiments1. Delta reduction potential: Compare each dataset with another version2. Equivalence detection potential: Compare each dataset with itself

Experimental Evaluation: Synthetic Datasets 1Delta Reduction Potential

0 1 2 3 4 5 6 7 80.1

1000Delta Reduction Potential

OPTIMAL NO BNODE MATCHING HUNGARIANSIGNATURE

Pair of datasets

Delta size is given as Without bnode matching the delta size ranges from 95% to 143% |'|

|)'(|KBKBKBKBe

• The algorithms provide a much smaller delta than without blank node matching

• The Hungarian achieves the optimal delta for most of the pairs• The Hungarian yields from 0 to 3 times smaller deltas than the Signature

Experimental Evaluation: Synthetic Datasets 1

Equivalence detection potential– Both the proposed algorithms detected equivalence for the first

five Knowledge BasesTime Efficiency

– The Signature gives two orders of magnitude lower mapping times than the Hungarian

0 1 2 3 4 5 6 7 810

10000Mapping Time

HUNGARIAN SIGNATURE

Pair of Datasets

Experimental Evaluation: Synthetic Datasets 2

Synthetic Generation 2• A set of 7 bigger datasets, from KB0 to KB6 were generated containing from

2,400 to 153,600 blank nodes

2,400 4,800 9,600 19,200 38,400 76,800 153,6000

12000Mapping Time

Signature Construction Signature Mapping

|BNodes|

Note:The Hungarian Algorithm could not be applied even to the third pair of datasetsdue to its high requirements in main memory space

The mapping time for the Signature was only 10.5 seconds for the seventh pair of Knowledge Bases

Measuring the approximation

Hungarian deviation: 0% - 7.2%

Signature deviation: 1% - 7.2%

0 0.1 0.15 0.2 0.25 0.320000000000001 0.40123456789

Deviation from Optimal Non Equivalent

HUNG ordered HUNG reversed SIGN ordered

SIGN reversed

b_density

Deviation from optimal delta• Investigate how the bnode structures impact on the deviation from optimal delta

||||||

optxdeviation

The percentage of bnodes in the direct neighborhood

Discussion: Semantics and Inference Rules• Apart from the explicitly specified triples of a KB, other triples can be inferred

based on the RDF/S semantics, or other custom inference rules.• To apply our method the only difference that the graphs should be completed

according to the inferred triples.• It follows that if the semantics is based on a set of inference rules yielding a finite

closure, then the graph is finite and thus our method can be applied. – E.g. Minimal RDFS semantics, ter Horst’s pD* semantics and others

• Note: – It is worth mentioning, that the optimal bnode mapping over the complete

graphs may be different from the optimal mapping when considering the explicit graphs.

Related Work

• Past works focus on detecting only isomorphism– Jena

• Past works focusing on finding delta – RDF Sync: no effort is dedicated on finding a blank node mapping– PromptDiff :employs heuristic matchers, but does not treat blank nodes– Otnoview: no blank node matching is offered– CWM: require for the blank nodes to have term labels– SemVersion: creates and assigns unique identifiers for the blank nodes– RDF Molecules (SSWS 2008): a blank node mapping O(n2) is offered ,

but requires the blank nodes to be part of a uniquely identified triple

• They do not try to find a mapping that reduces the delta size• Works for constructing RDF/S mappings are not directly related since they

map the named entities of the two KBs, and thus they take into account lexical similarities, something that is not possible with bnodes.

Concluding Remarks• We have shown how to exploit blank node anonymity in order to reduce

the delta size when comparing RDF/S Knowledge Bases• Proved that finding the optimal mapping is NP-Hard in the general case

(polynomial if there are no directly connected blank nodes)• We presented polynomial approximate algorithms for the general case (a

Hungarian-based and a Signature-based)• In real datasets with no directly connected blank nodes

– Signature Alg.: two orders of magnitude faster than the Hungarian Alg. (1 second for datasets with 6,390 blank nodes). 34% bigger deltas than the Hungarian Alg.

• In synthetic datasets with directly connected blank nodes– Hungarian Alg. yielded from 0 to 3 times smaller deltas than the Signature Alg.

The Signature Algorithm was 18 to 57 times faster• The algorithms provide a delta of 12.7 to 7,294 times smaller than

without blank node matching• The Signature Algorithm requires only 10.5 seconds to match 153,600

blank nodes!28FORTH-ICS, ISWC 2012

Possible Future Research

Several issues are interesting for further research• Investigation of other special cases where the optimal blank

node mapping can be found polynomially– Directly connected blank nodes that form graphs of bounded tree

• Comparative evaluation of various (probabilistic) signature construction methods and greedy approximation algorithms

Thank you for your attention

Work done in the context of SCIDIP-ES, APARSEN and i-Marine

Web system available in:http://www.ics.forth.gr/isl/BNodeDelta

Blank Node Matching and RDF/S Comparison Functions

Documents

Transcript of Blank Node Matching and RDF/S Comparison Functions

RDF Validation - RDF(s)/OWL/SHACL

RDF/XML: Encoding RDF into XML

Une classi cation exp erimentale multi-crit eres des ...RDF essentials RDF is a w3c standard An RDF graph is a set of RDF triples An RDF triple has three components: ... rdf stores

RDF: Alternative Fuel tailor-made from Waste · RDF: Alternative Fuel tailor-made from Waste •What is RDF? •Why to develop RDF-routes? •What is the potential of RDF for the

Practical RDF Chapter 10. Querying RDF: RDF as Data

S-RDF: A New RDF Serialization Format for Better Storage ...

RDF doppler

Leverage of Semantic Web Services for Practical Application: … · 2010-11-15 · • A SADI service consumes an RDF graph with a designated node and produces an RDF graph about

RDF and SRF Market Trends - RDF Industry Group · RDF Industry Group Secretariat RDF and SRF Market Trends Agenda •Introduction •Market Trends •Future Market •Brexit •Circular

Seminar Work – RDF Databases - DFKIsauermann/papers/SeminarWorkR... · Seminar Work – RDF Databases Using the three RDF Databases FORTH-RDFSuite, Sesame, RDF Gateway in development

Blank Node Matching and RDF/S Comparison Functionsiswc2012.semanticweb.org/sites/default/files/76490577.pdfRDF triples. For an RDF Graph G1 we shall use U1,B1,L1 to denote the URIs,

Unit3–BeyondSimpleEntailment: Semanticsofthe RDF(S ...polleres/teaching/... · Unit3–RDF(S)&OWLSemantics Unit3–BeyondSimpleEntailment: Semanticsofthe RDF(S),Datatypes,andOWLvocabulary

Rdf Editor

RDF Antennas

RDF Data Model and Query Languagestessaris/docs/RDF-query.pdf · Introduction RDF Semantics Querying RDF Building Blocks RDF Abstract Syntax RDF Vocabulary Basic Concepts RDF: language

Graphically Querying RDF Using RDF-GL

WS-DAI RDF(S) Specification Discussion · matchmaker requester SPARQL Access RDF(S) DataAccess Service SPARQL Access RDF(S) DataAccess Service SPARQL Access RDF(S) DataAccess Service

RDF as a Universal Healthcare Exchange Languagedbooth.org/2014/rdf-as-univ/rdf-as-univ-slides.pdf · 10 Yosemite Manifesto on RDF as a Universal Healthcare Exchange Language 1. RDF

Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Practical RDF Ch.10 Querying RDF: RDF as Data