Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan...
-
Upload
johnathan-gray -
Category
Documents
-
view
236 -
download
3
Transcript of Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan...
Text Based Similarity Metrics and Delta for Semantic Web Graphs
Krishnamurthy Koduvayur ViswanathanMonday, June 28, 2010
1
Contributions
• Define text-based similarity metrics that characterize the relationship between semantic web graphs
• Evaluate the similarity metrics for three specific cases of similarity that we defined
• Generate a delta between pairs of SW graphs that may be two versions of the same graph
• Prototyped the techniques in a new system called Similis
2Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Motivation: Near Duplicate Detection for the SW?
3Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Goals
• Explore the different ways in which two SW graphs may be similar to each other
• In particular, evaluate the specific use case of versioning relations between SW graphs
• Additionally, develop techniques to generate a delta between versions
4Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Comparison with near duplicate text document detection
• In a text document:– Order of the content is important– The meaning of the text is not a part of the problem, just
the textual encoding of the meaning
• For a SWD, the order is not deterministic i.e. equivalent SWDs may have different statement orderings
• Non-deterministic blank node identifiers
5Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Semantic Web Document (SWD)
• RDF representation of a Semantic Web Graph– Document based serialization of a SW graph on
the web (ontology or data-file)– Document based serialization of the result of a
SPARQL query on a triple-store– Document based serialization of structured
metadata extracted from an HTML page using RDFa
6Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Semantic Web Graph Similarity
• The archive or the Swoogle search engine (Ding et al. 2004) shows several examples of how ontologies and RDF documents evolve over time
• Kinds of similarity between two SW graphs:– Same classes and properties used. Differ only in literal
content– Different only in base-URIs of entities used– Different versions of the same semantic web graph
7Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Similarity in Classes and Properties• Two semantic web graphs that differ only in the
literal content
8Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Different in Literal Content<http://www.w3.org/People/EM/contact#me > <http://www.w3.org/1999/02/22-ref-syntax-
ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#fullName> “Eric Miller” .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#mailbox> “mailto:[email protected]“ .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#personalTitle> “Dr” . <http://www.w3.org/People/EM/contact#me > <http://www.w3.org/1999/02/22-ref-syntax-
ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#fullName> “John Doe” .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#mailbox> “mailto:[email protected] “ .<http://www.w3.org/People/EM/contact#me >
<http://www.w3.org/2000/10/swap/pim/contact#personalTitle> “Mr” .
9Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Different only in base-URI
10Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Different only in base-URI<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#ItalianRegion> ._:g103 <http://www.w3.org/2002/07/owl#onProperty>
<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#locatedIn> ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:g103 ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> ._:g105 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Restriction> ._:g105 <http://www.w3.org/2002/07/owl#hasValue> <http://www.w3.org/2001/sw/WebOnt/guide-src/wine#Dry> ._:g105 <http://www.w3.org/2002/07/owl#onProperty>
<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#hasSugar> .
<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#ItalianRegion> ._:g103 <http://www.w3.org/2002/07/owl#onProperty>
<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#locatedIn>._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:g103 ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> ._:g105 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Restriction> ._:g105 <http://www.w3.org/2002/07/owl#hasValue>
<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#Dry> ._:g105 <http://www.w3.org/2002/07/owl#onProperty>
<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#hasSugar> .
11Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Versioning Relationship
• Two semantic web documents have a versioning relationship, if they are variants of the same semantic web graph.
• Variants are created due to the dynamic nature of the web, i.e. content keeps getting modified– Minor changes: spelling corrections, punctuations etc– Major changes: Affect the semantic content
12Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Problem Definition
• Problem 1: Given a collection of semantic web graphs in the form of RDF documents, characterize the similarity between pairs into one or more of the three cases:– Same classes and properties used, but differ only in the
literal content– Differ only in the base-URI used– Are different versions of the same graph i.e. have a
versioning relationship
13Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Problem Definition
• Problem 2: Generate a delta between pairs that have been identified as having a versioning relationship between them.
14Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
ApproachInput: Corpus of SWDs
Convert to n-triples format
Convert to canonical form
Generate Reduced Forms
Compute Text-Based Similarity Metrics
Characterize similarity between pairs
Identify versions
Generate delta between versions
Build feature-vectors for each pair
15Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Convert to n-triples
16Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Convert to Canonical Form• Comparison methods may be affected by blank node
identifiers and statement ordering
• Canonicalization assigns consistent IDs to blank nodes and orders the statements lexicographically.
• Transforms two semantically equivalent graphs into the same canonical representation
17
Based on: Carroll, J. J. 2003. Signing RDF graphs. In In 2nd ISWC, volume 2870 of LNCS, 5–15. Springer.
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Convert to Canonical Form
<person:John> <a:livesIn> _:x ._:x <a:IsPartOf> ”USA” .<person:John> <a:likes> ”cheese” ._:x <a:hasCapital> :y .
“~” <a:hasCapital> “~” . # _:x _:y“~” <a:IsPartOf> ”USA” . # _:x<person:John> <a:likes> ”cheese” .<person:John> <a:livesIn> “~” . #_:x
Old Blank Node Identifier
New Blank Node Identifier
_:y _:g1
_:x _:g2
_:g2 <a:hasCapital> _:g1 . _:g2 <a:IsPartOf> ”USA” . <person:John> <a:likes> ”cheese” .<person:John> <a:livesIn> _:g2 .
BNode Table
18Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Limitation of the Algorithm: Non-Distinctive Triples
• The algorithm can only deal with graphs that do not have non-distinctive triples
• Non Distinctive Triples: The triples in the graph that cannot be uniquely identified when all the blank nodes are treated as equal
19Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Graphs with Non-Distinctive Triples
• For a group of n non-distinct triples, there are n! ways of renaming the blank nodes
• For graphs with non-distinctive triples, a single unique canonical form does not exist
• To compare two graphs, compare each of the possible canonical forms for both graphs
• Number of comparisons: O(m!n!)• Similis throws an exception when it finds a graph
with multiple forms
20Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Graphs with Non-Distinctive Triples• Only a small percentage of SW graphs (13%) did not
have a unique canonical form (1200 randomly collected SW documents)
21Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Generating Reduced Forms• The canonical form of each SW graph is broken down
into a number of reduced forms• These reduced forms are used to characterize the
relationship between pairs of SW graphs• The following is the anatomy of a triple:
Entity URI <http://www.w3.org/2001/sw/guide-src/wine#hasSugar>
Base URI <http://www.w3.org/2001/sw/guide-src/wine>
Local Name <hasSugar>
22Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Only-Literals Reduced Form• Contains only the literals from the original n-triples
file.• Lets us compare only the textual content within a
graph, separated from the rest of the graph
23Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
No-Literals Reduced Form• All the literals from the canonical form are replaced
by an empty string• Lets us compare only the classes and properties
used, regardless of literal content
24Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Local-Name Reduced Form• The base-URI of every node in the canonical form is
replaced by an empty string• Lets us compare only the local names of the classes
and properties used
25Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Local-Name-No-Literal Reduced Form• All the literals, and the base-URI of every node is
replaced by an empty string• Lets us compare the non-literal content of two SW
graphs
26Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Similarity/Distance Metrics Used
• Cosine Similarity between SWD vectors• Jaccard and Containment Metrics• Hamming Distance between Simhash fingerprints
27Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Computation of Pairwise Metrics
• Compute cosine similarity between the canonical, and local forms of each pair in the collection– If cosine similarity < 0.7, remove pair from further
consideration– Else, compute all other metrics for all the forms (5 forms *
3 metrics = 15 specific metrics)
• Total of 17 metrics computed
28Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Cosine Similarity Between Term Vectors
• Each SWD containing terms Tj = {t1, t2…tn} is treated as a vector Vj = (γ1t1,γ2t2,… γntn) where each γi is the weight associated with term ti
• Non-blank, non-literal nods are used as features, and Term Frequency (TF) is used as weight
• Two vectors for each SWD: one uses full entity URIs as features, other uses local-name of terms
• Indicates similarity in classes and properties
29Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
SW Document Vectors
Term Freq
<http://purl.org/dc/elements/1.1/title> 2
<http://purl.org/dc/elements/1.1/creator> 1
<http://purl.org/dc/elements/1.1/contributor> 1
<http://put-off.org> 1
30
Term Freq
<title> 2
<creator> 2
<contributor> 1
<put-off.org> 1
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Jaccard and Containment
• Computed for all forms (five) for a candidate pair of SW graphs (5 * 2 = 10 metrics)
• Construct sets of character 4-grams for each document
• 4-grams are computed by running a four character-wide window over the text representation
31Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Hamming Distance between Simhash Fingerprints
• Simhash fingerprints of similar documents differ in a small number of bit positions
• Tokenize documents into character 3-grams• Compute simhash fingerprint for each document in
pair (we implemented 128 bit fingerprints)• Find Hamming Distance between the fingerprints• Computed for all forms (five) for a candidate pair of
SW graphs (5 * 1 = 5 metrics)
32Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Classification
33
Naïve Bayes Classifier:
Similarity in classes and properties
Similarity metrics
computed for each
candidate pair
Naïve Bayes/SVM classifier:
Difference only in Base-URI
SVM Classifier: Versioning
Relationship
Feature Vector
FV
Feature Vector
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Example feature vector used for determining versioning relationship
Computing Delta Between Two Versions
34
Version1
Except Version2
Subtractive Delta
Version2Except Version1
Additive Delta
Version1
Version2
Delta
SVM Classifier: Versioning
Relationship
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Raw Delta• Statement-by-statement comparison between
canonical forms of the two SWDs• Only local names of entities are compared
35Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Delta After Deductive Closure
36
SWGv1
SWGv2
Compute deductive closure
Compute deductive closure
Canonicalize
Canonicalize
Generate Raw Delta
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Delta After Deductive Closure• If O is a set of propositions, p ԑ O and p q╞ , then q ԑ
O
37Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Delta at Concept Level
38Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Delta at Concept Level
• Works only for ontologies• Groups of class/property definitions are serialized
into individual graphs• Corresponding graphs in the two versions are
compared to each other
39Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Concept Level Delta: example
40Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Detecting Class Renaming
41
Sauterne
Sauterne
Sauterne
Sauterne
Sauterne
Sauterne
Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Detecting Class Renaming
Input: Local names of entites in both diffs
Generate 3-gram sets for each entity
Compute 3-gram overlap between sets in additive and subtractive deltas
If overlap > 0.7, add (oldname, newname) to candidate set
Replace oldname in subtractive delta by newname
Check for presence of all modified statements in additive delta
42Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Detecting Class Renaming
43Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Data-set: Using Swoogle’s SW Wayback machine
• Swoogle caches multiple snapshots for each indexed semantic web document
• Labeling for versions: We extract such snapshots from Swoogle’s cache and label these pairs as versions
44Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs that Differ in Literal Content
• Features used for classification:– LocalNameCosineSim– CosineSim– LocalNameNoLiteralJaccard– LocalNameNoLiteralSimhash
• Training set from Swoogle archive included 806 positive pairs, and 806 negative pairs
45Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs that Differ in Literal Content
• Results of 10-fold stratified cross validation using a Naïve Bayes classifier:
46Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs that Differ in Literal Content
• Results of using a SVM with all of the features, instead of manually selecting features:
• Attribute relevance ranking:
47Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs that Differ in Base-URI
• Features for classification:– CosineSim– LocalNameCosineSim– LocalNameNoLiteralJaccard– LocalNameNoLiteralContainment– OnlyLiteralContainment– OnlyLiteralJaccard
• Training set contained 100 positive examples, and 100 negative examples
48Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs that Differ in Base-URI
• 10-fold cross validation using Naïve Bayes:
• 10-fold cross validation (SVM linear-kernel)
49Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs with a Versioning Relationship
• 124 training instances from Swoogle data-set
• Filtered highly dynamic pairs from consideration
50Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Evaluation: Pairs with a Versioning Relationship
• Test dataset: 160 instances (50% +ve 50% -ve)• Classification results using SVM (linear kernel)
51Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Correctness of Delta Computation
• For any two versions of a SW graph, it holds that Δx(K → K’)K ≡ K’
• We check this condition programmatically for each delta generated
52Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Conclusion
• Define text-based similarity metrics that characterize the relationship between semantic web graphs
• Evaluate the similarity metrics for three specific cases of similarity that we defined
• Generate deltas between pairs of SW graphs that may be two versions of the same graph
• Prototyped the techniques in a new system called Similis
53Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion
Future Directions
• Scalability• Content of Delta Generated• Standard Ontologies to:– Describe delta– Describe the relationship between a pair of SW
graphs
• Detecting direction of change between two versions
54Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion