Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan...

54
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Transcript of Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan...

Page 1: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Text Based Similarity Metrics and Delta for Semantic Web Graphs

Krishnamurthy Koduvayur ViswanathanMonday, June 28, 2010

1

Page 2: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Contributions

• Define text-based similarity metrics that characterize the relationship between semantic web graphs

• Evaluate the similarity metrics for three specific cases of similarity that we defined

• Generate a delta between pairs of SW graphs that may be two versions of the same graph

• Prototyped the techniques in a new system called Similis

2Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 3: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Motivation: Near Duplicate Detection for the SW?

3Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 4: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Goals

• Explore the different ways in which two SW graphs may be similar to each other

• In particular, evaluate the specific use case of versioning relations between SW graphs

• Additionally, develop techniques to generate a delta between versions

4Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 5: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Comparison with near duplicate text document detection

• In a text document:– Order of the content is important– The meaning of the text is not a part of the problem, just

the textual encoding of the meaning

• For a SWD, the order is not deterministic i.e. equivalent SWDs may have different statement orderings

• Non-deterministic blank node identifiers

5Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 6: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Semantic Web Document (SWD)

• RDF representation of a Semantic Web Graph– Document based serialization of a SW graph on

the web (ontology or data-file)– Document based serialization of the result of a

SPARQL query on a triple-store– Document based serialization of structured

metadata extracted from an HTML page using RDFa

6Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 7: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Semantic Web Graph Similarity

• The archive or the Swoogle search engine (Ding et al. 2004) shows several examples of how ontologies and RDF documents evolve over time

• Kinds of similarity between two SW graphs:– Same classes and properties used. Differ only in literal

content– Different only in base-URIs of entities used– Different versions of the same semantic web graph

7Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 8: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Similarity in Classes and Properties• Two semantic web graphs that differ only in the

literal content

8Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 9: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Different in Literal Content<http://www.w3.org/People/EM/contact#me > <http://www.w3.org/1999/02/22-ref-syntax-

ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#fullName> “Eric Miller” .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#mailbox> “mailto:[email protected]“ .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#personalTitle> “Dr” . <http://www.w3.org/People/EM/contact#me > <http://www.w3.org/1999/02/22-ref-syntax-

ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#fullName> “John Doe” .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#mailbox> “mailto:[email protected] “ .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#personalTitle> “Mr” .

9Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 10: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Different only in base-URI

10Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 11: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Different only in base-URI<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#ItalianRegion> ._:g103 <http://www.w3.org/2002/07/owl#onProperty>

<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#locatedIn> ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:g103 ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> ._:g105 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Restriction> ._:g105 <http://www.w3.org/2002/07/owl#hasValue> <http://www.w3.org/2001/sw/WebOnt/guide-src/wine#Dry> ._:g105 <http://www.w3.org/2002/07/owl#onProperty>

<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#hasSugar> .

<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#ItalianRegion> ._:g103 <http://www.w3.org/2002/07/owl#onProperty>

<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#locatedIn>._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:g103 ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> ._:g105 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Restriction> ._:g105 <http://www.w3.org/2002/07/owl#hasValue>

<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#Dry> ._:g105 <http://www.w3.org/2002/07/owl#onProperty>

<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#hasSugar> .

11Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 12: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Versioning Relationship

• Two semantic web documents have a versioning relationship, if they are variants of the same semantic web graph.

• Variants are created due to the dynamic nature of the web, i.e. content keeps getting modified– Minor changes: spelling corrections, punctuations etc– Major changes: Affect the semantic content

12Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 13: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Problem Definition

• Problem 1: Given a collection of semantic web graphs in the form of RDF documents, characterize the similarity between pairs into one or more of the three cases:– Same classes and properties used, but differ only in the

literal content– Differ only in the base-URI used– Are different versions of the same graph i.e. have a

versioning relationship

13Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 14: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Problem Definition

• Problem 2: Generate a delta between pairs that have been identified as having a versioning relationship between them.

14Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 15: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

ApproachInput: Corpus of SWDs

Convert to n-triples format

Convert to canonical form

Generate Reduced Forms

Compute Text-Based Similarity Metrics

Characterize similarity between pairs

Identify versions

Generate delta between versions

Build feature-vectors for each pair

15Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 16: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Convert to n-triples

16Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 17: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Convert to Canonical Form• Comparison methods may be affected by blank node

identifiers and statement ordering

• Canonicalization assigns consistent IDs to blank nodes and orders the statements lexicographically.

• Transforms two semantically equivalent graphs into the same canonical representation

17

Based on: Carroll, J. J. 2003. Signing RDF graphs. In In 2nd ISWC, volume 2870 of LNCS, 5–15. Springer.

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 18: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Convert to Canonical Form

<person:John> <a:livesIn> _:x ._:x <a:IsPartOf> ”USA” .<person:John> <a:likes> ”cheese” ._:x <a:hasCapital> :y .

“~” <a:hasCapital> “~” . # _:x _:y“~” <a:IsPartOf> ”USA” . # _:x<person:John> <a:likes> ”cheese” .<person:John> <a:livesIn> “~” . #_:x

Old Blank Node Identifier

New Blank Node Identifier

_:y _:g1

_:x _:g2

_:g2 <a:hasCapital> _:g1 . _:g2 <a:IsPartOf> ”USA” . <person:John> <a:likes> ”cheese” .<person:John> <a:livesIn> _:g2 .

BNode Table

18Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 19: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Limitation of the Algorithm: Non-Distinctive Triples

• The algorithm can only deal with graphs that do not have non-distinctive triples

• Non Distinctive Triples: The triples in the graph that cannot be uniquely identified when all the blank nodes are treated as equal

19Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 20: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Graphs with Non-Distinctive Triples

• For a group of n non-distinct triples, there are n! ways of renaming the blank nodes

• For graphs with non-distinctive triples, a single unique canonical form does not exist

• To compare two graphs, compare each of the possible canonical forms for both graphs

• Number of comparisons: O(m!n!)• Similis throws an exception when it finds a graph

with multiple forms

20Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 21: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Graphs with Non-Distinctive Triples• Only a small percentage of SW graphs (13%) did not

have a unique canonical form (1200 randomly collected SW documents)

21Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 22: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Generating Reduced Forms• The canonical form of each SW graph is broken down

into a number of reduced forms• These reduced forms are used to characterize the

relationship between pairs of SW graphs• The following is the anatomy of a triple:

Entity URI <http://www.w3.org/2001/sw/guide-src/wine#hasSugar>

Base URI <http://www.w3.org/2001/sw/guide-src/wine>

Local Name <hasSugar>

22Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 23: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Only-Literals Reduced Form• Contains only the literals from the original n-triples

file.• Lets us compare only the textual content within a

graph, separated from the rest of the graph

23Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 24: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

No-Literals Reduced Form• All the literals from the canonical form are replaced

by an empty string• Lets us compare only the classes and properties

used, regardless of literal content

24Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 25: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Local-Name Reduced Form• The base-URI of every node in the canonical form is

replaced by an empty string• Lets us compare only the local names of the classes

and properties used

25Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 26: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Local-Name-No-Literal Reduced Form• All the literals, and the base-URI of every node is

replaced by an empty string• Lets us compare the non-literal content of two SW

graphs

26Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 27: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Similarity/Distance Metrics Used

• Cosine Similarity between SWD vectors• Jaccard and Containment Metrics• Hamming Distance between Simhash fingerprints

27Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 28: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Computation of Pairwise Metrics

• Compute cosine similarity between the canonical, and local forms of each pair in the collection– If cosine similarity < 0.7, remove pair from further

consideration– Else, compute all other metrics for all the forms (5 forms *

3 metrics = 15 specific metrics)

• Total of 17 metrics computed

28Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 29: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Cosine Similarity Between Term Vectors

• Each SWD containing terms Tj = {t1, t2…tn} is treated as a vector Vj = (γ1t1,γ2t2,… γntn) where each γi is the weight associated with term ti

• Non-blank, non-literal nods are used as features, and Term Frequency (TF) is used as weight

• Two vectors for each SWD: one uses full entity URIs as features, other uses local-name of terms

• Indicates similarity in classes and properties

29Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 30: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

SW Document Vectors

Term Freq

<http://purl.org/dc/elements/1.1/title> 2

<http://purl.org/dc/elements/1.1/creator> 1

<http://purl.org/dc/elements/1.1/contributor> 1

<http://put-off.org> 1

30

Term Freq

<title> 2

<creator> 2

<contributor> 1

<put-off.org> 1

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 31: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Jaccard and Containment

• Computed for all forms (five) for a candidate pair of SW graphs (5 * 2 = 10 metrics)

• Construct sets of character 4-grams for each document

• 4-grams are computed by running a four character-wide window over the text representation

31Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 32: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Hamming Distance between Simhash Fingerprints

• Simhash fingerprints of similar documents differ in a small number of bit positions

• Tokenize documents into character 3-grams• Compute simhash fingerprint for each document in

pair (we implemented 128 bit fingerprints)• Find Hamming Distance between the fingerprints• Computed for all forms (five) for a candidate pair of

SW graphs (5 * 1 = 5 metrics)

32Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 33: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Classification

33

Naïve Bayes Classifier:

Similarity in classes and properties

Similarity metrics

computed for each

candidate pair

Naïve Bayes/SVM classifier:

Difference only in Base-URI

SVM Classifier: Versioning

Relationship

Feature Vector

FV

Feature Vector

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Example feature vector used for determining versioning relationship

Page 34: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Computing Delta Between Two Versions

34

Version1

Except Version2

Subtractive Delta

Version2Except Version1

Additive Delta

Version1

Version2

Delta

SVM Classifier: Versioning

Relationship

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 35: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Raw Delta• Statement-by-statement comparison between

canonical forms of the two SWDs• Only local names of entities are compared

35Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 36: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Delta After Deductive Closure

36

SWGv1

SWGv2

Compute deductive closure

Compute deductive closure

Canonicalize

Canonicalize

Generate Raw Delta

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 37: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Delta After Deductive Closure• If O is a set of propositions, p ԑ O and p q╞ , then q ԑ

O

37Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 38: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Delta at Concept Level

38Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 39: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Delta at Concept Level

• Works only for ontologies• Groups of class/property definitions are serialized

into individual graphs• Corresponding graphs in the two versions are

compared to each other

39Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 40: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Concept Level Delta: example

40Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 41: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Detecting Class Renaming

41

Sauterne

Sauterne

Sauterne

Sauterne

Sauterne

Sauterne

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 42: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Detecting Class Renaming

Input: Local names of entites in both diffs

Generate 3-gram sets for each entity

Compute 3-gram overlap between sets in additive and subtractive deltas

If overlap > 0.7, add (oldname, newname) to candidate set

Replace oldname in subtractive delta by newname

Check for presence of all modified statements in additive delta

42Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 43: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Detecting Class Renaming

43Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 44: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Data-set: Using Swoogle’s SW Wayback machine

• Swoogle caches multiple snapshots for each indexed semantic web document

• Labeling for versions: We extract such snapshots from Swoogle’s cache and label these pairs as versions

44Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 45: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Evaluation: Pairs that Differ in Literal Content

• Features used for classification:– LocalNameCosineSim– CosineSim– LocalNameNoLiteralJaccard– LocalNameNoLiteralSimhash

• Training set from Swoogle archive included 806 positive pairs, and 806 negative pairs

45Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 46: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Evaluation: Pairs that Differ in Literal Content

• Results of 10-fold stratified cross validation using a Naïve Bayes classifier:

46Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 47: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Evaluation: Pairs that Differ in Literal Content

• Results of using a SVM with all of the features, instead of manually selecting features:

• Attribute relevance ranking:

47Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 48: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Evaluation: Pairs that Differ in Base-URI

• Features for classification:– CosineSim– LocalNameCosineSim– LocalNameNoLiteralJaccard– LocalNameNoLiteralContainment– OnlyLiteralContainment– OnlyLiteralJaccard

• Training set contained 100 positive examples, and 100 negative examples

48Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 49: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Evaluation: Pairs that Differ in Base-URI

• 10-fold cross validation using Naïve Bayes:

• 10-fold cross validation (SVM linear-kernel)

49Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 50: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Evaluation: Pairs with a Versioning Relationship

• 124 training instances from Swoogle data-set

• Filtered highly dynamic pairs from consideration

50Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 51: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Evaluation: Pairs with a Versioning Relationship

• Test dataset: 160 instances (50% +ve 50% -ve)• Classification results using SVM (linear kernel)

51Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 52: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Correctness of Delta Computation

• For any two versions of a SW graph, it holds that Δx(K → K’)K ≡ K’

• We check this condition programmatically for each delta generated

52Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 53: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Conclusion

• Define text-based similarity metrics that characterize the relationship between semantic web graphs

• Evaluate the similarity metrics for three specific cases of similarity that we defined

• Generate deltas between pairs of SW graphs that may be two versions of the same graph

• Prototyped the techniques in a new system called Similis

53Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 54: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Future Directions

• Scalability• Content of Delta Generated• Standard Ontologies to:– Describe delta– Describe the relationship between a pair of SW

graphs

• Detecting direction of change between two versions

54Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion