Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence
-
Upload
sean-golliher -
Category
Technology
-
view
382 -
download
1
description
Transcript of Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence
Property Matching and Query Expansion onLinked Data Using Kullback-Leibler Divergence
Sean Golliher, Nathan Fortier, Logan Perreault
December 12, 2013
1 / 25
Property Matching Problem
Databases with different properties:
2 / 25
def: Query Expansion
Query expansion (QE) is the process of reformulating a seedquery to improve retrieval performance in information retrievaloperations.
3 / 25
Societal Cloud
4 / 25
Cloud Diagram (TRIZ Problem Solving)
5 / 25
Cloud Diagram Broken
6 / 25
Property Matching Problem
How do we find all actors in both databases?
Don’t want to manually inspect all databases
Can we use SPARQL query language to infer across all datasets?
SELECT ?pWHERE { s ?p o }
Can only match total sizes of returned triple sets
7 / 25
Original Bayesian Approach
Problems with Bayesian Approach
Had to create, and track, a large vocabulary for trainingSmoothing issues with very sparse textUnderflow issues – small confidence valuesComplexity of likelihood was growing:n different features in feature set X and c classes + tunable parameters.
8 / 25
KL-Divergence
Original paper from 1951 entitled “On Information and Sufficiency”
Also referred to as“relative entropy”A system gains entropy when it moves to a state with more possiblearrangements. For example, a liquid to a gas.Used in paper from 2003 for text categorization:”Using KL-Distance for Text CategorizationElegant and efficient method for plagiarism detection
9 / 25
KL-Divergence
Measure of divergence of information between two distributions:
D(P ‖ Q) =∑x∈X
P(x) logP(x)
Q(x)
Not symmetric
10 / 25
KL-Divergence Example
11 / 25
KL-Divergence Example
Table : Generic Vocabularies Generated by Fixing on Predicates
d1 d2 d3
subject1 subject3 subject1object1 object4 object1object2 object2subject2 subject2 subject4object3 object3 object3object3
ex: D(d1‖d2) = 15 log
1/50 + 1
5 log1/50 + ........+ 2
5 log2/51/4
tf( subject1) is 1/5 in d1 and 0 in d2 – using ε value for now
12 / 25
Algorithm Description
13 / 25
Formal Problem Statement
Given:
Two databases DB1 and DB2
A predicate p1 ∈ DB1
An object type S1 where some triple “s p1 o′′ exists in D1
where s ∈ S1
Find predicate p2 in DB2 where p2 is equivilant to p1
14 / 25
High Level Description
Create a document d1 containing labels of all objects linkedby p1
Find an object type S2 ∈ d2 where S1 is equivilant to S2
For each predicate p2 used by S2 create a document d2containing labels of all objects linked by p2
Remove stop words and language tags from d1 and d2
For each document compute the normalized KL-Divergence,KLD∗(d1, d2)
Return predicate corresponding to the document with thelowest KL-Divergence
15 / 25
Algorithm 1 FindPredicate(DB1,DB2, p1,S1)
Create document d1 containing labels of all objects linked by p1Find an object type S2 ∈ d2 where S1 is equivilant to S2for each predicate p2 used by S2 do
Create document d2 containing labels of all objects linked by p2end forRemove stop words and language tags from d1 and d2min← 1for each predicate pi used by S2 do
k ← KLD∗(d1, di )if k < min then
min← kpmap ← pi
end ifend forreturn pmap
16 / 25
Computing KL-Divergence
KL-Divergence is computed as
KLD(di , dj) =∑k∈V
(P(tk , di )− P(tk , dj))× logP(tk , di )
P(tk , dj)(1)
Where
P(tk , di ) =tf (tk , di )∑
x∈di tf (tx , dj)(2)
If tk does not occur in di then P(tk , di )← ε
KL-Divergence is then normalized as follows:
KLD∗(di , dj) =KLD(di , dj)
KLD(di , 0)(3)
17 / 25
Algorithm 2 tf (tk , di )
tf ← 0for each term tx in di do
if sim(tk , tx ) > τ thentf ← tf + 1
end ifend forreturn tf
18 / 25
Experimental Results
19 / 25
Experimental Results
20 / 25
Experimental Results
21 / 25
Experimental Results
22 / 25
Experimental Results
23 / 25
Experimental Results
24 / 25
Questions?
25 / 25