Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence

Property Matching and Query Expansion onLinked Data Using Kullback-Leibler Divergence

Sean Golliher, Nathan Fortier, Logan Perreault

December 12, 2013

1 / 25

Property Matching Problem

Databases with different properties:

2 / 25

def: Query Expansion

Query expansion (QE) is the process of reformulating a seedquery to improve retrieval performance in information retrievaloperations.

3 / 25

Societal Cloud

4 / 25

Cloud Diagram (TRIZ Problem Solving)

5 / 25

Cloud Diagram Broken

6 / 25

Property Matching Problem

How do we find all actors in both databases?

Don’t want to manually inspect all databases

Can we use SPARQL query language to infer across all datasets?

SELECT ?pWHERE { s ?p o }

Can only match total sizes of returned triple sets

7 / 25

Original Bayesian Approach

Problems with Bayesian Approach

Had to create, and track, a large vocabulary for trainingSmoothing issues with very sparse textUnderflow issues – small confidence valuesComplexity of likelihood was growing:n different features in feature set X and c classes + tunable parameters.

8 / 25

KL-Divergence

Original paper from 1951 entitled “On Information and Sufficiency”

Also referred to as“relative entropy”A system gains entropy when it moves to a state with more possiblearrangements. For example, a liquid to a gas.Used in paper from 2003 for text categorization:”Using KL-Distance for Text CategorizationElegant and efficient method for plagiarism detection

9 / 25

KL-Divergence

Measure of divergence of information between two distributions:

D(P ‖ Q) =∑x∈X

P(x) logP(x)

Q(x)

Not symmetric

10 / 25

KL-Divergence Example

11 / 25

KL-Divergence Example

Table : Generic Vocabularies Generated by Fixing on Predicates

d1 d2 d3

subject1 subject3 subject1object1 object4 object1object2 object2subject2 subject2 subject4object3 object3 object3object3

ex: D(d1‖d2) = 15 log

1/50 + 1

5 log1/50 + ........+ 2

5 log2/51/4

tf( subject1) is 1/5 in d1 and 0 in d2 – using ε value for now

12 / 25

Algorithm Description

13 / 25

Formal Problem Statement

Given:

Two databases DB1 and DB2

A predicate p1 ∈ DB1

An object type S1 where some triple “s p1 o′′ exists in D1

where s ∈ S1

Find predicate p2 in DB2 where p2 is equivilant to p1

14 / 25

High Level Description

Create a document d1 containing labels of all objects linkedby p1

Find an object type S2 ∈ d2 where S1 is equivilant to S2

For each predicate p2 used by S2 create a document d2containing labels of all objects linked by p2

Remove stop words and language tags from d1 and d2

For each document compute the normalized KL-Divergence,KLD∗(d1, d2)

Return predicate corresponding to the document with thelowest KL-Divergence

15 / 25

Algorithm 1 FindPredicate(DB1,DB2, p1,S1)

Create document d1 containing labels of all objects linked by p1Find an object type S2 ∈ d2 where S1 is equivilant to S2for each predicate p2 used by S2 do

Create document d2 containing labels of all objects linked by p2end forRemove stop words and language tags from d1 and d2min← 1for each predicate pi used by S2 do

k ← KLD∗(d1, di )if k < min then

min← kpmap ← pi

end ifend forreturn pmap

16 / 25

Computing KL-Divergence

KL-Divergence is computed as

KLD(di , dj) =∑k∈V

(P(tk , di )− P(tk , dj))× logP(tk , di )

P(tk , dj)(1)

Where

P(tk , di ) =tf (tk , di )∑

x∈di tf (tx , dj)(2)

If tk does not occur in di then P(tk , di )← ε

KL-Divergence is then normalized as follows:

KLD∗(di , dj) =KLD(di , dj)

KLD(di , 0)(3)

17 / 25

Algorithm 2 tf (tk , di )

tf ← 0for each term tx in di do

if sim(tk , tx ) > τ thentf ← tf + 1

end ifend forreturn tf

18 / 25

Experimental Results

19 / 25


20 / 25


21 / 25


22 / 25


23 / 25


24 / 25

Questions?

25 / 25

Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence

Technology

Transcript of Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence