Keyword Search in RDF Databasescgi.di.uoa.gr/.../files/dissertation_msc_presentation.pdfMSc...

Keyword Search in RDF Databases

Charalampos S. Nikolaoucharnik@di.uoa.gr

Department of Informatics & TelecommunicationsUniversity of Athens

MSc Dissertation PresentationApril 15, 2011

Outline

Background

User Requirements

Our Approach

Implementation

Experimental Evaluation

Other Directions to Keyword Search

Conclusions & Future Work

Background

Keyword Search on (Semi-)Structured data

Data modelGraph-based data model

I Schema-aware

I Schema-agnostic (comeswith specialized indexes)

Exploration Algorithms

I Backward Expansion

I Bidirectional Expansion

I Variants of the above

Structure of the Answer

I Trees

I Graphs

I Nodes/Entities

I (Multi-)Relations

Ranking/Scoring of Answers

I TF/IDF and other morecomplex measures from IR

I Node/Edge weights

I Page Rank like,Hub/Authority nodes

I Path length

I Number of nodes/edges

I Schema-aware

I Trees

I Graphs

I Nodes/Entities

I (Multi-)Relations

I Node/Edge weights

I Path length

I Schema-aware

I Trees

I Graphs

I Nodes/Entities

I (Multi-)Relations

I Node/Edge weights

I Path length

I Schema-aware

I Trees

I Graphs

I Nodes/Entities

I (Multi-)Relations

I Node/Edge weights

I Path length

I Schema-aware X

I Variants of the above X

I Trees

I Graphs

I Nodes/Entities X

I (Multi-)Relations

I TF/IDF and other morecomplex measures from IRX

I Node/Edge weights X

I Page Rank like,Hub/Authority nodes X

I Path length X

I Number of nodes/edges X

User Requirements

Requirements

Use CaseHistorians interested in the evolution of Biotechnology andRenewable Energy (see Papyrus Project)

1. Search for information about certain concepts, facts, events

2. Search may involve temporal restrictions

3. Source of information may contain indefinite temporalinformation

Requirements

1. Search for information about certain concepts, facts, eventse.g., stem cells and public opinion

Requirements

2. Search may involve temporal restrictionse.g., evolution of biotechnology in 20th century

Requirements

3. Source of information may contain indefinite temporalinformatione.g., “around 7000 BC biotechnology involved brewing beer,fermenting wine and baking bread with help of yeast”

Our Approach

Data Model

I Extension of the temporal RDF data model with indefinitetime intervals for the validity of a triple: (s,p,o)[i ]

I Validity of a resource r : (r ,subclass,Resource)[i ]

I Indefinite time intervals (set I ): (s1,s2,e1,e2), wheres1,s2,e1,e2 are natural numbers

s1 s2 e1 e2

I Indefinite time point: (s,e,s,e)

Data Model

s1 s2 e1 e2

Data Model

s1 s2 e1 e2

Query Language

I Keyword-based language extended with temporal constraintsexpressed in a controlled natural language

I Temporal constraints (set TR):I Allen’s temporal relations: before, after, meets, met by,overlaps, overlapped by, starts, started by, finishes,finished by, during, contains

I Other lexical forms (formal/informal) used in speech andwriting

I Time intervals: [s1− s2,e3− e4], with each si (ei ) being ofthe form yyyy/mm/dd

DefinitionA query is a pair (KL,φ). The expression KL is a list of keywordsand φ is a finite conjunction of formulas of the form R c , whereR ∈ TR and c ∈ I

Query Language

Example

DatabaseBerlin has been a city since some point in the 12th century

(ex:city, rdf:type, rdfs:Class).

(ex:city, rdfs:label, "City").

(ex:berlin, foaf:name, "Berlin").

(ex:berlin, rdf:type, ex:city)[1100 - 1190, 2011].

Example

Information needs (1)

I would like information about cities during the period 1200 – 1210

city during [1200, 1210]

Example

AnswerThere is such a city named Berlin

Example

AnswerThere is possibly a city named Berlin

Example

The ideal answerThe answer to both questions should include the individual“Berlin” and the class “City”, but in the second case these twoentities should have lower score to convey the notion of possibility

Data Structures

Data graph

The underlying RDF graph

Schema graph

A summary of the data graph, containing data-driven schemainformation

Query graph

Super graph of schema graph containing elements from the RDFgraph

Data Structures

Data graph

Schema graph

Query graph

Data Structures

Data graph

Schema graph

Query graph

Data graph

Subject Predicate Object

pro1 type Projectpro2 type Projectpro1 name Papyruspub1 type Publicationpub1 author res1pub1 author res2pub1 year 2010pub2 type Publicationres1 type Researcherres2 type Researcheruniv1 name DI&Tres2 name Y. Ioannidisres1 name M. Koubarakisres1 worksAt univ1univ1 type Universityuniv2 type University

University subclass AgentAgent subclass Resourcepub1 hasProject pro1

subclass

subclas

subclasssubclass

hasProject

author

resres

Project

Researcher

University

Publication

Person Resource

author

ManolisKoubarakis

YannisIoannidis

namePapyrus name

typeyear

Figure: a) RDF triples b) Data graph

Schema and Query graphs

Project Researcher

Publication Person

University

Resource

subclass

subclas

subclass

subcla

worksAt

hasProject au

2010 Resource

Researcher

Person

University

hasProject

author

subcla

worksAt

subclass

subclas

Project

Publication

ManolisKoubarakis

Figure: a) schema graph

b) query graph for koubarakis publications

during 2010

Schema and Query graphs

Project Researcher

Publication Person

University

Resource

subclass

subclas

subclass

subcla

worksAt

hasProject au

2010 Resource

Researcher

Person

University

hasProject

author

subcla

worksAt

subclass

subclas

Project

Publication

ManolisKoubarakis

Figure: a) schema graph b) query graph for koubarakis publications

during 2010

Keyword Search Algorithm

Query processing in 4 phases:

1. Keyword Interpretation (KI): Interpret keywords as RDF graphelements (keyword elements) and construct the query graph

2. Graph Exploration (GE): Explore the query graph startingfrom keyword elements for finding top-k subgraphs

3. Query Mapping (QM): Map top-k subgraphs to SPARQLqueries and evaluate them

4. Entity Transformation (ET): Transform and rank entitiesappearing in subgraphs and results from query evaluation toentities

Running query

koubarakis publications during 2010

Running query

Keyword Interpretationkoubarakis publications during 2010

Project Researcher

Publication Person

University

Resourcesubclass

subclas

subclass

subcla

worksAt

hasProject au

Keyword Interpretationkoubarakis publications during 2010

2010 Resource

Researcher

Person

University

hasProject

author

subcla

worksAt

subclass

subclas

Project

Publication

ManolisKoubarakis

Graph Explorationkoubarakis publications during 2010

2010 Resource

Researcher

Person

University

hasProject

author

subcla

worksAt

subclass

subclas

Project

Publication

ManolisKoubarakis

Graph Explorationkoubarakis publications during 2010

2010 Resource

Researcher

Person

University

hasProject

author

subclass

worksAt

subclass

subclas

Project

Publicat ion

ManolisKoubarakis

Query Mappingkoubarakis publications during 2010

2010 Resource

Researcher

Person

University

hasProject

author

subclass

worksAt

subclass

subclas

Project

Publicat ion

ManolisKoubarakis

Researcherauth

Publicat ion

ManolisKoubarakis

Researcherauth

Publicat ion

ManolisKoubarakis

Conjunctive query

name(x,”Koubarakis”)∧type(x,Researcher)∧author(y,x)∧year(y,”2010”)∧type(y,Publication)

SPARQL Query

SELECT ?x ?y ?xl ?yl

WHERE {

?x rdf:type ex:Researcher .

?x foaf:name "Manolis Koubarakis" .

?y ex:author ?x .

?x rdf:type ex:Publication .

?x ex:year "2010" .

?x rdfs:label ?xl .

?y rdfs:label ?yl .

Table: SPARQL result

?x ?y ?xl ?ylres1 pub1 “Manolis Koubarakis”

Entity Transformationkoubarakis publications during 2010

Table: Answer

Rank Entity1 “Manolis Koubarakis”2 Publication3 pub14 Researcher

How ranking is performed?

Entity Transformationkoubarakis publications during 2010

Table: Answer

Rank Entity1 “Manolis Koubarakis”2 Publication3 pub14 Researcher

How ranking is performed?

Scoring FunctionsScoring of a graph (cont’d)

For any cost function cf : V ∪E → [0,1]:

CG = ∑pi∈P

∑n∈pi

cf (n)

Path length

cf (n)≡ cp(n) = 1/diamG , for any element n

Keyword Matching

cf (n)≡ ckw (n) =

{ε > 0 , if 1− sim(n) = 01− sim(n) , otherwise

CG = ∑pi∈P

∑n∈pi

cf (n)

Path length

Keyword Matching

cf (n)≡ ckw (n) =

CG = ∑pi∈P

∑n∈pi

cf (n)

Path length

Keyword Matching

cf (n)≡ ckw (n) =

Scoring FunctionsScoring of a graph

Popularity

cf (n)≡ cpop(n) =

1− |nagg ||V | , if n ∈ VC

1− |ninc ||V | , if n ∈ VE

1− |nagg ||E | , if n ∈ LR

Combine

cf (n)≡ ccomb(n) = cp(n)∗ ckw (n)∗ cpop(n)

Scoring FunctionsScoring of a graph

Popularity

cf (n)≡ cpop(n) =

1− |nagg ||V | , if n ∈ VC

1− |ninc ||V | , if n ∈ VE

1− |nagg ||E | , if n ∈ LR

Combine

cf (n)≡ ccomb(n) = cp(n)∗ ckw (n)∗ cpop(n)

Scoring FunctionsScoring of an entity

Entity Cost

Represents the weighted average of the cost of an entity over thesubgraphs it appears

Cost(e,S) =Ccf (e)∗minSGCost

|SGS(e)| ∑SGi∈SGS (e)

Ccf (SGi )

Final Entity Cost

Represents the average cost of an entity derived both during graphexploration and query mapping

Cost(e) =

Cost(e,SGE)+Cost(e,QE)

2 , if e ∈ SGE ∩QECost(e,SGE ) , if e ∈ SGE and e /∈ QECost(e,QE ) , if e ∈ QE and e /∈ SGE

Scoring FunctionsScoring of an entity

Entity Cost

Represents the weighted average of the cost of an entity over thesubgraphs it appears

Cost(e,S) =Ccf (e)∗minSGCost

|SGS(e)| ∑SGi∈SGS (e)

Ccf (SGi )

Final Entity Cost

Represents the average cost of an entity derived both during graphexploration and query mapping

Cost(e) =

Cost(e,SGE)+Cost(e,QE)

2 , if e ∈ SGE ∩QECost(e,SGE ) , if e ∈ SGE and e /∈ QECost(e,QE ) , if e ∈ QE and e /∈ SGE

Implementation

System Architecture

Figure: The architecture of the keyword querying system

System Implementation

RDF Store:

Sesame

Keyword index and full-text search: LuceneSail

Query Processor: Java implementation

RDF Store: Sesame

Keyword index and full-text search:

LuceneSail

RDF Store: Sesame

Query Processor:

Java implementation

RDF Store: Sesame

Experimental Evaluation

Experimental Setup

Machine

I CPU: Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00 GHz, L26144 KB

I Main Memory: 4 GB, 1033 MHz

I Hard Disk: 750 GB, 7200 rpm, 8 MB Buffer

Datasets

I History Ontology (HO)

I Semantic Web Dog Food Ontology (SWDF)

I Digital Bibliography & Library Project Ontology (DBLP)

Evaluated Dimensions

I Efficiency: Load Scalability, Query Answering Performance

I Effectiveness: Precision/Recall, F-measure, NDCG

Experimental Setup

Machine

Datasets

Experimental Setup

Machine

Datasets

Dataset Characteristics

Triples Classes Properties Instances Avg. Inst./ClassHO 8,327 367 727 1,405 4

SWDF 88,996 96 369 8,580 89DBLP 55,364,046 12 20 3,609,294 300,775

HO: High schema complexity and small number of instances

SWDF: Medium schema complexity and number of instances

DBLP: Low schema complexity and high number of instances

EfficiencyLoad/Service Scalability

# users ≡ # sessions ≡ # queries

1 2 4 8 16 32 64 128 256

#users

HOSWDF

1 2 4 8 16 32

#users

Figure: Load scalability for a) HO/SWDF and b) DBLP

EfficiencyIndex Performance (cont’d)

0 10 20 30 40 50 60

#triples (millions)

LucenInfInference

LuceneNative

Figure: Index construction (DBLP)

Native Triple indexingon disk

Lucene Native+Keywordindexing

Inference Native+Schemagraph inference

LuceneInf Lucene+Schemagraph inference

EfficiencyIndex Performance

HISTORY DOGFOOD DBLP

Dataset

Native Lucene Inference LuceneInf

HISTORY DOGFOOD

HISTORY DOGFOOD DBLP

Dataset

Figure: a) Index construction time, b) Load times of schema graph index

EfficiencyScoring Function Performance (cont’d)

Figure: Tasks performance: HO

EfficiencyScoring Function Performance (cont’d)

Figure: Tasks performance: SWDF

EfficiencyScoring Function Performance

Conclusion

Scoring functions have similar computational characteristics

Figure: Tasks performance: DBLP

EfficiencyScoring Function Performance

Conclusion

Scoring functions have similar computational characteristics

EfficiencyQuery Performance (cont’d)

Figure: Tasks performance: HO

Figure: Tasks performance: SWDF

EfficiencyQuery Performance

Table: Tasks performance: Overall contribution

KI GE QM ETHO 0.23 0.60 0.13 0.02

SWDF 0.35 0.52 0.11 0.01DBLP 0.90 0.05 0.04 0.01

Conclusions

HO: QP dominated by graph exploration

SWDF: QP rather fairly distributed among querying tasks

DBLP: QP dominated by keyword indexing

EfficiencyQuery Performance

Table: Tasks performance: Overall contribution

KI GE QM ETHO 0.23 0.60 0.13 0.02

SWDF 0.35 0.52 0.11 0.01DBLP 0.90 0.05 0.04 0.01

Conclusions

HO: QP dominated by graph exploration

SWDF: QP rather fairly distributed among querying tasks

DBLP: QP dominated by keyword indexing

EffectivenessMeasurement methodology (cont’d)

Precision (or how succinct is the answer)

P =#(relevant items retrieved)

#(retrieved items)

Recall (or how much did it cover the question)

R =#(relevant items retrieved)

#(relevant items)

F1-measure (or how much it is to the point)

F1 =2PR

#(retrieved items)

#(relevant items)

F1 =2PR

#(retrieved items)

#(relevant items)

F1 =2PR

EffectivenessMeasurement methodology

NDCG (or how good was the ranking)

NDCGk =r1 + ∑

rilog2(i)

IDCGk, k = 15

Judgement

I 5 history experts

I 20 keyword queries (2-3 keywords each)

I Relevance judgements for relevant entities

EffectivenessMeasurement methodology

NDCG (or how good was the ranking)

NDCGk =r1 + ∑

rilog2(i)

IDCGk, k = 15

Judgement

I 5 history experts

I 20 keyword queries (2-3 keywords each)

I Relevance judgements for relevant entities

Effectiveness

Figure: Effectiveness evaluation results (HO)

Other Directions to KeywordSearch

Browsing

I Rich interfaces for result exploration, discovery of hiddeninter-connections, and query refinement

I Zero-effort web publishing

Result Snippets

Generation of small passage descriptions for quick judgement ofresults

Result Clustering

Clustering of results according to different interpretations of thesemantics of the query

Query Cleaning

I Semantic linkage and spelling corrections of database-relevantquery keywords

I Segmentation of nearby query keywords so that each segmentcorresponds to a high quality data term

Conclusions & Future Work

Conclusions

I Design & implementation of a keyword-based system on RDFgraphs with indefinite temporal information

I Evaluation results:

rather scalable, modest performance,modest effectiveness

I Keyword interpretation and graph exploration call forimprovement

Conclusions

I Evaluation results: rather scalable

, modest performance,modest effectiveness

Conclusions

I Evaluation results: rather scalable, modest performance

,modest effectiveness

Conclusions

I Evaluation results: rather scalable, modest performance,modest effectiveness

Conclusions

I Evaluation results: rather scalable, modest performance,modest effectiveness

Future Work

I Sesame/LuceneSail substitution by BigOWLIM

I Keyword interpretation and graph exploration improvement

I Challenge: Integration and querying of semi-structured dataand linked-data, stored in different formats (RDF, XML) anddata sources (ontologies, knowledge bases, databases)

I Schema-agnostic or hybrid data model

Future Work

This is the End. . .

References

Check my dissertation

Keyword Search in RDF Databasescgi.di.uoa.gr/.../files/dissertation_msc_presentation.pdfMSc...

Documents

Transcript of Keyword Search in RDF Databasescgi.di.uoa.gr/.../files/dissertation_msc_presentation.pdfMSc...

RDF: Alternative Fuel tailor-made from Waste · RDF: Alternative Fuel tailor-made from Waste •What is RDF? •Why to develop RDF-routes? •What is the potential of RDF for the

Keyword Research: Google Keyword Planner

RDF Resource Description Framework - Informatica - EN...RDF Resource Description Framework Manuel Fiorelli fiorelli@info.uniroma2.it Important dates for RDF 1999 –RDF is adopted

RDF Data Sources (keyword textbox) Search RDF Data Sources: LOM Binding Schemas (keyword textbox) Search XML Bindings: Dublin Core Application Profiles.

Query Processing on RDF data - Swinburne Research Bank · 2017. 2. 1. · Query Processing on RDF data Hai Huang B.E. (HFUT, China) M.E. (IIM of CAS, China) A dissertation submitted

Hybrid Keyword Search across Peer-to-Peer Federated Data PhD Dissertation Defense Florida State University Jungkee (Jake) Kim.

RDF Data Model and Query Languagestessaris/docs/RDF-query.pdf · Introduction RDF Semantics Querying RDF Building Blocks RDF Abstract Syntax RDF Vocabulary Basic Concepts RDF: language

Practical RDF Chapter 10. Querying RDF: RDF as Data Shelley Powers, O’Reilly SNU IDB Lab. Hyewon Lim.

Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data

RDF-Anwendungen: Photo RDF

Effectively Interpreting Keyword Queries on RDF Databases with a ...

Agent Semantic Communications Service (ASCS) Teknowledge · familiar keyword search – Agents index RDF/OWL triples ... – Boost query performance by index and query optimizations

RDF Constraint Checking using RDF Data Descriptions (RDD)

The Semantic Web — RDF, RDF Schema, and OWL …Mitchell W. Smith — The Semantic Web – RDF, RDF Schema and OWL (Part 2 of 2) Page RDF Introduction “The Semantic Web is an extension

Metadata for RDF Statements: The RDF-star Approach

Graphically Querying RDF Using RDF-GL

Keyword Search over RDF Using Document-Centric …Keyword Search over RDF Using Document-Centric Information Retrieval Systems Giorgos Kadilierakis1,2, Pavlos Fafalios1(B), Panagiotis

An Introduction to RDF and the Jena RDF API

Make Headlines Pop! - Picaboo YearbooksLETTER ART NEW! Keyword: “chocolate” NEW! Keyword: “clouds” NEW! Keyword: “chaos” NEW! Keyword: “coffee” NEW! Keyword: “candlelight”

Top-k Exploration of Query Candidates for Efﬁcient Keyword ... · Top-k Exploration of Query Candidates for Efﬁcient Keyword Search on Graph-Shaped (RDF) Data⁄ Thanh Tran 1,