Keyword Search in RDF Databasescgi.di.uoa.gr/.../files/dissertation_msc_presentation.pdfMSc...

Post on 12-Jul-2020

1 views 0 download

Transcript of Keyword Search in RDF Databasescgi.di.uoa.gr/.../files/dissertation_msc_presentation.pdfMSc...

Keyword Search in RDF Databases

Charalampos S. Nikolaoucharnik@di.uoa.gr

Department of Informatics & TelecommunicationsUniversity of Athens

MSc Dissertation PresentationApril 15, 2011

Outline

Background

User Requirements

Our Approach

Implementation

Experimental Evaluation

Other Directions to Keyword Search

Conclusions & Future Work

Background

Keyword Search on (Semi-)Structured data

Data modelGraph-based data model

I Schema-aware

X

I Schema-agnostic (comeswith specialized indexes)

Exploration Algorithms

I Backward Expansion

I Bidirectional Expansion

I Variants of the above

X

Structure of the Answer

I Trees

I Graphs

I Nodes/Entities

X

I (Multi-)Relations

Ranking/Scoring of Answers

I TF/IDF and other morecomplex measures from IR

X

I Node/Edge weights

X

I Page Rank like,Hub/Authority nodes

X

I Path length

X

I Number of nodes/edges

X

Keyword Search on (Semi-)Structured data

Data modelGraph-based data model

I Schema-aware

X

I Schema-agnostic (comeswith specialized indexes)

Exploration Algorithms

I Backward Expansion

I Bidirectional Expansion

I Variants of the above

X

Structure of the Answer

I Trees

I Graphs

I Nodes/Entities

X

I (Multi-)Relations

Ranking/Scoring of Answers

I TF/IDF and other morecomplex measures from IR

X

I Node/Edge weights

X

I Page Rank like,Hub/Authority nodes

X

I Path length

X

I Number of nodes/edges

X

Keyword Search on (Semi-)Structured data

Data modelGraph-based data model

I Schema-aware

X

I Schema-agnostic (comeswith specialized indexes)

Exploration Algorithms

I Backward Expansion

I Bidirectional Expansion

I Variants of the above

X

Structure of the Answer

I Trees

I Graphs

I Nodes/Entities

X

I (Multi-)Relations

Ranking/Scoring of Answers

I TF/IDF and other morecomplex measures from IR

X

I Node/Edge weights

X

I Page Rank like,Hub/Authority nodes

X

I Path length

X

I Number of nodes/edges

X

Keyword Search on (Semi-)Structured data

Data modelGraph-based data model

I Schema-aware

X

I Schema-agnostic (comeswith specialized indexes)

Exploration Algorithms

I Backward Expansion

I Bidirectional Expansion

I Variants of the above

X

Structure of the Answer

I Trees

I Graphs

I Nodes/Entities

X

I (Multi-)Relations

Ranking/Scoring of Answers

I TF/IDF and other morecomplex measures from IR

X

I Node/Edge weights

X

I Page Rank like,Hub/Authority nodes

X

I Path length

X

I Number of nodes/edges

X

Keyword Search on (Semi-)Structured data

Data modelGraph-based data model

I Schema-aware X

I Schema-agnostic (comeswith specialized indexes)

Exploration Algorithms

I Backward Expansion

I Bidirectional Expansion

I Variants of the above X

Structure of the Answer

I Trees

I Graphs

I Nodes/Entities X

I (Multi-)Relations

Ranking/Scoring of Answers

I TF/IDF and other morecomplex measures from IRX

I Node/Edge weights X

I Page Rank like,Hub/Authority nodes X

I Path length X

I Number of nodes/edges X

User Requirements

Requirements

Use CaseHistorians interested in the evolution of Biotechnology andRenewable Energy (see Papyrus Project)

1. Search for information about certain concepts, facts, events

2. Search may involve temporal restrictions

3. Source of information may contain indefinite temporalinformation

Requirements

Use CaseHistorians interested in the evolution of Biotechnology andRenewable Energy (see Papyrus Project)

1. Search for information about certain concepts, facts, eventse.g., stem cells and public opinion

2. Search may involve temporal restrictions

3. Source of information may contain indefinite temporalinformation

Requirements

Use CaseHistorians interested in the evolution of Biotechnology andRenewable Energy (see Papyrus Project)

1. Search for information about certain concepts, facts, events

2. Search may involve temporal restrictionse.g., evolution of biotechnology in 20th century

3. Source of information may contain indefinite temporalinformation

Requirements

Use CaseHistorians interested in the evolution of Biotechnology andRenewable Energy (see Papyrus Project)

1. Search for information about certain concepts, facts, events

2. Search may involve temporal restrictions

3. Source of information may contain indefinite temporalinformatione.g., “around 7000 BC biotechnology involved brewing beer,fermenting wine and baking bread with help of yeast”

Our Approach

Data Model

I Extension of the temporal RDF data model with indefinitetime intervals for the validity of a triple: (s,p,o)[i ]

I Validity of a resource r : (r ,subclass,Resource)[i ]

I Indefinite time intervals (set I ): (s1,s2,e1,e2), wheres1,s2,e1,e2 are natural numbers

s1 s2 e1 e2

I Indefinite time point: (s,e,s,e)

s e

Data Model

I Extension of the temporal RDF data model with indefinitetime intervals for the validity of a triple: (s,p,o)[i ]

I Validity of a resource r : (r ,subclass,Resource)[i ]

I Indefinite time intervals (set I ): (s1,s2,e1,e2), wheres1,s2,e1,e2 are natural numbers

s1 s2 e1 e2

I Indefinite time point: (s,e,s,e)

s e

Data Model

I Extension of the temporal RDF data model with indefinitetime intervals for the validity of a triple: (s,p,o)[i ]

I Validity of a resource r : (r ,subclass,Resource)[i ]

I Indefinite time intervals (set I ): (s1,s2,e1,e2), wheres1,s2,e1,e2 are natural numbers

s1 s2 e1 e2

I Indefinite time point: (s,e,s,e)

s e

Query Language

I Keyword-based language extended with temporal constraintsexpressed in a controlled natural language

I Temporal constraints (set TR):I Allen’s temporal relations: before, after, meets, met by,overlaps, overlapped by, starts, started by, finishes,finished by, during, contains

I Other lexical forms (formal/informal) used in speech andwriting

I Time intervals: [s1− s2,e3− e4], with each si (ei ) being ofthe form yyyy/mm/dd

DefinitionA query is a pair (KL,φ). The expression KL is a list of keywordsand φ is a finite conjunction of formulas of the form R c , whereR ∈ TR and c ∈ I

Query Language

I Keyword-based language extended with temporal constraintsexpressed in a controlled natural language

I Temporal constraints (set TR):I Allen’s temporal relations: before, after, meets, met by,overlaps, overlapped by, starts, started by, finishes,finished by, during, contains

I Other lexical forms (formal/informal) used in speech andwriting

I Time intervals: [s1− s2,e3− e4], with each si (ei ) being ofthe form yyyy/mm/dd

DefinitionA query is a pair (KL,φ). The expression KL is a list of keywordsand φ is a finite conjunction of formulas of the form R c , whereR ∈ TR and c ∈ I

Query Language

I Keyword-based language extended with temporal constraintsexpressed in a controlled natural language

I Temporal constraints (set TR):I Allen’s temporal relations: before, after, meets, met by,overlaps, overlapped by, starts, started by, finishes,finished by, during, contains

I Other lexical forms (formal/informal) used in speech andwriting

I Time intervals: [s1− s2,e3− e4], with each si (ei ) being ofthe form yyyy/mm/dd

DefinitionA query is a pair (KL,φ). The expression KL is a list of keywordsand φ is a finite conjunction of formulas of the form R c , whereR ∈ TR and c ∈ I

Example

DatabaseBerlin has been a city since some point in the 12th century

(ex:city, rdf:type, rdfs:Class).

(ex:city, rdfs:label, "City").

(ex:berlin, foaf:name, "Berlin").

(ex:berlin, rdf:type, ex:city)[1100 - 1190, 2011].

Example

DatabaseBerlin has been a city since some point in the 12th century

(ex:city, rdf:type, rdfs:Class).

(ex:city, rdfs:label, "City").

(ex:berlin, foaf:name, "Berlin").

(ex:berlin, rdf:type, ex:city)[1100 - 1190, 2011].

Information needs (1)

I would like information about cities during the period 1200 – 1210

city during [1200, 1210]

Example

DatabaseBerlin has been a city since some point in the 12th century

(ex:city, rdf:type, rdfs:Class).

(ex:city, rdfs:label, "City").

(ex:berlin, foaf:name, "Berlin").

(ex:berlin, rdf:type, ex:city)[1100 - 1190, 2011].

Information needs (1)

I would like information about cities during the period 1200 – 1210

city during [1200, 1210]

AnswerThere is such a city named Berlin

Example

DatabaseBerlin has been a city since some point in the 12th century

(ex:city, rdf:type, rdfs:Class).

(ex:city, rdfs:label, "City").

(ex:berlin, foaf:name, "Berlin").

(ex:berlin, rdf:type, ex:city)[1100 - 1190, 2011].

Information needs (2)

I would like information about cities during the period 1100 – 1110

city during [1100, 1110]

Example

DatabaseBerlin has been a city since some point in the 12th century

(ex:city, rdf:type, rdfs:Class).

(ex:city, rdfs:label, "City").

(ex:berlin, foaf:name, "Berlin").

(ex:berlin, rdf:type, ex:city)[1100 - 1190, 2011].

Information needs (2)

I would like information about cities during the period 1100 – 1110

city during [1100, 1110]

AnswerThere is possibly a city named Berlin

Example

DatabaseBerlin has been a city since some point in the 12th century

(ex:city, rdf:type, rdfs:Class).

(ex:city, rdfs:label, "City").

(ex:berlin, foaf:name, "Berlin").

(ex:berlin, rdf:type, ex:city)[1100 - 1190, 2011].

The ideal answerThe answer to both questions should include the individual“Berlin” and the class “City”, but in the second case these twoentities should have lower score to convey the notion of possibility

Data Structures

Data graph

The underlying RDF graph

Schema graph

A summary of the data graph, containing data-driven schemainformation

Query graph

Super graph of schema graph containing elements from the RDFgraph

Data Structures

Data graph

The underlying RDF graph

Schema graph

A summary of the data graph, containing data-driven schemainformation

Query graph

Super graph of schema graph containing elements from the RDFgraph

Data Structures

Data graph

The underlying RDF graph

Schema graph

A summary of the data graph, containing data-driven schemainformation

Query graph

Super graph of schema graph containing elements from the RDFgraph

Data graph

Subject Predicate Object

pro1 type Projectpro2 type Projectpro1 name Papyruspub1 type Publicationpub1 author res1pub1 author res2pub1 year 2010pub2 type Publicationres1 type Researcherres2 type Researcheruniv1 name DI&Tres2 name Y. Ioannidisres1 name M. Koubarakisres1 worksAt univ1univ1 type Universityuniv2 type University

University subclass AgentAgent subclass Resourcepub1 hasProject pro1

subclass

subclas

s

subclasssubclass

works

At

hasProject

author

pro1

pro2

pub1

pub2

resres

2

res3

univ1

univ2

Project

Researcher

University

Publication

Person Resource

Agent

author

type

type

type

type

type

type

type

ManolisKoubarakis

YannisIoannidis

name

namePapyrus name

DI&T

name

2010

typeyear

type

1

Figure: a) RDF triples b) Data graph

Schema and Query graphs

Project Researcher

Publication Person

Agent

University

Resource

subclass

subclas

s

subclass

subcla

ss

worksAt

hasProject au

thor

2010 Resource

Agent

Researcher

Person

University

hasProject

author

subcla

ss

worksAt

year

name

subclass

subclass

subclas

s

Project

Publication

ManolisKoubarakis

Figure: a) schema graph

b) query graph for koubarakis publications

during 2010

Schema and Query graphs

Project Researcher

Publication Person

Agent

University

Resource

subclass

subclas

s

subclass

subcla

ss

worksAt

hasProject au

thor

2010 Resource

Agent

Researcher

Person

University

hasProject

author

subcla

ss

worksAt

year

name

subclass

subclass

subclas

s

Project

Publication

ManolisKoubarakis

Figure: a) schema graph b) query graph for koubarakis publications

during 2010

Keyword Search Algorithm

Query processing in 4 phases:

1. Keyword Interpretation (KI): Interpret keywords as RDF graphelements (keyword elements) and construct the query graph

2. Graph Exploration (GE): Explore the query graph startingfrom keyword elements for finding top-k subgraphs

3. Query Mapping (QM): Map top-k subgraphs to SPARQLqueries and evaluate them

4. Entity Transformation (ET): Transform and rank entitiesappearing in subgraphs and results from query evaluation toentities

Running query

koubarakis publications during 2010

Keyword Search Algorithm

Query processing in 4 phases:

1. Keyword Interpretation (KI): Interpret keywords as RDF graphelements (keyword elements) and construct the query graph

2. Graph Exploration (GE): Explore the query graph startingfrom keyword elements for finding top-k subgraphs

3. Query Mapping (QM): Map top-k subgraphs to SPARQLqueries and evaluate them

4. Entity Transformation (ET): Transform and rank entitiesappearing in subgraphs and results from query evaluation toentities

Running query

koubarakis publications during 2010

Keyword Search Algorithm

Query processing in 4 phases:

1. Keyword Interpretation (KI): Interpret keywords as RDF graphelements (keyword elements) and construct the query graph

2. Graph Exploration (GE): Explore the query graph startingfrom keyword elements for finding top-k subgraphs

3. Query Mapping (QM): Map top-k subgraphs to SPARQLqueries and evaluate them

4. Entity Transformation (ET): Transform and rank entitiesappearing in subgraphs and results from query evaluation toentities

Running query

koubarakis publications during 2010

Keyword Search Algorithm

Query processing in 4 phases:

1. Keyword Interpretation (KI): Interpret keywords as RDF graphelements (keyword elements) and construct the query graph

2. Graph Exploration (GE): Explore the query graph startingfrom keyword elements for finding top-k subgraphs

3. Query Mapping (QM): Map top-k subgraphs to SPARQLqueries and evaluate them

4. Entity Transformation (ET): Transform and rank entitiesappearing in subgraphs and results from query evaluation toentities

Running query

koubarakis publications during 2010

Keyword Search Algorithm

Query processing in 4 phases:

1. Keyword Interpretation (KI): Interpret keywords as RDF graphelements (keyword elements) and construct the query graph

2. Graph Exploration (GE): Explore the query graph startingfrom keyword elements for finding top-k subgraphs

3. Query Mapping (QM): Map top-k subgraphs to SPARQLqueries and evaluate them

4. Entity Transformation (ET): Transform and rank entitiesappearing in subgraphs and results from query evaluation toentities

Running query

koubarakis publications during 2010

Keyword Interpretationkoubarakis publications during 2010

Project Researcher

Publication Person

Agent

University

Resourcesubclass

subclas

s

subclass

subcla

ss

worksAt

hasProject au

thor

Keyword Interpretationkoubarakis publications during 2010

2010 Resource

Agent

Researcher

Person

University

hasProject

author

subcla

ss

worksAt

year

name

subclass

subclass

subclas

s

Project

Publication

ManolisKoubarakis

Graph Explorationkoubarakis publications during 2010

Graph Explorationkoubarakis publications during 2010

2010 Resource

Agent

Researcher

Person

University

hasProject

author

subcla

ss

worksAt

year

name

subclass

subclass

subclas

s

Project

Publication

ManolisKoubarakis

Graph Explorationkoubarakis publications during 2010

2010 Resource

Agent

Researcher

Person

University

hasProject

author

subclass

worksAt

year

name

subcl

ass

subclass

subclas

s

Project

Publicat ion

ManolisKoubarakis

Query Mappingkoubarakis publications during 2010

Query Mappingkoubarakis publications during 2010

2010 Resource

Agent

Researcher

Person

University

hasProject

author

subclass

worksAt

year

name

subcl

ass

subclass

subclas

s

Project

Publicat ion

ManolisKoubarakis

Query Mappingkoubarakis publications during 2010

2010

Researcherauth

or

year

name

Publicat ion

ManolisKoubarakis

Query Mappingkoubarakis publications during 2010

2010

Researcherauth

or

year

name

Publicat ion

ManolisKoubarakis

Conjunctive query

name(x,”Koubarakis”)∧type(x,Researcher)∧author(y,x)∧year(y,”2010”)∧type(y,Publication)

Query Mappingkoubarakis publications during 2010

SPARQL Query

SELECT ?x ?y ?xl ?yl

WHERE {

?x rdf:type ex:Researcher .

?x foaf:name "Manolis Koubarakis" .

?y ex:author ?x .

?x rdf:type ex:Publication .

?x ex:year "2010" .

?x rdfs:label ?xl .

?y rdfs:label ?yl .

}

Query Mappingkoubarakis publications during 2010

Table: SPARQL result

?x ?y ?xl ?ylres1 pub1 “Manolis Koubarakis”

Entity Transformationkoubarakis publications during 2010

Table: Answer

Rank Entity1 “Manolis Koubarakis”2 Publication3 pub14 Researcher

How ranking is performed?

Entity Transformationkoubarakis publications during 2010

Table: Answer

Rank Entity1 “Manolis Koubarakis”2 Publication3 pub14 Researcher

How ranking is performed?

Scoring FunctionsScoring of a graph (cont’d)

For any cost function cf : V ∪E → [0,1]:

CG = ∑pi∈P

∑n∈pi

cf (n)

Path length

cf (n)≡ cp(n) = 1/diamG , for any element n

Keyword Matching

cf (n)≡ ckw (n) =

{ε > 0 , if 1− sim(n) = 01− sim(n) , otherwise

Scoring FunctionsScoring of a graph (cont’d)

For any cost function cf : V ∪E → [0,1]:

CG = ∑pi∈P

∑n∈pi

cf (n)

Path length

cf (n)≡ cp(n) = 1/diamG , for any element n

Keyword Matching

cf (n)≡ ckw (n) =

{ε > 0 , if 1− sim(n) = 01− sim(n) , otherwise

Scoring FunctionsScoring of a graph (cont’d)

For any cost function cf : V ∪E → [0,1]:

CG = ∑pi∈P

∑n∈pi

cf (n)

Path length

cf (n)≡ cp(n) = 1/diamG , for any element n

Keyword Matching

cf (n)≡ ckw (n) =

{ε > 0 , if 1− sim(n) = 01− sim(n) , otherwise

Scoring FunctionsScoring of a graph

Popularity

cf (n)≡ cpop(n) =

1− |nagg ||V | , if n ∈ VC

1− |ninc ||V | , if n ∈ VE

1− |nagg ||E | , if n ∈ LR

Combine

cf (n)≡ ccomb(n) = cp(n)∗ ckw (n)∗ cpop(n)

Scoring FunctionsScoring of a graph

Popularity

cf (n)≡ cpop(n) =

1− |nagg ||V | , if n ∈ VC

1− |ninc ||V | , if n ∈ VE

1− |nagg ||E | , if n ∈ LR

Combine

cf (n)≡ ccomb(n) = cp(n)∗ ckw (n)∗ cpop(n)

Scoring FunctionsScoring of an entity

Entity Cost

Represents the weighted average of the cost of an entity over thesubgraphs it appears

Cost(e,S) =Ccf (e)∗minSGCost

|SGS(e)| ∑SGi∈SGS (e)

1

Ccf (SGi )

Final Entity Cost

Represents the average cost of an entity derived both during graphexploration and query mapping

Cost(e) =

Cost(e,SGE)+Cost(e,QE)

2 , if e ∈ SGE ∩QECost(e,SGE ) , if e ∈ SGE and e /∈ QECost(e,QE ) , if e ∈ QE and e /∈ SGE

Scoring FunctionsScoring of an entity

Entity Cost

Represents the weighted average of the cost of an entity over thesubgraphs it appears

Cost(e,S) =Ccf (e)∗minSGCost

|SGS(e)| ∑SGi∈SGS (e)

1

Ccf (SGi )

Final Entity Cost

Represents the average cost of an entity derived both during graphexploration and query mapping

Cost(e) =

Cost(e,SGE)+Cost(e,QE)

2 , if e ∈ SGE ∩QECost(e,SGE ) , if e ∈ SGE and e /∈ QECost(e,QE ) , if e ∈ QE and e /∈ SGE

Implementation

System Architecture

Figure: The architecture of the keyword querying system

System Implementation

RDF Store:

Sesame

Keyword index and full-text search: LuceneSail

Query Processor: Java implementation

System Implementation

RDF Store: Sesame

Keyword index and full-text search: LuceneSail

Query Processor: Java implementation

System Implementation

RDF Store: Sesame

Keyword index and full-text search:

LuceneSail

Query Processor: Java implementation

System Implementation

RDF Store: Sesame

Keyword index and full-text search: LuceneSail

Query Processor: Java implementation

System Implementation

RDF Store: Sesame

Keyword index and full-text search: LuceneSail

Query Processor:

Java implementation

System Implementation

RDF Store: Sesame

Keyword index and full-text search: LuceneSail

Query Processor: Java implementation

Experimental Evaluation

Experimental Setup

Machine

I CPU: Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00 GHz, L26144 KB

I Main Memory: 4 GB, 1033 MHz

I Hard Disk: 750 GB, 7200 rpm, 8 MB Buffer

Datasets

I History Ontology (HO)

I Semantic Web Dog Food Ontology (SWDF)

I Digital Bibliography & Library Project Ontology (DBLP)

Evaluated Dimensions

I Efficiency: Load Scalability, Query Answering Performance

I Effectiveness: Precision/Recall, F-measure, NDCG

Experimental Setup

Machine

I CPU: Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00 GHz, L26144 KB

I Main Memory: 4 GB, 1033 MHz

I Hard Disk: 750 GB, 7200 rpm, 8 MB Buffer

Datasets

I History Ontology (HO)

I Semantic Web Dog Food Ontology (SWDF)

I Digital Bibliography & Library Project Ontology (DBLP)

Evaluated Dimensions

I Efficiency: Load Scalability, Query Answering Performance

I Effectiveness: Precision/Recall, F-measure, NDCG

Experimental Setup

Machine

I CPU: Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00 GHz, L26144 KB

I Main Memory: 4 GB, 1033 MHz

I Hard Disk: 750 GB, 7200 rpm, 8 MB Buffer

Datasets

I History Ontology (HO)

I Semantic Web Dog Food Ontology (SWDF)

I Digital Bibliography & Library Project Ontology (DBLP)

Evaluated Dimensions

I Efficiency: Load Scalability, Query Answering Performance

I Effectiveness: Precision/Recall, F-measure, NDCG

Dataset Characteristics

Triples Classes Properties Instances Avg. Inst./ClassHO 8,327 367 727 1,405 4

SWDF 88,996 96 369 8,580 89DBLP 55,364,046 12 20 3,609,294 300,775

HO: High schema complexity and small number of instances

SWDF: Medium schema complexity and number of instances

DBLP: Low schema complexity and high number of instances

EfficiencyLoad/Service Scalability

# users ≡ # sessions ≡ # queries

10

100

1000

10000

1 2 4 8 16 32 64 128 256

#sec

onds

(lo

g sc

ale)

#users

HOSWDF

0

5000

10000

15000

20000

25000

30000

1 2 4 8 16 32

#sec

onds

(lo

g sc

ale)

#users

Figure: Load scalability for a) HO/SWDF and b) DBLP

EfficiencyIndex Performance (cont’d)

0

5

10

15

20

25

30

0 10 20 30 40 50 60

#sec

onds

(th

ousa

nds)

#triples (millions)

LucenInfInference

LuceneNative

Figure: Index construction (DBLP)

Native Triple indexingon disk

Lucene Native+Keywordindexing

Inference Native+Schemagraph inference

LuceneInf Lucene+Schemagraph inference

EfficiencyIndex Performance

0

5000

10000

15000

20000

25000

30000

HISTORY DOGFOOD DBLP

#sec

onds

Dataset

Native Lucene Inference LuceneInf

0

1

2

3

4

5

6

7

8

9

HISTORY DOGFOOD

100

200

300

400

500

600

700

800

900

1000

1100

HISTORY DOGFOOD DBLP

#ms

Dataset

Figure: a) Index construction time, b) Load times of schema graph index

EfficiencyScoring Function Performance (cont’d)

Figure: Tasks performance: HO

EfficiencyScoring Function Performance (cont’d)

Figure: Tasks performance: SWDF

EfficiencyScoring Function Performance

Conclusion

Scoring functions have similar computational characteristics

Figure: Tasks performance: DBLP

EfficiencyScoring Function Performance

Conclusion

Scoring functions have similar computational characteristics

Figure: Tasks performance: DBLP

EfficiencyQuery Performance (cont’d)

Figure: Tasks performance: HO

EfficiencyQuery Performance (cont’d)

Figure: Tasks performance: SWDF

EfficiencyQuery Performance (cont’d)

Figure: Tasks performance: DBLP

EfficiencyQuery Performance

Table: Tasks performance: Overall contribution

KI GE QM ETHO 0.23 0.60 0.13 0.02

SWDF 0.35 0.52 0.11 0.01DBLP 0.90 0.05 0.04 0.01

Conclusions

HO: QP dominated by graph exploration

SWDF: QP rather fairly distributed among querying tasks

DBLP: QP dominated by keyword indexing

EfficiencyQuery Performance

Table: Tasks performance: Overall contribution

KI GE QM ETHO 0.23 0.60 0.13 0.02

SWDF 0.35 0.52 0.11 0.01DBLP 0.90 0.05 0.04 0.01

Conclusions

HO: QP dominated by graph exploration

SWDF: QP rather fairly distributed among querying tasks

DBLP: QP dominated by keyword indexing

EffectivenessMeasurement methodology (cont’d)

Precision (or how succinct is the answer)

P =#(relevant items retrieved)

#(retrieved items)

Recall (or how much did it cover the question)

R =#(relevant items retrieved)

#(relevant items)

F1-measure (or how much it is to the point)

F1 =2PR

P +R

EffectivenessMeasurement methodology (cont’d)

Precision (or how succinct is the answer)

P =#(relevant items retrieved)

#(retrieved items)

Recall (or how much did it cover the question)

R =#(relevant items retrieved)

#(relevant items)

F1-measure (or how much it is to the point)

F1 =2PR

P +R

EffectivenessMeasurement methodology (cont’d)

Precision (or how succinct is the answer)

P =#(relevant items retrieved)

#(retrieved items)

Recall (or how much did it cover the question)

R =#(relevant items retrieved)

#(relevant items)

F1-measure (or how much it is to the point)

F1 =2PR

P +R

EffectivenessMeasurement methodology

NDCG (or how good was the ranking)

NDCGk =r1 + ∑

ki=2

rilog2(i)

IDCGk, k = 15

Judgement

I 5 history experts

I 20 keyword queries (2-3 keywords each)

I Relevance judgements for relevant entities

EffectivenessMeasurement methodology

NDCG (or how good was the ranking)

NDCGk =r1 + ∑

ki=2

rilog2(i)

IDCGk, k = 15

Judgement

I 5 history experts

I 20 keyword queries (2-3 keywords each)

I Relevance judgements for relevant entities

Effectiveness

Figure: Effectiveness evaluation results (HO)

Other Directions to KeywordSearch

Other Directions to Keyword Search

Browsing

I Rich interfaces for result exploration, discovery of hiddeninter-connections, and query refinement

I Zero-effort web publishing

Other Directions to Keyword Search

Result Snippets

Generation of small passage descriptions for quick judgement ofresults

Other Directions to Keyword Search

Result Clustering

Clustering of results according to different interpretations of thesemantics of the query

Other Directions to Keyword Search

Query Cleaning

I Semantic linkage and spelling corrections of database-relevantquery keywords

I Segmentation of nearby query keywords so that each segmentcorresponds to a high quality data term

Conclusions & Future Work

Conclusions

I Design & implementation of a keyword-based system on RDFgraphs with indefinite temporal information

I Evaluation results:

rather scalable, modest performance,modest effectiveness

I Keyword interpretation and graph exploration call forimprovement

Conclusions

I Design & implementation of a keyword-based system on RDFgraphs with indefinite temporal information

I Evaluation results: rather scalable

, modest performance,modest effectiveness

I Keyword interpretation and graph exploration call forimprovement

Conclusions

I Design & implementation of a keyword-based system on RDFgraphs with indefinite temporal information

I Evaluation results: rather scalable, modest performance

,modest effectiveness

I Keyword interpretation and graph exploration call forimprovement

Conclusions

I Design & implementation of a keyword-based system on RDFgraphs with indefinite temporal information

I Evaluation results: rather scalable, modest performance,modest effectiveness

I Keyword interpretation and graph exploration call forimprovement

Conclusions

I Design & implementation of a keyword-based system on RDFgraphs with indefinite temporal information

I Evaluation results: rather scalable, modest performance,modest effectiveness

I Keyword interpretation and graph exploration call forimprovement

Future Work

I Sesame/LuceneSail substitution by BigOWLIM

I Keyword interpretation and graph exploration improvement

I Challenge: Integration and querying of semi-structured dataand linked-data, stored in different formats (RDF, XML) anddata sources (ontologies, knowledge bases, databases)

I Schema-agnostic or hybrid data model

Future Work

I Sesame/LuceneSail substitution by BigOWLIM

I Keyword interpretation and graph exploration improvement

I Challenge: Integration and querying of semi-structured dataand linked-data, stored in different formats (RDF, XML) anddata sources (ontologies, knowledge bases, databases)

I Schema-agnostic or hybrid data model

Future Work

I Sesame/LuceneSail substitution by BigOWLIM

I Keyword interpretation and graph exploration improvement

I Challenge: Integration and querying of semi-structured dataand linked-data, stored in different formats (RDF, XML) anddata sources (ontologies, knowledge bases, databases)

I Schema-agnostic or hybrid data model

Future Work

I Sesame/LuceneSail substitution by BigOWLIM

I Keyword interpretation and graph exploration improvement

I Challenge: Integration and querying of semi-structured dataand linked-data, stored in different formats (RDF, XML) anddata sources (ontologies, knowledge bases, databases)

I Schema-agnostic or hybrid data model

This is the End. . .

References

Check my dissertation