SWAT4LS 2014 SLIDE by Yamamoto

26
A SPARQL Endpoint Profiler for Efficient Question Answering Systems Yasunori Yamamoto Database Center for Life Science (DBCLS)

Transcript of SWAT4LS 2014 SLIDE by Yamamoto

Page 1: SWAT4LS 2014 SLIDE by Yamamoto

A SPARQL Endpoint Profiler for Efficient Question Answering

SystemsYasunori Yamamoto

Database Center for Life Science (DBCLS)

Page 2: SWAT4LS 2014 SLIDE by Yamamoto

Various SPARQL Endpoints

Page 3: SWAT4LS 2014 SLIDE by Yamamoto

You can get any data necessarily and sufficiently from them…?

SPARQLSPARQL Protocol and RDF Query Language

Page 4: SWAT4LS 2014 SLIDE by Yamamoto

A SPARQL editor you might often see…

Almost nothing here…

dean beyett

Use the force

Page 5: SWAT4LS 2014 SLIDE by Yamamoto

Lower Columbia College (LCC)

We’d like to make this happen.

Page 6: SWAT4LS 2014 SLIDE by Yamamoto

Question Answering

Page 7: SWAT4LS 2014 SLIDE by Yamamoto

SELECT ?t1 WHERE { ?t1 [:isa] [genes] . ?t2 [:isa] [alzheimer disease] . ?t1 [associated with] ?t2 . }

“What genes are associated with Alzheimer disease?”

[genes] = a [gene] class [alzheimer disease] = instance of a [disease] class

Out of this study’s focus

Mapping to existing vocabularies

PubDictionary / identifiers.org

Page 8: SWAT4LS 2014 SLIDE by Yamamoto

Example: BIO2RDF

[gene] Class = bio2rdf:omim_vocabulary:Gene [disease] Class = bio2rdf:umls_vocabulary:Resource

[disease] Class = bio2rdf:omim_vocabulary:Phenotype

A QA system wants to know if there is any path between them.

Page 9: SWAT4LS 2014 SLIDE by Yamamoto

QA System

SPARQL Endpoint

SPARQL Endpoint

SPARQL Endpoint

SPARQL Endpoint

?

Slow and inefficient

A

Finding EPs that can compose a path between a [gene] class and [alzheimer disease] .

gene/omim/umls?

Naive method

Page 10: SWAT4LS 2014 SLIDE by Yamamoto

QA System

SPARQL Endpoint

SPARQL Endpoint

SPARQL Endpoint

SPARQL Endpoint

Metadata

Previously gathers metadata of each endpoint

Our approach

Page 11: SWAT4LS 2014 SLIDE by Yamamoto

QA System

SPARQL Endpoint

SPARQL Endpoint

SPARQL Endpoint

SPARQL Endpoint

Metadata

Metadata

Metadata

Metadata

Metadata

?

Effective and efficient

A

Page 12: SWAT4LS 2014 SLIDE by Yamamoto

QA System

SPARQL Endpoint

SPARQL Endpoint

SPARQL Endpoint

SPARQL Endpoint

Metadata

Metadata

Metadata

Metadata

Metadata

Each endpoint provides its metadata

Ideal way

Page 13: SWAT4LS 2014 SLIDE by Yamamoto

VoID SD

Yes, they’re fine!

SPARQL 1.1 Service DescriptionVocabulary of Interlinked Datasets

Page 14: SWAT4LS 2014 SLIDE by Yamamoto

But, we cannot get…

Data about relationships between classes

PhenotypeGene ?Cap9 APCS PRNP

:

EEC SYNDROME 1 TOXEMIA OF PREGNANCY

DIABETES :

Page 15: SWAT4LS 2014 SLIDE by Yamamoto

SPARQL Builder MetadataExtensions to VoID / SD

Page 16: SWAT4LS 2014 SLIDE by Yamamoto

Obtain all the relationshipsPredicate first vs. Class first

Page 17: SWAT4LS 2014 SLIDE by Yamamoto

Criterion Average Min Max Median TotalClasses 1,976.29 1 50,303 5.0 509,884Properties 33.48 1 885 21.0 9,608

http://stats.lod2.eu/stats

Page 18: SWAT4LS 2014 SLIDE by Yamamoto

Obtain metadata using SPARQLA locally declared class

:sycamore rdf:type :Tree .

any resource that appears as an object of rdf:type in the dataset.

A locally undeclared class (LUC) instance

any resource that does not appear as an subject of rdf:type in the dataset.

Page 19: SWAT4LS 2014 SLIDE by Yamamoto

Obtain metadata using SPARQL

Class A Class B LUC Literal

Class A 0 76 23 50

Class B

LUC

Object

Subject

SELECT ?p (COUNT(?p) AS ?rc) WHERE { ?f ?p ?t . ?f a :Class_A . ?t a :Class_B . } GROUP BY ?p

:Class_A :Class_B

?p?f ?t

a a

Page 20: SWAT4LS 2014 SLIDE by Yamamoto

Improvement

N2 times (N: number of the classes) Even a dataset of many classes has only a few relationships.

→ inefficient

SELECT ?p ?c (COUNT(?p) AS ?pc) WHERE { ?f a :Class_A ; ?p ?t ; !rdf:type ?t . ?t a ?c . } GROUP BY ?p ?c

→ N times

:Class_A ?c

?p?f ?t

a a

Page 21: SWAT4LS 2014 SLIDE by Yamamoto

Gathering and providing metadata

http://tm.dbcls.jp/tdp/

Page 22: SWAT4LS 2014 SLIDE by Yamamoto

Searching example[gene] Class = bio2rdf:omim_vocabulary:Gene

[disease] Class = bio2rdf:umls_vocabulary:Resource

[disease] [gene]

[gene] [disease]

OR

Page 23: SWAT4LS 2014 SLIDE by Yamamoto

From [disease]

[disease]

Literal

rdfs:Resource

dcterms:Dataset

rdf:typevoid:inDataset

From [gene]Count Predicate Object class608220 bio2rdf-omim:x-gi bio2rdf-gi:Resource!98639 bio2rdf-omim:article bio2rdf-pubmed:Resource95443 bio2rdf-omim:refers-to bio2rdf-omim:Resource81438 bio2rdf-omim:vocabulary:refers-to bio2rdf-omim:Gene28032 bio2rdf-omim:vocabulary:gene-symbol bio2rdf-hgnc:Resource

Page 24: SWAT4LS 2014 SLIDE by Yamamoto
Page 25: SWAT4LS 2014 SLIDE by Yamamoto

Things to do

To improve inter-operability Integrating vocabularies with other related parties.

To improve efficiency Calling for data providers to provide metadata.

To realize a QA system using metadata Discussing with LODQA developers.

Page 26: SWAT4LS 2014 SLIDE by Yamamoto

Thank you!