RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.

19
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University

Transcript of RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.

RDF Triple Stores

Nipun BhatiaDepartment of Computer Science. Stanford University

Contents Introduction Different Architectures

• Implications An Example : Jena SDB Evaluations

• Evaluations using LUBM/DBPedia Open Research Issues Which RDF Store to choose for a particular application? Possible system diagram for Phenotype Annonations.

Introduction What is an RDF store?

A system to provide a mechanism for persistent storage and access of RDF graphs.

Potential Applications areas:

Plenty! Backend for Protege, BioPortal, Phenotype Annotations.

Different Architectures Based on their implementation, can be divided into 3

broad categories : In-memory, Native, Non-native Non-memory.

In – Memory : RDF Graph is stored as triples in main –memory. Eg. Storing an RDF graph using Jena API/ Sesame API.

Native : Persistent storage systems with their own implementation of databases. Eg. Sesame Native, Virtuoso, AllegroGraph, Oracle 11g.

Non-Native Non-Memory : Persistent storage systems set-up to run on third party DBs. Eg. Jena SDB.

Implications Scalability Different query languages supported to varying degrees.

• Sesame – SeRQL, Oracle 11g – Own query language. Different level of inferencing.

• Sesame supports RDFS inference, AllegroGraph – RDFS++,

Oracle 11g – RDFS++, OWL Prime Lack of interoperability and portability.

• More pronounced in Native stores.

Jena SDB SDB basically is a Java Loader. Multiple stores supported: MySQL, PostgreSQL, Oracle,

DB2. Takes incoming triples and breaks them down into

components ready for the database. Multiple layouts Integration with the Joseki server. SPARQL supported.

(Non) Interest Declaration: I was previously an intern at HP Labs with the Jena team

Evaluations Third party evaluations for Sesame, Jena SDB, Virtuoso Oracle 11g company evaluations Methodology

• LUBM – Lehigh University BenchMark• DBPedia• Multiple Queries• Load Times

Evaluations DB Pedia – Database of structured information extracted

from Wikipedia. Information about places, persons, music albums and films[2]

LUBM – Synthetically generated RDF data containing universities, departments, students etc.[1]

Dataset size:• DataSet1: 15,472,624 triples; 2.1 GB• DataSet 2: LUBM 50 – 2.75 Million & LUBM 1000 – 55.09

Million• 3 Queries

Loading Time-DataSet1

Results – Query 1 Simple select query – 2 variables

Query 2 Unconstrained Select Query – only predicate was

specified.

Query 3 Complex Query – Uses filter

Oracle 11g – DataSet 2Ontology (size) RDFS OWL Prime

Triples Time Triples Time

LUBM – 50(6.8 Million) 2.75 M 12.14 min 3.05 M 8.01 min

LUBM – 1000(133.6 M)

55.09M 7h 19m 65.25M 7h 12m

Observations Native Stores perform better than systems using third

party stores.• Optimizations are possible

Each of the systems uses different database layouts. • Virtuoso – OGPS,POGS,PSOG,SOPG• SDB – SPO,GSPO

Hashing on SDB is very bad.

Open Research Issues Inferencing[4]

• Present common implementations:• Make a number of small queries to propagate the effects of rule firing.• Each of these queries creates an interaction with the database.• Not very efficient

• Approaches• Snapshot the contents of the database-backed model into RAM for the

duration of processing by the inference engine.• Performing inferencing in-stream.

• Precompute the inference closure of ontology and analyze the in-coming data-streams, add triples to it based on your inference closure.

• Assumes rigid seperation of the RDF Data(A-box) and the Ontology data(T-box)

• Even this maynot work for very large ontologies – BioMedical Ontologies

Open Research Issues Query Optimization

• Third party stores undo’s any optimization done at the API level.

• Better performance of native stores points to that direction.• Some work in optimizing SPARQL queries for in-memory

story.

Which RDF store to choose for an app? Frequency of loads that the application would perform. Single scaling factor and linear load times. Level of inferencing. Support for which query language. W3C

recommendations. Special system needs. Eg. Allegograph needs 64 bit

processor.

Phenotype Annotations

Set of Ontologies required for Phenotype

Annotationseg. PATO, Fly etc.

jJena Model

SDB

MySQL/ Virtuoso

Phenotype Annotations

Jena API

Jena API

Jena API

Inferencing

Jena API

j

Jena API

Jena Model

SDB

References [1] http://esw.w3.org/topic/RdfStoreBenchmarking [2] http://www4.wiwiss.fu-berlin.de/benchmarks-200801/ [3] Kurt Rohloff et al.: An Evaluation of Triple-Store

Technologies for Large Data Stores. Comparing Sesame, Jena and AllegroGraph. 2007

[4]N Bhatia, A Seaborne – ‘Ingestion pipeline for RDF’