Linked Data Access Goes Mobile: Context Aware Authorization for Graph Stores
CumulusRDF Linked Data Management on Nested Key-Value Stores
description
Transcript of CumulusRDF Linked Data Management on Nested Key-Value Stores
KIT – University of the State of Baden-Württemberg andNational Large-scale Research Center of the Helmholtz Association
Institute of Applied Informatics and Formal Description Methods (AIFB)
www.kit.edu
CumulusRDFLinked Data Management on Nested Key-Value StoresGünter Ladwig, Andreas HarthWorkshop on Scalable Semantic Web Knowledge Base Systems (SSWS2011)
Institute of Applied Informatics and Formal Description Methods (AIFB)2 October 24th, 2011
Contents
IntroductionLinked DataApache CassandraCumulusRDF
Storage LayoutsStorage ModelHierarchical LayoutFlat Layout
EvaluationConclusion
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)3 October 24th, 2011
INTRODUCTION
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)4 October 24th, 2011
Linked Data Management
RDF data accessible via HTTP lookupsMany datasets cover descriptions of millions of entitiesPublishers often use full-fledged triple stores
Complex query processing capabilities not necessary for Linked Data lookups
Trend towards specialized data management systems tailored for specific use casesDistributed key-value stores
Simple (often nested) data modelNo (expensive) joinsHigh availability and scalability
We investigate applicability of key-value stores for managing and publishing Linked Data
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)5 October 24th, 2011
Linked Data Lookups
Dereferencing URI t should return RDF graph describing tExact content is only lightly specified
Common practice (e.g. DBpedia) is to returnall triples with the given URI as subject andsome triples with the given URI as object
Other optionsOnly triples with the given URI as subjectConcise Bounded Descriptions
SSWS 2011, Bonn
User Agent
Server
GET
RDF
http://www.bbc.co.uk/music/artists/191cba6a-b83f-49ca-883c-02b20c7a9dd5#artist
http://www.bbc.co.uk/music/artists/191cba6a-b83f-49ca-883c-02b20c7a9dd5.rdf
Institute of Applied Informatics and Formal Description Methods (AIFB)6 October 24th, 2011
Triple Patterns
A triple pattern is an RDF triple that may contain variables instead of RDF terms in any position?s dbpprop:birthPlace dbpedia:Karlsruhe .or?s foaf:name ?o .
Linked Data Lookup on t translates into two triple patterns lookups
(t ? ?)(? ? t)
At least three indexes to cover all possibletriple patterns (with prefix lookups)
SSWS 2011, Bonn
Patterns Index? ? ? Anys ? ? SPO? p ? POS? ? o OSPs p ? SPO? p o POSs ? o OSPs p o Any
Institute of Applied Informatics and Formal Description Methods (AIFB)7 October 24th, 2011
Apache Cassandra
Open source data management systemDistributed key-value store (DHT-based)
Nested key-value data modelSchema-less
DecentralizedEvery node in the cluster has the same roleNo single point of failure
ElasticThroughput increases linearly as machines are added with no downtime
Fault-tolerantData can be replicated
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)8 October 24th, 2011
CumulusRDF
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)9 October 24th, 2011
CumulusRDF Functionality
Distributed deployment to enable scale (more data and also more clients) by adding more machines (via Cassandra)
Geographical replication (via Cassandra)
Write-optimized indices with eventual consistency (via Cassandra)
Triple pattern lookups (via CumulusRDF index structures)
Linked Data Lookups (via CumulusRDF index structures)
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)10 October 24th, 2011
STORAGE LAYOUTS
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)11 October 24th, 2011 SSWS 2011, Bonn
Nested Key-Value Storage Model
Column-only { row-key : { column : value } }
Super columns { row-key : { supercolumn : { column : value } } }
ro
c00 v00
c01 v01
... ...
r1
c10 v10
c11 v11
... ...
...
Row
Columns
Column key Column value
r2
sc00
sc01
c000 v000
c010 v010
... ...
r3
sc00
sc01
c000 v000
c010 v010
... ...
Super column key
Institute of Applied Informatics and Formal Description Methods (AIFB)12 October 24th, 2011
Nested Key-Value Storage Model
Secondary indexes map column values to rows{ value : row-key }
Cassandra limitationsEntire rows always stored on a single nodeNo range queries on row keysColumns are stored in specified order and allow for range queries
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)13 October 24th, 2011
Hierarchical Layout
Uses super columnsRDF terms occupy row, supercolumn and column positions
Value is emptyThree indexes SPO, POS, OSP cover all possible triple patternExample: SPO index
SPO: { s : { p : { o : - } } }
SSWS 2011, Bonn
dbp:Jaws
foaf:name
rdf:type
“Jaws” -
dbp:Film -dbp:Work -
Row key Super column key Column key Value
Institute of Applied Informatics and Formal Description Methods (AIFB)14 October 24th, 2011
Flat Layout
Uses columns onlyRange queries on column keys allow prefix lookups
Concatenate second & third position to form column keySPO { s : { po : - } }po is the concatenation of predicate and objectFor (sp?) we perform a prefix lookup on p in row with key s
SSWS 2011, Bonn
dbp:Jawsfoaf:name “Jaws” -rdf:type dbp:Film -rdf:type dbp:Work -
Row key Column key Value
Institute of Applied Informatics and Formal Description Methods (AIFB)15 October 24th, 2011
POS Index
RDF data is skewed: many triples may share the same predicate (rdf:type is a prime example)
p as row key will result in a very uneven distributionCassandra cannot split rows among several nodes
We take advantage of Cassandra’s secondary indexesUse po as row key
{ po : { s : - } }Smaller rows, better distributionNo range queries on rows key: no prefix lookup!
In each row we add a special column ‘p’ which has p as its value{ po : { ‘p’ : p } }
Secondary index on column ‘p’ allows retrieval of all po row keys for a given p
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)16 October 24th, 2011
EVALUATION
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)17 October 24th, 2011
System: 4 node cluster on virtualizedinfrastructure
2 CPUs, 4GB RAM, 80GB disk per nodeDataset: DBpedia 3.6 subset
120M triples (all w/o multilingual labels)Triple pattern queries
1M sampled S, SP, SPO, SO, and O patterns from datasetOutput: all matching triples
Linked Data lookup queries2M resource lookups from DBpedia logs (1.2M unique)Output: all triples with URI as subject and 10k triples with URI as object
Evaluation
SSWS 2011, Bonn
10k all
C0
Clients
CumulusRDF
C1 C2 C3
C0-C3:Cassandranodes
Institute of Applied Informatics and Formal Description Methods (AIFB)18 October 24th, 2011
Results – Storage Layout
SSWS 2011, Bonn
Index Node 1 Node 2 Node 3 Node 4 Std. Dev. Max. Row
SPO Hier 4.41 4.40 4.41 4.41 0.01 0.0002SPO Flat 4.36 4.36 4.36 4.36 0.00 0.0004OSP Hier 5.86 6.00 5.75 6.96 0.56 1.16OSP Flat 5.66 5.77 5.54 6.61 0.49 0.96POS Hier 4.43 3.68 4.69 1.08 1.65 2.40POS Sec 7.35 7.43 7.38 8.05 0.33 0.56
Values in GBSPO Flat: { s : { po : - } }, OSPPOS Sec: { po : { ‘p’ : p } }SPO Hier: { s : { p : { o : - } } }, OSP, POS
Institute of Applied Informatics and Formal Description Methods (AIFB)19 October 24th, 2011
Results – Pattern Lookups
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)20 October 24th, 2011
Results – Pattern Lookups
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)21 October 24th, 2011
Results – Linked Data Lookups
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)22 October 24th, 2011
Conclusion
We evaluated two index schemes for RDF on nested key-value stores to support Linked Data lookups
Flat indexing gives best overall resultsOutput format impacts performance (N-Triples v RDF/XML)
Apache Cassandra is a viable alternative to full-fledged triple stores for Linked Data lookupsFuture work
Automatic generation and maintenance of dataset statisticsEvaluate insert and update performance
Get CumulusRDF at http://code.google.com/p/cumulusrdf
SSWS 2011, Bonn
Institute of Applied Informatics and Formal Description Methods (AIFB)23 October 24th, 2011
Proxy Mode
Via host header~200 lookups/sec. TransferCluster with 4 machines
SSWS 2011, Bonn