Query Processing Using Structure Index for RDF Data on the Web
-
Upload
thanh-tran -
Category
Education
-
view
504 -
download
0
description
Transcript of Query Processing Using Structure Index for RDF Data on the Web
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
1
Query Processing Using Structure Index for RDF Data on the WebThanh Tran and Günter LadwigInstitute AIFB, Karlsruhe Institute of [email protected], [email protected]
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Agenda
Problem Introduction Approach
Structure Index for RDF Data Structure-based Partitioning Structure-aware Query Processing
Evaluation Conclusion
2
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
RDF data
3
0
6 7
8 9
432
1
Auth
orOf AuthorOfAu
thorOf AuthorOf
Auth
orOf
AuthorOf
Supervises Supervises Supervises
WorksAt WorksAt
WorksAt
Wor
ksAt
Wor
ksAt
KIT MITName Name
5Supervises
WorksAt
- Consists of triples <s,p,o>- Triples form a graph, where vertices denote resources and their values, connected
by directed labelled edges representing properties (i.e.,relations and attributes)- URIs are used as labels of edges and vertices representing resources
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Conjunctive Queries
4
- Important fragment of widely used languages (SQL, SPARQL)- Consisting of triple patterns p(s,o) where p is a predicate and s and o are variables
or constants- Distinguished variables, e.g. x, vs. undistinguished variables- Triple patterns constitute a query graph
z
u
yx
AuthorOf
Supervises
Wor
ksAtWorksAt
KITName
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Conjunctive Query Answering
5
0
6 7
8 9
432
1
Auth
orOf AuthorOfAuth
orOf AuthorOf
Auth
orOf
AuthorOf
Supervises Supervises Supervises
WorksAt WorksAt
WorksAt
Wor
ksAt
Wor
ksAt
KIT MITName Name
5Supervises
WorksAt
- Graph pattern matching problem: a match of a query q on a graph G is a mapping h from the variables of q to vertices of G such that the substitution of variables in the graph-representation of q would yield a subgraph of G
- A match h is a homomorphism from the “query graph” to the data graph- Query answering based on two basic operations: data loading and join
z
u
yx
AuthorOf
SupervisesW
orks
AtWorksAt
KITName
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
State-of-the-art Data Partitioning
Vertical partitioning (SW-Store) Indexing
Sextuple indexing (Hexastore) Materialization and indexing of entire join paths (GRIN)
Index Implementation B+ tree Inverted index (Semplore) Index compression (RDF-3X)
Query processing Sorted merge join based on vertical partitioning and indexing (SW-Store) Join order optimization based on dynamic programming (RDF-3X)
A combination of different concepts makes up the state-of-the-art!
6
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Large Volume of RDF Data on the Web
- ̴10 billions RDF triples (2009)- Interlinked by ̴10 millions mappings (2009)- Besides linked data, there are standalone ontologies, RDFa, etc.
7
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Semi-structured RDF data on the Web0
6 7
8 9
432
1
Auth
orOf AuthorOfAu
thorOf AuthorOf
Auth
orOf
AuthorOf
Supervises Supervises Supervises
WorksAt WorksAt
WorksAt
Wor
ksAt
Wor
ksAt
KIT MITName Name
5 Supervises
WorksAt
Publication
Institute
Post Doc
PhD Student
Auth
orOf
AuthorO
f
Supervises
WorksAt
Wor
ksAt
String Name
- RDF graph often contains both data and schema information
- Resources are linked with a rdf:class via rdf:type
- Schema information incomplete, especially Web data, RDFa data
RDF data might be schema-less, semi-structured data
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Overview of Our Approach
Problems
• Management of possibly semi-structured RDF data on the Web • Scalability and efficiency of RDF Web data query processing
Contributions
• Parameterized structure index for RDF data• Structure-based partitioning (SP)• Structure-aware query processing
Benefits
• Reduction of unions & joins as well as IO cost
9
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Structure Index for RDF data on the Web
10
Structure index is a graph Is a structural description more fine-granular then a schema Consists of classes (extensions) and relations between them Resources in an extension exhibit the same structure, i.e., cannot be distinguished by
outgoing (forward bisimilarity) and incoming (backward bisimilarity) “edge trees” Parameterize bisimulation by two sets of edge labels
0
6 7
8 9
432
1
Auth
orOf AuthorOfAuth
orOf AuthorOf
Auth
orOf
AuthorOf
Supervises Supervises Supervises
WorksAt WorksAt
WorksAt
Wor
ksAt
Wor
ksAt
KIT MITName Name
5Supervises
WorksAt
B1: 3,7
B4: 2,4,6
B3: 8,9
B2: 0,1
AuthorOf
Auth
orOfSupervises
WorksAt
Wor
ksAt
Nam
e
B6: 5
WorksAtSu
perv
ises
B5:KIT,MIT
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Structure-based Partitioning
11
Whether a graph vertex instantiates a variable of a query depends on its structure vertices physically grouped based on structural similarity
Apply grouping captured by the structure index to the physical organization Creating a physical group for every vertex Triples are in the same group when their subjects belong to the same extension
Triples of a SP table satisfy not only the property of a triple pattern but also, provide some structural guarantee, e.g., match the entire query structure
B1: 3,7
B4: 2,4,6
B3: 8,9
B2: 0,1
AuthorOf
Auth
orOfSupervises
WorksAt
Wor
ksAt
Nam
e
B6: 5
WorksAtSu
perv
ises
B5:KIT,MIT
Sub Property Obj
2 AuthorOf 0
4 AuthorOf 0
6 AuthorOf 1
2 WorksAt 8
4 WorksAt 8
6 WorksAt 9
Sub Obj
2 0
4 0
6 1
3 0
7 1
VP AuthorOf tableSP B4 table
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Structure-aware Query Processing
Proposition 1 A mapping of q into G exists only if it also exists into the
associated index graph G’. The resulting extensions that match the nodes in q will
contain all data graph matches.
12
2-steps query processing Index graph: find extensions Ei matching q Data graph: combining data elements retrieved for Ei
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Index Graph Matching
13
AuthorOf Supervises
Supervises
WorksAt
WorksAt Name
Wor
ksAtAu
thor
Of
B1
B2
B3
B4
B5
B6
y
x
u KIT
zx
u KIT
z y
Retrieve index graph edges matching query edges (triple patterns) Join index graph edges along query edges
h1 = {B1, B2, B3, B4, B5}
h2 = {B2, B3, B4, B5, B6,}
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Query Pruning
14
Proposition 2 If a query is tree-shaped, and consists only of
undistinguished variables (besides the root), matches on the structure index contain all and only data graph matches.
Data elements contained in the extensions matching the query root node represent all and only final query answers
Given such queries, no further processing is needed Given more general queries, tree-shaped query parts can be
pruned away
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Query Pruning
15
y
x
u KIT
z
AuthorOf Supervises
Supervises
WorksAt
WorksAt Name
Wor
ksAtAu
thor
Of
B1
B2
B3
B4
B5
B6
h1 = {B1, B2, B3, B4, B5}
Elements in extensions are known to satisfy query structure Elements in B4 are already known to be authors of some z No further data processing is needed for this part
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Data Graph Matching
16
AuthorOf Supervises
Supervises
WorksAt
WorksAt Name
Wor
ksAtAu
thor
OfB1
B2
B3
B4
B5
B6
3 WorksAt 87 WorksAt 93 Supervises 23 Supervises 47 Supervises 6...
8 Name KIT9 Name MIT
2 WorksAt 8 4 WorksAt 86 WorksAt 9...
Retrieve triples from matching extensions & join along query edges Match class processing: group index graph matches to match classes to
avoid processing matches that partially overlap
{ 3 WorksAt 8,3 Supervises 2,2 WorksAt 8,8 Name KIT}
h’1 =
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Evaluation
DBLP and several synthetic datasets created using the Lehigh University Benchmark (LUBM)
30 queries categorized into five classes
17
Path query
QLUBM6SELECT ?x ?y takesCourse (x, y) teacherOf (z, y) type (z, FullProfessor)
SELECT ?x ?m emailAddress (x, fp@edu) res.Interest (x, research24) telephone (x, xxx-xxx-xxxx)
QLUBM9
Entity query
SELECT ?x type (x, Person)
QDBLP1
Single-atom query Graph-shaped query
SELECT ?x ?a teacherOf (FullProfessor5, y) takesCourse (x, y) publicationAuthor (b, x) name (b, Publication7) memberOf (x, z) memberOf (a, z) advisor (x, a) telephone (a, xxx-xxx-xxxx)
QLUBM15
Star query
QDBLP12SELECT ?x, ?n type (x, Person) name (x, n) editor (y, x) author (z, x) cites (u, z)
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Evaluation – Performance
18
q1 q2 q3 q4 q5 q6 q7 q8 q9q10
q11q12
q13q14
q15Mea
n0.1
1.0
10.0
100.0
1000.0
10000.0
100000.0
SP VP
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q151.0
10.0
100.0
1000.0
10000.0
100000.0idx match load(VP-SP) join(VP-SP) # removed query nodes
Compare our work (SP) against vertical partitioning (VP) [Abadi et al.] Total query processing times Times of individual steps involved
Slightly slower w.r.t simple queries (1-3) SP 8-9 times faster w.r.t complex queries (4-15) With more complex queries, the overhead incurred by answer space matching can be outweighed by the accumulated gain for load and join
Total time in ms on DBLP Time of separate steps in ms, #pruned query nodes
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Conclusions
Structure index that can deal with general graph-structured RDF data on the Web
Structure index can be leveraged for dealing with semi-structured data on the Web
Structure index can be used for RDF data partitioning & query processing, allowing complex queries to be processed many times faster
Future work Adopt existing concepts in XML data management for
structure index optimization & updates Query optimization for structure-aware query processing
19
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Thank you for your attention!
Structure Index for RDF Data on the WebDuc Thanh Tran, AIFB Institute, KITE-Mail: [email protected]: http://sites.google.com/site/kimducthanh
20
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
State-of-the-art Data Partitioning
Big table (Old versions of Oracle, Jena, Sesame) Property tables (Jena) Vertical partitioning (SW-Store)
Indexing Multiple indexing (YARS) Sextuple indexing (Hexastore) Materialization and indexing of entire join paths (GRIN)
Index Implementation B+ tree Inverted index (Semplore) Index compression (RDF-3X)
Query processing Sorted merge join based on vertical partitioning and indexing (SW-Store) Join order optimization based on dynamic programming (RDF-3X)
A combination of different concepts makes up the state-of-the-art!
21
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Overview of Our Approach
Problems• Management of possibly semi-structured RDF data on the Web • Scalability and efficiency of RDF Web data query processing
Contributions• Parameterized structure index for RDF data• Structure-based partitioning (SP): triples with same structure are grouped• Structure-aware query processing
• Use structure index to focus on data that satisfy the overall query structure• Then retrieves data in corresponding structure-based partitioned tables
Benefits• Target data partitioning & query processing, i.e., complementary to other concepts • Reduction of unions & joins as well as IO cost
22
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
Evaluation – Scalability
23
LUBM1 LUBM5 LUBM10 LUBM20 LUBM500.00
1000.002000.003000.004000.005000.006000.007000.008000.009000.00
VPQP-SQP SQP idx match
load (VPQP-SQP) join(VPQP-SQP)
Proc
essin
g Ti
mes
[ms]
DBLP LUBM1 LUBM5 LUBM10 LUBM500
5000
10000
15000
20000
25000OSQPSQP
Que
ry T
imes
(ms)
Measured the average query performance for LUBM with varying size Times increases with the size of the data Gain for load and join increases in larger proportion than the overhead incurred for index match
Match performance is determined by the size of the index graph Size depends on structure but not on the size of the data graph Match time does not necessarily increase when the data becomes larger Positive effect of data filtering (IO reduction) and query pruning (load and join) correlates with the data size