Continuously Updating Query Results over Real-Time Linked Data
A Main Memory Index Structure to Query Linked Data
-
Upload
olaf-hartig -
Category
Documents
-
view
2.078 -
download
0
description
Transcript of A Main Memory Index Structure to Query Linked Data
A Main Memory Index Structureto Query Linked Data
Olaf Hartighttp://olafhartig.de/foaf.rdf#olaf
Frank Huber
Database and Information Systems Research GroupHumboldt-Universität zu Berlin
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 2
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 20 40 60 80
0 20 40 60 80
The Issue
(Query No. 36)
(Query No. 37)
(Query No. 38)
(Query No. 39)
(Query No. 40)
hit rate query execution time(in seconds)
number of query results
ContactInfoPhillipe
UnsetPropsPhillipe
2ndDegree1Phillipe
2ndDegree2Phillipe
IncomingPhillipe
0 0,2 0,4 0,6 0,8 1
0 0,2 0,4 0,6 0,8 1
no reuse given order
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 3
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 20 40 60 80
0 20 40 60 80
The Issue
(Query No. 36)
(Query No. 37)
(Query No. 38)
(Query No. 39)
(Query No. 40)
hit rate query execution time(in seconds)
number of query results
Descriptor objectsin the query-local
dataset afterquery execution:
172
533
ContactInfoPhillipe
UnsetPropsPhillipe
2ndDegree1Phillipe
2ndDegree2Phillipe
IncomingPhillipe
0 0,2 0,4 0,6 0,8 1
0 0,2 0,4 0,6 0,8 1
no reuse given order
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 4
query-localdataset
What data structure do we use tophysically represent the query-local dataset?
Logical representation of Linked Data from the Web
Physical representation of Linked Data from the Web
?
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 5
Outline
1. Requirements + Existing Work
2. Data Structures
3. Evaluation
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 6
Requirements
● (Consecutively) build and use ad hoc collections ofmany small sets of RDF triples
● Four main operations:● Find … matching triples for a triple pattern in all
descriptor objects● Add, Remove, Replace … descriptor objects
● Support of concurrent access (i.e. isolation)
● Non-relevant properties:● Querying descriptor objects individually is not necessary● No need to write data back to the Web● ACID properties not required for complete queries
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 7
Requirements
● (Consecutively) build and use ad hoc collections ofmany small sets of RDF triples
● Four main operations:● Find … matching triples for a triple pattern in all
descriptor objects● Add, Remove, Replace … descriptor objects
● Support of concurrent access (i.e. isolation)
● Non-relevant properties:● Querying descriptor objects individually is not necessary● No need to write data back to the Web● ACID properties not required for complete queries
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 8
Existing Work
● Disk based storage solutions for RDF data● Unsuitable due to very costly I/O operations
● Main memory based data structures in the literature● Focus on a large, single set of RDF triples● Optimized for complete graph pattern queries or path queries
● Main memory based data structures in RDF frameworks● Focus on Jena, ARQ and NG4J● Inefficient (see evaluation)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 9
Outline
1. Requirements + Existing Work
2. Data Structures
3. Evaluation
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 10
Hash-Based Index for RDF Data
Logical representation
Physical representation
PS
SP
O
SOPO
Dict
● Dictionary:● Two-way mapping between RDF
terms and numerical identifiers
● 6 hash tables:● Each hash table contains
all ID-encoded triples● Efficient support for all types of triple patterns
*Similar to Harth and Decker, 2005
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 11
Hash-Based Index for RDF Data
Logical representation
Physical representation
PS
SP
O
SOPO
Dict
● Dictionary:● Two-way mapping between RDF
terms and numerical identifiers
● 6 hash tables:● Each hash table contains
all ID-encoded triples● Efficient support for all types of triple patterns
*Similar to Harth and Decker, 2005
tid = ( id
s,id
p,id
o )
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 12
Hash-Based Index for RDF Data
Logical representation
Physical representation
PS
SP
O
SOPO
Dict
● Dictionary:● Two-way mapping between RDF
terms and numerical identifiers
● 6 hash tables:● Each hash table contains
all ID-encoded triples● Efficient support for all types of triple patterns
*Similar to Harth and Decker, 2005
Findknows
http://bob.name
?acq
tid = ( id
s,id
p,id
o )
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 13
query-localdataset
Individual Indexing
Logical representation
Physical representation
PS
SP
O
SOPO
Dict
● Idea: Index each descriptor objectseparately
● Implementation of the four operations:● Add, Remove, and Replace are straightforward● Find requires iterating over all indexes
PS
SP
O
SOPO
PS
SP
O
SOPO
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 14
query-localdataset
Individual Indexing
Logical representation
Physical representation
PS
SP
O
SOPO
Dict
● Idea: Index each descriptor objectseparately
● Implementation of the four operations:● Add, Remove, and Replace are straightforward● Find requires iterating over all indexes
PS
SP
O
SOPO
PS
SP
O
SOPO
Findknows
http://bob.name
?acq
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 15
query-localdataset
Combined Indexing
Logical representation
Physical representation
● Idea: Use a single indexfor all descriptor objects● src – maps each triple to a set
of descriptor object IDs
PS
SP
O
SOPO
Dict
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 16
query-localdataset
Combined Indexing
Logical representation
Physical representation
● Idea: Use a single indexfor all descriptor objects● src – maps each triple to a set
of descriptor object IDs
PS
SP
O
SOPO
Dict
tid = ( id
s,id
p,id
o )
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 17
query-localdataset
Combined Indexing
Logical representation
Physical representation
● Idea: Use a single indexfor all descriptor objects● src – maps each triple to a set
of descriptor object IDs
PS
SP
O
SOPO
Dict
tid = ( id
s,id
p,id
o )
+ src( t
id ) = { , }
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 18
query-localdataset
Quad Indexing
Logical representation
Physical representation
● Idea: Use a single quad indexfor all descriptor objects● quad = ID-encoded triple
+ descriptor object ID
PS
SP
O
SOPO
Dict
q = ( (ids,id
p,id
o) , )
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 19
Outline
1. Requirements + Existing Work
2. Data Structures
3. Evaluation
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 20
Experiment Setup
Does this affect the overall execution time forlink traversal based query executions ?
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 21
Experiment Setup
Does this affect the overall execution time forlink traversal based query executions ?
● Simulation of the Web of Data● Linked Data server publishes BSBM dataset (scal. factor: 50)● Adjusted BSBM queries link to the simulation server
● Experiment:● Sequence of 200 query mixes● Reuse of the query-local dataset for the whole sequence● IndIR, CombIR, and QuadIR (as presented), engine: SQUIN● NamedGraphSetImpl (NG4J/Jena), engine: SemWeb Client
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 22
Execution Time
0 20 40 60 80 100 120 140 160 180 2000
10
20
30
40
50
60
70
80
NG4J (SWClLib)IndIR, m=4CombIR, m=12CombQuadIR, m=12
query mix
exe
cutio
n tim
e in
se
cond
s
0 40 80 1201602000
500
1000
1500
2000
2500
query mix
ove
rall
n um
ber
of
desc
r.o
bje
cts
in t
he q
uerie
d da
tase
t
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 23
Execution Time
0 20 40 60 80 100 120 140 160 180 2000
10
20
30
40
50
60
70
80
NG4J (SWClLib)IndIR, m=4CombIR, m=12CombQuadIR, m=12
query mix
exe
cutio
n tim
e in
se
cond
s
0 40 80 1201602000
500
1000
1500
2000
2500
query mix
ove
rall
n um
ber
of
desc
r.o
bje
cts
in t
he q
uerie
d da
tase
t
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 24
Summary
● Three hash index based data structures:● Individually indexing● Combined indexing● Quad indexing
● Findings:● A single index improves query performance significantly● Smaller load times with quads
● Also for other use cases of ad hoc storing of Linked Data● Consecutively retrieved from remote sources● Used for immediate local processing
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 25
Backup Slides
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 26
query-localdataset
Combined Indexing
Logical representation
Physical representation
● Idea: Use a single indexfor all descriptor objects● src – maps each triple to a set
of descriptor object IDs
● Implementation of the four operations requires:● status – maps each descriptor object ID to a status
( BeingIndexed, Indexed, ToBeRemoved, BeingRemoved )
● Find reports a triple t only if: ∃d ∈ src(t) : status(d) = Indexed
PS
SP
O
SOPO
Dict
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 27
Implementation
● Available as Free Software (http://squin.org)
● Hash tables with n = 2m buckets
● Hash functions:
● hS( i
s, i
p, i
o ) = i
s & bitmask[m]
● hSP
( is, i
p, i
o ) = ( i
s • i
p ) & bitmask[m]
● etc.
● Which m ?* *see paper● m = 4 for the individual indexes● m = 12 for the combined index● m = 12 for the quad index
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 28
Experiment Setup
● Comparison without link traversal based query execution● Compared data structures:
● Our implementation of IndIR, CombIR, CombQuadIR● NamedGraphSetImpl in NG4J (Jena)● DatasetGraphMap in ARQ (Jena)
● Berlin SPARQL Benchmark (BSBM)● BSBM datasets partitioned into query-local datasets● BSBM (v2.0) query mixes executed over these datasets
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 29
Required Memory
BSBM scaling factor
number of descriptor
objects
overall numberof triples
50 2,599 22,616
100 4,178 40,133
150 5,756 57,524
200 7,329 75,062
250 9,873 97,613
300 11,455 115,217
350 13,954 137,567
500 18,687 190,502
Est
imat
ed r
equ
ire d
mem
ory
in M
B
0 100 200 300 400 5000
20
40
60
80
100
120ARQNG4JIndIR (m=4)CombIR (m=12)CombQuad (m=12)
pc (BSBM scaling factor)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 30
Execution Time
BSBM scaling factor
number of descriptor
objects
overall numberof triples
50 2,599 22,616
100 4,178 40,133
150 5,756 57,524
200 7,329 75,062
250 9,873 97,613
300 11,455 115,217
350 13,954 137,567
500 18,687 190,502
aver
age
exec
uti
on
tim
e pe
r qu
e ry
mix
in s
eco n
ds
0 100 200 300 400 5000
500
1000
1500
2000
2500ARQNG4JIndIR (m=4)CombIR (m=12)CombQuad (m=12)
pc (BSBM scaling factor)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 31
Load Time
BSBM scaling factor
number of descriptor
objects
overall numberof triples
50 2,599 22,616
100 4,178 40,133
150 5,756 57,524
200 7,329 75,062
250 9,873 97,613
300 11,455 115,217
350 13,954 137,567
500 18,687 190,502
aver
age
crea
tio
n t
ime
in s
eco n
ds
0 100 200 300 400 5000
1
2
3
4
5
6
7
8
9
10CombIR (m=12)IndIR (m=4)CombQuad (m=12)
pc (BSBM scaling factor)
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 32
Execution Time
0 20 40 60 80 100 120 140 160 180 2000
10
20
30
40
50
60
70
80
NG4J (SWClLib)IndIR, m=4CombIR, m=12CombQuadIR, m=12
query mix
exe
cutio
n tim
e in
se
cond
s
0 5 10 15 20 250
10
20
30
40
50
60
CombIR, m=12CombQuadIR, m=12
query mix
Olaf Hartig - A Main Memory Index Structure to Query Linked Data 33
These slides have been created byOlaf Hartig
http://olafhartig.de
This work is licensed under aCreative Commons Attribution-Share Alike 3.0 License
(http://creativecommons.org/licenses/by-sa/3.0/)