Compact Representation of Large RDF Data Sets for Publishing and Exchange
-
Upload
universita-di-roma-la-sapienza -
Category
Documents
-
view
739 -
download
3
description
Transcript of Compact Representation of Large RDF Data Sets for Publishing and Exchange
Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez
Compact Representation of Large RDF Data Sets for Publishing and Exchange
The Motivation
LARGE RDF data sets Syntaxes oriented mainly to represent documents
RDF/XML, N3, Turtle, JSON, etc. Document-centric data-centric view
Redundancy No structure (chunks)
Lack of metadata sequentiality of the information
Use?
examples: Billion Triple 2010 (~3200M triples, 318 gzipped chunks, ~27GB) Uniprot (~845M, 12 gzipped chunks, ~23GB)
Image: renjith krishnan / FreeDigitalPhotos.net Pag 2
Real World example: Billion Triple 2010
Pag 3
Where is the metadata?Who did publish this?Do I have all the data?
EXCHANGE
basic operations
RDFRDF
RDFgzip
[318]RDF
RDFRDF
gzip
[318]
?
PUBLICATION
Needs
Pag 4
The aims of the format are: Clean publication
Metadata Compactness
Efficient exchange RDF compression
Basic data operations
Image: jscreationzs / FreeDigitalPhotos.net
HDT Overview
Pag 5
HDT
Logical decomposition of RDF, Phylosophy of publication and exchange, Compact RDF representation
based on 3 main components: Header, Dictionary and Triples
HDT Overview
Pag 6
Header
Pag 7
Metadata information about the RDF collection Wh Questions (what, who, where, how, etc.)
Source and provider information Publication data Data set statistics Other information
Information required to retrieve and process the represented data Location/s, format/s, encoding/s, etc.
Header use
Pag 8
HDT
HDT
?
RDFRDF
RDFHDT
Header
HDTHeader
Dictionary &TriplesRDF
RDFRDF
HDTHDT
Dictionary &Triples
[318]
[318]
Header in Practice
Pag 9
http://purl.org/HDT/hdt#
Void, Dublin Core, etc
SCOVO, SDMX, hdt
hdt
SWP
DBPedia example
Pag 10
DBPedia Header
Pag 11
DBPedia Header
Pag 12
(Basic) Hdt statistics
Pag 13
out-degree, deg−(s) the number of triples of G in which s occurs as subject deg−(G), deg−(G)
partial out-degree, deg− −(s, p) the number of triples of G in which s occurs as subject and p as predicate deg− −(G), deg− −(G)
labeled out-degree, degL−(s) the number of different predicates (labels) of G with which s is related as a
subject in a triple of G degL−(G), degL−(G)
subject-object ratio, αs−o
the proportion of common subjects and objects in the graph G αs−o = |SG∩OG| / |SG OG|∪
Symmetrically, in-degrees: deg+(o), deg+(G), etc.
DBPedia example
Pag 14
out-degree(page1) = 4partial out-degree(page1,#label) = 2labeled out-degree(page1)=3
out-degree(page2) = 2labeled out-degree(page2)=2
in-degree(page3) = 2partial in-degree(page3,#broader) = 2labeled in-degree(page3)=1
HDT
Pag 15
Dictionary
Pag 16
In general terms, a data dictionary is a centralized repository of information about data.
Currently, in RDF formats: namespaces and prefixes
Currently, in Triple Stores: assigns a unique ID to each element in the data set
Header
Dictionary
Dictionary in Practice
Pag 17
Subset distinction: (1) Common subject-objects (2) The non common subjects (3) The non common objects (4) Predicates
List of strings matching the mapping of the four subsets, in order from (1) to (4). A reserved character is appended to the end of each string and each
vocabulary to delimit their size.
Dictionary in Practice
Pag 18
Dictionary in Practice. Header configuration
Pag 19
HDT
Pag 20
Triples
Pag 21
Contains the structure of the data after the ID replacement.
6 0 2 0 3 0 4 5 0 1 0 2 0 6 0 2 0 3 0 4 5 0
3 0 1 2 4 0 3 0 1 2 4 0 3 0 3 0 1 2
6 0 2 0 3 0 4 5 0 1 0 6 0
Compact Triples
Pag 22
1 2 6 .1 3 2 .2 1 3 .2 2 4 .2 2 5 .2 4 1 .3 3 2 .
1 2 6; 3 2 .2 1 3; 2 4, 5; 4 1 .3 3 2 .Subject
GroupingAdjacency
Lists Splitting
Predicates:
Objects:
2
Compact Triples
subject 1 subject 2 subject 3
2 6; 3 2 3 0
6 0 2 0
1 3
3 0 1 2 4, 5;
6 0 2 0 3 0
4 13 2
Predicates
Objects
Sp
Bp
So
Bo
2 3
Bitmap Triples
Pag 23
Bitsequence-based reorganization
Predicates:
Objects:
2 3 0 1 2 4 0 3 0
6 0 2 0 3 0 4 5 0 1 0 2 0
subject 2 subject 3
Compact Triples
subject 1
Bitmap Triples
2 3 0 1 2 4 0 3 0 0 0 12 3 0 1 2 4 0 3 02 3 0 1 2 4 0 3 0
1 2 4
0 0 0 1
3
0 1
6 0 2 0 3 0 4 5 0 1 0 2 00 1
6
6 0 2 0 3 0 4 5 0 1 0 2 0 2 3 4 5 1 2
0 1 0 1 0 0 1 0 1 0 1
HDT Operations Over Bitmaps Triples
Pag 24
Bitmaps Triples representation allows on-demand loading strategy take advantage of the structure indexed in Bp and Bo
accessible by fast rank/ select operations.
sequence S of length n drawn from an alphabet Σ = {0,1}: ranka(S,i): counts the occurrences of a symbol a {0,1} in S[1,i].∈
selecta(S,i): finds the i-th occurrence of symbol a {0,1} in S.∈
HDT Operations Over Bitmaps Triples
Pag 25
Algorithm 1. Check&Find operation for a triple (s,p,o).
The distribution of lists assures an average cost in O (degL−(G) + deg−−(G))
Triples in Practice. Header configuration
Pag 26
HDT Bitmap Triples Compression (for exchange)
Pag 27
Text compression:gzip, bz2, PPM
specific compression:Huffman (S), RRR (B)
HDT-Plain HDT-Compress
HDT Bitmap Triples Compression Results
Pag 28
HDT Bitmap Triples Compression Results
Pag 29
Uniprot
HDT And SPARQL
Pag 30
SPARQL can make use of some interesting features in HDT: Subject-object JOINs resolution can profit from the common naming in the
dictionary, as the elements are correctly and quickly localized in the top IDs. Algorithm 1 can response basic ASK queries of SPARQL for patterns (s,p,o),
(s,?p,?o) and (s,p,?o). Algorithm 1 can response basic CONSTRUCT query of SPARQL for simple
WHERE patterns (s,p,o), (s,?p,?o) and (s,p,?o).The resultant is a RDF HDT graph.
Note: The S-P-O Adjacency List order is assumed. The Algorithm1 and the response patterns vary for alternative representations S-O-P AL, P-S-O, P-O-S, O-P-S AL and O-S-P AL.
Conclusions
Pag 31
RDF publication and exchange at large scale are seriously compromised by the scalability drawbacks of current RDF formats
lack of structure, metadata information and native operations over the data
HDT addresses these problems (producer, consumer) Header, Dictionary, and Triples Triples Practice Implementation
Compact (HDT-Plain) Compress (HDT-Compress, outperforms universal compressors) Check&Find (indexed access)
Future Work
Pag 32
Optimize prototype Open-source (soon at http://hdt.dcc.uchile.cl/)
RDF native storage Dynamic structures on secondary memory Solve SPARQL joins
Multi-Index (size tradeoff) Extensions
N-Quads Sparql Endpoints
W3C Member Submission
Pag 33
Image: renjith krishnan / FreeDigitalPhotos.net
Thanks for your attention.
Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez
http://www.rdfhdt.com [email protected]