Compact Representation of Large RDF Data Sets for Publishing and Exchange

Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez

Compact Representation of Large RDF Data Sets for Publishing and Exchange

The Motivation

LARGE RDF data sets Syntaxes oriented mainly to represent documents

RDF/XML, N3, Turtle, JSON, etc. Document-centric data-centric view

Redundancy No structure (chunks)

Lack of metadata sequentiality of the information

Use?

examples: Billion Triple 2010 (~3200M triples, 318 gzipped chunks, ~27GB) Uniprot (~845M, 12 gzipped chunks, ~23GB)

Image: renjith krishnan / FreeDigitalPhotos.net Pag 2

Real World example: Billion Triple 2010

Pag 3

Where is the metadata?Who did publish this?Do I have all the data?

EXCHANGE

basic operations

RDFRDF

RDFgzip

[318]RDF

RDFRDF

gzip

[318]

?

PUBLICATION

Needs

Pag 4

The aims of the format are: Clean publication

Metadata Compactness

Efficient exchange RDF compression

Basic data operations

Image: jscreationzs / FreeDigitalPhotos.net

HDT Overview

Pag 5

HDT

Logical decomposition of RDF, Phylosophy of publication and exchange, Compact RDF representation

based on 3 main components: Header, Dictionary and Triples

HDT Overview

Pag 6

Header

Pag 7

Metadata information about the RDF collection Wh Questions (what, who, where, how, etc.)

Source and provider information Publication data Data set statistics Other information

Information required to retrieve and process the represented data Location/s, format/s, encoding/s, etc.

Header use

Pag 8

HDT

HDT

?

RDFRDF

RDFHDT

Header

HDTHeader

Dictionary &TriplesRDF

RDFRDF

HDTHDT

Dictionary &Triples

[318]

[318]

Header in Practice

Pag 9

http://purl.org/HDT/hdt#

Void, Dublin Core, etc

SCOVO, SDMX, hdt

hdt

SWP

http://purl.org/HDT/hdt

DBPedia example

Pag 10

DBPedia Header

Pag 11

DBPedia Header

Pag 12

(Basic) Hdt statistics

Pag 13

out-degree, deg−(s) the number of triples of G in which s occurs as subject deg−(G), deg−(G)

partial out-degree, deg− −(s, p) the number of triples of G in which s occurs as subject and p as predicate deg− −(G), deg− −(G)

labeled out-degree, degL−(s) the number of different predicates (labels) of G with which s is related as a

subject in a triple of G degL−(G), degL−(G)

subject-object ratio, αs−o

the proportion of common subjects and objects in the graph G αs−o = |SG∩OG| / |SG OG|∪

Symmetrically, in-degrees: deg+(o), deg+(G), etc.

DBPedia example

Pag 14

out-degree(page1) = 4partial out-degree(page1,#label) = 2labeled out-degree(page1)=3

out-degree(page2) = 2labeled out-degree(page2)=2

in-degree(page3) = 2partial in-degree(page3,#broader) = 2labeled in-degree(page3)=1

HDT

Pag 15

Dictionary

Pag 16

In general terms, a data dictionary is a centralized repository of information about data.

Currently, in RDF formats: namespaces and prefixes

Currently, in Triple Stores: assigns a unique ID to each element in the data set

Header

Dictionary

Dictionary in Practice

Pag 17

Subset distinction: (1) Common subject-objects (2) The non common subjects (3) The non common objects (4) Predicates

List of strings matching the mapping of the four subsets, in order from (1) to (4). A reserved character is appended to the end of each string and each

vocabulary to delimit their size.

Dictionary in Practice

Pag 18

Dictionary in Practice. Header configuration

Pag 19

HDT

Pag 20

Triples

Pag 21

Contains the structure of the data after the ID replacement.

6 0 2 0 3 0 4 5 0 1 0 2 0 6 0 2 0 3 0 4 5 0

3 0 1 2 4 0 3 0 1 2 4 0 3 0 3 0 1 2

6 0 2 0 3 0 4 5 0 1 0 6 0

Compact Triples

Pag 22

1 2 6 .1 3 2 .2 1 3 .2 2 4 .2 2 5 .2 4 1 .3 3 2 .

1 2 6; 3 2 .2 1 3; 2 4, 5; 4 1 .3 3 2 .Subject

GroupingAdjacency

Lists Splitting

Predicates:

Objects:

2

Compact Triples

subject 1 subject 2 subject 3

2 6; 3 2 3 0

6 0 2 0

1 3

3 0 1 2 4, 5;

6 0 2 0 3 0

4 13 2

Predicates

Objects

Sp

Bp

So

Bo

2 3

Bitmap Triples

Pag 23

Bitsequence-based reorganization

Predicates:

Objects:

2 3 0 1 2 4 0 3 0

6 0 2 0 3 0 4 5 0 1 0 2 0

subject 2 subject 3

Compact Triples

subject 1

Bitmap Triples

2 3 0 1 2 4 0 3 0 0 0 12 3 0 1 2 4 0 3 02 3 0 1 2 4 0 3 0

1 2 4

0 0 0 1

3

0 1

6 0 2 0 3 0 4 5 0 1 0 2 00 1

6

6 0 2 0 3 0 4 5 0 1 0 2 0 2 3 4 5 1 2

0 1 0 1 0 0 1 0 1 0 1

HDT Operations Over Bitmaps Triples

Pag 24

Bitmaps Triples representation allows on-demand loading strategy take advantage of the structure indexed in Bp and Bo

accessible by fast rank/ select operations.

sequence S of length n drawn from an alphabet Σ = {0,1}: ranka(S,i): counts the occurrences of a symbol a {0,1} in S[1,i].∈

selecta(S,i): finds the i-th occurrence of symbol a {0,1} in S.∈

HDT Operations Over Bitmaps Triples

Pag 25

Algorithm 1. Check&Find operation for a triple (s,p,o).

The distribution of lists assures an average cost in O (degL−(G) + deg−−(G))

Triples in Practice. Header configuration

Pag 26

HDT Bitmap Triples Compression (for exchange)

Pag 27

Text compression:gzip, bz2, PPM

specific compression:Huffman (S), RRR (B)

HDT-Plain HDT-Compress

HDT Bitmap Triples Compression Results

Pag 28

HDT Bitmap Triples Compression Results

Pag 29

Uniprot

HDT And SPARQL

Pag 30

SPARQL can make use of some interesting features in HDT: Subject-object JOINs resolution can profit from the common naming in the

dictionary, as the elements are correctly and quickly localized in the top IDs. Algorithm 1 can response basic ASK queries of SPARQL for patterns (s,p,o),

(s,?p,?o) and (s,p,?o). Algorithm 1 can response basic CONSTRUCT query of SPARQL for simple

WHERE patterns (s,p,o), (s,?p,?o) and (s,p,?o).The resultant is a RDF HDT graph.

Note: The S-P-O Adjacency List order is assumed. The Algorithm1 and the response patterns vary for alternative representations S-O-P AL, P-S-O, P-O-S, O-P-S AL and O-S-P AL.

Conclusions

Pag 31

RDF publication and exchange at large scale are seriously compromised by the scalability drawbacks of current RDF formats

lack of structure, metadata information and native operations over the data

HDT addresses these problems (producer, consumer) Header, Dictionary, and Triples Triples Practice Implementation

Compact (HDT-Plain) Compress (HDT-Compress, outperforms universal compressors) Check&Find (indexed access)

Future Work

Pag 32

Optimize prototype Open-source (soon at http://hdt.dcc.uchile.cl/)

RDF native storage Dynamic structures on secondary memory Solve SPARQL joins

Multi-Index (size tradeoff) Extensions

N-Quads Sparql Endpoints

http://hdt.dcc.uchile.cl/



W3C Member Submission

Pag 33

Image: renjith krishnan / FreeDigitalPhotos.net

Thanks for your attention.

Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez

http://www.rdfhdt.com [email protected]

http://www.rdfhdt.com/

Compact Representation of Large RDF Data Sets for Publishing and Exchange

Documents

Transcript of Compact Representation of Large RDF Data Sets for Publishing and Exchange