Compact Representation of Large RDF Data Sets for Publishing and Exchange

34
Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez Compact Representation of Large RDF Data Sets for Publishing and Exchange

description

ISWC 2010 presentation. Mo

Transcript of Compact Representation of Large RDF Data Sets for Publishing and Exchange

Page 1: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez

Compact Representation of Large RDF Data Sets for Publishing and Exchange

Page 2: Compact Representation of Large RDF Data Sets for Publishing and Exchange

The Motivation

LARGE RDF data sets Syntaxes oriented mainly to represent documents

RDF/XML, N3, Turtle, JSON, etc. Document-centric data-centric view

Redundancy No structure (chunks)

Lack of metadata sequentiality of the information

Use?

examples: Billion Triple 2010 (~3200M triples, 318 gzipped chunks, ~27GB) Uniprot (~845M, 12 gzipped chunks, ~23GB)

Image: renjith krishnan / FreeDigitalPhotos.net Pag 2

Page 3: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Real World example: Billion Triple 2010

Pag 3

Where is the metadata?Who did publish this?Do I have all the data?

EXCHANGE

basic operations

RDFRDF

RDFgzip

[318]RDF

RDFRDF

gzip

[318]

?

PUBLICATION

Page 4: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Needs

Pag 4

The aims of the format are: Clean publication

Metadata Compactness

Efficient exchange RDF compression

Basic data operations

Image: jscreationzs / FreeDigitalPhotos.net

Page 5: Compact Representation of Large RDF Data Sets for Publishing and Exchange

HDT Overview

Pag 5

HDT

Logical decomposition of RDF, Phylosophy of publication and exchange, Compact RDF representation

based on 3 main components: Header, Dictionary and Triples

Page 6: Compact Representation of Large RDF Data Sets for Publishing and Exchange

HDT Overview

Pag 6

Page 7: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Header

Pag 7

Metadata information about the RDF collection Wh Questions (what, who, where, how, etc.)

Source and provider information Publication data Data set statistics Other information

Information required to retrieve and process the represented data Location/s, format/s, encoding/s, etc.

Page 8: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Header use

Pag 8

HDT

HDT

?

RDFRDF

RDFHDT

Header

HDTHeader

Dictionary &TriplesRDF

RDFRDF

HDTHDT

Dictionary &Triples

[318]

[318]

Page 9: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Header in Practice

Pag 9

http://purl.org/HDT/hdt#

Void, Dublin Core, etc

SCOVO, SDMX, hdt

hdt

SWP

Page 10: Compact Representation of Large RDF Data Sets for Publishing and Exchange

DBPedia example

Pag 10

Page 11: Compact Representation of Large RDF Data Sets for Publishing and Exchange

DBPedia Header

Pag 11

Page 12: Compact Representation of Large RDF Data Sets for Publishing and Exchange

DBPedia Header

Pag 12

Page 13: Compact Representation of Large RDF Data Sets for Publishing and Exchange

(Basic) Hdt statistics

Pag 13

out-degree, deg−(s) the number of triples of G in which s occurs as subject deg−(G), deg−(G)

partial out-degree, deg− −(s, p) the number of triples of G in which s occurs as subject and p as predicate deg− −(G), deg− −(G)

labeled out-degree, degL−(s) the number of different predicates (labels) of G with which s is related as a

subject in a triple of G degL−(G), degL−(G)

subject-object ratio, αs−o

the proportion of common subjects and objects in the graph G αs−o = |SG∩OG| / |SG OG|∪

Symmetrically, in-degrees: deg+(o), deg+(G), etc.

Page 14: Compact Representation of Large RDF Data Sets for Publishing and Exchange

DBPedia example

Pag 14

out-degree(page1) = 4partial out-degree(page1,#label) = 2labeled out-degree(page1)=3

out-degree(page2) = 2labeled out-degree(page2)=2

in-degree(page3) = 2partial in-degree(page3,#broader) = 2labeled in-degree(page3)=1

Page 15: Compact Representation of Large RDF Data Sets for Publishing and Exchange

HDT

Pag 15

Page 16: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Dictionary

Pag 16

In general terms, a data dictionary is a centralized repository of information about data.

Currently, in RDF formats: namespaces and prefixes

Currently, in Triple Stores: assigns a unique ID to each element in the data set

Header

Dictionary

Page 17: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Dictionary in Practice

Pag 17

Subset distinction: (1) Common subject-objects (2) The non common subjects (3) The non common objects (4) Predicates

List of strings matching the mapping of the four subsets, in order from (1) to (4). A reserved character is appended to the end of each string and each

vocabulary to delimit their size.

Page 18: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Dictionary in Practice

Pag 18

Page 19: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Dictionary in Practice. Header configuration

Pag 19

Page 20: Compact Representation of Large RDF Data Sets for Publishing and Exchange

HDT

Pag 20

Page 21: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Triples

Pag 21

Contains the structure of the data after the ID replacement.

Page 22: Compact Representation of Large RDF Data Sets for Publishing and Exchange

6 0 2 0 3 0 4 5 0 1 0 2 0 6 0 2 0 3 0 4 5 0

3 0 1 2 4 0 3 0 1 2 4 0 3 0 3 0 1 2

6 0 2 0 3 0 4 5 0 1 0 6 0

Compact Triples

Pag 22

1 2 6 .1 3 2 .2 1 3 .2 2 4 .2 2 5 .2 4 1 .3 3 2 .

1 2 6; 3 2 .2 1 3; 2 4, 5; 4 1 .3 3 2 .Subject

GroupingAdjacency

Lists Splitting

Predicates:

Objects:

2

Compact Triples

subject 1 subject 2 subject 3

2 6; 3 2 3 0

6 0 2 0

1 3

3 0 1 2 4, 5;

6 0 2 0 3 0

4 13 2

Page 23: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Predicates

Objects

Sp

Bp

So

Bo

2 3

Bitmap Triples

Pag 23

Bitsequence-based reorganization

Predicates:

Objects:

2 3 0 1 2 4 0 3 0

6 0 2 0 3 0 4 5 0 1 0 2 0

subject 2 subject 3

Compact Triples

subject 1

Bitmap Triples

2 3 0 1 2 4 0 3 0 0 0 12 3 0 1 2 4 0 3 02 3 0 1 2 4 0 3 0

1 2 4

0 0 0 1

3

0 1

6 0 2 0 3 0 4 5 0 1 0 2 00 1

6

6 0 2 0 3 0 4 5 0 1 0 2 0 2 3 4 5 1 2

0 1 0 1 0 0 1 0 1 0 1

Page 24: Compact Representation of Large RDF Data Sets for Publishing and Exchange

HDT Operations Over Bitmaps Triples

Pag 24

Bitmaps Triples representation allows on-demand loading strategy take advantage of the structure indexed in Bp and Bo 

accessible by fast rank/ select operations.

sequence S of length n drawn from an alphabet Σ = {0,1}: ranka(S,i): counts the occurrences of a symbol a {0,1} in S[1,i].∈

selecta(S,i): finds the i-th occurrence of symbol a {0,1} in S.∈

Page 25: Compact Representation of Large RDF Data Sets for Publishing and Exchange

HDT Operations Over Bitmaps Triples

Pag 25

Algorithm 1. Check&Find operation for a triple (s,p,o).

The distribution of lists assures an average cost in O (degL−(G) + deg−−(G))

Page 26: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Triples in Practice. Header configuration

Pag 26

Page 27: Compact Representation of Large RDF Data Sets for Publishing and Exchange

HDT Bitmap Triples Compression (for exchange)

Pag 27

Text compression:gzip, bz2, PPM

specific compression:Huffman (S), RRR (B)

HDT-Plain HDT-Compress

Page 28: Compact Representation of Large RDF Data Sets for Publishing and Exchange

HDT Bitmap Triples Compression Results

Pag 28

Page 29: Compact Representation of Large RDF Data Sets for Publishing and Exchange

HDT Bitmap Triples Compression Results

Pag 29

Uniprot

Page 30: Compact Representation of Large RDF Data Sets for Publishing and Exchange

HDT And SPARQL

Pag 30

SPARQL can make use of some interesting features in HDT: Subject-object JOINs resolution can profit from the common naming in the

dictionary, as the elements are correctly and quickly localized in the top IDs. Algorithm 1 can response basic ASK queries of SPARQL for patterns (s,p,o),

(s,?p,?o) and (s,p,?o). Algorithm 1 can response basic CONSTRUCT query of SPARQL for simple

WHERE patterns (s,p,o), (s,?p,?o) and (s,p,?o).The resultant is a RDF HDT graph.

Note: The S-P-O Adjacency List order is assumed. The Algorithm1 and the response patterns vary for alternative representations S-O-P AL, P-S-O, P-O-S, O-P-S AL and O-S-P AL.

Page 31: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Conclusions

Pag 31

RDF publication and exchange at large scale are seriously compromised by the scalability drawbacks of current RDF formats

lack of structure, metadata information and native operations over the data

HDT addresses these problems (producer, consumer) Header, Dictionary, and Triples Triples Practice Implementation

Compact (HDT-Plain) Compress (HDT-Compress, outperforms universal compressors) Check&Find (indexed access)

Page 32: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Future Work

Pag 32

Optimize prototype Open-source (soon at http://hdt.dcc.uchile.cl/)

RDF native storage Dynamic structures on secondary memory Solve SPARQL joins

Multi-Index (size tradeoff) Extensions

N-Quads Sparql Endpoints

Page 33: Compact Representation of Large RDF Data Sets for Publishing and Exchange

W3C Member Submission

Pag 33

Image: renjith krishnan / FreeDigitalPhotos.net

Page 34: Compact Representation of Large RDF Data Sets for Publishing and Exchange

Thanks for your attention.

Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez

http://www.rdfhdt.com [email protected]