Matrix “Bit”loaded: A Scalable Lightweight Matrix “Bit”loaded: A Scalable Lightweight Join Query Processor for RDF DataJoin Query Processor for RDF Data
Medha Atre1, Vineet Chaoji2, Mohammed J. Zaki1, and James A. Hendler1
1Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy NY, USA2Yahoo! Labs, Bangalore, India
April 29, 2010WWW 2010, Raleigh NC, USA
OverviewOverview
• Introduction• Challenges• Motivation• BitMat structure
– Construction & operations
• Query processing algorithm• Experimental evaluation• Future roadmap
WWW 2010, Raleigh NC, USA
IntroductionIntroduction
WWW 2010, Raleigh NC, USA
• RDF (Resource Description Framework) for representing any information– triple form – [<subject> <predicate> <object>]– Depicted as a directed edge
• RDF graphs of hundreds of millions of triples to a few billion triples are common nowadays– DBPedia (103 million triples)– UniProt (845 million triples)– US Census (1 billion triples)– Bio2RDF (2.3 billion triples)– Data.gov (5+ billion triples)
Subject ObjectPredicate
Challenges – Storing RDF DataChallenges – Storing RDF Data
WWW 2010, Raleigh NC, USA
• RDF graphs of more than a billion triples (400 GB+ on-disk size).
• Traditional DB based efforts– Jena-TDB (custom indexes and storage)– C-store– MonetDB (open-source DB system)
• Exploit RDF data characteristics on top of DB storage– Vertical partitioning: create separate predicate table for each predicate.
• Compression based techniques– MonetDB and RDF-3X
Challenges – Querying RDF DataChallenges – Querying RDF Data
WWW 2010, Raleigh NC, USA
• Limited main memory compared to disk space• Large intermediate join tables• Scans over large percentage of indexes
– Even for aggressive indexing + compression. E.g. Hexastore, RDF-3X
• Optimizations– Selectivity estimation in case of multiple level joins, left deep join tree– Sideways (parallel) information passing for several merge-joins– Semi-joins: Semi-joins reduce the database for a given join query
Motivation for this workMotivation for this work
WWW 2010, Raleigh NC, USA
• SPARQL join queries can be broadly classified into 3 types:1) Queries having highly selective triple patterns,
e.g., (?s :residesIn USA)(?s :hasSSN “123-45-6789”)• Existing techniques handle these queries very efficiently
2) Queries with low-selectivity triple patterns but highly selective results, e.g., (?s :residesIn China)(?s :citizenOf India)
3) Queries with low-selectivity triple patterns and low-selectivity results, e.g., (?s :residesIn USA)(?s :hasSSN ?y)
• Such queries involving multi-level joins can lead to large intermediate results
Our ContributionOur Contribution
WWW 2010, Raleigh NC, USA
• A compressed data structure – BitMat to store the RDF data
• A join query algorithm which operates directly on the compressed data:– No intermediate join tables, instead, use a 2-phase query algorithm
• First phase: prune the candidate RDF triples• Second phase: stitch the final results directly from the pruned triples
– Can guaranty memory requirements at the beginning of the query– Online/streaming result generation
BitMat ConstructionBitMat Construction
WWW 2010, Raleigh NC, USA
• Conceptually construct a bit-cube of subject (S), predicate (P), object (O) dimensions
• Mapping dictionary:– Vs: Set of subjects, Vp: Set of predicates, Vo: Set of objects, Vso= Vs Vo
– Common subject and object URIs mapped to same integer IDs 1 to |Vso|
– Subject only URIs mapped to integer IDs |Vso|+1 to |Vs|S-dimension
P-dimension
O-dimension
1
1
Vso
Vso
Vo
Vs
BitMat Construction (continued..)BitMat Construction (continued..)
WWW 2010, Raleigh NC, USA
• Slice along P dimension and store S-O and O-S BitMats• Apply gap-encoding to each row of the BitMat before storing it• Storage: 2 |Vp| + |Vs| + |Vo| BitMats
• Additionally store condensed representation of rows and columns and number of triples in each of the 4 types of BitMats
S-dimension
00 1
100 0
0
01 0
000 1
0
00 0
011 0
0
P-dimension
O-dimension
Subject Predicate Object
:the_matrix :releasedIn “1999”
:the_thirteenth_floor :releasedIn “1999”
:the_matrix :similar_to :the_matrix_reloaded
:the_thirteenth_floor :similar_to :the_matrix
:the_matrix rdf:type :movie
:the_thirteenth_floor rdf:type :movieSO1
SO2
O3
O4
SO1
SO2
P1
P2
P3
Operations on BitMatOperations on BitMat
WWW 2010, Raleigh NC, USA
• Join algorithm uses two basic operations: fold & unfold• fold(BitMat, dimension) returns bitArray
– Folds the input BitMat by retaining the dimension
• unfold(BitMat, MaskBitArray, dimension)– Unfolds MaskBitArray on the BitMat in dimension
• Fold & unfold operate by doing bitwise AND/OR operations on gap compressed bit-vectors
1 11
111 1 11
Query Processing AlgorithmQuery Processing Algorithm
WWW 2010, Raleigh NC, USA
• Build a constraint graph
• E.g., query (?m rdf:type :movie)(?n rdf:type movie)(?m :similar_to :n) has constraint graph as
• Each triple pattern has a BitMat containing only triples matching that triple pattern
• Propagate the constraints on join variable bindings imposed by each triple pattern
?m ?n
?m rdf:type :movie ?m :similar_to ?n ?n rdf:type :movie
SS SO
Gjvar
Gtp
Phase 1 -- Pruning phasePhase 1 -- Pruning phase
WWW 2010, Raleigh NC, USA
1. Embed a tree on the subgraph Gjvar
2. Walk over this tree from root to leaves and back in BFS order3. At each node in the tree over Gjvar, collect all the variable
bindings from the BitMats of the triple patterns containing that variable (fold operation)
4. Do a bitwise AND of all folded bit-arrays obtained5. Relay back the results of bitwise AND on the BitMats (unfold
operation)
• Simple optimizations:– Tree root selection: Select the join variable having the least number of
triples in their BitMats as the root of the tree over Gjvar
– Early stopping: If at any point, the result of bitwise AND of folded bit-arrays is null
Pruning phasePruning phase
WWW 2010, Raleigh NC, USA
?m
?m ?n
?m rdf:type :movie ?m :similar_to ?n ?n rdf:type :movie
fold foldunfold unfold fold foldunfold unfold
1
1
1
1 1
1
1 1
11
1
1 1 1 1 1 1 111
1
1
1 1
1
11
1
1
1
1 1 1 1 1111 1 1 11
1
1
1 1
1
1
1
1
1
1 1
In the reverse traversal while propagating effect of join over “?n”, the fold of 2nd BitMat yields same bit-array as the mask bit-array of ?m before, hence there is no need to do fold/unfold again on the first BitMat
1
1
1
1
1
1
1
1
1
1
1
1
1
Phase 2 -- Final result generationPhase 2 -- Final result generation
WWW 2010, Raleigh NC, USA
• Resembles a multi-way join
• Start with the triple pattern with least number of triples left in its BitMat
• Generate bindings for variables in that triple pattern
• Next, select another triple pattern which shares a join variable with any of the previously selected triple patterns
• Check if it can generate the same bindings for the shared join variable and generate bindings for its other variables
• Continue this and at the end of one round when all triple patterns are processed and all variables have consistent bindings, output the result
Final result generationFinal result generation
WWW 2010, Raleigh NC, USA
Var
Val
?a
?b
?c
:s1
:o2
:t3
Output this result
:t4
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1 1
1
1
111
1
:o3
:t3
?a
?b
?a
?b
?c
Sample query?a rdf:type :Person?a :worksFor ?b?c :departmentOf ?b
?a rdf:type :Person
?a :worksFor ?b
?c :departmentOf ?b
Evaluation setupEvaluation setup
WWW 2010, Raleigh NC, USA
• Competitive RDF stores:– MonetDB– RDF-3X
• Datasets:– UniProt: Protein dataset with ~845M triples, ~147M subjects, 95 predicates, and
~128M objects– LUBM: Synthetic university dataset with ~1.33B triples, ~217M subjects, 18
predicates, and ~161M objects
• Queries:– UniProt: Queries published by UniProt dataset owners and RDF-3X– LUBM: Queries published by OpenRDF
• Development environment:– Dell Optiplex 755 PC, 3.0 GHz Intel E6850 Core 2 Duo Processor, 4 GB memory.– 7 GB swap space on 7200 rpm 1 TB disk.– 64 bit 2.6.28-15 Linux kernel (Ubuntu 9.04 distribution).
ResultsResults
WWW 2010, Raleigh NC, USA
• For queries with low-selectivity triple patterns, BitMat outperformed MonetDB and RDF-3X by 2-3 orders of magnitude
• For highly selective triple patterns, RDF-3X gave superior performance, especially for queries where sideways-information-passing (SIP) could benefit
• BitMat’s shortcomings in case of highly selective queries:– The 2-phase query processing can create additional overheads for highly
selective queries– No cache memory optimization– No memory mapping of disk files
WWW 2010, Raleigh NC, USA
Q1(4)
Q2(7)
Q3(8)
Q4(4)
Q5(3)
Q6(7)
Q7(2)
Q8(12)
Cold cache
BitMat 451.365 269.526 173.324 9.396 78.35 1.34 9.33 13.06
MonetDB 548.21 303.213 124.356 9.63 97.28 11.28 9.91 15.93
RDF-3X Aborted 525.125 224.58 1.38 4.636 0.902 0.892 1.353
Warm cache
BitMat 440.868 263.071 168.673 8.305 77.442 0.448 8.36 10.87
MonetDB 495.64 267.53 113.818 0.584 96.02 0.822 0.861 0.362
RDF-3X Aborted 487.182 226.05 0.077 1.008 0.0064 0.003 0.03
#Results 160,198,689 90,981,843 50,192,929 0 179,316 0 0 19
#Initial triples
92,965,468 73,618,481 78,840,372 16,626,073 60,260,006 15,408,126 16,625,901 53,677,336
UniProt 845 million triples (time in sec)
More results in the paper
WWW 2010, Raleigh NC, USA
Q1 (Circ) Q2 (Star) Q3(Circ) Q4 (Star) Q5(Star) Q6
Cold cache
BitMat 51.21 2.71 6.56 2.45 0.503 3.81
MonetDB 548.21 27.17 455.23 34.12 18.89 14.6
RDF-3X Aborted 34.868 2324.753 0.588 0.425 1.129
Warm cache
BitMat 48.57 2.11 1.94 0.686 0.27 2.85
MonetDB 96.65 6.56 398.46 3.209 0.566 0.542
RDF-3X Aborted 29.033 2028.685 0.0024 0.0029 0.1814
#Results 2528 10,799,863 0 10 10 125
#Initial triples 165,397,764 224,805,759 219,416,877 438,912,513 3,000,966 9,100,649
LUBM 1.33 billion triples (time in sec)
WWW 2010, Raleigh NC, USA
Comparison of index storage spaceComparison of index storage space
BitMat (including LZ77 compressed
dictionary mapping)
RDF-3X MonetDB Raw triples (uncompressed)
UniProt 51.2 GB 42 GB 16 GB 205 GB
LUBM 68.8 GB 70 GB 25 GB 451 GB
Future RoadmapFuture Roadmap
WWW 2010, Raleigh NC, USA
• Does not allow a subset of variables to be specified by the SELECT clause
• Does not have ability to process other class of SPARQL queries, e.g., OPTIONAL, UNION, FILTER etc.
• S-P or P-O dimensional joins not handled– Rare in assertional RDF data
• Cannot perform addition/deletion/update of triples
• Incorporate lazy-loading of BitMats to avoid overheads for highly selective queries
Thank you!
WWW 2010, Raleigh NC, USA
Top Related