SHARD Triple-Store - Information Services and...
Transcript of SHARD Triple-Store - Information Services and...
![Page 1: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/1.jpg)
SHARD Triple-Store:
Tools for Web-Scale SemWebKurt Rohloff
@avometric
Many thanks to:
Mike Dean, Ian Emmons, Gail Mitchell,
Doug Reid, Rick Schantz from BBN
Hanspeter Pfister from Harvard SEAS
Phil Zeyliger from Cloudera
Prakash Manghwani
![Page 2: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/2.jpg)
2
Semantic Web / Graph Data
• Vision from Tim Berners-Lee at W3C.
• Create a web of data– Support use by intelligent agents.– Data described using ontologies.– Data represented as digraphs.– “Web 3.0.”
• Emerging commercially– Use by NYTimes, BBC, Pharma, …– Numerous startups.– Oracle, MySQL have SemWeb support.
• Government use…
![Page 3: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/3.jpg)
Object Graph Example
BBN
City
Person
Massachusetts
elmer
Cambridge
Company
“BBN Technologies”
“Tad Elmer”
president
headquarters
US
rdf:type
rdf:type
rdf:type
locatedIn
rdf:type name
name
State
locatedIn
Countryrdf:type
Organization
rdfs:subClassOf
![Page 4: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/4.jpg)
SemWeb Layer Cake
4
Knowledge
Storage
Querying
Reasoning
![Page 5: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/5.jpg)
W3C Resource Description
Framework (RDF)
subject object
predicate
• RDF graph is made up of individual statements.
• Subject and predicate are Uniform Resource Identifiers
(URIs).
• You can also make statements about statements (e.g.
timestamp, confidence, etc.)
![Page 6: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/6.jpg)
RDF/XML
6
<rdf:RDF
xmlns:rdf ="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://example.org/business-ont#">
<Company rdf:ID="BBN">
<name>BBN Technologies</name>
<headquarters rdf:resource="http://www.state.ma.us/cities#Cambridge"/>
<president rdf:resource="http://www.bbn.com/management#elmer"/>
</Company>
</rdf:RDF>BBN
elmer
Cambridge
“BBN Technologies”
president
headquarters
name
![Page 7: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/7.jpg)
SPARQL Query
All people who own a car made in Detroit:
SELECT ?person
WHERE {
?person :owns ?car .
?car a :Car .
?car :madeIn :Detroit .
}
?person ?carowns
madeIn Detroit
Cara
7
![Page 8: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/8.jpg)
Answering Queries
Kurt car0 Ford
ownsmadeBy
madeIn
Detroit
livesIn
Cambridge
aa
City
Car
a
?person ?car
owns
madeIn
Detroit
Cara
8
![Page 9: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/9.jpg)
Sample of Triple-Stores
• Parliament by BBN (from DAPRA DAML.)
• OWLIM by OntoText (several versions.)
• Allegrograph from Franz.
• MySQL and Oracle Solutions.
• LarKC by DERI Galway.
• Mulgara.
• Hive- and Pig-based experimental triple-stores.
• Etc…
9
![Page 10: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/10.jpg)
Triple-Store Design Considerations
• Scalable – web-scale?
• High Assurance.
• Cost Effective – commodity hardware?
• Modular inferred data separation.
• Robustness.
• Considerations as endless as applications.
10
![Page 11: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/11.jpg)
Map-Reduce Triple-Store Proof of
Concept
11
![Page 12: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/12.jpg)
SHARD Triple-Store Built on Hadoop
Prioritized goals:
•Commodity hardware, ONLY.
•Web scalable.
•Robust.
12
![Page 13: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/13.jpg)
More Specifically
• Cloud-based triple-store on HDFS.
– Method calls at client.
– Processing in cloud.
– Move results to local machine.
• Massively scalable.
• SPARQL queries.
• Basic inferencing.
13
![Page 14: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/14.jpg)
Data Persistence Advice from SHARD
• Down to “bare metal” in HDFS for efficiency.
– No Berkeley DB, no C-stores, …. Nothing.
• Simple data storage as flat files.
– Lists of (predicate, object) pairs for every subject by line.
– Ex: Kurt owns car0 livesin Cambridge
• Simple often really is better…
14
![Page 15: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/15.jpg)
HDFS Graph Storage
Kurt car0 Ford
ownsmadeBy
madeIn
Detroit
livesIn
Cambridge
aa
City
Car
a
15
Graphs saved as flat-file in HDFS:
Kurt owns car0 livesIn Cambridge
Car0 a Car madeBy Ford madeIn Detroit
Cambridge a City
Detroit a City
![Page 16: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/16.jpg)
Query Processing
• BBN-developed query processor.
– Starting integration with “standard” interfaces
• Jena, Sesame.
• SHARD supports “most” of SPARQL.
– Like most commercial triple-stores.
• Large performance improvements possible with
improved query reordering.
16
![Page 17: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/17.jpg)
Iterative Query Response Construction
Source Data
s op
s op
s op
s op
s op
s op
?person ?carowns
madeIn
Detroit
Cara
?person ?car
owns
1st clause results
s op
s op
s op
s op
2nd clause results
s op
s op
op
op
?person ?carowns
Cara
2nd clause results
s op
s op
op
op
op
op 17
![Page 18: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/18.jpg)
Test Data
• Deployed code on Amazon EC2 cloud.
– 19 XL nodes.
• 6000 LUBM university dataset.
– Approximately 800 million edges in graph.
• In general, performed comparably to
“industrial” monolithic triple-stores.
18
![Page 19: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/19.jpg)
SHARD Open-Source Release
• BSD license.
• Check:
– My webpage
– Sourceforge (SHARD-3store)
19
![Page 20: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/20.jpg)
More info?
• Tim Berners-Lee’s seminal SciAmerican article.
• W3C for “recommended” standards.
• Jena and Sesame frameworks.
• SemWebCentral for other open-source.
• Please come up and talk with me for more info!
20
![Page 22: SHARD Triple-Store - Information Services and Technologyrohloff/papers/2010/SHARD_Rohloff_Kurt_HadoopWorld_2010.pdf · SHARD Triple-Store: Tools for Web-Scale SemWeb Kurt Rohloff](https://reader030.fdocuments.in/reader030/viewer/2022041122/5d14cc9a88c993e8108b8214/html5/thumbnails/22.jpg)
Performance Comparison
• Proof o’ Concept: For 6000 universities
(approx. 800 million triples):
Query 1: 404 sec. (approx 0.1 hr.)
Query 9: 740 sec. (approx 0.2 hr.)
Query 14: 118 sec. (approx 0.03 hr.)
• Sesame+DAMLDB:
Query 1: approx 0.1hr,
Query 9: approx 1 hr
Query 14: approx. 1 hr
• Jena+DAMLDB for 550 million triples:
Query 1: approx 0.001 hr,
Query 9: approx 1 hr
Query 14: approx. 5 hr 22