Managing Semantic Graphs with Stardog 4
-
Upload
pavel-klinov -
Category
Technology
-
view
512 -
download
0
Transcript of Managing Semantic Graphs with Stardog 4
Managing Semantic Graphs with Stardog 4*
Pavel Klinov
Senior Research Engineer Complexible Inc
Based on Evren Sirin’s talk “Taming Big Data Variety with Semantic Graph Databases” at Smart Data 2015
Overview
Graphs, semantic graphs, and data variety
Semantic graphs and data integration
• RDF as unified data model
• Virtual graphs
A little on Stardog (RDF database)
About Complexible
Leading semtech provider since 2006 (aka Clark & Parsia)
• software (Pellet, Stardog)
• W3C participation
Released Stardog 1.0 in 2012 (current version 4.0.1)
Raising Round A
http://complexible.com
Data variety is the real challenge
Based on Paradigm4 survey of more than 100 data scientists
http://www.paradigm4.com/infographic2014/
Data Variety
Syntax: formats
Structure: schemas
https://www.flickr.com/photos/designmilk/8552219138
In complex enterprises with lots of data variety, most
analytic challenges can be reduced to data integration
Data integration challenge
RDB RDB RDBData lakes:
How to query this as a single integrated data source?
Data integration challenge
RDB RDB RDBData lakes:
How to query this as a single integrated data source?
Unified Data Model
Unified Data Model
Global coherent view over heterogenous data
flexible and extensible
at the right level of abstraction
Unified Data Model
Global coherent view over heterogenous data
flexible and extensible
at the right level of abstraction
enabling automated processing and analysis
• querying
• constraint validation
• reasoning (making implicit knowledge explicit)
Why graphs?
Generic data representation model
Utilize connectedness of the data
Flexible and extensible
Why graphs?
Generic data representation model
Utilize connectedness of the data
Flexible and extensible
Easy to compose and connect
Why graphs?
Generic data representation model
Utilize connectedness of the data
Flexible and extensible
Easy to compose and connect
Increasing number of graph database offerings
(Neo4j, Titan,…)
Generic data representation model Utilize connectedness of the data
Flexible and extensible
Easy to compose and connect
Increasing number of graph database offerings
(Neo4j, Titan,…)
Why graphs?not
No standards for syntax, semantics, or queries
RDF, briefly
RDF addresses this standardization gap for graphs
RDF data is a set of triples (edges)
<emp:John, emp:worksFor, emp:Google>
RDF, briefly
RDF addresses this standardization gap for graphs
RDF data is a set of triples (edges)
<emp:John, emp:worksFor, emp:Google>
Originally developed to publish and link data on Web
thus Linked Data
RDF, briefly
RDF addresses this standardization gap for graphs
RDF data is a set of triples (edges)
<emp:John, emp:worksFor, emp:Google>
Originally developed to publish and link data on Web
thus Linked Data
But it can serve as general graph data model
RDF graphs are semantic graphs
RDF graphs are graphs with meaning
• explicit references to terms and their definitions
• definitions have formal semantics
RDF graphs are semantic graphs
RDF graphs are graphs with meaning
• explicit references to terms and their definitions
• definitions have formal semantics
Important for creating unified data models
• thus supporting data integration
RDF graphs are semantic graphs
RDF graphs are graphs with meaning
• explicit references to terms and their definitions
• definitions have formal semantics
Important for creating unified data models
• thus supporting data integration
Important for declaratively describing complex
information processing tasks
RDF serialization
http://www.w3.org/TR/rdf11-primer/
01 BASE <http://example.org/> 02 PREFIX foaf: <http://xmlns.com/foaf/0.1/> 03 PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 04 PREFIX schema: <http://schema.org/> 05 PREFIX dcterms: <http://purl.org/dc/terms/> 06 PREFIX wd: <http://www.wikidata.org/entity/> 07 08 <bob#me> 09 a foaf:Person ; 10 foaf:knows <alice#me> ; 11 schema:birthDate "1990-07-04"^^xsd:date ;12 foaf:topic_interest wd:Q12418 . 13 14 wd:Q12418 15 dcterms:title "Mona Lisa" ; 16 dcterms:creator <http://dbpedia.org/resource/Leonardo_da_Vinci> .17 18 <http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619> 19 dcterms:subject wd:Q12418 .
RDF serialization
http://www.w3.org/TR/rdf11-primer/
01 BASE <http://example.org/> 02 PREFIX foaf: <http://xmlns.com/foaf/0.1/> 03 PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 04 PREFIX schema: <http://schema.org/> 05 PREFIX dcterms: <http://purl.org/dc/terms/> 06 PREFIX wd: <http://www.wikidata.org/entity/> 07 08 <bob#me> 09 a foaf:Person ; 10 foaf:knows <alice#me> ; 11 schema:birthDate "1990-07-04"^^xsd:date ;12 foaf:topic_interest wd:Q12418 . 13 14 wd:Q12418 15 dcterms:title "Mona Lisa" ; 16 dcterms:creator <http://dbpedia.org/resource/Leonardo_da_Vinci> .17 18 <http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619> 19 dcterms:subject wd:Q12418 .
PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX schema: <http://schema.org/> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX dbpedia: <http://dbpedia.org/resource/>
SELECT ?person ?title WHERE { ?person a foaf:Person ; schema:birthDate ?birthDate ; foaf:topic_interest ?interest . ?interest dcterms:title ?title ; dcterms:creator dbpedia:Leonardo_da_Vinci . FILTER (?birthDate < "1991-01-01"^^xsd:date ) }
SPARQL query
Schema (aka ontology)
Person
Agent
Organization
rdfs:subClassOf rdfs:subClassOfworksFor hasEmployee
owl:inverseOf
rdfs:range
Schema (aka ontology)
Person
Agent
Organization
rdfs:subClassOf
Bob
rdfs:subClassOf
rdf:type
worksFor hasEmployeeowl:inverseOf
rdfs:range
Schema (aka ontology)
Person
Agent
Organization
rdfs:subClassOf
Bob
rdfs:subClassOf
rdf:type
rdf:type
worksFor hasEmployeeowl:inverseOf
rdfs:range
Schema (aka ontology)
Person
Agent
Organization
rdfs:subClassOf
Bob
rdfs:subClassOf
rdf:type
ACME
rdf:type
worksFor
worksFor hasEmployeeowl:inverseOf
rdfs:range
Schema (aka ontology)
Person
Agent
Organization
rdfs:subClassOf
Bob
rdfs:subClassOf
rdf:type
ACME
rdf:type
worksFor
hasEmployee
worksFor hasEmployeeowl:inverseOf
rdfs:range
Schema (aka ontology)
Person
Agent
Organization
rdfs:subClassOf
Bob
rdfs:subClassOf
rdf:typerdf:type
ACME
rdf:type
worksFor
hasEmployee
worksFor hasEmployeeowl:inverseOf
rdfs:range
Schema (aka ontology)
Person
Agent
Organization
rdfs:subClassOf
Bob
rdfs:subClassOf
rdf:typerdf:type
ACME
rdf:type
worksFor
hasEmployee
worksFor hasEmployeeowl:inverseOf
rdfs:range
rdf:type
Semantic models in RDF are:
Interoperable: no vendor lock-in
Actionable: run queries against it
Expressive: describe arbitrary (hyper) graphs
Flexible: adapt to changing data, new data, etc.
Reusable: by different apps in other domains
Viewing RDBs as RDF graphs
Take this:
And view it as something like:
http://www.w3.org/TR/rdb2rdf-ucr/
R2RML: mapping from RDB to RDF
R2RML is a standard for mapping RDB sources to RDF
Mapping is conceptual, vendors can:
• extract, transform, load as RDF
• query on the fly (virtual graphs)
R2RML: mapping from RDB to RDF
R2RML is a standard for mapping RDB sources to RDF
Mapping is conceptual, vendors can:
• extract, transform, load as RDF
• query on the fly (virtual graphs)
Direct and customizable mappings
Virtual graphs in Stardog
1. Register: name, properties, mappings
2. Use in queries
SELECT * { GRAPH <virtual://dept> { ?person a emp:Employee ; emp:department ?department . } ?department foaf:organization <urn:engineering> . }
Customizable mapping exampleemp:{"empno"} a emp:Employee ; emp:name "{\"ename\"}" ; emp:role emp:{ROLE} ; emp:department dept:{"deptno"} ; sm:map [ sm:query """ SELECT \"empno\", \"ename\", \"deptno\", (CASE \"job\" WHEN 'CLERK' THEN 'general-office' WHEN 'NIGHTGUARD' THEN 'security' WHEN 'ENGINEER' THEN 'engineering' END) AS ROLE FROM \"EMP\" """ ; ] .
Reasoning with virtual graphs
Get results which
• do not exist in the data lakes
• but follow given the domain models and mappings
Reasoning with virtual graphs
Get results which
• do not exist in the data lakes
• but follow given the domain models and mappings
Turn your data lakes into deductive databases…
Reasoning with virtual graphs
Get results which
• do not exist in the data lakes
• but follow given the domain models and mappings
Turn your data lakes into deductive databases…
… without them noticing!
Reasoning with virtual graphs: example
Author ArticleJohn http://nature.com/123
Publisher NameSpringer http://springer.com/LCNS
Article database Publisher database
Reasoning with virtual graphs: example
Author ArticleJohn http://nature.com/123
Publisher NameSpringer http://springer.com/LCNS
Article database Publisher database
Goal: query for all publications across both databases
Reasoning with virtual graphs: example
Author ArticleJohn http://nature.com/123
Publisher NameSpringer http://springer.com/LCNS
Article database Publisher database
John
nature:123authors
Articlerdf:type
Springer
springer:lncs
publishes
Goal: query for all publications across both databases
Reasoning with virtual graphs: example
Author ArticleJohn http://nature.com/123
Publisher NameSpringer http://springer.com/LCNS
Article database Publisher database
John
nature:123authors
Articlerdf:type
Springer
springer:lncs
publishes
Publication
rdfs:subClassOf
rdfs:range
Goal: query for all publications across both databases
Reasoning with virtual graphs: example
Author ArticleJohn http://nature.com/123
Publisher NameSpringer http://springer.com/LCNS
Article database Publisher database
John
nature:123authors
Articlerdf:type
Springer
springer:lncs
publishes
Publication
rdfs:subClassOf
rdfs:rangerdf:type
rdf:type
Goal: query for all publications across both databases
Stardog: Semantic Graph DatabaseThe leading RDF database
Pure Java: any JVM language, full REST bindings
Client-server, embedded, middleware modes
Rich feature set
Supports property graphs (Tinkerpop)
ACID Transactions, High Availability, Hot backup/restore, JMX server monitoring, Access & Audit logging, RBAC security model, LDAP integration, SPARQL 1.1 queries, OWL 2 Reasoning, Proof trees, Integrity constraints, Full-text search, Geospatial support, Virtual graphs, Provenance support
Single-node ScalabilityScale up to 50B triples on modest hardware
● 32 cores, 256 GB RAM, 2 x 7200RPM HDDs, < $10K cost
Single-node ScalabilityScale up to 50B triples on modest hardware
● 32 cores, 256 GB RAM, 2 x 7200RPM HDDs, < $10K cost
Load rates up to 500k triples/second
● That’s 100M triples in 3 min, 1B in 30 min, and 20B in 20 hours
Single-node ScalabilityScale up to 50B triples on modest hardware
● 32 cores, 256 GB RAM, 2 x 7200RPM HDDs, < $10K cost
Load rates up to 500k triples/second
● That’s 100M triples in 3 min, 1B in 30 min, and 20B in 20 hours
Best-of-breed query answering performance
● Query 100M triples with a throughput of 3M+ queries/hour, 1B at
500k queries/hour, and 10B at 20k queries/hour (BSBM, 64 clients)
Stardog for Big Data (coming 2016)
HDFS-backed storage
Horizontal partitioning of data
Advanced query planner and optimization
Parallel query execution with async messaging
Questions?@klinovp, [email protected]
http://complexible.com, http://stardog.com