Managing Semantic Graphs with Stardog 4

77
Managing Semantic Graphs with Stardog 4* Pavel Klinov Senior Research Engineer Complexible Inc Based on Evren Sirin’s talk “Taming Big Data Variety with Semantic Graph Databases” at Smart Data 2015

Transcript of Managing Semantic Graphs with Stardog 4

Managing Semantic Graphs with Stardog 4*

Pavel Klinov

Senior Research Engineer Complexible Inc

Based on Evren Sirin’s talk “Taming Big Data Variety with Semantic Graph Databases” at Smart Data 2015

Overview

Graphs, semantic graphs, and data variety

Semantic graphs and data integration

• RDF as unified data model

• Virtual graphs

A little on Stardog (RDF database)

About Complexible

Leading semtech provider since 2006 (aka Clark & Parsia)

• software (Pellet, Stardog)

• W3C participation

Released Stardog 1.0 in 2012 (current version 4.0.1)

Raising Round A

http://complexible.com

Big Data VsVolume

Velocity

Variety Veracity Volatility Value

Data variety is the real challenge

Based on Paradigm4 survey of more than 100 data scientists

http://www.paradigm4.com/infographic2014/

Data Variety

Syntax: formats

Structure: schemas

https://www.flickr.com/photos/designmilk/8552219138

In complex enterprises with lots of data variety, most

analytic challenges can be reduced to data integration

Data integration spaceIntegrated data

Integration effort

Data lakes

Data warehouses

Data integration spaceIntegrated data

Integration effort

Data lakes

Data warehouses

Sweet spot

Data integration challenge

RDB RDB RDBData lakes:

How to query this as a single integrated data source?

Data integration challenge

RDB RDB RDBData lakes:

How to query this as a single integrated data source?

Unified Data Model

Unified Data Model

Global coherent view over heterogenous data

Unified Data Model

Global coherent view over heterogenous data

flexible and extensible

Unified Data Model

Global coherent view over heterogenous data

flexible and extensible

at the right level of abstraction

Unified Data Model

Global coherent view over heterogenous data

flexible and extensible

at the right level of abstraction

enabling automated processing and analysis

• querying

• constraint validation

• reasoning (making implicit knowledge explicit)

Graphs are everywhere

Graphs are everywhere

Knowledge Graph

Graphs are everywhere

Knowledge Graph

Open Graph

Linked Open Data

Graphs are everywhere

Knowledge Graph

Open Graph

Why graphs?

Why graphs?

Generic data representation model

Why graphs?

Generic data representation model

Utilize connectedness of the data

Why graphs?

Generic data representation model

Utilize connectedness of the data

Flexible and extensible

Why graphs?

Generic data representation model

Utilize connectedness of the data

Flexible and extensible

Easy to compose and connect

Why graphs?

Generic data representation model

Utilize connectedness of the data

Flexible and extensible

Easy to compose and connect

Increasing number of graph database offerings

(Neo4j, Titan,…)

Generic data representation model Utilize connectedness of the data

Flexible and extensible

Easy to compose and connect

Increasing number of graph database offerings

(Neo4j, Titan,…)

Why graphs?not

No standards for syntax, semantics, or queries

RDF, briefly

RDF addresses this standardization gap for graphs

RDF, briefly

RDF addresses this standardization gap for graphs

RDF data is a set of triples (edges)

<emp:John, emp:worksFor, emp:Google>

RDF, briefly

RDF addresses this standardization gap for graphs

RDF data is a set of triples (edges)

<emp:John, emp:worksFor, emp:Google>

Originally developed to publish and link data on Web

thus Linked Data

RDF, briefly

RDF addresses this standardization gap for graphs

RDF data is a set of triples (edges)

<emp:John, emp:worksFor, emp:Google>

Originally developed to publish and link data on Web

thus Linked Data

But it can serve as general graph data model

Abstract Graph

http://www.w3.org/TR/rdf11-primer/

RDF Graph

http://www.w3.org/TR/rdf11-primer/

RDF graphs are semantic graphs

RDF graphs are graphs with meaning

RDF graphs are semantic graphs

RDF graphs are graphs with meaning

• explicit references to terms and their definitions

• definitions have formal semantics

RDF graphs are semantic graphs

RDF graphs are graphs with meaning

• explicit references to terms and their definitions

• definitions have formal semantics

Important for creating unified data models

• thus supporting data integration

RDF graphs are semantic graphs

RDF graphs are graphs with meaning

• explicit references to terms and their definitions

• definitions have formal semantics

Important for creating unified data models

• thus supporting data integration

Important for declaratively describing complex

information processing tasks

RDF serialization

http://www.w3.org/TR/rdf11-primer/

01 BASE <http://example.org/> 02 PREFIX foaf: <http://xmlns.com/foaf/0.1/> 03 PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 04 PREFIX schema: <http://schema.org/> 05 PREFIX dcterms: <http://purl.org/dc/terms/> 06 PREFIX wd: <http://www.wikidata.org/entity/> 07 08 <bob#me> 09 a foaf:Person ; 10 foaf:knows <alice#me> ; 11 schema:birthDate "1990-07-04"^^xsd:date ;12 foaf:topic_interest wd:Q12418 . 13 14 wd:Q12418 15 dcterms:title "Mona Lisa" ; 16 dcterms:creator <http://dbpedia.org/resource/Leonardo_da_Vinci> .17 18 <http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619> 19 dcterms:subject wd:Q12418 .

RDF serialization

http://www.w3.org/TR/rdf11-primer/

01 BASE <http://example.org/> 02 PREFIX foaf: <http://xmlns.com/foaf/0.1/> 03 PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 04 PREFIX schema: <http://schema.org/> 05 PREFIX dcterms: <http://purl.org/dc/terms/> 06 PREFIX wd: <http://www.wikidata.org/entity/> 07 08 <bob#me> 09 a foaf:Person ; 10 foaf:knows <alice#me> ; 11 schema:birthDate "1990-07-04"^^xsd:date ;12 foaf:topic_interest wd:Q12418 . 13 14 wd:Q12418 15 dcterms:title "Mona Lisa" ; 16 dcterms:creator <http://dbpedia.org/resource/Leonardo_da_Vinci> .17 18 <http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619> 19 dcterms:subject wd:Q12418 .

PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX schema: <http://schema.org/> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX dbpedia: <http://dbpedia.org/resource/>

SELECT ?person ?title WHERE { ?person a foaf:Person ; schema:birthDate ?birthDate ; foaf:topic_interest ?interest . ?interest dcterms:title ?title ; dcterms:creator dbpedia:Leonardo_da_Vinci . FILTER (?birthDate < "1991-01-01"^^xsd:date ) }

SPARQL query

Schema (aka ontology)

Person

Agent

Organization

rdfs:subClassOf rdfs:subClassOfworksFor hasEmployee

owl:inverseOf

rdfs:range

Schema (aka ontology)

Person

Agent

Organization

rdfs:subClassOf

Bob

rdfs:subClassOf

rdf:type

worksFor hasEmployeeowl:inverseOf

rdfs:range

Schema (aka ontology)

Person

Agent

Organization

rdfs:subClassOf

Bob

rdfs:subClassOf

rdf:type

rdf:type

worksFor hasEmployeeowl:inverseOf

rdfs:range

Schema (aka ontology)

Person

Agent

Organization

rdfs:subClassOf

Bob

rdfs:subClassOf

rdf:type

ACME

rdf:type

worksFor

worksFor hasEmployeeowl:inverseOf

rdfs:range

Schema (aka ontology)

Person

Agent

Organization

rdfs:subClassOf

Bob

rdfs:subClassOf

rdf:type

ACME

rdf:type

worksFor

hasEmployee

worksFor hasEmployeeowl:inverseOf

rdfs:range

Schema (aka ontology)

Person

Agent

Organization

rdfs:subClassOf

Bob

rdfs:subClassOf

rdf:typerdf:type

ACME

rdf:type

worksFor

hasEmployee

worksFor hasEmployeeowl:inverseOf

rdfs:range

Schema (aka ontology)

Person

Agent

Organization

rdfs:subClassOf

Bob

rdfs:subClassOf

rdf:typerdf:type

ACME

rdf:type

worksFor

hasEmployee

worksFor hasEmployeeowl:inverseOf

rdfs:range

rdf:type

Semantic models in RDF are:

Interoperable: no vendor lock-in

Actionable: run queries against it

Expressive: describe arbitrary (hyper) graphs

Flexible: adapt to changing data, new data, etc.

Reusable: by different apps in other domains

Viewing RDBs as RDF graphs

Take this:

Viewing RDBs as RDF graphs

Take this:

And view it as something like:

Viewing RDBs as RDF graphs

Take this:

And view it as something like:

http://www.w3.org/TR/rdb2rdf-ucr/

R2RML: mapping from RDB to RDF

R2RML is a standard for mapping RDB sources to RDF

R2RML: mapping from RDB to RDF

R2RML is a standard for mapping RDB sources to RDF

Mapping is conceptual, vendors can:

• extract, transform, load as RDF

• query on the fly (virtual graphs)

R2RML: mapping from RDB to RDF

R2RML is a standard for mapping RDB sources to RDF

Mapping is conceptual, vendors can:

• extract, transform, load as RDF

• query on the fly (virtual graphs)

Direct and customizable mappings

Virtual graphs in Stardog

1. Register: name, properties, mappings

2. Use in queries

Virtual graphs in Stardog

1. Register: name, properties, mappings

2. Use in queries

SELECT * { GRAPH <virtual://dept> { ?person a emp:Employee ; emp:department ?department . } ?department foaf:organization <urn:engineering> . }

Customizable mapping exampleemp:{"empno"} a emp:Employee ; emp:name "{\"ename\"}" ; emp:role emp:{ROLE} ; emp:department dept:{"deptno"} ; sm:map [ sm:query """ SELECT \"empno\", \"ename\", \"deptno\", (CASE \"job\" WHEN 'CLERK' THEN 'general-office' WHEN 'NIGHTGUARD' THEN 'security' WHEN 'ENGINEER' THEN 'engineering' END) AS ROLE FROM \"EMP\" """ ; ] .

Data integration with unified domain model and R2RML

Reasoning with virtual graphs

Reasoning with virtual graphs

Get results which

Reasoning with virtual graphs

Get results which

• do not exist in the data lakes

Reasoning with virtual graphs

Get results which

• do not exist in the data lakes

• but follow given the domain models and mappings

Reasoning with virtual graphs

Get results which

• do not exist in the data lakes

• but follow given the domain models and mappings

Turn your data lakes into deductive databases…

Reasoning with virtual graphs

Get results which

• do not exist in the data lakes

• but follow given the domain models and mappings

Turn your data lakes into deductive databases…

… without them noticing!

Reasoning with virtual graphs: example

Author ArticleJohn http://nature.com/123

Publisher NameSpringer http://springer.com/LCNS

Article database Publisher database

Reasoning with virtual graphs: example

Author ArticleJohn http://nature.com/123

Publisher NameSpringer http://springer.com/LCNS

Article database Publisher database

Goal: query for all publications across both databases

Reasoning with virtual graphs: example

Author ArticleJohn http://nature.com/123

Publisher NameSpringer http://springer.com/LCNS

Article database Publisher database

John

nature:123authors

Articlerdf:type

Springer

springer:lncs

publishes

Goal: query for all publications across both databases

Reasoning with virtual graphs: example

Author ArticleJohn http://nature.com/123

Publisher NameSpringer http://springer.com/LCNS

Article database Publisher database

John

nature:123authors

Articlerdf:type

Springer

springer:lncs

publishes

Publication

rdfs:subClassOf

rdfs:range

Goal: query for all publications across both databases

Reasoning with virtual graphs: example

Author ArticleJohn http://nature.com/123

Publisher NameSpringer http://springer.com/LCNS

Article database Publisher database

John

nature:123authors

Articlerdf:type

Springer

springer:lncs

publishes

Publication

rdfs:subClassOf

rdfs:rangerdf:type

rdf:type

Goal: query for all publications across both databases

Stardog: Semantic Graph DatabaseThe leading RDF database

Pure Java: any JVM language, full REST bindings

Client-server, embedded, middleware modes

Rich feature set

Supports property graphs (Tinkerpop)

ACID Transactions, High Availability, Hot backup/restore, JMX server monitoring, Access & Audit logging, RBAC security model, LDAP integration, SPARQL 1.1 queries, OWL 2 Reasoning, Proof trees, Integrity constraints, Full-text search, Geospatial support, Virtual graphs, Provenance support

Single-node ScalabilityScale up to 50B triples on modest hardware

Single-node ScalabilityScale up to 50B triples on modest hardware

● 32 cores, 256 GB RAM, 2 x 7200RPM HDDs, < $10K cost

Single-node ScalabilityScale up to 50B triples on modest hardware

● 32 cores, 256 GB RAM, 2 x 7200RPM HDDs, < $10K cost

Load rates up to 500k triples/second

● That’s 100M triples in 3 min, 1B in 30 min, and 20B in 20 hours

Single-node ScalabilityScale up to 50B triples on modest hardware

● 32 cores, 256 GB RAM, 2 x 7200RPM HDDs, < $10K cost

Load rates up to 500k triples/second

● That’s 100M triples in 3 min, 1B in 30 min, and 20B in 20 hours

Best-of-breed query answering performance

● Query 100M triples with a throughput of 3M+ queries/hour, 1B at

500k queries/hour, and 10B at 20k queries/hour (BSBM, 64 clients)

Stardog for Big Data (coming 2016)

Stardog for Big Data (coming 2016)

HDFS-backed storage

Horizontal partitioning of data

Stardog for Big Data (coming 2016)

HDFS-backed storage

Horizontal partitioning of data

Advanced query planner and optimization

Parallel query execution with async messaging

Questions?@klinovp, [email protected]

http://complexible.com, http://stardog.com