Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf...

Post on 31-Mar-2015

214 views 2 download

Tags:

Transcript of Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf...

Scalable RDF Data Management

& SPARQL Query Processing

Martin Theobald1, Katja Hose2, Ralf Schenkel3

1University of Antwerp, Belgium2University of Aalborg, Denmark3University of Passau, Germany

Outline of this Tutorial

• Part I–RDF in Centralized Relational Databases

• Part II–RDF in Distributed Settings

• Part III–Managing Uncertain RDF Data

Outline for Part I• Part I.1: Foundations

– Introduction to RDF and Linked Open Data– A short overview of SPARQL

• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook

bornOn(Jeff, 09/22/42)gradFrom(Jeff, Columbia)hasAdvisor(Jeff, Arthur)hasAdvisor(Surajit, Jeff)knownFor(Jeff, Theory)

Information Extraction

YAGO/DBpedia et al.

>120 M facts for YAGO2(mostly from Wikipedia infoboxes & categories)

http://www.mpi-inf.mpg.de/yago-naga/

YAGO2 Knowledge BaseEntity

Max_Planck

Apr 23, 1858

Person

City

Countrysubclass

Locationsubclass

instanceOf

subclass

bornOn

“Max Planck”

means

subclass

Oct 4, 1947 diedOn

Kiel

bornInNobel Prize

Erwin_Planck

FatherOfhasWon

Scientist

means

“Max Karl Ernst Ludwig Planck”

Physicist

instanceOf

subclassBiologist

subclass

Germany

Politician

Angela Merkel

Schleswig-Holstein

State

“Angela Dorothea Merkel”

Oct 23, 1944diedOn

Organization

subclass

Max_Planck Society

instanceOf

means

instanceOfinstanceOf

subclass

subclass

means

“Angela Merkel”

means

citizenOf

instanceOfinstanceOf

locatedIn

locatedIn

subclass

3 M entities, 120 M facts100 relations, 200k classes

accuracy 95%

Why care about scalability?Rapid growth of available semantic data

Sources:linkeddata.orgwikipedia.org

Why care about scalability?Rapid growth of available semantic data

More than 30 billion triples in more than 200 sources across the LOD cloud DBPedia: 3.4 million entities, 1 billion triples

Sources:linkeddata.orgwikipedia.org

As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase

… and still growing• Billion triple challenge 2008: 1B triples• Billion triple challenge 2010: 3B triples

http://km.aifb.kit.edu/projects/btc-2010/

• Billion triple challenge 2011: 2B tripleshttp://km.aifb.kit.edu/projects/btc-2011/

• |

• War stories from http://www.w3.org/wiki/LargeTripleStores

– BigOWLIM: 12B triples in Jun 2009 – Garlik 4Store: 15B triples in Oct 2009– OpenLink Virtuoso: 15.4B+ triples – AllegroGraph: 1+ Trillion triples

Queries can be complex, too

SELECT DISTINCT ?a ?b ?lat ?long WHERE{ ?a dbpedia:spouse ?b. ?a dbpedia:wikilink dbpediares:actor. ?b dbpedia:wikilink dbpediares:actor. ?a dbpedia:placeOfBirth ?c. ?b dbpedia:placeOfBirth ?c. ?c owl:sameAs ?c2. ?c2 pos:lat ?lat. ?c2 pos:long ?long.}

Q7 on BTC2008 in [Neumann & Weikum, 2009]

What effects does the financial crisis have on migration rates in the US?

Is there a significant increase of serious weather conditions in Europe over the past 20 years?

Which glutamic-acid proteases are inhibitors of HIV?

Question Answering (QA) Systems

• KB of curated, structured data• 10 trillion (!) facts, 50k algorithms

• KB from Wikipedia and user edits• 600 million facts, 25 million entities

IBM Watson: Deep Question Answering

99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain

This town is known as "Sin City" & its downtown is "Glitter Gulch"

William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel

As of 2010, this is the only former Yugoslav republic in the EU

YAGO

knowledgeback-ends

questionclassification &decomposition

www.ibm.com/innovation/us/watson/index.htm

D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.

SPARQL 1.0 / 1.1• Query language for RDF suggested by the W3C.

• 3 ways to interpret RDF data:– Instances of logical predicates (“facts”)– Graphs (subjects/objects as nodes,

predicates as directed and labeled edges)

– Relations (either multiple binary relations or a single, large ternary relation)

• SPARQL main building block:– select-project-join combination of relational triple patterns

equivalent to graph isomorphism queries over a potentially very large RDF graph

SPARQL – Example

Example query:Find all actors from Ontario (that are in the knowledge base)

vegetarian

Albert_Einstein

physicist

Jim_Carrey

actor

Ontario

Canada

Ulm

Germany

scientist

chemist

Otto_Hahn

Frankfurt

Mike_Myers

NewmarketScarborough

Europe

isA isA isA isA isA

bornIn bornIn bornIn bornIn

locatedInlocatedIn

locatedInlocatedInlocatedInlocatedIn

isAisA

isA

SPARQL – Example

Example query:Find all actors from Ontario (that are in the knowledge base)

vegetarian

Jim_Carrey

actor

Ontario

Canada

Mike_Myers

NewmarketScarborough

isA isA

bornIn bornIn

locatedIn

locatedInlocatedIn

isA

actor

Ontario

?person

?loc

bornIn

locatedIn

isA

Find subgraphs of this form:

variables

constants

SELECT ?person WHERE { ?person isA actor. ?person bornIn ?loc . ?loc locatedIn Ontario . }

• Eliminate duplicates in results

• Return results in some order

with optional LIMIT n clause• Optional matches and filters on bounded var’s

• More operators: ASK, DESCRIBE, CONSTRUCT

See: http://www.w3.org/TR/rdf-sparql-query/

SPARQL 1.0 – More Features

SELECT DISTINCT ?c WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn ?c}

SELECT ?person WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario} ORDER BY DESC(?person)

SELECT ?person WHERE {?person isA actor. OPTIONAL{?person bornIn ?loc}. FILTER (!BOUND(?loc))}

SPARQL 1.1 Extensions of the W3C

W3C SPARQL 1.1:• Aggregations (COUNT, AVG, …) and grouping• Subqueries in the WHERE clause• Safe negation: FILTER NOT EXISTS {?x …}

– Syntactic sugar forOPTIONAL {?x … }FILTER(!BOUND(?x))

• Expressions in the SELECT clause:SELECT (?a+?b) AS ?sum

• Label constraints on paths:?x foaf:knows/foaf:knows/foaf:name ?name

• More functions and operators …

RDF+SPARQL: Centralized Engines

• BigOWLIM (now ontotext.com)

• OpenLink Virtuoso • OntoBroker (now semafora-systems.com)

• Apache Jena (different main-memory/relational backends)

• Sesame (now openRDF.org)

• SW-Store, Hexastore, 3Store, RDF-3X (no reasoning)

System deployments with >1011 triples

( see http://esw.w3.org/LargeTripleStores)

SPARQL: Extensions from Research (1)

More complex graph patterns:• Transitive paths [Anyanwu et al., WWW’07]

SELECT ?p, ?c WHERE {?p isA scientist . ?p ??r ?c. ?c isA Country . ?c locatedIn Europe .

PathFilter(cost(??r) < 5) . PathFilter(containsAny(??r,?t ). ?t isA City . }

• Regular expressions [Kasneci et al., ICDE’08]SELECT ?p, ?c WHERE { ?p isA ?s. ?s isA scientist. ?p (bornIn | livesIn | citizenOf) locatedIn* Europe.}

Meanwhile mostly covered by the SPARQL 1.1 query proposal.

SPARQL: Extensions from Research (2)

Queries over federated RDF sources:• Determine distribution of triple patterns as part of

query (for example in Jena ARQ)

• Automatically route triple predicates to useful sources

SPARQL: Extensions from Research (2)

Queries over federated RDF sources:• Determine distribution of triple patterns as part of

query (for example in Jena ARQ)

• Automatically route triple predicates to useful sources

Potentially requires mapping of identifiers from different sources

SPARQL 1.1 explicitly supportsfederation of sources

http://www.w3.org/TR/sparql11-federated-query/

Ranking is Essential!

• Queries often have a huge number of results:– “scientists from Canada”– “publications in databases”– “actors from the U.S.”

• Queries may have no matches at all:– “Laboratoire d'informatique de Paris 6”– “most beautiful railway stations”

• Ranking is an integral part of search• Huge number of app-specific ranking methods:

paper/citation count, impact, salary, …• Need for generic ranking of 1) entities and 2) facts

Extending Entities with Keywords

Remember: entities occur in facts & in documentsÞ Associate entities with terms in those documents,

keywords in URIs, literals, … (context of entity)

chancellor Germany scientist election Stuttgart21 Guido Westerwelle France Nicolas Sarkozy

Extensions: Keywords

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified!

European composers who have won the Oscar,whose music appeared in dramatic western scenes,and who also wrote classical pieces ?

Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . }

Semantics: • triples match struct. pred.• witnesses match text pred.

Select ?r, ?a Where {?r instOf researcher [“computer science“] . ?a workedOn ?x [“Manhattan project“] .?r hasAdvisor ?a . }

Select ?r, ?a Where {?r ?p1 ?o1 [“computer science“] . ?a ?p2 ?o2 [“Manhattan project“] .?r ?p3 ?a . }

Extensions: Keywords

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified!

Proximity ofkeywords or phrasesboosts expressiveness

French politicians married to Italian singers? Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . }

CS researchers whose advisors worked on the Manhattan project?

Extensions: Keywords

CLEF/INEX 2012-13 Linked Data Track

<topic id="2012374" category="Politics"> <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue> <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title> <sparql_ft> SELECT ?s ?s1 WHERE { ?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> . ?s1 <http://dbpedia.org/property/successor> ?s . FILTER FTContains (?s, "stepped down early") . } </sparql_ft></topic>

https://inex.mmci.uni-saarland.de/tracks/lod/

Problem: not everything is triplified!

Extensions: Keywords / Multiple Languages

<question id="4" answertype="resource" aggregation="false" onlydbo="true"> <string lang="en">Which river does the Brooklyn Bridge cross?</string> <string lang="de">Welchen Fluss überspannt die Brooklyn Bridge?</string> <string lang="es">¿Por qué río cruza la Brooklyn Bridge?</string> <string lang="it">Quale fiume attraversa il ponte di Brooklyn?</string> <string lang="fr">Quelle cours d'eau est traversé par le pont de Brooklyn?</string> <string lang="nl">Welke rivier overspant de Brooklyn Bridge?</string> <keywords lang="en">river, cross, Brooklyn Bridge</keywords> <keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords> <keywords lang="es">río, cruza, Brooklyn Bridge</keywords> <keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords> <keywords lang="fr">cours d'eau, pont de Brooklyn</keywords> <keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords> <query> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX res: <http://dbpedia.org/resource/> SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridge dbo:crosses ?uri . } </query></question>

http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/

Multilingual Question Answering over Linked Data (QALD-3), CLEF 2011-13

Problem: not everything is triplified!

What Makes a Fact “Good”?

Confidence:Prefer results that are likely correct

accuracy of info extraction trust in sources

(authenticity, authority)

bornIn (Jim Gray, San Francisco) from“Jim Gray was born in San Francisco”(en.wikipedia.org)livesIn (Michael Jackson, Tibet) from“Fans believe Jacko hides in Tibet”(www.michaeljacksonsightings.com)

Informativeness:Prefer results with salient factsStatistical estimation from:

frequency in answer frequency on Web frequency in query log

q: Einstein isa ?Einstein isa scientistEinstein isa vegetarian

q: ?x isa vegetarianEinstein isa vegetarianWhocares isa vegetarian

Conciseness:Prefer results that are tightly connected

size of answer graph weight of Steiner tree

Einstein won NobelPrizeBohr won NobelPrizeEinstein isa vegetarianCruise isa vegetarianCruise born 1962 Bohr died 1962

Diversity:Prefer variety of facts

E won … E discovered … E played … E won … E won … E won … E won …

How Can We Implement This?

Confidence:Prefer results that are likely correct

accuracy of info extraction trust in sources

(authenticity, authority)

Informativeness:Prefer results with salient factsStatistical estimation from:

frequency in answer frequency on Web frequency in query log

Conciseness:Prefer results that are tightly connected

size of answer graph weight of Steiner tree

Diversity:Prefer variety of facts

Empirical accuracy of Information ExtractionPageRank-style estimate of trustcombine into: max { accuracy (f,s) * trust(s) | s witnesses(f) }

Statistical Language Models[Zhai et al., Elbassuoni et al.]

Graph algorithms (BANKS, STAR, …) [S.Chakrabarti et al., G.Kasneci et al., …]

PageRank-style entity/fact ranking[V. Hristidis et al., S.Chakrabarti, …]

IR models: tf*idf … [K.Chang et al., …]Statistical Language Models [de Rijke et al.]

or

Outline for Part I• Part I.1: Foundations

– Introduction to RDF– A short overview of SPARQL

• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook

RDF in Rowstores• Rowstore: general relational database, storing

relations (incl. facts) as complete rows (MySQL, PostgreSQL, Oracle, DB2, SQLServer, …)

• General principles:– store triples in one giant three-attribute table

(subject, predicate, object)– convert SPARQL to equivalent SQL– The database will do the rest

• Strings often mapped to unique integer IDs• Used by many TripleStores, including 3Store,

Jena, HexaStore, RDF-3X, …

Simple extension to quadruples (with graphid):(graph,subject,predicate,object)

We consider only triples for simplicity!

Example: Single Triple Tableex:Katja ex:teaches ex:Databases;

ex:works_for ex:MPI_Informatics;ex:PhD_from ex:TU_Ilmenau.

ex:Martin ex:teaches ex:Databases;ex:works_for ex:MPI_Informatics;ex:PhD_from ex:Saarland_University.

ex:Ralf ex:teaches ex:Information_Retrieval;ex:PhD_from ex:Saarland_University;ex:works_for ex:Saarland_University,

ex:MPI_Informatics.

subject predicate objectex:Katja ex:teaches ex:Databasesex:Katja ex:works_for ex:MPI_Informaticsex:Katja ex:PhD_from ex:TU_Ilmenauex:Martin ex:teaches ex:Databasesex:Martin ex:works_for ex:MPI_Informaticsex:Martin ex:PhD_from ex:Saarland_Universityex:Ralf ex:teaches ex:Information_Retrievalex:Ralf ex:PhD_from ex:Saarland_Universityex:Ralf ex:works_for ex:Saarland_Universityex:Ralf ex:works_for ex:MPI_Informatics

Conversion of SPARQL to SQLGeneral approach to translate SPARQL into SQL:

(1) Each triple pattern is translated into a (self-) JOIN over the triple table (2) Shared variables create JOIN conditions(3) Constants create WHERE conditions(4) FILTER conditions create WHERE conditions(5) OPTIONAL clauses create OUTER JOINS(6) UNION clauses create UNION expressions

SELECTFROM Triples P1, Triples P2, Triples P3

Example: Conversion to SQL QuerySELECT ?a ?b ?t WHERE{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }OPTIONAL {?a teaches ?t}FILTER (regex(?u, “Saar”))

SELECTFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from”

SELECTFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object

SELECTFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”)

SELECT P1.subject as A, P2.subject as BFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”)

SELECT R1.A, R1.B, R2.T FROM( SELECT P1.subject as A, P2.subject as B FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”) ) R1 LEFT OUTER JOIN ( SELECT P4.subject as A, P4.object as T FROM Triples P4 WHERE P4.predicate=“teaches”) AS R2) ON (R1.A=R2.A)

P1

P2

P3

P4

Filterregex(?u,“Saar“)

Projection

?u

?a,?u

?a

Is that all?Well, no.• Which indexes should be built?

(to support efficient evaluation of triple patterns)• How can we reduce storage space?• How can we find the best execution plan?

Existing databases need modifications:• flexible, extensible, generic storage not needed here• cannot deal with multiple self-joins of a single table• often generate bad execution plans

Dictionary for StringsMap all strings to unique integers (e.g., via hashing)• Regular size (4-8 bytes), much easier to handle• Dictionary usually small, can be kept in main

memory<http://example.de/Katja> 194760<http://example.de/Martin> 679375<http://example.de/Ralf> 4634

This may break original lexicographic sorting order RANGE conditions (not in SPARQL) are difficult! FILTER conditions may be more expensive!

Indexes for Commonly Used Triple PatternsPatterns with a single variable are frequentExample: Albert_Einstein invented ?xÞ Build clustered index over (s,p,o)

Can also be used for pattern like Albert_Einstein ?p ?x

Build similar clustered indexes for all six permutations (3 x 2 x 1 = 6)• SPO, POS, OSP to cover all possible triplet patterns• SOP, OPS, PSO to have all sort orders for patterns with two var’s

(16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456) …

All triples in(s,p,o) order

B+ tree foreasy access

1. Lookup ids for constants:Albert_Einstein 16, invented 24

2. Lookup known prefix in index:(16,24,0)

3. Read results while prefix matches:(16,24,567), (16,24,876)come already sorted!

Triple table no longer needed, all triples in each index

Why Sort Order Matters for Joins

(16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456)

(16,33,46578)(16,56,1345)(24,16,1353)(27,18,133)(47,37,20495)(50,134,1056)

MJ

When inputs sorted by joinattribute, use Merge Join:• sequentially scan both inputs• immediately join matching triples• skip over parts without matches• allows pipelining

When inputs are unsorted/sortedby wrong attribute, use Hash Join:• build hash table from one input• scan other input, probe hash table• needs to touch every input triple• breaks pipelining

(16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456)

(27,18,133)(50,134,1056)(16,56,1345)(24,16,1353) (47,37,20495)(16,33,46578)

HJ

In general, Merge Joins are more preferable:small memory footprint, pipelining

RDF-3x: Even More Indexes!SPARQL 1.0 considers duplicates (unless removed with DISTINCT) but

does not (yet) support aggregates/countingÞ often queries with many duplicates like

SELECT ?x WHERE ?x ?y Germany.

to retrieve entities related to Germany (but counts may be important in the application!)

Þ this materializes many identical intermediate resultsSolution: even more redundancy!

• Pre-compute aggregated indexes SP,SO,PO,PS,OP,OS,S,P,O Example: SO contains, for each pair (s,o), the number of triples with subject s and object o

• Do not materialize identical bindings, but keep counts Example: ?x=Albert_Einstein:4; ?x=Angela_Merkel:10 • 15 indexes overall (all SPO permutations + their unique subsets)

RDF-3x: Compression Scheme for Triplets

• Compress sequences of triples in lexicographic order (v1;v2;v3); for SPO: v1=S, v2=P, v3=O

• Step 1: compute per-attribute deltas

• Step 2: variable-byte encoding for each delta triple

1-13 bytes

(16,19,5356)(16,24,567)(16,24,676)(27,19,643)(27,48,10486)(50,10,10456)

(16,19,5356)(0,5,-4798)(0,0,109)(11,-5,-34)(0,29,9843)(23,-38,-30)

gapbit

header(7 bits)

Delta of value 1(0-4 bytes)

Delta of value 2(0-4 bytes)

Delta of value 3(0-4 bytes)

When gap=1, thedelta of value3 isincluded in header,all others are 0

Otherwise, header contains length of encoding for each of the three deltas (5*5*5=125 combinations stored in 7 bits)

Many variants exist; this one is designed for triplets…

Compression Effectiveness vs. Efficiency

• Byte-level encoding almost as effective as bit-level encoding techniques (Gamma, Golomb, Rice, etc.)

• Much faster (10x) for decompressing• Example for Barton dataset [Neumann & Weikum: VLDB’10]:

– Raw data 51 million triples, 7GB uncompressed (as N-Triples)– All 6 main indexes:

• 1.1GB size, 3.2s decompression with byte-level encoding

• Optionally: additional compression with LZ77 2x more compact, but much slower to decompress– Compression always on page level

POS(works_for,?u,?a)

POS(pdh_from,?u,?a)

PSO(works_for,?u,?b)

Projection

?u,?a

?u

?a

MJ

MJ

Filterregex(?u,“Saar“)

POS(teaches,?a,?t)

Back to the Example QuerySELECT ?a ?b ?t WHERE{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }OPTIONAL {?a teaches ?t}FILTER (regex(?u, “Saar”))

Which of the two plans is better?How many intermediate results?

1000

1000

100

50

5

250

POS(works_for,?u,?a)

POS(works_for,?u,?b)

PSO(phd_from,?a,?u)

POS(teaches,?a,?t)

Projection

?u

?a,?u

?a

MJ

HJ

Filterregex(?u,“Saar“)

Core ingredients of a good query optimizer areselectivity estimators for triple patterns (index scans) and joins

1000

1000

10050

2500

250250

250

RDF-3x: Selectivity EstimationHow many results will a triple pattern have?Standard databases:• Per-attribute histograms• Assume independence of attributes

Use aggregated indexes for exact count

Additional join statistics for triple blocks (pages):

too simplisticand inexact

… (16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456) …

Assume independencebetween triple patterns;additionally precompute

exact statistics for frequentpaths in the data

Handling UpdatesWhat should we do when our data changes?(SPARQL 1.1 has updates!)

Assumptions:• Queries far more frequent than updates• Updates mostly insertions, hardly any deletions• Different applications may update concurrently

Solution: Differential Indexing

RDF-3x: Differential Updates

Workspace A:Triples insertedby application A

Workspace B:Triples insertedby application B

on-demand indexesat query time

kept in main memory

Staging architecture for updates in RDF-3X

completionof A

completionof B

Deletions:• Insert the same tuple again with “deleted” flag• Modify scan/join operators: merge differential indexes with main index

Outline for Part I• Part I.1: Foundations

– Introduction to RDF– A short overview of SPARQL

• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook

Principles

Observations and assumptions:• Not too many different predicates• Triple patterns usually have fixed predicate• Need to access all triples with one predicate

Design consequence:• Use one two-attribute table for each predicate

Example: Columnstoresex:Katja ex:teaches ex:Databases;

ex:works_for ex:MPI_Informatics;ex:PhD_from ex:TU_Ilmenau.

ex:Martin ex:teaches ex:Databases;ex:works_for ex:MPI_Informatics;ex:PhD_from ex:Saarland_University.

ex:Ralf ex:teaches ex:Information_Retrieval;ex:PhD_from ex:Saarland_University;ex:works_for ex:Saarland_University,

ex:MPI_Informatics.

subject objectex:Katja ex:TU_Ilmenauex:Martin ex:Saarland_Universityex:Ralf ex:Saarland_University

PhD_from

subject objectex:Katja ex:MPI_Informaticsex:Martin ex:MPI_Informtaticsex:Ralf ex:Saarland_Universityex:Ralf ex:MPI_Informatics

works_for

subject objectex:Katja ex:Databasesex:Martin ex:Databasesex:Ralf ex:Information_Retrieval

teaches

Simplified Example: Query ConversionSELECT ?a ?b ?t WHERE{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }

SELECT W1.subject as A, W2.subject as B FROM works_for W1, works_for W2, phd_from P3 WHERE W1.object=W2.object AND W1.subject=P3.subject AND W1.object=P3.object

So far, this is yet another relational representation of RDF.So, what is a columnstore?

Columnstores and RDFColumnstores store all columns of a table separately.

subject objectex:Katja ex:TU_Ilmenauex:Martin ex:Saarland_Universityex:Ralf ex:Saarland_University

PhD_from PhD_from:subjectex:Katjaex:Martinex:Ralf

PhD_from:objectex:TU_Ilmenauex:Saarland_Universityex:Saarland_University

Advantages:• Fast if only subject or object are accessed, not both• Allows for a very compact representation Problems:• Need to recombine columns if subject and object are accessed• Inefficient for triple patterns with predicate variable

Compression in ColumnstoresGeneral ideas: • Store subject only once• Use same order of subjects for all columns, including NULL values

when necessary

• Additional compression to get rid of NULL values

subjectex:Katjaex:Martinex:Ralfex:Ralf

PhD_fromex:TU_Ilmenauex:Saarland_Universityex:Saarland_UniversityNULL

works_forex:MPI_Informaticsex:MPI_Informaticsex:Saarland_Universityex:MPI_Informatics

teachesex:Databasesex:Databases ex:Information_RetrievalNULL

PhD_from: bit[1110]ex:TU_Ilmenauex:Saarland_Universityex:Saarland_University

Teaches: range[1-3]ex:Databasesex:Databases ex:Information_Retrieval

Outline for Part I• Part I.1: Foundations

– Introduction to RDF– A short overview of SPARQL

• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook

Property TablesGroup entities with similar predicates into a relational table(for example using RDF types or a clustering algorithm).ex:Katja ex:teaches ex:Databases;

ex:works_for ex:MPI_Informatics;ex:PhD_from ex:TU_Ilmenau.

ex:Martin ex:teaches ex:Databases;ex:works_for ex:MPI_Informatics;ex:PhD_from ex:Saarland_University.

ex:Ralf ex:teaches ex:Information_Retrieval;ex:PhD_from ex:Saarland_University;ex:works_for ex:Saarland_University,

ex:MPI_Informatics.subject teaches PhD_fromex:Katja ex:Databases ex:TU_Ilmenauex:Martin ex:Databases ex:Saarland_Universityex:Ralf ex:IR ex:Saarland_University

subject teaches PhD_fromex:Katja ex:Databases ex:TU_Ilmenauex:Martin ex:Databases ex:Saarland_Universityex:Ralf ex:IR ex:Saarland_Universityex:Axel NULL ex:TU_Vienna

subject predicate objectex:Katja ex:works_for ex:MPI_Informaticsex:Martin ex:works_for ex:MPI_Informaticsex:Ralf ex:works_for ex:Saarland_Universityex:Ralf ex:works_for ex:MPI_Informatics

“Leftover triples”

Property Tables: Pros and Cons

Advantages:• More in the spirit of existing relational systems• Saves many self-joins over triple tables etc.Disadvantages:• Potentially many NULL values• Multi-value attributes problematic• Query mapping depends on schema• Schema changes very expensive

Even More Systems…• Store RDF data as sparse matrix with bit-vector

compression [BitMat, Hendler at al.: ISWC’09]

• Convert RDF into XML and use XML methods (XPath, XQuery, …)

• Store RDF data in graph databases and perform bi-simulation [Fletcher at al.: ESWC’12] or employ specialized graph index structures [gStore, Zou et al.: PVLDB’11]

• And many more …

See our list of readings.

Which Technique is Best?• Performance depends a lot on precomputation,

optimization, implementation, fine-tuning …• Comparative results on BTC 2008:

(from [Neumann & Weikum, 2009])

RDF-3X RDF-3X (2008) COLSTORE ROWSTORE

RDF-3X RDF-3X (2008)

COLSTORE

ROWSTORE

Challenges and Opportunities

• SPARQL with different entailment regimes• New SPARQL 1.1 features (grouping, aggregation, updates)

• User-oriented ranking of query results– Efficient top-k operators– Effective scoring methods for structured queries

• What are the limits of a centralized RDF engine?• Dealing with uncertain RDF data –

what is the most likely query answer?– Triples with probabilities probabilistic databases

Outline of this Tutorial

• Part I–RDF in Centralized Relational Databases

• Part II–RDF in Distributed Settings

• Part III–Managing Uncertain RDF Data

Outline for Part II• Part II.1: Search Engines for the Semantic Web• Part II.2: Mediator-based and Federated

Architectures

Semantic Web Search Engines

• Querying RDF data collections started by adapting existing search engines to RDF data.– Crawling for .rdf files, and HTML documents with embedded

RDF content (see: RDFa microformat).– Indexing & search based on keywords extracted from entity-

and property names.– Usually generate a virtual document for an entity (string literals

and human-readable names).

• Swoogle [Ding et al., CIKM’04] (University of Maryland)

• Falcons [Cheng at al., WWW’08] (Nanjing University)

Outline for Part II• Part II.1: Search Engines for the Semantic Web• Part II.2: Mediator-based and Federated

Architectures

Classification of Distributed ApproachesApproaches for querying

distributed and potentially heterogeneous (RDF) data sources

Materialization-based approaches(data-warehousing)

Virtually materialized approaches

Peer-2-Peer

Federated systems

MapReduce/Hadoop

Shared-memoryarchitectures

(Message-Passing, RMI, etc.)

Shared-nothingarchitectures

Mediator-based systems

Shard, Jena-HBase[Abadi et al. PVLDB’11]

Trinity (MSR)

DARQ FedExYARS2

GridvineRDFPeers

Partout4StoreEagre

How to Integrate Data Sources?

• Ship and integrate data from different sources to the client.

• Three common approaches:– Query-driven (single mediator)

– Database federations (exported schemas)

– Warehousing (fully integrated & centrally managed)

RDFSource

RDFSource

?

Query-Driven Approach

SPARQL Client

Wrapper

RDFSource

RDFSource

RDFSource

query result query resultquery result

Mediator

Wrapper Wrapper

List of SPARQL endpoints: http://www.w3.org/wiki/SparqlEndpointsDBpedia: http://dbpedia.org/sparqlYAGO: https://d5gate.ag5.mpi-sb.mpg.de/webyagospo/Browser

SPARQL Client

Advantages of Query-Driven Integration

• No need to copy data– no or little own storage costs– no need to purchase data

• Potentially more up-to-date data• Mediator holds catalog (statistics, etc.) and may

optimize queries• Only generic query interface needed at sources

(SPARQL endpoints)• May be less draining on sources• Sources often even unaware of participation

resultquery

Federation-based Approach

SPARQL Client

RDFSource

Federated Schema

Exported Schema

SPARQL Client

Local Schema

resultquery

RDFSource

Exported Schema

Local Schema

resultquery

RDFSource

Exported Schema

Local Schema

Source 1 Source 2 … Source n

Advantages of Federation-Based Integration

• Very similar to query-driven integration, except – that the sources know that they are part of a federation;– and they export their local schemas into a federated schema.

• Intermediate step toward full integration of the data in a single “warehouse”.

Warehousing Architecture

SPARQL Client

Warehouse

RDFSource

RDFSource

RDFSource

Query & Analysis

Integration

Metadata

SPARQL Client

Integrated LOD index: http://lod2.openlinksw.com/sparql

Advantages of Warehousing

• Perform Extract-Transform-Load (ETL) processes with periodic updates over the source

• High query performance• Local processing at sources unaffected• Can operate even when sources are offline• Can query data that is no longer stored at sources• More detailed statistics and metadata available at

warehouse– Modify, summarize (store aggregates), analyse– Add historical information, provenance, timestamps, etc.

Classification of Distributed ApproachesApproaches for querying

distributed and potentially heterogeneous (RDF) data sources

Materialization-based approaches(data-warehousing)

Virtually materialized approaches

Peer-2-Peer

Federated systems

MapReduce/Hadoop

Shared-memoryarchitectures

(Message-Passing, RMI, etc.)

Shared-nothingarchitectures

Mediator-based systems

Shard, Jena-HBase[Abadi et al. PVLDB’11]

Trinity (MSR)

DARQ FedExYARS2

GridvineRDFPeers

Partout4StoreEagre

DARQ [Leser et al., Humbold University Berlin, ISWC’08] • Classical mediator-based architecture

connecting a given SPARQL endpoint to other endpoints via a combination of wrappers and service descriptions.

• Service descriptions– RDF data descriptions– Statistical information– Binding constraints

• Query optimizer based on rewriting rules and cost estimations for physical join operators.

FedEx [fluid Op’s & MPI-INF: ISWC’11]

• Online query optimization over federations of SPARQL endpoints.

• Cost estimates based on result sizes of SPARQL ASK queries.• “Bound nested-loop joins” by grouping sets of variable bindings

into SPARQL UNION queries (instead of using FILTER conditions):

SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title .}

Partout [Galaraga, Hose, Schenkel: PVLDB’13]

• Materialization-based, distributed & workload-aware SPARQL engine.

• Distribution helps to scale-out query processing via parallel join executions.

• Triple fragments are distributed over hosts H1…Hn by– (1) maximizing query locality, and– (2) balancing the hosts workload.

• H1…Hn run local RDF-3x instances.– (1) local S,P,O statistics by RDF-3x, – (2) global (cached) statistics.

Global query workload (aka. “query log”) Global query graph

Partout Example Query Plan

• H1, H2, H3 hold triplets for ?s rdf:type db:city• H1 has triplets for ?s db:located db:Germany• H2 has triplets for ?s db:name ?name

More Distributed RDF Engines• Shard TripleStore (Hadoop + Hash-Partitioning)

• RDFPeers (P2P/Chord architecture) [Cai et al., WWW’04]

• Gridvine (P2P/Chord architecture) [Aberer et al., VLDB’07]

• YARS2 (federated architecture) [Decker at al., ISWC’07]

• Jena-HBase (Hadoop & HBase) [Khadilkar et al., ISWC’12]

• SW-Store (Hadoop/RDF-3x) [Abadi et al., PVLDB’11]

• 4Store (materialized, shared-nothing) [Harris et al., SSWS’09]

• Eagre (materialized, shared-nothing) [HKUST & HP Labs, ICDE’13]

• Trinity (materialized, shared-memory, message passing) [MSR, SIGMOD’13]

more in Zoi’s tutorial in the afternoon…

Outline of this Tutorial

• Part I–RDF in Centralized Relational Databases

• Part II–RDF in Distributed Settings

• Part III–Managing Uncertain RDF Data

Outline for Part III• Part III.1: Motivation

– What is uncertain data, and where does it come from?

• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines

– Stanford Trio Project– MystiQ @ U Washington

• Part III.4: Managing Uncertain RDF Data– URDF @ Max Planck Institute

What is “Uncertain” Data?

“Certain” Data “Uncertain” Data

Temperature is 74.634589 F Sensor reported 75 ± 0.5 F

Bob works for Yahoo Bob works for either Yahoo or Microsoft

Mary sighted a Finch Mary sighted either a Finch (60%) or a Sparrow (40%)

It always rains in Galway There is a 89% chance of rain in Galway tomorrow

Yahoo stocks will be at 100 in a month

Yahoo stock will be between 60 and 120 in a month

John’s age is 23 John’s age is in [20,30]

… And Why Does It Arise?

“Certain” Data “Uncertain” Data

Temperature is 74.634589 F Sensor reported 75 ± 0.5 F

Bob works for Yahoo Bob works for either Yahoo or Microsoft

Mary sighted a Finch Mary sighted either a Finch (60%) or a Sparrow (40%)

It always rains in Galway There is a 89% chance of rain in Galway tomorrow

Yahoo stocks will be at 100 in a month

Yahoo stock will be between 60 and 120 in a month

John’s age is 23 John’s age is in [20,30]

Precision of devices

Lack of exact information(alternatives and missing values)

Uncertainty about futureevents

Anonymization

Applications: Deduplication

NameJohn Doe

J. Doe? 80% match

Applications: Information Integration

name,hPhone,oPhone,hAddr,oAddr

name,phone,address

Combined View

at the schema level: “schema integration”

at the instance level: “record linkage”

Applications: Information Extraction (I)

Restaurant Zip

Hard Rock Cafe

94111 9413394109

Applications: Information Extraction (II)

What is Uncertain Data and Why Does It Arise?

Subj. Pred. Obj.Galway type City

locatedIn Ireland

hasPopulation 75,414

areaCode 091

namedAfter Gaillimh_River

… … …

bornOn(Jeff, 09/22/42)gradFrom(Jeff, Columbia)hasAdvisor(Jeff, Arthur)hasAdvisor(Surajit, Jeff)knownFor(Jeff, Theory)

type(Jeff, Author)[0.9]

author(Jeff, Drag_Book)[0.8]

author(Jeff,Cind_Book)[0.6]

worksAt(Jeff, Bell_Labs)[0.7]

type(Jeff, CEO)[0.4]

Applications: Information Extraction (III)

YAGO/DBpedia et al.

New fact candidates

>120 M facts for YAGO2(mostly from Wikipedia infoboxes)

100’s M additional facts from Wikipedia text

How do current database management systems (DBMS) handle uncertainty?

They don’t

• Clean: turn into data that DBMSs can handle

What Do (Most) Applications Do?

(1) Loss of information (2) Errors compound and propagate insidiously

Observer Bird-1

Mary Finch: 80%Sparrow: 20%

Susan

Dove: 70%Sparrow: 30%

Jane Hummingbird: 65%Sparrow: 35%

Bird-1

Finch

Dove

Hummingbird

Outline for Part III• Part III.1: Motivation

– What is uncertain data, and where does it come from?

• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines

– Stanford Trio Project– MystiQ @ U Washington

• Part III.4: Managing Uncertain RDF Data– URDF @ Max Planck Institute

Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch: Probabilistic Databases

(Synthesis Lectures on Data Management)Morgan & Claypool Publishers, 2012

Databases Today are Deterministic

• An item either is in the database or it is not.

• A tuple either is in the query answer or it is not.

• This applies to all variety of data models:– Relational, E/R, hierarchical, XML, …

What is a Probabilistic Database ?

• “An tuple belongs to the database” is a probabilistic event.

• “A tuple is an answer to the query” is a probabilistic event.

• Can be extended to all possible kinds of data models; we consider only

probabilistic relational data.

Sample Spaces & Venn Diagrams

Sample Space

• Sample space : all possible events that can be observed. Pr() = 1.• Random variable χt assigns a probability to an event s.t. 0 ≤ Pr( χt ) ≤ 1.• As a convention, we will use tuple identifiers in the place of random

variables to denote probabilistic events.

“Tuple t1 is in the database.”

“Tuple t2 is an answer to a query.”

Possible Worlds Semantics

int, varchar(55), datetime

Employee(ID:int, name:varchar(55), dob:datetime, salary:int)

Attribute domains:

Relational schema:

# values: 232, 2440, 264

# of possible tuples: 232 × 2440 × 264 × 232

# of possible relation instances: 2 232 × 2440 × 264 × 232

Employee(. . .), Projects( . . . ), Groups( . . .), WorksFor( . . .)

Database schema:

# of possible database instances: N (= big but finite)

The Definition

Given a finite set of all possible database instances:

INST = {I1, I2, I3, . . ., IN}

Definition: A probabilistic database Ip is a probability distribution on INST

s.t. åi=1,…,N Pr(Ii) = 1Pr : INST → [0,1]

Definition: A possible world is I INST s.t. Pr(I) > 0

Example

Customer Address Product

John Seattle Gizmo

John Seattle Camera

Sue Denver Gizmo

Pr(I1) = 1/3

Customer Address Product

John Boston Gadget

Sue Denver Gizmo

Customer Address Product

John Seattle Gizmo

John Seattle Camera

Sue Seattle Camera

Customer Address Product

John Boston Gadget

Sue Seattle Camera

Pr(I2) = 1/12

Pr(I3) = 1/2Pr(I4) = 1/12

Possible worlds = {I1, I2, I3, I4}

Ip =

Tuples as Events

One tuple t event “t I”

Two tuples t1, t2 event “t1 I Λ t2 I”

Pr(t) = åI: t I Pr(I)

Pr(t1 Λ t2) = åI: t1I Λ t2I

Pr(I)

Marginalprobability of t

Marginalprobability of t1 Λ t2

Tuple Correlations

Pr(t1 Λ t2) = 0

Pr(t1 Λ t2) < Pr(t1) Pr(t2)Negatively correlated

Pr(t1 Λ t2) > Pr(t1) Pr(t2)Positively correlated

Pr(t1 Λ t2) = Pr(t1) = Pr(t2)Identical =

N

P

DΛDisjoint-AND

Pr(t1 Λ t2) = Pr(t1) Pr(t2)Independent-AND I Λ

Independent-OR Pr(t1 V t2) = 1-(1-Pr(t1))(1-Pr(t2))

Disjoint-OR Pr(t1 V t2) = Pr(t1)+Pr(t2)

I V

D V

Pr(⌐t1) = 1 - Pr(t1)NOT ⌐

Example with Correlations

Customer Address Product

John Seattle Gizmo

John Seattle Camera

Sue Denver Gizmo

Pr(I1) = 1/3

Customer Address Product

John Boston Gadget

Sue Denver Gizmo

Customer Address Product

John Seattle Gizmo

John Seattle Camera

Sue Seattle Camera

Customer Address Product

John Boston Gadget

Sue Seattle Camera

Pr(I2) = 1/12

Pr(I3) = 1/2Pr(I4) = 1/12

=

N

P

D

D

Ip =

Special Case!

Pr(I) = Õt I pr(t) × Õt Ï I (1-pr(t))

No restrictions w.r.t. other tuplespr : TUP → (0,1]

Tuple-independent probabilistic database

INST = P(TUP)N = 2MTUP = {t1, t2, …, tM} = all tuples

… back to the Venn Diagram (I)Sample Space

If t1 and t2 are independent (per assumption!):

4 possible worlds = 4 subsets of events

“Tuple t1 is in the database.”

“Tuple t2 is in the database.”

Pr(“Tuple t1 is in the database and tuple t2 is in the database”) := Pr(t1) x Pr(t2) = pr(t1) x pr(t2)

… back to the Venn Diagram (II)Sample Space

If t1 and t2 are disjoint (per assumption!):

3 possible worlds = 3 subsets of events

“Tuple t1 is in the database.”

“Tuple t2 is in the database.”

Pr(“Tuple t1 is in the database and tuple t2 is in the database”) := 0

Tuple Prob. Possible Worlds

Name City pr

John Seattle p1 = 0.8

Sue Boston p2 = 0.6

Fred Boston p3 = 0.9

Ip = Name City

John Seattl

Sue Bosto

Fred Bosto

Name City

Sue Bosto

Fred Bosto

Name City

John Seattl

Fred Bosto

Name City

John Seattl

Sue Bosto

Name City

Fred Bosto

Name City

Sue Bosto

Name City

John Seattl

I1

(1-p1)(1-p2)(1-p3)

I2

p1(1-p2)(1-p3)

I3

(1-p1)p2(1-p3)

I4

(1-p1)(1-p2)p3

I5

p1p2(1-p3)

I6

p1(1-p2)p3

I7

(1-p1)p2p3

I8

p1p2p3

å = 1

J =

Name City

Assumption:Tuples are

independent!

Tuple Prob. Query Evaluation

Name City pr

John Seattle p1

Sue Boston p2

Fred Boston p3

Customer Product Date pr

John Gizmo . . . q1

John Gadget . . . q2

John Gadget . . . q3

Sue Camera . . . q4

Sue Gadget . . . q5

Sue Gadget . . . q6

Fred Gadget . . . q7

SELECT DISTINCT x.cityFROM Personp x, Purchasep yWHERE x.Name = y.Customer and y.Product = ‘Gadget’

Tuple Probability

Seattle

Boston

1-(1-q2)(1-q3)p1( )

1- (1- ) × (1- )

p2( )p3

1-(1-q5)(1-q6 )q7

Marginals

Summary of Data Model

Possible Worlds Semantics• Very powerful model:

– Can capture any tuple correlations.

• Needs separate representation formalism: (“just tables” are generally not enough)

Boolean event expressions to capture complex tuple- dependencies: “provenance”, “lineage”, “views”, etc.

• But: query evaluation may be very expensive.– Need to find good cases, otherwise must approximate.

Outline for Part III• Part III.1: Motivation

– What is uncertain data, and where does it come from?

• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines

– Stanford Trio Project– MystiQ @ U Washington

• Part III.4: Managing Uncertain RDF Data– URDF @ Max Planck Institute

Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage

Uncertainty-Lineage Databases (ULDBs)

[Widom et al.: 2008]

Trio’s Data Model

1. Alternatives: uncertainty about value

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Three possibleinstances

Six possibleinstances

Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe): uncertainty about presence

?

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Betty blue, Acura

Trio’s Data Model

• 1. Alternatives• 2. ‘?’ (Maybe) Annotations• 3. Confidences: weighted uncertainty

Six possible instances, each with a probability

?

Saw (witness, color, car)

Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2

Betty blue, Acura 0.6

So Far: Model is Not Closed

Saw (witness, car)

Cathy

Honda ∥ Mazda

Drives (person, car)

Jimmy, Toyota ∥ Jimmy, Mazda

Billy, Honda ∥ Frank, Honda

Hank, Honda

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

Example with Lineage

ID

Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID

Drives (person, car)

21

Jimmy, Toyota ∥ Jimmy, Mazda

22

Billy, Honda ∥ Frank, Honda

23

Hank, Honda

ID

Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

Suspects = πperson(Saw ⋈ Drives)

???

λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23

Example with Lineage

ID Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID Drives (person, car)

21

Jimmy, Toyota ∥ Jimmy, Mazda

22

Billy, Honda ∥ Frank, Honda

23

Hank, Honda

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

Suspects = πperson(Saw ⋈ Drives)

???

λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23

Correctly captures possible instances inthe result (7)

Operational Semantics

Closure:up-arrowalways exists

Completeness: any (finite) set of possible instances can be represented

D

I1, I2, …, In J1, J2, …, Jm

D′

possibleinstances

Q on eachinstance

rep. ofinstances

directimplementation

Summary on Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage

Uncertainty-Lineage Databases (ULDBs)

Theorem: ULDBs are closed and complete.

Formally studied properties like minimization, equivalence, approximation and membership based on lineage.

[Benjelloun, Widom, et al.: VLDB J. 08]

MYSTIQ: Query Complexity• Data complexity of a query Q:• Compute Q(Ip), for probabilistic database J

– Extensional query evaluation: Works for “safe” query plans with PTIME data complexity

– Intensional query evaluation:Works for any plan but has #P-complete data complexity in the general case

• Assume independent tuples in J• Compute marginal probabilities for tuples in Q• Boolean event expressions for intensional query evaluation

Extensional Query Evaluation

Relational op’s compute probabilities

s

v p

v p

×

v1 p1

v1 v2 p1 p2

v2 p2

P

v p1

v p2

v 1-(1-p1)(1-p2)…

[Fuhr&Roellke:1997, Dalvi&Suciu:2004]

-

v p1

v p1(1-p2)

v p2

Data complexity: PTIME

or: p1 + p2 + …

Jon Sea p1

Jon q1

Jon q2

Jon q3

SELECT DISTINCT x.CityFROM Personp x, Purchasep yWHERE x.Name = y.Customer and y.Product = ‘Gadget’

Jon Sea p1q1

Jon Sea p1q2

Jon Sea p1q3

Sea 1-(1-p1q1)(1- p1q2)(1- p1q3)

×

P

Jon q1

Jon q2

Jon q3

×

P

Jon Sea p1(1-(1-q1)(1-q2)(1-q3))

[Dalvi&Suciu:2004]

Wrong !

Correct !

Depends on plan !!!

Jon 1-(1-q1)(1-q2)(1-q3)

Jon Sea p1

“Safe Plans”

Query Complexity

Sometimes there exists a correct extensional plan,but consider the following:

Qbad :- R(x), S(x,y), T(y)Data

complexityis #P-complete

[Dalvi&Suciu:2004]

NP = class of problems of the form “is there a witness ?”#P = class of problems of the form “how many witnesses ?” (will be coming back to this…)

Intensional Database[Fuhr&Roellke:1997]

Atomic event ids

Intensional probabilistic database J each tuple t has an event attribute t.E

e1, e2, e3, …

p1, p2, p3, … [0,1]

e3 Λ (e5 V ⌐e2)

Probabilities:

Event expressions: Λ, V, ⌐

Probability of Boolean Expressions

E = X1X3 v X1X4 v X2X5 v X2X6

Sampling: Randomly make each variable true with the following probabilities

Pr(X1) = p1, Pr(X2) = p2, . . . . . , Pr(X6) = p6

What is Pr(E) ???

Answer: Re-group cleverly E = X1 (X3 v X4 ) v X2 (X5 v X6)

Pr(E) = 1 - (1-p1(1-(1-p3)(1-p4))) (1-p2(1-(1-p5)(1-p6)))

Needed for query evaluation!

“Read once” formula

Complexity Issues

Theorem [Valiant:1979]For a Boolean expression E, computing Pr(E) is #P-complete

NP = class of problems of the form “is there a witness ?” SAT#P = class of problems of the form “how many witnesses ?” #SAT

The decision problem for 2CNF is in PTIMEThe counting problem for 2CNF is #P-complete

MYSTIQ: [Re, Suciu: VLDB’04]

Probabilistic Query Evaluation on Top of a Deterministic Database Engine

Deterministic Database

SQL Query ProbabilisticQuery Engine

(Top-k) Answers

1. Sampling

2. Extensional joins

3. Indexes

Outline for Part III• Part III.1: Motivation

– What is uncertain data, and where does it come from?

• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines

– Stanford Trio Project– MystiQ @ U Washington

• Part III.4: Uncertain RDF Data– URDF Project @ Max Planck Institute

Uncertain RDF (URDF) Data Model

• Extensional Layer (information extraction & integration)– High-confidence facts: existing knowledge base (“ground truth”)– New fact candidates: extracted facts with confidence values– Integration of different knowledge sources:

Ontology merging or explicit Linked Data (owl:sameAs, owl:equivProp.)

Large “Probabilistic Database” of RDF facts

• Intensional Layer (query-time inference)– Soft rules: deductive grounding & lineage (Datalog/SLD resolution)– Hard rules: consistency constraints (more general FOL rules)– Propositional & probabilistic consistency reasoning

Soft Rules vs. Hard Rules

(Soft) Deduction Rules vs. (Hard) Consistency Constraints

• People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

• People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z

• People are not married to more than one person (at the same time, in most countries?)

marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)

[0.8]

[0.5]

Soft Rules vs. Hard Rules

(Soft) Deduction Rules vs. (Hard) Consistency Constraints

• People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

• People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z

• People are not married to more than one person (at the same time, in most countries?)

marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)

[0.8]

[0.5]

Deductive database: Datalog, core of SQL &

relational algebra, RDF/S, OWL2-RL, etc.

More general FOL constraints:

Datalog with constraints, X-Tuples in Prob.-DB’s

owl:FunctionalProperty, etc.

URDF: Running ExampleRules hasAdvisor(x,y) worksAt(y,z)

graduatedFrom(x,z) [0.4]

graduatedFrom(x,y) graduatedFrom(x,z) y=z

Jeff

Stanford

University

type[1.0]

Surajit

Princeton

David

Computer Scientist

worksAt[0.9]

type[1.0]

type[1.0]

type[1.0]type[1.0]

graduatedFrom[0.6]

graduatedFrom[0.7]

graduatedFrom[0.9]

hasAdvisor[0.8]hasAdvisor[0.7]

KB: Base Facts

Derived FactsgradFr(Surajit,Stanfor

d)gradFr(David,Stanford)

graduatedFrom[?]graduatedFrom[?]

Basic Types of Inference

• MAP Inference– Find the most likely assignment to query variables y

under a given evidence x.

– Compute: arg max y P( y | x) (NP-hard for MaxSAT)

• Marginal/Success Probabilities– Probability that query y is true in a random world under a given evidence x.

– Compute: ∑y P( y | x) (#P-hard already for conjunctive

queries)

General Route: Grounding & MaxSAT Solving

Query graduatedFrom(x, y)

CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton))

(graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)

1000

1000

0.4

0.4

0.9 0.8 0.7 0.6 0.7 0.9

1) Grounding– Consider only facts (and rules)

which are relevant for answering the query

2) Propositional formula in CNF, consisting of– Grounded hard & soft rules– Weighted base facts

3) Propositional Reasoning– Find truth assignment to facts such

that the total weight of the satisfied clauses is maximized

MAP inference: compute “most likely” possible world

[Theobald,Sozio,Suchanek,Nakashole: VLDS‘12]

Find: arg max y P( y | x) Resolves to a variant of MaxSAT

for propositional formulas

URDF: MaxSAT Solving with Soft & Hard Rules

{ graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) }

{ graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) }

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)

0.4

0.4

0.9 0.8 0.7 0.6 0.7 0.9

S:

Mut

ex-c

onst

.

Special case: Horn-clauses as soft rules & mutex-constraints as hard rules

C:

Wei

ghte

d H

orn

clau

ses

(CN

F)

Compute W0 = ∑clauses C w(C) P(C is satisfied);For each hard constraint S { For each fact f in St { Compute Wf+

t = ∑clauses C w(C) P(C is sat. | f = true); } Compute WS-

t = ∑clauses C w(C) P(C is sat. | St = false); Choose truth assignment to f in St that maximizes Wf+

t , WS-t ;

Remove satisfied clauses C; t++;}

• Runtime: O(|S||C|)• Approximation

guarantee of 1/2

MaxSAT Alg.

Deductive Grounding with Lineage (SLD Resolution/Datalog)

\/

/\

graduatedFrom(Surajit,

Princeton)[0.7]

hasAdvisor(Surajit,Jeff)

[0.8]

worksAt(Jeff,Stanford)[0.9]

graduatedFrom(Surajit,

Stanford)[0.6]

Query graduatedFrom(Surajit, y)

C D

A B

A(B (CD)) A(B (CD))

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)Q1 Q2

Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)

[0.4]

graduatedFrom(x,y) graduatedFrom(x,z) y=zBase FactsgraduatedFrom(Surajit, Princeton) [0.7]graduatedFrom(Surajit, Stanford) [0.6]graduatedFrom(David, Princeton) [0.9]hasAdvisor(Surajit, Jeff) [0.8]hasAdvisor(David, Jeff) [0.7]worksAt(Jeff, Stanford) [0.9]type(Princeton, University) [1.0]type(Stanford, University) [1.0]type(Jeff, Computer_Scientist) [1.0]type(Surajit, Computer_Scientist) [1.0]type(David, Computer_Scientist) [1.0]

Lineage & Possible Worlds

1) Deductive Grounding– Dependency graph of the query– Trace lineage of individual query

answers

2) Lineage DAG (not in CNF),consisting of– Grounded hard & soft rules– Weighted base facts

Plus: entire derivation history!

3) Probabilistic Inference Compute marginals:

P(Q): aggregate probabilities of all possible worlds where the lineage of the query evaluates to “true”

P(Q|H): drop “impossible worlds”

\/

/\

graduatedFrom(Surajit,

Princeton)[0.7]

hasAdvisor(Surajit,Jeff)

[0.8]

worksAt(Jeff,Stanford)[0.9]

graduatedFrom(Surajit,

Stanford)[0.6]

Query graduatedFrom(Surajit, y)

0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266

1-(1-0.72)x(1-0.6)=0.888

0.8x0.9=0.72

C D

A B

A(B (CD)) A(B (CD))

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)Q1 Q2

[Das Sarma,Theobald,Widom: ICDE‘08 Dylla, Miliaraki,Theobald: CIKM‘11]

Possible Worlds Semantics

A:0.7 B:0.6 C:0.8 D:0.9 Q2: A(B(CD))

P(W)

1 1 1 1 0 0.7x0.6x0.8x0.9 = 0.3024

1 1 1 0 0 0.7x0.6x0.8x0.1 = 0.0336

1 1 0 1 0 … = 0.0756

1 1 0 0 0 … = 0.0084

1 0 1 1 0 … = 0.2016

1 0 1 0 0 … = 0.0224

1 0 0 1 0 … = 0.0504

1 0 0 0 0 … = 0.0056

0 1 1 1 1 0.3x0.6x0.8x0.9 = 0.1296

0 1 1 0 1 0.3x0.6x0.8x0.1 = 0.0144

0 1 0 1 1 0.3x0.6x0.2x0.9 = 0.0324

0 1 0 0 1 0.3x0.6x0.2x0.1 = 0.0036

0 0 1 1 1 0.3x0.4x0.8x0.9 = 0.0864

0 0 1 0 0 … = 0.0096

0 0 0 1 0 … = 0.0216

0 0 0 0 0 … = 0.0024

1.0

0.2664

0.412

P(Q2)=0.2664

P(Q2|H)=0.2664 / 0.412 = 0.6466

P(Q1)=0.0784 P(Q1|H)=0.0784 / 0.412 = 0.1903

0.0784

Hard rule H: A (B (CD))

More Probabilistic Approaches

• Propositional– Stochastic MaxSat solvers: MaxWalkSat (MAP-Inference)– URDF: constrained weighted MaxSat solver for soft & hard rules

• Lineage & Possible Worlds (tuple-independent database)– Exact probabilistic inference: junction trees, variable elimination– Approximate inference: decision diagrams/Shannon expansions, sampling

• Combining First-Order Logic & Probabilistic Graphical Models

– Markov Logic Networks*[Richardson & Domingos: Machine Learning 2006]

– Factor Graphs [FactorIE, McCallum et al.: NIPS 2008]– Variety of MCMC sampling techniques for probabilistic inference (e.g., Gibbs sampling, MC-SAT, etc.)*Alchemy – Open-Source AI: http://alchemy.cs.washington.edu/

Experiments

• URDF: SLD grounding & MaxSat solving

|C| - # literals in soft rules|S| - # literals in hard rules

• URDF MaxSat vs. Markov Logic (MAP inference & MC-SAT)

• YAGO Knowledge Base: 2 Mio entities, 20 Mio facts• Basic query answering: SLD grounding & MaxSat solving of 10 queries over 16

soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)• Asymptotic runtime checks: runtime comparisons for synthetic rule expansions

• System components:– Flash Player client– Tomcat server (JRE)– Relational backend

(JDBC)– Remote Method

Invocation & Object Serialization (BlazeDS)

UViz: URDF Visualization Frontend[Meiser, Dylla, Theobald: CIKM’11 Demo]

UViz: URDF Visualization Frontend

Demo!http://urdf.mpi-inf.mpg.de

[Meiser, Dylla, Theobald: CIKM’11 Demo]

PART I• SPARQL Query Language for RDF, W3C Recommendation, 15 January 2008, http://www.w3.org/TR/2008/REC-rdf-sparql-query/• SPARQL 1.1 Query Language, W3C Working Draft, 21 March 2013, http://www.w3.org/TR/sparql11-query/ • SPARQL 1.1 Federated Query, W3C Working Draft, 21 March 2013, http://www.w3.org/TR/sparql11-federated-query/ • Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SPARQ2L: towards support for subgraph extraction queries in RDF databases. WWW Conference, 2007• Krisztian Balog, Edgar Meij, Maarten de Rijke: Entity Search: Building Bridges between Two Worlds. WWW, 2010• Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE, 2002• Tao Cheng , Xifeng Yan , Kevin Chen-Chuan Chang: EntityRank: searching entities directly and holistically. VLDB, 2007• Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, Gerhard Weikum: Language-model-based ranking for queries on RDF-graphs. CIKM, 2009• Vagelis Hristidis, Heasoo Hwang, Yannis Papakonstantinou: Authority-based keyword search in databases. ACM Transactions on Database Systems 33(1), 2008• Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner-Tree Approximation in Relationship Graphs. ICDE, 2009• Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking Knowledge. ICDE, 2008• Thomas Neumann, Gerhard Weikum: Scalable join processing on very large RDF graphs. SIGMOD Conference, 2009• Thomas Neumann, Gerhard Weikum: The RDF-3X engine for scalable management of RDF data. VLDB Journal 19(1), 2010• François Picalausa, Yongming Luo, George H. L. Fletcher, Jan Hidders, Stijn Vansummeren: A Structural Approach to Indexing Triples. ESWC 2012• Nicoleta Preda, Gjergji Kasneci, Fabian M. Suchanek, Thomas Neumann, Wenjun Yuan, Gerhard Weikum: Active knowledge: dynamically enriching RDF knowledge bases

by Web Services. SIGMOD Conference, 2010• Cheng Xiang Zhai: Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008• Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Özsu, Dongyan Zhao: gStore: Answering SPARQL Queries via Subgraph Matching. PVLDB 4(8), 2011PART II• Min Cai, Martin R. Frank: RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. WWW, 2004• Gong Cheng, Weiyi Ge, Yuzhong Qu: Falcons: searching and browsing entities on the semantic web. WWW, 2008• Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal C Doshi, Joel Sachs: Swoogle: A Search and Metadata Engine for the Semantic

Web. CIKM, 2004• Luis Galárraga, Katja Hose, Ralf Schenkel: Partout: A Distributed Engine for Efficient RDF Processing. To appear in PVLDB, 2013• Steve Harris, Nick Lamb, Nigel Shadbolt: 4store: The Design and Implementation of a Clustered RDF Store. SSWS, 2009• Jiewen Huang, Daniel J. Abadi, Kun Ren: Scalable SPARQL Querying of Large RDF Graphs. PVLDB 4(11), 2011• Vaibhav Khadilkar, Murat Kantarcioglu, Bhavani Thuraisingham, Paolo Castagna: Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store. ISWC, 2012• Bastian Quilitz, Ulf Leser: Querying Distributed RDF Data Sources with SPARQL. ISWC, 2008• Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, Michael Schmidt: FedX: A Federation Layer for Distributed Query Processing on Linked Open Data. ESWC, 2011• Bin Shao, Haixun Wang, Yatao Li: Trinity: A Distributed Graph Engine on a Memory Cloud. To appear in SIGMOD, 2013• Xiaofei Zhang, Lei Chen, Yongxin Tong, Min Wang: EAGRE: Towards Scalable I/O Efficient SPARQL Query Evaluation on the Cloud. ICDE, 2013PART III• Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom: Databases with uncertainty and lineage. VLDB J. 17(2), 2008• Jihad Boulos, Nilesh N. Dalvi, Bhushan Mandhani, Shobhit Mathur, Christopher Ré, Dan Suciu: MYSTIQ: a system for finding more answers by using probabilities. SIGMOD

Conference, 2005• Nilesh N. Dalvi, Dan Suciu: Efficient Query Evaluation on Probabilistic Databases. VLDB, 2004• Maximilian Dylla, Iris Miliaraki, Martin Theobald: Top-k Query Processing in Probabilistic Databases with Non-Materialized Views. ICDE, 2013• Norbert Fuhr, Thomas Rölleke: A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems. ACM Trans. Inf. Syst. 15(1), 1997• Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive reasoning in uncertain RDF knowledge bases. CIKM, 2011• Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS, 2012• Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch: Probabilistic Databases (Synthesis Lectures on Data Management), Morgan & Claypool Publishers, 2012

Recommended Readings