Web Data Management with RDF
-
Upload
m-tamer-oezsu -
Category
Technology
-
view
628 -
download
5
description
Transcript of Web Data Management with RDF
Web Data Management in RDF Age
M. Tamer Ozsu
University of WaterlooDavid R. Cheriton School of Computer Science
1PKU/2014-08-28
AcknowledgementsThis presentation draws upon collaborative research anddiscussions with the following colleagues (in alphabetical order)
Gunes Aluc, University of Waterloo
Khuzaima Daudjee, University of Waterloo
Olaf Hartig, University of Waterloo
Lei Chen, Hong Kong University of Science & Technology
Lei Zou, Peking University
2PKU/2014-08-28
Web Data Management
A long term research interest in the DB community
2000 2004
2011 20113PKU/2014-08-28
Interest Due to Properties of Web Data
Lack of a schema
Data is at best “semi-structured”Missing data, additional attributes, similar data but notidentical
Volatility
Changes frequentlyMay conform to one schema now, but not later
Scale
Does it make sense to talk about a schema for Web?How do you capture “everything”?
Querying difficulty
What is the user language?What are the primitives?Arent search engines or metasearch engines sufficient?
4PKU/2014-08-28
More Recent Approaches to Web Querying
Fusion TablesUsers contribute data in spreadsheet, CVS, KML formatPossible joins between multiple data setsExtensive visualization
8PKU/2014-08-28
More Recent Approaches to Web Querying
Fusion TablesUsers contribute data in spreadsheet, CVS, KML formatPossible joins between multiple data setsExtensive visualization
XMLData exchange languagePrimarily tree based structure
<list title="MOVIES">
<film>
<title>The Shining</title>
<director>Stanley Kubrick</director>
<actor>Jack Nicholson</actor>
</film>
<film>
<title>Spartacus</title>
<director>Stanley Kubrick</director>
</film>
<film>
<title>The Passenger</title>
<actor>Jack Nicholson</actor>
</film>
...
</list>
root
film
title
“The Shining”
director
“Stanley Kubrick”
actor
“Jack Nicholson”
film
...
film
title
“The Passenger”
actor
“Jack Nicholson”
8PKU/2014-08-28
More Recent Approaches to Web Querying
Fusion Tables
Users contribute data in spreadsheet, CVS, KML formatPossible joins between multiple data setsExtensive visualization
XML
Data exchange languagePrimarily tree based structure
RDF (Resource Description Framework) & SPARQL
W3C recommendationSimple, self-descriptive modelBuilding block of semantic web & Linked Open Data (LOD)
8PKU/2014-08-28
RDF and Semantic Web
RDF is a language for the conceptual modeling of informationabout resources (web resources in our context)
A building block of semantic web
Facilitates exchange of informationSearch engine results can be more focused and structuredFacilitates data integration (mashes)
Machine understandable
Understand the information on the web and theinterrelationships among them
9PKU/2014-08-28
RDF Uses
Yago and DBpedia extract facts from Wikipedia & representas RDF → structural queries
Communities build RDF data
E.g., biologists: Bio2RDF and Uniprot RDF
Web data integration
Linked Open Data Cloud
. . .
10PKU/2014-08-28
RDF Data Volumes . . .
. . . are growing – and fast
Linked data cloud currently consists of 325 datasets with>25B triplesSize almost doubling every year
11PKU/2014-08-28
RDF Data Volumes . . .
. . . are growing – and fast
Linked data cloud currently consists of 325 datasets with>25B triplesSize almost doubling every year
As of March 2009
LinkedCTReactome
Taxonomy
KEGG
PubMed
GeneID
Pfam
UniProt
OMIM
PDB
SymbolChEBI
Daily Med
Disea-some
CAS
HGNC
InterPro
Drug Bank
UniParc
UniRef
ProDom
PROSITE
Gene Ontology
HomoloGene
PubChem
MGI
UniSTS
GEOSpecies
Jamendo
BBCProgramm
es
Music-brainz
Magna-tune
BBCLater +TOTP
SurgeRadio
MySpaceWrapper
Audio-Scrobbler
LinkedMDB
BBCJohnPeel
BBCPlaycount
Data
Gov-Track
US Census Data
riese
Geo-names
lingvoj
World Fact-book
Euro-stat
IRIT Toulouse
SWConference
Corpus
RDF Book Mashup
Project Guten-berg
DBLPHannover
DBLPBerlin
LAAS- CNRS
Buda-pestBME
IEEE
IBM
Resex
Pisa
New-castle
RAE 2001
CiteSeer
ACM
DBLP RKB
Explorer
eprints
LIBRIS
SemanticWeb.org Eurécom
ECS South-ampton
RevyuSIOCSites
Doap-space
Flickrexporter
FOAFprofiles
flickrwrappr
CrunchBase
Sem-Web-
Central
Open-Guides
Wiki-company
QDOS
Pub Guide
Open Calais
RDF ohloh
W3CWordNet
OpenCyc
UMBEL
Yago
DBpediaFreebase
Virtuoso Sponger
March ’09:89 datasets
11PKU/2014-08-28
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.http://lod-cloud.net/
RDF Data Volumes . . .
. . . are growing – and fast
Linked data cloud currently consists of 325 datasets with>25B triplesSize almost doubling every year
As of September 2010
MusicBrainz
(zitgist)
P20
YAGO
World Fact-book (FUB)
WordNet (W3C)
WordNet(VUA)
VIVO UFVIVO
Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UMBEL
UK Post-codes
legislation.gov.uk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdata.gov
.uk
totl.net
Tele-graphis
TCMGeneDIT
TaxonConcept
The Open Library (Talis)
t4gm
Surge Radio
STW
RAMEAU SH
statisticsdata.gov
.uk
St. Andrews Resource
Lists
ECS South-ampton EPrints
Semantic CrunchBase
semanticweb.org
SemanticXBRL
SWDog Food
rdfabout US SEC
Wiki
UN/LOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAAS
KISTIJISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints
dotAC
DEPLOY
DBLP (RKB
Explorer)
Course-ware
CORDIS
CiteSeer
Budapest
ACM
riese
Revyu
researchdata.gov
.uk
referencedata.gov
.uk
Recht-spraak.
nl
RDFohloh
Last.FM (rdfize)
RDF Book
Mashup
PSH
ProductDB
PBAC
Poké-pédia
Ord-nance Survey
Openly Local
The Open Library
OpenCyc
OpenCalais
OpenEI
New York
Times
NTU Resource
Lists
NDL subjects
MARC Codes List
Man-chesterReading
Lists
Lotico
The London Gazette
LOIUS
lobidResources
lobidOrgani-sations
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
Linked Open
Numbers
lingvoj
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Kno.e.sis)
Good-win
Family
Jamendo
iServe
NSZL Catalog
GovTrack
GESIS
GeoSpecies
GeoNames
GeoLinkedData(es)
GTAA
STITCHSIDER
Project Guten-berg (FUB)
MediCare
Euro-stat
(FUB)
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
Freebase
flickr wrappr
Fishes of Texas
FanHubz
Event-Media
EUTC Produc-
tions
Eurostat
EUNIS
ESD stan-dards
Popula-tion (En-AKTing)
NHS (EnAKTing)
Mortality (En-
AKTing)Energy
(En-AKTing)
CO2(En-
AKTing)
educationdata.gov
.uk
ECS South-ampton
Gem. Norm-datei
datadcs
MySpace(DBTune)
MusicBrainz
(DBTune)
Magna-tune
John Peel(DB
Tune)
classical(DB
Tune)
Audio-scrobbler (DBTune)
Last.fmArtists
(DBTune)
DBTropes
dbpedia lite
DBpedia
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Discogs(Data In-cubator)
Climbing
Linked Data for Intervals
Cornetto
Chronic-ling
America
Chem2Bio2RDF
biz.data.
gov.uk
UniSTS
UniRef
UniPath-way
UniParc
Taxo-nomy
UniProt
SGD
Reactome
PubMed
PubChem
PRO-SITE
ProDom
Pfam PDB
OMIM
OBO
MGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Cpd
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
GenBank
ChEBI
CAS
Affy-metrix
BibBaseBBC
Wildlife Finder
BBC Program
mesBBC
Music
rdfaboutUS Census
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
September ’10:203 datasets
11PKU/2014-08-28
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.http://lod-cloud.net/
RDF Data Volumes . . .
. . . are growing – and fast
Linked data cloud currently consists of 325 datasets with>25B triplesSize almost doubling every year
As of September 2011
MusicBrainz
(zitgist)
P20
Turismo de
Zaragoza
yovisto
Yahoo! Geo
Planet
YAGO
World Fact-book
El ViajeroTourism
WordNet (W3C)
WordNet (VUA)
VIVO UF
VIVO Indiana
VIVO Cornell
VIAF
URIBurner
Sussex Reading
Lists
Plymouth Reading
Lists
UniRef
UniProt
UMBEL
UK Post-codes
legislationdata.gov.uk
Uberblic
UB Mann-heim
TWC LOGD
Twarql
transportdata.gov.
uk
Traffic Scotland
theses.fr
Thesau-rus W
totl.net
Tele-graphis
TCMGeneDIT
TaxonConcept
Open Library (Talis)
tags2con delicious
t4gminfo
Swedish Open
Cultural Heritage
Surge Radio
Sudoc
STW
RAMEAU SH
statisticsdata.gov.
uk
St. Andrews Resource
Lists
ECS South-ampton EPrints
SSW Thesaur
us
SmartLink
Slideshare2RDF
semanticweb.org
SemanticTweet
Semantic XBRL
SWDog Food
Source Code Ecosystem Linked Data
US SEC (rdfabout)
Sears
Scotland Geo-
graphy
ScotlandPupils &Exams
Scholaro-meter
WordNet (RKB
Explorer)
Wiki
UN/LOCODE
Ulm
ECS (RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-castle
LAASKISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP (RKB
Explorer)
Crime Reports
UK
Course-ware
CORDIS (RKB
Explorer)CiteSeer
Budapest
ACM
riese
Revyu
researchdata.gov.
ukRen. Energy Genera-
tors
referencedata.gov.
uk
Recht-spraak.
nl
RDFohloh
Last.FM (rdfize)
RDF Book
Mashup
Rådata nå!
PSH
Product Types
Ontology
ProductDB
PBAC
Poké-pédia
patentsdata.go
v.uk
OxPoints
Ord-nance Survey
Openly Local
Open Library
OpenCyc
Open Corpo-rates
OpenCalais
OpenEI
Open Election
Data Project
OpenData
Thesau-rus
Ontos News Portal
OGOLOD
JanusAMP
Ocean Drilling Codices
New York
Times
NVD
ntnusc
NTU Resource
Lists
Norwe-gian
MeSH
NDL subjects
ndlna
myExperi-ment
Italian Museums
medu-cator
MARC Codes List
Man-chester Reading
Lists
Lotico
Weather Stations
London Gazette
LOIUS
Linked Open Colors
lobidResources
lobidOrgani-sations
LEM
LinkedMDB
LinkedLCCN
LinkedGeoData
LinkedCT
LinkedUser
FeedbackLOV
Linked Open
Numbers
LODE
Eurostat (OntologyCentral)
Linked EDGAR
(OntologyCentral)
Linked Crunch-
base
lingvoj
Lichfield Spen-ding
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Kno.e.sis)
Klapp-stuhl-club
Good-win
Family
National Radio-activity
JP
Jamendo (DBtune)
Italian public
schools
ISTAT Immi-gration
iServe
IdRef Sudoc
NSZL Catalog
Hellenic PD
Hellenic FBD
PiedmontAccomo-dations
GovTrack
GovWILD
GoogleArt
wrapper
gnoss
GESIS
GeoWordNet
GeoSpecies
GeoNames
GeoLinkedData
GEMET
GTAA
STITCH
SIDER
Project Guten-berg
MediCare
Euro-stat
(FUB)
EURES
DrugBank
Disea-some
DBLP (FU
Berlin)
DailyMed
CORDIS(FUB)
Freebase
flickr wrappr
Fishes of Texas
Finnish Munici-palities
ChEMBL
FanHubz
EventMedia
EUTC Produc-
tions
Eurostat
Europeana
EUNIS
EU Insti-
tutions
ESD stan-dards
EARTh
Enipedia
Popula-tion (En-AKTing)
NHS(En-
AKTing) Mortality(En-
AKTing)
Energy (En-
AKTing)
Crime(En-
AKTing)
CO2 Emission
(En-AKTing)
EEA
SISVU
education.data.g
ov.uk
ECS South-ampton
ECCO-TCP
GND
Didactalia
DDC Deutsche Bio-
graphie
datadcs
MusicBrainz
(DBTune)
Magna-tune
John Peel
(DBTune)
Classical (DB
Tune)
AudioScrobbler (DBTune)
Last.FM artists
(DBTune)
DBTropes
Portu-guese
DBpedia
dbpedia lite
Greek DBpedia
DBpedia
data-open-ac-uk
SMCJournals
Pokedex
Airports
NASA (Data Incu-bator)
MusicBrainz(Data
Incubator)
Moseley Folk
Metoffice Weather Forecasts
Discogs (Data
Incubator)
Climbing
data.gov.uk intervals
Data Gov.ie
databnf.fr
Cornetto
reegle
Chronic-ling
America
Chem2Bio2RDF
Calames
businessdata.gov.
uk
Bricklink
Brazilian Poli-
ticians
BNB
UniSTS
UniPathway
UniParc
Taxonomy
UniProt(Bio2RDF)
SGD
Reactome
PubMedPub
Chem
PRO-SITE
ProDom
Pfam
PDB
OMIMMGI
KEGG Reaction
KEGG Pathway
KEGG Glycan
KEGG Enzyme
KEGG Drug
KEGG Com-pound
InterPro
HomoloGene
HGNC
Gene Ontology
GeneID
Affy-metrix
bible ontology
BibBase
FTS
BBC Wildlife Finder
BBC Program
mes BBC Music
Alpine Ski
Austria
LOCAH
Amster-dam
Museum
AGROVOC
AEMET
US Census (rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
September ’11:295 datasets, 25B
triples
11PKU/2014-08-28
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.http://lod-cloud.net/
RDF Data Volumes . . .
. . . are growing – and fast
Linked data cloud currently consists of 325 datasets with>25B triplesSize almost doubling every year
April ’14:1091 datasets, ???
triples
11PKU/2014-08-28
Max Schmachtenberg, Christian Bizer, and Heiko Paulheim: Adoption of LinkedData Best Practices in Different Topical Domains. In Proc. ISWC, 2014.
Closer Look
12PKU/2014-08-28
Globally Distributed Network of Data
13PKU/2014-08-28
Three Approaches
Data warehousing
Consolidate data in a repository and query it
SPARQL federation
Leverage query services provided by data publishers
Live Linked Data querying
Navigate through LOD by looking up URIs at query executiontime
14PKU/2014-08-28
Outline
1 LOD and RDF Introduction
2 Data Warehousing ApproachRelational ApproachesGraph-Based Approaches
3 SPARQL Federation ApproachDistributed RDF ProcessingSPARQL Endpoint Federation
4 Live Querying ApproachTraversal-based approachesIndex-based approachesHybrid approaches
5 Conclusions
15PKU/2014-08-28
Outline
1 LOD and RDF Introduction
2 Data Warehousing ApproachRelational ApproachesGraph-Based Approaches
3 SPARQL Federation ApproachDistributed RDF ProcessingSPARQL Endpoint Federation
4 Live Querying ApproachTraversal-based approachesIndex-based approachesHybrid approaches
5 Conclusions
16PKU/2014-08-28
Traditional Hypertext-based Web Access
IMDb WorldBook
Data exposedto the Webvia HTML
17PKU/2014-08-28
Linked Data Publishing Principles
IMDb WorldBook
(http://...linkedmdb.../Shining,releaseDate, 23 May 1980)(http://...linkedmdb.../Shining, filmLocation, http://cia.../UK)(http://...linkedmdb.../29704,actedIn, http://...linkedmdb.../Shining)
...
(http://cia.../UK, hasPopulation, 63230000)...
Shi
ning
UK
Data model: RDFGlobal identifier: URIAccess mechanism: HTTPConnection: data links
18PKU/2014-08-28
RDF Introduction
Everything is an uniquely namedresource
Prefixes can be used to shorten thenames
Properties of resources can bedefined
Relationships with other resourcescan be defined
Resource descriptions can becontributed by differentpeople/groups and can be locatedanywhere in the web
Integrated web “database”
http://data.linkedmdb.org/resource/actor/JN29704
xmlns:y=http://data.linkedmdb.org/resource/actor/y:JN29704
y:JN29704:hasName “Jack Nicholson”y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”y:TS2014:releaseDate “1980-05-23”
JN29704:movieActor
y:TS2014
19PKU/2014-08-28
RDF Introduction
Everything is an uniquely namedresource
Prefixes can be used to shorten thenames
Properties of resources can bedefined
Relationships with other resourcescan be defined
Resource descriptions can becontributed by differentpeople/groups and can be locatedanywhere in the web
Integrated web “database”
http://data.linkedmdb.org/resource/actor/JN29704
xmlns:y=http://data.linkedmdb.org/resource/actor/y:JN29704
y:JN29704:hasName “Jack Nicholson”y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”y:TS2014:releaseDate “1980-05-23”
JN29704:movieActor
y:TS2014
19PKU/2014-08-28
RDF Introduction
Everything is an uniquely namedresource
Prefixes can be used to shorten thenames
Properties of resources can bedefined
Relationships with other resourcescan be defined
Resource descriptions can becontributed by differentpeople/groups and can be locatedanywhere in the web
Integrated web “database”
http://data.linkedmdb.org/resource/actor/JN29704
xmlns:y=http://data.linkedmdb.org/resource/actor/y:JN29704
y:JN29704:hasName “Jack Nicholson”y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”y:TS2014:releaseDate “1980-05-23”
JN29704:movieActor
y:TS2014
19PKU/2014-08-28
RDF Introduction
Everything is an uniquely namedresource
Prefixes can be used to shorten thenames
Properties of resources can bedefined
Relationships with other resourcescan be defined
Resource descriptions can becontributed by differentpeople/groups and can be locatedanywhere in the web
Integrated web “database”
http://data.linkedmdb.org/resource/actor/JN29704
xmlns:y=http://data.linkedmdb.org/resource/actor/y:JN29704
y:JN29704:hasName “Jack Nicholson”y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”y:TS2014:releaseDate “1980-05-23”
JN29704:movieActor
y:TS2014
19PKU/2014-08-28
RDF Introduction
Everything is an uniquely namedresource
Prefixes can be used to shorten thenames
Properties of resources can bedefined
Relationships with other resourcescan be defined
Resource descriptions can becontributed by differentpeople/groups and can be locatedanywhere in the web
Integrated web “database”
http://data.linkedmdb.org/resource/actor/JN29704
xmlns:y=http://data.linkedmdb.org/resource/actor/y:JN29704
y:JN29704:hasName “Jack Nicholson”y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”y:TS2014:releaseDate “1980-05-23”
JN29704:movieActor
y:TS2014
19PKU/2014-08-28
RDF Data Model
Triple: Subject, Predicate (Property),Object (s, p, o)
Subject: the entity that is described(URI or blank node)
Predicate: a feature of the entity (URI)Object: value of the feature (URI,
blank node or literal)
(s, p, o) ∈ (U ∪ B)× U × (U ∪ B ∪ L)
Set of RDF triples is called an RDF graph
U
Subject Object
U B U B L
U: set of URIsB: set of blank nodesL: set of literals
Predicate
Subject Predicate Objecthttp://...imdb.../film/2014 rdfs:label “The Shining”http://...imdb.../film/2014 movie:releaseDate “1980-05-23”http://...imdb.../29704 movie:actor name “Jack Nicholson”. . . . . . . . .
20PKU/2014-08-28
RDF Example InstancePrefixes: mdb=http://data.linkedmdb.org/resource/; geo=http://sws.geonames.org/
bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/lexvo=http://lexvo.org/id/;wp=http://en.wikipedia.org/wiki/
Subject Predicate Object
mdb: film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”’mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
URI Literal
URI
21PKU/2014-08-28
RDF Graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
22PKU/2014-08-28
Linked Data Model [Hartig, 2012]
Web Document
Given a countably infinite set D (documents), a Web of LinkedData is a tuple W = (D, adoc, data) where:
I D ⊆ D,
I adoc is a partial mapping from URIs to D, and
I data is a total mapping from D to finite sets of RDF triples.
23PKU/2014-08-28
Linked Data Model [Hartig, 2012]
Web Document
Given a countably infinite set D (documents), a Web of LinkedData is a tuple W = (D, adoc, data) where:
I D ⊆ D,
I adoc is a partial mapping from URIs to D, and
I data is a total mapping from D to finite sets of RDF triples.
Web of Linked Data
A Web of Linked Data W = (D, adoc, data)contains a data link from document d ∈ D todocument d ′ ∈ D if there exists a URI u suchthat:
I u is mentioned in an RDF triplet ∈ data(d), and
I d ′ = adoc(u).23PKU/2014-08-28
RDF Query Model – SPARQLQuery Model - SPARQL Protocol and RDF Query LanguageGiven U (set of URIs), L (set of literals), and V (set ofvariables), a SPARQL expression is defined recursively:
an atomic triple pattern, which is an element of
(U ∪ V )× (U ∪ V )× (U ∪ V ∪ L)
?x rdfs:label “The Shining”
P FILTER R, where P is a graph pattern expression and R is abuilt-in SPARQL condition (i.e., analogous to a SQL predicate)
?x rev:rating ?p FILTER(?p > 3.0)
P1 AND/OPT/UNION P2, where P1 and P2 are graphpattern expressions
Example:SELECT ?nameWHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )
}24PKU/2014-08-28
SPARQL Queries
SELECT ?nameWHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )
}
?m ?dmovie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?rrev:rating
FILTER(?r > 4.0)
25PKU/2014-08-28
Outline
1 LOD and RDF Introduction
2 Data Warehousing ApproachRelational ApproachesGraph-Based Approaches
3 SPARQL Federation ApproachDistributed RDF ProcessingSPARQL Endpoint Federation
4 Live Querying ApproachTraversal-based approachesIndex-based approachesHybrid approaches
5 Conclusions
26PKU/2014-08-28
Naıve Triple Store Design
SELECT ?nameWHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )
}Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
Easy to implementbut
too many self-joins!
27PKU/2014-08-28
Naıve Triple Store Design
SELECT ?nameWHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )
}Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c tFROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5WHERE T1 . p=” r d f s : l a b e l ”AND T2 . p=” movie : r e l a t e d B o o k ”AND T3 . p=” movie : d i r e c t o r ”AND T4 . p=” r e v : r a t i n g ”AND T5 . p=” movie : d i r e c t o r n a m e ”AND T1 . s=T2 . sAND T1 . s=T3 . sAND T2 . o=T4 . sAND T3 . o=T5 . sAND T4 . o > 4 . 0AND T5 . o=” S t a n l e y K u b r i c k ”
Easy to implementbut
too many self-joins!
27PKU/2014-08-28
Naıve Triple Store Design
SELECT ?nameWHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .?d movie : d i r e c t o r n a m e ” S t a n l e y K u b r i c k ” .?m movie : r e l a t e d B o o k ?b . ?b r e v : r a t i n g ? r .FILTER(? r > 4 . 0 )
}Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”mdb:film/2014 movie:director mdb:director/8476mdb:film/2014 movie:actor mdb:actor/29704mdb:film/2014 movie:actor mdb: actor/30013mdb:film/2014 movie:music contributor mdb: music contributor/4110mdb:film/2014 foaf:based near geo:2635167mdb:film/2014 movie:relatedBook bm:0743424425mdb:film/2014 movie:language lexvo:iso639-3/engmdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:film/424 movie:director mdb:director/8476mdb:film/424 rdfs:label “Spartacus”mdb:actor/29704 movie:actor name “Jack Nicholson”mdb:film/1267 movie:actor mdb:actor/29704mdb:film/1267 rdfs:label “The Last Tycoon”mdb:film/3418 movie:actor mdb:actor/29704mdb:film/3418 rdfs:label “The Passenger”geo:2635167 gn:name “United Kingdom”geo:2635167 gn:population 62348447geo:2635167 gn:wikipediaArticle wp:United Kingdombm:books/0743424425 dc:creator bm:persons/Stephen+Kingbm:books/0743424425 rev:rating 4.7bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOfferlexvo:iso639-3/eng rdfs:label “English”lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CAlexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c tFROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5WHERE T1 . p=” r d f s : l a b e l ”AND T2 . p=” movie : r e l a t e d B o o k ”AND T3 . p=” movie : d i r e c t o r ”AND T4 . p=” r e v : r a t i n g ”AND T5 . p=” movie : d i r e c t o r n a m e ”AND T1 . s=T2 . sAND T1 . s=T3 . sAND T2 . o=T4 . sAND T3 . o=T5 . sAND T4 . o > 4 . 0AND T5 . o=” S t a n l e y K u b r i c k ”
Easy to implementbut
too many self-joins!
27PKU/2014-08-28
Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore[Weiss et al., 2008]Strings are mapped to ids using a mapping table
Triples are indexed in a clustered B+ tree in lexicographicorderCreate indexes for permutations of the three columns: SPO,SOP, PSO, POS, OPS, OSP
Original triple tableSubject Property Objectmdb: film/2014 rdfs:label “The Shining”mdb:film/2014 movie:initial release date “1980-05-23”mdb:director/8476 movie:director name “Stanley Kubrick”mdb:film/2685 movie:director mdb:director/8476
Encoded triple tableSubject Property Object
0 1 20 3 45 6 78 9 5
Mapping tableID Value0 mdb: film/20141 rdfs:label2 “The Shining”3 movie:initial release date4 “1980-05-23”5 mdb:director/84766 movie:director name7 “Stanley Kubrick”8 mdb:film/26859 movie:director
28PKU/2014-08-28
Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Triples are indexed in a clustered B+ tree in lexicographicorder
Create indexes for permutations of the three columns: SPO,SOP, PSO, POS, OPS, OSP
Subject Property Object0 1 2
0 3 4
5 6 7
8 9 5...
......
B+ treeEasy queryingthrough mappingtable
28PKU/2014-08-28
Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore[Weiss et al., 2008]
Strings are mapped to ids using a mapping table
Triples are indexed in a clustered B+ tree in lexicographicorder
Create indexes for permutations of the three columns: SPO,SOP, PSO, POS, OPS, OSP
Subject Property Object0 1 2
0 3 4
5 6 7
8 9 5...
......
B+ treeEasy queryingthrough mappingtable
28PKU/2014-08-28
Exhaustive Indexing–Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object0 1 2
0 3 4
5 6 7
8 9 5...
......
ID Value0 mdb: film/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:film/2685
9 movie:director
29PKU/2014-08-28
Exhaustive Indexing–Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object0 1 2
0 3 4
5 6 7
8 9 5...
......
ID Value0 mdb: film/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:film/2685
9 movie:director
Advantages
I Eliminates some of the joins – they become range queries
I Merge join is easy and fast
29PKU/2014-08-28
Exhaustive Indexing–Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object0 1 2
0 3 4
5 6 7
8 9 5...
......
ID Value0 mdb: film/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:film/2685
9 movie:director
Advantages
I Eliminates some of the joins – they become range queries
I Merge join is easy and fast
Disadvantages
I Space usage
29PKU/2014-08-28
Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF[Bornea et al., 2013]
Clustered property table: group together the properties thattend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same typeof property into one property table
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject refs:label movie:directormob:film/2014 “The Shining” mob:director/8476mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor namemdb:actor “Jack Nicholson”
30PKU/2014-08-28
Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF[Bornea et al., 2013]
Clustered property table: group together the properties thattend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same typeof property into one property table
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject refs:label movie:directormob:film/2014 “The Shining” mob:director/8476mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor namemdb:actor “Jack Nicholson”
Advantages
I Fewer joins
I If the data is structured, we have a relational system – similarto normalized relations
30PKU/2014-08-28
Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF[Bornea et al., 2013]
Clustered property table: group together the properties thattend to occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same typeof property into one property table
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject refs:label movie:directormob:film/2014 “The Shining” mob:director/8476mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor namemdb:actor “Jack Nicholson”
Advantages
I Fewer joins
I If the data is structured, we have a relational system – similarto normalized relations
Disadvantages
I Potentially a lot of NULLs
I Clustering is not trivial
I Multi-valued properties are complicated
30PKU/2014-08-28
Binary Tables
Grouping by properties: For each property, build a two-columntable, containing both subject and object, ordered by subjects[Abadi et al., 2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties inthe data)
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject Objectmdb:film/2014 mdb:director/8476mdb:film/2685 mdb:director/8476
movie:director
Subject Objectmob:film/2014 “The Shining”mob:film/2685 “The Clockwork Orange”
refs:label
Subject Objectmdb:actor/29704 “Jack Nicholson”
movie:actor name
31PKU/2014-08-28
Binary Tables
Grouping by properties: For each property, build a two-columntable, containing both subject and object, ordered by subjects[Abadi et al., 2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties inthe data)
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject Objectmdb:film/2014 mdb:director/8476mdb:film/2685 mdb:director/8476
movie:director
Subject Objectmob:film/2014 “The Shining”mob:film/2685 “The Clockwork Orange”
refs:label
Subject Objectmdb:actor/29704 “Jack Nicholson”
movie:actor name
Advantages
I Supports multi-valued properties
I No NULLs
I No clustering
I Read only needed attributes (i.e. less I/O)
I Good performance for subject-subject joins
31PKU/2014-08-28
Binary Tables
Grouping by properties: For each property, build a two-columntable, containing both subject and object, ordered by subjects[Abadi et al., 2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties inthe data)
Subject Property Objectmdb:film/2014 rdfs:label “The Shining”mdb:film/2014 movie:director mdb:director/8476mdb:film/2685 movie:director mdb:director/8476mdb:film/2685 rdfs:label “A Clockwork Orange”mdb:actor/29704 movie:actor name “Jack Nicholson”. . . . . . . . .
Subject Objectmdb:film/2014 mdb:director/8476mdb:film/2685 mdb:director/8476
movie:director
Subject Objectmob:film/2014 “The Shining”mob:film/2685 “The Clockwork Orange”
refs:label
Subject Objectmdb:actor/29704 “Jack Nicholson”
movie:actor name
Advantages
I Supports multi-valued properties
I No NULLs
I No clustering
I Read only needed attributes (i.e. less I/O)
I Good performance for subject-subject joins
Disadvantages
I Not useful for subject-object joins
I Expensive inserts
31PKU/2014-08-28
Graph-based Approach
Answering SPARQL query ≡ subgraph matching
gStore [Zou et al., 2011, 2014]
?m ?dmovie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?rrev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
SubgraphM
atching
32PKU/2014-08-28
Graph-based Approach
Answering SPARQL query ≡ subgraph matching
gStore [Zou et al., 2011, 2014]
?m ?dmovie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?rrev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
SubgraphM
atching
Advantages
I Maintains the graph structure
I Full set of queries can be handled
32PKU/2014-08-28
Graph-based Approach
Answering SPARQL query ≡ subgraph matching
gStore [Zou et al., 2011, 2014]
?m ?dmovie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?rrev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based nearmovie:actor
movie:directormovie:actor
movie:actor movie:actor
movie:director movie:director
SubgraphM
atching
Advantages
I Maintains the graph structure
I Full set of queries can be handled
Disadvantages
I Graph pattern matching is expensive
32PKU/2014-08-28
gStore
General Approach:
Work directly on the RDF graph and the SPARQL query graph
Use a signature-based encoding of each entity and class vertexto speed up matching
Filter-and-evaluate
Use a false positive algorithm to prune nodes and obtain a setof candidates; then do more detailed evaluation on those
Use an index (VS∗-tree) over the data signature graph (haslight maintenance load) for efficient pruning
33PKU/2014-08-28
1. Encode Q and G to Get Signature GraphsQuery signature graph Q∗
0100 0000 1000 000000010
0000 010010000
Data signature graph G∗
0010 1000
0100 0001
00001
1000 000100010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
0001 000101000
0100 1000
01000
1001 1000
01000
0001 0100
01000
34PKU/2014-08-28
2. Filter-and-EvaluateQuery signature graph Q∗
0100 0000 1000 000000010
0000 010010000
Data signature graph G∗
0010 1000
0100 0001
00001
1000 000100010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
0001 000101000
0100 1000
01000
1001 1000
01000
0001 0100
01000
Find matches of Q∗ oversignature graph G ∗
Verify each match inRDF graph G
35PKU/2014-08-28
How to Generate Candidate List
Two step process:1. For each node of Q∗ get lists of nodes in G∗ that include that
node.2. Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G∗
Both steps are inefficient
Use S-treesHeight-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists ofnodes in G∗ that include that node.
• Given query signature q and a set of data signatures S ,find all data signatures si ∈ S where q&si = q
Does not support second step – expensive
VS-tree (and VS∗-tree)Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices
36PKU/2014-08-28
How to Generate Candidate List
Two step process:1. For each node of Q∗ get lists of nodes in G∗ that include that
node.2. Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G∗
Both steps are inefficient
Use S-treesHeight-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists ofnodes in G∗ that include that node.
• Given query signature q and a set of data signatures S ,find all data signatures si ∈ S where q&si = q
Does not support second step – expensive
VS-tree (and VS∗-tree)Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices
36PKU/2014-08-28
How to Generate Candidate List
Two step process:1. For each node of Q∗ get lists of nodes in G∗ that include that
node.2. Do a multi-way join to get the candidate list
Alternatives:Sequential scan of G∗
Both steps are inefficient
Use S-treesHeight-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists ofnodes in G∗ that include that node.
• Given query signature q and a set of data signatures S ,find all data signatures si ∈ S where q&si = q
Does not support second step – expensive
VS-tree (and VS∗-tree)Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices
36PKU/2014-08-28
How to Generate Candidate List
Two step process:1. For each node of Q∗ get lists of nodes in G∗ that include that
node.2. Do a multi-way join to get the candidate list
Alternatives:Sequential scan of G∗
Both steps are inefficient
Use S-treesHeight-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists ofnodes in G∗ that include that node.
• Given query signature q and a set of data signatures S ,find all data signatures si ∈ S where q&si = q
Does not support second step – expensive
VS-tree (and VS∗-tree)Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices
36PKU/2014-08-28
How to Generate Candidate List
Two step process:1. For each node of Q∗ get lists of nodes in G∗ that include that
node.2. Do a multi-way join to get the candidate list
Alternatives:Sequential scan of G∗
Both steps are inefficient
Use S-treesHeight-balanced tree over signaturesRun an inclusion query for each node of Q∗ and get lists ofnodes in G∗ that include that node.
• Given query signature q and a set of data signatures S ,find all data signatures si ∈ S where q&si = q
Does not support second step – expensive
VS-tree (and VS∗-tree)Multi-resolution summary graph based on S-treeSupports both steps efficientlyGrouping by vertices
36PKU/2014-08-28
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000
Possibly large join space!
37PKU/2014-08-28
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000
Possibly large join space!
37PKU/2014-08-28
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000 002
011
Possibly large join space!
37PKU/2014-08-28
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000 002
011
003
008
Possibly large join space!
37PKU/2014-08-28
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000 002
011
003
008
004
009
Possibly large join space!
37PKU/2014-08-28
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000 002
011
003
008
004
009on on
Possibly large join space!
37PKU/2014-08-28
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
1000 00000100 000000010
0000 010010000 002
011
003
008
004
009on on
Possibly large join space!
37PKU/2014-08-28
VS-tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
Super edge
38PKU/2014-08-28
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
39PKU/2014-08-28
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
39PKU/2014-08-28
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
39PKU/2014-08-28
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
39PKU/2014-08-28
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
39PKU/2014-08-28
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d11
d21 d2
2
d31 d3
2 d33 d3
4
G 3
G 2
G 1
11101
1001010001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
1000000010
00100
01000
01000
01000
01000
1000 00000100 000000010
0000 010010000
d32
d33
d33
d34
d31
d34
G 3
00010 10000
01000
003
008
002
011
004
009onon
Reduced join space!
39PKU/2014-08-28
Adaptivity to Workload
Web applications that are supported by RDF datamanagement systems are far more varied than conventionalrelational applications
Data that are being handled are far more heterogeneous
SPARQL is far more flexible in how triple patterns (i.e., theatomic query unit) can be combined
An experiment [Aluc et al., 2014]
RDF-3X VOS (6.1) VOS (7.1) MonetDB 4Store% queries for whichtested system isfastest
20.9 0.0 22.6 56.5 0.0
Total workload exe-cution time (hours)
27.1 20.9 20.8 38.6 72.2
Mean (per query)execution time (sec-onds)
7.8 6.0 6.0 11.1 20.7
40PKU/2014-08-28
Adaptivity to Workload
Web applications that are supported by RDF datamanagement systems are far more varied than conventionalrelational applications
Data that are being handled are far more heterogeneous
SPARQL is far more flexible in how triple patterns (i.e., theatomic query unit) can be combined
An experiment [Aluc et al., 2014]
RDF-3X VOS (6.1) VOS (7.1) MonetDB 4Store% queries for whichtested system isfastest
20.9 0.0 22.6 56.5 0.0
Total workload exe-cution time (hours)
27.1 20.9 20.8 38.6 72.2
Mean (per query)execution time (sec-onds)
7.8 6.0 6.0 11.1 20.7
Summary of Experiments
I No single system is a sole winner across all queries
I No single system is the sole loser across all queries, either
I There can be 2–5 orders of magnitude difference in the performance (i.e., queryexecution time) between the best and the worst system for a given query
I The winner in one query may timeout in another
I Performance difference widens as dataset size increases
40PKU/2014-08-28
Group-by-Query Approach
Tamer Post23571hasPost
OlaftaggedIn
UWaterlooworksAt
Tamer Post23hasPost
Boblikes
UWaterlooworksAt
Post2hasPost taggedIn
Tamer Post23hasPost
BobtaggedIn
UWaterlooworksAt
Post2hasPost favourites
41PKU/2014-08-28
Challenges
Group-by-query clusters (a) do not have fixed size, (b) containsame set of attributes
1. Workload time analysis
2. Updating the physical layout
3. Partial indexing
Type-A,robust
Type-C,robust
Type-A,adaptable
Type-B,adaptable
Type-B,
adaptable
Type-B,
adaptable
Type-B,adaptable T
ype-C,
adaptable
42PKU/2014-08-28
Challenges
Group-by-query clusters (a) do not have fixed size, (b) containsame set of attributes
1. Workload time analysis
2. Updating the physical layout
3. Partial indexing
Storage System
CacheHash
Function
evict
@t1
· · ·
functionadapts
HashFunction
@tk
42PKU/2014-08-28
Challenges
Group-by-query clusters (a) do not have fixed size, (b) containsame set of attributes
1. Workload time analysis
2. Updating the physical layout
3. Partial indexing
Storage System
CacheHash
Function
evict
@t1
· · ·
functionadapts
HashFunction
@tk
Index – – – – – – – – – –
SPARQL Query Engine
42PKU/2014-08-28
chameleon-db
Prototype system [Aluc et al., 2013]
35,000 lines of code in C++ and growing
Structural Index
...
Vertex Index
Spill Index
Clu
ster
Inde
xS
tora
geS
yste
m Sto
rage
Adv
isor
QueryEngine Plan Generation Evaluation
43PKU/2014-08-28
Some Open Problems
Scalability of the solutions to very large datasets
Maintenance of auxiliary data structures in dynamicenvironments
Adaptive systems to handle varying and time-changingworkloads
Uncertain RDF data processing
Keyword search over RDF data
Query processing over incomplete RDF data
44PKU/2014-08-28
Outline
1 LOD and RDF Introduction
2 Data Warehousing ApproachRelational ApproachesGraph-Based Approaches
3 SPARQL Federation ApproachDistributed RDF ProcessingSPARQL Endpoint Federation
4 Live Querying ApproachTraversal-based approachesIndex-based approachesHybrid approaches
5 Conclusions
45PKU/2014-08-28
Remember the Environment
Distributed environment
Some of the data sites canprocess SPARQL queries –SPARQL endpoints
Not all data sites canprocess queries
Alternatives
Data re-distribution +query decompositionSPARQL federation: justprocess at SPARQLendpointsLive querying (see nextsection)
46PKU/2014-08-28
Remember the Environment
Distributed environment
Some of the data sites canprocess SPARQL queries –SPARQL endpoints
Not all data sites canprocess queries
Alternatives
Data re-distribution +query decompositionSPARQL federation: justprocess at SPARQLendpointsLive querying (see nextsection)
46PKU/2014-08-28
Remember the Environment
Distributed environment
Some of the data sites canprocess SPARQL queries –SPARQL endpoints
Not all data sites canprocess queries
Alternatives
Data re-distribution +query decomposition
SPARQL federation: justprocess at SPARQLendpointsLive querying (see nextsection)
46PKU/2014-08-28
Remember the Environment
Distributed environment
Some of the data sites canprocess SPARQL queries –SPARQL endpoints
Not all data sites canprocess queries
Alternatives
Data re-distribution +query decompositionSPARQL federation: justprocess at SPARQLendpoints
Live querying (see nextsection)
46PKU/2014-08-28
Remember the Environment
Distributed environment
Some of the data sites canprocess SPARQL queries –SPARQL endpoints
Not all data sites canprocess queries
Alternatives
Data re-distribution +query decompositionSPARQL federation: justprocess at SPARQLendpointsLive querying (see nextsection)
46PKU/2014-08-28
Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approachesRDF data warehouse is partitioned and distributed
RDF data D = {D1, . . . ,Dn}Allocate each Di to a site
Partitioning alternativesTable-based (e.g., [Husain et al., 2011])Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} over {D1, . . . ,Dn}
I High performance
I Great for parallelizing centralized RDF data
I May not be possible to re-partition and re-allocate Web data(i.e., LOD)
47PKU/2014-08-28
Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approachesRDF data warehouse is partitioned and distributed
RDF data D = {D1, . . . ,Dn}Allocate each Di to a site
Partitioning alternativesTable-based (e.g., [Husain et al., 2011])Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} over {D1, . . . ,Dn}
I High performance
I Great for parallelizing centralized RDF data
I May not be possible to re-partition and re-allocate Web data(i.e., LOD)
47PKU/2014-08-28
Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approachesRDF data warehouse is partitioned and distributed
RDF data D = {D1, . . . ,Dn}Allocate each Di to a site
Partitioning alternativesTable-based (e.g., [Husain et al., 2011])Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} over {D1, . . . ,Dn}
I High performance
I Great for parallelizing centralized RDF data
I May not be possible to re-partition and re-allocate Web data(i.e., LOD)
47PKU/2014-08-28
Distributed RDF Processing [Kaoudi and Manolescu, 2014]
Data partitioning approachesRDF data warehouse is partitioned and distributed
RDF data D = {D1, . . . ,Dn}Allocate each Di to a site
Partitioning alternativesTable-based (e.g., [Husain et al., 2011])Graph-based (e.g., [Huang et al., 2011; Zhang et al., 2013])Unit-based (e.g., [Gurajada et al., 2014; Lee and Liu, 2013])
SPARQL query decomposed Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} over {D1, . . . ,Dn}
I High performance
I Great for parallelizing centralized RDF data
I May not be possible to re-partition and re-allocate Web data(i.e., LOD)
47PKU/2014-08-28
Distributed RDF Processing – 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets(e.g., [Atre et al., 2010; Prasser et al., 2012])
SPARQL query Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} using the datasummary
I No data re-partitioning and re-allocation
I Have to scan the data at each site
I Index over distributed data with maintenance concerns
48PKU/2014-08-28
Distributed RDF Processing – 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets(e.g., [Atre et al., 2010; Prasser et al., 2012])
SPARQL query Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} using the datasummary
I No data re-partitioning and re-allocation
I Have to scan the data at each site
I Index over distributed data with maintenance concerns
48PKU/2014-08-28
Distributed RDF Processing – 2
Data summary-based approaches
Build summaries (index) for the distributed RDF datasets(e.g., [Atre et al., 2010; Prasser et al., 2012])
SPARQL query Q = {Q1, . . . ,Qk}Distributed execution of {Q1, . . . ,Qk} using the datasummary
I No data re-partitioning and re-allocation
I Have to scan the data at each site
I Index over distributed data with maintenance concerns
48PKU/2014-08-28
SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 ∪ D2 ∪ . . . ∪ Dn; Di : SPARQL endpoint
AlternativesSPARQL query decomposed Q = {Q1, . . . ,Qk} and executedover {D1, . . . ,Dn} – DARQ, FedX [Schwarte et al., 2011],SPLENDID [Gorlitz and Staab, 2011], ANAPSID [Acostaet al., 2011]Partial query evaluation – Distributed gStore [Peng et al.,2014]
Partial evaluation
I Given function f (s, d) and part of its input s, perform f ’scomputation that only depends on s to get f ′(d)
I Compute f ′(d) when d becomes available
I Applied to, e.g., XML [Buneman et al., 2006]
49PKU/2014-08-28
SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 ∪ D2 ∪ . . . ∪ Dn; Di : SPARQL endpointAlternatives
SPARQL query decomposed Q = {Q1, . . . ,Qk} and executedover {D1, . . . ,Dn} – DARQ, FedX [Schwarte et al., 2011],SPLENDID [Gorlitz and Staab, 2011], ANAPSID [Acostaet al., 2011]Partial query evaluation – Distributed gStore [Peng et al.,2014]
Partial evaluation
I Given function f (s, d) and part of its input s, perform f ’scomputation that only depends on s to get f ′(d)
I Compute f ′(d) when d becomes available
I Applied to, e.g., XML [Buneman et al., 2006]
49PKU/2014-08-28
SPARQL Endpoint Federation
Consider only the SPARQL endpoints for query execution
No data re-partitioning/re-distribution
Consider D = D1 ∪ D2 ∪ . . . ∪ Dn; Di : SPARQL endpointAlternatives
SPARQL query decomposed Q = {Q1, . . . ,Qk} and executedover {D1, . . . ,Dn} – DARQ, FedX [Schwarte et al., 2011],SPLENDID [Gorlitz and Staab, 2011], ANAPSID [Acostaet al., 2011]Partial query evaluation – Distributed gStore [Peng et al.,2014]
Partial evaluation
I Given function f (s, d) and part of its input s, perform f ’scomputation that only depends on s to get f ′(d)
I Compute f ′(d) when d becomes available
I Applied to, e.g., XML [Buneman et al., 2006]
49PKU/2014-08-28
Distributed SPARQL Using Partial Query EvaluationTwo steps:
1. Evaluate a query at each site to find local matchesQuery is the function and each Di is the known inputInner match or local partial match
2. Assemble the partial matches to get final resultCrossing matchCentralized assemblyDistributed assembly
D1
D2
D3
D4
Crossing match
50PKU/2014-08-28
Distributed SPARQL Using Partial Query EvaluationTwo steps:
1. Evaluate a query at each site to find local matchesQuery is the function and each Di is the known inputInner match or local partial match
2. Assemble the partial matches to get final resultCrossing matchCentralized assemblyDistributed assembly
D1
D2
D3
D4
Crossing match
50PKU/2014-08-28
Some Open Problems
Handling data at non-SPARQL endpoint sites
Modification to SPARQL endpoints (for partial queryevaluation)
Heterogeneous use of vocabularies (use of ontologies)
51PKU/2014-08-28
Outline
1 LOD and RDF Introduction
2 Data Warehousing ApproachRelational ApproachesGraph-Based Approaches
3 SPARQL Federation ApproachDistributed RDF ProcessingSPARQL Endpoint Federation
4 Live Querying ApproachTraversal-based approachesIndex-based approachesHybrid approaches
5 Conclusions
52PKU/2014-08-28
Live Query Processing
Not all data resides atSPARQL endpoints
Freshness of access to dataimportant
Potentially countably infinitedata sources
Live querying
On-line executionOnly rely on linked dataprinciples
Alternatives
Traversal-basedapproachesIndex-based approachesHybrid approaches
53PKU/2014-08-28
SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked DataQuery result completeness cannot be guaranteed by any(terminating) execution
Reachability-based query semantics
Query consists of a SPARQL expression, a set of seed URIs S ,and a reachability condition cScope: all data along paths of data links that satisfy theconditionComputationally feasible
54PKU/2014-08-28
SPARQL Query Semantics in Live Querying
Full-web semantics
Scope of evaluating a SPARQL expression is all Linked DataQuery result completeness cannot be guaranteed by any(terminating) execution
Reachability-based query semantics
Query consists of a SPARQL expression, a set of seed URIs S ,and a reachability condition cScope: all data along paths of data links that satisfy theconditionComputationally feasible
54PKU/2014-08-28
Traversal Approaches
Discover relevant URIs recursivelyby traversing (specific) data linksat query execution runtime [Hartig,2013; Ladwig and Tran, 2011]
Implements reachability-basedquery semantics
Start from a set of seed URIsRecursively follow and discovernew URIs
Important issue is selection of seedURIs
Retrieved data serves to discovernew URIs and to construct result
55PKU/2014-08-28
Traversal Approaches
Discover relevant URIs recursivelyby traversing (specific) data linksat query execution runtime [Hartig,2013; Ladwig and Tran, 2011]
Implements reachability-basedquery semantics
Start from a set of seed URIsRecursively follow and discovernew URIs
Important issue is selection of seedURIs
Retrieved data serves to discovernew URIs and to construct result
Advantages
Easy to implement.No data structure to maintain.
55PKU/2014-08-28
Traversal Approaches
Discover relevant URIs recursivelyby traversing (specific) data linksat query execution runtime [Hartig,2013; Ladwig and Tran, 2011]
Implements reachability-basedquery semantics
Start from a set of seed URIsRecursively follow and discovernew URIs
Important issue is selection of seedURIs
Retrieved data serves to discovernew URIs and to construct result
Advantages
Easy to implement.No data structure to maintain.
Disadvantages
Possibilities for parallelized data retrieval are limitedRepeated data retrieval introduces significant query latency.
55PKU/2014-08-28
Traversal Optimization
Dynamic query execution [Hartig and Ozsu, 2014]
...lookup queue...
Data Retrieval
Output
56PKU/2014-08-28
Traversal Optimization
Dynamic query execution [Hartig and Ozsu, 2014]
Prioritization of URIs – a number of alternatives
Non-adaptiveAdaptive,
Local processing awareAdaptive,
Local processing agnostic
Intermediate solution driven Solution-aware graph-based
Hybrid graph-based Purely graph-based
56PKU/2014-08-28
Index Approaches
Use pre-populated index to determine relevant URIs (and toavoid as many irrelevant ones as possible)
Different index keys possible; e.g., triple patterns [Umbrichet al., 2011]
Index entries a set of URIsIndexed URIs may appear multiple times (i.e., associated withmultiple index keys)Each URI in such an entry may be paired with a cardinality(utilized for source ranking)
Key: tp Entry: {uri1, uri2, , urin}
GET urii
57PKU/2014-08-28
Index Approaches
Use pre-populated index to determine relevant URIs (and toavoid as many irrelevant ones as possible)
Different index keys possible; e.g., triple patterns [Umbrichet al., 2011]
Index entries a set of URIsIndexed URIs may appear multiple times (i.e., associated withmultiple index keys)Each URI in such an entry may be paired with a cardinality(utilized for source ranking)
Key: tp Entry: {uri1, uri2, , urin}
GET urii
Advantages
Data retrieval can be fully parallelizedReduces the impact of data retrieval on query execution time
57PKU/2014-08-28
Index Approaches
Use pre-populated index to determine relevant URIs (and toavoid as many irrelevant ones as possible)
Different index keys possible; e.g., triple patterns [Umbrichet al., 2011]
Index entries a set of URIsIndexed URIs may appear multiple times (i.e., associated withmultiple index keys)Each URI in such an entry may be paired with a cardinality(utilized for source ranking)
Key: tp Entry: {uri1, uri2, , urin}
GET urii
Advantages
Data retrieval can be fully parallelizedReduces the impact of data retrieval on query execution time
Disadvantages
Querying can only start after index constructionDepends on what has been selected for the indexFreshness may be an issueIndex maintenance
57PKU/2014-08-28
Hybrid Approach
Perform a traversal-based execution using a prioritized list ofURIs to look up [Ladwig and Tran, 2010]
Initial seed from the pre-populated index
Non-seed URIs are ranked by a function based on informationin the index
New discovered URIs that are not in the index are rankedaccording to number of referring documents
58PKU/2014-08-28
Some Open Problems
Optimize queries by using statistics collected during earlierquery executions
Heterogeneous use of vocabularies (use of ontologies)
Combine SPARQL federation to leverage SPARQL endpointfunctionality
59PKU/2014-08-28
Outline
1 LOD and RDF Introduction
2 Data Warehousing ApproachRelational ApproachesGraph-Based Approaches
3 SPARQL Federation ApproachDistributed RDF ProcessingSPARQL Endpoint Federation
4 Live Querying ApproachTraversal-based approachesIndex-based approachesHybrid approaches
5 Conclusions
60PKU/2014-08-28
Conclusions
RDF and Linked Object Data seem to have considerablepromise for Web data management
More work needs to be done
Query semanticsAdaptive system designOptimizations – both in data warehousing and distributedenvironmentsLive querying requires significant thought to reduce latency
2014 2011
61PKU/2014-08-28
Conclusions
RDF and Linked Object Data seem to have considerablepromise for Web data management
More work needs to be done
Query semanticsAdaptive system designOptimizations – both in data warehousing and distributedenvironmentsLive querying requires significant thought to reduce latency
2014 2011
61PKU/2014-08-28
Conclusions
What I did not talk about:
Not much on general distributed/parallel processing
Not much on SPARQL semantics
Nothing about RDFS – no schema stuff
Nothing about entailment regimes > 0⇒ no reasoning
62PKU/2014-08-28
Thank you!
Research supported by
63PKU/2014-08-28
References I
Abadi, D. J., Marcus, A., Madden, S., and Hollenbach, K. (2009). SW-Store: avertically partitioned DBMS for semantic web data management. VLDB J.,18(2):385–406.
Abadi, D. J., Marcus, A., Madden, S. R., and Hollenbach, K. (2007). Scalablesemantic web data management using vertical partitioning. In Proc. 33rdInt. Conf. on Very Large Data Bases, pages 411–422.
Abiteboul, S., Quass, D., McHugh, J., Widom, J., and Wiener, J. (1997). TheLorel query language for semistructured data. Int. J. Digit. Libr., 1(1):68–88.
Acosta, M., Vidal, M.-E., Lampo, T., Castillo, J., and Ruckhaus, E. (2011).ANAPSID: an adaptive query processing engine for SPARQL endpoints. InProc. 10th Int. Semantic Web Conf., pages 18–34.
Aluc, G., Hartig, O., Ozsu, M. T., and Daudjee, K. (2014). Diversified stresstesting of RDF data management systems. In Proc. 13th Int. Semantic WebConf. Forthcoming.
Aluc, G., Ozsu, M. T., Daudjee, K., and Hartig, O. (2013). chameleon-db: aworkload-aware robust RDF data management system. Technical ReportCS-2013-10, University of Waterloo.
64PKU/2014-08-28
References IIArocena, G. and Mendelzon, A. (1998). Weboql: Restructuring documents,
databases and webs. In Proc. 14th Int. Conf. on Data Engineering, pages24–33.
Atre, M., Chaoji, V., Zaki, M. J., and Hendler, J. A. (2010). Matrix “bit”loaded: A scalable lightweight join query processor for rdf data. In Proc.19th Int. World Wide Web Conf., pages 41–50.
Bornea, M. A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P.,Udrea, O., and Bhattacharjee, B. (2013). Building an efficient RDF storeover a relational database. In Proc. ACM SIGMOD Int. Conf. onManagement of Data, pages 121–132.
Buneman, P., Cong, G., Fan, W., and Kementsietsidis, A. (2006). Using partialevaluation in distributed query evaluation. In Proc. 32nd Int. Conf. on VeryLarge Data Bases, pages 211–222.
Buneman, P., Davidson, S., Hillebrand, G. G., and Suciu, D. (1996). A querylanguage and optimization techniques for unstructured data. In Proc. ACMSIGMOD Int. Conf. on Management of Data, pages 505–516.
Fernandez, M., Florescu, D., and Levy, A. (1997). A query language for aweb-site management system. ACM SIGMOD Rec., 26(3):4–11.
65PKU/2014-08-28
References IIIGorlitz, O. and Staab, S. (2011). SPLENDID: SPARQL endpoint federation
exploiting VOID descriptions. In Proc. 2nd Int. Workshop on ConsumingLinked Data.
Gurajada, S., Seufert, S., Miliaraki, I., and Theobald, M. (2014). TriAD: Adistributed shared-nothing RDF engine based on asynchronous messagepassing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages289–300.
Hartig, O. (2012). SPARQL for a web of linked data: Semantics andcomputability. In Proc. 9th Extended Semantic Web Conf., pages 8–23.
Hartig, O. (2013). SQUIN: a traversal based query execution system for theweb of linked data. In Proc. ACM SIGMOD Int. Conf. on Management ofData, pages 1081–1084.
Hartig, O. and Ozsu, M. T. (2014). Optimizing response time oftraversal-based query optimization. In preparation.
Huang, J., Abadi, D. J., and Ren, K. (2011). Scalable SPARQL querying oflarge RDF graphs. Proc. VLDB Endowment, 4(11):1123–1134.
66PKU/2014-08-28
References IVHusain, M. F., McGlothlin, J., Masud, M. M., Khan, L. R., and Thuraisingham,
B. (2011). Heuristics-based query processing for large RDF graphs usingcloud computing. IEEE Trans. Knowl. and Data Eng., 23(9):1312–1327.
Kaoudi, Z. and Manolescu, I. (2014). RDF in the clouds: A survey. VLDB J.Forthcoming.
Konopnicki, D. and Shmueli, O. (1995). W3QS: A query system for the WorldWide Web. In Proc. 21th Int. Conf. on Very Large Data Bases, pages 54–65.
Ladwig, G. and Tran, T. (2010). Linked data query processing strategies. InProc. 9th Int. Semantic Web Conf., pages 453–469.
Ladwig, G. and Tran, T. (2011). SIHJoin: Querying remote and local linkeddata. In Proc. 8th Extended Semantic Web Conf., pages 139–153.
Lakshmanan, L. V. S., Sadri, F., and Subramanian, I. N. (1996). A declarativelanguage for querying and restructuring the Web. In Proc. 6th Int.Workshop on Research Issues on Data Eng., pages 12–21.
Lee, K. and Liu, L. (2013). Scaling queries over big rdf graphs with semantichash partitioning. Proc. VLDB Endowment, 6(14):1894–1905.
67PKU/2014-08-28
References VMendelzon, A. O., Mihaila, G. A., and Milo, T. (1997). Querying the World
Wide Web. Int. J. Digit. Libr., 1(1):54–67.
Neumann, T. and Weikum, G. (2008). RDF-3X: a RISC-style engine for RDF.Proc. VLDB Endowment, 1(1):647–659.
Neumann, T. and Weikum, G. (2009). The RDF-3X engine for scalablemanagement of RDF data. VLDB J., 19(1):91–113.
Papakonstantinou, Y., Garcia-Molina, H., and Widom, J. (1995). Objectexchange across heterogeneous information sources. In Proc. 11th Int. Conf.on Data Engineering, pages 251–260.
Peng, P., Zou, L., Ozsu, M. T., Chen, L., and Zhao, D. (2014). ProcessingSPARQL queries over linked data – a distributed graph-based approach. Insubmitted for publication.
Prasser, F., Kemper, A., and Kuhn, K. A. (2012). Efficient distributed queryprocessing for autonomous rdf databases. In Proc. 15th Int. Conf. onExtending Database Technology, pages 372–383.
Schwarte, A., Haase, P., Hose, K., Schenkel, R., and Schmidt, M. (2011).Fedx: A federation layer for distributed query processing on linked opendata. In Proc. 8th Extended Semantic Web Conf., pages 481–486.
68PKU/2014-08-28
References VIUmbrich, J., Hose, K., Karnstedt, M., Harth, A., and Polleres, A. (2011).
Comparing data summaries for processing live queries over linked data.World Wide Web J., 14(5-6):495–544.
Weiss, C., Karras, P., and Bernstein, A. (2008). Hexastore: sextuple indexingfor semantic web data management. Proc. VLDB Endowment,1(1):1008–1019.
Wilkinson, K. (2006). Jena property table implementation. Technical ReportHPL-2006-140, HP Laboratories Palo Alto.
Zhang, X., Chen, L., Tong, Y., and Wang, M. (2013). EAGRE: Towardsscalable I/O efficient SPARQL query evaluation on the cloud. In Proc. 29thInt. Conf. on Data Engineering, pages 565–576.
Zou, L., Mo, J., Chen, L., Ozsu, M. T., and Zhao, D. (2011). gStore:answering SPARQL queries via subgraph matching. Proc. VLDBEndowment, 4(8):482–493.
Zou, L., Ozsu, M. T., Chen, L., Shen, X., Huang, R., and Zhao, D. (2014).gStore: A graph-based SPARQL query engine. VLDB J., 23(4):565–590.
69PKU/2014-08-28