LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod ›...
Transcript of LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod ›...
![Page 1: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/1.jpg)
LOD is all about evolution
Querying and Managing evolving Linked Open Data
Javier D. Fernández
11TH SEPTEMBER 2017
Drift-a-LOD’17
Special thanks to Axel Polleres for his input
![Page 2: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/2.jpg)
About me:
since 2015 @WU, Inst. for Information BusinessResearch interest: Semantic Web, Open Data, Big (Semantic) Data Management, Databases, Data Compression, Privacy and Security
https://www.wu.ac.at/en/infobiz/team/fernandez/
MadridValladolid Santiago Rome
2
Óscar CorchoPablo de la FuenteMiguel A. Martínez-
Prieto
Claudio Gutiérrez Maurizio Lenzerini
Vienna
Axel Polleres
![Page 3: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/3.jpg)
Monitoring Evolution and Archiving
Archiving the Web of Data
Representing and querying evolving semantic data
Open Data evolution
PAGE 3
General agenda
images: zurb.com
![Page 4: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/4.jpg)
Monitoring evolution is relevant
ARCHIVING LINKED AND OPEN DATA4
Why evolution matters (Creationists: please ignore this slide…)
![Page 5: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/5.jpg)
Changes tell us “something”
Uncertain information
Validity of the information
ARCHIVING LINKED AND OPEN DATA5
Evolution matters
![Page 6: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/6.jpg)
Web archives: Common Crawl, Internet Memory, Internet Archive, …
6
Preservation matters
![Page 7: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/7.jpg)
The Memento protocol
7
Time-based access matters
But…
Follow your nose(HTTP content negotiation with datetime)
RFC 7089
Batch discovery (list of URIs of Mementos of the Original Resource)
![Page 8: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/8.jpg)
Poor granularity (“some” snapshots)
Aggregated data, only, rather than raw data access
(e.g. in Google trends)
What is the right query language?
basic retrieval features (get version at timestamp t)
when did a certain information disappear?
when was it changed?
structured queries?
Scalability problems
8
Challenges (Web archives)
Is it easier/better for RDF/Linked Data?
![Page 9: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/9.jpg)
Arching the Web of Data
![Page 10: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/10.jpg)
ANDREAS HARTH -STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015
Linked Data is evolving
Number of
sources
Update rate
month
year
week
day
hour
minute
second
104 105 106101100 102 103
BTC
Dynamic LD
Observatory
Internet
of Things
Virtual/Augmented
Reality
LOD Laundromat
live
![Page 11: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/11.jpg)
11
One of the first (and last?) LOD archives: The Dynamic Linked Data Observatory (evolving Linked Data since 2012)
Weekly dumps of crawl snapshots...
Granularity?Queries?Crawl failures?
![Page 12: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/12.jpg)
3
Most semantic Web/Linked Data tools are focused onthis “static view” but do not consider
versioning/evolution
Linked Data Archives:The missing link in the RDF evolution
Sindice, SWSE, Swoogle, LOD Cache, LOD-Laundromat… so far, no versions!
![Page 13: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/13.jpg)
13
RDF Archiving. Example
RDF Graph V1
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .
RDF Graph V3
ex:C1 ex:hasProfessor ex:P1 .ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
RDF Graph V2
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
ex:S1
ex:has Professor
ex:S2
ex:C1
ex:P1
ex:has Professor
ex:C1
ex:P1ex:S3
ex:S1
ex:S2
ex:has Professor
ex:C1
ex:P1ex:S3
ex:hasProfessor
ex:P2
ex:S1
ex:S2
![Page 14: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/14.jpg)
How can we represent archives of continuously evolving linked datasets?
How can we minimize the redundant information of archives? (e.g. duplicates in snapshots)
How can we improve completeness of archiving?
How can emerging retrieval demands in archiving be satisfied?
e.g. time-traversing and traceability? Avoiding bottlenecks?
How can certain time-specific queries over archives be answered?
Can we re-use existing technologies (e.g. SPARQL or temporal extensions)?
What is the right query language for such queries?
e.g. knowing if a dataset has changed, and how, in a certain time period?
14
Research challenges on evolving structured interlinked data
![Page 15: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/15.jpg)
15
…in the last few years:
Managing the Evolution and
Preservation of the Data Web (FP7)Preserving Linked Data (FP7)
Research projects
Archives
Tools
Benchmarking
one of the fundamental problems in the Web of Data
BEAR
RDF evolution at Scale
v-RDFCSA
![Page 16: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/16.jpg)
Representing and querying evolving semantic data
![Page 17: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/17.jpg)
How we can get archives of RDF data
The cold-start problem
![Page 18: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/18.jpg)
Some services that publish or are mapped to RDF change regularly, but we don’t know the frequency upfront!
Some services mapped to RDF announce/archive their changes already, so they already keep an archive…
Pull changes (crawl) vs.Push changes (notify)
data
YYYY/MM/DD/HH/domain
crawl
metadata
downloaderpoliteness queue
cron
crawl
schedule
content
meta
adaptive scheduler• check if URL was
crawled, • compare content
with previous crawl(s),
• adapt schedule
cron
scheduler
cron
URI type
links
Data Monitor Framework
1
2
3
Towards capturing and preserving changes on the Web of Data 50-65Jürgen Umbrich, Nina Mrzelj, Axel Polleres. DIACHRON WS 2015
![Page 19: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/19.jpg)
Retrieve historical versions of a DBpedia resource
What was the version of “Donald Trump” on dd/mm/yyyy?
Re-apply DBpedia mappings on the Wikipedia revision history
DBpedia Wayback Machine
http://data.wu.ac.at/wayback
19On-demand “archive”
![Page 20: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/20.jpg)
How can one represent revisions while respecting DBpedia?
a) quads <dbpediaSubject> <pred> <obj> <Revision> .
b) proprietary triples <ownSubject/Revision> <pred> <obj> .
Operations?
Get revisions meta-data for one resource (by revisionID or timestamp)
Get “materialised” versions of a resource (by revisionID or timestamp)
Get difference between two revisions
DBpedia Wayback Machine
![Page 21: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/21.jpg)
More complex operations/queries? Open challenge
a) On-demand? Query rewriting, similar to RDB2RDF
b) Batch: Fetch the desired information, then store and query it
DBpedia Wayback Machine
![Page 22: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/22.jpg)
22
We are (obviously) not the only ones looking into this…
However: Only one version per “irregular” dbpediadump
![Page 23: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/23.jpg)
Lodlaundromat.org: a central repository of LD
Problems?
Still you need to access/query 650K datasets
Of course the solution is not complete, but “a good approximation”
23
LOD Laundromat
![Page 24: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/24.jpg)
24
LOD
Laundromat
Dataset 1
N-Triples (zip)
Dataset 650K
N-Triples (zip)
Linked Open Data
SPARQL endpoint
(metadata)
LOD-a-lot: LOD-a-lot: Low cost archiving of LOD
Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, Mario Arias, Ruben Verbogh
LOD-a-lot28B triples
![Page 25: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/25.jpg)
Disk size:
HDT: 304 GB
HDT-FoQ (additional indexes): 133 GB
Memory footprint (to query):
15.7 GB of RAM (3% of the size)
144 seconds loading time
8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS
LDF page resolution in milliseconds.
25
LOD-a-lot (some numbers)
305€
(LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
![Page 26: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/26.jpg)
26
LOD-a-lot
https://datahub.io/dataset/lod-a-lot
http://purl.org/HDT/lod-a-lot
![Page 27: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/27.jpg)
Archiving
We plan to have quarterly releases
Query resolution at Web scale
Evaluation and Benchmarking
No excuse
RDF metrics and analytics
27
LOD-a-lot (some use cases)
subjects predicates objects
![Page 28: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/28.jpg)
28
ACKs LOD-a-lot
![Page 29: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/29.jpg)
The archiving problem
Now, how can we efficiently archive and perform time-based retrieval queries of a dataset?
![Page 30: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/30.jpg)
30
RDF Archiving. Archiving policies
V1
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
V2 V3
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .
V1
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .
ex:S3 ex:study ex:C1 .
ex:S2 ex:study ex:C1 .
ex:C1 ex:hasProfessor ex:P1 .
ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .
V1,2,
3ex:C1 ex:hasProfessor ex:P1 [V1,V2].ex:C1 ex:hasProfessor ex:P2 [V3].ex:C1 ex:hasProfessor ex:S2 [V3].ex:S1 ex:study ex:C1 [V1,V2,V3].ex:S2 ex:study ex:C1 [V1].ex:S3 ex:study ex:C1 [V2,V3].
a) Independent Copies/Snapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
d) Hybrid approaches
![Page 32: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/32.jpg)
Blueprint on benchmarking archives of semantic data
How can one define the corpus?
How can one design benchmark queries? Which queries?
BEAR: concrete basic benchmark
Data: Crawl from Linked Data Observatory
Basic queries: Materialize, get Version…
Initial evaluation on archiving policies
32
BEAR: Benchmarking the Efficiency of RDF Archives
https://aic.ai.wu.ac.at/qadlod/bear.html
![Page 33: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/33.jpg)
Blueprint on benchmarking archives of semantic data
How can one define the corpus?
How can one design benchmark queries? Which queries?
BEAR: concrete basic benchmark
Data: Crawl from Linked Data Observatory
Basic queries: Materialize, get Version…
Initial evaluation on archiving policies
33
BEAR: Benchmarking the Efficiency of RDF Archives
https://aic.ai.wu.ac.at/qadlod/bear.html
![Page 34: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/34.jpg)
Benchmarking: Define the corpus
Number of versions / size
Data dynamicity
Version change ratio
Version data growth
Data static core
Total triples (version-oblivious)
Others
RDF vocabulary
Per version / evolution34
![Page 35: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/35.jpg)
Benchmarking: Define the queries
Structured query languages managing time.
Temporal databases (T-Quel, TSQL2)
Overlapping, meeting, before, equal, during, finish
RDF/Linked Data
SPARQL extensions
T-SPARQL, SPARQL-ST
AnQL
DIACHRON Query Language
SPARQL with specific constructors such as DATASET (similar to a named graph),VERSION, or CHANGES
35
![Page 36: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/36.jpg)
Design of benchmark queries
Archive-driven Cardinality + Selectivity (disregard versions)
Version-driven Cardinality + Selectivity + dynamicity
Basic temporal retrieval features of queries
Mat (Q, Vi): version materialization
Diff (Q, Vi,Vj): delta materialization
Version(Q): results of Q annotated with the version
Join(Q1,Vi, Q2,Vj)
Change(Q): Returns versions in which Diff(Q, Vi, Vi-1) !=∅
36
Benchmarking: Define the queries
![Page 37: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/37.jpg)
Instantiation of archive queries in AnQL [1]
Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated Semantic Web data. Journal of Web Semantics (JWS), 12:72--95, March 2012.
Mat(Q,V)
Diff(Q,V1,V2)
Ver(Q)
join(Q1,vi,Q2,vj)
Change(Q)
37
SELECT * WHERE { Q :[v] }
Benchmarking: Define the queries
![Page 38: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/38.jpg)
Instantiation of archive queries in AnQL [1]
Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated Semantic Web data. Journal of Web Semantics (JWS), 12:72--95, March 2012.
Mat(Q,V)
Diff(Q,V1,V2)
Ver(Q)
join(Q1,vi,Q2,vj)
Change(Q)
38
SELECT * WHERE {
{ { {Q :[v1]} MINUS {Q :[v2]} } BIND (v1 AS ?V )
}
UNION
{ { {Q :[v2] } MINUS {Q :[v1]}} BIND (v2 AS ?V )
}
Benchmarking: Define the queries
![Page 39: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/39.jpg)
Instantiation of archive queries in AnQL [1]
Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated Semantic Web data. Journal of Web Semantics (JWS), 12:72--95, March 2012.
Mat(Q,V)
Diff(Q,V1,V2)
Ver(Q)
join(Q1,vi,Q2,vj)
Change(Q)
39
SELECT * WHERE { Q :?V }
Benchmarking: Define the queries
![Page 40: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/40.jpg)
Instantiation of archive queries in AnQL [1]
Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated Semantic Web data. Journal of Web Semantics (JWS), 12:72--95, March 2012.
Mat(Q,V)
Diff(Q,V1,V2)
Ver(Q)
join(Q1,v1,Q2,v2)
Change(Q)
40
SELECT * WHERE { {Q :[v1]} {Q :[v2]} }
Benchmarking: Define the queries
![Page 41: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/41.jpg)
Instantiation of archive queries in AnQL [1]
Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated Semantic Web data. Journal of Web Semantics (JWS), 12:72--95, March 2012.
Mat(Q,V)
Diff(Q,V1,V2)
Ver(Q)
join(Q1,vi,Q2,vj)
Change(Q)
41
SELECT ?V1 ?V2 WHERE
{ {{Q :?V1 } MINUS {Q :?V2}} UNION
{{Q :?V2 } MINUS {Q :?V1}}
FILTER( abs(?V1-?V2) = 1 ) }
Benchmarking: Define the queries
Open question remains:
What is the right query syntax for archive queries?
![Page 42: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/42.jpg)
blueprint on benchmarking archives of semantic data
How can one define the corpus?
How can one design benchmark queries? Which queries?
BEAR: concrete basic benchmark
Data: Crawl from Linked Data Observatory
Basic queries: Materialize, get Version…
Initial evaluation of archiving policies
42
BEAR: Benchmarking the Efficiency of RDF Archiving
https://aic.ai.wu.ac.at/qadlod/bear.html
![Page 43: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/43.jpg)
Queries and systems
We implemented and evaluate archiving systems on Jena-TDB and HDT, based on IC, CB and TB policies.
Serve as an initial baseline to compare archiving systems
More info: https://aic.ai.wu.ac.at/qadlod/bear.html
43
BEAR: Benchmarking the Efficiency of RDF Archiving
![Page 44: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/44.jpg)
BEAR datasets
44
![Page 45: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/45.jpg)
RDF Archiving. Archiving policies
V1
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
V2 V3
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .
V1
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .
ex:S3 ex:study ex:C1 .
ex:S2 ex:study ex:C1 .
ex:C1 ex:hasProfessor ex:P1 .
ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .
V1,2,
3ex:C1 ex:hasProfessor ex:P1 [V1,V2].ex:C1 ex:hasProfessor ex:P2 [V3].ex:C1 ex:hasProfessor ex:S2 [V3].ex:S1 ex:study ex:C1 [V1,V2,V3].ex:S2 ex:study ex:C1 [V1].ex:S3 ex:study ex:C1 [V2,V3].
a) Independent Copies/Snapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
45
![Page 46: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/46.jpg)
RDF Archiving. Archiving policies
V1
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .ex:S1 ex:study ex:C1 .ex:S3 ex:study ex:C1 .
V2 V3
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .
V1
ex:C1 ex:hasProfessor ex:P1 .ex:S1 ex:study ex:C1 .ex:S2 ex:study ex:C1 .
ex:S3 ex:study ex:C1 .
ex:S2 ex:study ex:C1 .
ex:C1 ex:hasProfessor ex:P1 .
ex:C1 ex:hasProfessor ex:P2 .ex:C1 ex:hasProfessor ex:S2 .
V1,2,
3ex:C1 ex:hasProfessor ex:P1 [V1,V2].ex:C1 ex:hasProfessor ex:P2 [V3].ex:C1 ex:hasProfessor ex:S2 [V3].ex:S1 ex:study ex:C1 [V1,V2,V3].ex:S2 ex:study ex:C1 [V1].ex:S3 ex:study ex:C1 [V2,V3].
a) Independent Copies/Snapshots (IC)
b) Change-based approach (CB)
c) Timestamp-based approach (TB)
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
RETRIEVAL MEDIATOR
46
83
![Page 47: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/47.jpg)
Time-based access. Queries
47
Triple Pattern queries
Queries with “similar” number of results in all versions.
ε-stable query
𝒎𝒂𝒙∀𝒊∈𝑵 𝑪𝑨𝑹𝑫 𝑸,𝑽𝒊 ≤ 𝟏 + 𝜺 ∀𝒊∈𝑵𝑪𝑨𝑹𝑫 𝑸,𝑽𝒊
𝑵
𝒎𝒊𝒏∀𝒊∈𝑵 𝑪𝑨𝑹𝑫(𝑸,𝑽𝒊) ≥ (𝟏 + 𝜺) ∀𝒊∈𝑵𝑪𝑨𝑹𝑫(𝑸,𝑽𝒊)
𝑵
50%
50%
𝜺 = 𝟎. 𝟓
![Page 48: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/48.jpg)
Time-based access. Queries
48
Materialize (s,?,? ; version)
Hybrid approach
IC CB HB4 HB8 HB16
48 GB 28 GB 34 GB 31 GB 29 GB
![Page 49: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/49.jpg)
Time-based access. Queries
49
diff(?,?,o ; version0 ; version t)
Hybrid approach
IC CB HB4 HB8 HB16
48 GB 28 GB 34 GB 31 GB 29 GB
![Page 50: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/50.jpg)
RDFCSA: RDF index based on a Compressed Suffix Array
v-RDFCSA[2] is designed as a lightweight TB approach
Version information encoding
Any triple can be identified by the position of its subject within SA
Let be N the number of different versions and n the set of version-oblivious triples
Two alternative encoding strategies
tpv: N bitsequences, each position i encodes if the triple i appears in the version
vpt: n bitsequences, each position i encodes if the version i includes the triple
51
Self-Indexing RDF Archives: v-RDFCSA
Bv1 0 1 1 0 1
Bv2 0 1 0 1 0
Bv3 1 0 0 0 1
Triples
1 2 3 4 5tpv
Versions
1
2
3
Bt1
0 1 1 0 1
0 1 0 1 0
1 0 0 0 1
Triples
1 2 3 4 5vptVersions
1
23
Bt2 Bt3 B
t4 B
t5
[2] Ana Cerdeira-Pena, Antonio Fariña, Javier D. Fernández, and Miguel A. Martínez-Prieto. Self-Indexing RDF Archives. Data Compression Conference (DCC), 2016.
Performs more than one order of magnitude faster than Jena-TDB
![Page 51: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/51.jpg)
How is Open Data evolving?
Open Data evolution
![Page 52: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/52.jpg)
Periodically monitoring a list of Open Data Portals
90 CKAN powered Open Data Portals
Quality assessment
Evolution tracking
Meta data
Data
OPEN DATA PORTAL WATCH… a first step.
http://data.wu.ac.at/portalwatch/
Jürgen UmbrichSebastian Neumaier
![Page 53: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/53.jpg)
54
ECDA: Evolving CSV Data Analyzer
![Page 54: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/54.jpg)
55
• Analysis of 726 datasets
• Mean of 18 versions per file
• Mean of 430 rows and 5.4 columns
• Increasing nature (x1.85 number of rows)
• Small value modifications (0.85 jaccard)
• Mostly string types (80% of 8-25 characters)
ECDA: Evolving CSV Data Analyzer
![Page 55: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/55.jpg)
56
• Analysis of 726 datasets
• Entities (Babelfy)
• On average there are around 0.07 entities per cell
• Entities are static in the header (a mean of 3)
• 1/3 of the entities change across time
• Number of entities slightly decrease in time
ECDA: Evolving CSV Data Analyzer
![Page 56: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/56.jpg)
Archiving and querying evolving semantic Web data
Finally, many open questions remain still!
Objective Research Question
Representation of archives
minimize the redundant information respect the original modeling and provenance information (e.g. LOD-a-lot)
Query language design a query language satisfying these requirements for evolving interlinkeddata
our BEAR operations are meant to be an extensible starting point
Indexing index archives at large scale keep up with evolution rate (streaming vs. archiving) to process the queries
efficiently
Analysis/Optimization use evolution patterns to optimize representations and queries Querying archives of structures and non-structured sources? E.g. Open Data!
Application LOD-a-lot is a good examples but modularity can be improved Any low-cost but functional archiving at LOD scale can be a major milestone for
the community
![Page 57: LOD is all about evolution - Centrum Wiskunde & Informatica › drift-a-lod › slidesDrift-a-LOD2017 › c... · 2018-02-21 · LOD is all about evolution Querying and Managing evolving](https://reader034.fdocuments.in/reader034/viewer/2022042310/5ed814a9cba89e334c673581/html5/thumbnails/57.jpg)
Thank you!
“The measure of intelligence is the ability to change”
Albert Einstein