A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data

16
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data Tobias Kuhn http://www.tkuhn.org @txkuhn Department of Computer Science, VU University Amsterdam Open Science for an Open Society Workshop 2016 Conference on Complex Systems Amsterdam, Netherlands 20 September 2016

Transcript of A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data

A Decentralized Approach to Dissemination,Retrieval, and Archiving of Data

Tobias Kuhn

http://www.tkuhn.org

@txkuhn

Department of Computer Science, VU University Amsterdam

Open Science for an Open Society Workshop2016 Conference on Complex Systems

Amsterdam, Netherlands20 September 2016

Increasing Importance of Scientific Data

https://www.google.com/trends/explore#q=%22data%20science%22

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 2 / 15

Scientific Data as Supplemental Material

...

http://www.nature.com/ni/journal/v16/n10/full/ni.3267.html#supplementary-information

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 3 / 15

Scientific Data in Open Repositories

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 4 / 15

We Need Better Data Publishing!

Published data should be:

• Verifiable (Is this really the data I am looking for?)

• Immutable (Can I be sure that it hasn’t been modified?)

• Permanent (Will it be available in 1, 5, 20 years from now?)

• Reliable (Can it be efficiently retrieved whenever needed?)

• Granular (Can I refer to individual data entries?)

• Semantic (Can it be automatically interpreted?)

• Linked (Does it use established identifiers and ontologies?)

• Trustworthy (Can I trust the source?)

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 5 / 15

Requirement: Automated Low-Level Operations

We need automated low-level operations to publish and retrieve dataentries and datasets:

publish <dataset-identifier>

get <dataset-identifier>

(like HTTP POST/GET but verifiable, immutable, permanent, reliable, ...)

Approach: Linked Data + Cryptography + Decentralization

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 6 / 15

Nanopublications: Linked Data Containers forProvenance-Aware Semantic Publishing

assertion

provenance

publication info

nanopublication

http://nanopub.org / @nanopub org

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 7 / 15

Trusty URIs: Cryptographic Hash Values forVerifiable and Immutable Web Identifiers

Nanopublications with Trusty URIs are ...

XVerifiable

+

Immutable

+ �Permanent

.trighttp://example.org/r1. RA 5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70

http://trustyuri.net/

Kuhn, Dumontier. Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data. ESWC 2014.

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 8 / 15

Decentralized and Reliable Publishing with aNanopublication Server Network

Nanopublicationswith Trusty URIs

Publication

Retrieval

Propagation / Archiving

http://npmonitor.inn.ac

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 9 / 15

Defining Datasets with Nanopublication Indexes(which are themselves Nanopublications)

appends

has sub-index

has element

(a) (b)

(c) (f)

(d) (e)

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 10 / 15

Nanopublication Server Network isEfficient and Scalable

Our servers can deliver nanopublications about 100 times faster thanwhen a triple store is used (and need much less resources):

time from start of test in seconds

resp

onse

tim

e in

sec

onds

0 50 100 150 200 250 3000 50 100 150 200 250 300

0.1

1

10

100

0 20 40 60 80 100

number of clients accessing the service in parallel

Virtuoso triple store with SPARQL endpointnanopublication server

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 11 / 15

Nanopublication Datasets

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 12 / 15

Reliable Low-Level Publish/Retrieve Operations!

Operation to publish data:

$ np publish nanopubs.trig

156026 nanopubs published at http://np.inn.ac/

which can also be used to publish dataset definitions (indexes):

$ np publish index.trig

157 nanopubs published at http://np.inn.ac/

Operation to retrieve data entries:

$ np get http://np.inn.ac/RA7Kmmugi8OuCirfe5WKchnJhC3FuhQDi6M4O8mgR0CqE

and to retrieve entire datasets:

$ np get -c http://np.inn.ac/RAY lQruuagCYtAcKAPptkY7EpITwZeUilGHsWGm9ZWNI

https://github.com/Nanopublication/nanopub-java

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 13 / 15

Future Work

• Improve server protocol

• Develop services on top of the server network

• Establish best practices for versioning, retractions, reviews, etc.

• Connect it all to the scientific publishing workflow

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 14 / 15

Thank you for your attention!

Questions?

Further information:

• Paper on the approach:https://peerj.com/articles/cs-78/

• Nanopublications: http://nanopub.org

• Trusty URIs: http://trustyuri.net

• Nanopublication Server Network: http://npmonitor.inn.ac

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 15 / 15

Multi-Layer Architecture

applications (analyze/use data)

advanced services (query/analyze data)

core services (find data)

decentralized server network (provide data)

1

Tobias Kuhn, Department of Computer Science, VU University Amsterdam Decentralized Data Publishing 16 / 15