Linked Open GeoData Management in the Cloud
K. Kritikos, Y. Roussakis ICS-FORTHD. Kotzinos ICS-FORTH & TEI of Serres
Cloud Computing
2
Better (faster, reliable, etc.) infrastructure - IaaS
Development infrastructure – PaaS
Software infrastructure – SaaS
Cloud Computing
3
DataData as a Service (DaaS)Data as a Service (DaaS)• Publication• Querying• Updating
Linked (Open) Data as a Service
• Publishing Linked Data– URI construction– Conceptual Model– Storage as RDF files or SPARQL endpoints
• Querying Linked Data– SPARQL– GeoSPARQL
• Updating Linked Data– SPARUL– Synchronization with original sources
4
Problem Introduction (I)
• INGeoCloudS FP7 Pilot B Project (www.ingeoclouds.eu)• Geophysical data from different sources and in different
formats (excel, xml, relational, nothing …)• Borehole and Groundwater Water Analysis
– Boreholes located in Mygdonia/Thriasio of Greece, whole country in Denmark and France and their features (static data over time)
– Chemical analyses of ground waters sampled from their boreholes (data updated over time)
• Earthquake events and features• Landslides
5
Data granularity
• Data refer to different levels of granularity, e.g. susceptibility maps refer to a country-wide area while earthquakes or boreholes are point-level data
• Data might need to be aggregated by such aggregation is based on the spatial dimension, i.e. points contained within a polygon
• Some problems of aggregation do exist since phenomena outside the area of concern may affect it, so spatial aggregation might not be enough
Earthquakes:•How much back in time should we go?•What information should be kept/would be relevant?•How should we query the repository to get the relevant information?
Earthquakes:•How much back in time should we go?•What information should be kept/would be relevant?•How should we query the repository to get the relevant information?
Landslides:•Which area and how much is it affected?•How does this change over time?•Is the earthquake effect cumulative or fades over time?
Landslides:•Which area and how much is it affected?•How does this change over time?•Is the earthquake effect cumulative or fades over time?
Problem Introduction (II)
• Data/Metadata Standards– INSPIRE standard proposes generic conceptual schema for scientific
data + models for 34 spatial data themes
• Deal with geospatial data & maintaining schemas/ontologies becomes difficult– Challenge is to exploit semantic heterogeneity
• Need to offer seamless & transparent LOD as a service (LODaaS) way to manage LOD data– Lack of tools for mapping, transforming & synchronizing geo-spatial
LD– Generic LOD management independent of way LOD are stored
Points of interest
• GeoData get bigger and more important– Used in a variety of applications in different fields
• Size & high demand impose considerable requirements in infrastructure storage size & compute power
• Need to be reused and linked with other data sets– Go beyond current Web paradigm of isolated data silos
• Current geo-spatial open data management work does not offer such effort– Cloud-based approaches:
• do not provide geo-spatial support • Some do not fully support SPARQL or offer SPARQL end-points
– Centralized approaches offer geo-spatial support but:• Do not enable automatic mapping between relational and RDF data• Worse performance in general (with the exception of Strabon wrt geo-
spatial query support)
Proposed Solution (I)
• A specific set of LODaaS services for geo-spatial LOD publishing, integration & querying
• Cloud is offering its scalability & elasticity of computation, 24/7 availability & multiple data storage and integration offerings
• Our cloud-based service-oriented system:– Exhibits good LOD management performance– Exposes a LOD management service that abstracts away
RDF Store peculiarities & provides a generic way for LOD access and management
Proposed Solution (II)
• A particular solution is adopted for mapping geo-spatial data in different formats to RDF data
• The latter conform to extensible conceptual models that accurately capture thematic areas and are integrated via GeoScientific Observation Model– This allows imposing queries across providers and thematic fields
• Our solution is part of the system, developed in the context of the InGeoCloudS project, that exploits cloud capabilities & LD technology to integrate & store heterogeneous geo-spatial data sets of different thematic fields + host & execute applications that exploit these data sets
Architecture (I)
• System is scalable and elastic by exploiting cloud facilities• An extensive application pool can be built on top that
exploits the offered services to perform various added-value and high-demanding tasks:– LO GeoData visualization, discovery & composition of data-sets,
LO GeoData analytics– System could be extended to host such applications & offer
various (geo-spatial) LO GeoData processing services and pre-built applications
Architecture (II)
• Distributor: equally distributes generic queries & collects back the results, non-generic queries are sent to instances with the appropriate data, data distribution achieved by assigning new data to the less loaded wrt storage space scaling layer, exploits CPU monitoring & elasticity facilities of Amazon
• Scaling Layer: comprises one or more LOD management components, data are replicated across these components to enhance reliability & enable layer-based load balancing
• LOD Management Component: comprises LOD Management Service (LMS) instance & Virtuoso server for storage
• LMS: provides methods for data providers to manage LOD & for other users to query & export the LOD stored
• Virtuoso: underlying RDF triple store also allowing the mapping & synchronization between relational and RDF data
General Query Evaluation Behavior
Response Time
Time passed
2nd instance involvement
LOD Integration & Publishing (I)
• Extension of the high-level CIDOC-CRM conceptual model• New model is called Geo-Scientific Spatial Observation
Model (GSOM) & expressed in RDF/S• It enables to capture all information coming different
fields & countries + link data across different providers• INSPIRE was not exploited as did not cover all
requirements:– Capturing of scientific events– Complicated and cumbersome for information integration– In some cases, does not cover all appropriate information
required by the data providers in particular thematic fields• GSOM-to-INSPIRE mapping specification to enable
exporting INSPIRE-compliant data
LOD Integration & Publishing (II)
• Two alternatives for publishing LOD:1. Create and import RDF-based descriptions of data-sets via
particular LMS method– Data update process must be controlled by performing SPARUL
updates via particular LMS method– Data provider responsibility to keep synchronized relational &
RDF data• A perfect synchronization may be also not required as it may incur costs
-> second alternative becomes more preferable
LOD Integration & Publishing (III)
2. Data provider publishes relational data of his/her data sets + provides a mapping file in R2RML to enable the synchronization of relational to RDF data (by executing LMS method)
– System takes care of this synchronization– Relational storage in the way used many years + additional
RDF storage for the data with automatic one-way synchronization between the two
– Provider should have a good knowledge of GSOM & RDF
LOD Integration & Publishing (IV)
• R2RML: – W3C recommendation since 2012– Can specify customized mappings between RDB & RDF data– R2RML specification is just a RDF graph in Turtle– No specific implementation is imposed
• Virtuoso supports R2RML by processing the R2RML specification & creating the respective RDB2RDF triggers (used for creating/updating RDF data from relational ones)– An RDF view or physical RDF graph can be created with the
second option mapping to far better performance
BoreholeName
Sample ID
WaterLevel
B1 XYZ Level1
Borehole Relational Model
E42.IdentifierSample_IDSample_ID,
E41.AppelationBorehole_NamBorehole_Namee
S16.Borehole
E54.DimensionWaterlevelWaterlevel
P43F.has_dimension
E26.Physical_Feature
S15.Acquifer_ConceptIntake
P121F.overlaps_with
P1F.is_identified_by
P1F.is_identified_by
S13.Sample
O5.removed
S2.SampleTaking
O4.sampled_from
URI Identification:http://orgURL/SampleID/XYZ
R2RML
GSOM
Publication
Synchronization
RDB
LOD Management Service (I)
• REST-based service with API exposing all appropriate management functionality needed by geo-spatial LOD users– Abstracts away from peculiarities of RDF triple stores– Enables simple & intuitive use of a specific set of LOD
management methods– Programmatic or form-based access to methods– Production of query results in different forms, such as WKT,
GML, & KML– Imporing/exporting capabilities in different formats
(RDF/XML, NTriples, Turtle)
LOD Management Service (II)
• The provided methods are:– meta_query (SPARQL string, timeout (opt.), row limit (opt.)): user-
requested format (e.g., JSON)– meta_update (SPARUL string, baseURI, timeout (opt.), row limit
(opt.))– meta_addMappings(R2RML string, graphURI) -> initiates mapping
procedure– meta_export(graphURI, subjURI, predURI, objURI, internal): user-
requested format -> last param indicates if result will be inline in the response
– meta_import(url, graphUri, format, blocking): ImportStatus -> RDF data are imported by downloading them via provided URL or inline in user-request + method can be blocking or non-blocking
– import_status(importID): ImportStatus -> in case of blocking import request, the user can inquire the status of his/her import by exploiting the value of a specific field (importID) returned from the previous method as input to this method
LOD Management Service (III)
• Each method accessible via specific URL + produces meaningful exception messages (e.g., in case user input is wrong)
• User-friendly HTML Documentation produced via Enunciate
• Implementation exploited Sesame RDF Data Management API, Virtuoso’s JDBC Driver & Jersey
Open Issues (I)
• Model:– Extend it to capture other thematic fields– Data published in our system could fulfill all requirements to be
5-star LOD if respective owners decide to do so
• Data mapping:– Cloud-based Virtuoso version supports native Relational DB for
RDB2RDF synchronization• Trade-off between LOD management completeness & cost
– Mapping tools are needed to allow visual-based editing of R2RML without needing from data providers to have good knowledge of RDF
– Research issue: support bi-directional RDB2RDF mappings
Open Issues (II)
• Geo-spatial query support:– Virtuoso does not support GeoSPARQL– Virtuoso has limited geo-spatial query support only in
commercial versions• 2D geometries + limited set of topological relation operators
– Additional support in terms of geometry dimensionality + feature aggregation operators
• Could extend Virtuoso via frameworks, such as uSeekM, which provide adequate geo-spatial support along with the capability of evaluating GeoSPARQL queries
– Such solutions require processing all RDF data stored to create geo-spatial indices as well as deploy another DB -> do not fit well with automatic geo-spatial LOD management
– Could resolve problem by: (a) performing re-indexing in infrequent time intervals, (b) create specialized triggers which trigger re-indexing only when RDF data are updated
Open Issues (III)
• Quality & provenance:– Original input data sets may not have the appropriate quality -
> resulting RDF data can have the same or lower quality level– Proposed infrastructure must be extended with quality
resolving procedures & methods (e.g., data cleansing methods for correcting the data exploited)
– Provenance information can ensure the correct updating of LD + assist in LD reasoning process by deriving additional facts
– Thus, provenance information should be exploited by our system, especially if we consider that such exploitation is not enabled by most LOD management systems
Conclusions
• Proposed a scalable, geo-spatial LOD as-a-Service management system deployed on Amazon cloud– Distributes query load + scales-up/down when CPU utilization
surpasses specific thresholds– Exposes REST-based service with LOD management methods– Provides two different ways for publishing open geo-spatial data sets
• Advance geo-spatial support level by following two directions:– Realize GSOM-to-INSPIRE mapping to enable producing INSPIRE-
compliant data– Extend Virtuoso with geo-spatial indexing & query systems to enable
the efficient processing of rich & expressive geo-spatial queries, expressed either in SPARQL or GeoSPARQL
28
Top Related