Linked data integration_framework

LINKED DATA INTEGRATION FRAMEWORK

Expressive language mapping for translating data from the various vocabularies used on the web into a consistent, local target vocabulary [Schultz et al, 2011]

CHALLENGES

Vocabulary heterogeneity – wide range of different RDF vocabularies to represent data about the same type of entity.

URI aliases – the same real-world entity is identified with different URIs within different data sources.

SOLUTION

Have all data describing one class of entities being represented using the same vocabulary

Have all triples describing the same entity have the same subject URI

TARGET

Vocabulary mapping = translate data to a single target vocabulary Identity resolution = replace URI aliases wit ha single target URI on

the client’s side (based on user-provided matching heuristics) while keeping track of data provenance (using the Named Graphs data model)

INTEGRATION PIPLELINE STEPS

1. COLLECT DATA: replicate data sets locally via file download, crawling and SPARQL;

2. MAP TO SCHEMA: expressive language mapping from the various vocabularies used on the Web into consistent, local target vocabulary;

3. RESOLVE IDENTITIES: identity resolution component – replace URI aliases;

4. OUTPUT: integrated data in a single file + provenance tracking (Named Graphs data model).

ARCHITECTURE Steps of the data integration process that are currently supported by LDIF.

COMPONENTS

SCHEDULER

Used for triggering pending data import jobs or integration jobs; Configured with an XML document; Updates the representation of external sources in the local cache; Has the following elements:

Properties : path to a Java properties file for configuration parameters; dataSources: directory containing the data sources configurations; importJobs configurations integrationJob dumpLocation: directory where local dumps are cached

Supports relative and absolute paths

SCHEDULER

<scheduler xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www4.wiwiss.fu-berlin.de/ldif/"> <properties>scheduler.properties</properties> <dataSources>datasources</dataSources> <importJobs>importJobs</importJobs> <integrationJobs>integration-config.xml</integrationJob> <dumpLocation>dumps</dumpLocation> </scheduler>

COMPONENTS

DATA IMPORT

Replicate data sets locally; Different types of import jobs generate provenance

metadata, tracked throughout the integration process; Managed by a scheduler configured to refresh (e.g.

hourly, daily) the local cache for each source.

DATA IMPORT

Elements: internalId: unique ID used to internally track the import job and its

files (i.e/ data and provenance) dataSource: reference to a data source to state from which source this

job imports data; One kind of importJob (exactly one for each element) refreshSchedule

DATA IMPORT

Mechanisms to import external data:

Quad Import Job – import N-Quad dumps Triple Import Job – import RDF/N-Triple dumps Crawl Import Job – import by dereferencing URIs as RDF data, using the LDSpider

Web Crawling Framework SPARQL Import Job – import by querying a SPARQL endpoint

TRIPLE/QUAD DUMP IMPORT

Download a file containing the data set; Difference Triple and Quad: LDIF generates a

provenance graph for a triple dump import, whereas it takes the given graphs from a quad dump import as provenance graphs;

CRAWLER IMPORT

Data sets that can be accessed only via dereferenceable URIS are good candidates for a crawler; Each crawled URI is put into a separate named graph

for provenance tracking

SPARQL IMPORT

The relevant data tube queried can be further specified in the configuration file for a SPARQL import job; Data from each SPARQL import job gets tracked by its

own named graph.

COMPONENTS

INTEGRATION RUNTIME ENVIRONMENT

Manages the data flow between the various stages/modules, the caching of intermediate results and the execution of the different modules for each stage.

Mechanisms: data input, transformation, data output, and runtime environments.

INTEGRATION RUNTIME ENVIRONMENT

Mechanisms: Data Input: expects to be represented as Named Graphs and be stored in N-Quands format

accessible locally; Transformation: LDIF provides transformation modules for vocabulary mapping and identity

resolution: R2R Data Translation Silk Identity Resolution – Silk Link Discovery Framework

Data Output: formats supported are N-Quads Writer and N-Triples Writer Runtime Environments: depending on the size of the dataset and the available computing

resources: Single machine / In-memory – keeps all intermediate results in the memory (fast, but limited scalability); Single machine / RDF Store - Jena TDB RDF store to store intermediate results, communicating with the RDF

and runtime environment through SPARQL queries (allows the processing of datasets that don't fit in the memory, but it is slower);

Cluster / Hadoop - parallelize the work onto multiple machines using Hadoop.

FUTHER STEPS

• Data Quality Evaluation and Data Fusion Module: should allow data to be filtered according to different quality data assessment policies and provide for fusing Web data according to different conflict resolution methods;

• Flexible integration workflow: make the workflow and its configuration more

flexible in order to make it easier to include additional modules to cover other data integration aspects.

REFERENCES

• Andreas Schultz, Andrea Matteini, Robert Isele, Christian Bizer, Christian Becker (2012) “LDIF – Linked Data Integration Framework ” Available online: http://www4.wiwiss.fu-berlin.de/bizer/ldif/, retrieved 06.02.2012 (since the link from above is not active anymore try: http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/news/ldif03released.html)

• Andreas Schultz, Andrea Matteini, Robert Isele, Christian Bizer, Christian Becker (2011) “LDIF - Linked Data Integration Framework”. 2nd International Workshop on Consuming Linked Data, Bonn, Germany, October 2011.

Linked data integration_framework

Presentations & Public Speaking

Transcript of Linked data integration_framework

Data Science meets Linked Data

Linked Justifications: Provenance Aware Data Integration on Linked Data

Einführung Linked Open Data (LOD) - Introduction to Linked Open Data (LOD)

Linked data

OCLC Linked Open Data · •Linked Data within OCLC product strategy ... •Secondary aggregators e.g. Europeana, ... Facilitates Linked Open Data work

Open | Linked | Open Linked data

Open Data, Linked Data, .... Big Data

Linked data demystified:Practical efforts to transform CONTENTDM metadata into linked data

Consuming Linked data in Supply Chains: Enabling data visibility via Linked … · 2015-01-25 · Consuming Linked data in Supply Chains: Enabling data visibility via Linked Pedigrees

RDA linked data update RDA Linked... · RDA Linked Data Forum, ALA Midwinter 2019 February 8, 2019. Glossary linked data . skos:label (Toolkit label) - canonical property. skos:definition

Linked data big data

Linked(open data)vsopen(linked data) lod2014roma

Data Structures and Algorithms - Vilniaus universitetasalgis/dsax/Data-structures-6.pdf · linked-list.html) Circular Linked List (circular-linked-list.html) Double Linked List (double-linked-list.html)

Linked Data & DBpedia - fusionfactory.de€¦ · Linked Data & DBpedia ... label "Siemens"@de ; dbo: ... Linked Data & DBpedia Linked Open Data LOD-Cloud 2014 Linked Data - Datasets

Next Steps Carlo Meghini ISTI CNR, Pisa. Preserving Linked Data Digital Preservation Digital Preservation Linked Data Linked Data need.

News Linked Data Summit - BBC News and Linked Data

Linked Open Models Extending Linked Open Data with ...eprints.cs.univie.ac.at/4590/1/Linked_OpenModels_ExtendingLinkedO… · Linked Open Models: Extending Linked Open Data with conceptual

Linked Data Management Capítulo 1: Linked Data & the ... · Linked Data Management Capítulo 1: Linked Data & the Semantic Web Standards Carmem Hara 18 de outubro de 2016

Linked Data Podium - SEMANTiCS 2020 · Linked Data Theatre The Linked Data Theatre is a platform for an optimal presentation of Linked Data Features: • Responsive UI based on Bootstrap;

Best Practices for Multilingual Linked Open Data...Linked data book [Heath, Bizer, 2011] Linked data patterns [Dodds, Davis, 2012] Best Practices for Publishing Linked Data [Hyland