Sieve - Data Quality and Fusion - LWDM2012

27
Sieve Linked Data Quality Assessment and Fusion Pablo N. Mendes Hannes Mühleisen Christian Bizer With contributions from: Andreas Schultz, Andrea Matteini, Christian Becker, Robert Isele

description

Presentation at the LWDM workshop at EDBT 2012. The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object. In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality. However, quality is a very subjective matter, and nding a canonical judgement that is suitable for each and every task is not feasible. To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution, all crucial preliminaries for quality assessment and fusion. We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.

Transcript of Sieve - Data Quality and Fusion - LWDM2012

Page 1: Sieve - Data Quality and Fusion - LWDM2012

Sieve Linked Data

Quality Assessment

and Fusion

Pablo N. Mendes

Hannes Mühleisen

Christian Bizer

With contributions from:

Andreas Schultz, Andrea Matteini, Christian Becker, Robert Isele

Page 2: Sieve - Data Quality and Fusion - LWDM2012

“A sieve, or sifter, separates wanted elements

from unwanted material using a woven screen

such as a mesh or net.” Source: http://en.wikipedia.org/wiki/Sieve

“sieve”

Page 3: Sieve - Data Quality and Fusion - LWDM2012

• Raw data (RDF)

• Accessible on the Web

• Data can link to other data sources

• Benefits: Ease of access and re-use; enables discovery

What is Linked Data?

Thing

Thing

Thing

Thing

Thing

Thing

A B C

Thing

Thing

Thing

Thing

D E

data link data link data link data link

Page 4: Sieve - Data Quality and Fusion - LWDM2012

Linking Open Data Cloud

http://lod-cloud.net

Page 5: Sieve - Data Quality and Fusion - LWDM2012

Linked Data Challenges

• Data providers have different intentions, experience/knowledge

• data may be inaccurate, outdated, spam etc.

• Data sources that overlap in content may use…

• ... different RDF schemata

• ... different identifiers for the same real-world entity

• …conflicting values for properties

• Integrating public datasets with internal databases poses the

same problems

Page 6: Sieve - Data Quality and Fusion - LWDM2012

An Architecture for Linked Data Applications

Page 7: Sieve - Data Quality and Fusion - LWDM2012

LDIF – Linked Data Integration Framework

• Open source (Apache License, Version 2.0)

• Collaboration between Freie Universität Berlin and mes|semantics

Collect data: Managed download and update

Translate data into a single target vocabulary

Resolve identifier aliases into local target URIs

Output

1

2

3

5

Assess quality, filter bad results, resolve conflicts 4

Page 8: Sieve - Data Quality and Fusion - LWDM2012

Supported data sources:

• RDF dumps (various formats)

• SPARQL Endpoints

• Crawling Linked Data

LDIF Pipeline

Collect data

Translate data

Resolve identities

Filter and fuse

1

2

3

4

Output 5

Page 9: Sieve - Data Quality and Fusion - LWDM2012

dbpedia-owl: City

LDIF Pipeline

Collect data

Translate data

Resolve identities

1

2

3 R2R

• Mappings expressed in RDF (Turtle)

• Simple mappings using OWL / RDFs statements (x rdfs:subClassOf y)

• Complex mappings with SPARQL expressivity

• Transformation functions

Data sources use a wide range of different RDF

vocabularies

schema:Place

fb:location.citytown

local:City

Filter and fuse 4

Output 5

Page 10: Sieve - Data Quality and Fusion - LWDM2012

LDIF Pipeline

Collect data

Translate data

Resolve identities

1

2

3

Silk

Berlin, Germany

Berlin, CT

Berlin, MD

Berlin, NJ

Berlin, MA

Berlin

• Profiles expressed in XML

• Supports various comparators and transformations

Data sources use different identifiers for the same entity

Berlin

=

Berlin,

Germany

Filter and fuse 4

Output 5

Page 11: Sieve - Data Quality and Fusion - LWDM2012

LDIF Pipeline

Collect data

Translate data

Resolve identities

1

2

Sieve

891.85 km2

891.82 km2

891.82 km2

891.85 km2

Quality

• Profiles expressed in XML

• Supports various scoring and fusion functions

Sources provide different values for the same property

Filter and fuse

Output 5

4

3

Total Area

Total Area

891.85 km2

Page 12: Sieve - Data Quality and Fusion - LWDM2012

• Output options:N-Quads

• N-Triples

• SPARQL Update Stream

• Provenance tracking using Named

Graphs

LDIF Pipeline

Collect data

Translate data

Resolve identities

1

2

3

Filter and fuse 4

Output 5

Page 13: Sieve - Data Quality and Fusion - LWDM2012

An Architecture for Linked Data Applications

Data Quality and

Fusion Module

Page 14: Sieve - Data Quality and Fusion - LWDM2012

Data Fusion

“fusing multiple records representing the same

real-world object into a single, consistent, and

clean representation”

(Bleiholder & Naumann, 2008)

Page 15: Sieve - Data Quality and Fusion - LWDM2012

Conflict resolution strategies

• Independent of quality assessment metrics

• Pick most frequent (democratic voting)

• Average, max, min, concatenation

• Within interval

• Based on task-specific quality assessment

• Keep highest scored

• Keep all that pass a threshold

• Trust some sources over others

• Weighted voting

Page 16: Sieve - Data Quality and Fusion - LWDM2012

Data Fusion

• Input:

• (Potentially) conflicting data

• Quality metadata describing input

• Execution:

• Use existing or custom FusionFunctions

• Output:

• Clean data, according to user’s definition of clean

Page 17: Sieve - Data Quality and Fusion - LWDM2012

Configuration: Data Fusion

Page 18: Sieve - Data Quality and Fusion - LWDM2012

Sieve: Quality Assessment

• Quality as “fitness for use”:

• Subjective:

• good for me might not be enough for you

• Task dependent:

• temperature: planning a weekend vs biology experiment

• Multidimensional:

• even correct data may be outdated or not available

• Requires task-specific quality assessment.

Page 19: Sieve - Data Quality and Fusion - LWDM2012

Data Quality - Conceptual Framework Dimension

Accuracy

Consistency

Objectivity

Timeliness

Validity

Believability

Completeness

Understandability

Relevancy

Reputation

Verifiability

Amount of Data

Interpretability

Rep. Conciseness

Rep. Consistency

Availability

Response Time

Security

Page 20: Sieve - Data Quality and Fusion - LWDM2012

Configuration: Quality Assessment

• Quality Assessment Metrics composed by:

• ScoringFunction (generically applicable to given data types)

• Quality Indicator as input (adaptable to use case)

• Output: [0;1]

Describes input within a quality dimension,

according to a user’s definition of quality

Page 21: Sieve - Data Quality and Fusion - LWDM2012

Configuration: Quality Assessment

Page 22: Sieve - Data Quality and Fusion - LWDM2012

More about Sieve

• Software: Open Source, Apache V2

• Scoring Functions and Fusion Functions can be extended

• Scala/Java interface, methods score/fuse and fromXML

• Quality scores can be stored and shared with other

applications

• Website: http://sieve.wbsg.de

• Documentation, examples, downloads, support

Page 23: Sieve - Data Quality and Fusion - LWDM2012

Use Case

Conflicting values

Quality indicators

User config Voilá!

(Multidimensional)

(Task-dependent)

Multiple data sources

(Complementary)

(Conflict

Resolution

Strategies)

(Heterogeneous)

Page 24: Sieve - Data Quality and Fusion - LWDM2012

Evaluating Quality of Data Integration

• Completeness

• How many cities did we find?

• How many of the properties did we fill with values?

• Conciseness

• How much redundancy is there in the object identifiers?

• How much redundancy is there in the property values?

• Consistency

• How many conflicting values are there?

Page 25: Sieve - Data Quality and Fusion - LWDM2012

Results

Generated data that is more complete, concise

and consistent than in the original sources

Page 26: Sieve - Data Quality and Fusion - LWDM2012

Linked Data application Architecture

My view on this data space can also be

shared, and reused.

We can “pay as we go”

Page 27: Sieve - Data Quality and Fusion - LWDM2012

• Twitter: @pablomendes

• E-mail: [email protected]

• Website: http://sieve.wbsg.de

• Google Group: http://bit.ly/ldifgroup

THANK YOU!

Supported in part by: Vulcan Inc. as part of its Project Halo

EU FP7 projects:

-LOD2 - Creating Knowledge out of Interlinked Data

-PlanetData - A European Network of Excellence on Large-Scale Data Management