Crowdsourcing-enabled Linked Data management architecture
-
Upload
elena-simperl -
Category
Education
-
view
679 -
download
1
description
Transcript of Crowdsourcing-enabled Linked Data management architecture
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
Institute of Applied Informatics and Formal Description Methods (AIFB)
Institute of Applied Informatics and Formal Description Methods (AIFB)
A semantically enabled architecture for crowdsourced Linked Data management Elena Simperl,1 Maribel Acosta,1 Barry Norton2
1Institute AIFB, Karlsruhe Institute of Technology, Germany 2Ontotext AD, Bulgaria
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
2 07.06.2012
Background: What is Linked Data?
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Linked Data: set of best practices to publish and connect structured data on the Web.
URIs to identify entities and concepts in the world HTTP to access and retrieve resources and descriptions of these resources RDF as generic graph-based data model to structure and link data
Taken together Linked Data is said to form a ‘cloud’ of shared references and vocabularies. Query language: SPARQL.
http://linkeddata.org/faq
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
3 07.06.2012
Background: Why Linked Data?
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Google, Yahoo, Bing & schema.org: enhanced search
Data.gov & public sector information: more transparency and accountability in governance
BBC & media: added value of content through interlinking
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
4 07.06.2012
Outline
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
• Motivation 1 • Our Approach 2 • Extensions to VoID and SPARQL 3 • Crowdsourced query processing tasks 4 • Advantages 5 • Challenges 6
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
5 07.06.2012
1. Motivation
„Retrieve the labels in German of commercial airports located in Baden-Württemberg, ordered by the better human-readable description of the airport given in the comment“.
This query cannot be optimally answered automatically:
Incorrect/missing classification of entities (e.g. classification as airports instead of commercial airports).
Missing information in data sets (e.g. German labels).
It is not possible to optimally perform subjective operations (e.g. comparisons of pictures or NL comments).
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
User Query: Give me the German names of all commercial airports in Baden-Württemberg, ordered by their most informative description.
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
6 07.06.2012
1. Motivation
„Retrieve the labels in German of commercial airports located in Baden-Württemberg, ordered by the better human-readable description of the airport given in the comment“.
In order to answer the query as intended:
Classification of airports as commercial airports.
Identity resolution of places (Baden-Württemberg).
Translation of the labels of the airports.
Ordering of the comments by a subjective comparison.
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
7 07.06.2012
1. Motivation
„Retrieve the labels in German of commercial airports located in Baden-Württemberg, ordered by the better human-readable description of the airport given in the comment“.
SPARQL Query: SELECT ?label WHERE { ?x a metar:CommercialHubAirport; rdfs:label ?label; rdfs:comment ?comment . ?x geonames:parentFeature ?z . ?z owl:sameAs <http://dbpedia.org/resource/Baden-Wuerttemberg> . FILTER (LANG(?label) = "de") } ORDER BY CROWD(?comment, "Better description of %x")
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
1
2
3 4
Classification
Identity Resolution
Missing Information Ordering
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
8 07.06.2012
1. Motivation: Our Aim
SPARQL query engine, able to process queries using seamless combination of automatic query processing and crowdsourcing.
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Query parsing
SPARQL query engine Crowdsourced query processing
Task design UI generation
Query optimization
Query execution
Query Results Mediator
Wrapper Wrapper Wrapper Wrapper
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
9 07.06.2012
2. Our Approach
Parser
Decomposes the input query.
Selects the data sets that should be accessed to produce answers.
Rewrites the query into the internal structures.
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Query parsing
SPARQL query engine
Query Results
Query optimization
Query execution
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
10 07.06.2012
2. Our Approach
Optimizer
DB statistics and crowdsourcing statistics: estimated time to completion, and other information about the performance (quality, cost) of the crowd.
Traditional data bases optimization techniques are implemented.
Determines which parts of the query should be solved by human input: VoID and SPARQL extensions.
Generates logical and physical plans.
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Query parsing
SPARQL query engine
Query Results
Query optimization
Query execution
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
11 07.06.2012
2. Our Approach
Executor
Implements physical operators.
Invokes crowdsourcing component:
Creates tasks.
Generates UI.
Infers facts automatically.
Executes query against Linked Data: computational tasks.
Incorporates results from the human input.
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Query parsing
SPARQL query engine
Query Results
Query optimization
Query execution
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
13 07.06.2012
3. Extensions to VoID and SPARQL
The RDF based schema to describe data sets is VoID (Vocabulary of Interlinked Datasets).
Common VoID predicates: voidDataset, void:inDataset, void:Linkset, void:linkPredicate, void:target.
VoID extensions:
Automatic interlinking of datasets CrowdClass CrowdProperty
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
14 07.06.2012
3. Extensions to VoID and SPARQL
Automatic interlinking of data sets Example - Specification of Data Sets:
:METAR rdf:type void:Dataset . :Genonames rdf:type void:Dataset . :METAR2Geonames rdf:type void:Linkset ; void:linkPredicate owl:sameAs ; void:target :METAR ; void:target :Geonames .
Geonames
METAR
owl:sameAs
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
15 07.06.2012
3. Extensions to VoID and SPARQL
CrowdClass - Specifies which entities of a data set could be crowdsourced. - All subclasses of the crowdClass are also defined (implicitly)
as crowdsourced entities.
Example:
metar:Airport void:inDataset :METAR .
metar:CommercialHubAirport void:inDataset :METAR;
rdfs:subClass metar:Airport .
metar:Airport rdf:type void:crowdClass .
metar:CommercialHubAirport rdf:type void:crowdClass.
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
16 07.06.2012
3. Extensions to VoID and SPARQL
RDF data can be queried using the language SPARQL.
Common SPARQL operators: join, union, optional, filter, order by.
Properties related to general ontology languages such as OWL are treated as extensions of SPARQL operators, and are modeled in our architecture as tasks.
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
17 07.06.2012
4. Tasks
Formal, declarative description of the data and tasks using SPARQL patterns as a basis for the automatic design of HITs.
Identity resolution
Missing information
Ontological classification
Ordering (new operator)
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced
Linked Data management
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
18 07.06.2012
4.1. Ontological Classification
It is not always possible to automatically infer classification from the properties. Example: Retrieve the names (labels) of METAR stations that
correspond to commercial airports.
SELECT ?label WHERE { ?station a metar:CommercialHubAirport; rdfs:label ?label .}
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
{?station a metar:Station; rdfs:label ?label; wgs84:lat ?lat; wgs84:long ?long}
Input:
{?station a ?type. ?type rdfs:subClassOf metar:Station}
Output:
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
19 07.06.2012
4.2. Ordering
Orderings defined via less straightforward built-ins; for instance, the ordering of pictorial representations of entities. SPARQL extension: ORDER BY CROWD Example: Retrieves all airports and their pictures, and the pictures should
be ordered according to the more representative image of the given airport.
SELECT ?airport ?picture WHERE { ?airport a metar:Airport; foaf:depiction ?picture . } ORDER BY CROWD(?picture, "Most representative image for %airport")
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
{?airport foaf:depiction ?x, ?y} Input:
{{(?x ?y) a rdf:List} UNION {(?y ?x) a rdf:List}} Output:
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
20 07.06.2012
4.3. Computational tasks expressed as SPARQL queries
Transitive relations inferred automatically, without requiring human intervention.
Implementation of restrictions in SPIN.
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Identity Resolution Classification Ordering CONSTRUCT { ?a owl:sameAs ?c . } WHERE { ?a owl:sameAs ?b . ?b owl:sameAs ?c . }
CONSTRUCT { ?a a ?b. ?b rdfs:subClassOf ?c. } WHERE { ?a rdfs:subClassOf ?c. ?b rdfs:subClassOf ?b1. ?b1 rdfs:subClassOf ?c. }
CONSTRUCT { {(?a ?b) a rdf:List .} } WHERE { (?a ?x) a rdf:List . (?x ?b) a rdf:List . }
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
21 07.06.2012
5. Advantages
Declarative description of data allows to decompose the query.
Generation of the UIs automatically.
Generation of human tasks on-the-fly and adjustment of the design of the task.
Automatic consistency check of results by reasoning against validating ontology.
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
22 07.06.2012
6. Challenges
Appropriate level of granularity for HITs design for specific SPARQL constructs.
Caching Naively we can materialise HIT results into datasets.
How to deal with partial coverage and dynamic datasets.
Optimal user interfaces of graph-like content.
Pricing and workers’ assignment.
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management
Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)
23 07.06.2012
QUESTIONS
CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management