Semantic Interpretation of User Query for Question Answering on Interlinked Data

+

Semantic Interpretation of User Queries forQuestion Answering on Interlinked Data

Saeedeh ShekarpourSupervisor: Prof. Dr. Sören Auer

1

EIS research group - Bonn University7 January 2015

+Search engines can answer queries which match certain templates

EIS research group - Bonn University

2

7 January 2015

+Search engines still lack the ability to answer more complex queries

7 January 2015EIS research group - Bonn University

3

+Evolution of Web

Web of Documents

Semantic Web Web of Data


4

7 January 2015

+RDF model

RDF is an standard for describing Web resources.

The RDF data model expresses statesments about Web resources in the form of subject-predicate-object (triple).

The statement “Jack knows Alice” is represented as:


5

Jack Aliceknow

+The growth of Linked Open Data


6

August 2014

570 DatasetsMore than 74 billion triples

May 2007

12 Datasets

7 January 2015

+How to retrieve data from Linked Data?


7

Linked Data characteristics:

• Wide range of topical domains

• Variety in vocabularies

• Interlinked data

SPARQL queries:

• Knowledge about the ontology

• Proficiency in formulating formal queries

• Explicit and unambigious semantics

Text queries (either keyword or natural language):

• Simple retrieval approach

• Implicit and ambiguous semantics

• Popular

7 January 2015

+Comparison of search approaches

Data-semantic

unaware

Data-semantic

aware

Keyword-based

query

Natural language

query

Question Answering

Systems

Information RetrievalSystems

Our approach: SINA

8

EIS research group - Bonn University 7 January 2015

+Objective: transformation from textual query to formal query

Which televisions shows were created by Walt Disney?


9

SELECT * WHERE

{ ?v0 a dbo:TelevisionShow.

?v0 dbo:creator dbr:Walt_Disney. }

1

2

3

+Test bed datasets


10

7 January 2015

One single dataset: DBpedia.

Three interlinked datasets from life-science:

1. Drugbank: contains information about drugs, drug target (i.e. protein) information, interactions and enzymes.

2. Diseasome: contains information about diseases and genes associated with these diseases.

3. Sider: contains information about drugs and their side effects.

+The addressed challenges

Challenges

Query Segmentation

Resource Disambiguation

Query Expansion

Formal Query Construction

Data Fusion on Linked Data


11

7 January 2015

+

Definition: query segmentation is the process of identifying the rightsegments of data items that occur in the keyword queries.

12


Query Segmentation

Two segmentations:

Sequence of keywords:

Input Query: What are the side effects of drugs used for Tuberculosis?

(side, effect, drug , Tuberculosis)

side effect | drug | Tuberculosis side effect drug | Tuberculosis

7 January 2015

+

Definition: resource disambiguation is the process of recognizing the suitable resources in the underlying knowledge base.


13


Input query

• What are the side effects of drugs used for Tuberculosis?

Ambiguous Resources

• diseasome:Tuberculosis

• sider:Tuberculosis

Input query

• Who produced films starring Natalie Portman?

Ambiguous Resources

• dbpedia/ontology/film

• dbpedia/property/film

7 January 2015

+Concurrent approach


14

Query Segmentation


7 January 2015

+

1

2

3

Unknown Entity

4

5

6

7

8

9

Start

Keyword 1 Keyword 3Keyword 2 Keyword 4

Modeling using hidden Markov model


15

7 January 2015

Query Segmentation

&


+Bootstrapping the model parameters

1. Emission probability is defined based on the similarity of the label of each state with a segment, this similarity is computed based on string-similarity and Jaccard-similarity.

2. Semantic relatedness is a base for transition probability and initialprobability. Intuitively, it is based on two values: distance and connectivitydegree. We transform these two values to hub and authority values usingweighted HITS algorithm.

3. HITS algorithm is a link analysis algorithm that was originally developed forranking Web pages. It assign a hub and authority value to each web page.

4. Initial probability and transition probability are defined as a uniformdistribution over the hub and and authority values.


16

7 January 2015

Query Segmentation

&


+Evaluation of bootstrapping

The accuracy of the bootstrapped transition probability using different distribution functions, i.e., Normal, Zipfian and uniform distributions.


17Query

Segmentation

&


+Output of the model after running viterbi algorithm

Sequence of keywords

(television show creat Walt Disney)

Paths 0.0023 dbo:TelevisionShow dbo:creator dbr:Walt_Disney

0.0014 dbo:TelevisionShow dbo:creator dbr:Category:Walt_Disney

0.000589 dbr:TelevisionShow dbo:creator dbr:Walt_Disney

0.000353 dbr:TelevisionShow dbo:creator dbr:Category:Walt_Disney

0.0000376 dbp:television dbp:show dbo:creator dbr:Category:Walt_Disney


18

7 January 2015

Query Segmentation

&


+

Definition: query expansion is a way of reformulating the input queryin order to overcome the vocabulary mismatch problem.


19

Input query

• Wife of Barak Obama

Reformulated query

• Spouse of Barak Obama

Query Expansion

7 January 2015

+

Analysis of linguistic

features vs. semantic features

A method for

automatic query

expansion


20

7 January 2015

Query Expansion

+Linguistic features

WordNet is a popular data source for expansion.

Linguistic features extracted from WordNet are:

1. Synonyms: words having a similar meanings to the input keyword.

2. Hyponyms: words representing a specialization of the input keyword.

3. Hypernyms: words representing a generalization of the input keyword.


21

7 January 2015

Query Expansion

+Semantic features from Linked Data

1. SameAs: deriving resources using owl:sameAs.

2. SeeAlso: deriving resources using rdfs:seeAlso.

3. Equivalence class/property: deriving classes or properties usingowl:equivalentClass and owl:equivalentProperty.

4. Super class/property: deriving all super classes/properties of by following therdfs:subClassOf or rdfs:subPropertyOf property.

5. Sub class/property: deriving resources by following the rdfs:subClassOf orrdfs:subPropertyOf property paths ending with the input resource.

6. Broader concepts: deriving using the SKOS vocabulary properties skos:broader andskos:broadMatch.

7. Narrower concepts: deriving concepts using skos:narrower and skos:narrowMatch.

8. Related concepts: deriving concepts using skos:closeMatch, skos:mappingRelationand skos:exactMatch.


22

7 January 2015

Query Expansion

+Exemplary expansion graph of the word movie


23

movie

home movieproduction

filmmotion picture show

video

telefilm

7 January 2015

Query Expansion

+Objective of experiment

How effective do linguistic as well as semantic features perform?

How well does a linear weighted combination of features perform?


24

7 January 2015

Query Expansion

+Benchmark creation

We created a benchmark extracted from QALD1 and QALD2.

Benchmark contains all keywords having vocabulary mismatchproblem and their corresponding match.


25

Query Expansion

+Accuracy results of prediction function based on linguistic as well as semantic features

Features Weighting Mechanism

Precision Recall F-score

Linguistic SVM 0.730 0.650 0.620

Semantic SVM 0.680 0.630 0.600

Linguistic Decision Tree/ Information Gain

0.588 0.579 0.568

Semantic Decision Tree/ Information Gain

0.755 0.684 0.661


26

7 January 2015

Query Expansion

+Statistics over the number of the derived words and matches


27

7 January 2015

Query Expansion

Feature #derived words #matches

synonym 503 23

hyponym 2703 10

hypernym 657 14

sameAs 2332 12

seeAlso 49 2

equivalence 2 0

super class/property 267 4

Sub class/property 2166 4

+Automatic query expansion

Input query

External Data source

Data extraction and preparation

Heuristic method

Reformulated query


28

7 January 2015

Query Expansion

+Expansion set for each segment

Expansion Set

Original segment

Lemmatized segment

Derived words from WordNet

Synonym

Hyponym

Hypernym


29

7 January 2015

Query Expansion

+Reformulating query using hidden Markov model


30

Barak

Barak Obama

spouseObama

wife

first lady

woman

Barak Obama wife Barak

Obama

Start

Input query: wife of Barak Obama

Obama wife

Barak Obama

wife

7 January 2015

Query Expansion

+Triple-based co-occurence

In a given triple t = (s, p, o), two words w1 and w2 are co-occurring, if they appear in the labels (rdfs:label) of at leasttwo resources.


31

7 January 2015

Query Expansion

+Goals of evaluation

How effective is our method with regard to a correct reformulation of queries which have vocabulary mismatch problem?

How robust is the method for queries which do not have vocabulary mismatch problem?


32

7 January 2015

Query Expansion

Query Mismatch word

Match word #derived words

Movies with Tom Cruise movie film 77

Altitude of Everest altitude elevation 16

Soccer clubs in Spain - - 19

Employees of Google - - 10

Sample of our benchmark

+Rank (R) and Cumulative rank (CR) for the test queries


33

7 January 2015

Query Expansion

+

Definition: Once the resources are detected, a connected subgraphof the knowledge base graph, called the query graph, has to bedetermined which fully covers the set of mapped resources.


34

Formal Query Construction

7 January 2015

Disambiguated resources

sider:sideEffect

diseasome:possibleDrug

diseasome:1154

SPARQL query SELECT ?v3 WHERE {

diseasome:115 diseasome:possibleDrug ?v1 .

?v1 owl:sameAs ?v2 .

?v2 sider:sideEffect ?v3 .}

+

Answer of a question may be spread among different datasetsemploying heterogeneous schemas.

Constructing a federated query from needs to exploit links betweenthe different datasets on the schema and instance levels.


35

Data Fusion on Linked Data

7 January 2015

+Two different approaches

Template-based query construction

Forward chaining based query construction


36

7 January 2015

Federated Query Construction

+Forward chaining based query construction

1. Set of resources


37

Query What is the side effects of drugs used for Tuberculosis?

resources diseasome:1154 (type instance)

diseasome:possibleDrug (type property)

sider:sideEffect (type property)

1154 ?v0

possibleDrug

Graph 1

?v1 ?v2

sideEffect

Graph 2

1154

?v0possibleDrug

Template 1

?v1 ?v2sideEffect

Template 2

1154

?v0possibleDrug

?v1 ?v2sideEffect

7 January 2015

2. Incomplete query graph

3. Query graph


+Evaluation

Goal of experiment:

1. performance of disambiguation method using Mean Reciprocal Rank (MRR).

2. performance of forward chaining query construction method using precision

and recall.

Benchmarks:1. 25 queries on the 3 interlinked datasets from life-science.2. QALD1 and QALD3 benchmarks over DBpedia.3. QALD2 was used for bootstrapping.


38


+Runtime

Parallization over three components:

1. Segment validation

2. Resource retrieval

3. Query construction


39

+

Client

Query Preprocessing

Query Expansion

Resource Retrieval

Disambiguation

Query Construction

Representation

Server

Underlying Interlinked Knowledge Bases

query result

keywords

valid segments

mapped resources

tuple of resources

SPARQLqueries

OWL API

http client

Stanford CoreNLP

Segment Validation

Reformulated query

SINA architecture


40

7 January 2015

+Demo


41

+Conclusion

We researched and addressed a number of challenges.

The result of the evaluation confirms the feasibility and highaccuracy, for instance:

1. query segmentation and resource disambiguation with achieved theMRR from 86% till 96%.

2. query construction with precision 32% in DBpedia QALD3 benchmarkand 95% in life-science.

We learnt that:1. In Linked Data, structure as well as topology of data can be leveraged for any

inference and heuristic.

2. Using structure as well as topology without any deep text analysis, Linked Datacan enhance power of question answering.


42

+Future work

It was the first step in a long agenda.

We plan to:

1. use supervise learning to enhance the parameters of the model.

2. extend our benchmark to make further evolutions.

3. employ more number of interlinked dataset to figure out the challengesof scability.

4. extending different aspects of each approach.

We are going to target new challenges, e.g. query cleaning.

We will continue this work with students joining us for Marie CurieITN network.


43

+


44

+Questiones?


45

7 January 2015

Semantic Interpretation of User Query for Question Answering on Interlinked Data

Engineering

Transcript of Semantic Interpretation of User Query for Question Answering on Interlinked Data