Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional...
-
Upload
andre-freitas -
Category
Technology
-
view
465 -
download
0
description
Transcript of Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional...
Natural Language Queries over
Heterogeneous Linked Data
Graphs:
A Distributional-Compositional Semantics
Approach
André Freitas and Edward CurryInsight Centre for Data Analytics
International Conference on Intelligent User Interfaces
Haifa, 2014
Talking to your (Big) Data
Motivation
Shift in the Database Landscape
Heterogeneous, complex and large-scale databases.
Very-large and dynamic “schemas”.
10s-100s attributes1,000s-1,000,000s attributescirca 2000
circa 2014
Databases for a Complex World
How do you query data on this scenario?
Vocabulary Problem for DatabasesQuery: Who is the daughter of Bill Clinton married to?
Semantic approximationSemantic Gap
Possible representations = Commonsense Knowledge
Semantics for a Complex World
Formal World Real World
Distributional Semantics
Query Approach
Does it work?
Addressing the Vocabulary Problem for Databases (with Distributional Semantics)
Gaelic: direction
Solution (Video)
More Complex Queries (Video)
Treo Answers Jeopardy Queries (Video)
http://bit.ly/1hWcch9
Evaluation
102 natural language queries (Test Collection: QALD 2011).
Avg. query execution time: 1.52 s (simple queries) – 8.53 s (all queries).
Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and 9,434,677 instances
Comparative Evaluation
Query Approach
Distributional Semantics
“Words occurring in similar (linguistic) contexts are semantically related.”
If we can equate meaning with context, we can simply record the contexts in which a word occurs in a collection of texts (a corpus).
This can then be used as a surrogate of its semantic representation.
Distributional Semantic Model
c1
child
husbandspouse
cn
c2
function (number of times that the words occur in c1)
0.7
0.5
Commonsense is here
Semantic Relatedness
θ
c1
child
husbandspouse
cn
c2
Works as a semantic ranking function
Approach Overview
Query Planner
Ƭ-Space
Large-scale unstructured data
Commonsense knowledge
Database
Distributional semantics
Core semantic approximation &
composition operations
Query AnalysisQuery Query Features
Query Plan
Approach Overview
Query Planner
Ƭ-Space
Wikipedia
RDF Data
Explicit Semantic Analysis (ESA)
Core semantic approximation &
composition operations
Query AnalysisQuery Query Features
Query Plan
Commonsense knowledge
Ƭ-Space
e
p
r
Core Operations
Query
Core Operations
Search & Composition Operations
Query
Search and Composition Operations Instance search
- Proper nouns- String similarity + node cardinality
Class (unary predicate) search- Nouns, adjectives and adverbs- String similarity + Distributional semantic relatedness
Property (binary predicate) search- Nouns, adjectives, verbs and adverbs- Distributional semantic relatedness
Navigation
Extensional expansion- Expands the instances associated with a class.
Operator application- Aggregations, conditionals, ordering, position
Disjunction & Conjunction Disambiguation dialog (instance, predicate)
Core Principles
Minimize the impact of Ambiguity, Vagueness, Synonymy.
Address the simplest matchings first (heuristics).
Semantic Relatedness as a primitive operation.
Distributional semantics as commonsense knowledge.
Question Analysis
Transform natural language queries into triple patterns
“Who is the daughter of Bill Clinton married to?”
Bill Clinton daughter married to
(INSTANCE) (PREDICATE) (PREDICATE) Query Features
PODS
Query Plan
Map query features into a query plan.
A query plan contains a sequence of core operations.
(INSTANCE) (PREDICATE) (PREDICATE) Query Features
Query Plan
(1) INSTANCE SEARCH (Bill Clinton) (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)
(3) e1 <- NAVIGATE (Bill Clintion, p1)
(4) p2 <- SEARCH PREDICATE (e1, married to)
(5) e2 <- NAVIGATE (e1, p2)
Instance Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked Data:
Instance Search
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked Data:
:Chelsea_Clinton
:child
:Baptists:religion
:Yale_Law_School
:almaMater
...(PIVOT ENTITY)
(ASSOCIATED TRIPLES)
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked Data:
:Chelsea_Clinton
:child
:Baptists:religion
:Yale_Law_School
:almaMater
...
sem_rel(daughter,child)=0.054
sem_rel(daughter,child)=0.004
sem_rel(daughter,alma mater)=0.001
Which properties are semantically related to ‘daughter’?
Navigate
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked Data:
:Chelsea_Clinton
:child
Navigate
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)
:Mark_Mezvinsky:spouse
Results
Conclusions
The compositional-distributional model supports a schema-agnostic natural language query mechanism over a large schema (open domain) database
Comprehensive and accurate semantic matching - Avg. recall=0.81, map=0.62, mrr=0.49 Medium-high expressivity
- 80% of queries answered Interactive query execution time
- Avg. 1.52 s (simple queries) – 8.53 s (all queries) / query Better recall and query coverage compared to
baselines with equivalent precision
Low adaptation effort for new datasets