COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative...
Transcript of COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative...
Introduction Data Model and Language Cooperative Querying Approach Experimentations
COnnected KEywords
Gregory Smits, Olivier Pivert and Virginie Thion
IRISA, Campus de Beaulieu, 35042 Rennes, France{gregory.smits, olivier.pivert, virginie.thion}@irisa.fr
RCIS, May 13-15 2015, Athens, Greece
G. Smits et al. COnnected KEywords 1/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Giving Access to Structured Data
Domain expert
DATABASE
0..N
0..N1..1
1..1
1..N
RELATIONAL MODEL
SQL> select title, genre from movies where year > 2014;
SQL QUERY LANGUAGE
G. Smits et al. COnnected KEywords 2/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Giving Access to Structured Data
Domain expert
DATABASE
0..N
0..N1..1
1..1
1..N
RELATIONAL MODEL
SQL> select title, genre from movies where year > 2014;
SQL QUERY LANGUAGE GRAPH MODEL
KEYWORD QUERYkw1 kw2 kw3 … answer
G. Smits et al. COnnected KEywords 2/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Motivations
moviesR
mov_castR
mov_prodR
mov_compR
movi_cinRmov_edR
mov_dirR
characters
actors
composers
producers
directors
editors
cinematographers
nameA
nameA
nameAdeath_dateA
birth_dateA
titleA
production_yearA
genreA
languageA
θ
θ
θθ
θθθ
θ
θ
θ
θ
θ
θ
μ
μ
μ
μ
μ
μ
nameA
μ
nameA
μ
nameA
μ
nameA
μ
R
R
R
R
R
R
RbudgetA
μ
μ
μμ
titlesD
!
yearD
!
languagesD
!
lengthA
μ
genresD
!
budgetsD
lengthsD
!
!
namesD
!
namesD
!
namesD
!
namesD
!
namesD
!
namesD
!
namesD
!
datesD
!
datesD
!
GRAPH MODEL
GRAPH MODEL: lexical mapping
Domain expert
title of movie with Clint Eastwood answer
GRAPH MODEL: query interpretation
DATABASE
movie_idtitleproduction_yeargenrebudget
movies (455,178)
character_idname
characters (2,338,885)
actor_idnamebirth_datedeath_date
actors (1,740,146)
composer_idname
composers (68,552)
producer_idname
producers (205,192)
director_idname
directors (129,906)
editor_idname
editors (79,561)
cinematographer_idname
cinematographers (92,873)movie_idcinematographer_id
mov_cin (306,299)
movie_ideditor_id
mov_ed (245,614)
movie_iddirector_id
mov_dir (436,391)
movie_idproducer_id
mov_prod (535,246)
movie_idcomposer_id
mov_comp (219,466)
movie_idcharacter_idactor_id
mov_cast (3,057,045)
RELATIONAL MODEL
attribute name table namename of an actor
title of the movies in which acts Clint E.
SQL> select title from movies M inner join producer P on M.idP = P.id and P. name =‘Clint E.’;
More expressive query interface:I keywords and connectives,
I constrained but rich query language,(SPJ queries, aggregates, some nested queries, ...)‘movies Clint E. Gene H.’‘genre of movies entitled Scarface’‘movies genre is western released in 2014’
‘actors appearing in Dogma’
I tool to help domain experts:I query their corporate databases,I manage the COKE language.
G. Smits et al. COnnected KEywords 3/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Outline
1 Introduction1.1 Problem1.2 Motivating example
2 Data Model and Language2.1 Schema Graph Model2.2 Query Graph Model and Searchable Vocabulary2.3 a Constrained Keyword-based Query Language
3 Cooperative Querying Approach3.1 KW Query Definition3.2 KW Query Interpretation3.3 Translation into SQL
4 Experimentations
G. Smits et al. COnnected KEywords 4/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Translation of the Relational Schema into a Graph Schema
I capture the data semantics,
I have a more manageable model,
I make it possible to apply graph-based algorithms.
movie_idtitleproduction_yeargenrebudget
movies (455,178)
character_idname
characters (2,338,885)
actor_idnamebirth_datedeath_date
actors (1,740,146)
composer_idname
composers (68,552)
producer_idname
producers (205,192)
director_idname
directors (129,906)
editor_idname
editors (79,561)
cinematographer_idname
cinematographers (92,873)movie_idcinematographer_id
mov_cin (306,299)
movie_ideditor_id
mov_ed (245,614)
movie_iddirector_id
mov_dir (436,391)
movie_idproducer_id
mov_prod (535,246)
movie_idcomposer_id
mov_comp (219,466)
movie_idcharacter_idactor_id
mov_cast (3,057,045)
moviesR
mov_castR
mov_prodR
mov_compR
movi_cinRmov_edR
mov_dirR
characters
actors
composers
producers
directors
editors
cinematographers
nameA
nameA
nameAdeath_dateA
birth_dateA
titleA
production_yearA
genreA
languageA
θ
θθ
θ
θθθ
θ
θ
θ
θ
θ
θ
μ
μ
μ
μ
μ
μ
nameA
μ
nameA
μ
nameA
μ
nameA
μ
R
R
R
R
R
R
RbudgetA
μ
μ
μμ
titlesD
!
yearD
!
languagesD
!
lengthA
μ
genresD
!
budgetsD
lengthsD
!
!
namesD
!
namesD
!
namesD
!
namesD
!
namesD
!
namesD
!
namesD
!
datesD
!
datesD
!
I tables, attributes and definition domains represented by nodes,
I primary to foreign key constraints form links between table nodes,
I domain nodes are linked to their attribute nodes themselves linked to theirtable nodes.
G. Smits et al. COnnected KEywords 5/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Query Graph Model: mapping between the model and SQL
Directed graph representing the elements that may appear in auser query:
I table, attribute, domain and function nodes,
I projection, predicate, selection, join and transformation edges,
I subquery edges.
Definition
A query is covered by the COKE system if it forms a connectedsubgraph of the query graph.
G. Smits et al. COnnected KEywords 6/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Query Graph Model: mapping between the model and SQL
‘name of producers, title and productionyear of movies with Michael Keaton in therole of Batman’
moviesR
mov_prodR
producers
titleA
production_yearA
nameAR
π
π ⨝⨝ π
mov_castR
⨝
charactersR
⨝
actorsR
⨝
nameA
σ
nameA
σ
batman=
Michael Keaton=
D
D
SQL> SELECT title, production_year, P.name
FROM producer P join movies M on M.pid = P.id
join mov_cast MC on MC.mid = M.id
join characters C on C.id = MC.cid
join actors A on A.id = MC.aid
WHERE a.name = ’Keaton’
and C.name = ’batman’;
‘number of movies produced by theproducer of the movie entitled Dogma’
moviesR
mov_prodR
producerstitleA
nameARπ
⨝⨝ π
ηin
producersR
nameA
πmoviesR
mov_prodR
titleA
π ⨝ ⨝
Dogma
=
countF
f
SQL> SELECT count(*)
FROM movies M join mov_prod MP on M.id = MP.mid
join producers P on P.id = MP.pid
WHERE P.name = ( SELECT P.name
FROM producers P2 join mov_prod MP2 on
P2.id = MP2.id
join movies M2 on M2.id = MP2.mid
WHERE M2.title = ’Dogma’);
G. Smits et al. COnnected KEywords 7/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Searchable Vocabulary: mapping keywords to elements ofthe query graph
Definition
A keyword is a word or a group of words attached to a searchableelement of the query graph:
I table, attribute, domain and function nodes,
I projection, predicate, join, selection and transformation edges,
I subqueries edges,
I and some paths (predicate + selection, joins, joins + selection).
G. Smits et al. COnnected KEywords 8/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Searchable Vocabulary: mapping keywords to elements ofthe query graph
mov_prodR
mov_cast
R
moviesR
producers
titleA
production_yearA
nameAR
charactersR
actorsR
nameA
nameA
…D
R:a producerR:a movie
A:title
A:production year
θ:is
R:an actor
R a character
:byhas produced
:appe
aring
in:in
whic
h play
s
:playing in
:in which acts
A:name
A:name
!:of
:in the role of:played by
…D
θ:is!:of
σ:whose
σθ:entitled
…D
θ:is
σθ:released in
…D
θ:is
…D
θ:isσ:whose
!:of
σ:whose
!:of
σθ:named
σθ:named
σθ:named
⨝*
⨝*
⨝*
⨝*⨝*
⨝*
⨝ *
⨝ *
σ:whose
!:of A:name
σ:whose
Extract of the labelled query graph
G. Smits et al. COnnected KEywords 9/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
COKE: a Constrained Keyword-based Query Language
A COKE query is a keyword-based expression of aselection-projection-join SQL query:
I with a restricted vocabulary,I and a constrained syntax (projection part + selection part).
Backus Naur form grammar:COKE query ::= {Projection Statement}* {Selection Statement}*Projection Statement ::= F f [ef ] Ak
l eπ Rk | F f [ef ] Rk | [Attributes [eπ ]] Rk | AttributesAttributes := Ak
l {, Aki }∗ [and Ak
j ]
Selection Statement ::= {e./|p./∗} Rk eσ Ak
l eθ Dkl | A
kl eπ Rk eθ Dk
l | {e./|p./
∗} Rk pσθ Dk
l | Akl eθ
Dkl |R
k pσθ Dkl | p
σθ Dkl | p
./∗σθ Dkl | R
k Dkl |D
kl | e
ηop Dkl .
G. Smits et al. COnnected KEywords 10/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
The COKE Syntax Illustrated
COKE query = a projection part + a selection part
Projection statement:I basic form: “movies”, “title year movies”, “number movies”, etc.,
I more expressive syntax: “title, length and genre of movies”, “number ofmovies”, etc.
Selection statement (joins, restrictions, some nested queries):I basic form: “Clint E. Gene H.”, “actor Clint E.”, “producer Clint E. actor Gene
H.”, etc.,
I more expressive syntax:. “by a producer whose name is Clint E.”, “produced byClint E. with Gene H.”, “with an actor of Dogma”, etc.
G. Smits et al. COnnected KEywords 11/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
COKE: a Cooperative Querying Approach
A constrained keyword-based query language with:
1. a restricted vocabulary,
2. a limited syntax,
3. and semantic expectations (connected subgraph of the query graph).
An interactive query construction and interpretation process to:
I help users express their information needs with coveredqueries,
I interactively determine the exact meaning of the queries.
G. Smits et al. COnnected KEywords 12/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Query Definition at the Lexical Level
Autocompletion and locked queryfield to ensure lexical coverage:
I index of the whole vocabulary.
produced by Clint E answer
Clint Eastwood Clint Eastlake Clint Eastman Clint Edwards Clint Ellison Clint Elmore Clint Elvy Clint Eruera
Definition
The lexical analyzer (made easy by autocompletion) produces mappingsbetween keywords and elements of the query graph.
G. Smits et al. COnnected KEywords 13/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Query Interpretation at the Syntactic Level
Parsing to identify valid projection and selection statements
Definition
A syntactic analysis is a mapping from adjacent keywords toprojection/selection graph patterns.
COKE query ::= {Projection Statement}* {Selection Statement}*Projection Statement ::= F f [ef ] Ak
l eπ Rk | F f [ef ] Rk | [Attributes [eπ ]] Rk | AttributesAttributes := Ak
l {, Aki }∗ [and Ak
j ]
Selection Statement ::= {e./|p./∗} Rk eσ Ak
l eθ Dkl | A
kl eπ Rk eθ Dk
l | {e./|p./
∗} Rk pσθ Dk
l | Akl eθ
Dkl |R
k pσθ Dkl | p
σθ Dkl | p
./∗σθ Dkl | R
k Dkl |D
kl | e
ηop Dkl .
Example:
〈 (‘title production year of movie’, {Akl }+ eπ Rk ) (‘with an actor named Mickael Keaton’, e./ R pσθ D) (‘in
the role of Batman’, p./∗σθDk
l )〉.
G. Smits et al. COnnected KEywords 14/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Interactive Syntactic Disambiguation
Some syntactic patterns may be ambiguous or uncovered:I ask for a confirmation of the right interpretation,
I expansion of ambiguous patterns.
I show syntactically uncovered parts of the query.
budget genre name birth date entitled avatar language in 2009 answer
Information to be displayed Satisfying the criteria- budget and genre of a movie- name and birth date of an actor - name of a producer- name of a director- name of a character- name of a composer- name of a cinematographer- name of an editor
- entitled avatar- production year of a movie is 2009- birth date of an actor is 2009- death date of an actor is 2009- name of a character is 2009
What do you mean by …- language
G. Smits et al. COnnected KEywords 15/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Discovering the Meaning of the Query
Definition
A query is semantically consistent if its interpretation forms aconnected subgraph of the query graph.
In case of isolated statements (e.g. ‘title of movies name Clint E.’),re-establish the semantic consistency:
I identify the subject (relation node) of each statement,I projection statement → the concerned relation,I selection statement → the relation on which a restriction is
applied (selection edge).
I determine meaningful joins,I link an isolated subject node to a previously activated subject,I favor the shortest join path between two subjects.
G. Smits et al. COnnected KEywords 16/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Discovering the Meaning of the Query (example)
“title genre and production year of movies actor named Gene H.producer Clint E.”
Subject relation nodes: {movies, actors, producers}.
I projection statement,P1 “title, genre and
production year of movies”
I selection statements,S1 “actor named Gene H.”
S2 “producer Clint E.”
mov_prodR
mov_castR
moviesR
producers
titleA
genreA
nameAR
actorsR
nameA
R:a producer
R:a movieA:title
A:genre
R:an actor
Clint EaswoodD
σθ:named
production_yearA
A:production year!:
of
Gene HackmanDP1 S1
S2
G. Smits et al. COnnected KEywords 17/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Translation to SQL
Depth-first traversal of the connected subgraph:
I starting with projection statements,
I translation rules for each graph pattern,
I incremental filling of the SELECT , FROM and WHERE clauses.
SQL> SELECT movie.title, movies.genre, movie.production year
FROM movies, mov cast, actors, mov prod, producers
WHERE movies.idM = mov cast.idM and mov cast.idA = actors.idA and
actors.name = ’Gene Hackman’ and movies.idM = mov prod.idM and
mov prod.idP = producers.idP and producers.name = ’Clint Eastwood’;
Translation rules:I A π R→ append A to
SELECT and R to FROM,
I R1 ./ R2 append a naturaljoin between R1 and R2 toFROM,
I A θ D append AθD toWHERE .
G. Smits et al. COnnected KEywords 18/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Experimentations
Implementation on top of the MusicBrainz DB:
I manual definition of the query graph and vocabulary.
Expressivity assessment:I translation of the QALD corpus (100 NL queries):
I 81 queries successfully translated into COKE queries,I 8 concern data not available in the DB version of MusicBrainz,I 11 uncovered (6 Boolean queries and 5 relying on indirect knowledge).
A first attempt of precision assessment (precision 100%):
I translation by a “COKE” queries expert,
I using unambiguous statements, no join deduction,
I basic queries (80/81 queries contain 1 projection and 1 selection).
Efficiency assessment:
I NL approach: avg. 31.6s, COKE: avg. 0.65s.
G. Smits et al. COnnected KEywords 19/20
Introduction Data Model and Language Cooperative Querying Approach Experimentations
Conclusion and Perspectives
A keyword-based query approach taking connectives into account:
I constrained keyword-based language,
I interactive query definition and disambiguation strategy.
A tool to help domain experts query their corporate DBs
Perspectives:
I make COKE queries closer to natural language,
I automatic acquisition of the vocabulary,
I handle more complex queries (aggregates, groups),
I real users feedbacks in a CRM applicative context.
G. Smits et al. COnnected KEywords 20/20