COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative...

21
Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr´ egory Smits, Olivier Pivert and Virginie Thion IRISA, Campus de Beaulieu, 35042 Rennes, France {gregory.smits, olivier.pivert, virginie.thion}@irisa.fr RCIS, May 13-15 2015, Athens, Greece G. Smits et al. COnnected KEywords 1/20

Transcript of COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative...

Page 1: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

COnnected KEywords

Gregory Smits, Olivier Pivert and Virginie Thion

IRISA, Campus de Beaulieu, 35042 Rennes, France{gregory.smits, olivier.pivert, virginie.thion}@irisa.fr

RCIS, May 13-15 2015, Athens, Greece

G. Smits et al. COnnected KEywords 1/20

Page 2: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Giving Access to Structured Data

Domain expert

DATABASE

0..N

0..N1..1

1..1

1..N

RELATIONAL MODEL

SQL> select title, genre from movies where year > 2014;

SQL QUERY LANGUAGE

G. Smits et al. COnnected KEywords 2/20

Page 3: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Giving Access to Structured Data

Domain expert

DATABASE

0..N

0..N1..1

1..1

1..N

RELATIONAL MODEL

SQL> select title, genre from movies where year > 2014;

SQL QUERY LANGUAGE GRAPH MODEL

KEYWORD QUERYkw1 kw2 kw3 … answer

G. Smits et al. COnnected KEywords 2/20

Page 4: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Motivations

moviesR

mov_castR

mov_prodR

mov_compR

movi_cinRmov_edR

mov_dirR

characters

actors

composers

producers

directors

editors

cinematographers

nameA

nameA

nameAdeath_dateA

birth_dateA

titleA

production_yearA

genreA

languageA

θ

θ

θθ

θθθ

θ

θ

θ

θ

θ

θ

μ

μ

μ

μ

μ

μ

nameA

μ

nameA

μ

nameA

μ

nameA

μ

R

R

R

R

R

R

RbudgetA

μ

μ

μμ

titlesD

!

yearD

!

languagesD

!

lengthA

μ

genresD

!

budgetsD

lengthsD

!

!

namesD

!

namesD

!

namesD

!

namesD

!

namesD

!

namesD

!

namesD

!

datesD

!

datesD

!

GRAPH MODEL

GRAPH MODEL: lexical mapping

Domain expert

title of movie with Clint Eastwood answer

GRAPH MODEL: query interpretation

DATABASE

movie_idtitleproduction_yeargenrebudget

movies (455,178)

character_idname

characters (2,338,885)

actor_idnamebirth_datedeath_date

actors (1,740,146)

composer_idname

composers (68,552)

producer_idname

producers (205,192)

director_idname

directors (129,906)

editor_idname

editors (79,561)

cinematographer_idname

cinematographers (92,873)movie_idcinematographer_id

mov_cin (306,299)

movie_ideditor_id

mov_ed (245,614)

movie_iddirector_id

mov_dir (436,391)

movie_idproducer_id

mov_prod (535,246)

movie_idcomposer_id

mov_comp (219,466)

movie_idcharacter_idactor_id

mov_cast (3,057,045)

RELATIONAL MODEL

attribute name table namename of an actor

title of the movies in which acts Clint E.

SQL> select title from movies M inner join producer P on M.idP = P.id and P. name =‘Clint E.’;

More expressive query interface:I keywords and connectives,

I constrained but rich query language,(SPJ queries, aggregates, some nested queries, ...)‘movies Clint E. Gene H.’‘genre of movies entitled Scarface’‘movies genre is western released in 2014’

‘actors appearing in Dogma’

I tool to help domain experts:I query their corporate databases,I manage the COKE language.

G. Smits et al. COnnected KEywords 3/20

Page 5: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Outline

1 Introduction1.1 Problem1.2 Motivating example

2 Data Model and Language2.1 Schema Graph Model2.2 Query Graph Model and Searchable Vocabulary2.3 a Constrained Keyword-based Query Language

3 Cooperative Querying Approach3.1 KW Query Definition3.2 KW Query Interpretation3.3 Translation into SQL

4 Experimentations

G. Smits et al. COnnected KEywords 4/20

Page 6: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Translation of the Relational Schema into a Graph Schema

I capture the data semantics,

I have a more manageable model,

I make it possible to apply graph-based algorithms.

movie_idtitleproduction_yeargenrebudget

movies (455,178)

character_idname

characters (2,338,885)

actor_idnamebirth_datedeath_date

actors (1,740,146)

composer_idname

composers (68,552)

producer_idname

producers (205,192)

director_idname

directors (129,906)

editor_idname

editors (79,561)

cinematographer_idname

cinematographers (92,873)movie_idcinematographer_id

mov_cin (306,299)

movie_ideditor_id

mov_ed (245,614)

movie_iddirector_id

mov_dir (436,391)

movie_idproducer_id

mov_prod (535,246)

movie_idcomposer_id

mov_comp (219,466)

movie_idcharacter_idactor_id

mov_cast (3,057,045)

moviesR

mov_castR

mov_prodR

mov_compR

movi_cinRmov_edR

mov_dirR

characters

actors

composers

producers

directors

editors

cinematographers

nameA

nameA

nameAdeath_dateA

birth_dateA

titleA

production_yearA

genreA

languageA

θ

θθ

θ

θθθ

θ

θ

θ

θ

θ

θ

μ

μ

μ

μ

μ

μ

nameA

μ

nameA

μ

nameA

μ

nameA

μ

R

R

R

R

R

R

RbudgetA

μ

μ

μμ

titlesD

!

yearD

!

languagesD

!

lengthA

μ

genresD

!

budgetsD

lengthsD

!

!

namesD

!

namesD

!

namesD

!

namesD

!

namesD

!

namesD

!

namesD

!

datesD

!

datesD

!

I tables, attributes and definition domains represented by nodes,

I primary to foreign key constraints form links between table nodes,

I domain nodes are linked to their attribute nodes themselves linked to theirtable nodes.

G. Smits et al. COnnected KEywords 5/20

Page 7: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Query Graph Model: mapping between the model and SQL

Directed graph representing the elements that may appear in auser query:

I table, attribute, domain and function nodes,

I projection, predicate, selection, join and transformation edges,

I subquery edges.

Definition

A query is covered by the COKE system if it forms a connectedsubgraph of the query graph.

G. Smits et al. COnnected KEywords 6/20

Page 8: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Query Graph Model: mapping between the model and SQL

‘name of producers, title and productionyear of movies with Michael Keaton in therole of Batman’

moviesR

mov_prodR

producers

titleA

production_yearA

nameAR

π

π ⨝⨝ π

mov_castR

charactersR

actorsR

nameA

σ

nameA

σ

batman=

Michael Keaton=

D

D

SQL> SELECT title, production_year, P.name

FROM producer P join movies M on M.pid = P.id

join mov_cast MC on MC.mid = M.id

join characters C on C.id = MC.cid

join actors A on A.id = MC.aid

WHERE a.name = ’Keaton’

and C.name = ’batman’;

‘number of movies produced by theproducer of the movie entitled Dogma’

moviesR

mov_prodR

producerstitleA

nameARπ

⨝⨝ π

ηin

producersR

nameA

πmoviesR

mov_prodR

titleA

π ⨝ ⨝

Dogma

=

countF

f

SQL> SELECT count(*)

FROM movies M join mov_prod MP on M.id = MP.mid

join producers P on P.id = MP.pid

WHERE P.name = ( SELECT P.name

FROM producers P2 join mov_prod MP2 on

P2.id = MP2.id

join movies M2 on M2.id = MP2.mid

WHERE M2.title = ’Dogma’);

G. Smits et al. COnnected KEywords 7/20

Page 9: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Searchable Vocabulary: mapping keywords to elements ofthe query graph

Definition

A keyword is a word or a group of words attached to a searchableelement of the query graph:

I table, attribute, domain and function nodes,

I projection, predicate, join, selection and transformation edges,

I subqueries edges,

I and some paths (predicate + selection, joins, joins + selection).

G. Smits et al. COnnected KEywords 8/20

Page 10: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Searchable Vocabulary: mapping keywords to elements ofthe query graph

mov_prodR

mov_cast

R

moviesR

producers

titleA

production_yearA

nameAR

charactersR

actorsR

nameA

nameA

…D

R:a producerR:a movie

A:title

A:production year

θ:is

R:an actor

R a character

:byhas produced

:appe

aring

in:in

whic

h play

s

:playing in

:in which acts

A:name

A:name

!:of

:in the role of:played by

…D

θ:is!:of

σ:whose

σθ:entitled

…D

θ:is

σθ:released in

…D

θ:is

…D

θ:isσ:whose

!:of

σ:whose

!:of

σθ:named

σθ:named

σθ:named

⨝*

⨝*

⨝*

⨝*⨝*

⨝*

⨝ *

⨝ *

σ:whose

!:of A:name

σ:whose

Extract of the labelled query graph

G. Smits et al. COnnected KEywords 9/20

Page 11: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

COKE: a Constrained Keyword-based Query Language

A COKE query is a keyword-based expression of aselection-projection-join SQL query:

I with a restricted vocabulary,I and a constrained syntax (projection part + selection part).

Backus Naur form grammar:COKE query ::= {Projection Statement}* {Selection Statement}*Projection Statement ::= F f [ef ] Ak

l eπ Rk | F f [ef ] Rk | [Attributes [eπ ]] Rk | AttributesAttributes := Ak

l {, Aki }∗ [and Ak

j ]

Selection Statement ::= {e./|p./∗} Rk eσ Ak

l eθ Dkl | A

kl eπ Rk eθ Dk

l | {e./|p./

∗} Rk pσθ Dk

l | Akl eθ

Dkl |R

k pσθ Dkl | p

σθ Dkl | p

./∗σθ Dkl | R

k Dkl |D

kl | e

ηop Dkl .

G. Smits et al. COnnected KEywords 10/20

Page 12: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

The COKE Syntax Illustrated

COKE query = a projection part + a selection part

Projection statement:I basic form: “movies”, “title year movies”, “number movies”, etc.,

I more expressive syntax: “title, length and genre of movies”, “number ofmovies”, etc.

Selection statement (joins, restrictions, some nested queries):I basic form: “Clint E. Gene H.”, “actor Clint E.”, “producer Clint E. actor Gene

H.”, etc.,

I more expressive syntax:. “by a producer whose name is Clint E.”, “produced byClint E. with Gene H.”, “with an actor of Dogma”, etc.

G. Smits et al. COnnected KEywords 11/20

Page 13: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

COKE: a Cooperative Querying Approach

A constrained keyword-based query language with:

1. a restricted vocabulary,

2. a limited syntax,

3. and semantic expectations (connected subgraph of the query graph).

An interactive query construction and interpretation process to:

I help users express their information needs with coveredqueries,

I interactively determine the exact meaning of the queries.

G. Smits et al. COnnected KEywords 12/20

Page 14: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Query Definition at the Lexical Level

Autocompletion and locked queryfield to ensure lexical coverage:

I index of the whole vocabulary.

produced by Clint E answer

Clint Eastwood Clint Eastlake Clint Eastman Clint Edwards Clint Ellison Clint Elmore Clint Elvy Clint Eruera

Definition

The lexical analyzer (made easy by autocompletion) produces mappingsbetween keywords and elements of the query graph.

G. Smits et al. COnnected KEywords 13/20

Page 15: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Query Interpretation at the Syntactic Level

Parsing to identify valid projection and selection statements

Definition

A syntactic analysis is a mapping from adjacent keywords toprojection/selection graph patterns.

COKE query ::= {Projection Statement}* {Selection Statement}*Projection Statement ::= F f [ef ] Ak

l eπ Rk | F f [ef ] Rk | [Attributes [eπ ]] Rk | AttributesAttributes := Ak

l {, Aki }∗ [and Ak

j ]

Selection Statement ::= {e./|p./∗} Rk eσ Ak

l eθ Dkl | A

kl eπ Rk eθ Dk

l | {e./|p./

∗} Rk pσθ Dk

l | Akl eθ

Dkl |R

k pσθ Dkl | p

σθ Dkl | p

./∗σθ Dkl | R

k Dkl |D

kl | e

ηop Dkl .

Example:

〈 (‘title production year of movie’, {Akl }+ eπ Rk ) (‘with an actor named Mickael Keaton’, e./ R pσθ D) (‘in

the role of Batman’, p./∗σθDk

l )〉.

G. Smits et al. COnnected KEywords 14/20

Page 16: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Interactive Syntactic Disambiguation

Some syntactic patterns may be ambiguous or uncovered:I ask for a confirmation of the right interpretation,

I expansion of ambiguous patterns.

I show syntactically uncovered parts of the query.

budget genre name birth date entitled avatar language in 2009 answer

Information to be displayed Satisfying the criteria- budget and genre of a movie- name and birth date of an actor - name of a producer- name of a director- name of a character- name of a composer- name of a cinematographer- name of an editor

- entitled avatar- production year of a movie is 2009- birth date of an actor is 2009- death date of an actor is 2009- name of a character is 2009

What do you mean by …- language

G. Smits et al. COnnected KEywords 15/20

Page 17: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Discovering the Meaning of the Query

Definition

A query is semantically consistent if its interpretation forms aconnected subgraph of the query graph.

In case of isolated statements (e.g. ‘title of movies name Clint E.’),re-establish the semantic consistency:

I identify the subject (relation node) of each statement,I projection statement → the concerned relation,I selection statement → the relation on which a restriction is

applied (selection edge).

I determine meaningful joins,I link an isolated subject node to a previously activated subject,I favor the shortest join path between two subjects.

G. Smits et al. COnnected KEywords 16/20

Page 18: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Discovering the Meaning of the Query (example)

“title genre and production year of movies actor named Gene H.producer Clint E.”

Subject relation nodes: {movies, actors, producers}.

I projection statement,P1 “title, genre and

production year of movies”

I selection statements,S1 “actor named Gene H.”

S2 “producer Clint E.”

mov_prodR

mov_castR

moviesR

producers

titleA

genreA

nameAR

actorsR

nameA

R:a producer

R:a movieA:title

A:genre

R:an actor

Clint EaswoodD

σθ:named

production_yearA

A:production year!:

of

Gene HackmanDP1 S1

S2

G. Smits et al. COnnected KEywords 17/20

Page 19: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Translation to SQL

Depth-first traversal of the connected subgraph:

I starting with projection statements,

I translation rules for each graph pattern,

I incremental filling of the SELECT , FROM and WHERE clauses.

SQL> SELECT movie.title, movies.genre, movie.production year

FROM movies, mov cast, actors, mov prod, producers

WHERE movies.idM = mov cast.idM and mov cast.idA = actors.idA and

actors.name = ’Gene Hackman’ and movies.idM = mov prod.idM and

mov prod.idP = producers.idP and producers.name = ’Clint Eastwood’;

Translation rules:I A π R→ append A to

SELECT and R to FROM,

I R1 ./ R2 append a naturaljoin between R1 and R2 toFROM,

I A θ D append AθD toWHERE .

G. Smits et al. COnnected KEywords 18/20

Page 20: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Experimentations

Implementation on top of the MusicBrainz DB:

I manual definition of the query graph and vocabulary.

Expressivity assessment:I translation of the QALD corpus (100 NL queries):

I 81 queries successfully translated into COKE queries,I 8 concern data not available in the DB version of MusicBrainz,I 11 uncovered (6 Boolean queries and 5 relying on indirect knowledge).

A first attempt of precision assessment (precision 100%):

I translation by a “COKE” queries expert,

I using unambiguous statements, no join deduction,

I basic queries (80/81 queries contain 1 projection and 1 selection).

Efficiency assessment:

I NL approach: avg. 31.6s, COKE: avg. 0.65s.

G. Smits et al. COnnected KEywords 19/20

Page 21: COnnected KEywordsrcis2015.hua.gr/pdf/48.pdf · Introduction Data Model and Language Cooperative Querying Approach Experimentations COnnected KEywords Gr egory Smits, Olivier Pivert

Introduction Data Model and Language Cooperative Querying Approach Experimentations

Conclusion and Perspectives

A keyword-based query approach taking connectives into account:

I constrained keyword-based language,

I interactive query definition and disambiguation strategy.

A tool to help domain experts query their corporate DBs

Perspectives:

I make COKE queries closer to natural language,

I automatic acquisition of the vocabulary,

I handle more complex queries (aggregates, groups),

I real users feedbacks in a CRM applicative context.

G. Smits et al. COnnected KEywords 20/20