How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic Queries

How hard is this query? Measuring the Semantic Complexity of

Schema-agnostic Queries

André Freitas, Juliano Efson Sales, Siegfried Handschuh, Edward Curry

IWCS, London 2015

Outline

• Motivation

• Query Semantic Complexity & Entropy

• Entropy Measures

• Validation & Analysis

• Conclusions

Motivation

Shift in the Database Landscape

Very-large and dynamic “schemas”.

10s-100s attributes1,000s-1,000,000s attributes

before 2000circa 2015

4 Brodie & Liu, 2010

Databases for a Complex WorldHow do you query data on this scenario?

Schema-agnosticism

Who is the daughter of Bill Clinton?

Bill Clinton

Chelsea Clinton

Schema-agnostic queries

Query approaches over structured databases which

allow users satisfying complex information needs

without the understanding of the representation

(schema) of the database.

Semantic Parsing

Vocabulary Problem for Databases

Query: Who is the daughter of Bill Clinton married to?

Quantify the Semantic Gap

Possible representations

Core Questions

• Can we measure the semantic complexity of a query-DB mapping?

• What defines an “easy” or a “hard” query?

• Which are the best estimators?

Semantic Complexity & Entropy

Configuration space of semantic matchings

Quantify the Query-DB semantic gap

Not all queries are born equal!

Semantic Complexity & Entropy

• Structural/conceptual complexity

• Level of ambiguity/indeterminacy/vagueness

• Teminological gap

• Novelty

Semantic Configuration Space

mΣ(Q,DB)13

Semantic Entropy Measures

Hsyntax

?Hstruct

HtermHtermHmatching

In the scope of this work

• Entropy -> Entropy estimator, approximation.

Syntactic Entropy (Hsyntax)

• The syntactic entropy of a query is defined by thepossible syntactic configurations in which a querycan be interpreted under the database syntax.

• Estimate the uncertainty of the translation of thequery into the DB categories (IDB(Q)).

• Is a function of the probability of the syntacticinterpretation of a query.

Structural Entropy (Hstruct)

• The structural entropy defines the complexity of adatabase based on the possible facts that can beencoded under its schema.

• Pollard & Biermann, A measure of semanticcomplexity for natural language systems (2000).

Terminological Entropy (Hterm)

• The terminological entropy focuses on quantifying anestimate on the amount of ambiguity, synonymy andvagueness for the query or database terms.

• Translational Entropy (Htrans) as an estimator.

• Melamed, Measuring semantic entropy (1997).

• Translation probability based on parallel corpora.

Matching Entropy (Hmatching)

• Consists of measures which describe theuncertainty involved in the query-datamatching/alignment between query terms anddataset entities.

• Provides an estimate based on the set ofpotential alignments.

• Distributional entropy (Hdist): Estimator based ondistributional semantic models.

Query Features as Complexity Estimators

• Query features (reference to data model/query operator categories).– Contains instance reference (named entities)

– Contains class reference

– Contains complex class reference

– Contains property

– Contains value

– Yes/No question

– Contains operator

Validation & Analysis

Experimental Set-up

• Question Answering over Linked Data TestCollection (Unger et al. 2011).

• QALD 2011 & 2012.

• 150 natural language queries over DBpedia(RDF).

Dataset (DBpedia + YAGO classes): 45,768 properties288,316 classes9,434,677 instances128,071,259 triples

Query Analysis Example

Experimental Set-up

• Linear regression between each entropymeasure and the f-measure of theparticipating QA systems.

• 4 QA systems:– QALD 2011: PowerAqua, Freya (κ = 0.501, 95% confidence

interval, ‘moderate’ agreement).

– QALD 2012: QAKis, MHE (κ= 0.236, 95% confidenceinterval, ‘fair’ agreement).

1st Analysis

• Linear regression model.

• Hsyntax, Hterm (Htrans), Hmatching (Hdist) and Hstruct

1st Analysis

• Higher correlation:

– Hsyntax (-)

– Hterm (Htrans) (-)

– Hmatching (Hdist) (-)

• Lower correlation:

– Hstruct

2nd Analysis

• Query features (reference to data model/query operator categories).– Contains instance reference (named entities)

– Contains class reference

– Contains complex class reference

– Contains property

– Contains value

– Yes/No question

– Contains operator

2nd Analysis

• Linear regression model.

2nd Analysis

• Higher correlation:

– References to instances (+)

– Presence of operators (-)

– Presence of complex classes (complex nominals) (-)

3rd Analysis

• Classification of the query-DBterminological gap for each datamodel category.

3rd Analysis

Lower terminological gap

Higher terminological gap

Query Classification

• % of unanswered questions:

– Syntactic complexity (Hsyntax): 51.7%

– Vocabulary gap (Hmatching, Hterm): 68.9%

– No reference to instance (named entity) (Hstruct,Hterm): 20.6%

Limitations

• Validation of the regression model in adifferent test collection.

• Distributional entropy needs a moreprincipled definition.

Minimizing Semantic Entropy

Reflections on the Design of Schema-agnostic Query Mechanisms

Or ....

Minimizing the Semantic Entropy for the Semantic Matching

Definition of a semantic pivot: first query term to be resolved in the database.

Maximizes the reduction of the semanticconfiguration space (Hstruct , Hmatch).

Semantic Pivots (Hstruct , Hmatch)

• Who is the daughter of Bill Clinton married to?

437100,184 62,781

> 4,580,000

dbpedia:spouse dbpedia:children :Bill_Clinton

Definition of a semantic pivot: first query termto be resolved in the database.

Less prone to more complex synonymicexpressions and abstraction-level differences(Hterm , Hmatch).

Semantic Pivots

• Proper nouns tends to have high percentage of string

overlap for synonymic expressions.

William Jefferson Clinton

Bill Clinton

William J. Clinton

T. E. Lawrence

Thomas Edward Lawrence

Lawrence of Arabia

Who is the daughter of Bill Clinton married to?

Definition of a semantic pivot: first query term to be resolved in the database.

Less prone to more complex synonymic expressionsand abstraction-level differences (Hterm , Hmatch).

proper nouns >> nouns >> complex nominals >>adjectives , verbs.

Semantic Matching

• Hsyntax is a strong estimator of querycomplexity.

• Hmatching can be used as an estimator for thequality of the predicate alignment.

• Hterm can be used as a heuristic for matchingcomplexity.

Conclusions

• Both entropy (Hsyntax, Hterm, Hmatching) and query features(instances, complex classes, operators) can be used asestimators for query semantic complexity.

• This can be incorporated as heuristics into schema-agnostic query planning approaches (or approximatesemantic parsing) to maximize semantic matchingprobabilities.

• Need for the construction of better semantic entropyestimators.

How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic Queries

Technology

Transcript of How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic Queries

More SQL: Complex Queries, Triggers, Views, and Schema Modification 1.

SQL-99: Schema Definition, Constraints, Queries, and Views

Schema-Agnostic Indexing with Azure DocumentDB · Schema-Agnostic Indexing with Azure DocumentDB Dharma Shukla, Shireesh Thota, Karthik Raman, Madhan Gajendran, Ankur Shah, Sergii

Discovery and Ranking of Embedded Uniqueness Constraints · eUCs in managing data integrity, improving schema qual-ity, enabling schema design, and optimizing queries and up-dates.

Schema-Agnostic Indexing with Azure DocumentDB - · PDF fileSchema-Agnostic Indexing with Azure DocumentDB Dharma Shukla, Shireesh Thota, Karthik Raman, Madhan Gajendran, Ankur Shah,

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Queries, Database Design, Constraint Enforcement Specify Schema + specify constraints.

8 SQL-99: Schema Deﬁnition, Basic Constraints, and Queries · 438 | Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries In addition to the concept of a schema, SQL2

Schema Agnostic Indexing with Azure DocumentDB

SQL SQL-99: a. Schema Definition b. Basic Constraints c. Queries.

Copyright © 2004 Pearson Education, Inc.. Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries.

g22 3033 002 c91 - nyu.edu...XML Query Use Cases Approach Description, DTD/Schema, Input, Queries, Results Existing Use Cases XMP (examples) TREE (queries that preserve hierarchy)

Schema-agnositc queries over large-schema databases: a distributional semantics approach (Phd Viva)

SQL-99: Schema Definition, Basic Constraints, And Queries

More SQL: Complex Queries, Triggers, Views, and Schema ...cis.csuohio.edu/~sschung/cis611/Elmasri_6e_Ch05AddedAllDivision_611Corrected.pdfTitle: Microsoft PowerPoint - Elmasri_6e_Ch05AddedAllDivision_611.ppt

CpSc 462/662: Database Management Systems (DBMS) (TEXNH Approach) Relational Schema and SQL Queries James Wang.

Schema Agnostic Indexing with Azure DocumentDBcis.csuohio.edu/~sschung/CIS601/CIS601_Presentation...YASH THAKKAR: 2642764 ABSTRACT Azure DocumentDB is Microsoft’s multi-tenant distributed

MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel …kostas.stefanidis/docs/edbt19.pdf · automation, support of highly heterogeneous entities, and massive parallelization

MINT – Metadata Interoperability Services · 2020. 4. 1. · Mapping & Transformation req. Agnostic to metadata input Target schema based on a metadata model XSD support Crosswalks

Ms. Hatoon Al-Sagri CCIS – IS Department SQL-99 :Schema Definition, Constraints, Queries, and Views 1.

Copyright © 2004 Pearson Education, Inc.. Lecture-7 Chapter 8 SQL: Schema Definition, Basic Constraints, and Queries.