CS848: Topics in Databases: Information Integration Topics covered Databases QL Query containment...
Transcript of CS848: Topics in Databases: Information Integration Topics covered Databases QL Query containment...
CS848: Topics in Databases: Information Integration
Topics covered
Databases
QL
Query containment
An evaluation of QL
CS848: Topics in Databases: Information Integration
A simple case of information integration
Subsystem3
Subsystem2
Subsystem1
SQL
Global Schema
A single table: T
Open, Scan
SQL Server
A “local as view” (LAV) integration schema: T ´ Q2. User submits Q1. Query optimizer must determine if a scan of T suffices. True iff Q1 is equivalent to Q2.
CS848: Topics in Databases: Information Integration
In the beginning …
Infinite countable sets of each of the following kinds of symbols:
C = {C1, C2, … } (primitive concepts)
A = {A1, A2, …} [ {B1, B2, …} (attributes)
R = {R1, R2, … } (roles)
Conventions: Attributes (resp. primitive concepts and roles) correspond to words in lower case or to positive integers (resp. words in upper case and words in mixed case).
CS848: Topics in Databases: Information Integration
For a particular database I
h, (¢)Ii
where is a countable possibly infinite domain, and where for each symbol
(C)I µ
(A)I : !
(R)I µ ( £ )
CS848: Topics in Databases: Information Integration
Partial databases
e
e : {C1, … , Cn}
e2e1
A
e2e1
R
e 2
e 2 (Ci)I
(A)I(e1) = e2
(e1, e2) 2 (R)I
e2e1 e1 e2
e2e1 e1 e2
CS848: Topics in Databases: Information Integration
Relational databases
“John”“Mary”
3333
name ageEMP
e1 : {EMP}
e2 : {EMP}
33
“Mary”
“John”name
name
age
CS848: Topics in Databases: Information Integration
Relational databases (cont’d)
{e1, e2} µ (EMP)I
(name)I(e1) = “John”
(age)I(e1) = (age)I(e2) = 33
{e1, e2, “John”, 33, “Mary”} µ
e1 e2
“John” ? 33
e1 : {EMP}
e2 : {EMP}
33
“Mary”
“John”name
name
age
CS848: Topics in Databases: Information Integration
Dialects of QL
(expressiveness)
(semantics)Conjunctive QL
withbag semantics†
Positive QL First order QLConjunctive QL
First order QLwith
bag semantics
Positive QLwith
bag semantics‡
†[Khizder et al., 1999], ‡[Lui et al., 2002]
CS848: Topics in Databases: Information Integration
Conjunctive QL
Q ::= D as A (quantification)
| A1 = A2.R (unnest)
| A1.Pf1 = A2.Pf2 (selection)
| elim A1, … , An Q (projection)
| true (null tuple)
| from Q1, Q2 (natural join)
| ( Q )
D ::= THING | C (basic description)
Pf ::= id | A.Pf (path function)
CS848: Topics in Databases: Information Integration
Well formed queries: (Q)
(D as A)´ {A}
(A1 = A2.R) ´ {A1, A2}
(A1.Pf1 = A2.Pf2) ´ {A1, A2}
(elim A1, … , An Q) ´ {A1, … , An}
(true) ´ ;
(from Q1, Q2) ´ (Q1) [ (Q2)
Require {A1, … , An} µ (Q) for projection operators.
CS848: Topics in Databases: Information Integration
Tuples and bags
A (duplicate) tuple t with attribute bindings for attributes {A1, … , An} over a database I = h, (¢)Ii has the general form
hA1 : e1, … , An : en , cnt : ii,
where {e1, … , en} µ , “cnt” is a distinct symbol not used as an attribute, and i a positive integer.
A set of duplicate tuples that contain the same attribute bindings is called a bag.
CS848: Topics in Databases: Information Integration
Operations on tuples
(t) ´ set of attributes occurring in t
t@cnt ´ integer i such that “cnt : i” occurs in t
t@A ´ element e 2 such that “A : e” occurs in t; defined only when A 2 (t)
t[{A1, … , An}] ´ {A1 : t@A1, … , An : t@An}; defined only when {A1, … , An} µ (t)
[t] ´ t[(t)]
CS848: Topics in Databases: Information Integration
SemanticsThe meaning of a query Q, denoted «Q¬, is a function that maps databases to bags. The behavior of this function on a particular database I = h, (¢)Ii is defined as follows.
«THING as A¬(I) ´ {hA : e, cnt : 1i : e 2 }
«C as A¬(I) ´ {hA : e, cnt : 1i : e 2 (C)I}
«A1 = A2.R¬(I) ´ {hA1 : e1, A2 : e2, cnt : 1i : (e2, e1) 2 RI}
«A1.Pf1 = A2.Pf2¬(I)´ {hA1 : e1, A2 : e2, cnt : 1i : (Pf1)I(e1) = (Pf2)I(e2)}
where
(id)I ´ {(e, e) : e 2 }
(A.Pf )I ´ {(e1, e2) : (Pf )I((A)I(e1)) = e2}
CS848: Topics in Databases: Information Integration
Semantics (cont’d)
«elim A1, … , An Q ¬(I) ´ ; , if not well formed; otherwise {hA1 : t@A1, … , An : t@An , cnt : 1i : t 2 «Q¬(I)}
«true¬(I) ´ {hcnt : 1i}
«from Q1, Q2¬(I) ´ {t : (t) = (Q1) [ (Q2) Æ 9 t1 2 «Q1¬(I), t2 2 «Q2¬(I) : t@cnt = t1@cnt £ t2@cnt Æ t[(t1)] = [t1] Æ t[(t2)] = [t2]}
CS848: Topics in Databases: Information Integration
Syntactic sugar
A1. .An.id ´ A1. .An
select distinct A1, … , An Q ´ elim A1, … , An Q
select * Q ´ Q
Q1 where Q2 ´ from Q1, Q2
Q1 and Q2 ´ from Q1, Q2
from ´ true
from Q1, Q2, … , Qn ´ from (from Q1, Q2, …) Qn
CS848: Topics in Databases: Information Integration
Examples
The names of employees who have the same age as another employee with a given name.
select distinct :p, namefrom EMP as e, ( select distinct :p, e1
from EMP as e1, EMP as e2where e1.age = e2.age and e2.name = :p )
where e.name = nameand e.id = e1.id
CS848: Topics in Databases: Information Integration
Method calls (more syntactic sugar)
A1.Pf1.C(A2.Pf2, … , An-1.Pfn-1) = An.Pfn
select distinct A1, … , An
´ from C as A where A.1 = A1.Pf1 and … and A.n = An.Pfn
A1.Pf1.C(A2.Pf2, … , An-1.Pfn-1) as An
´ A1.Pf1.C(A2.Pf2, … , An-1.Pfn-1) = An.id
CS848: Topics in Databases: Information Integration
Examples (cont’d)
select distinct namefrom EMP as e, ( select distinct e1
from EMP as e1, EMP as e2where e2.age.+(e2.age) = e1.age )
where e.name = nameand e.id = e1.id
The names of employees who have an age double that of another employee.
CS848: Topics in Databases: Information Integration
Conjunctive datalog (more syntactic sugar)
C(A1, … , An)
select distinct A1, … , An
´ from C as A where A.1 = A1.id and … and A.n = An.id
(A1, … , Am) :- Q1, … , Qn.
´ select distinct A1, … , Am from Q1, … , Qn
CS848: Topics in Databases: Information Integration
Positive QL
Q ::= empty A1, … , An (empty set)
| Q1 union Q2 (union)
(empty A1, …, An) ´ {A1, …, An}
(Q1 union all Q2) ´ (Q1)
Require (Q1) = (Q2) in union operations.
CS848: Topics in Databases: Information Integration
Semantics
«empty A1, … , An¬(I) ´ ;
«Q1 union Q2¬(I)
´ {t : t@cnt = 1 Æ (t) = (Q1) Æ (t) = (Q2) Æ (
( 9 t1 2 «Q1¬(I) : [t] = [t1] Æ :9 t2 2 «Q2¬(I) : [t] = [t2] ) Ç ( 9 t2 2 «Q2¬(I) : [t] = [t2] Æ :9 t1 2 «Q1¬(I) : [t] = [t1] ) Ç ( 9 t1 2 «Q1¬(I), t2 2 «Q2¬(I) : [t] = [t1] Æ [t] = [t2] ) )}
CS848: Topics in Databases: Information Integration
First order QL
Q ::= Q1 minus Q2 (difference)
(Q1 minus Q2) ´ (Q1)
Require (Q1) = (Q2) in difference operations.
CS848: Topics in Databases: Information Integration
Semantics
«Q1 minus Q2¬(I)
´ {t : t@cnt = 1 Æ (t) = (Q1) Æ (t) = (Q2) Æ ( 9t1 2 «Q1¬(I) : [t] = [t1] ) Æ ( :9t2 2 «Q2¬(I) : [t] = [t2] )}
CS848: Topics in Databases: Information Integration
QL with duplicates
Q ::= select A1, … , An Q (duplicate preserving projection)
| Q1 union all Q2 (bag union)
| Q1 minus all Q2 (bag difference)
CS848: Topics in Databases: Information Integration
Well formed queries (cont’d)
(select A1, … , An Q) ´ {A1, … , An}
(Q1 union all Q2 ) ´ (Q1)
(Q1 minus all Q2) ´ (Q1)
Require (Q1) = (Q2) in bag union and bag difference operations, and that {A1, … , An} µ (Q) in (duplicate preserving) projection operations.
CS848: Topics in Databases: Information Integration
Semantics
«select A1, … , An Q ¬(I)
´ ; , if not well formed and representable†; otherwise
{hA1 : t1@A1, … , An : t1@An , cnt : ni : t1 2
«Q¬(I)
Æ n = (t2@cnt) } t2 2 «Q¬(I) :
t2[{A1, … , An}] = t1 [{A1, … , An}]
†The selection operation is representable on database I iff, for everyt1 2 «Q¬(I),
|{t2 2 «Q¬(I) : t2[{A1, … , An}] = t1 [{A1, … , An}]}|
is finite.
CS848: Topics in Databases: Information Integration
Example
A duplicate preserving projection operation that is not representable in any database with an infinite domain.
select e1 from THING as e1, THING as e2
Observation: All well-formed duplicate preserving projection operations on databases with finite domains are representable.
CS848: Topics in Databases: Information Integration
Semantics (cont’d) «Q1 union all Q2¬(I)
´ ; , if not well formed; otherwise
{t 2 «Q1¬(I) : :9 t2 2 «Q2¬(I) : [t] = [t2]}
[ {t 2 «Q2¬(I) : :9 t1 2 «Q1¬(I) : [t] = [t1]}
[ {t : 9 t1 2 «Q1¬(I), t2 2 «Q2¬(I) : [t] = [t1]} Æ [t] =
[t2]
Æ t@cnt = t1@cnt + t2@cnt} «Q1 minus all Q2¬(I)
´ ; , if not well formed; otherwise
{t 2 «Q1¬(I) : :9 t2 2 «Q2¬(I) : [t] = [t2]}
[ {t : 9 t1 2 «Q1¬(I), t2 2 «Q2¬(I) : [t] = [t1]} Æ [t] = [t2]
Æ t@cnt = t1@cnt t2@cnt
Æ t1@cnt t2@cnt }
CS848: Topics in Databases: Information Integration
Summary
at, =, elim,true, from,
select
at, =, elim,true, from,
empty, union
at, =, elim,true, from,
empty, union,minus
at, =, elim,true, from
at, =, elim,true, from,
select,empty, union all,
minus all
at, =, elim,true, from,
select,empty, union all
(conjunctive)
(bag semantics)
(set semantics)
(positive) (first order)
CS848: Topics in Databases: Information Integration
Query contexts
An expression Q[] in the language QL enriched by an additional terminal symbol [] is called a query context.
For a query Q1 2 QL, the expression Q1[Q2] denotes the syntactical substitution of Q2 for [].
Q2 is compatible with Q1 if Q1[Q2] 2 QL. For example, Q2 is compatible with Q1 in the following.
Q2: EMP as e where e.name = :p
Q1: select distinct :p, d from DEPT as d, [] where d = e.dept
CS848: Topics in Databases: Information Integration
The query equivalence problem
Q1 is equivalent to Q2 for database I, written I ² (Q1 ´ Q2 ), if «Q1¬(I) = «Q2¬(I).
A query equivalence dependency E has the form (Q1 ´ Q2). E = (Q1 ´ Q2) is an axiom if, for any database I, I ² (Q1 ´ Q2 ).
A query equivalence problem for a given set of query equivalence dependencies is to determine if a given member of the set is an axiom.
CS848: Topics in Databases: Information Integration
Some axioms
Question: Is it true that any E with the following form is an axiom?
(elim A1, … , Am Q1)[elim B1, … , Bn Q2] ´ elim A1, … , Am Q1 [Q2]
Answer: No. However, any such E is an axiom if any attribute in
(Q2) – {B1, … , Bn})
does not occur in query context
(elim A1, … , Am Q1 []).
CS848: Topics in Databases: Information Integration
Excluding variable reuse in QL
Q has an occurrence of variable reuse if there is a query context Q1[] and a query of the form
elim A1, … , An Q2
or of the form
select A1, … , An Q2
such that Q = Q1[Q2] and there exists A in ((Q2) – {A1, … , An}) thatalso occurs in Q1[].
Observation: For any Q1, there exists an equivalent class of query Q2 that has no occurrence of variable reuse.
CS848: Topics in Databases: Information Integration
The query containment problem
Q1 is contained in Q2 for database I, written I ² (Q1 v Q2), if, for any tuple t1 in «Q1¬(I), there exists t2 in «Q2¬(I) such that [t1] = [t2] and t1@cnt t2@cnt.
A query containment dependency C has the form (Q1 v Q2). C = (Q1 v Q2) is an axiom if, for any database I, I ² (Q1 v Q2).
A query containment problem for a given set of query containment dependencies is to determine if a given member of the set is an axiom.
CS848: Topics in Databases: Information Integration
Equivalence and containment
Observation: Equivalence reduces to containment.
Q1 ´ Q2 iff Q1 v Q2 and Q2 v Q1
Observation: Containment reduces to equivalence in first order QL.
Q1 v Q2 iff (Q1 minus all Q2) ´ empty (Q1)
CS848: Topics in Databases: Information Integration
Some complexity results
Theorem: The query equivalence and containment problems for conjunctive QL is NP-complete.†
†Chandra, A. K. and P. M. Merlin. Optimal implementation of conjunctive queries in relational databases. Proc. Ninth Annual ACM Symposium on the Theory of Computing, pp. 77–90, 1977.
CS848: Topics in Databases: Information Integration
A decision procedure
Theorem: The following procedure decides if C = (Q1 v Q2) is anaxiom for conjunctive QL.†
1. Freeze the body of Q1 by creating a partial database consisting of individuals that include its variables.
2. If the tuple
hA1 : A1 , … , An : An , cnt : 1i occurs in «Q2¬(I), where (Q1) = {A1, … , An}, then return true;
otherwise return false.‡
†Derived from [Ullman, 1999].‡Use forced semantics for selection operations.
CS848: Topics in Databases: Information Integration
Obtaining a partial database from Q
A1.A2. .Am = B1.B2. .Bn
A1
A2 Am…
B1
B2 Bn…
THING as A A
C as A A : {C}
A1 = A2.R
A2
RA1
CS848: Topics in Databases: Information Integration
Derived partial databases (cont’d)
w : L u : L1
A
v : L2
A
w : L u : L1
A
v : L2
A
u : L1 v : L3
A
x : L4
Aw : L2
u : L1 v : L3
A
x : L4
Aw : L2
CS848: Topics in Databases: Information Integration
Deriving partial databases (cont’d)
n1 : L1 n2 : L2 n1 : L1 [ L2 n2 : L1 [ L2
n1 : L1 n2 : L2 n3 : L3
n1 : L1 n2 : L2 n3 : L3
CS848: Topics in Databases: Information Integration
Evaluating selections on partial databases
Note that selection conditions can navigate missing attribute values. In such cases, assume a forced semantics. In particular, two nodes n1 and n2 satisfy a selection condition iff the condition has the form
n1.Pf1.Pf = n2.Pf2.Pf
where (Pf1)I(n1) and (Pf2)I(n2) are defined and lead to nodes connected by an equality arc.
CS848: Topics in Databases: Information Integration
Some complexity results (cont’d)
Theorem: The query equivalence problem for conjunctive QL with bag semantics is NP-complete.
Observation: The complexity of the query containment problem for conjunctive QL with bag semantics remains open at this time.
Example:† In conjunctive QL with bag semantics, the query containment dependency Q1 v Q2 is an axiom, where Q1 and Q2 have the respective definitions
select x, z select x, z from P as x, R as z from P as x, R as z where x = u.Q and z = v.Q where y = u.Q and y = v.Q
† [Chaudhuri and Vardi, 1993]
CS848: Topics in Databases: Information Integration
The query membership problem
A database schema, denoted T, consists of a finite set {C1, … , Cn} of query containment dependencies.
C is an axiom relative to database schema T = {C1, … , Cn}, writtenT ² C, if, for any database I, I ² C if I ² Ci for each i.
A query membership problem for a given set of query containment dependencies is to determine if a given member of the set is an axiom relative to a given database schema also consisting of members of the set.
CS848: Topics in Databases: Information Integration
More complexity results
Theorem: The query membership problem for conjunctive QL is undecidable.
Theorem: The query membership problem for first order QL is equivalent to the query containment problem for first order QL.
Proof: Assignment.
CS848: Topics in Databases: Information Integration
Evaluating QL
Defining database schema
Expressing access plans
Fine grained APIs: record addresses
Protocols
Safety
Binding patterns
Adequacy: SQL, OQL, XQuery
CS848: Topics in Databases: Information Integration
Modeling generalization taxonomies
Consider a simple object-oriented schema language consisting of sentencesof the following form.†
class C {A1 : ref C1, … , Am : ref Cm} [isa C1 , … , Cn];
Assignment: Encode a fixed collection of such sentences as a databaseschema in conjunctive QL. Your encoding should be as compact as possibleand should enable the following questions to be expressed as querycontainment dependencies over your schema.
1. Is C a defined class?
2. Is attribute A defined on class C?
3. Can an object reside in both class C1 and class C2?†Assume that any object in a database was created with respect to a singleclass.
CS848: Topics in Databases: Information Integration
Modeling pipelined query access plans
(syntax) (defn of (¢))
(parameter) Q ::= (PARAM as A) {A}(index scan) | (from C as A, A.1 = B1, … , A.n = Bn) {A}(nested loops) | (from Q1, Q2) (Q1) [ (Q2)(noop) | (select A1, … , An Q) (Q) Å {A1, … , An}(record field access) | (A1 = A2.B) {A1}(comparison) | (A1 = A2) ;(catenation) | (Q1 union all Q2) (Q1) Å (Q2)(cut) | (elim A1, … , An Q) ;
| …
Require:1. ((Q2) – (Q2)) µ (Q1) for nested loops, and2. (Q) = (Q) for top-level queries.
CS848: Topics in Databases: Information Integration
Alternative semantics
Require richer models theories for
1. sort operations, and
2. named cuts.