1 Oblivious Querying of Data with Irregular Structure.

111
1 Oblivious Querying of Data with Irregular Structure
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of 1 Oblivious Querying of Data with Irregular Structure.

Page 1: 1 Oblivious Querying of Data with Irregular Structure.

1

Oblivious Querying of Data with Irregular Structure

Page 2: 1 Oblivious Querying of Data with Irregular Structure.

2

Based on Several Works

• Queries with Incomplete Answers– Yaron Kanza, Werner Nutt, Shuky Sagiv

• Flexible Queries– Yaron Kanza, Shuky Sagiv

• SQL4X– Sara Cohen, Yaron Kanza, Shuky Sagiv

• Computing Full Disjunctions– Yaron Kanza, Shuky Sagiv

Page 3: 1 Oblivious Querying of Data with Irregular Structure.

3

AgendaAgenda

Why is it difficult to query semistructured data?Why is it difficult to query semistructured data?

Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)

Flexible queries (FQ)Flexible queries (FQ)

Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ

Using QwIA and FQ for information integration Using QwIA and FQ for information integration

Page 4: 1 Oblivious Querying of Data with Irregular Structure.

4

AgendaAgenda

Why is it difficult to query semistructured data?Why is it difficult to query semistructured data?

Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)

Flexible queries (FQ)Flexible queries (FQ)

Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ

Using QwIA and FQ for information integration Using QwIA and FQ for information integration

Page 5: 1 Oblivious Querying of Data with Irregular Structure.

5

The Semistructured Data Model

• Data is described as a rooted labeled directed graph

• Nodes represent objects

• Edges represent relationships between objects

• Atomic values are attached to atomic nodes

Page 6: 1 Oblivious Querying of Data with Irregular Structure.

6

1

11 12 14

Movie Database

Movie

Movie

Actor

22 23 25 26 27 2829

T.V. Series

Film

ActorActor

TitleName Name

Name

Title

Title Title

31 3234 35

KyleMacLachlan

NataliePortman

Harrison Ford

1977

Dune

StarWars

TwinPeaks

A Movie Database ExampleA Movie Database Example

36

Year

1984

24

Year

21

Actor

Name

30

Mark Hamill

Léon

Movie

13

Title

33Magnolia

Page 7: 1 Oblivious Querying of Data with Irregular Structure.

7

<?xml version=“1.0”?>

<MDB>

<Movie>

<Title>Star Wars</Title>

<Year>1977</Year>

<Actor>

<Name>Mark Hamill</Name>

</Actor>

<Actor>

<Name>Harrison Ford</Name>

</Actor>

</Movie>

…</MDB>

<?xml version=“1.0”?>

<MDB>

<Movie>

<Title>Star Wars</Title>

<Year>1977</Year>

<Actor>

<Name>Mark Hamill</Name>

</Actor>

<Actor>

<Name>Harrison Ford</Name>

</Actor>

</Movie>

…</MDB>

XML that Encodes the Semistructured DataXML that Encodes the Semistructured Data

Page 8: 1 Oblivious Querying of Data with Irregular Structure.

8

1

11 12 14

Movie Database

Movie

Movie

Actor

22 23 25 26 27 2829

T.V. Series

Film

ActorActor

TitleName Name

Name

Title

Title Title

31 3234 35

KyleMacLachlan

NataliePortman

Harrison Ford

1977

Dune

StarWars

TwinPeaks

Consider a Query that RequestsMovies, Actors that Acted in the Movies

and the Movies’ Year of Release

Consider a Query that RequestsMovies, Actors that Acted in the Movies

and the Movies’ Year of Release

36

Year

1984

24

Year

21

Actor

Name

30

Mark Hamill

Léon

Movie

13

Title

33Magnolia

What Should be theform of the Query?

Page 9: 1 Oblivious Querying of Data with Irregular Structure.

9

1

11 12 14

Movie Database

Movie

Movie

Actor

22 23 25 26 27 2829

T.V. Series

Film

ActorActor

TitleName Name

Name

Title

Title Title

31 3234 35

KyleMacLachlan

NataliePortman

Harrison Ford

1977

Dune

StarWars

TwinPeaks

36

Year

1984

24

Year

21

Actor

Name

30

Mark Hamill

Léon

Movie

13

Title

33Magnolia

The movie has a year attribute

Incomplete DataIncomplete Data

The year of the movie is missing

Page 10: 1 Oblivious Querying of Data with Irregular Structure.

10

1

11 12 14

Movie Database

Movie

Movie

Actor

22 23 25 26 27 2829

T.V. Series

Film

ActorActor

TitleName Name

Name

Title

Title Title

31 3234 35

KyleMacLachlan

NataliePortman

Harrison Ford

1977

Dune

StarWars

TwinPeaks

36

Year

1984

24

Year

Actor

Name

30

Mark Hamill

Léon

Movie

13

Title

33Magnolia

Variations in StructureVariations in Structure

11

Movie below actor

29

14

2121

Actor below movie

Page 11: 1 Oblivious Querying of Data with Irregular Structure.

11

1

11 12 13

Movie Database

Movie

Movie

Actor

22 23 25 26 27 2829

T.V. Series

Film

ActorActor

TitleName Name

Name

Title

Title Title

31 3233 34

KyleMacLachlan

NataliePortman

Harrison Ford

1977

Dune

StarWars

TwinPeaks

35

Year

1984

24

Year

21

Actor

Name

30

Mark Hamill

Léon

Movie

13

Title

34Magnolia

A movie label A film label

Ontology VariationsOntology VariationsDealing with ontology variations isbeyond the scope of this talk

Dealing with ontology variations isbeyond the scope of this talk

Page 12: 1 Oblivious Querying of Data with Irregular Structure.

12

Irregular Data

• Data is incomplete– Missing values of attributes in objects

• Data has structural variations– Relationships between objects are represented

differently in different parts of the database

• Data has ontology variations– Different labels are used to describe objects of

the same type

Page 13: 1 Oblivious Querying of Data with Irregular Structure.

13

Irregular data does not conform to a strict schemaIrregular data does not conform to a strict schema

Queries over irregular data should not be rigid patternsQueries over irregular data should not be rigid patterns

The schema cannot guide a userin formulating a query

The schema cannot guide a userin formulating a query

Page 14: 1 Oblivious Querying of Data with Irregular Structure.

14

The description of the

schema is large

(e.g., a DTD of XML)

The description of the

schema is large

(e.g., a DTD of XML)

It is difficult to use the schema when formulating queries

It is difficult to use the schema when formulating queries

Data is contributedby many users in a variety of designs

Data is contributedby many users in a variety of designs

The query should deal with differentstructures of data

The query should deal with differentstructures of data

The structure of the

database is changed

frequently

The structure of the

database is changed

frequently

Queries should be rewritten frequentlyQueries should be rewritten frequently

In Which Cases is it Difficult to Formulate Queries over Semistructured Data?

In Which Cases is it Difficult to Formulate Queries over Semistructured Data?

Page 15: 1 Oblivious Querying of Data with Irregular Structure.

15

Can Regular Expressions Help in Querying Irregular Data?

• In many cases, regular expressions can be used to query irregular data

• Yet, regular expressions are – Not efficient – it is difficult to evaluate regular

expressions– Not intuitive – it is difficult for a naïve user to

formulate regular expressions

Page 16: 1 Oblivious Querying of Data with Irregular Structure.

16

More on UsingRegular Expressions

• When querying irregular data, the size of the regular expression could be exponential in the number of labels in the database– For n types of objects, there are n! possible

hierarchies– For an object with n attributes, there are 2n

subsets of missing attributes

Page 17: 1 Oblivious Querying of Data with Irregular Structure.

17

AgendaAgenda

Why is it difficult to query semistructured data?Why is it difficult to query semistructured data?

Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)

Flexible queries (FQ)Flexible queries (FQ)

Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ

Using QwIA and FQ for information integration Using QwIA and FQ for information integration

Page 18: 1 Oblivious Querying of Data with Irregular Structure.

18

Queries with Incomplete Answers

• We have developed queries that deal with incomplete data in a novel way and return incomplete answers

• The queries return maximal answers rather than complete answers

• Different query semantics admit different levels of incompleteness

Page 19: 1 Oblivious Querying of Data with Irregular Structure.

19

Queries with Incomplete AnswersQueries with Incomplete Answers

Queries with complete answersQueries with complete answers

Queries with AND SemanticsQueries with AND Semantics

Queries with Weak SemanticsQueries with Weak Semantics

Queries with OR SemanticsQueries with OR Semantics

Increasinglevel of incompleteness

Page 20: 1 Oblivious Querying of Data with Irregular Structure.

20

Queries and Matchings

• The queries are labeled rooted directed graphs

• Query nodes are variables

• Matchings are assignments of database objects to the query variables according to – the constraints specified in the query, and – the semantics of the query

Page 21: 1 Oblivious Querying of Data with Irregular Structure.

21

• Root Constraint: • Satisfied if the query root is mapped to the db root

• Edge Constraint: • Satisfied if a query edge with label l is mapped to a

database edge with label l

Constraints On Complete Matchings

r 1Query Root Database Root

x

y

12

25

l l

Page 22: 1 Oblivious Querying of Data with Irregular Structure.

22

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

TitleTitle

31 33DustinHoffman

Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

Director

StevenSpielberg

Director

12

r

yx

z

u

UncreditedActor

Name

32

Name

34

2927

Movie Movie

Director UncreditedActor

14 May 1944

Date of birth

35

v

NameDate of birth

GeorgeLucas

A CompleteMatching

A CompleteMatching

ProducerProducer

1

12

27

32

11

35

All the nodes are mapped to non-null values

The root constraint and all the edge constraintsare satisfied

Page 23: 1 Oblivious Querying of Data with Irregular Structure.

23

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

TitleTitle

31 33DustinHoffman

Harrison Ford

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

Director

StevenSpielberg

Director

12

r

yx

z

u

UncreditedActor

Name

32

Name

34

2927

Movie Movie

Director UncreditedActor

14 May 1944

Date of birth

35

v

NameDate of birth

Consider the case where Node 35is removed from the database

14 May 1944

Date of birth

35

GeorgeLucas

No CompleteMatching Exists!

No CompleteMatching Exists!

ProducerProducer

StarWars

1977

Page 24: 1 Oblivious Querying of Data with Irregular Structure.

24

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

TitleTitle

31 33DustinHoffman

Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

Director

StevenSpielberg

Director

12

r

yx

z

u

UncreditedActor

Name

32

Name

34

2927

Movie Movie

Director UncreditedActor

v

NameDate of birth

GeorgeLucas

Not Every Partial Assignmentis an Incomplete Matching

Not Every Partial Assignmentis an Incomplete Matching

ProducerProducer

1

This is not a matching, since the sequence of labelsfrom the database root to Node 31 is different fromany sequence of labels that starts at the query rootand ends in variable v

u

NULL

z NULL

y

NULL

xNULL

31

Page 25: 1 Oblivious Querying of Data with Irregular Structure.

25

The Reachability Constrainton Partial Matchings

• A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path are satisfied Database

x

z

w

y

l1

r

v

l3

l2

l5

l4

l6

Query

w

y

r

v

l3

l5

v

1

55

5

8

l1

1

l3

l5

55v

x

z

r l2

l4

l6

7

9

1 l2

l4

l6

55

Page 26: 1 Oblivious Querying of Data with Irregular Structure.

26

yx

z

Director Actor

r

Producer

“And” Matchings

• A partial matching is an AND matching if– The root constraint is satisfied– The reachability constraint is satisfied by every

query node that is mapped to a database node– If a query node is mapped to a database node,

all the incoming edge constraints are satisfied

Page 27: 1 Oblivious Querying of Data with Irregular Structure.

27

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

TitleTitle

31 33DustinHoffman

Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

An AND MatchingAn AND Matching

GeorgeLucas

Director

StevenSpielberg

Director

12

r

yx

z

u

UncreditedActor

Name

32

Name

34

2927

Movie Movie

Director UncreditedActor

v

NameDate of birth

1

12

27

32

Producer

11Producer

u

NULL

Page 28: 1 Oblivious Querying of Data with Irregular Structure.

28

UncreditedActor

UncreditedActor

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

TitleTitle

31 33DustinHoffman

Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

Director

StevenSpielberg

Director

12

r

yx

z

uName

32

Name

34

2927

Movie Movie

Director UncreditedActor

v

NameDate of birth

Suppose that we remove the edges that are labeled withUncredited Actor

GeorgeLucas

ProducerProducer

In an AND matching,Node z must be null!

In an AND matching,Node z must be null!

Page 29: 1 Oblivious Querying of Data with Irregular Structure.

29

• Edge Constraint: • Is Weakly Satisfied if it is either

• Satisfied (as defined earlier), or• One (or more) of its nodes is mapped to a null value

Weak Satisfaction of Edge Constraints

x

y

12

25

l l

x

y

12

25

l m

null

x

y

12

25

l m

nullx

y

l

null

null

Page 30: 1 Oblivious Querying of Data with Irregular Structure.

30

Weak Matchings

• A partial matching is a weak matching if– The root constraint is satisfied

– The reachability constraint is satisfied by every query node that is mapped to a database node

– Every edge constraint is weakly satisfied

Page 31: 1 Oblivious Querying of Data with Irregular Structure.

31

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

TitleTitle

31 33DustinHoffman

Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

A Weak MatchingA Weak Matching

GeorgeLucas

Director

StevenSpielberg

Director

12

r

yx

z

uName

32

Name

34

2927

Movie Movie

Director UncreditedActor

v

NameDate of birth

1

27

32

Producer

11Producer

u

NULL

y

NULL

Edges that areweakly satisfied

Page 32: 1 Oblivious Querying of Data with Irregular Structure.

32

x

y

12

25

l l

x

y

12

25

l m

null

x

y

l

null

null

x

y

12

25

l m

null

In a weak matching, all four options are permitted

In an AND matching, only the first three options are permitted

Page 33: 1 Oblivious Querying of Data with Irregular Structure.

33

ProducerProducer

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

TitleTitle

31 33DustinHoffman

Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

Director

StevenSpielberg

Director

12

r

yx

z

uName

32

Name

34

2927

Movie Movie

Director UncreditedActor

v

NameDate of birth

Consider the case where edges labeled with Producer are removed

GeorgeLucas

Producer

In a weak matching,Node z must be null!

In a weak matching,Node z must be null!

Page 34: 1 Oblivious Querying of Data with Irregular Structure.

34

“OR” Matchings

• A partial matching is an OR matching if– The root constraint is satisfied

– The reachability constraint is satisfied by every query node that is mapped to a database node

Page 35: 1 Oblivious Querying of Data with Irregular Structure.

35

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

TitleTitle

31 33DustinHoffman

Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

An OR MatchingAn OR Matching

GeorgeLucas

Director

StevenSpielberg

Director

12

r

yx

z

uName

32

Name

34

2927

Movie Movie

Director UncreditedActor

v

NameDate of birth

1

27

32

11Producer

u

NULL

y

NULL

An edge whichis not weaklysatisfied

Page 36: 1 Oblivious Querying of Data with Irregular Structure.

36

Increasing Level of Incompleteness

• A complete matching is an AND matching

• An AND matching is a weak matching

• A weak matching is an OR matching

Page 37: 1 Oblivious Querying of Data with Irregular Structure.

37

t1=(1, 5, 2, null)

t2=(1, null, 2, null)

Maximal Matchings

• A tuple t1 subsumes a tuple t2 if t1 is the result of replacing some null values in t2 by non-null values:

• A matching is maximal if no other matching subsumes it

• A query result consists of maximal matchings only

Matchings are represented as tuples of oid’s and null values

Page 38: 1 Oblivious Querying of Data with Irregular Structure.

38

On the Complexity of Computing Queries with Incomplete Answers

• The size of the result can be exponential in the size of the input (database and query)– Note that the same is true when joining

relations – the size of the result can be exponential in the size of the input (database and query)

• Instead of using data complexity (where the runtime depends only on the size of the database), we use input-output complexity

Page 39: 1 Oblivious Querying of Data with Irregular Structure.

39

Input-Output Complexity

In input-output complexity, the time complexity is a function ofthe size of the query,the size of the database, and the size of the result.

In input-output complexity, the time complexity is a function ofthe size of the query,the size of the database, and the size of the result.

Page 40: 1 Oblivious Querying of Data with Irregular Structure.

40

The Motivation for Using I/O Complexity

• Measuring the time complexity with respect to the size of the input does not separate between the following two cases:– An algorithm that does an exponential amount of work

simply because the size of the output is exponential in the size of the input

– An algorithm that does an exponential amount of work even when the query result is small

• Either the algorithm is naïve (e.g., it unnecessarily computes subsumed matchings) or the problem is hard

Page 41: 1 Oblivious Querying of Data with Irregular Structure.

41

I/O Complexity of Query Evaluation (lower bounds are for non-emptiness)

Query / Semantics

Path

Query

Tree

QueryDAG

QueryCyclic Query

CompletePTIMEPTIMENP-

CompleteNP-

Complete

ANDPTIMEPTIMEPTIMENP-

Complete

WeakPTIMEPTIMEPTIMEPTIME

ORPTIMEPTIMEPTIMEPTIME

Recent Results (PODS’03)

Page 42: 1 Oblivious Querying of Data with Irregular Structure.

42

Filter Constraints

• Constraints that filter the results (i.e., the maximal matchings)

• There are – Weak filter constraints (the constraint is

satisfied if a variable in the constraint is null)– Strong filter constraints (all variables must be

non-null for satisfaction)

• Existence constraint: !x is true if x is not null

Page 43: 1 Oblivious Querying of Data with Irregular Structure.

43

I/O Complexity of Query Evaluation with Existence Constraints

(lower bounds are for non-emptiness)Query /

Semantics

Path

Query

Tree

QueryDAG

QueryCyclic Query

CompletePTIMEPTIMENP-

CompleteNP-

Complete

ANDPTIMEPTIMENP-

CompleteNP-

Complete

WeakPTIMEPTIMENP-

CompleteNP-

Complete

ORPTIMEPTIMENP-

CompleteNP-

Complete

Page 44: 1 Oblivious Querying of Data with Irregular Structure.

44

I/O Complexity of Query Evaluation with Weak Equality/Inequality Constraints

(lower bounds are for non-emptiness)Query /

Semantics

Path

Query

Tree

QueryDAG

QueryCyclic Query

StrongPTIMENP-

CompleteNP-

CompleteNP-

Complete

ANDPTIMENP-

CompleteNP-

CompleteNP-

Complete

WeakPTIMENP-

CompleteNP-

CompleteNP-

Complete

ORPTIMENP-

CompleteNP-

CompleteNP-

Complete

Page 45: 1 Oblivious Querying of Data with Irregular Structure.

45

Query Containment

• Query containments for queries with incomplete answers is defined differently from query containment for queries with complete answers

• Q1 Q2 if for all database D,

every matching of Q1 w.r.t. to Dis subsumed by

a matchings of Q2 w.r.t. to D• Query containment (query equivalence) is useful

for the development of optimization techniques

Page 46: 1 Oblivious Querying of Data with Irregular Structure.

46

Containment in AND Semantics

• Homomorphism between the query graphs is necessary and sufficient for containment

r

y

x

z

l1

v

l2l2

u

l3 l4

Q1 r

q

p

l1

v

l2

u

l3 l4

Q2

homomorphism

• Deciding whether one query is contained in another is NP-Complete

Q1 Q2

Page 47: 1 Oblivious Querying of Data with Irregular Structure.

47

Containment in OR Semantics

• The following is a necessary and sufficient condition for query containment in OR semantics

• For every spanning tree T1 of the contained query, there a spanning tree T2 of the containing query, such that there is a homomorphism from T2 to T1

– is in ΠP2

– NP-Complete if the containee is a tree

– polynomial if the container is a tree

Page 48: 1 Oblivious Querying of Data with Irregular Structure.

48

Containment in Weak Semantics

• Similar to containment in OR Semantics, with the following difference

• Instead of checking homomorphism between spanning trees, we check homomorphism between graph fragments– A graph fragment is a restriction of the query to

a subset of the variables that includes the query root such that every node in the fragment is reachable from the root

Page 49: 1 Oblivious Querying of Data with Irregular Structure.

49

AgendaAgenda

Why is it difficult to query semistructured data?Why is it difficult to query semistructured data?

Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)

Flexible queries (FQ)Flexible queries (FQ)

Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ

Using QwIA and FQ for information integration Using QwIA and FQ for information integration

Page 50: 1 Oblivious Querying of Data with Irregular Structure.

50

Flexible Queries

• To deal with structural variations in the data, we have developed flexible queries

Page 51: 1 Oblivious Querying of Data with Irregular Structure.

51

Flexible QueriesFlexible Queries

Rigid QueriesRigid Queries

Semiflexible Queries Semiflexible Queries

Flexible QueriesFlexible Queries

Increasing level of flexibility

Page 52: 1 Oblivious Querying of Data with Irregular Structure.

52

A query that finds all pairs of actorsthat acted in the same movie

A query that finds all pairs of actorsthat acted in the same movie

However, if in the database, actorsare descendents of movies, the query has to be reformulated

However, if in the database, actorsare descendents of movies, the query has to be reformulated

Instead, we propose new waysof matching queries to databases

Instead, we propose new waysof matching queries to databases

r

yx

z

Actor Actor

Movie Movie

Movie Database

Example

Page 53: 1 Oblivious Querying of Data with Irregular Structure.

53

Rigid matchings andcomplete matchings

are the same

Returning rigid matchings is the usual semantics for queries

(e.g., XQuery, Lorel, XML-QL, etc.)

Rigid matchings andcomplete matchings

are the same

Returning rigid matchings is the usual semantics for queries

(e.g., XQuery, Lorel, XML-QL, etc.)

Page 54: 1 Oblivious Querying of Data with Irregular Structure.

54

• Root Constraint: • Satisfied if the query root is mapped to the db root

• Edge Constraint: • Satisfied if a query edge with label l is mapped to a

database edge with label l

Constraints On Rigid Matchings

r 1Query Root Database Root

x

y

12

25

l l

Page 55: 1 Oblivious Querying of Data with Irregular Structure.

55

1

11 12 14

Movie Database

Movie

Movie

Actor

22 23 25 26

27

2829

T.V. SeriesActorActor

TitleName Name

NameTitle

Title Title

31 32 3435

KyleMacLachlan

NataliePortman

Harrison Ford

1977

Dune

StarWars

TwinPeaks

36

Year

1984

24

Year

21

Actor

Name

30

Mark Hamill

Léon

Movie

r

x

y

Actor

Movie

1

14

29

A Rigid Matching

1

25

12

This is not a Rigid Matching

Page 56: 1 Oblivious Querying of Data with Irregular Structure.

56

A Semiflexible Matching• The query root is mapped

to the db root

y

l

x

11

l

9

×

r 1

Query Root

DB Root

• A query node with an incoming label l is mapped to a db node with an incoming label l

• The image of every query path is embedded in some database path

• SCC is mapped to SCC

Page 57: 1 Oblivious Querying of Data with Irregular Structure.

57

A Semiflexible Matching• The query root is

mapped to the db root • A query node with an

incoming label l is mapped to a db node with an incoming label l

• The image of every query path is embedded in some database path

• SCC is mapped to SCC

y

l

x

11

l

9

r 1

Query Root

DB RootThe last two conditions

cannot be verified locally, i.e., by considering one query edge at a time

The last two conditionscannot be verified locally, i.e., by considering one query edge at a time

Page 58: 1 Oblivious Querying of Data with Irregular Structure.

58

1

11 12 14

Movie Database

Movie

Movie

Actor

22 23 25 26

27

2829

T.V. SeriesActorActor

TitleName Name

NameTitle

Title Title

31 32 3435

KyleMacLachlan

NataliePortman

Harrison Ford

1977

Dune

StarWars

TwinPeaks

36

Year

1984

24

Year

21

Actor

Name

30

Mark Hamill

Léon

Movie

r

x

y

Actor

Movie

1

25

12

The Semiflexible MatchingsThe Semiflexible Matchings

1

14

29

We get all theactor-movie pairs

We get all theactor-movie pairs

1

22

1111

1

21

Page 59: 1 Oblivious Querying of Data with Irregular Structure.

59

r

y

x

Actor

Movie

r

x

y

Actor

Movie

Under semiflexible semantics,these two queries are equivalent

Under semiflexible semantics,these two queries are equivalent

The user does not have to knowif movies are above or below

actors in the database

The user does not have to knowif movies are above or below

actors in the database

Page 60: 1 Oblivious Querying of Data with Irregular Structure.

60

1

11 12 14

Movie Database

Movie

Movie

Actor

22 23 25 26

27

2829

T.V. SeriesActorActor

TitleName Name

NameTitle

Title Title

31 32 3435

KyleMacLachlan

NataliePortman

Harrison Ford

1977

Dune

StarWars

TwinPeaks

36

Year

1984

24

Year

21

Actor

Name30

Mark Hamill

Léon

Movie

r

xy

Actor

Movie

Another Example of aSemiflexible Matching

Another Example of aSemiflexible Matching

We get pairs of actors that acted in

the same movie

We get pairs of actors that acted in

the same movie

zMovie

Actor

1

21

11

22

1

11

1

11

21 2222

1

11

1

21

11

Impossible to get this pair by means of a rigid matching, since the query is a dag and the db is a tree

Impossible to get this pair by means of a rigid matching, since the query is a dag and the db is a tree

Page 61: 1 Oblivious Querying of Data with Irregular Structure.

61

A Flexible Matching

• The query root is mapped to the db root r 1

Query Root

DB Root

x 9

y 11

l l

• A query node with an incoming label l is mapped to a db node with an incoming label l

• An edge is mapped to two nodes on one path

• Notice that a path in the query is not necessarily mapped to a path in the db

Page 62: 1 Oblivious Querying of Data with Irregular Structure.

62

An Example of a Flexible Queryr

x

Director

A director

y

Name

The director name

z

Movie

A movie of the director

vTitle

The title of the movieu

Actor

An actor in the movieName

wThe name of the actor

Page 63: 1 Oblivious Querying of Data with Irregular Structure.

63

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

TitleTitle

31 33DustinHoffman

Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

Director

StevenSpielberg

Director

12

r

y

x

z

u

Name

32

Name

34

2927

MovieName

Director

14 May 1944

Date of birth

35

vTitle

NameGeorgeLucas

Producer

Actor

w

1

29

12

34

26

33

25

A query edge is mapped to

two db nodes on one path

A query edge is mapped to

two db nodes on one pathThis flexible matching is neither a rigid

matching nor a semiflexible matching

This flexible matching is neither a rigid

matching nor a semiflexible matching

Page 64: 1 Oblivious Querying of Data with Irregular Structure.

64

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

TitleTitle

31 33DustinHoffman

Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

Director

StevenSpielberg

Director

12

r

y

x

Name

32

Name

34

2927

Movie

Producer

14 May 1944

Date of birth

35

GeorgeLucas

Producer

1

Why are semiflexible matchings

preferred sometimes to flexible matchings?

Why are semiflexible matchings

preferred sometimes to flexible matchings?

27

11

In this flexible matching, a producer is given

with a movie that he directed but did not produce

In this flexible matching, a producer is given

with a movie that he directed but did not produce

Page 65: 1 Oblivious Querying of Data with Irregular Structure.

65

99

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

TitleTitle

31 33DustinHoffman

Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

Director

StevenSpielberg

Director

12

r

y

x

Name

32

Name

34

2927

Movie

Producer

14 May 1944

Date of birth

35

GeorgeLucas

Producer

1

99

11

In semiflexible semantics, the problem is solved

since the image of a query path is embedded in

a database path

In semiflexible semantics, the problem is solved

since the image of a query path is embedded in

a database path

Producer

Page 66: 1 Oblivious Querying of Data with Irregular Structure.

66

Differences Between the Semiflexible and Flexible Semantics

• On a technical level, in flexible matchings – Query paths are not necessarily embedded in database

paths

– SCC’s are not necessarily mapped to SCC’s

• On a conceptual level, in the semiflexible semantics, nodes are “semantically related” if they are on the same path, and hence– Query paths are embedded in database paths

• In the flexible semantics, this condition is relaxed:– Query edges are embedded in database paths

Page 67: 1 Oblivious Querying of Data with Irregular Structure.

67

Increasing Level of Flexibility

• A rigid matching is a semiflexible matching

• A semiflexible matching is a flexible matching

Page 68: 1 Oblivious Querying of Data with Irregular Structure.

68

Verifying that Mappings are Semiflexible Matchings

• Is a given mapping of query nodes to database nodes a semiflexible matching?– Not as simple as for rigid matchings (no local test, i.e.,

need to consider paths rather than edges)

• In a dag query, the number of paths may be exponential– Yet, verifying is in polynomial time

• In a cyclic query, the number of paths may be infinite– Yet, verifying is in exponential time

Page 69: 1 Oblivious Querying of Data with Irregular Structure.

69

Verifying that a Mapping is a Semiflexible Matching

Query / Database

Path Query

Tree Query

DAG Query

Cyclic Query

Path DatabasePTIMEPTIMEPTIMENo

matchings

Tree DatabasePTIMEPTIMEPTIMENo

matchings

DAG DatabasePTIMEPTIMEPTIMENo

matchings

Cyclic DatabasePTIMEPTIMEcoNPcoNP

Page 70: 1 Oblivious Querying of Data with Irregular Structure.

70

Input-Output Complexity of Query Evaluation for the Semiflexible Semantics

• Next slide summarizes results about the input-output complexity – Polynomial for a dag query and a tree database

(or simpler cases)• Rather difficult to prove, even when the query is a

tree, since there is no local test for verifying that mappings are semiflexible matchings

– Exponential lower bounds for other cases

Page 71: 1 Oblivious Querying of Data with Irregular Structure.

71

I/O Complexity for SF Semantics (lower bounds are for non-emptiness)

Query / Database

Path Query

Tree Query

DAG Query

Cyclic Query

Path DatabasePTIMEPTIMEPTIME

Result is empty

Tree DatabasePTIMEPTIMEPTIME

Result is empty

DAG Database

NP-Complete

NP-Complete

NP-Complete

Result is empty

Cyclic Database

NP-Complete

NP-Complete

NP-Hard

(in P2)

NP-Hard

(in P2)

Data Complexity is Polynomial in all Cases

Page 72: 1 Oblivious Querying of Data with Irregular Structure.

72

Query Evaluation for the Flexible Semantics

• The database is replaced with a relationship graph which is a graph, such that– The nodes are the nodes of the database– Two nodes are connected by an edge if there is

a path between them in the database (the direction of the path is unimportant)

• The query is evaluated under rigid semantics w.r.t. the relationship graph

Page 73: 1 Oblivious Querying of Data with Irregular Structure.

73

I/O Complexity of Query Evaluationfor the Flexible Semantics

• Results follow from a reduction to query evaluation under the rigid semantics

• Tree query– Input-Output complexity is polynomial

• DAG query– Testing for non-emptiness is NP-Complete

Page 74: 1 Oblivious Querying of Data with Irregular Structure.

74

Query Containment

• Q1 Q2 if for all database D,

the set of matchings of Q1 w.r.t. to D

is contained in

the set of matchings of Q2 w.r.t. to D

• We assume that– Both queries have the same set of variables

Page 75: 1 Oblivious Querying of Data with Irregular Structure.

75

Complexity of Query Containment

• Under the semiflexible semantics, Q1 Q2 iff the identity mapping from the variables of Q2 to the variables of Q1 is a semiflexible matching of Q2 w.r.t. Q1

• Thus, containment is – in coNP when Q1 is a cyclic graph and Q2 is

either a dag or a cyclic graph– in polynomial time in all other cases

• Under the flexible semantics, query containment is always in polynomial time

Page 76: 1 Oblivious Querying of Data with Irregular Structure.

76

Database Equivalence

• D1 and D2 are equivalent if for all queries Q,

the set of matchings of Q w.r.t. to D1

is equal to

the set of matchings of Q w.r.t. to D2

• Both databases must have the same set of objects and the same root

Page 77: 1 Oblivious Querying of Data with Irregular Structure.

77

Complexity of Database Equivalence

• For the semiflexible semantics, deciding equivalence of databases is– in polynomial time if both databases are dags– in coNP if one of the databases has cycles

• For the flexible semantics, deciding equivalence of databases is polynomial in all cases

Page 78: 1 Oblivious Querying of Data with Irregular Structure.

78

Database Transformation1

2 3 4

MDB

ActorActor

Movie

6 8

Actor

Movie Movie

The databases are equivalent under boththe flexible and semiflexible semantics

Hook Star Wars

DustinHoffman

HarrisonFord

MarkHamill

A DAG has become a TREE!

1

2 3 4

MDB

Actor Actor

Movie

6 8

Actor

Movie

DustinHoffman

Hook

HarrisonFord

Star Wars

MarkHamill

Page 79: 1 Oblivious Querying of Data with Irregular Structure.

79

Transforming a Database into a Tree

• Reasons for transforming a database into an equivalent tree database:– Evaluation of queries over a tree database is

more efficient– In a graphical user interface, it is easier to

represent trees than DAGs or cyclic graphs– Storing the data in a serial form (e.g., XML)

requires no references

Page 80: 1 Oblivious Querying of Data with Irregular Structure.

80

Transformation into a Tree

• There are algorithms for– Testing if a database can be transformed into an

equivalent tree database, and– Performing the transformation

• For the semiflexible semantics– The algorithms are polynomial

• For the flexible semantics– The algorithms are exponential

Page 81: 1 Oblivious Querying of Data with Irregular Structure.

81

Implementing Flexible Queries

• Flexible queries were implemented in SQL4X

• In an SQL4X query, relations and XML documents are queried simultaneously

• A query result can be either a relation or an XML document

Page 82: 1 Oblivious Querying of Data with Irregular Structure.

82

QUERY AS RELATION

SELECT text(y) as director, text(v) as title

FROM x Director of ‘MDB.xml’, y Name of x,

z Movie of x, v Title of z

An SQL4X Query

r

y

x

z

MovieName

Director

vTitle

A query under theFlexible Semantics

Page 83: 1 Oblivious Querying of Data with Irregular Structure.

83

QUERY AS RELATION

SELECT text(y) as director, text(v) as title

FROM x Director of ‘MDB.xml’, y Name of x,

z Movie of x, v Title of x

WHERE text(v) = ‘Star Wars’

An SQL4X Query

r

y

x

z

MovieName

Director

vTitle

A query under theFlexible Semantics

Constraints can be added

Page 84: 1 Oblivious Querying of Data with Irregular Structure.

84

QUERY AS RELATION

SELECT text(x) as director, text(v) as title, Budget

FROM x Director of ‘MDB.xml’, y Name of x,

z Movie of x, v Title of x, FilmBudgets

WHERE text(v) = FilmBudgets.Title

An SQL4X Query

r

y

x

z

MovieName

Director

vTitle

A query under theFlexible Semantics

Relations and XML Documentscan be queried simultaneously

TitleBudget

……

……

A relation with dataabout film budgets

FilmBudgets

Page 85: 1 Oblivious Querying of Data with Irregular Structure.

85

AgendaAgenda

Why is is difficult to query semistructured data?Why is is difficult to query semistructured data?

Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)

Flexible queries (FQ)Flexible queries (FQ)

Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ

Using QwIA and FQ for information integration Using QwIA and FQ for information integration

Page 86: 1 Oblivious Querying of Data with Irregular Structure.

86

Combining the Paradigms

• In oblivious querying: – The user does not have to know where data is

incomplete– The user does not have to know the exact

structure of the data

• The paradigm of flexible queries and the paradigm of queries with incomplete answers should be combined

Page 87: 1 Oblivious Querying of Data with Irregular Structure.

87

Flexible Queries with Incomplete Answers

• A flexible query w.r.t. a database is actually a rigid query w.r.t. the relationship graph

• Evaluating a query in AND-semantics (weak semantics, OR-Semantics) w.r.t. the relationship graph produces a flexible query that returns maximal answers rather than complete answers

Page 88: 1 Oblivious Querying of Data with Irregular Structure.

88

1

11

Movie Database

Movie

22 23 25 26

ActorActor

NameName

Title

31 33DustinHoffman

Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Movie

Director

StevenSpielberg

Director

12

r

y

x

z

u

Name

32

Name

34

2927

MovieName

Director

14 May 1944

Date of birth

35

vTitle

NameGeorgeLucas

Producer

Actor

w

Consider the case where Node 25

and Node 33 are removed

Consider the case where Node 25

and Node 33 are removed

25

Actor

Name

33DustinHoffman

Title

Hook

Page 89: 1 Oblivious Querying of Data with Irregular Structure.

89

1

11

Movie Database

Movie

22 23 26

Actor

Name

TitleTitle

31Harrison Ford

1977StarWars

24Year

21

Actor

Name

30

Mark Hamill

Hook

Movie

Director

StevenSpielberg

Director

12

r

y

x

z

u

Name

32

Name

34

2927

MovieName

Director

14 May 1944

Date of birth

35

vTitle

NameGeorgeLucas

Producer

Actor

w

1

29

12

34

26

A Flexible matching which is also an

incomplete (maximal) matching

A Flexible matching which is also an

incomplete (maximal) matching

u

NULL w

NULL

Page 90: 1 Oblivious Querying of Data with Irregular Structure.

90

AgendaAgenda

Why is is difficult to query semistructured data?Why is is difficult to query semistructured data?

Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)

Flexible queries (FQ)Flexible queries (FQ)

Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ

Using QwIA and FQ for information integration Using QwIA and FQ for information integration

Page 91: 1 Oblivious Querying of Data with Irregular Structure.

91

Full Disjunction

• Intuitively, the full disjunction of a given set of relations is the join of these relations that does not discard dangling tuples

• Dangling tuples are padded with nulls

• Only maximal tuples are retained in the full disjunction (as in the case of QwIA)

Page 92: 1 Oblivious Querying of Data with Irregular Structure.

92

m-idtitleyearlanguage

1Zelig1983English

2Antz1998English

3Armageddon1998English

4Fantasia1940English

Movies

a-idnamedate-of-birth

1Woody Allen1/12/1935

2Bruce Willis19/3/1955

3Julia Roberts28/10/1967

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-ina-idm-id

11Actors-that-Directed

m-idtitleyearlanguagea-idnameDate-of-birthrole

1Zelig1983English1Woody Allen1/12/1935Zelig

2Antz1998English1Woody Allen1/12/1935Z

3Armageddon1998English2Bruce Willis19/3/1955Harry

4Fantasia1940English

3Julia Roberts28/10/1967

The Full Disjunction of the Given Relations

Page 93: 1 Oblivious Querying of Data with Irregular Structure.

93

The Full Disjunction of the Given Relations

m-idtitleyearlanguagea-idnameDate-of-birthrole

1Zelig1983English1Woody Allen1/12/1935Zelig

2Antz1998English1Woody Allen1/12/1935Z

3Armageddon1998English2Bruce Willis19/3/1955Harry

4Fantasia1940English

3Julia Roberts28/10/1967

m-idtitleyearlanguagea-idnameDate-of-birthrole

1Zelig1983English

The full disjunction does not include subsumed tuples

m-idtitleyearlanguage

1Zelig1983English

2Antz1998English

3Armageddon1998English

4Fantasia1940English

Movies

This tuple will notbe in the full disjunction

Page 94: 1 Oblivious Querying of Data with Irregular Structure.

94

m-idtitleyearlanguage

1Zelig1983English

2Antz1998English

3Armageddon1998English

4Fantasia1940English

Movies

a-idnamedate-of-birth

1Woody Allen1/12/1935

2Bruce Willis19/3/1955

3Julia Roberts28/10/1967

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-ina-idm-id

11Actors-that-Directed

m-idtitleyearlanguagea-idnameDate-of-birthrole

1Zelig1983English1Woody Allen1/12/1935Zelig

2Antz1998English1Woody Allen1/12/1935Z

3Armageddon1998English2Bruce Willis19/3/1955Harry

4Fantasia1940English

3Julia Roberts28/10/1967

The Full Disjunction of the Given Relations

m-idtitleyearlanguagea-idnameDate-of-birthrole

4Fantasia1940English3Julia Roberts28/10/1967

The full disjunction does not include tuples that are based on Cartesian Product rather than join

Page 95: 1 Oblivious Querying of Data with Irregular Structure.

95

In the Full Disjunctionof a Given Set of Relations:

Every tuple of the input is a partof at least one tuple of the output

Tuples are joined as in a naturaljoin, padded with null values

The result includes only“maximal connected portions”

Page 96: 1 Oblivious Querying of Data with Irregular Structure.

96

Motivation for Full Disjunctions

• Full disjunctions have been proposed by Galiando-Legaria as an alternative for outerjoins [SIGMOD’94]

• Rajaraman and Ullman suggested to use full disjunctions for information integration [PODS’96]

Page 97: 1 Oblivious Querying of Data with Irregular Structure.

97

Computing Full Disjunctionsfor γ-acyclic Relation Schemas

• Rajaraman and Ullman have shown how to evaluate the full disjunction by a sequence of natural outerjoins when the relation schemas are γ-acyclic

• Hence, the full disjunction can be computed in polynomial time, under input-output complexity, when the relation schemas are γ-acyclic

Page 98: 1 Oblivious Querying of Data with Irregular Structure.

98

Weak Semantics GeneralizesFull Disjunctions

• Relations can be converted into a semistructured database

• The full disjunction can be expressed as the union of several queries that are evaluated under weak semantics

We have developed an algorithm that uses thisgeneralization to compute full disjunctions in

polynomial time under I/O complexity, even when the relation schemas are cyclic

We have developed an algorithm that uses thisgeneralization to compute full disjunctions in

polynomial time under I/O complexity, even when the relation schemas are cyclic

Page 99: 1 Oblivious Querying of Data with Irregular Structure.

99

Generalizing Full Disjunctions

• In a full disjunction, tuples are joined according to equality constraints as in a natural join (or equi-join)

• We can generalize full disjunctions to support constraints that are not merely equality among attributes

Page 100: 1 Oblivious Querying of Data with Irregular Structure.

100

Example

Movies (m-id, title, year, language, location)

Actors (a-id, name, date-of-birth)

Acted-in (a-id, m-id, role)

Actors-that-Directed (a-id, m-id)

Movies (m-id, title, year, language, location)

Actors (a-id, name, date-of-birth)

Acted-in (a-id, m-id, role)

Actors-that-Directed (a-id, m-id)

Historical-Events (name, date, description)

Historical-Sites (Country, State, City, Site)

Historical-Events (name, date, description)

Historical-Sites (Country, State, City, Site)

The date of the historical event is a date in the year whenthe movie was released

The filming location is near the historical site

Page 101: 1 Oblivious Querying of Data with Irregular Structure.

101

Another Way of Generalizing Full

Disjunctions: Use OR-Semantics

• OR-semantics is used rather than weak semantics when tuples are joined

• This relaxes the requirement that every pair of tuples should be join consistent

• Instead, a tuple of the full disjunction is only required to be generated by database tuples that form a connected subgraph, but need not be pairwise join consistent

Page 102: 1 Oblivious Querying of Data with Irregular Structure.

102

Employees (e-id, ename, city, dept-no)

Departments (dept-no, dname, building)

Located-in (building, city, street)

Employees (e-id, ename, city, dept-no)

Departments (dept-no, dname, building)

Located-in (building, city, street)

Employee: (007, James Bond, London, 6)

Department: (6, MI-6, 10)

Located-in: (10, Liverpool, King)

e-idenamecitydept

-no

dept

-no

dnamebuildingbuildingcitystreet

007James BondLondon66MI-610

6MI-61010LiverpoolKing

Example

The Full Disjunction

Page 103: 1 Oblivious Querying of Data with Irregular Structure.

103

Employees (e-id, ename, city, dept-no)

Departments (dept-no, dname, building)

Located-in (building, city, street)

Employees (e-id, ename, city, dept-no)

Departments (dept-no, dname, building)

Located-in (building, city, street)

Employee: (007, James Bond, London, 6)

Department: (6, MI-6, 10)

Located-in: (10, Liverpool, King)

e-idenamecitydept

-no

dept

-no

dnamebuildingbuildingcitystreet

007James BondLondon66MI-61010LiverpoolKing

Example

The Full Disjunction under OR-Semantics

Page 104: 1 Oblivious Querying of Data with Irregular Structure.

104

Integrated Relation

Data Source Data Source Data Source

Information Integration from Heterogeneous Sources

Query

Relation

Query

Relation

Query

Relation

Page 105: 1 Oblivious Querying of Data with Irregular Structure.

105

Integrated Relation

Data Source Data Source Data Source

Query

Relation

Query

Relation

Query

Relation

We use queries that combine flexible semanticsand weak semantics:

-The queries are insensitive to changes in the data- Easy to formulate the query

Page 106: 1 Oblivious Querying of Data with Irregular Structure.

106

Integrated Relation

Data Source Data Source Data Source

Query

Relation

Query

Relation

Query

Relation

The integration of the relations is done witha full disjunction of the computed relations

Page 107: 1 Oblivious Querying of Data with Irregular Structure.

107

Conclusion

• Flexible and semiflexible queries facilitate easy and intuitive querying of semistructured databases– Querying the database even when the user is

oblivious to the structure of the database– Queries are insensitive to variations in the

structure of the database

Page 108: 1 Oblivious Querying of Data with Irregular Structure.

108

Conclusion (continued)

• Queries in AND semantics, OR semantics or weak semantics facilitate easy and intuitive querying of incomplete databases– Querying the database even when the user is

oblivious to missing data– Queries return maximal answers rather than

complete answers

Page 109: 1 Oblivious Querying of Data with Irregular Structure.

109

Conclusion (continued)

• The two paradigms of flexible queries and queries with maximal answers can be combined

• The combination of the paradigms can facilitate integration of information from heterogeneous sources

Page 110: 1 Oblivious Querying of Data with Irregular Structure.

110

Conclusion (continued)

• Full disjunctions can be computed using queries in weak semantics

• Full disjunctions can be generalized so that relations are joined using constraints that are not merely equality constraints

Page 111: 1 Oblivious Querying of Data with Irregular Structure.

111

Thank YouThank You

Questions?Questions?