Efficient Detection of Empty Result Queries

32nd International Conference on Very Large Data BasesSeptember 12 - 15, 2006 Seoul, Korea

Efficient Detection of Empty Result Queries

Gang Luo IBM T.J. Watson Research Center

[email protected]

2

Empty Result Problem

• Query returns an empty result set• User gets lost about where to look at next• Frequently encountered in interactive

exploration of massive data sets• Our contribution: method for quickly

detecting empty result sets

3

Example Percentages of Empty Result Queries

• In a Customer Relationship Management (CRM) application developed by IBM– 18.07% (3,396 empty result queries in 18,793

queries)

• In a real estate application developed by IBM – 5.75%

• In a digital library application [JCM+00] – 10.53%

• In a bioinformatics application [RCP+98]– 38%

4

Empty Result Queries May Not Finish Execution Quickly

• Consider a query joining two relations– Query execution time is longer than join time, no

matter whether or not query result set is empty

• Even if a query finishes in a few seconds in a lightly loaded RDBMS, it may last longer than one minute in a heavily loaded RDBMS

5

Outline

• Limitations of previous approaches

• Fast detection method for empty result queries

• Some experiments

6

Existing Solutions to the Empty Result Problem

• Explain what leads to the empty result set• Automatically generalize the query so that the

generalized query will return some answers

7

Limitations of Existing Solutions

• Require domain specific knowledge• Only apply to a restricted form of queries• Require an excessive amount of time• Give too many reasons why the result set is

empty• Users cannot reuse each other’s query results

8

Outline




9

Our Solution

• Only consider read-only environment• From previous queries’ execution, remember

the query parts that lead to empty result sets• When a new query Q comes, match it with the

remembered query parts. If such a match exists, report that Q will return an empty result set without executing Q

• Utilize special properties of empty result sets and thus often more powerful than traditional materialized view method

10

Definitions

• Empty result propagating operator: An operator whose output is empty if any input is empty

• Empty result propagating query: A query whose query plan only contains empty result propagating operators (our focus)

• Query part: A sub-tree of a query plan• Atomic query part: An ordered pair (relation

names RN, selection condition SC)– Corresponds to a relational algebra formula: first

product join all relations in RN, then apply SC

– SC is a conjunction of primitive terms, where each primitive term is a comparison

11

Definitions – Cont.

• Cover: Atomic query part P1=(RN1, SC1) covers atomic query part P2=(RN2, SC2) if

– RN1RN2

– Whenever SC2 is true, SC1 is true

• Property: Suppose atomic query part P1 covers atomic query part P2. For a given database, if the output of P1 is empty, the output of P2 is also empty.

12

Given an Empty Result Query• Find the lowest-level query part P whose

output is empty

B (index-scan) B.e<40 B.e=50 [5000]

C (table-scan) [20000]

sort-merge join B.g=C.h [0]

C.f<300 [1000]

[0]

sort [0] sort [1000]

[0]

A (table-scan) [40000]

50<A.a<100 A.b=200 [200] [5000]

hash join A.c=B.d [0]

hash [200] hash [5000]

13

Transforming P into a Simplified Query Part Ps

• Drop all operators (e.g., projection, hash, sort) that have no influence on the emptiness of the output

• Replace each physical join operator with a logical join operator

• Replace each index-scan operator with a table-scan operator followed by a selection operator, where the selection condition is the index-scan condition

14

Transforming P into a Simplified Query Part Ps – Cont.

• Corresponding relational algebra formula– (50<A.a<100 A.b=200 (A)) ⋈A.c=B.d (B.e<40 B.e=50 (B))

B (table-scan) A (table-scan)

50<A.a<100 A.b=200B.e<40 B.e=50

⋈A.c=B.d

15

Breaking Ps into Atomic Query Parts

• Get all selection conditions in the selection/join operators

• Rewrite the conjunction of these selection conditions into a disjunctive normal form (DNF) – Negations on numeric or string attributes are

removed using complementary operators– Interval-based comparison is treated as a single

primitive term • Generate a set of atomic query parts (RN, SC)

– RN: input relations of all table-scan operators in Ps – SC: a term in the DNF

16

Breaking Ps into Atomic Query Parts – Cont.

• Property: The following three assertions are equivalent to each other:– The output of the query part P is empty

– The output of the simplified query part Ps is empty

– The output of each generated atomic query part is empty

(50<A.a<100 (A)) ⋈A.c=B.d (B.e<40 (B))

(A.b=200 (A)) ⋈A.c=B.d (B.e<40 (B))

(50<A.a<100 (A)) ⋈A.c=B.d (B.e=50 (B))

(A.b=200 (A)) ⋈A.c=B.d (B.e=50 (B))

17

Storing the Generated Atomic Query Parts

• For each generated atomic query part Pa

– Insert Pa into a collection Caqp of atomic query parts

– Remove from Caqp all previously stored atomic query parts that are covered by Pa

• See paper for details of the coverage checking algorithm

18

When Getting a New Query Q

• Break Q into a set of atomic query parts

• For each such atomic query part Pa, check whether some atomic query part Ai in Caqp covers Pa

• If such an Ai exists for each Pa, report that Q will return an empty result set without executing Q

19

Outline




20

Setup

• Testing environment– PostgreSQL 7.3.4– Windows XP OS– Dell Inspiron 8500 PC with one 2.2GHz CPU,

512MB memory, one 40GB disk

• TPC-R benchmark• See paper for detection probability analysis

21

Overhead Experiment

• Query Q1: Find the information about certain parts that were sold on certain days

select * from orders o, lineitem lwhere o.orderkey=l.orderkey and

(o.orderdate=d1 or … or o.orderdate=de) and (l.partkey=p1 or … or l.partkey=pf);

22

Overhead Experiment – Cont.

• Query Q2: Find the information about certain parts that were sold to certain customers on certain days

select * from orders o, lineitem l, customer cwhere o.orderkey=l.orderkey and o.custkey=c.custkey and

(o.orderdate=d1 or … or o.orderdate=de) and (l.partkey=p1 or … or l.partkey=pf) and(c.nationkey=n1 or … or c.nationkey=ng);

23

Overhead Experiment – Cont.• The overhead of our method increases with both query

complexity and the number of atomic query parts stored in Caqp • When check fails, the overhead of our method is higher than that

when check succeeds

0

0.002

0.004

0.006

0.008

1000 2000 3000

number of atomic query parts in Caqp

over

head

(sec

ond)

Q1, check succeedsQ1, check failsQ2, check succeedsQ2, check fails

24

Overhead Experiment – Cont.• The overhead of our method is trivial compared to

query execution overhead

0.001

0.01

0.1

1

10

100

1000

1 2 3database size (GB)

exec

utio

n tim

e or

ove

rhea

d (s

econ

d)

execute Q1check Q1execute Q2check Q2

25

Summary

• Provide a fast detection method for empty result queries– Low overhead– High detection probability once enough information

has been accumulated

26

Open Issues

• In the presence of update, correctly preserve as much stored information as possible

• A hybrid method that can combine the advantages of both our method and the existing solutions

• More aggressive storage saving technique

Efficient Detection of Empty Result Queries

Documents

Transcript of Efficient Detection of Empty Result Queries