Download - Semantic Query Optimization

Transcript
Page 1: Semantic Query Optimization

Semantic Query Optimization

Page 2: Semantic Query Optimization

2

Outline

• Semantic Query Optimization• Soft Constraints • Query Optimization via Soft Constraints• Selectivity Estimation via Soft Constraints

Page 3: Semantic Query Optimization

3

Semantic Query Optimization

Use integrity constraints associated with a database to rewrite a query into a form that may be evaluated more efficiently

Some Techniques:

• Join Elimination• Predicate Elimination• Join Introduction• Predicate Introduction• Detecting an Empty Answer Set

Page 4: Semantic Query Optimization

4

Commercial implementations of SQO

Early Experiences:

• Could not spend too much time on optimization• Few integrity constraints are ever defined• Association with deductive databases

Few (if any!)

Page 5: Semantic Query Optimization

5

Join elimination: exampleselect p_name, p_retailprice, s_name, s_address from tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplierwhere p_partkey = ps_partkey and s_suppkey = ps_suppkey and

ps_partkey = l_partkey and ps_suppkey = l_suppkey;

RI constraints: part-partsupp (on partkey) supplier-partsupp (on partkey)partsupp-lineitem (on partkey and suppkey)

select p_name, p_retailprice, s_name, s_address from tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplierwhere p_partkey = l_partkey and s_suppkey = l_suppkey;

Page 6: Semantic Query Optimization

6

Algorithm for join elimination1. Derive column transitivity classes from the

join predicates in the query2. Divide the relations in the query that are

related through RI constraints into removable and non-removable

3. Eliminate all removable relations from the query

4. Add is not null predicate to foreign key columns of all tables whose RI parents were removed

Page 7: Semantic Query Optimization

7

Algorithm for join elimination: example

C.C

PS.S

O.C

S.S PS.S

O.CC.C

S.S PS.S

O.C

Page 8: Semantic Query Optimization

9

Performance results for join elimination

0

100

200

300

400

500

600

J1 J2 J3 J4 J5 J6 J7 J8 J9 J10

OriginalOptimized

Page 9: Semantic Query Optimization

10

Predicate Introduction: Exampleselect sum(l_extendedprice * l_discount) as revenuefrom tpcd.lineitemwhere shipdate >date('1994-01-01');

select sum(l_extendedprice * l_discount) as revenuefrom tpcd.lineitemwhere shipdate >date('1994-01-01') and receiptdate >= date('1994-01-01');

Check constraint: receiptdate >= shipdateClustered Index on receiptdate

Page 10: Semantic Query Optimization

11

Algorithm for Predicate IntroductionN - set of predicates derivable from the query and check constraints

• If N is inconsistent, stop.• Else, for each predicate A op B in N, add it to the

query if:• A or B is a join column• B is a major column of an index• no other index on B’s table can be used in the plan

for the original query

Page 11: Semantic Query Optimization

13

Queriesselect 100.00 * sum

(casewhen p_type like 'PROMO%'then l_extendedprice * (1 - l_discount)else 0

end)/ sum(l_extendedprice * (1 - l_discount)) as promo_revenue

from tpcd.lineitem, tpcd.partwhere l_partkey = p_partkey and

l_shipdate >= date('1998-09-01') andl_shipdate < date('1998-09-01') + 1 month;

Given the check constraint l_receiptdate >= l_shipdate we may add a new predicate to the query:

l_receiptdate >= date(‘1998-09-01’)

Page 12: Semantic Query Optimization

14

Performance Results for Index Introduction

0102030405060708090

100

P1 P2 P3 P4 P5

OriginalOptimized

Page 13: Semantic Query Optimization

15

The Culprit

DATA READS INDEX READS

Physical Logical Physical Logical

CPU COST (S) ESTIMATED # OFQUALIFYING TUPLES

Original Query 21607 22439 12 26 21.9 20839“Optimized Query” 10680 286516 2687 288326 55.9 12618

New query plan uses an index, but the original table scan is still better!

Why did this happen:• incorrect estimate of the filter factor• underestimation of the CPU cost of locking index pages

Page 14: Semantic Query Optimization

16

Soft Constraints

Page 15: Semantic Query Optimization

17

Soft Constraints

Traditional (“hard”) integrity constraints are defined to prevent incorrect updates. A soft constraint is a statement that is true about the current state of the database, but does not verify updates. In fact, a soft constraint can be invalidated by an update.

Page 16: Semantic Query Optimization

18

Soft Constraints (cont.)• Absolute soft constraints – no violation in

the current state of the databaseAbsolute soft constraints can be used for optimization in exactly

the same way traditional constraints are.

• Statistical soft constraints – can have some (small) degree of violation

Statistical soft constraints can be used for improved selectivity estimation

Page 17: Semantic Query Optimization

19

Implementation of Soft Constraints

In Oracle the standard integrity constraints are marked with a rely option, so that they are not verified on updates.

In DB2 soft constraints are called informational constraints.

Page 18: Semantic Query Optimization

20

Informational Check ConstraintExample 1: Create an employee table where a minimum

salary of $25,000 is guaranteed by the application

CREATE TABLE emp(empno INTEGER NOT NULL PRIMARY KEY,

name VARCHAR(20), firstname VARCHAR(20), salary INTEGER CONSTRAINT minsalary CHECK (salary >= 25000) NOT ENFORCED ENABLE QUERY OPTIMIZATION);

Page 19: Semantic Query Optimization

21

Enforcing Validation

Example 2: Alter the employee table to start enforcing the minimum wage of $25,000 using DB2. DB2 will also verify existing data right away.

ALTER TABLE emp ALTER CONSTRAINT minsalary ENFORCED

Page 20: Semantic Query Optimization

22

Informational RI ConstraintExample 3: Create a department table where the

application ensures the existence of departments to which the employees belong.

CREATE TABLE dept(deptno INTEGER NOT NULL PRIMARY KEY,

deptName VARCHAR(20), budget INTEGER);

ALTER TABLE emp ADD COLUMN dept INTEGER NOT NULL CONSTRAINT dept_exist REFERENCES dept NOT ENFORCED ENABLE QUERY OPTIMIZATION);

Page 21: Semantic Query Optimization

23

Query Optimization via Empty Joins

Page 22: Semantic Query Optimization

24

Example

select Modelfrom Tickets T, Registration Rwhere T.RegNum = R.RegNum and T.date > “1990-01-01”

and R.Model LIKE “BMW Z3%”

select Modelfrom Tickets T, Registration Rwhere T.RegNum = R.RegNum and T.date > “1997-01-01”

and R.Model LIKE “BMW Z3%”

First BMW Z3 series cars were made in 1997.

Page 23: Semantic Query Optimization

25

Matrix representation of empty joins

A B 3 1 3

6 7 8

0 0 1 1 0 0 0 0 1

1 2 3 6 7 8

A,B(R S)

Page 24: Semantic Query Optimization

26

Staircase data structure

1

1

1

1

1

1

1

X

Y

0

1 0 0 0 0

10

00

00

( x ,y )r r

( x ,y )1 1

( x , y )

0

Page 25: Semantic Query Optimization

27

Properties of the algorithm

• Time Complexity O(nm) requires a single scan of the sorted data

• Space Complexity O(min(n,m)) only two rows of the matrix need be kept in memory

• Scalable with respect to:• number of tuples in the join result• number of discovered empty rectangles• size of the domain of one of the attributes

Page 26: Semantic Query Optimization

28

How many empty rectangles are there?

05000

1000015000200002500030000350004000045000

Test 1 Test 2 Test 3 Test 4

Number ofdiscovered emptyrectanglesNumber of tuplesin the join

Tests done on 4 pairs of attributes with numerical domain present in typical joins in a real-world workload of a health insurance company.

Page 27: Semantic Query Optimization

29

How big are the rectangles?

0

50

100

Test 1 Test 2 Test 3 Test 4

The sizes of the 5 largest rectangles as % of the size of the matrix

5th 4th 3rd 2nd 1st largest

Page 28: Semantic Query Optimization

30

Query rewrite: simple caseselect …from R, S,...where R.C=S.C and 60<R.A<80 and 20<S.B<80 and...

select …from R, S,...where R.C=S.C and 60<R.A<80 and 20<S.B<60 and...

Page 29: Semantic Query Optimization

31

Query rewrite: complex case

select …from R, S,...where R.C=S.C and 60<R.A<80 and 20<S.B<80 and...

select …from R, S,...where R.C=S.C and

(… and …) or(… and …) or(… and …) or...

Page 30: Semantic Query Optimization

32

Experiment I: Size of the Overlap

0102030405060708090

100

Reduction of theSize of the Table(%)Reduction ofExecution Time (%)

Page 31: Semantic Query Optimization

33

Experiment 2: Type of Overlap

-20

0

20

40

60

80

100

Reduction ofExecutionTime (%)

Page 32: Semantic Query Optimization

34

Experiment 3: Number of Empty Joins Used in Rewrite

0

20

40

60

80

100

Reduction ofExecutionTime (%)

Page 33: Semantic Query Optimization

35

How much do the rectangles overlap with queries?

0

5

10

15

20

25

Q 1 Q 2 Q 3 Q 4 Q 5

% of rectanglesoverlapping withqueries

Page 34: Semantic Query Optimization

36

Query optimization experiments

0

10

20

30

40

50

60

70

80

Q 1 Q 2 Q 3 Q 4 Q 5

% improvementin execution time

• real-world workload of 26 queries• 5 of the queries “qualified” for the rewrite • only simple rewrites were considered• all rewrites led to improved performance

Page 35: Semantic Query Optimization

37

Query Cardinality Estimate via Empty Joins

Page 36: Semantic Query Optimization

38

Query Cardinality Estimate via Empty Joins (SIEQE)

• Cardinality estimates crucial for designing good query evaluation plans

• Uniform data distribution (UDA): standard assumption in database systems

• Histograms effective in single dimensions: too expensive to build and maintain otherwise

Page 37: Semantic Query Optimization

39

The Strategy

Q1

Q2

• With UDA, the “density”: 1 tuple/sq unit• Empty joins cover 20% of the area• Adjusted density: 1.25 tuples/sq unit

Cardinality UDA SIEQE

Q1 100 62

Q2 100 125

Page 38: Semantic Query Optimization

40

Experiments

Number of queries for which the error is less than a given limit

Page 39: Semantic Query Optimization

42

Discovery of Check Constraints and Their Application in DB2

We discover two types of (rules) check constraints:• correlations between attributes over ordered domains• partitioning of attributes

Page 40: Semantic Query Optimization

43

Correlations between attributes over ordered domains

Rules have the form: Y = bX + a + [emin, emax]

Algorithm

for all tables in the databasefor all comparable variable pairs (X and Y) in the

table apply OLS estimation to get the function of

the form: Y = a + bX

calculate the max and min error (or residual) emax and emin

endforendfor

Page 41: Semantic Query Optimization

44

Partitioning

Rules have the form: If X = a, then Y [emin, emax]

Algorithm

for all tables in the databasefor any qualifying variable pair (X and Y) in the table

calculate partitions by using GROUP BY X statementsfind the max and min value of Y for each partition

endforendfor

Page 42: Semantic Query Optimization

45

Experiments in TPC-H

Rules discovered through partitioning:

If L_LINESTATUS=F, then L_SHIPDATE=(01/04/1992, 06/17/1995), m = 0.50If L_LINESTATUS=O, then L_SHIPDATE=(06/19/1995, 12/25/1998), m = 0.50

TPC-H contains the following check constraint:

L_RECEIPTDATE > L_SHIPDATE

Our algorithm discovered the following rule:

L_RECEIPTDATE = L_SHIPDATE + (1, 30), m = 0.0114.

Page 43: Semantic Query Optimization

46

Applications

• DBA Wizard• Semantic Query Optimization• Improved Filter Factor Estimates

Page 44: Semantic Query Optimization

47

Example

ARRIVAL DATE <= ‘1999-06-15’ AND DEPARTURE_DATE >= ‘1999-06-15’

The filter factor estimate for the query would be:

ff = ff1 * ff2

Consider a query issued against a hotel database, that requests the number of guests staying in the hotel on a given date.

If ‘1999-06-15’ was approximately midway in the date ranges, we would estimate a quarter of all the guests that came in over the number of years would be in the answer of the query!

Page 45: Semantic Query Optimization

48

Example (cont.)Assume that the following check constraint was discovered:

DEPARTURE_DATE >= ARRIVAL_DATE + (1 DAY, 5 DAYS)

The original condition in the query predicate can then be changed to:

ARRIVAL_DATE <= ‘1999-06-15’ AND ARRIVAL_DATE >= ‘1999-06-18’or

ARRIVAL_DATE BETWEEN ‘1999-06-15’ AND ‘1999-06-18’

The filter factor is now estimated to:

ff = (ff1 + ff2 –1)

Page 46: Semantic Query Optimization

49

Other Research on the Use of Soft Constraints in Query Optimization

Page 47: Semantic Query Optimization

50

Query-driven Approach

• Built multidimensional histograms based on query results (Microsoft)

• Improve cardinality estimates by looking at the intermediate query results (IBM)

Both techniques generate statistical soft constraints

Page 48: Semantic Query Optimization

51

Data-driven Approach

• Lots of methods using Bayesian networks to infer statistical soft constraint

• Lots of methods to discover functional dependencies in data (absolute soft constraints)

• Most recently, BHUNT and CORDS use sampling to discover soft constraints (IBM)

Page 49: Semantic Query Optimization

52

References• Q. Cheng, J. Gryz, F. Koo, T. Y. Cliff Leung, L. Liu, X. Qian, B.

Schiefer: Implementation of Two Semantic Query Optimization Techniques in DB2 Universal Database. VLDB 1999.

• J. Edmonds, J. Gryz, D. Liang, R. Miller: Mining for Empty Rectangles in Large Data Sets. ICDT 2001.

• J. Gryz, B. Schiefer, J. Zheng, C. Zuzarte: Discovery and Application of Check Constraints in DB2. ICDE 2001.

• P. Godfrey, J. Gryz, C. Zuzarte: Exploiting Constraint-Like Data Characterizations in Query Optimization. SIGMOD 2001.

• J. Gryz, D. Liang: Query Optimization via Empty Joins. DEXA 2002.• J. Gryz, D. Liang: Query Cardinality Estimation via Data Mining. IIS

2004.