Download - Semantic Query Optimization Techniques November 16, 2005 By : Mladen Kovacevic.

Semantic Query Optimization Techniques

November 16, 2005

By : Mladen Kovacevic

Background• 1980's, semantic information stored in

dbs as integrity constraints could be used for query optimization

• semantic: “of or relating to meaning or the study of meaning”(http://wordnet.princeton.edu)

• integrity : preserve data consistency when changes made in db.

• no extensive implementation existing today (1999)

http://wordnet.princeton.edu/

Introduction• Key factor in relational database

system’s improvement in query execution time, is query optimization.

• Query execution can be improved by:

– Analyzing integrity information, and rewriting queries exploiting this information (JE & PI)

– Avoid expensive sorting costs (Order Optimization)

– Exploiting uniqueness by knowing rows will be unique, thus, avoiding extra sorts. (EU)

Presentation Overview• Semantic Query Optimization techniques

– Join Elimination (JE)

– Predicate Introduction (PI)

– Order Optimization (OO)

– Exploiting Uniqueness (EU)

Some Motivation• Describing two techniques in SQO,

demonstrated in DB2 UDB.– Predicate Introduction– Join Elimination

• Reasons: – rewriting queries by hand showed that these

two provided consistent optimization.– practical to implement– extendible to other DBMS’s.

• Data sets used : TPC-D and APB-1 OLAP benchmarks

only REFERENTIAL INTEGRITY constraints and CHECK CONSTRAINTS used!

Semantic Query Optimization (SQO) Techniques

• Join Elimination: Some joins need NOT be evaluated since the result may be known apriori (more on this later)

• Join Introduction: Adding a join can help if relation is small compared to original relations and highly selective.

• Predicate Elimination : If predicate known to be always true, can be eliminated from query (DISTINCT clause on Primary Key – Uniqueness exploitation!)

• Predicate Introduction: New predicates on indexed attributes can result in a much faster access plan.

• Detecting the Empty Answer Set : If query predicates inconsistent with integrity constraints, the query does not have answer.

Why SQO implementations not used?• Deductive Databases : Many cases SQO

techniques were designed for deductive databases, thus not appearing to be useful in relational database context.

• CPU & I/O Speeds similar : When being developed, CPU & I/O speeds were not as dramatically different– (savings in I/O not worth the CPU time

added)

• Lack of Integrity Constraints : Thought that many integrity constraints are needed for SQO to be useful

Two-stage Optimizer• Examples of SQO techniques always designed

for a two-stage optimizer

– Stage 1 : logically equivalent queries created (DB2’s query rewrite optimization)

– Stage 2 : generate plans of all these queries, choosing the one with lowest estimated cost. (DB2’s query plan optimization)

• Join order, join methods, join site in a distributed database, method for accessing input table, etc.

Join Elimination· Simple : Eliminate relation where join is over tables

related through referential integrity constraint, and primary key table referenced only in the join

VIEW DEFINITIONCREATE VIEW Supplier_Info (n, a, c) asSELECT s_name, s_address, n_nameFROM tpcd.supplier, tpcd.nationWHERE s_nationkey = n_nationkey

QUERYSELECT s_n, s_aFROM Supplier_Info

Join Elimination (con’t)· Query can be rewritten internally as:

SELECT s_n, s_aFROM tpcd.supplier

Why do such a simple rewrite?

• User may not have access to the supplier table, and/or may only know about the view.• Sometimes GUI managers create these “dumb” queries so need to optimize• Non-programmers write queries often, and may

not even think about this.

• Algorithm for generic redundant join removal provided in paper.

Example – Join EliminationSELECT p_name, p_retailprice, s_name, s_addressFROM tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplierWHERE p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey and l_shipdate between '1994-01-01' and '1996-06-30' and l_discount >= 0.1GROUP BY p_name, p_retailprice, s_name, s_addressORDER BY p_name, s_name

PARTPARTKEY

SUPPLIERSUPPKEY

PARTSUPPPARTKEY

SUPPKEY

LINEITEMPARTKEY

SUPPKEY

1 – many relationship

Example : Join Elimination• Any immediate improvements that can be seen

here?

p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey

P_PARTKEY PS_PARTKEY L_PARTKEY

S_SUPPKEY PS_SUPPKEY L_SUPPKEY

P_PARTKEY = PS_PARTKEY PS_PARTKEY = L_PARTKEY

S_SUPPKEY = PS_SUPPKEY PS_SUPPKEY = L_SUPPKEY

S_SUPPKEY = L_SUPPKEY

PS_PARTKEY = L_PARTKEY

Results• 100 MB db size• Execution Time : 58.5 sec -> 38.25 sec (35 % improvement) • I/O Cost: 4631 -> 1498 page reads (67 % improvement)

Join Elimination Optimizing Query 1 Pages Read

4631

1498

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Original Optimized

Pag

es

Join Elimination Optimizing Query 1 Execution Time

58.5

38.25

0

10

20

30

40

50

60

70

Original Optimized

Se

con

ds

Results – OLAP Environment• In OLAP (online analytical processing) servers,

using a star schema (one fact table, with several dimension tables) improvements ranged from 2% to 96 %.

– In these cases, much improvement came from CPU cost instead of I/O, because dimension tables were small enough to fit into memory...

Join Elimination Optimized Query in OLAP Environment

0

100

200

300

400

500

600

700

I1 I2 I3 I4 I5 I6 I7 I8 I9 I10Query Name

Exe

cutio

n T

ime

(se

con

ds)

Predicate Introduction• Techniques discussed :

– Index Introduction : add new predicate on attribute if index exists on that attribute.

– Assumption : index retrieval is better than table scan, is this always good?

– Scan Reduction : reduce number of tuples that qualify for a join.

– Problem : Not very common; unlikely that there will be any check constraints or predicates with inequalities about join columns

• Detecting empty query answer set (not shown as query execution time essentially 0)

Example - Predicate IntroductionSELECT sum(l_extendedprice * l_discount) as revenue

FROM tpcd.lineitem

WHERE l_shipdate >= date(‘1994-01-01’) and

l_shipdate < date(‘1994-01-01’)+ 1 year and

l_discount between .06 – 0.01 and

.06 + 0.01 and l_quantity < 24;

Check Constraint : l_shipdate <= l_receiptdateIndex : l_receiptdate

• Maintaining semantics, we can add :– l_receiptdate >= date(‘1994-01-01’)

Example - Predicate IntroductionSELECT sum(l_extendedprice * l_discount) as revenue

FROM tpcd.lineitem

WHERE l_shipdate >= date(‘1994-01-01’) and

l_shipdate < date(‘1994-01-01’)+ 1 year and

l_receiptdate >= date(‘1994-01-01’) and

l_discount between .06 – 0.01 and

.06 + 0.01 and l_quantity < 24;

Check Constraint : l_shipdate <= l_receiptdateIndex : l_receiptdate

• Maintaining semantics, we can add :– l_receiptdate >= date(‘1994-01-01’)

• Why would we want to do this? In order to have optimizer choose a plan using the index. Is this always good?

• NO! What if most of the rows in the table need to be returned? We should use a tablescan instead.

Predicate Introduction - Algorithm• Input : set of all check constraints defined for

a database and the set of all predicates in query

• Output: set of all non-redundant formulas derivable from the input set. This answer set can then be added to the query, but only a few are potentially useful.

• The goal in the paper was to choose additions that would guarantee improvement.

• Conditions in paper: Conservative approach of introducing predicates that will have the plan optimizer use an index. Insist on only one index available with the query predicate.

Predicate Introduction - Results

Predicate Introduction Estimated Costs

0

50000

100000

150000

200000

250000

P1 P2 P3 P4 P5

Query

Est

ima

ted

Co

st (

inte

rna

l un

its)

Original

Optimized

Predicate Introduction - Results

Predicate Introduction Execution Times

0

20

40

60

80

100

120

P1 P2 P3 P4 P5

Query

Exe

cutio

n T

ime

(se

con

ds)

Original

Optimized

Why?

Why Longer Execution for P3/P5?• P2 and P3 are the same except for the followingP2 :SELECT ...FROM ...WHERE l_shipdate >= date ('1998-09-01') and l_shipdate < date ('1998-09-01') + 1 month

P3 :SELECT ...FROM ...WHERE l_shipdate >= date ('1995-09-01') and l_shipdate < date ('1995-09-01') + 1 month

• Difference in table shows that P2 has 2 % of the tuples falling in the range while P3 has 48 % of the tuples fall in the category : BOTH plans will choose index scan! P3 is so large that tablescan is better in this case.

1. Cost model underestimates cost of locking/unlocking index pages

2. Estimated number of tuples goes down because of the reduction factor problem (multiply in the new predicate added)

Adjustments for Reduction Factor Problem• Add new predicate only when it contains a major

column of an index and a scan of that index is sufficient to answer the query (thus, no table scan necessary)

• Original Index : <receiptdate, discount, quantity, extendedprice>

• New Index : <receiptdate, discount, quantity, extendedprice, shipdate, partkey, suppkey, orderkey> Predicate Introduction Execution Times

0

10

20

30

40

50

60

P3 P5

Query

Exe

cutio

n T

ime

(se

con

ds)

Original

Optimized

Order Optimization Techniques• Access plan strategies exploit the physical

orderings provided either by indexes or sorting

• GOAL: optimize the sorting strategy

• Techniques– Pushing down sorts in joins– Minimizing the number of sorting columns– Detecting when sorting can be avoided because of

predicates, keys or indexes

• Order Optimization : detecting when indexes provide an interesting order, so that sorting can be either avoided, and used as sparingly as possible.

• Interesting Orders : when the side effect of a join produces rows in sorted order, which can be taken advantage of later (if another join needed, ORDER BY, GROUP BY, DISTINCT)

Fundamental Operators• Order optimization requires the following

operations

– Reduce Order– Test Order– Cover Order– Homogenize Order

Order Optimization Results

Exploiting Uniqueness• Checking to see if query contains unnecessary

DISTINCT clauses– How does this make improvements?

• Removing duplicates is performed by SORTING, a costly operation.

• Example is removing DISTINCT keyword from query if it is applied onto the primary key itself (since primary keys are, by definition, distinct)

How to exploit uniqueness?• Using knowledge about:

– Keys– Table Constraints– Query Predicates

• Cannot always be tested efficiently, so we look for a sufficient solution.

Summary• Important Outcome : experimental evidence

showing SQO can provide effective enhancement to the traditional query optimization.– Join Elimination : geared towards OLAP

environment (where very useful)– Independent on existence of complex

integrity constraint – semantic reasoning used about referential integrity constraints

– Easy to implement and execute

– Predicate Introduction : guaranteeing improvements more difficult, needing rather severe restrictions imposed (limits the applicability of this approach)

– Order Optimization : utilizing functional dependencies and table information, we use it in creating a “smart” access plan, avoiding or optimizing sort operations.

– Exploiting Uniqueness : uniqueness is powerful when it reduces the number of expensive sorts. Discovering true ways of exploiting this technique are quite tricky and specific.

References• Qi Cheng, Jarek Gryz, Fred Koo, et al: Implementation of

Two Semantic Query Optimization Techniques in DB2 Universal Database. Proceedings of the 25th VLDB Conference, Edinburg, Scotland,1999.

• David E. Simmen, Eugene J. Shekita, Timothy Malkemus: FundamentalTechniques for Order Optimization. SIGMOD Conference 1996: 57-67

• G. N. Paulley, Per-ke Larson: Exploiting Uniqueness in Query Optimization. ICDE 1994: 68-79

The End.