Semantic Query Optimization Techniques
November 16, 2005
By : Mladen Kovacevic
Background• 1980's, semantic information stored in
dbs as integrity constraints could be used for query optimization
• semantic: “of or relating to meaning or the study of meaning”(http://wordnet.princeton.edu)
• integrity : preserve data consistency when changes made in db.
• no extensive implementation existing today (1999)
Introduction• Key factor in relational database
system’s improvement in query execution time, is query optimization.
• Query execution can be improved by:
– Analyzing integrity information, and rewriting queries exploiting this information (JE & PI)
– Avoid expensive sorting costs (Order Optimization)
– Exploiting uniqueness by knowing rows will be unique, thus, avoiding extra sorts. (EU)
Presentation Overview• Semantic Query Optimization techniques
– Join Elimination (JE)
– Predicate Introduction (PI)
– Order Optimization (OO)
– Exploiting Uniqueness (EU)
Some Motivation• Describing two techniques in SQO,
demonstrated in DB2 UDB.– Predicate Introduction– Join Elimination
• Reasons: – rewriting queries by hand showed that these
two provided consistent optimization.– practical to implement– extendible to other DBMS’s.
• Data sets used : TPC-D and APB-1 OLAP benchmarks
only REFERENTIAL INTEGRITY constraints and CHECK CONSTRAINTS used!
Semantic Query Optimization (SQO) Techniques
• Join Elimination: Some joins need NOT be evaluated since the result may be known apriori (more on this later)
• Join Introduction: Adding a join can help if relation is small compared to original relations and highly selective.
• Predicate Elimination : If predicate known to be always true, can be eliminated from query (DISTINCT clause on Primary Key – Uniqueness exploitation!)
• Predicate Introduction: New predicates on indexed attributes can result in a much faster access plan.
• Detecting the Empty Answer Set : If query predicates inconsistent with integrity constraints, the query does not have answer.
Why SQO implementations not used?• Deductive Databases : Many cases SQO
techniques were designed for deductive databases, thus not appearing to be useful in relational database context.
• CPU & I/O Speeds similar : When being developed, CPU & I/O speeds were not as dramatically different– (savings in I/O not worth the CPU time
added)
• Lack of Integrity Constraints : Thought that many integrity constraints are needed for SQO to be useful
Two-stage Optimizer• Examples of SQO techniques always designed
for a two-stage optimizer
– Stage 1 : logically equivalent queries created (DB2’s query rewrite optimization)
– Stage 2 : generate plans of all these queries, choosing the one with lowest estimated cost. (DB2’s query plan optimization)
• Join order, join methods, join site in a distributed database, method for accessing input table, etc.
Join Elimination· Simple : Eliminate relation where join is over tables
related through referential integrity constraint, and primary key table referenced only in the join
VIEW DEFINITIONCREATE VIEW Supplier_Info (n, a, c) asSELECT s_name, s_address, n_nameFROM tpcd.supplier, tpcd.nationWHERE s_nationkey = n_nationkey
QUERYSELECT s_n, s_aFROM Supplier_Info
Join Elimination (con’t)· Query can be rewritten internally as:
SELECT s_n, s_aFROM tpcd.supplier
Why do such a simple rewrite?
• User may not have access to the supplier table, and/or may only know about the view.• Sometimes GUI managers create these “dumb” queries so need to optimize• Non-programmers write queries often, and may
not even think about this.
• Algorithm for generic redundant join removal provided in paper.
Example – Join EliminationSELECT p_name, p_retailprice, s_name, s_addressFROM tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplierWHERE p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey and l_shipdate between '1994-01-01' and '1996-06-30' and l_discount >= 0.1GROUP BY p_name, p_retailprice, s_name, s_addressORDER BY p_name, s_name
PARTPARTKEY
SUPPLIERSUPPKEY
PARTSUPPPARTKEY
SUPPKEY
LINEITEMPARTKEY
SUPPKEY
1 – many relationship
Example : Join Elimination• Any immediate improvements that can be seen
here?
p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey
P_PARTKEY PS_PARTKEY L_PARTKEY
S_SUPPKEY PS_SUPPKEY L_SUPPKEY
P_PARTKEY = PS_PARTKEY PS_PARTKEY = L_PARTKEY
S_SUPPKEY = PS_SUPPKEY PS_SUPPKEY = L_SUPPKEY
S_SUPPKEY = L_SUPPKEY
PS_PARTKEY = L_PARTKEY
Results• 100 MB db size• Execution Time : 58.5 sec -> 38.25 sec (35 % improvement) • I/O Cost: 4631 -> 1498 page reads (67 % improvement)
Join Elimination Optimizing Query 1 Pages Read
4631
1498
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Original Optimized
Pag
es
Join Elimination Optimizing Query 1 Execution Time
58.5
38.25
0
10
20
30
40
50
60
70
Original Optimized
Se
con
ds
Results – OLAP Environment• In OLAP (online analytical processing) servers,
using a star schema (one fact table, with several dimension tables) improvements ranged from 2% to 96 %.
– In these cases, much improvement came from CPU cost instead of I/O, because dimension tables were small enough to fit into memory...
Join Elimination Optimized Query in OLAP Environment
0
100
200
300
400
500
600
700
I1 I2 I3 I4 I5 I6 I7 I8 I9 I10Query Name
Exe
cutio
n T
ime
(se
con
ds)
Predicate Introduction• Techniques discussed :
– Index Introduction : add new predicate on attribute if index exists on that attribute.
– Assumption : index retrieval is better than table scan, is this always good?
– Scan Reduction : reduce number of tuples that qualify for a join.
– Problem : Not very common; unlikely that there will be any check constraints or predicates with inequalities about join columns
• Detecting empty query answer set (not shown as query execution time essentially 0)
Example - Predicate IntroductionSELECT sum(l_extendedprice * l_discount) as revenue
FROM tpcd.lineitem
WHERE l_shipdate >= date(‘1994-01-01’) and
l_shipdate < date(‘1994-01-01’)+ 1 year and
l_discount between .06 – 0.01 and
.06 + 0.01 and l_quantity < 24;
Check Constraint : l_shipdate <= l_receiptdateIndex : l_receiptdate
• Maintaining semantics, we can add :– l_receiptdate >= date(‘1994-01-01’)
Example - Predicate IntroductionSELECT sum(l_extendedprice * l_discount) as revenue
FROM tpcd.lineitem
WHERE l_shipdate >= date(‘1994-01-01’) and
l_shipdate < date(‘1994-01-01’)+ 1 year and
l_receiptdate >= date(‘1994-01-01’) and
l_discount between .06 – 0.01 and
.06 + 0.01 and l_quantity < 24;
Check Constraint : l_shipdate <= l_receiptdateIndex : l_receiptdate
• Maintaining semantics, we can add :– l_receiptdate >= date(‘1994-01-01’)
• Why would we want to do this? In order to have optimizer choose a plan using the index. Is this always good?
• NO! What if most of the rows in the table need to be returned? We should use a tablescan instead.
Predicate Introduction - Algorithm• Input : set of all check constraints defined for
a database and the set of all predicates in query
• Output: set of all non-redundant formulas derivable from the input set. This answer set can then be added to the query, but only a few are potentially useful.
• The goal in the paper was to choose additions that would guarantee improvement.
• Conditions in paper: Conservative approach of introducing predicates that will have the plan optimizer use an index. Insist on only one index available with the query predicate.
Predicate Introduction - Results
Predicate Introduction Estimated Costs
0
50000
100000
150000
200000
250000
P1 P2 P3 P4 P5
Query
Est
ima
ted
Co
st (
inte
rna
l un
its)
Original
Optimized
Predicate Introduction - Results
Predicate Introduction Execution Times
0
20
40
60
80
100
120
P1 P2 P3 P4 P5
Query
Exe
cutio
n T
ime
(se
con
ds)
Original
Optimized
Why?
Why Longer Execution for P3/P5?• P2 and P3 are the same except for the followingP2 :SELECT ...FROM ...WHERE l_shipdate >= date ('1998-09-01') and l_shipdate < date ('1998-09-01') + 1 month
P3 :SELECT ...FROM ...WHERE l_shipdate >= date ('1995-09-01') and l_shipdate < date ('1995-09-01') + 1 month
• Difference in table shows that P2 has 2 % of the tuples falling in the range while P3 has 48 % of the tuples fall in the category : BOTH plans will choose index scan! P3 is so large that tablescan is better in this case.
1. Cost model underestimates cost of locking/unlocking index pages
2. Estimated number of tuples goes down because of the reduction factor problem (multiply in the new predicate added)
Adjustments for Reduction Factor Problem• Add new predicate only when it contains a major
column of an index and a scan of that index is sufficient to answer the query (thus, no table scan necessary)
• Original Index : <receiptdate, discount, quantity, extendedprice>
• New Index : <receiptdate, discount, quantity, extendedprice, shipdate, partkey, suppkey, orderkey> Predicate Introduction Execution Times
0
10
20
30
40
50
60
P3 P5
Query
Exe
cutio
n T
ime
(se
con
ds)
Original
Optimized
Order Optimization Techniques• Access plan strategies exploit the physical
orderings provided either by indexes or sorting
• GOAL: optimize the sorting strategy
• Techniques– Pushing down sorts in joins– Minimizing the number of sorting columns– Detecting when sorting can be avoided because of
predicates, keys or indexes
• Order Optimization : detecting when indexes provide an interesting order, so that sorting can be either avoided, and used as sparingly as possible.
• Interesting Orders : when the side effect of a join produces rows in sorted order, which can be taken advantage of later (if another join needed, ORDER BY, GROUP BY, DISTINCT)
Fundamental Operators• Order optimization requires the following
operations
– Reduce Order– Test Order– Cover Order– Homogenize Order
Order Optimization Results
Exploiting Uniqueness• Checking to see if query contains unnecessary
DISTINCT clauses– How does this make improvements?
• Removing duplicates is performed by SORTING, a costly operation.
• Example is removing DISTINCT keyword from query if it is applied onto the primary key itself (since primary keys are, by definition, distinct)
How to exploit uniqueness?• Using knowledge about:
– Keys– Table Constraints– Query Predicates
• Cannot always be tested efficiently, so we look for a sufficient solution.
Summary• Important Outcome : experimental evidence
showing SQO can provide effective enhancement to the traditional query optimization.– Join Elimination : geared towards OLAP
environment (where very useful)– Independent on existence of complex
integrity constraint – semantic reasoning used about referential integrity constraints
– Easy to implement and execute
– Predicate Introduction : guaranteeing improvements more difficult, needing rather severe restrictions imposed (limits the applicability of this approach)
– Order Optimization : utilizing functional dependencies and table information, we use it in creating a “smart” access plan, avoiding or optimizing sort operations.
– Exploiting Uniqueness : uniqueness is powerful when it reduces the number of expensive sorts. Discovering true ways of exploiting this technique are quite tricky and specific.
References• Qi Cheng, Jarek Gryz, Fred Koo, et al: Implementation of
Two Semantic Query Optimization Techniques in DB2 Universal Database. Proceedings of the 25th VLDB Conference, Edinburg, Scotland,1999.
• David E. Simmen, Eugene J. Shekita, Timothy Malkemus: FundamentalTechniques for Order Optimization. SIGMOD Conference 1996: 57-67
• G. N. Paulley, Per-ke Larson: Exploiting Uniqueness in Query Optimization. ICDE 1994: 68-79
The End.
Top Related