Download - Query optimisation

CS263

Query Optimisation

Motivation for Query Optimisation Phases of Query Processing Query Trees RA Transformation Rules Heuristic Processing Strategies Cost Estimation for RA Operations

LECTURE PLAN

Motivation for Query OptimisationList all the managers that work in the sales department.

SELECT *

FROM emp, dept

WHERE emp.deptno = dept.deptno

AND emp.job = ‘Manager’

AND dept.name = ‘Sales’;

(job = ‘Manager’) (name=‘Sales’) (emp.deptno = dept.deptno) (EMP X DEPT)

(job = ‘Manager’) (name=‘Sales’) (EMP emp.deptno = dept.deptno DEPT)

((job = ‘Manager’) (EMP)) emp.deptno = dept.deptno ((name=‘Sales’) (DEPT))

There are at least three alternative ways of representing this query as a Relational Algebra expression.

Motivation for Query Optimisation


Metrics:1000 tuples in the EMP relation50 tuples in the DEPT relation50 employees are Managers (one per department)5 separate Sales departments (across the country)

Cost of processing the following query alternate:

Cartesian product of EMP and DEPT: (1000 + 50) record I/O’s to read the relations

+ (1000 * 50) record I/O’s to create an intermediate relation to store result

Selection on result of Cartesian product: (1000 * 50) record I/O’s to read tuples and compare against predicate

Total cost of the query: (1000 + 50) + 2*(1000 * 50) = 101, 050 record I/O’s.

Motivation for Query OptimisationMetrics:1000 tuples in the EMP relation50 tuples in the DEPT relation50 employees are Managers (one per department)5 separate Sales departments (across the country)

Cost of processing the following query alternate:

Join of EMP and DEPT over deptno: (1000 + 50) record I/O’s to read the relations

+ (1000) record I/O’s to create an intermediate relation to store join result

Selection on result of Join: (1000) record I/O’s to read each tuple and compare against predicate

Total cost of the query: (1000 + 50) + 2*(1000) = 3, 050 record I/O’s.


Motivation for Query OptimisationCost of processing the following query:


Select ‘Managers’ in EMP: (1000) record I/O’s to read the relations

+ (50) record I/O’s to create an intermediate relation to store select result

Select ‘Sales’ in DEPT: (50) record I/O’s to read the relations

+ (5) record I/O’s to create an intermediate relation to store select result

Join of previous two selections over deptno: (50 + 5) record I/O’s to read the relations

Total cost of the query: (1000 2*(50) + 5 +(50 +5)) = 1, 160 record I/O’s.

Phases of Query Processing

Query Processing Stage - 1

Cast the query into internal form

This involves the conversion of the original (SQL) query into some internal representation more suitable for machine manipulation.

The internal representation typically chosen is either some kind of ‘abstract syntax tree’, or a relational algebra ‘query tree’.

Relational Algebra Query Trees

A Relational Algebra query can be represented as a ‘query tree’. For example the query to list all the managers that work in the sales department could be described as one of the following:


EMP DEPT

X

(job = ‘Manager’) (name=‘Sales’) (emp.deptno = dept.deptno)

Leaves

Intermediateoperations

Root


A Relational Algebra query can be represented as a ‘query tree’. For example the query to list all the managers that work in the sales department could be described as one of the following:


EMP DEPT

X

(job = ‘Manager’) (name=‘Sales’)

(emp.deptno = dept.deptno)

Leaves

Intermediateoperations

Root



EMP DEPT


emp.deptno = dept.deptno

Alternative‘query tree’ for the query to list all the managers that work in the sales department:



EMP DEPT



Alternative‘query tree’ for the query to list all the managers that work in the sales department:


Convert to canonical form

Find a more ‘efficient’ representation of the query by converting the internal representation into some equivalent (canonical) form through the application of a set of well-defined ‘transformation rules’.

The set of transformation rules to apply will generally be the result of the application of specific heuristic processing strategies associated with particular DBMSs.

1. Conjunctive selection operations can cascade into individual selection operations (and vice versa).

Sometimes referred to as cascade of selection.

pqr(R) = p(q(r(R)))

Example:

deptno=10 sal>1000(Emp) = deptno=10(sal>1000(Emp))

Transformation Rules for RA Operations

2. Commutativity of selection

p(q(R)) = q(p(R))

Example:

sal>1000(deptno=10(Emp)) = deptno=10(sal>1000(Emp))


3. In a sequence of projection operations, only the last in the sequence is required.

LM … N(R) = L (R)

Example:

deptnoname(Dept) = deptno (Dept))


4. Commutativity of selection and projection.

Ai, …, Am(p(R)) = p(Ai, …, Am(R))

where p {A1, A2, …, Am}

Example:

name, job(name=‘Smith’(Emp)) = name=‘Smith'(name, job(Staff))


Selection predicate (p) is only made up of projected attributes

5. Commutativity of theta-join (and Cartesian product).

Rp S = Sp R


R X S = S X R

Example:

EMP emp.deptno = dept.deptno DEPT

= DEPT emp.deptno = dept.deptno EMP

NOTE: Theta-join is a generalisation of both the equi-join and natural-join

6. Commutativity of selection and theta-join (or Cartesian

product).


Example:

emp.deptno=10 (EMP)) emp.deptno = dept.deptno DEPT

= emp.deptno=10 (EMP emp.deptno = dept.deptno DEPT)

(p(R)) r S = p(R r S)

where p {A1, A2, …, Am}

Selection predicate (p) is only made up of join attributes

7. Commutativity of projection and theta-join (or Cartesian

product).


Example:

job, location, deptno (EMP emp.deptno = dept.deptno DEPT)

= ( job, deptno (EMP)) emp.deptno = dept.deptno ( location, deptno (DEPT))

L(R r S) = (L1(R)) r (L2(S))

Project attributes L = L1 L2, where L1 are attributes of R, and L2 are attributes of S. L will also contain the join attributes

8. Commutativity of union and intersection (but not set

difference).

R S = S R

R S = S R



9. Commutativity of selection and set operations (union, intersection, and set difference).

Union

p(R S) = p(S) p(R)

Intersection

p(R S) = p(S) p(R)

Set Difference

p(R - S) = p(S) - p(R)

10 Commutativity of projection and union

L(R S) = L(S) L(R)


11 Associativity of natural join (and Cartesian product)

Natural Join

(R S) T = R (S T)

Cartesian Product

(R X S) X T = R X (S X T)



12 Associativity of union and intersection (but not set difference)

Union

(R S) T = S (R T)

Intersection

(R S) T = S (R T)

Heuristic Processing Strategies

Perform selection operations as early as possible

Translate a Cartesian product and subsequent selection (whose predicate represents a join condition) into a join operation.

Use associativity of binary operations to ensure that the most restrictive selection operations are executed first

Perform projections as early as possible.

Compute common expressions once

Heuristic Processing - Example

EMP DEPT



EMP DEPT



EMP DEPT



EMP DEPT



EMP DEPT



EMP DEPT


(job = ‘Manager’)(job = ‘Manager’) (name=‘Sales’)

EMP DEPT

X



EMP DEPT

X



EMP DEPT

X



OptimisedCanonical Query


Choose candidate low-level procedures

Consider the (optimised canonical) query as a series of low-level operations (join, restrict, etc…).

For each of these operations generate alternative execution strategies and calculate the cost of such strategies on the basis of statistical information held about the database tables (files).


Generate query plans and choose the cheapest

Construct a set of ‘candidate’ Query Execution Plans (QEPs).

Each QEP is constructed by selecting a candidate implementation procedure for each operation in the canonical query and then combining them to form a string of associated operations.

Each QEP will have an (estimated) cost associated with it – the sum of the cost of each of its operations.

Choose the QEP with the least cost.

Cost Based Optimisation

Cost Based Optimisation (stages 3 & 4)

A good declarative query optimiser does not rely solely on heuristic processing strategies.

It chooses the QEP with the lowest estimated cost.

After heuristic rules are applied to a query, there still remains a number of alternative ways to execute it .

The Query Optimiser estimates the cost of executing each one (or at least a number) of these alternatives, and selects the cheapest one.

Costs associated with query execution

Secondary storage access costs: Searching for data blocks on disk, Reading data blocks from disk Writing data block to disk

Storage costs Cost of storing intermediate (temp) files

Computation costs Cost of CPU usage

Main memory usage costs Cost of buffering data

Communication costs Cost of moving data across

Database statistics used in cost estimation

Information held on each relation:

number of tuples number of blocks blocking factor primary access method primary access attributes secondary indexes secondary indexing attributes number of levels for each index number of distinct values of each attribute

Physical Data Structures – File Types Heap (Sequential, Unordered)

no key columns queries, other than appends, scan every page rows are appended at the end duplicate rows are allowed

Ordered physically sorted data file with no index

Hash (Random, Direct) data is located based on the (calculated) value of a hash field (key)

Indexed Sequential (ISAM) sorted data file with a primary index

B+Tree dynamic multilevel index reuses deleted space on associated data pages

Strategies for implementing the RESTRICT operation

Different access strategies dependant upon the structure of the file in which the relation is stored, and whether the predicate attribute(s) have been indexed/hashed: Each uses a different cost algorithm (which refers to specific database statistics).

Linear Search (Heap) Binary Search (Ordered)

Equality on Hash Key Equality condition on primary key Inequality condition on primary key Equality condition on secondary index Inequality condition on secondary B+Tree index

If the selection predicate is a composite (AND & OR) then there are additional cost considerations!

Strategies for implementing the JOIN operation

Different access strategies dependant upon the structure of the files in which the relations to be joined are stored, and whether the join attributes have been indexed/hashed: Each uses its own cost algorithm (which refers to specific database statistics).

Block nested loop join Indexed nested loop join Sort-merge join Hash join

Query Optimisation Summary

The aims of query processing are to transform a query written in a high-level language (SQL), into a correct and efficient execution strategy expressed in a low-level language (Relational Algebra), and to execute the strategy to retrieve the required data.

There are many equivalent transformations of the same high-level query, the DBMS has to choose the one that minimises resource usage.

There are two main techniques for query optimisation. The first uses heuristic rules that order the operations in a query. The second compares different execution strategies for those operations, based on their relative costs, and selects the least resource intensive (cheapest) ones.