CSCI 453 -- Query Processing1 QUERY PROCESSING & OPTIMIZATION Dr. Awad Khalil Computer Science...

CSCI 453 -- Query Processing 1

QUERY PROCESSING&

OPTIMIZATION

Dr. Awad KhalilDr. Awad KhalilComputer Science DepartmentComputer Science Department

AUCAUC


Content Why Query Optimization?Why Query Optimization? Optimization ProcedureOptimization Procedure Syntactic and semantic checking Syntactic and semantic checking Casting the query into some internal Casting the query into some internal

representationrepresentation Heuristic optimizationHeuristic optimization Semantic Query OptimizationSemantic Query Optimization Systematic Optimization Systematic Optimization Selection of the Cheapest PlanSelection of the Cheapest Plan Code Generation for the Access PlanCode Generation for the Access Plan


Why Query Optimization

When users submit queries to a DBMS, they expect a response that is not only correct and consistent; but also timely, that is, it is produced in an acceptable period of time.

Queries can be written in a number of different ways, many of them being inefficient. Therefore, the DBMS should take a query and, before it is run, prepare a version that can be executed efficiently, a process that is known as Query Optimization.

In practice, a DBMS is concerned with the improvement of execution strategy rather finding the most efficient version, which is effectively impossible for most queries.

The larger a database becomes, the greater the need for a query optimizer.

It is one of the strengths of relational databases that query optimization can be done automatically by a software optimizer included within the DBMS software.


Optimization Procedure

1. Syntactic and semantic checking. 2. Casting the query into some internal representation. 3. Heuristic optimization (converting to a more efficient form). 4. Semantic query optimization. 5. Systematic optimization. 6. Selection of the cheapest plan. 7. Code generation for the access plan.


1- Syntactic and semantic checking

A query is expressed in a high-level language (SQL) A query is expressed in a high-level language (SQL) and is parsed to check if it syntactically correct (i.e. and is parsed to check if it syntactically correct (i.e. obeys the rules of SQL grammar).obeys the rules of SQL grammar).

The query is then validated to see if it semantically The query is then validated to see if it semantically correct (i.e. verified to see that the attributes, tables, correct (i.e. verified to see that the attributes, tables, views and other objects actually exist in the database).views and other objects actually exist in the database).


2- Casting the query into some internal representation

The query is converted into a form more suitable for machine The query is converted into a form more suitable for machine manipulation. In relational data models, the form is based on manipulation. In relational data models, the form is based on relational algebrarelational algebra or or relational calculusrelational calculus. The relational . The relational algebra is commonly used and usually manipulated in the form algebra is commonly used and usually manipulated in the form of of operator graphsoperator graphs..

S(S#, Sname, Status, City)SP(S#, P#, Qty)Query:Query: Get names of suppliers Get names of suppliersWho supply part P2.Who supply part P2.

((S Join SP) Where P#=‘P2’)[Sname]


3- Heuristic optimization(Converting to more efficient form)

Heuristic optimization depends on the syntax and not the semantics of the Heuristic optimization depends on the syntax and not the semantics of the database.database.

It is based only on the general qualities of the relational algebra expressions It is based only on the general qualities of the relational algebra expressions and involves substituting relational algebra expressions with more efficient and involves substituting relational algebra expressions with more efficient expressions using equivalence preserving transformation rules.expressions using equivalence preserving transformation rules.

Objectives:Objectives:A- Minimize disk input/output and processing by reducing the sizeA- Minimize disk input/output and processing by reducing the size of intermediate tables in a query.of intermediate tables in a query.B- Directly reducing the amount of computation involved byB- Directly reducing the amount of computation involved by rewriting simpler expression.rewriting simpler expression.C- Reducing the amount of computation involved by rewritingC- Reducing the amount of computation involved by rewriting expression, although more complex but less computationally demanding.expression, although more complex but less computationally demanding.


A- Rules that tend to reduce the size of intermediate tables

1.1. Perform selection as early as possible (especially Perform selection as early as possible (especially before join)before join)

STUDENT (Std#, Sname)COURSE (Course#, Cname, Instructor)RGISTRATION (Std#, Course#, Date)GRADE (Std#, Course#, Grade)

QueryQuery: : List the names of students List the names of students registered in the database course.registered in the database course.((STUDENT Join (REGISTRATION Join COURSE)) Where Cname = ‘Database’ ) [Sname]


Perform selection as early as possible (especially before join)

The selection operator can The selection operator can be pushed as far down the be pushed as far down the operator graph as possible. operator graph as possible.

At intermediate nodes, the At intermediate nodes, the

operators are pushed down operators are pushed down the appropriate branches.the appropriate branches.


Perform selection as early as possible (especially before join)

The selection operator can The selection operator can be passed down again:be passed down again:



2.2. Perform projection as early as possiblePerform projection as early as possible

Under certain conditions projection may be commuted Under certain conditions projection may be commuted with a join. When a projection is preceded by a join, it with a join. When a projection is preceded by a join, it is possible to push the projection down before the join, is possible to push the projection down before the join, but the projection acquires new attributes, therefore the but the projection acquires new attributes, therefore the original projection must be performed after the join. original projection must be performed after the join. Unless the cardinalities of the intermediate relations Unless the cardinalities of the intermediate relations are reduced, the usefulness of pushing a projection are reduced, the usefulness of pushing a projection before a join is questionable.before a join is questionable.


Perform projection as early as possible

(STUDENT Join (((REGISTRATION Join (COURSE Where Cname = ‘Database’)) [Std#, Course#]))) [Sname]

The projection, The projection, Project(Std#, Course#) should be pushed down the tree!should be pushed down the tree!


Perform projection as early as possible



3.3. Perform select before projectPerform select before project

If the selection condition involves only some of the attributes If the selection condition involves only some of the attributes in the projection list, then the two operations can be in the projection list, then the two operations can be commuted, e.g.:commuted, e.g.:

((GRADE [Std#, Course#]) Where Std# = 123) can be commuted to:can be commuted to: (GRADE Where Std# = 123) [Std#, Course#]


B- Rules that tend to directly reduce the amount of computation involved – making expression simpler

1.1. Combine a cascade of selections into one selection.Combine a cascade of selections into one selection.

Query:Query: Get the full details of courses with course number CSCI-453 Get the full details of courses with course number CSCI-453 where the instructor is Khalil.where the instructor is Khalil. ((COURSE Where Instructor = ‘Khalil’) Where Course# = ‘CSCI-453’)

Can be converted to:Can be converted to: COURSE Where Instructor = ‘Khalil’ AND Course# = ‘CSCI-453’


B- Rules that tend to directly reduce the amount of computation involved – making expression simpler

2.2. Combine a cascade of projections into one Combine a cascade of projections into one projectionprojection

((COURSE [Cname, Instructor]) [Cname])

Can be converted to:Can be converted to:

COURSE [Cname]


C- Rules that tend to directly reduce the amount of computation involved – rewriting in a less computationally demanding form

Where P OR (Q AND R)

Can be rewritten as:Can be rewritten as:

Where (P OR Q) AND (P OR R)


4- Semantic Query Optimization Semantic query optimization transformations use constraints on Semantic query optimization transformations use constraints on

database schema to modify queries.database schema to modify queries. ExampleExample:: Consider the Join of the two tables: Consider the Join of the two tables: SP (S#, P#, Qty) and P (P#, Pname, Color, Weight, City) If If SP.P# is a foreign key (with no nulls allowed) and is matched to is a foreign key (with no nulls allowed) and is matched to

the primary key the primary key P.P#, then then (SP Join P) [S#] can be transformed tocan be transformed to SP[S#].


5- Systematic Optimization Having reorganized the query, the Having reorganized the query, the Query OptimizerQuery Optimizer

must then consider how to retrieve the information must then consider how to retrieve the information physically from the database.physically from the database.

The Query Optimizer generates a The Query Optimizer generates a query plan ((execution strategyexecution strategy, , access planaccess plan, or , or execution planexecution plan) ) using using access routinesaccess routines ( (access aidsaccess aids or or low level low level implementation proceduresimplementation procedures) for the various operations.) for the various operations.

In optimizing at this level, an accurate cost estimate for In optimizing at this level, an accurate cost estimate for each execution strategy must be calculated. Cost each execution strategy must be calculated. Cost estimation is a time consuming task.estimation is a time consuming task.


5- Systematic Optimization (Cont’d)Statistical Information:Statistical Information: Systematic optimizers may make use of the following information:Systematic optimizers may make use of the following information:

Number of tuples.Number of tuples. Number of blocks used to store these tuples.Number of blocks used to store these tuples. Number of distinct data values.Number of distinct data values. Percent of total number of relevant database blocks used by the Percent of total number of relevant database blocks used by the

relation.relation. Ordering of tuples in the blocks.Ordering of tuples in the blocks. Blocking factor for each file.Blocking factor for each file. Existence and type of indexes.Existence and type of indexes. Number of levels of each index.Number of levels of each index. Number of blocks for packed relations.Number of blocks for packed relations. Physical clustering of records.Physical clustering of records.


5- Systematic Optimization (Cont’d)Types of cost functions:Types of cost functions: 1- Accesses to secondary storage costs:1- Accesses to secondary storage costs: Cost of searching for, reading and writing data blocks that reside on secondary storage. Cost of searching for, reading and writing data blocks that reside on secondary storage.

Temporary, intermediate files may also need to be accessed; this represents significant problem Temporary, intermediate files may also need to be accessed; this represents significant problem as there must also be an accurate estimate of the size of intermediate results to calculate the as there must also be an accurate estimate of the size of intermediate results to calculate the number of I/O required. number of I/O required. The access cost is the number of blocks that must be brought into main The access cost is the number of blocks that must be brought into main memory for reading and the number of blocks that must be written out to secondary storage.memory for reading and the number of blocks that must be written out to secondary storage.

2- Computation Costs:2- Computation Costs: Cost of performing in-memory operations in the data buffers during query executions. Cost of performing in-memory operations in the data buffers during query executions.

Operations include:Operations include: Searching for records.Searching for records. Sorting records.Sorting records. Merging records for a join.Merging records for a join. Performing computations on field values.Performing computations on field values.

3- Communication Costs:3- Communication Costs:Cost of shipping the query and results from database site to terminal where the query originated.Cost of shipping the query and results from database site to terminal where the query originated.


5- Systematic Optimization (Cont’d)Goals:Goals:

For large databases, the main emphasis is on reducing accesses For large databases, the main emphasis is on reducing accesses costs to secondary storage, i.e., the number of block transfers costs to secondary storage, i.e., the number of block transfers between disk and memory.between disk and memory.

Small databases, in which most data can be stored in memory Small databases, in which most data can be stored in memory focus on minimizing computation.focus on minimizing computation.

In case of distributed databases, communication costs must also In case of distributed databases, communication costs must also be minimized.be minimized.

It is difficult to include all cost components into a weighted cost It is difficult to include all cost components into a weighted cost function, therefore, most cost functions consider a single factor function, therefore, most cost functions consider a single factor only or possibly a combination of one factor for estimating I/O only or possibly a combination of one factor for estimating I/O and one factor to estimate the use of CPU.and one factor to estimate the use of CPU.


5- Systematic Optimization (Cont’d)

Use of Costs:Use of Costs:

t( R ) - the number of tuples in relation R.- the number of tuples in relation R. b( R ) - the number of blocks needed to store the relation R, if R is packed - the number of blocks needed to store the relation R, if R is packed forms.forms. bf( R ) - the number of tuples per block, also called the blocking factor of - the number of tuples per block, also called the blocking factor of R.R. If R is packed. Then If R is packed. Then b( R) = t( R ) / bf( R ). n(A, R) - the number of distinct values of attribute A in relation R. This can- the number of distinct values of attribute A in relation R. This can be used to approximate the number of tuples (t) that have abe used to approximate the number of tuples (t) that have a particular value. If we assume that the values of A are uniformlyparticular value. If we assume that the values of A are uniformly distributed in R, then the number of tuples expected to have adistributed in R, then the number of tuples expected to have a particular value c for A (called the selection size):particular value c for A (called the selection size): s(A=c, R) - average number of records that will satisfy an equality selection- average number of records that will satisfy an equality selection condition on an attribute. If we assume that the values for A arecondition on an attribute. If we assume that the values for A are uniformly distributed in R, then s = t( R )/n(A,R).uniformly distributed in R, then s = t( R )/n(A,R).



Example:Example:Consider the following database:Consider the following database:

STUDENT (Stuid, Stuname, Major, Credits)ENROLL (Course#, Stuid, Grade) Thus to estimate the number of students in the university with a Thus to estimate the number of students in the university with a

CS major:CS major:- - If there are 10,000 students, then t(STUDENT) = 10,000If there are 10,000 students, then t(STUDENT) = 10,000- If there are 25 possible major subjects, then n(MAJOR, STUDENT) = 25- If there are 25 possible major subjects, then n(MAJOR, STUDENT) = 25- Then, we can estimate the number of CS majors as: s(MAJOR=’CS’, - Then, we can estimate the number of CS majors as: s(MAJOR=’CS’,

STUDENT) = t(STUDENT)/n(MAJOR, STUDENT) = 10,000/25 = 400STUDENT) = t(STUDENT)/n(MAJOR, STUDENT) = 10,000/25 = 400- Note that if A is a primary key, then, n(A, R)=t( R ) and the selection - Note that if A is a primary key, then, n(A, R)=t( R ) and the selection

size is 1. size is 1.



Processing Joins:Processing Joins: In examining the systematic cost of evaluating a typical query, say a Join, In examining the systematic cost of evaluating a typical query, say a Join,

most processors focus on the accesses to secondary storage. This cost will most processors focus on the accesses to secondary storage. This cost will involve in not only the effort of retrieving the tables that are input to a query involve in not only the effort of retrieving the tables that are input to a query and writing out the final result, but also involve the costs of writing and and writing out the final result, but also involve the costs of writing and reading any intermediate tables. This is especially important when reading any intermediate tables. This is especially important when considering the size of intermediate results in complex query.considering the size of intermediate results in complex query.



To calculate size of a Join, say between R and S, of To calculate size of a Join, say between R and S, of size t( R ) and t( S ) respectively, we first need to size t( R ) and t( S ) respectively, we first need to estimate the number of tuples of R that will match of S estimate the number of tuples of R that will match of S on the corresponding attributes. There are several on the corresponding attributes. There are several distinct possibilities to consider:distinct possibilities to consider:

1.1. If there are no common attributes, Join becomes a Product, and If there are no common attributes, Join becomes a Product, and the number of tuples in the result is t( R ) * t( S ).the number of tuples in the result is t( R ) * t( S ).

2. If the set of common attributes is a key for one relation, then the 2. If the set of common attributes is a key for one relation, then the

number of tuples in the Join can be no larger than the number of number of tuples in the Join can be no larger than the number of tuples in the other relation, e.g., if the common attributes are a tuples in the other relation, e.g., if the common attributes are a key for R then the size of the Join is less than or equal to t( S ). key for R then the size of the Join is less than or equal to t( S ).



Methods of Processing Joins:Methods of Processing Joins:

Nested loops using blocks.Nested loops using blocks. Sort-merge.Sort-merge. Using an index or hash key.Using an index or hash key.



Nested loops using blocks:Nested loops using blocks: Assuming that both R and S are packed relations having b( R ) and b( S ) respectively, Assuming that both R and S are packed relations having b( R ) and b( S ) respectively,

then if we have two buffers, we can bring the first block of R into the first buffer and then if we have two buffers, we can bring the first block of R into the first buffer and bring each block of S in turn into the second buffer, compare each tuple of the r block bring each block of S in turn into the second buffer, compare each tuple of the r block with each tuple of the s block before switching in the next s block. When we have with each tuple of the s block before switching in the next s block. When we have finished all the s blocks we bring the next r block into the first buffer and so on. The finished all the s blocks we bring the next r block into the first buffer and so on. The algorithm can be shown as:algorithm can be shown as:

For each block of RFor each block of SFor each tuple in the R blockFor each tuple in the S blockIf the tuples satisfy the condition then add to joinEndEndEnd

End



Cost Functions for JOIN using Nested Loops:Cost Functions for JOIN using Nested Loops:

Refer to the text book “Fundamentals of Database Systems”, El Refer to the text book “Fundamentals of Database Systems”, El Masri:Masri:

Third Edition: pages 618 – 621Third Edition: pages 618 – 621 Fourth Edition: pages 527 – 529Fourth Edition: pages 527 – 529



Sort-Merge Join:Sort-Merge Join: In the previous method, the assumption has been made In the previous method, the assumption has been made

that the tuples in the tables are not sorted in any that the tuples in the tables are not sorted in any particular way. If both files are sorted on the attribute to particular way. If both files are sorted on the attribute to be joined, then another access method, Sort-Merge Join be joined, then another access method, Sort-Merge Join is preferable (see references).is preferable (see references).


5- Systematic Optimization (Cont’d)Using an Index or Hash Key:Using an Index or Hash Key: If one of the files, S, has an index on the common attribute A, or if A is a hash key, then each tuple of R If one of the files, S, has an index on the common attribute A, or if A is a hash key, then each tuple of R

would be retrieved in the usual way and the index or hashing algorithm would be used to find all the would be retrieved in the usual way and the index or hashing algorithm would be used to find all the matching records of S.matching records of S.

For example, to find STUDENT Join ENROLL, representing S and R respectively, we access each For example, to find STUDENT Join ENROLL, representing S and R respectively, we access each ENROLL record in sequence and then use the index on Stuid to find each matching STUDENT record. ENROLL record in sequence and then use the index on Stuid to find each matching STUDENT record. The overall cost depends on the type of index as follows:The overall cost depends on the type of index as follows:

If A is the primary key of S and we have a primary index on S then the access cost is the cost of accessing If A is the primary key of S and we have a primary index on S then the access cost is the cost of accessing all blocks of R plus the cost of reading the index and accessing one record of S for each of the tuples in R:all blocks of R plus the cost of reading the index and accessing one record of S for each of the tuples in R:b( R ) + (t(R) * (L(indexname) + 1), b( R ) + (t(R) * (L(indexname) + 1), where where

L(indexname) is the number of levels in a multilevel index, which is equivalent to the average number of L(indexname) is the number of levels in a multilevel index, which is equivalent to the average number of index accesses to find an entry.index accesses to find an entry.

If the index is a clustering index (a file with a clustering index is one in which data records are physically If the index is a clustering index (a file with a clustering index is one in which data records are physically ordered on a non-key field that does not have a distinct value for each record) with selection size s(A=c,S), ordered on a non-key field that does not have a distinct value for each record) with selection size s(A=c,S), the cost is: b( R ) + (t(R) * (L(indexname) + s(A=c, S)/bf(S)))the cost is: b( R ) + (t(R) * (L(indexname) + s(A=c, S)/bf(S)))

If we have a hash function (in a hash file, the hash field of a record is subjected to a hashing function If we have a hash function (in a hash file, the hash field of a record is subjected to a hashing function which gives the block address that contains that record) instead of an index, the cost is: which gives the block address that contains that record) instead of an index, the cost is:

b( R ) + (t(R) * h) b( R ) + (t(R) * h) where,where, h is the average number of accesses to get a block from the its hash key value.h is the average number of accesses to get a block from the its hash key value. If the index is a secondary index (a secondary index is an ordered file of data values that point to entities If the index is a secondary index (a secondary index is an ordered file of data values that point to entities

in a data file containing this value in one of its fields), we get:in a data file containing this value in one of its fields), we get: b( R ) + (t(R) * (L(indexname) + s(A=c,S))) b( R ) + (t(R) * (L(indexname) + s(A=c,S)))


6- Selection of the Cheapest Plan

Having generated several access plans, the Having generated several access plans, the systematic optimizer would then choose the systematic optimizer would then choose the cheapest.cheapest.


7- Code Generation for the Access Plan

The cheapest plan is then coded for execution at The cheapest plan is then coded for execution at the appropriate time.the appropriate time.


Thank you

CSCI 453 -- Query Processing1 QUERY PROCESSING & OPTIMIZATION Dr. Awad Khalil Computer Science...

Documents

Transcript of CSCI 453 -- Query Processing1 QUERY PROCESSING & OPTIMIZATION Dr. Awad Khalil Computer Science...