Query Optimization and Indexes. Introduction Relational Databases Relational Databases DB2/400.

47
Query Optimization and Query Optimization and Indexes Indexes

Transcript of Query Optimization and Indexes. Introduction Relational Databases Relational Databases DB2/400.

Query Optimization and IndexesQuery Optimization and Indexes

Query Optimization and IndexesQuery Optimization and Indexes

IntroductionIntroduction

Relational DatabasesRelational Databases

DB2/400DB2/400

QUERY OPTIMIATION QUERY OPTIMIATION AND INDEXESAND INDEXES

IntroductionIntroduction

Overview

Research Problem

Literature

Next

Previous

OverviewOverviewINTRODUCTIONINTRODUCTION

IBM has a DBMS called DB2. Query optimization

– Can significantly improve database performance– If the right indexes are available, queries can generally

be implemented using better performing algorithms.

Next

Previous

INTRODUCTIONINTRODUCTION

Research ProblemResearch Problem

Due to the complexity of the query optimizer, customers are often baffled about indexes and query response time. Users need advise on what indexes would offer the best performance.

Management asks:

How can an intelligent index manager be created? Related question:

How can queries be evaluated to determine a set of indexes which may minimize the total cost of database transactions, thus optimizing performance?

Next

Previous

LiteratureLiterature

Index selection problem (ISP) for secondary indexes is a well-known optimization problem.

This problem known to be NP-Complete.

INTRODUCTIONINTRODUCTION

Query Optimization and IndexesQuery Optimization and Indexes

IntroductionIntroduction

Relational DatabasesRelational Databases

DB2/400DB2/400

QUERY OPTIMIATION QUERY OPTIMIATION AND INDEXESAND INDEXES

Relational DatabasesRelational Databases

Background Indexes

Table Access and JoinsQuery Optimization

Next

Previous

RELATIONAL DBsRELATIONAL DBs

BackgroundBackground

Relational model introduced by Codd Based on relations (tables), tuples (records), and

attributes (fields)

Queries can be represented using relational algebra, relational calculus, or as a graph.

SQL is a high level query language.

Next

Previous

Query OptimizationQuery Optimization

A query optimizer takes the query and creates a procedural sequence of implementation steps known as an access plan.– This plan is created after analyzing various alternatives to arrive

at the best choice.

– Table indexes are critical in creating efficient access plans.

Objectives of query optimization:– Minimize response time

– Minimize usage of system resources

RELATIONAL DBsRELATIONAL DBs

Next

Previous

RELATIONAL DBsRELATIONAL DBs

Query Optimization ObjectivesQuery Optimization Objectives

Hardware objectives include minimizing:– CPU costs– Communications costs (in a network)– I/O costs for accessing secondary storage– Cost of using main memory and secondary storage

Software objectives include minimizing:– Cost of query optimization

(should be small compared to execution cost)

Next

Previous

Query Optimization ProcessQuery Optimization ProcessRELATIONAL DBsRELATIONAL DBs

Rewriter

Algebraic Space Cost Model

Method StructureSpace

Size-DistributionEstimator

Planner

Next

Previous

Query Optimization ProcessQuery Optimization Process

Rewriter– creates an internal query representation.– applies transformations

to streamline query evaluation.

RELATIONAL DBsRELATIONAL DBs

Next

Previous

Query Optimization ProcessQuery Optimization Process

Planner

– maps the transformed query• into various sequences of operations

(algebraic space)

• which can be implemented

(method structure space)

• with a known cost

(cost model & size distribution),

creating candidate access plans.

– computes the cost of each candidate plan,

and chooses the cheapest one.

RELATIONAL DBsRELATIONAL DBs

RELATIONAL DBsRELATIONAL DBs

IndexesIndexes

B-Tree Indexes

Clustered/Non-clustered Indexes

Hash Indexes

Next

Previous

IntroductionIntroduction

Memory access on the order of nanoseconds– For example, 40 (.00000004 sec.)

Disk access on the order of milliseconds– For example, 25 (.025 sec.) i.e., 40 random I/O’s per sec.

• .016 sec. Seek time (Move disk arm to the proper cylinder)• .008 sec. Rotational latency (Rotate platter into position)• .001 sec. Transfer time (to read/write data)• .025 sec. Total

$ cost of disk cheap compared to RAM

RELATIONAL DBsRELATIONAL DBsIndexesIndexes

Next

Previous

Composition– Root level node (one at the first level)– Directory (or index) nodes (usually one or two levels)

– Leaf level nodes (the bottom level, pointing to records)

B-Tree IndexesB-Tree IndexesRELATIONAL DBsRELATIONAL DBs

IndexesIndexes

... 221 np 346 np 398 ...

… 377 rid 411 rid 449 rid …… 278 rid 305 rid 346 rid …

np = node pointer to disk page

rid = relative row ID to actual record

Next

Previous

B-Tree IndexesB-Tree Indexes

Significantly, due to the frequency of access, upper-levels of a B-tree index may remain in memory.

For example, to access a record in a million record database:assuming fanout of 256

– B-tree index has 3 levels CEIL(log256 1,000,000)

– B-tree index probed 3 times reading 3 pages into RAM

– Only 1 disk I/O may be required to read in the disk page holding the leaf node pointing to the desired record.

RELATIONAL DBsRELATIONAL DBsIndexesIndexes

Next

Previous

B-Tree IndexesB-Tree Indexes

B-tree index on a one million record database

has 34 nodes above the leaf nodes– Assume:

• key values 4 bytes long (an integer)

• node pointers & row IDs 4 bytes long

• fill factor 70%

• disk page header 48 bytes

• disk page size 2 KB

– so (2048-48) *.70/(4+4) 175 entries per disk page• CEIL(1,000,000/175) 5715 disk pages for leaf nodes

• CEIL(5715/175) 33 disk pages for directory nodes

• CEIL(33/175) 1 disk page for parent node(s), root

RELATIONAL DBsRELATIONAL DBsIndexesIndexes

Next

Previous

Clustered/Non-clustered IndexesClustered/Non-clustered Indexes

In a clustered index the records are stored in the same order as the key.

Provides superior performance, for example:– Database of 10 million customer in 200 cities– Customer records of 100 bytes

• so 20 data records per disk page (2048-48)/100• so 500,000 disk pages for database (10,000,000/20)

– Mailing for a specific city averages 50,000 customers– If clustered index on city, only 2500 disk pages of data (50,000/20) need to be

scanned, taking 1 min. (2500/40*60)– If non-clustered, we could assume scan 50,000 disk pages of data, taking 20 min.

(50,000/40*60)

RELATIONAL DBsRELATIONAL DBsIndexesIndexes

Next

Previous

Hash IndexesHash Indexes

No disk file of keys, only a single I/O required (ideally) to read a record from disk.

The record key is the input value to a some hash function, whose output becomes the key to a disk page or relative record position.

Collisions, well-known problem, where 2 records have the same hash value.– Ideally, rehash to same disk page to minimize I/O.

Generally, cannot increase size of hash table.– Extensible hashing allows hash table to grow, based on a

linear hashing algorithm.

RELATIONAL DBsRELATIONAL DBsIndexesIndexes

RELATIONAL DBsRELATIONAL DBs

Table Access and JoinsTable Access and Joins

No Indexes

Simple Indexes

Composite Indexes

Multiple Indexes

Nested Loop Joins

Sort-Merge Joins

Next

Previous

IntroductionIntroduction

When a database system receives a query it compiles the query.– This includes syntax checking and query optimization.

– The result of the compilation step is an access plan (a series of steps specific to the computer and database) which will execute the query at run time.

Query optimizer– Minimizes CPU time & number of I/O requests.

– CPU memory, although important, may be limited to certain established levels.

RELATIONAL DBsRELATIONAL DBsTable Access and JoinsTable Access and Joins

Next

Previous

No IndexesNo Indexes

A table scan is an access step where all the rows in a table are sequentially searched.– The data collected is restricted by the

WHERE clause of an SQL statement.

Access plan (DB2/MVS):– ACCESSTYPE column

• Letter R, for a table scan

– PREFETCH column• Blank, if random I/O• Letter S, if multi-block I/O (called sequential prefetch)

RELATIONAL DBsRELATIONAL DBsTable AccessTable Access and Joins and Joins

Next

Previous

Simple/Composite IndexesSimple/Composite Indexes

A matching index scan:– implemented in DB2/MVS using a B-tree index.– based on a single index– rows retrieved from a table

based on the condition(s) specified in the WHERE clause of the SELECT statement.

RELATIONAL DBsRELATIONAL DBsTable AccessTable Access and Joins and Joins

Next

Previous

Simple IndexesSimple Indexes

The access plan for a SELECT statement may use indexes to limit the number of rows searched in a database. For example:

If ZipIdx exists on ZipCode in the Customers table,

SELECT * FROM Customers WHERE ZipCode = 56001

implemented in DB2/MVS using a matching index scan.

The plan for this query takes one step and is designated by the following DB2/MVS access plan:

ACCESSTYPE = I for an index scan

ACCESSNAME = ZipIdx

MATCHCOLS = 1

RELATIONAL DBsRELATIONAL DBsTable AccessTable Access and Joins and Joins

Next

Previous

Composite IndexesComposite Indexes

A matching index scan retrieves records from a table when the column components of the index can be matched to the predicates of the WHERE clause. For example:

If MailIdx exists on ZipCode + IncomeLevel + MaritalStatusSELECT Name, Address FROM Customers

WHERE ZipCode = 56001 AND IncomeLevel = 10

implemented in DB2/MVS using a matching index scan.

The DB2 access plan will have:ACCESSTYPE = I for an index scan

ACCESSNAME = MailIdx

MATCHCOLS = 2

RELATIONAL DBsRELATIONAL DBsTable AccessTable Access and Joins and Joins

Next

Previous

Multiple IndexesMultiple Indexes

Multiple index access used when different indexes on the predicates in the WHERE clause. For example:SELECT Name, Address FROM Customers

WHERE ZipCode = 56001AND (IncomeLevel = 10 OR MaritalStatus = “M”)

DB2 Access Plan assuming ZipIdx, IncomeIdx, and MaritalIdx :TNAME MATCHCOLS PREFETCH

ACCESSTYPE ACCESSNAME MIXOPSEQCustomersM 0 L 0CustomersMX 1 IncomeIdx S 1CustomersMX 1 MaritalIdx S 2CustomersMU 0 3CustomersMX 1 ZipIdx S 4CustomersMI 0 5

RELATIONAL DBsRELATIONAL DBsTable AccessTable Access and Joins and Joins

Next

Previous

OverviewOverview

Common methods to join tables– Nested Loop– Sort Merge– Hash

In DB2, multi-step access plans for join processing make use of columns called:

– PLANNO (1, 2, …) to indicate which step– METHOD (1=Nested Loop, 2=Sort Merge, 4=Hybrid Join) to indicate the join technique

chosen by the query optimizer– TABNO (such as 1 or 2) to indicate whether we are extracting rows from the first or second

table of the join– SORTN_JOIN (Y or N) to indicate whether a sort is required (for example, for

the sort merge join)

RELATIONAL DBsRELATIONAL DBsTable Access and Table Access and JoinsJoins

Next

Previous

OverviewOverview

A join of two tables occurs in two steps– where one table becomes the outer table– and the other becomes the inner table.

During join processing– records from outer table are presented one-by-one

to the inner table– and the inner table searched for records

matching the one presented to it.

RELATIONAL DBsRELATIONAL DBsTable Access and Table Access and JoinsJoins

Next

Previous

Nested Loop JoinsNested Loop Joins

Process:– Records of the outer table are retrieved (or presented) using a table

scan (or indexing, if possible).• Only candidate records are retrieved.• Candidate records satisfy local predicates.

– For each retrieved row from the outer table:• the inner table is searched for qualifying records• & the results merged into a new record in an output table.

SELECT CustomerID, Name, Street, City, State, Zip, TotalSales

FROM Customers, Sales

WHERE State = “MN”

AND Customers.CustomerID = Sales.CustomerID

RELATIONAL DBsRELATIONAL DBsTable Access and Table Access and JoinsJoins

Next

Previous

Nested Loop JoinsNested Loop Joins

Appropriate:– When the outer table has only a few records (after applying predicates)– Where the inner table is small or has an index usable to

access qualifying records

Drawbacks when the inner and outer tables are not indexed or clustered on the same values:– The inner table (including indexes) may be scanned repetitively to find

matching records.– The outer table is processed inefficiently for records with the same

value in the join columns.

RELATIONAL DBsRELATIONAL DBsTable Access and Table Access and JoinsJoins

Next

Previous

Sort-Merge JoinsSort-Merge Joins

Algorithm where the 2 tables are scanned only once:– For each table, a temporary table is created of qualifying

candidate records.– Each temporary table is sorted on the same columns based on the

joining predicates (conditions).– Finally the two temporary tables are merged and joined (as an inner and

outer table) into a third table.

If either of the original two tables has an index on the selection and joining predicates, it may be possible to skip the creation of the corresponding temporary table.

RELATIONAL DBsRELATIONAL DBsTable Access and Table Access and JoinsJoins

Query Optimization and IndexesQuery Optimization and Indexes

IntroductionIntroduction

Relational DatabasesRelational Databases

DB2/400DB2/400

QUERY OPTIMIATION QUERY OPTIMIATION AND INDEXESAND INDEXES

DB/400DB/400

Data Management Methods

Summary

The Optimizer

The Database Monitor

Components

Proposed Scope

Proposed Solution

Research Problem/Objectives

Next

Previous

ComponentsComponents

Query component

-> Query optimizer• Cost based

Data management methods– Access paths– Access methods

DB/400DB/400

Next

Previous

Data Management MethodsData Management Methods

Access Paths– Sequential, also called arrival sequence

• accesses data in physical order

– Keyed sequential, uses indexes• accesses data in the order of the index

Access Methods– Dataspace scan - similar to table scan - parallel version too– Key selection - requires index - parallel version too– Key positioning - requires index - parallel version too– Index only - requires index - SMP required– Index-from-Index - requires index - SMP required– Hashing - SMP required

DB/400DB/400

Next

Previous

The OptimizerThe Optimizer

Query optimization is a tradeoff between– the time to determine an optimal implementation– the time to actually execute the query

The query optimizer– selects the most efficient access method at

query run-time• identifies alternatives• estimates current costs

– optimizes joins and grouping operations

DB/400DB/400

Next

Previous

The OptimizerThe Optimizer

The access cost is modeled for:– reading records without an index

(a dataspace scan)– reading records with an index

(key selection or key positioning)*– creating a temporary index on the relevant data– creating a temporary index on another index

(index-from-index)– using the hashing method or a query sort routine

*Each index is evaluated, unless a time limit is reached first. Indexes are examined in LIFO order.

DB/400DB/400

Next

Previous

The OptimizerThe Optimizer

The cost of a particular method is the sum of:– the start-up cost

– the cost associated with the optimization mode**

– the cost of creating any indexes

– the cost for the expected number of page faults to read the data and the cost to process the expected number of rows

**The optimization mode is given by a parameter, which defines the minimization goal as either the time to retrieve the first buffer of data or the time to retrieve all the rows that would be selected.

DB/400DB/400

Next

Previous

The Database MonitorThe Database Monitor

The database monitor collects statistics and data about query implementation and performance during query optimization and execution.

For example, SQL can extract data showing:– Queries implemented as table scans– Queries taking the most time– Queries stopped by the query governor– All queries executed and data on each query, such as

• Table names• Number of rows in a table and the number of rows selected• Estimated & actual execution time• Indexed advised & fields

DB/400DB/400

Next

Previous

Research Problem/ObjectivesResearch Problem/Objectives

RESEARCH PROBLEM

Improve query performance

by determining an appropriate set of indexes.

RESEARCH OBJECTIVES Determine the information required & available.

Determine the formulas/algorithms

needed to evaluate the data. Determine the user interface

(including capabilities)

of an user-friendly intelligent index advisor.

DB/400DB/400

Next

Previous

DB/400DB/400

Proposed ScopeProposed Scope

Create a prototype intelligent index manager.– Experiment with some basic queries.

Evaluate simple queries– no joins & no GROUP BY or ORDER BY clauses in SQL.

Next

Previous

SummarySummary

IBM’s DB2/400 has several components:– The query component, including the query optimizer which:

• chooses an minimal cost method to implement queries• validating or creating an access plan

– a control structure containing the implementation data.

– The data management methods:• algorithms for retrieving data through an access path using a

table access method.

The database monitor:– collects statistics and implementation details about queries.– includes recommendations on creating indexes:

• on a per query basis• based on a limited subset of the query optimizer rules.

DB/400DB/400

Next

Previous

SummarySummary

An intelligent index manager would:– need data on query implementation and execution from the

database monitor.• The database monitor would need to be enhanced to save

additional data from the query optimizer• Supplemental data may be needed such as

– How often the records change in queried tables– User specifications

– need algorithms, formulas & rules based on• The major ones used by the query optimizer.• Additional criteria, such as projected index maintenance costs & rules based

on the user specifications

DB/400DB/400

Next

Previous

SummarySummary

An intelligent index manager would:– evaluate & judge if new indexes are needed; and if so, what indexes would

provide the best overall system performance.• Decisions would be made on the makeup of the indexes: which tables and which

fields.• Decisions would be based on all the queries within its scope

– advise users recommending specific indexes

Further research is needed to resolve questions concerning exact implementation details:– specific data requirements, algorithms and formulas– an appropriate user-interface

DB/400DB/400

Query Optimization and IndexesQuery Optimization and Indexes

Mark SchoennauerBACKGROUND BACKGROUND

RESEARCHRESEARCH

Thank you