Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function...

93
postgrespro.ru Adapve query opmizaon in PostgreSQL Oleg Ivanov Postgres Professional

Transcript of Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function...

Page 1: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

postgrespro.ru

Adaptive query optimizationin PostgreSQL

Oleg IvanovPostgres Professional

Page 2: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

2

● Problem statement– Query optimization– Correlated clauses issue

● Machine learning– Gradient K Nearest Neighbours method

● Adaptive query optimization– Theory– Implementation

● Experimental evaluation

Outline

Page 3: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

3

Query optimization

Page 4: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

4

SeqScan

users u1

IndexScan

messages m

MergeJoin

IndexScan

users u2

NestedLoopJoin

SELECT *FROM users AS u1, messages AS m, users AS u2WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;

Query plan

Page 5: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

5

IndexScan

users u1

SeqScan

messages m

NestedLoopJoin

IndexScan

users u2

NestedLoopJoin

SELECT *FROM users AS u1, messages AS m, users AS u2WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;

Query plan

Page 6: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

6

SeqScan

users u1

SeqScan

messages m

HashJoin

SeqScan

users u2

HashJoin

SELECT *FROM users AS u1, messages AS m, users AS u2WHERE u1.id = m.sender_id AND m.receiver_id = u2.id;

Query plan

Page 7: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

7

EXPLAIN SELECT *FROM users AS u1, messages AS m, users AS u2WHERE u1.id = m.sender_id AND m.receiver_id = u2.id; QUERY PLAN ----------------------------------------------------------------------------------- Hash Join (cost=540.00..439429.44 rows=10003825 width=27) Hash Cond: (m.receiver_id = u2.id) -> Hash Join (cost=270.00..301606.84 rows=10003825 width=23) Hash Cond: (m.sender_id = u1.id) -> Seq Scan on messages m (cost=0.00..163784.25 rows=10003825 width=19) -> Hash (cost=145.00..145.00 rows=10000 width=4) -> Seq Scan on users u1 (cost=0.00..145.00 rows=10000 width=4) -> Hash (cost=145.00..145.00 rows=10000 width=4) -> Seq Scan on users u2 (cost=0.00..145.00 rows=10000 width=4)(9 rows)

Query plan

Page 8: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

8

EXPLAIN SELECT *FROM users AS u1, messages AS m, users AS u2WHERE u1.id = m.sender_id AND m.receiver_id = u2.id; QUERY PLAN ----------------------------------------------------------------------------------- Hash Join (cost=540.00..439429.44 rows=10003825 width=27) Hash Cond: (m.receiver_id = u2.id) -> Hash Join (cost=270.00..301606.84 rows=10003825 width=23) Hash Cond: (m.sender_id = u1.id) -> Seq Scan on messages m (cost=0.00..163784.25 rows=10003825 width=19) -> Hash (cost=145.00..145.00 rows=10000 width=4) -> Seq Scan on users u1 (cost=0.00..145.00 rows=10000 width=4) -> Hash (cost=145.00..145.00 rows=10000 width=4) -> Seq Scan on users u2 (cost=0.00..145.00 rows=10000 width=4)(9 rows)

Plan node execution cost Plan node cardinality

Plan execution cost

Query plan

Page 9: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

9

Cost-based query optimization

System R

1. A function which determines the plan's cost2. Minimizing the function value over all possible plans for the query

How does PostgreSQL optimize queries?

Page 10: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

10

cs seq_page_cost 1.0

cr random_page_cost 4.0

ct cpu_Tuple_cost 0.01

ci cpu_Index_tuple_cost 0.005

co cpu_Operator_cost 0.0025

Cost=ns cs+nr cr+nt c t+nic i+no co

PostgreSQL cost model

Page 11: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

11

0 – 45 – 9

10 - 14 15 - 19

20 - 24 25 - 29

30 - 34 35 - 39

40 - 44 45 - 49

50 - 54 55 - 59

60 - 64 65 - 69

70 - 74 75 - 79

80 - 84 85 - 89

90 - 94 95 – 99

100 or more

0

2000

4000

6000

8000

10000

12000

14000

Age

SELECT * FROM usersWHERE age < 25;

Cardinality

Cardinality estimation

Page 12: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

12

Dynamic programming over subsets

● System R● Time complexity: 3n

● Memory consumption: 2n

● Always finds the cheapest plan

Page 13: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

13

Genetic algorithm

● PostgreSQL● Common and flexible method● Can be stopped on every iteration● No guarantees

Page 14: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

14

Dynamic programming

or

Genetic algorithm

Optimization methodSeqScan

users

HashJoin

HashJoin

SeqScan

messages

SeqScan

pictures

Cost = 439429

MergeJoin

SeqScan

messages

SeqScan

pictures

Cost = 304528

PostgreSQL query optimization

Plan's costestimation

Cardinalityestimation

Cost model

Page 15: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

15

Correlated clauses issue

Page 16: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

16

Query clauses

Information aboutstored data

PostgreSQL state

Cardinalityestimation Cost model

Page 17: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

17Dataset:

The TPC Benchmark™DS (TPC-DS)http://www.tpc.org/tpcds/

Page 18: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

18Dataset:

The TPC Benchmark™DS (TPC-DS)http://www.tpc.org/tpcds/

Error: 300 times Error: 4 times

Page 19: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

19

Query clauses

Information aboutstored data

PostgreSQL state

Cardinalityestimation Cost model

How good are query optimizers, really?V. Leis, A. Gubichev, A. Mirchev, P. Boncz,

A. Kemper, and T. Neumann,Proc. VLDB, Nov. 2015

Page 20: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

20

0 – 45 – 9

10 - 14 15 - 19

20 - 24 25 - 29

30 - 34 35 - 39

40 - 44 45 - 49

50 - 54 55 - 59

60 - 64 65 - 69

70 - 74 75 - 79

80 - 84 85 - 89

90 - 94 95 – 99

100 or more

0

2000

4000

6000

8000

10000

12000

14000

Age

Selectivity≃0.3Cardinality=N tuples⋅Selectivity

SELECT * FROM usersWHERE age < 25;

Page 21: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

21

SELECT * FROM usersWHERE age < 25 AND city = 'Ottawa';

Selectivityage=13

Selectivitycity=17

Selectivityage ,city=?

Only selectivities of individual clauses(i.e. marginal selectivities)

are known

Page 22: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

22

age < 25

city = 'Ottawa'

1/3

1/7 1/21

Page 23: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

23

SELECT * FROM usersWHERE age < 25 AND city = 'Ottawa';

Selectivityage ,city=Selectivityage⋅Selectivitycity

With the only following exception Selectivity25<ageANDage<57=Selectivity25<age<57

Only selectivities of individual clausesare known

The clauses are considered to be independent:

Page 24: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

24

SELECT * FROM usersWHERE age < 12 AND married = true;

married = trueage < 12

Page 25: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

25

SELECT * FROM usersWHERE age < 12 AND married = false;

age < 12

married = false

Page 26: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

26

SELECT * FROM usersWHERE age > 25 AND married = trueAND position = 'CTO';

age > 25position = 'CTO'

married = true

Page 27: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

27

Multidimensional histograms

Pros:● Solve the problem● Have theoretical guarantees

Contras:● Dimensionality curse● Require memory● Require time for building or updating● Not clear which of all possible

column subsets are needed● Correlation tests are slow

Page 28: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

28Histograms Cost models

ResultSQL query

Query executionQuery optimization

Adaptive query optimization: idea

A priori cost estimation

Page 29: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

29Histograms Cost models

Feedback ResultSQL query

Query executionQuery optimization

Execution statistics

A priori cost estimation

Adaptive query optimization: idea

Page 30: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

30Histograms Cost models

Feedback ResultSQL query

Query executionQuery optimization

Execution statistics

A priori cost estimation

Adaptive query optimization: idea

Page 31: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

31Histograms Cost models

Feedback ResultSQL query

Query executionQuery optimization

Execution statistics

A priori cost estimation

Adaptive query optimization: idea

Page 32: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

32

Machine learning

Page 33: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

33

Feature 1

Feature 2

Feature 3

Hidden variables

Features

Train set Test set

ObjectsMachine learning

Page 34: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

34

Learning procedure

For our problem the learning workflow is iterative:

● On the planning stage the model has to predict hidden variables for a number of objects

● After the execution stage some of these objects are appended to the train set and the model can learn on them

Page 35: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

35

Learning procedure

Query 1

After-execution stage:Object 1 – Variable 1Object 3 – Variable 3

Planning stage:Object 1 - ?Object 2 - ?Object 3 - ?

Train set:Empty

Page 36: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

36

Learning procedure

Query 2

After-execution stage:Object 4 – Variable 4

Planning stage:Object 4 - ?Object 1 - ?Object 5 - ?

Train set:Object 1 – Variable 1Object 3 – Variable 3

Page 37: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

37

Learning procedure

Query 3

After-execution stage:...

Planning stage:...

Train set:Object 1 – Variable 1Object 3 – Variable 3Object 4 – Variable 4

Page 38: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

38

K Nearest Neighbours method

25 47 55 32 22 45 28

50 120 100 80 30 90 ?

Age

Salary

Page 39: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

39

25 47 55 32 22 45 28

50 120 100 80 30 90 ?

Age

Salary

K Nearest Neighbours method

Page 40: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

40

Gradient approach to kNN

Goal: not to store the whole train set

Idea: to use a fixed number of virtual objectsthat provide the best possible prediction quality

Page 41: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

41

27 47 28

53 103 ?

2/3

1/3

1/3 5/3

Gradient approach to kNN

Age

Salary

Page 42: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

42

Gradient approach to kNN

Math for learning:

Loss function+

Stochastic gradient descent to optimize it+

K Nearest Neighbours method

Page 43: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

43

Gradient approach to kNN

13 1220

17

10

15

Page 44: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

44

Gradient approach to kNN

13 12

10

20

17

K = 3

15

Page 45: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

45

Gradient approach to kNN

13 12

10

15

20

17

K = 3

Page 46: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

46

Gradient approach to kNN

13 12

9

13

18

17

K = 3

Page 47: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

47

Adaptive query optimizationTheory

Page 48: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

48

SeqScan

users u1

IndexScan

messages m

MergeJoin

users u2

NestedLoopJoin

The object is a node with its subtree

u2.id = messages.receiver_id

u1.id = messages.sender_id

IndexScan

u2.married = trueu2.age < 25

Page 49: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

49

Node cardinality is105 tuples!Clauses list

users.id = messages.receiver_idANDusers.married = trueANDusers.age < 25

PostgreSQL estimator

Histograms

Page 50: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

50

Node cardinality is1017 tuples!Clauses list

users.id = messages.receiver_idANDusers.married = constANDusers.age < const

Machine learning

Clause selectivities● 0.0001● 0.73● 0.23

Relations to join● users● messages

*equal clauses must be handled specially: transform into clause with an equalence class as an argument

Page 51: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

51

users.id = messages.receiver_id

users.married = const

users.age < const

0.0001

0.73

0.23

Node cardinality

Features(clause types)

Hidden variable ?

Object is a plan node

Machine learning problem statement

Base relations● users● messages

Each set of base relations and clause types induce its own machine learning problem (feature subspace hash)!

Page 52: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

52

NestedLoopJoin

HashJoin IndexScan

SeqScan SeqScan

users messages

friends

SELECT * FROM users, messages, friendsWHERE users.age > 25 AND users.id > 1000AND users.id = messages.sender_idAND users.id = friends.first_idAND messages.receiver_id = friends.second_id;

eclass_2users.age > CONSTeclass_1 > CONST

Marginal selectivities0.0001

0.780.97

Clauses

Hash Sort clauses

Feature subspace hash993059063

Feature vector(0.0001, 0.78, 0.97)

users.id = friends.first_idmessages.receiver_id = friends.second_id

users.id = messages.sender_id

users.age > 25users.id > 1000

eclass_1users.idmessages.sender_idfriends.first_id

eclass_2messages.receiver_idfriends.second_id

Base relationsusersmessagesfriends

Page 53: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

53

Query optimization Query executionQuery parsing

Cardinality estimation

Machine learning data

Query execution statistics

LearningMachine learning

Workflow

Page 54: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

54

● Will it converge?

● How fast will it converge?

● What guarantees on obtained plans orregressor do we have?

Yes, in the finite number of steps

Predictions are correct for all executed paths

Don't know (in practice in a few steps)

With perfect cost model obtained plans are not worse

Theoretical properties

Page 55: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

55

Adaptive query optimizationImplementation

Page 56: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

56

Source code

Current code for vanilla PostgreSQL (extension + patch):https://github.com/tigvarts/aqo

Open Source, PostgreSQL license

Without extension equivalent to standard PostgreSQL optimizer

Needs to be in the shared preload libraries

Page 57: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

57

Planning stage (prediction):set_baserel_size_estimatesget_parameterized_baserel_sizeset_joinrel_size_estimatesget_parameterized_joinrel_size

After-execution stage:ExecutorEnd – learningExplainOnePlan – visualization

Other:planner_hook – prepare to the planning stageExecutorStart – setting the flags for statistics collectioncopy_generic_path_info – transmit Path information to Plan node

Hooks

Page 58: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

58

Control

For some queries we don't need AQO.

So we need a mechanism to disable AQO learning or usage for some queries.

Page 59: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

59

Query types

Query type is the set of queries, which differ only in their constants.

Query type:SELECT * FROM users WHERE age > const AND city = const;

Queries:SELECT * FROM users WHERE age > 18 AND city = 'Ottawa';SELECT * FROM users WHERE age > 65 AND city = 'New York';…

Page 60: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

60

AQO tables

aqo_queries● Query_hash● Learn AQO● Use AQO● Feature_space

Settings

aqo_query_texts● Query_hash● Query_text

For user

aqo_data● Feature_space● Feature_subspace● Features● Target

Machine learning

Page 61: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

61

Control

The users don't want to configure AQO query settings manually.

So we need a mechanism to determine automatically whether the query needs AQO.

It is called auto tuning.

Page 62: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

62

AQO tables

aqo_queries● Query_hash● Learn AQO● Use AQO● Feature_space● Auto_tuning

Settings

aqo_query_texts● Query_hash● Query_text

For user

aqo_query_stat● Query_hash● Planning time● Execution time● Cardinality error● Number of executions

For auto tuning

aqo_data● Feature_space● Feature_subspace● Features● Target

Machine learning

Page 63: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

63

Control

GUC: aqo.mode

● Disabled: disabled for all query types● Forced: enabled for all query types● Controlled: use manual settings for known query types, ignore others● Intelligent: use manual settings for known query types,

tries to tune others automatically

Page 64: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

64

Query optimization Query executionQuery parsing

Cardinality estimation

Machine learning data

Query execution statistics

LearningMachine learning

Workflow

Page 65: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

65

Query optimization Query executionQuery parsing

Cardinality estimation

Query execution statistics

Learning

Fair workflow

Query settings definition Auto tuning

Page 66: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

66

Experimental evaluation

Page 67: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

67

Strongly Correlated Columns (StrongCor)

Cardinality estimation using neural networksH. Liu, M. Xu, Z. Yu, V. Corvinelli, and C. Zuzarte,

CASCON’15, 2015, IBM Corp.

Page 68: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

68

Cardinality estimation error

Page 69: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

69

Cardinality estimation error

Page 70: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

70

Cardinality estimation error

Page 71: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

71

TPC-H slow

TPC-H fast

0 10 20 30 40 50 60

+1.3%

Performance improvement

AdaptiveOriginal

Page 72: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

72

TPC-H slow

TPС-H fast

0 5000 10000 15000 20000

-4.4%

AdaptiveOriginal

Performance improvement

Page 73: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

73

TPC-DS very slow

TPC-DS slow

TPC-DS normal

TPC-DS fast

TPC-DS very fast

0 2 4 6 8 10 12 14 16 18

+12%

AdaptiveOriginal

Performance improvement

Page 74: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

74

TPC-DS very slow

TPC-DS slow

TPC-DS normal

TPC-DS fast

TPC-DS very fast

0 20 40 60 80 100 120 140

+24%

AdaptiveOriginal

Performance improvement

Page 75: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

75

TPC-DS very slow

TPC-DS slow

TPC-DS normal

TPC-DS fast

TPC-DS very fast

0 50 100 150 200 250 300

+41%

AdaptiveOriginal

Performance improvement

Page 76: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

76

TPC-DS very slow

TPC-DS slow

TPC-DS normal

TPC-DS fast

TPC-DS very fast

0200

400600

8001000

12001400

16001800

+285%

AdaptiveOriginal

Performance improvement

Page 77: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

77

TPC-DS very slow

TPC-DS slow

TPC-DS normal

TPC-DS fast

TPC-DS very fast

0 5000 10000 15000 20000

Adaptive

+115%

Original

Performance improvement

Page 78: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

78

Maximum acceleration

Page 79: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

790 1 2 3 4 5 6 7 8 9

# iter

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4E

xecu

tion

tim

e,

s

TPC-DS 18

Learning progress

Page 80: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

800 1 2 3 4 5 6 7 8 9

# iter

100

101

102

103

104

Exe

cuti

on t

ime, s

TPC-DS 69

Learning progress

Page 81: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

810 1 2 3 4 5 6 7 8 9

# iter

0.7

0.8

0.9

1.0

1.1

1.2

1.3Exe

cuti

on

tim

e, s

TPC-DS 26

Learning progress

Page 82: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

82

SELECT MIN(k.keyword) AS movie_keyword, MIN(n.name) AS actor_name, MIN(t.title) AS hero_movieFROM cast_info AS ci, keyword AS k, movie_keyword AS mk, name AS n, title AS tWHERE k.keyword in ('superhero', 'sequel', 'second-part', 'marvel-comics', 'based-on-comic', 'tv-special', 'fight', 'violence') AND n.name LIKE '%Downey%Robert%' AND t.production_year > 2000 AND k.id = mk.keyword_id AND t.id = mk.movie_id AND t.id = ci.movie_id AND ci.movie_id = mk.movie_id AND n.id = ci.person_id;

Example: complicated queryHow good are query optimizers, really?

V. Leis, A. Gubichev, A. Mirchev, P. Boncz,A. Kemper, and T. Neumann,

Proc. VLDB, Nov. 2015Join Order Benchmark

Page 83: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

83

QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=15028.15..15028.16 rows=1 width=96) (actual time=9059.369..9059.369 rows=1 loops=1) -> Nested Loop (cost=8.21..15028.14 rows=1 width=48) (actual time=218.507..9058.827 rows=88 loops=1) -> Nested Loop (cost=7.78..12650.75 rows=5082 width=37) (actual time=0.700..3343.348 rows=785477 loops=1) Join Filter: (t.id = ci.movie_id) -> Nested Loop (cost=7.21..12370.11 rows=148 width=41) (actual time=0.682..423.503 rows=14165 loops=1) -> Nested Loop (cost=6.78..12235.17 rows=270 width=20) (actual time=0.661..159.090 rows=35548 loops=1) -> Seq Scan on keyword k (cost=0.00..3632.40 rows=8 width=20) (actual time=0.125..28.679 rows=8 loops=1) Filter: (keyword = ANY ('{superhero,sequel,second-part,marvel-comics,based-on-comic,tv-special,fight,violence}'::text[])) Rows Removed by Filter: 134162 -> Bitmap Heap Scan on movie_keyword mk (cost=6.78..1072.32 rows=303 width=8) (actual time=0.949..15.307 rows=4444 loops=8) Recheck Cond: (keyword_id = k.id) Heap Blocks: exact=23488 -> Bitmap Index Scan on keyword_id_movie_keyword (cost=0.00..6.71 rows=303 width=0) (actual time=0.563..0.563 rows=4444 loops=8) Index Cond: (keyword_id = k.id) -> Index Scan using title_pkey on title t (cost=0.43..0.49 rows=1 width=21) (actual time=0.007..0.007 rows=0 loops=35548) Index Cond: (id = mk.movie_id) Filter: (production_year > 2000) Rows Removed by Filter: 1 -> Index Scan using movie_id_cast_info on cast_info ci (cost=0.56..1.47 rows=34 width=8) (actual time=0.010..0.190 rows=55 loops=14165) Index Cond: (movie_id = mk.movie_id) -> Index Scan using name_pkey on name n (cost=0.43..0.46 rows=1 width=19) (actual time=0.007..0.007 rows=0 loops=785477) Index Cond: (id = ci.person_id) Filter: (name ~~ '%Downey%Robert%'::text) Rows Removed by Filter: 1 Planning time: 36.697 ms Execution time: 9059.593 ms(26 rows)

Bad cardinality estimates

Page 84: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

84

QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=88572.37..88572.38 rows=1 width=96) (actual time=16070.666..16070.666 rows=1 loops=1) -> Nested Loop (cost=8.21..88571.71 rows=88 width=48) (actual time=367.618..16070.206 rows=88 loops=1) Join Filter: (mk.movie_id = t.id) -> Nested Loop (cost=7.78..88571.24 rows=1 width=39) (actual time=367.595..16068.698 rows=112 loops=1) -> Nested Loop (cost=7.35..84247.78 rows=9242 width=28) (actual time=0.659..5438.176 rows=1564305 loops=1) -> Nested Loop (cost=6.78..12235.17 rows=35548 width=20) (actual time=0.642..154.833 rows=35548 loops=1) -> Seq Scan on keyword k (cost=0.00..3632.40 rows=8 width=20) (actual time=0.121..28.424 rows=8 loops=1) Filter: (keyword = ANY ('{superhero,sequel,second-part,marvel-comics,based-on-comic,tv-special,fight,violence}'::text[])) Rows Removed by Filter: 134162 -> Bitmap Heap Scan on movie_keyword mk (cost=6.78..1072.32 rows=303 width=8) (actual time=0.925..14.836 rows=4444 loops=8) Recheck Cond: (keyword_id = k.id) Heap Blocks: exact=23488 -> Bitmap Index Scan on keyword_id_movie_keyword (cost=0.00..6.71 rows=303 width=0) (actual time=0.543..0.543 rows=4444 loops=8) Index Cond: (keyword_id = k.id) -> Index Scan using movie_id_cast_info on cast_info ci (cost=0.56..1.47 rows=55 width=8) (actual time=0.008..0.139 rows=44 loops=35548) Index Cond: (movie_id = mk.movie_id) -> Index Scan using name_pkey on name n (cost=0.43..0.46 rows=1 width=19) (actual time=0.007..0.007 rows=0 loops=1564305) Index Cond: (id = ci.person_id) Filter: (name ~~ '%Downey%Robert%'::text) Rows Removed by Filter: 1 -> Index Scan using title_pkey on title t (cost=0.43..0.45 rows=1 width=21) (actual time=0.012..0.012 rows=1 loops=112) Index Cond: (id = ci.movie_id) Filter: (production_year > 2000) Rows Removed by Filter: 0 Planning time: 11.166 ms Execution time: 16070.850 ms(26 rows)

Using previous statistics to refine estimates

Page 85: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

85

QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=169291.60..169291.61 rows=1 width=96) (actual time=3851.890..3851.890 rows=1 loops=1) -> Hash Join (cost=107718.88..169290.94 rows=88 width=48) (actual time=845.627..3851.419 rows=88 loops=1) Hash Cond: (ci.person_id = n.id) -> Nested Loop (cost=7.78..58633.52 rows=785477 width=37) (actual time=0.700..3011.292 rows=785477 loops=1) Join Filter: (t.id = ci.movie_id) -> Nested Loop (cost=7.21..30001.33 rows=14165 width=41) (actual time=0.682..415.470 rows=14165 loops=1) -> Nested Loop (cost=6.78..12235.17 rows=35548 width=20) (actual time=0.663..154.262 rows=35548 loops=1) -> Seq Scan on keyword k (cost=0.00..3632.40 rows=8 width=20) (actual time=0.126..28.971 rows=8 loops=1) Filter: (keyword = ANY ('{superhero,sequel,second-part,marvel-comics,based-on-comic,tv-special,fight,violence}'::text[])) Rows Removed by Filter: 134162 -> Bitmap Heap Scan on movie_keyword mk (cost=6.78..1072.32 rows=303 width=8) (actual time=0.980..14.743 rows=4444 loops=8) Recheck Cond: (keyword_id = k.id) Heap Blocks: exact=23488 -> Bitmap Index Scan on keyword_id_movie_keyword (cost=0.00..6.71 rows=303 width=0) (actual time=0.579..0.579 rows=4444 loops=8) Index Cond: (keyword_id = k.id) -> Index Scan using title_pkey on title t (cost=0.43..0.49 rows=1 width=21) (actual time=0.007..0.007 rows=0 loops=35548) Index Cond: (id = mk.movie_id) Filter: (production_year > 2000) Rows Removed by Filter: 1 -> Index Scan using movie_id_cast_info on cast_info ci (cost=0.56..1.47 rows=44 width=8) (actual time=0.009..0.170 rows=55 loops=14165) Index Cond: (movie_id = mk.movie_id) -> Hash (cost=107705.93..107705.93 rows=414 width=19) (actual time=756.785..756.785 rows=2 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on name n (cost=0.00..107705.93 rows=414 width=19) (actual time=77.074..756.771 rows=2 loops=1) Filter: (name ~~ '%Downey%Robert%'::text) Rows Removed by Filter: 4167489 Planning time: 13.462 ms Execution time: 3852.110 ms(28 rows)

Page 86: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

86

QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=109869.39..109869.40 rows=1 width=96) (actual time=783.543..783.543 rows=1 loops=1) -> Nested Loop (cost=1.85..109868.73 rows=88 width=48) (actual time=93.291..783.412 rows=88 loops=1) -> Nested Loop (cost=1.43..109849.92 rows=41 width=36) (actual time=92.005..767.675 rows=5202 loops=1) -> Nested Loop (cost=0.99..109833.44 rows=9 width=40) (actual time=91.982..763.536 rows=306 loops=1) -> Nested Loop (cost=0.56..109825.54 rows=17 width=19) (actual time=91.907..758.112 rows=486 loops=1) -> Seq Scan on name n (cost=0.00..107705.93 rows=2 width=19) (actual time=73.336..732.998 rows=2 loops=1) Filter: (name ~~ '%Downey%Robert%'::text) Rows Removed by Filter: 4167489 -> Index Scan using person_id_cast_info on cast_info ci (cost=0.56..1054.82 rows=499 width=8) (actual time=12.418..12.503 rows=243 loops=2) Index Cond: (person_id = n.id) -> Index Scan using title_pkey on title t (cost=0.43..0.45 rows=1 width=21) (actual time=0.011..0.011 rows=1 loops=486) Index Cond: (id = ci.movie_id) Filter: (production_year > 2000) Rows Removed by Filter: 0 -> Index Scan using movie_id_movie_keyword on movie_keyword mk (cost=0.43..1.36 rows=47 width=8) (actual time=0.007..0.010 rows=17 loops=306) Index Cond: (movie_id = t.id) -> Index Scan using keyword_pkey on keyword k (cost=0.42..0.45 rows=1 width=20) (actual time=0.003..0.003 rows=0 loops=5202) Index Cond: (id = mk.keyword_id) Filter: (keyword = ANY ('{superhero,sequel,second-part,marvel-comics,based-on-comic,tv-special,fight,violence}'::text[])) Rows Removed by Filter: 1 Planning time: 13.981 ms Execution time: 783.723 ms(22 rows)

Page 87: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

87

QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=110747.71..110747.72 rows=1 width=96) (actual time=770.231..770.232 rows=1 loops=1) -> Nested Loop (cost=1.85..110747.05 rows=88 width=48) (actual time=78.828..770.093 rows=88 loops=1) Join Filter: (mk.movie_id = t.id) -> Nested Loop (cost=1.42..110694.70 rows=112 width=39) (actual time=75.208..769.518 rows=112 loops=1) -> Nested Loop (cost=1.00..110660.75 rows=74 width=27) (actual time=74.639..743.329 rows=10066 loops=1) -> Nested Loop (cost=0.56..109820.42 rows=486 width=19) (actual time=74.589..736.659 rows=486 loops=1) -> Seq Scan on name n (cost=0.00..107705.93 rows=2 width=19) (actual time=74.543..736.376 rows=2 loops=1) Filter: (name ~~ '%Downey%Robert%'::text) Rows Removed by Filter: 4167489 -> Index Scan using person_id_cast_info on cast_info ci (cost=0.56..1054.82 rows=243 width=8) (actual time=0.027..0.094 rows=243 loops=2) Index Cond: (person_id = n.id) -> Index Scan using movie_id_movie_keyword on movie_keyword mk (cost=0.43..1.26 rows=47 width=8) (actual time=0.006..0.010 rows=21 loops=486) Index Cond: (movie_id = ci.movie_id) -> Index Scan using keyword_pkey on keyword k (cost=0.42..0.45 rows=1 width=20) (actual time=0.002..0.002 rows=0 loops=10066) Index Cond: (id = mk.keyword_id) Filter: (keyword = ANY ('{superhero,sequel,second-part,marvel-comics,based-on-comic,tv-special,fight,violence}'::text[])) Rows Removed by Filter: 1 -> Index Scan using title_pkey on title t (cost=0.43..0.45 rows=1 width=21) (actual time=0.005..0.005 rows=1 loops=112) Index Cond: (id = ci.movie_id) Filter: (production_year > 2000) Rows Removed by Filter: 0 Planning time: 14.306 ms Execution time: 770.452 ms(23 rows)

Page 88: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

88

QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=112898.09..112898.10 rows=1 width=96) (actual time=754.243..754.243 rows=1 loops=1) -> Nested Loop (cost=1.85..112897.43 rows=88 width=48) (actual time=75.655..754.117 rows=88 loops=1) -> Nested Loop (cost=1.43..110510.89 rows=5202 width=36) (actual time=74.710..739.330 rows=5202 loops=1) Join Filter: (t.id = mk.movie_id) -> Nested Loop (cost=0.99..110046.39 rows=306 width=40) (actual time=74.694..735.745 rows=306 loops=1) -> Nested Loop (cost=0.56..109820.42 rows=486 width=19) (actual time=74.648..732.892 rows=486 loops=1) -> Seq Scan on name n (cost=0.00..107705.93 rows=2 width=19) (actual time=74.602..732.599 rows=2 loops=1) Filter: (name ~~ '%Downey%Robert%'::text) Rows Removed by Filter: 4167489 -> Index Scan using person_id_cast_info on cast_info ci (cost=0.56..1054.82 rows=243 width=8) (actual time=0.027..0.097 rows=243 loops=2) Index Cond: (person_id = n.id) -> Index Scan using title_pkey on title t (cost=0.43..0.45 rows=1 width=21) (actual time=0.005..0.006 rows=1 loops=486) Index Cond: (id = ci.movie_id) Filter: (production_year > 2000) Rows Removed by Filter: 0 -> Index Scan using movie_id_movie_keyword on movie_keyword mk (cost=0.43..1.26 rows=21 width=8) (actual time=0.004..0.008 rows=17 loops=306) Index Cond: (movie_id = ci.movie_id) -> Index Scan using keyword_pkey on keyword k (cost=0.42..0.45 rows=1 width=20) (actual time=0.003..0.003 rows=0 loops=5202) Index Cond: (id = mk.keyword_id) Filter: (keyword = ANY ('{superhero,sequel,second-part,marvel-comics,based-on-comic,tv-special,fight,violence}'::text[])) Rows Removed by Filter: 1 Planning time: 15.498 ms Execution time: 754.449 ms(23 rows)

Convergence to the plan with good estimates

Page 89: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

89

Conclusion

Page 90: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

90

Uses stored statistics to refine cardinality estimates

Works for clauses of the same structure (i. e. age < some_const)

Works when data distribution doesn't change rapidly

Is suitable for the static workload (i. e. the finite number of query templates or clause structures)

Is useful for complicated queries of the same structure with slow plan caused by bad cardinality estimates (OLAP)

Adaptive Query Optimization

Page 91: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

91

Further work

Histograms Cost models

A priori cost estimation

Feedback ResultSQL query

Query executionQuery optimization

Execution statistics

Page 92: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

92

Questions

Contacts:● [email protected]● +7 (916) 377-55-63

Page 93: Adaptive query optimization in PostgreSQL...Cost-based query optimization System R 1. A function which determines the plan's cost 2. Minimizing the function value over all possible

postgrespro.ru

Postgres Professionalhttp://postgrespro.ru/

+7 495 150 06 91

[email protected]