Efficient Query Optimization for Distributed Join in Database Federation

22
Efficient Query Optimization for Distributed Join in Database Federation A Master’s Thesis Proposal by Di Wang Advisor: Prof. Murali Mani Dec 4, 2008

description

Efficient Query Optimization for Distributed Join in Database Federation. A Master’s Thesis Proposal by Di Wang Advisor: Prof. Murali Mani Dec 4, 2008. Outline. Introduction – Query Optimization in Database Federations Architecture and Problem Definition Proposed Work Schedule. - PowerPoint PPT Presentation

Transcript of Efficient Query Optimization for Distributed Join in Database Federation

Page 1: Efficient Query Optimization for Distributed Join in Database Federation

Efficient Query Optimization for Distributed Join

in Database Federation

A Master’s Thesis Proposalby

Di Wang 

Advisor: Prof. Murali Mani

Dec 4, 2008

Page 2: Efficient Query Optimization for Distributed Join in Database Federation

OutlineIntroduction – Query Optimization

in Database Federations

Architecture and Problem Definition

Proposed Work

Schedule

Page 3: Efficient Query Optimization for Distributed Join in Database Federation

Introduction: Need for data integration ◦Various systems -> full picture◦Mergers -> access both resources with a

common interface◦Business partners -> combine data

Multiple Access MethodsMultiple Data Schemas

Page 4: Efficient Query Optimization for Distributed Join in Database Federation

Introduction to Database Federation

Database Federation is one approach to data integration◦Key performance advantage: efficiently

combine data from multiple sources in a single statement

◦The data sources are federated into a unified middleware, called mediator.

Page 5: Efficient Query Optimization for Distributed Join in Database Federation

Key Components of Database Federation

Query Rewriter

Cost-Based Optimizer

Query

. . . . . .

Research Issues: •containment algorithms for conjunctive queries,• schema mapping, •capability-based optimization

Cost-based optimization --Closely related to the optimization techniques developed for the distributed database systems

Page 6: Efficient Query Optimization for Distributed Join in Database Federation

The problem

Page 7: Efficient Query Optimization for Distributed Join in Database Federation

Things that make us unhappySortMerge on M1

NestLoop on M1M3.R3

M1.R1 M2.R2

Optimizer

M1

M2

M3

Estimated Condition: Available buffer sizes of sites; CPU utility of sites; Network traffics …Statistics: physical designs …

SortMerge on M2

NestLoop on M2M3.R3

M1.R1 M2.R2

Plan 1 Plan 2 HashJoin on M3

SortMerge on M1M3.R3

M1.R1 M2.R2

Plan 3

Run CPU Utility Available Buffer Chosen Plan

Optimal PlanM1 M2 M3 M1 M2 M3

1 25%

25% 25%

B(R1) - - Plan 1 Plan 1

2 75%

10% 25%

> B(R1) > B(R1)

- Plan 1 Plan 2

3 50%

50% 15%

> - > Plan 1 Plan3Need to take run-time conditions into account at optimization time.

Assume: B(R1) < B(R2) < B(R3), B(R1 join R2) < B(R3)

Page 8: Efficient Query Optimization for Distributed Join in Database Federation

Existing Solution - Parametric Query Optimization Y. E. Ioannidis, et al. Parametric Query Optimization. VLDB

1992. Key idea: To identify several execution plans, each one of

which is optimal for a subset of ALL possible values of the run-time parameters

E.g. Two parameters: Buffer size B = [2, 151]Kind of indexes I = {no_index, clustered_Btree, non_clustered_BTree}

P – possible vectors of values of parameters P = cross product B × I|P| = 150*3 = 450

The optimization problem: p P , to find the plan s0 in that plan space S that satisfies the condition:

is static parameters, c( ) is the cost function

Page 9: Efficient Query Optimization for Distributed Join in Database Federation

Existing Solution - Parametric Query Optimization (Cont.)

Efficient exploration algorithm – Randomized Algorithm

Justification for using parametric query optimizationRelative cost

Buffer size

Problems of the implementation in distributed database• Site selection + algebraic transformation + physical method selection• Much more combinations of run-time parameters

Page 10: Efficient Query Optimization for Distributed Join in Database Federation

Existing Solution – Two-Phase Algorithm

W. Hong, et al. Optimization of Parallel Query Execution Plans in XPRS. PDIS,1991.

Developed for a parallel database based on a share-memory multiprocessor

Phase 1: find the optimal sequential plan assuming the entire buffer pool is available

Phase 2: find the optimal parallelization of the optimal sequential plan, considering run-time available buffer size & # of free processors

Page 11: Efficient Query Optimization for Distributed Join in Database Federation

Benefits:

◦ Phase 1 has the same plan space as a System-R-style algorithm, but only one plan is explored in Phase 2

◦ Capability of dealing with compile-time unknown parameters

Problems for applying in database federations:◦ Communication cost was not considered◦ Exhaustive search in phase 2 is still expensive

for large scale of data sources

Existing Solution – Two-Phase Algorithm(Cont.)

Page 12: Efficient Query Optimization for Distributed Join in Database Federation

Proposed Work

Page 13: Efficient Query Optimization for Distributed Join in Database Federation

Important Observation many national-scale or global-scale data federations are

built on the networks which consist of both broad, LAN paths and narrow, long-haul paths.

many highly-integrated systems have to access data through a great deal of databases that belong to multiple different organizations.

Page 14: Efficient Query Optimization for Distributed Join in Database Federation

Cluster-and-Conquer consider all data resources in the database federation

as a set of several clusters of sites

design two layers of mediators to schedule the query plan cooperatively:◦ Global Mediator + Cluster Mediator

Cluster 2Cluster 1

Cluster 8

Cluster 4

Cluster 5 Cluster 6Cluster 7

Cluster9

Cluster 11

Cluster10Cluster12

Cluster13Global

Mediator

Page 15: Efficient Query Optimization for Distributed Join in Database Federation

Architecture•System-R style algorithm•performs at compiling time •considers all the tables as being stored in the clustered fashion• decide inter-cluster operations

•schedules the optimal plan found by the optimizer in a distributed and parallelized way •assigns each sub-plan to the corresponding cluster

•Consider run-time conditions & static physical designs•Find a intra-cluster optimal plan•Every cluster mediator functions independently and potentially in parallel

Page 16: Efficient Query Optimization for Distributed Join in Database Federation

Cost Model and Optimization Goal

Cost Model

Optimization Goal◦to find the distributed join schedule

plan with minimum cost.

Page 17: Efficient Query Optimization for Distributed Join in Database Federation

Problem DefinitionRun-time parameters:

◦Available buffer size◦CPU utilization

Parallelism:◦ Partitioned parallelism◦ Pipelined parallelism

Reasons: input data partition is not often feasible ;in bushy plans it is common to have two operations that do not each other’s output

Independent parallelism

Page 18: Efficient Query Optimization for Distributed Join in Database Federation

Optimization Algorithm

E.g. SELECT * FROM S1.t1, S2.t2, S5.t7, S1.t2, S6.t5, S2.t3 WHERE S1.t1.CustomerID = S2. t2. CustomerID AND S2.t2. SupplierID = S5.t7.SupplierID AND S5.t7.ItemID = S6.t5. ItemID AND S6.t5.Country = S1.t2.Country AND

S1.t2.Year = S2.t3.Year

Global Mediator

Clustered view

Physical design info:B(R), T(R), V(R.attr), ……

Rule 1: only determine inter-cluster operations

Rule 2: plans that join two relations in distinct clusters are eliminated

Page 19: Efficient Query Optimization for Distributed Join in Database Federation

Optimization Algorithm (Cont.)

Cluster

1Mediato

r

Sub-plan

Search space:•Algebraic transform

•Physical method selection – Available_buffer

•Site selection – CPU_utility (fine grain operator scheduling)

Run-time conditions:Available_buffer(S1), CPU_utility(S1), ……

Physical design info:B(R), T(R), V(R.attr), ……

Page 20: Efficient Query Optimization for Distributed Join in Database Federation

Theoretical AnalysisIn global mediator

In cluster mediator

Compare to related works

Page 21: Efficient Query Optimization for Distributed Join in Database Federation

Experiment Design

Page 22: Efficient Query Optimization for Distributed Join in Database Federation

That is what I want to do for my Master

Thesis …

Thanks