Efficient Query Optimization for Distributed Join in Database Federation

Post on 23-Feb-2016

49 views 0 download

Tags:

description

Efficient Query Optimization for Distributed Join in Database Federation. A Master’s Thesis Proposal by Di Wang Advisor: Prof. Murali Mani Dec 4, 2008. Outline. Introduction – Query Optimization in Database Federations Architecture and Problem Definition Proposed Work Schedule. - PowerPoint PPT Presentation

Transcript of Efficient Query Optimization for Distributed Join in Database Federation

Efficient Query Optimization for Distributed Join

in Database Federation

A Master’s Thesis Proposalby

Di Wang 

Advisor: Prof. Murali Mani

Dec 4, 2008

OutlineIntroduction – Query Optimization

in Database Federations

Architecture and Problem Definition

Proposed Work

Schedule

Introduction: Need for data integration ◦Various systems -> full picture◦Mergers -> access both resources with a

common interface◦Business partners -> combine data

Multiple Access MethodsMultiple Data Schemas

Introduction to Database Federation

Database Federation is one approach to data integration◦Key performance advantage: efficiently

combine data from multiple sources in a single statement

◦The data sources are federated into a unified middleware, called mediator.

Key Components of Database Federation

Query Rewriter

Cost-Based Optimizer

Query

. . . . . .

Research Issues: •containment algorithms for conjunctive queries,• schema mapping, •capability-based optimization

Cost-based optimization --Closely related to the optimization techniques developed for the distributed database systems

The problem

Things that make us unhappySortMerge on M1

NestLoop on M1M3.R3

M1.R1 M2.R2

Optimizer

M1

M2

M3

Estimated Condition: Available buffer sizes of sites; CPU utility of sites; Network traffics …Statistics: physical designs …

SortMerge on M2

NestLoop on M2M3.R3

M1.R1 M2.R2

Plan 1 Plan 2 HashJoin on M3

SortMerge on M1M3.R3

M1.R1 M2.R2

Plan 3

Run CPU Utility Available Buffer Chosen Plan

Optimal PlanM1 M2 M3 M1 M2 M3

1 25%

25% 25%

B(R1) - - Plan 1 Plan 1

2 75%

10% 25%

> B(R1) > B(R1)

- Plan 1 Plan 2

3 50%

50% 15%

> - > Plan 1 Plan3Need to take run-time conditions into account at optimization time.

Assume: B(R1) < B(R2) < B(R3), B(R1 join R2) < B(R3)

Existing Solution - Parametric Query Optimization Y. E. Ioannidis, et al. Parametric Query Optimization. VLDB

1992. Key idea: To identify several execution plans, each one of

which is optimal for a subset of ALL possible values of the run-time parameters

E.g. Two parameters: Buffer size B = [2, 151]Kind of indexes I = {no_index, clustered_Btree, non_clustered_BTree}

P – possible vectors of values of parameters P = cross product B × I|P| = 150*3 = 450

The optimization problem: p P , to find the plan s0 in that plan space S that satisfies the condition:

is static parameters, c( ) is the cost function

Existing Solution - Parametric Query Optimization (Cont.)

Efficient exploration algorithm – Randomized Algorithm

Justification for using parametric query optimizationRelative cost

Buffer size

Problems of the implementation in distributed database• Site selection + algebraic transformation + physical method selection• Much more combinations of run-time parameters

Existing Solution – Two-Phase Algorithm

W. Hong, et al. Optimization of Parallel Query Execution Plans in XPRS. PDIS,1991.

Developed for a parallel database based on a share-memory multiprocessor

Phase 1: find the optimal sequential plan assuming the entire buffer pool is available

Phase 2: find the optimal parallelization of the optimal sequential plan, considering run-time available buffer size & # of free processors

Benefits:

◦ Phase 1 has the same plan space as a System-R-style algorithm, but only one plan is explored in Phase 2

◦ Capability of dealing with compile-time unknown parameters

Problems for applying in database federations:◦ Communication cost was not considered◦ Exhaustive search in phase 2 is still expensive

for large scale of data sources

Existing Solution – Two-Phase Algorithm(Cont.)

Proposed Work

Important Observation many national-scale or global-scale data federations are

built on the networks which consist of both broad, LAN paths and narrow, long-haul paths.

many highly-integrated systems have to access data through a great deal of databases that belong to multiple different organizations.

Cluster-and-Conquer consider all data resources in the database federation

as a set of several clusters of sites

design two layers of mediators to schedule the query plan cooperatively:◦ Global Mediator + Cluster Mediator

Cluster 2Cluster 1

Cluster 8

Cluster 4

Cluster 5 Cluster 6Cluster 7

Cluster9

Cluster 11

Cluster10Cluster12

Cluster13Global

Mediator

Architecture•System-R style algorithm•performs at compiling time •considers all the tables as being stored in the clustered fashion• decide inter-cluster operations

•schedules the optimal plan found by the optimizer in a distributed and parallelized way •assigns each sub-plan to the corresponding cluster

•Consider run-time conditions & static physical designs•Find a intra-cluster optimal plan•Every cluster mediator functions independently and potentially in parallel

Cost Model and Optimization Goal

Cost Model

Optimization Goal◦to find the distributed join schedule

plan with minimum cost.

Problem DefinitionRun-time parameters:

◦Available buffer size◦CPU utilization

Parallelism:◦ Partitioned parallelism◦ Pipelined parallelism

Reasons: input data partition is not often feasible ;in bushy plans it is common to have two operations that do not each other’s output

Independent parallelism

Optimization Algorithm

E.g. SELECT * FROM S1.t1, S2.t2, S5.t7, S1.t2, S6.t5, S2.t3 WHERE S1.t1.CustomerID = S2. t2. CustomerID AND S2.t2. SupplierID = S5.t7.SupplierID AND S5.t7.ItemID = S6.t5. ItemID AND S6.t5.Country = S1.t2.Country AND

S1.t2.Year = S2.t3.Year

Global Mediator

Clustered view

Physical design info:B(R), T(R), V(R.attr), ……

Rule 1: only determine inter-cluster operations

Rule 2: plans that join two relations in distinct clusters are eliminated

Optimization Algorithm (Cont.)

Cluster

1Mediato

r

Sub-plan

Search space:•Algebraic transform

•Physical method selection – Available_buffer

•Site selection – CPU_utility (fine grain operator scheduling)

Run-time conditions:Available_buffer(S1), CPU_utility(S1), ……

Physical design info:B(R), T(R), V(R.attr), ……

Theoretical AnalysisIn global mediator

In cluster mediator

Compare to related works

Experiment Design

That is what I want to do for my Master

Thesis …

Thanks