Efficient Query Optimization for Distributed Join in Database Federation

Efficient Query Optimization for Distributed Join

in Database Federation

A Master’s Thesis Proposalby

Di Wang

Advisor: Prof. Murali Mani

Dec 4, 2008

OutlineIntroduction – Query Optimization

in Database Federations

Architecture and Problem Definition

Proposed Work

Schedule

Introduction: Need for data integration ◦Various systems -> full picture◦Mergers -> access both resources with a

common interface◦Business partners -> combine data

Multiple Access MethodsMultiple Data Schemas

Introduction to Database Federation

Database Federation is one approach to data integration◦Key performance advantage: efficiently

combine data from multiple sources in a single statement

◦The data sources are federated into a unified middleware, called mediator.

Key Components of Database Federation

Query Rewriter

Cost-Based Optimizer

. . . . . .

Research Issues: •containment algorithms for conjunctive queries,• schema mapping, •capability-based optimization

Cost-based optimization --Closely related to the optimization techniques developed for the distributed database systems

The problem

Things that make us unhappySortMerge on M1

NestLoop on M1M3.R3

M1.R1 M2.R2

Optimizer

Estimated Condition: Available buffer sizes of sites; CPU utility of sites; Network traffics …Statistics: physical designs …

SortMerge on M2

NestLoop on M2M3.R3

M1.R1 M2.R2

Plan 1 Plan 2 HashJoin on M3

SortMerge on M1M3.R3

M1.R1 M2.R2

Plan 3

Run CPU Utility Available Buffer Chosen Plan

Optimal PlanM1 M2 M3 M1 M2 M3

25% 25%

B(R1) - - Plan 1 Plan 1

10% 25%

> B(R1) > B(R1)

- Plan 1 Plan 2

50% 15%

> - > Plan 1 Plan3Need to take run-time conditions into account at optimization time.

Assume: B(R1) < B(R2) < B(R3), B(R1 join R2) < B(R3)

Existing Solution - Parametric Query Optimization Y. E. Ioannidis, et al. Parametric Query Optimization. VLDB

1992. Key idea: To identify several execution plans, each one of

which is optimal for a subset of ALL possible values of the run-time parameters

E.g. Two parameters: Buffer size B = [2, 151]Kind of indexes I = {no_index, clustered_Btree, non_clustered_BTree}

P – possible vectors of values of parameters P = cross product B × I|P| = 150*3 = 450

The optimization problem: p P , to find the plan s0 in that plan space S that satisfies the condition:

is static parameters, c( ) is the cost function

Existing Solution - Parametric Query Optimization (Cont.)

Efficient exploration algorithm – Randomized Algorithm

Justification for using parametric query optimizationRelative cost

Buffer size

Problems of the implementation in distributed database• Site selection + algebraic transformation + physical method selection• Much more combinations of run-time parameters

Existing Solution – Two-Phase Algorithm

W. Hong, et al. Optimization of Parallel Query Execution Plans in XPRS. PDIS,1991.

Developed for a parallel database based on a share-memory multiprocessor

Phase 1: find the optimal sequential plan assuming the entire buffer pool is available

Phase 2: find the optimal parallelization of the optimal sequential plan, considering run-time available buffer size & # of free processors

Benefits:

◦ Phase 1 has the same plan space as a System-R-style algorithm, but only one plan is explored in Phase 2

◦ Capability of dealing with compile-time unknown parameters

Problems for applying in database federations:◦ Communication cost was not considered◦ Exhaustive search in phase 2 is still expensive

for large scale of data sources

Existing Solution – Two-Phase Algorithm(Cont.)

Proposed Work

Important Observation many national-scale or global-scale data federations are

built on the networks which consist of both broad, LAN paths and narrow, long-haul paths.

many highly-integrated systems have to access data through a great deal of databases that belong to multiple different organizations.

Cluster-and-Conquer consider all data resources in the database federation

as a set of several clusters of sites

design two layers of mediators to schedule the query plan cooperatively:◦ Global Mediator + Cluster Mediator

Cluster 2Cluster 1

Cluster 8

Cluster 4

Cluster 5 Cluster 6Cluster 7

Cluster9

Cluster 11

Cluster10Cluster12

Cluster13Global

Mediator

Architecture•System-R style algorithm•performs at compiling time •considers all the tables as being stored in the clustered fashion• decide inter-cluster operations

•schedules the optimal plan found by the optimizer in a distributed and parallelized way •assigns each sub-plan to the corresponding cluster

•Consider run-time conditions & static physical designs•Find a intra-cluster optimal plan•Every cluster mediator functions independently and potentially in parallel

Cost Model and Optimization Goal

Cost Model

Optimization Goal◦to find the distributed join schedule

plan with minimum cost.

Problem DefinitionRun-time parameters:

◦Available buffer size◦CPU utilization

Parallelism:◦ Partitioned parallelism◦ Pipelined parallelism

Reasons: input data partition is not often feasible ;in bushy plans it is common to have two operations that do not each other’s output

Independent parallelism

Optimization Algorithm

E.g. SELECT * FROM S1.t1, S2.t2, S5.t7, S1.t2, S6.t5, S2.t3 WHERE S1.t1.CustomerID = S2. t2. CustomerID AND S2.t2. SupplierID = S5.t7.SupplierID AND S5.t7.ItemID = S6.t5. ItemID AND S6.t5.Country = S1.t2.Country AND

S1.t2.Year = S2.t3.Year

Global Mediator

Clustered view

Physical design info:B(R), T(R), V(R.attr), ……

Rule 1: only determine inter-cluster operations

Rule 2: plans that join two relations in distinct clusters are eliminated

Optimization Algorithm (Cont.)

Cluster

1Mediato

Sub-plan

Search space:•Algebraic transform

•Physical method selection – Available_buffer

•Site selection – CPU_utility (fine grain operator scheduling)

Run-time conditions:Available_buffer(S1), CPU_utility(S1), ……

Physical design info:B(R), T(R), V(R.attr), ……

Theoretical AnalysisIn global mediator

In cluster mediator

Compare to related works

Experiment Design

That is what I want to do for my Master

Thesis …

Thanks

Efficient Query Optimization for Distributed Join in Database Federation

Documents

Transcript of Efficient Query Optimization for Distributed Join in Database Federation

DISTINCT ENCODED RECORDS JOIN OPERATOR FOR DISTRIBUTED QUERY

Exploiting the query structure for efficient join ordering in SPARQL queries

Interactive Teradata Query Reference - Anatella Interactive... · 8 Interactive Teradata Query Reference ... JOIN Command ... LEFT Command ...

1 Query Processing Query Processing Overview Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions.

Join query

An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS

Ch. 13 (Silberchatz): Query Processing Overview Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation.

Input-Sensitive Scalable Continuous Join Query Processing

Join Synopses for Approximate Query Answering

9 Join Sub Query

Join Query Optimization Techniques for Complex Event ... · Join Query Optimization Techniques for Complex Event Processing Applications Ilya Kolchinsky Technion, Israel Institute

Massively Multi-Query Join Processing in Publish/Subscribe ... · ping this into a relational join problem, we can take advantage of a wealth of expertise in relational query processing.

E–cient Range and Join Query Processing in Massively ...

Multi Join Query - journal.uii.ac.id

Efficient Query Optimization for Distributed Join in ...the mediator, two key components are query rewriter and query optimizer. In this thesis, we focus on the query optimizer part,

DataFederationAdministrationToolGuide … ... 3.4.1 The Query Plan view in the data federation ... 3.4.3 Using the explain query feature to get feedback to ...

SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes

Causality Join Query Processing for Data Streams via a

Chapter 12: Query ProcessingChapter 12: Query Processing Overview Catalog Information for Cost Estimation Measures of Query Cost Selection Operation Sorting Join Operation Other Operations

Query Processing: A Systems View - Duke University · various query processing algorithms •E.g., table scan, index nested-loop join, sort-merge join, hash-based duplicate elimination…