SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
Efficient Query Optimization for Distributed Join in Database Federation
description
Transcript of Efficient Query Optimization for Distributed Join in Database Federation
Efficient Query Optimization for Distributed Join
in Database Federation
A Master’s Thesis Proposalby
Di Wang
Advisor: Prof. Murali Mani
Dec 4, 2008
OutlineIntroduction – Query Optimization
in Database Federations
Architecture and Problem Definition
Proposed Work
Schedule
Introduction: Need for data integration ◦Various systems -> full picture◦Mergers -> access both resources with a
common interface◦Business partners -> combine data
Multiple Access MethodsMultiple Data Schemas
Introduction to Database Federation
Database Federation is one approach to data integration◦Key performance advantage: efficiently
combine data from multiple sources in a single statement
◦The data sources are federated into a unified middleware, called mediator.
Key Components of Database Federation
Query Rewriter
Cost-Based Optimizer
Query
. . . . . .
Research Issues: •containment algorithms for conjunctive queries,• schema mapping, •capability-based optimization
Cost-based optimization --Closely related to the optimization techniques developed for the distributed database systems
The problem
Things that make us unhappySortMerge on M1
NestLoop on M1M3.R3
M1.R1 M2.R2
Optimizer
M1
M2
M3
Estimated Condition: Available buffer sizes of sites; CPU utility of sites; Network traffics …Statistics: physical designs …
SortMerge on M2
NestLoop on M2M3.R3
M1.R1 M2.R2
Plan 1 Plan 2 HashJoin on M3
SortMerge on M1M3.R3
M1.R1 M2.R2
Plan 3
Run CPU Utility Available Buffer Chosen Plan
Optimal PlanM1 M2 M3 M1 M2 M3
1 25%
25% 25%
B(R1) - - Plan 1 Plan 1
2 75%
10% 25%
> B(R1) > B(R1)
- Plan 1 Plan 2
3 50%
50% 15%
> - > Plan 1 Plan3Need to take run-time conditions into account at optimization time.
Assume: B(R1) < B(R2) < B(R3), B(R1 join R2) < B(R3)
Existing Solution - Parametric Query Optimization Y. E. Ioannidis, et al. Parametric Query Optimization. VLDB
1992. Key idea: To identify several execution plans, each one of
which is optimal for a subset of ALL possible values of the run-time parameters
E.g. Two parameters: Buffer size B = [2, 151]Kind of indexes I = {no_index, clustered_Btree, non_clustered_BTree}
P – possible vectors of values of parameters P = cross product B × I|P| = 150*3 = 450
The optimization problem: p P , to find the plan s0 in that plan space S that satisfies the condition:
is static parameters, c( ) is the cost function
Existing Solution - Parametric Query Optimization (Cont.)
Efficient exploration algorithm – Randomized Algorithm
Justification for using parametric query optimizationRelative cost
Buffer size
Problems of the implementation in distributed database• Site selection + algebraic transformation + physical method selection• Much more combinations of run-time parameters
Existing Solution – Two-Phase Algorithm
W. Hong, et al. Optimization of Parallel Query Execution Plans in XPRS. PDIS,1991.
Developed for a parallel database based on a share-memory multiprocessor
Phase 1: find the optimal sequential plan assuming the entire buffer pool is available
Phase 2: find the optimal parallelization of the optimal sequential plan, considering run-time available buffer size & # of free processors
Benefits:
◦ Phase 1 has the same plan space as a System-R-style algorithm, but only one plan is explored in Phase 2
◦ Capability of dealing with compile-time unknown parameters
Problems for applying in database federations:◦ Communication cost was not considered◦ Exhaustive search in phase 2 is still expensive
for large scale of data sources
Existing Solution – Two-Phase Algorithm(Cont.)
Proposed Work
Important Observation many national-scale or global-scale data federations are
built on the networks which consist of both broad, LAN paths and narrow, long-haul paths.
many highly-integrated systems have to access data through a great deal of databases that belong to multiple different organizations.
Cluster-and-Conquer consider all data resources in the database federation
as a set of several clusters of sites
design two layers of mediators to schedule the query plan cooperatively:◦ Global Mediator + Cluster Mediator
Cluster 2Cluster 1
Cluster 8
Cluster 4
Cluster 5 Cluster 6Cluster 7
Cluster9
Cluster 11
Cluster10Cluster12
Cluster13Global
Mediator
Architecture•System-R style algorithm•performs at compiling time •considers all the tables as being stored in the clustered fashion• decide inter-cluster operations
•schedules the optimal plan found by the optimizer in a distributed and parallelized way •assigns each sub-plan to the corresponding cluster
•Consider run-time conditions & static physical designs•Find a intra-cluster optimal plan•Every cluster mediator functions independently and potentially in parallel
Cost Model and Optimization Goal
Cost Model
Optimization Goal◦to find the distributed join schedule
plan with minimum cost.
Problem DefinitionRun-time parameters:
◦Available buffer size◦CPU utilization
Parallelism:◦ Partitioned parallelism◦ Pipelined parallelism
Reasons: input data partition is not often feasible ;in bushy plans it is common to have two operations that do not each other’s output
Independent parallelism
Optimization Algorithm
E.g. SELECT * FROM S1.t1, S2.t2, S5.t7, S1.t2, S6.t5, S2.t3 WHERE S1.t1.CustomerID = S2. t2. CustomerID AND S2.t2. SupplierID = S5.t7.SupplierID AND S5.t7.ItemID = S6.t5. ItemID AND S6.t5.Country = S1.t2.Country AND
S1.t2.Year = S2.t3.Year
Global Mediator
Clustered view
Physical design info:B(R), T(R), V(R.attr), ……
Rule 1: only determine inter-cluster operations
Rule 2: plans that join two relations in distinct clusters are eliminated
Optimization Algorithm (Cont.)
Cluster
1Mediato
r
Sub-plan
Search space:•Algebraic transform
•Physical method selection – Available_buffer
•Site selection – CPU_utility (fine grain operator scheduling)
Run-time conditions:Available_buffer(S1), CPU_utility(S1), ……
Physical design info:B(R), T(R), V(R.attr), ……
Theoretical AnalysisIn global mediator
In cluster mediator
Compare to related works
Experiment Design
That is what I want to do for my Master
Thesis …
Thanks