An Adaptive Distributed Query Processing Grid Service F.Porto - V.F.V.da Silva – M.L.Dutra –...

20
n Adaptive Distributed Query Processing Grid Servic F.Porto - V.F.V.da Silva – M.L.Dutra – B.Schulze Proc. VLDB Workshop on Data Management in Grids VLDB,LNCS 3836, Trondheim, Norway 2-3 September 2005 Cours : Grille de donnèes Prof .: Jean-Marc Pierson, Lionel Brunie Date de Présentation : 01/02/2006 Étudiant : Sammarco Aniello

Transcript of An Adaptive Distributed Query Processing Grid Service F.Porto - V.F.V.da Silva – M.L.Dutra –...

An Adaptive Distributed Query Processing Grid ServiceF.Porto - V.F.V.da Silva – M.L.Dutra – B.Schulze

Proc. VLDB Workshop on Data Management in GridsVLDB,LNCS 3836, Trondheim, Norway 2-3 September 2005

Cours : Grille de donnèes Prof .: Jean-Marc Pierson, Lionel Brunie

Date de Présentation : 01/02/2006Étudiant : Sammarco Aniello

Slide N.:2

PLAN

1-INTRODUCTION2-ABSTRACT DB3-ARCHITECTURE4-QUERY PROCESSING 5-Grid Greedy Node (G2N) algorithm6-Query Execution Engine Framework7-INITIAL RESULT8-CONCLUSION

OBJECTIVES

SOLUTIONS

RESULT

Slide N.:3

PROJECT CoDIMS(Configurable Data Integration

Middleware)It is a distributed grid service for the

evaluation ofscientific queries . The design of CoDIMS-Gfocused on conceiving efficient and

adaptablequery evaluation strategies for the gridenvironment.TESTBED: It support the pre-processing

stage ofa scientific visualization application (SVA) at

theNational Laboratory of Scientific Computing(LNCC) - Brazil -

FOCUS ON ADAPTIVE PROBLEM

OBJECTIVES

SOLUTIONS

RESULT

Slide N.:4

PROJECT CoDIMS-G

FOCUS ON ADAPTIVE PROBLEM

(1) Dynamic scheduling and allocation of query execution engine modules into grid nodes

(2) Adaptability of query execution to variations on environment conditions

(3) Support to special scientific operations

OBJECTIVES

SOLUTIONS

RESULT

Slide N.:5

PROJECT CoDIMS-G

FOCUS ON ADAPTIVE PROBLEM

Using the processing power available in a grid

environment may substantially reduce the time

needed for pre-processing virtual particle

trajectory.(1) A new node scheduling algorithm “selects grid nodes for parallel evaluation”(2) Extend the Eddy operator

OBJECTIVES

SOLUTIONS

RESULT

Slide N.:6

PROJECT CoDIMS-G

FOCUS ON ADAPTIVE PROBLEM

Reduction of the sheduling time

OBJECTIVES

SOLUTIONS

RESULT

Slide N.:7

PROJECT CoDIMS-G

FOCUS ON ADAPTIVE PROBLEMTo adapt the execution of an application to thechanging conditions of selected grid nodes. The problem in this context is to identify pointswhere execution may be interrupted in a node andrestarted in other nodes .

Slide N.:8

ABSTRACT DBThe Geometry relation stores data associated

withpolyhedron's geometry:

Geometry (id, time-instant, polyhedron<point>,velocity<point-velocity>) ;

Particle relation holds the initial particle position :

Particle (part-id, time-instant, point)The Resulting-vector user program computes

aresulting speed vector in a specific position of

theflow path:

Resulting-vector (position, polyhedron<point>,velocity<point-velocity>): velocity

The Trajectory Computing Program (TCP)computes VP's subsequent position:

TCP (particle-id, position, velocity): new-positionVelocity relation corresponds to velocity

vectorsfor each time instant.

Slide N.:9

ARCHITECTURE OF CoDIMS-G

Client Interface Users requests are forwarded to the Control component .The Control Component is the essence of the CoDIMSenvironment which stores, manages, validates and verifies an instanceconfiguration. which sends users requests to the queryprocessing system >>

The Parser transforms the users´ requests in a query graph representation(QG)

>>

Parser Component

Control Component

The Query Optimizer (QO) receives the graph and generates a physical distributed query execution plan (DQEP) using a cost model based on data and programs statistics stored in the Metadata Manager (MM).

>>

Metad

ata Man

ager

The optimizer calls the Scheduler (SC) Component and it indicates the set of interesting nodes to be allocated for the parallelized operator. The scheduler and optimizer cooperate to generate an initialdistributed parallel query execution plan DQEP. >>

Scheduler Component

Query Optimizer

Query Engine 1

Query Engine 2

Query Engine n

A QE is the component where actual query execution takes place. Instances of QE are instantiated into grid scheduled nodes. Each QE receives a fragment of the DQEP and it is responsible of its execution control .

>>

Query ExecutionManager

The QEM is responsible for deploying the query execution engine (QE) services at the nodes specified in the DQEP and managing their life-cycle during the query execution.The QEM manages the QEs real-time performance .

Slide N.:10

DISTRIBUTED QUERY PROCESSING

We express a query as a query graph QG, definedas a partial ordered set of operators QG={,},where is a set of algebraic operators and is aset of dependencies relations,where if (w1 w2), with w1, w2 and w1 , then

w2 succeds w1 in a bottom-up navigation of the DEQP and not (w2 w1)

The optimization algorithm explores the searchspace of valid plans, in accordance to datadependency restrictions. It considers all validexecution orders of expensive operators in QG Edges.

ALTERNATIVES

WHY

Slide N.:11

DISTRIBUTED QUERY PROCESSING

ALTERNATIVES

WHY

(a)non parallelization(b)scheduling according to the G2N algorithm (Grid Greedy )(c) adoption of the same parallelization strategy

used by the previous operator in the query execution plan.

For each computed query execution plan, a cost isassociated, using a parallel pipeline cost function.The DQEP presenting the lowest cost is selectedfor execution.

Slide N.:12

DISTRIBUTED QUERY PROCESSING

ALTERNATIVES

WHYThis strategy guarantees that costly programs onlyget invoked when all predicates have beenevaluated, eventually reducing the number oftuples to be processed by them

Slide N.:13

IMPLEMENTATION

G2N (throughput(tp1,tp2,…, tpn ),number-tasks):resultnodelist:= descending order(throughput);result:= result {nodelist(1)};cost(1):= number-tasks * nodelist(1);current-cost:=cost(1);While (nodes in the list and add-new-node)

total-cost:= current-cost;new-node:= next-node in nodelist;While (current-cost <= total-cost)move tuples from lowest node in result to new-node;Update costs of nodes and total-cost;If current-cost > total-cost If we could move at least 1 tuple to the new-node

result:= result {new-node} else

add-new-node:=false;Stop loop;

endwhileendwhileoutput result;

The G2N algorithm receives a set of available nodes with corresponding average throughput (tp1;tp2;…tpn), measured in tuples per second. The total estimated number oftasks (T) to be evaluated

>>

The algorithm classifies the list of available grid nodes in decreasing order of their corresponding average throughput values. It then allocates all T tuples to the fastest node

>>

The loop node to new grid node . It produce a new evaluation estimation that reduce query elapsedtime,until actual elapsedtime becomes higherthe last computed. Conversely,the algorithm stops and outputs thegrid nodes accepted so far >>

OUTPUT :Load Query Optimazer with the initial query execution plan and the re-scheduling of allocated nodes in face of variations on estimated values

>>

Grid Greedy Node (G2N) algorithm

Slide N.:14

ADAPTIVE QUERY EXECUTION - QEEF

Query Execution Engines(QEE) for supportingthe execution of traditional queries.

QEEF (Query Execution Engine Framework):an extensible QEE adapted to new executionmodels that implement each execution model as acombination of execution modules

SIMULATION

ANALISIS ON BLOCK SIZE

Slide N.:15

ADAPTIVE QUERY EXECUTION - QEEF

SIMULATION

ANALYSIS ON BLOCK SIZE

SEND

SEND

Eddy

RECEIVE

MERGE

RECEIVE

SEND SEND

SPLIT

RECEIVE

RECEIVE

Slide N.:16

ADAPTIVE QUERY EXECUTION - QEEF

SIMULATION

ANALYSIS ON BLOCK SIZEBlock size is an important tool to build

adaptivityinto the system. Eddy modifies a remote node block size in the following scenarios :1-TimeOut(estimated time)2- eddy proceeds a local adaptation(checking

on current throughput values)3- variations scheduled nodes4- When 2/3 tuples have beene valuated: - dataflow reduced -Eddy recomputes the number of

scheduled nodes - increase the number of tuples in each

node

Slide N.:17

SCIENTIFIC APPLICATIONS

INITIAL RESULT

QEEF framework has been extended with :-user's program execution

(strategy Apply operator)-spatial and temporal hash-joins

(implements the iterator interface)-loop control over query execution plan

fragment(repetitively evaluated)

Slide N.:18

SCIENTIFIC APPLICATIONS

INITIAL RESULTThe project configuraation :-java 1.4.2 and globus 3.2.1-20 pentium IV20 pentium IV, 1.7 GHz, processorswith 256 MB of RAM, running linux 2.4.20-31.9We considered :an instance with 1000 particles and executing

25iterations by each particle.Than we Obtained increasing :from 1 node to 25 nodesResults :demonstrated a gain of up to 11 times with 20machines, with respect to a centralized

execution(With 2.7 tuples for second).

Problem :blocking size update strategy to be very useful .

Slide N.:19

CONCLUSION

CoDIMS-G, which is an adaptive distributedquery processing grid service.

The proposed query execution strategy extends eddy

adaptive query execution model for the grid.

Environment,considering the variations on grid nodes

run-time conditions.

Slide N.:20

by Paul Horn, senior vice president, IBM research:

“The information-technology industry loves to prove the impossible possible”

Mercì!