14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1...

8
14-18 March 2004 EDBT'0 4 : Se rvice- Based 2 context 1. High-level data access and integration services services are needed if applications that have data with complex structure and complex semantics are to benefit from the GRID. 2. Standards for data access are emerging, and middleware products that are reference implementations of such standards are already available. 3. Distributed query processing technology is one approach to delivering (1.) given the availability of (2.).

Transcript of 14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1...

Page 1: 14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1 Title, places, people, funding, projects Manchester.

14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir)

2

context

1. High-level data access and integration servicesservices are needed if applications that have data with complex structure and complex semantics are to benefit from the GRID.

2. Standards for data access are emerging, and middleware products that are reference implementations of such standards are already available.

3. Distributed query processing technology is one approach to delivering (1.) given the availability of (2.).

Page 2: 14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1 Title, places, people, funding, projects Manchester.

14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir)

3

OGSA-DQPgoals

1. To benefit from homogeneous access to heterogeneous data sources [OGSA-DAI].

2. To benefit from Grid abstractions for on-demand, transparent allocation of resources required for a task [OGSA/OGSI/GT3].

3. To provide transparent, implicit parallelism and distribution. [Polar*]

4. To orchestrate the composition of data retrieval and analysis services using query mechanisms.

5. To expose this orchestration capability as a Grid data service.

Page 3: 14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1 Title, places, people, funding, projects Manchester.

14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir)

4

OGSA-DQPinnovations

OGSA-DQP dynamically allocates evaluators to do work on behalf of the mediator. All available nodes can be allocated for query evaluation (not just

the nodes with data sources) A distributed query execution plan is resourced on the fly

This allows for runtime circumstances to be taken into account when the optimiser decides how to partition and schedule. The query plan is the outcome of optimising a declarative service

orchestration expressed as a query.

OGSA-DQP uses a parallel physical algebra: most mediator-based query processors do not.

Page 4: 14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1 Title, places, people, funding, projects Manchester.

14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir)

6

OGSA-DQPprovides two grid services

Exposes to clients

• Grid Distributed Query Services (GDQSs) that:– interact with clients;– find and retrieve service

descriptions;– parse, compile, partition

and schedule the query execution over a union of distributed data sources.

– Coordinates the GQESs into executing the plan

• The query plan is an orchestration of GQESs

Coordinates transparently

• Grid Query Evaluation Services (GQESs) that:– implement the physical

query algebra;– implement the query

execution model and semantics;

– run a partition of a query execution plan generated by a GDQS;

– interact with other GQESs/GDSs/WSs but not with clients.

Page 5: 14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1 Title, places, people, funding, projects Manchester.

14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir)

7

Brief tour: an illustration

G D Q S

GD S 1

GD S 2

W e bS e rv ice s

C lie n t

re s o u rce lis t

W S D L

D B S ch e m a

D B S ch e m a

G L o g ica lO pt im is e r G

Ph y s ica lO pt im is e r

G Pa rt it io n e r GS ch e du le r

G

OQ

L P

arse

r

Po la r* Q u e ry O pt im is e r En g in e

GD S Q u e r yR e qu e s t D oc .

O Q LQ u e r y

pr in t

e xc h an g e

h as h join

s c ane x ch a n g e

s c an

P1

P2

P3

GGQ ES 3

GGQ ES 2

GGQ ES 1

Distributed QueryExecution Engine

sub- pla n

sub- pla n

da ta b lock s

da ta b lock s

s u b-qu e ry

s u b-qu e ry

o pe ra t io n ca ll

<?xml version="1.0" encoding="UTF-8"?>

<GDQDataSourceList xmlns="http://dqp.ogsadai.org.uk/schema/gdqs" >

<importedDataSource>

<GDSFactoryHandle>http://phoebus.cs.man.ac.uk:8080/ogsa/services/ogsadai/GridDataServiceFactory</GDSFactoryHandle>

<GDSFactoryHandle>http://rpc676.cs.man.ac.uk:8080/ogsa/services/ogsadai/GridDataServiceFactory</GDSFactoryHandle>

<GDSFactoryHandle>http://mygrid.ncl.cs.ac.uk:8080/ogsa/services/ogsadai/GridDataServiceFactory</GDSFactoryHandle>

</importedDataSource>

<importedService>

<wsdlURL>http://phoebus.cs.man.ac.uk:9090/axis/services/EntropyAnalyserService?WSDL</wsdlURL>

</importedService>

</GDQDataSourceList>

<?xml version="1.0" encoding="UTF-8"?>

<databaseSchema xmlns="">

<logicalSchema>

<table name="goterm">

<column fullName="goterm_id" length="32" name="id">

<sqlTypeName>varchar</sqlTypeName>

<sqlJavaTypeID>12</sqlJavaTypeID>

</column>

<column fullName="goterm_type" length="55" name="type">

<sqlTypeName>varchar</sqlTypeName>

<sqlJavaTypeID>12</sqlJavaTypeID>

</column>

<column fullName="goterm_name" length="255" name="name">

<sqlTypeName>varchar</sqlTypeName>

<sqlJavaTypeID>12</sqlJavaTypeID>

</column>

<primaryKey>

<columnFullName>id</columnFullName>

</primaryKey>

</table>

</logicalSchema>

<physicalSchema>

<hostMachine>130.88.192.230</hostMachine>

<database join_buffer_size="131072" max_join_size="4294967295">

<physTable avgRowLength="67" dataLength="766784" indexLength="126976" name="goterm" rowFormat="Dynamic" rows="11369"/>

</database>

</physicalSchema>

<GDSFHandle>http://phoebus.cs.man.ac.uk:9090/ogsa/services/ogsadai/GridDataServiceFactory</GDSFHandle>

</databaseSchema>

<?xml version="1.0" encoding="UTF-8"?>

<Partitions>

<Partition>

<evaluatorURI>http://130.88.198.195:9090/ogsa/services/ogsadai/dqp/GridQueryEvaluationFactory/hash-11025450-1076603541049</evaluatorURI>

<Operator operatorID="0" operatorType="TABLE_SCAN">

<tupleType>

<type>goterm</type>

<name>goterm.OID</name>

<type>string</type>

<name>goterm.id</name>

<type>string</type>

<name>goterm.type</name>

<type>string</type>

<name>goterm.name</name>

</tupleType>

<TABLE_SCAN>

<dataResourceName> goterms </dataResourceName>

<GDSHandle> http://130.88.192.230:9090/ogsa/services/ogsadai/GridDataServiceFactory/hash-31056514-1076603576481</GDSHandle>

<tableName> goterms </tableName>

<predicateExpr>

<predicate>

<comparativeOperator>LIKE</comparativeOperator>

<leftOperand name=" goterm.id" type="13"/>

<rightOperand name=" GO:0000%" type="16"/>

</predicate>

</predicateExpr>

</TABLE_SCAN>

</Operator> . . .

</Partition> . . .

</Partitions>

Page 6: 14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1 Title, places, people, funding, projects Manchester.

14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir)

9

The Demonstration:Configuring the DQP

Select DQP Factory

Select Data Sources

Select Web Services

Import Metadata

Page 7: 14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1 Title, places, people, funding, projects Manchester.

14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir)

10

The Demonstration :Example Query

• Given two DBMSs and one analysis tool (e.g., a WS):– Goterm to a GO Gene Ontology

running as a remote mySQL DB,– proteinSequence yeast protein

sequences,– EntropyAnalyser (information

Content analyser);• We can obtain the information content of

protein sequences of a certain kind specified by certain gene ontology terms:

select p.ORF, go.id, calculateEntropy(p.sequence)

from p in protein_sequences, go in goterms, pg in protein_goterms

where go.id=pg.GOTermIdentifier and p.ORF=pg.ORF and p.ORF like "YBL06%" and go.id like "GO:0000%";

• Then, OGSA-DQP acts as an enactor of a declarative orchestration of services on the Grid:

Partition boundaries

Parallelized on nodes 1 & 2

Page 8: 14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1 Title, places, people, funding, projects Manchester.

14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir)

12

where to find out more: software

OGSA-DQPGrid middleware to query distributed data

sources

www.ogsadai.org.uk/dqp OGSA-DAI

Grid middleware to interface with data(bases)

www.ogsadai.org.uk/ Globus ToolkitOpen-source implementation of OGSA/OGSI

www.globustoolkit.org/