HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

23
1 HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director www.nesc.ac.uk 23 rd June 2003

description

HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director www.nesc.ac.uk 23 rd June 2003. Outline. What is e-Science? Structured Data at its Foundation Key Uses of Distributed Data Resources Data-intensive Challenges Data Access & Integration - PowerPoint PPT Presentation

Transcript of HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

Page 1: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

1

HPDC12

Seattle

Structured Data and the GridAccess and Integration

Prof. Malcolm AtkinsonDirector

www.nesc.ac.uk

23rd June 2003

Page 2: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

2

Outline

What is e-Science?Structured Data at its FoundationKey Uses of Distributed Data ResourcesData-intensive Challenges

Data Access & IntegrationDAIS-WGOGSA-DAI: Progress and Dreams

Unanswered Architectural Questions

Page 3: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

3

Foundation for e-Science

sensor nets

Shared data archives

computers

software

colleagues

instruments

Grid

e-Science methodologies will rapidly transform science, engineering, medicine and business

Driven by exponential growth (×1000/decade)Enabling and requiring a whole-system approach

Page 4: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

4

Three-way Alliance

Computing Science:Systems, Notations &

Formal Foundation→ Process & Trust

Theory:Models & Simulations

→Shared Data

Experiment:Advanced Data

Collection→

Shared Data

Multi-national, Multi-discipline, Computer-enabledConsortia, Cultures & Societies

Requires Much Engineering, Much Innovation

Changes Culture, New Mores, New Behaviours

New Opportunities, New Results, New Rewards

Page 5: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

5

Database-mediated Communication

SimulationCommunities

ExperimentationCommunities

Analysis &TheoryCommunities

Data

knowledge

Data

Carries knowledge

Carries knowledge

Discoveries

Curated& SharedDatabases

Page 6: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

6

global in-flight engine diagnostics

in-flight data

airline

maintenance centre

ground station

global networkeg SITA

internet, e-mail, pager

DS&S Engine Health Center

data centre

Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York

100,000 engines2-5 Gbytes/flight5 flights/day =

2.5 petabytes/day

Page 7: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

7

Database GrowthPDB Content Growth

Bases 39,856,567,747

Page 8: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

8

Distributed Structured Data

Key to Integration of Scientific MethodsKey to Large-scale CollaborationGrowing Number of Growing Data Resources

Independently managedGeographically distributed

Key to DiscoveryExtracting nuggets from multiple sourcesCombing them using sophisticated modelsAnalysis on scales required by statistics

Repeated Processes

and Decisions!

Page 9: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

9

Tera → Peta Bytes

RAM time to move15 minutes

1Gb WAN move time

10 hours ($1000)

Disk Cost7 disks = $5000 (SCSI)

Disk Power100 Watts

Disk Weight5.6 Kg

Disk FootprintInside machine

RAM time to move2 months

1Gb WAN move time14 months ($1 million)

Disk Cost6800 Disks + 490 units + 32 racks = $7 million

Disk Power100 Kilowatts

Disk Weight33 Tonnes

Disk Footprint60 m2

May 2003 Approximately Correct

See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24

Page 10: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

10

Mohammed & Mountains

Petabytes of Data cannot be movedIt stays where it is produced or curated

Hospitals, observatories, European Bioinformatics Institute, …

Distributed collaborating communitiesExpertise in curation, simulation & analysis

Can’t collocated data in a few places

Distributed & diverse data collectionsDiscovery depends on insights

Unpredictable sophisticated application code

Tested by combining data from many sources

Page 11: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

11

DynamicallyMove computation to the dataAssumption: code size << data size

Code needs to be well behavedDevelop the database philosophy for this?

Queries are dynamically re-organised & boundDevelop the storage architecture for this?

Compute closer to disk? System on a Chip using free space in the on-disk controller

Data Cutter a step in this directionDevelop the sensor & simulation architectures for this?Safe hosting of arbitrary computation

Proof-carrying code for data and compute intensive tasks + robust hosting environments

Provision combined storage & compute resourcesDecomposition of applications

To ship behaviour-bounded sub-computations to dataCo-scheduling & co-optimisation

Data & Code (movement), Code executionRecovery and compensation

Dave PattersonSeattle

SIGMOD 98

Page 12: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

12

Page 13: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

13

First steps towards a generic framework forintegrating data access and computation

Using the grid to take specific classes of computation nearer to the data

Kit of parts for building tailored access and integration applications

Page 14: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

14

DAIS-WG

Specification of Grid Data ServicesChairs

Norman Paton, Manchester UniversityDave Pearson, Oracle

Current Spec. Draft AuthorsMario Antonioletti Malcolm AtkinsonNeil P Chue Hong Amy KrauseSusan Malaika Gavin McCanceSimon Laws James MagowanNorman W Paton Greg Riccardi

Page 15: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

15

Conceptual ModelExternal Universe

External data resource manager

External data resource

External data set

DBMS

DB

ResultSet

Page 16: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

16

Conceptual ModelDAI Service Classes

Data resource manager

Data resource

Data set

DBMS

DB

ResultSet

Data activity session

Data request

Page 17: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

17

Oxford

Glasgow

Cardiff

Southampton

London

Belfast

Daresbury Lab

RAL

OGSA-DAI Partners

EPCC & NeSC

Newcastle

IBMUSA

IBM Hursley

Oracle

Manchester

Cambridge

Hinxton

$5 million, 20 months, started February 2002

Additional 24 months, starts October 2003

Page 18: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

18

OGSA

Infrastructure Architecture

OGSI: Interface to Grid Infrastructure

Data Intensive Applications for X-ology Research

Compute, Data & Storage Resources

Distributed

Simulation, Analysis & Integration Technology for X-ology

Data Intensive X-ology Researchers

Virtual Integration Architecture

Generic Virtual Data Access and Integration Layer

Structured DataIntegration

Structured Data Access

Structured Data Relational XML Semi-structured-

Transformation

Registry

Job Submission

Data Transport Resource Usage

Banking

Brokering Workflow

Authorisation

Page 19: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

19

1a. Request to Registry for sources of data about “x”

1b. Registry responds with

Factory handle2a. Request to Factory for access to database

2c. Factory returns handle of GDS to client

3a. Client queries GDS with XPath, SQL, etc

3b. GDS interacts with database

3c. Results of query returned to client as XML

SOAP/HTTP

service creation

API interactions

Registry

Factory

2b. Factory creates GridDataService to manage access

Grid Data Service

Client

XML / Relational database

Data Access & Integration Services

Page 20: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

20

GDTS2 GDS3

GDS2

GDTS1

Sx

Sy

1a. Request to Registry for sources of data about “x” & “y”

1b. Registry responds with

Factory handle

2a. Request to Factory for access and integration from resources Sx and Sy

2b. Factory creates GridDataServices network

2c. Factory returns handle of GDS to client

3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc

3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation

SOAP/HTTP

service creation

API interactions

Data Registry

Data Access& Integration

master

Client

Analyst XML database

Relational database

GDS

GDS

GDS

GDTS

GDTS

3b. Client tells analyst

GDS1

Future DAI Services

“scientific”Applicationcodingscientificinsights

ProblemSolving

Environment

SemanticMeta data

Application Code

Page 21: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

21

What Architecture will Enable Data & Computation Integration?

Common Conceptual ModelsCommon Planning & OptimisationCommon Enactment of WorkflowsCommon Debugging…

What Fundamental CS is needed?Trustworthy code & Trustworthy evaluatorsDecomposition and Recomposition of Applications…

Is there an evolutionary path?

Page 22: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

22www.nesc.ac.uk

www.ogsadai.org.uk

Page 23: HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director

23

Scientific Data

OpportunitiesGlobal Production of Published DataVolume DiversityCombination Analysis Discovery

ChallengesData HuggersMeagre metadataEase of UseOptimised integrationDependability

OpportunitiesSpecialised IndexingNew Data OrganisationNew AlgorithmsVaried ReplicationShared AnnotationIntensive Data & Computation

ChallengesFundamental PrinciplesApproximate MatchingMulti-scale optimisationAutonomous ChangeLegacy structuresScale and LongevityPrivacy and Mobility