1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director 22 nd...

34
1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director www.nesc.ac.uk 22 nd January 2003

Transcript of 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director 22 nd...

Page 1: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

1

The Challenge of Data Integration

Data + Grid = Discovery?

Prof. Malcolm AtkinsonDirector

www.nesc.ac.uk

22nd January 2003

Page 2: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

2

Overview

Essentials of e-ScienceCollaboration

Resource Sharing Data Sharing Mutual Dependence

Essentials of the GridDistributed Virtual Machine?

Essentials of Data SharingDatabase Research did it?New ChallengesData Access & Integration Building Bricks

Band Wagon v Research OpportunityThresholds, Visions and Questions

Page 3: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

3

Page 4: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

5

UK e-Science

e- Science and the Grid‘e- Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’

‘e- Science will change the dynamic of the way science is undertaken.’

J ohn TaylorDirector General of Research Councils

Offi ce of Science and Technology

From presentation by Tony Hey

Page 5: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

6

Cambridge

Newcastle

Edinburgh

Oxford

Glasgow

Manchester

Cardiff

Southampton

London

Belfast

Daresbury Lab

RALHinxton

UK e-Science Investment

Nationale-

ScienceCentre

HPC(x)

Projects > 60 started

> 30 proposed+

EU Projects

Page 6: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

7

£80m Collaborative projects

E-ScienceSteering

Committee

DG Research Councils

Director

Director’s Management Role

Director’sAwareness and Co-ordination Role

Generic Challenges EPSRC (£15m), DTI (£15m)

Industrial Collaboration (£40m)

Academic Application SupportProgramme

Research Councils (£74m), DTI (£5m)

PPARC (£26m) BBSRC (£8m) MRC (£8m) NERC (£7m) ESRC (£3m) EPSRC (£17m) CLRC (£5m)

Grid TAG

UK e-Science Programme (2)2003 - 2005

Page 7: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

8

Page 8: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

9

Collaboration Growing

Hard Problems, Multi-disciplinary, Expense

Sharing Ideas Thought processes and Stimuli Effort Resources

Requires Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure

Scientists have done this for Centuries

Page 9: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

12

InterdependenceScience has relied on experiment and theorySimulation, Data Mining, Analysis

Theory-Greece400 BC

Experiment -Italy

1,500 AD

For problems which are:- too large/small- too fast/slow- too complex- too expensive, unethical, ...-Testing Understanding

Simulation -Europe

1,980 AD

Page 10: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

13

Interdependence

Theory

ExperimentComputing

Models

DataData

Page 11: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

14

Database Growth

PDB protein structures

Page 12: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

15

Page 13: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

16

Globus Toolkit® History

0

5000

10000

15000

20000

25000

30000

1997 1998 1999 2000 2001 2002

Do

wn

loa

ds

pe

r M

on

th f

rom

ftp

.glo

bu

s.o

rg

DARPA, NSF, and DOE begin funding Grid work

NASA beginsfunding Grid work,DOE adds support

The Grid: Blueprint for a New Computing

Infrastructure published

GT 1.0.0Released

Early ApplicationSuccesses Reported

NSF & European CommissionInitiate Many New Grid Projects

Anatomy of the GridPaper Released Significant

CommercialInterest inGrids

Physiology of the GridPaper Released

GT 2.0Released

Does not include downloads from:NMI, UK eScience, EU Datagrid,IBM, Platform, etc.

Page 14: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

17

Encompassing Vision

data archives

sensor nets

computers

software

colleagues

instruments

Page 15: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

18

People & Industry

Global Grid ForumGGF2 260 Jul 01GGF3 220 Oct 01GGF4 400 Feb 02GGF5 900 Jul 02GGF6 450 Oct 02GGF7 >1000Mar 03

UK All HandsAHM’02 350Sep 02

GlobusWorld1 450Jan 03

IBM This week“IBM DRIVES GRID COMPUTING FOR COMMERCIAL BUSINESS WITH TEN NEW GRID OFFERINGS”

Targets Financial, Life Sciences Automotive & Aerospace Governments

Partners Platform, DataSynapse Avaki, Entropia United Devices

IBM last 20 monthsLeaders of OGSIDevelopment teamsGrid JamboreeGGF

0100

200

300

400

500

600

700800

900

GGF1 GGF2 GGF3 GGF4 GGF5

Page 16: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

19

Page 17: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

20

High-Altitude ViewsA Rallying Cry

Meeting a Hard Challenge requires Many MindsOperating & Maintaining Infrastructure requires Many Hands & Many Companies

Another Stab at Distributed Computing

Hard Challenge: Intellectually and Practically ImportantDependable Ubiquity over Heterogeneity & Fallibility

An Ambitious Virtual MachineConsistent large scale computational environments

A Global Operating SystemCollective Resources, Common Management

Page 18: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

21

An Architectural View

Grid Plumbing & Security Infrastructure

Scheduling Accounting Authorisation

Monitoring Diagnosis Logging

Application

Data & Compute Resources OperationsTeams

DistributedProviders

Application Users

Common Application Platform for Group of ApplicationsApplication& PlatformDevelopers

Page 19: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

22

Open Grid Services Infrastructure

Confluence of Web Services & GridConsistent Interface Description

Based on WSDL 1.2 proposal Extend Properties Separate Binding from Interface Function Composition & Inheritence

Exploit WS* InvestmentGrid Features

SecurityLife-Time ManagementService (state) Information via Data ElementsDiscoveryGroupingNotification

OGSI Version 1 Proposal at GGF7 (March 03)

Page 20: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

23

Open Grid Services Architecture

Ubiquitous Building BlocksUsing OGSI PlatformOpen & ExtensibleEncourage Refactoring Experiments

InitiallyThe Globus 2 model

Except State Information now distributed

Example New FeaturesGlobal Name Mapping ServiceReplication and Caching ServiceData Access & IntegrationMetering, Logging, Authorisation, Charging, …

Page 21: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

24

Grid Challenge

Balancing “Direct” Access to the “Platforms” with Abstraction & Virtualisation

Developers often have exploitable application knowledgeAutomation necessary & helpful

Interface matching, operation validation, … Optimisation at many scales

There isn’t enough effort to develop Languages & Abstractions

Page 22: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

25

Page 23: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

26

Data Integration

Data Resource 1

Data Resource 2

Scientist with Idea1) Find Data2) Extract Data

3) Transform Data

4) Combine Data

5) Interpret Data

Page 24: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

27

Wellcome Trust: Cardiovascular Functional Genomics

Glasgow Edinburgh

Leicester

Oxford

LondonNetherlands

Shared dataPublic curated

data

Page 25: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

28

Oxford

Glasgow

Cardiff

Southampton

London

Belfast

Daresbury Lab

RAL

OGSA-DAI Partners

EPCC & NeSC

Newcastle

IBMUSA

IBM Hursley

Oracle

Manchester

EPCC & NeSCIBM UKIBM USAManchester e-SCNewcastle e-SCOracle £3 million, 18 months, started February 2002

Cambridge

Hinxton

Page 26: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

30

DAI Key Services

GridDataService GDS Access to data & DB operations

GridDataServiceFactory GDSF Makes GDS & GDSF

GridDataServiceRegistry GDSR Discovery of GDS(F) & Data

GridDataTranslationService GDTS Translates or Transforms Data

GridDataTransportDepot GDTD Data transport with persistence

Integrated Structured Data TransportRelational & XML models supportedRole-based AuthorisationBinary structured files (later)

Page 27: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

31

DAI Architecture

Grid Infrastructure

Scheduling Accounting

Monitoring Diagnosis

Data Intensive Applications for Science X

Compute, Data & Storage Resources

Distributed

Authorisation

Data Access Services

Data Integration Services

Structured Data

Simulation, Analysis & Integration Technology for Science X

Data Intensive X Scientists

Data Integration Architecture

GridFTP Naming Caching

Generic Virtual Data Access and Integration Technology

Page 28: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

32

1a. Request to Registry for sources of data about “x”

1b. Registry responds with

Factory handle2a. Request to Factory for access to database

2b. Factory creates GridDataService to manage access

2c. Factory returns handle of GDS to client

3a. Client queries GDS with XPath, SQL, etc

3b. GDS interacts with database

3c. Results of query returned to client as XML

SOAP/HTTP

service creation

API interactions

Registry

Factory

Grid Data Service

Client

XML / Relational database

Page 29: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

33

1a. Request to Registry for sources of data about “x” & “y”

1b. Registry responds with

Factory handle2a. Request to Factory for access and integration to databases

2b. Factory creates GridDataServices network

2c. Factory returns handle of GDS to client

3a. Client submits set of queries GDS with

XPath, SQL, etc

3c. Results of queries returned to consumer as XML or binary

SOAP/HTTP

service creation

API interactions

Registry

Factory

Client

XML / Relational database

Consumer

XML / Relational database

GDS

GDS

GDS

GDS

GDS

3b. Tell consumer

Page 30: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

34

Biomedical (or ANY) Data

OpportunitiesGlobal Production of Published DataVolume DiversityCombination Analysis Discovery

ChallengesData HuggersMeagre metadataEase of UseAutomated, optimised integrationTraceability, Dependability

OpportunitiesSpecialised IndexingStructurally varied replicationConsistent Structured Universe of DiscourseData & Computation Integration

ChallengesApproximate MatchingMulti-scale optimisation

Bad habits / industrial structures

Safety and Multi-scale optimisation

Page 31: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

35

Data Integration Challenges

High-Level LanguagesDescribing the Data Extraction RecipesDescribing the Sources & Components

Metadata that drives automation & validation

MobilityCode & Data

Integrating Existing DB technologyMoving the DBMS to the Grid context

New Optimisation ChallengesData & Computation & Storage & Movement

Shared Distributed Annotation SystemsHow to ReferenceProvenance & Acknowledgement

Page 32: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

36

Page 33: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

37

Challenges

A Programming & Development ModelDependability at this ScaleFoundations for TrustRaising the Level of AutomationSupporting New Forms of

CollaborationData

Page 34: 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director  22 nd January 2003.

38