1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director 22 nd...

Post on 20-Jan-2016

218 views 0 download

Tags:

Transcript of 1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director 22 nd...

1

The Challenge of Data Integration

Data + Grid = Discovery?

Prof. Malcolm AtkinsonDirector

www.nesc.ac.uk

22nd January 2003

2

Overview

Essentials of e-ScienceCollaboration

Resource Sharing Data Sharing Mutual Dependence

Essentials of the GridDistributed Virtual Machine?

Essentials of Data SharingDatabase Research did it?New ChallengesData Access & Integration Building Bricks

Band Wagon v Research OpportunityThresholds, Visions and Questions

3

5

UK e-Science

e- Science and the Grid‘e- Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’

‘e- Science will change the dynamic of the way science is undertaken.’

J ohn TaylorDirector General of Research Councils

Offi ce of Science and Technology

From presentation by Tony Hey

6

Cambridge

Newcastle

Edinburgh

Oxford

Glasgow

Manchester

Cardiff

Southampton

London

Belfast

Daresbury Lab

RALHinxton

UK e-Science Investment

Nationale-

ScienceCentre

HPC(x)

Projects > 60 started

> 30 proposed+

EU Projects

7

£80m Collaborative projects

E-ScienceSteering

Committee

DG Research Councils

Director

Director’s Management Role

Director’sAwareness and Co-ordination Role

Generic Challenges EPSRC (£15m), DTI (£15m)

Industrial Collaboration (£40m)

Academic Application SupportProgramme

Research Councils (£74m), DTI (£5m)

PPARC (£26m) BBSRC (£8m) MRC (£8m) NERC (£7m) ESRC (£3m) EPSRC (£17m) CLRC (£5m)

Grid TAG

UK e-Science Programme (2)2003 - 2005

8

9

Collaboration Growing

Hard Problems, Multi-disciplinary, Expense

Sharing Ideas Thought processes and Stimuli Effort Resources

Requires Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure

Scientists have done this for Centuries

12

InterdependenceScience has relied on experiment and theorySimulation, Data Mining, Analysis

Theory-Greece400 BC

Experiment -Italy

1,500 AD

For problems which are:- too large/small- too fast/slow- too complex- too expensive, unethical, ...-Testing Understanding

Simulation -Europe

1,980 AD

13

Interdependence

Theory

ExperimentComputing

Models

DataData

14

Database Growth

PDB protein structures

15

16

Globus Toolkit® History

0

5000

10000

15000

20000

25000

30000

1997 1998 1999 2000 2001 2002

Do

wn

loa

ds

pe

r M

on

th f

rom

ftp

.glo

bu

s.o

rg

DARPA, NSF, and DOE begin funding Grid work

NASA beginsfunding Grid work,DOE adds support

The Grid: Blueprint for a New Computing

Infrastructure published

GT 1.0.0Released

Early ApplicationSuccesses Reported

NSF & European CommissionInitiate Many New Grid Projects

Anatomy of the GridPaper Released Significant

CommercialInterest inGrids

Physiology of the GridPaper Released

GT 2.0Released

Does not include downloads from:NMI, UK eScience, EU Datagrid,IBM, Platform, etc.

17

Encompassing Vision

data archives

sensor nets

computers

software

colleagues

instruments

18

People & Industry

Global Grid ForumGGF2 260 Jul 01GGF3 220 Oct 01GGF4 400 Feb 02GGF5 900 Jul 02GGF6 450 Oct 02GGF7 >1000Mar 03

UK All HandsAHM’02 350Sep 02

GlobusWorld1 450Jan 03

IBM This week“IBM DRIVES GRID COMPUTING FOR COMMERCIAL BUSINESS WITH TEN NEW GRID OFFERINGS”

Targets Financial, Life Sciences Automotive & Aerospace Governments

Partners Platform, DataSynapse Avaki, Entropia United Devices

IBM last 20 monthsLeaders of OGSIDevelopment teamsGrid JamboreeGGF

0100

200

300

400

500

600

700800

900

GGF1 GGF2 GGF3 GGF4 GGF5

19

20

High-Altitude ViewsA Rallying Cry

Meeting a Hard Challenge requires Many MindsOperating & Maintaining Infrastructure requires Many Hands & Many Companies

Another Stab at Distributed Computing

Hard Challenge: Intellectually and Practically ImportantDependable Ubiquity over Heterogeneity & Fallibility

An Ambitious Virtual MachineConsistent large scale computational environments

A Global Operating SystemCollective Resources, Common Management

21

An Architectural View

Grid Plumbing & Security Infrastructure

Scheduling Accounting Authorisation

Monitoring Diagnosis Logging

Application

Data & Compute Resources OperationsTeams

DistributedProviders

Application Users

Common Application Platform for Group of ApplicationsApplication& PlatformDevelopers

22

Open Grid Services Infrastructure

Confluence of Web Services & GridConsistent Interface Description

Based on WSDL 1.2 proposal Extend Properties Separate Binding from Interface Function Composition & Inheritence

Exploit WS* InvestmentGrid Features

SecurityLife-Time ManagementService (state) Information via Data ElementsDiscoveryGroupingNotification

OGSI Version 1 Proposal at GGF7 (March 03)

23

Open Grid Services Architecture

Ubiquitous Building BlocksUsing OGSI PlatformOpen & ExtensibleEncourage Refactoring Experiments

InitiallyThe Globus 2 model

Except State Information now distributed

Example New FeaturesGlobal Name Mapping ServiceReplication and Caching ServiceData Access & IntegrationMetering, Logging, Authorisation, Charging, …

24

Grid Challenge

Balancing “Direct” Access to the “Platforms” with Abstraction & Virtualisation

Developers often have exploitable application knowledgeAutomation necessary & helpful

Interface matching, operation validation, … Optimisation at many scales

There isn’t enough effort to develop Languages & Abstractions

25

26

Data Integration

Data Resource 1

Data Resource 2

Scientist with Idea1) Find Data2) Extract Data

3) Transform Data

4) Combine Data

5) Interpret Data

27

Wellcome Trust: Cardiovascular Functional Genomics

Glasgow Edinburgh

Leicester

Oxford

LondonNetherlands

Shared dataPublic curated

data

28

Oxford

Glasgow

Cardiff

Southampton

London

Belfast

Daresbury Lab

RAL

OGSA-DAI Partners

EPCC & NeSC

Newcastle

IBMUSA

IBM Hursley

Oracle

Manchester

EPCC & NeSCIBM UKIBM USAManchester e-SCNewcastle e-SCOracle £3 million, 18 months, started February 2002

Cambridge

Hinxton

30

DAI Key Services

GridDataService GDS Access to data & DB operations

GridDataServiceFactory GDSF Makes GDS & GDSF

GridDataServiceRegistry GDSR Discovery of GDS(F) & Data

GridDataTranslationService GDTS Translates or Transforms Data

GridDataTransportDepot GDTD Data transport with persistence

Integrated Structured Data TransportRelational & XML models supportedRole-based AuthorisationBinary structured files (later)

31

DAI Architecture

Grid Infrastructure

Scheduling Accounting

Monitoring Diagnosis

Data Intensive Applications for Science X

Compute, Data & Storage Resources

Distributed

Authorisation

Data Access Services

Data Integration Services

Structured Data

Simulation, Analysis & Integration Technology for Science X

Data Intensive X Scientists

Data Integration Architecture

GridFTP Naming Caching

Generic Virtual Data Access and Integration Technology

32

1a. Request to Registry for sources of data about “x”

1b. Registry responds with

Factory handle2a. Request to Factory for access to database

2b. Factory creates GridDataService to manage access

2c. Factory returns handle of GDS to client

3a. Client queries GDS with XPath, SQL, etc

3b. GDS interacts with database

3c. Results of query returned to client as XML

SOAP/HTTP

service creation

API interactions

Registry

Factory

Grid Data Service

Client

XML / Relational database

33

1a. Request to Registry for sources of data about “x” & “y”

1b. Registry responds with

Factory handle2a. Request to Factory for access and integration to databases

2b. Factory creates GridDataServices network

2c. Factory returns handle of GDS to client

3a. Client submits set of queries GDS with

XPath, SQL, etc

3c. Results of queries returned to consumer as XML or binary

SOAP/HTTP

service creation

API interactions

Registry

Factory

Client

XML / Relational database

Consumer

XML / Relational database

GDS

GDS

GDS

GDS

GDS

3b. Tell consumer

34

Biomedical (or ANY) Data

OpportunitiesGlobal Production of Published DataVolume DiversityCombination Analysis Discovery

ChallengesData HuggersMeagre metadataEase of UseAutomated, optimised integrationTraceability, Dependability

OpportunitiesSpecialised IndexingStructurally varied replicationConsistent Structured Universe of DiscourseData & Computation Integration

ChallengesApproximate MatchingMulti-scale optimisation

Bad habits / industrial structures

Safety and Multi-scale optimisation

35

Data Integration Challenges

High-Level LanguagesDescribing the Data Extraction RecipesDescribing the Sources & Components

Metadata that drives automation & validation

MobilityCode & Data

Integrating Existing DB technologyMoving the DBMS to the Grid context

New Optimisation ChallengesData & Computation & Storage & Movement

Shared Distributed Annotation SystemsHow to ReferenceProvenance & Acknowledgement

36

37

Challenges

A Programming & Development ModelDependability at this ScaleFoundations for TrustRaising the Level of AutomationSupporting New Forms of

CollaborationData

38