Tony Doyle [email protected] GridPP – From Prototype To Production, GridPP10 Meeting,...

36
Tony Doyle [email protected]. ac.uk GridPP – From Prototype To Production, GridPP10 Meeting, CERN, 2 June 2004

Transcript of Tony Doyle [email protected] GridPP – From Prototype To Production, GridPP10 Meeting,...

Tony [email protected]

GridPP – From Prototype To Production, GridPP10 Meeting,CERN, 2 June 2004

Tony Doyle - University of Glasgow

OutlineOutline

• GridPP Project• Introduction• UK Context• Components:

A. ManagementB. MiddlewareC. ApplicationsD. Tier-2E. Tier-1F. Tier-0

• Challenges:1. Middleware Validation2. Improving Efficiency3. Meeting Experiment

Requirements4. ..via The Grid?5. Work Group Computing6. Events.. To Files.. To

Events7. Software Distribution8. Distributed Analysis9. Production Accounting10.Sharing Resources

• Summary

Tony Doyle - University of Glasgow

GridPP – A UK Computing Grid for GridPP – A UK Computing Grid for Particle PhysicsParticle Physics

GridPP

19 UK Universities, CCLRC (RAL & Daresbury) and CERN

Funded by the Particle Physics and Astronomy Research Council (PPARC)

GridPP1 - Sept. 2001-2004 £17m "From Web to Grid"

GridPP2 – Sept. 2004-2007 £16(+1)m "From Prototype to Production"

Tony Doyle - University of Glasgow

UK Core e-Science

Programme

Institutes

Tier-2 Centres

CERNLCG

EGEE

GridPP

GridPP in ContextGridPP in Context

Tier-1/A

Middleware, Security,

Networking

Experiments

GridSupportCentre

Not to scale!

Apps Dev

AppsInt

GridPP

Tony Doyle - University of Glasgow

GridPP1 ComponentsGridPP1 Components

6/Feb/2004

£3.57m

£5.67m

£3.74m

£2.08m£1.84m

CERN

DataGrid

Tier - 1/A

ApplicationsOperations

LHC Computing Grid Project (LCG)Applications, Fabrics, Technology and Deployment

European DataGrid (EDG)Middleware Development

UK Tier-1/A Regional CentreHardware and Manpower

Grid Application DevelopmentLHC and US Experiments + Lattice QCD

Management Travel etc

Tony Doyle - University of Glasgow

May 2004

£0.75m

£2.62m

£3.02m

£0.88m

£0.69m

£2.75m

£2.79m

£1.00m

£2.40m

Tier-1/AHardware

Tier-2Operations

Applications

M/S/N

LCG-2

MgrTravel

Ops

Tier-1/AOperations

GridPP2 ComponentsGridPP2 Components

C. Grid Application DevelopmentLHC and US Experiments + Lattice QCD + Phenomenology

B. Middleware Security NetworkDevelopment

F. LHC Computing Grid Project (LCG Phase 2) [review]

E. Tier-1/A Deployment:Hardware, System Management, Experiment Support

A. Management, Travel, Operations

D. Tier-2 Deployment: 4 Regional Centres - M/S/N support and System Management

Tony Doyle - University of Glasgow

A. GridPP ManagementA. GridPP Management

Collaboration Board

Project ManagementBoard

Project Leader

Project Manager

Technical (Deployment)

Board

Experiments (User)Board

(Production Manager)

(Dissemination Officer)

GGF, LCG, EDG (EGEE), UK e-

Science, Liaison

GridPP1 (GridPP2)

Project Map

Risk Register

Tony Doyle - University of Glasgow

GridPP PMB GridPP PMB Who’s WhoWho’s Who

CB ChairSteve Lloyd

Project LeaderTony Doyle

Project Manager Dave Britton

User Board ChairRoger Barlow

“External Input”

Dissemination Officer Sarah Pearce

CERN Liaison Tony Cass

UK e-Science Liaison Neil

Geddes

GGF Liaison Pete Clarke

Deployment Board Chair Dave Kelsey

Applications Coordinator Roger

Jones

Middleware Coordinator Robin Middleton

Tier-2 Board Chair Steve Lloyd

Tier-1 Board Chair Tony Doyle

Productn. Manager Jeremy Coles

PPARC Head of e-Science

Guy Rickett

Deputy Project Leader John Gordon

“authority” via the Collaboration Board “reporting” via the Project Manager “strategic” from the User Board and Deployment Board “external” from the dissemination officer and liaison members

Roles http://ppewww.ph.gla.ac.uk/~doyle/gridpp2/roles/Context http://www.gridpp.ac.uk/pmb/docs/PMB-36-Work_Areas-1.4.doc

Tony Doyle - University of Glasgow

In LCG In LCG ContextContext

A. Management A. Management StructureStructure

ARDA

Ex

pm

tsEG

EE LCG

Deployment Board

Tier1/Tier2,Testbeds,

Rollout

Servicespecification& provision

User Board

Requirements

ApplicationDevelopment

Userfeedback

Metadata

Workload

Network

Security

Info. Mon.

PMB

CB

Storage

Tony Doyle - University of Glasgow

A. GridPP Management A. GridPP Management Staff EffortStaff Effort

A. Management, Travel, Operations

GridPP2 Roles FTE

Project Leader + Admin. Assistant 0.67

Project Manager 0.9

CB and Tier-2 Board Chair 0.5

Applications Coordinator 0.5

Middleware Coordinator 0.5

DB Chair 0.5

Total 3.57

GridPP2 Roles FTE

Production Manager 1.0

Dissemination Officer 1.0

Total 2.0

Reporting line: Production Manager: via the Deputy Project Leader to the EGEE SA1

Infrastructure activity. Dissemination Officer: via the Project Manager and, partially, to the EGEE

NA2 Dissemination activity.

Tony Doyle - University of Glasgow

ARDA

Expmts

EGEE

LCG

Dep

loym

ent

Bo

ard

Tie

r1/T

ier2

,T

estb

eds,

Ro

llou

t

Ser

vice

spec

ific

atio

n&

pro

visi

on

Use

r B

oar

d

Req

uir

emen

ts

Ap

plic

atio

nD

evel

op

men

t

Use

rfe

edb

ack

Met

adat

a

Wo

rklo

ad

Net

wo

rk

Sec

uri

ty

Info

. M

on

.

PM

B

Sto

rag

e

III. Grid Middleware

I. Experiment Layer

II. Application Middleware

IV. Facilities and Fabrics

UserBoard

DeploymentBoard

GridPP2 Project GridPP2 Project Managing the MiddlewareManaging the Middleware

B. Middleware, Security and B. Middleware, Security and Network Development Network Development

Tony Doyle - University of Glasgow

B. Middleware, Security and B. Middleware, Security and Network Development Network Development

M/S/N builds upon UK strengths as part of International development

Configuration Management

Storage Interfaces

Network Monitoring

Security

Information Services

Grid Data Management

SecurityMiddleware

Networking

Tony Doyle - University of Glasgow

B. Middleware, Security and Network B. Middleware, Security and Network Development: Staff EffortDevelopment: Staff Effort

B. Middleware Security NetworkDevelopment

GridPP2 Work Area

PPARC funding Other funding

Metadata 1.0 0.0

Storage Management 2.0 0.0

Workload Management 1.0 3.0*

Security 3.5 0.0

Information & Monitoring 4.0 4.0

Network Sector 2.0 3.0

LHC Applications 1.0 0.0

Totals 14.5 10.0

Reporting line: via the middleware coordinator and also to the LCG/EGEE JRA1 Middleware area, if agreed, within the LCG/EGEE work areas.

Tony Doyle - University of Glasgow

C. Application DevelopmentC. Application Development

Fabric

TapeStorage

Elements

RequestFormulator and

Planner

Client Applications

ComputeElements

Indicates component that w ill be replaced

DiskStorage

Elements

LANs andWANs

Resource andServices Catalog

ReplicaCatalog

Meta-dataCatalog

Authentication and SecurityGSISAM-specific user, group , node, st at ion regis tration B bftp ‘cookie’

Connectivity and Resource

CORBA UDP File transfer protocol s - ftp, b bftp, rcp GridFTP

Mass Storage s ystems protocol se.g. encp, hp ss

Collective Services

C atalogproto co ls

Signi fi cant Event Log ger Naming Service Database ManagerC atalog Manager

SAM R es ource M an ag em entB atch Sys tems - LSF, FB S, PB S,

C ondorData Mov erJob Services

Storage ManagerJob ManagerCache ManagerRequest Manager

“Dataset Editor” “File Storage Server”“Project Master” “Station M aster” “Station M aster”

Web Python codes, Java codesCom mand line D0 Fram ework C++ codes

“Stager”“Optim iser”

CodeRepostory

Name in “quotes” is SAM-given software component name

or addedenhanced using PPDG and Grid tools

GANGA

SAMGridLattice QCD

AliEn → ARDA

CMS

BaBar

Tony Doyle - University of Glasgow

C. Application Development: C. Application Development: Staff EffortStaff Effort

C. Grid Application DevelopmentLHC and US Experiments + Lattice QCD + Phenomenology

GridPP2 Work Area FTE

ATLAS/LHCb (GANGA) 2.0

ATLAS 2.5

BaBar 2.0

CDF/D0 (SAM) 2.0

CDF 1.0

CMS 3.0

D0 1.0

LHCb 2.0

Portal 1.0

UKQCD 1.0

PhenoGrid 1.0

Total 18.5

Reporting line: via the applications coordinator.

Tony Doyle - University of Glasgow

D. UK Tier-2 CentresD. UK Tier-2 Centres

NorthGrid ****Daresbury, Lancaster, Liverpool,Manchester, Sheffield

SouthGrid *Birmingham, Bristol, Cambridge,Oxford, RAL PPD, Warwick

ScotGrid *Durham, Edinburgh, Glasgow

LondonGrid ***Brunel, Imperial, QMUL, RHUL, UCL

Current UK Status:10 Sites via LCG (2 at RAL)

Tony Doyle - University of Glasgow

D. The UK Testbed: D. The UK Testbed: Hidden SectorHidden Sector

Tony Doyle - University of Glasgow

D. UK Tier-2 Centres: D. UK Tier-2 Centres: Staff EffortStaff Effort

D. Tier-2 Deployment: 4 Regional Centres - M/S/N support and System Management

Reporting line: via the Tier-2 Board Chair for Operations staff. UK Support Posts report via the Production Manager and also to the Deputy

Project Leader for the EGEE SA1 Infrastructure activity. GridPP2 Work Area FTE

Security 1.0

Resource Broker 1.0

Network 0.5

Data Management 1.0

Storage Management 1.0

VO Management 0.5

ScotGrid Operations 1.0

NorthGrid Operations 4.5

Southern Grid Operations 1.0

London Grid Operations 2.5

Total 14.0 [+4.0]

Tony Doyle - University of Glasgow

E. The UK Tier-1/A CentreE. The UK Tier-1/A Centre

• High quality data services• National and International Role• UK focus for International Grid

development

LHCb

ATLAS

CMS

BaBar

April 2004:• 700 Dual CPU• 80TB Disk• 60TB Tape (Capacity 1PB)

Grid Operations Centre

Tony Doyle - University of Glasgow

E. The UK Tier-1/A Centre: E. The UK Tier-1/A Centre: Staff EffortStaff Effort

E. Tier-1/A Deployment:Hardware, System Management, Experiment Support

GridPP2 Work Area

PPARC funding CCLRC funding

CPU 2.0 0.0

Disk 1.5 0.0

Tape 1.5 1.0

Core Services 1.5 0.5

Operations 2.0 0.5

Networking 0.0 0.5

Deployment 2.0 0.0

Experiments 2.0 0.0

Management 1.0 0.5

Totals 13.5 3.0

Reporting line: via the Tier-1 Manager to the Tier-1/A Board.

Tony Doyle - University of Glasgow

Real Time Grid MonitoringReal Time Grid Monitoring

LCG21 June 2004

Tony Doyle - University of Glasgow

E. Grid OperationsE. Grid Operations

• Grid Operations Centre– Core Operational Tasks – Monitor infrastructure, components and

services– Troubleshooting– Verification of new sites joining Grid– Acceptance tests of new middleware

releases– Verify suppliers are meeting SLA– Performance tuning and optimisation– Publishing use figures and accounts– Grid information services – Monitoring services – Resource brokering – Allocation and scheduling services – Replica data catalogues – Authorisation services – Accounting services

• Grid Support Centre– Core Support Tasks – Running UK Certificate Authority

Tony Doyle - University of Glasgow

E. Grid Operations: E. Grid Operations: Staff EffortStaff Effort

Reporting line: via the Deputy Project Leader to the EGEE SA1 Infrastructure activity.

GridPP2 Work Area FTE

Tier-2 Coordinators +4.0

Operation Centre +3.0

Documentation +1.0

Other +1.5

Total +9.5

Tony Doyle - University of Glasgow

F. Tier 0 and LCG: F. Tier 0 and LCG: Foundation ProgrammeFoundation Programme

• Aim: build upon Phase 1

• Ensure development programmes are linked

• Project management:

GridPP LCG

• Shared expertise:

• LCG establishes the global computing infrastructure

• Allows all participating physicists to exploit LHC data

• Earmarked UK funding to be reviewed in Autumn 2004

Required Foundation: LCG Fabric, Technology and Deployment

F. LHC Computing Grid Project (LCG Phase 2) [review]

Tony Doyle - University of Glasgow

Ta

gg

ed

re

lea

se s

ele

cte

d f

or

cert

ifica

tion

Ce

rtifi

ed

re

lea

se s

ele

cte

d f

or

de

plo

yme

nt

Ta

gg

ed

pa

cka

ge

Problem reports

add unittested code to

repository

Run nightly build

& auto. testsGrid certification

Fix problemsApplication Certification

BuildSystem

CertificationTestbed ~40CPU

ApplicationTestbed ~1000CPU

Certified publicrelease

for use by apps.

24x7

Build system

Test Group

WPs

Unit Test Build Certification Production

Users

DevelopmentTestbed ~15CPU

Individual WP tests

IntegrationTeam

Integration

Overall release

tests

Releases candidate

Tagged Releases

Releases candidate

Certified Releases

Apps. Representatives

Process to:Test frameworksTest supportTest policiesTest documentationTest platforms/compilers

The Challenges Ahead I: The Challenges Ahead I: Implementing the Validation ProcessImplementing the Validation Process

Tony Doyle - University of Glasgow

The Challenges Ahead II: The Challenges Ahead II: Improving Grid “Efficiency”Improving Grid “Efficiency”

Efficiency (Successful Jobs / Jobs submitted)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

De

c-0

2

Jan

-03

Fe

b-0

3

Ma

r-0

3

Ap

r-0

3

Ma

y-0

3

Jun

-03

Jul-

03

Au

g-0

3

Se

p-0

3

Oct

-03

No

v-0

3

De

c-0

3

Jan

-04

Fe

b-0

4Su

cc

es

sfu

l Jo

bs

/ J

ob

s s

ub

mit

ed

CMS EDGv1.4 Altlas EDGv1.4 LHCb EDGv1.4 LCG1 (EDG v2.0) EDG appl. TB v2.x

Tony Doyle - University of Glasgow

The Challenges Ahead III: The Challenges Ahead III: Meeting Meeting Experiment Requirements (UK)Experiment Requirements (UK)

CPU

0

2000

4000

6000

8000

10000

12000

2004 2005 2006 2007

Year

kS

I20

00

ye

ar

ATLAS

CMS

LHCb

ALICE

Phenomenology

ZEUS

UKQCD

UKDMC

MINOS

MICE

LISA

D0

CDF

BaBar

ANTARES

LHC

NonLHC

Disk

0

500

1000

1500

2000

2500

2004 2005 2006 2007

Year

TB

ATLASCMSLHCbALICEPhenomenologyUKQCDUKDMCMINOSMICED0CRESSTCDFBaBarANTARES

LHC

NonLHC

Total Requirement:

Year 2004 2005 2006 2007

CPU [kSI2000] 2395 4066 6380 9965

Disk [TB] 369 735 1424 2285

Tape [TB] 376 752 1542 2623

In International Context -Q2 2004 LCGResources:

Tony Doyle - University of Glasgow

Dynamic Grid Optimisation over JANET

Network

2004 2007 ~7,000 1GHz CPUs ~30,000 1GHz CPUs ~400 TB disk ~2200 TB disk

(note x2 scale change)

The Challenges Ahead IV: The Challenges Ahead IV: Using (Anticipated) Grid ResourcesUsing (Anticipated) Grid Resources

Tony Doyle - University of Glasgow

The Challenges Ahead V: The Challenges Ahead V: Work Group ComputingWork Group Computing

Tony Doyle - University of Glasgow

The Challenges Ahead VI:The Challenges Ahead VI:Events.. to Files.. to EventsEvents.. to Files.. to Events

RAWRAW

ESDESD

AODAOD

TAGTAG

““Interesting Events List” Interesting Events List”

RAWRAW

ESDESD

AODAOD

TAGTAG

RAWRAW

ESDESD

AODAOD

TAGTAG

Tier-0Tier-0(International)(International)

Tier-1Tier-1(National)(National)

Tier-2Tier-2(Regional)(Regional)

Tier-3Tier-3(Local)(Local)

DataFiles

DataFiles

DataFiles

TAGData

DataFilesData

FilesDataFiles

RAWDataFile

DataFilesData

FilesESDData

DataFilesData

FilesAODData

Event 1 Event 2 Event 3

• VOMS-enhanced Grid certificates to access databases via metadata

• Non-Trivial..

Tony Doyle - University of Glasgow

The Challenges Ahead VII:The Challenges Ahead VII:software distributionsoftware distribution

• ATLAS Data Challenge (DC2) this year to validate world-wide computing model

• Packaging, distribution and installation: Scale:one release build takes 10 hours produces 2.5 GB of files

• Complexity: 500 packages, Mloc, 100s of developers and 1000s of users– ATLAS collaboration

is widely distributed:140 institutes, all wanting to use the software

– needs ‘push-button’ easy installation..

Physics Models

Monte Carlo Truth DataMonte Carlo Truth Data

MC Raw DataMC Raw Data

Reconstruction

MC Event Summary DataMC Event Summary Data MC Event Tags MC Event Tags

Detector Simulation

Raw DataRaw Data

Reconstruction

Data Acquisition

Level 3 trigger

Trigger TagsTrigger Tags

Event Summary Data

ESD

Event Summary Data

ESD Event Tags Event Tags

Calibration DataCalibration Data

Run ConditionsRun Conditions

Trigger System

Step 1: Monte Carlo

Data Challenges

Step 1: Monte Carlo

Data Challenges

Step 2: Real DataStep 2: Real Data

Tony Doyle - University of Glasgow

Complex workflow… Complex workflow… LCG/ARDA DevelopmentLCG/ARDA Development

1. AliEn (ALICE Grid) provided a pre-Grid implementation [Perl scripts]

2. ARDA provides a framework for PP application middleware

The Challenges Ahead VIII:The Challenges Ahead VIII:distributed analysisdistributed analysis

Tony Doyle - University of Glasgow

Complex workflow… Complex workflow… LCG/ARDA DevelopmentLCG/ARDA Development

Online monitoring

Automatic accounting

Meeting LCG and other requirements

The Challenges Ahead IX:The Challenges Ahead IX:Production AccountingProduction Accounting

GridPP Grid Report for Tue, 1 Jun 2004 14:00:47 +0100 CPUs Total:

1055

Hosts up: 442

Hosts down:

82

 

Avg Load (15, 5, 1m):

  33%, 35%, 36%

Localtime:

  2004-06-01 14:00

Tony Doyle - University of Glasgow

The Challenges Ahead X:The Challenges Ahead X:SSharing…haring… MoUs, Guidelines and Policies MoUs, Guidelines and Policies

• Disk/CPU resources allocated to each “group”• Grid is based on distributed resources - a “group” is an experiment• An institute is typically involved in many experiments• Institutes define priorities on computing resources via OPEN policy statements• All jobs submitted via Globus authentication - Certificates identified by user and experiment

• Need to implement Grid “priority”• Minimum amount of data to deliver at a time for a job?• Where to store files?• Which data access/storing activities have the highest priority?• Sharing of the resources among groups?• Users belong to multiple groups? • How many jobs per group are allowed?• What processing activities are allowed at each site?• To which sites should data access and processing activities be sent? • How should the resources of a local cluster of PCs be shared among groups?

• Tier-2 discussion prior to the Collaboration Meeting… issues will arise which require ALL Tier centres to define/sign up to an MoU and publish a policy (See Steve’s talk)

* Implemented by site administrators, with OPEN policies defined at each site based on e.g. case to funding authorityWhat’s new? Ability to monitor/allocate unused resources We will be judged by how well we work as a set of

Virtual Organisations

Tony Doyle - University of Glasgow

GridPP –GridPP – Theory and Experiment Theory and Experiment

• UK GridPP started 1/9/01• EU DataGrid: First

Middleware ~1/9/01 Development requires a testbed with feedback– “Operational Grid”

• Fit into UK e-Science structures

• Experience in distributed computing essential to build and exploit the Grid

Scale in UK? 0.5 PBytes and 2,000 distributed CPUs

GridPP in Sept 2004 • Grid jobs are being submitted

now.. user feedback loop is important..

• All experiments have immediate requirements

• Current Experiment Production: “The Grid” is a small component

• Non-technical issues:– Recognising context– Building upon expertise– Defining roles – Sharing resources

• Major deployment activity is LCG/EGEE

– We contribute significantly to LCG and our success depends critically on LCG

• “Production Grid” will be difficult to realise: GridPP2 planning underway as part of LCG/EGEE

• Work Areas and Roles defined

• Many Challenges Ahead..

GridPP Summary: GridPP Summary: From Web to GridFrom Web to Grid

Tony Doyle - University of Glasgow

GridPP Summary: GridPP Summary: From Prototype to ProductionFrom Prototype to Production

BaBar

D0CDF

ATLAS

CMS

LHCb

ALICE

19 UK Institutes

RAL Computer Centre

CERN ComputerCentre

SAMGrid

BaBarGrid

LCG

EDGGANGA

EGEE

UK PrototypeTier-1/A Centre

CERN PrototypeTier-0 Centre

4 UK Tier-2 Centres

LCG

UK Tier-1/ACentre

CERN Tier-0Centre

200720042001

4 UK Prototype Tier-2 Centres

ARDA

Separate Experiments, Resources, Multiple

Accounts 'One' Production GridPrototype Grids