Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony...

37
4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow GridPP Overview Tony Doyle

Transcript of Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony...

Page 1: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

GridPP Overview

Tony Doyle

Page 2: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Contents

• Technical Design Reports • Timescales• Oversight Committee Summary

– Current concerns– Actions (and how these were addressed)– Feedback from the July 1 (OC7) meeting

• “Get Fit” Plan and Problem Solving• Beyond GridPP2..

Page 3: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

June ReportsComputing Technical Design Reports:

http://doc.cern.ch/archive/electronic/cern/ preprints/lhcc/public/

ALICE: lhcc-2005-018.pdfATLAS: lhcc-2005-022.pdfCMS: lhcc-2005-023.pdf

LHCb: lhcc-2005-019.pdf

LCG: lhcc-2005-024.pdf LCG Baseline Services Group Report:

http://cern.ch/LCG/peb/bs/BSReport-v1.0.pdfContains all y

ou

(probably) need to

know about LHC

computing

Page 4: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Timescales

• Service Challenges – UK deployment plans

Page 5: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Functionality Requirement OMII VDT/GT LCG/gLite Other Comment

Storage Element Yes SRM via dCache, DPM or CASTOR

LCG includes Storage Resource Management capability

Basic File Transfer Yes GridFTP Yes LCG includes GridFTP

Reliable File Transfer File Transfer Service

FTS is built on top of GridFTP

Catalogue Services RLS LCG File Catalogue, gLite

FireMan

Central catalogues adequate, high throughput needed

Data Management tools OMII Data Service (upload / download )

LCG tools(replica managemen

t, etc.)

gLite File Placement Service under development

Compute Element OMII Job Service Gatekeeper Yes LCG uses Globus with mods

Workload Management Manual resource allocation & job submission

Condor-G Resource Broker RB builds on Globus, Condor-G

VO Agents Perform localised activities on behalf of VO

VO Membership Services Tools for account management, no GridMapFile equivalent

CAS VOMS CAS does not provide all the needed functionality

DataBase Services MySQL, PostgreSQL,

ORACLE

Off–the-shelf offerings are adequate

Posix-like I/O GFAL, gLite I/O Xrootd

Application Software Installation Tools

Yes Tools already exist in LCG-2e.g. PACMAN

Job Monitoring Monalisa, Netlogger

Logging & Bookkeeping service, R-

GMA

Reliable Messaging Tools such as Jabber are used by experiments (e.g. DIRAC for LHCb)

Information System MDS(GLUE

)

Yes BDII LCG based on BDII and GLUE schema

Fits on a

page.

Concentrate

on

robustness

and scale.

Experiments

have

assigned

prioriti

es.

Page 6: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

July DocumentsPPARC Oversight Committee Papers

Seventh GridPP Oversight Committee (July 2005)Executive Summary

Project Map Link to Project Map Database (Excel) Version (v2)

Resource Report LCG Report

EGEE Report Deployment Report

Middleware/Security/Network Report Applications Report User Board Report

Tier-1/A Report Tier-2 Report

Dissemination Report UK Analysis

Metrics and Deployment Middleware Planning

Experiment engagement questionnaireSee http://www.gridpp.ac.uk/docs/oversight/

Addressed vario

us

concerns of th

e OC

Page 7: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Exec2 Summary

• GridPP2 has already met 21% of its original targets with 86% of the metrics within specification

• “Get fit” plan described (requested by OC)• gLite 1 was released in April as planned but

components have not yet been deployed or their robustness tested by the experiments

• Service Challenge (SC) 2 addressing networking was a success at CERN and the Tier-1

• SC3 addressing file transfers for the experiments is about to commence

• Long-term concern: hardware at the Tier-1 in 2007-08

• Short-term concerns: under-utilisation of resources and the deployment of Tier-2 resources

Page 8: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

RAL joins labs worldwide in successful Service

Challenge 2

• The GridPP team at Rutherford Appleton Laboratory (RAL) in Oxfordshire recently joined computing centres around the world in a networking challenge that saw RAL transfer 60 terabytes of data over a ten-day period. A home user with a 512 kilobit per second broadband connection would be waiting 30 years to complete a download of the same size.

Page 9: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

gLite 1

Page 10: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

100 green sites sitting on a grid

• Thu 16 Jun 2005• Last week the UK CIC-on-duty team

celebrated the milestone of having 100 sites passing the Sites Functional Test. Thanks to all the sites who acted promptly to trouble tickets raised by the UK team during their shift.

Page 11: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Current concern 1. under-utilisation

• Under -utilisation of existing Tier-1/A resources

• improving overall and w.r.t. Grid fraction from 2004 to 2005

Non-GridGrid

Page 12: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Current concern2. under-delivery

• The current situation is somewhat better than these 2005 Q1 numbers indicate

• Some late procurements (OK given under-utilisation)

• Technical problems (being overcome)

CPU KSI2K Disk TB Promised Delivered Ratio Promised Delivered Ratio Brunel 30 4 13% 21 0 0% Imperial 420 384 91% 28 0 1% QMUL 317 247 78% 29 25 88% RHUL 204 167 82% 23 6 24% UCL 60 108 180% 1 1 111% Lancaster 510 101 20% 87 2 2% Liverpool 605 0 0% 80 0 0% Manchester 1305 65 5% 373 9 2% Sheffield 183 29 16% 3 3 100% Birmingham 136 103 76% 9 9 97% Bristol 39 38 98% 2 2 99% Cambridge 33 12 38% 4 4 80% Oxford 119 102 85% 19 19 100% RAL PPD 98 99 101% 12 6 51% Warwick 0 0 0 0 Durham 86 15 17% 5 5 100% Edinburgh 7 5 74% 71 1 1% Glasgow 246 1 1% 15 3 17% London 1031 910 88% 102 32 31% NorthGrid 2602 195 7% 543 14 3% ScotGrid 340 22 6% 90 9 9% SouthGrid 425 354 83% 46 39 85% Total 4397 1481 34% 781 93 12%

Page 13: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Longer-Term concern:

allocations2004 2005 2006 2007 2008 2009 2010

Alloc. Disk CPU Tape Disk CPU Tape Disk CPU Tape Disk CPU Tape Disk CPU Tape Disk CPU Tape Disk CPU Tape

TB KSI2k TB TB kSI2k TB TB kSI2k TB TB KSI2k TB TB KSI2k TB TB kSI2k TB TB kSI2k TB

ALICE 5 14 4 1 1 1 10 24 10 13 24 13 26 48 26 46 84 46 80 147 80

ATLAS 27 400 0 68 400 14 257 529 150 508 801 377 887 1571 1033 1249 2593 2026 1892 3504 2790

CMS 40 86 50 80 200 206 74 205 400 128 283 483 227 449 670 343 661 1148 503 916 1663

LHCb 15 90 15 25 50 30 108 222 104 191 384 207 343 644 346 450 868 714 545 1290 1178

TOTAL 191 796 239 298 1282 331 604 1604 891 1100 2167 1280 1945 3891 2316 2633 5516 4130 3641 7358 5944

LHC TOTAL

87 590 69 174 651 250 450 980 664 841 1492 1080 1484 2712 2074 2087 4206 3934 3020 5857 5710

LHC Fraction

46% 74% 29% 58% 51% 76% 74% 61% 74% 76% 69% 84% 76% 70% 90% 79% 76% 95% 83% 80% 96%

Starting point: fair shares input to BaBar and LHC MoUs

Page 14: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Metrics and Deployment

• GridPP is a significant contributor to EGEE (20%)

• CPU utilisation is low

• Disk utilisation is climbing (but very low)

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

06/0

2/20

04

06/1

5/04

06/2

8/04

07/1

1/20

04

07/2

4/04

08/0

6/20

04

08/1

9/04

09/0

1/20

04

09/1

4/04

09/2

7/04

10/1

0/20

04

10/2

3/04

11/0

5/20

04

11/1

8/04

12/0

1/20

04

12/1

4/04

12/2

7/04

01/0

9/20

05

01/2

2/05

02/0

4/20

05

02/1

7/05

03/0

2/20

05

03/1

5/05

03/2

8/05

04/1

0/20

05

04/2

3/05

05/0

6/20

05

05/1

9/05

06/0

1/20

05

Date

Per

cen

tag

e co

ntr

ibu

tio

n

UK % total CPU

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

06/0

2/20

04

06/1

6/04

06/3

0/04

07/1

4/04

07/2

8/04

08/1

1/20

04

08/2

5/04

09/0

8/20

04

09/2

2/04

10/0

6/20

04

10/2

0/04

11/0

3/20

04

11/1

7/04

12/0

1/20

04

12/1

5/04

12/2

9/04

01/1

2/20

05

01/2

6/05

02/0

9/20

05

02/2

3/05

03/0

9/20

05

03/2

3/05

04/0

6/20

05

04/2

0/05

05/0

4/20

05

05/1

8/05

06/0

1/20

05

Date

% j

ob

slo

ts u

sed

% EGEE slots used

% UK slots used

0

2

4

6

8

10

12

14

16

18

Date

TB

dis

k st

ora

ge

use

d

UK storage used

Page 15: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Metrics and Deployment

• Sites upgrade improvements – quarterly upgrades within 3 weeks

• gradual improvement in site configuration and stability

• Reflects systematic approach and measurable improvements in deployment

0

5

10

15

20

25

24/0

1/20

05

31/0

1/20

05

07/0

2/20

05

14/0

2/20

05

21/0

2/20

05

28/0

2/20

05

07/0

3/20

05

14/0

3/20

05

21/0

3/20

05

28/0

3/20

05

04/0

4/20

05

11/0

4/20

05

18/0

4/20

05

25/0

4/20

05

02/0

5/20

05

09/0

5/20

05

16/0

5/20

05

23/0

5/20

05

30/0

5/20

05

06/0

6/20

05

Date

Sit

es a

t re

leas

e LCG-2_4_0

LCG-2_3_1

LCG-2_3_0

Sites

0

5

10

15

20

25

30

35

40

45

Dateg

sta

t m

etr

ic v

alu

e

EGEE

GridPP

Page 16: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

GridPP Deployment Status 2/7/05 (9/1/05)

totalCPU

freeCPU

runJob

waitJob

seAvail TB

seUsed TB

maxCPU

avgCPU

Total

2966(2029)

1666 (1402)

843 (95)

31 (480)

74.28 (8.69)

16.54 (4.55)

3145 (2549)

2802 (1994)

Measurable

Improvements

Page 17: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Actions

GridPP to submit the proposal for LCG phase 2 funding to the Committee prior to its submission to Science Committee (minute 4.9).

• Done. 27 page report inc. input from OC at http://www.gridpp.ac.uk/docs/gridpp2/SC_GridPP2_LCG_1.0.doc unfunded

GridPP to clarify the situation with regard to ATLAS production run tests for the next physics workshop (minute 5.3).

• See News Item http://www.gridpp.ac.uk/news/-1119651840.463358.wlg• (and slide)GridPP to provide an update on progress resolving problems caused by

mismatches between local batch systems and the capabilities of the grid Resource broker (minute 6.3).

• (See slide)GridPP to more fully document its alignment with each of the individual

experiments (minute 15.2).• An experiment engagement questionnaire has been used (initial

input in February and further [updated] input in June). See http://www.gridpp.ac.uk/eb/workdoc/gridusebyexpts_0605.doc

Page 18: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

ATLAS steps up Grid

production

In a large-scale exercise in the weeks leading up to the workshop, about 8.5 million Monte Carlo simulated events were produced on the Grid. The events were produced using three Grids: LCG, Grid3 in the US and NorduGrid. Of the 65% produced on LCG, approximately one sixth used GridPP resources in the UK. A total of 573,315 jobs were run. On the best day 13k jobs ran, corresponding to a production rate of 7.5Hz. In parallel with all the major computing developments on the Grid, first real cosmic data has now been taken with the ATLAS detector in situ - see image. This is the first trickle in what will eventually become a torrent of data.

Page 19: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

RB Action

GridPP to provide an update on progress resolving problems caused by mismatches between local batch systems and the capabilities of the grid Resource broker (minute 6.3).

• The problem of connecting the local CE to a batch queue is largely overcome – many (all shared) sites now do this.

• There were problems subsequently deploying the accounting system (APEL) to point to the local batch system.

• Overcome (13 ex 18 sites), but not as straightforward as it could be.

• The JDL from the job is not passed to the local system. Hence there is no way for the local scheduler to use info from the Grid scheduler.

• This is a limitation from a (shared) site viewpoint (attempting to balance Grid and local jobs).

• The short term solution is to set up separate batch queues.• It is not a limitation for the experiments (affects efficiency).• It is noted as a requirement and it is intended that this will be

delivered in Year 2 of JRA1 for the WMS.

Page 20: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Actions

GridPP to define its usage policy with respect to Tier-1 allocations (minute 15.4).

• See http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-57-Tier1A_1.0.doc and documents within (“fair shares” using PPARC Form X information)

GridPP to produce an updated risk register (minute 15.5).• Incorporated in the new Project Map at (with 7 “high” risks)

http://www.gridpp.ac.uk/pmb/ProjectManagement/GridPP2_ProjectMap_2.htmGridPP to produce a “get-fit” plan for production metrics (minute 15.6).• See Metrics and Deployment document

http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-64-Metrics.doc and its incorporation into the Project Map

GridPP to define its metrics for job success (minute 15.7).• Adopted EGEE-wide definition at

http://ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/showstatsVO.php(See slides)

GridPP to produce a statement of intent regarding its adoption of gLite (minute 15.8).

• See Middleware Selection document http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-65-Middleware.doc

Page 21: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Metrics Action

GridPP to define its metrics for job success (minute 15.7).• GridPP adopts the EGEE-wide definition at

http://ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/showstatsVO.phpThe (web-based) QA system accounts for Workload Management System registered job successes (that can then be categorised by Virtual Organisation or Resource Broker) Before introducing the figures it should be understood that there are caveats:

• It only measures what the WMS “sees”– doesn't catch failure of WMS to register job in the first place (but this is a rare occurrence)– if a job half way through the script fails (for example tries but fails to copy a file) but the

script completes successfully then WMS sees everything as OK. – If a VO (e.g. LHCb) deploys an agent then the WMS only registers the success of the initial

(python) script: strategy enables higher overall LHCb performance (combined push-PULL model). (This currently leads to other problems in overall accounting should contention become an issue).

– Overall: an end user may see either:– 1. a worse efficiency

• failed job for other hidden e.g. data management problems – 2. a better efficiency by

• choosing selected sites according to the Site Functional Test performance index;• deploying an agent to initiate real jobs at sites where the agent succeeded.

• Physicists are “smart” and now “see” > 90% efficiency but the definition here is one defined within a given VO adopting their own methods (and from informed input from people currently submitting jobs to the system).

Page 22: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Overview

Integrated over all VOs and RBs for first half of 2005

Successes/Day 13806Success % 64%

• Key point: Improving from 42% to 78% during 2005

[For the UK RB (lcgrb01.gridpp.rl.ac.uk)

Successes/Day 319Success % 69% ]

Page 23: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

LHC VOs

ALICE ATLAS CMS LHCbSuccesses/Day N/A 2796 452 3463Success % 42% 83% 61% 68%

Page 24: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Other VOs

BaBar CDF D0 BioMedSuccesses/Day 37 1 207 1074Success % 76% 30% 84% 76%

PMB request:

please enable the

BioMed VO at your

site

Page 25: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Interlude..

Angels & Demons introduces the character of Robert Langdon, professor of religious iconology and art history at Harvard University. As the novel begins, he's awakened in the middle of the night by a phone call from Maximilian Kohler, the director of CERN, the world's largest scientific research facility in Geneva, Switzerland. One of their top physicists, Lenoardo Vetra, had been murdered, with his chest branded with the word "Illuminati.”Lenoardo Vetra created antimatter in canisters to simulate the Big Bang. Vetra's murder, though, allows one of the canisters to be stolen. Langdon and Vittoria Petra are quickly sent off to Rome and Vatican City, to help find the canister and return it to CERN before it explodes at midnight...

Page 26: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Agents and Daemons

The jobs that are sent to the LCG-2 Resource Broker (RB) do not contain any particular LHCb job as payload, but are only executing a simple script, which downloads and installs a standard DIRAC agent. Since the only environment necessary for the agent to run is the Python interpreter, this is perfectly possible on all the LCG sites. This pilot-agent is configured to use the hosting Worker Node (WN) as a DIRAC CE. Once this is done, the WN is reserved for the DIRAC WMS and is effectively turned into a virtual DIRAC production site for the time of reservation. This way allowed for efficient use of the LCG resources during the DC 2004 (over 5000 concurrent jobs at peak) with a low effective failure rate, despite the rather high intrinsic failure rate of LCG (about 40%).

Page 27: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

The future for the experiments?

The technologies used in this production are based on C++(LHCb software), Python (DIRAC tools), Jabber/XMPP(instant messaging protocol used for reliablecommunication between components of the centralservices) and XML-RPC (the protocol used tocommunicate between jobs and central services). ORACLEand MySQL are the two databases behind all of theservices. ORACLE was used for the production andbookkeeping databases, and MySQL for the workloadmanagement and AliEn FC systems.

This way allowed for efficient use of the LCG resourcesduring the DC 2004 (over 5000 concurrent jobs at peak)with a low effective failure rate, despite the rather highintrinsic failure rate of LCG (about 40%).

Page 28: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

OC Preliminary Feedback

ALL earlier actions were considered as “done” from OC perspective

GridPP to investigate alternative procurement strategies in order to improve Tier-1/A utilisation

Actions:Tier-1/A BoardI. evaluate alternative approachesUser Board – THIS MEETINGII. improve experiment estimates

GridPP to associate more resources for technical documentation (for end users and system administrators)

Actions:• Internal advertising: is anyone within GridPP willing/able to take

up the role of “Documentation Officer”?• (There will be an incentive for this)• If this fails, to advertise the post using role description (being

drafted)Deployment Board – THIS MEETING

Page 29: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

OC Preliminary Feedback

• GridPP to develop a deployment model that works for smaller T2 centres in association with CERN

• GridPP to provide a gap analysis for LCG(using the baseline services and the [classified] experiment components as described in the TDRs)

• GridPP to address UB questionnaire outcomes (perceptions as well as actual shortcomings)

• GridPP to document the high-level "value" GridPP is adding/delivering(using Project Map)

• OC8 in February 2006 “important” (not “G8 on Wednesday”)

Page 30: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

0.100 0.101 0.102 0.103 0.104 0.105 0.106 0.107 0.108 0.109 0.110 0.111 0.112 0.113 0.114 0.115 0.116

0.117 0.118 0.119 0.120 0.121 0.122 0.123 0.124 0.125 0.126 0.127 0.128 0.129 0.130 0.131 0.132 0.133

0.134 0.135 0.136 0.137 0.138 0.139 0.140 0.141 0.142 0.143 0.144 0.145 0.146 0.147

Production Grid Metrics

• Set SMART (Specific Measurable Achievable Realistic Time-phased) Goals

The “Get Fit” Plan

Page 31: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

0 Status Date 31-Mar-05Owner: Jeremy ColesNumber Due Date Status

0.100 On going OK

0.101 On going OK

0.102 On going OK

0.103 On going OK

0.104 On going NOT OK

0.105 On going NOT OK

0.106 On going NOT OK

0.107 On going OK

0.108 On going NOT OK

0.109 On going OK

0.110 On going OK

0.111 On going NOT OK

0.112 On going NOT OK

0.113 On going NOT OK

0.114 On going NOT OK

0.115 On going OK

0.116 On going OK

0.117 OK

0.118 OK

0.119 OK

0.120 On going OK

0.121 On going OK

0.122 On going OK

0.123 On going OK

0.124 On going OK

0.125 On going OK

0.126 OK

0.127 OK

0.128 On going OK

0.129 OK

0.130 On going OK

0.131 On going NOT OK

0.132 On going OK

0.133 On going OK

0.134 On going OK

0.135 On going OK

0.136 On going OK

0.137 On going OK

0.138 On going OK

0.139 On going OK

0.140 On going OK

0.141 OK

0.142 On going OK

0.143 On going NOT OK

0.144 OK

0.145 On going OK

0.146 On going OK

0.147 On going OK

GridPP security audit

T1 participation in GOC service challenges

Sites comply with LCG/EGEE security policy

Number of GridPP (site) system security incidents in the last quarterNumber of EGEE Grid security incidents in the last quarter

Accumulated scheduled downtime in last quarterAverage number of sites per quarter available in VO selections (N/a)

GridPP helpdesk functioning adequatelyFraction of Site Functional Tests passed over the last quarter

GridPP deployment web-pages up-to-dateTraining needs addressed

Tier-2s delivering to LCG MoUSite operating system upgrades

Quarterly operational performance reviewTier-1 delivering to LCG MoU

Deployment team meetings UK wide deployment support active

Tier-1 service disaster recovery plans up to dateProduction service risks and issues log available and up to date

T1 meeting "other" user commitmentsGridPP LCG middleware testbed operational

UB schedule implemented and upheld

Production Metrics

UK contribution to LHC experiments

GridPP disk storage availableGridPP disk storage available to LCG/EGEEGridPP Tape storage availableGridPP Tape storage available to LCG/EGEE.

T1 participating in 3D database phases

T2s participation in GOC service challengesGridPP participating in EGEE security challenges

UK contribution to non-LHC experiments

Job failure rates

Fraction of available Disk used in quarterFraction of available Tape used in quarter

Percentage of total jobs run via the GridNumber of sites publishing LCG accounting data

Title

Fraction of available KSI2K used in quarter

GridPP KSI2K Available to EGEE/LCG

Number of active usersNumber of supported VOsNumber of LCG/EGEE Job Slots Published by UKFraction of LCG/EGEE Jobs Slots UsedGridPP KSI2K Available

T1 meeting JRA1 commitmentsT1 meeting pre-production service commitmentsT1 support for GOC

Fraction of UK sites in ProductionNumber of registered users

“I take it plea bargaining is out of the question?”

• See Dave’s talk

Page 32: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Our 14 problems…

– 0.104: Number of LCG/EGEE job slots published by the UK. The current total is 2477 and the target was 3000. – 0.105: Number of LCG/EGEE jobs slots used. The current fraction is 19% compared to a target of 70%. This demonstrates that

0.104 above is clearly not an issue but that usage is presently low.– 0.106: GridPP KSI2K available: By the end of March 2005 the combined Tier-1 and Tier-2 CPU power was expected to be 5184 KSI2K compared

to 2277 KSI2K achieved. This number is dominated by the 4397 KSI2K expected from the Tier-2s which has been slowly becoming available. – 0.108: GridPP disk storage available: Similar to 0.106 above. Only 280TB available compared to 968TB anticipated but the situation is improving.– 0.111: GridPP tape storage made available to LCG/EGEE. At present the tape storage is being used but not really via the Grid route. – 0.112: Fraction of available KSI2K used in quarter: at present a rough estimate shows about 42% of the available CPU was used compared to a

target value of 70%.– 0.113: Fraction of available disk used in quarter: This is estimated at 64% compared to the target of 70%.– 0.114: Fraction of available Tape used in quarter: This is estimated at 61% compared to the target of 70%.– 0.131: Tier-1 service disaster recovery plans up to date: This has not been updated within the last 6 months.– 0.143: Accumulated scheduled downtime in the last quarter: The current value of 418 days is almost identical to the current) target of 411 days.

The metric expects the 25% figure to reduce to 5% by the third year.– 3.6.3: LCG Deployment evaluation reports: first report due in March 05 was delayed to the second quarter.– 5.2.4. Tier-2 Hardware realisation: This flags the same issue as 0.106 and 0.108 above. Tier-2 hardware has been delayed but the situation is

improving. – 5.2.7 Quarterly reports received within 1 month of the end of the quarter: The 05Q1 reports were received late. Some of the delay was due to the

unfortunate timing of EGEE meetings.– 6.2.11: Non-HEP applications tested on the GridPP Grid (submitted via the NGS submission mechanism). The NGS submission mechanism is

not yet adequate.

Page 33: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

The “Get Fit” Plan

• … not (yet) “The Final Solution”• We hope this drives the right behaviour • Plea bargaining is (probably) OK..

Page 34: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Some Problem Solving

Strategies

Page 35: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Beyond GridPP2.. LHC EXPLOITATION PLANNING REVIEW

Input is requested from the UK project spokespersons, for ATLAS and CMS for each of the financial years 2008/9 to 2011/12, and for LHCb, ALICE and GridPP for 2007/8 to

2011/12.Physics programme

Please give a brief outline of the planned physics programme. Please also indicate how this planned

programme could be enhanced with additional resources. In total this should be no more than 3 sides of A4. The

aim is to understand the incremental physics return from increasing resources.

Input will be based upon PPAP roadmap inputE-Science and LCG-2 (26 Oct 2004)

and feedback from CB (12 Jan & 7 July 2005)

Page 36: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Problem Solving andImproved

Communication• “Communication, in essence, is the shift of a particle from one

part of space to another part of space. A particle is the thing being communicated. It can be an object, a written message, a spoken word or an idea. In its crudest definition, this is communication.

• This simple view of communication leads to the full definition: • Communication is the consideration and action of impelling an

impulse or particle from source-point across a distance to receipt-point, with the intention of bringing into being at the receipt-point a duplication and understanding of that which emanated from the source-point..”

• from The Scientology HandbookThis may be a clue to how we will overcome our problems

But we can always improve this..

Page 37: Tony Doyle - University of Glasgow 4 July 2005GridPP13 Collaboration Meeting GridPP Overview Tony Doyle.

4 July 2005 GridPP13 Collaboration Meeting Tony Doyle - University of Glasgow

Summary

• LHC Technical Design Reports define an endpoint• Responsive-mode deployment/development• Timescales for LHC are soon – first cosmics data taken• Oversight Committee – improve “efficiency”• Some particular issues:

– Tier-1/A utilisation– Documentation Officer

• “Get Fit” plan endorsed by OC – requires support from everyone to improve metrics– There are 14 deployment problems (some interdependency) that need to

be solved – Many areas are now quantifiable (significant progress here)– Service Challenges will help focus attention– Improved communication and documentation (become a scientologist?!)

• Aim: measured end-to-end performance improvements during 2005• Beyond GridPP2: input required over the summer to PPARC LHC

exploitation planning review