Post on 01-Jan-2016
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
GridPP: Meeting The
Particle Physics Computing ChallengeTony Doyle
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Contents
“The particle physicists are now well on their way to constructing a genuinely global particle physics Grid to enable them to exploit the massive data streams expected from the Large Hadron Collider in CERN that will turn on in 2007.” Tony Hey, AHM 2005 Introduction
1. Why? – LHC Motivation (“one in a billion events”, “20 million readout
channels”, “1000s of physicists” “10 million lines of code”)
2. What?– The World’s Largest Grid (according to the Economist)
3. How?– “Get Fit Plan” and Current Status (197 sites, 13,797 CPUs, 5PB
storage)
4. When?– Accounting and Planning Overview (“50 PetaBytes of data”,
“100,000 of today’s processors” “2007-08”)
Reference: http://www.allhands.org.uk/2005/proceedings/papers/349.pdf
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
4 LHC Experiments
ALICE- heavy ion collisions, to create quark-gluon plasmas
- 50,000 particles in each collision
LHCb- to study the differences between matter and antimatter
- producing over 100 million b and b-bar mesons each year
ATLAS- general purpose: origin of mass, supersymmetry, micro-black holes?
- 2,000 scientists from 34 countries
CMS- general purpose detector
- 1,800 scientists from 150 institutes
“One Grid to Rule Them All”?
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
1. Rare Phenomena -
Huge Background
9 or
ders
of
mag
nitu
de
The HIGGS
All interactions
“one in a billion events”“20 million readout channels”
2. Complexity
Why (particularly) the LHC?
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
•Must •share data between thousands of scientists with multiple interests•link major (Tier-0 [Tier-1]) and minor (Tier-1 [Tier-2]) computer centres•ensure all data accessible anywhere, anytime•grow rapidly, yet remain reliable for more than a decade•cope with different management policies of different centres•ensure data security•be up and running routinely by 2007
What are the Grid challenges?
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
What are the Grid challenges?
Data Management, Security and
Sharing
1. Software process2. Software efficiency3. Deployment
planning 4. Link centres
5. Share data
6. Manage data7. Install software8. Analyse data9. Accounting
10. Policies
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Grid Overview
Aim: by 2008 (full year’s data taking)
- CPU ~100MSi2k (100,000 CPUs)
- Storage ~80PB - Involving >100 institutes
worldwide
- Build on complex middleware being developed in advanced Grid technology projects, both in Europe (Glite) and in the USA (VDT)
1. Prototype went live in September 2003 in 12 countries
2. Extensively tested by the LHC experiments in September 2004
3. Currently 197 sites, 13,797 CPUs, 5PB storage in September 2005
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Tier Structure
Tier 0
Tier 1National centres
Tier 2Regional groups
Tier 3Institutes
Tier 4Workstations
Offline farm
Online system
CERN computer centre
RAL,UK
ScotGrid NorthGridSouthGrid London
FranceItalyGermanyUSA
Glasgow Edinburgh Durham
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Functionality for the LHC
Experiments• The basic functionality of the Tier-1s is:ALICE Reconstruction, Chaotic AnalysisATLAS Reconstruction, Scheduled Analysis/skimming,
CalibrationCMS ReconstructionLHCb Reconstruction, Scheduled skimming, Analysis
• The basic functionality of the Tier-2s is:ALICE Simulation Production, AnalysisATLAS Simulation, Analysis, CalibrationCMS Analysis, All Simulation ProductionLHCb Simulation Production, No analysis
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Technical Design Reports (June 2005)
Computing Technical Design Reports: http://doc.cern.ch/archive/electronic/cern/
preprints/lhcc/public/ALICE: lhcc-2005-018.pdfATLAS: lhcc-2005-022.pdfCMS: lhcc-2005-023.pdf
LHCb: lhcc-2005-019.pdf
LCG: lhcc-2005-024.pdf LCG Baseline Services Group Report:
http://cern.ch/LCG/peb/bs/BSReport-v1.0.pdf
Contains all y
ou (probably) n
eed to
know about LHC computing.
End of prototype phase.
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Timescales
• Service Challenges – UK deployment plans
• End pointApril ’07
• Context:first real(cosmics)
data ’05
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Baseline Functionality
Requirement OMII VDT/GT LCG/gLite Other Comment
Storage Element Yes SRM via dCache, DPM or CASTOR
LCG includes Storage Resource Management capability
Basic File Transfer Yes GridFTP Yes LCG includes GridFTP
Reliable File Transfer
File Transfer Service
FTS is built on top of GridFTP
Catalogue Services RLS LCG File Catalogue, gLite
FireMan
Central catalogues adequate, high throughput needed
Data Management tools
OMII Data Service (upload / download)
LCG tools(replica managemen
t, etc.)
gLite File Placement Service under development
Compute Element OMII Job Service Gatekeeper Yes LCG uses Globus with mods
Workload Management
Manual resource allocation & job submission
Condor-G Resource Broker RB builds on Globus, Condor-G
VO Agents Perform localised activities on behalf of VO
VO Membership Services
Tools for account management, no GridMapFile equivalent
CAS VOMS CAS does not provide all the needed functionality
DataBase Services MySQL, PostgreSQL,
ORACLE
Off–the-shelf offerings are adequate
Posix-like I/O GFAL, gLite I/O Xrootd
Application Software Installation Tools
Yes Tools already exist in LCG-2e.g. PACMAN
Job Monitoring Monalisa, Netlogger
Logging & Bookkeeping service, R-
GMA
Reliable Messaging Tools such as Jabber are used by experiments (e.g. DIRAC for LHCb)
Information System MDS(GLUE
)
Yes BDII LCG based on BDII and GLUE schema
Concentrate on
robustness
and scale.
Experiments h
ave
assigned external
middleware prioriti
es.
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Exec2 Summary
• GridPP2 has already met 21% of its original targets with 86% of the metrics within specification
• “Get fit” deployment plan in place: LCG 2.6 deployed at 16 sites as a preliminary production service
• Glite 1 was released in April as planned but components have not yet been deployed or their robustness tested by the experiments (1.3 available on pre-production service)
• Service Challenge (SC)2 addressing networking was a success at CERN and the RAL Tier-1 in April 2005
• SC3 also addressing file transfers has just been completed
• Long-term concern: planning for 2007-08 (LHC startup)• Short-term concerns: some under-utilisation of
resources and the deployment of Tier-2 resources
At the end of G
ridPP2 Year 1
, the initia
l
foundations o
f “The Productio
n Grid”
are built. The fo
cus is on “e
fficiency”.
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
People and Roles
Birmingham Bristol
Brunel CERN
Cambridge CCLRC
Durham Edinburgh
Glasgow Imperial
Lancaster Liverpool
Manchester Oxford
PPARC QMUL
RHUL Sheffield
Sussex Swansea
Warwick UCL
Institutions
All People ALICE InfoMon Tier-1
Organisational Applications Middleware Infrastructure
GridPP Post Holders ATLAS Metadata Tier-2
CB - members CMS Networking London Tier 2
PMB - members LHCb Security NorthGrid
SouthGrid
DB - members BaBar Storage ScotGrid
Dissemination
UB - members CDF WLMS
EGEE
PhenoGrid
UKQCD
Portal
Tier-2 Board Members
If DataBase has been edited, click this button to update all lists:
Click on the catagories below to view sub-lists.DataBase is contained in the "All People" worksheet.
Deployment Team
Tier-1 Board Members T2 Middleware SupportOther Applications
OC - members D0 T2 Hardware Support
• More than 100 people in the UK
• http://www.gridpp.ac.uk/members/
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Project Map
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.100 0.101 0.102 0.103 0.104 0.105 0.106 0.107 0.108 0.109 0.110 0.111 0.112 0.113 0.114 0.115 0.116
0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.117 0.118 0.119 0.120 0.121 0.122 0.123 0.124 0.125 0.126 0.127 0.128 0.129 0.130 0.131 0.132 0.133
0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.134 0.135 0.136 0.137 0.138 0.139 0.140 0.141 0.142 0.143 0.144 0.145 0.146 0.1470.52 0.53 0.54 0.55 0.56
2.1 3.1 4.1 5.1 6.1 1.1.1 1.1.2 1.1.3 1.1.4 2.1.1 2.1.2 2.1.3 2.1.4 2.1.5 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 6.1.1 6.1.2 6.1.3 6.1.4 6.1.5
1.1.5 2.1.6 2.1.7 2.1.8 2.1.9 2.1.10 3.1.6 3.1.7 3.1.8 3.1.9 3.1.10 4.1.6 4.1.7 4.1.8 4.1.9 4.1.10 5.1.6 5.1.7 5.1.8 5.1.9 5.1.10 6.1.6 6.1.7 6.1.8 6.1.9
2.1.11 2.1.12 3.1.11 3.1.12 3.1.13 4.1.11 4.1.12 5.1.11 5.1.12
2.2 3.2 4.2 5.2 6.2 1.2.1 1.2.2 1.2.3 1.2.4 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5
1.2.5 2.2.6 2.2.7 2.2.8 2.2.9 2.2.10 3.2.6 3.2.7 4.2.6 4.2.7 4.2.8 4.2.9 4.2.10 5.2.6 5.2.7 5.2.8 5.2.9 5.2.10 6.2.6 6.2.7 6.2.8 6.2.9 6.2.10
2.2.11 2.2.12 2.2.13 2.2.14 2.2.15 4.2.11 4.2.12 4.2.13 4.2.14 5.2.11 5.2.12 5.2.13 5.2.14 5.2.15 6.2.11 6.2.12 6.2.13 6.2.14
2.3 3.3 4.3 6.3 1.3.1 1.3.2 1.3.3 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 6.3.1 6.3.2 6.3.3 6.3.4 6.3.5
2.3.6 2.3.7 2.3.8 2.3.9 2.3.10 3.3.6 3.3.7 3.3.8 3.3.9 3.3.10 4.3.6 4.3.7 4.3.8 4.3.9 4.3.10
2.3.11 3.3.11 3.3.12 3.3.13 4.3.11 4.3.12 4.3.13
2.4 3.4 4.4 6.4 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 3.4.1 3.4.2 3.4.3 3.4.4 3.4.5 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5 6.4.1 6.4.2 6.4.3 6.4.4
2.4.6 2.4.7 2.4.8 2.4.9 2.4.10 3.4.6 3.4.7 3.4.8 3.4.9 3.4.10 4.4.6 4.4.7 4.4.8 4.4.9
2.4.11 2.4.12 2.4.13 2.4.14 2.4.15 3.4.11 3.4.12 3.4.13 3.4.14 3.4.15
2.5 3.5 60 Days2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 3.5.1 3.5.2 3.5.3 3.5.4 3.5.5
2.5.6 2.5.7 2.5.8 2.5.9 2.5.10 3.5.6 3.5.7 3.5.8 3.5.9 Monitor OK 1.1.1 2.5.11 Monitor not OK 1.1.1 Milestone complete 1.1.1
2.6 3.6 Milestone overdue 1.1.1
2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 3.6.1 3.6.2 3.6.3 3.6.4 3.6.5 Milestone due soon 1.1.1
2.6.6 2.6.7 2.6.8 2.6.9 2.6.10 3.6.6 3.6.7 3.6.8 3.6.9 3.6.10 Milestone not due soon 1.1.1
2.6.11 2.6.12 2.6.13 Item not Active 1.1.1
Other Link Network LHC Deployment
Project Planning
CMS
Portal
Status Date - 31/Mar/05 + next
UKQCD
Navigate downExternal link
PhenoGrid
LHC Apps
1.1
1.3
Security
InfoMon
Design
Service Challenges
Production Grid Milestones Production Grid Metrics
1LCG External
4M/S/N
5Non-LHC Apps Management
GridPP2 Goal: To develop and deploy a large scale production quality grid in the UK for the use of the Particle Physics community
2 3
Knowledge Transfer
LHCb
GANGA
ATLAS
InteroperabilitySamGrid
EngagementWorkload
6
1.2
Development
Dissemination
Project Execution
BaBarMetadata
Storage
Update
Clear
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
GridPP Deployment Status 18/9/05 [2/7/05]
(9/1/05)
totalCPU
freeCPU
runJob
waitJob
seAvail TB
seUsed TB
maxCPU
avgCPU
Total
3070 [2966](2029)
2247 [1666] (1402)
458 [843] (95)
52[31] (480)
90.89[74.28] (8.69)
31.61 [16.54] (4.55)
3118 [3145] (2549)
2784 [2802] (1994)
Measurable
Improvements:
1. Sites Functio
nal -
Tested
2. 3000 CPUs
3. Storage via SRM
interfaces
4. UK+Ireland
federation
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
New Grid Monitoring Maps
Demo AND Google Maphttp://gridportal.hep.ph.ic.ac.uk/rtm/ http://map.gridpp.ac.uk/
Preliminary Production Grid Status
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Accounting
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
LCG Tier-1 Planning
CPU
0
20000
40000
60000
80000
100000
120000
2006 2007 2008 2009 2010
Year
kSI2
K
PIC, Barcelona
FNAL, US
BNL, US
RAL, UK
ASGC, Taipei
Nordic Data Grid Facility
NIKHEF/SARA, NL
CNAF, Italy
CC-IN2P3, France
GridKA, Germany
TRIUMF, Canada
RAL, UK
Pledged Planned to be pledged
2006 2007 2008 2009 2010
CPU (kSI2K) 98014921234
27123943
4206 6321
585710734
Disk (Tbytes) 450841630
14842232
20873300
30205475
Tape (Tbytes) 6641080555
20742115
39344007
57106402
2006: Pledged 2007-10:(a)Bottom up(b)Top down } ~50% uncertainty
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
LCG Tier-1 Planning
(CPU & Storage) CPU
0
20000
40000
60000
80000
100000
120000
2006 2007 2008 2009 2010
Year
kSI2
K
PIC, Barcelona
FNAL, US
BNL, US
RAL, UK
ASGC, Taipei
Nordic Data Grid Facility
NIKHEF/SARA, NL
CNAF, Italy
CC-IN2P3, France
GridKA, Germany
TRIUMF, Canada
CPU
0
20000
40000
60000
80000
100000
120000
2006 2007 2008 2009 2010
Year
kS
I2K Total Offered (Ext. T1s)
Total Requested
Experiment requests are large e.g. in 2008 CPU ~50MSi2k Storage ~50PB! They can be met globally except in 2008. UK plan to contribute >7%. [Currently contribute >10%]
First LCG Tier-1 Compute Law: CPU:Storage ~1[kSi2k/TB]Second LCG Tier-1 Storage Law: Disk:Tape ~ 1(The number to remember is.. 1)
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
LCG Tier-1 Planning
(Storage) Disk
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
2006 2007 2008 2009 2010
Year
Tbyt
es
PIC, Barcelona
FNAL, US
BNL, US
RAL, UK
ASGC, Taipei
Nordic Data Grid Facility
NIKHEF/SARA, NL
CNAF, Italy
CC-IN2P3, France
GridKA, Germany
TRIUMF, Canada
Disk
01000020000300004000050000
2006 2007 2008 2009 2010
Year
Tbyt
es Total Offered
Total Requested
Tape
0
10000
20000
30000
40000
50000
60000
70000
2006 2007 2008 2009 2010
Year
Tb
ytes
PIC, Barcelona
FNAL, US
BNL, US
RAL, UK
ASGC, Taipei
Nordic Data Grid Facility
NIKHEF/SARA, NL
CNAF, Italy
CC-IN2P3, France
GridKA, Germany
TRIUMF, Canada
Tape
0
10000
20000
30000
40000
50000
60000
70000
2006 2007 2008 2009 2010
Year
Tb
yte
s
Total Offered
Total Requested
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
LCG Tier-2 Planning
UK, Sum of all Federations
Pledged Planned to be pledged
2006 2007 2008 2009 2010
CPU (kSI2K) 380038401592
48304251
54106127
60109272
Disk (Tbytes) 530540258
6001174
6602150
7203406
Third LCG Tier-2 Compute Law: Tier-1:Tier-2 CPU ~1Zeroth LCG Law: There is no Zeroth law – all is uncertainFifth LCG Tier-2 Storage Law: CPU:Disk~5[kSi2k/TB])
2006: Pledged 2007-10:(a)Bottom up(b)Top down } ~100% uncertainty
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
0.100 0.101 0.102 0.103 0.104 0.105 0.106 0.107 0.108 0.109 0.110 0.111 0.112 0.113 0.114 0.115 0.116
0.117 0.118 0.119 0.120 0.121 0.122 0.123 0.124 0.125 0.126 0.127 0.128 0.129 0.130 0.131 0.132 0.133
0.134 0.135 0.136 0.137 0.138 0.139 0.140 0.141 0.142 0.143 0.144 0.145 0.146 0.147
Production Grid Metrics
• Set SMART (Specific Measurable Achievable Realistic Time-phased) Goals
• Systematic approach and measurable improvements in deployment area
• See Grid Deployment and Operations for EGEE, LCG and GridPP Jeremy Coles Provides context for Grid “efficiency”
The “Get Fit” Plan
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Service Challenges
• SC2 (April) : RAL joined computing centres around the world in a networking challenge, transferring 60 TeraBytes of data over ten days.
• SC3 (September): RAL to CERN (T1-T0) at rates of up to 650 Mb/s.• e.g. Edinburgh to RAL (T2-T1) at rates of up 480Mb/s. • UKLight service tested from Lancaster to RAL. • Overall, the File Transfer Service is very reliable, failure rate now below
1%.
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Middleware Development
Configuration Management
Storage Interfaces
Network Monitoring
Security
Information Services
Grid Data Management
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Glite Status
1.2 installed on the Grid pre-production service1.3 some components have been upgraded 1.4 upgrades to VOMs and registration tools plus additional bulk job submission componentsLCG 2.6 is the (August) production release
UK’s R-GMA incorporated(production and pre-prodn.)
LCG 3 will be based upon Glite..
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Application Development
e.g. Reprocessing DØ data with SAMGrid Frederic Villeneuve-
SeguierATLAS LHCb CMS
BaBar (SLAC) SAMGrid (FermiLab) QCDGrid
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
Workload Management
Efficiency OverviewIntegrated over all VOs and RBs:
Successes/Day 12722Success % 67%Improving from 42% to 70-80% during
2005
• Problemsidentified:
• half WMS (Grid) • half JDL (User)
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
LHC VOs ALICE ATLAS CMS LHCb
Successes/Day N/A 2435 448 3463Success % 53% 84% 59% 68%
Note. Some caveats, see http://egee-jra2.web.cern.ch/EGEE-JRA2/QoS/JobsMetrics/JobMetrics.htm
Selection by experiments of “production sites” using Site Functional Tests (currently ~110 of the 197 sites) or use of pre-test software agents leads to >90% experiment production efficiency
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
“UK contributes to EGEE's battle with
malaria” BioMed
Successes/Day 1107Success % 77%
WISDOM (Wide In Silico Docking On Malaria)
The first biomedical data challenge for drug discovery, which ran on the EGEE grid production service from 11 July 2005 until 19 August 2005.
GridPP resources in the UK contributed ~100,000 kSI2k-hours from 9 sites
Number of Biomedical jobs processed by country
Normalised CPU hours contributed to thebiomedical VO for UK sites, July-August 2005
21 September 2005
AHM05 Meeting Tony Doyle - University of Glasgow
1. Why? 2. What? 3. How? 4. When?
From Particle Physics perspective the Grid is:
1. mainly (but not just) for physicists, more generally for those needing to utilise large-scale computing resources efficiently and securely
2. a) a working production-scale system running todayb) about seamless discovery of computing resourcesc) using evolving standards for interoperationd) the basis for computing in the 21st Centurye) not (yet) as seamless, robust or efficient as end-users need
3. methods outlined – please come to the PPARC stand, Jeremy Coles’ talk
4. a) now at “preliminary production service” level, for simple(r) applications (e.g. experiment Monte Carlo production)b) 2007 for a fully tested 24x7 LHC service (a large distributed computing resource) for more complex applications (e.g. data analysis) c) planned to meet the LHC Computing Challenge